CN116091574A

CN116091574A - 3D target detection method and system based on plane constraint and position constraint

Info

Publication number: CN116091574A
Application number: CN202310028861.XA
Authority: CN
Inventors: 杨勐; 周祥; 丁瑞; 郑南宁
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2023-05-09

Abstract

The invention discloses a 3D target detection method and a system based on plane constraint and position constraint, which are used for inputting a data RGB image and obtaining a depth map by using a depth estimation model ForeE training; dividing the obtained depth map by using an example division mask, and converting the obtained foreground part into foreground point clouds; generating a pseudo point cloud frame label with the same size as the GT detection frame by taking the obtained foreground point cloud as the center; freezing parameters of a depth estimation model, training a 3D detection network, adopting a pseudo point cloud tag as a training tag, completing the first-stage training, freezing parameters of a 3D detection network F-PointNet, training the depth estimation model, and using a GT detection frame as a tag training depth estimation network of a 3D detector to complete the second-stage training; the first stage and the second stage are trained alternately, so that the 3D detection network F-PointNet can accurately predict the pseudo point cloud position at any time. The method and the device can obviously improve the depth estimation effect, have more prominent outline and improve the performance of the 3D detection model.

Description

3D target detection method and system based on plane constraint and position constraint

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a 3D target detection method and system based on plane constraint and position constraint.

Background

3D target detection is an important task in the fields of automatic driving, robot obstacle avoidance and the like. The purpose is to obtain the position and volume information of surrounding objects in three-dimensional space. The 3D object detection may be classified into a point cloud-based algorithm and an image-based detection algorithm according to the form of input data. Although the precision of the current image-based detection algorithm is behind that of the pure point cloud detection algorithm; however, due to the advantages of high resolution, low cost, convenient deployment, etc., image-based detection algorithms such as monocular 3D detection are still a research hotspot in the current academy and industry. In recent years, scholars have proposed a monocular detection method in the form of a pseudo point cloud. The pseudo point cloud detection algorithm decouples monocular detection into two separate modules of depth estimation and pure point cloud 3D detection. The method comprises the steps of firstly estimating the depth, then converting the depth map into pseudo point cloud, and finally training a pure point cloud detection model by using the pseudo point cloud as input. The pseudo point cloud detection algorithm can improve the monocular detection accuracy by means of the high-accuracy pure point cloud detection algorithm. The depth estimation module in the algorithm can be pre-trained by means of a large-scale data set, so that generalization can be improved, and the depth estimation module is suitable for more complex and changeable scenes. The 3D detection module can flexibly select a high-precision detection model according to actual scene requirements.

The difficulty with the pseudo point cloud detection method is depth estimation. Monocular depth estimation is itself a disease-state problem in that the predicted depth map is often inaccurate, and objects in the pseudo-point cloud are often severely distorted in shape and accompanied by positional shifts, compared to the real point cloud, resulting in reduced 3D detection performance. Based on the above description, the problem of depth estimation in the pseudo point cloud detection method can be summarized as follows:

1. depth estimation blurring causes severe distortion of the shape of objects in the pseudo-point cloud. Since conventional depth estimation typically focuses on reducing pixel level errors of the depth estimation, rather than optimizing the depth structure. The predicted depth map is typically blurred inside the object and around the contour. Blur in the predicted depth map may lead to distortion of the pseudo point cloud. The object in the pseudo point cloud has the phenomena of shape distortion, tailing around the outline and the like. This makes it difficult for 3D detection networks to learn valid features from distorted pseudo-point clouds while training. The 3D detection network may therefore produce a large number of false detection results during the prediction phase. In recent years, some post-processing methods have been proposed to solve the problem of pseudo-point cloud distortion. These methods typically utilize instance segmentation or redesign of the pseudo point cloud sparsity scheme to reduce the tailing point cloud. However, additional processing operations may complicate the overall model and are not suitable for real-time applications. In addition, distortion inside the object cannot be well handled due to the distortion.

2. The depth estimation error causes a deviation in the predicted position of the object. It is very difficult to estimate the absolute distance of an object from a single RGB image. In particular, as the distance becomes longer, the depth labels become sparse, and the depth estimation error becomes more serious. The depth estimation error will cause the position of the object in the pseudo point cloud to shift, thereby disturbing the 3D detection result. Some recent approaches propose to solve this problem by joint training of depth estimation models and 3D detection networks. However, due to the positional offset problem, some GT tags may not accurately correspond to the predicted positions of objects in the pseudo point cloud. The use of these GT tags in joint training can interfere with the training of the 3D detection network, making it impossible to learn the correct position of the object in the pseudo point cloud, thus degrading the performance of 3D detection.

Therefore, how to solve the two problems of object shape distortion and position offset in the pseudo point cloud under monocular depth estimation has become a key of the pseudo point cloud detection method.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a 3D target detection method and system based on plane constraint and position constraint for solving the technical problems of false point cloud shape distortion and predicted position error caused by depth estimation error, thereby improving the performance of false point cloud 3D target detection.

The invention adopts the following technical scheme:

A3D target detection method based on plane constraint and position constraint comprises the following steps:

s1, inputting a data RGB image, and training to obtain a depth map by using a depth estimation model ForeE;

s2, dividing the depth map obtained in the step S1 by using an example division mask, and converting the obtained foreground part into foreground point clouds;

s3, generating a pseudo point cloud frame label by taking the foreground point cloud obtained in the step S2 as a center, wherein the size of the pseudo point cloud frame label is the same as that of the GT detection frame;

s4, training a 3D detection network, adopting the pseudo point cloud label obtained in the step S3 as a training label, completing the first-stage training, freezing the parameters of the F-PointNet of the 3D detection network, training the depth estimation model, and using a GT detection frame as a label training depth estimation network of a 3D detector to complete the second-stage training; the first stage and the second stage are trained alternately, so that the 3D detection network F-PointNet can accurately predict the pseudo point cloud position at the moment.

Specifically, in step S1, a depth estimation model ForeE is used to train and predict a loss function loss of depth _wcel The following are provided:

wherein, wcel\u _fg ，wcel_ _bg Pixel level cross entropy loss functions of foreground and background respectively; alpha is the weight of the foreground loss.

Specifically, in step S1, the normal vector constraint loss _normal The calculation is as follows:

wherein N is the number of effective point groups in the 2D frame of the object,

for predicting normal vector, ++>

Is a true normal vector.

Specifically, in step S1, the gradient constraint is calculated as follows:

for predicting horizontal gradient differences +.>

For true horizontal gradient difference, +.>

To predict vertical gradient differences +.>

Is the true vertical gradient difference.

Specifically, in step S2, the foreground depth is converted into a foreground point cloud using the following conversion formula:

where u, v is the pixel coordinates, f _x ，f _y C is the focal length of the camera _x ，c _y And z is the predicted depth and is the pixel coordinate of the center point of the image.

Specifically, in step S3, the pseudo point cloud frame is a detection frame with the same size as the GT detection frame and with the pseudo point cloud as the center, when the GT frame has deviation, the pseudo point cloud frame represents the position of the pseudo point cloud, and when the ratio value is greater than 0.25, the center point of the GT frame is used as the center point of the pseudo point cloud frame; when the ratio value is less than 0.25, the average value of all pseudo point cloud positions of the object is used as the center point of the pseudo point cloud frame.

Further, the center of the pseudo point cloud frame is located at the center _pseudo The calculation is as follows:

wherein Num is _GT For the number of point clouds of a pseudo point cloud object in a GT frame, num _all Center for all point cloud quantity of a pseudo point cloud object _GT For the GT frame center, mean is the foreground point cloud center, thh is the ratio threshold.

Specifically, in step S4, the loss function loss1 of the first stage _det The following are provided:

loss1 _det ＝PointNetLoss(Box _pred ,Pseudolabel)

wherein FPointLoss is the original loss function of the detection network F-PointNet, box _pred For predicting 3D frames Pseudo is a Pseudo point cloud tag.

Specifically, in step S4, the loss function loss of the second stage _all The following are provided:

loss _all ＝ ₁ *oss _wcel + ₂ *oss _normal + ₃ *oss _gradient + ₄ *loss2 _del

wherein lambda is ₁ ＝6，λ ₂ ＝λ ₃ ＝1，λ ₄ ＝0.001，loss _wcel Loss as a loss function _normal For normal vector constraint, loss _gradient For gradient constraint, loss2 _del Is a 3D detection function.

In a second aspect, an embodiment of the present invention provides a 3D object detection system based on plane constraint and position constraint, including:

the estimation module inputs the data RGB image and trains to obtain a depth map by using a depth estimation model ForeE;

the conversion module is used for dividing the depth map obtained by the estimation module by using an example division mask and converting the obtained foreground part into foreground point clouds;

the tag module is used for generating a pseudo point cloud frame tag by taking the foreground point cloud obtained by the conversion module as a center, wherein the size of the pseudo point cloud frame tag is the same as that of the GT detection frame;

the prediction module freezes parameters of the depth estimation model, trains the 3D detection network, adopts the pseudo point cloud label obtained by the label module as a training label, completes the first-stage training, freezes the parameters of the F-PointNet of the 3D detection network, trains the depth estimation model, and utilizes the GT detection frame as a label training depth estimation network of the 3D detector to complete the second-stage training; the first stage and the second stage are trained alternately, so that the 3D detection network F-PointNet can accurately predict the pseudo point cloud position at the moment.

Compared with the prior art, the invention has at least the following beneficial effects:

according to the 3D target detection method based on plane constraint and position constraint, plane constraint (random normal vector and gradient constraint) is provided for the problem of shape distortion of the pseudo point cloud, and the shape of the pseudo point cloud can be remarkably improved. Selecting normal vector constraint to enhance the shape and structure characteristics of the object in a flat area; and selecting gradient constraint to highlight edges of objects in the depth map, and reducing tailing phenomenon in pseudo point cloud. The depth labels used in training are sparse and irregularly distributed, and normal vectors and gradient constraint are constructed by adopting a mode of randomly picking points in a 2D frame of an object. After normal vector and gradient constraint are adopted, the depth estimation effect is obviously improved. The object profile is more pronounced in the depth map. The shape and structure characteristics of the object in the pseudo point cloud are enhanced, and the tailing phenomenon of the edge part is reduced; aiming at the problem of depth estimation error, position constraint (end-to-end training+pseudo point cloud frame label+two-stage training strategy) is provided, so that the pseudo point cloud position prediction error can be remarkably reduced. And adding a 3D detection network to the depth estimation model for end-to-end joint training. The depth estimation model for 3D detection is optimized by adding information of the 3D detection frame additionally. Since the position deviation of the pseudo point cloud also interferes with the training of the 3D detection network, a pseudo point cloud frame (a detection frame generated by the center of the pseudo point cloud) is proposed as a training tag. The two-stage training method is further provided for the pseudo point cloud label. In the first stage, the depth estimation model is frozen and the 3D detection network is trained using only pseudo point cloud frame tags to correctly identify objects in the scene. In the second stage, the 3D detection network is frozen and only the GT-tag is used to train the depth estimation model. After the two-stage training strategy is adopted, the depth estimation error of the middle-far object is obviously reduced, so that the interference on the training of the 3D detection model is reduced, and the performance of the detection model is improved.

Further, a high-precision pixel-level chromaticity diagram can be obtained by using a depth estimation model forese. The loss function used calculated the foreground object separately from the background and gave the front Jing Gaoquan weight. The model can pay attention to foreground object prediction, and the prediction accuracy of the foreground object is improved.

Further, the normal vector constraint loss is adopted _normal The prediction accuracy of the internal structure of the object in the depth map is enhanced, so that the structural characteristics of the foreground point cloud are enhanced, and the 3D detection accuracy is improved.

Further, the normal vector constraint loss is adopted _gradient The object contour prediction accuracy is enhanced, so that the tailing phenomenon of the edges of the foreground point cloud is reduced, and the 3D detection accuracy is improved.

Furthermore, the front depth can be accurately converted into the foreground point cloud by adopting the conversion formula, the background depth is not needed, and the calculation efficiency is improved.

Further, a pseudo point cloud frame is generated centering on the foreground point cloud. When the current depth of field prediction is in a large error, the GT tag has deviation with the foreground point cloud; the pseudo point cloud frame can accurately represent the position of the foreground point cloud, so that the interference on subsequent 3D detection training is reduced.

Further, according to the ratio valueCenter of cloud frame for locating pseudo point _pseudo The ratio value can reflect the accuracy of the depth of field prediction before the ratio>And when the foreground depth prediction is 0.25, the GT tag center is used as the pseudo point cloud tag center, and otherwise, the foreground point cloud center is used as the pseudo point cloud tag center. Such an arrangement is more accurate.

Further, the first stage loss function loss1 _det The method is used for training the 3D detection network F-Pointnet, and the pseudo point cloud labels are adopted in the loss function, so that the F-Pointnet can learn the correct positions of the foreground pseudo point clouds.

Further, the first stage loss function loss _all The depth estimation model ForeSee, a loss of one of the functions, loss2, can be optimized by means of F-PointNet _det The GT tag is adopted, so that when errors occur between the position of the foreground point cloud and the GT tag, the loss function value of the 3D detection F-PointNet becomes large. And then, the parameter updating of the depth estimation network is influenced in the error back propagation link, so that the foreground point cloud is closer to the GT label, and the depth estimation model is optimized.

It will be appreciated that the advantages of the second aspect may be found in the relevant description of the first aspect, and will not be described in detail herein.

In conclusion, the depth estimation method and the depth estimation device have the advantages that the depth estimation effect can be obviously improved, and the outline is more prominent; the shape characteristics of the pseudo point cloud are more obvious, and the depth estimation error of the middle and far object can be obviously reduced, so that the performance of the 3D detection model is improved.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a unitary frame diagram of the present invention;

FIG. 2 is a depth estimation input-output diagram of the present invention, wherein (a) is an input RGB image and (b) is an output depth map;

FIG. 3 is a representation of plane constraints (normal vector versus gradient) employed in the present invention, wherein (a) is a schematic diagram of random mining-point construction normal vector constraints and (b) is a schematic diagram of random mining-point construction gradient constraints;

FIG. 4 is a semantic segmentation employed by the present invention;

FIG. 5 is a schematic diagram of a pseudo point cloud frame tag of the present invention;

FIG. 6 is a schematic diagram of a two-stage training strategy according to the present invention, wherein (a) is a first-stage training schematic diagram and (b) is a second-stage training schematic diagram;

FIG. 7 is a graph of data comparing the 3D detection results of the present invention with various methods;

fig. 8 is an objective comparison graph on a depth map and a point cloud before and after adding a vector constraint according to the present invention, wherein (a) is the depth map before adding a normal vector constraint, and (b) is the depth map after adding a normal vector constraint.

FIG. 9 is an objective comparison graph on a point cloud before and after adding gradient constraints in the present invention, wherein (a) is the depth graph before adding gradient constraints and (b) is the depth graph after adding gradient constraints.

Fig. 10 is an objective comparison chart of the 3D detection results before and after gradient constraint by sequentially adding the method vectors, wherein (a) is a depth chart before gradient constraint without adding a normal vector, (b) is a depth chart after adding the normal vector constraint, and (c) is a depth chart after adding the normal vector and the gradient constraint.

Fig. 11 is an objective comparison graph of the point cloud position prediction results before and after the two-stage training according to the present invention, where (a) is a point cloud graph to which the two-stage training strategy is not added, and (b) is a point cloud graph to which the two-stage training strategy is added.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it will be understood that the terms "comprises" and "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.

It should be understood that although the terms first, second, third, etc. may be used to describe the preset ranges, etc. in the embodiments of the present invention, these preset ranges should not be limited to these terms. These terms are only used to distinguish one preset range from another. For example, a first preset range may also be referred to as a second preset range, and similarly, a second preset range may also be referred to as a first preset range without departing from the scope of embodiments of the present invention.

Depending on the context, the word "if" as used herein may be interpreted as "at … …" or "at … …" or "in response to a determination" or "in response to detection". Similarly, the phrase "if determined" or "if detected (stated condition or event)" may be interpreted as "when determined" or "in response to determination" or "when detected (stated condition or event)" or "in response to detection (stated condition or event), depending on the context.

Various structural schematic diagrams according to the disclosed embodiments of the present invention are shown in the accompanying drawings. The figures are not drawn to scale, wherein certain details are exaggerated for clarity of presentation and may have been omitted. The shapes of the various regions, layers and their relative sizes, positional relationships shown in the drawings are merely exemplary, may in practice deviate due to manufacturing tolerances or technical limitations, and one skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions as actually required.

The traditional depth estimation model is not well optimized for 3D detection, so that the defects of serious distortion (deformation, tailing) and position deviation of the shape of the existing pseudo point cloud are caused. The invention provides a 3D target detection method based on plane constraint and position constraint, which is characterized in that plane constraint (normal vector gradient and gradient constraint) and position constraint (end-to-end training+pseudo point cloud frame label+two-stage training strategy) are added in a training pseudo point cloud detection method; therefore, the shape of the pseudo point cloud can be obviously improved, the position deviation of the pseudo point cloud is reduced, and the 3D detection precision is finally improved, so that the two problems of the shape distortion and the prediction position error of the pseudo point cloud caused by the depth estimation error in the pseudo point cloud detection method can be improved.

Aiming at the problem of shape distortion of the pseudo point cloud, the method provides additional addition of an additive vector and gradient constraint during training of a depth estimation model, so that the shape characteristics of the pseudo point cloud are improved, and the tailing phenomenon is reduced. Selecting normal vector constraint to enhance shape structural characteristics of the object on the pseudo point cloud aiming at the object flat area; aiming at the object edge area, gradient constraint is selected to highlight the edge of the object in the depth map, and the tailing phenomenon in the pseudo point cloud is reduced. Because the effective values in the actual depth labels are quite sparse and irregularly distributed, the traditional gradient operator is not suitable for the data, and therefore a random point picking method is adopted to calculate the vector and gradient constraint.

Aiming at the problem of pseudo point cloud prediction position errors, a 3D detection network is added behind the depth estimation model to perform end-to-end joint training. The depth estimation model for 3D detection is optimized by adding information of the 3D detection frame additionally. And it is proposed to better perform end-to-end training by combining a two-stage training strategy with a pseudo point cloud frame tag (centered on the pseudo point cloud). In the first stage, the depth estimation model is frozen and the 3D detection network is trained using only the pseudo point cloud frame tags to correctly identify objects in the pseudo point cloud. In the second stage, the 3D detection network is frozen and only the GT-tag is used to train the depth estimation model.

Referring to fig. 1, the 3D object detection method based on plane constraint and position constraint of the present invention includes the following steps:

s1, monocular depth estimation

S101, inputting a data RGB image as shown in fig. 2 (a), training and predicting depth by using a depth estimation model ForeE, and predicting a depth image as shown in fig. 2 (b);

the pixel level loss function employed in training is weighted cross entropy loss (wcel). And the loss of the foreground and the background is distinguished and calculated by using a 2D detection frame, and a final pixel-level loss function is obtained by weighted summation, wherein the calculation formula is as follows:

wherein, wcel\u _fg ，wcel_ _bg The pixel level cross entropy loss functions of the foreground and the background are respectively; alpha is the weight of the foreground loss. Setting α to 0.7 in the experiment, the depth estimation model will focus on the foreground prediction.

S102, enhancing structural features of the pseudo point cloud object by additionally adding normal vectors in a depth estimation stage;

except for single pixel level constraint loss _wcel In addition, the shape of the pseudo point cloud object is enhanced by adding normal vector constraint, so that the distortion of the pseudo point cloud is reduced; as in fig. 3 (a), virtual normal vector features are calculated at random acquisition points inside the 2D frame of the object.

The normal vector constraint calculation formula is as follows:

where N is the number of valid point groups within the 2D frame of the object, each group consisting of 3 points randomly extracted. n is the normal vector of the plane formed by these 3 points.

S103, improving the edge tailing phenomenon of the pseudo point cloud object by adding gradient constraint.

The shape of the pseudo point cloud object is enhanced by adding the vector constraint, and the distortion and deformation phenomenon of the pseudo point cloud is reduced.

As shown in fig. 3 (b), gradients are calculated inside the 2D frame of the object and at the edges by random sampling points, and the gradient constraint calculation formula is as follows:

where N is the number of valid pairs of points within the 2D frame of the object, each pair consisting of 2 points randomly extracted. gu, gv is the difference between the horizontal and vertical image gradients calculated from these 2 points, and the calculation formula is as follows:

where u, v are pixel coordinates in the horizontal and vertical directions, and z is a depth value obtained by depth estimation.

S2, generation of pseudo point cloud

S201, accurately distinguishing foreground objects in RGB images by using an example segmentation network;

the input RGB image is shown in fig. 2 (a), and the foreground and the background of the RGB image are precisely distinguished by using an example segmentation network, so as to obtain an example segmentation mask of the foreground. An example segmentation result is schematically shown in fig. 4.

S202, extracting corresponding foreground depth by using a foreground instance segmentation mask;

a foreground instance segmentation mask is used to extract a corresponding foreground portion from the panoramic depth prediction map obtained in step S201. The subsequent 3D detection model only needs to use the depth corresponding to the foreground object.

S203, converting the foreground depth into a foreground point cloud by using the following conversion formula:

where u, v is the pixel coordinates, f _x ，f _y Is the focal length of the camera, c _x ，c _y Is the image center point pixel coordinates and z is the predicted depth.

S3, pseudo point cloud frame label generation

The pseudo point cloud frame is a detection frame with the same size as the GT detection frame and taking the pseudo point cloud as the center. As shown in fig. 5, when the GT frame has a deviation, the pseudo point cloud frame can correctly represent the position of the pseudo point cloud. The pseudo point cloud frame can correctly represent the correct position of the pseudo point cloud, and the pseudo point cloud frame is used as a label to reduce interference training.

The calculation formula of the center position of the pseudo point cloud frame is as follows:

wherein Num is _GT Represents the number of point clouds of a pseudo point cloud object in a GT frame, num _all Representing the number of point clouds of a pseudo point cloud object.

When the ratio value is large, the deviation between the object and the GT frame is small, and at this time, the center point of the GT frame can be directly used as the center point of the pseudo point cloud frame. When the ratio value is small, the deviation between the object and the GT frame is large, and at the moment, the average value of all pseudo point cloud positions of the object is used as the center point of the pseudo point cloud frame.

S4, two-stage training strategy

S401, a first stage of a two-stage training strategy;

as shown in fig. 6 (a), in the first stage of training, parameters of the depth estimation model are frozen, and only the 3D detection network is trained. The first stage trains a 3D detection network F-PointNet using a pseudo point cloud box as a tag. The pseudo point cloud frame can enable the 3D detector to learn the correct position of the pseudo point cloud. In the first stage, the loss function of F-PointNet is directly adopted, and only GT labels in the loss function are replaced by pseudo point cloud frame labels.

loss1 _det ＝PointNetLoss(Box _pred ,Pseudolabel)

S402, a second stage of a two-stage training strategy;

as shown in fig. 6 (b), in the second stage, parameters of the 3D detection network F-PointNet are frozen, only the depth estimation model is trained, and the depth estimation network is trained using GT detection frames as labels of the 3D detector. Since the 3D detector parameters are frozen, the detector can only predict the true position of the pseudo-point cloud. When an error occurs between the pseudo point cloud position and the GT detection frame, the loss function value of the 3D detection becomes larger. And then, in the error back propagation link, the parameter update of the depth estimation network is affected, so that the pseudo point cloud is closer to the GT detection frame.

The overall loss function for the second stage is as follows:

wherein lambda is ₁ ＝6，λ ₂ ＝λ ₃ ＝1，λ ₄ ＝0.001。

loss2 _det The definition is as follows:

loss2 _det ＝PointNetLoss(Box _pred ,GTlabel)

s403, training the two stages alternately to ensure that the 3D detection network F-PointNet can predict the pseudo point cloud position correctly at any time.

The purpose of optimizing depth estimation by means of 3D detection is achieved through an alternating training method of a pseudo point cloud frame, a 3D detection model and a depth estimation model.

In still another embodiment of the present invention, a 3D object detection system based on plane constraint and position constraint is provided, where the system can be used to implement the above 3D object detection method based on plane constraint and position constraint, and specifically, the 3D object detection system based on plane constraint and position constraint includes an estimation module, a conversion module, a tag module, and a prediction module.

The estimating module inputs the data RGB image and trains the data RGB image by using a depth estimating model ForeE to obtain a depth map;

In yet another embodiment of the present invention, a terminal device is provided, the terminal device including a processor and a memory, the memory for storing a computer program, the computer program including program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular to load and execute one or more instructions to implement the corresponding method flow or corresponding functions; the processor according to the embodiment of the invention can be used for the operation of the 3D object detection method based on plane constraint and position constraint, and comprises the following steps:

inputting a data RGB image, and training to obtain a depth map by using a depth estimation model ForeE; dividing the obtained depth map by using an example division mask, and converting the obtained foreground part into foreground point clouds; generating a pseudo point cloud frame label with the same size as the GT detection frame by taking the obtained foreground point cloud as the center; freezing parameters of a depth estimation model, training a 3D detection network, adopting a pseudo point cloud tag as a training tag, completing the first-stage training, freezing parameters of a 3D detection network F-PointNet, training the depth estimation model, and using a GT detection frame as a tag training depth estimation network of a 3D detector to complete the second-stage training; the first stage and the second stage are trained alternately, so that the 3D detection network F-PointNet can accurately predict the pseudo point cloud position at the moment.

In a further embodiment of the present invention, the present invention also provides a storage medium, in particular, a computer readable storage medium (Memory), which is a Memory device in a terminal device, for storing programs and data. It will be appreciated that the computer readable storage medium herein may include both a built-in storage medium in the terminal device and an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium may be a high-speed RAM Memory or a Non-Volatile Memory (Non-Volatile Memory), such as at least one magnetic disk Memory.

One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the respective steps of the above-described embodiments with respect to a 3D object detection method based on planar constraints and position constraints; one or more instructions in a computer-readable storage medium are loaded by a processor and perform the steps of:

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The advantages of the invention are illustrated below by comparison of the results.

The main effects of the present invention are expressed in two aspects: the method has the main advantages that the structural characteristics of the pseudo point cloud can be enhanced, and the position prediction precision of the pseudo point cloud is improved, so that the 3D target detection performance is improved.

Referring to fig. 8, after adding the normal vector, the shape feature of the point cloud of the near vehicle becomes more obvious and is very close to the real point cloud feature.

Referring to fig. 9, before adding the gradient constraints. The outline of the object in the depth map is not clear; the edge transition of the object and the ground, and the background is blurred. After the gradient constraint is added, the outline of the object on the depth map is more prominent; the edge transition with the ground and the background is obvious; the trailing point cloud of the object edge on the BEV point cloud map disappears.

Referring to fig. 10, after normal vectors and gradient constraints are sequentially added, the false detection frames in the 3D detection result gradually decrease.

The results of fig. 7 show that the detection results of the present invention exceed some recent monocular detection methods under the AP, 3D modulator and hard models.

Referring to fig. 11, after two-stage training, the huge deviation between the pseudo point cloud position and the actual position (GT frame) is obviously reduced, which illustrates that the two-stage training can more effectively optimize the depth estimation model.

In summary, according to the 3D target detection method and system based on plane constraint and position constraint, the prediction effect of the depth estimation model is obviously improved, and the outline is more prominent; thereby enhancing the structural characteristics of the point cloud, reducing the tailing phenomenon of the point cloud and improving the prediction precision of the position of the point cloud; and finally, the 3D target detection performance is improved.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/terminal and method may be implemented in other manners. For example, the apparatus/terminal embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RandomAccess Memory, RAM), an electrical carrier wave signal, a telecommunications signal, a software distribution medium, etc., it should be noted that the computer readable medium may contain content that is appropriately increased or decreased according to the requirements of jurisdictions and patent practices, such as in certain jurisdictions, according to the jurisdictions and patent practices, the computer readable medium does not contain electrical carrier wave signals and telecommunications signals.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The 3D target detection method based on plane constraint and position constraint is characterized by comprising the following steps of:

2. The method for 3D object detection based on plane constraint and position constraint according to claim 1, wherein in step S1, a depth estimation model ForeSeE is used to train and predict a loss function loss of depth _wcel The following are provided:

wherein, wcel_loss _fg ，wcel_loss _bg Pixel level cross entropy loss functions of foreground and background respectively; alpha is the weight of the foreground loss.

3. The 3D object detection method based on plane constraint and position constraint according to claim 1, wherein in step S1, a normal vector constraint loss _normal The calculation is as follows:

for predicting normal vector, ++>

Is a true normal vector.

4. The 3D object detection method based on plane constraint and position constraint according to claim 1, wherein in step S1, gradient constraint is calculated as follows:

wherein N is an object 2D frameThe number of groups of valid points in the inner,

for predicting horizontal gradient differences +.>

For true horizontal gradient difference, +.>

To predict vertical gradient differences +.>

Is the true vertical gradient difference.

5. The 3D object detection method based on plane constraint and position constraint according to claim 1, wherein in step S2, the foreground depth is converted into a foreground point cloud using the following conversion formula:

6. The 3D object detection method based on plane constraint and position constraint according to claim 1, wherein in step S3, the pseudo point cloud frame is a detection frame with the same size as the GT detection frame with the pseudo point cloud as a center, when the GT frame has a deviation, the pseudo point cloud frame represents the position of the pseudo point cloud, and when the ratio value is greater than 0.25, the center point of the GT frame is used as the center point of the pseudo point cloud frame; when the ratio value is less than 0.25, the average value of all pseudo point cloud positions of the object is used as the center point of the pseudo point cloud frame.

7. The 3D object detection method based on plane constraint and position constraint according to claim 6, wherein the pseudo point cloud frame center position center _{pseudo label} The calculation is as follows:

wherein Num is _GT For the number of point clouds of a pseudo point cloud object in a GT frame, num _all Center for all point cloud quantity of a pseudo point cloud object _GT For the GT frame center position, mean value is the foreground point cloud center position, and thresh is the threshold value of ratio.

8. The method for 3D object detection based on plane constraint and position constraint according to claim 1, wherein in step S4, the loss function loss1 of the first stage _det The following are provided:

loss1 _det ＝FPointNetLoss(Box _pred ，Pseudo label)

wherein FPointLoss is the original loss function of the detection network F-PointNet, box _pred For predicting 3D frames, pseudolabel is a Pseudo point cloud tag.

9. The method for 3D object detection based on plane constraint and position constraint according to claim 1, wherein in step S4, the loss function loss of the second stage _all The following are provided:

loss _all ＝λ ₁ *loss _wcel +λ ₂ *loss _normal +λ ₃ *loss _gradient +λ ₄ *loss2 _del

10. A 3D object detection system based on plane constraints and position constraints, comprising: