CN111695403A

CN111695403A - 2D and 3D image synchronous detection method based on depth perception convolutional neural network

Info

Publication number: CN111695403A
Application number: CN202010308948.9A
Authority: CN
Inventors: 吴明瞭; 付智俊; 郭启翔; 尹思维; 谢斌; 何薇; 焦红波; 王晨阳; 白世伟
Original assignee: Dongfeng Automobile Co Ltd
Current assignee: Dongfeng Automobile Co Ltd
Priority date: 2020-04-19
Filing date: 2020-04-19
Publication date: 2020-09-22
Anticipated expiration: 2040-04-19
Also published as: CN111695403B

Abstract

The invention discloses a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, which comprises the following steps: step 1, defining a target anchor point formula, introducing a preset depth information parameter, and specifying a shared center pixel position; step 2, generating a preset anchor frame according to an anchor point template defining a target object, a visual anchor point generating formula and a 3D prior anchor point; step 3, checking the intersection ratio of the anchor frames; step 4, analyzing a network loss function of the target object; step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain a feature map, sending the feature map into global feature extraction and local feature extraction, and finally combining according to a certain weight; step 6, forward optimization processing, leading out a parameter step length sigma, setting a cycle termination parameter beta, and optimizing parameters; and 7, outputting the 3D parameters. The invention can realize higher safety of automatic driving and can be widely applied to the field of computer vision.

Description

2D and 3D image synchronous detection method based on depth perception convolutional neural network

Technical Field

The invention relates to a method for detecting effective targets in the field of computer vision such as unmanned driving and auxiliary driving, in particular to a method for synchronously detecting 2D and 3D images based on a depth perception convolutional neural network.

Background

Object detection refers to detecting and identifying the category and position information of an interested target (such as a vehicle, a pedestrian, an obstacle and the like) in an image or a video by using a computer technology, and is one of important research fields in the field of computer vision. With the continuous improvement and development of deep learning technology, the object detection technology based on deep learning has a wide application scenario in many practical fields, for example: the system comprises the following relevant fields of unmanned driving, auxiliary driving, face recognition, unmanned security, man-machine interaction, behavior recognition and the like.

As one of the important research directions in the deep learning technology, the deep convolutional neural network has achieved significant results on object detection tasks, and can achieve real-time detection and identification of an interested target in 2D image data. However, in the field of unmanned research, the system stability and safety are improved because the system is required to obtain the position information of the interested target in the 3D space in the application to realize the corresponding function.

The hardware equipment that is used for 3D image recognition at present relies on the camera, according to the function of camera, can divide into monocular camera and many cameras with the camera: the monocular camera is fixed in focus and is mostly applied to road condition judgment of automatic driving, but the monocular camera has an irreconcilable contradiction in the aspects of distance measurement range and distance, namely the wider the visual angle of the camera is, the shorter the length of the accurate distance which can be detected is, the narrower the visual angle is, the longer the detected distance is, the similar way that human eyes see the world is realized, the farther the people see the world, the range which can be covered is narrow, and the closer the people see the world, the wider the range is; the binocular camera is a camera with different focal lengths, the focal length of the binocular camera is related to the imaging definition, but the conventional vehicle-mounted camera is difficult to frequently zoom, the cost of the multi-view camera is high, and the algorithm complexity of the multi-view camera is increased compared with that of a monocular camera, so that the binocular camera is not suitable for an unmanned system at present.

To improve the accuracy of 3D image detection, existing 3D image detection methods also rely on expensive lidar sensors, which can provide sparse depth data as input. However, when such a method relying on a laser radar sensor is combined with a monocular camera, the sparse depth data thereof lacks depth information, and thus it is difficult to realize in 3D image detection.

For example, an automatic driving system is taken as an example for explanation, and for an object detection task in the scene, a conventional 2D target detection method is to acquire a real-time road scene in a driving process through a vehicle-mounted camera, input the real-time road scene into an existing algorithm, detect an interested target in an image through a trained detection model, output position and category information of the interested target to a decision layer of a control end, and plan how a vehicle runs. However, there is a problem that the 3D spatial position information of the detection target acquired by the monocular camera is unstable, and the accuracy of the method is reduced due to a plurality of influencing factors.

Disclosure of Invention

The invention aims to overcome the defects of the background technology, and provides a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, so that the method has the advantages that a camera stores more detailed semantic information on the basis of keeping the accurate depth information of a laser scanner, and can realize higher driving performance and safety in the automatic driving process.

The invention provides a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, which comprises the following steps: step 1, defining an anchor point template of a target object: respectively defining specific formulas of a 2D target anchor point and a 3D target anchor point, introducing a preset depth information parameter, and specifying a shared central pixel position; step 2, generating an anchor frame of the model prediction characteristic diagram: generating a preset anchor frame according to an anchor point template defining a target object and a visual anchor point generation formula and a pre-calculated 3D prior anchor point; step 3, checking the intersection ratio of GT of the anchor frame: checking whether the intersection ratio of GT of the anchor frame is more than or equal to 0.5 or not according to the generated anchor frame; step 4, analyzing a network loss function of the target object: the method comprises the steps of classification loss function LC analysis, 2D frame regression loss function analysis and 3D frame regression loss function analysis; step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain h x w dimension feature maps, then respectively sending the feature maps into two branches, wherein one branch is global feature extraction, the other branch is local feature extraction, and finally combining the features of the two branches according to a certain weight; step 6, forward optimization: projecting the 3D information to the 2D information and carrying out forward optimization processing, leading out a parameter step length sigma for updating theta, setting a cycle termination parameter beta, and inputting an optimization parameter when alpha is larger than the parameter beta; and 7, carrying out 3D target detection according to the 3D output parameters.

In the above technical solution, in step 1, the specific formula of the 2D target anchor point is [ w, h ]2D, and the specific formula of the 3D target anchor point is [ w, h, l, θ ]3D, where w, h, and l respectively represent given values of the width, height, and length of the target detection object, and θ represents an observation angle of the camera to the target detection object; the introduced preset depth information parameter is Zp, the shared central pixel position is designated as [ x, y ] P, wherein the parameter expressed by 2D is expressed as [ x, y ]2D ═ P · [ w, h ]2D according to the pixel coordinate, P represents the coordinate point of the known projection matrix which needs to project the target object, the 3D central position [ x, y, z ]3D under the camera coordinate system is projected into the image of the given known projection matrix P in a three-dimensional manner, and the depth information parameter Zp is encoded, and the formula is as follows:

in the above technical solution, in the step 2, each anchor point in the model prediction output feature graph is defined as C, and each anchor point corresponds to [ tx, ty, tw, th [ ]]2D、[tx,ty,tz]P、[tw,th,tl,tθ]And 3D, setting the total number of anchor points of a single pixel on the feature map of each target detection object as na, presetting the number of training model classes as nc, hxw as the resolution of the feature map, setting the total number of output frames as nb as w × h × na, and distributing each anchor point at each pixel position [ x, y]_P∈R^w×hThe first output anchor C represents a shared class prediction with dimension na × nc × h × w, where the output dimension of each class is na × h × w.

In the above technical solution, in step 2, the [ tx, ty, tw, th ]2D representing the 2D bounding box transformation is collectively referred to as b2D, where the bounding box transformation formula is as follows:

wherein xP and yP denote the spatial center position of each frame, and the frame b 'after transformation'_2DIs defined as [ x, y, w, h]′_2DThe 7 output variables, i.e. the projection centers, are transformed [ t ]_x,t_y,t_z]_PScale transformation [ t ]_w,t_h,t_l]_3DAnd direction change t_θ3DCollectively referred to as b_3DSaid b is_3DConversion is applied to band parameters [ w, h ]]_2D,z_P,[w,h,l,θ]_3DAnchor point (c):

similarly, the 3D center position [ x, y, z ] obtained after projection in image space is transformed using the inverse of equation (1)]′_PTo calculate its camera coordinates x, y, z]′_3D，b′_3DRepresents [ x, y, z ]]′_PAnd [ w, h, l, θ ]]′_3D。

In the above technical solution, in the step 3, if the intersection ratio of GT of the anchor frame is less than 0.5, the category of the target object is set as the background category, and the boundary anchor frame is ignored or deleted; if the intersection ratio IOU of the GT of the anchor frame is more than or equal to 0.5, generating the category index tau of the target object and the 2D frame according to the generated GT of the anchor frame

And 3D frame

In the above technical solution, in the step 4, the classification loss function LC adopts a softmax-based polynomial logistic loss function, and its formula is:

2D frame regression loss function

Analysis of for PBefore transformation of GT

And b 'after GT transformation'_2DCross-over ratio between IOU:

3D frame regression loss function

Analysis for optimizing each of the remaining 3D bounding box parameters with a smooth L1 regression loss function, which is formulated as:

in the above technical solution, in the step 4, a whole multitask network loss function L is further introduced, wherein the whole multitask network loss function L further includes a regularization weight λ₁And λ₂It defines the formula as follows:

in the above technical solution, in the step 5, the specific process is as follows:

step 5-1, obtaining a h x w dimensional characteristic diagram by using a convolutional neural network DenseNet: introducing a hyperparameter b, wherein b represents the number of bins at the row level and is used for representing that the characteristic diagram is divided into b bins along the transverse direction, and each bin represents a specific convolution kernel k; step 5-2, global/local feature extraction is carried out, wherein the step 5-2 is divided into two branches, and the flow is as follows: step 5-2-1, global feature extraction: the global feature extraction adopts conventional convolution which introduces the global feature F in the convolution process_globalSaid global feature F_globalIntroducing convolution kernels with the number of padding being 1 and 3x3, then nonlinearly activating by Relu function to generate 512 feature maps, using conventional convolution with 3x3 and 1x1 to act on the whole feature map, and then performing convolution at each timeOutputs C, theta, [ t ] on the characteristic diagram F_x,t_y,t_w,t_h]_2D,[t_x,t_y,t_z]_P,[t_w,t_h,t_l,tθ]_3DA total of 13 outputs, each of which is connected to a 1x1 convolution kernel O_global(ii) a Step 5-2-2, local feature extraction: for local feature extraction, a depth perception convolution is adopted, and the depth perception convolution introduces a global feature F in the convolution process_localThe global feature F_localIntroducing convolution kernels with the number of padding being 1 and 3x3, then nonlinearly activating by Relu function to generate 512 feature maps, acting on different bins (convolution kernel pixels) with different 3x3 kernels, and dividing the bins into b bins along the longitudinal direction, and then outputting C, theta, [ t ] on each feature map F_x,t_y,t_w,t_h]_2D,[t_x,t_y,t_z]_P,[t_w,t_h,t_l,tθ]_3DA total of 13 outputs, each of which is connected to a 1x1 convolution kernel O_localStep 5-3, performing weighting processing on the output of the global feature extraction and the local feature extraction, namely introducing a learned weighting α, wherein the weighting α uses the spatial invariance of a convolutional neural network as an index of the 1 st to 13 th outputs, and the specific output function is as follows:

Oⁱ＝O_global ⁱ·α_i+O_local ⁱ·(1-α_i) (8)。

in the above technical solution, the step 5 further includes the steps of 5-4: the backbone network of the 3D target detection method is established on the basis of DenseNet-121, and a dense connection mechanism for connecting all layers is proposed: that is, each layer will accept all layers before it as its additional input, ResNet connects each layer with the 2-3 layers before by way of element-level addition, while in DenseNet, each layer will concat with all layers before in the channel dimension and serve as the input for the next layer, and for a network of L layers, DenseNet contains L (L +1)/2 connections in total, and DenseNet is a feature map from different layers directly concat.

In the above technical solution, in step 6, the iteration step of the algorithm is as follows: by projecting the 3D frame and the 2D estimated frame b'_2DAs L_1lossAnd, while continuously adjusting θ, the formula of the step of projecting 3D to the 2D frame is as follows:

γ_P＝P·γ_3D,γ_2D＝γ_P/γ_P[φ_z],

x_min＝min(γ_3D[φ_x]),y_min＝min(γ_3D[γ_3D[φ_y]])

x_max＝max(γ_3D[φ_x]),y_max＝max(γ_3D[γ_3D[φ_y]])

(9)，

where φ represents the axis [ x, y, z ]]Index of (2D) frame parameter [ x ] after projection with 3D frame_min,y_min,x_max,y_max]And b 'estimated from the original 2D frame'_2DTo calculate L_1lossChanging the step size sigma by using the attenuation factor gamma when the loss is not updated in the range of theta +/-sigma, and repeating the operation when the loss is more than β. in the step 7, outputting 13 parameters according to the 3D in total, wherein the 13 parameters are C, theta and t_x,t_y,t_w,t_h]_2D,[t_x,t_y,t_z]_P,[t_w,t_h,t_l,tθ]_3D。

The 2D and 3D image synchronous detection method based on the depth perception convolutional neural network has the following beneficial effects: the scheme of the invention provides an algorithm for fusing laser radar point cloud and RGB (red (R), green (G) and blue (B) three-channel color) images. The 3D target vision analysis plays an important role in the vision perception system of the autonomous driving automobile. Modern autonomous vehicles are often equipped with sensors such as lidar and cameras. With respect to the application characteristics of the two sensors, the camera and the laser radar camera can be used for target detection, the laser scanner has the advantage of accurate depth information, and the camera stores more detailed semantic information, so that the integration of the laser radar point cloud and the RGB image can realize an automatic driving automobile with higher performance and safety. Highly accurate target location and identification of objects in a road scene is achieved using lidar and object detection in three-dimensional space of image data.

Drawings

FIG. 1 is a flow chart of a basic idea of a method for synchronously detecting 2D and 3D images based on a depth-aware convolutional neural network according to the present invention;

FIG. 2 is a detailed flowchart of the method for synchronously detecting 2D and 3D images based on the depth-sensing convolutional neural network according to the present invention;

FIG. 3 is a schematic diagram illustrating the parameter definition of an anchor point template in the method for synchronously detecting 2D and 3D images based on a depth-aware convolutional neural network according to the present invention;

FIG. 4 is a block diagram of a three-dimensional anchor of a 3D target object in the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network of the present invention;

FIG. 5 is a bird's eye view of a three-dimensional anchor frame of a 3D target object in the depth-aware convolutional neural network-based 2D and 3D image synchronous detection method of the present invention;

FIG. 6 is a diagram of RPN network architecture in the method for synchronous detection of 2D and 3D images based on a depth-aware convolutional neural network according to the present invention;

FIG. 7 is a schematic diagram of extracting local features of horizontal segmentation in the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network of the present invention;

FIG. 8 is a schematic diagram of extracting longitudinal segmentation local features in the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network of the present invention;

fig. 9 is a network architecture diagram of densenert in the 2D and 3D image synchronous detection method based on the depth-aware convolutional neural network of the present invention.

Detailed Description

The invention is described in further detail below with reference to the following figures and examples, which should not be construed as limiting the invention.

Referring to fig. 1, the basic idea of the method for synchronously detecting 2D and 3D images based on the depth-aware convolutional neural network of the present invention is as follows: input image → simultaneous detection processing of 2D and 3D images → projection of 3D information to 2D information and forward optimization processing → detection of 3D objects according to 3D output parameters.

Referring to fig. 2, the method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network of the present invention specifically comprises the following steps:

step 1: an anchor template for the target object is defined. In order to predict the 2D frame and the 3D frame simultaneously, anchor templates need to be defined in respective dimensional spaces, and it should be noted that the 2D frame herein is the maximum length and width observed by the 3D target object. Specifically, taking an automobile as an example, referring to fig. 3, specific formulas of a 2D target anchor point and an anchor point template of a 3D target are [ w, h ]2D and [ w, h, l, θ ]3D, respectively, where w, h, and l represent the width, height, and length of a target detection object, respectively, and w, h, and l are given values in a detection camera coordinate system; in addition, since the 3D object is different from the 2D object and has rotation, its θ represents the viewing angle of the camera to the object to be detected, which is equivalent to the camera rotating around the Y axis of its camera coordinate system, and the viewing angle takes into account the relative orientation of the object with respect to the viewing angle of the camera, rather than the ground's Bird's Eye View (BEV), where introducing θ makes it more meaningful to intuitively estimate the viewing angle when processing 3D image features.

In order to define the position of a 2D/3D frame of a complete target object, a preset depth information parameter Zp is introduced, and a shared central pixel position [ x, y ] P is specified, wherein the parameter represented by 2D is represented according to pixel coordinates, namely [ x, y ]2D is P.w.h ]2D, and P represents a coordinate point of a known projection matrix required to project the target object; in 3D object detection, 3D center position [ x, y, z ]3D in camera coordinate system is projected into the image of given known projection matrix P in three dimensions, and depth information parameter Zp is encoded, which formula is as follows:

wherein, the average value statistics of [ w, h, l, theta ]3D, Zp and [ w, h, l, theta ]3D of each preset depth information parameter Zp and 3D target object are calculated separately for each anchor point in advance, and the parameters of the type are used as follows: can serve as strong a priori information to ease the difficulty of 3D parameter estimation. Specifically, for each anchor point, each preset depth information parameter Zp and [ w, h, l, θ ]3D IOU (intersection ratio) of the 3D object are statistical data exceeding 0.5, and the anchor point represents a discrete template, where the 3D prior can be used as a strong initial guess to assume a reasonably consistent scene geometry.

Step 2: and generating an anchor frame of the model prediction characteristic diagram according to the anchor point template defining the target object. Specifically, the preset anchor frame is generated according to the anchor point template of the target object and expressed by a visual anchor point generation formula and a pre-calculated 3D prior anchor point, specifically, the generated three-dimensional anchor frame can be referred to as fig. 4, and the aerial view is referred to as fig. 5.

Furthermore, each anchor point in the model prediction output feature map is defined as C, the anchor points corresponding to [ tx, ty, tw, th ]2D, [ tx, ty, tz ] P, [ tw, th, tl, t θ ]3D, the total number of anchor points is na (the number of anchor points of a single pixel on the feature map of each target detection object), the number of categories (preset training models) is nc, and hxw is the resolution of the feature map.

Therefore, the total number of output frames nb — w × h × na;

each anchor point is distributed at each pixel position [ x, y ]]_P∈R^w×h，

The first output anchor point C represents a shared class prediction with dimensions na × nc × h × w, where the output dimensions of each other (per class) are na × h × w.

Further, [ tx, ty, tw, th ]2D represents a 2D bounding box transform, we refer to b2D collectively, wherein the bounding box transform formula is as follows:

where xP and yP represent the spatial center position of each box. Transformed frame b'_2DIs defined as [ x, y, w, h]′_2DThe following 7 outputs represent the projection center transformation [ t ]_x,t_y,t_z]_PScale transformation [ t ]_w,t_h,t_l]_3DAnd a direction change t_θ3DCollectively referred to as b_3D. Similar to 2D, the conversion is applied to the band parameters [ w, h ]]_2D,z_P,[w,h,l,θ]_3DAnchor point of (c):

similarly, b'_3DRepresents [ x, y, z ]]′_PAnd [ w, h, l, θ ]]′_3D. As previously described, the authors estimate the 3D center of the projection rather than the camera coordinates to better handle the image space based convolution features. In the inference process, the 3D center position [ x, y, z ] obtained after projection in image space is used by the inverse transform of equation (1)]′_PTo calculate its camera coordinates x, y, z]′_3D。

And step 3: and according to the generated anchor frame, checking whether the intersection ratio (IOU) of GT (ground true) of the anchor frame is more than or equal to 0.5 or not.

If the intersection ratio of GT of the anchor frame is less than 0.5, setting the type of the target object as a background type, and neglecting or deleting the boundary anchor frame;

if the intersection ratio IOU of the GT of the anchor frame is more than or equal to 0.5, generating a category index tau of the target object and a 2D frame according to the generated GT (ground truth) of the anchor frame

And 3D frame

And the following step 4 is performed.

And 4, step 4: analyzing the network loss function of the target. Further, this step includes classification loss function LC analysis, 2D frame regression loss function analysis, and 3D frame regression loss function analysis.

The classification loss function LC adopts a polynomial logic loss function based on softmax, and the formula is as follows:

and introduces 2D frame regression frame loss

For matching GT transform pre-

And b 'after GT transformation'_2DCross-over ratio between IOU:

3D frame regression loss function

further, for the whole network framework, the whole multitask network loss function L is also introduced, wherein the whole multitask network loss function L also comprises a regularization weight lambda₁And λ₂It defines the formula as follows:

and 5: and establishing a deep perception convolution area proposal network to improve the ability of high-order feature space perception in the area proposal network.

A hyperparameter b is introduced, where b represents the number of bins at the row level, which represents the lateral division of the profile into b bins, each bin representing a particular convolution kernel k.

And 5-1, introducing a Densenet convolutional neural network. Furthermore, DenseNet (convolutional neural network with deeper layers) is used as a basic feature extractor to obtain h x w dimension feature maps, then the feature maps are respectively sent to two branches, one is global feature extraction, the other is local feature extraction, and finally the features of the two branches are combined according to a certain weight. The global block is convolved with conventional 3x3 and 1x1 to act on the whole feature map, and the local block is convolved with different 3x3 kernels to act on different bins, which are divided into b bins along the longitudinal direction, as shown in fig. 6.

It should be noted that, for the local feature extraction, the present technology also adopts two feature extraction methods, as shown in fig. 7.

For b longitudinal horizontal bars generated by b bins divided along the longitudinal direction as a random function when the local feature 1 is extracted, the randomness of image extraction is increased in the convolution process, and therefore the recognition rate is improved.

Further, in order to more accurately identify the 3D target image, the present technology further provides a longitudinal segmentation method, and a specific segmentation method thereof is shown in fig. 8.

Due to the adoption of the longitudinal cutting method, the local features obtained by the feature extraction are more, so that the recognition rate is improved.

In addition, the backbone network of the present 3D target detection method is established on the basis of DenseNet-121, and the network architecture of DenseNet may specifically refer to fig. 9, and DenseNet proposes a more aggressive dense connection mechanism: i.e. interconnecting all layers, in particular each layer will accept as its additional input all layers in front of it. It can be seen that ResNet is short-circuited between each layer and some previous layer (typically 2-3 layers), and the connection is by element-level addition. In DenseNet, each layer is concatenated (concat) with all previous layers in the channel dimension (where the profile sizes of the individual layers are the same) and used as input for the next layer. For a L-layer network, DenseNet contains a total of L × L +1)/2 connections, which is a dense connection compared to ResNet. And DenseNet is a feature diagram directly concat from different layers, so that feature reuse can be realized, and efficiency is improved. The network architecture diagram of densenert is shown in fig. 9.

And 5-2, carrying out global/local feature extraction. The step 5-2 is divided into two branches, step 5-2-1 and step 5-2-2, respectively.

And 5-2-1, extracting global features. The global feature extraction adopts conventional convolution, the convolution kernel of the conventional convolution acts as the global convolution in the whole space, and the conventional convolution introduces the global feature F in the convolution process_globalSaid global feature F_globalIn the method, a convolution kernel with the number of padding (filling gaps) being 1 and 3 × 3 is introduced, and then nonlinear activation is performed by a Relu function (Rectified Linear Unit) to generate 512 feature maps.

Then, 13 outputs are outputted on each feature map F (from the above, it can be known that the 13 outputs are respectively C, theta, [ t ]_x,t_y,t_w,t_h]_2D,[t_x,t_y,t_z]_P,[t_w,t_h,t_l,tθ]_3D) And wherein each output is connected to one1x1 convolution kernel O_global。

And 5-2-2, extracting local features. For local feature extraction, depth-aware convolution (depth-aware convolution) is adopted, namely local convolution. The depth-aware convolution introduces a global feature F in the convolution process_localSaid global feature F_localA convolution kernel with a padding (filling gap) number of 1 and 3x3 is introduced, and then nonlinearly activated by the Relu function to generate 512 feature maps.

Then, 13 outputs are outputted on each feature map F (from the above, it can be known that the 13 outputs are respectively C, theta, [ t ]_x,t_y,t_w,t_h]_2D,[t_x,t_y,t_z]_P,[t_w,t_h,t_l,tθ]_3D) And wherein each output is connected to a 1x1 convolution kernel O_local。

And 5-3, performing weighting processing on the extracted outputs of the global features and the local features. A weighting number α (which is learned) is introduced, which takes advantage of the spatial invariance of the convolutional neural network as an index to the 1 st to 13 th outputs, and its specific output function is as follows:

Oⁱ＝O_global ⁱ·α_i+O_local ⁱ·(1-α_i) (8)

and 6, projecting the 3D information to the 2D information and carrying out forward optimization processing. Here, a parameter step size σ is introduced (for updating θ), and a loop termination parameter β is set, and when α is larger than β, the input of the optimization parameter is performed.

The iterative step of the algorithm is by combining the projection of the 3D box with the estimated box b 'of 2D'_2DAs L_1lossAnd theta is continuously adjusted. And the formula of the step of projecting 3D to the 2D frame is as follows:

γ_P＝P·γ_3D,γ_2D＝γ_P/γ_P[φ_z],

x_min＝min(γ_3D[φ_x]),y_min＝min(γ_3D[γ_3D[φ_y]])

x_max＝max(γ_3D[φ_x]),y_max＝max(γ_3D[γ_3D[φ_y]])

(9)

where φ represents the index of the axis [ x, y, z ].

2D frame parameter [ x ] after projection with 3D frame_min,y_min,x_max,y_max]And b 'estimated from the original 2D frame'_2DTo calculate L_1lossWhen the loss is not updated in the range of theta + -sigma, the step size sigma is changed by the attenuation factor gamma, and the above operation is repeatedly performed when sigma > β.

And 7, outputting 13 parameters, wherein the 13 parameters are respectively as follows: c, theta, [ t ]_x,t_y,t_w,t_h]_2D,[t_x,t_y,t_z]_P,[t_w,t_h,t_l,tθ]_3DAnd finally, carrying out 3D target detection.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Those not described in detail in this specification are within the skill of the art.

Claims

1. A2D and 3D image synchronous detection method based on a depth perception convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:

step 1, defining an anchor point template of a target object: respectively defining specific formulas of a 2D target anchor point and a 3D target anchor point, introducing a preset depth information parameter, and specifying a shared central pixel position;

step 2, generating an anchor frame of the model prediction characteristic diagram: generating a preset anchor frame according to an anchor point template defining a target object and a visual anchor point generation formula and a pre-calculated 3D prior anchor point;

step 3, checking the intersection ratio of GT of the anchor frame: checking whether the intersection ratio of GT of the anchor frame is more than or equal to 0.5 or not according to the generated anchor frame;

step 4, analyzing a network loss function of the target object: the method comprises the steps of classification loss function LC analysis, 2D frame regression loss function analysis and 3D frame regression loss function analysis;

step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain h x w dimension feature maps, then respectively sending the feature maps into two branches, wherein one branch is global feature extraction, the other branch is local feature extraction, and finally combining the features of the two branches according to a certain weight;

step 6, forward optimization: projecting the 3D information to the 2D information and carrying out forward optimization processing, leading out a parameter step length sigma for updating theta, setting a cycle termination parameter beta, and inputting an optimization parameter when alpha is larger than the parameter beta;

and 7, carrying out 3D target detection according to the 3D output parameters.

2. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 1, wherein: in the step 1, a specific formula of the 2D target anchor point is [ w, h ]2D, and a specific formula of the 3D target anchor point is [ w, h, l, θ ]3D, where w, h, and l respectively represent given values of the width, height, and length of a target detection object, and θ represents an observation angle of the camera to the target detection object; the introduced preset depth information parameter is Zp, the shared central pixel position is designated as [ x, y ] P, wherein the parameter expressed by 2D is expressed as [ x, y ]2D ═ P · [ w, h ]2D according to the pixel coordinate, P represents the coordinate point of the known projection matrix which needs to project the target object, the 3D central position [ x, y, z ]3D under the camera coordinate system is projected into the image of the given known projection matrix P in a three-dimensional manner, and the depth information parameter Zp is encoded, and the formula is as follows:

3. the method for synchronously detecting the 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 2, wherein: in the step 2, each anchor point in the model prediction output characteristic diagram is defined as C, and each anchor point corresponds to [ tx, ty, tw, th]2D、[tx,ty,tz]P、[tw,th,tl,tθ]And 3D, setting the total number of anchor points of a single pixel on the feature map of each target detection object as na, presetting the number of training model classes as nc, hxw as the resolution of the feature map, setting the total number of output frames as nb as w × h × na, and distributing each anchor point at each pixel position [ x, y]_P∈R^w×hThe first output anchor C represents a shared class prediction with dimension na × nc × h × w, where the output dimension of each class is na × h × w.

4. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 3, wherein: in step 2, the [ tx, ty, tw, th ]2D representing the 2D bounding box transformation is collectively referred to as b2D, wherein the bounding box transformation formula is as follows:

wherein xP and yP denote the spatial center position of each frame, and the frame b 'after transformation'_2DIs defined as [ x, y, w, h]′_2DTransforming the 7 output variables, namely the projection centers

Scale transformation [ t ]_w,t_h,t_l]_3DAnd direction change

Collectively referred to as b_3DSaid b is_3DConversion is applied to band parameters [ w, h ]]_2D,z_P,[w,h,l,θ]_3DAnchor point (c):

5. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 4, wherein: in the step 3, if the intersection ratio of GT of the anchor frame is less than 0.5, the category of the target object is set as a background category, and the boundary anchor frame is ignored or deleted; if the intersection ratio IOU of the GT of the anchor frame is more than or equal to 0.5, generating the category index tau of the target object and the 2D frame according to the generated GT of the anchor frame

And 3D frame

6. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 5, wherein: in the step 4, the classification loss function LC adopts a softmax-based polynomial logistic loss function, and the formula is as follows:

2D frame regression loss function

Analysis for matching before GT transform

And b 'after GT transformation'_2DCross-over ratio between IOU:

3D frame regression loss function

7. the method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 6, wherein: in the step 4, a whole multitask network loss function L is also introduced, wherein the whole multitask network loss function L also comprises a regularization weight lambda₁And λ₂It defines the formula as follows:

8. the method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 7, wherein: in the step 5, the specific process is as follows:

step 5-1, obtaining a h x w dimensional characteristic diagram by using a convolutional neural network DenseNet: introducing a hyperparameter b, wherein b represents the number of bins at the row level and is used for representing that the characteristic diagram is divided into b bins along the transverse direction, and each bin represents a specific convolution kernel k;

step 5-2, global/local feature extraction is carried out, wherein the step 5-2 is divided into two branches, and the flow is as follows:

step 5-2-1, global feature extraction: the global feature extraction adopts conventional convolution which introduces the global feature F in the convolution process_globalSaid global feature F_globalIntroducing convolution kernels with the number of padding being 1 and 3x3, then nonlinearly activating by a Relu function to generate 512 feature maps, performing convolution on the whole feature map by using conventional 3x3 and 1x1,

then, C, theta, [ t ] is output on each feature map F_x,t_y,t_w,t_h]_2D,[t_x,t_y,t_z]_P,[t_w,t_h,t_l,tθ]_3DA total of 13 outputs, each of which is connected to a 1x1 convolution kernel O_global；

Step 5-2-2, local feature extraction: for local feature extraction, a depth perception convolution is adopted, and the depth perception convolution introduces a global feature F in the convolution process_localThe global feature F_localIntroducing convolution kernels with the number of padding being 1 and 3x3, then performing nonlinear activation by a Relu function to generate 512 feature maps, using different 3x3 kernels to act on different bins, and dividing the bins into b bins along the longitudinal direction,

then, C, theta, [ t ] is output on each feature map F_x,t_y,t_w,t_h]_2D,[t_x,t_y,t_z]_P,[t_w,t_h,t_l,tθ]_3DA total of 13 outputs, each of which is connected to a 1x1 convolution kernel O_local；

Step 5-3, the output of the global feature and the local feature extraction is weighted: introducing a weighting number alpha obtained by neural network learning, wherein the weighting number alpha utilizes the space invariance of the convolutional neural network as an index of the 1 st to 13 th outputs, and the specific output function is as follows:

Oⁱ＝O_global ⁱ·α_i+O_local ⁱ·(1-α_i) (8)。

9. the method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 8, wherein: in the step 5, the method further comprises the step 5-4: the backbone network of the 3D target detection method is established on the basis of DenseNet-121, and a dense connection mechanism for connecting all layers is proposed: that is, each layer will accept all layers before it as its extra input, ResNet will connect each layer with the 2-3 layers before by way of element-level addition, while in DenseNet, each layer will concat with all layers before in the channel dimension and serve as the input for the next layer, and for a network of L layers, DenseNet contains L (L +1)/2 connections in total, and DenseNet links the signatures from the various layers through concat connectors.

10. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 9, wherein: in step 6, the iteration steps of the algorithm are as follows:

by projecting the 3D frame and the 2D estimated frame b'_2DAs L_1lossAnd, while continuously adjusting θ, the formula of the step of projecting 3D to the 2D frame is as follows:

γ_P＝P·γ_3D,γ_2D＝γ_P/γ_P[φ_z],

x_min＝min(γ_3D[φ_x]),y_min＝min(γ_3D[γ_3D[φ_y]])

x_max＝max(γ_3D[φ_x]),y_max＝max(γ_3D[γ_3D[φ_y]]) (9)，

where φ represents the index of the axis [ x, y, z ],

2D frame parameter [ x ] after projection with 3D frame_min,y_min,x_max,y_max]And b 'estimated from the original 2D frame'_2DTo calculate L_1lossWhen the loss is not updated in the range of theta +/-sigma, changing the step size sigma by using the attenuation factor gamma, and repeatedly executing the operation when the sigma is larger than β;

in the step 7, 13 parameters are output according to the 3D total, and the 13 parameters are respectively: c, theta, [ t ]_x,t_y,t_w,t_h]_2D,[t_x,t_y,t_z]_P,[t_w,t_h,t_l,tθ]_3D。