CN111695403A - 2D and 3D image synchronous detection method based on depth perception convolutional neural network - Google Patents
2D and 3D image synchronous detection method based on depth perception convolutional neural network Download PDFInfo
- Publication number
- CN111695403A CN111695403A CN202010308948.9A CN202010308948A CN111695403A CN 111695403 A CN111695403 A CN 111695403A CN 202010308948 A CN202010308948 A CN 202010308948A CN 111695403 A CN111695403 A CN 111695403A
- Authority
- CN
- China
- Prior art keywords
- frame
- neural network
- convolutional neural
- anchor
- anchor point
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 43
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 39
- 230000008447 perception Effects 0.000 title claims abstract description 31
- 230000001360 synchronised effect Effects 0.000 title claims abstract description 13
- 230000006870 function Effects 0.000 claims abstract description 40
- 238000000605 extraction Methods 0.000 claims abstract description 28
- 238000005457 optimization Methods 0.000 claims abstract description 10
- 238000012545 processing Methods 0.000 claims abstract description 9
- 230000000007 visual effect Effects 0.000 claims abstract description 6
- 238000000034 method Methods 0.000 claims description 40
- 238000010586 diagram Methods 0.000 claims description 17
- 238000004458 analytical method Methods 0.000 claims description 15
- 230000009466 transformation Effects 0.000 claims description 14
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 6
- 230000003213 activating effect Effects 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 230000004913 activation Effects 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 claims 1
- 230000011218 segmentation Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 235000004522 Pentaglottis sempervirens Nutrition 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000011897 real-time detection Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformations in the plane of the image
- G06T3/08—Projecting images onto non-planar surfaces, e.g. geodetic screens
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/50—Depth or shape recovery
- G06T7/55—Depth or shape recovery from multiple images
- G06T7/593—Depth or shape recovery from multiple images from stereo images
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/90—Determination of colour characteristics
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, which comprises the following steps: step 1, defining a target anchor point formula, introducing a preset depth information parameter, and specifying a shared center pixel position; step 2, generating a preset anchor frame according to an anchor point template defining a target object, a visual anchor point generating formula and a 3D prior anchor point; step 3, checking the intersection ratio of the anchor frames; step 4, analyzing a network loss function of the target object; step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain a feature map, sending the feature map into global feature extraction and local feature extraction, and finally combining according to a certain weight; step 6, forward optimization processing, leading out a parameter step length sigma, setting a cycle termination parameter beta, and optimizing parameters; and 7, outputting the 3D parameters. The invention can realize higher safety of automatic driving and can be widely applied to the field of computer vision.
Description
Technical Field
The invention relates to a method for detecting effective targets in the field of computer vision such as unmanned driving and auxiliary driving, in particular to a method for synchronously detecting 2D and 3D images based on a depth perception convolutional neural network.
Background
Object detection refers to detecting and identifying the category and position information of an interested target (such as a vehicle, a pedestrian, an obstacle and the like) in an image or a video by using a computer technology, and is one of important research fields in the field of computer vision. With the continuous improvement and development of deep learning technology, the object detection technology based on deep learning has a wide application scenario in many practical fields, for example: the system comprises the following relevant fields of unmanned driving, auxiliary driving, face recognition, unmanned security, man-machine interaction, behavior recognition and the like.
As one of the important research directions in the deep learning technology, the deep convolutional neural network has achieved significant results on object detection tasks, and can achieve real-time detection and identification of an interested target in 2D image data. However, in the field of unmanned research, the system stability and safety are improved because the system is required to obtain the position information of the interested target in the 3D space in the application to realize the corresponding function.
The hardware equipment that is used for 3D image recognition at present relies on the camera, according to the function of camera, can divide into monocular camera and many cameras with the camera: the monocular camera is fixed in focus and is mostly applied to road condition judgment of automatic driving, but the monocular camera has an irreconcilable contradiction in the aspects of distance measurement range and distance, namely the wider the visual angle of the camera is, the shorter the length of the accurate distance which can be detected is, the narrower the visual angle is, the longer the detected distance is, the similar way that human eyes see the world is realized, the farther the people see the world, the range which can be covered is narrow, and the closer the people see the world, the wider the range is; the binocular camera is a camera with different focal lengths, the focal length of the binocular camera is related to the imaging definition, but the conventional vehicle-mounted camera is difficult to frequently zoom, the cost of the multi-view camera is high, and the algorithm complexity of the multi-view camera is increased compared with that of a monocular camera, so that the binocular camera is not suitable for an unmanned system at present.
To improve the accuracy of 3D image detection, existing 3D image detection methods also rely on expensive lidar sensors, which can provide sparse depth data as input. However, when such a method relying on a laser radar sensor is combined with a monocular camera, the sparse depth data thereof lacks depth information, and thus it is difficult to realize in 3D image detection.
For example, an automatic driving system is taken as an example for explanation, and for an object detection task in the scene, a conventional 2D target detection method is to acquire a real-time road scene in a driving process through a vehicle-mounted camera, input the real-time road scene into an existing algorithm, detect an interested target in an image through a trained detection model, output position and category information of the interested target to a decision layer of a control end, and plan how a vehicle runs. However, there is a problem that the 3D spatial position information of the detection target acquired by the monocular camera is unstable, and the accuracy of the method is reduced due to a plurality of influencing factors.
Disclosure of Invention
The invention aims to overcome the defects of the background technology, and provides a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, so that the method has the advantages that a camera stores more detailed semantic information on the basis of keeping the accurate depth information of a laser scanner, and can realize higher driving performance and safety in the automatic driving process.
The invention provides a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, which comprises the following steps: step 1, defining an anchor point template of a target object: respectively defining specific formulas of a 2D target anchor point and a 3D target anchor point, introducing a preset depth information parameter, and specifying a shared central pixel position; step 2, generating an anchor frame of the model prediction characteristic diagram: generating a preset anchor frame according to an anchor point template defining a target object and a visual anchor point generation formula and a pre-calculated 3D prior anchor point; step 3, checking the intersection ratio of GT of the anchor frame: checking whether the intersection ratio of GT of the anchor frame is more than or equal to 0.5 or not according to the generated anchor frame; step 4, analyzing a network loss function of the target object: the method comprises the steps of classification loss function LC analysis, 2D frame regression loss function analysis and 3D frame regression loss function analysis; step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain h x w dimension feature maps, then respectively sending the feature maps into two branches, wherein one branch is global feature extraction, the other branch is local feature extraction, and finally combining the features of the two branches according to a certain weight; step 6, forward optimization: projecting the 3D information to the 2D information and carrying out forward optimization processing, leading out a parameter step length sigma for updating theta, setting a cycle termination parameter beta, and inputting an optimization parameter when alpha is larger than the parameter beta; and 7, carrying out 3D target detection according to the 3D output parameters.
In the above technical solution, in step 1, the specific formula of the 2D target anchor point is [ w, h ]2D, and the specific formula of the 3D target anchor point is [ w, h, l, θ ]3D, where w, h, and l respectively represent given values of the width, height, and length of the target detection object, and θ represents an observation angle of the camera to the target detection object; the introduced preset depth information parameter is Zp, the shared central pixel position is designated as [ x, y ] P, wherein the parameter expressed by 2D is expressed as [ x, y ]2D ═ P · [ w, h ]2D according to the pixel coordinate, P represents the coordinate point of the known projection matrix which needs to project the target object, the 3D central position [ x, y, z ]3D under the camera coordinate system is projected into the image of the given known projection matrix P in a three-dimensional manner, and the depth information parameter Zp is encoded, and the formula is as follows:
in the above technical solution, in the step 2, each anchor point in the model prediction output feature graph is defined as C, and each anchor point corresponds to [ tx, ty, tw, th [ ]]2D、[tx,ty,tz]P、[tw,th,tl,tθ]And 3D, setting the total number of anchor points of a single pixel on the feature map of each target detection object as na, presetting the number of training model classes as nc, hxw as the resolution of the feature map, setting the total number of output frames as nb as w × h × na, and distributing each anchor point at each pixel position [ x, y]P∈Rw×hThe first output anchor C represents a shared class prediction with dimension na × nc × h × w, where the output dimension of each class is na × h × w.
In the above technical solution, in step 2, the [ tx, ty, tw, th ]2D representing the 2D bounding box transformation is collectively referred to as b2D, where the bounding box transformation formula is as follows:
wherein xP and yP denote the spatial center position of each frame, and the frame b 'after transformation'2DIs defined as [ x, y, w, h]′2DThe 7 output variables, i.e. the projection centers, are transformed [ t ]x,ty,tz]PScale transformation [ t ]w,th,tl]3DAnd direction change tθ3DCollectively referred to as b3DSaid b is3DConversion is applied to band parameters [ w, h ]]2D,zP,[w,h,l,θ]3DAnchor point (c):
similarly, the 3D center position [ x, y, z ] obtained after projection in image space is transformed using the inverse of equation (1)]′PTo calculate its camera coordinates x, y, z]′3D,b′3DRepresents [ x, y, z ]]′PAnd [ w, h, l, θ ]]′3D。
In the above technical solution, in the step 3, if the intersection ratio of GT of the anchor frame is less than 0.5, the category of the target object is set as the background category, and the boundary anchor frame is ignored or deleted; if the intersection ratio IOU of the GT of the anchor frame is more than or equal to 0.5, generating the category index tau of the target object and the 2D frame according to the generated GT of the anchor frameAnd 3D frame
In the above technical solution, in the step 4, the classification loss function LC adopts a softmax-based polynomial logistic loss function, and its formula is:
2D frame regression loss functionAnalysis of for PBefore transformation of GTAnd b 'after GT transformation'2DCross-over ratio between IOU:
3D frame regression loss functionAnalysis for optimizing each of the remaining 3D bounding box parameters with a smooth L1 regression loss function, which is formulated as:
in the above technical solution, in the step 4, a whole multitask network loss function L is further introduced, wherein the whole multitask network loss function L further includes a regularization weight λ1And λ2It defines the formula as follows:
in the above technical solution, in the step 5, the specific process is as follows:
step 5-1, obtaining a h x w dimensional characteristic diagram by using a convolutional neural network DenseNet: introducing a hyperparameter b, wherein b represents the number of bins at the row level and is used for representing that the characteristic diagram is divided into b bins along the transverse direction, and each bin represents a specific convolution kernel k; step 5-2, global/local feature extraction is carried out, wherein the step 5-2 is divided into two branches, and the flow is as follows: step 5-2-1, global feature extraction: the global feature extraction adopts conventional convolution which introduces the global feature F in the convolution processglobalSaid global feature FglobalIntroducing convolution kernels with the number of padding being 1 and 3x3, then nonlinearly activating by Relu function to generate 512 feature maps, using conventional convolution with 3x3 and 1x1 to act on the whole feature map, and then performing convolution at each timeOutputs C, theta, [ t ] on the characteristic diagram Fx,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3DA total of 13 outputs, each of which is connected to a 1x1 convolution kernel Oglobal(ii) a Step 5-2-2, local feature extraction: for local feature extraction, a depth perception convolution is adopted, and the depth perception convolution introduces a global feature F in the convolution processlocalThe global feature FlocalIntroducing convolution kernels with the number of padding being 1 and 3x3, then nonlinearly activating by Relu function to generate 512 feature maps, acting on different bins (convolution kernel pixels) with different 3x3 kernels, and dividing the bins into b bins along the longitudinal direction, and then outputting C, theta, [ t ] on each feature map Fx,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3DA total of 13 outputs, each of which is connected to a 1x1 convolution kernel OlocalStep 5-3, performing weighting processing on the output of the global feature extraction and the local feature extraction, namely introducing a learned weighting α, wherein the weighting α uses the spatial invariance of a convolutional neural network as an index of the 1 st to 13 th outputs, and the specific output function is as follows:
Oi=Oglobal i·αi+Olocal i·(1-αi) (8)。
in the above technical solution, the step 5 further includes the steps of 5-4: the backbone network of the 3D target detection method is established on the basis of DenseNet-121, and a dense connection mechanism for connecting all layers is proposed: that is, each layer will accept all layers before it as its additional input, ResNet connects each layer with the 2-3 layers before by way of element-level addition, while in DenseNet, each layer will concat with all layers before in the channel dimension and serve as the input for the next layer, and for a network of L layers, DenseNet contains L (L +1)/2 connections in total, and DenseNet is a feature map from different layers directly concat.
In the above technical solution, in step 6, the iteration step of the algorithm is as follows: by projecting the 3D frame and the 2D estimated frame b'2DAs L1lossAnd, while continuously adjusting θ, the formula of the step of projecting 3D to the 2D frame is as follows:
γP=P·γ3D,γ2D=γP/γP[φz],
xmin=min(γ3D[φx]),ymin=min(γ3D[γ3D[φy]])
xmax=max(γ3D[φx]),ymax=max(γ3D[γ3D[φy]])
(9),
where φ represents the axis [ x, y, z ]]Index of (2D) frame parameter [ x ] after projection with 3D framemin,ymin,xmax,ymax]And b 'estimated from the original 2D frame'2DTo calculate L1lossChanging the step size sigma by using the attenuation factor gamma when the loss is not updated in the range of theta +/-sigma, and repeating the operation when the loss is more than β. in the step 7, outputting 13 parameters according to the 3D in total, wherein the 13 parameters are C, theta and tx,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3D。
The 2D and 3D image synchronous detection method based on the depth perception convolutional neural network has the following beneficial effects: the scheme of the invention provides an algorithm for fusing laser radar point cloud and RGB (red (R), green (G) and blue (B) three-channel color) images. The 3D target vision analysis plays an important role in the vision perception system of the autonomous driving automobile. Modern autonomous vehicles are often equipped with sensors such as lidar and cameras. With respect to the application characteristics of the two sensors, the camera and the laser radar camera can be used for target detection, the laser scanner has the advantage of accurate depth information, and the camera stores more detailed semantic information, so that the integration of the laser radar point cloud and the RGB image can realize an automatic driving automobile with higher performance and safety. Highly accurate target location and identification of objects in a road scene is achieved using lidar and object detection in three-dimensional space of image data.
Drawings
FIG. 1 is a flow chart of a basic idea of a method for synchronously detecting 2D and 3D images based on a depth-aware convolutional neural network according to the present invention;
FIG. 2 is a detailed flowchart of the method for synchronously detecting 2D and 3D images based on the depth-sensing convolutional neural network according to the present invention;
FIG. 3 is a schematic diagram illustrating the parameter definition of an anchor point template in the method for synchronously detecting 2D and 3D images based on a depth-aware convolutional neural network according to the present invention;
FIG. 4 is a block diagram of a three-dimensional anchor of a 3D target object in the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network of the present invention;
FIG. 5 is a bird's eye view of a three-dimensional anchor frame of a 3D target object in the depth-aware convolutional neural network-based 2D and 3D image synchronous detection method of the present invention;
FIG. 6 is a diagram of RPN network architecture in the method for synchronous detection of 2D and 3D images based on a depth-aware convolutional neural network according to the present invention;
FIG. 7 is a schematic diagram of extracting local features of horizontal segmentation in the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network of the present invention;
FIG. 8 is a schematic diagram of extracting longitudinal segmentation local features in the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network of the present invention;
fig. 9 is a network architecture diagram of densenert in the 2D and 3D image synchronous detection method based on the depth-aware convolutional neural network of the present invention.
Detailed Description
The invention is described in further detail below with reference to the following figures and examples, which should not be construed as limiting the invention.
Referring to fig. 1, the basic idea of the method for synchronously detecting 2D and 3D images based on the depth-aware convolutional neural network of the present invention is as follows: input image → simultaneous detection processing of 2D and 3D images → projection of 3D information to 2D information and forward optimization processing → detection of 3D objects according to 3D output parameters.
Referring to fig. 2, the method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network of the present invention specifically comprises the following steps:
step 1: an anchor template for the target object is defined. In order to predict the 2D frame and the 3D frame simultaneously, anchor templates need to be defined in respective dimensional spaces, and it should be noted that the 2D frame herein is the maximum length and width observed by the 3D target object. Specifically, taking an automobile as an example, referring to fig. 3, specific formulas of a 2D target anchor point and an anchor point template of a 3D target are [ w, h ]2D and [ w, h, l, θ ]3D, respectively, where w, h, and l represent the width, height, and length of a target detection object, respectively, and w, h, and l are given values in a detection camera coordinate system; in addition, since the 3D object is different from the 2D object and has rotation, its θ represents the viewing angle of the camera to the object to be detected, which is equivalent to the camera rotating around the Y axis of its camera coordinate system, and the viewing angle takes into account the relative orientation of the object with respect to the viewing angle of the camera, rather than the ground's Bird's Eye View (BEV), where introducing θ makes it more meaningful to intuitively estimate the viewing angle when processing 3D image features.
In order to define the position of a 2D/3D frame of a complete target object, a preset depth information parameter Zp is introduced, and a shared central pixel position [ x, y ] P is specified, wherein the parameter represented by 2D is represented according to pixel coordinates, namely [ x, y ]2D is P.w.h ]2D, and P represents a coordinate point of a known projection matrix required to project the target object; in 3D object detection, 3D center position [ x, y, z ]3D in camera coordinate system is projected into the image of given known projection matrix P in three dimensions, and depth information parameter Zp is encoded, which formula is as follows:
wherein, the average value statistics of [ w, h, l, theta ]3D, Zp and [ w, h, l, theta ]3D of each preset depth information parameter Zp and 3D target object are calculated separately for each anchor point in advance, and the parameters of the type are used as follows: can serve as strong a priori information to ease the difficulty of 3D parameter estimation. Specifically, for each anchor point, each preset depth information parameter Zp and [ w, h, l, θ ]3D IOU (intersection ratio) of the 3D object are statistical data exceeding 0.5, and the anchor point represents a discrete template, where the 3D prior can be used as a strong initial guess to assume a reasonably consistent scene geometry.
Step 2: and generating an anchor frame of the model prediction characteristic diagram according to the anchor point template defining the target object. Specifically, the preset anchor frame is generated according to the anchor point template of the target object and expressed by a visual anchor point generation formula and a pre-calculated 3D prior anchor point, specifically, the generated three-dimensional anchor frame can be referred to as fig. 4, and the aerial view is referred to as fig. 5.
Furthermore, each anchor point in the model prediction output feature map is defined as C, the anchor points corresponding to [ tx, ty, tw, th ]2D, [ tx, ty, tz ] P, [ tw, th, tl, t θ ]3D, the total number of anchor points is na (the number of anchor points of a single pixel on the feature map of each target detection object), the number of categories (preset training models) is nc, and hxw is the resolution of the feature map.
Therefore, the total number of output frames nb — w × h × na;
each anchor point is distributed at each pixel position [ x, y ]]P∈Rw×h,
The first output anchor point C represents a shared class prediction with dimensions na × nc × h × w, where the output dimensions of each other (per class) are na × h × w.
Further, [ tx, ty, tw, th ]2D represents a 2D bounding box transform, we refer to b2D collectively, wherein the bounding box transform formula is as follows:
where xP and yP represent the spatial center position of each box. Transformed frame b'2DIs defined as [ x, y, w, h]′2DThe following 7 outputs represent the projection center transformation [ t ]x,ty,tz]PScale transformation [ t ]w,th,tl]3DAnd a direction change tθ3DCollectively referred to as b3D. Similar to 2D, the conversion is applied to the band parameters [ w, h ]]2D,zP,[w,h,l,θ]3DAnchor point of (c):
similarly, b'3DRepresents [ x, y, z ]]′PAnd [ w, h, l, θ ]]′3D. As previously described, the authors estimate the 3D center of the projection rather than the camera coordinates to better handle the image space based convolution features. In the inference process, the 3D center position [ x, y, z ] obtained after projection in image space is used by the inverse transform of equation (1)]′PTo calculate its camera coordinates x, y, z]′3D。
And step 3: and according to the generated anchor frame, checking whether the intersection ratio (IOU) of GT (ground true) of the anchor frame is more than or equal to 0.5 or not.
If the intersection ratio of GT of the anchor frame is less than 0.5, setting the type of the target object as a background type, and neglecting or deleting the boundary anchor frame;
if the intersection ratio IOU of the GT of the anchor frame is more than or equal to 0.5, generating a category index tau of the target object and a 2D frame according to the generated GT (ground truth) of the anchor frameAnd 3D frameAnd the following step 4 is performed.
And 4, step 4: analyzing the network loss function of the target. Further, this step includes classification loss function LC analysis, 2D frame regression loss function analysis, and 3D frame regression loss function analysis.
The classification loss function LC adopts a polynomial logic loss function based on softmax, and the formula is as follows:
and introduces 2D frame regression frame lossFor matching GT transform pre-And b 'after GT transformation'2DCross-over ratio between IOU:
3D frame regression loss functionAnalysis for optimizing each of the remaining 3D bounding box parameters with a smooth L1 regression loss function, which is formulated as:
further, for the whole network framework, the whole multitask network loss function L is also introduced, wherein the whole multitask network loss function L also comprises a regularization weight lambda1And λ2It defines the formula as follows:
and 5: and establishing a deep perception convolution area proposal network to improve the ability of high-order feature space perception in the area proposal network.
A hyperparameter b is introduced, where b represents the number of bins at the row level, which represents the lateral division of the profile into b bins, each bin representing a particular convolution kernel k.
And 5-1, introducing a Densenet convolutional neural network. Furthermore, DenseNet (convolutional neural network with deeper layers) is used as a basic feature extractor to obtain h x w dimension feature maps, then the feature maps are respectively sent to two branches, one is global feature extraction, the other is local feature extraction, and finally the features of the two branches are combined according to a certain weight. The global block is convolved with conventional 3x3 and 1x1 to act on the whole feature map, and the local block is convolved with different 3x3 kernels to act on different bins, which are divided into b bins along the longitudinal direction, as shown in fig. 6.
It should be noted that, for the local feature extraction, the present technology also adopts two feature extraction methods, as shown in fig. 7.
For b longitudinal horizontal bars generated by b bins divided along the longitudinal direction as a random function when the local feature 1 is extracted, the randomness of image extraction is increased in the convolution process, and therefore the recognition rate is improved.
Further, in order to more accurately identify the 3D target image, the present technology further provides a longitudinal segmentation method, and a specific segmentation method thereof is shown in fig. 8.
Due to the adoption of the longitudinal cutting method, the local features obtained by the feature extraction are more, so that the recognition rate is improved.
In addition, the backbone network of the present 3D target detection method is established on the basis of DenseNet-121, and the network architecture of DenseNet may specifically refer to fig. 9, and DenseNet proposes a more aggressive dense connection mechanism: i.e. interconnecting all layers, in particular each layer will accept as its additional input all layers in front of it. It can be seen that ResNet is short-circuited between each layer and some previous layer (typically 2-3 layers), and the connection is by element-level addition. In DenseNet, each layer is concatenated (concat) with all previous layers in the channel dimension (where the profile sizes of the individual layers are the same) and used as input for the next layer. For a L-layer network, DenseNet contains a total of L × L +1)/2 connections, which is a dense connection compared to ResNet. And DenseNet is a feature diagram directly concat from different layers, so that feature reuse can be realized, and efficiency is improved. The network architecture diagram of densenert is shown in fig. 9.
And 5-2, carrying out global/local feature extraction. The step 5-2 is divided into two branches, step 5-2-1 and step 5-2-2, respectively.
And 5-2-1, extracting global features. The global feature extraction adopts conventional convolution, the convolution kernel of the conventional convolution acts as the global convolution in the whole space, and the conventional convolution introduces the global feature F in the convolution processglobalSaid global feature FglobalIn the method, a convolution kernel with the number of padding (filling gaps) being 1 and 3 × 3 is introduced, and then nonlinear activation is performed by a Relu function (Rectified Linear Unit) to generate 512 feature maps.
Then, 13 outputs are outputted on each feature map F (from the above, it can be known that the 13 outputs are respectively C, theta, [ t ]x,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3D) And wherein each output is connected to one1x1 convolution kernel Oglobal。
And 5-2-2, extracting local features. For local feature extraction, depth-aware convolution (depth-aware convolution) is adopted, namely local convolution. The depth-aware convolution introduces a global feature F in the convolution processlocalSaid global feature FlocalA convolution kernel with a padding (filling gap) number of 1 and 3x3 is introduced, and then nonlinearly activated by the Relu function to generate 512 feature maps.
Then, 13 outputs are outputted on each feature map F (from the above, it can be known that the 13 outputs are respectively C, theta, [ t ]x,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3D) And wherein each output is connected to a 1x1 convolution kernel Olocal。
And 5-3, performing weighting processing on the extracted outputs of the global features and the local features. A weighting number α (which is learned) is introduced, which takes advantage of the spatial invariance of the convolutional neural network as an index to the 1 st to 13 th outputs, and its specific output function is as follows:
Oi=Oglobal i·αi+Olocal i·(1-αi) (8)
and 6, projecting the 3D information to the 2D information and carrying out forward optimization processing. Here, a parameter step size σ is introduced (for updating θ), and a loop termination parameter β is set, and when α is larger than β, the input of the optimization parameter is performed.
The iterative step of the algorithm is by combining the projection of the 3D box with the estimated box b 'of 2D'2DAs L1lossAnd theta is continuously adjusted. And the formula of the step of projecting 3D to the 2D frame is as follows:
γP=P·γ3D,γ2D=γP/γP[φz],
xmin=min(γ3D[φx]),ymin=min(γ3D[γ3D[φy]])
xmax=max(γ3D[φx]),ymax=max(γ3D[γ3D[φy]])
(9)
where φ represents the index of the axis [ x, y, z ].
2D frame parameter [ x ] after projection with 3D framemin,ymin,xmax,ymax]And b 'estimated from the original 2D frame'2DTo calculate L1lossWhen the loss is not updated in the range of theta + -sigma, the step size sigma is changed by the attenuation factor gamma, and the above operation is repeatedly performed when sigma > β.
And 7, outputting 13 parameters, wherein the 13 parameters are respectively as follows: c, theta, [ t ]x,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3DAnd finally, carrying out 3D target detection.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Those not described in detail in this specification are within the skill of the art.
Claims (10)
1. A2D and 3D image synchronous detection method based on a depth perception convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:
step 1, defining an anchor point template of a target object: respectively defining specific formulas of a 2D target anchor point and a 3D target anchor point, introducing a preset depth information parameter, and specifying a shared central pixel position;
step 2, generating an anchor frame of the model prediction characteristic diagram: generating a preset anchor frame according to an anchor point template defining a target object and a visual anchor point generation formula and a pre-calculated 3D prior anchor point;
step 3, checking the intersection ratio of GT of the anchor frame: checking whether the intersection ratio of GT of the anchor frame is more than or equal to 0.5 or not according to the generated anchor frame;
step 4, analyzing a network loss function of the target object: the method comprises the steps of classification loss function LC analysis, 2D frame regression loss function analysis and 3D frame regression loss function analysis;
step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain h x w dimension feature maps, then respectively sending the feature maps into two branches, wherein one branch is global feature extraction, the other branch is local feature extraction, and finally combining the features of the two branches according to a certain weight;
step 6, forward optimization: projecting the 3D information to the 2D information and carrying out forward optimization processing, leading out a parameter step length sigma for updating theta, setting a cycle termination parameter beta, and inputting an optimization parameter when alpha is larger than the parameter beta;
and 7, carrying out 3D target detection according to the 3D output parameters.
2. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 1, wherein: in the step 1, a specific formula of the 2D target anchor point is [ w, h ]2D, and a specific formula of the 3D target anchor point is [ w, h, l, θ ]3D, where w, h, and l respectively represent given values of the width, height, and length of a target detection object, and θ represents an observation angle of the camera to the target detection object; the introduced preset depth information parameter is Zp, the shared central pixel position is designated as [ x, y ] P, wherein the parameter expressed by 2D is expressed as [ x, y ]2D ═ P · [ w, h ]2D according to the pixel coordinate, P represents the coordinate point of the known projection matrix which needs to project the target object, the 3D central position [ x, y, z ]3D under the camera coordinate system is projected into the image of the given known projection matrix P in a three-dimensional manner, and the depth information parameter Zp is encoded, and the formula is as follows:
3. the method for synchronously detecting the 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 2, wherein: in the step 2, each anchor point in the model prediction output characteristic diagram is defined as C, and each anchor point corresponds to [ tx, ty, tw, th]2D、[tx,ty,tz]P、[tw,th,tl,tθ]And 3D, setting the total number of anchor points of a single pixel on the feature map of each target detection object as na, presetting the number of training model classes as nc, hxw as the resolution of the feature map, setting the total number of output frames as nb as w × h × na, and distributing each anchor point at each pixel position [ x, y]P∈Rw×hThe first output anchor C represents a shared class prediction with dimension na × nc × h × w, where the output dimension of each class is na × h × w.
4. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 3, wherein: in step 2, the [ tx, ty, tw, th ]2D representing the 2D bounding box transformation is collectively referred to as b2D, wherein the bounding box transformation formula is as follows:
wherein xP and yP denote the spatial center position of each frame, and the frame b 'after transformation'2DIs defined as [ x, y, w, h]′2DTransforming the 7 output variables, namely the projection centersScale transformation [ t ]w,th,tl]3DAnd direction changeCollectively referred to as b3DSaid b is3DConversion is applied to band parameters [ w, h ]]2D,zP,[w,h,l,θ]3DAnchor point (c):
similarly, the 3D center position [ x, y, z ] obtained after projection in image space is transformed using the inverse of equation (1)]′PTo calculate its camera coordinates x, y, z]′3D,b′3DRepresents [ x, y, z ]]′PAnd [ w, h, l, θ ]]′3D。
5. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 4, wherein: in the step 3, if the intersection ratio of GT of the anchor frame is less than 0.5, the category of the target object is set as a background category, and the boundary anchor frame is ignored or deleted; if the intersection ratio IOU of the GT of the anchor frame is more than or equal to 0.5, generating the category index tau of the target object and the 2D frame according to the generated GT of the anchor frameAnd 3D frame
6. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 5, wherein: in the step 4, the classification loss function LC adopts a softmax-based polynomial logistic loss function, and the formula is as follows:
2D frame regression loss functionAnalysis for matching before GT transformAnd b 'after GT transformation'2DCross-over ratio between IOU:
3D frame regression loss functionAnalysis for optimizing each of the remaining 3D bounding box parameters with a smooth L1 regression loss function, which is formulated as:
7. the method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 6, wherein: in the step 4, a whole multitask network loss function L is also introduced, wherein the whole multitask network loss function L also comprises a regularization weight lambda1And λ2It defines the formula as follows:
8. the method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 7, wherein: in the step 5, the specific process is as follows:
step 5-1, obtaining a h x w dimensional characteristic diagram by using a convolutional neural network DenseNet: introducing a hyperparameter b, wherein b represents the number of bins at the row level and is used for representing that the characteristic diagram is divided into b bins along the transverse direction, and each bin represents a specific convolution kernel k;
step 5-2, global/local feature extraction is carried out, wherein the step 5-2 is divided into two branches, and the flow is as follows:
step 5-2-1, global feature extraction: the global feature extraction adopts conventional convolution which introduces the global feature F in the convolution processglobalSaid global feature FglobalIntroducing convolution kernels with the number of padding being 1 and 3x3, then nonlinearly activating by a Relu function to generate 512 feature maps, performing convolution on the whole feature map by using conventional 3x3 and 1x1,
then, C, theta, [ t ] is output on each feature map Fx,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3DA total of 13 outputs, each of which is connected to a 1x1 convolution kernel Oglobal;
Step 5-2-2, local feature extraction: for local feature extraction, a depth perception convolution is adopted, and the depth perception convolution introduces a global feature F in the convolution processlocalThe global feature FlocalIntroducing convolution kernels with the number of padding being 1 and 3x3, then performing nonlinear activation by a Relu function to generate 512 feature maps, using different 3x3 kernels to act on different bins, and dividing the bins into b bins along the longitudinal direction,
then, C, theta, [ t ] is output on each feature map Fx,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3DA total of 13 outputs, each of which is connected to a 1x1 convolution kernel Olocal;
Step 5-3, the output of the global feature and the local feature extraction is weighted: introducing a weighting number alpha obtained by neural network learning, wherein the weighting number alpha utilizes the space invariance of the convolutional neural network as an index of the 1 st to 13 th outputs, and the specific output function is as follows:
Oi=Oglobal i·αi+Olocal i·(1-αi) (8)。
9. the method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 8, wherein: in the step 5, the method further comprises the step 5-4: the backbone network of the 3D target detection method is established on the basis of DenseNet-121, and a dense connection mechanism for connecting all layers is proposed: that is, each layer will accept all layers before it as its extra input, ResNet will connect each layer with the 2-3 layers before by way of element-level addition, while in DenseNet, each layer will concat with all layers before in the channel dimension and serve as the input for the next layer, and for a network of L layers, DenseNet contains L (L +1)/2 connections in total, and DenseNet links the signatures from the various layers through concat connectors.
10. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 9, wherein: in step 6, the iteration steps of the algorithm are as follows:
by projecting the 3D frame and the 2D estimated frame b'2DAs L1lossAnd, while continuously adjusting θ, the formula of the step of projecting 3D to the 2D frame is as follows:
γP=P·γ3D,γ2D=γP/γP[φz],
xmin=min(γ3D[φx]),ymin=min(γ3D[γ3D[φy]])
xmax=max(γ3D[φx]),ymax=max(γ3D[γ3D[φy]]) (9),
where φ represents the index of the axis [ x, y, z ],
2D frame parameter [ x ] after projection with 3D framemin,ymin,xmax,ymax]And b 'estimated from the original 2D frame'2DTo calculate L1lossWhen the loss is not updated in the range of theta +/-sigma, changing the step size sigma by using the attenuation factor gamma, and repeatedly executing the operation when the sigma is larger than β;
in the step 7, 13 parameters are output according to the 3D total, and the 13 parameters are respectively: c, theta, [ t ]x,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3D。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010308948.9A CN111695403B (en) | 2020-04-19 | 2020-04-19 | Depth perception convolutional neural network-based 2D and 3D image synchronous detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010308948.9A CN111695403B (en) | 2020-04-19 | 2020-04-19 | Depth perception convolutional neural network-based 2D and 3D image synchronous detection method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111695403A true CN111695403A (en) | 2020-09-22 |
CN111695403B CN111695403B (en) | 2024-03-22 |
Family
ID=72476391
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010308948.9A Active CN111695403B (en) | 2020-04-19 | 2020-04-19 | Depth perception convolutional neural network-based 2D and 3D image synchronous detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111695403B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114266900A (en) * | 2021-12-20 | 2022-04-01 | 河南大学 | Monocular 3D target detection method based on dynamic convolution |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH07220084A (en) * | 1994-02-04 | 1995-08-18 | Canon Inc | Arithmetic system, semiconductor device, and image information processor |
CN106599939A (en) * | 2016-12-30 | 2017-04-26 | 深圳市唯特视科技有限公司 | Real-time target detection method based on region convolutional neural network |
CN106886755A (en) * | 2017-01-19 | 2017-06-23 | 北京航空航天大学 | A kind of intersection vehicles system for detecting regulation violation based on Traffic Sign Recognition |
CN109543601A (en) * | 2018-11-21 | 2019-03-29 | 电子科技大学 | A kind of unmanned vehicle object detection method based on multi-modal deep learning |
WO2019144575A1 (en) * | 2018-01-24 | 2019-08-01 | 中山大学 | Fast pedestrian detection method and device |
EP3525131A1 (en) * | 2018-02-09 | 2019-08-14 | Bayerische Motoren Werke Aktiengesellschaft | Methods and apparatuses for object detection in a scene represented by depth data of a range detection sensor and image data of a camera |
CN110555407A (en) * | 2019-09-02 | 2019-12-10 | 东风汽车有限公司 | pavement vehicle space identification method and electronic equipment |
US20200026953A1 (en) * | 2018-07-23 | 2020-01-23 | Wuhan University | Method and system of extraction of impervious surface of remote sensing image |
CN110852314A (en) * | 2020-01-16 | 2020-02-28 | 江西高创保安服务技术有限公司 | Article detection network method based on camera projection model |
CN110942000A (en) * | 2019-11-13 | 2020-03-31 | 南京理工大学 | Unmanned vehicle target detection method based on deep learning |
-
2020
- 2020-04-19 CN CN202010308948.9A patent/CN111695403B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH07220084A (en) * | 1994-02-04 | 1995-08-18 | Canon Inc | Arithmetic system, semiconductor device, and image information processor |
CN106599939A (en) * | 2016-12-30 | 2017-04-26 | 深圳市唯特视科技有限公司 | Real-time target detection method based on region convolutional neural network |
CN106886755A (en) * | 2017-01-19 | 2017-06-23 | 北京航空航天大学 | A kind of intersection vehicles system for detecting regulation violation based on Traffic Sign Recognition |
WO2019144575A1 (en) * | 2018-01-24 | 2019-08-01 | 中山大学 | Fast pedestrian detection method and device |
EP3525131A1 (en) * | 2018-02-09 | 2019-08-14 | Bayerische Motoren Werke Aktiengesellschaft | Methods and apparatuses for object detection in a scene represented by depth data of a range detection sensor and image data of a camera |
US20200026953A1 (en) * | 2018-07-23 | 2020-01-23 | Wuhan University | Method and system of extraction of impervious surface of remote sensing image |
CN109543601A (en) * | 2018-11-21 | 2019-03-29 | 电子科技大学 | A kind of unmanned vehicle object detection method based on multi-modal deep learning |
CN110555407A (en) * | 2019-09-02 | 2019-12-10 | 东风汽车有限公司 | pavement vehicle space identification method and electronic equipment |
CN110942000A (en) * | 2019-11-13 | 2020-03-31 | 南京理工大学 | Unmanned vehicle target detection method based on deep learning |
CN110852314A (en) * | 2020-01-16 | 2020-02-28 | 江西高创保安服务技术有限公司 | Article detection network method based on camera projection model |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114266900A (en) * | 2021-12-20 | 2022-04-01 | 河南大学 | Monocular 3D target detection method based on dynamic convolution |
Also Published As
Publication number | Publication date |
---|---|
CN111695403B (en) | 2024-03-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110942449B (en) | Vehicle detection method based on laser and vision fusion | |
CN113111974B (en) | Vision-laser radar fusion method and system based on depth canonical correlation analysis | |
CN111563415B (en) | Binocular vision-based three-dimensional target detection system and method | |
CN111428765B (en) | Target detection method based on global convolution and local depth convolution fusion | |
Vaudrey et al. | Differences between stereo and motion behaviour on synthetic and real-world stereo sequences | |
JP2022515895A (en) | Object recognition method and equipment | |
JP6574611B2 (en) | Sensor system for obtaining distance information based on stereoscopic images | |
CN110765922A (en) | AGV is with two mesh vision object detection barrier systems | |
EP3992908A1 (en) | Two-stage depth estimation machine learning algorithm and spherical warping layer for equi-rectangular projection stereo matching | |
Lore et al. | Generative adversarial networks for depth map estimation from RGB video | |
CN116258817B (en) | Automatic driving digital twin scene construction method and system based on multi-view three-dimensional reconstruction | |
CN114648758A (en) | Object detection method and device, computer readable storage medium and unmanned vehicle | |
CN116129233A (en) | Automatic driving scene panoramic segmentation method based on multi-mode fusion perception | |
CN115937819A (en) | Three-dimensional target detection method and system based on multi-mode fusion | |
EP3992909A1 (en) | Two-stage depth estimation machine learning algorithm and spherical warping layer for equi-rectangular projection stereo matching | |
CN115115917A (en) | 3D point cloud target detection method based on attention mechanism and image feature fusion | |
CN114155414A (en) | Novel unmanned-driving-oriented feature layer data fusion method and system and target detection method | |
CN111695403B (en) | Depth perception convolutional neural network-based 2D and 3D image synchronous detection method | |
CN112950786A (en) | Vehicle three-dimensional reconstruction method based on neural network | |
CN112990049A (en) | AEB emergency braking method and device for automatic driving of vehicle | |
Xiao et al. | Research on uav multi-obstacle detection algorithm based on stereo vision | |
CN114648639B (en) | Target vehicle detection method, system and device | |
CN116468950A (en) | Three-dimensional target detection method for neighborhood search radius of class guide center point | |
Itu et al. | MONet-Multiple Output Network for Driver Assistance Systems Based on a Monocular Camera | |
Berrio et al. | Semantic sensor fusion: From camera to sparse LiDAR information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |