CN111695403A - 2D and 3D image synchronous detection method based on depth perception convolutional neural network - Google Patents

2D and 3D image synchronous detection method based on depth perception convolutional neural network Download PDF

Info

Publication number
CN111695403A
CN111695403A CN202010308948.9A CN202010308948A CN111695403A CN 111695403 A CN111695403 A CN 111695403A CN 202010308948 A CN202010308948 A CN 202010308948A CN 111695403 A CN111695403 A CN 111695403A
Authority
CN
China
Prior art keywords
frame
neural network
convolutional neural
anchor
anchor point
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010308948.9A
Other languages
Chinese (zh)
Other versions
CN111695403B (en
Inventor
吴明瞭
付智俊
郭启翔
尹思维
谢斌
何薇
焦红波
王晨阳
白世伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongfeng Automobile Co Ltd
Original Assignee
Dongfeng Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongfeng Automobile Co Ltd filed Critical Dongfeng Automobile Co Ltd
Priority to CN202010308948.9A priority Critical patent/CN111695403B/en
Publication of CN111695403A publication Critical patent/CN111695403A/en
Application granted granted Critical
Publication of CN111695403B publication Critical patent/CN111695403B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/08Projecting images onto non-planar surfaces, e.g. geodetic screens
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, which comprises the following steps: step 1, defining a target anchor point formula, introducing a preset depth information parameter, and specifying a shared center pixel position; step 2, generating a preset anchor frame according to an anchor point template defining a target object, a visual anchor point generating formula and a 3D prior anchor point; step 3, checking the intersection ratio of the anchor frames; step 4, analyzing a network loss function of the target object; step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain a feature map, sending the feature map into global feature extraction and local feature extraction, and finally combining according to a certain weight; step 6, forward optimization processing, leading out a parameter step length sigma, setting a cycle termination parameter beta, and optimizing parameters; and 7, outputting the 3D parameters. The invention can realize higher safety of automatic driving and can be widely applied to the field of computer vision.

Description

2D and 3D image synchronous detection method based on depth perception convolutional neural network
Technical Field
The invention relates to a method for detecting effective targets in the field of computer vision such as unmanned driving and auxiliary driving, in particular to a method for synchronously detecting 2D and 3D images based on a depth perception convolutional neural network.
Background
Object detection refers to detecting and identifying the category and position information of an interested target (such as a vehicle, a pedestrian, an obstacle and the like) in an image or a video by using a computer technology, and is one of important research fields in the field of computer vision. With the continuous improvement and development of deep learning technology, the object detection technology based on deep learning has a wide application scenario in many practical fields, for example: the system comprises the following relevant fields of unmanned driving, auxiliary driving, face recognition, unmanned security, man-machine interaction, behavior recognition and the like.
As one of the important research directions in the deep learning technology, the deep convolutional neural network has achieved significant results on object detection tasks, and can achieve real-time detection and identification of an interested target in 2D image data. However, in the field of unmanned research, the system stability and safety are improved because the system is required to obtain the position information of the interested target in the 3D space in the application to realize the corresponding function.
The hardware equipment that is used for 3D image recognition at present relies on the camera, according to the function of camera, can divide into monocular camera and many cameras with the camera: the monocular camera is fixed in focus and is mostly applied to road condition judgment of automatic driving, but the monocular camera has an irreconcilable contradiction in the aspects of distance measurement range and distance, namely the wider the visual angle of the camera is, the shorter the length of the accurate distance which can be detected is, the narrower the visual angle is, the longer the detected distance is, the similar way that human eyes see the world is realized, the farther the people see the world, the range which can be covered is narrow, and the closer the people see the world, the wider the range is; the binocular camera is a camera with different focal lengths, the focal length of the binocular camera is related to the imaging definition, but the conventional vehicle-mounted camera is difficult to frequently zoom, the cost of the multi-view camera is high, and the algorithm complexity of the multi-view camera is increased compared with that of a monocular camera, so that the binocular camera is not suitable for an unmanned system at present.
To improve the accuracy of 3D image detection, existing 3D image detection methods also rely on expensive lidar sensors, which can provide sparse depth data as input. However, when such a method relying on a laser radar sensor is combined with a monocular camera, the sparse depth data thereof lacks depth information, and thus it is difficult to realize in 3D image detection.
For example, an automatic driving system is taken as an example for explanation, and for an object detection task in the scene, a conventional 2D target detection method is to acquire a real-time road scene in a driving process through a vehicle-mounted camera, input the real-time road scene into an existing algorithm, detect an interested target in an image through a trained detection model, output position and category information of the interested target to a decision layer of a control end, and plan how a vehicle runs. However, there is a problem that the 3D spatial position information of the detection target acquired by the monocular camera is unstable, and the accuracy of the method is reduced due to a plurality of influencing factors.
Disclosure of Invention
The invention aims to overcome the defects of the background technology, and provides a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, so that the method has the advantages that a camera stores more detailed semantic information on the basis of keeping the accurate depth information of a laser scanner, and can realize higher driving performance and safety in the automatic driving process.
The invention provides a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, which comprises the following steps: step 1, defining an anchor point template of a target object: respectively defining specific formulas of a 2D target anchor point and a 3D target anchor point, introducing a preset depth information parameter, and specifying a shared central pixel position; step 2, generating an anchor frame of the model prediction characteristic diagram: generating a preset anchor frame according to an anchor point template defining a target object and a visual anchor point generation formula and a pre-calculated 3D prior anchor point; step 3, checking the intersection ratio of GT of the anchor frame: checking whether the intersection ratio of GT of the anchor frame is more than or equal to 0.5 or not according to the generated anchor frame; step 4, analyzing a network loss function of the target object: the method comprises the steps of classification loss function LC analysis, 2D frame regression loss function analysis and 3D frame regression loss function analysis; step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain h x w dimension feature maps, then respectively sending the feature maps into two branches, wherein one branch is global feature extraction, the other branch is local feature extraction, and finally combining the features of the two branches according to a certain weight; step 6, forward optimization: projecting the 3D information to the 2D information and carrying out forward optimization processing, leading out a parameter step length sigma for updating theta, setting a cycle termination parameter beta, and inputting an optimization parameter when alpha is larger than the parameter beta; and 7, carrying out 3D target detection according to the 3D output parameters.
In the above technical solution, in step 1, the specific formula of the 2D target anchor point is [ w, h ]2D, and the specific formula of the 3D target anchor point is [ w, h, l, θ ]3D, where w, h, and l respectively represent given values of the width, height, and length of the target detection object, and θ represents an observation angle of the camera to the target detection object; the introduced preset depth information parameter is Zp, the shared central pixel position is designated as [ x, y ] P, wherein the parameter expressed by 2D is expressed as [ x, y ]2D ═ P · [ w, h ]2D according to the pixel coordinate, P represents the coordinate point of the known projection matrix which needs to project the target object, the 3D central position [ x, y, z ]3D under the camera coordinate system is projected into the image of the given known projection matrix P in a three-dimensional manner, and the depth information parameter Zp is encoded, and the formula is as follows:
Figure BDA0002456897710000031
in the above technical solution, in the step 2, each anchor point in the model prediction output feature graph is defined as C, and each anchor point corresponds to [ tx, ty, tw, th [ ]]2D、[tx,ty,tz]P、[tw,th,tl,tθ]And 3D, setting the total number of anchor points of a single pixel on the feature map of each target detection object as na, presetting the number of training model classes as nc, hxw as the resolution of the feature map, setting the total number of output frames as nb as w × h × na, and distributing each anchor point at each pixel position [ x, y]P∈Rw×hThe first output anchor C represents a shared class prediction with dimension na × nc × h × w, where the output dimension of each class is na × h × w.
In the above technical solution, in step 2, the [ tx, ty, tw, th ]2D representing the 2D bounding box transformation is collectively referred to as b2D, where the bounding box transformation formula is as follows:
Figure BDA0002456897710000041
Figure BDA0002456897710000042
wherein xP and yP denote the spatial center position of each frame, and the frame b 'after transformation'2DIs defined as [ x, y, w, h]′2DThe 7 output variables, i.e. the projection centers, are transformed [ t ]x,ty,tz]PScale transformation [ t ]w,th,tl]3DAnd direction change tθ3DCollectively referred to as b3DSaid b is3DConversion is applied to band parameters [ w, h ]]2D,zP,[w,h,l,θ]3DAnchor point (c):
Figure BDA0002456897710000043
Figure BDA0002456897710000044
Figure BDA0002456897710000045
Figure BDA0002456897710000046
similarly, the 3D center position [ x, y, z ] obtained after projection in image space is transformed using the inverse of equation (1)]′PTo calculate its camera coordinates x, y, z]′3D,b′3DRepresents [ x, y, z ]]′PAnd [ w, h, l, θ ]]′3D
In the above technical solution, in the step 3, if the intersection ratio of GT of the anchor frame is less than 0.5, the category of the target object is set as the background category, and the boundary anchor frame is ignored or deleted; if the intersection ratio IOU of the GT of the anchor frame is more than or equal to 0.5, generating the category index tau of the target object and the 2D frame according to the generated GT of the anchor frame
Figure BDA0002456897710000051
And 3D frame
Figure BDA0002456897710000052
In the above technical solution, in the step 4, the classification loss function LC adopts a softmax-based polynomial logistic loss function, and its formula is:
Figure BDA0002456897710000053
2D frame regression loss function
Figure BDA0002456897710000054
Analysis of for PBefore transformation of GT
Figure BDA0002456897710000055
And b 'after GT transformation'2DCross-over ratio between IOU:
Figure BDA0002456897710000056
3D frame regression loss function
Figure BDA0002456897710000057
Analysis for optimizing each of the remaining 3D bounding box parameters with a smooth L1 regression loss function, which is formulated as:
Figure BDA0002456897710000058
in the above technical solution, in the step 4, a whole multitask network loss function L is further introduced, wherein the whole multitask network loss function L further includes a regularization weight λ1And λ2It defines the formula as follows:
Figure BDA0002456897710000059
in the above technical solution, in the step 5, the specific process is as follows:
step 5-1, obtaining a h x w dimensional characteristic diagram by using a convolutional neural network DenseNet: introducing a hyperparameter b, wherein b represents the number of bins at the row level and is used for representing that the characteristic diagram is divided into b bins along the transverse direction, and each bin represents a specific convolution kernel k; step 5-2, global/local feature extraction is carried out, wherein the step 5-2 is divided into two branches, and the flow is as follows: step 5-2-1, global feature extraction: the global feature extraction adopts conventional convolution which introduces the global feature F in the convolution processglobalSaid global feature FglobalIntroducing convolution kernels with the number of padding being 1 and 3x3, then nonlinearly activating by Relu function to generate 512 feature maps, using conventional convolution with 3x3 and 1x1 to act on the whole feature map, and then performing convolution at each timeOutputs C, theta, [ t ] on the characteristic diagram Fx,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3DA total of 13 outputs, each of which is connected to a 1x1 convolution kernel Oglobal(ii) a Step 5-2-2, local feature extraction: for local feature extraction, a depth perception convolution is adopted, and the depth perception convolution introduces a global feature F in the convolution processlocalThe global feature FlocalIntroducing convolution kernels with the number of padding being 1 and 3x3, then nonlinearly activating by Relu function to generate 512 feature maps, acting on different bins (convolution kernel pixels) with different 3x3 kernels, and dividing the bins into b bins along the longitudinal direction, and then outputting C, theta, [ t ] on each feature map Fx,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3DA total of 13 outputs, each of which is connected to a 1x1 convolution kernel OlocalStep 5-3, performing weighting processing on the output of the global feature extraction and the local feature extraction, namely introducing a learned weighting α, wherein the weighting α uses the spatial invariance of a convolutional neural network as an index of the 1 st to 13 th outputs, and the specific output function is as follows:
Oi=Oglobal i·αi+Olocal i·(1-αi) (8)。
in the above technical solution, the step 5 further includes the steps of 5-4: the backbone network of the 3D target detection method is established on the basis of DenseNet-121, and a dense connection mechanism for connecting all layers is proposed: that is, each layer will accept all layers before it as its additional input, ResNet connects each layer with the 2-3 layers before by way of element-level addition, while in DenseNet, each layer will concat with all layers before in the channel dimension and serve as the input for the next layer, and for a network of L layers, DenseNet contains L (L +1)/2 connections in total, and DenseNet is a feature map from different layers directly concat.
In the above technical solution, in step 6, the iteration step of the algorithm is as follows: by projecting the 3D frame and the 2D estimated frame b'2DAs L1lossAnd, while continuously adjusting θ, the formula of the step of projecting 3D to the 2D frame is as follows:
Figure BDA0002456897710000071
Figure BDA0002456897710000072
γP=P·γ3D2D=γPPz],
xmin=min(γ3Dx]),ymin=min(γ3D3Dy]])
xmax=max(γ3Dx]),ymax=max(γ3D3Dy]])
(9),
where φ represents the axis [ x, y, z ]]Index of (2D) frame parameter [ x ] after projection with 3D framemin,ymin,xmax,ymax]And b 'estimated from the original 2D frame'2DTo calculate L1lossChanging the step size sigma by using the attenuation factor gamma when the loss is not updated in the range of theta +/-sigma, and repeating the operation when the loss is more than β. in the step 7, outputting 13 parameters according to the 3D in total, wherein the 13 parameters are C, theta and tx,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3D
The 2D and 3D image synchronous detection method based on the depth perception convolutional neural network has the following beneficial effects: the scheme of the invention provides an algorithm for fusing laser radar point cloud and RGB (red (R), green (G) and blue (B) three-channel color) images. The 3D target vision analysis plays an important role in the vision perception system of the autonomous driving automobile. Modern autonomous vehicles are often equipped with sensors such as lidar and cameras. With respect to the application characteristics of the two sensors, the camera and the laser radar camera can be used for target detection, the laser scanner has the advantage of accurate depth information, and the camera stores more detailed semantic information, so that the integration of the laser radar point cloud and the RGB image can realize an automatic driving automobile with higher performance and safety. Highly accurate target location and identification of objects in a road scene is achieved using lidar and object detection in three-dimensional space of image data.
Drawings
FIG. 1 is a flow chart of a basic idea of a method for synchronously detecting 2D and 3D images based on a depth-aware convolutional neural network according to the present invention;
FIG. 2 is a detailed flowchart of the method for synchronously detecting 2D and 3D images based on the depth-sensing convolutional neural network according to the present invention;
FIG. 3 is a schematic diagram illustrating the parameter definition of an anchor point template in the method for synchronously detecting 2D and 3D images based on a depth-aware convolutional neural network according to the present invention;
FIG. 4 is a block diagram of a three-dimensional anchor of a 3D target object in the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network of the present invention;
FIG. 5 is a bird's eye view of a three-dimensional anchor frame of a 3D target object in the depth-aware convolutional neural network-based 2D and 3D image synchronous detection method of the present invention;
FIG. 6 is a diagram of RPN network architecture in the method for synchronous detection of 2D and 3D images based on a depth-aware convolutional neural network according to the present invention;
FIG. 7 is a schematic diagram of extracting local features of horizontal segmentation in the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network of the present invention;
FIG. 8 is a schematic diagram of extracting longitudinal segmentation local features in the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network of the present invention;
fig. 9 is a network architecture diagram of densenert in the 2D and 3D image synchronous detection method based on the depth-aware convolutional neural network of the present invention.
Detailed Description
The invention is described in further detail below with reference to the following figures and examples, which should not be construed as limiting the invention.
Referring to fig. 1, the basic idea of the method for synchronously detecting 2D and 3D images based on the depth-aware convolutional neural network of the present invention is as follows: input image → simultaneous detection processing of 2D and 3D images → projection of 3D information to 2D information and forward optimization processing → detection of 3D objects according to 3D output parameters.
Referring to fig. 2, the method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network of the present invention specifically comprises the following steps:
step 1: an anchor template for the target object is defined. In order to predict the 2D frame and the 3D frame simultaneously, anchor templates need to be defined in respective dimensional spaces, and it should be noted that the 2D frame herein is the maximum length and width observed by the 3D target object. Specifically, taking an automobile as an example, referring to fig. 3, specific formulas of a 2D target anchor point and an anchor point template of a 3D target are [ w, h ]2D and [ w, h, l, θ ]3D, respectively, where w, h, and l represent the width, height, and length of a target detection object, respectively, and w, h, and l are given values in a detection camera coordinate system; in addition, since the 3D object is different from the 2D object and has rotation, its θ represents the viewing angle of the camera to the object to be detected, which is equivalent to the camera rotating around the Y axis of its camera coordinate system, and the viewing angle takes into account the relative orientation of the object with respect to the viewing angle of the camera, rather than the ground's Bird's Eye View (BEV), where introducing θ makes it more meaningful to intuitively estimate the viewing angle when processing 3D image features.
In order to define the position of a 2D/3D frame of a complete target object, a preset depth information parameter Zp is introduced, and a shared central pixel position [ x, y ] P is specified, wherein the parameter represented by 2D is represented according to pixel coordinates, namely [ x, y ]2D is P.w.h ]2D, and P represents a coordinate point of a known projection matrix required to project the target object; in 3D object detection, 3D center position [ x, y, z ]3D in camera coordinate system is projected into the image of given known projection matrix P in three dimensions, and depth information parameter Zp is encoded, which formula is as follows:
Figure BDA0002456897710000091
wherein, the average value statistics of [ w, h, l, theta ]3D, Zp and [ w, h, l, theta ]3D of each preset depth information parameter Zp and 3D target object are calculated separately for each anchor point in advance, and the parameters of the type are used as follows: can serve as strong a priori information to ease the difficulty of 3D parameter estimation. Specifically, for each anchor point, each preset depth information parameter Zp and [ w, h, l, θ ]3D IOU (intersection ratio) of the 3D object are statistical data exceeding 0.5, and the anchor point represents a discrete template, where the 3D prior can be used as a strong initial guess to assume a reasonably consistent scene geometry.
Step 2: and generating an anchor frame of the model prediction characteristic diagram according to the anchor point template defining the target object. Specifically, the preset anchor frame is generated according to the anchor point template of the target object and expressed by a visual anchor point generation formula and a pre-calculated 3D prior anchor point, specifically, the generated three-dimensional anchor frame can be referred to as fig. 4, and the aerial view is referred to as fig. 5.
Furthermore, each anchor point in the model prediction output feature map is defined as C, the anchor points corresponding to [ tx, ty, tw, th ]2D, [ tx, ty, tz ] P, [ tw, th, tl, t θ ]3D, the total number of anchor points is na (the number of anchor points of a single pixel on the feature map of each target detection object), the number of categories (preset training models) is nc, and hxw is the resolution of the feature map.
Therefore, the total number of output frames nb — w × h × na;
each anchor point is distributed at each pixel position [ x, y ]]P∈Rw×h
The first output anchor point C represents a shared class prediction with dimensions na × nc × h × w, where the output dimensions of each other (per class) are na × h × w.
Further, [ tx, ty, tw, th ]2D represents a 2D bounding box transform, we refer to b2D collectively, wherein the bounding box transform formula is as follows:
Figure BDA0002456897710000101
Figure BDA0002456897710000102
where xP and yP represent the spatial center position of each box. Transformed frame b'2DIs defined as [ x, y, w, h]′2DThe following 7 outputs represent the projection center transformation [ t ]x,ty,tz]PScale transformation [ t ]w,th,tl]3DAnd a direction change tθ3DCollectively referred to as b3D. Similar to 2D, the conversion is applied to the band parameters [ w, h ]]2D,zP,[w,h,l,θ]3DAnchor point of (c):
Figure BDA0002456897710000111
Figure BDA0002456897710000112
Figure BDA0002456897710000113
Figure BDA0002456897710000114
similarly, b'3DRepresents [ x, y, z ]]′PAnd [ w, h, l, θ ]]′3D. As previously described, the authors estimate the 3D center of the projection rather than the camera coordinates to better handle the image space based convolution features. In the inference process, the 3D center position [ x, y, z ] obtained after projection in image space is used by the inverse transform of equation (1)]′PTo calculate its camera coordinates x, y, z]′3D
And step 3: and according to the generated anchor frame, checking whether the intersection ratio (IOU) of GT (ground true) of the anchor frame is more than or equal to 0.5 or not.
If the intersection ratio of GT of the anchor frame is less than 0.5, setting the type of the target object as a background type, and neglecting or deleting the boundary anchor frame;
if the intersection ratio IOU of the GT of the anchor frame is more than or equal to 0.5, generating a category index tau of the target object and a 2D frame according to the generated GT (ground truth) of the anchor frame
Figure BDA0002456897710000115
And 3D frame
Figure BDA0002456897710000116
And the following step 4 is performed.
And 4, step 4: analyzing the network loss function of the target. Further, this step includes classification loss function LC analysis, 2D frame regression loss function analysis, and 3D frame regression loss function analysis.
The classification loss function LC adopts a polynomial logic loss function based on softmax, and the formula is as follows:
Figure BDA0002456897710000117
and introduces 2D frame regression frame loss
Figure BDA0002456897710000118
For matching GT transform pre-
Figure BDA0002456897710000119
And b 'after GT transformation'2DCross-over ratio between IOU:
Figure BDA0002456897710000121
3D frame regression loss function
Figure BDA0002456897710000122
Analysis for optimizing each of the remaining 3D bounding box parameters with a smooth L1 regression loss function, which is formulated as:
Figure BDA0002456897710000123
further, for the whole network framework, the whole multitask network loss function L is also introduced, wherein the whole multitask network loss function L also comprises a regularization weight lambda1And λ2It defines the formula as follows:
Figure BDA0002456897710000124
and 5: and establishing a deep perception convolution area proposal network to improve the ability of high-order feature space perception in the area proposal network.
A hyperparameter b is introduced, where b represents the number of bins at the row level, which represents the lateral division of the profile into b bins, each bin representing a particular convolution kernel k.
And 5-1, introducing a Densenet convolutional neural network. Furthermore, DenseNet (convolutional neural network with deeper layers) is used as a basic feature extractor to obtain h x w dimension feature maps, then the feature maps are respectively sent to two branches, one is global feature extraction, the other is local feature extraction, and finally the features of the two branches are combined according to a certain weight. The global block is convolved with conventional 3x3 and 1x1 to act on the whole feature map, and the local block is convolved with different 3x3 kernels to act on different bins, which are divided into b bins along the longitudinal direction, as shown in fig. 6.
It should be noted that, for the local feature extraction, the present technology also adopts two feature extraction methods, as shown in fig. 7.
For b longitudinal horizontal bars generated by b bins divided along the longitudinal direction as a random function when the local feature 1 is extracted, the randomness of image extraction is increased in the convolution process, and therefore the recognition rate is improved.
Further, in order to more accurately identify the 3D target image, the present technology further provides a longitudinal segmentation method, and a specific segmentation method thereof is shown in fig. 8.
Due to the adoption of the longitudinal cutting method, the local features obtained by the feature extraction are more, so that the recognition rate is improved.
In addition, the backbone network of the present 3D target detection method is established on the basis of DenseNet-121, and the network architecture of DenseNet may specifically refer to fig. 9, and DenseNet proposes a more aggressive dense connection mechanism: i.e. interconnecting all layers, in particular each layer will accept as its additional input all layers in front of it. It can be seen that ResNet is short-circuited between each layer and some previous layer (typically 2-3 layers), and the connection is by element-level addition. In DenseNet, each layer is concatenated (concat) with all previous layers in the channel dimension (where the profile sizes of the individual layers are the same) and used as input for the next layer. For a L-layer network, DenseNet contains a total of L × L +1)/2 connections, which is a dense connection compared to ResNet. And DenseNet is a feature diagram directly concat from different layers, so that feature reuse can be realized, and efficiency is improved. The network architecture diagram of densenert is shown in fig. 9.
And 5-2, carrying out global/local feature extraction. The step 5-2 is divided into two branches, step 5-2-1 and step 5-2-2, respectively.
And 5-2-1, extracting global features. The global feature extraction adopts conventional convolution, the convolution kernel of the conventional convolution acts as the global convolution in the whole space, and the conventional convolution introduces the global feature F in the convolution processglobalSaid global feature FglobalIn the method, a convolution kernel with the number of padding (filling gaps) being 1 and 3 × 3 is introduced, and then nonlinear activation is performed by a Relu function (Rectified Linear Unit) to generate 512 feature maps.
Then, 13 outputs are outputted on each feature map F (from the above, it can be known that the 13 outputs are respectively C, theta, [ t ]x,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3D) And wherein each output is connected to one1x1 convolution kernel Oglobal
And 5-2-2, extracting local features. For local feature extraction, depth-aware convolution (depth-aware convolution) is adopted, namely local convolution. The depth-aware convolution introduces a global feature F in the convolution processlocalSaid global feature FlocalA convolution kernel with a padding (filling gap) number of 1 and 3x3 is introduced, and then nonlinearly activated by the Relu function to generate 512 feature maps.
Then, 13 outputs are outputted on each feature map F (from the above, it can be known that the 13 outputs are respectively C, theta, [ t ]x,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3D) And wherein each output is connected to a 1x1 convolution kernel Olocal
And 5-3, performing weighting processing on the extracted outputs of the global features and the local features. A weighting number α (which is learned) is introduced, which takes advantage of the spatial invariance of the convolutional neural network as an index to the 1 st to 13 th outputs, and its specific output function is as follows:
Oi=Oglobal i·αi+Olocal i·(1-αi) (8)
and 6, projecting the 3D information to the 2D information and carrying out forward optimization processing. Here, a parameter step size σ is introduced (for updating θ), and a loop termination parameter β is set, and when α is larger than β, the input of the optimization parameter is performed.
The iterative step of the algorithm is by combining the projection of the 3D box with the estimated box b 'of 2D'2DAs L1lossAnd theta is continuously adjusted. And the formula of the step of projecting 3D to the 2D frame is as follows:
Figure BDA0002456897710000151
Figure BDA0002456897710000152
γP=P·γ3D2D=γPPz],
xmin=min(γ3Dx]),ymin=min(γ3D3Dy]])
xmax=max(γ3Dx]),ymax=max(γ3D3Dy]])
(9)
where φ represents the index of the axis [ x, y, z ].
2D frame parameter [ x ] after projection with 3D framemin,ymin,xmax,ymax]And b 'estimated from the original 2D frame'2DTo calculate L1lossWhen the loss is not updated in the range of theta + -sigma, the step size sigma is changed by the attenuation factor gamma, and the above operation is repeatedly performed when sigma > β.
And 7, outputting 13 parameters, wherein the 13 parameters are respectively as follows: c, theta, [ t ]x,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3DAnd finally, carrying out 3D target detection.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.
Those not described in detail in this specification are within the skill of the art.

Claims (10)

1. A2D and 3D image synchronous detection method based on a depth perception convolutional neural network is characterized by comprising the following steps: the method comprises the following steps:
step 1, defining an anchor point template of a target object: respectively defining specific formulas of a 2D target anchor point and a 3D target anchor point, introducing a preset depth information parameter, and specifying a shared central pixel position;
step 2, generating an anchor frame of the model prediction characteristic diagram: generating a preset anchor frame according to an anchor point template defining a target object and a visual anchor point generation formula and a pre-calculated 3D prior anchor point;
step 3, checking the intersection ratio of GT of the anchor frame: checking whether the intersection ratio of GT of the anchor frame is more than or equal to 0.5 or not according to the generated anchor frame;
step 4, analyzing a network loss function of the target object: the method comprises the steps of classification loss function LC analysis, 2D frame regression loss function analysis and 3D frame regression loss function analysis;
step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain h x w dimension feature maps, then respectively sending the feature maps into two branches, wherein one branch is global feature extraction, the other branch is local feature extraction, and finally combining the features of the two branches according to a certain weight;
step 6, forward optimization: projecting the 3D information to the 2D information and carrying out forward optimization processing, leading out a parameter step length sigma for updating theta, setting a cycle termination parameter beta, and inputting an optimization parameter when alpha is larger than the parameter beta;
and 7, carrying out 3D target detection according to the 3D output parameters.
2. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 1, wherein: in the step 1, a specific formula of the 2D target anchor point is [ w, h ]2D, and a specific formula of the 3D target anchor point is [ w, h, l, θ ]3D, where w, h, and l respectively represent given values of the width, height, and length of a target detection object, and θ represents an observation angle of the camera to the target detection object; the introduced preset depth information parameter is Zp, the shared central pixel position is designated as [ x, y ] P, wherein the parameter expressed by 2D is expressed as [ x, y ]2D ═ P · [ w, h ]2D according to the pixel coordinate, P represents the coordinate point of the known projection matrix which needs to project the target object, the 3D central position [ x, y, z ]3D under the camera coordinate system is projected into the image of the given known projection matrix P in a three-dimensional manner, and the depth information parameter Zp is encoded, and the formula is as follows:
Figure FDA0002456897700000021
3. the method for synchronously detecting the 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 2, wherein: in the step 2, each anchor point in the model prediction output characteristic diagram is defined as C, and each anchor point corresponds to [ tx, ty, tw, th]2D、[tx,ty,tz]P、[tw,th,tl,tθ]And 3D, setting the total number of anchor points of a single pixel on the feature map of each target detection object as na, presetting the number of training model classes as nc, hxw as the resolution of the feature map, setting the total number of output frames as nb as w × h × na, and distributing each anchor point at each pixel position [ x, y]P∈Rw×hThe first output anchor C represents a shared class prediction with dimension na × nc × h × w, where the output dimension of each class is na × h × w.
4. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 3, wherein: in step 2, the [ tx, ty, tw, th ]2D representing the 2D bounding box transformation is collectively referred to as b2D, wherein the bounding box transformation formula is as follows:
Figure FDA0002456897700000022
Figure FDA0002456897700000023
wherein xP and yP denote the spatial center position of each frame, and the frame b 'after transformation'2DIs defined as [ x, y, w, h]′2DTransforming the 7 output variables, namely the projection centers
Figure FDA0002456897700000024
Scale transformation [ t ]w,th,tl]3DAnd direction change
Figure FDA0002456897700000031
Collectively referred to as b3DSaid b is3DConversion is applied to band parameters [ w, h ]]2D,zP,[w,h,l,θ]3DAnchor point (c):
Figure FDA0002456897700000032
Figure FDA0002456897700000033
Figure FDA0002456897700000034
Figure FDA0002456897700000035
similarly, the 3D center position [ x, y, z ] obtained after projection in image space is transformed using the inverse of equation (1)]′PTo calculate its camera coordinates x, y, z]′3D,b′3DRepresents [ x, y, z ]]′PAnd [ w, h, l, θ ]]′3D
5. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 4, wherein: in the step 3, if the intersection ratio of GT of the anchor frame is less than 0.5, the category of the target object is set as a background category, and the boundary anchor frame is ignored or deleted; if the intersection ratio IOU of the GT of the anchor frame is more than or equal to 0.5, generating the category index tau of the target object and the 2D frame according to the generated GT of the anchor frame
Figure FDA0002456897700000036
And 3D frame
Figure FDA0002456897700000037
6. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 5, wherein: in the step 4, the classification loss function LC adopts a softmax-based polynomial logistic loss function, and the formula is as follows:
Figure FDA0002456897700000038
2D frame regression loss function
Figure FDA0002456897700000039
Analysis for matching before GT transform
Figure FDA00024568977000000310
And b 'after GT transformation'2DCross-over ratio between IOU:
Figure FDA00024568977000000311
3D frame regression loss function
Figure FDA00024568977000000312
Analysis for optimizing each of the remaining 3D bounding box parameters with a smooth L1 regression loss function, which is formulated as:
Figure FDA0002456897700000041
7. the method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 6, wherein: in the step 4, a whole multitask network loss function L is also introduced, wherein the whole multitask network loss function L also comprises a regularization weight lambda1And λ2It defines the formula as follows:
Figure FDA0002456897700000042
8. the method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 7, wherein: in the step 5, the specific process is as follows:
step 5-1, obtaining a h x w dimensional characteristic diagram by using a convolutional neural network DenseNet: introducing a hyperparameter b, wherein b represents the number of bins at the row level and is used for representing that the characteristic diagram is divided into b bins along the transverse direction, and each bin represents a specific convolution kernel k;
step 5-2, global/local feature extraction is carried out, wherein the step 5-2 is divided into two branches, and the flow is as follows:
step 5-2-1, global feature extraction: the global feature extraction adopts conventional convolution which introduces the global feature F in the convolution processglobalSaid global feature FglobalIntroducing convolution kernels with the number of padding being 1 and 3x3, then nonlinearly activating by a Relu function to generate 512 feature maps, performing convolution on the whole feature map by using conventional 3x3 and 1x1,
then, C, theta, [ t ] is output on each feature map Fx,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3DA total of 13 outputs, each of which is connected to a 1x1 convolution kernel Oglobal
Step 5-2-2, local feature extraction: for local feature extraction, a depth perception convolution is adopted, and the depth perception convolution introduces a global feature F in the convolution processlocalThe global feature FlocalIntroducing convolution kernels with the number of padding being 1 and 3x3, then performing nonlinear activation by a Relu function to generate 512 feature maps, using different 3x3 kernels to act on different bins, and dividing the bins into b bins along the longitudinal direction,
then, C, theta, [ t ] is output on each feature map Fx,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3DA total of 13 outputs, each of which is connected to a 1x1 convolution kernel Olocal
Step 5-3, the output of the global feature and the local feature extraction is weighted: introducing a weighting number alpha obtained by neural network learning, wherein the weighting number alpha utilizes the space invariance of the convolutional neural network as an index of the 1 st to 13 th outputs, and the specific output function is as follows:
Oi=Oglobal i·αi+Olocal i·(1-αi) (8)。
9. the method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 8, wherein: in the step 5, the method further comprises the step 5-4: the backbone network of the 3D target detection method is established on the basis of DenseNet-121, and a dense connection mechanism for connecting all layers is proposed: that is, each layer will accept all layers before it as its extra input, ResNet will connect each layer with the 2-3 layers before by way of element-level addition, while in DenseNet, each layer will concat with all layers before in the channel dimension and serve as the input for the next layer, and for a network of L layers, DenseNet contains L (L +1)/2 connections in total, and DenseNet links the signatures from the various layers through concat connectors.
10. The method for synchronously detecting 2D and 3D images based on the depth perception convolutional neural network as claimed in claim 9, wherein: in step 6, the iteration steps of the algorithm are as follows:
by projecting the 3D frame and the 2D estimated frame b'2DAs L1lossAnd, while continuously adjusting θ, the formula of the step of projecting 3D to the 2D frame is as follows:
Figure FDA0002456897700000061
Figure FDA0002456897700000062
γP=P·γ3D2D=γPPz],
xmin=min(γ3Dx]),ymin=min(γ3D3Dy]])
xmax=max(γ3Dx]),ymax=max(γ3D3Dy]]) (9),
where φ represents the index of the axis [ x, y, z ],
2D frame parameter [ x ] after projection with 3D framemin,ymin,xmax,ymax]And b 'estimated from the original 2D frame'2DTo calculate L1lossWhen the loss is not updated in the range of theta +/-sigma, changing the step size sigma by using the attenuation factor gamma, and repeatedly executing the operation when the sigma is larger than β;
in the step 7, 13 parameters are output according to the 3D total, and the 13 parameters are respectively: c, theta, [ t ]x,ty,tw,th]2D,[tx,ty,tz]P,[tw,th,tl,tθ]3D
CN202010308948.9A 2020-04-19 2020-04-19 Depth perception convolutional neural network-based 2D and 3D image synchronous detection method Active CN111695403B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010308948.9A CN111695403B (en) 2020-04-19 2020-04-19 Depth perception convolutional neural network-based 2D and 3D image synchronous detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010308948.9A CN111695403B (en) 2020-04-19 2020-04-19 Depth perception convolutional neural network-based 2D and 3D image synchronous detection method

Publications (2)

Publication Number Publication Date
CN111695403A true CN111695403A (en) 2020-09-22
CN111695403B CN111695403B (en) 2024-03-22

Family

ID=72476391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010308948.9A Active CN111695403B (en) 2020-04-19 2020-04-19 Depth perception convolutional neural network-based 2D and 3D image synchronous detection method

Country Status (1)

Country Link
CN (1) CN111695403B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114266900A (en) * 2021-12-20 2022-04-01 河南大学 Monocular 3D target detection method based on dynamic convolution

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07220084A (en) * 1994-02-04 1995-08-18 Canon Inc Arithmetic system, semiconductor device, and image information processor
CN106599939A (en) * 2016-12-30 2017-04-26 深圳市唯特视科技有限公司 Real-time target detection method based on region convolutional neural network
CN106886755A (en) * 2017-01-19 2017-06-23 北京航空航天大学 A kind of intersection vehicles system for detecting regulation violation based on Traffic Sign Recognition
CN109543601A (en) * 2018-11-21 2019-03-29 电子科技大学 A kind of unmanned vehicle object detection method based on multi-modal deep learning
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
EP3525131A1 (en) * 2018-02-09 2019-08-14 Bayerische Motoren Werke Aktiengesellschaft Methods and apparatuses for object detection in a scene represented by depth data of a range detection sensor and image data of a camera
CN110555407A (en) * 2019-09-02 2019-12-10 东风汽车有限公司 pavement vehicle space identification method and electronic equipment
US20200026953A1 (en) * 2018-07-23 2020-01-23 Wuhan University Method and system of extraction of impervious surface of remote sensing image
CN110852314A (en) * 2020-01-16 2020-02-28 江西高创保安服务技术有限公司 Article detection network method based on camera projection model
CN110942000A (en) * 2019-11-13 2020-03-31 南京理工大学 Unmanned vehicle target detection method based on deep learning

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07220084A (en) * 1994-02-04 1995-08-18 Canon Inc Arithmetic system, semiconductor device, and image information processor
CN106599939A (en) * 2016-12-30 2017-04-26 深圳市唯特视科技有限公司 Real-time target detection method based on region convolutional neural network
CN106886755A (en) * 2017-01-19 2017-06-23 北京航空航天大学 A kind of intersection vehicles system for detecting regulation violation based on Traffic Sign Recognition
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
EP3525131A1 (en) * 2018-02-09 2019-08-14 Bayerische Motoren Werke Aktiengesellschaft Methods and apparatuses for object detection in a scene represented by depth data of a range detection sensor and image data of a camera
US20200026953A1 (en) * 2018-07-23 2020-01-23 Wuhan University Method and system of extraction of impervious surface of remote sensing image
CN109543601A (en) * 2018-11-21 2019-03-29 电子科技大学 A kind of unmanned vehicle object detection method based on multi-modal deep learning
CN110555407A (en) * 2019-09-02 2019-12-10 东风汽车有限公司 pavement vehicle space identification method and electronic equipment
CN110942000A (en) * 2019-11-13 2020-03-31 南京理工大学 Unmanned vehicle target detection method based on deep learning
CN110852314A (en) * 2020-01-16 2020-02-28 江西高创保安服务技术有限公司 Article detection network method based on camera projection model

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114266900A (en) * 2021-12-20 2022-04-01 河南大学 Monocular 3D target detection method based on dynamic convolution

Also Published As

Publication number Publication date
CN111695403B (en) 2024-03-22

Similar Documents

Publication Publication Date Title
CN110942449B (en) Vehicle detection method based on laser and vision fusion
CN113111974B (en) Vision-laser radar fusion method and system based on depth canonical correlation analysis
CN111563415B (en) Binocular vision-based three-dimensional target detection system and method
CN111428765B (en) Target detection method based on global convolution and local depth convolution fusion
Vaudrey et al. Differences between stereo and motion behaviour on synthetic and real-world stereo sequences
JP2022515895A (en) Object recognition method and equipment
JP6574611B2 (en) Sensor system for obtaining distance information based on stereoscopic images
CN110765922A (en) AGV is with two mesh vision object detection barrier systems
EP3992908A1 (en) Two-stage depth estimation machine learning algorithm and spherical warping layer for equi-rectangular projection stereo matching
Lore et al. Generative adversarial networks for depth map estimation from RGB video
CN116258817B (en) Automatic driving digital twin scene construction method and system based on multi-view three-dimensional reconstruction
CN114648758A (en) Object detection method and device, computer readable storage medium and unmanned vehicle
CN116129233A (en) Automatic driving scene panoramic segmentation method based on multi-mode fusion perception
CN115937819A (en) Three-dimensional target detection method and system based on multi-mode fusion
EP3992909A1 (en) Two-stage depth estimation machine learning algorithm and spherical warping layer for equi-rectangular projection stereo matching
CN115115917A (en) 3D point cloud target detection method based on attention mechanism and image feature fusion
CN114155414A (en) Novel unmanned-driving-oriented feature layer data fusion method and system and target detection method
CN111695403B (en) Depth perception convolutional neural network-based 2D and 3D image synchronous detection method
CN112950786A (en) Vehicle three-dimensional reconstruction method based on neural network
CN112990049A (en) AEB emergency braking method and device for automatic driving of vehicle
Xiao et al. Research on uav multi-obstacle detection algorithm based on stereo vision
CN114648639B (en) Target vehicle detection method, system and device
CN116468950A (en) Three-dimensional target detection method for neighborhood search radius of class guide center point
Itu et al. MONet-Multiple Output Network for Driver Assistance Systems Based on a Monocular Camera
Berrio et al. Semantic sensor fusion: From camera to sparse LiDAR information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant