CN112990050A - Monocular 3D target detection method based on lightweight characteristic pyramid structure - Google Patents

Monocular 3D target detection method based on lightweight characteristic pyramid structure Download PDF

Info

Publication number
CN112990050A
CN112990050A CN202110326713.7A CN202110326713A CN112990050A CN 112990050 A CN112990050 A CN 112990050A CN 202110326713 A CN202110326713 A CN 202110326713A CN 112990050 A CN112990050 A CN 112990050A
Authority
CN
China
Prior art keywords
sampling
feature map
candidate key
outputting
time down
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110326713.7A
Other languages
Chinese (zh)
Other versions
CN112990050B (en
Inventor
李骏
张新钰
杨磊
王力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110326713.7A priority Critical patent/CN112990050B/en
Publication of CN112990050A publication Critical patent/CN112990050A/en
Application granted granted Critical
Publication of CN112990050B publication Critical patent/CN112990050B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a monocular 3D target detection method based on a lightweight characteristic pyramid structure, which comprises the following steps: collecting RGB images of a vehicle-mounted camera; inputting the RGB image into a pre-established and trained monocular 3D target detection network, and outputting a target detection result; the monocular 3D object detection network comprises: the system comprises a feature extraction network, a detection head and a post-processing module; the feature extraction network is used for performing down-sampling on the RGB image to extract high-level semantic features, generating 4-time, 8-time and 16-time down-sampling feature maps and inputting the feature maps to the detection head; the detection head is used for generating candidate key point category vectors and candidate key point pixel position index vectors based on the 4-time down-sampling feature map, generating candidate key point 3D regression frame coding vectors based on the 4-time down-sampling feature map, the 8-time down-sampling feature map and the 16-time down-sampling feature map, and outputting the candidate key point category vectors and the 3D regression frame coding vectors to the post-processing module; and the post-processing module is used for decoding the 3D regression frame coding vector and outputting a target detection result by combining the candidate key point category vector.

Description

Monocular 3D target detection method based on lightweight characteristic pyramid structure
Technical Field
The invention relates to the technical field of automatic driving, in particular to a monocular 3D target detection method based on a lightweight characteristic pyramid structure.
Background
In an automatic driving system, 3D target detection is a very important task in a perception module, and rear-end prediction, planning, motion control and other modules all depend on reliable detection results of specific category targets around a main vehicle. By means of the advantage that high-line-beam laser radar can accurately model the surrounding environment in centimeter level, the 3D target detection algorithm based on the laser radar has advanced greatly in recent years, but due to the inherent defects that a laser radar sensor has high cost and poor adaptability to severe weather such as rain, snow, fog and the like, the large-scale landing of the laser radar and related algorithms in the field of automatic driving is severely limited. Compared with the laser radar, the vision sensor is low in cost, the capability of adapting to severe weather such as rain and snow is superior to the laser radar, and the requirements of marketization and large-scale batch production are met more easily, so the 3D target detection algorithm based on pure vision gradually receives attention from academic circles and industrial circles, on one hand, the 3D target detection technology based on pure vision can avoid expensive laser radar, the automatic driving solution with low cost is realized, on the other hand, the technology can also be matched with the 3D target detection module based on the laser radar to realize module redundancy, the serious consequences caused by the failure of the laser radar are avoided, and the automatic driving solution with higher safety and reliability is realized.
Due to the double consideration of cost and power consumption, the calculation power of a vehicle-end calculation platform carried by an automatic driving vehicle is relatively limited, and a large-scale complex model cannot be supported, so that the pure visual 3D target detection algorithm for automatic driving application must simultaneously take into account the double indexes of detection precision and efficiency: on the premise that the accuracy index meets the requirement of an actual scene, a faster model reasoning speed is pursued to ensure a more timely response of the sensing system, and an advanced early warning is provided for a rear-end prediction, planning and motion planning module, so that a safer and more reliable automatic driving system is realized.
The anchor-free frame idea is a latest research hotspot in the field of target detection, and the newly proposed pure visual 3D target detection algorithms based on key points (such as CenterNet, SMOKE and RTM3D) meet the requirements of real-time deployment and engineering landing of an automatic driving edge computing platform with higher algorithm efficiency (CenterNet:30ms, SMOKE:30ms and RTM3D:50ms), but the perception requirements under an automatic driving scene cannot be completely met due to low precision indexes.
The subsequent improvement method mainly comprises the following steps: the method has the advantages that a traditional characteristic pyramid structure, a series multistage cascade regression structure, a deep reinforcement learning introduction structure and the like are added to realize the optimization of the detection frame, the accuracy index of the algorithm is effectively improved through the improvement schemes, but due to the fact that structural branches are additionally introduced into the model, the time delay of model reasoning is greatly increased, and the real-time deployment of the algorithm on an automatic driving edge computing platform is influenced; therefore, on the premise of ensuring that the efficiency of the existing method is not reduced, the method for greatly improving the precision index of the method has huge actual engineering value.
Disclosure of Invention
The invention aims to overcome the technical defects, provides a lightweight characteristic pyramid structure and an attention loss function, can be applied to most of the existing monocular 3D target detection methods based on key points, and can simultaneously improve the two aspects of detection precision and efficiency of the existing methods.
In order to achieve the above object, embodiment 1 of the present invention provides a monocular 3D object detection method based on a lightweight feature pyramid structure, where the method includes:
collecting RGB images of a vehicle-mounted camera;
inputting the RGB image into a pre-established and trained monocular 3D target detection network, and outputting a target detection result; the monocular 3D object detection network comprises: the system comprises a feature extraction network, a detection head and a post-processing module;
the feature extraction network is used for performing down-sampling on the RGB image to extract high-level semantic features, generating a 4-time down-sampling feature map and inputting the feature map to the detection head;
the detection head is used for generating candidate key point category vectors and candidate key point pixel position index vectors based on the 4-time down-sampling feature map, generating 3D regression frame coding vectors corresponding to the candidate key points based on the 4-time down-sampling feature map, the 8-time down-sampling feature map and the 16-time down-sampling feature map, and outputting the candidate key point category vectors and the 3D regression frame coding vectors to the post-processing module;
and the post-processing module is used for decoding the 3D regression frame coding vector and outputting a target detection result by combining the candidate key point category vector.
As an improvement of the above method, the feature extraction network comprises an encoder and a decoder;
the encoder is used for performing down-sampling on the input RGB image to extract high-level semantic features and outputting a 32-time down-sampling feature map;
the decoder is used for up-sampling the high-level semantic feature map output by the encoder to obtain a 4-time down-sampling feature map required by the detection head; the decoder includes three deconvolution layers: a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer; a first deconvolution layer for processing the 32-fold down-sampling feature map output by the encoder and outputting a 16-fold down-sampling feature map; a second deconvolution layer for processing the 16-fold down-sampling feature map and outputting an 8-fold down-sampling feature map; and a third deconvolution layer for processing the 8-fold down-sampling feature map and outputting the 4-fold down-sampling feature map to the detection head.
As an improvement of the above method, the detection head comprises a thermodynamic diagram branch and a parameter regression branch;
the thermodynamic diagram branch is used for generating a thermodynamic diagram for target key points based on a 4-time down-sampling feature map, and finally outputting candidate key point category vectors and pixel position index vectors of the candidate key points corresponding to the 16-time down-sampling feature map, the 8-time down-sampling feature map and the 4-time down-sampling feature map by arranging all confidence values in a descending order and screening the positions with the maximum K previous confidence values as candidate key points;
the parameter regression branch introduces a lightweight characteristic pyramid structure for indexing and classifying according to three positionsRespectively taking values from the corresponding feature maps, then merging the values and extracting a target 3D regression frame coding vector
Figure BDA0002994939040000031
And outputting the data to a post-processing module, wherein K represents the number of the detection targets, and R represents the number of regression parameters.
As an improvement of the above method, the thermodynamic diagram branch comprises: a first convolution layer, a second convolution layer and a TopK operation unit;
the first convolution layer is used for further extracting the features of the 4-time down-sampling feature map and outputting the feature map to the second convolution layer;
the second convolution layer is used for performing convolution processing on the characteristic diagram and outputting a thermodynamic diagram, wherein any element y on the thermodynamic diagram is outputijcRepresenting a probability value that at the (i, j) pixel position of the thermodynamic diagram, a target keypoint of category c exists;
the TopK operation unit is used for arranging all probability values on the thermodynamic diagram in a descending order, taking the first K candidate points with the maximum probability values as candidate key points, converting pixel coordinates (i, j) into position indexes Index, converting channel indexes c representing categories into categories, splicing the K category values into candidate key point category vectors Classes, and outputting the candidate key point category vectors Classes to the post-processing module; and splicing and generating candidate key point pixel position Index vectors indexes according to the K position Index indexes: acquiring pixel position indexes of the candidate key points corresponding to 4 times of downsampling feature maps; the pixel position indexes 1/2 indexes of the candidate key points corresponding to the 8-time down-sampling feature map and the pixel position indexes 1/4 indexes of the 16-time down-sampling feature map can be further obtained through division operation, the three position indexes are one-dimensional vectors with the same number of elements, and three values of the three one-dimensional vectors at the same position respectively correspond to pixel position indexes of the same candidate key points at the 4-time down-sampling feature map, the 8-time down-sampling feature map and the 16-time down-sampling feature map.
As an improvement of the above method, the parametric regression branch comprises: a third convolutional layer, three parallel sampling units, a splicing unit and a 1x1 convolutional layer; the three parallel sampling units comprise: the device comprises a first sampling unit, a second sampling unit and a third sampling unit;
the third convolution layer is used for further extracting the characteristics of the 4 times down-sampling characteristic diagram and outputting the characteristic diagram to the first sampling unit;
the first sampling unit is used for sampling values from the feature map output by the third convolutional layer according to pixel position indexes of the candidate key points on the feature map sampled by 4 times;
the second sampling unit is used for sampling values from the 8-time down-sampling feature map output by the second deconvolution layer according to the pixel position index 1/2 indexes of the candidate key points on the 8-time down-sampling feature map;
the third sampling unit is used for sampling values from the 16-time down-sampling feature map output by the first deconvolution layer according to the pixel position index 1/4 indexes of the candidate key points on the 16-time down-sampling feature map;
the splicing unit is used for merging the values output by the three sampling units to realize the feature reading, alignment and fusion of the candidate target key points on the feature maps with different resolutions; outputting the fused features to a 1x1 convolutional layer;
the 1x1 convolutional layer is used for obtaining a 3D regression frame coding vector from the fusion features
Figure BDA0002994939040000041
And outputting the data to a post-processing module.
As an improvement of the above method, the method further comprises: the method for training the monocular 3D target detection network specifically comprises the following steps:
in the parameter regression branch, the function L of attention loss is establishedreg
Figure BDA0002994939040000042
In the formula (I), the compound is shown in the specification,
Figure BDA0002994939040000043
regression loss for the ith target; n is the target number in a training batch; w is aiWeighting coefficients of the ith target regression loss in the total regression loss;
each target loss weight w in the attention loss functioniThe definition is as follows:
Figure BDA0002994939040000044
in the formula, PiA category confidence for the ith target;
Figure BDA0002994939040000045
regression loss for the ith target;
Figure BDA0002994939040000046
three-dimensional intersection ratio between the ith target prediction frame and the true value frame is obtained; beta is an equilibrium parameter.
Embodiment 2 of the present invention provides a terminal device, including: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-mentioned method when executing the computer program.
Embodiment 3 of the present invention proposes a storage medium including: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-mentioned method when executing the computer program.
The invention has the advantages that:
1. the method provided by the invention provides a lightweight characteristic pyramid structure, is suitable for most monocular 3D target detection methods based on key points, effectively overcomes the defects that the traditional characteristic pyramid structure influences algorithm efficiency, increases non-maximum values to inhibit post-processing and the like, and can further shorten model reasoning time delay under the condition of effectively improving algorithm accuracy indexes;
2. the method provides an attention loss function, is suitable for a large part of target detection methods, and effectively improves the precision index of the algorithm from the perspective of optimizing the training process on the premise of not influencing the model inference time delay by solving the problem of mismatching between the model output class confidence coefficient and the position precision of a detection frame (one of the reasons influencing the precision).
Drawings
FIG. 1 is a lightweight feature pyramid structure applied to a monocular 3D object detection method of the present invention;
fig. 2 is a graph of attention loss function weight coefficients.
Detailed Description
The technical solution of the present invention will be described in detail below with reference to the accompanying drawings.
The embodiment 1 of the invention provides a monocular 3D target detection method based on a lightweight characteristic pyramid structure, which comprises the following steps:
step 1) collecting vehicle-mounted camera image RGB data;
step 2) establishing a monocular 3D target detection network based on key points;
as shown in fig. 1, the network includes: a feature extraction network (backhaul), a detection head (detective head) and a post-processing module (PostProcess);
the Chinese and English symbols in FIG. 1 are shown in Table 1:
TABLE 1
English sign Chinese interpretation English sign Chinese interpretation
Backbone Feature extraction network Conv Convolution with a bit line
Detection Head Detection head Conv1x1 1x1 convolution
Post Process Post-treatment Sampling Sampling by index
Encoder Encoder for encoding a video signal TopK K maximum values before retrieval
Decoder Decoder Keypoint Key points
H Height of input image Index Index
W Input image width 1/2Index 1/2 times index
D Number of characteristic channels 1/4Index 1/4 times index
Heatmap Thermodynamic diagram Class Class vector
Regression Regression K Numerical variable name
Light-FPN Lightweight feature pyramid Concat Merging
C Number of detection categories Decode Decoding
Deconv Deconvolution 3D Boxes Three-dimensional bounding box
Resnet-34 Deep residual error network Results The result of the detection
The feature extraction network includes an Encoder (Encoder) and a Decoder (Decoder). The encoder structure can select basic networks such as Resnet, DLA-34, Hourglass-101 and the like to carry out down-sampling on an input image to extract high-level semantic features, and the decoder is used for carrying out up-sampling on a high-level semantic feature image output by the encoder to obtain a 4-time down-sampling feature image required by the detection head; includes three deconvolution layers: a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer; a first deconvolution layer for processing the 32-fold down-sampling feature map output by the encoder and outputting a 16-fold down-sampling feature map; a second deconvolution layer for processing the 16-fold down-sampling feature map and outputting an 8-fold down-sampling feature map; and a third deconvolution layer for processing the 8-fold down-sampling feature map and outputting a 4-fold down-sampling feature map.
The full convolution structure detection head includes a thermodynamic diagram branch (Heatmap) and a parametric Regression branch (Regression).
The thermodynamic diagram branch is used for predicting a target key point and outputting a key point thermodynamic diagram Y' ═ 0,1]W/4×H/4×CRepresenting the probability that the target keypoint is detected at each pixel position (the channel dimension C is responsible for the class prediction).
The thermodynamic diagram branch comprises: a first convolution layer, a second convolution layer and a TopK operation unit;
the first convolution layer is used for further extracting the features of the 4-time down-sampling feature map and outputting the feature map to the second convolution layer;
a second convolution layer for performing convolution processing on the characteristic diagram to output a thermodynamic diagram Y' for any element Y on the thermodynamic diagram Yijc: representing the probability that at the (j) pixel location of the thermodynamic diagram, there is a candidate keypoint of category c;
the TopK operation unit executes the sorting operation, and arranges all the probability values on the thermodynamic diagram in a descending order, and takes the top K values with the maximum probability values, wherein each value has the meaning same as that of the y described aboveijcLikewise, pixel coordinates (i, j) are converted to a location Index, c, which represents a category, is converted to class, and the concatenation of K such values into vectors are Classes and Indices in FIG. 1. Screening pixels with high confidence probability as candidate key points, and obtaining the pixel position of the candidate key point corresponding to the 4-time down-sampling feature map (relative to the original image)Index. The pixel position Index 1/2Index of the candidate key point corresponding to the 8-time down-sampling feature map and the pixel position Index 1/4Index of the 16-time down-sampling feature map can be further obtained through division operation, the above three pixel position indexes are one-dimensional vectors with the same number of elements, and three values of the three one-dimensional vectors at the same position respectively correspond to pixel position indexes of the same target key point at the 4-time down-sampling feature map, the 8-time down-sampling feature map and the 16-time down-sampling feature map.
The parameter regression branch introduces the lightweight characteristic pyramid structure provided by the invention, and the lightweight characteristic pyramid structure comprises the following steps: a third convolutional layer, three parallel sampling units, a splicing unit and a 1x1 convolutional layer; the three parallel sampling units comprise: a first sampling unit, a second sampling unit and a third sampling unit;
a third convolution layer for further extracting the feature of the 4-fold down-sampling feature map and outputting the feature map to the first sampling unit;
the first sampling unit is used for sampling values from the feature map output by the third convolutional layer according to the position indexes of the key points on the 4-time downsampling feature map;
the second sampling unit is used for sampling values from the 8-time down-sampling feature map output by the second deconvolution layer according to the position indexes 1/2 indexes of the key points on the 8-time down-sampling feature map;
the third sampling unit is used for sampling values from the 16-time down-sampling feature map output by the first deconvolution layer according to the position indexes 1/4 indexes of the key points on the 16-time down-sampling feature map;
the splicing unit is used for merging the outputs of the three sampling units (weighting, attention and other modes can also be adopted) to realize the feature reading, alignment and fusion of the candidate key points on feature maps with different resolutions; outputting the fused features to a 1x1 convolutional layer;
convolution with 1x1 for obtaining 3D regression frame code vector from the fused features
Figure BDA0002994939040000071
And outputting the data to a post-processing module. (compare in the Standard framework3D frame regression map coding vector
Figure BDA0002994939040000072
The regression branch convolution operation is effectively reduced).
3D regression block diagram coding vector output by post-processing module to detection head
Figure BDA0002994939040000073
Decoding is carried out, and a final target detection result is output by combining the candidate key point Class vector Class output by the thermodynamic diagram. K represents the number of detection targets, and R represents the number of regression parameters.
The regression parameters for each target under test can be represented as an 8-dimensional vector:
Figure BDA0002994939040000074
wherein:
δz: a residual representing a depth value z;
Figure BDA0002994939040000075
representing the deviation of the key point down-sampling quantization process;
δhwl: a residual representing a target size dimension;
sin α, cos α: a sine-cosine value representing an azimuth;
and (3) decoding process:
size/size:
Figure BDA0002994939040000076
position:
Figure BDA0002994939040000077
azimuth angle:
Figure BDA0002994939040000078
Figure BDA0002994939040000079
the average value of the length, the width and the height of the target is obtained by data set labeling and statistics;
μzσzthe mean value and the variance of the mean value of the target depth values are obtained by data set labeling and statistics;
k is camera internal reference, xcycIs the pixel coordinates of the keypoint.
Step 3) training a monocular 3D target detection network;
the monocular 3D target detection method based on the key points has the problem that the confidence coefficient of the detection category is not matched with the geometric precision of the detection frame. The detection network outputs 3D detection frame type and confidence degree information from thermodynamic diagram branches, the probability that the corresponding pixel detects the target center point of the specified type is reflected, and the geometric information from parameter regression branches reflects the geometric information such as the size, the position, the attitude angle and the like of the target frame at the pixel. Because the thermodynamic diagram branch and the parameter regression branch in the detection head are independent from each other in the training process, the confidence of the detection frame cannot truly reflect the geometric accuracy of the detection frame.
Aiming at the mismatching problem, the invention provides an attention loss function acting on a parametric regression branch, which improves the training process by giving more attention to the real target needing further optimization, and constructs the weight of the target loss in the total regression loss according to the target class confidence and the 3D IOU (cross-over ratio) between the prediction box and the truth box, thereby realizing the quantitative definition of the attention, wherein the weight distribution follows the principle that the high class confidence is low, the 3D IOU weight is highest, the high class confidence is high, the 3D IOU is low, and the low class confidence is low, and the 3D IOU weight is low. Because the weight of each target loss in the total regression loss depends on the class confidence of the thermodynamic diagram branch output, the parameter regression branch with the attention loss function needs feedback from the thermodynamic diagram branch in the training process, so that the problem of mismatching of two prediction branches in a standard model detection head due to mutual independence of the training stages is solved, and finally the class confidence of the converged model output can simultaneously reflect the position accuracy information of the corresponding 3D frame and has strong positive correlation.
The loss function of the regression branch of the original model parameters is:
Figure BDA0002994939040000081
in the formula, LregThe total regression loss;
Figure BDA0002994939040000082
regression loss for the ith target; n is the target number in a training batch;
after introducing the attention loss function, the loss function of the parametric regression branch is:
Figure BDA0002994939040000083
in the formula, LregThe total regression loss;
Figure BDA0002994939040000084
regression loss for the ith target; n is the target number in a training batch; w is aiWeighting coefficients of the ith target regression loss in the total regression loss;
each target loss weight w in the attention loss functioniThe definition is as follows:
Figure BDA0002994939040000091
in the formula, PiA category confidence for the ith target;
Figure BDA0002994939040000092
regression loss for the ith target;
Figure BDA0002994939040000093
three-dimensional intersection ratio between the ith target prediction frame and the true value frame is obtained; beta is an equilibrium parameter.
The above target loss weight visualization effect is shown in fig. 2.
And 4) inputting the RGB image obtained in the step 1) into the trained monocular 3D target detection network, and outputting a target detection result.
Embodiment 2 of the present invention may also provide a computer device including: at least one processor, memory, at least one network interface, and a user interface. The various components in the device are coupled together by a bus system. It will be appreciated that a bus system is used to enable communications among the components. The bus system includes a power bus, a control bus, and a status signal bus in addition to a data bus.
The user interface may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, trackball, touch pad, or touch screen, among others.
It will be appreciated that the memory in the embodiments disclosed herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
In some embodiments, the memory stores elements, executable modules or data structures, or a subset thereof, or an expanded set thereof as follows: an operating system and an application program.
The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs, including various application programs such as a Media Player (Media Player), a Browser (Browser), etc., are used to implement various application services. The program for implementing the method of the embodiment of the present disclosure may be included in an application program.
In the above embodiments, the processor may further be configured to call a program or an instruction stored in the memory, specifically, a program or an instruction stored in the application program, and the processor is configured to:
the steps of the method of example 1 were performed.
The method of embodiment 1 may be applied in or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The methods, steps, and logic blocks disclosed in embodiment 1 may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with embodiment 1 may be directly implemented by a hardware decoding processor, or may be implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques of the present invention may be implemented by executing the functional blocks (e.g., procedures, functions, and so on) of the present invention. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Embodiment 3 of the present invention provides a nonvolatile storage medium for storing a computer program. The computer program may implement the steps of the method in embodiment 1 when executed by a processor.
Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (8)

1. A monocular 3D target detection method based on a lightweight feature pyramid structure, the method comprising:
collecting RGB images of a vehicle-mounted camera;
inputting the RGB image into a pre-established and trained monocular 3D target detection network, and outputting a target detection result; the monocular 3D object detection network comprises: the system comprises a feature extraction network, a detection head and a post-processing module;
the feature extraction network is used for performing down-sampling on the RGB image to extract high-level semantic features, generating 4-time, 8-time and 16-time down-sampling feature maps and inputting the down-sampling feature maps to the detection head;
the detection head is used for generating candidate key point category vectors and candidate key point pixel position index vectors based on the 4-time down-sampling feature map, generating candidate key point 3D regression frame coding vectors based on the 4-time down-sampling feature map, the 8-time down-sampling feature map and the 16-time down-sampling feature map, and outputting the candidate key point category vectors and the 3D regression frame coding vectors to the post-processing module;
and the post-processing module is used for decoding the 3D regression frame coding vector and outputting a target detection result by combining the candidate key point category vector.
2. The monocular 3D object detection method based on a lightweight feature pyramid structure of claim 1, wherein the feature extraction network comprises an encoder and a decoder;
the encoder is used for performing down-sampling on the input RGB image to extract high-level semantic features and outputting a 32-time down-sampling feature map;
the decoder is used for up-sampling the high-level semantic feature map output by the encoder to obtain 4-time, 8-time and 16-time down-sampling feature maps required by the detection head; the decoder includes three deconvolution layers: a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer; a first deconvolution layer for processing the 32-fold down-sampling feature map output by the encoder and outputting a 16-fold down-sampling feature map; a second deconvolution layer for processing the 16-fold down-sampling feature map and outputting an 8-fold down-sampling feature map; and a third deconvolution layer for processing the 8-fold down-sampling feature map and outputting the 4-fold down-sampling feature map to the detection head.
3. The monocular 3D object detecting method based on the lightweight feature pyramid structure as claimed in claim 2, wherein the detection head comprises a thermodynamic diagram branch and a parameter regression branch;
the thermodynamic diagram branch is used for generating a thermodynamic diagram for target key points based on a 4-time down-sampling feature map, and finally outputting candidate key point category vectors and pixel position index vectors of the candidate key points corresponding to the 16-time down-sampling feature map, the 8-time down-sampling feature map and the 4-time down-sampling feature map by arranging all confidence values in a descending order and screening the positions with the maximum K previous confidence values as candidate key points;
the parameter regression branch introduces a lightweight feature pyramid structure for respectively taking values from corresponding feature maps according to three position indexes, and then merging the values to extract a target 3D regression frame coding vector
Figure FDA0002994939030000011
And output to the post-processing module.
4. The monocular 3D object detection method based on a lightweight feature pyramid structure of claim 3, wherein the thermodynamic diagram branch comprises: a first convolution layer, a second convolution layer and a TopK operation unit;
the first convolution layer is used for further extracting the features of the 4-time down-sampling feature map and outputting the feature map to the second convolution layer;
the second convolution layer is used for performing convolution processing on the characteristic diagram and outputting a thermodynamic diagram, wherein any element y on the thermodynamic diagram is outputijcRepresenting a probability value of the presence of a target keypoint of category c at the (j) pixel position of the thermodynamic diagram;
the TopK operation unit is used for arranging all probability values on the thermodynamic diagram in a descending order, taking the first K candidate points with the maximum probability values as candidate key points, converting pixel coordinates (j) into position indexes Index, converting channel indexes c representing categories into categories, splicing the K category values into candidate key point category vectors Classes, and outputting the candidate key point category vectors Classes to the post-processing module; and splicing and generating candidate key point pixel position Index vectors indexes according to the K position Index indexes: acquiring pixel position indexes of the candidate key points corresponding to 4 times of downsampling feature maps; the pixel position indexes 1/2 indexes of the candidate key points corresponding to the 8-time down-sampling feature map and the pixel position indexes 1/4 indexes of the 16-time down-sampling feature map can be further obtained through division operation, the three position indexes are one-dimensional vectors with the same number of elements, and three values of the three one-dimensional vectors at the same position respectively correspond to pixel position indexes of the same candidate key points at the 4-time down-sampling feature map, the 8-time down-sampling feature map and the 16-time down-sampling feature map.
5. The monocular 3D object detection method based on a lightweight feature pyramid structure of claim 4, wherein the parametric regression branch comprises: a third convolutional layer, three parallel sampling units, a splicing unit and a 1x1 convolutional layer; the three parallel sampling units comprise: the device comprises a first sampling unit, a second sampling unit and a third sampling unit;
the third convolution layer is used for further extracting the characteristics of the 4 times down-sampling characteristic diagram and outputting the characteristic diagram to the first sampling unit;
the first sampling unit is used for sampling values from the feature map output by the third convolutional layer according to pixel position indexes of the candidate key points on the feature map sampled by 4 times;
the second sampling unit is used for sampling values from the 8-time down-sampling feature map output by the second deconvolution layer according to the pixel position index 1/2 indexes of the candidate key points on the 8-time down-sampling feature map;
the third sampling unit is used for sampling values from the 16-time down-sampling feature map output by the first deconvolution layer according to the pixel position index 1/4 indexes of the candidate key points on the 16-time down-sampling feature map;
the splicing unit is used for merging the values output by the three sampling units to realize the feature reading, alignment and fusion of the candidate target key points on the feature maps with different resolutions; outputting the fused features to a 1x1 convolutional layer;
the 1x1 convolutional layer is used for obtaining a 3D regression frame coding vector from the fusion features
Figure FDA0002994939030000031
And outputting the data to a post-processing module, wherein K represents the number of detection targets, and R represents the number of regression parameters.
6. The monocular 3D object detection method based on a lightweight feature pyramid structure as claimed in claim 5, further comprising: the method for training the monocular 3D target detection network specifically comprises the following steps:
in the parameter regression branch, the function L of attention loss is establishedreg
Figure FDA0002994939030000032
In the formula (I), the compound is shown in the specification,
Figure FDA0002994939030000033
regression loss for the ith target; n is the target number in a training batch; w is aiWeighting coefficients of the ith target regression loss in the total regression loss;
each target loss weight w in the attention loss functioniThe definition is as follows:
Figure FDA0002994939030000034
in the formula, PiA category confidence for the ith target;
Figure FDA0002994939030000035
regression loss for the ith target;
Figure FDA0002994939030000036
three-dimensional intersection ratio between the ith target prediction frame and the true value frame is obtained; beta is an equilibrium parameter.
7. A terminal device, comprising: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the computer program.
8. A storage medium, comprising: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the computer program.
CN202110326713.7A 2021-03-26 2021-03-26 Monocular 3D target detection method based on lightweight characteristic pyramid structure Active CN112990050B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110326713.7A CN112990050B (en) 2021-03-26 2021-03-26 Monocular 3D target detection method based on lightweight characteristic pyramid structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110326713.7A CN112990050B (en) 2021-03-26 2021-03-26 Monocular 3D target detection method based on lightweight characteristic pyramid structure

Publications (2)

Publication Number Publication Date
CN112990050A true CN112990050A (en) 2021-06-18
CN112990050B CN112990050B (en) 2021-10-08

Family

ID=76333846

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110326713.7A Active CN112990050B (en) 2021-03-26 2021-03-26 Monocular 3D target detection method based on lightweight characteristic pyramid structure

Country Status (1)

Country Link
CN (1) CN112990050B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332792A (en) * 2021-12-09 2022-04-12 苏州驾驶宝智能科技有限公司 Method and system for detecting three-dimensional scene target based on multi-scale fusion of key points
CN114821717A (en) * 2022-04-20 2022-07-29 北京百度网讯科技有限公司 Target object fusion method and device, electronic equipment and storage medium
CN114842287A (en) * 2022-03-25 2022-08-02 中国科学院自动化研究所 Monocular three-dimensional target detection model training method and device of depth-guided deformer
CN115063789A (en) * 2022-05-24 2022-09-16 中国科学院自动化研究所 3D target detection method and device based on key point matching
CN115661577A (en) * 2022-11-01 2023-01-31 吉咖智能机器人有限公司 Method, apparatus, and computer-readable storage medium for object detection
CN116403180A (en) * 2023-06-02 2023-07-07 上海几何伙伴智能驾驶有限公司 4D millimeter wave radar target detection, tracking and speed measurement method based on deep learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190662A (en) * 2018-07-26 2019-01-11 北京纵目安驰智能科技有限公司 A kind of three-dimensional vehicle detection method, system, terminal and storage medium returned based on key point
US20200151512A1 (en) * 2018-11-08 2020-05-14 Eduardo R. Corral-Soto Method and system for converting point cloud data for use with 2d convolutional neural networks
CN111291714A (en) * 2020-02-27 2020-06-16 同济大学 Vehicle detection method based on monocular vision and laser radar fusion
CN111369617A (en) * 2019-12-31 2020-07-03 浙江大学 3D target detection method of monocular view based on convolutional neural network
CN111832655A (en) * 2020-07-16 2020-10-27 四川大学 Multi-scale three-dimensional target detection method based on characteristic pyramid network
CN112069896A (en) * 2020-08-04 2020-12-11 河南科技大学 Video target tracking method based on twin network fusion multi-template features

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190662A (en) * 2018-07-26 2019-01-11 北京纵目安驰智能科技有限公司 A kind of three-dimensional vehicle detection method, system, terminal and storage medium returned based on key point
US20200151512A1 (en) * 2018-11-08 2020-05-14 Eduardo R. Corral-Soto Method and system for converting point cloud data for use with 2d convolutional neural networks
CN111369617A (en) * 2019-12-31 2020-07-03 浙江大学 3D target detection method of monocular view based on convolutional neural network
CN111291714A (en) * 2020-02-27 2020-06-16 同济大学 Vehicle detection method based on monocular vision and laser radar fusion
CN111832655A (en) * 2020-07-16 2020-10-27 四川大学 Multi-scale three-dimensional target detection method based on characteristic pyramid network
CN112069896A (en) * 2020-08-04 2020-12-11 河南科技大学 Video target tracking method based on twin network fusion multi-template features

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZHAOXIN FAN: ""PointFPN: A Frustum-based Feature Pyramid Network for 3D Object Detection"", 《2020 IEEE 32ND INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI)》 *
任之俊等: ""基于改进特征金字塔的Mask R-CNN 目标检测方法"", 《激光与光电子学进展》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114332792A (en) * 2021-12-09 2022-04-12 苏州驾驶宝智能科技有限公司 Method and system for detecting three-dimensional scene target based on multi-scale fusion of key points
CN114842287A (en) * 2022-03-25 2022-08-02 中国科学院自动化研究所 Monocular three-dimensional target detection model training method and device of depth-guided deformer
CN114821717A (en) * 2022-04-20 2022-07-29 北京百度网讯科技有限公司 Target object fusion method and device, electronic equipment and storage medium
CN114821717B (en) * 2022-04-20 2024-03-12 北京百度网讯科技有限公司 Target object fusion method and device, electronic equipment and storage medium
CN115063789A (en) * 2022-05-24 2022-09-16 中国科学院自动化研究所 3D target detection method and device based on key point matching
CN115063789B (en) * 2022-05-24 2023-08-04 中国科学院自动化研究所 3D target detection method and device based on key point matching
CN115661577A (en) * 2022-11-01 2023-01-31 吉咖智能机器人有限公司 Method, apparatus, and computer-readable storage medium for object detection
CN115661577B (en) * 2022-11-01 2024-04-16 吉咖智能机器人有限公司 Method, apparatus and computer readable storage medium for object detection
CN116403180A (en) * 2023-06-02 2023-07-07 上海几何伙伴智能驾驶有限公司 4D millimeter wave radar target detection, tracking and speed measurement method based on deep learning
CN116403180B (en) * 2023-06-02 2023-08-15 上海几何伙伴智能驾驶有限公司 4D millimeter wave radar target detection, tracking and speed measurement method based on deep learning

Also Published As

Publication number Publication date
CN112990050B (en) 2021-10-08

Similar Documents

Publication Publication Date Title
CN112990050B (en) Monocular 3D target detection method based on lightweight characteristic pyramid structure
US11120276B1 (en) Deep multimodal cross-layer intersecting fusion method, terminal device, and storage medium
CN110032969B (en) Method, apparatus, device, and medium for detecting text region in image
CN110516541B (en) Text positioning method and device, computer readable storage medium and computer equipment
Huang et al. GraNet: Global relation-aware attentional network for semantic segmentation of ALS point clouds
CN111950453A (en) Optional-shape text recognition method based on selective attention mechanism
CN113221740B (en) Farmland boundary identification method and system
CN113095152B (en) Regression-based lane line detection method and system
CN115393680B (en) 3D target detection method and system for multi-mode information space-time fusion in foggy weather scene
WO2023030182A1 (en) Image generation method and apparatus
Wang et al. MCF3D: Multi-stage complementary fusion for multi-sensor 3D object detection
CN112733942A (en) Variable-scale target detection method based on multi-stage feature adaptive fusion
CN113298032A (en) Unmanned aerial vehicle visual angle image vehicle target detection method based on deep learning
CN114782865B (en) Intersection vehicle positioning method and system based on multi-view and re-recognition
CN117079163A (en) Aerial image small target detection method based on improved YOLOX-S
CN117173657A (en) Pre-training method for automatic driving perception model
US20230162474A1 (en) Method of processing image, method of training model, and electronic device
CN114550163B (en) Imaging millimeter wave three-dimensional target detection method based on deformable attention mechanism
CN112101310B (en) Road extraction method and device based on context information and computer equipment
CN115170662A (en) Multi-target positioning method based on yolov3 and convolutional neural network
Wu et al. Application and Research of the Image Segmentation Algorithm in Remote Sensing Image Buildings
CN112396596A (en) Closed loop detection method based on semantic segmentation and image feature description
CN115082869B (en) Vehicle-road cooperative multi-target detection method and system for serving special vehicle
CN112396593B (en) Closed loop detection method based on key frame selection and local features
CN114387521B (en) Remote sensing image building extraction method based on attention mechanism and boundary loss

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant