CN112990050A

CN112990050A - Monocular 3D target detection method based on lightweight characteristic pyramid structure

Info

Publication number: CN112990050A
Application number: CN202110326713.7A
Authority: CN
Inventors: 李骏; 张新钰; 杨磊; 王力
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-06-18
Anticipated expiration: 2041-03-26
Also published as: CN112990050B

Abstract

The invention discloses a monocular 3D target detection method based on a lightweight characteristic pyramid structure, which comprises the following steps: collecting RGB images of a vehicle-mounted camera; inputting the RGB image into a pre-established and trained monocular 3D target detection network, and outputting a target detection result; the monocular 3D object detection network comprises: the system comprises a feature extraction network, a detection head and a post-processing module; the feature extraction network is used for performing down-sampling on the RGB image to extract high-level semantic features, generating 4-time, 8-time and 16-time down-sampling feature maps and inputting the feature maps to the detection head; the detection head is used for generating candidate key point category vectors and candidate key point pixel position index vectors based on the 4-time down-sampling feature map, generating candidate key point 3D regression frame coding vectors based on the 4-time down-sampling feature map, the 8-time down-sampling feature map and the 16-time down-sampling feature map, and outputting the candidate key point category vectors and the 3D regression frame coding vectors to the post-processing module; and the post-processing module is used for decoding the 3D regression frame coding vector and outputting a target detection result by combining the candidate key point category vector.

Description

Monocular 3D target detection method based on lightweight characteristic pyramid structure

Technical Field

The invention relates to the technical field of automatic driving, in particular to a monocular 3D target detection method based on a lightweight characteristic pyramid structure.

Background

In an automatic driving system, 3D target detection is a very important task in a perception module, and rear-end prediction, planning, motion control and other modules all depend on reliable detection results of specific category targets around a main vehicle. By means of the advantage that high-line-beam laser radar can accurately model the surrounding environment in centimeter level, the 3D target detection algorithm based on the laser radar has advanced greatly in recent years, but due to the inherent defects that a laser radar sensor has high cost and poor adaptability to severe weather such as rain, snow, fog and the like, the large-scale landing of the laser radar and related algorithms in the field of automatic driving is severely limited. Compared with the laser radar, the vision sensor is low in cost, the capability of adapting to severe weather such as rain and snow is superior to the laser radar, and the requirements of marketization and large-scale batch production are met more easily, so the 3D target detection algorithm based on pure vision gradually receives attention from academic circles and industrial circles, on one hand, the 3D target detection technology based on pure vision can avoid expensive laser radar, the automatic driving solution with low cost is realized, on the other hand, the technology can also be matched with the 3D target detection module based on the laser radar to realize module redundancy, the serious consequences caused by the failure of the laser radar are avoided, and the automatic driving solution with higher safety and reliability is realized.

Due to the double consideration of cost and power consumption, the calculation power of a vehicle-end calculation platform carried by an automatic driving vehicle is relatively limited, and a large-scale complex model cannot be supported, so that the pure visual 3D target detection algorithm for automatic driving application must simultaneously take into account the double indexes of detection precision and efficiency: on the premise that the accuracy index meets the requirement of an actual scene, a faster model reasoning speed is pursued to ensure a more timely response of the sensing system, and an advanced early warning is provided for a rear-end prediction, planning and motion planning module, so that a safer and more reliable automatic driving system is realized.

The anchor-free frame idea is a latest research hotspot in the field of target detection, and the newly proposed pure visual 3D target detection algorithms based on key points (such as CenterNet, SMOKE and RTM3D) meet the requirements of real-time deployment and engineering landing of an automatic driving edge computing platform with higher algorithm efficiency (CenterNet:30ms, SMOKE:30ms and RTM3D:50ms), but the perception requirements under an automatic driving scene cannot be completely met due to low precision indexes.

The subsequent improvement method mainly comprises the following steps: the method has the advantages that a traditional characteristic pyramid structure, a series multistage cascade regression structure, a deep reinforcement learning introduction structure and the like are added to realize the optimization of the detection frame, the accuracy index of the algorithm is effectively improved through the improvement schemes, but due to the fact that structural branches are additionally introduced into the model, the time delay of model reasoning is greatly increased, and the real-time deployment of the algorithm on an automatic driving edge computing platform is influenced; therefore, on the premise of ensuring that the efficiency of the existing method is not reduced, the method for greatly improving the precision index of the method has huge actual engineering value.

Disclosure of Invention

The invention aims to overcome the technical defects, provides a lightweight characteristic pyramid structure and an attention loss function, can be applied to most of the existing monocular 3D target detection methods based on key points, and can simultaneously improve the two aspects of detection precision and efficiency of the existing methods.

In order to achieve the above object, embodiment 1 of the present invention provides a monocular 3D object detection method based on a lightweight feature pyramid structure, where the method includes:

collecting RGB images of a vehicle-mounted camera;

inputting the RGB image into a pre-established and trained monocular 3D target detection network, and outputting a target detection result; the monocular 3D object detection network comprises: the system comprises a feature extraction network, a detection head and a post-processing module;

the feature extraction network is used for performing down-sampling on the RGB image to extract high-level semantic features, generating a 4-time down-sampling feature map and inputting the feature map to the detection head;

the detection head is used for generating candidate key point category vectors and candidate key point pixel position index vectors based on the 4-time down-sampling feature map, generating 3D regression frame coding vectors corresponding to the candidate key points based on the 4-time down-sampling feature map, the 8-time down-sampling feature map and the 16-time down-sampling feature map, and outputting the candidate key point category vectors and the 3D regression frame coding vectors to the post-processing module;

and the post-processing module is used for decoding the 3D regression frame coding vector and outputting a target detection result by combining the candidate key point category vector.

As an improvement of the above method, the feature extraction network comprises an encoder and a decoder;

the encoder is used for performing down-sampling on the input RGB image to extract high-level semantic features and outputting a 32-time down-sampling feature map;

the decoder is used for up-sampling the high-level semantic feature map output by the encoder to obtain a 4-time down-sampling feature map required by the detection head; the decoder includes three deconvolution layers: a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer; a first deconvolution layer for processing the 32-fold down-sampling feature map output by the encoder and outputting a 16-fold down-sampling feature map; a second deconvolution layer for processing the 16-fold down-sampling feature map and outputting an 8-fold down-sampling feature map; and a third deconvolution layer for processing the 8-fold down-sampling feature map and outputting the 4-fold down-sampling feature map to the detection head.

As an improvement of the above method, the detection head comprises a thermodynamic diagram branch and a parameter regression branch;

the thermodynamic diagram branch is used for generating a thermodynamic diagram for target key points based on a 4-time down-sampling feature map, and finally outputting candidate key point category vectors and pixel position index vectors of the candidate key points corresponding to the 16-time down-sampling feature map, the 8-time down-sampling feature map and the 4-time down-sampling feature map by arranging all confidence values in a descending order and screening the positions with the maximum K previous confidence values as candidate key points;

the parameter regression branch introduces a lightweight characteristic pyramid structure for indexing and classifying according to three positionsRespectively taking values from the corresponding feature maps, then merging the values and extracting a target 3D regression frame coding vector

And outputting the data to a post-processing module, wherein K represents the number of the detection targets, and R represents the number of regression parameters.

As an improvement of the above method, the thermodynamic diagram branch comprises: a first convolution layer, a second convolution layer and a TopK operation unit;

the first convolution layer is used for further extracting the features of the 4-time down-sampling feature map and outputting the feature map to the second convolution layer;

the second convolution layer is used for performing convolution processing on the characteristic diagram and outputting a thermodynamic diagram, wherein any element y on the thermodynamic diagram is output_ijcRepresenting a probability value that at the (i, j) pixel position of the thermodynamic diagram, a target keypoint of category c exists;

the TopK operation unit is used for arranging all probability values on the thermodynamic diagram in a descending order, taking the first K candidate points with the maximum probability values as candidate key points, converting pixel coordinates (i, j) into position indexes Index, converting channel indexes c representing categories into categories, splicing the K category values into candidate key point category vectors Classes, and outputting the candidate key point category vectors Classes to the post-processing module; and splicing and generating candidate key point pixel position Index vectors indexes according to the K position Index indexes: acquiring pixel position indexes of the candidate key points corresponding to 4 times of downsampling feature maps; the pixel position indexes 1/2 indexes of the candidate key points corresponding to the 8-time down-sampling feature map and the pixel position indexes 1/4 indexes of the 16-time down-sampling feature map can be further obtained through division operation, the three position indexes are one-dimensional vectors with the same number of elements, and three values of the three one-dimensional vectors at the same position respectively correspond to pixel position indexes of the same candidate key points at the 4-time down-sampling feature map, the 8-time down-sampling feature map and the 16-time down-sampling feature map.

As an improvement of the above method, the parametric regression branch comprises: a third convolutional layer, three parallel sampling units, a splicing unit and a 1x1 convolutional layer; the three parallel sampling units comprise: the device comprises a first sampling unit, a second sampling unit and a third sampling unit;

the third convolution layer is used for further extracting the characteristics of the 4 times down-sampling characteristic diagram and outputting the characteristic diagram to the first sampling unit;

the first sampling unit is used for sampling values from the feature map output by the third convolutional layer according to pixel position indexes of the candidate key points on the feature map sampled by 4 times;

the second sampling unit is used for sampling values from the 8-time down-sampling feature map output by the second deconvolution layer according to the pixel position index 1/2 indexes of the candidate key points on the 8-time down-sampling feature map;

the third sampling unit is used for sampling values from the 16-time down-sampling feature map output by the first deconvolution layer according to the pixel position index 1/4 indexes of the candidate key points on the 16-time down-sampling feature map;

the splicing unit is used for merging the values output by the three sampling units to realize the feature reading, alignment and fusion of the candidate target key points on the feature maps with different resolutions; outputting the fused features to a 1x1 convolutional layer;

the 1x1 convolutional layer is used for obtaining a 3D regression frame coding vector from the fusion features

And outputting the data to a post-processing module.

As an improvement of the above method, the method further comprises: the method for training the monocular 3D target detection network specifically comprises the following steps:

in the parameter regression branch, the function L of attention loss is established_reg：

In the formula (I), the compound is shown in the specification,

regression loss for the ith target; n is the target number in a training batch; w is a_iWeighting coefficients of the ith target regression loss in the total regression loss;

each target loss weight w in the attention loss function_iThe definition is as follows:

in the formula, P_iA category confidence for the ith target;

regression loss for the ith target;

three-dimensional intersection ratio between the ith target prediction frame and the true value frame is obtained; beta is an equilibrium parameter.

Embodiment 2 of the present invention provides a terminal device, including: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-mentioned method when executing the computer program.

Embodiment 3 of the present invention proposes a storage medium including: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the above-mentioned method when executing the computer program.

The invention has the advantages that:

1. the method provided by the invention provides a lightweight characteristic pyramid structure, is suitable for most monocular 3D target detection methods based on key points, effectively overcomes the defects that the traditional characteristic pyramid structure influences algorithm efficiency, increases non-maximum values to inhibit post-processing and the like, and can further shorten model reasoning time delay under the condition of effectively improving algorithm accuracy indexes;

2. the method provides an attention loss function, is suitable for a large part of target detection methods, and effectively improves the precision index of the algorithm from the perspective of optimizing the training process on the premise of not influencing the model inference time delay by solving the problem of mismatching between the model output class confidence coefficient and the position precision of a detection frame (one of the reasons influencing the precision).

Drawings

FIG. 1 is a lightweight feature pyramid structure applied to a monocular 3D object detection method of the present invention;

fig. 2 is a graph of attention loss function weight coefficients.

Detailed Description

The technical solution of the present invention will be described in detail below with reference to the accompanying drawings.

The embodiment 1 of the invention provides a monocular 3D target detection method based on a lightweight characteristic pyramid structure, which comprises the following steps:

step 1) collecting vehicle-mounted camera image RGB data;

step 2) establishing a monocular 3D target detection network based on key points;

as shown in fig. 1, the network includes: a feature extraction network (backhaul), a detection head (detective head) and a post-processing module (PostProcess);

the Chinese and English symbols in FIG. 1 are shown in Table 1:

TABLE 1

English sign	Chinese interpretation	English sign	Chinese interpretation
				Backbone	Feature extraction network	Conv	Convolution with a bit line
Detection Head	Detection head	Conv1x1	1x1 convolution
				Post Process	Post-treatment	Sampling	Sampling by index
Encoder	Encoder for encoding a video signal	TopK	K maximum values before retrieval
				Decoder	Decoder	Keypoint	Key points
H	Height of input image	Index	Index
				W	Input image width	1/2Index	1/2 times index
D	Number of characteristic channels	1/4Index	1/4 times index
				Heatmap	Thermodynamic diagram	Class	Class vector
Regression	Regression	K	Numerical variable name
				Light-FPN	Lightweight feature pyramid	Concat	Merging
C	Number of detection categories	Decode	Decoding
				Deconv	Deconvolution	3D Boxes	Three-dimensional bounding box
Resnet-34	Deep residual error network	Results	The result of the detection

The feature extraction network includes an Encoder (Encoder) and a Decoder (Decoder). The encoder structure can select basic networks such as Resnet, DLA-34, Hourglass-101 and the like to carry out down-sampling on an input image to extract high-level semantic features, and the decoder is used for carrying out up-sampling on a high-level semantic feature image output by the encoder to obtain a 4-time down-sampling feature image required by the detection head; includes three deconvolution layers: a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer; a first deconvolution layer for processing the 32-fold down-sampling feature map output by the encoder and outputting a 16-fold down-sampling feature map; a second deconvolution layer for processing the 16-fold down-sampling feature map and outputting an 8-fold down-sampling feature map; and a third deconvolution layer for processing the 8-fold down-sampling feature map and outputting a 4-fold down-sampling feature map.

The full convolution structure detection head includes a thermodynamic diagram branch (Heatmap) and a parametric Regression branch (Regression).

The thermodynamic diagram branch is used for predicting a target key point and outputting a key point thermodynamic diagram Y' ═ 0,1]^W/4×H/4×CRepresenting the probability that the target keypoint is detected at each pixel position (the channel dimension C is responsible for the class prediction).

The thermodynamic diagram branch comprises: a first convolution layer, a second convolution layer and a TopK operation unit;

a second convolution layer for performing convolution processing on the characteristic diagram to output a thermodynamic diagram Y' for any element Y on the thermodynamic diagram Y_ijc: representing the probability that at the (j) pixel location of the thermodynamic diagram, there is a candidate keypoint of category c;

the TopK operation unit executes the sorting operation, and arranges all the probability values on the thermodynamic diagram in a descending order, and takes the top K values with the maximum probability values, wherein each value has the meaning same as that of the y described above_ijcLikewise, pixel coordinates (i, j) are converted to a location Index, c, which represents a category, is converted to class, and the concatenation of K such values into vectors are Classes and Indices in FIG. 1. Screening pixels with high confidence probability as candidate key points, and obtaining the pixel position of the candidate key point corresponding to the 4-time down-sampling feature map (relative to the original image)Index. The pixel position Index 1/2Index of the candidate key point corresponding to the 8-time down-sampling feature map and the pixel position Index 1/4Index of the 16-time down-sampling feature map can be further obtained through division operation, the above three pixel position indexes are one-dimensional vectors with the same number of elements, and three values of the three one-dimensional vectors at the same position respectively correspond to pixel position indexes of the same target key point at the 4-time down-sampling feature map, the 8-time down-sampling feature map and the 16-time down-sampling feature map.

The parameter regression branch introduces the lightweight characteristic pyramid structure provided by the invention, and the lightweight characteristic pyramid structure comprises the following steps: a third convolutional layer, three parallel sampling units, a splicing unit and a 1x1 convolutional layer; the three parallel sampling units comprise: a first sampling unit, a second sampling unit and a third sampling unit;

a third convolution layer for further extracting the feature of the 4-fold down-sampling feature map and outputting the feature map to the first sampling unit;

the first sampling unit is used for sampling values from the feature map output by the third convolutional layer according to the position indexes of the key points on the 4-time downsampling feature map;

the second sampling unit is used for sampling values from the 8-time down-sampling feature map output by the second deconvolution layer according to the position indexes 1/2 indexes of the key points on the 8-time down-sampling feature map;

the third sampling unit is used for sampling values from the 16-time down-sampling feature map output by the first deconvolution layer according to the position indexes 1/4 indexes of the key points on the 16-time down-sampling feature map;

the splicing unit is used for merging the outputs of the three sampling units (weighting, attention and other modes can also be adopted) to realize the feature reading, alignment and fusion of the candidate key points on feature maps with different resolutions; outputting the fused features to a 1x1 convolutional layer;

convolution with 1x1 for obtaining 3D regression frame code vector from the fused features

And outputting the data to a post-processing module. (compare in the Standard framework3D frame regression map coding vector

The regression branch convolution operation is effectively reduced).

3D regression block diagram coding vector output by post-processing module to detection head

Decoding is carried out, and a final target detection result is output by combining the candidate key point Class vector Class output by the thermodynamic diagram. K represents the number of detection targets, and R represents the number of regression parameters.

The regression parameters for each target under test can be represented as an 8-dimensional vector:

wherein:

δ_z: a residual representing a depth value z;

representing the deviation of the key point down-sampling quantization process;

δ_h,δ_w,δ_l: a residual representing a target size dimension;

sin α, cos α: a sine-cosine value representing an azimuth;

and (3) decoding process:

size/size:

position:

azimuth angle:

the average value of the length, the width and the height of the target is obtained by data set labeling and statistics;

μ_zσ_zthe mean value and the variance of the mean value of the target depth values are obtained by data set labeling and statistics;

k is camera internal reference, x_cy_cIs the pixel coordinates of the keypoint.

Step 3) training a monocular 3D target detection network;

the monocular 3D target detection method based on the key points has the problem that the confidence coefficient of the detection category is not matched with the geometric precision of the detection frame. The detection network outputs 3D detection frame type and confidence degree information from thermodynamic diagram branches, the probability that the corresponding pixel detects the target center point of the specified type is reflected, and the geometric information from parameter regression branches reflects the geometric information such as the size, the position, the attitude angle and the like of the target frame at the pixel. Because the thermodynamic diagram branch and the parameter regression branch in the detection head are independent from each other in the training process, the confidence of the detection frame cannot truly reflect the geometric accuracy of the detection frame.

Aiming at the mismatching problem, the invention provides an attention loss function acting on a parametric regression branch, which improves the training process by giving more attention to the real target needing further optimization, and constructs the weight of the target loss in the total regression loss according to the target class confidence and the 3D IOU (cross-over ratio) between the prediction box and the truth box, thereby realizing the quantitative definition of the attention, wherein the weight distribution follows the principle that the high class confidence is low, the 3D IOU weight is highest, the high class confidence is high, the 3D IOU is low, and the low class confidence is low, and the 3D IOU weight is low. Because the weight of each target loss in the total regression loss depends on the class confidence of the thermodynamic diagram branch output, the parameter regression branch with the attention loss function needs feedback from the thermodynamic diagram branch in the training process, so that the problem of mismatching of two prediction branches in a standard model detection head due to mutual independence of the training stages is solved, and finally the class confidence of the converged model output can simultaneously reflect the position accuracy information of the corresponding 3D frame and has strong positive correlation.

The loss function of the regression branch of the original model parameters is:

in the formula, L_regThe total regression loss;

regression loss for the ith target; n is the target number in a training batch;

after introducing the attention loss function, the loss function of the parametric regression branch is:

in the formula, L_regThe total regression loss;

in the formula, P_iA category confidence for the ith target;

regression loss for the ith target;

The above target loss weight visualization effect is shown in fig. 2.

And 4) inputting the RGB image obtained in the step 1) into the trained monocular 3D target detection network, and outputting a target detection result.

Embodiment 2 of the present invention may also provide a computer device including: at least one processor, memory, at least one network interface, and a user interface. The various components in the device are coupled together by a bus system. It will be appreciated that a bus system is used to enable communications among the components. The bus system includes a power bus, a control bus, and a status signal bus in addition to a data bus.

The user interface may include, among other things, a display, a keyboard, or a pointing device (e.g., a mouse, trackball, touch pad, or touch screen, among others.

It will be appreciated that the memory in the embodiments disclosed herein can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

In some embodiments, the memory stores elements, executable modules or data structures, or a subset thereof, or an expanded set thereof as follows: an operating system and an application program.

The operating system includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, and is used for implementing various basic services and processing hardware-based tasks. The application programs, including various application programs such as a Media Player (Media Player), a Browser (Browser), etc., are used to implement various application services. The program for implementing the method of the embodiment of the present disclosure may be included in an application program.

In the above embodiments, the processor may further be configured to call a program or an instruction stored in the memory, specifically, a program or an instruction stored in the application program, and the processor is configured to:

the steps of the method of example 1 were performed.

The method of embodiment 1 may be applied in or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The methods, steps, and logic blocks disclosed in embodiment 1 may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with embodiment 1 may be directly implemented by a hardware decoding processor, or may be implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques of the present invention may be implemented by executing the functional blocks (e.g., procedures, functions, and so on) of the present invention. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Embodiment 3 of the present invention provides a nonvolatile storage medium for storing a computer program. The computer program may implement the steps of the method in embodiment 1 when executed by a processor.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and are not limited. Although the present invention has been described in detail with reference to the embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A monocular 3D target detection method based on a lightweight feature pyramid structure, the method comprising:

collecting RGB images of a vehicle-mounted camera;

the feature extraction network is used for performing down-sampling on the RGB image to extract high-level semantic features, generating 4-time, 8-time and 16-time down-sampling feature maps and inputting the down-sampling feature maps to the detection head;

the detection head is used for generating candidate key point category vectors and candidate key point pixel position index vectors based on the 4-time down-sampling feature map, generating candidate key point 3D regression frame coding vectors based on the 4-time down-sampling feature map, the 8-time down-sampling feature map and the 16-time down-sampling feature map, and outputting the candidate key point category vectors and the 3D regression frame coding vectors to the post-processing module;

2. The monocular 3D object detection method based on a lightweight feature pyramid structure of claim 1, wherein the feature extraction network comprises an encoder and a decoder;

the decoder is used for up-sampling the high-level semantic feature map output by the encoder to obtain 4-time, 8-time and 16-time down-sampling feature maps required by the detection head; the decoder includes three deconvolution layers: a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer; a first deconvolution layer for processing the 32-fold down-sampling feature map output by the encoder and outputting a 16-fold down-sampling feature map; a second deconvolution layer for processing the 16-fold down-sampling feature map and outputting an 8-fold down-sampling feature map; and a third deconvolution layer for processing the 8-fold down-sampling feature map and outputting the 4-fold down-sampling feature map to the detection head.

3. The monocular 3D object detecting method based on the lightweight feature pyramid structure as claimed in claim 2, wherein the detection head comprises a thermodynamic diagram branch and a parameter regression branch;

the parameter regression branch introduces a lightweight feature pyramid structure for respectively taking values from corresponding feature maps according to three position indexes, and then merging the values to extract a target 3D regression frame coding vector

And output to the post-processing module.

4. The monocular 3D object detection method based on a lightweight feature pyramid structure of claim 3, wherein the thermodynamic diagram branch comprises: a first convolution layer, a second convolution layer and a TopK operation unit;

the second convolution layer is used for performing convolution processing on the characteristic diagram and outputting a thermodynamic diagram, wherein any element y on the thermodynamic diagram is output_ijcRepresenting a probability value of the presence of a target keypoint of category c at the (j) pixel position of the thermodynamic diagram;

the TopK operation unit is used for arranging all probability values on the thermodynamic diagram in a descending order, taking the first K candidate points with the maximum probability values as candidate key points, converting pixel coordinates (j) into position indexes Index, converting channel indexes c representing categories into categories, splicing the K category values into candidate key point category vectors Classes, and outputting the candidate key point category vectors Classes to the post-processing module; and splicing and generating candidate key point pixel position Index vectors indexes according to the K position Index indexes: acquiring pixel position indexes of the candidate key points corresponding to 4 times of downsampling feature maps; the pixel position indexes 1/2 indexes of the candidate key points corresponding to the 8-time down-sampling feature map and the pixel position indexes 1/4 indexes of the 16-time down-sampling feature map can be further obtained through division operation, the three position indexes are one-dimensional vectors with the same number of elements, and three values of the three one-dimensional vectors at the same position respectively correspond to pixel position indexes of the same candidate key points at the 4-time down-sampling feature map, the 8-time down-sampling feature map and the 16-time down-sampling feature map.

5. The monocular 3D object detection method based on a lightweight feature pyramid structure of claim 4, wherein the parametric regression branch comprises: a third convolutional layer, three parallel sampling units, a splicing unit and a 1x1 convolutional layer; the three parallel sampling units comprise: the device comprises a first sampling unit, a second sampling unit and a third sampling unit;

And outputting the data to a post-processing module, wherein K represents the number of detection targets, and R represents the number of regression parameters.

6. The monocular 3D object detection method based on a lightweight feature pyramid structure as claimed in claim 5, further comprising: the method for training the monocular 3D target detection network specifically comprises the following steps:

In the formula (I), the compound is shown in the specification,

in the formula, P_iA category confidence for the ith target;

regression loss for the ith target;

7. A terminal device, comprising: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the computer program.

8. A storage medium, comprising: comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 6 when executing the computer program.