CN114550160A

CN114550160A - Automobile identification method based on three-dimensional point cloud data and traffic scene

Info

Publication number: CN114550160A
Application number: CN202111358810.0A
Authority: CN
Inventors: 杨彪; 王姝媛; 徐黎明; 杨长春; 陈阳; 吕继东
Original assignee: Changzhou University
Current assignee: Changzhou University
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-05-27

Abstract

The invention provides an automobile identification method based on three-dimensional point cloud data and a traffic scene. The invention can realize the identification of the traffic scene based on the point cloud data set through the three characteristic modules, and effectively improve the detection efficiency and the detection precision of the vehicle.

Description

Automobile identification method based on three-dimensional point cloud data and traffic scene

Technical Field

The invention relates to the field of traffic detection, in particular to the field of vehicle and pedestrian detection, and particularly relates to an automobile identification method based on three-dimensional point cloud data and a traffic scene.

Background

With the continuous development of artificial intelligence, sensors and control theory, the automatic driving draws wide attention in the academic and industrial fields, and has bright application prospect. During automatic driving of a vehicle, detection and behavior prediction of surrounding objects such as vehicles and pedestrians are required. At present, a target detection method using a two-dimensional RGB image cannot accurately identify information such as space, position, depth and angle of an opposite vehicle, so that driving movement of a vehicle cannot be planned and controlled only by simple target azimuth information. The invention adopts three-dimensional point cloud data different from the traditional two-dimensional RGB image data, points in the point cloud data all contain characteristic information such as the position, distance, angle and the like of a target object, and the data composition is more in line with the actual situation of the real world than the two-dimensional RGB image. The data used by the three-dimensional point cloud data is mainly generated by a lidar (light Detection And ranging) sensor. LiDAR is also known as optical radar. The main working principle is realized by receiving laser beam reflected light emitted by a radar sensor. The method has the advantages of long distance measurement, high precision, high reliability and the like, and is widely applied to the field of vehicle-mounted automatic driving. Currently, manufacturers of LiDAR include Velodyne, IBEO, Quanergy, Silan technology and other companies, and Velodyne is the most well known in the industry.

By using computer vision technology, researchers can extract the outline and shape information of vehicles and pedestrians to detect targets. For example, CN111507340A discloses a method for extracting target point cloud data based on three-dimensional point cloud data, which includes: acquiring original three-dimensional point cloud data, and performing denoising processing on the original three-dimensional point cloud data to obtain denoised three-dimensional point cloud data; extracting intensity image data from the de-noised three-dimensional point cloud data; calling a preset target extraction algorithm to perform target extraction processing on the intensity image data to obtain target intensity image data; extracting target three-dimensional point cloud data from the original three-dimensional point cloud data according to the pixel coordinate value of each pixel in the target intensity image data; and calling a preset point cloud denoising algorithm to denoise the target three-dimensional point cloud data to obtain target point cloud data. Although three-dimensional point cloud data is utilized, behavior features and distance features after data extraction are not further fused and feature dimension reduction is not performed, and the detection precision in a traffic scene cannot meet the use requirement.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: in order to overcome the defects of the prior art, the invention provides the automobile identification method based on the three-dimensional point cloud data and the traffic scene, which can comprehensively judge whether vehicles exist in a target area or not by combining the position, the outline and the shape of the point cloud in the space and the surrounding scene information, effectively improve the detection precision and the detection effect and provide accurate surrounding driving environment information for an automatic driving system.

The technical scheme adopted by the invention for solving the technical problems is as follows: a car identification method based on three-dimensional point cloud data and a traffic scene comprises the following steps: 1) a multi-resolution column-by-column feature extraction network; 2) a spatial attention-based convolution detection framework; 3) a compression-activated attention based detection head.

Further, the multi-resolution-based column-by-column feature extraction network comprises point cloud data processing, column feature extraction and pseudo-image feature extraction which are sequentially carried out.

Further, the point cloud data processing specifically includes that a point I in the point cloud data is uniquely represented by four dimensions of x, y, z and r; uniformly dividing points in the point cloud into grids based on an x-y plane, wherein the grids form a group of column sets and are set as columns p, and the columns p have no height limitation on a z axis; enhancing the four-dimensional characteristics of the original input points x, y, z and r in the column into x, y, z, r and x_c、y_c、z_c、x_p、y_pA nine dimensional feature where r is the point I reflectivity, the c subscript represents the arithmetic mean distance to all points I in the bar, and the p subscript represents the deviation from the center of the bar x, bar y.

Further, the extraction of the column features specifically comprises extracting the features in the column P by using a point cloud network for the point I in each column, and adopting high, medium and lowAcquiring column characteristics at three different resolutions, wherein the three resolutions respectively control the sparsity D by limiting the number P of non-empty columns of each sample and the total number N of points I in each column to generate a scale T belonging to R^D×P×NThe density tensor of (a); extracting the characteristics of each point I in the column P by adopting a point cloud network, enabling each point in the column P to pass through a linear layer, a batch normalization layer and a ReLU layer respectively, and outputting the value Z belonging to R^C×P×NA tensor; the features are combined and stacked according to the position of the original column to form the size of the S ∈ R^C×H×WWherein the three resolutions of high, medium and low respectively generate corresponding pseudo-graphs S^H、S^M、S^L。

Preferably, the fixed size of the retention frame tensor T is 10000; if the data within the collected sample or in the column is less than 10000, then the tensor T to 10000 is filled by using zero padding. With 10000 as a threshold, if sufficient means more than 10000 data points, the data is kept at 10000 by a random sampling method, and if too little means less than 10000, then the data needs to be supplemented to 10000 by data gain.

Further, the pseudo-image feature extraction comprises sequentially extracting the pseudo-image S by using a convolution operation for down-sampling and a deconvolution operation for up-sampling^H、S^M、S^LThe medium vehicle characteristic information, after up-sampling and down-sampling, comprises a batch normalization layer and a ReLU layer, and a pseudo-map S obtained by up-sampling^H、S^M、S^LThe characteristic information is combined to generate a new point cloud pseudo-map S.

Further, the spatial attention-based convolution detection framework comprises: 1) respectively extracting pseudo-map features by using 1C, 2C and 4C channels; 2) spatial information features are enhanced using a spatial attention mechanism.

Further, the characteristics of the pseudo-map extracted by using the 1C, 2C and 4C channels respectively are as follows:

inputting the pseudo-map S into a detection framework by using the area proposal network, wherein the detection framework is a downsampling network Net₁And an upsampling network Net₂；

Down-sampling network Net₁By convolution operations to become smaller and smallerThe

spatial resolution

1C, 2C and 4C carry out down-sampling on the feature map, the down-sampling network is represented by a series of (S, L and F) blocks, wherein S represents a step length, F represents the number of output channels, and L represents the number of 3 multiplied by 3 two-dimensional convolutional layer layers, a batch normalization layer and a ReLU layer are connected behind each channel, the first convolution step in each layer is S/S _ in, so that the size of the detection network is kept to be S after the detection network receives the input of the step length S _ in; the subsequent convolution steps in each layer are all 1, and the number of channels in each layer is [64,128,256 ]]Down-sampling networks produce successively smaller spatial resolutions;

upsampling network Net₂Performing up-sampling operation on feature maps with different resolutions by deconvolution, and performing up-sampling on network Net₂Represented by (S _ in, S _ out, F), where S _ in is the initial step, S _ out is the termination step, and F is the final characteristic; the pseudo graph S respectively generates a pseudo graph characteristic graph F through the up-sampling network and the down-sampling network₁、F₂、F₃。

Further, the method for enhancing the spatial information characteristics by using the spatial attention mechanism comprises the following steps: sending the pseudo-map feature map F generated by the network into a space attention module, and generating two new feature maps G by using the feature map with two 1 multiplied by 1 convolution layers by the space attention module₁And G₂；

Wherein { G₁，G₂}∈R^C×H×WG is₁Conversion to R^{C×（H×W）}Then to G₁Transpose and G₁Performing matrix multiplication;

the spatial attention matrix W is then calculated using the Softmax function_sa∈R^{（H×W）×（H×W）}The matrix display-encodes spatially salient portions;

by the addition of an acid at G₂And W_saPerforming matrix multiplication to generate a feature map;

and finally, combining and outputting the scene target characteristic graphs subjected to spatial attention re-weighting under the three scales.

Further, the compression-activation attention-based detection head re-weights the merged multi-scale feature map using a compression-activation attention mechanism;

when compressing, global average pooling is used to produce channel-by-channel vectors s ∈ R^C；

When the module is activated, the module is realized by capturing channel-by-channel dependence;

se = ReLU (W₂δ (W₁s))

δ () is sigmoid function, ReLU () is ReLU function, W1 ∈ R^{C/ r ×C}、 W2 ∈ R^{C× C/ r}。

The above-mentioned compression-activation attention-based detection head has the following detection algorithm:

the output result of the network of the compression-activation attention detection head is used for target detection by a single-needle multi-box detector, the single-needle multi-box detector network is divided into six modules, the first module consists of five former Conv1, 2, 3, 4 and 5 convolutional layers of VGG16, and the second module is used for converting FC6 and FC7 full connection layers in VGG16 into Conv6 and Conv7 convolutional layers; the remaining four modules are added onv8, Conv9, Conv10 and Conv11 convolutional layers, so that target information under different scales is extracted, and the method finally performs target classification detection and non-maximum inhibition position regression operation.

The above detection algorithm has the following loss function formula:

the bounding box of the real object uses (x, y, z, w, l, h, theta) to represent the three-dimensional center, width, length, height and deflection, respectively,

，

，

，

，

，

，

，

X^gtand X^aRespectively represent a real target and an anchor point, and

wherein the localization loss function is:

，

because the positioning loss can not distinguish whether the bounding box is turned over, the method is used

Learning bounding box directions in discrete directions;

the classification loss uses a focal loss function as:

，

a probability value for an anchor point, α =0.25, γ = 2;

the overall loss function is:

，

number of anchors representing positive probabilityAmount, beta_loc=2 ，β_cls=1,β_dir=0.2；

The loss function uses an Adam optimizer and the learning rate decreases as the training period increases.

The invention has the advantages that the invention provides an automobile identification method based on three-dimensional point cloud data and traffic scenes,

(1) providing accurate surrounding driving environment information for an automatic driving system based on the point cloud data set;

(2) the detection precision and the detection effect of the invention under the actual driving environment are improved by utilizing a spatial attention mechanism and by multi-resolution joint detection;

(3) the detection result of the single-needle multi-box detector algorithm is improved by the detection head based on the compression-activation attention and by re-weighting the weight among different channels in the space.

Drawings

The invention is further illustrated with reference to the following figures and examples.

FIG. 1 is a system flow chart of the automobile identification method based on three-dimensional point cloud data and traffic scenes.

Fig. 2 is a schematic diagram of a multi-resolution based pillar-by-pillar feature extraction network proposed in the present invention.

Fig. 3 is a schematic diagram of the spatial attention-based convolution detection framework proposed in the present invention.

Fig. 4 is a schematic diagram of a compression-activated attention based detection head as proposed in the present invention.

Detailed Description

The present invention will now be described in further detail with reference to the accompanying drawings. The drawings are simplified schematic diagrams illustrating the basic structure of the present invention only in a schematic manner, and thus show only the constitution related to the present invention, and directions and references (e.g., upper, lower, left, right, etc.) may be used only to help the description of the features in the drawings. The following detailed description is, therefore, not to be taken in a limiting sense, and the scope of the claimed subject matter is defined only by the appended claims and equivalents thereof.

As shown in fig. 1, a method for identifying an automobile based on three-dimensional point cloud data and a traffic scene comprehensively considers the factors of accurate distance measurement, high precision and more data characteristics of a point cloud data set, and invents a method for detecting vehicles and pedestrians based on the point cloud data set, which comprises the following steps:

1) a multi-resolution column-by-column feature extraction network;

2) a spatial attention-based convolution detection framework;

3) a compression-activated attention based detection head.

The method comprises the steps of detecting through a single-needle multi-box detector, further searching for an interested target by utilizing the motion information of pedestrians, and extracting a motion sequence, a surrounding traffic scene sequence and a track position of the interested target; the invention designs a three-dimensional convolution neural network to process the motion sequence of an interested target and obtain the behavior characteristics related to the intention of pedestrians to cross a road.

According to the invention, two weights are obtained according to the elements of the local traffic scene where the pedestrian is located and the vehicle running speed to correct the distance between the pedestrian and the vehicle, and the corrected distance is sent to the multilayer perceptron to be encoded to obtain the distance characteristic related to the intention of the pedestrian to cross the road.

And finally, performing information fusion on the behavior characteristics and the distance characteristics, reducing the dimension of the fused characteristics by using a full connection layer, and obtaining a result whether the pedestrian passes through the road or not through softmax operation.

Fig. 2 shows a schematic diagram of a multi-resolution based pillar-by-pillar feature extraction network. The different shades and shades of the gray in the figure represent features extracted from different scales, the scale in fig. 2 being the multi-resolution shown at the far left of the figure.

1) Point cloud data processing

According to the processing based on the point cloud data, the interested pedestrians are extracted, so that the time cost of processing non-interested pedestrians by an algorithm is reduced. The invention enhances the point I in the point cloud data into x, y, z, r and x by calculation from x, y, z and r_c、y_c、z_c、x_p、y_p. Including the point space coordinates, r-reflectivity, the arithmetic mean distance of all points I in the c-bin, and the deviation of p from the x, y center of the bin, respectively.

2) Column feature extraction

And extracting the cloud characteristics of the points in the columns by adopting a point cloud network according to the high resolution, the medium resolution and the low resolution of the points I in each column. Controlling the sparsity of acquisition D by imposing a limit on the number of non-empty columns in each sample and the total number N of points I in each column to yield a scale T ∈ R^D×P×NThe density tensor of (2). For data redundancy within the collected samples or in the column, the frame tensor T is kept fixed in size by randomly sampling to retain the data. If too little data is in the collected sample or in the column, the tensor T is expanded through zero padding to maintain the size of the frame tensor T.

The size of the point cloud network output is Z epsilon R^C×P×NThe tensors are combined and piled up according to the positions of the original columns to form the size of S epsilon R^C×H×WWherein the three resolutions of high, medium and low respectively generate corresponding pseudo-graphs S^H、S^M、S^L。

3) Pseudo-graph feature extraction

The invention is used for the pseudo-graph S with high, medium and low resolutions^H、S^M、S^LThe method of downsampling using a convolution operation followed by upsampling using a deconvolution operation extracts the pseudo-map S^H、S^M、S^LThe medium vehicle characteristic information, the up-sampling and the down-sampling comprise a batch normalization layer and a ReLU layer. And finally, combining the three feature map information obtained by up-sampling to generate a new point cloud pseudo-map S.

Fig. 3 presents a schematic diagram of a spatial attention-based convolution detection framework. In fig. 3, the different shades of gray in the leftmost stitched bitmap represent the three features resulting from the upsampling.

4) Extracting pseudo-map features using 1C, 2C, 4C channels respectively

The invention realizes the vehicle detection task under the real traffic situation by a detection method of respectively extracting the characteristics through multiple channels. The area proposal network detection framework is entered using the pseudo-graph S. The frame is mainly divided into two parts: lower miningSample networks and upsampling networks. The downsampling network downsamples the feature map with increasingly smaller spatial resolutions (1C, 2C, 4C) by convolution operations. The downsampled network is represented by a series of (S, L, F) blocks. Wherein S represents a step size, F represents the number of output channels, and L represents the number of 3 × 3 two-dimensional convolutional layer layers. And a batch normalization layer and a ReLU layer are connected behind each channel, and the first convolution step in each layer is S/S _ in so as to ensure that the size of the detection network is still kept to be S after the detection network receives the input of the step S _ in. The subsequent convolution steps in each layer are all 1, and the number of channels in each layer is [64,128,256 ]]. The downsampling network may produce successively smaller spatial resolutions; upsampling network Net₂And performing an up-sampling operation on the feature maps with different resolutions through deconvolution, wherein an up-sampling network is represented by (S _ in, S _ out, F), S _ in is an initial step, S _ out is a termination step, and F is a final feature. The same as the down sampling network, the up sampling network is also connected with the batch normalization layer and the ReLU layer. The pseudo-graph S respectively generates a pseudo-graph characteristic graph F through an up-sampling network and a down-sampling network₁、F₂、F₃。

5) Enhancing spatial information features using spatial attention mechanism

F is to be₁、F₂、F₃The pseudo-map features are fed into a spatial attention module, which first uses two 1 × 1 convolutional layers to generate two new feature maps G₁And G₂Wherein { G₁，G₂}∈R^C×H×WG is₁Conversion to R^C ^×（H×W）Then to G₁Transpose and G₁Matrix multiplication is performed. The spatial attention matrix W is then calculated using the Softmax function_sa∈R^（H ^{×W）×（H×W）}The matrix pair may display code the spatially salient portion. Then, by passing through at G₂And W_saPerforms matrix multiplication to generate a feature map. And finally, combining and outputting the scene target characteristic graphs subjected to spatial attention re-weighting under the three scales.

Fig. 4 shows a schematic diagram of a detection head based on SE attention. The different shades and shades of gray in fig. 4 represent different features.

6) Detection head based on compression-activation network

The merged multi-scale feature map is re-weighted using a compression-activation attention mechanism, which is implemented primarily through compression and activation operations. In compression operations, global average pooling is used to produce channel-by-channel vectors s ∈ R^C. In the activation phase, the module is implemented by capturing channel-by-channel dependencies.

se = ReLU (W₂δ (W₁s))

δ () is sigmoid function ReLU () is ReLU function. W1 ∈ R^{C/ r ×C}、 W2 ∈ R^{C× C/ r}。

7) Detection algorithm

The invention uses the single-needle multi-box detector to detect the target of the output result of the compression-activation network, and the single-needle multi-box detector method has high detection speed and high detection precision. The method introduces the idea of anchor points, can adapt to multi-scale target detection tasks, and is more in line with the characteristic of larger scale transformation of point cloud data. The single-pin multi-box detector network is mainly divided into six modules, the first module is composed of the first five layers of Conv1, 2, 3, 4 and 5 convolutional layers of VGG16, and then FC6 and FC7 full-connection layers in VGG16 are converted into Conv6 and Conv7 convolutional layers. On the basis, four modules of Conv8, Conv9, Conv10 and Conv11 convolutional layers are added, so that target information under different scales is extracted, and the method finally carries out target classification detection and non-maximum inhibition position regression operation.

，

，

，

，

，

，

，

X^gtand X^aRespectively represent a real target and an anchor point, and

wherein the localization loss function is:

，

Learning bounding box directions in discrete directions;

the classification loss uses a focal loss function as:

，

a probability value for an anchor point, α =0.25, γ = 2;

the overall loss function is:

，

representing the number of positive probability anchors, beta_loc=2 ，β_cls=1,β_dir=0.2；

In light of the foregoing description of the preferred embodiment of the present invention, many modifications and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The technical scope of the present invention is not limited to the content of the specification, and must be determined according to the scope of the claims.

Claims

1. A car identification method based on three-dimensional point cloud data and traffic scenes is characterized in that: the method comprises the following steps:

1) a multi-resolution column-by-column feature extraction network;

2) a spatial attention-based convolution detection framework;

3) a compression-activated attention based detection head.

2. The automobile identification method based on the three-dimensional point cloud data and the traffic scene as claimed in claim 1, characterized in that: the multi-resolution-based column-by-column feature extraction network comprises point cloud data processing, column feature extraction and pseudo-image feature extraction which are sequentially carried out.

3. The automobile identification method based on the three-dimensional point cloud data and the traffic scene as claimed in claim 2, characterized in that: specifically, the point I in the point cloud data is uniquely represented by four dimensions of x, y, z and r; uniformly dividing points in the point cloud into grids based on an x-y plane, wherein the grids form a group of column sets and are set as columns p, and the columns p have no height limitation on a z axis; enhancing the four-dimensional characteristics of the original input points x, y, z and r in the column into x, y, z, r and x_c、y_c、z_c、x_p、y_pA nine dimensional feature where r is the point I reflectivity, the c subscript represents the arithmetic mean distance to all points I in the bar, and the p subscript represents the deviation from the center of the bar x, bar y.

4. The method for recognizing the automobile based on the three-dimensional point cloud data and the traffic scene as claimed in claim 3, wherein: the column feature extraction specifically comprises the steps of extracting features in a column P by using a point cloud network for points I in each column respectively, collecting the column features by adopting three different resolutions of high resolution, medium resolution and low resolution, controlling the sparsity D by limiting the number P of non-empty columns of each sample and the total number N of the points I in each column respectively to generate a scale T belonging to the R^D×P×NThe density tensor of (a);

extracting the characteristics of each point I in the column P by adopting a point cloud network, leading each point in the column P to respectively pass through a linear layer, a batch normalization layer and a ReLU layer, and leading the output size to be Z ∈ R^C×P×NA tensor;

the features are combined and stacked according to the position of the original column to form the size of the S ∈ R^C×H×WWherein the three resolutions of high, medium and low respectively generate corresponding pseudo-graphs S^H、S^M、S^L。

5. The method for identifying the automobile based on the three-dimensional point cloud data and the traffic scene as claimed in claim 4, wherein: the fixed size of the frame tensor T is maintained at 10000; if the data within the collected sample or in the column is less than 10000, then the tensor T to 10000 is filled by using zero padding.

6. The method for identifying the automobile based on the three-dimensional point cloud data and the traffic scene as claimed in claim 4, wherein: the pseudo-graph characteristic extraction comprises the sequential steps of down-sampling by using convolution operation and up-sampling by using deconvolution operation to extract a pseudo-graph S^H、S^M、S^LThe medium vehicle characteristic information, after up-sampling and down-sampling, comprises a batch normalization layer and a ReLU layer, and a pseudo-map S obtained by up-sampling^H、S^M、S^LThe characteristic information is combined to generate a new point cloud pseudo-map S.

7. The method for identifying the automobile based on the three-dimensional point cloud data and the traffic scene as claimed in claim 6, wherein: the spatial attention-based convolution detection framework comprises:

1) respectively extracting pseudo-map features by using 1C, 2C and 4C channels;

2) spatial information features are enhanced using a spatial attention mechanism.

8. The method for identifying the automobile based on the three-dimensional point cloud data and the traffic scene as claimed in claim 7, wherein: the characteristics of the pseudo-map extracted by using the 1C, 2C and 4C channels are as follows:

Down-sampling network Net₁Downsampling the feature map by convolution operation with smaller and smaller spatial resolutions 1C, 2C and 4C, wherein the downsampling network is represented by a series of (S, L and F) blocks, S represents a step length, F represents the number of output channels, L represents the number of 3 multiplied by 3 two-dimensional convolution layers, a batch normalization layer and a ReLU layer are connected behind each channel, and the first convolution step in each layer is S/S _ in so as to ensure that the size of the detection network is still kept to be S after the detection network receives the input of the step length S _ in; the subsequent convolution steps in each layer are all 1, and the number of channels in each layer is [64,128,256 ]]Down-sampling networks produce successively smaller spatial resolutions;

upsampling network Net₂Performing up-sampling operation on feature maps with different resolutions by deconvolution, and performing up-sampling on network Net₂Represented by (S _ in, S _ out, F), where S _ in is the initial step, S _ out is the termination step, and F is the final characteristic; the upper sampling network is connected with a batch normalization layer and a ReLU layer, and the pseudo-graph S respectively generates a pseudo-graph characteristic graph F through the upper sampling network and the lower sampling network₁、F₂、F₃。

9. The method of claim 8, wherein the method comprises the following steps: the method for enhancing the spatial information characteristics by using the spatial attention mechanism comprises the following steps: sending the pseudo-map feature map F generated by the network into a space attention module, and generating two new feature maps G by using the feature map with two 1 multiplied by 1 convolution layers by the space attention module₁And G₂；

Wherein { G₁，G₂}∈R^C×H×WG is to be₁Conversion to R^{C×（H×W）}Then to G₁Transpose and G₁Performing matrix multiplication;

10. The method for identifying automobiles according to claim 9 based on three-dimensional point cloud data and traffic scenes, wherein the method comprises the following steps: the compression-activation attention-based detection head re-weights the merged multi-scale feature map using a compression-activation attention mechanism;

When activated, the module is realized by capturing channel-by-channel dependence;

se = ReLU (W₂δ (W₁s))

δ () is sigmoid function, ReLU () is ReLU function, W1 belongs to R^{C/ r ×C}、 W2 ∈ R^{C× C/ r}；

the output result of the network of the compression-activation attention detection head is used for target detection by a single-needle multi-box detector, the single-needle multi-box detector network is divided into six modules, the first module consists of five former Conv1, 2, 3, 4 and 5 convolutional layers of VGG16, and the second module is used for converting FC6 and FC7 full connection layers in VGG16 into Conv6 and Conv7 convolutional layers; the remaining four modules are added onv8, Conv9, Conv10 and Conv11 convolutional layers, so that target information under different scales is extracted, and the method finally carries out target classification detection and non-maximum inhibition position regression operation;

the above detection algorithm has the following loss function formula:

the bounding box of the real object represents the three-dimensional center, width, length, height and deflection of the bounding box using (x, y, z, w, l, h, theta),