CN116229452B

CN116229452B - Point cloud three-dimensional target detection method based on improved multi-scale feature fusion

Info

Publication number: CN116229452B
Application number: CN202310238232.XA
Authority: CN
Inventors: 郑琛; 马淑康; 常琳; 蒋华涛
Original assignee: Wuxi Internet Of Things Innovation Center Co ltd
Current assignee: Wuxi Internet Of Things Innovation Center Co ltd
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-11-17
Anticipated expiration: 2043-03-13
Also published as: CN116229452A

Abstract

The application discloses a point cloud three-dimensional target detection method based on improved multi-scale feature fusion, and relates to the technical field of three-dimensional target detection, wherein the method comprises the steps of acquiring a 3D point cloud image to be detected, inputting the 3D point cloud image into a point cloud three-dimensional target detection model obtained through pre-training, and outputting a target detection result; the detection model comprises a dimension reduction module, a dimension reduction module and a feature reinforcement module, wherein the dimension reduction module is used for carrying out voxel dimension reduction and feature reinforcement processing on an input image; the attention mechanism is introduced into the coding module, so that the extraction capability of the network to the target position information is enhanced; the multi-scale feature fusion module carries out corresponding structural adjustment aiming at information loss caused by up-sampling during feature fusion and aliasing effect caused by multiple fusion, and simultaneously carries out feature enhancement of the context to enrich the feature information of large, medium and small targets; and the detection head module outputs a final target detection result. The three-dimensional target detection model provided by the application has good accuracy, instantaneity and generalization for detecting the small target while ensuring the detection precision of the large target.

Description

Point cloud three-dimensional target detection method based on improved multi-scale feature fusion

Technical Field

The application relates to the technical field of three-dimensional target detection, in particular to a point cloud three-dimensional target detection method based on improved multi-scale feature fusion.

Background

The three-dimensional target detection is an important basis of visual perception, motion prediction and automatic driving planning, and particularly in the field of automatic driving automobiles, three-dimensional information of target obstacles is obtained, so that the analysis accuracy of targets can be improved, and the method has a vital role in path planning and control in subsequent automatic driving scenes. Therefore, ensuring reliable and stable detection of three-dimensional targets is a vital task in intelligent driving systems, where small targets often lack sufficient visual information relative to conventionally sized targets, so that it is difficult to distinguish them from background or similar targets, which results in a significant gap in detection performance between small and large targets, which is often only half as great as large targets, so that small target detection has a wide application value and important research significance.

At present, the research ideas for small target detection mainly comprise context learning, multi-scale learning and the like. The method of context learning is to learn the coexistence relationship between the object and the scene and the object by properly modeling the context, so that the performance of object detection can be improved, especially for the object with unobvious appearance characteristic, such as a small object. However, compared with the conventional targets, the small targets have less available information, so that better features are difficult to extract, and as the number of network layers increases, the feature information and the position information of the small targets are gradually lost, so that the small targets are difficult to detect by a network. These characteristics lead to the requirement of deep semantic information and shallow characterization information of a small target at the same time, and multi-scale learning combines the two, so that the method is an effective strategy for improving the detection performance of the small target. However, performing 1×1 convolution and linear upsampling operations on the feature map in multi-scale learning causes information loss, and feature aliasing effects caused by multiple feature fusion, which cause difficulty in further improving the small target detection performance based on multi-scale learning.

Disclosure of Invention

Aiming at the problems and the technical requirements, the inventor provides a point cloud three-dimensional target detection method based on improved multi-scale feature fusion, which mainly solves the problem of low target detection accuracy in three-dimensional target detection. The technical scheme of the application is as follows:

a point cloud three-dimensional target detection method based on improved multi-scale feature fusion comprises the following steps:

acquiring a 3D point cloud image to be detected, which is shot for a vehicle driving road;

inputting a 3D point cloud image to be detected into a point cloud three-dimensional target detection model obtained through pre-training, and outputting a target detection result, wherein the target detection result comprises three-dimensional information of a corresponding target and the class of the target;

the point cloud three-dimensional target detection model sequentially comprises a dimension reduction module, a coding module, an improved multi-scale feature fusion module and a detection head module from input to output; the dimension reduction module is used for carrying out voxel dimension reduction and characteristic reinforcement treatment on the input 3D point cloud image to be detected to obtain a 2D pseudo image after characteristic reinforcement; the coding module is used for carrying out feature extraction on the 2D pseudo image with the reinforced features to obtain a plurality of first feature images; the improved multi-scale feature fusion module comprises a sub-pixel jump fusion unit and a channel attention guiding unit, wherein the sub-pixel jump fusion unit is used for carrying out channel enhancement and up-sampling on a first feature map so as to construct a feature pyramid, and the channel attention guiding unit is used for optimizing final integrated features with different scales in an integrated map output by the feature pyramid; the detection head module is used for adjusting the number of the characteristic map channels output by the improved multi-scale characteristic fusion module and outputting a final target detection result.

The beneficial technical effects of the application are as follows:

in the method, the point cloud three-dimensional target detection model provided by the application comprises a dimension reduction module, a coding module, an improved multi-scale feature fusion module and a detection head module. The dimension reduction module comprises a voxelized branch and a characteristic reinforcement branch, and small target characteristic information lost in the voxelized dimension reduction process is supplemented through the characteristic reinforcement branch to obtain a 2D pseudo image after characteristic reinforcement; the coding module uses a plurality of cascade convolutions, introduces an attention mechanism unit and enhances the extraction capability of a network to target position information; the improved multi-scale feature fusion module introduces a feature pyramid, carries out corresponding structural adjustment aiming at information loss caused by up-sampling during feature fusion and aliasing effect caused by multiple fusion, and simultaneously carries out feature enhancement of the context so as to further enrich the feature information of large, medium and small targets; and the detection head module outputs a final target detection result. The three-dimensional target detection model provided by the application has good accuracy, instantaneity and generalization for detecting the small target while ensuring the detection precision of the large target.

Drawings

Fig. 1 is a schematic diagram of a three-dimensional object detection model of a point cloud provided by the application.

Fig. 2 is a block diagram of a CA attention mechanism unit in an encoding module provided by the present application.

FIG. 3 is a block diagram of an improved multi-scale feature fusion module provided by the present application.

FIG. 4 is a block diagram of sub-pixel convolution in a multi-scale feature fusion module provided by the present application.

FIG. 5 is a block diagram of a subpixel context enhancement unit in a multi-scale feature fusion module provided by the present application.

FIG. 6 is a block diagram of a channel attention director unit in a multi-scale feature fusion module provided by the present application.

Detailed Description

The following describes the embodiments of the present application further with reference to the drawings.

The embodiment discloses a point cloud three-dimensional target detection method based on improved multi-scale feature fusion, which is shown in combination with fig. 1 and comprises the following steps:

and acquiring a 3D point cloud image to be detected, which is shot for a vehicle driving road, inputting the 3D point cloud image to be detected into a point cloud three-dimensional target detection model obtained by training in advance, and outputting a target detection result, wherein the target detection result comprises three-dimensional information of a corresponding target, the category of the target and the confidence coefficient. Optionally, the target categories are set as three categories of automobiles, riding persons and pedestrians.

The point cloud three-dimensional target detection model is built and trained in advance, and as shown in fig. 1, the point cloud three-dimensional target detection model sequentially comprises a dimension reduction module, a coding module, an improved multi-scale feature fusion module and a detection head module from input to output.

And (1) the dimension reduction module is used for carrying out voxel dimension reduction and characteristic reinforcement treatment on the input 3D point cloud image to be detected to obtain a 2D pseudo image after characteristic reinforcement. The dimension reduction module comprises two parts, namely a voxelized branch and a characteristic reinforcement branch, wherein the voxelized branch is used for dividing a 3D point cloud image to be measured into a plurality of column pilers, and reducing dimensions into a first 2D pseudo image after extracting characteristics of points contained in each piler. The feature reinforcement branch is used for downsampling a 3D point cloud image to be detected by utilizing PointNet++, and reducing dimensions into a second 2D pseudo image after feature extraction is carried out on points obtained by downsampling. And finally, carrying out feature fusion on the first 2D pseudo image and the second 2D pseudo image to obtain a 2D pseudo image with reinforced features. Optionally, the acquired 3D point cloud image to be measured may be subjected to denoising processing and then input to the dimension reduction module.

In the voxelization branch, firstly, a 3D point cloud image to be detected is divided into a plurality of columns with equal volume size based on a set voxel size and the number N of point clouds in each voxel, and three-dimensional tensors (D, N, P) are taken for each column, wherein D is four-dimensional information x, y, z, r of points in each column ₁ X, y and z are position information of point cloud, r ₁ Is the reflectivity; p=h×w, which is the size of the top view of the 3D point cloud image to be measured. In this embodiment, the set length, width and height of the voxels are respectively 0.16m, 0.16m and 4m, the number N of point clouds set in each voxel is 32, if the number of points in a single voxel is more than 32, sampling is performed, and if the number of points in a single voxel is less than 32, the operation of filling 0 is performed, so that the division of the 3D point cloud image to be measured is realized. Next, the three-dimensional tensor (D, N, P) is input to the simple PointNet for feature learning, and C channels are learned from the D dimension and converted into (C, N, P). The N dimension is then maximally pooled into (C, P), and finally a 2D pseudo-image with the height width H×W and the channel number C, namely voxel characteristics, is obtained. In the present embodiment, c=64.

In the characteristic reinforcing branch, the PointNet++ is adopted to perform two downsampling on the 3D point cloud image to be detected, sampling points are selected by using the furthest point sampling method (FPS, farthest Point Sampling), and the number of the two downsampling points is one downsampling and two downsampling times of the number of the original point cloud respectively. The dimension reduction operation is respectively carried out on the points obtained by the two downsampling, the specific steps are the same as the dimension reduction operation in the voxelized branches, but the number of the point clouds set in a single voxel is different, and the number is respectively 16 and 8. The feature of each point obtained by downsampling is the aggregation of the point features near the point before sampling, the downsampling process is the feature aggregation process, and the feature extraction process is also performed on the points contained in each size target, so that compared with the process of the point before sampling, the number of the points is reduced, but the feature information of the points after sampling is richer and more important. And finally, fusing the second 2D pseudo image obtained by twice sampling and dimension reduction with the first 2D pseudo image output by the voxelized branch, thereby playing the characteristic reinforcing effect, and the 2D pseudo image fused by the three pseudo images is the output of the dimension reduction module.

In the traditional technology, the Pointpilers model is adopted for voxelization, and the Pointpilers model converts the point cloud into a pseudo image through pilar conversion and then dimension reduction, so that a part of characteristic information about a small target can be lost in the generated pseudo image, and the characteristic can only be extracted from the pseudo image with originally lost information by a later coding network, thereby not only improving the difficulty of extracting the characteristic, but also being difficult to have a better detection effect on the small target. The feature reinforcement branch of the dimension reduction module provided by the application firstly reduces the number of point clouds through downsampling and simultaneously refines the feature information of each point, and each point after downsampling is more important, so that the influence of other useless point information in the dimension reduction pseudo-image forming process is reduced, the generated pseudo-image has specific detail feature information compared with the pseudo-image directly generated on the basis of the original point clouds, and the feature information is supplemented to the 2D pseudo-image generated by the original point clouds, so that the feature information of a small target is supplemented, the feature of the pseudo-image is further enriched, and the feature extraction of a later coding module is facilitated.

And the <2> encoding module is used for carrying out feature extraction on the 2D pseudo image subjected to feature reinforcement to obtain a plurality of first feature images. The coding module comprises a plurality of cascaded convolution units block and CA (Coordinate Attention) attention mechanism units respectively positioned at the output end of each convolution unit block. In this embodiment, the encoding module includes three blocks and a CA attention mechanism unit following each block, each block including four 3*3 convolution kernels, a BN layer, and a nonlinear operation layer. The CA attention mechanism encodes channel relationships and long-term dependencies with precise location information, which helps the network more accurately locate the object of interest in the object detection task, thus enhancing the characterizability of the feature map after each block of the encoded network. CA attention mechanism element As shown in FIG. 2, the specific operation is divided into two steps of Coordinatate information embedding and Coordinate Attention generation.

<2.1> Coordinatate information embedding

The global pooling method is generally used for global encoding of channel attention encoded spatial information, but it makes it difficult to save location information because it compresses the global spatial information into channel descriptors. To enable the attention unit to capture remote spatial interactions with precise location information, the CA attention mechanism unit breaks down global pooling according to equations (1) and (2), translating into a one-to-one dimensional feature encoding operation. Specifically, in each CA attention mechanism unit, a feature map x is output for a corresponding convolution unit block _c (i, j) encoding each channel in the horizontal direction using a pooling window of size (H, 1), then the output of the c-th channel of height HExpressed as:

encoding each channel in the vertical direction using a pooling window of size (1, w), then the output of the c-th channel of width wExpressed as:

wherein x is _c (h, i) is the ith column slice with height h on the feature map output by the convolution unit, x _c (j, w) is the j-th cross slice with the width w on the characteristic diagram output by the convolution unit.

These two ways of conversion enable the CA's attention mechanism unit to capture long-term dependencies along one spatial direction and save accurate location information along the other spatial direction, which helps the network locate objects of interest more accurately.

<2.2>Coordinate Attention generation

The global receptive field is obtained through the transformation by embedding the Coordinate information, the precise position information is encoded, the Coordinate Attention generation makes full use of the captured position information, so that the region of interest can be accurately captured, and the relation among channels can be effectively captured. After transformation in information embedding, outputAnd output->Splicing and then passing through a first convolution transform function F ₁ And performing transformation operation, wherein the expression is as follows:

f＝δ(F ₁ ([z ^h ,z ^w ]))(3)

where [ (·, · ] is the stitching operation along the spatial dimension (i.e., concatate), δ (·) is the nonlinear activation function, and f is the intermediate feature map that encodes spatial information in the horizontal and vertical directions.

Decomposing the intermediate feature map f along the spatial dimension into two separate tensors f ^h And f ^w Using a second convolution transfer function F _h And a third convolution transform function F _w Respectively f ^h And f ^w Transforming into two tensors g with the same number of channels ^h And g ^w The expression is:

g ^h ＝σ(F _h (f ^h ))，g ^w ＝σ(F _w (f ^w )) (4)

wherein σ (·) is a sigmoid activation function; alternatively, F ₁ 、F _h 、F _w Are all 1 x 1 convolution transform functions. To reduce the complexity and computational overhead of the model, an appropriate reduction ratio r (e.g., 32) is typically used to reduce the number of channels for f. Then to output g ^h And g ^w Expansion is performed as the attention weights, respectively.

Finally, the characteristic diagram x output by the convolution unit block _c (i,j)、g ^h And g ^w Fusing and outputting to obtain:

in the past, attention is usually paid to a global pooling method to compress global space channel information, but the position information is difficult to store, and the position information is very critical in a target detection task, if the encoding module can mark important area positions in a feature map through an attention mechanism, the difficulty of subsequent processing is greatly reduced, and even direct decoding can have good detection effect. The CA attention mechanism unit converts global pooling into pooling along the X direction and the Y direction, then the pooling results of the two directions are spliced together to carry out convolution for remote space interaction of coding accurate position information, finally the remote space interaction is divided into the X direction attention weight and the Y direction attention weight which are acted on the feature map to mark important area positions, and the feature map output by the previous block carries out global position information weight adjustment through the CA attention mechanism unit after the CA attention mechanism unit is added to each block of the coding module, so that the CA attention mechanism unit can also help the feature extraction of the subsequent blocks to a certain extent.

<3> an improved multi-scale feature fusion module is shown in fig. 3, comprising a subpixel skipping fusion unit, a subpixel context enhancement unit, and a channel attention guide unit.

<3.1> the subpixel skipping fusion unit replaces the conventional 1 x 1 convolution and linear upsampling, which is used to channel enhance and upsample the first feature map to construct feature pyramids (Feature Pyramid Networks, FPN), thereby reducing the information loss due to channel reduction. In the sub-pixel jump fusion unit, the sub-pixel convolution is adopted to up-sample the first feature map of the current layer to the size of the first feature map of the previous layer, meanwhile, the first feature map of the previous layer is subjected to 1×1 convolution operation, and the result after the operation is subjected to feature fusion with the result after the sub-pixel convolution, so that the output of the sub-pixel jump fusion unit is obtained. As shown in fig. 3, C1 is a 2D pseudo image output by the dimension reduction module, C2, C3 and C4 are outputs of three blocks (including CA) of the encoding module, and then C1 to C4 form a bottom-up feature pyramid, up-sampling C4 and C3 to the sizes of C3 and C2 by using a sub-pixel convolution method, and simultaneously performing a 1×1 convolution operation on C3 and C2, and performing feature fusion on the result after the operation and the result after the sub-pixel convolution, thereby obtaining F3 and F2. In the figure, f3=p3, and P2 is obtained by performing a stitching operation on P3 and F2, so that P3 to P2 form a feature pyramid from top to bottom. Finally, extracting the characteristics of P3 and P2 and fusing the characteristics into an integrated graph I.

The sub-pixel convolution is shown in fig. 4, the sampling multiplying power of the sub-pixel convolution is preset, the input feature image is firstly subjected to channel expansion through 1×1 convolution, then the channel information of each pixel point after expansion is filled near the corresponding pixel point, the purpose of size expansion is achieved, and double-scale up-sampling is achieved. Alternatively, the number of channels becomes the square of the original channel divided by the sampling magnification, and the size becomes the original size multiplied by the sampling magnification of four times.

The <3.2> subpixel context enhancing unit is used for performing feature enhancement of context information on the first feature map of the last layer and fusing the extracted context features into the integrated map I. As shown in fig. 5, the sub-pixel context enhancement unit includes three branches, and for the first branch, the first feature map C4 of the last layer is up-sampled sequentially through the first convolution layer and the sub-pixel convolution, so as to locally extract the context information of the first feature map C4 of the last layer. Optionally, the first convolution layer is a 3×3 convolution for extracting local information, and transforming the channel size to implement sub-pixel upsampling. For the second branch, the first feature map C4 of the last layer is firstly subjected to downsampling through a maximum pooling layer, and then sub-pixel convolution is adopted for upsampling, so that the context information of the first feature map C4 of the last layer is extracted on a large receptive field. Alternatively, the max-pooling layer is a 3×3 max-pooling layer for downsampling C4 to w×h, and the sub-pixel convolution of the branch has a sampling magnification of 4. For the third branch, the first feature map C4 of the last layer sequentially passes through the global average pooling layer, feature compression and broadcasting to realize global extraction of the context information of the first feature map C4 of the last layer. That is, under this branch, the C4 is subjected to global averaging pooling, and the obtained 1×1×8C features are compressed to 1×1×c, and broadcast to a feature map having a size of 4w×4H. And finally, the feature mapping of the three branches is aggregated into an integrated mapping according to the sum of the elements, so that an integrated graph I with a preset size is obtained. The structure and effect of the sub-pixel convolution mentioned in this paragraph are referred to as <3.1> and are not described in detail herein.

According to the content, the sub-pixel context enhancement unit obtains richer context information by extracting local and global context information and larger receptive fields, and fuses the extracted context characteristics into the integrated graph I, so that semantic information in the highest-level characteristics is fully utilized in FPN, the problem of lack of the context information is relieved, the receptive fields of C4 are effectively enlarged through characteristic characterization of three characteristic scales, and the characterization capability of I is improved.

<3.3> the channel attention director unit is used to optimize the final integrated features of different scales in the integrated graph I of the feature pyramid output, so that the aliasing effect can be alleviated with only a small computational burden, and especially for a small target, which is a special target susceptible to aliasing noise, the reduction of the influence of the aliasing effect is critical. As shown in fig. 6, in the channel attention guiding unit, the integrated graph I output by the feature pyramid is respectively subjected to average pooling and maximum pooling operations, the output is respectively spliced after passing through two full-connection layers FC layers without sharing weights, and finally the corresponding attention weights CA (x) are obtained by substituting a sigmoid activation function, and the specific transformation can refer to formula (6). As shown in fig. 3, the attention weight CA (x) is multiplied by the integrated graphs R2, R3 and R4 with different scales respectively to be output as a feature graph output by the improved multi-scale feature fusion module.

CA(x)＝σ[fc ₁ (AvgPool(x′))+fc ₂ (MaxPool(x′))] (6)

Wherein x' is characteristic information of the integrated graph I, fc ₁ And fc ₂ Is two fully connected layers that do not share weights, avgPool is average pooling and MaxPool is maximum pooling.

And 4, the detection head module is used for adjusting the number of the characteristic map channels output by the improved multi-scale characteristic fusion module through convolution, and outputting a final target detection result in one-to-one correspondence with the final required target detection information. The target detection results output by the detection head module respectively correspond to the three-dimensional information of the target, the category of the target and the confidence level.

The point cloud three-dimensional target detection model is obtained by training a data set in advance, so that the training process of the point cloud three-dimensional target detection model is further included before the point cloud three-dimensional target detection model is used, the Kitti data set is adopted to train the network, super parameters of model training are set before training, and the method comprises the following steps: training 80 epochs, batch_size of 4, setting learning rate initial value of 0.001, and performing parameter optimization by adopting AdamW optimizer, beta ₁ ＝0.95，β ₂ =0.99, the weight decay is set to 0.01. In the process of model training of the model structure of the built point cloud three-dimensional target detection model, the loss function L comprises positioning loss L _loc Classification loss L _cls And direction loss L _dir The expression is:

L＝β _loc L _loc +β _cls L _cls +β _dir L _dir (7)

wherein beta is _loc 、β _cls 、β _dir The weights of the respective losses are set in this embodiment: beta _loc ＝2，β _cls ＝1，β _dir ＝0.2。

Loss of positioning L _loc The residual error delta b between the predicted value and the true value is constrained by adopting a smoothL1 Loss function, and the expression is as follows:

the seven-dimensional vector (x, y, z, l, w, h, theta) is target three-dimensional information output by the detection head module, wherein (x, y, z) represents the center point coordinates of the three-dimensional boundary frame, (l, w, h) represents the length, width and height of the three-dimensional boundary frame, and theta represents the yaw angle.

Classification loss L _cls By adopting the Focal Loss function, the expression is:

L _cls ＝-α(1-p) ^γ log p (9)

where p is the prediction probability of a certain target class, α and γ are two superparameters, and α=0.25, γ=2 are set for adapting loss to the case of unbalanced positive and negative samples and difficult-to-classify samples.

Loss of positioning L _loc Although angles have been considered, detection frames which overlap but are opposite in direction cannot be distinguished, defined as flipped boxes, e.g. vehicles in one 3d box travel in the north and south directions respectively, but the 3d boxes marked are identical, so the direction loss L _dir The cross entropy loss function is adopted to distinguish the target action direction, and the expression is as follows:

wherein p (x) _i ) And q (x) _i ) Is the probability size of the prediction box in both directions.

And carrying out parameter optimization on the model structure by adopting an AdamW optimizer on the basis of the loss function L to obtain an initial model, wherein the processing procedure of the point cloud three-dimensional target detection model on the training sample is the same as the processing procedure of the 3D point cloud image to be detected, and the description is omitted. And after the model training reaches the total number of rounds, testing the initial model by using a test set to obtain a model score for indicating the generalization effect of the model, and obtaining a point cloud three-dimensional target detection model of which the model score reaches a score threshold. In this embodiment, the effect of the model is evaluated with Accuracy (AP) by testing the car, the rider and the pedestrian on the test set for three difficulties, easy, medium and difficult, respectively. In conclusion, the point cloud three-dimensional target detection model provided by the application has the advantages that the detection accuracy of other targets is improved by a small extent, and meanwhile, the detection of the small targets is better.

The above is only a preferred embodiment of the present application, and the present application is not limited to the above examples. It is to be understood that other modifications and variations which may be directly derived or contemplated by those skilled in the art without departing from the spirit and concepts of the present application are deemed to be included within the scope of the present application.

Claims

1. A point cloud three-dimensional object detection method based on improved multi-scale feature fusion, the method comprising:

inputting the 3D point cloud image to be detected into a point cloud three-dimensional target detection model obtained through pre-training, and outputting a target detection result, wherein the target detection result comprises three-dimensional information of a corresponding target and the class of the target;

the point cloud three-dimensional target detection model sequentially comprises a dimension reduction module, a coding module, an improved multi-scale feature fusion module and a detection head module from input to output; the dimension reduction module is used for carrying out voxel dimension reduction and characteristic reinforcement treatment on the input 3D point cloud image to be detected to obtain a 2D pseudo image after characteristic reinforcement; the dimension reduction module comprises a voxelized branch and a characteristic reinforcement branch, wherein the voxelized branch is used for dividing the 3D point cloud image to be measured into a plurality of columns, extracting characteristics of points contained in each column and reducing dimension into a first 2D pseudo image; the feature reinforcing branch is used for downsampling the 3D point cloud image to be detected, extracting features of the downsampled points, and then reducing the dimensions to form a second 2D pseudo image; performing feature fusion on the first 2D pseudo image and the second 2D pseudo image to obtain a 2D pseudo image with reinforced features;

the coding module is used for carrying out feature extraction on the 2D pseudo image subjected to feature reinforcement to obtain a plurality of first feature images; the improved multi-scale feature fusion module comprises a sub-pixel jump fusion unit and a channel attention guiding unit, wherein the sub-pixel jump fusion unit is used for carrying out channel enhancement and up-sampling on the first feature map so as to construct a feature pyramid, and the channel attention guiding unit is used for optimizing final integrated features with different scales in an integrated map output by the feature pyramid; the detection head module is used for adjusting the number of the characteristic map channels output by the improved multi-scale characteristic fusion module and outputting a final target detection result.

2. The method for detecting a point cloud three-dimensional object based on improved multi-scale feature fusion according to claim 1, wherein the method for performing feature extraction and dimension reduction on points contained in each pillar to form a first 2D pseudo image is the same as the method for performing feature extraction and dimension reduction on points obtained by downsampling to form a second 2D pseudo image, and the method comprises the following steps:

taking its three-dimensional tensor (D, N, P) for each column, where D is four-dimensional information x, y, z, r of the points within each column ₁ X, y and z are position information of point cloud, r ₁ Is the reflectivity; p=h×w, which is the size of the top view of the 3D point cloud image to be measured;

inputting a three-dimensional tensor (D, N, P) into the PointNet for feature learning, and learning C channels from the D dimension and converting the C channels into (C, N, P); and then carrying out maximum pooling operation on the N dimensions to obtain (C, P), and finally obtaining a 2D pseudo image with the height and width of H multiplied by W and the channel number of C.

3. The method for detecting a point cloud three-dimensional object based on improved multi-scale feature fusion according to claim 1, wherein the encoding module comprises a plurality of cascaded convolution units and CA attention mechanism sheets respectively positioned at the output end of each convolution unitA meta-element; in each CA attention mechanism unit, a characteristic diagram x is output for a corresponding convolution unit _c (i, j) encoding each channel in the horizontal direction using a pooling window of size (H, 1), then the output of the c-th channel of height HExpressed as:

wherein x is _c (h, i) is the ith column slice with height h on the feature map output by the convolution unit, x _c (j, w) is the j-th cross slice with the width w on the characteristic diagram output by the convolution unit;

will outputAnd output->Splicing and then passing through a first convolution transform function F ₁ And performing transformation operation, wherein the expression is as follows: f=δ (F ₁ ([z ^h ,z ^w ]))；

Wherein [ (·, · ] is a splicing operation along the spatial dimension, δ (·) is a nonlinear activation function, and f is an intermediate feature map that encodes spatial information in the horizontal and vertical directions;

decomposing the intermediate feature map f along the spatial dimension into two separate tensors f ^h And f ^w Using a second convolution transfer function F _h And a third convolution transform function F _w Respectively f ^h And f ^w Transforming into two tensors g with the same number of channels ^h And g ^w The expression is: g ^h ＝σ(F _h (f ^h ))，g ^w ＝σ(F _w (f ^w ) A) is provided; wherein σ (·) is a sigmoid activation function;

the characteristic diagram x output by the convolution unit _c (i,j)、g ^h And g ^w Fusing and outputting to obtain:

4. the method for detecting a point cloud three-dimensional object based on improved multi-scale feature fusion of claim 1, wherein the method for channel enhancement and upsampling of the first feature map comprises:

and up-sampling the first feature map of the current layer to the size of the first feature map of the previous layer by adopting sub-pixel convolution, simultaneously carrying out 1X 1 convolution operation on the first feature map of the previous layer, and carrying out feature fusion on the result after operation and the result after sub-pixel convolution to obtain the output of the sub-pixel jump fusion unit.

5. The method for detecting a point cloud three-dimensional object based on improved multi-scale feature fusion according to claim 1, wherein the improved multi-scale feature fusion module further comprises a sub-pixel context enhancement unit, and the sub-pixel context enhancement unit is used for performing feature enhancement of context information on a first feature map of a last layer and fusing the extracted context features into an integrated map.

6. The method for detecting a three-dimensional object of a point cloud based on improved multi-scale feature fusion according to claim 5, wherein the method for performing feature enhancement of context information on the first feature map of the last layer and fusing the extracted context features into an integrated map comprises:

the sub-pixel context enhancement unit comprises three branches, and for the first branch, the first feature image of the last layer is up-sampled through a first convolution layer and sub-pixel convolution in sequence, so that the context information of the first feature image of the last layer is locally extracted; for the second branch, the first feature map of the last layer is firstly subjected to downsampling through a maximum pooling layer, and then subsampling is performed by adopting sub-pixel convolution, so that the context information of the first feature map of the last layer is extracted on a large receptive field; for the third branch, the first feature map of the last layer sequentially passes through a global average pooling layer, feature compression and broadcasting to realize global extraction of the context information of the first feature map of the last layer; and finally, the feature mapping of the three branches is aggregated into an integrated mapping according to the sum of the elements, so that an integrated graph with a preset size is obtained.

7. The method for point cloud three-dimensional object detection based on improved multi-scale feature fusion of claim 4 or 6, wherein the method for upsampling using sub-pixel convolution comprises:

the input characteristic diagram is firstly subjected to channel expansion through 1X 1 convolution, and then channel information after expansion of each pixel point is filled near the corresponding pixel point, so that the purpose of size expansion is achieved, and double-scale up-sampling is realized.

8. The method for detecting the point cloud three-dimensional target based on the improved multi-scale feature fusion according to claim 1, wherein in the channel attention guiding unit, the integrated graphs output by the feature pyramid are subjected to average pooling and maximum pooling operation respectively, the outputs are spliced after passing through two full-connection layers without sharing weights respectively, and finally the full-connection layers are substituted into a sigmoid activation function to obtain corresponding attention weights, and the attention weights are multiplied by the integrated graphs with different scales respectively and output as the feature graphs output by the improved multi-scale feature fusion module.

9. The method for point cloud three-dimensional object detection based on improved multiscale feature fusion of any of claims 1-8, further comprising:

in the process of model training of the model structure of the built point cloud three-dimensional target detection model, the loss function L comprises positioning loss L _loc Classification loss L _cls And direction loss L _dir The expression is:

L＝β _loc L _loc +β _cls L _cls +β _dir L _dir ；

wherein beta is _loc 、β _cls 、β _dir Weights of the corresponding losses;

the positioning loss L _loc The residual error delta b between the predicted value and the true value is constrained by adopting a smoothL1 Loss function, and the expression is as follows:

wherein, the seven-dimensional vector (x, y, z, l, w, h, theta) is the target three-dimensional information output by the detection head module, the (x, y, z) represents the center point coordinate of the three-dimensional boundary frame, the (l, w, h) represents the length, width and height of the three-dimensional boundary frame, and the theta represents the yaw angle;

the classification loss L _cls By adopting the Focal Loss function, the expression is: l (L) _cls ＝-α(1-p) ^γ logp；

Wherein p is the predicted probability of a certain target class, and alpha and gamma are two super parameters;

said directional loss L _dir The cross entropy loss function is adopted to distinguish the target action direction, and the expression is as follows:

wherein p (x) _i ) And q (x) _i ) The probability of the prediction frame in two directions;

and carrying out parameter optimization on a model structure by adopting an AdamW optimizer on the basis of the loss function L to obtain an initial model.