CN114842313A

CN114842313A - Target detection method and device based on pseudo-point cloud, electronic equipment and storage medium

Info

Publication number: CN114842313A
Application number: CN202210508913.9A
Authority: CN
Inventors: 陈禹行; 彭微; 李雪; 范圣印
Original assignee: Beijing Yihang Yuanzhi Technology Co Ltd
Current assignee: Beijing Yihang Yuanzhi Technology Co Ltd
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2022-08-02
Anticipated expiration: 2042-05-10
Also published as: CN114842313B

Abstract

The disclosure provides a target detection method and device based on a pseudo-point cloud, electronic equipment and a storage medium. The target detection method based on the pseudo point cloud in the disclosure comprises the following steps: the method comprises the steps of obtaining first pseudo point cloud data of a first image, obtaining 3D candidate frame information of the first image, obtaining first pseudo point cloud data of a 3D candidate frame, obtaining second pseudo point cloud data of the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame, obtaining a first feature vector of a target corresponding to the 3D candidate frame through feature coding according to the second pseudo point cloud data of the 3D candidate frame, enabling target features represented by the second pseudo point cloud data to be consistent with target features represented by laser point cloud data in distribution, and obtaining 3D detection frame information according to the first feature vector of the target corresponding to the 3D candidate frame. The method and the device can improve the detection precision of the pseudo-point cloud target and enable the detection result to be more accurate.

Description

Target detection method and device based on pseudo-point cloud, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer vision, and in particular, to a target detection method and apparatus based on a pseudo point cloud, an electronic device, and a storage medium.

Background

The pseudo-point cloud-based three-dimensional (3D) target detection algorithm takes the pseudo-point cloud as input, and realizes the prediction of the position, size and category information of an object in a three-dimensional space by performing feature extraction and analysis on the pseudo-point cloud, thereby playing an important role in the fields of automatic driving scenes, robot application and the like. The pseudo-point cloud is generally converted from a depth map of an RGB image, whose data representation is consistent with the laser point cloud. Because the data form of the point cloud can more often represent the shape information of an object in a three-dimensional space, the existing 3D target detection algorithm based on the laser point cloud has a good effect, so that the existing 3D target detection algorithm based on the visual image usually carries out depth estimation on the image to obtain a depth map, then converts pixel points into the three-dimensional space according to the depth map to obtain a pseudo point cloud, and finally realizes 3D target detection based on the pseudo point cloud. At present, the 3D target detection based on the pseudo-point cloud is usually realized by directly adopting a 3D target detection model of the laser point cloud, the depth information of the pseudo-point cloud is not accurate enough, and the pseudo-point cloud and the laser point cloud are also different in distribution, so that the target detection method based on the pseudo-point cloud is low in accuracy and poor in precision, and cannot meet the requirements of obstacle detection in scenes such as automatic driving.

Disclosure of Invention

In order to solve at least one of the above technical problems, the present disclosure provides a target detection method, apparatus, electronic device, and storage medium based on a pseudo point cloud.

A first aspect of the present disclosure provides a target detection method based on a pseudo point cloud, including:

acquiring first pseudo-point cloud data of a first image, wherein target features represented by the first pseudo-point cloud data comprise three-dimensional (3D) features and category features;

acquiring 3D candidate frame information of the first image according to the first pseudo point cloud data of the first image;

acquiring first pseudo point cloud data of the 3D candidate frame according to the 3D candidate frame information of the first image and the first pseudo point cloud data of the first image;

according to the first pseudo point cloud data of the 3D candidate frame, obtaining second pseudo point cloud data of the 3D candidate frame, wherein target features represented by the second pseudo point cloud data comprise 3D features, category features and internal features of a corresponding target of the 3D candidate frame;

according to the second pseudo point cloud data of the 3D candidate frame, obtaining a first feature vector of a target corresponding to the 3D candidate frame through feature coding, wherein the feature coding can enable the target features represented by the second pseudo point cloud data to be consistent with the target features represented by the laser point cloud data in distribution;

and obtaining 3D detection frame information according to the first feature vector of the target corresponding to the 3D candidate frame.

A second aspect of the present disclosure provides a target detection apparatus based on a pseudo point cloud, including: the pseudo-point cloud obtaining unit is used for obtaining first pseudo-point cloud data of a first image, and target features represented by the first pseudo-point cloud data comprise three-dimensional (3D) features and category features; the 3D target candidate frame extraction unit is used for acquiring 3D candidate frame information of the first image according to the first pseudo point cloud data of the first image; a candidate frame pseudo point cloud unit, configured to obtain first pseudo point cloud data of a 3D candidate frame according to the 3D candidate frame information of the first image and the first pseudo point cloud data of the first image; the feature association unit is used for obtaining second pseudo point cloud data of the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame, and target features represented by the second pseudo point cloud data comprise 3D features, category features and internal features of a corresponding target of the 3D candidate frame; the feature coding unit is used for obtaining a first feature vector of a target corresponding to the 3D candidate frame through feature coding according to second pseudo point cloud data of the 3D candidate frame, wherein the feature coding can enable target features represented by the second pseudo point cloud data to be consistent with target features represented by laser point cloud data in distribution; and the detection frame acquisition unit is used for acquiring 3D detection frame information according to the first feature vector of the target corresponding to the 3D candidate frame.

A third aspect of the present disclosure provides an electronic device, comprising: a memory storing execution instructions; and the processor executes the execution instructions stored by the memory, so that the processor executes the target detection method based on the pseudo point cloud.

A fourth aspect of the present disclosure provides a readable storage medium, in which execution instructions are stored, and the execution instructions are executed by a processor to implement the above-mentioned target detection method based on pseudo point cloud.

By embedding the category information into the pseudo-point cloud, the problem of feature alignment which is usually considered in multi-mode fusion can be avoided, so that the operation complexity is effectively reduced, and the processing efficiency is improved; in addition, the incidence relation between the pseudo point clouds is constructed in the pseudo point cloud data through the second pseudo point cloud data, so that the point characteristics in the 3D candidate frame are not isolated any more; moreover, the pseudo point cloud target features and the laser point cloud target features tend to be consistent in distribution through feature coding, the difference of data sources is optimized on a target layer, and compared with the optimization of the whole point cloud scene, the method is more efficient and improves the detection precision. In other words, the method and the device can reduce the complexity of the detection of the pseudo point cloud target, improve the processing efficiency, improve the precision and simultaneously enable the detection result to be more accurate.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the disclosure and together with the description serve to explain the principles of the disclosure.

Fig. 1 is a schematic flow chart of a target detection method based on a pseudo point cloud according to an embodiment of the present disclosure.

Fig. 2 is a schematic flow chart of acquiring first pseudo point cloud data of a first image according to an embodiment of the disclosure.

Fig. 3 is an exemplary structural diagram of a feature association network of one embodiment of the present disclosure.

Fig. 4 is a schematic flow chart of obtaining second pseudo point cloud data of a 3D candidate box according to an embodiment of the disclosure.

Fig. 5 is a flow diagram of feature encoding according to an embodiment of the present disclosure.

FIG. 6 is a schematic diagram of a model structure and its implementation of one embodiment of the present disclosure.

Fig. 7 is a schematic flow chart of a specific implementation of the target detection method based on the pseudo point cloud according to an embodiment of the present disclosure.

Fig. 8 is a block diagram schematic structure of a pseudo point cloud based target detection apparatus employing a hardware implementation of a processing system according to an embodiment of the present disclosure.

Detailed Description

The present disclosure will be described in further detail with reference to the drawings and embodiments. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not to be construed as limitations of the present disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict. Technical solutions of the present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Unless otherwise indicated, the illustrated exemplary embodiments/examples are to be understood as providing exemplary features of various details of some ways in which the technical concepts of the present disclosure may be practiced. Accordingly, unless otherwise indicated, features of the various embodiments may be additionally combined, separated, interchanged, and/or rearranged without departing from the technical concept of the present disclosure.

A brief analysis of the related art is first made below.

Related art 1: although the technology realizes multi-level depth fusion of laser radar point cloud and optical camera images, the technology can more effectively utilize the advantage that point cloud data has accurate spatial information and the advantage that image data has good target recognition capability to improve the accuracy of sensing the surrounding environment by the automatic driving vehicle. However, the related art needs to consider the alignment problem between the point cloud and the input feature data of two different modalities of the image, and is complex and tedious, and the processing efficiency is low.

Related art 2: chinese patent publication No. CN112419494 discloses a method for enhancing a target point cloud by fusing a pseudo point cloud and laser, which converts image data into a pseudo point cloud form in a 3D space according to depth information, and then fuses the pseudo point cloud and the laser point cloud, although the obstacle detection accuracy is effectively improved, the direct fusion of the point cloud layer is greatly affected by the accuracy of the pseudo point cloud position information, and mainly performs target detection based on the laser point cloud, and the pseudo point cloud mainly plays an auxiliary role and is not suitable for a scene lacking the laser point cloud.

Related art 3: chinese patent publication No. CN113486887 discloses a feature fusion method for laser point cloud and pseudo point cloud roi, which utilizes a pseudo point cloud auxiliary network to generate a stronger feature representation of the laser point cloud roi, but the laser point cloud and the pseudo point cloud have a larger difference in distribution, so that it is difficult to achieve effective fusion between the two features, and the method relates to complicated model design, and has high computation complexity and low processing efficiency. In addition, the related technology is also based on laser point cloud for target detection, and the pseudo point cloud mainly plays an auxiliary role and is not suitable for scenes lacking the laser point cloud.

In summary, the influence factors of the 3D target detection accuracy based on the pseudo point cloud mainly include two aspects, namely, the depth information is not accurate enough, and the difference between the laser point cloud and the pseudo point cloud in distribution results in inconsistent learned feature distribution. The method comprises the following specific steps:

1) at present, the 3D target detection based on the pseudo-point cloud directly adopts a model of the laser point cloud to directly perform the 3D target detection on the pseudo-point cloud, and the difference of the two is not considered. The laser point cloud and the pseudo point cloud are different in source, the laser point cloud is a sparse point cloud acquired by a laser radar and has accurate depth information, the pseudo point cloud is converted from an image depth map, the depth information is not necessarily accurate, and the pseudo point cloud is denser than the laser point cloud. In addition, the laser point cloud and the pseudo point cloud of the same target have larger difference in distribution, the pseudo point cloud is more divergent, and the variance is larger. The existing laser point cloud model is suitable for target feature coding of laser point cloud, but the pseudo point cloud coding effect is not good when the model is directly used for coding the pseudo point cloud. Therefore, if the difference between the laser point cloud and the pseudo point cloud can be fully considered, the precision of the 3D target detection algorithm based on the pseudo point cloud can be improved;

2) in the related art, the pseudo point cloud is obtained by converting coordinates into a three-dimensional space based on a depth map, and is usually filled with 1 in the reflectivity of the fourth dimension, and the filling mode is only used for keeping the same with the laser point cloud data format, and cannot bring additional benefits. That is, in the related art, the pseudo point cloud only includes position information, and lacks semantic information. Most of multi-mode fusion methods can fuse semantic features obtained by RGB images and point cloud features, and the fusion method usually needs to consider feature alignment, is relatively complicated, has high operation complexity, and cannot meet the real-time requirements of scenes such as automatic driving and the like due to processing efficiency.

In view of this, the present disclosure provides a target detection method, an apparatus, an electronic device and a storage medium based on a pseudo-point cloud, which effectively improve the accuracy of detecting a 3D target based on a pseudo-point cloud mainly through the following two improvements: 1) embedding semantic information during generation of pseudo-point clouds, and adding a feature association module to construct an association relation between the pseudo-point clouds; 2) the method is characterized in that a first target feature coding network guided by the laser point cloud is designed, and target feature coding of the laser point cloud is introduced to guide generation of target features of the pseudo point cloud in a training stage of the first target feature coding network, so that the target features of the pseudo point cloud tend to be consistent in distribution with the target features of the laser point cloud, and the difference of data sources is optimized on a target layer, and the optimization is more efficient compared with the optimization of the whole point cloud scene.

The present disclosure may be applicable to scenarios where 3D object detection needs to be performed. For example, the method and the device can be applied to detection of vehicles in the surrounding environment in the field of automatic driving, and can sense position information of other vehicles in time, so that effective obstacle avoidance is carried out, and the vehicles can run more safely.

Hereinafter, a detailed description will be given of a specific embodiment of the present disclosure with reference to fig. 1 to 8.

Fig. 1 shows a schematic flow diagram of a pseudo point cloud-based target detection method S10 of the present disclosure. Referring to fig. 1, the pseudo point cloud based target detection method S10 may include:

step S12, acquiring first pseudo point cloud data of the first image, wherein the target features represented by the first pseudo point cloud data comprise 3D features and category features;

for example, the first image may be acquired by a camera such as an RGB camera, a depth camera, or the like. Taking a vehicle scene as an example, the first image may be, but is not limited to, a front image captured by a front camera of the vehicle, the front image includes a vehicle front environment, and 3D features such as a position, a size, and a category of an obstacle in front of the vehicle may be obtained by performing target detection on the front image.

The 3D features may include 3D contour features, 3D shape features, and/or other similar features, which may be indicated by 3D space coordinates such as keypoints (e.g., center points, corner points, etc.), sizes in 3D space, and the like.

The first pseudo point cloud data and the following second pseudo point cloud have four dimensions, one of which represents a category of points. For example, the four dimensions may include: three dimensions and categories representing the 3D position can be represented as X-axis coordinate values, Y-axis coordinate values and Z-axis coordinate values in a 3D space coordinate system, and the category dimensions can be represented by semantic information. For example, the information of any one point K in the first pseudo point cloud data or the second pseudo point cloud may be represented as (x, y, z, cls), where x represents an abscissa of the point K in the 3D space coordinate system, y represents an ordinate of the point K in the 3D space coordinate system, z represents a z-axis coordinate of the point K in the 3D space coordinate system, and cls represents semantic information of the point K. Therefore, the pseudo-point cloud data obtained by the method not only contain 3D position information, but also carry rich semantic information, provide more information for the 3D target detection based on the pseudo-point cloud, and can effectively improve the precision and accuracy of the 3D target detection based on the pseudo-point cloud.

In some embodiments, referring to fig. 2, the process S20 of acquiring the first pseudo point cloud data of the first image in step S12 may include:

step S22, obtaining depth information of the first image;

in some embodiments, the depth information of the first image may be directly obtained using a depth estimation network to generate a depth map or a depth camera. Thus, a first image containing depth information may be obtained, and the pixel data in the first image containing depth information may be represented as (u, v, depth), where u and v represent the position of the pixel point in the pixel coordinate system of the camera, and depth represents the depth of the pixel point in the lidar coordinate system.

Step S24, semantic information of the first image is obtained;

in some embodiments, the semantic information of the first image may be obtained by using a semantic segmentation network, and thereby the first image containing the semantic information may be obtained, the pixel data of the first image containing the semantic information may be represented as (u, v, cls), u and v represent the position of the pixel point in the pixel coordinate system of the camera, and cls represents the semantic information of the pixel point, and the semantic information may indicate the category of the pixel point.

Step S26, generating first pseudo point cloud data of the first image through coordinate transformation according to the depth information and semantic information of the first image.

In some embodiments, step S26 may include: the method comprises the steps of firstly embedding semantic information of pixel points in a first image with depth information to obtain a first image containing the depth information and the semantic information, enabling data of any pixel point in the first image to be expressed as (u, v, depth, cls)), and then generating first pseudo point cloud data containing the semantic information through coordinate transformation according to camera parameters (such as camera external parameters and camera internal parameters) of a sensor for collecting the first image.

In some embodiments, the coordinate transformation formula is as follows (1) to (3):

z＝D(u,v) (1)

wherein D (u, v) represents a depth value of the pixel (u, v), (c) _u ,c _v ) Representing the pixel coordinates corresponding to the center of the camera, f _u Representing the horizontal focal length of the camera, f _v Representing the vertical focal length of the camera. Here, the camera refers to a camera that captures a first image.

In step S12, a fusion mode of depth map + semantic segmentation result is adopted, so that semantic classification information can be effectively embedded into pseudo-point cloud, where the pseudo-point cloud data includes both 3D position information (x, y, z) and semantic information cls capable of indicating categories, thereby avoiding the problem of feature alignment that is usually considered in multimodal fusion.

Step S14, acquiring 3D candidate frame information of the first image according to the first pseudo point cloud data of the first image;

illustratively, the 3D candidate box information may be represented in the form of (x) ^c ,y ^c ,z ^c ,w ^c ,l ^c ,h ^c ,θ ^c )，x ^c ,y ^c ,z ^c Representing the coordinates of the center point of the 3D candidate frame in a 3D spatial coordinate system, w ^c Width of 3D candidate box, l ^c Represents the length of the 3D candidate box, h ^c Denotes the height, θ, of the 3D candidate box ^c An anchor orientation angle representing the 3D candidate box.

In some embodiments, a 3D candidate box for a first image may be obtained using first pseudo point cloud data for the first image based on a pre-trained 3D candidate box detection model. Wherein, the 3D candidate frame detection model is used for extracting the 3D candidate frame. For example, the 3D candidate box detection model may be, but is not limited to, a Region generation Network (RPN), any other RPN of high quality, or a lightweight 3D detection Network.

In some embodiments, the 3D candidate box detection model may be an RPN based on the original point cloud, a voxel based RPN, or a point + voxel based RPN.

For example, when the 3D candidate frame detection model is based on the RPN of the original point cloud, a PointNet or PointNet + + network may be used to extract the features of the first pseudo point cloud data of the first image, and then the 3D candidate frame may be generated according to the features and the preset anchor.

For another example, when the 3D candidate frame detection model is based on the RPN of voxels, the first pseudo point cloud data of the first image is voxel-divided, and is distributed in a three-dimensional space with a size of D, H, W along the ZYX coordinate system, and the three-dimensional space is divided into small cubes, that is, voxels, and the length, width, and height dimensions of each voxel are set as v _D ,v _H And v _w Then the total number of voxels after division is

And sets a point cloud number upper threshold for each voxel. And carrying out voxel coding on the divided voxels by adopting a voxel coder, carrying out feature extraction on the coded voxels through 3D sparse convolution, and generating a 3D candidate frame according to the features and a preset anchor.

Step S16, acquiring first pseudo point cloud data of the 3D candidate frame according to the 3D candidate frame information of the first image and the first pseudo point cloud data of the first image;

in some embodiments, first pseudo point cloud data of the 3D candidate box may be extracted from the first pseudo point cloud data of the first image according to the 3D candidate box information.

In some embodiments, after the first pseudo point cloud data of the 3D candidate frames are extracted, the first pseudo point cloud data of the 3D candidate frames may be further encoded to equalize the number of points of each 3D candidate frame, so as to reduce the error of the candidate frames.

For example, encoding the first pseudo point cloud data of the 3D candidate box may include: for each 3D candidate frame, in order to reduce the error of the candidate frame and wrap more target point cloud data, each 3D candidate frame may be enlarged to be a cylinder with unlimited height, a radius r of a bottom surface of the cylinder satisfies equation (4), w and l respectively represent the width and length of the 3D candidate frame, α is a hyper-parameter, α is 1.2 in the experiment, and the number N of points in each 3D candidate frame is 256.

In some embodiments, encoding the first pseudo point cloud data of the 3D candidate box may further include: and if the actual point number M of the 3D candidate frame is larger than N, randomly selecting N points from the M points, and deleting other points. And if the actual point number M of the 3D candidate frame is less than N, randomly selecting a point coordinate as the coordinate of the rest N-M points, and filling the point number corresponding to the first pseudo point cloud data of the 3D candidate frame to N.

Step S18, obtaining second pseudo point cloud data of the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame;

the target features represented by the second pseudo point cloud data comprise 3D features, class features and internal features of the target corresponding to the 3D candidate box. For example, internal features may include, but are not limited to, internal geometric features, internal structural features, inter-component association features, and/or other similar internal features.

In some embodiments, the second pseudo point cloud data of the 3D candidate box may be obtained through a pre-trained feature association network. An exemplary structure of the feature association network 300 is shown in fig. 3, where the feature association network 300 includes an inter-point attention module 320, an inter-channel attention module 340, a position coding module 360, and a fusion module 380, the inter-point attention module 320 may be configured to obtain spatial association features of an object corresponding to a 3D candidate box, the inter-channel attention module 340 may be configured to obtain channel association features of an object corresponding to a 3D candidate box, the position coding module 360 may be configured to obtain 3D relative position features of an object corresponding to a 3D candidate box, and the fusion module 380 may be configured to obtain second pseudo point cloud data of the 3D candidate box.

The data of different points in the first pseudo point cloud data of each 3D candidate frame are independent. In some embodiments, the feature association network 300 may be configured to construct association relationships between internal point data of the same 3D candidate box, so as to characterize the internal features of the target corresponding to the 3D candidate box through the association relationships.

Referring to fig. 3, the input feature X of the feature correlation network 300 may be represented as B × N × D, where B represents the number of 3D candidate boxes, N represents the number of points of the 3D candidate boxes, and D represents the degree of dimension of the first pseudo point cloud data. As described above, the first pseudo point cloud data may be represented as (x, y, z, cls), i.e., the dimension number D of the first pseudo point cloud data is 4.

In some embodiments, referring to fig. 4, the process S40 of obtaining the second pseudo point cloud data of the 3D candidate box may include:

step S42, according to the first pseudo point cloud data of the 3D candidate frame, obtaining the space association characteristics of the target corresponding to the 3D candidate frame;

the first pseudo point cloud data of the 3D candidate box can represent the geometric shape of the target corresponding to the 3D candidate box, that is, there is a spatial interdependence relationship between points in the 3D candidate box, and therefore, the spatial association relationship of the 3D candidate box is constructed by using an inter-point attention mechanism in the present disclosure.

In some embodiments, referring to fig. 3, a specific implementation process of the inter-point attention module 320, that is, a specific implementation flow of step S42, may include:

the input features X are passed through three convolutional layers, respectively, to obtain three features A, C and G, the output feature F of the inter-point attention module 320 ₁ (that is, the spatial correlation characteristic of the target corresponding to the 3D candidate frame) can be obtained by the following equations (5) to (6):

F ₁ ＝S·G+X (6)

in the formulas (5) and (6), N represents the number of 3D candidate frames, and S _ij The value of the element representing the ith row and the jth column of the feature S.

Step S44, acquiring channel association characteristics of the target corresponding to the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame;

besides the spatial interdependence relationship between the points, the correlation relationship between the internal channels of the point cloud needs to be concerned, so the present disclosure adopts a channel attention mechanism to concern the importance of different channel features.

In some embodiments, referring to fig. 3, the inter-channel attention module 340 includes an average pooling layer, a linear layer, a Relu layer, a linear layer, and a sigmoid activation layer. In the inter-channel attention module 340 of fig. 3, from input to output, the first light grey square represents the average pooling layer, the second and third dark grey squares represent the two linear layers, respectively, and the Relu layer and sigmoid active layer are directly indicated by arrows.

For example, the specific implementation process of step S44 may include: the input feature X is spatially compressed through an average pooling layer, enters a sigmoid activation layer after being processed through a linear layer, a Relu layer and a linear layer in sequence, obtains a weight matrix after being processed by the sigmoid activation layer, and finally obtains the output feature F of the inter-channel attention module 340 by multiplying the weight matrix by the input feature X ₂ The output characteristic F ₂ The channel correlation characteristics of the target corresponding to the 3D candidate frame are obtained.

Step S46, acquiring 3D relative position characteristics of a target corresponding to the 3D candidate frame according to position information in the first pseudo point cloud data of the 3D candidate frame;

in some embodiments, the point cloud data itself may contain 3D location information for location-embedding encoding. For position embedding of point clouds, the present disclosure introduces 3D spatial relative coordinates, whose calculation formula is the following formula (7):

E＝ψ(p _i -p _j ) (7)

wherein p is _i And p _j 3D position information (x, y, z) respectively indicating a point i and a point j in the same 3D candidate box, ψ indicates the operation of one linear layer, one Relu layer, one linear layer, and sigmoid activation layer.

Referring to fig. 3, the position encoding module 360 may include a linear layer, a Relu layer, two linear layers, and a sigmoid activation layer, which are connected in sequence, where an input feature of the position encoding module 360 is 3D position information of any two points in the same 3D candidate frame, and an output feature of the position encoding module 360 is F ₃ The output characteristic F ₃ Namely the 3D relative position characteristics of the first pseudo point cloud data of the 3D candidate frame. Similarly, in the position encoding module 360, three dark gray squares represent three linear layers, respectively, and the Relu layer and the sigmoid activation layer are directly represented by arrows.

And step S48, obtaining second pseudo point cloud data of the 3D candidate frame according to the space association feature, the channel association feature and the 3D relative position feature of the target corresponding to the 3D candidate frame.

In some embodiments, step S48 may include: the 3D relative position characteristics F of the target corresponding to the 3D candidate frame ₃ Spatial association features F of targets respectively corresponding to 3D candidate frames ₁ And channel associated characteristics F of targets corresponding to the 3D candidate frames ₂ And performing splicing operation, and performing pixel-by-pixel multiplication through two Multilayer perceptrons (MLPs) to obtain an output characteristic Y, wherein the output characteristic Y is second pseudo point cloud data of the 3D candidate frame.

In some embodiments, referring to fig. 3, the fusion module 380 includes two concatenation layers (concat), two MLP layers and a network layer for performing multiplication operations, and the input features of the fusion module 380 include a 3D relative position feature F ₃ Spatial correlation feature F ₁ And channel association feature F ₂ The output characteristic is Y.

As can be seen from the above, the size of the second pseudo point cloud data is consistent with that of the first pseudo point cloud data, the feature representation of the first pseudo point cloud data is independent, but the second pseudo point cloud data constructs an association relationship between the pseudo point clouds through an attention mechanism, and the data of the points in each 3D candidate frame fuses the spatial association and the position information of other points in the 3D candidate frame, so that the method has a stronger representation capability.

Exemplarily, the second pseudo point cloud data of the 3D candidate frame may be represented as N ═ { p ═ p ₁ ,…,p _N }，，p _i Data indicating the ith point in the 3D candidate frame, i is 1, … …, and N is the number of points in the 3D candidate frame. As previously described, the number of points N in the 3D candidate box may be 256.

In step S14, a feature association network is used to perform feature association on data of points in the same 3D candidate frame, and an association relationship between pseudo point clouds is fully constructed, so that features of each point in the 3D candidate frame are no longer isolated, and information of surrounding points is fused.

Step S11, according to the second pseudo point cloud data of the 3D candidate frame, obtaining a first feature vector of a target corresponding to the 3D candidate frame through feature coding, wherein the feature coding can enable the target features represented by the second pseudo point cloud data to be consistent with the target features represented by the laser point cloud data in distribution;

in some embodiments, referring to fig. 5, step S11 may include:

step S52, encoding second pseudo-point cloud data of the 3D candidate frame into pseudo-point cloud target feature data of the 3D candidate frame in a key point relative coordinate encoding mode, wherein the pseudo-point cloud target feature data are consistent with the laser point cloud target feature data in distribution, and the laser point cloud target feature data are obtained by encoding the laser point cloud data of the 3D candidate frame;

step S54, the pseudo point cloud target feature data is decoded into a first feature vector with a fixed size.

In some embodiments, the second pseudo point cloud data of the 3D candidate box may be feature-coded by the pre-trained first target feature coding network to obtain a first feature vector of a target corresponding to the 3D candidate box.

In some embodiments, the first target feature encoding network may include an encoder (encoder) and a decoder (decoder). The encoder can be used for encoding the second pseudo point cloud data of the 3D candidate frame into the pseudo point cloud target feature data of the 3D candidate frame in a key point relative coordinate encoding mode, and the decoder can be used for decoding the pseudo point cloud target feature data output by the encoder into a first feature vector with a fixed size.

Illustratively, the encoder may encode the relationship between each point in the 3D candidate frame and each corner point of the 3D candidate frame in a keypoint relative coordinate encoding manner.

For example, the relative coordinates between N points in the 3D candidate frame and 8 corner points of the 3D candidate frame may be expressed as the following formula (8):

wherein p is ^j The coordinates of the jth corner point are indicated,

and representing the relative coordinates of the ith point and the jth corner point in the 3D candidate frame.

For example, the relative coordinates of the ith point in the 3D candidate frame and the center point of the 3D candidate frame may be expressed as

For example, after encoding, the feature of the point in the 3D candidate frame is expressed by the following expression (9).

Wherein, cls _i The category information of the ith point is represented,

representing a linear layer maps the features of the ith point into a high-dimensional space. In the experiment, D in formula (9) was 28.

The characteristics of the target point after being coded are multipleFeatures of the Head-self-attention mechanism (Muti-Head self-attention) for each candidate box

The encoder in the target coding network model can be stacked by 3 multi-head self-attention structures.

The decoder outputs the pseudo-point cloud target characteristic data output by the encoder

Decoding into a fixed-size (e.g., 1 × D) global feature vector y, which is the first feature vector of the target corresponding to the 3D candidate box. Thus, the second pseudo point cloud data of each 3D candidate box can be represented as a set of vectors y of dimension 1 × D.

For example, the output feature of the decoder, i.e., the first feature vector of the target corresponding to the 3D candidate box, may be expressed as the following expression (10):

f _pse ＝{y ₁ ,y ₂ ,…,y _N } (10)

where N is the number of 3D candidate frames, f _pse And y represents the first feature vector of the target corresponding to a single 3D candidate frame.

Considering the difference between the pseudo point cloud and the laser point cloud, that is, the following difference mainly exists between the pseudo point cloud and the laser point cloud in the same 3D candidate frame: 1) the depth information of the laser point cloud is accurate, and the depth information of the pseudo point cloud has errors; 2) the laser point cloud is collected by a laser beam emitted by a laser radar at certain angle intervals, and is regular and sparse, and the pseudo point cloud is obtained by converting image pixels according to a depth map and coordinates, and is irregular and dense. The distribution of the two in three-dimensional space has large difference, and the difference is particularly expressed on the mean value and the variance in X, Y and Z directions. In view of this, in order to reduce the network performance loss caused by such a data source and enable the first target feature coding network to learn more effective feature representation, the present disclosure introduces a laser point cloud coding branch to code a laser point cloud target feature in the training process of the network, and guides the generation of a pseudo point cloud target feature by using the feature, so that the pseudo point cloud target feature data tends to be consistent with the laser point cloud target feature in distribution. Therefore, the target detection precision of the pseudo point cloud can be improved in a laser point cloud guidance mode.

In some embodiments, the first target feature coding network is obtained by training a first feature vector and a second feature vector of a target corresponding to the 3D candidate frame, and the second feature vector is obtained by the second target feature coding network according to laser point cloud data corresponding to the 3D candidate frame.

The second target feature coding network is obtained by training according to the laser point cloud data and is used for carrying out feature coding on the laser point cloud data of the 3D candidate frame so as to obtain a second feature vector of a target corresponding to the 3D candidate frame.

The second target feature encoding network has the same structure as the first target feature encoding network. That is, the second target feature encoding network includes an encoder and a decoder that function the same as the corresponding components in the first target feature encoding network, except that the input data is laser point cloud data of a 3D candidate frame, the laser point cloud data has four dimensions, the four dimensions of the laser point cloud may be three dimensions and reflectivity representing a 3D position, the laser point cloud data may be represented as (x, y, z, r), x, y, z correspond to three-dimensional coordinates under a 3D spatial rectangular coordinate system, and r is the reflectivity. The output data is a second feature vector of the target corresponding to the 3D candidate frame. The size of the second feature vector is the same as the size of the first feature vector (e.g., may be 1 × D).

For example, the output feature of the second target feature coding network, i.e., the second feature vector of the target corresponding to the 3D candidate box, may be expressed as the following formula (11):

wherein f is _lidar Second feature representing objects corresponding to all 3D candidate frames of the first imageThe set of vectors is then used to generate a set of vectors,

and a second feature vector representing an object corresponding to the single 3D candidate box.

In a specific application, the first image may be acquired while acquiring the corresponding environmental laser point cloud data, and after obtaining the 3D candidate frame information, the 3D candidate frame information (i.e., (x) above) may be obtained according to the 3D candidate frame information ^c ,y ^c ,z ^c ,w ^c ,l ^c ,h ^c ,θ ^c ) Laser point cloud data corresponding to the 3D candidate box is extracted from the ambient laser point cloud data.

In some embodiments, the loss function of the first target feature coding network may include a feature similarity loss, where the feature similarity loss is obtained according to the first feature vector and the second feature vector of the target corresponding to the 3D candidate box. Therefore, the generation of the pseudo-point cloud target features is restrained by adding feature similarity loss into a loss function of the first target feature coding network, so that the distribution of the pseudo-point cloud target features can be consistent with the laser point cloud target features.

Preferably, the feature similarity loss may be a KL divergence loss between a first feature vector and a second feature vector of a target to which the 3D candidate box corresponds. The KL divergence, also known as relative entropy, can be used to measure the distance between two distributions. The feature similarity loss may be obtained by a KL divergence loss function (KLDivLoss).

In some embodiments, the output data of the second target signature coding network is assumed to be

The output data of the first target feature coding network is f _pse ＝{y ₁ ,y ₂ ,…,y _N The loss of KL divergence between the two can be obtained by:

in the formula (12), L _KLD Representing a loss of feature similarity, N represents the number of 3D candidate boxes,

and j is 1,2, … …, N, which represents a second feature vector of the target corresponding to the jth 3D candidate frame.

Step S13, obtaining 3D detection frame information according to the first feature vector of the target corresponding to the 3D candidate frame.

In some embodiments, after step S13 or in step S13, the method further includes: and obtaining the confidence of the 3D detection frame according to the first feature vector of the target corresponding to the 3D candidate frame.

In some embodiments, the 3D detection frame information may include information on a center point position, a size, and the like of the 3D detection frame. Illustratively, the 3D detection frame information may be expressed in the form of the foregoing 3D candidate frame information, i.e., coordinates of a center point of the 3D detection frame in a 3D space coordinate system, a width, a length, and a height of the 3D detection frame, and an orientation of the 3D detection frame.

In some embodiments, the 3D detection frame information and the confidence level of the 3D detection frame may be obtained by a Feed-Forward neural Network (FFN). For example, 3D detection box information may be obtained based on a pre-trained first feed-forward neural network, and/or confidence of a 3D detection box is obtained based on a pre-trained second feed-forward neural network. That is, the first feedforward neural network is used to implement 3D detection box regression, and the second feedforward neural network is used to calculate the confidence of the 3D detection box.

Fig. 6 shows a model structure and an implementation process thereof for implementing a target detection method based on a pseudo point cloud according to an embodiment of the present disclosure.

As shown in fig. 6, the first pseudo point cloud data of the first image is sequentially processed by the 3D candidate frame detection network, the feature association network, and the first target feature coding network to obtain a first feature vector of a target corresponding to the 3D candidate frame, and the first feature vector of the target corresponding to the 3D candidate frame is processed by the first feedforward neural network and the second feedforward neural network to obtain 3D detection frame information and a confidence of the 3D detection frame.

In some embodiments, the training process of the network includes: the method comprises an individual stage and a combined stage, wherein the individual stage can utilize pseudo point cloud data containing an original labeling frame to train a 3D candidate frame detection network, a feature association network, a first target feature coding network, a first feed-forward neural network and a second feed-forward neural network, and the combined stage trains the 3D candidate frame detection network, the feature association network, the first target feature coding network, the first feed-forward neural network and the second feed-forward neural network based on the pseudo point cloud data and the laser point cloud data on the basis of the training result of the individual stage.

During the training process of the individual stage, the loss function of the 3D candidate box detection network, the feature association network, the first target feature coding network, the first feedforward neural network and/or the second feedforward neural network can be expressed as equation (13):

wherein,

respectively represent the weight coefficient, L _total Function value, L, representing a loss function _conf Represents the loss of confidence calculation for the detection box, L _reg Represents the regression loss of the 3D detection frame, L _RPN Indicating that the 3D candidate box detects loss of the network.

During the training process of the joint stage, the loss function of the 3D candidate box detection network, the feature association network, the first target feature coding network, the first feedforward neural network and/or the second feedforward neural network can be expressed as equation (14):

L _total ＝α ₁ L _RPN +α ₂ L _conf +α ₃ L _reg +α ₄ L _KLD (14)

wherein L is _total Express loss letterFunction value of number, alpha ₁ 、α ₂ 、α ₃ 、α ₄ Represents a weight coefficient, L _KLD Indicates a loss of characteristic similarity, L _conf Represents the loss of confidence calculation for the detection box, L _reg Represents the regression loss of the 3D detection frame, L _RPN Indicating loss of the 3D candidate box detection network (e.g., RPN).

In some embodiments, L _RPN Mainly involving a classification loss L _cls And 3D detection frame regression loss L _reg I.e. L _RPN Can be obtained by the following formula (15):

L _RPN ＝β ₁ L _cls +β ₂ L _reg (15)

wherein, beta ₁ Is L _cls Weight coefficient of (1), beta ₂ Is L' _reg The weight coefficient of (2).

In some embodiments, for obtaining a classification loss L _cls The classification loss function can adopt Focal loss, and the problem of imbalance of positive and negative samples can be effectively inhibited. Specifically, the classification loss function can be expressed as the following formulas (16) to (17):

wherein n represents the number of classes, p _i Representing the prediction score of the i-th class, alpha and gamma are two hyper-parameters.

In some embodiments, the confidence measure predicts the loss L _conf The cross entropy calculation is adopted, and the formula is shown in the following formulas (18) to (19):

L _conf ＝-c ^t log(c)-(1-c ^t )log(1-c) (18)

where c is the confidence score of the prediction box, c ^t Confidence truth representing the predicted box, IoU representing the intersection ratio of the 3D candidate box and the annotated box, α _F And alpha _B Representing IoU thresholds that distinguish between foreground and background, respectively. In the experiment, alpha _F ＝0.75，α _B ＝0.25。

In some embodiments, the 3D detection box regression loss L _reg In effect, the offset of the regression anchor from the original annotation box (ground route bound).

Considering the mechanism of the anchor, what actually regresses is for the anchor (x) ^a ,y ^a ,z ^a ,w ^a ,l ^a ,h ^a ,θ ^a ) The amount of offset of (c). Therefore, in the training process, the original label box (x) is first labeled ^g ,y ^g ,z ^g ,w ^g ,l ^g ,h ^g ,θ ^g ) Coding is performed as an offset from the anchor, and the coding method is as shown in the following equations (20) to (22):

θ ^t ＝θ ^g -θ ^a (22)

wherein, box _t (x ^t ,y ^t ,z ^t ,w ^t ,l ^t ,h ^t ,θ ^t ) To be a target offset amount requiring regression, it satisfies the following formula (23):

the calculation formula for the prediction of the position of the center point of the detection frame and the length, width and height dimensions is shown in the following formula (24):

L _reg-loc+dim ＝(box _prediction -box _t ) ² (24)

wherein, box _prediction Representing the predicted offset, box, of the target _t Representing the amount of offset to regress for the target.

The prediction of the orientation angle mainly includes direction prediction and angle prediction. The direction prediction can be simplified into a binary problem, and a cross entropy loss form is adopted, and a calculation formula of the cross entropy loss form is shown as the following formula (25):

L _dir ＝-ylog(p(x))-(1-y)log(1-p(x)) (25)

wherein p (x) represents the prediction result of x, and y represents the real classification label.

The angular loss component may be in the form of SmoothL1 of sin sine function, which is calculated as shown in equation (26) below:

L _reg-θ ＝SmooothL1(sin(θ _p -θ _t )) (26)

wherein, theta _p Indicating the predicted value.

Thus, the 3D detection frame regression loss L can be obtained by the following expression (27) _reg ：

L _reg ＝γ ₁ L _reg-loc+dim +γ ₂ L _dir +γ ₃ L _reg-θ (27)

Wherein, γ ₁ ,γ ₂ ,γ ₃ The weight coefficients of each loss are respectively.

3D detection frame regression loss L _reg Regressing is the offset of the 3D candidate box with respect to the original labeled box (ground route bound).

Illustratively, the 3D detection box regression loss L is calculated _reg In this case, the detection frames may be screened first, and only the loss of the detection frame whose Intersection-over-Union (IoU) ratio between the detection frame and the labeling frame is greater than the set threshold may be calculated, where the threshold may be 0.55.

Fig. 6 also shows laser point cloud branching and a second target feature encoding network. In the present disclosure, the laser point cloud branches are only used for training assistance, and are not needed for model reasoning. Illustratively, the training method of the network is as follows: the pseudo-point cloud branch and the laser point cloud branch are respectively and independently trained for 100 epochs, the learning rate is set to be 0.001, then trained parameters are loaded for combined training, the parameters of the laser point cloud branch are frozen during the combined training, gradient return is not carried out, the parameters are only used for guiding generation of pseudo-point cloud target features, and learning and updating are not carried out. And (5) performing combined training for 50epoch, setting the learning rate to be 0.0005, and finally selecting the model with the best detection effect on the verification set for detecting the pictures in the test set.

Therefore, the generation of the pseudo-point cloud target features is guided by adopting the pre-trained laser point cloud branches, so that the pseudo-point cloud target feature coding result and the laser point cloud target feature distribution tend to be consistent, the data source difference is optimized on the target layer, the optimization is more efficient compared with the optimization of the whole point cloud scene, and the detection precision is improved.

Fig. 7 shows a flowchart of a specific implementation of the pseudo-point cloud based 3D target detection method according to an embodiment of the present disclosure.

As shown in fig. 7, an exemplary implementation process S70 of the pseudo point cloud based 3D object detection method may include:

step S72, generating first pseudo point cloud data containing category information by using the depth information of the first image and a semantic segmentation result obtained through a semantic segmentation network;

step S74, the first pseudo point cloud data is processed by a 3D candidate frame detection network (for example, RPN) to obtain 3D candidate frame information and corresponding category information;

step S76, extracting first pseudo point cloud data of the target according to the 3D candidate frame information, namely extracting the first pseudo point cloud data of the 3D candidate frame;

step S78, the first pseudo point cloud data of the 3D candidate frame is processed by a feature association network to construct an association relation between pseudo point clouds, so that second pseudo point cloud data of the 3D candidate frame is obtained, the second pseudo point cloud data can simultaneously represent 3D features, category features and internal features of the target, and the internal features can be internal geometric features, internal structural features and/or other features related to the interior of the target;

step S71, the second pseudo point cloud data of the 3D candidate frame is processed by a first target feature coding network to generate a first feature vector of a target corresponding to the 3D candidate frame;

step S73, the first feature vector of the target corresponding to the 3D candidate frame is processed by the first feedforward neural network, the regression of the 3D detection frame is executed, information such as the length, width, height, and coordinates of the center point in the 3D spatial coordinate system of the 3D detection frame is obtained, and meanwhile, the first feature vector of the target corresponding to the 3D candidate frame is processed by the second feedforward neural network, the confidence calculation is executed, and the confidence of the 3D detection frame is obtained.

Step S75, determining whether the training phase is currently performed, if so, continuing to step S77, otherwise, ending the current process.

Step S77, if the current training stage is, determining the feature similarity loss L according to the first feature vector of the target corresponding to the 3D candidate box obtained in step S71 and the second feature vector of the target corresponding to the 3D candidate box obtained in step S714 _KLD ；

Step S79, performing backward propagation in the pseudo-point cloud branch shown in fig. 6 according to the 3D detection frame information obtained in step S73 and the confidence of the 3D detection frame, and performing gradient backward propagation by using the loss function of the above equation (14) in the backward propagation process, so as to update the parameters of the 3D candidate frame detection network, the feature association network, the first target feature coding network, the first feedforward neural network and the second feedforward neural network, thereby completing the round of training.

And step S711, judging whether the convergence condition is met, if so, ending the training, otherwise, returning to the step S72 and continuing to execute the next round of training.

If the training phase is currently performed, before the step S77 and after the step S74, the following steps may be performed simultaneously:

step S712, extracting laser point cloud data of the target from the environmental laser point cloud data corresponding to the first image according to the 3D candidate frame information, that is, extracting laser point cloud data of the 3D candidate frame;

for example, the first image may be acquired by a sensor such as a vehicle-mounted laser radar, and the laser point cloud data of the surrounding environment may be acquired by a sensor such as a vehicle-mounted laser radar, and the field of view of the vehicle-mounted laser radar and the field of view of the vehicle-mounted camera may at least partially coincide, so that the environment laser point cloud data corresponding to the first image may be obtained.

Step S714, the laser point cloud data of the 3D candidate frame is processed by the second target feature coding network to obtain a second feature vector of the target corresponding to the 3D candidate frame, and the step S77 is skipped.

The present disclosure provides a more concise semantic information fusion approach. Firstly, generating a pseudo point cloud containing semantic category information according to a depth map and a semantic segmentation result of an image, and embedding semantic information in the pseudo point cloud. Secondly, the incidence relation of the pseudo-point clouds is fully constructed through a feature incidence network, and the feature incidence network constructs the spatial incidence relation among the pseudo-point clouds and the channel incidence relation inside the pseudo-point clouds through an attention mechanism and by combining 3D position embedding.

The present disclosure also provides a laser point cloud guided first target feature encoding network for pseudo point cloud data, enabling the network to learn a more powerful target feature representation. The method comprises the steps of introducing laser point clouds in a network training process, using two target feature coding networks to respectively code target candidate point clouds of the two target feature coding networks, loading pre-trained model parameters into a second target feature coding network for laser point cloud data, keeping the model parameters unchanged in the training process, not performing gradient pass-back, and expressing the laser point cloud features output by the second target feature coding network as f _lidar The pseudo point cloud target feature output by the first target feature coding network is represented as f _pse And feature similarity loss is introduced between the pseudo point cloud target feature and the laser point cloud target feature, so that the representation of the pseudo point cloud target feature tends to the target feature representation of the laser point cloud in distribution, the difference of the data sources is optimized on a target layer, the optimization is more efficient compared with the optimization of the whole point cloud scene, and the detection precision is improved.

The apparatus may include corresponding means for performing each or several of the steps of the flowcharts described above. Thus, each step or several steps in the above-described flow charts may be performed by a respective module, and the apparatus may comprise one or more of these modules. The modules may be one or more hardware modules specifically configured to perform the respective steps, or implemented by a processor configured to perform the respective steps, or stored within a computer-readable medium for implementation by a processor, or by some combination.

As shown in fig. 8, the target detecting apparatus 800 based on the pseudo point cloud may include:

a pseudo-point cloud obtaining unit 802, configured to obtain first pseudo-point cloud data of a first image, where target features represented by the first pseudo-point cloud data include a three-dimensional (3D) feature and a category feature;

a 3D target candidate frame extracting unit 804, configured to obtain 3D candidate frame information of the first image according to the first pseudo point cloud data of the first image;

a candidate frame pseudo point cloud unit 806, configured to obtain first pseudo point cloud data of a 3D candidate frame according to the 3D candidate frame information of the first image and the first pseudo point cloud data of the first image;

a feature association unit 808, configured to obtain second pseudo point cloud data of the 3D candidate box according to the first pseudo point cloud data of the 3D candidate box, where a target feature represented by the second pseudo point cloud data includes a 3D feature, a category feature, and an internal feature of a corresponding target of the 3D candidate box;

a feature encoding unit 810, configured to obtain, according to second pseudo point cloud data of the 3D candidate box, a first feature vector of a target corresponding to the 3D candidate box through feature encoding, where the feature encoding enables a target feature represented by the second pseudo point cloud data to be consistent with a target feature represented by laser point cloud data in distribution;

a detection frame obtaining unit 812, configured to obtain 3D detection frame information according to the first feature vector of the target corresponding to the 3D candidate frame.

In some embodiments, the pseudo point cloud obtaining unit 802 is specifically configured to: acquiring depth information of the first image; obtaining semantic information of the first image; and generating first pseudo point cloud data of the first image through coordinate conversion according to the depth information and the semantic information of the first image.

In some embodiments, the first and/or second pseudo point cloud data has four dimensions, one of which represents a classification of points.

In some embodiments, the feature associating unit 808 is specifically configured to: acquiring spatial association characteristics of a target corresponding to the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame; acquiring channel association characteristics of a target corresponding to the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame; acquiring 3D relative position characteristics of a target corresponding to the 3D candidate frame according to position information in the first pseudo point cloud data of the 3D candidate frame; and obtaining second pseudo point cloud data of the 3D candidate frame according to the space association feature, the channel association feature and the 3D relative position feature of the target corresponding to the 3D candidate frame.

In some embodiments, the feature associating unit 805 is specifically configured to obtain the second pseudo point cloud data of the 3D candidate box through a pre-trained feature associating network, where the feature associating network includes an inter-point attention module, an inter-channel attention module, a position coding module, and a fusion module, the inter-point attention module is configured to obtain a spatial association feature of an object corresponding to the 3D candidate box, the inter-channel attention module is configured to obtain a channel association feature of an object corresponding to the 3D candidate box, the position coding module is configured to obtain a 3D relative position feature of the object corresponding to the 3D candidate box, and the fusion module is configured to obtain the second pseudo point cloud data of the 3D candidate box.

In some embodiments, the feature encoding unit 810 is specifically configured to: encoding second pseudo-point cloud data of the 3D candidate frame into pseudo-point cloud target feature data of the 3D candidate frame in a key point relative coordinate encoding mode, wherein the pseudo-point cloud target feature data are consistent with laser point cloud target feature data in distribution, and the laser point cloud target feature data are obtained according to laser point cloud data encoding of the 3D candidate frame; and decoding the pseudo point cloud target feature data into a first feature vector with a fixed size.

In some embodiments, the feature encoding unit 810 is specifically configured to perform feature encoding on the second pseudo point cloud data of the 3D candidate box through a pre-trained first target feature encoding network to obtain a first feature vector of a target corresponding to the 3D candidate box, where the first target feature encoding network includes an encoder and a decoder.

In some embodiments, a first target feature coding network is obtained by training a first feature vector and a second feature vector of a target corresponding to the 3D candidate frame, where the second feature vector is obtained by a second target feature coding network according to laser point cloud data corresponding to the 3D candidate frame, and the second target feature coding network has the same structure as the first target feature coding network.

In some embodiments, the loss function of the first target feature coding network includes a feature similarity loss, where the feature similarity loss is obtained according to a first feature vector and a second feature vector of a target corresponding to the 3D candidate frame; preferably, the feature similarity loss is a KL divergence loss between a first feature vector and a second feature vector of a target corresponding to the 3D candidate box.

In some embodiments, the target detection apparatus 800 based on the pseudo point cloud may further include: a confidence calculating unit 814, configured to obtain a confidence of the 3D detection frame according to the first feature vector of the target corresponding to the 3D candidate frame.

In some embodiments, the detection box obtaining unit 812 is specifically configured to obtain 3D detection box information based on a pre-trained first feedforward neural network, and/or the confidence calculating unit 814 is specifically configured to obtain the confidence of the 3D detection box based on a pre-trained second feedforward neural network.

In some embodiments, the loss function of the 3D candidate box detection network, the feature correlation network, the first target feature encoding network, the first feed-forward neural network, and/or the second feed-forward neural network is represented by equation (14).

The hardware architecture may be implemented using a bus architecture. The bus architecture may include any number of interconnecting buses and bridges depending on the specific application of the hardware and the overall design constraints. The bus 900 connects together various circuits including one or more processors 1000, memories 1100, and/or hardware modules. The bus 900 may also connect various other circuits 1200 such as peripherals, voltage regulators, power management circuits, external antennas, and the like.

The bus 900 may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one connection line is shown, but no single bus or type of bus is shown.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present disclosure includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the implementations of the present disclosure. The processor performs the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program tangibly embodied in a machine-readable medium, such as a memory. In some embodiments, some or all of the software program may be loaded and/or installed via memory and/or a communication interface. When the software program is loaded into memory and executed by a processor, one or more steps of the method described above may be performed. Alternatively, in other embodiments, the processor may be configured to perform one of the methods described above by any other suitable means (e.g., by means of firmware).

The logic and/or steps represented in the flowcharts or otherwise described herein may be embodied in any readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

For the purposes of this description, a "readable storage medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the readable storage medium include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable read-only memory (CDROM). Further, the readable storage medium may even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a memory.

It should be understood that portions of the present disclosure may be implemented in hardware, software, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps of the method implementing the above embodiments may be implemented by hardware that is related to instructions of a program, and the program may be stored in a readable storage medium, and when executed, the program may include one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present disclosure may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a separate product, may also be stored in a readable storage medium. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The present disclosure also provides an electronic device, including: a memory storing execution instructions; and the processor or other hardware module executes the execution instructions stored in the memory, so that the processor or other hardware module executes the target detection method based on the pseudo point cloud.

The disclosure also provides a readable storage medium, in which an execution instruction is stored, and the execution instruction is executed by a processor to implement the above target detection method based on the pseudo point cloud.

In the description herein, reference to the description of the terms "one embodiment/implementation," "some embodiments/implementations," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment/implementation or example is included in at least one embodiment/implementation or example of the present application. In this specification, the schematic representations of the terms described above are not necessarily the same embodiment/mode or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments/modes or examples. Furthermore, the various embodiments/aspects or examples and features of the various embodiments/aspects or examples described in this specification can be combined and combined by one skilled in the art without conflicting therewith.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

It will be understood by those skilled in the art that the foregoing embodiments are merely for clarity of illustration of the disclosure and are not intended to limit the scope of the disclosure. Other variations or modifications may occur to those skilled in the art, based on the foregoing disclosure, and are still within the scope of the present disclosure.

Claims

1. A target detection method based on a pseudo point cloud is characterized by comprising the following steps:

according to the second pseudo point cloud data of the 3D candidate frame, obtaining a first feature vector of the 3D candidate frame through feature coding, wherein the feature coding can enable the target features represented by the second pseudo point cloud data to be consistent with the target features represented by the laser point cloud data in distribution;

2. The method of claim 1, wherein the obtaining the first pseudo point cloud data of the first image comprises:

acquiring depth information of the first image;

obtaining semantic information of the first image;

and generating first pseudo point cloud data of the first image through coordinate conversion according to the depth information and the semantic information of the first image.

3. The method of claim 1 or 2, wherein the first pseudo point cloud data and/or the second pseudo point cloud data have four dimensions, one of the four dimensions representing a category of points.

4. The method for detecting the target based on the pseudo point cloud of claim 1, wherein the obtaining the second pseudo point cloud data of the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame comprises:

acquiring spatial association characteristics of a target corresponding to the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame;

acquiring channel association characteristics of a target corresponding to the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame;

acquiring 3D relative position characteristics of a target corresponding to the 3D candidate frame according to position information in the first pseudo point cloud data of the 3D candidate frame;

obtaining second pseudo point cloud data of the 3D candidate frame according to the space association feature, the channel association feature and the 3D relative position feature of the target corresponding to the 3D candidate frame;

preferably, the second pseudo point cloud data of the 3D candidate frame is obtained through a pre-trained feature association network, where the feature association network includes an inter-point attention module, an inter-channel attention module, a position coding module and a fusion module, the inter-point attention module is configured to obtain the spatial association feature, the inter-channel attention module is configured to obtain a channel association feature of a target corresponding to the 3D candidate frame, the position coding module is configured to obtain a 3D relative position feature of the target corresponding to the 3D candidate frame, and the fusion module is configured to obtain the second pseudo point cloud data of the 3D candidate frame;

preferably, the obtaining a first feature vector of a target corresponding to the 3D candidate frame through feature coding according to the second pseudo point cloud data of the 3D candidate frame includes: encoding second pseudo-point cloud data of the 3D candidate frame into pseudo-point cloud target feature data of the 3D candidate frame in a key point relative coordinate encoding mode, wherein the pseudo-point cloud target feature data are consistent with laser point cloud target feature data in distribution, and the laser point cloud target feature data are obtained according to laser point cloud data encoding of the 3D candidate frame; decoding the pseudo point cloud target feature data into a first feature vector with a fixed size;

preferably, the obtaining a first feature vector of a target corresponding to the 3D candidate frame through feature coding according to the second pseudo point cloud data of the 3D candidate frame includes: feature coding the second pseudo point cloud data of the 3D candidate frame through a pre-trained first target feature coding network to obtain a first feature vector of a target corresponding to the 3D candidate frame, wherein the first target feature coding network comprises an encoder and a decoder;

preferably, the first target feature coding network is obtained by training according to a first feature vector and a second feature vector of a target corresponding to the 3D candidate frame, the second feature vector is obtained by a second target feature coding network according to laser point cloud data corresponding to the 3D candidate frame, and the second target feature coding network has the same structure as the first target feature coding network;

preferably, the loss function of the first target feature coding network includes a feature similarity loss, and the feature similarity loss is obtained according to a first feature vector and a second feature vector of a target corresponding to the 3D candidate box;

preferably, the feature similarity loss is a KL divergence loss between a first feature vector and a second feature vector of a target corresponding to the 3D candidate box;

preferably, the method further comprises the following steps: obtaining the confidence of the 3D detection frame according to the first feature vector of the target corresponding to the 3D candidate frame;

preferably, the 3D detection box information is obtained based on a pre-trained first feed-forward neural network, and/or the confidence of the 3D detection box is obtained based on a pre-trained second feed-forward neural network;

preferably, the loss function of the 3D candidate box detection network, the feature correlation network, the first target feature coding network, the first feed-forward neural network and/or the second feed-forward neural network is expressed as:

L _total ＝α ₁ L _RPN +α ₂ L _conf +α ₃ L _reg +α ₄ L _KLD

wherein L is _total Function value, alpha, representing a loss function ₁ 、α ₂ 、α ₃ 、α ₄ The weight coefficient is represented by a weight coefficient,

L _KLD indicates a loss of characteristic similarity, L _conf Represents the loss of confidence calculation for the detection box, L _reg Represents the regression loss of the 3D detection frame, L _RPN Representing loss of the 3D candidate box detection network;

the 3D candidate frame detection network is used for obtaining the 3D candidate frame information, the feature association network is used for obtaining second pseudo point cloud data of the 3D candidate frame, the first target feature coding network is used for obtaining a first feature vector of a target corresponding to the 3D candidate frame, the first feed-forward neural network is used for obtaining the 3D candidate frame information, and the second feed-forward neural network is used for obtaining the confidence degree of the 3D candidate frame.

5. An object detection device based on a pseudo-point cloud, comprising:

the pseudo-point cloud obtaining unit is used for obtaining first pseudo-point cloud data of a first image, and target features represented by the first pseudo-point cloud data comprise three-dimensional (3D) features and category features;

the 3D target candidate frame extraction unit is used for acquiring 3D candidate frame information of the first image according to the first pseudo point cloud data of the first image;

a candidate frame pseudo point cloud unit, configured to obtain first pseudo point cloud data of a 3D candidate frame according to the 3D candidate frame information of the first image and the first pseudo point cloud data of the first image;

the feature association unit is used for obtaining second pseudo point cloud data of the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame, and target features represented by the second pseudo point cloud data comprise 3D features, category features and internal features of a corresponding target of the 3D candidate frame;

the feature coding unit is used for obtaining a first feature vector of a target corresponding to the 3D candidate frame through feature coding according to second pseudo point cloud data of the 3D candidate frame, wherein the feature coding can enable target features represented by the second pseudo point cloud data to be consistent with target features represented by laser point cloud data in distribution;

and the detection frame acquisition unit is used for acquiring 3D detection frame information according to the first feature vector of the target corresponding to the 3D candidate frame.

6. The pseudo-point cloud-based target detection device according to claim 5, wherein the pseudo-point cloud obtaining unit is specifically configured to:

acquiring depth information of the first image;

obtaining semantic information of the first image;

7. The pseudo-point cloud based object detection apparatus according to claim 5 or 6, wherein the first pseudo-point cloud data and/or the second pseudo-point cloud data has four dimensions, one of the four dimensions representing a category of points.

8. The pseudo-point cloud-based target detection device according to claim 13, wherein the feature association unit is specifically configured to: acquiring spatial association characteristics of a target corresponding to the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame; acquiring channel association characteristics of a target corresponding to the 3D candidate frame according to the first pseudo point cloud data of the 3D candidate frame; acquiring 3D relative position characteristics of a target corresponding to the 3D candidate frame according to position information in the first pseudo point cloud data of the 3D candidate frame; obtaining second pseudo point cloud data of the 3D candidate frame according to the space association feature, the channel association feature and the 3D relative position feature of the target corresponding to the 3D candidate frame;

preferably, the feature association unit is specifically configured to obtain second pseudo point cloud data of the 3D candidate frame through a pre-trained feature association network, where the feature association network includes an inter-point attention module, an inter-channel attention module, a position coding module, and a fusion module, the inter-point attention module is configured to obtain spatial association features of a target corresponding to the 3D candidate frame, the inter-channel attention module is configured to obtain channel association features of the target corresponding to the 3D candidate frame, the position coding module is configured to obtain 3D relative position features of the target corresponding to the 3D candidate frame, and the fusion module is configured to obtain second pseudo point cloud data of the 3D candidate frame;

preferably, the feature encoding unit is specifically configured to: encoding the second pseudo point cloud data of the 3D candidate frame into pseudo point cloud target characteristic data of the 3D candidate frame in a key point relative coordinate encoding mode, wherein the pseudo point cloud target characteristic data are consistent with laser point cloud target characteristic data in distribution, and the laser point cloud target characteristic data are obtained according to the laser point cloud data of the 3D candidate frame in an encoding mode; decoding the pseudo point cloud target feature data into a first feature vector with a fixed size;

preferably, the feature encoding unit is specifically configured to perform feature encoding on the second pseudo point cloud data of the 3D candidate box through a pre-trained first target feature encoding network to obtain a first feature vector of a target corresponding to the 3D candidate box, where the first target feature encoding network includes an encoder and a decoder;

preferably, the loss function of the first target feature coding network includes a feature similarity loss, and the feature similarity loss is obtained according to a first feature vector and a second feature vector of a target corresponding to the 3D candidate frame;

preferably, the method further comprises the following steps: the confidence coefficient calculation unit is used for obtaining the confidence coefficient of the 3D detection frame according to the first feature vector of the target corresponding to the 3D candidate frame;

preferably, the detection frame acquiring unit is specifically configured to acquire 3D detection frame information based on a pre-trained first feedforward neural network, and/or the confidence coefficient calculating unit is specifically configured to acquire the confidence coefficient of the 3D detection frame based on a pre-trained second feedforward neural network;

L _total ＝α ₁ L _RPN +α ₂ L _conf +α ₃ L _reg +α ₄ L _KLD

wherein L is _total Function value, alpha, representing a loss function ₁ 、α ₂ 、α ₃ 、α ₄ Representing weight coefficients，

L _KLD Indicates a loss of characteristic similarity, L _conf Indicating a loss of confidence calculation for the detection box,

L _reg represents the regression loss of the 3D detection frame, L _RPN Representing loss of the 3D candidate box detection network;

9. An electronic device, comprising:

a memory storing execution instructions; and

a processor executing execution instructions stored by the memory to cause the processor to perform the pseudo-point cloud based object detection method of any one of claims 1 to 4.

10. A readable storage medium, wherein an execution instruction is stored in the readable storage medium, and when executed by a processor, the execution instruction is used for implementing the pseudo point cloud based target detection method according to any one of claims 1 to 4.