CN115018910A

CN115018910A - Method and device for detecting target in point cloud data and computer readable storage medium

Info

Publication number: CN115018910A
Application number: CN202210409033.6A
Authority: CN
Inventors: 潘滢炜; 李栋; 邱钊凡; 姚霆; 梅涛
Original assignee: Jingdong Technology Information Technology Co Ltd
Current assignee: Jingdong Technology Information Technology Co Ltd
Priority date: 2022-04-19
Filing date: 2022-04-19
Publication date: 2022-09-06
Also published as: WO2023202401A1

Abstract

The disclosure relates to a method and a device for detecting a target in point cloud data and a computer readable storage medium, and relates to the technical field of computers. The method of the present disclosure comprises: inputting the point cloud data into a point cloud feature extraction network to obtain a plurality of key points in the output point cloud data and feature information of each key point; for each key point, coding the feature information of the key point according to the relevance between the key point and other key points in the preset range of the key point to obtain a first feature code of the key point; classifying each key point, and determining a point classified as a target center as a reference center point; aiming at each reference central point, coding the first feature code of the reference central point according to the relevance between the reference central point and other reference central points to obtain a second feature code of the reference central point; and predicting the position and the category of each target in the point cloud data according to the second feature codes of each reference central point.

Description

Method and device for detecting target in point cloud data and computer readable storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for detecting a target in point cloud data, and a computer-readable storage medium.

Background

The purpose of 3D (3-dimensional) target detection is to identify and locate objects appearing in a 3D point cloud, and has been widely used in the fields of autopilot and augmented reality. In contrast to 2D images, 3D point clouds can provide the geometry of objects and capture the 3D structure of a scene.

Disclosure of Invention

The inventor finds that: due to the three-dimensional characteristics and irregularities of the point cloud, the point cloud cannot be directly processed by using a powerful deep learning model such as a convolutional neural network, and therefore a special 3D characteristic learning technology is required to identify the target in the point cloud data.

One technical problem to be solved by the present disclosure is: a method for detecting a target in point cloud data is provided, and accuracy of target detection in the point cloud data is improved.

According to some embodiments of the present disclosure, there is provided a method for detecting a target in point cloud data, including: inputting the point cloud data into a point cloud feature extraction network to obtain a plurality of key points in the output point cloud data and feature information of each key point; for each key point, coding the feature information of the key point according to the relevance between the key point and other key points in the preset range of the key point to obtain a first feature code of the key point; classifying each key point, and determining a point classified as a target center as a reference center point; aiming at each reference central point, coding the first feature code of the reference central point according to the relevance between the reference central point and other reference central points to obtain a second feature code of the reference central point; and predicting the position and the category of each target in the point cloud data according to the second feature codes of the reference center points.

In some embodiments, for each keypoint, encoding feature information of the keypoint according to a relevance between the keypoint and another keypoint within a preset range of the keypoint, and obtaining a first feature code of the keypoint includes: and for each key point, determining a first feature code of the key point based on a self-attention mechanism according to the feature information of the key point, the feature information of other key points in the preset range of the key point and the relative position relationship between the key point and other key points in the preset range of the key point.

In some embodiments, for each keypoint, determining, according to the feature information of the keypoint, the feature information of other keypoints within the preset range of keypoints, and the relative positional relationship between the keypoint and other keypoints within the preset range of keypoints, a first feature code of the keypoint based on a self-attention mechanism includes: for each key point, inputting the feature information and the position information of the key point, and the feature information and the position information of other key points in a preset range of the key point into a first self-attention module of an encoder in a first conversion model; in a first self-attention module, taking other key points in a preset range of the key points as relative points, respectively inputting the position information of the key points and the position information of the relative points into a first position coding layer, a second position coding layer and a third position coding layer aiming at each relative point, and determining a first relative position code, a second relative position code and a third relative position code of the key points and the relative points; determining a key vector and a value vector of the relative point according to the product of the feature information of the relative point and the key matrix and the value matrix in the first self-attention module respectively; determining a query vector of the key point according to the product of the feature information of the key point and a query matrix in a first self-attention module; and determining a first feature code of the key point according to the first relative position code, the second relative position code and the third relative position code of the key point and each relative point, the key vector of each relative point, the value vector of each relative point and the query vector of the key point.

In some embodiments, determining the first feature code of the keypoint based on the first relative position code, the second relative position code and the third relative position code of the keypoint and each relative point, the key vector of each relative point, the value vector of each relative point, and the query vector of the keypoint comprises: for each relative point, taking the sum of the first position codes of the key point and the relative point and the query vector of the key point as a modified query vector of the key point; taking the sum of the key point and the second position code of the relative point and the key vector of the relative point as a corrected key vector of the relative point; the sum of the key point and the third position code of the relative point and the value vector of the relative point is used as the correction value vector of the relative point; inputting the product of the correction query vector of the key point and the correction key vector of the relative point and the dimensionality of the feature information of the key point into a first normalization layer to obtain the weight of the relative point; and carrying out weighted summation on the correction value vector of each relative point according to the weight of each relative point to obtain the first feature code of the key point.

In some embodiments, the first position-coding layer, the second position-coding layer, and the third position-coding layer are a first feedforward network, a second feedforward network, and a third feedforward network, respectively, and inputting the position information of the key point and the position information of the opposite point into the first position-coding layer, the second position-coding layer, and the third position-coding layer, respectively, includes: and respectively inputting the difference between the coordinates of the key point and the coordinates of the relative point into a first feedforward network, a second feedforward network and a third feedforward network.

In some embodiments, for each keypoint, encoding feature information of the keypoint according to a correlation between the keypoint and another keypoint within a preset range of the keypoint, and obtaining a first feature code of the keypoint includes: and for each key point, determining a first feature code of the key point based on a self-attention mechanism according to the feature information of the key point, the feature information of other key points in the preset range of the key point, the relative position relationship between the key point and other key points in the preset range of the key point and the relative geometric structure relationship between the key point and other key points in the preset range of the key point.

In some embodiments, for each keypoint, determining, according to the feature information of the keypoint, the feature information of other keypoints within the preset range of keypoints, the relative positional relationship between the keypoint and other keypoints within the preset range of keypoints, and the relative geometric relationship between the keypoint and other keypoints within the preset range of keypoints, a first feature code of the keypoint based on a self-attention mechanism includes: for each key point, inputting the feature information, the position information and the geometric structure information of the key point, and the feature information, the position information and the geometric structure information of other key points in a preset range of the key point into a first self-attention module of an encoder in a first conversion model; in a first self-attention module, taking other key points in a preset range of the key points as relative points, respectively inputting the position information of the key points and the position information of the relative points into a first position coding layer, a second position coding layer and a third position coding layer aiming at each relative point, and determining a first relative position code, a second relative position code and a third relative position code of the key points and the relative points; inputting the geometric structure information of the key point and the geometric structure information of the relative point into a geometric structure coding layer, and determining the relative geometric structure weight of the key point and the relative point; determining a key vector and a value vector of the relative point according to the product of the feature information of the relative point and the key matrix and the value matrix in the first self-attention module respectively; determining a query vector of the key point according to the product of the feature information of the key point and a query matrix in the first self-attention module; and determining a first feature code of the key point according to the first relative position code, the second relative position code, the third relative position code and the relative geometric structure weight of the key point and each relative point, the key vector of each relative point, the value vector of each relative point and the query vector of the key point.

In some embodiments, determining the first feature code of the keypoint based on the first relative position code, the second relative position code, the third relative position code, and the relative geometry weight of the keypoint and each of the relative points, the key vector of each of the relative points, the value vector of each of the relative points, and the query vector of the keypoint comprises: for each relative point, taking the sum of the first position codes of the key point and the relative point and the query vector of the key point as a modified query vector of the key point; taking the sum of the key point and the second position code of the relative point and the key vector of the relative point as a corrected key vector of the relative point; the sum of the key point and the third position code of the relative point and the value vector of the relative point is used as the correction value vector of the relative point; inputting the product of the correction query vector of the key point and the correction key vector of the relative point, the relative geometric structure weight of the key point and the relative point and the dimensionality of the feature information of the key point into a first normalization layer to obtain the weight of the relative point; and carrying out weighted summation on the correction value vector of each relative point according to the weight of each relative point to obtain the first feature code of the key point.

In some embodiments, inputting the product of the modified query vector of the keypoint and the modified key vector of the relative point, the relative geometric weight of the keypoint and the relative point and the dimension of the feature information of the keypoint into a first normalization layer, and obtaining the weight of the relative point comprises: dividing the product of the correction query vector of the key point and the correction key vector of the relative point by the square root of the dimension of the feature information of the key point, and then superposing the product with the relative geometric structure weight of the key point and the relative point, and inputting the obtained result into a first normalization layer to obtain the weight of the relative point.

In some embodiments, the geometry information comprises: determining the relative geometric structure weight of the key point and the relative point according to at least one of the normal vector of the local plane and the curvature radius of the local plane, wherein the relative geometric structure weight of the key point and the relative point comprises the following steps: and determining the relative geometric structure weight of the key point and the relative point according to at least one of the distance between the key point and the relative point, the dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located, the difference between the curvature radius of the local plane where the key point is located and the curvature radius of the local plane where the relative point is located, and the included angle between the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located.

In some embodiments, the relative geometric weight of the keypoint and the opposing point decreases as the distance of the keypoint from the opposing point increases; the relative geometric structure weight of the key point and the relative point is increased along with the increase of the dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located, and the relative geometric structure weight of the key point and the relative point is increased along with the increase of the difference between the curvature radius of the local plane where the key point is located and the curvature radius of the local plane where the relative point is located; the relative geometric structure weight of the key point and the relative point is increased along with the increase of the result after the included angle between the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located passes through the feature propagation layer.

In some embodiments, for each reference center point, encoding the first feature code of the reference center point according to the association between the reference center point and other reference center points, and obtaining the second feature code of the reference center point includes: and for each reference central point, determining a second feature code of the reference central point based on a self-attention mechanism according to the first feature code of the reference central point, the first feature codes of other reference central points and the relative position relationship between the reference central point and other reference central points.

In some embodiments, for each reference center point, determining the second feature code of the reference center point based on the self-attention mechanism according to the first feature code of the reference center point, the first feature codes of the other reference center points, and the relative positional relationship between the reference center point and the other reference center points comprises: for each reference center point, inputting the first feature codes and the position information of the reference center point and the first feature codes and the position information of other reference center points into a second self-attention module of an encoder in a second conversion model; in the second self-attention module, taking other reference center points as relative center points, respectively inputting the position information of the reference center point and the position information of the relative center point into a fourth position coding layer, a fifth position coding layer and a sixth position coding layer aiming at each relative center point, and determining a fourth relative position code, a fifth relative position code and a sixth relative position code of the reference center point and the relative center point; determining a key vector and a value vector of the relative central point according to the product of the characteristic information of the relative central point and a key matrix and a value matrix in a second self-attention module respectively; determining a query vector of the reference center point according to the product of the feature information of the reference center point and a query matrix in a second self-attention module; and determining a second feature code of the reference center point according to the fourth relative position code, the fifth relative position code and the sixth relative position code of the reference center point and each relative center point, the key vector of each relative center point, the value vector of each relative center point and the query vector of the reference center point.

In some embodiments, determining the second feature code for the reference center point based on the fourth relative position code, the fifth relative position code, and the sixth relative position code for the reference center point, the key vector for each relative center point, the value vector for each relative center point, and the query vector for the reference center point comprises: for each relative center point, the sum of the reference center point and the fourth position code of the relative center point and the query vector of the reference center point is used as the modified query vector of the reference center point; taking the sum of the fifth position code of the reference central point and the relative central point and the key vector of the relative central point as a correction key vector of the relative central point; the sum of the reference central point and the sixth position code of the relative central point and the value vector of the relative central point is used as the correction value vector of the relative central point; inputting the product of the correction query vector of the reference center point and the correction key vector of the corresponding center point and the dimension of the first feature code of the reference center point into a second normalization layer to obtain the weight of the corresponding center point; and carrying out weighted summation on the correction value vectors of the relative central points according to the weight of the relative central points to obtain a second feature code of the reference central point.

In some embodiments, the fourth, fifth and sixth relative position codes are a fourth, fifth and sixth feed-forward network, respectively, and the inputting the position information of the reference center point and the position information of the relative center point into the fourth, fifth and sixth position-coding layers, respectively, includes: and respectively inputting the difference between the coordinate of the reference central point and the coordinate of the relative central point into a fourth feedforward network, a fifth feedforward network and a sixth feedforward network.

In some embodiments, classifying the key points, and determining the point classified as the target center as the reference center point includes: inputting the first feature codes of the key points into a classification network aiming at each key point to obtain a classification result of the key points; and determining whether the key point is the point of the target center according to the classification result.

In some embodiments, the classification network is trained by using position information of each key point with label information as training data, wherein for each key point, the label information of the key point is a point of a target center if the key point is located within a bounding box of the target and belongs to a point closest to the target center.

In some embodiments, predicting the location and class of each target in the point cloud data from the second feature encodings for each reference center point comprises: inputting the second feature codes of the reference center points into a decoder in a second conversion model to obtain feature vectors of the reference center points; and inputting the characteristic vector of each reference central point into a target detection network to obtain the position and the category of each target in the point cloud data.

In some embodiments, for each keypoint, the other keypoints within the preset range of keypoints are determined using the following method: and aiming at each key point, sequencing other key points according to the distance from the key point to the key point from small to large, and selecting a preset number of other key points as other key points in the preset range of the key point according to the sequence from front to back.

According to other embodiments of the present disclosure, there is provided an apparatus for detecting an object in point cloud data, including: the characteristic extraction module is used for inputting the point cloud data into a point cloud characteristic extraction network to obtain a plurality of key points in the output point cloud data and characteristic information of each key point; the first coding module is used for coding the feature information of each key point according to the relevance between the key point and other key points in the preset range of the key point to obtain a first feature code of the key point; the classification module is used for classifying all the key points, determining the points classified as the target center and taking the points as reference center points; the second coding module is used for coding the first feature code of each reference central point according to the relevance between the reference central point and other reference central points to obtain a second feature code of the reference central point; and the target detection module is used for predicting the position and the category of each target in the point cloud data according to the second feature codes of each reference central point.

According to still other embodiments of the present disclosure, an apparatus for detecting an object in point cloud data is provided, including: a processor; and a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform a method of detecting an object in point cloud data as in any of the preceding embodiments.

According to still further embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the steps of the method of detecting an object in point cloud data of any of the foregoing embodiments.

According to still further embodiments of the present disclosure, there is provided an article sorting apparatus including: the detection device and the sorting component of the target in the point cloud data of any embodiment are provided; the sorting component is used for sorting the targets according to the positions and the types of the targets in the point cloud data output by the target detection device in the point cloud data.

In some embodiments, the apparatus further comprises: and the point cloud acquisition component is used for acquiring point cloud data of a preset area and sending the point cloud data to the detection device of the target in the point cloud data.

In the method, key points of the point cloud data and characteristic information of each key point are extracted, and aiming at each key point, determining a first feature code of the key point according to the relevance between the key point and other key points in the preset range of the key point, wherein the first feature code reflects the relevance between points in the target internal local area and points, further, dividing the key point into a point of a target center and a point of a non-target center, taking the point of the target center as a reference center point, aiming at each reference center point, determining a second feature code of the reference center point according to the relevance between the reference center point and other reference center points, wherein the relevance between the target is increased on the basis of the relevance between the point in the target internal local area and the second feature code, and predicting the position and the category of each target in the point cloud data according to the second feature codes of each reference center point. According to the scheme, the relevance between all points in the point cloud is not modeled any more, but the relevance between the points is divided into the relevance in the target and the relevance between the targets, so that the local and global dependency relationship in the point cloud can be captured at the same time, the three-dimensional characteristics and the irregularity of the point cloud data are adapted, the accuracy of target detection in the point cloud data is improved, the detection efficiency can be improved, and the calculation cost is saved.

Other features of the present disclosure and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 illustrates a flow diagram of a method of detection of a target in point cloud data of some embodiments of the present disclosure.

FIG. 2 shows a schematic diagram of a model of a target in point cloud data of further embodiments of the present disclosure.

Fig. 3 illustrates a schematic structural diagram of an apparatus for detecting a target in point cloud data according to some embodiments of the present disclosure.

Fig. 4 shows a schematic structural diagram of an apparatus for detecting an object in point cloud data according to further embodiments of the present disclosure.

Fig. 5 shows a schematic structural diagram of an apparatus for detecting an object in point cloud data according to still other embodiments of the disclosure.

Figure 6 illustrates a schematic structural view of an article sorting apparatus according to some embodiments of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

The present disclosure provides a method for detecting a target in point cloud data, which is described below with reference to fig. 1.

Fig. 1 is a flow chart of some embodiments of a method for detecting objects in point cloud data according to the present disclosure. As shown in fig. 1, the method of this embodiment includes: steps S102 to S110.

In step S102, the point cloud data is input to a point cloud feature extraction network, and a plurality of key points in the output point cloud data and feature information of each key point are obtained.

Given a point cloud of N points with XYZ coordinates as input, a point cloud feature extraction network may downsample point cloud data and learn a depth feature of each point, outputting a subset of points, and each point is represented by a C (C is a positive integer) dimensional feature, regarding the points as keypoints.

The point cloud feature extraction network is, for example, VoxelNet, PointNet + +,3DSSD, and is not limited to the examples given above, and is used to extract key points in the point cloud data and feature information of the key points. For example, a PointNet + + network is used as the point cloud feature extraction network. Taking point cloud data containing N points as input, following the encoder-decoder structure, the input point cloud is first down-sampled to 8 times resolution (i.e. N/8 points) by 4 set abstraction layers, then up-sampled to 2 times resolution (i.e. N/2 points) by feature propagation layers, and each point is represented by a C-dimensional feature, and the set of key points is represented by, for example, a C-dimensional feature

f _i The feature information representing the ith key point is a feature vector. The number of key points and the sampling manner are not limited to the above examples, and are determined according to a model and a test result of an actual application.

In step S104, for each key point, the feature information of the key point is encoded according to the relevance between the key point and other key points in the preset range of the key point, so as to obtain a first feature code of the key point.

For example, for each key point, the other key points are sorted according to the distance from the key point from small to large, and a preset number of other key points are selected from the front to the back according to the sorting order and are used as the other key points in the preset range of the key point. For example, for each keypoint, K keypoints closest to the keypoint are used as other keypoints within a preset range (local area) corresponding to the keypoint.

In some embodiments, for each keypoint, a first feature code of the keypoint is determined based on a self-attention mechanism according to feature information of the keypoint, feature information of other keypoints within the preset range of the keypoint, and a relative position relationship between the keypoint and other keypoints within the preset range of the keypoint. The importance degree or the contribution of other key points in the preset range of the key point relative to the key point code can be determined based on a self-attention mechanism, and the features of the key points are described by combining the features of the other key points in the preset range when the key points are coded, so that the accuracy of feature expression of the key points can be improved. In addition, the relative position relation between the key point and other key points in the preset range of the key point is introduced into an attention mechanism, so that the accuracy of feature expression is improved, and the accuracy of target detection is further improved.

Further, in some embodiments, for each keypoint, inputting feature information and position information of the keypoint, feature information and position information of other keypoints within the preset range of the keypoint, into a first self-attention module of an encoder in the first conversion model; in a first self-attention module, taking other key points in a preset range of the key points as relative points, respectively inputting the position information of the key points and the position information of the relative points into a first position coding layer, a second position coding layer and a third position coding layer aiming at each relative point, and determining a first relative position code, a second relative position code and a third relative position code of the key points and the relative points; determining a key vector and a value vector of the relative point according to the product of the feature information of the relative point and the key matrix and the value matrix in the first self-attention module respectively; determining a query vector of the key point according to the product of the feature information of the key point and a query matrix in the first self-attention module; and determining a first feature code of the key point according to the first relative position code, the second relative position code and the third relative position code of the key point and each relative point, the key vector of each relative point, the value vector of each relative point and the query vector of the key point.

Further, in some embodiments, for each relative point, the sum of the first location code of the keypoint and the relative point and the query vector of the keypoint is taken as the revised query vector of the keypoint; taking the sum of the key point and the second position code of the relative point and the key vector of the relative point as a corrected key vector of the relative point; the sum of the key point and the third position code of the relative point and the value vector of the relative point is used as the correction value vector of the relative point; inputting the product of the correction query vector of the key point and the correction key vector of the relative point and the dimension of the feature information of the key point into a first normalization layer to obtain the weight of the relative point; and carrying out weighted summation on the correction value vector of each relative point according to the weight of each relative point to obtain the first feature code of the key point.

For example, the first transformation model is a Transformer model, which may be referred to as a Local Transformer since it is used to determine the correlation between target interior points. The first conversion model may include an encoder portion in a Transformer. Further, for example, the first position encoding layer, the second position encoding layer, and the third position encoding layer are a First Feedforward Network (FFN), a second feedforward network, and a third feedforward network, respectively, and differences between the coordinates of the key point and the coordinates of the opposite point are input to the first feedforward network, the second feedforward network, and the third feedforward network, respectively, to determine a first relative position encoding, a second relative position encoding, and a third relative position encoding of the key point and the opposite point. The first normalization layer is for example a softmax layer.

For example,

and

respectively representing the feature vectors and the position coordinates (x) of N/2 key points input to the encoder of the Local transform _i As a vector). In Local transform, for any keypoint x _i Selecting K key points nearest to the key points as corresponding Local areas, and inputting the points into a Local transform module to model all the key points belonging to the same Local area:

in the formula (1)

The FFN (feed forward neural network) that may also be included after the first self-attention module in the encoder is the output of the first self-attention module of the encoder of the Local transform. If the first self-attention module does not include FFN thereafter

The first signature encoding may be represented, otherwise,

the output after the FFN represents a first feature code;

for the first query matrix in the self-attention module,

for the key matrix in the first self-attention module,

the median matrix of the first self-attention module, C is the dimension of the feature information of the keypoints,

respectively representing the corresponding functions of the first position coding layer, the second position coding layer and the third position coding layer. PE (x) _i ,x _j ) Represents x _i ,x _j The relative position code of (2) can be expressed by the following formula:

PE(x _i ,x _j )＝FFN(x _i -x _j )＝(x _i -x _j )W _PE +b _PE (2)

W _PE and b _PE A parameter representing the FFN (feed forward network or feature propagation layer),

corresponding to W _PE And b _PE Different.

A multi-head attention mechanism may be applied to the encoder, each attention head may determine the codes of the key points by referring to the above equations (1) and (2), and the codes of the attention heads are spliced and then multiplied by a preset matrix (or the product is further subjected to FFN) to obtain the first feature code of the key point. The query matrix, the key matrix, the value matrix, and the parameters of the first position-coding layer, the second position-coding layer, and the third position-coding layer in each attention head are different.

The first feature code output by the Local Transformer contains context information of the Local area where the key point is located, namely the relevance between the target interior point and the point.

In step S106, each of the key points is classified, and a point classified as a target center is determined as a reference center point.

In some embodiments, for each key point, inputting the first feature code of the key point into a classification network to obtain a classification result of the key point; and determining whether the key point is the point of the target center according to the classification result.

The key points output by the first conversion model are dense points, each point does not represent a single target (object), all the key points are screened in order to reduce the redundancy of the final detection result, and only the key point in the center of the target is reserved. Therefore, it is necessary to determine whether each key point is a real target center. In the training process, a label is allocated to each key point, if a key point is located in the boundary box of a target and is the point closest to the center of the target, a positive label is allocated to the key point, and otherwise, a negative label is allocated to the key point. And training a two-classification network according to the labels of the key points. In the testing process, all key points are input into the two-classification network, and only the key points with positive classification results are reserved as reference center points.

In step S108, for each reference center point, the first feature code of the reference center point is encoded according to the relevance between the reference center point and other reference center points, so as to obtain the second feature code of the reference center point.

In some embodiments, for each reference center point, the second feature code of the reference center point is determined based on a self-attention mechanism according to the first feature code of the reference center point, the first feature codes of the other reference center points, and the relative positional relationship between the reference center point and the other reference center points.

Further, in some embodiments, for each reference center point, the first feature codes and position information of the reference center point, and the first feature codes and position information of other reference center points are input into the second self-attention module of the encoder in the second conversion model; in the second self-attention module, taking other reference center points as relative center points, respectively inputting the position information of the reference center point and the position information of the relative center point into a fourth position coding layer, a fifth position coding layer and a sixth position coding layer aiming at each relative center point, and determining a fourth relative position code, a fifth relative position code and a sixth relative position code of the reference center point and the relative center point; determining a key vector and a value vector of the relative central point according to the product of the characteristic information of the relative central point and a key matrix and a value matrix in a second self-attention module respectively; determining a query vector of the reference center point according to the product of the feature information of the reference center point and a query matrix in a second self-attention module; and determining a second feature code of the reference center point according to the fourth relative position code, the fifth relative position code and the sixth relative position code of the reference center point and each relative center point, the key vector of each relative center point, the value vector of each relative center point and the query vector of the reference center point.

Further, in some embodiments, for each relative center point, encoding a sum of the reference center point and the query vector for the reference center point as a revised query vector for the reference center point; taking the sum of the fifth position code of the reference central point and the relative central point and the key vector of the relative central point as a correction key vector of the relative central point; the sum of the reference central point and the sixth position code of the relative central point and the value vector of the relative central point is used as the correction value vector of the relative central point; inputting the product of the correction query vector of the reference center point and the correction key vector of the corresponding center point and the dimension of the first feature code of the reference center point into a second normalization layer to obtain the weight of the corresponding center point; and carrying out weighted summation on the correction value vectors of the relative central points according to the weight of the relative central points to obtain a second feature code of the reference central point.

For example, the second transformation model is a transform model, which may be referred to as Global transform since it is used to determine the relevance between objects. The second transformation model may include encoder and decoder portions in a Transformer. Further, the fourth, fifth and sixth relative position codings are a fourth, fifth and sixth feedforward network, respectively, the difference between the coordinate of the reference center point and the coordinate of the relative center point is input into the fourth, fifth and sixth feedforward networks, respectively, and the fourth, fifth and sixth relative position codings of the reference center point and the relative center point are determined. The second normalization layer is for example a softmax layer.

For example, from keyAnd (4) screening the points to obtain M reference central points, wherein the Global Transformer aims to learn the relevance among the M different targets. In particular, M reference center points (e.g., a feature set of M reference center points) are represented as

) Inputting into a Global Transformer module to model the relevance between different targets:

h in the formula (3) _i The FFN may also be included after the second self-attention module in the encoder, which is an output of the second self-attention module of the Global Transformer's encoder. H if FFN is not included after the second self-attention module _i Can represent a second signature, otherwise, h _i The output after the FFN represents a second feature code;

for the second self-attention module to query the matrix,

for the key matrix in the second self-attention module,

a second self-attention module median matrix, C is the dimension of the feature information of the keypoints,

and respectively representing the functions corresponding to the fourth position coding layer, the fourth position coding layer and the fourth position coding layer.

Can be prepared intoConsider equation (2).

The multi-head attention mechanism can also be applied to the encoder of Global Transformer, and is not described herein. h is _i The high-level feature of the ith reference center which is output comprises the relevance of the internal point of the target and the relevance of different targets.

In step S110, the position and category of each target in the point cloud data are predicted according to the second feature codes of each reference center point.

In some embodiments, the second feature codes of the reference center points are input into a decoder in a second conversion model to obtain feature vectors of the reference center points; and inputting the characteristic vectors of the reference central points into a target detection network to obtain the positions and the types of the targets in the point cloud data.

The target detection network is for example FFN. The target detection network determines the position and the category of each target in the point cloud data according to the feature vector containing the relevance of the target internal points and the relevance among different targets.

In the above embodiment, the key points of the point cloud data and the feature information of each key point are extracted, for each key point, the first feature code of the key point is determined according to the relevance between the key point and another key point within the preset range of the key point, the first feature code reflects the relevance between the point in the target internal local area and the point, further, the key point is divided into a point at the target center and a point at a non-target center, the point at the target center is used as a reference center point, for each reference center point, the second feature code of the reference center point is determined according to the relevance between the reference center point and another reference center point, the relevance between the targets is added to the relevance between the point in the target internal local area in the second feature code, and then the second feature code of each reference center point is further determined, and predicting the position and the category of each target in the point cloud data. According to the scheme of the embodiment, the relevance between all the points in the point cloud is not modeled any more, but the relevance between the points is divided into the relevance in the target and the relevance between the targets, so that the local dependency relationship and the global dependency relationship in the point cloud can be captured at the same time, the three-dimensional characteristics and the irregularity of the point cloud data are adapted, the accuracy of target detection in the point cloud data is improved, the detection efficiency can be improved, and the calculation cost is saved.

In order to further improve the accuracy of target detection in the point cloud data, the scheme of the embodiment is further improved. The inventor further excavates the three-dimensional characteristics of the point cloud data and introduces the geometrical structure characteristics among the points into the encoding process, so that the learning of the characteristics of the point cloud data is more accurate, and the accuracy of target detection is improved. Specific embodiments are described below.

With respect to step S104, in some embodiments, for each keypoint, according to the feature information of the keypoint, the feature information of other keypoints within the preset range of the keypoint, the relative position relationship between the keypoint and other keypoints within the preset range of the keypoint, and the relative geometric structure relationship between the keypoint and other keypoints within the preset range of the keypoint, a first feature code of the keypoint is determined based on a self-attention mechanism.

Further, in some embodiments, for each keypoint, inputting feature information, position information, and geometry information of the keypoint, feature information, position information, and geometry information of other keypoints within the preset range of the keypoint, into a first self-attention module of an encoder in the first conversion model; in a first self-attention module, taking other key points in a preset range of the key points as relative points, respectively inputting the position information of the key points and the position information of the relative points into a first position coding layer, a second position coding layer and a third position coding layer aiming at each relative point, and determining a first relative position code, a second relative position code and a third relative position code of the key points and the relative points; inputting the geometric structure information of the key point and the geometric structure information of the relative point into a geometric structure coding layer, and determining the relative geometric structure weight of the key point and the relative point; determining a key vector and a value vector of the relative point according to the product of the feature information of the relative point and the key matrix and the value matrix in the first self-attention module respectively; determining a query vector of the key point according to the product of the feature information of the key point and a query matrix in a first self-attention module; and determining a first feature code of the key point according to the first relative position code, the second relative position code, the third relative position code and the relative geometric structure weight of the key point and each relative point, the key vector of each relative point, the value vector of each relative point and the query vector of the key point.

Further, in some embodiments, for each relative point, the sum of the first location code of the keypoint and the relative point and the query vector of the keypoint is taken as the revised query vector of the keypoint; taking the sum of the key point and the second position code of the relative point and the key vector of the relative point as a corrected key vector of the relative point; the sum of the key point and the third position code of the relative point and the value vector of the relative point is used as the correction value vector of the relative point; inputting the product of the correction query vector of the key point and the correction key vector of the relative point, the relative geometric structure weight of the key point and the relative point and the dimension of the feature information of the key point into a first normalization layer to obtain the weight of the relative point; and carrying out weighted summation on the correction value vector of each relative point according to the weight of each relative point to obtain the first feature code of the key point.

Further, in some embodiments, the product of the modified query vector of the keypoint and the modified key vector of the opposite point is divided by the square root of the dimension of the feature information of the keypoint, and then superimposed with the relative geometric weight of the keypoint and the opposite point, and the obtained result is input to the first normalization layer to obtain the weight of the opposite point.

For example, the geometry information includes: at least one of a normal vector of the local plane and a curvature radius of the local plane. In some embodiments, the relative geometric structure weight of the key point and the relative point is determined according to at least one of a distance between the key point and the relative point, a dot product of a normal vector of a local plane where the key point is located and a normal vector of a local plane where the relative point is located, a difference between a curvature radius of the local plane where the key point is located and a curvature radius of the local plane where the relative point is located, and an included angle between the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located.

Further, in some embodiments, the relative geometric weight of the keypoint and the opposing point decreases as the distance of the keypoint and the opposing point increases; the relative geometric structure weight of the key point and the relative point is increased along with the increase of the dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located; the relative geometric structure weight of the key point and the relative point is increased along with the increase of the difference between the curvature radius of the local plane where the key point is located and the curvature radius of the local plane where the relative point is located; the relative geometric structure weight of the key point and the relative point is increased along with the increase of the result that the included angle between the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located passes through the feature propagation layer

For example, the first transformation model is a Transformer model, which is referred to as Local Transformer. The first, second, and third position coding layers may be a first feed-forward network (FFN), a second feed-forward network, and a third feed-forward network, respectively, and the first normalization layer may be a softmax layer.

The relative geometric structure relationship between the key point and other key points within the preset range of the key point can be added into the attention mechanism, and the formula (1) can be specifically improved:

g in formula (4) _i,j The relative geometric weights representing the keypoints i and j can be determined by using the following formula:

in the formula (5), n _i ，n _j Individual watchNormal vectors, c, showing local planes in which the key points i and j lie _i ，c _j Respectively representing the radius of curvature of the local plane in which the key points i and j lie,

represents the angle between the normal vector of the local plane where the key point i is located and the normal vector of the local plane where the key point j is located, beta ₁ ，β ₂ And beta ₃ FFN is a feed-forward neural network or feature propagation layer, which is a parameter of the geometry encoding layer. G _i,j The method is a Gaussian function model, the strength of the correlation between two points is calculated through geometric parameters such as a local plane normal vector, a local curvature radius and a normal vector included angle, and if the correlation between the two points is stronger, the corresponding Gaussian weight G is obtained _i,j The larger.

In some embodiments, for a key point, N points in the neighborhood and the nearest distance to the key point are found, and then a plane is found by using a least square method, so that the sum of the distances of the N points projected onto the plane is minimum, and the plane is a local plane.

According to the method, the original relative geometric structure weight is added and used for expressing the geometric structure relation between the point and the point, the geometric characteristics of objects such as a local plane normal vector, a local curvature radius, a normal vector included angle and the like are integrated into a self-attention mechanism, and a high-efficiency feature extraction model and a target detection model which are specially used for processing point cloud data are designed.

Some application examples of the present disclosure are described below in conjunction with fig. 2.

As shown in fig. 2, Point Cloud data is input into a Point Cloud Backbone network (Point Cloud Backbone network) to obtain Feature information (Point Feature) of a key Point, and then the Feature information of the key Point is input into a Local-Global transform model. In the Local-Global transform model, the feature information and the position information of each key point and key points in the Local area of the key point are input into a Local transform module, and the geometric structure information of the key points is input into the Local transform module to obtain a first feature code. Selecting a reference central point from each key point through a classification network (for example, comprising a Sampling/Pooling module), inputting a first feature code of the reference central point and position information into a Global transform module to obtain a second feature code, and inputting the second feature code into an FFN to obtain a boundary box and a category of a target.

The scheme of the above embodiment provides an end-to-end 3D point cloud target detection network based on a Transformer model, which may be referred to as 3D trans, and takes a 3D point cloud as an input and outputs a set of labeled 3D bounding boxes to represent the position of a target (object). The overall structure of the 3DTrans detection network is shown in fig. 2, and comprises two main components: a feature extraction network and a Local-Global Transformer. Given a point cloud of N points with XYZ coordinates as input, a feature extraction network downsamples the point cloud and learns the depth features of each point, outputting a subset of points, and each point is represented by a C-dimensional feature, considering the points as keypoints. The Local-Global Transformer takes the characteristics of the key points as input and outputs the final target detection result. The traditional Transformer model is improved from two aspects, so that the traditional Transformer model is more suitable for processing 3D point cloud data. In one aspect, the associations between all key points are not modeled directly, but rather the associations between points are divided into associations within objects and associations between objects. Specifically, the Local transform module is used for learning the relevance between points in a Local area in the same object, the Global transform module is used for learning the relevance between different objects, and the Local-Global transform module can capture the Local and Global dependency relationship in point cloud at the same time while reducing the calculation cost by connecting the two modules in series, so that the learning expression capability of the model is improved. On the other hand, original object geometric structure information is added on the basis of the traditional Transformer model, and the object geometric characteristics such as local plane normal vectors, local curvature radiuses and normal vector included angles are integrated into a self-attention mechanism, so that the efficient Transformer model specially used for processing point cloud data is designed.

The method disclosed by the invention does not need a large number of manually designed components, does not need a large number of priori knowledge, does not need to screen out redundant candidate frames to carry out a large number of post-processing operations, is simple in model, can carry out end-to-end training, and has the advantages of low calculation cost, high processing efficiency and high accuracy.

The model disclosed by the invention can be used for end-to-end training, point cloud data images are labeled, and boundary frames and categories of targets are labeled to be used as training samples. Inputting a training sample into a point cloud feature extraction network to obtain a plurality of key points in output point cloud data and feature information of each key point; inputting the feature information and the position information of each key point into a first conversion model, and coding the feature information of each key point according to the relevance between the key point and other key points in a preset range of the key point aiming at each key point to obtain a first feature code of the key point; inputting the first feature codes of all key points into a classification network, classifying all key points, and determining points classified as target centers as reference center points; inputting the feature information and the position information of each reference center point into a second conversion model, and coding the first feature code of each reference center point according to the relevance between the reference center point and other reference center points aiming at each reference center point to obtain a second feature code of the reference center point; inputting the second feature codes of the reference center points into a decoder in a second conversion model to obtain feature vectors of the reference center points; and inputting the feature vectors of the reference central points into a target detection network to obtain the positions and the types of the targets in the point cloud data, and training the point cloud feature extraction network, the first conversion model, the classification network, the second conversion model and the target detection network according to the difference between the positions and the types of the targets in the point cloud data and the boundary frames and the types of the marked targets. The point cloud feature extraction network and the classification network can be pre-trained.

The present disclosure further provides an apparatus for detecting a target in point cloud data, which is described below with reference to fig. 3.

FIG. 3 is a block diagram of some embodiments of an apparatus for detecting objects in point cloud data according to the present disclosure. As shown in fig. 3, the apparatus 30 of this embodiment includes: a feature extraction module 310, a first encoding module 320, a classification module 330, a second encoding module 340, and an object detection module 350.

The feature extraction module 310 is configured to input the point cloud data into a point cloud feature extraction network, and obtain a plurality of key points in the output point cloud data and feature information of each key point.

The first encoding module 320 is configured to, for each key point, encode the feature information of the key point according to the relevance between the key point and another key point within the preset range of the key point, so as to obtain a first feature code of the key point.

The classification module 330 is configured to classify each of the key points, and determine a point classified as a target center as a reference center point.

The second encoding module 340 is configured to encode, for each reference center point, the first feature code of the reference center point according to the relevance between the reference center point and other reference center points, so as to obtain a second feature code of the reference center point.

And the target detection module 350 is configured to predict the position and the category of each target in the point cloud data according to the second feature code of each reference center point.

In some embodiments, the first encoding module 320 is configured to, for each keypoint, determine a first feature encoding of the keypoint based on a self-attention mechanism according to the feature information of the keypoint, the feature information of other keypoints within the preset range of the keypoint, and the relative position relationship between the keypoint and other keypoints within the preset range of the keypoint.

In some embodiments, the first encoding module 320 is configured to, for each keypoint, input feature information and location information of the keypoint, feature information and location information of other keypoints within the preset range of the keypoint, to the first self-attention module of the encoder in the first conversion model; in a first self-attention module, taking other key points in a preset range of the key points as relative points, respectively inputting the position information of the key points and the position information of the relative points into a first position coding layer, a second position coding layer and a third position coding layer aiming at each relative point, and determining a first relative position code, a second relative position code and a third relative position code of the key points and the relative points; determining a key vector and a value vector of the relative point according to the product of the feature information of the relative point and the key matrix and the value matrix in the first self-attention module respectively; determining a query vector of the key point according to the product of the feature information of the key point and a query matrix in the first self-attention module; and determining a first feature code of the key point according to the first relative position code, the second relative position code and the third relative position code of the key point and each relative point, the key vector of each relative point, the value vector of each relative point and the query vector of the key point.

In some embodiments, the first encoding module 320 is configured to encode, for each relative point, a sum of the first location code of the keypoint and the relative point and the query vector of the keypoint as a revised query vector of the keypoint; taking the sum of the key point and the second position code of the relative point and the key vector of the relative point as a corrected key vector of the relative point; the sum of the key point and the third position code of the relative point and the value vector of the relative point is used as the correction value vector of the relative point; inputting the product of the correction query vector of the key point and the correction key vector of the relative point and the dimension of the feature information of the key point into a first normalization layer to obtain the weight of the relative point; and carrying out weighted summation on the correction value vector of each relative point according to the weight of each relative point to obtain the first feature code of the key point.

In some embodiments, the first position encoding layer, the second position encoding layer and the third position encoding layer are a first feedforward network, a second feedforward network and a third feedforward network, respectively, and the first encoding module 320 is configured to input a difference between the coordinate of the keypoint and the coordinate of the relative point into the first feedforward network, the second feedforward network and the third feedforward network, respectively.

In some embodiments, the first encoding module 320 is configured to determine, for each keypoint, a first feature encoding of the keypoint based on a self-attention mechanism according to feature information of the keypoint, feature information of other keypoints within a preset range of the keypoint, a relative position relationship between the keypoint and other keypoints within the preset range of the keypoint, and a relative geometry relationship between the keypoint and other keypoints within the preset range of the keypoint.

In some embodiments, the first encoding module 320 is configured to input, for each keypoint, feature information, position information, and geometry information of the keypoint, feature information, position information, and geometry information of other keypoints within the preset range of the keypoint, into the first self-attention module of the encoder in the first conversion model; in a first self-attention module, taking other key points in a preset range of the key points as relative points, respectively inputting the position information of the key points and the position information of the relative points into a first position coding layer, a second position coding layer and a third position coding layer aiming at each relative point, and determining a first relative position code, a second relative position code and a third relative position code of the key points and the relative points; inputting the geometric structure information of the key point and the geometric structure information of the relative point into a geometric structure coding layer, and determining the relative geometric structure weight of the key point and the relative point; determining a key vector and a value vector of the relative point according to the product of the feature information of the relative point and the key matrix and the value matrix in the first self-attention module respectively; determining a query vector of the key point according to the product of the feature information of the key point and a query matrix in the first self-attention module; and determining a first feature code of the key point according to the first relative position code, the second relative position code, the third relative position code and the relative geometric structure weight of the key point and each relative point, the key vector of each relative point, the value vector of each relative point and the query vector of the key point.

In some embodiments, the first encoding module 320 is configured to encode, for each relative point, a sum of the key point and the first location code of the relative point and the query vector of the key point as a revised query vector of the key point; taking the sum of the key point and the second position code of the relative point and the key vector of the relative point as a corrected key vector of the relative point; encoding the key point and the third position of the relative point and summing the value vector of the relative point to obtain a corrected value vector of the relative point; inputting the product of the correction query vector of the key point and the correction key vector of the relative point, the relative geometric structure weight of the key point and the relative point and the dimension of the feature information of the key point into a first normalization layer to obtain the weight of the relative point; and carrying out weighted summation on the correction value vector of each relative point according to the weight of each relative point to obtain the first feature code of the key point.

In some embodiments, the first encoding module 320 is configured to divide a product of the modified query vector of the key point and the modified key vector of the relative point by a square root of a dimension of the feature information of the key point, and superimpose the product with the relative geometric weight of the key point and the relative point, and input the resultant into the first normalization layer to obtain the weight of the relative point.

In some embodiments, the geometry information comprises: the first encoding module 320 is configured to determine the relative geometric structure weight of the key point and the relative point according to at least one of a distance between the key point and the relative point, a dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located, a difference between the curvature radius of the local plane where the key point is located and the curvature radius of the local plane where the relative point is located, and an included angle between the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located.

In some embodiments, the relative geometric weight of the keypoint and the opposing point decreases as the distance of the keypoint from the opposing point increases; the relative geometric structure weight of the key point and the relative point is increased along with the increase of the dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located, and the relative geometric structure weight of the key point and the relative point is increased along with the increase of the difference between the curvature radius of the local plane where the key point is located and the curvature radius of the local plane where the relative point is located; the relative geometric structure weight of the key point and the relative point is increased along with the increase of the included angle between the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located.

In some embodiments, the second encoding module 340 is configured to determine, for each reference center point, a second feature encoding of the reference center point based on the self-attention mechanism according to the first feature encoding of the reference center point, the first feature encoding of other reference center points, and the relative position relationship between the reference center point and other reference center points.

In some embodiments, the second encoding module 340 is configured to, for each reference center point, input the first feature codes and the position information of the reference center point, and the first feature codes and the position information of other reference center points into the second self-attention module of the encoder in the second transformation model; in the second self-attention module, taking other reference center points as relative center points, respectively inputting the position information of the reference center point and the position information of the relative center point into a fourth position coding layer, a fifth position coding layer and a sixth position coding layer aiming at each relative center point, and determining a fourth relative position code, a fifth relative position code and a sixth relative position code of the reference center point and the relative center point; determining a key vector and a value vector of the relative central point according to the product of the characteristic information of the relative central point and a key matrix and a value matrix in a second self-attention module respectively; determining a query vector of the reference center point according to the product of the feature information of the reference center point and a query matrix in a second self-attention module; and determining a second feature code of the reference center point according to the fourth relative position code, the fifth relative position code and the sixth relative position code of the reference center point and each relative center point, the key vector of each relative center point, the value vector of each relative center point and the query vector of the reference center point.

In some embodiments, the second encoding module 340 is configured to encode, for each relative center point, a sum of the query vector of the reference center point and the fourth position of the relative center point as a modified query vector of the reference center point; taking the sum of the fifth position code of the reference central point and the relative central point and the key vector of the relative central point as a correction key vector of the relative central point; the sum of the reference central point and the sixth position code of the relative central point and the value vector of the relative central point is used as the correction value vector of the relative central point; inputting the product of the correction query vector of the reference center point and the correction key vector of the corresponding center point and the dimension of the first feature code of the reference center point into a second normalization layer to obtain the weight of the corresponding center point; and carrying out weighted summation on the correction value vectors of the relative central points according to the weight of the relative central points to obtain a second feature code of the reference central point.

In some embodiments, the fourth, fifth and sixth relative position codes are a fourth, fifth and sixth feed-forward network, respectively, and the second encoding module 340 is configured to input the difference between the coordinates of the reference center point and the coordinates of the relative center point into the fourth, fifth and sixth feed-forward networks, respectively.

In some embodiments, the classification module 330 is configured to, for each keypoint, input the first feature code of the keypoint into a classification network to obtain a classification result of the keypoint; and determining whether the key point is a point of the target center or not according to the classification result.

In some embodiments, the classification network is obtained by training using position information of each key point with label information as training data, where for each key point, in the case that the key point is located in a bounding box of an object and belongs to a point closest to the center of the object, the label information of the key point is a point at the center of the object.

In some embodiments, the object detection module 350 is configured to predict the location and the category of each object in the point cloud data according to the second feature code of each reference center point, including: inputting the second feature codes of the reference center points into a decoder in a second conversion model to obtain feature vectors of the reference center points; and inputting the characteristic vectors of the reference central points into a target detection network to obtain the positions and the types of the targets in the point cloud data.

In some embodiments, for each keypoint, the other keypoints within the preset range of keypoints are determined by: and aiming at each key point, sequencing other key points according to the distance from the key point to the key point from small to large, and selecting a preset number of other key points as other key points in the preset range of the key point according to the sequence from front to back.

The detection devices for the target in the point cloud data in the embodiments of the present disclosure may be implemented by various computing devices or computer systems, and are described below with reference to fig. 4 and 5.

FIG. 4 is a block diagram of some embodiments of an apparatus for detecting objects in point cloud data according to the present disclosure. As shown in fig. 4, the apparatus 40 of this embodiment includes: a memory 410 and a processor 420 coupled to the memory 410, the processor 420 configured to perform a method of detecting an object in point cloud data in any of the embodiments of the present disclosure based on instructions stored in the memory 410.

Memory 410 may include, for example, system memory, fixed non-volatile storage media, and the like. The system memory stores, for example, an operating system, an application program, a Boot Loader (Boot Loader), a database, and other programs.

FIG. 5 is a block diagram of an alternate embodiment of an apparatus for detecting objects in point cloud data according to the present disclosure. As shown in fig. 5, the apparatus 50 of this embodiment includes: memory 510 and processor 520 are similar to memory 410 and processor 420, respectively. An input output interface 530, a network interface 540, a storage interface 550, and the like may also be included. These

interfaces

530, 540, 550 and the connections between the memory 510 and the processor 520 may be, for example, through a bus 560. The input/output interface 530 provides a connection interface for input/output devices such as a display, a mouse, a keyboard, and a touch screen. The network interface 540 provides a connection interface for various networking devices, such as a database server or a cloud storage server. The storage interface 550 provides a connection interface for external storage devices such as an SD card and a usb disk.

The present disclosure also provides an article sorting apparatus, described below in conjunction with fig. 6.

As shown in fig. 6, the article sorting device 6 includes: the detection device 30/40/50 for the objects in the point cloud data and the sorting component 62 in any of the foregoing embodiments are configured to sort the objects corresponding to the objects according to the positions and categories of the objects in the point cloud data output by the detection device 30/40/50 for the objects in the point cloud data.

In some embodiments, the apparatus 6 further comprises: the point cloud collecting component 64 is configured to collect point cloud data of a preset area, and send the point cloud data to the target detecting device 30/40/50 in the point cloud data.

The sorting means is, for example, a robot arm, the point cloud collecting means is, for example, a three-dimensional camera,

the three-dimensional point cloud target detection technology can be applied to products such as sorting mechanical arms based on vision in logistics scenes, namely, each article can be accurately positioned and identified by utilizing point cloud data collected by a three-dimensional camera erected on the sorting mechanical arms, and therefore the mechanical arms are helped to sort one by one.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only exemplary of the present disclosure and is not intended to limit the present disclosure, so that any modification, equivalent replacement, or improvement made within the spirit and principle of the present disclosure should be included in the scope of the present disclosure.

Claims

1. A method for detecting a target in point cloud data comprises the following steps:

inputting point cloud data into a point cloud feature extraction network to obtain a plurality of key points in the output point cloud data and feature information of each key point;

for each key point, coding the feature information of the key point according to the relevance between the key point and other key points in the preset range of the key point to obtain a first feature code of the key point;

classifying each key point, and determining a point classified as a target center as a reference center point;

aiming at each reference central point, coding the first feature code of the reference central point according to the relevance between the reference central point and other reference central points to obtain a second feature code of the reference central point;

and predicting the position and the category of each target in the point cloud data according to the second feature codes of each reference central point.

2. The detection method according to claim 1, wherein the encoding, for each keypoint, the feature information of the keypoint according to the relevance between the keypoint and other keypoints within the preset range of the keypoint, to obtain the first feature code of the keypoint, includes:

and for each key point, determining a first feature code of the key point based on a self-attention mechanism according to the feature information of the key point, the feature information of other key points in the preset range of the key point and the relative position relationship between the key point and other key points in the preset range of the key point.

3. The detection method according to claim 2, wherein for each key point, determining the first feature code of the key point based on a self-attention mechanism according to the feature information of the key point, the feature information of other key points within the preset range of the key point, and the relative position relationship between the key point and other key points within the preset range of the key point comprises:

for each key point, inputting the feature information and the position information of the key point, and the feature information and the position information of other key points in a preset range of the key point into a first self-attention module of an encoder in a first conversion model;

in the first self-attention module, taking other key points in a preset range of the key point as relative points, respectively inputting the position information of the key point and the position information of the relative points into a first position coding layer, a second position coding layer and a third position coding layer aiming at each relative point, and determining a first relative position code, a second relative position code and a third relative position code of the key point and the relative points;

determining a key vector and a value vector of the relative point according to the multiplication of the feature information of the relative point and the key matrix and the value matrix in the first self-attention module respectively;

determining a query vector of the key point according to the product of the feature information of the key point and the query matrix in the first self-attention module;

and determining a first feature code of the key point according to the first relative position code, the second relative position code and the third relative position code of the key point and each relative point, the key vector of each relative point, the value vector of each relative point and the query vector of the key point.

4. The detection method according to claim 3, wherein the determining the first feature code of the key point according to the first relative position code, the second relative position code and the third relative position code of the key point and each relative point, the key vector of each relative point, the value vector of each relative point and the query vector of the key point comprises:

for each relative point, taking the sum of the first position codes of the key point and the relative point and the query vector of the key point as a modified query vector of the key point;

taking the sum of the key point and the second position code of the relative point and the key vector of the relative point as a corrected key vector of the relative point;

the sum of the key point and the third position code of the relative point and the value vector of the relative point is used as the correction value vector of the relative point;

inputting the product of the correction query vector of the key point and the correction key vector of the relative point and the dimensionality of the feature information of the key point into a first normalization layer to obtain the weight of the relative point;

and carrying out weighted summation on the correction value vector of each relative point according to the weight of each relative point to obtain the first feature code of the key point.

5. The detection method according to claim 3, wherein the first, second and third position-coding layers are first, second and third feedforward networks, respectively, and the inputting the position information of the key point and the position information of the opposite point into the first, second and third position-coding layers, respectively, comprises:

and inputting the difference between the coordinates of the key point and the coordinates of the relative point into a first feedforward network, a second feedforward network and a third feedforward network respectively.

6. The detection method according to claim 1, wherein the encoding, for each keypoint, the feature information of the keypoint according to the relevance between the keypoint and other keypoints within the preset range of the keypoint, to obtain the first feature code of the keypoint, includes:

and for each key point, determining a first feature code of the key point based on a self-attention mechanism according to the feature information of the key point, the feature information of other key points in the preset range of the key point, the relative position relationship between the key point and other key points in the preset range of the key point and the relative geometric structure relationship between the key point and other key points in the preset range of the key point.

7. The detection method according to claim 6, wherein the determining, for each keypoint, according to the feature information of the keypoint, the feature information of other keypoints within the preset range of keypoints, the relative positional relationship between the keypoint and other keypoints within the preset range of keypoints, and the relative geometric relationship between the keypoint and other keypoints within the preset range of keypoints, the first feature code of the keypoint based on a self-attention mechanism comprises:

for each key point, inputting the feature information, the position information and the geometric structure information of the key point, and the feature information, the position information and the geometric structure information of other key points in a preset range of the key point into a first self-attention module of an encoder in a first conversion model;

inputting the geometric structure information of the key point and the geometric structure information of the relative point into a geometric structure coding layer, and determining the relative geometric structure weight of the key point and the relative point;

and determining a first feature code of the key point according to the first relative position code, the second relative position code, the third relative position code and the relative geometric structure weight of the key point and each relative point, the key vector of each relative point, the value vector of each relative point and the query vector of the key point.

8. The detection method of claim 7, wherein determining the first feature code of the keypoint based on the first relative position code, the second relative position code, the third relative position code and the relative geometry weight of the keypoint and each of the relative points, the key vector of each of the relative points, the value vector of each of the relative points, and the query vector of the keypoint comprises:

inputting the product of the correction query vector of the key point and the correction key vector of the relative point, the relative geometric structure weight of the key point and the relative point and the dimension of the feature information of the key point into a first normalization layer to obtain the weight of the relative point;

and carrying out weighted summation on the correction value vectors of the relative points according to the weights of the relative points to obtain the first feature code of the key point.

9. The detection method according to claim 7, wherein the multiplying the modified query vector of the keypoint by the modified key vector of the opposite point, the relative geometric weights of the keypoint and the opposite point and the dimension of the feature information of the keypoint are input into a first normalization layer, and obtaining the weight of the opposite point comprises:

dividing the product of the correction query vector of the key point and the correction key vector of the relative point by the square root of the dimension of the feature information of the key point, and then superposing the product with the relative geometric structure weight of the key point and the relative point, and inputting the obtained result into a first normalization layer to obtain the weight of the relative point.

10. The detection method of claim 7, wherein the geometry information comprises: at least one of a normal vector of the local plane and a curvature radius of the local plane, wherein the determining of the relative geometric structure weight of the key point and the relative point comprises:

and determining the relative geometric structure weight of the key point and the relative point according to at least one of the distance between the key point and the relative point, the dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located, the difference between the curvature radius of the local plane where the key point is located and the curvature radius of the local plane where the relative point is located, and the included angle between the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located.

11. The detection method according to claim 10,

the relative geometric structure weight of the key point and the relative point decreases with the distance between the key point and the relative point;

the relative geometric structure weight of the key point and the relative point is increased along with the increase of the dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located;

the relative geometric structure weight of the key point and the relative point is increased along with the increase of the difference between the curvature radius of the local plane where the key point is located and the curvature radius of the local plane where the relative point is located;

the relative geometric structure weight of the key point and the relative point is increased along with the increase of the result after the included angle between the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located passes through the feature propagation layer.

12. The detection method according to claim 1, wherein the encoding, for each reference central point, the first feature code of the reference central point according to the association between the reference central point and other reference central points to obtain the second feature code of the reference central point comprises:

and for each reference central point, determining a second feature code of the reference central point based on a self-attention mechanism according to the first feature code of the reference central point, the first feature codes of other reference central points and the relative position relationship between the reference central point and other reference central points.

13. The detection method according to claim 12, wherein for each reference center point, determining the second feature code of the reference center point based on the self-attention mechanism according to the first feature code of the reference center point, the first feature codes of the other reference center points, and the relative positional relationship between the reference center point and the other reference center points comprises:

for each reference center point, inputting the first feature codes and the position information of the reference center point and the first feature codes and the position information of other reference center points into a second self-attention module of an encoder in a second conversion model;

in the second self-attention module, regarding other reference center points as relative center points, respectively inputting the position information of the reference center point and the position information of the relative center point into a fourth position coding layer, a fifth position coding layer and a sixth position coding layer for each relative center point, and determining a fourth relative position code, a fifth relative position code and a sixth relative position code of the reference center point and the relative center point;

determining a key vector and a value vector of the relative central point according to the product of the characteristic information of the relative central point and the key matrix and the value matrix in the second self-attention module respectively;

determining a query vector of the reference center point according to the product of the feature information of the reference center point and the query matrix in the second self-attention module;

and determining a second feature code of the reference center point according to the fourth relative position code, the fifth relative position code and the sixth relative position code of the reference center point and each relative center point, the key vector of each relative center point, the value vector of each relative center point and the query vector of the reference center point.

14. The detecting method according to claim 13, wherein the determining the second feature code of the reference center point according to the fourth relative position code, the fifth relative position code and the sixth relative position code of the reference center point, the key vector of each relative center point, the value vector of each relative center point and the query vector of the reference center point comprises:

for each relative center point, the sum of the reference center point and the fourth position code of the relative center point and the query vector of the reference center point is used as the modified query vector of the reference center point;

taking the sum of the fifth position code of the reference central point and the relative central point and the key vector of the relative central point as a correction key vector of the relative central point;

the sum of the reference central point and the sixth position code of the relative central point and the value vector of the relative central point is used as the correction value vector of the relative central point;

inputting the product of the correction query vector of the reference center point and the correction key vector of the corresponding center point and the dimension of the first feature code of the reference center point into a second normalization layer to obtain the weight of the corresponding center point;

and carrying out weighted summation on the correction value vectors of the relative central points according to the weight of the relative central points to obtain a second feature code of the reference central point.

15. The sensing method of claim 13, wherein the fourth, fifth and sixth relative position codings are fourth, fifth and sixth feed-forward networks, respectively, and inputting the position information of the reference center point and the position information of the relative center point into fourth, fifth and sixth position coding layers, respectively, comprises:

and respectively inputting the difference between the coordinate of the reference central point and the coordinate of the relative central point into a fourth feedforward network, a fifth feedforward network and a sixth feedforward network.

16. The detection method according to claim 1, wherein the classifying the respective key points and determining the point classified as the target center as the reference center point comprises:

inputting the first feature codes of the key points into a classification network aiming at each key point to obtain a classification result of the key points;

and determining whether the key point is the point of the target center according to the classification result.

17. The detection method according to claim 16, wherein the classification network is trained by using position information of each key point with label information as training data, wherein for each key point, the label information of the key point is a point of a target center if the key point is located within a bounding box of the target and belongs to a point nearest to the target center.

18. The detection method of claim 1, wherein the predicting the location and class of each object in the point cloud data according to the second feature codes of each reference center point comprises:

inputting the second feature codes of the reference center points into a decoder in a second conversion model to obtain feature vectors of the reference center points;

and inputting the characteristic vectors of the reference central points into a target detection network to obtain the positions and the categories of the targets in the point cloud data.

19. The detection method according to claim 1, wherein, for each keypoint, the other keypoints within the preset range of keypoints are determined by:

and aiming at each key point, sequencing other key points according to the distance from the key point to the key point from small to large, and selecting a preset number of other key points as other key points in the preset range of the key point according to the sequence from front to back.

20. An apparatus for detecting objects in point cloud data, comprising:

the system comprises a characteristic extraction module, a characteristic extraction module and a characteristic analysis module, wherein the characteristic extraction module is used for inputting point cloud data into a point cloud characteristic extraction network to obtain a plurality of key points in the output point cloud data and characteristic information of each key point;

the first coding module is used for coding the feature information of each key point according to the relevance between the key point and other key points in the preset range of the key point to obtain a first feature code of the key point;

the classification module is used for classifying all the key points, determining the points classified as the target center and taking the points as reference center points;

the second coding module is used for coding the first feature code of each reference central point according to the relevance between the reference central point and other reference central points to obtain a second feature code of the reference central point;

and the target detection module is used for predicting the position and the category of each target in the point cloud data according to the second feature codes of each reference central point.

21. An apparatus for detecting objects in point cloud data, comprising:

a processor; and

a memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to perform a method of detecting objects in point cloud data as recited in any one of claims 1-19.

22. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the steps of the method of any one of claims 1-19.

23. An article sorting apparatus comprising: the apparatus for detecting objects in point cloud data and sorting means as claimed in claim 20 or 21;

the sorting component is used for sorting the targets according to the positions and the types of the targets in the point cloud data output by the target detection device in the point cloud data.

24. The article sorting apparatus of claim 23, further comprising:

the point cloud acquisition component is used for acquiring point cloud data of a preset area and sending the point cloud data to a detection device of a target in the point cloud data.