WO2023202401A1

WO2023202401A1 - Method and apparatus for detecting target in point cloud data, and computer-readable storage medium

Info

Publication number: WO2023202401A1
Application number: PCT/CN2023/087273
Authority: WO
Inventors: 潘滢炜; 李栋; 邱钊凡; 姚霆; 梅涛
Original assignee: 京东科技信息技术有限公司
Priority date: 2022-04-19
Filing date: 2023-04-10
Publication date: 2023-10-26
Also published as: CN115018910A

Abstract

The present disclosure relates to the technical field of computers, and relates to a method and apparatus for detecting a target in point cloud data, and a computer-readable storage medium. The method in the present disclosure comprises: inputting point cloud data into a point cloud feature extraction network, so as to obtain a plurality of key points in output point cloud data and feature information of each key point; for each key point, encoding feature information of the key point according to the association between the key point and other key points within a preset range of the key point, so as to obtain a first feature code of the key point; classifying each key point, and determining, as a reference center point, a point which is classified as a target center; for each reference center point, encoding a first feature code of the reference center point according to the association between the reference center point and other reference center points, so as to obtain a second feature code of the reference center point; and predicting the position and category of each target in the point cloud data according to a second feature code of each reference center point.

Description

Detection method, device and computer-readable storage medium for targets in point cloud data

Cross-references to related applications

This application is based on the application with CN application number 202210409033.6 and the filing date is April 19, 2022, and claims its priority. The disclosure content of the CN application is hereby incorporated into this application as a whole.

Technical field

The present disclosure relates to the field of computer technology, and in particular to a method, device and computer-readable storage medium for detecting targets in point cloud data.

Background technique

The purpose of 3D (3-dimension, three-dimensional) target detection is to identify and locate objects appearing in 3D point clouds, and has been widely used in fields such as autonomous driving and augmented reality. Compared to 2D images, 3D point clouds can provide the geometry of objects and capture the 3D structure of a scene.

Contents of the invention

According to some embodiments of the present disclosure, a method for detecting targets in point cloud data is provided, including: inputting point cloud data into a point cloud feature extraction network, obtaining multiple key points in the output point cloud data, and each key point Characteristic information of the point; for each key point, the characteristic information of the key point is encoded according to the correlation between the key point and other key points within the preset range of the key point, and the first key point of the key point is obtained. Feature encoding; classify each key point and determine the point classified as the target center as the reference center point; for each reference center point, based on the correlation between the reference center point and other reference center points, the reference center point The first feature code of the point is encoded to obtain the second feature code of the reference center point; based on the second feature code of each reference center point, the location and category of each target in the point cloud data are predicted.

In some embodiments, for each key point, the characteristic information of the key point is encoded according to the correlation between the key point and other key points within the preset range of the key point, and the third key point of the key point is obtained. A feature encoding includes: for each key point, according to the characteristic information of the key point, the characteristic information of other key points within the preset range of the key point, and the key point and other key points within the preset range of the key point The relative positional relationship between them determines the first feature encoding of the key point based on the self-attention mechanism.

In some embodiments, for each key point, according to the characteristic information of the key point, the key point preset range Feature information of other key points within the range, as well as the relative positional relationship between the key point and other key points within the preset range of the key point, determining the first feature encoding of the key point based on the self-attention mechanism includes: for each A key point, the characteristic information and position information of the key point, and the characteristic information and position information of other key points within the preset range of the key point are input into the first self-attention module of the encoder in the first conversion model; in the first In a self-attention module, other key points within the preset range of the key point are used as relative points, and for each relative point, the position information of the key point and the position information of the relative point are respectively input into the first position coding layer. , the second position coding layer and the third position coding layer, determine the first relative position code, the second relative position code and the third relative position code of the key point and the relative point; according to the characteristic information of the relative point, respectively and The product of the key matrix and the value matrix in the first self-attention module determines the key vector and value vector of the relative point; based on the product of the feature information of the key point and the query matrix in the first self-attention module, determines the key point Query vector; based on the first relative position code, the second relative position code and the third relative position code of the key point and each relative point, the key vector of each relative point, the value vector of each relative point and the key point The query vector determines the first feature encoding of the key point.

In some embodiments, according to the first relative position code of the key point and each relative point, the second relative position code and the third relative position code, the key vector of each relative point, the value vector of each relative point and The query vector of the key point, determining the first feature code of the key point includes: for each relative point, the sum of the key point and the first relative position code of the relative point and the query vector of the key point is as the The modified query vector of the key point; the sum of the second relative position code of the key point and the relative point and the key vector of the relative point is used as the modified key vector of the relative point; the second relative position code of the key point and the relative point is The sum of the three relative position codes and the value vector of the relative point is used as the correction value vector of the relative point; the product of the correction query vector of the key point and the correction key vector of the relative point is multiplied by the product of the feature information of the key point The dimension is input to the first normalization layer to obtain the weight of the relative point; the correction value vector of each relative point is weighted and summed according to the weight of each relative point to obtain the first feature code of the key point.

In some embodiments, the first position coding layer, the second position coding layer and the third position coding layer are respectively a first feedforward network, a second feedforward network and a third feedforward network, and the position information of the key point is Inputting the position information of the relative point into the first position coding layer, the second position coding layer and the third position coding layer respectively includes: inputting the difference between the coordinates of the key point and the coordinates of the relative point into the first feedforward network, respectively. The second feedforward network and the third feedforward network.

In some embodiments, for each key point, the characteristic information of the key point is encoded according to the correlation between the key point and other key points within the preset range of the key point, and the third key point of the key point is obtained. A feature encoding includes: for each key point, based on the characteristic information of the key point, other related relations within the preset range of the key point Characteristic information of the key point, the relative positional relationship between the key point and other key points within the preset range of the key point, and the relative geometric structure relationship between the key point and other key points within the preset range of the key point, The first feature encoding of the key point is determined based on the self-attention mechanism.

In some embodiments, for each key point, according to the characteristic information of the key point, the characteristic information of other key points within the preset range of the key point, the key point is consistent with other key points within the preset range of the key point. The relative positional relationship between the key point and the relative geometric structure relationship between the key point and other key points within the preset range of the key point. The first feature encoding of the key point based on the self-attention mechanism includes: for each key point , input the characteristic information, position information and geometric structure information of the key point and the characteristic information, position information and geometric structure information of other key points within the preset range of the key point into the first self-attention of the encoder in the first conversion model Force module; in the first self-attention module, other key points within the preset range of the key point are used as relative points, and for each relative point, the position information of the key point and the position information of the relative point are input separately The first position coding layer, the second position coding layer and the third position coding layer determine the first relative position coding, the second relative position coding and the third relative position coding of the key point and the relative point; The geometric structure information and the geometric structure information of the relative point are input into the geometric structure encoding layer to determine the relative geometric structure weight of the key point and the relative point; according to the characteristic information of the relative point, the key matrix and the key matrix in the first self-attention module are respectively The key vector and value vector of the relative point are determined by the product of the value matrix; the query vector of the key point is determined based on the product of the feature information of the key point and the query matrix in the first self-attention module; the query vector of the key point is determined based on the key point and each The first relative position code, the second relative position code, the third relative position code and the relative geometric structure weight of a relative point, the key vector of each relative point, the value vector of each relative point and the query vector of the key point, Determine the first feature encoding of the key point.

In some embodiments, according to the first relative position code, the second relative position code, the third relative position code and the relative geometric structure weight of the key point and each relative point, the key vector of each relative point, each relative The value vector of the point and the query vector of the key point. Determining the first feature encoding of the key point includes: for each relative point, combining the first relative position encoding of the key point and the relative point with the query vector of the key point. The sum of is used as the modified query vector of the key point; the sum of the second relative position code of the key point and the relative point and the key vector of the relative point is used as the modified key vector of the relative point; the sum of the key point and the key vector of the relative point is The sum of the third relative position code of the relative point and the value vector of the relative point is used as the correction value vector of the relative point; the product of the correction query vector of the key point and the correction key vector of the relative point is the key point. The relative geometric structure weight of the relative point and the dimension of the feature information of the key point are input into the first normalization layer to obtain the weight of the relative point; the correction value vector of each relative point is weighted according to the weight of each relative point. and, get the first feature code of the key point.

In some embodiments, the product of the modified query vector of the key point and the modified key vector of the relative point, The relative geometric structure weight of the key point and the relative point and the dimension of the feature information of the key point are input into the first normalization layer. Obtaining the weight of the relative point includes: combining the modified query vector of the key point and the relative point. The product of the corrected key vector is divided by the square root of the dimension of the feature information of the key point, and then added to the relative geometric structure weight of the key point and the relative point. The result is input into the first normalization layer to obtain the relative point. Weights.

In some embodiments, the geometric structure information includes: at least one of the normal vector of the local plane and the curvature radius of the local plane. Determining the relative geometric structure weight of the key point and the relative point includes: based on the key point and the relative point. The distance between the relative points, the dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located, the difference between the curvature radius of the local plane where the key point is located and the curvature radius of the local plane where the relative point is located , and at least one of the angles between the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located, determines the relative geometric structure weight of the key point and the relative point.

In some embodiments, the relative geometric structure weight of the key point and the relative point decreases as the distance between the key point and the relative point increases; the relative geometric structure weight of the key point and the relative point decreases as the distance between the key point and the relative point increases. The increase in the dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located increases the relative geometric structure weight of the key point and the relative point. The difference between the curvature radius and the curvature radius of the local plane where the relative point is located increases; the relative geometric structure weight of the key point and the relative point increases with the normal vector of the local plane where the key point is located and the location of the relative point. The angle between the normal vectors of the local plane increases as the result after the feature propagation layer increases.

In some embodiments, for each reference center point, the first feature code of the reference center point is encoded according to the correlation between the reference center point and other reference center points to obtain the second feature code of the reference center point. The feature coding includes: for each reference center point, based on the first feature code of the reference center point, the first feature codes of other reference center points, and the relative position relationship between the reference center point and other reference center points, based on the The attention mechanism determines the second feature encoding of the reference center point.

In some embodiments, for each reference center point, according to the first feature code of the reference center point, the first feature codes of other reference center points, and the relative positional relationship between the reference center point and other reference center points, Determining the second feature encoding of the reference center point based on the self-attention mechanism includes: for each reference center point, the first feature encoding and position information of the reference center point, the first feature encoding and position information of other reference center points Input the second self-attention module of the encoder in the second conversion model; in the second self-attention module, other reference center points are used as relative center points, and for each relative center point, the position of the reference center point is The information and the position information of the relative center point are respectively input into the fourth position coding layer, the fifth position coding layer and the sixth position coding layer to determine the fourth relative position coding and the fifth relative position of the reference center point and the relative center point. coding and The sixth relative position encoding; determine the key vector and value vector of the relative center point according to the product of the characteristic information of the relative center point and the key matrix and value matrix in the second self-attention module respectively; according to the characteristics of the reference center point The product of the information and the query matrix in the second self-attention module determines the query vector of the reference center point; based on the fourth relative position code, the fifth relative position code and the sixth relative position code of the reference center point and each relative center point The position code, the key vector of each relative center point, the value vector of each relative center point and the query vector of the reference center point determine the second feature code of the reference center point.

In some embodiments, according to the fourth relative position code, the fifth relative position code and the sixth relative position code of the reference center point and each relative center point, the key vector of each relative center point, each relative center point The value vector of and the query vector of the reference center point, determining the second feature code of the reference center point includes: for each relative center point, combining the reference center point and the fourth position code of the relative center point with the reference center The sum of the query vectors of the points is used as the modified query vector of the reference center point; the sum of the fifth position code of the reference center point and the relative center point and the key vector of the relative center point is used as the correction of the relative center point Key vector; use the sum of the sixth position code of the reference center point and the relative center point and the value vector of the relative center point as the correction value vector of the relative center point; combine the correction query vector of the reference center point with the value vector of the relative center point The product of the modified key vector of the relative center point and the dimension of the first feature encoding of the reference center point is input to the second normalization layer to obtain the weight of the relative center point; the weight of each relative center point is calculated based on the weight of each relative center point. The correction value vectors are weighted and summed to obtain the second feature code of the reference center point.

In some embodiments, the fourth relative position code, the fifth relative position code and the sixth relative position code are respectively the fourth feedforward network, the fifth feedforward network and the sixth feedforward network, and the position of the reference center point is The information and the position information of the relative center point are respectively input into the fourth position encoding layer, the fifth position encoding layer and the sixth position encoding layer, including: inputting the difference between the coordinates of the reference center point and the coordinates of the relative center point into the fourth position encoding layer respectively. Feedforward network, fifth feedforward network and sixth feedforward network.

In some embodiments, classifying each key point and determining the point classified as the target center as the reference center point includes: for each key point, encoding the first feature of the key point into the classification network to obtain the key point The classification result; determine whether the key point is the center point of the target based on the classification result.

In some embodiments, the classification network is trained by using the position information of each key point with annotation information as training data, wherein, for each key point, the key point is located within the bounding box of an object and belongs to the distance In the case of the point closest to the target center, the label information of the key point is the point at the target center.

In some embodiments, predicting the location and category of each target in the point cloud data according to the second feature code of each reference center point includes: inputting the second feature code of each reference center point into the decoder in the second transformation model, Obtain the feature vectors of each reference center point; input the feature vectors of each reference center point into the target detection network to obtain the location and category of each target in the point cloud data.

In some embodiments, for each key point, other key points within the preset range of the key point are determined using the following method: for each key point, other key points are determined in ascending order of distance from the key point. Sort, select a preset number of other key points in order from front to back, as other key points within the preset range of the key point.

According to other embodiments of the present disclosure, a device for detecting targets in point cloud data is provided, including: a feature extraction module for inputting point cloud data into a point cloud feature extraction network to obtain multiple features in the output point cloud data. key points, and the characteristic information of each key point; the first encoding module is used for each key point, based on the correlation between the key point and other key points within the preset range of the key point, the key point The characteristic information of the point is encoded to obtain the first feature encoding of the key point; the classification module is used to classify each key point and determine the point classified as the target center as the reference center point; the second encoding module is used to target Each reference center point, according to the correlation between the reference center point and other reference center points, encodes the first feature code of the reference center point to obtain the second feature code of the reference center point; the target detection module, Used to predict the location and category of each target in the point cloud data based on the second feature encoding of each reference center point.

According to further embodiments of the present disclosure, a device for detecting objects in point cloud data is provided, including: a processor; and a memory coupled to the processor for storing instructions. When the instructions are executed by the processor, the processing The device performs the detection method of objects in point cloud data as in any of the aforementioned embodiments.

According to further embodiments of the present disclosure, a non-transitory computer-readable storage medium is provided, on which a computer program is stored, wherein when the program is executed by a processor, the object in the point cloud data of any of the foregoing embodiments is achieved. Steps of the detection method.

According to further embodiments of the present disclosure, an object sorting device is provided, including: a detection device for objects in point cloud data and a sorting component according to any of the foregoing embodiments; the sorting component is used to detect objects in point cloud data according to Detect the location and category of each target in the point cloud data output by the device, and sort the targets.

In some embodiments, the device further includes: a point cloud collection component, configured to collect point cloud data in a preset area, and send the point cloud data to a detection device for targets in the point cloud data.

According to further embodiments of the present disclosure, a computer program is provided, including: instructions, which when executed by the processor, cause the processor to perform detection of targets in point cloud data as in any of the foregoing embodiments. method.

Through the following detailed description of exemplary embodiments of the present disclosure with reference to the accompanying drawings, other features of the present disclosure and their The advantages will become clear.

Description of the drawings

In order to explain the embodiments of the present disclosure or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present disclosure. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

Figure 1 shows a schematic flowchart of a method for detecting targets in point cloud data according to some embodiments of the present disclosure.

FIG. 2 shows a schematic diagram of a model of an object in point cloud data according to other embodiments of the present disclosure.

Figure 3 shows a schematic structural diagram of a device for detecting targets in point cloud data according to some embodiments of the present disclosure.

Figure 4 shows a schematic structural diagram of a device for detecting targets in point cloud data according to other embodiments of the present disclosure.

Figure 5 shows a schematic structural diagram of a device for detecting targets in point cloud data according to further embodiments of the present disclosure.

Figure 6 shows a schematic structural diagram of an object sorting device according to some embodiments of the present disclosure.

Detailed ways

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure. Obviously, the described embodiments are only some of the embodiments of the present disclosure, rather than all of the embodiments. The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application or uses. Based on the embodiments in this disclosure, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of this disclosure.

The inventor found that due to the three-dimensional characteristics and irregularities of point clouds, they cannot be directly processed using powerful deep learning models such as convolutional neural networks. Therefore, specialized 3D feature learning technology is needed to identify targets in point cloud data.

A technical problem to be solved by this disclosure is to propose a method for detecting targets in point cloud data to improve the accuracy of target detection in point cloud data.

The present disclosure provides a method for detecting targets in point cloud data, which is described below with reference to Figure 1 .

Figure 1 is a flow chart of some embodiments of a method for detecting objects in point cloud data of the present disclosure. As shown in Figure 1, the method in this embodiment includes: steps S102 to S110.

In step S102, the point cloud data is input into the point cloud feature extraction network to obtain multiple key points in the output point cloud data and feature information of each key point.

Given a point cloud of N points with XYZ coordinates as input, the point cloud feature extraction network can downsample the point cloud data and learn the depth features of each point, thereby outputting a subset of points, and each point is Represented by a C (C is a positive integer) dimensional feature, these points are regarded as key points.

Point cloud feature extraction networks, such as VoxelNet, PointNet, PointNet++, and 3DSSD, are not limited to the examples given and are used to extract key points and feature information of key points in point cloud data. For example, the PointNet++ network is used as the point cloud feature extraction network. Taking point cloud data containing N points as input, following the encoder-decoder structure, the input point cloud is first down-sampled to 8 times the resolution (i.e. N/8 points) through 4 set abstraction layers, and then It is upsampled to 2 times the resolution (i.e. N/2 points) through the feature propagation layer, and each point is represented by a C-dimensional feature. The set of key points is, for example, expressed as _fi represents the feature information of the i-th key point, which is a feature vector. The number and sampling methods of key points are not limited to the above examples and are determined based on the actual application model and test results.

In step S104, for each key point, the characteristic information of the key point is encoded according to the correlation between the key point and other key points within the preset range of the key point, and the first value of the key point is obtained. Feature encoding.

For example, for each key point, sort other key points according to the distance from the key point from small to large, and select a preset number of other key points in order from front to back as the preset key point. other key points within the scope. For example, for each key point, the K key points closest to it are used as other key points within its corresponding preset range (local area).

In some embodiments, for each key point, according to the characteristic information of the key point, the characteristic information of other key points within the preset range of the key point, and the relationship between the key point and other key points within the preset range of the key point. The relative positional relationship between points determines the first feature encoding of the key point based on the self-attention mechanism.

For each key point, based on the self-attention mechanism, the importance or contribution of other key points within the preset range of the key point relative to the encoding of the key point can be determined, and other key points within the preset range are combined when encoding the key point. The characteristics of a key point describe the characteristics of the key point, which can improve the accuracy of the expression of the characteristics of the key point. In addition, the relative positional relationship between the key point and other key points within the preset range of the key point is introduced into the attention mechanism, which improves the accuracy of feature expression and thereby improves the accuracy of target detection.

Further, in some embodiments, for each key point, the characteristic information and position information of the key point and the characteristic information and position information of other key points within the preset range of the key point are input into the encoder in the first conversion model The first self-attention module; in the first self-attention module, other key points within the preset range of the key point are used as relative points, and for each relative point, the position information of the key point is compared with the relative point location information respectively Input the first position coding layer, the second position coding layer and the third position coding layer, determine the first relative position coding, the second relative position coding and the third relative position coding of the key point and the relative point; according to the relative point The key vector and value vector of the relative point are determined based on the product of the feature information of the key point and the key matrix and value matrix in the first self-attention module respectively; according to the product of the feature information of the key point and the query matrix in the first self-attention module , determine the query vector of the key point; according to the first relative position code, the second relative position code and the third relative position code of the key point and each relative point, the key vector of each relative point, the key vector of each relative point The value vector and the query vector of the key point determine the first feature encoding of the key point.

Further, in some embodiments, for each relative point, the sum of the first relative position code of the key point and the relative point and the query vector of the key point is used as the modified query vector of the key point; The sum of the second relative position code of the point and the relative point and the key vector of the relative point is used as the modified key vector of the relative point; the third relative position code of the key point and the relative point is combined with the value of the relative point The sum of the vectors is used as the correction value vector of the relative point; input the product of the correction query vector of the key point and the correction key vector of the relative point and the dimension of the feature information of the key point into the first normalization layer to obtain the The weight of the relative point; perform a weighted sum of the correction value vectors of each relative point according to the weight of each relative point to obtain the first feature code of the key point.

For example, the first transformation model is a Transformer model. Since the first transformation model is used to determine the correlation between internal points of the target, it can be called a Local Transformer. The first transformation model may include the encoder part in the Transformer. In addition, for example, the first position encoding layer, the second position encoding layer and the third position encoding layer are the first feedforward network (FFN), the second feedforward network and the third feedforward network respectively. For each key point and For each corresponding relative point, input the difference between the coordinates of the key point and the coordinates of the relative point into the first feedforward network, the second feedforward network, and the third feedforward network respectively to determine the difference between the key point and the relative point. The first relative position code, the second relative position code and the third relative position code. The first normalization layer is, for example, a softmax layer.

For example, and Respectively represent the feature vectors and position coordinates of N/2 key points input to the encoder of Local Transformer ( _xi is a vector). In Local Transformer, for any key point x _i , select the K key points closest to it as its corresponding local area, and input these points into a Local Transformer module to construct a model for all key points belonging to the same local area. mold:

In formula (1) It is the output of the first self-attention module of the Local Transformer encoder. The encoder can also include a FFN (feedforward neural network) after the first self-attention module. If FFN is not included after the first self-attention module then can represent the first feature encoding, otherwise, The output after FFN represents the first feature encoding; is the query matrix in the first self-attention module, is the key moment in the first self-attention module array, is the first self-attention module median matrix, C is the dimension of the feature information of the key point, Respectively represent the functions corresponding to the first position coding layer, the second position coding layer and the third position coding layer. PE( _xi ,x _j ) represents the relative position encoding of x _i ,x _j , which can be expressed by the following formula:
PE(x _i ,x _j )=FFN(x _i -x _j )=(x _i -x _j )W _PE +b _PE (2)

W _PE and b _PE represent the parameters of FFN (feedforward network or feature propagation layer), The corresponding W _PE and b _PE are different.

A multi-head attention mechanism can be applied in the encoder. Each attention head can refer to the above formulas (1) and (2) to determine the encoding of key points. The encoding of each attention head is spliced and multiplied by the preset matrix. (Or the product is further passed through FFN) to obtain the first feature encoding of the key point. The parameters of the query matrix, key matrix, value matrix, and first position encoding layer, second position encoding layer, and third position encoding layer in each attention head are different.

The first feature encoding output by Local Transformer contains the contextual information of the local area where the key points are located, that is, the correlation between internal points of the target and points.

In step S106, each key point is classified, and the point classified as the target center is determined as the reference center point.

In some embodiments, for each key point, the first feature code of the key point is input into the classification network to obtain a classification result of the key point; it is determined whether the key point is a point at the center of the target according to the classification result.

The key points output by the first conversion model are dense points. Not every point represents a separate target (object). In order to reduce the redundancy of the final detection result, all key points are filtered and only those located at the center of the target are retained. key points. Therefore, it is necessary to judge whether each key point is the real target center. During the training process, each keypoint is assigned a label. If a keypoint is located within the bounding box of an object and is the point closest to the center of the object, it is assigned a positive label, otherwise it is assigned a negative label. Train a binary classification network based on the labels of key points. During the testing process, all key points are input into the binary classification network, and only key points with positive classification results are retained as reference center points.

In step S108, for each reference center point, the first feature code of the reference center point is encoded according to the correlation between the reference center point and other reference center points to obtain the second feature of the reference center point. coding.

In some embodiments, for each reference center point, according to the first feature encoding of the reference center point, the He refers to the first feature code of the center point and the relative positional relationship between the reference center point and other reference center points, and determines the second feature code of the reference center point based on the self-attention mechanism.

Further, in some embodiments, for each reference center point, the first feature code and position information of the reference center point, and the first feature codes and position information of other reference center points are input into the encoder in the second conversion model. The second self-attention module; in the second self-attention module, other reference center points are used as relative center points, and for each relative center point, the position information of the reference center point and the position information of the relative center point are The fourth position coding layer, the fifth position coding layer and the sixth position coding layer are respectively input to determine the fourth relative position coding, the fifth relative position coding and the sixth relative position coding of the reference center point and the relative center point; according to The key vector and value vector of the relative center point are determined by multiplying the feature information of the relative center point with the key matrix and value matrix in the second self-attention module respectively; according to the feature information of the reference center point and the second self-attention module The product of the query matrix in the module determines the query vector of the reference center point; according to the fourth relative position code, the fifth relative position code, and the sixth relative position code of the reference center point and each relative center point, each relative center point The key vector of the point, the value vector of each relative center point and the query vector of the reference center point determine the second feature code of the reference center point.

Further, in some embodiments, for each relative center point, the sum of the reference center point, the fourth position code of the relative center point, and the query vector of the reference center point is used as the modified query vector of the reference center point. ; The sum of the fifth position code of the reference center point and the relative center point and the key vector of the relative center point is used as the modified key vector of the relative center point; the sixth position code of the reference center point and the relative center point is The sum of the position code and the value vector of the relative center point is used as the correction value vector of the relative center point; the product of the correction query vector of the reference center point and the correction key vector of the relative center point is multiplied by the third value vector of the reference center point. The dimension of a feature code is input to the second normalization layer to obtain the weight of the relative center point; the correction value vector of each relative center point is weighted and summed according to the weight of each relative center point to obtain the second reference center point. Feature encoding.

For example, the second transformation model is a Transformer model. Since the second transformation model is used to determine the correlation between targets, it can be called a Global Transformer. The second transformation model may include the encoder and decoder parts in the Transformer. In addition, the fourth relative position coding, the fifth relative position coding and the sixth relative position coding are respectively the fourth feedforward network, the fifth feedforward network and the sixth feedforward network, for each reference center point and each corresponding Relative to the center point, input the difference between the coordinates of the reference center point and the coordinates of the relative center point into the fourth feedforward network, the fifth feedforward network and the sixth feedforward network respectively to determine the reference center point and the relative center point The fourth relative position code, the fifth relative position code and the sixth relative position code. The second normalization layer is, for example, a softmax layer.

For example, M reference center points are obtained from key points, and Global Transformer aims to learn the correlation between these M different targets. Specifically, the feature set of M reference center points (for example, M reference center points) is expressed as ) is input into the Global Transformer module to model the correlation between different targets:

In formula (3), h _i is the output of the second self-attention module of the encoder of Global Transformer. The encoder can also include FFN after the second self-attention module. If FFN is not included after the second self-attention module, h _i can represent the second feature encoding. Otherwise, the output of h _i after FFN represents the second feature encoding; is the query matrix in the second self-attention module, is the key matrix in the second self-attention module, is the second self-attention module median matrix, C is the dimension of the feature information of the key point, Respectively represent the functions corresponding to the fourth position coding layer, the fourth position coding layer and the fourth position coding layer. The specific form of can refer to formula (2).

The multi-head attention mechanism can also be applied to the encoder of Global Transformer, so I won’t go into details here. h _i is the high-level feature of the i-th reference center of the output, which includes both the correlation of internal points of the target and the correlation between different targets.

In step S110, the location and category of each target in the point cloud data are predicted based on the second feature encoding of each reference center point.

In some embodiments, the second feature encoding of each reference center point is input to the decoder in the second conversion model to obtain the feature vector of each reference center point; the feature vector of each reference center point is input to the target detection network to obtain the point cloud. The location and category of each target in the data.

The target detection network is, for example, FFN. The target detection network determines the location and category of each target in the point cloud data based on the feature vector that contains the correlation between internal points of the target and the correlation between different targets.

In the above embodiment, the key points of the point cloud data and the characteristic information of each key point are extracted. For each key point, the key point is determined based on the correlation between the key point and other key points within the preset range of the key point. The first feature encoding of the point. The first feature encoding reflects the correlation between points in the local area inside the target. Further, the key points are divided into points at the target center and points at the non-target center. The key points at the target center are point as a reference center point. For each reference center point, the second feature code of the reference center point is determined based on the correlation between the reference center point and other reference center points. In the second feature code, the local area inside the target is On top of the correlation between points in the method, the correlation between targets is added, and then the location and category of each target in the point cloud data are predicted based on the second feature encoding of each reference center point. The solution of the above embodiment no longer analyzes the distance between all points in the point cloud. correlation modeling, but divides the correlation between points into correlation within the target and correlation between targets, which can capture local and global dependencies in the point cloud at the same time and adapt to the three-dimensional structure of point cloud data. Features and irregularities improve the accuracy of target detection in point cloud data. In addition, it can also improve detection efficiency and save computing costs.

In order to further improve the accuracy of target detection in point cloud data, the solutions of the above embodiments are also improved. The inventor further mined the three-dimensional features of point cloud data and introduced the geometric structure features between points into the encoding process, making the learning of the features of point cloud data more accurate, thus improving the accuracy of target detection. Specific embodiments are described below.

Regarding step S104, in some embodiments, for each key point, according to the characteristic information of the key point, the characteristic information of other key points within the preset range of the key point, the key point is within the preset range of the key point. The relative positional relationship between other key points, as well as the relative geometric structure relationship between the key point and other key points within the preset range of the key point, determine the first feature encoding of the key point based on the self-attention mechanism.

Further, in some embodiments, for each key point, the characteristic information, position information and geometric structure information of the key point are combined with the characteristic information, position information and geometric structure information of other key points within the preset range of the key point. Input the first self-attention module of the encoder in the first conversion model; in the first self-attention module, other key points within the preset range of the key point are used as relative points, and for each relative point, the key point is The position information of the point and the position information of the relative point are input into the first position coding layer, the second position coding layer and the third position coding layer respectively, and the first relative position coding and the second relative position of the key point and the relative point are determined. Encoding and third relative position encoding; input the geometric structure information of the key point and the geometric structure information of the relative point into the geometric structure encoding layer to determine the relative geometric structure weight of the key point and the relative point; according to the characteristics of the relative point The key vector and value vector of the relative point are determined by multiplying the information with the key matrix and value matrix in the first self-attention module respectively; according to the product of the feature information of the key point and the query matrix in the first self-attention module, determine The query vector of the key point; according to the first relative position code, the second relative position code, the third relative position code and the relative geometric structure weight of the key point and each relative point, the key vector of each relative point, each The value vector of the relative point and the query vector of the key point are used to determine the first feature encoding of the key point.

Further, in some embodiments, for each relative point, the sum of the first relative position code of the key point and the relative point and the query vector of the key point is used as the modified query vector of the key point; The sum of the second relative position code of the point and the relative point and the key vector of the relative point is used as the modified key vector of the relative point; the third relative position code of the key point and the relative point is combined with the value of the relative point The sum of vectors is used as the correction value vector of the relative point; the product of the correction query vector of the key point and the correction key vector of the relative point, the relative geometric structure weight of the key point and the relative point, and the characteristics of the key point Dimensional input of information first normalization layer, the weight of the relative point is obtained; the correction value vector of each relative point is weighted and summed according to the weight of each relative point, and the first feature code of the key point is obtained.

Further, in some embodiments, the product of the modified query vector of the key point and the modified key vector of the relative point is divided by the square root of the dimension of the feature information of the key point, and then divided by the relative value of the key point and the relative point. The geometric structure weights are added, and the result is input into the first normalization layer to obtain the weight of the relative point.

For example, the geometric structure information includes: at least one of the normal vector of the local plane and the curvature radius of the local plane. In some embodiments, according to the distance between the key point and the relative point, the dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located, the curvature radius of the local plane where the key point is located The difference between the radius of curvature of the local plane where the relative point is located, and at least one of the angles between the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located, determines the key point and the relative point. relative geometric structure weight.

Further, in some embodiments, the relative geometric structure weight of the key point and the relative point decreases as the distance between the key point and the relative point increases; the relative geometric structure weight of the key point and the relative point , increases with the increase of the dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located; the relative geometric structure weight of the key point and the relative point increases with the increase of the normal vector of the local plane where the key point is located. The difference between the curvature radius of the local plane and the curvature radius of the local plane where the relative point is located increases; the relative geometric structure weight of the key point and the relative point increases with the normal vector of the local plane where the key point is located and the relative The angle between the normal vectors of the local plane where the point is located increases as the result after the feature propagation layer increases.

For example, the first transformation model is the Transformer model, which is called Local Transformer. The first position coding layer, the second position coding layer and the third position coding layer may be a first feedforward network (FFN), a second feedforward network and a third feedforward network respectively, and the first normalization layer may be softmax. layer.

The relative geometric structure relationship between the key point and other key points within the preset range of the key point can be added to the attention mechanism. Specifically, formula (1) can be improved:

In formula (4), G _{i, j} represents the relative geometric structure weight of key points i and j, which can be determined using the following formula:

In formula (5), n _i and n _j respectively represent the normal vectors of the local plane where the key points i and j are located, c _i and c _j respectively represent the curvature radius of the local plane where the key points i and j are located, The method representing the local plane where the key point i is located The angle between the vector and the normal vector of the local plane where the key point j is located, β ₁ , β ₂ and β ₃ are the parameters of the geometric structure encoding layer, and FFN is the feedforward neural network or feature propagation layer. G _i,j is a Gaussian function model. The correlation between two points is calculated through geometric parameters such as local plane normal vector, local curvature radius, and normal vector angle. If the correlation between two points is stronger, Stronger, the corresponding Gaussian weight G _i,j will be larger.

In some embodiments, for a key point, find the N points in the neighborhood that are closest to it, and then use the least squares method to find a plane so that the sum of the distances projected by these N points onto this plane is the smallest, so The plane is the local plane.

The method of the above embodiment adds relative geometric structure weight to express the geometric structure relationship between points, and integrates object geometric features such as local plane normal vector, local curvature radius, and normal vector angle into the self-attention mechanism. , design an efficient feature extraction model and target detection model specifically for processing point cloud data.

Some application examples of the present disclosure are described below with reference to FIG. 2 .

As shown in Figure 2, input the point cloud data into the point cloud feature extraction network (Point Cloud Backbone, point cloud backbone network) to obtain the feature information of the key points (Point Feature), and then input the feature information of the key points into the Local-Global Transformer Model. In the Local-Global Transformer model, the feature information and location information of each key point and other key points in the local area of the key point are input into the Local Transformer module, and the geometric structure information of each key point is input into the Local Transformer module to obtain the third A feature encoding. Select the reference center point for each key point through the classification network (for example, including the Sampling/Pooling module), input the first feature encoding and position information of each reference center point into the Global Transformer module to obtain the second feature encoding, and encode the second feature Enter FFN to get the bounding box and category of the object.

The solution of the above embodiment proposes an end-to-end 3D point cloud target detection network based on the Transformer model, which can be called 3DTrans. It takes a 3D point cloud as input and outputs a set of labeled 3D bounding boxes to represent the target. The location of the (object). The overall structure of the 3DTrans detection network is shown in Figure 2, which consists of two main components: feature extraction network and Local-Global Transformer. Given a point cloud of N points with XYZ coordinates as input, the feature extraction network downsamples the point cloud and learns the deep features of each point, thereby outputting a subset of points, and each point in the subset is represented by a C-dimensional feature representation, consider these points as key points. Local-Global Transformer takes the features of these key points as input and outputs the final target detection result.

The traditional Transformer model has been improved in two aspects, making it more suitable for processing 3D point cloud data. On the one hand, instead of directly modeling the correlation between all key points, the correlation between points is divided into for the correlation within objects and the correlation between objects. Specifically, the Local Transformer module is used to learn the correlation between points and points in the local area inside the same object, and the Global Transformer module is used to learn the correlation between different objects. By connecting the two modules together, Local -The Global Transformer model not only reduces the computational cost, but also captures local and global dependencies in the point cloud, thereby improving the model's learning expression ability. On the other hand, object geometric structure information is added to the traditional Transformer model, and object geometric features such as local plane normal vector, local curvature radius, and normal vector angle are integrated into the self-attention mechanism, thereby designing a dedicated An efficient Transformer model for processing point cloud data.

The disclosed method does not require a large number of manual design components, does not require a large amount of prior knowledge, and does not require screening out redundant candidate frames for a large number of post-processing operations. The model is simple and can be trained end-to-end, and the calculation cost is low. The processing efficiency is high and the accuracy is high.

The disclosed model can be trained end-to-end, labeling point cloud data images and labeling the bounding boxes and categories of each target as training samples. Input the training samples into the point cloud feature extraction network to obtain multiple key points in the output point cloud data, as well as the characteristic information of each key point; input the characteristic information and position information of each key point into the first conversion model, and for each Key point, according to the correlation between the key point and other key points within the preset range of the key point, the characteristic information of the key point is encoded to obtain the first characteristic encoding of the key point; the key points of each key point are The first feature encoding is input into the classification network to classify each key point and determine the point classified as the target center as the reference center point; input the feature information and position information of each reference center point into the second conversion model, and for each reference center point, according to the correlation between the reference center point and other reference center points, encode the first feature code of the reference center point to obtain the second feature code of the reference center point; convert the second feature code of each reference center point The feature encoding is input to the decoder in the second conversion model to obtain the feature vector of each reference center point; the feature vector of each reference center point is input to the target detection network to obtain the position and category of each target in the point cloud data. According to the obtained point cloud The difference between the position and category of each target in the data and the bounding box and category of each annotated target is used to train the point cloud feature extraction network, the first conversion model, the classification network, the second conversion model, and the target detection network. The point cloud feature extraction network and classification network can be pre-trained. For specific details, reference may be made to the foregoing embodiments and will not be described again here.

The present disclosure also proposes a device for detecting targets in point cloud data, which will be described below with reference to Figure 3 .

Figure 3 is a structural diagram of some embodiments of a device for detecting objects in point cloud data of the present disclosure. As shown in FIG. 3 , the device 30 of this embodiment includes: a feature extraction module 310 , a first encoding module 320 , a classification module 330 , a second encoding module 340 , and a target detection module 350 .

Feature extraction module 310 is used to input point cloud data into the point cloud feature extraction network to obtain the output point cloud number. Multiple key points in the data, as well as the characteristic information of each key point.

The first encoding module 320 is configured to encode, for each key point, the characteristic information of the key point according to the correlation between the key point and other key points within the preset range of the key point, to obtain the key point. The first characteristic encoding.

The classification module 330 is used to classify each key point and determine the point classified as the target center as the reference center point.

The second encoding module 340 is configured to encode, for each reference center point, the first feature code of the reference center point according to the correlation between the reference center point and other reference center points, to obtain the first feature code of the reference center point. Second feature encoding.

The target detection module 350 is used to predict the location and category of each target in the point cloud data based on the second feature code of each reference center point.

In some embodiments, the first encoding module 320 is used for each key point, according to the characteristic information of the key point, the characteristic information of other key points within the preset range of the key point, and the relationship between the key point and the key point. The relative positional relationship between other key points within the preset range is determined based on the self-attention mechanism to determine the first feature encoding of the key point.

In some embodiments, the first encoding module 320 is configured to, for each key point, input the characteristic information and position information of the key point, and the characteristic information and position information of other key points within the preset range of the key point into the first The first self-attention module of the encoder in the conversion model; in the first self-attention module, other key points within the preset range of the key point are used as relative points, and for each relative point, the position of the key point is The information and the position information of the relative point are respectively input into the first position coding layer, the second position coding layer and the third position coding layer to determine the first relative position coding, the second relative position coding and the third relative position coding of the key point and the relative point. Three relative position encodings; determine the key vector and value vector of the relative point based on the product of the characteristic information of the relative point and the key matrix and value matrix in the first self-attention module respectively; determine the key vector and value vector of the relative point based on the characteristic information of the key point and the first self-attention module The product of the query matrix in the self-attention module determines the query vector of the key point; according to the first relative position code, the second relative position code, and the third relative position code of the key point and each relative point, each relative point The key vector, the value vector of each relative point and the query vector of the key point determine the first feature encoding of the key point.

In some embodiments, the first encoding module 320 is configured to, for each relative point, encode the sum of the key point and the first relative position encoding of the relative point and the query vector of the key point as a modified query of the key point Vector; the sum of the second relative position code of the key point and the relative point and the key vector of the relative point is used as the modified key vector of the relative point; the sum of the third relative position code of the key point and the relative point and The sum of the value vectors of this relative point, As the correction value vector of the relative point; input the product of the correction query vector of the key point and the correction key vector of the relative point and the dimension of the feature information of the key point into the first normalization layer to obtain the relative point Weight; perform a weighted summation of the correction value vectors of each relative point based on the weight of each relative point to obtain the first feature code of the key point.

In some embodiments, the first position encoding layer, the second position encoding layer and the third position encoding layer are respectively a first feedforward network, a second feedforward network and a third feedforward network, and the first encoding module 320 is used to The difference between the coordinates of the key point and the coordinates of the relative point is input into the first feedforward network, the second feedforward network, and the third feedforward network respectively.

In some embodiments, the first encoding module 320 is configured to, for each key point, according to the characteristic information of the key point, the characteristic information of other key points within the preset range of the key point, the key point is preset with the key point. Assume the relative positional relationship between other key points within the range, and the relative geometric structure relationship between the key point and other key points within the preset range of the key point, and determine the first feature encoding of the key point based on the self-attention mechanism. .

In some embodiments, the first encoding module 320 is configured to, for each key point, combine the feature information, location information and geometric structure information of the key point, and the feature information, location information of other key points within the preset range of the key point. Information and geometric structure information are input into the first self-attention module of the encoder in the first conversion model; in the first self-attention module, other key points within the preset range of the key point are used as relative points, and for each relative point, input the position information of the key point and the position information of the relative point into the first position encoding layer, the second position encoding layer and the third position encoding layer respectively, and determine the first relative position encoding of the key point and the relative point. , the second relative position coding and the third relative position coding; input the geometric structure information of the key point and the geometric structure information of the relative point into the geometric structure coding layer to determine the relative geometric structure weight of the key point and the relative point; according to The key vector and value vector of the relative point are determined by multiplying the characteristic information of the relative point with the key matrix and the value matrix in the first self-attention module respectively; according to the characteristic information of the key point and the query in the first self-attention module The product of the matrix determines the query vector of the key point; according to the first relative position code, the second relative position code, the third relative position code and the relative geometric structure weight of the key point and each relative point, the weight of each relative point The key vector, the value vector of each relative point and the query vector of the key point determine the first feature encoding of the key point.

In some embodiments, the first encoding module 320 is configured to, for each relative point, encode the sum of the key point and the first relative position encoding of the relative point and the query vector of the key point as a modified query of the key point Vector; the sum of the second relative position code of the key point and the relative point and the key vector of the relative point is used as the modified key vector of the relative point; the sum of the third relative position code of the key point and the relative point and The sum of the value vectors of the relative point is used as the correction value vector of the relative point; the product of the correction query vector of the key point and the correction key vector of the relative point, the relative geometric structure weight of the key point and the relative point, and Dimension input of the feature information of the key point In the first normalization layer, the weight of the relative point is obtained; according to the weight of each relative point, the correction value vector of each relative point is weighted and summed to obtain the first feature code of the key point.

In some embodiments, the first encoding module 320 is used to divide the product of the modified query vector of the key point and the modified key vector of the relative point by the square root of the dimension of the feature information of the key point, and then divide the product of the modified query vector of the key point and the key point with The relative geometric structure weights of the relative points are added up, and the result is input into the first normalization layer to obtain the weight of the relative point.

In some embodiments, the geometric structure information includes: at least one of the normal vector of the local plane and the curvature radius of the local plane. The first encoding module 320 is used to determine the key point based on the distance between the key point and the relative point. The dot product of the normal vector of the local plane and the normal vector of the local plane where the relative point is located, the difference between the curvature radius of the local plane where the key point is located and the curvature radius of the local plane where the relative point is located, and the local plane where the key point is located At least one of the angles between the normal vector of the key point and the normal vector of the local plane where the relative point is located determines the relative geometric structure weight of the key point and the relative point.

In some embodiments, the relative geometric structure weight of the key point and the relative point decreases as the distance between the key point and the relative point increases; the relative geometric structure weight of the key point and the relative point decreases as the distance between the key point and the relative point increases. The increase in the dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located increases the relative geometric structure weight of the key point and the relative point. The difference between the curvature radius and the curvature radius of the local plane where the relative point is located increases; the relative geometric structure weight of the key point and the relative point increases with the normal vector of the local plane where the key point is located and the location of the relative point. The angle between the normal vectors of the local plane increases.

In some embodiments, the second encoding module 340 is configured to, for each reference center point, encode the first feature according to the reference center point, the first feature encoding of other reference center points, and the difference between the reference center point and other reference centers. The relative positional relationship between points determines the second feature encoding of the reference center point based on the self-attention mechanism.

In some embodiments, the second encoding module 340 is configured to, for each reference center point, input the first feature code and position information of the reference center point, and the first feature codes and position information of other reference center points into the second transformation. The second self-attention module of the encoder in the model; in the second self-attention module, other reference center points are used as relative center points, and for each relative center point, the position information of the reference center point is compared with the relative center point. The position information of the center point is input into the fourth position coding layer, the fifth position coding layer and the sixth position coding layer respectively, and the fourth relative position code, the fifth relative position code and the sixth relative position code of the reference center point and the relative center point are determined. Relative position encoding; determine the key vector and value vector of the relative center point based on the product of the feature information of the relative center point and the key matrix and value matrix in the second self-attention module respectively; determine the key vector and value vector of the relative center point; based on the feature information of the reference center point and The product of the query matrix in the second self-attention module determines the query vector of the reference center point; according to the reference center point and each phase Determine the fourth relative position code, fifth relative position code and sixth relative position code of the center point, the key vector of each relative center point, the value vector of each relative center point and the query vector of the reference center point. Second feature encoding of the reference center point.

In some embodiments, the second encoding module 340 is configured to, for each relative center point, encode the sum of the reference center point and the fourth position encoding of the relative center point and the query vector of the reference center point as the reference center. The modified query vector of the point; the sum of the fifth position code of the reference center point and the relative center point and the key vector of the relative center point is used as the modified key vector of the relative center point; the reference center point and the relative center point are The sum of the sixth position code of the center point and the value vector of the relative center point is used as the correction value vector of the relative center point; the product of the correction query vector of the reference center point and the correction key vector of the relative center point is multiplied by the The dimension of the first feature encoding of the reference center point is input into the second normalization layer to obtain the weight of the relative center point; the correction value vector of each relative center point is weighted and summed according to the weight of each relative center point to obtain the reference Second feature encoding of the center point.

In some embodiments, the fourth relative position encoding, the fifth relative position encoding and the sixth relative position encoding are respectively a fourth feedforward network, a fifth feedforward network and a sixth feedforward network, and the second encoding module 340 is used to The differences between the coordinates of the reference center point and the coordinates of the relative center point are respectively input into the fourth feedforward network, the fifth feedforward network and the sixth feedforward network.

In some embodiments, the classification module 330 is configured to input the first feature code of the key point into the classification network for each key point to obtain the classification result of the key point; determine whether the key point is the center of the target according to the classification result. point.

In some embodiments, the target detection module 350 is configured to predict the location and category of each target in the point cloud data according to the second feature code of each reference center point, including: inputting the second feature code of each reference center point into the second transformation The decoder in the model obtains the feature vectors of each reference center point; inputs the feature vectors of each reference center point into the target detection network to obtain the location and category of each target in the point cloud data.

The device for detecting objects in point cloud data in embodiments of the present disclosure can be implemented by various computing devices or computer systems, which will be described below with reference to FIG. 4 and FIG. 5 .

Figure 4 is a structural diagram of some embodiments of a device for detecting objects in point cloud data of the present disclosure. As shown in Figure 4, the device 40 of this embodiment includes: a memory 410 and a processor 420 coupled to the memory 410. The processor 420 is configured to execute any implementation of the present disclosure based on instructions stored in the memory 410. Example of target detection method in point cloud data.

The memory 410 may include, for example, system memory, fixed non-volatile storage media, etc. System memory stores, for example, operating systems, applications, boot loaders, databases, and other programs.

Figure 5 is a structural diagram of another embodiment of a device for detecting objects in point cloud data of the present disclosure. As shown in Figure 5, the device 50 of this embodiment includes: a memory 510 and a processor 520, which are similar to the memory 410 and the processor 420 respectively. It may also include an input/output interface 530, a network interface 540, a storage interface 550, etc. These interfaces 530, 540, 550, the memory 510 and the processor 520 may be connected through a bus 560, for example. Among them, the input and output interface 530 provides a connection interface for input and output devices such as a monitor, mouse, keyboard, and touch screen. The network interface 540 provides a connection interface for various networked devices, such as a database server or a cloud storage server. The storage interface 550 provides a connection interface for external storage devices such as SD cards and USB disks.

The present disclosure also provides an object sorting device, which will be described below in conjunction with FIG. 6 .

As shown in Figure 6, the item sorting device 6 includes: the detection device 30/40/50 of objects in point cloud data in any of the aforementioned embodiments, and the sorting component 62 is used according to the detection device 30 of objects in point cloud data. /40/50 The position and category of each target in the point cloud data output, and the items corresponding to the target are sorted.

In some embodiments, the device 6 also includes: a point cloud collection component 64, used to collect point cloud data in a preset area, and send the point cloud data to the detection device 30/40/50 of the target in the point cloud data.

The sorting component is, for example, a robotic arm, and the point cloud collection component is, for example, a three-dimensional camera.

The three-dimensional point cloud target detection technology proposed in this disclosure can be applied to products such as vision-based sorting robotic arms in logistics scenarios. That is, the point cloud data collected by the three-dimensional camera installed on the sorting robotic arm can be used to accurately locate and identify each object. An item to help the robotic arm sort one by one.

The present disclosure also provides a computer program, including: instructions, which when executed by the processor, cause the processor to execute the method for detecting objects in point cloud data as in any of the foregoing embodiments.

Those skilled in the art will appreciate that embodiments of the present disclosure may be provided as methods, systems, or computer program products. Accordingly, the present disclosure may employ an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. The form of the embodiment in terms of parts. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk memory, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein. .

The disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing device to produce a machine, such that the instructions executed by the processor of the computer or other programmable data processing device produce a use A device for realizing the functions specified in one process or multiple processes of the flowchart and/or one block or multiple blocks of the block diagram.

These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions The device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.

These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device. Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.

The above are only preferred embodiments of the present disclosure and are not intended to limit the present disclosure. Any modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of the present disclosure shall be included in the protection of the present disclosure. within the range.

Claims

A method for detecting targets in point cloud data, including:

Input the point cloud data into the point cloud feature extraction network to obtain multiple key points in the output point cloud data and the feature information of each key point;

For each key point, the characteristic information of the key point is encoded according to the correlation between the key point and other key points within the preset range of the key point, and the first characteristic encoding of the key point is obtained;

Classify each key point and determine the point classified as the target center as the reference center point;

For each reference center point, encode the first feature code of the reference center point according to the correlation between the reference center point and other reference center points to obtain the second feature code of the reference center point;

According to the second feature encoding of each reference center point, the location and category of each target in the point cloud data are predicted.
The detection method according to claim 1, wherein for each key point, the characteristic information of the key point is processed according to the correlation between the key point and other key points within the preset range of the key point. Encoding, obtaining the first feature encoding of the key point includes:

For each key point, based on the characteristic information of the key point, the characteristic information of other key points within the preset range of the key point, and the relative positional relationship between the key point and other key points within the preset range of the key point , determine the first feature encoding of the key point based on the self-attention mechanism.
The detection method according to claim 2, wherein for each key point, according to the characteristic information of the key point, the characteristic information of other key points within the preset range of the key point, and the relationship between the key point and the key point. The relative positional relationship between other key points within the preset range of the point, and the first feature encoding of the key point determined based on the self-attention mechanism includes:

For each key point, input the characteristic information and position information of the key point and the characteristic information and position information of other key points within the preset range of the key point into the first self-attention module of the encoder in the first conversion model;

In the first self-attention module, other key points within the preset range of the key point are used as relative points, and for each relative point, the position information of the key point and the position information of the relative point are respectively input into the third a position coding layer, a second position coding layer and a third position coding layer to determine the first relative position coding, the second relative position coding and the third relative position coding of the key point and the relative point;

Determine the key vector and value vector of the relative point according to the product of the characteristic information of the relative point and the key matrix and value matrix in the first self-attention module respectively;

Determine the query vector of the key point based on the product of the characteristic information of the key point and the query matrix in the first self-attention module;

According to the first relative position code, the second relative position code and the third relative position code of the key point and each relative point, the key vector of each relative point, the value vector of each relative point and the query vector of the key point , determine the first feature code of the key point.
The detection method according to claim 3, wherein the first relative position code, the second relative position code and the third relative position code according to the key point and each relative point, the key vector of each relative point, The value vector of each relative point and the query vector of the key point. Determining the first feature encoding of the key point includes:

For each relative point, the sum of the first relative position code of the key point and the relative point and the query vector of the key point is used as the modified query vector of the key point;

The sum of the second relative position code of the key point and the relative point and the key vector of the relative point is used as the modified key vector of the relative point;

The sum of the third relative position code of the key point and the relative point and the value vector of the relative point is used as the correction value vector of the relative point;

Enter the product of the modified query vector of the key point and the modified key vector of the relative point and the dimension of the feature information of the key point into the first normalization layer to obtain the weight of the relative point;

Perform a weighted summation of the correction value vectors of each relative point according to the weight of each relative point to obtain the first feature code of the key point.
The detection method according to claim 3, wherein the first position coding layer, the second position coding layer and the third position coding layer are a first feedforward network, a second feedforward network and a third feedforward network respectively. , the input of the position information of the key point and the position information of the relative point into the first position coding layer, the second position coding layer and the third position coding layer respectively includes:

The difference between the coordinates of the key point and the coordinates of the relative point is input into the first feedforward network, the second feedforward network, and the third feedforward network respectively.
The detection method according to claim 1, wherein for each key point, the characteristic information of the key point is processed according to the correlation between the key point and other key points within the preset range of the key point. Encoding, obtaining the first feature encoding of the key point includes:

For each key point, based on the characteristic information of the key point, the characteristic information of other key points within the preset range of the key point, and the relative positional relationship between the key point and other key points within the preset range of the key point, And the relative geometric structure relationship between the key point and other key points within the preset range of the key point, based on self-attention The mechanism determines the first feature encoding of the keypoint.
The detection method according to claim 6, wherein for each key point, according to the characteristic information of the key point, the characteristic information of other key points within the preset range of the key point, the key point and the key point The relative positional relationship between other key points within the preset range, and the relative geometric structure relationship between the key point and other key points within the preset range of the key point, and the first feature of the key point is determined based on the self-attention mechanism Coding includes:

For each key point, input the characteristic information, position information and geometric structure information of the key point and the characteristic information, position information and geometric structure information of other key points within the preset range of the key point into the encoder in the first conversion model The first self-attention module;

In the first self-attention module, other key points within the preset range of the key point are used as relative points, and for each relative point, the position information of the key point and the position information of the relative point are respectively input into the third a position coding layer, a second position coding layer and a third position coding layer to determine the first relative position coding, the second relative position coding and the third relative position coding of the key point and the relative point;

Input the geometric structure information of the key point and the geometric structure information of the relative point into the geometric structure coding layer, and determine the relative geometric structure weight of the key point and the relative point;

Determine the key vector and value vector of the relative point according to the product of the characteristic information of the relative point and the key matrix and value matrix in the first self-attention module respectively;

Determine the query vector of the key point based on the product of the feature information of the key point and the query matrix in the first self-attention module;

According to the first relative position code, the second relative position code, the third relative position code and the relative geometric structure weight of the key point and each relative point, the key vector of each relative point, the value vector of each relative point and the The query vector of the key point determines the first feature encoding of the key point.
The detection method according to claim 7, wherein the first relative position code, the second relative position code, the third relative position code and the relative geometric structure weight of the key point and each relative point, each relative The key vector of the point, the value vector of each relative point and the query vector of the key point. Determining the first feature encoding of the key point includes:

For each relative point, the sum of the first relative position code of the key point and the relative point and the query vector of the key point is used as the modified query vector of the key point;

The sum of the second relative position code of the key point and the relative point and the key vector of the relative point is used as the modified key vector of the relative point;

The sum of the third relative position code of the key point and the relative point and the value vector of the relative point is used as the correction value vector of the relative point;

The product of the modified query vector of the key point and the modified key vector of the relative point, the relative geometric structure weight of the key point and the relative point, and the dimension of the feature information of the key point are input into the first normalization layer to obtain the The weight of relative points;

Perform a weighted summation of the correction value vectors of each relative point according to the weight of each relative point to obtain the first feature code of the key point.
The detection method according to claim 7, wherein the product of the modified query vector of the key point and the modified key vector of the relative point, the relative geometric structure weight of the key point and the relative point and the product of the key point The dimensions of the feature information are input into the first normalization layer, and the weight of the relative point obtained includes:

Divide the product of the modified query vector of the key point and the modified key vector of the relative point by the square root of the dimension of the feature information of the key point, and then add it to the relative geometric structure weight of the key point and the relative point to get the result Enter the first normalization layer to get the weight of the relative point.
The detection method according to claim 7, wherein the geometric structure information includes: at least one of the normal vector of the local plane and the curvature radius of the local plane, and the relative geometry of the key point and the relative point is determined. Structural weights include:

According to the distance between the key point and the relative point, the dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located, the curvature radius of the local plane where the key point is located and the local area where the relative point is located. The difference between the curvature radii of the planes and at least one of the angles between the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located determines the relative geometric structure weight of the key point and the relative point.
The detection method according to claim 10, wherein,

The relative geometric structure weight of the key point and the relative point decreases as the distance between the key point and the relative point increases;

The relative geometric structure weight of the key point and the relative point increases as the dot product of the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located increases;

The relative geometric structure weight of the key point and the relative point increases as the difference between the curvature radius of the local plane where the key point is located and the curvature radius of the local plane where the relative point is located increases;

The relative geometric structure weight of the key point and the relative point increases as the angle between the normal vector of the local plane where the key point is located and the normal vector of the local plane where the relative point is located increases after passing through the feature propagation layer. .
The detection method according to any one of claims 1-11, wherein for each reference center point, According to the correlation between the reference center point and other reference center points, the first feature code of the reference center point is coded, and the second feature code of the reference center point includes:

For each reference center point, it is determined based on the self-attention mechanism based on the first feature code of the reference center point, the first feature codes of other reference center points, and the relative position relationship between the reference center point and other reference center points. The second characteristic encoding of the reference center point.
The detection method according to claim 12, wherein for each reference center point, according to the first feature code of the reference center point, the first feature codes of other reference center points, and the difference between the reference center point and other reference points. The relative positional relationship between center points, and the second feature encoding of the reference center point determined based on the self-attention mechanism includes:

For each reference center point, input the first feature encoding and position information of the reference center point, and the first feature encoding and position information of other reference center points into the second self-attention module of the encoder in the second conversion model;

In the second self-attention module, other reference center points are used as relative center points, and for each relative center point, the position information of the reference center point and the position information of the relative center point are respectively input into the fourth position. The coding layer, the fifth position coding layer and the sixth position coding layer determine the fourth relative position coding, the fifth relative position coding and the sixth relative position coding of the reference center point and the relative center point;

Determine the key vector and value vector of the relative center point according to the product of the characteristic information of the relative center point and the key matrix and value matrix in the second self-attention module respectively;

Determine the query vector of the reference center point based on the product of the feature information of the reference center point and the query matrix in the second self-attention module;

According to the fourth relative position code, the fifth relative position code and the sixth relative position code of the reference center point and each relative center point, the key vector of each relative center point, the value vector of each relative center point and the reference The query vector of the center point determines the second feature code of the reference center point.
The detection method according to claim 13, wherein according to the fourth relative position code, the fifth relative position code and the sixth relative position code of the reference center point and each relative center point, the The key vector, the value vector of each relative center point and the query vector of the reference center point. Determining the second feature encoding of the reference center point includes:

For each relative center point, the sum of the reference center point, the fourth position code of the relative center point, and the query vector of the reference center point is used as the modified query vector of the reference center point;

The sum of the fifth position code of the reference center point and the relative center point and the key vector of the relative center point is used as the modified key vector of the relative center point;

The sum of the sixth position code of the reference center point and the relative center point and the value vector of the relative center point is used as the correction value vector of the relative center point;

Enter the product of the modified query vector of the reference center point and the modified key vector of the relative center point and the dimension of the first feature encoding of the reference center point into the second normalization layer to obtain the weight of the relative center point;

The correction value vectors of each relative center point are weighted and summed according to the weight of each relative center point to obtain the second feature code of the reference center point.
The detection method according to claim 13, wherein the fourth relative position code, the fifth relative position code and the sixth relative position code are a fourth feedforward network, a fifth feedforward network and a sixth feedforward network respectively. , respectively inputting the position information of the reference center point and the position information of the relative center point into the fourth position coding layer, the fifth position coding layer and the sixth position coding layer includes:

The differences between the coordinates of the reference center point and the coordinates of the relative center point are respectively input into the fourth feedforward network, the fifth feedforward network and the sixth feedforward network.
The detection method according to any one of claims 1 to 15, wherein the classification of each key point and determining the point classified as the target center as the reference center point includes:

For each key point, input the first feature encoding of the key point into the classification network to obtain the classification result of the key point;

Determine whether the key point is a point at the center of the target according to the classification result.
The detection method according to claim 16, wherein the classification network is trained by using the position information of each key point with annotation information as training data, wherein for each key point, the key point is located in a If it is within the bounding box of the target and belongs to the point closest to the target center, the annotation information of the key point is the point at the target center.
The detection method according to any one of claims 1-17, wherein predicting the position and category of each target in the point cloud data according to the second feature encoding of each reference center point includes:

Input the second feature encoding of each reference center point into the decoder in the second conversion model to obtain the feature vector of each reference center point;

The feature vectors of each reference center point are input into the target detection network to obtain the location and category of each target in the point cloud data.
The detection method according to any one of claims 1-18, wherein for each key point, other key points within the preset range of the key point are determined using the following method:

For each key point, sort other key points according to the distance from the key point from small to large, according to Sort from front to back to select a preset number of other key points as other key points within the preset range of the key point.
A device for detecting targets in point cloud data, including:

A feature extraction module, used to input point cloud data into a point cloud feature extraction network to obtain multiple key points in the output point cloud data and feature information of each key point;

The first encoding module is used to encode the characteristic information of each key point according to the correlation between the key point and other key points within the preset range of the key point to obtain the key point's characteristic information. First feature encoding;

The classification module is used to classify each key point and determine the point classified as the target center as the reference center point;

The second encoding module is used to encode, for each reference center point, the first feature code of the reference center point according to the correlation between the reference center point and other reference center points, to obtain the third feature code of the reference center point. Two feature encoding;

A target detection module, configured to predict the location and category of each target in the point cloud data based on the second feature code of each reference center point.
A device for detecting targets in point cloud data, including:

processor; and

A memory coupled to the processor for storing instructions that, when executed by the processor, cause the processor to execute the target in the point cloud data according to any one of claims 1-19. Detection method.
A non-transitory computer-readable storage medium on which a computer program is stored, wherein the steps of the method of any one of claims 1-19 are implemented when the program is executed by a processor.
An object sorting device, comprising: the detection device for targets in point cloud data and a sorting component according to claim 20 or 21;

The sorting component is used to sort the targets according to the position and category of each target in the point cloud data output by the detection device of the target in the point cloud data.
The article sorting device according to claim 23, further comprising:

A point cloud collection component is used to collect point cloud data in a preset area, and send the point cloud data to a detection device for targets in the point cloud data.
A computer program includes: instructions, which when executed by the processor, cause the processor to execute the method for detecting objects in point cloud data according to any one of claims 1 to 19.