CN113870160B

CN113870160B - Point cloud data processing method based on transformer neural network

Info

Publication number: CN113870160B
Application number: CN202111060998.0A
Authority: CN
Inventors: 王旭; 曾宇乔; 金�一; 岑翼刚; 孙宇霄; 李浥东; 郎丛妍; 王涛; 冯松鹤
Original assignee: Beijing Jiaotong University
Current assignee: Beijing Jiaotong University
Priority date: 2021-09-10
Filing date: 2021-09-10
Publication date: 2024-02-27
Anticipated expiration: 2041-09-10
Also published as: CN113870160A

Abstract

The invention provides a point cloud data processing method based on a transformer neural network. The method comprises the following steps: constructing a three-dimensional object symmetry detection model, acquiring symmetry points of input point cloud data by detecting object symmetry planes/axes, and converting a projection plane of the point cloud data into rotation translation operation of a symmetry structure to obtain point cloud image data with multiple groups of data enhanced; extracting global characteristic information and local characteristic information of point cloud image data enhanced by multiple groups of data through a transformer network model to obtain down-sampled point cloud data; and constructing a task network model driven by the task by combining different target task requirements, and inputting the down-sampled point cloud data into the task network model to obtain a target task result. The invention effectively combines the three-dimensional object symmetry detection model and the converter network model, can improve the robustness of the downsampling model, and further has the capability of minimizing the precision loss of the target task, and improves the downsampling scale and the precision of the target task.

Description

Point cloud data processing method based on transformer neural network

Technical Field

The invention relates to the technical field of point cloud data downsampling, in particular to a point cloud data processing method based on a transformer neural network.

Background

The transducer (transducer) was a new deep learning framework proposed in 2017 by the google machine translation team paper Attention is All You Need. The transducer in the field of deep learning has an encoder-decoder (encoder-decoder) structure, comprising three main modules: an input data embedding module (input embedding), a position encoding module (positional encoding), a self-attention module (self-attention).

The point cloud data in the track traffic system is a set of vectors in a three-dimensional coordinate system, which is acquired by three-dimensional acquisition equipment, such as a laser radar, a stereo camera and the like, wherein each point contains three-dimensional coordinates, and some points also contain information such as color, depth, reflection intensity and the like.

The point cloud data acquired in the rail transit system is often large in scale, for example, the number of the point clouds of a single point cloud image can reach hundreds of thousands to millions, but is limited by the restrictions of indexes such as time, energy consumption and the like, and the existing embedded equipment is difficult to directly operate the large-scale data. Meanwhile, the point cloud data often contains a large number of noise points under the influence of weather, road bump, illumination change and the like, and the accuracy of the data is possibly seriously influenced, so that the accuracy of an unmanned analysis system depending on a large data scale is reduced. Therefore, in an actual point cloud data processing system, a down-sampling operation of the point cloud is often included, that is, noise points and redundant points in the point cloud data are removed.

Data enhancement involves a series of techniques to augment existing training samples, which are largely divided into two categories: one is a traditional enhancement method such as random expansion and contraction, rotation, dithering, translation and the like, and the other is a deep learning-based method such as learning-based training sample migration transformation, component recombination and the like. The aim of applying the data enhancement technology is to expand the training sample number of the neural network model and increase the generalization of the model.

With the development of three-dimensional sensing sampling technologies such as laser radar, three-dimensional sensors play an increasingly important role in the field of computer vision, especially in the aspects of automatic driving, environmental perception and the like. Classifying or segmenting objects or scenes described by a three-dimensional point cloud using deep neural networks has become a hotspot problem in the field. For example, in the field of autopilot, vehicles are typically equipped with multiple three-dimensional sensors with 360 ° shooting modes to ensure that enough redundant information is captured for the deep neural network to be more accurate and robust. However, visual tasks represented by autopilot place high demands on response time, and large amounts of raw point cloud data are difficult to directly use, and downsampling of three-dimensional point cloud data is often required to reduce data size, remove redundancy and noise points, thereby speeding up operations and reducing computational power consumption.

Currently, the downsampling method in the prior art is mainly divided into a traditional method and a deep learning method. Conventional downsampling methods are represented by furthest point sampling and random sampling. The process of sampling the furthest point is to select the point with the furthest Euclidean distance as the next sampling point each time by taking a certain sampling point as a starting point, and repeating the operation until a total of K sampling points are selected; random sampling is to randomly extract sample points from the original data, and the sampling strategy does not apply any artificial mind. Although the traditional method can effectively reduce the scale of the point cloud data and ensure the output precision of the model to a certain extent, the non-task-driven downsampling mode is difficult to be connected with a subsequent task network, and consideration of task demands in the downsampling process is ignored, so that a suboptimal sampling result is often obtained, and the output precision of a target task is difficult to be maintained to the greatest extent while the scale of input data is reduced.

Downsampling methods based on deep learning have also been proposed in recent years. One of the currently popular deep learning based downsampling methods is to utilize pooling operations. Pooling (pooling) is a common operation in convolutional neural networks, which imitates the dimension reduction of data by a human visual system, and is usually used after a convolutional layer is constructed to reduce the characteristic dimension of the output of the convolutional layer, so as to achieve the purpose of effectively reducing network parameters. The prior work utilizes the maximum pooling operation idea, and in the network learning process, the point with the maximum characteristic attribute value in the point cloud is regarded as a key point and reserved, so that the downsampling operation is completed. The other main downsampling technique is to use an adaptive weighting mechanism, namely, firstly, sampling by using the furthest point sampling algorithm, then, obtaining the neighborhood points of the sampling points by using the K neighbor algorithm, then, learning the neighborhood points by using the full-connection layer, carrying out adaptive weighting on the K neighborhood points by using the learned weights, and finally, taking the weighted average value as a new key point. Although the above method considers the requirements of the target task to some extent, the degradation of the model performance is inevitably caused. Meanwhile, different deep learning framework structures can also influence point cloud data. The point cloud learning network based on convolution operation generally needs to voxel the point cloud into a three-dimensional grid, so that the three-dimensional convolution neural network is used for object learning, and the method has the defects that the calculation efficiency and the storage requirement are increased rapidly at a cubic speed along with the improvement of precision, and the sparse spatial structure characteristics of the point cloud are destroyed in the process of voxelization. The deep learning framework of the point-based method, such as a shared full-connection network, creatively combines a multi-layer perceptron with the maximum pooling operation, effectively reduces the calculation and storage cost of the neural network, but the input layer reorders the point cloud to destroy the original spatial distribution characteristics of the point cloud, and the matrix multiplication performed at the hidden layer is the mapping transformation of the original characteristics to other dimensions, and does not effectively consider the spatial structure information of the point cloud.

In summary, the point cloud downsampling method in the prior art has the following disadvantages: the existing point cloud downsampling method does not incorporate a transformer network framework into the design of a depth model, and meanwhile, the problem of weighing downsampling scale and target task accuracy is not simplified into the problem of task-driven point cloud self-attention metric learning.

The deep learning method is data driven, and a large number of various training samples are needed to improve the precision of the deep network model. For the traditional two-dimensional visual task, the public data set is huge in scale and high in quality, for example, imageNet contains up to 2.2 tens of thousands of categories, more than 1500 tens of thousands of manually annotated images exist, at least 100 tens of thousands of images are provided with a calibration frame of a target object, great convenience is brought to visual tasks such as object classification and target detection of the two-dimensional images, and researchers can search more by means of massive high-quality data. However, the existing three-dimensional point cloud has a smaller public Dataset size, which is detrimental to training of Depth models, e.g., the commonly used Dataset Sydney City target Dataset (Sydney Urban Objects Dataset) contains 631 labeled objects, the RGB-D Object Dataset (RGB-D Object Dataset) contains 51 classes of 300 objects, and the New York university Depth Dataset (NYU-Depth) contains 2347 labeled frames, 108617 unlabeled frames. The existing public data sets are extremely limited in size.

Disclosure of Invention

The embodiment of the invention provides a point cloud data processing method based on a transformer neural network, which aims to realize the trade-off between the down-sampling scale of point cloud data and the accuracy of a point cloud target task.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

A point cloud data processing method based on a transformer neural network comprises the following steps:

s1, constructing a three-dimensional object symmetry detection model, wherein the three-dimensional object symmetry detection model acquires symmetry points of input point cloud data by detecting object symmetry planes/axes, and converts a projection plane of the point cloud data into rotation translation operation of a symmetrical structure by utilizing the symmetry points to obtain point cloud image data with multiple groups of data enhanced;

s2, constructing a converter network model, extracting global characteristic information and local characteristic information of the point cloud image data with the plurality of groups of data enhanced by the converter network model, acquiring importance degree information of each point in the point cloud data, and learning the point cloud data after downsampling;

and step S3, constructing a task network model driven by the task by combining different target task demands, inputting the down-sampled point cloud data into the task network model, and carrying out target task learning by the task network model to output a target task result.

Preferably, the step S1 specifically includes:

constructing a self-attention mechanism module based on a neural network, collecting and labeling training samples with symmetrical information, and training the self-attention mechanism module by using the training samples;

a plurality of three-dimensional object symmetrical detection models are connected in parallel by introducing a plurality of loss functions, so that a shared self-attention module is constructed, and different self-attention modules in the shared self-attention module pay attention to different targets;

based on the shared self-attention module, a three-dimensional object symmetry detection model is constructed, and original point cloud data P epsilon R is obtained ^3+f Input to the three-dimensional object symmetry detection model, and output the three-dimensional object symmetry detection model with the diversity loss function L _var Under the constraint of (1), each self-attention model learns the characteristic information of a certain symmetry plane of the original point cloud data and leads the original point cloud data P epsilon R ^3+f And (3) carrying out series connection with all the characteristic information, inputting a series connection result into a shared full-connection network, and realizing simultaneous learning of a plurality of groups of rotation translation matrixes of the point cloud image, wherein f represents other characteristic information except three-dimensional coordinates in the point cloud data, and the series connection operation is represented as follows:

F _output ＝concat(f _i ¹ ，f _i ² ，…，f _i ⁹ ，f _i ¹⁰ ，P)

multiplying the original point cloud data with the learned multiple groups of rotation translation matrixes to obtain point cloud image data obtained after the data enhancement under the new coordinates with multiple groups of projection planes in a symmetrical structure.

Preferably, the self-attention mechanism creates three vectors for each point in the three-dimensional coordinates in the point cloud data: inquiring a vector Q, a key vector K and a value vector V, and scoring importance degree of semantic association degree of each point in the input point and the point cloud by calculating the product of Q and K;

the self-attention machine is formalized as:

wherein y is _i Is a new output feature generated by the self-attention module,beta and alpha represent point-by-point feature transformation operation, Q, K, V three vectors are obtained by embedding points into three feature transformation matrixes respectively trained in the training process of the vector point-by-point neural network, and gamma and theta are matrix functions, wherein gamma represents multiplication operation for calculating Q and K; θ represents the operation of the importance score matrix of the value vector with the set of point cloud data originally input, and ρ represents the normalization function.

Preferably, said multiple loss function L _var The expression is as follows:

i represents different point cloud samples, w represents learned attention weights, and p and q represent different two self-attention models in the same shared attention module.

Preferably, the shared fully connected network consists of a three-part cascade: the multi-layer perceptron, the batch normalization function and the linear rectification function, wherein the shared full-connection network is expressed as:

F _ouyput ＝ReLU(BN(MLP(F _in )))。

Preferably, the step S2 specifically includes:

constructing a converter network model comprising an input embedding module, a position coding module and a self-attention module, training the converter network model by using a loss function, combining the input embedding module with the position coding module, and modeling the spatial distribution of point cloud image data enhanced by multiple groups of data by using the combined input embedding module and the position coding module according to the natural position coordinate information of the three-dimensional point cloud;

analyzing the point cloud image data enhanced by the multiple groups of data by using a self-attention model based on a space distribution model of the point cloud image data enhanced by the multiple groups of data, and extracting global characteristic information of the point cloud image data enhanced by the data;

constructing a local feature extraction unit comprising a sampling and grouping layer and a convolution layer, establishing a plurality of point cloud subsets of the multi-group data-enhanced point cloud image data through the sampling and grouping layer, and carrying out feature extraction on the plurality of point cloud subsets by using a convolution neural network layer to obtain fine-grained local features of the data-enhanced point cloud image data;

and the self-attention module synthesizes the global characteristic information and the local characteristic information of the point cloud image data with the plurality of groups of data enhanced, selects a three-dimensional point set with the greatest contribution to the task discrimination accuracy of the task network, and obtains the point cloud data after downsampling.

Preferably, said training said transformer network model using a loss function comprises:

for an input point cloud p= { P containing n points _i ∈R ^3+f I=1, 2, …, n }, the training goal of the transformer network is to learn the subset P _s So that s < n, and minimize the task sampling loss L, the objective function L is expressed as:

wherein t is _i Representing the true value, to meet the requirements of the objective function L, a sampling regularization loss function L is introduced _sampling The specific table form is as follows:

wherein L is _f And L _m Representing average and maximum neighbor losses, L, respectively _b Representing the neighbor point matching loss.

Preferably, the step S3 specifically includes:

constructing a task network model driven by a task based on a transformer neural network, inputting down-sampled point cloud data into the task network model, designing a three-dimensional object symmetry detection model and the transformer network model based on the task network model, and designing an end-to-end loss function to be expressed as:

L _total (P，P _s )＝αL _var (P)+βL _sampling (P，P _s )+L _task (P _s )

where α and β represent weights.

The end-to-end loss function is used as a training loss function of the three-dimensional object symmetry detection model and the converter network model, and the weight parameters in the three-dimensional object symmetry detection model and the converter network model are updated through a neural network inherent counter-propagation algorithm, so that the output precision of the three-dimensional object symmetry detection model, the converter network model and the task network model is continuously optimized;

And downsampling the input point cloud data through a final optimized symmetrical detection model, a transformer network model and a task network model, mapping the input downsampled point cloud data to a feature space, and learning the point cloud input features on the feature space through a shared full-connection layer to obtain an output result of a target task.

According to the technical scheme provided by the embodiment of the invention, the embodiment of the invention provides a task-driven point cloud data robustness downsampling framework based on a transformer neural network. The framework effectively combines the three-dimensional object symmetry detection model and the converter network model, and cascades the target task network to form an end-to-end deep learning model, and finally, the balance point of the downsampling scale of the point cloud data and the accuracy of the point cloud target task is achieved, so that the method can improve the robustness of the downsampling model, and meanwhile, the method has the capability of minimizing the accuracy loss of the target task, and the bidirectional improvement of the downsampling scale and the accuracy of the target task is achieved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a process flow diagram of a method for robust downsampling of point cloud data under task driving based on a transformer neural network according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating an exemplary embodiment of a self-attention model according to an embodiment of the present invention;

FIG. 3 is a specific exemplary structural diagram of a three-dimensional object symmetry detection model according to an embodiment of the present invention;

FIG. 4 is a specific exemplary block diagram of a local feature extraction module according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating an exemplary architecture of a task-driven converter network model according to an embodiment of the present invention;

fig. 6 is a point cloud diagram after training samples and corresponding downsampling according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present invention and are not to be construed as limiting the present invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For the purpose of facilitating an understanding of the embodiments of the invention, reference will now be made to the drawings of several specific embodiments illustrated in the drawings and in no way should be taken to limit the embodiments of the invention.

The unmanned application scene puts higher demands on the reliability of the three-dimensional point cloud analysis processing, such as classification or identification of networks, which needs to have high enough accuracy to perform reliable judgment. However, objects with far or small volumes in the three-dimensional point cloud are often represented by sparse points, and the characteristics of the objects are usually lost due to the deepening of layers in the process of extracting the characteristics of the neural network, so that the model miss-detection rate and the false-detection rate are increased. In contrast, the embodiment of the invention introduces a local feature extraction module based on a convolutional neural network framework, extracts the detail semantic information of the point cloud in a fine granularity, supplements the local features ignored in the global features, and improves the robustness of the whole downsampling model.

The embodiment of the invention adopts the converter model theory which is raised in recent years to cascade the converter model and the target task network together for the first time, and converts the problem into the task-driven point cloud self-attention quantity learning problem. By designing a special three-dimensional object symmetry detection model, input point cloud data are rotated in a three-dimensional space, so that the input point cloud data are translated and transformed to a new coordinate system projected to a symmetry plane, the scale of a training sample is increased, and the generalization capability of a subsequent training model is improved; the transformer network model is built, rich point cloud semantic information is acquired as much as possible, so that the whole model framework can effectively learn key points, redundant points and noise points in the point cloud, and importance degree information of each point in data is acquired. And then, performing downsampling based on measurement according to the point-by-point importance degree information, so as to achieve the aim of minimizing the precision loss of the target task.

The embodiment of the invention provides a point cloud data robustness downsampling framework based on task driving of a transformer neural network. The framework effectively combines the three-dimensional object symmetry detection model and the converter network model, and cascades the target task network to form an end-to-end deep learning model, and finally, the balance point of the downsampling scale of the point cloud data and the accuracy of the point cloud target task is achieved, so that the method can improve the robustness of the downsampling model, and meanwhile, the model has the capability of minimizing the accuracy loss of the target task, so that the downsampling scale and the accuracy of the target task are improved in both directions.

According to the invention, the problem of weighing the downsampling scale and the target task accuracy in a point cloud downsampling algorithm is simplified into the task-driven point cloud self-attention measurement learning problem, and input point cloud data is rotated in a three-dimensional space by designing a special three-dimensional object symmetry detection model, so that the point cloud data is translated and transformed to a new coordinate system projected as a symmetry plane, the scale of a training sample is increased, and the generalization capability of a subsequent training model is improved; the transformer network model is built, rich point cloud semantic information is acquired as much as possible, so that the whole model framework can effectively learn key points, redundant points and noise points in the point cloud, and importance degree information of each point in data is acquired. And then, performing downsampling based on measurement according to the point-by-point importance degree information, so as to achieve the aim of minimizing the precision loss of the target task.

The point cloud data in the three-dimensional space has translational invariance, rotational invariance and scale invariance, namely, the point cloud data is subjected to integral translational, rotational and scale transformation, so that the real representation of the point cloud data and the expression of semantic information are not changed. The symmetrical structure is a basic geometrical attribute of most objects in nature, and is widely used in actual track traffic scenes, such as pedestrians, automobiles, bicycles and the like, and has a generalized symmetrical structure, so understanding the symmetry of the objects is an important problem in understanding the intelligent interaction of the real world and unmanned vehicles by a deep learning model. In order to solve the problem of insufficient scale of a three-dimensional point cloud training data set in the existing deep learning field, a three-dimensional object symmetry detection model learns to detect symmetry corresponding points of an object symmetry plane/axis and an input point cloud, and seeks to perform accurate three-dimensional transformation on the input point cloud so as to obtain point cloud diagrams with multiple groups of projection planes in a new coordinate of a symmetrical structure.

In order to alleviate the problems that the sampling strategy in the traditional downsampling method is irrelevant to a target task, noise is sensitive and the like; the problem of damage to the spatial distribution characteristics of the point cloud based on the downsampling strategy of the existing deep learning framework is solved, the invention designs the robust downsampling framework of the point cloud data based on the task driving of the transformer neural network, the invariance of the self-attention module to the arrangement of the processed point cloud sequence is fully utilized, and the spatial distribution characteristics of the point cloud can be protected while the global characteristics of the point cloud are effectively extracted; the local feature extraction module based on the convolutional neural network frame is introduced, the detail semantic information of the fine-granularity extraction point cloud is adopted, and the local features ignored in the global features are supplemented, so that the robustness of the whole downsampling model is improved, and the robustness of the downsampling model and the accuracy of a target task are improved in a bidirectional mode.

The point cloud data processing method based on the transformer neural network mainly comprises the following processing procedures:

(1) And constructing a three-dimensional object symmetry detection model. Based on a shared attention mechanism, the model introduces a diversity loss function and promotes different attention mechanisms to concentrate on learning specific symmetry plane information of the point cloud images, so that multiple groups of rotation translation matrixes of the same point cloud image are output simultaneously;

(2) And constructing a transformer network model. The model consists of three modules: the device comprises a local feature extraction module, a global feature extraction module and a point cloud reconstruction module. The local feature extraction module consists of a group of cascaded two-dimensional convolutional neural networks and is used for extracting fine-granularity semantic feature information of input point cloud data. The global feature extraction module consists of a group of cascade self-attention modules and is used for extracting global semantic features of the input point cloud data. The point cloud reconstruction module is used for cascading three groups of shared fully-connected neural networks, fusing important information learned in global features and local features, and reconstructing a three-dimensional point cloud image containing importance degree information;

(3) And constructing a task network model driven by the task. And combining different task demands, cascading the three-dimensional object symmetry detection model, the converter network model and the task model to form an end-to-end integrated self-learning model frame.

The processing flow chart of the point cloud data robustness downsampling method based on the task driving of the transformer neural network provided by the embodiment of the invention is shown in fig. 1, and specifically comprises the following steps:

step S1: the method comprises the steps of constructing a three-dimensional object symmetry detection model based on a neural network, adopting a shared attention mechanism, adding various loss functions, and collecting and labeling training samples with symmetry information.

The three-dimensional object symmetry detection model acquires symmetry points of input point cloud data by detecting object symmetry planes/axes, and converts a projection plane of the point cloud data into rotation translation operation of a symmetrical structure by utilizing the symmetry points to obtain point cloud image data with multiple groups of data enhanced.

Step S1-1: a self-attention mechanism module is constructed based on the neural network.

The self-attention mechanism is a mechanism by which a computer mimics the process of brain internal neurite activation when a human observes behavior, which fuses internal experience with external perceptual information, thereby enhancing the observation finesse of the region of interest. The special structure of the attention mechanism enables the special structure to quickly extract important features of sparse data, so that the method is applied to a three-dimensional point cloud processing task with sparse space distribution, in particular to a point cloud downsampling task. A specific example structure of the self-attention model is shown in fig. 2. The self-attention mechanism creates three vectors for each point in the three-dimensional coordinates in the point cloud data: query vector (Q), key vector (K), value vector (V). The three vectors are high-level abstractions of calculation and thinking attention by a computer, then the importance degree scoring is carried out on the semantic relevance degree of each point in the input point and the point cloud by calculating the product of Q and K, and the score determines the influence of other points on the current point, namely, the importance degree of the other points on the current point is judged.

Then, the scores are normalized by using a logistic regression function so that all the scores are positive values and the sum is 1. And multiplying the normalized score by V to obtain an importance score matrix of the point cloud value vector. And finally, performing a set operation on the importance score matrix of the value vector and the point cloud data which are originally input by utilizing a matrix function, so as to obtain the point cloud data comprising the point cloud coordinate information and the importance information. The self-attention machine is formalized as:

wherein y is _i Is a new output feature generated by the self-attention module,beta and alpha represent point-by-point feature transformation operations by point-multiplying a point-embedded vector (pointembdingvector) by a neural network training processThe above Q, K, V three vectors can be obtained by the three feature transformation matrices respectively trained. Gamma, θ is a matrix function, where gamma represents the multiplication operation of fig. 2 to calculate Q and K; the importance score matrix of the value vector is operated with the set of the point cloud data which is input originally, and common functions include addition, subtraction, multiplication and series connection. Delta represents the position-coding function of the point cloud. ρ represents a normalization function, and the present invention specifically uses a softmax normalization function.

Step S1-2: a shared self-attention mechanism module is constructed and a diversity loss function is introduced.

Along with the continuous increase of the size of point cloud data in track traffic data, semantic information contained in the point cloud data is more and more complex, and a single attention mechanism is difficult to pay attention to all important targets. Therefore, the invention constructs the shared self-attention module, and by connecting multiple groups of self-attention models in the S1-1 step in parallel, each self-attention module focuses on a specific target in the deep learning training process, thereby improving the feature extraction capability of semantic detail information. Meanwhile, in order to enable different self-attention modules in the shared self-attention module to effectively focus on different targets and distinguish the targets focused by other self-attention modules, the invention introduces a diversity loss function and promotes the model to learn diversity targets consciously in the learning process. Diversity loss function L _var The expression is as follows:

where i represents different point cloud samples, w represents learned attention weights, and p and q represent different two self-attention models in the same shared attention module.

Step S1-3: and constructing a three-dimensional object symmetry detection model structure based on a shared self-attention mechanism.

The shared self-attention module based on S1-2 designs a three-dimensional object symmetry detection model, and a specific example structure is shown in FIG. 3. Originally input point cloud data to include ten self-notesIn the shared self-attention model of the attention module, the diversity loss function L in S1-2 _var Under the constraint of (2) each self-attention model is enabled to learn the characteristic information of a certain symmetry plane of the point cloud picture. Then the original point cloud data P epsilon R ^3+f And (3) carrying out serial connection (concatate) with all the characteristic information, and inputting the result into a shared full-connection network to realize simultaneous learning of a plurality of groups of rotation translation matrixes of the point cloud picture. Where f represents other characteristic information except three-dimensional coordinates in the point cloud data, and typically includes image RGB, reflectivity, deep network learning characteristics, and the like. The series operation can be expressed as:

F _output ＝concat(f _i ¹ ，f _i ² ，…，f _i ⁹ ，f _i ¹⁰ ，P)

in particular, tandem operation is the merging of the number of channels in a neural network, i.e., features describing the point cloud itself are increased, without increasing the information under each feature. The shared fully connected network consists of three parts of cascades: multilayer perceptrons (Multilayer perceptron, MLP), batch normalization functions (BatchNormalization, BN), linear rectification functions (Rectified Linear Unit, reLU). The shared fully connected network is mathematically expressed as:

F _output ＝ReLU(BN(MLP(F _in )))

And finally, multiplying the input point cloud data by the learned multiple groups of rotation translation matrixes to obtain point cloud image data with multiple groups of enhanced data under the new coordinates of which the multiple groups of projection planes are symmetrical structures.

Step S1-4: training samples with symmetry information are collected and annotated.

The current task network based on deep learning is driven by data, and the artificial efficiency can be improved by a neural network learning method based on big data, and even the task network replaces the artificial work in a specific scene, so that the machine intelligence is realized. The three-dimensional object symmetry detection model designed by the invention has the greatest advantages that the model is completely driven by data, does not need manual intervention, and maximally utilizes the information contained in the sample data to detect the point cloud image symmetry plane. Therefore, in order to maximize the detection accuracy of the model, the invention is used for the disclosed data set containing symmetrical information labels: based on the shapen data set and the YCB data set (published data set access address: https:// github. Com/GodZaraathustra/symmetry Net), sample data containing symmetrical structures, such as pedestrians, bicycles, cars and the like, appearing in a rail transit system in a real scene can be continuously expanded, so that the accuracy of a symmetrical detection model is gradually improved.

The module aims to design a neural network, so that the neural network can learn symmetry information of three-dimensional point cloud data and perform rotation transformation on the data. The neural network is just a piece of machine code, after being trained by a large number of data samples, the neural network learns specific weights (the weights can be thought of as matrix parameters or as numerical values), and then the neural network input is processed by fixed weights. Therefore, the first step of learning the three-dimensional point cloud data by the network is to collect training samples, mark the samples, and then train the neural network by using the collected training samples to fulfill the aim of the module, namely, design a neural network, so that the neural network can learn the symmetry information of the three-dimensional point cloud data and perform rotation transformation on the data.

Step S2: and constructing a transformer network model, extracting global characteristic information and local characteristic information of the point cloud image data with the plurality of groups of data enhanced according to the transformer network model, acquiring importance degree information of each point in the point cloud data, and learning the point cloud data after downsampling.

Through the processing of the input point cloud image, a richer training set required by training the point cloud transformer network model is obtained. The point cloud converter network model mainly comprises two modules: a coordinate-based position coding module (coordinate-based positional encoding), a self-attention module (self-attention), wherein the self-attention module is the core of the transformer module, and is used for obtaining refined global characteristic information of the input point cloud image.

The invention introduces a local feature extraction module based on a convolutional neural network framework, extracts the detail semantic information of the point cloud in a fine granularity, supplements the local features ignored in the global features, and improves the robustness of the whole downsampling model. In the following sections, we will describe the above modules in detail, respectively.

Step S2-1: input embedding module

The point cloud converter network model mainly comprises three modules: an Input embedding module (Input embedding), a position encoding module (positional encoding), a self-attention module (self-attention). It is worth noting that the point cloud has no arrangement invariance, and the real representation of the point cloud data and the expression of semantic information cannot be changed by different arrangement sequences, so that the input embedding module and the position coding module are combined, and the spatial distribution of the three-dimensional point cloud is modeled by utilizing the natural position coordinate information of the three-dimensional point cloud.

The point cloud converter network model is actually a point cloud data robustness downsampling model, and can downsample the point cloud of various point cloud inputs, namely, the scale of the point cloud is reduced, so that the S3 task network only needs fewer point clouds, and the calculation cost and the memory cost of the network are effectively reduced.

Step S2-2: self-attention model

Among the many existing point cloud processing tasks, the self-attention model has proven its effectiveness on point cloud tasks. Therefore, the self-attention model is utilized to analyze the point cloud data after the data enhancement, and global characteristic information of the point cloud data is extracted. The self-attention instantiation model is identical to the structure of FIG. 2 in step S1-1. In particular, the θ set operation functions are various, and common functions such as addition, subtraction, multiplication, concatenation, etc., and specific mathematical formulas are expressed as follows:

addition operation θ (SA (x) _i )；x)＝SA(x _i )+x

Subtraction operation θ (SA (x) _i )；x)＝SA(x _i )-x

Series operation, θ (SA (x) _i )；x)＝[SA(x _i ),x]

Hadamard product operation, θ (SA (x) _i )；x)＝SA(x _i )⊙x

Dot product operation θ (SA (x) _i )；x)＝SA(x _i )·x

The five common functions are tested through an ablation experiment, and the invention discovers that for a point cloud downsampling task, a series operation can provide higher precision contribution for a deep learning network. The invention achieves the best feature extraction effect through the verification by cascading three groups of attention mechanisms.

Step S2-3: construction of local feature extraction unit

The convolutional neural network has strong local feature extraction capability, and can be used for identifying detail information and carrying out effective feature fusion on complex scene information. For a conventional convolution network, the output of a certain position in a two-dimensional image is related to the input of that position, but also to the input of a position around that position, the inputs of different positions having different weights. However, the three-dimensional point cloud is a data form with sparse structure, and it cannot be guaranteed that point cloud data exist in each same position, so it is difficult to directly apply convolution operation to process point cloud tasks. In order to solve the above problems, the present invention proposes a new feature extraction module, which contains two main components: the sampling and grouping layer, the convolution layer, and the specific example structure is shown in fig. 4.

The goal of the sampling and grouping layer is to build a hierarchical set of input point clouds. The method comprises the following specific steps: (1) Acquiring initial M sampling point indexes by using a furthest point sampling function (Farthest Point Sample, FPS), extracting the points from an original point cloud image P through the sampling point indexes, representing the points by new_points, and reserving the spatial distribution of the points in a three-dimensional space; (2) Setting a sphere radius parameter r, taking each point in new_points as a central coordinate, and establishing a sphere coordinate system by taking r as a radius; (3) Extracting all point clouds in a spherical body with new_points in the original point clouds as the center and r as the radius to form a new point cloud data diagram, wherein the new_points are represented by new_ball_points; (4) Setting the number K of sampling points in the spherical coordinates, sampling K points in each new_ball_points by using a K nearest neighbor algorithm, removing non-sampling points to form a new point cloud data diagram with fixed sampling points, and representing the new_ball_sampled_points; (5) Subtracting the value of new_points from the points in the new_ball_sampled_points region, andand splicing the new features with the old features on each point, and finally obtaining the uniform point cloud picture with the spherical solid space as the constraint fixed point number. The above flow can be expressed by using mathematical symbols, wherein for an input point cloud image P containing N points, M point cloud subsets are obtained through sampling and grouping layers Wherein each point cloud subset p ^m Is the corresponding center coordinate point +.>Is composed of K points that are nearest neighbors of (c), and the nearest neighbor points satisfy euclidean metric (euclidean metric) ρ:

obtaining new M dense point cloud subsetsAnd then, carrying out feature extraction on the point cloud by using a convolutional neural network layer to obtain fine-grained local features of the point cloud image, so as to achieve the purpose of supplementing the local features neglected in the global features and improve the robustness of the whole downsampling model. The invention uses two groups of cascaded local feature extraction modules to complete the local fine granularity semantic feature extraction target.

Step S2-4: constructing a loss function for training a network model of a transformer

For an input point cloud p= { P containing n points _i ∈R ^3+f I=1, 2, …, n }, the training goal of the transformer network is to learn the subset P _s Let s < n and minimize the task sampling loss L. The objective function L can be expressed as:

wherein t is _i Representing the true value. In order to meet the requirement of an objective function L, the invention introduces a sampling regularization loss function L _sampling The specific table form is as follows:

The point cloud converter network model selects a three-dimensional point set which has the greatest contribution to task discrimination accuracy of a task network, and takes the three-dimensional point set as the down-sampled point cloud data.

Step S3: and constructing a task network model driven by the task according to different target task requirements, inputting the down-sampled point cloud data into the task network model, performing target task learning by the task network model, and outputting a target task result.

For specific point cloud processing tasks, such as point cloud classification, point cloud reconstruction and the like, the invention provides a task network model based on task driving of a transformer neural network. The task network model may be considered as a three-dimensional point selection mechanism for accomplishing the original task requirements, such as object classification, etc.

And (3) taking the down-sampled point cloud data obtained in the step (S2) as the input of a task network model to finish a target task. In order to balance the downsampling scale and the task accuracy, the three-dimensional object symmetrical detection model, the converter network model and the task network model are cascaded to form an end-to-end integrated self-learning model frame, and the point cloud data under the appointed task is downsampled through the end-to-end integrated self-learning model frame.

The target tasks are various, and different tasks such as target detection, target classification, semantic segmentation and the like can be performed on the same data set. The point cloud target task provided by the invention refers to a specific task, and is used for training the neural network model. Because the same dataset may have different applications in the neural network, such as target classification, target detection, target reconstruction, etc., and the manner in which the same dataset is downsampled at the S2 module is different for different tasks, i.e., the network learns different features, qualifiers are added in S3-given tasks.

Fig. 5 is a specific instantiation structure diagram of a converter network model under task driving according to an embodiment of the present invention, as shown in fig. 5. Firstly, a three-dimensional object symmetry detection model constructed in the step S1 is used for carrying out data enhancement on training samples. And secondly, inputting the enhanced data into a converter network in the step S2 so as to learn the simplified point cloud image. And finally, inputting the simplified point cloud image into a task network model, and outputting a target task result. Wherein the overall end-to-end loss function is expressed as:

L _total (P，P _s )＝αL _var (P)+βL _sampling (P，P _s )+L _task (P _s )

where α and β represent weights.

The end-to-end loss function is used as the neural network training loss function in the step S2, and the weight parameters in the network are updated through the inherent back propagation algorithm of the neural network, so that the output precision of the network is continuously optimized. Wherein alpha and beta represent a proportional function, i.e. L _var (P) and L _sampling (P，P _s ) At L _total (P，P _s ) The multiplied scaling factor has a value in the range of (0, 1)](i.e., a number in the middle of the value 0-1, the left bracket indicates that 0 cannot be taken, and the right bracket indicates that 1 can be taken). P represents the original point cloud input and Ps represents the down-sampled point cloud data.

The task network may be replaced according to the user's target task. According to the invention, the point cloud downsampling model design is carried out on the trained task neural network, namely S1 and S2 are the learnable parts, and training is carried out according to the point cloud input and the network parameters obtained by the task network. It is clear that the task network is a neural network with parameters fixed in advance. The method has the advantages that any task network S3 can learn fewer point cloud sets needed by the task network S3 through the S1 and S2 modules, so that the calculation cost of the whole network is effectively reduced, and the whole accuracy of the task network is ensured to meet the requirements of users.

The invention takes the classification target task as an example to carry out S3 model design. The original point cloud data is input to a target network, feature mapping is firstly carried out, namely three-dimensional point cloud data is mapped to a feature space, then the point cloud input features on the feature space are learned through a shared full-connection layer, and the maximum output precision of a task network is obtained through continuously updating weight parameters of the task network. And finally, fixing the trained task network model to serve as an S3 model.

Fig. 6 is an exemplary schematic diagram of an overall network structure of a training sample and a corresponding downsampled point cloud image according to an embodiment of the present invention.

In summary, the invention simplifies the trade-off problem of the downsampling scale and the target task accuracy in the point cloud downsampling algorithm into the task-driven point cloud self-attention measurement learning problem, rotates the input point cloud data in the three-dimensional space by designing a special three-dimensional object symmetry detection model, and enables the input point cloud data to be translated and transformed into a plurality of new coordinate systems projected to be symmetrical planes, so that the scale of training samples is increased, and the generalization capability of the follow-up training model is improved; the transformer network model is built, rich point cloud semantic information is acquired as much as possible, so that the whole model framework can effectively learn key points, redundant points and noise points in the point cloud, and importance degree information of each point in data is acquired. And then, performing downsampling based on measurement according to the point-by-point importance degree information, so as to achieve the aim of minimizing the precision loss of the target task.

The method and the device effectively solve the problem of insufficient training samples and make contribution to improving the robustness and generalization of the subsequent training model. The invention utilizes the abundant geometric and semantic information contained in the object symmetry plane to realize the dynamic multi-angle rotation of the point cloud input data, and enlarges the scale of training data so as to enhance the generalization capability of the model; introducing a self-attention and local feature extraction model, performing feature extraction on input data from two dimensions of global and local, and acquiring abundant semantic information as much as possible, so that the whole model can effectively distinguish key points, redundancy and noise points in point cloud data; and combining the modules, designing a complete point cloud downsampling model, so that the model performs self-learning according to the specified point cloud task, and finally realizing a point cloud data downsampling target for minimizing the precision loss of the target task under the task driving.

Those of ordinary skill in the art will appreciate that: the drawing is a schematic diagram of one embodiment and the modules or flows in the drawing are not necessarily required to practice the invention.

From the above description of embodiments, it will be apparent to those skilled in the art that the present invention may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present invention.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, with reference to the description of method embodiments in part. The apparatus and system embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The present invention is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the technical scope of the present invention are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. The point cloud data processing method based on the transformer neural network is characterized by comprising the following steps of:

s2, constructing a converter network model, extracting global characteristic information and local characteristic information of the point cloud image data after the plurality of groups of data enhancement through the converter network model, acquiring importance degree information of each point in the point cloud data, and learning the point cloud data after downsampling;

step S3, constructing a task network model driven by a task by combining different target task demands, inputting the down-sampled point cloud data into the task network model, and carrying out target task learning by the task network model to output a target task result, wherein the target task is point cloud classification;

the step S1 specifically comprises the following steps:

A plurality of three-dimensional object symmetrical detection models are connected in parallel by introducing a plurality of loss functions, so that shared self-attention modules are obtained, and different self-attention modules in the shared self-attention modules pay attention to different targets;

based on the shared self-attention module, a three-dimensional object symmetry detection model is constructed, and original point cloud data P epsilon R is obtained ^3+f Input to the three-dimensional object symmetry detection model, and output the three-dimensional object symmetry detection model in a diversity loss function L _var Under the constraint of (1), each self-attention model learns the characteristic information of a certain symmetry plane of the original point cloud data and leads the original point cloud data P epsilon R ^3+f And (3) carrying out serial connection with all the characteristic information, inputting serial connection results into a shared full-connection network, and realizing simultaneous learning of a plurality of groups of rotation translation matrixes of the point cloud image, wherein f represents other characteristic information except three-dimensional coordinates in the point cloud data, and serial connection operation is represented as follows:

F _output ＝concat(f _i ¹ ,f _i ² ,…,f _i ⁹ ,f _i ¹⁰ ,P)

multiplying the original point cloud data with the learned multiple groups of rotation translation matrixes to obtain point cloud image data after data enhancement under the new coordinates of which multiple groups of projection planes are of symmetrical structures;

the step S2 specifically includes:

constructing a converter network model comprising an input embedding module, a position coding module and a self-attention module, training the converter network model by using a loss function, combining the input embedding module with the position coding module, and modeling the spatial distribution of point cloud image data after multiple groups of data enhancement by using the combined input embedding module and the position coding module through the natural position coordinate information of the three-dimensional point cloud;

Analyzing the point cloud image data with the enhanced multiple groups of data by using a self-attention model based on a spatial distribution model of the point cloud image data with the enhanced multiple groups of data, and extracting global characteristic information of the point cloud image data with the enhanced data;

the method comprises the steps of constructing a local feature extraction unit comprising a sampling and grouping layer and a convolution layer, establishing a plurality of layered point cloud subsets of the point cloud image data after the data enhancement by the sampling and grouping layer, and carrying out feature extraction on the plurality of point cloud subsets by using a convolution neural network layer to obtain fine-grained local features of the point cloud image data after the data enhancement;

the self-attention module synthesizes global characteristic information and local characteristic information of the point cloud image data with the enhanced multiple groups of data, selects a three-dimensional point set with the greatest contribution to task discrimination accuracy of a task network, and obtains down-sampled point cloud data;

the step S3 specifically includes:

L _total (P,P _s )＝αL _var (P)+βL _sampling (P,P _s )+L _task (P _s )

Wherein α and β represent weights;

2. The method of claim 1, wherein the self-attention mechanism creates three vectors for each point in the three-dimensional coordinates of the point cloud data: inquiring a vector Q, a key vector K and a value vector V, and scoring importance degree of semantic association degree of each point in the input point and the point cloud by calculating the product of Q and K;

the self-attention mechanism is functionally represented as:

3. The method according to claim 1, wherein the diversity loss function L _var The expression is as follows:

4. The method of claim 1, wherein the shared fully connected network consists of a three-part cascade: the multi-layer perceptron, the batch normalization function and the linear rectification function, wherein the shared full-connection network is expressed as:

F _output ＝ReLU(BN(MLP(F _in )))。

5. the method of claim 1, wherein training the transformer network model with a loss function comprises:

For an input point cloud p= { p containing n points _i ∈r ^3+f I=1, 2, …, n }, the training goal of the transformer network is to learn the subset P _s So that s<n, and minimizes the task sampling loss L, the objective function L is expressed as: