CN111832358A

CN111832358A - Point cloud semantic analysis method and device

Info

Publication number: CN111832358A
Application number: CN201910318406.7A
Authority: CN
Inventors: 李艳丽; 贾魁; 崔丽华; 赫桂望; 蔡金华
Original assignee: Beijing Jingdong Three Hundred And Sixty Degree E Commerce Co ltd
Current assignee: Beijing Jingdong Three Hundred And Sixty Degree E Commerce Co ltd
Priority date: 2019-04-19
Filing date: 2019-04-19
Publication date: 2020-10-27

Abstract

The invention discloses a point cloud semantic analysis method and device, and relates to the technical field of computers. One embodiment of the method comprises: inputting laser point cloud data of a target scene; and carrying out semantic analysis on each frame of point cloud in sequence, wherein the semantic analysis on each frame of point cloud comprises the following steps: extracting high-dimensional features and global features with space-time consistency from the frame point cloud, and obtaining fusion features of the frame point cloud by using the global features with space-time consistency and the high-dimensional features; performing full-connection layer conversion and probability normalization on the fusion characteristics to obtain point cloud identification information of the frame point cloud; and outputting the point cloud identification information of the multi-frame point cloud on the time sequence. The embodiment can not be influenced by the dislocation of the same scene point between the moving object and the field frame, and improves the consistency of the analysis results of the next frame and the previous frame, thereby obtaining the analysis result with space-time consistency, avoiding the problem of repeated calculation, and having stronger accuracy and robustness of the analysis result.

Description

Point cloud semantic analysis method and device

Technical Field

The invention relates to the technical field of computers, in particular to a point cloud semantic analysis method and device.

Background

The point cloud semantic analysis is to perform semantic analysis on the laser point cloud, that is, to mark which category each point cloud belongs to, for example, to analyze the area of vehicles, pedestrians and the like under the street view point cloud, or to distinguish dynamic and non-dynamic objects, or to distinguish foreground objects and background scenes. The point cloud semantic analysis technology is an indispensable technology in the fields of street view intelligent perception and environment visualization and is also a research difficulty in computer vision, on one hand, real scenes are rich and various, and object postures are different; on the other hand, the point cloud is formed by splicing single frame point clouds, and adjacent frame point clouds under the same position are influenced by equipment precision and moving objects to have height or radiation difference, so that the spliced point cloud has a thickness and noise area (as shown in fig. 1), and the problem of insufficient precision exists in direct analysis of the spliced point cloud.

In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:

the same scene point has difference in different single frame point clouds, and the problem of time-space inconsistency exists in the splicing result; the problem of repeated calculation exists, and the accuracy and robustness of the analysis result are not sufficient.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for point cloud semantic analysis, which are not affected by the dislocation of the same scene point between a moving object and a field frame, and improve the consistency of the analysis results of the next frame and the previous frame, so as to obtain an analysis result with space-time consistency, and have no problem of repeated computation, and the analysis result has high accuracy and robustness.

To achieve the above object, according to an aspect of the embodiments of the present invention, a method for semantic parsing of a point cloud is provided.

A point cloud semantic parsing method comprises the following steps: inputting laser point cloud data of a target scene, wherein the laser point cloud data comprises a plurality of frames of point clouds in a time sequence; and carrying out semantic analysis on each frame of point cloud in sequence, wherein the semantic analysis on each frame of point cloud comprises the following steps: extracting high-dimensional features and global features with space-time consistency from the frame point cloud, and obtaining fusion features of the frame point cloud by using the global features with space-time consistency and the high-dimensional features; performing full-link layer conversion and probability normalization on the fusion features to obtain point cloud identification information of the frame point cloud, wherein the point cloud identification information represents the probability that each point in the frame point cloud belongs to each class of targets in the target scene; and outputting the point cloud identification information of the multi-frame point clouds on the time sequence.

Optionally, the step of extracting a high-dimensional feature and a global feature with space-time consistency from the frame point cloud, and obtaining a fusion feature of the frame point cloud by using the global feature with space-time consistency and the high-dimensional feature includes: extracting high-dimensional features of the frame point cloud through a feature extraction sub-network of a convolutional neural network; processing the high-dimensional characteristics of the frame point cloud through a full connection layer and a pooling layer of the convolutional neural network, and inputting the processed high-dimensional characteristics into the convolutional neural network for conversion to obtain global characteristics of the frame point cloud with space-time consistency; and cascading the high-dimensional features and the features obtained after the high-dimensional features are processed by the full-connection layer through a channel cascade layer of the convolutional neural network with the global features with space-time consistency to obtain the fusion features of the frame point cloud.

Optionally, the step of extracting a high-dimensional feature and a global feature with space-time consistency from the frame point cloud, and obtaining a fusion feature of the frame point cloud by using the global feature with space-time consistency and the high-dimensional feature includes: extracting high-dimensional features of the frame point cloud through a feature extraction sub-network of a convolutional neural network; inputting the high-dimensional characteristics of the frame point cloud into a data slicing layer of the convolutional neural network to perform slicing according to the laser line number to obtain the high-dimensional characteristics of each line point cloud corresponding to each laser line number; processing the high-dimensional features of the point clouds through a full-connection layer and a pooling layer of the convolutional neural network, and inputting the processed high-dimensional features into the convolutional neural network for conversion to obtain the global features of the point clouds with space-time consistency; cascading the high-dimensional features of the line point clouds and the features obtained after the high-dimensional features are processed through the full-connection layer with the corresponding global features with space-time consistency of the line point clouds through a channel cascade layer of the convolutional neural network to obtain the fusion features of the line point clouds; and merging the fusion features of the line point clouds through a data cascade layer of the convolutional neural network to obtain the fusion features of the frame point cloud.

Optionally, the recurrent neural network is one of the following networks embedded in the convolutional neural network: the system comprises a single-layer unidirectional long-short term memory network, a bidirectional long-short term memory network, a multi-layer long-short term memory network and a gate cycle neural network.

According to another aspect of the embodiment of the invention, a point cloud semantic analysis device is provided.

A point cloud semantic parsing device comprises: the system comprises a point cloud data input module, a point cloud data output module and a data processing module, wherein the point cloud data input module is used for inputting laser point cloud data of a target scene, and the laser point cloud data comprises multi-frame point clouds in a time sequence; the point cloud semantic analysis module is used for performing semantic analysis on each frame of point cloud in sequence, wherein the semantic analysis on each frame of point cloud comprises the following steps: extracting high-dimensional features and global features with space-time consistency from the frame point cloud, and obtaining fusion features of the frame point cloud by using the global features with space-time consistency and the high-dimensional features; performing full-link layer conversion and probability normalization on the fusion features to obtain point cloud identification information of the frame point cloud, wherein the point cloud identification information represents the probability that each point in the frame point cloud belongs to each class of targets in the target scene; and the point cloud identification information output module is used for outputting the point cloud identification information of the multi-frame point clouds on the time sequence.

Optionally, the point cloud semantic parsing module includes a first parsing processing unit, configured to: extracting high-dimensional features of the frame point cloud through a feature extraction sub-network of a convolutional neural network; processing the high-dimensional characteristics of the frame point cloud through a full connection layer and a pooling layer of the convolutional neural network, and inputting the processed high-dimensional characteristics into the convolutional neural network for conversion to obtain global characteristics of the frame point cloud with space-time consistency; and cascading the high-dimensional features and the features obtained after the high-dimensional features are processed by the full-connection layer through a channel cascade layer of the convolutional neural network with the global features with space-time consistency to obtain the fusion features of the frame point cloud.

Optionally, the point cloud semantic parsing module includes a second parsing processing unit, configured to: extracting high-dimensional features of the frame point cloud through a feature extraction sub-network of a convolutional neural network; inputting the high-dimensional characteristics of the frame point cloud into a data slicing layer of the convolutional neural network to perform slicing according to the laser line number to obtain the high-dimensional characteristics of each line point cloud corresponding to each laser line number; processing the high-dimensional features of the point clouds through a full-connection layer and a pooling layer of the convolutional neural network, and inputting the processed high-dimensional features into the convolutional neural network for conversion to obtain the global features of the point clouds with space-time consistency; cascading the high-dimensional features of the line point clouds and the features obtained after the high-dimensional features are processed through the full-connection layer with the corresponding global features with space-time consistency of the line point clouds through a channel cascade layer of the convolutional neural network to obtain the fusion features of the line point clouds; and merging the fusion features of the line point clouds through a data cascade layer of the convolutional neural network to obtain the fusion features of the frame point cloud.

According to yet another aspect of an embodiment of the present invention, an electronic device is provided.

An electronic device, comprising: one or more processors; a memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the point cloud semantic parsing method provided by the invention.

According to yet another aspect of an embodiment of the present invention, a computer-readable medium is provided.

A computer-readable medium, on which a computer program is stored, which, when executed by a processor, implements the point cloud semantic parsing method provided by the invention.

One embodiment of the above invention has the following advantages or benefits: and carrying out semantic analysis on each frame of point cloud in sequence, wherein the semantic analysis on each frame of point cloud comprises the following steps: extracting high-dimensional features and global features with space-time consistency from the frame point cloud, and obtaining fusion features of the frame point cloud by using the global features with space-time consistency and the high-dimensional features; performing full-connection layer conversion and probability normalization on the fusion characteristics to obtain point cloud identification information of the frame point cloud; and outputting the point cloud identification information of the multi-frame point cloud on the time sequence. The method can not be influenced by the dislocation of the same scene point between the moving object and the field frame, and improves the consistency of the analysis results of the next frame and the previous frame, thereby obtaining the analysis result with space-time consistency, avoiding the problem of repeated calculation, and having stronger accuracy and robustness of the analysis result.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic illustration of a street view laser point cloud spliced from a sequence of single frame point clouds according to the prior art;

FIG. 2 is a schematic diagram illustrating the main steps of a point cloud semantic parsing method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a point cloud semantic parsing network model according to one embodiment of the invention;

FIG. 4 is a schematic diagram of a recurrent neural network at a time step t, in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of the basic cell structure of an LSTM according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a sub Net sub-network structure according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of the main blocks of a point cloud semantic analysis device according to an embodiment of the present invention;

FIG. 8 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

fig. 9 is a schematic structural diagram of a computer system suitable for implementing a terminal device or a server according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

As will be appreciated by one skilled in the art, embodiments of the present invention may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

FIG. 2 is a schematic diagram illustrating the main steps of a point cloud semantic parsing method according to an embodiment of the present invention.

As shown in fig. 2, the point cloud semantic analysis method according to the embodiment of the present invention mainly includes the following steps S201 to S203.

Step S201: laser point cloud data of a target scene are input, and the laser point cloud data comprises a plurality of frames of point clouds in a time sequence.

Multiple frames of point clouds in the time series may be input frame by frame and sequentially parsed through step S202.

Step S202: and carrying out semantic analysis on each frame of point cloud in sequence, wherein the semantic analysis on each frame of point cloud comprises the following steps: extracting high-dimensional features and global features with space-time consistency from the frame point cloud, and obtaining fusion features of the frame point cloud by using the global features with space-time consistency and the high-dimensional features; and performing full-connection layer conversion and probability normalization on the fusion characteristics to obtain point cloud identification information of the frame point cloud.

The point cloud identification information represents the probability that each point in the frame of point cloud belongs to each category of target in the target scene.

In an embodiment, the step of extracting a high-dimensional feature and a global feature with space-time consistency from the frame point cloud, and obtaining a fusion feature of the frame point cloud by using the global feature with space-time consistency and the high-dimensional feature may specifically include:

extracting high-dimensional features of the frame point cloud through a feature extraction sub-network of a convolutional neural network;

processing the high-dimensional characteristics of the frame point cloud through a full connection layer and a pooling layer of a convolutional neural network, and inputting the processed high-dimensional characteristics into the convolutional neural network for conversion to obtain global characteristics of the frame point cloud with space-time consistency;

and cascading the high-dimensional features and the features obtained by processing the high-dimensional features through a full connection layer and the global features with space-time consistency through a channel cascade layer of a convolutional neural network to obtain the fusion features of the frame point cloud.

As another embodiment, the step of extracting the high-dimensional features and the global features with space-time consistency from the frame point cloud, and obtaining the fusion features of the frame point cloud by using the global features with space-time consistency and the high-dimensional features may specifically include:

inputting the high-dimensional characteristics of the frame point cloud into a data segmentation layer of a convolutional neural network to perform segmentation according to the laser line number to obtain the high-dimensional characteristics of each line point cloud corresponding to each laser line number;

processing the high-dimensional characteristics of each line point cloud through a full connection layer and a pooling layer of a convolutional neural network, and inputting the processed high-dimensional characteristics into the convolutional neural network for conversion to obtain the global characteristics of each line point cloud with space-time consistency;

cascading the high-dimensional features of the line point clouds and the features obtained by processing the high-dimensional features through a full connection layer with the corresponding global features with space-time consistency of the line point clouds through a channel cascade layer of a convolutional neural network to obtain the fusion features of the line point clouds;

and merging the fusion features of the point clouds of all lines through a data cascade layer of the convolutional neural network to obtain the fusion features of the frame point clouds.

The recurrent neural network may be one of the following networks embedded in the convolutional neural network: the system comprises a single-layer unidirectional long-short term memory network, a bidirectional long-short term memory network, a multi-layer long-short term memory network and a gate cycle neural network.

Step S203: and outputting the point cloud identification information of the multi-frame point cloud on the time sequence.

The embodiment of the invention combines the cyclic neural network and the convolutional neural network for analyzing the single-frame point cloud, and the cyclic neural network keeps the space-time consistency of the analysis result. The method can not be influenced by the dislocation of the same scene point between the moving object and the field frame, although the same scene point has space dislocation between single sequence frames, the cycle network can update and transmit the space dislocation through learning; each frame of point cloud is analyzed once, so that the problem of repeated calculation does not exist; the cyclic neural network can transmit the result of the previous frame to the next frame, and the consistency of the analysis results of the next frame and the previous frame is improved, so that the analysis result with space-time consistency is obtained, and the problem of analyzing the space-time consistency of large-scale point clouds spliced by single-frame point clouds can be well solved.

FIG. 3 is a schematic diagram of a point cloud semantic parsing network model according to a first embodiment of the invention.

The embodiment of the invention provides a new deep neural network model (called point cloud semantic analysis network model) aiming at the scene analysis and tracking problems of large-scale time sequence point clouds, for example, the point cloud semantic analysis network model (called network model for short) shown in fig. 3.

The network model of the embodiment of the invention is input as a frame of point cloud on a time sequence, the point cloud can be acquired by 16-line, 32-line or 64-line laser radar equipment, and the network model shown in fig. 3 takes 16-line and 7-channel laser point clouds as an example.

The point cloud attributes comprise spatial coordinates (X, Y, Z), reflection intensity I and optionally color (R, G, B), and are output as a frame analysis result on a time sequence. Because the single-line point clouds show stronger space-time relevance between adjacent frames, a cyclic neural network is respectively established for each line of point clouds to perform independent updating, laser point cloud data Ntx7 is input at each moment t, wherein Nt is the point cloud number, point cloud identification information NtxM is output to represent the probability that each point cloud belongs to each category, and M is the identification category number.

In fig. 3, the solid line box (non-circular corner box) is a data layer, and the dashed line box is a network layer, where in the data layer, Ntx128 and Ntx 512 represent Nt 128-dimensional data and 1 512-dimensional data, respectively, and the meanings of the other data layers are similar and will not be explained one by one hereafter. At the network layer, the basic neurons include a channel cut layer (Split), a data cut layer (Segment), a feature space conversion layer (T-Net), a fully connected layer (e.g., network layer 128 × 256), a down sampling layer (firing), a recurrent neural network (e.g., LSTM long-term memory network), a channel cascade layer (Concat), a data cascade layer (Merge), an expanded layer (Expand), and a normalization layer (Softmax).

The specific network layer is defined as:

the data Segment layer (Segment) segments the input data by laser line number, for example, the input data Ntx128 into 16 parts, that is, Nt _1x128, Nt _2x128, …, Nt _16x 128; splitting input data by channel by a channel slicing layer (Split), for example, slicing each Nt _ Kx7 into two Nt _ Kx3 and Nt _ Kx4 according to point cloud geometric attributes (X, Y, Z) and color attributes (I, R, G, B); a characteristic space conversion layer (T-Net) performs geometric transformation and characteristic transformation on the data, and aims to convert linear correlation attributes into orthogonal attributes by means of similar principal vector analysis so as to extract more effective characteristics; the full-connection layer is a matrix cross-product operation, for example, (nx4) × (4x64) ═ (nx64) converts 4-dimensional data of each cloud point into 64-dimensional data; a downsampling layer (Pooling) takes the maximum or mean value of the point cloud data under each channel, e.g., Nt _ Kx512 converts to 1x512 data; a recurrent neural network (e.g., an LSTM long-term memory network) will not be described herein, and will be described in detail later; the channel cascade layer (Concat) is used for merging data according to channels, for example, the data of Nt _ Kx128 and Nt _ Kx256 are cascaded into N _ Kx384, the data cascade layer (Merge) is used for merging data according to laser numbers, for example, 16 point clouds Nt _1x3712, Nt _2x3712, … and Nt _16x3712 are merged into Ntx 3712; the expansion layer (Expand) expands one data into several data, for example, expanding 1x1792 data into Nt _ Kx1792 data; the normalization layer (Softmax) performs a normalization calculation of the data, namely:

softmax(x_i)＝exp(x_i)/∑_jexp(x_j)

wherein x is_i,x_jRespectively representing sub-items of the same data under different channels, in fig. 3, the input of softmax is NtxM, representing that there are Nt data, each data has M channels, and the softmax operation is independently called for 1xM of each data, where j ranges from [1, M]. The formula shows that the denominator in softmax is the sum of exp operators of all channels, the numerator is the exp operator value of a single channel, the sum of output results is a probability, and the sum of the probabilities of all channels is 1.

The following focuses on the recurrent neural network. Compared with a single-frame convolutional neural network, the convolutional neural network has a memory function, namely a memory state s which is automatically updated along with the advance of time is internally stored_t. FIG. 4 is a schematic diagram of a recurrent neural network at a time step t, according to an embodiment of the present invention. Wherein the output is:

y_t＝softmax(W_ys_t)

the state conversion relationship is as follows:

s_t＝sigmoid(W_xx_t+W_ss_t-1)

the recurrent neural network can be trained by using a time Backpropagation gradient algorithm (BPTT), and since each weight gradient in the algorithm is the sum of the weight gradients at each time, a sequence of a time step needs to be used as an independent sample to jointly update the gradients. In order to avoid the problems of gradient explosion and disappearance, most of the applications adopt a special recurrent neural network, namely a Long short-term memory (LSTM) network, the LSTM forgets useless Long-term memory by means of an input gate, a forgetting gate and an output gate, and filters valuable input and output desired results. The basic cell structure of the LSTM is shown in fig. 5.

Wherein the input gate is defined as:

i＝σ(x_tU_i+s_t-1W_i)

σ (-) here is a Sigmoid function, which is an activation function.

The forgetting gate is defined as:

f＝σ(x_tU_f+s_t-1W_f) The output gate is defined as:

o＝σ(x_tU_o+s_t-1W_o)

the candidate hidden states are defined as:

g＝tanh(x_tU_g+s_t-1W_g)

the internal memory state is:

c_t＝c_t-1·i+f·g

the hidden state is:

s_t＝tanh(c_t)·o

in the above equations of FIGS. 4 and 5, U, W represents a weight, e.g., U_f,W_fAnd (4) expressing the weight of the forgetting gate, and other formulas are similar.

The network model of the embodiment of the invention divides a point cloud analysis convolutional network of a single frame into two parts, then embeds an LSTM network structure into the two parts, namely, multidimensional characteristics calculated by a point cloud convolutional neural network of the first half part are used as input of the LSTM network, and the output of the LSTM network is led into the convolutional neural network of the second half part to complete final point cloud analysis, thereby completing the update of a time sequence state by means of the LSTM network.

It should be noted that the present invention is not limited to completing point cloud analysis and tracking by a single-layer unidirectional LSTM network (single-layer unidirectional long-short term memory network), and may also use a bidirectional LSTM network (bidirectional long-short term memory network), or a multilayer LSTM network (multilayer long-short term memory network), or a gate Recurrent neural network (GRU, Gated Recurrent Unit) to update the state frame by frame.

The sub net sub-network in the network model of the embodiment of the invention is a feature extraction sub-network, and is used for extracting high-dimensional features of each point cloud, namely extracting the high-dimensional features of the frame point cloud according to input laser point cloud data Ntx7 of a target scene to obtain Ntx128Nt pieces of 128-dimensional data. Fig. 6 shows a schematic diagram of a subnetwork of a SubNet according to an embodiment of the present invention. The sub Net sub-network includes a channel slicing layer (Split), a feature space conversion layer (T-Net), a matrix multiplexing, a channel cascading layer (Concat), and a plurality of fully connected layers (e.g., 3x64, 128x128, etc.), each of which has been described above and will not be repeated here. The sub net sub network has the advantages that the sub net sub network can also be applied to 7-channel laser point clouds, namely, the laser point clouds are divided into two parts according to channels in advance, high-dimensional features are respectively and independently extracted from the two parts and then combined, and therefore interference among the channels with different attributes is avoided.

The process of point cloud semantic parsing using the network model of fig. 3 includes:

inputting laser point cloud data of a target scene, such as a frame of laser point cloud data Ntx7, into a sub net sub network (a feature extraction sub network) to extract high-dimensional features of the frame of point cloud, so as to obtain feature data Ntx128, that is, Nt 128-dimensional data;

inputting the high-dimensional features (feature data Ntx128) of the frame point cloud into a data slicing layer (Segment) to slice the frame point cloud according to the laser line number, wherein the operation can extract the features with space-time consistency under different laser lines because of stronger space-time consistency under different laser lines, and the Segment extracts the features with space-time consistency under different laser lines from the input Ntx128 to obtain Nt _1x128, … Nt _16x128, and Nt _1x128, …, Nt _ Kx128, and Nt _16x128(1< K <16) are called the high-dimensional features of each line point cloud corresponding to each laser line number;

processing the high-dimensional characteristics of each line point cloud through a full connection layer and a pooling layer of a convolutional neural network, and inputting the processed high-dimensional characteristics into the convolutional neural network for conversion to obtain the global characteristics of each line point cloud with space-time consistency; for each line point cloud, in the single-line space-time feature extraction process, high-order feature extraction may be performed in a multi-scale space, that is, high-order feature extraction may be performed on each line point cloud corresponding to Nt _1x128, …, Nt _ Kx128,. and Nt _16x128 in fig. 3 in the multi-scale space, fig. 3 takes only the K-th line point cloud (the high-dimensional feature after segmentation is Nt _ Kx128) in the multi-scale space as an example, the high-order feature extraction is divided into three scales, which respectively correspond to full-connected layers 128x256, 128x512, and 128x1024, and the flow of high-order feature extraction of other line point clouds in the single-line space-time feature extraction process and in the multi-scale space is the same as that of Nt _ Kx128, and is not shown in detail in fig. 3 (represented by an ellipsis number "… …").

Taking Nt _ Kx128 data output by a data slicing layer (Segment) as an example, in three scale spaces in fig. 3, taking a scale space corresponding to a full connection layer 128x256 as an example under a single scale space, performing high-order feature extraction processing on the high-dimensional feature Nt _ Kx128 of the line point cloud through the full connection layer (128x256) to obtain a processing result which is Nt _ Kx256 feature data of the data layer, performing Pooling layer (Pooling) processing, and extracting a global feature to obtain a feature (Nt _ Kx256- >1x256) of the whole laser line, wherein the obtained processing result is 1x256 feature data output from the Pooling layer, and then performing conversion by a recurrent neural network (LSTM _ K _ P3) to obtain the global feature (i.e. 1x256 feature data output by the LSTM _ K _ P3 layer) with space-time consistency.

The processing flow in the other two scale spaces (fully connected layers 128x512 and 128x1024) of the point cloud is the same as the above-mentioned flow, and the description is not repeated. Multi-scale features may be compatible with contextual relationships and local details compared to single-scale features.

Processing the global features with space-time consistency of each scale space through a first channel cascade layer (Concat), and then processing through an expansion layer (Expand) to obtain global feature data Nt _ Kx1792 with space-time consistency of the line point cloud, cascading the feature Nt _ Kx1792 to the original laser point cloud feature of the line point cloud through a second channel cascade layer (Concat) to obtain a fusion feature of the line point cloud, wherein the original laser point cloud feature of the line point cloud comprises a high-dimensional feature of the line point cloud (namely Nt _ Kx128 data obtained by segmenting the Ntx128 high-dimensional feature through a segmentlayer) and a feature obtained by processing the high-dimensional feature through a full connecting layer (comprising feature data Nt _ Kx256, Nt _ Kx512, Nt _ Kx1024 obtained through the full connecting layers 128x256, 128x512, 128x 1024). Therefore, the characteristics Nt _ Kx1792 output by the expansion layer (expanded) also take the space-time consistency, the global context, the local details and other factors into consideration, and the scaling invariance is strengthened by the fusion of the multi-scale characteristics.

Merging the fusion features of the line point clouds through a data cascade layer (Merge) to obtain the fusion features of the frame point cloud, such as fusion feature data Ntx3712, and performing multiple times of full-link layer conversion (the full-link layers are network layers such as 3712x1024, 1024x512, 512x128, 128 xM) and softmax (normalization layer) probability normalization on the fusion features to obtain the probability (called point cloud identification information) that each point in the frame point cloud belongs to each class of targets in the target scene.

And sequentially processing the voice analysis process on each frame of point cloud on the time sequence, and finally outputting the point cloud identification information of the multi-frame point cloud on the time sequence.

The point cloud semantic analysis network model according to the second embodiment of the present invention performs unified numbering processing on the input laser point clouds, that is, Segment and Merge layers in the network model of fig. 3 are removed, and accordingly, on the basis of fig. 3, there is no need to perform merging operations of fusion features on the 16 line point clouds of Nt _1x128, …, Nt _ Kx128,. and Nt _16x128(1< K <16), and there is no processing result Nt _1x3712,. and no processing result Nt _ Kx3712,. and no processing result Nt _16x3712 of each line point cloud in fig. 3. The specific network layer and data layer are also described in detail in the embodiment, and are not described herein again.

The point cloud semantic analysis by using the network models of the first embodiment and the second embodiment of the invention comprises a model training stage and a model testing stage.

In the model training stage, a large number of marked sequence single-frame point clouds are required to be provided as training samples, wherein the data acquisition mode includes but is not limited to: 1. collecting point clouds on a vehicle, extracting key frames at a fixed distance according to a travel track of the vehicle, and manually marking the point clouds under the key frames; 2. setting a driving track under a simulation environment, sampling track points according to a fixed distance along the driving track, and synchronously generating a frame of point cloud and a labeling result at the track points, wherein each track point generates a frame of point cloud.

In the model testing stage, assuming that a vehicle-mounted system carries a laser radar to collect road point cloud, wherein the point cloud attributes comprise three-dimensional coordinates (X, Y, Z) and reflection intensity information I, optionally a color camera is synchronously carried to capture a video image to color the point cloud, so that the point cloud additionally acquires three channels (red, green and blue color information) of (R, G and B), and the channels of a test sample and a training sample need to be consistent. Then, extracting key frames along a fixed distance, collecting a set of point clouds as a frame of point clouds at each driving track point laser radar, sequentially predicting the analysis result of the key frame point clouds by a trained model, inputting the frame of point clouds into the network model of the embodiment of the invention each time, and performing semantic analysis on the frame of point clouds by the network model to obtain the analysis result.

Fig. 7 is a schematic diagram of main blocks of a point cloud semantic analysis device according to an embodiment of the present invention.

The point cloud semantic analysis device 700 of the embodiment of the invention mainly comprises: a point cloud data input module 701, a point cloud semantic analysis module 702, and a point cloud identification information output module 703.

The point cloud data input module 701 is configured to input laser point cloud data of a target scene, where the laser point cloud data includes multiple frames of point clouds in a time sequence.

A point cloud semantic parsing module 702, configured to perform semantic parsing on each frame of point cloud in sequence, where the semantic parsing on each frame of point cloud includes: extracting high-dimensional features and global features with space-time consistency from the frame point cloud, and obtaining fusion features of the frame point cloud by using the global features with space-time consistency and the high-dimensional features; and performing full-connection layer conversion and probability normalization on the fusion characteristics to obtain point cloud identification information of the frame point cloud, wherein the point cloud identification information represents the probability that each point in the frame point cloud belongs to each category of targets in the target scene.

In one embodiment, the point cloud semantic parsing module 702 may include a first parsing processing unit to: extracting high-dimensional features of the frame point cloud through a feature extraction sub-network of a convolutional neural network; processing the high-dimensional characteristics of the frame point cloud through a full connection layer and a pooling layer of a convolutional neural network, and inputting the processed high-dimensional characteristics into the convolutional neural network for conversion to obtain global characteristics of the frame point cloud with space-time consistency; and cascading the high-dimensional features and the features obtained by processing the high-dimensional features through a full connection layer and the global features with space-time consistency through a channel cascade layer of a convolutional neural network to obtain the fusion features of the frame point cloud.

In another embodiment, the point cloud semantic parsing module 702 may include a second parsing processing unit for: : extracting high-dimensional features of the frame point cloud through a feature extraction sub-network of a convolutional neural network; inputting the high-dimensional characteristics of the frame point cloud into a data slicing layer of the convolutional neural network to perform slicing according to the laser line number to obtain the high-dimensional characteristics of each line point cloud corresponding to each laser line number; processing the high-dimensional features of the point clouds through a full-connection layer and a pooling layer of the convolutional neural network, and inputting the processed high-dimensional features into the convolutional neural network for conversion to obtain the global features of the point clouds with space-time consistency; cascading the high-dimensional features of the line point clouds and the features obtained by processing the high-dimensional features through a full connection layer with the corresponding global features with space-time consistency of the line point clouds through a channel cascade layer of a convolutional neural network to obtain the fusion features of the line point clouds; and merging the fusion features of the line point clouds through a data cascade layer of the convolutional neural network to obtain the fusion features of the frame point cloud.

The cyclic neural network of the embodiment of the invention is one of the following networks embedded into a convolutional neural network: the system comprises a single-layer unidirectional long-short term memory network, a bidirectional long-short term memory network, a multi-layer long-short term memory network and a gate cycle neural network.

A point cloud identification information output module 703, configured to output point cloud identification information of multiple frames of point clouds in a time sequence.

In addition, the specific implementation contents of the point cloud semantic analysis device in the embodiment of the invention have been described in detail in the above point cloud semantic analysis method, so repeated contents are not described again here.

Fig. 8 shows an exemplary system architecture 800 to which the point cloud semantic parsing method or the point cloud semantic parsing apparatus of the embodiments of the invention may be applied.

As shown in fig. 8, the system architecture 800 may include

terminal devices

801, 802, 803, a network 804, and a server 805. The network 804 serves to provide a medium for communication links between the

terminal devices

801, 802, 803 and the server 805. Network 804 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

801, 802, 803 to interact with a server 805 over a network 804 to receive or send messages or the like. The

terminal devices

801, 802, 803 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

The

terminal devices

801, 802, 803 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 805 may be a server that provides various services, such as a back-office management server (for example only) that supports shopping-like websites browsed by users using the

terminal devices

801, 802, 803. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.

It should be noted that the point cloud semantic analysis method provided by the embodiment of the present invention is generally executed by the server 805, and accordingly, the point cloud semantic analysis device is generally disposed in the server 805.

It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

Referring now to FIG. 9, shown is a block diagram of a computer system 900 suitable for use in implementing a terminal device or server of an embodiment of the present application. The terminal device or the server shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data necessary for the operation of the system 900 are also stored. The CPU 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

In particular, according to embodiments of the present disclosure, the processes described above with reference to the main step schematic may be implemented as computer software programs. For example, the disclosed embodiments of the invention include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the main step diagram. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 909, and/or installed from the removable medium 911. The above-described functions defined in the system of the present application are executed when the computer program is executed by a Central Processing Unit (CPU) 901.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The principal step diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the main step diagrams or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or block diagrams, and combinations of blocks in the block diagrams or block diagrams, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprises a point cloud data input module, a point cloud semantic analysis module and a point cloud identification information output module. The names of these modules do not in some cases constitute a limitation on the module itself, and for example, the point cloud data input module may also be described as a "module for inputting laser point cloud data of a target scene".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: inputting laser point cloud data of a target scene, wherein the laser point cloud data comprises a plurality of frames of point clouds in a time sequence; and carrying out semantic analysis on each frame of point cloud in sequence, wherein the semantic analysis on each frame of point cloud comprises the following steps: extracting high-dimensional features and global features with space-time consistency from the frame point cloud, and obtaining fusion features of the frame point cloud by using the global features with space-time consistency and the high-dimensional features; performing full-link layer conversion and probability normalization on the fusion features to obtain point cloud identification information of the frame point cloud, wherein the point cloud identification information represents the probability that each point in the frame point cloud belongs to each class of targets in the target scene; and outputting the point cloud identification information of the multi-frame point clouds on the time sequence.

According to the technical scheme of the embodiment of the invention, semantic analysis is sequentially carried out on each frame of point cloud, wherein the semantic analysis on each frame of point cloud comprises the following steps: extracting high-dimensional features and global features with space-time consistency from the frame point cloud, and obtaining fusion features of the frame point cloud by using the global features with space-time consistency and the high-dimensional features; performing full-connection layer conversion and probability normalization on the fusion characteristics to obtain point cloud identification information of the frame point cloud; and outputting the point cloud identification information of the multi-frame point cloud on the time sequence. The method can not be influenced by the dislocation of the same scene point between the moving object and the field frame, and improves the consistency of the analysis results of the next frame and the previous frame, thereby obtaining the analysis result with space-time consistency, avoiding the problem of repeated calculation, and having stronger accuracy and robustness of the analysis result.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A point cloud semantic parsing method is characterized by comprising the following steps:

inputting laser point cloud data of a target scene, wherein the laser point cloud data comprises a plurality of frames of point clouds in a time sequence;

and carrying out semantic analysis on each frame of point cloud in sequence, wherein the semantic analysis on each frame of point cloud comprises the following steps: extracting high-dimensional features and global features with space-time consistency from the frame point cloud, and obtaining fusion features of the frame point cloud by using the global features with space-time consistency and the high-dimensional features; performing full-link layer conversion and probability normalization on the fusion features to obtain point cloud identification information of the frame point cloud, wherein the point cloud identification information represents the probability that each point in the frame point cloud belongs to each class of targets in the target scene;

and outputting the point cloud identification information of the multi-frame point clouds on the time sequence.

2. The method of claim 1, wherein the step of extracting high-dimensional features and global features with space-time consistency from the frame point cloud, and obtaining the fusion features of the frame point cloud by using the global features with space-time consistency and the high-dimensional features comprises:

processing the high-dimensional characteristics of the frame point cloud through a full connection layer and a pooling layer of the convolutional neural network, and inputting the processed high-dimensional characteristics into the convolutional neural network for conversion to obtain global characteristics of the frame point cloud with space-time consistency;

and cascading the high-dimensional features and the features obtained after the high-dimensional features are processed by the full-connection layer through a channel cascade layer of the convolutional neural network with the global features with space-time consistency to obtain the fusion features of the frame point cloud.

3. The method of claim 1, wherein the step of extracting high-dimensional features and global features with space-time consistency from the frame point cloud, and obtaining the fusion features of the frame point cloud by using the global features with space-time consistency and the high-dimensional features comprises:

inputting the high-dimensional characteristics of the frame point cloud into a data slicing layer of the convolutional neural network to perform slicing according to the laser line number to obtain the high-dimensional characteristics of each line point cloud corresponding to each laser line number;

processing the high-dimensional features of the point clouds through a full-connection layer and a pooling layer of the convolutional neural network, and inputting the processed high-dimensional features into the convolutional neural network for conversion to obtain the global features of the point clouds with space-time consistency;

cascading the high-dimensional features of the line point clouds and the features obtained after the high-dimensional features are processed through the full-connection layer with the corresponding global features with space-time consistency of the line point clouds through a channel cascade layer of the convolutional neural network to obtain the fusion features of the line point clouds;

and merging the fusion features of the line point clouds through a data cascade layer of the convolutional neural network to obtain the fusion features of the frame point cloud.

4. The method of claim 2 or 3, wherein the recurrent neural network is one of the following embedded into the convolutional neural network:

the system comprises a single-layer unidirectional long-short term memory network, a bidirectional long-short term memory network, a multi-layer long-short term memory network and a gate cycle neural network.

5. A point cloud semantic analysis device is characterized by comprising:

the system comprises a point cloud data input module, a point cloud data output module and a data processing module, wherein the point cloud data input module is used for inputting laser point cloud data of a target scene, and the laser point cloud data comprises multi-frame point clouds in a time sequence;

the point cloud semantic analysis module is used for performing semantic analysis on each frame of point cloud in sequence, wherein the semantic analysis on each frame of point cloud comprises the following steps: extracting high-dimensional features and global features with space-time consistency from the frame point cloud, and obtaining fusion features of the frame point cloud by using the global features with space-time consistency and the high-dimensional features; performing full-link layer conversion and probability normalization on the fusion features to obtain point cloud identification information of the frame point cloud, wherein the point cloud identification information represents the probability that each point in the frame point cloud belongs to each class of targets in the target scene;

and the point cloud identification information output module is used for outputting the point cloud identification information of the multi-frame point clouds on the time sequence.

6. The apparatus of claim 5, wherein the point cloud semantic parsing module comprises a first parsing processing unit configured to:

7. The apparatus of claim 5, wherein the point cloud semantic parsing module comprises a second parsing processing unit configured to:

8. The apparatus of claim 6 or 7, wherein the recurrent neural network is one of the following embedded in the convolutional neural network:

9. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method recited in any of claims 1-4.

10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-4.