CN111709343B

CN111709343B - Point cloud detection method and device, computer equipment and storage medium

Info

Publication number: CN111709343B
Application number: CN202010519325.6A
Authority: CN
Inventors: 黄章帅; 杨欣豫; 陈世熹; 韩旭
Original assignee: Guangzhou Weride Technology Co Ltd
Current assignee: Wenyuan Jingxing Beijing Technology Co ltd
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2023-11-10
Anticipated expiration: 2040-06-09
Also published as: CN111709343A

Abstract

The embodiment of the invention discloses a point cloud detection method, a point cloud detection device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring multi-frame point cloud and multi-frame original image data which are simultaneously acquired for the same visual range; respectively projecting the multi-frame point cloud onto multi-frame original image data to obtain multi-frame target image data; according to the relation between multi-frame target image data in time sequence, carrying out semantic segmentation on point clouds and pixel points in the target image data so as to identify semantic information of the pixel points on the obstacle; and giving semantic information of the pixel points to point clouds corresponding to the pixel points. The visual features of the image data are combined with the spatial features of the point cloud, so that the target image data not only contains rich color features and texture features, but also contains the coordinates of the point cloud, the laser intensity and other features, the dimension of the features is greatly enriched, and semantic segmentation is performed by considering the time sequence among frames, so that the accuracy of semantic information of the obstacle is improved.

Description

Point cloud detection method and device, computer equipment and storage medium

Technical Field

The embodiment of the invention relates to an environment sensing technology, in particular to a point cloud detection method, a point cloud detection device, computer equipment and a storage medium.

Background

In the scenes of automatic driving of a vehicle, automatic inspection by a robot, and the like, a laser radar is used as a commonly used sensor, and can detect point clouds of surrounding environments, thereby identifying obstacles.

The point clouds detected by the laser radar are sparse and have more noise, each point cloud contains information such as laser intensity (intensity) and coordinates, the information quantity is small, and therefore problems such as detection omission of obstacles, detection of type errors of the obstacles and the like are easy to occur, for example, people and pillars are mixed, and noise points generated by dust, water flowers, flying insects and the like are mistakenly identified as the obstacles.

Disclosure of Invention

The embodiment of the invention provides a point cloud detection method, a point cloud detection device, computer equipment and a storage medium, which are used for solving the problems of small information quantity and large noise of point clouds.

In a first aspect, an embodiment of the present invention provides a method for detecting a point cloud, including:

acquiring multi-frame point cloud and multi-frame original image data which are simultaneously acquired for the same visual range;

respectively projecting a plurality of frames of point clouds onto a plurality of frames of original image data to obtain a plurality of frames of target image data;

according to the time sequence relation among the multiple frames of target image data, carrying out semantic segmentation on point clouds and pixel points in the target image data so as to identify semantic information of the pixel points on the obstacle;

And giving semantic information of the pixel points to point clouds corresponding to the pixel points.

In a second aspect, an embodiment of the present invention further provides a point cloud detection apparatus, including:

the original data acquisition module is used for acquiring multi-frame point clouds and multi-frame original image data which are simultaneously acquired for the same visual range;

the point cloud projection module is used for respectively projecting a plurality of frames of point clouds onto a plurality of frames of original image data to obtain a plurality of frames of target image data;

the semantic segmentation module is used for carrying out semantic segmentation on point clouds and pixel points in the target image data according to the time sequence relation among the multiple frames of the target image data so as to identify semantic information of the pixel points on the obstacle;

and the semantic information assignment module is used for assigning the semantic information of the pixel point to the point cloud corresponding to the pixel point.

In a third aspect, an embodiment of the present invention further provides a computer apparatus, including:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the point cloud detection method as described in the first aspect.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the point cloud detection method according to the first aspect.

In this embodiment, multiple frames of point clouds and multiple frames of original image data which are simultaneously collected for the same visual range can be obtained, multiple frames of point clouds are respectively projected onto multiple frames of original image data to obtain multiple frames of target image data, semantic segmentation is performed on the point clouds and pixel points in the target image data according to the time sequence relation between the multiple frames of target image data, so that semantic information of the pixel points on an obstacle is identified, the semantic information of the pixel points is endowed to the point clouds corresponding to the pixel points, visual features of the image data are combined with spatial features of the point clouds, so that the target image data contains abundant color features and texture features, coordinates of the point clouds, laser intensity and other features, feature dimensions are greatly enriched, semantic segmentation is performed by considering time sequences among frames, and operation scene characteristics of unmanned equipment are fused into the semantic segmentation, thereby improving accuracy of the semantic information on the obstacle.

Drawings

Fig. 1 is a schematic structural diagram of an unmanned vehicle according to an embodiment of the present invention;

fig. 2 is a flowchart of a method for detecting a point cloud according to a first embodiment of the present invention;

fig. 3 is a flowchart of a point cloud detection method according to a second embodiment of the present invention;

fig. 4 is a schematic structural diagram of a semantic segmentation model according to a second embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a dimension pooling module according to a second embodiment of the present invention;

fig. 6 is a schematic structural diagram of a point cloud detecting device according to a third embodiment of the present invention;

fig. 7 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings.

Referring to fig. 1, there is shown an unmanned vehicle 100 to which embodiments of a point cloud detection method, a point cloud detection apparatus of an embodiment of the present invention may be applied.

As shown in fig. 1, the unmanned vehicle 100 may include a drive control apparatus 101, a body bus 102, an ECU (Electronic Control Unit ) 103, an ECU 104, an ECU105, a sensor 106, a sensor 107, a sensor 108, and an actuator 109, an actuator 110, and an actuator 111.

The driving control apparatus (also referred to as an onboard brain) 101 is responsible for overall intelligent control of the entire unmanned vehicle 100. The driving control apparatus 101 may be a separately provided controller such as a Programmable logic controller (Programmable LogicController, PLC), a single-chip microcomputer, an industrial controller, or the like; the device can also be equipment consisting of other electronic devices with input/output ports and operation control functions; but also a computer device installed with a vehicle driving control type application. The driving control device may analyze and process data sent from each ECU and/or data sent from each sensor received on the body bus 102, make a corresponding decision, and send an instruction corresponding to the decision to the body bus.

The body bus 102 may be a bus for connecting the driving control device 101, the ECU 103, the ECU 104, the ECU105, the sensor 106, the sensor 107, the sensor 108, and other devices not shown of the unmanned vehicle 100. Because of the wide acceptance of high performance and reliability of CAN (Controller AreaNetwork ) buses, the body bus commonly used in motor vehicles is currently the CAN bus. Of course, it is understood that the body bus may be other types of buses.

The body bus 102 may send the instruction sent by the driving control device 101 to the ECU 103, the ECU 104, the ECU 105, and the ECU 103, the ECU 104, and the ECU 105 analyze the instruction and send the instruction to the corresponding executing device for execution.

The sensors 106, 107, 108 include, but are not limited to, lidar, cameras, and the like.

It should be noted that, the point cloud detection method provided by the embodiment of the present invention may be executed by the driving control apparatus 101, and accordingly, the point cloud detection device is generally disposed in the driving control apparatus 101.

It should be understood that the number of unmanned vehicles, drive control devices, body buses, ECUs, actuators, and sensors in fig. 1 are merely illustrative. There may be any number of unmanned vehicles, drive control devices, body buses, ECU's, and sensors, as desired for implementation.

Example 1

Fig. 2 is a flowchart of a point cloud detection method according to an embodiment of the present invention, where the method may be applied to a case of performing semantic analysis on image data according to a time sequence and assigning semantic information representing an obstacle to a point cloud, and the method may be performed by a point cloud detection device, where the point cloud detection device may be implemented by software and/or hardware, and may be configured in a computer device, for example, unmanned vehicles, robots, unmanned vehicles, etc., and a server, a personal computer, etc., and the method specifically includes the following steps:

S201, acquiring multi-frame point cloud and multi-frame original image data which are simultaneously acquired for the same visual range.

In this embodiment, the unmanned device is configured with a laser radar for detecting point clouds and a camera for collecting raw image data.

In the running process of unmanned equipment, the laser radar is continuously driven to scan the point cloud of the surrounding environment of the vehicle, and when the laser radar scans the visible range of the camera, a specific synchronizer triggers the camera to expose and collect image data to serve as original image data.

It should be noted that, besides the unmanned device collecting the point cloud and the original image data in real time and identifying the semantic information of the pixel point for the obstacle and giving the point cloud, the unmanned device may also send the point cloud and the original image data to the computing device after collecting the point cloud and the original image data, and the computing device identifies the semantic information of the pixel point for the obstacle and gives the point cloud.

S202, respectively projecting the multi-frame point clouds onto multi-frame original image data to obtain multi-frame target image data.

In this embodiment, for three-dimensional point clouds detected by the lidar at the same time and original image data acquired by the camera, the point clouds of the lidar may be transformed to coordinates of the camera according to a positional relationship between the lidar and the camera and projected onto the original image data.

In a specific implementation, the internal parameters and external parameters of the camera can be queried.

The internal parameters of the camera include focal length of the camera, imaging pixel size, etc., and can be represented by matrix K:

the external parameters are parameters of the camera in the world coordinate system, including the position, rotation direction, offset direction, etc. of the camera in the world coordinate system, and may be represented by a rotation matrix R and a translation vector t:

for the same frame of point cloud and original image data, the point cloud may be represented as a=p _wj ＝(x _j ,y _j ,z _j ,i _j ) J=1, 2,..n, represents the number of point clouds, wherein the information of each point cloud includes coordinates (x _j ,y _j ,z _j ) And laser intensity (i) _j ) The pixel point in the original image data is denoted as P _uv ＝(u,v)。

And inquiring a first coordinate of the point cloud in a first coordinate system, wherein the first coordinate system is a coordinate system of a laser radar for collecting the point cloud.

Mapping the first coordinate to a second coordinate in a second coordinate system based on the external parameters, wherein the second coordinate system is the coordinate system where the camera is located, and the second coordinate P _C Can be expressed as:

P _C ＝RP _w ＝(x _C ,y _C ,z _C )

mapping the second coordinate to a third coordinate in the original image data based on the internal parameters, at this time, the third coordinate P _uv Can be expressed as:

P _uv ＝K(P _C +t)＝(u,v)

And projecting the point cloud to a pixel point at a third coordinate in the original image data to obtain target image data.

It should be noted that, because the view angle of the camera is limited, only a part of the point cloud will project onto the original image data, and the point cloud falling on the original image data satisfies the following rule:

z _C ＞0

0＜u＜W

0＜v＜H

wherein W, H is the width and height of the original image data, respectively.

For each pixel point in the target image data, there is visual information, such as color components (e.g., R (red), G (green), B (blue)), and also a pointInformation of cloud, such as intensity, coordinates, etc., b=p _Cj ＝(x _cj ,y _cj ,z _cj ,i _cj ) Wherein j=1, 2, the combination of the first and second components, M, representing the number of point clouds projected onto the original image data.

S203, according to the time sequence relation among the multi-frame target image data, carrying out semantic segmentation on the point cloud and the pixel points in the target image data so as to identify semantic information of the pixel points on the obstacle.

During operation of the computer device, the original image data is continuous, and since the time interval between the original image numbers of each two frames is small, the content between the original image data of adjacent two frames is substantially the same. Accordingly, the target image data is continuous, and since the time interval between the target image numbers of every two frames is small, the content between the target image data of adjacent frames is substantially the same.

In this embodiment, the semantic segmentation (Semantic Segmentation) is performed on the pixel points in the target image data by using the relationship between the multi-frame target image data in time sequence and using the point cloud as an assist and using networks such as a U-Net, a full convolution neural network FCN, a SegNet, a hole convolution, a deep lab, and the like, so that the semantic information of the pixel points on the obstacle, namely, the category of the object to which the pixel points belong, is identified, and whether the obstacle is determined.

It should be noted that, the category of interest varies according to different scenes, and those skilled in the art may set the category of interest according to actual scene requirements, which is not limited in this embodiment.

For example, for an autopilot scenario, categories of interest include, but are not limited to: cars, buses, trucks, tricycles, bicycles, pedestrians, traffic cones, traffic lights, traffic signs, etc.

S204, the semantic information of the pixel points is endowed to the point cloud corresponding to the pixel points.

If the voice segmentation of the target image data is completed, each pixel point can be traversed, and semantic information of the pixel point is endowed to a point cloud projected to the pixel.

Example two

Fig. 3 is a flowchart of a point cloud detection method according to a second embodiment of the present invention, where the processing operations of semantic segmentation and obstacle detection are further refined based on the foregoing embodiment, and the method specifically includes the following steps:

s301, acquiring multi-frame point cloud and multi-frame original image data which are simultaneously acquired for the same visual range.

In this embodiment, the unmanned device may collect the point cloud and the original image data in real time, identify semantic information of the pixel point on the obstacle, and assign the point cloud, so as to detect the obstacle using the point cloud, and in addition, the unmanned device may also send the point cloud and the original image data to the computing device after collecting the point cloud and the original image data, and the computing device identifies the semantic information of the pixel point on the obstacle and assigns the point cloud, so as to detect the obstacle using the point cloud and send the information of the obstacle to the unmanned device.

S302, respectively projecting the multi-frame point clouds onto multi-frame original image data to obtain multi-frame target image data.

Raw image data collected by a camera has rich visual features, such as colors and textures, but lacks spatial features, such as depth information, and combining raw image data with a point cloud can combine visual features with spatial features, thereby increasing the dimensionality of the features.

And S303, representing the point cloud in the target image data as distance information, Z coordinates, laser intensity and point identification.

In this embodiment, the point cloud in the target image data is preprocessed, and the point cloud is converted from the X coordinate, the Y coordinate, the Z coordinate, and the laser intensity into distance information, the Z coordinate, the laser intensity, and a point identifier, where the distance information is the distance information represented by the X coordinate and the Y coordinate of the point cloud, and the point identifier is used to represent whether the point cloud exists, so that the point cloud is more explicitly represented in the target image data.

Because the point cloud projection is sparse on the original image data, point information of which the point mark indicates whether projection exists at the position of the pixel point or not can be generated, and when a plurality of point clouds fall in the same pixel, the information of the point cloud closest to the camera can be adopted.

In a specific implementation, if the point cloud is b=p _c ＝(x _cj ,y _cj ,z _cj ,i _cj ) Wherein j=1, 2, the combination of the first and second components, M, X-coordinate, Y-coordinate (X _cj ,y _cj ) Calculating the point-to-point distance information of the point cloud in the cameraAs distance information.

Then, the target image data is expressed as I and L:

I＝R ^H×W×3 wherein W, H is the width and height of the target image data, respectively, each pixel has three-dimensional visual features, i.e _uv ＝(R _uv ,G _uv ,B _uv )。

L＝R ^H×W×4 W, H are the width and height of the target image data, respectively, each pixel has four dimensional spatial features, i.e. L _uv ＝(d _uv ,z _uv ,i _uv ,f _uv )。

In addition, in some scenes, if the coordinates of the point clouds and the laser intensity are insufficient to represent the information of the point clouds at the position, other calculation amounts can be obtained according to the information of the point clouds, for example, when a plurality of point clouds exist in the same pixel point, the median of the height of the point clouds can be counted, the coordinates of the point clouds are logarithm, the laser intensity information is logarithm, and meanwhile, the wire harness information of the point clouds can be utilized to enrich the representation of the point clouds.

S304, merging the point cloud and the pixel points in the target image data of the current frame into first target feature data.

In this embodiment, the point cloud and the pixel point in each frame of target image data are traversed, and the point cloud and the pixel point in the same frame of target image data are combined to obtain the first target feature data meeting the specified specification.

In a specific implementation, as shown in fig. 4, a convolution operation may be performed on a pixel point 411 in the target image data of the current frame to obtain pixel feature data 411'.

And carrying out convolution operation on the point cloud 412 in the target image data of the current frame to obtain point cloud characteristic data 412'.

And superposing the pixel characteristic data and the point cloud characteristic data according to a channel (channel) to obtain candidate characteristic data.

The candidate feature data is convolved to obtain first target feature data 413.

In fig. 4, for convenience of description, the arrow 421 may represent a convolution operation, where the convolution operation may refer to, but is not limited to, performing two-dimensional convolution (convolution) using a specified number (e.g., 16) of convolution kernels of a size (e.g., 3×3), performing normalization using a BN (Batch Normalization) operator, performing an activation operation using a ReLU (Rectified Linear Unit) operator, performing processing using a residual network, and the like.

S305, referring to the coding feature generated in the coding process of the target image data of the previous frame, coding the first target feature data to obtain second target feature data.

Along with the improvement of the effectiveness of deep learning, particularly the outstanding performance of the deep convolutional neural network, a plurality of semantic segmentation models based on the deep convolutional neural network are realized.

However, the depth convolutional neural network has high complexity, large cost of required computing resources, and in order to meet the requirements of limited computing resources and low computing delay of unmanned equipment, a typical semantic segmentation model is a single-frame model, namely, time sequence information exists among frames in the process of unmanned equipment operation is not considered. While current timing-based models are typically based on recurrent neural networks (Recurrent neural network, RNN), such as long and short term memory units (Long Short Term Memory, LSTM), are used in existing convolutional neural networks, but are relatively complex.

In order to fully utilize the time sequence information between frames to improve the performance of semantic segmentation, in this embodiment, as shown in fig. 4, a lightweight time sequence semantic segmentation model based on U-Net is provided, and the semantic segmentation model uses the coding features of the frames before and after the memory unit to perform semantic segmentation, so that the requirements of limited computing resources, low computing delay cost and precision of semantic segmentation can be considered.

U-Net is mainly composed of two parts, namely encoding (Encoder) and decoding (Decode), wherein the encoding (Encoder) comprises convolution operation and downsampling; decoding (Decoder) includes convolution operations and upsampling. After downsampling encoding, a series of features (i.e., second target feature data) smaller than the original image (i.e., first target feature data) are obtained, which is equivalent to compression, and then an upsampled decoding is performed, so that the original image (i.e., first target feature data) can be restored to the ideal condition.

In fig. 4, for convenience of description, arrows 422 may each represent downsampling, arrows 424 may each represent upsampling, downsampling, i.e., a process of continuously reducing the size of image data, specifically including max pooling, mean pooling, and random pooling (stochastic pooling), etc.; upsampling, i.e., the process of continuously enlarging the size of the image data, specifically includes the nearest neighbor method, the bilinear difference or the differential method such as the cubic interpolation method, the bilinear difference and convolution, the transposed convolution (transposed convolution), and so on.

Illustratively, the downsampling in fig. 4 may be performed using a convolution kernel of step size 2, size 3×3, then normalized using BN operator, and finally an activation operation using ReLU operator, such that the size is reduced to half of the original, and the number of channels is increased; the upsampling in fig. 4 may employ transpose convolution.

In the process of encoding (Encoder) and decoding (Decoder), if the encoding characteristics of the last frame are added and the point cloud of the current frame is added, the semantic segmentation of the pixel points of the current frame by using the time sequence information between frames is facilitated.

In the encoding (Encoder) process, the same operation may be circularly performed, and then the same operation may be regarded as one hierarchy, i.e., the encoding process may be divided into a plurality of hierarchies, each of which may be referred to as an encoding layer.

It should be noted that the coding hierarchy may be set by those skilled in the art according to practical situations, such as 3 layers, 4 layers, and 5 layers, which is not limited in this embodiment.

In a specific implementation, as shown in fig. 4, the coding features generated during the encoding of the previous frame of target image data include fourth coding feature data, and if the operations of downsampling, copying (copy) and splicing (concat) the previous frame and convolution are regarded as a layer, the semantic segmentation model in fig. 4 may be divided into 5 coding layers.

In fig. 4, for convenience of description, the arrow 425 may indicate that the memory unit copies and concatenates the previous frame, that is, the data at the start position of the arrow 425 in the previous frame is copied to the current frame at the end position of the arrow 425, so as to concatenate with the previous data.

In encoding (Encoder), each encoding layer is traversed sequentially, starting from the first encoding layer and proceeding to the last encoding layer, as shown in fig. 4, the first encoding layer is chosen as an example illustration for ease of representation, and the operation of the remaining encoding layers can be seen from the operation of the first encoding layer.

For the current coding layer, determining first coding feature data input into the current coding layer, wherein the first coding feature data of the first coding layer is first target feature data 413, the first coding feature data of the non-first coding layer is the last coding layer to output fourth coding feature data, and if the first coding feature data input into the second coding layer is the first coding layer to output fourth coding feature data 415.

The first encoded characteristic data (e.g., the first target characteristic data 413) is downsampled to obtain the second encoded characteristic data 414.

And extracting fourth coding characteristic data 415' of the previous frame copied by the memory unit, splicing the second coding characteristic data and the fourth coding characteristic data of the previous frame according to the channel to obtain third coding characteristic data, wherein when the third coding characteristic data are spliced, the convolutional neural network can be utilized to tolerate the situation that the alignment is deviated through a learning mechanism.

And performing convolution operation on the third coding feature data to obtain fourth coding feature data 415 of the current frame, at this time, judging whether the current coding layer is an end coding layer (i.e. the last coding layer), if so, the fourth coding feature data of the end coding layer is second target feature data to be output to a Decoder (Decoder), and if not, the fourth coding feature data of the current coding layer is to be output to a next coding layer as first coding feature data to continue coding (Encoder).

In addition, the memory unit records the fourth encoded feature data 415 of the current frame and the current encoding layer, and when traversing the next frame and the current encoding layer, the fourth encoded feature data 415 can be regarded as the fourth encoded feature data 415' of the previous frame, and the process is circulated until all the target image data are traversed.

S306, performing scale pooling operation on the second target feature data.

In this embodiment, as shown in fig. 4, the scale pooling module (Pyriade pooling module, PPM) 431 may be used to scale-pool the second target feature data, that is, to increase the receptive field of the second target feature data (the size of the receptive field may be regarded as the extent of utilizing the context information) with a lower calculation amount, which is advantageous for improving the expression capability of the semantic segmentation model.

As shown in FIG. 4, for convenience of description, the line segment 426 may represent data transmission.

As shown in fig. 5, for the scale pooling operation, the second target feature data may be regarded as feature map501 (channel=n), and the pooling operation 511 is performed on the feature map501 to obtain first intermediate feature data 502 of a plurality of different scales, where the example in fig. 5 uses 4 kinds of features of different scales.

The convolution operation 512 is performed on each feature map 502 by a convolution kernel of 1×1, so as to obtain second intermediate feature data 503 of channel=1/N of 1×1,2×2,4×4, and 6×6, respectively.

Upsampling 513, such as bilinear interpolation padding, the second intermediate feature data 503 yields second intermediate feature data 504.

The feature map501 performs channel splicing (also called concatenation) with each second intermediate feature data 504 by a copy operation 514, to obtain a feature map with a channel number increased by 1.

S307, the second target feature data is decoded by referring to the coding feature generated during coding of the target image data of the current frame, and the third target feature data is obtained.

As shown in fig. 4, the U-Net contains long jump connections, and for convenience of description, arrow 423 may represent the long jump connection, through which the coding feature generated when the target image data of the current frame is coded may be applied in the process of decoding (Decoder) the second target feature data, so that the information in the coding (Encoder) is fully utilized in the decoding process, and the information includes semantic feature information of a high layer, and detail features such as edges of the middle-bottom layer image, so as to achieve the purpose of accurately predicting each pixel.

In the decoding (Decoder), the same operation may be circularly performed, and thus the same operation may be regarded as one hierarchy, i.e., the decoding process may be divided into a plurality of hierarchies, each of which may be referred to as a decoding layer.

It should be noted that, the decoded layer is generally one layer less than the encoded layer, and may be specifically set by those skilled in the art according to practical situations, for example, 3 layers, 4 layers, and 5 layers, which is not limited in this embodiment.

In a specific implementation, as shown in fig. 4, the coding features generated during coding of the target image data of the current frame include fourth coding feature data, and if the coding features and convolution operations of the up-sampling and splicing (concat) current frame are regarded as one layer, the semantic segmentation model in fig. 4 may be divided into 4 decoding layers.

In decoding (Decoder), each decoding layer is traversed sequentially, starting from the first decoding layer to the last decoding layer, as shown in fig. 4, and the last decoding layer is selected as an example for convenience of representation, and the operations of the remaining decoding layers can be referred to as the operation of the last decoding layer.

And determining first decoding characteristic data input into the current decoding layer according to the current decoding layer, wherein the first decoding characteristic data of the first decoding layer is second target characteristic data, the first coding characteristic data of the non-first decoding layer is output by the fourth decoding characteristic data for the last decoding layer, and if the first decoding characteristic data input into the fourth decoding layer is output by the third decoding layer, the fourth decoding characteristic data is output by the third decoding layer.

The first decoded feature data is up-sampled to obtain second decoded feature data 416.

The fourth encoded feature data 515, the second decoded feature data 46, which are adapted to the current decoding layer, are concatenated to obtain a third decoded feature.

It should be noted that U-Net is a symmetrical structure, and the coding layer is assumed to be E ₁ 、E ₂ 、…、E _i 、…、E _n The decoding layer is D ₁ 、D ₂ 、…、D _j 、…、D _n-1 ，E _i And D _j Is adapted, wherein i+j=n, e.g. as shown in fig. 4, the first encoding layer is adapted to the fourth decoding layer, the second encoding layer is adapted to the third decoding layer, etc.

And performing convolution operation on the third decoding feature data to obtain fourth decoding feature data 417, at this time, it may be determined whether the current decoding layer is the last decoding layer (i.e., the last decoding layer), if so, the fourth decoding feature data of the last decoding layer is the third target feature data, and is to be output to the determined semantic information, and if not, the fourth decoding feature data of the current decoding layer is to be output as the first decoding feature data to the next decoding layer for further decoding (Decoder).

And S308, performing scale pooling operation on the third target feature data.

In this embodiment, as shown in fig. 4, the scale pooling module PPM 432 may be used to scale-pool the third target feature data, that is, the receptive field of the third target feature data is increased with a lower calculation amount, which is beneficial to improving the expression capability of the semantic segmentation model.

S309, up-sampling the third target feature data to obtain semantic information of the pixel point on the obstacle.

In this embodiment, as shown in fig. 4, the third target feature data may be up-sampled, so that the semantic information 418 of the obstacle in the (c+1) dimension is output for each pixel point as the Prediction result Prediction in the semantic segmentation model:

Prediction＝R ^H×W×(C+1)

wherein W, H is the width and height of the target image data, respectively.

Then, for each pixel point in the target image data, its semantic information can be expressed as:

Prediction _uv ＝(p ₀ ,p ₁ ,p ₂ ,..,p _i ,..,p _C )

wherein p is ₀ The confidence probability of the pixel point as an obstacle is i=1, 2, …, C, and represents the category of the obstacle, p ₁ ,p ₂ ,..,p _i ,..,p _C The probability that a pixel is an obstacle of a certain class is represented.

It should be noted that, the semantic information may be divided into fine-grained segments according to actual requirements, for example, a car, a minibus, a construction car, etc. The classification of the obstacle to which the pixel point belongs is predicted, and the height and the center of the obstacle to which the pixel point belongs can be predicted, so that semantic information of the point cloud is enriched.

In addition, the depth information of the point cloud is used as the depth information of the pixel points in the target image data, so that the target is predicted, the depth of each pixel point in the target image data can be predicted through the existing sparse point cloud, and therefore dense point cloud-like data is obtained and used for obstacle detection, and obstacle detection with smaller size is facilitated.

S310, semantic information of the pixel points is given to point clouds corresponding to the pixel points.

In this embodiment, after semantic segmentation, the semantic information of each pixel point in the target image data is Prediction _uv ＝(p ₀ ,p ₁ ,p ₂ ,..,p _i ,..,p _C ) After the semantic information is given to the point cloud corresponding to the pixel point, then, for each point cloud, the original coordinates, the laser intensity and the predicted semantic information are fused, so that the point cloud with the semantic information can be obtained:

A’＝P _w ＝(x _j ,y _j ,z _j ,i _j ,Prediction _j )

wherein x is _j Representing the X coordinate, y _j Representing the Y-coordinate, z _j Representing Z coordinate, i _j Indicating laser intensity, prediction _j Representing semantic information, j=1, 2, and N represents the number of point clouds.

S311, detecting obstacles in the visual range according to the point cloud carrying the semantic information.

For point clouds carrying semantic information, a point cloud target detection model, e.g., pointRCNN, pointPillars, voxelNet, etc., may be input to detect obstacles present in the visual range to provide support for decision making by the unmanned device.

Example III

Fig. 6 is a schematic structural diagram of a point cloud detection device according to a third embodiment of the present invention, where the device may specifically include the following modules:

the original data acquisition module 601 is configured to acquire multiple frames of point clouds and multiple frames of original image data that are simultaneously acquired for the same visual range;

The point cloud projection module 602 is configured to project a plurality of frames of the point cloud onto a plurality of frames of the original image data, so as to obtain a plurality of frames of target image data;

the semantic segmentation module 603 is configured to perform semantic segmentation on a point cloud and a pixel point in the target image data according to a time sequence relation among multiple frames of the target image data, so as to identify semantic information of the pixel point on an obstacle;

the semantic information assignment module 604 is configured to assign semantic information of the pixel point to a point cloud corresponding to the pixel point.

In one embodiment of the present invention, the point cloud projection module 602 includes:

the parameter query sub-module is used for querying internal parameters and external parameters of a camera, and the camera is used for acquiring the original image data;

the first coordinate inquiring sub-module is used for inquiring a first coordinate of the point cloud in a first coordinate system aiming at the point cloud and the original image data of the same frame, wherein the first coordinate system is a coordinate system in which a laser radar for acquiring the point cloud is located;

the second coordinate mapping submodule is used for mapping the first coordinate to a second coordinate in a second coordinate system based on the external parameter, wherein the second coordinate system is the coordinate system where the camera is located;

A third coordinate mapping sub-module for mapping the second coordinate to a third coordinate in the original image data based on the internal parameter;

and the pixel point projection submodule is used for projecting the point cloud to the pixel point which is positioned at the third coordinate in the original image data to obtain target image data.

In one embodiment of the present invention, the semantic segmentation module 603 includes:

the target image data merging sub-module is used for merging point clouds and pixel points in the target image data of the current frame into first target characteristic data;

the encoding submodule is used for referring to the encoding characteristics generated in the encoding process of the target image data of the previous frame, encoding the first target characteristic data and obtaining second target characteristic data;

the decoding submodule is used for decoding the second target feature data by referring to the coding features generated during coding of the target image data of the current frame to obtain third target feature data;

and the semantic information mapping sub-module is used for up-sampling the third target characteristic data to obtain the semantic information of the pixel point on the obstacle.

In one embodiment of the present invention, the target image data merging sub-module includes:

The pixel characteristic data obtaining unit is used for carrying out convolution operation on pixel points in the target image data of the current frame to obtain pixel characteristic data;

the point cloud characteristic data obtaining unit is used for carrying out convolution operation on the point cloud in the target image data of the current frame to obtain point cloud characteristic data;

the characteristic data superposition unit is used for superposing the pixel characteristic data and the point cloud characteristic data according to a channel to obtain candidate characteristic data;

and the first target feature data obtaining unit is used for carrying out convolution operation on the candidate feature data to obtain first target feature data.

In one embodiment of the present invention, the encoding feature generated at the time of encoding the target image data of the previous frame includes fourth encoding feature data;

the encoding submodule includes:

the first coding feature data determining unit is used for determining first coding feature data input into the current coding layer, wherein the first coding feature data of the first coding layer is the first target feature data, and the first coding feature data of the non-first coding layer is the last coding layer to output fourth coding feature data;

the second coding characteristic data obtaining unit is used for downsampling the first coding characteristic data to obtain second coding characteristic data;

A third coding feature data obtaining unit, configured to splice the second coding feature data with fourth coding feature data of a previous frame to obtain third coding feature data;

a fourth coding feature data obtaining unit, configured to perform convolution operation on the third coding feature data to obtain fourth coding feature data of the current frame; the fourth coding characteristic data of the last coding layer is the second target characteristic data.

In one embodiment of the present invention, the coding feature generated when the target image data of the current frame is coded includes fourth coding feature data;

the decoding submodule includes:

a first decoding feature data determining unit, configured to determine first decoding feature data input to a current decoding layer, where the first decoding feature data of a first decoding layer is the second target feature data, and the first encoding feature data of a non-first decoding layer is the last decoding layer to output fourth decoding feature data;

a second decoded feature data obtaining unit, configured to upsample the first decoded feature data to obtain second decoded feature data;

a third decoding characteristic data obtaining unit, configured to splice fourth encoding characteristic data adapted to the current decoding layer and the second decoding characteristic data to obtain a third decoding characteristic;

And a fourth decoded characteristic data obtaining unit, configured to perform a convolution operation on the third decoded characteristic data to obtain fourth decoded characteristic data, where fourth decoded data of the last decoding layer is third target characteristic data.

In one embodiment of the present invention, the semantic segmentation module 603 further includes:

the point cloud preprocessing sub-module is used for representing the point cloud in the target image data as distance information, Z coordinates, laser intensity and point identification;

the distance information is the distance represented by the X coordinate and the Y coordinate of the point cloud, and the point mark is used for representing whether the point cloud exists or not.

the first scale pooling operation submodule is used for performing scale pooling operation on the second target characteristic data;

and/or the number of the groups of groups,

and the second scale pooling operation sub-module is used for performing scale pooling operation on the third target characteristic data.

In one embodiment of the present invention, further comprising:

and the obstacle detection module is used for detecting obstacles in the visual range according to the point cloud carrying the semantic information.

The point cloud detection device provided by the embodiment of the invention can execute the point cloud detection method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 7 is a schematic structural diagram of a computer device according to a fourth embodiment of the present invention. As shown in fig. 7, the computer apparatus includes a processor 700, a memory 701, a communication module 702, an input device 703, and an output device 704; the number of processors 700 in the computer device may be one or more, one processor 700 being taken as an example in fig. 7; the processor 700, memory 701, communication module 702, input device 703 and output device 704 in the computer apparatus may be connected by a bus or other means, in fig. 7 by way of example.

The memory 701 is used as a computer readable storage medium, and may be used to store a software program, a computer executable program, and modules, such as modules corresponding to the point cloud detection method in the present embodiment (for example, the raw data acquisition module 601, the point cloud projection module 602, the semantic segmentation module 603, and the semantic information assignment module 604 in the point cloud detection apparatus shown in fig. 6). The processor 700 executes various functional applications of the computer device and data processing, i.e., implements the above-described point cloud detection method, by running software programs, instructions, and modules stored in the memory 701.

The memory 701 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for functions; the storage data area may store data created according to the use of the computer device, etc. In addition, memory 701 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 701 may further include memory remotely located relative to processor 700, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

And the communication module 702 is used for establishing connection with the display screen and realizing data interaction with the display screen.

The input means 703 may be used for receiving input digital or character information and generating key signal inputs related to user settings and function control of the computer device, as well as a camera for capturing images and a sound pickup device for capturing audio data.

The output device 704 may include an audio apparatus such as a speaker.

The specific composition of the input device 703 and the output device 704 may be set according to the actual situation.

The processor 700 executes various functional applications of the device and data processing by running software programs, instructions and modules stored in the memory 701, i.e., implements the above-described connection node control method of the electronic whiteboard.

The computer device provided in this embodiment may execute the point cloud detection method provided in any one of the embodiments of the present invention, and specifically, the corresponding functions and beneficial effects.

Example five

The fifth embodiment of the present invention further provides a computer readable storage medium having a computer program stored thereon, the computer program implementing a point cloud detection method when executed by a processor, the method comprising:

Of course, the computer readable storage medium provided by the embodiments of the present invention, the computer program thereof is not limited to the method operations described above, and may also perform the related operations in the point cloud detection method provided by any embodiment of the present invention.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.

It should be noted that, in the embodiment of the point cloud detection apparatus, each unit and module included are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. The point cloud detection method is characterized by comprising the following steps of:

according to the time sequence relation between the target image data of a plurality of frames, carrying out semantic segmentation on the point cloud and the pixel points in the target image data to identify semantic information of the pixel points on the obstacle, wherein the semantic information comprises the following steps:

merging point clouds and pixel points in the target image data of the current frame into first target feature data;

coding the first target feature data by referring to the coding feature generated during coding of the target image data of the previous frame to obtain second target feature data;

decoding the second target feature data by referring to the coding feature generated during coding of the target image data of the current frame to obtain third target feature data;

up-sampling the third target feature data to obtain semantic information of the pixel point on the obstacle;

the semantic information of the pixel points is endowed to point clouds corresponding to the pixel points;

wherein the coding feature generated in the encoding of the target image data of the previous frame comprises fourth coding feature data;

the encoding the first target feature data with reference to the encoding feature generated during encoding of the target image data of the previous frame to obtain second target feature data includes:

Determining first coding characteristic data input into a current coding layer;

downsampling the first coded characteristic data to obtain second coded characteristic data;

splicing the second coding characteristic data with fourth coding characteristic data of the previous frame to obtain third coding characteristic data;

performing convolution operation on the third coding feature data to obtain fourth coding feature data of the current frame; the fourth coding characteristic data of the last coding layer is the second target characteristic data.

2. The method according to claim 1, wherein projecting the plurality of frames of the point cloud onto the plurality of frames of the original image data, respectively, to obtain a plurality of frames of target image data, comprises:

inquiring internal parameters and external parameters of a camera, wherein the camera is used for acquiring the original image data;

inquiring a first coordinate of the point cloud in a first coordinate system aiming at the point cloud and the original image data of the same frame, wherein the first coordinate system is a coordinate system of a laser radar for collecting the point cloud;

mapping the first coordinate to a second coordinate in a second coordinate system based on the external parameter, wherein the second coordinate system is the coordinate system where the camera is located;

Mapping the second coordinates to third coordinates in the original image data based on the internal parameters;

and projecting the point cloud to the pixel point at the third coordinate in the original image data to obtain target image data.

3. The method according to claim 1, wherein merging the point cloud and the pixel points in the target image data of the current frame into the first target feature data includes:

performing convolution operation on pixel points in the target image data of the current frame to obtain pixel characteristic data;

performing convolution operation on the point cloud in the target image data of the current frame to obtain point cloud characteristic data;

superposing the pixel characteristic data and the point cloud characteristic data according to a channel to obtain candidate characteristic data;

and carrying out convolution operation on the candidate feature data to obtain first target feature data.

4. The method according to claim 1, wherein the encoding the first target feature data with reference to the encoding feature generated at the time of encoding the target image data of the previous frame to obtain second target feature data, further comprises:

and outputting fourth coding characteristic data for the last coding layer by taking the first coding characteristic data of the first coding layer as the first target characteristic data and the first coding characteristic data of the non-first coding layer as the first coding characteristic data.

5. The method of claim 4, wherein the encoding features generated by the current frame of the target image data at the time of encoding include fourth encoding feature data;

the decoding the second target feature data with reference to the coding feature generated during coding of the target image data of the current frame to obtain third target feature data includes:

determining first decoding characteristic data input into a current decoding layer, wherein the first decoding characteristic data of a first decoding layer is the second target characteristic data, and the first coding characteristic data of a non-first decoding layer is the last decoding layer to output fourth decoding characteristic data;

upsampling the first decoded feature data to obtain second decoded feature data;

splicing fourth coding characteristic data and the second decoding characteristic data which are matched with the current decoding layer to obtain a third decoding characteristic;

and carrying out convolution operation on the third decoding characteristic data to obtain fourth decoding characteristic data, wherein the fourth decoding data of the last decoding layer is third target characteristic data.

6. The method according to any one of claims 1-5, wherein the semantic segmentation is performed on the point cloud and the pixel point in the target image data according to the time sequence relation between the target image data so as to identify semantic information of the pixel point on an obstacle, and the method further comprises:

The point cloud in the target image data is expressed as distance information, Z coordinates, laser intensity and point marks;

7. The method according to any one of claims 1-5, wherein the semantic segmentation is performed on the point cloud and the pixel point in the target image data according to the time sequence relation between the target image data so as to identify semantic information of the pixel point on an obstacle, and the method further comprises:

performing scale pooling operation on the second target characteristic data;

and/or the number of the groups of groups,

and performing scale pooling operation on the third target characteristic data.

8. The method of any one of claims 1-5, further comprising:

and detecting the obstacle existing in the visual range according to the point cloud carrying the semantic information.

9. A point cloud detection apparatus, comprising:

the semantic information assignment module is used for assigning semantic information of the pixel points to point clouds corresponding to the pixel points;

the semantic segmentation module comprises:

the semantic information mapping sub-module is used for up-sampling the third target feature data to obtain semantic information of the pixel point on the obstacle;

The encoding submodule includes:

a first coding feature data determining unit for determining first coding feature data input to the current coding layer;

10. A computer device, the computer device comprising:

one or more processors;

a memory for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the point cloud detection method of any of claims 1-8.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the point cloud detection method according to any of claims 1-8.