WO2024001093A1 - 语义分割、环境感知方法、装置和无人车 - Google Patents

语义分割、环境感知方法、装置和无人车 Download PDF

Info

Publication number
WO2024001093A1
WO2024001093A1 PCT/CN2022/140873 CN2022140873W WO2024001093A1 WO 2024001093 A1 WO2024001093 A1 WO 2024001093A1 CN 2022140873 W CN2022140873 W CN 2022140873W WO 2024001093 A1 WO2024001093 A1 WO 2024001093A1
Authority
WO
WIPO (PCT)
Prior art keywords
point cloud
dimensional image
semantic segmentation
cloud data
semantic
Prior art date
Application number
PCT/CN2022/140873
Other languages
English (en)
French (fr)
Inventor
温欣
董林
康瀚隆
许新玉
Original Assignee
北京京东乾石科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京京东乾石科技有限公司 filed Critical 北京京东乾石科技有限公司
Publication of WO2024001093A1 publication Critical patent/WO2024001093A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/588Recognition of the road, e.g. of lane markings; Recognition of the vehicle driving pattern in relation to the road

Definitions

  • the present disclosure relates to the field of computer vision technology, in particular to the field of unmanned driving, and in particular to a semantic segmentation, environment perception method, device and unmanned vehicle.
  • unmanned driving equipment is used to automatically transport people or objects from one location to another.
  • Unmanned driving equipment collects environmental information through sensors on the equipment and completes automatic transportation.
  • Logistics and transportation using unmanned delivery vehicles controlled by unmanned driving technology has greatly improved the convenience of production and life and saved labor costs.
  • Three-dimensional environment perception technology is one of the core methods in the autonomous driving technology system. This perception technology is responsible for identifying pedestrians, vehicles and other dynamic and static elements around autonomous vehicles to provide comprehensive environmental information to downstream control systems to plan driving routes and avoid static obstacles and dynamic pedestrians and vehicles.
  • the goal of the three-dimensional semantic segmentation method based on lidar is to identify the semantic categories of each element in the three-dimensional scene point cloud scanned by lidar, which is a foundation in the entire three-dimensional environment perception technology system. Task.
  • a technical problem to be solved by this disclosure is to provide a method, device and unmanned vehicle for semantic segmentation and environment perception.
  • a semantic segmentation method including: generating a two-dimensional image according to point cloud data; performing semantic segmentation processing on the two-dimensional image based on a first neural network model to obtain the two-dimensional image.
  • the semantic segmentation result corresponding to the image generating a point cloud feature map according to the semantic segmentation result corresponding to the two-dimensional image and the point cloud data; performing semantic segmentation processing on the point cloud feature map based on the second neural network model, To obtain the semantic label information of the point cloud data.
  • the two-dimensional image is a depth map
  • generating the two-dimensional image according to point cloud data includes: determining the two-dimensional transformation coordinates of the point cloud data in a spherical coordinate system, wherein the two-dimensional The transformation coordinates include yaw angle and pitch angle; according to the two-dimensional transformation coordinates, the point cloud data is distributed into multiple grids; according to the point cloud points in each grid, the characteristics of each grid are determined ;Construct the depth map based on the characteristics of all meshes.
  • the semantic segmentation result corresponding to the two-dimensional image includes semantic label information corresponding to each point in the two-dimensional image and a feature representation corresponding to each point in the two-dimensional image.
  • generating a point cloud feature map includes: determining the point cloud points in the point cloud data that match each point in the two-dimensional image; The semantic label information corresponding to each point in the two-dimensional image, the feature representation corresponding to each point in the two-dimensional image, and the coordinates of the matching point cloud points are spliced to obtain each point in the two-dimensional image.
  • the spliced feature representation corresponding to each point; the point cloud feature map is constructed according to the spliced feature representation corresponding to each point in the two-dimensional image.
  • performing semantic segmentation processing on the two-dimensional image based on the first neural network model to obtain the semantic segmentation result corresponding to the two-dimensional image includes: performing feature extraction on the two-dimensional image based on an encoder module. , and output the obtained feature map to the decoder module; decode the feature map based on the decoder module to obtain the semantic segmentation result corresponding to the two-dimensional image.
  • performing feature extraction on the two-dimensional image based on the encoder module and outputting the resulting feature map to the decoder module includes: performing feature extraction on the two-dimensional image based on the first encoding unit, and output the feature map extracted by the first coding unit to the second coding unit and the decoder module; perform feature extraction on the feature map output by the first coding unit based on the second coding unit, and use the second coding unit to The extracted feature map is output to the decoder module, wherein the first coding unit and the second coding unit have different structures.
  • At least one of the first coding unit and the second coding unit includes a plurality of convolutional layers, and at least one of the plurality of convolutional layers uses a strip-shaped convolution kernel.
  • the first coding unit includes first to third convolutional layers arranged sequentially from the input side to the output side, wherein the first and third convolutional layers use square convolution kernels, and the second The convolutional layer uses strip-shaped convolution kernels.
  • the decoder module includes a plurality of decoding units arranged sequentially from the input side to the output side, and a semantic label classification layer.
  • the decoding unit includes an upsampling layer and a plurality of convolutional layers.
  • the second neural network model includes multiple convolutional layers and a semantic label classification layer, wherein the convolutional layer uses a 1*1 convolution kernel.
  • an environment sensing method including: acquiring point cloud data collected by an unmanned vehicle; determining semantic label information of the point cloud data according to the above-mentioned semantic segmentation method; The semantic label information of the data determines the environmental information where the unmanned vehicle is located.
  • a semantic segmentation device including: a first generation module configured to generate a two-dimensional image based on point cloud data; a first segmentation module configured to generate a two-dimensional image based on a first neural network model.
  • the two-dimensional image is subjected to semantic segmentation processing to obtain the semantic segmentation result corresponding to the two-dimensional image; the second generation module is configured to generate points according to the semantic segmentation result of the two-dimensional image and the point cloud data.
  • Cloud feature map a second segmentation module configured to perform semantic segmentation processing on the point cloud feature map based on the second neural network model to obtain semantic label information of the point cloud data.
  • the two-dimensional image is a depth map
  • the first generation module is configured to: determine the two-dimensional transformation coordinates of the point cloud data in a spherical coordinate system, wherein the two-dimensional transformation coordinates including yaw angle and pitch angle; allocate the point cloud data to multiple grids according to the two-dimensional transformation coordinates; determine the characteristics of each grid according to the point cloud points in each grid; according to Features of all meshes are used to construct the depth map.
  • the second generation module is configured to: determine point cloud points in the point cloud data that match each point in the two-dimensional image; The corresponding semantic label information, the feature representation corresponding to each point in the two-dimensional image, and the coordinates of the matching point cloud points are spliced to obtain the spliced features corresponding to each point in the two-dimensional image. Represent; construct the point cloud feature map according to the spliced feature representation corresponding to each point in the two-dimensional image.
  • another semantic segmentation device including: a memory; and a processor coupled to the memory, and the processor is configured to execute the above semantic segmentation method based on instructions stored in the memory.
  • an environment sensing device including: an acquisition module configured to acquire point cloud data collected by an unmanned vehicle; the above-mentioned semantic segmentation device; and a determination module configured to acquire point cloud data based on the points.
  • the semantic label information of the cloud data determines the environmental information where the unmanned vehicle is located.
  • an environment sensing device including: a memory; and a processor coupled to the memory, the processor being configured to execute the above environment sensing method based on instructions stored in the memory.
  • a computer-readable storage medium on which computer program instructions are stored.
  • the instructions are executed by a processor, the above-mentioned semantic segmentation method or the above-mentioned environment sensing method is implemented.
  • an unmanned vehicle including the above-mentioned semantic segmentation device or environment sensing device.
  • a computer program including: instructions that, when executed by a processor, cause the processor to perform the above-mentioned semantic segmentation method, or the above-mentioned environment perception method.
  • Figure 1 is a schematic flowchart of a semantic segmentation method according to some embodiments of the present disclosure
  • Figure 2 is a schematic flowchart of generating a two-dimensional image based on point cloud data according to some embodiments of the present disclosure
  • Figure 3 is a schematic flowchart of semantic segmentation based on the first neural network model according to some embodiments of the present disclosure
  • Figure 4a is a schematic structural diagram of a first neural network model according to some embodiments of the present disclosure.
  • Figure 4b is a schematic structural diagram of a first coding unit according to some embodiments of the present disclosure.
  • Figure 4c is a schematic structural diagram of a second coding unit according to some embodiments of the present disclosure.
  • Figure 4d is a schematic structural diagram of a decoding unit according to some embodiments of the present disclosure.
  • Figure 5 is a schematic flowchart of an environment sensing method according to some embodiments of the present disclosure.
  • Figure 6 is a schematic structural diagram of a semantic segmentation device according to some embodiments of the present disclosure.
  • Figure 7 is a schematic structural diagram of an environment sensing device according to some embodiments of the present disclosure.
  • Figure 8 is a schematic structural diagram of a semantic segmentation device or environment sensing device according to some embodiments of the present disclosure.
  • Figure 9 is a schematic structural diagram of a computer system according to some embodiments of the present disclosure.
  • Figure 10 is a schematic structural diagram of an autonomous vehicle according to some embodiments of the present disclosure.
  • Figure 11 is a schematic three-dimensional structural diagram of an autonomous vehicle according to some embodiments of the present disclosure.
  • any specific values are to be construed as illustrative only and not as limiting. Accordingly, other examples of the exemplary embodiments may have different values.
  • the three-dimensional lidar point cloud is usually converted into a depth map (Range View) based on the spherical projection principle, and with the help of two-dimensional convolutional neural After the network performs semantic segmentation on the depth map, it projects the obtained semantic label information back into the original 3D lidar point cloud.
  • a depth map Range View
  • the main problem with the above method is that the process of projecting 3D point cloud data into a depth map is usually accompanied by information loss. For example, projecting several 3D point cloud points into the same pixel in the depth map will result in these The distinction between three-dimensional point cloud points falling in the same pixel is lost, which leads to a reduction in the accuracy of the segmentation results.
  • Figure 1 is a schematic flowchart of a semantic segmentation method according to some embodiments of the present disclosure. As shown in Figure 1, the semantic segmentation method of the embodiment of the present disclosure includes:
  • Step S110 Generate a two-dimensional image based on the point cloud data.
  • the two-dimensional image is a depth map (Range View).
  • a depth map is generated based on the point cloud data.
  • the two-dimensional image is a bird's eye view (Bird's Eye View, BEV for short), and in step S110, the bird's eye view is generated based on the point cloud data.
  • BEV bird's Eye View
  • the point cloud data is point cloud data collected by lidar.
  • point cloud data is collected through the lidar installed on the autonomous vehicle.
  • Step S120 Perform semantic segmentation processing on the two-dimensional image based on the first neural network model to obtain a semantic segmentation result corresponding to the two-dimensional image.
  • the first neural network model is a deep neural network model, including an encoder module and a decoder module.
  • the semantic segmentation result of the two-dimensional image includes semantic label information corresponding to each point in the two-dimensional image and feature representation corresponding to each point in the two-dimensional image.
  • semantic label information includes label information such as pedestrians, vehicles, lanes, and sidewalks.
  • the feature corresponding to each point in the two-dimensional image is represented by a feature vector corresponding to each point in the two-dimensional image and output after being processed by the first neural network model.
  • Step S130 Generate a point cloud feature map based on the semantic segmentation result corresponding to the two-dimensional image and the point cloud data.
  • the semantic segmentation result of the two-dimensional image includes semantic label information corresponding to each point in the two-dimensional image and feature representation corresponding to each point in the two-dimensional image.
  • step S130 includes steps S131 to S133.
  • Step S131 Determine the point cloud points in the point cloud data that match each point in the two-dimensional image.
  • step S110 after the two-dimensional image is generated in step S110, the mapping relationship between the points in the two-dimensional image and the point cloud points in the point cloud data is saved.
  • step S131 based on the mapping relationship, point cloud points in the point cloud data that match each point in the two-dimensional image are determined.
  • Step S132 Splice the semantic label information corresponding to each point in the two-dimensional image, the feature representation corresponding to each point in the two-dimensional image, and the coordinates of the matching point cloud points to obtain the corresponding point corresponding to each point in the two-dimensional image.
  • the spliced feature representation Splice the semantic label information corresponding to each point in the two-dimensional image, the feature representation corresponding to each point in the two-dimensional image, and the coordinates of the matching point cloud points.
  • splicing is performed in the following order: semantic label information corresponding to each point in the two-dimensional image, feature representation corresponding to each point in the two-dimensional image, and coordinates of the matching point cloud points, thereby obtaining the splicing Later feature representation.
  • the semantic label information corresponding to any point in the two-dimensional image is represented by the vector Aij
  • the feature representation corresponding to the point is represented by the vector Bij
  • the coordinates of the point cloud points matching the point are represented by the vector Cij
  • the spliced feature representation It is represented by vector Dij
  • Dij (Aij, Bij, Cij).
  • the semantic label information corresponding to each point in the two-dimensional image, the feature representation corresponding to each point in the two-dimensional image, and the coordinates of the matching point cloud points can also be spliced according to other splicing orders.
  • the spliced feature representation is obtained
  • Step S133 Construct a point cloud feature map based on the spliced feature representation corresponding to each point in the two-dimensional image.
  • the entirety of the spliced feature representations corresponding to all points in the two-dimensional image is used as a point cloud feature map.
  • the point cloud feature map can be expressed in matrix form.
  • Step S140 Perform semantic segmentation processing on the point cloud feature map based on the second neural network model to obtain semantic label information of the point cloud data.
  • the second neural network model includes multiple convolutional layers and a semantic label classification layer, where the convolutional layer uses a convolution kernel with a size of 1*1. Compared with the structure of the first neural network model, the structure of the second neural network model is more lightweight.
  • the second neural network model includes a first convolution layer, a batch normalization (Batch Normalization, BN) layer, an activation function layer, a second convolution layer, and is arranged in sequence from the input end to the output end.
  • Semantic label classification layer the first convolution layer and the second convolution layer respectively use convolution kernels with a size of 1*1; the activation function layer uses the Relu function; the semantic label classification layer uses the Softmax function.
  • the second neural network model may also adopt other network structures capable of realizing the point cloud semantic segmentation function.
  • a point cloud feature map is generated according to the semantic segmentation result and the point cloud data, and based on the second neural network Semantic segmentation of point cloud feature maps can reduce processing time while maintaining the accuracy of point cloud semantic segmentation, improve the real-time performance of point cloud semantic segmentation, and meet the safety and real-time requirements of autonomous driving.
  • Figure 2 is a schematic flowchart of generating a two-dimensional image based on point cloud data according to some embodiments of the present disclosure. As shown in Figure 2, the process of generating a two-dimensional image based on point cloud data in an embodiment of the present disclosure includes:
  • Step S111 Determine the two-dimensional transformation coordinates of the point cloud data in the spherical coordinate system.
  • the point cloud data is point cloud data collected by lidar.
  • the point cloud data collected by lidar includes the three-dimensional coordinates (x, y, z) of the point cloud point, and the attribute data of the point cloud point, such as the reflection intensity (remission) of the point cloud point, and the point cloud point.
  • the distance (depth) to the origin of the coordinate system is the three-dimensional coordinates (x, y, z) of the point cloud point, and the attribute data of the point cloud point, such as the reflection intensity (remission) of the point cloud point, and the point cloud point.
  • step S111 the two-dimensional transformed coordinates of the point in spherical coordinates are calculated based on the three-dimensional coordinates of the point cloud point.
  • the two-dimensional conversion coordinates include yaw angle and pitch angle.
  • the definitions of yaw angle and pitch angle are as follows:
  • yaw is the yaw angle
  • pitch is the pitch angle
  • x is the coordinate of the point cloud point on the X axis
  • y is the coordinate of the point cloud point on the Y axis.
  • Step S112 Distribute the point cloud data into multiple grids according to the two-dimensional transformation coordinates.
  • multiple grids are set according to the value range of the two-dimensional transformation coordinates corresponding to the point cloud data.
  • the number of lateral grids is determined based on the value range of the yaw angle corresponding to the point cloud data and the lateral size of each grid.
  • the number of lateral grids is determined based on the value range of the pitch angle corresponding to the point cloud data and the lateral size of each grid.
  • the vertical size determines the number of vertical grids, the formula is as follows:
  • n ( ⁇ max ⁇ _yaw- ⁇ min ⁇ _yaw)/weight
  • n is the number of horizontal grids
  • m is the number of longitudinal grids
  • ⁇ max ⁇ _yaw is the maximum yaw angle corresponding to the point cloud data
  • ⁇ min ⁇ _yaw is the minimum yaw angle corresponding to the point cloud data
  • the weight is The horizontal size of each grid
  • ⁇ max ⁇ _pitch is the maximum pitch angle corresponding to the point cloud data
  • ⁇ min ⁇ _pitch is the minimum pitch angle corresponding to the point cloud data
  • height is the longitudinal size of each grid.
  • these point cloud data are projected into each grid according to the two-dimensional transformation coordinates of the point cloud data and the size of the grid.
  • Step S113 Determine the characteristics of each grid based on the point cloud points in each grid.
  • the characteristics of the point cloud point with the smallest distance (depth) to the origin in each grid are taken as the characteristics of the grid.
  • the characteristics of the grid can be expressed in the form of feature vectors. For example, by splicing the three-dimensional coordinates x, y, z of the point cloud point with the smallest distance to the origin, as well as the reflectance remission, and the distance depth to the origin, the eigenvector of the grid (x, y, z, remission, depth ).
  • Step S114 Construct a depth map based on the characteristics of all meshes.
  • a feature matrix composed of feature vectors of all meshes is used as a depth map.
  • the point cloud data can be mapped into a depth map through the above steps to facilitate subsequent semantic segmentation based on the depth map.
  • the processing speed of semantic segmentation of 3D point cloud data can be further improved to meet the real-time requirements of autonomous driving scenarios.
  • Figure 3 is a schematic flowchart of semantic segmentation based on the first neural network model according to some embodiments of the present disclosure. As shown in Figure 3, the process of semantic segmentation based on the first neural network model in this embodiment of the present disclosure includes:
  • Step S131 Extract features from the two-dimensional image based on the encoder module, and output the obtained feature map to the decoder module.
  • the first semantic segmentation network model includes an encoder module and a decoder module.
  • the network structure of the first semantic segmentation network model is based on the large model design idea of ConvNeXt (a network model).
  • ConvNeXt a network model
  • the network structure has been miniaturized and improved. For example, the number of internal network layers has been pruned, and a long convolution kernel has been designed and used according to the characteristics of the depth map, which improves the processing speed. It is faster than ConvNeXt and can meet the accuracy requirements for autonomous vehicles.
  • the encoder module includes a first coding unit and a second coding unit with different structures, wherein the first coding unit is mainly used for extracting low-level features, and the second coding unit is mainly used for extracting high-level features.
  • step S131 includes step S1311 and step S1312.
  • Step S1311 Extract features from the two-dimensional image based on the first coding unit, and output the feature map obtained by the first coding unit to the second coding unit and decoder module.
  • Step S1312 Perform feature extraction on the feature map output by the first coding unit based on the second coding unit, and output the feature map obtained by the second coding unit to the decoder module.
  • the first coding unit includes a plurality of convolutional layers, and at least one of the convolutional layers uses a strip convolution kernel.
  • the second coding unit includes a plurality of convolutional layers, and at least one of the convolutional layers uses a strip convolution kernel.
  • the long strip convolution kernel refers to the convolution kernel whose horizontal size and vertical size are not equal.
  • a convolution kernel with a size of 5*9 is a strip convolution kernel
  • a convolution kernel with a size of 1*1 is a square convolution kernel.
  • Step S132 Decode the feature map based on the decoder module to obtain the semantic segmentation result corresponding to the two-dimensional image.
  • the decoder module includes a plurality of decoding units arranged sequentially from the input side to the output side, and a semantic label classification layer.
  • the input feature map is decoded layer by layer through multiple decoding units, and then semantic label prediction is performed through the semantic label classification layer.
  • the depth map can be quickly semantically segmented based on the first neural network model to meet the real-time requirements of the autonomous driving scenario. Furthermore, by projecting the semantic segmentation results of the depth map to the point cloud data, and constructing the point cloud feature map accordingly, and performing semantic segmentation on the point cloud feature map based on the second neural network model, it can improve the real-time performance while meeting the real-time requirements. Accuracy of point cloud semantic segmentation.
  • Figure 4a is a schematic structural diagram of a first neural network model according to some embodiments of the present disclosure.
  • the first neural network model in the embodiment of the present disclosure includes: an encoder module 410 and a decoder module 420.
  • the encoder module 410 includes a first encoding unit 411 and a second encoding unit 412.
  • the decoder module 420 includes a plurality of decoding units 421 and a semantic label classification layer (not shown in the figure).
  • the semantic label classification layer may consist of a single layer of convolutions.
  • the encoder module 410 includes one first encoding unit 411 and three second encoding units 412 arranged sequentially from the input side to the output side; the decoder module includes three decoding units 421.
  • the three second coding units from top to bottom in Figure 4a are called coding unit e1, coding unit e2, and coding unit e3, and the three decoding units from top to bottom in Figure 4a are called decoding units.
  • the first encoding unit 411 performs feature extraction on the input depth map, and outputs the obtained feature map 1 to the encoding unit e1 and the decoding unit d1.
  • the encoding unit e1 performs feature extraction on the input feature map 1, and outputs the obtained feature map 2 to the encoding unit e2 and the decoding unit d2.
  • the encoding unit e2 performs feature extraction on the input feature map 2, and outputs the obtained feature map 3 to the encoding unit e3 and the decoding unit d3.
  • the encoding unit e3 performs feature extraction on the input feature map 3, and outputs the obtained feature map 4 to the decoding unit d3.
  • the decoding unit d3 decodes the input feature map 3 and feature map 4, and outputs the processed feature map 5 to the decoding unit d2; the decoding unit d2 decodes the input feature map 2 and feature map 5, and outputs
  • the processed feature map 6 is output to the decoding unit d1; the decoding unit d1 decodes the input feature map 1 and feature map 6, and outputs the processed feature map 7 to the semantic label classification layer to obtain the depth map corresponding Semantic label information, and feature map 7 is used as the feature map finally output by the first neural network model.
  • the resolution of the input depth map is W*H and the feature dimension is 5.
  • the resolution of the feature map output by each unit in the encoder module and the decoder module and the output feature dimension satisfy: first encoding
  • the resolution of the feature map output by the unit is W*H, and the output feature dimension is 32.
  • the resolution of the feature map output by the first second coding unit ie, coding unit e1
  • the resolution of the feature map output by the second coding unit ie, coding unit e1
  • the resolution of the feature map output by the second second coding unit ie, coding unit e2
  • W/4*H/4 and the output feature dimension is 64.
  • the third second coding unit i.e., coding unit e2 That is, the resolution of the feature map output by the encoding unit e3) is W/8*H/8, the output feature dimension is 128, and the resolution of the feature map output by the first decoding unit (i.e., decoding unit d3) is W/4* H/4, the output feature dimension is 64, the resolution of the feature map output by the second decoding unit (i.e. decoding unit d2) is W/2*H/2, the output feature dimension is 32, the third decoding unit (i.e. The resolution of the feature map output by the decoding unit d1) is W*H, and the output feature dimension is 32.
  • W and H are integers greater than 1, for example, W is 16 and H is 640.
  • Figure 4b is a schematic structural diagram of a first coding unit according to some embodiments of the present disclosure.
  • the first encoding unit 411 in the embodiment of the present disclosure includes: a first convolution layer 4111, a second convolution layer 4112, and a third convolution layer 4113.
  • the first convolution layer 4111 uses a convolution kernel with a size of 1*1
  • the second convolution layer 4112 uses a convolution kernel with a size of 5*9
  • the third convolution layer uses a convolution kernel with a size of 1* 1 convolution kernel.
  • the second convolution layer adopts depth-wise convolution method.
  • One convolution kernel in the depth-wise convolution method is responsible for one channel of the input image, and one channel of the input image is convolved by only one convolution kernel.
  • a Relu (Relu is an activation function) layer is also provided between the first convolution layer and the second convolution layer, and a Relu (Relu is an activation function) layer is also provided between the second convolution layer and the second convolution layer.
  • a batch normalization layer There is a batch normalization layer, and after the third convolutional layer there is also a batch normalization layer and a Relu layer.
  • the first coding unit adopts a combination of a convolution layer based on a 1 ⁇ 1 convolution kernel, a convolution layer based on a 5 ⁇ 9 depth convolution kernel, and a convolution layer based on a 1 ⁇ 1 convolution kernel.
  • the processing speed can be improved without losing the perceptual range and accuracy.
  • Figure 4c is a schematic structural diagram of a second coding unit according to some embodiments of the present disclosure.
  • the second encoding unit 412 of the embodiment of the present disclosure includes: a first convolution layer 4121, a second convolution layer 4122, a third convolution layer 4123, a fourth convolution layer 4124, and an average pooling layer. 4124.
  • the first convolution layer 4121 uses a convolution kernel with a size of 1*1
  • the second convolution layer 4122 uses a convolution kernel with a size of 3*11
  • the third convolution layer 4123 uses a convolution kernel with a size of 1 *1 convolution kernel
  • the fourth convolution layer 4124 uses a convolution kernel of size 1*1.
  • a Relu layer is provided between the first convolution layer 4121 and the second convolution layer 4122, and a batch reduction layer is provided between the second convolution layer 4122 and the third convolution layer 4123.
  • Normalization layer there is also a Relu layer between the third convolution layer 4123 and the fourth convolution layer 4124, and a batch normalization layer between the fourth convolution layer 4124 and the average pooling layer 4125. and Relu layer.
  • the second convolution layer 4122 adopts a depth-wise convolution method.
  • Figure 4d is a schematic structural diagram of a decoding unit according to some embodiments of the present disclosure.
  • the decoding unit 421 of the embodiment of the present disclosure includes: an upsampling layer 4211, a first convolutional layer 4212, a second convolutional layer 4213, a third convolutional layer 4214, and a semantic label classification layer 4215.
  • the upsampling layer 4211 is a pixel shuffle layer.
  • PixelShuffle is an upsampling method whose main function is to obtain high-resolution feature maps through convolution and reorganization between multiple channels from low-resolution feature maps. PixelShuffle can effectively enlarge the reduced feature map and can replace interpolation or deconvolution methods to achieve upsampling.
  • the first convolution layer 4212 uses a convolution kernel with a size of 3*3
  • the second convolution layer 4213 uses a convolution kernel with a size of 1*1
  • the third convolution layer uses a convolution kernel with a size of 1* 1 convolutional layer.
  • the convolution kernel helps reduce computational complexity and reduce redundancy in the same perception range, helping to increase processing speed.
  • FIG. 5 is a schematic flowchart of an environment sensing method according to some embodiments of the present disclosure. As shown in Figure 5, the environment sensing method according to the embodiment of the present disclosure includes:
  • Step S510 Obtain the point cloud data collected by the unmanned vehicle.
  • point cloud data is collected through a lidar installed on the unmanned vehicle, and the point cloud data collected by the lidar is transmitted to the environment sensing device.
  • Step S520 Generate a two-dimensional image based on the point cloud data.
  • the point cloud data is mapped based on the method shown in Figure 2 to obtain a depth map.
  • Step S530 Perform semantic segmentation processing on the two-dimensional image based on the first neural network model to obtain a semantic segmentation result corresponding to the two-dimensional image.
  • Step S540 Generate a point cloud feature map based on the semantic segmentation result corresponding to the two-dimensional image and the point cloud data.
  • Step S550 Perform semantic segmentation processing on the point cloud feature map based on the second neural network model to obtain semantic label information of the point cloud data.
  • Step S560 Determine the environment information where the unmanned vehicle is located based on the semantic label information of the point cloud data.
  • the semantic label information of the point cloud data includes pedestrian category labels and vehicle category labels
  • the environment in which the unmanned vehicle is located includes pedestrians and vehicles.
  • various dynamic and static elements in the scene can be further subdivided and identified, such as further subdividing and identifying the vehicle.
  • the above steps achieve accurate and real-time perception of the environment in which the unmanned vehicle is located, which can meet the safety and real-time requirements of autonomous driving.
  • Figure 6 is a schematic structural diagram of a semantic segmentation device according to some embodiments of the present disclosure.
  • the semantic segmentation device 600 in the embodiment of the present disclosure includes a first generation module 610 , a first segmentation module 620 , a second generation module 630 , and a second segmentation module 640 .
  • the first generation module 610 is configured to generate a two-dimensional image according to point cloud data.
  • the two-dimensional image is a depth map (Range View), and the first generation module 610 generates the depth map according to the point cloud data.
  • the two-dimensional image is a Bird's Eye View (BEV), and the first generation module 610 generates the Bird's Eye View based on the point cloud data.
  • BEV Bird's Eye View
  • the first generation module 610 generates a depth map according to the point cloud data including: the first generation module 610 determines the two-dimensional transformation coordinates of the point cloud data in a spherical coordinate system, where the two-dimensional transformation coordinates include a yaw angle. and pitch angle; the first generation module 610 distributes the point cloud data to multiple grids according to the two-dimensional transformation coordinates; the first generation module 610 determines the characteristics of each grid according to the point cloud points in each grid. ; The first generation module 610 constructs a depth map based on the characteristics of all meshes.
  • the first segmentation module 620 is configured to perform semantic segmentation processing on the two-dimensional image based on the first neural network model to obtain a semantic segmentation result corresponding to the two-dimensional image.
  • the first neural network model includes an encoder module and a decoder module.
  • the first segmentation module 620 is configured to: perform feature extraction on the two-dimensional image based on the encoder module and output the resulting feature map to the decoder module; decode the feature map based on the decoder module to Obtain the semantic segmentation results corresponding to the two-dimensional image.
  • the second generation module 630 is configured to generate a point cloud feature map according to the semantic segmentation result of the two-dimensional image and the point cloud data.
  • the semantic segmentation results of the two-dimensional image include: semantic label information corresponding to each point in the two-dimensional image, and feature representation corresponding to each point in the two-dimensional image.
  • the second generation module 630 generates the point cloud feature map according to the semantic segmentation result of the two-dimensional image and the point cloud data, including: the second generation module 630 determines each of the points in the point cloud data and the two-dimensional image.
  • the second generation module 630 splices the semantic label information corresponding to each point in the two-dimensional image, the feature representation corresponding to each point in the two-dimensional image, and the coordinates of the matching point cloud points to generate A spliced feature representation corresponding to each point in the two-dimensional image is obtained; the second generation module 630 constructs a point cloud feature map based on the spliced feature representation corresponding to each point in the two-dimensional image.
  • the second segmentation module 640 is configured to perform semantic segmentation processing on the point cloud feature map based on the second neural network model to obtain semantic label information of the point cloud data.
  • the above device can reduce processing time consumption while maintaining the accuracy of point cloud semantic segmentation, improve the real-time performance of point cloud semantic segmentation, and meet the safety and real-time requirements of autonomous driving.
  • Figure 7 is a schematic structural diagram of an environment sensing device according to some embodiments of the present disclosure.
  • the environment sensing device 700 in the embodiment of the present disclosure includes: an acquisition module 710 , a semantic segmentation device 720 , and a determination module 730 .
  • the acquisition module 710 is configured to acquire point cloud data collected by the unmanned vehicle.
  • the semantic segmentation device 720 is configured to generate a two-dimensional image based on point cloud data; perform semantic segmentation processing on the two-dimensional image based on the first neural network model to obtain a semantic segmentation result corresponding to the two-dimensional image; based on the semantic segmentation corresponding to the two-dimensional image
  • the segmentation results and point cloud data are used to generate point cloud feature maps; semantic segmentation processing is performed on the point cloud feature maps based on the second neural network model to obtain semantic label information of the point cloud data.
  • the determination module 730 is configured to determine the environment information where the unmanned vehicle is located based on the semantic label information of the point cloud data.
  • the above equipment realizes accurate and real-time perception of the environment in which the unmanned vehicle is located, which can meet the safety and real-time requirements of autonomous driving.
  • Figure 8 is a schematic structural diagram of a semantic segmentation device or environment sensing device according to some embodiments of the present disclosure.
  • the semantic segmentation device or environment awareness device 800 includes a memory 810; and a processor 820 coupled to the memory 810.
  • the memory 810 is used to store instructions for executing corresponding embodiments of the semantic segmentation method or the environment awareness method.
  • the processor 820 is configured to execute the semantic segmentation method or the environment awareness method in any embodiments of the present disclosure based on instructions stored in the memory 810 .
  • Figure 9 is a schematic structural diagram of a computer system according to some embodiments of the present disclosure.
  • Computer system 900 may be embodied in the form of a general purpose computing device.
  • Computer system 900 includes memory 910, a processor 920, and a bus 930 that connects various system components.
  • Memory 910 may include, for example, system memory, non-volatile storage media, and the like.
  • System memory stores, for example, operating systems, applications, boot loaders, and other programs.
  • System memory may include volatile storage media such as random access memory (RAM) and/or cache memory.
  • RAM random access memory
  • the non-volatile storage medium stores, for example, instructions for executing corresponding embodiments of at least one of the semantic segmentation method or the environment awareness method.
  • Non-volatile storage media includes but is not limited to disk storage, optical storage, flash memory, etc.
  • the processor 920 may be implemented as a general purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete hardware components such as discrete gates or transistors.
  • each module such as the first generation module and the first segmentation module, can be implemented by a central processing unit (CPU) running instructions in a memory to perform corresponding steps, or by a dedicated circuit that performs corresponding steps.
  • CPU central processing unit
  • Bus 930 may use any of a variety of bus structures.
  • bus structures include, but are not limited to, Industry Standard Architecture (ISA) bus, Micro Channel Architecture (MCA) bus, and Peripheral Component Interconnect (PCI) bus.
  • ISA Industry Standard Architecture
  • MCA Micro Channel Architecture
  • PCI Peripheral Component Interconnect
  • the interfaces 940, 950, 960, the memory 910 and the processor 920 of the computer system 900 may be connected through a bus 930.
  • the input and output interface 940 can provide a connection interface for input and output devices such as a monitor, mouse, and keyboard.
  • the network interface 950 provides a connection interface for various networked devices.
  • the storage interface 960 provides a connection interface for external storage devices such as floppy disks, USB disks, and SD cards.
  • Figure 10 is a schematic structural diagram of an unmanned vehicle according to some embodiments of the present disclosure
  • Figure 11 is a perspective view of an unmanned vehicle according to some embodiments of the present disclosure.
  • the unmanned vehicle provided by the embodiment of the present disclosure will be described below with reference to FIG. 10 and FIG. 11 .
  • the unmanned vehicle includes four parts: a chassis module 1010, an autonomous driving module 1020, a cargo box module 1030, and a remote monitoring flow module 1040.
  • the chassis module 1010 mainly includes a battery, a power management device, a chassis controller, a motor driver, and a power motor.
  • the battery provides power for the entire unmanned vehicle system
  • the power management device converts the battery output into different voltage levels that can be used by each functional module, and controls power on and off.
  • the chassis controller receives motion instructions from the autonomous driving module and controls the steering, forward, backward, braking, etc. of the unmanned vehicle.
  • the autonomous driving module 1020 includes a core processing unit (Orin or Xavier module), traffic light recognition camera, front, rear, left and right surround cameras, multi-line lidar, positioning module (such as Beidou, GPS, etc.), and inertial navigation unit.
  • the camera and the autonomous driving module can communicate.
  • GMSL link communication can be used.
  • the automatic driving module 1020 includes the speech segmentation device or environment sensing device in the above embodiments.
  • the remote monitoring streaming module 1030 is composed of a front surveillance camera, a rear surveillance camera, a left surveillance camera, a right surveillance camera, and a streaming module. This module transmits the video data collected by the surveillance cameras to the backend server for use by the backend. Operator checks.
  • the wireless communication module communicates with the backend server through the antenna, allowing the backend operator to remotely control the unmanned vehicle.
  • the cargo box module 1040 is the cargo carrying device of the unmanned vehicle.
  • the cargo box module 1040 is also provided with a display interaction module.
  • the display interaction module is used for the unmanned vehicle to interact with the user.
  • the user can perform operations such as picking up, depositing, and purchasing goods through the display interaction module.
  • the type of cargo box can be changed according to actual needs.
  • a cargo box can include multiple sub-boxes of different sizes, and the sub-boxes can be used to load goods for distribution.
  • the cargo box can be set up as a transparent box so that users can intuitively see the products for sale.
  • the unmanned vehicle in the embodiment of the present disclosure can improve the real-time performance of point cloud semantic segmentation processing while maintaining the accuracy of point cloud semantic segmentation results, thereby meeting the safety and real-time requirements of autonomous driving.
  • These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable device to produce a machine, such that execution of the instructions by the processor produces implementations in one or more blocks of the flowcharts and/or block diagrams.
  • a device with specified functions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable device to produce a machine, such that execution of the instructions by the processor produces implementations in one or more blocks of the flowcharts and/or block diagrams.
  • Computer-readable program instructions which may also be stored in computer-readable memory, cause the computer to operate in a specific manner to produce an article of manufacture, including implementing the functions specified in one or more blocks of the flowcharts and/or block diagrams. instructions.
  • the disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本公开提出了一种语义分割、环境感知方法、装置和无人车,涉及计算机视觉技术领域。其中,语义分割方法包括:根据点云数据生成二维图像;基于第一神经网络模型对所述二维图像进行语义分割处理,以得到所述二维图像对应的语义分割结果;根据所述二维图像对应的语义分割结果、以及所述点云数据,生成点云特征图;基于第二神经网络模型对所述点云特征图进行语义分割处理,以得到所述点云数据的语义标签信息。通过以上步骤,能够在保持点云语义分割结果准确性的同时,提高点云语义分割处理的实时性。

Description

语义分割、环境感知方法、装置和无人车
相关申请的交叉引用
本申请是以CN申请号为202210767911.1,申请日为2022年7月1日的申请为基础,并主张其优先权,该CN申请的公开内容在此作为整体引入本申请中。
技术领域
本公开涉及计算机视觉技术领域,尤其涉及无人驾驶领域,特别涉及一种语义分割、环境感知方法、装置和无人车。
背景技术
目前,无人驾驶设备用于将人或者物从一个位置自动运送到另一个位置,无人驾驶设备通过设备上的传感器采集环境信息并完成自动运送。基于无人驾驶技术控制的无人配送车进行物流运输极大地提高了生产生活的便捷性,节约了人力成本。
三维环境感知技术是自动驾驶技术体系中的核心方法之一。该感知技术负责识别自动驾驶车辆周围的行人、车辆以及其他动、静态元素,以提供全面的环境信息给下游的控制系统,进而规划行车路线、规避静态障碍物和动态的行人、车辆等。在三维环境感知技术体系中,基于激光雷达的三维语义分割方法,其目标是识别出激光雷达扫描出的三维场景点云中各元素的语义类别,是整个三维环境感知技术体系中的一项基础任务。
发明内容
本公开要解决的一个技术问题是,提供一种语义分割、环境感知方法、装置和无人车。
根据本公开的第一方面,提出了一种语义分割方法,包括:根据点云数据生成二维图像;基于第一神经网络模型对所述二维图像进行语义分割处理,以得到所述二维图像对应的语义分割结果;根据所述二维图像对应的语义分割结果、以及所述点云数据,生成点云特征图;基于第二神经网络模型对所述点云特征图进行语义分割处理,以得到所述点云数据的语义标签信息。
在一些实施例中,所述二维图像为深度图,所述根据点云数据生成二维图像包括: 确定所述点云数据在球面坐标系下的二维转换坐标,其中,所述二维转换坐标包括偏航角和俯仰角;根据所述二维转换坐标,将所述点云数据分配到多个网格中;根据每个网格中的点云点,确定每个网格的特征;根据所有网格的特征,构建所述深度图。
在一些实施例中,所述二维图像对应的语义分割结果包括所述二维图像中每个点对应的语义标签信息和所述二维图像中每个点对应的特征表示,所述根据所述二维图像对应的语义分割结果、以及所述点云数据,生成点云特征图包括:确定所述点云数据中与所述二维图像中的每个点匹配的点云点;将所述二维图像中每个点对应的语义标签信息和所述二维图像中每个点对应的特征表示、以及所述匹配的点云点的坐标进行拼接,以得到所述二维图像中每个点对应的拼接后的特征表示;根据所述二维图像中每个点对应的拼接后的特征表示,构建所述点云特征图。
在一些实施例中,基于第一神经网络模型对所述二维图像进行语义分割处理,以得到所述二维图像对应的语义分割结果包括:基于编码器模块对所述二维图像进行特征提取,并将得到的特征图输出至解码器模块;基于所述解码器模块对所述特征图进行解码,以得到所述二维图像对应的语义分割结果。
在一些实施例中,所述基于编码器模块对所述二维图像进行特征提取,并将得到的特征图输出至解码器模块包括:基于第一编码单元对所述二维图像进行特征提取,并将第一编码单元提取得到的特征图输出至第二编码单元和所述解码器模块;基于第二编码单元对所述第一编码单元输出的特征图进行特征提取,并将第二编码单元提取得到的特征图输出至所述解码器模块,其中,所述第一编码单元与所述第二编码单元的结构不同。
在一些实施例中,所述第一编码单元和所述第二编码单元中的至少一个包括多个卷积层,所述多个卷积层中的至少一个使用长条形卷积核。
在一些实施例中,所述第一编码单元包括从输入侧到输出侧依次设置的第一至第三卷积层,其中,第一和第三卷积层使用方形的卷积核,第二卷积层使用长条形卷积核。
在一些实施例中,所述解码器模块包括从输入侧到输出侧依次设置的多个解码单元、以及语义标签分类层,所述解码单元包括上采样层和多个卷积层。
在一些实施例中,所述第二神经网络模型包括多个卷积层、以及语义标签分类层,其中,所述卷积层采用1*1卷积核。
根据本公开的第二方面,提出一种环境感知方法,包括:获取无人车采集的点云 数据;根据如上述的语义分割方法确定所述点云数据的语义标签信息;根据所述点云数据的语义标签信息确定所述无人车所处的环境信息。
根据本公开的第三方面,提出一种语义分割装置,包括:第一生成模块,被配置为根据点云数据生成二维图像;第一分割模块,被配置为基于第一神经网络模型对所述二维图像进行语义分割处理,以得到所述二维图像对应的语义分割结果;第二生成模块,被配置为根据所述二维图像的语义分割结果、以及所述点云数据,生成点云特征图;第二分割模块,被配置为基于第二神经网络模型对所述点云特征图进行语义分割处理,以得到所述点云数据的语义标签信息。
在一些实施例中,所述二维图像为深度图,所述第一生成模块被配置为:确定所述点云数据在球面坐标系下的二维转换坐标,其中,所述二维转换坐标包括偏航角和俯仰角;根据所述二维转换坐标,将所述点云数据分配到多个网格中;根据每个网格中的点云点,确定每个网格的特征;根据所有网格的特征,构建所述深度图。
在一些实施例中,所述第二生成模块被配置为:确定所述点云数据中与所述二维图像中的每个点匹配的点云点;将所述二维图像中每个点对应的语义标签信息和所述二维图像中每个点对应的特征表示、以及所述匹配的点云点的坐标进行拼接,以得到所述二维图像中每个点对应的拼接后的特征表示;根据所述二维图像中每个点对应的拼接后的特征表示,构建所述点云特征图。
根据本公开的第四方面,提出另一种语义分割装置,包括:存储器;以及耦接至存储器的处理器,处理器被配置为基于存储在存储器的指令执行如上述的语义分割方法。
根据本公开的第五方面,提出一种环境感知设备,包括:获取模块,被配置为获取无人车采集的点云数据;如上述的语义分割装置;确定模块,被配置为根据所述点云数据的语义标签信息确定所述无人车所处的环境信息。
根据本公开的第六方面,提出一种环境感知设备,包括:存储器;以及耦接至存储器的处理器,处理器被配置为基于存储在存储器的指令执行如上述的环境感知方法。
根据本公开的第七方面,提出一种计算机可读存储介质,其上存储有计算机程序指令,该指令被处理器执行时实现上述的语义分割方法或上述的环境感知方法。
根据本公开的第八方面,还提出一种无人车,包括如上述的语义分割装置或环境感知设备。
根据本公开的第九方面,还提出一种计算机程序,包括:指令,所述指令当由处 理器执行时使所述处理器执行上述的语义分割方法,或上述的环境感知方法。
通过以下参照附图对本公开的示例性实施例的详细描述,本公开的其它特征及其优点将会变得清楚。
附图说明
构成说明书的一部分的附图描述了本公开的实施例,并且连同说明书一起用于解释本公开的原理。
参照附图,根据下面的详细描述,可以更加清楚地理解本公开,其中:
图1为根据本公开一些实施例的语义分割方法的流程示意图;
图2为根据本公开一些实施例的根据点云数据生成二维图像的流程示意图;
图3为根据本公开一些实施例的基于第一神经网络模型进行语义分割的流程示意图;
图4a为根据本公开一些实施例的第一神经网络模型的结构示意图;
图4b为根据本公开一些实施例的第一编码单元的结构示意图;
图4c为根据本公开一些实施例的第二编码单元的结构示意图;
图4d为根据本公开一些实施例的解码单元的结构示意图;
图5为根据本公开一些实施例的环境感知方法的流程示意图;
图6为根据本公开一些实施例的语义分割装置的结构示意图;
图7为根据本公开一些实施例的环境感知设备的结构示意图;
图8为根据本公开一些实施例的语义分割装置或环境感知设备的结构示意图;
图9为根据本公开一些实施例的计算机系统的结构示意图;
图10为根据本公开一些实施例的无人车的结构示意图;
图11为根据本公开一些实施例的无人车的立体结构示意图。
具体实施方式
现在将参照附图来详细描述本公开的各种示例性实施例。应注意到:除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本公开的范围。
同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。
以下对至少一个示例性实施例的描述实际上仅仅是说明性的,决不作为对本公开及其应用或使用的任何限制。
对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论,但在适当情况下,所述技术、方法和设备应当被视为授权说明书的一部分。
在这里示出和讨论的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。
为使本公开的目的、技术方案和优点更加清楚明白,以下结合具体实施例,并参照附图,对本公开进一步详细说明。
在相关技术中的三维点云语义分割方法中,为了满足工程应用的实时性要求,通常基于球面投影原理,将三维激光雷达点云转换为深度图(Range View),并借助二维卷积神经网络对深度图进行语义分割之后,将得到的语义标签信息投影回原始的三维激光雷达点云中。
上述方法的主要问题在于,在将三维点云数据投影为深度图的过程中,通常伴随着信息损失,例如,将若干个三维点云点投影至深度图中的同一个像素中,会导致这些落在同一像素中的三维点云点之间丧失了区分度,进而导致分割结果的准确性降低。
为解决这一问题,相关技术中往往通过提升深度图的分辨率,来尽量减少投影在同一像素中的点的数量,提升点云投影之后的区分度。但是,这样会带来计算资源开销增大的问题,对三维点云语义分割方法的实时性造成负面影响。
图1为根据本公开一些实施例的语义分割方法的流程示意图。如图1所示,本公开实施例的语义分割方法包括:
步骤S110:根据点云数据生成二维图像。
在一些实施例中,二维图像为深度图(Range View),在步骤S110中,根据点云数据生成深度图。
在另一些实施例中,二维图像为鸟瞰图(Bird's Eye View,简称BEV),在步骤S110中,根据点云数据生成鸟瞰图。
在一些实施例中,点云数据为激光雷达采集的点云数据。例如,在无人车行驶过程中,通过安装在无人车上的激光雷达采集点云数据。
步骤S120:基于第一神经网络模型对二维图像进行语义分割处理,以得到二维图 像对应的语义分割结果。
在一些实施例中,第一神经网络模型为深度神经网络模型,包括编码器模块和解码器模块。
在一些实施例中,二维图像的语义分割结果包括二维图像中每个点对应的语义标签信息和二维图像中每个点对应的特征表示。示例性地,在自动驾驶场景中,语义标签信息包括行人、车辆、车道、以及人行道等标签信息。示例性地,二维图像中每个点对应的特征表示为与二维图像中每个点对应的、经第一神经网络模型处理后输出的特征向量。
步骤S130:根据二维图像对应的语义分割结果、以及点云数据,生成点云特征图。
在一些实施例中,二维图像的语义分割结果包括二维图像中每个点对应的语义标签信息和二维图像中每个点对应的特征表示。在这些实施例中,步骤S130包括步骤S131至步骤S133。
步骤S131:确定点云数据中与二维图像中的每个点匹配的点云点。
在一些实施例中,在通过步骤S110生成二维图像之后,保存了二维图像中的点与点云数据中的点云点之间的映射关系。在步骤S131中,根据该映射关系,确定点云数据中与二维图像中的每个点匹配的点云点。
步骤S132:将二维图像中每个点对应的语义标签信息和二维图像中每个点对应的特征表示、以及匹配的点云点的坐标进行拼接,以得到二维图像中每个点对应的拼接后的特征表示。
在一些实施例中,按照如下先后顺序依次进行拼接:二维图像中每个点对应的语义标签信息、二维图像中每个点对应的特征表示、匹配的点云点的坐标,从而得到拼接后的特征表示。
例如,假设二维图像中任一点对应的语义标签信息用向量Aij表示,该点对应的特征表示用向量Bij表示,与该点匹配的点云点的坐标用向量Cij表示,拼接后的特征表示用向量Dij表示,且Dij=(Aij,Bij,Cij)。
在另一些实施例中,也可按照其他拼接顺序对二维图像中每个点对应的语义标签信息、二维图像中每个点对应的特征表示、以及匹配的点云点的坐标进行拼接,从而得到拼接后的特征表示
步骤S133:根据二维图像中每个点对应的拼接后的特征表示,构建点云特征图。
在一些实施例中,将二维图像中所有点对应的拼接后的特征表示构成的整体作为 点云特征图。示例性地,点云特征图可表示为矩阵形式。
步骤S140:基于第二神经网络模型对点云特征图进行语义分割处理,以得到点云数据的语义标签信息。
在一些实施例中,第二神经网络模型包括多个卷积层、以及语义标签分类层,其中,卷积层采用大小为1*1的卷积核。与第一神经网络模型的结构相比,第二神经网络模型的结构更为轻量化。
在另一些实施例中,第二神经网络模型包括由输入端至输出端依次设置的第一卷积层、批量归一化(Batch Normalization,BN)层、激活函数层、第二卷积层、语义标签分类层。其中,第一卷积层、以及第二卷积层分别采用大小为1*1的卷积核;激活函数层采用Relu函数;语义标签分类层采用Softmax函数。
在再一些实施例中,第二神经网络模型还可采用其他能够实现点云语义分割功能的网络结构。
在本公开实施例中,通过在基于第一神经网络模型对由点云数据生成的二维图像进行语义分割之后,根据语义分割结果和点云数据生成点云特征图,并基于第二神经网络对点云特征图进行语义分割,能够在保持点云语义分割准确度的同时,降低处理时耗,提高点云语义分割的实时性,满足自动驾驶对安全性和实时性的要求。
图2为根据本公开一些实施例的根据点云数据生成二维图像的流程示意图。如图2所示,本公开实施例的根据点云数据生成二维图像的流程包括:
步骤S111:确定点云数据在球面坐标系下的二维转换坐标。
在一些实施例中,点云数据为激光雷达采集的点云数据。示例性地,激光雷达采集的点云数据包括点云点的三维坐标(x,y,z)、以及点云点的属性数据,比如点云点的反射强度(remission)、以及该点云点至坐标系原点的距离(depth)。
在步骤S111中,根据点云点的三维坐标计算该点在球面坐标下下的二维转换坐标。其中,二维转换坐标包括偏航角和俯仰角,偏航角和俯仰角的定义如下:
yaw=arctan(y/x)
pitch=arcsin(y/x)
其中,yaw为偏航角,pitch为俯仰角,x为点云点在X轴的坐标,y为点云点在Y轴的坐标。
步骤S112:根据二维转换坐标,将点云数据分配到多个网格中。
在一些实施例中,根据点云数据对应的二维转换坐标的取值区间设置多个网格。 例如,根据点云数据对应的偏航角的取值区间、以及每个网格的横向尺寸确定横向的网格数量,根据点云数据对应的俯仰角的取值区间、以及每个网格的纵向尺寸确定纵向的网格数量,公式如下:
n=(〖max〗_yaw-〖min〗_yaw)/weight
m=(〖max〗_pitch-〖min〗_pitch)/height
其中,n为横向的网格数量,m为纵向的网格数量,〖max〗_yaw为点云数据对应的最大偏航角,〖min〗_yaw为点云数据对应的最小偏航角,weight为每个网格的横向尺寸,〖max〗_pitch为点云数据对应的最大俯仰角,〖min〗_pitch为点云数据对应的最小俯仰角,height为每个网格的纵向尺寸。
在设置多个网格之后,根据点云数据的二维转换坐标以及网格的尺寸,将这些点云数据投影到各个网格中。
步骤S113:根据每个网格中的点云点,确定每个网格的特征。
在一些实施例中,取每个网格中至原点距离(depth)最小的点云点的特征作为网格的特征。其中,网格的特征可以通过特征向量的形式进行表示。例如,将至原点距离最小的点云点的三维坐标x、y、z,以及反射率remission,以及至原点的距离depth进行拼接,得到网格的特征向量(x,y,z,remission,depth)。
步骤S114:根据所有网格的特征,构建深度图。
在一些实施例中,将所有网格的特征向量构成的特征矩阵作为深度图。
在本公开实施例中,通过以上步骤能够将点云数据映射成深度图,便于后续基于该深度图进行语义分割。与采用将点云数据映射成BEV图的方式相比,能够进一步提高三维点云数据语义分割的处理速度,满足自动驾驶场景对实时性的要求。
图3为根据本公开一些实施例的基于第一神经网络模型进行语义分割的流程示意图。如图3所示,本公开实施例的基于第一神经网络模型进行语义分割的流程包括:
步骤S131:基于编码器模块对二维图像进行特征提取,并将得到的特征图输出至解码器模块。
第一语义分割网络模型包括编码器模块和解码器模块。在一些实施例中,第一语义分割网络模型的网络结构基于ConvNeXt(一种网络模型)的大模型设计思想。为了满足自动驾驶对实时性的要求,对网络结构进行了小型化改进,例如,修剪了内部的网络层数,根据深度图的特点设计使用了长条形卷积核,这样一来使得处理速度比ConvNeXt更快且能够满足无人车使用的准确度要求。
在一些实施例中,编码器模块包括结构不同的第一编码单元和第二编码单元,其中,第一编码单元主要用于底层特征的提取,第二编码单元主要用于高层特征的提取。在这些实施例中,步骤S131包括步骤S1311和步骤S1312。
步骤S1311:基于第一编码单元对二维图像进行特征提取,并将第一编码单元得到的特征图输出至第二编码单元和解码器模块。
步骤S1312:基于第二编码单元对第一编码单元输出的特征图进行特征提取,并将第二编码单元得到的特征图输出至解码器模块。
在一些实施例中,第一编码单元包括多个卷积层,这些卷积层中的至少一个使用长条形卷积核。
在一些实施例中,第二编码单元包括多个卷积层,这些卷积层中的至少一个使用长条形卷积核。
其中,长条形卷积核是指横向尺寸与纵向尺寸不相等的卷积核。例如,大小为5*9的卷积核为长条形卷积核,1*1的卷积核为方形卷积核。
步骤S132:基于解码器模块对特征图进行解码,以得到二维图像对应的语义分割结果。
在一些实施例中,解码器模块包括从输入侧到输出侧依次设置的多个解码单元、以及语义标签分类层。在这些实施例中,通过多个解码单元对输入的特征图进行逐层解码,然后通过语义标签分类层进行语义标签预测。
在本公开实施例中,通过以上步骤能够基于第一神经网络模型对深度图进行快速地语义分割处理,满足自动驾驶场景对实时性的要求。进一步,通过将深度图的语义分割结果投影至点云数据,并据此构建点云特征图,基于第二神经网络模型对点云特征图进行语义分割,能够在满足实时性要求的同时,提高点云语义分割的准确性。
图4a为根据本公开一些实施例的第一神经网络模型的结构示意图。如图4a所示,本公开实施例的第一神经网络模型包括:编码器模块410、解码器模块420。
编码器模块410,包括第一编码单元411和第二编码单元412。解码器模块420包括多个解码单元421、以及语义标签分类层(图中未示出)。示例性地,语义标签分类层可由单层卷积组成。
在一些实施例中,编码器模块410包括从输入侧至输出侧依次设置的一个第一编码单元411、和三个第二编码单元412;解码器模块包括三个解码单元421。为便于说明,将图4a中由上至下的三个第二编码单元称为编码单元e1、编码单元e2、编码单 元e3,将图4a中由上至下的三个解码单元称为解码单元d1、解码单元d2和解码单元d3。
第一编码单元411对输入的深度图进行特征提取,并将得到的特征图1输出至编码单元e1和解码单元d1。编码单元e1对输入的特征图1进行特征提取,并将得到的特征图2输出至编码单元e2和解码单元d2。编码单元e2对输入的特征图2进行特征提取,并将得到的特征图3输出至编码单元e3和解码单元d3。编码单元e3对输入的特征图3进行特征提取,并将得到的特征图4输出至解码单元d3。
解码单元d3对输入的特征图3和特征图4进行解码处理,并将处理得到的特征图5输出至解码单元d2;解码单元d2对输入的特征图2和特征图5进行解码处理,并将处理得到的特征图6输出至解码单元d1;解码单元d1对输入的特征图1和特征图6进行解码处理,并将处理得到的特征图7输出至语义标签分类层,以得到深度图对应的语义标签信息,并将特征图7作为第一神经网络模型最终输出的特征图。
在一些实施例中,输入的深度图的分辨率为W*H、特征维度为5,编码器模块和解码器模块中各个单元输出的特征图的分辨率、以及输出特征维度满足:第一编码单元输出的特征图的分辨率为W*H,输出特征维度为32,第一个第二编码单元(即编码单元e1)输出的特征图的分辨率为W/2*H/2,输出的特征维度为32维,第二个第二编码单元(即编码单元e2)输出的特征图的分辨率为W/4*H/4,输出的特征维度为64,第三个第二编码单元(即编码单元e3)输出的特征图的分辨率为W/8*H/8,输出特征维度为128,第一个解码单元(即解码单元d3)输出的特征图的分辨率为W/4*H/4,输出特征维度为64,第二个解码单元(即解码单元d2)输出的特征图的分辨率为W/2*H/2,输出特征维度为32,第三个解码单元(即解码单元d1)输出的特征图的分辨率为W*H,输出特征维度为32。其中,W、H为大于1的整数,例如,W为16,H为640。
图4b为根据本公开一些实施例的第一编码单元的结构示意图。如图4b所示,本公开实施例的第一编码单元411包括:第一卷积层4111、第二卷积层4112、第三卷积层4113。
在一些实施例中,第一卷积层4111采用大小为1*1的卷积核,第二卷积层4112采用大小为5*9的卷积核,第三卷积层采用大小为1*1的卷积核。其中,第二卷积层采用深度-智能(depth-wise)卷积方式。Depth-wise卷积方式中的一个卷积核负责输入图片的一个通道,输入图片的一个通道只被一个卷积核卷积。
在一些实施例中,在第一卷积层与第二卷积层之间还设有Relu(Relu是一种激活函数)层,在第二卷积层和第二卷积层之间还设有批量归一化层,在第三卷积层之后还设有批量归一化层和Relu层。
在本公开实施例中,第一编码单元通过采用基于1×1卷积核的卷积层、5×9深度卷积核的卷积层、以及1×1卷积核的卷积层的组合结构,相比采用单层大卷积核的结构,能够提高处理速度,且不损失感知范围和精度。
图4c为根据本公开一些实施例的第二编码单元的结构示意图。如图4c所示,本公开实施例的第二编码单元412包括:第一卷积层4121、第二卷积层4122、第三卷积层4123、第四卷积层4124、平均池化层4124。
在一些实施例中,第一卷积层4121采用大小为1*1的卷积核,第二卷积层4122采用大小为3*11的卷积核,第三卷积层4123采用大小为1*1的卷积核,第四卷积层4124采用大小为1*1的卷积核。
在一些实施例中,在第一卷积层4121和第二卷积层4122之间,还设有Relu层,在第二卷积层4122和第三卷积层4123之间还设有批量归一化层,在第三卷积层4123和第四卷积层4124之间还设有Relu层,在第四卷积层4124和平均池化层4125层之间还设有批量归一化层和Relu层。其中,第二卷积层4122采用深度-智能(depth-wise)卷积方式。
在本公开实施例中,通过在第二编码单元的第二卷积层中采用长条形的卷积核,能够与深度图的尺寸相匹配,同时,通过在第四卷积层之后设置一个额外的ReLU激活函数等设计,有助于提升模型效果并降低时耗。
图4d为根据本公开一些实施例的解码单元的结构示意图。如图4d所示,本公开实施例的解码单元421包括:上采样层4211、第一卷积层4212、第二卷积层4213、第三卷积层4214、语义标签分类层4215。
在一些实施例中,上采样层4211为像素重组(PixelShuffle)层。PixelShuffle是一种上采样方法,主要功能是将低分辨的特征图,通过卷积和多通道间的重组得到高分辨率的特征图。PixelShuffle可以对缩小后的特征图进行有效的放大,可以替代插值或解卷积的方法实现上采样。
在一些实施例中,第一卷积层4212采用大小为3*3的卷积核,第二卷积层4213采用大小为1*1的卷积核,第三卷积层采用大小为1*1的卷积层。
在本公开实施例中,通过在解码单元中引入PixelShuffle层,能够将低分辨率的 特征还原成高分辨率的特征,有助于提高语义分割准确性,另外,通过在解码单元中采用小尺寸的卷积核,有助于降低计算复杂度并减少相同感知范围的冗余,有助于提高处理速度。
图5为根据本公开一些实施例的环境感知方法的流程示意图。如图5所示,本公开实施例的环境感知方法包括:
步骤S510:获取无人车采集的点云数据。
在一些实施例中,通过安装在无人车上的激光雷达采集点云数据,并将激光雷达采集的点云数据传输至环境感知设备。
步骤S520:根据点云数据生成二维图像。
在一些实施例中,基于图2所示方式对点云数据进行映射处理,以得到深度图。
步骤S530:基于第一神经网络模型对二维图像进行语义分割处理,以得到二维图像对应的语义分割结果。
步骤S540:根据二维图像对应的语义分割结果、以及点云数据,生成点云特征图。
步骤S550:基于第二神经网络模型对点云特征图进行语义分割处理,以得到点云数据的语义标签信息。
步骤S560:根据点云数据的语义标签信息确定无人车所处的环境信息。
例如,通过步骤S520至步骤S560确定点云数据的语义标签信息包括行人类别标签、车辆类别标签,则可确定无人车所处的环境包括行人和车辆。接下来,还可根据无人车所处的环境信息,进一步细分识别场景中各类动、静态元素,比如对车辆进行进一步的细分识别。
在本公开实施例中,通过以上步骤实现了对无人车所处环境进行准确、实时地感知,能够满足自动驾驶对安全性和实时性的要求。
图6为根据本公开一些实施例的语义分割装置的结构示意图。如图6所示,本公开实施例的语义分割装置600包括第一生成模块610、第一分割模块620、第二生成模块630、第二分割模块640。
第一生成模块610,被配置为根据点云数据生成二维图像。
在一些实施例中,二维图像为深度图(Range View),第一生成模块610根据点云数据生成深度图。
在另一些实施例中,二维图像为鸟瞰图(Bird's Eye View,简称BEV),第一生成模块610根据点云数据生成鸟瞰图。
在一些实施例中,第一生成模块610根据点云数据生成深度图包括:第一生成模块610确定点云数据在球面坐标系下的二维转换坐标,其中,二维转换坐标包括偏航角和俯仰角;第一生成模块610根据二维转换坐标,将点云数据分配到多个网格中;第一生成模块610根据每个网格中的点云点,确定每个网格的特征;第一生成模块610根据所有网格的特征,构建深度图。
第一分割模块620,被配置为基于第一神经网络模型对二维图像进行语义分割处理,以得到二维图像对应的语义分割结果。
在一些实施例中,第一神经网络模型包括编码器模块和解码器模块。在这些实施例中,第一分割模块620被配置为:基于编码器模块对二维图像进行特征提取,并将得到的特征图输出至解码器模块;基于解码器模块对特征图进行解码,以得到二维图像对应的语义分割结果。
第二生成模块630,被配置为根据二维图像的语义分割结果、以及点云数据,生成点云特征图。
在一些实施例中,二维图像的语义分割结果包括:二维图像中每个点对应的语义标签信息、以及二维图像中每个点对应的特征表示。在这些实施例中,第二生成模块630根据二维图像的语义分割结果、以及点云数据,生成点云特征图包括:第二生成模块630确定点云数据中与二维图像中的每个点匹配的点云点;第二生成模块630将二维图像中每个点对应的语义标签信息和二维图像中每个点对应的特征表示、以及匹配的点云点的坐标进行拼接,以得到二维图像中每个点对应的拼接后的特征表示;第二生成模块630根据二维图像中每个点对应的拼接后的特征表示,构建点云特征图。
第二分割模块640,被配置为基于第二神经网络模型对点云特征图进行语义分割处理,以得到点云数据的语义标签信息。
在本公开实施例中,通过上述装置能够在保持点云语义分割准确度的同时,降低处理时耗,提高点云语义分割的实时性,满足自动驾驶对安全性和实时性的要求。
图7为根据本公开一些实施例的环境感知设备的结构示意图。如图7所示,本公开实施例的环境感知设备700包括:获取模块710、语义分割装置720、确定模块730。
获取模块710,被配置为获取无人车采集的点云数据。
语义分割装置720,被配置为根据点云数据生成二维图像;基于第一神经网络模型对二维图像进行语义分割处理,以得到二维图像对应的语义分割结果;根据二维图像对应的语义分割结果、以及点云数据,生成点云特征图;基于第二神经网络模型对 点云特征图进行语义分割处理,以得到点云数据的语义标签信息。
确定模块730,被配置为根据点云数据的语义标签信息确定无人车所处的环境信息。
在本公开实施例中,通过以上设备实现了对无人车所处环境进行准确、实时地感知,能够满足自动驾驶对安全性和实时性的要求。
图8为根据本公开一些实施例的语义分割装置或环境感知设备的结构示意图。
如图8所示,语义分割装置或环境感知设备800包括存储器810;以及耦接至该存储器810的处理器820。存储器810用于存储执行语义分割方法或环境感知方法对应实施例的指令。处理器820被配置为基于存储在存储器810中的指令,执行本公开中任意一些实施例中的语义分割方法或环境感知方法。
图9为根据本公开一些实施例的计算机系统的结构示意图。
如图9所示,计算机系统900可以通用计算设备的形式表现。计算机系统900包括存储器910、处理器920和连接不同系统组件的总线930。
存储器910例如可以包括系统存储器、非易失性存储介质等。系统存储器例如存储有操作系统、应用程序、引导装载程序(Boot Loader)以及其他程序等。系统存储器可以包括易失性存储介质,例如随机存取存储器(RAM)和/或高速缓存存储器。非易失性存储介质例如存储有执行语义分割方法或环境感知方法中的至少一种的对应实施例的指令。非易失性存储介质包括但不限于磁盘存储器、光学存储器、闪存等。
处理器920可以用通用处理器、数字信号处理器(DSP)、应用专用集成电路(ASIC)、现场可编程门阵列(FPGA)或其它可编程逻辑设备、分立门或晶体管等分立硬件组件方式来实现。相应地,诸如第一生成模块、第一分割模块的每个模块,可以通过中央处理器(CPU)运行存储器中执行相应步骤的指令来实现,也可以通过执行相应步骤的专用电路来实现。
总线930可以使用多种总线结构中的任意总线结构。例如,总线结构包括但不限于工业标准体系结构(ISA)总线、微通道体系结构(MCA)总线、外围组件互连(PCI)总线。
计算机系统900这些接口940、950、960以及存储器910和处理器920之间可以通过总线930连接。输入输出接口940可以为显示器、鼠标、键盘等输入输出设备提供连接接口。网络接口950为各种联网设备提供连接接口。存储接口960为软盘、U盘、SD卡等外部存储设备提供连接接口。
图10为根据本公开一些实施例的无人车的结构示意图;图11为根据本公开一些实施例的无人车的立体图。以下结合图10和图11对本公开实施例提供的无人车进行说明。
如附图10所示,无人车包括底盘模块1010、自动驾驶模块1020、货箱模块1030和远程监控推流模块1040四部分。
在一些实施例中,底盘模块1010主要包括电池、电源管理装置、底盘控制器、电机驱动器、动力电机。电池为整个无人车系统提供电源,电源管理装置将电池输出转换为可供各功能模块使用的不同电平电压,并控制上下电。底盘控制器接受自动驾驶模块下发的运动指令,控制无人车转向、前进、后退、刹车等。
在一些实施例中,自动驾驶模块1020包括核心处理单元(Orin或Xavier模组)、红绿灯识别相机、前后左右环视相机、多线激光雷达、定位模块(如北斗、GPS等)、惯性导航单元。相机与自动驾驶模块之间可进行通信,为了提高传输速度、减少线束,可采用GMSL链路通信。
在一些实施例中,自动驾驶模块1020包括上述实施例中的语分割装置或环境感知设备。
在一些实施例中,远程监控推流模块1030由前监控相机、后监控相机、左监控相机、右监控相机和推流模块构成,该模块将监控相机采集的视频数据传输到后台服务器,供后台操作人员查看。无线通讯模块通过天线与后台服务器进行通信,可实现后台操作人员对无人车的远程控制。
货箱模块1040为无人车的货物承载装置。在一些实施例中,货箱模块1040上还设置有显示交互模块,显示交互模块用于无人车与用户交互,用户可通过显示交互模块进行如取件、寄存、购买货物等操作。货箱的类型可根据实际需求进行更换,如在物流场景中,货箱可以包括多个不同大小的子箱体,子箱体可用于装载货物进行配送。在零售场景中,货箱可以设置成透明箱体,以便于用户直观看到待售产品。
本公开实施例的无人车,能够在保持点云语义分割结果准确性的同时,提高点云语义分割处理的实时性,进而能够满足自动驾驶对安全性和实时性的要求。
这里,参照根据本公开实施例的方法、装置和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个框以及各框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可提供到通用计算机、专用计算机或其他可编程装置的 处理器,以产生一个机器,使得通过处理器执行指令产生实现在流程图和/或框图中一个或多个框中指定的功能的装置。
这些计算机可读程序指令也可存储在计算机可读存储器中,这些指令使得计算机以特定方式工作,从而产生一个制造品,包括实现在流程图和/或框图中一个或多个框中指定的功能的指令。
本公开可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。
通过上述实施例中的语义分割、环境感知方法、装置和无人车,能够在保持点云语义分割结果准确性的同时,提高点云语义分割处理的实时性,进而满足自动驾驶对安全性和实时性的要求。
至此,已经详细描述了根据本公开的语义分割、环境感知方法、装置和无人车。为了避免遮蔽本公开的构思,没有描述本领域所公知的一些细节。本领域技术人员根据上面的描述,完全可以明白如何实施这里公开的技术方案。

Claims (19)

  1. 一种语义分割方法,包括:
    根据点云数据生成二维图像;
    基于第一神经网络模型对所述二维图像进行语义分割处理,以得到所述二维图像对应的语义分割结果;
    根据所述二维图像对应的语义分割结果、以及所述点云数据,生成点云特征图;
    基于第二神经网络模型对所述点云特征图进行语义分割处理,以得到所述点云数据的语义标签信息。
  2. 根据权利要求1所述的语义分割方法,其中,所述二维图像为深度图,所述根据点云数据生成二维图像包括:
    确定所述点云数据在球面坐标系下的二维转换坐标,其中,所述二维转换坐标包括偏航角和俯仰角;
    根据所述二维转换坐标,将所述点云数据分配到多个网格中;
    根据每个网格中的点云点,确定每个网格的特征;
    根据所有网格的特征,构建所述深度图。
  3. 根据权利要求1所述的语义分割方法,其中,所述二维图像对应的语义分割结果包括所述二维图像中每个点对应的语义标签信息和所述二维图像中每个点对应的特征表示,所述根据所述二维图像对应的语义分割结果、以及所述点云数据,生成点云特征图包括:
    确定所述点云数据中与所述二维图像中的每个点匹配的点云点;
    将所述二维图像中每个点对应的语义标签信息和所述二维图像中每个点对应的特征表示、以及所述匹配的点云点的坐标进行拼接,以得到所述二维图像中每个点对应的拼接后的特征表示;
    根据所述二维图像中每个点对应的拼接后的特征表示,构建所述点云特征图。
  4. 根据权利要求1所述的语义分割方法,其中,基于第一神经网络模型对所述二维图像进行语义分割处理,以得到所述二维图像对应的语义分割结果包括:
    基于编码器模块对所述二维图像进行特征提取,并将得到的特征图输出至解码器模块;
    基于所述解码器模块对所述特征图进行解码,以得到所述二维图像对应的语义分割结果。
  5. 根据权利要求4所述的语义分割方法,其中,所述基于编码器模块对所述二维图像进行特征提取,并将得到的特征图输出至解码器模块包括:
    基于第一编码单元对所述二维图像进行特征提取,并将第一编码单元得到的特征图输出至第二编码单元和所述解码器模块;
    基于第二编码单元对所述第一编码单元输出的特征图进行特征提取,并将第二编码单元得到的特征图输出至所述解码器模块,其中,所述第一编码单元与所述第二编码单元的结构不同。
  6. 根据权利要求5所述的语义分割方法,其中,所述第一编码单元和所述第二编码单元中的至少一个包括多个卷积层,所述多个卷积层中的至少一个使用长条形卷积核。
  7. 根据权利要求6所述的语义分割方法,其中,所述第一编码单元包括从输入侧到输出侧依次设置的第一至第三卷积层,其中,第一和第三卷积层使用方形的卷积核,第二卷积层使用长条形卷积核。
  8. 根据权利要求4所述的语义分割方法,其中,所述解码器模块包括从输入侧到输出侧依次设置的多个解码单元、以及语义标签分类层,所述解码单元包括上采样层和多个卷积层。
  9. 根据权利要求1所述的语义分割方法,其中,所述第二神经网络模型包括多个卷积层、以及语义标签分类层,其中,所述卷积层采用1*1卷积核。
  10. 一种环境感知方法,包括:
    获取无人车采集的点云数据;
    根据权利要求1-9任一所述的语义分割方法确定所述点云数据的语义标签信息;
    根据所述点云数据的语义标签信息确定所述无人车所处的环境信息。
  11. 一种语义分割装置,包括:
    第一生成模块,被配置为根据点云数据生成二维图像;
    第一分割模块,被配置为基于第一神经网络模型对所述二维图像进行语义分割处理,以得到所述二维图像对应的语义分割结果;
    第二生成模块,被配置为根据所述二维图像的语义分割结果、以及所述点云数据,生成点云特征图;
    第二分割模块,被配置为基于第二神经网络模型对所述点云特征图进行语义分割处理,以得到所述点云数据的语义标签信息。
  12. 根据权利要求11所述的语义分割装置,其中,所述二维图像为深度图,所述第一生成模块被配置为:
    确定所述点云数据在球面坐标系下的二维转换坐标,其中,所述二维转换坐标包括偏航角和俯仰角;
    根据所述二维转换坐标,将所述点云数据分配到多个网格中;
    根据每个网格中的点云点,确定每个网格的特征;
    根据所有网格的特征,构建所述深度图。
  13. 根据权利要求11所述的语义分割装置,其中,所述第二生成模块被配置为:
    确定所述点云数据中与所述二维图像中的每个点匹配的点云点;
    将所述二维图像中每个点对应的语义标签信息和所述二维图像中每个点对应的特征表示、以及所述匹配的点云点的坐标进行拼接,以得到所述二维图像中每个点对应的拼接后的特征表示;
    根据所述二维图像中每个点对应的拼接后的特征表示,构建所述点云特征图。
  14. 一种语义分割装置,包括:
    存储器;以及
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器的指令 执行如权利要求1至9任一项所述的语义分割方法。
  15. 一种环境感知设备,包括:
    获取模块,被配置为获取无人车采集的点云数据;
    如权利要求11-14任一所述的语义分割装置;
    确定模块,被配置为根据所述点云数据的语义标签信息确定所述无人车所处的环境信息。
  16. 一种环境感知设备,包括:
    存储器;以及
    耦接至所述存储器的处理器,所述处理器被配置为基于存储在所述存储器的指令执行如权利要求10所述的环境感知方法。
  17. 一种计算机可读存储介质,其上存储有计算机程序指令,该指令被处理器执行时实现权利要求1至9任一项所述的语义分割方法,或者权利要求10所述的环境感知方法。
  18. 一种无人车,包括:
    如权利要求11至14任一所述的语义分割装置,或者15-16任一所述的环境感知设备。
  19. 一种计算机程序,包括:
    指令,所述指令当由处理器执行时使所述处理器执行根据权利要求1-9中任一项所述的语义分割方法,或权利要求10所述的环境感知方法。
PCT/CN2022/140873 2022-07-01 2022-12-22 语义分割、环境感知方法、装置和无人车 WO2024001093A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210767911.1 2022-07-01
CN202210767911.1A CN115082681A (zh) 2022-07-01 2022-07-01 语义分割、环境感知方法、装置和无人车

Publications (1)

Publication Number Publication Date
WO2024001093A1 true WO2024001093A1 (zh) 2024-01-04

Family

ID=83257440

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/140873 WO2024001093A1 (zh) 2022-07-01 2022-12-22 语义分割、环境感知方法、装置和无人车

Country Status (2)

Country Link
CN (1) CN115082681A (zh)
WO (1) WO2024001093A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117706942A (zh) * 2024-02-05 2024-03-15 四川大学 一种环境感知与自适应驾驶辅助电子控制方法及系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115082681A (zh) * 2022-07-01 2022-09-20 北京京东乾石科技有限公司 语义分割、环境感知方法、装置和无人车

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222395A (zh) * 2019-10-21 2020-06-02 杭州飞步科技有限公司 目标检测方法、装置与电子设备
US20200174132A1 (en) * 2018-11-30 2020-06-04 Ehsan Nezhadarya Method and system for semantic label generation using sparse 3d data
CN111476242A (zh) * 2020-03-31 2020-07-31 北京经纬恒润科技有限公司 一种激光点云语义分割方法及装置
CN113762195A (zh) * 2021-09-16 2021-12-07 复旦大学 一种基于路侧rsu的点云语义分割与理解方法
CN114022858A (zh) * 2021-10-18 2022-02-08 西南大学 一种针对自动驾驶的语义分割方法、系统、电子设备及介质
CN114255238A (zh) * 2021-11-26 2022-03-29 电子科技大学长三角研究院(湖州) 一种融合图像特征的三维点云场景分割方法及系统
CN115082681A (zh) * 2022-07-01 2022-09-20 北京京东乾石科技有限公司 语义分割、环境感知方法、装置和无人车

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200174132A1 (en) * 2018-11-30 2020-06-04 Ehsan Nezhadarya Method and system for semantic label generation using sparse 3d data
CN111222395A (zh) * 2019-10-21 2020-06-02 杭州飞步科技有限公司 目标检测方法、装置与电子设备
CN111476242A (zh) * 2020-03-31 2020-07-31 北京经纬恒润科技有限公司 一种激光点云语义分割方法及装置
CN113762195A (zh) * 2021-09-16 2021-12-07 复旦大学 一种基于路侧rsu的点云语义分割与理解方法
CN114022858A (zh) * 2021-10-18 2022-02-08 西南大学 一种针对自动驾驶的语义分割方法、系统、电子设备及介质
CN114255238A (zh) * 2021-11-26 2022-03-29 电子科技大学长三角研究院(湖州) 一种融合图像特征的三维点云场景分割方法及系统
CN115082681A (zh) * 2022-07-01 2022-09-20 北京京东乾石科技有限公司 语义分割、环境感知方法、装置和无人车

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117706942A (zh) * 2024-02-05 2024-03-15 四川大学 一种环境感知与自适应驾驶辅助电子控制方法及系统
CN117706942B (zh) * 2024-02-05 2024-04-26 四川大学 一种环境感知与自适应驾驶辅助电子控制方法及系统

Also Published As

Publication number Publication date
CN115082681A (zh) 2022-09-20

Similar Documents

Publication Publication Date Title
US11783230B2 (en) Automatic generation of ground truth data for training or retraining machine learning models
JP7295234B2 (ja) 自律運転マシンのための回帰ベースの線分検出
WO2024001093A1 (zh) 语义分割、环境感知方法、装置和无人车
El Madawi et al. Rgb and lidar fusion based 3d semantic segmentation for autonomous driving
JP2023507695A (ja) 自律運転アプリケーションのための3次元交差点構造予測
US11462112B2 (en) Multi-task perception network with applications to scene understanding and advanced driver-assistance system
WO2022104774A1 (zh) 目标检测方法和装置
CN113994390A (zh) 针对自主驾驶应用的使用曲线拟合的地标检测
US11610078B2 (en) Low variance region detection for improved high variance region detection using machine learning
US20200410225A1 (en) Low variance detection training
WO2024001969A1 (zh) 一种图像处理方法、装置、存储介质及计算机程序产品
US20220269900A1 (en) Low level sensor fusion based on lightweight semantic segmentation of 3d point clouds
WO2022206414A1 (zh) 三维目标检测方法及装置
Shi et al. Grid-centric traffic scenario perception for autonomous driving: A comprehensive review
Berrio et al. Octree map based on sparse point cloud and heuristic probability distribution for labeled images
CN114463736A (zh) 一种基于多模态信息融合的多目标检测方法及装置
JP2022132075A (ja) 自律運転アプリケーションにおけるディープ・ニューラル・ネットワーク知覚のためのグラウンド・トゥルース・データ生成
CN115965970A (zh) 基于隐式的集合预测实现鸟瞰图语义分割的方法及系统
WO2024055551A1 (zh) 点云特征提取网络模型训练、点云特征提取方法、装置和无人车
CN112990049A (zh) 用于车辆自动驾驶的aeb紧急制动方法、装置
CN114332845A (zh) 一种3d目标检测的方法及设备
CN116129234A (zh) 一种基于注意力的4d毫米波雷达与视觉的融合方法
CN114550116A (zh) 一种对象识别方法和装置
Nayak et al. BEV Detection and Localisation using Semantic Segmentation in Autonomous Car Driving Systems
CN111753768A (zh) 表示障碍物形状的方法、装置、电子设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22949169

Country of ref document: EP

Kind code of ref document: A1