WO2023098018A1 - 一种基于多帧点云的运动目标检测系统和方法 - Google Patents

一种基于多帧点云的运动目标检测系统和方法 Download PDF

Info

Publication number
WO2023098018A1
WO2023098018A1 PCT/CN2022/098356 CN2022098356W WO2023098018A1 WO 2023098018 A1 WO2023098018 A1 WO 2023098018A1 CN 2022098356 W CN2022098356 W CN 2022098356W WO 2023098018 A1 WO2023098018 A1 WO 2023098018A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature
feature tensor
module
tensor
target
Prior art date
Application number
PCT/CN2022/098356
Other languages
English (en)
French (fr)
Inventor
马也驰
华炜
冯权
张顺
Original Assignee
之江实验室
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 之江实验室 filed Critical 之江实验室
Publication of WO2023098018A1 publication Critical patent/WO2023098018A1/zh
Priority to US18/338,328 priority Critical patent/US11900618B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • G06T7/251Analysis of motion using feature-based methods, e.g. the tracking of corners or segments involving models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Definitions

  • the invention relates to the technical field of three-dimensional object detection, in particular to a multi-frame point cloud-based moving object detection system and method.
  • Perception technology especially point cloud-based 3D object detection technology
  • the 3D object detection technology based on point cloud with good effect includes the papers “Sparsely Embedded Convolutional Detection", “3D Object Proposal Generation and Detection from Point Cloud” and the patent "A 3D object detection system based on laser point cloud and its Detection Method", “A 3D Target Detection Method Based on Point Cloud”, etc., but there are some problems in the above-mentioned prior art: first, the above-mentioned method does not consider continuous frame point cloud data, not only does not predict the target trajectory, but also affects the detection of the target Accuracy; Secondly, the above method is completely dependent on the inherent category of the training data set, that is, when there are categories that are not in the training set in the actual scene, the phenomenon of missed detection of the target will occur.
  • the present invention not only considers multi-frame point cloud data, but also does not strongly depend on the category of the training set target for the detection of the moving target, so as to realize the ability to predict the target trajectory, improve the detection accuracy, and avoid missed detection. Purpose, the present invention adopts following technical scheme:
  • a moving target detection system based on multi-frame point clouds including a voxel feature extraction module, a conversion module and a recognition module, the conversion module includes a cross-modal attention module;
  • the cross-modal attention module matches and fuses two feature tensors according to an attention mechanism, and fuses them through a convolutional neural network to obtain a fused feature tensor;
  • the recognition module performs feature extraction on the final fused feature tensor F_Base_fusion_seq[N-1,N], and outputs target detection information.
  • the earth coordinate system C_Base is a fixed preset coordinate origin relative to the earth
  • the forward direction of the first frame of point cloud data is the positive direction of the X-axis of the earth coordinate system C_Base
  • the rightward direction is the positive direction of the Y-axis of the earth coordinate system C_Base
  • the upward direction is the earth coordinate system C_Base
  • voxelization is to construct the voxel size and voxelization range, and use the mean value of all points in each voxel as the voxelization feature.
  • the size of the voxelization feature is C*D*W*H, and C represents the feature Channel number, D means height, W means width, H means length.
  • the conversion module reshape the feature tensor F_Base[i] with a shape size of C*D*W*H into a feature tensor F_Base_seq[i] with a size of C*(D*W*H),
  • X_a and X_b represent the two feature tensors to be fused
  • W_Q, W_K and *W_V represent the trainable weight matrix respectively
  • d represent the dimensions of Q_a and K_b and the dimensions of Q_b and K_a respectively
  • Trans() is a matrix transposition operation
  • softmax_col() indicates that the matrix is normalized by column;
  • Conv() represents a convolutional neural network.
  • the recognition module reshape the final fused feature tensor F_Base_fusion_seq[N-1,N] into a feature tensor F_Base_fusion with a shape size of (C*D)W*H, and then reshape the feature tensor Perform feature extraction and output the detection information of the target.
  • the identification module obtains the three-dimensional coordinate hm of the target center point in the C_Base earth coordinate system, the movement direction diret of the target center point, the offset of the target center point offset, and the target center point through a set of two-dimensional convolutional neural networks.
  • Predict trajectory trajectory, target length, width and height dim, target height z, and target category information in the training phase, the detection of the three-dimensional coordinates of the target center point uses the Focal_loss loss function, and the detection of the movement direction of the target center point returns its sine value and Cosine value, and use the L1_loss loss function, the regression of the offset of the target center point uses the L1_Loss loss function, the regression of the predicted trajectory of the target center point uses the L1_Loss loss function, the length, width and height of the target and the height of the target (Z-axis coordinates) The regression uses the SmothL1_loss loss function, in which the losses of different detection branches are assigned different weights, and finally a trained system is obtained.
  • a method for detecting a moving object based on a multi-frame point cloud comprising the steps of:
  • S1 construct voxel feature extraction module, transformation module, recognition module and cross-modal attention module;
  • the present invention judges the motion state of the target through the mechanism of multi-frame fusion, thereby judging the motion mode adopted by the target, such as two-wheel motion, four-wheel motion, biped motion, quadruped motion, etc.;
  • the target category of the truck appears, it can also be recognized as a four-wheel movement through multi-frame information, which does not depend on the inherent category in the training data set, thereby improving the detection accuracy while avoiding the target The phenomenon of missed detection.
  • Fig. 1 is a flow chart of the method of the present invention.
  • Fig. 2 is a schematic diagram of the network structure of sparse 3D_Conv in the present invention.
  • Fig. 3 is a schematic diagram of the network structure of the convolutional neural network in the present invention.
  • Fig. 4 is a schematic diagram of the system structure of the present invention.
  • kitti data set used in the embodiment of the present invention, wherein the data set of the embodiment includes 5000 segments of continuous frame point cloud data with a length of 10, the pose of the point cloud acquisition device lidar and the three-dimensional information label of the target, of which 4000 segments
  • the data is the training set, and the 1000 pieces of data are the verification set.
  • a moving target detection system and method based on a multi-frame point cloud includes the following steps:
  • the first step first construct the voxel feature extraction module.
  • the feature size after voxelization is C*D*W*H, where C represents the number of feature channels, D represents height, W represents width, and H represents length. The size in this embodiment is 3*40*1600*1408.
  • the shape size is 64*2*200*176
  • the network structure of the sparse 3D_Conv is shown in Figure 2, including a set of sub-convolution modules.
  • the convolution module consists of a sub-popular convolution layer, a normalization layer and Relu layer, the specific network parameters are shown in the following table:
  • F_Base[i] is the output of the voxel feature extraction module.
  • the second step is to construct the Crossmodal_Attention module.
  • the input is two feature tensors, X_a and X_b (the selection of the tensor is set in the third step, which is a call to the second step).
  • W_Q, W_K and *W_V are trainable weight matrices respectively
  • d is the dimension of Q_a and K_b
  • Trans() is Matrix transposition function
  • softmax_col() is a column-wise normalization operation for the matrix.
  • d is the dimension of Q_b and K_a; softmax is used to normalize the vector.
  • Conv() is a convolutional neural network function. Concat Y(X_a,X_b),Y(X_b,X_a) and then fuse them through a 1*1 convolutional neural network to obtain the feature tensor Crossmodal_Attention(X_a,X_b), shape and size It is 64*(200*176*2).
  • Step 3 Construct the Transformer module.
  • the input is a continuous frame feature tensor sequence of length 10 ⁇ F_Base[i]
  • i is the frame index, 0 ⁇ i ⁇ 10 ⁇ .
  • the fourth step is to construct the identification module.
  • the input is F_Base_fusion_seq[10-1,10], which is reshaped into a feature tensor F_Base_fusion with a shape size of (C*D)*W*H, which is 128*200*176 in this embodiment.
  • Use the convolutional neural network to extract the features of the feature tensor F_Base_fusion, and output the detection information of the target, including the three-dimensional coordinate hm of the target center point in the C_Base coordinate system, the length, width and height dim of the target, the movement direction diret of the target center point, The offset of the center point of the target, the height z of the target, and the category information of the target.
  • the target category information includes two-wheel sports, four-wheel sports, bipedal sports, and quadrupedal sports.
  • kitti data cars are divided into four-legged sports , pedestrians are divided into two-legged movements, and cyclists are divided into two-wheeled movements.
  • the network structure of the convolutional neural network is shown in Figure 3, and the specific network parameters are shown in the following table:
  • Network layer Convolution kernel size step size filling number of channels input size output size Conv2d(hm) 3*3 1*1*1 0*0*0 64 128*200*176 4*200*176 Conv2d(offset) 3*3 1*1*1 0*0*0 64 128*200*176 2*200*176 Conv2d(diret) 3*3 1*1*1 0*0*0 64 128*200*176 2*200*176 Conv2d(z) 3*3 1*1*1 0*0*0 64 128*200*176 2*200*176 Conv2d(dim) 3*3 1*1*1 0*0*0 64 128*200*176 3*200*176
  • the fifth step is to connect and train each module.
  • kitti training set data to train the neural network, where the Focal_loss loss function is used for the detection of the target center point, and the sine and cosine values are returned for the detection of the movement direction of the target center point, and the L1_loss loss function is used for the target
  • the regression of the offset of the center point uses the L1_Loss loss function
  • the regression of the length, width and height of the target and the Z-axis coordinate uses the SmothL1_loss loss function.
  • the losses of different detection branches are assigned different weights.
  • the sixth step is reasoning test.
  • the moving object detection system and method based on the multi-frame point cloud in the embodiment of the present invention is compared with the more popular three-dimensional object detection schemes PointPillars, PointRCNN, and Second based on pure point clouds at the present stage.
  • PointPillars PointPillars
  • PointRCNN PointRCNN
  • Second Second based on pure point clouds at the present stage.
  • the 3D map comparison of each category indicator in the verification set is shown in the following table:
  • the present invention has greatly improved the detection accuracy of 3D objects, and the overall efficiency of the present invention is only reduced by 15ms, which ensures the real-time performance of 3D object detection.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种基于多帧点云的运动目标检测系统和方法,系统包括体素特征提取模块,将连续帧点云序列进行体素化,并提取特征张量序列;转换模块,对特征张量序列通过跨模态注意力模块进行匹配融合,将第一特征张量与第二特征张量融合,融合的结果再与第三特征张量融合,再将融合后的结果与第四特征张量融合,在以此类推,得到最终融合后的特征张量;跨模态注意力模块,将两个特征张量,根据注意力机制,通过卷积神经网络融合,得到融合后的特征张量;识别模块,对最终融合后的特征张量进行特征提取,输出目标的检测信息。方法包括:S1,构建各系统模块;S2,通过训练集数据,对模型进行训练;S3,通过训练好的模型进行预测。

Description

一种基于多帧点云的运动目标检测系统和方法 技术领域
本发明涉及三维目标检测技术领域,尤其是涉及一种基于多帧点云的运动目标检测系统和方法。
背景技术
现阶段自动驾驶技术应用越来越广泛,感知技术中尤其是基于点云的三维目标检测技术是自动驾驶技术中最重要的任务之一。现阶段效果较佳的基于点云的三维目标检测技术包括论文《Sparsely Embedded Convolutional Detection》、《3D Object Proposal Generation and Detection from Point Cloud》以及专利《一种基于激光点云的三维目标检测系统及其检测方法》、《一种基于点云的三维目标检测方法》等,但上述现有技术存在一下问题:首先上述方法未考虑连续帧点云数据,不但没有预测目标轨迹,而且也影响目标的检测精度;其次上述方法完全依赖于训练数据集的固有类别,即当实际场景中出现训练集没有的类别,会产生目标漏检的现象。
发明内容
为解决现有技术的不足,本发明不但考虑多帧点云数据,并且对运动目标的检测,不强依赖训练集目标的类别,从而实现能够预测目标轨迹、提高检测精度,以及避免漏检的目的,本发明采用如下的技术方案:
一种基于多帧点云的运动目标检测系统,包括体素特征提取模块、转换模块和识别模块,转换模块包括跨模态注意力模块;
所述体素特征提取模块,将连续帧点云序列{Pointcloud[i],0<i<=N}进行体素化,并提取特征张量序列{F_Base[i],0<i<=N},i表示帧索引,N表示帧数;
所述转换模块,获取特征张量序列{F_Base[i],0<i<=N},通过跨模态注意力模块,将第一特征张量与第二特征张量进行融合,融合的结果再与第三特征张量融合,再将融合后的结果与第四特征张量融合,在以此类推,得到最终融合后的特征张量F_Base_fusion_seq[N-1,N];
所述跨模态注意力模块,将两个特征张量,根据注意力机制进行匹配融合,并通过卷积神经网络融合,得到融合后的特征张量;
所述识别模块,对最终融合后的特征张量F_Base_fusion_seq[N-1,N]进行特征提取,输出目标的检测信息。
进一步地,体素特征提取模块根据每帧激光雷达对应的位姿{Pose[i],0<i<=N},将连续帧点云序列{Pointcloud[i],0<i<=N}转换到大地坐标系C_Base上,并对转换后的连续帧点云序列{Pointcloud_Base[i],0<i<=N}进行体素化,大地坐标系C_Base是相对于大地的固定预设坐标原点的笛卡尔正交坐标系,以第一帧点云数据向前方向为大地坐标系C_Base的X轴正方向,向右方向为大地坐标系C_Base的Y轴正方向,向上方向为大地坐标系C_Base的Z轴正方向。
进一步地,体素化是通过构建体素大小及体素化范围,将每个体素内所有点的均值作为体素化特征,体素化特征大小为C*D*W*H,C表示特征通道数,D表示高度,W表示宽度,H表示长度。
进一步地,提取特征张量是对体素化得到的特征序列{Voxel_Base[i],0<i<=N},通过三维稀疏卷积模块进行特征提取,得到特征张量序列{F_Base[i],0<i<=N},三维稀疏卷积模块包括一组子卷积模块,子卷积模块包括三维子流行卷积层、归一化层和Relu层。
进一步地,转换模块将形状大小为C*D*W*H的特征张量F_Base[i],重塑成大小为C*(D*W*H)的特征张量F_Base_seq[i],C表示特征通道数,D表示高度,W表示宽度,H表示长度,再对重塑后的特征张量序列{F_Base_seq[i],0<i<=N}进行匹配融合。
进一步地,所述特征张量序列为{F_Base_seq[i],0<i<=N},i表示帧索引,N表示帧数,对序列中的特征张量进行匹配融合,得到融合后的特征张量F_Base_fusion_seq[j,j+1],j表示帧索引,0<j<=N,当j=1时,对特征张量F_Base_seq[j]和特征张量F_Base_seq[j+1]进行融合,当1<j<N时,对融合后的特征张量F_Base_fusion_seq[j-1,j]和特征张量F_Base_seq[j+1]进行循环融合,输出最终融合后的特征张量F_Base_fusion_seq[N-1,N]。
进一步地,跨模态注意力模块的匹配融合如下:
Figure PCTCN2022098356-appb-000001
Figure PCTCN2022098356-appb-000002
其中,Q_a=X_a*W_Q和Q_b=X_b*W_Q分别表示注意力机制中的Query,K_a=X_a*W_K和K_b=X_b*W_K分别表示注意力机制中Key,V_a=X_a*W_V和V_b=X_b*W_V分别表示注意力机制中Value,X_a和X_b表示待融合的两个特征张量,W_Q、W_K以及*W_V分别表示可训练权重矩阵,d分别表示Q_a与K_b的维度和Q_b与K_a的维度,Trans()为矩阵转置操作,softmax_col()表示矩阵按列进行归一化操作;
再将Y(X_a,X_b)和Y(X_b,X_a)通过卷积神经网络进行融合,得到融合后的特征张量:
Crossmodal Attention(X_a,X_b)=Conv(Y(X_a,X_b),Y(X_b,X_a))
其中,Conv()表示卷积神经网络。
进一步地,识别模块将最终融合后的特征张量F_Base_fusion_seq[N-1,N]重塑成形状大小为(C*D)W*H的特征张量F_Base_fusion,再对重塑后的特征张量进行特征提取,输出目标的检测信息。
进一步地,识别模块通过一组二维卷积神经网络,分别获取目标中心点在C_Base大地坐标系下的三维坐标hm、目标中心点的运动方向diret、目标中心点偏移量offset、目标中心点预测轨迹trajectory、目标的长宽高dim、目标的高度z和目标的类别信息;训练阶段,目标中心点三维坐标的检测采用Focal_loss损失函数,目标中心点的运动方向的检测,回归其正弦值与余弦值,并采用L1_loss损失函数,目标中心点的偏移量的回归采用L1_Loss损失函数,目标中心点的预测轨迹的回归采用L1_Loss损失函数,目标的长宽高以及目标高度(Z轴坐标)的回归采用SmothL1_loss损失函数,其中不同检测分支的损失分配不同的权重,最终得到训练好的系统。
一种基于多帧点云的运动目标检测方法,包括如下步骤:
S1,构建体素特征提取模块、转换模块、识别模块和跨模态注意力模块;
S2,通过训练集数据,对模型进行训练;
S3,通过训练好的模型进行预测。
本发明的优势和有益效果在于:
本发明通过多帧融合的机制,判断目标的运动状态,从而判断目标采用的运动方式,例如两轮运动、四轮运动、两足运动、四足运动等;当训练数据集中只有人,轿车两种类别,在实际预测中,出现卡车的目标类别时,同样可以通过多帧信息,识别出它是四轮运动,不依赖训练数据集中的固有类别,从而在提高检测精度的同时,避免了目标漏检的现象。
附图说明
图1是本发明的方法流程图。
图2是本发明中稀疏3D_Conv的网络结构示意图。
图3是本发明中卷积神经网络的网络结构示意图。
图4是本发明的系统结构示意图。
具体实施方式
以下结合附图对本发明的具体实施方式进行详细说明。应当理解的是,此处所描述的具体实施方式仅用于说明和解释本发明,并不用于限制本发明。
本发明的实施例采用的kitti数据集,其中,实施例的数据集包括5000段长度为10的连续帧点云数据、点云采集设备激光雷达的位姿以及目标的三维信息标签,其中4000段数据为训练集,1000段数据为验证集。
如图1所示,一种基于多帧点云的运动目标检测系统和方法,包括以下步骤:
第一步:首先构造体素特征提取模块。
输入长度为10的连续帧点云序列{Pointcloud[i]|i为帧索引,0<i<=10}以及每帧激光雷达传感器的位姿{Pose[i]|i为帧索引,0<i<=N}。
将长度为10的连续帧点云序列,通过每帧激光雷达的位姿,转换到C_Base坐标系上,得到10帧新的点云序列{Pointcloud_Base[i]|i为帧索引,0<i<=10},其中C_Base坐标系为以相对于大地的固定预设坐标原点的笛卡尔正交坐标系,第一帧点云数据向前方向为C_Base坐标系的X轴正方向,向右方向为C_Base坐标系的Y轴正方向,向上方向为C_Base坐标系的Z轴正方向。
对长度为10的连续帧点云序列{Pointcloud_Base[i]|i为帧索引,0<i<=10}进行体素化,并得到10帧点云体素化后的特征{Voxel_Base[i]|i为帧索引,0<i<=10},其中体素化的点云特征序列在X、Y、Z轴的取值范围分别是[0米,70.4米],[-40米,40米],[-3米,1米],每个体素的大小为[0.05米,0.05米,0.1米],每个体素化特征为体素内所有点的均值。体素化后的特征大小为C*D*W*H,C表示特征通道数,D表示高度,W表示宽度,H表示长度,本实施例中的大小为3*40*1600*1408。
对体素化特征序列{Voxel_Base[i]|i为帧索引,0<i<=10}通过稀疏3D_Conv进行特征提取,得到特征张量序列{F_Base[i]|i为帧索引,0<i<=10},形状大小为64*2*200*176,其中稀疏3D_Conv的网络结构如图2所示,包括一组子卷积模块,卷积模块由子流行卷积层、归一化层和Relu层,具体网络参数如下表所示:
Figure PCTCN2022098356-appb-000003
F_Base[i]为体素特征提取模块的输出。
第二步,构造Crossmodal_Attention模块。
输入为两个特征张量,X_a和X_b(张量的选取在第三步中设置,第三步是对第二步的调用)。
Crossmodal Attention(X_a,X_b)=Conv(Y(X_a,X_b),Y(X_b,X_a))
Figure PCTCN2022098356-appb-000004
其中Q_a=X_a*W_Q作为Query,K_b=X_b*W_K作为Key,V_b=X_b*W_V作为Value,W_Q、W_K以及*W_V分别为可训练权重矩阵;d为Q_a与K_b的维度;Trans()为矩阵转置函数;softmax_col()为矩阵按列进行归一化操作。
Figure PCTCN2022098356-appb-000005
其中Q_b=X_b*W_Q作为Query,K_a=X_a*W_K作为Key,V_a=X_a*W_V作为Value,d为Q_b与K_a的维度;softmax为对向量进行归一化操作。
Conv()为卷积神经网络函数,将Y(X_a,X_b),Y(X_b,X_a)进行Concat再通过1*1卷积神经网络融合,得到特征张量Crossmodal_Attention(X_a,X_b),形状大小为64*(200*176*2)。
第三步:构造Transformer模块。
输入为长度为10的连续帧特征张量序列{F_Base[i]|i为帧索引,0<i<=10}。将{F_Base[i]|i为帧索引,0<i<=10}reshape成形状大小为64*(2*200*176)的特征序列{F_Base_seq[i]|i为帧索引,0<i<=10}。
使用Crossmodal_Attention对{F_Base[i]|i为帧索引,0<i<=N}特征序列进行匹配融合。其中当j=1时,F_Base_fusion_seq[1,2]=Crossmodal_Attention(F_Base_seq[1],F_Base_seq[2]),当1<j<10时,F_Base_fusion_seq[j,j+1]=Crossmodal_Attention(F_Base_fusion_seq[j-1,j],F_Base_seq[j+1]),其中,j为帧索引,Crossmodal_Attention为多帧融合模块,特征张量F_Base_fusion_seq[10-1,10]为Transformer模块的输出。
第四步,构造识别模块。
输入为F_Base_fusion_seq[10-1,10],将其reshape成形状大小为(C*D)*W*H,本实施例中为128*200*176的特征张量F_Base_fusion。使用卷积神经网络对特征张量F_Base_fusion进行特征提取,并输出目标的检测信息,包括目标中心点在C_Base坐标系下的三维坐标hm、目标的长宽高dim、目标中心点的运动方向diret、目标中心点偏移量offset、目标的高度z、目标的类别信息,目标类别信息包括两轮运动、四轮运动、两足运动、四足运动,针对于kitti数据,将轿车划分为四足运动,行人划分为两足运动,骑自行车的人划分为两轮运动。卷积神经网络的网络结构如图3所示,具体网络参数如下表所示:
网络层 卷积核尺寸 步长 填充 通道数 输入尺寸 输出尺寸
Conv2d(hm) 3*3 1*1*1 0*0*0 64 128*200*176 4*200*176
Conv2d(offset) 3*3 1*1*1 0*0*0 64 128*200*176 2*200*176
Conv2d(diret) 3*3 1*1*1 0*0*0 64 128*200*176 2*200*176
Conv2d(z) 3*3 1*1*1 0*0*0 64 128*200*176 2*200*176
Conv2d(dim) 3*3 1*1*1 0*0*0 64 128*200*176 3*200*176
第五步,如图4所示,对各模块进行连接,训练。
使用kitti训练集数据对神经网络进行训练,其中针对于目标中心点的检测采用Focal_loss损失函数,针对目标中心点的运动方向的检测,回归其正弦值与余弦值,并采用L1_loss损失函数,针对目标中心点的偏移量的回归采用L1_Loss损失函数,针对目标的长宽高以及Z轴坐标的回归采用SmothL1_loss损失函数。其中不同检测分支的损失分配不同的权重。最后,得到训练好的模型。
第六步,推理测试。
加载训练好的模型,使用kitti的验证集数据对神经网络进行推理测试。
所述采用本发明实施方案中的基于多帧点云的运动目标检测系统和方法,与现阶段较为流行的基于纯点云的三维目标检测方案PointPillars、PointRCNN、Second相对比,在同样的训练集以及模型参数优化方法下,各自在验证集的各类别指标的3D map比较如下表所示:
  车辆 行人 骑车的人
PointPillars 89.65372 72.65376 86.88952
PointRCNN 94.78256 73.66579 88.63552
Second 93.37265 73.22698 88.98336
Ours 97.34768 80.45791 92.36704
通过上表可以看出,本发明相对于现有的主流方法,在三维目标检测精度上有较大的提升,且本发明的整体效率只降低了15ms,保证了三维目标检测的实时性。
以上实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明实施例技术方案的范围。

Claims (9)

  1. 一种基于多帧点云的运动目标检测系统,包括体素特征提取模块、转换模块和识别模块,其特征在于所述转换模块包括跨模态注意力模块;
    所述体素特征提取模块,将连续帧点云序列进行体素化,并提取特征张量序列;
    所述转换模块,获取特征张量序列,通过跨模态注意力模块,将第一特征张量与第二特征张量进行融合,融合的结果再与第三特征张量融合,再将融合后的结果与第四特征张量融合,在以此类推,得到最终融合后的特征张量;跨模态注意力模块,将两个特征张量,根据注意力机制进行匹配融合,并通过卷积神经网络融合后,得到融合后的特征张量;
    所述识别模块,对最终融合后的特征张量进行特征提取,输出目标的检测信息;
    跨模态注意力模块的匹配融合如下:
    Figure PCTCN2022098356-appb-100001
    Figure PCTCN2022098356-appb-100002
    其中,Q_a=X_a*W_Q和Q_b=X_b*W_Q分别表示注意力机制中的Query,K_a=X_a*W_K和K_b=X_b*W_K分别表示注意力机制中Key,V_a=X_a*W_V和V_b=X_b*W_V分别表示注意力机制中Value,X_a和X_b表示待融合的两个特征张量,W_Q、W_K以及W_V分别表示可训练权重矩阵,d分别表示Q_a与K_b的维度和Q_b与K_a的维度,Trans()为矩阵转置操作,softmax_col()表示矩阵按列进行归一化操作;
    再将Y(X_a,X_b)和Y(X_b,X_a)通过卷积神经网络进行融合,得到融合后的特征张量:
    Crossmodal Attention(X_a,X_b)=Conv(Y(X_a,X_b),Y(X_b,X_a))
    其中,Conv()表示卷积神经网络。
  2. 根据权利要求1所述的一种基于多帧点云的运动目标检测系统,其特征在于所述体素特征提取模块,根据每帧对应的位姿,将连续帧点云序列转换到大地坐标系,并对转换后的连续帧点云序列进行体素化,大地坐标系是相对于大地的固定预设坐标原点的笛卡尔正交坐标系,以第一帧点云数据向前方向为大地坐标系的X轴正方向,向右方向为大地坐标系的Y轴正方向,向上方向为大地坐标系的Z轴正方向。
  3. 根据权利要求1所述的一种基于多帧点云的运动目标检测系统,其特征在于所述体素化,通过构建体素大小及体素化范围,将每个体素内点的均值作为体素化特征。
  4. 根据权利要求1所述的一种基于多帧点云的运动目标检测系统,其特征在于所述提取特征张量,是对体素化得到的特征,通过稀疏卷积模块进行特征提取,得到特征张量,稀疏卷积模块包括一组子卷积模块,子卷积模块包括子流行卷积层、归一化层和Relu层。
  5. 根据权利要求1所述的一种基于多帧点云的运动目标检测系统,其特征在于所述转换模块,将形状大小为C*D*W*H的特征张量重塑成大小为C*(D*W*H)的特征张量,C表示特征通道数,D表示高度,W表示宽度,H表示长度,再对重塑后的特征张量序列进行匹配融合。
  6. 根据权利要求1所述的一种基于多帧点云的运动目标检测系统,其特征在于所述特征张量序列为{F_Base_seq[i],0<i<=N},i表示帧索引,N表示帧数,对序列中的特征张量进行匹配融合,得到融合后的特征张量F_Base_fusion_seq[j,j+1],j表示帧索引,0<j<=N,当j=1时,对特征张量F_Base_seq[j]和特征张量F_Base_seq[j+1]进行融合,当1<j<N时,对融合后的特征张量F_Base_fusion_seq[j-1,j]和特征张量F_Base_seq[j+1]进行循环融合,输出最终融合后的特征张量F_Base_fusion_seq[N-1,N]。
  7. 根据权利要求5所述的一种基于多帧点云的运动目标检测系统,其特征在于所述识别模块,将最终融合后的特征张量重塑成形状大小为(C*D)*W*H的特征张量,再对重塑后的特征张量进行特征提取,输出目标的检测信息。
  8. 根据权利要求1所述的一种基于多帧点云的运动目标检测系统,其特征在于所述识别模块,通过一组卷积神经网络,分别获取目标中心点坐标、目标中心点的运动方向、目标中心点偏移量、目标的长宽高、目标的高度和目标的类别信息;训练阶段,目标中心点坐标的检测采用Focal_loss损失函数,目标中心点的运动方向的检测,回归其正弦值与余弦值,并采用L1_loss损失函数,目标中心点的偏移量的回归采用L1_Loss损失函数,目标中心点的预测轨迹的回归采用L1_Loss损失函数,目标的长宽高以及目标高度的回归采用SmothL1_loss损失函数,其中不同检测分支的损失分配不同的权重,最终得到训练好的模型。
  9. 一种使用如权利要求1所述的一种基于多帧点云的运动目标检测系统的目标检测方法,其特征在于包括如下步骤:
    S1,构建体素特征提取模块、转换模块、识别模块和跨模态注意力模块;
    S2,通过训练集数据,对模型进行训练;
    S3,通过训练好的模型进行预测。
PCT/CN2022/098356 2021-12-02 2022-06-13 一种基于多帧点云的运动目标检测系统和方法 WO2023098018A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/338,328 US11900618B2 (en) 2021-12-02 2023-06-20 System and method for detecting moving target based on multi-frame point cloud

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111456208.0A CN113870318B (zh) 2021-12-02 2021-12-02 一种基于多帧点云的运动目标检测系统和方法
CN202111456208.0 2021-12-02

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/338,328 Continuation US11900618B2 (en) 2021-12-02 2023-06-20 System and method for detecting moving target based on multi-frame point cloud

Publications (1)

Publication Number Publication Date
WO2023098018A1 true WO2023098018A1 (zh) 2023-06-08

Family

ID=78985530

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/098356 WO2023098018A1 (zh) 2021-12-02 2022-06-13 一种基于多帧点云的运动目标检测系统和方法

Country Status (3)

Country Link
US (1) US11900618B2 (zh)
CN (1) CN113870318B (zh)
WO (1) WO2023098018A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665019A (zh) * 2023-07-31 2023-08-29 山东交通学院 一种用于车辆重识别的多轴交互多维度注意力网络
CN116664874A (zh) * 2023-08-02 2023-08-29 安徽大学 一种单阶段细粒度轻量化点云3d目标检测系统及方法
CN117014633A (zh) * 2023-10-07 2023-11-07 深圳大学 一种跨模态数据压缩方法、装置、设备及介质
CN117392396A (zh) * 2023-12-08 2024-01-12 安徽蔚来智驾科技有限公司 跨模态目标状态的检测方法、设备、智能设备和介质

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870318B (zh) 2021-12-02 2022-03-25 之江实验室 一种基于多帧点云的运动目标检测系统和方法
CN114067371B (zh) * 2022-01-18 2022-09-13 之江实验室 一种跨模态行人轨迹生成式预测框架、方法和装置
CN114322994B (zh) * 2022-03-10 2022-07-01 之江实验室 一种基于离线全局优化的多点云地图融合方法和装置
CN114494248B (zh) * 2022-04-01 2022-08-05 之江实验室 基于点云和不同视角下的图像的三维目标检测系统及方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190080210A1 (en) * 2017-09-13 2019-03-14 Hrl Laboratories, Llc Independent component analysis of tensors for sensor data fusion and reconstruction
CN111429514A (zh) * 2020-03-11 2020-07-17 浙江大学 一种融合多帧时序点云的激光雷达3d实时目标检测方法
CN112731339A (zh) * 2021-01-04 2021-04-30 东风汽车股份有限公司 一种基于激光点云的三维目标检测系统及其检测方法
CN113379709A (zh) * 2021-06-16 2021-09-10 浙江工业大学 一种基于稀疏多尺度体素特征融合的三维目标检测方法
CN113870318A (zh) * 2021-12-02 2021-12-31 之江实验室 一种基于多帧点云的运动目标检测系统和方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10970518B1 (en) * 2017-11-14 2021-04-06 Apple Inc. Voxel-based feature learning network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190080210A1 (en) * 2017-09-13 2019-03-14 Hrl Laboratories, Llc Independent component analysis of tensors for sensor data fusion and reconstruction
CN111429514A (zh) * 2020-03-11 2020-07-17 浙江大学 一种融合多帧时序点云的激光雷达3d实时目标检测方法
CN112731339A (zh) * 2021-01-04 2021-04-30 东风汽车股份有限公司 一种基于激光点云的三维目标检测系统及其检测方法
CN113379709A (zh) * 2021-06-16 2021-09-10 浙江工业大学 一种基于稀疏多尺度体素特征融合的三维目标检测方法
CN113870318A (zh) * 2021-12-02 2021-12-31 之江实验室 一种基于多帧点云的运动目标检测系统和方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEI HUA; ZHU MING; WANG BO; WANG JIARONG; SUN DEYAO: "Two-Level Progressive Attention Convolutional Network for Fine-Grained Image Recognition", IEEE ACCESS, IEEE, USA, vol. 8, 2 June 2020 (2020-06-02), USA , pages 104985 - 104995, XP011792972, DOI: 10.1109/ACCESS.2020.2999722 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116665019A (zh) * 2023-07-31 2023-08-29 山东交通学院 一种用于车辆重识别的多轴交互多维度注意力网络
CN116665019B (zh) * 2023-07-31 2023-09-29 山东交通学院 一种用于车辆重识别的多轴交互多维度注意力网络
CN116664874A (zh) * 2023-08-02 2023-08-29 安徽大学 一种单阶段细粒度轻量化点云3d目标检测系统及方法
CN116664874B (zh) * 2023-08-02 2023-10-20 安徽大学 一种单阶段细粒度轻量化点云3d目标检测系统及方法
CN117014633A (zh) * 2023-10-07 2023-11-07 深圳大学 一种跨模态数据压缩方法、装置、设备及介质
CN117014633B (zh) * 2023-10-07 2024-04-05 深圳大学 一种跨模态数据压缩方法、装置、设备及介质
CN117392396A (zh) * 2023-12-08 2024-01-12 安徽蔚来智驾科技有限公司 跨模态目标状态的检测方法、设备、智能设备和介质
CN117392396B (zh) * 2023-12-08 2024-03-05 安徽蔚来智驾科技有限公司 跨模态目标状态的检测方法、设备、智能设备和介质

Also Published As

Publication number Publication date
US20230351618A1 (en) 2023-11-02
US11900618B2 (en) 2024-02-13
CN113870318A (zh) 2021-12-31
CN113870318B (zh) 2022-03-25

Similar Documents

Publication Publication Date Title
WO2023098018A1 (zh) 一种基于多帧点云的运动目标检测系统和方法
Ye et al. Tpcn: Temporal point cloud networks for motion forecasting
Chandio et al. Precise single-stage detector
Rani et al. Object detection and recognition using contour based edge detection and fast R-CNN
CN113536232B (zh) 用于无人驾驶中激光点云定位的正态分布变换方法
Xia et al. Scpnet: Semantic scene completion on point cloud
CN110032952B (zh) 一种基于深度学习的道路边界点检测方法
CN112633088B (zh) 一种基于航拍图像中光伏组件识别的电站容量估测方法
He et al. Real-time vehicle detection from short-range aerial image with compressed mobilenet
CN114494248B (zh) 基于点云和不同视角下的图像的三维目标检测系统及方法
Song et al. Msfanet: A light weight object detector based on context aggregation and attention mechanism for autonomous mining truck
Reuse et al. About the ambiguity of data augmentation for 3d object detection in autonomous driving
Li et al. RoadFormer: Duplex Transformer for RGB-normal semantic road scene parsing
Li et al. An end-to-end multi-task learning model for drivable road detection via edge refinement and geometric deformation
Song et al. GraphAlign: Enhancing accurate feature alignment by graph matching for multi-modal 3D object detection
Piewak et al. Analyzing the cross-sensor portability of neural network architectures for LiDAR-based semantic labeling
CN114821508A (zh) 基于隐式上下文学习的道路三维目标检测方法
CN115546594A (zh) 一种基于激光雷达和相机数据融合的实时目标检测方法
Tian et al. Jyolo: Joint point cloud for autonomous driving 3d object detection
CN113887462A (zh) 一种基于多帧点云数据的3d目标检测装置和方法
Lian et al. Study on obstacle detection and recognition method based on stereo vision and convolutional neural network
Li et al. PAT: Point cloud analysis with local filter embedding in transformer
Chung et al. Object Detection Algorithm Based on Improved YOLOv7 for UAV Images
CN113379672B (zh) 一种基于深度学习的细胞图像分割方法
Chen et al. Multi-view 3D object detection based on point cloud enhancement

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22899838

Country of ref document: EP

Kind code of ref document: A1