CN111695402B - Tools and methods for annotating human poses in 3D point cloud data - Google Patents

Tools and methods for annotating human poses in 3D point cloud data Download PDF

Info

Publication number
CN111695402B
CN111695402B CN202010171054.XA CN202010171054A CN111695402B CN 111695402 B CN111695402 B CN 111695402B CN 202010171054 A CN202010171054 A CN 202010171054A CN 111695402 B CN111695402 B CN 111695402B
Authority
CN
China
Prior art keywords
point cloud
cloud data
points
annotation
processors
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010171054.XA
Other languages
Chinese (zh)
Other versions
CN111695402A (en
Inventor
S·博通吉克
丁司昊
A·瓦林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Volvo Car Corp
Original Assignee
Volvo Car Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Volvo Car Corp filed Critical Volvo Car Corp
Publication of CN111695402A publication Critical patent/CN111695402A/en
Application granted granted Critical
Publication of CN111695402B publication Critical patent/CN111695402B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/103Static body considered as a whole, e.g. static pedestrian or occupant recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S17/00Systems using the reflection or reradiation of electromagnetic waves other than radio waves, e.g. lidar systems
    • G01S17/02Systems using the reflection of electromagnetic waves other than radio waves
    • G01S17/04Systems determining the presence of a target
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S17/00Systems using the reflection or reradiation of electromagnetic waves other than radio waves, e.g. lidar systems
    • G01S17/86Combinations of lidar systems with systems other than lidar, radar or sonar, e.g. with direction finders
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S17/00Systems using the reflection or reradiation of electromagnetic waves other than radio waves, e.g. lidar systems
    • G01S17/88Lidar systems specially adapted for specific applications
    • G01S17/89Lidar systems specially adapted for specific applications for mapping or imaging
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S17/00Systems using the reflection or reradiation of electromagnetic waves other than radio waves, e.g. lidar systems
    • G01S17/88Lidar systems specially adapted for specific applications
    • G01S17/93Lidar systems specially adapted for specific applications for anti-collision purposes
    • G01S17/931Lidar systems specially adapted for specific applications for anti-collision purposes of land vehicles
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S7/00Details of systems according to groups G01S13/00, G01S15/00, G01S17/00
    • G01S7/48Details of systems according to groups G01S13/00, G01S15/00, G01S17/00 of systems according to group G01S17/00
    • G01S7/4802Details of systems according to groups G01S13/00, G01S15/00, G01S17/00 of systems according to group G01S17/00 using analysis of echo signal for target characterisation; Target signature; Target cross-section
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01SRADIO DIRECTION-FINDING; RADIO NAVIGATION; DETERMINING DISTANCE OR VELOCITY BY USE OF RADIO WAVES; LOCATING OR PRESENCE-DETECTING BY USE OF THE REFLECTION OR RERADIATION OF RADIO WAVES; ANALOGOUS ARRANGEMENTS USING OTHER WAVES
    • G01S7/00Details of systems according to groups G01S13/00, G01S15/00, G01S17/00
    • G01S7/48Details of systems according to groups G01S13/00, G01S15/00, G01S17/00 of systems according to group G01S17/00
    • G01S7/4808Evaluating distance, position or velocity data
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/24Indexing scheme for image data processing or generation, in general involving graphical user interfaces [GUIs]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20092Interactive image processing based on input by user
    • G06T2207/20101Interactive definition of point of interest, landmark or seed
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20112Image segmentation details
    • G06T2207/20132Image cropping
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2219/00Indexing scheme for manipulating 3D models or images for computer graphics
    • G06T2219/004Annotating, labelling

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Artificial Intelligence (AREA)
  • Remote Sensing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Electromagnetism (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Graphics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Geometry (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

Methods and apparatus for annotating point cloud data. The device may be configured to cause the point cloud data to be displayed; marking points in the point cloud data with a plurality of marking points, the plurality of marking points corresponding to points on a human body; moving one or more of the annotation points in response to user input to define a human pose and create annotated point cloud data; and outputting the marked point cloud data.

Description

用于标注3D点云数据中人体姿态的工具和方法Tools and methods for annotating human poses in 3D point cloud data

优先权priority

本申请要求2019年11月22日递交的美国申请No.16/692,901和2019年3月12日递交的美国临时申请No.62/817,400的优先权,它们各自的全部内容通过引用合并与此。This application claims priority from U.S. Application No. 16/692,901 filed on November 22, 2019 and U.S. Provisional Application No. 62/817,400 filed on March 12, 2019, the entire contents of each of which are incorporated herein by reference.

技术领域Technical field

本申请涉及基于计算机视觉的图形检测技术中的姿态估计(pose estimation)和姿态标注(pose annotation)。This application relates to pose estimation and pose annotation in computer vision-based graphics detection technology.

背景技术Background technique

姿态估计是一种从图像或视频数据中检测出人像的计算机视觉技术。除了检测人像的存在之外,计算机视觉技术还可以确定人像的四肢的位置和方向(即姿态)。姿态估计在许多领域都可能是有用的,包括自主驾驶。例如,人的姿态可用于确定人(例如,行人,交警等)的关注和意向。用于汽车的自主驾驶应用程序可以使用依据所估计的姿态预测或推断的人的意向和关注来确定驾驶行为。Pose estimation is a computer vision technique for detecting human figures from image or video data. In addition to detecting the presence of a human figure, computer vision technology can also determine the position and orientation (i.e., posture) of a human figure's limbs. Pose estimation can be useful in many areas, including autonomous driving. For example, a person's posture can be used to determine the attention and intentions of a person (eg, pedestrian, traffic police, etc.). Autonomous driving applications for automobiles can use the person's intentions and concerns predicted or inferred from the estimated posture to determine driving behavior.

发明内容Contents of the invention

在下面描述的示例中,本申请描述了用于依据由LiDAR(Light Detection andRanging,光探测和测距,又称激光雷达)传感器或其他类似传感器产生的点云(pointcloud)来估计一个或多个人的姿态的技术和设备。在一些示例中,所估计的一个或多个人的姿态可以用于为自主车辆做出驾驶决策。然而,本公开的技术不限于自主驾驶应用程序,还可用于为姿态估计可能有用的任意数量的应用程序估计人的姿态。通过使用LiDAR传感器的输出,例如,与摄像机传感器相反,姿态估计可以在包括弱光环境的困难环境中快速地执行。In the example described below, this application describes a method for estimating one or more people based on a point cloud generated by a LiDAR (Light Detection and Ranging, also known as lidar) sensor or other similar sensor. gesture of technology and equipment. In some examples, the estimated pose of one or more people can be used to make driving decisions for autonomous vehicles. However, the techniques of the present disclosure are not limited to autonomous driving applications and may be used to estimate human posture for any number of applications for which posture estimation may be useful. By using the output of a LiDAR sensor, for example, as opposed to a camera sensor, pose estimation can be performed quickly in difficult environments including low-light environments.

计算系统可以配置为接收来自LiDAR传感器或其他类似传感器的点云数据。计算系统还可以配置为将点云数据转换成结构化数据格式,诸如体素(voxels)的帧(体积像素)。然后,计算系统可以使用深度神经网络来处理体素化的帧。深度神经网络可以配置有确定是否有人存在的模型。深度神经网络还可以执行回归(regression)以估计检测到的一个或多个人中每个人的姿态。在一些示例中,计算系统逐次做出人的确定和姿态估计。即,在一些示例中,首先,计算系统使用深度神经网络检测人,然后计算系统使用深度神经网络来估计人的姿态。在其他示例中,计算系统并行地执行人的确定和姿态估计。即,在一些示例中,计算系统对于每个体素同时确定人的存在和人的对应姿态。如果深度神经网络确定体素中不存在人,则计算系统丢弃所估计的姿态。The computing system can be configured to receive point cloud data from LiDAR sensors or other similar sensors. The computing system may also be configured to convert the point cloud data into a structured data format, such as frames of voxels (volume pixels). The computing system can then use a deep neural network to process the voxelized frames. Deep neural networks can be configured with models that determine the presence of a human being. Deep neural networks can also perform regression to estimate the pose of each of the detected person or persons. In some examples, the computing system makes determinations and pose estimates of the person sequentially. That is, in some examples, first, the computing system uses a deep neural network to detect a person, and then the computing system uses the deep neural network to estimate the pose of the person. In other examples, the computing system performs human determination and pose estimation in parallel. That is, in some examples, the computing system simultaneously determines the presence of the person and the corresponding pose of the person for each voxel. If the deep neural network determines that a person is not present in the voxel, the computational system discards the estimated pose.

深度神经网络可以配置为使用一个或多个三维(3D)卷积层以及随后的一个或多个二维卷积层来处理体素化的帧。3D卷积层通常提供对人和姿态估计更准确的确定,而2D卷积层通常提供对人和姿态估计更快速的确定。通过在深度神经网络中使用3D和2D卷积层的组合,可以以期望的准确度执行人的检测和姿态估计,同时还保持对自主驾驶应用程序有用的速度。A deep neural network may be configured to process voxelized frames using one or more three-dimensional (3D) convolutional layers followed by one or more 2D convolutional layers. 3D convolutional layers generally provide more accurate determinations of people and pose estimates, while 2D convolutional layers generally provide faster determinations of people and pose estimates. By using a combination of 3D and 2D convolutional layers in a deep neural network, human detection and pose estimation can be performed with the desired accuracy while also maintaining a speed useful for autonomous driving applications.

在另一示例中,本公开描述了用于标注点云数据的技术。为了训练深度神经网络以估计点云数据中人的姿态,深度神经网络可以通过处理点云数据的训练集来配置和修改。点云数据的训练集先前用点云内的人的准确位置和姿态来标记(例如,通过手动标记)。点云数据中的姿态的该先前标记可以被称为标注。存在用于标注二维图像中人体姿态(human pose)的技术。然而,标注点云数据有很大不同。首先,点云数据是三维的。此外,点云数据相对于二维图像数据是稀疏的。In another example, this disclosure describes techniques for annotating point cloud data. To train a deep neural network to estimate the pose of a person in point cloud data, the deep neural network can be configured and modified by processing a training set of point cloud data. The training set of point cloud data was previously labeled (e.g., by manual labeling) with the accurate positions and poses of people within the point cloud. This previous labeling of the pose in the point cloud data may be called an annotation. Techniques exist for annotating human poses in two-dimensional images. However, annotating point cloud data is very different. First, point cloud data is three-dimensional. Furthermore, point cloud data is sparse relative to 2D image data.

本公开描述了用于标注点云数据的方法、设备和软件。用户可以使用本公开的技术来标注点云,以标记在点云数据中发现的一个或多个姿态。经标注的点云数据然后可以用于训练神经网络以更准确地实时识别和标记点云数据中的姿态。This disclosure describes methods, devices, and software for annotating point cloud data. Users can annotate point clouds using the techniques of the present disclosure to mark one or more poses found in the point cloud data. The annotated point cloud data can then be used to train neural networks to more accurately identify and label poses in point cloud data in real time.

在一个示例中,本公开描述了一种方法,包括:使显示点云数据;用多个标注点标记所述点云数据中的点,所述多个标注点对应于人体上的点;响应于用户输入而移动所述标注点中的一个或多个以限定人体姿态并且创建标注的点云数据;并且输出所标注的点云数据。In one example, the present disclosure describes a method that includes: causing point cloud data to be displayed; marking points in the point cloud data with a plurality of annotation points, the plurality of annotation points corresponding to points on a human body; responding Moving one or more of the annotated points in response to user input to define a human body pose and creating annotated point cloud data; and outputting the annotated point cloud data.

在另一示例中,本公开描述了一种设备,所述设备包括被配置成存储点云数据的存储器、以及与所述存储器通信的一个或多个处理器,所述一个或多个处理器被配置成使所述点云数据显示;用多个标注点标记所述点云数据中的点,所述多个标注点对应于人体上的点;响应于用户输入而移动所述标注点中的一个或多个标注点以限定人体姿态并且创建标注的点云数据;并且输出所标注的点云数据。In another example, the present disclosure describes an apparatus that includes a memory configured to store point cloud data, and one or more processors in communication with the memory, the one or more processors configured to cause the point cloud data to be displayed; to mark points in the point cloud data with a plurality of annotation points, the plurality of annotation points corresponding to points on a human body; and to move the annotation points in response to user input. One or more annotated points are used to define the human body posture and create annotated point cloud data; and output the annotated point cloud data.

在另一示例中,本公开描述了一种设备,所述设备包括用于使得点云数据显示的装置;用于用多个标注点标记所述点云数据中的点的装置,所述多个标注点对应于人体上的点;用于响应于用户输入而移动所述标注点中的一个或多个以限定人体姿态并创建标注的点云数据的装置,以及用于输出所标注的点云数据的装置。In another example, the present disclosure describes an apparatus that includes means for causing point cloud data to be displayed; means for marking points in the point cloud data with a plurality of annotation points, the plurality of annotation points being annotation points corresponding to points on the human body; means for moving one or more of the annotation points in response to user input to define a pose of the human body and create annotated point cloud data, and for outputting the annotated points Cloud data installation.

在另一示例中,本公开描述了一种存储指令的非暂时性计算机可读存储介质,所述指令在被执行时引起一个或多个处理器使得:显示点云数据;用多个标注点标记所述点云数据中的点,所述多个标注点对应于人体上的点;响应于用户输入而移动所述标注点中的一个或多个标注点以限定人体姿态并且创建标注的点云数据,并且输出所标注的点云数据。In another example, the present disclosure describes a non-transitory computer-readable storage medium storing instructions that, when executed, cause one or more processors to: display point cloud data; label points with a plurality of Marking points in the point cloud data, the plurality of annotated points corresponding to points on a human body; moving one or more of the annotated points in response to user input to define a human body pose and create the labeled points cloud data and output the labeled point cloud data.

在附图和以下描述中阐述了本发明的一个或多个实施例的细节。根据说明书和附图以及权利要求书,本发明的其它特征、目的和优点将是显而易见的。The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects and advantages of the invention will be apparent from the description and drawings, and from the claims.

附图说明Description of the drawings

图1是显示本公开的技术的示例性操作环境的概念图。FIG. 1 is a conceptual diagram showing an exemplary operating environment of the technology of the present disclosure.

图2是显示被配置为执行本公开的技术的示例性设备的框图。2 is a block diagram showing an exemplary device configured to perform the techniques of the present disclosure.

图3是显示本公开一个示例的处理流程的框图。3 is a block diagram showing a processing flow of one example of the present disclosure.

图4是显示根据本公开一个示例的使用深度神经网络的并行处理流程的概念图。4 is a conceptual diagram showing a parallel processing flow using a deep neural network according to one example of the present disclosure.

图5是显示根据本公开一个示例的使用深度神经网络的顺序处理流程的概念图。5 is a conceptual diagram showing a sequential processing flow using a deep neural network according to one example of the present disclosure.

图6是示出示例性锚点骨架(anchor skeleton)的概念图。Figure 6 is a conceptual diagram illustrating an exemplary anchor skeleton.

图7是示出包括具有估计姿态的多个分类骨架的示例性点云的概念图。Figure 7 is a conceptual diagram illustrating an exemplary point cloud including multiple classified skeletons with estimated poses.

图8是显示根据本公开一个示例的被配置为执行姿态估计的设备的示例性操作的流程图。8 is a flowchart showing exemplary operations of a device configured to perform pose estimation according to one example of the present disclosure.

图9是显示根据本公开的技术的被配置为执行点云标注的示例性设备的框图。9 is a block diagram showing an exemplary device configured to perform point cloud annotation in accordance with the techniques of this disclosure.

图10是示出用于标注的输入点云的概念性用户界面图。Figure 10 is a conceptual user interface diagram illustrating an input point cloud for annotation.

图11是示出用于标注的裁剪点云(cropped point cloud)的概念性用户界面图。Figure 11 is a conceptual user interface diagram illustrating a cropped point cloud for annotation.

图12是示出用于标注的示例性骨架的概念图。Figure 12 is a conceptual diagram illustrating an exemplary skeleton for annotation.

图13是示出点云的估计标注的概念性用户界面图。Figure 13 is a conceptual user interface diagram illustrating estimated annotation of a point cloud.

图14是示出已标注的点云的概念性用户界面图。Figure 14 is a conceptual user interface diagram showing an annotated point cloud.

图15是示出根据本公开一个示例的标注工具的示例性操作的流程图。Figure 15 is a flowchart illustrating exemplary operations of an annotation tool according to one example of the present disclosure.

具体实施方式Detailed ways

姿态估计是一种从图像或视频中检测人像的计算机视觉技术。除了检测人像的存在之外,计算机视觉技术还可以确定人像的肢体的位置和方向(即,姿态)。姿态估计在许多领域中是有用的,包括自主驾驶。例如,人的姿态可用于确定人(例如,行人、交警等)的关注和意向或人(例如,抬手拦出租车的行人)的需求。汽车的自主驾驶应用程序可以使用依据估计姿态所预测的人的意向和关注来确定驾驶行为。Pose estimation is a computer vision technique for detecting human figures from images or videos. In addition to detecting the presence of a human figure, computer vision technology can also determine the position and orientation (i.e., pose) of a human figure's limbs. Pose estimation is useful in many fields, including autonomous driving. For example, the person's gesture can be used to determine the attention and intention of the person (eg, a pedestrian, a traffic policeman, etc.) or the needs of the person (eg, a pedestrian raising his hand to stop a taxi). Autonomous driving applications in cars can use the predicted human intentions and concerns based on the estimated posture to determine driving behavior.

在一些示例中,对从摄像机传感器接收的图像数据执行姿态估计。这样的数据具有若干缺点。例如,如果来自摄像机传感器的输出不包括深度信息,则可能难以辨别图像中的人的相对位置。即使来自摄像机传感器的输出确实包括深度信息,在黑暗环境中执行姿态估计可能是困难的或不可能的。In some examples, pose estimation is performed on image data received from a camera sensor. Such data has several disadvantages. For example, if the output from a camera sensor does not include depth information, it may be difficult to discern the relative positions of people in the image. Even if the output from the camera sensor does include depth information, performing pose estimation in dark environments may be difficult or impossible.

本公开描述了用于使用点云数据执行姿态估计的技术,所述点云数据是诸如由LiDAR传感器产生的点云数据。来自LiDAR传感器的点云输出提供传感器附近的物体的3D映射(3D map)。因此,可获得深度信息。另外,与摄像机传感器相反,LiDAR传感器可以在黑暗环境中生成点云。本公开的技术包括使用深度神经网络处理来自LiDAR传感器的点云,以检测传感器附近的人的存在并且估计该人的姿态以便做出自主驾驶决策。This disclosure describes techniques for performing pose estimation using point cloud data, such as that produced by a LiDAR sensor. The point cloud output from the LiDAR sensor provides a 3D map of objects in the vicinity of the sensor. Therefore, depth information can be obtained. Additionally, as opposed to camera sensors, LiDAR sensors can generate point clouds in dark environments. Technology of the present disclosure includes using deep neural networks to process point clouds from LiDAR sensors to detect the presence of a person near the sensor and estimate the person's posture in order to make autonomous driving decisions.

图1是显示本公开的技术的示例性操作环境的概念图。在本公开的一个示例中,汽车2可包括被配置为执行姿态估计的部件。在该示例中,汽车2可包括LiDAR传感器10、计算系统14以及可选的,摄像机16。FIG. 1 is a conceptual diagram showing an exemplary operating environment of the technology of the present disclosure. In one example of the present disclosure, the car 2 may include components configured to perform attitude estimation. In this example, car 2 may include LiDAR sensor 10 , computing system 14 , and optionally, camera 16 .

参考汽车应用程序(包括自主驾驶应用程序)来描述本公开的技术。然而,应理解,本公开的用于人检测及姿态估计的技术可用于其它情境。The technology of the present disclosure is described with reference to automotive applications, including autonomous driving applications. However, it should be understood that the techniques of this disclosure for person detection and pose estimation may be used in other contexts.

汽车2可以是任何类型的乘用车。LiDAR传感器10可以使用支架12安装到汽车2。在其它示例中,LiDAR传感器10可以以其它配置安装到汽车2,或者集成在诸如保险杠、侧面、挡风玻璃等的汽车结构中或由诸如保险杠、侧面、挡风玻璃等的汽车结构承载。另外,汽车2可以被配置为使用多个LiDAR传感器。如将在下面更详细地解释的,计算系统14可以被配置为接收来自LiDAR传感器10的点云数据,并且确定LiDAR传感器10的视场中的人的位置和姿态。Car 2 can be any type of passenger car. LiDAR sensor 10 may be mounted to car 2 using bracket 12 . In other examples, the LiDAR sensor 10 may be mounted to the vehicle 2 in other configurations, or integrated into or carried by vehicle structures such as bumpers, sides, windshield, etc. . Additionally, car 2 may be configured to use multiple LiDAR sensors. As will be explained in more detail below, computing system 14 may be configured to receive point cloud data from LiDAR sensor 10 and determine the position and posture of a person in the field of view of LiDAR sensor 10 .

LiDAR传感器10包括被配置为发射激光脉冲的激光器。LiDAR传感器10进一步包括接收器以接收从LiDAR传感器10附近的物体反射的激光。LiDAR传感器10通过用脉冲激光照射物体并测量反射的脉冲来测量到物体的距离。反射的脉冲的返回时间和波长的差异用来确定一个或多个物体(例如,人)的3D表示。LiDAR sensor 10 includes a laser configured to emit laser pulses. LiDAR sensor 10 further includes a receiver to receive laser light reflected from objects near LiDAR sensor 10 . The LiDAR sensor 10 measures the distance to an object by illuminating the object with pulsed laser light and measuring the reflected pulses. The difference in return time and wavelength of the reflected pulses is used to determine a 3D representation of one or more objects (eg, people).

LiDAR传感器10还可以包括全球定位传感器(GPS)或类似的传感器,以确定传感器的准确物理位置和从反射的激光感测到的物体。LiDAR传感器10还可以被配置为检测附加信息,如强度。点云中的点的强度可以指示由LiDAR传感器10检测到的物体的反射率。通常,以点云形式存储由LiDAR传感器10捕获的3D表示。点云是表示3D形状或特征的点的集合。每个点具有其自己的x、y和z坐标集,并且在一些情况下具有附加属性(例如,GPS位置和强度)。LiDAR收集方法所得到的点云可以被保存和/或传输到计算系统14。LiDAR sensor 10 may also include a global positioning sensor (GPS) or similar sensor to determine the exact physical location of the sensor and the object sensed from the reflected laser light. LiDAR sensor 10 may also be configured to detect additional information, such as intensity. The intensity of points in the point cloud may be indicative of the reflectivity of the object detected by LiDAR sensor 10 . Typically, the 3D representation captured by the LiDAR sensor 10 is stored in the form of a point cloud. A point cloud is a collection of points that represent a 3D shape or feature. Each point has its own set of x, y, and z coordinates, and in some cases additional attributes (eg, GPS location and intensity). The resulting point cloud from the LiDAR collection method may be saved and/or transferred to the computing system 14 .

尽管在本公开中描述了LiDAR传感器,但是本文描述的用于姿态估计的技术可以与在弱光下工作和/或输出点云数据的任何传感器的输出一起使用。可以与本公开的技术一起使用的另外的传感器类型可以包括例如雷达、超声波、摄像机/成像传感器、和/或声纳传感器。Although LiDAR sensors are described in this disclosure, the techniques for pose estimation described herein can be used with the output of any sensor that operates in low light and/or outputs point cloud data. Additional sensor types that may be used with the techniques of the present disclosure may include, for example, radar, ultrasonic, camera/imaging sensors, and/or sonar sensors.

计算系统14可以通过有线或无线通信技术连接到LiDAR传感器。计算系统可以包括被配置为接收来自LiDAR传感器10的点云的一个或多个处理器。如下文将更详细地解释的,计算系统14可被配置为执行姿态估计。例如,计算系统14可以被配置为:接收来自LiDAR传感器10的点云,所述点云包括表示物体相对于LiDAR传感器的位置的多个点;处理所述点云,以产生包括多个体素的体素化的帧;使用深度神经网络处理所述体素化的帧,以确定相对于LiDAR传感器的一个或多个人以及一个或多个人中每个人的姿态;以及输出所确定的一个或多个人的位置和所确定的一个或多个人中每个人的姿态。本公开的技术不限于用于人(例如,行人、骑行者等)的检测和姿态估计,但也可用于动物(例如狗、猫等)的姿态检测。Computing system 14 may be connected to the LiDAR sensor via wired or wireless communication technology. The computing system may include one or more processors configured to receive point clouds from LiDAR sensor 10 . As will be explained in greater detail below, computing system 14 may be configured to perform pose estimation. For example, computing system 14 may be configured to: receive a point cloud from LiDAR sensor 10 that includes a plurality of points representing the position of an object relative to the LiDAR sensor; and process the point cloud to generate a point cloud that includes a plurality of voxels. voxelizing the frame; processing the voxelized frame using a deep neural network to determine one or more persons and a pose of each of the one or more persons relative to the LiDAR sensor; and outputting the determined one or more persons The position and posture of each of the identified person or persons. The technology of the present disclosure is not limited to detection and posture estimation of people (eg, pedestrians, cyclists, etc.), but may also be used for posture detection of animals (eg, dogs, cats, etc.).

支架12可以包括一个或多个摄像机16。支架的使用仅仅是一个示例。摄像机16可以定位在汽车2上的任何合适的位置。汽车2可以进一步包括图1中未示出的另外摄像机。计算系统14可以连接到摄像机16以接收图像数据。在本公开的一个示例中,计算系统14可以还被配置为使用基于摄像机的技术来执行姿态估计。在这样的示例中,计算系统14可以被配置为使用基于摄像机的技术和在本公开中描述的基于LiDAR的技术两者来估计一个或多个人的姿态。计算系统14可以被配置为对由基于摄像机的技术和基于LiDAR的技术确定的每个姿态分配权重,并且基于所确定的姿态的加权平均值来确定人的最终姿态。计算系统14可以被配置为基于每种技术的置信度来确定权重。例如,与基于摄像机的技术相比,基于LiDAR的技术在弱光环境中可能具有更高置信度的准确性。Mount 12 may include one or more cameras 16 . The use of brackets is just an example. The camera 16 can be positioned at any suitable location on the car 2 . The car 2 may further comprise additional cameras not shown in Figure 1 . Computing system 14 may be connected to camera 16 to receive image data. In one example of the present disclosure, computing system 14 may be further configured to perform pose estimation using camera-based techniques. In such examples, computing system 14 may be configured to estimate the pose of one or more persons using both camera-based technology and the LiDAR-based technology described in this disclosure. Computing system 14 may be configured to assign a weight to each pose determined by the camera-based technology and the LiDAR-based technology and determine the final pose of the person based on a weighted average of the determined poses. Computing system 14 may be configured to determine weights based on the confidence of each technology. For example, LiDAR-based technology may have higher-confidence accuracy in low-light environments than camera-based technology.

图2是显示被配置为执行本公开的技术的示例性设备的框图。特别地,图2更详细地示出了图1的计算系统14的示例。再次,在一些示例中,计算系统14可以是汽车2的一部分。但是,在其他示例中,计算系统14可以是独立的系统,或者可以被集成到其他设备中以用于可受益于姿态估计的其他应用中。2 is a block diagram showing an exemplary device configured to perform the techniques of the present disclosure. In particular, FIG. 2 shows an example of computing system 14 of FIG. 1 in greater detail. Again, in some examples, computing system 14 may be part of automobile 2 . However, in other examples, computing system 14 may be a stand-alone system or may be integrated into other devices for other applications that may benefit from pose estimation.

计算系统14包括与存储器24通信的微处理器22。在一些示例中,计算系统14可以包括多个微处理器。微处理器22可以被实现为固定功能的处理电路、可编程的处理电路、或其组合。固定功能的电路是指提供特定功能并预设在可以执行的操作上的电路。可编程的电路是指可以被编程以执行各种任务并在可以执行的操作中提供灵活功能的电路。例如,可编程的电路可以执行使可编程的电路以软件或固件的指令所定义的方式操作的软件或固件。固定功能的电路可以执行软件指令(例如,以接收参数或输出参数),但是固定功能的处理电路执行的操作类型通常是不可变的。在一些示例中,一个或多个单元可以是不同的电路块(固定功能或可编程的),并且在一些示例中,一个或多个单元可以是集成电路。Computing system 14 includes microprocessor 22 in communication with memory 24 . In some examples, computing system 14 may include multiple microprocessors. Microprocessor 22 may be implemented as fixed-function processing circuitry, programmable processing circuitry, or a combination thereof. A fixed-function circuit is a circuit that provides a specific function and is preset to perform an operation. Programmable circuits are circuits that can be programmed to perform a variety of tasks and provide flexible functionality in the operations that can be performed. For example, a programmable circuit may execute software or firmware that causes the programmable circuit to operate in a manner defined by the instructions of the software or firmware. Fixed-function circuits may execute software instructions (eg, to receive parameters or output parameters), but the types of operations performed by fixed-function processing circuits are generally immutable. In some examples, one or more units may be distinct circuit blocks (fixed function or programmable), and in some examples, one or more units may be integrated circuits.

在图2的示例中,微处理器22可以被配置为根据本公开的技术在基于LiDAR的姿态估计模块40中执行一组或多组指令以执行姿态估计。限定基于LiDAR的姿态估计模块40的指令可以存储在存储器24中。在一些示例中,可以将限定基于LiDAR的姿态估计模块40的指令通过有线或无线网络下载到存储器24。In the example of FIG. 2 , microprocessor 22 may be configured to execute one or more sets of instructions in LiDAR-based pose estimation module 40 to perform pose estimation in accordance with the techniques of the present disclosure. Instructions defining LiDAR-based pose estimation module 40 may be stored in memory 24 . In some examples, instructions defining LiDAR-based attitude estimation module 40 may be downloaded to memory 24 over a wired or wireless network.

在一些示例中,存储器24可以是临时存储器,这意味着存储器24的主要目的不是长期存储。存储器24可以被配置为易失性存储器,用于信息的短期存储,并且因此如果断电则不保留所存储的内容。易失性存储器的示例包括随机存取存储器(RAM),动态随机存取存储器(DRAM),静态随机存取存储器(SRAM)、以及本领域已知的其他形式的易失性存储器。In some examples, memory 24 may be temporary storage, meaning that the primary purpose of memory 24 is not long-term storage. Memory 24 may be configured as volatile memory for short-term storage of information and therefore does not retain stored contents if power is lost. Examples of volatile memory include random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), and other forms of volatile memory known in the art.

存储器24可以包括一个或多个非暂时性的计算机可读的存储介质。存储器24可以被配置为存储比通常由易失性存储器存储的信息量更多的信息。存储器24可以还被配置为非易失性存储空间,用于长期存储信息,并且在通电/断电周期之后保留信息。非易失性存储器的示例包括磁性硬盘、光盘、闪存或电可编程存储器(EPROM)或电可擦除且可编程存储器(EEPROM)的形式。存储器24可以存储程序指令(例如,基于LiDAR的姿态估计模块40)和/或信息(例如,点云30以及检测到的人32的姿态和位置),该程序指令在被执行时使微处理器22执行本公开的技术。Memory 24 may include one or more non-transitory computer-readable storage media. Memory 24 may be configured to store greater amounts of information than is typically stored by volatile memory. Memory 24 may also be configured as a non-volatile storage space for long-term storage of information and retention of information after power on/off cycles. Examples of non-volatile memory include magnetic hard disks, optical disks, flash memory, or forms of electrically programmable memory (EPROM) or electrically erasable and programmable memory (EEPROM). Memory 24 may store program instructions (eg, LiDAR-based pose estimation module 40) and/or information (eg, point cloud 30 and detected pose and position of person 32) that, when executed, cause the microprocessor 22 Perform the techniques of the present disclosure.

将参考执行各种软件模块的微处理器22描述本公开的下述技术。然而,应当理解,本文描述的每个软件模块也可以以专用硬件、固件、软件、或硬件、软件和固件的任何组合来实现。The following techniques of the present disclosure will be described with reference to microprocessor 22 executing various software modules. However, it should be understood that each software module described herein may also be implemented in dedicated hardware, firmware, software, or any combination of hardware, software, and firmware.

基于LiDAR的姿态估计模块40可以包括预处理单元42、深度神经网络(DNN)44、和后处理单元46。基于LiDAR的姿态估计模块40配置为接收来自LiDAR传感器(例如,图1的LiDAR传感器10)的点云30。预处理单元42被配置为使非结构化的原始输入(即,点云30)成为结构化帧(例如,矩阵数据),从而深度神经网络44可以处理所述输入数据。LiDAR-based pose estimation module 40 may include a pre-processing unit 42, a deep neural network (DNN) 44, and a post-processing unit 46. LiDAR-based pose estimation module 40 is configured to receive point cloud 30 from a LiDAR sensor (eg, LiDAR sensor 10 of FIG. 1 ). Preprocessing unit 42 is configured to convert unstructured raw input (ie, point cloud 30) into structured frames (eg, matrix data) so that deep neural network 44 can process the input data.

预处理单元42可以被配置为将点云30以许多方式处理成结构化帧。在一个示例中,预处理单元42可以被配置为将点云转换为体素(体积像素)。预处理单元42可以被配置为根据用于体素的预定的数据结构来执行这种体素化。例如,每个体素可以由三维(3D)箱的尺寸(例如,用X,Y和Z坐标表示)以及用于3D箱(3D bin)存储的数据类型来定义。例如,每个3D箱(即体素)可以包括指示位于箱中的来自点云30的点的数量、箱中的来自点云30的点的位置、以及这些点的强度的数据。可以存储在体素中的数据的其他示例包括所述体素内或甚至体素附近的点云的高度、宽度、长度(x,y,z坐标)的平均值(mean)和方差(variance);强度/反射率的平均值和方差;以及其他统计信息。在一些示例中,体素可以包括来自点云30的零点、来自点云30的一个点、或来自点云30的多个点。使用预定的箱可以被称为手动体素化。在其他示例中,预处理单元42可以被配置为以自适应方式来体素化点云30,例如通过使用以原始点云30作为输入并输出结构化(体素化)帧的神经网络。Preprocessing unit 42 may be configured to process point cloud 30 into structured frames in a number of ways. In one example, preprocessing unit 42 may be configured to convert point clouds into voxels (volume pixels). Preprocessing unit 42 may be configured to perform such voxelization according to a predetermined data structure for the voxels. For example, each voxel may be defined by the dimensions of a three-dimensional (3D) bin (eg, represented by X, Y, and Z coordinates) and the data type used for 3D bin storage. For example, each 3D bin (ie, voxel) may include data indicating the number of points from point cloud 30 located in the bin, the location of the points from point cloud 30 in the bin, and the intensity of those points. Other examples of data that can be stored in voxels include the mean and variance of the height, width, length (x, y, z coordinates) of the point cloud within said voxel or even near the voxel ;Mean and variance of intensity/reflectance; and other statistics. In some examples, a voxel may include a zero point from point cloud 30 , one point from point cloud 30 , or multiple points from point cloud 30 . Using predetermined bins can be called manual voxelization. In other examples, preprocessing unit 42 may be configured to voxelize point cloud 30 in an adaptive manner, such as by using a neural network that takes raw point cloud 30 as input and outputs structured (voxelized) frames.

深度神经网络44从预处理单元42接收体素化的帧。深度神经网络是一种机器学习算法。深度神经网络44可以被配置有多层处理层,每个层被配置用于从输入数据(在这种情况下是点云30的体素化的帧)确定和/或提取特征。深度神经网络44的每个连续层可以被配置为使用来自前一层的输出作为输入。Deep neural network 44 receives voxelized frames from pre-processing unit 42 . Deep neural network is a machine learning algorithm. Deep neural network 44 may be configured with multiple processing layers, each layer configured to determine and/or extract features from input data (in this case voxelized frames of point cloud 30). Each successive layer of deep neural network 44 may be configured to use the output from the previous layer as input.

在一些示例中,深度神经网络44可以被配置为卷积深度神经网络。卷积深度神经网络是一种深度、前馈神经网络。卷积深度神经网络的每一层都可以称为卷积层。卷积层将卷积运算应用于输入(例如,体素化的帧的体素),然后将结果传递给下一层。深度神经网络44可以配置有3D和2D卷积层。3D卷积层提供了更准确的特征提取(例如,更准确地识别人和相应的姿态),而与3D卷积层相比,2D卷积层提供了更快速的特征提取。深度神经网络44可以被配置为首先使用一个或多个3D卷积层处理体素化的帧,然后使用一个或多个2D卷积层继续处理体素化的帧。2D卷积层可以被配置为仅在X和Y方向(即,不在Z方向)上处理来自体素化的帧的数据。3D和2D卷积层的数量、以及层之间的分割点决定了姿态估计的速度和准确性之间的权衡。通过在深度神经网络44中使用3D和2D卷积层的组合,可以以期望的准确度执行人的检测和姿态估计,同时还保持对于自主驾驶应用程序有用的速度。In some examples, deep neural network 44 may be configured as a convolutional deep neural network. A convolutional deep neural network is a deep, feedforward neural network. Each layer of a convolutional deep neural network can be called a convolutional layer. A convolutional layer applies a convolutional operation to an input (e.g., the voxels of a voxelized frame) and then passes the result to the next layer. The deep neural network 44 can be configured with 3D and 2D convolutional layers. 3D convolutional layers provide more accurate feature extraction (e.g., more accurate recognition of people and corresponding poses), while 2D convolutional layers provide faster feature extraction compared to 3D convolutional layers. The deep neural network 44 may be configured to first process the voxelized frames using one or more 3D convolutional layers and then continue to process the voxelized frames using one or more 2D convolutional layers. The 2D convolutional layer can be configured to process data from voxelized frames only in the X and Y directions (i.e. not in the Z direction). The number of 3D and 2D convolutional layers, as well as the split points between layers, determine the trade-off between speed and accuracy of pose estimation. By using a combination of 3D and 2D convolutional layers in a deep neural network 44, human detection and pose estimation can be performed with the desired accuracy while also maintaining a speed useful for autonomous driving applications.

深度神经网络44被配置为分析体素化的帧并为每个体素产生两个输出。一个输出可以称为分类(classification)。分类表示在正在被分析的体素中是否存在人。另一输出可以称为由回归(regression)产生的姿态估计。如果体素中存在此人,则回归确定该人的姿态(或人的关键点)。如将在下面更详细解释的,深度神经网络44可以被配置为以串行或并行方式执行分类和回归技术。The deep neural network 44 is configured to analyze the voxelized frames and produce two outputs for each voxel. An output can be called a classification. Classification indicates whether a person is present in the voxel being analyzed. The other output may be called a pose estimate resulting from regression. If the person is present in the voxel, regression determines the person's pose (or the person's keypoints). As will be explained in more detail below, deep neural network 44 may be configured to perform classification and regression techniques in a serial or parallel manner.

深度神经网络44可以被配置为通过DNN模型48处理每个体素。DNN模型48限定3D和2D卷积层的数量以及为每个层执行的功能。DNN模型48可以用大量的数据标记对(data-label pairs)来训练。在数据标记对中,数据是体素化的点云数据,而标记是可能的3D姿态。通过手动标注(例如,标记)点云数据,然后使用标记的数据训练深度神经网络44来训练DNN模型48。将深度神经网络44的输出与给定标记数据的预期输出进行比较。然后,技术人员可以调整DNN模型48以找到用于深度神经网络44的各层权重的最佳集合,以便在给定了预先标注的点云的情况下,期望的标记在由深度神经网络44处理时被预测。DNN模型38可以是预定的,并且可以定期更新。Deep neural network 44 may be configured to process each voxel through DNN model 48. The DNN model 48 defines the number of 3D and 2D convolutional layers and the functions performed for each layer. The DNN model 48 can be trained with a large number of data-label pairs. In a data-label pair, the data are voxelized point cloud data and the labels are possible 3D poses. The DNN model 48 is trained by manually annotating (eg, labeling) point cloud data and then using the labeled data to train a deep neural network 44 . The output of the deep neural network 44 is compared to the expected output given the labeled data. The technician can then tune the DNN model 48 to find the optimal set of weights for the layers of the deep neural network 44 such that, given the pre-labeled point cloud, the desired labeling is processed by the deep neural network 44 time is predicted. The DNN model 38 can be scheduled and updated periodically.

深度神经网络44可以被配置为针对每个锚点位置(anchor position)产生分类和回归结果。在一个示例中,深度神经网络可以被配置为将体素的中心视为锚点位置。对于每个锚点位置,深度神经网络44可以被配置为将体素中存储的数据与一个或多个预定的锚点骨架(也称为标准或规范骨架)进行比较。锚点骨架可以由多个关键点限定。在一个示例中,锚点骨架由十四个关节和/或关键点限定:头、颈部、左肩、右肩、左肘、右肘、左手、右手、左腰、右腰、左膝、右膝、左脚和右脚。通常,关键点可以对应于人体解剖结构的特征或结构(例如,人体上的点)。Deep neural network 44 may be configured to produce classification and regression results for each anchor position. In one example, the deep neural network can be configured to consider the center of the voxel as the anchor location. For each anchor location, the deep neural network 44 may be configured to compare the data stored in the voxel to one or more predetermined anchor skeletons (also referred to as standard or canonical skeletons). An anchor skeleton can be defined by multiple keypoints. In one example, the anchor skeleton is defined by fourteen joints and/or keypoints: head, neck, left shoulder, right shoulder, left elbow, right elbow, left hand, right hand, left waist, right waist, left knee, right Knee, left foot and right foot. Typically, keypoints may correspond to features or structures of human anatomy (e.g., points on the human body).

在深度神经网络44进行处理期间,如果锚点骨架的边界框与任何真实骨架(ground truth skeleton)(即,体素中存在的数据)的边界框之间的重叠区域满足阈值条件,锚点骨架被激活(即,对于人的存在被分类为肯定的)。例如,如果锚点骨架和体素的边界框的重叠区域高于某个阈值(例如0.5),则针对该体素激活锚点骨架,并且检测到人的存在。所述阈值可以是重叠量(例如,交并比(intersection-over-union,简称IOU))的测量值。深度神经网络44可以基于与一个或多个多重不同的锚点骨架的比较来进行分类。深度神经网络44还可以实施为执行对锚点骨架和真实骨架(即,实际体素中的数据)之间的差异进行编码的回归。深度神经网络44可以被配置为对锚点骨架限定的多个关键点中的每个关键点的该差异进行编码。锚点骨架的关键点和体素中的数据之间的差异表示分类期间检测到的人的实际姿态。然后可以配置深度神经网络来向后处理单元46提供分类(例如,确定的一个或多个人的位置)和所确定的一个或多个人中每个人的姿态。当从点云中检测到多个人时,将激活多个锚点骨架,从而实现多人姿态估计。During processing by the deep neural network 44, if the overlapping area between the bounding box of the anchor skeleton and the bounding box of any ground truth skeleton (i.e., the data present in the voxel) satisfies a threshold condition, the anchor skeleton is activated (i.e., is classified as positive for the person's existence). For example, if the overlapping area of the anchor skeleton and the voxel's bounding box is above a certain threshold (e.g., 0.5), the anchor skeleton is activated for that voxel, and the presence of a person is detected. The threshold may be a measurement of the amount of overlap (eg, intersection-over-union (IOU)). The deep neural network 44 may perform classification based on comparison to one or more multiple different anchor skeletons. The deep neural network 44 may also be implemented to perform regression that encodes the difference between the anchor skeleton and the true skeleton (ie, the data in the actual voxels). Deep neural network 44 may be configured to encode this difference for each of the plurality of keypoints defined by the anchor skeleton. The difference between the keypoints of the anchor skeleton and the data in the voxels represents the actual pose of the person detected during classification. The deep neural network may then be configured to provide the classification (eg, the determined location of the person or persons) and the determined pose of each of the one or more persons to post-processing unit 46 . When multiple people are detected from the point cloud, multiple anchor skeletons are activated, enabling multi-person pose estimation.

后处理单元46可以被配置为将深度神经网络44的输出转换为最终输出。例如,后处理单元46可以被配置为在由深度神经网络44产生的分类和估计的姿态上执行非最大值抑制(Non-maximum suppression),并产生检测到的人的最终位置和姿态。非最大值抑制是一种边缘细化技术(edge thinning technique)。在某些情况下,深度神经网络44将对人进行分类,并估计许多实际上仅存在一个人的体素的密集组的姿态。也就是说,在某些情况下,深度神经网络将检测同一个人的重叠重复(overlapping duplicate)。后处理单元46可以使用非最大值抑制技术来去除重复的骨架。后处理单元46输出检测到的人数据32的姿态和位置。检测到的人数据32的姿态和位置可以包括由基于LiDAR的姿态估计模块40检测到的人的位置(例如,用GPS坐标表示)以及限定人的骨架姿态(例如,关键点的位置)。检测到的人数据32的姿态和位置可以存储在存储器24中、发送到自主驾驶应用程序52、其他应用程序54、基于摄像机的姿态估计应用程序56,或者从计算系统14传输到另一个计算系统。Post-processing unit 46 may be configured to convert the output of deep neural network 44 into a final output. For example, post-processing unit 46 may be configured to perform non-maximum suppression on the classified and estimated poses produced by deep neural network 44 and produce a final position and pose of the detected person. Non-maximum suppression is an edge thinning technique. In some cases, a deep neural network 44 will classify a person and estimate the pose of many dense groups of voxels where only one person actually exists. That is, in some cases, deep neural networks will detect overlapping duplicates of the same person. Post-processing unit 46 may use non-maximum suppression techniques to remove duplicate skeletons. The post-processing unit 46 outputs the detected posture and position of the person data 32 . The detected pose and position of the person data 32 may include the position of the person detected by the LiDAR-based pose estimation module 40 (eg, represented by GPS coordinates) as well as the skeletal pose that defines the person (eg, the location of key points). The detected pose and position of person data 32 may be stored in memory 24 , sent to autonomous driving application 52 , other applications 54 , camera-based pose estimation application 56 , or transmitted from computing system 14 to another computing system .

在一个示例中,自主驾驶应用程序52可以被配置为接收检测到的人数据32的姿态和位置,并预测或确定所识别的人的意向和/或关注、或其他行为暗示,以做出自主驾驶决策。In one example, the autonomous driving application 52 may be configured to receive the posture and location of the detected person data 32 and predict or determine the identified person's intentions and/or concerns, or other behavioral cues, to make autonomous decisions. Driving decisions.

在其他示例中,基于摄像机的姿态估计应用程序56可以接收检测到的人数据32的姿态和位置。基于摄像机的姿态估计应用程序56可以被配置为使用由摄像机16(图1)产生的图像数据来确定一个或多个人的姿态。基于摄像机的姿态估计应用程序56还可以被配置为对由基于摄像机的技术和基于Li-DAR的技术确定的每个姿态分配权重,并基于所确定的姿态的加权平均值来确定人的最终姿态。基于摄像机的姿态估计应用程序56可以被配置为基于每种技术的置信度来确定权重。例如,与基于摄像机的技术相比,基于LiDAR的技术在弱光环境中可能具有更高置信度的准确性。In other examples, camera-based pose estimation application 56 may receive the detected pose and location of person data 32 . Camera-based pose estimation application 56 may be configured to determine the pose of one or more persons using image data produced by camera 16 (FIG. 1). The camera-based pose estimation application 56 may also be configured to assign a weight to each pose determined by the camera-based technology and the Li-DAR-based technology and determine the final pose of the person based on a weighted average of the determined poses. . The camera-based pose estimation application 56 may be configured to determine weights based on the confidence of each technique. For example, LiDAR-based technology may have higher-confidence accuracy in low-light environments than camera-based technology.

其他应用程序54代表各种其他情境,其中可以在其他情境中使用检测到的人数据32的姿态和位置。例如,基于LiDAR的姿态估计模块40输出的姿态和位置可以在如下各种应用中:用于肢体语言识别、动作理解(例如,交通、警务人员、紧急服务人员、或其他人员信号/指挥交通)、注意和意向检测(例如,在等待/穿过街道的行人)、电影、动画、游戏、机器人技术、人机交互、机器学习、虚拟现实、替代现实、监视、异常行为检测、和公共安全。Other applications 54 represent various other contexts in which the detected posture and location of person data 32 may be used. For example, the pose and position output by the LiDAR-based pose estimation module 40 can be used in various applications: for body language recognition, motion understanding (e.g., traffic, police personnel, emergency service personnel, or other personnel signaling/directing traffic). ), attention and intent detection (e.g., pedestrians waiting/crossing the street), film, animation, games, robotics, human-computer interaction, machine learning, virtual reality, alternative reality, surveillance, abnormal behavior detection, and public safety .

图3是显示本公开一个示例的处理流程的框图。如图3所示,LiDAR传感器10可以被配置为捕获点云30,该点云30是基于LiDAR的姿态估计模块40的原始输入。基于LiDAR的姿态估计模块40利用预处理单元42(体素化)处理点云30以产生体素化的帧。然后,深度神经网络44处理体素化的帧,以产生一个或多个人的分类(例如,一个或多个人的位置)以及针对分类的一个或多个人的一个或多个姿态。一个人的姿态由骨架的多个关键点的位置进行限定。深度神经网络44的输出是初步的3D姿态。后处理单元46利用非最大值抑制算法处理初步的3D姿态以产生输出3D姿态。3 is a block diagram showing a processing flow of one example of the present disclosure. As shown in FIG. 3 , the LiDAR sensor 10 may be configured to capture a point cloud 30 that is the raw input to the LiDAR-based pose estimation module 40 . The LiDAR-based pose estimation module 40 processes the point cloud 30 using a pre-processing unit 42 (voxelization) to produce voxelized frames. The deep neural network 44 then processes the voxelized frames to produce a classification of the one or more persons (eg, the location of the one or more persons) and one or more poses for the classified one or more persons. A person's pose is defined by the positions of multiple key points of the skeleton. The output of the deep neural network 44 is a preliminary 3D pose. Post-processing unit 46 processes the preliminary 3D pose using a non-maximum suppression algorithm to produce an output 3D pose.

图4是显示根据本发明一个示例的使用深度神经网络的并行处理流程的概念图。如图4所示,首先将点云30转换为包括多个体素的体素化的帧。在该示例中,深度神经网络44处理体素化的帧的每个体素70。深度神经网络44使用一个或多个3D卷积层72处理体素70。3D卷积层74表示对3D体素数据进行操作的最后一层。在3D卷积层74之后,深度神经网络44用一个或多个2D卷积层76处理体素70。2D卷积层76仅对二维的体素数据(例如,XY数据)进行操作。2D卷积层78表示输出分类和姿态估计两者的最后的2D卷积层。在图4的示例中,深度神经网络44的层被配置为对每个体素并行地分类和估计姿态。即,深度神经网络44的层可以被配置为同时对多于一个体素进行分类和估计姿态。如果深度神经网络44确定体素不被分类为人,则可以丢弃任何估计的姿态。4 is a conceptual diagram showing a parallel processing flow using a deep neural network according to an example of the present invention. As shown in Figure 4, the point cloud 30 is first converted into a voxelized frame that includes a plurality of voxels. In this example, the deep neural network 44 processes each voxel 70 of the voxelized frame. Deep neural network 44 processes voxels 70 using one or more 3D convolutional layers 72. 3D convolutional layer 74 represents the final layer that operates on 3D voxel data. Following the 3D convolutional layer 74, the deep neural network 44 processes the voxels 70 with one or more 2D convolutional layers 76. The 2D convolutional layers 76 operate only on voxel data in two dimensions (eg, XY data). 2D convolutional layer 78 represents the final 2D convolutional layer that outputs both classification and pose estimation. In the example of Figure 4, the layers of deep neural network 44 are configured to classify and estimate pose in parallel for each voxel. That is, layers of deep neural network 44 may be configured to classify and estimate poses for more than one voxel simultaneously. If the deep neural network 44 determines that the voxel is not classified as a person, any estimated pose may be discarded.

图5是显示根据本公开一个示例的使用深度神经网络的顺序处理流程的概念图。在图5的示例中,3D卷积层72和74以及2D卷积层76和78被配置为将输入体素分类为人还是非人。如果2D卷积层78不分类为人,则处理结束。如果2D卷积层78确实分类为人,则深度神经网络44将使用3D卷积层80和82以及2D卷积层84和86处理所述输入体素以估计被分类的人的姿态。即,深度神经网络44可以被配置为对分类和姿态估计使用单独的神经网络。在该示例中,分类和姿态估计处理顺序地执行。5 is a conceptual diagram showing a sequential processing flow using a deep neural network according to one example of the present disclosure. In the example of Figure 5, 3D convolutional layers 72 and 74 and 2D convolutional layers 76 and 78 are configured to classify input voxels as human or non-human. If the 2D convolutional layer 78 does not classify as a person, the process ends. If 2D convolutional layer 78 does classify a person, deep neural network 44 will process the input voxels using 3D convolutional layers 80 and 82 and 2D convolutional layers 84 and 86 to estimate the pose of the classified person. That is, deep neural network 44 may be configured to use separate neural networks for classification and pose estimation. In this example, the classification and pose estimation processes are performed sequentially.

图6是示出示例性骨架的概念图。骨架100可以代表预定的锚点骨架或使用上面描述的公开的技术所估计的真实骨架的姿态。在本公开的一个示例中,骨架100可以由多个关键点和/或关节限定。在图6的示例中,骨架100包括14个关键点。如图6所示,骨架100由头关键点102、颈部关键点104、左肩关键点108、右肩关键点106、左肘112、右肘关键点110、左手关键点116、右手关键点114、左腰关键点120、右腰关键点118、左膝关键点124、右膝关键点122、左脚关键点128、和右脚关键点126限定。为了确定姿态,微处理器22(参见图2)可以被配置为确定骨架100的每个关键点的位置(例如,在3D空间中的位置)。也就是说,骨架100的每个关键点相对于彼此的位置限定了骨架100的姿态,因此限定了从点云中检测到的人的姿态。Figure 6 is a conceptual diagram showing an exemplary skeleton. Skeleton 100 may represent a predetermined anchor point skeleton or a pose of a real skeleton estimated using the disclosed techniques described above. In one example of this disclosure, skeleton 100 may be defined by a plurality of key points and/or joints. In the example of Figure 6, skeleton 100 includes 14 key points. As shown in Figure 6, the skeleton 100 consists of a head key point 102, a neck key point 104, a left shoulder key point 108, a right shoulder key point 106, a left elbow 112, a right elbow key point 110, a left hand key point 116, a right hand key point 114, The left waist point 120, the right waist point 118, the left knee point 124, the right knee point 122, the left foot point 128, and the right foot point 126 are limited. To determine the pose, microprocessor 22 (see Figure 2) may be configured to determine the position (eg, position in 3D space) of each key point of skeleton 100. That is, the position of each keypoint of the skeleton 100 relative to each other defines the pose of the skeleton 100 and therefore the pose of the person detected from the point cloud.

对于其他应用,可以使用更多或更少的关键点。用于限定骨架100的关键点越多,可以估计的独特姿态越多。但是,更多的关键点也可能会导致更长的处理时间来估计姿态。For other applications, more or fewer keypoints can be used. The more keypoints used to define the skeleton 100, the more unique poses can be estimated. However, more keypoints may also result in longer processing time to estimate the pose.

图7是示出具有多个带有已估计姿态的分类骨架的示例性点云30的概念图。如图7所示,点云30被示出具有可视化的三个检测到的骨架140、142和144。所述骨架被示出具有不同姿态,这些姿态可以通过来自图6的14个关键点的不同位置得到。注意,骨架140示出了未被非最大值抑制算法处理的骨架的一个示例。骨架140实际上是多个重叠的骨架,而不是显示单个骨架。在本公开的一些示例中,计算系统14可以还被配置为产生所检测到的骨架的可视化,诸如图7所示的可视化。Figure 7 is a conceptual diagram illustrating an exemplary point cloud 30 with multiple classified skeletons with estimated poses. As shown in FIG. 7 , point cloud 30 is shown with visualization of three detected skeletons 140 , 142 , and 144 . The skeleton is shown with different poses that can be obtained by different positions of the 14 key points from Figure 6. Note that skeleton 140 shows one example of a skeleton that is not processed by the non-maximum suppression algorithm. Skeleton 140 is actually multiple overlapping skeletons rather than showing a single skeleton. In some examples of the present disclosure, computing system 14 may be further configured to generate a visualization of the detected skeleton, such as the visualization shown in FIG. 7 .

图8是显示根据本公开一个示例的被配置为执行姿态估计的设备的示例性操作的流程图。一个或多个处理器可以被配置为执行图8所示的技术,其包括计算系统14的微处理器22。如上所述,在一些示例中,计算系统14可以是汽车2的一部分。在该示例中,汽车2可以配置为使用由计算系统14产生的姿态估计来做出自主驾驶决策。然而,本公开的技术不限于此。任何处理器或处理电路可以被配置为执行图8的技术用于任何数量的应用的姿态估计,包括AR/VR、游戏、HCI、监视和监测等。8 is a flowchart showing exemplary operations of a device configured to perform pose estimation according to one example of the present disclosure. One or more processors may be configured to perform the techniques illustrated in FIG. 8 , including microprocessor 22 of computing system 14 . As mentioned above, in some examples, computing system 14 may be part of automobile 2 . In this example, car 2 may be configured to use attitude estimates generated by computing system 14 to make autonomous driving decisions. However, the technology of the present disclosure is not limited thereto. Any processor or processing circuit may be configured to perform the techniques of Figure 8 for pose estimation for any number of applications, including AR/VR, gaming, HCI, surveillance and monitoring, and the like.

在本公开的一个示例中,计算系统14可以包括被配置为接收来自LiDAR传感器10(参见图1)的点云30(参见图2)的存储器24。计算系统14可进一步包括以电路实现的一个或多个处理器(例如,图2的微处理器22),所述一个或多个处理器与存储器通信。微处理器22可以被配置为接收来自LiDAR传感器10的点云(800)。该点云包括表示物体相对于LiDAR传感器10的位置的多个点。微处理器22可以还被配置为处理该点云以产生包括多个体素的体素化的帧(802)。在本公开的一个示例中,体素化的帧的每个体素包括指示在体素中存在或不存在来自点云的点的数据结构。In one example of the present disclosure, computing system 14 may include memory 24 configured to receive point cloud 30 (see FIG. 2 ) from LiDAR sensor 10 (see FIG. 1 ). Computing system 14 may further include one or more processors (eg, microprocessor 22 of FIG. 2) implemented as circuitry in communication with memory. Microprocessor 22 may be configured to receive the point cloud from LiDAR sensor 10 (800). The point cloud includes a plurality of points representing the position of the object relative to the LiDAR sensor 10 . Microprocessor 22 may be further configured to process the point cloud to generate a voxelized frame that includes a plurality of voxels (802). In one example of the present disclosure, each voxel of the voxelized frame includes a data structure indicating the presence or absence of a point from the point cloud in the voxel.

微处理器22还可被配置为使用深度神经网络的一个或多个3D卷积层来处理体素化的帧(804),以及使用深度神经网络的一个或多个2D卷积层来处理体素化的帧(806)。微处理器22使用3D和2D卷积层处理体素化的帧以确定相对于LiDAR传感器的一个或多个人以及一个或多个人中的每一个人的姿态。然后,微处理器22可以输出所确定的一个或多个人的位置以及所确定的一个或多个人中的每一个人的姿态(808)。Microprocessor 22 may also be configured to process voxelized frames using one or more 3D convolutional layers of a deep neural network (804) and to process volumes using one or more 2D convolutional layers of a deep neural network. Pixelated frames (806). Microprocessor 22 processes the voxelized frames using 3D and 2D convolutional layers to determine the pose of the one or more persons and each of the one or more persons relative to the LiDAR sensor. Microprocessor 22 may then output the determined location of the one or more persons and the determined posture of each of the one or more persons (808).

在一个示例中,微处理器22可以被配置为针对体素化的帧的第一体素确定是否存在人,并基于该确定来激活用于第一体素的锚点骨架,其中,第一体素中表示的数据被定义为真实骨架。微处理器22可以被配置为顺序地或并行地确定人的存在以及这样的人的姿态。在一个示例中,微处理器22可以被配置为与确定是否存在人并行地确定真实骨架与锚点骨架之间的差异,基于该差异估计真实骨架的姿态,并在激活锚点骨架的情况下输出所述姿态。在另一个示例中,微处理器22可以被配置成在锚点骨架被激活的情况下确定真实骨架与锚点骨架之间的差异,并且基于该差异估计真实骨架的姿态,并输出所述姿态。In one example, microprocessor 22 may be configured to determine whether a person is present for a first voxel of the voxelized frame and activate an anchor skeleton for the first voxel based on the determination, wherein the first voxel The data represented in voxels is defined as the real skeleton. Microprocessor 22 may be configured to determine the presence of persons and the gestures of such persons either sequentially or in parallel. In one example, microprocessor 22 may be configured to determine a difference between the real skeleton and the anchor skeleton in parallel with determining whether a person is present, estimate the pose of the real skeleton based on the difference, and, with the anchor skeleton activated, Output the gesture. In another example, the microprocessor 22 may be configured to determine a difference between the real skeleton and the anchor skeleton when the anchor skeleton is activated, and estimate a pose of the real skeleton based on the difference, and output the pose .

锚点骨架由多个关键点限定。为了确定真实骨架与锚点骨架之间的差异,微处理器22可以被配置为确定真实骨架与锚点骨架的每个关键点之间的差异。The anchor skeleton is defined by multiple keypoints. To determine the difference between the real skeleton and the anchor skeleton, the microprocessor 22 may be configured to determine the difference between each key point of the real skeleton and the anchor skeleton.

在本公开的另一示例中,微处理器22还可以被配置为使用非最大值抑制技术来处理相对于LiDAR传感器的确定的一个或多个人以及所述一个或多个人中的每个人的姿态,以去除所述一个或多个人的重复。In another example of the present disclosure, microprocessor 22 may also be configured to use non-maximum suppression techniques to process the determined one or more persons and the pose of each of the one or more persons relative to the LiDAR sensor. , to remove duplicates of the person or persons.

在本公开的其他示例中,本公开的姿态估计技术可以扩展到一系列帧上,以检测可以构成某个动作(例如,挥手、走路、跑步等)的一系列姿态。这样的动作识别可以使用时间信息(例如,来自多个时间阶段的LiDAR点云数据)来执行动作识别。因此,在一个示例中,DNN 44可以配置为处理多个体素化的帧,以确定相对于LiDAR传感器的至少一个人以及该至少一个人的一系列姿态。然后,DNN 44可以依据所述姿态序列确定针对至少一个人的动作。下面描述实现动作识别的两个示例方式。In other examples of the present disclosure, the pose estimation techniques of the present disclosure can be extended to a series of frames to detect a series of gestures that can constitute a certain action (eg, waving, walking, running, etc.). Such action recognition can use temporal information (e.g., LiDAR point cloud data from multiple time phases) to perform action recognition. Thus, in one example, DNN 44 may be configured to process a plurality of voxelized frames to determine at least one person and a series of poses of the at least one person relative to the LiDAR sensor. DNN 44 may then determine an action for at least one person based on the sequence of gestures. Two example ways of implementing action recognition are described below.

在第一示例中,DNN 44可被配置为将针对点云30的每一帧的固定数量的输出堆叠和/或级联为单个数据样本。DNN 44可以将所述单个数据样本馈送到分类器中以对动作类别进行分类。在帧索引t处,DNN44可以被配置为使用w的时间窗大小来产生单个样本,其是从帧t-w+1到t的组合w输出。DNN 44可以配置为包括分类器,该分类器是(多类(multi-class))深度神经网络或任何类型的机器学习模型,例如支持向量机(SVM)。In a first example, DNN 44 may be configured to stack and/or concatenate a fixed number of outputs for each frame of point cloud 30 into a single data sample. DNN 44 can feed the single data sample into a classifier to classify action categories. At frame index t, DNN 44 can be configured to use a time window size of w to produce a single sample that is the combined w output from frames t-w+1 to t. DNN 44 may be configured to include a classifier that is a (multi-class) deep neural network or any type of machine learning model, such as a support vector machine (SVM).

在另一示例中,DNN 44可以被配置为以顺序的方式使用每帧输出。例如,DNN 44可以被配置为将每帧输出馈送到递归神经网络,并在每个帧处或在一定数量的帧之后确定对动作的预测。In another example, DNN 44 may be configured to use each frame output in a sequential manner. For example, DNN 44 may be configured to feed each frame output to a recurrent neural network and determine a prediction of an action at each frame or after a certain number of frames.

因此,代替每帧姿态估计(例如,骨架输出)或除每帧姿态估计之外(例如,骨架输出),DNN 44可被配置为将输出缝合为批量或顺序地馈送输出以获得更高级别的动作识别。要识别动作的一些可能类别包括站立、步行、跑步、骑自行车、滑板运动、挥手等。Therefore, instead of or in addition to per-frame pose estimation (e.g., skeleton output), the DNN 44 may be configured to stitch the output into batches or feed the output sequentially to obtain higher-level Action recognition. Some possible categories of actions to recognize include standing, walking, running, biking, skateboarding, waving, etc.

下面描述本公开的技术的其他示例和组合。Other examples and combinations of the techniques of the present disclosure are described below.

方案1.一种用于姿态估计的方法,所述方法包括:接收来自LiDAR传感器的点云,所述点云包括表示物体相对于LiDAR传感器的位置的多个点;处理所述点云以产生包括多个体素的体素化的帧;使用深度神经网络处理所述体素化的帧,以确定相对于LiDAR传感器的一个或多个人以及所述一个或多个人的姿态;以及输出所确定的一个或多个人的位置以及所确定的一个或多个人中每个人的姿态。Embodiment 1. A method for attitude estimation, the method comprising: receiving a point cloud from a LiDAR sensor, the point cloud including a plurality of points representing the position of an object relative to the LiDAR sensor; processing the point cloud to generate voxelized frames including a plurality of voxels; processing the voxelized frames using a deep neural network to determine one or more persons and a pose of the one or more persons relative to the LiDAR sensor; and outputting the determined The location of one or more persons and the determined posture of each of the one or more persons.

方案2.根据方案1所述的方法,其中,所述体素化的帧的每个体素包括指示所述体素中存在或不存在来自所述点云的点的数据结构。Embodiment 2. The method of embodiment 1, wherein each voxel of the voxelized frame includes a data structure indicating the presence or absence of a point from the point cloud in the voxel.

方案3.根据方案1或2所述的方法,其中,使用深度神经网络处理所述体素化的帧包括:使用卷积深度神经网络来处理所述体素化的帧,其中,所述卷积深度神经网络包括一个或多个三维卷积层,其后是一个或多个二维卷积层。Option 3. The method of Option 1 or 2, wherein using a deep neural network to process the voxelized frame includes using a convolutional deep neural network to process the voxelized frame, wherein the convolution A convolutional deep neural network consists of one or more three-dimensional convolutional layers, followed by one or more two-dimensional convolutional layers.

方案4.根据方案1至3的任意组合的方法,其中,使用深度神经网络处理所述体素化的帧包括:对于所述体素化的帧的第一体素,确定是否存在人;以及基于所述确定,激活用于所述第一体素的锚点骨架,其中,将所述第一体素中表示的数据定义为真实骨架(ground truth skeleton)。Embodiment 4. The method according to any combination of Embodiments 1 to 3, wherein processing the voxelized frame using a deep neural network includes: for a first voxel of the voxelized frame, determining whether a person is present; and Based on the determination, an anchor skeleton is activated for the first voxel, wherein the data represented in the first voxel is defined as a ground truth skeleton.

方案5.根据方案1至4的任意组合的方法,还包括:与确定是否存在所述人并行地确定所述真实骨架与所述锚点骨架之间的差异;基于所述差异来估计所述真实骨架的姿态;以及在所述锚点骨架被激活的情况下输出所述姿态。Option 5. The method according to any combination of Option 1 to 4, further comprising: determining a difference between the real skeleton and the anchor point skeleton in parallel with determining whether the person is present; estimating the difference based on the difference the pose of the real skeleton; and outputting the pose if the anchor skeleton is activated.

方案6.根据方案1至4的任意组合的方法,还包括:在所述锚点骨架被激活的情况下,确定所述真实骨架与所述锚点骨架之间的差异;基于所述差异估计所述真实骨架的姿态;以及输出所述姿态。Option 6. The method according to any combination of Option 1 to 4, further comprising: when the anchor point skeleton is activated, determining the difference between the real skeleton and the anchor point skeleton; estimating based on the difference the pose of the real skeleton; and outputting the pose.

方案7.根据方案1至6的任意组合的方法,其中,所述锚点骨架由多个关键点限定。Option 7. The method according to any combination of Option 1 to 6, wherein the anchor point skeleton is defined by a plurality of key points.

方案8.根据方案1至7的任何组合的方法,其中确定所述真实骨架与所述锚点骨架之间的差异包括:确定所述真实骨架与所述锚点骨架的每个关键点之间的差异。Option 8. The method according to any combination of aspects 1 to 7, wherein determining the difference between the real skeleton and the anchor point skeleton includes: determining the difference between each key point of the real skeleton and the anchor point skeleton. difference.

方案9.根据方案1至8的任意组合的方法,还包括:使用非最大值抑制技术来处理所确定的相对于LiDAR传感器的一个或多个人以及所述一个或多个人中的每个人的姿态,以去除所述一个或多个人的重复。Embodiment 9. The method according to any combination of embodiments 1 to 8, further comprising: using a non-maximum suppression technique to process the determined pose of the one or more persons relative to the LiDAR sensor and each of the one or more persons. , to remove duplicates of the person or persons.

方案10.一种被配置为执行姿态估计的设备,所述设备包括:存储器,其被配置为接收来自LiDAR传感器的点云;以及以电路实现的一个或多个处理器,所述一个或多个处理器与存储器通信,并被配置为:接收来自LiDAR传感器的点云,所述点云包括表示物体相对于LiDAR传感器的位置的多个点;处理所述点云以产生包括多个体素的体素化的帧;使用深度神经网络处理所述体素化的帧,以确定一个或多个相对于LiDAR传感器的人以及所述一个或多个人中每个人的姿态;以及输出所确定的一个或多个人的位置以及所确定的一个或多个人中的每个人的姿态。Embodiment 10. A device configured to perform pose estimation, the device comprising: a memory configured to receive a point cloud from a LiDAR sensor; and one or more processors implemented in circuitry, the one or more A processor is in communication with the memory and is configured to: receive a point cloud from the LiDAR sensor, the point cloud including a plurality of points representing the position of the object relative to the LiDAR sensor; process the point cloud to generate a point cloud including a plurality of voxels voxelizing the frame; processing the voxelized frame using a deep neural network to determine one or more persons relative to the LiDAR sensor and a pose of each of the one or more persons; and outputting the determined one The location of the person or persons and the determined posture of each of the one or more persons.

方案11.根据方案10所述的设备,其中,所述体素化的帧的每个体素包括指示所述体素中存在或不存在来自点云的点的数据结构。Embodiment 11. The apparatus of embodiment 10, wherein each voxel of the voxelized frame includes a data structure indicating the presence or absence of a point from a point cloud in the voxel.

方案12.根据方案10至11的任意组合的设备,其中,为了使用深度神经网络来处理所述体素化的帧,所述一个或多个处理器还被配置为:使用卷积深度神经网络来处理所述体素化的帧,其中所述卷积深度神经网络包括一个或多个三维卷积层,然后是一个或多个二维卷积层。Embodiment 12. The device according to any combination of embodiments 10 to 11, wherein to process the voxelized frames using a deep neural network, the one or more processors are further configured to: use a convolutional deep neural network To process the voxelized frames, the convolutional deep neural network includes one or more three-dimensional convolutional layers, followed by one or more two-dimensional convolutional layers.

方案13.根据方案10至12的任意组合的设备,其中,为了使用深度神经网络来处理所述体素化的帧,所述一个或多个处理器还被配置为:对于所述体素化的帧的第一体素,确定是否存在人;并基于所述确定来激活用于所述第一体素的锚点骨架,其中,所述第一体素中表示的数据被定义为真实骨架。Embodiment 13. The device according to any combination of embodiments 10 to 12, wherein, in order to process the voxelized frames using a deep neural network, the one or more processors are further configured to: for the voxelized a first voxel of the frame, determining whether a person is present; and activating an anchor skeleton for the first voxel based on the determination, wherein the data represented in the first voxel is defined as the true skeleton .

方案14.根据方案10至13的任意组合的设备,其中,所述一个或多个处理器还被配置为:与确定是否存在所述人并行地确定所述真实骨架与所述锚点骨架之间的差异;基于所述差异估计所述真实骨架的姿态;以及在激活所述锚点骨架的情况下输出所述姿态。Option 14. The device according to any combination of aspects 10 to 13, wherein the one or more processors are further configured to: determine between the real skeleton and the anchor point skeleton in parallel with determining whether the person is present. the difference between; estimating the pose of the real skeleton based on the difference; and outputting the pose with the anchor point skeleton activated.

方案15.根据方案10至13的任意组合的设备,其中,所述一个或多个处理器还被配置为:在激活所述锚点骨架的情况下确定所述真实骨架与所述锚点骨架之间的差异;基于所述差异来估计所述真实骨架的姿态;以及输出所述姿态。Option 15. The device according to any combination of solutions 10 to 13, wherein the one or more processors are further configured to: determine the real skeleton and the anchor point skeleton when the anchor point skeleton is activated the difference between; estimating the pose of the real skeleton based on the difference; and outputting the pose.

方案16.根据方案10-15的任意组合的设备,其中,所述锚点骨架由多个关键点限定。Embodiment 16. The device according to any combination of embodiments 10-15, wherein the anchor point skeleton is defined by a plurality of key points.

方案17.根据方案10-11的任意组合的设备,其中,为了确定所述真实骨架和所述锚点骨架之间的差异,所述一个或多个处理器还被配置为:确定所述真实骨架与所述锚点骨架的每个所述关键点之间的差异。Item 17. The device according to any combination of items 10-11, wherein, in order to determine the difference between the real skeleton and the anchor point skeleton, the one or more processors are further configured to: determine the real skeleton. The difference between each of the keypoints of the skeleton and the anchor point skeleton.

方案18.根据方案10至17的任意组合的设备,其中,所述一个或多个处理器还被配置为:使用非最大值抑制技术来处理所确定的相对于LiDAR传感器的一个或多个人以及所述一个或多个人中每个人的姿态,以去除所述一个或多个人的重复。Embodiment 18. The apparatus according to any combination of embodiments 10 to 17, wherein the one or more processors are further configured to: use non-maximum suppression techniques to process the determined one or more persons relative to the LiDAR sensor and The gesture of each of the one or more persons to remove duplicates of the one or more persons.

方案19.根据方案10至18的任意组合的设备,其中,所述设备包括具有LiDAR传感器的汽车。Embodiment 19. The device according to any combination of aspects 10 to 18, wherein the device includes a car having a LiDAR sensor.

方案20.一种配置为执行姿态估计的设备,所述设备包括:用于接收来自LiDAR传感器的点云的装置,所述点云包括表示物体相对于LiDAR传感器的位置的多个点;用于处理所述点云以产生包括多个体素的体素化的帧的装置;用于使用深度神经网络处理所述体素化的帧以确定相对于LiDAR传感器的一个或多个人以及所述一个或多个人中每个人的姿态的装置;以及用于输出所确定的一个或多个人的位置和所确定的一个或多个人中每个人的姿态的装置。Embodiment 20. An apparatus configured to perform pose estimation, the apparatus comprising: means for receiving a point cloud from a LiDAR sensor, the point cloud including a plurality of points representing a position of an object relative to the LiDAR sensor; means for processing the point cloud to produce a voxelized frame comprising a plurality of voxels; for processing the voxelized frame using a deep neural network to determine one or more persons relative to a LiDAR sensor and the one or means for a posture of each of a plurality of persons; and means for outputting the determined position of one or more persons and the determined posture of each of the one or more persons.

方案21.一种配置为执行姿态估计的设备,所述设备包括用于执行方案1至9的方法中的步骤的任何组合的装置。Embodiment 21. An apparatus configured to perform pose estimation, the apparatus comprising means for performing any combination of the steps of the method of embodiments 1 to 9.

方案22.一种非暂时性的计算机可读介质,可以被配置为存储指令,当执行所述指令时,所述指令使得一个或多个处理器接收来自LiDAR传感器的点云,所述点云包括表示物体相对于LiDAR传感器的位置的多个点;处理所述点云以生成包括多个体素的体素化的帧;使用深度神经网络处理所述体素化的帧以确定相对于LiDAR传感器的一个或多个人以及所述一个或多个人中每个人的姿态;以及输出所确定的一个或多个人的位置以及所确定的一个或多个人中每个人的姿态。Embodiment 22. A non-transitory computer-readable medium that may be configured to store instructions that, when executed, cause one or more processors to receive a point cloud from a LiDAR sensor, the point cloud including a plurality of points representing a position of an object relative to a LiDAR sensor; processing the point cloud to generate a voxelized frame that includes a plurality of voxels; processing the voxelized frame using a deep neural network to determine a position relative to the LiDAR sensor one or more people and the posture of each of the one or more people; and output the determined position of the one or more people and the determined posture of each of the one or more people.

在另一示例中,本公开描述了用于标注点云数据的技术。为了训练深度神经网络以估计在点云数据中人的姿态,深度神经网络可以通过处理点云数据的训练集来配置和修改。所述点云数据的训练集用点云内人的确切位置和姿态(例如,通过手动标记(manuallabeling))事先标记。点云数据中姿态的该事先标记可以被称为标注(annotation)。存在用于标注在二维图像中人的姿态的技术。但是,标注点云数据有很大不同。首先,点云数据是三维的。此外,点云数据相对于二维图像数据是稀疏的。In another example, this disclosure describes techniques for annotating point cloud data. In order to train a deep neural network to estimate the pose of a person in point cloud data, the deep neural network can be configured and modified by processing a training set of point cloud data. The training set of point cloud data is pre-labeled with the exact position and pose of the person within the point cloud (eg, by manual labeling). This prior marking of poses in point cloud data may be called annotation. Techniques exist for annotating the pose of a person in two-dimensional images. However, labeling point cloud data is very different. First, point cloud data is three-dimensional. Furthermore, point cloud data is sparse relative to 2D image data.

本公开描述了一种用于标注点云数据的方法、设备和软件工具。用户可以使用本公开的技术来标注点云以标记在点云数据中发现的一个或多个姿态。已标注的点云数据然后可以用于训练神经网络,以更准确地实时识别和标记点云数据中的姿态。The present disclosure describes a method, apparatus, and software tool for annotating point cloud data. Users can annotate point clouds using the techniques of the present disclosure to mark one or more gestures found in the point cloud data. The annotated point cloud data can then be used to train neural networks to more accurately identify and label poses in point cloud data in real time.

图9是显示被配置为执行本公开的点云标注技术的示例性计算系统214的框图。计算系统214可以利用例如台式计算机、笔记本计算机、平板计算机、或任何类型的计算设备来实现。计算系统214包括处理器222、存储器224、和一个或多个输入设备218。在一些示例中,计算系统214可以包括多个处理器222。Figure 9 is a block diagram showing an exemplary computing system 214 configured to perform the point cloud annotation techniques of the present disclosure. Computing system 214 may be implemented using, for example, a desktop computer, notebook computer, tablet computer, or any type of computing device. Computing system 214 includes a processor 222, memory 224, and one or more input devices 218. In some examples, computing system 214 may include multiple processors 222 .

处理器222可以实现为固定功能的处理电路、可编程的处理电路、或其组合。固定功能的电路是指提供特定功能并预设在可执行的操作上的电路。可编程的电路是指可以被编程为执行各种任务并在可以执行的操作中提供灵活功能的电路。例如,可编程的电路可以执行使可编程的电路以软件或固件的指令所限定的方式操作的软件或固件。固定功能的电路可以执行软件指令(例如,以接收参数或输出参数),但是固定功能的处理电路执行的操作类型通常是不可变的。在一些示例中,一个或多个单元可以是不同的电路块(固定功能或可编程的),并且在一些示例中,一个或多个单元可以是集成电路。Processor 222 may be implemented as fixed-function processing circuitry, programmable processing circuitry, or a combination thereof. Fixed-function circuits are circuits that provide a specific function and are designed to perform operations. Programmable circuits are circuits that can be programmed to perform a variety of tasks and provide flexible functionality in the operations that can be performed. For example, a programmable circuit may execute software or firmware that causes the programmable circuit to operate in a manner defined by instructions of the software or firmware. Fixed-function circuits may execute software instructions (eg, to receive parameters or output parameters), but the types of operations performed by fixed-function processing circuits are generally immutable. In some examples, one or more units may be distinct circuit blocks (fixed function or programmable), and in some examples, one or more units may be integrated circuits.

计算系统214可以被配置为生成用于在显示设备216上显示的信息。例如,如下面将更详细描述的,计算系统214可以生成图形用户界面(GUI)250并使GUI 250显示在显示设备216上。用户可以例如通过输入设备218而与GUI 250交互以标注点云数据。在一些示例中,显示设备216是计算系统214的一部分,但在其他示例中,显示设备216可以与计算系统214分离。显示设备216可以用任何电子显示器来实现,例如液晶显示器(LCD)、发光二极管(LED)显示器、或有机发光二极管(OLED)显示器。Computing system 214 may be configured to generate information for display on display device 216 . For example, as will be described in greater detail below, computing system 214 may generate graphical user interface (GUI) 250 and cause GUI 250 to be displayed on display device 216 . A user may interact with GUI 250 to annotate point cloud data, such as through input device 218. In some examples, display device 216 is part of computing system 214 , but in other examples, display device 216 may be separate from computing system 214 . Display device 216 may be implemented with any electronic display, such as a liquid crystal display (LCD), a light emitting diode (LED) display, or an organic light emitting diode (OLED) display.

输入设备218是被配置为接收用户命令或其他信息的设备。在某些示例中,输入设备218是计算系统214的一部分,但在其他示例中,输入设备218可与计算系统214分离。输入设备218可包括用于输入信息或命令的任何设备,例如键盘、麦克风、光标控制设备、或触摸屏。Input device 218 is a device configured to receive user commands or other information. In some examples, input device 218 is part of computing system 214 , but in other examples, input device 218 may be separate from computing system 214 . Input device 218 may include any device for inputting information or commands, such as a keyboard, microphone, cursor control device, or touch screen.

根据本公开的技术,处理器222可以被配置为执行标注工具242的一组指令以根据本公开的技术执行点云标注。限定标注工具242的指令可以被存储在存储器224中。在一些示例中,可以通过有线或无线网络将限定标注工具242的指令下载到存储器224。In accordance with the techniques of the present disclosure, the processor 222 may be configured to execute a set of instructions of the annotation tool 242 to perform point cloud annotation in accordance with the techniques of the present disclosure. Instructions defining annotation tool 242 may be stored in memory 224 . In some examples, instructions defining annotation tool 242 may be downloaded to memory 224 over a wired or wireless network.

在一些示例中,存储器224可以是临时存储器,这意味着存储器224的主要目的不是长期存储。存储器224可以被配置为易失性存储器用于信息的短期存储,并且因此如果断电则不保留所存储的内容。易失性存储器的示例包括随机存取存储器(RAM)、动态随机存取存储器(DRAM)、静态随机存取存储器(SRAM)、以及本领域已知的其他形式的易失性存储器。In some examples, memory 224 may be temporary storage, meaning that the primary purpose of memory 224 is not long-term storage. Memory 224 may be configured as volatile memory for short-term storage of information, and therefore does not retain stored content if power is lost. Examples of volatile memory include random access memory (RAM), dynamic random access memory (DRAM), static random access memory (SRAM), and other forms of volatile memory known in the art.

存储器224可以包括一个或多个非暂时性的计算机可读的存储介质。存储器224可以被配置为存储比通常由易失性存储器存储的信息量更多的信息。存储器224可以还被配置为非易失性存储器空间用于长期存储信息,并且在通电/断电周期之后仍保留信息。非易失性存储器的示例包括磁性硬盘、光盘、闪存、或电可编程存储器(EPROM)或电可擦除可编程存储器(EEPROM)的形式。存储器224可以存储程序指令(例如,标注工具242)和/或信息(例如,点云训练数据230和标注的点云232),所述程序指令在被执行时使处理器222执行本公开的技术。Memory 224 may include one or more non-transitory computer-readable storage media. Memory 224 may be configured to store greater amounts of information than is typically stored by volatile memory. Memory 224 may also be configured as a non-volatile memory space for long-term storage of information and to retain information after power on/off cycles. Examples of non-volatile memory include magnetic hard disks, optical disks, flash memory, or forms of electrically programmable memory (EPROM) or electrically erasable programmable memory (EEPROM). Memory 224 may store program instructions (eg, annotation tool 242) and/or information (eg, point cloud training data 230 and annotated point cloud 232) that, when executed, cause processor 222 to perform the techniques of the present disclosure. .

将参考执行各种软件模块的处理器222来描述本公开的以下技术。然而,应当理解,本文描述的每个软件模块也可以以专用硬件、固件、软件、或硬件、软件和固件的任何组合来实现。The following techniques of the present disclosure will be described with reference to processor 222 executing various software modules. However, it should be understood that each software module described herein may also be implemented in dedicated hardware, firmware, software, or any combination of hardware, software, and firmware.

根据本公开的技术,处理器222可通过执行标注工具242被配置为加载点云训练数据230。点云训练数据230可以包括一帧或多帧点云数据,例如,由LiDAR传感器或任何其他类型的捕获点云数据的传感器捕获的点云数据。标注工具242可以被配置为生成包括一帧或多个帧点云训练数据230的GUI 250,并且使显示设备216显示GUI 250。用户然后可以与GUI 250交互以将人的姿态标注到一帧点云训练数据230上,以限定在点云数据中可能存在的人的姿态。在标注之后,标注工具242可以被配置为输出已标注的点云232。已标注的点云232可以被存储在存储器224中和/或被下载在计算系统214的外部。已标注的点云232然后可以用于训练深度神经网络,深度神经网络被配置为根据点云数据估计人体姿态(例如,图2的深度神经网络44)。图9的深度神经网络244代表在训练过程之前和/或训练过程期间的深度神经网络44的版本。图10-14示出可以由标注工具242生成的GUI 250的各种示例。标注工具242相对于所生成的GUI 250和用户输入的操作将在下面更详细地讨论。In accordance with techniques of the present disclosure, processor 222 may be configured to load point cloud training data 230 by executing annotation tool 242 . Point cloud training data 230 may include one or more frames of point cloud data, such as point cloud data captured by a LiDAR sensor or any other type of sensor that captures point cloud data. The annotation tool 242 may be configured to generate a GUI 250 including one or more frames of point cloud training data 230 and cause the display device 216 to display the GUI 250 . The user may then interact with the GUI 250 to annotate human poses onto a frame of point cloud training data 230 to define the human poses that may be present in the point cloud data. After annotation, annotation tool 242 may be configured to output annotated point cloud 232 . Annotated point cloud 232 may be stored in memory 224 and/or downloaded external to computing system 214 . The annotated point cloud 232 may then be used to train a deep neural network configured to estimate human pose from the point cloud data (eg, deep neural network 44 of Figure 2). Deep neural network 244 of Figure 9 represents a version of deep neural network 44 before and/or during the training process. 10-14 illustrate various examples of GUI 250 that may be generated by annotation tool 242. The operation of the annotation tool 242 with respect to the generated GUI 250 and user input is discussed in greater detail below.

图10是示出用于标注的输入点云的概念性用户界面图。标注工具242可以使显示设备216显示包括点云帧252的GUI 250。例如,标注工具242可以响应于用户输入而生成GUI250,以加载点云训练数据230的点云帧252(参见图9)。标注工具242可以被配置为响应于用户与导入数据控件(import data controls)254的交互来加载和显示点云帧252。如图10所示,点云帧252可以包括点264,点264可以对应于在点云数据中捕获的一个或多个人。Figure 10 is a conceptual user interface diagram illustrating an input point cloud for annotation. Annotation tool 242 may cause display device 216 to display GUI 250 including point cloud frame 252 . For example, annotation tool 242 may generate GUI 250 in response to user input to load point cloud frames 252 of point cloud training data 230 (see Figure 9). Annotation tool 242 may be configured to load and display point cloud frames 252 in response to user interaction with import data controls 254 . As shown in Figure 10, point cloud frame 252 may include points 264, which may correspond to one or more people captured in the point cloud data.

导入数据控件254包括加载匹配按钮(load match button),加载云按钮(loadcloud(s)button)和加载Skel(骨架)按钮。当用户选择所述加载匹配按钮时,标注工具242打开文件浏览器对话框,并从图像和点云的匹配对的相应目录中读取用户选择的文件,该文件包含图像和点云的匹配对。在该示例中,标注工具242能够从既包括图像又包括匹配的点云的目录中加载文件。这样,可以同时查看配对的图像和点云。在其他示例中,标注工具242可以仅加载图像或点云之一。Import data controls 254 include a load match button, a load cloud(s) button, and a load Skel button. When the user selects the Load Matches button, the annotation tool 242 opens a file browser dialog box and reads the user-selected file containing the matching pairs of images and point clouds from the corresponding directory. . In this example, annotation tool 242 can load files from a directory that includes both images and matching point clouds. This way, paired images and point clouds can be viewed simultaneously. In other examples, annotation tool 242 may load only one of the image or point cloud.

当用户选择所述加载云按钮时,标注工具242打开文件浏览器对话框,并在用户选择的目录中填充具有可用点云文件的列表。然后,用户可以从填充的下拉列表中选择要查看的点云。When the user selects the Load Cloud button, the annotation tool 242 opens a file browser dialog and populates a list with available point cloud files in the directory selected by the user. The user can then select the point cloud to view from the populated drop-down list.

当用户选择所述加载骨架时,标注工具242打开文件浏览器对话框,并从用户选择的文件中加载任意先前标注的骨架。然后,用户可以编辑任意先前标注的骨架。When the user selects the load skeleton, the annotation tool 242 opens a file browser dialog and loads any previously annotated skeleton from the file selected by the user. The user can then edit any previously annotated skeleton.

在一些示例中,标注工具242可以被配置为响应于用户输入,加载点云训练数据230的多个帧。标注工具可以最初显示多个帧中的单个点云帧252。标注工具242可以进一步生成一个或多个视频控件256,其使标注工具242按顺序显示点云数据的多个帧中的每一帧(例如,像视频一样)。在图10的示例中,视频控制按钮256包括播放和停止按钮,但是在其他示例中可以提供更多控制(例如,暂停、倒带、快进等)。In some examples, annotation tool 242 may be configured to load multiple frames of point cloud training data 230 in response to user input. The annotation tool may initially display a single point cloud frame 252 among multiple frames. The annotation tool 242 may further generate one or more video controls 256 that cause the annotation tool 242 to display each of the plurality of frames of the point cloud data sequentially (eg, like a video). In the example of Figure 10, video control buttons 256 include play and stop buttons, but in other examples more controls may be provided (eg, pause, rewind, fast forward, etc.).

标注工具242还可以在GUI 250中包括编辑数据控件262。用户可以与编辑数据控件262交互以改变在GUI 250中显示的点云帧252的数据量或区域。用户可以通过指定水平方向(x-lims),竖直方向(y-lims)和深度方向(z-lims)的最小(min)和最大(max)尺寸来指定要显示的点云帧252的区域。标注工具242可以被配置为响应于编辑数据控件262中的用户输入来裁剪点云帧252,并显示点云帧252的裁剪区域。编辑数据控件262可以进一步包括旋转(rot)按钮,该按钮改变观察点云帧252的角度。编辑数据控件262中的大小按钮改变点云中每个点的大小。Annotation tool 242 may also include edit data controls 262 in GUI 250 . The user can interact with edit data control 262 to change the amount of data or area of point cloud frame 252 displayed in GUI 250 . The user can specify the area of the point cloud frame 252 to be displayed by specifying the minimum (min) and maximum (max) dimensions in the horizontal (x-lims), vertical (y-lims), and depth (z-lims) directions. . Annotation tool 242 may be configured to crop point cloud frame 252 in response to user input in edit data control 262 and display the cropped region of point cloud frame 252 . Edit data control 262 may further include a rot button that changes the angle at which point cloud frame 252 is viewed. The size buttons in the edit data control 262 change the size of each point in the point cloud.

在以上示例中,用户通过操纵编辑数据控件262,可以使用标注工具242来手动裁剪要显示的点云帧252的区域。裁剪点云帧252,例如在数据中的一个或多个潜在人的周围,可以使标注点云帧252更容易。在其他示例中,不是让用户手动裁剪点云帧252,而是将标注工具242配置为自动识别点云帧252中的感兴趣区域,并自动裁剪点云帧252以仅显示所识别的感兴趣区域。在一个示例中,标注工具242可以被配置为通过检测点云帧252的哪些区域包括指示可以在其上标注姿态的人的数据来识别关注区域。在一个示例中,标注工具242可以将点云帧242提供给深度神经网络244(图9),以便识别感兴趣的区域。深度神经网络244可以是以与上述深度神经网络44(参见图2)相同的方式来从点云数据识别人和估计姿态的深度神经网络。深度神经网络244可以是配置用于姿态估计的完整深度神经网络,也可以是用标注工具242生成的标注的点云232(见图9)进行训练的深度神经网络。深度神经网络244可以向标注工具242提供感兴趣区域的指示,并且标注工具242对感兴趣区域的指示可以裁剪所指示的感兴趣区域或其周围的点云帧252。In the above example, the user can use the annotation tool 242 to manually crop the area of the point cloud frame 252 to be displayed by manipulating the edit data control 262 . Cropping the point cloud frame 252 , such as around one or more potential people in the data, may make it easier to label the point cloud frame 252 . In other examples, rather than having the user manually crop the point cloud frame 252 , the annotation tool 242 is configured to automatically identify regions of interest in the point cloud frame 252 and automatically crop the point cloud frame 252 to display only the identified areas of interest. area. In one example, annotation tool 242 may be configured to identify regions of interest by detecting which areas of point cloud frame 252 include data indicative of persons on whom gestures may be annotated. In one example, annotation tool 242 may provide point cloud frames 242 to deep neural network 244 (FIG. 9) in order to identify regions of interest. Deep neural network 244 may be a deep neural network that recognizes people and estimates poses from point cloud data in the same manner as deep neural network 44 described above (see FIG. 2 ). The deep neural network 244 may be a complete deep neural network configured for pose estimation, or a deep neural network trained using the annotated point cloud 232 (see Figure 9) generated by the annotation tool 242. The deep neural network 244 may provide an indication of the region of interest to the annotation tool 242 , and the indication of the region of interest by the annotation tool 242 may crop the indicated region of interest or the point cloud frame 252 surrounding it.

图11是示出用于标注的裁剪点云的概念性用户界面图。在图11中,标注工具242通过GUI 250在顶视图窗口266、侧视图窗口268、和前视图窗口270中显示点云帧252的裁剪区域。在顶视图窗口266中,标注工具242从正头顶显示点云帧252的裁剪区域。在侧视图窗口268中,标注工具242从与预定的前角成一角度(例如90度)显示点云帧252的裁剪区域。在前视图窗口270中,标注工具242从预定的前角显示点云帧252的裁剪区域。在其他示例中,标注工具242可以从更大或更小的角度和/或视角(包括等距视角)显示点云帧242的裁剪区域。另外,标注工具242不限于以不同的角度显示点云帧252的裁剪区域。标注工具242还可以以各种角度显示整个裁剪的帧252。Figure 11 is a conceptual user interface diagram illustrating a cropped point cloud for annotation. In FIG. 11 , annotation tool 242 displays cropped regions of point cloud frame 252 through GUI 250 in top view window 266 , side view window 268 , and front view window 270 . In top view window 266 , annotation tool 242 displays a cropped region of point cloud frame 252 from directly overhead. In the side view window 268, the annotation tool 242 displays the cropped region of the point cloud frame 252 from an angle (eg, 90 degrees) to a predetermined front angle. In the front view window 270, the annotation tool 242 displays a cropped region of the point cloud frame 252 from a predetermined front angle. In other examples, annotation tool 242 may display cropped regions of point cloud frame 242 from larger or smaller angles and/or perspectives, including isometric perspectives. Additionally, annotation tool 242 is not limited to displaying cropped regions of point cloud frame 252 at different angles. The annotation tool 242 can also display the entire cropped frame 252 at various angles.

返回到图10,标注工具242可以进一步包括标注控件260(例如,标记为“骨架”的按钮)。当用户选择标注控件262的“骨架”按钮时,标注工具242可以用多个标注点来标记点云帧252中的点。也就是说,标注工具242可以覆盖由多个标注点限定的骨架,其中所述多个标注点对应于人体上的点。Returning to Figure 10, the annotation tool 242 may further include an annotation control 260 (eg, a button labeled "Skeleton"). When the user selects the "Skeleton" button of annotation control 262, annotation tool 242 can mark points in point cloud frame 252 with a plurality of annotation points. That is, the annotation tool 242 may overlay a skeleton defined by a plurality of annotation points, where the plurality of annotation points correspond to points on the human body.

图12是示出用于标注的示例性骨架的概念图。骨架400表示标注工具242可以用来标记点云帧252的点的示例性骨架。在图12的示例中,骨架400由14个标注点限定,每个标注点对应于人的关节或其他人体解剖结构。在其他示例中,骨架400可以包括更多或更少的标注点。如下面将解释的,用户可以操纵和移动骨架400的一个或多个标注点以限定由点云帧242中的点表示的人的姿态。Figure 12 is a conceptual diagram illustrating an exemplary skeleton for annotation. Skeleton 400 represents an exemplary skeleton that annotation tool 242 may use to label points of point cloud frame 252 . In the example of Figure 12, the skeleton 400 is defined by 14 annotation points, each annotation point corresponding to a human joint or other human anatomy. In other examples, skeleton 400 may include more or fewer annotation points. As will be explained below, the user can manipulate and move one or more annotated points of the skeleton 400 to define the pose of the person represented by the points in the point cloud frame 242 .

在图12中,骨架400面向观察者。这样,在图12的左侧示出了“右侧”肢体。骨架400由头部标注点402(1)的顶部、颈部标注点404(2)的中心、左肩标注点408(10)、右肩标注点406(5)、左肘标注点412(11)、右肘标注点410(6)、左手标注点416(12)、右手标注点414(7)、左髋标注点420(4),右髋标注点418(3)、左膝标注点424(13)、右膝标注点422(8)、左脚标注点428(14)和右脚标注点426(9)限定。标注点的参考数字旁边的括号中的数字涉及将在图13和图14中示出的标注工具242的选择按钮。In Figure 12, skeleton 400 faces the viewer. Thus, the "right" limb is shown on the left side of Figure 12 . The skeleton 400 consists of the top of the head mark point 402 (1), the center of the neck mark point 404 (2), the left shoulder mark point 408 (10), the right shoulder mark point 406 (5), and the left elbow mark point 412 (11). , right elbow mark point 410 (6), left hand mark point 416 (12), right hand mark point 414 (7), left hip mark point 420 (4), right hip mark point 418 (3), left knee mark point 424 ( 13), right knee marking point 422 (8), left foot marking point 428 (14) and right foot marking point 426 (9). The numbers in parentheses next to the reference numbers of the annotation points refer to the selection buttons of the annotation tool 242 as will be shown in Figures 13 and 14.

除了显示骨架400的各标注点之外,标注工具242还可以显示在标注点之间的线以限定骨架400的肢体和/或主要身体部位。例如,标注工具242可以显示在头部标注点402的顶部与颈部标注点404的中心之间的线,以限定头部。标注工具242可以显示从颈部标注点404的中心延伸通过右肩标注点406和右肘标注点410,并且结束于右手标注点414的线,以限定右臂。标注工具242可以显示从颈标注点404的中心延伸通过左肩标注点408和左肘标注点412并且结束于左手标注点416的线,以限定左臂。标注工具242可以显示从颈部标注点404的中心到左髋标注点420、到右髋标注点418、并且返回到颈标注点404的中心的线,以限定身体。标注工具242可以显示从右髋标注点418到右膝标注点422,并且结束于右脚标注点426的线以限定右腿。标注工具242可以显示从左髋标注点420到左膝标注点424并且结束于左脚标注点428的线,以限定右腿。In addition to displaying various annotation points of the skeleton 400 , the annotation tool 242 may also display lines between the annotation points to define limbs and/or major body parts of the skeleton 400 . For example, the annotation tool 242 may display a line between the top of the head annotation point 402 and the center of the neck annotation point 404 to define the head. Annotation tool 242 may display a line extending from the center of neck annotation point 404 through right shoulder annotation point 406 and right elbow annotation point 410 and ending at right hand annotation point 414 to define the right arm. Annotation tool 242 may display a line extending from the center of neck annotation point 404 through left shoulder annotation point 408 and left elbow annotation point 412 and ending at left hand annotation point 416 to define the left arm. Annotation tool 242 may display a line from the center of neck annotation point 404 to left hip annotation point 420, to right hip annotation point 418, and back to the center of neck annotation point 404 to define the body. Annotation tool 242 may display a line from right hip annotation point 418 to right knee annotation point 422 and ending at right foot annotation point 426 to define the right leg. Annotation tool 242 may display a line from left hip annotation point 420 to left knee annotation point 424 and ending at left foot annotation point 428 to define the right leg.

如图12所示,标注工具242可以针对不同的肢体使用不同的线宽和/或虚线类型。以这种方式,用户可以更容易地将骨架400的肢体彼此区分,以便选择适于操纵的适当标注点。在其他示例中,标注工具242可以使用不同的颜色来区分不同的肢体,而不是使用线宽或虚线类型。例如,限定头的标注点之间的线可以是蓝色的,限定右臂的标注点之间的线可以是绿色的,限定左臂的标注点之间的线可以是红色的,限定躯干的标注点之间的线可以是黄色的,限定右腿的标注点之间的线可以是品红色,并且限定左腿的标注点之间的线可以是青色。当然,可以使用其他颜色。As shown in Figure 12, annotation tool 242 may use different line widths and/or dash line types for different limbs. In this manner, the user can more easily distinguish the limbs of the skeleton 400 from each other in order to select appropriate annotation points for manipulation. In other examples, annotation tool 242 may use different colors to distinguish different limbs rather than using line weight or dashed line type. For example, the line between the label points defining the head could be blue, the line between the label points defining the right arm could be green, the line between the label points defining the left arm could be red, and the line between the label points defining the torso could be red. The line between the label points can be yellow, the line between the label points defining the right leg can be magenta, and the line between the label points defining the left leg can be cyan. Of course, other colors can be used.

图13是示出点云的估计标注的概念性用户界面图。在图13中,标注工具242用骨架400的标注点标记点云帧252中的点。标注工具242在GUI 250中的顶视图266、侧视图268、和前视图270的每一个内示出了骨架400。初始地,标注工具242可以以默认位置和默认姿态显示骨架400。从图13中可以看出,骨架400的默认位置与点云帧252的点云数据中所描绘的人的实际姿态不匹配。用户可以操纵和/或移动骨架400的标注点中的一个或多个来匹配点云帧252的数据中存在的一个或多个人的实际姿态,以产生标注的点云。Figure 13 is a conceptual user interface diagram illustrating estimated annotation of a point cloud. In FIG. 13 , annotation tool 242 labels points in point cloud frame 252 with annotation points of skeleton 400 . Annotation tool 242 shows skeleton 400 in each of top view 266 , side view 268 , and front view 270 in GUI 250 . Initially, annotation tool 242 may display skeleton 400 in a default position and default pose. As can be seen in Figure 13, the default position of skeleton 400 does not match the actual pose of the person depicted in the point cloud data of point cloud frame 252. The user may manipulate and/or move one or more of the annotated points of the skeleton 400 to match the actual pose of one or more persons present in the data of the point cloud frame 252 to produce an annotated point cloud.

在图13的示例中,标注工具242显示单个骨架400。在其他示例中,如果在点云帧242中存在多个待标注的姿态,则标注工具242可以生成并显示多个骨架。在一个示例中,标注工具242可以将骨架400定位在默认位置中,例如,在每个视图的中心中。在其他示例中,用户可以将骨架400手动定位(例如,通过一起移动所有标注点)到一个位置中。在其他示例中,标注工具242可以自动确定点云帧252中一个或多个人的位置,并将骨架400定位在自动确定的位置。在一个示例中,标注工具242可以将点云帧242提供给深度神经网络244。深度神经网络244可以是被配置为以与上面描述的深度神经网络44(见图2)相同的方式估计姿态的深度神经网络。深度神经网络244可以是配置用于姿态估计的完整深度神经网络,或者可以是由标注工具242产生的标注的点云232(见图9)进行训练的深度神经网络。深度神经网络244可以确定点云帧252中人的位置并且向标注工具242指示这样的人的位置。标注工具242然后可以在深度神经网络244指示的位置显示默认骨架400。In the example of Figure 13, annotation tool 242 displays a single skeleton 400. In other examples, if there are multiple poses to be annotated in point cloud frame 242, annotation tool 242 may generate and display multiple skeletons. In one example, annotation tool 242 may position skeleton 400 in a default location, such as in the center of each view. In other examples, the user may manually position the skeleton 400 (eg, by moving all annotation points together) into a location. In other examples, annotation tool 242 may automatically determine the location of one or more people in point cloud frame 252 and position skeleton 400 at the automatically determined location. In one example, annotation tool 242 may provide point cloud frames 242 to deep neural network 244 . Deep neural network 244 may be a deep neural network configured to estimate pose in the same manner as deep neural network 44 (see Figure 2) described above. Deep neural network 244 may be a full deep neural network configured for pose estimation, or may be a deep neural network trained on annotated point cloud 232 (see Figure 9) generated by annotation tool 242. Deep neural network 244 may determine the location of people in point cloud frame 252 and indicate the location of such people to annotation tool 242 . Annotation tool 242 may then display default skeleton 400 at the location indicated by deep neural network 244 .

标注工具242还可以显示具有默认姿态的骨架400。也就是说,标注工具242可以以相对于彼此的默认方向显示骨架的标注点。图13示出骨架400的默认姿态的一个示例。当然,标注工具242可以生成其他默认姿态。在其他示例中,不是使用默认姿态,标注工具242可以估计在点云帧252中的人的位置和姿态。在一个示例中,标注工具242可以将点云帧242提供给深度神经网络244。深度神经网络244可以是被配置为以与上述深度神经网络44(参见图2)相同的方式估计姿态的深度神经网络。深度神经网络244可以是配置用于姿态估计的完整深度神经网络,也可以是由标注工具242生成的标注的点云232(见图9)进行训练的深度神经网络。深度神经网络244可以确定在点云帧252中发现的人的估计的位置和姿态,并向标注工具242指示这些人的位置和姿态。The annotation tool 242 may also display the skeleton 400 in a default pose. That is, the annotation tool 242 may display the annotation points of the skeleton in a default orientation relative to each other. FIG. 13 shows an example of the default posture of the skeleton 400. Of course, the annotation tool 242 can generate other default poses. In other examples, rather than using the default pose, annotation tool 242 may estimate the position and pose of the person in point cloud frame 252 . In one example, annotation tool 242 may provide point cloud frames 242 to deep neural network 244 . Deep neural network 244 may be a deep neural network configured to estimate pose in the same manner as deep neural network 44 (see FIG. 2) described above. The deep neural network 244 may be a complete deep neural network configured for pose estimation, or a deep neural network trained on the annotated point cloud 232 (see Figure 9) generated by the annotation tool 242. Deep neural network 244 may determine estimated positions and poses of people found in point cloud frames 252 and indicate the positions and poses of these people to annotation tool 242 .

然后,标注工具242可以在深度神经网络244指示的位置和姿态上显示默认骨架400。这样估计的姿态通常是不完全准确。然而,由深度神经网络244产生的估计的姿态可以比默认姿态更接近在点云数据中发现的实际姿态。因此,标注工具244的用户开始的姿态可以更接近要标注的实际姿态,从而使得用于标注姿态的手动过程更容易和更快速。The annotation tool 242 may then display the default skeleton 400 at the position and pose indicated by the deep neural network 244 . Such estimated poses are usually not completely accurate. However, the estimated pose produced by the deep neural network 244 may be closer to the actual pose found in the point cloud data than the default pose. Therefore, the user of annotation tool 244 may start with a gesture that is closer to the actual gesture to be annotated, thereby making the manual process for annotating the gesture easier and faster.

标注工具242可以为用户提供几种不同的工具,以将骨架400操纵到表示在点云帧252中发现的人的实际姿态的位置。标注工具242可以生成选择区域按钮272。用户可以激活选择区域按钮272,然后从顶视图266、侧视图268、或前视图270中选择用户将与之交互的那个视图。取决于点云帧252中捕获的人的位置和方向,在不同视图中对骨架400的操纵可能更容易。标注点控件274允许用户选择特定的标注点进行控制。标注点控件274中复选框的数字1-14对应于图12中所示的标注点括号中的数字。用户可以通过选择相应的复选框来选择一个或多个标注点。然后,标注工具242可以响应于用户输入而移动所选择的标注点中的一个或多个,以限定人的姿态并创建标注的点云数据232(参见图9)。Annotation tools 242 may provide the user with several different tools to manipulate the skeleton 400 into positions that represent the actual pose of the person found in the point cloud frame 252 . Annotation tool 242 may generate select area button 272 . The user can activate the select area button 272 and then select from the top view 266 , the side view 268 , or the front view 270 which view the user will interact with. Depending on the position and orientation of the person captured in point cloud frame 252, manipulation of skeleton 400 may be easier in different views. Label point control 274 allows the user to select specific label points for control. The numbers 1-14 of the check boxes in the label point control 274 correspond to the numbers in the label point brackets shown in Figure 12. The user can select one or more label points by selecting the corresponding checkbox. Annotation tool 242 may then move one or more of the selected annotation points in response to user input to define the human pose and create annotated point cloud data 232 (see FIG. 9 ).

在一个示例中,标注工具242可以响应于用户与鼠标的交互(例如,单击和拖动)来移动所选择的标注点。在其他示例中,标注工具242可以响应于用户与旋转控件280和/或位置控件282的交互来移动一个或多个标注点。在图13的示例中,旋转控件280是使标注工具242围绕头部标注点的顶部旋转骨架400的所有标注点的滑块控件。位置控件282包括用于点云帧252的水平(X),竖直(Y)和深度(Z)尺寸的各个滑块。响应于用户移动所述滑块,标注工具242将所选择的标注点移动到点云帧252内的每个指定尺寸。尽管图13的示例示出了滑块控件,但是可以使用其他控件类型,包括在点云帧252内的特定坐标(例如,(X、Y、Z)坐标)的文本输入。In one example, the annotation tool 242 may move the selected annotation point in response to user interaction with the mouse (eg, click and drag). In other examples, annotation tool 242 may move one or more annotation points in response to user interaction with rotation control 280 and/or position control 282 . In the example of Figure 13, rotation control 280 is a slider control that causes annotation tool 242 to rotate all annotation points of skeleton 400 around the top of the head annotation point. Position control 282 includes respective sliders for the horizontal (X), vertical (Y), and depth (Z) dimensions of point cloud frame 252 . In response to the user moving the slider, the annotation tool 242 moves the selected annotation points to each specified dimension within the point cloud frame 252 . Although the example of FIG. 13 shows a slider control, other control types may be used, including text input of specific coordinates within point cloud frame 252 (eg, (X, Y, Z) coordinates).

在本公开的一个示例中,标注工具242可以对骨架中的每个标注点分配唯一标识符。这样,可以跨越多个帧跟踪特定骨架的姿态和标注点。此外,结合以上讨论的动作识别技术,标注工具242还可包括每个帧和每个骨架的动作类别标记(例如,人的姿态)。In one example of the present disclosure, annotation tool 242 may assign a unique identifier to each annotation point in the skeleton. In this way, the pose and annotation points of a specific skeleton can be tracked across multiple frames. Additionally, in conjunction with the action recognition techniques discussed above, the annotation tool 242 may also include action category tags (eg, human gestures) for each frame and each skeleton.

在本公开的一个示例中,为了增加定位标注点的精度,标注工具242可以被配置为每次仅允许选择单个标注点。一旦选择了单个标注点,用户可以仅使标注工具242移动单个选择的标注点。例如,一旦复选框标注点控件274之一被选择,标注工具242就可以使其他复选框不可用于选择。为了移动其他标注点,用户可以首先取消选择所选择的标注点。在移动单个选择的标注点时,标注工具242使所有其他标注点保持静止。在其他示例中,标注工具242可以允许选择多个标注点。在该示例中,标注工具242将响应于用户输入而仅移动所选择的标注点(例如,一个或多个所选择的标注点)。未选择的标注点将保持静止。In one example of the present disclosure, to increase the accuracy of locating annotation points, annotation tool 242 may be configured to only allow selection of a single annotation point at a time. Once a single label point is selected, the user may cause the label tool 242 to move only the single selected label point. For example, once one of the checkbox annotation point controls 274 is selected, the annotation tool 242 can render the other checkboxes unavailable for selection. In order to move other label points, the user can first deselect the selected label point. While moving a single selected annotation point, annotation tool 242 holds all other annotation points stationary. In other examples, annotation tool 242 may allow selection of multiple annotation points. In this example, the annotation tool 242 will move only the selected annotation points (eg, one or more selected annotation points) in response to the user input. Unselected label points will remain stationary.

也就是说,在一个示例中,标注工具242接收对多个标注点中的单个标注点的选择,并且响应于用户输入,仅移动单个所选择的标注点以限定人体姿态的一部分。在其他示例中,标注工具242接收多个标注点中的两个或更多个标注点的选择,并且响应于用户输入,仅移动两个或更多个所选择的标注点以限定人体姿态的一部分。用户可以在标注窗口276中键入关于标注的点云帧252的注释。标注完成后,用户可以激活保存控件278以保存每个标注点的最终位置。接着,可以将标注的点云232(参见图9)保存在存储器中(例如,.json文件中),然后将其用于训练神经网络(例如,神经网络244)。That is, in one example, the annotation tool 242 receives a selection of a single annotation point among a plurality of annotation points and, in response to user input, moves only the single selected annotation point to define a portion of the human pose. In other examples, the annotation tool 242 receives a selection of two or more of the plurality of annotation points and, in response to the user input, moves only the two or more selected annotation points to define the human pose. part. The user can type comments about the annotated point cloud frame 252 in the annotation window 276 . After labeling is complete, the user can activate save control 278 to save the final location of each label point. The annotated point cloud 232 (see Figure 9) can then be saved in memory (eg, in a .json file) and then used to train a neural network (eg, neural network 244).

图14是示出标注的点云的概念性用户界面图。图14示出了在用户操纵后来自图13的骨架400的姿态。从图14可以看出,骨架400的标注点已移动到与点云252中的人捕获的实际姿态更加匹配的位置(即姿态)。Figure 14 is a conceptual user interface diagram showing a labeled point cloud. Figure 14 shows the pose of skeleton 400 from Figure 13 after user manipulation. As can be seen in FIG. 14 , the annotated points of the skeleton 400 have been moved to a position (ie, pose) that better matches the actual pose captured by the person in the point cloud 252 .

图15是显示根据本公开一个示例的标注工具的示例操作的流程图。图15中的技术可以由运行用于标注工具242(参见图9)的指令的处理器222来执行。图15的技术描绘了用于标注点云数据的单个帧的一个示例性过程。对于点云数据的多个帧可以重复图15的过程。Figure 15 is a flowchart showing example operations of an annotation tool according to one example of the present disclosure. The techniques in Figure 15 may be performed by processor 222 executing instructions for annotation tool 242 (see Figure 9). The technique of Figure 15 depicts an exemplary process for annotating a single frame of point cloud data. The process of Figure 15 can be repeated for multiple frames of point cloud data.

标注工具242可以加载并显示点云数据(900)。例如,标注工具242可以从点云文件中加载点云数据(例如,图9的点云训练数据230),并且可以使显示设备216显示GUI 250,其中GUI 250包括所加载的点云数据。标注工具242然后可以自动地或者通过用户输入来确定是否裁剪点云(902)。如果是,则标注工具902以一个或多个视角显示裁剪的点云(904)。标注工具904可以用标注点标记所显示的点云(例如,裁剪的点云)中的点(906)。如果未裁剪点云数据(902),则标注工具904还用标注点标记所显示的点云中的点(例如,整个点云)(906)。Annotation tool 242 can load and display point cloud data (900). For example, annotation tool 242 may load point cloud data from a point cloud file (eg, point cloud training data 230 of FIG. 9 ) and may cause display device 216 to display GUI 250 , where GUI 250 includes the loaded point cloud data. Annotation tool 242 may then determine whether to crop the point cloud automatically or through user input (902). If so, the annotation tool 902 displays the cropped point cloud in one or more views (904). The annotation tool 904 may label points in the displayed point cloud (eg, cropped point cloud) with annotation points (906). If the point cloud data is not clipped (902), the labeling tool 904 also labels points in the displayed point cloud (eg, the entire point cloud) with label points (906).

标注工具242然后等待,直到选择了标注点(908)。一旦选择了标注点,标注工具242然后将响应于用户输入来移动所选择的标注点(910)。标注工具242随后将检查指示标注完成的用户输入(912)。如果是,则标注工具242输出标注的点云(914)。如果否,则标注工具242将等待另一个标注点被选择(908),然后响应于用户输入而重复该移动过程(910)。The annotation tool 242 then waits until an annotation point is selected (908). Once an annotation point is selected, annotation tool 242 will then move the selected annotation point in response to user input (910). Annotation tool 242 will then check for user input indicating annotation is complete (912). If so, the annotation tool 242 outputs the annotated point cloud (914). If not, then the annotation tool 242 will wait for another annotation point to be selected (908) and then repeat the movement process in response to user input (910).

将认识到,根据示例,本文描述的任何技术的某些动作或事件可以以不同的顺序执行,可以被添加、合并、或全部不考虑(例如,并非所有描述的动作或事件是实施该技术所必需的)。此外,在某些示例中,动作或事件可以例如通过多线程处理、中断处理、或多个处理器并发地而不是顺序地执行。It will be appreciated that, by way of example, certain actions or events of any technology described herein may be performed in a different order, may be added, combined, or eliminated altogether (e.g., not all described actions or events may be required to implement the technology). required). Furthermore, in some examples, actions or events may be performed concurrently rather than sequentially, such as through multi-threading, interrupt handling, or multiple processors.

在一个或多个示例中,可以以硬件、软件、固件、或其任何组合来实现所描述的功能。如果以软件实现,则功能可以作为一条或多条指令或代码存储在计算机可读介质上或在计算机可读介质上传输,并由基于硬件的处理单元执行。计算机可读介质可以包括计算机可读的存储介质,其对应于诸如数据存储介质的有形介质,或者通信介质,包括例如根据通信协议来促进将计算机程序从一个地方转移到另一个地方的任何介质。以这种方式,计算机可读介质通常可以对应于(1)非暂时性的、有形的计算机可读存储介质,或者(2)诸如信号或载波的通信介质。数据存储介质可以是可以由一台或多台计算机或一个或多个处理器访问以检索指令、代码和/或数据结构以实现本公开中描述的技术的任何可用介质。计算机程序产品可以包括计算机可读介质。In one or more examples, the described functionality may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored or transmitted on a computer-readable medium as one or more instructions or code and executed by a hardware-based processing unit. Computer-readable media may include computer-readable storage media, which corresponds to tangible media such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another, for example, according to a communications protocol. In this manner, computer-readable media generally may correspond to (1) non-transitory, tangible computer-readable storage media, or (2) communications media such as signals or carrier waves. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code, and/or data structures to implement the techniques described in this disclosure. A computer program product may include computer-readable media.

作为示例而非限制,这样的计算机可读的数据存储介质可以包括RAM、ROM、EEPROM、CD-ROM或其他光盘存储,磁盘存储、或其他磁性存储设备、闪存、或可以用于以指令或数据结构形式存储所需程序代码并且可以由计算机访问的任何其他介质。而且,任何连接都适当地称为计算机可读介质。例如,如果使用同轴缆线、光纤缆线、双绞线、数字用户线(DSL)、或无线技术(例如红外、无线电和微波)从网站、服务器或其他远程源发送指令,则介质的定义包括同轴缆线、光纤缆线、双绞线、DSL或诸如红外、无线电和微波之类的无线技术。然而,应当理解,计算机可读的存储介质和数据存储介质不包括连接、载波、信号或其他瞬时介质,而是针对非瞬时的有形存储介质。上述的组合也应包括在计算机可读介质的范围内。By way of example and not limitation, such computer-readable data storage media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage, or other magnetic storage devices, flash memory, or may be used to store instructions or data. Any other medium in a structured form that stores the required program code and can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, the definition of medium if coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies (such as infrared, radio, and microwave) are used to send instructions from a website, server, or other remote source This includes coaxial cable, fiber optic cable, twisted pair, DSL or wireless technologies such as infrared, radio and microwave. However, it should be understood that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are directed to non-transitory tangible storage media. Combinations of the above should also be included within the scope of computer-readable media.

指令可以由一个或多个处理器执行,例如一个或多个数字信号处理器(DSP)、通用微处理器、专用集成电路(ASIC)、现场可编程门阵列(FPGA)、复杂的可编程逻辑设备(CPLD)、或其他等同的集成或离散逻辑电路。因此,本文所使用的术语“处理器”可以指任何前述结构或适合于实施本文所述技术的任何其他结构。还有,该技术可以在一个或多个电路或逻辑元件中完全实现。Instructions may be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), complex programmable logic device (CPLD), or other equivalent integrated or discrete logic circuit. Accordingly, the term "processor" as used herein may refer to any of the foregoing structures or any other structure suitable for implementing the techniques described herein. Also, the technology may be fully implemented in one or more circuit or logic elements.

本公开的技术可以在包括一块集成电路(IC)或一组IC(例如,芯片组)的多种设备或装置中实现。在本公开中描述了各种组件、模块、或单元以强调被配置为执行所披露技术的设备的功能方面,但不一定需要由不同硬件单元来实现。The techniques of the present disclosure may be implemented in a variety of devices or devices including an integrated circuit (IC) or a group of ICs (eg, a chipset). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques, but do not necessarily require implementation by different hardware units.

本领域普通技术人员已知的对所公开的技术的方法和设备的任何改变和/或修改均在本发明的范围内。已经描述了本发明的各种示例。这些和其他实施例都落入所附权利要求的范围内。Any changes and/or modifications to the methods and apparatus of the disclosed technology known to those of ordinary skill in the art are within the scope of the invention. Various examples of the invention have been described. These and other embodiments are within the scope of the following claims.

Claims (20)

1.一种用于标注3D点云数据中人体姿态的方法,包括:1. A method for labeling human posture in 3D point cloud data, including: 通过一个或多个处理器使显示点云数据的多个帧中的至少一帧;causing, by one or more processors, to display at least one of the plurality of frames of point cloud data; 通过所述一个或多个处理器用多个标注点标记所述点云数据的所述至少一帧中的点,所述多个标注点对应于人体上的点;Mark points in the at least one frame of the point cloud data with a plurality of annotation points by the one or more processors, the plurality of annotation points corresponding to points on the human body; 通过所述一个或多个处理器使视频显示所述点云数据的所述多个帧,其中所述视频显示可被控制以前进、快进、暂停、倒带;causing a video display of the plurality of frames of the point cloud data by the one or more processors, wherein the video display can be controlled to forward, fast forward, pause, rewind; 通过所述一个或多个处理器并响应于用户输入而移动所述标注点中的一个或多个以限定人体姿态并创建标注的点云数据;以及moving, by the one or more processors and in response to user input, one or more of the annotation points to define human poses and create annotated point cloud data; and 通过所述一个或多个处理器输出所标注的点云数据用于由神经网络进行训练。The annotated point cloud data is output by the one or more processors for training by the neural network. 2.根据权利要求1所述的方法,其中,通过所述一个或多个处理器用多个标注点标记所述点云数据的所述至少一帧中的点包括:2. The method of claim 1, wherein marking points in the at least one frame of the point cloud data with a plurality of annotation points by the one or more processors includes: 通过所述一个或多个处理器估计所述点云数据的所述至少一帧中潜在的人体姿态的位置;以及Estimating, by the one or more processors, the position of a potential human pose in the at least one frame of the point cloud data; and 通过所述一个或多个处理器标记所述标注点以与所述潜在的人体姿态的估计位置相对应。The annotation points are marked by the one or more processors to correspond to estimated locations of the potential human poses. 3.根据权利要求1所述的方法,其中,通过所述一个或多个处理器并响应于用户输入而移动所述标注点中的一个或多个以限定人体姿态并创建标注的点云数据包括:3. The method of claim 1, wherein one or more of the annotated points are moved by the one or more processors and in response to user input to define human poses and create annotated point cloud data include: 通过所述一个或多个处理器接收对所述多个标注点中的单个标注点的选择;receiving, by the one or more processors, a selection of a single annotation point from the plurality of annotation points; 通过所述一个或多个处理器并响应于用户输入,仅移动所述单个标注点以限定人体姿态的一部分。Only the single annotation point is moved by the one or more processors and in response to user input to define a portion of the human pose. 4.根据权利要求1所述的方法,其中,通过所述一个或多个处理器并响应于用户输入而移动所述标注点中的一个或多个以限定人体姿态并创建标注的点云数据包括:4. The method of claim 1, wherein one or more of the annotated points are moved by the one or more processors and in response to user input to define human poses and create annotated point cloud data include: 通过所述一个或多个处理器接收对所述多个标注点中的两个或更多标注点的选择;receiving, by the one or more processors, selections of two or more annotation points from the plurality of annotation points; 通过所述一个或多个处理器并响应于用户输入,仅移动所述两个或更多个标注点以限定人体姿态的一部分。By the one or more processors and in response to user input, only the two or more annotation points are moved to define a portion of the human pose. 5.根据权利要求1所述的方法,还包括:5. The method of claim 1, further comprising: 在显示所述点云数据的同时显示与所述点云数据的所述至少一帧相对应的图像;displaying an image corresponding to the at least one frame of the point cloud data while displaying the point cloud data; 通过所述一个或多个处理器裁剪在所述点云数据中潜在的人体姿态区域周围的所述点云数据和所述图像;以及Cropping, by the one or more processors, the point cloud data and the image around potential human pose regions in the point cloud data; and 通过所述一个或多个处理器使显示裁剪区域。The cropped area is displayed by the one or more processors. 6.根据权利要求5所述的方法,还包括:6. The method of claim 5, further comprising: 通过所述一个或多个处理器使以多个视角显示裁剪区域。The cropped area is displayed in multiple viewing angles by the one or more processors. 7.根据权利要求1所述的方法,其中,所述多个标注点包括与头部的顶部、颈部的中心、右髋、左髋、右肩、右肘、右手、右膝、右脚、左肩、左肘、左手、左膝、和左脚相对应的标注点,并且其中,所述标注点的组对应于人的四肢,所述方法还包括:7. The method of claim 1, wherein the plurality of marking points include the top of the head, the center of the neck, the right hip, the left hip, the right shoulder, the right elbow, the right hand, the right knee, and the right foot. , left shoulder, left elbow, left hand, left knee, and left foot corresponding marking points, and wherein the group of said marking points corresponds to human limbs, the method further includes: 通过所述一个或多个处理器使显示所述标注点之间的线以限定肢体,包括使用不同的颜色显示不同的肢体。Displaying, by the one or more processors, lines between the annotation points to define limbs, including displaying different limbs using different colors. 8.根据权利要求1所述的方法,还包括:8. The method of claim 1, further comprising: 通过所述一个或多个处理器为所述点云数据的所述多个帧中的每帧点云数据和人体姿态添加动作标签。An action tag is added to each frame of point cloud data and human posture in the plurality of frames of the point cloud data by the one or more processors. 9.根据权利要求1所述的方法,还包括:9. The method of claim 1, further comprising: 通过所述一个或多个处理器用所标注的点云数据训练所述神经网络,其中,所述神经网络被配置为从LiDAR点云数据估计人的姿态。The neural network is trained with the annotated point cloud data by the one or more processors, wherein the neural network is configured to estimate a human pose from the LiDAR point cloud data. 10.一种用于标注3D点云数据中人体姿态的设备,包括:10. A device for labeling human posture in 3D point cloud data, including: 存储器,被配置以存储点云数据;和a memory configured to store point cloud data; and 一个或多个与所述存储器通信的处理器,所述一个或多个处理器被配置为:One or more processors in communication with the memory, the one or more processors being configured to: 使显示所述点云数据的多个帧中的至少一帧;causing at least one frame among the plurality of frames of the point cloud data to be displayed; 用多个标注点标记所述点云数据的所述至少一帧中的点,所述多个标注点对应于人体上的点;Mark points in the at least one frame of the point cloud data with a plurality of annotation points, the plurality of annotation points corresponding to points on the human body; 使视频显示所述点云数据的所述多个帧,其中所述视频显示可被控制以前进、快进、暂停、倒带;causing a video to display the plurality of frames of the point cloud data, wherein the video display can be controlled to forward, fast forward, pause, rewind; 响应于用户输入而移动一个或多个所述标注点以限定人体姿态并创建标注的点云数据;和moving one or more of the annotated points to define human poses and create annotated point cloud data in response to user input; and 输出所标注的点云数据用于由神经网络进行训练。The output labeled point cloud data is used for training by the neural network. 11.根据权利要求10所述的设备,其中,为了用多个标注点标记所述点云数据的所述至少一帧中的点,所述一个或多个处理器还被配置为:11. The apparatus of claim 10, wherein to label points in the at least one frame of the point cloud data with a plurality of annotation points, the one or more processors are further configured to: 估计所述点云数据的所述至少一帧中潜在的人体姿态的位置;和estimating the location of a potential human pose in the at least one frame of the point cloud data; and 标记所述标注点以与所述潜在的人体姿态的估计位置相对应。The annotation points are marked to correspond to the estimated positions of the potential human poses. 12.如权利要求10所述的设备,其中,为了响应于用户输入而移动一个或多个所述标注点以限定人体姿态并创建标注的点云数据,所述一个或多个处理器还被配置为:12. The device of claim 10, wherein the one or more processors are further configured to move one or more of the annotated points to define human poses and create annotated point cloud data in response to user input. Configured as: 接收对所述多个标注点中的单个标注点的选择;receiving a selection of a single annotation point among the plurality of annotation points; 响应于用户输入,仅移动所述单个标注点以限定人体姿态的一部分。In response to user input, only the single annotation point is moved to define a portion of the human pose. 13.根据权利要求10所述的设备,其中,为了响应于用户输入而移动一个或多个所述标注点以限定人体姿态并创建标注的点云数据,所述一个或多个处理器还被配置为:13. The device of claim 10, wherein the one or more processors are further configured to move one or more of the annotated points to define human poses and create annotated point cloud data in response to user input. Configured as: 接收对所述多个标注点中的两个或更多个标注点的选择;receiving selection of two or more annotation points from the plurality of annotation points; 响应于用户输入,仅移动所述两个或更多个标注点以限定人体姿态的一部分。In response to user input, only the two or more annotation points are moved to define a portion of the human pose. 14.根据权利要求10所述的设备,其中,所述一个或多个处理器还被配置为:14. The device of claim 10, wherein the one or more processors are further configured to: 在显示所述点云数据的同时显示与所述点云数据的所述至少一帧相对应的图像;displaying an image corresponding to the at least one frame of the point cloud data while displaying the point cloud data; 裁剪在所述点云数据中潜在的人体姿态区域周围的所述点云数据和所述图像;和cropping the point cloud data and the image around potential human pose regions in the point cloud data; and 使显示裁剪区域。Makes the cropped area appear. 15.根据权利要求14所述的设备,其中,所述一个或多个处理器还被配置为:15. The device of claim 14, wherein the one or more processors are further configured to: 使以多个视角显示裁剪区域。Makes the cropped area visible from multiple perspectives. 16.根据权利要求10所述的设备,其中,所述多个标注点包括与头部的顶部、颈部的中心、右髋、左髋、右肩、右肘、右手、右膝、右脚、左肩、左肘、左手、左膝、和左脚相对应的标注点,并且其中,所述标注点的组对应于人的四肢,并且其中,所述一个或多个处理器还被配置为:16. The device according to claim 10, wherein the plurality of marking points include the top of the head, the center of the neck, the right hip, the left hip, the right shoulder, the right elbow, the right hand, the right knee, the right foot. , left shoulder, left elbow, left hand, left knee, and left foot corresponding annotation points, and wherein the group of annotation points corresponds to human limbs, and wherein the one or more processors are further configured to : 使显示所述标注点之间的线以限定肢体,包括使得使用不同的颜色显示不同的肢体。Causes display of lines between the annotation points to define limbs, including causing display of different limbs using different colors. 17.根据权利要求10所述的设备,其中,所述一个或多个处理器还被配置为:17. The device of claim 10, wherein the one or more processors are further configured to: 为所述点云数据的所述多个帧中的每帧点云数据和人体姿态添加动作标签。Add an action tag to each frame of point cloud data and human body posture in the plurality of frames of the point cloud data. 18.根据权利要求10所述的设备,其中,所述一个或多个处理器还被配置为:18. The device of claim 10, wherein the one or more processors are further configured to: 用所标注的点云数据训练所述神经网络,其中所述神经网络被配置为从LiDAR点云数据估计人的姿态。The neural network is trained with the annotated point cloud data, wherein the neural network is configured to estimate human pose from the LiDAR point cloud data. 19.一种用于标注3D点云数据中人体姿态的设备,包括:19. A device for labeling human posture in 3D point cloud data, including: 用于使点云数据的多个帧中的至少一帧显示的装置;means for causing at least one frame of a plurality of frames of point cloud data to be displayed; 用多个标注点标记所述点云数据的所述至少一帧中的点的装置,所述多个标注点对应于人体上的点;A device for marking points in the at least one frame of the point cloud data with a plurality of annotation points, the plurality of annotation points corresponding to points on the human body; 使视频显示所述点云数据的所述多个帧的装置,其中所述视频显示可被控制以前进、快进、暂停、倒带;means for causing a video display of the plurality of frames of the point cloud data, wherein the video display is controllable to forward, fast forward, pause, rewind; 响应于用户输入而移动所述标注点中的一个或多个以限定人体姿态并创建标注的点云数据的装置;和means for moving one or more of the annotated points to define a human pose and create annotated point cloud data in response to user input; and 用于输出所标注的点云数据用于由神经网络进行训练的装置。A device for outputting labeled point cloud data for training by a neural network. 20.一种存储指令的非暂时性的计算机可读存储介质,在执行所述指令时引起一个或多个处理器以:20. A non-transitory computer-readable storage medium storing instructions that, when executed, cause one or more processors to: 使显示点云数据的多个帧中的至少一帧;causing at least one frame of the plurality of frames of point cloud data to be displayed; 用多个标注点标记所述点云数据的所述至少一帧中的点,所述多个标注点对应于人体上的点;Mark points in the at least one frame of the point cloud data with a plurality of annotation points, the plurality of annotation points corresponding to points on the human body; 使视频显示所述点云数据的所述多个帧,其中所述视频显示可被控制以前进、快进、暂停、倒带;causing a video to display the plurality of frames of the point cloud data, wherein the video display can be controlled to forward, fast forward, pause, rewind; 响应于用户输入而移动所述标注点中的一个或多个以限定人体姿态并创建标注的点云数据;和moving one or more of the annotated points to define a human pose and create annotated point cloud data in response to user input; and 输出所标注的点云数据用于由神经网络进行训练。The output labeled point cloud data is used for training by the neural network.
CN202010171054.XA 2019-03-12 2020-03-12 Tools and methods for annotating human poses in 3D point cloud data Active CN111695402B (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201962817400P 2019-03-12 2019-03-12
US62/817,400 2019-03-12
US16/692,901 US11308639B2 (en) 2019-03-12 2019-11-22 Tool and method for annotating a human pose in 3D point cloud data
US16/692,901 2019-11-22

Publications (2)

Publication Number Publication Date
CN111695402A CN111695402A (en) 2020-09-22
CN111695402B true CN111695402B (en) 2023-09-08

Family

ID=69804514

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010171054.XA Active CN111695402B (en) 2019-03-12 2020-03-12 Tools and methods for annotating human poses in 3D point cloud data

Country Status (3)

Country Link
US (1) US11308639B2 (en)
EP (1) EP3709134B1 (en)
CN (1) CN111695402B (en)

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11494938B2 (en) * 2018-05-15 2022-11-08 Northeastern University Multi-person pose estimation using skeleton prediction
US11495070B2 (en) * 2019-09-10 2022-11-08 Orion Entrance Control, Inc. Method and system for providing access control
JP7216679B2 (en) * 2020-02-21 2023-02-01 株式会社日立ハイテク Information processing device and judgment result output method
TWI733616B (en) * 2020-11-04 2021-07-11 財團法人資訊工業策進會 Reconition system of human body posture, reconition method of human body posture, and non-transitory computer readable storage medium
US20210110606A1 (en) * 2020-12-23 2021-04-15 Intel Corporation Natural and immersive data-annotation system for space-time artificial intelligence in robotics and smart-spaces
CN112686979B (en) * 2021-03-22 2021-06-01 中智行科技有限公司 Simulated pedestrian animation generation method and device and electronic equipment
CN113361333B (en) * 2021-05-17 2022-09-27 重庆邮电大学 A non-contact cycling motion state monitoring method and system
CN115436894A (en) * 2021-06-01 2022-12-06 富士通株式会社 Key point identification device and method based on wireless radar signals
CN113223181B (en) * 2021-06-02 2022-12-23 广东工业大学 A Pose Estimation Method for Weakly Textured Objects
CN113341402A (en) * 2021-07-15 2021-09-03 哈尔滨工程大学 Sonar device for sonar monitoring robot
CN113705445B (en) * 2021-08-27 2023-08-04 深圳龙岗智能视听研究院 Method and equipment for recognizing human body posture based on event camera
CN113963192B (en) * 2021-09-22 2025-03-25 森思泰克河北科技有限公司 Fall detection method, device and electronic device
CN114091601B (en) * 2021-11-18 2023-05-05 业成科技(成都)有限公司 Sensor fusion method for detecting personnel condition
EP4513233A4 (en) * 2022-04-21 2025-07-30 Sony Semiconductor Solutions Corp Information processing device and program
CN114913603B (en) * 2022-05-25 2026-01-23 南京南自信息技术有限公司 Single person posture estimation system based on key point regression and working method thereof
CN115965788B (en) * 2023-01-12 2023-07-28 黑龙江工程学院 Point cloud semantic segmentation method based on multi-view image structural feature attention convolution
US12466422B2 (en) * 2023-01-30 2025-11-11 Ford Global Technologies, Llc Large animal detection and intervention in a vehicle
US20240310523A1 (en) * 2023-03-16 2024-09-19 Ford Global Technologies, Llc Systems and methods for in-cabin monitoring with liveliness detection
CN116699551B (en) * 2023-06-17 2026-01-09 华北理工大学 An Adaptive Human Body 3D Point Cloud Generation Method Based on 4D Millimeter-Wave Radar

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106688013A (en) * 2014-09-19 2017-05-17 高通股份有限公司 System and method of pose estimation
CN107871129A (en) * 2016-09-27 2018-04-03 北京百度网讯科技有限公司 Method and apparatus for handling cloud data
CN108062526A (en) * 2017-12-15 2018-05-22 厦门美图之家科技有限公司 A kind of estimation method of human posture and mobile terminal
US9984499B1 (en) * 2015-11-30 2018-05-29 Snap Inc. Image and point cloud based tracking and in augmented reality systems
CN108734120A (en) * 2018-05-15 2018-11-02 百度在线网络技术(北京)有限公司 Method, device and equipment for labeling image and computer readable storage medium
JP2018189510A (en) * 2017-05-08 2018-11-29 株式会社マイクロ・テクニカ Method and apparatus for estimating position and orientation of three-dimensional object
CN109086683A (en) * 2018-07-11 2018-12-25 清华大学 A kind of manpower posture homing method and system based on cloud semantically enhancement

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070174769A1 (en) 2006-01-24 2007-07-26 Sdgi Holdings, Inc. System and method of mapping images of the spine
US9571816B2 (en) 2012-11-16 2017-02-14 Microsoft Technology Licensing, Llc Associating an object with a subject
US9485540B2 (en) * 2014-06-03 2016-11-01 Disney Enterprises, Inc. System and method for multi-device video image display and modification
US11120478B2 (en) 2015-01-12 2021-09-14 Ebay Inc. Joint-based item recognition
GB2537681B (en) 2015-04-24 2018-04-25 Univ Oxford Innovation Ltd A method of detecting objects within a 3D environment
US20170046865A1 (en) * 2015-08-14 2017-02-16 Lucasfilm Entertainment Company Ltd. Animation motion capture using three-dimensional scanner data
US10282663B2 (en) 2015-08-15 2019-05-07 Salesforce.Com, Inc. Three-dimensional (3D) convolution with 3D batch normalization
US9898858B2 (en) 2016-05-18 2018-02-20 Siemens Healthcare Gmbh Human body representation with non-rigid parts in an imaging system
US10451405B2 (en) 2016-11-22 2019-10-22 Symbol Technologies, Llc Dimensioning system for, and method of, dimensioning freight in motion along an unconstrained path in a venue
ES2927177T3 (en) 2017-02-07 2022-11-03 Veo Robotics Inc Workspace safety monitoring and equipment control
US10539676B2 (en) 2017-03-22 2020-01-21 Here Global B.V. Method, apparatus and computer program product for mapping and modeling a three dimensional structure
US10444759B2 (en) 2017-06-14 2019-10-15 Zoox, Inc. Voxel based ground plane estimation and object segmentation
US11475351B2 (en) 2017-11-15 2022-10-18 Uatc, Llc Systems and methods for object detection, tracking, and motion prediction
GB201804082D0 (en) * 2018-03-14 2018-04-25 Five Ai Ltd Image annotation
US10977827B2 (en) 2018-03-27 2021-04-13 J. William Mauchly Multiview estimation of 6D pose
US10839266B2 (en) 2018-03-30 2020-11-17 Intel Corporation Distributed object detection processing
CN108898063B (en) 2018-06-04 2021-05-04 大连大学 Human body posture recognition device and method based on full convolution neural network
US11217006B2 (en) 2018-10-29 2022-01-04 Verizon Patent And Licensing Inc. Methods and systems for performing 3D simulation based on a 2D video image
AU2018282435B1 (en) 2018-11-09 2020-02-06 Beijing Didi Infinity Technology And Development Co., Ltd. Vehicle positioning system using LiDAR

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106688013A (en) * 2014-09-19 2017-05-17 高通股份有限公司 System and method of pose estimation
US9984499B1 (en) * 2015-11-30 2018-05-29 Snap Inc. Image and point cloud based tracking and in augmented reality systems
CN107871129A (en) * 2016-09-27 2018-04-03 北京百度网讯科技有限公司 Method and apparatus for handling cloud data
JP2018189510A (en) * 2017-05-08 2018-11-29 株式会社マイクロ・テクニカ Method and apparatus for estimating position and orientation of three-dimensional object
CN108062526A (en) * 2017-12-15 2018-05-22 厦门美图之家科技有限公司 A kind of estimation method of human posture and mobile terminal
CN108734120A (en) * 2018-05-15 2018-11-02 百度在线网络技术(北京)有限公司 Method, device and equipment for labeling image and computer readable storage medium
CN109086683A (en) * 2018-07-11 2018-12-25 清华大学 A kind of manpower posture homing method and system based on cloud semantically enhancement

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
三维手姿态建模及其交互系统研究;李昊鑫;中国优秀硕士学位论文全文数据库信息科技辑(第2期);I138-2958 *

Also Published As

Publication number Publication date
CN111695402A (en) 2020-09-22
US11308639B2 (en) 2022-04-19
EP3709134B1 (en) 2024-09-04
EP3709134A1 (en) 2020-09-16
US20200294266A1 (en) 2020-09-17

Similar Documents

Publication Publication Date Title
CN111695402B (en) Tools and methods for annotating human poses in 3D point cloud data
US11043005B2 (en) Lidar-based multi-person pose estimation
KR102677044B1 (en) Image processing methods, apparatus and devices, and storage media
US20230351795A1 (en) Determining associations between objects and persons using machine learning models
Possatti et al. Traffic light recognition using deep learning and prior maps for autonomous cars
JP7254823B2 (en) Neural networks for object detection and characterization
US12067471B2 (en) Searching an autonomous vehicle sensor data repository based on context embedding
US10837788B1 (en) Techniques for identifying vehicles and persons
US10809081B1 (en) User interface and augmented reality for identifying vehicles and persons
JP7011578B2 (en) Methods and systems for monitoring driving behavior
US12073575B2 (en) Object-centric three-dimensional auto labeling of point cloud data
CN112233221B (en) Three-dimensional map reconstruction system and method based on instant positioning and map construction
Pravallika et al. Deep learning frontiers in 3D object detection: a comprehensive review for autonomous driving
EP3814981B1 (en) METHOD AND DEVICE FOR COMPUTER VIEW
US20230377160A1 (en) Method and electronic device for achieving accurate point cloud segmentation
CN115440001B (en) Child following care method, device, following robot and storage medium
CN115565072A (en) Road garbage recognition and positioning method and device, electronic equipment and medium
Aswini et al. Drone object detection using deep learning algorithms
CN118736535A (en) Visual detection of hands on the steering wheel
Du et al. A Lightweight UAV Visual Obstacle Avoidance Algorithm Based on Improved YOLOv8.
US11846514B1 (en) User interface and augmented reality for representing vehicles and persons
US20250265823A1 (en) Device and method for generating training data for an object detector
Ahmed et al. A novel hybrid deep learning algorithm for object and lane detection in autonomous driving
US20250060481A1 (en) Image and lidar adaptive transformer for fusion-based perception
Jain et al. Gestarlite: An on-device pointing finger based gestural interface for smartphones and video see-through head-mounts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant