CN117373012A

CN117373012A - Method, apparatus and medium for performing multi-tasking point cloud awareness

Info

Publication number: CN117373012A
Application number: CN202310981387.2A
Authority: CN
Inventors: 陈新元; 宋帆; 吴子章; 蒋伟平
Original assignee: Zongmu Technology Shanghai Co Ltd
Current assignee: Zongmu Technology Shanghai Co Ltd
Priority date: 2023-08-04
Filing date: 2023-08-04
Publication date: 2024-01-09

Abstract

The application discloses a method for performing multi-tasking point cloud sensing, comprising: inputting point cloud data associated with the 3D target acquired by the lidar into a backbone network of shared network parameters to perform point-based feature extraction and generate corresponding shared point cloud features; performing a first task regarding point cloud semantic segmentation based on the shared point cloud features; generating a designated feature based on the shared point cloud feature; performing one or more tasks with respect to the 3D object based on the specified characteristics; normalizing a first task relating to the point cloud semantic segmentation and a loss function relating to one or more tasks of the 3D object; and performing gradient back propagation and gradient updating on the backbone network based on the normalized loss function. Apparatus for performing multi-tasking point cloud sensing, as well as numerous other aspects, are also disclosed.

Description

Method, apparatus and medium for performing multi-tasking point cloud awareness

Technical Field

The present invention relates to the field of autopilot awareness, and more particularly to a method, apparatus, and medium for performing multi-tasking point cloud awareness.

Background

Unlike image information that lacks depth information and is greatly affected by illumination changes, lidar (Lidar) point cloud data is capable of providing structural and spatial information of relative velocity positioning and precise depth, and is less sensitive to illumination changes. Based on point cloud sensing tasks of the laser radar, such as 3D target detection, 3D target tracking, point cloud semantic segmentation and the like, individual networks of each specific task can be designed aiming at data characteristics of the point cloud, so that respective task sensing indexes are improved, but in general, the networks are specifically designed on respective sensing tasks and cannot be organically connected together for training or reasoning. In addition, the consumption of resources and the real-time requirement in the practical embedded system are strictly limited, and a separate network which is separately connected for each task is difficult to realize on the embedded system. The current point cloud sensing network is designed based on specific sensing tasks, and the sensing capability of the current point cloud sensing network cannot be related and promoted among the sensing tasks.

Accordingly, there is a need for an improved method, apparatus, and medium for performing multi-tasking point cloud sensing.

Disclosure of Invention

The following presents a simplified summary of one or more aspects in order to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.

In order to solve the above problems, the present invention proposes a method, an apparatus and a medium for performing multi-tasking point cloud sensing.

An aspect of the present application provides a method for performing multi-tasking point cloud sensing, the method comprising: inputting point cloud data associated with the 3D target acquired by the lidar into a backbone network of shared network parameters to perform point-based feature extraction and generate corresponding shared point cloud features; performing a first task regarding point cloud semantic segmentation based on the shared point cloud features; generating a designated feature based on the shared point cloud feature; performing one or more tasks with respect to the 3D object based on the specified features; normalizing the first task with respect to point cloud semantic segmentation and the loss function of the one or more tasks with respect to the 3D object; and performing gradient back propagation and gradient updating on the backbone network based on the normalized loss function.

Preferably, the point cloud data associated with the 3D object includes first coordinate information, second coordinate information, third coordinate information, and reflectivity information of each point.

Preferably, generating the specified feature based on the shared point cloud feature comprises generating a Bird's Eye View (BEV) feature based on the shared point cloud feature, and wherein performing one or more tasks with respect to the 3D object based on the specified feature comprises: the second task regarding detection of the 3D object, the third task regarding tracking of the 3D object, or both are performed based on BEV features.

Preferably, generating the BEV features comprises performing a cylinder-based feature extraction on the shared point cloud features to generate the BEV features.

Preferably, performing the second task related to the detection of the 3D object comprises: inputting the BEV features into a multi-layer perceptron to regression predict first information of the 3D object, and wherein the first information includes at least a projection point of a center point of the 3D object onto a BEV plot, an offset, a height of the 3D object, a width of the 3D object, a length of the 3D object, a heading angle of the 3D object, and a velocity of the 3D object; and wherein the BEV map is generated based on the BEV features.

Preferably, performing the third task with respect to tracking of the 3D object comprises: the BEV features are input to a multi-layer perceptron to regression predict second information of the 3D object, and wherein the second information includes at least parameters regarding motion of the 3D object, and an allocation matrix.

Preferably, the cylinder-based feature extraction is performed using a point cloud object detection network PointPillar.

Preferably, performing the first task with respect to point cloud semantic segmentation comprises: the shared point cloud features are input into a multi-layer perceptron after point-by-point feature interaction is carried out; and performing classification on each point using the multi-layer perceptron to predict a class of each point.

Preferably, normalizing the first task regarding point cloud semantic segmentation, the loss function regarding one or more tasks of the 3D object comprises: calculating a first loss function of the first task, a second loss function of the second task, or a third loss function of the third task, respectively; normalizing the first, second and third loss functions by a predetermined weight.

Another aspect of the present application provides an apparatus for performing multi-tasking point cloud sensing, comprising a processor and a memory, the memory storing program instructions; the processor executes program instructions to implement any of the methods for performing multi-tasking point cloud sensing described above.

Yet another aspect of the present application provides a non-transitory computer-readable storage medium storing instructions that, when executed by a computer, cause the computer to perform any of the methods for performing multi-tasking point cloud sensing described above.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Additional aspects, features, and/or advantages of the embodiments will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

So that the manner in which the above recited features of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only certain typical aspects of this invention and are therefore not to be considered limiting of its scope, for the description may admit to other equally effective aspects. In the drawings, like reference numerals are given like designations throughout. It is noted that the drawings described are only schematic and are non-limiting. In the drawings, the size of some of the elements may be exaggerated and not drawn on scale for illustrative purposes.

Fig. 1 illustrates an example of a method for performing multi-tasking point cloud sensing according to an embodiment of the present invention.

Fig. 2A-2D illustrate examples of further methods for performing multi-tasking point cloud sensing according to an embodiment of the present invention.

FIG. 3 is a block diagram of an embodiment of a device supporting performing multi-tasking point cloud sensing according to an embodiment of the present invention.

FIG. 4 is an example of an exemplary functional device performing multi-tasking point cloud sensing according to an embodiment.

Detailed Description

The present invention will be further described in detail below with reference to specific embodiments and with reference to the accompanying drawings, in order to make the objects, technical solutions and advantages of the present invention more apparent. In the following detailed description, numerous specific details are set forth in order to provide a thorough understanding of the described exemplary embodiments. It will be apparent, however, to one skilled in the art, that the described embodiments may be practiced without some or all of these specific details. In other exemplary embodiments, well-known structures or processing steps have not been described in detail in order to avoid unnecessarily obscuring the concepts of the present disclosure.

In the present specification, unless otherwise indicated, the term "a or B" as used throughout the present specification refers to "a and B" and "a or B" and does not mean that a and B are exclusive.

Unlike image information that lacks depth information and is greatly affected by illumination changes, lidar point cloud data is able to provide structural and spatial information of relative velocity positioning and accurate depth, and is less sensitive to illumination changes and appears more robust. Based on point cloud sensing tasks of the laser radar, such as 3D target detection, 3D target tracking, point cloud semantic segmentation and the like, an independent network of each specific task can be designed aiming at data characteristics of the point cloud, so that respective task sensing indexes are improved. But in general these networks are targeted designs on their own perceptual tasks that cannot be organically linked together for training or reasoning. However, in the practical embedded system, the requirements on the consumption and real-time performance of the resources are strictly limited, so that the connection of the separate network is difficult to implement for the embedded system.

The current point cloud sensing network is designed based on specific sensing tasks, and the sensing capability of the current point cloud sensing network cannot be related and promoted among the sensing tasks. The multi-task point cloud sensing network based on the laser radar has the advantage of improving the redundancy of the system, and the problems are greatly improved in the multi-task point cloud sensing network. The shared point cloud feature extraction network is beneficial to reducing the network reasoning speed and resource consumption; the design of a plurality of perception tasks can achieve the effects of relatively independence and mutual promotion; results of different tasks can be obtained only by one-time reasoning, and the requirements of the embedded system on instantaneity and resource consumption are met. Meanwhile, the intermediate result of 3D target detection can also help the perception tasks of 3D target tracking, so that the improvement among all the perception tasks is realized. Methods, apparatuses, and media for performing multi-tasking point cloud sensing according to the present invention will be explained below with reference to fig. 1 to 4.

Fig. 1 illustrates an example of a method 100 for performing multi-tasking point cloud sensing according to an embodiment of the present invention.

As shown in fig. 1, the method for performing multi-tasking point cloud sensing includes: in step 110, point cloud data associated with the 3D object acquired by the lidar is input into a backbone network of shared network parameters to perform point-based feature extraction and generate corresponding shared point cloud features.

In embodiments of the present application, a vehicle may have a laser radar (LIDAR) for detecting objects and measuring distances to those objects. The LIDAR is typically mounted on the roof of the vehicle, however, if there are multiple LIDAR units, they may be oriented around the front, rear and sides of the vehicle. The vehicle may have various other location-related systems, various wireless communication interfaces (such as WAN, WLAN, V X), RADAR (typically in a front bumper), and sonor (typically on both sides of the vehicle, if present). Various wheel sensors and driveline sensors may also be present, such as tire pressure sensors, accelerometers, gyroscopes, and wheel rotation detection and/or counters. In an embodiment, distance measurements and relative positions determined via various sensors (such as LIDAR, RADAR, cameras, GNSS, and sonor) may be combined with vehicle size and shape information and information about sensor positions to determine the distance and relative position between the surfaces of different vehicles such that the distance or vector from the sensor to the other vehicle or two different sensors is incrementally increased to account for the positioning of the sensor on each vehicle. It should be understood that the above description is intended to provide examples of various sensors in embodiments of a vehicle including a system with autopilot capability, and is not intended to be limiting.

In embodiments of the present application, point cloud data refers to a set of vectors in a three-dimensional coordinate system. The scanned data is recorded and stored in the form of points, each of which contains three-dimensional coordinates and may carry other information about the properties of the point, such as color, reflectivity, intensity, etc. The point cloud data is generally acquired by a laser scanner, a camera, a three-dimensional scanner, a laser radar and other devices, and can be used in three-dimensional modeling, scene reconstruction, robot navigation, virtual reality, augmented reality and other applications. The main characteristics of the point cloud data are that the point cloud data have high-precision, high-resolution and high-dimensional geometric information, and can intuitively represent the information of the shape, the surface, the texture and the like of an object in space.

In the embodiment of the application, the laser radar can be used for collecting the point cloud data, and the measuring speed is high, the accuracy is high, and the identification is accurate. In embodiments of the present application, point cloud data associated with a 3D target may include first coordinate information (e.g., x-coordinates), second coordinate information (e.g., y-coordinates), third coordinate information (e.g., z-coordinates), and reflectivity information for each point. For example, four features of coordinates x, y, z and reflectivity of the point cloud data may be read as features of each radar point, each point cloud feature is subjected to a point-based feature extraction work through a backbone network of shared parameters, and a corresponding feature map is output. The backbone network is a model that shares one and the same parameter by all perceived tasks, and its main structure may be a PointNet++ network, for extracting characteristics of the point cloud based on each point (i.e., shared point cloud characteristics), thereby providing shared point cloud characteristics required for subsequent perceived tasks.

As shown in fig. 1, the method for performing multi-tasking point cloud sensing includes: at 120, a first task regarding point cloud semantic segmentation is performed based on the shared point cloud features.

In an embodiment of the present application, performing a first task with respect to point cloud semantic segmentation includes: the shared point cloud features are input into a multi-layer perceptron after point-by-point feature interaction is carried out; and performing classification on each point using the multi-layer perceptron to predict a class of each point. For example, for a point cloud semantic segmentation task, the shared features based on points output by the backbone network may be directly used, features of each point are interacted with through a multi-layer perceptron (MLP), and a classification task is performed on each point, thereby predicting a class of each point. For example, each point, line segment, or object may be assigned a category to identify certain types of objects (e.g., vehicles, roads, signs, or buildings, etc.). Additionally or alternatively, while examples of performing a first task with respect to point cloud semantic segmentation are described herein, in embodiments of the present application, other tasks may also be performed based on shared point cloud features without exceeding the scope of the present application.

As shown in fig. 1, the method for performing multi-tasking point cloud sensing includes: at 130, a designated feature is generated based on the shared point cloud feature, and at 140, one or more tasks are performed with respect to the 3D object based on the designated feature.

In an embodiment of the present application, the above specified features include a Bird's Eye View (BEV) feature, and the like, and further performing one or more tasks with respect to the 3D object based on the specified features includes: a second task regarding detection of the 3D object, a third task regarding tracking of the 3D object, or both are performed based on the BEV features.

In an embodiment of the present application, generating BEV features includes performing cylinder-based feature extraction on shared point cloud features to generate BEV features. For example, in the case where the 3D information of the regressive object needs to be performed based on the BEV bird's eye view feature, the extracted shared point cloud feature may be subjected to column-based (pilar) feature extraction of the pointpilars, so that the BEV feature may be output. BEV features are a set of feature descriptions at a bird's eye view perspective that reflect semantic features and edge features that are exhibited by the environment in which they are located, ignoring altitude. Under the view angle of the aerial view, the time domain information and the space domain information can be fused more easily, and meanwhile, the information synchronization and fusion between the cross-camera and the multi-mode sensor are convenient. The extracted BEV features are capable of supporting a variety of autopilot awareness tasks including 3D object detection and 3D object tracking, map semantic segmentation. It should be appreciated that other ways known to those skilled in the art may also be used to extract BEV features.

In embodiments of the present application, the BEV features described above may be used to generate BEV birds-eye views.

In an embodiment of the present application, performing a second task regarding detection of a 3D object includes: the BEV features are input to a multi-layer perceptron (MLP) to regression predict first information of the 3D object, and wherein the first information includes at least a projection point of a center point of the 3D object onto the BEV map, an offset, a height of the 3D object, a width of the 3D object, a length of the 3D object, a heading angle of the 3D object, and a velocity of the 3D object. The above-described features included in the first information are advantageous in meeting the need to perform 3D object detection in BEV space. It should be appreciated that different types of multi-layer perceptrons than those used for the point cloud semantic segmentation task may be used herein.

In an embodiment of the present application, performing a third task with respect to tracking of a 3D object includes: the BEV features are input to a multi-layer perceptron to regression predict second information of the 3D object, and wherein the second information includes at least parameters regarding motion of the 3D object, and an allocation matrix of the 3D object. It should be appreciated that the multi-layer perceptron used herein to perform tracking with respect to a 3D object may be different from the multi-layer perceptron used to perform detection with respect to a 3D object. For example, for a 3D object tracking task, BEV bird's eye view features are also required, and the detection results of the 3D object detection are object-matched, the BEV features obtained at 120 may be reused in order to reduce the calculation amount and the resource occupation amount. The BEV features are then used to regression predict the motion of the target and assign a matrix to perform matching of the target, thereby completing the target tracking awareness task. The parameters related to the motion of the 3D object include a series of parameters for recording the motion state of the object, for example, the position, length, width, height, heading angle, and the like where the object appears at the next moment. The distribution matrix indicates the matching degree matrix between all the targets in the adjacent frames, can reflect the similarity between every two targets, and is beneficial to matching between the adjacent frames of the objects. Both the parameters and the distribution matrix regarding the motion of the 3D object may satisfy the requirements of 3D object tracking.

In embodiments of the present application, one or more of a first task regarding point cloud semantic segmentation, a second task regarding detection of 3D objects, a third task regarding tracking of 3D objects may be configured/selected to be performed based on an interface of human-machine interaction on a vehicle. Additionally or alternatively, while examples of performing a second task regarding detection of a 3D object and/or a third task regarding tracking of a 3D object are described herein, in embodiments of the present application, one or more other tasks may also be performed based on specified features, in particular BEV features, without exceeding the scope of the present application.

As shown in fig. 1, the method for performing multi-tasking point cloud sensing includes: at 150, the first task with respect to the point cloud semantic segmentation and the loss function of the one or more tasks with respect to the 3D object are normalized, and at 160, gradient back propagation and gradient update are performed on the backbone network based on the normalized loss function.

In an embodiment of the present application, normalizing the loss function of the first task with respect to the point cloud semantic segmentation and the one or more tasks with respect to the 3D object comprises: respectively calculating a first loss function of the first task, a second loss function of the second task or a third loss function of the third task; the first, second and third loss functions are normalized by a predetermined weight. For example, the loss functions of the predicted values and the true values of different tasks can be calculated, then the loss functions of the three perception tasks are normalized according to weights, and finally gradient back propagation and gradient update are performed on the network at the same time. It should be appreciated that the above weights may be parameters that may be preconfigured.

For example, the loss function in this application can be expressed as:

L＝α ₁ L ₁ +α ₂ L ₂ +α ₃ L ₃

wherein L is ₁ 、L ₂ 、L ₃ Respectively representing a first loss function of a first task, a second loss function of a second task and a third loss function of a third task; wherein alpha is ₁ ,α ₂ ,α ₃ Respectively representing the weights of the three loss functions. Specifically, the first loss function is exemplified by:

L ₁ ＝α ₁₁ L _{BEV center projection point&Offset amount} +α ₁₂ L _{Length, width and height} +α ₁₃ L _{Course angle} +α ₁₄ L _{Speed of speed}

Wherein L is _{EV center projection point&Offset amount} 、L _{Length, width and height} 、L _{Course angle} 、L _{Speed of speed} Respectively representing the position and offset of the central projection point of the target on the BEV, the length, width, height and course angle of the targetLoss function of regression prediction value and true value of speed, alpha ₁₁ ,α ₁₂ ,α ₁₃ ,α ₁₄ The weights representing the loss functions, respectively, may be preconfigured. The second loss function is exemplified by:

L ₂ ＝α ₂₁ L _{exercise machine} +α ₂₂ L _{Distribution matrix}

Wherein L is _{Exercise machine} 、L _{Distribution matrix} Loss function, alpha, representing regression prediction and true values of the target in the motion and distribution matrices, respectively ₂₁ ,α ₂₂ The weights representing the loss functions, respectively, may be preconfigured. The third loss function is exemplified by:

L ₃ ＝α ₃₁ L _category(s)

Wherein L is _Category(s) Loss function representing class prediction and truth value of point cloud, alpha ₃₁ The weights representing the loss functions, which may be preconfigured. It should be appreciated that the first, second, or third loss functions may be normalized using a softmax function or the like in this application.

It should be appreciated that in case step 140 involves only one of the second task regarding detection of a 3D object or the third task regarding tracking of a 3D object, only the second or third loss function need be calculated. Additionally or alternatively, in case one or more of the first task regarding point cloud semantic segmentation, the second task regarding detection of 3D objects, the third task regarding tracking of 3D objects is configured/selected to be performed based on an interface of human-machine interaction on the vehicle, accordingly, only the corresponding penalty function may be calculated.

As shown in fig. 2A, the neural network 200 of the embodiment of the present application is composed of a backbone network 280, a semantic segmentation branch 210, an object detection branch 220, and an object tracking branch 230. The backbone network 280 is formed by a plurality of layers of convolutional neural networks, the backbone network 280 shares network parameters with the semantic segmentation branch 210, the target detection branch 220 and the target tracking branch 230, and the backbone network 280 outputs shared point cloud features through a shared 3D encoder for use by the subsequent semantic segmentation branch 210, the target detection branch 220 and the target tracking branch 230. In embodiments of the present application, semantic segmentation branch 210, object detection branch 220, and object tracking branch 230 may be parallel neural networks (e.g., may include one or more multi-layer perceptrons). The backbone network 280 may be gradient back-propagated and gradient updated based on the normalized results of the loss functions of the semantic segmentation branch 210, the target detection branch 220, and the target tracking branch 230.

As shown in fig. 2B, the semantic segmentation branch 210 performs point-by-point feature interactions on the shared point cloud features, then inputs the first multi-layer perceptron, and performs classification on each point using the multi-layer perceptron, i.e., regression predicts the class of each point.

As shown in fig. 2C, the object detection branch 220 generates BEV features based on the shared point cloud features and inputs the BEV features into the second multi-layer perceptron to regression predict first information of the 3D object, and wherein the first information includes at least a projected point of the center point of the 3D object onto the BEV map, an offset, a height of the 3D object, a width of the 3D object, a length of the 3D object, a heading angle of the 3D object, and a velocity of the 3D object.

As shown in fig. 2D, the object tracking branch 230 generates BEV features based on the shared point cloud features (e.g., may be BEV features generated by the object detection branch 220, thereby saving resources) and inputs the BEV features into a third multi-layer perceptron to regression predict second information of the 3D object, and wherein the second information includes at least parameters regarding motion of the 3D object, and an allocation matrix of the 3D object.

Fig. 3 is a block diagram of an embodiment of a device 300 supporting performing multi-tasking point cloud sensing according to an embodiment of the present invention. It should be noted that fig. 3 is intended merely to provide a generalized illustration of various components, any or all of which may be utilized as appropriate. It may be noted that in some examples, the components illustrated by fig. 3 may be localized to a single physical device and/or distributed among various networked devices, which may be located at different physical locations on a vehicle or other entity, for example.

Device 300 is shown as including hardware elements that may be electrically coupled via bus 305 (or may be otherwise in communication as appropriate). The hardware elements may include processing unit(s) 310, which may include, but are not limited to, one or more general purpose processors, one or more special purpose processors, such as Digital Signal Processing (DSP) chips, graphics acceleration processors, application Specific Integrated Circuits (ASICs), etc., and/or other processing structures or devices.

The device 300 may also include one or more input devices 370, which may include devices related to user interfaces (e.g., touch screens, touch pads, microphones, keys, dials, switches, etc.) and/or devices related to navigation, autopilot, etc. Similarly, the one or more output devices 315 may relate to devices that interact with a user (e.g., via a display, light Emitting Diode (LED), speaker, etc.) and/or that are related to navigation, autopilot, etc.

The device 300 may also include a wireless communication interface 330, which wireless communication interface 330 may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset (such asDevices, wiFi devices, wiMax devices, WAN devices and/or various cellular devices, etc.), and so forth. Wireless communication interface 330 may enable device 300 to communicate with other devices. This may include various forms of communication of the previously described embodiments. And as such it may be capable of transmitting direct communications, broadcasting wireless signals, receiving direct and/or broadcasting wireless signals, and so forth. Accordingly, the wireless communication interface 330 may be capable of transmitting and/or receiving RF signals from various RF channels/bands. Communication using the wireless communication interface 330 may be performed via one or more wireless communication antennas 332 that transmit and/or receive wireless signals 334.

The apparatus 300 may further include sensor(s) 340. The sensors 340 may include, but are not limited to, one or more inertial sensors and/or other sensors (e.g., lidar, accelerometer, gyroscope, camera, magnetometer, altimeter, microphone, proximity sensor, light sensor, barometer, etc.). The sensor 340 may be used, for example, to determine certain real-time characteristics of the vehicle, such as position, speed, acceleration, etc.

The device 300 may further include a memory 360 and/or be in communication with the memory 360. Memory 360 may include, but is not limited to, local and/or network accessible storage, disk drives, arrays of drives, optical storage devices, solid state storage devices such as Random Access Memory (RAM) and/or Read Only Memory (ROM), which may be programmable, flash updateable, and the like. Such storage devices may be configured to enable any suitable data storage, including but not limited to various file systems, database structures, and/or the like.

Memory 360 of device 300 may also include software elements (not shown in fig. 3) including an operating system, device drivers, executable libraries, and/or other code (such as one or more application programs), which may include computer programs provided by the various embodiments, and/or may be designed to implement the methods as described herein, and/or configure the systems as described herein. Software applications stored in memory 360 and executed by processing unit(s) 310 may be used to implement the functionality of a vehicle, as described herein. Further, one or more of the procedures described with respect to the method(s) discussed herein may be implemented as code and/or instructions in memory 360 executable by device 300 (and/or processing unit(s) 310 or DSP 320 within device 300), including the functions illustrated in the method of fig. 4 described below. In an aspect, such code and/or instructions may be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.

Fig. 4 is an example of an exemplary functional apparatus 400 that performs multi-tasking point cloud sensing according to an embodiment. Alternative embodiments may vary the functionality described in the blocks illustrated in fig. 4 by combining, separating, or otherwise varying the functionality. The apparatus 400 of fig. 4 illustrates how the functionality of the vehicle described above (e.g., with respect to fig. 1) may be implemented according to an embodiment. As such, the means for performing the functionality of one or more of the blocks illustrated in fig. 4 may include hardware and/or software components of a vehicle (as previously mentioned in fig. 1) that may include one or more components of the apparatus 300 illustrated in fig. 3 and described above.

At block 410, the functionality includes inputting point cloud data associated with the 3D object acquired by the lidar into a backbone network of shared network parameters to perform point-based feature extraction and generate corresponding shared point cloud features. Preferably, the point cloud data associated with the 3D object may include first coordinate information, second coordinate information, third coordinate information, and reflectivity information of each point. Feature extraction module 410 for performing the functionality of block 410 may include one or more software and/or hardware components of a device, such as bus 305, processing unit(s) 310, memory 360, and/or other software and/or hardware components of device 300 illustrated in fig. 3.

At block 420, the functionality includes generating a designated feature based on the shared point cloud feature. Preferably, the specified feature comprises a Bird's Eye View (BEV) feature. Preferably, the BEV features are generated by performing a cylinder-based feature extraction on the shared point cloud features using a point cloud object detection network pointpilar. BEV feature generation module 420 for performing the functionality of block 420 includes one or more software and/or hardware components of a device, such as bus 305, processing unit(s) 310, memory 360, and/or other software and/or hardware components of device 300 illustrated in fig. 3.

At block 430, the functionality includes one or more of: the method comprises performing a first task regarding point cloud semantic segmentation based on shared point cloud features, performing a second task regarding detection of 3D objects or a third task regarding tracking of 3D objects based on BEV features. Preferably, performing the first task with respect to point cloud semantic segmentation comprises: the shared point cloud features are input into a multi-layer perceptron after point-by-point feature interaction is carried out; and performing classification on each point using the multi-layer perceptron to predict a class of each point. Preferably, performing the second task with respect to detection of the 3D object comprises: inputting BEV features into a multi-layer perceptron to regression predict first information of the 3D object, and wherein the first information includes at least a projection point of a center point of the 3D object onto the BEV plot, an offset, a height of the 3D object, a width of the 3D object, a length of the 3D object, a heading angle of the 3D object, and a velocity of the 3D object; and wherein the BEV map is generated based on BEV features. A third task of preferably performing tracking with respect to a 3D object includes: the BEV features are input to a multi-layer perceptron to regression predict second information of the 3D object, and wherein the second information includes at least parameters regarding motion of the 3D object, and an allocation matrix of the 3D object. Preferably, one or more of a first task regarding point cloud semantic segmentation, a second task regarding detection of 3D objects, a third task regarding tracking of 3D objects may be configured/selected to be performed based on an interface of human-machine interaction on the vehicle. Task execution module 430 for performing the functionality of block 430 may include one or more software and/or hardware components of a device, such as bus 305, processing unit(s) 310, memory 360, and/or other software and/or hardware components of device 300 illustrated in fig. 3.

At block 440, the functionality includes normalizing a loss function for one or more of the first task, the second task, or the third task. Preferably, the first loss function of the first task, the second loss function of the second task, or the third loss function of the third task may be calculated, respectively, and the first loss function, the second loss function, and the third loss function may be normalized by a predetermined weight. Preferably, only the corresponding loss function may be calculated and normalized based on the selection or configuration. The penalty function calculation module 440 for performing the functionality of block 440 may include one or more software and/or hardware components of the device, such as bus 305, processing unit(s) 310, memory 360, and/or other software and/or hardware components of device 300 illustrated in fig. 3.

At block 450, the functionality includes performing gradient back propagation and gradient updating on the backbone network based on the normalized loss function. The inverse update module 450 for performing the functionality of block 450 may include one or more software and/or hardware components of a device, such as bus 305, processing unit(s) 310, memory 360, and/or other software and/or hardware components of device 300 illustrated in fig. 3.

Moreover, embodiments of the present application also disclose a computer-readable storage medium comprising computer-executable instructions stored thereon, which when executed by a processor, cause the processor to perform the methods of the embodiments herein.

Further, embodiments of the present application disclose an apparatus comprising a processor and a memory storing computer-executable instructions that, when executed by the processor, cause the processor to perform the methods of the embodiments herein.

In addition, the embodiment of the application also discloses equipment for performing multi-task point cloud sensing, wherein the equipment comprises the device for realizing the method of the embodiment. In one aspect, the apparatus comprises: means for inputting point cloud data associated with the 3D object acquired by the lidar into a backbone network of shared network parameters to perform point-based feature extraction and generate corresponding shared point cloud features; means for performing a first task regarding point cloud semantic segmentation based on the shared point cloud features; means for generating a designated feature based on the shared point cloud feature; means for performing one or more tasks with respect to the 3D object based on the specified characteristics; means for normalizing a loss function for a first task related to point cloud semantic segmentation and one or more tasks related to the 3D object; and means for performing gradient back propagation and gradient updating on the backbone network based on the normalized loss function; means for generating a Bird's Eye View (BEV) feature based on the shared point cloud feature; and means for performing a second task regarding detection of the 3D object, a third task regarding tracking of the 3D object, or both, based on the BEV features, and so forth.

The above describes a method, apparatus and medium for performing multi-tasking point cloud sensing according to the present invention, which has at least the following advantages over the prior art:

(1) The invention has simple network structure, less resource occupation and high multiplexing efficiency, and can obtain three perception task results by one reasoning, and each perception task is mutually connected and promoted;

(2) Based on the strategy of the shared backbone network, the invention carries out relatively independent training and prediction on each point cloud sensing task without lack of connection, thereby meeting the requirements of the actual embedded system on the consumption and instantaneity of resources;

(3) The multi-task sensing method based on the laser radar point cloud is wide in application scene and is not influenced by weather illumination; and

(4) The method has the advantages that the algorithm thought is concise and clear, and the point cloud 3D information sensing capability and the sensing system redundancy are improved.

Reference throughout this specification to "an embodiment" means that a particular described feature, structure, or characteristic is included in at least one embodiment. Thus, the use of such phrases may not merely refer to one embodiment. Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

The various steps and modules of the methods and apparatus described above may be implemented in hardware, software, or a combination thereof. If implemented in hardware, the various illustrative steps, modules, and circuits described in connection with this disclosure may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic component, a hardware component, or any combination thereof. A general purpose processor may be a processor, microprocessor, controller, microcontroller, state machine, or the like. If implemented in software, the various illustrative steps, modules, described in connection with this disclosure may be stored on a computer readable medium or transmitted as one or more instructions or code. Software modules implementing various operations of the present disclosure may reside in storage media such as RAM, flash memory, ROM, EPROM, EEPROM, registers, hard disk, removable disk, CD-ROM, cloud storage, etc. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium, as well as execute corresponding program modules to implement the various steps of the present disclosure. Moreover, software-based embodiments may be uploaded, downloaded, or accessed remotely via suitable communication means. Such suitable communication means include, for example, the internet, world wide web, intranet, software applications, cable (including fiber optic cable), magnetic communications, electromagnetic communications (including RF microwave and infrared communications), electronic communications, or other such communication means.

The numerical values given in the embodiments are only examples and are not intended to limit the scope of the present invention. Furthermore, as an overall solution, there are other components or steps not listed by the claims or the specification of the present invention. Moreover, the singular designation of a component does not exclude the plural designation of such a component.

It is also noted that the embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. Additionally, the order of the operations may be rearranged.

The disclosed methods, apparatus, and systems should not be limited in any way. Rather, the present disclosure encompasses all novel and non-obvious features and aspects of the various disclosed embodiments (both alone and in various combinations and subcombinations with one another). The disclosed methods, apparatus and systems are not limited to any specific aspect or feature or combination thereof, nor do any of the disclosed embodiments require that any one or more specific advantages be present or that certain or all technical problems be solved.

The present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the invention and the scope of the appended claims, which are all within the scope of the invention.

One skilled in the relevant art will recognize that the embodiments may be practiced without one or more of the specific details, or with other methods, resources, materials, etc. In other instances, well-known structures, resources, or merely to facilitate a obscuring aspect of the embodiments have not been shown or described in detail.

While embodiments and applications have been illustrated and described, it is to be understood that the embodiments are not limited to the precise configuration and resources described above. Various modifications, substitutions, and improvements apparent to those skilled in the art may be made in the arrangement, operation, and details of the methods and systems disclosed herein without departing from the scope of the claimed embodiments.

The terms "and," "or," and/or "as used herein may include various meanings that are also expected to depend at least in part on the context in which such terms are used. Generally, or, if used in connection with a list, such as A, B or C, is intended to mean A, B and C (inclusive meaning as used herein) and A, B or C (exclusive meaning as used herein). Furthermore, the terms "one or more" as used herein may be used to describe any feature, structure, or characteristic in the singular or may be used to describe a plurality of features, structures, or characteristics or some other combination thereof. However, it should be noted that this is merely an illustrative example and claimed subject matter is not limited to this example.

While there has been illustrated and described what are presently considered to be example features, it will be understood by those skilled in the art that various other modifications may be made, and equivalents may be substituted, without departing from claimed subject matter. Additionally, many modifications may be made to adapt a particular situation to the teachings of the claimed subject matter without departing from the central concept described herein.

Claims

1. A method for performing multi-tasking point cloud sensing, comprising:

inputting point cloud data associated with the 3D target acquired by the lidar into a backbone network of shared network parameters to perform point-based feature extraction and generate corresponding shared point cloud features;

performing a first task regarding point cloud semantic segmentation based on the shared point cloud features;

generating a designated feature based on the shared point cloud feature;

performing one or more tasks with respect to the 3D object based on the specified features;

normalizing the first task with respect to point cloud semantic segmentation and the loss function of the one or more tasks with respect to the 3D object; and

gradient back-propagation and gradient update are performed on the backbone network based on the normalized loss function.

2. The method of claim 1, wherein the point cloud data associated with a 3D object includes first coordinate information, second coordinate information, third coordinate information, and reflectivity information for each point.

3. The method of claim 1, wherein,

generating the specified feature based on the shared point cloud feature includes generating a Bird's Eye View (BEV) feature based on the shared point cloud feature, and

wherein performing one or more tasks with respect to the 3D object based on the specified characteristics includes: the second task regarding detection of the 3D object, the third task regarding tracking of the 3D object, or both are performed based on BEV features.

4. The method of claim 3, wherein generating the BEV features comprises

Performing cylinder-based feature extraction on the shared point cloud features to generate the BEV features.

5. The method of claim 4, wherein performing a second task regarding detection of the 3D object comprises:

inputting the BEV feature into a multi-layer perceptron to regression predict first information of the 3D object, and

wherein the first information includes at least a projection point of a center point of the 3D object onto a BEV plot, an offset, a height of the 3D object, a width of the 3D object, a length of the 3D object, a heading angle of the 3D object, and a velocity of the 3D object;

and wherein the BEV map is generated based on the BEV features.

6. The method of claim 4, wherein performing the third task with respect to tracking of the 3D object comprises:

inputting the BEV feature into a multi-layer perceptron to regression predict second information of the 3D object, and

wherein the second information comprises at least parameters regarding the motion of the 3D object and an allocation matrix of the 3D object.

7. The method of claim 4, wherein the cylinder-based feature extraction is performed using a point cloud object detection network PointPillar.

8. The method of claim 1, wherein performing the first task with respect to point cloud semantic segmentation comprises:

the shared point cloud features are input into a multi-layer perceptron after point-by-point feature interaction is carried out; and

classification is performed on each point using a multi-layer perceptron to predict the class of each point.

9. The method of claim 3, wherein normalizing the first task regarding point cloud semantic segmentation, the loss function regarding one or more tasks of the 3D object, comprises:

respectively calculating a first loss function of the first task, a second loss function of the second task or a third loss function of the third task;

normalizing the first, second and third loss functions by a predetermined weight.

10. An apparatus for performing multi-tasking point cloud sensing, comprising a processor and a memory, the memory storing program instructions; the processor executing program instructions to implement the method for performing multi-tasking point cloud sensing according to any of the claims 1 to 9.

11. A non-transitory computer readable storage medium storing instructions that, when executed by a computer,

causing the computer to perform the method for performing multi-tasking point cloud sensing according to any of the claims 1 to 9.