CN117152696A - Method for generating input data for machine learning model - Google Patents

Method for generating input data for machine learning model Download PDF

Info

Publication number
CN117152696A
CN117152696A CN202310644690.3A CN202310644690A CN117152696A CN 117152696 A CN117152696 A CN 117152696A CN 202310644690 A CN202310644690 A CN 202310644690A CN 117152696 A CN117152696 A CN 117152696A
Authority
CN
China
Prior art keywords
sensor
target sensor
point cloud
points
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310644690.3A
Other languages
Chinese (zh)
Inventor
T·纽恩贝格
F·法翁
T·米查尔克
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Robert Bosch GmbH
Original Assignee
Robert Bosch GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Robert Bosch GmbH filed Critical Robert Bosch GmbH
Publication of CN117152696A publication Critical patent/CN117152696A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/588Recognition of the road, e.g. of lane markings; Recognition of the vehicle driving pattern in relation to the road
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0475Generative networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/094Adversarial learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/10Geometric effects
    • G06T15/20Perspective computation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/34Smoothing or thinning of the pattern; Morphological operations; Skeletonisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • G06V20/58Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads
    • G06V20/582Recognition of moving objects or obstacles, e.g. vehicles or pedestrians; Recognition of traffic objects, e.g. traffic signs, traffic lights or roads of traffic signs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20036Morphological image processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20228Disparity calculation for image-based rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2210/00Indexing scheme for image generation or computer graphics
    • G06T2210/56Particle system, point based geometry or rendering

Abstract

According to various embodiments, a method for generating input data for a machine learning model is described, the method having: determining, for at least one sensor, a point cloud having points of a surface in the ambient environment of the sensor detected by the sensor; generating a preliminary target sensor point cloud for the target sensor by converting points of the determined point cloud to points from a perspective of the target sensor in accordance with a relative position of the target sensor with respect to the at least one sensor for the at least one sensor; generating a target sensor point cloud for the target sensor by means of the preliminary target sensor point cloud, wherein in the target sensor point cloud, points that are not detected by the target sensor due to one or more surfaces for which points are present in the preliminary target sensor point cloud are removed; and using the target sensor point cloud as input for the machine learning model.

Description

Method for generating input data for machine learning model
Technical Field
The present disclosure relates to a method for generating input data for a machine learning model.
Background
As the complexity of the sensing task increases, more and more complex machine learning models, such as neural networks with complex architecture, are often used, and in order to train these more and more complex machine learning models, an increasing amount of annotated training data is required. These training data must also be highly diverse and contain as many occurrences as possible in order to achieve as good an ability to generalize the system to unknown data as possible and avoid Overfitting. For this purpose, among other things, large measuring activities are planned and performed in which a measuring vehicle is used to record a large amount of data in a number of different situations and places. The (manual) annotation of the training input data thus recorded is then achieved with the relevant desired output data of the perception task, i.e. training output data, i.e. Ground-Truth (Ground-Truth) data. The creation of such a data set correspondingly entails high time and costs.
In correspondence with this, it is desirable that: multiple machine learning models may be trained using such data sets.
Disclosure of Invention
According to various embodiments, there is provided a method for generating input data for a machine learning model, the method having: determining, for at least one sensor, a point cloud having points of a surface in the ambient environment of the sensor detected by the sensor; generating a preliminary target sensor point cloud for the target sensor by converting points of the determined point cloud to points from a perspective of the target sensor in accordance with a relative position of the target sensor with respect to the at least one sensor for the at least one sensor; generating a target sensor point cloud for the target sensor by means of the preliminary target sensor point cloud, wherein in the target sensor point cloud, points that are not detected by the target sensor due to one or more surfaces for which points are present in the preliminary target sensor point cloud are removed; and using the target sensor point cloud as input for the machine learning model.
The above method enables training and reasoning by means of the machine learning model if the machine learning model should be trained on sensor data from the perspective of the target sensor or on sensor data from the perspective of the target sensor, but only sensor data from the perspective of at least one further sensor is available.
It should be noted that: the following use cases are possible: the target sensor point cloud is generated from one or more other point clouds other than the preliminary target sensor point cloud. For example, the preliminary target sensor point cloud is generated from a laser radar (LiDAR) point cloud, the surfaces are determined from the preliminary target sensor point cloud, and then the target sensor point cloud is generated from one or more radar point clouds, wherein the surfaces are considered (i.e., points that are undetectable to the (radar) target sensor due to the determined surfaces are not recorded or removed). However, the target sensor point cloud may also be generated simply from the preliminary target sensor point cloud by removing undetectable surface points and supplementing surface points if necessary.
Various embodiments are described below.
Embodiment 1 is a method for generating input data for a machine learning model, as described above.
Embodiment 2 is the method of embodiment 1, wherein, in generating the target sensor point cloud, the preliminary target sensor point cloud is supplemented with points of surfaces for which surface points are contained in the preliminary target sensor point cloud and which are detectable for the target sensor when the corresponding surface is present.
Thereby more realistic input data is generated from the perspective of the object sensor, as these input data preserve points that would be detected by the object sensor.
Embodiment 3 is the method of embodiment 1 or 2, having: the target sensor point cloud is generated by: generating a depth image representing an orientation of points of the preliminary target sensor point cloud of view of the target sensor; performing morphological opening operation on the depth image; and generating the target sensor point cloud as a depth image subjected to the morphological opening operation.
In this way, it can be efficiently determined which points are visible to the target sensor, because: by this morphological opening operation, for the representation of occluded points in the depth image, the depth information is corrected such that only points that can be detected by the object sensor are represented by the depth image.
Embodiment 4 is the method of embodiment 1 or 2, having: the target sensor point cloud is generated by: generating a parallax image representing the orientation of the points of the preliminary target sensor point cloud of the perspective of the target sensor; performing morphological closing operation on the parallax image; and generating the target sensor point cloud according to the parallax image subjected to the morphological closing operation.
As with depth images, such application of parallax images can efficiently determine which points are visible to the target sensor.
Embodiment 5 is the method of any one of embodiments 1 to 4, having: for each sensor of a plurality of sensors, determining a respective point cloud having points of the surface in the surrounding of the sensor detected by the sensor; and generating the preliminary target sensor point cloud for each of the plurality of sensors by converting, for the target sensor, a point of the respective determined point cloud to a point from a perspective of the target sensor in accordance with the relative position of the target sensor with respect to the sensor; and the transformed points are combined into the preliminary point cloud.
Thereby, a more complete image can be achieved for the object sensor, because: surface points that are undetectable for individual sensors of the plurality of sensors may be detected (i.e., e.g., visible) for the target sensor.
Embodiment 6 is the method of any one of embodiments 1 to 5, having: generating respective target sensor point clouds for a target sensor array having a plurality of target sensors; and using the generated target sensor point cloud as input for the machine learning model.
Thus, with the above-described method, sensor data of a plurality of target sensors may also be generated (each of which is generated as described above, i.e., the method is performed, for example, a plurality of times, and the results are interrelated). In other words, instead of simulating sensor data for a single sensor, sensor data for the entire sensor array is simulated.
Embodiment 7 is the method of any one of embodiments 1 to 5, having: the machine learning model is trained by means of the target sensor point cloud to process sensor data from the perspective of the target sensor.
The target sensor point cloud may be used, inter alia, to train the machine learning model. Thereby, even if sensor data is not recorded from the perspective of the target sensor, the target sensor can be trained.
Embodiment 8 is the method of embodiment 7, having: acquiring ground truth information of points of the point cloud determined by the at least one sensor; converting the ground truth information into ground truth information of points of the target sensor point cloud; and training the machine learning model with supervised learning by means of the target sensor point cloud as training input data and the converted ground truth information.
Whereby the machine learning model may be trained in a supervised manner by means of ground truth information to process sensor data from the perspective of the target sensor, the sensor data being generated or present for one or more origin clouds.
Embodiment 9 is the method of embodiment 7 or 8, having: generating respective target sensor point clouds for a target sensor array having a plurality of target sensors; and training the machine learning model by means of the generated target sensor point cloud.
In this way, a machine learning model may be trained to process sensor data for the sensor array.
Thus, with the above-described method, sensor data of a plurality of target sensors may also be generated (wherein each is generated as described above, i.e. the method is performed e.g. a plurality of times and the results are interrelated), which sensor data may then be used to train a machine learning model. In other words, instead of simulating sensor data for a single sensor, sensor data for the entire sensor array is simulated.
Embodiment 10 is a sensor data processing system that is set up to perform the method according to any one of embodiments 1 to 9.
Embodiment 11 is a computer program having instructions that, when executed by a processor, cause: the processor performs the method according to any one of embodiments 1 to 9.
Embodiment 12 is a computer-readable medium storing instructions that, when executed by a processor, cause: the processor performs the method according to any one of embodiments 1 to 9.
Drawings
In the drawings, like reference numerals generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the application. In the following description, various aspects are described with reference to the following drawings.
Fig. 1 shows a vehicle.
FIG. 2 illustrates an example of training a sensor array and positioning of a target sensor on a vehicle.
Fig. 3A and 3B illustrate simulations of a target sensor training dataset from an original training dataset.
FIG. 4 illustrates a flow diagram presenting a method for generating input data for a machine learning model, in accordance with an embodiment.
Detailed Description
The following detailed description refers to the accompanying drawings that illustrate, for purposes of explanation, specific details and aspects of the present disclosure in which the application may be practiced. Other aspects may be utilized and structural, logical, and electrical changes may be made without departing from the scope of the present application. The different aspects of the disclosure are not necessarily mutually exclusive, as some aspects of the disclosure may be combined with one or more other aspects of the disclosure in order to form new aspects.
Various examples are described in more detail below.
In the case of machine learning, a function that maps input data to output data is learned. In the case of supervised learning (e.g., training a neural network or other model), the function is determined from an input dataset (also referred to as a training dataset) for which a desired output (e.g., a desired classification of input data) is predefined for each input, such that the function depicts such an allocation from input to output as well as possible.
Examples of the application of such machine learning functions are object detection, if necessary including classification, in a digital image, or semantic segmentation in a digital image, for example for autonomous driving, as illustrated in fig. 1.
Fig. 1 shows a (e.g., autonomous) vehicle 101.
It should be noted that: in the following, image or image data is very commonly understood as a set of data representing one or more objects or patterns. The image data may be provided by sensors that measure visible or invisible light, such as infrared or ultraviolet light, ultrasound or radar waves, or other electromagnetic or acoustic signals.
In the example of fig. 1, a vehicle 101, such as a passenger vehicle (PKW) or a van (LKW), is equipped with a vehicle control device 102.
The vehicle control device 102 has data processing components such as a processor (e.g., CPU (central processing unit)) 103 and a memory 104 for storing control software according to which the vehicle control device 102 operates and data processed by the processor 103.
For example, the stored control software (computer program) has commands which, when executed by a processor, cause: the processor 103 implements a Machine Learning (ML) model 107.
The data stored in the memory 104 may include, for example, sensor data detected by one or more sensors 105. The one or more sensors 105 may scan the surroundings of the vehicle 101, for example by means of a laser radar (LiDAR).
The vehicle control device 102 may determine based on the sensor data: whether and which objects, for example stationary objects like traffic signs or road markings or movable objects like pedestrians, animals and other vehicles, are present in the surroundings of the vehicle 101.
Then, the vehicle 101 may be controlled by the vehicle control device 102 in accordance with the result of the object determination. In this way, the vehicle control device 102 may control, for example, the actuator 106 (e.g., a brake) to control the speed of the vehicle, for example, to brake the vehicle. Thus, the vehicle control device 102 may take on the tasks of a driver assistance system (advanced driving assistance system (Advanced Driver Assistance System), ADAS) and/or autonomous driving (Autonomous Driving, AD).
For such tasks, accurate perception and representation of the surrounding environment of the vehicle is often required. Common problems are for example object detection, semantic segmentation or determining an occupied grid from sensor data of one or more sensors (lidar, radar, video, etc.) 105. Particularly recently, data-driven ML methods such as Deep Learning (Deep Learning) have become available in these disciplines to make great progress. Such a method is distinguished in that: parameters of the perception system, i.e. the ML model for perception tasks like object detection or semantic segmentation, can be trained using supervised learning, i.e. based on an annotated dataset consisting of input data and associated desired output data, called Ground Truth (Ground Truth), supervised learning (Supervised Learning).
A great challenge in designing and training ML models is the desired generalization ability of ML models, i.e. in the case of perceptual tasks, the ability of a trained perceptual system to provide correct results even for input data or situations that are distinct from the training input data or situations of the training dataset. In one aspect, these differences may be fundamental differences. For example, video object detectors trained with only red vehicles are typically unable to identify blue vehicles. On the other hand, however, finer differences, such as slight rotations or resolution reduction of the sensor providing training data relative to the sensor providing input data for subsequent operations (i.e., reasoning), may also result in: the quality of the results of the ML model is significantly degraded.
To improve this generalization ability and to increase the amount of training data cost effectively, data enhancement methods can be used with which other training data is derived from the existing data. Known methods are for example geometric transformations or distortions (Verzerrung) of the sensor data or the addition of disturbances in order to generate additional training input data for the ML model. The ability to generalize ML-based perception systems remains a challenge because almost all conceivable variations of the input data can not be modeled even by enhancements.
Because of these reasons, the creation of training data sets is highly labor and cost intensive, it is desirable to use the available training data sets in as many applications and projects as possible. At the same time, however, annotated data is needed for optimal training of the perception system, whose characteristics and distribution are as close as possible to the input data during actual use of the system (i.e. at the time of reasoning).
The wide applicability of training data sets is thus limited, since different experimental carriers and sensor configurations are often used in different projects and applications. In particular, the data of the different experimental carriers/items can be distinguished from each other or from the data of the training dataset, in particular in the following points:
the mounting positions of the sensors are different (external calibration);
although the sensor model is identical in structure, the intrinsic sensor characteristics are different (intrinsic calibration);
the number of sensors is different;
sensor models are different, and the sensor models respectively have different:
the action range of o is equal to that of the water,
the resolution of o is set to be,
o noise behavior, etc.
Even if a vehicle similar to the vehicle in the measurement activity is used or even the same vehicle is used, the training data thus obtained may be distinguished from annotated training data from the measurement activity performed in the past at some of the points mentioned. Furthermore, the nature and distribution of the data in the application may also change over time, for example because the final number, the particular sensor model or the exact mounting locations of these sensors are not fixed from the beginning. Furthermore, it is also possible that: the sensors envisaged for this application are also completely unusable at the beginning of the project and can only be used during the course of the project.
One method for domain adaptation (Domain Adaptation) of training data is to generate antagonism networks (Generative Adverserial Networks, GANs) with which to learn the mapping of data from the domains of the training data set to the target domains of the perception system. However, to learn the mapping, another annotated data set from the target domain is required in addition to the original training data set, however, the other annotated data set may be smaller in scope. However, due to the extended network structure of the perception system required for this purpose, the complexity in training increases significantly, and there is also a need for annotated data of the target domain to be available, which leads to additional costs, especially in case of variable sensor configurations.
Thus, according to various embodiments, a method is provided that is capable of training an ML model (especially for a perception task, i.e. an ML-based perception system) specifically for training data from a specific sensor (or a sensor array with a plurality of sensors), wherein the training dataset used for the training does not have to be recorded here by the same sensor (or the same sensor array) as the sensor (or sensor array) used in the actual use of the perception system.
In the following, for simplicity, an array of target sensors may appear even if replaced by a target sensor, which provides input data that the ML model should be able to process. Instead, an object sensor array is also understood as a possibility that the object sensor array comprises only one (object) sensor. One or more sensors used to record sensor data (e.g., during a measurement activity) are referred to as one or more training sensors, which are then used for training.
In order to train the ML model to process the data of the target sensor in practice using the sensor data (e.g. sensor raw data) provided by the training sensor, these sensor data and, as far as appropriate for the respective application, the relevant annotations of the training data set are transformed for this training so that these characteristics and distributions are simulated as closely as possible to the input data expected in practice. In this way, the perception system is thus trained with simulated data that better corresponds to the data of the target application. Thus, the input data in actual use is not so different from the training data, and the perception system does not have to be as generalized.
With this approach, the available data set can be more easily used to train the sensing system in applications with additional sensor (array) configurations. In this way, especially in the case of possible strong changes in the sensor (array) configuration, a large effort for creating new training data sets can be avoided, while at the same time an optimal training of the perception system can be achieved.
Similarly, with the described method, the quality of the results of the perception system in applications where no specific training data set is available can be significantly improved, for example because the corresponding sensing means are not yet available at all or because creating the training data set is uneconomical.
Hereinafter, as an embodiment, training of a deep learning object detector (as an example of the ML model 107) for using a lidar sensor array (as an example of the sensor 105) is described.
Fig. 2 shows an example of an array of training sensors 201 (e.g., on the roof of a vehicle) and positioning of target sensors 202 on a vehicle 200.
That is, in this embodiment, sensor data (with associated annotations) that is or has been provided by lidar sensor array 201 (i.e., a training sensor or training sensor array) should be used to train a target sensor-specific deep learning object detector, where target sensor 202 in this example differs from the training sensor not only in terms of mounting location but also in terms of angular resolution and field of view.
The annotated sensor data provided by the training sensor is referred to as the raw training data set or the initial training data set. The target sensor training data set for the ML model is generated from the raw training data set such that the ML model is able to perform a corresponding task (here exemplified by object detection) when it obtains sensor data provided by the target sensor 202 as input data, i.e. in this embodiment, training the perception system specific to the sensor data of the target sensor 202 in the front bumper.
Thus, in this example, the raw training dataset consists of a 3D point cloud (one for each training sensor 201) and a list of relevant ground truth object detections, where each detection contains a different object characteristic (such as also classification), but at least an object location.
Fig. 3A and 3B illustrate simulations of a target sensor training dataset from an original training dataset.
For simplicity of illustration, only two-dimensional cross sections through the respective point clouds are considered here, i.e. only a single elevation angle.
First, the respective 3D point clouds of the training sensor 201 are converted into a coordinate system of the target sensor, i.e. into a coordinate system having an origin and an orientation corresponding to the (assumed) installation position of the target sensor, using the known installation positions of the respective sensors 201, 202 and, if necessary, taking into account the different measurement time points and the measurement of the own motion of the vehicle. The result is a preliminary version of the target sensor training dataset. In this way, the points of the 3D point cloud of the training sensor are converted into the perspective of the target sensor.
In the present example, it is assumed that the training sensors (here two training sensors 306 are exemplified) have detected points 301, 302 on the surfaces 303, 304 of the two objects.
In the illustration of fig. 3A, these points have been converted into the coordinate system of the target sensor 305.
At this transition, all 3D points from the point cloud that are outside of the field of view 307 of the target sensor are directly discarded (i.e., not included in or removed from the target sensor training dataset). First, all remaining 3D points are included in the target sensor training dataset. Then, for this preliminary version of the target sensor training dataset, a 2D depth image is created (here only one line of the depth image due to the two-dimensional cross section through the point cloud), with the horizontal axis corresponding to azimuth and the vertical axis corresponding to elevation. If there is a point of the point cloud of one of the training sensors 306 (and thus also in the current version of the target sensor training dataset) at the corresponding (angular) position (from the perspective of the target sensor 305, i.e. in the coordinate system of the target sensor 305), the pixel value describes the distance to the origin of coordinates (i.e. e.g. the position of the target sensor 305). If there is no point of the point cloud of one of the training sensors 306 at the corresponding (angular) position, the pixel value has a predefined standard value (e.g., a value representing infinity).
Discretization of the axis of the depth image is selected according to the desired angular resolution of the object sensor. The resolution is here shown by the line 308 radiating from the target sensor location: each sector in the field of view 307 defined by two such lines 308 corresponds to one pixel value (in this case, similar for other elevation angles in the respective row of the depth image for the elevation angle under consideration).
If more than one 3D point of the point cloud of the training sensor falls within the same discrete pixel of the depth image (i.e., within the same sector for a particular elevation angle), then all points except the point with the smallest distance to the origin of coordinates are discarded. In this example, these are points 302. That is, the depth image here first contains pixel values of the points 301 (these pixel values are different from the standard values).
Especially in the case of a training sensor with a low angular resolution compared to the target sensor, a number of pixels in which no distance values are entered remain in the depth image thus filled. For these pixels, standard values, for example arbitrarily large distance values, denoted hereinafter by D, are entered correspondingly. However, this is not reasonable in some cases: for example, in a sector 310 where neither point 301, 302 is present, there should actually be a depth value corresponding to a surface 303 for which there is no point 301 in the point cloud from the training sensor in the sector 310, although represented by the point 301 on that surface 303. Furthermore, based on possible parallaxes due to the difference in mounting positions of the training sensor and the target sensor, a 3D point actually blocked by other objects from the perspective of the target sensor may be selected: in the example shown, in sectors 311 and 312, the depth value of a point on the rear surface 304 would be entered for that point, even though that point would actually be obscured by the front surface 303.
In order to correct for both cases of erroneous pixel values in the depth image (standard value despite the presence of a surface; or distance value of a point of the surface despite the point being occluded), the depth image is subjected to morphological opening operations, i.e. morphological operations erosion and dilation are performed sequentially using structural elements.
The results are shown in fig. 3B: pixel values of points 312 on the front surface 303 are supplemented so that the pixel values are already covered, especially for points 309 that are obscured by them.
The size of the structural element is selected according to the angular resolution of the target sensor 305 or the resolution of the training sensor 306.
It should be noted that: points 312 newly generated by interpolation in this manner may be incorporated into the target sensor training dataset, or may be used only to remove points 309 that are occluded by it from the target sensor training dataset (or to prevent the points from being generated for the target sensor training dataset). In the second case, for example, all 3D points in the depth image whose distance (i.e., pixel value) is significantly reduced (e.g., negative change greater than a threshold value) due to morphological opening operations in their associated pixels are discarded.
A final target sensor training dataset may be generated from the depth image by: for each pixel of the depth image that is on, the 3D point of the target sensor training dataset of the target sensor is calculated from the azimuth, elevation and distance of that pixel, and the entire point cloud of the target sensor is thus constructed. Alternatively, these 3D points may also be employed for pixels having correlated and non-discarded 3D points of the point cloud of the preliminary version of the target sensor training dataset. Here, for pixels that contain a distance value D even after the morphological opening operation, no 3D point is calculated, since no surface is detected there by the training sensor.
Alternatively to using a depth image with a morphological opening operation, equivalent results can be obtained by entering 3D points into a parallax image (=opposite distance) and then performing a morphological closing operation (i.e. dilation and then erosion). This may be done similarly to using depth images as described above.
The location of the annotated ground truth object detection of the original training dataset is converted into the coordinate system of the target sensor in a similar way as the 3D point cloud. Detection that is outside the field of view of the object sensor may be discarded directly. For the rest of the object detection it is checked whether the simulated 3D point cloud of the target sensor (i.e. the input data from the target sensor training dataset) contains points of the detected object, for example by comparing these 3D points with the 3D object positions.
Correspondingly, object detection of simulated 3D points without target sensors is also discarded.
To train a target sensor specific deep learning object detector, a target sensor training dataset is simulated from the raw training dataset as described above. Alternatively, this may be performed once in advance, and the target sensor training data set thus generated may be stored, or performed as a preprocessing step when accessing the original training data during training. All other aspects of the training (architecture, (hyper) parameters, etc.) can be handled as if there were no training of this kind. Only in the case of enhancement is attention paid: no data or labels are generated outside the field of view of the object sensor.
For training an object detector for an object sensor array having a plurality of sensors, the above-described generation (simulation) of the object sensor training data set is performed for each object sensor of the object sensor array, i.e. in the case of a plurality of object sensors, a plurality of times. In this way, specific training data is simulated for each individual target sensor, which can then be used in combination to train the entire target sensor array.
The method for training a sensor (array) specific sensing system described above in terms of tasks using lidar sensor data for object detection may be extended to other task fields. In the context of semantic segmentation of lidar sensor data, a 3D point cloud of a target sensor may be simulated from a 3D point cloud of a training sensor in a similar manner as described for the case of object detection. Related point-wise annotated semantic tags may be generated from the original tags by: the same size image with semantic tags is similarly generated except for the depth image. With the definition of the rank of semantic tags (Rangfolge), the image can also be further processed by morphological opening operations to then read therefrom the semantic tags for the input training data of the target sensor.
The described method can also be extended to video-based object recognition and segmentation applications as long as in the training dataset, in addition to video data, 3D data are available, such as lidar sensors or stereoscopic video recordings. For this purpose, the measured image intensities can be assigned to the 3D points by means of intrinsic and extrinsic sensor calibration. The described simulation method can then be applied, wherein, in addition to the depth image, the same-sized image with image intensity is carried and further processed. Then, from the simulated 3D points with the associated simulated image intensities, a simulated image of the target sensor may be generated. Similarly, annotated tags of the target sensor may also be simulated.
The described method can also be extended to radar-based object recognition and segmentation applications, as long as in the training dataset, in addition to radar data, dense 3D data is available, such as lidar sensors or dense 3D data of stereoscopic video recordings. Instead of radar data of the training dataset which is relatively sparse due to the measurement principle and thereby unsuitable for modeling surfaces in the surrounding, available dense 3D data is used for modeling surfaces of the surrounding from the perspective of the target sensor according to the above-described method. The model may then be used to: discarding radar measurements occluded in the coordinate system of the target sensor from the raw training dataset; and generates new measurements. In selecting 3D radar measurements, the characteristics of the (radar) target sensor, such as the total number of measured 3D points or the spatial and spectral separability of the reflections, have to be taken into account. The velocity approximation of the original 3D point can be used as the doppler velocity of the simulated 3D point as long as the installation position of the target sensor is not significantly different from the position of the training sensor.
The above-described method for simulating sensor (raw) data of an object sensor can also be used online, for example in a vehicle or generally in a robotic device, in order to simulate an uninstalled object sensor and to use and check the object sensor in actual operation. This means: the target sensor training data set is not necessarily generated from the original training data set by (training) input data, but is typically generated from the original input data set (wherein these input data sets are provided with ground truth information for the training, but may also be input data sets without ground truth information for reasoning; correspondingly, the one or more sensors providing the original input data set are not necessarily "training" sensors, i.e. the one or more sensors do not necessarily provide training data).
With the described method, raw data of an unobtrusive sensor can be simulated from raw data of a training sensor. However, this method cannot simulate information in areas where the training sensor has not detected data at all. That is, the field of view of the target sensor should be covered by the field of view of the training sensor. Otherwise, the simulated data in these areas remain empty and thus may differ from the data of the actually present target sensor. From this point of view, with the described method, it is still possible to: the target sensors whose assumed mounting locations are not between the mounted training sensors are also simulated to train the perception system, for example for target sensors in the front bumper, based on sensor data recorded with the lidar sensors on the roof of the vehicle, as shown in fig. 2. Thus, as long as the target sensor is sufficiently close to the training sensor array, improved performance can be achieved with the described method through sensor-specific training.
In summary, according to various embodiments, a method as shown in fig. 4 is provided.
FIG. 4 illustrates a flow diagram 400 presenting a method for generating input data for a machine learning model, in accordance with an embodiment.
At 401, for at least one sensor, a point cloud having points of a surface in the surrounding of the sensor detected by the sensor is determined.
At 402, a preliminary target sensor point cloud is generated for the target sensor by converting points of the determined point cloud to points from a perspective of the target sensor in accordance with a relative position of the target sensor with respect to the at least one sensor.
At 403, a target sensor point cloud is generated for the target sensor by means of the preliminary target sensor point cloud, wherein points in the target sensor point cloud that are not detected by the target sensor due to one or more surfaces for which points are present in the preliminary target sensor point cloud are removed.
At 404, the target sensor point cloud is used as input for the machine learning model.
The method of fig. 4 may be performed by one or more computers having one or more data processing units. The term "data processing unit" may be understood as any type of entity capable of processing data or signals. For example, such data or signals may be processed in accordance with at least one (i.e., one or more) particular function(s), which is/are performed by the data processing unit. The data processing unit may comprise or be constructed from integrated circuits of analog circuits, digital circuits, logic circuits, microprocessors, microcontrollers, central Processing Units (CPUs), graphics Processing Units (GPUs), digital Signal Processors (DSPs), programmable gate arrays (FPGAs), or any combination thereof. Any other means for realizing the corresponding functions described in more detail herein may also be understood as a data processing unit or a logic circuit arrangement. One or more of the method steps described in detail herein may be implemented (e.g., accomplished) by one or more specialized functions performed by a data processing unit.
The method of fig. 4 may be used, for example, to generate control signals for the robotic device (from the output of the machine learning model). The term "robotic device" may be understood to relate to any technical system (with a mechanical part whose movement is controlled), such as a computer-controlled machine, a vehicle, a household appliance, an electric tool, a manufacturing machine, a personal assistant or an access control system. By means of the input data, the machine learning model can be trained in order to generate output data, such a technical system being controlled on the basis of the machine learning model and then correspondingly being controlled using the machine learning model.
Various embodiments may receive and use sensor signals of various sensors such as video, radar, liDAR (ultrasonic), motion, thermal imaging, and the like. The machine learning model processes the sensor data. This may include classification of the sensor data or execution of semantic segmentation in terms of the sensor data, for example, in order to detect the presence of objects (in the environment where these sensor data were obtained). Embodiments may be used to train a machine learning system and control a robot, e.g., autonomously by a robotic manipulator, to achieve different maneuvering tasks in different scenarios. In particular, embodiments can be applied to control and monitor the implementation of manipulation tasks, such as in an assembly line.
Although specific embodiments are presented and described herein, those skilled in the art will recognize that: the particular embodiments shown and described may be replaced by alternative and/or equivalent implementations without departing from the scope of the present application. Any adaptations or variations of the specific embodiments discussed herein are intended to be comprehended by the present application. It is the intention, therefore, to be limited only as indicated by the claims and their equivalents.

Claims (12)

1. A method for generating input data for a machine learning model, the method having:
determining, for at least one sensor, a point cloud having points of a surface in the ambient environment of the sensor detected by the sensor;
generating a preliminary target sensor point cloud for the target sensor by converting points of the determined point cloud to points from a perspective of the target sensor for the at least one sensor in accordance with a relative position of the target sensor with respect to the at least one sensor;
generating a target sensor point cloud for the target sensor by means of the preliminary target sensor point cloud, wherein in the target sensor point cloud, points that are not detected by the target sensor due to one or more surfaces for which points exist in the preliminary target sensor point cloud are removed; and also
The target sensor point cloud is used as input for the machine learning model.
2. The method of claim 1, wherein, in generating the target sensor point cloud, the preliminary target sensor point cloud is supplemented with points of a surface for which surface points are contained in the preliminary target sensor point cloud and which surface is detectable for the target sensor when a corresponding surface is present.
3. The method according to claim 1 or 2, having: the target sensor point cloud is generated by: generating a depth image representing an orientation of points of the preliminary target sensor point cloud of the perspective of the target sensor; performing morphological opening operation on the depth image; and generating the target sensor point cloud as a depth image subjected to a morphological opening operation.
4. The method according to claim 1 or 2, having: the target sensor point cloud is generated by: generating a parallax image representing an orientation of points of the preliminary target sensor point cloud of the perspective of the target sensor; performing morphological closing operation on the parallax image; and generating the target sensor point cloud according to the parallax image subjected to the morphological closing operation.
5. The method according to any one of claims 1 to 4, having: for each sensor of a plurality of sensors, determining a respective point cloud having points of the surface in the surrounding of the sensor detected by the sensor; and generating for each of the plurality of sensors the preliminary target sensor point cloud by converting, for the target sensor, the points of the respective determined point cloud to points from the perspective of the target sensor in accordance with the relative position of the target sensor with respect to the sensor; and merging the converted points into the preliminary point cloud.
6. The method according to claims 1 to 5, having: generating respective target sensor point clouds for a target sensor array having a plurality of target sensors; and using the generated target sensor point cloud as input for the machine learning model.
7. The method according to any one of claims 1 to 5, having: the machine learning model is trained by means of the target sensor point cloud to process sensor data from the perspective of the target sensor.
8. The method according to claim 7, the method having: acquiring ground truth information of points of the point cloud determined by the at least one sensor; converting the ground truth information into ground truth information of points of the target sensor point cloud; and training the machine learning model with supervised learning by means of the target sensor point cloud as training input data and the converted ground truth information.
9. The method according to claim 7 or 8, the method having: generating respective target sensor point clouds for a target sensor array having a plurality of target sensors; and training the machine learning model by means of the generated target sensor point cloud.
10. A sensor data processing system set up to perform the method according to any one of claims 1 to 9.
11. A computer program having instructions which, when executed by a processor, cause: the processor performs the method according to any one of claims 1 to 9.
12. A computer readable medium storing instructions that when executed by a processor cause: the processor performs the method according to any one of claims 1 to 9.
CN202310644690.3A 2022-06-01 2023-06-01 Method for generating input data for machine learning model Pending CN117152696A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DE102022205572.1A DE102022205572A1 (en) 2022-06-01 2022-06-01 Method for generating input data for a machine learning model
DE102022205572.1 2022-06-01

Publications (1)

Publication Number Publication Date
CN117152696A true CN117152696A (en) 2023-12-01

Family

ID=88790632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310644690.3A Pending CN117152696A (en) 2022-06-01 2023-06-01 Method for generating input data for machine learning model

Country Status (3)

Country Link
US (1) US20230394757A1 (en)
CN (1) CN117152696A (en)
DE (1) DE102022205572A1 (en)

Also Published As

Publication number Publication date
DE102022205572A1 (en) 2023-12-07
US20230394757A1 (en) 2023-12-07

Similar Documents

Publication Publication Date Title
JP7254823B2 (en) Neural networks for object detection and characterization
JP7239703B2 (en) Object classification using extraterritorial context
US20190294177A1 (en) Data augmentation using computer simulated objects for autonomous control systems
Nieto et al. Real-time lane tracking using Rao-Blackwellized particle filter
JP2021089724A (en) 3d auto-labeling with structural and physical constraints
Dey et al. VESPA: A framework for optimizing heterogeneous sensor placement and orientation for autonomous vehicles
US20220108544A1 (en) Object detection apparatus, system and method
US11967103B2 (en) Multi-modal 3-D pose estimation
EP3903232A1 (en) Realistic sensor simulation and probabilistic measurement correction
US20220277581A1 (en) Hand pose estimation method, device and storage medium
KR20180027242A (en) Apparatus and method for environment mapping of an unmanned vehicle
CN115346192A (en) Data fusion method, system, equipment and medium based on multi-source sensor perception
WO2022171428A1 (en) Computer-implemented method for generating reliability indications for computer vision
CN114595738A (en) Method for generating training data for recognition model and method for generating recognition model
CN112800822A (en) 3D automatic tagging with structural and physical constraints
Stäcker et al. RC-BEVFusion: A Plug-In Module for Radar-Camera Bird's Eye View Feature Fusion
US20220237897A1 (en) Computer-implemented method for analyzing relevance of visual parameters for training a computer vision model
CN117152696A (en) Method for generating input data for machine learning model
WO2022157157A1 (en) Radar perception
Alhamwi et al. Real time vision system for obstacle detection and localization on FPGA
Tousi et al. A new approach to estimate depth of cars using a monocular image
US20230267749A1 (en) System and method of segmenting free space based on electromagnetic waves
US11669980B2 (en) Optical flow based motion detection
Miekkala 3D object detection using lidar point clouds and 2D image object detection
CN117157682A (en) Machine labeling of photographic images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication