EP4138048A1 - Training von 3d-fahrspurerfassungsmodellen für automobilanwendungen - Google Patents
Training von 3d-fahrspurerfassungsmodellen für automobilanwendungen Download PDFInfo
- Publication number
- EP4138048A1 EP4138048A1 EP21191605.1A EP21191605A EP4138048A1 EP 4138048 A1 EP4138048 A1 EP 4138048A1 EP 21191605 A EP21191605 A EP 21191605A EP 4138048 A1 EP4138048 A1 EP 4138048A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- lane
- image
- lane boundaries
- boundaries
- neural network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/56—Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
- G06V20/588—Recognition of the road, e.g. of lane markings; Recognition of the vehicle driving pattern in relation to the road
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
- G06V10/7753—Incorporation of unlabelled data, e.g. multiple instance learning [MIL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/64—Three-dimensional objects
Definitions
- the present invention relates, in general, to the field of autonomous and semi-autonomous vehicles, and in particular to a method for unsupervised and semi-supervised training of 3D lane detection models.
- ADAS driver-assistance systems
- ACC adaptive cruise control
- forward collision warning etc.
- ADS Automated Driving System
- An ADS may be construed as a complex combination of various components that can be defined as systems where perception, decision making, and operation of the vehicle are performed by electronics and machinery instead of a human driver, and as introduction of automation into road traffic. This includes handling of the vehicle, destination, as well as awareness of surroundings. While the automated system has control over the vehicle, it allows the human operator to leave all or at least some responsibilities to the system.
- An ADS commonly combines a variety of sensors to perceive the vehicle's surroundings, such as e.g. radar, LIDAR, sonar, camera, navigation system e.g. GPS, odometer and/or inertial measurement units (IMUs), upon which advanced control systems may interpret sensory information to identify appropriate navigation paths, as well as obstacles, free-space areas, and/or relevant signage.
- Deep Learning is a promising technology in the area of perception, for example in order to detect and classify objects in images, video streams and/or LIDAR point clouds.
- DL Deep Learning
- the problem with DL is that it quickly reaches some level of performance, but then extremely large amounts of data are required to get truly high performance.
- Annotating millions of images is expensive, and hence many initiatives are taken in the autonomous driving field to reduce this cost through semi-automated annotation and learning efficiently from the annotated data.
- Lane detection is employed in various ADS features, such as ACC, Lane Departure Warning (LDW), Lane Change Assist (LCA), aside from being a necessary function for fully autonomous vehicles.
- LDW Lane Departure Warning
- LCA Lane Change Assist
- There are two main approaches to lane detection may also be referred to as lane marker detection or lane boundary detection).
- One approach involves the determination of the vehicle's position relative to the lanes (lane markers), which can be computed by using maps created offline and by monitoring the vehicle's position in the map.
- the other approach relies on the use of the on-board perception system, which can allow for detecting lanes during run-time without relying on any offline created map.
- the first approach (offline approach) to lane detection may for example be accomplished by combining GPS data, Inertial Measurement Unit (IMU) data, and an HD-map.
- IMU Inertial Measurement Unit
- the run-time lane detection methods are most often based on perception data from monocular cameras.
- these methods generally include four main steps. Firstly, there is a local lane feature extraction, followed by lane model fitting and spatial transformation (i.e. image-to-world correspondence). Lastly, there is a temporal aggregation step (i.e. tracking lanes via filtering approaches).
- lane model fitting and spatial transformation i.e. image-to-world correspondence
- temporal aggregation step i.e. tracking lanes via filtering approaches.
- the majority of the camera-based lane detection methods treat the problem as a 2D task in the image plane.
- the semantic segmentation approach is among the most popular deep learning methods for lane detection using monocular cameras.
- a deep convolutional neural network is trained to find the pixels in an image that correspond to a lane then using some method of lane model fitting to create lane instances from the detected pixels.
- the lanes are oftentimes projected to 3D coordinates under a "flat earth” assumption.
- such methods may struggle with drawbacks in terms of providing inaccurate estimates of the 3D position of the lanes when the road is "hilly", “banked” or “curvy”, even if each pixel in the image is classified correctly.
- This object is achieved by means of a method for training an artificial neural network configured for 3D lane detection based on unlabelled image data from a vehicle-mounted camera, a computer-readable storage medium, an apparatus, a vehicle comprising such an apparatus, a remote server comprising such an apparatus, and a cloud environment comprising one or more such remote servers, as defined in the appended independent claims.
- the term exemplary is in the present context to be understood as serving as an instance, example or illustration.
- a method for training an artificial neural network configured for 3D lane detection based on unlabelled image data from a vehicle-mounted camera.
- the method comprises generating, by means of the artificial neural network, a first set of 3D lane boundaries in a first coordinate system based a first image captured by the vehicle-mounted camera.
- the method further comprises generating, by means of the artificial neural network, a second set of 3D lane boundaries in a second coordinate system based on a second image captured by the vehicle mounted camera.
- the second image is captured at a later moment in time compared to the first image and the first image and the second image contain at least partly overlapping road portions.
- the method comprises transforming at least one of the second set of 3D lane boundaries and the first set of 3D lane boundaries based on positional data associated with the first image and the second image, such that the first set of 3D lane boundaries and the second set of 3D lane boundaries have a common coordinate system.
- the method further comprises evaluating the first set of 3D lane boundaries against the second set of 3D lane boundaries in the common coordinate system in order to find matching lane pairs of the first set of 3D lane boundaries and the second set of 3D lane boundaries.
- the method comprises updating one or more model parameters of the artificial neural network based on a spatio-temporal consistency loss between the found matching lane pairs of the first set of 3D lane boundaries and the second set of 3D lane boundaries.
- a (non-transitory) computer-readable storage medium storing one or more programs configured to be executed by one or more processors of a processing system, the one or more programs comprising instructions for performing the method according to any one of the embodiments disclosed herein.
- non-transitory is intended to describe a computer-readable storage medium (or “memory”) excluding propagating electromagnetic signals, but are not intended to otherwise limit the type of physical computer-readable storage device that is encompassed by the phrase computer-readable medium or memory.
- the terms “non-transitory computer readable medium” or “tangible memory” are intended to encompass types of storage devices that do not necessarily store information permanently, including for example, random access memory (RAM).
- Program instructions and data stored on a tangible computer-accessible storage medium in non-transitory form may further be transmitted by transmission media or signals such as electrical, electromagnetic, or digital signals, which may be conveyed via a communication medium such as a network and/or a wireless link.
- the term “non-transitory”, as used herein is a limitation of the medium itself (i.e., tangible, not a signal) as opposed to a limitation on data storage persistency (e.g., RAM vs. ROM).
- apparatus for training an artificial neural network configured for 3D lane detection based on unlabelled image data from a vehicle-mounted camera.
- the apparatus comprises control circuitry configured to generate, by means of the artificial neural network, a first set of 3D lane boundaries in a first coordinate system based a first image captured by the vehicle-mounted camera.
- the control circuitry is further configured to generate, by means of the artificial neural network, a second set of 3D lane boundaries in a second coordinate system based on a second image captured by the vehicle mounted camera.
- the second image is captured at a later moment in time compared to the first image and the first image and the second image contain at least partly overlapping road portions.
- control circuitry is configured to transform at least one of the second set of 3D lane boundaries and the first set of 3D lane boundaries based on positional data associated with the first image and the second image, such that the first set of 3D lane boundaries and the second set of 3D lane boundaries have a common coordinate system.
- the positional data may example be obtained from a localization system (e.g. a GPS unit) of the vehicle.
- the control circuitry is further configured to evaluate the first set of 3D lane boundaries against the second set of 3D lane boundaries in the common coordinate system in order to find matching lane pairs of the first set of 3D lane boundaries and the second set of 3D lane boundaries.
- control circuitry is configured to update one or more model parameters of the artificial neural network based on a spatio-temporal consistency loss between the found matching lane pairs of the first set of 3D lane boundaries and the second set of 3D lane boundaries.
- control circuitry is configured to update the one or more parameters of the artificial neural network by penalizing the artificial neural network based on the spatio-temporal consistency loss between matching lane pairs of the first set of 3D lane boundaries and the second set of 3D lane boundaries.
- control circuitry is configured to determine an average distance between each lane of the first set of 3D lane boundaries and each lane of the second set of 3D lane boundaries in the common coordinate system.
- the control circuitry may be further configured to compare the determined average distances against an average distance threshold, and determine each lane pair having a determined average distance below the average distance threshold as a matching lane pair.
- the artificial neural network is configured to generate a set of anchor lines where each anchor line is associated with a confidence value indicative of a presence of a lane boundary at the corresponding anchor line.
- control circuitry is configured to disregard any anchor line associated with a confidence value below a confidence value threshold in the transformation step.
- control circuitry is configured to update the one or more model parameters of the artificial neural network is further based on the confidence values of any anchor lines associated with matching lane pairs and any anchor lines associated with non-matching lane pairs, such that any anchor lines associated with matching lane pairs should predict a normalized confidence level of 1,0 and any anchor lines associated with non-matching lane pairs should predict a normalized confidence value of 0.
- a vehicle comprising a camera configured to capture images of at least a portion of a surrounding environment of the vehicle.
- the vehicle further comprises a localization system configured to monitor a geographical position of the vehicle, and an apparatus according to any one of the embodiments disclosed herein.
- a remote server comprising the apparatus according to any one of the embodiments of the third aspect disclosed herein.
- a cloud environment comprising one or more remote servers according to any one of the embodiments of the fifth aspect disclosed herein.
- the present invention is at least partly motivated by the recent success of 3D lane detection models (end-to-end data-driven approaches), as well as semi-supervised learning methods. Accordingly, the present inventors hereby introduce a spatio-temporal consistency loss for 3D lane detection that is used to train suitable artificial neural networks (may also be referred to as self-learning models) on unlabelled data in a semi-supervised fashion.
- C cam and C road there are two main coordinate systems used in the present disclosure, C cam and C road (see e.g. Fig. 6 ).
- the first coordinate system C cam is simply defined as the system with the camera in the origin and orientated with the forward direction pointing in the direction of the camera.
- the second coordinate system C road lies straight beneath C cam but is oriented such that the forward direction is aligned with the road surface. This means that the two systems differ by both a rotation and height offset due to the pitch which the camera is mounted with.
- the transformation between C cam and C road is hence done by translating the system with the pitch of the camera such that the forward direction points along the road surface and then translating the system down to the road.
- FIG. 1 A schematic overview of an example of a "suitable artificial neural network” or suitable self-learning model is illustrated in Fig. 1 .
- the depicted architecture is known as "3D-lanenet” and fully described by Garnett N. et al. (2019), 3d-lanenet: End-to-end 3d multiple lane detection, 2019 .
- the model comprises a dual pathway that includes convolutional, max pooling and dense layers together with a projective transformation layer that transforms the feature maps from image-view to top-view.
- the network contains a set of layers that are clustered into L-layers, and specific details are listed in Table 1 below containing information about each layer of the network.
- Information passed to the network is split up and processed in two parallel pathways via the so called dual-pathway backbone.
- the model can be divided into four quadrants, which may be referred to as Image-view pathway, Road plane prediction branch, Top-view pathway, and Lane prediction head.
- Table 1 Table containing information about the dimensionality of the L-layers.
- the first part of the network is the first of the two parallel pathways, the Image-view pathway.
- This part of the network takes in the RGB-channels of the input image and propagates the information through several convolutional and max pooling layers, following the structure of the first part of the standard VGG16 network.
- the filters are swept over the image in the original image 2D-space.
- information about the original image is preserved and sent to the Road plane prediction branch.
- the signal from the L-layers[3, 4, 5, 6] is sent to the Top-view pathway via the Projective transformation layer which is described in detail later.
- L-layer 6 From the Image-view pathway the output features of L-layer 6 is sent to the Road plane prediction branch, which goes through a similar process as the Image-view pathway of convolving and max pooling. The information is then sent to the L-layer 10 and 11 which are two fully connected layers resulting in two values: the predicted position and pitch of the camera ⁇ t (see Fig. 1 ). The predicted pitch and height is then sent together with the outputs from L-layers[3, 4, 5, 6] to the Projective transformation layer.
- the Projective transformation layer is able to create feature maps corresponding spatially to a virtual top-view by performing a differentiable sampling of the input features corresponding to the image plane, without affecting the number of input channels.
- the second path of the parallel pathway is the Top-view pathway where all features are represented in a top-view plane.
- the signals are convolved and max pooled with the addition of concatenation of the projected outputs of the feature maps from the L-layers[3, 4, 5, 6]. This means that the spatial features gathered in both the image-view plane and the top-view plane are combined.
- the signal is then sent to the Lane prediction head where the final predictions of the 3D-Lanes are made.
- the final part of the network is the Lane Prediction head, which receives the output features of L-layer 15 and further convolves that information.
- the final output of L-layer 25 are the predicted 3D-Lanes represented with lane anchors.
- a lane is represented by being connected to the "anchor" closest to the lane.
- the shape of the lane is described by offsets in height and width from the connected anchor line at different predefined values (y) in the forward direction.
- y may be set to 6.5, 10, 15, 20, 30, 40, 50, 60, 80, 100 meters in front of the vehicle.
- 3D-lanenet essentially casts the lane detection task as an object detection task, it works in a similar fashion as one-stage object detectors by using the concept of anchors to make predictions (i.e. to generate 3D lanes).
- the predictions of x j i z j i correspond to the point in 3D space given by X A i + x j i , y j , z j i in the coordinate system C road .
- the 3D-lanenet architecture is merely one example out of several possible suitable artificial neural networks, and should not be construed as limiting to the present invention.
- Another example of an artificial neural network or model configured for 3D lane detection is Gen-lanenet (see e.g. Guo Y. et al. (2020) Gen-LaneNet: A Generalized and Scalable Approach for 3D Lane Detection for further details of that architecture).
- Fig. 2 is a schematic flowchart representation of a method S100 for training an artificial neural network (ANN) configured for 3D lane detection based on unlabelled image data from a vehicle-mounted camera in accordance with an embodiment of the invention.
- the herein proposed method of unsupervised training is at least partly based on the realization that one can assume a consistency of 3D lanes in video sequences, and therefore utilize the notion of a spatio-temporal consistency loss in order to train the ANN.
- the method S100 comprises generating S101, by means of the artificial neural network, a first set of (predicted/estimated) 3D lane boundaries (may also be referred to as 3D lanes) in a first coordinate system based on a first image captured by the vehicle-mounted camera.
- the first image may for example be referred to image taken at time t.
- the method S100 comprises generating S102, by means of the artificial neural network, a second set of (predicted/estimated) 3D lane boundaries in a second coordinate system based on a second image captured by the vehicle mounted camera.
- the second image is captured at a later moment in time compared to the first image.
- the first image and the second image contain at least partly overlapping road portions. Accordingly, the second image may be referred to as image taken at time t+1.
- the first and second coordinate systems may be in the aforementioned C road coordinate system (see e.g. Fig. 6 ) where the forward direction is aligned with the road surface, but with different origins.
- the method S100 further comprises transforming S103 at least one of the second set of 3D lane boundaries and the first set of 3D lane boundaries based on positional data associated with the first image and the second image, such that the first set of 3D lane boundaries and the second set of 3D lane boundaries have a common coordinate system.
- the positional data may for example be processed Global Navigation Satellite System (GNSS) data obtained from an on-board localization system (e.g. a GPS unit).
- GNSS Global Navigation Satellite System
- the second set of 3D lane boundaries which may be provided in C road at time t+1 , are transformed S103 to the coordinate system of the first set of 3D lane boundaries, i.e. to C road at time t.
- the method S100 comprises evaluating S104 the first set of 3D lane boundaries against the second set of 3D lane boundaries in the common coordinate system in order to find matching lane pairs of the first set of 3D lane boundaries and the second set of 3D lane boundaries. This may for example be done by formulating the evaluation process S104 as a bipartite matching problem and using a min-cost-flow solver.
- the step of evaluating S104 the first set of 3D lane boundaries against the second set of 3D lane boundaries in the common coordinate system comprises determining S106 an average distance between each lane of the first set of 3D lane boundaries and each lane of the second set of 3D lane boundaries in the common coordinate system. Then, the determined S106 average distances may be compared S108 against an average distance threshold, and each lane pair having a determined S106 average distance below the average distance threshold is/are determined S108 as a matching lane pair.
- the method S100 comprises updating S105 one or more model parameters of the artificial neural network based on a spatio-temporal consistency loss between the found matching lane pairs of the first set of 3D lane boundaries and the second set of 3D lane boundaries.
- the artificial neural network is trained S105 based on the spatio-temporal consistency loss between the found matching lane pairs of the first set of 3D lane boundaries and the second set of 3D lane boundaries.
- the step of updating S105 one or more parameters of the artificial neural network comprises penalizing the artificial neural network based on the spatio-temporal consistency loss between matching lane pairs of the first set of 3D lane boundaries and the second set of 3D lane boundaries.
- the ANN may alternatively be updated by "rewarding" the ANN instead of “penalizing” the ANN based on specific implementations.
- the spatio-temporal consistency loss is, in accordance with some embodiments, based on a Euclidean distance between matching lane pairs of the first set of 3D lane boundaries and the second set of 3D lane boundaries.
- the herein disclosed method for unsupervised training is at least partly based on the assumption of consistency of 3D lanes in video sequences (i.e. a series of images).
- the "predicted" 3D lanes i.e. the 3D lane boundaries generated by the ANN
- the herein proposed method leverages this fact and constitutes a framework for training an AAN configured to generate 3D lane boundaries from image data to make consistent predictions on unlabelled video data.
- the first image and the second image are captured at different geographical positions. More specifically, the first image and the second image may be captured at different geographical positions along the vehicle's traveling direction such that the two images have overlapping content.
- lanes visible in one frame may not always be visible in the following/subsequent frame.
- this road type appears to be well suited for imposing consistency.
- the fact that the lane topologies are usually constant over far distances on highways makes it possible to impose consistency between frames/images even though the vehicle has significant movement between the frames/images.
- the first image and the second image are taken/captured while the vehicle is traveling on a highway or motorway.
- other scenarios are still viable and applicable with the concepts disclosed herein.
- the first and second images are taken/captured while the vehicle is traveling in an urban environment.
- different requirements e.g. reduce a distance threshold for the "common parts" (i.e. overlapping portions) of the two images for images taken in urban environments as compared to images taken while the vehicle is traveling on a highway.
- the artificial neural network is configured to generate a set of anchor lines where each anchor line is associated with a confidence value indicative of a presence of a lane boundary at the corresponding anchor line.
- the method S100 further comprises disregarding any anchor line associated with a confidence value below a confidence value threshold in the transformation step S103.
- the step of updating S105 the one or more model parameters of the artificial neural network is further based on the confidence values of any anchor lines associated with matching lane pairs and any anchor lines associated with non-matching lane pairs.
- the one or more parameters of the ANN are updated such that any anchor lines associated with matching lane pairs should predict a normalized confidence level of 1,0 and any anchor lines associated with non-matching lane pairs should predict a normalized confidence value of 0. This may for example be done by using a cross-entropy loss function.
- Fig. 3 illustrates a schematic flowchart indicative of how the spatio-temporal consistency loss between two sets of predicted/generated 3D lane boundaries is computed.
- a first image/frame t and a second image/frame t+1 are provided as input to the artificial neural network 303, which outputs a first set of 3D lane boundaries 304 and a second set of 3D lane boundaries 305.
- At least one of the generated lane boundaries 304, 305 are transformed S103 so to have a common coordinate system, such that they can be evaluated.
- the resulting overlap and offset d i , j between the two sets of 3D lane boundaries are schematically indicated in box 306 and with an enlarged view in box 307.
- the ANN 303 is configured to generate a set of anchor lines where each anchor line is associated with a confidence value (p) indicative of a presence of a lane boundary at the corresponding anchor line.
- the remaining predictions are then transformed into 3D space by using the fact that the prediction x j i z j i corresponds to the point in 3D space given by X A i + x j i , y j , z j i in the coordinate system C road .
- the predictions of frame t+1 302 are transformed S103 to C road of frame t 301 and the lanes are re-sampled at common longitudinal positions by piecewise linear interpolation (see e.g. ref. 306 in Fig. 3 ).
- the re-sampled lane that originated from predictions p i and ( x i , z i ) is now defined by a new set of points in 3D space given by x ext i y ext z ext i , where each of these vectors are of some length q and y ext is a vector of pre-determined sampling positions.
- y ext the re-sampled lane that originated from the predictions p ⁇ i and (x ⁇ i , ⁇ i ) of frame t+1 302 is denoted by x ⁇ ext i y ext z ⁇ ext i
- y ext the longitudinal positions given by y ext is the same for the predictions of both frames.
- y ext only corresponds to points that lie within the common region (indicated by box 306) of the lanes from both frames 301, 302. This may for example be in the region of 50 - 100 meters in front of the vehicle. This will however depend on the distance travelled by the vehicle between the two frames 301, 302.
- This definition of the cost of matching two lanes is used by the min-cost-flow solver and the global optimal matching is defined as the one that matches the set of lanes (or lane boundaries) with the lowest total cost.
- L match ⁇ m n ⁇ V L match m , n
- V is the set of all valid matches.
- U 1 and U 2 are the set of all anchor lines of frame t 301 and frame t+1 302 respectively that did not find a valid match.
- the total spatio-temporal consistency loss may accordingly be formulated as the sum of the penalty on the matched anchor lines and non-matched anchor lines.
- L c L match + L nomatch
- the anchor lines that correspond to valid lane matches result in a penalizing/rewarding of the ANN 303 so to predict 1,0 confidence and all other anchor lines result in a penalizing/rewarding so to predict a 0,0 confidence using the cross-entropy loss.
- the underlying assumption of penalizing/rewarding the confidence predictions in this manner is that lanes (or lane boundaries) that found a valid match over two frames are likely to be correct, while the lanes that do not are likely to be incorrect predictions.
- the geometric part of the spatio-temporal consistency loss penalizes the predictions of the network 303 implicitly by computing the loss based on, for example, x ext i and x ⁇ ext i rather than the predictions x i and x ⁇ i .
- the points of y ext that lie outside of the common region are removed, such that both predictions of frame t 301 and frame t+1 302 exist for all the remaining points in y ext .
- the described spatio-temporal consistency loss can be said to entail two hyper parameters, namely the distance threshold d th that defines which matches are valid and the probability threshold p th (may also be referred to as confidence value threshold) that defines which predictions are considered as "positive" predictions.
- the distance threshold may be set low enough such that the predictions of two adjacent lanes can never be considered valid, meaning that d th is set smaller than a (typical) half-lane width.
- d th should not be too close to zero either, because then essentially no lanes will constitute valid matches.
- the average distance threshold d th is set to a value above zero and below a lane width (of the road captured by the images 301, 302).
- the average distance threshold d th may be set separately for "close range” errors (e.g. within 0-40 meters from the vehicle) and "far range” errors (e.g. within 40 - 100 meters from the vehicle).
- the average distance threshold d th may comprise a plurality of average distance thresholds based on a distance from the vehicle.
- predictions from the consecutive frames/images could "supervise” each other differently in comparison to having equal importance when penalizing the network if lanes do not align well in 3D space. For example, since it can be assumed that the close range predictions are more accurate than the far range predictions one could argue that it would be better to assign greater importance to the close range predictions, and thus penalize the predictions of the first frame more than those of the second frame. Moreover, in some embodiments, only the lane geometry predictions of the first frame are penalized and the predictions of the second frame are not penalized. This is because it may be assumed that the far range predictions of the first frame cannot supervise the more accurate close range predictions of the second frame.
- the spatio-temporal consistency loss as described herein is based on an assumption that the predicted (and matched) lanes or lane boundaries describe the underlying ground truth lanes as well as possible, one can set the probability threshold p th to the value that resulted in the maximum F-score on a validation set.
- the F-score is a common metric used for measuring the performance of object detection models. This binary classification metric is based on the two measures that can be calculated from a confusion matrix, namely the precision and recall.
- the basic underlying principles and uses of F-score is considered to be readily understood by the skilled person in the art, and will for the sake of brevity and conciseness not be further elaborated upon herein.
- the teachings herein may be expanded to more than two frames/images, such as e.g. 5, 10, or 15 consecutive images in a video sequence.
- the teachings herein are analogously applicable for road edge detection.
- the ANN is configured to 3D road edge detection (may also be referred to as 3D road boundary detection) based on unlabelled image data from a vehicle-mounted camera.
- 3D road edge detection may also be referred to as 3D road boundary detection
- the methods are analogous in that it follows the same principles with the difference that the features related to "3D lane boundaries" are replaced with "3D road edges".
- the road edges and the lane boundaries may be treated simultaneously or as alternatives.
- the network Before the network is trained on unlabelled data, one may initialize the network and train the network until convergence on a labelled dataset before training the network further on the unlabelled dataset as described above. This may be referred to as a semi-supervised training scheme and is schematically depicted by the flowchart S500 in Fig. 5 .
- the method S400 includes using accumulated dense depth maps 413 (from aggregated LiDAR point clouds) together with 2D lane instance annotations 411, 412 and then applying some post processing S403, S404 to refine the extracted lanes.
- the depth maps 413 are accordingly used as ground truth depths of the pixels in the associated images 410 captured by the monocular camera.
- the 2D lane instances may be annotated as polygons 411 or polylines 412.
- the associated depth map of all pixels belonging to each lane instance may be considered to form 3D point clouds that represent the lanes (step S401).
- the method S400 may comprise combining the 2D polygon lane annotations 411 and projected depth maps to construct 3D lane point clouds for each lane instance using the camera intrinsic matrix.
- the point clouds are divided S401 into bins of a desired size (e.g. 20 cm) in the forward direction and a median (or mean) value of each populated bin is computed.
- a filtering process S404 for outliers for example, one can use the density based clustering algorithm DBSCAN. Additionally one can employ RANSAC to filter any outliers that DBSCAN missed.
- the method S400 may comprise combining S402 the 2D polyline lane annotations 412 and projected depth maps 413 to construct 3D lane point clouds for each lane instance using the camera intrinsic matrix. Once this is done, the 3D position of each lane instance may be determined for each lane.
- a filtering step S404 may be employed to remove outliers with DBSCAN and RANSAC.
- the filtering S404 since the depth maps from the accumulated LiDAR point clouds 413 are likely to include some noise there may be some need for filtering the 3D lane point clouds for outliers. Due to the regularity of the lanes' geometry one can assume that the lanes' lateral and vertical position change slowly over short distances in the forward direction. In order to leverage this assumption, the lateral and vertical axes may be scaled up such that DBSCAN is more sensitive to changes in these dimensions. Therefore, if some points of each lane have different vertical or lateral position from the most nearby points (in the forward direction) these are classified as outliers. DBSCAN is effective in filtering outliers when the density of the outliers is much smaller than the density of the inliers.
- DBSCAN may advantageously be applied in two different ranges separately. Once in the close range (e.g. any range or subrange between 0 and 40 meters from the vehicle) and once in the far range (e.g. any range or subrange between 40 and 100 meters from the vehicle), using different hyper parameters for the algorithm to match the general density of the lanes in both regions. Thereafter, RANSAC may be applied to filter any outliers that DBSCAN may have missed.
- RANSAC instead leverages the fact that lanes locally are well estimated by lines and removes any points that deviate too much form the best line to fit the data. Hence RANSAC can remove dense clusters of faulty points. Moreover, RANSAC may be applied separately at different distance intervals in the forward direction (e.g. 0-33 meters, 33-67 meters, and >67 meters). This may be advantageous to ensure that not many points are removed simply due to lane curvature or other violations of a straight-line assumption.
- Fig. 5 illustrates a schematic flowchart representation of a method S500 for semi-supervised training of an artificial neural network configured for 3D lane detection in accordance with some embodiments.
- the method comprises initializing the artificial neural network by setting some appropriate initialization values. Alternatively, one can use an already initialized ANN.
- the method S500 comprises training S502 the ANN on a labelled dataset (e.g. a dataset generated by the method depicted in Fig. 4 ) until convergence is reached.
- a labelled dataset e.g. a dataset generated by the method depicted in Fig. 4
- the method S500 further comprises making S503 predictions on unlabelled data in order to find matches between predicted/generated 3D lane boundaries in each image pair (as described in the foregoing in reference to Figs. 2 and 3 ). The found matches will then define how the ANN should be updated (penalized/rewarded) with the spatio-temporal consistency loss during training on this unlabelled data. Furthermore, the method S500 comprises selecting or adding S504 the unlabelled dataset for which the ANN found a sufficient number of matches (e.g. above a threshold) to the training set, and training S505 the ANN in a semi-supervised fashion based on the labelled and the unlabelled data (i.e. the formed training set). The training set accordingly comprises the labelled and the "cherry-picked" unlabelled data.
- the method S500 then comprises iterating steps S503, S504 and S505 until the unlabelled dataset is exhausted or until no further improvements to the ANN are made.
- Executable instructions for performing these functions are, optionally, included in a non-transitory computer-readable storage medium or other computer program product configured for execution by one or more processors.
- Fig. 6 is a schematic side-view of a vehicle 1 comprising an apparatus 10 for training an artificial neural network configured for 3D lane detection based on unlabelled image data from a vehicle-mounted camera 7.
- the apparatus 10 comprises control circuitry 11 configured to generate, by means of the artificial neural network, a first set of 3D lane boundaries in a first coordinate system based a first image captured by the vehicle-mounted camera.
- the control circuitry 11 is further configured to generate, by means of the artificial neural network, a second set of 3D lane boundaries in a second coordinate system based on a second image captured by the vehicle mounted camera.
- the second image is captured at a later moment in time compared to the first image and the first image and the second image contain at least partly overlapping road portions.
- control circuitry 11 is configured to transform at least one of the second set of 3D lane boundaries and the first set of 3D lane boundaries based on positional data associated with the first image and the second image, such that the first set of 3D lane boundaries and the second set of 3D lane boundaries have a common coordinate system.
- the positional data may example be obtained from a localization system 5 (e.g. a GPS unit) of the vehicle 1.
- the control circuitry 11 is further configured to evaluate the first set of 3D lane boundaries against the second set of 3D lane boundaries in the common coordinate system in order to find matching lane pairs of the first set of 3D lane boundaries and the second set of 3D lane boundaries.
- the control circuitry 11 is configured to update one or more model parameters of the artificial neural network based on a spatio-temporal consistency loss between the found matching lane pairs of the first set of 3D lane boundaries and the second set of 3D lane boundaries.
- the training of the ANN may accordingly be employed in a federated learning scheme where the local updates to the model parameters of the ANN may be transmitted to a central entity connected to a plurality of vehicles, where they may be consolidated in order to perform a "global" updated to the ANN based on the received local updates.
- the apparatus 10 need not necessarily be provided in a vehicle, but may be employed either in a remote server, or in plurality of remote servers, for instance in a plurality of servers in communication with each other, a so called cloud solution.
- the apparatus 10 may be in the form of (e.g., implemented as) a central service, a cloud service (provided by a cloud environment, e.g., a cloud computing system, comprising one or more remote servers, e.g., distributed cloud computing resources).
- a cloud service provided by a cloud environment, e.g., a cloud computing system, comprising one or more remote servers, e.g., distributed cloud computing resources.
- the necessary training set (labelled and/or unlabeled image data) may accordingly be provided in a memory device connected to the apparatus.
- the vehicle 1 further comprises a perception system 6 and a localization system 5.
- a perception system 6 is in the present context to be understood as a system responsible for acquiring raw sensor data from on sensors such as cameras 6, LIDARs and RADARs, ultrasonic sensors, and converting this raw data into scene understanding.
- the vehicle 1 has at least one vehicle-mounted camera 6 for capturing images of (at least a portion of) a surrounding environment of the vehicle.
- the localization system 5 is configured to monitor a geographical position and heading of the vehicle, and may in the form of a Global Navigation Satellite System (GNSS), such as a GPS. However, the localization system may alternatively be realized as a Real Time Kinematics (RTK) GPS in order to improve accuracy.
- GNSS Global Navigation Satellite System
- RTK Real Time Kinematics
- the vehicle 1 may have access to a digital map (e.g. a HD-map), either in the form of a locally stored digital map or via a remote data repository accessible via an external communication network 2 (e.g. as a data stream).
- a digital map e.g. a HD-map
- the access to the digital map may for example be provided by the localization system 5.
- the vehicle 1 may be connected to external network(s) 2 via for instance a wireless link (e.g. for transmitting and receiving one or more model parameters).
- a wireless link e.g. for transmitting and receiving one or more model parameters.
- the same or some other wireless link may be used to communicate with other vehicles in the vicinity of the vehicle or with local infrastructure elements.
- Cellular communication technologies may be used for long range communication such as to external networks and if the cellular communication technology used have low latency it may also be used for communication between vehicles, vehicle to vehicle (V2V), and/or vehicle to infrastructure, V2X.
- Examples of cellular radio technologies are GSM, GPRS, EDGE, LTE, 5G, 5G NR, and so on, also including future cellular solutions.
- LAN Wireless Local Area
- ETSI is working on cellular standards for vehicle communication and for instance 5G is considered as a suitable solution due to the low latency and efficient handling of high bandwidths and communication channels.
- a non-transitory computer-readable storage medium storing one or more programs configured to be executed by one or more processors of a vehicle control system, the one or more programs comprising instructions for performing the method according to any one of the above-discussed embodiments.
- a cloud computing system can be configured to perform any of the methods presented herein.
- the cloud computing system may comprise distributed cloud computing resources that jointly perform the methods presented herein under control of one or more computer program products.
- a computer-accessible medium may include any tangible or non-transitory storage media or memory media such as electronic, magnetic, or optical media-e.g., disk or CD/DVD-ROM coupled to computer system via bus.
- tangible and non-transitory are intended to describe a computer-readable storage medium (or “memory”) excluding propagating electromagnetic signals, but are not intended to otherwise limit the type of physical computer-readable storage device that is encompassed by the phrase computer-readable medium or memory.
- the terms “non-transitory computer-readable medium” or “tangible memory” are intended to encompass types of storage devices that do not necessarily store information permanently, including for example, random access memory (RAM).
- Program instructions and data stored on a tangible computer-accessible storage medium in non-transitory form may further be transmitted by transmission media or signals such as electrical, electromagnetic, or digital signals, which may be conveyed via a communication medium such as a network and/or a wireless link.
- transmission media or signals such as electrical, electromagnetic, or digital signals, which may be conveyed via a communication medium such as a network and/or a wireless link.
- the processor(s) 11 may be or include any number of hardware components for conducting data or signal processing or for executing computer code stored in memory 12.
- the device 10 has an associated memory 12, and the memory 12 may be one or more devices for storing data and/or computer code for completing or facilitating the various methods described in the present description.
- the memory may include volatile memory or non-volatile memory.
- the memory 12 may include database components, object code components, script components, or any other type of information structure for supporting the various activities of the present description. According to an exemplary embodiment, any distributed or local memory device may be utilized with the systems and methods of this description.
- the memory 12 is communicably connected to the processor 11 (e.g., via a circuit or any other wired, wireless, or network connection) and includes computer code for executing one or more processes described herein.
- the sensor interface 14 may also provide the possibility to acquire sensor data directly or via dedicated sensor control circuitry 6 in the vehicle.
- the communication/antenna interface 13 may further provide the possibility to send output to a remote location (e.g. remote operator or control centre) by means of the antenna 5.
- some sensors in the vehicle may communicate with the apparatus 10 using a local network setup, such as CAN bus, I2C, Ethernet, optical fibres, and so on.
- the communication interface 13 may be arranged to communicate with other control functions of the vehicle and may thus be seen as control interface also; however, a separate control interface (not shown) may be provided.
- Local communication within the vehicle may also be of a wireless type with protocols such as WiFi, LoRa, Zigbee, Bluetooth, or similar mid/short range technologies.
- parts of the described solution may be implemented either in the vehicle, in a system located external the vehicle, or in a combination of internal and external the vehicle; for instance in a server in communication with the vehicle, a so called cloud solution.
- image data may be sent to an external system and that system performs the methods disclosed in the foregoing.
- the different features and steps of the embodiments may be combined in other combinations than those described.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
- Traffic Control Systems (AREA)
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP21191605.1A EP4138048A1 (de) | 2021-08-17 | 2021-08-17 | Training von 3d-fahrspurerfassungsmodellen für automobilanwendungen |
US17/884,716 US20230074419A1 (en) | 2021-08-17 | 2022-08-10 | Training of 3d lane detection models for automotive applications |
CN202210988020.9A CN115705720A (zh) | 2021-08-17 | 2022-08-17 | 用于汽车应用的3d车道检测模型的训练 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP21191605.1A EP4138048A1 (de) | 2021-08-17 | 2021-08-17 | Training von 3d-fahrspurerfassungsmodellen für automobilanwendungen |
Publications (1)
Publication Number | Publication Date |
---|---|
EP4138048A1 true EP4138048A1 (de) | 2023-02-22 |
Family
ID=77367295
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP21191605.1A Pending EP4138048A1 (de) | 2021-08-17 | 2021-08-17 | Training von 3d-fahrspurerfassungsmodellen für automobilanwendungen |
Country Status (3)
Country | Link |
---|---|
US (1) | US20230074419A1 (de) |
EP (1) | EP4138048A1 (de) |
CN (1) | CN115705720A (de) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008091565A1 (en) * | 2007-01-23 | 2008-07-31 | Valeo Schalter & Sensoren Gmbh | Method and system for universal lane boundary detection |
US20190171223A1 (en) * | 2017-12-06 | 2019-06-06 | Petuum Inc. | Unsupervised Real-to-Virtual Domain Unification for End-to-End Highway Driving |
US20200249684A1 (en) * | 2019-02-05 | 2020-08-06 | Nvidia Corporation | Path perception diversity and redundancy in autonomous machine applications |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110914641B (zh) * | 2017-06-14 | 2024-01-30 | 御眼视觉技术有限公司 | 自主导航的导航信息的融合构架和分批对齐 |
CN110494863B (zh) * | 2018-03-15 | 2024-02-09 | 辉达公司 | 确定自主车辆的可驾驶自由空间 |
CN115393536A (zh) * | 2018-04-18 | 2022-11-25 | 移动眼视力科技有限公司 | 利用相机进行车辆环境建模 |
WO2019241022A1 (en) * | 2018-06-13 | 2019-12-19 | Nvidia Corporation | Path detection for autonomous machines using deep neural networks |
US11475678B2 (en) * | 2019-01-04 | 2022-10-18 | Qualcomm Incorporated | Lane marker detection and lane instance recognition |
US11181922B2 (en) * | 2019-03-29 | 2021-11-23 | Zoox, Inc. | Extension of autonomous driving functionality to new regions |
-
2021
- 2021-08-17 EP EP21191605.1A patent/EP4138048A1/de active Pending
-
2022
- 2022-08-10 US US17/884,716 patent/US20230074419A1/en active Pending
- 2022-08-17 CN CN202210988020.9A patent/CN115705720A/zh active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2008091565A1 (en) * | 2007-01-23 | 2008-07-31 | Valeo Schalter & Sensoren Gmbh | Method and system for universal lane boundary detection |
US20190171223A1 (en) * | 2017-12-06 | 2019-06-06 | Petuum Inc. | Unsupervised Real-to-Virtual Domain Unification for End-to-End Highway Driving |
US20200249684A1 (en) * | 2019-02-05 | 2020-08-06 | Nvidia Corporation | Path perception diversity and redundancy in autonomous machine applications |
Non-Patent Citations (1)
Title |
---|
MOHSEN GHAFOORIAN ET AL: "EL-GAN: Embedding Loss Driven Generative Adversarial Networks for Lane Detection", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 14 June 2018 (2018-06-14), XP081242213 * |
Also Published As
Publication number | Publication date |
---|---|
CN115705720A (zh) | 2023-02-17 |
US20230074419A1 (en) | 2023-03-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11676296B2 (en) | Augmenting reality using semantic segmentation | |
US11217012B2 (en) | System and method for identifying travel way features for autonomous vehicle motion control | |
US10937178B1 (en) | Image-based depth data and bounding boxes | |
US20220026232A1 (en) | System and method for precision localization and mapping | |
US11748909B2 (en) | Image-based depth data and localization | |
EP4152204A1 (de) | Fahrspurliniendetektionsverfahren und zugehörige vorrichtung | |
CN111771207B (zh) | 增强的车辆跟踪 | |
EP4191532A1 (de) | Bildannotation | |
US10984543B1 (en) | Image-based depth data and relative depth data | |
Nguyen et al. | Stereo-camera-based urban environment perception using occupancy grid and object tracking | |
CN111771229A (zh) | 用于自动驾驶车辆的点云重影效果检测系统 | |
CN112105890A (zh) | 用于自动驾驶车辆的基于rgb点云的地图生成系统 | |
CN112055805A (zh) | 用于自动驾驶车辆的点云登记系统 | |
CN111442776A (zh) | 顺序地面场景图像投影合成与复杂场景重构的方法及设备 | |
EP4020111B1 (de) | Lokalisierung eines fahrzeuges | |
Gálvez del Postigo Fernández | Grid-based multi-sensor fusion for on-road obstacle detection: Application to autonomous driving | |
Vaquero et al. | Improving map re-localization with deep ‘movable’objects segmentation on 3D LiDAR point clouds | |
US20240010227A1 (en) | Method and system for in-vehicle self-supervised training of perception functions for an automated driving system | |
US20230202497A1 (en) | Hypothesis inference for vehicles | |
EP4138048A1 (de) | Training von 3d-fahrspurerfassungsmodellen für automobilanwendungen | |
Chipka et al. | Estimation and navigation methods with limited information for autonomous urban driving | |
US11983918B2 (en) | Platform for perception system development for automated driving system | |
Lee et al. | Ego‐lane index‐aware vehicular localisation using the DeepRoad Network for urban environments | |
WO2022258203A1 (en) | Platform for perception function development for automated driving system | |
EP4266261A1 (de) | 3d-strassenoberflächenschätzung für automatisierte antriebssysteme |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION HAS BEEN PUBLISHED |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20230816 |
|
RBV | Designated contracting states (corrected) |
Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |