WO2020156923A2

WO2020156923A2 - Map and method for creating a map

Info

Publication number: WO2020156923A2
Application number: PCT/EP2020/051634
Authority: WO
Inventors: Vassilios PANAGIOTOU; Roland Preiss; Pavol MICHALIK; Johannes W.H. RABE
Original assignee: Harman Becker Automotive Systems Gmbh
Priority date: 2019-01-30
Filing date: 2020-01-23
Publication date: 2020-08-06
Also published as: WO2020156923A3; DE112020000590T5

Abstract

A method comprises capturing at least two images with an image unit, wherein each of the at least two images is captured from a different position of the image unit, wherein a first section of the surroundings of the image unit is captured by each of the at least two images, and wherein a section of an elongated landmark is captured in the first section, and generating at least two observation planes, wherein each observation plane is essentially orthogonal to the elongated landmark and crosses the elongated landmark at a certain point

Description

MAP AND METHOD FOR CREATING A MAP

BACKGROUND

1. Technical Field

[0001] The disclosure relates to a map and a method for creating such map, in particular a map for advanced driver-assistance systems (ADAS) and highly automated driving (HAD).

2. Related Art

[0002] The position of a point landmark in a 3D space can generally be triangulated when the respective point has been observed in camera images from multiple viewpoints. For this purpose, it is generally necessary to determine which observed points correspond to each other (“data association”). Different approaches to determine the position and orientation of short line segments are known and are generally implemented in similar ways. Such approaches generally work well for line segments with visible ends and/or when the camera is not moving along these segments. However, if the camera (which e.g. is build into a moving vehicle) is moving along a landmark of large extension in the moving direction, this becomes difficult. Landmarks of this type could be polylines or curves, such as lane-lines (e.g. the road markings separating two drivable road lanes), or the road boundary, or a surface with limited lateral extension, such as the road surface or a lane-line whose width is relevant. Therefore, there is a need for a reliable and robust map and a method for creating such a map which allows the sampling of a 3D space at desired positions and which allows to determine the intersection of any landmarks having a large extension in at least one dimension, even when moving along the landmark.

SUMMARY

[0003] A method includes capturing at least two images with an image unit, wherein each of the at least two images is captured from a different position of the image unit, wherein a first section of the surroundings of the image unit is captured by each of the at least two images, and wherein a section of an elongated landmark is captured in the first section, and generating at least two observation planes, wherein each observation plane is essentially orthogonal to the elongated landmark and crosses the elongated landmark at a certain point.

[0004] A method for generating a map includes generating a local map in response to a triggering event, uploading the local map to a central server, ingesting the local map into the central server and linking the local map with at least one additional information, and incorporating the local map into a global map stored on the central server, thereby creating an updated global map.

[0005] Other systems, methods, features and advantages will be, or will become, apparent to one with skill in the art upon examination of the following detailed description and appended figures. It is intended that all such additional systems, methods, features and advantages be included within this description, be within the scope of the invention, and be protected by the following claims.

BRIEF DESCRIPTION OF THE DRAWINGS

[0006] The system may be better understood with reference to the following drawings and description. In the figures, like referenced numerals designate corresponding parts throughout the different views.

[0007] Figure 1 is a local map fragment comprising landmarks, sequence poses and relationships among them.

[0008] Figure 2 is a schematic representation of a parametric MAP Estimator system. [0009] Figure 3 is a schematic overview of a mapping pipeline.

[0010] Figure 4 is an example of a lane detection input.

[0011] Figure 5 exemplarily illustrates a dense map representation.

[0012] Figure 6 exemplarily illustrates a sparse map representation. [0013] Figure 7 exemplarily illustrates another sparse map representation.

[0014] Figure 8 exemplarily illustrates an observation of a straight road from two different camera positions.

[0015] Figure 9 exemplarily illustrates a lane-line as observed from a first camera position, a corresponding image in an image plane and a corresponding lane-line wedge.

[0016] Figure 10 exemplarily illustrates a lane-line as observed from a first and a second camera position and corresponding observation planes.

[0017] Figure 11 schematically illustrates different multi lane scenarios.

[0018] Figure 12 schematically illustrates estimated lane-lines in parts of different KITTI raw sequences.

[0019] Figure 13 schematically illustrates reprojected and intersected observations of line segments in an observation plane.

DETAILED DESCRIPTION

[0020] Advanced driver-assistance systems (ADAS) are systems that help the driver of a vehicle in the driving process. Some ADAS comprise a safe human-machine interface, which is intended to increase car safety and, more generally, road safety. Many road accidents occur due to human error. Advanced driver-assistance systems are systems developed to automate, adapt and enhance vehicle systems for safety and better driving. Advanced driver assistance systems generally reduce road fatalities, by minimizing the human error. Many safety features are designed to avoid collisions and accidents by offering technologies that alert the driver to potential problems, or to avoid collisions by implementing safeguards and taking over control of the vehicle. Adaptive features may automate lighting, provide adaptive cruise control and collision avoidance, pedestrian crash avoidance mitigation (PCAM), incorporate satnav/traffic warnings, connect to smartphones, alert a driver to other vehicles or dangers, provide lane departure warning, or automatic lane centering, or indicating when objects such as other vehicles are located in the blind spots. Highly automated driving (HAD) forms a link between assisted and autonomous driving. The driver is not only supported but also individual driving tasks are adapted to meet the needs of specific situations. Differences between ADAS and HAD are gradual. Advanced driver assistance systems, highly automated driving as well as automated driving applications generally require up-to-date and accurate maps.

[0021] A map may generally be defined as a digital representation of the real environment such that the contained information can be used for various purposes, e.g., visualization, or localization. Different kinds of maps are generally known which differ from each other by the type of the content and the creation and deployment models. So- called feature maps describe the environment in terms of landmarks. Landmarks are either an abstract representation of real world objects (so-called semantic landmarks) or they represent salient properties of real world objects as observed through (vision) sensors, e.g., various types of salient image features (e.g., ORB, SURF, SIFT, etc.). Landmarks generally have a geometric representation and are placed in a common coordinate system. Further, landmarks satisfy certain types of mutual geometric constraints such as, e.g., incidence, coincidence, co-planarity, parallelism.

[0022] A map can generally be created ahead-of-time by a specialized map provider or online map service (e.g., google maps®, TomTom®, ViaMichelin®, etc.). Usually, the provider owns the map data and also the pipeline to deploy the maps. Alternatively, maps can be created on the fly by a Simultaneous Localization and Mapping (SLAM) system integrated in the vehicle or other mobile agent (e.g., other vehicles). In this latter case, the mapping pipeline may consist of the following steps: (1) collect the map data from a fleet of agents, (2) create a global map model from a fusion of compatible local maps (3) distribute the global map to the fleet and (4) let the agent use the global map as additional sensor, i.e. source of information for SLAM. Such maps may also be referred to as crowd- sourced reference maps.

[0023] Such crowd-sourced maps generally offer a lot of commercial opportunities and plenty of challenges in technology and science. In the following, some of the challenges will be described in further detail. Specifically, the terminology used herein will be finalized and the architecture of the mapping pipeline will be described in further detail. Further, a method is described for tracking and mapping lane-line markings. The results achieved with this method will be described and it will be outlined how this method may be generalized for other mutually constrained landmarks such as road surface, road markings or traffic signs.

[0024] Maps may be distinguished in local maps and global maps. Local maps are built by a fleet of mobile agents from monocular images and position information. That is, e.g., of a plurality of cameras that are each mounted to one of a plurality of vehicles, each camera captures one or more monocular images and provides this information to a central server which creates local maps from the plurality of captured images. The resulting local maps are sparse, though intermediate dense or hybrid representations may exist internally. A global map may result from a fusion process of local map fragments uploaded by the plurality of agents. The global map is sparse. It consists of geometric representations of landmarks which represent real world objects.

[0025] Every mobile agent may have a physical copy of the global map (reference map) on board (e.g., stored on a storage device of a vehicle). The reference map serves as another source of observations for the mobile agent who“mapmatches” the observations from other sensors with this map. The agents run an on-board Visual Semantic SLAM service which simultaneously estimates a map and a sequence of 6-DOF (six degrees of freedom) poses over a period of time. In the present example, it is assumed that the plurality of agents are each equipped with low cost sensors. For example, in a minimal configuration each agent (vehicle) may comprise a reference map, a monocular camera, and a global positioning system (GPS). The local map estimation generally is a continuous process. The uploads of local maps are discrete events initiated by a predefined set of rules or by an explicit command of a fleet operator. The format of the local map corresponds roughly to the graph illustrated in Figure 1. These map fragments represent the relevant parts of the local maps and are used as observations in the global map conflation process. The stored relations between the landmarks L and sequence of poses P (so-called observation relations) are helpful for resolving data association ambiguities during the global map fusion process. Typically, local map fragments have a size of up to a few mega bytes. Each fragment is associated with metadata which identifies the region and the time interval of the estimation. This will be described in further detail below. [0026] It will be assumed that the model for global map estimation is very similar to the estimation model for local maps. For example, the Parametric maximum a posteriori (MAP) Estimation framework may be used, to estimate both categories of maps, as will be described in further detail below.

[0027] In the following, a sensor reading in an estimation system represents an observation. The relationship between estimated variables (here: the map and the poses) and an observation is referred to as observation model. The set of sensors of a mobile agent is not fixed. For each new sensor type a new observation model may be defined and added to the system. The software implementation of such a system is referred to as a Parametric MAP Estimator, as schematically illustrated in Figure 2 Figure 2 illustrates a schematic representation of the parametric MAP Estimator system: The raw observations Z are augmented by semantic feature extractors yielding semantic observations Z0. In a parameterizable tracking and mapping process the model variables X are estimated from observations Z and Z0 via the observation model h which defines the relationship between model variables and observations.

[0028] The attribute parametric refers to the ability of describing the estimator by a set of parameters which define the sensors and the observation models, the precision and accuracy data, used solver engine, the execution environment, threading model etc. It is assumed that the distinct observation models hi compose into a globally valid relationship Z = h(X) such that the MAP probability P(Xj|Z) can be calculated. Such probability calculations are generally known and will not be described in further detail herein. The applied approach to factor graph based formulation of the MAP problem will be described in further detail below. The map conflation cloud-service which calculates the global map XG as a MAP estimate given the local maps (= observations) ZLi is an instance of the parametric MAP estimator. Thus, this piece of software enables deployment of a function in a cloud service as well as in an embedded system. In the following, the architecture and the data-flow model of the mapping pipeline will be briefly described.

[0029] The map ingestion, conflation and distribution pipeline may be conceived as a distributed cloud application. It may consist of a set of mutually independent services which expose APIs (application programming interfaces) for data ingress/egress and for consumption and raising of domain events. The invocation of a service may be controlled either by a domain event (e.g.“new agent was added”) or by an explicit query (e.g. “retrieve all fragments for a specified region”). The services can be deployed on any public cloud or run on-premise assuming that proper outscaling technologies are available. The workflow is schematically illustrated in Figure 3. [0030] Local maps upload: The agents upload local maps according to the configuration. Typical events which cause data upload are, e.g.,“end of trip reached”, “amount of data in local map exceeds limit”,“operator override for earliest possible upload”, etc..

[0031] Local maps ingestion: The data ingress into a distributed eventually consistent storage system. The payload (aka local map) is stored along with obligatory attributes (“region”,“time range”) and a set of optional attributes. The data is indexed by region and time-range in a volume grid-like schema, the so-called spatiotemporal columns.

[0032] Local and global map data fusion: The map fusion service either queries the spatiotemporal database for new map fragments or reacts to domain event“new fragment uploaded for region X/time range Y” and updates all compatible regions of the global map. The global map is indexed following the same spatiotemporal column format.

[0033] Global map distribution: The distribution of the updated global map follows either the“push” model in which the data consumers (= agents) register for notification about available updates, or the agents query the map distribution service for existing updates.

[0034] Global map update: Finally, the agents replace the reference map used for localization by an updated version of a global map.

[0035] The functionality of the whole pipeline relies on robust tracking and mapping of semantic landmarks from geo-referenced images. For autonomous ground vehicles which travel on public roads, the most significant semantic features are the road surface, road boundaries, lane-lines or road markings. These landmarks are omnipresent in urban and highway scenarios, do not undergo very frequent changes and are relatively easy to model. Therefore, in the following, a method will be described for observing and estimating lane-lines. It will further be described, how this method generalizes for other types of“continuous” features.

[0036] As compared to widely used point landmark maps based on salient features, different environment and observation models may be used together with a different method for data association for landmarks that have a physical extent in at least one dimension, like lane-lines or road surface. Several approaches are known, wherein edges are described by edgelet landmarks. These are parts of an edge described by a 3D point and a 3D direction. While they seem well-suited for edges with clear boundaries such as occur in indoor scenarios and building surfaces, the association problem is less clear for curves which are mostly traveled along rather than across. Other approaches focus on indoor cases, or consider street scenarios as well, but maps mostly line segments on building fronts. Other methods propose mapping the road boundary using a particle filter, but limited to the two-dimensional case. Other methods are known which use a GraphSLAM approach to estimate lanes based on camera images and radar objects but also do not consider road height either. Other approaches propose mapping of road- markings such as arrows, diamonds, and dashes or the complete road surface. Localization on a map containing lines on the road surface such as lane-markings, curbs, and stop lines is generally known. Some maps model dashed lines with the beginning and end of each segment. The map generation, however, may be based on a setup containing multiple cameras and a laser scanner. One method for 6 DOF pose and 3D curve estimation is known which, however, relies on an IMU and a stereo camera.

[0037] The method described herein for the mapping of lane-line structures is built on the so-called ORB-SLAM system that estimates the camera poses xLi and the position of point landmarks lLj based on ORB features using a graph optimization approach, both in local world coordinates (denoted by L). In the following, adaptations that enable the system to consider position information and lane-lines in the SLAM process are discussed.

[0038] Formulating SLAM as a MAP problem and solving it using graph optimization methods is widely accepted. All parameters to be estimated, e.g. camera poses and landmark positions, are interpreted as the nodes of a graph. Measurements from all available sensors are added as edges between the corresponding nodes that represent constraints; for example, the observation of a landmark j in a certain camera frame associated with camera pose i connects the nodes i and j and contains the information where in the image the landmark has been observed. Any edge is associated with an error function ey. Then, the error over all edges is minimized: (i)

rmation matrix, i.e the inverse of the covariance matrix of the measurement connecting the nodes i and j.

[0039] This optimization problem is in general nonlinear and multiple libraries exist that allow for efficient solving of problems of this type using the Gauss-Newton or Levenberg- Marquardt algorithm while making use of the typical sparsity of the resulting graph.

[0040] Position information is available with 6 degrees of freedom in geodetic coordinates. The first position measurement is used to define the transformation TLG from geodetic to local Euclidean coordinates. The translational part t_{p0S i} of the positional measurement Z_{p0S i} associated with the camera pose estimate

can be used to formulate a constraint in the graph based on the error function

where is the translational part of the camera pose x .

[0041] A deep learning-based computer vision approach is used to detect lane-lines in the video image. The artificial neural network (ANN) outputs pixel masks which are then abstracted to line segments in the image space. The ANN component is located in the“Extractors” layer of the Parametric MAP Estimator, see Figure 2. An Example output for one image frame from KITTI raw sequence 201 l_09_30_drive_0016 is depicted in Figure 4. Multiple lane-lines are detected but e.g. the right boundary is only partially detected, the border of the keepout area is not an independent region but distorts the near left boundary. On the far left, an arrow on the road surface is detected as a lane-line. [0042] Maps may generally also be distinguished in sparse maps and dense maps, for example. “Sparse” in this context refers to the amount of landmarks found per volumetric unit of the map coordinate system. In general, in a sparse map the number of semantic landmarks per unit may be orders of magnitude lower than the number of landmarks of salient image features, which results in the sparse vs. dense distinction. Sparse maps are generally understandable for humans and may include road-related features such as traffic signs, lane-lines, or road markings, for example. Sparse maps generally require advanced matching methods. Sparse maps may be reduced in residential areas, and may be very complex in downtown areas. Sparse maps are often useful in highway and rural situations. An example of a sparse map is schematically illustrated in Figure 6. Figure 6 illustrates a top view of a single street. Two vehicles are traveling on the street in different directions. The street has a plurality of different lanes that are separated by lane-lines 204. The lane-lines 204, which represent the landmarks of interest in the example of Figure 6, are marked in solid lines. Other landmarks of interest may be road signs, as is exemplarily illustrated in Figure 7. In particular, different points of the road sign which define the form and type of the road sign may represent the landmarks 206.

[0043] Dense maps, on the other hand may include salient features such as corners, or edges, for example. This is exemplarily illustrated in Figure 5. Different landmarks 202 (salient features) are marked in the representation illustrated in Figure 5 such as corners or edges of buildings, windows, vehicles, sidewalks, etc., which match with descriptors. Dense maps are frequently used for urban situations. However, dense maps are rarely found in highway and rural situations, as they are comparably difficult to triangulate.

[0044] Both sparse maps and dense maps have certain drawbacks and weaknesses. Dense maps, for example, may require large storage space and bandwidth. The content of dense maps is not directly related to driving situations, as dense maps describe corners and edges of buildings, sidewalks, etc., instead of lane boundaries, speed limits, etc. Dense maps often contain landmarks in the environment that are influenced by several factors such as time and season, for example. Dense maps often are not intuitive and are not easily understandable for humans, and are therefore much harder to maintain as compared to sparse maps. Sparse maps, on the other hand, often are less robust and reliable than dense maps, especially in areas with only few semantic landmarks.

[0045] As will be described in the following, the weaknesses of dense maps and sparse maps may be eliminated or at least reduced by combining sparse and dense maps and providing a hybrid map which combines the strengths of standalone dense or sparse maps, eliminates their weaknesses, and therefore provides an improved, superior and complete substrate for many of the autonomous driving functions. The weaknesses of dense maps may be eliminated by combining it with a sparse map. This may result in a reduction of the storage and bandwidth requirements, as sparse maps generally provide more information for the required functions such as highways or rural roads, for example. The robustness of the dense maps improves when taking into account the semantic map landmarks of the sparse map. The hybrid map in general is much more intuitive and understandable for humans as dense maps. By combining sparse maps with dense maps, on the other hand, the map becomes more reliable and robust concerning the mapping and the localization, even in areas with only few semantic landmarks.

[0046] The hybrid map that will be described in the following combines sparse maps including point landmarks with dense maps including semantic landmarks. Optionally, the hybrid map may further comprise extensions. The two mutually compatible, but orthogonal, map types dense maps, which consist of 3D point clouds reconstructed from salient visual features, and sparse maps, which consist of geometric representations (e.g., 3D or 2D geometric representations) of semantic features representing real world objects such as traffic signs, road surface markings (e.g. lane lines) or the road surface, are considered as the substantial ingredients of the hybrid map. For example, the positions of such semantic features may be represented 3 -dimensionally, while other characteristics of the semantic features may be represented 2-dimensionally (e.g., the front side of a traffic sign may be represented as a flat object without the pole it is mounted on). Hybrid maps serve as an essential part of the corresponding systematic approach to modeling, tracking and (re)mapping of features, which enables the integration of corresponding observation models into various SLAM processes. In addition, the extension of quality indicators are considered, which indicate the content distribution and quality of the hybrid map, according to the mapped region and application needs. Various autonomous driving applications such as, e.g., localization, pose estimation, may utilize the hybrid maps for their needs.

[0047] The position of a point landmark in 3D space can be triangulated when the respective point has been observed in camera images from multiple viewpoints. For this purpose, it is generally necessary to find which observed points correspond to each other (“data association”). Several different approaches for determining the position and orientation of short line segments are known which each function in a similar way. Such approaches generally are well suited for line segments with visible ends and/or when the camera is not moving along these segments. If the camera (which, e.g., is build into a moving vehicle) is moving along a landmark of large extension (elongated landmark) in the moving direction, however, this may become difficult. Landmarks of this type could be polylines or curves, such as lane-lines (e.g. the road-markings separating two drivable road lanes), or a road boundary, or a surface with limited lateral extent, such as the road surface or a lane-line whose width is relevant, for example. With such landmarks, the association of the observed part of the landmark is difficult, as is schematically illustrated in Figure 8.

[0048] Figure 8 schematically illustrates a camera 300. The camera 300 may be mounted on a vehicle (not illustrated) that is moving in a driving direction. The camera 300 may capture a first image at a first point in time tl, and a second image at a second point in time t2. Between the first point in time tl and the second point in time t2, the camera position moves in the driving direction, depending on the speed of the vehicle. The images captured at different points in time, therefore, depict different sections of the surroundings. For example, the first picture may capture the environment within a distance dl of the first position of the camera 300, and the second picture may capture the environment within a distance d2 of the second position of the camera 300.

[0049] If the camera 300 moves constantly along a straight road segment, the images captured by the camera 300 at different points in time may look identical for multiple camera positions. However, the observed part of the lane-lines is different. Often, it is not possible to determine which part of the complete lane-line is observed if the camera pose is not perfectly known in advance. [0050] This problem is generally not limited to perfectly straight roads but also arises for all road segments with a constant curvature if the camera 300 is moving along the segment. If the curvature changes, such an association is possible but usually not very stable. If the camera 300 moves with a certain (non-zero) angle with respect to such lane lines, the different positions can be calculated by means of difficult processes that do not necessarily allow for proper handling of observation uncertainties. The same happens for surfaces such as the road surface.

[0051] The main difference between mapping point landmarks and polyline segments is the data association. Through feature descriptors, it is possible to associate which key point from one frame could correspond to a key point in another frame. When driving on a straight road and detecting lane boundaries in the image frame, the detected line segments might look exactly equal from multiple positions - even though from every position, a slightly different segment of the real-world lane boundary is detected (see Figure 8).

[0052] A key point in an image is the projection of a point in the 3D world that lies anywhere on a ray starting at the optical axis of the camera and passing through its 2D point in the image plane. Observing the same point from multiple camera positions allows for intersecting the corresponding rays to estimate its 3D position (triangulation). Similarly, a straight line segment in an image is the projection of a line segment in the 3D world that lies somewhere within a wedge, i.e. the part of a plane between the two rays corresponding to its endpoints (see, e.g., Figure 9). However, its orientation within the wedge is unclear. Furthermore, intersecting multiple wedges is not useful as it is not clear whether multiple observations correspond to the same part of the line in the 3D world - as depicted in Figure 8. To make this problem tractable, the world is sampled using observation planes. Observation planes are virtual planes, or parts of planes with limited extension, that are standing close to vertically in the vehicle’s path and are passed by the vehicle while driving. Observation planes allow for sampling the 3D space at certain positions and find the intersection of any landmark with large extension in one dimension with the observation plane. The orientation of the observation planes may be close to orthogonal or orthogonal to the landmarks to be observed, e.g., vertical and across the road for lane-lines or the road surface. The general idea of observation planes is depicted in Figure 10. Assuming the camera poses are well known and further assuming ideal lane line observations, from camera pose A, the white part of the complete lane-line can be observed - due to field of view, maximum detection distance, etc. not the complete lane line is detectable. From this, it is known that the real lane-line lies somewhere within the wedge starting from camera pose A. Analogously, a different segment of the lane-line is observed from camera pose B.

[0053] When considering one of the observed lane-lines 302 as shown in Figure 8, it can be seen that any observed line segment in 2D image space is the projection of a line segment in the 3D world that lies somewhere within a part of a plane that is bounded by the rays on which the 3D correspondences of its endpoints lie. This is visualized in Figure 10 and denoted as“wedge” in the following. If one perfectly knew the ground surface and camera poses, one could reconstruct the observed line segment (white part observed from camera pose A) in 3D already. However, neither the ground surface nor the camera position are known.

[0054] When observing a certain segment multiple times, using observation planes, and having an initial estimate for the pose, the intersections of the reprojections of multiple observations with the observation planes (yellow solid and red dashed parts of the observation planes in 10) can be found. It is known that the actual position of the intersection of the real lane-line with the observation plane (if they intersect) lies within both of these intersections of reprojections, it is further known that the actual position of the intersection must lie in the intersecting region of these intersections of reprojections. By having multiple observations from different distances to the observation planes, the resulting intersecting region reduces and the estimate for the position becomes more accurate.

[0055] For this motivation, only the side-view may be considered, reducing the observations and observation planes to ID-line segments. Actually, the observations are line segments in 2D and the observation planes 2D surfaces. If the camera is moving perfectly parallel to perfectly straight structures, the same happens and many observations need to be correctly associated to reduce the region within which the actual lane-line point lies. For any other cases, the reprojected lane-line observations intersect in a single point - if the observations were perfect and the camera poses were perfectly known, compare Figure 12. As neither is the case, they tend to mutually intersect in points within a certain region.

[0056] When estimating the actual camera poses and lane-line points in an optimization problem, one can formulate a constraint that all reprojected lane-line observations are supposed to intersect in a single point - i.e. that each of the estimated lane-line points is the point that has a minimal distance to the reprojected lane-line observations. Solving this optimization problem will estimate the camera poses and lane line points that agree best with the observations. The actual lane-line can then be reconstructed using the points sampled in the observation planes. [0057] In the observation problem, the uncertainties of the observations should ideally be considered. These uncertainties need to follow the transformation into the observation planes as well, i.e. observations from larger distance are considered with higher uncertainty.

[0058] The described method can further be used for: other polyline or curve landmarks with large extent in one direction observations of such landmarks that are more generic than line segments (e.g. polylines, curves, splines, pixel masks) surface landmarks with large extent in one direction, e.g. road surfaces. If a model is available for the shape of the actual landmark, e.g. that it is smooth, this information can be introduced as additional constraints in the optimization problem. [0059] For association of multiple observations of such landmarks, i.e. for finding which observations correspond to the same lane-line, the reprojected observations may be clustered in the observation planes. As for the uncertainties in the optimization problem, the uncertainties of the observations - but also the uncertainties of the initial pose estimates - may be considered in the association step. Observation planes can be placed in front of the camera in expectation that the camera will move through it and/or observe landmarks that intersect with the observation plane.

[0060] Alternatively, observation planes may be placed at the current camera pose estimate. This may be beneficial if a model is available for the pose of the camera above ground, e.g. if the camera is mounted on a moving vehicle and the camera height and orientation in standstill are known. Then this information can be used in the association method and as initial estimates of the landmark positions.

[0061] To stabilize the estimation of multiple coplanar landmarks with large extent in one direction, one can consider using additional information such as road-markings such as direction arrows, HOV (High-occupancy vehicle lane) diamonds etc. which are known to coincide with the ground surface, salient features detected within the observed road surface, or the height of other detected objects e.g. from other traffic participants.

[0062] Observations in a 2D camera image and their uncertainties can be compared to observations in polar coordinates. By reprojecting and intersecting with a flat observation plane, the shape of a previously symmetric uncertainty (a probability density function) changes and becomes asymmetric. This effect can be reduced by replacing observation planes with general observation surfaces such as spheres or cylinders.

[0063] A virtual plane, e.g., observation plane 1, is arranged in front of both camera poses A and B (its second dimension is along the vehicle y-axis). Observation plane 1 is defined by the transformation T_PiL E M^{4 4} that transforms a 3D point in homogeneous representation from local world coordinates x^L to observation plane coordinates x^pl and the plane z^pl = 0 in these coordinates. Intersecting both wedges with observation plane 1 leads to the red dashed and orange solid parts of the observation plane illustrated in Figure 10, respectively (i.e. the images of the lane-line observations in the observation plane). It can be seen that the actual lane-line passes through the observation plane in the area where both line segments overlap. The same holds for observation plane 2. With more observations from more different poses, the overlapping section can be reduced and the lane-point, i.e. the actual point of the lane-line intersecting the observation plane can be reconstructed. In the 2D observation plane, these line segments are typically oriented somewhat diagonally and, when a vehicle is not driving perfectly parallel to a perfectly straight lane-line, intersect in a single point. In the degenerate case - parallel to a straight line -, the images of the lane-line observations are segments of the same line and do in general not restrict the lane-point to a single point.

[0064] The challenge in data association is generally twofold: Observations may have to be sorted out where the detected lane-line does not actually intersect the observation plane, e.g. when the ground at the observation plane is outside the field-of- view or occluded. Additionally, it may be necessary to associate line segments with line segments, where it is not immediately apparent whether they correspond to the same part of a longer line or curve. Due to these considerations and the possible degenerate case, the camera extrinsic parameters may be used for data association: As height and roll in standstill are generally known and vary only slightly during driving, the position of the ground plane at the observation plane can be estimated if the observation plane coincides with the camera pose. As lane-lines coincide with the ground, only line segments in the observation plane that intersect with the estimated ground plane are considered for data association, which can then be performed in ID along the ground plane estimate. A simple hierarchical clustering algorithm may be employed based on the Mahalanobis distance.

[0065] All images of line segments in the observation plane that correspond to the same part of the line segment in the world intersect in a single point if the line detections and pose estimates are perfect. This intersection point can be interpreted as the 2D point with the smallest distance to all line segments. In the case of real line detections and pose estimates, the images of the line segments intersect in multiple distinct points. Therefore, it may be desirable to find the point that best fits to the observations. Hence, the constraint imposed by a line segment observation z_{( k} ^l onto the camera pose x_t and the lane-point position ΐz uses the distance from the lane-point to the image of the line segment in the observation plane as its error function

where di (p, 1) is the distance vector from point p onto the line segment 1. The lane-point position

is projected into camera coordinates by

where K_t E M^{3 4} is the projection matrix for the camera at pose i, and z_c the normalization factor for the z-component of the homogeneous coordinates. The error is weighted using an information matrix based on the measurement covariance of the observed line segment and the recovered covariance of the pose estimate.

[0066] To prove the suitability of the algorithm in practice, experiments were performed on sequences from the KITTI raw data set. From this data, the left greyscale camera and the almost ideal OxTS (Oxford Technical Solutions) position were used. These experiments were performed within an ORB-SLAM environment. ORB-SLAM is a versatile and accurate SLAM solution for Monocular, Stereo and RGB-D cameras. It is able to compute in real-time the camera trajectory and a sparse 3D reconstruction of the scene in a wide variety of environments, ranging from small hand-held sequences of a desk to a car driven around several city blocks. It is able to close large loops and perform global relocalisation in real-time and from wide baselines. It includes an automatic and robust initialization from planar and non-planar scenes. Therefore, the pose graph contains binary constraints between pose and ORB landmarks, pose and lane-points, and unary position constraints for the poses. To properly model the information matrix for the lane-point constraints, the uncertainty of the pose estimates is required. While g²o (g²o is an open-source C++ framework for optimizing graph-based nonlinear error functions) provides means to recover covariances from the graph, practice has shown that this procedure is not fast enough to determine updated covariances in every iteration and lane line data association step when using the available solvers. Hence, a fixed information matrix was used for the lane-point constraints for these experiments. In the multi-lane scenario 2011_09_30_drive_0016, it is possible to detect and map up to 7 lane-lines, including the markings on the side of a bicycle crossing, see also Figure 11a) with a single drive. Except for the far left (top) lane-line, the estimated points are reasonably close to the markings in the aerial image. Parts of the boundary of the keepout area can also be modeled; the ends are missing as they are not distinguished from the straight lane (beginning) or only rarely detected (end), respectively.

[0067] Figure 11 schematically illustrates estimated lane-lines in parts of different KITTI raw sequences. Cyan: Lane-lines reconstructed from lane-points (yellow), magenta: vehicle trajectory.

[0068] Due to the vehicle moving very straight along a straight road, it comes close to the degenerate case in this scenario: The images of the line segments in the observation plane are collinear rather than intersecting in a single point. Hence, the system can only trust its initial height estimate based on the ground assumption - despite this not being explicitly modeled. This is the reason for the far left lane-line being shifted by around half a lane: The banking on that part of the road is different than at the ego-lane, and hence the lane-points are initialized too high, leading to the lateral offset. This effect occurs similarly, but less significantly, also on the second and third lane from the left. As highways are an interesting scenario to have mapped lane-lines, the algorithm has also been run on scenario 201 l_10_03_drive_0042 of Figure 11 b). This is the KITTI raw equivalence of KITTI odometry sequence 01 - for which difficulties are known to arise in running it reliably and in real-time. With an increased number of features, reduced playback speed, and support from position and lane-line information, reliable results may be achieved at least for the beginning of the sequence. However, it may be noted that the pose tracking and motion estimation part of ORB-SLAM were not changed and hence also fail in case of too few useful ORB features.

[0069] In this scenario, the estimated lane-lines initially agree quite well with the satellite image and model the curve well, see Figure 11 b), where the vehicle starts on the top right. After the curve, the association is partially off and leads to some outliers in the lane-line estimate. Large parts of the estimated lane-lines resemble the shape of the lane lines in the aerial image but are shifted to the left (bottom). Another multi-lane scenario is sequence 2011_09_26_drive_0051 (see Figure 12). Due to some redesign of this intersection, it was compared with older, less clear satellite images, see Figure 12. Most available lane-lines are mapped quite accurately if they can be detected by the vision frontend. Again, the far left (top) lane exhibits an offset, which might also be due to an error in the ground plane estimate.

[0070] Figure 12 illustrates estimated lane-lines in a part of a KITTI raw sequence. Blue: Lane-lines reconstructed from lane-points, vehicle trajectory from left to right.

[0071] A multi-agent semantic mapping pipeline has been presented herein that allows for flexible sensor setups in the agents using a parametric MAP estimator. The exemplary generation of local lane-line maps requires as only input position and monocular camera. The ability of the algorithm may be proven on KITTI data by extending the pose graph ORB-SLAM system. The method can map up to seven lane- lines in a single drive. Significant improvements in accuracy may be expected by not only relying on a ground plane assumption but instead incorporating other road surface elements such as arrows, asphalt imperfections, ends of dashes etc. into the estimation. Using the proposed observation planes, this is possible in a straightforward manner: After defining the model of any surface-related landmarks, the observation models for the semantic observations Z’ and the constraints defined by co-incidence, co-planarity etc. can be defined. The additional observations lead to a robust estimation of the 3D road surface and stabilize any landmarks that are related to it. Regarding the lane-line SLAM, the highly accurate OXTS data from KITTI was used raw as position input for the experiments on real data. The algorithm, however, may also be used based on low-cost GPS input. A high accuracy reference map may be generated, to evaluate the online lane line estimation as well as the overall pipeline in a quantitative manner.

[0072] A cloud-based pipeline for creation of a globally consistent 3D map from crowd-sourced local maps has been described above. The local maps are created from a stream of monocular images and position data. Two mutually compatible but orthogonal map types have been considered: dense maps which consist of 3D point clouds reconstructed from salient visual features, and sparse maps which consist of 3D geometric representation of semantic features representing real world objects such as traffic signs, road surface markings (e.g. lane lines) or the road surface. A systematic approach has been presented to modeling, tracking and mapping of semantic features, which enables integration of the corresponding observation models into a GraphSLAM process. Furthermore, a pipeline has been described for ingestion, conflation and distribution of maps suitable to serve vehicle fleets on any scale.

[0073] Autonomous Driving Hybrid Maps are the combination of dense (point landmarks) and sparse maps (semantic landmarks) with optional extensions. These two mutually compatible but orthogonal map types are considered, namely dense maps which consist of 3D point clouds reconstructed from salient visual features, and sparse maps which consist of 3D geometric representations of semantic features representing real world objects such as traffic signs, road surface markings (e.g. lane-lines) or the road surface, as the ingredients of the hybrid map. Hybrid maps may serve as an essential part of the corresponding systematic approach to modeling, tracking and (re)mapping of features, which enables the integration of corresponding observation models into various SLAM processes. In addition the extension of quality indicators is considered, which indicate the content distribution and quality of the hybrid map, according to the mapped region and application needs. Various autonomous driving applications such as e.g. localization, pose estimation, may utilize the hybrid maps for their needs.

[0074] The method described above may be encoded in a computer-readable medium such as a CD ROM, disk, flash memory, RAM or ROM, an electromagnetic signal, or other machine-readable medium as instructions for execution by a processor. Alternatively or additionally, any type of logic may be utilized and may be implemented as analog or digital logic using hardware, such as one or more integrated circuits (including amplifiers, adders, delays, and filters); or one or more processors executing amplification, adding, delaying, and filtering instructions; or in software in an application programming interface (API) or in a Dynamic Link Library (DLL), functions available in a shared memory or defined as local or remote procedure calls; or as a combination of hardware and software.

[0075] The method may be implemented by software and/or firmware stored on or in a computer-readable medium, machine-readable medium, propagated- signal medium, and/or signal-bearing medium. The media may comprise any device that contains, stores, communicates, propagates, or transports executable instructions for use by or in connection with an instruction executable system, apparatus, or device. The machine- readable medium may selectively be, but is not limited to, an electronic, magnetic, optical, electromagnetic, or infrared signal or a semiconductor system, apparatus, device, or propagation medium. A non-exhaustive list of examples of a machine-readable medium includes: a magnetic or optical disk, a volatile memory such as a Random Access Memory “RAM,” a Read-Only Memory“ROM,” an Erasable Programmable Read-Only Memory (i.e., EPROM) or Flash memory, or an optical fiber. A machine-readable medium may also include a tangible medium upon which executable instructions are printed, as the logic may be electronically stored as an image or in another format (e.g., through an optical scan), then compiled, and/or interpreted or otherwise processed. The processed medium may then be stored in a computer and/or machine memory. [0076] The systems may include additional or different logic and may be implemented in many different ways. A controller may be implemented as a microprocessor, microcontroller, application specific integrated circuit (ASIC), discrete logic, or a combination of other types of circuits or logic. Similarly, memories may be DRAM, SRAM, Flash, or other types of memory. Parameters (e.g., conditions and thresholds) and other data structures may be separately stored and managed, may be incorporated into a single memory or database, or may be logically and physically organized in many different ways. Programs and instruction sets may be parts of a single program, separate programs, or distributed across several memories and processors. The systems may be included in a wide variety of electronic devices, including a cellular phone, a headset, a hands-free set, a speakerphone, communication interface, or an infotainment system.

[0077] The description of embodiments has been presented for purposes of illustration and description. Suitable modifications and variations to the embodiments may be performed in light of the above description or may be acquired from practicing the methods. For example, unless otherwise noted, one or more of the described methods may be performed by a suitable device and/or combination of devices. The described methods and associated actions may also be performed in various orders in addition to the order described in this application, in parallel, and/or simultaneously. The described systems are exemplary in nature, and may include additional elements and/or omit elements.

[0078] As used in this application, an element or step recited in the singular and proceeded with the word“a” or“an” should be understood as not excluding plural of said elements or steps, unless such exclusion is stated. Furthermore, references to“one embodiment” or“one example” of the present disclosure are not intended to be interpreted as excluding the existence of additional embodiments that also incorporate the recited features. The terms“first,”“second,” and“third,” etc. are used merely as labels, and are not intended to impose numerical requirements or a particular positional order on their objects.

[0079] While various embodiments of the invention have been described, it will be apparent to those of ordinary skilled in the art that many more embodiments and implementations are possible within the scope of the invention. In particular, the skilled person will recognize the interchangeability of various features from different embodiments. Although these techniques and systems have been disclosed in the context of certain embodiments and examples, it will be understood that these techniques and systems may be extended beyond the specifically disclosed embodiments to other embodiments and/or uses and obvious modifications thereof.

Claims

CLAIMS:

1. A method comprising:

capturing at least two images with an image unit, wherein each of the at least two images is captured from a different position of the image unit, wherein a first section of the surroundings of the image unit is captured by each of the at least two images, and wherein a section of an elongated landmark is captured in the first section;

generating at least two observation planes, wherein each observation plane is essentially orthogonal to the elongated landmark and crosses the elongated landmark at a certain point.

2. The method of claim 1, wherein the image unit is a monocular camera.

3. The method of claim 1 or 2, wherein the image unit is mounted on a vehicle.

4. The method of claim 3, wherein the vehicle passes each of the at least two observation planes while moving along.

5. The method of claim 3 or 4, wherein the vehicle moves along a road, and wherein each of the at least two observation planes is arranged vertical to and across the road.

6. The method of any of claims 1 to 5, wherein a segment of the elongated landmark is the projection of a line segment in the 3D world that lies within a wedge.

7. The method of claim 6, wherein the wedge is formed by a plane between two rays extending between the image unit and the edges of the corresponding image.

8. The method of any of claims 6 and 7, wherein the orientation of the segment of the elongated landmark is determined by the generation of the at least two observation planes.

9. The method of any of the preceding claims, wherein each of the at least two observation planes crosses the elongated landmark at a certain point which is visible in each of the at least two images.

10. A method for generating a map, the method comprising:

generating a local map in response to a triggering event;

uploading the local map to a central server;

ingesting the local map into the central server and linking the local map with at least one additional information; and

incorporating the local map into a global map stored on the central server, thereby creating an updated global map.

11. The method of claim 10, wherein the at least one additional information includes at least one of a region where the local map was generated, and a time when the local map was generated.

12. The method of claim 10 or 11, wherein the local map is generated by a vehicle system.

13. The method of any of claim 12, further comprising

providing the updated global map to the vehicle system.

14. The method of claim 13, further comprising replacing a global map stored within the vehicle system with the updated global map.

15. The method of any of claims 10 to 14, wherein the local map comprises information obtained from a monocular image.

16. The method of any of claims 10 to 15, wherein the global map is created by fusing a plurality of uploaded local map fragments.

17. The method of any of claims 10 to 16, wherein the triggering event comprises at least one of reaching the end of a trip, the amount of data in a local map exceeding a specified limit, and operator override for earliest possible upload.

18. The method of any of claims 10 to 17, wherein the local map is generated as a hybrid map including point landmarks and semantic landmarks.

19. The method of claim 18, wherein the hybrid map is generated combining characteristics of a dense map and characteristics of a sparse map.

20. The method of claim 19, wherein the characteristics of a dense map comprise 3D point clouds reconstructed from salient visual features, and wherein the characteristics of a sparse map comprise geometric representations of semantic features representing real world objects.