WO2022141240A1

WO2022141240A1 - Determining vehicle positions for autonomous driving based on monocular vision and semantic map

Info

Publication number: WO2022141240A1
Application number: PCT/CN2020/141587
Authority: WO
Inventors: Cansen JIANG; Liang HENG
Original assignee: SZ DJI Technology Co., Ltd.
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2022-07-07

Abstract

Method, apparatus (200), and non-transitory computer-readable medium for determining position of an autonomous vehicle. The method includes identifying one or more objects in an image captured by a camera onboard a vehicle during movement of the vehicle, wherein the image includes at least a portion of an environment surrounding the vehicle during the movement (910). The method also includes retrieving position data associated with one or more predetermined objects from a map of the environment, wherein the one or more predetermined objects correspond to the one or more objects identified in the captured image (920). The method further includes determining one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map (930).

Description

DETERMINING VEHICLE POSITIONS FOR AUTONOMOUS DRIVING BASED ON MONOCULAR VISION AND SEMANTIC MAP

COPYRIGHT NOTICE

A portion of the disclosure of this patent document contains material which is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

TECHNICAL FIELD

The present disclosure relates generally to self-driving technology and self-driving vehicles and, more particularly, to systems, apparatus, and methods for determining vehicle positions for autonomous driving based on camera data and semantic map data in real time.

BACKGROUND

Self-driving technology is capable of sensing the surrounding environment and generating real-time instructions to safely drive a movable object, such as a self-driving vehicle, with little or no human interaction. The self-driving vehicle can be equipped with one or more sensors to gather information from the environment, such as radar, LiDAR, sonar, camera (s) , global positioning system (GPS) , inertial measurement units (IMU) , and/or odometry, etc. Based on various sensory data obtained from the one or more sensors, the self-driving vehicle needs to determine real-time position and generate instructions for navigation.

However, there exists a need of a system, an apparatus, and a method for positioning and navigating self-driving vehicles in real time with reduced cost, improved accuracy, and enhanced safety.

SUMMARY

Consistent with embodiments of the present disclosure, a method is provided for determining position information of an autonomous vehicle. The method includes identifying one or more objects in an image captured by a camera onboard a vehicle during movement of the vehicle. The image includes at least a portion of an environment surrounding the vehicle during the movement. The method also includes retrieving position data associated with one or more predetermined objects from a map of the environment. The one or more predetermined objects correspond to the one or more objects identified in the captured image. The method further includes determining one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map.

There is also provided an apparatus configured to determine position information of an autonomous vehicle. The apparatus includes one or more processors, and memory coupled to the one or more processors and storing instructions that, when executed by the one or more processors, cause the apparatus to perform operations including identifying one or more objects in an image captured by a camera onboard a vehicle during movement of the vehicle. The image includes at least a portion of an environment surrounding the vehicle during the movement. The operations also includes retrieving position data associated with one or more predetermined objects from a map of the environment. The one or more predetermined objects correspond to the one or more objects identified in the captured image. The operations further includes determining one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map.

There is further provided a non-transitory computer-readable medium with instructions stored thereon, that when executed by a processor, cause the processor to perform operations including identifying one or more objects in an image captured by a camera onboard a vehicle during movement of the vehicle. The image includes at least a portion of an environment surrounding the vehicle during the movement. The operations also includes retrieving position data associated with one or more predetermined objects from a map of the environment. The one or more predetermined objects correspond to the one or more objects identified in the captured image. The operations further includes determining one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed. Other features and advantages of the present invention will become apparent by a review of the specification, claims, and appended figures.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an exemplary environment for applying self-driving technology, in accordance with embodiments of the present disclosure.

FIG. 2 shows a block diagram of an exemplary apparatus configured in accordance with embodiments of the present disclosure.

FIG. 3 shows a block diagram of an exemplary process of determining a position of an autonomous vehicle, in accordance with embodiments of the present disclosure.

FIG. 4A shows an exemplary diagrammatic representation of an image captured by a camera during movement of a vehicle, in accordance with embodiments of the present disclosure.

FIG. 4B shows an exemplary diagrammatic representation of an image processed based on the captured image of FIG. 4A, in accordance with embodiments of the present disclosure.

FIG. 5A shows an exemplary diagrammatic representation of an image captured by a camera during movement of a vehicle, in accordance with embodiments of the present disclosure.

FIG. 5B shows exemplary pixels associated with lane lines in a bird's eye view transformed from pixels corresponding to lane lines, in accordance with some embodiments of the present disclosure.

FIG. 5C shows exemplary lines extracted and parameterized from pixels corresponding to lane lines in a bird's eye view to represent lane lines in a captured image, in accordance with some embodiments.

FIG. 5D shows an exemplary diagrammatic representation of an image in a perspective view including lines transformed from extracted lines in the bird's eye view corresponding to lane lines and curb lines, in accordance with embodiments of the present disclosure.

FIG. 6 is an exemplary diagrammatic representation of an image for matching lane lines extracted from a captured image and lane lines obtained from HD map data in a camera coordinate system, in accordance with some embodiments of the present disclosure.

FIG. 7 is an exemplary diagrammatic representation of an image for matching light poles extracted from a captured image and light poles obtained from HD map data and projected into a camera view image, in accordance with some embodiments of the present disclosure.

FIG. 8 is an exemplary map generated and updated based on real-time position of an autonomous vehicle determined during movement, in accordance with some embodiments of the present disclosure.

FIG. 9 shows a flow diagram of an exemplary process of determining a position of a camera onboard an autonomous vehicle, in accordance with some embodiments of the present disclosure.

DETAILED DESCRIPTION

The following detailed description refers to the accompanying drawings. Wherever possible, the same reference numbers refer to the same or similar parts. While several illustrative embodiments are described herein, modifications, adaptations, and other implementations are possible. For example, substitutions, additions, or modifications may be made to the components illustrated in the drawings. Accordingly, the following detailed description is not limited to the disclosed embodiments and examples. Instead, the proper scope is defined by the appended claims.

Self-driving technology requires determining positions of self-driving vehicles in real time. The Global Positioning System (GPS) can be used for providing location information in navigation. However, the accuracy (e.g., 2-3 meters or more than 10 meters) of GPS is too low to satisfy the requirement of self-driving technology. Some self-driving technology uses a LiDAR system installed on the self-driving vehicles to provide location information with much higher accuracy (e.g., in a range of several centimeters) . However, the cost for a LiDAR system is high. Self-driving technology may also have high requirements for data storage and maintenance to handle large amounts of data in high precision maps, such as a three-dimensional (3D) point cloud map. For example, maps used in LiDAR technology may include about 200 MB of data per kilometer on the map. In addition, large amounts of data require systems with high computing power and/or special computing capability, such as graphics processing units (GPUs) for parallel computing, which can greatly increase the cost of the system. Further, some self-driving technology can only determine the vehicle positions with three degrees of freedom (3 DOF) including x, y, and yaw, but cannot estimate other pose information, such as z, roll, and pitch.

Consistent with embodiments of the present disclosure, there are provided methods, apparatus, and systems for determining real-time positions (e.g., pose information) of self-driving vehicles based on image (s) captured by a camera, such as a monocular camera, onboard the self-driving vehicle, and positions of objects in semantic maps, such as lane lines, road signs, or light poles, that only take up a small amount of data of the semantic maps. In some embodiments, the self-driving technology described herein includes comparing real-time object (s) information obtained in the images captured by the camera with off-line information associated with the corresponding object (s) in the semantic maps for determining pose information of the self-driving vehicle. For example, methods disclosed herein include object metadata indexing, image transformation (e.g., between a perspective view and a bird's -eye view) , and vanishing points aligning (e.g., lane lines matching) . As such, the methods do not need to consider complex road conditions and driving conditions, and the objects on the road do not need to be parameterized (e.g., represented by parametric equations) , thus improving the robustness and accuracy of the methods or algorithms implementing the methods disclosed herein.

Further, the apparatus and systems described herein do not require expensive hardware. For example, a monocular camera is less expensive than a LiDAR system. Meanwhile, the methods provide a higher accuracy of about 10 cm than a GPS system (e.g., several to ten meters) . The methods described herein determine real-time position using objects in the semantic map along with topology information, such as lane lines, road sign outlines, and street light poles, which usually involve a small amount of data, such as 10 KB per kilometer on the map. As such, it is more convenient for data storage, real-time map loading and updating, and data transmission such as via various wireless communication network (s) . Additionally, the methods, apparatus, and systems described herein can estimate six degrees of freedom (6 DOF) pose information of the self-driving vehicle by a decoupling method. For example, one or more of the 6 DOF pose information can be determined separately and independently from other pose information. This decoupling method can provide lower algorithm complexity and lower power consumption, with sufficient accuracy to provide accurate global pose information for navigation and route planning of the self-driving vehicle.

FIG. 1 shows an exemplary environment 100 for applying self-driving technology (also known as autonomous driving or driverless technology) , in accordance with embodiments of the present disclosure. In some embodiments, environment 100 includes a movable object, such as a self-driving vehicle or a vehicle 102 (also known as an autonomous vehicle or driverless vehicle) , that is capable of communicatively connecting to one or more electronic devices including a mobile device 140 (e.g., a mobile phone) , a server 110 (e.g., cloud-based server) , and one or more map provider servers 130-1, ... 130-k via a network 120 in order to exchange information with one another and/or other additional devices and systems. In some embodiments, environment 100 includes a road 101 on which vehicle 102 autonomously moves, and one or more stationary objects on or along road 101 that can be used for determining positions, such as pose information, of vehicle 102. For example, the one or more objects can be used as landmarks that are not moving or changing relative to road 101. The one or more stationary objects may be included in a map (e.g., a commercial or publicly accessible map of an area) including road 101. Some examples of stationary objects include a road sign 103, a lane line 104, light poles 106, buildings, trees, etc., as shown in FIG. 1. In some embodiments, a driving scene on road 101 in FIG. 1 may be created based on an image captured by a camera onboard vehicle 102 and one or more items of positional information of vehicle 102 relative to the captured image. In some embodiments, the driving scene may be generated based on a map of an area including road 101.

In some embodiments, one or more map provider servers 130-1, ... 130-k may be associated with one or more service providers that can provide map data, such as high definition (HD) maps, used for navigating vehicle 102. In some embodiments, an HD map may include multiple layers of content including a geometric layer, such as a 3D point cloud map representing geometric information of a surrounding environment, and a sematic map layer including various types of traffic-related objects used for navigating vehicle 102, such as lane lines 104, road sign 103, light poles 106, intersections, traffic lights, etc. In some embodiments, in the sematic map layer, the objects contain metadata associated with other parameters associated with respect to objects, such as speed limits, or other restrictions. In some embodiments, the HD map may further include a real-time traffic layer including traffic information such as traffic conditions, speeds, highway checkpoints, etc. The multiple map layers may be aligned in the 3D space to provide detailed navigation information.

In some embodiments, network 120 may be any combination of wired and wireless local area network (LAN) and/or wide area network (WAN) , such as an intranet, an extranet, and the internet. Any suitable communication techniques can be implemented by network 120, such as local area network (LAN) , wide area network (WAN) (e.g., the Internet) , cloud environment, telecommunications network (e.g., 3G, 4G, 5G) , WiFi, Bluetooth, radiofrequency (RF) , infrared (IR) , or any other communications techniques. In some embodiments, network 120 is capable of providing communications between one or more electronic devices, as discussed in the present disclosure.

In some embodiments, vehicle 102 is capable of transmitting data (e.g., image data, positional data, and/or motion data) detected by one or more sensors onboard vehicle 102, such as a camera 107 (e.g., a monocular camera) , an odometer, and/or inertial measurement unit (IMU) sensors, in real time during movement of vehicle 102, via network 120, to mobile device 140 and/or server 110 that are configured to process the data. For example, camera 107 onboard vehicle 102 may capture images while vehicle 102 moves on road 101, as shown in FIG. 1. In some embodiments, vehicle 102 may retrieve sematic maps from one or more map provider servers 130-1, ... 130-k via network 120, and process the captured images, positional data, and/or motion data to determine the real-time pose information of vehicle 102 during its movement. In some embodiments, vehicle 102 may, while moving, transmit the captured images, positional data, and/or motion data in real-time to mobile device 140 and/or server 110 via network 120 for processing. Mobile device 140 and/or server 110 may obtain semantic map data from one or more map provider servers 130-1, ... 130-k via network 120, and further determine the pose information of vehicle 102. The determined pose information of vehicle 102 can be used to generate instructions for autonomous driving. In some embodiments, the determined pose information of vehicle 102 and the autonomous driving instructions can be communicated in real-time among vehicle 102, mobile device 140, and/or cloud-based server 110 via network 120. For example, the autonomous driving instructions can be transmitted in real time from mobile device 140 and/or cloud-based server 110 to vehicle 102.

In some embodiments, vehicle 102 includes a sensing system which may include one or more onboard sensors (not shown) . For instance, the sensing system may include sensors for determining positional information, velocity information, and acceleration information relating to vehicle 102 and/or target locations or objects (e.g., obstacles) . Components of the sensing system may be configured to generate data and information for use (e.g., processed by an onboard controller or another device in communication with vehicle 102) in determining additional information about vehicle 102, its components, and/or its targets. For example, the sensing system may include sensory devices such as a positioning sensor for a positioning system (e.g., GPS, GLONASS, Galileo, Beidou, GAGAN, RTK, etc. ) , motion sensors, inertial sensors (e.g., IMU sensors, MIMU sensors, etc. ) , proximity sensors, odometer, camera 107, etc. In some embodiments, the sensing system may also include sensors configured to provide data or information relating to the surrounding environment, such as weather information (e.g., temperature, pressure, humidity, etc. ) , lighting conditions (e.g., light-source frequencies) , air constituents, or nearby obstacles (e.g., objects, buildings, trees, people, other vehicles, etc. ) .

In some embodiments, camera 107 is configured to gather data that may be used to generate images or videos of the surrounding environment. As disclosed herein, image data obtained from camera 107 may be processed and compared with object information extracted from a sematic map to determine pose information of vehicle 102. In some embodiments, camera 107 includes a photographic camera, a video camera, an infrared imaging device, an ultraviolet imaging device, an x-ray device, an ultrasonic imaging device, or a radar device. Camera 107 may be a monocular camera. Camera 107 may include a wide-angle lens. In some embodiments, vehicle 102 includes a plurality of cameras that are placed on multiple sides, such as front, rear, left, and right sides, of vehicle 102. The images captured by the cameras facing different sides of vehicle 102 may be stitched together to form a wide-angle view (e.g., a panoramic view or a 360° view) of the surrounding environment.

In some embodiments, camera 107 may be directly mounted to vehicle 102, such as fixedly connected, fastened, attached, rigidly connected, or placed in another way to be firmly connected and not readily movable relative to vehicle 102. Camera 107 may be aimed in a direction that can capture views of objects on the road, such as lane lines and/or light poles, that can be used for determining pose information of vehicle 102. In some embodiments, camera 107 may be connected or attached to vehicle 102 via a carrier (not shown) , which may allow for one or more degrees of relative movement between camera 107 and vehicle 102. For example, the carrier may be adjustable or movable in accordance with movement of vehicle 102 so as to capture a view including one or more objects used for determining the pose information of vehicle 102 in real time during the movement of vehicle 102. When camera 107 is attached to vehicle 102 via a carrier, a relative position between camera 107 and vehicle 102 can be determined, so that pose information of one of camera 107 and vehicle 102 can be determined based on pose information of the other. In the present disclosure, the position of camera 107 is determined using image (s) captured by camera 107 to determine the pose information of vehicle 102. The position of camera 107 can be used to represent the position of vehicle 102 in computing and generating instructions used for autonomous driving. As such, the position of camera 107 and the position of vehicle 102 may be used interchangeably.

In some embodiments, vehicle 102 includes a communication system 150 that may be configured to enable communication of data, information, autonomous driving instructions, and/or other types of signals between an onboard controller of vehicle 102 and one or more off-board devices, such as mobile device 140, server 110, map provider server (s) 130, or another suitable entity. Communication system 150 may include one or more onboard components configured to send and/or receive signals, such as receivers, transmitter, or transceivers, that are configured for one-way or two-way communication. The onboard components of communication system 150 may be configured to communicate with off-board devices via one or more communication networks, such as radio, cellular, Bluetooth, Wi-Fi, RFID, and/or other types of communication networks usable to transmit signals indicative of data, information, commands, and/or other signals. For example, communication system 150 may be configured to enable communication with off-board devices, such as server 110 and/or mobile device 140, for providing autonomous driving instructions or other commands (e.g., to override the autonomous driving instructions during an emergency situation) for controlling vehicle 102.

In some embodiments, vehicle 102 includes an onboard controller (not shown) that is configured to communicate with various devices onboard vehicle 102, such as communication system 150, camera 107, and other sensors. The onboard controller may also communicate with a positioning system (e.g., a global navigation satellite system (GNSS) , GPS, or odometer, etc. ) to receive data indicating the location of vehicle 102. The onboard controller may communicate with various other types of devices, including a barometer, an inertial measurement unit (IMU) , a transponder, or the like, to obtain positioning information and velocity information of vehicle 102. The onboard controller may also provide control signals for controlling the movement of vehicle 102. In some embodiments, the onboard controller may include circuits and modules configured to process image data captured by camera 107 and/or perform other functions discussed herein.

It is appreciated that while the movable object is illustrated in the present disclose using vehicle 102 as an example, the movable object could instead be provided as any other suitable object, device, mechanism, system, or machine configured to travel on or within a suitable medium (e.g., surface, air, water, rails, space, underground, etc. ) . The movable object may also be another type of movable object (e.g., wheeled objects, nautical objects, locomotive objects, other aerial objects, etc. ) . For illustrative purpose, in the present disclosure, vehicle 102 refers to a self-driving vehicle configured to be operated and/or controlled autonomously based on data collected by one or more sensors (e.g., camera 107, IMU, and/or an odometer, etc. ) onboard vehicle 102 and semantic map data (e.g., obtained from map provider server (s) 130) . In some embodiments, although vehicle 102 is operated autonomously, vehicle 102 may be configured to receive manual instructions under certain circumstances (e.g., a dangerous road condition or an emergency situation, etc. ) by an onboard or off-board operator.

In some embodiments, one or more off-board devices, such as server 110 and/or mobile device 140, may be configured to receive and process image (s) captured by camera 107, and other data such as positional data, velocity data, acceleration data, sensory data, and information relating to vehicle 102, its components, and/or its surrounding environment. The off-board device (s) can generate and communicate signals associated with autonomous driving to the onboard controller of vehicle 102. Although not shown, the off-board devices can include a cellular phone, a smartphone, a tablet, a personal digital assistant, a game console, a mobile device, a wearable device, a virtual reality (VR) /augmented reality (AR) headset, a laptop computer, a cloud computing server, or any other suitable computing device. In some embodiments, the off-board device (s) may be configured to perform one or more functionalities or sub-functionalities associated with autonomous driving in addition to or in combination with vehicle 102. For example, server 110 may participate in image processing and algorithm computing to facilitate the process for determining vehicle position. The off-board device (s) may include one or more communication devices, such as antennas or other devices, configured to send and/or receive signals via network 120 and can sufficiently support real-time communication with vehicle 102 with minimum latency.

In some embodiments, an off-board device (s) , such as mobile device 140, may include a display device (e.g., a display screen 144 that may be a touch screen 144) for displaying information, such as image (s) captured by camera 107, a map received from map provider server 130, and/or signals indicative of information or data relating to movement status of vehicle 102. In some embodiments, the display device may be a multifunctional display device configured to display information as well as receive user input, such as an interactive graphical interface (GUI) for receiving one or more user inputs. In some embodiments, the off-board device (s) , e.g., mobile device 140, may be configured to work in conjunction with a computer application (e.g., an “app” ) to provide an interactive interface for displaying information received from vehicle 102, such as captured (s) images or position data of vehicle 102, in conjunction with a map received from map provider server 130. In some embodiments, server 110 or vehicle 102 may also include a display device configured to display position data and/or navigation path of vehicle 102 in conjunction with a map received from map provider server 130 to show real-time location and movement of vehicle 102 on the map. In some embodiments, the display device may be an integral component, e.g., attached or fixed, to the corresponding device. In other embodiments, display device may be electronically connectable to (and disconnectable from) the corresponding device (e.g., via a connection port or a wireless communication link) and/or otherwise connectable to the corresponding device via a mounting device, such as by a clamping, clipping, clasping, hooking, adhering, or other type of mounting device.

In some embodiments, the off-board device (s) may also include one or more input devices configured to receive input (e.g., audio data containing speech commands, user input on a keyboard or a touch screen, body gestures, eye gaze controls, etc. ) from a user, and generate instructions communicable to the onboard controller of vehicle 102. For example, the off-board device (s) may be used to receive user inputs of other information, such as manual control settings, automated control settings, control assistance settings, and/or photography settings. In some embodiments, the off-board devices can generate instructions based on the user input and transmit the instructions to the onboard controller to manually control vehicle 102 (e.g., to override autonomous driving instructions in emergency) . It is understood that different combinations or layouts of input devices for an off-board device are possible and within the scope of this disclosure.

FIG. 2 shows an exemplary block diagram of an apparatus 200 configured in accordance with embodiments of the present disclosure. In some embodiments, apparatus 200 can be included in one of the devices discussed with reference to FIG. 1, such as vehicle 102, mobile device 140, or server 110. Apparatus 200 includes one or more processors 202 for executing modules, programs, and/or instructions stored in a memory 212 and thereby performing predefined operations, one or more network or other communications interfaces 208, and one or more communication buses 210 for interconnecting these components. Apparatus 200 may also include a user interface 203 comprising one or more input devices 204 (e.g., a keyboard, mouse, touchscreen, microphone, physical sticks, levers, switches, wearable apparatus, touchable display, and/or buttons) and one or more output devices 206 (e.g., a display or speaker) . In some embodiments, when apparatus 200 is included in vehicle 102, apparatus 200 also includes a sensor system 207 onboard vehicle 102, including camera 107, an odometer, a GPS, and/or inertial measurement unit (IMU) sensors, etc. as described herein.

Processors 202 may be any suitable hardware processor, such as an image processor, an image processing engine, an image-processing chip, a graphics-processor (GPU) , a microprocessor, a micro-controller, a central processing unit (CPU) , a network processor (NP) , a digital signal processor (DSP) , an application specific integrated circuit (ASIC) , a field-programmable gate array (FPGA) , or another programmable logic device, discrete gate or transistor logic device, discrete hardware component.

Memory 212 may include high-speed random access memory, such as DRAM, SRAM, or other random access solid state memory devices. In some embodiments, memory 212 includes non-volatile memory, such as one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, or other non-volatile solid state storage devices. In some embodiments, memory 212 includes one or more storage devices remotely located from processor (s) 202. Memory 212, or alternatively one or more storage devices (e.g., one or more nonvolatile storage devices) within memory 212, includes a non-transitory computer readable storage medium. In some implementations, memory 212 or the computer readable storage medium of memory 212 stores one or more computer program instructions (e.g., modules 220) , and a database 240, or a subset thereof that are configured to cause processor (s) 202 to perform one or more steps of processes, as described below with reference to FIGs. 3-9. Memory 212 may also store map data obtained from map provider server (s) 130, and image (s) captured by camera 107.

In some embodiments, memory 212 of apparatus 200 may include an operating system 214 that includes procedures for handling various basic system services and for performing hardware dependent tasks. Apparatus 200 may further include a network communication module 216 that is used for connecting apparatus 200 to other electronic devices via communication interface 208 and one or more communication networks 120 (wired or wireless) , such as the Internet, other wide area networks, local area networks, metropolitan area networks, etc. as discussed with reference to FIG. 1.

In some embodiments, the modules included in modules 220 each comprises program instructions for execution by processor (s) 202 to perform a variety of functions. More particularly, modules 220 include an image obtaining and processing module 222 configured to receive and process image (s) captured by camera 107 onboard vehicle 102. For example, image obtaining and processing module 222 may be configured to parse the captured image (s) to identify one or more object (s) based on metadata associated with pixels of the object (s) . Image obtaining and processing module 222 may be configured to extract and parameterize visual representations in the captured image (s) , such as lines representing lane lines on the road. Image obtaining and processing module 222 may also be configured to transform a perspective view of a capture image to a bird's eye view, or transform the bird's eye view back to the perspective view. In some embodiments, modules 220 include a map obtaining and processing module 224 configured to receive HD map (s) for autonomous navigation from a map provider server 130, and identify objects from a semantic layer of the HD map corresponding to those found in the captured image (s) . In some embodiments, modules 220 include a position determination module 230, including a plurality of sub-modules for determining multiple of groups of pose information separately and independently, including but not limited to a height (z) determination module 231, a pitch and yaw determination module 232, a horizontal pose (x) and roll determination module 233, and a vertical pose (y) determination module 234. In some embodiments, modules 220 include a sensor fusion module 226 configured to perform sensor fusion process (es) on the determined pose information based on sensory data obtained from multiple sensors onboard vehicle 102, such as IMU, odometer, GPS etc. Sensor fusion includes a process of merging data obtained from different sources to provide a global position result for vehicle 102 with reduced error, noise, and improved accuracy and certainty. For example, latitude, longitude, and altitude of vehicle 102 at different locations and at different time points can be determined based on image data and map semantic data as described for process 300. Meanwhile, GPS data can also be used to track and calculate the latitude, longitude, and altitude of vehicle 102 independently. The latitude, longitude, or altitude, separately or in combination, from these two sets of position data obtained from different sources can be fused correspondingly to provide more robust and accurate latitude, longitude, or altitude values for vehicle 102. Similarly, the altitude information, including roll, yaw, and pitch of vehicle 102 can be determined based on image data and map semantic data as described for process 300. IMU data can also be used to calculate the altitude information. These two sets of altitude data from different sources can be fused to improve accuracy and certainty in determining roll, yaw, or pitch of vehicle 102. Sensor fusion module 226 can apply any suitable algorithms, such as a Central Limit Theorem, a Kalman Filter, a Bayesian Networks, or a convolutional neural network (CNN) , etc. In some embodiments, modules 220 also include an instruction generation module 228 configured to generate instructions, including autonomous driving instructions and manual control commands, for navigation, path planning, or other functions based on determined positions of vehicle 102. In some embodiments, database 240 stores map data 242 including semantic object information 244, image data 246 from image (s) captured by camera 107, vehicle control data 248 including system settings, autonomous driving settings, safety settings etc., and user data 250 including user account information, user activity data, user preference settings, etc.

Details associated with modules 220 and database 240 are further described with reference to example processes shown in FIGs. 3-9 of the present disclosure. It is appreciated that modules 220 and/or database 240 are not limited to the scope of the example processes discussed herein. Modules 220 may further be configured to cause processor (s) 220 to perform other suitable functions, and database 240 may store information needed to perform such other suitable functions.

FIG. 3 shows a block diagram of an exemplary process 300 of determining a position of an autonomous vehicle (e.g., vehicle 102, or camera 107 onboard vehicle 102) , in accordance with embodiments of the present disclosure. In some embodiments, a position determined at a particular time point includes multiple groups of pose information of vehicle 102 at that particular time point. The position may be determined based on image (s) captured by camera 107 (e.g., provided as a monovision camera) onboard the autonomous vehicle and object information included in a semantic map. For purposes of explanation and without limitation, process 300 may be performed by one or more modules 220 and database 240 of apparatus 200 shown in FIG. 2. For example, one or more steps or processes included in process 300 may be performed by vehicle 102, mobile device 140, server 110, or combinations thereof.

In step 302, one or more images of a surrounding environment during movement of vehicle 102 are obtained and processed, for example, by image obtaining and processing module 222. In some embodiments, the one or more images (e.g., an image 146 in FIG. 1) are captured by camera 107 when vehicle 102 is moving on a road (e.g., road 101, FIG. 1) . As shown in image 146 captured by camera 107 and displayed on display device 144, a field of view of camera 107 may represent a driver's view. For example, the field of view of camera 107 may include one or more objects associated with the road, such as lane lines 104 for controlling and guiding traffic moving on road 101, road sign 103, light poles 106 along road 101, buildings, and/or trees, etc. Accordingly, image 146 includes lane lines 104' and light poles 106' respectively corresponding to lane lines 104 and light poles 106 as shown in FIG. 1. In some embodiments, camera 107 may capture images at a predetermined frequency.

In some embodiments, a captured image can be parsed based on metadata associated with pixels in the image to identify different objects in the image. In some embodiments, each pixel in the captured image may be associated with a label indicating a semantic meaning of a corresponding object. For example, in the captured image, pixels associated with the ground may be marked with a ground category, pixels associated with lane lines may be marked with a lane line category, and pixels associated with light poles may be marked with a light pole category. In some embodiments, image obtaining and processing module 222 may be configured to extract the semantic information associated with the pixels in the captured image, and identify the objects, such as lane lines 104 and light poles 106 in FIG. 1, based on the semantic information for determining positions of vehicle 102. The captured image maybe processed by image obtaining and processing module 222 onboard vehicle 102. Additionally or alternatively, the captured image may be transmitted from vehicle 102 to another device shown in FIG. 1, such as server 110 or mobile device 140 in real time for processing by image obtaining and processing module 222 on the corresponding device.

In step 310, based on the parsing result obtained from step 302, the light poles are extracted from the captured image. FIG. 4A shows an exemplary diagrammatic representation of an image 410 captured by camera 107 during movement of vehicle 102, in accordance with embodiments of the present disclosure. In some embodiments, image 410 includes, among other objects in the surrounding environment, lane lines 404 on the road and light poles 406 along the road. Image obtaining and processing module 222 may be configured to parse image 410 using semantic information associated with pixels of image 410. FIG. 4B shows an example of an exemplary diagrammatic representation of an image 420 processed based on image 410 of FIG. 4A, in accordance with embodiments of the present disclosure. For example, different objects may be grouped and represented using different visual representations in the processed image 420, as shown in FIG. 4B.

In some embodiments, image obtaining and processing module 222 can be configured to extract and parameterize lines 426 corresponding to light poles in image 420 in FIG. 4B. In some embodiments, a Hough transform may be used to analyze the pixels associated with all the light poles to identify and determine parameters associated with lines 426 that most closely represent light poles 406 in image 410. For example, for each pixel associated with the light pole category in the image, it is determined whether there can be one or more lines representing possible light poles passing through the pixel. The possible lines of all pixels associated with light poles may be superimposed. The lines with the greatest overlay values or with overlay values greater than a predetermined threshold, such as lines 426 in image 420, can be extracted as most likely representing light poles 406 in image 410. The parameters of lines 426 can also be obtained in this process. In some embodiments, other objects, such as lane lines or objects in other shapes, can also be extracted and parameterized using the similar process, such as the Hough transform as described herein.

In step 320, the lane lines are extracted from the captured image. FIG. 5A shows an exemplary diagrammatic representation of an image 500 captured by camera 107 during movement of vehicle 102, in accordance with embodiments of the present disclosure. In some embodiments, image 500 includes, among other objects in the surrounding environment, lane lines 504 on the road, curb lines 501 along the road, light poles 506 along the road, and a road sign 503. In some embodiments, as shown in FIG. 5A, image 500 captured by camera 107 shows a perspective view of the nearby environment of vehicle 102 (e.g., as a driver’s view from the moving vehicle on the road) .

In some embodiments, image obtaining and processing module 222 may be configured to mathematically transform the perspective view of image 500 to a bird’s eye view (also referred to as a top-down view) . In some embodiments, certain calibration parameters associated with vehicle 102 and/or camera 107 may be used during the mathematical transformation, such as coordinates of camera 107 including, but not limited to, a height of camera 107 from the ground and an angle between camera 107 and the ground (e.g., a pitch angle of the camera) . In some examples, the transformation process may include extracting a region of interest from image 500 (e.g., a region of the road including multiple lane lines 504 and/or curb lines 501) , shifting a coordinate system of image 500 to a two-dimensional plane of a top-down view, rotating the image view by applying a matrix multiplication operation to pixels associated with lane lines, and projecting the image onto the two-dimensional plane. It is appreciated that any suitable methods and processes for transforming a perspective view to a bird’s eye view can be applied in the present disclosure.

FIG. 5B shows exemplary pixels associated with lane lines in a bird’s eye view (e.g., image 520 in FIG. 5B) transformed from pixels corresponding to lane lines in a perspective view of an image captured by camera 107 (e.g., lane lines 504 in image 500 in FIG. 5A) , in accordance with some embodiments of the present disclosure. Image 520 may further include pixels associated with curb lines in the bird’s eye view. In some embodiments, lane lines 504 and/or curb lines 501 in image 500 can first be identified using semantic information in the metadata of pixels respectively associated with lane lines and curb lines. A mathematical transformation process can then be applied to the identified pixels. During the transformation, lines that appear to intersect in the perspective view can be parallel in the corresponding bird’s eye view. As shown in FIG. 5B, the transformed and projected pixels associated with the lane lines and curb lines in the bird’s eye view appear to align to form substantially parallel line patterns. In some embodiments, an actual distance between adjacent lines remains the same during the transformation from the perspective view to the bird’s eye view, even though visually the distance may appear to be different depending on the height of the viewer in the bird’s eye view (e.g., the higher the view, the closer the adjacent lines appear) .

In some embodiments, image obtaining and processing module 222 may be configured to extract and parameterize lines corresponding to the lane lines and/or the curb lines in the bird’s eye view 520 in FIG. 5B. In some embodiments, similar methods such as discussed for step 310 for extracting and parametrizing lines 426 for light poles 406 can be used.

FIG. 5C shows exemplary lines 534 extracted and parameterized from pixels corresponding to the lane lines in a bird’s eye view (e.g., image 520) to represent the lane lines in the captured image (e.g., lane lines 504 in captured image 500) , in accordance with some embodiments. Similarly, lines 531 may be extracted and parameterized from pixels corresponding to the curb lines. In some embodiments, the Hough transform applied in step 310 can be used to extract lines that are highly likely to correspond to lane lines 504 and/or curb lines 501. For example, for each pixel shown in FIG. 5B, lines going through the pixel and possibly corresponding to lane lines and/or curb lines are obtained. Then the possible lines corresponding to lane lines and/or curb lines for all pixels in image 520 are determined and superimposed. The lines with the greatest overlay values or with overlay values greater than a predetermined threshold, such as lines 534 and lines 531 in image 530, can be extracted as most closely representative of lane lines 504 and curb lines 501, respectively, in image 500. The parameters of lines 534 and lines 531 can also be obtained in this process. It is beneficial to perform lane line extraction and parameterization in the bird’s eye view because the lines representing lanes are parallel and do not intersect in a distance. As such, it is more accurate and efficient while less complex to compute and determine the location and the parameters associated with the lane lines and/or curb lines using the bird’s eye view.

In step 330, image obtaining and processing 222 may be configured to perform a perspective projection to transform the extracted lines in the bird’s eye view to lines in the perspective view. FIG. 5D shows an exemplary diagrammatic representation of an image 540 in a perspective view including lines transformed from extracted lines in the bird’s eye view (e.g.,

lines

534 and 531 in image 530) corresponding to lane lines (e.g., lane lines 504) and curb lines (e.g., curb lines 501) , in accordance with embodiments of the present disclosure. In some embodiments as shown in FIG. 5D, lines 534 in the bird’s eye view are transformed to lines 544 in the perspective view, and lines 531 in the bird’s eye view are transformed to lines 541 in the perspective view in image 540. Image 540 may be substantially similar to captured image 500 in FIG. 5A with the transformed lines 544 and 541 (e.g., dotted lines 544 and 541) being superimposed on respectively corresponding lane lines 504 and 501 (e.g., dashed lines 504 and 501) . In some embodiments, transformed

lines

544 and 541 intersect at a point 550 (also referred to as a vanishing point 550) in a distance in the perspective view 540 in FIG. 5D.

It is appreciated that the determination of vanishing point 550 as described in steps 320-330 with reference to FIGs. 5A-5D, involving transformation of an image (e.g., captured image 500) or a portion of the image (e.g., a region including lane lines 504 and/or curb lines 501) from a perspective view (e.g., in FIG. 5A) , to a bird’s eye view (e.g., in FIGs. 5B-5C) , then back to a perspective view (e.g., FIG. 5D) , is beneficial. It is more accurate and efficient to extract and parameterize the lines corresponding to lane lines and/or curb lines in the bird’s eye view (e.g., as shown in FIG. 5C) . As a result, the position of the vanishing point (also referred to as the intersection point) of the extracted lines, such as vanishing point 550 in FIG. 5D, can be more accurately and efficiently determined with a less complex computing process.

Referring back to FIG. 3, in step 340, a height value associated with camera 107 (or vehicle 102) can be estimated. In some embodiments, the height value associated with camera 107 (or vehicle 102) is estimated by height determination module 231. The height value of camera 107 may be determined based on the height information of the lane lines and camera calibration parameter information. In some embodiments, the height information of the lane lines (e.g., lane lines 504) may include coordinates of spots on the lane lines (e.g., elevations) in a geographic coordinate system. The geographic coordinate system may be a three-dimensional (3D) reference system including a latitude, a longitude, and an elevation for every location on the Earth to be specified by a set of numbers, letters, or symbols. For example, the height information of portions of lane lines 504 (e.g., within a range of 3 meters from camera 107) in the geographic coordinate system can be extracted from the HD map corresponding to the environment shown in image 500. In some embodiments, a point in a location or environment containing the lane lines 504, such as a point or a location in the current city where vehicle 102 is traveling, can be used as an origin of the geographic coordinate system. Then, three-dimensional (3D) interpolation can be used to identify more data points along or associated with the lane lines, including a point on the ground above which camera 107 is currently located. The height information of such spot in the geographical coordinate system can be obtained. In some embodiments, a height of the camera is determined based on position data, such as height information, of a predetermined object, such as a lane line, in a captured image.

Further, camera parameters, such as internal or intrinsic parameters, may be used to determine image coordinates of one or more points in the captured image, given the spatial position of the one or more points with reference to the camera. For example, the coordinates of pixels associated with lane lines 504 (e.g., elevations) can be determined in a coordinate system within the captured image 500. Accordingly, the height value (e.g., an elevation) of camera 107 relative to the ground can be determined in the coordinate system within the captured image.

Based on the height value of camera 107 relative to the ground in a camera coordinate system (e.g., a coordinate system having its origin at the camera center) , or an image coordinate system (e.g., including the position of pixels transformed from the position of the ground in the camera coordinate system, for example, the positions of pixels of the ground being represented by horizontal, vertical, and height values of each pixel in the image, ) , and the height value of the ground in the geographical coordinate system, the height value (e.g., an altitude or an elevation) of camera 107 in the geographic coordinate system can be determined.

In step 350, pitch and yaw angles associated with vehicle 102 or camera 107 can be estimated. In some embodiments, the pitch and yaw angles are estimated by pitch and yaw determination module 232. In some embodiments as shown in FIG. 3, semantic information from the HD map can be retrieved in step 352 and used in step 350. In some embodiments, semantic information of objects within 200 meters of vehicle 102 are obtained in accordance with the vehicle position information. For example, initial vehicle position information can be determined based on the GPS information of the vehicle, and during movement, vehicle position can be determined based on sensory data obtained from IMU and odometer. After estimating the height value of camera 107 in step 340, the lane lines corresponding to lane lines 504 can be retrieved from the semantic layer of the HD map in step 304. The HD map information may be retrieved from map provider server 130 as shown in FIG. 1. Semantic information of the HD map information may be retrieved by map obtaining and processing module 224 of apparatus 200. The position information of the objects obtained from the HD map can be converted, by map obtaining and process module 224, from the geographic coordinate system into the camera coordinate system (or the image coordinate system) using one or more camera parameters, including extrinsic parameters and intrinsic parameters such as optical, geometric, and/or digital characteristics of the camera.

FIG. 6 is an exemplary diagrammatic representation of an image 600 for matching the lane lines (e.g., lines 544) extracted from the captured image (e.g., image 500) and the lane lines (e.g., lines 604 and 614) obtained from the HD map data (e.g., in step 352) in the camera coordinate system (or the image coordinate system) , in accordance with some embodiments of the present disclosure. In some embodiments, lines 544 are extracted from the bird’s eye view and transformed to the camera coordinate system or the image coordinate system (e.g., as described with reference to FIGs. 5A-5D and shown as lines 544 in the camera perspective view in FIG. 5D) . In some embodiments,

lines

604 and 614 correspond to lane lines 504 and are identified in an HD map and converted to the camera coordinate system or the image coordinate system using the camera parameters. Further, a vanishing point 605 (or an intersection point) between the lines 604, or a vanishing point 615 between the lines 614 can be determined in image 600.

In some embodiments, the yaw and pitch angles of camera 107 can affect the positions of the lane lines projected into the camera coordinate system or the image coordinate system in image 600 (e.g., lines 604 and 614) and the position of the corresponding vanishing point (e.g., vanishing point 605 and 615) . Therefore, by adjusting the yaw and pitch angles of the camera, the vanishing point of the lane lines projected from the HD map can be adjusted to overlapped with the vanishing point determined from lane lines extracted from the captured image (e.g., as determined in step 330) .

As shown in FIG. 6, lines 544 extracted from the bird’s eye view corresponding to lane lines 504 are shown in as dotted lines, and the corresponding vanishing point 550 is a solid dot. The

lines

604 and 614 projected from the lane lines in the HD map are shown as dash-dotted lines, and the corresponding vanishing

point

605 or 615 is shown in as an empty circle. For example, lines 604 with vanishing point 605 may be projected using an initial pair of pitch and yaw angles (θ ₀, ψ ₀) , from the lane lines obtained from the HD map. It can be determined as shown in FIG. 6 that vanishing point 605 does not coincide well with vanishing point 550. For example, a deviation between the two points may be above a predetermined threshold. Accordingly, pitch and yaw determination module 232 can iterate the adjustment of pitch and yaw angles of camera 107 to gradually align the vanishing points. For example, lines 614 and corresponding vanishing point 615 may be obtained at pitch and yaw angles of (θ _m, ψ _m) . When it is determined that vanishing point 615 substantially coincides with vanishing point 550, it is determined that camera 107 has the current pitch and yaw angles, e.g., (θ _m, ψ _m) .

In some embodiments, the vanishing point alignment discussed in step 350 can use various iterative optimization algorithms to achieve efficient convergence. For example, a fast-gradient method may be used for the iterative optimization. In some embodiments, vibration during movement of vehicle 102 may affect the accuracy of the height value. Errors caused by such motions during movement may be considered and compensated for in the calculation of pitch and yaw angles.

Referring back to FIG. 3, in step 360, a roll angle and a horizontal position (e.g., a lateral position) of vehicle 102 or camera 107 can be estimated. In some embodiments, the roll angle and a horizontal position are estimated by roll and horizontal pose determination module 233. In some embodiments, the horizontal position may correspond to a relative position of vehicle 102 or camera 107 along a direction traversing the direction of the road. In some embodiments, the roll angle (φ) and a horizontal position (x) may be determined using a two-layer exhaustive search (e.g., a brute-force search) algorithm for matching one or more objects that have been identified from the captured image with corresponding objects retrieved from the HD map in the bird’s eye view. For example, lines 534 corresponding to lane lines 504 are extracted from image 500 captured by camera 107 and projected in the bird’s eye view image 530 in FIG. 5C. In some embodiments, lines corresponding to lane lines 504 can be obtained from the HD map and then projected to the bird’s eye view in the same coordinate system of image 530 using an initial set of roll angle and horizontal position (φ = φ ₀, x = x ₀) of camera 107. Initially, the two sets of lines may deviate from each other significantly. By using the exhaustive search algorithm, roll and horizontal pose determination module 233 can change the roll angle and the horizontal position of the camera to adjust the locations of the lines obtained from the HD map and projected in the bird’s eye view. When these lines coincide with lines 534 in the bird’s eye view, it is determined that camera 107 has the corresponding roll angle and horizontal position (φ= φ _m, x= x _m) .

In some embodiments, the exhaustive search algorithm may use a dynamic model of the vehicle body as a constraint, and assume a position deviation between two image frames (e.g., at 50 ms apart) does not exceed 30 cm. The search range of the lateral pose may be set to be ± 50 cm, ± 30 cm, ± 20 cm, ± 10 cm, etc., and the search resolution may be 10 cm, 5 cm, 2 cm, 1 cm, etc. The range of roll variation may be ± 5.0 degree, ± 2.0 degree, ± 1.0 degree, ± 0.5 degree, etc., and the resolution may be 0.1 degree, 0.2 degree, 0.3 degree, etc. It is appreciated that the above parameters are only examples, for illustrative purpose. Different search ranges and resolutions can be set to achieve more accurate pose estimation.

In step 370, a vertical value (y) of camera 107 or vehicle 102 may be estimated. In some embodiments, vertical value (y) may correspond to a relative position of camera 107 or vehicle 102 at a point along the direction of the road. In some embodiments, vertical value (y) can be determined by vertical pose determination module 234. In some embodiments, light poles along the road may be used for estimating the vertical value (y) . It is appreciated that while light poles are used as examples for illustrating the process of determining the vertical value (y) , other objects (e.g., road signs, buildings, etc. ) that can be captured by the camera and retrieved from the HD map can also be used for determining the vertical value.

FIG. 7 is an exemplary diagrammatic representation of an image 700 for matching the light poles (e.g., light poles 706) extracted from the captured image and the light poles (e.g., lines 716) obtained from the HD map data and projected into the camera view image 700, in accordance with some embodiments of the present disclosure. During automatic driving, it is generally assumed that the head of vehicle 102 consistently points in the direction in which the road extends. The vertical positioning of the vehicle may be determined relative to the position of objects along the road, such as street light poles and road signs, etc. In some embodiments, the street light poles can be extracted from captured image 700 using the method described in step 310. For example, pixels in the image corresponding to the light poles can be identified based on the metadata of the pixels. In some embodiments, position information of light poles corresponding to light poles 706 can be retrieved from the semantic layer of the HD map in step 372. The identified light poles 706 from the captured image 700 can then be matched with corresponding light poles extracted from the HD map in step 372 to estimate the vertical position.

In some embodiments, similar to step 360, the exhaustive search (e.g., brute-force search) algorithm can be used to match the light poles. In some embodiments, lines corresponding to light poles 706 can be obtained from the HD map and then projected onto the camera view in image 700 using an initial vertical value (y = y ₀) of camera 107. Initially, the estimated light poles may significantly mismatch with light poles 706 in captured image 700. By using the exhaustive search algorithm, vertical pose determination module 234 can change the vertical position of the camera to adjust the locations of the light poles obtained from the HD map and projected in the camera view. When the projected light poles, e.g., light poles 716, substantially coincide with light poles 706 in the camera view, it is determined that camera 107 has the corresponding vertical position (y = y _m) . In some embodiments, a latitude and a longitude of vehicle 102 in the geographic coordinate system can further be determined based on its horizontal position (x) and vertical position (y) relative to a certain point in image 700, where this point has a latitude and a longitude in the geographic coordinate system that are known from the HD map.

In some embodiments, the exhaustive search algorithm may use a dynamic model of the vehicle body and wheel speed information as constraints to achieve convergence by searching in a small range. For example, the estimated search range may be ± 100 cm, ± 70 cm, ± 50 cm, ± 30 cm, etc., and the search step may be 20 cm, 10 cm, 5 cm, etc.

Conventional position information, including the 6 DOF parameters, of the vehicle may be determined together and in a coupled manner. That is, the 6 DOF parameters are dependent on each other and determined in one process. As a result, the calculation process is slow and requires substantial computing power. The present disclosure uses separate and independent groupings to optimize parameters in each individual group (as discussed for process 300) . Accordingly, one or two parameters in a group can be determined independently and separately, requiring a less complex computation process involving a small volume of data. Further, the 6 DOF parameters can be determined using the sematic information of the objects in the HD map with high precision, e.g., at centimeter-level, in process 300. As a result, the calculation process described in the present disclosure can be more accurate, more efficient, and more cost-effective, without requiring complex and expensive computing resources.

In step 380, pose information (e.g., in 6 DOF including height (z) , pitch (θ) , yaw (ψ) , roll (φ) , horizontal pose (x) , and vertical pose (y) ) in different groups determined separately from

steps

340, 350, 360, and 370, and sensory data obtained from multiple sensors onboard vehicle 102 can be merged together to obtain global pose information of camera 107 or vehicle 102. For example, as shown in

steps

382 and 384, sensory data can be obtained from IMU (e.g., vehicle inertial navigation data) and odometer (e.g., wheel speed data) respectively, to obtain angular acceleration, angular speed, and vehicle speed during vehicle movement. Other sensory data, such as GPS data, can also be obtained. In some embodiments, sensor fusion module 226 of apparatus 200 may be used for performing sensor fusion in step 380. In some embodiments, sensor fusion module 226 can use any suitable algorithm, such as a Kalman Filter (e.g., Extended Kalman Filter, or Error-State Kalman Filter (ESKF) ) , for performing sensor fusion to obtain a more stable and accurate output in step 390. For example, as explained above, sensor fusion module 226 can apply any suitable algorithms, such as a Central Limit Theorem, a Kalman Filter, a Bayesian Networks, or a convolutional neural network (CNN) , etc., to the pose data based on the sensory data to receive global position results of vehicle 102 with improved accuracy. In some embodiments, the global position of vehicle 102 output in step 390 can be used, by instruction generation module 228, to generate instructions for navigating autonomous vehicle 102. Because sensor fusion process (es) are performed to data obtained from disparate sources, such as pose information calculated based on image data and map semantic data, and sensory data detected by sensors (e.g., GPS, IMU) , the determination of pose information according to process 300 is independent of noise and errors from the various sensors. After performing sensor fusion in step 380, the global position of vehicle 102 output in step 380 is more accurate and robust than conventional methods.

FIG. 8 is an exemplary map 800 generated and updated based on the real-time position of vehicle 102 determined during movement of vehicle 102, in accordance with some embodiments of the present disclosure. In some embodiments, map 800 includes lane lines 804, light poles 806, and/or other objects associated with the road extracted from the semantic layer of the HD map, and projected into a bird’s eye view, as shown in FIG. 8. Other information, such as traffic information from the real-time traffic layer of the HD map can also be obtained to generate map 800. In some embodiments, the position of vehicle 102 determined in process 300 can be shown and updated in real time and updated in map 800. In some embodiments, map 800 can also show real-time autonomous driving information related to route planning, obstacle avoidance, user notification, etc. In some embodiments, map 800 may also include a GUI for receiving user input, such as adding a stop or changing a destination by selecting a location on map 800, to provide a user interactive experience.

FIG. 9 shows a flow diagram of an exemplary process 900 of determining a position of camera 107 or vehicle 102, in accordance with some embodiments of the present disclosure. For purposes of explanation and without limitation, process 900 may be performed by one or more modules 220 and database 240 of apparatus 200 shown in FIG. 2. For example, one or more steps of process 900 may be performed by modules in vehicle 102, mobile device 140, server 110, or combinations thereof. In some examples, vehicle position can be determined by modules onboard vehicle 102 based on image data captured by camera 107 onboard vehicle 102. Instructions for autonomous driving can also be generated onboard vehicle 102. In some other examples, image data captured by camera 107 can be transmitted to mobile device 140 or server 110, and vehicle position can be determined by mobile device 140 or server 110 and transmitted to vehicle 102 in real time.

In step 910, one or more objects, such as lane lines (

lane lines

104, 404, 504, 704, 804) and/or light poles (

light poles

106, 406, 506, 706, 806) , are identified in an image (e.g.,

image

410, 500, 700) captured by a camera (e.g., camera 107) onboard a vehicle (e.g., vehicle 102) during movement of the vehicle. In some embodiments, the captured image includes at least a portion of an environment surrounding the vehicle during the movement. In some embodiments, the one or more objects in the captured image are identified in accordance with semantic metadata of pixels associated with the one or more objects in the captured image. In some embodiments, camera 107 is a monovision camera, and the image captured by the camera is a monocular image.

In step 920, position data associated with one or more predetermined objects corresponding to the one or more objects identified in the captured image (in step 910) is retrieved from a map (e.g., a HD map retrieved from map provider server 130) of the environment. For example, location information of the predetermined objects, such as lane lines and light poles, can be obtained from the HD map of the environment (e.g., from the semantic layer) . In some embodiments, the map of the environment in the vicinity of vehicle 102 may be retrieved in accordance with a location of vehicle 102, which can be determined based on sensor data obtained from sensors (e.g., IMU, odometer, GPS, etc. ) onboard vehicle 102. In some embodiments, the HD map can be obtained from different sources (e.g., any map provider server 130) in various suitable data formats. The map data may be requested and fetched using API calls. In some embodiments, the one or more predetermined objects in the map include a plurality of predefined liner objects in the environment where the vehicle moves. In some embodiments, the plurality of predefined liner objects include a plurality of predefined lines, such as lane lines, on a road on which the vehicle moves. In some embodiments, the one or more predetermined objects in the map include a pre-established object on a side of the road, such as light poles, road signs, buildings, etc., along the road. The one or more predetermined objects may also be above the road, such as traffic lights etc. The objects used for determining the vertical pose (y) (e.g., in step 370) may not be blocked by moving vehicles on the road.

In step 930, one or more pose information items associated with the camera (camera 107) or the vehicle (e.g., vehicle 102) , such as 6 DOF parameters including height (z) , pitch (θ) , yaw (ψ) , roll (φ) , horizontal pose (x) , and vertical pose (y) can be determined. In some embodiments, the pose information items may be determined in accordance with matching the one or more objects identified in the captured image (e.g., in step 910) with the corresponding one or more predetermined objects retrieved from the map (e.g., in step 920) .

In some embodiments, a height of the camera may be determined (e.g., as described in step 340) based on position data (e.g., height information in the geographical coordinate system) of a predetermined object (e.g., a liner object, such as a lane line) obtained from the map and position data (e.g., height information in the camera view system) of the corresponding object (e.g., lane lines) extracted in the captured image. In some embodiments, a height of the camera may be determined based on position data (e.g., height information) of a predetermined object (e.g., a liner object such as a lane line) in the captured image. In some embodiments, a height of the camera may be determined based on position data of a predetermined object in the captured image and one or more parameters of the camera. The predetermined object may include a liner object, such as a lane line, and the one or more parameters may include an optical parameter, a preset pose parameter, or a calibration parameter of the camera. In some embodiments, a height of the camera may be determined based on one or more parameters of the cameras as discussed herein and a height of a road on which the image is captured, where the height of the road is determined based on position data of an object in the image.

In some embodiments, a yaw angle and a pitch angle of the camera may be determined (e.g., as described in step 350) in accordance with matching a first vanishing point (e.g., vanishing point 550) associated with two road lanes (e.g., lane lines 544) in the captured image and a second vanishing point (e.g., vanishing point 615) associated with two predefined lines in the map corresponding to the two road lanes. A vanishing point may be associated with an intersection of two lines. In some embodiments, the first vanishing point may be determined by determining positions of the two road lanes (e.g., lines 534 in FIG. 5C) extracted from a bird’s eye view image (e.g., image 530) transformed from the captured image (e.g., image 500) , and determining the first vanishing point in a perspective view (e.g., image 540 or 600) of the camera transformed from the bird’s eye view image. In some embodiments, the second vanishing point may be determined based on projecting position data associated with the two predefined lines in the map onto the perspective view of the camera, such as lines 614 in image 600.

In some embodiments, two road lanes may be fitted based on data obtained from the captured image and/or the map. For example, semantic data associated with the road lanes may be obtained from the map for fitting the two road lanes. In another example, pixel values and associated position information corresponding to the road lanes may be extracted from the captured image for fitting the two road lanes. In some embodiments, positions of the two road lanes can be determined based on the fitting result, and the first vanishing point can be further determined based on the positions of the two road lanes.

In some embodiments, a horizontal position and/or a roll angle of the camera may be determined (e.g., as described in step 360) in accordance with matching one or more liner objects, such as one or more lanes (e.g., lines 534 in the bird’s eye view 530 FIG. 5C) extracted in the captured image with one or more corresponding predefined liner objects, such as lines on the road in the map (e.g., extracted from the semantic layer of the HD map and projected to the bird’s eye view) .

In some embodiments, a vertical position of the camera may be determined (e.g., as described in step 370) in accordance with matching one or more vertical objects, such as facility or utility objects, e.g., light poles (e.g., poles 706) , or fences, trees, buildings, footbridge, etc., in the captured image (e.g., image 700) with one or more of the pre-established objects on the side of the road in the map (e.g., lines 716 extracted from the HD map and projected onto the perspective view 700) .

In some embodiments, after obtaining the 6 DOF parameters including height (z) , pitch (θ) , yaw (ψ) , roll (φ) , horizontal pose (x) , and vertical pose (y) , sensor fusion may be performed (e.g., as described in step 380) to merge the determined one or more pose information items (e.g., the 6 DOF parameters) of the camera. In some embodiments, sensory data obtained from sensors onboard the vehicle, such as IMU, odometer, etc., can also be merged by the sensor fusion process to obtain a more accurate global position of camera 107 or vehicle 102. In some embodiments, instructions may be generated to operate the vehicle in real time, e.g., including autonomous driving, route planning/updating, obstacle avoidance, providing user notification, etc., during the movement of the vehicle based on the determined one or more pose information items.

In some embodiments, the map information (e.g., information associated with various predetermined objects, such as semantic information and/or position information) as discussed herein may be collected by sensors on vehicle 102 and/or on one or more other vehicles different from the vehicle 102, such as a vehicle used for road calibration, road construction, road maintenance, and/or road mapping. The map information can also be downloaded prior to the trip or in real time from any suitable network 120, for example, from one or more other vehicles previously or currently travelling on the corresponding segments of the road. The map information may be shared and/or pushed to multiple vehicles periodically. In some embodiments, the map information may be updated (e.g., by iteration, correction, editing, replacement, overwriting, etc. ) via interaction among systems of one or more vehicles running on corresponding segments of the road. For example, if one vehicle fails to accurately collect certain information of a segment of the road, the vehicle can report such failure and broadcast to or notify one or more other vehicles to collect the missing information with sufficient accuracy to be used for the map information. In some embodiments, additionally or alternatively, the map information may be collected by any other suitable type of movable object, such as an unmanned aerial vehicle (UAV) .

It is to be understood that the disclosed embodiments are not necessarily limited in their application to the details of construction and the arrangement of the components set forth in the following description and/or illustrated in the drawings and/or the examples. The disclosed embodiments are capable of variations, or of being practiced or carried out in various ways. The types of user control as discussed in the present disclosure can be equally applied to other types of movable objects or any suitable object, device, mechanism, system, or machine configured to travel on or within a suitable medium, such as a surface, air, water, rails, space, underground, etc. It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed devices and systems. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed devices and systems. It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.

Claims

A method, comprising:

identifying one or more objects in an image captured by a camera onboard a vehicle during movement of the vehicle, the image including at least a portion of an environment surrounding the vehicle during the movement;

retrieving position data associated with one or more predetermined objects from a map of the environment, the one or more predetermined objects corresponding to the one or more objects identified in the captured image; and

determining one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map.
The method of claim 1, wherein the one or more predetermined objects in the map include a plurality of predefined liner objects in the environment where the vehicle moves.
The method of claim 1, wherein the one or more predetermined objects in the map include a plurality of predefined lines on a road on which the vehicle moves.
The method of claim 1, wherein the one or more predetermined objects in the map include a pre-established object on a side of the road or above the road.
The method of claim 1, wherein the one or more objects in the captured image are identified in accordance with semantic metadata of pixels associated with the one or more objects in the captured image.
The method of claim 1, wherein the one or more pose information items include a height, a yaw angle, a pitch angle, a horizontal position, a vertical position, and a roll angle of the camera.
The method of claim 1, wherein the one or more pose information items associated with the camera are determined in accordance with matching a first vanishing point associated with two liner objects in the captured image with a second vanishing point associated with two predefined lines in the map corresponding to the two liner objects.
The method of claim 1, wherein determining the one or more pose information items associated with the camera comprises:

determining a height of the camera based on position data of an object in the captured image, the object including a liner object.
The method of claim 1, wherein determining the one or more pose information items associated with the camera comprises:

determining a height of the camera based on position data of an object in the captured image, and one or more parameters of the camera, wherein the object includes a liner object, and the one or more parameters include an optical parameter or a preset pose parameter of the camera.
The method of claim 1, further comprising:

determining a height of a road on which the image is captured based on position data of an object in the image,

wherein determining the one or more pose information items associated with the camera comprises:

determining a height of the camera based on the height of the road and one or more parameters of the cameras.
The method of claim 1, wherein determining the one or more pose information items associated with the camera comprises:

determining a height of the camera based on position data of a predetermined object from the map and position data of the corresponding object in the captured image.
The method of claim 1, wherein determining the one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map comprises:

matching a first vanishing point associated with two road lanes in the captured image and a second vanishing point associated with two predefined lines in the map corresponding to the two road lanes; and

determining a yaw angle and/or a pitch angle of the camera in accordance with matching the first vanishing point and the second vanishing point.
The method of claim 12, further comprising:

determining the first vanishing point by:

determining positions of the two road lanes in a bird's eye view image transformed from the captured image; and

determining the first vanishing point in a perspective view of the camera transformed from the bird's eye view image.
The method of claim 12, further comprising:

fitting the two road lanes based on data obtained from the captured image and/or the map;

determining positions of the two road lanes based on the fitting result; and

determining the first vanishing point based on the positions of the two road lanes.
The method of claim 13, wherein the second vanishing point is determined based on projecting position data associated with the two predefined lines in the map to the perspective view of the camera.
The method of claim 1, wherein determining the one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map comprises:

determining a horizontal position and/or a roll angle of the camera in accordance with matching one or more liner objects in the captured image with one or more corresponding predefined liner objects on the road in the map.
The method of claim 1, wherein determining the one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map comprises:

determining a vertical position of the camera in accordance with matching one or more vertical objects in the captured image with one or more of the pre-established objects on the side of the road in the map.
The method of claim 1, wherein the image captured by the camera is a monocular image.
The method of claim 1, wherein the map of the environment is obtained in accordance with a location of the vehicle detected by one or more sensors.
The method of claim 1, further comprising:

performing a sensor fusion of the determined one or more pose information items of the camera based on corresponding sensory data detected by one or more sensors onboard the vehicle using a Kalman Filter algorithm.
The method of claim 1, further comprising:

generating instructions to operate the vehicle in real time during the movement of the vehicle based on the determined one or more pose information items.
An apparatus communicatively coupled to a vehicle, comprising:

one or more processors; and

memory coupled to the one or more processors and storing instructions that, when executed by the one or more processors, cause the apparatus to:

identifying one or more objects in an image captured by a camera onboard a vehicle during movement of the vehicle, the image including at least a portion of an environment surrounding the vehicle during the movement;

retrieving position data associated with one or more predetermined objects from a map of the environment, the one or more predetermined objects corresponding to the one or more objects identified in the captured image; and

determining one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map.
The apparatus of claim 22, wherein the one or more predetermined objects in the map include a plurality of predefined liner objects in the environment where the vehicle moves.
The apparatus of claim 22, wherein the one or more predetermined objects in the map include a plurality of predefined lines on a road on which the vehicle moves.
The apparatus of claim 22, wherein the one or more predetermined objects in the map include a pre-established object on a side of the road or above the road.
The apparatus of claim 22, wherein the one or more objects in the captured image are identified in accordance with semantic metadata of pixels associated with the one or more objects in the captured image.
The apparatus of claim 22, wherein the one or more pose information items include a height, a yaw angle, a pitch angle, a horizontal position, a vertical position, and a roll angle of the camera.
The apparatus of claim 22, wherein the one or more pose information items associated with the camera are determined in accordance with matching a first vanishing point associated with two liner objects in the captured image with a second vanishing point associated with two predefined lines in the map corresponding to the two liner objects.
The apparatus of claim 22, wherein determining the one or more pose information items associated with the camera comprises:

determining a height of the camera based on position data of an object in the captured image, the object including a liner object.
The apparatus of claim 22, wherein determining the one or more pose information items associated with the camera comprises:

determining a height of the camera based on position data of an object in the captured image, and one or more parameters of the camera, wherein the object includes a liner object, and the one or more parameters include an optical parameter or a preset pose parameter of the camera.
The apparatus of claim 22, wherein the memory further stores instructions for:

determining a height of a road on which the image is captured based on position data of an object in the image,

wherein determining the one or more pose information items associated with the camera comprises:

determining a height of the camera based on the height of the road and one or more parameters of the cameras.
The apparatus of claim 22, wherein determining the one or more pose information items associated with the camera comprises:

determining a height of the camera based on position data of a predetermined object from the map and position data of the corresponding object in the captured image.
The apparatus of claim 22, wherein determining the one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map comprises:

matching a first vanishing point associated with two road lanes in the captured image and a second vanishing point associated with two predefined lines in the map corresponding to the two road lanes; and

determining a yaw angle and/or a pitch angle of the camera in accordance with matching the first vanishing point and the second vanishing point.
The apparatus of claim 33, wherein the memory further stores instructions for:

determining the first vanishing point by:

determining positions of the two road lanes in a bird's eye view image transformed from the captured image; and

determining the first vanishing point in a perspective view of the camera transformed from the bird's eye view image.
The apparatus of claim 33, wherein the memory further stores instructions for:

fitting the two road lanes based on data obtained from the captured image and/or the map;

determining positions of the two road lanes based on the fitting result; and

determining the first vanishing point based on the positions of the two road lanes.
The apparatus of claim 34, wherein the second vanishing point is determined based on projecting position data associated with the two predefined lines in the map to the perspective view of the camera.
The apparatus of claim 22, wherein determining the one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map comprises:

determining a horizontal position and/or a roll angle of the camera in accordance with matching one or more liner objects in the captured image with one or more corresponding predefined liner objects on the road in the map.
The apparatus of claim 22, wherein determining the one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map comprises:

determining a vertical position of the camera in accordance with matching one or more vertical objects in the captured image with one or more of the pre-established objects on the side of the road in the map.
The apparatus of claim 22, wherein the image captured by the camera is a monocular image.
The apparatus of claim 22, wherein the map of the environment is obtained in accordance with a location of the vehicle detected by one or more sensors.
The apparatus of claim 22, wherein the memory further stores instructions for:

performing a sensor fusion of the determined one or more pose information items of the camera based on corresponding sensory data detected by one or more sensors onboard the vehicle using a Kalman Filter algorithm.
The apparatus of claim 22, wherein the memory further stores instructions for:

generating instructions to operate the vehicle in real time during the movement of the vehicle based on the determined one or more pose information items.
A non-transitory computer-readable medium with instructions stored thereon, that when executed by a processor, cause the processor to perform operations comprising:

identifying one or more objects in an image captured by a camera onboard a vehicle during movement of the vehicle, the image including at least a portion of an environment surrounding the vehicle during the movement;

retrieving position data associated with one or more predetermined objects from a map of the environment, the one or more predetermined objects corresponding to the one or more objects identified in the captured image; and

determining one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map.
The non-transitory computer-readable medium of claim 43, wherein the one or more predetermined objects in the map include a plurality ofpredefined liner objects in the environment where the vehicle moves.
The non-transitory computer-readable medium of claim 43, wherein the one or more predetermined objects in the map include a plurality of predefined lines on a road on which the vehicle moves.
The non-transitory computer-readable medium of claim 43, wherein the one or more predetermined objects in the map include a pre-established object on a side of the road or above the road.
The non-transitory computer-readable medium of claim 43, wherein the one or more objects in the captured image are identified in accordance with semantic metadata of pixels associated with the one or more objects in the captured image.
The non-transitory computer-readable medium of claim 43, wherein the one or more pose information items include a height, a yaw angle, a pitch angle, a horizontal position, a vertical position, and a roll angle of the camera.
The non-transitory computer-readable medium of claim 43, wherein the one or more pose information items associated with the camera are determined in accordance with matching a first vanishing point associated with two liner objects in the captured image with a second vanishing point associated with two predefined lines in the map corresponding to the two liner objects.
The non-transitory computer-readable medium of claim 43, wherein determining the one or more pose information items associated with the camera comprises:

determining a height of the camera based on position data of an object in the captured image, the object including a liner object.
The non-transitory computer-readable medium of claim 43, wherein determining the one or more pose information items associated with the camera comprises:

determining a height of the camera based on position data of an object in the captured image, and one or more parameters of the camera, wherein the object includes a liner object, and the one or more parameters include an optical parameter or a preset pose parameter of the camera.
The non-transitory computer-readable medium of claim 43, further storing instructions for:

determining a height of a road on which the image is captured based on position data of an object in the image,

wherein determining the one or more pose information items associated with the camera comprises:

determining a height of the camera based on the height of the road and one or more parameters of the cameras.
The non-transitory computer-readable medium of claim 43, wherein determining the one or more pose information items associated with the camera comprises:

determining a height of the camera based on position data ofa predetermined object from the map and position data of the corresponding object in the captured image.
The non-transitory computer-readable medium of claim 43, wherein determining the one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map comprises:

matching a first vanishing point associated with two road lanes in the captured image and a second vanishing point associated with two predefined lines in the map corresponding to the two road lanes; and

determining a yaw angle and/or a pitch angle of the camera in accordance with matching the first vanishing point and the second vanishing point.
The non-transitory computer-readable medium of claim 54, further storing instructions for:

determining the first vanishing point by:

determining positions of the two road lanes in a bird's eye view image transformed from the captured image; and

determining the first vanishing point in a perspective view of the camera transformed from the bird's eye view image.
The non-transitory computer-readable medium of claim 54, further storing instructions for:

fitting the two road lanes based on data obtained from the captured image and/or the map;

determining positions of the two road lanes based on the fitting result; and

determining the first vanishing point based on the positions of the two road lanes.
The non-transitory computer-readable medium of claim 55, wherein the second vanishing point is determined based on projecting position data associated with the two predefined lines in the map to the perspective view of the camera.
The non-transitory computer-readable medium of claim 43, wherein determining the one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map comprises:

determining a horizontal position and/or a roll angle of the camera in accordance with matching one or more liner objects in the captured image with one or more corresponding predefined liner objects on the road in the map.
The non-transitory computer-readable medium of claim 43, wherein determining the one or more pose information items associated with the camera in accordance with matching the one or more objects in the captured image with the corresponding one or more predetermined objects in the map comprises:

determining a vertical position of the camera in accordance with matching one or more vertical objects in the captured image with one or more of the pre-established objects on the side of the road in the map.
The non-transitory computer-readable medium of claim 43, wherein the image captured by the camera is a monocular image.
The non-transitory computer-readable medium of claim 43, wherein the map of the environment is obtained in accordance with a location of the vehicle detected by one or more sensors.
The non-transitory computer-readable medium of claim 43, further storing instructions for:

performing a sensor fusion of the determined one or more pose information items of the camera based on corresponding sensory data detected by one or more sensors onboard the vehicle using a Kalman Filter algorithm.
The non-transitory computer-readable medium of claim 43, further storing instructions for:

generating instructions to operate the vehicle in real time during the movement of the vehicle based on the determined one or more pose information items.