CN117094363A

CN117094363A - Vehicle neural network localization

Info

Publication number: CN117094363A
Application number: CN202210505339.1A
Authority: CN
Inventors: M·沃达拉; S·什里瓦斯塔瓦; 普纳杰·查克拉瓦蒂
Original assignee: Ford Global Technologies LLC
Current assignee: Ford Global Technologies LLC
Priority date: 2022-05-10
Filing date: 2022-05-10
Publication date: 2023-11-21

Abstract

The present disclosure provides a "vehicle neural network localization". A plurality of temporally successive vehicle sensor images are received as input to a variational automatic encoder neural network that outputs an average semantic bird's-eye view image including respective pixels determined by averaging semantic class values of corresponding pixels in respective ones of the plurality of temporally successive vehicle sensor images. From a plurality of topological nodes each specifying a respective real world location, a topological node closest to the vehicle and a three degree of freedom pose of the vehicle relative to the topological node closest to the vehicle are determined based on the average semantic bird's eye view image. A real world three-degree-of-freedom pose of the vehicle is determined by combining the three-degree-of-freedom pose of the vehicle relative to the topological node and the real world position of the topological node closest to the vehicle.

Description

Vehicle neural network localization

Technical Field

The present disclosure relates to neural networks in vehicles.

Background

The vehicle may be equipped with a computing device, network, sensors, and controller to acquire data regarding the environment of the vehicle and operate the vehicle based on the data. The vehicle sensors may provide data regarding a route to be traveled and objects to be avoided in the environment of the vehicle. While the vehicle is operating on the road, the operation of the vehicle may rely on acquiring accurate and timely data about objects in the environment of the vehicle.

Disclosure of Invention

A system includes a computer including a processor and a memory storing instructions executable by the processor to receive a plurality of temporally successive vehicle sensor images as input to a variational auto-encoder neural network that outputs an average semantic bird's-eye view image including respective pixels determined by averaging semantic class values of corresponding pixels in respective ones of the plurality of temporally successive vehicle sensor images. The instructions further include instructions for: from a plurality of topological nodes each specifying a respective real world location, a topological node closest to the vehicle and a three degree of freedom pose of the vehicle relative to the topological node closest to the vehicle are determined based on the average semantic bird's eye view image. The instructions further include instructions for: a real world three-degree-of-freedom pose of the vehicle is determined by combining the three-degree-of-freedom pose of the vehicle relative to the topological node and the real world position of the topological node closest to the vehicle.

The instructions may also include instructions for: the average semantic bird's-eye view image is generated based on rendering semantic point cloud images of an environment surrounding the vehicle into a two-dimensional plane.

The instructions may also include instructions for: the semantic point cloud image is generated based on combining a semantic image including regions marked by region type and a stereoscopic point cloud image including regions marked by region distance relative to the vehicle.

The instructions may also include instructions for: the stereoscopic point cloud image is generated based on a pair of stereoscopic images acquired by sensors in the vehicle.

The instructions may also include instructions for: the semantic image is generated based on a single stereoscopic image acquired by a sensor in the vehicle.

The zone types may include roads, sidewalks, vehicles, buildings, and vegetation.

The instructions may also include instructions for: the topological node is determined by acquiring a point cloud image with a stereo camera and determining the position of the point cloud image in real world coordinates with a visual ranging method.

The real world three degree of freedom pose of the vehicle may be determined in coordinates based on orthogonal x-and y-axes and yaw rotation about a z-axis orthogonal to the x-and y-axes.

The instructions may also include instructions for: the variational automatic encoder neural network is trained to output the average semantic bird's-eye view image using a plurality of modified semantic bird's-eye view images.

The instructions may also include instructions for: each of the plurality of modified semantic bird's-eye view images is generated based on at least one of translating or rotating the semantic bird's-eye view image.

The variational automatic encoder neural network may determine the three degree of freedom pose of the vehicle relative to the topological node closest to the vehicle by outputting latent variables to a fully connected layer.

The variational automatic encoder neural network may determine the topology node closest to the vehicle by inputting latent variables of the average semantic bird's-eye view to a nearest neighbor classifier trained to determine the topology node closest to the vehicle.

A method includes receiving a plurality of temporally successive vehicle sensor images as input to a variational automatic encoder neural network that outputs an average semantic bird's-eye view image including respective pixels determined by averaging semantic class values of corresponding pixels in respective ones of the plurality of temporally successive vehicle sensor images. The method further includes determining, from a plurality of topological nodes each specifying a respective real world location, a topological node closest to the vehicle and a three degree of freedom pose of the vehicle relative to the topological node closest to the vehicle based on the average semantic bird's eye view image. The method further includes determining a real world three-degree-of-freedom pose of the vehicle by combining the three-degree-of-freedom pose of the vehicle relative to the topological node and the real world position of the topological node closest to the vehicle.

The method may also include generating the average semantic bird's-eye view image based on rendering semantic point cloud images of an environment surrounding the vehicle into a two-dimensional plane.

The method may further include generating the semantic point cloud image based on combining the semantic image including the region marked by the region type and the stereoscopic point cloud image including the region marked by the region distance relative to the vehicle.

The method may also include determining the topological node by acquiring a point cloud image with a stereo camera and determining a location of the point cloud image in real world coordinates using visual ranging.

The method may further include training the variational automatic encoder neural network to output the average semantic bird's-eye view image using a plurality of modified semantic bird's-eye view images.

Also disclosed herein is a computing device programmed to perform any of the above method steps. Also disclosed herein is a computer program product comprising a computer readable medium storing instructions executable by a computer processor to perform any of the above method steps.

A vehicle computer in the vehicle may be programmed to acquire data regarding the environment surrounding the vehicle and use the data to determine a path over which to operate the vehicle in an autonomous or semi-autonomous mode. The vehicle may operate on the road based on the path by determining commands to direct the powertrain, brake, and steering components of the vehicle to operate the vehicle to travel along the path. The data about the environment may include the location of one or more objects in the environment surrounding the vehicle (such as vehicles and pedestrians, etc.), and may be used by the vehicle computer to operate the vehicle.

Determining the path may include solving a positioning problem. Positioning includes determining a three degree of freedom (DoF) pose of the vehicle relative to a map of an environment surrounding the vehicle. The three DoF pose includes a position in two orthogonal coordinates (e.g., x and y) and an orientation in one rotation about an axis of a third orthogonal coordinate (e.g., yaw). Locating the vehicle relative to the map and perceiving objects in the environment surrounding the vehicle may permit the vehicle computer to determine a path over which the vehicle may travel to reach a destination on the map while avoiding contact with objects in the environment surrounding the vehicle. The path may be a polynomial function determined to maintain lateral and longitudinal acceleration of the vehicle within upper and lower limits as the vehicle travels on the vehicle path.

Solving the positioning problem with vehicle routing may begin by recognizing that the vehicle is generally repeatedly traveling along the same route. The techniques disclosed herein may utilize predictable travel patterns by creating a topological map of repeatedly traveled routes that may be used by a vehicle computer to solve positioning problems using cheaper equipment and less computer resources that would otherwise be required to determine the path of the vehicle. The technology described herein performs locating the environment around a vehicle by first determining a topological map of a route that the vehicle is to travel. A route is defined as a path describing the continuous position of a vehicle as it travels from one point on a map (typically on a road) to a second point. A topological map is a map that includes location and image data that can be used by a vehicle computer to determine data including the location of a vehicle and the location of objects in the environment surrounding the vehicle. Each node includes three DoF data for a location along the route and an average semantic bird's eye view image for the location. The tri-DoF data and the average semantic bird's eye view image are used to train the neural network to input temporally successive images acquired by sensors included in the vehicle and to output data identifying a closest node of the topological map to the vehicle and a tri-DoF pose of the vehicle relative to the topological map. The techniques disclosed herein improve localization by determining the tri-DoF position of the vehicle based on the average semantic bird's eye view image, which can determine the tri-DoF position of the vehicle regardless of weather and lighting conditions surrounding the vehicle.

Drawings

Fig. 1 is a diagram of an exemplary communication infrastructure system.

Fig. 2 is a diagram of an exemplary illustration of a topological map.

Fig. 3 is an illustration of an exemplary stereoscopic image.

Fig. 4 is a diagram of an exemplary point cloud image.

Fig. 5A is an illustration of an exemplary semantic bird's eye view image.

Fig. 5B is an illustration of an exemplary average semantic bird's-eye view image generated from the semantic bird's-eye view image of fig. 5A.

FIG. 6 is an exemplary node system that generates topology nodes.

Fig. 7 is an example of a topologically convolutional neural network.

FIG. 8 is a flow chart of an exemplary process for operating a vehicle based on three degrees of freedom positioning.

Detailed Description

Referring to fig. 1-3, an exemplary control system 100 includes a vehicle 105. The first computer 110 in the vehicle 105 receives data from the sensors 115. The first computer 110 is programmed to receive a plurality of temporally successive sensor 115 images as input to a variational automatic encoder neural network that outputs an average semantic bird's-eye view image comprising respective pixels determined by averaging semantic class values of corresponding pixels in respective ones of the plurality of temporally successive vehicle sensor images. The first computer 110 is further programmed to determine, from a plurality of topological nodes each specifying a respective real world location, a topological node closest to the vehicle 105 and a three degree of freedom pose of the vehicle 105 relative to the topological node closest to the vehicle 105 based on the average semantic bird's eye view image. The first computer 110 is further programmed to determine a real world three-degree-of-freedom pose of the vehicle 105 by combining the three-degree-of-freedom pose of the vehicle 105 relative to the topological node and a real world position of the topological node closest to the vehicle 105. The first computer 110 may then generate a path for the vehicle 105 based on the real world three degree of freedom pose and operate the vehicle 105 along the path.

Turning now to fig. 1, a vehicle 105 includes a first computer 110, vehicle sensors 115, actuators 120 for actuating various vehicle components 125, and a vehicle communication module 130. The communication module 130 allows the first computer 110 to communicate with the remote server computer 140 and/or other vehicles, for example, via messaging or broadcast protocols (such as Dedicated Short Range Communications (DSRC), cellular, and/or other protocols that may support vehicle-to-vehicle, vehicle-to-infrastructure, vehicle-to-cloud communications, etc.), and/or via the packet network 135.

The first computer 110 includes a processor and memory such as are known. The memory includes one or more forms of computer-readable media and stores instructions executable by the first computer 110 for performing various operations, including operations as disclosed herein. The first computer 110 may also include two or more computing devices that cooperate to perform operations of the vehicle 105, including operations as described herein. Further, the first computer 110 may be a general purpose computer having a processor and memory as described above and/or may include dedicated electronic circuitry including an ASIC manufactured for specific operations, such as an ASIC for processing sensor data and/or transmitting sensor data. In another example, the first computer 110 may include an FPGA (field programmable gate array), which is an integrated circuit manufactured to be configurable by a user. Typically, digital and mixed signal systems such as FPGAs and ASICs are described using hardware description languages such as VHDL (very high speed integrated circuit hardware description language) in electronic design automation. For example, ASICs are manufactured based on VHDL programming provided prior to manufacture, while logic components within FPGAs may be configured based on VHDL programming stored, for example, in a memory electrically connected to FPGA circuitry. In some examples, a combination of processors, ASICs, and/or FPGA circuitry may be included in the first computer 110.

The first computer 110 may operate the vehicle 105 in an autonomous, semi-autonomous mode or an autonomous (or manual) mode. For purposes of this disclosure, autonomous mode is defined as a mode in which each of vehicle 105 propulsion, braking, and steering is controlled by first computer 110; in semi-autonomous mode, the first computer 110 controls one or both of propulsion, braking, and steering of the vehicle 105; in the non-autonomous mode, a human operator controls each of the vehicle 105 propulsion, braking, and steering.

The first computer 110 may include one or more of braking, propulsion (e.g., controlling acceleration of the vehicle 105 by controlling one or more of an internal combustion engine, an electric motor, a hybrid engine, etc.), steering, shifting, climate control, interior and/or exterior lights, horns, doors, etc. programmed to operate the vehicle 105, and determining whether and when the first computer 110 (rather than a human operator) controls such operations.

The first computer 110 may include or be communicatively coupled to more than one processor, such as included in an Electronic Controller Unit (ECU) or the like (e.g., transmission controller, brake controller, steering controller, etc.) included in the vehicle 105 for monitoring and/or controlling various vehicle components 125, for example, via a vehicle communication network, such as a communication bus as described further below. The first computer 110 is typically arranged for communication over a vehicle communication network, which may include a bus in the vehicle 105, such as a Controller Area Network (CAN) or the like, and/or other wired and/or wireless mechanisms.

Via the vehicle 105 network, the first computer 110 may transmit and/or receive messages (e.g., CAN messages) to and/or from various devices in the vehicle 105 (e.g., sensors 115, actuators 120, ECU, etc.). Alternatively or additionally, where the first computer 110 actually includes a plurality of devices, a vehicle communication network may be used for communication between the devices represented in this disclosure as the first computer 110. Further, as mentioned below, various controllers and/or sensors 115 may provide data to the first computer 110 via the vehicle communication network.

The vehicle 105 sensors 115 may include a variety of devices such as are known for providing data to the first computer 110. For example, the sensors 115 may include light detection and ranging (lidar) sensors 115 or the like disposed on top of the vehicle 105, behind a front windshield of the vehicle 105, around the vehicle 105, etc., that provide the relative position, size, and shape of objects around the vehicle 105. As another example, one or more radar sensors 115 secured to the bumper of the vehicle 105 may provide data to provide the location of objects, other vehicles, etc. relative to the location of the vehicle 105. Alternatively or additionally, the sensor 115 may also include, for example, a camera sensor 115 (e.g., front view, side view, etc.) that provides an image from an area surrounding the vehicle 105. In the context of the present disclosure, an object is a physical (i.e., substance) item that has a mass and that can be represented by a physical phenomenon (e.g., light or other electromagnetic waves or sound, etc.) that can be detected by the sensor 115. Thus, the vehicle 105, as well as other items discussed herein, fall within the definition of "object" herein.

The first computer 110 is programmed to receive data from the one or more sensors 115 substantially continuously, periodically, and/or upon direction from the remote server computer 140, etc. The data may include, for example, a location of the vehicle 105. The location data specifies one or more points on the ground and may be of known form, such as geographic coordinates, such as latitude and longitude coordinates, obtained via a navigation system using a Global Positioning System (GPS) as is known. Additionally or alternatively, the data may include a location of an object (e.g., vehicle, sign, tree, etc.) relative to the vehicle 105. As one example, the data may be image data of an environment surrounding the vehicle 105. In such examples, the image data may include one or more objects (e.g., vehicles, trees, buildings, etc.) and/or signs (e.g., lane markers) on or along the road on which the vehicle 105 is currently operating. Image data herein means digital image data that may be acquired by the camera sensor 115, including, for example, pixels having intensity values and color values. The sensor 115 may be mounted to any suitable location in or on the vehicle 105, for example, on a bumper of the vehicle 105, on a roof of the vehicle 105, etc., to collect an image of the environment surrounding the vehicle 105.

The vehicle 105 actuators 120 are implemented via circuits, chips, or other electronic and/or mechanical components that may actuate various vehicle subsystems according to appropriate control signals as is known. The actuators 120 may be used to control components 125 including braking, acceleration, and steering of the vehicle 105.

In the context of the present disclosure, the vehicle component 125 is one or more hardware components adapted to perform mechanical or electromechanical functions or operations, such as moving the vehicle 105, decelerating or stopping the vehicle 105, steering the vehicle 105, and the like. Non-limiting examples of components 125 include propulsion components (which include, for example, an internal combustion engine and/or an electric motor, etc.), transmission components, steering components (which may include, for example, one or more of a steering wheel, a steering rack, etc.), suspension components 125 (which may include, for example, one or more of a damper (e.g., a shock absorber or a strut), bushings, springs, control arms, ball joints, links, etc.), braking components, parking assist components, adaptive cruise control components, adaptive steering components, one or more passive restraint systems (e.g., airbags), movable seats, etc.

In addition, the first computer 110 may be configured to communicate with devices external to the vehicle 105 via the vehicle-to-vehicle communication module 130 or interface, such as by vehicle-to-vehicle (V2V) or vehicle-to-infrastructure (V2X) wireless communication (cellular and/or DSRC, etc.) with another vehicle and/or remote server computer 140 (typically via direct radio frequency communication). The communication module 130 may include one or more mechanisms by which a computer of the vehicle may communicate, such as a transceiver, including any desired combination of wireless (e.g., cellular, wireless, satellite, microwave, and radio frequency) communication mechanisms, as well as any desired network topology (or topologies when multiple communication mechanisms are utilized). Exemplary communications provided via the communications module 130 include cellular, bluetooth, IEEE 802.11, dedicated Short Range Communications (DSRC), and/or Wide Area Networks (WAN) including the Internet, which provide data communications services.

Network 135 represents one or more mechanisms by which first computer 110 may communicate with a remote computing device (e.g., remote server computer 140, another vehicle computer, etc.). Thus, the network 135 may be one or more of a variety of wired or wireless communication mechanisms, including any desired combination of wired (e.g., cable and fiber) and/or wireless (e.g., cellular, wireless, satellite, microwave, and radio frequency) communication mechanisms, as well as any desired network topology (or topologies where multiple communication mechanisms are utilized). Exemplary communication networks include wireless communication networks providing data communication services (e.g., using Bluetooth、Bluetooth/>Low power consumption (BLE), IEEE 802.11, vehicle-to-vehicle (V2V) such as Dedicated Short Range Communication (DSRC), etc.), local Area Networks (LANs), and/or Wide Area Networks (WANs) including the internet.

The remote server computer 140 may be a conventional computing device programmed to provide operations such as those disclosed herein, i.e., including one or more processors and one or more memories. Further, the remote server computer 140 may be accessed via a network 135 (e.g., the Internet, a cellular network, and/or some other wide area network).

The control system 100 may include a mapping vehicle 145. The mapping vehicle 145 can include a second (i.e., mapping vehicle) computer 150. The second computer 150 includes a second processor and a second memory, such as are known. The second memory includes one or more forms of computer-readable media and stores instructions executable by the second computer 150 for performing various operations, including operations as disclosed herein.

Additionally, the mapping vehicle 145 may include sensors, actuators for actuating various vehicle components, and a vehicle communication module. The sensors, actuators for actuating the various vehicle components, and the vehicle communication module generally have features in common with the sensors 115, actuators 120 for actuating the various host vehicle components 125, and the vehicle communication module 130, and thus will not be described further to avoid redundancy.

Fig. 2 is an illustration of a topological map 200. The topological map 200 is a map comprising a set of nodes 202, each node comprising real world coordinate data about the location of the node 202 and an average semantic bird's eye view image 504 of the node 202 derived from stereoscopic video data (as discussed below with respect to fig. 5B). The topology map is generated by processing stereoscopic video data of the route to form a plurality of nodes 202, as discussed below. For example, the topology map 200 may be constructed by determining topology nodes 202 along routes or roads 204, 206 using video ranging. The terms road and route are used interchangeably herein. By modifying the street map by adding nodes 202 (shown as circles on roads or routes 204, 206), a topological map 200 as in fig. 2 can be shown. The topology map 200 may be stored in a memory of, for example, a remote server computer 140 and provided to the vehicles 105, 145, for example, via the network 135.

Each node 202 is located on a road 204, 206 along which the vehicle 105, 145 may travel. The nodes 202 are positioned such that the distance between adjacent nodes 202 is one meter to 10 meters. Spacing the nodes 202 in this manner permits the position of the vehicles 105, 145 to be within a few centimeters (e.g., one centimeter to 25 centimeters) in the x-and y-directions (i.e., lateral and longitudinal directions) relative to the roads 204, 206 while maintaining a limit on the amount of data required to represent the roads 204, 206. Roads 204, 206 are mapped using a mapping vehicle 145 equipped with stereoscopic cameras to obtain stereoscopic video data for each node 202 along roads 204, 206. The mapping vehicle 145 may then generate a semantic point cloud image 402 for each node 202 based on the stereoscopic video data, as discussed below. Additionally, the mapping vehicle 145 may generate an average semantic bird's eye view image 504 with corresponding feature points for each node 202 in the topological map 200 using potential variables from the neural network, as discussed below. Alternatively, the mapping vehicle 145 may provide stereoscopic video data to the remote server computer 140, which may generate the semantic point cloud image 402 and the average semantic bird's eye view image 504 of the corresponding node 202.

In this context, a "semantic point cloud image" is a point cloud image that includes tags that identify regions within the image that correspond to objects. In this context, a "point cloud image" is point cloud data that includes a distance or range from a point in the image. In other words, a semantic point cloud is a point cloud image in which point cloud data corresponding to distances is also tagged with semantic class values to identify the type of object or region. Areas so marked may include roads, sidewalks, vehicles, pedestrians, buildings, vegetation, and the like. A semantic class value is an integer corresponding to a type of object or region.

Fig. 3 is an illustration of a pair of stereoscopic images 302, 304. The stereo images 302, 304 may be acquired by the stereo camera sensor 115, wherein both cameras are arranged to view the same scene at lateral intervals. The lateral spacing (also referred to as the baseline) causes the camera to generate images, where the corresponding points in each image will be shifted relative to the image by an amount that is a function of the lateral spacing of the camera and the distance of the points in space from the camera. Because the lateral spacing of the cameras can be precisely determined, a direct geometric transformation can result in a distance from a point in the image, for example, as described further below.

Fig. 4 is an illustration of a semantic point cloud image 402 generated from an average stereoscopic point cloud image and a semantic image. In this context, a "semantic image" is an image labeled with semantic class values that identify the type of object or region within the image. That is, each pixel in the image is labeled with a semantic class value corresponding to the type of object or region detected in the pixel, as discussed further below. In this context, an "average stereoscopic point cloud image" is a stereoscopic point cloud image that includes pixels corresponding to an average distance from points corresponding to respective pixels in each of the plurality of stereoscopic point cloud images to the stereoscopic camera sensor 115.

An average stereoscopic point cloud image is generated from a plurality of temporally successive stereoscopic image pairs (i.e., stereoscopic video data). Although the pixel values in a pair of stereo images 302, 304 correspond to the amount of light received by the stereo camera sensor 115, in a stereo point cloud image the pixel values correspond to the distance from the point corresponding to the pixel to the stereo camera sensor 115. In other words, the stereoscopic point cloud image includes an area marked by an area distance relative to the stereoscopic camera sensor 115.

A stereoscopic point cloud image may be constructed from a pair of stereoscopic images 302, 304 based on stereoscopic parallax. The stereo disparity is defined as the difference in the positions of the corresponding feature points in a pair of stereo images 302, 304. The corresponding feature points are defined as locations in the pair of stereoscopic images 302, 304 that share similar pixel values, including areas around the locations. For example, corners, edges, and textures in a pair of stereo images 302, 304 may be corresponding feature points. The feature points may be determined by known machine vision techniques that determine by processing regions in the image to find pixel locations that may be defined by patterns of abrupt changes in pixel values (e.g., edges and corners, and texture). The pattern of pixel values around the feature points may be compared between pairs of stereoscopic images to identify corresponding feature points that appear in the two images. The difference in position relative to the array of image points can be used to measure stereo parallax. Feature point detection may use machine vision techniques, such as, for example, speeded Up Robust Features (SURF).

Once the corresponding feature points in the pair of stereo images 302, 304 are identified by determining a similar arrangement of pixel values, the distance from the stereo camera sensor 115 to the feature points may be determined. Because the distance between the two cameras is determined by the fixed mount to which the cameras are attached, a baseline is established that permits the distance from the camera to the corresponding feature point to be determined by triangulation. For determining, based on the stereo parallax, a position P (u) of an image feature from the stereo camera sensor 115 to a position corresponding to the first stereo image 302 and the second stereo image 304 ₁ ，v ₁ )、P(u ₂ ，v ₂ ) Characteristic point p=x in overlapping fields of view of a pair of stereoscopic image sensors _p 、y _p 、z _p The equation for the distance of (2) is given by:

d＝u ₁ -u ₂ (1)

where d is the characteristic coordinate position u in the x-direction ₁ -u ₂ B is the baseline between the centers of the two cameras, and f is the common focal length of the two cameras. The distances from the plurality of corresponding feature points determined in this manner may be combined into a stereoscopic point cloud image. A stereoscopic point cloud image may be generated for each pair of stereoscopic images 302, 304. The stereoscopic point cloud images may then be combined, for example, by averaging the distances from each of a plurality of corresponding feature points in the respective stereoscopic point cloud images, to generate an average stereoscopic point cloud image.

The average stereoscopic point cloud image may also be determined by training a Convolutional Neural Network (CNN) to determine an average stereoscopic point cloud image from a plurality of temporally successive stereoscopic image pairs 302, 304. The convolutional neural network includes a plurality of convolutional layers followed by a plurality of fully-connected layers. The convolution layer may determine feature points that pass as potential variables to the fully connected layer, which calculates the equivalent of equations (1) and (2). The CNN may be trained to determine an average stereoscopic point cloud image from a plurality of temporally successive stereoscopic image pairs 302, 304 using a training dataset comprising stereoscopic image pairs 302, 304 and ground truth point cloud images that have been determined based on equations (1) and (2) using feature points and geometric processing. Ground truth is data corresponding to the correct results output from the CNN, i.e., data that correctly represents the real world state, where the ground truth data is obtained from a source that is independent of the CNN. When training the CNN to determine when the CNN outputs the correct results, the ground truth is used to compare with the results output from the CNN. For example, the ground truth of point cloud data may be determined by manually selecting corresponding feature points in a pair of stereo images and manually calculating distances based on measured baselines and camera focal distances to form ground truth point cloud data.

In addition to distance, pixel values of the semantic point cloud image 402 may also correspond to regions from the semantic image. That is, in the semantic point cloud image 402, objects corresponding to roads, vehicles, trees, and buildings adjacent to the roads have been marked to identify pixel areas in the semantic point cloud image 402 that correspond to the marked objects in the semantic image. The semantic image includes regions marked by region type, such as vehicles, roads, buildings, plants, and the like. The semantic image is generated from one of the stereo image pairs 302, 304. For example, one of the RGB images included in the stereo image pair 302, 304 may be input to a Convolutional Neural Network (CNN) that has been trained to segment the image. Image segmentation is a machine vision technique that marks objects in image data. That is, the CNN may be programmed to segment and classify objects based on the connected regions of pixels in the RGB image data.

The connection regions may be classified by tagging each connection region with one of a number of different semantic class values corresponding to the object. As set forth above, each semantic class value is an integer corresponding to a type of object or region. The semantic class values may be selected by the CNN based on the size, shape, and location of objects in the RGB image. For example, CNNs may include different semantic class values, such as different make and model for vehicles, different types of terrain (e.g., grass, mud, gravel, etc.), different types of vegetation (e.g., trees, shrubs, bushes, etc.), and so forth. The CNN may mark objects in the input image and then the label may be combined with the point cloud image, as discussed below.

The CNN may be trained to mark regions in the RGB image data by first constructing a training data set, where the RGB image is manually marked by a human using image processing software to draw a boundary around the object and fill the boundary with pixel values corresponding to the object. The manually marked RGB image is a ground truth to be compared with the output of the CNN. The dataset may comprise more than 1000 RGB images with corresponding ground truth. The CNN is performed a plurality of times with the same RGB image as an input while changing a parameter set that manages operations of the convolution layer and the full connection layer included in the CNN. The parameter sets are ranked according to how similar the output is to the corresponding ground truth. The highest scoring parameter set on the training data set is retained as the parameter set to be used in operating the trained CNN.

Fig. 5A is an illustration of a semantic bird's eye view image 502 generated from an exemplary semantic point cloud image. The semantic bird's eye view image 502 is a two-dimensional (2D) image generated by rendering a semantic point cloud image. Rendering may generate a semantic bird's eye view image of the semantic point cloud image by determining a virtual camera perspective that projects the semantic point cloud image onto a 2D plane.

The virtual camera may be provided by programming the computer 110, 140, 150 to generate a 2D semantic bird's eye view image from the semantic point cloud image. The computer 110, 140, 150 may generate virtual light rays that travel from the virtual image sensor through the virtual lens, thereby following the laws of physics as if the image sensor and lens were physical objects. The computer 110, 140, 150 inserts data into a virtual image sensor that corresponds to the appearance of the portion of the semantic point cloud image that light rays that are emitted by the feature points of the semantic point cloud image and travel through the physical lens will produce on the physical image sensor. By positioning the virtual camera at a selected position and orientation relative to the semantic point cloud image, a 2D semantic bird's-eye view image corresponding to a selected perspective relative to the vehicle 105, 145 may be generated.

The virtual camera view angle includes position and orientation data of an optical axis of the virtual camera and data about magnification of a lens of the virtual camera. The virtual camera perspective is determined based on the position and orientation of the virtual camera relative to the vehicles 105, 145. The position of the virtual camera is selected to be above the vehicle 105, 145 and on the y-axis of the semantic point cloud image. In addition, the orientation of the virtual camera corresponds to the orientation of the vehicle 105, 145. That is, the perspective of the virtual camera is a top view of the environment included in the semantic point cloud image. Projecting the semantic point cloud image onto the 2D plane corresponds to determining which feature points of the semantic point cloud image are visible to a camera that acquired an image of the semantic point cloud image from the selected location and orientation. Because the semantic bird's-eye view image 502 is generated from the semantic point cloud image based on the virtual camera at the selected location and orientation, data about the location and orientation of the feature points shown in the semantic bird's-eye view image 502 is known.

Alternatively, the semantic bird's eye view image 502 may be constructed from the semantic point cloud image based on coordinates of feature points in the semantic point cloud image. In particular, the computer 110, 140, 150 may draw the x-coordinate and z-coordinate of each feature in the semantic point cloud image into a 2D plane. In this case, the x-coordinate may be plotted along a horizontal axis (e.g., substantially parallel to the vehicle transverse axis) and the z-coordinate may be plotted along a vertical axis (e.g., substantially parallel to the vehicle longitudinal axis). The semantic bird's eye view image 502 includes a field of view within which all feature points of the semantic point cloud image are drawn. The field of view is defined by a sensor 115 (e.g., a stereo camera).

Fig. 5B is an illustration of an average semantic bird's-eye view image 504 generated from the semantic bird's-eye view image 502. The "average semantic bird's-eye view image" is a semantic bird's-eye view image including pixels corresponding to an average distance from points corresponding to respective pixels in each of the semantic bird's-eye view image and the plurality of modified semantic bird's-eye view images to the stereo camera sensor 115 and average semantic class values corresponding to respective pixels in each of the semantic bird's-eye view image and the plurality of modified semantic bird's-eye view images. That is, corresponding pixel values in the semantic bird's-eye view image and the plurality of modified semantic bird's-eye view images are averaged to generate an average semantic bird's-eye view image.

An average semantic bird's-eye view image 504 is generated from the plurality of modified semantic bird's-eye view images. The computer 110, 140, 150 may generate a plurality of modified semantic bird's-eye view images by transforming the semantic bird's-eye view image 502 of the node 202. For example, the computer 110, 140, 150 may translate the field of view of the semantic bird's-eye view image 502 relative to the feature points, e.g., along at least one of the x-axis or the z-axis, such that some of the feature points are outside the field of view of the modified semantic bird's-eye view image. Additionally or alternatively, the computer 110, 140, 150 may rotate the field of view of the semantic bird's-eye view image 502 relative to the feature points, e.g., about the y-axis, such that some of the feature points are outside the field of view of the modified semantic bird's-eye view image.

In other words, the computer 110, 140, 150 may update the position (e.g., by translating a predetermined amount along the x-axis) and/or orientation (e.g., by rotating a predetermined amount about the y-axis) of the virtual camera relative to the semantic point cloud image. After updating the position and/or orientation of the virtual camera, the computer 110, 140, 150 may obtain a modified semantic bird's eye view. The computers 110, 140, 150 may generate any suitable number of modified semantic bird's eye view images.

The modified semantic bird's-eye view image is then combined with the semantic bird's-eye view image 502 of node 202 to construct an average semantic bird's-eye view image 504 of node 202. Specifically, the computers 110, 140, 150 determine respective pixels in the average semantic bird's-eye view image 504 by averaging semantic class values and distances of the respective pixels in the semantic bird's-eye view image 502 and the respective modified semantic bird's-eye view image. The computer 110, 140, 150 may then include the average semantic bird's eye view image 504 with the node 202 data.

Fig. 6 is an illustration of a node system 600 that generates node 202 data from a STEREO image pair (STEREO) 602 acquired as a mapping vehicle 145 equipped with STEREO video sensors travels along roads 204, 206 to be mapped. The node system 600 may be implemented as software operating on the second computer 150. In this case, the second computer 150 may include the node 202 data in the topology map 200 and provide the topology map 200 to the remote server computer 140, for example, via the network 135. As another example, node system 600 may be implemented as software operating on remote server computer 140. In this case, the remote server computer 140 may generate the node 202 data and include the node 202 data in the topology map 200. The remote server computer 140 may provide the topology map 200 to the vehicles 105, 145, for example, via the network 135.

When the mapping vehicle 145 has traveled along the roads 204, 206 a specified distance from the previous node 202 (e.g., the specified distance may be one meter to 10 meters), the second computer 150 or the remote server computer 140 may create a new node 202 and put into the topology map 200. Each node 202 in the topology map 200 includes an average semantic bird's-eye view image 504 and a three DoF pose corresponding to the position of the node 202 on the topology map 200.

As the mapping vehicle 145 travels along the routes 204, 206, the mapping vehicle 145 acquires a stereoscopic image pair 602, i.e., stereoscopic video data, via a stereoscopic video sensor. The second computer 150 may then input the stereoscopic image pair 602 into the node system 600. The stereo image pair 602 is processed by a Point Cloud Processor (PCP) 604 to form an average stereo point cloud image by determining the three-dimensional positions of the corresponding feature points based on stereo disparities between the stereo image pair 602. PCP604 may be a CNN as discussed above with respect to fig. 4.

Additionally, the stereo image pair 602 is passed to an image segmentation processor (SIS) 606. The image segmentation processor 606 segments one of the RGB images in the stereoscopic image pair 602 to generate a semantic image using CNN as discussed above with respect to fig. 4. The stereo image pair 602 is processed into RGB images one at a time by a segmentation processor (SIS) 606. SIS 606 is a CNN trained to mark regions in RGB image data, as discussed above with respect to fig. 4.

The semantic image is passed to a point cloud labeling Processor (PCL) 610, where the point cloud image from PCP 604 is combined with the semantic image formed by stereo image pair 602 that generated the average stereo point cloud image to form a semantic point cloud image 612. For example, in fig. 4, roads 404, vehicles 406, 408, buildings 410, 412, 414, and plants 416, 418 have been marked, making the stereoscopic point cloud image a semantic point cloud image 402.

The semantic point cloud image 612 is input into a bird's eye view processor (BEV) 614, where a semantic bird's eye view image 616 is generated from the semantic point cloud image 612. For example, the semantic point cloud image 612 may be rendered to generate a semantic bird's eye view image 616 in the 2D plane by determining the position and orientation of the virtual camera, as discussed above. As another example, the x-coordinates and z-coordinates of feature points in the semantic point cloud image 612 may be drawn to generate a semantic bird's eye view image 616, as discussed above.

The semantic bird's-eye view 616 image is input into a bird's-eye view modification processor (BEVM) 618, where an average semantic bird's-eye view image 620 is generated from the semantic bird's-eye view image 616. For example, BEVM 618 may generate a plurality of modified semantic bird's-eye view images from semantic bird's-eye view images 616, e.g., by transforming semantic bird's-eye view images 620 of node 202, as discussed above with respect to fig. 5B. The modified semantic bird's-eye view image is then combined with the semantic bird's-eye view image 616 of node 202 to construct an average semantic bird's-eye view image 620 of the node, as discussed above with respect to fig. 5B.

Additionally, a plurality of stereoscopic images 602 are input to a visual ranging processor (VO) 608. Stereoscopic ranging is a technique for determining a three DoF (3 DoF) pose 622 of a mapping vehicle 145 based on determining the change in the position of feature points extracted from images as the mapping vehicle 145 moves through a scene. Visual ranging may be performed by a trained Variational Automatic Encoder (VAE). VAEs are neural networks that include an encoder, a decoder, and a loss function. The VAE may be trained to input image data, encode the image data to form latent variables corresponding to an encoded representation of the input image data, and decode the latent variables to output an image comprising portions of the input image data modified in a deterministic manner. The VAE may be trained by determining a loss function that measures the accuracy with which the VAE encodes and decodes image data. Once the VAE is trained, the encoder portion or "header" may be removed from the VAE and used to form potential variables corresponding to the input image. The latent variables formed by the encoder may be processed by decoding sections from which additional types of data are derived (e.g., tri-DoF data describing the pose of a camera acquiring an input image), as discussed below.

Visual ranging is a known technique for determining three DoF data from a series of successive images. Visual ranging can be determined by training the VAE to input paired stereo images and output tri-DoF data. The VAE determines corresponding feature points in successive images and calculates the variation in sensor position between the images. The three DoF pose of the camera may be determined by triangulating two or more sets of feature points to determine translation and rotation to determine the reference frame of the sensor in global coordinates. The VAE may be trained by determining ground truth using an Inertial Measurement Unit (IMU) and a real-time motion enhanced global positioning system (GPS-RTK).

The VAE includes an encoder, a decoder, and a loss function. The encoder inputs image data and encodes the input image data as latent variables. The latent variables are then decoded to form a tri-DoF pose of the mapping vehicle 145 based on the input image data. The loss function is used to train the encoder and decoder by training the encoder and decoder based on ground truth data using the three DoF pose 622 with respect to an input image determined based on the three DoF pose 622 of the vehicle to determine whether the three DoF pose 622 is a valid pose of the vehicle on the road. The visual ranging processor 608 determines a tri-DoF pose 622 based on a plurality of stereo image pairs 602 acquired as the mapping vehicle 145 travels along a path to be topologically mapped. The three DoF pose 622 positions the mapping vehicle 145 relative to global coordinates. Computers 140, 150 may then include tri-DoF pose 622 with node 202 data.

Fig. 7 is an illustration of a topology CNN 700. Topology CNN 700 is one type of VAE. Topology CNN 700 is a neural network that may be trained to input a plurality of temporally successive images 702 (e.g., stereoscopic video data) and output an average semantic bird's eye view image 710. The VAE includes an Encoder (EN) 704 that includes a convolutional layer that encodes the input image 702 into latent variables (LATs) 706; and a Decoder (DEC) 708 that decodes the latent variables 706 into an average semantic bird's-eye view image 710 using the full connection layer and the convolution layer. The VAE may be trained using an average semantic bird's eye view image manually marked as ground truth by a human operator. For example, the VAE may be trained using an average semantic bird's-eye view image for node markers and an average semantic bird's-eye view image for neighboring node markers. The ground truth may be compared to the output from the VAE to train the VAE to correctly label the average semantic bird's eye view image of the node.

Since the stereoscopic image obtained by the vehicle 105 may be different from the stereoscopic image obtained by the mapping vehicle 145 at the node 202 (e.g., due to deviations in sensor calibration between vehicles, deviations in vehicle position on the road when the stereoscopic image was obtained, etc.), the decoder 708 may be separate from the rest of the VAE and the average semantic bird's eye view image 710 may be input to the encoder 704. Encoder 704 may then encode average semantic bird's-eye view image 710 as latent variables 706 for determining closest node 202 of vehicle 105 and a three DoF pose of vehicle 105 relative to closest node 202. Encoding the average semantic bird's-eye view image as a latent variable allows the first computer 110 to identify the topological node 202 closest to the vehicle 105 and the three DoF pose of the vehicle 105 relative to the nearest topological node 202, regardless of any changes between the stereoscopic image obtained by the vehicle 105 and the stereoscopic image obtained by the mapping vehicle 145 at the corresponding node 202. The nearest topological node 202 is defined as a topological node of the three DoF location having the smallest euclidean distance from the three DoF locations of the vehicle 105 in three dimensions.

The first computer 110 may obtain a plurality of temporally continuous images, such as stereoscopic video data, and may input the plurality of temporally continuous images into the topology CNN 700 trained to output an average semantic bird's-eye view image 710 based on the plurality of temporally continuous images. The first computer 110 may then input the average semantic bird's-eye view image 710 into the CNN 700 after separating the decoder 708, such that the encoder 704 outputs the latent variable 706 of the average semantic bird's-eye view image 710. After generating the latent variable 706, the first computer 110 may, for example, input the latent variable 706 to a nearest neighbor classifier that includes programming for comparing the latent variable 706 to the latent variable of the average semantic bird's-eye view image of each of the topology nodes 202. For example, the classifier may use machine learning techniques in which potential variables labeled as representing various topological nodes are provided to a learning machine program for training the classifier. Once trained, the classifier may accept the potential variables as inputs and then provide as outputs an identification of the topology node closest to the vehicle 105. Additionally, the first computer 110 may input the latent variable 706 to a fully connected layer that processes the latent variable 706 to output a three DoF pose of the vehicle 105 relative to the closest node 202.

After determining the three DoF pose of closest node 202 and vehicle 105 relative to closest node 202, first computer 110 may determine the three DoF pose in the real world coordinates of vehicle 105 using the following equation:

wherein the method comprises the steps ofIs the three DoF pose in real world coordinates of the vehicle 105 measured relative to the origin of the topological map 200,/->Is a measurement of the closest node 202 relative to the origin of the topological map 200Is a three DoF gesture, and +.>Is the three DoF pose of the vehicle 105 measured relative to the nearest topological node 202.

After determining the DoF pose in the real world coordinates of the vehicle 105, the first computer 110 may, for example, generate a path along which to operate the vehicle 105. The first computer 110 may then actuate one or more vehicle components 125 to operate the vehicle 105 along the path. As used herein, a "path" is a set of points, which may be designated as coordinates and/or geographic coordinates relative to a vehicle coordinate system, for example, that the first computer 110 is programmed to determine using conventional navigation and/or path planning algorithms. The path may be specified according to one or more path polynomials. The path polynomial is a polynomial function describing three or less degrees of the movement of the vehicle on the ground. The movement of a vehicle on a road is described by a multi-dimensional state vector that includes vehicle position, orientation, speed, and acceleration. In particular, the vehicle motion vector may include a position in x, y, z, yaw, pitch, roll, yaw rate, pitch rate, roll rate, heading speed, and heading acceleration, which may be determined, for example, by fitting a polynomial function to successive 2D positions relative to the ground included in the vehicle motion vector.

Further, for example, the path polynomial p (x) is a model that predicts the path as a line depicted by a polynomial equation. The path polynomial p (x) predicts the path for a predetermined upcoming distance x (e.g., measured in meters) by determining the lateral coordinate p:

p(x)＝a ₀ +a ₁ x+a ₂ x ² +a ₃ x ³ (3)

wherein a is ₀ Is the offset, a, the lateral distance between the path and the centerline of the host vehicle 105 at the upcoming distance x ₁ Is the heading angle of the path, a ₂ Is the curvature of the path, and a ₃ Is the rate of change of curvature of the path.

The techniques described herein improve vehicle positioning by generating and processing a bird's eye view image to improve estimation of the three DoF pose of the vehicle 105. The bird's eye view image improves the ability of the vehicle computer 110 to locate objects in the environment surrounding the vehicle 105, regardless of weather and/or lighting conditions, which allows for improved location of the vehicle 105 despite changes in environmental conditions at the node 202 after the node 202 data is acquired. Furthermore, the techniques described herein improve computation by processing 2D bird's eye view images, as compared to 3D semantic point cloud images. Furthermore, CNN 700 requires that a set of ground truth data for each node 202 be trained to output an average bird's eye view image of node 202 based on temporally successive stereoscopic images. That is, CNN 700 may output an average semantic bird's-eye view image of node 202 based on temporally-continuous stereoscopic images obtained in environmental (i.e., weather and/or lighting) conditions that are not included in the ground truth data. In other words, the ground truth data need not include ground truth data for each node 202 under each environmental condition, thereby reducing the amount of ground truth data required to train CNN 700.

Fig. 8 is an illustration of an exemplary process 800 for determining a three DoF pose of vehicle 105 based on a plurality of temporally successive stereoscopic images (i.e., stereoscopic video data). Process 800 begins in block 805. The process 800 may be performed by a first computer 110 included in the vehicle 105 executing program instructions stored in its memory.

In block 805, the topological map 200 of the roads 204, 206 is determined by traversing the roads with a mobile platform equipped with a stereo camera, as discussed with respect to fig. 2. For example, the mapping vehicle 145 may traverse the roads 204, 206 and obtain stereoscopic video data. Alternatively, any mobile platform (e.g., robot, drone, boat, etc.) may be used to determine the route. The topology map 200 includes a plurality of nodes 202, wherein each node 202 includes a three DoF location and an average semantic bird's eye view image 504.

The second computer 150 may identify the node 202 in the topological map 200, or the second computer 150 may provide stereoscopic video data to the remote server computer 140 that may be programmed to identify the node 202. Multiple temporally successive stereoscopic images from the stereoscopic video data are processed by the computers 140, 150 to produce a semantic point cloud image 402, wherein distances from points in the image are grouped and labeled as discussed with respect to fig. 4. For example, the semantic point cloud image 402 may include labels for roads, vehicles, pedestrians, buildings, and vegetation. Computers 140, 150 may determine a tri-DoF pose of node 202 based on semantic point cloud image 402, as discussed above. Additionally, the semantic point cloud image 402 is processed by the computer 140, 150 to produce a semantic bird's eye view image 502 of the node 202, as discussed above. The computer 140, 150 then generates an average semantic bird's-eye view based on the semantic bird's-eye view image 502 and the plurality of modified semantic bird's-eye views, as discussed above. Process 800 continues in block 810.

In block 810, the first computer 110 in the vehicle 105 trains the topology CNN 700 to input a plurality of temporally-consecutive stereoscopic images and output an average semantic bird's-eye view image 504 of the nodes 202 in the topology map 200, as discussed with respect to fig. 7. Topology CNN 700 may also be trained to output three DoF poses of vehicle 105 and the nearest topology node 202 from vehicle 105 that acquired a plurality of temporally successive stereo images. Process 800 continues in block 815.

In block 815, the first computer 110 uses the trained topology CNN 700 to determine the three DoF pose of the vehicle 105 relative to the nearest node 202 and the nearest topology node 202 in the topology map 200. For example, the first computer 110 may obtain a plurality of temporally successive stereoscopic images, i.e., stereoscopic video data, of the environment surrounding the vehicle 105 while operating the vehicle 105 along the roads 204, 206. The first computer 110 may then input a plurality of temporally successive stereoscopic images to the topology CNN 700.

Upon receiving the average semantic bird's-eye view image 504 of node 202, first computer 110 may input the average semantic bird's-eye view image 504 into topology CNN 700, wherein decoder 708 is separated such that encoder 704 of topology CNN 700 outputs latent variables 706 corresponding to the average semantic bird's-eye view image 504. The first computer 110 may then process the latent variables 706 to determine the nearest topological node 202 from the vehicle 105 and the three DoF pose of the vehicle 105 relative to the nearest topological node 202, as discussed above. First computer 110 may then determine the tri-DoF pose of vehicle 105 relative to topological map 200 by combining the tri-DoF pose of vehicle 105 relative to closest node 202 and the tri-DoF of closest node 202 relative to topological map 200, as discussed above. Process 800 continues in block 820.

In block 820, the first computer 110 may operate the vehicle 105 using the three DoF pose of the vehicle 105 relative to the topological map 200. That is, after locating the vehicle 105, the first computer 110 may determine a path along which to operate the vehicle 105, as discussed above. The first computer 110 may then actuate one or more vehicle components 125, such as braking, steering, and/or propulsion, to move the vehicle 105 along the path. After block 820, process 800 ends.

As used herein, the adverb "substantially" means that the shape, structure, measurement, quantity, time, etc. may deviate from the exactly described geometry, distance, measurement, quantity, time, etc. due to imperfections in materials, machining, manufacturing, data transmission, calculation speeds, etc.

In general, the described computing systems and/or devices may employ any of a variety of computer operating systems, including, but in no way limited to, the following versions and/or categories: ford SyncApplication, appLink/Smart Device Link middleware, microsoft Automotive->Operating System, microsoft Windows->An operating system; unix operating systems (e.g., solaris +. >Operating system), AIX UN published by International Business machines corporation of Armonk, N.Y.IX operating System, linux operating System, macOSX and iOS operating systems published by apple Inc. of Coptis, california, blackBerry OS published by BlackBerry Inc. of Torilis, canada, and Android operating systems developed by Google corporation and open cell phone alliance, or QNX @ offered by QNX software systems corporation>CAR entertainment information platform. Examples of computing devices include, but are not limited to, an in-vehicle first computer, a computer workstation, a server, a desktop computer, a notebook computer, a laptop computer, or a handheld computer, or some other computing system and/or device.

Computers and computing devices typically include computer-executable instructions that are executable by one or more computing devices, such as those listed above. Computer-executable instructions may be compiled or interpreted from a computer program created using a variety of programming languages and/or techniques, including, but not limited to, java, alone or in combination ^TM C, C ++, matlab, simulink, stateflow, visual Basic, java Script, perl, HTML, etc. Some of these applications may be compiled and executed on virtual machines such as Java virtual machines, dalvik virtual machines, and the like. Generally, a processor (e.g., a microprocessor) receives instructions, e.g., from a memory, a computer-readable medium, etc., and executes the instructions, thereby performing one or more processes, including one or more of the processes described herein. A variety of computer readable media may be used to store and transmit such instructions and other data. Files in a computing device are typically a collection of data stored on a computer readable medium, such as a storage medium, random access memory, or the like.

The memory may include computer-readable media (also referred to as processor-readable media) including any non-transitory (e.g., tangible) media that participates in providing data (e.g., instructions) that may be read by a computer (e.g., by a processor of a computer). Such a medium may take many forms, including but not limited to, non-volatile media and volatile media. Non-volatile media may include, for example, optical or magnetic disks, and other persistent memory. Volatile media may include, for example, dynamic Random Access Memory (DRAM), which typically constitutes a main memory. Such instructions may be transmitted by one or more transmission media, including coaxial cables, copper wire and fiber optics, including the wires that comprise a system bus coupled to the processor of the ECU. Common forms of computer-readable media include, for example, a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, RAM, PROM, EPROM, a flash EEPROM, any other memory chip or cartridge, or any other medium from which a computer can read.

Databases, data stores, or other data stores described herein may include various mechanisms for storing, accessing, and retrieving various data, including hierarchical databases, file sets in file systems, application databases in proprietary formats, relational database management systems (RDBMS), and the like. Each such data store is typically included within a computing device employing a computer operating system (such as one of those mentioned above) and accessed in any one or more of a variety of ways via a network. The file system is accessible from a computer operating system and may include files stored in various formats. In addition to languages used to create, store, edit, and execute stored programs, such as the PL/SQL language described above, RDBMS also typically employs Structured Query Language (SQL).

In some examples, system elements may be implemented as computer-readable instructions (e.g., software) on one or more computing devices (e.g., servers, personal computers, etc.), stored on a computer-readable medium (e.g., disk, memory, etc.) associated therewith. The computer program product may include such instructions stored on a computer-readable medium for performing the functions described herein.

With respect to the media, processes, systems, methods, heuristics, etc. described herein, it should be understood that, while the steps of such processes, etc. have been described as occurring in a certain ordered sequence, such processes could be practiced by executing the described steps in an order different than that described herein. It should also be understood that certain steps may be performed concurrently, other steps may be added, or certain steps described herein may be omitted. In other words, the description of the processes herein is provided for the purpose of illustrating certain embodiments and should not be construed as limiting the claims in any way.

Accordingly, it is to be understood that the above description is intended to be illustrative, and not restrictive. Many embodiments and applications other than the examples provided will be apparent to those of skill in the art upon reading the above description. The scope of the invention should be determined, not with reference to the above description, but should instead be determined with reference to the appended claims, along with the full scope of equivalents to which such claims are entitled. It is contemplated and anticipated that the technology discussed herein will evolve in the future, and that the disclosed systems and methods will be incorporated into such future embodiments. In summary, it is to be understood that the invention is capable of modification and variation and is limited only by the following claims.

Unless explicitly indicated to the contrary herein, all terms used in the claims are intended to be given their ordinary and customary meaning as understood by those skilled in the art. In particular, the use of singular articles such as "a," "an," "the," and the like are to be construed to recite one or more of the indicated elements unless a claim recites an explicit limitation to the contrary.

According to the present invention there is provided a system having a computer comprising a processor and a memory, the memory storing instructions executable by the processor to: receiving a plurality of temporally successive vehicle sensor images as input to a variational automatic encoder neural network, the variational automatic encoder neural network outputting an average semantic bird's-eye view image comprising respective pixels determined by averaging semantic class values of corresponding pixels in respective ones of the plurality of temporally successive vehicle sensor images; determining, from a plurality of topological nodes each specifying a respective real world location, a topological node closest to the vehicle and a three-degree-of-freedom pose of the vehicle relative to the topological node closest to the vehicle based on the average semantic bird's eye view image; and determining a real world three degree of freedom pose of the vehicle by combining the three degree of freedom pose of the vehicle relative to the topological node and the real world position of the topological node closest to the vehicle.

According to one embodiment, the instructions further comprise instructions for: the average semantic bird's-eye view image is generated based on rendering semantic point cloud images of an environment surrounding the vehicle into a two-dimensional plane.

According to one embodiment, the instructions further comprise instructions for: the semantic point cloud image is generated based on combining a semantic image including regions marked by region type and a stereoscopic point cloud image including regions marked by region distance relative to the vehicle.

According to one embodiment, the instructions further comprise instructions for: the stereoscopic point cloud image is generated based on a pair of stereoscopic images acquired by sensors in the vehicle.

According to one embodiment, the instructions further comprise instructions for: the semantic image is generated based on a single stereoscopic image acquired by a sensor in the vehicle.

According to one embodiment, the zone types include roads, sidewalks, vehicles, buildings, and vegetation.

According to one embodiment, the instructions further comprise instructions for: the topological node is determined by acquiring a point cloud image with a stereo camera and determining the position of the point cloud image in real world coordinates with a visual ranging method.

According to one embodiment, the real world three degree of freedom pose of the vehicle is determined in coordinates based on orthogonal x-and y-axes and yaw rotation about a z-axis orthogonal to the x-and y-axes.

According to one embodiment, the instructions further comprise instructions for: the variational automatic encoder neural network is trained to output the average semantic bird's-eye view image using a plurality of modified semantic bird's-eye view images.

According to one embodiment, the instructions further comprise instructions for: each of the plurality of modified semantic bird's-eye view images is generated based on at least one of translating or rotating the semantic bird's-eye view image.

According to one embodiment, the variational automatic encoder neural network determines the three degree of freedom pose of the vehicle relative to the topological node closest to the vehicle by outputting latent variables to a fully connected layer.

According to one embodiment, the variational automatic encoder neural network determines the topology node closest to the vehicle by inputting latent variables of the average semantic bird's-eye view to a nearest neighbor classifier trained to determine the topology node closest to the vehicle.

According to the invention, a method comprises: receiving a plurality of temporally successive vehicle sensor images as input to a variational automatic encoder neural network, the variational automatic encoder neural network outputting an average semantic bird's-eye view image comprising respective pixels determined by averaging semantic class values of corresponding pixels in respective ones of the plurality of temporally successive vehicle sensor images; determining, from a plurality of topological nodes each specifying a respective real world location, a topological node closest to the vehicle and a three-degree-of-freedom pose of the vehicle relative to the topological node closest to the vehicle based on the average semantic bird's eye view image; and determining a real world three degree of freedom pose of the vehicle by combining the three degree of freedom pose of the vehicle relative to the topological node and the real world position of the topological node closest to the vehicle.

In one aspect of the invention, the method includes generating the average semantic bird's-eye view image based on rendering semantic point cloud images of an environment surrounding the vehicle into a two-dimensional plane.

In one aspect of the invention, the method includes generating the semantic point cloud image based on combining a semantic image including regions marked by region type and a stereoscopic point cloud image including regions marked by region distance relative to the vehicle.

In one aspect of the invention, the method includes determining the topological node by acquiring a point cloud image with a stereo camera and determining a position of the point cloud image in real world coordinates using a visual ranging method.

In one aspect of the invention, the real world three degree of freedom pose of the vehicle is determined in coordinates based on orthogonal x-and y-axes and yaw rotation about a z-axis orthogonal to the x-and y-axes.

In one aspect of the invention, the method includes training the variational automatic encoder neural network to output the average semantic bird's-eye view image using a plurality of modified semantic bird's-eye view images.

In one aspect of the invention, the variational automatic encoder neural network determines the three degree of freedom pose of the vehicle relative to the topological node closest to the vehicle by outputting latent variables to a fully connected layer.

In one aspect of the invention, the variational automatic encoder neural network determines the topology node closest to the vehicle by inputting latent variables of the average semantic bird's eye view into a nearest neighbor classifier trained to determine the topology node closest to the vehicle.

Claims

1. A method, the method comprising:

receiving a plurality of temporally successive vehicle sensor images as input to a variational automatic encoder neural network, the variational automatic encoder neural network outputting an average semantic bird's-eye view image comprising respective pixels determined by averaging semantic class values of corresponding pixels in respective ones of the plurality of temporally successive vehicle sensor images;

determining, from a plurality of topological nodes each specifying a respective real world location, a topological node closest to the vehicle and a three-degree-of-freedom pose of the vehicle relative to the topological node closest to the vehicle based on the average semantic bird's eye view image; and

a real world three-degree-of-freedom pose of the vehicle is determined by combining the three-degree-of-freedom pose of the vehicle relative to the topological node and the real world position of the topological node closest to the vehicle.

2. The method of claim 1, further comprising generating the average semantic bird's-eye view image based on rendering semantic point cloud images of an environment surrounding the vehicle into a two-dimensional plane.

3. The method of claim 2, further comprising generating the semantic point cloud image based on combining a semantic image comprising regions marked by region type and a stereoscopic point cloud image comprising regions marked by region distance relative to the vehicle.

4. The method of claim 3, further comprising generating the stereoscopic point cloud image based on a pair of stereoscopic images acquired by sensors in the vehicle.

5. The method of claim 3, further comprising generating the semantic image based on a single stereoscopic image acquired by a sensor in the vehicle.

6. The method of claim 3, wherein the type of area comprises a roadway, a sidewalk, a vehicle, a building, and a plant.

7. The method of claim 1, further comprising determining the topological node by acquiring a point cloud image with a stereo camera and determining a location of the point cloud image in real world coordinates with a visual ranging method.

8. The method of claim 1, wherein the real world three degree of freedom pose of the vehicle is determined in coordinates based on orthogonal x-and y-axes and yaw rotation about a z-axis orthogonal to the x-and y-axes.

9. The method of claim 1, further comprising training the variational automatic encoder neural network to output the average semantic bird's-eye view image using a plurality of modified semantic bird's-eye view images.

10. The method of claim 9, further comprising generating each of the plurality of modified semantic bird's-eye view images based on at least one of translating or rotating the semantic bird's-eye view image.

11. The method of claim 1, wherein the variational automatic encoder neural network determines the three degree of freedom pose of the vehicle relative to the topological node closest to the vehicle by outputting latent variables to a fully connected layer.

12. The method of claim 1, wherein the variational automatic encoder neural network determines the topology node closest to the vehicle by inputting latent variables of the average semantic bird's eye view to a nearest neighbor classifier trained to determine the topology node closest to the vehicle.

13. A computer programmed to perform the method of any one of claims 1 to 12.

14. A computer program product comprising instructions for performing the method of any of claims 1 to 12.

15. A vehicle comprising a computer programmed to perform the method of any one of claims 1 to 12.