WO2020060478A1

WO2020060478A1 - System and method for training virtual traffic agents

Info

Publication number: WO2020060478A1
Application number: PCT/SG2018/050477
Authority: WO
Inventors: Oliver Michael GRIMM; Intakhab Mehboob KHAN; Yuanbo XIANG; Ali Raza
Original assignee: Sixan Pte Ltd
Priority date: 2018-09-18
Filing date: 2018-09-18
Publication date: 2020-03-26
Also published as: WO2020060480A1

Abstract

A computing system for training a traffic agent navigating through a simulation environment, comprising a traffic agent learning system. The traffic agent learning system comprises instructions for: generating a neural network model based on a plurality of predetermined network parameters for emulating a target real world driving competency of a driver; receiving perception input data descriptive of objects perceived by the traffic agent characterizing a current state of an environment; receiving a reward or penalty based on one or more reward parameters associated with the target real world driving competency of a driver; evaluating the performance of the traffic agent by comparing a first mathematical indicator obtained from the performance of the traffic agent with a second mathematical indicator based on data obtained from a real-world source, wherein a close or approximate match of the first mathematical indicator and the second mathematical indicator results in a trained traffic agent.

Description

SYSTEM AND METHOD FOR TRAINING VIRTUAL TRAFFIC AGENTS

Technical Field

[0001] Various embodiments of the present invention generally relate to systems, methods and computer-implemented methods, for training virtual traffic agents, and more specifically, but not exclusively, for training virtual traffic agents for the purpose of testing an autonomous vehicle system that controls a vehicle.

Background

[0002] Autonomous vehicle systems are tested and validated in real world conditions where a physical vehicle controlled by the autonomous vehicle sytem is driven on physical roads. Such testing can be difficult, expensive and at times even dangerous to other road users. There are limitations to testing an autonomous vehicle in a real world environment since an extremely large number of driving hours may need to be accumulated in order to properly train and evaluate autonomous vehicle systems. It has been recommended that autonomous vehicles should log 11 billion miles of road test data to reach an acceptable safety threshold.

[0003] With the emergence of autonomous vehicle technologies, there is a need for multiple and diversified eco-systems for training, evaluating and validating autonomous vehicle systems that control autonomous vehicles. One developing area for testing and validating autonomous vehicle systems involves development and deployment of driving simulators. Driving simulators provide advantages in mileage data collection efficiency, road conditions dataset diversity and sensor corresponding data accuracy. They also enable an efficient, continuous and unlimited data collection in a virtual environment with relatively low operational costs, all of which have helped to speed up development of autonomous vehicle technologies. However, existing driving simulators are limited and lack the complexity and depth of real world urban driving conditions. Summary of the Invention

[0004] Throughout this document, unless otherwise indicated to the contrary, the terms“comprising”,“consisting of’, and the like, are to be construed as non-exhaustive, or in other words, as meaning“including, but not limited to”.

[0005] In accordance with a first embodiment of the invention, there is a computing system for training a traffic agent navigating through a simulation environment comprising:

one or more processors;

a memory device coupled to the one or more processors;

a traffic agent learning system stored in the memory device and configured to be executed by the one or more processors, the traffic agent learning system comprising instructions for:

generating a neural network model based on a plurality of predetermined network parameters for emulating a target real world driving competency of a driver;

receiving perception input data descriptive of objects perceived by the traffic agent characterizing a current state of an environment;

selecting an action to be performed by the traffic agent based on the current state of the environment, the action to be performed being operably communicative with a driving controller of the traffic agent;

receiving a reward or penalty based on one or more reward parameters associated with the target real world driving competency of a driver; updating the plurality of predetermined network parameters based on the reward or penalty received so as to minimize loss of the neural network model; evaluating the performance of the traffic agent by comparing a first mathematical indicator obtained from the performance of the traffic agent with a second mathematical indicator based on data obtained from a real-world source, wherein a close or approximate match of the first mathematical indicator and the second mathematical indicator results in a trained traffic agent. [0006] Preferably, the perception input data comprises sensor data from sensors located on the traffic agent and training map data.

[0007] Preferably, the training map data comprises a two-dimensional map representation constructed in an endless loop and information that describes the location of objects within the surrounding environment of the traffic agent.

[0008] Preferably, the plurality of predetermined network parameters for emulating the target real world driving competency of a driver comprises:

a first layer convolutional neural network (CNN) for generating an encoded representation of the perception input data,

an intermediate neural network including a Fully Convolutional Network (FCN) layer in combination with a stack of LSTM (Long Short Term Memory) layers for processing the encoded representation of the perception input data to generate an intermediate representation and;

an output neural network including a plurality of layers comprising a set of functions based on Double Duel Deep Q Networks algorithm for processing the intermediate representation to generate the action to be performed by the traffic agent.

[0009] Preferably, the stack of LSTM layers further comprises a plurality of LSTM layers, wherein a first LSTM layer processes and outputs the encoded representation of the perception input data for further processing as input by a second LSTM layer and a third LSTM layer processes the output of the second LSTM layer.

[0010] Preferably, the first mathematical indicator includes at least one of average reward, average speed, average acceleration or number of legal actions taken for a certain number of actions.

[0011] Preferably, the step of evaluating the performance of the traffic agent further comprises navigating the traffic agent through a testing map data that is different from the training map data.

[0012] Preferably, the reward or penalty awarded based on one or more reward parameters associated with the target real world driving competency of a driver include quantitative and qualitative indicators of driving competency. [0013] Preferably, the one or more reward parameters include any one or more of the following: relative velocity, relative acceleration, frequent lane changes, increases in relative acceleration or following distance from a vehicle in front of the traffic agent.

[0014] Preferably, the step of updating the plurality of predetermined network parameters based on the reward or penalty received further comprises calibrating each of the weights of neural network layers using back propagation algorithm so as to achieve minimum loss on the perception input data.

[0015] Preferably, the step of updating the plurality of predetermined network parameters further comprises calibrating each of the weights of neural network layers using back propagation algorithm to iteratively improve identifying a target and to iteratively assist the neural network model to converge on the identified target.

[0016] In accordance with a second embodiment of the invention, there is a computer- implemented method for training a traffic agent navigating through a simulation environment comprising:

receiving a reward or penalty based on one or more reward parameters associated with the target real world driving competency of a driver; updating the plurality of predetermined network parameters based on the reward or penalty received so as to minimize loss of the neural network model; evaluating the performance of the traffic agent by comparing a first mathematical indicator obtained from the performance of the traffic agent with a second mathematical indicator based on data obtained from a real-world source, wherein a close or approximate match of the first mathematical indicator and the second mathematical indicator results in a trained traffic agent.

[0017] Preferably, the perception input data comprises sensor data from sensors located on the traffic agent and training map data.

[0018] Preferably, the training map data comprises a two-dimensional map representation constructed in an endless loop and information that describes the location of objects within the surrounding environment of the traffic agent.

[0019] Preferably, the plurality of predetermined network parameters for emulating the target real world driving competency of a driver comprises:

[0020] Preferably, the stack of LSTM layers further comprises a plurality of LSTM layers, wherein a first LSTM layer processes and outputs the encoded representation of the perception input data for further processing as input by a second LSTM layer and a third LSTM layer processes the output of the second LSTM layer.

[0021] Preferably, the first mathematical indicator includes at least one of: average reward, average speed, average acceleration or number of legal actions taken for a certain number of actions.

[0022] Preferably, the step of evaluating the performance of the traffic agent further comprises navigating the traffic agent through a testing map data that is different from the training map data. [0023] Preferably, the reward or penalty awarded based on one or more reward parameters associated with the target real world driving competency of a driver include quantitative and qualitative indicators of driving competency.

[0024] Preferably, the one or more reward parameters include any one or more of the following: relative velocity, relative acceleration, frequent lane changes, increases in relative acceleration or following distance from a vehicle in front of the traffic agent.

[0025] Preferably, the step of updating the plurality of predetermined network parameters based on the reward or penalty received further comprises calibrating each of the weights of neural network layers using back propagation algorithm so as to achieve minimum loss on the perception input data.

[0026] Preferably, the step of updating the plurality of predetermined network parameters further comprises calibrating each of the weights of neural network layers using back propagation algorithm to iteratively improve identifying a target and to iteratively assist the neural network model to converge on the identified target.

Brief Description of the Drawings

[0027] In the drawings, like reference characters generally refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the invention. The dimensions of the various features or elements may be arbitrarily expanded or reduced for clarity. In the following description, various embodiments of the invention are described with reference to the following drawings, in which:

[0028] FIG. 1A depicts a block diagram of an example computing system architecture for training a traffic agent according to various embodiments;

[0029] FIG. 1B depicts a block diagram of an example traffic agent learning system for a virtual traffic agent according to various embodiments;

[0030] FIG. 2 depicts a flow chart diagram of an example method for training a traffic agent according to various embodiments; [0031] FIG. 3 depicts a graphical diagram of components of an example training map data according to various embodiments;

[0032] FIG. 4 depicts a graphical diagram and description of an example perception input data according to various embodiments;

[0033] FIG. 5 depicts a graphical diagram of components of a traffic agent learning system focusing on evaluation according to various embodiments;

[0034] FIG. 6 depicts a block diagram of an example setup of a policy neural network model according to various embodiments; and

[0035] FIG. 7 depicts a graphical diagram of components of a traffic agent learning system focusing on rewarding according to various embodiments;

Detailed Description

[0036] The following detailed description refers to the accompanying drawings that show, by way of illustration, specific details and embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. Other embodiments may be utilized and structural, and logical changes may be made without departing from the scope of the invention. The various embodiments are not necessarily mutually exclusive, as some embodiments can be combined with one or more other embodiments to form new embodiments.

[0037] In the specification the term“comprising” shall be understood to have a broad meaning similar to the term“including” and will be understood to imply the inclusion of a stated integer or step or group of integers or steps but not the exclusion of any other integer or step or group of integers or steps. This definition also applies to variations on the term “comprising” such as“comprise” and“comprises”.

[0038] In order that the invention may be readily understood and put into practical effect, particular embodiments will now be described by way of examples and not limitations, and with reference to the figures. It will be understood that any property described herein for a specific system may also hold for any system described herein. It will be understood that any property described herein for a specific method may also hold for any method described herein. Furthermore, it will be understood that for any system or method described herein, not necessarily all the components or steps described must be enclosed in the system or method, but only some (but not all) components or steps may be enclosed.

[0039] The term“configured” herein may be understood as in connection with systems and computer program components. For a system of one or more computers to be configured to perform particular operations or actions, it means that the system has installed on it software, firmware, hardware, or a combination of them that in operation cause the system to perform operations or actions. For one or more computer programs to be configured to perform particular operations or actions means that the one or more programs include instructions that, when executed by a data processing apparatus, cause the apparatus to perform the operations or actions.

[0040] To achieve the stated features, advantages and objects, the present disclosure is directed to systems and methods that make use of computer hardware and software to train a virtual traffic agent navigating through a simulation environment using reinforcement learning algorithms and techniques. A virtual traffic agent can for example be a car, truck, bus, bike or motor bike. Once a virtual traffic agent has been trained in a way that replicates human driving behavior for a predetermined geographical area, the goal is to inject one or more trained virtual traffic agents into a simulation environment representing the predetermined geographical area where they can interact, cooperate with and challenge an automous vehicle system controlling an autonomous vehicle under test. The goal is to test the limits and weaknesses of the autonomous vehicle system, especially from adverse and dangerous traffic scenerios that are attributed to assertive or aggressive driving behaviors.

[0041] The disclosed systems and methods have a technical effect and benefit of providing an improvement to autonomous vehicle computing technology. For example, by utilizing the disclosed systems and methods, autonomous vehicle systems can avoid the rules-based system or hand-crafted rules system which can be less effective and flexible for decisions made by the autonomous vehicle systems. It also greatly reduces the research time needed relative to development of hand- crafted rules. For example, a designer would need to exhaustively derive several models of how different vehicles would need to react in different scenarios which can be challenging given all the possible scenarios that an autonomous vehicle may encounter. Additionally, there is the benefit of significant scalability and customizability since a plurality of various ride scenarios can be simulated for a plurality of geographical locations with the traffic agents without moving real vehicles in the real world. For example, how an autonomous vehicle would be driven around other traffic agents may be different in Germany as opposed to China due to differing driving cultures and specific country rules. By using the disclosed traffic agent learning systems as described herein, a traffic agent can be trained on appropriate input data and can be done on a massive scale, and can also be easily revised as new data is made available. This also significantly reduces hardware resoures and human resources that are used for training, evaluating and testing autonomous vehicle systems. There is also a significant reduction in risk since the process is conducted in a simulation environment. Damages, accidents and loss of lives will be prevented and avoided from simulating risky scenarios in a simulation environment.

[0042] Various embodiments are provided for systems, and various embodiments are provided for methods. It will be understood that basic properties of the systems also hold for the methods and vice versa. Therefore, for sake of brevity, duplicate description of such properties may be omitted.

[0043] FIG. 1A illustrates a block diagram of an example architecture of a computing system 10 for training a virtual traffic agent according to various embodiments. The computing system 10 includes a server 11 in communication with one or more client devices 12 via a communications network (not shown) which operably communicates with a traffic agent learning system 100 and a driving controller 160 for controlling the driving operations of the traffic agent. The client device 12 can comprise a personal computer, a portable computing device such as a laptop, a television, a mobile phone, or any other appropriate storage and/or communication device to exchange data via a web browser and/or communications network. The computing system 10 includes a Central Processing Unit (CPU) or a processor 15 which executes instructions contained in programs such as the traffic agent learning system and stored in storage devices (not shown). The processor 15 may provide the central processing unit (CPU) functions of a computing device on one or more integrated circuits. As used herein, the term‘processor’ broadly refers to and is not limited to single or multi-core general purpose processor, a special purpose processor, a conventional processor, a graphical processing unit, a digital signal processor (DSP), a plurality of microprocessors, one or more microprocessors in association with a DSP core, a controller, a microcontroller, one or more Application Specific Integrated Circuits (ASICs), one or more Field Programmable Gate Array (FPGA) circuits, any other type of integrated circuit, a system on a chip (SOC), and/or a state machine.

[0044] As used herein, one or more client devices 12 may exchange information via any communication network, such as a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), a proprietary network, and/or Internet Protocol (IP) network such as the Internet, an Intranet or an extranet. Each client device 12, module or component within the system may be connected over a network or may be directly connected. A person skilled in the art will recognize that the terms‘network’, ‘computer network’ and‘online’ may be used interchangeably and do not imply a particular network embodiment. In general, any type of network may be used to implement the online or computer networked embodiment of the present invention. The network may be maintained by a server or a combination of servers or the network may be serverless. Additionally, any type of protocol (for example, HTTP, FTP, ICMP, UDP, WAP, SIP, H.323, NDMP, TCP/IP) may be used to communicate across the network. The devices as described herein may communicate via one or more such communication networks.

[0045] The computing system 10 includes a memory 16 for storing data and software instructions which are executed by the processor 15 and may control operation of various aspects of the computing system 10. The memory 16 can include one or more databases 180 and image processing software, as well as the traffic agent learning system 100 as described herein, which further includes a neural network or a deep neural network. The memory 16 used in the embodiments may include a volatile memory, for example a DRAM (Dynamic Random Access Memory) or a non-volatile memory, for example a PROM (Programmable Read Only Memory), an EPROM (Erasable PROM), EEPROM (Electrically Erasable PROM), or a flash memory, e.g., a floating gate memory, a charge trapping memory, an MRAM (Magnetoresistive Random Access Memory) or a PCRAM (Phase Change Random Access Memory). Alternatively or operably communicating with the memory 16, the computing system also includes a solid state drive (SSD) 13 that stores data and software instructions to be executed by the processor 15. A solid state drive 13 is a data storage device that uses a NAND-based flash memory to store data and is a form of non-volatile memory. Once data is written to the flash memory, and if no power is supplied to the flash memory, the data is still retained in the flash memory. The memory 16 and/or the SSD 13 includes a database 180 for storing at least one of perception input data 110, training map data 120, testing map data 130, and reward parameters 140, each of which provides an input to the traffic agent learning system 100.

[0046] FIG. 1B illustrates a block diagram of an example traffic agent learning system 100 for training a virtual traffic agent 20 (refer to Figure 4). The presently disclosed methods and systems of training a traffic agent 20 are based on reinforcement learning methods. Reinforcement learning is a category of machine learning in which a machine (agent) learns a policy specifying which action to take in any situation (state), in order to maximize the expected reward according to a reward function. Reinforcement learning methods typically compute a value function expressing the expected longer term reward of a state, and may also compute a predictive model of the environment in terms of sequences of states, actions and rewards. While powerful, reinforcement learning agents are dependent on the crafting of an appropriate reward function towards some task or goal and are further dependent on hand-tuned parameters controlling learning rates, future reward discount factors, and exploit- vs-explore trade-offs. The traffic agent learning system 100 may be trained through exposure to various navigational states, and when the system applies the policy, it provides a reward based on a reward function that is designed to reward desired navigational or driver behavior. Based on the reward received, the system may learn the policy and becomes trained in producing desired navigational actions or desired driving competencies. For example, the learning system 100 may observe the current state s_t eS and decide on an action a_t e A based on a policy p: S D(A) Based on the decided action, the environment moves to the next state s_t+i e S for observation by the learning system 100. For each action developed in response to the observed state, the feedback to the learning system is a reward signal n, r_2,... The goal of reinforcement learning is to find a policy p.

[0047] One of the goals of training the learning system 100 is to produce a traffic agent capable of emulating driving competencies or driving behavior that are close to human driving behaviors typical of a geographical region. Geographical region can include specific countries or regions or cities within specific countries. For example, driving competencies can be based on how drivers interact or negotiate with other road users, road infrastructure or road geometry such as the level of aggressiveness, distractedness, driving styles, or driving maneuvers. In another example, a road user in Germany will behave differently from a road user in China and the driving behavior or driving style in terms of aggressiveness and distraction can differ greatly. When a traffic agent 20 is trained for its driver behavior pattern, it is trained to behave in an unpredictable manner and in certain traffic scenarios, it can provoke dangerous incidents or even cause accidents. One or more of a trained traffic agent can be injected into a simulation environment representing a geographical area, and movement of the traffic agent can be controlled according to driver behavior data representing driver behavior patterns exhibited by drivers in that particular geographical area. Adapting the movement of simulated vehicles according to driver behavior patterns in a geographical region will significantly enhance the realistic simulation in the testing of autonomous vehicle driving systems that are made for that geographical region with the aim of identifying and eliminating the limits and weaknesses of the autonomous vehicle driving systems. A traffic agent 20 can for example be a car, truck, bus, bike or motor bike.

[0048] Another goal of the learning system 100 is to train the traffic agent 20 for selecting actions that are specific to certain advanced driver assist functions. For example, a traffic agent 20 can be trained to emulate human-like driving behavior when following other road users in a traffic jam. This requires the traffic agent 20 to be exposed to numerous relevant scenarios comprising different environmental states, stationary and dynamic road elements and road users. In another example, a traffic agent 20 can also be trained in other driving competencies that emulate real-world drivers, for example, cooperative driving behavior or defensive driving behavior that allow traffic agents to communicate with each other by giving way to each other or to inhibit letting the other road user on a highway entry ramp.

[0049] In an embodiment of the present invention, the traffic agent learning system 100 obtains perception input data 110, training map data 120, testing map data 130 and reward parameters 140, as input from the database 180. FIG. 3 illustrates examples of the training map data 120. The training map data 120 includes the synthesis of data received from a training map database 123 and a road infrastructure database 121. The training map database 123 comprises a specially constructed training map 124 in both two-dimensional and three-dimensional format. The training map 124 is constructed in an endless loop so that the traffic agent can hypothetically drive in an endless loop for an innumerable number of cycles. The training map data 120 is constructed with equally distributed stationary and movable road elements and road users taken from the road infrastructure database 121. Stationary road elements include traffic signs, poles, road barriers, buildings, monuments, structures, a natural object, vegetation and/or a terrain surface and movable road elements include ground vehicles, pedestrian, animals, or the like. The road infrastructure database 121 includes a road infrastructure generator 122 which generates road infrastructure elements such as basic rules and balanced elements in order to provide an unbiased training. The basic rules are guidelines for the construction of road infrastructure and stationary road elements and can include legislation, regulations, policies, design standards, and/or recommendations of a country, state, region, province or the like. The guidelines can for example, relate to maximum possible road curvatures, minimum lane width or styles of road markings. The basic rules can be categorized by countries, regions, or cities. For the purposes of training a virtual traffic agent, road elements will need to comply with the basic rules. Other road infrastructure elements can include variances of road sections with varying road curvatures, traffic signs or different number of road lanes. The road infrastructure database 121 also includes balanced elements for the provision of the unbiased training of a virtual traffic agent 20. In ensuring balanced elements during training, a traffic agent 20 should be trained on motorways that are representative of a typical motorway map. A typical motorway map would provide all differing types of possible road curvatures that are in line with the basic rules of the particular country, state, region, city or province. In contrast, having unbalanced elements in the training of a traffic agent 20 would for example, be a training map that is limited to a straight road with only two lanes in each direction. In this example, the traffic agent 20 will learn how to drive perfectly on this road, but will fail when it drives on a curvy, three lane motorway as it has not experienced driving on such a road before. The training map data 120 is constructed with the aim of providing an unbiased training and assumes that the traffic agent 20 is experiencing all possible unique states with an equal probability.

[0050] The testing map data 130 is utilized for testing the performance of the traffic agent 20 when the training process has concluded or when evaluating the performance of the traffic agent (under Evaluator 155 or step 260 of Figure 2). The testing map data 130 differs from the training map data 120 in terms of the map layout and sequences of scenes comprising of positions of stationary and movable elements. The performance of the traffic agent 20 is expected to be high when states of the perception were already experienced during the training process.

[0051] The perception input data 110 includes sensor data received from the one or more sensors that are coupled to or otherwise included within a traffic agent 20 or an autonomous vehicle driving system. For example, the one or more sensors can include a Fight Detection and Ranging (FIDAR) system, a Radio Detection and Ranging System (RADAR) system, one or more image capture devices and/or other sensors. The perception input data 110 can include information that describes the location of objects within the surrounding environment of the traffic agent 20. Additionally, the perception input data 110 can retrieve or otherwise obtain map data (described herein below) that provides detailed information about the surrounding environment of the traffic agent 20, in both 2- dimensional and 3 -dimensional representation. The perception input data 110 that is provided as input to the traffic agent learning system 100 is based on a 2- dimensional map representation. The perception input data 110 includes a discretized occupancy grid of the traffic agent’s 20 surrounding according to the sensors’ field of view. Alternatively, the perception input data 110 can also mimic the limited perception view of humans, for example, by taking into account the conventional blind spot of drivers. The traffic agent receives an awareness of drivable and non-drivable areas which is updated on a frame-by- frame basis. The perception input data 110 also includes a representation of stationary road elements and movable road elements and/or any other map data that provides information that assists the traffic agent learning system 100 in comprehending and perceiving its surrounding environment and its relationship thereto. Some examples of static map elements are traffic signs, poles, road barriers, vegetation. Some examples of dynamic objects are vehicles, pedestrians, bikes and animals. An important subset of the perception input data 110 is an area where other objects shall be avoided. The traffic agent’s surrounding can be discretized up to a grid level of lOcm by lOcm, or even further. The chosen resolution defines the amount of input parameters per state and impacts the calculation effort and the required preciseness of stationary or movable elements.

[0052] Referring to FIG. 4, the perception input data 110 can identify one or more objects that are proximate to the traffic agent 20 based on sensor data received from the one or more sensors on the traffic agent (not shown) and the training map data 120 or the testing map data 130. In some implementations, the perception input data 110 can include state data that describes a current state of an object proximate to the traffic agent 20. For example, the state data for each object can describe an estimate of the object’s current location (also known as position), current speed (or velocity), current acceleration, current orientation, class (for example, car versus pedestrian versus motorbike versus bicycle), and/or any other state information.

[0053] Referring again to FIG. 1B, the traffic agent learning system 100 is an example of a system implemented as a computer program on one or more computers in one or more locations in which the components and techniques described herein below are implemented. The traffic agent learning system 100 includes a neural network model 151, an action selector 152, a reward generator 153, a neural network model updater 154 and an evaluator 155. The neural network model 151 is configured to receive perception input data 110 at each of multiple time steps t from the simulation environment. The neural network model 151 processes the perception input data 110 in accordance with a set of network parameters of the neural network model 151 to generate a policy output that the traffic agent learning system 100 uses to determine an action (via the action selector 152) to be performed by the traffic agent at a time step t. The perception input data 110 is representative of the current state of the simulation environment at each of time step t and an action selector 152 selects an action to be performed by the traffic agent 20 in response to the perception input data 110 received. For example, if the traffic agent is a simulated vehicle navigating through a simulation environment, the action to be selected may be a driving controller 160 or control inputs to control the simulated vehicle. The action selected can be from one of the choices: keep, accelerate, decelerate, drive left or drive right. At each time step t, the traffic agent 20 receives a reward from the reward generator 153 at time step t+l based on the current state of the environment and the action selected. For example, the reward generator 153 may generate a reward based on progress toward the traffic agent accomplishing one or more goals. Based on the reward generated by the reward generator 153, the neural network model updater 154 modifies the weights of the individual neurons of the neural network model so that the neural network model 151 is updated to reflect the update and to generate updated parameters for the neural network model for a current state of the environment. The goal is to carefully design the reward parameters 140 and to allow the traffic agent to navigate the training map for innumerable time steps. This will ultimately train the neural network model 151 representing an optimally trained traffic agent 20.

[0054] When the traffic agent 20 has undergone the training phase representative of the traffic agent 20 navigating through a simulation environment characterized by the training map data 120, an evaluator 155 will proceed to judge the performance of the traffic agent 20 during an evaluation phase. This is done through a comprehensive set of predefined mathematical indicators. For example, if the predefined mathematical indicators are indicative of driver behavior patterns, the evaluator 155 will compare each of the mathematical indicators associated with the traffic agent 20 and compare them with the driving behavior data collected from real drivers or vehicle trajectories extracted from GPS data or video data. Examples of mathematical indicators evaluated are average reward, average speed, average acceleration, number of legal actions taken for a certain number of actions. The evaluator 155 will utilize testing map data 130, which is different from the training map 120 utilized during the training phase. This is to avoid overfitting that may have taken place during the training phase. In the evaluation phase, the traffic agent 20 will continue to drive on the predefined testing map and will be exposed to two different types of states. The first state is the state that it has already experienced during the training phase and the second state is the unknown states that are characterized by new states such as new road curvatures or spatial or timely different constellations of road users.

[0055] FIG. 2 illustrates a flow diagram of an example method of training a traffic agent according to embodiments of the present disclosure. At step 200, the training process is initialized by determining the parameters of the perception input data 110. The parameters of the perception input data 110 may include the resolution of the grid and/or the perception radius of the traffic agent 20. The parameters are chosen based on the environmental perception of the traffic agent 20 and how it perceives its surrounding environment. For example, factors affecting the choice of parameters can include the following: (i) the dimensions of the perception in front, to the left, to the right, and to the rear of the vehicle, (ii) weather conditions which can affect the dimensions of the perception of the traffic agent, and (iii) predetermined perception areas that exclude areas occluded by conventional blind spots. For example, if the goal is to train a traffic agent 20 that mimics the driving style of a driver resident in a city in China, the perception radius of the traffic agent 20 may be reduced in order to provoke a more aggressive driving style. Conversely, if the goal is to train a traffic agent 20 that mimics the driving style of a driver resident in a city in Europe, the perception radius of the traffic agent 20 may be increased to emulate a less aggressive driving style.

[0056] At step 210, a neural network model is generated based on a set of predetermined network parameters. To setup a neural network, the layers and artificial neurons are first created and joined as a network of successive layers. The number of layers and number of neurons in each layer need to be carefully decided as it plays a crucial and sensitive role in the learning of the neural network. The weights of the neural network are then initialized based on different randomization techniques which will give random predictions at the start. Different types and sizes of layers can be used and combined to provide different learning outcomes and behaviors. The last layer is the output layer of the neural network where the loss is calculated. The goal of the learning process is to minimize the loss or error of the network over different samples of ground truth data. The neural network is then trained over batches of the training data, test data or ground truth data and the weights of the neural network are then updated using back propagation algorithm. With each successive training step, the goal is to calibrate tens of thousands or millions of weights in a way that the minimum loss is obtained on the training data and test data.

[0057] Designing the dimensions and parameters of the neural network is mostly following a heuristic approach but also based on theoretical assumptions and proofs. It requires an iterative probing and evaluation of the outcome. Generally speaking, the more complex the problem at hand, the more number of layers and neurons in each layer that is needed. But more layers and neurons also means including more parameters to train which makes the learning slow. A very big network can also lead to overfitting problems in which the neural network memorizes the trained data instead of generalizing the patterns, since the network is especially well trained for scenes experienced during training on the training map. However, when driving on a different map and experiencing unseen traffic scenes, the network might perform badly. In other words, the neural network would memorize the training data instead of generalizing the problem. On the other hand, a small network can lead to underfitting i.e network not being able to model the problem at all. Therefore, careful and intensive adjustment of the network parameters is required to come with a good neural network architecture and size for the problem at hand. After intensive probing and adjustment of the network parameters of the neural network, and for the specific purpose of training a traffic agent 20 to mimic a target real world driving competency, an example neural network architecture may be implemented as shown in figure 6.

[0058] Referring back to Fig. 2, Step 220 marks the start of the training phase and the traffic agent learning system 100 obtains perception input data 110 that is based on training map data 120 perceived by the traffic agent 20. The perception input data 110 is a combination of the sensor data received from the one or more sensors that are coupled to or otherwise included within the traffic agent 20 and a 2-dimensional map representation of information that describes the location of objects within the surrounding environment of the traffic agent 20. The perception input data 110 is therefore a combined realistic perception of the surrounding environment as perceived by the traffic agent 20. The training map data 120 may be representative of the 2-dimensional map representation. For example, for the goal of training a traffic agent to mimic the real world driving behavior of a driver resident in a city in China, the training map data 120 includes movable and stationary road elements that are compliant with legislation, policies, guidelines, or design standards that are related to road infrastructure or traffic rules in China. For example, the training map data 120 for a city in China may include maximum possible road curvatures or minimum lane widths for stationary road elements such as side roads or it may include minimum or maximum length and width of movable road elements such as a typical Standard Utility Vehicle (SUV) or trucks.

[0059] At step 230, the neural network model 151 processes the perception input data 110 characterizing the current state of the environment and selects an action to be performed by the traffic agent that is interacting with the simulation environment at time step t. If the traffic agent 20 is a simulated vehicle, the action to be selected may be a driving controller 160 that controls the speed and direction of the simulated vehicle. For example, the action to be performed can be one of the following: keep, accelerate, decelerate, drive left, drive right.

[0060] At step 240, a reward or penalty is received based on the current state of the environment and the action selected of the traffic agent at time step t. For example, the traffic agent 20 may receive a reward or a penalty for a given time step, t+l, based on progress toward the traffic agent 20 accomplishing one or more goals. The reward parameters 140 are predetermined and can be selected based on the type of real world driving competency or driving behavior pattern that the traffic agent 20 wishes to emulate. For example, if an aggressive driver behavior pattern is to be emulated, some indicators of aggression, both qualitatively and quantitatively, may be used. Qualitatively, the perceived level of attention, i.e. potential impairment of driver, distraction, asleep, or refusing to cooperate with other road users may be used as reward parameters 140. Quantitatively, navigational actions may be selected or developed based on indicators of aggression. For example, the relative velocity, relative acceleration, frequent lane changes, increases in relative acceleration, following distance from the front vehicle, may be used as reward parameters 140. [0061] At step 250, based on the reward or penalty generated by the reward generator 153, the neural network model updater 154 modifies the weights of the individual neurons of the neural network model so that the neural network model 151 is updated to reflect the modifications to the weights of the individual neurons and to generate updated parameters for the neural network model for a current state of the environment. In contrast with supervised learning where a target (ground truth) is known and the task is to make the neural network model converge to the target, the target in unsupervised learning is typically not known and the traffic agent 20 explores the environment and states to find the target. There are therefore two tasks occurring concurrently: (i) to iteratively improve the target, and (ii) to iteratively make the neural network model converge to the target. In one embodiment, based on the reward or penalty awarded, the following equation is used to update the target of the neural network model: :

Y_t ^DQN = _{Rt+J + y max} Q (S_t+I>a;9_t )

a

where R_t+i is the reward or penalty given for time step t and the remaining term is the discounted future reward Y_t or also exemplified by the goal of the learning process. With each training step, the goal is to try to make the neural network model converge to Y_t by using different evolutionary methods.

[0062] As mentioned above, the goal of the learning process is to minimize the loss or error of the neural network model over different samples of perception input data. The neural network model is then trained over batches of perception input data and the weights of the neural network are updated using backpropagation algorithm. With each successive training step, the goal is to calibrate tens of thousands or even millions of weights of individual neurons in order to obtain a minimum loss on the perception input data. Steps 220, 230, 240 and 250 are steps conducted during the training phase and are iteratively processed during the training phase. Once the losses or errors of the neural network model over different samples of perception input data are minimized accordingly, or in other words, once a target is known and the neural network model converges to the target, the training phase concludes and the evaluation phase in step 260 will be initiated.

[0063] At step 260, the performance of the traffic agent 20 is evaluated based on navigating through testing map data 130. The testing map data 130 is different from the training map data 120 in terms of map layout and sequences. In the evaluation phase, the traffic agent 20 will be exposed to experienced states (i.e. states already experienced during training) and unknown states (i.e. exploratory states that are characterized by new road curvatures). The evaluation phase at step 260 is described in further detail below.

[0064] At step 270, and as described herein below, when the predefined mathematical indicators characterized by the target competency or driver behavior pattern approximates or matches the mathematical indicators of real-world driving competency or driver behavior pattern data of the target geographical area, the traffic agent 20 will be considered optimally trained.

[0065] FIG. 5 illustrates an abstract and representative example of a neural network architecture employed by an embodiment of the invention. It illustrates the relationship between the traffic agent 20, the simulation environment 400 and the evaluator 155. The traffic agent learning system 100 selects actions to be performed by the traffic agent 20 that is interacting with the simulation environment 400 at each of multiple time step t. The actions to be performed can be one of the following: keep, accelerate, decelerate, drive left, drive right. In order for the traffic agent 20 to interact with the simulation environment 400, the traffic agent 20 receives perception input data 110 characterizing the current state of the environment and selects an action to be performed by the agent in response to the perception input data 110. At each time step t, a reward is received based on the current state of the environment and the action of the traffic agent at time step t. For example, the traffic agent 20 may receive a reward for a given time step, t+l, based on progress toward the traffic agent 20 accomplishing one or more goals. The simulation environment 400 can be a simulated environment that is based on a training map represented by training map data 120 or testing map represented by testing map data 130. In some implementations, the type of simulation environment selected depends on the phase the traffic agent is in. If the traffic agent 20 is in the training phase, as indicated by steps 220, 230, 240 and 250 above, the simulation environment will be based on the training map data 120. If the traffic agent 20 is in the evaluation phase, as indicated by step 260 above, the simulation environment will be based on the testing map data 130 which is different from the training map data 120. When the traffic agent 20 is in the evaluation phase, the evaluator 155 will proceed to judge the performance of the traffic agent 20. This is done through a comprehensive set of predefined mathematical indicators. For example, if the predefined mathematical indicators are indicative of driver behavior patterns, the evaluator 155 will compare each of the mathematical indicators associated with the traffic agent 20 and compare them with the mathematical indicators obtained from real world driving behavior data collected from real drivers or vehicle trajectories extracted from GPS data or video data. Examples of mathematical indicators evaluated are average reward, average speed, average acceleration, number of legal actions taken for a certain number of actions. If the mathematical indicators associated with the traffic agent 20 is close to or matches the mathematical indicators obtained from real-world driving behavior data, this will result in trained traffic agent 20 that emulates the real-world driving competency or driver behavior pattern of a predetermined geographical area. If the mathematical indicators associated with the traffic agent 20 is not close to the mathematical indicators obtained from real- world driving behavior data, the traffic agent 20 will continue to navigate the testing map and continue to tune and optimize the reward parameters and network parameters of the neural network to achieve its intended driver behavior pattern.

[0066] In general, the traffic agent learning system 100 may employ machine learning models or techniques that are based on supervised, unsupervised and reinforcement learning or combinations thereof. Particularly, the traffic agent 20 may be trained using any reasonable reinforcement learning technique, for example Q learning or policy gradients. Example neural network architectures include a combination of Convolutional Neural Networks (CNN) layers, Fully Connected Network (FCN) Layers, Long Short Term Memory (LSTM) Layers and train it based on Double Duel Deep Q Network (DDQN) algorithm or other types of learning algorithms. Reinforcement learning uses the formal framework of Markov Decision Process (MDP) to define the interaction between a learning agent and its environment in terms of states, actions and rewards. FIG. 6 illustrates an abstract and representative setup of a neural network model according to an embodiment of the invention. The neural network model 151 includes a convolutional neural network (CNN) that generates an encoded representation of the perception input data 110, an intermediate neural network that processes the encoded representation of the perception input data 110 to generate an intermediate representation and an output neural network that processes the intermediate representation to generate the policy output. The intermediate neural network may include a Fully Convolutional Network (FCN) in combination with a long short term memory (LSTM) network or a stack of LSTM networks. A stack of LSTM layers is an ordered set of multiple LSTM layers, where the first LSTM layers processes the encoded representation, builds the memories based on that, and converts the representation into another hyper dimension for further processing by the next LSTM layers. Therefore, each subsequent LSTM network processes the output of the previous LSTM network. This multilayer approach allows the network to learn the desired policy step by step, where initial layers learn low level features and the successive layers build on top of those low level features to learn more higher level features and eventually predict the output. The output neural network which is in fact a set of different types of neural layers may include objective function or a set of functions from Double Duel Deep Q Networks Learning Algorithm or other forms of learning algorithms.

[0067] FIG. 7 illustrates a graphical diagram of some of the components of the traffic agent learning system 100. Particularly, it illustrates the relationship between the traffic agent 20, simulation environment 400, the reward generator 153, reward parameters 140 and the neural network model 151. As described above, the traffic agent learning system 100 selects actions (one of keep, accelerate, decelerate, drive left, drive right) to be performed by the traffic agent 20 that is interacting with the simulation environment 400 at each of step t. In order for the traffic agent 20 to interact with the simulation environment 400, the traffic agent 20 receives perception input data 110 characterizing the current state of the environment and selects an action to be performed by the agent in response to the perception input data 110. At each time step t, a reward is received based on the current state of the environment and the action of the traffic agent at time step t. Based on the reward generated by the reward generator 153, the neural network model updater 154 (not shown) modifies the weights of the individual neurons of the neural network model 151 so that the neural network model 151 is updated to reflect the update and to generate updated parameters for the neural network model 151 for the current state of the environment. The goal is to carefully design the reward parameters 140 and to allow the traffic agent to navigate the training map for innumerable time steps. This will ultimately train the neural network model to represent an optimally trained traffic agent 20. The reward parameters 140 are predetermined and can be selected based on the type of driving competency or driving behavior pattern that the traffic agent 20 wishes to emulate. For example, if an aggressive driver behavior pattern is to be emulated, some indicators of aggression, both qualitatively and quantitatively, may be used. Qualitatively, the perceived level of attention, i.e. potential impairment of driver, distraction, asleep, etc. may be used as reward parameters 140. Quantitatively, navigational actions may be selected or developed based on indicators of aggression. For example, the relative velocity, relative acceleration, increases in relative acceleration, following distance, may be used as reward parameters 140.

[0068] While the invention has been particularly shown and described with reference to specific embodiments, it should be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. The scope of the invention is thus indicated by the appended claims and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced.

Claims

1. A computing system for training a traffic agent navigating through a simulation environment comprising:

one or more processors;

a memory device coupled to the one or more processors;

receiving a reward or penalty based on one or more reward parameters associated with the target real world driving competency of a driver; updating the plurality of predetermined network parameters based on the reward or penalty received so as to minimize loss of the neural network model; evaluating the performance of the traffic agent by comparing a first mathematical indicator obtained from the performance of the traffic agent with a second mathematical indicator based on data obtained from a real-world source characterizing the target real world driving competency of the driver, wherein a close or approximate match of the first mathematical indicator and the second mathematical indicator results in a trained traffic agent.

2. The computing system according to claim 1 wherein the perception input data comprises sensor data from sensors located on the traffic agent and training map data.

3. The computing system according to claim 2, wherein the training map data comprises a two-dimensional map representation constructed in an endless loop and information that describes the location of objects within the surrounding environment of the traffic agent.

4. The computing system according to claim 1, wherein the plurality of predetermined network parameters for emulating the target real world driving competency of a driver comprises:

5. The computing system according to claim 4, wherein the stack of LSTM layers further comprises a plurality of LSTM layers, wherein a first LSTM layer processes and outputs the encoded representation of the perception input data for further processing as input by a second LSTM layer and a third LSTM layer processes the output of the second LSTM layer.

6. The computing system according to any one of claim 1 or 2 wherein the first mathematical indicator includes at least one of: average reward, average speed, average acceleration or number of legal actions taken for a certain number of actions.

7. The computing system according to claim 3 wherein the step of evaluating the performance of the traffic agent further comprises navigating the traffic agent through a testing map data that is different from the training map data.

8. The computing system according to claim 1, wherein the reward or penalty awarded based on one or more reward parameters associated with the target real world driving competency of a driver include quantitative and qualitative indicators of driving competency.

9. The computing system according to claim 8 wherein the one or more reward parameters include any one or more of the following: relative velocity, relative acceleration, frequent lane changes, increases in relative acceleration or following distance from a vehicle in front of the traffic agent.

10. The computing system according to claims 1 and 4, wherein the step of updating the plurality of predetermined network parameters further comprises calibrating each of the weights of neural network layers using back propagation algorithm in response to the reward or penalty received so as to achieve minimum loss on the perception input data.

11. The computing system according to claim 1, wherein the step of updating the plurality of predetermined network parameters further comprises calibrating each of the weights of neural network layers using back propagation algorithm to iteratively improve identifying a target and to iteratively assist the neural network model to converge on the identified target.

12. A computer-implemented method for training a traffic agent navigating through a simulation environment comprising:

generating a neural network model based on a plurality of predetermined network parameters for emulating a target real world driving competency of a driver; receiving perception input data descriptive of objects perceived by the traffic agent characterizing a current state of an environment;

receiving a reward or penalty based on one or more reward parameters associated with the target real world driving competency of a driver;

updating the plurality of predetermined network parameters based on the reward or penalty received so as to minimize loss of the neural network model; evaluating the performance of the traffic agent by comparing a first mathematical indicator obtained from the performance of the traffic agent with a second mathematical indicator based on data obtained from a real-world source characterizing the target real world driving competency of the driver, wherein a close or approximate match of the first mathematical indicator and the second mathematical indicator results in a trained traffic agent.

13. The computer-implemented method according to claim 11 wherein the perception input data comprises sensor data from sensors located on the traffic agent and training map data.

14. The computer-implemented method according to claim 12, wherein the training map data comprises a two-dimensional map representation constructed in an endless loop and information that describes the location of objects within the surrounding environment of the traffic agent.

15. The computer-implemented method according to claim 11, wherein the plurality of predetermined network parameters for emulating the target real world driving competency of a driver comprises:

a first layer convolutional neural network (CNN) for generating an encoded representation of the perception input data, an intermediate neural network including a Fully Convolutional Network (FCN) layer in combination with a stack of LSTM (Long Short Term Memory) layers for processing the encoded representation of the perception input data to generate an intermediate representation and;

16. The computer-implemented method according to claim 14, wherein the stack of LSTM layers further comprises a plurality of LSTM layers, wherein a first LSTM layer processes and outputs the encoded representation of the perception input data for further processing as input by a second LSTM layer and a third LSTM layer processes the output of the second LSTM layer.

17. The computer-implemented method according to any one of claims 11 or 12 wherein the first mathematical indicator includes at least one of: average reward, average speed, average acceleration or number of legal actions taken for a certain number of actions.

18. The computer-implemented method according to claim 13 wherein the step of evaluating the performance of the traffic agent further comprises navigating the traffic agent through a testing map data that is different from the training map data.

19. The computer-implemented method according to claim 11, wherein the reward or penalty awarded based on one or more reward parameters associated with the target real world driving competency of a driver include quantitative and qualitative indicators of driving competency.

20. The computer-implemented method according to claim 18 wherein the one or more reward parameters include any one or more of the following: relative velocity, relative acceleration, frequent lane changes, increases in relative acceleration or following distance from a vehicle in front of the traffic agent.

21. The computer-implemented method according to claims 11 and 15, wherein the step of updating the plurality of predetermined network parameters based on the reward or penalty received further comprises calibrating each of the weights of neural network layers using back propagation algorithm so as to achieve minimum loss on the perception input data.

22. The computer-implemented method according to claims 11 and 15, wherein the step of updating the plurality of predetermined network parameters further comprises calibrating each of the weights of neural network layers using back propagation algorithm to iteratively improve identifying a target and to iteratively assist the neural network model to converge on the identified target.