GB2625511A

GB2625511A - A computer-implemented method for navigating an agent through an environment for the purpose of data collection

Info

Publication number: GB2625511A
Application number: GB2218281.0A
Authority: GB
Inventors: Issac Baby Febin; Hoong Yap Kok
Original assignee: Continental Automotive Technologies GmbH
Current assignee: Continental Automotive Technologies GmbH
Priority date: 2022-12-06
Filing date: 2022-12-06
Publication date: 2024-06-26
Also published as: GB202218281D0

Abstract

In order to improve navigation of an agent through an environment for the purpose of data collection a computer-implemented method is proposed. After providing (S1) map data in a graph structure, the agent fills (S2) the graph structure with sensor data. The sensor data is evaluated (S3) in order to obtain a label confidence score for each label at each node of the graph structure. A data processing system trains (S4) a first machine learning model using reinforcement learning based on the label confidence scores such that the first machine learning model is able to determine (S5) an action to be taken by the agent at each node based on input data from the agent. The first machine learning model is deployed for access by the agent. The agent collates input data at the current node at which it currently is and feeds these input data to the previously trained first machine learning model. The first machine learning model determines a next action for the agent, which the agent performs (S6).

Description

TITLE

A computer-implemented method for navigating an agent through an environment for the purpose of data collection

TECHNICAL FIELD

The invention relates to a method for determining a route of an agent through an environment for the purpose of data collection.

BACKGROUND

WO 2021 /132 791 Al discloses a method by which a server intelligently provides content in an autonomous vehicle in an autonomous driving system, and the autonomous driving system.

DE 10 2020 129 802 Al discloses a computer, including a processor and a memory, the memory including instructions to be executed by the processor to identify patterns in high anticipation scenarios, such as video sequences, based on user identification. The vehicle will be within a specified distance of an object, and user identification is determined by viewing portions of a respective video sequence.

SUMMARY OF THE INVENTION

It is the object of the invention to improve navigation of an agent through an environment for the purpose of data collection.

The invention provides a computer-implemented method for determining a route of an agent through an environment for data acquisition, the method comprising: a) providing map data that describes the environment and is configured as a graph structure comprising nodes and edges, wherein each node represents a location and each edge represents an action that can be taken at the node; b) with the agent: recording sensor data at a node and transmitting the sensor data from the agent to a data processing system where the sensor data is stored in association with that node in the graph structure; c) with the data processing system: evaluating the sensor data in order to obtain a label confidence score for each label at each node; d) with the data processing system: training a first machine learning model using reinforcement learning based on the label confidence scores such that the first machine learning model gets trained to determine an action to be taken at each node based on input data from the agent and deploying the trained first machine learning model for access by the agent; e) with the agent: at a current node collating input data and feeding the input data to the trained first machine learning model in order to determine a next action; and f) the agent performing the next action determined in step e).

Starting with the map data, nodes and edges are defined. The agent collects sensor data that are stored in the graph structure so that each node is associated with the sensor data. The idea is that the data processing system determines the next step of the agent through the map data. Since the action of the agent is influenced by the label confidence scores, the agent is guided to nodes which are associated with less realiable data. This allows the agent to follow a pathway that is formed by a sequence of nodes, where the nodes are more likely those that require an increase in label confidence scores. Consequently, the agent is now able to capture data of greater significance. In other words, the agent will preferably capture fewer data in such a way that improves the performance of the first machine learning model trained with said data by a larger amount compared to the conventional approach.

Preferably, the agent is a vehicle or a mobile robot navigating through the environment. While the agent may be in general of any configuration, it is preferred that the agent is a vehicle or a mobile robot that navigates through the map data, e.g. by driving or flying through a city.

Preferably, step a) comprises providing map data by subdividing at least one the edges connecting two nodes into a plurality of sub edges and inserting sub nodes between the sub edges according to a predetermined interval. A larger amount of sensor data can be captured. However, in contrast to conventional approaches, the increase amount is also of greater significance, due to the data being collected in a more targeted manner.

Preferably, in step b) the agent comprises an imaging sensor and records image data as the sensor data and/or wherein in step b) the agent comprises a non-imaging sensor and records non-imaging data as the sensor data. The sensor data may be image data from cameras and the like. The sensor data may be non-imaging data like distance measurements of LIDAR, RADAR or ultrasonics. It is particularly advantageous to capture both kinds of data since they are usually complementary and allow correction of errors in one kind of data by also checking the other kind of data.

Preferably, step c) comprises evaluating the sensor data by performing semantic segmentation, object detection, and/or object classification of the sensor data in order to obtain at least one label and the associated label confidence score. It should be noted that the evaluation may be extended to other tasks as the aforementioned ones that are typically applied to RGB images or other sensors like 3D LIDAR.

Preferably, step c) comprises evaluating the sensor data by performing pseudo-labelling of the sensor data with a second machine learning model trained for pseudo-labelling and outputting a deviation from the pseudo-labelling confidence score as the label confidence score. It is also possible to generate pseudo-labels for the sensor data by preferably using already labelled data from a previous run of the method and unlabelled data of the current run. The second machine learning model can be trained in a known manner with the already labelled data. With this approach it is possible to steadily increase the accuracy of the pseudo-labelling, thereby allowing also improved label confidence scores for the first machine learning model.

Preferably, in step c) the label confidence score is determined as an average label confidence score of all confidence scores associated with a particular label.

Preferably, in step d) the reinforcement learning is performed by deep Q-learning or Proximal Policy Optimization (PPO).

Preferably, in step d) the first machine learning model is transferred to the agent. The first machine learning model may be trained in a multi access edge computing (MAEC) system and subsequently transferred to the agent. With this the computational burden of training can be performed by a different system and the computational burden for the agent can be reduced.

Preferably, in step e) the collated input data include any of a current node, visited nodes, and features. The input data serve as a basis for the decision, which action should be performed next.

Preferably, in step e), the next action determined by the first machine learning model is replaced with a random action with a predetermined probability. With this approach, the agent is able to explore a greater portion of the map data, i.e. graph structure. The risk for getting stuck or avoiding certain parts of the graph for too long can be reduced.

The invention provides a system comprising an agent and a data processing system, wherein the agent and the data processing system comprise means to perform respective steps of a preferred method.

Preferably, the data processing system is configured as a distributed system.

The invention provides a computer program comprising instructions that, upon execution by the system, performs a preferred method.

The invention provides a machine readable data storage, data carrier or data carrier signal comprising the computer program.

Data acquisition especially in the amount useful for machine learning poses some challenges. For example, it is possible to have a comparably high amount of bad data or lack of good data in a dataset. At the same time storage for the same is typically limited. Data annotation -especially when done manually -may overwhelm the people working due to short timelines. This may decrease quality. Thus, even if more data is captured, it is difficult to understand and convey which are the data points/situations that are needed to improve the models. It is also a challenge to design a data capturing pathway that enables taking of more samples in those relevant regions. The improved pathway allows to provide data, which is of significance, thereby allowing a reduction of resources and time spent in data capturing.

In some embodiments the agent is configured as a vehicle or a mobile robot. In some embodiments the vehicle is a semi-autonomous or autonomous vehicle. In some embodiments the vehicle is equipped with a sensor suite. The sensor suite may be chosen from a group consisting of imaging sensors, such as cameras, or non-imaging sensors that include but are not limited to LIDAR, RADAR, an inertial measurement unit (IMU), a satellite navigation device, or vehicle-to-anything (V2X) technology.

In some embodiments the vehicle comprises an on-board device (OBD). The OBD may include a transceiver, a data processing unit, a display (e.g. for displaying real time info such as road charges, accidents, and the like), a payment processor, and/or a touch screen.

In some embodiments the OBD may connect to a multi access edge computing (MAEC) system.

In some embodiments, smart road users are involved. The smart road users, such as pedestrians, cyclists, motorcyclists, drivers, pets, etc. may be equipped with wearables or a mobile app on a personal device. In some embodiments a satellite navigation system is used for locating the smart road user. In some embodiments information about the smart road user provided, the information including but not limited to user type (such as pedestrian, cyclist, driver,...), object dimension or general size, movement direction, movement speed or similar motion related quantities. In some embodiments the smart road user may be notified using, for example, sound, visual notification, or haptic notification, e.g., by vibrating. In some embodiments the smart road user may be connected to the MAEC system.

In some embodiments, smart infrastructure is involved. The smart infrastructure may include lamp posts, overhead bridges or any other infrastructure component. In some embodiments the infrastructure component may be equipped with imaging sensors, such as cameras, and/or non-imaging sensors that include but are not limited to LIDAR, RADAR, an inertial measurement unit (IMU), a satellite navigation device, or vehicle-to-anything (V2X) technology. The smart infrastructure component may include a transceiver for connecting with the MAEC system.

In some embodiments, a MAEC system is involved. The MAEC system may also be designated as low latency cloud. The MAEC system is configured for data processing, object tracking, object detection and classification, localization, collision warning, and/or traffic rule violation detection.

In some embodiments, the tasks mentioned below may be grounded specific to semantic segmentation and/or object detection on image data. It is possible to extend the tasks to any additional task related to the data that is captured by any of the sensors or sources mentioned above, such as object detection in point cloud data from LIDAR.

In some embodiments the performance or confidence of the model is determined by obtaining the confidence score of the predictions of the obtained data against the deployed model. In some embodiments the performance or confidence of the model is determined by computing the similarity of predictions with respect to the labels predicted using pseudo-labelling by much heavier model (and thus slower model) on the same data. For an example for the heavier model for object detection tasks, any state of the art object detection model can be used. One specific example is the Swin Transformer Network described in Z. Liu et al.; "Swin Transformer V2: Scaling Up Capacity and Resolution", available from ts.

The model can be deployed in the vehicle itself or in the MAEC system. It is also possible to distribute the model between a plurality of data processing devices.

For mapping the nodes and edges information, any map may be used, such as OpenStreetMap.

In one approach it can be assumed that the graph information of the map is generally available (nodes' information and edges' information along with length of the edges).

In some embodiments, each edge is split into multiple sub-edges based on the precision in which the data (e.g. images) are to be captured. For example, when taking a sample in every 20 meters travelled (this can be adjusted based on what precision is needed) then the edge connecting two nodes can be split so as to include more separate nodes. As a result exactly one sample can be captured for and in each node. Once the entire graph is populated with this data, the graph may include nodes as locations and edges as the actions that can be taken from each node.

In some embodiments, all the data recorded at these nodes are sent to the MAEC system or a central database for storing. Notwithstanding where the data are stored, the data may be stored in a graph database.

In some embodiments, the data received at the MAEC system, can also be associated with the weather condition (sunny, cloudy, rainy), time of day (morning, afternoon, evening, night), type of day (weekday, weekend) or any additional useful features.

In some embodiments, all the stored data can be evaluated against the existing best model. In some embodiments, data evaluation and corresponding model update can be carried out in a periodic fashion (say, once every month) or upon request.

In some embodiments, e.g., regarding the semantic segmentation task, for each image, during evaluation, pixelwise annotated data are obtained. For example, all pixels with people inside are classified as person, cars are classified as vehicles and so on. All the individual pixel classification can be associated with a confidence score. Preferably, the confidence score of each label is taken and the average confidence score for that particular label is determined and stored in the graph database for each node. In some embodiments, instead of confidence score, the performance score against the pseudo-labelling approach by a heavier but slower model can be used. Over time, the graph database will include the confidence score for all labels for this task at each node. In some embodiments, the average score of all label types is stored in the database. In some embodiments this evaluation is performed for all different tasks in consideration.

In some embodiments, the confidence score evaluated for the nodes of the graph is used to plan a path. The new path should prioritize visiting nodes which did not perform well (low confidence) and vice versa. This is equivalent to optimizing a reward inversely proportional to the confidence scores.

In some embodiments, a neural network is used to represent the reinforcement learning model. In some embodiments the optimization problem is reformulated into a Markov Decision Process (MDP) which is represented by <S, A, P, R>, where S is the set of states, A is the set of Actions, P is the Transition Probability and R is the set of Rewards.

The neural network is preferably configured as a fully-connected multi-layer perceptron (MLP) with preferably two hidden layers. Each hidden layer may include 256 units/neurons. Preferably the nonlinear tanh function is used as the activation function. The output layer is preferably a SoftMax layer over the Action space(A). The model will take the individual states as the inputs (S) and the output will be the Actions (A).

In some embodiments the Bellman equation is used to determine the reward for a particular state: New Q(S, A) = Q(S, A) + a [R(S, A') + y max(Q'(S', A') -where Q is the quality, a is the learning rate, R is the reward and y is the discount factor. The unprimed states S and action A refer to the current situation and the primed states S' and actions A' refer to the next step.

Typically the state S represents the current position of an agent in the environment. Here, each state comprises node information, information about previously visited nodes, weather information, time of day, other features etc. If, for example, there are N = 10,000 nodes, since nodes are a categorical value, preferably using one-hot representation to represent nodes and previously visited node information, e.g. w = 4 values to represent weather and t = 4 values to represent time of the day, there will be 2*N + w + t = 20008 binary values to represent a single state. This can be increased based on adding other features. In other words, there is preferably a one-hot representation for each node, the previously visited node information, and possibly additional features. In addition there are preferably one-hot represented global features, i.e. features that are independent of the nodes, such as the weather condition, the time of day, or possible other global features.

The action A represents the step taken by the agent at each state. The action can be represented by which turn the agent has to take when being in a particular state. For example, if the max number of actions is 4, then at a single node, 1 -may represent go left, 2 -go straight, 3 -go right and 4 to take a U-turn. The number of actions usually depends on the number of pathways connecting the node and may be larger or smaller.

Preferably, once a node was visited, the reward for revisiting the same node is none.

In some embodiments, the Q table may be populated by running the Bellman equation. The process may be stopped, when the values converge. Due to the length of the state variable, preferably deep 0-learning is used to approximate the Q value of each State-Action pair (S, A). In such case, preferably Monte-Carlo sampling is used for a trial-based learning or it is possible to provide a teacher policy. Preferably, once the 0-table is populated or the Deep 0-learning model is trained, it is known exactly which action to take at each node, to maximize the reward. Preferably, the previously visited flag will penalize being the same state and suggest an action to be taken.

The stepwise Reward R is inversely proportional to the confidence score.

R(S) = 1/ C(S) where C is the confidence function The Transition Probability P represents the dynamics of the MDP which is sampled using the prior distributions of the existing dataset in the server. For e.g., Gaussian distribution of time to travel between nodes, or Gaussian distributions of weather changes given time and location.

In some embodiments, a policy is trained which maximizes the expected sum of reward until convergence using any Reinforcement learning algorithm. One good choice is Proximal Policy Optimization (PPO) which gives very good and stable performance for discrete actions space problems.

In some embodiments, once the model is trained, the model is transferred to the end-user. The model may be updated regularly at a predetermined interval, e.g. every month.

In some embodiments, information such as current node, visited nodes, time of day, or weather conditions are collected at the user end and input into the model to get the best action to be taken at the present node and iteratively to maximize the reward within a given timeframe.

In some embodiments, at every node, a random action may be taken based on a predetermined probability factor, to encourage the model to explore the map as well to get more data from unexplored places. The probability factor may be decided based on an exploration-exploitation trade-off.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention are described in more detail with reference to the accompanying schematic drawings.

The Fig. depicts an embodiment of a method for determining a route of an agent through an environment for data acquisition.

DETAILED DESCRIPTION OF EMBODIMENT

Referring to the Fig., a method for determining a route of an agent through an environment for data acquisition comprises a map data provision step Si. In this step, map data that describes the environment is provided. The map data is configured as a graph structure and includes a plurality of nodes and edges. Each node represents a location and each edge represents an action that can be performed by the agent, when it is at that particular node. Some edges may be subdivided into a plurality of sub edges of smaller length according to a predetermined interval. The subdividing includes inserting sub nodes that are connected to the sub edges.

In a data recording step S2, the agent that is preferably configured as a vehicle records sensor data with one of its imaging sensors, e.g., a camera, or non-imaging sensors, e.g., LIDAR, RADAR, IMU, etc. The sensor data, e.g. image data, are transmitted from the agent to a data processing system. In particular, the sensor data that were recorded by the agent at a specific node is stored as associated with that particular node in the graph structure.

In an evaluation step S3, the data processing system that may be configured as a MAEC system processes the received sensor data. The data processing system evaluates the sensor data using semantic segmentation, object detection and/or object classification in order to produce labels for objects that are included in the sensor data.

The data processing system determines a label confidence score for each label that was recognized in the sensor data at each node that is associated with the respective sensor data. The label confidence score may also be determined based on a second machine learning model that is more detailed than a first machine learning model described below. The second machine learning model may be any state of the art object detection model. The second machine learning model is preferably a swin transformer network The second machine learning model may be of the same type or general configuration as the first machine learning model. However, the second machine learning model preferably includes more layers and/or is trained using a larger or different dataset. In this case the label confidence score is determined by comparing the performance of the label predictions of different machine learning models that solve the same task.

The data processing system stores the average of the label confidence scores of all confidence scores for a particular label in the respective node. In addition, it is possible to store the average over all label confidence scores at all nodes for each particular label.

When a sufficient amount of the graph structure is filled with the sensor data, a training step S4 is carried out by the data processing system. In this step, a first machine learning model is trained by reinforcement learning, such as Proximal Policy Optimization (PPO) or deep 0-learning, based on the label confidence scores. The first machine learning model is preferably a deep neural network, such as a deep 0-network. The first machine learning model can be configured as a multi-layer perceptron. The first machine learning model is preferably a less complex machine learning model compared to the second machine learning model. The first machine learning model is trained to determine a next action that is to be taken by the agent when at a particular node and based on input data recorded there. After training the first machine learning model by the data processing system, the first machine learning model is deployed so that it can be accessed by the agent. This may happen by transferring the first machine learning model to the agent via a communications network or allowing the agent to send input data to the data processing system and processing the input data with the first machine learning model stored therein.

The method comprises an action determination step S5, in which the agent collates input data from a current node at which the agent is currently located. The input data may include weather condition, time of day, type of day, or similar features as well as the current node and the visited nodes. The input data are fed to the previously trained first machine learning model which determines based on the input data the next action to be taken by the agent. In addition, it is possible that at random with a predetermined probability, the next action is a random action and the first machine learning model is bypassed.

The method further comprises an action step S6, in which the agent performs the action that was determined in the previous step and moves to the next node.

The steps S5 and S6 are repeated until the agent has visited all nodes or a predetermined condition is met.

With this method, it is possible to let the agent visit such nodes more often for data acquisition that have a lower confidence score so as to enable improvement of the first machine learning model by selectively collecting more relevant training data.

In order to improve navigation of an agent through an environment for the purpose of data collection a computer-implemented method is proposed. After providing map data in a graph structure, the agent fills the graph structure with sensor data. The sensor data is evaluated in order to obtain a label confidence score for each label at each node of the graph structure. A data processing system trains a first machine learning model using reinforcement learning based on the label confidence scores such that the first machine learning model is able to determine an action to be taken by the agent at each node based on input data from the agent. The first machine learning model is deployed for access by the agent. The agent collates input data at the current node at which it currently is and feeds these input data to the previously trained first machine learning model. The first machine learning model determines a next action for the agent, which the agent performs.

Claims

CLAIMS1. A computer-implemented method for determining a route of an agent through an environment for data acquisition, the method comprising: a) providing map data that describes the environment and is configured as a graph structure comprising nodes and edges, wherein each node represents a location and each edge represents an action that can be taken at the node; b) with the agent: recording sensor data at a node and transmitting the sensor data from the agent to a data processing system where the sensor data is stored in association with that node in the graph structure; c) with the data processing system: evaluating the sensor data in order to obtain a label confidence score for each label at each node; d) with the data processing system: training a first machine learning model using reinforcement learning based on the label confidence scores such that the first machine learning model gets trained to determine an action to be taken at each node based on input data from the agent and deploying the trained first machine learning model for access by the agent; e) with the agent: at a current node collating input data and feeding the input data to the first machine learning model trained in step d) in order to determine a next action; and f) the agent performing the next action determined in step e).
2. The method according to claim 1, wherein the agent is a vehicle or a mobile robot navigating through the environment.
3. The method according to any of the preceding claims, wherein step a) comprises providing map data by subdividing at least one the edges connecting two nodes into a plurality of sub edges and inserting sub nodes between the sub edges according to a predetermined interval.
4. The method according to any of the preceding claims, wherein in step b) the agent comprises an imaging sensor and records image data as the sensor data and/or wherein in step b) the agent comprises a non-imaging sensor and records non-imaging data as the sensor data.
5. The method according to any of the preceding claims, wherein step c) comprises evaluating the sensor data by performing semantic segmentation, object detection, and/or object classification of the sensor data in order to obtain at least one label and the associated label confidence score.
6. The method according to any of the preceding claims, wherein step c) comprises evaluating the sensor data by performing pseudo-labelling of the sensor data with a second machine learning model that was trained for pseudo-labelling and outputting a deviation from the pseudo-labelling confidence score as the label confidence score.
7. The method according to any of the preceding claims, wherein in step c) the label confidence score is determined as an average label confidence score of all confidence scores associated with a particular label.
8. The method according to any of the preceding claims, wherein in step d) the reinforcement learning is performed by deep 0-learning or Proximal Policy Optimization (PPO).
9. The method according to any of the preceding claims, wherein in step d) the first machine learning model is transferred to the agent.
10. The method according to any of the preceding claims, wherein in step e) the collated input data include any of a current node, visited nodes, and features.
11. The method according to any of the preceding claims wherein in step e), the next action determined by the first machine learning model is replaced with a random action with a predetermined probability.
12. A system comprising an agent and a data processing system, wherein the agent and the data processing system comprise means to perform respective steps of a method according to any of the preceding claims.
13. The system according to claim 12, wherein the data processing system is configured as a distributed system.
14. A computer program comprising instructions that, upon execution by a system according to claim 12 or 13, performs a method according to any of the claims 1 to 11.
15. A machine readable data storage, data carrier or data carrier signal comprising the computer program of claim 14.