CN116415627A

CN116415627A - Training method, device and system for target network for automatic driving

Info

Publication number: CN116415627A
Application number: CN202310228539.1A
Authority: CN
Inventors: 张恒彰; 王天成
Original assignee: Human Horizons Shanghai Autopilot Technology Co Ltd
Current assignee: Human Horizons Shanghai Autopilot Technology Co Ltd
Priority date: 2023-03-08
Filing date: 2023-03-08
Publication date: 2023-07-11

Abstract

The application provides a training method, device and system for an automatic driving target network, wherein the method comprises the following steps: generating first state information and environment identification information corresponding to the first state information by using a simulation environment processing module, wherein the environment identification information is used for representing driving information in a preset range of a target vehicle; inputting the first state information into a hierarchical graph neural network to obtain a first coding feature and a first decoding feature corresponding to the first state information; inputting the first coding feature into a multi-layer perceptron to obtain decoding information corresponding to the first coding feature; determining a first loss value of the hierarchical graph neural network according to the environment identification information and the decoding information; and determining a second loss value for the first actor network; and updating parameters of the hierarchical graph neural network and the first actor network according to the second loss value. According to the technology of the application, the learning capability of the hierarchical graph neural network on key information is improved, so that the decision performance of the target network is improved.

Description

Training method, device and system for target network for automatic driving

Technical Field

The present application relates to the field of autopilot, and in particular, to a training method, device, and system for an autopilot target network.

Background

Aiming at training scenes of a target network for automatic driving, the training of the target network is increasingly performed by adopting a deep reinforcement learning mode. In the related art, in the process of training a target network based on a PPO (Proximal Policy Optimization) algorithm, a deep learning network such as a convolutional neural network and a multi-layer perceptron is generally adopted as a front network of an actor network for learning environment information, but the effect of capturing important environment information influencing decision behaviors by the deep learning network such as the convolutional neural network and the multi-layer perceptron is poor when learning the environment information, so that an accurate decision result cannot be obtained by the finally obtained target network, and the decision performance of the target network is influenced.

Disclosure of Invention

The embodiment of the application provides a training method, device and system for an automatic driving target network, so as to solve the problems in the related art, and the technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a training method for a target network for automatic driving, including:

Generating first state information and environment identification information corresponding to the first state information by using a simulation environment processing module, wherein the environment identification information is used for representing driving information in a preset range of a target vehicle;

inputting the first state information into a hierarchical graph neural network to obtain a first coding feature and a first decoding feature corresponding to the first state information; inputting the first coding feature into a first actor network to obtain a first decision result, obtaining second state information after the first decision result is executed, a reward value corresponding to the first decision result and a judgment result of whether a decision process is terminated by using a simulation environment processing module, and constructing training data;

inputting the first coding feature into a multi-layer perceptron to obtain decoding information corresponding to the first coding feature; based on the training data, a second decision result corresponding to the first state information is obtained through a second actor network, and a dominance value corresponding to the first decision result is obtained through a commentary network and a dominance function;

determining a first loss value of the hierarchical graph neural network according to the environment identification information and the decoding information; and determining a second loss value for the first actor network based on the first loss value, the comparison of the first decision result and the second decision result, the first decoding feature and the dominance value;

And updating parameters of the hierarchical graph neural network and the first actor network according to the second loss value until a target hierarchical graph neural network and a target network which meet preset conditions are obtained.

In one embodiment, the driving information within the preset range of the target vehicle includes lane information in which the target vehicle is currently located, determination information of whether other vehicles exist within the preset range of the target vehicle, and speed information of the other vehicles within the preset range.

In one embodiment, determining a first loss value for the hierarchical graph neural network based on the environment identification information and the decoding information includes:

and determining a first loss value of the hierarchical graph neural network by utilizing a mean square error formula according to the environment identification information and the decoding information.

In one embodiment, determining a second loss value for the first actor network based on the first loss value, the comparison of the first decision result and the second decision result, the first decoding feature and the dominance value comprises:

and inputting the first loss value, the comparison result of the first decision result and the second decision result, the first decoding characteristic and the dominant value into an actor network loss function to obtain a second loss value of the first actor network.

In one embodiment, based on the training data, obtaining a second decision result corresponding to the first state information through a second actor network, and obtaining a dominance value corresponding to the first decision result through a commentary network and a dominance function, the method includes:

inputting the first state information in the training data into a hierarchical graph neural network to obtain a second coding feature and a second decoding feature corresponding to the first state information;

inputting the second coding feature into a second actor network to obtain a second decision result corresponding to the first state information;

inputting the second coding feature into the critic network to obtain the state value corresponding to the first state information, and determining the dominance value corresponding to the first decision result by using the dominance function.

In one embodiment, after inputting the second coding feature into the critic network to obtain the state value corresponding to the first state information, and determining the dominance value corresponding to the first decision result by using the dominance function, the method further includes:

determining a third loss value corresponding to the dominant value by using the cost loss function;

and updating parameters of the commentator network according to the third loss value.

In one embodiment, the parameters of the second actor network are based on hard updating the parameters of the first actor network a predetermined number of times, and the initialization parameters of the first actor network and the second actor network are the same.

In one embodiment, the server communicates with the client through a remote procedure call technique.

In a second aspect, an embodiment of the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of training a target network for autopilot of embodiments of the present application.

In a third aspect, an embodiment of the present application provides a training system for an autopilot target network, configured to execute a training method for an autopilot target network according to the embodiment of the present application, where the system includes a client and a server; the client is deployed with a simulation environment processing module, a layering graph neural network and a first actor network, obtains training data based on the first state information, the first decision result, the second state information, the rewarding value and a judgment result of whether the decision process is terminated or not, and sends the training data to the server; the server is provided with a first actor network, a second actor network, a commentator network and a parameter updating module, wherein the parameter updating module is used for updating parameters of the hierarchical graph neural network, the first actor network, the second actor network and the commentator network according to training data and synchronizing the updated network parameters to the client.

According to the technology of the application, the environment identification information corresponding to the first state information is generated by utilizing the simulation environment processing module, the decoding information corresponding to the first coding feature output by the hierarchical graph neural network is obtained by utilizing the multi-layer perceptron, and the first loss value is determined according to the environment identification information and the decoding information, so that the hierarchical graph neural network can be guided to pay attention to key information in the state information, namely driving information in a preset range of a target vehicle, for example, the driving information in the preset range of the target vehicle can comprise lane information in which the target vehicle is currently located, judging information of whether other vehicles exist in the preset range of the target vehicle and speed information of the other vehicles in the preset range, and the first decision result obtained by the first actor network according to the first coding feature output by the hierarchical graph neural network is more accurate. And secondly, determining a second loss value of the first actor network according to the first loss value, a comparison result of the first decision result and the second decision result, the first decoding characteristic and the dominance value, and updating the hierarchical graph neural network, the first actor network and parameters of the hierarchical graph neural network according to the second loss value, so that the hierarchical graph neural network and the first actor network can capture key information more stably to further make a more reasonable decision result. In summary, according to the method of the embodiment of the application, the parameters of the hierarchical graph neural network and the parameters of the first actor network can be synchronously updated, and the learning capacity of the hierarchical graph neural network on key information is remarkably improved, so that the decision performance of the target network is improved.

The foregoing summary is for the purpose of the specification only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will become apparent by reference to the drawings and the following detailed description.

Drawings

In the drawings, the same reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily drawn to scale. It is appreciated that these drawings depict only some embodiments according to the disclosure and are not therefore to be considered limiting of its scope.

FIG. 1 is a flow chart of a training method for an autonomous driving target network according to an embodiment of the present application;

FIG. 2 is a block diagram of a training method for an autonomous driving target network according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a training method for an autonomous driving target network according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a training system for an autonomous driving target network according to an embodiment of the present application;

fig. 5 is a block diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Hereinafter, only certain exemplary embodiments are briefly described. As will be recognized by those of skill in the pertinent art, the described embodiments may be modified in various different ways without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature and not as restrictive.

Fig. 1 shows a flowchart of a training method for an autonomous driving target network according to an embodiment of the present application. As shown in fig. 1, the method may include the steps of:

step S101: generating first state information and environment identification information corresponding to the first state information by using a simulation environment processing module, wherein the environment identification information is used for representing driving information in a preset range of a target vehicle;

step S102: inputting the first state information into a hierarchical graph neural network to obtain a first coding feature and a first decoding feature corresponding to the first state information; inputting the first coding feature into a first actor network to obtain a first decision result, obtaining second state information after the first decision result is executed, a reward value corresponding to the first decision result and a judgment result of whether a decision process is terminated by using a simulation environment processing module, and constructing training data;

Step S103: inputting the first coding feature into a multi-layer perceptron to obtain decoding information corresponding to the first coding feature; based on the training data, a second decision result corresponding to the first state information is obtained through a second actor network, and a dominance value corresponding to the first decision result is obtained through a commentary network and a dominance function;

step S104: determining a first loss value of the hierarchical graph neural network according to the environment identification information and the decoding information; and determining a second loss value for the first actor network based on the first loss value, the comparison of the first decision result and the second decision result, the first decoding feature and the dominance value;

step S105: and updating parameters of the hierarchical graph neural network and the first actor network according to the second loss value until a target hierarchical graph neural network and a target network which meet preset conditions are obtained.

In the embodiment of the application, as shown in fig. 2, the reinforcement learning algorithm falls into the scene of the downstream task, and comprises two major parts of training environment and reinforcement learning. In the training environment part, the simulation environment processing module constructs observation information (namely state information at the current moment) based on an observation generation algorithm according to basic state data in the current environment information, and transmits the observation information to the reinforcement learning model. In the reinforcement learning part, the reinforcement learning model generates action information according to the observation information and feeds the action information back to the simulation environment processing module. And the simulation environment processing module receives the action information, generates return information according to the return generation definition and feeds the return information back to the reinforcement learning model so as to update parameters of the actor network by the reinforcement learning model. The reward information may include a reward value, status information of the next time, and a determination of whether the decision process is terminated.

The reinforcement learning model can adopt any one of a plurality of reinforcement learning algorithms such as Q-Learning, DQN, actor-Critic, DDPG and PPO. In the following description of the embodiments of the present application, a PPO algorithm is specifically described as an example, and it is understood that the PPO algorithm is a policy-based reinforcement learning algorithm that uses three neural networks (i.e., a first actor network, a second actor network, and a critique network). The current state of the intelligent agent is input into the neural network, corresponding action and rewarding are finally obtained, the state of the intelligent agent is updated according to the action, and the weight parameters in the actor network are updated by gradient rising according to the objective function comprising the rewarding and the action, so that the action judgment capable of enabling the overall rewarding value to be larger is obtained. In the following description of the present application, the first state information may be understood as state information at the current time, and the second state information may be understood as state information at the next time after the first decision result is performed.

In the embodiment of the application, the target network is used for outputting a target decision result according to the state information of the target vehicle. The state information comprises current road information of the vehicle and driving information in a preset range, wherein the driving information can specifically comprise historical track information of whether other vehicles exist in the preset range or not, the historical track information of other vehicles in a preset duration and the like. And the target decision result output by the target network is used for controlling the target vehicle to execute corresponding driving behaviors. For example, the target decision results may include acceleration, deceleration, left lane change, right lane change, maintaining current driving behavior, and the like. According to a target decision result generated by the target network, automatic driving of the vehicle in a high-speed driving scene can be realized.

Illustratively, in step S101, the simulation environment processing module is configured to simulate a driving environment and generate state information based on the driving environment. The simulation environment processing module may be an autopilot simulation environment for reinforcement learning built based on a Gym (a tool kit for developing and comparing reinforcement learning algorithms) architecture and a carpa (an autopilot simulator) joint simulation. Wherein Gym is a tool pack for developing and comparing reinforcement learning algorithms, providing a reinforcement learning simulated simulation environment and data protocols. Carla is an open-source autopilot simulator, solves a series of tasks related to autopilot problems with a modular and flexible API (Application Programming Interface ), helps to achieve autonomy in autopilot development, and is a tool that can be accessed and customized conveniently.

The environment identification information may be key information in the noted first state information, and is used for indicating driving information within a preset range of the target vehicle. More specifically, in one embodiment, the driving information within the preset range of the target vehicle includes lane information in which the target vehicle is currently located, determination information of whether other vehicles exist within the preset range of the target vehicle, and speed information of the other vehicles within the preset range.

For example, the environmental identification information may include driving information within a plurality of preset ranges of the target vehicle, and may specifically include: whether or not there is another vehicle in front of the target vehicle by 100 meters and 50 meters, and in the case where there is another vehicle, speed information of the other vehicle, lane information in which the target vehicle is currently located, and speed information of the other vehicle within a range of 100 meters in front of and within a range of 100 meters behind the lane in which the target vehicle is currently located and an adjacent lane.

Illustratively, in step S102, the hierarchical graph neural network is used as a pre-network of the first actor network to extract key information in the first state information, and then the extracted first state information is input into the first actor network for the first actor network to generate a first decision result based on the key information. The key information may specifically be driving information in a plurality of preset ranges of the target vehicle.

In one example, the hierarchical graph neural network may specifically be VectorNet. Wherein the VectorNet comprises an encoding layer and a decoding layer. The coding layer is used for carrying out vectorization expression on all graph elements (such as map features and dynamic traffic participants) in the first state information according to the input first state information, and constructing a broken line subgraph based on vectorization expression so as to obtain coding features. The decoding Layer adopts a Multi-Layer Perceptron (MLP) to aggregate the local features of different polyline subgraphs, then integrates all tracks and map features globally, and finally obtains the decoding features.

After the first decision result output by the first actor network is obtained, the first decision result can be input into a simulation environment processing module to obtain second state information of the target vehicle after the first decision result is executed, a reward value corresponding to the first decision result and a judging result of whether the current driving scene is terminated. The reward value is used for evaluating the goodness of the first decision result, and the decision result of whether the decision process is ended is judged based on the second state information.

In addition, training data is obtained based on the first state information, the first decision result, the reward value, the second state information and the decision result of whether the decision process is terminated, and the training data is added into the experience pool.

After the first decision result is executed to obtain the second state information, if the decision result of whether the decision process is terminated is no, the second state information is input into the first actor network as the state information of the current moment to obtain the first decision result corresponding to the current moment, and the first decision result corresponding to the current moment is input into the simulation environment processing module to obtain the reward value of the first decision result corresponding to the current moment, the state information of the next moment and the decision result of whether the decision process is terminated, which are output by the simulation environment processing module. Based on this, the training data may be plural corresponding to plural consecutive time instants, respectively, to constitute a training data set. Training data may be expressed in the form of sequences, such as: { S _t ，A _t ，R _t ，S _t+1 Is_done }, where S _t For representing state information corresponding to time t, A _t For representing a decision result obtained based on state information corresponding to time t, R _t The prize value is used for indicating the decision result corresponding to the time t, and the is_done is used for indicating the judgment result of whether to terminate the decision process at the time t+1. After the training data is obtained, the training data may be added to the experience pool as a data basis for subsequent reinforcement learning of the first actor network and the commentator network.

Illustratively, in step S103, the multi-layer perceptron is an artificial neural network of forward structure, mapping a set of input vectors to a set of output vectors. The multi-layer perceptron comprises a first input layer, an intermediate hidden layer and a final output layer, the product of the input element and the weight being fed back to a summing point with neuron bias, the main advantage being its ability to quickly solve complex problems.

Illustratively, in step S103, training data may be extracted from the experience pool, and first state information in the training data is input into the hierarchical graph neural network to obtain second coding features and second decoding features corresponding to the first state information, and the second coding features are input into the second actor network to obtain a second decision result. In addition, the second coding feature is input into a criticizing network, and a dominance value corresponding to the first decision result is obtained by using a dominance function.

It should be noted that, the initialization parameters of the first actor network are the same as those of the second actor network, and the parameters of the second actor network are updated based on the parameters of the first actor network. For example, after the parameters of the first actor network are updated a preset number of times, the parameters of the second actor network may be updated based on the parameters of the actor network updated a preset number of times. Based on this, the update frequencies of the first actor network and the second actor network are different, and in the subsequent reinforcement learning process, the parameters of the first actor network and the parameters of the second actor network may be different, and the first decision result and the second decision result obtained based on the first state information may be different.

Illustratively, in step S104, the first loss value may be calculated using a mean square error equation. The second loss value may be calculated using a predetermined actor network loss function, wherein the actor network loss function is designed based on the first loss value, the comparison of the first decision result and the second decision result, the first decoding characteristic, and the dominance value.

Illustratively, in step S105, a synchronized parameter update may be performed for the hierarchical map neural network and the first actor network. The preset condition may be that average prize values corresponding to the first actor network under a preset number of consecutive rounds are all greater than or equal to a desired preset return value. A round indicates from the start of decision to the end of decision, specifically, after the first actor network makes a decision based on the initial first state information at the start of decision, in the case where the obtained result of the decision is that the current driving scenario is not terminated, the first actor network continues to make a decision based on the second state information, and so on, until in the case where the obtained result of the decision after a certain decision is made is that the current driving scenario is terminated, the decision is stopped, that is, the round is ended. In one round, a first decision result and a corresponding prize value, which are respectively made by the first actor network at a plurality of continuous moments, are corresponding, and based on this, a corresponding average prize value for the round can be obtained. Based on the average rewards corresponding to the continuous preset number of rounds, if the average rewards corresponding to each round is greater than or equal to the expected preset rewards, it can be determined that the first actor network meets the preset conditions.

In addition, if the updated first actor network does not meet the preset condition, steps S101 to S105 are repeatedly performed until the first actor network meets the preset condition, and the current hierarchical graph neural network and the first actor network are determined as the target hierarchical graph neural network and the target network.

According to the method, the environment identification information corresponding to the first state information is generated by using the simulation environment processing module, the decoding information corresponding to the first coding feature output by the hierarchical graph neural network is obtained by using the multi-layer perceptron, and the first loss value is determined according to the environment identification information and the decoding information, so that the hierarchical graph neural network can be guided to pay attention to key information in the state information, namely driving information in a preset range of a target vehicle, for example, the driving information in the preset range of the target vehicle can comprise lane information in which the target vehicle is currently located, judging information of whether other vehicles exist in the preset range of the target vehicle and speed information of the other vehicles in the preset range, and the first decision result obtained by the first actor network according to the first coding feature output by the hierarchical graph neural network is more accurate. And secondly, determining a second loss value of the first actor network according to the first loss value, a comparison result of the first decision result and the second decision result, the first decoding characteristic and the dominance value, and updating parameters of the hierarchical graph neural network and the first actor network according to the second loss value, so that the hierarchical graph neural network and the first actor network can capture key information more stably and further make a more reasonable decision result. In summary, according to the method of the embodiment of the application, the parameters of the hierarchical graph neural network and the parameters of the first actor network can be synchronously updated, and the learning capacity of the hierarchical graph neural network on key information is remarkably improved, so that the decision performance of the target network is improved.

In one embodiment, step S104 may include:

step S1041: and determining a first loss value of the hierarchical graph neural network by utilizing a mean square error formula according to the environment identification information and the decoding information.

Illustratively, the mean square error formula is specifically as follows:

wherein,,

for representing decoding information, θ for representing environment identification information.

In one embodiment, step S104 may include:

step S1042: and inputting the first loss value, the comparison result of the first decision result and the second decision result, the first decoding characteristic and the dominant value into an actor network loss function to obtain a second loss value of the first actor network.

Illustratively, the actor network loss function for calculating the second loss value may be as follows:

Actor_loss＝mean(-min(surr1,surr2))+aux_loss+node_decoder_loss，

surr1＝ratio*advantage，

surr2＝max(min(ratio,1-ε),1+ε)*ratio，

node_decoder_loss＝Smooth_L1_Loss(aux_pred,aux_true)，

the ratio is used for representing the ratio of the first decision result to the second decision result, the advantage is used for representing the dominance value, the super-parameter epsilon is used for representing the truncation ratio, and epsilon (0, 1); aux_pred is used to represent the first decoded feature, aux_true is used to represent the true value corresponding to the first decoded feature, and β is used to represent the numerical difference between the first decoded feature and the true value corresponding to the first decoded feature.

In one embodiment, step S103 may include:

Step S1031: inputting the first state information in the training data into a hierarchical graph neural network to obtain a second coding feature and a second decoding feature corresponding to the first state information;

step S1032: inputting the second coding feature into a second actor network to obtain a second decision result corresponding to the first state information;

step S1033: inputting the second coding feature into the critic network to obtain the state value corresponding to the first state information, and determining the dominance value corresponding to the first decision result by using the dominance function.

In the embodiment of the application, the value output by the critics network is used for representing the value corresponding to the first state information; the dominance function is used for expressing the goodness of the first decision result output by the first actor network relative to the average strategy under the first state information, and can reflect the deviation of the random variable relative to the average value.

Illustratively, the dominance function a (s, a) may be specifically as follows:

A(s,a)＝Q(s,a)-V(s)

wherein Q (s, a) is a state-behavior value function for calculating a desired return value obtained after performing action a in state s; v(s) is a state cost function for calculating the expected return value in state s.

In one implementation manner, after inputting the second coding feature into the second actor network to obtain the second decision result corresponding to the first state information, the method in the embodiment of the present application further includes:

Illustratively, the cost loss function may be as follows:

Critic_loss＝mean(mse_loss(states,td_target))，

td_target＝rewards+γ*next_states*(1-J)，

wherein states are used for representing first state information, next_states are used for representing second state information, mse_loss is used for calculating mean square error loss, rewards is used for representing a reward value set in an experience pool, super-parameter gamma is used for representing attenuation rate, and J is used for representing a judgment result of whether a decision process is terminated after executing a first decision result.

For example, after performing the preset number of updates to the parameters of the first actor network, the hard update may be performed to the parameters of the second actor network according to the parameters of the first actor network after the preset number of updates. It is understood that the hard update refers to assigning all parameters of the first actor network updated by the preset number of times to the second actor network, that is, the parameters of the second actor network to be updated are the same as the parameters of the first actor network updated by the preset number of times.

It should be noted that, the preset number is not limited in particular, and may be specifically set according to practical situations, for example, the preset number may be any other value such as 100 or 200.

In the embodiment of the application, the updating of the hierarchical graph neural network, the first actor network, the second actor network and the commentator network parameters is performed by the server. Specifically, the service end is deployed with a hierarchical graph neural network, a first actor network, a second actor network and a criticism network, and the service end can train the first actor network, the second actor network and the criticism network by adopting a PPO algorithm. The client is deployed with an agent for running the hierarchical graph neural network and the first actor network and a simulation environment processing module, wherein the simulation environment processing module is used for generating first state information and environment identification information corresponding to the first state information, inputting the first state information into the agent so that the hierarchical graph neural network can obtain first coding features and first decoding features according to the first state information, the first actor network can obtain a first decision result according to the first coding features, and inputting the first decision result into the simulation environment processing module to obtain a reward value of the first decision result, second state information after the first decision result is executed and a judgment result of whether a decision process is terminated. The client side takes the first state information, the first decision result, the rewarding value and the second state information as training data and sends the training data to a data cache area (namely an experience pool) of the server side. The server updates parameters of the hierarchical graph neural network, the first actor network, the second actor network and the critic network according to the training data, and feeds the updated parameters back to the client for the client to adjust parameters of the hierarchical graph neural network and the first actor network deployed in the intelligent body.

A training method of a target network for automatic driving according to an embodiment of the present application is described below with reference to fig. 3 as a specific example. As shown in fig. 3, the method may employ PPO algorithms to train the actor network and the critic network.

Specifically, the first actor network processes the first state information s output by the module according to the simulation environment _t And environment identification information info, outputting a corresponding first decision result a based on a first coding feature corresponding to the first state information output by the hierarchical graph neural network _t Then the first decision result a _t Inputting the first decision result a into the simulation environment processing module again to obtain the execution first decision result a _t Second state information s after _t+1 Prize value r _t And a judgment result of whether the decision process is ended. Based on the first state information s _t First decision result a _t Second state information s _t+1 Prize value r _t And judging whether the decision process is finished or not to obtain training data, and storing the training data into an experience pool.

In the reinforcement learning process, training data are extracted from the experience pool, and first state information in the training data are input into the hierarchical graph neural network to obtain second coding features. And then inputting the second coding feature into a second actor network to obtain a second decision result, inputting the second coding feature into a critic network to obtain a state value, and obtaining a dominance value by using a dominance function.

And obtaining coding information by using a multi-layer perceptron according to the first coding characteristic output by the hierarchical graph neural network, and determining a first loss value corresponding to the hierarchical graph neural network by using a mean square error formula according to the coding information and the environment identification information. And determining a second loss value of the first actor network using the actor network loss function based on the first loss value, the comparison of the first decision result and the second decision result, the first decoding feature and the dominance value.

And finally, synchronously updating parameters of the hierarchical graph neural network and the first actor network according to the first loss value and the second loss value until a target hierarchical graph neural network and a target network which meet preset conditions are obtained.

According to another aspect of the embodiment of the application, there is also provided a training system for an autopilot target network, for executing the training method for an autopilot target network according to the embodiment of the application. The system comprises a client and a server. Specifically, a simulation environment processing module, a layering graph neural network and a first actor network are deployed on a client, the client obtains training data based on first state information, a first decision result, second state information, a reward value and a decision result of whether a decision process is terminated or not, and the training data is sent to a server; the server is provided with a first actor network, a second actor network, a commentator network and a parameter updating module, wherein the parameter updating module is used for updating parameters of the hierarchical graph neural network, the first actor network, the second actor network and the commentator network according to training data and synchronizing the updated network parameters to the client.

Illustratively, as shown in FIG. 4, the system may include a Client (i.e., client) and a Server (i.e., server). The server side comprises a data cache region, a reinforcement learning model training module and a storage module, wherein the reinforcement learning model training module is used for updating parameters of the hierarchical graph neural network, the first actor network, the second actor network and the commentator network which are deployed at the server side. The client is deployed with an agent for running the hierarchical graph neural network and the first actor network and a simulation environment processing module, wherein the simulation environment processing module is used for generating first state information and inputting the agent so that the hierarchical graph neural network outputs first coding features according to the first state information and the first actor network generates first decision results according to the first coding features. The intelligent agent inputs the first decision result into the simulation environment processing module to obtain a reward value of the first decision result, second state information after the first decision result is executed, the reward value and a judgment result of whether the decision process is finished. The client side takes the first state information, the first decision result, the rewarding value and the second state information as training data and sends the training data to a data cache area (namely an experience pool) of the server side. The server updates parameters of the hierarchical graph neural network, the first actor network, the second actor network and the critic network according to the training data, and feeds the updated parameters back to the client for the client to adjust the parameters of the hierarchical graph neural network and the first actor network deployed in the intelligent body.

Illustratively, the remote procedure call technique may specifically employ the gRPC technique. gRPC is a language neutral, platform neutral, open source remote procedure call technique. By utilizing the gRPC technology, the communication between the client and the server can be performed transparently, and the construction of a communication system is simplified.

According to another aspect of the embodiments of the present application, there is also provided a vehicle including an automatic driving apparatus configured with an actor network for automatic driving, the actor network being configured to generate an automatic decision result according to driving environment information, wherein the actor network is generated by using the training method of the target network for automatic driving of the embodiments of the present application.

Fig. 5 shows a block diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic device includes: memory 510 and processor 520, and instructions executable on processor 520 are stored in memory 510. The processor 520, when executing the instructions, implements the training method of the target network for autopilot in the above-described embodiment. The number of memories 510 and processors 520 may be one or more. The electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the application described and/or claimed herein.

The electronic device may further include a communication interface 530 for communicating with external devices for data interactive transmission. The various devices are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor 520 may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of a GUI on an external input/output device, such as a display device coupled to an interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). The bus may be classified as an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in fig. 5, but not only one bus or one type of bus.

Alternatively, in a specific implementation, if the memory 510, the processor 520, and the communication interface 530 are integrated on a chip, the memory 510, the processor 520, and the communication interface 530 may communicate with each other through internal interfaces.

It should be appreciated that the processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be a processor supporting an advanced reduced instruction set machine (Advanced RISC Machines, ARM) architecture.

The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, from one website, computer, training device, or data center via a wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.

Computer-readable storage media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other physical classes of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, or other magnetic storage media that can be used to store information that can be accessed by a computing device. Computer-readable Media, as defined herein, does not include non-Transitory computer-readable Media (transmission Media), such as modulated data signals and carrier waves.

Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the technical solutions of the present application are achieved, and are not limited herein. The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of various changes or substitutions within the technical scope of the present application, and these should be covered in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A training method for an autonomous driving target network, comprising:

inputting the first state information into a hierarchical graph neural network to obtain a first coding feature and a first decoding feature corresponding to the first state information; inputting the first coding feature into a first actor network to obtain a first decision result, obtaining second state information after the first decision result is executed, a reward value corresponding to the first decision result and a judgment result of whether a decision process is terminated by utilizing the simulation environment processing module, and constructing training data;

Inputting the first coding feature into a multi-layer perceptron to obtain decoding information corresponding to the first coding feature; based on the training data, a second decision result corresponding to the first state information is obtained through a second actor network, and a dominance value corresponding to the first decision result is obtained through a commentator network and a dominance function;

determining a first loss value of the hierarchical graph neural network according to the environment identification information and the decoding information; and determining a second loss value for the first actor network based on the first loss value, the comparison of the first decision result and the second decision result, the first decoding feature, and the dominance value;

2. The method according to claim 1, wherein the driving information within the preset range of the target vehicle includes lane information in which the target vehicle is currently located, determination information of whether other vehicles exist within the preset range of the target vehicle, and speed information of the other vehicles within the preset range.

3. The method of claim 1, wherein determining a first loss value for the hierarchical graph neural network based on the environment identification information and the decoding information comprises:

4. The method of claim 1, wherein determining a second loss value for the first actor network based on the first loss value, the comparison of the first decision result and the second decision result, the first decoding feature, and the dominance value comprises:

5. The method of claim 1, wherein obtaining a second decision result corresponding to the first state information via a second actor network and obtaining a dominance value corresponding to the first decision result via a commentary network and a dominance function based on the training data, comprises:

inputting the second coding feature into a criticism network to obtain a state value corresponding to the first state information, and determining a dominance value corresponding to the first decision result by using a dominance function.

6. The method of claim 5, wherein after inputting the second encoding feature into a critic network to obtain a state value corresponding to the first state information, and determining a dominance value corresponding to the first decision result using a dominance function, the method further comprises:

determining a third loss value corresponding to the dominant value by using a cost loss function;

and updating parameters of the comment home network according to the third loss value.

7. The method of any of claims 1-6, wherein the parameters of the second actor network are based on hard updating of the parameters of the first actor network a preset number of updates, and wherein the initialization parameters of the first actor network and the second actor network are the same.

8. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.

9. A training system for an autopilot target network, characterized by performing the method of any one of claims 1 to 7, the system comprising a client and a server;

the client is deployed with a simulation environment processing module, a hierarchical graph neural network and a first actor network, obtains training data based on first state information, a first decision result, second state information, a reward value and a decision result of whether a decision process is terminated or not, and sends the training data to the server; the server is provided with a first actor network, a second actor network, a criticism network and a parameter updating module, wherein the parameter updating module is used for updating parameters of the hierarchical graph neural network, the first actor network, the second actor network and the criticism network according to the training data and synchronizing the updated network parameters to the client.

10. The system of claim 9, wherein the server communicates with the client via a remote procedure call technique.