WO2021244207A1 - 训练驾驶行为决策模型的方法及装置 - Google Patents

训练驾驶行为决策模型的方法及装置 Download PDF

Info

Publication number
WO2021244207A1
WO2021244207A1 PCT/CN2021/091964 CN2021091964W WO2021244207A1 WO 2021244207 A1 WO2021244207 A1 WO 2021244207A1 CN 2021091964 W CN2021091964 W CN 2021091964W WO 2021244207 A1 WO2021244207 A1 WO 2021244207A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
driving behavior
behavior decision
parameter
information
Prior art date
Application number
PCT/CN2021/091964
Other languages
English (en)
French (fr)
Inventor
何祥坤
陈晨
刘武龙
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021244207A1 publication Critical patent/WO2021244207A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0231Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
    • G05D1/0238Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using obstacle or wall sensors
    • G05D1/024Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using obstacle or wall sensors in combination with a laser
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0214Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory in accordance with safety or protection criteria, e.g. avoiding hazardous areas
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0221Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving a learning process
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0212Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory
    • G05D1/0223Control of position or course in two dimensions specially adapted to land vehicles with means for defining a desired trajectory involving speed control of the vehicle
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0231Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
    • G05D1/0242Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using non-visible light signals, e.g. IR or UV signals
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0231Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means
    • G05D1/0246Control of position or course in two dimensions specially adapted to land vehicles using optical position detecting means using a video camera in combination with image processing means
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0257Control of position or course in two dimensions specially adapted to land vehicles using a radar
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0259Control of position or course in two dimensions specially adapted to land vehicles using magnetic or electromagnetic means
    • G05D1/0263Control of position or course in two dimensions specially adapted to land vehicles using magnetic or electromagnetic means using magnetic strips
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0276Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle
    • G05D1/0278Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle using satellite positioning signals, e.g. GPS
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0276Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle
    • G05D1/028Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle using a RF signal
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/02Control of position or course in two dimensions
    • G05D1/021Control of position or course in two dimensions specially adapted to land vehicles
    • G05D1/0276Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle
    • G05D1/0285Control of position or course in two dimensions specially adapted to land vehicles using signals provided by a source external to the vehicle using signals transmitted via a public communication network, e.g. GSM network
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Definitions

  • This application relates to the field of automatic driving, and more specifically, to a method and device for training a driving behavior decision model.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.
  • Autonomous driving is a mainstream application in the field of artificial intelligence.
  • Autonomous driving technology relies on the collaboration of computer vision, radar, monitoring devices, and global positioning systems to allow motor vehicles to achieve autonomous driving without the need for human active operations.
  • Self-driving vehicles use various computing systems to help transport passengers from one location to another. Some autonomous vehicles may require some initial input or continuous input from an operator (such as a navigator, driver, or passenger). The self-driving vehicle allows the operator to switch from the manual mode to the automatic driving mode or a mode in between. Since autonomous driving technology does not require humans to drive motor vehicles, it can theoretically effectively avoid human driving errors, reduce the occurrence of traffic accidents, and improve the efficiency of highway transportation. Therefore, more and more attention is paid to autonomous driving technology.
  • Driving behavior decision-making is an important part of automatic driving technology, which specifically includes selecting an action to be performed for the vehicle (for example, acceleration, deceleration, or steering) according to the state information of the vehicle, and controlling the vehicle according to the selected action to be performed.
  • Driving behavior decisions are usually inferred from driving behavior decision models.
  • Commonly used driving behavior decision models are obtained through reinforcement learning training. However, the training efficiency of the existing driving behavior decision model training through the reinforcement learning method is low.
  • the present application provides a method and device for training a driving behavior decision model, which helps to improve the training efficiency of the driving behavior decision model.
  • a method for training a driving behavior decision model includes:
  • the imitation learning method is a common supervised learning method.
  • the supervised learning method can use the true value (or label) to calculate the loss value of the model (for example, the driving behavior decision model) during the training process, and use the calculated loss value to adjust the parameters of the model. Therefore, The learning efficiency of the supervised learning method is relatively high.
  • the supervised learning method can often obtain a model that meets the needs of users in a relatively short time. At the same time, due to the participation of the truth value in the training process, the model trained based on the supervised learning method often It is also more reliable.
  • the first parameter is obtained by the server after training the imitation learning model based on the imitation learning method and using the driving behavior decision information.
  • the imitation learning method can ensure the imitation learning model based on the imitation learning method. Training effect.
  • adjusting the parameters of the driving behavior decision model according to the driving behavior decision information and the first parameter can improve the learning efficiency of the driving behavior decision model.
  • the imitation learning method may include supervised learning (supervised learning), generative adversarial network (GAN), inverse reinforcement learning (IRL), etc.
  • the adjusting the parameters of the driving behavior decision model according to the driving behavior decision information and the first parameter includes: based on a reinforcement learning method, according to The driving behavior decision information adjusts the parameters of the driving behavior decision model to obtain a second parameter; and adjusts the second parameter of the driving behavior decision model according to the first parameter.
  • the parameters of the driving behavior decision model can be adjusted based on the reinforcement learning method to obtain the second parameter, and the second parameter of the driving behavior decision model can be adjusted according to the first parameter, so that all
  • the driving behavior decision model has online learning capabilities and offline learning capabilities, that is, on the premise that the driving behavior decision model has online learning capabilities, the learning efficiency of the driving behavior decision model can be further improved.
  • the driving behavior decision model includes a first model and a second model; wherein, the reinforcement learning-based method compares the driving behavior decision information to the Adjusting the parameters of the driving behavior decision model to obtain the second parameter includes: adjusting the parameters of the first model according to the driving behavior decision information based on the reinforcement learning method to obtain the second parameter; In the case of a preset condition, the parameter of the second model is updated to the second parameter, and the first preset condition is a preset time interval or a preset adjustment of the parameters of the first model frequency.
  • the parameters of the second model are updated to the second parameters, which can avoid the frequent adjustment of the parameters of the second model causing the
  • the output of the second model is unstable, and therefore, the reliability of the driving behavior decision information can be improved.
  • the adjusting the second parameter of the driving behavior decision model according to the first parameter includes: adjusting the second parameter of the driving behavior decision model according to the first parameter.
  • the parameters of the first model and/or the parameters of the second model are included in the adjusting the second parameter of the driving behavior decision model according to the first parameter.
  • the parameter of at least one of the first model and the second model can be flexibly adjusted according to the first parameter.
  • the use of the driving behavior decision model to make decisions based on the state information to obtain driving behavior decision information includes: based on the dynamics model and movement of the vehicle Model, predict the driving behavior of the vehicle at one or more moments in the future according to the state information, and obtain all possible driving behaviors at the one or more moments; use the driving behavior decision-making model to All possible driving behaviors are evaluated, and the driving behavior decision information is obtained.
  • the driving behavior decision is made in combination with the dynamic model and the kinematics model of the vehicle, which can improve the rationality of the driving behavior decision information.
  • the use of the driving behavior decision model is for all possible Evaluating the driving behavior of to obtain the driving behavior decision information includes: using the second model to evaluate all possible driving behaviors to obtain the driving behavior decision information.
  • the parameters of the first model change relatively frequently, and the use of the second model to make decisions can improve the reliability of the driving behavior decision information.
  • the method further includes: receiving a third parameter of the imitation learning model sent by the server, where the third parameter is based on the imitation learning method, using The data output by the decision expert system is obtained after training the imitation learning model.
  • the decision expert system is designed according to the driving data of the driver and the dynamic characteristics of the vehicle; the driving behavior decision model is determined according to the third parameter The initial parameters.
  • determining the initial parameters of the driving behavior decision model according to the third parameter of the pre-trained imitation learning model can improve the stability of the driving behavior decision model and avoid the output of the driving behavior decision model. Risky (or unreasonable) driving behavior decision information.
  • the first parameter is that the server trains the imitation learning model based on an imitation learning method and using the driving behavior decision information that satisfies a second preset condition As obtained later, the second preset condition includes that the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.
  • using the reasonable driving behavior decision-making corresponding to the state information to train the imitation learning model can further improve the training effect of the imitation learning model, thereby further improving the learning efficiency of the driving behavior decision-making model.
  • the second preset condition further includes that the noise of the state information is within a first preset range.
  • the noise of the state information is within the first preset range, which can make the driving behavior decision information obtained after the decision based on the state information more reasonable.
  • the training institute is trained based on the driving behavior decision information
  • the imitation learning model can further improve the training effect of the imitation learning model, so that the learning efficiency of the driving behavior decision model can be further improved.
  • the state information is one of a plurality of state information
  • the second preset condition further includes that the plurality of state information is in a plurality of scenarios Acquired.
  • acquiring the state information in the foregoing multiple scenarios can make the training data of the driving behavior decision model (for example, driving behavior decision information obtained after making a decision based on the state information) more abundant scenarios
  • training the imitation learning model according to the driving behavior decision information can further improve the training effect of the imitation learning model, thereby helping to further improve the learning efficiency of the driving behavior decision model.
  • the second preset condition further includes: among the plurality of state information, the value of the state information obtained in any one of the plurality of scenes The difference between the quantity and the quantity of the state information obtained in any other scene in the plurality of scenes is within a second preset range.
  • the difference between the quantity of state information acquired in any one of the multiple scenes and the quantity of state information acquired in any other scene of the multiple scenes is the first 2.
  • the amount of training data obtained in each scene (for example, driving behavior decision information obtained after making a decision based on the state information) can be more balanced.
  • the training institute is trained based on the driving behavior decision information.
  • the imitation learning model can ensure the training effect of the imitation learning model, thereby avoiding the problem of overfitting of the driving behavior decision model in a certain scene.
  • a method for training a driving behavior decision model includes:
  • Receive driving behavior decision information sent by the vehicle the driving behavior decision information is obtained after the vehicle uses the driving behavior decision model to make a decision based on the state information of the vehicle; based on the imitation learning method, training is based on the driving behavior decision information Imitating a learning model to obtain first parameters of the model learning model, where the first parameters are used to adjust the parameters of the driving behavior decision model; and sending the first parameters to the vehicle.
  • the imitation learning method is a common supervised learning method.
  • the supervised learning method can use the true value (or label) to calculate the loss value of the model (for example, the driving behavior decision model) during the training process, and use the calculated loss value to adjust the parameters of the model. Therefore, The learning efficiency of the supervised learning method is relatively high.
  • the supervised learning method can often obtain a model that meets the needs of users in a relatively short time. At the same time, due to the participation of the truth value in the training process, the model trained based on the supervised learning method often It is also more reliable.
  • the imitation learning model is trained according to the driving behavior decision information to obtain the first parameter of the model learning model, and the training effect of the imitation learning model can be guaranteed based on the imitation learning method.
  • the learning efficiency of the driving behavior decision model can be improved.
  • the imitation learning method may include supervised learning (supervised learning), generative adversarial network (GAN), inverse reinforcement learning (IRL), etc.
  • the method further includes: training the imitation learning model based on the imitation learning method and using the data output by the decision expert system to obtain the third imitation learning model Parameters, wherein the third parameter is used to determine the initial parameters of the driving behavior decision model, and the decision expert system is designed according to the driving data of the driver and the dynamic characteristics of the vehicle; The third parameter.
  • determining the initial parameters of the driving behavior decision model according to the third parameter of the pre-trained imitation learning model can improve the stability of the driving behavior decision model and avoid the output of the driving behavior decision model. Risky (or unreasonable) driving behavior decision information.
  • the imitation-based learning method trains an imitation learning model according to the driving behavior decision information to obtain the first parameter of the model learning model, including: imitation-based learning
  • the learning method is to train an imitation learning model according to the driving behavior decision information meeting a second preset condition to obtain the first parameter of the model learning model, and the second preset condition includes the driving behavior decision information as the Reasonable driving behavior decision corresponding to status information.
  • using the reasonable driving behavior decision-making corresponding to the state information to train the imitation learning model can further improve the training effect of the imitation learning model, thereby further improving the learning efficiency of the driving behavior decision-making model.
  • the second preset condition further includes that the noise of the state information is within a first preset range.
  • the noise of the state information is within the first preset range, which can make the driving behavior decision information obtained after the decision based on the state information more reasonable.
  • the training institute is trained based on the driving behavior decision information
  • the imitation learning model can further improve the training effect of the imitation learning model, so that the learning efficiency of the driving behavior decision model can be further improved.
  • the state information is one of a plurality of state information
  • the second preset condition further includes that the plurality of state information is in a plurality of scenarios Acquired.
  • acquiring the state information in the foregoing multiple scenarios can make the training data of the driving behavior decision model (for example, driving behavior decision information obtained after making a decision based on the state information) more abundant scenarios
  • training the imitation learning model according to the driving behavior decision information can further improve the training effect of the imitation learning model, thereby helping to further improve the learning efficiency of the driving behavior decision model.
  • the second preset condition further includes: among the plurality of state information, the value of the state information obtained in any one of the plurality of scenes The difference between the quantity and the quantity of the state information obtained in any other scene in the plurality of scenes is within a second preset range.
  • the difference between the quantity of state information acquired in any one of the multiple scenes and the quantity of state information acquired in any other scene of the multiple scenes is the first 2.
  • the amount of training data obtained in each scene (for example, driving behavior decision information obtained after making a decision based on the state information) can be more balanced.
  • the training institute is trained based on the driving behavior decision information
  • the imitation learning model can ensure the training effect of the imitation learning model, thereby avoiding the problem of overfitting of the driving behavior decision model in a certain scene.
  • a device for training a driving behavior decision model including:
  • the decision-making unit is used to use the driving behavior decision model to make decisions based on the state information of the vehicle to obtain driving behavior decision information; the sending unit is used to send the driving behavior decision information to the server; the receiving unit is used to receive the server sent The first parameter of the imitation learning model, the first parameter is obtained by the server after training the imitation learning model based on the imitation learning method and using the driving behavior decision information; the adjustment unit is configured to be used according to the driving behavior The decision information and the first parameter adjust the parameters of the driving behavior decision model.
  • the imitation learning method is a common supervised learning method.
  • the supervised learning method can use the true value (or label) to calculate the loss value of the model (for example, the driving behavior decision model) during the training process, and use the calculated loss value to adjust the parameters of the model. Therefore, The learning efficiency of the supervised learning method is relatively high.
  • the supervised learning method can often obtain a model that meets the needs of users in a relatively short time. At the same time, due to the participation of the truth value in the training process, the model trained based on the supervised learning method often It is also more reliable.
  • the first parameter is obtained by the server after training the imitation learning model based on the imitation learning method and using the driving behavior decision information.
  • the imitation learning method can ensure the imitation learning model based on the imitation learning method. Training effect.
  • adjusting the parameters of the driving behavior decision model according to the driving behavior decision information and the first parameter can improve the learning efficiency of the driving behavior decision model.
  • the imitation learning method may include supervised learning (supervised learning), generative adversarial network (GAN), inverse reinforcement learning (IRL), etc.
  • the adjustment unit is specifically configured to: based on a reinforcement learning method, adjust the parameters of the driving behavior decision model according to the driving behavior decision information to obtain the first Two parameters; adjust the second parameter of the driving behavior decision model according to the first parameter.
  • the parameters of the driving behavior decision model can be adjusted based on the reinforcement learning method to obtain the second parameter, and the second parameter of the driving behavior decision model can be adjusted according to the first parameter, so that all
  • the driving behavior decision model has online learning capabilities and offline learning capabilities, that is, on the premise that the driving behavior decision model has online learning capabilities, the learning efficiency of the driving behavior decision model can be further improved.
  • the driving behavior decision model includes a first model and a second model; wherein, the adjustment unit is specifically configured to: based on a reinforcement learning method, according to the driving behavior
  • the behavior decision information adjusts the parameters of the first model to obtain the second parameter; in the case that the first preset condition is satisfied, the parameter of the second model is updated to the second parameter, and the
  • the first preset condition is a preset time interval or a preset number of adjustments to the parameters of the first model.
  • the parameters of the second model are updated to the second parameters, which can avoid the frequent adjustment of the parameters of the second model causing the
  • the output of the second model is unstable, and therefore, the reliability of the driving behavior decision information can be improved.
  • the adjustment unit is specifically configured to adjust the parameters of the first model and/or the parameters of the second model according to the first parameters.
  • the parameter of at least one of the first model and the second model can be flexibly adjusted according to the first parameter.
  • the decision-making unit is specifically configured to: based on the dynamics model and kinematics model of the vehicle, according to the state information, the next one or the next to the vehicle Predict driving behaviors at multiple moments to obtain all possible driving behaviors at the one or more moments; use the driving behavior decision model to evaluate all possible driving behaviors to obtain the driving behavior decision information .
  • the driving behavior decision is made in combination with the dynamic model and the kinematics model of the vehicle, which can improve the rationality of the driving behavior decision information.
  • the decision unit is specifically configured to: use the second model, All the possible driving behaviors are evaluated, and the driving behavior decision information is obtained.
  • the parameters of the first model change relatively frequently, and the use of the second model to make decisions can improve the reliability of the driving behavior decision information.
  • the receiving unit is further configured to: receive a third parameter of the imitation learning model sent by the server, where the third parameter is based on an imitation learning method , Using the data output by the decision expert system to train the simulation learning model, the decision expert system is designed according to the driving data of the driver and the dynamic characteristics of the vehicle; the adjustment unit is also used to: The third parameter determines the initial parameters of the driving behavior decision model.
  • determining the initial parameters of the driving behavior decision model according to the third parameter of the pre-trained imitation learning model can improve the stability of the driving behavior decision model and avoid the output of the driving behavior decision model. Risky (or unreasonable) driving behavior decision information.
  • the first parameter is that the server trains the imitation learning model based on an imitation learning method and using the driving behavior decision information that satisfies a second preset condition As obtained later, the second preset condition includes that the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.
  • using the reasonable driving behavior decision-making corresponding to the state information to train the imitation learning model can further improve the training effect of the imitation learning model, thereby further improving the learning efficiency of the driving behavior decision-making model.
  • the second preset condition further includes that the noise of the state information is within a first preset range.
  • the noise of the state information is within the first preset range, which can make the driving behavior decision information obtained after the decision based on the state information more reasonable.
  • the training institute is trained based on the driving behavior decision information
  • the imitation learning model can further improve the training effect of the imitation learning model, so that the learning efficiency of the driving behavior decision model can be further improved.
  • the state information is one of a plurality of state information
  • the second preset condition further includes that the plurality of state information is in a plurality of scenarios Acquired.
  • acquiring the state information in the foregoing multiple scenarios can make the training data of the driving behavior decision model (for example, driving behavior decision information obtained after making a decision based on the state information) more abundant scenarios
  • training the imitation learning model according to the driving behavior decision information can further improve the training effect of the imitation learning model, thereby helping to further improve the learning efficiency of the driving behavior decision model.
  • the second preset condition further includes: among the plurality of state information, the status information obtained in any one of the plurality of scenes The difference between the quantity and the quantity of the state information obtained in any other scene in the plurality of scenes is within a second preset range.
  • the difference between the quantity of state information acquired in any one of the multiple scenes and the quantity of state information acquired in any other scene of the multiple scenes is the first 2.
  • the amount of training data obtained in each scene (for example, driving behavior decision information obtained after making a decision based on the state information) can be more balanced.
  • the training institute is trained based on the driving behavior decision information
  • the imitation learning model can ensure the training effect of the imitation learning model, thereby avoiding the problem of overfitting of the driving behavior decision model in a certain scene.
  • a device for training a driving behavior decision model including:
  • the receiving unit is used to receive the driving behavior decision information sent by the vehicle, the driving behavior decision information is obtained after the vehicle uses the driving behavior decision model to make a decision based on the state information of the vehicle; the training unit is used to learn based on imitation Method, training an imitation learning model according to the driving behavior decision information to obtain first parameters of the model learning model, where the first parameters are used to adjust the parameters of the driving behavior decision model; a sending unit is used to send The vehicle sends the first parameter.
  • the imitation learning method is a common supervised learning method.
  • the supervised learning method can use the true value (or label) to calculate the loss value of the model (for example, the driving behavior decision model) during the training process, and use the calculated loss value to adjust the parameters of the model. Therefore, The learning efficiency of the supervised learning method is relatively high.
  • the supervised learning method can often obtain a model that meets the needs of users in a relatively short time. At the same time, due to the participation of the truth value in the training process, the model trained based on the supervised learning method often It is also more reliable.
  • the imitation learning model is trained according to the driving behavior decision information to obtain the first parameter of the model learning model, and the training effect of the imitation learning model can be guaranteed based on the imitation learning method.
  • the learning efficiency of the driving behavior decision model can be improved.
  • the imitation learning method may include supervised learning (supervised learning), generative adversarial network (GAN), inverse reinforcement learning (IRL), etc.
  • the training unit is further used to train the imitation learning model based on the imitation learning method and using the data output by the decision expert system to obtain the imitation learning model
  • the third parameter wherein the third parameter is used to determine the initial parameters of the driving behavior decision model, and the decision expert system is designed according to the driving data of the driver and the dynamic characteristics of the vehicle; the sending unit is also Used to: send the third parameter to the vehicle.
  • determining the initial parameters of the driving behavior decision model according to the third parameter of the pre-trained imitation learning model can improve the stability of the driving behavior decision model and avoid the output of the driving behavior decision model. Risky (or unreasonable) driving behavior decision information.
  • the training unit is specifically configured to train an imitation learning model based on the imitation learning method according to the driving behavior decision information that satisfies the second preset condition, and obtain the result
  • the second preset condition includes that the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.
  • using the reasonable driving behavior decision-making corresponding to the state information to train the imitation learning model can further improve the training effect of the imitation learning model, thereby further improving the learning efficiency of the driving behavior decision-making model.
  • the second preset condition further includes that the noise of the state information is within a first preset range.
  • the noise of the state information is within the first preset range, which can make the driving behavior decision information obtained after the decision based on the state information more reasonable.
  • the training institute is trained based on the driving behavior decision information
  • the imitation learning model can further improve the training effect of the imitation learning model, so that the learning efficiency of the driving behavior decision model can be further improved.
  • the state information is one of a plurality of state information
  • the second preset condition further includes that the plurality of state information is in a plurality of scenarios Acquired.
  • acquiring the state information in the foregoing multiple scenarios can make the training data of the driving behavior decision model (for example, driving behavior decision information obtained after making a decision based on the state information) more abundant scenarios
  • training the imitation learning model according to the driving behavior decision information can further improve the training effect of the imitation learning model, thereby helping to further improve the learning efficiency of the driving behavior decision model.
  • the second preset condition further includes: among the plurality of state information, the value of the state information obtained in any one of the plurality of scenes The difference between the quantity and the quantity of the state information obtained in any other scene in the plurality of scenes is within a second preset range.
  • the difference between the quantity of state information acquired in any one of the multiple scenes and the quantity of state information acquired in any other scene of the multiple scenes is the first 2.
  • the amount of training data obtained in each scene (for example, driving behavior decision information obtained after making a decision based on the state information) can be more balanced.
  • the training institute is trained based on the driving behavior decision information
  • the imitation learning model can ensure the training effect of the imitation learning model, thereby avoiding the problem of overfitting of the driving behavior decision model in a certain scene.
  • a device for training a driving behavior decision model includes a storage medium and a central processing unit.
  • the storage medium may be a non-volatile storage medium, and a computer executable program is stored in the storage medium.
  • the central processing unit is connected to the non-volatile storage medium, and executes the computer executable program to implement the method in any possible implementation manner of the first aspect.
  • a device for training a driving behavior decision model includes a storage medium and a central processing unit.
  • the storage medium may be a non-volatile storage medium, and a computer executable program is stored in the storage medium.
  • the central processing unit is connected to the non-volatile storage medium, and executes the computer executable program to implement the method in any possible implementation manner of the second aspect.
  • a chip in a seventh aspect, includes a processor and a data interface.
  • the processor reads instructions stored in a memory through the data interface, and executes any possible implementation manner of the first aspect or the second The method in any possible implementation of the aspect.
  • the chip may further include a memory in which instructions are stored, and the processor is configured to execute instructions stored on the memory.
  • the processor is configured to execute any possible implementation manner of the first aspect or a method in any possible implementation manner of the second aspect.
  • a computer-readable storage medium stores program code for device execution, and the program code includes any possible implementation manner for executing the first aspect or the second aspect The instruction of the method in any of the possible implementations.
  • a vehicle which includes any possible implementation of the third aspect or the device for training a driving behavior decision model described in the fifth aspect.
  • a server in a tenth aspect, includes any possible implementation of the fourth aspect or the device for training a driving behavior decision model described in the sixth aspect.
  • the first parameter is obtained by the server after training the imitation learning model based on the imitation learning method and using the driving behavior decision information.
  • the imitation learning method can ensure the imitation learning model based on the imitation learning method. Training effect.
  • adjusting the parameters of the driving behavior decision model according to the driving behavior decision information and the first parameter can improve the learning efficiency of the driving behavior decision model.
  • FIG. 1 is a schematic structural diagram of an automatic driving vehicle provided by an embodiment of the application
  • FIG. 2 is a schematic structural diagram of an automatic driving system provided by an embodiment of the application.
  • FIG. 3 is a schematic structural diagram of a neural network processor provided by an embodiment of this application.
  • FIG. 4 is a schematic diagram of the application of a cloud-side command automatic driving vehicle provided by an embodiment of this application;
  • FIG. 5 is a schematic block diagram of a method for training a driving behavior decision model provided by an embodiment of this application
  • FIG. 6 is a schematic block diagram of a method for training a driving behavior decision model provided by another embodiment of the application.
  • FIG. 7 is a schematic block diagram of a method for training a driving behavior decision model provided by another embodiment of the application.
  • FIG. 8 is a schematic flowchart of a method for training a driving behavior decision model provided by an embodiment of the application.
  • FIG. 9 is a schematic block diagram of an RBFNN provided by an embodiment of the application.
  • FIG. 10 is a schematic block diagram of an apparatus for training a driving behavior decision model provided by an embodiment of the present application.
  • FIG. 11 is a schematic block diagram of an apparatus for training a driving behavior decision model provided by another embodiment of the present application.
  • Fig. 12 is a schematic block diagram of a device for training a driving behavior decision model provided by still another embodiment of the present application.
  • the technical solutions of the embodiments of the present application can be applied to various vehicles.
  • the vehicle may specifically be a diesel locomotive, a smart electric vehicle, or a hybrid vehicle, or the vehicle may also be a vehicle of other power types. Not limited.
  • the vehicle in the embodiment of the present application may be an autonomous driving vehicle.
  • the autonomous driving vehicle may be configured with an automatic driving mode, and the automatic driving mode may be a fully automatic driving mode, or may also be a partially automatic driving mode.
  • the embodiment is not limited to this.
  • the vehicle in the embodiment of the present application may also be configured with other driving modes, and the other driving modes may include one or more of multiple driving modes such as sports mode, economy mode, standard mode, snow mode, and hill climbing mode.
  • the automatic driving vehicle can switch between the automatic driving mode and the above-mentioned multiple driving models (of which the driver drives the vehicle), which is not limited in the embodiment of the present application.
  • FIG. 1 is a functional block diagram of a vehicle 100 provided by an embodiment of the present application.
  • the vehicle 100 is configured in a fully or partially autonomous driving mode.
  • the vehicle 100 can control itself while in the automatic driving mode, and can determine the current state of the vehicle and its surrounding environment through human operations, determine the possible behavior of at least one other vehicle in the surrounding environment, and determine the other vehicle
  • the confidence level corresponding to the possibility of performing possible actions is controlled based on the determined information.
  • the vehicle 100 can be placed to operate without human interaction.
  • the vehicle 100 may include various subsystems, such as a travel system 102, a sensor system 104, a control system 106, one or more peripheral devices 108 and a power supply 110, a computer system 112, and a user interface 116.
  • the vehicle 100 may include more or fewer subsystems, and each subsystem may include multiple elements.
  • each of the subsystems and elements of the vehicle 100 may be wired or wirelessly interconnected.
  • the travel system 102 may include components that provide power movement for the vehicle 100.
  • the propulsion system 102 may include an engine 118, an energy source 119, a transmission 120, and wheels/tires 121.
  • the engine 118 may be an internal combustion engine, an electric motor, an air compression engine, or a combination of other types of engines, for example, a hybrid engine composed of a gas oil engine and an electric motor, or a hybrid engine composed of an internal combustion engine and an air compression engine.
  • the engine 118 converts the energy source 119 into mechanical energy.
  • Examples of energy sources 119 include gasoline, diesel, other petroleum-based fuels, propane, other compressed gas-based fuels, ethanol, solar panels, batteries, and other sources of electricity.
  • the energy source 119 may also provide energy for other systems of the vehicle 100.
  • the transmission device 120 can transmit mechanical power from the engine 118 to the wheels 121.
  • the transmission device 120 may include a gearbox, a differential, and a drive shaft.
  • the transmission device 120 may further include other devices, such as a clutch.
  • the drive shaft may include one or more shafts that can be coupled to one or more wheels 121.
  • the sensor system 104 may include several sensors that sense information about the environment around the vehicle 100.
  • the sensor system 104 may include a positioning system 122 (the positioning system may be a GPS system, a Beidou system or other positioning systems), an inertial measurement unit (IMU) 124, a radar 126, a laser rangefinder 128, and Camera 130.
  • the sensor system 104 may also include sensors of the internal system of the monitored vehicle 100 (for example, an in-vehicle air quality monitor, a fuel gauge, an oil temperature gauge, etc.). Sensor data from one or more of these sensors can be used to detect objects and their corresponding characteristics (position, shape, direction, speed, etc.). Such detection and identification are key functions for the safe operation of the autonomous vehicle 100.
  • the positioning system 122 can be used to estimate the geographic location of the vehicle 100.
  • the IMU 124 is used to sense changes in the position and orientation of the vehicle 100 based on inertial acceleration.
  • the IMU 124 may be a combination of an accelerometer and a gyroscope.
  • the radar 126 may use radio signals to sense objects in the surrounding environment of the vehicle 100. In some embodiments, in addition to sensing the object, the radar 126 may also be used to sense the speed and/or direction of the object.
  • the laser rangefinder 128 can use laser light to sense objects in the environment where the vehicle 100 is located.
  • the laser rangefinder 128 may include one or more laser sources, laser scanners, and one or more detectors, as well as other system components.
  • the camera 130 may be used to capture multiple images of the surrounding environment of the vehicle 100.
  • the camera 130 may be a still camera or a video camera.
  • the control system 106 controls the operation of the vehicle 100 and its components.
  • the control system 106 may include various components, including a steering system 132, a throttle 134, a braking unit 136, a sensor fusion algorithm 138, a computer vision system 140, a route control system 142, and an obstacle avoidance system 144.
  • the steering system 132 is operable to adjust the forward direction of the vehicle 100.
  • it may be a steering wheel system.
  • the throttle 134 is used to control the operating speed of the engine 118 and thereby control the speed of the vehicle 100.
  • the braking unit 136 is used to control the vehicle 100 to decelerate.
  • the braking unit 136 may use friction to slow down the wheels 121.
  • the braking unit 136 may convert the kinetic energy of the wheels 121 into electric current.
  • the braking unit 136 may also take other forms to slow down the rotation speed of the wheels 121 to control the speed of the vehicle 100.
  • the computer vision system 140 may be operable to process and analyze the images captured by the camera 130 in order to identify objects and/or features in the surrounding environment of the vehicle 100.
  • the objects and/or features may include traffic signals, road boundaries, and obstacles.
  • the computer vision system 140 may use object recognition algorithms, Structure from Motion (SFM) algorithms, video tracking, and other computer vision technologies.
  • SFM Structure from Motion
  • the computer vision system 140 may be used to map the environment, track objects, estimate the speed of objects, and so on.
  • the route control system 142 is used to determine the travel route of the vehicle 100.
  • the route control system 142 may combine data from the sensor 138, the GPS 122, and one or more predetermined maps to determine the driving route for the vehicle 100.
  • the obstacle avoidance system 144 is used to identify, evaluate and avoid or otherwise cross over potential obstacles in the environment of the vehicle 100.
  • control system 106 may add or alternatively include components other than those shown and described. Alternatively, a part of the components shown above may be reduced.
  • the vehicle 100 interacts with external sensors, other vehicles, other computer systems, or users through peripheral devices 108.
  • the peripheral device 108 may include a wireless communication system 146, an onboard computer 148, a microphone 150, and/or a speaker 152.
  • the peripheral device 108 provides a means for the user of the vehicle 100 to interact with the user interface 116.
  • the onboard computer 148 may provide information to the user of the vehicle 100.
  • the user interface 116 can also operate the onboard computer 148 to receive user input.
  • the on-board computer 148 can be operated through a touch screen.
  • the peripheral device 108 may provide a means for the vehicle 100 to communicate with other devices located in the vehicle.
  • the microphone 150 may receive audio (eg, voice commands or other audio input) from a user of the vehicle 100.
  • the speaker 152 may output audio to the user of the vehicle 100.
  • the wireless communication system 146 may wirelessly communicate with one or more devices directly or via a communication network.
  • the wireless communication system 146 may use 3G cellular communication, such as CDMA, EVDO, GSM/GPRS, or 4G cellular communication, such as LTE. Or 5G cellular communication.
  • the wireless communication system 146 may use WiFi to communicate with a wireless local area network (WLAN).
  • WLAN wireless local area network
  • the wireless communication system 146 can directly communicate with the device using an infrared link, Bluetooth, or ZigBee.
  • Other wireless protocols such as various vehicle communication systems.
  • the wireless communication system 146 may include one or more dedicated short-range communications (DSRC) devices. These devices may include vehicles and/or roadside stations. Public and/or private data communications.
  • DSRC dedicated short-range communications
  • the power supply 110 may provide power to various components of the vehicle 100.
  • the power source 110 may be a rechargeable lithium ion or lead-acid battery.
  • One or more battery packs of such batteries may be configured as a power source to provide power to various components of the vehicle 100.
  • the power source 110 and the energy source 119 may be implemented together, such as in some all-electric vehicles.
  • the computer system 112 may include at least one processor 113 that executes instructions 115 stored in a non-transitory computer readable medium such as a data storage device 114.
  • the computer system 112 may also be multiple computing devices that control individual components or subsystems of the vehicle 100 in a distributed manner.
  • the processor 113 may be any conventional processor, such as a commercially available CPU. Alternatively, the processor may be a dedicated device such as an ASIC or other hardware-based processor.
  • FIG. 1 functionally illustrates the processor, memory, and other elements of the computer 110 in the same block, those of ordinary skill in the art should understand that the processor, computer, or memory may actually be included in the same physical Multiple processors, computers, or memories in the housing.
  • the memory may be a hard disk drive or other storage medium located in a housing other than the computer 110. Therefore, a reference to a processor or computer will be understood to include a reference to a collection of processors or computers or memories that may or may not operate in parallel. Rather than using a single processor to perform the steps described here, some components such as steering components and deceleration components may each have its own processor that only performs calculations related to component-specific functions .
  • the processor may be located away from the vehicle and wirelessly communicate with the vehicle.
  • some of the processes described herein are executed on a processor disposed in the vehicle and others are executed by a remote processor, including taking the necessary steps to perform a single manipulation.
  • the data storage device 114 may include instructions 115 (eg, program logic), which may be executed by the processor 113 to perform various functions of the vehicle 100, including those functions described above.
  • the data storage device 114 may also contain additional instructions, including sending data to, receiving data from, interacting with, and/or performing data on one or more of the propulsion system 102, the sensor system 104, the control system 106, and the peripheral device 108. Control instructions.
  • the data storage device 114 may also store data, such as road maps, route information, the location, direction, and speed of the vehicle, and other such vehicle data, as well as other information. Such information may be used by the vehicle 100 and the computer system 112 during the operation of the vehicle 100 in autonomous, semi-autonomous, and/or manual modes.
  • the user interface 116 is used to provide information to or receive information from a user of the vehicle 100.
  • the user interface 116 may include one or more input/output devices in the set of peripheral devices 108, such as a wireless communication system 146, an in-vehicle computer 148, a microphone 150, and a speaker 152.
  • the computer system 112 may control the functions of the vehicle 100 based on inputs received from various subsystems (for example, the travel system 102, the sensor system 104, and the control system 106) and from the user interface 116. For example, the computer system 112 may utilize input from the control system 106 in order to control the steering unit 132 to avoid obstacles detected by the sensor system 104 and the obstacle avoidance system 144. In some embodiments, the computer system 112 is operable to provide control of many aspects of the vehicle 100 and its subsystems.
  • one or more of these components described above may be installed or associated with the vehicle 100 separately.
  • the data storage device 114 may exist partially or completely separately from the vehicle 100.
  • the aforementioned components may be communicatively coupled together in a wired and/or wireless manner.
  • FIG. 1 should not be construed as a limitation to the embodiment of the present application.
  • An autonomous vehicle traveling on a road can recognize objects in its surrounding environment to determine the current speed adjustment.
  • the object may be other vehicles, traffic control equipment, or other types of objects.
  • each recognized object can be considered independently, and based on the respective characteristics of the object, such as its current speed, acceleration, distance from the vehicle, etc., can be used to determine the speed to be adjusted silently by the automatic driving.
  • the vehicle 100 or a computing device associated with the vehicle 100 may be based on the characteristics of the identified object and the state of the surrounding environment (for example, Traffic, rain, ice on the road, etc.) to predict the behavior of the identified object.
  • each recognized object depends on each other's behavior, so all recognized objects can also be considered together to predict the behavior of a single recognized object.
  • the vehicle 100 can adjust its speed based on the predicted behavior of the identified object.
  • an autonomous vehicle can determine what stable state the vehicle will need to adjust to (e.g., accelerate, decelerate, or stop) based on the predicted behavior of the object.
  • other factors may also be considered to determine the speed of the vehicle 100, such as the lateral position of the vehicle 100 on the road on which it is traveling, the curvature of the road, the proximity of static and dynamic objects, and so on.
  • the computing device can also provide instructions to modify the steering angle of the vehicle 100 so that the self-driving vehicle follows a given trajectory and/or maintains contact with objects near the self-driving vehicle (such as , The safe horizontal and vertical distances of cars in adjacent lanes on the road.
  • the above-mentioned vehicle 100 may be a car, truck, motorcycle, bus, boat, airplane, helicopter, lawn mower, recreational vehicle, playground vehicle, construction equipment, tram, golf cart, train, and trolley, etc.
  • the application examples are not particularly limited.
  • Fig. 2 is a schematic diagram of an automatic driving system provided by an embodiment of the present application.
  • the automatic driving system shown in FIG. 2 includes a computer system 101, where the computer system 101 includes a processor 103, and the processor 103 is coupled with a system bus 105.
  • the processor 103 may be one or more processors, where each processor may include one or more processor cores.
  • a display adapter (video adapter) 107, the display adapter can drive the display 109, and the display 109 is coupled to the system bus 105.
  • the system bus 105 is coupled with an input/output (I/O) bus 113 through a bus bridge 111.
  • the I/O interface 115 is coupled to the I/O bus.
  • the I/O interface 115 communicates with various I/O devices, such as an input device 117 (such as a keyboard, a mouse, a touch screen, etc.), a media tray 121 (such as a CD-ROM, a multimedia interface, etc.).
  • the transceiver 123 can send and/or receive radio communication signals
  • the camera 155 can capture still and dynamic digital video images
  • an external USB interface 125 external USB interface 125.
  • the interface connected to the I/O interface 115 may be a USB interface.
  • the processor 103 may be any traditional processor, including a reduced instruction set computer (RISC) processor, a complex instruction set computer (CISC) processor, or a combination of the foregoing.
  • the processor may be a dedicated device such as an application specific integrated circuit (ASIC).
  • the processor 103 may be a neural network processor or a combination of a neural network processor and the foregoing traditional processors.
  • the computer system 101 may be located far away from the autonomous vehicle (for example, the computer system 101 may be located in the cloud or a server), and may communicate wirelessly with the autonomous vehicle.
  • the processes described herein are executed on a processor provided in an autonomous vehicle, and others are executed by a remote processor, including taking actions required to perform a single manipulation.
  • the computer 101 can communicate with the software deployment server 149 through the network interface 129.
  • the network interface 129 is a hardware network interface, such as a network card.
  • the network 127 may be an external network, such as the Internet, or an internal network, such as an Ethernet or a virtual private network (VPN).
  • the network 127 may also be a wireless network, such as a WiFi network, a cellular network, and so on.
  • the hard disk drive interface is coupled to the system bus 105.
  • the hardware drive interface is connected with the hard drive.
  • the system memory 135 is coupled to the system bus 105.
  • the data running in the system memory 135 may include the operating system 137 and application programs 143 of the computer 101.
  • the operating system includes a parser 139 (shell) and a kernel (kernel) 141.
  • the shell is an interface between the user and the kernel of the operating system.
  • the shell is the outermost layer of the operating system.
  • the shell manages the interaction between the user and the operating system: waiting for the user's input, interpreting the user's input to the operating system, and processing the output of various operating systems.
  • the kernel 141 is composed of those parts of the operating system that are used to manage memory, files, peripherals, and system resources. Directly interact with the hardware, the operating system kernel usually runs processes and provides inter-process communication, providing CPU time slice management, interrupts, memory management, IO management, and so on.
  • the application program 143 includes programs related to driving behavior decision-making, for example, obtaining state information of the vehicle, making decisions based on the state information of the vehicle, and obtaining driving behavior decision information (that is, the vehicle's to-be-executed actions, such as acceleration, deceleration, or steering, etc.) , And control the vehicle based on the driving behavior decision information.
  • the application program 143 also exists on the system of the software deployment server 149 (deploying server). In one embodiment, when the application program 143 needs to be executed, the computer system 101 may download the application program 143 from a software deployment server 149 (deploying server).
  • the sensor 153 is associated with the computer system 101.
  • the sensor 153 is used to detect the environment around the computer 101.
  • the sensor 153 can detect animals, cars, obstacles, and crosswalks. Further, the sensor can also detect the surrounding environment of the above-mentioned animals, cars, obstacles, and crosswalks, such as the environment around the animals, for example, when the animals appear around them. Other animals, weather conditions, the brightness of the surrounding environment, etc.
  • the sensor 153 may also be used to obtain status information of the vehicle.
  • the sensor 153 can detect vehicle state information such as the position of the vehicle, the speed of the vehicle, the acceleration of the vehicle, and the posture of the vehicle.
  • the sensor may be a camera, an infrared sensor, a chemical detector, a microphone, etc.
  • the application program 143 may make a decision based on the surrounding environment information and/or the state information of the vehicle detected by the sensor 153, obtain driving behavior decision information, and control the vehicle according to the driving behavior decision information. At this time, the vehicle can be controlled according to the driving behavior decision information, so as to realize the automatic driving of the vehicle.
  • the driving behavior decision information can refer to the vehicle's to-be-executed actions, for example, performing one or more of the actions such as acceleration, deceleration, or steering, or the driving behavior decision-making information can also refer to the vehicle's to-be-selected control mode Or control system, for example, select one or more of the steering control system, direct yaw moment control system or emergency brake control system.
  • FIG. 3 is a hardware structure diagram of a chip provided by an embodiment of the present application.
  • the chip includes a neural network processor 20.
  • the chip may be in the processor 103 shown in FIG. 2 to make driving behavior decisions based on the state information of the vehicle.
  • the algorithms of each layer in the pre-trained neural network can be implemented in the chip as shown in FIG. 3.
  • the method of training the driving behavior decision model and the method of determining the driving behavior in the embodiment of the present application can also be implemented in the chip as shown in FIG.
  • the same chip, or the chip may also be a different chip from the chip that implements the above-mentioned pre-trained neural network, which is not limited in the embodiment of the present application.
  • the neural network processor NPU 50 NPU is mounted on the host CPU (host CPU) as a coprocessor, and the Host CPU distributes tasks.
  • the core part of the NPU is the arithmetic circuit 50.
  • the arithmetic circuit 503 is controlled by the controller 504 to extract matrix data from the memory and perform multiplication operations.
  • the arithmetic circuit 203 includes multiple processing units (process engines, PE). In some implementations, the arithmetic circuit 203 is a two-dimensional systolic array. The arithmetic circuit 203 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 203 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the corresponding data of matrix B from the weight memory 202 and caches it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the matrix A data and matrix B from the input memory 201 to perform matrix operations, and the partial result or final result of the obtained matrix is stored in an accumulator 208.
  • the vector calculation unit 207 can perform further processing on the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on.
  • the vector calculation unit 207 can be used for network calculations in the non-convolutional/non-FC layer of the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • the vector calculation unit 207 can store the processed output vector in the unified buffer 206.
  • the vector calculation unit 207 may apply a nonlinear function to the output of the arithmetic circuit 203, such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 207 generates a normalized value, a combined value, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 203, for example, for use in a subsequent layer in a neural network.
  • the unified memory 206 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 201 and/or the unified memory 206 through the storage unit access controller 205 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 202, And the data in the unified memory 206 is stored in the external memory.
  • DMAC direct memory access controller
  • a bus interface unit (BIU) 210 is used to implement interaction between the main CPU, the DMAC, and the fetch memory 209 through the bus.
  • An instruction fetch buffer 209 connected to the controller 204 is used to store instructions used by the controller 204;
  • the controller 204 is used to call the instructions cached in the instruction fetch memory 209 to control the working process of the operation accelerator.
  • the unified memory 206, the input memory 201, the weight memory 202, and the instruction fetch memory 209 are all on-chip (On-Chip) memories.
  • the external memory is a memory external to the NPU.
  • the external memory can be a double data rate synchronous dynamic random access memory.
  • Memory double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (HBM) or other readable and writable memory.
  • the computer system 112 can also receive information from other computer systems or transfer information to other computer systems.
  • the sensor data collected from the sensor system 104 of the vehicle 100 may be transferred to another computer to process the data.
  • data from the computer system 312 may be transmitted to the server 320 on the cloud side via the network for further processing.
  • the network and intermediate nodes can include various configurations and protocols, including the Internet, World Wide Web, Intranet, virtual private network, wide area network, local area network, private network using one or more company’s proprietary communication protocols, Ethernet, WiFi and HTTP, And various combinations of the foregoing.
  • This communication can be performed by any device capable of transferring data to and from other computers, such as modems and wireless interfaces.
  • the server 320 may include a server with multiple computers, for example, a load balancing server group, which exchanges information with different nodes of the network for the purpose of receiving, processing, and transmitting data from the computer system 312.
  • the server can be configured similar to the computer system 312, with a processor 330, a memory 340, instructions 350, and data 360.
  • the data 360 of the server 320 may include parameters of an offline learning neural network model (e.g., a neural network model based on deep learning) and related information of the neural network model (e.g., training data of the neural network model or neural network model). Other parameters of the model, etc.).
  • the server 320 may receive, detect, store, update, and transmit the parameters of the neural network model learned offline and related information of the neural network model.
  • the parameters of the neural network model for offline learning may include the hyperparameters of the neural network model and other model parameters (or model strategies).
  • the related information of the neural network model may include training data of the neural network model, and other parameters of the neural network model.
  • the server 320 may also use the training data of the neural network model to train the neural network model based on an imitation learning method (ie, offline training or offline learning), so as to update the parameters of the neural network model.
  • an imitation learning method ie, offline training or offline learning
  • the driving behavior decision model can have online learning capabilities, that is, the driving behavior decision model can be continuously trained in the process of using the driving behavior decision model, so that the driving behavior decision model can be continuously trained. Continuously optimize the driving behavior decision model.
  • the reinforcement learning method is a typical unsupervised learning method.
  • the true value or label
  • the loss value of the model for example, the driving behavior decision model
  • the learning efficiency of reinforcement learning methods is lower.
  • the reinforcement learning method cannot guarantee that the model obtained is more reliable like the supervised learning method.
  • this application proposes a method for training a driving behavior decision model, which can improve the training efficiency of the driving behavior decision model.
  • the driving behavior decision model can also have both online learning capabilities and offline learning capabilities, that is, on the premise that the driving behavior decision model has online learning capabilities, the learning efficiency of the driving behavior decision model can be improved .
  • FIG. 5 is a schematic flowchart of a method 500 for training a driving behavior decision model provided by an embodiment of the present application.
  • the method 500 shown in FIG. 5 may include step 510, step 520, step 530, and step 540. It should be understood that the method 500 shown in FIG. 5 is only an example and not a limitation, and the method 500 may include more or fewer steps. This is not limited in the embodiments of the present application, and these steps are respectively described in detail below.
  • the method 500 shown in FIG. 5 may be executed by the processor 113 in the vehicle 100 in FIG. 1, or it may be executed by the processor 103 in the automatic driving system in FIG.
  • the processor 330 in the server 320 executes.
  • S510 Use the driving behavior decision model to make a decision based on the state information of the vehicle to obtain driving behavior decision information.
  • the state information of the vehicle may include the position of the vehicle, the speed of the vehicle, the acceleration of the vehicle, the posture of the vehicle, and other state information of the vehicle.
  • the state information of the vehicle may include preview deviation (for example, lateral preview deviation), the yaw rate of the vehicle, and the longitudinal speed of the vehicle.
  • the state information of the vehicle may be the current state of the vehicle (and/or the current action of the vehicle) in the method 600 of FIG. 6 or the method 700 of FIG. 7.
  • the driving behavior decision information may be used to indicate the actions (or operations) to be performed of the vehicle, for example, to perform one or more of the actions such as acceleration, deceleration, or steering.
  • the driving behavior decision information may also refer to the control mode (or control system) of the vehicle to be selected, for example, selecting one of the steering control system, the direct yaw moment control system, or the emergency braking control system. Or multiple.
  • the initial parameters of the driving behavior decision model may be determined according to the third parameters of the imitation learning model pre-trained based on the imitation learning method.
  • the imitation learning model may be the imitation learning system in the method 700 of FIG. 7 or the method 800 of FIG. 8.
  • the imitation learning method may include supervised learning (supervised learning), generative adversarial network (GAN), inverse reinforcement learning (IRL), etc.
  • determining the initial parameters of the driving behavior decision model according to the third parameter of the pre-trained imitation learning model can improve the stability of the driving behavior decision model and avoid the output of the driving behavior decision model. Risky (or unreasonable) driving behavior decision information.
  • the third parameter may be obtained after the server (or the cloud) pre-trains the imitation learning model based on the imitation learning method.
  • the server (or the cloud) may set the third parameter of the imitation learning model Sent to the vehicle (for example, the automatic driving system in the vehicle or the computer system in the vehicle), and then the vehicle may determine the driving behavior decision model according to the third parameter of the imitation learning model Initial parameters.
  • the third parameter of the imitation learning model may also be obtained by the vehicle (for example, a processor or a computer system in the vehicle) based on an imitation learning method beforehand.
  • the third parameters can be directly used as the initial parameters of the driving behavior decision model; alternatively, the Part of the parameters in the third parameter are used as part of the parameters in the initial parameters of the driving behavior decision model (the remaining parameters in the initial parameters of the driving behavior decision model can be determined according to other methods), which are combined in the embodiments of the present application. Not limited.
  • the third parameter may be obtained by the server (or cloud) based on an imitation learning method and using data output by a decision expert system to train the imitation learning model, and the decision expert system may be based on the driver
  • the driving data for example, the driving data may include the operation data of an excellent driver or a professional driver and the operation data of the vehicle, etc.
  • the dynamic characteristics of the vehicle may be obtained by the server (or cloud) based on an imitation learning method and using data output by a decision expert system to train the imitation learning model, and the decision expert system may be based on the driver
  • the driving data for example, the driving data may include the operation data of an excellent driver or a professional driver and the operation data of the vehicle, etc.
  • a rule-based decision expert system by analyzing the driving data of the driver (for example, an example of an excellent driver performing an emergency collision avoidance operation) and the dynamic characteristics of the vehicle (for example, the dynamic characteristics of a vehicle tire); further Ground, the data output by the decision-making expert system can be collected, and the collected data can be labeled (that is, the data is labeled so that the imitation learning model uses the data for imitation learning), so that the imitation learning method can be used based on the imitation learning method.
  • the labeled data trains the imitation learning model to obtain the third parameter of the imitation learning model.
  • the driving behavior decision model may include a first model and a second model.
  • the first model may be the current network in the method 700 in FIG. 7 or the method 800 in FIG. 8
  • the second model may be the target network in the method 700 in FIG. 7 or the method 800 in FIG. 8.
  • the first model and the second model may both be decision-making models based on reinforcement learning (driving behavior), and the initial parameters of the first model and the initial parameters of the second model may be based on imitation learning
  • the method is determined by the third parameter of the pre-trained imitation learning model.
  • the use of the driving behavior decision model to make a decision based on the state information to obtain driving behavior decision information may include:
  • the driving behavior decision model is used to evaluate all possible driving behaviors to obtain the driving behavior decision information.
  • the driving behavior decision is made in combination with the dynamic model and the kinematics model of the vehicle, which can improve the rationality of the driving behavior decision information.
  • the driving behavior of the vehicle at one or more subsequent moments can be predicted at the same time, which is not limited in the embodiment of the present application.
  • the driving behavior decision model when the driving behavior decision model includes a first model and a second model, the driving behavior decision model is used to evaluate all possible driving behaviors to obtain the driving behavior decision
  • the information may include: using the second model to evaluate all possible driving behaviors to obtain the driving behavior decision information.
  • the parameters of the first model change relatively frequently, and the use of the second model to make decisions can improve the reliability of the driving behavior decision information.
  • parameters of the second model may be updated periodically according to the parameters of the first model.
  • the parameters of the second model may be updated to the second parameters, where the first preset condition may be a preset time interval, or, The first preset condition may also be a preset number of adjustments to the parameters of the first model.
  • S520 Send the driving behavior decision information to the server.
  • S530 Receive the first parameter of the imitation learning model sent by the server.
  • the first parameter may be obtained by the server after training the imitation learning model based on the imitation learning method and using the driving behavior decision information.
  • the first parameter may be obtained by the server after training the imitation learning model based on an imitation learning method and using the driving behavior decision information that satisfies a second preset condition.
  • the second preset condition may include at least one of the following multiple conditions:
  • the second preset condition may include: the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.
  • the reasonable driving behavior decision refers to a driving behavior decision that complies with preset rules.
  • the preset rules can be understood as the driving habits of experienced and experienced drivers.
  • the reasonable driving behavior decision may be obtained by an automated tagging learning method, or may also be obtained by a manual tagging method.
  • the reasonable driving behavior decision corresponding to the state information of the vehicle is the emergency braking control system. If the driving behavior decision model is used, the driving behavior decision information obtained according to the state information of the vehicle is the emergency system. When the power control system works, the driving behavior decision information is the same as the reasonable driving behavior decision corresponding to the state information of the vehicle, that is, the driving behavior decision information is the reasonable driving behavior decision corresponding to the state information.
  • using the reasonable driving behavior decision corresponding to the state information can improve the learning efficiency of the driving behavior decision model.
  • the second preset condition may further include: the noise of the state information is within a first preset range.
  • the noise of the state information may include interference (for example, Gaussian noise) received by the signal of the state information or jitter of the signal of the state information.
  • interference for example, Gaussian noise
  • the noise of the state information may also include data errors of the state information.
  • the status information of the vehicle includes the longitudinal speed of the vehicle.
  • the first preset range is 5 km/h. If the error of the longitudinal speed of the vehicle is less than (or, less than or equal to) 5 km/h , The driving behavior decision information satisfies the second preset condition, that is, the driving behavior decision information is a correct driving behavior decision corresponding to the state information.
  • the value of the first preset range in the foregoing embodiment is only an example and not a limitation, and can be specifically determined according to actual conditions, which is not limited in the embodiment of the present application.
  • the noise of the state information is within the first preset range, and the decision is made according to the state information, which can make the driving behavior decision information obtained more reasonable. At this time, adjust according to the driving behavior decision information
  • the parameters of the driving behavior decision model can improve the learning efficiency of the driving behavior decision model.
  • the state information may be one of a plurality of state information, and the second preset condition may further include: the plurality of state information is acquired in multiple scenarios.
  • the plurality of scenes may include one or more scenes in a highway, an urban area, a suburban area, and a mountainous area.
  • the multiple scenes may also include one or more scenes of an intersection, a T-junction, and a roundabout.
  • acquiring the state information in at least one of the above scenarios can make the training data of the driving behavior decision model (for example, driving behavior decision information obtained after making a decision based on the state information) more abundant scenes , which helps to further improve the learning efficiency of the driving behavior decision model.
  • the driving behavior decision model for example, driving behavior decision information obtained after making a decision based on the state information
  • the second preset condition may further include: among the plurality of state information, the number of state information obtained in any one of the plurality of scenes is different from the amount of state information obtained in any other scene of the plurality of scenes. The difference between the amounts of status information is within the second preset range.
  • the multiple status information is acquired in four scenes of high-speed, urban area, suburbs, and mountainous areas.
  • 1000 (or 1000 groups) of status information are acquired. If 100 (or 100 groups) of status information are obtained in each of the three scenarios, you can filter out 100 (or 1000 groups) of status information from the 1000 (or 1000 groups) of status information obtained in the high-speed scene according to the method in Condition 1 and Condition 2 above. (Or 100 groups) status information, so that the number of status information obtained in the four scenarios is the same.
  • the multiple status information may also be acquired in other scenarios, which is not limited in the embodiment of the present application.
  • the multiple status information may be acquired in multiple scenes such as intersections, T-junctions, and roundabouts, and the number of status information acquired in the multiple scenes is the same, or, acquired in the multiple scenes
  • the difference between the amounts of status information is within the second preset range.
  • the difference between the quantity of state information acquired in any one of the at least two scenes and the quantity of state information acquired in any other scene of the at least two scenes within the second preset range, the amount of training data obtained in each scenario (for example, driving behavior decision information obtained after making a decision based on the state information) can be more balanced, thereby avoiding the appearance of the driving behavior decision model There is an overfitting problem in a certain scene.
  • the value of the second preset range in the foregoing embodiment may be determined according to actual conditions, which is not limited in the embodiment of the present application.
  • using high-quality driving behavior decision information can improve the learning efficiency of the driving behavior decision model.
  • the driving behavior decision information for evaluating whether the driving behavior decision information satisfies the second preset condition may be executed by the vehicle or the server, which is not limited in the embodiment of the present application.
  • the vehicle may send the driving behavior decision information obtained by the decision to a server, and then the server evaluates whether the driving behavior decision information satisfies the second preset condition, so as to filter out whether the driving behavior decision information satisfies the second preset condition.
  • the server evaluates whether the driving behavior decision information satisfies the second preset condition, so as to filter out whether the driving behavior decision information satisfies the second preset condition.
  • the vehicle may also evaluate whether the driving behavior decision information satisfies the second preset condition, so as to filter out driving behavior decision information that satisfies the second preset condition, and then satisfy the second preset condition.
  • the conditional driving behavior decision information is sent to the server.
  • S540 Adjust parameters of the driving behavior decision model according to the driving behavior decision information and the first parameter.
  • the first parameter is obtained by the server after training the imitation learning model based on the imitation learning method and using the driving behavior decision information.
  • the imitation learning method can ensure the imitation learning model based on the imitation learning method. Training effect.
  • adjusting the parameters of the driving behavior decision model according to the driving behavior decision information and the first parameter can improve the learning efficiency of the driving behavior decision model.
  • the adjusting the parameters of the driving behavior decision model according to the driving behavior decision information and the first parameter may include:
  • the parameters of the driving behavior decision model are adjusted according to the driving behavior decision information to obtain the second parameter; and the second parameter of the driving behavior decision model is adjusted according to the first parameter.
  • the parameters of the driving behavior decision model can be adjusted based on the reinforcement learning method to obtain the second parameter, and the second parameter of the driving behavior decision model can be adjusted according to the first parameter, so that all
  • the driving behavior decision model has online learning capabilities and offline learning capabilities, that is, on the premise that the driving behavior decision model has online learning capabilities, the learning efficiency of the driving behavior decision model can be further improved.
  • the driving behavior decision model may include a first model and a second model.
  • the method based on the reinforcement learning to adjust the parameters of the driving behavior decision model according to the driving behavior decision information to obtain the second parameter may include:
  • the parameters of the first model are adjusted according to the driving behavior decision information to obtain the second parameters; when the first preset condition is met, the parameters of the second model are updated Is the second parameter.
  • the first preset condition may be a preset time interval or a preset number of adjustments to the parameters of the first model.
  • the parameters of the second model are updated to the second parameters, which can avoid the frequent adjustment of the parameters of the second model causing the
  • the output of the second model is unstable, and therefore, the reliability of the driving behavior decision information can be improved.
  • updating the parameters of the second model to the second parameters may mean: directly updating all the parameters of the second model to the second parameters; or, it may also mean: changing all the parameters of the second model to the second parameters.
  • Some parameters of the second model (the remaining parameters of the second model may be determined according to other methods) are updated to the second parameters, which are not limited in the embodiment of the present application.
  • satisfying the first preset condition may refer to: a preset time interval from the time when the parameters of the second model was last updated; or, satisfying the first preset condition may also be Refers to: the number of decisions made using the driving behavior decision model reaches a preset number of times; or, meeting the first preset condition may also mean meeting other preset conditions, which is not limited in the embodiment of the present application.
  • the adjusting the second parameter of the driving behavior decision model according to the first parameter may include: adjusting the parameter of the first model and/or the first parameter according to the first parameter 2. Parameters of the model.
  • the parameter of at least one of the first model and the second model can be flexibly adjusted according to the first parameter.
  • the second parameter of the first model and the second parameter of the second model may be updated simultaneously according to the first parameter of the imitation learning model; or, the first parameter of the imitation learning model may also be updated.
  • Update the second parameter of the first model and then update the second parameter of the second model according to the parameters of the first model when the second preset condition is satisfied.
  • the method 500 may further include: controlling the vehicle according to the driving behavior decision information.
  • the vehicle while training the driving behavior decision model, the vehicle can be controlled according to the driving behavior decision information. Therefore, the driving behavior decision model can be used to control the driving behavior.
  • the behavior decision model is trained to continuously optimize the driving behavior decision model.
  • FIG. 6 is a schematic flowchart of a method 600 for training a driving behavior decision model provided by an embodiment of the present application.
  • the method 600 shown in FIG. 6 may include step 610, step 620, and step 630. It should be understood that the method 600 shown in FIG. 6 is only an example and not a limitation. The method 600 may include more or fewer steps. This is not limited in the embodiment, and these steps are respectively described in detail below.
  • the method 600 shown in FIG. 6 may be executed by the processor 330 in the server 320 in FIG. 4.
  • S610 Receive driving behavior decision information sent by the vehicle.
  • the driving behavior decision information may be obtained after the vehicle uses a driving behavior decision model to make a decision according to the state information of the vehicle.
  • S620 Based on the imitation learning method, train an imitation learning model according to the driving behavior decision information to obtain the first parameter of the model learning model.
  • the first parameter is used to adjust the parameter of the driving behavior decision model.
  • the imitation learning method training an imitation learning model according to the driving behavior decision information to obtain the first parameter of the model learning model, may include:
  • an imitation learning model is trained according to the driving behavior decision information that satisfies the second preset condition, and the first parameter of the model learning model is obtained.
  • the second preset condition may include that the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.
  • the second preset condition may further include that the noise of the state information is within a first preset range.
  • the state information may be one of a plurality of state information
  • the second preset condition may further include that the plurality of state information is acquired in multiple scenarios.
  • the second preset condition may further include: among the plurality of state information, the number of state information acquired in any one of the plurality of scenes is different from that of any one of the plurality of scenes. The difference between the amounts of state information acquired in other scenes is within the second preset range.
  • the value of the second preset range in the foregoing embodiment may be determined according to actual conditions, which is not limited in the embodiment of the present application.
  • the driving behavior decision information for evaluating whether the driving behavior decision information satisfies the second preset condition may be executed by the vehicle or the server, which is not limited in the embodiment of the present application.
  • the vehicle may evaluate whether the driving behavior decision information satisfies the second preset condition to filter out driving behavior decision information that satisfies the second preset condition, and then the second preset condition will be satisfied
  • the driving behavior decision information is sent to the server.
  • the imitation learning model is trained according to the driving behavior decision information to obtain the first parameter of the model learning model, and the training effect of the imitation learning model can be guaranteed based on the imitation learning method.
  • the learning efficiency of the driving behavior decision model can be improved.
  • the method 600 may further include:
  • the decision expert system is designed according to the driving data of the driver and the dynamic characteristics of the vehicle; the third parameter is sent to the vehicle.
  • determining the initial parameters of the driving behavior decision model according to the third parameter of the pre-trained imitation learning model can improve the stability of the driving behavior decision model and avoid the output of the driving behavior decision model. Risky (or unreasonable) driving behavior decision information.
  • FIG. 7 is a schematic flowchart of a method 600 for training a driving behavior decision model provided by an embodiment of the present application.
  • the method 700 shown in FIG. 7 may include step 710, step 720, step 730, and step 740. It should be understood that the method 700 shown in FIG. 7 is only an example and not a limitation, and the method 700 may include more or fewer steps. This is not limited in the embodiments of the present application, and these steps are respectively described in detail below.
  • Each step in the method 700 can be performed by a vehicle (for example, the processor 113 in the vehicle 100 in FIG. 1 or the processor 103 in the automatic driving system in FIG. 2) or a server (for example, in the server 320 in FIG. 4).
  • the processor 330 executes, which is not limited in the embodiment of the present application.
  • the server executes step 710, step 720, and step 730, and the vehicle executes step 740 as an example for description.
  • the server may collect driving data of the vehicle, and the driving data may include the driving data of the driver and the dynamics data of the vehicle (for example, the dynamic characteristics of the vehicle may be determined based on the dynamics data); based on the driving of the vehicle
  • the data design is based on a rule-based expert system that can make driving behavior decisions.
  • the server can collect the decision information generated by the expert system designed by S710, and label the collected decision information (that is, label the data to use the data to perform imitation learning on the neural network model) To build a training data set.
  • the server can also collect the decision information generated by the reinforcement learning system designed by S740, filter out the high-quality decision information (generated by the reinforcement learning system), and label the high-quality decision information To build a training data set.
  • the description of the high-quality decision information and the method of determining the high-quality decision information can be referred to the embodiment of the method 500 in FIG. 5, which will not be repeated here.
  • the imitation learning system may be designed according to a Softmax classifier scheme based on a radial basis function neural network (radial basis function neural network, RBFNN).
  • RBFNN radial basis function neural network
  • the training data set constructed in S720 may be used, based on small batches.
  • the stochastic gradient descent algorithm performs offline training on the imitation learning system, so as to realize the cloning of the behavior of the expert system by the imitation learning system.
  • the cloning here can be understood as: offline training of the imitation learning system, so that the performance (or effect) of the decision information generated by the imitation learning system is no less than the performance (or effect) of the decision information generated by the expert system, Or, the performance (or effect) of the decision information generated by the imitation learning system is close to the performance (or effect) of the decision information generated by the expert system.
  • the reinforcement learning system may be designed according to a scheme based on a reinforcement learning neural network.
  • the model strategy (that is, model parameters) learned by the imitating learning system can be used as the initial strategy (that is, the initial parameters of the model) of the reinforcement learning system; combining the dynamic model of the vehicle and the kinematics model of the vehicle , Based on the current state of the vehicle (and/or the current action of the vehicle) to predict the state information of the vehicle at the next moment (or the next n moments, n is a positive integer), and the state information may include certain All possible driving behaviors at a time; use the reinforcement learning system to estimate the Q values corresponding to multiple different driving behaviors included at a certain time, and use the driving behavior corresponding to the largest Q value as the decision information at that time (the reinforcement The driving behavior decision information output by the learning system includes the decision information at that moment).
  • the reinforcement learning system may include two networks, a current network and a target network, respectively, and these two networks may adopt the same RBFNN structure as the imitation learning system.
  • the state information predicted by combining the dynamic model of the vehicle and the kinematics model of the vehicle may include state information of the vehicle at one or more subsequent moments.
  • the reinforcement learning system can be used to estimate the decision information at each of the multiple times.
  • the driving behavior output by the reinforcement learning system The decision information may include the decision information at the multiple moments.
  • a reinforcement learning system can output the driving behavior decision information, and the vehicle can be controlled based on the driving behavior decision information.
  • each step in the method 700 can be continuously executed iteratively, thereby realizing continuous online learning of the reinforcement learning system.
  • each step in the method 700 may be iteratively executed as follows:
  • the vehicle may send the driving behavior decision information generated by the reinforcement learning system to the server;
  • the server can determine the high-quality decision information in the driving behavior decision information, update the determined high-quality decision information to the training data set, and compare the imitation learning system based on the updated training data set. Perform offline training;
  • the server may periodically send the model strategy (ie model parameters) of the imitation learning system to the vehicle;
  • the vehicle after the vehicle receives the model strategy (i.e. model parameter) of the imitation learning system, it can update the model strategy (i.e. model parameter) of the reinforcement learning system based on the received model strategy;
  • the vehicle can continue to send the driving behavior decision information generated by the reinforcement learning system to the server; the server can continue to perform offline training on the imitation learning system based on the driving behavior decision information; the server can continue to regularly send the imitation learning system
  • the model strategy (ie, model parameters) of the learning system is sent to the vehicle to update the model strategy (ie, model parameters) of the reinforcement learning system.
  • the steps in the method 700 may be repeatedly and iteratively executed in the above-mentioned manner.
  • the vehicle updates the model strategy of the reinforcement learning system based on the received model strategy, either by directly replacing the model strategy of the reinforcement learning system with the model strategy, or by using the model strategy , Replacing the model strategy of the reinforcement learning system in proportion, for example, using 70% of the model strategy and 30% of the model strategy of the reinforcement learning system as the model strategy of the reinforcement learning system.
  • the reinforcement learning system not only can the reinforcement learning system be continuously improved and better through reinforcement learning, but also vehicles can be monitored through the server (or the cloud), and offline training can be imitated regularly.
  • the learning system adjusts the reinforcement learning system so as to continuously improve the performance of the autonomous vehicle from two dimensions (online and offline).
  • FIG. 8 is a schematic flowchart of a method 800 for training a driving behavior decision model provided by an embodiment of the present application.
  • the method 800 shown in FIG. 8 may include step 810, step 820, step 830, and step 840. It should be understood that the method 800 shown in FIG. 8 is only an example and not a limitation, and the method 800 may include more or fewer steps. This is not limited in the embodiments of the present application, and these steps are respectively described in detail below.
  • Each step in the method 800 may be performed by a vehicle (for example, the processor 113 in the vehicle 100 in FIG. 1 or the processor 103 in the automatic driving system in FIG. 2) or a server (for example, in the server 320 in FIG. 4).
  • the processor 330 executes, which is not limited in the embodiment of the present application.
  • the server executes step 810, step 820, and step 830, and the vehicle executes step 840 as an example for description.
  • the expert system may be used to coordinate (decide) the motion control system of the self-driving vehicle, and the motion control system may include an emergency braking control system, a direct yaw moment control system, and a steering control system.
  • the expert system can also be used to decide other systems or other states of the vehicle.
  • the expert system can also be used to coordinate (or decide) the speed, acceleration, or steering angle of the vehicle.
  • the application embodiment does not limit this.
  • the server may receive (or regularly receive) the driving data of the vehicle sent by the vehicle (the driving data may refer to the driving data of a professional driver, for example, an example of an excellent driver performing an emergency collision avoidance operation) ) And the dynamics data of the vehicle (for example, the dynamics of the vehicle can be determined based on the dynamics data).
  • the driving data may refer to the driving data of a professional driver, for example, an example of an excellent driver performing an emergency collision avoidance operation
  • the dynamics data of the vehicle for example, the dynamics of the vehicle can be determined based on the dynamics data.
  • the following takes the expert system to coordinate (decision) the motion control system of an autonomous vehicle as an example for detailed description.
  • the rule-based expert system By analyzing the driving data of the vehicle and the dynamics data of the vehicle, the rule-based expert system can be designed as follows:
  • the non-operation of the steering control system means that the vehicle is driving in a straight line.
  • the kinematics state of the vehicle can include preview deviation, path tracking deviation, heading angle, etc.
  • the dynamic state of the vehicle can include vehicle speed, yaw rate, lateral acceleration, longitudinal acceleration, side slip angle, etc., environmental sensing system information It can include the distance to the surrounding vehicles, the speed of the surrounding vehicles, etc., the heading angle of the surrounding vehicles, and so on.
  • the decision for the coordination (decision) of the motion control system of the autonomous vehicle can be generated. information.
  • the server can collect the decision information generated by the expert system and the high-quality decision information generated by the reinforcement learning system, and the collected decision information (including the decision information generated by the expert system and the high-quality decision information generated by the reinforcement learning system) ) Is annotated to construct a training data set.
  • the description of the high-quality decision information and the method of determining the high-quality decision information can be referred to the embodiment of the method 500 in FIG. 5, which will not be repeated here.
  • the imitation learning system can be designed based on the Softmax classifier and the method of small batch stochastic gradient descent, so as to realize the behavior cloning of the expert system.
  • the imitation learning system can be designed according to the following steps:
  • the neural network may be a Softmax classifier, and the decision information output by the neural network may be consistent with the decision information generated by the rule-based expert system.
  • the decision information output by the neural network (similar to the decision information generated by the expert system) can be used for the coordinated working mode of the motion control system of the autonomous driving vehicle.
  • the combination of emergency collision avoidance actions of an autonomous vehicle can be divided into the following categories:
  • the serial number "1" can indicate that only the steering control system works
  • the serial number "2” can indicate that the steering control system and the direct yaw moment control system work together
  • the serial number "3" can indicate that the steering control system and the emergency brake control system work together.
  • "4" can indicate the joint work of the steering control system, direct yaw moment control system and emergency brake control system.
  • the serial number "0” can indicate the steering control system, direct yaw moment control system and emergency brake control system. None of them work.
  • the neural network can output any of the aforementioned serial numbers.
  • the cost function defined by the cross-entropy method can improve the learning efficiency and effect.
  • the network structure of the neural network may refer to a radial basis function neural network (RBFNN).
  • RBFNN radial basis function neural network
  • RBFNN can be used to learn to approximate the Q value of the Softmax classifier.
  • RBFNN can include three inputs, which are the projection deviation (or preview deviation) e p , the vehicle yaw rate ⁇ , and the reciprocal of the driving speed.
  • RBFNN can include a single hidden layer h 1 ⁇ h 11 composed of 11 Gaussian kernel functions, and RBFNN can output 4 vectors of Q values
  • the network structure of RBFNN can be shown in Figure 9.
  • RBFNN The expression of RBFNN can be:
  • represents the weight matrix of the neural network
  • i represents the number of hidden layer nodes of the neural network
  • h i represents the Gaussian function, nerve center vector c i the representative node
  • X represents the input vector of the neural network
  • the gradient of the total cost function of the neural network may be The gradient of the total cost function relative to the weight W of the neural network is
  • P i is the probability value output softmax classifier
  • y i value for the label Q i and Q k are state learning reinforcement - the value of the action function
  • N is the total number of categories of samples
  • h is the Gaussian kernel
  • i and k is a positive integer.
  • the small batch stochastic gradient descent algorithm can use the following gradients:
  • M 0 is the number of batches of mini-batch random gradient descent
  • n is a positive integer greater than or equal to 1 and less than or equal to M 0.
  • the behavior of the rule-based driving behavior decision system can be cloned.
  • the reinforcement learning system can be designed according to the following steps:
  • the model strategy (that is, the model parameter) learned by the imitation learning system is used as the initial strategy (that is, the initial parameter of the model) of the reinforcement learning system to improve the efficiency and effect of driving behavior decision-making.
  • the reinforcement learning system may include two networks, a current network and a target network, respectively, and these two networks may adopt the same RBFNN structure as the imitation learning system.
  • the three inputs of the target network are the results predicted by the vehicle prediction model (for example, the dynamic model and the kinematics model of the vehicle).
  • the design optimization index can be
  • the gradient formula can be:
  • Q * is the optimal value function
  • ⁇ rl is the discount factor
  • a' is the action performed to maximize the Q value under the tth iteration
  • ⁇ t ' is the target network parameter
  • x' is the input at the next moment
  • r is the reward function
  • t is a positive integer.
  • the vehicle prediction model can be expressed as follows:
  • x' is the predicted state at time t+1
  • y is the system output
  • A is the state matrix
  • B is the input matrix
  • x is the state vector
  • x [ ⁇ ⁇ e p ⁇ e v ] T
  • u is the input vector
  • u [ ⁇ t M c F xa ] T
  • w is the interference vector
  • x p is a preview distance
  • is the sideslip angle
  • is the yaw rate
  • e p is the preview deviation
  • is the heading angle deviation
  • e v is the velocity deviation
  • K is the curvature of the road
  • C f is the front wheel cornering stiffness
  • C r is the rear wheel cornering stiffness
  • a is the distance of the front axle of the vehicle
  • the vehicle prediction model is:
  • the state S t + 1 to time t + 1, state S t to time t, A is the operation time t of t
  • T s is the prediction horizon
  • e p is the preview deviation
  • is the yaw rate
  • v x is the longitudinal vehicle speed
  • the vehicle prediction model may be combined to predict the state of the vehicle at the next moment (or the next n moments, n is a positive integer) based on the current state of the vehicle (and/or the current action of the vehicle) Information
  • the state information may include all possible driving behaviors at a certain time; the reinforcement learning system is used to estimate the Q values corresponding to multiple different driving behaviors included at a certain time, and the driving behavior corresponding to the largest Q value is taken as Decision information at this moment (the driving behavior decision information output by the reinforcement learning system includes the decision information at this moment).
  • the method of qualification trace and gradient descent can be combined to determine the gradient of the network weight update as:
  • ⁇ t is the time sequence difference component of the value function Q
  • ⁇ rl is the attenuation factor
  • ⁇ rl is the discount factor
  • ET t is the qualification trace at time t
  • ET t-1 is the qualification trace at time t-1
  • r is the reward Function
  • ⁇ rl is a positive coefficient
  • t is a positive integer.
  • the high-quality data generated by the reinforcement learning system may be labeled and added to the training data set, and provided to the imitation learning system for offline training.
  • S820, S830, and S840 can be continuously executed iteratively, and continuously interact with the self-driving vehicle through offline training and online learning, so as to realize the continuous self-training of the reinforcement learning system and improve the self-driving system.
  • FIG. 10 is a schematic block diagram of an apparatus 1000 for training a driving behavior decision model provided by an embodiment of the present application. It should be understood that the device 1000 for training a driving behavior decision model shown in FIG. 10 is only an example, and the device 1000 of the embodiment of the present application may further include other modules or units. It should be understood that the device 1000 for training a driving behavior decision model can execute each step in the method of FIG. 5, FIG. 7 or FIG. 8. In order to avoid repetition, it will not be described in detail here.
  • the decision-making unit 1010 is configured to use the driving behavior decision model to make decisions based on the state information of the vehicle to obtain driving behavior decision information;
  • the sending unit 1020 is configured to send the driving behavior decision information to the server;
  • the receiving unit 1030 is configured to receive the first parameter of the imitation learning model sent by the server, where the first parameter is obtained by the server after training the imitation learning model based on the imitation learning method and using the driving behavior decision information ;
  • the adjusting unit 1040 is configured to adjust the parameters of the driving behavior decision model according to the driving behavior decision information and the first parameter.
  • the adjustment unit 1040 is specifically configured to:
  • the parameters of the driving behavior decision model are adjusted according to the driving behavior decision information to obtain the second parameter; and the second parameter of the driving behavior decision model is adjusted according to the first parameter.
  • the driving behavior decision model includes a first model and a second model; wherein, the adjustment unit 1040 is specifically configured to:
  • the parameters of the first model are adjusted according to the driving behavior decision information to obtain the second parameters; when the first preset condition is met, the parameters of the second model are updated
  • the first preset condition is a preset time interval or a preset number of adjustments to the parameters of the first model.
  • the adjustment unit 1040 is specifically configured to adjust the parameters of the first model and/or the parameters of the second model according to the first parameters.
  • the decision unit 1010 is specifically configured to:
  • the driving behavior decision model is used to evaluate all possible driving behaviors to obtain the driving behavior decision information.
  • the decision unit 1010 is specifically configured to:
  • the receiving unit 1030 is further configured to:
  • the server Receive the third parameter of the imitation learning model sent by the server, where the third parameter is obtained after training the imitation learning model based on the imitation learning method and using the data output by the decision expert system, and the decision expert system is Designed according to the driver's driving data and the dynamic characteristics of the vehicle;
  • the adjustment unit 1040 is also used for:
  • the initial parameters of the driving behavior decision model are determined according to the third parameter.
  • the first parameter is obtained by the server after training the imitation learning model based on an imitation learning method and using the driving behavior decision information that satisfies a second preset condition, and the second preset condition includes
  • the driving behavior decision information is a reasonable driving behavior decision corresponding to the state information.
  • the second preset condition further includes that the noise of the state information is within a first preset range.
  • the state information is one of a plurality of state information
  • the second preset condition further includes that the plurality of state information is acquired in multiple scenarios.
  • the second preset condition further includes: among the plurality of state information, the number of state information acquired in any one of the plurality of scenes is different from that of any one of the plurality of scenes. The difference between the amounts of state information acquired in the scene is within the second preset range.
  • FIG. 11 is a schematic block diagram of an apparatus 1100 for training a driving behavior decision model provided by an embodiment of the present application. It should be understood that the device 1100 for training a driving behavior decision model shown in FIG. 11 is only an example, and the device 1100 in the embodiment of the present application may further include other modules or units. It should be understood that the behavior planning apparatus 1100 can execute each step in the method of FIG. 6, FIG. 7 or FIG. 8. In order to avoid repetition, it will not be detailed here.
  • the receiving unit 1110 is configured to receive driving behavior decision information sent by a vehicle, where the driving behavior decision information is obtained after the vehicle uses a driving behavior decision model to make a decision based on the state information of the vehicle;
  • the training unit 1120 is configured to train an imitation learning model according to the driving behavior decision information based on the imitation learning method to obtain first parameters of the model learning model, and the first parameters are used to adjust the parameters of the driving behavior decision model ;
  • the sending unit 1130 is configured to send the first parameter to the vehicle.
  • the training unit 1020 is further configured to:
  • the decision-making expert system is designed according to the driver's driving data and the dynamic characteristics of the vehicle;
  • the sending unit 1130 is further configured to send the third parameter to the vehicle.
  • the training unit 1120 is specifically configured to:
  • an imitation learning model is trained according to the driving behavior decision information meeting a second preset condition to obtain the first parameter of the model learning model, and the second preset condition includes the driving behavior decision information as Reasonable driving behavior decision corresponding to the state information.
  • the second preset condition further includes that the noise of the state information is within a first preset range.
  • the state information is one of a plurality of state information
  • the second preset condition further includes that the plurality of state information is acquired in multiple scenarios.
  • the second preset condition further includes: among the plurality of state information, the number of state information acquired in any one of the plurality of scenes is different from that of any one of the plurality of scenes. The difference between the amounts of state information acquired in the scene is within the second preset range.
  • Fig. 12 is a schematic diagram of the hardware structure of an apparatus for training a driving behavior decision model provided by an embodiment of the present application.
  • the device 3000 for training a driving behavior decision model shown in FIG. 12 includes a memory 3001, a processor 3002, a communication interface 3003, and a bus 3004.
  • the memory 3001, the processor 3002, and the communication interface 3003 implement communication connections between each other through the bus 3004.
  • the memory 3001 may be a read only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM).
  • the memory 3001 may store a program. When the program stored in the memory 3001 is executed by the processor 3002, the processor 3002 is configured to execute each step of the method for training a driving behavior decision model in the embodiment of the present application.
  • the processor 3002 may adopt a general central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more
  • the integrated circuit is used to execute related programs to implement the method of training a driving behavior decision model in the method embodiment of the present application.
  • the processor 3002 may also be an integrated circuit chip with signal processing capability. For example, it may be the chip shown in FIG. 3.
  • each step of the method for training a driving behavior decision model of the present application can be completed by hardware integrated logic circuits in the processor 3002 or instructions in the form of software.
  • the aforementioned processor 3002 may also be a general-purpose processor, a digital signal processing (digital signal processing, DSP), an application specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gates or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present application may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 3001, and the processor 3002 reads the information in the memory 3001, and combines its hardware to complete the functions required by the units included in the device for training the driving behavior decision model, or execute the training driving of the method embodiment of the application Methods of behavioral decision-making models.
  • the communication interface 3003 uses a transceiver device such as but not limited to a transceiver to implement communication between the device 3000 and other devices or communication networks.
  • a transceiver device such as but not limited to a transceiver to implement communication between the device 3000 and other devices or communication networks.
  • the state information of the vehicle, the driving data of the vehicle, and the training data required in the process of training the driving behavior decision model can be obtained through the communication interface 3003.
  • the bus 3004 may include a path for transferring information between various components of the device 3000 (for example, the memory 3001, the processor 3002, and the communication interface 3003).
  • the processor in the embodiments of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processors, DSPs), and application-specific integrated circuits. (application specific integrated circuit, ASIC), ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory in the embodiments of the present application may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), and electrically available Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be random access memory (RAM), which is used as an external cache.
  • RAM random access memory
  • static random access memory static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • Access memory synchronous DRAM, SDRAM
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous connection dynamic random access memory Take memory (synchlink DRAM, SLDRAM) and direct memory bus random access memory (direct rambus RAM, DR RAM).
  • the foregoing embodiments may be implemented in whole or in part by software, hardware, firmware or any other combination.
  • the above-mentioned embodiments may be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes one or more computer instructions or computer programs.
  • the computer instructions or computer programs are loaded or executed on the computer, the processes or functions described in the embodiments of the present application are generated in whole or in part.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the usable medium may be a magnetic medium (for example, a floppy disk, a hard disk, a magnetic tape), an optical medium (for example, a DVD), or a semiconductor medium.
  • the semiconductor medium may be a solid state drive.
  • At least one refers to one or more, and “multiple” refers to two or more.
  • the following at least one item (a)” or similar expressions refers to any combination of these items, including any combination of a single item (a) or a plurality of items (a).
  • at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple .
  • the size of the sequence numbers of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not correspond to the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of this application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disk and other media that can store program code .

Abstract

一种训练驾驶行为决策模型的方法及装置,该方法包括:使用驾驶行为决策模型,根据车辆的状态信息进行决策,得到驾驶行为决策信息;向服务器发送驾驶行为决策信息;接收服务器发送的模仿学习模型的第一参数,第一参数是服务器基于模仿学习方法、使用驾驶行为决策信息训练模仿学习模型后得到的;根据驾驶行为决策信息与第一参数,调整驾驶行为决策模型的参数。本方法有助于提高驾驶行为决策模型的训练效率,训练后得到的驾驶行为决策模型能够输出合理的、可靠的驾驶行为决策信息。

Description

训练驾驶行为决策模型的方法及装置
本申请要求于2020年06月06日提交中国专利局、申请号为202010508722.3、申请名称为“训练驾驶行为决策模型的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及自动驾驶领域,并且更具体地,涉及训练驾驶行为决策模型的方法及装置。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
自动驾驶是人工智能领域的一种主流应用,自动驾驶技术依靠计算机视觉、雷达、监控装置和全球定位系统等协同合作,让机动车辆可以在不需要人类主动操作下,实现自动驾驶。自动驾驶的车辆使用各种计算系统来帮助将乘客从一个位置运输到另一位置。一些自动驾驶车辆可能要求来自操作者(诸如,领航员、驾驶员、或者乘客)的一些初始输入或者连续输入。自动驾驶车辆准许操作者从手动模操作式切换到自动驾驶模式或者介于两者之间的模式。由于自动驾驶技术无需人类来驾驶机动车辆,所以理论上能够有效避免人类的驾驶失误,减少交通事故的发生,且能够提高公路的运输效率。因此,自动驾驶技术越来越受到重视。
驾驶行为决策是自动驾驶技术中的重要组成部分,具体包括根据车辆的状态信息为车辆选择待执行动作(例如,加速、减速或转向等),并根据选择的待执行动作对该车辆进行控制。驾驶行为决策通常是由驾驶行为决策模型推测得到的。常用的驾驶行为决策模型是通过强化学习训练得到的。但是,现有通过强化学习方法训练驾驶行为决策模型的训练效率较低。
发明内容
本申请提供一种训练驾驶行为决策模型的方法及装置,有助于提高驾驶行为决策模型的训练效率。
第一方面,提供了一种训练驾驶行为决策模型的方法,该方法包括:
使用驾驶行为决策模型,根据车辆的状态信息进行决策,得到驾驶行为决策信息;向服务器发送所述驾驶行为决策信息;接收所述服务器发送的模仿学习模型的第一参数,所述第一参数是所述服务器基于模仿学习方法、使用所述驾驶行为决策信息训练所述模仿学 习模型后得到的;根据所述驾驶行为决策信息与所述第一参数,调整所述驾驶行为决策模型的参数。
所述模仿学习方法属于常见的监督学习方法。通常,监督学习方法可以在训练的过程中利用真值(或称为标签)计算模型(例如,驾驶行为决策模型)的损失值,并使用计算得到的损失值去调整该模型的参数,因此,监督学习方法的学习效率较高,基于监督学习方法往往可以在较短的时间内得到满足用户需求的模型,同时,由于在训练的过程中有真值参与,基于监督学习方法训练得到的模型往往也比较可靠。
在本申请实施例中,所述第一参数是所述服务器基于模仿学习方法、使用所述驾驶行为决策信息训练所述模仿学习模型后得到的,基于模仿学习方法可以保证所述模仿学习模型的训练效果,此时,根据所述驾驶行为决策信息与所述第一参数调整所述驾驶行为决策模型的参数,可以提高驾驶行为决策模型的学习效率。
其中,所述模仿学习方法可以包括监督学习(supervised learning)、生成对抗网络(generative adversarial network,GAN)及逆强化学习(inverse reinforcement learning,IRL)等。
结合第一方面,在第一方面的某些实现方式中,所述根据所述驾驶行为决策信息与所述第一参数,调整所述驾驶行为决策模型的参数,包括:基于强化学习方法,根据所述驾驶行为决策信息对所述驾驶行为决策模型的参数进行调整,得到第二参数;根据所述第一参数,调整所述驾驶行为决策模型的所述第二参数。
在本申请实施例中,可以基于强化学习方法对所述驾驶行为决策模型的参数进行调整得到第二参数,并根据所述第一参数调整所述驾驶行为决策模型的第二参数,可以使得所述驾驶行为决策模型具有在线学习能力及离线学习能力,即可以在所述驾驶行为决策模型具备在线学习能力的前提下、进一步提高驾驶行为决策模型的学习效率。
结合第一方面,在第一方面的某些实现方式中,所述驾驶行为决策模型包括第一模型和第二模型;其中,所述基于强化学习方法,根据所述驾驶行为决策信息对所述驾驶行为决策模型的参数进行调整,得到第二参数,包括:基于强化学习方法,根据所述驾驶行为决策信息对所述第一模型的参数进行调整,得到所述第二参数;在满足第一预设条件的情况下,将所述第二模型的参数更新为所述第二参数,所述第一预设条件为间隔预设的时间间隔或对所述第一模型的参数调整预设的次数。
在本申请实施例中,在满足第一预设条件的情况下,将所述第二模型的参数更新为所述第二参数,可以避免因频繁调整所述第二模型的参数而导致所述第二模型的输出不稳定,因此,能够提高所述驾驶行为决策信息的可靠性。
结合第一方面,在第一方面的某些实现方式中,所述根据所述第一参数,调整所述驾驶行为决策模型的所述第二参数,包括:根据所述第一参数,调整所述第一模型的参数和/或所述第二模型的参数。
在本申请实施例中,可以灵活地根据所述第一参数调整所述第一模型及所述第二模型中至少一个的参数。
结合第一方面,在第一方面的某些实现方式中,所述使用驾驶行为决策模型,根据所述状态信息进行决策,得到驾驶行为决策信息,包括:基于所述车辆的动力学模型及运动学模型,根据所述状态信息对所述车辆在之后一个或多个时刻的行驶行为进行预测,得到所述一个或多个时刻的所有可能的行驶行为;使用所述驾驶行为决策模型,对所述所有可 能的行驶行为进行评估,得到所述驾驶行为决策信息。
在本申请实施例中,结合所述车辆的动力学模型及运动学模型进行驾驶行为决策,可以提高所述驾驶行为决策信息的合理性。
结合第一方面,在第一方面的某些实现方式中,在所述驾驶行为决策模型包括第一模型和第二模型的情况下,所述使用所述驾驶行为决策模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息,包括:使用所述第二模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息。
在本申请实施例中,所述第一模型的参数变化比较频繁,使用所述第二模型进行决策,能够提高所述驾驶行为决策信息的可靠性。
结合第一方面,在第一方面的某些实现方式中,所述方法还包括:接收所述服务器发送的所述模仿学习模型的第三参数,所述第三参数是基于模仿学习方法、使用决策专家系统输出的数据训练所述模仿学习模型后得到的,所述决策专家系统是根据驾驶员的驾驶数据及车辆的动力学特性设计的;根据所述第三参数确定所述驾驶行为决策模型的初始参数。
在本申请实施例中,根据预先训练好的模仿学习模型的第三参数确定所述驾驶行为决策模型的初始参数,可以提高所述驾驶行为决策模型的稳定性,避免所述驾驶行为决策模型输出冒险的(或不合理的)驾驶行为决策信息。
结合第一方面,在第一方面的某些实现方式中,所述第一参数是所述服务器基于模仿学习方法、使用满足第二预设条件的所述驾驶行为决策信息训练所述模仿学习模型后得到的,所述第二预设条件包括所述驾驶行为决策信息为所述状态信息对应的合理驾驶行为决策。
在本申请实施例中,使用所述状态信息对应的合理驾驶行为决策训练所述模仿学习模型,可以进一步提高所述模仿学习模型的训练效果,从而能够进一步提高驾驶行为决策模型的学习效率。
结合第一方面,在第一方面的某些实现方式中,所述第二预设条件还包括所述状态信息的噪声在第一预设范围内。
在本申请实施例中,所述状态信息的噪声在第一预设范围内,可以使得基于所述状态信息决策后得到的驾驶行为决策信息更加合理,此时,根据该驾驶行为决策信息训练所述模仿学习模型,可以进一步提高所述模仿学习模型的训练效果,从而能够进一步提高驾驶行为决策模型的学习效率。
结合第一方面,在第一方面的某些实现方式中,所述状态信息是多个状态信息中的一个,所述第二预设条件还包括所述多个状态信息是在多个场景中获取的。
在本申请实施例中,在上述多个场景中获取所述状态信息,可以使得驾驶行为决策模型的训练数据(例如,根据所述状态信息进行决策后得到的驾驶行为决策信息)的场景更加丰富,此时,根据该驾驶行为决策信息训练所述模仿学习模型,可以进一步提高所述模仿学习模型的训练效果,从而有助于进一步能够提高驾驶行为决策模型的学习效率。
结合第一方面,在第一方面的某些实现方式中,所述第二预设条件还包括:所述多个状态信息中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内。
在本申请实施例中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内, 可以使得在各个场景中得到的训练数据(例如,根据所述状态信息进行决策后得到的驾驶行为决策信息)的数量更加均衡,此时,根据该驾驶行为决策信息训练所述模仿学习模型,可以保证所述模仿学习模型的训练效果,从而避免出现所述驾驶行为决策模型在某个场景存在过拟合的问题。
第二方面,提供了一种训练驾驶行为决策模型的方法,该方法包括:
接收车辆发送的驾驶行为决策信息,所述驾驶行为决策信息是所述车辆使用驾驶行为决策模型根据所述车辆的状态信息进行决策后得到的;基于模仿学习方法,根据所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数,所述第一参数用于调整所述驾驶行为决策模型的参数;向所述车辆发送所述第一参数。
所述模仿学习方法属于常见的监督学习方法。通常,监督学习方法可以在训练的过程中利用真值(或称为标签)计算模型(例如,驾驶行为决策模型)的损失值,并使用计算得到的损失值去调整该模型的参数,因此,监督学习方法的学习效率较高,基于监督学习方法往往可以在较短的时间内得到满足用户需求的模型,同时,由于在训练的过程中有真值参与,基于监督学习方法训练得到的模型往往也比较可靠。
在本申请实施例中,基于模仿学习方法,根据所述驾驶行为决策信息训练模仿学习模型得到所述模型学习模型的第一参数,基于模仿学习方法可以保证所述模仿学习模型的训练效果,此时,根据所述第一参数调整所述驾驶行为决策模型的参数,可以提高驾驶行为决策模型的学习效率。
其中,所述模仿学习方法可以包括监督学习(supervised learning)、生成对抗网络(generative adversarial network,GAN)及逆强化学习(inverse reinforcement learning,IRL)等。
结合第二方面,在第二方面的某些实现方式中,所述方法还包括:基于模仿学习方法、使用决策专家系统输出的数据训练所述模仿学习模型,得到所述模仿学习模型的第三参数,其中,所述第三参数用于确定所述驾驶行为决策模型的初始参数,所述决策专家系统是根据驾驶员的驾驶数据及车辆的动力学特性设计的;向所述车辆发送所述第三参数。
在本申请实施例中,根据预先训练好的模仿学习模型的第三参数确定所述驾驶行为决策模型的初始参数,可以提高所述驾驶行为决策模型的稳定性,避免所述驾驶行为决策模型输出冒险的(或不合理的)驾驶行为决策信息。
结合第二方面,在第二方面的某些实现方式中,所述基于模仿学习方法,根据所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数,包括:基于模仿学习方法,根据满足第二预设条件的所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数,所述第二预设条件包括所述驾驶行为决策信息为所述状态信息对应的合理驾驶行为决策。
在本申请实施例中,使用所述状态信息对应的合理驾驶行为决策训练所述模仿学习模型,可以进一步提高所述模仿学习模型的训练效果,从而能够进一步提高驾驶行为决策模型的学习效率。
结合第二方面,在第二方面的某些实现方式中,所述第二预设条件还包括所述状态信息的噪声在第一预设范围内。
在本申请实施例中,所述状态信息的噪声在第一预设范围内,可以使得基于所述状态信息决策后得到的驾驶行为决策信息更加合理,此时,根据该驾驶行为决策信息训练所述 模仿学习模型,可以进一步提高所述模仿学习模型的训练效果,从而能够进一步提高驾驶行为决策模型的学习效率。
结合第二方面,在第二方面的某些实现方式中,所述状态信息是多个状态信息中的一个,所述第二预设条件还包括所述多个状态信息是在多个场景中获取的。
在本申请实施例中,在上述多个场景中获取所述状态信息,可以使得驾驶行为决策模型的训练数据(例如,根据所述状态信息进行决策后得到的驾驶行为决策信息)的场景更加丰富,此时,根据该驾驶行为决策信息训练所述模仿学习模型,可以进一步提高所述模仿学习模型的训练效果,从而有助于进一步能够提高驾驶行为决策模型的学习效率。
结合第二方面,在第二方面的某些实现方式中,所述第二预设条件还包括:所述多个状态信息中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内。
在本申请实施例中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内,可以使得在各个场景中得到的训练数据(例如,根据所述状态信息进行决策后得到的驾驶行为决策信息)的数量更加均衡,此时,根据该驾驶行为决策信息训练所述模仿学习模型,可以保证所述模仿学习模型的训练效果,从而避免出现所述驾驶行为决策模型在某个场景存在过拟合的问题。
第三方面,提供了一种训练驾驶行为决策模型的装置,包括:
决策单元,用于使用驾驶行为决策模型,根据车辆的状态信息进行决策,得到驾驶行为决策信息;发送单元,用于向服务器发送所述驾驶行为决策信息;接收单元,用于接收所述服务器发送的模仿学习模型的第一参数,所述第一参数是所述服务器基于模仿学习方法、使用所述驾驶行为决策信息训练所述模仿学习模型后得到的;调整单元,用于根据所述驾驶行为决策信息与所述第一参数,调整所述驾驶行为决策模型的参数。
所述模仿学习方法属于常见的监督学习方法。通常,监督学习方法可以在训练的过程中利用真值(或称为标签)计算模型(例如,驾驶行为决策模型)的损失值,并使用计算得到的损失值去调整该模型的参数,因此,监督学习方法的学习效率较高,基于监督学习方法往往可以在较短的时间内得到满足用户需求的模型,同时,由于在训练的过程中有真值参与,基于监督学习方法训练得到的模型往往也比较可靠。
在本申请实施例中,所述第一参数是所述服务器基于模仿学习方法、使用所述驾驶行为决策信息训练所述模仿学习模型后得到的,基于模仿学习方法可以保证所述模仿学习模型的训练效果,此时,根据所述驾驶行为决策信息与所述第一参数调整所述驾驶行为决策模型的参数,可以提高驾驶行为决策模型的学习效率。
其中,所述模仿学习方法可以包括监督学习(supervised learning)、生成对抗网络(generative adversarial network,GAN)及逆强化学习(inverse reinforcement learning,IRL)等。
结合第三方面,在第三方面的某些实现方式中,所述调整单元具体用于:基于强化学习方法,根据所述驾驶行为决策信息对所述驾驶行为决策模型的参数进行调整,得到第二参数;根据所述第一参数,调整所述驾驶行为决策模型的所述第二参数。
在本申请实施例中,可以基于强化学习方法对所述驾驶行为决策模型的参数进行调整得到第二参数,并根据所述第一参数调整所述驾驶行为决策模型的第二参数,可以使得所 述驾驶行为决策模型具有在线学习能力及离线学习能力,即可以在所述驾驶行为决策模型具备在线学习能力的前提下、进一步提高驾驶行为决策模型的学习效率。
结合第三方面,在第三方面的某些实现方式中,所述驾驶行为决策模型包括第一模型和第二模型;其中,所述调整单元具体用于:基于强化学习方法,根据所述驾驶行为决策信息对所述第一模型的参数进行调整,得到所述第二参数;在满足第一预设条件的情况下,将所述第二模型的参数更新为所述第二参数,所述第一预设条件为间隔预设的时间间隔或对所述第一模型的参数调整预设的次数。
在本申请实施例中,在满足第一预设条件的情况下,将所述第二模型的参数更新为所述第二参数,可以避免因频繁调整所述第二模型的参数而导致所述第二模型的输出不稳定,因此,能够提高所述驾驶行为决策信息的可靠性。
结合第三方面,在第三方面的某些实现方式中,所述调整单元具体用于:根据所述第一参数,调整所述第一模型的参数和/或所述第二模型的参数。
在本申请实施例中,可以灵活地根据所述第一参数调整所述第一模型及所述第二模型中至少一个的参数。
结合第三方面,在第三方面的某些实现方式中,所述决策单元具体用于:基于所述车辆的动力学模型及运动学模型,根据所述状态信息对所述车辆在之后一个或多个时刻的行驶行为进行预测,得到所述一个或多个时刻的所有可能的行驶行为;使用所述驾驶行为决策模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息。
在本申请实施例中,结合所述车辆的动力学模型及运动学模型进行驾驶行为决策,可以提高所述驾驶行为决策信息的合理性。
结合第三方面,在第三方面的某些实现方式中,在所述驾驶行为决策模型包括第一模型和第二模型的情况下,所述决策单元具体用于:使用所述第二模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息。
在本申请实施例中,所述第一模型的参数变化比较频繁,使用所述第二模型进行决策,能够提高所述驾驶行为决策信息的可靠性。
结合第三方面,在第三方面的某些实现方式中,所述接收单元还用于:接收所述服务器发送的所述模仿学习模型的第三参数,所述第三参数是基于模仿学习方法、使用决策专家系统输出的数据训练所述模仿学习模型后得到的,所述决策专家系统是根据驾驶员的驾驶数据及车辆的动力学特性设计的;所述调整单元还用于:根据所述第三参数确定所述驾驶行为决策模型的初始参数。
在本申请实施例中,根据预先训练好的模仿学习模型的第三参数确定所述驾驶行为决策模型的初始参数,可以提高所述驾驶行为决策模型的稳定性,避免所述驾驶行为决策模型输出冒险的(或不合理的)驾驶行为决策信息。
结合第三方面,在第三方面的某些实现方式中,所述第一参数是所述服务器基于模仿学习方法、使用满足第二预设条件的所述驾驶行为决策信息训练所述模仿学习模型后得到的,所述第二预设条件包括所述驾驶行为决策信息为所述状态信息对应的合理驾驶行为决策。
在本申请实施例中,使用所述状态信息对应的合理驾驶行为决策训练所述模仿学习模型,可以进一步提高所述模仿学习模型的训练效果,从而能够进一步提高驾驶行为决策模型的学习效率。
结合第三方面,在第三方面的某些实现方式中,所述第二预设条件还包括所述状态信息的噪声在第一预设范围内。
在本申请实施例中,所述状态信息的噪声在第一预设范围内,可以使得基于所述状态信息决策后得到的驾驶行为决策信息更加合理,此时,根据该驾驶行为决策信息训练所述模仿学习模型,可以进一步提高所述模仿学习模型的训练效果,从而能够进一步提高驾驶行为决策模型的学习效率。
结合第三方面,在第三方面的某些实现方式中,所述状态信息是多个状态信息中的一个,所述第二预设条件还包括所述多个状态信息是在多个场景中获取的。
在本申请实施例中,在上述多个场景中获取所述状态信息,可以使得驾驶行为决策模型的训练数据(例如,根据所述状态信息进行决策后得到的驾驶行为决策信息)的场景更加丰富,此时,根据该驾驶行为决策信息训练所述模仿学习模型,可以进一步提高所述模仿学习模型的训练效果,从而有助于进一步能够提高驾驶行为决策模型的学习效率。
结合第三方面,在第三方面的某些实现方式中,所述第二预设条件还包括:所述多个状态信息中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内。
在本申请实施例中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内,可以使得在各个场景中得到的训练数据(例如,根据所述状态信息进行决策后得到的驾驶行为决策信息)的数量更加均衡,此时,根据该驾驶行为决策信息训练所述模仿学习模型,可以保证所述模仿学习模型的训练效果,从而避免出现所述驾驶行为决策模型在某个场景存在过拟合的问题。
第四方面,提供了一种训练驾驶行为决策模型的装置,包括:
接收单元,用于接收车辆发送的驾驶行为决策信息,所述驾驶行为决策信息是所述车辆使用驾驶行为决策模型根据所述车辆的状态信息进行决策后得到的;训练单元,用于基于模仿学习方法,根据所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数,所述第一参数用于调整所述驾驶行为决策模型的参数;发送单元,用于向所述车辆发送所述第一参数。
所述模仿学习方法属于常见的监督学习方法。通常,监督学习方法可以在训练的过程中利用真值(或称为标签)计算模型(例如,驾驶行为决策模型)的损失值,并使用计算得到的损失值去调整该模型的参数,因此,监督学习方法的学习效率较高,基于监督学习方法往往可以在较短的时间内得到满足用户需求的模型,同时,由于在训练的过程中有真值参与,基于监督学习方法训练得到的模型往往也比较可靠。
在本申请实施例中,基于模仿学习方法,根据所述驾驶行为决策信息训练模仿学习模型得到所述模型学习模型的第一参数,基于模仿学习方法可以保证所述模仿学习模型的训练效果,此时,根据所述第一参数调整所述驾驶行为决策模型的参数,可以提高驾驶行为决策模型的学习效率。
其中,所述模仿学习方法可以包括监督学习(supervised learning)、生成对抗网络(generative adversarial network,GAN)及逆强化学习(inverse reinforcement learning,IRL)等。
结合第四方面,在第四方面的某些实现方式中,所述训练单元还用于:基于模仿学习 方法、使用决策专家系统输出的数据训练所述模仿学习模型,得到所述模仿学习模型的第三参数,其中,所述第三参数用于确定所述驾驶行为决策模型的初始参数,所述决策专家系统是根据驾驶员的驾驶数据及车辆的动力学特性设计的;所述发送单元还用于:向所述车辆发送所述第三参数。
在本申请实施例中,根据预先训练好的模仿学习模型的第三参数确定所述驾驶行为决策模型的初始参数,可以提高所述驾驶行为决策模型的稳定性,避免所述驾驶行为决策模型输出冒险的(或不合理的)驾驶行为决策信息。
结合第四方面,在第四方面的某些实现方式中,所述训练单元具体用于:基于模仿学习方法,根据满足第二预设条件的所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数,所述第二预设条件包括所述驾驶行为决策信息为所述状态信息对应的合理驾驶行为决策。
在本申请实施例中,使用所述状态信息对应的合理驾驶行为决策训练所述模仿学习模型,可以进一步提高所述模仿学习模型的训练效果,从而能够进一步提高驾驶行为决策模型的学习效率。
结合第四方面,在第四方面的某些实现方式中,所述第二预设条件还包括所述状态信息的噪声在第一预设范围内。
在本申请实施例中,所述状态信息的噪声在第一预设范围内,可以使得基于所述状态信息决策后得到的驾驶行为决策信息更加合理,此时,根据该驾驶行为决策信息训练所述模仿学习模型,可以进一步提高所述模仿学习模型的训练效果,从而能够进一步提高驾驶行为决策模型的学习效率。
结合第四方面,在第四方面的某些实现方式中,所述状态信息是多个状态信息中的一个,所述第二预设条件还包括所述多个状态信息是在多个场景中获取的。
在本申请实施例中,在上述多个场景中获取所述状态信息,可以使得驾驶行为决策模型的训练数据(例如,根据所述状态信息进行决策后得到的驾驶行为决策信息)的场景更加丰富,此时,根据该驾驶行为决策信息训练所述模仿学习模型,可以进一步提高所述模仿学习模型的训练效果,从而有助于进一步能够提高驾驶行为决策模型的学习效率。
结合第四方面,在第四方面的某些实现方式中,所述第二预设条件还包括:所述多个状态信息中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内。
在本申请实施例中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内,可以使得在各个场景中得到的训练数据(例如,根据所述状态信息进行决策后得到的驾驶行为决策信息)的数量更加均衡,此时,根据该驾驶行为决策信息训练所述模仿学习模型,可以保证所述模仿学习模型的训练效果,从而避免出现所述驾驶行为决策模型在某个场景存在过拟合的问题。
第五方面,提供了一种训练驾驶行为决策模型的装置,所述装置包括存储介质和中央处理器,所述存储介质可以是非易失性存储介质,所述存储介质中存储有计算机可执行程序,所述中央处理器与所述非易失性存储介质连接,并执行所述计算机可执行程序以实现第一方面的任一可能的实现方式中的方法。
第六方面,提供了一种训练驾驶行为决策模型的装置,所述装置包括存储介质和中央 处理器,所述存储介质可以是非易失性存储介质,所述存储介质中存储有计算机可执行程序,所述中央处理器与所述非易失性存储介质连接,并执行所述计算机可执行程序以实现第二方面的任一可能的实现方式中的方法。
第七方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行第一方面的任一可能的实现方式或者第二方面的任一可能的实现方式中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面的任一可能的实现方式或者第二方面的任一可能的实现方式中的方法。
第八方面,提供一种计算机可读存储介质,所述计算机可读介质存储用于设备执行的程序代码,所述程序代码包括用于执行第一方面的任一可能的实现方式或者第二方面的任一可能的实现方式中的方法的指令。
第九方面,提供一种汽车,所述汽车包括上述第三方面的任一可能的实现方式或第五方面所述的训练驾驶行为决策模型的装置。
第十方面,提供一种服务器,所述服务器包括上述第四方面的任一可能的实现方式或第六方面所述的训练驾驶行为决策模型的装置。
在本申请实施例中,所述第一参数是所述服务器基于模仿学习方法、使用所述驾驶行为决策信息训练所述模仿学习模型后得到的,基于模仿学习方法可以保证所述模仿学习模型的训练效果,此时,根据所述驾驶行为决策信息与所述第一参数调整所述驾驶行为决策模型的参数,可以提高驾驶行为决策模型的学习效率。
附图说明
图1为本申请实施例提供的一种自动驾驶车辆的结构示意图;
图2为本申请实施例提供的一种自动驾驶系统的结构示意图;
图3为本申请实施例提供的一种神经网络处理器的结构示意图;
图4为本申请实施例提供的一种云侧指令自动驾驶车辆的应用示意图;
图5为本申请一个实施例提供的训练驾驶行为决策模型的方法的示意性框图;
图6为本申请另一个实施例提供的训练驾驶行为决策模型的方法的示意性框图;
图7为本申请另一个实施例提供的训练驾驶行为决策模型的方法的示意性框图;
图8为本申请一个实施例提供的训练驾驶行为决策模型的方法的示意性流程图;
图9为本申请实施例提供的RBFNN的示意性框图;
图10是本申请一个实施例提供的训练驾驶行为决策模型的装置的示意性框图;
图11是本申请另一个实施例提供的训练驾驶行为决策模型的装置的示意性框图;
图12是本申请再一个实施例提供的训练驾驶行为决策模型的装置的示意性框图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
本申请实施例的技术方案可以应用于各种车辆,该车辆具体可以为内燃机车、智能电动车或者混合动力车,或者,该车辆也可以为其他动力类型的车辆等,本申请实施例对此并不限定。
本申请实施例中的车辆可以为自动驾驶车辆,例如,所述自动驾驶车辆可以配置有自动驾驶模式,该自动驾驶模式可以为完全自动驾驶模式,或者,也可以为部分自动驾驶模式,本申请实施例对此并不限定。
本申请实施例中的车辆还可以配置有其他驾驶模式,所述其他驾驶模式可以包括运动模式、经济模式、标准模式、雪地模式及爬坡模式等多种驾驶模式中的一种或多种。所述自动驾驶车辆可以在自动驾驶模式和上述多种(驾驶员驾驶车辆的)驾驶模型之间进行切换,本申请实施例对此并不限定。
图1是本申请实施例提供的车辆100的功能框图。
在一个实施例中,将车辆100配置为完全或部分地自动驾驶模式。
例如,车辆100可以在处于自动驾驶模式中的同时控制自身,并且可通过人为操作来确定车辆及其周边环境的当前状态,确定周边环境中的至少一个其他车辆的可能行为,并确定该其他车辆执行可能行为的可能性相对应的置信水平,基于所确定的信息来控制车辆100。在车辆100处于自动驾驶模式中时,可以将车辆100置为在没有和人交互的情况下操作。
车辆100可包括各种子系统,例如行进系统102、传感器系统104、控制系统106、一个或多个外围设备108以及电源110、计算机系统112和用户接口116。
可选地,车辆100可包括更多或更少的子系统,并且每个子系统可包括多个元件。另外,车辆100的每个子系统和元件可以通过有线或者无线互连。
行进系统102可包括为车辆100提供动力运动的组件。在一个实施例中,推进系统102可包括引擎118、能量源119、传动装置120和车轮/轮胎121。引擎118可以是内燃引擎、电动机、空气压缩引擎或其他类型的引擎组合,例如,气油发动机和电动机组成的混动引擎,内燃引擎和空气压缩引擎组成的混动引擎。引擎118将能量源119转换成机械能量。
能量源119的示例包括汽油、柴油、其他基于石油的燃料、丙烷、其他基于压缩气体的燃料、乙醇、太阳能电池板、电池和其他电力来源。能量源119也可以为车辆100的其他系统提供能量。
传动装置120可以将来自引擎118的机械动力传送到车轮121。传动装置120可包括变速箱、差速器和驱动轴。
在一个实施例中,传动装置120还可以包括其他器件,比如离合器。其中,驱动轴可包括可耦合到一个或多个车轮121的一个或多个轴。
传感器系统104可包括感测关于车辆100周边的环境的信息的若干个传感器。
例如,传感器系统104可包括定位系统122(定位系统可以是GPS系统,也可以是北斗系统或者其他定位系统)、惯性测量单元(inertial measurement unit,IMU)124、雷达126、激光测距仪128以及相机130。传感器系统104还可包括被监视车辆100的内部系统的传感器(例如,车内空气质量监测器、燃油量表、机油温度表等)。来自这些传感器中的一个或多个的传感器数据可用于检测对象及其相应特性(位置、形状、方向、速度等)。这种检测和识别是自主车辆100的安全操作的关键功能。
定位系统122可用于估计车辆100的地理位置。IMU 124用于基于惯性加速度来感测车辆100的位置和朝向变化。在一个实施例中,IMU 124可以是加速度计和陀螺仪的组合。
雷达126可利用无线电信号来感测车辆100的周边环境内的物体。在一些实施例中, 除了感测物体以外,雷达126还可用于感测物体的速度和/或前进方向。
激光测距仪128可利用激光来感测车辆100所位于的环境中的物体。在一些实施例中,激光测距仪128可包括一个或多个激光源、激光扫描器以及一个或多个检测器,以及其他系统组件。
相机130可用于捕捉车辆100的周边环境的多个图像。相机130可以是静态相机或视频相机。
控制系统106为控制车辆100及其组件的操作。控制系统106可包括各种元件,其中包括转向系统132、油门134、制动单元136、传感器融合算法138、计算机视觉系统140、路线控制系统142以及障碍物避免系统144。
转向系统132可操作来调整车辆100的前进方向。例如在一个实施例中可以为方向盘系统。
油门134用于控制引擎118的操作速度并进而控制车辆100的速度。
制动单元136用于控制车辆100减速。制动单元136可使用摩擦力来减慢车轮121。在其他实施例中,制动单元136可将车轮121的动能转换为电流。制动单元136也可采取其他形式来减慢车轮121转速从而控制车辆100的速度。
计算机视觉系统140可以操作来处理和分析由相机130捕捉的图像以便识别车辆100周边环境中的物体和/或特征。所述物体和/或特征可包括交通信号、道路边界和障碍物。计算机视觉系统140可使用物体识别算法、运动中恢复结构(Structure from Motion,SFM)算法、视频跟踪和其他计算机视觉技术。在一些实施例中,计算机视觉系统140可以用于为环境绘制地图、跟踪物体、估计物体的速度等等。
路线控制系统142用于确定车辆100的行驶路线。在一些实施例中,路线控制系统142可结合来自传感器138、GPS 122和一个或多个预定地图的数据以为车辆100确定行驶路线。
障碍物避免系统144用于识别、评估和避免或者以其他方式越过车辆100的环境中的潜在障碍物。
当然,在一个实例中,控制系统106可以增加或替换地包括除了所示出和描述的那些以外的组件。或者也可以减少一部分上述示出的组件。
车辆100通过外围设备108与外部传感器、其他车辆、其他计算机系统或用户之间进行交互。外围设备108可包括无线通信系统146、车载电脑148、麦克风150和/或扬声器152。
在一些实施例中,外围设备108提供车辆100的用户与用户接口116交互的手段。例如,车载电脑148可向车辆100的用户提供信息。用户接口116还可操作车载电脑148来接收用户的输入。车载电脑148可以通过触摸屏进行操作。在其他情况中,外围设备108可提供用于车辆100与位于车内的其它设备通信的手段。例如,麦克风150可从车辆100的用户接收音频(例如,语音命令或其他音频输入)。类似地,扬声器152可向车辆100的用户输出音频。
无线通信系统146可以直接地或者经由通信网络来与一个或多个设备无线通信。例如,无线通信系统146可使用3G蜂窝通信,例如CDMA、EVD0、GSM/GPRS,或者4G蜂窝通信,例如LTE。或者5G蜂窝通信。无线通信系统146可利用WiFi与无线局域网(wireless local area network,WLAN)通信。在一些实施例中,无线通信系统146可利用 红外链路、蓝牙或ZigBee与设备直接通信。其他无线协议,例如各种车辆通信系统,例如,无线通信系统146可包括一个或多个专用短程通信(dedicated short range communications,DSRC)设备,这些设备可包括车辆和/或路边台站之间的公共和/或私有数据通信。
电源110可向车辆100的各种组件提供电力。在一个实施例中,电源110可以为可再充电锂离子或铅酸电池。这种电池的一个或多个电池组可被配置为电源为车辆100的各种组件提供电力。在一些实施例中,电源110和能量源119可一起实现,例如一些全电动车中那样。
车辆100的部分或所有功能受计算机系统112控制。计算机系统112可包括至少一个处理器113,处理器113执行存储在例如数据存储装置114这样的非暂态计算机可读介质中的指令115。计算机系统112还可以是采用分布式方式控制车辆100的个体组件或子系统的多个计算设备。
处理器113可以是任何常规的处理器,诸如商业可获得的CPU。替选地,该处理器可以是诸如ASIC或其它基于硬件的处理器的专用设备。尽管图1功能性地图示了处理器、存储器、和在相同块中的计算机110的其它元件,但是本领域的普通技术人员应该理解该处理器、计算机、或存储器实际上可以包括在相同的物理外壳内的多个处理器、计算机、或存储器。例如,存储器可以是硬盘驱动器或位于不同于计算机110的外壳内的其它存储介质。因此,对处理器或计算机的引用将被理解为包括对可以或者可以不并行操作的处理器或计算机或存储器的集合的引用。不同于使用单一的处理器来执行此处所描述的步骤,诸如转向组件和减速组件的一些组件每个都可以具有其自己的处理器,所述处理器只执行与特定于组件的功能相关的计算。
在此处所描述的各个方面中,处理器可以位于远离该车辆并且与该车辆进行无线通信。在其它方面中,此处所描述的过程中的一些在布置于车辆内的处理器上执行而其它则由远程处理器执行,包括采取执行单一操纵的必要步骤。
在一些实施例中,数据存储装置114可包含指令115(例如,程序逻辑),指令115可被处理器113执行来执行车辆100的各种功能,包括以上描述的那些功能。数据存储装置114也可包含额外的指令,包括向推进系统102、传感器系统104、控制系统106和外围设备108中的一个或多个发送数据、从其接收数据、与其交互和/或对其进行控制的指令。
除了指令115以外,数据存储装置114还可存储数据,例如道路地图、路线信息,车辆的位置、方向、速度以及其它这样的车辆数据,以及其他信息。这种信息可在车辆100在自主、半自主和/或手动模式中操作期间被车辆100和计算机系统112使用。
用户接口116,用于向车辆100的用户提供信息或从其接收信息。可选地,用户接口116可包括在外围设备108的集合内的一个或多个输入/输出设备,例如无线通信系统146、车车在电脑148、麦克风150和扬声器152。
计算机系统112可基于从各种子系统(例如,行进系统102、传感器系统104和控制系统106)以及从用户接口116接收的输入来控制车辆100的功能。例如,计算机系统112可利用来自控制系统106的输入以便控制转向单元132来避免由传感器系统104和障碍物避免系统144检测到的障碍物。在一些实施例中,计算机系统112可操作来对车辆100及其子系统的许多方面提供控制。
可选地,上述这些组件中的一个或多个可与车辆100分开安装或关联。例如,数据存储装置114可以部分或完全地与车辆100分开存在。上述组件可以按有线和/或无线方式来通信地耦合在一起。
可选地,上述组件只是一个示例,实际应用中,上述各个模块中的组件有可能根据实际需要增添或者删除,图1不应理解为对本申请实施例的限制。
在道路行进的自动驾驶车辆,如上面的车辆100,可以识别其周围环境内的物体以确定对当前速度的调整。所述物体可以是其它车辆、交通控制设备、或者其它类型的物体。在一些示例中,可以独立地考虑每个识别的物体,并且基于物体的各自的特性,诸如它的当前速度、加速度、与车辆的间距等,可以用来确定自动驾驶默默所要调整的速度。
可选地,车辆100或者与车辆100相关联的计算设备(如图1的计算机系统112、计算机视觉系统140、数据存储装置114)可以基于所识别的物体的特性和周围环境的状态(例如,交通、雨、道路上的冰、等等)来预测所述识别的物体的行为。可选地,每一个所识别的物体都依赖于彼此的行为,因此还可以将所识别的所有物体全部一起考虑来预测单个识别的物体的行为。车辆100能够基于预测的所述识别的物体的行为来调整它的速度。换句话说,自动驾驶车辆能够基于所预测的物体的行为来确定车辆将需要调整到(例如,加速、减速、或者停止)什么稳定状态。在这个过程中,也可以考虑其它因素来确定车辆100的速度,诸如,车辆100在行驶的道路中的横向位置、道路的曲率、静态和动态物体的接近度等等。
除了提供调整自动驾驶车辆的速度的指令之外,计算设备还可以提供修改车辆100的转向角的指令,以使得自动驾驶车辆遵循给定的轨迹和/或维持与自动驾驶车辆附近的物体(例如,道路上的相邻车道中的轿车)的安全横向和纵向距离。
上述车辆100可以为轿车、卡车、摩托车、公共汽车、船、飞机、直升飞机、割草机、娱乐车、游乐场车辆、施工设备、电车、高尔夫球车、火车、和手推车等,本申请实施例不做特别的限定。
图2是本申请实施例提供的自动驾驶系统的示意图。
如图2所示的自动驾驶系统包括计算机系统101,其中,计算机系统101包括处理器103,处理器103和系统总线105耦合。处理器103可以是一个或者多个处理器,其中每个处理器都可以包括一个或多个处理器核。显示适配器(video adapter)107,显示适配器可以驱动显示器109,显示器109和系统总线105耦合。系统总线105通过总线桥111和输入输出(I/O)总线113耦合。I/O接口115和I/O总线耦合。I/O接口115和多种I/O设备进行通信,比如输入设备117(如:键盘,鼠标,触摸屏等),多媒体盘(media tray)121,(例如,CD-ROM,多媒体接口等)。收发器123(可以发送和/或接受无线电通信信号),摄像头155(可以捕捉静态和动态数字视频图像)和外部USB接口125。其中,可选地,和I/O接口115相连接的接口可以是USB接口。
其中,处理器103可以是任何传统处理器,包括精简指令集计算(reduced instruction set computer,RISC)处理器、复杂指令集计算(complex instruction set computer,CISC)处理器或上述的组合。可选地,处理器可以是诸如专用集成电路(application specific integrated circuit,ASIC)的专用装置。可选地,处理器103可以是神经网络处理器或者是神经网络处理器和上述传统处理器的组合。
可选地,在本文所述的各种实施例中,计算机系统101可位于远离自动驾驶车辆的地 方(例如,计算机系统101可位于云端或服务器),并且可与自动驾驶车辆无线通信。在其它方面,本文所述的一些过程在设置在自动驾驶车辆内的处理器上执行,其它由远程处理器执行,包括采取执行单个操纵所需的动作。
计算机101可以通过网络接口129和软件部署服务器149通信。网络接口129是硬件网络接口,比如,网卡。网络127可以是外部网络,比如因特网,也可以是内部网络,比如以太网或者虚拟私人网络(virtual private network,VPN)。可选地,网络127还可以是无线网络,比如WiFi网络,蜂窝网络等。
硬盘驱动接口和系统总线105耦合。硬件驱动接口和硬盘驱动器相连接。系统内存135和系统总线105耦合。运行在系统内存135的数据可以包括计算机101的操作系统137和应用程序143。
操作系统包括解析器139(shell)和内核(kernel)141。shell是介于使用者和操作系统之内核(kernel)间的一个接口。shell是操作系统最外面的一层。shell管理使用者与操作系统之间的交互:等待使用者的输入,向操作系统解释使用者的输入,并且处理各种各样的操作系统的输出结果。
内核141由操作系统中用于管理存储器、文件、外设和系统资源的那些部分组成。直接与硬件交互,操作系统内核通常运行进程,并提供进程间的通信,提供CPU时间片管理、中断、内存管理、IO管理等等。
应用程序143包括驾驶行为决策相关的程序,比如,获取车辆的状态信息,根据车辆的状态信息进行决策,得到驾驶行为决策信息(即车辆的待执行动作,例如,加速、减速或转向等动作),并根据驾驶行为决策信息对该车辆进行控制。应用程序143也存在于软件部署服务器149(deploying server)的系统上。在一个实施例中,在需要执行应用程序143时,计算机系统101可以从软件部署服务器149(deploying server)下载应用程序143。
传感器153和计算机系统101关联。传感器153用于探测计算机101周围的环境。举例来说,传感器153可以探测动物,汽车,障碍物和人行横道等,进一步传感器还可以探测上述动物,汽车,障碍物和人行横道等物体周围的环境,比如:动物周围的环境,例如,动物周围出现的其他动物,天气条件,周围环境的光亮度等。传感器153还可以用于获取车辆的状态信息。例如,传感器153可以探测车辆的位置、车辆的速度、车辆的加速度和车辆的姿态等车辆的状态信息。可选地,如果计算机101位于自动驾驶车辆上,传感器可以是摄像头,红外线感应器,化学检测器,麦克风等。
例如,应用程序143可以基于传感器153探测到的周围的环境信息和/或车辆的状态信息进行决策,得到驾驶行为决策信息,并根据驾驶行为决策信息对该车辆进行控制。此时,根据驾驶行为决策信息就可以对该车辆进行控制,从而实现车辆的自动驾驶。
其中,驾驶行为决策信息可以是指车辆的待执行动作,例如,执行加速、减速或转向等动作中的一项或多项,或者,驾驶行为决策信息也可以是指车辆的待选择的控制模式或控制系统,例如,选择转向控制系统、直接横摆力矩控制系统或紧急制动控制系统等系统中的一项或多项。
图3是本申请实施例提供的一种芯片硬件结构图,该芯片包括神经网络处理器20。该芯片可以为在如图2所示的处理器103中,用以根据车辆的状态信息进行驾驶行为决策。在本申请实施例中,预训练的神经网络中各层的算法均可在如图3所示的芯片中得以实现。
本申请实施例中的训练驾驶行为决策模型的方法,以及确定驾驶行为的方法也可以在 如图3所示的芯片中得以实现,其中,该芯片可以与实现上述预训练的神经网络的芯片是同一个芯片,或者,该芯片也可以与实现上述预训练的神经网络的芯片是不同的芯片,本申请实施例对此并不限定。
神经网络处理器NPU 50NPU作为协处理器挂载到主CPU(host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路50,通过控制器504控制运算电路503提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路203内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路203是二维脉动阵列。运算电路203还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路203是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器202中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器201中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)208中。
向量计算单元207可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元207可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元能207将经处理的输出的向量存储到统一缓存器206。例如,向量计算单元207可以将非线性函数应用到运算电路203的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元207生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路203的激活输入,例如,用于在神经网络中的后续层中的使用。
统一存储器206用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器205(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器201和/或统一存储器206、将外部存储器中的权重数据存入权重存储器202,以及将统一存储器206中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)210,用于通过总线实现主CPU、DMAC和取指存储器209之间进行交互。
与控制器204连接的取指存储器(instruction fetch buffer)209,用于存储控制器204使用的指令;
控制器204,用于调用取指存储器209中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器206,输入存储器201,权重存储器202以及取指存储器209均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
计算机系统112还可以从其它计算机系统接收信息或转移信息到其它计算机系统。或者,从车辆100的传感器系统104收集的传感器数据可以被转移到另一个计算机对此数据 进行处理。
例如,如图4所示,来自计算机系统312的数据可以经由网络被传送到云侧的服务器320用于进一步的处理。网络以及中间节点可以包括各种配置和协议,包括因特网、万维网、内联网、虚拟专用网络、广域网、局域网、使用一个或多个公司的专有通信协议的专用网络、以太网、WiFi和HTTP、以及前述的各种组合。这种通信可以由能够传送数据到其它计算机和从其它计算机传送数据的任何设备执行,诸如调制解调器和无线接口。
在一个示例中,服务器320可以包括具有多个计算机的服务器,例如,负载均衡服务器群,为了从计算机系统312接收、处理并传送数据的目的,其与网络的不同节点交换信息。该服务器可以被类似于计算机系统312配置,具有处理器330、存储器340、指令350、和数据360。
示例性地,服务器320的数据360可以包括离线学习的神经网络模型(例如,基于深度学习的神经网络模型)的参数及该神经网络模型的相关信息(例如,神经网络模型的训练数据或神经网络模型的其他参数等)。例如,服务器320可以接收、检测、存储、更新、以及传送离线学习的神经网络模型的参数及该神经网络模型的相关信息。
例如,离线学习的神经网络模型的参数可以包括该神经网络模型的超参数,以及其他模型参数(或模型策略)。
再例如,该神经网络模型的相关信息可以包括该神经网络模型的训练数据,以及该神经网络模型的其他参数等。
需要说明的是,服务器320还可以使用该神经网络模型的训练数据,基于模仿学习方法对该神经网络模型进行训练(即离线训练或离线学习),从而更新该神经网络模型的参数。
在现有技术中,基于强化学习方法可以使得驾驶行为决策模型具有在线学习能力,即,可以在使用所述驾驶行为决策模型的过程中,不断地对所述驾驶行为决策模型进行训练,从而可以不断地优化所述驾驶行为决策模型。
但是,强化学习方法是一种典型的无监督学习方法,在训练的过程中并没有像监督学习方法那样利用真值(或称为标签)计算模型(例如,驾驶行为决策模型)的损失值,使用计算得到的损失值加速该模型的收敛速度,也无法在较短的时间内得到满足用户需求的模型。因此,与监督学习方法相比,强化学习方法的学习效率较低。而且,由于在训练的过程中没有真值参与,强化学习方法也没法像监督学习方法那样,保证得到的模型是比较可靠的。
综上所述,在仅使用强化学习方法训练驾驶行为决策模型的情况下,虽然可以使驾驶行为决策模型具有在线学习能力,但是模型的训练效率往往并不理想。
基于上述问题,本申请提出一种训练驾驶行为决策模型的方法,能够提高驾驶行为决策模型的训练效率。进一步地,根据该方法还可以使得所述驾驶行为决策模型同时具有在线学习能力及离线学习能力,即可以在所述驾驶行为决策模型具备在线学习能力的前提下、提高驾驶行为决策模型的学习效率。
下面结合图5至图10对本申请实施例中的训练驾驶行为决策模型的方法,以及确定驾驶行为的方法进行详细说明。
图5是本申请实施例提供的训练驾驶行为决策模型的方法500的示意性流程图。
图5所示的方法500可以包括步骤510、步骤520、步骤530及步骤540,应理解, 图5所示的方法500仅为示例而非限定,方法500中可以包括更多或更少的步骤,本申请实施例中对此并不限定,下面分别对这几个步骤进行详细的介绍。
图5所示的方法500可以由图1中的车辆100中的处理器113执行、或者,也可以由图2中的自动驾驶系统中的处理器103执行,或者,还可以由图4中的服务器320中的处理器330执行。
S510,使用驾驶行为决策模型,根据车辆的状态信息进行决策,得到驾驶行为决策信息。
其中,所述车辆的状态信息可以包括车辆的位置、车辆的速度、车辆的加速度、车辆的姿态及车辆的其他状态信息。
例如,所述车辆的状态信息可以包括预瞄偏差(例如,横向预瞄偏差)、车辆的横摆角速度及车辆的纵向速度。
例如,所述车辆的状态信息可以为图6方法600或图7方法700中的所述车辆的当前状态(和/或所述车辆的当前动作)。
其中,驾驶行为决策信息可以用于指示所述车辆的待执行的动作(或操作),例如,执行加速、减速或转向等动作中的一项或多项。
或者,驾驶行为决策信息也可以是指所述车辆的待选择的控制模式(或控制系统),例如,选择转向控制系统、直接横摆力矩控制系统或紧急制动控制系统等系统中的一项或多项。
可选地,所述驾驶行为决策模型的初始参数可以是根据基于模仿学习方法预先训练的模仿学习模型的第三参数确定的。
例如,所述模仿学习模型可以为图7方法700或图8方法800中的所述模仿学习系统。
其中,所述模仿学习方法可以包括监督学习(supervised learning)、生成对抗网络(generative adversarial network,GAN)及逆强化学习(inverse reinforcement learning,IRL)等。
在本申请实施例中,根据预先训练好的模仿学习模型的第三参数确定所述驾驶行为决策模型的初始参数,可以提高所述驾驶行为决策模型的稳定性,避免所述驾驶行为决策模型输出冒险的(或不合理的)驾驶行为决策信息。
例如,所述第三参数可以是服务器(或者云端)基于模仿学习方法预先训练所述模仿学习模型后得到的,在训练完成后,服务器(或者云端)可以将所述模仿学习模型的第三参数发送至所述车辆(例如,所述车辆中的自动驾驶系统或所述车辆中的计算机系统),进而,所述车辆可以根据所述模仿学习模型的第三参数确定所述驾驶行为决策模型的初始参数。
再例如,所述模仿学习模型的第三参数也可以是所述车辆(例如,所述车辆中的处理器或计算机系统等)基于模仿学习方法预先训练后得到的。
需要说明的是,在根据所述第三参数确定所述驾驶行为决策模型的初始参数时,可以直接将所述第三参数作为所述驾驶行为决策模型的初始参数;或者,也可以将所述第三参数中的部分参数作为所述驾驶行为决策模型的初始参数中的部分参数(可以根据其他方法确定所述驾驶行为决策模型的初始参数中的其余参数),本申请实施例中对此并不限定。
可选地,所述第三参数可以是所述服务器(或者云端)基于模仿学习方法、使用决策专家系统输出的数据训练所述模仿学习模型后得到的,所述决策专家系统可以是基于驾驶 员的驾驶数据(例如,所述驾驶数据可以包括优秀驾驶员或专业驾驶员的操作数据以及车辆的运行数据等)及车辆的动力学特性设计的。
例如,可以通过分析驾驶员的驾驶数据(例如,优秀驾驶员执行紧急避撞操作的范例)及车辆的动力学特性(例如,车辆轮胎的动力学特性),设计基于规则的决策专家系统;进一步地,可以收集该决策专家系统输出的数据,对该收集到的数据进行标注(即,为数据打上标签,以使得模仿学习模型使用该数据进行模仿学习),从而,可以基于模仿学习方法,使用标注后的数据对所述模仿学习模型进行训练,得到所述模仿学习模型的第三参数。
可选地,所述驾驶行为决策模型可以包括第一模型和第二模型。例如,所述第一模型可以为图7方法700或图8方法800中的所述当前网络,所述第二模型可以为图7方法700或图8方法800中的所述目标网络。
其中,所述第一模型和所述第二模型可以均为基于强化学习的(驾驶行为)决策模型,所述第一模型的初始参数和所述第二模型的初始参数可以是根据基于模仿学习方法预先训练的模仿学习模型的第三参数确定的。
可选地,所述使用驾驶行为决策模型,根据所述状态信息进行决策,得到驾驶行为决策信息,可以包括:
基于所述车辆的动力学模型及运动学模型,根据所述状态信息对所述车辆在之后一个或多个时刻的行驶行为进行预测,得到所述一个或多个时刻的所有可能的行驶行为;使用所述驾驶行为决策模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息。
在本申请实施例中,结合所述车辆的动力学模型及运动学模型进行驾驶行为决策,可以提高所述驾驶行为决策信息的合理性。
例如,可以基于所述车辆的动力学模型及运动学模型,根据所述车辆的当前的状态信息,预测出所述车辆(自当前时刻起)第i个时刻的所有可能的行驶行为,i为正整数。
需要说明的是,在本申请实施例中,可以同时对所述车辆在之后一个或多个时刻的行驶行为进行预测,本申请实施例对此并不限定。
可选地,在所述驾驶行为决策模型包括第一模型和第二模型的情况下,所述使用所述驾驶行为决策模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息,可以包括:使用所述第二模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息。
在本申请实施例中,所述第一模型的参数变化比较频繁,使用所述第二模型进行决策,能够提高所述驾驶行为决策信息的可靠性。
进一步地,可以根据所述第一模型的参数,定期地更新所述第二模型的参数。
例如,在满足第一预设条件的情况下,可以将所述第二模型的参数更新为所述第二参数,其中,所述第一预设条件可以为间隔预设的时间间隔,或者,所述第一预设条件也可以为对所述第一模型的参数调整预设的次数。
S520,向服务器发送所述驾驶行为决策信息。
S530,接收所述服务器发送的模仿学习模型的第一参数。
其中,所述第一参数可以是所述服务器基于模仿学习方法、使用所述驾驶行为决策信息训练所述模仿学习模型后得到的。
进一步地,所述第一参数可以是所述服务器基于模仿学习方法、使用满足第二预设条件的所述驾驶行为决策信息训练所述模仿学习模型后得到的。
其中,所述第二预设条件可以包括下述多种条件中的至少一种:
条件一:
所述第二预设条件可以包括:所述驾驶行为决策信息为所述状态信息对应的合理驾驶行为决策。
其中,所述合理驾驶行为决策指符合预设规则的驾驶行为决策。例如,所述预设规则可以理解为经验丰富的老司机的驾驶习惯。
所述合理驾驶行为决策可以是通过自动化打标签学习方法得到的,或者,也可以是通过人工打标签方法得到的。
例如,在直线制动过程中,假设车辆的状态信息对应的合理驾驶行为决策为紧急制动控制系统工作,若使用驾驶行为决策模型,根据车辆的状态信息决策得到的驾驶行为决策信息为紧急制动控制系统工作,则所述驾驶行为决策信息与所述车辆的状态信息对应的合理驾驶行为决策相同,也就是说,所述驾驶行为决策信息为所述状态信息对应的合理驾驶行为决策。
在本申请实施例中,使用所述状态信息对应的合理驾驶行为决策,能够提高驾驶行为决策模型的学习效率。
条件二:
所述第二预设条件还可以包括:所述状态信息的噪声在第一预设范围内。
其中,所述状态信息的噪声可以包括所述状态信息的信号受到的干扰(例如,高斯噪声)或所述状态信息的信号的抖动。
或者,所述状态信息的噪声也可以包括所述状态信息的数据误差。
例如,车辆的状态信息包括车辆的纵向速度,在行驶过程中,假设第一预设范围为5公里/每小时,若车辆的纵向速度的误差小于(或者,小于或等于)5公里/每小时,则所述驾驶行为决策信息满足所述第二预设条件,也就是说,所述驾驶行为决策信息为所述状态信息对应的正确驾驶行为决策。
上述实施例中的第一预设范围的取值仅为示例而非限定,具体可以根据实际情况确定,本申请实施例中对此并不限定。
在本申请实施例中,所述状态信息的噪声在第一预设范围内,根据所述状态信息进行决策,可以使得到的驾驶行为决策信息更加合理,此时,根据这些驾驶行为决策信息调整所述驾驶行为决策模型的参数,能够提高驾驶行为决策模型的学习效率。
条件三:
所述状态信息可以是多个状态信息中的一个,所述第二预设条件还可以包括:所述多个状态信息是在多个场景中获取的。
例如,所述多个场景可以包括高速、市区、郊区及山区中的一个或多个场景。
再例如,所述多个场景也可以包括十字路口、丁字路口及环岛中的一个或多个场景。
需要说明的是,上述多个场景的划分方式只是示例而非限定,在本申请实施例中也可以按照其他方式对场景进行划分,或者,本申请实施例也可以适用于其他的车辆能够行驶的场景,这里对此并不限定。
在本申请实施例中,在上述至少一个场景中获取所述状态信息,可以使得驾驶行为决 策模型的训练数据(例如,根据所述状态信息进行决策后得到的驾驶行为决策信息)的场景更加丰富,从而有助于进一步能够提高驾驶行为决策模型的学习效率。
条件四:
所述第二预设条件还可以包括:所述多个状态信息中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内。
例如,所述多个状态信息是在高速、市区、郊区及山区这四个场景中获取的,在高速场景中获取了1000个(或1000组)状态信息,在市区、郊区及山区这三个场景中各获取了100个(或100组)状态信息,则可以按照上述条件一及条件二中的方法,从高速场景中获取的1000个(或1000组)状态信息中筛选出100个(或100组)状态信息,以使得这四个场景中获取的状态信息的数量一样。
或者,也可以使得高速场景中获取的状态信息的数量与其他场景中获取的状态信息的数量之间的差值在第二预设范围内。
可选地,所述多个状态信息也可以是在其他场景中获取的,本申请实施例中对此并不限定。
例如,所述多个状态信息可以是在十字路口、丁字路口及环岛等多个场景中获取的,所述多个场景中获取的状态信息的数量一样,或者,所述多个场景中获取的状态信息的数量之间的差值在第二预设范围内。
在本申请实施例中,在所述至少两个场景中任意一个场景中获取的状态信息的数量与在所述至少两个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内,可以使得在各个场景中得到的训练数据(例如,根据所述状态信息进行决策后得到的驾驶行为决策信息)的数量更加均衡,从而避免出现所述驾驶行为决策模型在某个场景存在过拟合的问题。
上述实施例中的第二预设范围的取值可以根据实际情况确定,本申请实施例中对此并不限定。
在本申请实施例中,使用高质量的驾驶行为决策信息(例如,上述满足所述第二预设条件的驾驶行为决策信息),能够提高驾驶行为决策模型的学习效率。
需要说明的是,评估所述驾驶行为决策信息是否满足第二预设条件的所述驾驶行为决策信息,既可以由车辆执行,也可以由服务器执行,本申请实施例中对此并不限定。
例如,所述车辆可以将决策得到的所述驾驶行为决策信息发送给服务器,然后由服务器评估所述驾驶行为决策信息是否满足所述第二预设条件,以筛选出满足所述第二预设条件的驾驶行为决策信息。
或者,也可以由所述车辆评估所述驾驶行为决策信息是否满足所述第二预设条件,以筛选出满足所述第二预设条件的驾驶行为决策信息,然后将满足所述第二预设条件的驾驶行为决策信息发送给服务器。
S540,根据所述驾驶行为决策信息与所述第一参数,调整所述驾驶行为决策模型的参数。
在本申请实施例中,所述第一参数是所述服务器基于模仿学习方法、使用所述驾驶行为决策信息训练所述模仿学习模型后得到的,基于模仿学习方法可以保证所述模仿学习模型的训练效果,此时,根据所述驾驶行为决策信息与所述第一参数调整所述驾驶行为决策 模型的参数,可以提高驾驶行为决策模型的学习效率。
可选地,所述根据所述驾驶行为决策信息与所述第一参数,调整所述驾驶行为决策模型的参数,可以包括:
基于强化学习方法,根据所述驾驶行为决策信息对所述驾驶行为决策模型的参数进行调整,得到第二参数;根据所述第一参数,调整所述驾驶行为决策模型的所述第二参数。
在本申请实施例中,可以基于强化学习方法对所述驾驶行为决策模型的参数进行调整得到第二参数,并根据所述第一参数调整所述驾驶行为决策模型的第二参数,可以使得所述驾驶行为决策模型具有在线学习能力及离线学习能力,即可以在所述驾驶行为决策模型具备在线学习能力的前提下、进一步提高驾驶行为决策模型的学习效率。
可选地,所述驾驶行为决策模型可以包括第一模型和第二模型。
相应地,所述基于强化学习方法,根据所述驾驶行为决策信息对所述驾驶行为决策模型的参数进行调整,得到第二参数,可以包括:
基于强化学习方法,根据所述驾驶行为决策信息对所述第一模型的参数进行调整,得到所述第二参数;在满足第一预设条件的情况下,将所述第二模型的参数更新为所述第二参数。
其中,所述第一预设条件可以为间隔预设的时间间隔或对所述第一模型的参数调整预设的次数。
在本申请实施例中,在满足第一预设条件的情况下,将所述第二模型的参数更新为所述第二参数,可以避免因频繁调整所述第二模型的参数而导致所述第二模型的输出不稳定,因此,能够提高所述驾驶行为决策信息的可靠性。
其中,将所述第二模型的参数更新为所述第二参数,可以是指:直接将所述第二模型的所有参数全部更新为所述第二参数;或者,也可以是指:将所述第二模型的部分参数(可以根据其他方法确定所述第二模型的其余参数)更新为所述第二参数,本申请实施例中对此并不限定。
需要说明的是,满足所述第一预设条件可以是指:与上一次更新所述第二模型的参数的时刻相距预设的时间间隔;或者,满足所述第一预设条件也可以是指:使用所述驾驶行为决策模型进行决策的次数达到预设的次数;或者,满足所述第一预设条件还可以是指满足其他预设条件,本申请实施例中对此并不限定。
进一步地,所述根据所述第一参数,调整所述驾驶行为决策模型的所述第二参数,可以包括:根据所述第一参数,调整所述第一模型的参数和/或所述第二模型的参数。
在本申请实施例中,可以灵活地根据所述第一参数调整所述第一模型及所述第二模型中至少一个的参数。
例如,可以根据所述模仿学习模型的第一参数,同时更新所述第一模型的第二参数和所述第二模型的第二参数;或者,也可以根据所述模仿学习模型的第一参数,更新所述第一模型的第二参数,随后在满足所述第二预设条件的情况下,根据所述第一模型的参数更新所述第二模型的第二参数。
可选地,所述方法500还可以包括:根据所述驾驶行为决策信息控制所述车辆。
在本申请实施例中,在训练所述驾驶行为决策模型的同时,可以根据所述驾驶行为决策信息控制所述车辆,因此,可以在使用所述驾驶行为决策模型的过程中,对所述驾驶行为决策模型进行训练,不断地优化所述驾驶行为决策模型。
下面结合图6对本申请实施例中的训练驾驶行为决策模型的方法的实施流程进行详细说明。
图6是本申请实施例提供的训练驾驶行为决策模型的方法600的示意性流程图。
图6所示的方法600可以包括步骤610、步骤620及步骤630,应理解,图6所示的方法600仅为示例而非限定,方法600中可以包括更多或更少的步骤,本申请实施例中对此并不限定,下面分别对这几个步骤进行详细的介绍。
图6所示的方法600可以由图4中的服务器320中的处理器330执行。
S610,接收车辆发送的驾驶行为决策信息。
其中,所述驾驶行为决策信息可以是所述车辆使用驾驶行为决策模型根据所述车辆的状态信息进行决策后得到的。
关于所述驾驶行为决策信息、所述状态信息及所述驾驶行为决策模型的具体描述可以参照上述图5方法500中的实施例,这里不再赘述。
S620,基于模仿学习方法,根据所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数。
其中,所述第一参数用于调整所述驾驶行为决策模型的参数。
进一步地,所述基于模仿学习方法,根据所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数,可以包括:
基于模仿学习方法,根据满足第二预设条件的所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数。
可选地,所述第二预设条件可以包括所述驾驶行为决策信息为所述状态信息对应的合理驾驶行为决策。
可选地,所述第二预设条件还可以包括所述状态信息的噪声在第一预设范围内。
可选地,所述状态信息可以是多个状态信息中的一个,所述第二预设条件还可以包括所述多个状态信息是在多个场景中获取的。
可选地,所述第二预设条件还可以包括:所述多个状态信息中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内。
上述实施例中的第二预设范围的取值可以根据实际情况确定,本申请实施例中对此并不限定。
关于所述第二预设条件的具体描述可以参照上述图5方法500中的实施例,这里不再赘述。
需要说明的是,评估所述驾驶行为决策信息是否满足第二预设条件的所述驾驶行为决策信息,既可以由车辆执行,也可以由服务器执行,本申请实施例中对此并不限定。
例如,所述车辆可以评估所述驾驶行为决策信息是否满足所述第二预设条件,以筛选出满足所述第二预设条件的驾驶行为决策信息,然后将满足所述第二预设条件的驾驶行为决策信息发送给服务器。
S630,向所述车辆发送所述第一参数。
在本申请实施例中,基于模仿学习方法,根据所述驾驶行为决策信息训练模仿学习模型得到所述模型学习模型的第一参数,基于模仿学习方法可以保证所述模仿学习模型的训练效果,此时,根据所述第一参数调整所述驾驶行为决策模型的参数,可以提高驾驶行为 决策模型的学习效率。
可选地,所述方法600还可以包括:
基于模仿学习方法、使用决策专家系统输出的数据训练所述模仿学习模型,得到所述模仿学习模型的第三参数,其中,所述第三参数用于确定所述驾驶行为决策模型的初始参数,所述决策专家系统是根据驾驶员的驾驶数据及车辆的动力学特性设计的;向所述车辆发送所述第三参数。
在本申请实施例中,根据预先训练好的模仿学习模型的第三参数确定所述驾驶行为决策模型的初始参数,可以提高所述驾驶行为决策模型的稳定性,避免所述驾驶行为决策模型输出冒险的(或不合理的)驾驶行为决策信息。
图7是本申请实施例提供的训练驾驶行为决策模型的方法600的示意性流程图。
图7所示的方法700可以包括步骤710、步骤720、步骤730及步骤740,应理解,图7所示的方法700仅为示例而非限定,方法700中可以包括更多或更少的步骤,本申请实施例中对此并不限定,下面分别对这几个步骤进行详细的介绍。
方法700中的各个步骤可以分别由车辆(例如,图1中的车辆100中的处理器113或者图2中的自动驾驶系统中的处理器103)或服务器(例如,图4中的服务器320中的处理器330)执行,本申请实施例中对此并不限定。
作为示例而非限定,在方法700中的下述实施例中,以服务器执行步骤710、步骤720及步骤730,车辆执行步骤740为例进行说明。
S710,设计专家系统。
例如,服务器可以收集车辆的行驶数据,所述行驶数据可以包括驾驶员的驾驶数据及车辆的动力学数据(例如,基于该动力学数据可以确定车辆的动力学特性);基于所述车辆的行驶数据设计基于规则的专家系统,该专家系统可以进行驾驶行为决策。
S720,构建训练数据集。
例如,如图7所示,服务器可以收集S710设计的专家系统产生的决策信息,并对该收集的决策信息进行标注(即,为数据打上标签,以使用该数据对神经网络模型进行模仿学习),以构建训练数据集。
再例如,如图7所示,服务器还可以收集S740设计的强化学习系统产生的决策信息,筛选出(强化学习系统产生的)该决策信息中的优质决策信息,并对该优质决策信息进行标注,以构建训练数据集。
其中,优质决策信息的描述及确定优质决策信息的方法可以参见上述图5中方法500中的实施例,这里不再赘述。
S730,设计模仿学习系统。
所述模仿学习系统可以是根据基于径向基的神经网络(radial basis function neural network,RBFNN)的Softmax分类器的方案设计的,例如,可以利用S720中构建的所述训练数据集,基于小批量随机梯度下降算法对模仿学习系统进行离线训练,从而实现所述模仿学习系统对专家系统的行为的克隆。
这里的克隆可以理解为:对模仿学习系统进行离线训练,以使得所述模仿学习系统产生的决策信息的性能(或效果)不亚于所述专家系统产生的决策信息的性能(或效果),或所述模仿学习系统产生的决策信息的性能(或效果)接近所述专家系统产生的决策信息的性能(或效果)。
S740,设计强化学习系统。
所述强化学习系统可以根据基于强化学习神经网络的方案设计的。
例如,可以将模仿学习系统学习得到的模型策略(即模型参数)作为所述强化学习系统的初始策略(即模型的初始参数);结合所述车辆的动力学模型及所述车辆的运动学模型,基于所述车辆的当前状态(和/或所述车辆的当前动作)预测出所述车辆在下一个时刻(或下n个时刻,n为正整数)的状态信息,所述状态信息可以包括某一时刻的所有可能的行驶行为;使用所述强化学习系统估计出某一个时刻包括的多个不同行驶行为对应的Q值,将最大Q值对应的行驶行为作为该时刻的决策信息(所述强化学习系统输出的驾驶行为决策信息包括该时刻的决策信息)。
所述强化学习系统可以包括两个网络,分别为当前网络和目标网络,这两个网络可以采用与所述模仿学习系统相同的RBFNN结构。
需要说明的是,结合所述车辆的动力学模型及所述车辆的运动学模型预测出的所述状态信息可以包括所述车辆在之后一个或多个时刻的状态信息。
在所述状态信息包括多个时刻的状态信息的情况下,可以使用所述强化学习系统估计出该多个时刻中的每一个时刻的决策信息,此时,所述强化学习系统输出的驾驶行为决策信息可以包括该多个时刻的决策信息。
通过上述步骤就设计了强化学习系统,所述强化学习系统可以输出所述驾驶行为决策信息,基于所述驾驶行为决策信息可以控制所述车辆。
在本申请实施例中,如图7所示,所述方法700中的各个步骤可以不断地迭代执行,从而实现所述强化学习系统的持续在线学习。
例如,所述方法700中的各个步骤可以按照如下方式迭代执行:
车辆可以将所述强化学习系统产生的驾驶行为决策信息发送给服务器;
相应地,服务器可以确定出所述驾驶行为决策信息中的优质决策信息,将确定出的优质决策信息更新至所述训练数据集,并基于更新后的所述训练数据集对所述模仿学习系统进行离线训练;
服务器可以定期将所述模仿学习系统的模型策略(即模型参数)发送给车辆;
相应地,车辆接收所述模仿学习系统的模型策略(即模型参数)后,可以基于接收到的所述模型策略更新所述强化学习系统的模型策略(即模型参数);
接下来,车辆可以继续将所述强化学习系统产生的驾驶行为决策信息发送给服务器;服务器可以继续基于所述驾驶行为决策信息对所述模仿学习系统进行离线训练;服务器可以继续定期将所述模仿学习系统的模型策略(即模型参数)发送给车辆,以更新所述强化学习系统的模型策略(即模型参数)。
所述方法700中的各个步骤可以按照上述方式反复迭代执行。
需要说明的是,车辆基于接收到的所述模型策略更新所述强化学习系统的模型策略,可以是直接使用所述模型策略替换所述强化学习系统的模型策略,也可以是使用所述模型策略,按比例替换所述强化学习系统的模型策略,例如,将70%的所述模型策略及30%的所述强化学习系统的模型策略作为所述强化学习系统的模型策略。
在本申请实施例中的上述迭代过程中,不仅可以通过强化学习使所述强化学习系统不断进步,越来越好,还可以通过服务器(或者云端)对车辆进行监控,定期通过离线训练的模仿学习系统调整所述强化学习系统,从而能够从两个维度(在线和离线两个维度)不 断地提升自动驾驶车辆的性能。
图8是本申请实施例提供的训练驾驶行为决策模型的方法800的示意性流程图。
图8所示的方法800可以包括步骤810、步骤820、步骤830及步骤840,应理解,图8所示的方法800仅为示例而非限定,方法800中可以包括更多或更少的步骤,本申请实施例中对此并不限定,下面分别对这几个步骤进行详细的介绍。
方法800中的各个步骤可以分别由车辆(例如,图1中的车辆100中的处理器113或者图2中的自动驾驶系统中的处理器103)或服务器(例如,图4中的服务器320中的处理器330)执行,本申请实施例中对此并不限定。
作为示例而非限定,在方法800中的下述实施例中,以服务器执行步骤810、步骤820及步骤830,车辆执行步骤840为例进行说明。
S810,设计专家系统。
可选地,专家系统可以用于协调(决策)自动驾驶车辆的运动控制系统,所述运动控制系统可以包括紧急制动控制系统、直接横摆力矩控制系统及转向控制系统。
在本申请实施例中,所述专家系统也可以用于决策车辆的其他系统或其他状态,例如,所述专家系统也可以用于协调(或决策)车辆的速度、加速度或转向角度等,本申请实施例对此并不限定。
如图8所示,服务器可以接收(或定期接收)车辆发送的所述车辆的行驶数据(所述行驶数据可以是指专业驾驶员的驾驶数据,例如,优秀驾驶员执行紧急避撞操作的范例)及车辆的动力学数据(例如,基于该动力学数据可以确定车辆的动力学特性)。
下面以专家系统用于协调(决策)自动驾驶车辆的运动控制系统为例进行详细描述。
通过分析车辆的行驶数据及车辆的动力学数据,基于规则的专家系统可以按照如下方法设计:
a:在直线制动过程中,紧急制动控制系统工作,但直接横摆力矩控制系统与转向控制系统均不工作;
b:在转弯避让过程中,当汽车的侧向加速度小于或等于预设阈值时,紧急制动控制系统与转向控制系统工作,但直接横摆力矩控制系统不工作;
其中,所述预设阈值可以为0.4倍的重力加速度(gravitational acceleration,g),即,所述预设阈值=0.4g。
c:在转弯避让过程中,当车辆的侧向加速度大于所述预设阈值时,直接横摆力矩控制系统与转向控制系统工作,紧急制动控制系统不工作;
d:当避撞任务完成时,紧急制动控制系统、直接横摆力矩控制系统及转向控制系统均不工作。
本领域技术人员可知,在上述规则中,转向控制系统不工作是指车辆处于直线行驶状态。
上述基于规则的专家系统的伪代码可以如下述表1所示:
表1 基于规则的专家系统的伪代码
Figure PCTCN2021091964-appb-000001
Figure PCTCN2021091964-appb-000002
其中,车辆的运动学状态可以包括预瞄偏差、路径跟踪偏差、航向角等,车辆的动力学状态可以包括车速、横摆角速度、侧向加速度、纵向加速度、侧偏角等,环境感知系统信息可以包括与周围车辆的距离、周围车辆的速度等、周围车辆的航向角等。
此时,将车辆的运动学状态、车辆的动力学状态及(车辆感知到的)环境感知系统信息输入所述专家系统,就可以产生用于协调(决策)自动驾驶车辆的运动控制系统的决策信息。
S820,构建训练数据集。
如图8所示,服务器可以收集专家系统产生的决策信息以及强化学习系统产生的优质决策信息,并对该收集到的决策信息(包括专家系统产生的决策信息以及强化学习系统产生的优质决策信息)进行标注,以构建训练数据集。
其中,优质决策信息的描述及确定优质决策信息的方法可以参见上述图5中方法500中的实施例,这里不再赘述。
S830,设计模仿学习系统。
在本申请实施例中,可以基于Softmax分类器和小批量随机梯度下降的方法设计模仿学习系统,以实现对专家系统的行为克隆。
例如,所述模仿学习系统可以按照下述步骤进行设计:
a:设计神经网络的输出。
可选地,所述神经网络可以为Softmax分类器,所述神经网络输出的决策信息可以与基于规则的专家系统产生的决策信息一致。
所述神经网络输出的决策信息(与所述专家系统产生的决策信息类似)可以用于协调的自动驾驶车辆运动控制系统的工作模式。
例如,自动驾驶车辆紧急避撞动作的组合总共可以分为下面几个类别:
序号“1”可以表示只有转向控制系统工作,序号“2”可以表示转向控制系统与直接横摆力矩控制系统共同工作,序号“3”可以表示转向控制系统与紧急制动控制系统共同工作,序号“4”可以表示转向控制系统、直接横摆力矩控制系统及紧急制动控制系统三者联合工作,序号“0”可以表示转向控制系统、直接横摆力矩控制系统及紧急制动控制系统三者均不工作。
相应地,所述神经网络可以输出上述序号中的任一个。
b:设计训练时使用的代价函数。
可选地,可以利用交叉熵的方法定义代价函数,例如,代价函数可以为L i=-y iln(P i)。
利用交叉熵的方法定义的代价函数,能够提高学习效率和效果。
c:确定神经网络的结构及输入。
其中,所述神经网络的网络结构可以参考基于径向基的神经网络(radial basis function neural network,RBFNN)。
例如,可以利用RBFNN学习逼近Softmax分类器的Q值。
如图9所示,RBFNN可以包括三个输入,分别为投影偏差(或预瞄偏差)e p、车辆横摆角速度γ及行驶车速的倒数
Figure PCTCN2021091964-appb-000003
RBFNN可以包括由11个高斯核函数组成的单隐含层h 1~h 11,RBFNN可以输出4个Q值的向量
Figure PCTCN2021091964-appb-000004
RBFNN的网络结构可以如图9所示。
RBFNN的表达式可以为:
Figure PCTCN2021091964-appb-000005
其中,
Figure PCTCN2021091964-appb-000006
代表神经网络的输出;θ代表神经网络权矩阵;h(x)=[h i] T代表基函数向量,i代表神经网络的隐含层节点个数,h i代表高斯函数,
Figure PCTCN2021091964-appb-000007
c i代表神经节点的中心向量;b i代表神经节点的高斯函数的宽度;x代表神经网络的输入向量,
Figure PCTCN2021091964-appb-000008
其元素分别为投影偏差e p、车辆横摆角速度γ和纵向车速的倒数
Figure PCTCN2021091964-appb-000009
d:推导梯度计算公式。
例如,所述神经网络的总代价函数的梯度可以为
Figure PCTCN2021091964-appb-000010
总代价函数相对于所述神经网络的权值W的梯度为
Figure PCTCN2021091964-appb-000011
其中,P i为softmax分类器输出的概率值,
Figure PCTCN2021091964-appb-000012
y i为标签值,Q i与Q k均为强化学习的状态-动作值函数,N为样本的类别总数,h为高斯核函数,i和k为正整数。
小批量随机梯度下降算法可以采用如下梯度:
Figure PCTCN2021091964-appb-000013
其中,M 0为小批量随机梯度下降的批量数,n为大于或等于1、小于或等于M 0的正整数。
可选地,采用小批量随机梯度下降方法离线训练所述神经网络,就可以实现对基于规则的驾驶行为决策系统的行为的克隆。
S840,设计强化学习系统。
例如,所述强化学习系统可以按照下述步骤进行设计:
a:确定初始策略。
将所述模仿学习系统学习得到的模型策略(即模型参数)作为所述强化学习系统的初始策略(即模型的初始参数),以改善驾驶行为决策的效率及效果。
例如,设计的马尔可夫决策过程(Markov decision process,MDP)状态可以为S=[e p,γ,v x -1] T,动作空间可以为A=[1,2,3,4] T
b:确定立即奖赏函数。
设计的立即奖赏函数可以为r=-S TKS,其中,K为奖赏权值矩阵。
c:确定网络结构。
所述强化学习系统可以包括两个网络,分别为当前网络和目标网络,这两个网络可以采用与所述模仿学习系统相同的RBFNN结构。
不同的是,目标网络的三个输入是通过车辆预测模型(例如,所述车辆的动力学模型及运动学模型)预测的结果。
d:设计优化指标及梯度。
设计优化指标可以为
Figure PCTCN2021091964-appb-000014
梯度的公式可以为:
Figure PCTCN2021091964-appb-000015
其中,Q *为最优值函数,
Figure PCTCN2021091964-appb-000016
为近似值函数,γ rl为折扣因子,a'为使得第t次迭代下的Q值最大所执行的动作,
Figure PCTCN2021091964-appb-000017
为下一时刻估计的状态,θ t'为目标网络参数,x'为下一时刻的输入,r为奖赏函数,t为正整数。
e:确定车辆预测模型。
所述车辆预测模型可以按下述方式表示:
Figure PCTCN2021091964-appb-000018
其中,x'为预测t+1时刻的状态,y为系统输出,A为状态矩阵,
Figure PCTCN2021091964-appb-000019
B为输入矩阵,
Figure PCTCN2021091964-appb-000020
x为状态向量,x=[β γ e p Δψ e v] T,u为输入向量,u=[δ t M c F xa] T,w为干扰向量,
Figure PCTCN2021091964-appb-000021
Figure PCTCN2021091964-appb-000022
v x为纵向车速,x p为预瞄距离,β为质心侧偏角,γ为横摆角速度,e p为预瞄偏差,Δψ为航向角偏差,e v为速度偏差,δ t为前轮转角,M c为横摆角速度控制力矩,F xa为车辆纵向力,K为道路曲率,
Figure PCTCN2021091964-appb-000023
为行驶距离, C f为前轮侧偏刚度,C r为后轮侧偏刚度,a为车辆前轴距离,b为车辆后轴距,J z为车辆的转动惯量,m为车辆的质量。
因此,所述车辆预测模型为:
S t+1=f(S t,A t)
Figure PCTCN2021091964-appb-000024
其中,S t+1为t+1时刻的状态,S t为t时刻状态,A t为t时刻的动作,T s为预测时域,e p为预瞄偏差,
Figure PCTCN2021091964-appb-000025
为预瞄偏差的微分,γ为横摆角速度,
Figure PCTCN2021091964-appb-000026
为横摆角速度微分,v x为纵向车速,
Figure PCTCN2021091964-appb-000027
为纵向车速的微分。
f:预测每个时刻对应的动作。
例如,可以结合所述车辆预测模型,基于所述车辆的当前状态(和/或所述车辆的当前动作)预测出所述车辆在下一个时刻(或下n个时刻,n为正整数)的状态信息,所述状态信息可以包括某一时刻的所有可能的行驶行为;使用所述强化学习系统估计出某一个时刻包括的多个不同行驶行为对应的Q值,将最大Q值对应的行驶行为作为该时刻的决策信息(所述强化学习系统输出的驾驶行为决策信息包括该时刻的决策信息)。
g:计算所述强化学习系统的权值更新的梯度。
例如,可以结合资格迹和梯度下降的方法,确定网络权值更新的梯度为:
Figure PCTCN2021091964-appb-000028
Figure PCTCN2021091964-appb-000029
Δθ t=ρ rlδ tET t
其中,δ t为值函数Q的时序差分量,λ rl为衰减因子,γ rl为折扣因子,ET t为t时刻的资格迹,ET t-1为t-1时刻的资格迹,r为奖赏函数,ρ rl为正系数,t为正整数。
h:更新所述强化学习系统的参数。
例如,可以确定神经网络的权矩阵的更新公式为θ t+1=θ t+Δθ trltt-1],其中,θ t+1为t+1时刻的网络系数,θ t为t时刻的网络系数,θ t-1为t-1时刻的网络系数,ζ rl为比例系数。
在本申请实施例中,可以将所述强化学习系统产生的优质数据贴上标签后添加到训练数据集中,提供给所述模仿学习系统进行离线训练。
在所述方法800中,可以不断地迭代执行S820、S830及S840,通过离线训练和在线学习的方式与自动驾驶车辆不断交互,从而实现所述强化学习系统不断地自我训练,改进自动驾驶系统。
图10是本申请一个实施例提供的训练驾驶行为决策模型的装置1000的示意性框图。应理解,图10示出的训练驾驶行为决策模型的装置1000仅是示例,本申请实施例的装置1000还可包括其他模块或单元。应理解,训练驾驶行为决策模型的装置1000能够执行图5、图7或图8的方法中的各个步骤,为了避免重复,此处不再详述。
决策单元1010,用于使用驾驶行为决策模型,根据车辆的状态信息进行决策,得到驾驶行为决策信息;
发送单元1020,用于向服务器发送所述驾驶行为决策信息;
接收单元1030,用于接收所述服务器发送的模仿学习模型的第一参数,所述第一参数是所述服务器基于模仿学习方法、使用所述驾驶行为决策信息训练所述模仿学习模型后得到的;
调整单元1040,用于根据所述驾驶行为决策信息与所述第一参数,调整所述驾驶行为决策模型的参数。
可选地,所述调整单元1040具体用于:
基于强化学习方法,根据所述驾驶行为决策信息对所述驾驶行为决策模型的参数进行调整,得到第二参数;根据所述第一参数,调整所述驾驶行为决策模型的所述第二参数。
可选地,所述驾驶行为决策模型包括第一模型和第二模型;其中,所述调整单元1040具体用于:
基于强化学习方法,根据所述驾驶行为决策信息对所述第一模型的参数进行调整,得到所述第二参数;在满足第一预设条件的情况下,将所述第二模型的参数更新为所述第二参数,所述第一预设条件为间隔预设的时间间隔或对所述第一模型的参数调整预设的次数。
可选地,所述调整单元1040具体用于:根据所述第一参数,调整所述第一模型的参数和/或所述第二模型的参数。
可选地,所述决策单元1010具体用于:
基于所述车辆的动力学模型及运动学模型,根据所述状态信息对所述车辆在之后一个或多个时刻的行驶行为进行预测,得到所述一个或多个时刻的所有可能的行驶行为;使用所述驾驶行为决策模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息。
可选地,在所述驾驶行为决策模型包括第一模型和第二模型的情况下,所述决策单元1010具体用于:
使用所述第二模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息。
可选地,所述接收单元1030还用于:
接收所述服务器发送的所述模仿学习模型的第三参数,所述第三参数是基于模仿学习方法、使用决策专家系统输出的数据训练所述模仿学习模型后得到的,所述决策专家系统是根据驾驶员的驾驶数据及车辆的动力学特性设计的;
所述调整单元1040还用于:
根据所述第三参数确定所述驾驶行为决策模型的初始参数。
可选地,所述第一参数是所述服务器基于模仿学习方法、使用满足第二预设条件的所述驾驶行为决策信息训练所述模仿学习模型后得到的,所述第二预设条件包括所述驾驶行为决策信息为所述状态信息对应的合理驾驶行为决策。
可选地,所述第二预设条件还包括所述状态信息的噪声在第一预设范围内。
可选地,所述状态信息是多个状态信息中的一个,所述第二预设条件还包括所述多个状态信息是在多个场景中获取的。
可选地,所述第二预设条件还包括:所述多个状态信息中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内。
图11是本申请一个实施例提供的训练驾驶行为决策模型的装置1100的示意性框图。 应理解,图11示出的训练驾驶行为决策模型的装置1100仅是示例,本申请实施例的装置1100还可包括其他模块或单元。应理解,行为规划装置1100能够执行图6、图7或图8的方法中的各个步骤,为了避免重复,此处不再详述。
接收单元1110,用于接收车辆发送的驾驶行为决策信息,所述驾驶行为决策信息是所述车辆使用驾驶行为决策模型根据所述车辆的状态信息进行决策后得到的;
训练单元1120,用于基于模仿学习方法,根据所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数,所述第一参数用于调整所述驾驶行为决策模型的参数;
发送单元1130,用于向所述车辆发送所述第一参数。
可选地,所述训练单元1020还用于:
基于模仿学习方法、使用决策专家系统输出的数据训练所述模仿学习模型,得到所述模仿学习模型的第三参数,其中,所述第三参数用于确定所述驾驶行为决策模型的初始参数,所述决策专家系统是根据驾驶员的驾驶数据及车辆的动力学特性设计的;
所述发送单元1130还用于:向所述车辆发送所述第三参数。
可选地,所述训练单元1120具体用于:
基于模仿学习方法,根据满足第二预设条件的所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数,所述第二预设条件包括所述驾驶行为决策信息为所述状态信息对应的合理驾驶行为决策。
可选地,所述第二预设条件还包括所述状态信息的噪声在第一预设范围内。
可选地,所述状态信息是多个状态信息中的一个,所述第二预设条件还包括所述多个状态信息是在多个场景中获取的。
可选地,所述第二预设条件还包括:所述多个状态信息中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内。
图12是本申请实施例提供的训练驾驶行为决策模型的装置的硬件结构示意图。图12所示的训练驾驶行为决策模型的装置3000(该装置3000具体可以是一种计算机设备)包括存储器3001、处理器3002、通信接口3003以及总线3004。其中,存储器3001、处理器3002、通信接口3003通过总线3004实现彼此之间的通信连接。
存储器3001可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器3001可以存储程序,当存储器3001中存储的程序被处理器3002执行时,处理器3002用于执行本申请实施例的训练驾驶行为决策模型的方法的各个步骤。
处理器3002可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请方法实施例的训练驾驶行为决策模型的方法。
处理器3002还可以是一种集成电路芯片,具有信号的处理能力,例如,可以是图3所示的芯片。在实现过程中,本申请的训练驾驶行为决策模型的方法的各个步骤可以通过处理器3002中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器3002还可以是通用处理器、数字信号处理器(digital signal processing, DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器3001,处理器3002读取存储器3001中的信息,结合其硬件完成本训练驾驶行为决策模型的装置中包括的单元所需执行的功能,或者执行本申请方法实施例的训练驾驶行为决策模型的方法。
通信接口3003使用例如但不限于收发器一类的收发装置,来实现装置3000与其他设备或通信网络之间的通信。例如,可以通过通信接口3003获取车辆的状态信息、车辆的行驶数据以及训练驾驶行为决策模型的过程中需要的训练数据。
总线3004可包括在装置3000各个部件(例如,存储器3001、处理器3002、通信接口3003)之间传送信息的通路。
应理解,本申请实施例中的处理器可以为中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
还应理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一 个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系,但也可能表示的是一种“和/或”的关系,具体可参考前后文进行理解。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代 码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (40)

  1. 一种训练驾驶行为决策模型的方法,其特征在于,包括:
    使用驾驶行为决策模型,根据车辆的状态信息进行决策,得到驾驶行为决策信息;
    向服务器发送所述驾驶行为决策信息;
    接收所述服务器发送的模仿学习模型的第一参数,所述第一参数是所述服务器基于模仿学习方法、使用所述驾驶行为决策信息训练所述模仿学习模型后得到的;
    根据所述驾驶行为决策信息与所述第一参数,调整所述驾驶行为决策模型的参数。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述驾驶行为决策信息与所述第一参数,调整所述驾驶行为决策模型的参数,包括:
    基于强化学习方法,根据所述驾驶行为决策信息对所述驾驶行为决策模型的参数进行调整,得到第二参数;
    根据所述第一参数,调整所述驾驶行为决策模型的所述第二参数。
  3. 根据权利要求2所述的方法,其特征在于,所述驾驶行为决策模型包括第一模型和第二模型;
    其中,所述基于强化学习方法,根据所述驾驶行为决策信息对所述驾驶行为决策模型的参数进行调整,得到第二参数,包括:
    基于强化学习方法,根据所述驾驶行为决策信息对所述第一模型的参数进行调整,得到所述第二参数;
    在满足第一预设条件的情况下,将所述第二模型的参数更新为所述第二参数,所述第一预设条件为间隔预设的时间间隔或对所述第一模型的参数调整预设的次数。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述第一参数,调整所述驾驶行为决策模型的所述第二参数,包括:
    根据所述第一参数,调整所述第一模型的参数和/或所述第二模型的参数。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述使用驾驶行为决策模型,根据所述状态信息进行决策,得到驾驶行为决策信息,包括:
    基于所述车辆的动力学模型及运动学模型,根据所述状态信息对所述车辆在之后一个或多个时刻的行驶行为进行预测,得到所述一个或多个时刻的所有可能的行驶行为;
    使用所述驾驶行为决策模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息。
  6. 根据权利要求5所述的方法,其特征在于,在所述驾驶行为决策模型包括第一模型和第二模型的情况下,所述使用所述驾驶行为决策模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息,包括:
    使用所述第二模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述方法还包括:
    接收所述服务器发送的所述模仿学习模型的第三参数,所述第三参数是基于模仿学习 方法、使用决策专家系统输出的数据训练所述模仿学习模型后得到的,所述决策专家系统是根据驾驶员的驾驶数据及车辆的动力学特性设计的;
    根据所述第三参数确定所述驾驶行为决策模型的初始参数。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述第一参数是所述服务器基于模仿学习方法、使用满足第二预设条件的所述驾驶行为决策信息训练所述模仿学习模型后得到的,所述第二预设条件包括所述驾驶行为决策信息为所述状态信息对应的合理驾驶行为决策。
  9. 根据权利要求8所述的方法,其特征在于,所述第二预设条件还包括所述状态信息的噪声在第一预设范围内。
  10. 根据权利要求8或9所述的方法,其特征在于,所述状态信息是多个状态信息中的一个,所述第二预设条件还包括所述多个状态信息是在多个场景中获取的。
  11. 根据权利要求10所述的方法,其特征在于,所述第二预设条件还包括:所述多个状态信息中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内。
  12. 一种训练驾驶行为决策模型的方法,其特征在于,包括:
    接收车辆发送的驾驶行为决策信息,所述驾驶行为决策信息是所述车辆使用驾驶行为决策模型根据所述车辆的状态信息进行决策后得到的;
    基于模仿学习方法,根据所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数,所述第一参数用于调整所述驾驶行为决策模型的参数;
    向所述车辆发送所述第一参数。
  13. 根据权利要求12所述的方法,其特征在于,所述方法还包括:
    基于模仿学习方法、使用决策专家系统输出的数据训练所述模仿学习模型,得到所述模仿学习模型的第三参数,其中,所述第三参数用于确定所述驾驶行为决策模型的初始参数,所述决策专家系统是根据驾驶员的驾驶数据及车辆的动力学特性设计的;
    向所述车辆发送所述第三参数。
  14. 根据权利要求12或13所述的方法,其特征在于,所述基于模仿学习方法,根据所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数,包括:
    基于模仿学习方法,根据满足第二预设条件的所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数,所述第二预设条件包括所述驾驶行为决策信息为所述状态信息对应的合理驾驶行为决策。
  15. 根据权利要求14所述的方法,其特征在于,所述第二预设条件还包括所述状态信息的噪声在第一预设范围内。
  16. 根据权利要求14或15所述的方法,其特征在于,所述状态信息是多个状态信息中的一个,所述第二预设条件还包括所述多个状态信息是在多个场景中获取的。
  17. 根据权利要求16所述的方法,其特征在于,所述第二预设条件还包括:所述多个状态信息中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内。
  18. 一种训练驾驶行为决策模型的装置,其特征在于,包括:
    决策单元,用于使用驾驶行为决策模型,根据车辆的状态信息进行决策,得到驾驶行 为决策信息;
    发送单元,用于向服务器发送所述驾驶行为决策信息;
    接收单元,用于接收所述服务器发送的模仿学习模型的第一参数,所述第一参数是所述服务器基于模仿学习方法、使用所述驾驶行为决策信息训练所述模仿学习模型后得到的;
    调整单元,用于根据所述驾驶行为决策信息与所述第一参数,调整所述驾驶行为决策模型的参数。
  19. 根据权利要求18所述的装置,其特征在于,所述调整单元具体用于:
    基于强化学习方法,根据所述驾驶行为决策信息对所述驾驶行为决策模型的参数进行调整,得到第二参数;
    根据所述第一参数,调整所述驾驶行为决策模型的所述第二参数。
  20. 根据权利要求19所述的装置,其特征在于,所述驾驶行为决策模型包括第一模型和第二模型;
    其中,所述调整单元具体用于:
    基于强化学习方法,根据所述驾驶行为决策信息对所述第一模型的参数进行调整,得到所述第二参数;
    在满足第一预设条件的情况下,将所述第二模型的参数更新为所述第二参数,所述第一预设条件为间隔预设的时间间隔或对所述第一模型的参数调整预设的次数。
  21. 根据权利要求20所述的装置,其特征在于,所述调整单元具体用于:
    根据所述第一参数,调整所述第一模型的参数和/或所述第二模型的参数。
  22. 根据权利要求18至21中任一项所述的装置,其特征在于,所述决策单元具体用于:
    基于所述车辆的动力学模型及运动学模型,根据所述状态信息对所述车辆在之后一个或多个时刻的行驶行为进行预测,得到所述一个或多个时刻的所有可能的行驶行为;
    使用所述驾驶行为决策模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息。
  23. 根据权利要求22所述的装置,其特征在于,在所述驾驶行为决策模型包括第一模型和第二模型的情况下,所述决策单元具体用于:
    使用所述第二模型,对所述所有可能的行驶行为进行评估,得到所述驾驶行为决策信息。
  24. 根据权利要求18至23中任一项所述的装置,其特征在于,所述接收单元还用于:
    接收所述服务器发送的所述模仿学习模型的第三参数,所述第三参数是基于模仿学习方法、使用决策专家系统输出的数据训练所述模仿学习模型后得到的,所述决策专家系统是根据驾驶员的驾驶数据及车辆的动力学特性设计的;
    所述调整单元还用于:
    根据所述第三参数确定所述驾驶行为决策模型的初始参数。
  25. 根据权利要求18至24中任一项所述的装置,其特征在于,所述第一参数是所述服务器基于模仿学习方法、使用满足第二预设条件的所述驾驶行为决策信息训练所述模仿学习模型后得到的,所述第二预设条件包括所述驾驶行为决策信息为所述状态信息对应的合理驾驶行为决策。
  26. 根据权利要求25所述的装置,其特征在于,所述第二预设条件还包括所述状态信息的噪声在第一预设范围内。
  27. 根据权利要求25或26所述的装置,其特征在于,所述状态信息是多个状态信息中的一个,所述第二预设条件还包括所述多个状态信息是在多个场景中获取的。
  28. 根据权利要求27所述的装置,其特征在于,所述第二预设条件还包括:所述多个状态信息中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内。
  29. 一种训练驾驶行为决策模型的装置,其特征在于,包括:
    接收单元,用于接收车辆发送的驾驶行为决策信息,所述驾驶行为决策信息是所述车辆使用驾驶行为决策模型根据所述车辆的状态信息进行决策后得到的;
    训练单元,用于基于模仿学习方法,根据所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数,所述第一参数用于调整所述驾驶行为决策模型的参数;
    发送单元,用于向所述车辆发送所述第一参数。
  30. 根据权利要求29所述的装置,其特征在于,所述训练单元还用于:
    基于模仿学习方法、使用决策专家系统输出的数据训练所述模仿学习模型,得到所述模仿学习模型的第三参数,其中,所述第三参数用于确定所述驾驶行为决策模型的初始参数,所述决策专家系统是根据驾驶员的驾驶数据及车辆的动力学特性设计的;
    所述发送单元还用于:
    向所述车辆发送所述第三参数。
  31. 根据权利要求29或30所述的装置,其特征在于,所述训练单元具体用于:
    基于模仿学习方法,根据满足第二预设条件的所述驾驶行为决策信息训练模仿学习模型,得到所述模型学习模型的第一参数,所述第二预设条件包括所述驾驶行为决策信息为所述状态信息对应的合理驾驶行为决策。
  32. 根据权利要求31所述的装置,其特征在于,所述第二预设条件还包括所述状态信息的噪声在第一预设范围内。
  33. 根据权利要求31或32所述的装置,其特征在于,所述状态信息是多个状态信息中的一个,所述第二预设条件还包括所述多个状态信息是在多个场景中获取的。
  34. 根据权利要求33所述的装置,其特征在于,所述第二预设条件还包括:所述多个状态信息中,在所述多个场景中任意一个场景中获取的状态信息的数量与在所述多个场景中任意一个其他场景中获取的状态信息的数量之间的差值在第二预设范围内。
  35. 一种训练驾驶行为决策模型的装置,其特征在于,包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令来执行权利要求1至11中任一项所述的方法。
  36. 一种训练驾驶行为决策模型的装置,其特征在于,包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令来执行权利要求12至17中任一项所述的方法。
  37. 一种汽车,其特征在于,包括权利要求18至28中任一项或权利要求35所述的装置。
  38. 一种服务器,其特征在于,包括权利要求29至34中任一项或权利要求36所述 的装置。
  39. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有程序指令,当所述程序指令由处理器运行时,实现权利要求1至17中任一项所述的方法。
  40. 一种芯片,其特征在于,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,以执行如权利要求1至17中任一项所述的方法。
PCT/CN2021/091964 2020-06-06 2021-05-06 训练驾驶行为决策模型的方法及装置 WO2021244207A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010508722.3 2020-06-06
CN202010508722.3A CN113835421B (zh) 2020-06-06 2020-06-06 训练驾驶行为决策模型的方法及装置

Publications (1)

Publication Number Publication Date
WO2021244207A1 true WO2021244207A1 (zh) 2021-12-09

Family

ID=78830645

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/091964 WO2021244207A1 (zh) 2020-06-06 2021-05-06 训练驾驶行为决策模型的方法及装置

Country Status (2)

Country Link
CN (1) CN113835421B (zh)
WO (1) WO2021244207A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114078242A (zh) * 2022-01-19 2022-02-22 浙江吉利控股集团有限公司 基于自动驾驶的感知决策模型升级方法及系统
CN116302010A (zh) * 2023-05-22 2023-06-23 安徽中科星驰自动驾驶技术有限公司 自动驾驶系统升级包生成方法、装置、计算机设备及介质
CN116700012A (zh) * 2023-07-19 2023-09-05 合肥工业大学 一种多智能体的避撞编队合围控制器的设计方法

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11742901B2 (en) * 2020-07-27 2023-08-29 Electronics And Telecommunications Research Institute Deep learning based beamforming method and apparatus
CN114407931B (zh) * 2022-02-21 2024-05-03 东南大学 一种高度类人的自动驾驶营运车辆安全驾驶决策方法
CN116070783B (zh) * 2023-03-07 2023-05-30 北京航空航天大学 一种混动传动系统在通勤路段下的学习型能量管理方法

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110238289A1 (en) * 2010-03-24 2011-09-29 Sap Ag Navigation device and method for predicting the destination of a trip
CN106874597A (zh) * 2017-02-16 2017-06-20 北理慧动(常熟)车辆科技有限公司 一种应用于自动驾驶车辆的高速公路超车行为决策方法
US20180032864A1 (en) * 2016-07-27 2018-02-01 Google Inc. Selecting actions to be performed by a reinforcement learning agent using tree search
WO2018175441A1 (en) * 2017-03-20 2018-09-27 Mobileye Vision Technologies Ltd. Navigation by augmented path prediction
JP2019098949A (ja) * 2017-12-04 2019-06-24 アセントロボティクス株式会社 学習方法、学習装置及び学習プログラム
CN110060475A (zh) * 2019-04-17 2019-07-26 清华大学 一种基于深度强化学习的多交叉口信号灯协同控制方法
CN110322017A (zh) * 2019-08-13 2019-10-11 吉林大学 基于深度强化学习的自动驾驶智能车轨迹跟踪控制策略
CN110858328A (zh) * 2018-08-06 2020-03-03 纳恩博(北京)科技有限公司 用于模仿学习的数据采集方法、装置及存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102295004B (zh) * 2011-06-09 2013-07-03 中国人民解放军国防科学技术大学 一种车道偏离预警方法
US11106211B2 (en) * 2018-04-02 2021-08-31 Sony Group Corporation Vision-based sample-efficient reinforcement learning framework for autonomous driving
CN108550279B (zh) * 2018-04-03 2019-10-18 同济大学 基于机器学习的车辆驾驶行为预测方法
US11327156B2 (en) * 2018-04-26 2022-05-10 Metawave Corporation Reinforcement learning engine for a radar system
US20200033869A1 (en) * 2018-07-27 2020-01-30 GM Global Technology Operations LLC Systems, methods and controllers that implement autonomous driver agents and a policy server for serving policies to autonomous driver agents for controlling an autonomous vehicle
CN110187639B (zh) * 2019-06-27 2021-05-11 吉林大学 一种基于参数决策框架的轨迹规划控制方法
CN110758403B (zh) * 2019-10-30 2022-03-01 北京百度网讯科技有限公司 自动驾驶车辆的控制方法、装置、设备及存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110238289A1 (en) * 2010-03-24 2011-09-29 Sap Ag Navigation device and method for predicting the destination of a trip
US20180032864A1 (en) * 2016-07-27 2018-02-01 Google Inc. Selecting actions to be performed by a reinforcement learning agent using tree search
CN106874597A (zh) * 2017-02-16 2017-06-20 北理慧动(常熟)车辆科技有限公司 一种应用于自动驾驶车辆的高速公路超车行为决策方法
WO2018175441A1 (en) * 2017-03-20 2018-09-27 Mobileye Vision Technologies Ltd. Navigation by augmented path prediction
JP2019098949A (ja) * 2017-12-04 2019-06-24 アセントロボティクス株式会社 学習方法、学習装置及び学習プログラム
CN110858328A (zh) * 2018-08-06 2020-03-03 纳恩博(北京)科技有限公司 用于模仿学习的数据采集方法、装置及存储介质
CN110060475A (zh) * 2019-04-17 2019-07-26 清华大学 一种基于深度强化学习的多交叉口信号灯协同控制方法
CN110322017A (zh) * 2019-08-13 2019-10-11 吉林大学 基于深度强化学习的自动驾驶智能车轨迹跟踪控制策略

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114078242A (zh) * 2022-01-19 2022-02-22 浙江吉利控股集团有限公司 基于自动驾驶的感知决策模型升级方法及系统
CN116302010A (zh) * 2023-05-22 2023-06-23 安徽中科星驰自动驾驶技术有限公司 自动驾驶系统升级包生成方法、装置、计算机设备及介质
CN116700012A (zh) * 2023-07-19 2023-09-05 合肥工业大学 一种多智能体的避撞编队合围控制器的设计方法
CN116700012B (zh) * 2023-07-19 2024-03-01 合肥工业大学 一种多智能体的避撞编队合围控制器的设计方法

Also Published As

Publication number Publication date
CN113835421B (zh) 2023-12-15
CN113835421A (zh) 2021-12-24

Similar Documents

Publication Publication Date Title
CN109901574B (zh) 自动驾驶方法及装置
EP3835908B1 (en) Automatic driving method, training method and related apparatuses
WO2021244207A1 (zh) 训练驾驶行为决策模型的方法及装置
US20210262808A1 (en) Obstacle avoidance method and apparatus
CN110379193B (zh) 自动驾驶车辆的行为规划方法及行为规划装置
US20220379920A1 (en) Trajectory planning method and apparatus
US20220332348A1 (en) Autonomous driving method, related device, and computer-readable storage medium
WO2021102955A1 (zh) 车辆的路径规划方法以及车辆的路径规划装置
WO2021000800A1 (zh) 道路可行驶区域推理方法及装置
WO2022001773A1 (zh) 轨迹预测方法及装置
US20220080972A1 (en) Autonomous lane change method and apparatus, and storage medium
CN110371132B (zh) 驾驶员接管评估方法及装置
WO2021212379A1 (zh) 车道线检测方法及装置
WO2021168669A1 (zh) 车辆控制方法及装置
CN111950726A (zh) 基于多任务学习的决策方法、决策模型训练方法及装置
US20230048680A1 (en) Method and apparatus for passing through barrier gate crossbar by vehicle
CN113954858A (zh) 一种规划车辆行驶路线的方法以及智能汽车
WO2022017307A1 (zh) 自动驾驶场景生成方法、装置及系统
US20230399023A1 (en) Vehicle Driving Intention Prediction Method, Apparatus, and Terminal, and Storage Medium
US20230107033A1 (en) Method for optimizing decision-making regulation and control, method for controlling traveling of vehicle, and related apparatus
WO2021254000A1 (zh) 车辆纵向运动参数的规划方法和装置
CN114261404A (zh) 一种自动驾驶方法及相关装置
US20230256970A1 (en) Lane Change Track Planning Method and Apparatus
WO2022001432A1 (zh) 推理车道的方法、训练车道推理模型的方法及装置
WO2021097823A1 (zh) 用于确定车辆可通行空间的方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21816838

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21816838

Country of ref document: EP

Kind code of ref document: A1