CN107862346B

CN107862346B - Method and equipment for training driving strategy model

Info

Publication number: CN107862346B
Application number: CN201711257831.7A
Authority: CN
Inventors: 许稼轩; 周小成
Original assignee: Uisee Technologies Beijing Co Ltd
Current assignee: Uisee Technologies Beijing Co Ltd
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2020-06-30
Anticipated expiration: 2037-12-01
Also published as: CN107862346A

Abstract

The application aims to provide a method or equipment for training a driving strategy model; obtaining model parameter information corresponding to a driving strategy model of driving equipment, wherein the model parameter information is determined by pre-training the driving strategy model based on preset driving rule information, and the driving strategy model is established based on a reinforcement learning algorithm; and acquiring driving parameter information of driving equipment in driving, and training the driving strategy model based on the model parameter information. Compared with the prior art, the driving strategy model does not need to be explored from zero in the training process, the driving equipment already learns to drive like the driving rule before the training process is started, the training process of the driving strategy model on the basis is greatly shortened, and the damage to the vehicle caused by the times of unreasonable driving strategies and the training process is also greatly reduced.

Description

Method and equipment for training driving strategy model

Technical Field

The application relates to the field of automatic driving, in particular to a technology for training a driving strategy model.

Background

With the development and application of machine learning technology, for example, the development of reinforcement learning technology, in the existing automatic driving technology, the driving control for vehicles, especially for automatic driving vehicles, can be realized by a reinforcement learning neural network trained by a reinforcement learning algorithm, i.e. real-time status information of the vehicle is input into the reinforcement learning neural network, so as to output corresponding driving strategy information, however, in the existing training for the reinforcement learning neural network, for each vehicle to be trained, the corresponding neural network parameters need to be continuously trained from zero, however, in practical application, for different vehicles, because the vehicle parameters (vehicle length, weight, axle distance, parts, etc.) are different, if reinforcement learning training from zero is performed for each vehicle, the long training and trial and error process is required, which brings huge training cost. In addition, if a large amount of training and trial and error is applied to an actual vehicle, a long time is consumed, and a vehicle body is greatly damaged.

Disclosure of Invention

The application aims to provide a method and equipment for training a driving strategy model.

According to one aspect of the present application, there is provided a method of performing driving strategy model training, comprising:

obtaining model parameter information corresponding to a driving strategy model of driving equipment, wherein the model parameter information is determined by pre-training the driving strategy model based on preset driving rule information, and the driving strategy model is established based on a reinforcement learning algorithm;

and acquiring driving parameter information of driving equipment in driving, and training the driving strategy model based on the model parameter information.

According to still another aspect of the present application, there is also provided a driving apparatus for performing driving maneuver model training, including:

the device comprises an acquisition device and a control device, wherein the acquisition device is used for acquiring model parameter information corresponding to a driving strategy model of driving equipment, the model parameter information is determined by pre-training the driving strategy model based on preset driving rule information, and the driving strategy model is established based on a reinforcement learning algorithm;

and the training device is used for acquiring driving parameter information of driving equipment in running and training the driving strategy model based on the model parameter information.

According to another aspect of the present application, there is also provided a driving apparatus for performing driving strategy model training, including:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for:

According to another aspect of the present application, there is also provided a computer-readable storage medium having a computer program stored thereon, the computer program being executable by a processor to:

Compared with the prior art, the driving strategy model is trained on the driving equipment based on the model parameter information corresponding to the driving strategy model of the vehicle, and the driving strategy model is pre-trained based on the preset driving rule information. Based on the characteristics that the stability and the simplicity of the vehicle are controlled through the driving rule information, and the driving rule information can be commonly used on different vehicles in most cases, the pre-trained driving strategy model can simulate the driving rule information to drive, and the pre-training can be suitable for different types of vehicles; moreover, the rules are used for constraint, so that the times of unreasonable driving strategies in the pre-training of the driving strategy model can be reduced, and the reasonability and stability of the final driving strategy information are improved; meanwhile, the efficiency of the training process of reinforcement learning can be improved, and the training time and the trial and error times are reduced. Therefore, the training of the driving strategy model does not need to be explored from zero, but before the training is started, the driving device already learns to drive like a driving rule, the model parameter information determined by the pre-training is relatively close to the final convergence value of the parameters of the driving strategy model, so that the process of training the driving strategy model on the basis of the driving strategy model is greatly shortened, and the damage to the vehicle caused by the unreasonable driving strategy times and the training process is greatly reduced, so that the application of the reinforcement learning technology to the real driving field, particularly the automatic driving field or the intelligent driving field is more feasible.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a flow diagram of a method of performing driving strategy model training in accordance with an aspect of the subject application;

FIG. 2 illustrates an apparatus diagram of a driving apparatus for driving maneuver model training in accordance with another aspect of the present application;

FIG. 3 illustrates an exemplary system that can be used to implement the various embodiments described herein;

FIG. 4 illustrates an example diagram of pre-training a driving strategy model according to one embodiment of an aspect of the present application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the computing device include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

FIG. 1 illustrates a flow diagram of a method of performing driving strategy model training in accordance with an aspect of the subject application; wherein the method comprises step S11 and step S12. In one implementation of the present application, the method is performed on a driving device that performs driving strategy model training.

In step S11, model parameter information corresponding to a driving strategy model of a driving device may be obtained, where the model parameter information is determined by pre-training the driving strategy model based on predetermined driving rule information, and the driving strategy model is established based on a reinforcement learning algorithm; next, in step S12, driving parameter information during driving of the driving device may be acquired, and the driving strategy model may be trained based on the model parameter information.

In one implementation of the present application, the pre-training of the driving strategy model may be performed in a virtual environment built by the respective computing device. The computing device may include, but is not limited to, a simulator, a computer simulator, such as a Torcs simulator, and the like. The computing device may be separate from the steering device or may be included in the steering device.

In the present application, the driving device may include various driving devices that may be driven on the road, in the air, or in the water, such as an aircraft or a vehicle. The vehicle may include, but is not limited to, a vehicle traveling in any mode, such as a fully human driving mode, an assisted driving mode, a partially autonomous driving mode, a conditional autonomous driving mode, a highly autonomous driving mode, or a fully autonomous driving mode. In a preferred embodiment, the vehicle may comprise an unmanned vehicle or a smart drive vehicle, wherein, in one implementation, the unmanned vehicle may comprise the vehicle traveling in a fully autonomous mode; the smart driving vehicle may include a vehicle that travels in an assisted driving mode, a partial autonomous driving mode, a conditional autonomous driving mode, a highly autonomous driving mode, and the like. In a preferred embodiment, the driving device comprises a smart driving vehicle.

Specifically, in step S11, model parameter information corresponding to a driving strategy model of the driving device may be acquired, where the model parameter information is determined by pre-training the driving strategy model based on predetermined driving rule information, and the driving strategy model is established based on a reinforcement learning algorithm.

In the present application, the driving strategy model solves the problem of what driving strategy is executed by a driving device, such as a vehicle, in the current environment and state, that is, when the real-time environment state information of the vehicle, such as real-time driving parameter information, is known, the corresponding driving strategy information can be determined by inputting the real-time driving parameter information into the driving strategy model for the corresponding driving device to execute.

In the present application, the pre-training of the driving strategy model comprises a reinforcement learning training process. Specifically, it may refer to a learning process that achieves a goal through multiple steps of appropriate decision-making under a series of scenarios, which is a sequential multi-step decision-making problem. The goal of reinforcement learning is to find a strategy for achieving the maximum jackpot. In an application scenario of vehicle control according to the present application, in a possible implementation manner, a method for training a driving strategy model corresponding to vehicle control through a reinforcement learning algorithm includes: the method comprises the steps that a vehicle executes corresponding driving operation based on driving strategy information under the current environment and state, so that the environment and state of the vehicle are changed, a reward is obtained, namely a feedback function value is determined, the feedback function value reflects the change of the state of the vehicle after the vehicle adopts the driving strategy information, in one implementation mode, the state can be set to be good, the feedback function value is a positive number, and the larger the feedback function value is, the better the state is; otherwise, if the state is bad, the feedback function value is negative. Through the setting of the feedback function, the cycle process of interaction between the vehicle and the environment where the vehicle is located is controlled, and the driving strategy information of the vehicle is adjusted, so that the driving strategy model corresponding to the vehicle control is trained step by step and perfected. In the application, the reinforcement learning algorithm may further include a deep reinforcement learning algorithm that is integrated with deep learning on the basis of reinforcement learning, and the driving strategy model trained by the reinforcement learning algorithm and corresponding to vehicle control may further include a reinforcement learning neural network model. Here, the depth reinforcement Learning algorithm may include, but is not limited to, Deep Q Learning, Double Q-Network, Deep decision Gradient (DEP) method, and the like.

Here, model parameter information corresponding to a driving strategy model of a driving device may be acquired, where the model parameter information is determined by pre-training the driving strategy model based on predetermined driving rule information. Here, the model parameter information is a parameter training result obtained by initializing parameters of a driving strategy model, such as a reinforcement learning neural network, based on the pre-training.

Here, the driving strategy model is pre-trained based on predetermined driving rule information, i.e. the driving strategy model is constrained by the driving rule information. The feedback function corresponding to reinforcement learning training in the application is set as the similarity between the output value based on reinforcement learning and the output value based on the driving rule information. In one implementation, the reinforcement learning output value and the output value based on the driving rule information may respectively correspond to specific driving strategy information, and the similarity between the two may correspond to a distance between the strategy information.

In the present application, the driving rule information includes a process of deriving a certain driving strategy information through a predetermined logic formula based on the input driving parameter information or the historical driving parameter information. Here, the driving rule information may include various rules of an existing driving scenario, a known driving experience, and a set output control strategy. In one implementation, the driving rule information may include, but is not limited to, one or more of various types of rules such as an obstacle avoidance rule, a path planning rule, a pre-aiming rule, and the like. In one implementation, the driving rule information may correspond to a specific rule algorithm formula, the input information is driving parameter information or historical driving parameter information, and corresponding driving strategy information is output through calculation, for example, if the driving rule information includes a pre-aiming rule, and the input historical driving parameter information includes obstacle sensing information, such as a relative position and a size of a front obstacle, speed information of a current vehicle, and off-track direction information of the vehicle, corresponding driving strategy information such as steering wheel angle control, brake control, or throttle control of the vehicle is calculated through the rule algorithm formula, such as a current off-track direction Θ, and based on the rule, the steering wheel should be turned in the opposite direction by 2 Θ.

Next, in step S12, driving parameter information during driving of the driving device may be acquired, and the driving strategy model may be trained based on the model parameter information. Here, the driving parameter information may include various types of vehicle driving information reflecting the vehicle driving environment and the vehicle driving state. In one implementation, the driving parameter information includes, but is not limited to, at least any one of: speed information of the vehicle; off-track direction information of the vehicle; distance information between the vehicle and the track center line; distance information between the vehicle and the track edge; obstacle sensing information, such as relative position and size of the obstacle ahead; traffic sign awareness information, such as traffic light indicators, direction indicators, turn indicators, and the like. In one implementation, the driving parameter information may be vehicle driving information collected from various sensors, such as vehicle driving information collected in real-time. In one implementation, the initialized model parameter information may be used as a training initial value of reinforcement learning in a real scene, so as to further optimize a driving strategy model on a real driving device, such as optimizing parameters of a reinforcement learning neural network. The basic idea of training is as follows: and enabling the driving equipment, such as an intelligent driving vehicle, to drive based on the model parameter information trained by the pre-training through the driving rule information, judging the quality of the action by using a feedback function for the action output of each time of vehicle control, and then returning the judged result to a driving strategy model, such as an enhanced learning neural network, wherein the driving strategy model optimizes the model parameters based on the returned value until convergence.

The driving strategy model is trained on the driving equipment based on model parameter information corresponding to the driving strategy model of the vehicle, and the driving strategy model is pre-trained based on preset driving rule information. Based on the characteristics that the stability and the simplicity of the vehicle are controlled through the driving rule information, and the driving rule information can be commonly used on different vehicles in most cases, the pre-trained driving strategy model can simulate the driving rule information to drive, and the pre-training can be suitable for different types of vehicles; moreover, the rules are used for constraint, so that the times of unreasonable driving strategies in the pre-training of the driving strategy model can be reduced, and the reasonability and stability of the final driving strategy information are improved; meanwhile, the efficiency of the training process of reinforcement learning can be improved, and the training time and the trial and error times are reduced. Therefore, the training of the driving strategy model does not need to be explored from zero, but before the training is started, the driving device already learns to drive like a driving rule, the model parameter information determined by the pre-training is relatively close to the final convergence value of the parameters of the driving strategy model, so that the process of training the driving strategy model on the basis of the driving strategy model is greatly shortened, and the damage to the vehicle caused by the unreasonable driving strategy times and the training process is greatly reduced, so that the application of the reinforcement learning technology to the real driving field, particularly the automatic driving field or the intelligent driving field is more feasible.

In one embodiment of the present application, the pre-training comprises: determining first driving strategy information of the driving equipment based on historical driving parameter information of the driving equipment and corresponding driving rule information; determining second driving strategy information of the driving equipment through a reinforcement learning algorithm based on historical driving parameter information of the driving equipment; and training a driving strategy model based on the first driving strategy information and the second driving strategy information.

Specifically, here, the historical driving parameter information may be known driving parameter information obtained by various computing devices, such as simulators, e.g., Torcs simulators. Here, the historical driving parameter information may be input data of the driving strategy model, that is, may be sample data of the pre-training.

Here, the driving strategy information in the present application, for example, the first driving strategy information and the second driving strategy information of the vehicle, may include control information on various driving behaviors of the vehicle, for example, steering angle control of the vehicle, brake control of the vehicle, accelerator control of the vehicle, and the like. It should be understood by those skilled in the art that the above-mentioned driving strategy information is only an example, and other driving strategy information, which may be present or later come, should be included in the scope of the present application if applicable to the present application, and is included herein by reference.

In one implementation, the driving rule information may correspond to a specific rule algorithm formula, the input information is driving parameter information, and the first driving strategy information is output through calculation, for example, the input driving parameter information includes obstacle sensing information, such as a relative position and a size of a front obstacle, speed information of a current vehicle, and off-track direction information of the vehicle, and then the second driving strategy information, such as a current off-track direction Θ, of the vehicle is calculated through the rule algorithm formula, and based on a rule, the steering wheel should turn in the opposite direction by 2 Θ.

Then, second driving strategy information of the driving device is determined through a reinforcement learning algorithm based on the historical driving parameter information of the driving device. In one implementation, when the first driving strategy information corresponding to the driving parameter information is determined based on the driving rule information, the second driving strategy information obtained through a reinforcement learning algorithm is determined based on the corresponding historical driving parameter information based on the consistency of time and tasks.

In order to achieve this, a corresponding feedback function may be set as a predetermined threshold value for the second driving strategy information and the first driving strategy information, for example, if the second driving strategy information outputted based on the reinforcement learning neural network includes a steering wheel angle Θ, an accelerator degree η, and a braking degree γ ', the first driving strategy information calculated and outputted based on the driving rule information includes a steering wheel angle Θ', an accelerator degree η ', and a braking degree γ', the corresponding feedback function is R ═ Θ - [ (Θ ') (η - η') (γ - γ) feedback ″ + ], the larger the feedback of both is, the smaller the feedback of both is applied, and the more the driving strategy model is trained to achieve the simulation of the actual driving rule.

Further, in an embodiment, the driving strategy model subjected to the pre-training satisfies an evaluation index of a corresponding first feedback function, where the evaluation index of the first feedback function includes that a distance between the second driving strategy information and the first driving strategy information is smaller than a predetermined threshold. In this case, the feedback function value can be flexibly adjusted by setting the predetermined threshold value, so that the accuracy of the driving strategy model is controlled, and the requirements of different actual scenes are reasonably met.

In one embodiment of the present application, the step S12 includes a step S121 (not shown), a step S122 (not shown), a step S123 (not shown), and a step S124 (not shown). Specifically, in step S121, driving parameter information of the driving device during driving may be acquired, and third driving strategy information of the driving device may be determined based on the model parameter information; next, in step S122, the third driving strategy information may be executed; next, in step S123, the execution result of executing the third driving strategy information may be determined by using the evaluation index of the second feedback function corresponding to the driving strategy model; next, in step S124, the driving strategy model may be adjusted based on the determination result. Here, the driving device, such as an intelligent driving vehicle, is caused to drive based on the model parameter information trained by the pre-training through the driving rule information, and the training on the basis of having learned to drive according to the driving rule is: and executing the third driving strategy information, judging whether the action is good or not by using a second feedback function for an execution result, and then returning the judgment result to a driving strategy model, such as an enhanced learning neural network, wherein the driving strategy model optimizes the model parameters based on the return value until convergence.

Further, in one embodiment, the second feedback function may be determined according to actual requirements. For example, if it is desired that the vehicle remain on the track and not collide, the faster the vehicle is, the better. The vehicle is excited with a back-off function that should be relatively high when the vehicle meets the expectations. In one embodiment, the evaluation indicator of the second feedback function includes at least one of: the distance between the vehicle and the central line of the track is smaller than a preset distance threshold value; the vehicle running direction coincides with the track line. For example, in practical applications, the second feedback function may be set as a criterion that the vehicle is driven as fast and safe as possible, that is, the vehicle is kept as close to the track center line as possible, and the driving direction is consistent with the track line, where R is a component of the velocity of the vehicle in the track direction, a component of the velocity of the vehicle in the non-track direction, and a component of the distance from the track center line.

Here, fig. 4 illustrates an example diagram of pre-training a driving strategy model according to an embodiment of an aspect of the present application. The driving device may be an unmanned vehicle, the driving parameter information may include unmanned vehicle state information, such as speed, driving angle, obstacle information, path planning information, and the like of the unmanned vehicle, and based on the driving parameter information, on one hand, the first driving strategy information, that is, the unmanned vehicle control 2, may be determined in combination with driving strategy information, such as a driving strategy based on a rule, and may include information such as steering wheel control, throttle control, and brake control; on the other hand, second driving strategy information, namely the unmanned vehicle control 1, can be output through the reinforcement learning neural network; then, based on the first driving strategy information and the second driving strategy information, the feedback function value corresponding to the reinforcement learning neural network is updated, so that the reinforcement learning neural network, namely the driving strategy model, can be trained. The unmanned vehicle can be used as the unmanned vehicle state information at the next moment based on the driving strategy information, such as the state after the unmanned vehicle control 1 or the unmanned vehicle control 2 executes the driving operation or the entering environment information.

Fig. 2 shows a device schematic of a driving device 1 for driving strategy model training according to another aspect of the present application. Wherein the driving device 1 comprises acquisition means 21 and training means 22.

The obtaining device 21 may obtain model parameter information corresponding to a driving strategy model of the driving apparatus 1, where the model parameter information is determined by pre-training the driving strategy model based on predetermined driving rule information, and the driving strategy model is established based on a reinforcement learning algorithm; the training device 22 may acquire driving parameter information during driving of the driving apparatus 1, and train the driving strategy model based on the model parameter information.

In one implementation of the present application, the pre-training of the driving strategy model may be performed in a virtual environment built by the respective computing device. The computing device may include, but is not limited to, a simulator, a computer simulator, such as a Torcs simulator, and the like. The computing device may be independent from the driving device 1 or may be included in the driving device 1.

In the present application, the driving device 1 may include various driving devices that can be driven on the road, in the air, or in the water, such as an aircraft and a vehicle. The vehicle may include, but is not limited to, a vehicle traveling in any mode, such as a fully human driving mode, an assisted driving mode, a partially autonomous driving mode, a conditional autonomous driving mode, a highly autonomous driving mode, or a fully autonomous driving mode. In a preferred embodiment, the vehicle may comprise an unmanned vehicle or a smart drive vehicle, wherein, in one implementation, the unmanned vehicle may comprise the vehicle traveling in a fully autonomous mode; the smart driving vehicle may include a vehicle that travels in an assisted driving mode, a partial autonomous driving mode, a conditional autonomous driving mode, a highly autonomous driving mode, and the like. In a preferred embodiment, the driving device 1 comprises a smart driving vehicle.

Specifically, the obtaining device 21 may obtain model parameter information corresponding to a driving strategy model of the driving apparatus 1, where the model parameter information is determined by pre-training the driving strategy model based on predetermined driving rule information, and the driving strategy model is established based on a reinforcement learning algorithm.

In the present application, the driving strategy model solves the problem of what driving strategy is implemented by the driving device 1, for example, in the current environment and state of the vehicle, that is, when the real-time environment individual state information of the vehicle, such as the real-time driving parameter information, is known, the corresponding driving strategy information can be determined by inputting the real-time driving parameter information into the driving strategy model for the corresponding driving device 1 to implement.

Here, model parameter information corresponding to a driving strategy model of the driving apparatus 1 may be acquired, wherein the model parameter information is determined by pre-training the driving strategy model based on predetermined driving rule information. Here, the model parameter information is a parameter training result obtained by initializing parameters of a driving strategy model, such as a reinforcement learning neural network, based on the pre-training.

The training device 22 may acquire driving parameter information during driving of the driving apparatus 1, and train the driving strategy model based on the model parameter information. Here, the driving parameter information may include various types of vehicle driving information reflecting the vehicle driving environment and the vehicle driving state. In one implementation, the driving parameter information includes, but is not limited to, at least any one of: speed information of the vehicle; off-track direction information of the vehicle; distance information between the vehicle and the track center line; distance information between the vehicle and the track edge; obstacle sensing information, such as relative position and size of the obstacle ahead; traffic sign awareness information, such as traffic light indicators, direction indicators, turn indicators, and the like. In one implementation, the driving parameter information may be vehicle driving information collected from various sensors, such as vehicle driving information collected in real-time. In one implementation, the initialized model parameter information may be used as a training initial value of reinforcement learning in a real scene, so as to further optimize a driving strategy model, such as optimizing parameters of a reinforcement learning neural network, on the real driving device 1. The basic idea of training is as follows: and enabling the driving equipment 1, such as an intelligent driving vehicle, to drive based on the model parameter information trained by the pre-training through the driving rule information, judging the quality of the action by using a feedback function for the action output of each time of vehicle control, and then returning the judged result to a driving strategy model, such as an enhanced learning neural network, wherein the driving strategy model optimizes the model parameters based on the returned value until convergence.

Here, the present application trains a driving strategy model of a vehicle on a driving device 1 based on obtaining model parameter information corresponding to the driving strategy model, where the driving strategy model has been pre-trained based on predetermined driving rule information. Based on the characteristics that the stability and the simplicity of the vehicle are controlled through the driving rule information, and the driving rule information can be commonly used on different vehicles in most cases, the pre-trained driving strategy model can simulate the driving rule information to drive, and the pre-training can be suitable for different types of vehicles; moreover, the rules are used for constraint, so that the times of unreasonable driving strategies in the pre-training of the driving strategy model can be reduced, and the reasonability and stability of the final driving strategy information are improved; meanwhile, the efficiency of the training process of reinforcement learning can be improved, and the training time and the trial and error times are reduced. Therefore, the training of the driving strategy model does not need to be explored from zero, but before the training is started, the driving device 1 already learns to drive like a driving rule, the model parameter information determined by the pre-training is relatively close to the final convergence value of the parameters of the driving strategy model, so that the process of training the driving strategy model on the basis of the driving strategy model is greatly shortened, and the damage to the vehicle caused by the unreasonable driving strategy times and the training process is greatly reduced, thereby the application of the reinforcement learning technology to the real driving field, in particular to the automatic driving or intelligent driving field is more feasible.

In one embodiment of the present application, the pre-training comprises: determining first driving strategy information of the driving device 1 based on historical driving parameter information of the driving device 1 and corresponding driving rule information; determining second driving strategy information of the driving device 1 through a reinforcement learning algorithm based on the historical driving parameter information of the driving device 1; and training a driving strategy model based on the first driving strategy information and the second driving strategy information.

Next, second driving strategy information of the driving apparatus 1 is determined by a reinforcement learning algorithm based on the historical driving parameter information of the driving apparatus 1. In one implementation, when the first driving strategy information corresponding to the driving parameter information is determined based on the driving rule information, the second driving strategy information obtained through a reinforcement learning algorithm is determined based on the corresponding historical driving parameter information based on the consistency of time and tasks.

In one embodiment of the present application, the training device 22 comprises a first unit 221 (not shown), a second unit 222 (not shown), a third unit 223 (not shown), and a fourth unit 224 (not shown). Specifically, the first unit 221 may acquire driving parameter information while the driving apparatus 1 is running, and determine third driving strategy information of the driving apparatus 1 based on the model parameter information; the second unit 222 may execute the third driving strategy information; the third unit 223 may determine the execution result of executing the third driving strategy information by using the evaluation index of the second feedback function corresponding to the driving strategy model; the fourth unit 224 may adjust the driving strategy model based on the determination result. Here, letting the driving apparatus 1, such as an intelligent driving vehicle, drive based on the model parameter information trained by the pre-training through the driving rule information, and further, the training on the basis of having learned to drive according to the driving rule is: and executing the third driving strategy information, judging whether the action is good or not by using a second feedback function for an execution result, and then returning the judgment result to a driving strategy model, such as an enhanced learning neural network, wherein the driving strategy model optimizes the model parameters based on the return value until convergence.

Further, in one embodiment, the second feedback function may be determined according to actual requirements. For example, if it is desired that the vehicle remain on the track and not collide, the faster the vehicle is, the better. The vehicle is excited with a back-off function that should be relatively high when the vehicle meets the expectations. In one embodiment, the evaluation indicator of the second feedback function includes at least one of: the distance between the vehicle and the central line of the track is smaller than a preset distance threshold value; the vehicle running direction coincides with the track line. For example, in practical applications, the second feedback function R, i.e. the component of the speed of the vehicle in the track direction, the component of the speed of the vehicle in the non-track direction, and the component of the distance of the vehicle from the track centerline, may be set as the criterion for making the vehicle drive as fast and safe as possible, i.e. keeping the vehicle as close to the track centerline as possible and making the driving direction and the track centerline coincide with each other

The present application further provides a driving device for performing driving strategy model training, including:

one or more processors;

a memory; and

Further, the program of the apparatus may also be used to perform corresponding operations in other related embodiments based on the above operations.

The present application further provides a computer-readable storage medium having a computer program stored thereon, the computer program being executable by a processor to:

in some embodiments, as shown in fig. 3, the system 300 can be implemented as any of the steering devices shown in fig. 1, 2, 4, or other described embodiments. In some embodiments, system 300 may include one or more computer-readable media (e.g., system memory or NVM/storage 320) having instructions and one or more processors (e.g., processor(s) 305) coupled with the one or more computer-readable media and configured to execute the instructions to implement modules to perform the actions described herein.

For one embodiment, system control module 310 may include any suitable interface controllers to provide any suitable interface to at least one of processor(s) 305 and/or any suitable device or component in communication with system control module 310.

The system control module 310 may include a memory controller module 330 to provide an interface to the system memory 315. Memory controller module 330 may be a hardware module, a software module, and/or a firmware module.

System memory 315 may be used, for example, to load and store data and/or instructions for system 300. For one embodiment, system memory 315 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, the system memory 315 may include a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).

For one embodiment, system control module 310 may include one or more input/output (I/O) controllers to provide an interface to NVM/storage 320 and communication interface(s) 325.

For example, NVM/storage 320 may be used to store data and/or instructions. NVM/storage 320 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 320 may include storage resources that are physically part of the device on which system 300 is installed or may be accessed by the device and not necessarily part of the device. For example, NVM/storage 320 may be accessible over a network via communication interface(s) 325.

Communication interface(s) 325 may provide an interface for system 300 to communicate over one or more networks and/or with any other suitable device. System 300 may wirelessly communicate with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols.

For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controller(s) (e.g., memory controller module 330) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controller(s) of the system control module 310 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310 to form a system on a chip (SoC).

In various embodiments, system 300 may be, but is not limited to being: a server, a workstation, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.). In various embodiments, system 300 may have more or fewer components and/or different architectures. For example, in some embodiments, system 300 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

It should be noted that the present invention may be implemented in software and/or in a combination of software and hardware, for example, as an Application Specific Integrated Circuit (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software program of the present invention may be executed by a processor to implement the steps or functions described above. Also, the software programs (including associated data structures) of the present invention can be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Further, some of the steps or functions of the present invention may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present invention can be applied as a computer program product, such as computer program instructions, which when executed by a computer, can invoke or provide the method and/or technical solution according to the present invention through the operation of the computer. Program instructions which invoke the methods of the present invention may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the invention herein comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or solution according to embodiments of the invention as described above.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Various aspects of various embodiments are defined in the claims. These and other aspects of the various embodiments are specified in the following numbered clauses:

1. a method of performing driving strategy model training, wherein the method comprises:

2. The method of clause 1, wherein the pre-training comprises:

determining first driving strategy information of the driving equipment based on historical driving parameter information of the driving equipment and corresponding driving rule information;

determining second driving strategy information of the driving equipment through a reinforcement learning algorithm based on historical driving parameter information of the driving equipment;

and training a driving strategy model based on the first driving strategy information and the second driving strategy information.

3. The method according to clause 2, wherein the pre-trained driving strategy model satisfies an evaluation index of a corresponding first feedback function, and the evaluation index of the first feedback function includes that a distance between the second driving strategy information and the first driving strategy information is smaller than a predetermined threshold.

4. The method of clause 1, wherein the obtaining driving parameter information while driving equipment is running, and training the driving strategy model based on the model parameter information comprises:

acquiring driving parameter information of the driving equipment in driving, and determining third driving strategy information of the driving equipment based on the model parameter information;

executing the third driving strategy information;

judging the execution result of executing the third driving strategy information by utilizing the evaluation index of a second feedback function corresponding to the driving strategy model;

adjusting the driving strategy model based on the judgment result;

5. the method of clause 4, wherein the evaluation indicator of the second feedback function comprises at least one of:

the distance between the driving equipment and the central line of the track is smaller than a preset distance threshold value;

the driving direction of the driving device is consistent with the track line.

6. The method of any of clauses 1-5, wherein the driving parameter information includes at least any one of:

speed information of the vehicle;

off-track direction information of the vehicle;

distance information between the vehicle and the track center line;

distance information between the vehicle and the track edge;

obstacle perception information;

traffic sign awareness information.

7. The method of any of clauses 1-5, wherein the driving device comprises a smart driving vehicle.

8. A driving apparatus for performing driving strategy model training, wherein the driving apparatus comprises:

9. The driving apparatus of clause 8, wherein the pre-training comprises:

10. The driving apparatus according to clause 9, wherein the pre-trained driving strategy model satisfies an evaluation index of a corresponding first feedback function, and the evaluation index of the first feedback function includes that a distance between the second driving strategy information and the first driving strategy information is smaller than a predetermined threshold.

11. The driving apparatus according to clause 8, wherein the training device includes:

the first unit is used for acquiring driving parameter information of the driving equipment in running and determining third driving strategy information of the driving equipment based on the model parameter information;

a second unit for executing the third driving strategy information;

a third unit, configured to determine an execution result of executing the third driving strategy information by using an evaluation index of a second feedback function corresponding to the driving strategy model;

a fourth unit configured to adjust the driving strategy model based on the determination result;

12. the driving apparatus according to clause 11, wherein the evaluation indicator of the second feedback function includes at least one of:

the driving direction of the driving device is consistent with the track line.

13. The driving apparatus according to any one of clauses 8 to 12, wherein the driving parameter information includes at least any one of:

speed information of the vehicle;

off-track direction information of the vehicle;

distance information between the vehicle and the track center line;

distance information between the vehicle and the track edge;

obstacle perception information;

traffic sign awareness information.

14. The driving device of any of clauses 8-12, wherein the driving device comprises a smart driving vehicle.

15. A driving apparatus for performing driving strategy model training, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the method of any of clauses 1-7.

16. A computer-readable storage medium having stored thereon a computer program executable by a processor to perform the method of any of clauses 1-7.

Claims

acquiring driving parameter information of driving equipment in driving, and training the driving strategy model based on the model parameter information;

the pre-training comprises:

and training a driving strategy model based on the first driving strategy information and the second driving strategy information, and determining the model parameter information.

2. The method of claim 1, wherein the training of a driving strategy model based on the first and second driving strategy information, the determining the model parameter information comprises:

and constraining the second driving strategy information by the first driving strategy information corresponding to the driving rule information.

3. The method of claim 2, wherein the pre-trained driving strategy model satisfies an evaluation indicator of a corresponding first feedback function, the evaluation indicator of the first feedback function comprising a distance of the second driving strategy information from the first driving strategy information being less than a predetermined threshold.

4. The method of claim 1, wherein the obtaining driving parameter information while a driving device is driving, and the training of the driving strategy model based on the model parameter information comprises:

executing the third driving strategy information;

and adjusting the driving strategy model based on the judgment result.

5. The method of claim 4, wherein the evaluation indicator of the second feedback function comprises at least one of:

the driving direction of the driving device is consistent with the track line.

6. The method according to any one of claims 1 to 5, wherein the driving parameter information comprises at least any one of:

speed information of the vehicle;

off-track direction information of the vehicle;

distance information between the vehicle and the track center line;

distance information between the vehicle and the track edge;

obstacle perception information;

traffic sign awareness information.

7. The method of any of claims 1-5, wherein the driving device comprises a smart driving vehicle.

the training device is used for acquiring driving parameter information of driving equipment in running and training the driving strategy model based on the model parameter information;

the pre-training comprises:

9. The driving apparatus of claim 8, wherein the training of a driving strategy model based on the first and second driving strategy information, the determining of the model parameter information comprises:

10. The driving apparatus according to claim 9, wherein the pre-trained driving strategy model satisfies an evaluation index of a corresponding first feedback function, the evaluation index of the first feedback function including that a distance between the second driving strategy information and the first driving strategy information is smaller than a predetermined threshold.

11. The driving apparatus according to claim 8, wherein the training device includes:

a second unit for executing the third driving strategy information;

a fourth unit for adjusting the driving strategy model based on the determination result.

12. The driving apparatus of claim 11, wherein the evaluation indicator of the second feedback function comprises at least one of:

the driving direction of the driving device is consistent with the track line.

13. The driving apparatus according to any one of claims 8 to 12, wherein the driving parameter information includes at least any one of:

speed information of the vehicle;

off-track direction information of the vehicle;

distance information between the vehicle and the track center line;

distance information between the vehicle and the track edge;

obstacle perception information;

traffic sign awareness information.

14. The driving device according to any one of claims 8 to 12, wherein the driving device comprises a smart driving vehicle.

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the method of any of claims 1-7.

16. A computer-readable storage medium, on which a computer program is stored, which computer program can be executed by a processor to perform the method according to any of claims 1-7.