CN108009587B

CN108009587B - Method and equipment for determining driving strategy based on reinforcement learning and rules

Info

Publication number: CN108009587B
Application number: CN201711257834.0A
Authority: CN
Inventors: 许稼轩; 周小成
Original assignee: Uisee Technologies Beijing Co Ltd
Current assignee: Uisee Technologies Beijing Co Ltd
Priority date: 2017-12-01
Filing date: 2017-12-01
Publication date: 2021-04-16
Anticipated expiration: 2037-12-01
Also published as: CN108009587A

Abstract

The application aims to provide a method or equipment for determining a driving strategy based on reinforcement learning and rules; determining first driving strategy information of a vehicle through a reinforcement learning algorithm based on driving parameter information of the vehicle; performing rationality detection on the first driving strategy information based on the driving parameter information and the driving rule information of the vehicle; determining target driving strategy information of the vehicle based on a detection result of the rationality detection. Compared with the prior art, the method for determining the driving strategy restrains the first driving strategy information rule calculated and determined through the reinforcement learning algorithm, so that the method for determining the driving strategy is more intelligent than the existing method for realizing vehicle control by adopting the rule algorithm or the method for realizing vehicle control by adopting the reinforcement learning algorithm, and the reasonability and the stability of the finally determined driving strategy are improved.

Description

Method and equipment for determining driving strategy based on reinforcement learning and rules

Technical Field

The application relates to the field of automatic driving, in particular to a technology for determining a driving strategy based on reinforcement learning and rules.

Background

In the existing vehicle driving process, vehicle control for a vehicle, especially for an automatic driving vehicle, is mainly realized by the following methods: the algorithm is simple to implement, does not need training, and the output result of the control algorithm is predictable and stable, but the algorithm is not intelligent and is easy to rob the right of way in a complex scene of real driving, so the algorithm cannot effectively cope with the complex scene of real driving; the vehicle control is realized by adopting the reinforcement learning algorithm, so that the driving strategy is more intelligent, but the time cost of the reinforcement learning model training is higher, the reinforcement learning model training cannot be applied to actual automatic driving, and the output result of the algorithm is unpredictable; the existing algorithm for fusing the rule and the reinforcement learning can only linearly add the results determined by the rule algorithm and the reinforcement learning algorithm, the time cost of model training is still high, and continuous trial and error are needed, so that the algorithm cannot be applied to actual automatic driving.

Disclosure of Invention

The application aims to provide a method and equipment for determining a driving strategy based on reinforcement learning and rules.

According to one aspect of the present application, there is provided a method of determining a driving strategy based on reinforcement learning and rules, comprising:

determining first driving strategy information of a vehicle through a reinforcement learning algorithm based on driving parameter information of the vehicle;

performing rationality detection on the first driving strategy information based on the driving parameter information and the driving rule information of the vehicle;

determining target driving strategy information of the vehicle based on a detection result of the rationality detection.

According to another aspect of the present application, there is provided an apparatus for determining a driving strategy based on reinforcement learning and rules, including:

first driving strategy information determination means for determining first driving strategy information of a vehicle by a reinforcement learning algorithm based on driving parameter information of the vehicle;

the detection device is used for carrying out rationality detection on the first driving strategy information based on the driving parameter information and the driving rule information of the vehicle;

target driving strategy information determination means for determining target driving strategy information of the vehicle based on a detection result of the rationality detection.

According to another aspect of the present application, there is also provided an apparatus for determining a driving strategy based on reinforcement learning and rules, including:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for:

According to another aspect of the present application, there is also provided a computer-readable storage medium having a computer program stored thereon, the computer program being executable by a processor to:

Compared with the prior art, this application is based on the driving rule information of driving parameter information and vehicle, to confirming through reinforcement learning algorithm the first driving strategy information of vehicle carries out the rationality and detects, and based on the testing result that the rationality detected, confirms the target driving strategy information of vehicle is in order to realize the control to vehicle, especially unmanned vehicle, intelligent driving vehicle. The method for realizing vehicle control by adopting the rule algorithm and the method for realizing vehicle control by adopting the reinforcement learning algorithm are fused more deeply, the first driving strategy information calculated and determined by the reinforcement learning algorithm is restrained by the rules, and through the novel fusion technology, the method for determining the driving strategy is more intelligent than the existing method for realizing vehicle control by adopting the rule algorithm or the method for realizing vehicle control by adopting the reinforcement learning algorithm, and the reasonability and the stability of the finally determined driving strategy are improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 illustrates a flow diagram of a method for determining a driving strategy based on reinforcement learning and rules, according to one aspect of the present application;

FIG. 2 illustrates a schematic diagram of an apparatus for determining a driving strategy based on reinforcement learning and rules, according to an aspect of the subject application;

FIG. 3 illustrates an exemplary system that can be used to implement the various embodiments described in this application.

The same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal, the device serving the network, and the computing device include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

The device referred to in this application includes, but is not limited to, a user device, a network device, or a device formed by integrating a user device and a network device through a network. The user equipment includes, but is not limited to, any mobile electronic product, such as a smart phone, a tablet computer, etc., capable of performing human-computer interaction with a user (e.g., human-computer interaction through a touch panel), and the mobile electronic product may employ any operating system, such as an android operating system, an iOS operating system, etc. The network device includes an electronic device capable of automatically performing numerical calculation and information processing according to a preset or stored instruction, and hardware thereof includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like. The network device includes but is not limited to a computer, a network host, a single network server, a plurality of network server sets or a cloud of a plurality of servers; here, the Cloud is composed of a large number of computers or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual supercomputer consisting of a collection of loosely coupled computers. Including, but not limited to, the internet, a wide area network, a metropolitan area network, a local area network, a VPN network, a wireless Ad Hoc network (Ad Hoc network), etc.

FIG. 1 illustrates a flow diagram of a method for determining a driving strategy based on reinforcement learning and rules, according to one aspect of the present application. Wherein the method comprises step S11, step S12 and step S13. In one implementation of the present application, the method is performed on a device that determines a driving strategy based on reinforcement learning and rules.

In step S11, first driving strategy information of the vehicle may be determined through a reinforcement learning algorithm based on the driving parameter information of the vehicle; next, in step S12, based on the driving parameter information and the driving rule information of the vehicle, the first driving strategy information may be subjected to rationality detection; next, in step S13, target driving strategy information of the vehicle may be determined based on the detection result of the rationality detection.

In the present application, the vehicle may include, but is not limited to, a vehicle traveling in any mode, such as a fully human driving mode, an assisted driving mode, a partially automated driving mode, a conditional automated driving mode, a highly automated driving mode, or a fully automated driving mode. In a preferred embodiment, the vehicle may comprise an unmanned vehicle or a smart drive vehicle, wherein, in one implementation, the unmanned vehicle may comprise the vehicle traveling in a fully autonomous mode; the smart driving vehicle may include a vehicle that travels in an assisted driving mode, a partial autonomous driving mode, a conditional autonomous driving mode, a highly autonomous driving mode, and the like.

Specifically, in step S11, the first driving strategy information of the vehicle may be determined by a reinforcement learning algorithm based on the driving parameter information of the vehicle. In an implementation manner of the present application, a driving strategy model corresponding to vehicle control may be trained through a reinforcement learning algorithm, and then driving parameter information of the vehicle is input into the driving strategy model, and the first driving strategy information is output.

Here, the driving parameter information may include various types of vehicle driving information reflecting the vehicle driving environment and the vehicle driving state. In one implementation, the driving parameter information includes, but is not limited to, at least any one of: speed information of the vehicle; off-track direction information of the vehicle; distance information between the vehicle and the track center line; distance information between the vehicle and the track edge; obstacle sensing information, such as relative position and size of the obstacle ahead; traffic sign awareness information, such as traffic light indicators, direction indicators, turn indicators, and the like. In one implementation, the driving parameter information may be vehicle driving information collected from various sensors, such as real-time vehicle driving information; in another implementation, the driving reference information may also be obtained from other computing devices, such as simulators, such as Torcs simulators.

Here, the driving strategy information in the present application, for example, the first driving strategy information of the vehicle may include control information on various driving behaviors of the vehicle, for example, steering wheel angle control of the vehicle, brake control of the vehicle, throttle control of the vehicle, and the like. It should be understood by those skilled in the art that the above-mentioned driving strategy information is only an example, and other driving strategy information, which may be present or later come, should be included in the scope of the present application if applicable to the present application, and is included herein by reference.

Here, the reinforcement learning may refer to a learning process of achieving a goal through multiple steps of appropriate decisions under a series of scenarios, which is a sequential multi-step decision problem. The goal of reinforcement learning is to find a strategy that will allow us to obtain the maximum jackpot. In an application scenario of vehicle control according to the present application, in a possible implementation manner, a method for training a driving strategy model corresponding to vehicle control through a reinforcement learning algorithm includes: the method comprises the steps that a vehicle executes corresponding driving operation based on driving strategy information under the current environment and state, so that the environment and state of the vehicle are changed, a reward is obtained, namely a feedback function value is determined, the feedback function value reflects the change of the state of the vehicle after the vehicle adopts the driving strategy information, in one implementation mode, the state can be set to be good, the feedback function value is a positive number, and the larger the feedback function value is, the better the state is; otherwise, if the state is bad, the feedback function value is negative. Through the setting of the feedback function, the cycle process of interaction between the vehicle and the environment where the vehicle is located is controlled, and the driving strategy information of the vehicle is adjusted, so that the driving strategy model corresponding to the vehicle control is trained step by step and perfected. In the application, the reinforcement learning algorithm may further include a deep reinforcement learning algorithm that is integrated with deep learning on the basis of reinforcement learning, and the driving strategy model trained by the reinforcement learning algorithm and corresponding to vehicle control may further include a reinforcement learning neural network model. Here, the depth reinforcement Learning algorithm may include, but is not limited to, Deep Q Learning, Double Q-Network, Deep decision Gradient (DEP) method, and the like.

Further, the first driving strategy information of the vehicle may be determined by a reinforcement learning algorithm based on the driving parameter information of the vehicle. For example, if the driving parameter information includes speed information of the vehicle, off-track direction information of the vehicle, distance information between the vehicle and a track center line, and distance information between the vehicle and a track edge, the first driving strategy information such as steering wheel angle control, brake control, and accelerator control of the vehicle can be output by inputting the driving parameter information to the reinforcement learning neural network by the reinforcement learning algorithm.

Next, in step S12, the rationality of the first driving strategy information may be detected based on the driving parameter information and the driving rule information of the vehicle. Further, in step S13, target driving strategy information of the vehicle is determined based on the detection result of the rationality detection.

In the present application, the driving rule information includes a process of deriving a certain driving strategy information through a predetermined logic formula based on the input driving parameter information or the historical driving parameter information. Here, the driving rule information may include various rules of an existing driving scenario, a known driving experience, and a set output control strategy. In one implementation, the driving rule information may include, but is not limited to, one or more of various types of rules such as an obstacle avoidance rule, a path planning rule, a pre-aiming rule, and the like. For example, if the driving rule information includes a pre-aiming rule, and the input historical driving parameter information includes obstacle sensing information, such as the relative position and size of a front obstacle, the speed information of the current vehicle, and the information of the direction of the vehicle deviating from the track, the corresponding driving strategy information such as the steering wheel angle control, the brake control, or the accelerator control of the vehicle is calculated through a rule algorithm formula, such as the direction of the current direction deviating from the track, and based on the rule, the steering wheel should be turned in the opposite direction by 2 Θ.

In one embodiment of the present application, in step S12, second driving strategy information of the vehicle may be determined based on the driving parameter information and the driving rule information of the vehicle. In one implementation, the driving rule information may correspond to a specific rule algorithm formula, the input information is driving parameter information, and the second driving strategy information is output through calculation, for example, the input driving parameter information includes obstacle sensing information, such as a relative position and a size of a front obstacle, speed information of a current vehicle, and off-track direction information of the vehicle, and then the second driving strategy information, such as a current off-track direction Θ of the vehicle, such as a steering wheel angle control, a brake control, or an accelerator control, of the vehicle is calculated through the rule algorithm formula, and based on a rule, it is calculated that the steering wheel should turn in the opposite direction by 2 Θ.

Then, rationality detection is performed on the first driving strategy information based on the second driving strategy information. In one implementation, rationality detecting the first driving strategy information based on the second driving strategy information may include similarity detecting the second driving strategy information and the first driving strategy information. For example, assume that the first driving strategy information includes: the second driving strategy information determined based on the driving rule information includes a steering wheel angle Θ ', an accelerator degree η', and a brake degree γ ', and the similarity between the second driving strategy information and the first driving strategy information may be determined by comparing specific strategy parameters, i.e., the steering wheel angle Θ, the accelerator degree η, and the brake degree γ, and the similarity may be calculated, for example, by using R ═ t (Θ - Θ') + (η - η ') + (γ - γ'), and the smaller the value of R, the greater the similarity between the second driving strategy information and the first driving strategy information. In one implementation, a predetermined threshold value of a policy may be set, and the determination rule of the rationality detection is flexibly set by comparing the predetermined threshold value with the R value. For example, if the distance between the second driving strategy information and the first driving strategy information is greater than or equal to a predetermined threshold, it is determined that the first driving strategy information is unreasonable, whereas if the distance between the second driving strategy information and the first driving strategy information is less than the predetermined threshold, it is determined that the first driving strategy information is reasonable.

In one embodiment, in step S13, if the detection result of the rationality detection includes that the distance between the second driving strategy information and the first driving strategy information is greater than or equal to a predetermined threshold, the second driving strategy information is determined as the target driving strategy information of the vehicle. And if the detection result of the rationality detection includes that the distance between the second driving strategy information and the first driving strategy information is smaller than a preset threshold value, determining the first driving strategy information as the target driving strategy information of the vehicle. Since the outcome of reinforcement learning is difficult to predict, a small error can be fatal in a real unmanned or smart driving scenario. Therefore, in the present application, the output result of reinforcement learning, i.e., the first driving strategy information, is subjected to rationality detection using the driving rule information, if the first driving strategy information is significantly unreasonable, for example, the vehicle has deviated beyond the track line and also toward the departing direction; for another example, if an obstacle is clearly present in a short distance ahead, and the first driving strategy information is still selected to accelerate, then an error therein can be found and the actual execution of the first driving strategy information can be stopped by the constraint of the driving rule information, and then the automatic driving operation is realized by adopting the second driving strategy information determined based on the driving rule information. Here, the driving rule information includes a process of deriving certain driving strategy information through a predetermined logic formula based on the input driving parameter information or the historical driving parameter information. The driving rule information may include, but is not limited to, one or more of various specific rules such as an obstacle avoidance rule, a path planning rule, a pre-aiming rule, and the like. In one implementation, the rationality detection may include sequentially detecting various types of specific rules included in the driving rule information. It may be configured that, if the content does not satisfy any specific rule, the determination is not reasonable, and the second driving strategy information is determined as the target driving strategy information of the vehicle.

In one embodiment of the present application, in step S12, vehicle state information of the vehicle after virtually executing the first driving strategy information may be determined based on the driving parameter information. In the present application, another possible method of performing the rationality detection on the first driving strategy information based on the driving parameter information and the driving rule information of the vehicle is: in the virtual environment, the first driving strategy information determined by the driving parameter information is executed in the current environment and state of the vehicle, so as to obtain the vehicle state information at the next moment, and the content of the vehicle state information may be overlapped with the driving parameter information at the next moment, that is, various types of vehicle driving information reflecting the vehicle driving environment and the vehicle driving state may be included. Further, the vehicle state information is subjected to rationality detection based on the driving rule information of the vehicle. Here, since the vehicle state information is a direct result of the vehicle executing the first driving strategy information, whether the vehicle state information is reasonable, that is, whether the first driving strategy information is reasonable is directly reflected, and the risk of unnecessary vehicle damage or the like caused in an actual driving scene can be avoided by executing the first driving strategy information in the virtual environment and generating the vehicle state information. Here, the virtual environment may be constructed by an emulator, a simulator, such as a Torcs simulator, or the like.

Next, in one embodiment, in step S13, if the detection result of the rationality detection includes that the vehicle state information belongs to a vehicle safety range, the first driving strategy information is taken as target driving strategy information of the vehicle; otherwise, generating target driving strategy information of the vehicle based on the driving parameter information and the driving rule information of the vehicle. For example, if the vehicle state information obtained by executing the first driving strategy information is that the vehicle collides with an obstacle, the vehicle state information exceeds a vehicle safety range, and deviates from an obstacle avoidance rule, the result of the rationality detection is unreasonable, so that the target driving strategy information of the vehicle is generated based on the driving parameter information and the driving rule information of the vehicle.

In one embodiment of the present application, the method further includes step S14 (not shown), and in step S14, an automatic driving operation may be performed based on the target driving strategy information. Here, based on the target driving strategy information determined through the rationality detection, a corresponding driving operation may be performed in a real vehicle, for example, an automatic driving operation may be performed on an unmanned vehicle or a smart driving vehicle.

In an embodiment of the present application, the method further includes step S15 (not shown), and in step S15, the feedback function value corresponding to the reinforcement learning algorithm may be updated based on the detection result of the rationality detection.

In one implementation mode, the state can be set to be good, the feedback function value is positive, and the larger the feedback function value is, the better the state is; otherwise, if the state is bad, the feedback function value is negative. Through the setting of the feedback function, the cycle process of interaction between the vehicle and the environment where the vehicle is located is controlled, and the driving strategy information of the vehicle is adjusted, so that the driving strategy model corresponding to the vehicle control is trained step by step and perfected.

Therefore, if the determined target driving strategy information of the vehicle does not include the first driving strategy information based on the detection result of the rationality detection, that is, the detection result is not reasonable, the feedback function value corresponding to the reinforcement learning algorithm is set to be a negative number. For example, setting the current feedback function value to-100, the reinforcement learning neural network updates the neural network parameters based on the feedback function value after each decision. If the value of the feedback function is smaller, the probability of making a similar decision next time is smaller, so that the occurrence of a similar situation, such as an unreasonable situation, next time can be avoided. Otherwise, if the rationality detection result is reasonable, namely the target driving strategy corresponds to the first driving strategy information, the feedback function value is set to be a positive number.

This application is based on the driving rule information of driving parameter information and vehicle, to confirming through reinforcement learning algorithm the first driving strategy information of vehicle carries out the rationality and detects, and based on the testing result that the rationality detected, confirms the target driving strategy information of vehicle is in order to realize the control to vehicle, especially unmanned vehicle, intelligent driving vehicle. The method for realizing vehicle control by adopting the rule algorithm and the method for realizing vehicle control by adopting the reinforcement learning algorithm are fused more deeply, the first driving strategy information calculated and determined by the reinforcement learning algorithm is restrained by the rules, and through the novel fusion technology, the method for determining the driving strategy is more intelligent than the existing method for realizing vehicle control by adopting the rule algorithm or the method for realizing vehicle control by adopting the reinforcement learning algorithm, and the reasonability and the stability of the finally determined driving strategy are improved.

Fig. 2 shows a schematic view of an arrangement 1 for determining a driving strategy based on reinforcement learning and rules according to an aspect of the present application, wherein the arrangement 1 comprises first driving strategy information determining means 21, detecting means 22 and target driving strategy information determining means 23.

Wherein the first driving strategy information determining means 21 may determine the first driving strategy information of the vehicle by a reinforcement learning algorithm based on the driving parameter information of the vehicle; the detection means 22 may perform the rationality detection on the first driving strategy information based on the driving parameter information and the driving rule information of the vehicle; the target driving strategy information determination means 23 may determine the target driving strategy information of the vehicle based on the detection result of the rationality detection.

Specifically, the first driving strategy information determining means 21 may determine the first driving strategy information of the vehicle by a reinforcement learning algorithm based on the driving parameter information of the vehicle. In an implementation manner of the present application, a driving strategy model corresponding to vehicle control may be trained through a reinforcement learning algorithm, and then driving parameter information of the vehicle is input into the driving strategy model, and the first driving strategy information is output.

Here, the detection device 22 may detect the rationality of the first driving strategy information based on the driving parameter information and the driving rule information of the vehicle. Further, the target driving strategy information determining means 23 may determine the target driving strategy information of the vehicle based on a detection result of the rationality detection.

In one embodiment of the present application, the detection device 22 may include a first unit (not shown) and a second unit (not shown), and the first unit may determine second driving strategy information of the vehicle based on the driving parameter information and the driving rule information of the vehicle. In one implementation, the driving rule information may correspond to a specific rule algorithm formula, the input information is driving parameter information, and the second driving strategy information is output through calculation, for example, the input driving parameter information includes obstacle sensing information, such as a relative position and a size of a front obstacle, speed information of a current vehicle, and off-track direction information of the vehicle, and then the second driving strategy information, such as a current off-track direction Θ of the vehicle, such as a steering wheel angle control, a brake control, or an accelerator control, of the vehicle is calculated through the rule algorithm formula, and based on a rule, it is calculated that the steering wheel should turn in the opposite direction by 2 Θ.

Here, the second unit may perform rationality detection on the first driving strategy information based on the second driving strategy information. In one implementation, rationality detecting the first driving strategy information based on the second driving strategy information may include similarity detecting the second driving strategy information and the first driving strategy information. For example, assume that the first driving strategy information includes: the second driving strategy information determined based on the driving rule information includes a steering wheel angle Θ ', an accelerator degree η', and a brake degree γ ', and the similarity between the second driving strategy information and the first driving strategy information may be determined by comparing specific strategy parameters, i.e., the steering wheel angle Θ, the accelerator degree η, and the brake degree γ, and the similarity may be calculated, for example, by using R ═ t (Θ - Θ') + (η - η ') + (γ - γ'), and the smaller the value of R, the greater the similarity between the second driving strategy information and the first driving strategy information. In one implementation, a predetermined threshold value of a policy may be set, and the determination rule of the rationality detection is flexibly set by comparing the predetermined threshold value with the R value. For example, if the distance between the second driving strategy information and the first driving strategy information is greater than or equal to a predetermined threshold, it is determined that the first driving strategy information is unreasonable, whereas if the distance between the second driving strategy information and the first driving strategy information is less than the predetermined threshold, it is determined that the first driving strategy information is reasonable.

Next, the target driving strategy information determining means 23 may determine the second driving strategy information as the target driving strategy information of the vehicle if the detection result of the rationality detection includes that the distance between the second driving strategy information and the first driving strategy information is greater than or equal to a predetermined threshold value. And if the detection result of the rationality detection includes that the second driving strategy information and the first driving strategy information are smaller than a predetermined threshold, the device 1 may determine the first driving strategy information as the target driving strategy information of the vehicle. Since the outcome of reinforcement learning is difficult to predict, a small error can be fatal in a real unmanned or smart driving scenario. Therefore, in the present application, the output result of reinforcement learning, i.e., the first driving strategy information, is subjected to rationality detection using the driving rule information, if the first driving strategy information is significantly unreasonable, for example, the vehicle has deviated beyond the track line and also toward the departing direction; for another example, if an obstacle is clearly present in a short distance ahead, and the first driving strategy information is still selected to accelerate, then an error therein can be found and the actual execution of the first driving strategy information can be stopped by the constraint of the driving rule information, and then the automatic driving operation is realized by adopting the second driving strategy information determined based on the driving rule information. Here, the driving rule information includes a process of deriving certain driving strategy information through a predetermined logic formula based on the input driving parameter information or the historical driving parameter information. The driving rule information may include, but is not limited to, one or more of various specific rules such as an obstacle avoidance rule, a path planning rule, a pre-aiming rule, and the like. In one implementation, the rationality detection may include sequentially detecting various types of specific rules included in the driving rule information. It may be configured that, if the content does not satisfy any specific rule, the determination is not reasonable, and the second driving strategy information is determined as the target driving strategy information of the vehicle.

In one implementation, if the detection result of the rationality detection is not reasonable, the second driving strategy information needs to be determined as the target driving strategy information of the vehicle, and at this time, information based on a plurality of different driving rules exists, for example, an obstacle avoidance rule, a path planning rule, and a pre-aiming rule exist at the same time.

In one embodiment of the present application, the detecting device 22 further includes a third unit (not shown) and a fourth unit (not shown), and the third unit may determine vehicle state information of the vehicle after virtually executing the first driving strategy information based on the driving parameter information. In the present application, another possible method of performing the rationality detection on the first driving strategy information based on the driving parameter information and the driving rule information of the vehicle is: in the virtual environment, the first driving strategy information determined by the driving parameter information is executed in the current environment and state of the vehicle, so as to obtain the vehicle state information at the next moment, and the content of the vehicle state information may be overlapped with the driving parameter information at the next moment, that is, various types of vehicle driving information reflecting the vehicle driving environment and the vehicle driving state may be included. Further, the fourth means may perform the rationality detection on the vehicle state information based on driving rule information of the vehicle. Here, since the vehicle state information is a direct result of the vehicle executing the first driving strategy information, whether the vehicle state information is reasonable, that is, whether the first driving strategy information is reasonable is directly reflected, and the risk of unnecessary vehicle damage or the like caused in an actual driving scene can be avoided by executing the first driving strategy information in the virtual environment and generating the vehicle state information. Here, the virtual environment may be constructed by an emulator, a simulator, such as a Torcs simulator, or the like.

Next, in one embodiment, if the detection result of the rationality detection includes that the vehicle state information belongs to a vehicle safety range, the target driving strategy information determining means 23 may take the first driving strategy information as target driving strategy information of the vehicle; otherwise, generating target driving strategy information of the vehicle based on the driving parameter information and the driving rule information of the vehicle. For example, if the vehicle state information obtained by executing the first driving strategy information is that the vehicle collides with an obstacle, the vehicle state information exceeds a vehicle safety range, and deviates from an obstacle avoidance rule, the result of the rationality detection is unreasonable, so that the target driving strategy information of the vehicle is generated based on the driving parameter information and the driving rule information of the vehicle.

In one embodiment of the present application, the method further comprises an execution device (not shown) where an autonomous driving operation may be executed based on the target driving strategy information. Here, based on the target driving strategy information determined through the rationality detection, a corresponding driving operation may be performed in a real vehicle, for example, an automatic driving operation may be performed on an unmanned vehicle or a smart driving vehicle.

In an embodiment of the application, the method further includes an updating device (not shown), and the updating device may update the feedback function value corresponding to the reinforcement learning algorithm based on the detection result of the rationality detection.

The present application further provides an apparatus for determining a driving strategy based on reinforcement learning and rules, comprising:

one or more processors;

a memory; and

Further, the program of the apparatus may also be used to perform corresponding operations in other related embodiments based on the above operations.

The present application further provides a computer-readable storage medium having a computer program stored thereon, the computer program being executable by a processor to:

Further, the computer program may also be adapted to be executed by the processor for corresponding operations in other related embodiments based on the above-described operations.

FIG. 3 illustrates an exemplary system that can be used to implement the various embodiments described herein;

in some embodiments, as shown in fig. 3, the system 300 can be used as the device 1 for determining the driving strategy based on reinforcement learning and rules in any of the embodiments shown in fig. 1, fig. 2, or other described embodiments. In some embodiments, system 300 may include one or more computer-readable media (e.g., system memory 315 or NVM/storage 320) having instructions and one or more processors (e.g., processor(s) 305) coupled with the one or more computer-readable media and configured to execute the instructions to implement modules to perform the actions described herein.

For one embodiment, system control module 310 may include any suitable interface controllers to provide any suitable interface to at least one of processor(s) 305 and/or any suitable device or component in communication with system control module 310.

The system control module 310 may include a memory controller module 330 to provide an interface to the system memory 315. Memory controller module 330 may be a hardware module, a software module, and/or a firmware module.

System memory 315 may be used, for example, to load and store data and/or instructions for system 300. For one embodiment, system memory 315 may include any suitable volatile memory, such as suitable DRAM. In some embodiments, the system memory 315 may include a double data rate type four synchronous dynamic random access memory (DDR4 SDRAM).

For one embodiment, system control module 310 may include one or more input/output (I/O) controllers to provide an interface to NVM/storage 320 and communication interface(s) 325.

For example, NVM/storage 320 may be used to store data and/or instructions. NVM/storage 320 may include any suitable non-volatile memory (e.g., flash memory) and/or may include any suitable non-volatile storage device(s) (e.g., one or more Hard Disk Drives (HDDs), one or more Compact Disc (CD) drives, and/or one or more Digital Versatile Disc (DVD) drives).

NVM/storage 320 may include storage resources that are physically part of the device on which system 300 is installed or may be accessed by the device and not necessarily part of the device. For example, NVM/storage 320 may be accessible over a network via communication interface(s) 325.

Communication interface(s) 325 may provide an interface for system 300 to communicate over one or more networks and/or with any other suitable device. System 300 may wirelessly communicate with one or more components of a wireless network according to any of one or more wireless network standards and/or protocols.

For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controller(s) (e.g., memory controller module 330) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be packaged together with logic for one or more controller(s) of the system control module 310 to form a System In Package (SiP). For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310. For one embodiment, at least one of the processor(s) 305 may be integrated on the same die with logic for one or more controller(s) of the system control module 310 to form a system on a chip (SoC).

In various embodiments, system 300 may be, but is not limited to being: a server, a workstation, a desktop computing device, or a mobile computing device (e.g., a laptop computing device, a handheld computing device, a tablet, a netbook, etc.). In various embodiments, system 300 may have more or fewer components and/or different architectures. For example, in some embodiments, system 300 includes one or more cameras, a keyboard, a Liquid Crystal Display (LCD) screen (including a touch screen display), a non-volatile memory port, multiple antennas, a graphics chip, an Application Specific Integrated Circuit (ASIC), and speakers.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

It should be noted that the present invention may be implemented in software and/or in a combination of software and hardware, for example, as an Application Specific Integrated Circuit (ASIC), a general purpose computer or any other similar hardware device. In one embodiment, the software program of the present invention may be executed by a processor to implement the steps or functions described above. Also, the software programs (including associated data structures) of the present invention can be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Further, some of the steps or functions of the present invention may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

In addition, some of the present invention can be applied as a computer program product, such as computer program instructions, which when executed by a computer, can invoke or provide the method and/or technical solution according to the present invention through the operation of the computer. Program instructions which invoke the methods of the present invention may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. An embodiment according to the invention herein comprises an apparatus comprising a memory for storing computer program instructions and a processor for executing the program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform a method and/or solution according to embodiments of the invention as described above.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A method of determining a driving strategy based on reinforcement learning and rules, wherein the method comprises:

determining first driving strategy information of a vehicle through a reinforcement learning algorithm based on driving parameter information of the vehicle; the driving parameter information includes information reflecting a vehicle running environment and a vehicle running state;

determining second driving strategy information of the vehicle based on the driving parameter information and the driving rule information of the vehicle; the driving rule information is a rule for outputting a driving strategy based on input driving parameter information or historical driving parameter information;

performing rationality detection on the first driving strategy information based on the second driving strategy information;

the rationality detection comprises: carrying out similarity detection on the second driving strategy information and the first driving strategy information;

if the distance between the second driving strategy information and the first driving strategy information is smaller than a preset threshold value, determining the first driving strategy information as target driving strategy information of the vehicle; otherwise, determining the second driving strategy information as the target driving strategy information of the vehicle; wherein the distance is determined based on a difference of the same strategy parameter in the first and second driving strategy information.

2. The method of claim 1, wherein the driving parameter information includes at least any one of:

speed information of the vehicle;

off-track direction information of the vehicle;

distance information between the vehicle and the track center line;

distance information between the vehicle and the track edge;

obstacle perception information;

traffic sign awareness information.

3. The method of claim 1, wherein the method further comprises:

and executing automatic driving operation based on the target driving strategy information.

4. The method of claim 1, wherein the method further comprises:

and updating the feedback function value corresponding to the reinforcement learning algorithm based on the detection result of the rationality detection.

5. The method of claim 4, wherein updating the feedback function value corresponding to the reinforcement learning algorithm based on the detection result of the rationality check comprises:

and if the determined target driving strategy information of the vehicle does not comprise the first driving strategy information based on the detection result of the rationality detection, setting a feedback function value corresponding to the reinforcement learning algorithm as a negative number.

6. An apparatus for determining a driving strategy based on reinforcement learning and rules, wherein the apparatus comprises:

first driving strategy information determination means for determining first driving strategy information of a vehicle by a reinforcement learning algorithm based on driving parameter information of the vehicle; the driving parameter information includes information reflecting a vehicle running environment and a vehicle running state;

the detection device is used for carrying out rationality detection on the first driving strategy information based on the driving parameter information and the driving rule information of the vehicle; the driving rule information is a rule for outputting a driving strategy based on input driving parameter information or historical driving parameter information; the rationality detection comprises: carrying out similarity detection on the second driving strategy information and the first driving strategy information;

target driving strategy information determining means for determining the second driving strategy information as target driving strategy information of the vehicle if the detection result of the rationality detection includes that the distance between the second driving strategy information and the first driving strategy information is greater than or equal to a predetermined threshold; wherein the distance is determined based on a difference of the same strategy parameter in the first and second driving strategy information;

wherein the detection device comprises:

a first unit configured to determine second driving strategy information of the vehicle based on the driving parameter information and driving rule information of the vehicle;

and the second unit is used for carrying out rationality detection on the first driving strategy information based on the second driving strategy information.

7. The apparatus of claim 6, wherein the driving parameter information comprises at least any one of:

speed information of the vehicle;

off-track direction information of the vehicle;

distance information between the vehicle and the track center line;

distance information between the vehicle and the track edge;

obstacle perception information;

traffic sign awareness information.

8. The apparatus of claim 6, wherein the apparatus further comprises:

and executing means for executing an automatic driving operation based on the target driving strategy information.

9. The apparatus of claim 6, wherein the apparatus further comprises:

and the updating device is used for updating the feedback function value corresponding to the reinforcement learning algorithm based on the detection result of the rationality detection.

10. The apparatus of claim 9, wherein the updating means is to:

11. An apparatus for determining a driving strategy based on reinforcement learning and rules, comprising:

one or more processors;

a memory; and

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the programs comprising instructions for performing the method of any of claims 1-5.

12. A computer-readable storage medium, on which a computer program is stored, which computer program can be executed by a processor to perform the method according to any of claims 1-5.