CN110390398A - On-line study method - Google Patents
On-line study method Download PDFInfo
- Publication number
- CN110390398A CN110390398A CN201810330517.5A CN201810330517A CN110390398A CN 110390398 A CN110390398 A CN 110390398A CN 201810330517 A CN201810330517 A CN 201810330517A CN 110390398 A CN110390398 A CN 110390398A
- Authority
- CN
- China
- Prior art keywords
- network
- original
- critic
- action
- actor
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000011156 evaluation Methods 0.000 claims abstract description 66
- 230000033001 locomotion Effects 0.000 claims abstract description 10
- 230000009471 action Effects 0.000 claims description 96
- 230000006870 function Effects 0.000 claims description 44
- 238000012549 training Methods 0.000 claims description 36
- 230000002787 reinforcement Effects 0.000 claims description 18
- 238000004422 calculation algorithm Methods 0.000 claims description 16
- 238000005070 sampling Methods 0.000 claims description 10
- 230000001788 irregular Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 238000012552 review Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 description 11
- 238000013528 artificial neural network Methods 0.000 description 7
- 238000010801 machine learning Methods 0.000 description 5
- 238000004088 simulation Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000010923 batch production Methods 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000003111 delayed effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- User Interface Of Digital Computer (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The present invention provides a kind of on-line study methods, comprising: calculates the first evaluation index of the first movement;Calculate the second evaluation index of the second movement;When the first evaluation index is greater than the second evaluation index, scene state information and the first movement data cached are stored as first;When the first evaluation index is less than the second evaluation index, scene state information and the first movement and the second movement data cached are stored as second;First it is data cached and second it is data cached constitute it is data cached;When data cached data volume is greater than preset threshold, from data cached middle acquisition sampled data;When sampled data is data cached from first, the first system is trained using nitrification enhancement;When sampled data is data cached from second, the first system is trained using supervised nitrification enhancement, improves the decision-making capability and robustness of decision system.
Description
Technical Field
The invention relates to the field of artificial intelligence, in particular to an online learning method based on rule supervision.
Background
With the rise of artificial intelligence, machine learning is applied to various fields, and the application of machine learning in the field of automatic driving is the guarantee of reliability and safety of automatic driving. One of the cores of the automatic driving technology is a complete decision making system. This decision system needs to guarantee the safety of the unmanned vehicle, while at the same time should meet the driving habits and comfort requirements of the human driver.
The common machine learning method usually acquires a large amount of training data, trains the deep neural network offline, and does not update the neural network any more in the actual use process, and the method completely depends on the generalization of the neural network effect, so that great potential safety hazards exist when complex application environments are processed.
The current application of machine Learning in the field of automatic driving mainly relies on Deep Reinforcement Learning (DRL). Under ideal conditions, the fully trained deep neural network can deal with different road conditions and make relatively reasonable driving decisions. Similar to the traditional machine learning process, a decision system based on deep reinforcement learning needs a large amount of training data to train a neural network, however, limited simulation and actual road training cannot contain all unknown actual road conditions, so that considering the limited generalization of the neural network, the driving system is likely to make unsafe decision actions when a vehicle encounters some unknown scenes in actual use.
The existing deep neural network can be optimized only under the constraint of the reward value function during training, however, the reward value function cannot completely meet all definitions of human drivers on vehicle operation. Therefore, during the actual running process of the vehicle, unreasonable actions need to be supervised, but at present, no training method combining supervision and reinforcement learning exists.
Disclosure of Invention
The embodiment of the invention aims to provide an online learning method to solve the problem that all conditions of vehicle operation cannot be completely met in the prior art.
In order to solve the above problems, the present invention provides an online learning method, including:
the first system generates a first action according to the acquired scene state information and calculates a first evaluation index of the first action;
the second system generates a second action according to the acquired scene state information and calculates a second evaluation index of the second action;
comparing the first evaluation index with the second evaluation index, and storing the scene state information and the first action as first cache data when the first evaluation index is larger than the second evaluation index; when the first evaluation index is smaller than the second evaluation index, storing the scene state information, the first action and the second action as second cache data; the first cache data and the second cache data form cache data;
when the data quantity of the cache data is larger than a preset threshold value, acquiring sampling data from the cache data;
judging the source of the sampling data, and training the first system by using a reinforcement learning algorithm when the sampling data comes from first cache data; when the sampled data is derived from second cached data, training the first system using a supervised reinforcement learning algorithm.
Preferably, using a formulaCalculating a first evaluation index of the first action; wherein s is scene state information; g is a first action; r istThe size of the reward value obtained for executing the current action in the t-th iteration is gamma, and gamma is the discount rate.
Preferably, when the sample data is derived from first buffered data, training the first system using a reinforcement learning algorithm includes:
constructing an original actor-critic network when the sampled data is derived from the first cached data; the original actor-critic network comprises an original actor network and an original critic network, wherein the input of the original actor network is scene state information s, the output of the original actor network is a first action a, the input of the original critic network is the scene state information and the first action (s, a), and the output of the original critic network is a first evaluation index;
determining a loss function gradient of an original actor network;
determining a loss function and a gradient of an original critic network;
and updating the network parameters of the original actor network and the network parameters of the original critic network according to the gradient of the loss function of the original actor network, the loss function of the original critic network and the gradient of the original critic network, and generating a target actor-critic network.
Preferably, the determining the gradient of the loss function of the original actor network includes:
using formulasDetermining a loss function gradient of an original actor network; wherein, the output of the original actor network is mu(s), and the network parameter of the original actor network is thetaμ(ii) a And N is the size of the sampled data volume.
Preferably, determining the loss function and gradient of the original critic's network comprises:
using formulasCalculating a loss function of the original critic network; wherein, the output of the original critic network is Q (s, a), and the network parameter of the original critic network is thetaQ;
Using Bellman's equationTraining the original critic network;
using formulasCalculating the gradient of the original critic network; wherein, i tableNumber of rounds of training, δiDefined as the timing difference error, of the form:
δi=ri+γQ'(si+1,μ'(si+1|θμ')|θQ')-Q(si,ai|θQ)。
preferably, the updating the network parameters of the original actor network and the network parameters of the original critic network according to the gradient of the loss function of the original actor network, the loss function of the original critic network and the gradient of the original critic network to generate the target actor-critic network includes:
using formulasUpdating network parameters of an original critic network, wherein theNetwork parameters for a target critic network;
using formulasNetwork parameters of the original actor network are updated, wherein,network parameters for the target actor network.
Preferably, the training the first system by using a supervised reinforcement learning algorithm when the sample data is derived from the second buffered data comprises:
using the formula | μ(s) - μE(s) | < epsilon to judge the second action a and the standard supervision action a corresponding to the current scene state information sEMu represents the current actor network output strategy, muERepresenting a current actor network rule supervision strategy, wherein epsilon is a preset threshold value;
using formulas
Calculating a loss function of the current critic network; wherein, thetaμFor the network parameter of the current actor network, thetaQNetwork parameters of the current critic network; dRuleCaching the collected second cache data; (s)E,aE) A set of state-action pairs in the second cache data; n is the number of a batch of data in processing operation; h (mu)E(sE),μ(sE) Is a function of motion error, defined as
Wherein eta is a normal value, and the function of the action error can ensure that the loss generated by the irregular supervision action is at least one boundary value eta greater than that of the regular supervision action;
using a synthetic loss function Jcom=JQ+λJsupUpdating the critic network, wherein lambda is an amount set manually and used for adjusting the weight ratio between the current critic network loss function and the next critic network loss function;
using the formula deltaS=H(aE,μ(sE|θμ))+Q(sE,μ(sE|θμ)|θQ)-Q(sE,aE|θQ) Defining a supervision error;
using formulasCalculating the updated network parameters of the critic network;
using formulasCalculating network parameters of the updated actor network; wherein,for the i-th update, the network parameters of the critic's network,the network parameters of the critic network in the i +1 th update are the network parameters of the critic network in the i-th update,in order to review the learning rate of the critic's network,for the i-th update, the network parameters of the actor network,for the i +1 th update, the network parameters of the actor network,the learning rate of the actor network.
Therefore, the decision-making capability and robustness of the system are improved by applying the online learning method provided by the embodiment of the invention.
Drawings
Fig. 1 is a schematic flow chart of an online learning method according to an embodiment of the present invention.
Detailed Description
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
When the online learning method provided by the embodiment of the present invention is applied, a first system (hereinafter, the system may also be referred to as a network) needs to be trained, and how to train the first system is described below.
First, first original scene state information is acquired, and then, a first original action set is generated according to the first original scene state information, wherein the first original action set comprises at least one original action. Then, according to the first original scene state information and the first original action set, calculating a first original evaluation index corresponding to each original action in the first original action set, and obtaining a first original evaluation index set. And then, determining the maximum one in the first original evaluation index set as a target first original evaluation index, wherein the original action corresponding to the target first original evaluation index is the target first original action. And then, acquiring second original scene state information according to the target first original action. Then, according to the second original scene state information and the first original action set, calculating a second original evaluation index corresponding to each original action in the first original action set to obtain a second original evaluation index set. And then, determining the maximum one in the second original evaluation index set as a target second original evaluation index, wherein the original action corresponding to the target second original evaluation index is taken as a target second original action. Finally, according to the target second original action, third original scene state information is obtained; and (5) performing iterative optimization until the obtained evaluation index is maximum, and training a first system at the moment.
Wherein, the first and the second only have distinction effect and have no other meaning.
How to establish the first system will be described in detail below with reference to specific examples.
In the training process of the first system, for each scene state information s, assuming that there are four actions a1, a2, a3 and a4 that can be selected (for example, four actions, i.e., up, down, left and right), the deep Q learning algorithm calculates evaluation indexes Q (evaluation indexes), i.e., Q (s, a1), Q (s, a2), Q (s, a3) and Q (s, a4), for the four actions. Then, the action with the best evaluation index (i.e., the action with the largest Q value) is selected as the final output action. Then, by using the action to interact with the environment, new scene state information s 'is obtained, then, as before, a new evaluation index, namely Q (s', a1).. times.q (s ', a4), corresponding to the four actions under the new scene state information s' is obtained, actions corresponding to the optimal evaluation index are continuously selected to interact with the environment, and the steps are repeated and iteratively optimized, so that a reasonable network, namely a first system, can be finally obtained.
Fig. 1 is a schematic flow chart of an online learning method according to an embodiment of the present invention. The application scene of the method is that the unmanned vehicle is used. As shown in fig. 1, the method comprises the steps of:
and step 110, the first system generates a first action according to the acquired scene state information, and calculates a first evaluation index of the first action.
In an unmanned vehicle, obstacle (such as other vehicles, pedestrians and the like) information is recognized through sensing modules such as a camera and a laser radar, dynamic obstacle prediction tracks and road information (such as lane lines, traffic lights and the like) are predicted through a prediction module, the complex traffic environments are constructed into a simplified traffic simulation environment, and one or more of the traffic environments are taken to form scene state information s.
The first system can be a hierarchical reinforcement learning decision-making system, the first system can comprise an upper-layer decision-making framework and a lower-layer decision-making framework, the input of the upper-layer decision-making framework is scene state information output as a first action, and the first action can be lane changing, following, overtaking and the like.
The first action is used as an input of a lower-layer decision framework, which may calculate, by way of example and not limitation, a first evaluation index corresponding to the first action by using the following formula:
wherein s is scene state information; g is a first action; r istThe size of the reward value obtained for performing the current action in the t-th iteration is gamma, which is the discount rate and may also be referred to as a discount factor. Wherein r istGenerally, s is set, or s and g are set, and the present application is not limited thereto.
Subsequently, the first action may be denoted as aDRL。
And step 120, the second system generates a second action according to the acquired scene state information, and calculates a second evaluation index of the second action.
Wherein the second system may be a rule constraint decision system, which is pre-trained and which may make some decisions, such as: scene state information is the vehicle aheadThe second action a is that the vehicle is 10m away from the vehicle, and no vehicle is in the left lane 50mRuleIs [ accelerator 0.9, steering-0.5, brake 0.0]"; "the front vehicle is 10m away from itself, and there are vehicles in both the left and right lanes 50m, then aRuleIs [ accelerator 0.0, steering 0.0, brake 0.5 ]]”。
After the second action is obtained, the evaluation index of the second action can be calculated using the same formula as that used in the calculation of the first evaluation index.
It is understood that when the above formula is applied to calculate the second evaluation index, only the first action needs to be replaced by the second action.
Step 130, comparing the first evaluation index with the second evaluation index, and storing the scene state information and the first action as first cache data when the first evaluation index is greater than the second evaluation index; when the first evaluation index is smaller than the second evaluation index, storing the scene state information, the first action and the second action as second cache data; the first cache data and the second cache data constitute cache data.
Specifically, during the actual running process of the unmanned vehicle, the scene state information s at the time t is usedtInputting the first action into a decision framework designed by the invention, and obtaining the first action through the decision system based on hierarchical reinforcement learning and the decision system based on rule constraintAnd the second actionObtaining a first action a using a policy evaluation function QDRLFirst evaluation index and second action aRuleThe second evaluation index of (2) is obtained by comparing the first evaluation index and the second evaluation index.
The data buffer area is used for storing data to be trained, and is generally composed of "state-action" data.
At time t, if the first evaluation indicatesIf the index is larger than the second evaluation index, aDRLPreferably, a is finally outputFinalIs thatAt the same time will stAndform a "state-action" pairAs the first cache data, it is stored in the data cache area. On the contrary, if the first evaluation index is smaller than the second evaluation index, aRulePreferably, a is finally outputFinalIs thatWill be provided withAs second cache data, while being stored in another data cache region.
In one example, the first cache data is stored in a first cache region and the second cache data is stored in a second cache region. The first cache region and the second cache region may be distinguished based on a pointer and an address.
In another example, the first cache data and the second cache data may be placed in the same region, distinguished by a header of the data.
And step 140, when the data amount of the cache data is larger than a preset threshold value, acquiring sampling data from the cache data.
The preset threshold may be a value set according to actual needs, and is generally set to be an integer power of 2, which is consistent with the "quantity size" of the batch process. The classical value is 32 or 64, and the application is not limited to specific values.
Step 150, judging the source of the sampling data, and when the sampling data comes from first cache data, training the first system by using a reinforcement learning algorithm; when the sampled data is derived from second cached data, training the first system using a supervised reinforcement learning algorithm.
Therefore, the decision method is obtained by comparing the first evaluation index with the second evaluation index, and the problems of poor human simulation, poor flexibility and difficulty in maintenance caused by adding new logic in the conventional decision method are solved. The method has the advantages of good human-like performance, good flexibility and simple maintenance when new logics are added. In the process of real-time running of the vehicle, the system records data (states) of real-time interaction between the vehicle and the environment and control actions output by a decision framework, stores the state-action pairs in a data cache, samples training data in an online batch processing (mini-batch) mode and performs optimization training of a network, and updates the weight of a learning network, so that the decision network becomes more intelligent and humanized along with use.
Wherein training the first system using a reinforcement learning algorithm when the sample data is derived from first cached data comprises:
constructing an original actor-critic network when the sampled data is derived from the first cached data; the original actor-critic network comprises an original actor network and an original critic network, wherein the input of the original actor network is scene state information s, the output of the original actor network is a first action a, the input of the original critic network is the scene state information and the first action (s, a), and the output of the original critic network is a first evaluation index;
determining a loss function gradient of an original actor network;
determining a loss function and a gradient of an original critic network;
and updating the network parameters of the original actor network and the network parameters of the original critic network according to the gradient of the loss function of the original actor network, the loss function of the original critic network and the gradient of the original critic network, and generating a target actor-critic network.
Next, the training of the first system using the reinforcement learning algorithm when the sample data is derived from the first cache data will be specifically described.
When the first system is subjected to online learning, the first system may be divided into an original network and a target network, where the number of times of online learning is unlimited, and the original network and the target network have relativity, for example, when the first training is performed, the target network is trained by the original network, when the second training is performed, the target network is used as the original network for the second training, and then the target network is trained, and so on, until the number of times of online learning meets the requirement.
In the following, online learning is specifically described by taking an actor-critic network as an example.
The original network and the target network respectively comprise an actor network and a critic network, the structures of the actor network and the critic network are completely consistent, the structures of the critic network and the critic network are also completely consistent, and the weight updating of the target network is delayed by a ratio tau compared with the original network so as to ensure convergence. For differentiation, the actor network of the original network may be referred to as the original actor network, and the critic network of the original network may be referred to as the original critic network. The actor network of the target network is called a target actor network, and the critic network of the target network is called a target critic network.
Wherein, the network input of the original actor is a state s, and the output is an action a; the original critic network inputs status and action (s, a) and the output is a discount jackpot value Q. s ', a' respectively represent the state and action of the next moment obtained by interaction with the environment during the network training process, i.e. the network input and output of the target actor. Usually, four networks are required to work simultaneously in the training process of the network, and weights are updated alternately, but only the actor network itself needs to be concerned in the actual use process.
Suppose the original critic network output is Q (s, a) and the network parameter is thetaQ(ii) a The output of the original actor network is μ(s) and the network parameter is θμ. The output of the target critic's network is Q' (s, a), and the network parameter is thetaQ'the output of the target actor network is μ'(s),network parameter is thetaμ' then actor network policy gradient is in the execution policy distribution ρβThe following definitions are:
using the gradient formula chain rule, one can obtain:
therefore, the gradient equation of the loss function of the original actor network can be obtained by using a sampling mode:
wherein, N is the size of the sample data volume, and similarly, the definition of the original critic network loss function is as follows:
the critic network can therefore be trained by the bellman equation:
wherein r isiThe size of the prize value for the ith group of data.
And then the gradient of the original critic network can be obtained:
where i denotes the number of rounds of training, δiIs defined as the time-sequence differential error (TD-error) and has the form:
δi=ri+γQ'(si+1,μ'(si+1|θμ')|θQ')-Q(si,ai|θQ) (8)
thus, the network parameters of the original critic network and the network parameters of the original actor network may be updated in the gradient direction using the following formula:
wherein, theFor the network parameters of the target critic's network,the degree of update for the network parameters of the target actor network generally has no explicit decision criteria, but may be limited by a training round, such as stop after 2000 rounds of training.
Next, the training of the first system using a supervised reinforcement learning algorithm when the sample data is derived from the second buffered data will be described.
Hereinafter, the actor network and the critic network may be distinguished with current and next time in order to be distinguished from the original network, the target network, above.
Firstly, the following formula is used to judge the action output a and the rule supervision action a of the network under the current state sEThe difference between (1) and (2):
|μ(s)-μE(s)|<ε (11)
wherein, muERespectively representing the current actor network output strategy and the rule supervision strategy. Epsilon is a given threshold, meaning that if the motion error is within the threshold, then the two motions are considered sufficiently similar. This process may enable the agent to learn a better policy than the rule supervision policy, even if the rule supervision action is not optimal, to be subjected toAnd (5) safety supervision of rules. After the action error is judged by the above formula, the action error is tried to be reflected in the time sequence difference error of the updated critic network:
wherein, thetaμ,θQRespectively representing the network parameters of the current actor network and the current critic network; dRuleRepresenting the collected rule supervision data cache; (s)E,aE) Representing a set of state-action pairs collected from the cache; n represents the number of a batch of data in a batch operation; h (mu)E(sE),μ(sE) Is a function of motion error, defined as follows:
where η is a normal value, the function ensures that the loss due to irregular supervision actions is at least a boundary value η greater than regular supervision actions. Review family network loss function considering the original Depth Deterministic Policy Gradient (DDPG):
updating the critic network using a synthetic loss function:
Jcom=JQ+λJsup (15)
here, λ is an artificially set quantity used to adjust the weight ratio between two losses.
The supervision error is defined as:
δS=H(aE,μ(sE|θμ))+Q(sE,μ(sE|θμ)|θQ)-Q(sE,aE|θQ) (16)
for the final result, the error of the synthesis at the i-th update is defined as follows:
wherein λ is an artificially selected proportional weight,(s)E,aE) State-action pairs in the data are supervised for rules. Thus, the invention derives from the sampled dataAndmeanwhile, the parameter updating process of each step of the critic network and the actor network is respectively as follows:
wherein,andrespectively representing the parameters of the critic network and the actor network in the ith update.Andrespectively representing the learning rates of the critic network and the actor network,for the i +1 th update, the network parameters of the critic's network,for the i +1 th update, the network parameters of the actor network are updated to a degree that generally has no explicit decision criteria, but may be limited by a training round, for example, stop after 2000 rounds of training.
The network updating mode ensures the online learning ability of the decision framework, and can be supervised by the decision constrained by the rules under the condition of poor current network output result, so that the decision ability of the whole decision system can be continuously improved along with the use within the limit of safety constraint, the whole decision system is more robust, and the robustness of the decision system is enhanced.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, a software module executed by a processor, or a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above embodiments are provided to further explain the objects, technical solutions and advantages of the present invention in detail, it should be understood that the above embodiments are merely exemplary embodiments of the present invention and are not intended to limit the scope of the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (7)
1. An online learning method, the method comprising:
the first system generates a first action according to the acquired scene state information and calculates a first evaluation index of the first action;
the second system generates a second action according to the acquired scene state information and calculates a second evaluation index of the second action;
comparing the first evaluation index with the second evaluation index, and storing the scene state information and the first action as first cache data when the first evaluation index is larger than the second evaluation index; when the first evaluation index is smaller than the second evaluation index, storing the scene state information, the first action and the second action as second cache data; the first cache data and the second cache data form cache data;
when the data quantity of the cache data is larger than a preset threshold value, acquiring sampling data from the cache data;
judging the source of the sampling data, and training the first system by using a reinforcement learning algorithm when the sampling data comes from first cache data; when the sampled data is derived from second cached data, training the first system using a supervised reinforcement learning algorithm.
2. The online learning method of claim 1,
using formulasCalculating a first evaluation index of the first action; wherein s is scene state information; g is a first action; r istThe size of the reward value obtained for executing the current action in the t-th iteration is gamma, and gamma is the discount rate.
3. The method of online learning of claim 1, wherein training the first system with a reinforcement learning algorithm when the sampled data is derived from first buffered data comprises:
constructing an original actor-critic network when the sampled data is derived from the first cached data; the original actor-critic network comprises an original actor network and an original critic network, wherein the input of the original actor network is scene state information s, the output of the original actor network is a first action a, the input of the original critic network is the scene state information and the first action (s, a), and the output of the original critic network is a first evaluation index;
determining a loss function gradient of an original actor network;
determining a loss function and a gradient of an original critic network;
and updating the network parameters of the original actor network and the network parameters of the original critic network according to the gradient of the loss function of the original actor network, the loss function of the original critic network and the gradient of the original critic network, and generating a target actor-critic network.
4. The online learning method of claim 3, wherein determining a gradient of a loss function of the original actor network comprises:
using formulasDetermining a loss function gradient of an original actor network; wherein, the output of the original actor network is mu(s), and the network parameter of the original actor network is thetaμ(ii) a And N is the size of the sampled data volume.
5. The online learning method of claim 3, wherein determining the loss function and gradient of the original critic network comprises:
using formulasCalculating a loss function of the original critic network; wherein, the output of the original critic network is Q (s, a), and the network parameter of the original critic network is thetaQ;
Using Bellman's equationTraining the original critic network;
using formulasCalculating the gradient of the original critic network; where i denotes the number of rounds of training, δiIs defined as the time-sequence differential error (TD-error) and has the form:
δi=ri+γQ'(si+1,μ'(si+1|θμ')|θQ')-Q(si,ai|θQ)。
6. the on-line learning method of any one of claims 3-5, wherein the generating of the target actor-critic network by updating the network parameters of the original actor network and the network parameters of the original critic network according to the gradient of the loss function of the original actor network, the loss function of the original critic network and the gradient of the original critic network comprises:
using formulasUpdating network parameters of an original critic network, wherein theNetwork parameters for a target critic network;
using formulasNetwork parameters of the original actor network are updated, wherein,network parameters for the target actor network.
7. The method of online learning of claim 1, wherein training the first system using a supervised reinforcement learning algorithm when the sampled data is derived from the second buffered data comprises:
using the formula | μ(s) - μE(s) | < epsilon to judge the second action a and the standard supervision action a corresponding to the current scene state information sEMu represents the current actor network output strategy, muERepresenting a current actor network rule supervision strategy, wherein epsilon is a preset threshold value;
using formulas
Calculating a loss function of the current critic network; wherein, thetaμFor the network parameter of the current actor network, thetaQNetwork parameters of the current critic network; dRuleCaching the collected second cache data; (s)E,aE) A set of state-action pairs in the second cache data; n is the number of a batch of data in processing operation; h (mu)E(sE),μ(sE) Is a function of motion error, defined as
Wherein eta is a normal value, and the function of the action error can ensure that the loss generated by the irregular supervision action is at least one boundary value eta greater than that of the regular supervision action;
using a synthetic loss function Jcom=JQ+λJsupUpdating the critic network, wherein lambda is an artificially set quantity used for adjusting the current critic network loss functionThe weight ratio between the number and the network loss function of the next critic;
using the formula deltaS=H(aE,μ(sE|θμ))+Q(sE,μ(sE|θμ)|θQ)-Q(sE,aE|θQ) Defining a supervision error;
using formulasCalculating the updated network parameters of the critic network;
using formulasCalculating network parameters of the updated actor network; wherein,for the i-th update, the network parameters of the critic's network,the network parameter of the critic network in the i +1 th update is the network parameter of the critic network in the i th update, alphaθQIn order to review the learning rate of the critic's network,for the i-th update, the network parameters of the actor network,for the network parameter, alpha, of the actor network in the i +1 th updateθμThe learning rate of the actor network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810330517.5A CN110390398B (en) | 2018-04-13 | 2018-04-13 | Online learning method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810330517.5A CN110390398B (en) | 2018-04-13 | 2018-04-13 | Online learning method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110390398A true CN110390398A (en) | 2019-10-29 |
CN110390398B CN110390398B (en) | 2021-09-10 |
Family
ID=68283714
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810330517.5A Active CN110390398B (en) | 2018-04-13 | 2018-04-13 | Online learning method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110390398B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112264995A (en) * | 2020-10-16 | 2021-01-26 | 清华大学 | Robot double-shaft hole assembling method based on hierarchical reinforcement learning |
CN112580801A (en) * | 2020-12-09 | 2021-03-30 | 广州优策科技有限公司 | Reinforced learning training method and decision-making method based on reinforced learning |
CN113239634A (en) * | 2021-06-11 | 2021-08-10 | 上海交通大学 | Simulator modeling method based on robust simulation learning |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105260628A (en) * | 2014-06-03 | 2016-01-20 | 腾讯科技(深圳)有限公司 | Classifier training method and device and identity verification method and system |
CN105912814A (en) * | 2016-05-05 | 2016-08-31 | 苏州京坤达汽车电子科技有限公司 | Lane change decision model of intelligent drive vehicle |
CN106154834A (en) * | 2016-07-20 | 2016-11-23 | 百度在线网络技术(北京)有限公司 | For the method and apparatus controlling automatic driving vehicle |
CN106842925A (en) * | 2017-01-20 | 2017-06-13 | 清华大学 | A kind of locomotive smart steering method and system based on deeply study |
WO2017120336A2 (en) * | 2016-01-05 | 2017-07-13 | Mobileye Vision Technologies Ltd. | Trained navigational system with imposed constraints |
CN107342078A (en) * | 2017-06-23 | 2017-11-10 | 上海交通大学 | The cold starting system and method for dialog strategy optimization |
CN107577231A (en) * | 2017-08-28 | 2018-01-12 | 驭势科技(北京)有限公司 | Formulating method, device and the automatic driving vehicle of the control decision of vehicle |
CN107862346A (en) * | 2017-12-01 | 2018-03-30 | 驭势科技(北京)有限公司 | A kind of method and apparatus for carrying out driving strategy model training |
CN107895501A (en) * | 2017-09-29 | 2018-04-10 | 大圣科技股份有限公司 | Unmanned car steering decision-making technique based on the training of magnanimity driving video data |
-
2018
- 2018-04-13 CN CN201810330517.5A patent/CN110390398B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105260628A (en) * | 2014-06-03 | 2016-01-20 | 腾讯科技(深圳)有限公司 | Classifier training method and device and identity verification method and system |
WO2017120336A2 (en) * | 2016-01-05 | 2017-07-13 | Mobileye Vision Technologies Ltd. | Trained navigational system with imposed constraints |
CN105912814A (en) * | 2016-05-05 | 2016-08-31 | 苏州京坤达汽车电子科技有限公司 | Lane change decision model of intelligent drive vehicle |
CN106154834A (en) * | 2016-07-20 | 2016-11-23 | 百度在线网络技术(北京)有限公司 | For the method and apparatus controlling automatic driving vehicle |
CN106842925A (en) * | 2017-01-20 | 2017-06-13 | 清华大学 | A kind of locomotive smart steering method and system based on deeply study |
CN107342078A (en) * | 2017-06-23 | 2017-11-10 | 上海交通大学 | The cold starting system and method for dialog strategy optimization |
CN107577231A (en) * | 2017-08-28 | 2018-01-12 | 驭势科技(北京)有限公司 | Formulating method, device and the automatic driving vehicle of the control decision of vehicle |
CN107895501A (en) * | 2017-09-29 | 2018-04-10 | 大圣科技股份有限公司 | Unmanned car steering decision-making technique based on the training of magnanimity driving video data |
CN107862346A (en) * | 2017-12-01 | 2018-03-30 | 驭势科技(北京)有限公司 | A kind of method and apparatus for carrying out driving strategy model training |
Non-Patent Citations (2)
Title |
---|
XIN LI等: ""Reinforcement learning based overtaking decision-making for highway autonomous driving"", 《2015 SIXTH INTERNATIONAL CONFERENCE ON INTELLIGENT CONTROL AND INFORMATION PROCESSING (ICICIP)》 * |
田赓: ""复杂动态城市环境下无人驾驶车辆仿生换道决策模型研究"", 《中国优秀硕士学位论文全文数据库 工程科技II辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112264995A (en) * | 2020-10-16 | 2021-01-26 | 清华大学 | Robot double-shaft hole assembling method based on hierarchical reinforcement learning |
CN112264995B (en) * | 2020-10-16 | 2021-11-16 | 清华大学 | Robot double-shaft hole assembling method based on hierarchical reinforcement learning |
CN112580801A (en) * | 2020-12-09 | 2021-03-30 | 广州优策科技有限公司 | Reinforced learning training method and decision-making method based on reinforced learning |
CN113239634A (en) * | 2021-06-11 | 2021-08-10 | 上海交通大学 | Simulator modeling method based on robust simulation learning |
CN113239634B (en) * | 2021-06-11 | 2022-11-04 | 上海交通大学 | Simulator modeling method based on robust simulation learning |
Also Published As
Publication number | Publication date |
---|---|
CN110390398B (en) | 2021-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liang et al. | Cirl: Controllable imitative reinforcement learning for vision-based self-driving | |
CN112099496B (en) | Automatic driving training method, device, equipment and medium | |
CN111696370B (en) | Traffic light control method based on heuristic deep Q network | |
CN111142522B (en) | Method for controlling agent of hierarchical reinforcement learning | |
CN114358128B (en) | Method for training end-to-end automatic driving strategy | |
CN113044064B (en) | Vehicle self-adaptive automatic driving decision method and system based on meta reinforcement learning | |
CN110750877A (en) | Method for predicting car following behavior under Apollo platform | |
CN107229973A (en) | The generation method and device of a kind of tactful network model for Vehicular automatic driving | |
CN110390398B (en) | Online learning method | |
CN114162146B (en) | Driving strategy model training method and automatic driving control method | |
Shi et al. | Offline reinforcement learning for autonomous driving with safety and exploration enhancement | |
CN111695737A (en) | Group target advancing trend prediction method based on LSTM neural network | |
CN117032203A (en) | Svo-based intelligent control method for automatic driving | |
WO2021258847A1 (en) | Driving decision-making method, device, and chip | |
Rais et al. | Decision making for autonomous vehicles in highway scenarios using Harmonic SK Deep SARSA | |
CN117523821B (en) | System and method for predicting vehicle multi-mode driving behavior track based on GAT-CS-LSTM | |
Chen et al. | MetaFollower: Adaptable personalized autonomous car following | |
CN110378460B (en) | Decision making method | |
Yang et al. | Decision-making in autonomous driving by reinforcement learning combined with planning & control | |
CN116224996A (en) | Automatic driving optimization control method based on countermeasure reinforcement learning | |
Ma et al. | Evolving testing scenario generation method and intelligence evaluation framework for automated vehicles | |
WO2018205245A1 (en) | Strategy network model generation method and apparatus for automatic vehicle driving | |
Wang et al. | An end-to-end deep reinforcement learning model based on proximal policy optimization algorithm for autonomous driving of off-road vehicle | |
CN114701517A (en) | Multi-target complex traffic scene automatic driving solution based on reinforcement learning | |
Yang et al. | Deep Reinforcement Learning Lane-Changing Decision Algorithm for Intelligent Vehicles Combining LSTM Trajectory Prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CP01 | Change in the name or title of a patent holder |
Address after: B4-006, maker Plaza, 338 East Street, Huilongguan town, Changping District, Beijing 100096 Patentee after: Beijing Idriverplus Technology Co.,Ltd. Address before: B4-006, maker Plaza, 338 East Street, Huilongguan town, Changping District, Beijing 100096 Patentee before: Beijing Idriverplus Technology Co.,Ltd. |
|
CP01 | Change in the name or title of a patent holder |