CN108791308A - The system for building driving strategy based on driving environment - Google Patents
The system for building driving strategy based on driving environment Download PDFInfo
- Publication number
- CN108791308A CN108791308A CN201810662039.8A CN201810662039A CN108791308A CN 108791308 A CN108791308 A CN 108791308A CN 201810662039 A CN201810662039 A CN 201810662039A CN 108791308 A CN108791308 A CN 108791308A
- Authority
- CN
- China
- Prior art keywords
- driving
- state
- strategy
- reward program
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000009471 action Effects 0.000 claims abstract description 14
- 238000000605 extraction Methods 0.000 claims abstract description 10
- 238000013528 artificial neural network Methods 0.000 claims description 48
- 230000001154 acute effect Effects 0.000 claims description 22
- 208000015181 infectious disease Diseases 0.000 claims description 22
- 238000000034 method Methods 0.000 claims description 17
- 238000005457 optimization Methods 0.000 claims description 16
- 238000011478 gradient descent method Methods 0.000 claims description 9
- 230000004913 activation Effects 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 8
- 210000002569 neuron Anatomy 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 239000004575 stone Substances 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims description 2
- 210000004218 nerve net Anatomy 0.000 claims description 2
- 230000000875 corresponding effect Effects 0.000 description 30
- 230000006870 function Effects 0.000 description 29
- 230000006399 behavior Effects 0.000 description 14
- 238000013527 convolutional neural network Methods 0.000 description 5
- 230000008859 change Effects 0.000 description 2
- 238000007796 conventional method Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000005286 illumination Methods 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W50/00—Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B60—VEHICLES IN GENERAL
- B60W—CONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
- B60W50/00—Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
- B60W2050/0001—Details of the control system
- B60W2050/0019—Control system elements or transfer functions
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mechanical Engineering (AREA)
- Human Computer Interaction (AREA)
- Transportation (AREA)
- Automation & Control Theory (AREA)
- Air Conditioning Control Device (AREA)
- Feedback Control In General (AREA)
Abstract
The invention discloses a kind of systems building driving strategy based on driving environment, including:Feature extractor, the feature of extraction structure Reward Program;Reward Program generator obtains driving strategy;Driving strategy getter completes the structure of driving strategy;Judging device judges the optimal driving strategy of getter structure, whether meets judgment criteria;If not satisfied, then rebuilding Reward Program, repetition builds optimal driving strategy, iterates, until meeting judgment criteria;It is final to obtain the true driving strategy for driving demonstration of description;The Reward Program generator, including obtain the drivings example data module of expert, seek the weights module for driving the feature desired value module demonstrated, seeking state-behavior aggregate module under greedy strategy, seeking Reward Program.The application can be applicable in new state scene, to obtain its respective action, substantially increase the generalization ability of the driver behavior model of foundation, and applicable scene is wider, and robustness is stronger.
Description
Technical field
The present invention relates to a kind of systems building driving strategy based on driving environment.
Background technology
Traditional driver's driving strategy established based on intensified learning using the analysis of known driving data, is described and is pushed away
Driving behavior is managed, however inexhaustible driving behavior can not be completely covered with the driving data of acquisition, unlikely
The case where obtaining whole state respective actions.Under practical Driving Scene, because of the difference of weather, scene, object, driving condition
There are numerous possibility, it is impossible thing to traverse whole states.Therefore traditional driver's driving behavior model generalization ability
Weak, model hypothesis condition is more, poor robustness.
Secondly, in actual driving problem, the method for Reward Program is only set with researcher, needs balance too many right
In the demand of various features, it is completely dependent on the experience setting of researcher, reconciles, takes time and effort, more fatal is manually repeatedly
It is excessively subjective.Under different scenes and environment, researcher then needs to face too many scene state;Meanwhile even for
Some scene state determined, the difference of demand also result in the variation of driving behavior.For the accurate description driving
Task will distribute a series of weights with these factors of accurate description.In existing method, the reverse extensive chemical based on probabilistic model
It practises mainly from existing example data, using example data as data with existing, and then seeks the distribution of corresponding current data
Situation, the action that can be just sought based on this under corresponding states are chosen.But the distribution of given data can not indicate total data
Distribution, it is correct to obtain distribution, the case where needing to obtain whole state respective actions.
Invention content
In the presence of the prior art for Driving Scene not in the case of example data, corresponding return can not be established
Function is come the technical issues of carrying out driving behavior modeling, this application provides a kind of based on driving environment structure driving strategy
System can be applicable in new state scene, and to obtain its respective action, applicable scene is wider, and robustness is stronger.
To achieve the goals above, the technical essential of the present invention program is:One kind building driving strategy based on driving environment
System, specifically include:Feature extractor, the feature of extraction structure Reward Program;Reward Program generator obtains and drives plan
Slightly;Driving strategy getter completes the structure of driving strategy;Judging device judges the optimal driving strategy of getter structure, is
It is no to meet judgment criteria;If not satisfied, then rebuilding Reward Program, repetition builds optimal driving strategy, iterates, until
Meet judgment criteria;It is final to obtain the true driving strategy for driving demonstration of description;
The Reward Program generator, including obtain the driving example data module of expert, seek driving the feature demonstrated
Desired value module, the weights module sought state-behavior aggregate module under greedy strategy, seek Reward Program.
Further, the driving example data module for obtaining expert is specially:Driving example data, which comes from, drives demonstration
The sampling extraction for sailing video data, samples according to one section of continuous driving video of certain frequency pair, obtains one group of track and show
Model;One expert's example data includes a plurality of track, is totally denoted as:
DE={ (s1,a1),(s2,a2),...,(sM,aM)}Wherein DEIndicate whole driving example data,
(sj,aj) indicating that corresponding states j corresponds to the data pair of decision instruction composition with the state, M represents driving example data in total
Number, NTIt represents and drives demonstration trace number, LiIt represents i-th and drives the state-decision instruction for including in demonstration track to (sj,
aj) number.
Further, seeking the feature desired value module that driving is demonstrated is specially;Example data D will be driven firstEIn
The state s of each description driving environment situationtIn input state feature extractor, corresponding states s is obtainedtUnder feature situation f
(st,at), f (st,at) one group of correspondence s of acute pyogenic infection of finger tiptInfluence Driving Decision-making result driving environment scene characteristic value, be then based on down
It states formula and calculates the feature desired value for driving demonstration:
Wherein γ is discount factor, and according to the difference of problem, correspondence is configured.
Further, seeking the state under greedy strategy-behavior aggregate module is specially:Due to Reward Program generator with drive
Sail two parts that tactful getter is cycle;
First, the neural network in driving strategy getter is obtained:Driving example data DEExtract obtained description ring
The state feature f (s of border situationt), neural network is inputted, output g is obtainedw(st);gw(st) it is about description state stOne group of Q
Value set, i.e. [Q (st,a1),...,Q(st,an)]T, and Q (st,ai) state-working value is represented, it is driven currently for describing
Scene state stUnder, choose decision driver behavior aiQuality, ((s a) is acquired, the formula by s, a)=θ μ based on formula Q
In the current Reward Program of θ acute pyogenic infection of finger tip in weights, μ (s, a) acute pyogenic infection of finger tip feature desired value;
ε-greedy strategies are then based on, carry out choosing description Driving Scene state stCorresponding Driving Decision-making action
It chooses about current Driving Scene stUnder Q value sets in allow the maximum decision action of Q valuesOtherwise, then it randomly selects
It has chosenLater, it records at this time
So for driving the D that demonstratesEIn each state state feature f (st,at), the neural network is inputted, is obtained altogether
M state-action is obtained to (st,at), which depict the Driving Scene state s of t momenttLower selection Driving Decision-making acts at;Together
When based on action choose the case where, obtain the Q values of M corresponding states-action pair, be denoted as Q.
Further, the weights module for seeking Reward Program is specially:
It is primarily based on following formula, builds object function:
Loss function is represented, i.e., according to current state-action to whether there is among driving demonstration, if depositing
It is being then 0, is being otherwise 1;For the corresponding states-working value recorded above;To seek
Drive the product of the weights θ of the expectation of driving exemplary features and Reward Program sought in the feature desired value module of demonstration;For regular terms;
The object function, i.e. t=min are minimized by gradient descent methodθJ (θ), acquisition enable the minimization of object function
Variable θ, the θ are the weights of striked required Reward Program.
Further, the correspondence Reward Program weights θ based on acquisition, according to formula r (s, a)=θT(s a) is built back f
Report function generator.
Further, driving strategy getter specific implementation process is:
S31 builds the training data of driving strategy getter
Training data is obtained, each data include two parts:One is that t moment Driving Scene state is inputted driving condition
The Driving Decision-making feature f (s that extractor obtainst), another is namely based on what following formula obtained
Wherein, rθ(st,at) by Reward Program of the Reward Program generator based on driving example data generation;Qπ(st,
at) and Qπ(st+1,at+1) coming from the Q values sought recorded in the state under greedy strategy-behavior aggregate module, selection is wherein retouched
State t moment Driving Scene stQ values and choose its described in t+1 moment Driving Scenes st+1Q values;
S32. neural network is established;
S33. optimization neural network.
Further, the neural network in step S32 includes three layers, and first layer is as input layer, neuron therein
Number is identical for k with the output feature type of feature extractor, the feature f (s for inputting Driving Scenet,at), the second layer
Hidden layer number be 10, the neuron number of third layer in motion space carry out decision driver behavior number n it is identical;It is defeated
The activation primitive for entering layer and hidden layer is all sigmoid functions, i.e.,Have:
Z=w(1)X=w(1)[1,ft]T
H=sigmoid (z)
gw(st)=sigmoid (w(2)[1,h]T)
Wherein w(1)For the weights of hidden layer;ftFor the state s of t moment Driving ScenetFeature, that is, neural network is defeated
Enter;Network layer output when z is without hidden layer sigmoid activation primitives;H is hidden after sigmoid activation primitives
Layer output;w(2)For the weights of output layer;
The g of network outputw(st) it is t moment Driving Scene state stQ set, i.e. [Q (st,a1),...,Q(st,an)]T,
Q in S31π(st,at) it is exactly by state stInput neural network, a in selection outputtObtained by.
As further, the loss function of the optimization for the neural network, foundation is cross entropy cost function, public
Formula is as follows:
The wherein number of N acute pyogenic infection of finger tip training data;Qπ(st,at) it is that will describe t moment Driving Scene state stInput nerve net
Network, the correspondence Driving Decision-making in selection output act atThe obtained numerical value of item;For the numerical value acquired in S31;It is regular terms, W={ w therein(1),w(2)Weights in neural network above acute pyogenic infection of finger tip;
The training data that will be obtained in S31 inputs the Neural Network Optimization cost function;By gradient descent method completion pair
In the minimum of the cross entropy cost function, the neural network of obtained optimization completion, and then obtain driving strategy getter.
As further, judging device implements process and includes:
Regard current Reward Program generator and driving strategy getter as an entirety, checks currently to seek driving and show
The feature desired value mould t values in the block of model, if meet t < ε, ε be judge object function whether the threshold value of meet demand, also
It is to judge to be currently used in whether the Reward Program for obtaining driving strategy meets the requirements;Its numerical value carries out different according to specific needs
Setting;
When the numerical value of t, when being unsatisfactory for the formula;It needs to rebuild Reward Program generator, needs currently to ask at this time
The neural network needed in the state under greedy strategy-behavior aggregate module is taken to be substituted for new after having already passed through optimization in S33
Neural network, i.e., will be used for generate description in Driving Scene state stUnder, the decision driver behavior a of selectioniGood and bad Q (st,
ai) value network, be substituted for the new network structure optimized by gradient descent method in S33;Then it rebuilds
Reward Program generator obtains driving strategy getter, judge again t numerical value whether meet demand;
When meeting the formula, current θ is exactly the weights of required Reward Program;Reward Program generator, which then meets, to be wanted
It asks, driving strategy getter is also met the requirements;Then acquisition needs to establish the driving data of certain driver of pilot model, i.e.,
Environment scene image in driving procedure and corresponding operation data input driving environment feature extractor, obtain for current
The decision feature of scene;Then feature extraction obtained inputs Reward Program generator, obtains the return of corresponding scene state
Function;Then the decision feature of acquisition and the Reward Program being calculated are inputted driving strategy getter, obtains the driver
Corresponding driving strategy.
Advantageous effect is the present invention compared with prior art:In actual driving situation, because of the originals such as weather, scenery
The corresponding big state space of various Driving Scene caused by, by means of the outstanding approximate expression arbitrary function of neural network
Ability, approximately can be by a kind of this Policy Table up to regarding black box as:By the characteristic value of input state, it is dynamic to export corresponding state-
Work value, while further being acted according to the case where output valve to choose, to obtain respective action.To make to strengthen by reverse
The applicability for learning to model driving behavior greatly enhances, and conventional method is fitted because attempting by a certain probability distribution
To demonstration track, thus the optimal policy obtained is still limited to the existing state status in demonstration track, and the present invention can
To be applicable in for new state scene, to obtain its respective action, the driver behavior model of foundation is substantially increased
Generalization ability, applicable scene is wider, and robustness is stronger.
Description of the drawings
Fig. 1 is new depth convolutional neural networks;
Fig. 2 is driving video sample graph;
Fig. 3 is the working method flow diagram of system in embodiment 1;
Fig. 4 is to establish neural network structure figure in step S32.
Specific implementation mode
Below in conjunction with Figure of description, the invention will be further described.Following embodiment is only used for clearly
Illustrate technical scheme of the present invention, and not intended to limit the protection scope of the present invention.
The present embodiment provides the systems for building driving strategy based on driving environment, specifically comprise the following steps:
Feature extractor, the feature of extraction structure Reward Program, the specific steps are:
S11. in vehicle travel process, the driving video that is obtained using the subsequent video camera of the windshield for being placed on vehicle into
Row sampling, sample graph are as shown in Figure 2.
Obtain the picture of N group different vehicle driving road environment road conditions and corresponding steering angle situation.Including N1
Straight way and N2 bends, the value of N1, N2 can be N1>=300, N2>=3000, at the same corresponding driver behavior data, joint
Construct training data.
S12. carry out relevant translation to collecting the image come, cut, the change operations such as brightness, with simulate different illumination and
The scene of weather.
S13. convolutional neural networks are built, using picture after treatment as input, the operation data of corresponding picture is made
For label value, it is trained;Optimize to seek optimal solution to mean square error loss using based on the optimization method of Nadam optimizers
The weight parameter of neural network.
Convolutional neural networks include 1 input layer, 3 convolutional layers, 3 pond layers, 4 full articulamentums.Input layer is successively
First convolutional layer, first pond layer are connected, second convolutional layer, second pond layer are then connected, reconnects third
Convolutional layer, third pond layer, be then sequentially connected the full articulamentum of first full articulamentum, second full articulamentum, third,
4th full articulamentum.
S14. the network structure by the convolutional neural networks after the completion of training in addition to the last output layer and weights preserve,
To establish a new convolutional neural networks, completion status feature extractor.
Reward Program generator obtains driving strategy:
Reward Program returns letter as the standard for acting selection in intensified learning method in the acquisition process of driving strategy
Several quality plays the role of conclusive, directly determines the quality of the driving strategy of acquisition, and the strategy obtained is
No strategy corresponding with true driving example data is identical.The formula of Reward Program is reward=θTf(st,at), f (st,
at) acute pyogenic infection of finger tip corresponds to the t moment state s under driving environment scene " vehicle-periphery "tOne group of influence Driving Decision-making result spy
Value indicative, for describing vehicle-periphery scenario.And θ acute pyogenic infection of finger tip corresponds to one group of weights of the feature for influencing Driving Decision-making, power
The corresponding environmental characteristic of the numbers illustrated of value proportion shared in Reward Program, embodies importance.It is carried in state feature
On the basis of taking device, need to solve this weights θ, to come build influence driving strategy Reward Program.
Obtain the driving example data module of expert:Example data is driven from the sampling for driving video data of demonstrating
Extraction (different with data used in driving environment feature extractor before), can be according to the continuous driving of one section of frequency pair of 10hz
Video is sampled, and one group of track demonstration is obtained.One expert's demonstration should have a plurality of track.Totally it is denoted as:DE={ (s1,
a1),(s2,a2),...,(sM,aM)}Wherein DEIndicate whole driving example data, (sj,aj) indicate corresponding shape
State j (video pictures of the driving environment of the time j of sampling) corresponds to the decision instruction (steering angle in such as steering order with the state
Degree) data pair that constitute, M represents the number of driving example data in total, NTIt represents and drives demonstration trace number, LiRepresent i-th
Item drives the state-decision instruction for including in demonstration track to (sj,aj) number
The feature for seeking driving demonstration it is expected module:Example data D will be driven firstEIn each description driving environment feelings
The state s of conditiontInput state feature extractor obtains corresponding states stUnder feature situation f (st,at), f (st,at) acute pyogenic infection of finger tip one
The corresponding s of grouptInfluence Driving Decision-making result driving environment scene characteristic value, be then based on following formula and calculate driving
The feature of demonstration it is expected:
Wherein γ is discount factor, and according to the difference of problem, correspondence is configured, and referential data can be set as 0.65.
Seek state-behavior aggregate module under greedy strategy:First, the god in the driving strategy getter in S32 is obtained
Through network.It is (most refreshing at first because being two parts in a cycle in Reward Program generator and driving strategy getter
It is the neural network just initialized in S32 through network.With the progress of cycle, each step in cycle is all:It completes to influence
The structure of the Reward Program of Driving Decision-making is then based on current Reward Program and obtains corresponding optimal driving strategy, judges whether
Meet the standard of end loop, is rebuild back if not satisfied, being then put into the neural network that the process in current S34 optimized
Report function)
Driving example data DEExtract the state feature f (s of obtained description ambient conditionst,at), neural network is inputted,
Obtain output gw(st);gw(st) it is about description state stOne group of Q value set, i.e. [Q (st,a1),...,Q(st,an)]T, and
Q(st,ai) state-working value is represented, for describing in current Driving Scene state stUnder, choose decision driver behavior aiIt is excellent
It is bad, can be based on formula Q (s, a)=θ μ (s a) is acquired, the weights in the current Reward Program of the θ acute pyogenic infection of finger tip in the formula,
μ (s, a) acute pyogenic infection of finger tip feature expectation.
ε-greedy strategies are then based on, if setting ε is 0.5, carry out choosing description Driving Scene state stIt is corresponding
Driving Decision-making actsThat is there is 50 percent possibility, choose about current Driving Scene stUnder Q value collection
The maximum decision of Q values is allowed to act in conjunctionOtherwise, then it randomly selectsIt has chosenLater, it records at this time
So for driving the D that demonstratesEIn each state state feature f (st,at), the neural network is inputted, is obtained altogether
M state-action is obtained to (st,at) which depict the Driving Scene state s of t momenttLower selection Driving Decision-making acts at.Together
When based on action choose the case where, obtain the Q values of M corresponding states-action pair, be denoted as Q.
Seek the weights module of Reward Program:It is primarily based on following formula, builds object function:
Represent loss function, i.e., according to current state-action to whether there is among driving demonstration, if
It is otherwise 1 in the presence of being then 0.For the corresponding states-working value recorded above.For regular terms, to prevent
The appearance of overfitting problem, the γ can be 0.9.
The object function, i.e. t=min are minimized by gradient descent methodθJ (θ), acquisition enable the minimization of object function
Variable θ, the θ are the weights of striked required Reward Program.
Correspondence Reward Program weights θ based on acquisition, according to formula r (s, a)=θT(s a) builds Reward Program generation to f
Device.
Driving strategy getter completes the structure of driving strategy, specially:
The structure of the training data of S31 driving strategy getters
Obtain training data.Data come from the sampling to example data before, but are handled to obtain one group
The data of new type amount to N number of.Each data include two parts in data:One is to input t moment Driving Scene state
The Driving Decision-making feature f (s that driving condition extractor obtainst), another is namely based on what following formula obtained
Include parameter r in the formulaθ(st,at) by return of the Reward Program generator based on driving example data generation
Function.Choose t moment Driving Scene s described in ittQ values and choose its described in t+1 moment Driving Scenes st+1Q values.
S32. neural network is established
Neural network includes three layers, and first layer is as input layer, the output of neuron number and feature extractor therein
Feature type is identical for k, the feature f (s for inputting Driving Scenet,at), the hidden layer number of the second layer is 10, third layer
Neuron number in motion space carry out decision driver behavior number n as number;The activation of input layer and hidden layer
Function is all sigmoid functions, i.e.,Have:
Z=w(1)X=w(1)[1,ft]T
H=sigmoid (z)
gw(st)=sigmoid (w(2)[1,h]T)
Wherein w(1)The weights of acute pyogenic infection of finger tip hidden layer;ftThe state s of acute pyogenic infection of finger tip t moment Driving ScenetFeature, that is, neural network
Input;The output of network layer when z acute pyogenic infection of finger tip is without hidden layer sigmoid activation primitives;H acute pyogenic infection of finger tip is activated by sigmoid
Hidden layer output after function;w(2)The weights of acute pyogenic infection of finger tip output layer;Network structure such as Fig. 3:
The g of network outputw(st) it is t moment Driving Scene state stQ set, i.e. [Q (st,a1),...,Q(st,an)]T,
Q in S31π(st,at) it is exactly by state stInput neural network, a in selection outputtObtained by.
S33. optimization neural network
The loss function of optimization for the neural network, foundation is cross entropy cost function, and formula is as follows:
The wherein number of N acute pyogenic infection of finger tip training data.Qπ(st,at) it is exactly that will describe t moment Driving Scene state stInput nerve
Network, the correspondence Driving Decision-making in selection output act atThe obtained numerical value of item.For the numerical value acquired in S31.Equally it is regular terms, prevents over-fitting and be arranged.The γ may be 0.9.W={ w therein(1),w(2)Acute pyogenic infection of finger tip
Weights in neural network above.
The training data that will be obtained in S31 inputs the Neural Network Optimization cost function.By gradient descent method completion pair
In the minimum of the cross entropy cost function, the neural network of obtained optimization completion obtains driving strategy getter.
Judging device regards current Reward Program generator and driving strategy getter as an entirety, checks current t
Value, if meet t < ε, ε be judge object function whether the threshold value of meet demand, that is, judge to be currently used in acquisition driving
Whether the Reward Program of strategy meets the requirements.Its numerical value carries out different settings according to specific needs.
When the numerical value of t is unsatisfactory for the formula.It needs to rebuild Reward Program generator, needs to work as at this time
Preceding neural network is substituted for the new neural network after having already passed through optimization, i.e., will be used to generate description in Driving Scene state
stUnder, the decision driver behavior a of selectioniGood and bad Q (st,ai) value network, be substituted in S33 by gradient descent method into
The new network structure that row optimized.Then it rebuilds Reward Program generator, obtain driving strategy getter, judge again
The numerical value of t whether meet demand.
When meeting the formula, current θ is exactly the weights of required Reward Program.Reward Program generator is then full
Foot requires, and driving strategy getter is also met the requirements.It then can be with:Acquisition needs to establish driving for certain driver of pilot model
Data, i.e., the environment scene image in driving procedure and corresponding operation data are sailed, steering angle is such as driven.It is special to input driving environment
Extractor is levied, the decision feature for current scene is obtained.Then feature extraction obtained inputs Reward Program generator, obtains
To the Reward Program of corresponding scene state.Then the decision feature of acquisition and the Reward Program being calculated are inputted driving strategy
Getter obtains the corresponding driving strategy of the driver.
In markov decision process, a kind of strategy needs connection status to its corresponding action.But have for one
When large-scale state space, for the region not traversed, indicated it is difficult to be depicted and carry out a determining strategy, tradition
Also the description to this part is had ignored among method, is only based on demonstration track, to illustrate the probability mould of entire track distribution
Type does not provide specific strategy for new state and indicates, i.e., takes the possibility for determining and acting not for new state
Provide specific method.Strategy is described by neural network in the present invention, neural network can be in any essence because of it
The characteristic of approximate representation arbitrary function in exactness, while having outstanding generalization ability.By the expression of state feature, on the one hand
Those states being not included in demonstration track can be represented, in addition, inputting neural network by by corresponding state feature.
Corresponding working value can be sought, to seek deserved action according to strategy, thus, conventional method can not extensive driving demonstration number
It is addressed according to not traversing Driving Scene state issues.
The preferable specific implementation mode of the above, only the invention, but the protection domain of the invention is not
It is confined to this, any one skilled in the art is in the technical scope that the invention discloses, according to the present invention
The technical solution of creation and its inventive concept are subject to equivalent substitution or change, should all cover the invention protection domain it
It is interior.
Claims (10)
1. the system for building driving strategy based on driving environment, which is characterized in that specifically include:Feature extractor, extraction structure
The feature of Reward Program;Reward Program generator obtains driving strategy;Driving strategy getter completes the structure of driving strategy;
Judging device judges the optimal driving strategy of getter structure, whether meets judgment criteria;If not satisfied, then rebuilding back
Function, repetition is reported to build optimal driving strategy, iterate, until meeting judgment criteria;Final true drive of acquisition description is shown
The driving strategy of model;
The Reward Program generator, including obtain the driving example data module of expert, seek driving the feature expectation demonstrated
Value module, the weights module sought state-behavior aggregate module under greedy strategy, seek Reward Program.
2. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that obtain driving for expert
Sailing example data module is specially:Example data is driven from being extracted for the sampling for driving video data of demonstrating, according to certain
The continuous driving video of one section of frequency pair samples, and obtains one group of track demonstration;One expert's example data includes a plurality of rail
Mark is totally denoted as:
DE={ (s1,a1),(s2,a2),...,(sM,aM)}Wherein DEIndicate whole driving example data, (sj,
aj) indicating that corresponding states j corresponds to the data pair of decision instruction composition with the state, M represents of driving example data in total
Number, NTIt represents and drives demonstration trace number, LiIt represents i-th and drives the state-decision instruction for including in demonstration track to (sj,
aj) number.
3. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that seek driving demonstration
Feature desired value module be specially;Example data D will be driven firstEIn each description driving environment situation state stIt is defeated
Enter in state feature extractor, obtains corresponding states stUnder feature situation f (st,at), f (st,at) one group of correspondence s of acute pyogenic infection of finger tipt's
The driving environment scene characteristic value for influencing Driving Decision-making result is then based on following formula and calculates the feature phase for driving demonstration
Prestige value:
Wherein γ is discount factor, and according to the difference of problem, correspondence is configured.
4. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that seek greedy strategy
Under state-behavior aggregate module be specially:Since Reward Program generator and driving strategy getter are two parts of cycle;
First, the neural network in driving strategy getter is obtained:Driving example data DEExtract obtained description ambient conditions
State feature f (st), neural network is inputted, output g is obtainedw(st);gw(st) it is about description state stOne group of Q value collection
It closes, i.e. [Q (st,a1),...,Q(st,an)]T, and Q (st,ai) state-working value is represented, for describing in current Driving Scene
State stUnder, choose decision driver behavior aiQuality, ((s a) is acquired, in the formula by s, a)=θ μ based on formula Q
Weights in the current Reward Program of θ acute pyogenic infection of finger tip, μ (s, a) acute pyogenic infection of finger tip feature desired value;
ε-greedy strategies are then based on, carry out choosing description Driving Scene state stCorresponding Driving Decision-making actionIt chooses
About current Driving Scene stUnder Q value sets in allow the maximum decision action of Q valuesOtherwise, then it randomly selectsIt chooses
It is completeLater, it records at this time
So for driving the D that demonstratesEIn each state state feature f (st,at), the neural network is inputted, is acquired altogether
M state-action is to (st,at), which depict the Driving Scene state s of t momenttLower selection Driving Decision-making acts at;Base simultaneously
In acting the case where choosing, the Q values of M corresponding states-action pair are obtained, Q is denoted as.
5. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that seek Reward Program
Weights module be specially:
It is primarily based on following formula, builds object function:
Loss function is represented, i.e., according to current state-action to whether there is among driving demonstration, if in the presence of if
It is 0, is otherwise 1;For the corresponding states-working value recorded above;To seek driving
The driving exemplary features sought in the feature desired value module of demonstration it is expected and the product of the weights θ of Reward Program;For
Regular terms;
The object function, i.e. t=min are minimized by gradient descent methodθJ (θ) obtains the variable for enabling the minimization of object function
θ, the θ are the weights of striked required Reward Program.
6. according to the system for building driving strategy based on driving environment described in any one of claim 5, which is characterized in that based on obtaining
The correspondence Reward Program weights θ obtained, according to formula r (s, a)=θT(s a) builds Reward Program generator to f.
7. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that driving strategy obtains
Implement body realizes that process is:
S31 builds the training data of driving strategy getter
Training data is obtained, each data include two parts:One is by the input driving condition extraction of t moment Driving Scene state
The Driving Decision-making feature f (s that device obtainst,at), another is namely based on what following formula obtained
Wherein, rθ(st,at) by Reward Program of the Reward Program generator based on driving example data generation;Qπ(st,at) and Qπ
(st+1,at+1) come from the Q values sought recorded in the state under greedy strategy-behavior aggregate module, choose t moment described in it
Driving Scene stQ values and choose its described in t+1 moment Driving Scenes st+1Q values;
S32. neural network is established;
S33. optimization neural network.
8. the system for building driving strategy based on driving environment according to claim 7, which is characterized in that in step S32
Neural network includes three layers, and first layer is as input layer, the output feature type of neuron number and feature extractor therein
It is all mutually k, the feature f (s for inputting Driving Scenet,at), the hidden layer number of the second layer is 10, the neuron of third layer
Number is identical with the driver behavior number n for carrying out decision in motion space;Input layer and the activation primitive of hidden layer are all sigmoid
Function, i.e.,Have:
Z=w(1)X=w(1)[1,ft]T
H=sigmoid (z)
gw(st)=sigmoid (w(2)[1,h]T)
Wherein w(1)For the weights of hidden layer;ftFor the state s of t moment Driving ScenetFeature, that is, neural network input;z
Network layer output when for without hidden layer sigmoid activation primitives;H is that the hidden layer after sigmoid activation primitives is defeated
Go out;w(2)For the weights of output layer;
The g of network outputw(st) it is t moment Driving Scene state stQ set, i.e. [Q (st,a1),...,Q(st,an)]T, S31
In Qπ(st,at) it is exactly by state stInput neural network, a in selection outputtObtained by.
9. the system for building driving strategy based on driving environment according to claim 7, which is characterized in that for the nerve net
The loss function of the optimization of network, foundation is cross entropy cost function, and formula is as follows:
The wherein number of N acute pyogenic infection of finger tip training data;Qπ(st,at) it is that will describe t moment Driving Scene state stInput neural network, choosing
Select the correspondence Driving Decision-making action a in outputtThe obtained numerical value of item;For the numerical value acquired in S31;It is just
Then item, W={ w therein(1),w(2)Weights in neural network above acute pyogenic infection of finger tip;
The training data that will be obtained in S31 inputs the Neural Network Optimization cost function;It is completed for this by gradient descent method
The minimum of cross entropy cost function, the neural network that obtained optimization is completed, and then obtain driving strategy getter.
10. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that judgement implement body
Realization process includes:
Regard current Reward Program generator and driving strategy getter as an entirety, checks and currently seek driving demonstration
Feature desired value mould t values in the block, if meet t < ε, ε be judge object function whether the threshold value of meet demand, that is, sentence
Whether the disconnected Reward Program for being currently used in acquisition driving strategy meets the requirements;Its numerical value carries out different set according to specific needs
It sets;
When the numerical value of t, when being unsatisfactory for the formula;It needs to rebuild Reward Program generator, needs currently to seek coveting at this time
The neural network needed in state-behavior aggregate module under greedy strategy is substituted for the new god after having already passed through optimization in S33
Through network, i.e., it will be used to generate description in Driving Scene state stUnder, the decision driver behavior a of selectioniGood and bad Q (st,ai) value
Network, be substituted for the new network structure optimized by gradient descent method in S33;Then return letter is rebuild
Number generators, obtain driving strategy getter, judge again t numerical value whether meet demand;
When meeting the formula, current θ is exactly the weights of required Reward Program;Reward Program generator is then met the requirements,
Driving strategy getter is also met the requirements;Then acquisition needs to establish the driving data of certain driver of pilot model, that is, drives
Environment scene image during sailing and corresponding operation data input driving environment feature extractor, obtain for working as front court
The decision feature of scape;Then feature extraction obtained inputs Reward Program generator, obtains the return letter of corresponding scene state
Number;Then the decision feature of acquisition and the Reward Program being calculated are inputted driving strategy getter, obtains the driver couple
The driving strategy answered.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810662039.8A CN108791308B (en) | 2018-06-25 | 2018-06-25 | System for constructing driving strategy based on driving environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810662039.8A CN108791308B (en) | 2018-06-25 | 2018-06-25 | System for constructing driving strategy based on driving environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108791308A true CN108791308A (en) | 2018-11-13 |
CN108791308B CN108791308B (en) | 2020-05-19 |
Family
ID=64070762
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810662039.8A Active CN108791308B (en) | 2018-06-25 | 2018-06-25 | System for constructing driving strategy based on driving environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108791308B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113449589A (en) * | 2021-05-16 | 2021-09-28 | 桂林电子科技大学 | Method for calculating driving strategy of unmanned automobile in urban traffic scene |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013225192A (en) * | 2012-04-20 | 2013-10-31 | Nippon Telegr & Teleph Corp <Ntt> | Reward function estimation apparatus, reward function estimation method and program |
EP2990997A2 (en) * | 2014-08-27 | 2016-03-02 | Chemtronics Co., Ltd | Method and apparatus for controlling vehicle using motion recognition with face recognition |
CN107038477A (en) * | 2016-08-10 | 2017-08-11 | 哈尔滨工业大学深圳研究生院 | A kind of neutral net under non-complete information learns the estimation method of combination with Q |
CN107168303A (en) * | 2017-03-16 | 2017-09-15 | 中国科学院深圳先进技术研究院 | A kind of automatic Pilot method and device of automobile |
CN107229973A (en) * | 2017-05-12 | 2017-10-03 | 中国科学院深圳先进技术研究院 | The generation method and device of a kind of tactful network model for Vehicular automatic driving |
CN107679557A (en) * | 2017-09-19 | 2018-02-09 | 平安科技(深圳)有限公司 | Driving model training method, driver's recognition methods, device, equipment and medium |
CN108108657A (en) * | 2017-11-16 | 2018-06-01 | 浙江工业大学 | A kind of amendment local sensitivity Hash vehicle retrieval method based on multitask deep learning |
-
2018
- 2018-06-25 CN CN201810662039.8A patent/CN108791308B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2013225192A (en) * | 2012-04-20 | 2013-10-31 | Nippon Telegr & Teleph Corp <Ntt> | Reward function estimation apparatus, reward function estimation method and program |
EP2990997A2 (en) * | 2014-08-27 | 2016-03-02 | Chemtronics Co., Ltd | Method and apparatus for controlling vehicle using motion recognition with face recognition |
CN107038477A (en) * | 2016-08-10 | 2017-08-11 | 哈尔滨工业大学深圳研究生院 | A kind of neutral net under non-complete information learns the estimation method of combination with Q |
CN107168303A (en) * | 2017-03-16 | 2017-09-15 | 中国科学院深圳先进技术研究院 | A kind of automatic Pilot method and device of automobile |
CN107229973A (en) * | 2017-05-12 | 2017-10-03 | 中国科学院深圳先进技术研究院 | The generation method and device of a kind of tactful network model for Vehicular automatic driving |
CN107679557A (en) * | 2017-09-19 | 2018-02-09 | 平安科技(深圳)有限公司 | Driving model training method, driver's recognition methods, device, equipment and medium |
CN108108657A (en) * | 2017-11-16 | 2018-06-01 | 浙江工业大学 | A kind of amendment local sensitivity Hash vehicle retrieval method based on multitask deep learning |
Non-Patent Citations (1)
Title |
---|
王勇鑫等: "基于轨迹分析的自主导航性能评估方法", 《计算机工程》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113449589A (en) * | 2021-05-16 | 2021-09-28 | 桂林电子科技大学 | Method for calculating driving strategy of unmanned automobile in urban traffic scene |
CN113449589B (en) * | 2021-05-16 | 2022-11-15 | 桂林电子科技大学 | Method for calculating driving strategy of unmanned vehicle in urban traffic scene |
Also Published As
Publication number | Publication date |
---|---|
CN108791308B (en) | 2020-05-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108819948A (en) | Driving behavior modeling method based on reverse intensified learning | |
CN108791302A (en) | Driving behavior modeling | |
CN107833183B (en) | Method for simultaneously super-resolving and coloring satellite image based on multitask deep neural network | |
CN108920805A (en) | Driving behavior modeling with state feature extraction functions | |
CN106966298B (en) | Assembled architecture intelligence hanging method based on machine vision and system | |
CN107610123A (en) | A kind of image aesthetic quality evaluation method based on depth convolutional neural networks | |
CN109682392A (en) | Vision navigation method and system based on deeply study | |
CN107808132A (en) | A kind of scene image classification method for merging topic model | |
CN108288035A (en) | The human motion recognition method of multichannel image Fusion Features based on deep learning | |
CN108891421A (en) | A method of building driving strategy | |
CN107729819A (en) | A kind of face mask method based on sparse full convolutional neural networks | |
CN107909008A (en) | Video target tracking method based on multichannel convolutive neutral net and particle filter | |
CN109464803A (en) | Virtual objects controlled, model training method, device, storage medium and equipment | |
CN111461325B (en) | Multi-target layered reinforcement learning algorithm for sparse rewarding environmental problem | |
CN107240085A (en) | A kind of image interfusion method and system based on convolutional neural networks model | |
CN107351080A (en) | A kind of hybrid intelligent research system and control method based on array of camera units | |
CN108944940A (en) | Driving behavior modeling method neural network based | |
CN110097110A (en) | A kind of semantic image restorative procedure based on objective optimization | |
DiPaola et al. | Using artificial intelligence techniques to emulate the creativity of a portrait painter | |
CN111282272B (en) | Information processing method, computer readable medium and electronic device | |
CN108875555A (en) | Video interest neural network based region and well-marked target extraction and positioning system | |
CN111259950A (en) | Method for training YOLO neural network based on 3D model | |
CN108791308A (en) | The system for building driving strategy based on driving environment | |
CN112121419B (en) | Virtual object control method, device, electronic equipment and storage medium | |
CN111445024B (en) | Medical image recognition training method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
OL01 | Intention to license declared | ||
OL01 | Intention to license declared |