CN108791308A - The system for building driving strategy based on driving environment - Google Patents

The system for building driving strategy based on driving environment Download PDF

Info

Publication number
CN108791308A
CN108791308A CN201810662039.8A CN201810662039A CN108791308A CN 108791308 A CN108791308 A CN 108791308A CN 201810662039 A CN201810662039 A CN 201810662039A CN 108791308 A CN108791308 A CN 108791308A
Authority
CN
China
Prior art keywords
driving
state
strategy
reward program
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810662039.8A
Other languages
Chinese (zh)
Other versions
CN108791308B (en
Inventor
邹启杰
李昊宇
裴腾达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University
Original Assignee
Dalian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University filed Critical Dalian University
Priority to CN201810662039.8A priority Critical patent/CN108791308B/en
Publication of CN108791308A publication Critical patent/CN108791308A/en
Application granted granted Critical
Publication of CN108791308B publication Critical patent/CN108791308B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B60VEHICLES IN GENERAL
    • B60WCONJOINT CONTROL OF VEHICLE SUB-UNITS OF DIFFERENT TYPE OR DIFFERENT FUNCTION; CONTROL SYSTEMS SPECIALLY ADAPTED FOR HYBRID VEHICLES; ROAD VEHICLE DRIVE CONTROL SYSTEMS FOR PURPOSES NOT RELATED TO THE CONTROL OF A PARTICULAR SUB-UNIT
    • B60W50/00Details of control systems for road vehicle drive control not related to the control of a particular sub-unit, e.g. process diagnostic or vehicle driver interfaces
    • B60W2050/0001Details of the control system
    • B60W2050/0019Control system elements or transfer functions

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mechanical Engineering (AREA)
  • Human Computer Interaction (AREA)
  • Transportation (AREA)
  • Automation & Control Theory (AREA)
  • Air Conditioning Control Device (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a kind of systems building driving strategy based on driving environment, including:Feature extractor, the feature of extraction structure Reward Program;Reward Program generator obtains driving strategy;Driving strategy getter completes the structure of driving strategy;Judging device judges the optimal driving strategy of getter structure, whether meets judgment criteria;If not satisfied, then rebuilding Reward Program, repetition builds optimal driving strategy, iterates, until meeting judgment criteria;It is final to obtain the true driving strategy for driving demonstration of description;The Reward Program generator, including obtain the drivings example data module of expert, seek the weights module for driving the feature desired value module demonstrated, seeking state-behavior aggregate module under greedy strategy, seeking Reward Program.The application can be applicable in new state scene, to obtain its respective action, substantially increase the generalization ability of the driver behavior model of foundation, and applicable scene is wider, and robustness is stronger.

Description

The system for building driving strategy based on driving environment
Technical field
The present invention relates to a kind of systems building driving strategy based on driving environment.
Background technology
Traditional driver's driving strategy established based on intensified learning using the analysis of known driving data, is described and is pushed away Driving behavior is managed, however inexhaustible driving behavior can not be completely covered with the driving data of acquisition, unlikely The case where obtaining whole state respective actions.Under practical Driving Scene, because of the difference of weather, scene, object, driving condition There are numerous possibility, it is impossible thing to traverse whole states.Therefore traditional driver's driving behavior model generalization ability Weak, model hypothesis condition is more, poor robustness.
Secondly, in actual driving problem, the method for Reward Program is only set with researcher, needs balance too many right In the demand of various features, it is completely dependent on the experience setting of researcher, reconciles, takes time and effort, more fatal is manually repeatedly It is excessively subjective.Under different scenes and environment, researcher then needs to face too many scene state;Meanwhile even for Some scene state determined, the difference of demand also result in the variation of driving behavior.For the accurate description driving Task will distribute a series of weights with these factors of accurate description.In existing method, the reverse extensive chemical based on probabilistic model It practises mainly from existing example data, using example data as data with existing, and then seeks the distribution of corresponding current data Situation, the action that can be just sought based on this under corresponding states are chosen.But the distribution of given data can not indicate total data Distribution, it is correct to obtain distribution, the case where needing to obtain whole state respective actions.
Invention content
In the presence of the prior art for Driving Scene not in the case of example data, corresponding return can not be established Function is come the technical issues of carrying out driving behavior modeling, this application provides a kind of based on driving environment structure driving strategy System can be applicable in new state scene, and to obtain its respective action, applicable scene is wider, and robustness is stronger.
To achieve the goals above, the technical essential of the present invention program is:One kind building driving strategy based on driving environment System, specifically include:Feature extractor, the feature of extraction structure Reward Program;Reward Program generator obtains and drives plan Slightly;Driving strategy getter completes the structure of driving strategy;Judging device judges the optimal driving strategy of getter structure, is It is no to meet judgment criteria;If not satisfied, then rebuilding Reward Program, repetition builds optimal driving strategy, iterates, until Meet judgment criteria;It is final to obtain the true driving strategy for driving demonstration of description;
The Reward Program generator, including obtain the driving example data module of expert, seek driving the feature demonstrated Desired value module, the weights module sought state-behavior aggregate module under greedy strategy, seek Reward Program.
Further, the driving example data module for obtaining expert is specially:Driving example data, which comes from, drives demonstration The sampling extraction for sailing video data, samples according to one section of continuous driving video of certain frequency pair, obtains one group of track and show Model;One expert's example data includes a plurality of track, is totally denoted as:
DE={ (s1,a1),(s2,a2),...,(sM,aM)}Wherein DEIndicate whole driving example data, (sj,aj) indicating that corresponding states j corresponds to the data pair of decision instruction composition with the state, M represents driving example data in total Number, NTIt represents and drives demonstration trace number, LiIt represents i-th and drives the state-decision instruction for including in demonstration track to (sj, aj) number.
Further, seeking the feature desired value module that driving is demonstrated is specially;Example data D will be driven firstEIn The state s of each description driving environment situationtIn input state feature extractor, corresponding states s is obtainedtUnder feature situation f (st,at), f (st,at) one group of correspondence s of acute pyogenic infection of finger tiptInfluence Driving Decision-making result driving environment scene characteristic value, be then based on down It states formula and calculates the feature desired value for driving demonstration:
Wherein γ is discount factor, and according to the difference of problem, correspondence is configured.
Further, seeking the state under greedy strategy-behavior aggregate module is specially:Due to Reward Program generator with drive Sail two parts that tactful getter is cycle;
First, the neural network in driving strategy getter is obtained:Driving example data DEExtract obtained description ring The state feature f (s of border situationt), neural network is inputted, output g is obtainedw(st);gw(st) it is about description state stOne group of Q Value set, i.e. [Q (st,a1),...,Q(st,an)]T, and Q (st,ai) state-working value is represented, it is driven currently for describing Scene state stUnder, choose decision driver behavior aiQuality, ((s a) is acquired, the formula by s, a)=θ μ based on formula Q In the current Reward Program of θ acute pyogenic infection of finger tip in weights, μ (s, a) acute pyogenic infection of finger tip feature desired value;
ε-greedy strategies are then based on, carry out choosing description Driving Scene state stCorresponding Driving Decision-making action It chooses about current Driving Scene stUnder Q value sets in allow the maximum decision action of Q valuesOtherwise, then it randomly selects It has chosenLater, it records at this time
So for driving the D that demonstratesEIn each state state feature f (st,at), the neural network is inputted, is obtained altogether M state-action is obtained to (st,at), which depict the Driving Scene state s of t momenttLower selection Driving Decision-making acts at;Together When based on action choose the case where, obtain the Q values of M corresponding states-action pair, be denoted as Q.
Further, the weights module for seeking Reward Program is specially:
It is primarily based on following formula, builds object function:
Loss function is represented, i.e., according to current state-action to whether there is among driving demonstration, if depositing It is being then 0, is being otherwise 1;For the corresponding states-working value recorded above;To seek Drive the product of the weights θ of the expectation of driving exemplary features and Reward Program sought in the feature desired value module of demonstration;For regular terms;
The object function, i.e. t=min are minimized by gradient descent methodθJ (θ), acquisition enable the minimization of object function Variable θ, the θ are the weights of striked required Reward Program.
Further, the correspondence Reward Program weights θ based on acquisition, according to formula r (s, a)=θT(s a) is built back f Report function generator.
Further, driving strategy getter specific implementation process is:
S31 builds the training data of driving strategy getter
Training data is obtained, each data include two parts:One is that t moment Driving Scene state is inputted driving condition The Driving Decision-making feature f (s that extractor obtainst), another is namely based on what following formula obtained
Wherein, rθ(st,at) by Reward Program of the Reward Program generator based on driving example data generation;Qπ(st, at) and Qπ(st+1,at+1) coming from the Q values sought recorded in the state under greedy strategy-behavior aggregate module, selection is wherein retouched State t moment Driving Scene stQ values and choose its described in t+1 moment Driving Scenes st+1Q values;
S32. neural network is established;
S33. optimization neural network.
Further, the neural network in step S32 includes three layers, and first layer is as input layer, neuron therein Number is identical for k with the output feature type of feature extractor, the feature f (s for inputting Driving Scenet,at), the second layer Hidden layer number be 10, the neuron number of third layer in motion space carry out decision driver behavior number n it is identical;It is defeated The activation primitive for entering layer and hidden layer is all sigmoid functions, i.e.,Have:
Z=w(1)X=w(1)[1,ft]T
H=sigmoid (z)
gw(st)=sigmoid (w(2)[1,h]T)
Wherein w(1)For the weights of hidden layer;ftFor the state s of t moment Driving ScenetFeature, that is, neural network is defeated Enter;Network layer output when z is without hidden layer sigmoid activation primitives;H is hidden after sigmoid activation primitives Layer output;w(2)For the weights of output layer;
The g of network outputw(st) it is t moment Driving Scene state stQ set, i.e. [Q (st,a1),...,Q(st,an)]T, Q in S31π(st,at) it is exactly by state stInput neural network, a in selection outputtObtained by.
As further, the loss function of the optimization for the neural network, foundation is cross entropy cost function, public Formula is as follows:
The wherein number of N acute pyogenic infection of finger tip training data;Qπ(st,at) it is that will describe t moment Driving Scene state stInput nerve net Network, the correspondence Driving Decision-making in selection output act atThe obtained numerical value of item;For the numerical value acquired in S31;It is regular terms, W={ w therein(1),w(2)Weights in neural network above acute pyogenic infection of finger tip;
The training data that will be obtained in S31 inputs the Neural Network Optimization cost function;By gradient descent method completion pair In the minimum of the cross entropy cost function, the neural network of obtained optimization completion, and then obtain driving strategy getter.
As further, judging device implements process and includes:
Regard current Reward Program generator and driving strategy getter as an entirety, checks currently to seek driving and show The feature desired value mould t values in the block of model, if meet t < ε, ε be judge object function whether the threshold value of meet demand, also It is to judge to be currently used in whether the Reward Program for obtaining driving strategy meets the requirements;Its numerical value carries out different according to specific needs Setting;
When the numerical value of t, when being unsatisfactory for the formula;It needs to rebuild Reward Program generator, needs currently to ask at this time The neural network needed in the state under greedy strategy-behavior aggregate module is taken to be substituted for new after having already passed through optimization in S33 Neural network, i.e., will be used for generate description in Driving Scene state stUnder, the decision driver behavior a of selectioniGood and bad Q (st, ai) value network, be substituted for the new network structure optimized by gradient descent method in S33;Then it rebuilds Reward Program generator obtains driving strategy getter, judge again t numerical value whether meet demand;
When meeting the formula, current θ is exactly the weights of required Reward Program;Reward Program generator, which then meets, to be wanted It asks, driving strategy getter is also met the requirements;Then acquisition needs to establish the driving data of certain driver of pilot model, i.e., Environment scene image in driving procedure and corresponding operation data input driving environment feature extractor, obtain for current The decision feature of scene;Then feature extraction obtained inputs Reward Program generator, obtains the return of corresponding scene state Function;Then the decision feature of acquisition and the Reward Program being calculated are inputted driving strategy getter, obtains the driver Corresponding driving strategy.
Advantageous effect is the present invention compared with prior art:In actual driving situation, because of the originals such as weather, scenery The corresponding big state space of various Driving Scene caused by, by means of the outstanding approximate expression arbitrary function of neural network Ability, approximately can be by a kind of this Policy Table up to regarding black box as:By the characteristic value of input state, it is dynamic to export corresponding state- Work value, while further being acted according to the case where output valve to choose, to obtain respective action.To make to strengthen by reverse The applicability for learning to model driving behavior greatly enhances, and conventional method is fitted because attempting by a certain probability distribution To demonstration track, thus the optimal policy obtained is still limited to the existing state status in demonstration track, and the present invention can To be applicable in for new state scene, to obtain its respective action, the driver behavior model of foundation is substantially increased Generalization ability, applicable scene is wider, and robustness is stronger.
Description of the drawings
Fig. 1 is new depth convolutional neural networks;
Fig. 2 is driving video sample graph;
Fig. 3 is the working method flow diagram of system in embodiment 1;
Fig. 4 is to establish neural network structure figure in step S32.
Specific implementation mode
Below in conjunction with Figure of description, the invention will be further described.Following embodiment is only used for clearly Illustrate technical scheme of the present invention, and not intended to limit the protection scope of the present invention.
The present embodiment provides the systems for building driving strategy based on driving environment, specifically comprise the following steps:
Feature extractor, the feature of extraction structure Reward Program, the specific steps are:
S11. in vehicle travel process, the driving video that is obtained using the subsequent video camera of the windshield for being placed on vehicle into Row sampling, sample graph are as shown in Figure 2.
Obtain the picture of N group different vehicle driving road environment road conditions and corresponding steering angle situation.Including N1 Straight way and N2 bends, the value of N1, N2 can be N1>=300, N2>=3000, at the same corresponding driver behavior data, joint Construct training data.
S12. carry out relevant translation to collecting the image come, cut, the change operations such as brightness, with simulate different illumination and The scene of weather.
S13. convolutional neural networks are built, using picture after treatment as input, the operation data of corresponding picture is made For label value, it is trained;Optimize to seek optimal solution to mean square error loss using based on the optimization method of Nadam optimizers The weight parameter of neural network.
Convolutional neural networks include 1 input layer, 3 convolutional layers, 3 pond layers, 4 full articulamentums.Input layer is successively First convolutional layer, first pond layer are connected, second convolutional layer, second pond layer are then connected, reconnects third Convolutional layer, third pond layer, be then sequentially connected the full articulamentum of first full articulamentum, second full articulamentum, third, 4th full articulamentum.
S14. the network structure by the convolutional neural networks after the completion of training in addition to the last output layer and weights preserve, To establish a new convolutional neural networks, completion status feature extractor.
Reward Program generator obtains driving strategy:
Reward Program returns letter as the standard for acting selection in intensified learning method in the acquisition process of driving strategy Several quality plays the role of conclusive, directly determines the quality of the driving strategy of acquisition, and the strategy obtained is No strategy corresponding with true driving example data is identical.The formula of Reward Program is reward=θTf(st,at), f (st, at) acute pyogenic infection of finger tip corresponds to the t moment state s under driving environment scene " vehicle-periphery "tOne group of influence Driving Decision-making result spy Value indicative, for describing vehicle-periphery scenario.And θ acute pyogenic infection of finger tip corresponds to one group of weights of the feature for influencing Driving Decision-making, power The corresponding environmental characteristic of the numbers illustrated of value proportion shared in Reward Program, embodies importance.It is carried in state feature On the basis of taking device, need to solve this weights θ, to come build influence driving strategy Reward Program.
Obtain the driving example data module of expert:Example data is driven from the sampling for driving video data of demonstrating Extraction (different with data used in driving environment feature extractor before), can be according to the continuous driving of one section of frequency pair of 10hz Video is sampled, and one group of track demonstration is obtained.One expert's demonstration should have a plurality of track.Totally it is denoted as:DE={ (s1, a1),(s2,a2),...,(sM,aM)}Wherein DEIndicate whole driving example data, (sj,aj) indicate corresponding shape State j (video pictures of the driving environment of the time j of sampling) corresponds to the decision instruction (steering angle in such as steering order with the state Degree) data pair that constitute, M represents the number of driving example data in total, NTIt represents and drives demonstration trace number, LiRepresent i-th Item drives the state-decision instruction for including in demonstration track to (sj,aj) number
The feature for seeking driving demonstration it is expected module:Example data D will be driven firstEIn each description driving environment feelings The state s of conditiontInput state feature extractor obtains corresponding states stUnder feature situation f (st,at), f (st,at) acute pyogenic infection of finger tip one The corresponding s of grouptInfluence Driving Decision-making result driving environment scene characteristic value, be then based on following formula and calculate driving The feature of demonstration it is expected:
Wherein γ is discount factor, and according to the difference of problem, correspondence is configured, and referential data can be set as 0.65.
Seek state-behavior aggregate module under greedy strategy:First, the god in the driving strategy getter in S32 is obtained Through network.It is (most refreshing at first because being two parts in a cycle in Reward Program generator and driving strategy getter It is the neural network just initialized in S32 through network.With the progress of cycle, each step in cycle is all:It completes to influence The structure of the Reward Program of Driving Decision-making is then based on current Reward Program and obtains corresponding optimal driving strategy, judges whether Meet the standard of end loop, is rebuild back if not satisfied, being then put into the neural network that the process in current S34 optimized Report function)
Driving example data DEExtract the state feature f (s of obtained description ambient conditionst,at), neural network is inputted, Obtain output gw(st);gw(st) it is about description state stOne group of Q value set, i.e. [Q (st,a1),...,Q(st,an)]T, and Q(st,ai) state-working value is represented, for describing in current Driving Scene state stUnder, choose decision driver behavior aiIt is excellent It is bad, can be based on formula Q (s, a)=θ μ (s a) is acquired, the weights in the current Reward Program of the θ acute pyogenic infection of finger tip in the formula, μ (s, a) acute pyogenic infection of finger tip feature expectation.
ε-greedy strategies are then based on, if setting ε is 0.5, carry out choosing description Driving Scene state stIt is corresponding Driving Decision-making actsThat is there is 50 percent possibility, choose about current Driving Scene stUnder Q value collection The maximum decision of Q values is allowed to act in conjunctionOtherwise, then it randomly selectsIt has chosenLater, it records at this time
So for driving the D that demonstratesEIn each state state feature f (st,at), the neural network is inputted, is obtained altogether M state-action is obtained to (st,at) which depict the Driving Scene state s of t momenttLower selection Driving Decision-making acts at.Together When based on action choose the case where, obtain the Q values of M corresponding states-action pair, be denoted as Q.
Seek the weights module of Reward Program:It is primarily based on following formula, builds object function:
Represent loss function, i.e., according to current state-action to whether there is among driving demonstration, if It is otherwise 1 in the presence of being then 0.For the corresponding states-working value recorded above.For regular terms, to prevent The appearance of overfitting problem, the γ can be 0.9.
The object function, i.e. t=min are minimized by gradient descent methodθJ (θ), acquisition enable the minimization of object function Variable θ, the θ are the weights of striked required Reward Program.
Correspondence Reward Program weights θ based on acquisition, according to formula r (s, a)=θT(s a) builds Reward Program generation to f Device.
Driving strategy getter completes the structure of driving strategy, specially:
The structure of the training data of S31 driving strategy getters
Obtain training data.Data come from the sampling to example data before, but are handled to obtain one group The data of new type amount to N number of.Each data include two parts in data:One is to input t moment Driving Scene state The Driving Decision-making feature f (s that driving condition extractor obtainst), another is namely based on what following formula obtained
Include parameter r in the formulaθ(st,at) by return of the Reward Program generator based on driving example data generation Function.Choose t moment Driving Scene s described in ittQ values and choose its described in t+1 moment Driving Scenes st+1Q values.
S32. neural network is established
Neural network includes three layers, and first layer is as input layer, the output of neuron number and feature extractor therein Feature type is identical for k, the feature f (s for inputting Driving Scenet,at), the hidden layer number of the second layer is 10, third layer Neuron number in motion space carry out decision driver behavior number n as number;The activation of input layer and hidden layer Function is all sigmoid functions, i.e.,Have:
Z=w(1)X=w(1)[1,ft]T
H=sigmoid (z)
gw(st)=sigmoid (w(2)[1,h]T)
Wherein w(1)The weights of acute pyogenic infection of finger tip hidden layer;ftThe state s of acute pyogenic infection of finger tip t moment Driving ScenetFeature, that is, neural network Input;The output of network layer when z acute pyogenic infection of finger tip is without hidden layer sigmoid activation primitives;H acute pyogenic infection of finger tip is activated by sigmoid Hidden layer output after function;w(2)The weights of acute pyogenic infection of finger tip output layer;Network structure such as Fig. 3:
The g of network outputw(st) it is t moment Driving Scene state stQ set, i.e. [Q (st,a1),...,Q(st,an)]T, Q in S31π(st,at) it is exactly by state stInput neural network, a in selection outputtObtained by.
S33. optimization neural network
The loss function of optimization for the neural network, foundation is cross entropy cost function, and formula is as follows:
The wherein number of N acute pyogenic infection of finger tip training data.Qπ(st,at) it is exactly that will describe t moment Driving Scene state stInput nerve Network, the correspondence Driving Decision-making in selection output act atThe obtained numerical value of item.For the numerical value acquired in S31.Equally it is regular terms, prevents over-fitting and be arranged.The γ may be 0.9.W={ w therein(1),w(2)Acute pyogenic infection of finger tip Weights in neural network above.
The training data that will be obtained in S31 inputs the Neural Network Optimization cost function.By gradient descent method completion pair In the minimum of the cross entropy cost function, the neural network of obtained optimization completion obtains driving strategy getter.
Judging device regards current Reward Program generator and driving strategy getter as an entirety, checks current t Value, if meet t < ε, ε be judge object function whether the threshold value of meet demand, that is, judge to be currently used in acquisition driving Whether the Reward Program of strategy meets the requirements.Its numerical value carries out different settings according to specific needs.
When the numerical value of t is unsatisfactory for the formula.It needs to rebuild Reward Program generator, needs to work as at this time Preceding neural network is substituted for the new neural network after having already passed through optimization, i.e., will be used to generate description in Driving Scene state stUnder, the decision driver behavior a of selectioniGood and bad Q (st,ai) value network, be substituted in S33 by gradient descent method into The new network structure that row optimized.Then it rebuilds Reward Program generator, obtain driving strategy getter, judge again The numerical value of t whether meet demand.
When meeting the formula, current θ is exactly the weights of required Reward Program.Reward Program generator is then full Foot requires, and driving strategy getter is also met the requirements.It then can be with:Acquisition needs to establish driving for certain driver of pilot model Data, i.e., the environment scene image in driving procedure and corresponding operation data are sailed, steering angle is such as driven.It is special to input driving environment Extractor is levied, the decision feature for current scene is obtained.Then feature extraction obtained inputs Reward Program generator, obtains To the Reward Program of corresponding scene state.Then the decision feature of acquisition and the Reward Program being calculated are inputted driving strategy Getter obtains the corresponding driving strategy of the driver.
In markov decision process, a kind of strategy needs connection status to its corresponding action.But have for one When large-scale state space, for the region not traversed, indicated it is difficult to be depicted and carry out a determining strategy, tradition Also the description to this part is had ignored among method, is only based on demonstration track, to illustrate the probability mould of entire track distribution Type does not provide specific strategy for new state and indicates, i.e., takes the possibility for determining and acting not for new state Provide specific method.Strategy is described by neural network in the present invention, neural network can be in any essence because of it The characteristic of approximate representation arbitrary function in exactness, while having outstanding generalization ability.By the expression of state feature, on the one hand Those states being not included in demonstration track can be represented, in addition, inputting neural network by by corresponding state feature. Corresponding working value can be sought, to seek deserved action according to strategy, thus, conventional method can not extensive driving demonstration number It is addressed according to not traversing Driving Scene state issues.
The preferable specific implementation mode of the above, only the invention, but the protection domain of the invention is not It is confined to this, any one skilled in the art is in the technical scope that the invention discloses, according to the present invention The technical solution of creation and its inventive concept are subject to equivalent substitution or change, should all cover the invention protection domain it It is interior.

Claims (10)

1. the system for building driving strategy based on driving environment, which is characterized in that specifically include:Feature extractor, extraction structure The feature of Reward Program;Reward Program generator obtains driving strategy;Driving strategy getter completes the structure of driving strategy; Judging device judges the optimal driving strategy of getter structure, whether meets judgment criteria;If not satisfied, then rebuilding back Function, repetition is reported to build optimal driving strategy, iterate, until meeting judgment criteria;Final true drive of acquisition description is shown The driving strategy of model;
The Reward Program generator, including obtain the driving example data module of expert, seek driving the feature expectation demonstrated Value module, the weights module sought state-behavior aggregate module under greedy strategy, seek Reward Program.
2. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that obtain driving for expert Sailing example data module is specially:Example data is driven from being extracted for the sampling for driving video data of demonstrating, according to certain The continuous driving video of one section of frequency pair samples, and obtains one group of track demonstration;One expert's example data includes a plurality of rail Mark is totally denoted as:
DE={ (s1,a1),(s2,a2),...,(sM,aM)}Wherein DEIndicate whole driving example data, (sj, aj) indicating that corresponding states j corresponds to the data pair of decision instruction composition with the state, M represents of driving example data in total Number, NTIt represents and drives demonstration trace number, LiIt represents i-th and drives the state-decision instruction for including in demonstration track to (sj, aj) number.
3. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that seek driving demonstration Feature desired value module be specially;Example data D will be driven firstEIn each description driving environment situation state stIt is defeated Enter in state feature extractor, obtains corresponding states stUnder feature situation f (st,at), f (st,at) one group of correspondence s of acute pyogenic infection of finger tipt's The driving environment scene characteristic value for influencing Driving Decision-making result is then based on following formula and calculates the feature phase for driving demonstration Prestige value:
Wherein γ is discount factor, and according to the difference of problem, correspondence is configured.
4. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that seek greedy strategy Under state-behavior aggregate module be specially:Since Reward Program generator and driving strategy getter are two parts of cycle;
First, the neural network in driving strategy getter is obtained:Driving example data DEExtract obtained description ambient conditions State feature f (st), neural network is inputted, output g is obtainedw(st);gw(st) it is about description state stOne group of Q value collection It closes, i.e. [Q (st,a1),...,Q(st,an)]T, and Q (st,ai) state-working value is represented, for describing in current Driving Scene State stUnder, choose decision driver behavior aiQuality, ((s a) is acquired, in the formula by s, a)=θ μ based on formula Q Weights in the current Reward Program of θ acute pyogenic infection of finger tip, μ (s, a) acute pyogenic infection of finger tip feature desired value;
ε-greedy strategies are then based on, carry out choosing description Driving Scene state stCorresponding Driving Decision-making actionIt chooses About current Driving Scene stUnder Q value sets in allow the maximum decision action of Q valuesOtherwise, then it randomly selectsIt chooses It is completeLater, it records at this time
So for driving the D that demonstratesEIn each state state feature f (st,at), the neural network is inputted, is acquired altogether M state-action is to (st,at), which depict the Driving Scene state s of t momenttLower selection Driving Decision-making acts at;Base simultaneously In acting the case where choosing, the Q values of M corresponding states-action pair are obtained, Q is denoted as.
5. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that seek Reward Program Weights module be specially:
It is primarily based on following formula, builds object function:
Loss function is represented, i.e., according to current state-action to whether there is among driving demonstration, if in the presence of if It is 0, is otherwise 1;For the corresponding states-working value recorded above;To seek driving The driving exemplary features sought in the feature desired value module of demonstration it is expected and the product of the weights θ of Reward Program;For Regular terms;
The object function, i.e. t=min are minimized by gradient descent methodθJ (θ) obtains the variable for enabling the minimization of object function θ, the θ are the weights of striked required Reward Program.
6. according to the system for building driving strategy based on driving environment described in any one of claim 5, which is characterized in that based on obtaining The correspondence Reward Program weights θ obtained, according to formula r (s, a)=θT(s a) builds Reward Program generator to f.
7. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that driving strategy obtains Implement body realizes that process is:
S31 builds the training data of driving strategy getter
Training data is obtained, each data include two parts:One is by the input driving condition extraction of t moment Driving Scene state The Driving Decision-making feature f (s that device obtainst,at), another is namely based on what following formula obtained
Wherein, rθ(st,at) by Reward Program of the Reward Program generator based on driving example data generation;Qπ(st,at) and Qπ (st+1,at+1) come from the Q values sought recorded in the state under greedy strategy-behavior aggregate module, choose t moment described in it Driving Scene stQ values and choose its described in t+1 moment Driving Scenes st+1Q values;
S32. neural network is established;
S33. optimization neural network.
8. the system for building driving strategy based on driving environment according to claim 7, which is characterized in that in step S32 Neural network includes three layers, and first layer is as input layer, the output feature type of neuron number and feature extractor therein It is all mutually k, the feature f (s for inputting Driving Scenet,at), the hidden layer number of the second layer is 10, the neuron of third layer Number is identical with the driver behavior number n for carrying out decision in motion space;Input layer and the activation primitive of hidden layer are all sigmoid Function, i.e.,Have:
Z=w(1)X=w(1)[1,ft]T
H=sigmoid (z)
gw(st)=sigmoid (w(2)[1,h]T)
Wherein w(1)For the weights of hidden layer;ftFor the state s of t moment Driving ScenetFeature, that is, neural network input;z Network layer output when for without hidden layer sigmoid activation primitives;H is that the hidden layer after sigmoid activation primitives is defeated Go out;w(2)For the weights of output layer;
The g of network outputw(st) it is t moment Driving Scene state stQ set, i.e. [Q (st,a1),...,Q(st,an)]T, S31 In Qπ(st,at) it is exactly by state stInput neural network, a in selection outputtObtained by.
9. the system for building driving strategy based on driving environment according to claim 7, which is characterized in that for the nerve net The loss function of the optimization of network, foundation is cross entropy cost function, and formula is as follows:
The wherein number of N acute pyogenic infection of finger tip training data;Qπ(st,at) it is that will describe t moment Driving Scene state stInput neural network, choosing Select the correspondence Driving Decision-making action a in outputtThe obtained numerical value of item;For the numerical value acquired in S31;It is just Then item, W={ w therein(1),w(2)Weights in neural network above acute pyogenic infection of finger tip;
The training data that will be obtained in S31 inputs the Neural Network Optimization cost function;It is completed for this by gradient descent method The minimum of cross entropy cost function, the neural network that obtained optimization is completed, and then obtain driving strategy getter.
10. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that judgement implement body Realization process includes:
Regard current Reward Program generator and driving strategy getter as an entirety, checks and currently seek driving demonstration Feature desired value mould t values in the block, if meet t < ε, ε be judge object function whether the threshold value of meet demand, that is, sentence Whether the disconnected Reward Program for being currently used in acquisition driving strategy meets the requirements;Its numerical value carries out different set according to specific needs It sets;
When the numerical value of t, when being unsatisfactory for the formula;It needs to rebuild Reward Program generator, needs currently to seek coveting at this time The neural network needed in state-behavior aggregate module under greedy strategy is substituted for the new god after having already passed through optimization in S33 Through network, i.e., it will be used to generate description in Driving Scene state stUnder, the decision driver behavior a of selectioniGood and bad Q (st,ai) value Network, be substituted for the new network structure optimized by gradient descent method in S33;Then return letter is rebuild Number generators, obtain driving strategy getter, judge again t numerical value whether meet demand;
When meeting the formula, current θ is exactly the weights of required Reward Program;Reward Program generator is then met the requirements, Driving strategy getter is also met the requirements;Then acquisition needs to establish the driving data of certain driver of pilot model, that is, drives Environment scene image during sailing and corresponding operation data input driving environment feature extractor, obtain for working as front court The decision feature of scape;Then feature extraction obtained inputs Reward Program generator, obtains the return letter of corresponding scene state Number;Then the decision feature of acquisition and the Reward Program being calculated are inputted driving strategy getter, obtains the driver couple The driving strategy answered.
CN201810662039.8A 2018-06-25 2018-06-25 System for constructing driving strategy based on driving environment Active CN108791308B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810662039.8A CN108791308B (en) 2018-06-25 2018-06-25 System for constructing driving strategy based on driving environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810662039.8A CN108791308B (en) 2018-06-25 2018-06-25 System for constructing driving strategy based on driving environment

Publications (2)

Publication Number Publication Date
CN108791308A true CN108791308A (en) 2018-11-13
CN108791308B CN108791308B (en) 2020-05-19

Family

ID=64070762

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810662039.8A Active CN108791308B (en) 2018-06-25 2018-06-25 System for constructing driving strategy based on driving environment

Country Status (1)

Country Link
CN (1) CN108791308B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449589A (en) * 2021-05-16 2021-09-28 桂林电子科技大学 Method for calculating driving strategy of unmanned automobile in urban traffic scene

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013225192A (en) * 2012-04-20 2013-10-31 Nippon Telegr & Teleph Corp <Ntt> Reward function estimation apparatus, reward function estimation method and program
EP2990997A2 (en) * 2014-08-27 2016-03-02 Chemtronics Co., Ltd Method and apparatus for controlling vehicle using motion recognition with face recognition
CN107038477A (en) * 2016-08-10 2017-08-11 哈尔滨工业大学深圳研究生院 A kind of neutral net under non-complete information learns the estimation method of combination with Q
CN107168303A (en) * 2017-03-16 2017-09-15 中国科学院深圳先进技术研究院 A kind of automatic Pilot method and device of automobile
CN107229973A (en) * 2017-05-12 2017-10-03 中国科学院深圳先进技术研究院 The generation method and device of a kind of tactful network model for Vehicular automatic driving
CN107679557A (en) * 2017-09-19 2018-02-09 平安科技(深圳)有限公司 Driving model training method, driver's recognition methods, device, equipment and medium
CN108108657A (en) * 2017-11-16 2018-06-01 浙江工业大学 A kind of amendment local sensitivity Hash vehicle retrieval method based on multitask deep learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2013225192A (en) * 2012-04-20 2013-10-31 Nippon Telegr & Teleph Corp <Ntt> Reward function estimation apparatus, reward function estimation method and program
EP2990997A2 (en) * 2014-08-27 2016-03-02 Chemtronics Co., Ltd Method and apparatus for controlling vehicle using motion recognition with face recognition
CN107038477A (en) * 2016-08-10 2017-08-11 哈尔滨工业大学深圳研究生院 A kind of neutral net under non-complete information learns the estimation method of combination with Q
CN107168303A (en) * 2017-03-16 2017-09-15 中国科学院深圳先进技术研究院 A kind of automatic Pilot method and device of automobile
CN107229973A (en) * 2017-05-12 2017-10-03 中国科学院深圳先进技术研究院 The generation method and device of a kind of tactful network model for Vehicular automatic driving
CN107679557A (en) * 2017-09-19 2018-02-09 平安科技(深圳)有限公司 Driving model training method, driver's recognition methods, device, equipment and medium
CN108108657A (en) * 2017-11-16 2018-06-01 浙江工业大学 A kind of amendment local sensitivity Hash vehicle retrieval method based on multitask deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王勇鑫等: "基于轨迹分析的自主导航性能评估方法", 《计算机工程》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449589A (en) * 2021-05-16 2021-09-28 桂林电子科技大学 Method for calculating driving strategy of unmanned automobile in urban traffic scene
CN113449589B (en) * 2021-05-16 2022-11-15 桂林电子科技大学 Method for calculating driving strategy of unmanned vehicle in urban traffic scene

Also Published As

Publication number Publication date
CN108791308B (en) 2020-05-19

Similar Documents

Publication Publication Date Title
CN108819948A (en) Driving behavior modeling method based on reverse intensified learning
CN108791302A (en) Driving behavior modeling
CN107833183B (en) Method for simultaneously super-resolving and coloring satellite image based on multitask deep neural network
CN108920805A (en) Driving behavior modeling with state feature extraction functions
CN106966298B (en) Assembled architecture intelligence hanging method based on machine vision and system
CN107610123A (en) A kind of image aesthetic quality evaluation method based on depth convolutional neural networks
CN109682392A (en) Vision navigation method and system based on deeply study
CN107808132A (en) A kind of scene image classification method for merging topic model
CN108288035A (en) The human motion recognition method of multichannel image Fusion Features based on deep learning
CN108891421A (en) A method of building driving strategy
CN107729819A (en) A kind of face mask method based on sparse full convolutional neural networks
CN107909008A (en) Video target tracking method based on multichannel convolutive neutral net and particle filter
CN109464803A (en) Virtual objects controlled, model training method, device, storage medium and equipment
CN111461325B (en) Multi-target layered reinforcement learning algorithm for sparse rewarding environmental problem
CN107240085A (en) A kind of image interfusion method and system based on convolutional neural networks model
CN107351080A (en) A kind of hybrid intelligent research system and control method based on array of camera units
CN108944940A (en) Driving behavior modeling method neural network based
CN110097110A (en) A kind of semantic image restorative procedure based on objective optimization
DiPaola et al. Using artificial intelligence techniques to emulate the creativity of a portrait painter
CN111282272B (en) Information processing method, computer readable medium and electronic device
CN108875555A (en) Video interest neural network based region and well-marked target extraction and positioning system
CN111259950A (en) Method for training YOLO neural network based on 3D model
CN108791308A (en) The system for building driving strategy based on driving environment
CN112121419B (en) Virtual object control method, device, electronic equipment and storage medium
CN111445024B (en) Medical image recognition training method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
OL01 Intention to license declared
OL01 Intention to license declared