CN108791308A

CN108791308A - The system for building driving strategy based on driving environment

Info

Publication number: CN108791308A
Application number: CN201810662039.8A
Authority: CN
Inventors: 邹启杰; 李昊宇; 裴腾达
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2018-06-25
Filing date: 2018-06-25
Publication date: 2018-11-13
Anticipated expiration: 2038-06-25
Also published as: CN108791308B

Abstract

The invention discloses a kind of systems building driving strategy based on driving environment, including：Feature extractor, the feature of extraction structure Reward Program；Reward Program generator obtains driving strategy；Driving strategy getter completes the structure of driving strategy；Judging device judges the optimal driving strategy of getter structure, whether meets judgment criteria；If not satisfied, then rebuilding Reward Program, repetition builds optimal driving strategy, iterates, until meeting judgment criteria；It is final to obtain the true driving strategy for driving demonstration of description；The Reward Program generator, including obtain the drivings example data module of expert, seek the weights module for driving the feature desired value module demonstrated, seeking state-behavior aggregate module under greedy strategy, seeking Reward Program.The application can be applicable in new state scene, to obtain its respective action, substantially increase the generalization ability of the driver behavior model of foundation, and applicable scene is wider, and robustness is stronger.

Description

The system for building driving strategy based on driving environment

Technical field

The present invention relates to a kind of systems building driving strategy based on driving environment.

Background technology

Traditional driver's driving strategy established based on intensified learning using the analysis of known driving data, is described and is pushed away Driving behavior is managed, however inexhaustible driving behavior can not be completely covered with the driving data of acquisition, unlikely The case where obtaining whole state respective actions.Under practical Driving Scene, because of the difference of weather, scene, object, driving condition There are numerous possibility, it is impossible thing to traverse whole states.Therefore traditional driver's driving behavior model generalization ability Weak, model hypothesis condition is more, poor robustness.

Secondly, in actual driving problem, the method for Reward Program is only set with researcher, needs balance too many right In the demand of various features, it is completely dependent on the experience setting of researcher, reconciles, takes time and effort, more fatal is manually repeatedly It is excessively subjective.Under different scenes and environment, researcher then needs to face too many scene state；Meanwhile even for Some scene state determined, the difference of demand also result in the variation of driving behavior.For the accurate description driving Task will distribute a series of weights with these factors of accurate description.In existing method, the reverse extensive chemical based on probabilistic model It practises mainly from existing example data, using example data as data with existing, and then seeks the distribution of corresponding current data Situation, the action that can be just sought based on this under corresponding states are chosen.But the distribution of given data can not indicate total data Distribution, it is correct to obtain distribution, the case where needing to obtain whole state respective actions.

Invention content

In the presence of the prior art for Driving Scene not in the case of example data, corresponding return can not be established Function is come the technical issues of carrying out driving behavior modeling, this application provides a kind of based on driving environment structure driving strategy System can be applicable in new state scene, and to obtain its respective action, applicable scene is wider, and robustness is stronger.

To achieve the goals above, the technical essential of the present invention program is：One kind building driving strategy based on driving environment System, specifically include：Feature extractor, the feature of extraction structure Reward Program；Reward Program generator obtains and drives plan Slightly；Driving strategy getter completes the structure of driving strategy；Judging device judges the optimal driving strategy of getter structure, is It is no to meet judgment criteria；If not satisfied, then rebuilding Reward Program, repetition builds optimal driving strategy, iterates, until Meet judgment criteria；It is final to obtain the true driving strategy for driving demonstration of description；

The Reward Program generator, including obtain the driving example data module of expert, seek driving the feature demonstrated Desired value module, the weights module sought state-behavior aggregate module under greedy strategy, seek Reward Program.

Further, the driving example data module for obtaining expert is specially：Driving example data, which comes from, drives demonstration The sampling extraction for sailing video data, samples according to one section of continuous driving video of certain frequency pair, obtains one group of track and show Model；One expert's example data includes a plurality of track, is totally denoted as：

D_E={ (s₁,a₁),(s₂,a₂),...,(s_M,a_M)}Wherein D_EIndicate whole driving example data, (s_j,a_j) indicating that corresponding states j corresponds to the data pair of decision instruction composition with the state, M represents driving example data in total Number, N_TIt represents and drives demonstration trace number, L_iIt represents i-th and drives the state-decision instruction for including in demonstration track to (s_j, a_j) number.

Further, seeking the feature desired value module that driving is demonstrated is specially；Example data D will be driven first_EIn The state s of each description driving environment situation_tIn input state feature extractor, corresponding states s is obtained_tUnder feature situation f (s_t,a_t), f (s_t,a_t) one group of correspondence s of acute pyogenic infection of finger tip_tInfluence Driving Decision-making result driving environment scene characteristic value, be then based on down It states formula and calculates the feature desired value for driving demonstration：

Wherein γ is discount factor, and according to the difference of problem, correspondence is configured.

Further, seeking the state under greedy strategy-behavior aggregate module is specially：Due to Reward Program generator with drive Sail two parts that tactful getter is cycle；

First, the neural network in driving strategy getter is obtained：Driving example data D_EExtract obtained description ring The state feature f (s of border situation_t), neural network is inputted, output g is obtained_w(s_t)；g_w(s_t) it is about description state s_tOne group of Q Value set, i.e. [Q (s_t,a₁),...,Q(s_t,a_n)]^T, and Q (s_t,a_i) state-working value is represented, it is driven currently for describing Scene state s_tUnder, choose decision driver behavior a_iQuality, ((s a) is acquired, the formula by s, a)=θ μ based on formula Q In the current Reward Program of θ acute pyogenic infection of finger tip in weights, μ (s, a) acute pyogenic infection of finger tip feature desired value；

ε-greedy strategies are then based on, carry out choosing description Driving Scene state s_tCorresponding Driving Decision-making action It chooses about current Driving Scene s_tUnder Q value sets in allow the maximum decision action of Q valuesOtherwise, then it randomly selects It has chosenLater, it records at this time

So for driving the D that demonstrates_EIn each state state feature f (s_t,a_t), the neural network is inputted, is obtained altogether M state-action is obtained to (s_t,a_t), which depict the Driving Scene state s of t moment_tLower selection Driving Decision-making acts a_t；Together When based on action choose the case where, obtain the Q values of M corresponding states-action pair, be denoted as Q.

Further, the weights module for seeking Reward Program is specially：

It is primarily based on following formula, builds object function：

Loss function is represented, i.e., according to current state-action to whether there is among driving demonstration, if depositing It is being then 0, is being otherwise 1；For the corresponding states-working value recorded above；To seek Drive the product of the weights θ of the expectation of driving exemplary features and Reward Program sought in the feature desired value module of demonstration；For regular terms；

The object function, i.e. t=min are minimized by gradient descent method_θJ (θ), acquisition enable the minimization of object function Variable θ, the θ are the weights of striked required Reward Program.

Further, the correspondence Reward Program weights θ based on acquisition, according to formula r (s, a)=θ^T(s a) is built back f Report function generator.

Further, driving strategy getter specific implementation process is：

S31 builds the training data of driving strategy getter

Training data is obtained, each data include two parts：One is that t moment Driving Scene state is inputted driving condition The Driving Decision-making feature f (s that extractor obtains_t), another is namely based on what following formula obtained

Wherein, r_θ(s_t,a_t) by Reward Program of the Reward Program generator based on driving example data generation；Q^π(s_t, a_t) and Q^π(s_t+1,a_t+1) coming from the Q values sought recorded in the state under greedy strategy-behavior aggregate module, selection is wherein retouched State t moment Driving Scene s_tQ values and choose its described in t+1 moment Driving Scenes s_t+1Q values；

S32. neural network is established；

S33. optimization neural network.

Further, the neural network in step S32 includes three layers, and first layer is as input layer, neuron therein Number is identical for k with the output feature type of feature extractor, the feature f (s for inputting Driving Scene_t,a_t), the second layer Hidden layer number be 10, the neuron number of third layer in motion space carry out decision driver behavior number n it is identical；It is defeated The activation primitive for entering layer and hidden layer is all sigmoid functions, i.e.,Have：

Z=w⁽¹⁾X=w⁽¹⁾[1,f_t]^T

H=sigmoid (z)

g_w(s_t)=sigmoid (w⁽²⁾[1,h]^T)

Wherein w⁽¹⁾For the weights of hidden layer；f_tFor the state s of t moment Driving Scene_tFeature, that is, neural network is defeated Enter；Network layer output when z is without hidden layer sigmoid activation primitives；H is hidden after sigmoid activation primitives Layer output；w⁽²⁾For the weights of output layer；

The g of network output_w(s_t) it is t moment Driving Scene state s_tQ set, i.e. [Q (s_t,a₁),...,Q(s_t,a_n)]^T, Q in S31^π(s_t,a_t) it is exactly by state s_tInput neural network, a in selection output_tObtained by.

As further, the loss function of the optimization for the neural network, foundation is cross entropy cost function, public Formula is as follows：

The wherein number of N acute pyogenic infection of finger tip training data；Q^π(s_t,a_t) it is that will describe t moment Driving Scene state s_tInput nerve net Network, the correspondence Driving Decision-making in selection output act a_tThe obtained numerical value of item；For the numerical value acquired in S31；It is regular terms, W={ w therein⁽¹⁾,w⁽²⁾Weights in neural network above acute pyogenic infection of finger tip；

The training data that will be obtained in S31 inputs the Neural Network Optimization cost function；By gradient descent method completion pair In the minimum of the cross entropy cost function, the neural network of obtained optimization completion, and then obtain driving strategy getter.

As further, judging device implements process and includes：

Regard current Reward Program generator and driving strategy getter as an entirety, checks currently to seek driving and show The feature desired value mould t values in the block of model, if meet t < ε, ε be judge object function whether the threshold value of meet demand, also It is to judge to be currently used in whether the Reward Program for obtaining driving strategy meets the requirements；Its numerical value carries out different according to specific needs Setting；

When the numerical value of t, when being unsatisfactory for the formula；It needs to rebuild Reward Program generator, needs currently to ask at this time The neural network needed in the state under greedy strategy-behavior aggregate module is taken to be substituted for new after having already passed through optimization in S33 Neural network, i.e., will be used for generate description in Driving Scene state s_tUnder, the decision driver behavior a of selection_iGood and bad Q (s_t, a_i) value network, be substituted for the new network structure optimized by gradient descent method in S33；Then it rebuilds Reward Program generator obtains driving strategy getter, judge again t numerical value whether meet demand；

When meeting the formula, current θ is exactly the weights of required Reward Program；Reward Program generator, which then meets, to be wanted It asks, driving strategy getter is also met the requirements；Then acquisition needs to establish the driving data of certain driver of pilot model, i.e., Environment scene image in driving procedure and corresponding operation data input driving environment feature extractor, obtain for current The decision feature of scene；Then feature extraction obtained inputs Reward Program generator, obtains the return of corresponding scene state Function；Then the decision feature of acquisition and the Reward Program being calculated are inputted driving strategy getter, obtains the driver Corresponding driving strategy.

Advantageous effect is the present invention compared with prior art：In actual driving situation, because of the originals such as weather, scenery The corresponding big state space of various Driving Scene caused by, by means of the outstanding approximate expression arbitrary function of neural network Ability, approximately can be by a kind of this Policy Table up to regarding black box as：By the characteristic value of input state, it is dynamic to export corresponding state- Work value, while further being acted according to the case where output valve to choose, to obtain respective action.To make to strengthen by reverse The applicability for learning to model driving behavior greatly enhances, and conventional method is fitted because attempting by a certain probability distribution To demonstration track, thus the optimal policy obtained is still limited to the existing state status in demonstration track, and the present invention can To be applicable in for new state scene, to obtain its respective action, the driver behavior model of foundation is substantially increased Generalization ability, applicable scene is wider, and robustness is stronger.

Description of the drawings

Fig. 1 is new depth convolutional neural networks；

Fig. 2 is driving video sample graph；

Fig. 3 is the working method flow diagram of system in embodiment 1；

Fig. 4 is to establish neural network structure figure in step S32.

Specific implementation mode

Below in conjunction with Figure of description, the invention will be further described.Following embodiment is only used for clearly Illustrate technical scheme of the present invention, and not intended to limit the protection scope of the present invention.

The present embodiment provides the systems for building driving strategy based on driving environment, specifically comprise the following steps：

Feature extractor, the feature of extraction structure Reward Program, the specific steps are：

S11. in vehicle travel process, the driving video that is obtained using the subsequent video camera of the windshield for being placed on vehicle into Row sampling, sample graph are as shown in Figure 2.

Obtain the picture of N group different vehicle driving road environment road conditions and corresponding steering angle situation.Including N1 Straight way and N2 bends, the value of N1, N2 can be N1>=300, N2>=3000, at the same corresponding driver behavior data, joint Construct training data.

S12. carry out relevant translation to collecting the image come, cut, the change operations such as brightness, with simulate different illumination and The scene of weather.

S13. convolutional neural networks are built, using picture after treatment as input, the operation data of corresponding picture is made For label value, it is trained；Optimize to seek optimal solution to mean square error loss using based on the optimization method of Nadam optimizers The weight parameter of neural network.

Convolutional neural networks include 1 input layer, 3 convolutional layers, 3 pond layers, 4 full articulamentums.Input layer is successively First convolutional layer, first pond layer are connected, second convolutional layer, second pond layer are then connected, reconnects third Convolutional layer, third pond layer, be then sequentially connected the full articulamentum of first full articulamentum, second full articulamentum, third, 4th full articulamentum.

S14. the network structure by the convolutional neural networks after the completion of training in addition to the last output layer and weights preserve, To establish a new convolutional neural networks, completion status feature extractor.

Reward Program generator obtains driving strategy：

Reward Program returns letter as the standard for acting selection in intensified learning method in the acquisition process of driving strategy Several quality plays the role of conclusive, directly determines the quality of the driving strategy of acquisition, and the strategy obtained is No strategy corresponding with true driving example data is identical.The formula of Reward Program is reward=θ^Tf(s_t,a_t), f (s_t, a_t) acute pyogenic infection of finger tip corresponds to the t moment state s under driving environment scene " vehicle-periphery "_tOne group of influence Driving Decision-making result spy Value indicative, for describing vehicle-periphery scenario.And θ acute pyogenic infection of finger tip corresponds to one group of weights of the feature for influencing Driving Decision-making, power The corresponding environmental characteristic of the numbers illustrated of value proportion shared in Reward Program, embodies importance.It is carried in state feature On the basis of taking device, need to solve this weights θ, to come build influence driving strategy Reward Program.

Obtain the driving example data module of expert：Example data is driven from the sampling for driving video data of demonstrating Extraction (different with data used in driving environment feature extractor before), can be according to the continuous driving of one section of frequency pair of 10hz Video is sampled, and one group of track demonstration is obtained.One expert's demonstration should have a plurality of track.Totally it is denoted as：D_E={ (s₁, a₁),(s₂,a₂),...,(s_M,a_M)}Wherein D_EIndicate whole driving example data, (s_j,a_j) indicate corresponding shape State j (video pictures of the driving environment of the time j of sampling) corresponds to the decision instruction (steering angle in such as steering order with the state Degree) data pair that constitute, M represents the number of driving example data in total, N_TIt represents and drives demonstration trace number, L_iRepresent i-th Item drives the state-decision instruction for including in demonstration track to (s_j,a_j) number

The feature for seeking driving demonstration it is expected module：Example data D will be driven first_EIn each description driving environment feelings The state s of condition_tInput state feature extractor obtains corresponding states s_tUnder feature situation f (s_t,a_t), f (s_t,a_t) acute pyogenic infection of finger tip one The corresponding s of group_tInfluence Driving Decision-making result driving environment scene characteristic value, be then based on following formula and calculate driving The feature of demonstration it is expected：

Wherein γ is discount factor, and according to the difference of problem, correspondence is configured, and referential data can be set as 0.65.

Seek state-behavior aggregate module under greedy strategy：First, the god in the driving strategy getter in S32 is obtained Through network.It is (most refreshing at first because being two parts in a cycle in Reward Program generator and driving strategy getter It is the neural network just initialized in S32 through network.With the progress of cycle, each step in cycle is all：It completes to influence The structure of the Reward Program of Driving Decision-making is then based on current Reward Program and obtains corresponding optimal driving strategy, judges whether Meet the standard of end loop, is rebuild back if not satisfied, being then put into the neural network that the process in current S34 optimized Report function)

Driving example data D_EExtract the state feature f (s of obtained description ambient conditions_t,a_t), neural network is inputted, Obtain output g_w(s_t)；g_w(s_t) it is about description state s_tOne group of Q value set, i.e. [Q (s_t,a₁),...,Q(s_t,a_n)]^T, and Q(s_t,a_i) state-working value is represented, for describing in current Driving Scene state s_tUnder, choose decision driver behavior a_iIt is excellent It is bad, can be based on formula Q (s, a)=θ μ (s a) is acquired, the weights in the current Reward Program of the θ acute pyogenic infection of finger tip in the formula, μ (s, a) acute pyogenic infection of finger tip feature expectation.

ε-greedy strategies are then based on, if setting ε is 0.5, carry out choosing description Driving Scene state s_tIt is corresponding Driving Decision-making actsThat is there is 50 percent possibility, choose about current Driving Scene s_tUnder Q value collection The maximum decision of Q values is allowed to act in conjunctionOtherwise, then it randomly selectsIt has chosenLater, it records at this time

So for driving the D that demonstrates_EIn each state state feature f (s_t,a_t), the neural network is inputted, is obtained altogether M state-action is obtained to (s_t,a_t) which depict the Driving Scene state s of t moment_tLower selection Driving Decision-making acts a_t.Together When based on action choose the case where, obtain the Q values of M corresponding states-action pair, be denoted as Q.

Seek the weights module of Reward Program：It is primarily based on following formula, builds object function：

Represent loss function, i.e., according to current state-action to whether there is among driving demonstration, if It is otherwise 1 in the presence of being then 0.For the corresponding states-working value recorded above.For regular terms, to prevent The appearance of overfitting problem, the γ can be 0.9.

Correspondence Reward Program weights θ based on acquisition, according to formula r (s, a)=θ^T(s a) builds Reward Program generation to f Device.

Driving strategy getter completes the structure of driving strategy, specially：

The structure of the training data of S31 driving strategy getters

Obtain training data.Data come from the sampling to example data before, but are handled to obtain one group The data of new type amount to N number of.Each data include two parts in data：One is to input t moment Driving Scene state The Driving Decision-making feature f (s that driving condition extractor obtains_t), another is namely based on what following formula obtained

Include parameter r in the formula_θ(s_t,a_t) by return of the Reward Program generator based on driving example data generation Function.Choose t moment Driving Scene s described in it_tQ values and choose its described in t+1 moment Driving Scenes s_t+1Q values.

S32. neural network is established

Neural network includes three layers, and first layer is as input layer, the output of neuron number and feature extractor therein Feature type is identical for k, the feature f (s for inputting Driving Scene_t,a_t), the hidden layer number of the second layer is 10, third layer Neuron number in motion space carry out decision driver behavior number n as number；The activation of input layer and hidden layer Function is all sigmoid functions, i.e.,Have：

Z=w⁽¹⁾X=w⁽¹⁾[1,f_t]^T

H=sigmoid (z)

g_w(s_t)=sigmoid (w⁽²⁾[1,h]^T)

Wherein w⁽¹⁾The weights of acute pyogenic infection of finger tip hidden layer；f_tThe state s of acute pyogenic infection of finger tip t moment Driving Scene_tFeature, that is, neural network Input；The output of network layer when z acute pyogenic infection of finger tip is without hidden layer sigmoid activation primitives；H acute pyogenic infection of finger tip is activated by sigmoid Hidden layer output after function；w⁽²⁾The weights of acute pyogenic infection of finger tip output layer；Network structure such as Fig. 3：

S33. optimization neural network

The loss function of optimization for the neural network, foundation is cross entropy cost function, and formula is as follows：

The wherein number of N acute pyogenic infection of finger tip training data.Q^π(s_t,a_t) it is exactly that will describe t moment Driving Scene state s_tInput nerve Network, the correspondence Driving Decision-making in selection output act a_tThe obtained numerical value of item.For the numerical value acquired in S31.Equally it is regular terms, prevents over-fitting and be arranged.The γ may be 0.9.W={ w therein⁽¹⁾,w⁽²⁾Acute pyogenic infection of finger tip Weights in neural network above.

The training data that will be obtained in S31 inputs the Neural Network Optimization cost function.By gradient descent method completion pair In the minimum of the cross entropy cost function, the neural network of obtained optimization completion obtains driving strategy getter.

Judging device regards current Reward Program generator and driving strategy getter as an entirety, checks current t Value, if meet t < ε, ε be judge object function whether the threshold value of meet demand, that is, judge to be currently used in acquisition driving Whether the Reward Program of strategy meets the requirements.Its numerical value carries out different settings according to specific needs.

When the numerical value of t is unsatisfactory for the formula.It needs to rebuild Reward Program generator, needs to work as at this time Preceding neural network is substituted for the new neural network after having already passed through optimization, i.e., will be used to generate description in Driving Scene state s_tUnder, the decision driver behavior a of selection_iGood and bad Q (s_t,a_i) value network, be substituted in S33 by gradient descent method into The new network structure that row optimized.Then it rebuilds Reward Program generator, obtain driving strategy getter, judge again The numerical value of t whether meet demand.

When meeting the formula, current θ is exactly the weights of required Reward Program.Reward Program generator is then full Foot requires, and driving strategy getter is also met the requirements.It then can be with：Acquisition needs to establish driving for certain driver of pilot model Data, i.e., the environment scene image in driving procedure and corresponding operation data are sailed, steering angle is such as driven.It is special to input driving environment Extractor is levied, the decision feature for current scene is obtained.Then feature extraction obtained inputs Reward Program generator, obtains To the Reward Program of corresponding scene state.Then the decision feature of acquisition and the Reward Program being calculated are inputted driving strategy Getter obtains the corresponding driving strategy of the driver.

In markov decision process, a kind of strategy needs connection status to its corresponding action.But have for one When large-scale state space, for the region not traversed, indicated it is difficult to be depicted and carry out a determining strategy, tradition Also the description to this part is had ignored among method, is only based on demonstration track, to illustrate the probability mould of entire track distribution Type does not provide specific strategy for new state and indicates, i.e., takes the possibility for determining and acting not for new state Provide specific method.Strategy is described by neural network in the present invention, neural network can be in any essence because of it The characteristic of approximate representation arbitrary function in exactness, while having outstanding generalization ability.By the expression of state feature, on the one hand Those states being not included in demonstration track can be represented, in addition, inputting neural network by by corresponding state feature. Corresponding working value can be sought, to seek deserved action according to strategy, thus, conventional method can not extensive driving demonstration number It is addressed according to not traversing Driving Scene state issues.

The preferable specific implementation mode of the above, only the invention, but the protection domain of the invention is not It is confined to this, any one skilled in the art is in the technical scope that the invention discloses, according to the present invention The technical solution of creation and its inventive concept are subject to equivalent substitution or change, should all cover the invention protection domain it It is interior.

Claims

1. the system for building driving strategy based on driving environment, which is characterized in that specifically include：Feature extractor, extraction structure The feature of Reward Program；Reward Program generator obtains driving strategy；Driving strategy getter completes the structure of driving strategy； Judging device judges the optimal driving strategy of getter structure, whether meets judgment criteria；If not satisfied, then rebuilding back Function, repetition is reported to build optimal driving strategy, iterate, until meeting judgment criteria；Final true drive of acquisition description is shown The driving strategy of model；

The Reward Program generator, including obtain the driving example data module of expert, seek driving the feature expectation demonstrated Value module, the weights module sought state-behavior aggregate module under greedy strategy, seek Reward Program.

2. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that obtain driving for expert Sailing example data module is specially：Example data is driven from being extracted for the sampling for driving video data of demonstrating, according to certain The continuous driving video of one section of frequency pair samples, and obtains one group of track demonstration；One expert's example data includes a plurality of rail Mark is totally denoted as：

D_E={ (s₁,a₁),(s₂,a₂),...,(s_M,a_M)}Wherein D_EIndicate whole driving example data, (s_j, a_j) indicating that corresponding states j corresponds to the data pair of decision instruction composition with the state, M represents of driving example data in total Number, N_TIt represents and drives demonstration trace number, L_iIt represents i-th and drives the state-decision instruction for including in demonstration track to (s_j, a_j) number.

3. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that seek driving demonstration Feature desired value module be specially；Example data D will be driven first_EIn each description driving environment situation state s_tIt is defeated Enter in state feature extractor, obtains corresponding states s_tUnder feature situation f (s_t,a_t), f (s_t,a_t) one group of correspondence s of acute pyogenic infection of finger tip_t's The driving environment scene characteristic value for influencing Driving Decision-making result is then based on following formula and calculates the feature phase for driving demonstration Prestige value：

4. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that seek greedy strategy Under state-behavior aggregate module be specially：Since Reward Program generator and driving strategy getter are two parts of cycle；

First, the neural network in driving strategy getter is obtained：Driving example data D_EExtract obtained description ambient conditions State feature f (s_t), neural network is inputted, output g is obtained_w(s_t)；g_w(s_t) it is about description state s_tOne group of Q value collection It closes, i.e. [Q (s_t,a₁),...,Q(s_t,a_n)]^T, and Q (s_t,a_i) state-working value is represented, for describing in current Driving Scene State s_tUnder, choose decision driver behavior a_iQuality, ((s a) is acquired, in the formula by s, a)=θ μ based on formula Q Weights in the current Reward Program of θ acute pyogenic infection of finger tip, μ (s, a) acute pyogenic infection of finger tip feature desired value；

ε-greedy strategies are then based on, carry out choosing description Driving Scene state s_tCorresponding Driving Decision-making actionIt chooses About current Driving Scene s_tUnder Q value sets in allow the maximum decision action of Q valuesOtherwise, then it randomly selectsIt chooses It is completeLater, it records at this time

So for driving the D that demonstrates_EIn each state state feature f (s_t,a_t), the neural network is inputted, is acquired altogether M state-action is to (s_t,a_t), which depict the Driving Scene state s of t moment_tLower selection Driving Decision-making acts a_t；Base simultaneously In acting the case where choosing, the Q values of M corresponding states-action pair are obtained, Q is denoted as.

5. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that seek Reward Program Weights module be specially：

It is primarily based on following formula, builds object function：

Loss function is represented, i.e., according to current state-action to whether there is among driving demonstration, if in the presence of if It is 0, is otherwise 1；For the corresponding states-working value recorded above；To seek driving The driving exemplary features sought in the feature desired value module of demonstration it is expected and the product of the weights θ of Reward Program；For Regular terms；

The object function, i.e. t=min are minimized by gradient descent method_θJ (θ) obtains the variable for enabling the minimization of object function θ, the θ are the weights of striked required Reward Program.

6. according to the system for building driving strategy based on driving environment described in any one of claim 5, which is characterized in that based on obtaining The correspondence Reward Program weights θ obtained, according to formula r (s, a)=θ^T(s a) builds Reward Program generator to f.

7. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that driving strategy obtains Implement body realizes that process is：

S31 builds the training data of driving strategy getter

Training data is obtained, each data include two parts：One is by the input driving condition extraction of t moment Driving Scene state The Driving Decision-making feature f (s that device obtains_t,a_t), another is namely based on what following formula obtained

Wherein, r_θ(s_t,a_t) by Reward Program of the Reward Program generator based on driving example data generation；Q^π(s_t,a_t) and Q^π (s_t+1,a_t+1) come from the Q values sought recorded in the state under greedy strategy-behavior aggregate module, choose t moment described in it Driving Scene s_tQ values and choose its described in t+1 moment Driving Scenes s_t+1Q values；

S32. neural network is established；

S33. optimization neural network.

8. the system for building driving strategy based on driving environment according to claim 7, which is characterized in that in step S32 Neural network includes three layers, and first layer is as input layer, the output feature type of neuron number and feature extractor therein It is all mutually k, the feature f (s for inputting Driving Scene_t,a_t), the hidden layer number of the second layer is 10, the neuron of third layer Number is identical with the driver behavior number n for carrying out decision in motion space；Input layer and the activation primitive of hidden layer are all sigmoid Function, i.e.,Have：

Z=w⁽¹⁾X=w⁽¹⁾[1,f_t]^T

H=sigmoid (z)

g_w(s_t)=sigmoid (w⁽²⁾[1,h]^T)

Wherein w⁽¹⁾For the weights of hidden layer；f_tFor the state s of t moment Driving Scene_tFeature, that is, neural network input；z Network layer output when for without hidden layer sigmoid activation primitives；H is that the hidden layer after sigmoid activation primitives is defeated Go out；w⁽²⁾For the weights of output layer；

The g of network output_w(s_t) it is t moment Driving Scene state s_tQ set, i.e. [Q (s_t,a₁),...,Q(s_t,a_n)]^T, S31 In Q^π(s_t,a_t) it is exactly by state s_tInput neural network, a in selection output_tObtained by.

9. the system for building driving strategy based on driving environment according to claim 7, which is characterized in that for the nerve net The loss function of the optimization of network, foundation is cross entropy cost function, and formula is as follows：

The wherein number of N acute pyogenic infection of finger tip training data；Q^π(s_t,a_t) it is that will describe t moment Driving Scene state s_tInput neural network, choosing Select the correspondence Driving Decision-making action a in output_tThe obtained numerical value of item；For the numerical value acquired in S31；It is just Then item, W={ w therein⁽¹⁾,w⁽²⁾Weights in neural network above acute pyogenic infection of finger tip；

The training data that will be obtained in S31 inputs the Neural Network Optimization cost function；It is completed for this by gradient descent method The minimum of cross entropy cost function, the neural network that obtained optimization is completed, and then obtain driving strategy getter.

10. the system for building driving strategy based on driving environment according to claim 1, which is characterized in that judgement implement body Realization process includes：

Regard current Reward Program generator and driving strategy getter as an entirety, checks and currently seek driving demonstration Feature desired value mould t values in the block, if meet t < ε, ε be judge object function whether the threshold value of meet demand, that is, sentence Whether the disconnected Reward Program for being currently used in acquisition driving strategy meets the requirements；Its numerical value carries out different set according to specific needs It sets；

When the numerical value of t, when being unsatisfactory for the formula；It needs to rebuild Reward Program generator, needs currently to seek coveting at this time The neural network needed in state-behavior aggregate module under greedy strategy is substituted for the new god after having already passed through optimization in S33 Through network, i.e., it will be used to generate description in Driving Scene state s_tUnder, the decision driver behavior a of selection_iGood and bad Q (s_t,a_i) value Network, be substituted for the new network structure optimized by gradient descent method in S33；Then return letter is rebuild Number generators, obtain driving strategy getter, judge again t numerical value whether meet demand；

When meeting the formula, current θ is exactly the weights of required Reward Program；Reward Program generator is then met the requirements, Driving strategy getter is also met the requirements；Then acquisition needs to establish the driving data of certain driver of pilot model, that is, drives Environment scene image during sailing and corresponding operation data input driving environment feature extractor, obtain for working as front court The decision feature of scape；Then feature extraction obtained inputs Reward Program generator, obtains the return letter of corresponding scene state Number；Then the decision feature of acquisition and the Reward Program being calculated are inputted driving strategy getter, obtains the driver couple The driving strategy answered.