CN107544516A

CN107544516A - Automated driving system and method based on relative entropy depth against intensified learning

Info

Publication number: CN107544516A
Application number: CN201710940590.XA
Authority: CN
Inventors: 林嘉豪; 章宗长
Original assignee: Suzhou University
Current assignee: NANQI XIANCE (NANJING) TECHNOLOGY Co.,Ltd.
Priority date: 2017-10-11
Filing date: 2017-10-11
Publication date: 2018-01-05
Also published as: WO2019071909A1

Abstract

The present invention relates to a kind of automated driving system based on relative entropy depth against intensified learning, including：(1) client：Show driving strategy；(2) basic data acquisition subsystem is driven：Gather road information；(3) memory module：With client and driving basic data acquisition subsystem and being connected and road information that memory of driving basic data acquisition subsystem is collected；Wherein, drive basic data acquisition subsystem collection road information and the road information is transferred to client and memory module, memory module receives road information, and one section of lasting road information is stored as historical track, analysis calculating is carried out according to historical track and simulates driving strategy, memory module transmits the driving strategy to client for selection by the user, and client receives road information and selected to implement automatic Pilot according to user.Present system realizes automatic Pilot under model-free using the depth of relative entropy against nitrification enhancement.

Description

Automated driving system and method based on relative entropy depth against intensified learning

Technical field

The present invention relates to a kind of automated driving system and method based on relative entropy depth against intensified learning, belong to and drive automatically Sail technical field.

Background technology

With the increase of China's automobile volume of holding, road traffic congestion phenomenon is increasingly severe, etesian traffic thing Therefore also constantly rising, in order to preferably solve this problem, research and development automatic vehicle control system is necessary.And with The lifting that people are pursued quality of life, it is desirable to be liberated from the driving-activity of fatigue, automatic Pilot technology should Transport and give birth to.

A kind of existing automatic vehicle control system is to be distinguished to drive by the video camera mounted in driver's cabin and image identification system Environment is sailed, then by vehicle-mounted main control computer, GPS positioning system and path planning software according to road-map kept in advance etc. Information is navigated to vehicle, and rational driving path is cooked up between the current location of vehicle and destination by vehicle guidance Destination.

In above-mentioned automatic vehicle control system, because road-map is pre-stored in vehicle, the renewal of its data depends on The manual operation of driver, renewal frequency cannot be guaranteed, also, enable a driver to accomplish to upgrade in time, it is also possible to by The data for finally give on the up-to-date information of road in existing resource can not react road feelings instantly Condition, it is unreasonable to ultimately cause traffic route, and navigation accuracy rate is not high, is made troubles to driving.Also, at present in automatic Pilot skill Most of automatic vehicle control system in art field also needs to manually be intervened, and can not reach the ground of complete automatic Pilot Step.

The content of the invention

It is an object of the invention to provide a kind of automated driving system and method based on relative entropy depth against intensified learning, Using deep neural network structure and the history driving locus information of user driver is inputted, a variety of individual characteies that represent is obtained and drives habit Used driving strategy, the automatic Pilot of individual character, intelligence is carried out by these driving strategies.

To reach above-mentioned purpose, the present invention provides following technical scheme：It is a kind of based on relative entropy depth against intensified learning Automated driving system, the system include：

Client：Show driving strategy；

Drive basic data acquisition subsystem：Gather road information；

Memory module：It is connected with the client and driving basic data acquisition subsystem and stores the basic number of driving The road information collected according to acquisition subsystem；

Wherein, the driving basic data acquisition subsystem gathers road information and is transferred to the road information described Client and memory module, the memory module receives the road information, and one section of lasting road information is stored as going through History track, carry out analyzing calculating simulating driving strategy according to the historical track, the memory module is by the driving strategy Transmit to client for selection by the user, the client receives and according to the road information and user personality selection Driving strategy implements automatic Pilot.

Further, the memory module include be used for store history driving locus driving locus storehouse, according to drive rail Mark and driving habit calculate and simulate the trace information processing subsystem of driving strategy and the driving strategy of memory of driving strategy Storehouse；Driving locus data are transferred to the trace information processing subsystem, the trace information processing by the driving locus storehouse Subsystem calculates according to the driving locus data analysis and simulates driving strategy and be transferred to the driving strategy storehouse, described Driving strategy storehouse receives and stores the driving strategy.

Further, the trace information processing subsystem using the relative entropy depth of multiple target against nitrification enhancement meter Calculate simultaneously drive simulating strategy.

Further, the inverse nitrification enhancement of the multiple target is strengthened using EM algorithm frame nesting relative entropies depth is inverse Study calculates the parameter of more reward functions.

Further, the basic data acquisition subsystem that drives includes being used for the sensor for gathering road information.

Present invention also offers a kind of method based on relative entropy depth against the automatic Pilot of intensified learning, methods described bag Include following steps：

Comprise the following steps：

S1：The road information is simultaneously transferred to client and memory module by collection road information；

S2：The memory module receives the road information and one section of lasting road information is stored as into historical track, Calculated according to the historical track analysis and simulate a variety of driving strategies, and the driving strategy is passed into the client；

S3：The client receives the road information and driving strategy, and the individual character driving strategy selected according to user And road information implements automatic Pilot.

Further, the memory module is including being used to store the driving locus storehouse of history driving locus, being advised according to driving Draw and driving habit calculates and simulates the trace information processing subsystem of driving strategy and the driving strategy of memory of driving strategy Storehouse；Driving locus data are transferred to the trace information processing subsystem, the trace information processing by the driving locus storehouse Subsystem calculates according to the driving locus data analysis and simulates driving strategy and be transferred to the driving strategy storehouse, described Driving strategy storehouse receives and stores the driving strategy.

The beneficial effects of the present invention are：Basic data acquisition subsystem is driven by setting in systems, collection in real time Road information, and road information passed into memory module, memory module receive after road information and by one section of lasting roads Information is stored as historical track, according to history driving locus drive simulating strategy, realizes individual character, the automatic Pilot of intelligence.

Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of specification, below with presently preferred embodiments of the present invention and coordinate accompanying drawing describe in detail as after.

Brief description of the drawings

Fig. 1 be the present invention based on relative entropy depth against the automated driving system of intensified learning and the flow chart of method.

Fig. 2 is markov decision process MDP schematic diagrames.

Embodiment

With reference to the accompanying drawings and examples, the embodiment of the present invention is described in further detail.Implement below Example is used to illustrate the present invention, but is not limited to the scope of the present invention.

Refer to Fig. 1, the automatic Pilot system based on relative entropy depth against intensified learning of a preferred embodiment of the invention System includes：

Client 1：Show driving strategy；

Drive basic data acquisition subsystem 2：Gather road information；

Memory module 3：It is connected with the client 1 and driving basic data acquisition subsystem 2 and stores the driving base The road information that plinth data acquisition subsystem 2 is collected；

Wherein, the driving basic data acquisition subsystem 2 gathers road information and the road information is transferred into institute Client 1 and memory module 3 are stated, the memory module 3 receives the road information, and one section of lasting road information is stored For historical track, analyze calculating according to the historical track and simulate driving strategy, the memory module 3 is by the driving To client 1 for selection by the user, the client 1 receives the road information and the individual character selected according to user to strategy transmission Driving strategy implements automatic Pilot.In the present embodiment, the memory module 3 is high in the clouds.

The 1 most important function of client is and user's finishing man-machine interaction process, there is provided to individual character, intelligent more The selection of kind driving strategy and service.Client 1 selects according to the driving strategy of user personality, from the driving strategy storehouse 33 of high in the clouds 3 Corresponding driving strategy is downloaded, real-time Driving Decision-making is carried out then according to driving strategy and basic data, realizes real-time nothing People's Driving control.

The driving basic data acquisition subsystem 2 passes through sensor collection road information (not shown).The letter collected Breath has two purposes：Client 1 is passed information to, basic data is provided for current Driving Decision-making；Communicate information to cloud The driving locus storehouse 31 at end 3, it is stored as the history driving locus data of user driver.

The high in the clouds 3 include by the driving locus storehouse 31 of history driving locus, according to drive plan and driving habit based on Calculate and simulate the trace information processing subsystem 32 of driving strategy and the driving strategy storehouse 33 of memory of driving strategy；The driving Driving locus data are transferred to the trace information processing subsystem 32, the trace information processing subsystem 32 by track storehouse 31 Calculated according to the driving locus data analysis and simulate driving strategy and be transferred to the driving strategy storehouse 33, the driving Policy library 33 receives and stores the driving strategy.The trace information processing subsystem 32 uses the relative entropy depth of multiple target Inverse nitrification enhancement calculates and drive simulating strategy.In the present embodiment, the inverse nitrification enhancement of the multiple target uses EM algorithm frames nesting relative entropy depth calculates the parameter of more reward functions against intensified learning.The history driving locus includes special Family's history driving locus and the historical track of user.

The inverse intensified learning IRL refers to that reward functions R is unknown in markov decision process MDP known to environment Problem.In in general intensified learning problem RL, often utilize known to environment, given reward functions R and Markov Property estimates the value Q of a state action pair (s, a) (alternatively referred to as action accumulation reward value), then using convergent each State action pair value Q (s, a) ask for tactful π, intelligent body (Agent) can Utilization strategies π carry out decision-making.In reality, Reward functions R is often extremely difficult to what is known, but some outstanding track T^NIt is easier to obtain.It is unknown in reward functions Markov decision process MDP/R in, utilize outstanding track T^NThe problem of reducing reward functions R is referred to as inverse intensified learning Problem IRL.

In the present embodiment, using known user's history driving locus data in the driving locus storehouse 31, phase is carried out To entropy depth against intensified learning, the reward functions R of a variety of user personalities is restored, and then simulate corresponding driving strategy π.Phase To entropy depth against nitrification enhancement be a kind of algorithm of model-free, without in known environment model state transition function T (s, A, s '), relative entropy can utilize the method for importance sampling to avoid state transition function T in the calculation against nitrification enhancement (s,a,s′)。

In the present embodiment, the automatic Pilot decision process of automobile is a Markovian decision mistake without reward functions Journey MDP/R, being expressed as set, { state space S, motion space A, the state transition probability T of Environment Definition (are omitted to environment Transition probability T requirement).Automobile Agent value function (accumulative reward value) can be expressed asAnd automobile Agent state action value function can be expressed as Q (s, a)= R_θ(s,a)+γE_T(s,a,s′)[V(s′)].In order to solve the problems, such as more complicated true driving, to the hypothesis of reward functions no longer Simply simple linear combination, but it is assumed to be deep neural network R (s, a, θ)=g₁(g₂(…(g_n(f(s,a), θ_n),…),θ₂),θ₁), wherein f (s, a) represents (s, a) driving at place link characteristic information, θ_iRepresent deep neural network the The parameter of i layers.

Meanwhile in order to meet more individual character, more intelligent true Driving Scene, it is assumed that there is multiple reward functions R (target) same When exist, represent the different driving habit of user driver.Assuming that G reward functions be present, the priori of this G reward functions is made Probability distribution is ρ₁,…,ρ_G, award weight is θ¹,…,θ^G, make Θ=(ρ₁,…,ρ_G,θ¹,…,θ^G), represent this G award letter Several parameter sets.

Refer to Fig. 2, it is known have hypothesis reward functions (by initializing or being obtained by iteration) under conditions of, now I Problem can be described as a complete markov decision process MDP.Now in complete markov decision process Under MDP, according to the knowledge of intensified learning, reward functions R (s, a, θ)=g is utilized₁(g₂(…(g_n(f,θ_n),…),θ₂),θ₁), I V values and Q values can be assessed.For the assessment algorithm of intensified learning, using a kind of new soft maximization approach (MellowMax) desired value of V values is estimated.MellowMax maker is defined as：MellowMax is a kind of algorithm more optimized, and it can ensure to V values Estimation can converge on uniquely a bit.Meanwhile MellowMax is but also with speciality：The probability assignments mechanism and expectation estimation of science Method.In the present embodiment, combine exploration of the MellowMax nitrification enhancement during automatic Pilot to environment and Use aspects will be more reasonable.It ensure that when intensified learning process restrains, automated driving system has had to various scenes Enough study simultaneously can be to assessment of the current state generation compared with science.

In the present embodiment, according to a kind of soft intensified learning for maximizing algorithm MellowMax is combined, can obtain pair The more scientific evaluation of the desired value of the feature of state.Using MellowMax can obtain action choose probability distribution beUnder the rule that the soft maximized action is chosen, the iteration of intensified learning is utilized Process, can obtain can obtain the expectation of feature using the parameter of current depth neutral net as the reward functions that θ is formed Value μ.μ is appreciated that the accumulative expectation being characterized.

In the present embodiment, the above-mentioned multiple target with hidden variable is solved using EM algorithms against intensified learning problem.EM is calculated Method can be divided into E steps by step and M is walked, and is walked by E, the continuous iteration of M steps, approaches the maximum of possibility predication.

E is walked：Calculate firstWherein Z is regular terms.z_ijRepresent i-th of driving rail Mark belongs to driving habit (reward functions) j probability.

Make y_i=j represents that i-th of driving locus belongs to driving habit j, and with y=(y₁,…,y_N) set expression N number of drive Sail the subordinate set of track.

Calculate likelihood estimator Q (Θ, Θ^t)=∑_yL(Θ|D,y)Pr(y|D,Θ^t) (Q functions Q referred herein (Θ, Θ^t) be EM algorithms renewal object function, pay attention to intensified learning in Q operating state value functions distinguish), through reckoning Obtain likelihood estimator

M is walked：Choose suitable more driving habit parameter sets Θ (ρ_lAnd θ_l) cause E step in likelihood estimator Q (Θ, Θ^t) maximization.Due to ρ_lAnd θ_lMutual independence, can separate ask their maximization.It can obtain Latter half

For Q (Θ, the Θ of maximizing^t) latter half more fresh target: It can be understood as It is on being θ in the parameter of l cluster targets_lUnder conditions of the track observed SetMaximum likelihood equations can be obtained.We can solve this using relative entropy depth against the knowledge of intensified learning Individual maximum likelihood equations.The solution formula of relative entropy, while maximum likelihood more fresh target is met, can naturally it be applied to The backpropagation renewal of deep neural network parameter.The maximization object function for making deep neural network is L (θ)=logP (D, θ | r), according to the decomposition formula of joint likelihood function, can obtain L (θ)=logP (D, θ | r)=logP (D | r)+logP (θ). Local derviation is asked to obtain the joint plausible goals functionFor the first half of the local derviation, one can be entered Step is decomposed, and is expressed as

WhereinKnowledge according to relative entropy against intensified learning, solving result can be obtained as under current reward functions The difference of feature desired value and expert features valueWherein, utilization is important Property sampling,Wherein, π is a kind of given strategy, is obtained according to this tactful π samplingsIndividual track.Its InWherein τ=s₁a₁,…,s_Ha_H.Further,Its InIt is expressed as updating the ladder hidden in deep neural network and calculated during layer parameter by back-propagation algorithm Degree.

Gradient updating complement mark a completion that relative entropy depth updates against intensified learning iteration.Completed using renewal The new depth network reward functions of parameter renewal produce new tactful π, carry out new iteration.

Continuous iteration carries out the calculating of E steps and M steps, until likelihood estimator Q (Θ, Θ^t) converge to maximum.Now obtain The parameter sets Θ=(ρ obtained₁,…,ρ_G,θ¹,…,θ^G), it is exactly that we want the reward functions of the more driving habits of representative of solution Prior distribution and weight.

In the present embodiment, according to this parameter sets Θ, by intensified learning RL calculating, each driving habit is obtained R driving strategy π.More driving strategies are exported, and are preserved in driving strategy storehouse beyond the clouds.User can select in the client The driving strategy of individual character, intelligence.

S2：The memory module, which receives the road information and analyzed according to the road information, calculates and simulates a variety of drive Strategy is sailed, and the driving strategy is passed into the client；

In summary：Basic data acquisition subsystem 2 is driven by setting in systems, gathers road information in real time, and Road information is passed into memory module 3 and client 1, memory module 3 is received after road information according to history driving locus mould Intend driving strategy, realize individual character, the automatic Pilot of intelligence.

In automatic Pilot based on this method, driving strategy is all realized in 3 and calculated beyond the clouds, rather than in client 1 Run calculating process.When user is needing to carry out automatic Pilot, all driving strategies all 3 have been completed beyond the clouds.With Family only needs selection to download the driving strategy needed for oneself, driving strategy and real-time road of the car body can according to selected by user Road information carries out real-time automatic Pilot.Meanwhile after any once driving is completed, substantial amounts of road information uploads to high in the clouds 3 are stored as history driving locus.Using the history driving locus big data of storage, then realize the renewal to driving strategy storehouse. Using trace information big data, the system will realize the automatic Pilot for demand of being more close to the users.

Each technical characteristic of embodiment described above can be combined arbitrarily, to make description succinct, not to above-mentioned reality Apply all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, the scope that this specification is recorded all is considered to be.

Embodiment described above only expresses the several embodiments of the present invention, and its description is more specific and detailed, but simultaneously Can not therefore it be construed as limiting the scope of the patent.It should be pointed out that come for one of ordinary skill in the art Say, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the protection of the present invention Scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims

1. a kind of automated driving system based on relative entropy depth against intensified learning, it is characterised in that the system includes：

Client：Show driving strategy；

Drive basic data acquisition subsystem：Gather road information；

Memory module：It is connected with the client and driving basic data acquisition subsystem and stores the driving basic data and is adopted The road information that subsystem is collected；

Wherein, the driving basic data acquisition subsystem gathers road information and the road information is transferred into the client End and memory module, the memory module receives the road information, and one section of lasting road information is stored as into history rail Mark, analysis calculating is carried out according to the historical track and simulates driving strategy, the memory module transmits the driving strategy To client for selection by the user, the driving that the client receives and selected according to the road information and user personality Strategy implement automatic Pilot.

2. the automated driving system based on relative entropy depth against intensified learning as claimed in claim 1, it is characterised in that described Memory module is including being used to store the driving locus storehouse of history driving locus, calculating and simulate according to driving locus and driving habit Go out the trace information processing subsystem of driving strategy and the driving strategy storehouse of memory of driving strategy；The driving locus storehouse will drive Track data is transferred to the trace information processing subsystem, and the trace information processing subsystem is according to the driving locus number Calculated according to analysis and simulate driving strategy and be transferred to the driving strategy storehouse, the driving strategy storehouse receives and stored described Driving strategy.

3. the automated driving system based on relative entropy depth against intensified learning as claimed in claim 2, it is characterised in that described Trace information processing subsystem is calculated using the relative entropy depth of multiple target against nitrification enhancement and drive simulating strategy.

4. the automated driving system based on relative entropy depth against intensified learning as claimed in claim 3, it is characterised in that described The inverse nitrification enhancement of multiple target calculates more reward functions using EM algorithm frames nesting relative entropy depth against intensified learning Parameter.

5. the personalized automated driving system based on relative entropy depth against intensified learning, its feature exist as claimed in claim 1 In the basic data acquisition subsystem that drives includes being used for the sensor for gathering road information.

A kind of 6. method based on relative entropy depth against the automatic Pilot of intensified learning, it is characterised in that methods described is included such as Lower step：

S2：The memory module receives the road information and one section of lasting road information is stored as into historical track, according to The historical track analysis is calculated and simulates a variety of driving strategies, and the driving strategy is passed into the client；

S3：The client receives the road information and driving strategy, and the individual character driving strategy and road selected according to user Road information implements automatic Pilot.

7. the method based on relative entropy depth against the automatic Pilot of intensified learning as claimed in claim 6, it is characterised in that institute Memory module is stated including being used to store the driving locus storehouse of history driving locus, calculating simultaneously mould according to driving planning and driving habit Draw up the trace information processing subsystem of driving strategy and the driving strategy storehouse of memory of driving strategy；The driving locus storehouse will drive Sail track data and be transferred to the trace information processing subsystem, the trace information processing subsystem is according to the driving locus Data analysis calculates and simulates driving strategy and be transferred to the driving strategy storehouse, and the driving strategy storehouse receives and stores institute State driving strategy.

8. the method based on relative entropy depth against the automatic Pilot of intensified learning as claimed in claim 7, it is characterised in that institute State trace information processing subsystem and simultaneously drive simulating strategy is calculated against nitrification enhancement using the relative entropy depth of multiple target.

9. the method based on relative entropy depth against the automatic Pilot of intensified learning as claimed in claim 8, it is characterised in that institute The inverse nitrification enhancement for stating multiple target calculates more reward functions using EM algorithm frames nesting relative entropy depth against intensified learning Parameter.