CN109948781A

CN109948781A - Continuous action online learning control method and system for automatic driving vehicle

Info

Publication number: CN109948781A
Application number: CN201910217492.2A
Authority: CN
Inventors: 徐昕; 曾宇骏; 姚亮
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2019-03-21
Filing date: 2019-03-21
Publication date: 2019-06-28

Abstract

The invention discloses a continuous action on-line learning control method and a continuous action on-line learning control system for an automatic driving vehicle_tCoding to obtain a coding state characteristic s_t(ii) a Will encode the state feature s_tRespectively inputting the evaluator model and the evaluator model into an evaluator network and an actuator network of a cerebellum model neural network, and outputting an action a through the actuator network_tThe invention solves the learning control problem of continuous action space under the high-dimensional state input by combining the deep neural network characteristic coding technology and the reinforcement learning principle, and can realize the online learning of the continuous action space under the large-scale continuous state inputAnd control is performed, the learning effect is ensured, the learning period is shortened, the learning process can be quickly converged to obtain a control strategy with good performance effect, and the data utilization rate is good.

Description

Continuous action on-line study control method and system for automatic driving vehicle

Technical field

The present invention relates to the environment sensing fields of automatic driving vehicle, and in particular to a kind of company for automatic driving vehicle Continuous movement on-line study control method and system, for combining deep neural network feature coding technology and enhancing Learning Principle, The lower study control problem for solving Continuous action space is inputted towards dimensional state.

Background technique

As at home and abroad market goes from strength to strength for the development innovation and automobile industry of artificial intelligence technology, as intelligence The product that driving technology and automobile organically combine --- it is public that intelligent driving vehicle is increasingly becoming major Automobile Enterprises, high and new technology Department, institution of higher learning and scientific research institutions' focus of attention.Context aware systems, behaviour decision making system, path planning system and Under organic coordinated of kinetic control system etc., intelligent driving vehicle being capable of effective monitoring itself and driver status, perception Surrounding environment change and unusual condition, auxiliary, prompting and replacement driver complete part or all of driving behavior in time.Compared to With general vehicle, intelligent driving vehicle, which has, to be swift in response, perceive the careful advantages such as accurate of accurate, predictable behavior, control, is Indispensable component units in the following intelligent transportation system.It will be effective it is contemplated that promoting intelligent driving vehicle application Alleviate traffic congestion, reduces artificial caused traffic accident generation, save energy consumption and drive the time, reduce disposal of pollutants Amount is promoted and drives comfort level and traffic substance transportation efficiency, has far reaching significance and important value to promotion progress of human society.

The industries such as intelligent driving vehicle and robot, communications and transportation, artificial intelligence are closely coupled, be include pattern-recognition with Crossing domain including multiple subject directions such as intelligence system, control theory, computer science, cognitive psychology, it is related Key technology mainly has environment perception technology, Path Planning Technique, behaviour decision making technology and movement control technology, four kinds of crucial skills The correlation of art, basic premised on environment perception technology, movement control technology is final bottom foothold.

In the case where road opposed configuration, traffic Driving Scene are relatively easy, in relation to environment perception technology and movement Planning control technology has impressive progress.But with the increasingly increase of traffic scene complexity, towards adverse weather condition with Increasing for the intelligent driving demand under orographic condition is driven, the environment sensing and behavior motion control to intelligent driving vehicle propose Harsher, more challenge requirement, i.e. class people perception and class people driving behavior intelligence.Reach this target, it is crucial One of be that intelligent driving vehicle can efficiently perceive and merge driving-environment information, and by with environment and human experience's data Between interaction to carry out autonomous learning.Just because of this, researchers at home and abroad are under the promotion that computer science rapidly develops, The problems such as continuously attempting to using machine learning Principle Method perception, the control to solve intelligent driving vehicle under complex environment.

Currently, it is gradually increased and accelerates parallel the increasingly mature of computing hardware with data available scale, " big data+ Deep learning (Deep learning, DL) model " gradually replaces the academic, original " Feature Engineering+tradition study mould of industry Type " mode and become the hot spot studied instantly.Deep learning is a series of in machine learning field to attempt using multiple non-linear It converts and the abstract learning algorithm of multilayer is carried out to data.By the training in great amount of samples data, deep learning can learn Superior performance, reflection data internal structure, the feature representation for disclosing variables form are obtained, pedestrian is gradually applied to In the environment sensings tasks such as detection, vehicle detection, signal lamp identification and obtain remarkable effect.

Under complex environment in terms of intelligent driving vehicle study control, due to system dynamic system itself and ring is driven The strong nonlinearity in border and variability, to conventional Decision Control method Target Modeling and in terms of bring difficulty. In this regard, the intelligent method that researcher is gradually introduced including neural network, genetic algorithm, enhancing study etc. is resolved. Wherein, enhancing study (also known as intensified learning, Reinforcement learning, RL) method do not needed supervision message and Its exclusive " intelligent body-environment " interactive return study mechanism can make target object in the case where manually participating in the smallest situation Carry out self study.But the state complex under intelligent vehicle driving environment is changeable, and enhancing learning method is exploring these state spaces Undoubtedly it is faced with the dimension disaster problem of extensive state space and the problem of continuous control.Therefore, research and probe is towards big rule The input of mould higher-dimension, the control learning algorithm continuously exported are very necessary.The depth that deep neural network is obtained by image study It is characterized in indicating effective dimensionality reduction of complex environment state, deep neural network is increased with the depth that enhancing Learning Principle combines Strong study (Deep reinforcement learning, DRL) method, so that enhancing learning method processing dimensional state input The study control problem of lower Continuous action space is possibly realized.However, existing DL and DRL method is typically based on gradient decline original Reason carries out parameter optimization, often has that local minimum is difficult to avoid that, generalization ability is difficult to ensure and due to largely searching for The problems such as costly with training caused by optimization calculating demand, cause the side DL perceived towards intelligent driving vehicle complex environment Method and the DRL motion control method of lower Continuous action space is inputted towards dimensional state ask there are adaptability and high efficiency are insufficient Topic limits the further promotion of its performance, accelerates algorithm pace of learning, improves learning efficiency to be urgent problem.Cause This, the study control problem of lower Continuous action space is inputted towards dimensional state, in conjunction with deep neural network feature coding technology With enhancing Learning Principle, how to realize more efficient, quick on-line study control, have become a key urgently to be resolved Technical problem.

The study control problem of intelligent driving vehicle under complicated environmental condition can substantially be abstracted as one on a large scale Continuous or discrete movement optimal control policy learning process under continuous state input.Enhancing learning art should solve to be somebody's turn to do One powerful technique means of problem.But with gradually increasing for task complexity, dimension is increasingly becoming obstruction enhancing and learns To the principal element of satisfactory result.Usually, the state representation for using visual pattern to learn as enhancing is the most Mode that is direct and being full of prospect.Visual pattern can capture dynamic environment relevant to learning tasks and system performance, by it It can be to avoid many and diverse state characteristic Design and additional sensing hardware as state representation input.However, working as original image When as state input, tradition enhancing learning algorithm usually causes study to dissipate due to its excessively high dimension.

Although depth, which enhances learning method, uses such as convolutional neural networks, recurrent neural network even depth neural network mould Type realizes from end-to-end characterization, the policy learning of original image, but the design of depth network model and a large amount of network parameters Selection optimization is often more intractable, in addition to this also needs a large amount of training data and expends huge calculating cost to protect Card obtains model of good performance.Therefore, depth enhancing learning algorithm usually has strong dependence to parallel computation hardware.Separately Outside, current depth enhancing learning method is based primarily upon the depth network of the strong nonlinearity structure of back-propagation algorithm training.Part Minimum problem and generalization ability be limited in it is inevitable during back propagation learning, information included for training sample Utilization rate is not high, study convergence will often pass through more very long process.On the one hand, traditional enhancing learning method generallys use Linear function approaches device and carries out study approximation to the control strategy in study control problem, and having mature theoretical proof, it has Good study stability and data validity, but dimensional state input problem can not be handled；On the other hand, deep neural network With powerful representative learning ability, it can be excavated from the input of original higher-dimension and coding is conducive in control strategy learning process In relation to feature-rich needed for approximation to function.The two advantage is subjected to tradeoff synthesis, it is special for state using deep neural network Assemble-publish code, and then approached with the study that tradition enhancing learning algorithm carries out control strategy, it may be that it is defeated to solve extensive state Enter an effective way of the study control problem under space.

As to based on value function estimation and the enhancing learning method based on Policy-Gradient and decision search combination with put down Weighing apparatus is able to carry out single step update based on actuator-evaluator (Actor-critic) method, also can guarantee in Continuous action space Preferable learning effect.Wherein, adaptive heuristic evaluation device (Adaptive heuristic critic, AHC) is a kind of tool Representational method.It is as shown in Figure 1 the enhancing learning system structure based on AHC algorithm, the enhancing study based on AHC algorithm System is made of evaluator network and actor network, evaluator network and actor network show as two connect each other it is only Vertical neural network.The input of evaluator network includes external return value immediately and the state feedback from environment, is exported to claim Immediately it returns inside for time-domain difference signal.Actor network input include from environment state feedback, evaluator network The inside of output is returned immediately, is exported to act on the movement of environment.Actor network is used for according to strategy generating control action, Evaluator network for evaluating the tactful quality that actuator indicates, while for actor network provide inside return without With waiting external delay return.Evaluator network is intended to learn to predict, time-domain difference method is calculated usually as the study of evaluator Method, and the study of actor network then depends on the estimation to Policy-Gradient.

The final goal of adaptive heuristic evaluation algorithm is to approach to maximize accumulates return as shown in formula (1) Optimal policy π^*.Entire learning system is modeled as the markov decision process indicated with four-tuple { S, A, P, R }, wherein S It indicates state space, that is, enhances the set for the state that learning agent is likely to be at；A indicates motion space, i.e. enhancing study The set for the everything that intelligent body can be taken during with environmental interaction；P indicates state action transition probability, that is, is working as Preceding state takes the probability distribution for the NextState that can be transferred to after a certain movement；R is Reward Program, current to measure The fine or not degree of taken movement.

In formula (1), π is that state is mapped as to the different strategies for acting corresponding probability distribution, J under motion space_πFor foundation The overall aggregate return expectation that tactful π execution movement can obtain, γ is discount factor, r_tIt is obtained after t moment execution movement The instant return obtained.Formula (2) defines optimal value function V corresponding to optimal policy^*(s), by dynamic programming principle it is found that most Merit function V^*(s) meet the Bellman equation as shown in formula (3) simultaneously.

In formula (2) and formula (3), V^*It (s) is optimal value function corresponding to optimal policy,To tire out under optimal policy π Expected returns value is counted, γ is discount factor, r_tThe instant return obtained after t moment execution movement, r (s, a) in state s Take the getable instant return value of expectation of movement a when institute, E [V^*It (s')] is optimal corresponding to the optimal policy under s' state Value expectation, V^*It (s') is optimal value corresponding to the optimal policy under s' state.

In adaptive heuristic evaluation device algorithm, value function represented by evaluator network uses time-domain difference (Temporal difference, TD) study carries out approximate evaluation.When approaching device using the linear function of fixed basis, I.e.WhereinTo carry out the value function that linear function approaches using fixed basis,Device, θ are approached for the linear function comprising n basic function_t=(θ₁,θ₂,...,θ_n)^TFor t The weight vectors at moment.

According to time-domain difference Learning Principle, it is as follows that weight more new formula can be derived by:

θ_t+1=θ_t+α_t[r_t+γV(s_t+1)-V(s_t)]e_t (4)

In formula (4), θ_t+1For the weight vectors at t+1 moment, θ_tFor the weight vectors of t moment, r_tAfter t moment execution movement Instant return obtained, γ are discount factor, V (s_t+1) it is the value function estimated value that t+1 moment evaluator network exports, V (s_t) it is the value function estimated value that t moment evaluator network exports, α_tFor learning rate, e_t=[e_1t,e_1t,...,e_nt]^TIt is suitable Track vector is spent, and is hadλ is delay factor, and value range is (0,1),For state s_tUnder n The linear function of a basic function approaches the value of device.

In actor network, the execution exported is acted by the value letter of current state and evaluator network output at this time Number estimated value codetermines, as shown in formula (5).

In formula (5), p (a_t) indicate to select the movement a of t moment_tMovement probability, a_tIndicate the movement that t moment is taken, For a_tMean value, σ_tFor variance.Mean valueW=(w₁,w₂,...,w_M)^TFor actor network weight, wherein w₁, w₂,...,w_MFor the practical mapping layer weighted value of actor network, variances sigma_tIt is given by:

σ_t=b₁/{1+exp[b₂V(s_t)]} (6)

In formula (6), b₁、b₂For normal number, V (s_t) it is the value function estimated value that t moment evaluator network exports.It is same with this When, the Policy-Gradient of actuator strategy network carries out approximate evaluation by formula (7) and obtains:

In formula (7), J_πFor the overall aggregate return expectation that can be obtained according to strategy π execution movement, θ is weight,For a_t Mean value, a_tFor the movement that t moment is taken, Δ r_tIt is the time-domain difference signal that evaluator network provides, and has Δ r_t=r_t+γV (s_t+1)-V(s_t), r_tFor the output signal of evaluator network, γ is discount factor, V (s_t+1) it is that t+1 moment evaluator network is defeated Value function estimated value out, V (s_t) it is the value function estimated value that t moment evaluator network exports, σ_tFor variance.

Adaptive heuristic evaluation algorithm is based on actuator-evaluator structure, and evaluator network is used for good to policy action It is bad to be evaluated, estimate value function corresponding to current strategies, finally approaches optimal value function.Actuator is for constantly approaching most Dominant strategy, the desk evaluation signal that training optimization is provided dependent on evaluator, i.e. time domain differential errors.As it can be seen that evaluator exists It plays an important role in entire learning system, the precision and convergent speed for value function approximation are directly affected and determined The impact of performance of system entirety.In adaptive heuristic evaluation device algorithm, the study of evaluator is calculated using traditional linear TD (λ) There is data efficiency deficiency in method, and the selection of learning training step sizes needs, mistake well-designed according to particular problem Big or too small step value can bring adverse effect to algorithm performance, be easy to appear oscillation, convergence process become it is very long very To diverging.

Cerebellum constantly can receive and store over time the relevant information that coordination behavior is generated for controlling muscle, be The movement at the living organisms such as eyes, arm, finger, leg, wing position provides accurate coordinated control.Cerebellar Model Articulation Controller (also known as cerebellar model articulation controller, Cerebellar model articulation controller, CMAC) is a kind of The neural network of imitation cerebellum structure and working mechanism based on neuro-physiology and memory mechanism.It is illustrated in figure 2 basic Cerebellar Model Articulation Controller structural representation specifically includes that input layer S, concept mapping layer A, practical mapping layer W and output layer Y Four parts.Input layer S receives the input vector from high-rise command argument or sensor perception status information as model, Input vector is transformed into the concept mapping space of concept mapping layer A by the quantization mapping ruler of layering tile type, is shown as Activate specific one group of memory internal storage block in concept mapping layer A.The then ground according to the memory block activated in concept mapping layer A Location position, it is further corresponding to activate the real response unit stored in practical mapping layer W and its respective weights, last output layer Y Export the weighted sum for the unit that is activated.Compared to other kinds of neural network (such as based on the BP neural network approached of the overall situation), The advantage of Cerebellar Model Articulation Controller is mainly reflected in three aspects: firstly, Cerebellar Model Articulation Controller is based on part study, often An iteration study only has fractional weight to need to update adjustment, and pace of learning is fast, and calculation amount is small；Secondly, cerebellar model nerve net Network has good function None-linear approximation ability and generalization ability, and insensitive to the precedence of learning sample data；Again Secondary, the Cerebellar Model Articulation Controller response weight activated for different inputs has certain sparsity, is capable of handling higher-dimension Spend input problem.

Summary of the invention

The technical problem to be solved by the present invention is in view of the above problems in the prior art, providing a kind of for automatic Pilot The continuous action on-line study control method and system of vehicle, the present invention propose that the adaptive inspiration of depth cerebellum coding characteristic is commented Valence method, depth characteristic coding techniques is used to solve the dimensionality reduction encoded question that extensive continuous state inputs by the present invention, in depth It spends on the basis of coding characteristic using the heuristic evaluation learning method based on Cerebellar Model Articulation Controller, can be realized extensive Continuous state inputs efficient, the rapidly on-line study control of lower Continuous action space, shortens while guaranteeing learning effect Learning cycle, learning process can fast convergence obtain the good control strategy of the impact of performance, have good data user rate.

In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention are as follows:

A kind of continuous action on-line study control method for automatic driving vehicle, implementation steps include:

1) current perceptual image I is obtained_t；

2) pass through depth coding network for perceptual image I_tIt is encoded to obtain encoding state feature s_t；

3) by encoding state feature s_tActuator-evaluator model evaluator network and actor network are inputted respectively, The actuator-evaluator model evaluator network and actor network are all made of Cerebellar Model Articulation Controller；

4) pass through actor network output action a_tAnd actuator-evaluator model ginseng is updated by evaluator network Number.

Preferably, the depth coding network used in step 2) is HELM network model.

Preferably, the detailed step of step 4) includes:

4.1) by encoding state feature s_tInput actuator-evaluator model actor network obtains output y_t；, wherein Export y_tThe state s of t moment is calculated for actor network_tIt is lower to execute the corresponding probability of each movement；

4.2) according to the movement a of the distribution selection t moment of the corresponding probability of each movement_tAnd it exports；

4.3) by the movement a of t moment_tMarkovian decision environmental model is inputted, the state s of t moment is observed and record_tTo t The state s at+1 moment_t+1Storage state shifts (s_t,s_t+1) simultaneously calculate t moment to the t+1 moment return value r_t=r (s_t, s_t+1)；

4.4) the return value r based on t moment to t+1 moment_t=r (s_t,s_t+1), using recursive least-squares TD (λ) algorithm The weighted value W that is activated of value function update evaluator network_c；

4.5) actor network is updated based on gradient decline to be activated weighted value.

Preferably, step 4.2) selects the movement a of t moment according to the distribution of the corresponding probability of each movement_tFunction table Up to shown in formula such as formula (5)；

In formula (5), p (a_t) indicate to select the movement a of t moment_tMovement probability, a_tIndicate the movement that t moment is taken, Expression acts mean value, σ_tIndicate variance,W=(w₁,w₂,...,w_M)^TFor the weight of actor network, wherein w₁,w₂,...,w_MFor the practical mapping layer weighted value of actor network, s_tFor the state of t moment.

Preferably, the variances sigma_tFunction expression such as formula (6) shown in；

σ_t=b₁/{1+exp[b₂V(s_t)]} (6)

In formula (6), b₁、b₂For normal number, V (s_t) it is the value function estimated value that t moment evaluator network exports, s_tWhen for t The state at quarter.

Preferably, in step 4.5) based on gradient decline update actor network be activated weighted value function expression such as Shown in formula (10)；

In formula (10),The movement a at t+1 moment is exported for actor network_t+1The weighted value that is activated,To execute The movement a of device network output t moment_tThe weighted value that is activated, β is learning rate, J_πFor that can be obtained according to strategy π execution movement Overall aggregate return expectation.

The present invention also provides a kind of continuous action on-line study control systems for automatic driving vehicle, including computer Equipment, the computer equipment are programmed to perform the continuous action on-line study control that the present invention is previously used for automatic driving vehicle The step of method processed；Or be stored in the storage medium of the computer equipment be programmed to perform the present invention be previously used for from The computer program of the dynamic continuous action on-line study control method for driving vehicle.

Compared to the prior art, the present invention has an advantage that depth characteristic coding techniques is used to solve greatly by the present invention The dimensionality reduction encoded question of scale continuous state input, by depth coding network by perceptual image I_tIt is encoded to obtain coding shape State feature s_t；By encoding state feature s_tActuator-evaluator model evaluator network and actor network, institute are inputted respectively It states actuator-evaluator model evaluator network and actor network is all made of Cerebellar Model Articulation Controller；Pass through actuator Network output action a_tAnd actuator-evaluator model parameter is updated by evaluator network, therefore in depth coding feature On the basis of using based on Cerebellar Model Articulation Controller (Cerebellar Model Articulation Controller, CMAC heuristic evaluation learning method) realizes that extensive continuous state inputs the on-line study control of lower Continuous action space, Can shorten learning cycle while guaranteeing learning effect, learning process can fast convergence obtain the good control of the impact of performance System strategy, has good data user rate.

Detailed description of the invention

Fig. 1 is existing actuator-evaluator model structural schematic diagram.

Fig. 2 is the structural schematic diagram of existing Cerebellar Model Articulation Controller.

Fig. 3 is the schematic illustration of present invention method.

Fig. 4 is the schematic illustration that the embodiment of the present invention uses HELM network model.

Fig. 5 is the Acrobot study control simulated environment schematic diagram in the embodiment of the present invention.

Layering when Fig. 6 is Acrobot of embodiment of the present invention study control emulation under heterogeneous networks parameter, which is transfinited, to learn to compile Code network training root-mean-square error curve graph.

Storehouse sparse coding stochastic neural net when Fig. 7 is Acrobot study control emulation in the embodiment of the present invention Acrobot image reconstruction result.

Fig. 8 is that the Acrobot in the embodiment of the present invention learns control performance contrast curve chart.

Acrobot pendulum of Fig. 9 when being Acrobot study control emulation in the embodiment of the present invention under final stable strategy control Bar angle and torque change curve.

Figure 10 is Mountain Car study control simulated environment schematic diagram in the embodiment of the present invention.

Layering when Figure 11 is Mountain of embodiment of the present invention Car study control emulation under heterogeneous networks parameter is transfinited Learn coding network training root-mean-square error curve graph.

Storehouse sparse coding stochastic neural net when Figure 12 is Mountain Car study control emulation in the embodiment of the present invention Network Mountain Car image reconstruction result.

Figure 13 is that Mountain Car learns control performance contrast curve chart in the embodiment of the present invention.

Figure 14 is the control effect of the stable strategy obtained when Mountain of embodiment of the present invention Car study control emulation Curve graph.

Specific embodiment

As shown in figure 3, implementation step of the present embodiment for the continuous action on-line study control method of automatic driving vehicle Suddenly include:

1) current perceptual image I is obtained_t；

3) by encoding state feature s_tActuator-evaluator model evaluator network (cerebellar model nerve is inputted respectively Network values Function Network) and actor network (Cerebellar Model Articulation Controller strategy network), the evaluation of actuator-evaluator model Device network and actor network are all made of Cerebellar Model Articulation Controller；

As shown in figure 4, the depth coding network used in step 2) is HELM network model.

As shown in figure 4, HELM coding network is by random storehouse autoencoder network feature coding device, least square probabilistic neural Network class/recurrence device composition, for the original visual image of higher-dimension to be converted to the coding characteristic vector of low-dimensional.

Wherein, random storehouse autoencoder network feature coding device is made of multiple sparse random self-encoding encoder stackings.HELM The training process of coding network includes two relatively independent stages, i.e., the unsupervised layered characteristic of random storehouse autoencoder network Learn and the stochastic neural net based on the learning machine that transfinites returns having for device and supervises feature recurrence learning.With self-encoding encoder quantity Increase, it is abstract and have high-rise meaning that encoded feature will become more compression.It is single sparse in HELM coding network The study optimization process of random self-encoding encoder is based on l₁Norm, as shown in formula (8)；

In formula (8), H represents the output of hidden layer accidental projection, and Y is input data, can be original visual image I_input, It can be exported for the coding of previous random autoencoder network, β is that output weight (can be asked by existing FISTA arithmetic analysis ), l₁Indicate l₁Norm indicates the sum of the absolute value of each element in vector.The output of i-th of sparse random self-encoding encoder is logical It is indicated shown in Chang Youru formula (9)；

In formula (9), T_iFor the 1st original visual image I_input, T_i-1For (i-1)-th original visual image I_input, β_iIt is The output weight of i original visual image, f () are the activation primitive of hidden node, can be for sigmoid function, hyperbolic just Function etc. is cut, k is the number of sparse random self-encoding encoder total included in random storehouse autoencoder network.

Passing through depth coding network for perceptual image I_tIt is encoded to obtain encoding state feature s_tAfterwards, by encoding state Feature s_tActuator-evaluator model evaluator network and actor network are inputted respectively, can be exported by actor network Act a_tAnd actuator-evaluator model parameter is updated by evaluator network, it carries out corresponding Policy evaluation and strategy changes Into.

In the present embodiment, actuator-evaluator model evaluator network uses Cerebellar Model Articulation Controller.When it is used for It mainly include netinit, state mapping and three coding, weight activation and study processes when tactical comment.

S1, netinit.

In network initialization procedure, the number N of tile layer need to be set_tiling, divide space-number N_p, physical memory memory Number K and network inputs (i.e. input vector) dimension D_s.Then, according to each component of input vector value range and draw Divide space-number N_p, to N_tilingA tile layer carries out the division at equal intervals with offset one by one, so that each tile layer be divided For N_pA tile block.Wherein, spacing value Δ is divided_intervalWith offset value Δ_offsetIt is determined respectively by formula (10) and formula (11).

Δ_interval=(R_max-R_min)/N_p (10)

Δ_offset=Δ_interval/N_tiling (11)

In formula (10) and formula (11), R_maxFor the maximum value in each component value range of input vector, R_minFor input vector Minimum value in each component value range.

The weights initialisation of evaluator CMAC Neural Network is the null vector of K × 1.

S2, state mapping and coding.

When the mapping of carry out state is with coding, each component value of each input vector can be fallen in different tile layers In a certain tile block, claims each input vector to activate the tile block in each tile layer at this time, be accordingly activated Specific tile block index value a is determined by formula (12):

A=ceil ((s- Δ_offset)/Δ_interval) (12)

Wherein, s is input vector, and ceil () is the operation that rounds up.

Further, the tile block activated according to Hash principle to input vector carries out coding mapping.

F (s)=A (s) mod K (14)

Wherein, A (s) indicates that cerebellum network activation weight concept maps memory internal storage address, a (i) be expressed as being entered to The tile block index value in each tile layer that amount i-th dimension component is activated, 0≤a (i)≤N_p；F (s) is swashed by input vector The actual physics memory internal storage address of weight living.

S3, weight activation and study.

According to the actual physics memory internal storage address F (s) for the weight that is activated, that is, it can determine evaluator CMAC Neural Network The component being activated in weight vectors, and the weight component activated is learnt according to recursive least-squares TD (λ) algorithm Update (corresponding step 4.4, detail will be introduced in greater detail below).So far, it just completes based on CMAC Neural Network Evaluator network initialization, state mapping with coding, weight activation and learning process.

In the present embodiment, actuator-evaluator model actor network is all made of Cerebellar Model Articulation Controller.With evaluation Device network is similar, mainly includes that netinit, state mapping are activated and learned with coding, weight when it is used for stragetic innovation Practise three processes.The difference of itself and evaluator CMAC Neural Network essentially consists in weight learning process.When the determining power being activated When weight, corresponding weight, which is updated, declines principle based on gradient.

In the present embodiment, the detailed step of step 4) includes:

In the present embodiment, step 4.2) selects the movement a of t moment according to the distribution of the corresponding probability of each movement_tLetter Shown in number expression formula such as formula (5)；

In the present embodiment, variances sigma_tFunction expression such as formula (6) shown in；

σ_t=b₁/{1+exp[b₂V(s_t)]} (6)

Declined in the present embodiment, in step 4.5) based on gradient and updates actor network and be activated the function representation of weighted value Shown in formula such as formula (10)；

In formula (10),The movement a at t+1 moment is exported for actor network_t+1The weighted value that is activated,For actuator The movement a of network output t moment_tThe weighted value that is activated, β is learning rate, J_πFor what can be obtained according to strategy π execution movement Overall aggregate return expectation.

In the present embodiment, there is N for each input_tilingA tile layer and N_pA cerebellar model nerve for dividing interval It is saved as in physical memory needed for network of network(D_sFor the dimension of input vector), and according to shown in formula (11), (12) Hash (Hashing) principle coding mapping is carried out to the physical memory memory address of input and its corresponding activation weight.

F (s)=A (s) mod K (12)

In formula (11) and (12), A (s) indicates that cerebellum network activation weight concept maps memory internal storage address, and a (i) is indicated To be entered the tile layer that vector i-th dimension component is activated, 0≤a (i)≤N_p, N_pExpression state inputs isodisperse, D_sFor input The dimension of vector；F (s) is the actual physics memory internal storage address corresponding to input vector s, and K is total physical memory memory number, S is input vector.

In the present embodiment, return value r of the step 4.4) based on t moment to t+1 moment_t=r (s_t,s_t+1), most using recursion Small two multiply the weighted value W that is activated that TD (λ) algorithm value function updates evaluator network_c, recursive least-squares TD (λ) algorithm is existing The weight more new algorithm of some evaluator networks, has data user rate height, function approximation effect good and convergence rate is better than passing Linear TD (λ) algorithm of system, therefore optimal value function is carried out for evaluator using recursive least-squares TD (λ) algorithm and is approached, It is beneficial to promote the Policy evaluation performance of evaluator network, and then enhances algorithm global learning control effect.

The function formula of recursive least-squares TD (λ) algorithm is as follows:

In formula (13)~formula (15), P₀=δ I, δ are a positive real number, and I is unit matrix, θ_t=(θ₁,θ₂,...,θ_n)^TTo adopt The weight vectors of device, θ are approached with the linear function of fixed basis_t+1For θ_tThe primary updated value of iteration, e_tFor grade of fit rail Mark vector, P_tAnd P_t+1To inscribe Iterative Matrix, K when t and t+1_t+1It also is the Iterative Matrix of t+1 in formula,_μFor forgetting factor, value Range be (0,1],WithFor state s_tAnd s_t+1Under device is approached using the linear function of fixed basis Value, γ is discount factor, r_tIt is rewarded immediately for t moment.

In order to which the continuous action on-line study control method to the present embodiment for automatic driving vehicle is verified, this reality Apply in example software and hardware configuration condition be 8 core Intel (R) Xeon (R) E5-2643CPU (3.30GHz) 12GB DDR4, Emulation experiment verifying analysis is carried out on the computer of Ubtuntu14.04, and MATLAB 2014a.

One, Acrobot study control simulating, verifying experiment.

Acrobot is swung to specified height by the learning control method purport of Acrobot in the shortest time.Have as one kind Representational adaptive Optimal Control problem, the research of Acrobot problem start from 1996, and discrete movement and continuous action are empty Between under study control have and accordingly studied.The dynamic model of Acrobot learning control system is provided by formula (16)~(22):

φ₁=φ₁₁-φ₁₂ (19)

φ₁₁=(m₁l_c1+m₂l₁)gcos(θ₁-π/2)+φ₂ (20)

φ₂=m₂l_c2gcos(θ₁+θ₂-π/2) (22)

In formula (16)~(22), θ_i,It respectively corresponds as angle, angular speed and angular acceleration, θ_i∈ [- π, π],Variable m_i,l_i,l_ci,I_iQuality, the length, mass center of respectively i-th (i=1,2) connecting rod Length and rotary inertia.Remaining variables are intermediate variable.In the present embodiment emulation experiment, m₁=m₂=1kg, l₁=l₂=1m, l_c1 =l_c2=0.5m, I₁=I₂=1kgm², simulation time step-length is 0.05s.Referring to Fig. 5, Acrobot study control simulated environment In, OP₁For first segment connecting rod, P₁P₂For the second section connecting rod.Torque is applied to point P₁.Work as P₂End reaches higher than one section length of connecting rod of O point When spending, it is believed that acrobot is controlled successfully.

Learn control simulating, verifying experiment for Acrobot, is used for automatic driving vehicle in the present embodiment the present embodiment The relative parameters setting of continuous action on-line study control method is λ=0.6, γ=0.9, β=0.2, k₁=0.4, k₂=0.5. In actor network and evaluator network based on Cerebellar Model Articulation Controller, tile number of plies N_tiling=4, section separates points N_p=7, physical memory spatial content is respectively 100 and 80.State is a four-tupleControl torque tau be [- 3N, 3N] in continuous quantity.Reward Program is for example given below:

In formula (23), s_GIndicate dbjective state.Hit and miss experiment starts from [0,0,0,0] state each time, reaches to connecting rod Terminate when controlling step number to preset height or more than maximum, state will be reinitialized at this time, and the corresponding track that controls also will It is removed.

In storehouse sparse coding stochastic neural net used, is transfinited by two and learn self-encoding encoder and transfinite It practises and returns device.The initial weight of node is generated by the random homogeneous distribution in [- 1,1] section.Training dataset includes 12701 width The gray scale snapshot image of acrobot study control simulated environment.Each image be sheared zoom to 48 × 48 size.In order to Study obtains the feature coding network of Control-oriented task, and transfiniting for training layering learns the label value of coding network and be Angle information [the θ that acrobot study control analogue system provides₁,θ₂].Network node activation primitive is hyperbolic tangent function.By It transfinites in layering and learns single frames study control simulated environment snapshot image of the output derived from the information that lacks exercise of coding network, because The difference information of coding and the previous frame image coding of this current frame image transfinites the estimator as velocity information to layering The coding output of study coding network is supplemented.Learn coding network parameter to determine that suitable layering is transfinited, i.e., it is each super Limit study self-encoding encoder and study of transfiniting return the number of hidden nodes and regularization coefficient of device.Using in the different changes to be determined of selection Keep other variate-values constant while magnitude, it is final to choose so that the smallest one group of parameter value of training error.It is super due to being layered Input weight, biasing etc. are random generation in limit study coding network, therefore final result takes being averaged for 5 hit and miss experiments Value.

Layering under heterogeneous networks parameter transfinite learn coding network training root-mean-square error curve graph as shown in fig. 6, its The study of transfiniting of middle Fig. 6 (a) returns training root-mean-square error in device under different the number of hidden nodes；Fig. 6 (b) is first transfinite Practise the training root-mean-square error in self-encoding encoder under different the number of hidden nodes；Fig. 6 (c) transfinites for second to be learnt in self-encoding encoder Training root-mean-square error under different the number of hidden nodes；Fig. 6 (d) is the training root-mean-square error under different regularization coefficient values. Fig. 6, which is shown, learns the number of hidden nodes N in self-encoding encoder with transfiniting₁、N₂Study returns the number of hidden nodes N in device with transfiniting₃With And regularization coefficient C is different and the training root-mean-square error curve that changes.Obviously, compared to N₁、N₂, training effect is to N₃With C's Value is more sensitive.With N₃Increase, training root-mean-square error originally substantially reduce, work as N₃Change when value is to 3000 or so Start flat until in 0.03 or so convergence.Due to being not necessarily to the update that iterates by gradient, entire training process time-consuming is about 23.6 seconds, significantly shorter than train the training duration of a deep neural network.

Fig. 7 show coding network to the reconstruction result of different original input picture data, wherein (a), (c) be classified as it is original Input picture, (b), (d) be classified as corresponding reconstructed image.It can be seen that reconstructed image can be consistent substantially with original image, compile Code network can encode to obtain characteristic information significant in original image.

Then, it is trained layering transfinite learn coding network coding output will be fed as input to based on cerebellum mould The actor network and evaluator network of type controller key carry out enhancing study.Learning performance quality is by acquiring stable strategy When required hit and miss experiment number and weigh according to learning strategy for acrobot and swinging to the step number used of success status height Amount.

Referring to Fig. 8 and Fig. 9, wherein Fig. 8, which gives, proposes algorithm and related typical algorithm (Fast-AHC, SARSA-Q Learning the learning performance Contrast on effect between).In test phase, the equal independent operating of each method 10 times, record control at The curve that step number needed for function changes with number of attempt, curve show average 10 postrun results in figure.It is believed that working as Not in significant changes, learning algorithm basically reaches convergence and learns to obtain stable control step number needed for success controls acrobot System strategy.As can be seen that the learning efficiency of algorithm and Fast-AHC method is proposed, 70 times or so average from comparing result Learn to obtain stable strategy when trial, better than the SARSA-Q algorithm for needing average 80 trials, this should have benefited from being applied to Efficient utilization of recursive least-squares TD (λ) algorithm to data information in the evaluator network of Cerebellar Model Articulation Controller.With This proposes that algorithm can acquire strategy more better than Fast-AHC method simultaneously, from the figure, it can be seen that proposing what algorithm was acquired Acrobot can be swung to success status height in 47 steps by Stable Control Strategy, and the strategy that Fast-AHC methodology acquistion is arrived 62 steps are needed, SARSA-Q learning algorithm then needs 110 steps.Fig. 9 is shown according to the stability contorting for proposing that algorithm learns The opposite variation of swing rod angle and torque under policy control.

Two, Mountain Car study control simulating, verifying experiment.

Mountain Car study control is also the typical standard problem of assessment enhancing learning algorithm.As shown in Figure 10, should Problem, which is aimed at, drives the specified final position to outside mountain valley for the trolley in mountain valley with least step number.The drive of trolley Power is limited, only relies in mountain valley back and forth traveling and arrives at the destination until running up to enough momentum and being likely to be driven out to.

The kinetic model of Mountain Car system is given by:

In formula (24), Δ t=0.01s is time interval, and F is the driving force (for successive value) of small car engine, value area Between be [- 0.2N, 0.2N], m_c=0.02kg is trolley quality, g=9.8m/s²。

Learn control simulating, verifying experiment for Mountain Car, the present embodiment is held based on Cerebellar Model Articulation Controller In row device network and evaluator network, relative parameters setting is λ=0.8, γ=0.98, β=0.02, k₁=0.4, k₂=0.6. In actor network and evaluator network based on Cerebellar Model Articulation Controller, tile number N_tiling=3, section separates points N_p =6, physical memory spatial content is respectively 50 and 30.State is binary groupIt is small to make as early as possible due to control target Vehicle is driven out to mountain valley, therefore Reward Program is defined as follows:

In formula (25), s_gIndicate dbjective state.

It is layered to transfinite and learn used in coding network model and acrobot study control experiment unanimously.

Training dataset includes the gray scale snapshot image of 1000 width Mountain Car study control simulated environment.Every width figure 20 × 60 size is zoomed to as being sheared.The small truck position p that Mountain Car simulated environment provides_tBy as label value Learn the training of coding network model for being layered to transfinite.The initial weight of network node is by random uniform in [- 1,1] section Distribution generates, and activation primitive is hyperbolic tangent function.Network code output is controlled with acrobot study experiment, before having done Frame code differential supplement increases dimension afterwards.The overall average training time is 2.89s.

Mountain Car study control problem lower leaf transfinite learn the training effect (see Figure 11) of coding network with Acrobot problem is substantially similar, since the state space of Mountain Car problem is relatively easy, is choosing suitable net Smaller root mean square training error can be obtained after network parameter.Figure 11 transfinites for the layering under heterogeneous networks parameter to be learnt to compile Code network training root-mean-square error curve graph, in which: Figure 11 (a) is the instruction for learning to return in device under different the number of hidden nodes that transfinites Practice root-mean-square error；Figure 11 (b) is that the first training root mean square learnt in self-encoding encoder under different the number of hidden nodes that transfinites misses Difference；Figure 11 (c) is second training root-mean-square error learnt in self-encoding encoder under different the number of hidden nodes that transfinites；Figure 11 (d) For the training root-mean-square error under different regularization coefficient values.Figure 12 show coding network to different original input picture data Reconstruction result, wherein Figure 12 (a) is classified as original input picture, and Figure 12 (b) is classified as corresponding reconstructed image.It is asked with Acrobot It is similar to inscribe situation, it is seen that reconstructed image can be consistent substantially with original image, and coding network can encode to obtain original graph The significant characteristic information as in.

As shown in figure 13, continuous action on-line study control method and related allusion quotation of the present embodiment for automatic driving vehicle Learning performance Contrast on effect between type algorithm (Fast-AHC, SARSA-Q learning).Similarly, each in test phase Kind method equal independent operating 10 times, record control the curve that successfully required step number changes with number of attempt, and curve is shown in figure Average 10 postrun results.Although the acquistion of Fast-AHC methodology to strategy can be made with least step number (216 step) Trolley is driven out to mountain valley (proposing that algorithm needs 226 steps, SARSA-Q study needs 255 steps), but the present embodiment is used for automatic Pilot vehicle Continuous action on-line study control method can quickly obtain stable control strategy.It is proposed algorithm at the 80th time or so Trial can restrain to obtain the stable strategy of better performances, and Fast-AHC method and SARSA-Q study are respectively required for offer 150 It is secondary and 240 times.This demonstrates height of the present embodiment for the continuous action on-line study control method of automatic driving vehicle again Data user rate and excellent learning ability.Figure 14 is to learn online for the continuous action of automatic driving vehicle according to the present embodiment Practise the opposite change that the Stable Control Strategy that control method learns controls the position, speed and engine drive power of lower trolley Change.

To sum up, the present embodiment is special using depth coding for the continuous action on-line study control method of automatic driving vehicle The quick heuristic evaluation device of sign and Cerebellar Model Articulation Controller enhances learning method, and is applied to study control problem.Pass through Depth coding feature is introduced, dimension disaster problem in enhancing learning method is effectively avoided, is realized defeated based on image (dimensional state) The Optimal Control Strategy study entered.Show the present embodiment for automatic Pilot in classics study control problem the simulation experiment result The continuous action on-line study control method of vehicle can successfully complete the study control task of view-based access control model driving, and domestic and international Horizontal correlation technique is compared, and can be guaranteed under the premise of study obtains suitable or better control strategy, and study convergence is more fast Speed.In addition, the present embodiment also provides a kind of continuous action on-line study control system for automatic driving vehicle, including calculate Machine equipment, the computer equipment are programmed to perform the continuous action on-line study that the present embodiment is previously used for automatic driving vehicle The step of control method；Or it is stored in the storage medium of the computer equipment and is programmed to perform the present embodiment and is previously used for The computer program of the continuous action on-line study control method of automatic driving vehicle.

The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims

1. a kind of continuous action on-line study control method for automatic driving vehicle, it is characterised in that implementation steps include:

1) current perceptual image I is obtained_t；

3) by encoding state feature s_tActuator-evaluator model evaluator network and actor network are inputted respectively, it is described to hold Row device-evaluator model evaluator network and actor network are all made of Cerebellar Model Articulation Controller；

4) pass through actor network output action a_tAnd actuator-evaluator model parameter is updated by evaluator network.

2. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 1, which is characterized in that The depth coding network used in step 2) is HELM network model.

3. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 1, which is characterized in that The detailed step of step 4) includes:

4.1) by encoding state feature s_tInput actuator-evaluator model actor network obtains output y_t；, wherein exporting y_tThe state s of t moment is calculated for actor network_tIt is lower to execute the corresponding probability of each movement；

4.3) by the movement a of t moment_tMarkovian decision environmental model is inputted, the state s of t moment is observed and record_tWhen to t+1 The state s at quarter_t+1Storage state shifts (s_t,s_t+1) simultaneously calculate t moment to the t+1 moment return value r_t=r (s_t,s_t+1)；

4.4) the return value r based on t moment to t+1 moment_t=r (s_t,s_t+1), using recursive least-squares TD (λ) algorithm values letter Number updates the weighted value W that is activated of evaluator network_c；

4. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 3, which is characterized in that Step 4.2) selects the movement a of t moment according to the distribution of the corresponding probability of each movement_tFunction expression such as formula (5) shown in；

In formula (5), p (a_t) indicate to select the movement a of t moment_tMovement probability, a_tIndicate the movement that t moment is taken,Indicate dynamic Make mean value, σ_tIndicate variance,W=(w₁,w₂,...,w_M)^TFor the weight of actor network, wherein w₁, w₂,...,w_MFor the practical mapping layer weighted value of actor network, s_tFor the state of t moment.

5. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 4, which is characterized in that The variances sigma_tFunction expression such as formula (6) shown in；

σ_t=b₁/{1+exp[b₂V(s_t)]} (6)

In formula (6), b₁、b₂For normal number, V (s_t) it is the value function estimated value that t moment evaluator network exports, s_tFor t moment State.

6. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 3, which is characterized in that Decline update actor network based on gradient in step 4.5) to be activated shown in the function expression such as formula (10) of weighted value；

In formula (10),The movement a at t+1 moment is exported for actor network_t+1The weighted value that is activated,For actuator net The movement a of network output t moment_tThe weighted value that is activated, β is learning rate, J_πIt is total for that can be obtained according to strategy π execution movement Body accumulation return expectation.

7. a kind of continuous action on-line study control system for automatic driving vehicle, including computer equipment, feature exist In: the computer equipment is programmed to perform described in any one of claim 1~6 for the continuous of automatic driving vehicle The step of acting on-line study control method；Or the power of being programmed to perform is stored in the storage medium of the computer equipment Benefit require any one of 1~6 described in for automatic driving vehicle continuous action on-line study control method computer journey Sequence.