CN109948781A - Continuous action online learning control method and system for automatic driving vehicle - Google Patents
Continuous action online learning control method and system for automatic driving vehicle Download PDFInfo
- Publication number
- CN109948781A CN109948781A CN201910217492.2A CN201910217492A CN109948781A CN 109948781 A CN109948781 A CN 109948781A CN 201910217492 A CN201910217492 A CN 201910217492A CN 109948781 A CN109948781 A CN 109948781A
- Authority
- CN
- China
- Prior art keywords
- network
- moment
- evaluator
- movement
- learning
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Manipulator (AREA)
- Feedback Control In General (AREA)
Abstract
The invention discloses a continuous action on-line learning control method and a continuous action on-line learning control system for an automatic driving vehicletCoding to obtain a coding state characteristic st(ii) a Will encode the state feature stRespectively inputting the evaluator model and the evaluator model into an evaluator network and an actuator network of a cerebellum model neural network, and outputting an action a through the actuator networktThe invention solves the learning control problem of continuous action space under the high-dimensional state input by combining the deep neural network characteristic coding technology and the reinforcement learning principle, and can realize the online learning of the continuous action space under the large-scale continuous state inputAnd control is performed, the learning effect is ensured, the learning period is shortened, the learning process can be quickly converged to obtain a control strategy with good performance effect, and the data utilization rate is good.
Description
Technical field
The present invention relates to the environment sensing fields of automatic driving vehicle, and in particular to a kind of company for automatic driving vehicle
Continuous movement on-line study control method and system, for combining deep neural network feature coding technology and enhancing Learning Principle,
The lower study control problem for solving Continuous action space is inputted towards dimensional state.
Background technique
As at home and abroad market goes from strength to strength for the development innovation and automobile industry of artificial intelligence technology, as intelligence
The product that driving technology and automobile organically combine --- it is public that intelligent driving vehicle is increasingly becoming major Automobile Enterprises, high and new technology
Department, institution of higher learning and scientific research institutions' focus of attention.Context aware systems, behaviour decision making system, path planning system and
Under organic coordinated of kinetic control system etc., intelligent driving vehicle being capable of effective monitoring itself and driver status, perception
Surrounding environment change and unusual condition, auxiliary, prompting and replacement driver complete part or all of driving behavior in time.Compared to
With general vehicle, intelligent driving vehicle, which has, to be swift in response, perceive the careful advantages such as accurate of accurate, predictable behavior, control, is
Indispensable component units in the following intelligent transportation system.It will be effective it is contemplated that promoting intelligent driving vehicle application
Alleviate traffic congestion, reduces artificial caused traffic accident generation, save energy consumption and drive the time, reduce disposal of pollutants
Amount is promoted and drives comfort level and traffic substance transportation efficiency, has far reaching significance and important value to promotion progress of human society.
The industries such as intelligent driving vehicle and robot, communications and transportation, artificial intelligence are closely coupled, be include pattern-recognition with
Crossing domain including multiple subject directions such as intelligence system, control theory, computer science, cognitive psychology, it is related
Key technology mainly has environment perception technology, Path Planning Technique, behaviour decision making technology and movement control technology, four kinds of crucial skills
The correlation of art, basic premised on environment perception technology, movement control technology is final bottom foothold.
In the case where road opposed configuration, traffic Driving Scene are relatively easy, in relation to environment perception technology and movement
Planning control technology has impressive progress.But with the increasingly increase of traffic scene complexity, towards adverse weather condition with
Increasing for the intelligent driving demand under orographic condition is driven, the environment sensing and behavior motion control to intelligent driving vehicle propose
Harsher, more challenge requirement, i.e. class people perception and class people driving behavior intelligence.Reach this target, it is crucial
One of be that intelligent driving vehicle can efficiently perceive and merge driving-environment information, and by with environment and human experience's data
Between interaction to carry out autonomous learning.Just because of this, researchers at home and abroad are under the promotion that computer science rapidly develops,
The problems such as continuously attempting to using machine learning Principle Method perception, the control to solve intelligent driving vehicle under complex environment.
Currently, it is gradually increased and accelerates parallel the increasingly mature of computing hardware with data available scale, " big data+
Deep learning (Deep learning, DL) model " gradually replaces the academic, original " Feature Engineering+tradition study mould of industry
Type " mode and become the hot spot studied instantly.Deep learning is a series of in machine learning field to attempt using multiple non-linear
It converts and the abstract learning algorithm of multilayer is carried out to data.By the training in great amount of samples data, deep learning can learn
Superior performance, reflection data internal structure, the feature representation for disclosing variables form are obtained, pedestrian is gradually applied to
In the environment sensings tasks such as detection, vehicle detection, signal lamp identification and obtain remarkable effect.
Under complex environment in terms of intelligent driving vehicle study control, due to system dynamic system itself and ring is driven
The strong nonlinearity in border and variability, to conventional Decision Control method Target Modeling and in terms of bring difficulty.
In this regard, the intelligent method that researcher is gradually introduced including neural network, genetic algorithm, enhancing study etc. is resolved.
Wherein, enhancing study (also known as intensified learning, Reinforcement learning, RL) method do not needed supervision message and
Its exclusive " intelligent body-environment " interactive return study mechanism can make target object in the case where manually participating in the smallest situation
Carry out self study.But the state complex under intelligent vehicle driving environment is changeable, and enhancing learning method is exploring these state spaces
Undoubtedly it is faced with the dimension disaster problem of extensive state space and the problem of continuous control.Therefore, research and probe is towards big rule
The input of mould higher-dimension, the control learning algorithm continuously exported are very necessary.The depth that deep neural network is obtained by image study
It is characterized in indicating effective dimensionality reduction of complex environment state, deep neural network is increased with the depth that enhancing Learning Principle combines
Strong study (Deep reinforcement learning, DRL) method, so that enhancing learning method processing dimensional state input
The study control problem of lower Continuous action space is possibly realized.However, existing DL and DRL method is typically based on gradient decline original
Reason carries out parameter optimization, often has that local minimum is difficult to avoid that, generalization ability is difficult to ensure and due to largely searching for
The problems such as costly with training caused by optimization calculating demand, cause the side DL perceived towards intelligent driving vehicle complex environment
Method and the DRL motion control method of lower Continuous action space is inputted towards dimensional state ask there are adaptability and high efficiency are insufficient
Topic limits the further promotion of its performance, accelerates algorithm pace of learning, improves learning efficiency to be urgent problem.Cause
This, the study control problem of lower Continuous action space is inputted towards dimensional state, in conjunction with deep neural network feature coding technology
With enhancing Learning Principle, how to realize more efficient, quick on-line study control, have become a key urgently to be resolved
Technical problem.
The study control problem of intelligent driving vehicle under complicated environmental condition can substantially be abstracted as one on a large scale
Continuous or discrete movement optimal control policy learning process under continuous state input.Enhancing learning art should solve to be somebody's turn to do
One powerful technique means of problem.But with gradually increasing for task complexity, dimension is increasingly becoming obstruction enhancing and learns
To the principal element of satisfactory result.Usually, the state representation for using visual pattern to learn as enhancing is the most
Mode that is direct and being full of prospect.Visual pattern can capture dynamic environment relevant to learning tasks and system performance, by it
It can be to avoid many and diverse state characteristic Design and additional sensing hardware as state representation input.However, working as original image
When as state input, tradition enhancing learning algorithm usually causes study to dissipate due to its excessively high dimension.
Although depth, which enhances learning method, uses such as convolutional neural networks, recurrent neural network even depth neural network mould
Type realizes from end-to-end characterization, the policy learning of original image, but the design of depth network model and a large amount of network parameters
Selection optimization is often more intractable, in addition to this also needs a large amount of training data and expends huge calculating cost to protect
Card obtains model of good performance.Therefore, depth enhancing learning algorithm usually has strong dependence to parallel computation hardware.Separately
Outside, current depth enhancing learning method is based primarily upon the depth network of the strong nonlinearity structure of back-propagation algorithm training.Part
Minimum problem and generalization ability be limited in it is inevitable during back propagation learning, information included for training sample
Utilization rate is not high, study convergence will often pass through more very long process.On the one hand, traditional enhancing learning method generallys use
Linear function approaches device and carries out study approximation to the control strategy in study control problem, and having mature theoretical proof, it has
Good study stability and data validity, but dimensional state input problem can not be handled;On the other hand, deep neural network
With powerful representative learning ability, it can be excavated from the input of original higher-dimension and coding is conducive in control strategy learning process
In relation to feature-rich needed for approximation to function.The two advantage is subjected to tradeoff synthesis, it is special for state using deep neural network
Assemble-publish code, and then approached with the study that tradition enhancing learning algorithm carries out control strategy, it may be that it is defeated to solve extensive state
Enter an effective way of the study control problem under space.
As to based on value function estimation and the enhancing learning method based on Policy-Gradient and decision search combination with put down
Weighing apparatus is able to carry out single step update based on actuator-evaluator (Actor-critic) method, also can guarantee in Continuous action space
Preferable learning effect.Wherein, adaptive heuristic evaluation device (Adaptive heuristic critic, AHC) is a kind of tool
Representational method.It is as shown in Figure 1 the enhancing learning system structure based on AHC algorithm, the enhancing study based on AHC algorithm
System is made of evaluator network and actor network, evaluator network and actor network show as two connect each other it is only
Vertical neural network.The input of evaluator network includes external return value immediately and the state feedback from environment, is exported to claim
Immediately it returns inside for time-domain difference signal.Actor network input include from environment state feedback, evaluator network
The inside of output is returned immediately, is exported to act on the movement of environment.Actor network is used for according to strategy generating control action,
Evaluator network for evaluating the tactful quality that actuator indicates, while for actor network provide inside return without
With waiting external delay return.Evaluator network is intended to learn to predict, time-domain difference method is calculated usually as the study of evaluator
Method, and the study of actor network then depends on the estimation to Policy-Gradient.
The final goal of adaptive heuristic evaluation algorithm is to approach to maximize accumulates return as shown in formula (1)
Optimal policy π*.Entire learning system is modeled as the markov decision process indicated with four-tuple { S, A, P, R }, wherein S
It indicates state space, that is, enhances the set for the state that learning agent is likely to be at;A indicates motion space, i.e. enhancing study
The set for the everything that intelligent body can be taken during with environmental interaction;P indicates state action transition probability, that is, is working as
Preceding state takes the probability distribution for the NextState that can be transferred to after a certain movement;R is Reward Program, current to measure
The fine or not degree of taken movement.
In formula (1), π is that state is mapped as to the different strategies for acting corresponding probability distribution, J under motion spaceπFor foundation
The overall aggregate return expectation that tactful π execution movement can obtain, γ is discount factor, rtIt is obtained after t moment execution movement
The instant return obtained.Formula (2) defines optimal value function V corresponding to optimal policy*(s), by dynamic programming principle it is found that most
Merit function V*(s) meet the Bellman equation as shown in formula (3) simultaneously.
In formula (2) and formula (3), V*It (s) is optimal value function corresponding to optimal policy,To tire out under optimal policy π
Expected returns value is counted, γ is discount factor, rtThe instant return obtained after t moment execution movement, r (s, a) in state s
Take the getable instant return value of expectation of movement a when institute, E [V*It (s')] is optimal corresponding to the optimal policy under s' state
Value expectation, V*It (s') is optimal value corresponding to the optimal policy under s' state.
In adaptive heuristic evaluation device algorithm, value function represented by evaluator network uses time-domain difference
(Temporal difference, TD) study carries out approximate evaluation.When approaching device using the linear function of fixed basis,
I.e.WhereinTo carry out the value function that linear function approaches using fixed basis,Device, θ are approached for the linear function comprising n basic functiont=(θ1,θ2,...,θn)TFor t
The weight vectors at moment.
According to time-domain difference Learning Principle, it is as follows that weight more new formula can be derived by:
θt+1=θt+αt[rt+γV(st+1)-V(st)]et (4)
In formula (4), θt+1For the weight vectors at t+1 moment, θtFor the weight vectors of t moment, rtAfter t moment execution movement
Instant return obtained, γ are discount factor, V (st+1) it is the value function estimated value that t+1 moment evaluator network exports, V
(st) it is the value function estimated value that t moment evaluator network exports, αtFor learning rate, et=[e1t,e1t,...,ent]TIt is suitable
Track vector is spent, and is hadλ is delay factor, and value range is (0,1),For state stUnder n
The linear function of a basic function approaches the value of device.
In actor network, the execution exported is acted by the value letter of current state and evaluator network output at this time
Number estimated value codetermines, as shown in formula (5).
In formula (5), p (at) indicate to select the movement a of t momenttMovement probability, atIndicate the movement that t moment is taken,
For atMean value, σtFor variance.Mean valueW=(w1,w2,...,wM)TFor actor network weight, wherein w1,
w2,...,wMFor the practical mapping layer weighted value of actor network, variances sigmatIt is given by:
σt=b1/{1+exp[b2V(st)]} (6)
In formula (6), b1、b2For normal number, V (st) it is the value function estimated value that t moment evaluator network exports.It is same with this
When, the Policy-Gradient of actuator strategy network carries out approximate evaluation by formula (7) and obtains:
In formula (7), JπFor the overall aggregate return expectation that can be obtained according to strategy π execution movement, θ is weight,For at
Mean value, atFor the movement that t moment is taken, Δ rtIt is the time-domain difference signal that evaluator network provides, and has Δ rt=rt+γV
(st+1)-V(st), rtFor the output signal of evaluator network, γ is discount factor, V (st+1) it is that t+1 moment evaluator network is defeated
Value function estimated value out, V (st) it is the value function estimated value that t moment evaluator network exports, σtFor variance.
Adaptive heuristic evaluation algorithm is based on actuator-evaluator structure, and evaluator network is used for good to policy action
It is bad to be evaluated, estimate value function corresponding to current strategies, finally approaches optimal value function.Actuator is for constantly approaching most
Dominant strategy, the desk evaluation signal that training optimization is provided dependent on evaluator, i.e. time domain differential errors.As it can be seen that evaluator exists
It plays an important role in entire learning system, the precision and convergent speed for value function approximation are directly affected and determined
The impact of performance of system entirety.In adaptive heuristic evaluation device algorithm, the study of evaluator is calculated using traditional linear TD (λ)
There is data efficiency deficiency in method, and the selection of learning training step sizes needs, mistake well-designed according to particular problem
Big or too small step value can bring adverse effect to algorithm performance, be easy to appear oscillation, convergence process become it is very long very
To diverging.
Adaptive heuristic evaluation algorithm is based on actuator-evaluator structure, and evaluator network is used for good to policy action
It is bad to be evaluated, estimate value function corresponding to current strategies, finally approaches optimal value function.Actuator is for constantly approaching most
Dominant strategy, the desk evaluation signal that training optimization is provided dependent on evaluator, i.e. time domain differential errors.As it can be seen that evaluator exists
It plays an important role in entire learning system, the precision and convergent speed for value function approximation are directly affected and determined
The impact of performance of system entirety.In adaptive heuristic evaluation device algorithm, the study of evaluator is calculated using traditional linear TD (λ)
There is data efficiency deficiency in method, and the selection of learning training step sizes needs, mistake well-designed according to particular problem
Big or too small step value can bring adverse effect to algorithm performance, be easy to appear oscillation, convergence process become it is very long very
To diverging.
Cerebellum constantly can receive and store over time the relevant information that coordination behavior is generated for controlling muscle, be
The movement at the living organisms such as eyes, arm, finger, leg, wing position provides accurate coordinated control.Cerebellar Model Articulation Controller
(also known as cerebellar model articulation controller, Cerebellar model articulation controller, CMAC) is a kind of
The neural network of imitation cerebellum structure and working mechanism based on neuro-physiology and memory mechanism.It is illustrated in figure 2 basic
Cerebellar Model Articulation Controller structural representation specifically includes that input layer S, concept mapping layer A, practical mapping layer W and output layer Y
Four parts.Input layer S receives the input vector from high-rise command argument or sensor perception status information as model,
Input vector is transformed into the concept mapping space of concept mapping layer A by the quantization mapping ruler of layering tile type, is shown as
Activate specific one group of memory internal storage block in concept mapping layer A.The then ground according to the memory block activated in concept mapping layer A
Location position, it is further corresponding to activate the real response unit stored in practical mapping layer W and its respective weights, last output layer Y
Export the weighted sum for the unit that is activated.Compared to other kinds of neural network (such as based on the BP neural network approached of the overall situation),
The advantage of Cerebellar Model Articulation Controller is mainly reflected in three aspects: firstly, Cerebellar Model Articulation Controller is based on part study, often
An iteration study only has fractional weight to need to update adjustment, and pace of learning is fast, and calculation amount is small;Secondly, cerebellar model nerve net
Network has good function None-linear approximation ability and generalization ability, and insensitive to the precedence of learning sample data;Again
Secondary, the Cerebellar Model Articulation Controller response weight activated for different inputs has certain sparsity, is capable of handling higher-dimension
Spend input problem.
Summary of the invention
The technical problem to be solved by the present invention is in view of the above problems in the prior art, providing a kind of for automatic Pilot
The continuous action on-line study control method and system of vehicle, the present invention propose that the adaptive inspiration of depth cerebellum coding characteristic is commented
Valence method, depth characteristic coding techniques is used to solve the dimensionality reduction encoded question that extensive continuous state inputs by the present invention, in depth
It spends on the basis of coding characteristic using the heuristic evaluation learning method based on Cerebellar Model Articulation Controller, can be realized extensive
Continuous state inputs efficient, the rapidly on-line study control of lower Continuous action space, shortens while guaranteeing learning effect
Learning cycle, learning process can fast convergence obtain the good control strategy of the impact of performance, have good data user rate.
In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention are as follows:
A kind of continuous action on-line study control method for automatic driving vehicle, implementation steps include:
1) current perceptual image I is obtainedt;
2) pass through depth coding network for perceptual image ItIt is encoded to obtain encoding state feature st;
3) by encoding state feature stActuator-evaluator model evaluator network and actor network are inputted respectively,
The actuator-evaluator model evaluator network and actor network are all made of Cerebellar Model Articulation Controller;
4) pass through actor network output action atAnd actuator-evaluator model ginseng is updated by evaluator network
Number.
Preferably, the depth coding network used in step 2) is HELM network model.
Preferably, the detailed step of step 4) includes:
4.1) by encoding state feature stInput actuator-evaluator model actor network obtains output yt;, wherein
Export ytThe state s of t moment is calculated for actor networktIt is lower to execute the corresponding probability of each movement;
4.2) according to the movement a of the distribution selection t moment of the corresponding probability of each movementtAnd it exports;
4.3) by the movement a of t momenttMarkovian decision environmental model is inputted, the state s of t moment is observed and recordtTo t
The state s at+1 momentt+1Storage state shifts (st,st+1) simultaneously calculate t moment to the t+1 moment return value rt=r (st,
st+1);
4.4) the return value r based on t moment to t+1 momentt=r (st,st+1), using recursive least-squares TD (λ) algorithm
The weighted value W that is activated of value function update evaluator networkc;
4.5) actor network is updated based on gradient decline to be activated weighted value.
Preferably, step 4.2) selects the movement a of t moment according to the distribution of the corresponding probability of each movementtFunction table
Up to shown in formula such as formula (5);
In formula (5), p (at) indicate to select the movement a of t momenttMovement probability, atIndicate the movement that t moment is taken,
Expression acts mean value, σtIndicate variance,W=(w1,w2,...,wM)TFor the weight of actor network, wherein
w1,w2,...,wMFor the practical mapping layer weighted value of actor network, stFor the state of t moment.
Preferably, the variances sigmatFunction expression such as formula (6) shown in;
σt=b1/{1+exp[b2V(st)]} (6)
In formula (6), b1、b2For normal number, V (st) it is the value function estimated value that t moment evaluator network exports, stWhen for t
The state at quarter.
Preferably, in step 4.5) based on gradient decline update actor network be activated weighted value function expression such as
Shown in formula (10);
In formula (10),The movement a at t+1 moment is exported for actor networkt+1The weighted value that is activated,To execute
The movement a of device network output t momenttThe weighted value that is activated, β is learning rate, JπFor that can be obtained according to strategy π execution movement
Overall aggregate return expectation.
The present invention also provides a kind of continuous action on-line study control systems for automatic driving vehicle, including computer
Equipment, the computer equipment are programmed to perform the continuous action on-line study control that the present invention is previously used for automatic driving vehicle
The step of method processed;Or be stored in the storage medium of the computer equipment be programmed to perform the present invention be previously used for from
The computer program of the dynamic continuous action on-line study control method for driving vehicle.
Compared to the prior art, the present invention has an advantage that depth characteristic coding techniques is used to solve greatly by the present invention
The dimensionality reduction encoded question of scale continuous state input, by depth coding network by perceptual image ItIt is encoded to obtain coding shape
State feature st;By encoding state feature stActuator-evaluator model evaluator network and actor network, institute are inputted respectively
It states actuator-evaluator model evaluator network and actor network is all made of Cerebellar Model Articulation Controller;Pass through actuator
Network output action atAnd actuator-evaluator model parameter is updated by evaluator network, therefore in depth coding feature
On the basis of using based on Cerebellar Model Articulation Controller (Cerebellar Model Articulation Controller,
CMAC heuristic evaluation learning method) realizes that extensive continuous state inputs the on-line study control of lower Continuous action space,
Can shorten learning cycle while guaranteeing learning effect, learning process can fast convergence obtain the good control of the impact of performance
System strategy, has good data user rate.
Detailed description of the invention
Fig. 1 is existing actuator-evaluator model structural schematic diagram.
Fig. 2 is the structural schematic diagram of existing Cerebellar Model Articulation Controller.
Fig. 3 is the schematic illustration of present invention method.
Fig. 4 is the schematic illustration that the embodiment of the present invention uses HELM network model.
Fig. 5 is the Acrobot study control simulated environment schematic diagram in the embodiment of the present invention.
Layering when Fig. 6 is Acrobot of embodiment of the present invention study control emulation under heterogeneous networks parameter, which is transfinited, to learn to compile
Code network training root-mean-square error curve graph.
Storehouse sparse coding stochastic neural net when Fig. 7 is Acrobot study control emulation in the embodiment of the present invention
Acrobot image reconstruction result.
Fig. 8 is that the Acrobot in the embodiment of the present invention learns control performance contrast curve chart.
Acrobot pendulum of Fig. 9 when being Acrobot study control emulation in the embodiment of the present invention under final stable strategy control
Bar angle and torque change curve.
Figure 10 is Mountain Car study control simulated environment schematic diagram in the embodiment of the present invention.
Layering when Figure 11 is Mountain of embodiment of the present invention Car study control emulation under heterogeneous networks parameter is transfinited
Learn coding network training root-mean-square error curve graph.
Storehouse sparse coding stochastic neural net when Figure 12 is Mountain Car study control emulation in the embodiment of the present invention
Network Mountain Car image reconstruction result.
Figure 13 is that Mountain Car learns control performance contrast curve chart in the embodiment of the present invention.
Figure 14 is the control effect of the stable strategy obtained when Mountain of embodiment of the present invention Car study control emulation
Curve graph.
Specific embodiment
As shown in figure 3, implementation step of the present embodiment for the continuous action on-line study control method of automatic driving vehicle
Suddenly include:
1) current perceptual image I is obtainedt;
2) pass through depth coding network for perceptual image ItIt is encoded to obtain encoding state feature st;
3) by encoding state feature stActuator-evaluator model evaluator network (cerebellar model nerve is inputted respectively
Network values Function Network) and actor network (Cerebellar Model Articulation Controller strategy network), the evaluation of actuator-evaluator model
Device network and actor network are all made of Cerebellar Model Articulation Controller;
4) pass through actor network output action atAnd actuator-evaluator model ginseng is updated by evaluator network
Number.
As shown in figure 4, the depth coding network used in step 2) is HELM network model.
As shown in figure 4, HELM coding network is by random storehouse autoencoder network feature coding device, least square probabilistic neural
Network class/recurrence device composition, for the original visual image of higher-dimension to be converted to the coding characteristic vector of low-dimensional.
Wherein, random storehouse autoencoder network feature coding device is made of multiple sparse random self-encoding encoder stackings.HELM
The training process of coding network includes two relatively independent stages, i.e., the unsupervised layered characteristic of random storehouse autoencoder network
Learn and the stochastic neural net based on the learning machine that transfinites returns having for device and supervises feature recurrence learning.With self-encoding encoder quantity
Increase, it is abstract and have high-rise meaning that encoded feature will become more compression.It is single sparse in HELM coding network
The study optimization process of random self-encoding encoder is based on l1Norm, as shown in formula (8);
In formula (8), H represents the output of hidden layer accidental projection, and Y is input data, can be original visual image Iinput,
It can be exported for the coding of previous random autoencoder network, β is that output weight (can be asked by existing FISTA arithmetic analysis
), l1Indicate l1Norm indicates the sum of the absolute value of each element in vector.The output of i-th of sparse random self-encoding encoder is logical
It is indicated shown in Chang Youru formula (9);
In formula (9), TiFor the 1st original visual image Iinput, Ti-1For (i-1)-th original visual image Iinput, βiIt is
The output weight of i original visual image, f () are the activation primitive of hidden node, can be for sigmoid function, hyperbolic just
Function etc. is cut, k is the number of sparse random self-encoding encoder total included in random storehouse autoencoder network.
Passing through depth coding network for perceptual image ItIt is encoded to obtain encoding state feature stAfterwards, by encoding state
Feature stActuator-evaluator model evaluator network and actor network are inputted respectively, can be exported by actor network
Act atAnd actuator-evaluator model parameter is updated by evaluator network, it carries out corresponding Policy evaluation and strategy changes
Into.
In the present embodiment, actuator-evaluator model evaluator network uses Cerebellar Model Articulation Controller.When it is used for
It mainly include netinit, state mapping and three coding, weight activation and study processes when tactical comment.
S1, netinit.
In network initialization procedure, the number N of tile layer need to be settiling, divide space-number Np, physical memory memory
Number K and network inputs (i.e. input vector) dimension Ds.Then, according to each component of input vector value range and draw
Divide space-number Np, to NtilingA tile layer carries out the division at equal intervals with offset one by one, so that each tile layer be divided
For NpA tile block.Wherein, spacing value Δ is dividedintervalWith offset value ΔoffsetIt is determined respectively by formula (10) and formula (11).
Δinterval=(Rmax-Rmin)/Np (10)
Δoffset=Δinterval/Ntiling (11)
In formula (10) and formula (11), RmaxFor the maximum value in each component value range of input vector, RminFor input vector
Minimum value in each component value range.
The weights initialisation of evaluator CMAC Neural Network is the null vector of K × 1.
S2, state mapping and coding.
When the mapping of carry out state is with coding, each component value of each input vector can be fallen in different tile layers
In a certain tile block, claims each input vector to activate the tile block in each tile layer at this time, be accordingly activated
Specific tile block index value a is determined by formula (12):
A=ceil ((s- Δoffset)/Δinterval) (12)
Wherein, s is input vector, and ceil () is the operation that rounds up.
Further, the tile block activated according to Hash principle to input vector carries out coding mapping.
F (s)=A (s) mod K (14)
Wherein, A (s) indicates that cerebellum network activation weight concept maps memory internal storage address, a (i) be expressed as being entered to
The tile block index value in each tile layer that amount i-th dimension component is activated, 0≤a (i)≤Np;F (s) is swashed by input vector
The actual physics memory internal storage address of weight living.
S3, weight activation and study.
According to the actual physics memory internal storage address F (s) for the weight that is activated, that is, it can determine evaluator CMAC Neural Network
The component being activated in weight vectors, and the weight component activated is learnt according to recursive least-squares TD (λ) algorithm
Update (corresponding step 4.4, detail will be introduced in greater detail below).So far, it just completes based on CMAC Neural Network
Evaluator network initialization, state mapping with coding, weight activation and learning process.
In the present embodiment, actuator-evaluator model actor network is all made of Cerebellar Model Articulation Controller.With evaluation
Device network is similar, mainly includes that netinit, state mapping are activated and learned with coding, weight when it is used for stragetic innovation
Practise three processes.The difference of itself and evaluator CMAC Neural Network essentially consists in weight learning process.When the determining power being activated
When weight, corresponding weight, which is updated, declines principle based on gradient.
In the present embodiment, the detailed step of step 4) includes:
4.1) by encoding state feature stInput actuator-evaluator model actor network obtains output yt;, wherein
Export ytThe state s of t moment is calculated for actor networktIt is lower to execute the corresponding probability of each movement;
4.2) according to the movement a of the distribution selection t moment of the corresponding probability of each movementtAnd it exports;
4.3) by the movement a of t momenttMarkovian decision environmental model is inputted, the state s of t moment is observed and recordtTo t
The state s at+1 momentt+1Storage state shifts (st,st+1) simultaneously calculate t moment to the t+1 moment return value rt=r (st,
st+1);
4.4) the return value r based on t moment to t+1 momentt=r (st,st+1), using recursive least-squares TD (λ) algorithm
The weighted value W that is activated of value function update evaluator networkc;
4.5) actor network is updated based on gradient decline to be activated weighted value.
In the present embodiment, step 4.2) selects the movement a of t moment according to the distribution of the corresponding probability of each movementtLetter
Shown in number expression formula such as formula (5);
In formula (5), p (at) indicate to select the movement a of t momenttMovement probability, atIndicate the movement that t moment is taken,
Expression acts mean value, σtIndicate variance,W=(w1,w2,...,wM)TFor the weight of actor network, wherein
w1,w2,...,wMFor the practical mapping layer weighted value of actor network, stFor the state of t moment.
In the present embodiment, variances sigmatFunction expression such as formula (6) shown in;
σt=b1/{1+exp[b2V(st)]} (6)
In formula (6), b1、b2For normal number, V (st) it is the value function estimated value that t moment evaluator network exports, stWhen for t
The state at quarter.
Declined in the present embodiment, in step 4.5) based on gradient and updates actor network and be activated the function representation of weighted value
Shown in formula such as formula (10);
In formula (10),The movement a at t+1 moment is exported for actor networkt+1The weighted value that is activated,For actuator
The movement a of network output t momenttThe weighted value that is activated, β is learning rate, JπFor what can be obtained according to strategy π execution movement
Overall aggregate return expectation.
In the present embodiment, there is N for each inputtilingA tile layer and NpA cerebellar model nerve for dividing interval
It is saved as in physical memory needed for network of network(DsFor the dimension of input vector), and according to shown in formula (11), (12)
Hash (Hashing) principle coding mapping is carried out to the physical memory memory address of input and its corresponding activation weight.
F (s)=A (s) mod K (12)
In formula (11) and (12), A (s) indicates that cerebellum network activation weight concept maps memory internal storage address, and a (i) is indicated
To be entered the tile layer that vector i-th dimension component is activated, 0≤a (i)≤Np, NpExpression state inputs isodisperse, DsFor input
The dimension of vector;F (s) is the actual physics memory internal storage address corresponding to input vector s, and K is total physical memory memory number,
S is input vector.
In the present embodiment, return value r of the step 4.4) based on t moment to t+1 momentt=r (st,st+1), most using recursion
Small two multiply the weighted value W that is activated that TD (λ) algorithm value function updates evaluator networkc, recursive least-squares TD (λ) algorithm is existing
The weight more new algorithm of some evaluator networks, has data user rate height, function approximation effect good and convergence rate is better than passing
Linear TD (λ) algorithm of system, therefore optimal value function is carried out for evaluator using recursive least-squares TD (λ) algorithm and is approached,
It is beneficial to promote the Policy evaluation performance of evaluator network, and then enhances algorithm global learning control effect.
The function formula of recursive least-squares TD (λ) algorithm is as follows:
In formula (13)~formula (15), P0=δ I, δ are a positive real number, and I is unit matrix, θt=(θ1,θ2,...,θn)TTo adopt
The weight vectors of device, θ are approached with the linear function of fixed basist+1For θtThe primary updated value of iteration, etFor grade of fit rail
Mark vector, PtAnd Pt+1To inscribe Iterative Matrix, K when t and t+1t+1It also is the Iterative Matrix of t+1 in formula,μFor forgetting factor, value
Range be (0,1],WithFor state stAnd st+1Under device is approached using the linear function of fixed basis
Value, γ is discount factor, rtIt is rewarded immediately for t moment.
In order to which the continuous action on-line study control method to the present embodiment for automatic driving vehicle is verified, this reality
Apply in example software and hardware configuration condition be 8 core Intel (R) Xeon (R) E5-2643CPU (3.30GHz) 12GB DDR4,
Emulation experiment verifying analysis is carried out on the computer of Ubtuntu14.04, and MATLAB 2014a.
One, Acrobot study control simulating, verifying experiment.
Acrobot is swung to specified height by the learning control method purport of Acrobot in the shortest time.Have as one kind
Representational adaptive Optimal Control problem, the research of Acrobot problem start from 1996, and discrete movement and continuous action are empty
Between under study control have and accordingly studied.The dynamic model of Acrobot learning control system is provided by formula (16)~(22):
φ1=φ11-φ12 (19)
φ11=(m1lc1+m2l1)gcos(θ1-π/2)+φ2 (20)
φ2=m2lc2gcos(θ1+θ2-π/2) (22)
In formula (16)~(22), θi,It respectively corresponds as angle, angular speed and angular acceleration, θi∈ [- π, π],Variable mi,li,lci,IiQuality, the length, mass center of respectively i-th (i=1,2) connecting rod
Length and rotary inertia.Remaining variables are intermediate variable.In the present embodiment emulation experiment, m1=m2=1kg, l1=l2=1m, lc1
=lc2=0.5m, I1=I2=1kgm2, simulation time step-length is 0.05s.Referring to Fig. 5, Acrobot study control simulated environment
In, OP1For first segment connecting rod, P1P2For the second section connecting rod.Torque is applied to point P1.Work as P2End reaches higher than one section length of connecting rod of O point
When spending, it is believed that acrobot is controlled successfully.
Learn control simulating, verifying experiment for Acrobot, is used for automatic driving vehicle in the present embodiment the present embodiment
The relative parameters setting of continuous action on-line study control method is λ=0.6, γ=0.9, β=0.2, k1=0.4, k2=0.5.
In actor network and evaluator network based on Cerebellar Model Articulation Controller, tile number of plies Ntiling=4, section separates points
Np=7, physical memory spatial content is respectively 100 and 80.State is a four-tupleControl torque tau be [-
3N, 3N] in continuous quantity.Reward Program is for example given below:
In formula (23), sGIndicate dbjective state.Hit and miss experiment starts from [0,0,0,0] state each time, reaches to connecting rod
Terminate when controlling step number to preset height or more than maximum, state will be reinitialized at this time, and the corresponding track that controls also will
It is removed.
In storehouse sparse coding stochastic neural net used, is transfinited by two and learn self-encoding encoder and transfinite
It practises and returns device.The initial weight of node is generated by the random homogeneous distribution in [- 1,1] section.Training dataset includes 12701 width
The gray scale snapshot image of acrobot study control simulated environment.Each image be sheared zoom to 48 × 48 size.In order to
Study obtains the feature coding network of Control-oriented task, and transfiniting for training layering learns the label value of coding network and be
Angle information [the θ that acrobot study control analogue system provides1,θ2].Network node activation primitive is hyperbolic tangent function.By
It transfinites in layering and learns single frames study control simulated environment snapshot image of the output derived from the information that lacks exercise of coding network, because
The difference information of coding and the previous frame image coding of this current frame image transfinites the estimator as velocity information to layering
The coding output of study coding network is supplemented.Learn coding network parameter to determine that suitable layering is transfinited, i.e., it is each super
Limit study self-encoding encoder and study of transfiniting return the number of hidden nodes and regularization coefficient of device.Using in the different changes to be determined of selection
Keep other variate-values constant while magnitude, it is final to choose so that the smallest one group of parameter value of training error.It is super due to being layered
Input weight, biasing etc. are random generation in limit study coding network, therefore final result takes being averaged for 5 hit and miss experiments
Value.
Layering under heterogeneous networks parameter transfinite learn coding network training root-mean-square error curve graph as shown in fig. 6, its
The study of transfiniting of middle Fig. 6 (a) returns training root-mean-square error in device under different the number of hidden nodes;Fig. 6 (b) is first transfinite
Practise the training root-mean-square error in self-encoding encoder under different the number of hidden nodes;Fig. 6 (c) transfinites for second to be learnt in self-encoding encoder
Training root-mean-square error under different the number of hidden nodes;Fig. 6 (d) is the training root-mean-square error under different regularization coefficient values.
Fig. 6, which is shown, learns the number of hidden nodes N in self-encoding encoder with transfiniting1、N2Study returns the number of hidden nodes N in device with transfiniting3With
And regularization coefficient C is different and the training root-mean-square error curve that changes.Obviously, compared to N1、N2, training effect is to N3With C's
Value is more sensitive.With N3Increase, training root-mean-square error originally substantially reduce, work as N3Change when value is to 3000 or so
Start flat until in 0.03 or so convergence.Due to being not necessarily to the update that iterates by gradient, entire training process time-consuming is about
23.6 seconds, significantly shorter than train the training duration of a deep neural network.
Fig. 7 show coding network to the reconstruction result of different original input picture data, wherein (a), (c) be classified as it is original
Input picture, (b), (d) be classified as corresponding reconstructed image.It can be seen that reconstructed image can be consistent substantially with original image, compile
Code network can encode to obtain characteristic information significant in original image.
Then, it is trained layering transfinite learn coding network coding output will be fed as input to based on cerebellum mould
The actor network and evaluator network of type controller key carry out enhancing study.Learning performance quality is by acquiring stable strategy
When required hit and miss experiment number and weigh according to learning strategy for acrobot and swinging to the step number used of success status height
Amount.
Referring to Fig. 8 and Fig. 9, wherein Fig. 8, which gives, proposes algorithm and related typical algorithm (Fast-AHC, SARSA-Q
Learning the learning performance Contrast on effect between).In test phase, the equal independent operating of each method 10 times, record control at
The curve that step number needed for function changes with number of attempt, curve show average 10 postrun results in figure.It is believed that working as
Not in significant changes, learning algorithm basically reaches convergence and learns to obtain stable control step number needed for success controls acrobot
System strategy.As can be seen that the learning efficiency of algorithm and Fast-AHC method is proposed, 70 times or so average from comparing result
Learn to obtain stable strategy when trial, better than the SARSA-Q algorithm for needing average 80 trials, this should have benefited from being applied to
Efficient utilization of recursive least-squares TD (λ) algorithm to data information in the evaluator network of Cerebellar Model Articulation Controller.With
This proposes that algorithm can acquire strategy more better than Fast-AHC method simultaneously, from the figure, it can be seen that proposing what algorithm was acquired
Acrobot can be swung to success status height in 47 steps by Stable Control Strategy, and the strategy that Fast-AHC methodology acquistion is arrived
62 steps are needed, SARSA-Q learning algorithm then needs 110 steps.Fig. 9 is shown according to the stability contorting for proposing that algorithm learns
The opposite variation of swing rod angle and torque under policy control.
Two, Mountain Car study control simulating, verifying experiment.
Mountain Car study control is also the typical standard problem of assessment enhancing learning algorithm.As shown in Figure 10, should
Problem, which is aimed at, drives the specified final position to outside mountain valley for the trolley in mountain valley with least step number.The drive of trolley
Power is limited, only relies in mountain valley back and forth traveling and arrives at the destination until running up to enough momentum and being likely to be driven out to.
The kinetic model of Mountain Car system is given by:
In formula (24), Δ t=0.01s is time interval, and F is the driving force (for successive value) of small car engine, value area
Between be [- 0.2N, 0.2N], mc=0.02kg is trolley quality, g=9.8m/s2。
Learn control simulating, verifying experiment for Mountain Car, the present embodiment is held based on Cerebellar Model Articulation Controller
In row device network and evaluator network, relative parameters setting is λ=0.8, γ=0.98, β=0.02, k1=0.4, k2=0.6.
In actor network and evaluator network based on Cerebellar Model Articulation Controller, tile number Ntiling=3, section separates points Np
=6, physical memory spatial content is respectively 50 and 30.State is binary groupIt is small to make as early as possible due to control target
Vehicle is driven out to mountain valley, therefore Reward Program is defined as follows:
In formula (25), sgIndicate dbjective state.
It is layered to transfinite and learn used in coding network model and acrobot study control experiment unanimously.
Training dataset includes the gray scale snapshot image of 1000 width Mountain Car study control simulated environment.Every width figure
20 × 60 size is zoomed to as being sheared.The small truck position p that Mountain Car simulated environment providestBy as label value
Learn the training of coding network model for being layered to transfinite.The initial weight of network node is by random uniform in [- 1,1] section
Distribution generates, and activation primitive is hyperbolic tangent function.Network code output is controlled with acrobot study experiment, before having done
Frame code differential supplement increases dimension afterwards.The overall average training time is 2.89s.
Mountain Car study control problem lower leaf transfinite learn the training effect (see Figure 11) of coding network with
Acrobot problem is substantially similar, since the state space of Mountain Car problem is relatively easy, is choosing suitable net
Smaller root mean square training error can be obtained after network parameter.Figure 11 transfinites for the layering under heterogeneous networks parameter to be learnt to compile
Code network training root-mean-square error curve graph, in which: Figure 11 (a) is the instruction for learning to return in device under different the number of hidden nodes that transfinites
Practice root-mean-square error;Figure 11 (b) is that the first training root mean square learnt in self-encoding encoder under different the number of hidden nodes that transfinites misses
Difference;Figure 11 (c) is second training root-mean-square error learnt in self-encoding encoder under different the number of hidden nodes that transfinites;Figure 11 (d)
For the training root-mean-square error under different regularization coefficient values.Figure 12 show coding network to different original input picture data
Reconstruction result, wherein Figure 12 (a) is classified as original input picture, and Figure 12 (b) is classified as corresponding reconstructed image.It is asked with Acrobot
It is similar to inscribe situation, it is seen that reconstructed image can be consistent substantially with original image, and coding network can encode to obtain original graph
The significant characteristic information as in.
As shown in figure 13, continuous action on-line study control method and related allusion quotation of the present embodiment for automatic driving vehicle
Learning performance Contrast on effect between type algorithm (Fast-AHC, SARSA-Q learning).Similarly, each in test phase
Kind method equal independent operating 10 times, record control the curve that successfully required step number changes with number of attempt, and curve is shown in figure
Average 10 postrun results.Although the acquistion of Fast-AHC methodology to strategy can be made with least step number (216 step)
Trolley is driven out to mountain valley (proposing that algorithm needs 226 steps, SARSA-Q study needs 255 steps), but the present embodiment is used for automatic Pilot vehicle
Continuous action on-line study control method can quickly obtain stable control strategy.It is proposed algorithm at the 80th time or so
Trial can restrain to obtain the stable strategy of better performances, and Fast-AHC method and SARSA-Q study are respectively required for offer 150
It is secondary and 240 times.This demonstrates height of the present embodiment for the continuous action on-line study control method of automatic driving vehicle again
Data user rate and excellent learning ability.Figure 14 is to learn online for the continuous action of automatic driving vehicle according to the present embodiment
Practise the opposite change that the Stable Control Strategy that control method learns controls the position, speed and engine drive power of lower trolley
Change.
To sum up, the present embodiment is special using depth coding for the continuous action on-line study control method of automatic driving vehicle
The quick heuristic evaluation device of sign and Cerebellar Model Articulation Controller enhances learning method, and is applied to study control problem.Pass through
Depth coding feature is introduced, dimension disaster problem in enhancing learning method is effectively avoided, is realized defeated based on image (dimensional state)
The Optimal Control Strategy study entered.Show the present embodiment for automatic Pilot in classics study control problem the simulation experiment result
The continuous action on-line study control method of vehicle can successfully complete the study control task of view-based access control model driving, and domestic and international
Horizontal correlation technique is compared, and can be guaranteed under the premise of study obtains suitable or better control strategy, and study convergence is more fast
Speed.In addition, the present embodiment also provides a kind of continuous action on-line study control system for automatic driving vehicle, including calculate
Machine equipment, the computer equipment are programmed to perform the continuous action on-line study that the present embodiment is previously used for automatic driving vehicle
The step of control method;Or it is stored in the storage medium of the computer equipment and is programmed to perform the present embodiment and is previously used for
The computer program of the continuous action on-line study control method of automatic driving vehicle.
The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation
Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art
Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications
It should be regarded as protection scope of the present invention.
Claims (7)
1. a kind of continuous action on-line study control method for automatic driving vehicle, it is characterised in that implementation steps include:
1) current perceptual image I is obtainedt;
2) pass through depth coding network for perceptual image ItIt is encoded to obtain encoding state feature st;
3) by encoding state feature stActuator-evaluator model evaluator network and actor network are inputted respectively, it is described to hold
Row device-evaluator model evaluator network and actor network are all made of Cerebellar Model Articulation Controller;
4) pass through actor network output action atAnd actuator-evaluator model parameter is updated by evaluator network.
2. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 1, which is characterized in that
The depth coding network used in step 2) is HELM network model.
3. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 1, which is characterized in that
The detailed step of step 4) includes:
4.1) by encoding state feature stInput actuator-evaluator model actor network obtains output yt;, wherein exporting
ytThe state s of t moment is calculated for actor networktIt is lower to execute the corresponding probability of each movement;
4.2) according to the movement a of the distribution selection t moment of the corresponding probability of each movementtAnd it exports;
4.3) by the movement a of t momenttMarkovian decision environmental model is inputted, the state s of t moment is observed and recordtWhen to t+1
The state s at quartert+1Storage state shifts (st,st+1) simultaneously calculate t moment to the t+1 moment return value rt=r (st,st+1);
4.4) the return value r based on t moment to t+1 momentt=r (st,st+1), using recursive least-squares TD (λ) algorithm values letter
Number updates the weighted value W that is activated of evaluator networkc;
4.5) actor network is updated based on gradient decline to be activated weighted value.
4. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 3, which is characterized in that
Step 4.2) selects the movement a of t moment according to the distribution of the corresponding probability of each movementtFunction expression such as formula (5) shown in;
In formula (5), p (at) indicate to select the movement a of t momenttMovement probability, atIndicate the movement that t moment is taken,Indicate dynamic
Make mean value, σtIndicate variance,W=(w1,w2,...,wM)TFor the weight of actor network, wherein w1,
w2,...,wMFor the practical mapping layer weighted value of actor network, stFor the state of t moment.
5. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 4, which is characterized in that
The variances sigmatFunction expression such as formula (6) shown in;
σt=b1/{1+exp[b2V(st)]} (6)
In formula (6), b1、b2For normal number, V (st) it is the value function estimated value that t moment evaluator network exports, stFor t moment
State.
6. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 3, which is characterized in that
Decline update actor network based on gradient in step 4.5) to be activated shown in the function expression such as formula (10) of weighted value;
In formula (10),The movement a at t+1 moment is exported for actor networkt+1The weighted value that is activated,For actuator net
The movement a of network output t momenttThe weighted value that is activated, β is learning rate, JπIt is total for that can be obtained according to strategy π execution movement
Body accumulation return expectation.
7. a kind of continuous action on-line study control system for automatic driving vehicle, including computer equipment, feature exist
In: the computer equipment is programmed to perform described in any one of claim 1~6 for the continuous of automatic driving vehicle
The step of acting on-line study control method;Or the power of being programmed to perform is stored in the storage medium of the computer equipment
Benefit require any one of 1~6 described in for automatic driving vehicle continuous action on-line study control method computer journey
Sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910217492.2A CN109948781A (en) | 2019-03-21 | 2019-03-21 | Continuous action online learning control method and system for automatic driving vehicle |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910217492.2A CN109948781A (en) | 2019-03-21 | 2019-03-21 | Continuous action online learning control method and system for automatic driving vehicle |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109948781A true CN109948781A (en) | 2019-06-28 |
Family
ID=67010531
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910217492.2A Pending CN109948781A (en) | 2019-03-21 | 2019-03-21 | Continuous action online learning control method and system for automatic driving vehicle |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109948781A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110598309A (en) * | 2019-09-09 | 2019-12-20 | 电子科技大学 | Hardware design verification system and method based on reinforcement learning |
CN110716562A (en) * | 2019-09-25 | 2020-01-21 | 南京航空航天大学 | Decision-making method for multi-lane driving of unmanned vehicle based on reinforcement learning |
CN110826145A (en) * | 2019-09-09 | 2020-02-21 | 西安工业大学 | Automobile multi-parameter operation condition design method based on heuristic Markov chain evolution |
CN110956148A (en) * | 2019-12-05 | 2020-04-03 | 上海舵敏智能科技有限公司 | Autonomous obstacle avoidance method and device for unmanned vehicle, electronic device and readable storage medium |
CN111136659A (en) * | 2020-01-15 | 2020-05-12 | 南京大学 | Mechanical arm action learning method and system based on third person scale imitation learning |
CN111222630A (en) * | 2020-01-17 | 2020-06-02 | 北京工业大学 | Autonomous driving rule learning method based on deep reinforcement learning |
CN112052956A (en) * | 2020-07-16 | 2020-12-08 | 山东派蒙机电技术有限公司 | Training method for strengthening best action of vehicle execution |
CN113281999A (en) * | 2021-04-23 | 2021-08-20 | 南京大学 | Unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning |
CN113406955A (en) * | 2021-05-10 | 2021-09-17 | 江苏大学 | Complex network-based automatic driving automobile complex environment model, cognitive system and cognitive method |
CN113734167A (en) * | 2021-09-10 | 2021-12-03 | 苏州智加科技有限公司 | Vehicle control method, device, terminal and storage medium |
WO2023083113A1 (en) * | 2021-11-10 | 2023-05-19 | International Business Machines Corporation | Reinforcement learning with inductive logic programming |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004068399A1 (en) * | 2003-01-31 | 2004-08-12 | Matsushita Electric Industrial Co. Ltd. | Predictive action decision device and action decision method |
CN105690392A (en) * | 2016-04-14 | 2016-06-22 | 苏州大学 | Robot motion control method and device based on actor-critic method |
CN107346138A (en) * | 2017-06-16 | 2017-11-14 | 武汉理工大学 | A kind of unmanned boat method for lateral control based on enhancing learning algorithm |
CN107610692A (en) * | 2017-09-22 | 2018-01-19 | 杭州电子科技大学 | The sound identification method of self-encoding encoder multiple features fusion is stacked based on neutral net |
CN108038545A (en) * | 2017-12-06 | 2018-05-15 | 湖北工业大学 | Fast learning algorithm based on Actor-Critic neutral net continuous controls |
US20190035275A1 (en) * | 2017-07-28 | 2019-01-31 | Toyota Motor Engineering & Manufacturing North America, Inc. | Autonomous operation capability configuration for a vehicle |
-
2019
- 2019-03-21 CN CN201910217492.2A patent/CN109948781A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2004068399A1 (en) * | 2003-01-31 | 2004-08-12 | Matsushita Electric Industrial Co. Ltd. | Predictive action decision device and action decision method |
CN105690392A (en) * | 2016-04-14 | 2016-06-22 | 苏州大学 | Robot motion control method and device based on actor-critic method |
CN107346138A (en) * | 2017-06-16 | 2017-11-14 | 武汉理工大学 | A kind of unmanned boat method for lateral control based on enhancing learning algorithm |
US20190035275A1 (en) * | 2017-07-28 | 2019-01-31 | Toyota Motor Engineering & Manufacturing North America, Inc. | Autonomous operation capability configuration for a vehicle |
CN107610692A (en) * | 2017-09-22 | 2018-01-19 | 杭州电子科技大学 | The sound identification method of self-encoding encoder multiple features fusion is stacked based on neutral net |
CN108038545A (en) * | 2017-12-06 | 2018-05-15 | 湖北工业大学 | Fast learning algorithm based on Actor-Critic neutral net continuous controls |
Non-Patent Citations (2)
Title |
---|
YUJUN ZENG ET AL.: "Evolutionary Hierarchical Sparse Extreme Learning Autoencoder Network for Object Recognition", 《SYMMETRY》 * |
徐昕: "增强学习及其在移动机器人导航与控制中的应用研究", 《中国优秀博硕士学位论文全文数据库 (博士) 信息科技辑》 * |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110826145A (en) * | 2019-09-09 | 2020-02-21 | 西安工业大学 | Automobile multi-parameter operation condition design method based on heuristic Markov chain evolution |
CN110598309A (en) * | 2019-09-09 | 2019-12-20 | 电子科技大学 | Hardware design verification system and method based on reinforcement learning |
CN110716562A (en) * | 2019-09-25 | 2020-01-21 | 南京航空航天大学 | Decision-making method for multi-lane driving of unmanned vehicle based on reinforcement learning |
CN110956148A (en) * | 2019-12-05 | 2020-04-03 | 上海舵敏智能科技有限公司 | Autonomous obstacle avoidance method and device for unmanned vehicle, electronic device and readable storage medium |
CN110956148B (en) * | 2019-12-05 | 2024-01-23 | 上海舵敏智能科技有限公司 | Autonomous obstacle avoidance method and device for unmanned vehicle, electronic equipment and readable storage medium |
CN111136659B (en) * | 2020-01-15 | 2022-06-21 | 南京大学 | Mechanical arm action learning method and system based on third person scale imitation learning |
CN111136659A (en) * | 2020-01-15 | 2020-05-12 | 南京大学 | Mechanical arm action learning method and system based on third person scale imitation learning |
CN111222630A (en) * | 2020-01-17 | 2020-06-02 | 北京工业大学 | Autonomous driving rule learning method based on deep reinforcement learning |
CN112052956A (en) * | 2020-07-16 | 2020-12-08 | 山东派蒙机电技术有限公司 | Training method for strengthening best action of vehicle execution |
CN112052956B (en) * | 2020-07-16 | 2021-12-17 | 山东派蒙机电技术有限公司 | Training method for strengthening best action of vehicle execution |
CN113281999A (en) * | 2021-04-23 | 2021-08-20 | 南京大学 | Unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning |
CN113406955A (en) * | 2021-05-10 | 2021-09-17 | 江苏大学 | Complex network-based automatic driving automobile complex environment model, cognitive system and cognitive method |
CN113406955B (en) * | 2021-05-10 | 2022-06-21 | 江苏大学 | Complex network-based automatic driving automobile complex environment model, cognitive system and cognitive method |
CN113734167A (en) * | 2021-09-10 | 2021-12-03 | 苏州智加科技有限公司 | Vehicle control method, device, terminal and storage medium |
WO2023083113A1 (en) * | 2021-11-10 | 2023-05-19 | International Business Machines Corporation | Reinforcement learning with inductive logic programming |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109948781A (en) | Continuous action online learning control method and system for automatic driving vehicle | |
Kuefler et al. | Imitating driver behavior with generative adversarial networks | |
CN110148296A (en) | A kind of trans-city magnitude of traffic flow unified prediction based on depth migration study | |
Vanegas et al. | Inverse design of urban procedural models | |
CN102402712B (en) | Robot reinforced learning initialization method based on neural network | |
Shahabi et al. | Application of artificial neural network in prediction of municipal solid waste generation (Case study: Saqqez City in Kurdistan Province) | |
Nasir et al. | A genetic fuzzy system to model pedestrian walking path in a built environment | |
Melo et al. | Learning humanoid robot running skills through proximal policy optimization | |
CN108399744A (en) | Short-time Traffic Flow Forecasting Methods based on grey wavelet neural network | |
CN113031528B (en) | Multi-legged robot non-structural ground motion control method based on depth certainty strategy gradient | |
Mao et al. | A comparison of deep reinforcement learning models for isolated traffic signal control | |
Huang et al. | Computational modeling of emotion-motivated decisions for continuous control of mobile robots | |
Chen et al. | NN model-based evolved control by DGM model for practical nonlinear systems | |
CN111645673A (en) | Automatic parking method based on deep reinforcement learning | |
CN116848532A (en) | Attention neural network with short term memory cells | |
CN109656236A (en) | A kind of industrial data failure prediction method based on cyclic forecast neural network | |
Hu et al. | Heterogeneous crowd simulation using parametric reinforcement learning | |
Deng et al. | Advanced self-improving ramp metering algorithm based on multi-agent deep reinforcement learning | |
CN114239974B (en) | Multi-agent position prediction method and device, electronic equipment and storage medium | |
Na et al. | A novel heuristic artificial neural network model for urban computing | |
AbuZekry et al. | Comparative study of neuro-evolution algorithms in reinforcement learning for self-driving cars | |
CN116382267B (en) | Robot dynamic obstacle avoidance method based on multi-mode pulse neural network | |
CN108470212A (en) | A kind of efficient LSTM design methods that can utilize incident duration | |
Hart et al. | Towards robust car-following based on deep reinforcement learning | |
Wang et al. | Hybrid neural network modeling for multiple intersections along signalized arterials-current situation and some new results |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190628 |
|
RJ01 | Rejection of invention patent application after publication |