CN109948781A - Continuous action online learning control method and system for automatic driving vehicle - Google Patents

Continuous action online learning control method and system for automatic driving vehicle Download PDF

Info

Publication number
CN109948781A
CN109948781A CN201910217492.2A CN201910217492A CN109948781A CN 109948781 A CN109948781 A CN 109948781A CN 201910217492 A CN201910217492 A CN 201910217492A CN 109948781 A CN109948781 A CN 109948781A
Authority
CN
China
Prior art keywords
network
moment
evaluator
movement
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910217492.2A
Other languages
Chinese (zh)
Inventor
徐昕
曾宇骏
姚亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201910217492.2A priority Critical patent/CN109948781A/en
Publication of CN109948781A publication Critical patent/CN109948781A/en
Pending legal-status Critical Current

Links

Landscapes

  • Manipulator (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses a continuous action on-line learning control method and a continuous action on-line learning control system for an automatic driving vehicletCoding to obtain a coding state characteristic st(ii) a Will encode the state feature stRespectively inputting the evaluator model and the evaluator model into an evaluator network and an actuator network of a cerebellum model neural network, and outputting an action a through the actuator networktThe invention solves the learning control problem of continuous action space under the high-dimensional state input by combining the deep neural network characteristic coding technology and the reinforcement learning principle, and can realize the online learning of the continuous action space under the large-scale continuous state inputAnd control is performed, the learning effect is ensured, the learning period is shortened, the learning process can be quickly converged to obtain a control strategy with good performance effect, and the data utilization rate is good.

Description

Continuous action on-line study control method and system for automatic driving vehicle
Technical field
The present invention relates to the environment sensing fields of automatic driving vehicle, and in particular to a kind of company for automatic driving vehicle Continuous movement on-line study control method and system, for combining deep neural network feature coding technology and enhancing Learning Principle, The lower study control problem for solving Continuous action space is inputted towards dimensional state.
Background technique
As at home and abroad market goes from strength to strength for the development innovation and automobile industry of artificial intelligence technology, as intelligence The product that driving technology and automobile organically combine --- it is public that intelligent driving vehicle is increasingly becoming major Automobile Enterprises, high and new technology Department, institution of higher learning and scientific research institutions' focus of attention.Context aware systems, behaviour decision making system, path planning system and Under organic coordinated of kinetic control system etc., intelligent driving vehicle being capable of effective monitoring itself and driver status, perception Surrounding environment change and unusual condition, auxiliary, prompting and replacement driver complete part or all of driving behavior in time.Compared to With general vehicle, intelligent driving vehicle, which has, to be swift in response, perceive the careful advantages such as accurate of accurate, predictable behavior, control, is Indispensable component units in the following intelligent transportation system.It will be effective it is contemplated that promoting intelligent driving vehicle application Alleviate traffic congestion, reduces artificial caused traffic accident generation, save energy consumption and drive the time, reduce disposal of pollutants Amount is promoted and drives comfort level and traffic substance transportation efficiency, has far reaching significance and important value to promotion progress of human society.
The industries such as intelligent driving vehicle and robot, communications and transportation, artificial intelligence are closely coupled, be include pattern-recognition with Crossing domain including multiple subject directions such as intelligence system, control theory, computer science, cognitive psychology, it is related Key technology mainly has environment perception technology, Path Planning Technique, behaviour decision making technology and movement control technology, four kinds of crucial skills The correlation of art, basic premised on environment perception technology, movement control technology is final bottom foothold.
In the case where road opposed configuration, traffic Driving Scene are relatively easy, in relation to environment perception technology and movement Planning control technology has impressive progress.But with the increasingly increase of traffic scene complexity, towards adverse weather condition with Increasing for the intelligent driving demand under orographic condition is driven, the environment sensing and behavior motion control to intelligent driving vehicle propose Harsher, more challenge requirement, i.e. class people perception and class people driving behavior intelligence.Reach this target, it is crucial One of be that intelligent driving vehicle can efficiently perceive and merge driving-environment information, and by with environment and human experience's data Between interaction to carry out autonomous learning.Just because of this, researchers at home and abroad are under the promotion that computer science rapidly develops, The problems such as continuously attempting to using machine learning Principle Method perception, the control to solve intelligent driving vehicle under complex environment.
Currently, it is gradually increased and accelerates parallel the increasingly mature of computing hardware with data available scale, " big data+ Deep learning (Deep learning, DL) model " gradually replaces the academic, original " Feature Engineering+tradition study mould of industry Type " mode and become the hot spot studied instantly.Deep learning is a series of in machine learning field to attempt using multiple non-linear It converts and the abstract learning algorithm of multilayer is carried out to data.By the training in great amount of samples data, deep learning can learn Superior performance, reflection data internal structure, the feature representation for disclosing variables form are obtained, pedestrian is gradually applied to In the environment sensings tasks such as detection, vehicle detection, signal lamp identification and obtain remarkable effect.
Under complex environment in terms of intelligent driving vehicle study control, due to system dynamic system itself and ring is driven The strong nonlinearity in border and variability, to conventional Decision Control method Target Modeling and in terms of bring difficulty. In this regard, the intelligent method that researcher is gradually introduced including neural network, genetic algorithm, enhancing study etc. is resolved. Wherein, enhancing study (also known as intensified learning, Reinforcement learning, RL) method do not needed supervision message and Its exclusive " intelligent body-environment " interactive return study mechanism can make target object in the case where manually participating in the smallest situation Carry out self study.But the state complex under intelligent vehicle driving environment is changeable, and enhancing learning method is exploring these state spaces Undoubtedly it is faced with the dimension disaster problem of extensive state space and the problem of continuous control.Therefore, research and probe is towards big rule The input of mould higher-dimension, the control learning algorithm continuously exported are very necessary.The depth that deep neural network is obtained by image study It is characterized in indicating effective dimensionality reduction of complex environment state, deep neural network is increased with the depth that enhancing Learning Principle combines Strong study (Deep reinforcement learning, DRL) method, so that enhancing learning method processing dimensional state input The study control problem of lower Continuous action space is possibly realized.However, existing DL and DRL method is typically based on gradient decline original Reason carries out parameter optimization, often has that local minimum is difficult to avoid that, generalization ability is difficult to ensure and due to largely searching for The problems such as costly with training caused by optimization calculating demand, cause the side DL perceived towards intelligent driving vehicle complex environment Method and the DRL motion control method of lower Continuous action space is inputted towards dimensional state ask there are adaptability and high efficiency are insufficient Topic limits the further promotion of its performance, accelerates algorithm pace of learning, improves learning efficiency to be urgent problem.Cause This, the study control problem of lower Continuous action space is inputted towards dimensional state, in conjunction with deep neural network feature coding technology With enhancing Learning Principle, how to realize more efficient, quick on-line study control, have become a key urgently to be resolved Technical problem.
The study control problem of intelligent driving vehicle under complicated environmental condition can substantially be abstracted as one on a large scale Continuous or discrete movement optimal control policy learning process under continuous state input.Enhancing learning art should solve to be somebody's turn to do One powerful technique means of problem.But with gradually increasing for task complexity, dimension is increasingly becoming obstruction enhancing and learns To the principal element of satisfactory result.Usually, the state representation for using visual pattern to learn as enhancing is the most Mode that is direct and being full of prospect.Visual pattern can capture dynamic environment relevant to learning tasks and system performance, by it It can be to avoid many and diverse state characteristic Design and additional sensing hardware as state representation input.However, working as original image When as state input, tradition enhancing learning algorithm usually causes study to dissipate due to its excessively high dimension.
Although depth, which enhances learning method, uses such as convolutional neural networks, recurrent neural network even depth neural network mould Type realizes from end-to-end characterization, the policy learning of original image, but the design of depth network model and a large amount of network parameters Selection optimization is often more intractable, in addition to this also needs a large amount of training data and expends huge calculating cost to protect Card obtains model of good performance.Therefore, depth enhancing learning algorithm usually has strong dependence to parallel computation hardware.Separately Outside, current depth enhancing learning method is based primarily upon the depth network of the strong nonlinearity structure of back-propagation algorithm training.Part Minimum problem and generalization ability be limited in it is inevitable during back propagation learning, information included for training sample Utilization rate is not high, study convergence will often pass through more very long process.On the one hand, traditional enhancing learning method generallys use Linear function approaches device and carries out study approximation to the control strategy in study control problem, and having mature theoretical proof, it has Good study stability and data validity, but dimensional state input problem can not be handled;On the other hand, deep neural network With powerful representative learning ability, it can be excavated from the input of original higher-dimension and coding is conducive in control strategy learning process In relation to feature-rich needed for approximation to function.The two advantage is subjected to tradeoff synthesis, it is special for state using deep neural network Assemble-publish code, and then approached with the study that tradition enhancing learning algorithm carries out control strategy, it may be that it is defeated to solve extensive state Enter an effective way of the study control problem under space.
As to based on value function estimation and the enhancing learning method based on Policy-Gradient and decision search combination with put down Weighing apparatus is able to carry out single step update based on actuator-evaluator (Actor-critic) method, also can guarantee in Continuous action space Preferable learning effect.Wherein, adaptive heuristic evaluation device (Adaptive heuristic critic, AHC) is a kind of tool Representational method.It is as shown in Figure 1 the enhancing learning system structure based on AHC algorithm, the enhancing study based on AHC algorithm System is made of evaluator network and actor network, evaluator network and actor network show as two connect each other it is only Vertical neural network.The input of evaluator network includes external return value immediately and the state feedback from environment, is exported to claim Immediately it returns inside for time-domain difference signal.Actor network input include from environment state feedback, evaluator network The inside of output is returned immediately, is exported to act on the movement of environment.Actor network is used for according to strategy generating control action, Evaluator network for evaluating the tactful quality that actuator indicates, while for actor network provide inside return without With waiting external delay return.Evaluator network is intended to learn to predict, time-domain difference method is calculated usually as the study of evaluator Method, and the study of actor network then depends on the estimation to Policy-Gradient.
The final goal of adaptive heuristic evaluation algorithm is to approach to maximize accumulates return as shown in formula (1) Optimal policy π*.Entire learning system is modeled as the markov decision process indicated with four-tuple { S, A, P, R }, wherein S It indicates state space, that is, enhances the set for the state that learning agent is likely to be at;A indicates motion space, i.e. enhancing study The set for the everything that intelligent body can be taken during with environmental interaction;P indicates state action transition probability, that is, is working as Preceding state takes the probability distribution for the NextState that can be transferred to after a certain movement;R is Reward Program, current to measure The fine or not degree of taken movement.
In formula (1), π is that state is mapped as to the different strategies for acting corresponding probability distribution, J under motion spaceπFor foundation The overall aggregate return expectation that tactful π execution movement can obtain, γ is discount factor, rtIt is obtained after t moment execution movement The instant return obtained.Formula (2) defines optimal value function V corresponding to optimal policy*(s), by dynamic programming principle it is found that most Merit function V*(s) meet the Bellman equation as shown in formula (3) simultaneously.
In formula (2) and formula (3), V*It (s) is optimal value function corresponding to optimal policy,To tire out under optimal policy π Expected returns value is counted, γ is discount factor, rtThe instant return obtained after t moment execution movement, r (s, a) in state s Take the getable instant return value of expectation of movement a when institute, E [V*It (s')] is optimal corresponding to the optimal policy under s' state Value expectation, V*It (s') is optimal value corresponding to the optimal policy under s' state.
In adaptive heuristic evaluation device algorithm, value function represented by evaluator network uses time-domain difference (Temporal difference, TD) study carries out approximate evaluation.When approaching device using the linear function of fixed basis, I.e.WhereinTo carry out the value function that linear function approaches using fixed basis,Device, θ are approached for the linear function comprising n basic functiont=(θ12,...,θn)TFor t The weight vectors at moment.
According to time-domain difference Learning Principle, it is as follows that weight more new formula can be derived by:
θt+1tt[rt+γV(st+1)-V(st)]et (4)
In formula (4), θt+1For the weight vectors at t+1 moment, θtFor the weight vectors of t moment, rtAfter t moment execution movement Instant return obtained, γ are discount factor, V (st+1) it is the value function estimated value that t+1 moment evaluator network exports, V (st) it is the value function estimated value that t moment evaluator network exports, αtFor learning rate, et=[e1t,e1t,...,ent]TIt is suitable Track vector is spent, and is hadλ is delay factor, and value range is (0,1),For state stUnder n The linear function of a basic function approaches the value of device.
In actor network, the execution exported is acted by the value letter of current state and evaluator network output at this time Number estimated value codetermines, as shown in formula (5).
In formula (5), p (at) indicate to select the movement a of t momenttMovement probability, atIndicate the movement that t moment is taken, For atMean value, σtFor variance.Mean valueW=(w1,w2,...,wM)TFor actor network weight, wherein w1, w2,...,wMFor the practical mapping layer weighted value of actor network, variances sigmatIt is given by:
σt=b1/{1+exp[b2V(st)]} (6)
In formula (6), b1、b2For normal number, V (st) it is the value function estimated value that t moment evaluator network exports.It is same with this When, the Policy-Gradient of actuator strategy network carries out approximate evaluation by formula (7) and obtains:
In formula (7), JπFor the overall aggregate return expectation that can be obtained according to strategy π execution movement, θ is weight,For at Mean value, atFor the movement that t moment is taken, Δ rtIt is the time-domain difference signal that evaluator network provides, and has Δ rt=rt+γV (st+1)-V(st), rtFor the output signal of evaluator network, γ is discount factor, V (st+1) it is that t+1 moment evaluator network is defeated Value function estimated value out, V (st) it is the value function estimated value that t moment evaluator network exports, σtFor variance.
Adaptive heuristic evaluation algorithm is based on actuator-evaluator structure, and evaluator network is used for good to policy action It is bad to be evaluated, estimate value function corresponding to current strategies, finally approaches optimal value function.Actuator is for constantly approaching most Dominant strategy, the desk evaluation signal that training optimization is provided dependent on evaluator, i.e. time domain differential errors.As it can be seen that evaluator exists It plays an important role in entire learning system, the precision and convergent speed for value function approximation are directly affected and determined The impact of performance of system entirety.In adaptive heuristic evaluation device algorithm, the study of evaluator is calculated using traditional linear TD (λ) There is data efficiency deficiency in method, and the selection of learning training step sizes needs, mistake well-designed according to particular problem Big or too small step value can bring adverse effect to algorithm performance, be easy to appear oscillation, convergence process become it is very long very To diverging.
Adaptive heuristic evaluation algorithm is based on actuator-evaluator structure, and evaluator network is used for good to policy action It is bad to be evaluated, estimate value function corresponding to current strategies, finally approaches optimal value function.Actuator is for constantly approaching most Dominant strategy, the desk evaluation signal that training optimization is provided dependent on evaluator, i.e. time domain differential errors.As it can be seen that evaluator exists It plays an important role in entire learning system, the precision and convergent speed for value function approximation are directly affected and determined The impact of performance of system entirety.In adaptive heuristic evaluation device algorithm, the study of evaluator is calculated using traditional linear TD (λ) There is data efficiency deficiency in method, and the selection of learning training step sizes needs, mistake well-designed according to particular problem Big or too small step value can bring adverse effect to algorithm performance, be easy to appear oscillation, convergence process become it is very long very To diverging.
Cerebellum constantly can receive and store over time the relevant information that coordination behavior is generated for controlling muscle, be The movement at the living organisms such as eyes, arm, finger, leg, wing position provides accurate coordinated control.Cerebellar Model Articulation Controller (also known as cerebellar model articulation controller, Cerebellar model articulation controller, CMAC) is a kind of The neural network of imitation cerebellum structure and working mechanism based on neuro-physiology and memory mechanism.It is illustrated in figure 2 basic Cerebellar Model Articulation Controller structural representation specifically includes that input layer S, concept mapping layer A, practical mapping layer W and output layer Y Four parts.Input layer S receives the input vector from high-rise command argument or sensor perception status information as model, Input vector is transformed into the concept mapping space of concept mapping layer A by the quantization mapping ruler of layering tile type, is shown as Activate specific one group of memory internal storage block in concept mapping layer A.The then ground according to the memory block activated in concept mapping layer A Location position, it is further corresponding to activate the real response unit stored in practical mapping layer W and its respective weights, last output layer Y Export the weighted sum for the unit that is activated.Compared to other kinds of neural network (such as based on the BP neural network approached of the overall situation), The advantage of Cerebellar Model Articulation Controller is mainly reflected in three aspects: firstly, Cerebellar Model Articulation Controller is based on part study, often An iteration study only has fractional weight to need to update adjustment, and pace of learning is fast, and calculation amount is small;Secondly, cerebellar model nerve net Network has good function None-linear approximation ability and generalization ability, and insensitive to the precedence of learning sample data;Again Secondary, the Cerebellar Model Articulation Controller response weight activated for different inputs has certain sparsity, is capable of handling higher-dimension Spend input problem.
Summary of the invention
The technical problem to be solved by the present invention is in view of the above problems in the prior art, providing a kind of for automatic Pilot The continuous action on-line study control method and system of vehicle, the present invention propose that the adaptive inspiration of depth cerebellum coding characteristic is commented Valence method, depth characteristic coding techniques is used to solve the dimensionality reduction encoded question that extensive continuous state inputs by the present invention, in depth It spends on the basis of coding characteristic using the heuristic evaluation learning method based on Cerebellar Model Articulation Controller, can be realized extensive Continuous state inputs efficient, the rapidly on-line study control of lower Continuous action space, shortens while guaranteeing learning effect Learning cycle, learning process can fast convergence obtain the good control strategy of the impact of performance, have good data user rate.
In order to solve the above-mentioned technical problem, the technical solution adopted by the present invention are as follows:
A kind of continuous action on-line study control method for automatic driving vehicle, implementation steps include:
1) current perceptual image I is obtainedt
2) pass through depth coding network for perceptual image ItIt is encoded to obtain encoding state feature st
3) by encoding state feature stActuator-evaluator model evaluator network and actor network are inputted respectively, The actuator-evaluator model evaluator network and actor network are all made of Cerebellar Model Articulation Controller;
4) pass through actor network output action atAnd actuator-evaluator model ginseng is updated by evaluator network Number.
Preferably, the depth coding network used in step 2) is HELM network model.
Preferably, the detailed step of step 4) includes:
4.1) by encoding state feature stInput actuator-evaluator model actor network obtains output yt;, wherein Export ytThe state s of t moment is calculated for actor networktIt is lower to execute the corresponding probability of each movement;
4.2) according to the movement a of the distribution selection t moment of the corresponding probability of each movementtAnd it exports;
4.3) by the movement a of t momenttMarkovian decision environmental model is inputted, the state s of t moment is observed and recordtTo t The state s at+1 momentt+1Storage state shifts (st,st+1) simultaneously calculate t moment to the t+1 moment return value rt=r (st, st+1);
4.4) the return value r based on t moment to t+1 momentt=r (st,st+1), using recursive least-squares TD (λ) algorithm The weighted value W that is activated of value function update evaluator networkc
4.5) actor network is updated based on gradient decline to be activated weighted value.
Preferably, step 4.2) selects the movement a of t moment according to the distribution of the corresponding probability of each movementtFunction table Up to shown in formula such as formula (5);
In formula (5), p (at) indicate to select the movement a of t momenttMovement probability, atIndicate the movement that t moment is taken, Expression acts mean value, σtIndicate variance,W=(w1,w2,...,wM)TFor the weight of actor network, wherein w1,w2,...,wMFor the practical mapping layer weighted value of actor network, stFor the state of t moment.
Preferably, the variances sigmatFunction expression such as formula (6) shown in;
σt=b1/{1+exp[b2V(st)]} (6)
In formula (6), b1、b2For normal number, V (st) it is the value function estimated value that t moment evaluator network exports, stWhen for t The state at quarter.
Preferably, in step 4.5) based on gradient decline update actor network be activated weighted value function expression such as Shown in formula (10);
In formula (10),The movement a at t+1 moment is exported for actor networkt+1The weighted value that is activated,To execute The movement a of device network output t momenttThe weighted value that is activated, β is learning rate, JπFor that can be obtained according to strategy π execution movement Overall aggregate return expectation.
The present invention also provides a kind of continuous action on-line study control systems for automatic driving vehicle, including computer Equipment, the computer equipment are programmed to perform the continuous action on-line study control that the present invention is previously used for automatic driving vehicle The step of method processed;Or be stored in the storage medium of the computer equipment be programmed to perform the present invention be previously used for from The computer program of the dynamic continuous action on-line study control method for driving vehicle.
Compared to the prior art, the present invention has an advantage that depth characteristic coding techniques is used to solve greatly by the present invention The dimensionality reduction encoded question of scale continuous state input, by depth coding network by perceptual image ItIt is encoded to obtain coding shape State feature st;By encoding state feature stActuator-evaluator model evaluator network and actor network, institute are inputted respectively It states actuator-evaluator model evaluator network and actor network is all made of Cerebellar Model Articulation Controller;Pass through actuator Network output action atAnd actuator-evaluator model parameter is updated by evaluator network, therefore in depth coding feature On the basis of using based on Cerebellar Model Articulation Controller (Cerebellar Model Articulation Controller, CMAC heuristic evaluation learning method) realizes that extensive continuous state inputs the on-line study control of lower Continuous action space, Can shorten learning cycle while guaranteeing learning effect, learning process can fast convergence obtain the good control of the impact of performance System strategy, has good data user rate.
Detailed description of the invention
Fig. 1 is existing actuator-evaluator model structural schematic diagram.
Fig. 2 is the structural schematic diagram of existing Cerebellar Model Articulation Controller.
Fig. 3 is the schematic illustration of present invention method.
Fig. 4 is the schematic illustration that the embodiment of the present invention uses HELM network model.
Fig. 5 is the Acrobot study control simulated environment schematic diagram in the embodiment of the present invention.
Layering when Fig. 6 is Acrobot of embodiment of the present invention study control emulation under heterogeneous networks parameter, which is transfinited, to learn to compile Code network training root-mean-square error curve graph.
Storehouse sparse coding stochastic neural net when Fig. 7 is Acrobot study control emulation in the embodiment of the present invention Acrobot image reconstruction result.
Fig. 8 is that the Acrobot in the embodiment of the present invention learns control performance contrast curve chart.
Acrobot pendulum of Fig. 9 when being Acrobot study control emulation in the embodiment of the present invention under final stable strategy control Bar angle and torque change curve.
Figure 10 is Mountain Car study control simulated environment schematic diagram in the embodiment of the present invention.
Layering when Figure 11 is Mountain of embodiment of the present invention Car study control emulation under heterogeneous networks parameter is transfinited Learn coding network training root-mean-square error curve graph.
Storehouse sparse coding stochastic neural net when Figure 12 is Mountain Car study control emulation in the embodiment of the present invention Network Mountain Car image reconstruction result.
Figure 13 is that Mountain Car learns control performance contrast curve chart in the embodiment of the present invention.
Figure 14 is the control effect of the stable strategy obtained when Mountain of embodiment of the present invention Car study control emulation Curve graph.
Specific embodiment
As shown in figure 3, implementation step of the present embodiment for the continuous action on-line study control method of automatic driving vehicle Suddenly include:
1) current perceptual image I is obtainedt
2) pass through depth coding network for perceptual image ItIt is encoded to obtain encoding state feature st
3) by encoding state feature stActuator-evaluator model evaluator network (cerebellar model nerve is inputted respectively Network values Function Network) and actor network (Cerebellar Model Articulation Controller strategy network), the evaluation of actuator-evaluator model Device network and actor network are all made of Cerebellar Model Articulation Controller;
4) pass through actor network output action atAnd actuator-evaluator model ginseng is updated by evaluator network Number.
As shown in figure 4, the depth coding network used in step 2) is HELM network model.
As shown in figure 4, HELM coding network is by random storehouse autoencoder network feature coding device, least square probabilistic neural Network class/recurrence device composition, for the original visual image of higher-dimension to be converted to the coding characteristic vector of low-dimensional.
Wherein, random storehouse autoencoder network feature coding device is made of multiple sparse random self-encoding encoder stackings.HELM The training process of coding network includes two relatively independent stages, i.e., the unsupervised layered characteristic of random storehouse autoencoder network Learn and the stochastic neural net based on the learning machine that transfinites returns having for device and supervises feature recurrence learning.With self-encoding encoder quantity Increase, it is abstract and have high-rise meaning that encoded feature will become more compression.It is single sparse in HELM coding network The study optimization process of random self-encoding encoder is based on l1Norm, as shown in formula (8);
In formula (8), H represents the output of hidden layer accidental projection, and Y is input data, can be original visual image Iinput, It can be exported for the coding of previous random autoencoder network, β is that output weight (can be asked by existing FISTA arithmetic analysis ), l1Indicate l1Norm indicates the sum of the absolute value of each element in vector.The output of i-th of sparse random self-encoding encoder is logical It is indicated shown in Chang Youru formula (9);
In formula (9), TiFor the 1st original visual image Iinput, Ti-1For (i-1)-th original visual image Iinput, βiIt is The output weight of i original visual image, f () are the activation primitive of hidden node, can be for sigmoid function, hyperbolic just Function etc. is cut, k is the number of sparse random self-encoding encoder total included in random storehouse autoencoder network.
Passing through depth coding network for perceptual image ItIt is encoded to obtain encoding state feature stAfterwards, by encoding state Feature stActuator-evaluator model evaluator network and actor network are inputted respectively, can be exported by actor network Act atAnd actuator-evaluator model parameter is updated by evaluator network, it carries out corresponding Policy evaluation and strategy changes Into.
In the present embodiment, actuator-evaluator model evaluator network uses Cerebellar Model Articulation Controller.When it is used for It mainly include netinit, state mapping and three coding, weight activation and study processes when tactical comment.
S1, netinit.
In network initialization procedure, the number N of tile layer need to be settiling, divide space-number Np, physical memory memory Number K and network inputs (i.e. input vector) dimension Ds.Then, according to each component of input vector value range and draw Divide space-number Np, to NtilingA tile layer carries out the division at equal intervals with offset one by one, so that each tile layer be divided For NpA tile block.Wherein, spacing value Δ is dividedintervalWith offset value ΔoffsetIt is determined respectively by formula (10) and formula (11).
Δinterval=(Rmax-Rmin)/Np (10)
Δoffsetinterval/Ntiling (11)
In formula (10) and formula (11), RmaxFor the maximum value in each component value range of input vector, RminFor input vector Minimum value in each component value range.
The weights initialisation of evaluator CMAC Neural Network is the null vector of K × 1.
S2, state mapping and coding.
When the mapping of carry out state is with coding, each component value of each input vector can be fallen in different tile layers In a certain tile block, claims each input vector to activate the tile block in each tile layer at this time, be accordingly activated Specific tile block index value a is determined by formula (12):
A=ceil ((s- Δoffset)/Δinterval) (12)
Wherein, s is input vector, and ceil () is the operation that rounds up.
Further, the tile block activated according to Hash principle to input vector carries out coding mapping.
F (s)=A (s) mod K (14)
Wherein, A (s) indicates that cerebellum network activation weight concept maps memory internal storage address, a (i) be expressed as being entered to The tile block index value in each tile layer that amount i-th dimension component is activated, 0≤a (i)≤Np;F (s) is swashed by input vector The actual physics memory internal storage address of weight living.
S3, weight activation and study.
According to the actual physics memory internal storage address F (s) for the weight that is activated, that is, it can determine evaluator CMAC Neural Network The component being activated in weight vectors, and the weight component activated is learnt according to recursive least-squares TD (λ) algorithm Update (corresponding step 4.4, detail will be introduced in greater detail below).So far, it just completes based on CMAC Neural Network Evaluator network initialization, state mapping with coding, weight activation and learning process.
In the present embodiment, actuator-evaluator model actor network is all made of Cerebellar Model Articulation Controller.With evaluation Device network is similar, mainly includes that netinit, state mapping are activated and learned with coding, weight when it is used for stragetic innovation Practise three processes.The difference of itself and evaluator CMAC Neural Network essentially consists in weight learning process.When the determining power being activated When weight, corresponding weight, which is updated, declines principle based on gradient.
In the present embodiment, the detailed step of step 4) includes:
4.1) by encoding state feature stInput actuator-evaluator model actor network obtains output yt;, wherein Export ytThe state s of t moment is calculated for actor networktIt is lower to execute the corresponding probability of each movement;
4.2) according to the movement a of the distribution selection t moment of the corresponding probability of each movementtAnd it exports;
4.3) by the movement a of t momenttMarkovian decision environmental model is inputted, the state s of t moment is observed and recordtTo t The state s at+1 momentt+1Storage state shifts (st,st+1) simultaneously calculate t moment to the t+1 moment return value rt=r (st, st+1);
4.4) the return value r based on t moment to t+1 momentt=r (st,st+1), using recursive least-squares TD (λ) algorithm The weighted value W that is activated of value function update evaluator networkc
4.5) actor network is updated based on gradient decline to be activated weighted value.
In the present embodiment, step 4.2) selects the movement a of t moment according to the distribution of the corresponding probability of each movementtLetter Shown in number expression formula such as formula (5);
In formula (5), p (at) indicate to select the movement a of t momenttMovement probability, atIndicate the movement that t moment is taken, Expression acts mean value, σtIndicate variance,W=(w1,w2,...,wM)TFor the weight of actor network, wherein w1,w2,...,wMFor the practical mapping layer weighted value of actor network, stFor the state of t moment.
In the present embodiment, variances sigmatFunction expression such as formula (6) shown in;
σt=b1/{1+exp[b2V(st)]} (6)
In formula (6), b1、b2For normal number, V (st) it is the value function estimated value that t moment evaluator network exports, stWhen for t The state at quarter.
Declined in the present embodiment, in step 4.5) based on gradient and updates actor network and be activated the function representation of weighted value Shown in formula such as formula (10);
In formula (10),The movement a at t+1 moment is exported for actor networkt+1The weighted value that is activated,For actuator The movement a of network output t momenttThe weighted value that is activated, β is learning rate, JπFor what can be obtained according to strategy π execution movement Overall aggregate return expectation.
In the present embodiment, there is N for each inputtilingA tile layer and NpA cerebellar model nerve for dividing interval It is saved as in physical memory needed for network of network(DsFor the dimension of input vector), and according to shown in formula (11), (12) Hash (Hashing) principle coding mapping is carried out to the physical memory memory address of input and its corresponding activation weight.
F (s)=A (s) mod K (12)
In formula (11) and (12), A (s) indicates that cerebellum network activation weight concept maps memory internal storage address, and a (i) is indicated To be entered the tile layer that vector i-th dimension component is activated, 0≤a (i)≤Np, NpExpression state inputs isodisperse, DsFor input The dimension of vector;F (s) is the actual physics memory internal storage address corresponding to input vector s, and K is total physical memory memory number, S is input vector.
In the present embodiment, return value r of the step 4.4) based on t moment to t+1 momentt=r (st,st+1), most using recursion Small two multiply the weighted value W that is activated that TD (λ) algorithm value function updates evaluator networkc, recursive least-squares TD (λ) algorithm is existing The weight more new algorithm of some evaluator networks, has data user rate height, function approximation effect good and convergence rate is better than passing Linear TD (λ) algorithm of system, therefore optimal value function is carried out for evaluator using recursive least-squares TD (λ) algorithm and is approached, It is beneficial to promote the Policy evaluation performance of evaluator network, and then enhances algorithm global learning control effect.
The function formula of recursive least-squares TD (λ) algorithm is as follows:
In formula (13)~formula (15), P0=δ I, δ are a positive real number, and I is unit matrix, θt=(θ12,...,θn)TTo adopt The weight vectors of device, θ are approached with the linear function of fixed basist+1For θtThe primary updated value of iteration, etFor grade of fit rail Mark vector, PtAnd Pt+1To inscribe Iterative Matrix, K when t and t+1t+1It also is the Iterative Matrix of t+1 in formula,μFor forgetting factor, value Range be (0,1],WithFor state stAnd st+1Under device is approached using the linear function of fixed basis Value, γ is discount factor, rtIt is rewarded immediately for t moment.
In order to which the continuous action on-line study control method to the present embodiment for automatic driving vehicle is verified, this reality Apply in example software and hardware configuration condition be 8 core Intel (R) Xeon (R) E5-2643CPU (3.30GHz) 12GB DDR4, Emulation experiment verifying analysis is carried out on the computer of Ubtuntu14.04, and MATLAB 2014a.
One, Acrobot study control simulating, verifying experiment.
Acrobot is swung to specified height by the learning control method purport of Acrobot in the shortest time.Have as one kind Representational adaptive Optimal Control problem, the research of Acrobot problem start from 1996, and discrete movement and continuous action are empty Between under study control have and accordingly studied.The dynamic model of Acrobot learning control system is provided by formula (16)~(22):
φ11112 (19)
φ11=(m1lc1+m2l1)gcos(θ1-π/2)+φ2 (20)
φ2=m2lc2gcos(θ12-π/2) (22)
In formula (16)~(22), θi,It respectively corresponds as angle, angular speed and angular acceleration, θi∈ [- π, π],Variable mi,li,lci,IiQuality, the length, mass center of respectively i-th (i=1,2) connecting rod Length and rotary inertia.Remaining variables are intermediate variable.In the present embodiment emulation experiment, m1=m2=1kg, l1=l2=1m, lc1 =lc2=0.5m, I1=I2=1kgm2, simulation time step-length is 0.05s.Referring to Fig. 5, Acrobot study control simulated environment In, OP1For first segment connecting rod, P1P2For the second section connecting rod.Torque is applied to point P1.Work as P2End reaches higher than one section length of connecting rod of O point When spending, it is believed that acrobot is controlled successfully.
Learn control simulating, verifying experiment for Acrobot, is used for automatic driving vehicle in the present embodiment the present embodiment The relative parameters setting of continuous action on-line study control method is λ=0.6, γ=0.9, β=0.2, k1=0.4, k2=0.5. In actor network and evaluator network based on Cerebellar Model Articulation Controller, tile number of plies Ntiling=4, section separates points Np=7, physical memory spatial content is respectively 100 and 80.State is a four-tupleControl torque tau be [- 3N, 3N] in continuous quantity.Reward Program is for example given below:
In formula (23), sGIndicate dbjective state.Hit and miss experiment starts from [0,0,0,0] state each time, reaches to connecting rod Terminate when controlling step number to preset height or more than maximum, state will be reinitialized at this time, and the corresponding track that controls also will It is removed.
In storehouse sparse coding stochastic neural net used, is transfinited by two and learn self-encoding encoder and transfinite It practises and returns device.The initial weight of node is generated by the random homogeneous distribution in [- 1,1] section.Training dataset includes 12701 width The gray scale snapshot image of acrobot study control simulated environment.Each image be sheared zoom to 48 × 48 size.In order to Study obtains the feature coding network of Control-oriented task, and transfiniting for training layering learns the label value of coding network and be Angle information [the θ that acrobot study control analogue system provides12].Network node activation primitive is hyperbolic tangent function.By It transfinites in layering and learns single frames study control simulated environment snapshot image of the output derived from the information that lacks exercise of coding network, because The difference information of coding and the previous frame image coding of this current frame image transfinites the estimator as velocity information to layering The coding output of study coding network is supplemented.Learn coding network parameter to determine that suitable layering is transfinited, i.e., it is each super Limit study self-encoding encoder and study of transfiniting return the number of hidden nodes and regularization coefficient of device.Using in the different changes to be determined of selection Keep other variate-values constant while magnitude, it is final to choose so that the smallest one group of parameter value of training error.It is super due to being layered Input weight, biasing etc. are random generation in limit study coding network, therefore final result takes being averaged for 5 hit and miss experiments Value.
Layering under heterogeneous networks parameter transfinite learn coding network training root-mean-square error curve graph as shown in fig. 6, its The study of transfiniting of middle Fig. 6 (a) returns training root-mean-square error in device under different the number of hidden nodes;Fig. 6 (b) is first transfinite Practise the training root-mean-square error in self-encoding encoder under different the number of hidden nodes;Fig. 6 (c) transfinites for second to be learnt in self-encoding encoder Training root-mean-square error under different the number of hidden nodes;Fig. 6 (d) is the training root-mean-square error under different regularization coefficient values. Fig. 6, which is shown, learns the number of hidden nodes N in self-encoding encoder with transfiniting1、N2Study returns the number of hidden nodes N in device with transfiniting3With And regularization coefficient C is different and the training root-mean-square error curve that changes.Obviously, compared to N1、N2, training effect is to N3With C's Value is more sensitive.With N3Increase, training root-mean-square error originally substantially reduce, work as N3Change when value is to 3000 or so Start flat until in 0.03 or so convergence.Due to being not necessarily to the update that iterates by gradient, entire training process time-consuming is about 23.6 seconds, significantly shorter than train the training duration of a deep neural network.
Fig. 7 show coding network to the reconstruction result of different original input picture data, wherein (a), (c) be classified as it is original Input picture, (b), (d) be classified as corresponding reconstructed image.It can be seen that reconstructed image can be consistent substantially with original image, compile Code network can encode to obtain characteristic information significant in original image.
Then, it is trained layering transfinite learn coding network coding output will be fed as input to based on cerebellum mould The actor network and evaluator network of type controller key carry out enhancing study.Learning performance quality is by acquiring stable strategy When required hit and miss experiment number and weigh according to learning strategy for acrobot and swinging to the step number used of success status height Amount.
Referring to Fig. 8 and Fig. 9, wherein Fig. 8, which gives, proposes algorithm and related typical algorithm (Fast-AHC, SARSA-Q Learning the learning performance Contrast on effect between).In test phase, the equal independent operating of each method 10 times, record control at The curve that step number needed for function changes with number of attempt, curve show average 10 postrun results in figure.It is believed that working as Not in significant changes, learning algorithm basically reaches convergence and learns to obtain stable control step number needed for success controls acrobot System strategy.As can be seen that the learning efficiency of algorithm and Fast-AHC method is proposed, 70 times or so average from comparing result Learn to obtain stable strategy when trial, better than the SARSA-Q algorithm for needing average 80 trials, this should have benefited from being applied to Efficient utilization of recursive least-squares TD (λ) algorithm to data information in the evaluator network of Cerebellar Model Articulation Controller.With This proposes that algorithm can acquire strategy more better than Fast-AHC method simultaneously, from the figure, it can be seen that proposing what algorithm was acquired Acrobot can be swung to success status height in 47 steps by Stable Control Strategy, and the strategy that Fast-AHC methodology acquistion is arrived 62 steps are needed, SARSA-Q learning algorithm then needs 110 steps.Fig. 9 is shown according to the stability contorting for proposing that algorithm learns The opposite variation of swing rod angle and torque under policy control.
Two, Mountain Car study control simulating, verifying experiment.
Mountain Car study control is also the typical standard problem of assessment enhancing learning algorithm.As shown in Figure 10, should Problem, which is aimed at, drives the specified final position to outside mountain valley for the trolley in mountain valley with least step number.The drive of trolley Power is limited, only relies in mountain valley back and forth traveling and arrives at the destination until running up to enough momentum and being likely to be driven out to.
The kinetic model of Mountain Car system is given by:
In formula (24), Δ t=0.01s is time interval, and F is the driving force (for successive value) of small car engine, value area Between be [- 0.2N, 0.2N], mc=0.02kg is trolley quality, g=9.8m/s2
Learn control simulating, verifying experiment for Mountain Car, the present embodiment is held based on Cerebellar Model Articulation Controller In row device network and evaluator network, relative parameters setting is λ=0.8, γ=0.98, β=0.02, k1=0.4, k2=0.6. In actor network and evaluator network based on Cerebellar Model Articulation Controller, tile number Ntiling=3, section separates points Np =6, physical memory spatial content is respectively 50 and 30.State is binary groupIt is small to make as early as possible due to control target Vehicle is driven out to mountain valley, therefore Reward Program is defined as follows:
In formula (25), sgIndicate dbjective state.
It is layered to transfinite and learn used in coding network model and acrobot study control experiment unanimously.
Training dataset includes the gray scale snapshot image of 1000 width Mountain Car study control simulated environment.Every width figure 20 × 60 size is zoomed to as being sheared.The small truck position p that Mountain Car simulated environment providestBy as label value Learn the training of coding network model for being layered to transfinite.The initial weight of network node is by random uniform in [- 1,1] section Distribution generates, and activation primitive is hyperbolic tangent function.Network code output is controlled with acrobot study experiment, before having done Frame code differential supplement increases dimension afterwards.The overall average training time is 2.89s.
Mountain Car study control problem lower leaf transfinite learn the training effect (see Figure 11) of coding network with Acrobot problem is substantially similar, since the state space of Mountain Car problem is relatively easy, is choosing suitable net Smaller root mean square training error can be obtained after network parameter.Figure 11 transfinites for the layering under heterogeneous networks parameter to be learnt to compile Code network training root-mean-square error curve graph, in which: Figure 11 (a) is the instruction for learning to return in device under different the number of hidden nodes that transfinites Practice root-mean-square error;Figure 11 (b) is that the first training root mean square learnt in self-encoding encoder under different the number of hidden nodes that transfinites misses Difference;Figure 11 (c) is second training root-mean-square error learnt in self-encoding encoder under different the number of hidden nodes that transfinites;Figure 11 (d) For the training root-mean-square error under different regularization coefficient values.Figure 12 show coding network to different original input picture data Reconstruction result, wherein Figure 12 (a) is classified as original input picture, and Figure 12 (b) is classified as corresponding reconstructed image.It is asked with Acrobot It is similar to inscribe situation, it is seen that reconstructed image can be consistent substantially with original image, and coding network can encode to obtain original graph The significant characteristic information as in.
As shown in figure 13, continuous action on-line study control method and related allusion quotation of the present embodiment for automatic driving vehicle Learning performance Contrast on effect between type algorithm (Fast-AHC, SARSA-Q learning).Similarly, each in test phase Kind method equal independent operating 10 times, record control the curve that successfully required step number changes with number of attempt, and curve is shown in figure Average 10 postrun results.Although the acquistion of Fast-AHC methodology to strategy can be made with least step number (216 step) Trolley is driven out to mountain valley (proposing that algorithm needs 226 steps, SARSA-Q study needs 255 steps), but the present embodiment is used for automatic Pilot vehicle Continuous action on-line study control method can quickly obtain stable control strategy.It is proposed algorithm at the 80th time or so Trial can restrain to obtain the stable strategy of better performances, and Fast-AHC method and SARSA-Q study are respectively required for offer 150 It is secondary and 240 times.This demonstrates height of the present embodiment for the continuous action on-line study control method of automatic driving vehicle again Data user rate and excellent learning ability.Figure 14 is to learn online for the continuous action of automatic driving vehicle according to the present embodiment Practise the opposite change that the Stable Control Strategy that control method learns controls the position, speed and engine drive power of lower trolley Change.
To sum up, the present embodiment is special using depth coding for the continuous action on-line study control method of automatic driving vehicle The quick heuristic evaluation device of sign and Cerebellar Model Articulation Controller enhances learning method, and is applied to study control problem.Pass through Depth coding feature is introduced, dimension disaster problem in enhancing learning method is effectively avoided, is realized defeated based on image (dimensional state) The Optimal Control Strategy study entered.Show the present embodiment for automatic Pilot in classics study control problem the simulation experiment result The continuous action on-line study control method of vehicle can successfully complete the study control task of view-based access control model driving, and domestic and international Horizontal correlation technique is compared, and can be guaranteed under the premise of study obtains suitable or better control strategy, and study convergence is more fast Speed.In addition, the present embodiment also provides a kind of continuous action on-line study control system for automatic driving vehicle, including calculate Machine equipment, the computer equipment are programmed to perform the continuous action on-line study that the present embodiment is previously used for automatic driving vehicle The step of control method;Or it is stored in the storage medium of the computer equipment and is programmed to perform the present embodiment and is previously used for The computer program of the continuous action on-line study control method of automatic driving vehicle.
The above is only a preferred embodiment of the present invention, protection scope of the present invention is not limited merely to above-mentioned implementation Example, all technical solutions belonged under thinking of the present invention all belong to the scope of protection of the present invention.It should be pointed out that for the art Those of ordinary skill for, several improvements and modifications without departing from the principles of the present invention, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims (7)

1. a kind of continuous action on-line study control method for automatic driving vehicle, it is characterised in that implementation steps include:
1) current perceptual image I is obtainedt
2) pass through depth coding network for perceptual image ItIt is encoded to obtain encoding state feature st
3) by encoding state feature stActuator-evaluator model evaluator network and actor network are inputted respectively, it is described to hold Row device-evaluator model evaluator network and actor network are all made of Cerebellar Model Articulation Controller;
4) pass through actor network output action atAnd actuator-evaluator model parameter is updated by evaluator network.
2. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 1, which is characterized in that The depth coding network used in step 2) is HELM network model.
3. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 1, which is characterized in that The detailed step of step 4) includes:
4.1) by encoding state feature stInput actuator-evaluator model actor network obtains output yt;, wherein exporting ytThe state s of t moment is calculated for actor networktIt is lower to execute the corresponding probability of each movement;
4.2) according to the movement a of the distribution selection t moment of the corresponding probability of each movementtAnd it exports;
4.3) by the movement a of t momenttMarkovian decision environmental model is inputted, the state s of t moment is observed and recordtWhen to t+1 The state s at quartert+1Storage state shifts (st,st+1) simultaneously calculate t moment to the t+1 moment return value rt=r (st,st+1);
4.4) the return value r based on t moment to t+1 momentt=r (st,st+1), using recursive least-squares TD (λ) algorithm values letter Number updates the weighted value W that is activated of evaluator networkc
4.5) actor network is updated based on gradient decline to be activated weighted value.
4. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 3, which is characterized in that Step 4.2) selects the movement a of t moment according to the distribution of the corresponding probability of each movementtFunction expression such as formula (5) shown in;
In formula (5), p (at) indicate to select the movement a of t momenttMovement probability, atIndicate the movement that t moment is taken,Indicate dynamic Make mean value, σtIndicate variance,W=(w1,w2,...,wM)TFor the weight of actor network, wherein w1, w2,...,wMFor the practical mapping layer weighted value of actor network, stFor the state of t moment.
5. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 4, which is characterized in that The variances sigmatFunction expression such as formula (6) shown in;
σt=b1/{1+exp[b2V(st)]} (6)
In formula (6), b1、b2For normal number, V (st) it is the value function estimated value that t moment evaluator network exports, stFor t moment State.
6. being used for the continuous action on-line study control method of automatic driving vehicle according to claim 3, which is characterized in that Decline update actor network based on gradient in step 4.5) to be activated shown in the function expression such as formula (10) of weighted value;
In formula (10),The movement a at t+1 moment is exported for actor networkt+1The weighted value that is activated,For actuator net The movement a of network output t momenttThe weighted value that is activated, β is learning rate, JπIt is total for that can be obtained according to strategy π execution movement Body accumulation return expectation.
7. a kind of continuous action on-line study control system for automatic driving vehicle, including computer equipment, feature exist In: the computer equipment is programmed to perform described in any one of claim 1~6 for the continuous of automatic driving vehicle The step of acting on-line study control method;Or the power of being programmed to perform is stored in the storage medium of the computer equipment Benefit require any one of 1~6 described in for automatic driving vehicle continuous action on-line study control method computer journey Sequence.
CN201910217492.2A 2019-03-21 2019-03-21 Continuous action online learning control method and system for automatic driving vehicle Pending CN109948781A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910217492.2A CN109948781A (en) 2019-03-21 2019-03-21 Continuous action online learning control method and system for automatic driving vehicle

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910217492.2A CN109948781A (en) 2019-03-21 2019-03-21 Continuous action online learning control method and system for automatic driving vehicle

Publications (1)

Publication Number Publication Date
CN109948781A true CN109948781A (en) 2019-06-28

Family

ID=67010531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910217492.2A Pending CN109948781A (en) 2019-03-21 2019-03-21 Continuous action online learning control method and system for automatic driving vehicle

Country Status (1)

Country Link
CN (1) CN109948781A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598309A (en) * 2019-09-09 2019-12-20 电子科技大学 Hardware design verification system and method based on reinforcement learning
CN110716562A (en) * 2019-09-25 2020-01-21 南京航空航天大学 Decision-making method for multi-lane driving of unmanned vehicle based on reinforcement learning
CN110826145A (en) * 2019-09-09 2020-02-21 西安工业大学 Automobile multi-parameter operation condition design method based on heuristic Markov chain evolution
CN110956148A (en) * 2019-12-05 2020-04-03 上海舵敏智能科技有限公司 Autonomous obstacle avoidance method and device for unmanned vehicle, electronic device and readable storage medium
CN111136659A (en) * 2020-01-15 2020-05-12 南京大学 Mechanical arm action learning method and system based on third person scale imitation learning
CN111222630A (en) * 2020-01-17 2020-06-02 北京工业大学 Autonomous driving rule learning method based on deep reinforcement learning
CN112052956A (en) * 2020-07-16 2020-12-08 山东派蒙机电技术有限公司 Training method for strengthening best action of vehicle execution
CN113281999A (en) * 2021-04-23 2021-08-20 南京大学 Unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning
CN113406955A (en) * 2021-05-10 2021-09-17 江苏大学 Complex network-based automatic driving automobile complex environment model, cognitive system and cognitive method
CN113734167A (en) * 2021-09-10 2021-12-03 苏州智加科技有限公司 Vehicle control method, device, terminal and storage medium
WO2023083113A1 (en) * 2021-11-10 2023-05-19 International Business Machines Corporation Reinforcement learning with inductive logic programming

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004068399A1 (en) * 2003-01-31 2004-08-12 Matsushita Electric Industrial Co. Ltd. Predictive action decision device and action decision method
CN105690392A (en) * 2016-04-14 2016-06-22 苏州大学 Robot motion control method and device based on actor-critic method
CN107346138A (en) * 2017-06-16 2017-11-14 武汉理工大学 A kind of unmanned boat method for lateral control based on enhancing learning algorithm
CN107610692A (en) * 2017-09-22 2018-01-19 杭州电子科技大学 The sound identification method of self-encoding encoder multiple features fusion is stacked based on neutral net
CN108038545A (en) * 2017-12-06 2018-05-15 湖北工业大学 Fast learning algorithm based on Actor-Critic neutral net continuous controls
US20190035275A1 (en) * 2017-07-28 2019-01-31 Toyota Motor Engineering & Manufacturing North America, Inc. Autonomous operation capability configuration for a vehicle

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2004068399A1 (en) * 2003-01-31 2004-08-12 Matsushita Electric Industrial Co. Ltd. Predictive action decision device and action decision method
CN105690392A (en) * 2016-04-14 2016-06-22 苏州大学 Robot motion control method and device based on actor-critic method
CN107346138A (en) * 2017-06-16 2017-11-14 武汉理工大学 A kind of unmanned boat method for lateral control based on enhancing learning algorithm
US20190035275A1 (en) * 2017-07-28 2019-01-31 Toyota Motor Engineering & Manufacturing North America, Inc. Autonomous operation capability configuration for a vehicle
CN107610692A (en) * 2017-09-22 2018-01-19 杭州电子科技大学 The sound identification method of self-encoding encoder multiple features fusion is stacked based on neutral net
CN108038545A (en) * 2017-12-06 2018-05-15 湖北工业大学 Fast learning algorithm based on Actor-Critic neutral net continuous controls

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YUJUN ZENG ET AL.: "Evolutionary Hierarchical Sparse Extreme Learning Autoencoder Network for Object Recognition", 《SYMMETRY》 *
徐昕: "增强学习及其在移动机器人导航与控制中的应用研究", 《中国优秀博硕士学位论文全文数据库 (博士) 信息科技辑》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110826145A (en) * 2019-09-09 2020-02-21 西安工业大学 Automobile multi-parameter operation condition design method based on heuristic Markov chain evolution
CN110598309A (en) * 2019-09-09 2019-12-20 电子科技大学 Hardware design verification system and method based on reinforcement learning
CN110716562A (en) * 2019-09-25 2020-01-21 南京航空航天大学 Decision-making method for multi-lane driving of unmanned vehicle based on reinforcement learning
CN110956148A (en) * 2019-12-05 2020-04-03 上海舵敏智能科技有限公司 Autonomous obstacle avoidance method and device for unmanned vehicle, electronic device and readable storage medium
CN110956148B (en) * 2019-12-05 2024-01-23 上海舵敏智能科技有限公司 Autonomous obstacle avoidance method and device for unmanned vehicle, electronic equipment and readable storage medium
CN111136659B (en) * 2020-01-15 2022-06-21 南京大学 Mechanical arm action learning method and system based on third person scale imitation learning
CN111136659A (en) * 2020-01-15 2020-05-12 南京大学 Mechanical arm action learning method and system based on third person scale imitation learning
CN111222630A (en) * 2020-01-17 2020-06-02 北京工业大学 Autonomous driving rule learning method based on deep reinforcement learning
CN112052956A (en) * 2020-07-16 2020-12-08 山东派蒙机电技术有限公司 Training method for strengthening best action of vehicle execution
CN112052956B (en) * 2020-07-16 2021-12-17 山东派蒙机电技术有限公司 Training method for strengthening best action of vehicle execution
CN113281999A (en) * 2021-04-23 2021-08-20 南京大学 Unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning
CN113406955A (en) * 2021-05-10 2021-09-17 江苏大学 Complex network-based automatic driving automobile complex environment model, cognitive system and cognitive method
CN113406955B (en) * 2021-05-10 2022-06-21 江苏大学 Complex network-based automatic driving automobile complex environment model, cognitive system and cognitive method
CN113734167A (en) * 2021-09-10 2021-12-03 苏州智加科技有限公司 Vehicle control method, device, terminal and storage medium
WO2023083113A1 (en) * 2021-11-10 2023-05-19 International Business Machines Corporation Reinforcement learning with inductive logic programming

Similar Documents

Publication Publication Date Title
CN109948781A (en) Continuous action online learning control method and system for automatic driving vehicle
Kuefler et al. Imitating driver behavior with generative adversarial networks
CN110148296A (en) A kind of trans-city magnitude of traffic flow unified prediction based on depth migration study
Vanegas et al. Inverse design of urban procedural models
CN102402712B (en) Robot reinforced learning initialization method based on neural network
Shahabi et al. Application of artificial neural network in prediction of municipal solid waste generation (Case study: Saqqez City in Kurdistan Province)
Nasir et al. A genetic fuzzy system to model pedestrian walking path in a built environment
Melo et al. Learning humanoid robot running skills through proximal policy optimization
CN108399744A (en) Short-time Traffic Flow Forecasting Methods based on grey wavelet neural network
CN113031528B (en) Multi-legged robot non-structural ground motion control method based on depth certainty strategy gradient
Mao et al. A comparison of deep reinforcement learning models for isolated traffic signal control
Huang et al. Computational modeling of emotion-motivated decisions for continuous control of mobile robots
Chen et al. NN model-based evolved control by DGM model for practical nonlinear systems
CN111645673A (en) Automatic parking method based on deep reinforcement learning
CN116848532A (en) Attention neural network with short term memory cells
CN109656236A (en) A kind of industrial data failure prediction method based on cyclic forecast neural network
Hu et al. Heterogeneous crowd simulation using parametric reinforcement learning
Deng et al. Advanced self-improving ramp metering algorithm based on multi-agent deep reinforcement learning
CN114239974B (en) Multi-agent position prediction method and device, electronic equipment and storage medium
Na et al. A novel heuristic artificial neural network model for urban computing
AbuZekry et al. Comparative study of neuro-evolution algorithms in reinforcement learning for self-driving cars
CN116382267B (en) Robot dynamic obstacle avoidance method based on multi-mode pulse neural network
CN108470212A (en) A kind of efficient LSTM design methods that can utilize incident duration
Hart et al. Towards robust car-following based on deep reinforcement learning
Wang et al. Hybrid neural network modeling for multiple intersections along signalized arterials-current situation and some new results

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190628

RJ01 Rejection of invention patent application after publication