CN109978133A - A kind of intensified learning moving method based on action mode - Google Patents

A kind of intensified learning moving method based on action mode Download PDF

Info

Publication number
CN109978133A
CN109978133A CN201811646218.9A CN201811646218A CN109978133A CN 109978133 A CN109978133 A CN 109978133A CN 201811646218 A CN201811646218 A CN 201811646218A CN 109978133 A CN109978133 A CN 109978133A
Authority
CN
China
Prior art keywords
migration
action
task
action mode
moving method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811646218.9A
Other languages
Chinese (zh)
Inventor
丁晓静
吴章凯
高阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGSU WANWEI AISI NETWORK INTELLIGENT INDUSTRY INNOVATION CENTER Co Ltd
Nanjing University
Original Assignee
JIANGSU WANWEI AISI NETWORK INTELLIGENT INDUSTRY INNOVATION CENTER Co Ltd
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU WANWEI AISI NETWORK INTELLIGENT INDUSTRY INNOVATION CENTER Co Ltd, Nanjing University filed Critical JIANGSU WANWEI AISI NETWORK INTELLIGENT INDUSTRY INNOVATION CENTER Co Ltd
Priority to CN201811646218.9A priority Critical patent/CN109978133A/en
Publication of CN109978133A publication Critical patent/CN109978133A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of new intensified learning moving methods, i.e., based on the migration of action mode, accelerate to solve new unknown task using existing model.The knowledge of simple state task is used in the migration that the moving method can be used between different conditions space tasks, help the solving complex state of the task.Invention defines action modes, and propose that action sequence prediction model extracts the knowledge from originating task.How action mode is moved on goal task, two methods are proposed: migration and the heuristic migration for exploring strategy based on internal reward mechanism.

Description

A kind of intensified learning moving method based on action mode
Technical field
The present invention relates to a kind of new intensified learning moving method --- migrations based on action mode.
Background technique
Artificial intelligence requires one kind in uncertain dynamic in the application in the fields such as automatic Pilot, service humanoid robot The ability of decision in environment, this is exactly intensified learning technology, a kind of machine learning method of interaction and feedback by with environment. Intensified learning is a kind of trial-and-error method, needs a large amount of learning sample, this is also that it is difficult to one of the reason of applying in practice.
Transfer learning is to extract and migrate effective knowledge from the model succeeded in school using the similitude between task, The learning framework for the unknown task for accelerating solution new.There are many types for migration in intensified learning, as based on model parameter Migration, the migration based on substrategy and migration based on characterization etc., but existing method is only applicable to identical state-mostly Migration situation between the task of motion space.The discovery and extraction of higher level abstract knowledge and more flexible moving method are still Need to be invented.
In intensified learning migration, what state space was characterized in often being considered, and the feature and structure of motion space Seldom it is involved.The state of intensified learning task has many different types, such as uses the small-scale state of digital coding grid position The extensive state space that every frame pixel indicates in space and video-game.The expression and understanding of state space are a important Problem, and often simpler and movement meaning is more fixed for the understanding of motion space, such as related only in navigation task it is eastern, West, south, the movement of four, north.Contained in the combination of different action sequences from environment learning to knowledge.As Hierarchical reinforcement learning After the middle two types that motion space is divided into high-level and low level, the complexity of problem is substantially reduced.Motion space structure Understanding, the various combination of action sequence, between movement relationship excavation, be a kind of more abstract semantic hierarchies knowledge.
Summary of the invention
Goal of the invention: the present invention is directed to the motion space of intensified learning task, proposes a kind of new migration knowledge --- and it is dynamic Operation mode, the migration being applicable not only between same state space tasks, the migration being also applied between different conditions space tasks.
Technical solution: the moving method that the present invention uses include two parts, i.e., from originating task extract action mode and Action mode is migrated into goal task.The extraction unit of action mode point includes: (1) training originating task;(2) from trained source Multiple action sequences are sampled in task;(3) it by the action sequence of acquisition, is input in recurrent neural network, one movement of training Sequential forecasting models.Action mode is moved to there are two types of the strategies in new unknown task: (1) a kind of method is will to act sequence The prediction result that column model acted at current time is added in original award as internal reward, and excitation main body is executing The action mode of originating task can be followed when movement;(2) another method is, at each moment, to export from movement prediction model Next movement probability distribution in, sample out a movement, and in the exploration of goal task.
The utility model has the advantages that remarkable advantage of the invention is the feature using originating task motion space, accelerate at new unknown Pace of learning in business reduces learning sample.Compared with existing moving method, can flexibly it be used between different conditions space tasks Migration, if originating task is simple small-scale state space task, and goal task is that complicated extensive state space is appointed Business can be helped to solve complicated task with simple task.
Detailed description of the invention
Fig. 1 is overall construction drawing of the invention.
Fig. 2 is migration flow chart of the invention.
Migration setting between Fig. 3 same state space tasks.
Migration setting between Fig. 4 different conditions space tasks.
Fig. 5 arrives the migration setting of extensive state space on a small scale.
Migration results between Fig. 6 same state space tasks.
Migration results between Fig. 7 different conditions space tasks.
Fig. 8 arrives the migration results of extensive state space on a small scale.
Specific embodiment
The formal definitions of action mode are provided first:
Define 1. given action sequence α1, α2..., αT, it is from task T=<S, A, P, r>(Markov decisior process Journey, MDP) tactful π in sample and obtain, action mode (action pattern) is defined as based on historical action sequence, next Probability distribution Pr (the α of a movementt+1t, αt-1..., α1)。
Consider from multiple originating task Ts 1, Ts 2... ..., Ts NTo a goal task TtMigration situation, wherein Ts i=< Ss i, A, Ps i, rs i> (i=1,2 ... ..., N), Tt=< St,A,Pt,rt>, andIt is originating taskOptimal policy,It is mesh Mark task TtOptimal policy.We assume that originating task and goal task have identical motion space A, and its optimal policyAnd πt *Follow similar action mode.
As shown in Figure 1, method of the invention mainly includes two parts, the extraction and migration of action mode, in detail below It is illustrated:
Step 1: selecting and train multiple originating tasksEach source is obtained respectively to appoint The optimal policy of business
Step 2: in each originating taskOn, use optimal policyMultiple rounds are separately operable, a plurality of movement sequence is generated Arrange { a1,a2,...,aT}。
Step 3: training action sequential forecasting models: using recurrent neural network (RNN) as prediction model, and use The RNN of long short-term memory (LSTM) node.The input of model is the movement α at current timet, output is the probability point of next movement Cloth Pr (αt+1).Because of the memory capability of hidden layer, the probability distribution Pr (α of next movementt+1) not only α is inputted with current timet It is related, also with the input α at moment beforet-1..., α1It is related.The objective function of training pattern is to minimize negative log-likelihood (NLL), as follows:
It is solved on optimization method using gradient descent method, i.e., temporal backpropagation (BPTT).
Step 4-1: the action mode migration based on internal reward mechanism: as shown in Fig. 2, action prediction model is when each T is carved, with the movement α at current timetIt updates model state and exports the probability distribution Pr (α of next movementt+1)(αt+1∈A)。 The moving method is added in original award as internal reward at each moment, constitutes new reward functions, and excitation is dynamic The selection of work follows the action mode, as follows:
Wherein, rtIt is current time original award;r′tIt is new award;α is a constant, is used to balance play mode pair The influence degree of goal task learning process is traditionally arranged to be as the time successively decreases;P indicates the current of action prediction model output Moment execution acts αtProbabilistic forecasting value Pr (αt);N is the element number of set of actions A,Show that each movement is selected flat Equal probability.
Step 4-2: the heuristic action mode migration for exploring strategy: as shown in Fig. 2, action prediction model is at each moment T, with the movement α at current timetIt updates model state and exports the probability distribution Pr (α of next movementt+1)(αt+1∈A).It should Moving method directly samples out next movement α from the probability distributiont+1, for the exploration strategy in learning algorithm, replace with Machine explores strategy.It is as follows according to the heuristic exploration strategy of action mode:
Wherein, p is the random number generated from (0,1);0≤ε≤1;It is the optimal movement of next step.I.e. with probability ε is from prediction distribution Pr (αt+1) in choose next movement, with probability 1- ε according to selecting optimal movement.And with the learning process time Growth, the value of ε is gradually reduced, until being 0.
In order to verify the validity and flexibility of above-mentioned moving method, the migration that the present invention devises three kinds of different situations is set It sets and is tested, as a result as follows:
(1) migration between same state space tasks: the migration selects the grid world in 4 rooms as environment, grid Middle grid indicates the state in environment, and under free position, main body can choose one in four movements (upper and lower, left and right) It is a, but have random influence, that is, there are the direction of 2/3 probability along selected movement, 1/3 probability random selection other 3 One (probability of each movement is 1/9) in a movement.If main body enters the grid of wall after mobile, main body keeps original state State.Target is randomly provided some grid in a room, original state from institute it is stateful in be randomly generated.When reaching target, Award is 1.0, is terminated;Otherwise, the award of each step is 0.As shown in figure 3, originating task is different from the position of dbjective state, they Between have identical state space, different reward functions.The result of the moving method as shown in fig. 6, be with Sarsa (0) Rudimentary algorithm, Base are not migrated as a result, HE is the action mode moving method of heuristic exploration, and IR is internal reward Moving method (α constant value is 0.01,0.0001 and 0.00001), Image to left is the change of episode length at any time Change curve, right side picture is award versus time curve.It is compared to the algorithm of no action mode migration, two kinds of movements The moving method of mode can accelerate the process of study.However, it is clear that the moving method effect of heuristic exploration is more preferable, more Add stabilization, and does not need the size of the different manual control parameters according to task and award.And for internal reward mechanism Moving method, it can be seen that the selection of constant α value is very big to the influential effect of migration.
(2) migration between different conditions space tasks: as shown in figure 4,4 room tasks are replicated by row duplication and column Mode is extended to 8 room tasks.Originating task is the state space of 4 room tasks in experiment above, after goal task is extension 8 room state space tasks.As a result as shown in fig. 7, with Sarsa (0) for rudimentary algorithm, in figure be episode length at any time Between change curve, be the result of two kinds of different target tasks (row duplication and column duplication) respectively.Two kinds of room 8 task Result all show that the pace of learning of large increase goal task is capable of in the migration of action mode, sample needed for reducing study Number.
(3) migration of the small-scale state to extensive state space task: in deeply learning tasks, the shape of input State is original pixels, and state is on a grand scale, and needs to do value function using convolutional neural networks and estimate, but learns in image out Key feature usually require many learning sample and training time.The moving method provides a kind of similar movement of utilization Mode accelerates the strategy of study, as shown in figure 5, originating task is an one-dimensional maze task, goal task is that a first person is penetrated Hit the shooting scene in gaming platform ViZDoom.Using DQN as rudimentary algorithm, compare migration action mode after as a result, such as Shown in Fig. 8, discovery has faster convergence rate and less learning sample number plus the algorithm after action mode migration.

Claims (5)

1. a kind of intensified learning moving method based on action mode, it is characterised in that define a kind of new migration knowledge --- it is dynamic Operation mode is expressed as based on historical action sequence, and the prediction probability of next movement is distributed.
2. a kind of intensified learning moving method based on action mode, it is characterised in that propose action sequence prediction model, that is, use Recurrent neural network models action mode described in claim 1.
3. a kind of intensified learning moving method based on action mode, it is characterised in that propose a kind of more flexible migration frame, Based on action mode described in claim 1, the migration that can be used between different conditions space tasks, i.e., with simple state task Knowledge, help solve complex state task.Key step is as follows:
(1) action sequence prediction model as claimed in claim 2 is used, knowledge is proposed from originating task;
(2) by the knowledge migration of extraction into goal task, moving method has the migration of the action mode based on internal reward mechanism With the heuristic action mode migration for exploring strategy.
4. the action mode moving method according to claim 3 based on internal reward mechanism, it is characterised in that by right It is required that the current action prediction probability value of the output of action sequence prediction model described in 2 is added to original prize as internal reward In reward, new reward functions are constituted.
5. the heuristic action mode moving method for exploring strategy according to claim 3, it is characterised in that right The prediction probability distribution of the next step movement of the output of action sequence prediction model described in asking 2, for exploring in strategy, i.e., in plan In exploratory stage slightly, a movement is not randomly selected instead of, the movement gone out according to action prediction profile samples.
CN201811646218.9A 2018-12-29 2018-12-29 A kind of intensified learning moving method based on action mode Pending CN109978133A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811646218.9A CN109978133A (en) 2018-12-29 2018-12-29 A kind of intensified learning moving method based on action mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811646218.9A CN109978133A (en) 2018-12-29 2018-12-29 A kind of intensified learning moving method based on action mode

Publications (1)

Publication Number Publication Date
CN109978133A true CN109978133A (en) 2019-07-05

Family

ID=67076455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811646218.9A Pending CN109978133A (en) 2018-12-29 2018-12-29 A kind of intensified learning moving method based on action mode

Country Status (1)

Country Link
CN (1) CN109978133A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111260040A (en) * 2020-05-06 2020-06-09 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Video game decision method based on intrinsic rewards
CN113938397A (en) * 2021-10-13 2022-01-14 苏州龙卷风云科技有限公司 Method and device for predicting SR type flow delay in vehicle-mounted time-sensitive network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111260040A (en) * 2020-05-06 2020-06-09 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Video game decision method based on intrinsic rewards
CN113938397A (en) * 2021-10-13 2022-01-14 苏州龙卷风云科技有限公司 Method and device for predicting SR type flow delay in vehicle-mounted time-sensitive network
CN113938397B (en) * 2021-10-13 2024-02-02 苏州龙卷风云科技有限公司 SR class traffic delay prediction method and device in vehicle-mounted time-sensitive network

Similar Documents

Publication Publication Date Title
CN109829541A (en) Deep neural network incremental training method and system based on learning automaton
Allen Knowledge, ignorance, and learning
CN107253195B (en) A kind of carrying machine human arm manipulation ADAPTIVE MIXED study mapping intelligent control method and system
CN111062491A (en) Intelligent agent unknown environment exploration method based on reinforcement learning
CN109523029A (en) For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body
CN107179077B (en) Self-adaptive visual navigation method based on ELM-LRF
CN111008449A (en) Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment
CN109496305A (en) Nash equilibrium strategy on continuous action space and social network public opinion evolution model
CN110442129A (en) A kind of control method and system that multiple agent is formed into columns
CN109978133A (en) A kind of intensified learning moving method based on action mode
CN114510012A (en) Unmanned cluster evolution system and method based on meta-action sequence reinforcement learning
CN109925718A (en) A kind of system and method for distributing the micro- end map of game
CN105989376B (en) A kind of hand-written discrimination system neural network based, device and mobile terminal
CN111159489A (en) Searching method
CN111282272B (en) Information processing method, computer readable medium and electronic device
Zhang et al. Birds foraging search: a novel population-based algorithm for global optimization
CN108955689A (en) It is looked for food the RBPF-SLAM method of optimization algorithm based on adaptive bacterium
CN108891421A (en) A method of building driving strategy
CN114116995A (en) Session recommendation method, system and medium based on enhanced graph neural network
CN110866866B (en) Image color imitation processing method and device, electronic equipment and storage medium
CN109034279A (en) Handwriting model training method, hand-written character recognizing method, device, equipment and medium
Källström et al. Reinforcement learning for computer generated forces using open-source software
CN110450164A (en) Robot control method, device, robot and storage medium
Ren Optimal control
Shanthi et al. The Blue Brain Technology using Machine Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190705

WD01 Invention patent application deemed withdrawn after publication