CN109978133A - A kind of intensified learning moving method based on action mode - Google Patents
A kind of intensified learning moving method based on action mode Download PDFInfo
- Publication number
- CN109978133A CN109978133A CN201811646218.9A CN201811646218A CN109978133A CN 109978133 A CN109978133 A CN 109978133A CN 201811646218 A CN201811646218 A CN 201811646218A CN 109978133 A CN109978133 A CN 109978133A
- Authority
- CN
- China
- Prior art keywords
- migration
- action
- task
- action mode
- moving method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000009471 action Effects 0.000 title claims abstract description 53
- 238000000034 method Methods 0.000 title claims abstract description 36
- 238000013508 migration Methods 0.000 claims abstract description 42
- 230000005012 migration Effects 0.000 claims abstract description 41
- 230000007246 mechanism Effects 0.000 claims abstract description 5
- 230000033001 locomotion Effects 0.000 claims description 36
- 230000006870 function Effects 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 4
- 230000000306 recurrent effect Effects 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 claims 1
- 239000000284 extract Substances 0.000 abstract description 2
- 238000012549 training Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000005284 excitation Effects 0.000 description 2
- BULVZWIRKLYCBC-UHFFFAOYSA-N phorate Chemical compound CCOP(=S)(OCC)SCSCC BULVZWIRKLYCBC-UHFFFAOYSA-N 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002787 reinforcement Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000006641 stabilisation Effects 0.000 description 1
- 238000011105 stabilization Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a kind of new intensified learning moving methods, i.e., based on the migration of action mode, accelerate to solve new unknown task using existing model.The knowledge of simple state task is used in the migration that the moving method can be used between different conditions space tasks, help the solving complex state of the task.Invention defines action modes, and propose that action sequence prediction model extracts the knowledge from originating task.How action mode is moved on goal task, two methods are proposed: migration and the heuristic migration for exploring strategy based on internal reward mechanism.
Description
Technical field
The present invention relates to a kind of new intensified learning moving method --- migrations based on action mode.
Background technique
Artificial intelligence requires one kind in uncertain dynamic in the application in the fields such as automatic Pilot, service humanoid robot
The ability of decision in environment, this is exactly intensified learning technology, a kind of machine learning method of interaction and feedback by with environment.
Intensified learning is a kind of trial-and-error method, needs a large amount of learning sample, this is also that it is difficult to one of the reason of applying in practice.
Transfer learning is to extract and migrate effective knowledge from the model succeeded in school using the similitude between task,
The learning framework for the unknown task for accelerating solution new.There are many types for migration in intensified learning, as based on model parameter
Migration, the migration based on substrategy and migration based on characterization etc., but existing method is only applicable to identical state-mostly
Migration situation between the task of motion space.The discovery and extraction of higher level abstract knowledge and more flexible moving method are still
Need to be invented.
In intensified learning migration, what state space was characterized in often being considered, and the feature and structure of motion space
Seldom it is involved.The state of intensified learning task has many different types, such as uses the small-scale state of digital coding grid position
The extensive state space that every frame pixel indicates in space and video-game.The expression and understanding of state space are a important
Problem, and often simpler and movement meaning is more fixed for the understanding of motion space, such as related only in navigation task it is eastern,
West, south, the movement of four, north.Contained in the combination of different action sequences from environment learning to knowledge.As Hierarchical reinforcement learning
After the middle two types that motion space is divided into high-level and low level, the complexity of problem is substantially reduced.Motion space structure
Understanding, the various combination of action sequence, between movement relationship excavation, be a kind of more abstract semantic hierarchies knowledge.
Summary of the invention
Goal of the invention: the present invention is directed to the motion space of intensified learning task, proposes a kind of new migration knowledge --- and it is dynamic
Operation mode, the migration being applicable not only between same state space tasks, the migration being also applied between different conditions space tasks.
Technical solution: the moving method that the present invention uses include two parts, i.e., from originating task extract action mode and
Action mode is migrated into goal task.The extraction unit of action mode point includes: (1) training originating task;(2) from trained source
Multiple action sequences are sampled in task;(3) it by the action sequence of acquisition, is input in recurrent neural network, one movement of training
Sequential forecasting models.Action mode is moved to there are two types of the strategies in new unknown task: (1) a kind of method is will to act sequence
The prediction result that column model acted at current time is added in original award as internal reward, and excitation main body is executing
The action mode of originating task can be followed when movement;(2) another method is, at each moment, to export from movement prediction model
Next movement probability distribution in, sample out a movement, and in the exploration of goal task.
The utility model has the advantages that remarkable advantage of the invention is the feature using originating task motion space, accelerate at new unknown
Pace of learning in business reduces learning sample.Compared with existing moving method, can flexibly it be used between different conditions space tasks
Migration, if originating task is simple small-scale state space task, and goal task is that complicated extensive state space is appointed
Business can be helped to solve complicated task with simple task.
Detailed description of the invention
Fig. 1 is overall construction drawing of the invention.
Fig. 2 is migration flow chart of the invention.
Migration setting between Fig. 3 same state space tasks.
Migration setting between Fig. 4 different conditions space tasks.
Fig. 5 arrives the migration setting of extensive state space on a small scale.
Migration results between Fig. 6 same state space tasks.
Migration results between Fig. 7 different conditions space tasks.
Fig. 8 arrives the migration results of extensive state space on a small scale.
Specific embodiment
The formal definitions of action mode are provided first:
Define 1. given action sequence α1, α2..., αT, it is from task T=<S, A, P, r>(Markov decisior process
Journey, MDP) tactful π in sample and obtain, action mode (action pattern) is defined as based on historical action sequence, next
Probability distribution Pr (the α of a movementt+1|αt, αt-1..., α1)。
Consider from multiple originating task Ts 1, Ts 2... ..., Ts NTo a goal task TtMigration situation, wherein Ts i=<
Ss i, A, Ps i, rs i> (i=1,2 ... ..., N), Tt=< St,A,Pt,rt>, andIt is originating taskOptimal policy,It is mesh
Mark task TtOptimal policy.We assume that originating task and goal task have identical motion space A, and its optimal policyAnd πt *Follow similar action mode.
As shown in Figure 1, method of the invention mainly includes two parts, the extraction and migration of action mode, in detail below
It is illustrated:
Step 1: selecting and train multiple originating tasksEach source is obtained respectively to appoint
The optimal policy of business
Step 2: in each originating taskOn, use optimal policyMultiple rounds are separately operable, a plurality of movement sequence is generated
Arrange { a1,a2,...,aT}。
Step 3: training action sequential forecasting models: using recurrent neural network (RNN) as prediction model, and use
The RNN of long short-term memory (LSTM) node.The input of model is the movement α at current timet, output is the probability point of next movement
Cloth Pr (αt+1).Because of the memory capability of hidden layer, the probability distribution Pr (α of next movementt+1) not only α is inputted with current timet
It is related, also with the input α at moment beforet-1..., α1It is related.The objective function of training pattern is to minimize negative log-likelihood
(NLL), as follows:
It is solved on optimization method using gradient descent method, i.e., temporal backpropagation (BPTT).
Step 4-1: the action mode migration based on internal reward mechanism: as shown in Fig. 2, action prediction model is when each
T is carved, with the movement α at current timetIt updates model state and exports the probability distribution Pr (α of next movementt+1)(αt+1∈A)。
The moving method is added in original award as internal reward at each moment, constitutes new reward functions, and excitation is dynamic
The selection of work follows the action mode, as follows:
Wherein, rtIt is current time original award;r′tIt is new award;α is a constant, is used to balance play mode pair
The influence degree of goal task learning process is traditionally arranged to be as the time successively decreases;P indicates the current of action prediction model output
Moment execution acts αtProbabilistic forecasting value Pr (αt);N is the element number of set of actions A,Show that each movement is selected flat
Equal probability.
Step 4-2: the heuristic action mode migration for exploring strategy: as shown in Fig. 2, action prediction model is at each moment
T, with the movement α at current timetIt updates model state and exports the probability distribution Pr (α of next movementt+1)(αt+1∈A).It should
Moving method directly samples out next movement α from the probability distributiont+1, for the exploration strategy in learning algorithm, replace with
Machine explores strategy.It is as follows according to the heuristic exploration strategy of action mode:
Wherein, p is the random number generated from (0,1);0≤ε≤1;It is the optimal movement of next step.I.e. with probability
ε is from prediction distribution Pr (αt+1) in choose next movement, with probability 1- ε according to selecting optimal movement.And with the learning process time
Growth, the value of ε is gradually reduced, until being 0.
In order to verify the validity and flexibility of above-mentioned moving method, the migration that the present invention devises three kinds of different situations is set
It sets and is tested, as a result as follows:
(1) migration between same state space tasks: the migration selects the grid world in 4 rooms as environment, grid
Middle grid indicates the state in environment, and under free position, main body can choose one in four movements (upper and lower, left and right)
It is a, but have random influence, that is, there are the direction of 2/3 probability along selected movement, 1/3 probability random selection other 3
One (probability of each movement is 1/9) in a movement.If main body enters the grid of wall after mobile, main body keeps original state
State.Target is randomly provided some grid in a room, original state from institute it is stateful in be randomly generated.When reaching target,
Award is 1.0, is terminated;Otherwise, the award of each step is 0.As shown in figure 3, originating task is different from the position of dbjective state, they
Between have identical state space, different reward functions.The result of the moving method as shown in fig. 6, be with Sarsa (0)
Rudimentary algorithm, Base are not migrated as a result, HE is the action mode moving method of heuristic exploration, and IR is internal reward
Moving method (α constant value is 0.01,0.0001 and 0.00001), Image to left is the change of episode length at any time
Change curve, right side picture is award versus time curve.It is compared to the algorithm of no action mode migration, two kinds of movements
The moving method of mode can accelerate the process of study.However, it is clear that the moving method effect of heuristic exploration is more preferable, more
Add stabilization, and does not need the size of the different manual control parameters according to task and award.And for internal reward mechanism
Moving method, it can be seen that the selection of constant α value is very big to the influential effect of migration.
(2) migration between different conditions space tasks: as shown in figure 4,4 room tasks are replicated by row duplication and column
Mode is extended to 8 room tasks.Originating task is the state space of 4 room tasks in experiment above, after goal task is extension
8 room state space tasks.As a result as shown in fig. 7, with Sarsa (0) for rudimentary algorithm, in figure be episode length at any time
Between change curve, be the result of two kinds of different target tasks (row duplication and column duplication) respectively.Two kinds of room 8 task
Result all show that the pace of learning of large increase goal task is capable of in the migration of action mode, sample needed for reducing study
Number.
(3) migration of the small-scale state to extensive state space task: in deeply learning tasks, the shape of input
State is original pixels, and state is on a grand scale, and needs to do value function using convolutional neural networks and estimate, but learns in image out
Key feature usually require many learning sample and training time.The moving method provides a kind of similar movement of utilization
Mode accelerates the strategy of study, as shown in figure 5, originating task is an one-dimensional maze task, goal task is that a first person is penetrated
Hit the shooting scene in gaming platform ViZDoom.Using DQN as rudimentary algorithm, compare migration action mode after as a result, such as
Shown in Fig. 8, discovery has faster convergence rate and less learning sample number plus the algorithm after action mode migration.
Claims (5)
1. a kind of intensified learning moving method based on action mode, it is characterised in that define a kind of new migration knowledge --- it is dynamic
Operation mode is expressed as based on historical action sequence, and the prediction probability of next movement is distributed.
2. a kind of intensified learning moving method based on action mode, it is characterised in that propose action sequence prediction model, that is, use
Recurrent neural network models action mode described in claim 1.
3. a kind of intensified learning moving method based on action mode, it is characterised in that propose a kind of more flexible migration frame,
Based on action mode described in claim 1, the migration that can be used between different conditions space tasks, i.e., with simple state task
Knowledge, help solve complex state task.Key step is as follows:
(1) action sequence prediction model as claimed in claim 2 is used, knowledge is proposed from originating task;
(2) by the knowledge migration of extraction into goal task, moving method has the migration of the action mode based on internal reward mechanism
With the heuristic action mode migration for exploring strategy.
4. the action mode moving method according to claim 3 based on internal reward mechanism, it is characterised in that by right
It is required that the current action prediction probability value of the output of action sequence prediction model described in 2 is added to original prize as internal reward
In reward, new reward functions are constituted.
5. the heuristic action mode moving method for exploring strategy according to claim 3, it is characterised in that right
The prediction probability distribution of the next step movement of the output of action sequence prediction model described in asking 2, for exploring in strategy, i.e., in plan
In exploratory stage slightly, a movement is not randomly selected instead of, the movement gone out according to action prediction profile samples.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811646218.9A CN109978133A (en) | 2018-12-29 | 2018-12-29 | A kind of intensified learning moving method based on action mode |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811646218.9A CN109978133A (en) | 2018-12-29 | 2018-12-29 | A kind of intensified learning moving method based on action mode |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109978133A true CN109978133A (en) | 2019-07-05 |
Family
ID=67076455
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811646218.9A Pending CN109978133A (en) | 2018-12-29 | 2018-12-29 | A kind of intensified learning moving method based on action mode |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109978133A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111260040A (en) * | 2020-05-06 | 2020-06-09 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Video game decision method based on intrinsic rewards |
CN113938397A (en) * | 2021-10-13 | 2022-01-14 | 苏州龙卷风云科技有限公司 | Method and device for predicting SR type flow delay in vehicle-mounted time-sensitive network |
-
2018
- 2018-12-29 CN CN201811646218.9A patent/CN109978133A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111260040A (en) * | 2020-05-06 | 2020-06-09 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Video game decision method based on intrinsic rewards |
CN113938397A (en) * | 2021-10-13 | 2022-01-14 | 苏州龙卷风云科技有限公司 | Method and device for predicting SR type flow delay in vehicle-mounted time-sensitive network |
CN113938397B (en) * | 2021-10-13 | 2024-02-02 | 苏州龙卷风云科技有限公司 | SR class traffic delay prediction method and device in vehicle-mounted time-sensitive network |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107833183B (en) | Method for simultaneously super-resolving and coloring satellite image based on multitask deep neural network | |
JP7510637B2 (en) | How to generate a general-purpose trained model | |
CN111259738B (en) | Face recognition model construction method, face recognition method and related device | |
CN107253195B (en) | A kind of carrying machine human arm manipulation ADAPTIVE MIXED study mapping intelligent control method and system | |
CN110515303A (en) | A kind of adaptive dynamic path planning method based on DDQN | |
CN111062491A (en) | Intelligent agent unknown environment exploration method based on reinforcement learning | |
CN109523029A (en) | For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body | |
CN107179077B (en) | Self-adaptive visual navigation method based on ELM-LRF | |
CN111008449A (en) | Acceleration method for deep reinforcement learning deduction decision training in battlefield simulation environment | |
CN109978133A (en) | A kind of intensified learning moving method based on action mode | |
CN114510012A (en) | Unmanned cluster evolution system and method based on meta-action sequence reinforcement learning | |
CN111159489A (en) | Searching method | |
Zhang et al. | Birds foraging search: a novel population-based algorithm for global optimization | |
CN111282272B (en) | Information processing method, computer readable medium and electronic device | |
CN108891421A (en) | A method of building driving strategy | |
CN114116995B (en) | Session recommendation method, system and medium based on enhanced graph neural network | |
CN110222817A (en) | Convolutional neural networks compression method, system and medium based on learning automaton | |
CN110866866B (en) | Image color imitation processing method and device, electronic equipment and storage medium | |
Källström et al. | Reinforcement learning for computer generated forces using open-source software | |
CN110047088A (en) | A kind of HT-29 image partition method based on improvement learning aid optimization algorithm | |
CN115204249A (en) | Group intelligent meta-learning method based on competition mechanism | |
Shanthi et al. | The Blue Brain Technology using Machine Learning | |
CN113360669A (en) | Knowledge tracking method based on gated graph convolution time sequence neural network | |
CN115202339B (en) | DQN-based multi-moon vehicle sampling fixed target self-adaptive planning method | |
CN111950691A (en) | Reinforced learning strategy learning method based on potential action representation space |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190705 |