CN109978133A

CN109978133A - A kind of intensified learning moving method based on action mode

Info

Publication number: CN109978133A
Application number: CN201811646218.9A
Authority: CN
Inventors: 丁晓静; 吴章凯; 高阳
Original assignee: JIANGSU WANWEI AISI NETWORK INTELLIGENT INDUSTRY INNOVATION CENTER Co Ltd; Nanjing University
Current assignee: JIANGSU WANWEI AISI NETWORK INTELLIGENT INDUSTRY INNOVATION CENTER Co Ltd; Nanjing University
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2019-07-05

Abstract

The invention discloses a kind of new intensified learning moving methods, i.e., based on the migration of action mode, accelerate to solve new unknown task using existing model.The knowledge of simple state task is used in the migration that the moving method can be used between different conditions space tasks, help the solving complex state of the task.Invention defines action modes, and propose that action sequence prediction model extracts the knowledge from originating task.How action mode is moved on goal task, two methods are proposed: migration and the heuristic migration for exploring strategy based on internal reward mechanism.

Description

A kind of intensified learning moving method based on action mode

Technical field

The present invention relates to a kind of new intensified learning moving method --- migrations based on action mode.

Background technique

Artificial intelligence requires one kind in uncertain dynamic in the application in the fields such as automatic Pilot, service humanoid robot The ability of decision in environment, this is exactly intensified learning technology, a kind of machine learning method of interaction and feedback by with environment. Intensified learning is a kind of trial-and-error method, needs a large amount of learning sample, this is also that it is difficult to one of the reason of applying in practice.

Transfer learning is to extract and migrate effective knowledge from the model succeeded in school using the similitude between task, The learning framework for the unknown task for accelerating solution new.There are many types for migration in intensified learning, as based on model parameter Migration, the migration based on substrategy and migration based on characterization etc., but existing method is only applicable to identical state-mostly Migration situation between the task of motion space.The discovery and extraction of higher level abstract knowledge and more flexible moving method are still Need to be invented.

In intensified learning migration, what state space was characterized in often being considered, and the feature and structure of motion space Seldom it is involved.The state of intensified learning task has many different types, such as uses the small-scale state of digital coding grid position The extensive state space that every frame pixel indicates in space and video-game.The expression and understanding of state space are a important Problem, and often simpler and movement meaning is more fixed for the understanding of motion space, such as related only in navigation task it is eastern, West, south, the movement of four, north.Contained in the combination of different action sequences from environment learning to knowledge.As Hierarchical reinforcement learning After the middle two types that motion space is divided into high-level and low level, the complexity of problem is substantially reduced.Motion space structure Understanding, the various combination of action sequence, between movement relationship excavation, be a kind of more abstract semantic hierarchies knowledge.

Summary of the invention

Goal of the invention: the present invention is directed to the motion space of intensified learning task, proposes a kind of new migration knowledge --- and it is dynamic Operation mode, the migration being applicable not only between same state space tasks, the migration being also applied between different conditions space tasks.

Technical solution: the moving method that the present invention uses include two parts, i.e., from originating task extract action mode and Action mode is migrated into goal task.The extraction unit of action mode point includes: (1) training originating task；(2) from trained source Multiple action sequences are sampled in task；(3) it by the action sequence of acquisition, is input in recurrent neural network, one movement of training Sequential forecasting models.Action mode is moved to there are two types of the strategies in new unknown task: (1) a kind of method is will to act sequence The prediction result that column model acted at current time is added in original award as internal reward, and excitation main body is executing The action mode of originating task can be followed when movement；(2) another method is, at each moment, to export from movement prediction model Next movement probability distribution in, sample out a movement, and in the exploration of goal task.

The utility model has the advantages that remarkable advantage of the invention is the feature using originating task motion space, accelerate at new unknown Pace of learning in business reduces learning sample.Compared with existing moving method, can flexibly it be used between different conditions space tasks Migration, if originating task is simple small-scale state space task, and goal task is that complicated extensive state space is appointed Business can be helped to solve complicated task with simple task.

Detailed description of the invention

Fig. 1 is overall construction drawing of the invention.

Fig. 2 is migration flow chart of the invention.

Migration setting between Fig. 3 same state space tasks.

Migration setting between Fig. 4 different conditions space tasks.

Fig. 5 arrives the migration setting of extensive state space on a small scale.

Migration results between Fig. 6 same state space tasks.

Migration results between Fig. 7 different conditions space tasks.

Fig. 8 arrives the migration results of extensive state space on a small scale.

Specific embodiment

The formal definitions of action mode are provided first:

Define 1. given action sequence α₁, α₂..., α_T, it is from task T=<S, A, P, r>(Markov decisior process Journey, MDP) tactful π in sample and obtain, action mode (action pattern) is defined as based on historical action sequence, next Probability distribution Pr (the α of a movement_t+1|α_t, α_t-1..., α₁)。

Consider from multiple originating task T_s ¹, T_s ²... ..., T_s ^NTo a goal task T^tMigration situation, wherein T_s ⁱ=< S_s ⁱ, A, P_s ⁱ, r_s ⁱ> (i=1,2 ... ..., N), T_t=< S_t,A,P_t,r_t>, andIt is originating taskOptimal policy,It is mesh Mark task T_tOptimal policy.We assume that originating task and goal task have identical motion space A, and its optimal policyAnd π_t ^*Follow similar action mode.

As shown in Figure 1, method of the invention mainly includes two parts, the extraction and migration of action mode, in detail below It is illustrated:

Step 1: selecting and train multiple originating tasksEach source is obtained respectively to appoint The optimal policy of business

Step 2: in each originating taskOn, use optimal policyMultiple rounds are separately operable, a plurality of movement sequence is generated Arrange { a₁,a₂,...,a_T}。

Step 3: training action sequential forecasting models: using recurrent neural network (RNN) as prediction model, and use The RNN of long short-term memory (LSTM) node.The input of model is the movement α at current time_t, output is the probability point of next movement Cloth Pr (α_t+1).Because of the memory capability of hidden layer, the probability distribution Pr (α of next movement_t+1) not only α is inputted with current time_t It is related, also with the input α at moment before_t-1..., α₁It is related.The objective function of training pattern is to minimize negative log-likelihood (NLL), as follows:

It is solved on optimization method using gradient descent method, i.e., temporal backpropagation (BPTT).

Step 4-1: the action mode migration based on internal reward mechanism: as shown in Fig. 2, action prediction model is when each T is carved, with the movement α at current time_tIt updates model state and exports the probability distribution Pr (α of next movement_t+1)(α_t+1∈A)。 The moving method is added in original award as internal reward at each moment, constitutes new reward functions, and excitation is dynamic The selection of work follows the action mode, as follows:

Wherein, r_tIt is current time original award；r′_tIt is new award；α is a constant, is used to balance play mode pair The influence degree of goal task learning process is traditionally arranged to be as the time successively decreases；P indicates the current of action prediction model output Moment execution acts α_tProbabilistic forecasting value Pr (α_t)；N is the element number of set of actions A,Show that each movement is selected flat Equal probability.

Step 4-2: the heuristic action mode migration for exploring strategy: as shown in Fig. 2, action prediction model is at each moment T, with the movement α at current time_tIt updates model state and exports the probability distribution Pr (α of next movement_t+1)(α_t+1∈A).It should Moving method directly samples out next movement α from the probability distribution_t+1, for the exploration strategy in learning algorithm, replace with Machine explores strategy.It is as follows according to the heuristic exploration strategy of action mode:

Wherein, p is the random number generated from (0,1)；0≤ε≤1；It is the optimal movement of next step.I.e. with probability ε is from prediction distribution Pr (α_t+1) in choose next movement, with probability 1- ε according to selecting optimal movement.And with the learning process time Growth, the value of ε is gradually reduced, until being 0.

In order to verify the validity and flexibility of above-mentioned moving method, the migration that the present invention devises three kinds of different situations is set It sets and is tested, as a result as follows:

(1) migration between same state space tasks: the migration selects the grid world in 4 rooms as environment, grid Middle grid indicates the state in environment, and under free position, main body can choose one in four movements (upper and lower, left and right) It is a, but have random influence, that is, there are the direction of 2/3 probability along selected movement, 1/3 probability random selection other 3 One (probability of each movement is 1/9) in a movement.If main body enters the grid of wall after mobile, main body keeps original state State.Target is randomly provided some grid in a room, original state from institute it is stateful in be randomly generated.When reaching target, Award is 1.0, is terminated；Otherwise, the award of each step is 0.As shown in figure 3, originating task is different from the position of dbjective state, they Between have identical state space, different reward functions.The result of the moving method as shown in fig. 6, be with Sarsa (0) Rudimentary algorithm, Base are not migrated as a result, HE is the action mode moving method of heuristic exploration, and IR is internal reward Moving method (α constant value is 0.01,0.0001 and 0.00001), Image to left is the change of episode length at any time Change curve, right side picture is award versus time curve.It is compared to the algorithm of no action mode migration, two kinds of movements The moving method of mode can accelerate the process of study.However, it is clear that the moving method effect of heuristic exploration is more preferable, more Add stabilization, and does not need the size of the different manual control parameters according to task and award.And for internal reward mechanism Moving method, it can be seen that the selection of constant α value is very big to the influential effect of migration.

(2) migration between different conditions space tasks: as shown in figure 4,4 room tasks are replicated by row duplication and column Mode is extended to 8 room tasks.Originating task is the state space of 4 room tasks in experiment above, after goal task is extension 8 room state space tasks.As a result as shown in fig. 7, with Sarsa (0) for rudimentary algorithm, in figure be episode length at any time Between change curve, be the result of two kinds of different target tasks (row duplication and column duplication) respectively.Two kinds of room 8 task Result all show that the pace of learning of large increase goal task is capable of in the migration of action mode, sample needed for reducing study Number.

(3) migration of the small-scale state to extensive state space task: in deeply learning tasks, the shape of input State is original pixels, and state is on a grand scale, and needs to do value function using convolutional neural networks and estimate, but learns in image out Key feature usually require many learning sample and training time.The moving method provides a kind of similar movement of utilization Mode accelerates the strategy of study, as shown in figure 5, originating task is an one-dimensional maze task, goal task is that a first person is penetrated Hit the shooting scene in gaming platform ViZDoom.Using DQN as rudimentary algorithm, compare migration action mode after as a result, such as Shown in Fig. 8, discovery has faster convergence rate and less learning sample number plus the algorithm after action mode migration.

Claims

1. a kind of intensified learning moving method based on action mode, it is characterised in that define a kind of new migration knowledge --- it is dynamic Operation mode is expressed as based on historical action sequence, and the prediction probability of next movement is distributed.

2. a kind of intensified learning moving method based on action mode, it is characterised in that propose action sequence prediction model, that is, use Recurrent neural network models action mode described in claim 1.

3. a kind of intensified learning moving method based on action mode, it is characterised in that propose a kind of more flexible migration frame, Based on action mode described in claim 1, the migration that can be used between different conditions space tasks, i.e., with simple state task Knowledge, help solve complex state task.Key step is as follows:

(1) action sequence prediction model as claimed in claim 2 is used, knowledge is proposed from originating task；

(2) by the knowledge migration of extraction into goal task, moving method has the migration of the action mode based on internal reward mechanism With the heuristic action mode migration for exploring strategy.

4. the action mode moving method according to claim 3 based on internal reward mechanism, it is characterised in that by right It is required that the current action prediction probability value of the output of action sequence prediction model described in 2 is added to original prize as internal reward In reward, new reward functions are constituted.

5. the heuristic action mode moving method for exploring strategy according to claim 3, it is characterised in that right The prediction probability distribution of the next step movement of the output of action sequence prediction model described in asking 2, for exploring in strategy, i.e., in plan In exploratory stage slightly, a movement is not randomly selected instead of, the movement gone out according to action prediction profile samples.