CN108898221A

CN108898221A - The combination learning method of feature and strategy based on state feature and subsequent feature

Info

Publication number: CN108898221A
Application number: CN201810601576.1A
Authority: CN
Inventors: 查正军; 李厚强; 冯晓云; 李斌; 王子磊
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-06-12
Filing date: 2018-06-12
Publication date: 2018-11-27
Anticipated expiration: 2038-06-12
Also published as: CN108898221B

Abstract

The invention discloses the federation policies learning methods of a kind of state feature and subsequent feature, including：By learning to obtain the state feature of characterization input state from input state to the mapping rewarded immediately；By learning to obtain subsequent feature from state feature to the mapping of value assessment function；The state feature of acquisition is in different temporal resolutions from subsequent feature, is learnt after state feature and subsequent Fusion Features, then using the policy learning network of varied mode to fusion results.Compared with traditional Agent network, the present invention is more efficient to be utilized sample information, and compared with other algorithms, pace of learning is obviously accelerated, and network can also restrain faster and obtain preferable learning effect.

Description

The combination learning method of feature and strategy based on state feature and subsequent feature

Technical field

The present invention relates to deeply learning areas more particularly to a kind of feature based on state feature and subsequent feature with The combination learning method of strategy.

Background technique

It is a kind of Sequence Decision based on depth network that deeply, which learns (Deep Reinforcement Learning), Learning method, it has merged deep learning and intensified learning, realizes the end-to-end study from state to movement, and continuous Enhance with implementation strategy during environmental interaction.In higher-dimension challenge, based on deep neural network from perception information Validity feature is automatically extracted, and policy learning and direct output action are carried out based on this, i.e., does not have hard coded in policy learning Process.Deeply study can effectively solve perception decision problem of the intelligent body (Agent) under higher-dimension challenge, be The forward position research direction of general artificial intelligence field, has broad application prospects.

Deeply study is that Agent obtains sample in the interaction constantly with environment, and is carried out effectively by sample The extraction of information, thus the process of training depth-size strategy network.It is obvious, the core of algorithm be how from sample effectively into Row feature extraction, however the extraction of feature is not designed specially when constructing Agent network, extraction effect only relies only on In the training effect of network.Since intensified learning is the process of a high dynamic non-stationary, it usually needs the training data of magnanimity To carry out network training.The sample of one side low-quality will affect the training and convergence of network, the training process of another aspect network It is not high to the utilization rate of sample；These problems cause the training cost of network to greatly increase.

Summary of the invention

The object of the present invention is to provide the combination learning sides of a kind of feature based on state feature and subsequent feature and strategy Sample utilization efficiency can be improved in method.

The purpose of the present invention is what is be achieved through the following technical solutions：

A kind of combination learning method of feature based on state feature and subsequent feature and strategy, including：

By learning to obtain the state feature of characterization input state from input state to the mapping rewarded immediately；

By learning to obtain subsequent feature from state feature to the mapping of value assessment function；

The state feature of acquisition is in different temporal resolutions from subsequent feature, and state feature and subsequent feature are melted After conjunction, then the policy learning network of varied mode is used to learn fusion results.

As seen from the above technical solution provided by the invention, compared with traditional Agent network, this present invention is improved The Sample utilization efficiency of algorithm more efficiently can utilize sample information to carry out policy learning.In identical GPU/CPU hardware Under configuration, compared with canonical algorithm, pace of learning is obviously accelerated, and can utilize less sample learning to available strategy.Limited Sample under, tactful network can restrain faster and obtain preferable learning effect.

Detailed description of the invention

In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 combines with tactful for a kind of feature based on state feature and subsequent feature provided in an embodiment of the present invention The flow chart of learning method；

Fig. 2 combines with tactful for a kind of feature based on state feature and subsequent feature provided in an embodiment of the present invention The frame diagram of learning method；

Fig. 3 is the present invention program provided in an embodiment of the present invention and existing scheme performance comparison schematic diagram.

Specific embodiment

With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, belongs to protection scope of the present invention.

The embodiment of the present invention provides one kind, the combination learning method of the feature based on state feature and subsequent feature and strategy Aim to solve the problem that the problem that feature learning Sample utilization efficiency is low in conventional depth intensified learning.

This programme combines depth network design policy learning scheme to intensified learning formula deployment analysis.For strengthening Problem concerning study, Agent is during with environmental interaction, real-time reception state s, generation act a, and receive environment generation i.e. When reward r, the target of policy learning is that study obtains the tactful π an of robust, so that accumulation discount rewardIt maximizes.Wherein, γ is discount factor, s₀It is original state, from current time t =0 starts to calculate the expectation accumulation discount reward at tactful π.Particularly, the learning objective of Q study is

V (s)=max_aQ (s, a)=max_a{r(s,a)+max_a'Q(δ(s,a),a')}

=max_a{r+max_a'Q(s',a')}

Wherein it is possible to see that Q value is bootstrapping, the method that can use Function Estimation is iterated the study of formula.

This programme obtains instant reward function r (s)=φ using depth network^T(s) w, wherein φ_s=φ (s) is from current The state feature that state is extracted, w are prediction weights；Instant reward function formula is substituted into v^π(s) calculation formula defines subsequent FeatureAnalysis obtains cost function

The embodiment of the present invention is substantially carried out the combination learning of feature and strategy, as shown in Figure 1, it mainly includes：

Step 1, by learning, from input state, (input is the state that original game picture simple combination generates, and is The input of rawpixel level, dimension are high) to the mapping rewarded immediately, obtain the state feature of characterization input state.

The above process is learnt to obtain by convolutional neural networks, and the input of network is higher-dimension input, and output is to reward immediately, net What network learnt is that the mapping rewarded immediately is input to from higher-dimension.Network mainly connects layer by convolutional layer and the last layer entirely and forms, The state feature of dimensional state input is obtained by convolutional layer, the parameter w for connecting layer entirely is to predict weight.It can be seen that state is special Sign is extracted automatically by network, is not engineer.

In this step, instant reward function is represented spatially by state feature to the prediction weight w rewarded immediately Distribution.It on the one hand is the basis of whole network framework to the extraction of state effective information, on the other hand spatially to reward The study of distribution facilitates the modeling to environment.Whether study is most important to policy learning to effective state feature.

In addition, nonproductive task, i.e. state is added during study state feature to improve Sample utilization efficiency Reconstruction task；State reconstruction corresponds to the full convolutional coding structure of state feature extraction network, is made of deconvolution network.State reconstruction Task and immediately reward learning tasks are (that is, from input state to the mapping rewarded immediately.That is an input state by with In two network branches) both have shared bottom convolutional network, i.e., the training of two tasks can all change simultaneously bottom convolution net The parameter of network, and then influence state feature φ_sStudy.

Step 2, by learn from state feature to the mapping of value assessment function, obtain subsequent feature.

In the embodiment of the present invention, subsequent feature by front formulaReally It is fixed, it is obtained by the way of e-learning here and network extracts automatically, be not engineer's extraction.

In the embodiment of the present invention, the prediction weight w and state feature of subsequent feature to value assessment function are rewarded to instant Prediction weight w be consistent, will be directly realized by weight in training and share, and on the one hand ensure that and acquire in formula analysis level Subsequent feature validity, i.e. r (s)=φ^T(s) w,On the other hand realize state feature and The joint training of subsequent feature.

Step 3, obtain state feature be in different temporal resolutions from subsequent feature, by state feature with it is subsequent After Fusion Features, then the policy learning network of varied mode is used to learn fusion results.

After obtaining state feature and subsequent feature by step 1 and step 2 study, learn the fusion side of two kinds of features Method.According to the analysis of front, it is seen that two kinds of features are to be in different temporal resolutions but have potential connection, here mainly The effect for being used for policy learning by comparing two different integrated processes, to inquire into suitable characteristic use method.Correspondingly, Feature learning is synchronous progress with policy learning, and the extraction of feature and the study of strategy are mutually reinforced.

In the embodiment of the present invention, state feature and subsequent feature can be combined together two kinds of test by two ways Different feature combination simultaneously compares its performance (the selection preferable method of performance comes bonding state feature and subsequent feature：One Kind is simply spliced as two category features, one is subsequent feature is regarded as target is directed toward, utilizes subsequent feature State feature is modulated or is weighted, carries out policy learning after fusion again.The combination of two kinds of features is mainly tested herein Mode, to determine, which kind of combination is more effective in actual use, and subsequent two kinds of policy learning schemes are also such.It is practical On, method difference proposed by the present invention and traditional network, by feature learning, policy learning modularization is both able to achieve combination learning, It also can sub-module test performance.

Policy learning network can use diversified mode, for example, can be the policy learning based on linear regression Or the policy learning based on LSTM.Fused feature is put into specific policy learning network, direct output action. After Agent executes corresponding movement, the information such as new state are obtained, the training of a new round is carried out.

In above scheme of the embodiment of the present invention, state feature and subsequent feature are related to the study of Agent, are for reward What function and value assessment function designed.On the one hand this way ensure that does not have traditional manually to set in characteristic extraction procedure Meter, is able to achieve end-to-end study, another aspect feature extraction is more targeted, improves the utilization efficiency of sample.Thus energy It is enough that sample is fast and accurately learnt, effective strategy and Agent is motivated to explore space, learning efficiency is into one Step enhancing.

The general frame of above scheme of the embodiment of the present invention is as shown in Fig. 2, mainly include feature learning, policy learning and auxiliary Three steps of tasking learning are helped, main innovative part is in feature learning and its utilization.In Fig. 2, s_tIt is the higher-dimension perception of network Input；φ_sIt is the state feature extracted；It is the subsequent feature extracted；R is to reward immediately；V (s) is the value assessment of state.

As shown in Fig. 2, combining two features after study to state feature and subsequent feature, tactful is carried out It practises.Any policy learning method, such as linear regression, the policy learning etc. based on LSTM can be selected.And policy learning and feature Study is based on identical input while to carry out, therefore be referred to as the feature based on state feature and subsequent feature and combine with tactful Study.Wherein feature learning is to utilize end-to-end e-learning (state feature：State is rewarded to instant, subsequent feature：State To cost function) automatic extraction feature, pass through shared prediction weight w and guarantees the validity of feature learning and realize two kinds of features Joint training.

For training, reward is an extrinsic reward immediately, it may be possible to it is sparse, it is not easy to learn.State feature φ_sIt can be input in nonproductive task network, exportSo, outputWith actual input state s_tBetween gap be Agent provides the reward of an inside, greatly improves the learning efficiency and generality of feature.

The propaedeutics algorithm of network is the update that network parameter is completed based on stochastic gradient descent, and the update plan of algorithm It is slightly that the intensified learning frame based on A2C carries out algorithm enhancing.This is all by the canonical algorithm of complete and comprehensive specification.

In order to make it easy to understand, being illustrated below with reference to an example.

As benchmark algorithm, interactive environment platform is the A2C deeply learning algorithm that this example is increased income using OpenAI The Gym of OpenAI open source, the computing resource used are that 1 piece of GPU cooperates 8 CPU line journeys.In different game environments pair when training Different Agent, the initialization of network and the initiation parameter of algorithm are all consistent.

A2C algorithm is built first in the interactive interface of Gym platform, and the acquisition of sample and the training process of network are set, this A part belongs to general setting.After being provided with, 8 different game environments should be able to be instantiated simultaneously, have one in each environment A Agent and environmental interaction, obtain state, the information such as reward, and A2C algorithm can obtain data according to algorithm flow, complete instruction Practice, synchronous Agent.

A2C algorithm grabs information from interface, according to the corresponding higher-dimension perception input s of algorithm construction_t, extraction is corresponding to be When reward R.Noticing the value information of state, itself is a estimated values, are trained as supervision message, value Acquisition be to obtain in the training process.

Corresponding training pace k is set, and intercepted length is that the training segment of k is once trained.For depth network Speech, needs to design corresponding loss function, instructs the training of network.In conjunction with the network architecture of the invention, the loss function of design It is as follows：

I.e. whole loss function is weighted to obtain by the loss function of each branch, wherein α, β, λ₁、λ₂、λ₃For items damage The weighting coefficient of function is lost, for the hyper parameter manually set.

Wherein, (r)-r_t ²Item represents the error between prediction reward immediately and practical reward immediately,It represents Error between reconstituted state and true input state, | W |_pIt is the regular terms for preventing network to be fitted and adding, is network parameter Norm.

Loss (critic) refers to the loss function of value estimations branch, defines v^πTo be really worth letter under the strategy Number,For the estimated value to it,

Loss (actor) refers to the loss function of policy learning branch, and the output of the branch is general on motion space Rate is distributed P, definitionBy prediction action probability The staggered form composition of the entropy of distribution and itself and actual selection execution movement.This is the standard loss function of policy learning.

Firstly, in the training segment that this length is k, the input at our available each moment, the movement taken, with And obtained instant reward, what is lacked is the value of each state.It is returned according to k-step corresponding to construct each step in this k step Predictive value, specifically：

v^π(s_t)=r (s_t)+γv^π(s_t+1)

When prediction, the value of final step is predicted first, then gradually updates forward according to this formula, obtains each step Predictive value.It is noted that the state of subsequent time is confirmable at a time fixed tactful π.

Specifically,

v^π(s_k)=r (s_k)+γv^π(s_k+1)

v^π(s_k-1)=r (s_k-1)+γv^π(s_k)

v^π(s_k-2)=r (s_k-2)+γv^π(s_k-1)

For the state reconstruction task of auxiliary, the higher-dimension of input is perceived into s_tDirectly as the supervision message of the branch, i.e., It can be trained.It should be noted that the information from interface crawl is as input after treatment.

So far, it is inputted, is rewarded immediately, predictive value, the information such as reconstituted state, the training rule according to depth network Then with the training of the training process of A2C, that is, deployable network, network training said herein refers to the depth network training of standard With nitrification enhancement training process, the embodiment of the present invention to be protected is the designed network architecture, including " state characterology Practise, subsequent feature learning and federation policies learn " these parts.

In instant reward than not degenerating in training process for guarantee network structure in sparse environment, subsequent feature Study gradient does not return at state feature, that is, guarantees that subsequent feature is only based on what state feature was acquired, rather than directly A feature has been acquired from input.And nonproductive task has also helped the study of state feature.

Obtained state feature and subsequent feature is in different temporal resolutions, can be incorporated in by two ways Together：One is simply being spliced as two category features, one is regarding subsequent feature as target is directed toward, after utilization State feature is modulated or is weighted after feature, is carrying out policy learning after fusion.

Policy learning can use diversified mode, test in this example policy learning based on linear regression or Policy learning of the person based on LSTM.The feature that will be combined is put into specific policy learning network, direct output action. After Agent executes corresponding movement, the information such as new state are obtained, the training of a new round is carried out.Network training is one section every time A segment information is grabbed, is once trained, after obtaining new information, is once being trained.These information are held by input state Action is made, new state, the composition such as reward of acquisition, indispensable when being framework training.State feature learning, subsequent spy The process that sign study learn with federation policies be ongoing, it is good tactful (in actual task up to finally obtaining Obtain higher achievement, preferable performance) until.

On the other hand, comparative test has also been carried out in order to illustrate the performance of above scheme of the embodiment of the present invention.Such as Fig. 3 (a) Shown in~Fig. 3 (b), A2C and A2C/LSTM are the network realized using the A2C algorithm of benchmark, the instruction in Breakout game Practice effect.Training samples number when horizontal axis, the longitudinal axis are the scores that each bout obtains.In Fig. 3 a~Fig. 3 b ,/LR ,/LSTM refer to Policy learning, the policy learning based on LSTM that this spy of logic-based returns when policy learning.A2C is canonical algorithm, and DSSF is Algorithm framework proposed by the present invention.As it was noted above, the present invention has attempted the combination method of two kinds of features, spelled one is simple It connects, one is Weighted Fusion (use/F mark).

It can clearly be seen that modified hydrothermal process (applies the agent of Frame Design of the present invention from Fig. 3 (a)~Fig. 3 (b) Algorithm) it is faster than benchmark convergence speed of the algorithm, the training time for starting to carry out performance boost needs is also shorter.Finally, algorithm energy Reach a higher stability.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can The mode of necessary general hardware platform can also be added to realize by software by software realization.Based on this understanding, The technical solution of above-described embodiment can be embodied in the form of software products, which can store non-easy at one In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Within the technical scope of the present disclosure, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Subject to enclosing.

Claims

1. a kind of combination learning method of feature based on state feature and subsequent feature and strategy, which is characterized in that including：

The state feature of acquisition is in different temporal resolutions from subsequent feature, by state feature and subsequent Fusion Features Afterwards, then using the policy learning network of varied mode fusion results are learnt.

2. the combination learning side of a kind of feature based on state feature and subsequent feature according to claim 1 and strategy Method, which is characterized in that

State feature is learnt to obtain by convolutional neural networks, and convolutional neural networks include that convolutional layer and the last layer connect layer entirely, In convolutional layer obtain the state feature of input state, connect the parameter w of layer entirely for prediction weight, the prediction weight w and subsequent spy The prediction weight w of sign to value assessment function is consistent, will be directly realized by weight in training and share.

3. the combination learning side of a kind of feature based on state feature and subsequent feature according to claim 1 and strategy Method, which is characterized in that state feature and subsequent feature are combined together by two ways, test two different feature groups Conjunction mode simultaneously compares its performance, and the preferable method of performance is selected to come bonding state feature and subsequent feature：One is as two classes Feature is spliced, one is by subsequent feature regard as target be directed toward, state feature is modulated using subsequent feature or Weighting.

4. the combination learning side of a kind of feature based on state feature and subsequent feature according to claim 1 and strategy Method, which is characterized in that carrying out study to fusion results using the policy learning network of varied mode includes：It is returned based on linear The policy learning returned or the policy learning based on LSTM.

5. the combination learning side of a kind of feature based on state feature and subsequent feature according to claim 1 and strategy Method, which is characterized in that nonproductive task, i.e. state reconstruction task is added during study state feature；State reconstruction Both task and instant reward learning tasks have shared bottom convolutional network, i.e., the training of two tasks can all change simultaneously bottom The parameter of convolutional network, and then influence the study of state feature.