CN108898221A - The combination learning method of feature and strategy based on state feature and subsequent feature - Google Patents

The combination learning method of feature and strategy based on state feature and subsequent feature Download PDF

Info

Publication number
CN108898221A
CN108898221A CN201810601576.1A CN201810601576A CN108898221A CN 108898221 A CN108898221 A CN 108898221A CN 201810601576 A CN201810601576 A CN 201810601576A CN 108898221 A CN108898221 A CN 108898221A
Authority
CN
China
Prior art keywords
feature
state
learning
subsequent
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810601576.1A
Other languages
Chinese (zh)
Other versions
CN108898221B (en
Inventor
查正军
李厚强
冯晓云
李斌
王子磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN201810601576.1A priority Critical patent/CN108898221B/en
Publication of CN108898221A publication Critical patent/CN108898221A/en
Application granted granted Critical
Publication of CN108898221B publication Critical patent/CN108898221B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Abstract

The invention discloses the federation policies learning methods of a kind of state feature and subsequent feature, including:By learning to obtain the state feature of characterization input state from input state to the mapping rewarded immediately;By learning to obtain subsequent feature from state feature to the mapping of value assessment function;The state feature of acquisition is in different temporal resolutions from subsequent feature, is learnt after state feature and subsequent Fusion Features, then using the policy learning network of varied mode to fusion results.Compared with traditional Agent network, the present invention is more efficient to be utilized sample information, and compared with other algorithms, pace of learning is obviously accelerated, and network can also restrain faster and obtain preferable learning effect.

Description

The combination learning method of feature and strategy based on state feature and subsequent feature
Technical field
The present invention relates to deeply learning areas more particularly to a kind of feature based on state feature and subsequent feature with The combination learning method of strategy.
Background technique
It is a kind of Sequence Decision based on depth network that deeply, which learns (Deep Reinforcement Learning), Learning method, it has merged deep learning and intensified learning, realizes the end-to-end study from state to movement, and continuous Enhance with implementation strategy during environmental interaction.In higher-dimension challenge, based on deep neural network from perception information Validity feature is automatically extracted, and policy learning and direct output action are carried out based on this, i.e., does not have hard coded in policy learning Process.Deeply study can effectively solve perception decision problem of the intelligent body (Agent) under higher-dimension challenge, be The forward position research direction of general artificial intelligence field, has broad application prospects.
Deeply study is that Agent obtains sample in the interaction constantly with environment, and is carried out effectively by sample The extraction of information, thus the process of training depth-size strategy network.It is obvious, the core of algorithm be how from sample effectively into Row feature extraction, however the extraction of feature is not designed specially when constructing Agent network, extraction effect only relies only on In the training effect of network.Since intensified learning is the process of a high dynamic non-stationary, it usually needs the training data of magnanimity To carry out network training.The sample of one side low-quality will affect the training and convergence of network, the training process of another aspect network It is not high to the utilization rate of sample;These problems cause the training cost of network to greatly increase.
Summary of the invention
The object of the present invention is to provide the combination learning sides of a kind of feature based on state feature and subsequent feature and strategy Sample utilization efficiency can be improved in method.
The purpose of the present invention is what is be achieved through the following technical solutions:
A kind of combination learning method of feature based on state feature and subsequent feature and strategy, including:
By learning to obtain the state feature of characterization input state from input state to the mapping rewarded immediately;
By learning to obtain subsequent feature from state feature to the mapping of value assessment function;
The state feature of acquisition is in different temporal resolutions from subsequent feature, and state feature and subsequent feature are melted After conjunction, then the policy learning network of varied mode is used to learn fusion results.
As seen from the above technical solution provided by the invention, compared with traditional Agent network, this present invention is improved The Sample utilization efficiency of algorithm more efficiently can utilize sample information to carry out policy learning.In identical GPU/CPU hardware Under configuration, compared with canonical algorithm, pace of learning is obviously accelerated, and can utilize less sample learning to available strategy.Limited Sample under, tactful network can restrain faster and obtain preferable learning effect.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 combines with tactful for a kind of feature based on state feature and subsequent feature provided in an embodiment of the present invention The flow chart of learning method;
Fig. 2 combines with tactful for a kind of feature based on state feature and subsequent feature provided in an embodiment of the present invention The frame diagram of learning method;
Fig. 3 is the present invention program provided in an embodiment of the present invention and existing scheme performance comparison schematic diagram.
Specific embodiment
With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete Ground description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, belongs to protection scope of the present invention.
The embodiment of the present invention provides one kind, the combination learning method of the feature based on state feature and subsequent feature and strategy Aim to solve the problem that the problem that feature learning Sample utilization efficiency is low in conventional depth intensified learning.
This programme combines depth network design policy learning scheme to intensified learning formula deployment analysis.For strengthening Problem concerning study, Agent is during with environmental interaction, real-time reception state s, generation act a, and receive environment generation i.e. When reward r, the target of policy learning is that study obtains the tactful π an of robust, so that accumulation discount rewardIt maximizes.Wherein, γ is discount factor, s0It is original state, from current time t =0 starts to calculate the expectation accumulation discount reward at tactful π.Particularly, the learning objective of Q study is
V (s)=maxaQ (s, a)=maxa{r(s,a)+maxa'Q(δ(s,a),a')}
=maxa{r+maxa'Q(s',a')}
Wherein it is possible to see that Q value is bootstrapping, the method that can use Function Estimation is iterated the study of formula.
This programme obtains instant reward function r (s)=φ using depth networkT(s) w, wherein φs=φ (s) is from current The state feature that state is extracted, w are prediction weights;Instant reward function formula is substituted into vπ(s) calculation formula defines subsequent FeatureAnalysis obtains cost function
The embodiment of the present invention is substantially carried out the combination learning of feature and strategy, as shown in Figure 1, it mainly includes:
Step 1, by learning, from input state, (input is the state that original game picture simple combination generates, and is The input of rawpixel level, dimension are high) to the mapping rewarded immediately, obtain the state feature of characterization input state.
The above process is learnt to obtain by convolutional neural networks, and the input of network is higher-dimension input, and output is to reward immediately, net What network learnt is that the mapping rewarded immediately is input to from higher-dimension.Network mainly connects layer by convolutional layer and the last layer entirely and forms, The state feature of dimensional state input is obtained by convolutional layer, the parameter w for connecting layer entirely is to predict weight.It can be seen that state is special Sign is extracted automatically by network, is not engineer.
In this step, instant reward function is represented spatially by state feature to the prediction weight w rewarded immediately Distribution.It on the one hand is the basis of whole network framework to the extraction of state effective information, on the other hand spatially to reward The study of distribution facilitates the modeling to environment.Whether study is most important to policy learning to effective state feature.
In addition, nonproductive task, i.e. state is added during study state feature to improve Sample utilization efficiency Reconstruction task;State reconstruction corresponds to the full convolutional coding structure of state feature extraction network, is made of deconvolution network.State reconstruction Task and immediately reward learning tasks are (that is, from input state to the mapping rewarded immediately.That is an input state by with In two network branches) both have shared bottom convolutional network, i.e., the training of two tasks can all change simultaneously bottom convolution net The parameter of network, and then influence state feature φsStudy.
Step 2, by learn from state feature to the mapping of value assessment function, obtain subsequent feature.
In the embodiment of the present invention, subsequent feature by front formulaReally It is fixed, it is obtained by the way of e-learning here and network extracts automatically, be not engineer's extraction.
In the embodiment of the present invention, the prediction weight w and state feature of subsequent feature to value assessment function are rewarded to instant Prediction weight w be consistent, will be directly realized by weight in training and share, and on the one hand ensure that and acquire in formula analysis level Subsequent feature validity, i.e. r (s)=φT(s) w,On the other hand realize state feature and The joint training of subsequent feature.
Step 3, obtain state feature be in different temporal resolutions from subsequent feature, by state feature with it is subsequent After Fusion Features, then the policy learning network of varied mode is used to learn fusion results.
After obtaining state feature and subsequent feature by step 1 and step 2 study, learn the fusion side of two kinds of features Method.According to the analysis of front, it is seen that two kinds of features are to be in different temporal resolutions but have potential connection, here mainly The effect for being used for policy learning by comparing two different integrated processes, to inquire into suitable characteristic use method.Correspondingly, Feature learning is synchronous progress with policy learning, and the extraction of feature and the study of strategy are mutually reinforced.
In the embodiment of the present invention, state feature and subsequent feature can be combined together two kinds of test by two ways Different feature combination simultaneously compares its performance (the selection preferable method of performance comes bonding state feature and subsequent feature:One Kind is simply spliced as two category features, one is subsequent feature is regarded as target is directed toward, utilizes subsequent feature State feature is modulated or is weighted, carries out policy learning after fusion again.The combination of two kinds of features is mainly tested herein Mode, to determine, which kind of combination is more effective in actual use, and subsequent two kinds of policy learning schemes are also such.It is practical On, method difference proposed by the present invention and traditional network, by feature learning, policy learning modularization is both able to achieve combination learning, It also can sub-module test performance.
Policy learning network can use diversified mode, for example, can be the policy learning based on linear regression Or the policy learning based on LSTM.Fused feature is put into specific policy learning network, direct output action. After Agent executes corresponding movement, the information such as new state are obtained, the training of a new round is carried out.
In above scheme of the embodiment of the present invention, state feature and subsequent feature are related to the study of Agent, are for reward What function and value assessment function designed.On the one hand this way ensure that does not have traditional manually to set in characteristic extraction procedure Meter, is able to achieve end-to-end study, another aspect feature extraction is more targeted, improves the utilization efficiency of sample.Thus energy It is enough that sample is fast and accurately learnt, effective strategy and Agent is motivated to explore space, learning efficiency is into one Step enhancing.
The general frame of above scheme of the embodiment of the present invention is as shown in Fig. 2, mainly include feature learning, policy learning and auxiliary Three steps of tasking learning are helped, main innovative part is in feature learning and its utilization.In Fig. 2, stIt is the higher-dimension perception of network Input;φsIt is the state feature extracted;It is the subsequent feature extracted;R is to reward immediately;V (s) is the value assessment of state.
As shown in Fig. 2, combining two features after study to state feature and subsequent feature, tactful is carried out It practises.Any policy learning method, such as linear regression, the policy learning etc. based on LSTM can be selected.And policy learning and feature Study is based on identical input while to carry out, therefore be referred to as the feature based on state feature and subsequent feature and combine with tactful Study.Wherein feature learning is to utilize end-to-end e-learning (state feature:State is rewarded to instant, subsequent feature:State To cost function) automatic extraction feature, pass through shared prediction weight w and guarantees the validity of feature learning and realize two kinds of features Joint training.
For training, reward is an extrinsic reward immediately, it may be possible to it is sparse, it is not easy to learn.State feature φsIt can be input in nonproductive task network, exportSo, outputWith actual input state stBetween gap be Agent provides the reward of an inside, greatly improves the learning efficiency and generality of feature.
The propaedeutics algorithm of network is the update that network parameter is completed based on stochastic gradient descent, and the update plan of algorithm It is slightly that the intensified learning frame based on A2C carries out algorithm enhancing.This is all by the canonical algorithm of complete and comprehensive specification.
In order to make it easy to understand, being illustrated below with reference to an example.
As benchmark algorithm, interactive environment platform is the A2C deeply learning algorithm that this example is increased income using OpenAI The Gym of OpenAI open source, the computing resource used are that 1 piece of GPU cooperates 8 CPU line journeys.In different game environments pair when training Different Agent, the initialization of network and the initiation parameter of algorithm are all consistent.
A2C algorithm is built first in the interactive interface of Gym platform, and the acquisition of sample and the training process of network are set, this A part belongs to general setting.After being provided with, 8 different game environments should be able to be instantiated simultaneously, have one in each environment A Agent and environmental interaction, obtain state, the information such as reward, and A2C algorithm can obtain data according to algorithm flow, complete instruction Practice, synchronous Agent.
A2C algorithm grabs information from interface, according to the corresponding higher-dimension perception input s of algorithm constructiont, extraction is corresponding to be When reward R.Noticing the value information of state, itself is a estimated values, are trained as supervision message, value Acquisition be to obtain in the training process.
Corresponding training pace k is set, and intercepted length is that the training segment of k is once trained.For depth network Speech, needs to design corresponding loss function, instructs the training of network.In conjunction with the network architecture of the invention, the loss function of design It is as follows:
I.e. whole loss function is weighted to obtain by the loss function of each branch, wherein α, β, λ1、λ2、λ3For items damage The weighting coefficient of function is lost, for the hyper parameter manually set.
Wherein, (r)-rt 2Item represents the error between prediction reward immediately and practical reward immediately,It represents Error between reconstituted state and true input state, | W |pIt is the regular terms for preventing network to be fitted and adding, is network parameter Norm.
Loss (critic) refers to the loss function of value estimations branch, defines vπTo be really worth letter under the strategy Number,For the estimated value to it,
Loss (actor) refers to the loss function of policy learning branch, and the output of the branch is general on motion space Rate is distributed P, definitionBy prediction action probability The staggered form composition of the entropy of distribution and itself and actual selection execution movement.This is the standard loss function of policy learning.
Firstly, in the training segment that this length is k, the input at our available each moment, the movement taken, with And obtained instant reward, what is lacked is the value of each state.It is returned according to k-step corresponding to construct each step in this k step Predictive value, specifically:
vπ(st)=r (st)+γvπ(st+1)
When prediction, the value of final step is predicted first, then gradually updates forward according to this formula, obtains each step Predictive value.It is noted that the state of subsequent time is confirmable at a time fixed tactful π.
Specifically,
vπ(sk)=r (sk)+γvπ(sk+1)
vπ(sk-1)=r (sk-1)+γvπ(sk)
vπ(sk-2)=r (sk-2)+γvπ(sk-1)
For the state reconstruction task of auxiliary, the higher-dimension of input is perceived into stDirectly as the supervision message of the branch, i.e., It can be trained.It should be noted that the information from interface crawl is as input after treatment.
So far, it is inputted, is rewarded immediately, predictive value, the information such as reconstituted state, the training rule according to depth network Then with the training of the training process of A2C, that is, deployable network, network training said herein refers to the depth network training of standard With nitrification enhancement training process, the embodiment of the present invention to be protected is the designed network architecture, including " state characterology Practise, subsequent feature learning and federation policies learn " these parts.
In instant reward than not degenerating in training process for guarantee network structure in sparse environment, subsequent feature Study gradient does not return at state feature, that is, guarantees that subsequent feature is only based on what state feature was acquired, rather than directly A feature has been acquired from input.And nonproductive task has also helped the study of state feature.
Obtained state feature and subsequent feature is in different temporal resolutions, can be incorporated in by two ways Together:One is simply being spliced as two category features, one is regarding subsequent feature as target is directed toward, after utilization State feature is modulated or is weighted after feature, is carrying out policy learning after fusion.
Policy learning can use diversified mode, test in this example policy learning based on linear regression or Policy learning of the person based on LSTM.The feature that will be combined is put into specific policy learning network, direct output action. After Agent executes corresponding movement, the information such as new state are obtained, the training of a new round is carried out.Network training is one section every time A segment information is grabbed, is once trained, after obtaining new information, is once being trained.These information are held by input state Action is made, new state, the composition such as reward of acquisition, indispensable when being framework training.State feature learning, subsequent spy The process that sign study learn with federation policies be ongoing, it is good tactful (in actual task up to finally obtaining Obtain higher achievement, preferable performance) until.
On the other hand, comparative test has also been carried out in order to illustrate the performance of above scheme of the embodiment of the present invention.Such as Fig. 3 (a) Shown in~Fig. 3 (b), A2C and A2C/LSTM are the network realized using the A2C algorithm of benchmark, the instruction in Breakout game Practice effect.Training samples number when horizontal axis, the longitudinal axis are the scores that each bout obtains.In Fig. 3 a~Fig. 3 b ,/LR ,/LSTM refer to Policy learning, the policy learning based on LSTM that this spy of logic-based returns when policy learning.A2C is canonical algorithm, and DSSF is Algorithm framework proposed by the present invention.As it was noted above, the present invention has attempted the combination method of two kinds of features, spelled one is simple It connects, one is Weighted Fusion (use/F mark).
It can clearly be seen that modified hydrothermal process (applies the agent of Frame Design of the present invention from Fig. 3 (a)~Fig. 3 (b) Algorithm) it is faster than benchmark convergence speed of the algorithm, the training time for starting to carry out performance boost needs is also shorter.Finally, algorithm energy Reach a higher stability.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can The mode of necessary general hardware platform can also be added to realize by software by software realization.Based on this understanding, The technical solution of above-described embodiment can be embodied in the form of software products, which can store non-easy at one In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Within the technical scope of the present disclosure, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims Subject to enclosing.

Claims (5)

1. a kind of combination learning method of feature based on state feature and subsequent feature and strategy, which is characterized in that including:
By learning to obtain the state feature of characterization input state from input state to the mapping rewarded immediately;
By learning to obtain subsequent feature from state feature to the mapping of value assessment function;
The state feature of acquisition is in different temporal resolutions from subsequent feature, by state feature and subsequent Fusion Features Afterwards, then using the policy learning network of varied mode fusion results are learnt.
2. the combination learning side of a kind of feature based on state feature and subsequent feature according to claim 1 and strategy Method, which is characterized in that
State feature is learnt to obtain by convolutional neural networks, and convolutional neural networks include that convolutional layer and the last layer connect layer entirely, In convolutional layer obtain the state feature of input state, connect the parameter w of layer entirely for prediction weight, the prediction weight w and subsequent spy The prediction weight w of sign to value assessment function is consistent, will be directly realized by weight in training and share.
3. the combination learning side of a kind of feature based on state feature and subsequent feature according to claim 1 and strategy Method, which is characterized in that state feature and subsequent feature are combined together by two ways, test two different feature groups Conjunction mode simultaneously compares its performance, and the preferable method of performance is selected to come bonding state feature and subsequent feature:One is as two classes Feature is spliced, one is by subsequent feature regard as target be directed toward, state feature is modulated using subsequent feature or Weighting.
4. the combination learning side of a kind of feature based on state feature and subsequent feature according to claim 1 and strategy Method, which is characterized in that carrying out study to fusion results using the policy learning network of varied mode includes:It is returned based on linear The policy learning returned or the policy learning based on LSTM.
5. the combination learning side of a kind of feature based on state feature and subsequent feature according to claim 1 and strategy Method, which is characterized in that nonproductive task, i.e. state reconstruction task is added during study state feature;State reconstruction Both task and instant reward learning tasks have shared bottom convolutional network, i.e., the training of two tasks can all change simultaneously bottom The parameter of convolutional network, and then influence the study of state feature.
CN201810601576.1A 2018-06-12 2018-06-12 Joint learning method of characteristics and strategies based on state characteristics and subsequent characteristics Active CN108898221B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810601576.1A CN108898221B (en) 2018-06-12 2018-06-12 Joint learning method of characteristics and strategies based on state characteristics and subsequent characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810601576.1A CN108898221B (en) 2018-06-12 2018-06-12 Joint learning method of characteristics and strategies based on state characteristics and subsequent characteristics

Publications (2)

Publication Number Publication Date
CN108898221A true CN108898221A (en) 2018-11-27
CN108898221B CN108898221B (en) 2021-12-14

Family

ID=64344815

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810601576.1A Active CN108898221B (en) 2018-06-12 2018-06-12 Joint learning method of characteristics and strategies based on state characteristics and subsequent characteristics

Country Status (1)

Country Link
CN (1) CN108898221B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059646A (en) * 2019-04-23 2019-07-26 暗物智能科技(广州)有限公司 The method and Target Searching Method of training action plan model
CN111585811A (en) * 2020-05-06 2020-08-25 郑州大学 Virtual optical network mapping method based on multi-agent deep reinforcement learning
CN113763723A (en) * 2021-09-06 2021-12-07 武汉理工大学 Traffic signal lamp control system and method based on reinforcement learning and dynamic timing

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150138332A1 (en) * 2004-09-17 2015-05-21 Proximex Corporation Adaptive multi-modal integrated biometric identification and surveillance systems
CN104881678A (en) * 2015-05-11 2015-09-02 中国科学技术大学 Multitask learning method of model and characteristic united learning
CN105809671A (en) * 2016-03-02 2016-07-27 无锡北邮感知技术产业研究院有限公司 Combined learning method for foreground region marking and depth order inferring
CN106899026A (en) * 2017-03-24 2017-06-27 三峡大学 Intelligent power generation control method based on the multiple agent intensified learning with time warp thought
CN107045655A (en) * 2016-12-07 2017-08-15 三峡大学 Wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150138332A1 (en) * 2004-09-17 2015-05-21 Proximex Corporation Adaptive multi-modal integrated biometric identification and surveillance systems
CN104881678A (en) * 2015-05-11 2015-09-02 中国科学技术大学 Multitask learning method of model and characteristic united learning
CN105809671A (en) * 2016-03-02 2016-07-27 无锡北邮感知技术产业研究院有限公司 Combined learning method for foreground region marking and depth order inferring
CN107045655A (en) * 2016-12-07 2017-08-15 三峡大学 Wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan
CN106899026A (en) * 2017-03-24 2017-06-27 三峡大学 Intelligent power generation control method based on the multiple agent intensified learning with time warp thought

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHAO YU等: "Multiagent Learning of Coordination in Loosely Coupled Multiagent Systems", 《IEEE》 *
李晓萌等: "基于独立学习的多智能体协作决策", 《控制与决策》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110059646A (en) * 2019-04-23 2019-07-26 暗物智能科技(广州)有限公司 The method and Target Searching Method of training action plan model
CN110059646B (en) * 2019-04-23 2021-02-09 暗物智能科技(广州)有限公司 Method for training action planning model and target searching method
CN111585811A (en) * 2020-05-06 2020-08-25 郑州大学 Virtual optical network mapping method based on multi-agent deep reinforcement learning
CN111585811B (en) * 2020-05-06 2022-09-02 郑州大学 Virtual optical network mapping method based on multi-agent deep reinforcement learning
CN113763723A (en) * 2021-09-06 2021-12-07 武汉理工大学 Traffic signal lamp control system and method based on reinforcement learning and dynamic timing
CN113763723B (en) * 2021-09-06 2023-01-17 武汉理工大学 Traffic signal lamp control system and method based on reinforcement learning and dynamic timing

Also Published As

Publication number Publication date
CN108898221B (en) 2021-12-14

Similar Documents

Publication Publication Date Title
Harvey et al. Flexible diffusion modeling of long videos
Villegas et al. Hierarchical long-term video prediction without supervision
Cheng et al. Cspn++: Learning context and resource aware convolutional spatial propagation networks for depth completion
CN109891897B (en) Method for analyzing media content
CN108898221A (en) The combination learning method of feature and strategy based on state feature and subsequent feature
CN110383298A (en) Data efficient intensified learning for continuous control task
CN112052948B (en) Network model compression method and device, storage medium and electronic equipment
CN111461325B (en) Multi-target layered reinforcement learning algorithm for sparse rewarding environmental problem
CN113627596A (en) Multi-agent confrontation method and system based on dynamic graph neural network
CN108809713A (en) Monte Carlo tree searching method based on optimal resource allocation algorithm
CN108805611A (en) Advertisement screening technique and device
CN114820871B (en) Font generation method, model training method, device, equipment and medium
CN114139637B (en) Multi-agent information fusion method and device, electronic equipment and readable storage medium
CN107016212A (en) Intention analysis method based on dynamic Bayesian network
CN111282272B (en) Information processing method, computer readable medium and electronic device
Wu et al. Exploring the Task Cooperation in Multi-goal Visual Navigation.
Rao et al. Distributed deep reinforcement learning using tensorflow
CN114626499A (en) Embedded multi-agent reinforcement learning method using sparse attention to assist decision making
Tong et al. Enhancing rolling horizon evolution with policy and value networks
Yanpeng Hybrid kernel extreme learning machine for evaluation of athletes' competitive ability based on particle swarm optimization
Shen et al. Efficient deep structure learning for resource-limited IoT devices
Tang et al. A deep map transfer learning method for face recognition in an unrestricted smart city environment
CN115222773A (en) Single-point motion learning method and device
Kalaivani et al. Evolutionary game theory to predict the population growth in Few districts of Tamil Nadu and Kerala
Hao et al. Learning representation with Q-irrelevance abstraction for reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant