CN108898221A - The combination learning method of feature and strategy based on state feature and subsequent feature - Google Patents
The combination learning method of feature and strategy based on state feature and subsequent feature Download PDFInfo
- Publication number
- CN108898221A CN108898221A CN201810601576.1A CN201810601576A CN108898221A CN 108898221 A CN108898221 A CN 108898221A CN 201810601576 A CN201810601576 A CN 201810601576A CN 108898221 A CN108898221 A CN 108898221A
- Authority
- CN
- China
- Prior art keywords
- feature
- state
- learning
- subsequent
- network
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Abstract
The invention discloses the federation policies learning methods of a kind of state feature and subsequent feature, including:By learning to obtain the state feature of characterization input state from input state to the mapping rewarded immediately;By learning to obtain subsequent feature from state feature to the mapping of value assessment function;The state feature of acquisition is in different temporal resolutions from subsequent feature, is learnt after state feature and subsequent Fusion Features, then using the policy learning network of varied mode to fusion results.Compared with traditional Agent network, the present invention is more efficient to be utilized sample information, and compared with other algorithms, pace of learning is obviously accelerated, and network can also restrain faster and obtain preferable learning effect.
Description
Technical field
The present invention relates to deeply learning areas more particularly to a kind of feature based on state feature and subsequent feature with
The combination learning method of strategy.
Background technique
It is a kind of Sequence Decision based on depth network that deeply, which learns (Deep Reinforcement Learning),
Learning method, it has merged deep learning and intensified learning, realizes the end-to-end study from state to movement, and continuous
Enhance with implementation strategy during environmental interaction.In higher-dimension challenge, based on deep neural network from perception information
Validity feature is automatically extracted, and policy learning and direct output action are carried out based on this, i.e., does not have hard coded in policy learning
Process.Deeply study can effectively solve perception decision problem of the intelligent body (Agent) under higher-dimension challenge, be
The forward position research direction of general artificial intelligence field, has broad application prospects.
Deeply study is that Agent obtains sample in the interaction constantly with environment, and is carried out effectively by sample
The extraction of information, thus the process of training depth-size strategy network.It is obvious, the core of algorithm be how from sample effectively into
Row feature extraction, however the extraction of feature is not designed specially when constructing Agent network, extraction effect only relies only on
In the training effect of network.Since intensified learning is the process of a high dynamic non-stationary, it usually needs the training data of magnanimity
To carry out network training.The sample of one side low-quality will affect the training and convergence of network, the training process of another aspect network
It is not high to the utilization rate of sample;These problems cause the training cost of network to greatly increase.
Summary of the invention
The object of the present invention is to provide the combination learning sides of a kind of feature based on state feature and subsequent feature and strategy
Sample utilization efficiency can be improved in method.
The purpose of the present invention is what is be achieved through the following technical solutions:
A kind of combination learning method of feature based on state feature and subsequent feature and strategy, including:
By learning to obtain the state feature of characterization input state from input state to the mapping rewarded immediately;
By learning to obtain subsequent feature from state feature to the mapping of value assessment function;
The state feature of acquisition is in different temporal resolutions from subsequent feature, and state feature and subsequent feature are melted
After conjunction, then the policy learning network of varied mode is used to learn fusion results.
As seen from the above technical solution provided by the invention, compared with traditional Agent network, this present invention is improved
The Sample utilization efficiency of algorithm more efficiently can utilize sample information to carry out policy learning.In identical GPU/CPU hardware
Under configuration, compared with canonical algorithm, pace of learning is obviously accelerated, and can utilize less sample learning to available strategy.Limited
Sample under, tactful network can restrain faster and obtain preferable learning effect.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment
Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this
For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing.
Fig. 1 combines with tactful for a kind of feature based on state feature and subsequent feature provided in an embodiment of the present invention
The flow chart of learning method;
Fig. 2 combines with tactful for a kind of feature based on state feature and subsequent feature provided in an embodiment of the present invention
The frame diagram of learning method;
Fig. 3 is the present invention program provided in an embodiment of the present invention and existing scheme performance comparison schematic diagram.
Specific embodiment
With reference to the attached drawing in the embodiment of the present invention, technical solution in the embodiment of the present invention carries out clear, complete
Ground description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Based on this
The embodiment of invention, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, belongs to protection scope of the present invention.
The embodiment of the present invention provides one kind, the combination learning method of the feature based on state feature and subsequent feature and strategy
Aim to solve the problem that the problem that feature learning Sample utilization efficiency is low in conventional depth intensified learning.
This programme combines depth network design policy learning scheme to intensified learning formula deployment analysis.For strengthening
Problem concerning study, Agent is during with environmental interaction, real-time reception state s, generation act a, and receive environment generation i.e.
When reward r, the target of policy learning is that study obtains the tactful π an of robust, so that accumulation discount rewardIt maximizes.Wherein, γ is discount factor, s0It is original state, from current time t
=0 starts to calculate the expectation accumulation discount reward at tactful π.Particularly, the learning objective of Q study is
V (s)=maxaQ (s, a)=maxa{r(s,a)+maxa'Q(δ(s,a),a')}
=maxa{r+maxa'Q(s',a')}
Wherein it is possible to see that Q value is bootstrapping, the method that can use Function Estimation is iterated the study of formula.
This programme obtains instant reward function r (s)=φ using depth networkT(s) w, wherein φs=φ (s) is from current
The state feature that state is extracted, w are prediction weights;Instant reward function formula is substituted into vπ(s) calculation formula defines subsequent
FeatureAnalysis obtains cost function
The embodiment of the present invention is substantially carried out the combination learning of feature and strategy, as shown in Figure 1, it mainly includes:
Step 1, by learning, from input state, (input is the state that original game picture simple combination generates, and is
The input of rawpixel level, dimension are high) to the mapping rewarded immediately, obtain the state feature of characterization input state.
The above process is learnt to obtain by convolutional neural networks, and the input of network is higher-dimension input, and output is to reward immediately, net
What network learnt is that the mapping rewarded immediately is input to from higher-dimension.Network mainly connects layer by convolutional layer and the last layer entirely and forms,
The state feature of dimensional state input is obtained by convolutional layer, the parameter w for connecting layer entirely is to predict weight.It can be seen that state is special
Sign is extracted automatically by network, is not engineer.
In this step, instant reward function is represented spatially by state feature to the prediction weight w rewarded immediately
Distribution.It on the one hand is the basis of whole network framework to the extraction of state effective information, on the other hand spatially to reward
The study of distribution facilitates the modeling to environment.Whether study is most important to policy learning to effective state feature.
In addition, nonproductive task, i.e. state is added during study state feature to improve Sample utilization efficiency
Reconstruction task;State reconstruction corresponds to the full convolutional coding structure of state feature extraction network, is made of deconvolution network.State reconstruction
Task and immediately reward learning tasks are (that is, from input state to the mapping rewarded immediately.That is an input state by with
In two network branches) both have shared bottom convolutional network, i.e., the training of two tasks can all change simultaneously bottom convolution net
The parameter of network, and then influence state feature φsStudy.
Step 2, by learn from state feature to the mapping of value assessment function, obtain subsequent feature.
In the embodiment of the present invention, subsequent feature by front formulaReally
It is fixed, it is obtained by the way of e-learning here and network extracts automatically, be not engineer's extraction.
In the embodiment of the present invention, the prediction weight w and state feature of subsequent feature to value assessment function are rewarded to instant
Prediction weight w be consistent, will be directly realized by weight in training and share, and on the one hand ensure that and acquire in formula analysis level
Subsequent feature validity, i.e. r (s)=φT(s) w,On the other hand realize state feature and
The joint training of subsequent feature.
Step 3, obtain state feature be in different temporal resolutions from subsequent feature, by state feature with it is subsequent
After Fusion Features, then the policy learning network of varied mode is used to learn fusion results.
After obtaining state feature and subsequent feature by step 1 and step 2 study, learn the fusion side of two kinds of features
Method.According to the analysis of front, it is seen that two kinds of features are to be in different temporal resolutions but have potential connection, here mainly
The effect for being used for policy learning by comparing two different integrated processes, to inquire into suitable characteristic use method.Correspondingly,
Feature learning is synchronous progress with policy learning, and the extraction of feature and the study of strategy are mutually reinforced.
In the embodiment of the present invention, state feature and subsequent feature can be combined together two kinds of test by two ways
Different feature combination simultaneously compares its performance (the selection preferable method of performance comes bonding state feature and subsequent feature:One
Kind is simply spliced as two category features, one is subsequent feature is regarded as target is directed toward, utilizes subsequent feature
State feature is modulated or is weighted, carries out policy learning after fusion again.The combination of two kinds of features is mainly tested herein
Mode, to determine, which kind of combination is more effective in actual use, and subsequent two kinds of policy learning schemes are also such.It is practical
On, method difference proposed by the present invention and traditional network, by feature learning, policy learning modularization is both able to achieve combination learning,
It also can sub-module test performance.
Policy learning network can use diversified mode, for example, can be the policy learning based on linear regression
Or the policy learning based on LSTM.Fused feature is put into specific policy learning network, direct output action.
After Agent executes corresponding movement, the information such as new state are obtained, the training of a new round is carried out.
In above scheme of the embodiment of the present invention, state feature and subsequent feature are related to the study of Agent, are for reward
What function and value assessment function designed.On the one hand this way ensure that does not have traditional manually to set in characteristic extraction procedure
Meter, is able to achieve end-to-end study, another aspect feature extraction is more targeted, improves the utilization efficiency of sample.Thus energy
It is enough that sample is fast and accurately learnt, effective strategy and Agent is motivated to explore space, learning efficiency is into one
Step enhancing.
The general frame of above scheme of the embodiment of the present invention is as shown in Fig. 2, mainly include feature learning, policy learning and auxiliary
Three steps of tasking learning are helped, main innovative part is in feature learning and its utilization.In Fig. 2, stIt is the higher-dimension perception of network
Input;φsIt is the state feature extracted;It is the subsequent feature extracted;R is to reward immediately;V (s) is the value assessment of state.
As shown in Fig. 2, combining two features after study to state feature and subsequent feature, tactful is carried out
It practises.Any policy learning method, such as linear regression, the policy learning etc. based on LSTM can be selected.And policy learning and feature
Study is based on identical input while to carry out, therefore be referred to as the feature based on state feature and subsequent feature and combine with tactful
Study.Wherein feature learning is to utilize end-to-end e-learning (state feature:State is rewarded to instant, subsequent feature:State
To cost function) automatic extraction feature, pass through shared prediction weight w and guarantees the validity of feature learning and realize two kinds of features
Joint training.
For training, reward is an extrinsic reward immediately, it may be possible to it is sparse, it is not easy to learn.State feature
φsIt can be input in nonproductive task network, exportSo, outputWith actual input state stBetween gap be
Agent provides the reward of an inside, greatly improves the learning efficiency and generality of feature.
The propaedeutics algorithm of network is the update that network parameter is completed based on stochastic gradient descent, and the update plan of algorithm
It is slightly that the intensified learning frame based on A2C carries out algorithm enhancing.This is all by the canonical algorithm of complete and comprehensive specification.
In order to make it easy to understand, being illustrated below with reference to an example.
As benchmark algorithm, interactive environment platform is the A2C deeply learning algorithm that this example is increased income using OpenAI
The Gym of OpenAI open source, the computing resource used are that 1 piece of GPU cooperates 8 CPU line journeys.In different game environments pair when training
Different Agent, the initialization of network and the initiation parameter of algorithm are all consistent.
A2C algorithm is built first in the interactive interface of Gym platform, and the acquisition of sample and the training process of network are set, this
A part belongs to general setting.After being provided with, 8 different game environments should be able to be instantiated simultaneously, have one in each environment
A Agent and environmental interaction, obtain state, the information such as reward, and A2C algorithm can obtain data according to algorithm flow, complete instruction
Practice, synchronous Agent.
A2C algorithm grabs information from interface, according to the corresponding higher-dimension perception input s of algorithm constructiont, extraction is corresponding to be
When reward R.Noticing the value information of state, itself is a estimated values, are trained as supervision message, value
Acquisition be to obtain in the training process.
Corresponding training pace k is set, and intercepted length is that the training segment of k is once trained.For depth network
Speech, needs to design corresponding loss function, instructs the training of network.In conjunction with the network architecture of the invention, the loss function of design
It is as follows:
I.e. whole loss function is weighted to obtain by the loss function of each branch, wherein α, β, λ1、λ2、λ3For items damage
The weighting coefficient of function is lost, for the hyper parameter manually set.
Wherein, (r)-rt 2Item represents the error between prediction reward immediately and practical reward immediately,It represents
Error between reconstituted state and true input state, | W |pIt is the regular terms for preventing network to be fitted and adding, is network parameter
Norm.
Loss (critic) refers to the loss function of value estimations branch, defines vπTo be really worth letter under the strategy
Number,For the estimated value to it,
Loss (actor) refers to the loss function of policy learning branch, and the output of the branch is general on motion space
Rate is distributed P, definitionBy prediction action probability
The staggered form composition of the entropy of distribution and itself and actual selection execution movement.This is the standard loss function of policy learning.
Firstly, in the training segment that this length is k, the input at our available each moment, the movement taken, with
And obtained instant reward, what is lacked is the value of each state.It is returned according to k-step corresponding to construct each step in this k step
Predictive value, specifically:
vπ(st)=r (st)+γvπ(st+1)
When prediction, the value of final step is predicted first, then gradually updates forward according to this formula, obtains each step
Predictive value.It is noted that the state of subsequent time is confirmable at a time fixed tactful π.
Specifically,
vπ(sk)=r (sk)+γvπ(sk+1)
vπ(sk-1)=r (sk-1)+γvπ(sk)
vπ(sk-2)=r (sk-2)+γvπ(sk-1)
For the state reconstruction task of auxiliary, the higher-dimension of input is perceived into stDirectly as the supervision message of the branch, i.e.,
It can be trained.It should be noted that the information from interface crawl is as input after treatment.
So far, it is inputted, is rewarded immediately, predictive value, the information such as reconstituted state, the training rule according to depth network
Then with the training of the training process of A2C, that is, deployable network, network training said herein refers to the depth network training of standard
With nitrification enhancement training process, the embodiment of the present invention to be protected is the designed network architecture, including " state characterology
Practise, subsequent feature learning and federation policies learn " these parts.
In instant reward than not degenerating in training process for guarantee network structure in sparse environment, subsequent feature
Study gradient does not return at state feature, that is, guarantees that subsequent feature is only based on what state feature was acquired, rather than directly
A feature has been acquired from input.And nonproductive task has also helped the study of state feature.
Obtained state feature and subsequent feature is in different temporal resolutions, can be incorporated in by two ways
Together:One is simply being spliced as two category features, one is regarding subsequent feature as target is directed toward, after utilization
State feature is modulated or is weighted after feature, is carrying out policy learning after fusion.
Policy learning can use diversified mode, test in this example policy learning based on linear regression or
Policy learning of the person based on LSTM.The feature that will be combined is put into specific policy learning network, direct output action.
After Agent executes corresponding movement, the information such as new state are obtained, the training of a new round is carried out.Network training is one section every time
A segment information is grabbed, is once trained, after obtaining new information, is once being trained.These information are held by input state
Action is made, new state, the composition such as reward of acquisition, indispensable when being framework training.State feature learning, subsequent spy
The process that sign study learn with federation policies be ongoing, it is good tactful (in actual task up to finally obtaining
Obtain higher achievement, preferable performance) until.
On the other hand, comparative test has also been carried out in order to illustrate the performance of above scheme of the embodiment of the present invention.Such as Fig. 3 (a)
Shown in~Fig. 3 (b), A2C and A2C/LSTM are the network realized using the A2C algorithm of benchmark, the instruction in Breakout game
Practice effect.Training samples number when horizontal axis, the longitudinal axis are the scores that each bout obtains.In Fig. 3 a~Fig. 3 b ,/LR ,/LSTM refer to
Policy learning, the policy learning based on LSTM that this spy of logic-based returns when policy learning.A2C is canonical algorithm, and DSSF is
Algorithm framework proposed by the present invention.As it was noted above, the present invention has attempted the combination method of two kinds of features, spelled one is simple
It connects, one is Weighted Fusion (use/F mark).
It can clearly be seen that modified hydrothermal process (applies the agent of Frame Design of the present invention from Fig. 3 (a)~Fig. 3 (b)
Algorithm) it is faster than benchmark convergence speed of the algorithm, the training time for starting to carry out performance boost needs is also shorter.Finally, algorithm energy
Reach a higher stability.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment can
The mode of necessary general hardware platform can also be added to realize by software by software realization.Based on this understanding,
The technical solution of above-described embodiment can be embodied in the form of software products, which can store non-easy at one
In the property lost storage medium (can be CD-ROM, USB flash disk, mobile hard disk etc.), including some instructions are with so that a computer is set
Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto,
Within the technical scope of the present disclosure, any changes or substitutions that can be easily thought of by anyone skilled in the art,
It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with the protection model of claims
Subject to enclosing.
Claims (5)
1. a kind of combination learning method of feature based on state feature and subsequent feature and strategy, which is characterized in that including:
By learning to obtain the state feature of characterization input state from input state to the mapping rewarded immediately;
By learning to obtain subsequent feature from state feature to the mapping of value assessment function;
The state feature of acquisition is in different temporal resolutions from subsequent feature, by state feature and subsequent Fusion Features
Afterwards, then using the policy learning network of varied mode fusion results are learnt.
2. the combination learning side of a kind of feature based on state feature and subsequent feature according to claim 1 and strategy
Method, which is characterized in that
State feature is learnt to obtain by convolutional neural networks, and convolutional neural networks include that convolutional layer and the last layer connect layer entirely,
In convolutional layer obtain the state feature of input state, connect the parameter w of layer entirely for prediction weight, the prediction weight w and subsequent spy
The prediction weight w of sign to value assessment function is consistent, will be directly realized by weight in training and share.
3. the combination learning side of a kind of feature based on state feature and subsequent feature according to claim 1 and strategy
Method, which is characterized in that state feature and subsequent feature are combined together by two ways, test two different feature groups
Conjunction mode simultaneously compares its performance, and the preferable method of performance is selected to come bonding state feature and subsequent feature:One is as two classes
Feature is spliced, one is by subsequent feature regard as target be directed toward, state feature is modulated using subsequent feature or
Weighting.
4. the combination learning side of a kind of feature based on state feature and subsequent feature according to claim 1 and strategy
Method, which is characterized in that carrying out study to fusion results using the policy learning network of varied mode includes:It is returned based on linear
The policy learning returned or the policy learning based on LSTM.
5. the combination learning side of a kind of feature based on state feature and subsequent feature according to claim 1 and strategy
Method, which is characterized in that nonproductive task, i.e. state reconstruction task is added during study state feature;State reconstruction
Both task and instant reward learning tasks have shared bottom convolutional network, i.e., the training of two tasks can all change simultaneously bottom
The parameter of convolutional network, and then influence the study of state feature.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810601576.1A CN108898221B (en) | 2018-06-12 | 2018-06-12 | Joint learning method of characteristics and strategies based on state characteristics and subsequent characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810601576.1A CN108898221B (en) | 2018-06-12 | 2018-06-12 | Joint learning method of characteristics and strategies based on state characteristics and subsequent characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108898221A true CN108898221A (en) | 2018-11-27 |
CN108898221B CN108898221B (en) | 2021-12-14 |
Family
ID=64344815
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810601576.1A Active CN108898221B (en) | 2018-06-12 | 2018-06-12 | Joint learning method of characteristics and strategies based on state characteristics and subsequent characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108898221B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110059646A (en) * | 2019-04-23 | 2019-07-26 | 暗物智能科技(广州)有限公司 | The method and Target Searching Method of training action plan model |
CN111585811A (en) * | 2020-05-06 | 2020-08-25 | 郑州大学 | Virtual optical network mapping method based on multi-agent deep reinforcement learning |
CN113763723A (en) * | 2021-09-06 | 2021-12-07 | 武汉理工大学 | Traffic signal lamp control system and method based on reinforcement learning and dynamic timing |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150138332A1 (en) * | 2004-09-17 | 2015-05-21 | Proximex Corporation | Adaptive multi-modal integrated biometric identification and surveillance systems |
CN104881678A (en) * | 2015-05-11 | 2015-09-02 | 中国科学技术大学 | Multitask learning method of model and characteristic united learning |
CN105809671A (en) * | 2016-03-02 | 2016-07-27 | 无锡北邮感知技术产业研究院有限公司 | Combined learning method for foreground region marking and depth order inferring |
CN106899026A (en) * | 2017-03-24 | 2017-06-27 | 三峡大学 | Intelligent power generation control method based on the multiple agent intensified learning with time warp thought |
CN107045655A (en) * | 2016-12-07 | 2017-08-15 | 三峡大学 | Wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan |
-
2018
- 2018-06-12 CN CN201810601576.1A patent/CN108898221B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150138332A1 (en) * | 2004-09-17 | 2015-05-21 | Proximex Corporation | Adaptive multi-modal integrated biometric identification and surveillance systems |
CN104881678A (en) * | 2015-05-11 | 2015-09-02 | 中国科学技术大学 | Multitask learning method of model and characteristic united learning |
CN105809671A (en) * | 2016-03-02 | 2016-07-27 | 无锡北邮感知技术产业研究院有限公司 | Combined learning method for foreground region marking and depth order inferring |
CN107045655A (en) * | 2016-12-07 | 2017-08-15 | 三峡大学 | Wolf pack clan strategy process based on the random consistent game of multiple agent and virtual generating clan |
CN106899026A (en) * | 2017-03-24 | 2017-06-27 | 三峡大学 | Intelligent power generation control method based on the multiple agent intensified learning with time warp thought |
Non-Patent Citations (2)
Title |
---|
CHAO YU等: "Multiagent Learning of Coordination in Loosely Coupled Multiagent Systems", 《IEEE》 * |
李晓萌等: "基于独立学习的多智能体协作决策", 《控制与决策》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110059646A (en) * | 2019-04-23 | 2019-07-26 | 暗物智能科技(广州)有限公司 | The method and Target Searching Method of training action plan model |
CN110059646B (en) * | 2019-04-23 | 2021-02-09 | 暗物智能科技(广州)有限公司 | Method for training action planning model and target searching method |
CN111585811A (en) * | 2020-05-06 | 2020-08-25 | 郑州大学 | Virtual optical network mapping method based on multi-agent deep reinforcement learning |
CN111585811B (en) * | 2020-05-06 | 2022-09-02 | 郑州大学 | Virtual optical network mapping method based on multi-agent deep reinforcement learning |
CN113763723A (en) * | 2021-09-06 | 2021-12-07 | 武汉理工大学 | Traffic signal lamp control system and method based on reinforcement learning and dynamic timing |
CN113763723B (en) * | 2021-09-06 | 2023-01-17 | 武汉理工大学 | Traffic signal lamp control system and method based on reinforcement learning and dynamic timing |
Also Published As
Publication number | Publication date |
---|---|
CN108898221B (en) | 2021-12-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Harvey et al. | Flexible diffusion modeling of long videos | |
Villegas et al. | Hierarchical long-term video prediction without supervision | |
Cheng et al. | Cspn++: Learning context and resource aware convolutional spatial propagation networks for depth completion | |
CN109891897B (en) | Method for analyzing media content | |
CN108898221A (en) | The combination learning method of feature and strategy based on state feature and subsequent feature | |
CN110383298A (en) | Data efficient intensified learning for continuous control task | |
CN112052948B (en) | Network model compression method and device, storage medium and electronic equipment | |
CN111461325B (en) | Multi-target layered reinforcement learning algorithm for sparse rewarding environmental problem | |
CN113627596A (en) | Multi-agent confrontation method and system based on dynamic graph neural network | |
CN108809713A (en) | Monte Carlo tree searching method based on optimal resource allocation algorithm | |
CN108805611A (en) | Advertisement screening technique and device | |
CN114820871B (en) | Font generation method, model training method, device, equipment and medium | |
CN114139637B (en) | Multi-agent information fusion method and device, electronic equipment and readable storage medium | |
CN107016212A (en) | Intention analysis method based on dynamic Bayesian network | |
CN111282272B (en) | Information processing method, computer readable medium and electronic device | |
Wu et al. | Exploring the Task Cooperation in Multi-goal Visual Navigation. | |
Rao et al. | Distributed deep reinforcement learning using tensorflow | |
CN114626499A (en) | Embedded multi-agent reinforcement learning method using sparse attention to assist decision making | |
Tong et al. | Enhancing rolling horizon evolution with policy and value networks | |
Yanpeng | Hybrid kernel extreme learning machine for evaluation of athletes' competitive ability based on particle swarm optimization | |
Shen et al. | Efficient deep structure learning for resource-limited IoT devices | |
Tang et al. | A deep map transfer learning method for face recognition in an unrestricted smart city environment | |
CN115222773A (en) | Single-point motion learning method and device | |
Kalaivani et al. | Evolutionary game theory to predict the population growth in Few districts of Tamil Nadu and Kerala | |
Hao et al. | Learning representation with Q-irrelevance abstraction for reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |