CN112257872A - Target planning method for reinforcement learning - Google Patents

Target planning method for reinforcement learning Download PDF

Info

Publication number
CN112257872A
CN112257872A CN202011192071.8A CN202011192071A CN112257872A CN 112257872 A CN112257872 A CN 112257872A CN 202011192071 A CN202011192071 A CN 202011192071A CN 112257872 A CN112257872 A CN 112257872A
Authority
CN
China
Prior art keywords
vector
environment
actuator
action
planning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011192071.8A
Other languages
Chinese (zh)
Other versions
CN112257872B (en
Inventor
周世海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Hongyun Zhihui Technology Co ltd
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202011192071.8A priority Critical patent/CN112257872B/en
Publication of CN112257872A publication Critical patent/CN112257872A/en
Application granted granted Critical
Publication of CN112257872B publication Critical patent/CN112257872B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Feedback Control In General (AREA)

Abstract

The method converts actions in reinforcement learning into vector representation with internal relation in a word vector embedding mode, uses the vector representation as a predictor, and calculates a planning path of a given target to reach a target state by combining environmental characteristics of the target, thereby converting sparse environment reward into a dense reward form. Meanwhile, the local optimal problem is solved to a certain extent by adopting a mode of countertraining of a planner and an executor.

Description

Target planning method for reinforcement learning
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a target planning method for reinforcement learning.
Background
In reinforcement learning, a local optimal problem is involved, that is, when the state space is too large, the agent is apt to stay in the policy with the highest value in the currently explored policies, but the policy is not the optimal policy, so that the agent cannot well complete the specified task.
In addition, the problem of sparse rewards exists in reinforcement learning, namely when an intelligent agent executes a task exploration environment, given rewards are rare, for example, the rewards are given only when a final target is reached, and no rewards are given when the target is not reached. This can easily lead to difficulties in the initial training of the agent in mastering a given task goal, while further increasing the interference from local optimization problems.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a target planning method for reinforcement learning, which can overcome the local optimal problem to a certain extent and can convert sparse rewards into dense rewards inside an intelligent agent.
The technical scheme adopted by the invention is that a target planning method for reinforcement learning comprises the following steps:
s1, collecting a plurality of converged intelligent agents with the same action space, calculating vector representation of each action in the action sequence when the intelligent agents execute tasks according to the action sequence, integrating the vector representation into an action vector dictionary corresponding to the action-vector, and then putting an actuator with the same action space to be trained in a target training environment;
s2, extracting the environment feature vector related to the action through the feature extractor to be used as the external input of the actuator;
s3, combining the environmental characteristic vectors extracted in the current period and S2 and the vector representation of the action executed by the output of the actuator into a vector, using the vector as the input of an environmental characteristic predictor of the next period, and calculating by the environmental characteristic predictor to obtain the environmental characteristic vector of the next period;
s4, giving a task final state target environment, and obtaining a target environment feature vector through a feature extractor;
s5, according to the distance between the current environment feature vector and the target environment feature vector, aiming at shortening the distance and reducing the iteration times, carrying out iterative computation to obtain a group of planning sequences in which the environment feature vectors obtained by iteration correspond to the actions one by one;
and S6, taking the planning sequence as a training set, and carrying out planning training on the actuator.
The invention has the beneficial effects that:
(1) each action in the action sequence is expressed in a vector form, which endows each action with a basic connotation and also endows a similar relation between the actions, each action does not exist independently any more, an intelligent body can directly acquire the relation between the actions when planning a target instead of recognizing the relation between the actions through massive exploration, which is beneficial to the multi-target learning of a plurality of intelligent bodies, and the acquisition of the action vector only needs to be acquired under the condition of realizing the action sequence under a simple basic task without considering the optimal strategy problem in a complex state space, and can be continuously used under the background of the same action space.
(2) The characteristic extractor is used for extracting environmental characteristics related to the action, so that the input environment is also connected with the action, and the position of each element in the vector of the action can be regarded as an influence factor on a certain characteristic of the environment. On the basis, the environmental feature predictor is used for fitting the relation between the action vector and the environmental feature vector, so that the environment feature predictor can learn the action vector inside, the contribution of each element to the environmental feature can be accurately predicted to the environmental feature of the next period, if the original action instruction is directly used as input, the predictor needs to further decompose the action instruction, and the predictor cannot well learn the influence relation between the action and the environment.
(3) On the basis of accurately predicting the environmental features of the next period, shortening the distance between the target environmental features and the current environmental features for a short time as an optimization target, reconstructing a reinforcement learning environment and an intelligent agent for planning, taking the current environmental feature vector as the input of the intelligent agent to obtain a planning action, combining the planning action and the current environmental feature vector as the input of an environmental feature predictor, obtaining a predicted environmental feature vector, taking the predicted environmental feature vector as the input of the intelligent agent, sequentially and repeatedly iterating and optimizing until a planning sequence with less actions and capable of reaching the target environmental features is obtained. The method can convert the sparse reward into the distance problem between the current environment and the target environment, and the internal relation of the action is constructed through vectorization of the action according to the action sequence, so that the distance problem is calculated more accurately, and the technical problems that the accurate prediction is difficult and the sparse reward cannot be well converted into the reward dense distance optimization in the prior art are well solved.
Preferably, the motion vector representation in S1 is obtained by using the word vector embedding principle in nlp by regarding the motion sequence as a text sequence, and this method can better convert a plurality of motions into motion vectors with intrinsic relations, for example, using word2vec or other methods.
Preferably, the feature extraction method in S2 includes a feature extractor and an actuator motion predictor, and combines the environmental feature vector output by the feature extractor in the current cycle and the environmental feature vector output by the next cycle into a vector as the output of the actuator motion predictor, and uses the difference between the motion output by the actuator in the current cycle and the motion output by the actuator motion predictor as the loss function of the feature extractor and the actuator motion predictor. The method can enable the environmental features extracted by the feature extractor to be only related to the action, can ignore the environmental part which cannot be influenced by the intelligent agent, and enables the environmental feature vector and the action vector to be internally linked, so that when the environmental feature vector is used as the input of the environmental feature predictor, the fitting relation expressed by the environmental features and the action vector can be more easily converged.
Preferably, the S5 includes:
s51, taking the environment characteristic predictor as an environment function, taking an intelligent agent based on reinforcement learning as a planner, wherein the planner comprises a strategy device and a valuator, and constructing a data loop between the environment and the intelligent agent;
s52, taking the current environment feature vector as the input of the strategy device of the planner to obtain the action output of the strategy device of the planner;
s53, converting the action of the strategy device of the planner into vector representation according to the action vector dictionary, then merging the vector representation with the current environment characteristic vector and inputting the merged vector representation into the environment characteristic predictor, predicting to obtain the next period environment characteristic vector aiming at planning, using the next period environment characteristic vector aiming at planning as the new input of the strategy device of the planner, and sequentially iterating to obtain a group of planning sequences;
and S54, judging the sequence value by using a valuator of the planner, and updating the combination strategy of the optimized planning sequence until convergence.
The method can obtain a group of better planning sequences based on the iterative optimization of the current environment characteristic vector and the target environment characteristic vector.
Preferably, the S6 includes:
s61, the actuator is a reinforcement learning agent which comprises a strategy device and a valuator and can search and train in the environment, whether the current actuator starts to search the environment and train in the self is judged, if not, a group of initial planning sequences are obtained according to the initial state and the given target when the actuator is put into the training environment, the strategy device of the actuator is trained, and then the strategy device starts to enter the searching environment state; if so, training the strategy device of the actuator without using a planning sequence, and turning to S62;
s62, judging whether the strategy device of the current actuator is converged, if not, the actuator continues to perform environment exploration and self training; if the current environment characteristic vector is converged, calculating to obtain a group of planning sequences according to the current environment characteristic vector and the target environment characteristic vector, and switching to S63;
s63, judging the value of the planning sequence and the actuator strategy according to the task target of the actuator, and if the value of the planning sequence is high, using the planning sequence as a training set to train the strategy device of the actuator; if the strategy value of the actuator is higher than or equal to the planning sequence, carrying out iterative calculation again, optimizing the planning sequence, repeatedly comparing the value of the value, wherein the number of times of repetition is N, and if the number of times of repetition is larger than or equal to N, turning to S64;
s64, collecting the environment characteristic vectors and corresponding actions of the actuators, training the strategy device of the planner as a training set, and then turning back to S61.
By the method, when the actuator enters the sparse rewarding environment for the first time, the actuator has a set of strategy directions for executing tasks in the first time, and the aim of unscrupulous exploration is avoided. The mutual training and confrontation of the planner and the executor are used, and two different reward mechanisms are used for achieving a common goal, so that the local optimal problem of the two parties can be broken. The vector representation of the action is used, so that the prediction planning precision of the planner is improved. When the joint training of the planner and the executor reaches convergence, the training is considered to be finished, and an executor strategy capable of better executing the task target is obtained.
Preferably, the vector representation of the action and the environment feature vector have the same dimension, and the method can enable the vector representation of the action and the environment feature vector to have better intrinsic relation.
Preferably, before the input of the environmental characteristic predictor is obtained through combination, the vector representation of the action and the environmental characteristic vector are respectively subjected to normalization processing, and the method is favorable for reducing the influence of the size of the numerical value and constructing the fitting relation.
Detailed Description
The following detailed description further describes the invention so that others skilled in the art may, by reference to the description, make further embodiments of the invention and are not intended to limit the scope of the invention to the particular embodiments described herein.
The technical scheme adopted by the invention is that a target planning method for reinforcement learning comprises the following steps:
s1, collecting a plurality of converged intelligent agents with the same action space, calculating vector representation of each action in the action sequence when the intelligent agents execute tasks according to the action sequence, integrating the vector representation of each action into an action vector dictionary corresponding to an action-vector, and then putting an actuator to be trained with the same action space in a target training environment, wherein the action vector representation is obtained by considering the action sequence as a text sequence and utilizing a word vector embedding principle in nlp, and the method can well convert a plurality of actions into action vectors with intrinsic relation respectively, such as a method of using word2vec and the like.
And S2, extracting the environment characteristic vector related to the action through a characteristic extractor as the external input of the actuator, wherein the characteristic extractor comprises a characteristic extractor and an actuator action predictor, the environment characteristic vector output by the characteristic extractor in the current period and the environment characteristic vector output by the next period are combined into a vector to be used as the output of the actuator action predictor, and the difference between the action output by the actuator in the current period and the action output by the actuator action predictor is used as the loss function of the characteristic extractor and the actuator action predictor. The method can enable the environmental features extracted by the feature extractor to be only related to the action, can ignore the environmental part which cannot be influenced by the intelligent agent, and enables the environmental feature vector and the action vector to be internally linked, so that when the environmental feature vector is used as the input of the environmental feature predictor, the fitting relation expressed by the environmental features and the action vector can be more easily converged.
S3, combining the environmental characteristic vectors extracted in the current period and S2 and the vector representation of the action executed by the output of the actuator into a vector, using the vector as the input of an environmental characteristic predictor of the next period, and calculating by the environmental characteristic predictor to obtain the environmental characteristic vector of the next period;
s4, giving a task final state target environment, and obtaining a target environment feature vector through a feature extractor;
s5, according to the distance between the current environment feature vector and the target environment feature vector, aiming at shortening the distance and reducing the iteration times, carrying out iterative computation to obtain a group of planning sequences in which the environment feature vectors obtained by iteration correspond to the actions one by one;
in S5, the following is specifically developed:
s51, taking the environment characteristic predictor as an environment function, taking an intelligent agent based on reinforcement learning as a planner, wherein the planner comprises a strategy device and a valuator, and constructing a data loop between the environment and the intelligent agent;
s52, taking the current environment feature vector as the input of the strategy device of the planner to obtain the action output of the strategy device of the planner;
s53, converting the action of the strategy device of the planner into vector representation according to the action vector dictionary, then merging the vector representation with the current environment characteristic vector and inputting the merged vector representation into the environment characteristic predictor, predicting to obtain the next period environment characteristic vector aiming at planning, using the next period environment characteristic vector aiming at planning as the new input of the strategy device of the planner, and sequentially iterating to obtain a group of planning sequences;
and S54, judging the sequence value by using a valuator of the planner, and updating the combination strategy of the optimized planning sequence until convergence.
The method can obtain a group of better planning sequences based on the iterative optimization of the current environment characteristic vector and the target environment characteristic vector.
And S6, taking the planning sequence as a training set, and carrying out planning training on the actuator.
In S6, the following is specifically developed:
s61, the actuator is a reinforcement learning agent comprising a strategy device and a valuator and capable of environmental exploration and self-training, whether the current actuator starts to explore the environment and self-training is judged, if not, a group of initial planning sequences are obtained according to the initial state and a given target when the actuator is put into the training environment, the strategy device of the actuator is trained, and then the actuator starts to enter the exploration environment state; if so, training the strategy device of the actuator without using a planning sequence, and turning to S62;
s62, judging whether the strategy device of the current actuator is converged, if not, the actuator continues to perform environment exploration and self training; if the current environment characteristic vector is converged, calculating to obtain a group of planning sequences according to the current environment characteristic vector and the target environment characteristic vector, and switching to S63;
s63, judging the value of the planning sequence and the actuator strategy according to the task target of the actuator, and if the value of the planning sequence is high, using the planning sequence as a training set to train the strategy device of the actuator; if the strategy value of the actuator is higher than or equal to the planning sequence, carrying out iterative calculation again, optimizing the planning sequence, repeatedly comparing the value of the value, wherein the number of times of repetition is N, and if the number of times of repetition is larger than or equal to N, turning to S64;
s64, collecting the environment characteristic vectors and corresponding actions of the actuators, training the strategy device of the planner as a training set, and then turning back to S61.
By the method, when the actuator enters the sparse rewarding environment for the first time, the actuator has a set of strategy directions for executing tasks in the first time, and the aim of unscrupulous exploration is avoided. The mutual training and confrontation of the planner and the executor are used, and two different reward mechanisms are used for achieving a common goal, so that the local optimal problem of the two parties can be broken. The vector representation of the action is used, so that the prediction planning precision of the planner is improved. When the joint training of the planner and the executor reaches convergence, the training is considered to be finished, and an executor strategy capable of better executing the task target is obtained.
In the above content, each action in the action sequence is represented in a vector form, which gives a basic connotation to each action and gives a similar relationship between the actions, each action does not exist independently any more, and the agent can directly acquire the relation between the actions when performing the goal planning, rather than recognizing the relation between the actions through a large amount of exploration again, which is beneficial to the multi-objective learning of a plurality of agents, and the acquisition of the action vector only needs to be obtained in the action sequence under the realization of a simple basic task, without considering the optimal strategy problem in a complex state space, and can be used all the time in the context of the same action space.
In the above, the feature extractor is used to extract the environmental features related to the motion, which will make the input environment and the motion have a relationship, and the position of each element in the vector of the motion can be regarded as an influence factor on a certain feature of the environment. On the basis, the environmental feature predictor is used for fitting the relation between the action vector and the environmental feature vector, so that the environment feature predictor can learn the action vector inside, the contribution of each element to the environmental feature can be accurately predicted to the environmental feature of the next period, if the original action instruction is directly used as input, the predictor needs to further decompose the action instruction, and the predictor cannot well learn the influence relation between the action and the environment.
In the above, on the basis of accurately predicting the environmental features of the next period, the distance between the shortened target environmental features and the current environmental features is used as an optimization target for reconstructing a reinforcement learning environment and an intelligent agent for planning, the current environmental feature vector is used as the input of the intelligent agent to obtain a planning action, the planning action and the current environmental feature vector are combined to be used as the input of an environmental feature predictor, then the predicted environmental feature vector is obtained, the predicted environmental feature vector is used as the input of the intelligent agent, and iteration and optimization are sequentially repeated until a planning sequence with less actions and capable of reaching the target environmental features is obtained. The method can convert the sparse reward into the distance problem between the current environment and the target environment, and the internal relation of the action is constructed through vectorization of the action according to the action sequence, so that the distance problem is calculated more accurately, and the technical problems that the accurate prediction is difficult and the sparse reward cannot be well converted into the reward dense distance optimization in the prior art are well solved.
The vector representation of the action and the environment characteristic vector have the same dimension, and the method can enable the vector representation of the action and the environment characteristic vector to have better intrinsic connection.
Before the input of the environmental characteristic predictor is obtained through combination, the vector representation of the action and the environmental characteristic vector are respectively subjected to normalization processing, and the method is favorable for reducing the influence of the magnitude of the numerical value and constructing the fitting relation.
The predictor, the strategy device and the valuator of the invention are realized by a BP neural network, a recurrent neural network, a transfomer, a graph neural network and the like, and the period of the invention comprises a moment, a time period, a sequence and the like.

Claims (7)

1. A reinforcement learning goal planning method, comprising:
s1, collecting a plurality of converged intelligent agents with the same action space, calculating vector representation of each action in the action sequence when the intelligent agents execute tasks according to the action sequence, integrating the vector representation into an action vector dictionary corresponding to the action-vector, and then putting an actuator with the same action space to be trained in a target training environment;
s2, extracting the environment feature vector related to the action through the feature extractor to be used as the external input of the actuator;
s3, combining the environmental characteristic vectors extracted in the current period and S2 and the vector representation of the action executed by the output of the actuator into a vector, using the vector as the input of an environmental characteristic predictor of the next period, and calculating by the environmental characteristic predictor to obtain the environmental characteristic vector of the next period;
s4, giving a task final state target environment, and obtaining a target environment feature vector through a feature extractor;
s5, according to the distance between the current environment feature vector and the target environment feature vector, aiming at shortening the distance and reducing the iteration times, carrying out iterative computation to obtain a group of planning sequences in which the environment feature vectors obtained by iteration correspond to the actions one by one;
and S6, taking the planning sequence as a training set, and carrying out planning training on the actuator.
2. The reinforcement learning objective planning method of claim 1, wherein the action vector representation in S1 is obtained by considering the action sequence as a text sequence and using word vector embedding principle in nlp.
3. The method of claim 1, wherein the feature extraction method in S2 includes a feature extractor and an actuator motion predictor, and the feature extractor combines the environmental feature vector output by the feature extractor in the current cycle and the environmental feature vector output by the next cycle into a vector as the output of the actuator motion predictor, and uses the difference between the motion output by the actuator in the current cycle and the motion output by the actuator motion predictor as the loss function of the feature extractor and the actuator motion predictor.
4. The reinforcement learning objective planning method of claim 1, wherein the S5 comprises:
s51, taking the environment characteristic predictor as an environment function, taking an intelligent agent based on reinforcement learning as a planner, wherein the planner comprises a strategy device and a valuator, and constructing a data loop between the environment and the intelligent agent;
s52, taking the current environment feature vector as the input of the strategy device of the planner to obtain the action output of the strategy device of the planner;
s53, converting the action of the strategy device of the planner into vector representation according to the action vector dictionary, then merging the vector representation with the current environment characteristic vector and inputting the merged vector representation into the environment characteristic predictor, predicting to obtain the next period environment characteristic vector aiming at planning, using the next period environment characteristic vector aiming at planning as the new input of the strategy device of the planner, and sequentially iterating to obtain a group of planning sequences;
and S54, judging the value of the planning sequence by using a valuator of the planner, and updating the combination strategy of the optimized planning sequence until convergence.
5. The reinforcement learning objective planning method of claim 1, wherein the S6 comprises:
s61, the actuator is a reinforcement learning agent which comprises a strategy device and a valuator and can search and train in the environment, whether the current actuator starts to search the environment and train in the self is judged, if not, a group of initial planning sequences are obtained according to the initial state and the given target when the actuator is put into the training environment, the strategy device of the actuator is trained, and then the strategy device starts to enter the searching environment state; if so, training the strategy device of the actuator without using a planning sequence, and turning to S62;
s62, judging whether the strategy device of the current actuator is converged, if not, the actuator continues to perform environment exploration and self training; if the current environment characteristic vector is converged, calculating to obtain a group of planning sequences according to the current environment characteristic vector and the target environment characteristic vector, and switching to S63;
s63, judging the value of the planning sequence and the actuator strategy according to the task target of the actuator, and if the value of the planning sequence is high, using the planning sequence as a training set to train the strategy device of the actuator; if the strategy value of the actuator is higher than or equal to the planning sequence, carrying out iterative calculation again, optimizing the planning sequence, repeatedly comparing the value of the value, wherein the number of times of repetition is N, and if the number of times of repetition is larger than or equal to N, turning to S64;
s64, collecting the environment characteristic vectors and corresponding actions of the actuators, training the strategy device of the planner as a training set, and then turning back to S61.
6. The reinforcement learning objective planning method of claim 1, wherein the vector representation of the action and the environmental feature vector have the same dimension.
7. The reinforcement learning objective planning method of claim 3, wherein the vector representation of the motion and the environment feature vector are normalized before being combined to obtain the input of the environment feature predictor.
CN202011192071.8A 2020-10-30 2020-10-30 Target planning method for reinforcement learning Active CN112257872B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011192071.8A CN112257872B (en) 2020-10-30 2020-10-30 Target planning method for reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011192071.8A CN112257872B (en) 2020-10-30 2020-10-30 Target planning method for reinforcement learning

Publications (2)

Publication Number Publication Date
CN112257872A true CN112257872A (en) 2021-01-22
CN112257872B CN112257872B (en) 2022-09-13

Family

ID=74268493

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011192071.8A Active CN112257872B (en) 2020-10-30 2020-10-30 Target planning method for reinforcement learning

Country Status (1)

Country Link
CN (1) CN112257872B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113467487A (en) * 2021-09-06 2021-10-01 中国科学院自动化研究所 Path planning model training method, path planning device and electronic equipment
CN113879339A (en) * 2021-12-07 2022-01-04 阿里巴巴达摩院(杭州)科技有限公司 Decision planning method for automatic driving, electronic device and computer storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140214469A1 (en) * 2011-08-26 2014-07-31 Bae Systems Plc Goal-based planning system
CN108288097A (en) * 2018-01-24 2018-07-17 中国科学技术大学 Higher-dimension Continuous action space discretization heuristic approach in intensified learning task
US20190244099A1 (en) * 2018-02-05 2019-08-08 Deepmind Technologies Limited Continual reinforcement learning with a multi-task agent
CN110794832A (en) * 2019-10-21 2020-02-14 同济大学 Mobile robot path planning method based on reinforcement learning
US10600004B1 (en) * 2017-11-03 2020-03-24 Am Mobileapps, Llc Machine-learning based outcome optimization
CN111062491A (en) * 2019-12-13 2020-04-24 周世海 Intelligent agent unknown environment exploration method based on reinforcement learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140214469A1 (en) * 2011-08-26 2014-07-31 Bae Systems Plc Goal-based planning system
US10600004B1 (en) * 2017-11-03 2020-03-24 Am Mobileapps, Llc Machine-learning based outcome optimization
CN108288097A (en) * 2018-01-24 2018-07-17 中国科学技术大学 Higher-dimension Continuous action space discretization heuristic approach in intensified learning task
US20190244099A1 (en) * 2018-02-05 2019-08-08 Deepmind Technologies Limited Continual reinforcement learning with a multi-task agent
CN110794832A (en) * 2019-10-21 2020-02-14 同济大学 Mobile robot path planning method based on reinforcement learning
CN111062491A (en) * 2019-12-13 2020-04-24 周世海 Intelligent agent unknown environment exploration method based on reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨惟轶等: "深度强化学习中稀疏奖励问题研究综述", 《计算机科学》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113467487A (en) * 2021-09-06 2021-10-01 中国科学院自动化研究所 Path planning model training method, path planning device and electronic equipment
CN113879339A (en) * 2021-12-07 2022-01-04 阿里巴巴达摩院(杭州)科技有限公司 Decision planning method for automatic driving, electronic device and computer storage medium

Also Published As

Publication number Publication date
CN112257872B (en) 2022-09-13

Similar Documents

Publication Publication Date Title
CN107665230B (en) Training method and device of user behavior prediction model for intelligent home control
CN102402712B (en) Robot reinforced learning initialization method based on neural network
CN112257872B (en) Target planning method for reinforcement learning
Ahmadi et al. Learning fuzzy cognitive maps using imperialist competitive algorithm
CN112016736A (en) Photovoltaic power generation power control method based on gate control convolution and attention mechanism
CN113780002A (en) Knowledge reasoning method and device based on graph representation learning and deep reinforcement learning
Schubert et al. A generalist dynamics model for control
CN113935489A (en) Variational quantum model TFQ-VQA based on quantum neural network and two-stage optimization method thereof
CN117349319A (en) Database query optimization method based on deep reinforcement learning
Khan et al. Motion planning for a snake robot using double deep q-learning
CN112241802A (en) Interval prediction method for wind power
CN116862080B (en) Carbon emission prediction method and system based on double-view contrast learning
Gabrijel et al. On-line identification and reconstruction of finite automata with generalized recurrent neural networks
Abiyev Fuzzy wavelet neural network for prediction of electricity consumption
Zha et al. Learning from ambiguous demonstrations with self-explanation guided reinforcement learning
CN116030537B (en) Three-dimensional human body posture estimation method based on multi-branch attention-seeking convolution
CN117154845A (en) Power grid operation adjustment method based on generation type decision model
CN116010621B (en) Rule-guided self-adaptive path generation method
Liu et al. Dap: Domain-aware prompt learning for vision-and-language navigation
Ghazali et al. Dynamic ridge polynomial neural networks in exchange rates time series forecasting
CN114372418A (en) Wind power space-time situation description model establishing method
Dong et al. Neural networks and AdaBoost algorithm based ensemble models for enhanced forecasting of nonlinear time series
Massel et al. Contingency management and semantic modeling in energy sector
Li et al. A Short-Term Load Forecasting Method for Large-Scale Power Distribution Systems Based on a Novel Spatio-Temporal Neural Network
Richard et al. A comprehensive benchmark of neural networks for system identification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20240605

Address after: 310000, Block H5, 19th Floor, Building A, Paradise Software Park, No. 3 Xidoumen Road, Xihu District, Hangzhou City, Zhejiang Province (self declared)

Patentee after: Zhejiang Hongyun Zhihui Technology Co.,Ltd.

Country or region after: China

Address before: 315722 No.14 Dongxi Road, Xizhou Town, Xiangshan County, Ningbo City, Zhejiang Province

Patentee before: Zhou Shihai

Country or region before: China