Background
In reinforcement learning, a local optimal problem is involved, that is, when the state space is too large, the agent is apt to stay in the policy with the highest value in the currently explored policies, but the policy is not the optimal policy, so that the agent cannot well complete the specified task.
In addition, the problem of sparse rewards exists in reinforcement learning, namely when an intelligent agent executes a task exploration environment, given rewards are rare, for example, the rewards are given only when a final target is reached, and no rewards are given when the target is not reached. This can easily lead to difficulties in the initial training of the agent in mastering a given task goal, while further increasing the interference from local optimization problems.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a target planning method for reinforcement learning, which can overcome the local optimal problem to a certain extent and can convert sparse rewards into dense rewards inside an intelligent agent.
The technical scheme adopted by the invention is that a target planning method for reinforcement learning comprises the following steps:
s1, collecting a plurality of converged intelligent agents with the same action space, calculating vector representation of each action in the action sequence when the intelligent agents execute tasks according to the action sequence, integrating the vector representation into an action vector dictionary corresponding to the action-vector, and then putting an actuator with the same action space to be trained in a target training environment;
s2, extracting the environment feature vector related to the action through the feature extractor to be used as the external input of the actuator;
s3, combining the environmental characteristic vectors extracted in the current period and S2 and the vector representation of the action executed by the output of the actuator into a vector, using the vector as the input of an environmental characteristic predictor of the next period, and calculating by the environmental characteristic predictor to obtain the environmental characteristic vector of the next period;
s4, giving a task final state target environment, and obtaining a target environment feature vector through a feature extractor;
s5, according to the distance between the current environment feature vector and the target environment feature vector, aiming at shortening the distance and reducing the iteration times, carrying out iterative computation to obtain a group of planning sequences in which the environment feature vectors obtained by iteration correspond to the actions one by one;
and S6, taking the planning sequence as a training set, and carrying out planning training on the actuator.
The invention has the beneficial effects that:
(1) each action in the action sequence is expressed in a vector form, which endows each action with a basic connotation and also endows a similar relation between the actions, each action does not exist independently any more, an intelligent body can directly acquire the relation between the actions when planning a target instead of recognizing the relation between the actions through massive exploration, which is beneficial to the multi-target learning of a plurality of intelligent bodies, and the acquisition of the action vector only needs to be acquired under the condition of realizing the action sequence under a simple basic task without considering the optimal strategy problem in a complex state space, and can be continuously used under the background of the same action space.
(2) The characteristic extractor is used for extracting environmental characteristics related to the action, so that the input environment is also connected with the action, and the position of each element in the vector of the action can be regarded as an influence factor on a certain characteristic of the environment. On the basis, the environmental feature predictor is used for fitting the relation between the action vector and the environmental feature vector, so that the environment feature predictor can learn the action vector inside, the contribution of each element to the environmental feature can be accurately predicted to the environmental feature of the next period, if the original action instruction is directly used as input, the predictor needs to further decompose the action instruction, and the predictor cannot well learn the influence relation between the action and the environment.
(3) On the basis of accurately predicting the environmental features of the next period, shortening the distance between the target environmental features and the current environmental features for a short time as an optimization target, reconstructing a reinforcement learning environment and an intelligent agent for planning, taking the current environmental feature vector as the input of the intelligent agent to obtain a planning action, combining the planning action and the current environmental feature vector as the input of an environmental feature predictor, obtaining a predicted environmental feature vector, taking the predicted environmental feature vector as the input of the intelligent agent, sequentially and repeatedly iterating and optimizing until a planning sequence with less actions and capable of reaching the target environmental features is obtained. The method can convert the sparse reward into the distance problem between the current environment and the target environment, and the internal relation of the action is constructed through vectorization of the action according to the action sequence, so that the distance problem is calculated more accurately, and the technical problems that the accurate prediction is difficult and the sparse reward cannot be well converted into the reward dense distance optimization in the prior art are well solved.
Preferably, the motion vector representation in S1 is obtained by using the word vector embedding principle in nlp by regarding the motion sequence as a text sequence, and this method can better convert a plurality of motions into motion vectors with intrinsic relations, for example, using word2vec or other methods.
Preferably, the feature extraction method in S2 includes a feature extractor and an actuator motion predictor, and combines the environmental feature vector output by the feature extractor in the current cycle and the environmental feature vector output by the next cycle into a vector as the output of the actuator motion predictor, and uses the difference between the motion output by the actuator in the current cycle and the motion output by the actuator motion predictor as the loss function of the feature extractor and the actuator motion predictor. The method can enable the environmental features extracted by the feature extractor to be only related to the action, can ignore the environmental part which cannot be influenced by the intelligent agent, and enables the environmental feature vector and the action vector to be internally linked, so that when the environmental feature vector is used as the input of the environmental feature predictor, the fitting relation expressed by the environmental features and the action vector can be more easily converged.
Preferably, the S5 includes:
s51, taking the environment characteristic predictor as an environment function, taking an intelligent agent based on reinforcement learning as a planner, wherein the planner comprises a strategy device and a valuator, and constructing a data loop between the environment and the intelligent agent;
s52, taking the current environment feature vector as the input of the strategy device of the planner to obtain the action output of the strategy device of the planner;
s53, converting the action of the strategy device of the planner into vector representation according to the action vector dictionary, then merging the vector representation with the current environment characteristic vector and inputting the merged vector representation into the environment characteristic predictor, predicting to obtain the next period environment characteristic vector aiming at planning, using the next period environment characteristic vector aiming at planning as the new input of the strategy device of the planner, and sequentially iterating to obtain a group of planning sequences;
and S54, judging the sequence value by using a valuator of the planner, and updating the combination strategy of the optimized planning sequence until convergence.
The method can obtain a group of better planning sequences based on the iterative optimization of the current environment characteristic vector and the target environment characteristic vector.
Preferably, the S6 includes:
s61, the actuator is a reinforcement learning agent which comprises a strategy device and a valuator and can search and train in the environment, whether the current actuator starts to search the environment and train in the self is judged, if not, a group of initial planning sequences are obtained according to the initial state and the given target when the actuator is put into the training environment, the strategy device of the actuator is trained, and then the strategy device starts to enter the searching environment state; if so, training the strategy device of the actuator without using a planning sequence, and turning to S62;
s62, judging whether the strategy device of the current actuator is converged, if not, the actuator continues to perform environment exploration and self training; if the current environment characteristic vector is converged, calculating to obtain a group of planning sequences according to the current environment characteristic vector and the target environment characteristic vector, and switching to S63;
s63, judging the value of the planning sequence and the actuator strategy according to the task target of the actuator, and if the value of the planning sequence is high, using the planning sequence as a training set to train the strategy device of the actuator; if the strategy value of the actuator is higher than or equal to the planning sequence, carrying out iterative calculation again, optimizing the planning sequence, repeatedly comparing the value of the value, wherein the number of times of repetition is N, and if the number of times of repetition is larger than or equal to N, turning to S64;
s64, collecting the environment characteristic vectors and corresponding actions of the actuators, training the strategy device of the planner as a training set, and then turning back to S61.
By the method, when the actuator enters the sparse rewarding environment for the first time, the actuator has a set of strategy directions for executing tasks in the first time, and the aim of unscrupulous exploration is avoided. The mutual training and confrontation of the planner and the executor are used, and two different reward mechanisms are used for achieving a common goal, so that the local optimal problem of the two parties can be broken. The vector representation of the action is used, so that the prediction planning precision of the planner is improved. When the joint training of the planner and the executor reaches convergence, the training is considered to be finished, and an executor strategy capable of better executing the task target is obtained.
Preferably, the vector representation of the action and the environment feature vector have the same dimension, and the method can enable the vector representation of the action and the environment feature vector to have better intrinsic relation.
Preferably, before the input of the environmental characteristic predictor is obtained through combination, the vector representation of the action and the environmental characteristic vector are respectively subjected to normalization processing, and the method is favorable for reducing the influence of the size of the numerical value and constructing the fitting relation.
Detailed Description
The following detailed description further describes the invention so that others skilled in the art may, by reference to the description, make further embodiments of the invention and are not intended to limit the scope of the invention to the particular embodiments described herein.
The technical scheme adopted by the invention is that a target planning method for reinforcement learning comprises the following steps:
s1, collecting a plurality of converged intelligent agents with the same action space, calculating vector representation of each action in the action sequence when the intelligent agents execute tasks according to the action sequence, integrating the vector representation of each action into an action vector dictionary corresponding to an action-vector, and then putting an actuator to be trained with the same action space in a target training environment, wherein the action vector representation is obtained by considering the action sequence as a text sequence and utilizing a word vector embedding principle in nlp, and the method can well convert a plurality of actions into action vectors with intrinsic relation respectively, such as a method of using word2vec and the like.
And S2, extracting the environment characteristic vector related to the action through a characteristic extractor as the external input of the actuator, wherein the characteristic extractor comprises a characteristic extractor and an actuator action predictor, the environment characteristic vector output by the characteristic extractor in the current period and the environment characteristic vector output by the next period are combined into a vector to be used as the output of the actuator action predictor, and the difference between the action output by the actuator in the current period and the action output by the actuator action predictor is used as the loss function of the characteristic extractor and the actuator action predictor. The method can enable the environmental features extracted by the feature extractor to be only related to the action, can ignore the environmental part which cannot be influenced by the intelligent agent, and enables the environmental feature vector and the action vector to be internally linked, so that when the environmental feature vector is used as the input of the environmental feature predictor, the fitting relation expressed by the environmental features and the action vector can be more easily converged.
S3, combining the environmental characteristic vectors extracted in the current period and S2 and the vector representation of the action executed by the output of the actuator into a vector, using the vector as the input of an environmental characteristic predictor of the next period, and calculating by the environmental characteristic predictor to obtain the environmental characteristic vector of the next period;
s4, giving a task final state target environment, and obtaining a target environment feature vector through a feature extractor;
s5, according to the distance between the current environment feature vector and the target environment feature vector, aiming at shortening the distance and reducing the iteration times, carrying out iterative computation to obtain a group of planning sequences in which the environment feature vectors obtained by iteration correspond to the actions one by one;
in S5, the following is specifically developed:
s51, taking the environment characteristic predictor as an environment function, taking an intelligent agent based on reinforcement learning as a planner, wherein the planner comprises a strategy device and a valuator, and constructing a data loop between the environment and the intelligent agent;
s52, taking the current environment feature vector as the input of the strategy device of the planner to obtain the action output of the strategy device of the planner;
s53, converting the action of the strategy device of the planner into vector representation according to the action vector dictionary, then merging the vector representation with the current environment characteristic vector and inputting the merged vector representation into the environment characteristic predictor, predicting to obtain the next period environment characteristic vector aiming at planning, using the next period environment characteristic vector aiming at planning as the new input of the strategy device of the planner, and sequentially iterating to obtain a group of planning sequences;
and S54, judging the sequence value by using a valuator of the planner, and updating the combination strategy of the optimized planning sequence until convergence.
The method can obtain a group of better planning sequences based on the iterative optimization of the current environment characteristic vector and the target environment characteristic vector.
And S6, taking the planning sequence as a training set, and carrying out planning training on the actuator.
In S6, the following is specifically developed:
s61, the actuator is a reinforcement learning agent comprising a strategy device and a valuator and capable of environmental exploration and self-training, whether the current actuator starts to explore the environment and self-training is judged, if not, a group of initial planning sequences are obtained according to the initial state and a given target when the actuator is put into the training environment, the strategy device of the actuator is trained, and then the actuator starts to enter the exploration environment state; if so, training the strategy device of the actuator without using a planning sequence, and turning to S62;
s62, judging whether the strategy device of the current actuator is converged, if not, the actuator continues to perform environment exploration and self training; if the current environment characteristic vector is converged, calculating to obtain a group of planning sequences according to the current environment characteristic vector and the target environment characteristic vector, and switching to S63;
s63, judging the value of the planning sequence and the actuator strategy according to the task target of the actuator, and if the value of the planning sequence is high, using the planning sequence as a training set to train the strategy device of the actuator; if the strategy value of the actuator is higher than or equal to the planning sequence, carrying out iterative calculation again, optimizing the planning sequence, repeatedly comparing the value of the value, wherein the number of times of repetition is N, and if the number of times of repetition is larger than or equal to N, turning to S64;
s64, collecting the environment characteristic vectors and corresponding actions of the actuators, training the strategy device of the planner as a training set, and then turning back to S61.
By the method, when the actuator enters the sparse rewarding environment for the first time, the actuator has a set of strategy directions for executing tasks in the first time, and the aim of unscrupulous exploration is avoided. The mutual training and confrontation of the planner and the executor are used, and two different reward mechanisms are used for achieving a common goal, so that the local optimal problem of the two parties can be broken. The vector representation of the action is used, so that the prediction planning precision of the planner is improved. When the joint training of the planner and the executor reaches convergence, the training is considered to be finished, and an executor strategy capable of better executing the task target is obtained.
In the above content, each action in the action sequence is represented in a vector form, which gives a basic connotation to each action and gives a similar relationship between the actions, each action does not exist independently any more, and the agent can directly acquire the relation between the actions when performing the goal planning, rather than recognizing the relation between the actions through a large amount of exploration again, which is beneficial to the multi-objective learning of a plurality of agents, and the acquisition of the action vector only needs to be obtained in the action sequence under the realization of a simple basic task, without considering the optimal strategy problem in a complex state space, and can be used all the time in the context of the same action space.
In the above, the feature extractor is used to extract the environmental features related to the motion, which will make the input environment and the motion have a relationship, and the position of each element in the vector of the motion can be regarded as an influence factor on a certain feature of the environment. On the basis, the environmental feature predictor is used for fitting the relation between the action vector and the environmental feature vector, so that the environment feature predictor can learn the action vector inside, the contribution of each element to the environmental feature can be accurately predicted to the environmental feature of the next period, if the original action instruction is directly used as input, the predictor needs to further decompose the action instruction, and the predictor cannot well learn the influence relation between the action and the environment.
In the above, on the basis of accurately predicting the environmental features of the next period, the distance between the shortened target environmental features and the current environmental features is used as an optimization target for reconstructing a reinforcement learning environment and an intelligent agent for planning, the current environmental feature vector is used as the input of the intelligent agent to obtain a planning action, the planning action and the current environmental feature vector are combined to be used as the input of an environmental feature predictor, then the predicted environmental feature vector is obtained, the predicted environmental feature vector is used as the input of the intelligent agent, and iteration and optimization are sequentially repeated until a planning sequence with less actions and capable of reaching the target environmental features is obtained. The method can convert the sparse reward into the distance problem between the current environment and the target environment, and the internal relation of the action is constructed through vectorization of the action according to the action sequence, so that the distance problem is calculated more accurately, and the technical problems that the accurate prediction is difficult and the sparse reward cannot be well converted into the reward dense distance optimization in the prior art are well solved.
The vector representation of the action and the environment characteristic vector have the same dimension, and the method can enable the vector representation of the action and the environment characteristic vector to have better intrinsic connection.
Before the input of the environmental characteristic predictor is obtained through combination, the vector representation of the action and the environmental characteristic vector are respectively subjected to normalization processing, and the method is favorable for reducing the influence of the magnitude of the numerical value and constructing the fitting relation.
The predictor, the strategy device and the valuator of the invention are realized by a BP neural network, a recurrent neural network, a transfomer, a graph neural network and the like, and the period of the invention comprises a moment, a time period, a sequence and the like.