CN115688858A - Fine-grained expert behavior simulation learning method, device, medium and terminal - Google Patents

Fine-grained expert behavior simulation learning method, device, medium and terminal Download PDF

Info

Publication number
CN115688858A
CN115688858A CN202211285500.5A CN202211285500A CN115688858A CN 115688858 A CN115688858 A CN 115688858A CN 202211285500 A CN202211285500 A CN 202211285500A CN 115688858 A CN115688858 A CN 115688858A
Authority
CN
China
Prior art keywords
network model
information
prediction network
value
reward
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211285500.5A
Other languages
Chinese (zh)
Other versions
CN115688858B (en
Inventor
漆舒汉
孙志航
殷俊
黄新昊
万乐
王轩
张加佳
王强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN202211285500.5A priority Critical patent/CN115688858B/en
Publication of CN115688858A publication Critical patent/CN115688858A/en
Application granted granted Critical
Publication of CN115688858B publication Critical patent/CN115688858B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a fine-grained expert behavior simulation learning method, a device, a medium and a terminal, wherein the method comprises the steps of obtaining current environment state information of an intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and collecting task completion condition information and state information of the current actions; calculating a single reward value according to the state information of the action, and calculating a task reward value according to the task completion condition information; the method has the advantages that the training difficulty is reduced, the training efficiency is improved, and the strategy close to the expert behavior mode can be learned in a high-dimensional state and an action space without acquiring a large amount of expert data.

Description

Fine-grained expert behavior imitation learning method, device, medium and terminal
Technical Field
The invention relates to the field of simulation learning, in particular to a fine-grained expert behavior simulation learning method, device, medium and terminal.
Background
The existing simulation learning mostly adopts a behavior cloning method and an inverse reinforcement learning method, wherein the behavior cloning method can learn the mapping relation from expert states to expert actions, but the behavior cloning method has the problems that the mapping is difficult to learn directly from a high-dimensional space and distribution drift and compound errors can be encountered under the environment of an incomplete information three-dimensional video game; the inverse reinforcement learning method generally has the problems of high training difficulty, low efficiency and instability because of the two reinforcement learning processes, and in addition, the two methods often need a large amount of expert data to obtain relatively good results, so that it is difficult to collect a large amount of high-quality expert data.
Disclosure of Invention
In view of the defects of the prior art, the present application aims to provide a fine-grained expert behavior imitation learning method, apparatus, medium and terminal, and aims to solve the problems that the learning is very difficult when the traditional imitation learning method is directly imitated from a high-dimensional state and an action space, and the finally obtained strategy has a large deviation from the expert strategy.
In order to solve the above technical problem, a first aspect of the embodiments of the present application provides a fine-grained expert behavior simulation learning method, where the method includes:
acquiring current environment state information of an agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and current action state information;
calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information;
training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each office to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is larger than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.
As a further improved technical scheme, the preset prediction network model is an operation prediction network model constructed based on a deep reinforcement learning method.
As a further improvement technical scheme, before the current environment state information of the intelligent agent is obtained, expert decision data are obtained in advance.
As a further improvement, the acquiring current environmental state information of the agent, inputting the current environmental state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and acquiring task completion information and current state information of the actions includes:
acquiring current environment state information of an agent, and inputting the current environment state information into the operation prediction network model to obtain the prediction information, wherein the current environment state information comprises coordinate information, angle information and posture information, and the prediction information is action probability distribution;
selecting one piece of action operation information based on the action probability distribution sampling, and executing corresponding actions according to the action operation information, wherein each piece of action operation information corresponds to one probability;
and acquiring task completion information and the current state information of the action.
As a further improved technical solution, the calculating a single bonus value according to the state information of the action and calculating a task bonus value according to the task completion information includes:
comparing the state information of the action with the expert decision data to obtain difference information, and calculating a single imitation reward value according to the difference information, wherein the comparison of the state information of the action with the expert decision data is to compare an action key frame in the state information of the action with an expert key frame in the expert decision data;
and calculating a task reward value according to the task completion condition information.
As a further improved technical solution, the training of a preset prediction network model according to the single bonus value and the mission bonus value, adding the mission bonus value to a plurality of single bonus values of each office to obtain a total bonus value, when the total bonus value is greater than a threshold value, completing the training of the preset prediction network model to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model includes:
initializing the agent to a random sampling state;
training a preset prediction network model according to the single rewarding value and the task rewarding value in a course learning mode, adding a plurality of single simulated rewarding values and the task rewarding value to obtain a total rewarding value, and finishing the training of the preset prediction network model when the total rewarding value is greater than a threshold value to obtain a trained prediction network model;
and returning the strategy output by the trained prediction network model.
As a further improvement, a course learning mode is adopted, a preset prediction network model is trained according to the single reward value and the task reward value, a plurality of single simulated reward values are added to the task reward value to obtain a total reward value, when the total reward value is greater than a threshold value, the preset prediction network model is trained, and the obtaining of the trained prediction network model comprises the following steps:
selecting the expert decision data in a preset time period to carry out simulation learning training on the preset prediction network model, and adding all single simulation reward values in the preset time period to obtain a single simulation reward value;
judging whether the training is carried out again, if the single-section simulated reward value is smaller than a single-section reward threshold value or the single-section training is not passed when the condition of early termination is triggered, repeatedly carrying out simulated learning training on the preset prediction network model, and if the single-section simulated reward value is larger than the single-section reward threshold value, passing the single-section training, adding a new time section on the basis of the preset time section to obtain an accumulated time section;
and selecting the expert decision data of the accumulated time period to repeat the process of simulating learning training on the preset prediction network model and judging whether to train again or not, and finishing the training on the preset prediction network model when the accumulated time period is equal to a local time period and the added values of the task reward value and all the single simulated reward values in the local time period are greater than a local reward threshold value to obtain the trained prediction network model.
A second aspect of the embodiments of the present application provides a fine-grained expert behavior simulation learning device, including:
the intelligent agent comprises an information acquisition module, a task execution module and a task execution module, wherein the information acquisition module is used for acquiring current environment state information of the intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and current state information of the actions;
the reward value calculation module is used for calculating a single reward value according to the state information of the action and calculating a task reward value according to the task completion condition information;
and the model training module is used for training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each office to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is greater than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.
A third aspect of embodiments of the present application provides a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps in the fine-grained expert behavioral imitation learning method as described in any above.
A fourth aspect of the embodiments of the present application provides a terminal device, including: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;
the communication bus realizes the connection communication between the processor and the memory;
the processor, when executing the computer readable program, implements the steps in the fine-grained expert behavioral simulation learning method as described in any one of the above.
Has the advantages that: compared with the prior art, the fine-grained expert behavior simulation learning method comprises the steps of obtaining current environment state information of an agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and collecting task completion condition information and current state information of the actions; calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information; training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each office to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is greater than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.
Drawings
FIG. 1 is a flow chart of a fine-grained expert behavior imitation learning method of the present invention.
Fig. 2 is a schematic structural diagram of a terminal device provided in the present invention.
Fig. 3 is a block diagram of the apparatus provided by the present invention.
FIG. 4 is a diagram of the fine-grained expert behavioral simulation learning algorithm enhanced by the present invention.
The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.
Detailed Description
To facilitate an understanding of the present application, the present application will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present application are given in the accompanying drawings. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.
The inventor finds that the prior art has the following problems through research:
(1) Training a game agent learning strategy through imitation learning, and usually using two types of methods of behavior cloning and imitation learning based on reverse reinforcement learning, wherein a behavior cloning algorithm is a supervised learning method, the state given by the environment is used as a characteristic, the action which can be executed by an agent is used as a mark, the action difference between the agent strategy and an expert strategy is tried to be minimized, and the imitation learning task is reduced to a common regression or classification task; the simulation learning based on the inverse reinforcement learning divides the process of the simulation learning into two sub-processes of the inverse reinforcement learning and the reinforcement learning, and iterates repeatedly, the inverse reinforcement learning is used for deducing a reward function which accords with expert decision data, the reinforcement learning learns a strategy based on the reward function, the generated confrontation simulation learning is developed from the simulation learning based on the inverse reinforcement learning, and the simulation learning method is characterized in that a confrontation network frame is used for solving the simulation learning problem, and the simulation learning method can be expanded into practical application.
The behavior cloning method learns the mapping relation from the expert state to the expert action, but under the environment of an incomplete information three-dimensional video game, the mapping is difficult to learn directly from a high-dimensional space, and the problems of distribution drift and compound errors can be encountered, while the inverse reinforcement learning method relates to two reinforcement learning processes, and generally has the problems of high training difficulty, low efficiency and instability. In addition, the two methods often require a large amount of expert data to obtain relatively good results, and collecting a large amount of high-quality expert data often has certain difficulty.
In order to solve the above problems, various non-limiting embodiments of the present application will be described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the fine-grained expert behavior simulation learning method provided in the embodiment of the present application includes the following steps:
s1, acquiring current environment state information of an intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and current action state information;
the preset prediction network model is an operation prediction network model constructed based on a deep reinforcement learning method.
Specifically, the parameters of the network in the middle layer of the operation prediction network model need to use the corresponding strategy of deep reinforcement learning training, for example, the encoder in the operation prediction network model inputs the current game state information, including the information of the position, the moving direction, and the like of each agent, the input dimension of the encoder can be set to 96, the output dimension of the encoder can be set to 256, the input dimension of the decoder can be set to 256, the query vector dimension can be set to 64 dimensions, the number of the attention heads can be set to 4, the optimizer of the operation prediction network model can use an Adam optimizer, the learning rate can be set to 0.001, the gaussian noise variance can be set to 0.1, the discount factor can be set to 0.9, and a multi-process method can be used to distribute the environment to 32 processes to accelerate the training speed of the whole operation prediction network model.
The method comprises the steps of obtaining current environment state information of the agent, and obtaining expert decision data in advance before obtaining the current environment state information of the agent.
Specifically, a human expert normally plays a game, makes a decision from obtained state information and takes an action, the generated state-action pair information is expert decision data, the scheme needs at least one piece of complete expert decision data for simulation, the state of each moment in the game and the action corresponding to the moment are integrated into one piece of complete expert data, the expert decision data reflects the strategy of the human in playing the game to a certain extent, an intelligent body is guided to learn in the subsequent learning process, the obtained state information may be different according to the difference of game environments, but at least two pieces of information can be further learned only if the coordinate information and the angle information are the most basic requirements.
The method comprises the following steps of obtaining current environment state information of an intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and the current action state information, wherein the steps of:
s101, obtaining current environment state information of an agent, and inputting the current environment state information into the operation prediction network model to obtain the prediction information, wherein the current environment state information comprises coordinate information, angle information and posture information, and the prediction information is action probability distribution;
s102, selecting one piece of action operation information based on the action probability distribution sampling, and executing corresponding action according to the action operation information, wherein each piece of action operation information corresponds to one probability;
and S103, acquiring task completion condition information and the current state information of the action.
Specifically, current environment state information of the intelligent agent in the environment at the current moment is acquired, the environment can be a game environment, the current environment state information comprises coordinate information, angle information and posture information of the intelligent agent in the environment, the acquisition type of the current environment state information corresponds to the acquired expert decision data, and if only the coordinate information and the angle information exist in the expert decision data, the current environment state information only needs to include the coordinate information and the angle information;
then inputting the current environment state information into a preset prediction network model, outputting prediction information by the preset prediction network model, wherein the prediction information is action probability distribution, the action probability distribution is probability distribution that the intelligent agent in the current state can execute actions, the intelligent agent selects one action operation information according to the probability distribution, selects a probability corresponding to the action operation information, controls the intelligent agent to execute corresponding actions in the environment through the selected action operation information, collects corresponding actions executed by the intelligent agent and updates the current environment state information of the intelligent agent, and collects task completion condition information, wherein the task completion condition information comprises various information related to a specific task, such as whether a certain task is completed or not, and the task can be preset.
S2, calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information;
specifically, the simulated reward function mainly considers behavior simulated rewards and task rewards, encourages the intelligent agent to simulate actions and tracks of experts by calculating a single reward value, encourages the intelligent agent to complete set tasks by calculating the task reward value, and the rewards of the two parts are respectively provided with a weight and are added to form a total reward value.
Wherein, the calculation of the single reward value according to the state information of the action and the calculation of the task reward value according to the task completion condition information comprises the following steps:
s201, comparing the state information of the action with the expert decision data to obtain difference information, and calculating a single imitation reward value according to the difference information, wherein the comparison of the state information of the action with the expert decision data is to compare an action key frame in the state information of the action with an expert key frame in the expert decision data;
and S202, calculating a task reward value according to the task completion condition information.
Specifically, the reward of the behavior simulation is divided into a plurality of parts, including the positions, the speeds, the angles and various actions taken by experts, and the like, the intelligent agent is mainly encouraged to move to the position where the experts are located, the actions similar to the experts are made, then the rewards of the parts are multiplied to form a final single reward value, the single reward value guides the intelligent agent to take the action mode same as the experts, the task reward is usually specifically set according to a specific task scene, and the task reward is used for guiding the intelligent agent to complete the set task;
in the embodiment, a key frame alignment mode is adopted to calculate the single imitation reward value, the moments when some experts make key actions are selected as key frames, the intelligent agent is enabled to be close to the actions of the experts as far as possible at the moments, the intelligent agent is encouraged to take actions consistent with the experts at the key frames, and the key frame alignment mode is adopted to keep the diversity of strategies relative to frame-by-frame alignment, so that the actions of the intelligent agent are not completely consistent with the experts, and the method is helpful for further application.
Game for example, the position reward r is designed in the game scene l Speed award r v Angle reward r r Posture reward r p Stimulating intelligence from multiple aspectsThe body approaches the target position of the expert trajectory and takes similar actions as the expert, thereby generating a behavior pattern similar to the expert, the specific function form being:
Figure BDA0003899672910000101
r v =exp(w v *|v agent -v expert |)
r r =exp(w r *|r agent -r expert |)
Figure BDA0003899672910000102
wherein w l 、w v 、w r 、w p Respectively represent the weight of each prize,
Figure BDA0003899672910000103
i =1,2,3 represents the position information of x, y and z dimensions of coordinate axes respectively under three-dimensional environment, v agent 、v expert 、r agent 、r expert Velocity information, angle information representing the agent and the expert,
Figure BDA0003899672910000104
posture information representing the agent and the expert, including standing, squatting, running, leaning, etc. The behavior mimicking reward function is:
r ref =r l *r v *r r *r p
the mission-mimicking reward function is:
Figure BDA0003899672910000105
wherein w a The weight is represented by a weight that is,
Figure BDA0003899672910000106
indicating whether the i-th task-related action made by the agent is consistent with the expert.
The combined behavior emulated reward and mission emulated reward total reward function is:
r=w ref *r ref +w task *r task
wherein w ref 、w tasj The weights of the behaviour mimicking rewards items and the task related rewards items, respectively.
And S3, training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each office to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is greater than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.
Specifically, the preset prediction network model is trained according to the single reward value, parameters of the preset prediction network model are updated in each training, specifically, in each round, according to the designed simulated reward function, the corresponding single reward value can be obtained through calculation of state information and corresponding expert state information, at each moment, decision information is stored in a memory pool of the model and used for later training, a strategy close to an expert can be learned, the decision information comprises an intelligent body obtaining state from the environment at a certain moment, the intelligent body selects one action operation information according to probability distribution sampling, the probability of selecting the action operation information and the single reward value.
The method comprises the following steps of training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each bureau to obtain a total reward value, finishing training the preset prediction network model when the total reward value is larger than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model to the following steps:
s301, initializing the agent to a random sampling state;
s302, training a preset prediction network model according to the single reward value and the task reward value in a course learning mode, adding a plurality of single simulation reward values and the task reward value to obtain a total reward value, and finishing training of the preset prediction network model when the total reward value is larger than a threshold value to obtain a trained prediction network model;
and S303, returning the strategy output by the trained prediction network model.
The method comprises the following steps of training a preset prediction network model by adopting a course learning mode according to a single reward value and a task reward value, adding a plurality of single imitation reward values and the task reward value to obtain a total reward value, finishing training of the preset prediction network model when the total reward value is larger than a threshold value, and obtaining the trained prediction network model, wherein the steps of:
s3021, selecting the expert decision data in a preset time period to perform simulated learning training on the preset prediction network model, and adding all single simulated reward values in the preset time period to obtain a single simulated reward value;
s3022, judging whether to train again, if the single-section simulated reward value is smaller than a single-section reward threshold value or the single-section training is not passed when a condition of early termination is triggered, repeating simulated learning training on the preset prediction network model, and if the single-section simulated reward value is larger than the single-section reward threshold value, passing the single-section training, adding a new time section on the basis of the preset time section to obtain an accumulated time section;
and S3023, selecting the expert decision data in the accumulated time period to repeat the process of simulating learning training on the preset prediction network model and judging whether to train again or not, and finishing the training on the preset prediction network model until the accumulated time period is equal to a local time period and the sum of the task reward value and all the single simulated reward values in the local time period is greater than a local reward threshold value, so as to obtain the trained prediction network model.
Specifically, in one training, the agent and the environment need to perform a plurality of rounds of interaction, and training is performed according to data obtained by interaction, when each round starts, the agent can obtain an initial state including the coordinates of the agent, the facing direction and the like, the agent will continue subsequent interaction from the state, if the initial state of the agent in each round is the same, it may be difficult to learn a subsequent trajectory, therefore, the agent needs to be initialized to a random sampling state, the agent initializes to a state randomly sampled from expert data, when each round starts, a state is selected from expert decision data, so that the initial state of the agent is the state, and the state is near the initial state of the agent needing to simulate the trajectory;
in the course learning, the course learning mode mainly adopts an inheritance mode to set courses, the course learning mode is improved continuously on the basis of the capability of a short-sequence intelligent agent through inheritance, and the course tasks are set from simple to difficult, the track generated by an imitation expert is improved to the action taken by the imitation expert through the course task from simple to difficult, specifically, the expert decision data in a preset time period is selected to carry out the learning and training on the preset prediction network model, and all single imitation incentive values in the preset time period are added to obtain a single-section imitation incentive value, for example: performing simulation learning training on the preset prediction network model by adopting 10-second expert decision track segments, wherein one round of training is 10 seconds, and adding all single simulation reward values in one round to obtain a single-section simulation reward value;
further, whether training is carried out again is judged, if the single-section simulation reward value is smaller than the single-section reward threshold value or the condition of early termination is triggered, it is indicated that the single-section training is not passed, and simulation learning training of the preset prediction network model needs to be repeated if the single-section simulation reward value is not passed; if the single-section simulated reward value is larger than the single-section reward threshold value, the single-section training is indicated to be passed, and a new time section needs to be added on the basis of a preset time section after the single-section training to obtain an accumulated time section;
further, selecting expert decision data of an accumulated time period to repeatedly perform simulated learning training on the preset prediction network model and to repeat the process of judging whether to perform training again or not, finishing the training of the preset prediction network model until the accumulated time period is equal to a local time period and the added values of the task reward value and all single simulated reward values in the local time period are greater than a local reward threshold value, and obtaining the trained prediction network model, for example: when the expert takes 100 seconds for a complete game, 100 seconds of expert decision tracks need to be collected, so that the time of the total section of the expert decision tracks in the game is 100 seconds, when the intelligent agent trains, the complete game also needs 100 seconds, namely, the time period of one part is equal to 100 seconds, the total section of the expert decision tracks of 100 seconds can be divided into 10 groups of 10 seconds of expert decision track sections during the training, when the intelligent agent trains next time, a new 10-second expert decision track section is added on the basis of the original 10-second expert decision track section, so that the existing cumulative time period is 20 seconds, and meanwhile, the training continues to be trained on the basis of the model trained last time;
further, the steps are repeated, after multiple times of training, the model can simulate the continuous increase of the length of an expert track, the accumulated time period is continuously accumulated until the accumulated time period is equal to a local time period and the added values of the task reward value and all the single simulated reward values in the local time period are greater than a local reward threshold value, the training of the preset prediction network model is completed, and the course learning is finished, wherein the added values of all the single simulated reward values in the local time period are values obtained by adding all the single simulated reward values in the local time period after the whole game of the intelligent body is completed each time, and the time for the intelligent body to complete the whole game of the local time is equal to the time for the simulated expert to complete the whole game of the local time;
further, by setting a trigger early termination condition, the problem of invalid exploration can be relieved, training time is shortened, if the intelligent body is trapped in a certain state during training, and target actions can not be successfully learned, the opposite office needs to be terminated in advance, so that resource waste caused by continuous simulation is avoided, the intelligent body cannot or hardly continue to advance according to a track in training with the trigger early termination condition of one round, and the intelligent body cannot or hardly continue to advance according to the track and can be judged by coordinate information in state information, and the two conditions are divided: 1. if the agent stays at a certain place or moves back and forth in a small range 2. The agent route deviates too much from the trajectory, it can be considered to be trapped, and the turn can be ended directly at this time.
Based on the above fine-grained expert behavior simulation learning method, this embodiment provides a fine-grained expert behavior simulation learning device, which includes:
the intelligent agent comprises an information acquisition module 1, a task execution module and a task execution module, wherein the information acquisition module 1 is used for acquiring current environment state information of the intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and current state information of the actions;
the reward value calculation module 2 is used for calculating a single reward value according to the state information of the action and calculating a task reward value according to the task completion condition information;
and the model training module 3 is used for training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each bureau to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is greater than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.
In addition, it is worth to be noted that the working process of the fine-grained expert behavior simulation learning device provided in this embodiment is the same as the working process of the fine-grained expert behavior simulation learning method, and the working process of the fine-grained expert behavior simulation learning method may be specifically referred to, and is not described here again.
Based on the fine-grained expert behavior imitation learning method described above, the present embodiment provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors, to implement the steps in the fine-grained expert behavior imitation learning method described in the above embodiment.
As shown in fig. 2, based on the fine-grained expert behavior imitation learning method, the present application further provides a terminal device, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory) 22, and may further include a communication Interface (Communications Interface) 23 and a bus 24. The processor 20, the display 21, the memory 22 and the communication interface 23 can communicate with each other through the bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. Processor 20 may call logic instructions in memory 22 to perform the methods in the embodiments described above.
Furthermore, the logic instructions in the memory 22 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.
The memory 22, which is a computer-readable storage medium, may be configured to store a software program, a computer-executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 executes the functional applications and data processing, i.e. implements the methods in the above embodiments, by running software programs, instructions or modules stored in the memory 22.
The memory 22 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Further, the memory 22 may include a high speed random access memory and may also include a non-volatile memory. For example, a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be transient storage media.
Compared with the prior art, the fine-grained expert behavior imitation learning method comprises the steps of obtaining current environment state information of an agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and collecting task completion condition information and current state information of the actions; calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information; training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each office to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is greater than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.
It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.
Naturally, the above embodiments of the present invention are described in detail, but it should not be understood that the scope of the present invention is limited thereto, and other various embodiments of the present invention can be obtained by those skilled in the art without any creative work based on the embodiments, and the scope of the present invention is subject to the appended claims.

Claims (10)

1. A fine-grained expert behavior simulation learning method is characterized by comprising the following steps:
acquiring current environment state information of an intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and current action state information;
calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information;
training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each office to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is larger than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.
2. The fine-grained expert behavior imitation learning method according to claim 1, wherein the preset prediction network model is an operation prediction network model constructed based on a deep reinforcement learning method.
3. The fine-grained expert behavior imitation learning method of claim 2, wherein obtaining the current environmental state information of the agent is preceded by obtaining expert decision data in advance.
4. The fine-grained expert behavior imitation learning method according to claim 3, wherein the obtaining current environmental state information of an agent, inputting the current environmental state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and collecting task completion condition information and current state information of the actions comprises:
acquiring current environment state information of an agent, and inputting the current environment state information into the operation prediction network model to obtain the prediction information, wherein the current environment state information comprises coordinate information, angle information and posture information, and the prediction information is action probability distribution;
selecting one piece of action operation information based on the action probability distribution sampling, and executing corresponding actions according to the action operation information, wherein each piece of action operation information corresponds to one probability;
and acquiring task completion information and the current state information of the action.
5. The fine-grained expert behavior simulation learning method according to claim 4, wherein the calculating of the single reward value according to the state information of the action and the calculating of the task reward value according to the task completion information comprises:
comparing the state information of the action with the expert decision data to obtain difference information, and calculating a single imitation reward value according to the difference information, wherein the comparison of the state information of the action with the expert decision data is to compare an action key frame in the state information of the action with an expert key frame in the expert decision data;
and calculating a task reward value according to the task completion condition information.
6. The fine-grained expert behavior simulation learning method according to claim 5, wherein the training of a preset prediction network model according to the single reward value and the mission reward value, the adding of the mission reward value and a plurality of single reward values of each office to obtain a total reward value, when the total reward value is greater than a threshold value, the training of the preset prediction network model is completed, a trained prediction network model is obtained, and the returning of the strategy output by the trained prediction network model comprises:
initializing the agent to a random sampling state;
training a preset prediction network model according to the single rewarding value and the task rewarding value in a course learning mode, adding a plurality of single simulated rewarding values and the task rewarding value to obtain a total rewarding value, and finishing the training of the preset prediction network model when the total rewarding value is greater than a threshold value to obtain a trained prediction network model;
and returning the strategy output by the trained prediction network model.
7. The fine-grained expert behavior simulation learning method according to claim 6, wherein a course learning manner is adopted, a preset prediction network model is trained according to the single incentive value and the task incentive value, a plurality of single simulation incentive values are added to the task incentive value to obtain a total incentive value, when the total incentive value is greater than a threshold value, the training of the preset prediction network model is completed, and the obtaining of the trained prediction network model comprises:
selecting the expert decision data in a preset time period to carry out simulation learning training on the preset prediction network model, and adding all single simulation reward values in the preset time period to obtain a single simulation reward value;
judging whether the training is carried out again, if the single-section simulated reward value is smaller than a single-section reward threshold value or the single-section training is not passed when the condition of early termination is triggered, repeatedly carrying out simulated learning training on the preset prediction network model, and if the single-section simulated reward value is larger than the single-section reward threshold value, passing the single-section training, adding a new time section on the basis of the preset time section to obtain an accumulated time section;
and selecting the expert decision data of the accumulated time period to repeat the process of simulating learning training on the preset prediction network model and judging whether to train again or not, and finishing the training on the preset prediction network model when the accumulated time period is equal to a local time period and the added values of the task reward value and all the single simulated reward values in the local time period are greater than a local reward threshold value to obtain the trained prediction network model.
8. A fine-grained expert behavior imitation learning device, comprising:
the intelligent agent comprises an information acquisition module, a task execution module and a task execution module, wherein the information acquisition module is used for acquiring current environment state information of the intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information and acquiring task completion condition information and current state information of the actions;
the reward value calculation module is used for calculating a single reward value according to the state information of the action and calculating a task reward value according to the task completion condition information;
and the model training module is used for training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each office to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is greater than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.
9. A computer readable storage medium, storing one or more programs, the one or more programs being executable by one or more processors to perform the steps in the fine-grained expert behavioral imitation learning method of any one of claims 1-7.
10. A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;
the communication bus realizes the connection communication between the processor and the memory;
the processor when executing the computer readable program performs the steps in the fine-grained expert behavior imitation learning method of any of claims 1-7.
CN202211285500.5A 2022-10-20 2022-10-20 Fine granularity expert behavior imitation learning method, device, medium and terminal Active CN115688858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211285500.5A CN115688858B (en) 2022-10-20 2022-10-20 Fine granularity expert behavior imitation learning method, device, medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211285500.5A CN115688858B (en) 2022-10-20 2022-10-20 Fine granularity expert behavior imitation learning method, device, medium and terminal

Publications (2)

Publication Number Publication Date
CN115688858A true CN115688858A (en) 2023-02-03
CN115688858B CN115688858B (en) 2024-02-09

Family

ID=85066632

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211285500.5A Active CN115688858B (en) 2022-10-20 2022-10-20 Fine granularity expert behavior imitation learning method, device, medium and terminal

Country Status (1)

Country Link
CN (1) CN115688858B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399920A (en) * 2019-07-25 2019-11-01 哈尔滨工业大学(深圳) A kind of non-perfect information game method, apparatus, system and storage medium based on deeply study
CN111401556A (en) * 2020-04-22 2020-07-10 清华大学深圳国际研究生院 Selection method of opponent type imitation learning winning incentive function
CN111580385A (en) * 2020-05-11 2020-08-25 深圳阿米嘎嘎科技有限公司 Robot walking control method, system and medium based on deep reinforcement learning
WO2021184530A1 (en) * 2020-03-18 2021-09-23 清华大学 Reinforcement learning-based label-free six-dimensional item attitude prediction method and device
CN113688977A (en) * 2021-08-30 2021-11-23 浙江大学 Confrontation task oriented man-machine symbiosis reinforcement learning method and device, computing equipment and storage medium
CN114048834A (en) * 2021-11-05 2022-02-15 哈尔滨工业大学(深圳) Continuous reinforcement learning non-complete information game method and device based on after-the-fact review and progressive expansion
CN114307160A (en) * 2021-12-10 2022-04-12 腾讯科技(深圳)有限公司 Method for training intelligent agent

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110399920A (en) * 2019-07-25 2019-11-01 哈尔滨工业大学(深圳) A kind of non-perfect information game method, apparatus, system and storage medium based on deeply study
WO2021184530A1 (en) * 2020-03-18 2021-09-23 清华大学 Reinforcement learning-based label-free six-dimensional item attitude prediction method and device
CN111401556A (en) * 2020-04-22 2020-07-10 清华大学深圳国际研究生院 Selection method of opponent type imitation learning winning incentive function
CN111580385A (en) * 2020-05-11 2020-08-25 深圳阿米嘎嘎科技有限公司 Robot walking control method, system and medium based on deep reinforcement learning
CN113688977A (en) * 2021-08-30 2021-11-23 浙江大学 Confrontation task oriented man-machine symbiosis reinforcement learning method and device, computing equipment and storage medium
CN114048834A (en) * 2021-11-05 2022-02-15 哈尔滨工业大学(深圳) Continuous reinforcement learning non-complete information game method and device based on after-the-fact review and progressive expansion
CN114307160A (en) * 2021-12-10 2022-04-12 腾讯科技(深圳)有限公司 Method for training intelligent agent

Also Published As

Publication number Publication date
CN115688858B (en) 2024-02-09

Similar Documents

Publication Publication Date Title
CN112668235B (en) Robot control method based on off-line model pre-training learning DDPG algorithm
CN111488988B (en) Control strategy simulation learning method and device based on counterstudy
Knox et al. Tamer: Training an agent manually via evaluative reinforcement
Loiacono et al. The 2009 simulated car racing championship
CN111026272B (en) Training method and device for virtual object behavior strategy, electronic equipment and storage medium
CN109523029A (en) For the adaptive double from driving depth deterministic policy Gradient Reinforcement Learning method of training smart body
CN109063823B (en) Batch A3C reinforcement learning method for exploring 3D maze by intelligent agent
CN109978012A (en) It is a kind of based on combine the improvement Bayes of feedback against intensified learning method
CN110516389B (en) Behavior control strategy learning method, device, equipment and storage medium
Efthymiadis et al. Using plan-based reward shaping to learn strategies in starcraft: Broodwar
CN110390399A (en) A kind of efficient heuristic approach of intensified learning
CN113379027A (en) Method, system, storage medium and application for generating confrontation interactive simulation learning
CN111282272A (en) Information processing method, computer readable medium and electronic device
CN116147627A (en) Mobile robot autonomous navigation method combining deep reinforcement learning and internal motivation
Hafez et al. Improving robot dual-system motor learning with intrinsically motivated meta-control and latent-space experience imagination
CN114404975B (en) Training method, device, equipment, storage medium and program product of decision model
CN116090549A (en) Knowledge-driven multi-agent reinforcement learning decision-making method, system and storage medium
CN114861368A (en) Method for constructing railway longitudinal section design learning model based on near-end strategy
Tong et al. Enhancing rolling horizon evolution with policy and value networks
CN115688858A (en) Fine-grained expert behavior simulation learning method, device, medium and terminal
CN116992928A (en) Multi-agent reinforcement learning method for fair self-adaptive traffic signal control
CN116540535A (en) Progressive strategy migration method based on self-adaptive dynamics model
CN116047902A (en) Method, device, equipment and storage medium for navigating robots in crowd
CN116306947A (en) Multi-agent decision method based on Monte Carlo tree exploration
CN112884129B (en) Multi-step rule extraction method, device and storage medium based on teaching data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant