CN115688858A

CN115688858A - Fine-grained expert behavior simulation learning method, device, medium and terminal

Info

Publication number: CN115688858A
Application number: CN202211285500.5A
Authority: CN
Inventors: 漆舒汉; 孙志航; 殷俊; 黄新昊; 万乐; 王轩; 张加佳; 王强
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2023-02-03
Anticipated expiration: 2042-10-20
Also published as: CN115688858B

Abstract

The invention discloses a fine-grained expert behavior simulation learning method, a device, a medium and a terminal, wherein the method comprises the steps of obtaining current environment state information of an intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and collecting task completion condition information and state information of the current actions; calculating a single reward value according to the state information of the action, and calculating a task reward value according to the task completion condition information; the method has the advantages that the training difficulty is reduced, the training efficiency is improved, and the strategy close to the expert behavior mode can be learned in a high-dimensional state and an action space without acquiring a large amount of expert data.

Description

Fine-grained expert behavior imitation learning method, device, medium and terminal

Technical Field

The invention relates to the field of simulation learning, in particular to a fine-grained expert behavior simulation learning method, device, medium and terminal.

Background

The existing simulation learning mostly adopts a behavior cloning method and an inverse reinforcement learning method, wherein the behavior cloning method can learn the mapping relation from expert states to expert actions, but the behavior cloning method has the problems that the mapping is difficult to learn directly from a high-dimensional space and distribution drift and compound errors can be encountered under the environment of an incomplete information three-dimensional video game; the inverse reinforcement learning method generally has the problems of high training difficulty, low efficiency and instability because of the two reinforcement learning processes, and in addition, the two methods often need a large amount of expert data to obtain relatively good results, so that it is difficult to collect a large amount of high-quality expert data.

Disclosure of Invention

In view of the defects of the prior art, the present application aims to provide a fine-grained expert behavior imitation learning method, apparatus, medium and terminal, and aims to solve the problems that the learning is very difficult when the traditional imitation learning method is directly imitated from a high-dimensional state and an action space, and the finally obtained strategy has a large deviation from the expert strategy.

In order to solve the above technical problem, a first aspect of the embodiments of the present application provides a fine-grained expert behavior simulation learning method, where the method includes:

acquiring current environment state information of an agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and current action state information;

calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information;

training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each office to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is larger than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.

As a further improved technical scheme, the preset prediction network model is an operation prediction network model constructed based on a deep reinforcement learning method.

As a further improvement technical scheme, before the current environment state information of the intelligent agent is obtained, expert decision data are obtained in advance.

As a further improvement, the acquiring current environmental state information of the agent, inputting the current environmental state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and acquiring task completion information and current state information of the actions includes:

acquiring current environment state information of an agent, and inputting the current environment state information into the operation prediction network model to obtain the prediction information, wherein the current environment state information comprises coordinate information, angle information and posture information, and the prediction information is action probability distribution;

selecting one piece of action operation information based on the action probability distribution sampling, and executing corresponding actions according to the action operation information, wherein each piece of action operation information corresponds to one probability;

and acquiring task completion information and the current state information of the action.

As a further improved technical solution, the calculating a single bonus value according to the state information of the action and calculating a task bonus value according to the task completion information includes:

comparing the state information of the action with the expert decision data to obtain difference information, and calculating a single imitation reward value according to the difference information, wherein the comparison of the state information of the action with the expert decision data is to compare an action key frame in the state information of the action with an expert key frame in the expert decision data;

and calculating a task reward value according to the task completion condition information.

As a further improved technical solution, the training of a preset prediction network model according to the single bonus value and the mission bonus value, adding the mission bonus value to a plurality of single bonus values of each office to obtain a total bonus value, when the total bonus value is greater than a threshold value, completing the training of the preset prediction network model to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model includes:

initializing the agent to a random sampling state;

training a preset prediction network model according to the single rewarding value and the task rewarding value in a course learning mode, adding a plurality of single simulated rewarding values and the task rewarding value to obtain a total rewarding value, and finishing the training of the preset prediction network model when the total rewarding value is greater than a threshold value to obtain a trained prediction network model;

and returning the strategy output by the trained prediction network model.

As a further improvement, a course learning mode is adopted, a preset prediction network model is trained according to the single reward value and the task reward value, a plurality of single simulated reward values are added to the task reward value to obtain a total reward value, when the total reward value is greater than a threshold value, the preset prediction network model is trained, and the obtaining of the trained prediction network model comprises the following steps:

selecting the expert decision data in a preset time period to carry out simulation learning training on the preset prediction network model, and adding all single simulation reward values in the preset time period to obtain a single simulation reward value;

judging whether the training is carried out again, if the single-section simulated reward value is smaller than a single-section reward threshold value or the single-section training is not passed when the condition of early termination is triggered, repeatedly carrying out simulated learning training on the preset prediction network model, and if the single-section simulated reward value is larger than the single-section reward threshold value, passing the single-section training, adding a new time section on the basis of the preset time section to obtain an accumulated time section;

and selecting the expert decision data of the accumulated time period to repeat the process of simulating learning training on the preset prediction network model and judging whether to train again or not, and finishing the training on the preset prediction network model when the accumulated time period is equal to a local time period and the added values of the task reward value and all the single simulated reward values in the local time period are greater than a local reward threshold value to obtain the trained prediction network model.

A second aspect of the embodiments of the present application provides a fine-grained expert behavior simulation learning device, including:

the intelligent agent comprises an information acquisition module, a task execution module and a task execution module, wherein the information acquisition module is used for acquiring current environment state information of the intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and current state information of the actions;

the reward value calculation module is used for calculating a single reward value according to the state information of the action and calculating a task reward value according to the task completion condition information;

and the model training module is used for training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each office to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is greater than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.

A third aspect of embodiments of the present application provides a computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps in the fine-grained expert behavioral imitation learning method as described in any above.

A fourth aspect of the embodiments of the present application provides a terminal device, including: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the communication bus realizes the connection communication between the processor and the memory;

the processor, when executing the computer readable program, implements the steps in the fine-grained expert behavioral simulation learning method as described in any one of the above.

Has the advantages that: compared with the prior art, the fine-grained expert behavior simulation learning method comprises the steps of obtaining current environment state information of an agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and collecting task completion condition information and current state information of the actions; calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information; training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each office to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is greater than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.

Drawings

FIG. 1 is a flow chart of a fine-grained expert behavior imitation learning method of the present invention.

Fig. 2 is a schematic structural diagram of a terminal device provided in the present invention.

Fig. 3 is a block diagram of the apparatus provided by the present invention.

FIG. 4 is a diagram of the fine-grained expert behavioral simulation learning algorithm enhanced by the present invention.

The implementation, functional features and advantages of the present invention will be further described with reference to the accompanying drawings.

Detailed Description

To facilitate an understanding of the present application, the present application will now be described more fully with reference to the accompanying drawings. Preferred embodiments of the present application are given in the accompanying drawings. This application may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

The inventor finds that the prior art has the following problems through research:

(1) Training a game agent learning strategy through imitation learning, and usually using two types of methods of behavior cloning and imitation learning based on reverse reinforcement learning, wherein a behavior cloning algorithm is a supervised learning method, the state given by the environment is used as a characteristic, the action which can be executed by an agent is used as a mark, the action difference between the agent strategy and an expert strategy is tried to be minimized, and the imitation learning task is reduced to a common regression or classification task; the simulation learning based on the inverse reinforcement learning divides the process of the simulation learning into two sub-processes of the inverse reinforcement learning and the reinforcement learning, and iterates repeatedly, the inverse reinforcement learning is used for deducing a reward function which accords with expert decision data, the reinforcement learning learns a strategy based on the reward function, the generated confrontation simulation learning is developed from the simulation learning based on the inverse reinforcement learning, and the simulation learning method is characterized in that a confrontation network frame is used for solving the simulation learning problem, and the simulation learning method can be expanded into practical application.

The behavior cloning method learns the mapping relation from the expert state to the expert action, but under the environment of an incomplete information three-dimensional video game, the mapping is difficult to learn directly from a high-dimensional space, and the problems of distribution drift and compound errors can be encountered, while the inverse reinforcement learning method relates to two reinforcement learning processes, and generally has the problems of high training difficulty, low efficiency and instability. In addition, the two methods often require a large amount of expert data to obtain relatively good results, and collecting a large amount of high-quality expert data often has certain difficulty.

In order to solve the above problems, various non-limiting embodiments of the present application will be described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the fine-grained expert behavior simulation learning method provided in the embodiment of the present application includes the following steps:

s1, acquiring current environment state information of an intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and current action state information;

the preset prediction network model is an operation prediction network model constructed based on a deep reinforcement learning method.

Specifically, the parameters of the network in the middle layer of the operation prediction network model need to use the corresponding strategy of deep reinforcement learning training, for example, the encoder in the operation prediction network model inputs the current game state information, including the information of the position, the moving direction, and the like of each agent, the input dimension of the encoder can be set to 96, the output dimension of the encoder can be set to 256, the input dimension of the decoder can be set to 256, the query vector dimension can be set to 64 dimensions, the number of the attention heads can be set to 4, the optimizer of the operation prediction network model can use an Adam optimizer, the learning rate can be set to 0.001, the gaussian noise variance can be set to 0.1, the discount factor can be set to 0.9, and a multi-process method can be used to distribute the environment to 32 processes to accelerate the training speed of the whole operation prediction network model.

The method comprises the steps of obtaining current environment state information of the agent, and obtaining expert decision data in advance before obtaining the current environment state information of the agent.

Specifically, a human expert normally plays a game, makes a decision from obtained state information and takes an action, the generated state-action pair information is expert decision data, the scheme needs at least one piece of complete expert decision data for simulation, the state of each moment in the game and the action corresponding to the moment are integrated into one piece of complete expert data, the expert decision data reflects the strategy of the human in playing the game to a certain extent, an intelligent body is guided to learn in the subsequent learning process, the obtained state information may be different according to the difference of game environments, but at least two pieces of information can be further learned only if the coordinate information and the angle information are the most basic requirements.

The method comprises the following steps of obtaining current environment state information of an intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and the current action state information, wherein the steps of:

s101, obtaining current environment state information of an agent, and inputting the current environment state information into the operation prediction network model to obtain the prediction information, wherein the current environment state information comprises coordinate information, angle information and posture information, and the prediction information is action probability distribution;

s102, selecting one piece of action operation information based on the action probability distribution sampling, and executing corresponding action according to the action operation information, wherein each piece of action operation information corresponds to one probability;

and S103, acquiring task completion condition information and the current state information of the action.

Specifically, current environment state information of the intelligent agent in the environment at the current moment is acquired, the environment can be a game environment, the current environment state information comprises coordinate information, angle information and posture information of the intelligent agent in the environment, the acquisition type of the current environment state information corresponds to the acquired expert decision data, and if only the coordinate information and the angle information exist in the expert decision data, the current environment state information only needs to include the coordinate information and the angle information;

then inputting the current environment state information into a preset prediction network model, outputting prediction information by the preset prediction network model, wherein the prediction information is action probability distribution, the action probability distribution is probability distribution that the intelligent agent in the current state can execute actions, the intelligent agent selects one action operation information according to the probability distribution, selects a probability corresponding to the action operation information, controls the intelligent agent to execute corresponding actions in the environment through the selected action operation information, collects corresponding actions executed by the intelligent agent and updates the current environment state information of the intelligent agent, and collects task completion condition information, wherein the task completion condition information comprises various information related to a specific task, such as whether a certain task is completed or not, and the task can be preset.

S2, calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information;

specifically, the simulated reward function mainly considers behavior simulated rewards and task rewards, encourages the intelligent agent to simulate actions and tracks of experts by calculating a single reward value, encourages the intelligent agent to complete set tasks by calculating the task reward value, and the rewards of the two parts are respectively provided with a weight and are added to form a total reward value.

Wherein, the calculation of the single reward value according to the state information of the action and the calculation of the task reward value according to the task completion condition information comprises the following steps:

s201, comparing the state information of the action with the expert decision data to obtain difference information, and calculating a single imitation reward value according to the difference information, wherein the comparison of the state information of the action with the expert decision data is to compare an action key frame in the state information of the action with an expert key frame in the expert decision data;

and S202, calculating a task reward value according to the task completion condition information.

Specifically, the reward of the behavior simulation is divided into a plurality of parts, including the positions, the speeds, the angles and various actions taken by experts, and the like, the intelligent agent is mainly encouraged to move to the position where the experts are located, the actions similar to the experts are made, then the rewards of the parts are multiplied to form a final single reward value, the single reward value guides the intelligent agent to take the action mode same as the experts, the task reward is usually specifically set according to a specific task scene, and the task reward is used for guiding the intelligent agent to complete the set task;

in the embodiment, a key frame alignment mode is adopted to calculate the single imitation reward value, the moments when some experts make key actions are selected as key frames, the intelligent agent is enabled to be close to the actions of the experts as far as possible at the moments, the intelligent agent is encouraged to take actions consistent with the experts at the key frames, and the key frame alignment mode is adopted to keep the diversity of strategies relative to frame-by-frame alignment, so that the actions of the intelligent agent are not completely consistent with the experts, and the method is helpful for further application.

Game for example, the position reward r is designed in the game scene ^l Speed award r ^v Angle reward r ^r Posture reward r ^p Stimulating intelligence from multiple aspectsThe body approaches the target position of the expert trajectory and takes similar actions as the expert, thereby generating a behavior pattern similar to the expert, the specific function form being:

r ^v ＝exp(w ^v *|v _agent -v _expert |)

r ^r ＝exp(w ^r *|r _agent -r _expert |)

wherein w ^l 、w ^v 、w ^r 、w ^p Respectively represent the weight of each prize,

i =1,2,3 represents the position information of x, y and z dimensions of coordinate axes respectively under three-dimensional environment, v _agent 、v _expert 、r _agent 、r _expert Velocity information, angle information representing the agent and the expert,

posture information representing the agent and the expert, including standing, squatting, running, leaning, etc. The behavior mimicking reward function is:

r ^ref ＝r ^l *r ^v *r ^r *r ^p

the mission-mimicking reward function is:

wherein w ^a The weight is represented by a weight that is,

indicating whether the i-th task-related action made by the agent is consistent with the expert.

The combined behavior emulated reward and mission emulated reward total reward function is:

r＝w ^ref *r ^ref +w ^task *r ^task

wherein w ^ref 、w ^tasj The weights of the behaviour mimicking rewards items and the task related rewards items, respectively.

And S3, training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each office to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is greater than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.

Specifically, the preset prediction network model is trained according to the single reward value, parameters of the preset prediction network model are updated in each training, specifically, in each round, according to the designed simulated reward function, the corresponding single reward value can be obtained through calculation of state information and corresponding expert state information, at each moment, decision information is stored in a memory pool of the model and used for later training, a strategy close to an expert can be learned, the decision information comprises an intelligent body obtaining state from the environment at a certain moment, the intelligent body selects one action operation information according to probability distribution sampling, the probability of selecting the action operation information and the single reward value.

The method comprises the following steps of training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each bureau to obtain a total reward value, finishing training the preset prediction network model when the total reward value is larger than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model to the following steps:

s301, initializing the agent to a random sampling state;

s302, training a preset prediction network model according to the single reward value and the task reward value in a course learning mode, adding a plurality of single simulation reward values and the task reward value to obtain a total reward value, and finishing training of the preset prediction network model when the total reward value is larger than a threshold value to obtain a trained prediction network model;

and S303, returning the strategy output by the trained prediction network model.

The method comprises the following steps of training a preset prediction network model by adopting a course learning mode according to a single reward value and a task reward value, adding a plurality of single imitation reward values and the task reward value to obtain a total reward value, finishing training of the preset prediction network model when the total reward value is larger than a threshold value, and obtaining the trained prediction network model, wherein the steps of:

s3021, selecting the expert decision data in a preset time period to perform simulated learning training on the preset prediction network model, and adding all single simulated reward values in the preset time period to obtain a single simulated reward value;

s3022, judging whether to train again, if the single-section simulated reward value is smaller than a single-section reward threshold value or the single-section training is not passed when a condition of early termination is triggered, repeating simulated learning training on the preset prediction network model, and if the single-section simulated reward value is larger than the single-section reward threshold value, passing the single-section training, adding a new time section on the basis of the preset time section to obtain an accumulated time section;

and S3023, selecting the expert decision data in the accumulated time period to repeat the process of simulating learning training on the preset prediction network model and judging whether to train again or not, and finishing the training on the preset prediction network model until the accumulated time period is equal to a local time period and the sum of the task reward value and all the single simulated reward values in the local time period is greater than a local reward threshold value, so as to obtain the trained prediction network model.

Specifically, in one training, the agent and the environment need to perform a plurality of rounds of interaction, and training is performed according to data obtained by interaction, when each round starts, the agent can obtain an initial state including the coordinates of the agent, the facing direction and the like, the agent will continue subsequent interaction from the state, if the initial state of the agent in each round is the same, it may be difficult to learn a subsequent trajectory, therefore, the agent needs to be initialized to a random sampling state, the agent initializes to a state randomly sampled from expert data, when each round starts, a state is selected from expert decision data, so that the initial state of the agent is the state, and the state is near the initial state of the agent needing to simulate the trajectory;

in the course learning, the course learning mode mainly adopts an inheritance mode to set courses, the course learning mode is improved continuously on the basis of the capability of a short-sequence intelligent agent through inheritance, and the course tasks are set from simple to difficult, the track generated by an imitation expert is improved to the action taken by the imitation expert through the course task from simple to difficult, specifically, the expert decision data in a preset time period is selected to carry out the learning and training on the preset prediction network model, and all single imitation incentive values in the preset time period are added to obtain a single-section imitation incentive value, for example: performing simulation learning training on the preset prediction network model by adopting 10-second expert decision track segments, wherein one round of training is 10 seconds, and adding all single simulation reward values in one round to obtain a single-section simulation reward value;

further, whether training is carried out again is judged, if the single-section simulation reward value is smaller than the single-section reward threshold value or the condition of early termination is triggered, it is indicated that the single-section training is not passed, and simulation learning training of the preset prediction network model needs to be repeated if the single-section simulation reward value is not passed; if the single-section simulated reward value is larger than the single-section reward threshold value, the single-section training is indicated to be passed, and a new time section needs to be added on the basis of a preset time section after the single-section training to obtain an accumulated time section;

further, selecting expert decision data of an accumulated time period to repeatedly perform simulated learning training on the preset prediction network model and to repeat the process of judging whether to perform training again or not, finishing the training of the preset prediction network model until the accumulated time period is equal to a local time period and the added values of the task reward value and all single simulated reward values in the local time period are greater than a local reward threshold value, and obtaining the trained prediction network model, for example: when the expert takes 100 seconds for a complete game, 100 seconds of expert decision tracks need to be collected, so that the time of the total section of the expert decision tracks in the game is 100 seconds, when the intelligent agent trains, the complete game also needs 100 seconds, namely, the time period of one part is equal to 100 seconds, the total section of the expert decision tracks of 100 seconds can be divided into 10 groups of 10 seconds of expert decision track sections during the training, when the intelligent agent trains next time, a new 10-second expert decision track section is added on the basis of the original 10-second expert decision track section, so that the existing cumulative time period is 20 seconds, and meanwhile, the training continues to be trained on the basis of the model trained last time;

further, the steps are repeated, after multiple times of training, the model can simulate the continuous increase of the length of an expert track, the accumulated time period is continuously accumulated until the accumulated time period is equal to a local time period and the added values of the task reward value and all the single simulated reward values in the local time period are greater than a local reward threshold value, the training of the preset prediction network model is completed, and the course learning is finished, wherein the added values of all the single simulated reward values in the local time period are values obtained by adding all the single simulated reward values in the local time period after the whole game of the intelligent body is completed each time, and the time for the intelligent body to complete the whole game of the local time is equal to the time for the simulated expert to complete the whole game of the local time;

further, by setting a trigger early termination condition, the problem of invalid exploration can be relieved, training time is shortened, if the intelligent body is trapped in a certain state during training, and target actions can not be successfully learned, the opposite office needs to be terminated in advance, so that resource waste caused by continuous simulation is avoided, the intelligent body cannot or hardly continue to advance according to a track in training with the trigger early termination condition of one round, and the intelligent body cannot or hardly continue to advance according to the track and can be judged by coordinate information in state information, and the two conditions are divided: 1. if the agent stays at a certain place or moves back and forth in a small range 2. The agent route deviates too much from the trajectory, it can be considered to be trapped, and the turn can be ended directly at this time.

Based on the above fine-grained expert behavior simulation learning method, this embodiment provides a fine-grained expert behavior simulation learning device, which includes:

the intelligent agent comprises an information acquisition module 1, a task execution module and a task execution module, wherein the information acquisition module 1 is used for acquiring current environment state information of the intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and current state information of the actions;

the reward value calculation module 2 is used for calculating a single reward value according to the state information of the action and calculating a task reward value according to the task completion condition information;

and the model training module 3 is used for training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each bureau to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is greater than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.

In addition, it is worth to be noted that the working process of the fine-grained expert behavior simulation learning device provided in this embodiment is the same as the working process of the fine-grained expert behavior simulation learning method, and the working process of the fine-grained expert behavior simulation learning method may be specifically referred to, and is not described here again.

Based on the fine-grained expert behavior imitation learning method described above, the present embodiment provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors, to implement the steps in the fine-grained expert behavior imitation learning method described in the above embodiment.

As shown in fig. 2, based on the fine-grained expert behavior imitation learning method, the present application further provides a terminal device, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory) 22, and may further include a communication Interface (Communications Interface) 23 and a bus 24. The processor 20, the display 21, the memory 22 and the communication interface 23 can communicate with each other through the bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. Processor 20 may call logic instructions in memory 22 to perform the methods in the embodiments described above.

Furthermore, the logic instructions in the memory 22 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.

The memory 22, which is a computer-readable storage medium, may be configured to store a software program, a computer-executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 executes the functional applications and data processing, i.e. implements the methods in the above embodiments, by running software programs, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Further, the memory 22 may include a high speed random access memory and may also include a non-volatile memory. For example, a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be transient storage media.

Compared with the prior art, the fine-grained expert behavior imitation learning method comprises the steps of obtaining current environment state information of an agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and collecting task completion condition information and current state information of the actions; calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information; training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values of each office to obtain a total reward value, finishing the training of the preset prediction network model when the total reward value is greater than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model.

It is to be understood that the invention is not limited to the examples described above, but that modifications and variations may be effected thereto by those of ordinary skill in the art in light of the foregoing description, and that all such modifications and variations are intended to be within the scope of the invention as defined by the appended claims.

Naturally, the above embodiments of the present invention are described in detail, but it should not be understood that the scope of the present invention is limited thereto, and other various embodiments of the present invention can be obtained by those skilled in the art without any creative work based on the embodiments, and the scope of the present invention is subject to the appended claims.

Claims

1. A fine-grained expert behavior simulation learning method is characterized by comprising the following steps:

acquiring current environment state information of an intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and current action state information;

2. The fine-grained expert behavior imitation learning method according to claim 1, wherein the preset prediction network model is an operation prediction network model constructed based on a deep reinforcement learning method.

3. The fine-grained expert behavior imitation learning method of claim 2, wherein obtaining the current environmental state information of the agent is preceded by obtaining expert decision data in advance.

4. The fine-grained expert behavior imitation learning method according to claim 3, wherein the obtaining current environmental state information of an agent, inputting the current environmental state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and collecting task completion condition information and current state information of the actions comprises:

5. The fine-grained expert behavior simulation learning method according to claim 4, wherein the calculating of the single reward value according to the state information of the action and the calculating of the task reward value according to the task completion information comprises:

6. The fine-grained expert behavior simulation learning method according to claim 5, wherein the training of a preset prediction network model according to the single reward value and the mission reward value, the adding of the mission reward value and a plurality of single reward values of each office to obtain a total reward value, when the total reward value is greater than a threshold value, the training of the preset prediction network model is completed, a trained prediction network model is obtained, and the returning of the strategy output by the trained prediction network model comprises:

initializing the agent to a random sampling state;

and returning the strategy output by the trained prediction network model.

7. The fine-grained expert behavior simulation learning method according to claim 6, wherein a course learning manner is adopted, a preset prediction network model is trained according to the single incentive value and the task incentive value, a plurality of single simulation incentive values are added to the task incentive value to obtain a total incentive value, when the total incentive value is greater than a threshold value, the training of the preset prediction network model is completed, and the obtaining of the trained prediction network model comprises:

8. A fine-grained expert behavior imitation learning device, comprising:

the intelligent agent comprises an information acquisition module, a task execution module and a task execution module, wherein the information acquisition module is used for acquiring current environment state information of the intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information and acquiring task completion condition information and current state information of the actions;

9. A computer readable storage medium, storing one or more programs, the one or more programs being executable by one or more processors to perform the steps in the fine-grained expert behavioral imitation learning method of any one of claims 1-7.

10. A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the processor when executing the computer readable program performs the steps in the fine-grained expert behavior imitation learning method of any of claims 1-7.