CN115688858B

CN115688858B - Fine granularity expert behavior imitation learning method, device, medium and terminal

Info

Publication number: CN115688858B
Application number: CN202211285500.5A
Authority: CN
Inventors: 漆舒汉; 孙志航; 殷俊; 黄新昊; 万乐; 王轩; 张加佳; 王强
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2022-10-20
Filing date: 2022-10-20
Publication date: 2024-02-09
Anticipated expiration: 2042-10-20
Also published as: CN115688858A

Abstract

The invention discloses a fine-grained expert behavior simulation learning method, a device, a medium and a terminal, wherein the method comprises the steps of obtaining current environmental state information of an agent, inputting the current environmental state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and collecting task completion condition information and state information of the current actions; calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information; training a preset prediction network model according to the single rewarding value and the task rewarding value, adding the task rewarding value and a plurality of single rewarding values in each office to obtain a total rewarding value, and when the total rewarding value is larger than a threshold value, completing training of the preset prediction network model and returning an output strategy.

Description

Fine granularity expert behavior imitation learning method, device, medium and terminal

Technical Field

The invention relates to the field of imitation learning, in particular to a fine-grained expert behavior imitation learning method, a device, a medium and a terminal.

Background

The existing imitation learning multi-use behavior cloning method and inverse reinforcement learning method, wherein the adoption of the behavior cloning method can learn the mapping relation from expert state to expert action, but the behavior cloning method has the problems that under the environment of a three-dimensional video game with incomplete information, the direct learning mapping from a high-dimensional space is very difficult and distribution drift and compound errors are encountered; in addition, the two methods often require a large amount of expert data to obtain a relatively good result, and a certain difficulty often exists in collecting a large amount of high-quality expert data.

Disclosure of Invention

In view of the shortcomings of the prior art, the purpose of the application is to provide a fine-grained expert behavior simulation learning method, a fine-grained expert behavior simulation learning device, a fine-grained expert behavior simulation learning medium and a fine-grained expert behavior simulation terminal, and aims to solve the problem that learning is difficult when a traditional simulation learning method is directly simulated from a high-dimensional state and an action space, and deviation between a finally obtained strategy and an expert strategy is large.

To solve the above technical problem, a first aspect of the embodiments of the present application provides a fine-grained expert behavior simulation learning method, where the method includes:

acquiring current environmental state information of an intelligent agent, inputting the current environmental state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and collecting task completion condition information and current state information of the actions;

calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information;

training a preset prediction network model according to the single rewards value and the task rewards value, adding the task rewards value and a plurality of single rewards values of each office to obtain a total rewards value, completing training of the preset prediction network model when the total rewards value is larger than a threshold value, obtaining a trained prediction network model, and returning a strategy output by the trained prediction network model.

As a further improvement technical scheme, the preset prediction network model is an operation prediction network model constructed based on a deep reinforcement learning method.

As a further improvement technical scheme, the method further comprises the step of acquiring expert decision data in advance before acquiring the current environmental state information of the intelligent agent.

As a further improved technical solution, the obtaining the current environmental status information of the agent, inputting the current environmental status information into a preset prediction network model to obtain prediction information, controlling the agent to execute a corresponding action according to the prediction information, and collecting task completion status information and current status information of the action includes:

acquiring current environmental state information of an intelligent agent, and inputting the current environmental state information into the operation prediction network model to obtain prediction information, wherein the current environmental state information comprises coordinate information, angle information and gesture information, and the prediction information is motion probability distribution;

selecting one piece of action operation information based on the action probability distribution sampling, and executing corresponding actions according to the action operation information, wherein each piece of action operation information corresponds to one probability;

and collecting task completion information and state information of the current action.

As a further improvement technical solution, the calculating a single prize value according to the state information of the action, and calculating a task prize value according to the task completion status information includes:

comparing the state information of the action with the expert decision data to obtain difference information, and calculating a single simulated rewarding value according to the difference information, wherein the comparison of the state information of the action with the expert decision data is to compare an action key frame in the state information of the action with an expert key frame in the expert decision data;

and calculating a task rewarding value according to the task completion condition information.

As a further improved technical solution, training a preset prediction network model according to the single reward value and the task reward value, adding the task reward value and a plurality of single reward values in each office to obtain a total reward value, when the total reward value is greater than a threshold value, completing training the preset prediction network model to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model to the method includes:

initializing the intelligent agent to a random sampling state;

training a preset prediction network model by adopting a course learning mode according to the single rewarding value and the task rewarding value, adding a plurality of single imitating rewarding values and the task rewarding value to obtain a total rewarding value, and completing training the preset prediction network model when the total rewarding value is greater than a threshold value to obtain a trained prediction network model;

and returning the strategy output by the trained prediction network model.

As a further improved technical solution, training a preset prediction network model by adopting a course learning mode and according to the single rewarding value and the task rewarding value, adding a plurality of single imitating rewarding values and the task rewarding value to obtain a total rewarding value, and when the total rewarding value is greater than a threshold value, completing training the preset prediction network model, wherein the obtaining the trained prediction network model comprises:

selecting expert decision data in a preset time period to perform simulated learning training on the preset prediction network model, and adding all single simulated rewards in the preset time period to obtain single simulated rewards;

judging whether training is performed again, if the single-segment simulated rewarding value is smaller than a single-segment rewarding threshold value or the single-segment training is not passed when an early termination condition is triggered, repeating simulated learning training on the preset prediction network model, and if the single-segment simulated rewarding value is larger than the single-segment rewarding threshold value, adding a new time segment on the basis of the preset time segment to obtain a cumulative time segment;

and repeating the training simulating the preset prediction network model by the expert decision data of the accumulated time period, and repeating the process of judging whether the training is performed again or not until the accumulated time period is equal to one time period, and the task rewarding value and all the single simulated rewarding values in the one time period are added to be larger than one time rewarding threshold value, so as to complete the training of the preset prediction network model, and obtain the prediction network model after training.

A second aspect of the embodiments of the present application provides a fine-grained expert behavior simulation learning apparatus, including:

the information acquisition module is used for acquiring current environment state information of the intelligent agent, inputting the current environment state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and current state information of the actions;

the rewarding value calculation module is used for calculating a single rewarding value according to the state information of the action and calculating a task rewarding value according to the task completion condition information;

the model training module is used for training a preset prediction network model according to the single rewarding value and the task rewarding value, adding the task rewarding value and a plurality of single rewarding values in each office to obtain a total rewarding value, completing training the preset prediction network model when the total rewarding value is larger than a threshold value, obtaining a trained prediction network model, and returning a strategy output by the trained prediction network model.

A third aspect of the embodiments provides a computer-readable storage medium storing one or more programs executable by one or more processors to implement steps in a fine-grained expert behavior-mimicking learning method as described in any of the above.

A fourth aspect of the present embodiment provides a terminal device, including: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the communication bus realizes connection communication between the processor and the memory;

the processor, when executing the computer readable program, implements the steps in the fine grain expert behavior emulation learning method as described in any one of the above.

The beneficial effects are that: compared with the prior art, the fine-grained expert behavior simulation learning method comprises the steps of obtaining current environmental state information of an agent, inputting the current environmental state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and collecting task completion condition information and current state information of the actions; calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information; according to the single rewarding value and the task rewarding value, training a preset prediction network model, adding the task rewarding value and a plurality of single rewarding values of each office to obtain a total rewarding value, when the total rewarding value is larger than a threshold value, completing training of the preset prediction network model, obtaining a trained prediction network model, and returning a strategy output by the trained prediction network model.

Drawings

FIG. 1 is a flow chart of a fine grain expert behavior simulation learning method of the present invention.

Fig. 2 is a schematic structural diagram of a terminal device provided by the present invention.

Fig. 3 is a block diagram of the structure of the device provided by the invention.

FIG. 4 is a diagram of an improved fine grain expert behavior simulation learning algorithm of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

In order to facilitate an understanding of the present application, a more complete description of the present application will now be provided with reference to the relevant figures. Preferred embodiments of the present application are shown in the accompanying drawings. This application may, however, be embodied in many different forms and is not limited to the embodiments described herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

The inventors have found that the following problems exist in the prior art:

(1) By training game agent learning strategies through imitation learning, behavior cloning and imitation learning based on reverse reinforcement learning are generally used, wherein a behavior cloning algorithm is a supervised learning method, takes states given by environments as characteristics, takes actions which can be executed by agents as marks, tries to minimize action differences of agent strategies and expert strategies, and reduces imitation learning tasks to common regression or classification tasks; the imitation learning based on the inverse reinforcement learning divides the imitation learning process into two sub-processes of inverse reinforcement learning and reinforcement learning, and iterates repeatedly, the inverse reinforcement learning is used for deriving a reward function conforming to expert decision data, the reinforcement learning is based on the reward function to learn strategies, and the generation of the countermeasure imitation learning is developed from the imitation learning based on the inverse reinforcement learning.

The behavior cloning method learns the mapping relation from expert state to expert action, but under the environment of the incomplete information three-dimensional video game, the mapping is difficult to directly learn from a high-dimensional space, and the problems of distribution drift and compound errors can be encountered, and the inverse reinforcement learning method generally has the problems of high training difficulty, low efficiency and instability due to the fact that two reinforcement learning processes are involved. In addition, the two methods often require a large amount of expert data to obtain relatively good results, and collecting a large amount of high-quality expert data often has difficulty.

In order to solve the above problems, various non-limiting embodiments of the present application are described in detail below with reference to the accompanying drawings.

As shown in fig. 1, the fine-grained expert behavior simulation learning method provided in the embodiment of the application includes the following steps:

s1, acquiring current environmental state information of an agent, inputting the current environmental state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and collecting task completion condition information and current state information of the actions;

the preset prediction network model is an operation prediction network model constructed based on a deep reinforcement learning method.

Specifically, parameters of a network in a middle layer of the operation prediction network model need to be trained by using a corresponding strategy of deep reinforcement learning, for example, an encoder in the operation prediction network model inputs current game state information including information of positions, moving directions and the like of each agent, an input dimension of the encoder can be set to 96, an output dimension of the encoder can be set to 256, an input dimension of the decoder can be set to 256, in attention mechanism parameters of the communication information processing module, a query vector dimension can be set to 64, the number of attention heads can be set to 4, an optimizer of the operation prediction network model can use an Adam optimizer, a learning rate can be set to 0.001, a gaussian noise variance can be set to 0.1, a discount factor can be set to 0.9, and meanwhile, a multiprocessing method can be used to allocate the environment to 32 processes, so that the training speed of the whole operation prediction network model is accelerated.

The method comprises the steps of acquiring expert decision data in advance before acquiring the current environmental state information of the intelligent agent.

Specifically, a human expert normally plays a game, decisions are made from obtained state information and actions are taken, the generated state-action pair information is expert decision data, at least one piece of complete expert decision data is needed to imitate the game, the state of each moment in a game and the action set corresponding to the moment are complete expert data, the expert decision data embody strategies when the human plays the game to a certain extent, an agent is guided to learn in the later learning process, the obtained state information is possibly different according to the difference of game environments, but coordinate information and angle information are the most basic requirements, and at least two pieces of information can be used for further learning.

The method for acquiring the current environmental state information of the intelligent agent, inputting the current environmental state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and current state information of the actions comprises the following steps:

s101, acquiring current environmental state information of an intelligent agent, and inputting the current environmental state information into the operation prediction network model to obtain prediction information, wherein the current environmental state information comprises coordinate information, angle information and gesture information, and the prediction information is motion probability distribution;

s102, selecting one piece of action operation information based on the action probability distribution sampling, and executing a corresponding action according to the action operation information, wherein each piece of action operation information corresponds to one probability;

s103, collecting task completion condition information and state information of the current action.

Specifically, firstly, acquiring current environmental state information of an agent in an environment at the current moment, wherein the environment can be a game environment, the current environmental state information comprises coordinate information, angle information and gesture information of the agent in the environment, the acquisition type of the current environmental state information is corresponding to the acquired expert decision data, and if only two items of coordinate information and angle information are included in the expert decision data, the current environmental state information only needs to contain the two items of coordinate information and angle information;

and then inputting the current environmental state information into a preset prediction network model, and outputting prediction information by the preset prediction network model, wherein the prediction information is motion probability distribution, the motion probability distribution is probability distribution of possible execution of motion by an agent in the current state, the agent selects one motion operation information according to probability distribution sampling, the motion operation information is selected to correspond to one probability, then the agent is controlled to execute corresponding motion in the environment through the selected motion operation information, the corresponding motion executed by the agent is acquired, the current environmental state information of the agent is updated, meanwhile, task completion condition information is acquired, the task completion condition information comprises various information related to specific tasks, such as whether a certain task is completed or not, and the task can be preset.

S2, calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information;

specifically, the simulated rewarding function mainly considers behavior simulated rewards and task rewards, the action and the track of the simulated expert are encouraged by the agent through calculating a single rewards value, the agent is encouraged to complete the set task through calculating the task rewards value, and the rewards of the two parts are respectively provided with a weight, and the total rewards value is obtained after adding.

The method comprises the following steps of:

s201, comparing the state information of the action with the expert decision data to obtain difference information, and calculating a single imitative rewarding value according to the difference information, wherein the comparison of the state information of the action with the expert decision data is that the action key frame in the state information of the action is compared with the expert key frame in the expert decision data;

s202, calculating a task rewarding value according to the task completion condition information.

Specifically, the rewards of behavior simulation are divided into a plurality of parts, including the positions, the speeds and the angles of the experts, various actions taken by the experts and the like, mainly encourage the intelligent agents to move to the positions of the experts to make actions similar to the actions of the experts, and then the rewards of the parts are formed into final single rewards by multiplying, the single rewards guide the intelligent agents to take action modes identical to those of the experts, task rewards are often required to be specifically set according to specific task scenes, and the task rewards are used for guiding the intelligent agents to complete set tasks;

in the embodiment, a key frame alignment mode is adopted to calculate a single simulated rewarding value, the moment when some experts make key actions is selected as a key frame, so that the intelligent agent is as close as possible to the expert actions at the moment, the intelligent agent is encouraged to take actions consistent with the experts at the key frame, and the adoption of the key frame alignment mode maintains the diversity of strategies relative to the frame-by-frame alignment, so that the behavior of the intelligent agent is not completely consistent with the experts, and the intelligent agent is helpful for further application.

For example, a game is provided in which a position prize r is designed in a game scene ^l Speed prize r ^v Angle prize r ^r Posture rewards r ^p The agent is stimulated from multiple aspects to approach the target position of the expert's trajectory and take action similar to the expert, thereby generating a behavior pattern similar to the expert, in the form of a specific function:

r ^v ＝exp(w ^v *|v _agent -v _expert |)

r ^r ＝exp(w ^r *|r _agent -r _expert |)

wherein w is ^l 、w ^v 、w ^r 、w ^p Representing the weight of each prize individually,i=1, 2 and 3 in three-dimensional environment represent the position information of three dimensions of x, y and z of coordinate axes respectively, v _agent 、v _expert 、r _agent 、r _expert Speed information, angle information, indicating agent and expert +.>Posture information representing agents and experts, including standing, squatting, running, sideways, etc. The behavior mimicking reward function is:

r ^ref ＝r ^l *r ^v *r ^r *r ^p

the task emulation reward function is:

wherein w is ^a The weight is represented by a weight that,indicating whether the actions associated with the ith task made by the agent are consistent with the expert.

The combined behavior modeling rewards and task modeling rewards total rewards function is:

r＝w ^ref *r ^ref +w ^task *r ^task

wherein w is ^ref 、w ^tasj Weights for behavior mimicking rewards and task related rewards, respectively.

And S3, training a preset prediction network model according to the single rewarding value and the task rewarding value, adding the task rewarding value and a plurality of single rewarding values in each office to obtain a total rewarding value, completing training of the preset prediction network model when the total rewarding value is larger than a threshold value, obtaining a trained prediction network model, and returning a strategy output by the trained prediction network model.

Specifically, the preset prediction network model is trained according to the single reward value, parameters of the preset prediction network model are updated in each training, specifically, in each round, according to the designed simulated reward function, the corresponding single reward value can be obtained through calculation of state information and corresponding expert state information, at each moment, decision information is stored in a memory pool of the model for later training, strategies close to the expert can be learned, the decision information comprises states obtained by an agent from the environment at a certain moment, the agent selects one action operation information according to probability distribution sampling, and the probability of selecting the action operation information and the single reward value.

The training of the preset prediction network model is completed when the total reward value is greater than a threshold value, a trained prediction network model is obtained, and a strategy output by the trained prediction network model is returned, wherein the training of the preset prediction network model according to the single reward value and the task reward value comprises the following steps:

s301, initializing the intelligent agent to a random sampling state;

s302, training a preset prediction network model by adopting a course learning mode according to the single rewarding value and the task rewarding value, adding a plurality of single simulation rewarding values and the task rewarding value to obtain a total rewarding value, and completing training the preset prediction network model when the total rewarding value is greater than a threshold value to obtain a trained prediction network model;

and S303, returning the strategy output by the trained prediction network model.

Training a preset prediction network model by adopting a course learning mode according to the single rewarding value and the task rewarding value, adding a plurality of single imitating rewarding values and the task rewarding value to obtain a total rewarding value, and when the total rewarding value is larger than a threshold value, completing training the preset prediction network model to obtain a trained prediction network model, wherein the method comprises the following steps of:

s3021, selecting expert decision data in a preset time period to perform simulated learning training on the preset prediction network model, and adding all single simulated rewards in the preset time period to obtain single simulated rewards;

s3022, judging whether training is performed again, if the single-segment simulated reward value is smaller than a single-segment reward threshold or the single-segment training is not passed when an early termination condition is triggered, repeating simulated learning training on the preset prediction network model, and if the single-segment simulated reward value is larger than the single-segment reward threshold, adding a new time segment on the basis of the preset time segment to obtain a cumulative time segment;

s3023, selecting the expert decision data in the accumulated time period, repeating the training simulating the preset prediction network model, and repeating the process of judging whether training is performed again or not until the accumulated time period is equal to a period of time, and when the task reward value and all the single simulation reward values in the period of time are added to be greater than a threshold value of one time, completing the training of the preset prediction network model, and obtaining the prediction network model after training.

Specifically, in one training, the agent is required to interact with the environment for a plurality of rounds, and training is performed according to data obtained by interaction, when each round starts, the agent obtains an initial state including a coordinate where the agent is located, a facing direction, and the like, and the agent continues subsequent interaction from the initial state, if the initial state of each round is the same, it may be difficult to learn a subsequent track, so that the agent needs to be initialized to a random sampling state, the agent is initialized to a state randomly sampled from expert data, and when each round starts, a state is selected from expert decision data, so that the initial state of the agent is the state, and the state is near the initial state of the track to be simulated;

in the course learning, the course is determined by the prior knowledge in advance and then is kept fixed, the course learning mode mainly adopts an inheritance mode to set the course, the capability of short-sequence intelligent bodies is inherited and is continuously improved on the basis of the capability, then the course task from simple to difficult is set, the track generated by the imitation expert is improved to the action adopted by the imitation expert, specifically, the expert decision data in a preset time period is firstly selected to carry out imitation learning training on the preset prediction network model, all single imitation rewards in the preset time period are added to obtain a single-section imitation rewards, for example: adopting an expert decision track segment of 10 seconds to simulate learning training on a preset prediction network model, wherein the training of one round is 10 seconds, and adding all single simulated rewards values in one round to obtain a single-section simulated rewards value;

further, judging whether training is performed again, if the single-segment simulated reward value is smaller than the single-segment reward threshold or triggering an early termination condition, indicating that the training is not performed by the single segment, and repeating simulated learning training on the preset prediction network model when the training is not performed by the single segment; if the single-segment simulated rewarding value is larger than the single-segment rewarding threshold value, the fact that a new time period is needed to be added on the basis of a preset time period after the single-segment training is carried out is indicated, and an accumulated time period is obtained;

further, expert decision data of a selected accumulation period repeats a process of performing simulated learning training on a preset prediction network model and judging whether training is performed again or not until training on the preset prediction network model is completed when the accumulation period is equal to a period of time and the sum of a task rewarding value and all single simulated rewarding values in the period of time is greater than a threshold value of one reward, so as to obtain a trained prediction network model, for example: the expert takes 100 seconds for a complete game, so that 100 seconds of expert decision tracks are required to be acquired, the total segment time of the expert decision tracks in the game is 100 seconds, the same time for 100 seconds is required for the complete game to be carried out when the intelligent body is trained, namely, the total segment time of the expert decision tracks in 100 seconds is equal to 100 seconds, the expert decision track segments in 10 groups of 10 seconds can be divided into 10 seconds of expert decision track segments in training, and a new 10 seconds of expert decision track segments are added on the basis of the original 10 seconds of expert decision track segments in the next training, so that the existing accumulated time period is 20 seconds, and meanwhile, the training is continued on the basis of a model trained last time;

further, repeating the steps, after training for multiple times, the model can simulate the length of an expert track to be continuously increased, and the accumulation time period is continuously accumulated until the accumulation time period is equal to a period of time, and when the task rewarding value and the adding value of the single simulated rewarding value in the period of time are larger than a threshold value of the single simulated rewarding value, training the preset prediction network model is completed, and course learning is finished, wherein the adding value of the single simulated rewarding value in the period of time is the value obtained by adding all the single simulated rewarding values in the period of time after each time the agent completes a game, and the time when the agent completes a game is equal to the time when the simulated expert completes a game;

further, by setting the triggering early termination condition, the problem of invalid exploration can be relieved, the training time is shortened, if an intelligent agent is trapped in a certain state during training, and the target action can not be successfully learned, the game is required to be terminated in advance, so that the resource waste caused by continuous simulation is avoided, the triggering early termination condition is that the intelligent agent can not or hardly continuously advance along the track in one round of training, and the intelligent agent can not or hardly continuously advance along the track can pass through coordinate information judgment in state information, and the method is divided into two cases: 1. if the agent stays at a certain location or moves back and forth in a small range 2. The agent route deviates too much from the trajectory, it can be considered to be trapped, at which point the round can be ended directly.

Based on the above-mentioned fine-grained expert behavior simulation learning method, the present embodiment provides a fine-grained expert behavior simulation learning device, including:

the information acquisition module 1 is used for acquiring current environmental state information of an intelligent agent, inputting the current environmental state information into a preset prediction network model to obtain prediction information, controlling the intelligent agent to execute corresponding actions according to the prediction information, and acquiring task completion condition information and current state information of the actions;

the rewarding value calculating module 2 is used for calculating a single rewarding value according to the state information of the action and calculating a task rewarding value according to the task completion condition information;

the model training module 3 is configured to train a preset predicted network model according to the single reward value and the task reward value, add the task reward value and a plurality of single reward values in each office to obtain a total reward value, complete training of the preset predicted network model when the total reward value is greater than a threshold value, obtain a trained predicted network model, and return a strategy output by the trained predicted network model.

In addition, it should be noted that the working process of the fine-grained expert behavior simulation learning device provided in this embodiment is the same as the working process of the fine-grained expert behavior simulation learning method, and specifically, the working process of the fine-grained expert behavior simulation learning method may be referred to, which is not described herein again.

Based on the above-described fine-grained expert behavior simulation learning method, the present embodiment provides a computer-readable storage medium storing one or more programs executable by one or more processors to implement the steps in the fine-grained expert behavior simulation learning method as described in the above-described embodiments.

As shown in fig. 2, based on the above fine-grained expert behavior simulation learning method, the present application also provides a terminal device, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory) 22, which may also include a communication interface (Communications Interface) 23 and a bus 24. Wherein the processor 20, the display 21, the memory 22 and the communication interface 23 may communicate with each other via a bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may invoke logic instructions in the memory 22 to perform the methods of the embodiments described above.

Further, the logic instructions in the memory 22 described above may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product.

The memory 22, as a computer readable storage medium, may be configured to store a software program, a computer executable program, such as program instructions or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 performs functional applications and data processing, i.e. implements the methods of the embodiments described above, by running software programs, instructions or modules stored in the memory 22.

The memory 22 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created according to the use of the terminal device, etc. In addition, the memory 22 may include high-speed random access memory, and may also include nonvolatile memory. For example, a plurality of media capable of storing program codes such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or a transitory storage medium may be used.

Compared with the prior art, the fine-grained expert behavior simulation learning method comprises the steps of obtaining current environmental state information of an agent, inputting the current environmental state information into a preset prediction network model to obtain prediction information, controlling the agent to execute corresponding actions according to the prediction information, and collecting task completion condition information and current state information of the actions; calculating a single rewarding value according to the state information of the action, and calculating a task rewarding value according to the task completion condition information; according to the single rewarding value and the task rewarding value, training a preset prediction network model, adding the task rewarding value and a plurality of single rewarding values of each office to obtain a total rewarding value, when the total rewarding value is larger than a threshold value, completing training of the preset prediction network model, obtaining a trained prediction network model, and returning a strategy output by the trained prediction network model.

It is to be understood that the invention is not limited in its application to the examples described above, but is capable of modification and variation in light of the above teachings by those skilled in the art, and that all such modifications and variations are intended to be included within the scope of the appended claims.

The above examples of the present invention are of course more detailed, but should not be construed as limiting the scope of the invention, and various other embodiments are possible, based on which those skilled in the art can obtain other embodiments without any inventive task, which fall within the scope of the invention as defined in the appended claims.

Claims

1. A fine-grained expert behavior simulation learning method, characterized by comprising:

training a preset predictive network model according to the single rewards value and the task rewards value, adding the task rewards value and a plurality of single rewards values of each office to obtain a total rewards value, completing training of the preset predictive network model when the total rewards value is larger than a threshold value to obtain a trained predictive network model, and returning a strategy output by the trained predictive network model;

training a preset prediction network model according to the single rewards value and the task rewards value, adding the task rewards value and a plurality of single rewards values of each office to obtain a total rewards value, completing training the preset prediction network model when the total rewards value is larger than a threshold value to obtain a trained prediction network model, and returning a strategy output by the trained prediction network model to the method comprises the following steps:

initializing the intelligent agent to a random sampling state;

training a preset prediction network model by adopting a course learning mode according to the single rewards and the task rewards, adding a plurality of single rewards and the task rewards to obtain a total rewards, and completing training the preset prediction network model when the total rewards are larger than a threshold value to obtain a trained prediction network model;

returning the strategy output by the trained prediction network model;

training a preset prediction network model by adopting a course learning mode according to the single rewards and the task rewards, adding a plurality of single rewards and the task rewards to obtain a total rewards, and when the total rewards are larger than a threshold value, completing training the preset prediction network model, wherein the obtaining of the trained prediction network model comprises the following steps:

expert decision data in a preset time period is selected to simulate learning training on the preset prediction network model, and all single rewards in the preset time period are added to obtain single rewards;

judging whether training is performed again, if the single-segment rewarding value is smaller than a single-segment rewarding threshold value or the single-segment training is not passed when an early termination condition is triggered, repeating simulation learning training on the preset prediction network model, and if the single-segment rewarding value is larger than the single-segment rewarding threshold value, increasing a new time segment on the basis of the preset time segment to obtain a cumulative time segment;

and repeating the process of simulating learning training on the preset prediction network model by the expert decision data of the accumulated time period, and repeating the process of judging whether training is performed again or not until the accumulated time period is equal to one time period, and the sum of the task rewards and all the single rewards in the one time period is greater than one time rewards threshold value, so as to complete training on the preset prediction network model, and obtain the prediction network model after training.

2. The fine-grained expert behavior simulation learning method according to claim 1, wherein the preset predictive network model is an operation predictive network model constructed based on a deep reinforcement learning method.

3. The fine-grained expert behavior simulation learning method of claim 2 wherein the pre-acquiring expert decision data is further included prior to the acquiring of the current environmental state information of the agent.

4. The fine-grained expert behavior simulation learning method according to claim 3, wherein the obtaining the current environmental state information of the agent, inputting the current environmental state information into a preset prediction network model to obtain prediction information, controlling the agent to execute a corresponding action according to the prediction information, and collecting task completion information and current state information of the action comprises:

5. The fine-grained expert behavior simulation learning method of claim 4 wherein calculating a single prize value based on the status information of the action and calculating a task prize value based on the task completion information comprises:

comparing the state information of the action with the expert decision data to obtain difference information, and calculating a single rewarding value according to the difference information, wherein the comparison of the state information of the action with the expert decision data is to compare an action key frame in the state information of the action with an expert key frame in the expert decision data;

6. A computer readable storage medium storing one or more programs executable by one or more processors to implement the steps in the fine grain expert behavior simulation learning method of any of claims 1-5.

7. A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the processor, when executing the computer readable program, implements the steps in the fine-grained expert behavior simulation learning method of any of claims 1-5.