CN114841362A

CN114841362A - Method for collecting imitation learning data by using virtual reality technology

Info

Publication number: CN114841362A
Application number: CN202210331565.2A
Authority: CN
Inventors: 王春鹏; 石翔慧; 盖新宇; 张岩
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-08-02

Abstract

The invention discloses a method for collecting simulated learning data by using a virtual reality technology, which belongs to the technical field of virtual reality and comprises the following steps: the method comprises the following steps: acquiring scene image data, imitating a real scene, and building a virtual scene in a three-dimensional engine; step two: at least one operational virtual model object is set up as a proxy in the virtual scene in a geometric manner. Compared with the traditional method for artificially demonstrating by using a keyboard, the method combines the simulated learning and the virtual reality technology, provides a feasible scheme for training the agent with complexity, is convenient for collecting simulated learning data and improves the model training efficiency; the simulation learning data collection can be realized by utilizing the virtual reality, and the model training efficiency is improved.

Description

Method for collecting simulated learning data by using virtual reality technology

Technical Field

The invention relates to the technical field of virtual reality, in particular to a method for collecting simulated learning data by using a virtual reality technology.

Background

In recent years, with the continuous breakthrough of artificial intelligence related technologies and the continuous maturity of related algorithms, AI intelligent agents have gradually deepened into various fields and showed better application effects. Unity Machine Learning Agents (ML-Agents) is an open-source Unity plugin, allows users to train intelligent Agents in game environments and simulation environments, and can train Agents by using reinforcement Learning, imitation Learning, neural evolution or other Machine Learning methods and controlled through a simple and easy-to-use Python API.

Reinforcement learning can train far beyond human intelligent agents by interacting with the environment to obtain maximum benefits, but training time is often very long. Whereas mock learning may extract knowledge from human expert's presentations or artificially created agents to replicate their behavior. The simulation learning is combined, and the reinforcement learning is carried out on the basis of human demonstration, so that the training time can be greatly reduced, and the efficiency is improved.

But for agents with complex behavior, it is difficult or even impossible to operate with a keyboard during the presentation phase, and the presentation quality is poor. The performance of the model is highly dependent on the quality of the presentation, which makes training a complex agent with mock learning impractical, and inefficient because the desired effect can only be achieved over a large amount of training time.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a method for collecting simulated learning data by using a virtual reality technology.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method for mock learning data collection using virtual reality technology, comprising the steps of:

the method comprises the following steps: acquiring scene image data, imitating a real scene, and building a virtual scene in a three-dimensional engine;

step two: setting up at least one operation virtual model object as a proxy in a virtual scene in a geometric mode;

step three: according to a specific target to be realized, compiling codes by using a Unity plug-in ML-Agents to complete state input, reward setting and action output of the intelligent agent;

step four: configuring reinforcement learning training parameters, training and checking effects;

step five: configuring simulation learning training parameters, and performing human demonstration by using a virtual reality tracker to finish data collection of simulation learning;

step six: performing simulation learning, training and checking effects on the basis of human demonstration;

step seven: and analyzing the result to obtain an optimal scheme.

Further, for step three, the virtual environment is built in Unity 3D.

Further, in step three, the vector motion space of the virtual model object is of a Continuous type, and there are 5 variable motion parameters in total, including the movement of the virtual model object in the directions of the x-axis, the y-axis and the z-axis and the rotation of the virtual model object along the directions of the x-axis and the z-axis.

Further, in the third step, the Unity plug-in ML-Agents is used for completing the state input, the reward setting and the action output of the virtual model object.

Further, in the sixth step, the virtual model object performs reinforcement learning on the basis of the demonstration, and performs cyclic training on the strategy model according to the continuously input state input, the reward information and the action output.

And further, in the sixth step, the virtual model object adopting the reinforcement learning model and the virtual model object using the reinforcement learning + simulation learning model are compared and trained, wherein the strategy result is quickened by adding a manual demonstration mode in the process.

Further, in the fifth step, a VR handle is adopted for data collection, and control parameters are obtained by utilizing rotation and movement parameters of the VR handle to complete data collection.

Further, in the fifth step, the simulated learning algorithm is adopted for the simulated learning training demonstration, wherein the BC algorithm and the GAIL algorithm are simultaneously used by the simulated learning algorithm.

In summary, compared with the traditional method of artificially demonstrating by using a keyboard, the method combines the simulated learning and the virtual reality technology, provides a feasible scheme for training an agent with complexity, and has important significance;

the combination of simulation learning and other tools (such as a depth camera and the like) capable of controlling the agent can replace a keyboard for operation, so that the collection of simulation learning data is facilitated, and the model training efficiency is improved;

the simulation learning data collection can be realized by utilizing the virtual reality, and the model training efficiency is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a logic diagram of a method for mock learning data collection using virtual reality techniques in accordance with the present invention;

fig. 2 is a schematic illustration of a second demonstration of the method for performing simulated learning data collection by using a virtual reality technology according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.

The present invention will be described in detail below in order to better understand the technical solution of the present invention.

Unity ML Agents are an open-source Unity plugin that allows users to train intelligent Agents in gaming and simulation environments, and may be controlled through a simple easy-to-use Python API using reinforcement learning (reinforcement learning), emulation learning (animation learning), neural evolution (neuroevolution), or other machine learning methods to train the Agents. Supporting a plurality of deep reinforcement learning algorithms (PPO, SAC, MA-POCA, self-play), and supporting learning from demonstration through two simulation learning algorithms (BC and GAIL);

proximal Policy Optimization (near-end Policy Optimization algorithm), PPO for short, is a Policy gradient method for reinforcement learning, and allowsParallel agent interactions with the environment are sampled and agent objectives are optimized by stochastic gradient descent. The core idea is that the action probability of the above strategy is divided by the action probability of the current strategy

The objective function is constrained to ensure that large policy updates do not occur. Using PPO optimized clipping instead of the objective loss function:

example one

Referring to fig. 1, a method for mock learning data collection using virtual reality technology,

the method comprises the following steps:

step three: according to a specific target to be realized, writing codes by using a Unity plug-in ML-Agents to complete state input, reward setting and action output of the intelligent agent;

step four: training for reinforcement learning: configuring reinforcement learning training parameters, training and checking effects;

step five: training to simulate learning: configuring simulation learning training parameters, and performing human demonstration by using a virtual reality tracker to finish data collection of simulation learning;

step six: training of imitation learning is carried out on the basis of demonstration, and the simulation learning + reinforcement learning is realized: performing simulation learning, training and checking effects on the basis of human demonstration;

step seven: and analyzing the result to obtain an optimal scheme.

The analysis mode is to compare the effect of model training by adopting a reinforcement learning mode and a mode of simulating learning and reinforcement learning.

Example two

Referring to fig. 2, on the basis of the first embodiment, an example of building a tennis scene is adopted, and the method specifically includes the following steps:

the method comprises the following steps: acquiring tennis court scene image data, imitating a real scene, and building a tennis court virtual scene in a three-dimensional engine;

step two: setting two operation virtual model objects in a virtual tennis scene at equal ratio as agents, and sharing policy model parameters according to agent rules;

specifically, the same policy module is used when the agent rules of the agents are the same, and different policy models are used otherwise.

It should be noted that the Policy model adopts Proximal Policy Optimization (near-end Policy Optimization algorithm), in which:

dividing the action probability of the above strategy according to the action probability under the current strategy

The objective function is constrained to ensure that large policy updates do not occur.

Using PPO optimized clipping instead of the objective loss function:

step five: training to simulate learning: configuring simulation learning training parameters, performing human demonstration by using a virtual reality tracker, and completing the data collection of simulation learning, wherein a reinforced learning + simulation learning (BC + GAIL) mode is adopted;

when the simulation learning training is carried out, the Behavior Type of the agent is modified into Heuristic Only, and the Demonstration Recorder component is added for human Demonstration.

The vector motion space of the virtual model object is of a Continuous type, and 5 variable motion parameters in total comprise the movement of the virtual model object in the directions of an x axis, a y axis and a z axis and the rotation of the virtual model object along the directions of the x axis and the z axis of the virtual model object.

It should be noted that the simulated learning algorithm is adopted for performing the simulated learning training demonstration, wherein the simulated learning algorithm uses the BC algorithm and the GAIL algorithm at the same time, which is further illustrated and described below:

behavior Cloning, BC for short;

genetic adaptive evaluation Learning, GAIL, may be used together.

BC: typically used as pre-training. The idea is that the strategy network of the intelligent agent is trained to be closer to the behavior mode of human demonstration data as well as better, namely the same state s is input and the similar output a is required;

the optimized target is not different from supervised learning, each state s is equivalent to the input characteristic, the action output by the expert is equivalent to a Label, and only the output a of the model is close to the Label.

Wherein, before using the BC, it should be ensured that the presentation file has been recorded by itself, i.e. the collection of the presentation data;

some teaching data, namely a plurality of state action pairs, need to be collected, and the state action pairs are used as training data to train the strategy network, so that the strategy network can imitate behaviors.

Genetic adaptive evaluation Learning, GAIL for short, can learn strategies directly from expert data. GAIL belongs to the Inverse reinforcement learning (Inverse RL) modelDomain, reinforcement learning is given by the environment Reward and next state s _t+1 To learn the optimal strategy;

enhancement of policy reward signals by recorded expert demonstration using reverse reinforcement learning at a given s _t And a _t The method has the advantages that the strategy is not directly supervised for learning, and is more universal.

The goal of the GAIL algorithm is to find a saddle point (π, D) in the following equation:

defining two approximation functions to represent pi and D as pi, respectively _θ And D _ω : sxa → (0, 1). The adam optimizer was used to ramp up the ω gradient and the TRPO method was used to ramp down the θ gradient.

GAIL is equivalent to another internal reward and can be trained in combination with an external reward for reinforcement learning. The higher the expert reward is set, the more the agent tends to mimic the behavior of the expert in the environment. By reasonably setting the upper limit of the reward, the intelligent agent can imitate the behaviors of experts to a certain extent and explore the environment more so as to find a better strategy.

From the above, BC and GAIL can significantly enhance the effect of reinforcement learning. Wherein the Behavioral Cloning is equivalent to a pre-training and is only used in the early stage; the general adaptive evaluation Learning can run through the whole reinforcement Learning, which is equivalent to adding an internal reward, and the closer to the strategy of expert demonstration, the greater the reward is, so that the intelligent can explore more optimal solutions.

In the present application, the duration to achieve the same result is greatly reduced with the reinforcement learning + BC + GAIL method.

Step six: training of simulation learning is carried out on the basis of demonstration, and simulation learning and reinforcement learning are realized: performing simulation learning, training and checking effects on the basis of human demonstration;

specifically, the virtual agent inputs (i.e., human demonstration) basic demonstration parameters through the VR handle, and the model is slowly trained without human demonstration.

Step seven: and analyzing the result to obtain an optimal scheme.

And the scheme which is analyzed to obtain the most added options is the optimal action scheme.

In a specific embodiment of the present application, for step three, a virtual environment is built in Unity 3D.

In the specific embodiment of the application, the Unity plug-in ML-Agents is used for completing the state input, the reward setting and the action output of the virtual model object in the step three.

In the embodiment of the present application, in the step six, the virtual model object performs reinforcement learning based on the demonstration, and performs a cyclic training on the policy model according to the continuously input status input, the reward information, and the motion output.

In the specific embodiment of the present application, in the sixth step, the virtual model object using the reinforcement learning model and the virtual model object using the reinforcement learning + simulation learning model are subjected to comparison training, wherein in the process, a policy result is obtained in an accelerated manner by adding a manual demonstration.

In a specific embodiment of the application, in the fifth step, the VR handle is used for collecting data, and the control parameters are obtained by using the rotation and movement parameters of the VR handle, so that data collection is completed.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A method for performing simulated learning data collection by using a virtual reality technology is characterized by comprising the following steps:

step seven: and analyzing the result to obtain an optimal scheme.

2. The method for performing mock learning data collection by utilizing virtual reality technology according to claim 1, which is used in step three, wherein a virtual environment is constructed in Unity 3D.

3. The method for performing mock learning data collection by utilizing virtual reality technology according to claim 2, wherein the vector motion space of the virtual model object used in step three is of the continuos type, and there are 5 variable motion parameters in total, including the movement of the virtual model object in the x-axis, y-axis and z-axis directions and the rotation of the virtual model object in the x-axis and z-axis directions.

4. The method for performing mock learning data collection by utilizing virtual reality technology according to claim 3, wherein for step three, the Unity plug-ins ML-Agents is utilized to complete the status input, reward setting and action output to the virtual model object.

5. The method according to claim 4, wherein in the sixth step, the virtual model object performs reinforcement learning based on the demonstration, and the strategy model is trained cyclically according to the continuously input state input, reward information and motion output.

6. The method of claim 5, wherein the method is used in step six, and the virtual model object using reinforcement learning model and the virtual model object using reinforcement learning + simulation learning model are compared and trained, wherein the strategy result is accelerated by adding manual demonstration during the process.

7. The method of claim 6, wherein in step five, the data collection is performed by using a VR handle, and the data collection is performed by obtaining the control parameters according to the rotation and movement parameters of the VR handle.

8. The method of claim 7, wherein in step five, the simulated learning training demonstration is performed by using a simulated learning algorithm, wherein the simulated learning algorithm uses the BC algorithm and the GAIL algorithm at the same time.