WO2023101112A1

WO2023101112A1 - Method for offline meta-reinforcement learning of multiple tasks and computing device for performing same

Info

Publication number: WO2023101112A1
Application number: PCT/KR2022/006128
Authority: WO
Inventors: 장병탁; 김준호
Original assignee: 서울대학교 산학협력단
Priority date: 2021-12-01
Filing date: 2022-04-28
Publication date: 2023-06-08

Abstract

A reinforcement learning method for establishing one policy for performing multiple tasks comprises the steps of: using data on the multiple tasks to generate a meta-model in which environments for the multiple tasks are approximated in a single model; and teaching the one policy by using the meta-model. Accordingly, since reinforcement learning can be performed through a model approximated using data prepared in advance without interaction with actual environments, learning time is dramatically reduced. In addition, since environment models for multiple tasks are approximated in a single meta-model, knowledge is readily shared amongst the tasks, and thus a policy having better performance can be obtained than a case where each of the tasks are learned separately.

Description

Offline meta-reinforcement learning method for multiple tasks and computing device for performing the same

Embodiments disclosed herein relate to an offline meta-reinforcement learning method for a plurality of tasks and a computing device for performing the same.

This study was conducted as a result of the research on the ICT convergence industry innovative technology development project of the Ministry of Science and ICT and the Institute of Information and Communication Planning and Evaluation, "Development of robot hand manipulation intelligence that learns how to handle various objects with a tactile robot hand and procedures." has been (IITP-2018-0-00622).

This study was conducted as a result of the research on the project "Development of learning technology that mimics human demonstration in a virtual reality environment for robots that support humans through physical interaction", an international cooperation project for industrial technology by the Ministry of Trade, Industry and Energy and the Korea Institute for Advancement of Technology. (KIAT-P0006720).

This study was conducted as a result of research on the task "(SW Star Lab) development of cognitive agent SW based on daily life learning" of the Ministry of Science and ICT and the Institute of Information and Communications Technology Evaluation and Planning (IITP-2015-0) -00310).

When reinforcement learning is applied to a real robot rather than a simulation, it takes a lot of time for the robot to interact with the environment, so it is difficult to learn by collecting data in real time. Therefore, research on offline reinforcement learning technology that obtains good policies without interaction with the environment is steadily being conducted, and it is applied to logistics systems or unmanned systems that require robot automation.

However, offline reinforcement learning technology has a limitation in that reinforcement learning must be performed using only given data.

In order to overcome this limitation, a reinforcement learning method for obtaining one policy that can perform all tasks well in an environment where data on a plurality of tasks is given but cannot directly interact with the environment for each task. Research was conducted, and as a result, a reinforcement learning method according to the embodiments introduced in this specification could be derived.

On the other hand, the above-mentioned background art is technical information that the inventor possessed for derivation of the present invention or acquired in the process of derivation of the present invention, and cannot necessarily be said to be known art disclosed to the general public prior to filing the present invention. .

Through the embodiments disclosed in this specification, it is intended to provide a reinforcement learning method for establishing one policy for performing a plurality of tasks and a computing device for performing the same.

According to the embodiments disclosed herein, a metamodel approximating an environment for a plurality of tasks as a model is generated using data on a plurality of tasks, and a policy policy is generated using the generated metamodel. By learning, it is possible to learn one policy capable of performing all of the plurality of tasks well.

According to any one of the above-described task solving means, reinforcement learning can be performed through a model approximated using pre-prepared data without interaction with the real environment, so that not only the time for learning is drastically reduced, but also various Since the environment model for a task is approximated as a metamodel, knowledge sharing between tasks is well done, and a policy with better performance can be obtained than in the case of learning each task separately.

Effects obtainable from the disclosed embodiments are not limited to those mentioned above, and other effects not mentioned are clear to those skilled in the art from the description below to which the disclosed embodiments belong. will be understandable.

1 is a diagram showing the configuration of a computing device that performs a reinforcement learning method for establishing one policy for performing a plurality of tasks according to an embodiment.

2 and 3 are flowcharts for explaining a reinforcement learning method for establishing one policy for performing a plurality of tasks according to an embodiment.

4 is a diagram for explaining a process of learning a global environment model (meta-model) and a context encoder according to an embodiment, and FIG. 5 is a diagram for explaining a process of learning a policy agent according to an embodiment.

As a technical means for achieving the above-described technical problem, according to an embodiment, a reinforcement learning method for establishing one policy for performing a plurality of tasks uses data for the plurality of tasks, The method may include generating a metamodel in which an environment for tasks is approximated as a model, and learning the one policy using the metamodel.

According to another embodiment, as a computer program for performing a reinforcement learning method for establishing one policy for performing a plurality of tasks, the reinforcement learning method for establishing one policy for performing a plurality of tasks includes the plurality of tasks. It may include generating a metamodel in which the environment for the plurality of tasks is approximated as one model by using data on the tasks of, and learning the one policy using the metamodel. there is.

According to another embodiment, a computer-readable recording medium on which a program for performing a reinforcement learning method for establishing one policy for performing a plurality of tasks is recorded, and one policy for performing a plurality of tasks is recorded. The reinforcement learning method includes generating a metamodel that approximates the environment for the plurality of tasks as one model using data for the plurality of tasks, and using the metamodel to determine the one policy It may include the step of learning.

According to another embodiment, the computing device for performing video story question and answering, the computing device for performing a reinforcement learning method of establishing one policy for performing a plurality of tasks, includes data related to reinforcement learning. and an input/output unit for receiving commands and outputting reinforcement learning results, a storage unit for storing data and programs for performing reinforcement learning, and at least one processor, which executes the program to perform reinforcement learning. and a control unit, by executing the program, generating a metamodel that approximates the environment for the plurality of tasks as a model using data on the plurality of tasks, and generating the metamodel. The one policy can be learned using

Hereinafter, various embodiments will be described in detail with reference to the accompanying drawings. Embodiments described below may be modified and implemented in various different forms. In order to more clearly describe the characteristics of the embodiments, detailed descriptions of matters widely known to those skilled in the art to which the following embodiments belong are omitted. And, in the drawings, parts irrelevant to the description of the embodiments are omitted, and similar reference numerals are attached to similar parts throughout the specification.

Throughout the specification, when a component is said to be “connected” to another component, this includes not only the case of being “directly connected” but also the case of being “connected with another component intervening therebetween”. In addition, when a certain component "includes" a certain component, this means that other components may be further included without excluding other components unless otherwise specified.

Embodiments described in this specification relate to reinforcement learning that establishes a policy for performing all of a plurality of tasks well, and in particular, in a state in which data for a plurality of tasks are prepared in advance. These are embodiments in which reinforcement learning can be effectively performed without additional interaction by approximating an environmental model for a plurality of tasks using corresponding data.

Hereinafter, a configuration of a computing device for performing reinforcement learning will be briefly described, and then a process in which the computing device performs reinforcement learning will be described in detail.

1 is a diagram showing the configuration of a computing device that performs a reinforcement learning method for establishing one policy for performing a plurality of tasks according to an embodiment. Referring to FIG. 1 , a computing device 100 according to an embodiment may include an input/output unit 110, a control unit 120, and a storage unit 130.

The input/output unit 110 is a component for receiving data and commands related to reinforcement learning and outputting results of reinforcement learning. The input/output unit 110 may include various types of input devices (e.g. keyboard, touch screen, etc.) for receiving input from a user, and may also include a connection port or communication module for transmitting and receiving data necessary for reinforcement learning. may be

The control unit 120 is a component including at least one processor such as a CPU, and can perform reinforcement learning according to a process presented below by executing a program stored in the storage unit 130 . A method for the controller 120 to perform reinforcement learning will be described in detail below.

The storage unit 130 is a configuration in which files and programs can be stored and may be configured through various types of memories. In particular, data and programs that allow the controller 120 to perform reinforcement learning according to a process presented below may be stored in the storage unit 130 .

Hereinafter, a process of performing reinforcement learning in which one policy for performing a plurality of tasks is established by the controller 120 executing a program stored in the storage unit 130 will be described in detail. The processes described below are performed by the control unit 120 executing a program stored in the storage unit 130 unless otherwise specified.

A reinforcement learning method for establishing one policy for performing a plurality of tasks according to an embodiment approximates the environment for a plurality of tasks as one model without additional interaction in a situation in which data for a plurality of tasks is given. It proceeds by creating a metamodel (global environment model) and learning policies through the created metamodel. In the case of this reinforcement learning method, since the environmental models for multiple tasks are approximated as a single metamodel, knowledge sharing between tasks is well done, and a policy with better performance can be obtained than when each task is trained separately. .

Referring to FIG. 2 , in step 201, the control unit 120 may generate a metamodel (global environment model) approximating the environment for a plurality of tasks into one model using data on a plurality of tasks. there is.

At this time, the data for the plurality of tasks includes transition tuples composed of a state, action, reward, and next state for each task, and each task It may be stored in advance in a buffer corresponding to .

FIG. 3 shows detailed steps included in step 201 of FIG. 2 . Referring to FIG. 3 , in step 301, the controller 120 generates context vectors by performing embedding on transition tuples for each of a plurality of tasks.

Subsequently, in step 302, the control unit 120 may generate a global environment model (meta-model) by approximating an environment model for each task by conditioning the context vectors, and may train the global environment model.

The control unit 120 may implement a context encoder, a metamodel generator, and a policy agent learning module by executing programs stored in the storage unit 130 . The above three configurations are not physically separated hardware configurations, but correspond to software configurations arbitrarily classified based on operations performed.

The context encoder creates a context vector by performing embedding on the transitions of each of a plurality of tasks, and the metamodel generator creates a metamodel (global environment model) approximated for all tasks by conditioning the context vector, and the policy The agent learning module uses a metamodel to learn one policy for all tasks. A specific method for the control unit 120 to learn the context encoder, metamodel, and policy agent will be described in detail with reference to FIGS. 4 and 5 below.

According to an embodiment, the process of performing reinforcement learning by the computing device 100 is largely (1) a process of learning a global environment model (meta-model) and a context encoder (a process of generating a meta-model) and (2) a policy It consists of the process of learning the agent. Hereinafter, on the assumption that M transition tuples (state (s), action (a), reward (r), next state (s')) are stored in the buffer for each of the plurality of tasks, the computing device 100 [0043] Examples of performing the above two processes will be described.

4 is a diagram for explaining a process of learning a global environment model (meta model) and a context encoder, and FIG. 5 is a diagram for explaining a process of learning a policy agent.

(1) The process of learning the global environment model (meta-model) and context encoder (the process of creating a meta-model)

The controller 120 of the computing device 100 selects N tasks from among a plurality of tasks. The controller 120 performs a process using data of the selected N tasks, and the task buffer 410 shown in FIG. 4 is a buffer corresponding to any one of the selected N tasks. That is, the controller 120 may perform the process described with reference to FIG. 4 for each of the selected N tasks.

Referring to FIG. 4 , the controller 120 randomly samples some of the transition tuples stored in the task buffer 410 as a first transition set and a second transition set. In this case, the first transition set and the second transition set are separately and randomly sampled, and some of the transition tuples included in each set may overlap each other. The first set of transitions can be used as an input to the context encoder 420, and the second set of transitions can be used to train the metamodel 430.

The control unit 120 converts the sampled first transition set to the context encoder 420.

is applied as an input to obtain a first context vector describing each task through embedding. The context vector z is expressed as Equation 1 below.

[Equation 1]

As can be seen in FIG. 4, (s, a) extracted from the second transition set is applied as an input to the metamodel 430 together with the conditioned first context vector z, and in the metamodel 430 ( s', r) are output as the third transition set.

The control unit 120 uses the second transition set and the third transition set to calculate the metamodel 430 through a loss function expressed by Equation 2 below.

can be learned. The loss function of Equation 2 includes mean squared errors of the second transition set and the third transition set.

[Equation 2]

At this time, T means the number of types of a plurality of tasks.

The control unit 120 repeats the process described above (from the operation of selecting N tasks to the operation of calculating the loss function of Equation 2) while repeating

so that the error of does not exceed a preset threshold

and

can be learned.

(2) The process of training the policy agent

The controller 120 of the computing device 100 selects N tasks from among a plurality of tasks. The controller 120 performs a process using data of the selected N tasks, and the task buffer 510 shown in FIG. 5 is a buffer corresponding to any one of the selected N tasks. That is, the controller 120 may perform the process described with reference to FIG. 5 for each of the selected N tasks.

Referring to FIG. 5 , the controller 120 randomly samples some of the transition tuples stored in the task buffer 510 as a fourth transition set and a fifth transition set. In this case, the fourth transition set and the fifth transition set are individually and randomly sampled, and some of the transition tuples included in each set may overlap each other. The fourth transition set is used as an input of the context encoder 520, and the fifth transition set can be used for training of the policy agent.

The control unit 120 converts the sampled fourth transition set to the context encoder 520.

is applied as an input to obtain a second context vector describing each task through embedding. The context vector z is expressed as in Equation 1 above.

The control unit 120 obtains a sixth transition set (s', r) _new by applying the conditioned second context vector z and (s, a) extracted from the fifth transition set to the metamodel 530 as inputs. Among them, a penalty due to the uncertainty of the model may be reflected for the compensation r. In detail, the controller 120 may give a penalty to the reward r in the sixth transition set (s', r) _new by calculating the Probius norm of the variance matrix of the prediction distribution of the model. Equation 3 for calculating compensation r _new reflecting the penalty due to uncertainty is as follows.

[Equation 3]

The control unit 120 uses a transition set newly obtained through penalty reflection (a transition set in which a penalty due to uncertainty in compensation is reflected in the sixth transition set) and a second context vector, expressed by Equations 4 to 6 below. The policy π, the state value function V, and the action value function Q can be learned through the loss function.

[Equation 4]

[Equation 5]

[Equation 6]

According to the above-described embodiments, reinforcement learning can be performed through a model approximated using pre-prepared data without interaction with the real environment, so that not only the time required for learning is drastically reduced, but also the time required for learning is significantly reduced. Since the environment model is approximated as a metamodel, knowledge sharing between tasks is well done, and thus, a policy with better performance can be obtained than when each task is trained separately.

The term '~unit' used in the above embodiments means software or a hardware component such as a field programmable gate array (FPGA) or ASIC, and '~unit' performs certain roles. However, '~ part' is not limited to software or hardware. '~bu' may be configured to be in an addressable storage medium and may be configured to reproduce one or more processors. Therefore, as an example, '~unit' refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, and procedures. , subroutines, segments of program patent code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

Functions provided within components and '~units' may be combined into smaller numbers of components and '~units' or separated from additional components and '~units'.

In addition, components and '~units' may be implemented to play one or more CPUs in a device or a secure multimedia card.

The offline meta-reinforcement learning method for a plurality of tasks according to the embodiments described with reference to FIGS. 2 to 5 can also be implemented in the form of a computer-readable medium that stores instructions and data executable by a computer. . In this case, instructions and data may be stored in the form of program codes, and when executed by a processor, a predetermined program module may be generated to perform a predetermined operation. Also, computer-readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. Also, a computer-readable medium may be a computer recording medium, which is a volatile and non-volatile memory implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. It can include both volatile, removable and non-removable media. For example, the computer recording medium may be a magnetic storage medium such as HDD and SSD, an optical recording medium such as CD, DVD, and Blu-ray disc, or a memory included in a server accessible through a network.

In addition, the offline meta-reinforcement learning method for a plurality of tasks according to the embodiments described with reference to FIGS. 2 to 5 may be implemented as a computer program (or computer program product) including instructions executable by a computer. A computer program includes programmable machine instructions processed by a processor and may be implemented in a high-level programming language, object-oriented programming language, assembly language, or machine language. . Also, the computer program may be recorded on a tangible computer-readable recording medium (eg, a memory, a hard disk, a magnetic/optical medium, or a solid-state drive (SSD)).

Therefore, the offline meta-reinforcement learning method for a plurality of tasks according to the embodiments described with reference to FIGS. 2 to 5 can be implemented by executing the computer program as described above by a computing device. A computing device may include at least some of a processor, a memory, a storage device, a high-speed interface connected to the memory and a high-speed expansion port, and a low-speed interface connected to a low-speed bus and a storage device. Each of these components are connected to each other using various buses and may be mounted on a common motherboard or mounted in any other suitable manner.

Here, the processor may process commands within the computing device, for example, to display graphic information for providing a GUI (Graphic User Interface) on an external input/output device, such as a display connected to a high-speed interface. Examples include instructions stored in memory or storage devices. As another example, multiple processors and/or multiple buses may be used along with multiple memories and memory types as appropriate. Also, the processor may be implemented as a chipset comprising chips including a plurality of independent analog and/or digital processors.

Memory also stores information within the computing device. In one example, the memory may consist of a volatile memory unit or a collection thereof. As another example, the memory may be composed of a non-volatile memory unit or a collection thereof. Memory may also be another form of computer readable medium, such as, for example, a magnetic or optical disk.

Also, the storage device may provide a large amount of storage space to the computing device. A storage device may be a computer-readable medium or a component that includes such a medium, and may include, for example, devices in a storage area network (SAN) or other components, such as a floppy disk device, a hard disk device, an optical disk device, or a tape device, flash memory, or other semiconductor memory device or device array of the like.

The above-described embodiments are for illustrative purposes, and those skilled in the art to which the above-described embodiments belong can easily transform into other specific forms without changing the technical spirit or essential features of the above-described embodiments. You will understand. Therefore, it should be understood that the above-described embodiments are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

The scope to be protected through this specification is indicated by the following claims rather than the detailed description above, and should be construed to include all changes or modifications derived from the meaning and scope of the claims and equivalent concepts thereof. .

Claims

In the reinforcement learning method of establishing one policy for performing a plurality of tasks,

generating a metamodel in which an environment for the plurality of tasks is approximated as one model by using data on the plurality of tasks; and

And learning the one policy using the metamodel.
According to claim 1,

Data for the plurality of tasks,

and transition tuples composed of a state, an action, a reward, and a next state for each of the plurality of tasks.
According to claim 1,

The step of generating the metamodel,

generating context vectors by performing embedding on transition tuples for each of the plurality of tasks; and

and conditioning the generated context vectors to train the metamodel.
According to claim 1,

In a buffer corresponding to each of the plurality of tasks, M transition tuples for the plurality of tasks are stored in advance,

The step of generating the metamodel,

selecting N tasks from among the plurality of tasks;

randomly sampling a first transition set and a second transition set from buffers of each of the N tasks;

embedding a first context vector for each of the N tasks by inputting the first transition set to a context encoder;

obtaining a third transition set by inputting the first context vector and the second transition set to the metamodel; and

and training the context encoder and the metamodel using a loss function including mean squared errors of the second transition set and the third transition set.
According to claim 1,

In a buffer corresponding to each of the plurality of tasks, M transition tuples for the plurality of tasks are stored in advance,

The step of learning the one policy,

selecting N tasks from among the plurality of tasks;

randomly sampling a fourth transition set and a fifth transition set from buffers of each of the N tasks;

embedding a second context vector for each of the N tasks by inputting the fourth transition set to the context encoder;

obtaining a sixth transition set by inputting the second context vector and the fifth transition set to the metamodel;

reflecting a penalty due to uncertainty to a reward among transition tuples included in the sixth transition set; and

and learning the one policy using a loss function including a sixth transition set in which the penalty is reflected and the second context vector.
According to claim 1,

A computer-readable recording medium storing a program for causing a computer to execute the method according to claim 1.
A computer program executed by a computing device and stored in a recording medium to perform the method according to claim 1 .
A computing device for performing a reinforcement learning method for establishing one policy for performing a plurality of tasks,

an input/output unit for receiving data and commands related to reinforcement learning and outputting reinforcement learning results;

a storage unit for storing data and programs for performing reinforcement learning; and

It includes at least one processor, and includes a control unit that performs reinforcement learning by executing the program,

By executing the program, the controller

A computing device that generates a metamodel that approximates an environment for the plurality of tasks as a model using data for the plurality of tasks, and learns the one policy using the metamodel.
According to claim 8,

Data for the plurality of tasks,

The computing device comprising transition tuples composed of a state, an action, a reward, and a next state for each of the plurality of tasks.
According to claim 8,

In generating the metamodel, the controller

Characterized in that context vectors are generated by embedding in transition tuples for each of the plurality of tasks, and the metamodel is trained by conditioning the generated context vectors. computing device.
According to claim 8,

In a buffer corresponding to each of the plurality of tasks, M transition tuples for the plurality of tasks are stored in advance,

In generating the metamodel, the controller

N tasks are selected from among the plurality of tasks, a first transition set and a second transition set are randomly sampled from a buffer of each of the N tasks, and then the first transition set is input to a context encoder to generate the N transition sets. A first context vector for each task is embedded, and a third transition set is obtained by inputting the first context vector and the second transition set into the metamodel, and a combination of the second transition set and the third transition set is obtained. The computing device characterized in that the context encoder and the metamodel are trained using a loss function including a mean squared error.
According to claim 8,

In a buffer corresponding to each of the plurality of tasks, M transition tuples for the plurality of tasks are stored in advance,

In learning the one policy, the control unit

N tasks are selected from among the plurality of tasks, a fourth transition set and a fifth transition set are randomly sampled from the buffers of each of the N tasks, and then the fourth transition set is input to the context encoder so that the N embedding a second context vector for each of the tasks, obtaining a sixth transition set by inputting the second context vector and the fifth transition set into the metamodel, and among the transition tuples included in the sixth transition set For reward, a penalty due to uncertainty is reflected, and the one policy is learned using a loss function including a sixth transition set to which the penalty is reflected and the second context vector. Characterized in that computing device.