WO2022009859A1

WO2022009859A1 - Reinforcement learning device, reinforcement learning system, object manipulation device, model generation method, and reinforcement learning program

Info

Publication number: WO2022009859A1
Application number: PCT/JP2021/025392
Authority: WO
Inventors: 康博藤田
Original assignee: 株式会社Preferred Networks
Priority date: 2020-07-10
Filing date: 2021-07-06
Publication date: 2022-01-13
Also published as: JP2023145809A

Abstract

Provided are a reinforcement learning device, a reinforcement learning system, an object manipulation device, a model generation method, and a reinforcement learning program, whereby the probability of success of a prescribed manipulation on an object can be increased. This reinforcement learning device has at least one memory and at least one processor, the at least one processor being configured so as to be capable of: inputting information relating to a captured image captured by an imaging device that changes in at least position or orientation thereof, and information relating to a target object image indicating an object to be manipulated by an end effector, to a training model that outputs information for controlling the operation of the end effector; and updating a parameter of the training model on the basis of the result of manipulation of the object for a case where the operation of the end effector is controlled on the basis of the information outputted by the training model.

Description

Reinforcement learning device, reinforcement learning system, object manipulation device, model generation method and reinforcement learning program

This disclosure relates to a reinforcement learning device, a reinforcement learning system, an object manipulation device, a model generation method, and a reinforcement learning program.

Input an image taken by a fixed camera so that a predetermined operation (for example, a gripping operation by an end effector) is successful for a specified type of object among a plurality of types of objects placed in a predetermined area. As a method, a reinforcement learning system for reinforcement learning of the operation of an end effector is known.

According to the reinforcement learning system, if an object of the specified type is placed in a position where it can be photographed, the success probability of a predetermined operation can be increased by repeating reinforcement learning. On the other hand, if the specified type of object is not placed in a position where it can be photographed, reinforcement learning cannot proceed and the success probability of a predetermined operation cannot be increased.

The present disclosure provides a reinforcement learning device, a reinforcement learning system, an object manipulation device, a model generation method, and a reinforcement learning program capable of increasing the success probability of a predetermined operation on an object.

The reinforcement learning device according to one aspect of the present disclosure has, for example, the following configuration. That is,
With at least one memory
With at least one processor,
The at least one processor
The operation of the end effector is controlled by controlling the operation of the end effector with the information about the captured image taken by the image pickup device whose position and posture change at least and the information about the target object image indicating the object to be operated by the end effector. To input to the training model to output information for
When the operation of the end effector is controlled based on the information output by the training model, it is possible to update the parameters of the training model based on the operation result for the object.

FIG. 1 is a diagram showing an example of a system configuration of a reinforcement learning system. FIG. 2 is a diagram showing an example of the hardware configuration of each device constituting the reinforcement learning system. FIG. 3 is a first diagram showing an example of the functional configuration of the reinforcement learning device. FIG. 4 is a first flowchart showing the flow of the reinforcement learning process. FIG. 5 is a first diagram showing an execution example of the reinforcement learning process. FIG. 6 is a second diagram showing an execution example of the reinforcement learning process. FIG. 7 is a second diagram showing an example of the functional configuration of the reinforcement learning device. FIG. 8 is a second flowchart showing the flow of the reinforcement learning process.

Hereinafter, each embodiment will be described with reference to the attached drawings. In the present specification and the drawings, the components having substantially the same functional configuration are designated by the same reference numerals, and duplicate description thereof will be omitted.

[First Embodiment]
<System configuration of reinforcement learning system>
First, the system configuration of the reinforcement learning system will be described. FIG. 1 is a diagram showing an example of a system configuration of a reinforcement learning system. As shown in FIG. 1, the reinforcement learning system 100 includes a manipulator 110 and a reinforcement learning device 120.

For example, the manipulator 110 performs a predetermined operation on a specified type of object (object to be operated as shown by a target object image) from an object group 130 in which a plurality of types of objects are mixedly placed. It is a device that performs.

The main body 113 of the manipulator 110 has a plurality of arms connected via a plurality of joints, and by controlling the joint angle of each, the position and posture of the tip portion of the main body 113 of the manipulator 110 are controlled. It is configured to be.

A gripping mechanism portion 111 (an example of an end effector) that performs a predetermined operation (a gripping operation in this embodiment) for an object of a specified type is attached to the tip portion of the main body portion 113 of the manipulator 110. The gripping operation for the specified type of object is performed by controlling the opening and closing of the gripping mechanism unit 111.

Further, an image pickup device 112 is attached to the tip portion of the main body 113 of the manipulator 110. That is, the image pickup device 112 is configured so that the position and the posture change with the change of the position and the posture of the gripping mechanism portion 111. The image pickup apparatus 112 outputs a photographed image including each image of R value, G value, and B value at a predetermined frame cycle. Alternatively, the image pickup apparatus 112 may output a captured image including distance information to each position on the surface of the object in addition to the R value, G value, and B value images at a predetermined frame period. Alternatively, the image pickup apparatus 112 may output a distance image including distance information to each position on the surface of the object at a predetermined frame period. Further, the captured image captured by the image pickup apparatus 112 may be a moving image. Hereinafter, for the sake of simplification of the description, as an example, the image pickup apparatus 112 will be described as outputting a captured image including each image of R value, G value, and B value at a predetermined frame cycle.

Further, the support base 114 that supports the main body 113 of the manipulator 110 is driven to "control the operation of the gripping mechanism 111" (control the position and posture of the gripping mechanism 111 and the opening / closing of the gripping mechanism 111). The control device 115 is built-in.

The drive control device 115 acquires a captured image captured by the image pickup device 112 and transmits it to the reinforcement learning device 120. Further, the drive control device 115 acquires sensor signals detected by various sensors (not shown) arranged in the grip mechanism portion 111 of the manipulator 110 and the main body portion 113, and transmits them to the reinforcement learning device 120.

Further, the drive control device 115 acquires information for controlling the operation of the gripping mechanism unit 111 from the reinforcement learning device 120 in response to the transmission of the captured image and the sensor signal. The information for controlling the operation of the gripping mechanism unit 111 referred to here is, for example,
Information (target value) indicating the state of the gripping mechanism 111 after operation,
A specific operation amount and control amount for controlling the position and posture of the gripping mechanism unit 111 and the opening / closing of the gripping mechanism unit 111.
Etc., any command regarding the operation of the gripping mechanism unit 111 may be included. Further, the information for controlling the operation of the gripping mechanism unit 111 may include information for controlling the operation of the manipulator 110. Hereinafter, the drive control device 115 will be described as acquiring information indicating a state after the operation of the gripping mechanism unit 111 as an example of information for controlling the operation of the gripping mechanism unit 111.

Further, when the drive control device 115 acquires the information indicating the state after the operation of the gripping mechanism unit 111, the drive control device 115 is based on various sensor signals (information indicating the state before the operation of the gripping mechanism unit 111).
Various actuators (not shown) in the gripping mechanism 111 of the manipulator 110, and
Various actuators (not shown) in the main body 113 of the manipulator 110,
To control. As a result, the position and posture of the gripping mechanism portion 111 and the opening / closing of the gripping mechanism portion 111 are controlled.

The reinforcement learning device 120 shows the state after the operation of the gripping mechanism unit 111 by inputting the captured image transmitted from the drive control device 115 and the target object image showing the object to be gripped by the gripping mechanism unit 111. It has a reinforcement learning model (an example of a training model) that outputs information. For the reinforcement learning model, for example, a neural network may be used.

When inputting the information about the captured image transmitted from the drive control device 115 into the reinforcement learning model, the feature amount extracted from the captured image may be input instead of inputting the captured image itself. The feature amount extracted from the captured image is, for example, a feature amount output from the intermediate layer by inputting the captured image into the neural network.

Further, the information regarding the target object image input to the reinforcement learning model may be a captured image including each image of R value, G value, and B value, and each image and object of R value, G value, and B value. It may be a photographed image including distance information to each position on the surface. Alternatively, the target object image may be a distance image including distance information to each position on the object surface. Alternatively, the target object image may be a moving image. Further, in the reinforcement learning model, instead of inputting the target object image itself, the feature amount extracted from the target object image (for example, the feature amount output from the intermediate layer by inputting the target object image into the neural network). May be entered. Hereinafter, for the sake of simplification of the description, as an example of the target object image, a photographed image including each image of R value, G value, and B value will be input.

Further, by controlling the operation of the gripping mechanism unit 111 based on the information indicating the state after the operation of the gripping mechanism unit 111 output by the reinforcement learning model, the reinforcement learning device 120 operates the object to be gripped. The result (for example, the determination result of whether or not the gripping operation is successful) is acquired. Then, the reinforcement learning device 120 updates the model parameters of the reinforcement learning model based on the acquired operation result.

As described above, in the reinforcement learning system 100, in order to increase the success probability of the gripping operation when gripping a designated type of object from the object group 130 in which a plurality of types of objects are mixedly placed.
-Reinforcement learning is performed using information on the captured image captured by the image pickup device 112 whose position and posture change with the change in the position and posture of the gripping mechanism unit 111.

Thereby, for example, even if the object to be gripped is not placed in a position where it can be photographed, the gripping mechanism unit 111 can be operated so that the object to be gripped can be photographed in the process of reinforcement learning. can. That is, according to the present embodiment, it is possible to provide the reinforcement learning system 100 capable of increasing the success probability of the gripping operation regardless of the placement state of the object to be gripped.

In this embodiment, as shown by reference numeral 140, the vertical direction of the paper surface in FIG. 1 is defined as the Z-axis direction, the horizontal direction of the paper surface in FIG. 1 is defined as the Y-axis direction, and the depth direction of the paper surface in FIG. 1 is defined as the X-axis direction. And.

<Hardware configuration of each device that composes the reinforcement learning system>
Next, FIG. 2 is used for the hardware configuration of the manipulator 110 (here, the mechanical system is omitted and the hardware configuration for the control system is shown) and the hardware configuration of the reinforcement learning device 120 constituting the reinforcement learning system 100. I will explain. FIG. 2 is a diagram showing an example of the hardware configuration of each device constituting the reinforcement learning system.

(1) Hardware Configuration of Manipulator As shown in FIG. 2, the manipulator 110 has a sensor group 211 and an actuator group 212 in addition to the image pickup device 112 and the drive control device 115.

The sensor group 211 includes n sensors. In this embodiment, for n sensors, at least,
-Sensor for calculating the position and posture of the gripping mechanism portion 111 (sensor for measuring each joint angle of the main body portion 113),
A sensor that detects the opening and closing of the gripping mechanism 111,
Is included.

Further, the actuator group 212 includes m actuators. In this embodiment, for m actuators, at least,
An actuator for controlling the position and posture of the gripping mechanism portion 111 (actuator for controlling each joint angle of the main body portion 113),
An actuator for controlling the opening and closing of the gripping mechanism 111,
Is included.

Further, the drive control device 115 includes a sensor signal processing device 201, an actuator drive device 202, and a controller 203. The sensor signal processing device 201 receives the sensor signal transmitted from the sensor group 211 and notifies the controller 203 of the sensor signal data. Further, the actuator drive device 202 acquires the control signal data from the controller 203 and transmits the control signal to the actuator group 212.

The controller 203 acquires the captured image transmitted from the image pickup device 112 and transmits it to the reinforcement learning device 120. Further, the controller 203 transmits the sensor signal data notified from the sensor signal processing device 201 to the reinforcement learning device 120.

Further, the controller 203 acquires information indicating the state after the operation of the gripping mechanism unit 111 from the reinforcement learning device 120 in response to the transmission of the captured image and the sensor signal data. Further, when the controller 203 acquires the information indicating the state after the operation of the gripping mechanism unit 111, the controller 203 generates the control signal data for operating the actuator group 212 based on the sensor signal data and notifies the actuator drive device 202.

(2) Hardware Configuration of Reinforcement Learning Device Next, the hardware configuration of the reinforcement learning device 120 will be described. As shown in FIG. 2, the enhanced learning device 120 has a processor 221, a main storage device (memory) 222, an auxiliary storage device 223, a network interface 224, and a device interface 225 as components. The reinforcement learning device 120 is realized as a computer in which these components are connected via a bus 226.

In the example of FIG. 2, the reinforcement learning device 120 is shown to include one component for each, but the reinforcement learning device 120 may include a plurality of the same components. Further, in the example of FIG. 2, one reinforcement learning device 120 is shown, but the reinforcement learning program is installed in a plurality of reinforcement learning devices, and each of the plurality of reinforcement learning devices is a reinforcement learning program. It may be configured to perform the same or different parts of the process. In this case, the reinforcement learning device may take the form of distributed computing in which the entire processing is executed by communicating with each other via the network interface 224 or the like. That is, the reinforcement learning device 120 may be configured as a system that realizes a function by executing instructions stored in one or a plurality of storage devices by one or a plurality of computers. Further, various data transmitted from the drive control device 115 may be processed by one or a plurality of reinforcement learning devices provided on the cloud, and the processing result may be transmitted to the drive control device 115.

Various operations of the reinforcement learning device 120 may be executed in parallel processing by using one or a plurality of processors or by using a plurality of reinforcement learning devices that communicate via the communication network 240. Further, various operations may be distributed to a plurality of arithmetic cores in the processor 221 and executed in parallel processing. Further, some or all of the processes, means, etc. of the present disclosure are executed by an external device 230 (at least one of the processor and the storage device) provided on the cloud capable of communicating with the reinforcement learning device 120 via the communication network 240. May be done. As described above, the reinforcement learning device 120 may take the form of parallel computing by one or a plurality of computers.

The processor 221 may be an electronic circuit (processing circuit, Processing circuitry, CPU, GPU, FPGA, ASIC, etc.). Further, the processor 221 may be a semiconductor device or the like including a dedicated processing circuit. The processor 221 is not limited to an electronic circuit using an electronic logic element, and may be realized by an optical circuit using an optical logic element. Further, the processor 221 may include an arithmetic function based on quantum computing.

The processor 221 performs various calculations based on various data and instructions input from each device and the like of the internal configuration of the reinforcement learning device 120, and outputs the calculation result and the control signal to each device and the like. The processor 221 controls each component included in the reinforcement learning device 120 by executing an OS (Operating System), an application, or the like.

Further, the processor 221 may refer to one or more electronic circuits arranged on one chip, or may refer to one or more electronic circuits arranged on two or more chips or devices. When a plurality of electronic circuits are used, each electronic circuit may communicate by wire or wirelessly.

The main storage device 222 is a storage device that stores instructions executed by the processor 221 and various data, and various data stored in the main storage device 222 is read out by the processor 221. The auxiliary storage device 223 is a storage device other than the main storage device 222. It should be noted that these storage devices mean arbitrary electronic components capable of storing various data, and may be semiconductor memories. The semiconductor memory may be either a volatile memory or a non-volatile memory. The storage device for storing various data in the reinforcement learning device 120 may be realized by the main storage device 222 or the auxiliary storage device 223, or may be realized by the built-in memory built in the processor 221.

Further, a plurality of processors 221 may be connected (combined) to one main storage device 222, or a single processor 221 may be connected. Alternatively, a plurality of main storage devices 222 may be connected (combined) to one processor 221. When the enhanced learning device 120 is composed of at least one main storage device 222 and a plurality of processors 221 connected (combined) to the at least one main storage device 222, at least one of the plurality of processors 221 is used. The processor may include a configuration in which it is connected (coupled) to at least one main memory device 222. Further, this configuration may be realized by the main storage device 222 and the processor 221 included in the plurality of reinforcement learning devices 120. Further, the main storage device 222 may include a configuration in which the processor is integrated (for example, a cache memory including an L1 cache and an L2 cache).

The network interface 224 is an interface for connecting to the communication network 240 wirelessly or by wire. For the network interface 224, an appropriate interface such as one conforming to an existing communication standard is used. The network interface 224 may exchange various data with the drive control device 115 and other external devices 230 connected via the communication network 240. The communication network 240 may be any one of WAN (Wide Area Network), LAN (Local Area Network), PAN (Personal Area Network), or a combination thereof, and may be a computer and a drive control device 115 or a combination thereof. Anything may be used as long as information is exchanged with another external device 230. An example of WAN is the Internet, an example of LAN is 802.11, Ethernet, etc., and an example of PAN is Bluetooth (registered trademark), NFC (Near Field Communication), etc.

The device interface 225 is an interface such as USB that directly connects to the external device 250.

The external device 250 is a device connected to a computer. The external device 250 may be an input device as an example. The input device is, for example, a device such as a camera, a microphone, a motion capture, various sensors, a keyboard, a mouse, or a touch panel, and gives acquired information to a computer. Further, it may be a device having an input unit such as a personal computer, a tablet terminal, or a smartphone, a memory, and a processor.

Further, the external device 250 may be an output device as an example. The output device may be, for example, a display device such as an LCD (Liquid Crystal Display), a CRT (Cathode Ray Tube), a PDP (Plasma Display Panel), or an organic EL (Electro Luminescence) panel, and outputs audio or the like. It may be a speaker or the like. Further, it may be a device having an output unit such as a personal computer, a tablet terminal, or a smartphone, a memory, and a processor.

Further, the external device 250 may be a storage device (memory). For example, the external device 250 may be a network storage or the like, and the external device 250 may be a storage such as an HDD.

Further, the external device 250 may be a device having some functions of the components of the reinforcement learning device 120. That is, the computer may transmit or receive a part or all of the processing result of the external device 250.

<Functional configuration of reinforcement learning device>
Next, as a functional configuration of the reinforcement learning device 120, two types of functional configuration examples will be described here. FIG. 3 is a first diagram showing an example of the functional configuration of the reinforcement learning device. As shown in 3a of FIG. 3, the reinforcement learning device 120 has an update unit 310, a state input unit 320, and a reinforcement learning model 330.

The update unit 310 has a reward calculation unit 311 and updates the model parameters of the reinforcement learning model 330. Specifically, the updating unit 310 acquires a determination result of whether or not the gripping operation for the object to be gripped is successful, and information indicating a change in the state due to the control of the operation of the gripping mechanism unit 111. .. Further, the reward calculation unit 311 calculates the reward based on the determination result acquired by the update unit 310. Then, the update unit 310 updates the model parameters of the reinforcement learning model 330 based on various information (information indicating the change in the state, reward, etc.) acquired or calculated so far.

It should be noted that the determination of whether or not the gripping operation for the object to be gripped is successful may be automatically performed based on, for example, a captured image. Alternatively, the user of the reinforcement learning system 100 may determine whether or not the gripping operation for the object to be gripped is successful.

Further, the above-mentioned reward calculation method is only an example, and the update unit 310 may calculate the reward based on information other than the determination result of whether or not the gripping operation is successful. For example, the update unit 310 calculates the reward based on various information such as the operation time and the number of operations required for the grip operation to succeed, and the magnitude (energy efficiency) of the operation of the entire manipulator 110 during the grip operation. You may.

The state input unit 320 acquires the captured image transmitted from the drive control device 115 and the target object image input by the user, and notifies the reinforcement learning model 330.

In the reinforcement learning model 330, the model parameters are updated by the update unit 310. Further, the reinforcement learning model 330 after the model parameters are updated receives the captured image notified from the state input unit 320 and the target object image as inputs, and outputs information indicating the state after the operation of the gripping mechanism unit 111. .. In the present embodiment, the reinforcement learning model 330 is, for example, as information indicating the state after the operation of the gripping mechanism unit 111, for example.
Information indicating the position and posture of the gripping mechanism 111 after operation,
Information indicating the opening and closing of the gripping mechanism 111 after operation,
Is output.

On the other hand, 3b in FIG. 3 shows another functional configuration example. As shown in 3b of FIG. 3, the state input unit 320 of the reinforcement learning device 120 acquires, in addition to the captured image and the target object image, information indicating the state before (currently) the operation of the gripping mechanism unit 111, and strengthens it. It is configured to notify the learning model 330. The information indicating the state before (currently) the operation of the gripping mechanism portion 111 referred to here is, for example,
Information indicating the position and posture of the gripping mechanism 111 before (currently) operation,
Information indicating the opening and closing of the gripping mechanism 111 before (currently) operation,
Is included.

In this case, the reinforcement learning model 330 receives the captured image, the target object image, and the information indicating the state before (currently) the operation of the gripping mechanism unit 111 notified from the state input unit 320 as input, and after the operation of the gripping mechanism unit 111. Outputs information indicating the status of.

<Flow of reinforcement learning process>
Next, the flow of the reinforcement learning process by the reinforcement learning device 120 will be described. FIG. 4 is a first flowchart showing the flow of the reinforcement learning process. Hereinafter, the flow of the reinforcement learning process will be described with reference to FIG. The reinforcement learning process shown in FIG. 4 is only an example, and a model for which reinforcement learning has been completed may be generated by executing the reinforcement learning process by another model generation method.

In step S401, the state input unit 320 of the reinforcement learning device 120 acquires the target object image.

In step S402, the state input unit 320 of the reinforcement learning device 120 acquires a captured image.

In step S403, when the state input unit 320 of the reinforcement learning device 120 is configured to acquire information indicating the state before (currently) the operation of the gripping mechanism unit 111, the gripping mechanism unit 111 Acquires information indicating the state before (current) operation.

In step S404, the reinforcement learning model 330 of the reinforcement learning device 120 receives the target object image, the captured image, and (and information indicating the state before the operation of the gripping mechanism unit 111) as inputs, and the state after the operation of the gripping mechanism unit 111. The information indicating is output. It is assumed that the reinforcement learning model 330 is configured to comprehensively output various information as information indicating the state after the operation of the gripping mechanism unit 111. As a result, the motion of the gripping mechanism unit 111 during the reinforcement learning process includes the optimum motion selected from the set of possible motions and the motion randomly selected from the set of possible motions. Will be.

In step S405, the reinforcement learning device 120 transmits the information indicating the post-operation state of the gripping mechanism unit 111 output by the reinforcement learning model 330 to the drive control device 115.

In step S406, the update unit 310 of the reinforcement learning device 120 acquires information indicating a change in the state due to the control of the operation of the gripping mechanism unit 111.

In step S407, the update unit 310 of the reinforcement learning device 120 acquires a determination result of whether or not the gripping operation for the object to be gripped is successful, and the reward calculation unit 311 of the reinforcement learning device 120 obtains the acquired determination result. Based on this, the reward is calculated.

In step S408, the update unit 310 of the reinforcement learning device 120 updates the model parameters of the reinforcement learning model 330 based on various information (information indicating changes in the state, rewards, etc.) acquired or calculated so far.

In step S409, the state input unit 320 of the reinforcement learning device 120 determines whether or not to switch from the current target object image to a different target object image.

If it is determined in step S409 that the image is not switched to a different target object image (NO in step S409), the process returns to step S402.

On the other hand, if it is determined in step S409 to switch to a different target object image (YES in step S409), the process proceeds to step S410.

In step S410, the update unit 310 of the reinforcement learning device 120 determines whether or not the end condition of the reinforcement learning process is satisfied. The end condition of the reinforcement learning process is, for example, a condition defined by the user of the reinforcement learning system 100, and one example thereof is a target success probability of a gripping operation for a predetermined object.

If it is determined in step S410 that the end condition of the reinforcement learning process is not satisfied (NO in step S410), the process returns to step S401.

On the other hand, if it is determined in step S410 that the end condition of the reinforcement learning process is satisfied (YES in step S410), the reinforcement learning process is terminated. The reinforcement learning model 330 after the reinforcement learning process is completed is a device (object operation device) that outputs information for controlling the operation of the gripping mechanism unit 111 to the drive control device 115 as the reinforcement learning completed model. Applies to).

The reinforcement-learned model applied to the object manipulation device executes the processes of steps S401 to S405 of FIG. 4 (that is, it does not acquire information indicating a state change, calculate a reward, update model parameters, or the like). .. Further, in step S404, the optimum information is output as the information indicating the state after the operation of the gripping mechanism unit 111. That is, unlike during the reinforcement learning process, the gripping mechanism unit 111 performs the optimum operation selected from the set of possible operations instead of performing various operations comprehensively.

<Execution example of reinforcement learning process>
Next, an execution example of the reinforcement learning process by the reinforcement learning system 100 will be described. 5 and 6 are first and second diagrams showing an execution example of the reinforcement learning process. When the target object image 510 shown in 5a of FIG. 5 is input by the user, the reinforcement learning device 120 recognizes the object 511 included in the target object image 510 as the object to be gripped.

In this way, by designating the type of the object to be gripped by inputting the target object image 510, according to the reinforcement learning device 120, the user can grip any object included in the object group 130. Can be specified as an object of.

In 5b of FIG. 5, the arrow 500 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 at the time when the object 511 is recognized as the object to be gripped. Further, the captured image 521 shows a captured image when the object group 130 is captured under the position and posture indicated by the arrow 500.

As shown in the captured image 521, when the object group 130 is photographed from the upward direction (Z-axis direction), the object 511 to be grasped is shielded by another object 512, and the image pickup apparatus 112 photographs the object 511. I can't. That is, in this state, the object 511 cannot be grasped.

Therefore, the reinforcement learning device 120 outputs information indicating the state after the operation of the gripping mechanism unit 111 in order to change the position and posture of the gripping mechanism unit 111 so that the object 511 to be gripped can be photographed. As a result, the drive control device 115 controls the operation of the gripping mechanism unit 111 based on the information indicating the state after the operation of the gripping mechanism unit 111.

In 5c of FIG. 5, the arrow 501 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 after the change, which is changed by controlling the operation of the gripping mechanism portion 111. Further, the captured image 522 shows a captured image when the object group 130 is captured under the position and posture indicated by the arrow 501.

As shown in the photographed image 522, the object group 511 to be grasped can be photographed by photographing the object group 130 from the lateral direction (X-axis direction).

Therefore, the reinforcement learning device 120 outputs information indicating the state after the operation of the gripping mechanism unit 111 in order to further change the position and posture of the gripping mechanism unit 111 so that the object 511 to be gripped can be gripped. .. As a result, the drive control device 115 controls the operation of the gripping mechanism unit 111 based on the information indicating the state after the operation of the gripping mechanism unit 111.

In 6a of FIG. 6, the arrow 601 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 after the change, which is changed by controlling the operation of the manipulator 110. Further, the captured image 611 shows a captured image when the object group 130 is captured under the position and posture indicated by the arrow 601.

As shown in the captured image 611, the object 511 to be gripped can be gripped by bringing it closer to the object 511 to be gripped.

Therefore, the reinforcement learning device 120 outputs information indicating the state after the operation of the gripping mechanism unit 111 so that the gripping mechanism unit 111 grips the object 511 to be gripped. As a result, the drive control device 115 controls the operation of the gripping mechanism unit 111 based on the information indicating the state of the gripping mechanism unit 111 after the operation.

In 6b of FIG. 6, the arrow 602 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 after the change, which is changed by controlling the operation of the gripping mechanism unit 111 (object 511 grips). It shows the state of being lifted to a predetermined height). Further, the captured image 612 shows a captured image when the object 511 is captured under the position and posture indicated by the arrow 602.

In this way, after the position and posture of the image pickup device 112 are configured to change according to the change in the position and posture of the gripping mechanism portion 111, reinforcement learning is performed using the captured image taken by the image pickup device. By doing
・ It is possible to evaluate from a long-term perspective, such as changing the appearance from the image pickup device.
-In the process of trying the operation of gripping the object to be gripped, the operation of searching for the object to be gripped can be tried.

That is, in the process of reinforcement learning, it is possible to control the operation of the gripping mechanism so that the object to be gripped can be photographed. As a result, the success probability of the gripping operation can be increased regardless of the placement state of the object to be gripped.

<Summary>
As is clear from the above explanation, the reinforcement learning system 100 according to the first embodiment is
A gripping mechanism captures an image taken by an image pickup device whose position and posture change as the position and posture of the gripping mechanism change, and a target object image showing an object to be gripped by the gripping mechanism. Input to the reinforcement learning model that outputs information indicating the state after the operation of the part.
-The operation result (determination result of whether or not the gripping operation by the end effector was successful) for the object to be gripped when the operation of the gripping mechanism unit is controlled based on the information indicating the state after the operation of the gripping mechanism unit. Based on this, update the model parameters of the reinforcement learning model.

As a result, according to the reinforcement learning system 100, even when the object to be gripped is placed so as to be shielded, the gripping mechanism unit can take an image of the object to be gripped in the process of reinforcement learning. The operation can be controlled.

That is, according to the first embodiment, the reinforcement learning device, the reinforcement learning system, and the object operation device capable of increasing the success probability of the gripping operation for the specified type of object regardless of the mounting state. , Model generation methods and reinforcement learning programs can be provided.

[Second Embodiment]
In the second embodiment, a case where reinforcement learning is performed by Q-learning will be described. Hereinafter, the second embodiment will be described focusing on the differences from the first embodiment.

<Functional configuration of reinforcement learning device>
First, a functional configuration example of the reinforcement learning device 120 according to the second embodiment will be described. FIG. 7 is a second diagram showing an example of the functional configuration of the reinforcement learning device. As shown in FIG. 7, the reinforcement learning device 120 according to the second embodiment has an update unit 710 and a reinforcement learning model 720. For the reinforcement learning model 720, for example, a neural network may be used.

The update unit 710 has a reward calculation unit 711 and a parameter update unit 712, and updates the model parameters of the reinforcement learning model 720.

Specifically, the update unit 710 acquires the determination result of whether or not the gripping operation for the object to be gripped is successful, and the information indicating the change in the state due to the control of the operation of the gripping mechanism unit 111. ..

Further, the reward calculation unit 711 calculates the reward based on the determination result of whether or not the gripping operation for the object to be gripped is successful. Since the method for determining whether or not the gripping operation for the object to be gripped has been successful and the method for calculating the reward have already been described in the first embodiment, the description thereof will be omitted here.

Further, the parameter update unit 712 updates each model parameter of the image analysis unit 721, the state and motion input unit 722, and the expected value calculation unit 724 included in the reinforcement learning model 720. The parameter update unit 712 is
-Information indicating a change in state acquired by the update unit 710,
-Reward calculated by the reward calculation unit 711 (immediate reward),
-The predicted value of the expected value (Q value) of the discount cumulative reward calculated by the expected value calculation unit 724, which will be described later.
Update the model parameters based on.

The model parameters of the reinforcement learning model 720 are updated by the update unit 710. Further, in the reinforcement learning model 720 after the model parameters are updated, the captured image, the target object image, and the information indicating the state (s) before the operation of the gripping mechanism unit 111 are input, and after the operation of the gripping mechanism unit 111. Outputs status information.

Specifically, as shown in FIG. 7, the reinforcement learning model 720 has an image analysis unit 721, a state and motion input unit 722, an addition unit 723, an expected value calculation unit 724, and an adjustment unit 725.

The image analysis unit 721 executes the process by acquiring the captured image transmitted from the drive control device 115 and the target object (g) image input by the user, and outputs the execution result to the addition unit 723. The image analysis unit 721 is configured by using, for example, a neural network. More specifically, the image analysis unit 721 is composed of, for example, a first convolution layer, a first MaxPooling layer, a second convolution layer, a second MaxPooling layer, and the like.

The state and operation input unit 722 executes the process by acquiring the information indicating the state (s) before the operation of the gripping mechanism unit 111 and the information indicating the operation (a) of the gripping mechanism unit 111, and the execution result. Is output to the addition unit 723. The state and motion input unit 722 is configured by using, for example, a neural network. More specifically, the state and motion input unit 722 is composed of a first linear layer, a second linear layer, a shape conversion layer, and the like. Further, the state and operation input unit 722 is provided with the operation (a) of the gripping mechanism unit 111 adjusted by the adjustment unit 725 in order to search for the maximum Q value calculated by the expected value calculation unit 724, which will be described later. The information to be shown is input a predetermined number of times (for example, 20 times).

The addition unit 723 adds the execution result output from the image analysis unit 721 and the execution result output from the state and operation input unit 722, and inputs the expected value calculation unit 724.

The expected value calculation unit 724 executes the process by inputting the execution result of the image analysis unit 721 and the execution result of the state and operation input unit 722 added by the addition unit 723, and executes the process, and the Q value (Q (Q) s, a, g)) is calculated. The expected value calculation unit 724 calculates a number of Q values according to the number of information indicating the operation (a) of the gripping mechanism unit 111 adjusted by the adjustment unit 725. The expected value calculation unit 724 is configured by using, for example, a neural network. More specifically, the expected value calculation unit 724 is composed of a first convolution layer, a first MaxPooling layer, a second convolution layer, a second MaxPooling layer, and the like.

The adjustment unit 725 adjusts the information indicating the operation (a) of the gripping mechanism unit 111 every time the Q value is calculated by the expected value calculation unit 724, and inputs the information to the state and operation input unit 722. The adjusting unit 725 adjusts the information indicating the operation (a) of the gripping mechanism unit 111 a predetermined number of times (for example, 20 times), and extracts the maximum Q value from the Q values calculated during that period. The adjusting unit 725 specifies information indicating one of the operations (a) from the set of possible operations of the gripping mechanism unit 111, for example, based on the ε-greedy method.

According to the ε-greedy method, the information indicating the operation (a) corresponding to the maximum Q value may be specified, or the information indicating the randomly selected operation (a) may be specified. ..

Further, the adjusting unit 725 is based on the information indicating the operation (a) of the specified gripping mechanism unit 111 and the information indicating the state (s) before the operation of the gripping mechanism unit 111, after the operation of the gripping mechanism unit 111. Information indicating the state of is derived and transmitted to the drive control device 115.

As described above, in the reinforcement learning device 120 according to the second embodiment, by using the ε-greedy method, various information can be comprehensively output as information indicating the state after the operation of the gripping mechanism unit 111. Can be done. As a result, the motion of the gripping mechanism unit 111 during the reinforcement learning process includes the optimum motion selected from the set of possible motions (the motion that maximizes the Q value) and the set of possible motions. It will include randomly selected actions.

As the configuration of the reinforcement learning model 720 that realizes such a function, the functional configuration shown in FIG. 7 is merely an example, and the reinforcement learning model 720 may be configured by another functional configuration. For example, in the above description, the image analysis unit 721, the state and motion input unit 722, and the expected value calculation unit 724 are each configured using a neural network, but the entire reinforcement learning model 720 uses a neural network. It may be configured.

Further, in the above description, the function at the time of the reinforcement learning process is mentioned, but the function after the reinforcement learning process is completed is the same as that of the first embodiment. That is, after the reinforcement learning process is completed, the update unit 710 does not acquire information indicating the change in the state, calculate the reward, update the model parameter, or the like. Further, in the adjusting unit 725, the operation of the gripping mechanism unit 111 derived based on the optimum information (the information indicating the operation (a) in which the Q value is maximized) is used as the information indicating the state after the operation of the gripping mechanism unit 111. Information indicating the later state) is output. As a result, according to the reinforcement-learned model, it is possible to acquire a behavioral rule that maximizes the expected value (Q value) of the discount cumulative reward.

<Flow of reinforcement learning process>
Next, the flow of the reinforcement learning process by the reinforcement learning device 120 according to the second embodiment will be described. FIG. 8 is a second flowchart showing the flow of the reinforcement learning process. Hereinafter, the flow of the reinforcement learning process will be described with reference to FIG. The reinforcement learning process shown in FIG. 8 is merely an example, and a model for which reinforcement learning has been completed may be generated by executing the reinforcement learning process by another model generation method.

In step S801, the reinforcement learning model 720 of the reinforcement learning device 120 acquires a target object image.

In step S802, the reinforcement learning model 720 of the reinforcement learning device 120 acquires a photographed image.

In step S803, the reinforcement learning model 720 of the reinforcement learning device 120 acquires information indicating the state (s) before (currently) the operation of the gripping mechanism unit 111.

In steps S804 to S807, for example, based on the ε-greedy method, information indicating any operation (a) is specified from a set of possible operations, and the state after the operation of the gripping mechanism unit 111 is shown. Comprehensively output information.

Specifically, in the case of specifying the information indicating the operation (a) corresponding to the optimum Q value from the set of possible operations, steps S804 to S806 are executed, and then step S807 is performed. move on. Further, in the case of specifying the information indicating the randomly selected motion (a) from the set of possible motions, the process directly proceeds to step S807.

In step S804, the reinforcement learning model 720 of the reinforcement learning device 120 calculates the Q value.

In step S805, the reinforcement learning model 720 of the reinforcement learning device 120 determines whether or not the Q value has been calculated a predetermined number of times. If it is determined in step S805 that the Q value has not been calculated a predetermined number of times (NO in step S805), the process proceeds to step S806.

In step S806, the reinforcement learning model 720 of the reinforcement learning device 120 adjusts the information indicating the operation (a) of the gripping mechanism unit 111, and returns to step S804.

On the other hand, if it is determined in step S805 that the Q value has been calculated a predetermined number of times (YES in step S805), the process proceeds to step S807.

In step S807, the reinforcement learning model 720 of the reinforcement learning device 120 identifies the information indicating the operation (a) corresponding to the maximum Q value when steps S804 to S807 are executed, and the gripping mechanism unit 111. After deriving information indicating the state after the operation of, it is transmitted to the drive control device 115. Further, the reinforcement learning model 720 of the reinforcement learning device 120 specifies information indicating a randomly selected operation (a) when steps S804 to S807 are not executed, and after the operation of the gripping mechanism unit 111. After deriving the information indicating the state of, it is transmitted to the drive control device 115.

In step S808, the update unit 710 of the reinforcement learning device 120 acquires information indicating a change in the state due to the control of the operation of the gripping mechanism unit 111.

In step S809, the update unit 710 of the reinforcement learning device 120 acquires the determination result of whether or not the gripping operation for the object to be gripped is successful, and calculates the immediate reward. Further, the update unit 710 of the reinforcement learning device 120 acquires the predicted value (Q value) of the expected value (Q value) of the discount cumulative reward calculated by the expected value calculation unit 724.

In step S810, the update unit 710 of the reinforcement learning device 120 uses the information indicating the change in the acquired state, the calculated immediate reward, and the predicted value (Q value) of the acquired discount cumulative reward to be used as the reinforcement learning model. Update the model parameters of 720.

In step S811, the reinforcement learning device 120 determines whether or not to switch from the current target object image to a different target object image.

If it is determined in step S811 not to switch to a different target object image (NO in step S811), the process returns to step S802.

On the other hand, if it is determined in step S811 to switch to a different target object image (YES in step S811), the process proceeds to step S812.

In step S812, the update unit 310 of the reinforcement learning device 120 determines whether or not the end condition of the reinforcement learning process is satisfied. The end condition of the reinforcement learning process is, for example, a condition defined by the user of the reinforcement learning system 100, and one example thereof is a target success probability of a gripping operation for a predetermined object.

If it is determined in step S812 that the end condition of the reinforcement learning process is not satisfied (NO in step S812), the process returns to step S801.

On the other hand, if it is determined in step S812 that the end condition of the reinforcement learning process is satisfied (YES in step S812), the reinforcement learning process is terminated. The reinforcement learning model 720 after the reinforcement learning process is completed is applied to the object operation device as the reinforcement learning completed model.

The reinforcement-learned model applied to the object manipulation device executes the processes of steps S801 to S807 in FIG. 8 (that is, it does not acquire information indicating a state change, calculate a reward, update model parameters, or the like). .. Further, in step S807, the optimum information is output as the information indicating the state after the operation of the gripping mechanism unit 111. That is, unlike during the reinforcement learning process, the gripping mechanism unit 111 comprehensively performs various operations, but instead performs the optimum operation selected from the set of possible operations (the operation that maximizes the Q value). I do.

<Summary>
As is clear from the above description, according to the reinforcement learning system 100 according to the second embodiment, the same effect as that of the first embodiment is obtained.

[Third Embodiment]
In the first and second embodiments described above, a case where a gripping operation is performed on a designated type of object has been described. However, the predetermined operation performed on the specified type of object is not limited to the gripping operation, and may be any other operation. That is, the end effector attached to the tip end portion of the main body portion 113 of the manipulator 110 is not limited to the gripping mechanism portion 111, and may be any other operation mechanism portion. Arbitrary operations referred to here include, for example, a pressing operation for pushing a specified type of object, a suction operation for sucking a specified type of object, a suction operation for sucking a specified type of object with an electromagnet, etc. Is included.

Further, in the first and second embodiments described above, it has been described that the image pickup device is attached to the tip portion of the manipulator, but the attachment position of the image pickup device is not limited to the tip portion of the manipulator. Any position may be used as long as the position and posture of the image pickup apparatus change according to the change in the position and posture of the gripping mechanism portion.

The gripping mechanism unit and the image pickup device may be attached to different manipulators, for example, and the above-mentioned reinforcement learning model can be applied even in that case. The reinforcement learning model in this case may be configured to output information for controlling at least one of the position and the posture of the image pickup device in addition to the information for controlling the operation of the gripping mechanism unit.

Further, in the first and second embodiments, as information indicating the state before the operation of the gripping mechanism unit, which is input to the reinforcement learning model, information indicating the position and posture of the gripping mechanism unit and opening / closing of the gripping mechanism unit are performed. It has been described as including the information shown. However, the information indicating the state before the operation of the gripping mechanism portion is not limited to these, and other information may be input.

Further, in the first and second embodiments, the manipulator 110 and the reinforcement learning device 120 (or the object manipulation device) are configured as separate bodies, but the manipulator 110 and the reinforcement learning device 120 (or the object manipulation device) are different from each other. It may be configured as one. Alternatively, the drive control device 115 and the reinforcement learning device 120 (or the object manipulation device) may be integrally configured.

Further, in the first and second embodiments, the operation of the gripping mechanism unit 111 is actually controlled based on the information output from the reinforcement learning device 120 indicating the state after the operation of the gripping mechanism unit 111. It was explained as performing reinforcement learning processing in. However, it is not necessary to actually control the operation of the gripping mechanism unit 111, and the reinforcement learning process may be performed by using a simulator simulating the actual environment. In this case, the image pickup device may also be configured to change the position and posture or perform shooting on a simulator simulating the actual environment. Further, a predetermined operation on the object to be operated and the generation of the operation result may be performed on a simulator simulating the actual environment.

Further, in the first and second embodiments described above, the case where the end effector is attached to the tip portion of the main body 113 of the manipulator 110 is described as the case where the reinforcement learning device 120 performs the reinforcement learning process. However, the reinforcement learning device 120 may perform reinforcement learning processing in the case where the manipulator 110 to which the end effector is not attached to the tip portion operates the object to be operated by the main body 113. In this case, the reinforcement learning device 120 may output information for controlling the operation of the tip portion of the main body 113 of the manipulator 110.

Further, in the first and second embodiments, it has been described that the position and posture of the tip portion of the main body 113 of the manipulator 110 are changed, but at least one of the position and the posture is described. It may be configured to change. That is, the gripping mechanism portion 111 may be configured to change at least one of the position and the posture. Further, the image pickup device 112 may be configured so that at least one of the position and the posture changes with the change of at least one of the position and the posture of the gripping mechanism unit 111. In this case, in the reinforcement learning device 120, as information for controlling the operation of the gripping mechanism unit 111, information for controlling at least one of the position and the posture of the gripping mechanism unit 111, and the gripping mechanism unit 111. Information for controlling the opening and closing of the may be output.

[Other embodiments]
In the present specification (including claims), the expression (including similar expressions) of "at least one (one) of a, b and c" or "at least one (one) of a, b or c" is used. When used, it comprises any of a, b, c, ab, ac, bc, or abc. Further, a plurality of instances may be included for any of the elements, such as aa, abb, aabbbcc, and the like. It also includes the addition of other elements than the listed elements (a, b and c), such as having d, such as abcd.

Further, in the present specification (including claims), there is no particular notice when expressions (including similar expressions) such as "with data as input / based on / according to / according to" are used. This includes the case where various data itself is used as an input, and the case where various data are processed in some way (for example, noise-added data, normalized data, intermediate representation of various data, etc.) are used as input data. In addition, when it is stated that some result can be obtained "based on / according to / according to the data", it includes the case where the result can be obtained based only on the data, and other data other than the data. It may also include cases where the result is obtained under the influence of factors, conditions, and / or conditions. In addition, when it is stated that "data is output", unless otherwise specified, various data itself is used as output, or various data is processed in some way (for example, noise is added, normal). It also includes the case where the output is output (intermediate representation of various data, etc.).

In addition, when the terms "connected" and "coupled" are used in the present specification (including claims), direct connection / connection and indirect connection are used. Unlimited including / coupling, electrically connected / coupled, communicatively connected / coupled, operatively connected / coupled, physically connected / coupled, etc. Intended as a term. The term should be interpreted as appropriate according to the context in which the term is used, but any connection / coupling form that is not intentionally or naturally excluded is not included in the term. It should be interpreted in a limited way.

Further, in the present specification (including claims), when the expression "A is configured to B (A configured to B)" is used, the physical structure of the element A executes the operation B. It has a possible configuration, and the permanent or temporary setting (setting / configuration) of the element A is set (configured / set) to actually execute the operation B. May include. For example, when the element A is a general-purpose processor, the processor has a hardware configuration capable of executing the operation B, and the operation B is set by setting a permanent or temporary program (instruction). It suffices if it is configured to actually execute. Further, when the element A is a dedicated processor, a dedicated arithmetic circuit, or the like, the circuit structure of the processor actually executes the operation B regardless of whether or not the control instruction and data are actually attached. It only needs to be implemented.

In addition, when a term meaning inclusion or possession (for example, "comprising / including" and "having") is used in the present specification (including claims), the object of the term is used. It is intended as an open-ended term, including the case of containing or owning an object other than the object indicated by the word. If the object of these terms that mean inclusion or possession is an expression that does not specify a quantity or suggests a singular (an expression with a or an as an article), the expression is interpreted as not being limited to a specific number. It should be.

Further, in the present specification (including claims), expressions such as "one or more" or "at least one" are used in some places, and quantities are used in other places. Even if an expression that does not specify or suggests a singular (an article with a or an as an article) is used, the latter expression is not intended to mean "one". In general, expressions that do not specify a quantity or suggest a singular (an article with a or an as an article) should be construed as not necessarily limited to a particular number.

In addition, when it is stated in the present specification that a specific effect (advantage / result) can be obtained for a specific configuration of a certain embodiment, the other one having the configuration is not specified unless there is another reason. Alternatively, it should be understood that the effect can be obtained for a plurality of examples. However, it should be understood that the presence or absence of the effect generally depends on various factors, conditions, and / or states, and the effect cannot always be obtained by the configuration. The effect is merely obtained by the configuration described in the examples when various factors, conditions, and / or conditions are satisfied, and in the invention relating to the claim that defines the configuration or a similar configuration. , The effect is not always obtained.

Further, in the present specification (including claims), when a plurality of hardware perform predetermined processing, each hardware may cooperate to perform predetermined processing, and some hardware may perform predetermined processing. You may perform all of the processing of. Further, some hardware may perform a part of a predetermined process, and another hardware may perform the rest of the predetermined process. In the present specification (including claims), expressions such as "one or more hardware performs the first process and the one or more hardware performs the second process" are used. , The hardware that performs the first process and the hardware that performs the second process may be the same or different. That is, the hardware that performs the first process and the hardware that performs the second process may be included in the one or a plurality of hardware. The hardware may include an electronic circuit, a device including the electronic circuit, or the like.

Further, in the present specification (including the claims), when a plurality of storage devices (memory) store data, each storage device (memory) among the plurality of storage devices (memory) is a part of the data. Only may be stored, or the entire data may be stored.

Although the embodiments of the present disclosure have been described in detail above, the present disclosure is not limited to the individual embodiments described above. Various additions, changes, replacements, partial deletions, etc. are possible without departing from the conceptual idea and purpose of the present invention derived from the contents specified in the claims and their equivalents. For example, in all the above-described embodiments, the numerical values used in the explanation are shown as an example, and are not limited thereto. Further, the order of each operation in the embodiment is shown as an example, and is not limited to these.

This application claims its priority based on Japanese Patent Application No. 2020-119349 filed on July 10, 2020, and the present application is made by referring to the entire contents of the Japanese patent application. Use it.

100: Reinforcement learning system 110: Manipulator 111: Grip mechanism unit 112: Imaging device 113: Main body unit 115: Drive control device 120: Reinforcement learning device 310: Update unit 311: Reward calculation unit 320: State input unit 330: Reinforcement learning model 510: Target object image 511: Object 521: 522: Photographed image 611, 612: Photographed image 710: Update unit 711: Reward calculation unit 712: Reinforcement learning model 721: Reinforcement learning model 721: State and motion input Part 723: Addition part 724: Expected value calculation part 725: Adjustment part

Claims

With at least one memory
With at least one processor,
The at least one processor
The operation of the end effector is controlled by controlling the operation of the end effector with the information about the captured image taken by the image pickup device whose position and posture change at least and the information about the target object image indicating the object to be operated by the end effector. To input to the training model to output information for
When the operation of the end effector is controlled based on the information output by the training model, it is possible to update the parameters of the training model based on the operation result for the object.
Reinforcement learning device.
Any of the at least positions and orientations of the imaging device will vary depending on any of the at least positions and orientations of the end effector.
The reinforcement learning device according to claim 1.
The image pickup device is attached to the end effector.
The reinforcement learning device according to claim 2.
Any of the at least positions and orientations of the image pickup device is controlled based on the output from the training model.
The reinforcement learning device according to claim 1.
The end effector is a gripping mechanism unit that grips the object.
The reinforcement learning device according to claim 1, wherein the at least one processor updates the parameters of the training model based on the determination result of whether or not the gripping operation for the object is successful by the gripping mechanism unit.
5. The method 5 in which the at least one processor inputs information regarding at least one of the positions and postures of the gripping mechanism portion before operation and information regarding opening and closing of the gripping mechanism portion before operation into the training model. Reinforcement learning device described in.
The reinforcement learning device according to claim 6, wherein the training model outputs information regarding at least one of the positions and postures of the gripping mechanism after operation and information regarding opening and closing of the gripping mechanism after operation.
The reinforcement learning device according to claim 1, wherein a predetermined operation on the object by the end effector and a change in at least one of the positions and postures of the image pickup device are executed on the simulator.
The reinforcement learning device according to any one of claims 1 to 8.
A manipulator to which the end effector and the image pickup device are attached,
Reinforcement learning system with.
At least one memory that stores the training model whose parameters have been updated by reinforcement learning,
With at least one processor,
The at least one processor
Information about an image taken by an image pickup device whose position or posture changes at least, and information about a target object image indicating an object to be operated by the end effector are input to the training model. When,
Controlling the operation of the end effector based on the information output by the training model
Is configured to be executable,
Object control device.
With the end effector
With the image pickup device
10. The object manipulation device according to claim 10.
A model generation method executed by at least one processor.
The operation of the end effector is controlled by controlling the operation of the end effector with the information about the captured image taken by the image pickup device whose position and posture change at least and the information about the target object image indicating the object to be operated by the end effector. The process of inputting to the training model that outputs information for
A model generation method including a step of updating the parameters of the training model based on the operation result for the object when the operation of the end effector is controlled based on the information output by the training model.
The operation of the end effector is controlled by controlling the operation of the end effector with the information about the captured image taken by the image pickup device whose position and posture change at least and the information about the target object image indicating the object to be operated by the end effector. The process of inputting to the training model that outputs information for
When the operation of the end effector is controlled based on the information output by the training model, at least one computer is made to perform the step of updating the parameters of the training model based on the operation result for the object. Reinforcement learning program for.