WO2022009859A1 - Reinforcement learning device, reinforcement learning system, object manipulation device, model generation method, and reinforcement learning program - Google Patents

Reinforcement learning device, reinforcement learning system, object manipulation device, model generation method, and reinforcement learning program Download PDF

Info

Publication number
WO2022009859A1
WO2022009859A1 PCT/JP2021/025392 JP2021025392W WO2022009859A1 WO 2022009859 A1 WO2022009859 A1 WO 2022009859A1 JP 2021025392 W JP2021025392 W JP 2021025392W WO 2022009859 A1 WO2022009859 A1 WO 2022009859A1
Authority
WO
WIPO (PCT)
Prior art keywords
reinforcement learning
end effector
information
gripping mechanism
training model
Prior art date
Application number
PCT/JP2021/025392
Other languages
French (fr)
Japanese (ja)
Inventor
康博 藤田
Original Assignee
株式会社Preferred Networks
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社Preferred Networks filed Critical 株式会社Preferred Networks
Publication of WO2022009859A1 publication Critical patent/WO2022009859A1/en

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J13/00Controls for manipulators
    • B25J13/08Controls for manipulators by means of sensing devices, e.g. viewing or touching devices

Definitions

  • This disclosure relates to a reinforcement learning device, a reinforcement learning system, an object manipulation device, a model generation method, and a reinforcement learning program.
  • Input an image taken by a fixed camera so that a predetermined operation (for example, a gripping operation by an end effector) is successful for a specified type of object among a plurality of types of objects placed in a predetermined area.
  • a reinforcement learning system for reinforcement learning of the operation of an end effector is known.
  • the success probability of a predetermined operation can be increased by repeating reinforcement learning.
  • reinforcement learning cannot proceed and the success probability of a predetermined operation cannot be increased.
  • the present disclosure provides a reinforcement learning device, a reinforcement learning system, an object manipulation device, a model generation method, and a reinforcement learning program capable of increasing the success probability of a predetermined operation on an object.
  • the reinforcement learning device has, for example, the following configuration. That is, With at least one memory With at least one processor, The at least one processor The operation of the end effector is controlled by controlling the operation of the end effector with the information about the captured image taken by the image pickup device whose position and posture change at least and the information about the target object image indicating the object to be operated by the end effector. To input to the training model to output information for When the operation of the end effector is controlled based on the information output by the training model, it is possible to update the parameters of the training model based on the operation result for the object.
  • FIG. 1 is a diagram showing an example of a system configuration of a reinforcement learning system.
  • FIG. 2 is a diagram showing an example of the hardware configuration of each device constituting the reinforcement learning system.
  • FIG. 3 is a first diagram showing an example of the functional configuration of the reinforcement learning device.
  • FIG. 4 is a first flowchart showing the flow of the reinforcement learning process.
  • FIG. 5 is a first diagram showing an execution example of the reinforcement learning process.
  • FIG. 6 is a second diagram showing an execution example of the reinforcement learning process.
  • FIG. 7 is a second diagram showing an example of the functional configuration of the reinforcement learning device.
  • FIG. 8 is a second flowchart showing the flow of the reinforcement learning process.
  • FIG. 1 is a diagram showing an example of a system configuration of a reinforcement learning system.
  • the reinforcement learning system 100 includes a manipulator 110 and a reinforcement learning device 120.
  • the manipulator 110 performs a predetermined operation on a specified type of object (object to be operated as shown by a target object image) from an object group 130 in which a plurality of types of objects are mixedly placed. It is a device that performs.
  • the main body 113 of the manipulator 110 has a plurality of arms connected via a plurality of joints, and by controlling the joint angle of each, the position and posture of the tip portion of the main body 113 of the manipulator 110 are controlled. It is configured to be.
  • a gripping mechanism portion 111 (an example of an end effector) that performs a predetermined operation (a gripping operation in this embodiment) for an object of a specified type is attached to the tip portion of the main body portion 113 of the manipulator 110.
  • the gripping operation for the specified type of object is performed by controlling the opening and closing of the gripping mechanism unit 111.
  • an image pickup device 112 is attached to the tip portion of the main body 113 of the manipulator 110. That is, the image pickup device 112 is configured so that the position and the posture change with the change of the position and the posture of the gripping mechanism portion 111.
  • the image pickup apparatus 112 outputs a photographed image including each image of R value, G value, and B value at a predetermined frame cycle.
  • the image pickup apparatus 112 may output a captured image including distance information to each position on the surface of the object in addition to the R value, G value, and B value images at a predetermined frame period.
  • the image pickup apparatus 112 may output a distance image including distance information to each position on the surface of the object at a predetermined frame period.
  • the captured image captured by the image pickup apparatus 112 may be a moving image.
  • the image pickup apparatus 112 will be described as outputting a captured image including each image of R value, G value, and B value at a predetermined frame cycle.
  • the support base 114 that supports the main body 113 of the manipulator 110 is driven to "control the operation of the gripping mechanism 111" (control the position and posture of the gripping mechanism 111 and the opening / closing of the gripping mechanism 111).
  • the control device 115 is built-in.
  • the drive control device 115 acquires a captured image captured by the image pickup device 112 and transmits it to the reinforcement learning device 120. Further, the drive control device 115 acquires sensor signals detected by various sensors (not shown) arranged in the grip mechanism portion 111 of the manipulator 110 and the main body portion 113, and transmits them to the reinforcement learning device 120.
  • the drive control device 115 acquires information for controlling the operation of the gripping mechanism unit 111 from the reinforcement learning device 120 in response to the transmission of the captured image and the sensor signal.
  • the information for controlling the operation of the gripping mechanism unit 111 referred to here is, for example, Information (target value) indicating the state of the gripping mechanism 111 after operation, A specific operation amount and control amount for controlling the position and posture of the gripping mechanism unit 111 and the opening / closing of the gripping mechanism unit 111. Etc., any command regarding the operation of the gripping mechanism unit 111 may be included.
  • the information for controlling the operation of the gripping mechanism unit 111 may include information for controlling the operation of the manipulator 110.
  • the drive control device 115 will be described as acquiring information indicating a state after the operation of the gripping mechanism unit 111 as an example of information for controlling the operation of the gripping mechanism unit 111.
  • the drive control device 115 acquires the information indicating the state after the operation of the gripping mechanism unit 111
  • the drive control device 115 is based on various sensor signals (information indicating the state before the operation of the gripping mechanism unit 111).
  • the position and posture of the gripping mechanism portion 111 and the opening / closing of the gripping mechanism portion 111 are controlled.
  • the reinforcement learning device 120 shows the state after the operation of the gripping mechanism unit 111 by inputting the captured image transmitted from the drive control device 115 and the target object image showing the object to be gripped by the gripping mechanism unit 111. It has a reinforcement learning model (an example of a training model) that outputs information.
  • a reinforcement learning model an example of a training model
  • a neural network may be used.
  • the feature amount extracted from the captured image may be input instead of inputting the captured image itself.
  • the feature amount extracted from the captured image is, for example, a feature amount output from the intermediate layer by inputting the captured image into the neural network.
  • the information regarding the target object image input to the reinforcement learning model may be a captured image including each image of R value, G value, and B value, and each image and object of R value, G value, and B value. It may be a photographed image including distance information to each position on the surface.
  • the target object image may be a distance image including distance information to each position on the object surface.
  • the target object image may be a moving image.
  • the feature amount extracted from the target object image (for example, the feature amount output from the intermediate layer by inputting the target object image into the neural network). May be entered.
  • a photographed image including each image of R value, G value, and B value will be input.
  • the reinforcement learning device 120 operates the object to be gripped.
  • the result (for example, the determination result of whether or not the gripping operation is successful) is acquired.
  • the reinforcement learning device 120 updates the model parameters of the reinforcement learning model based on the acquired operation result.
  • -Reinforcement learning is performed using information on the captured image captured by the image pickup device 112 whose position and posture change with the change in the position and posture of the gripping mechanism unit 111.
  • the gripping mechanism unit 111 can be operated so that the object to be gripped can be photographed in the process of reinforcement learning. can. That is, according to the present embodiment, it is possible to provide the reinforcement learning system 100 capable of increasing the success probability of the gripping operation regardless of the placement state of the object to be gripped.
  • the vertical direction of the paper surface in FIG. 1 is defined as the Z-axis direction
  • the horizontal direction of the paper surface in FIG. 1 is defined as the Y-axis direction
  • the depth direction of the paper surface in FIG. 1 is defined as the X-axis direction.
  • FIG. 2 is used for the hardware configuration of the manipulator 110 (here, the mechanical system is omitted and the hardware configuration for the control system is shown) and the hardware configuration of the reinforcement learning device 120 constituting the reinforcement learning system 100. I will explain.
  • FIG. 2 is a diagram showing an example of the hardware configuration of each device constituting the reinforcement learning system.
  • the manipulator 110 has a sensor group 211 and an actuator group 212 in addition to the image pickup device 112 and the drive control device 115.
  • the sensor group 211 includes n sensors.
  • n sensors at least, -Sensor for calculating the position and posture of the gripping mechanism portion 111 (sensor for measuring each joint angle of the main body portion 113), A sensor that detects the opening and closing of the gripping mechanism 111, Is included.
  • the actuator group 212 includes m actuators.
  • m actuators at least, An actuator for controlling the position and posture of the gripping mechanism portion 111 (actuator for controlling each joint angle of the main body portion 113), An actuator for controlling the opening and closing of the gripping mechanism 111, Is included.
  • the drive control device 115 includes a sensor signal processing device 201, an actuator drive device 202, and a controller 203.
  • the sensor signal processing device 201 receives the sensor signal transmitted from the sensor group 211 and notifies the controller 203 of the sensor signal data.
  • the actuator drive device 202 acquires the control signal data from the controller 203 and transmits the control signal to the actuator group 212.
  • the controller 203 acquires the captured image transmitted from the image pickup device 112 and transmits it to the reinforcement learning device 120. Further, the controller 203 transmits the sensor signal data notified from the sensor signal processing device 201 to the reinforcement learning device 120.
  • the controller 203 acquires information indicating the state after the operation of the gripping mechanism unit 111 from the reinforcement learning device 120 in response to the transmission of the captured image and the sensor signal data. Further, when the controller 203 acquires the information indicating the state after the operation of the gripping mechanism unit 111, the controller 203 generates the control signal data for operating the actuator group 212 based on the sensor signal data and notifies the actuator drive device 202.
  • the enhanced learning device 120 has a processor 221, a main storage device (memory) 222, an auxiliary storage device 223, a network interface 224, and a device interface 225 as components.
  • the reinforcement learning device 120 is realized as a computer in which these components are connected via a bus 226.
  • the reinforcement learning device 120 is shown to include one component for each, but the reinforcement learning device 120 may include a plurality of the same components. Further, in the example of FIG. 2, one reinforcement learning device 120 is shown, but the reinforcement learning program is installed in a plurality of reinforcement learning devices, and each of the plurality of reinforcement learning devices is a reinforcement learning program. It may be configured to perform the same or different parts of the process. In this case, the reinforcement learning device may take the form of distributed computing in which the entire processing is executed by communicating with each other via the network interface 224 or the like. That is, the reinforcement learning device 120 may be configured as a system that realizes a function by executing instructions stored in one or a plurality of storage devices by one or a plurality of computers. Further, various data transmitted from the drive control device 115 may be processed by one or a plurality of reinforcement learning devices provided on the cloud, and the processing result may be transmitted to the drive control device 115.
  • Various operations of the reinforcement learning device 120 may be executed in parallel processing by using one or a plurality of processors or by using a plurality of reinforcement learning devices that communicate via the communication network 240. Further, various operations may be distributed to a plurality of arithmetic cores in the processor 221 and executed in parallel processing. Further, some or all of the processes, means, etc. of the present disclosure are executed by an external device 230 (at least one of the processor and the storage device) provided on the cloud capable of communicating with the reinforcement learning device 120 via the communication network 240. May be done. As described above, the reinforcement learning device 120 may take the form of parallel computing by one or a plurality of computers.
  • the processor 221 may be an electronic circuit (processing circuit, Processing circuitry, CPU, GPU, FPGA, ASIC, etc.). Further, the processor 221 may be a semiconductor device or the like including a dedicated processing circuit. The processor 221 is not limited to an electronic circuit using an electronic logic element, and may be realized by an optical circuit using an optical logic element. Further, the processor 221 may include an arithmetic function based on quantum computing.
  • the processor 221 performs various calculations based on various data and instructions input from each device and the like of the internal configuration of the reinforcement learning device 120, and outputs the calculation result and the control signal to each device and the like.
  • the processor 221 controls each component included in the reinforcement learning device 120 by executing an OS (Operating System), an application, or the like.
  • OS Operating System
  • the processor 221 may refer to one or more electronic circuits arranged on one chip, or may refer to one or more electronic circuits arranged on two or more chips or devices. When a plurality of electronic circuits are used, each electronic circuit may communicate by wire or wirelessly.
  • the main storage device 222 is a storage device that stores instructions executed by the processor 221 and various data, and various data stored in the main storage device 222 is read out by the processor 221.
  • the auxiliary storage device 223 is a storage device other than the main storage device 222. It should be noted that these storage devices mean arbitrary electronic components capable of storing various data, and may be semiconductor memories.
  • the semiconductor memory may be either a volatile memory or a non-volatile memory.
  • the storage device for storing various data in the reinforcement learning device 120 may be realized by the main storage device 222 or the auxiliary storage device 223, or may be realized by the built-in memory built in the processor 221.
  • a plurality of processors 221 may be connected (combined) to one main storage device 222, or a single processor 221 may be connected.
  • a plurality of main storage devices 222 may be connected (combined) to one processor 221.
  • the processor may include a configuration in which it is connected (coupled) to at least one main memory device 222. Further, this configuration may be realized by the main storage device 222 and the processor 221 included in the plurality of reinforcement learning devices 120.
  • the main storage device 222 may include a configuration in which the processor is integrated (for example, a cache memory including an L1 cache and an L2 cache).
  • the network interface 224 is an interface for connecting to the communication network 240 wirelessly or by wire.
  • the network interface 224 may exchange various data with the drive control device 115 and other external devices 230 connected via the communication network 240.
  • the communication network 240 may be any one of WAN (Wide Area Network), LAN (Local Area Network), PAN (Personal Area Network), or a combination thereof, and may be a computer and a drive control device 115 or a combination thereof. Anything may be used as long as information is exchanged with another external device 230.
  • An example of WAN is the Internet
  • an example of LAN is 802.11, Ethernet, etc.
  • PAN is Bluetooth (registered trademark), NFC (Near Field Communication), etc.
  • the device interface 225 is an interface such as USB that directly connects to the external device 250.
  • the external device 250 is a device connected to a computer.
  • the external device 250 may be an input device as an example.
  • the input device is, for example, a device such as a camera, a microphone, a motion capture, various sensors, a keyboard, a mouse, or a touch panel, and gives acquired information to a computer. Further, it may be a device having an input unit such as a personal computer, a tablet terminal, or a smartphone, a memory, and a processor.
  • the external device 250 may be an output device as an example.
  • the output device may be, for example, a display device such as an LCD (Liquid Crystal Display), a CRT (Cathode Ray Tube), a PDP (Plasma Display Panel), or an organic EL (Electro Luminescence) panel, and outputs audio or the like. It may be a speaker or the like. Further, it may be a device having an output unit such as a personal computer, a tablet terminal, or a smartphone, a memory, and a processor.
  • the external device 250 may be a storage device (memory).
  • the external device 250 may be a network storage or the like, and the external device 250 may be a storage such as an HDD.
  • the external device 250 may be a device having some functions of the components of the reinforcement learning device 120. That is, the computer may transmit or receive a part or all of the processing result of the external device 250.
  • FIG. 3 is a first diagram showing an example of the functional configuration of the reinforcement learning device.
  • the reinforcement learning device 120 has an update unit 310, a state input unit 320, and a reinforcement learning model 330.
  • the update unit 310 has a reward calculation unit 311 and updates the model parameters of the reinforcement learning model 330. Specifically, the updating unit 310 acquires a determination result of whether or not the gripping operation for the object to be gripped is successful, and information indicating a change in the state due to the control of the operation of the gripping mechanism unit 111. .. Further, the reward calculation unit 311 calculates the reward based on the determination result acquired by the update unit 310. Then, the update unit 310 updates the model parameters of the reinforcement learning model 330 based on various information (information indicating the change in the state, reward, etc.) acquired or calculated so far.
  • the determination of whether or not the gripping operation for the object to be gripped is successful may be automatically performed based on, for example, a captured image.
  • the user of the reinforcement learning system 100 may determine whether or not the gripping operation for the object to be gripped is successful.
  • the update unit 310 may calculate the reward based on information other than the determination result of whether or not the gripping operation is successful.
  • the update unit 310 calculates the reward based on various information such as the operation time and the number of operations required for the grip operation to succeed, and the magnitude (energy efficiency) of the operation of the entire manipulator 110 during the grip operation. You may.
  • the state input unit 320 acquires the captured image transmitted from the drive control device 115 and the target object image input by the user, and notifies the reinforcement learning model 330.
  • the model parameters are updated by the update unit 310. Further, the reinforcement learning model 330 after the model parameters are updated receives the captured image notified from the state input unit 320 and the target object image as inputs, and outputs information indicating the state after the operation of the gripping mechanism unit 111. ..
  • the reinforcement learning model 330 is, for example, as information indicating the state after the operation of the gripping mechanism unit 111, for example. Information indicating the position and posture of the gripping mechanism 111 after operation, Information indicating the opening and closing of the gripping mechanism 111 after operation, Is output.
  • 3b in FIG. 3 shows another functional configuration example.
  • the state input unit 320 of the reinforcement learning device 120 acquires, in addition to the captured image and the target object image, information indicating the state before (currently) the operation of the gripping mechanism unit 111, and strengthens it. It is configured to notify the learning model 330.
  • the information indicating the state before (currently) the operation of the gripping mechanism portion 111 referred to here is, for example, Information indicating the position and posture of the gripping mechanism 111 before (currently) operation, Information indicating the opening and closing of the gripping mechanism 111 before (currently) operation, Is included.
  • the reinforcement learning model 330 receives the captured image, the target object image, and the information indicating the state before (currently) the operation of the gripping mechanism unit 111 notified from the state input unit 320 as input, and after the operation of the gripping mechanism unit 111. Outputs information indicating the status of.
  • FIG. 4 is a first flowchart showing the flow of the reinforcement learning process.
  • the reinforcement learning process shown in FIG. 4 is only an example, and a model for which reinforcement learning has been completed may be generated by executing the reinforcement learning process by another model generation method.
  • step S401 the state input unit 320 of the reinforcement learning device 120 acquires the target object image.
  • step S402 the state input unit 320 of the reinforcement learning device 120 acquires a captured image.
  • step S403 when the state input unit 320 of the reinforcement learning device 120 is configured to acquire information indicating the state before (currently) the operation of the gripping mechanism unit 111, the gripping mechanism unit 111 Acquires information indicating the state before (current) operation.
  • the reinforcement learning model 330 of the reinforcement learning device 120 receives the target object image, the captured image, and (and information indicating the state before the operation of the gripping mechanism unit 111) as inputs, and the state after the operation of the gripping mechanism unit 111.
  • the information indicating is output. It is assumed that the reinforcement learning model 330 is configured to comprehensively output various information as information indicating the state after the operation of the gripping mechanism unit 111.
  • the motion of the gripping mechanism unit 111 during the reinforcement learning process includes the optimum motion selected from the set of possible motions and the motion randomly selected from the set of possible motions. Will be.
  • step S405 the reinforcement learning device 120 transmits the information indicating the post-operation state of the gripping mechanism unit 111 output by the reinforcement learning model 330 to the drive control device 115.
  • step S406 the update unit 310 of the reinforcement learning device 120 acquires information indicating a change in the state due to the control of the operation of the gripping mechanism unit 111.
  • step S407 the update unit 310 of the reinforcement learning device 120 acquires a determination result of whether or not the gripping operation for the object to be gripped is successful, and the reward calculation unit 311 of the reinforcement learning device 120 obtains the acquired determination result. Based on this, the reward is calculated.
  • step S408 the update unit 310 of the reinforcement learning device 120 updates the model parameters of the reinforcement learning model 330 based on various information (information indicating changes in the state, rewards, etc.) acquired or calculated so far.
  • step S409 the state input unit 320 of the reinforcement learning device 120 determines whether or not to switch from the current target object image to a different target object image.
  • step S409 If it is determined in step S409 that the image is not switched to a different target object image (NO in step S409), the process returns to step S402.
  • step S409 determines whether a different target object image (YES in step S409) or a different target object image (YES in step S409). If it is determined in step S409 to switch to a different target object image (YES in step S409), the process proceeds to step S410.
  • step S410 the update unit 310 of the reinforcement learning device 120 determines whether or not the end condition of the reinforcement learning process is satisfied.
  • the end condition of the reinforcement learning process is, for example, a condition defined by the user of the reinforcement learning system 100, and one example thereof is a target success probability of a gripping operation for a predetermined object.
  • step S410 If it is determined in step S410 that the end condition of the reinforcement learning process is not satisfied (NO in step S410), the process returns to step S401.
  • the reinforcement learning model 330 after the reinforcement learning process is completed is a device (object operation device) that outputs information for controlling the operation of the gripping mechanism unit 111 to the drive control device 115 as the reinforcement learning completed model. Applies to).
  • the reinforcement-learned model applied to the object manipulation device executes the processes of steps S401 to S405 of FIG. 4 (that is, it does not acquire information indicating a state change, calculate a reward, update model parameters, or the like). .. Further, in step S404, the optimum information is output as the information indicating the state after the operation of the gripping mechanism unit 111. That is, unlike during the reinforcement learning process, the gripping mechanism unit 111 performs the optimum operation selected from the set of possible operations instead of performing various operations comprehensively.
  • 5 and 6 are first and second diagrams showing an execution example of the reinforcement learning process.
  • the reinforcement learning device 120 recognizes the object 511 included in the target object image 510 as the object to be gripped.
  • the user can grip any object included in the object group 130. Can be specified as an object of.
  • the arrow 500 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 at the time when the object 511 is recognized as the object to be gripped. Further, the captured image 521 shows a captured image when the object group 130 is captured under the position and posture indicated by the arrow 500.
  • the object 511 to be grasped is shielded by another object 512, and the image pickup apparatus 112 photographs the object 511. I can't. That is, in this state, the object 511 cannot be grasped.
  • the reinforcement learning device 120 outputs information indicating the state after the operation of the gripping mechanism unit 111 in order to change the position and posture of the gripping mechanism unit 111 so that the object 511 to be gripped can be photographed.
  • the drive control device 115 controls the operation of the gripping mechanism unit 111 based on the information indicating the state after the operation of the gripping mechanism unit 111.
  • the arrow 501 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 after the change, which is changed by controlling the operation of the gripping mechanism portion 111. Further, the captured image 522 shows a captured image when the object group 130 is captured under the position and posture indicated by the arrow 501.
  • the object group 511 to be grasped can be photographed by photographing the object group 130 from the lateral direction (X-axis direction).
  • the reinforcement learning device 120 outputs information indicating the state after the operation of the gripping mechanism unit 111 in order to further change the position and posture of the gripping mechanism unit 111 so that the object 511 to be gripped can be gripped. ..
  • the drive control device 115 controls the operation of the gripping mechanism unit 111 based on the information indicating the state after the operation of the gripping mechanism unit 111.
  • the arrow 601 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 after the change, which is changed by controlling the operation of the manipulator 110. Further, the captured image 611 shows a captured image when the object group 130 is captured under the position and posture indicated by the arrow 601.
  • the object 511 to be gripped can be gripped by bringing it closer to the object 511 to be gripped.
  • the reinforcement learning device 120 outputs information indicating the state after the operation of the gripping mechanism unit 111 so that the gripping mechanism unit 111 grips the object 511 to be gripped.
  • the drive control device 115 controls the operation of the gripping mechanism unit 111 based on the information indicating the state of the gripping mechanism unit 111 after the operation.
  • the arrow 602 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 after the change, which is changed by controlling the operation of the gripping mechanism unit 111 (object 511 grips). It shows the state of being lifted to a predetermined height). Further, the captured image 612 shows a captured image when the object 511 is captured under the position and posture indicated by the arrow 602.
  • the reinforcement learning system 100 is A gripping mechanism captures an image taken by an image pickup device whose position and posture change as the position and posture of the gripping mechanism change, and a target object image showing an object to be gripped by the gripping mechanism.
  • Input to the reinforcement learning model that outputs information indicating the state after the operation of the part.
  • the operation result (determination result of whether or not the gripping operation by the end effector was successful) for the object to be gripped when the operation of the gripping mechanism unit is controlled based on the information indicating the state after the operation of the gripping mechanism unit. Based on this, update the model parameters of the reinforcement learning model.
  • the gripping mechanism unit can take an image of the object to be gripped in the process of reinforcement learning.
  • the operation can be controlled.
  • the reinforcement learning device capable of increasing the success probability of the gripping operation for the specified type of object regardless of the mounting state.
  • Model generation methods and reinforcement learning programs can be provided.
  • FIG. 7 is a second diagram showing an example of the functional configuration of the reinforcement learning device.
  • the reinforcement learning device 120 according to the second embodiment has an update unit 710 and a reinforcement learning model 720.
  • the reinforcement learning model 720 for example, a neural network may be used.
  • the update unit 710 has a reward calculation unit 711 and a parameter update unit 712, and updates the model parameters of the reinforcement learning model 720.
  • the update unit 710 acquires the determination result of whether or not the gripping operation for the object to be gripped is successful, and the information indicating the change in the state due to the control of the operation of the gripping mechanism unit 111. ..
  • the reward calculation unit 711 calculates the reward based on the determination result of whether or not the gripping operation for the object to be gripped is successful. Since the method for determining whether or not the gripping operation for the object to be gripped has been successful and the method for calculating the reward have already been described in the first embodiment, the description thereof will be omitted here.
  • the parameter update unit 712 updates each model parameter of the image analysis unit 721, the state and motion input unit 722, and the expected value calculation unit 724 included in the reinforcement learning model 720.
  • the parameter update unit 712 is -Information indicating a change in state acquired by the update unit 710, -Reward calculated by the reward calculation unit 711 (immediate reward), -The predicted value of the expected value (Q value) of the discount cumulative reward calculated by the expected value calculation unit 724, which will be described later. Update the model parameters based on.
  • the model parameters of the reinforcement learning model 720 are updated by the update unit 710. Further, in the reinforcement learning model 720 after the model parameters are updated, the captured image, the target object image, and the information indicating the state (s) before the operation of the gripping mechanism unit 111 are input, and after the operation of the gripping mechanism unit 111. Outputs status information.
  • the reinforcement learning model 720 has an image analysis unit 721, a state and motion input unit 722, an addition unit 723, an expected value calculation unit 724, and an adjustment unit 725.
  • the image analysis unit 721 executes the process by acquiring the captured image transmitted from the drive control device 115 and the target object (g) image input by the user, and outputs the execution result to the addition unit 723.
  • the image analysis unit 721 is configured by using, for example, a neural network. More specifically, the image analysis unit 721 is composed of, for example, a first convolution layer, a first MaxPooling layer, a second convolution layer, a second MaxPooling layer, and the like.
  • the state and operation input unit 722 executes the process by acquiring the information indicating the state (s) before the operation of the gripping mechanism unit 111 and the information indicating the operation (a) of the gripping mechanism unit 111, and the execution result. Is output to the addition unit 723.
  • the state and motion input unit 722 is configured by using, for example, a neural network. More specifically, the state and motion input unit 722 is composed of a first linear layer, a second linear layer, a shape conversion layer, and the like. Further, the state and operation input unit 722 is provided with the operation (a) of the gripping mechanism unit 111 adjusted by the adjustment unit 725 in order to search for the maximum Q value calculated by the expected value calculation unit 724, which will be described later.
  • the information to be shown is input a predetermined number of times (for example, 20 times).
  • the addition unit 723 adds the execution result output from the image analysis unit 721 and the execution result output from the state and operation input unit 722, and inputs the expected value calculation unit 724.
  • the expected value calculation unit 724 executes the process by inputting the execution result of the image analysis unit 721 and the execution result of the state and operation input unit 722 added by the addition unit 723, and executes the process, and the Q value (Q (Q) s, a, g)) is calculated.
  • the expected value calculation unit 724 calculates a number of Q values according to the number of information indicating the operation (a) of the gripping mechanism unit 111 adjusted by the adjustment unit 725.
  • the expected value calculation unit 724 is configured by using, for example, a neural network. More specifically, the expected value calculation unit 724 is composed of a first convolution layer, a first MaxPooling layer, a second convolution layer, a second MaxPooling layer, and the like.
  • the adjustment unit 725 adjusts the information indicating the operation (a) of the gripping mechanism unit 111 every time the Q value is calculated by the expected value calculation unit 724, and inputs the information to the state and operation input unit 722.
  • the adjusting unit 725 adjusts the information indicating the operation (a) of the gripping mechanism unit 111 a predetermined number of times (for example, 20 times), and extracts the maximum Q value from the Q values calculated during that period.
  • the adjusting unit 725 specifies information indicating one of the operations (a) from the set of possible operations of the gripping mechanism unit 111, for example, based on the ⁇ -greedy method.
  • the information indicating the operation (a) corresponding to the maximum Q value may be specified, or the information indicating the randomly selected operation (a) may be specified. ..
  • the adjusting unit 725 is based on the information indicating the operation (a) of the specified gripping mechanism unit 111 and the information indicating the state (s) before the operation of the gripping mechanism unit 111, after the operation of the gripping mechanism unit 111.
  • Information indicating the state of is derived and transmitted to the drive control device 115.
  • the motion of the gripping mechanism unit 111 during the reinforcement learning process includes the optimum motion selected from the set of possible motions (the motion that maximizes the Q value) and the set of possible motions. It will include randomly selected actions.
  • the functional configuration shown in FIG. 7 is merely an example, and the reinforcement learning model 720 may be configured by another functional configuration.
  • the image analysis unit 721, the state and motion input unit 722, and the expected value calculation unit 724 are each configured using a neural network, but the entire reinforcement learning model 720 uses a neural network. It may be configured.
  • the function at the time of the reinforcement learning process is mentioned, but the function after the reinforcement learning process is completed is the same as that of the first embodiment. That is, after the reinforcement learning process is completed, the update unit 710 does not acquire information indicating the change in the state, calculate the reward, update the model parameter, or the like. Further, in the adjusting unit 725, the operation of the gripping mechanism unit 111 derived based on the optimum information (the information indicating the operation (a) in which the Q value is maximized) is used as the information indicating the state after the operation of the gripping mechanism unit 111. Information indicating the later state) is output. As a result, according to the reinforcement-learned model, it is possible to acquire a behavioral rule that maximizes the expected value (Q value) of the discount cumulative reward.
  • FIG. 8 is a second flowchart showing the flow of the reinforcement learning process.
  • the reinforcement learning process shown in FIG. 8 is merely an example, and a model for which reinforcement learning has been completed may be generated by executing the reinforcement learning process by another model generation method.
  • step S801 the reinforcement learning model 720 of the reinforcement learning device 120 acquires a target object image.
  • step S802 the reinforcement learning model 720 of the reinforcement learning device 120 acquires a photographed image.
  • step S803 the reinforcement learning model 720 of the reinforcement learning device 120 acquires information indicating the state (s) before (currently) the operation of the gripping mechanism unit 111.
  • steps S804 to S807 for example, based on the ⁇ -greedy method, information indicating any operation (a) is specified from a set of possible operations, and the state after the operation of the gripping mechanism unit 111 is shown. Comprehensively output information.
  • steps S804 to S806 are executed, and then step S807 is performed. move on. Further, in the case of specifying the information indicating the randomly selected motion (a) from the set of possible motions, the process directly proceeds to step S807.
  • step S804 the reinforcement learning model 720 of the reinforcement learning device 120 calculates the Q value.
  • step S805 the reinforcement learning model 720 of the reinforcement learning device 120 determines whether or not the Q value has been calculated a predetermined number of times. If it is determined in step S805 that the Q value has not been calculated a predetermined number of times (NO in step S805), the process proceeds to step S806.
  • step S806 the reinforcement learning model 720 of the reinforcement learning device 120 adjusts the information indicating the operation (a) of the gripping mechanism unit 111, and returns to step S804.
  • step S805 determines whether the Q value has been calculated a predetermined number of times (YES in step S805). If it is determined in step S805 that the Q value has been calculated a predetermined number of times (YES in step S805), the process proceeds to step S807.
  • step S807 the reinforcement learning model 720 of the reinforcement learning device 120 identifies the information indicating the operation (a) corresponding to the maximum Q value when steps S804 to S807 are executed, and the gripping mechanism unit 111. After deriving information indicating the state after the operation of, it is transmitted to the drive control device 115. Further, the reinforcement learning model 720 of the reinforcement learning device 120 specifies information indicating a randomly selected operation (a) when steps S804 to S807 are not executed, and after the operation of the gripping mechanism unit 111. After deriving the information indicating the state of, it is transmitted to the drive control device 115.
  • step S808 the update unit 710 of the reinforcement learning device 120 acquires information indicating a change in the state due to the control of the operation of the gripping mechanism unit 111.
  • step S809 the update unit 710 of the reinforcement learning device 120 acquires the determination result of whether or not the gripping operation for the object to be gripped is successful, and calculates the immediate reward. Further, the update unit 710 of the reinforcement learning device 120 acquires the predicted value (Q value) of the expected value (Q value) of the discount cumulative reward calculated by the expected value calculation unit 724.
  • step S810 the update unit 710 of the reinforcement learning device 120 uses the information indicating the change in the acquired state, the calculated immediate reward, and the predicted value (Q value) of the acquired discount cumulative reward to be used as the reinforcement learning model. Update the model parameters of 720.
  • step S811 the reinforcement learning device 120 determines whether or not to switch from the current target object image to a different target object image.
  • step S811 If it is determined in step S811 not to switch to a different target object image (NO in step S811), the process returns to step S802.
  • step S811 determines whether it is determined in step S811 to switch to a different target object image (YES in step S811), the process proceeds to step S812.
  • step S812 the update unit 310 of the reinforcement learning device 120 determines whether or not the end condition of the reinforcement learning process is satisfied.
  • the end condition of the reinforcement learning process is, for example, a condition defined by the user of the reinforcement learning system 100, and one example thereof is a target success probability of a gripping operation for a predetermined object.
  • step S812 If it is determined in step S812 that the end condition of the reinforcement learning process is not satisfied (NO in step S812), the process returns to step S801.
  • step S812 if it is determined in step S812 that the end condition of the reinforcement learning process is satisfied (YES in step S812), the reinforcement learning process is terminated.
  • the reinforcement learning model 720 after the reinforcement learning process is completed is applied to the object operation device as the reinforcement learning completed model.
  • the reinforcement-learned model applied to the object manipulation device executes the processes of steps S801 to S807 in FIG. 8 (that is, it does not acquire information indicating a state change, calculate a reward, update model parameters, or the like). .. Further, in step S807, the optimum information is output as the information indicating the state after the operation of the gripping mechanism unit 111. That is, unlike during the reinforcement learning process, the gripping mechanism unit 111 comprehensively performs various operations, but instead performs the optimum operation selected from the set of possible operations (the operation that maximizes the Q value). I do.
  • the predetermined operation performed on the specified type of object is not limited to the gripping operation, and may be any other operation. That is, the end effector attached to the tip end portion of the main body portion 113 of the manipulator 110 is not limited to the gripping mechanism portion 111, and may be any other operation mechanism portion.
  • Arbitrary operations referred to here include, for example, a pressing operation for pushing a specified type of object, a suction operation for sucking a specified type of object, a suction operation for sucking a specified type of object with an electromagnet, etc. Is included.
  • the image pickup device is attached to the tip portion of the manipulator, but the attachment position of the image pickup device is not limited to the tip portion of the manipulator. Any position may be used as long as the position and posture of the image pickup apparatus change according to the change in the position and posture of the gripping mechanism portion.
  • the gripping mechanism unit and the image pickup device may be attached to different manipulators, for example, and the above-mentioned reinforcement learning model can be applied even in that case.
  • the reinforcement learning model in this case may be configured to output information for controlling at least one of the position and the posture of the image pickup device in addition to the information for controlling the operation of the gripping mechanism unit.
  • information indicating the state before the operation of the gripping mechanism unit which is input to the reinforcement learning model
  • information indicating the position and posture of the gripping mechanism unit and opening / closing of the gripping mechanism unit are performed. It has been described as including the information shown. However, the information indicating the state before the operation of the gripping mechanism portion is not limited to these, and other information may be input.
  • the manipulator 110 and the reinforcement learning device 120 are configured as separate bodies, but the manipulator 110 and the reinforcement learning device 120 (or the object manipulation device) are different from each other. It may be configured as one. Alternatively, the drive control device 115 and the reinforcement learning device 120 (or the object manipulation device) may be integrally configured.
  • the operation of the gripping mechanism unit 111 is actually controlled based on the information output from the reinforcement learning device 120 indicating the state after the operation of the gripping mechanism unit 111. It was explained as performing reinforcement learning processing in. However, it is not necessary to actually control the operation of the gripping mechanism unit 111, and the reinforcement learning process may be performed by using a simulator simulating the actual environment. In this case, the image pickup device may also be configured to change the position and posture or perform shooting on a simulator simulating the actual environment. Further, a predetermined operation on the object to be operated and the generation of the operation result may be performed on a simulator simulating the actual environment.
  • the reinforcement learning device 120 may perform reinforcement learning processing in the case where the manipulator 110 to which the end effector is not attached to the tip portion operates the object to be operated by the main body 113. In this case, the reinforcement learning device 120 may output information for controlling the operation of the tip portion of the main body 113 of the manipulator 110.
  • the position and posture of the tip portion of the main body 113 of the manipulator 110 are changed, but at least one of the position and the posture is described. It may be configured to change. That is, the gripping mechanism portion 111 may be configured to change at least one of the position and the posture. Further, the image pickup device 112 may be configured so that at least one of the position and the posture changes with the change of at least one of the position and the posture of the gripping mechanism unit 111. In this case, in the reinforcement learning device 120, as information for controlling the operation of the gripping mechanism unit 111, information for controlling at least one of the position and the posture of the gripping mechanism unit 111, and the gripping mechanism unit 111. Information for controlling the opening and closing of the may be output.
  • the expression (including similar expressions) of "at least one (one) of a, b and c" or "at least one (one) of a, b or c" is used. When used, it comprises any of a, b, c, ab, ac, bc, or abc. Further, a plurality of instances may be included for any of the elements, such as aa, abb, aabbbcc, and the like. It also includes the addition of other elements than the listed elements (a, b and c), such as having d, such as abcd.
  • connection and “coupled” are used in the present specification (including claims), direct connection / connection and indirect connection are used. Unlimited including / coupling, electrically connected / coupled, communicatively connected / coupled, operatively connected / coupled, physically connected / coupled, etc. Intended as a term. The term should be interpreted as appropriate according to the context in which the term is used, but any connection / coupling form that is not intentionally or naturally excluded is not included in the term. It should be interpreted in a limited way.
  • the physical structure of the element A executes the operation B. It has a possible configuration, and the permanent or temporary setting (setting / configuration) of the element A is set (configured / set) to actually execute the operation B. May include.
  • the element A is a general-purpose processor
  • the processor has a hardware configuration capable of executing the operation B, and the operation B is set by setting a permanent or temporary program (instruction). It suffices if it is configured to actually execute.
  • the element A is a dedicated processor, a dedicated arithmetic circuit, or the like, the circuit structure of the processor actually executes the operation B regardless of whether or not the control instruction and data are actually attached. It only needs to be implemented.
  • each hardware when a plurality of hardware perform predetermined processing, each hardware may cooperate to perform predetermined processing, and some hardware may perform predetermined processing. You may perform all of the processing of. Further, some hardware may perform a part of a predetermined process, and another hardware may perform the rest of the predetermined process.
  • expressions such as "one or more hardware performs the first process and the one or more hardware performs the second process" are used.
  • the hardware that performs the first process and the hardware that performs the second process may be the same or different. That is, the hardware that performs the first process and the hardware that performs the second process may be included in the one or a plurality of hardware.
  • the hardware may include an electronic circuit, a device including the electronic circuit, or the like.
  • each storage device (memory) among the plurality of storage devices (memory) is a part of the data. Only may be stored, or the entire data may be stored.
  • Reinforcement learning system 110 Manipulator 111: Grip mechanism unit 112: Imaging device 113: Main body unit 115: Drive control device 120: Reinforcement learning device 310: Update unit 311: Reward calculation unit 320: State input unit 330: Reinforcement learning model 510: Target object image 511: Object 521: 522: Photographed image 611, 612: Photographed image 710: Update unit 711: Reward calculation unit 712: Reinforcement learning model 721: Reinforcement learning model 721: State and motion input Part 723: Addition part 724: Expected value calculation part 725: Adjustment part

Abstract

Provided are a reinforcement learning device, a reinforcement learning system, an object manipulation device, a model generation method, and a reinforcement learning program, whereby the probability of success of a prescribed manipulation on an object can be increased. This reinforcement learning device has at least one memory and at least one processor, the at least one processor being configured so as to be capable of: inputting information relating to a captured image captured by an imaging device that changes in at least position or orientation thereof, and information relating to a target object image indicating an object to be manipulated by an end effector, to a training model that outputs information for controlling the operation of the end effector; and updating a parameter of the training model on the basis of the result of manipulation of the object for a case where the operation of the end effector is controlled on the basis of the information outputted by the training model.

Description

強化学習装置、強化学習システム、物体操作装置、モデル生成方法及び強化学習プログラムReinforcement learning device, reinforcement learning system, object manipulation device, model generation method and reinforcement learning program
 本開示は、強化学習装置、強化学習システム、物体操作装置、モデル生成方法及び強化学習プログラムに関する。 This disclosure relates to a reinforcement learning device, a reinforcement learning system, an object manipulation device, a model generation method, and a reinforcement learning program.
 所定領域内に載置された複数種類の物体のうち、指定された種類の物体に対して所定の操作(例えば、エンドエフェクタによる把持操作)を成功させるよう、固定カメラで撮影した撮影画像を入力として、エンドエフェクタの動作を強化学習する強化学習システムが知られている。 Input an image taken by a fixed camera so that a predetermined operation (for example, a gripping operation by an end effector) is successful for a specified type of object among a plurality of types of objects placed in a predetermined area. As a method, a reinforcement learning system for reinforcement learning of the operation of an end effector is known.
 当該強化学習システムによれば、指定された種類の物体が、撮影可能な位置に載置されていれば、強化学習を繰り返すことで、所定の操作の成功確率を上げることができる。一方で、指定された種類の物体が、撮影可能な位置に載置されていない場合には、強化学習を進めることができず、所定の操作の成功確率を上げることができない。 According to the reinforcement learning system, if an object of the specified type is placed in a position where it can be photographed, the success probability of a predetermined operation can be increased by repeating reinforcement learning. On the other hand, if the specified type of object is not placed in a position where it can be photographed, reinforcement learning cannot proceed and the success probability of a predetermined operation cannot be increased.
 本開示は、物体に対する所定の操作の成功確率を上げることが可能な、強化学習装置、強化学習システム、物体操作装置、モデル生成方法及び強化学習プログラムを提供する。 The present disclosure provides a reinforcement learning device, a reinforcement learning system, an object manipulation device, a model generation method, and a reinforcement learning program capable of increasing the success probability of a predetermined operation on an object.
 本開示の一態様による強化学習装置は、例えば、以下のような構成を有する。即ち、
 少なくとも1つのメモリと、
 少なくとも1つのプロセッサと、を有し、
 前記少なくとも1つのプロセッサは、
  少なくとも位置及び姿勢のいずれかが変化する撮像装置により撮影された撮影画像に関する情報と、エンドエフェクタにより操作される操作対象の物体を示す目標物体画像に関する情報とを、前記エンドエフェクタの動作を制御するための情報を出力する訓練モデルに入力することと、
 前記訓練モデルにより出力された情報に基づき前記エンドエフェクタの動作が制御された場合の、前記物体に対する操作結果に基づいて、前記訓練モデルのパラメータを更新することとを実行可能に構成される。
The reinforcement learning device according to one aspect of the present disclosure has, for example, the following configuration. That is,
With at least one memory
With at least one processor,
The at least one processor
The operation of the end effector is controlled by controlling the operation of the end effector with the information about the captured image taken by the image pickup device whose position and posture change at least and the information about the target object image indicating the object to be operated by the end effector. To input to the training model to output information for
When the operation of the end effector is controlled based on the information output by the training model, it is possible to update the parameters of the training model based on the operation result for the object.
図1は、強化学習システムのシステム構成の一例を示す図である。FIG. 1 is a diagram showing an example of a system configuration of a reinforcement learning system. 図2は、強化学習システムを構成する各装置のハードウェア構成の一例を示す図である。FIG. 2 is a diagram showing an example of the hardware configuration of each device constituting the reinforcement learning system. 図3は、強化学習装置の機能構成の一例を示す第1の図である。FIG. 3 is a first diagram showing an example of the functional configuration of the reinforcement learning device. 図4は、強化学習処理の流れを示す第1のフローチャートである。FIG. 4 is a first flowchart showing the flow of the reinforcement learning process. 図5は、強化学習処理の実行例を示す第1の図である。FIG. 5 is a first diagram showing an execution example of the reinforcement learning process. 図6は、強化学習処理の実行例を示す第2の図である。FIG. 6 is a second diagram showing an execution example of the reinforcement learning process. 図7は、強化学習装置の機能構成の一例を示す第2の図である。FIG. 7 is a second diagram showing an example of the functional configuration of the reinforcement learning device. 図8は、強化学習処理の流れを示す第2のフローチャートである。FIG. 8 is a second flowchart showing the flow of the reinforcement learning process.
 以下、各実施形態について添付の図面を参照しながら説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複した説明を省略する。 Hereinafter, each embodiment will be described with reference to the attached drawings. In the present specification and the drawings, the components having substantially the same functional configuration are designated by the same reference numerals, and duplicate description thereof will be omitted.
 [第1の実施形態]
 <強化学習システムのシステム構成>
 はじめに、強化学習システムのシステム構成について説明する。図1は、強化学習システムのシステム構成の一例を示す図である。図1に示すように、強化学習システム100は、マニピュレータ110と、強化学習装置120とを有する。
[First Embodiment]
<System configuration of reinforcement learning system>
First, the system configuration of the reinforcement learning system will be described. FIG. 1 is a diagram showing an example of a system configuration of a reinforcement learning system. As shown in FIG. 1, the reinforcement learning system 100 includes a manipulator 110 and a reinforcement learning device 120.
 マニピュレータ110は、例えば、複数種類の物体が混在して載置された物体群130の中から、指定された種類の物体(目標物体画像により示された操作対象の物体)に対して所定の操作を行う装置である。 For example, the manipulator 110 performs a predetermined operation on a specified type of object (object to be operated as shown by a target object image) from an object group 130 in which a plurality of types of objects are mixedly placed. It is a device that performs.
 マニピュレータ110の本体部113は、複数の関節を介して接続された複数のアームを有し、それぞれの関節角を制御することで、マニピュレータ110の本体部113の先端部分の位置及び姿勢が制御されるように構成されている。 The main body 113 of the manipulator 110 has a plurality of arms connected via a plurality of joints, and by controlling the joint angle of each, the position and posture of the tip portion of the main body 113 of the manipulator 110 are controlled. It is configured to be.
 マニピュレータ110の本体部113の先端部分には、指定された種類の物体に対して所定の操作(本実施形態では把持操作)を行う把持機構部111(エンドエフェクタの一例)が取り付けられている。指定された種類の物体に対する把持操作は、把持機構部111の開閉を制御することにより行われる。 A gripping mechanism portion 111 (an example of an end effector) that performs a predetermined operation (a gripping operation in this embodiment) for an object of a specified type is attached to the tip portion of the main body portion 113 of the manipulator 110. The gripping operation for the specified type of object is performed by controlling the opening and closing of the gripping mechanism unit 111.
 また、マニピュレータ110の本体部113の先端部分には、撮像装置112が取り付けられている。つまり、撮像装置112は、把持機構部111の位置及び姿勢の変化に伴って、位置及び姿勢が変化するように構成されている。撮像装置112は、R値、G値、B値の各画像を含む撮影画像を所定のフレーム周期で出力する。あるいは、撮像装置112は、R値、G値、B値の各画像に加えて、物体表面の各位置までの距離情報を含む撮影画像を所定のフレーム周期で出力してもよい。あるいは、撮像装置112は、物体表面の各位置までの距離情報を含む距離画像を所定のフレーム周期で出力してもよい。また、撮像装置112が撮影する撮影画像は動画像であってもよい。以下では、説明の簡略化のため、一例として、撮像装置112は、R値、G値、B値の各画像を含む撮影画像を所定のフレーム周期で出力するものとして説明する。 Further, an image pickup device 112 is attached to the tip portion of the main body 113 of the manipulator 110. That is, the image pickup device 112 is configured so that the position and the posture change with the change of the position and the posture of the gripping mechanism portion 111. The image pickup apparatus 112 outputs a photographed image including each image of R value, G value, and B value at a predetermined frame cycle. Alternatively, the image pickup apparatus 112 may output a captured image including distance information to each position on the surface of the object in addition to the R value, G value, and B value images at a predetermined frame period. Alternatively, the image pickup apparatus 112 may output a distance image including distance information to each position on the surface of the object at a predetermined frame period. Further, the captured image captured by the image pickup apparatus 112 may be a moving image. Hereinafter, for the sake of simplification of the description, as an example, the image pickup apparatus 112 will be described as outputting a captured image including each image of R value, G value, and B value at a predetermined frame cycle.
 更に、マニピュレータ110の本体部113を支持する支持台114には、「把持機構部111の動作を制御」(把持機構部111の位置及び姿勢と、把持機構部111の開閉とを制御)する駆動制御装置115が内蔵されている。 Further, the support base 114 that supports the main body 113 of the manipulator 110 is driven to "control the operation of the gripping mechanism 111" (control the position and posture of the gripping mechanism 111 and the opening / closing of the gripping mechanism 111). The control device 115 is built-in.
 駆動制御装置115は、撮像装置112により撮影された撮影画像を取得し、強化学習装置120に送信する。また、駆動制御装置115は、マニピュレータ110の把持機構部111及び本体部113内に配された各種センサ(不図示)により検出されたセンサ信号を取得し、強化学習装置120に送信する。 The drive control device 115 acquires a captured image captured by the image pickup device 112 and transmits it to the reinforcement learning device 120. Further, the drive control device 115 acquires sensor signals detected by various sensors (not shown) arranged in the grip mechanism portion 111 of the manipulator 110 and the main body portion 113, and transmits them to the reinforcement learning device 120.
 また、駆動制御装置115は、撮影画像及びセンサ信号を送信したことに応じて、強化学習装置120から、把持機構部111の動作を制御するための情報を取得する。ここでいう把持機構部111の動作を制御するための情報には、例えば、
・把持機構部111の動作後の状態を示す情報(目標値)、
・把持機構部111の位置及び姿勢と、把持機構部111の開閉とを制御するための具体的な操作量、制御量、
等、把持機構部111の動作に関する任意の指令が含まれてもよい。また、把持機構部111の動作を制御するための情報には、マニピュレータ110の動作を制御するための情報が含まれてもよい。以下では、駆動制御装置115は、把持機構部111の動作を制御するための情報の一例として、把持機構部111の動作後の状態を示す情報を取得するものとして説明する。
Further, the drive control device 115 acquires information for controlling the operation of the gripping mechanism unit 111 from the reinforcement learning device 120 in response to the transmission of the captured image and the sensor signal. The information for controlling the operation of the gripping mechanism unit 111 referred to here is, for example,
Information (target value) indicating the state of the gripping mechanism 111 after operation,
A specific operation amount and control amount for controlling the position and posture of the gripping mechanism unit 111 and the opening / closing of the gripping mechanism unit 111.
Etc., any command regarding the operation of the gripping mechanism unit 111 may be included. Further, the information for controlling the operation of the gripping mechanism unit 111 may include information for controlling the operation of the manipulator 110. Hereinafter, the drive control device 115 will be described as acquiring information indicating a state after the operation of the gripping mechanism unit 111 as an example of information for controlling the operation of the gripping mechanism unit 111.
 更に、駆動制御装置115は、把持機構部111の動作後の状態を示す情報を取得すると、各種センサ信号(把持機構部111の動作前の状態を示す情報)に基づいて、
・マニピュレータ110の把持機構部111内の各種アクチュエータ(不図示)、及び、
・マニピュレータ110の本体部113内の各種アクチュエータ(不図示)、
を制御する。これにより、把持機構部111の位置及び姿勢と、把持機構部111の開閉とが制御される。
Further, when the drive control device 115 acquires the information indicating the state after the operation of the gripping mechanism unit 111, the drive control device 115 is based on various sensor signals (information indicating the state before the operation of the gripping mechanism unit 111).
Various actuators (not shown) in the gripping mechanism 111 of the manipulator 110, and
Various actuators (not shown) in the main body 113 of the manipulator 110,
To control. As a result, the position and posture of the gripping mechanism portion 111 and the opening / closing of the gripping mechanism portion 111 are controlled.
 強化学習装置120は、駆動制御装置115より送信された撮影画像と、把持機構部111が把持する把持対象の物体を示す目標物体画像とを入力として、把持機構部111の動作後の状態を示す情報を出力する強化学習モデル(訓練モデルの一例)を有する。強化学習モデルには、例えば、ニューラルネットワークが用いられてもよい。 The reinforcement learning device 120 shows the state after the operation of the gripping mechanism unit 111 by inputting the captured image transmitted from the drive control device 115 and the target object image showing the object to be gripped by the gripping mechanism unit 111. It has a reinforcement learning model (an example of a training model) that outputs information. For the reinforcement learning model, for example, a neural network may be used.
 なお、駆動制御装置115より送信された撮影画像に関する情報を強化学習モデルに入力するにあたっては、撮影画像そのものを入力する代わりに、撮影画像から抽出される特徴量を入力してもよい。撮影画像から抽出される特徴量とは、例えば、撮影画像をニューラルネットワークに入力することで中間層から出力される特徴量等である。 When inputting the information about the captured image transmitted from the drive control device 115 into the reinforcement learning model, the feature amount extracted from the captured image may be input instead of inputting the captured image itself. The feature amount extracted from the captured image is, for example, a feature amount output from the intermediate layer by inputting the captured image into the neural network.
 また、強化学習モデルに入力する目標物体画像に関する情報は、R値、G値、B値の各画像を含む撮影画像であってもよいし、R値、G値、B値の各画像と物体表面の各位置までの距離情報とを含む撮影画像であってもよい。あるいは、目標物体画像は、物体表面の各位置までの距離情報を含む距離画像であってもよい。あるいは、目標物体画像は、動画像であってもよい。また、強化学習モデルには、目標物体画像そのものを入力する代わりに、目標物体画像から抽出される特徴量(例えば、目標物体画像をニューラルネットワークに入力することで中間層から出力される特徴量)を入力してもよい。以下では、説明の簡略化のため、目標物体画像の一例として、R値、G値、B値の各画像を含む撮影画像が入力されるものとして説明する。 Further, the information regarding the target object image input to the reinforcement learning model may be a captured image including each image of R value, G value, and B value, and each image and object of R value, G value, and B value. It may be a photographed image including distance information to each position on the surface. Alternatively, the target object image may be a distance image including distance information to each position on the object surface. Alternatively, the target object image may be a moving image. Further, in the reinforcement learning model, instead of inputting the target object image itself, the feature amount extracted from the target object image (for example, the feature amount output from the intermediate layer by inputting the target object image into the neural network). May be entered. Hereinafter, for the sake of simplification of the description, as an example of the target object image, a photographed image including each image of R value, G value, and B value will be input.
 また、強化学習モデルにより出力された、把持機構部111の動作後の状態を示す情報に基づき、把持機構部111の動作が制御されることで、強化学習装置120は、把持対象の物体に対する操作結果(例えば、把持操作が成功したか否かの判定結果)を取得する。そして、強化学習装置120では、取得した操作結果に基づいて、強化学習モデルのモデルパラメータを更新する。 Further, by controlling the operation of the gripping mechanism unit 111 based on the information indicating the state after the operation of the gripping mechanism unit 111 output by the reinforcement learning model, the reinforcement learning device 120 operates the object to be gripped. The result (for example, the determination result of whether or not the gripping operation is successful) is acquired. Then, the reinforcement learning device 120 updates the model parameters of the reinforcement learning model based on the acquired operation result.
 このように、強化学習システム100では、複数種類の物体が混在して載置された物体群130の中から、指定した種類の物体を把持する場合の把持操作の成功確率を上げるために、
・把持機構部111の位置及び姿勢の変化に伴って、位置及び姿勢が変化する撮像装置112により撮影された撮影画像に関する情報を用いて、強化学習を行う。
As described above, in the reinforcement learning system 100, in order to increase the success probability of the gripping operation when gripping a designated type of object from the object group 130 in which a plurality of types of objects are mixedly placed.
-Reinforcement learning is performed using information on the captured image captured by the image pickup device 112 whose position and posture change with the change in the position and posture of the gripping mechanism unit 111.
 これにより、例えば、把持対象の物体が、撮影可能な位置に載置されていない場合でも、強化学習の過程で把持対象の物体が撮影可能となるように、把持機構部111を動作させることができる。つまり、本実施形態によれば、把持対象の物体の載置状態によらず、把持操作の成功確率を上げることが可能な強化学習システム100を提供することができる。 Thereby, for example, even if the object to be gripped is not placed in a position where it can be photographed, the gripping mechanism unit 111 can be operated so that the object to be gripped can be photographed in the process of reinforcement learning. can. That is, according to the present embodiment, it is possible to provide the reinforcement learning system 100 capable of increasing the success probability of the gripping operation regardless of the placement state of the object to be gripped.
 なお、本実施形態では、符号140に示すように、図1の紙面縦方向をZ軸方向、図1の紙面横方向をY軸方向、図1の紙面奥行き方向をX軸方向と定義するものとする。 In this embodiment, as shown by reference numeral 140, the vertical direction of the paper surface in FIG. 1 is defined as the Z-axis direction, the horizontal direction of the paper surface in FIG. 1 is defined as the Y-axis direction, and the depth direction of the paper surface in FIG. 1 is defined as the X-axis direction. And.
 <強化学習システムを構成する各装置のハードウェア構成>
 次に、強化学習システム100を構成する、マニピュレータ110のハードウェア構成(ここでは機構系については省略し、制御系に関するハードウェア構成を示す)及び強化学習装置120のハードウェア構成について図2を用いて説明する。図2は、強化学習システムを構成する各装置のハードウェア構成の一例を示す図である。
<Hardware configuration of each device that composes the reinforcement learning system>
Next, FIG. 2 is used for the hardware configuration of the manipulator 110 (here, the mechanical system is omitted and the hardware configuration for the control system is shown) and the hardware configuration of the reinforcement learning device 120 constituting the reinforcement learning system 100. I will explain. FIG. 2 is a diagram showing an example of the hardware configuration of each device constituting the reinforcement learning system.
 (1)マニピュレータのハードウェア構成
 図2に示すように、マニピュレータ110は、撮像装置112、駆動制御装置115に加えて、センサ群211、アクチュエータ群212を有する。
(1) Hardware Configuration of Manipulator As shown in FIG. 2, the manipulator 110 has a sensor group 211 and an actuator group 212 in addition to the image pickup device 112 and the drive control device 115.
 センサ群211は、n個のセンサを含む。本実施形態において、n個のセンサには、少なくとも、
・把持機構部111の位置及び姿勢を算出するためのセンサ(本体部113の各関節角を測定するセンサ)、
・把持機構部111の開閉を検知するセンサ、
が含まれる。
The sensor group 211 includes n sensors. In this embodiment, for n sensors, at least,
-Sensor for calculating the position and posture of the gripping mechanism portion 111 (sensor for measuring each joint angle of the main body portion 113),
A sensor that detects the opening and closing of the gripping mechanism 111,
Is included.
 また、アクチュエータ群212は、m個のアクチュエータを含む。本実施形態において、m個のアクチュエータには、少なくとも、
・把持機構部111の位置及び姿勢を制御するためのアクチュエータ(本体部113の各関節角を制御するためのアクチュエータ)、
・把持機構部111の開閉を制御するためのアクチュエータ、
が含まれる。
Further, the actuator group 212 includes m actuators. In this embodiment, for m actuators, at least,
An actuator for controlling the position and posture of the gripping mechanism portion 111 (actuator for controlling each joint angle of the main body portion 113),
An actuator for controlling the opening and closing of the gripping mechanism 111,
Is included.
 また、駆動制御装置115は、センサ信号処理装置201、アクチュエータ駆動装置202、コントローラ203を有する。センサ信号処理装置201は、センサ群211から送信されたセンサ信号を受信し、コントローラ203にセンサ信号データを通知する。また、アクチュエータ駆動装置202は、コントローラ203からの制御信号データを取得し、アクチュエータ群212に制御信号を送信する。 Further, the drive control device 115 includes a sensor signal processing device 201, an actuator drive device 202, and a controller 203. The sensor signal processing device 201 receives the sensor signal transmitted from the sensor group 211 and notifies the controller 203 of the sensor signal data. Further, the actuator drive device 202 acquires the control signal data from the controller 203 and transmits the control signal to the actuator group 212.
 コントローラ203は、撮像装置112から送信された撮影画像を取得し、強化学習装置120に送信する。また、コントローラ203は、センサ信号処理装置201より通知されたセンサ信号データを、強化学習装置120に送信する。 The controller 203 acquires the captured image transmitted from the image pickup device 112 and transmits it to the reinforcement learning device 120. Further, the controller 203 transmits the sensor signal data notified from the sensor signal processing device 201 to the reinforcement learning device 120.
 また、コントローラ203は、撮影画像及びセンサ信号データを送信したことに応じて、強化学習装置120から、把持機構部111の動作後の状態を示す情報を取得する。更に、コントローラ203は、把持機構部111の動作後の状態を示す情報を取得すると、センサ信号データに基づいて、アクチュエータ群212を動作させる制御信号データを生成し、アクチュエータ駆動装置202に通知する。 Further, the controller 203 acquires information indicating the state after the operation of the gripping mechanism unit 111 from the reinforcement learning device 120 in response to the transmission of the captured image and the sensor signal data. Further, when the controller 203 acquires the information indicating the state after the operation of the gripping mechanism unit 111, the controller 203 generates the control signal data for operating the actuator group 212 based on the sensor signal data and notifies the actuator drive device 202.
 (2)強化学習装置のハードウェア構成
 次に、強化学習装置120のハードウェア構成について説明する。図2に示すように、強化学習装置120は、構成要素として、プロセッサ221、主記憶装置(メモリ)222、補助記憶装置223、ネットワークインタフェース224、デバイスインタフェース225を有する。強化学習装置120は、これらの構成要素がバス226を介して接続されたコンピュータとして実現される。
(2) Hardware Configuration of Reinforcement Learning Device Next, the hardware configuration of the reinforcement learning device 120 will be described. As shown in FIG. 2, the enhanced learning device 120 has a processor 221, a main storage device (memory) 222, an auxiliary storage device 223, a network interface 224, and a device interface 225 as components. The reinforcement learning device 120 is realized as a computer in which these components are connected via a bus 226.
 なお、図2の例では、強化学習装置120は、各構成要素を1個ずつ備えるものとして示しているが、強化学習装置120は、同じ構成要素を複数備えていてもよい。また、図2の例では、1台の強化学習装置120が示されているが、強化学習プログラムが複数台の強化学習装置にインストールされて、当該複数台の強化学習装置それぞれが強化学習プログラムの同一のまたは異なる一部の処理を実行するように構成してもよい。この場合、強化学習装置それぞれがネットワークインタフェース224等を介して通信することで全体の処理を実行する分散コンピューティングの形態をとってもよい。つまり、強化学習装置120は、1または複数の記憶装置に記憶された命令を1台または複数台のコンピュータが実行することで機能を実現するシステムとして構成されてもよい。また、駆動制御装置115から送信された各種データをクラウド上に設けられた1台または複数台の強化学習装置で処理し、処理結果を駆動制御装置115に送信する構成であってもよい。 In the example of FIG. 2, the reinforcement learning device 120 is shown to include one component for each, but the reinforcement learning device 120 may include a plurality of the same components. Further, in the example of FIG. 2, one reinforcement learning device 120 is shown, but the reinforcement learning program is installed in a plurality of reinforcement learning devices, and each of the plurality of reinforcement learning devices is a reinforcement learning program. It may be configured to perform the same or different parts of the process. In this case, the reinforcement learning device may take the form of distributed computing in which the entire processing is executed by communicating with each other via the network interface 224 or the like. That is, the reinforcement learning device 120 may be configured as a system that realizes a function by executing instructions stored in one or a plurality of storage devices by one or a plurality of computers. Further, various data transmitted from the drive control device 115 may be processed by one or a plurality of reinforcement learning devices provided on the cloud, and the processing result may be transmitted to the drive control device 115.
 強化学習装置120の各種演算は、1または複数のプロセッサを用いて、または、通信ネットワーク240を介して通信する複数台の強化学習装置を用いて並列処理で実行されてもよい。また、各種演算は、プロセッサ221内に複数ある演算コアに振り分けられて、並列処理で実行されてもよい。また、本開示の処理、手段等の一部または全部は、通信ネットワーク240を介して強化学習装置120と通信可能なクラウド上に設けられた外部装置230(プロセッサ及び記憶装置の少なくとも一方)により実行されてもよい。このように、強化学習装置120は、1台または複数台のコンピュータによる並列コンピューティングの形態をとってもよい。 Various operations of the reinforcement learning device 120 may be executed in parallel processing by using one or a plurality of processors or by using a plurality of reinforcement learning devices that communicate via the communication network 240. Further, various operations may be distributed to a plurality of arithmetic cores in the processor 221 and executed in parallel processing. Further, some or all of the processes, means, etc. of the present disclosure are executed by an external device 230 (at least one of the processor and the storage device) provided on the cloud capable of communicating with the reinforcement learning device 120 via the communication network 240. May be done. As described above, the reinforcement learning device 120 may take the form of parallel computing by one or a plurality of computers.
 プロセッサ221は、電子回路(処理回路、Processing circuit、Processing circuitry、CPU、GPU、FPGA、又はASIC等)であってもよい。また、プロセッサ221は、専用の処理回路を含む半導体装置等であってもよい。なお、プロセッサ221は、電子論理素子を用いた電子回路に限定されるものではなく、光論理素子を用いた光回路により実現されてもよい。また、プロセッサ221は、量子コンピューティングに基づく演算機能を含むものであってもよい。 The processor 221 may be an electronic circuit (processing circuit, Processing circuitry, CPU, GPU, FPGA, ASIC, etc.). Further, the processor 221 may be a semiconductor device or the like including a dedicated processing circuit. The processor 221 is not limited to an electronic circuit using an electronic logic element, and may be realized by an optical circuit using an optical logic element. Further, the processor 221 may include an arithmetic function based on quantum computing.
 プロセッサ221は、強化学習装置120の内部構成の各装置等から入力された各種データや命令に基づいて各種演算を行い、演算結果や制御信号を各装置等に出力する。プロセッサ221は、OS(Operating System)や、アプリケーション等を実行することにより、強化学習装置120が備える各構成要素を制御する。 The processor 221 performs various calculations based on various data and instructions input from each device and the like of the internal configuration of the reinforcement learning device 120, and outputs the calculation result and the control signal to each device and the like. The processor 221 controls each component included in the reinforcement learning device 120 by executing an OS (Operating System), an application, or the like.
 また、プロセッサ221は、1チップ上に配置された1又は複数の電子回路を指してもよいし、2つ以上のチップあるいはデバイス上に配置された1又は複数の電子回路を指してもよい。複数の電子回路を用いる場合、各電子回路は有線又は無線により通信してもよい。 Further, the processor 221 may refer to one or more electronic circuits arranged on one chip, or may refer to one or more electronic circuits arranged on two or more chips or devices. When a plurality of electronic circuits are used, each electronic circuit may communicate by wire or wirelessly.
 主記憶装置222は、プロセッサ221が実行する命令及び各種データ等を記憶する記憶装置であり、主記憶装置222に記憶された各種データがプロセッサ221により読み出される。補助記憶装置223は、主記憶装置222以外の記憶装置である。なお、これらの記憶装置は、各種データを格納可能な任意の電子部品を意味するものとし、半導体のメモリでもよい。半導体のメモリは、揮発性メモリ、不揮発性メモリのいずれでもよい。強化学習装置120において各種データを保存するための記憶装置は、主記憶装置222又は補助記憶装置223により実現されてもよく、プロセッサ221に内蔵される内蔵メモリにより実現されてもよい。 The main storage device 222 is a storage device that stores instructions executed by the processor 221 and various data, and various data stored in the main storage device 222 is read out by the processor 221. The auxiliary storage device 223 is a storage device other than the main storage device 222. It should be noted that these storage devices mean arbitrary electronic components capable of storing various data, and may be semiconductor memories. The semiconductor memory may be either a volatile memory or a non-volatile memory. The storage device for storing various data in the reinforcement learning device 120 may be realized by the main storage device 222 or the auxiliary storage device 223, or may be realized by the built-in memory built in the processor 221.
 また、1つの主記憶装置222に対して、複数のプロセッサ221が接続(結合)されてもよいし、単数のプロセッサ221が接続されてもよい。あるいは、1つのプロセッサ221に対して、複数の主記憶装置222が接続(結合)されてもよい。強化学習装置120が、少なくとも1つの主記憶装置222と、この少なくとも1つの主記憶装置222に接続(結合)される複数のプロセッサ221とで構成される場合、複数のプロセッサ221のうち少なくとも1つのプロセッサが、少なくとも1つの主記憶装置222に接続(結合)される構成を含んでもよい。また、複数台の強化学習装置120に含まれる主記憶装置222とプロセッサ221とによって、この構成が実現されてもよい。さらに、主記憶装置222がプロセッサと一体になっている構成(例えば、L1キャッシュ、L2キャッシュを含むキャッシュメモリ)を含んでもよい。 Further, a plurality of processors 221 may be connected (combined) to one main storage device 222, or a single processor 221 may be connected. Alternatively, a plurality of main storage devices 222 may be connected (combined) to one processor 221. When the enhanced learning device 120 is composed of at least one main storage device 222 and a plurality of processors 221 connected (combined) to the at least one main storage device 222, at least one of the plurality of processors 221 is used. The processor may include a configuration in which it is connected (coupled) to at least one main memory device 222. Further, this configuration may be realized by the main storage device 222 and the processor 221 included in the plurality of reinforcement learning devices 120. Further, the main storage device 222 may include a configuration in which the processor is integrated (for example, a cache memory including an L1 cache and an L2 cache).
 ネットワークインタフェース224は、無線又は有線により、通信ネットワーク240に接続するためのインタフェースである。ネットワークインタフェース224には、既存の通信規格に適合したもの等、適切なインタフェースが用いられる。ネットワークインタフェース224により、通信ネットワーク240を介して接続された駆動制御装置115やその他の外部装置230と各種データのやり取りが行われてもよい。なお、通信ネットワーク240は、WAN(Wide Area Network)、LAN(Local Area Network)、PAN(Personal Area Network)等のいずれか、又は、それらの組み合わせであってもよく、コンピュータと駆動制御装置115やその他の外部装置230との間で情報のやり取りが行われるものであればよい。WANの一例としてインタネット等があり、LANの一例としてIEEE802.11やイーサネット等があり、PANの一例としてBluetooth(登録商標が)やNFC(Near Field Communication)等がある。 The network interface 224 is an interface for connecting to the communication network 240 wirelessly or by wire. For the network interface 224, an appropriate interface such as one conforming to an existing communication standard is used. The network interface 224 may exchange various data with the drive control device 115 and other external devices 230 connected via the communication network 240. The communication network 240 may be any one of WAN (Wide Area Network), LAN (Local Area Network), PAN (Personal Area Network), or a combination thereof, and may be a computer and a drive control device 115 or a combination thereof. Anything may be used as long as information is exchanged with another external device 230. An example of WAN is the Internet, an example of LAN is 802.11, Ethernet, etc., and an example of PAN is Bluetooth (registered trademark), NFC (Near Field Communication), etc.
 デバイスインタフェース225は、外部装置250と直接接続するUSB等のインタフェースである。 The device interface 225 is an interface such as USB that directly connects to the external device 250.
 外部装置250はコンピュータと接続されている装置である。外部装置250は、一例として、入力装置であってもよい。入力装置は、例えば、カメラ、マイクロフォン、モーションキャプチャ、各種センサ、キーボード、マウス、又はタッチパネル等のデバイスであり、取得した情報をコンピュータに与える。また、パーソナルコンピュータ、タブレット端末、又はスマートフォン等の入力部とメモリとプロセッサとを備えるデバイス等であってもよい。 The external device 250 is a device connected to a computer. The external device 250 may be an input device as an example. The input device is, for example, a device such as a camera, a microphone, a motion capture, various sensors, a keyboard, a mouse, or a touch panel, and gives acquired information to a computer. Further, it may be a device having an input unit such as a personal computer, a tablet terminal, or a smartphone, a memory, and a processor.
 また、外部装置250は、一例として、出力装置であってもよい。出力装置は、例えば、LCD(Liquid Crystal Display)、CRT(Cathode Ray Tube)、PDP(Plasma Display Panel)、又は有機EL(Electro Luminescence)パネル等の表示装置であってもよいし、音声等を出力するスピーカ等であってもよい。また、パーソナルコンピュータ、タブレット端末、又はスマートフォン等の出力部とメモリとプロセッサとを備えるデバイス等であってもよい。 Further, the external device 250 may be an output device as an example. The output device may be, for example, a display device such as an LCD (Liquid Crystal Display), a CRT (Cathode Ray Tube), a PDP (Plasma Display Panel), or an organic EL (Electro Luminescence) panel, and outputs audio or the like. It may be a speaker or the like. Further, it may be a device having an output unit such as a personal computer, a tablet terminal, or a smartphone, a memory, and a processor.
 また、外部装置250は、記憶装置(メモリ)であってもよい。例えば、外部装置250はネットワークストレージ等であってもよく、外部装置250はHDD等のストレージであってもよい。 Further, the external device 250 may be a storage device (memory). For example, the external device 250 may be a network storage or the like, and the external device 250 may be a storage such as an HDD.
 また、外部装置250は、強化学習装置120の構成要素の一部の機能を有する装置でもよい。つまり、コンピュータは、外部装置250の処理結果の一部又は全部を送信または受信してもよい。 Further, the external device 250 may be a device having some functions of the components of the reinforcement learning device 120. That is, the computer may transmit or receive a part or all of the processing result of the external device 250.
 <強化学習装置の機能構成>
 次に、強化学習装置120の機能構成として、ここでは、2種類の機能構成例について説明する。図3は、強化学習装置の機能構成の一例を示す第1の図である。図3の3aに示すように、強化学習装置120は、更新部310、状態入力部320、強化学習モデル330を有する。
<Functional configuration of reinforcement learning device>
Next, as a functional configuration of the reinforcement learning device 120, two types of functional configuration examples will be described here. FIG. 3 is a first diagram showing an example of the functional configuration of the reinforcement learning device. As shown in 3a of FIG. 3, the reinforcement learning device 120 has an update unit 310, a state input unit 320, and a reinforcement learning model 330.
 更新部310は、報酬算出部311を有し、強化学習モデル330のモデルパラメータを更新する。具体的には、更新部310は、把持対象の物体に対する把持操作が成功したか否かの判定結果と、把持機構部111の動作が制御されたことによる状態の変化を示す情報とを取得する。また、報酬算出部311は、更新部310により取得された判定結果に基づき、報酬を算出する。そして、更新部310は、これまでに取得または算出した各種情報(状態の変化を示す情報、報酬等)に基づいて、強化学習モデル330のモデルパラメータを更新する。 The update unit 310 has a reward calculation unit 311 and updates the model parameters of the reinforcement learning model 330. Specifically, the updating unit 310 acquires a determination result of whether or not the gripping operation for the object to be gripped is successful, and information indicating a change in the state due to the control of the operation of the gripping mechanism unit 111. .. Further, the reward calculation unit 311 calculates the reward based on the determination result acquired by the update unit 310. Then, the update unit 310 updates the model parameters of the reinforcement learning model 330 based on various information (information indicating the change in the state, reward, etc.) acquired or calculated so far.
 なお、把持対象の物体に対する把持操作が成功したか否かの判定は、例えば、撮影画像に基づいて自動で行われてもよい。あるいは、把持対象の物体に対する把持操作が成功したか否かの判定は、強化学習システム100のユーザが行ってもよい。 It should be noted that the determination of whether or not the gripping operation for the object to be gripped is successful may be automatically performed based on, for example, a captured image. Alternatively, the user of the reinforcement learning system 100 may determine whether or not the gripping operation for the object to be gripped is successful.
 また、上述した報酬の算出方法は一例にすぎず、更新部310は、把持操作が成功したか否かの判定結果以外の情報に基づいて、報酬を算出してもよい。例えば、更新部310は、把持操作が成功するまでに要した動作時間や動作回数、把持操作する際のマニピュレータ110全体の動作の大きさ(エネルギ効率)など、各種情報に基づいて、報酬を算出してもよい。 Further, the above-mentioned reward calculation method is only an example, and the update unit 310 may calculate the reward based on information other than the determination result of whether or not the gripping operation is successful. For example, the update unit 310 calculates the reward based on various information such as the operation time and the number of operations required for the grip operation to succeed, and the magnitude (energy efficiency) of the operation of the entire manipulator 110 during the grip operation. You may.
 状態入力部320は、駆動制御装置115より送信された撮影画像と、ユーザにより入力された目標物体画像とを取得し、強化学習モデル330に通知する。 The state input unit 320 acquires the captured image transmitted from the drive control device 115 and the target object image input by the user, and notifies the reinforcement learning model 330.
 強化学習モデル330は、更新部310によりモデルパラメータが更新される。また、モデルパラメータが更新された後の強化学習モデル330は、状態入力部320より通知された撮影画像と目標物体画像とを入力として、把持機構部111の動作後の状態を示す情報を出力する。本実施形態において、強化学習モデル330は、把持機構部111の動作後の状態を示す情報として、例えば、
・把持機構部111の動作後の位置及び姿勢を示す情報、
・把持機構部111の動作後の開閉を示す情報、
を出力する。
In the reinforcement learning model 330, the model parameters are updated by the update unit 310. Further, the reinforcement learning model 330 after the model parameters are updated receives the captured image notified from the state input unit 320 and the target object image as inputs, and outputs information indicating the state after the operation of the gripping mechanism unit 111. .. In the present embodiment, the reinforcement learning model 330 is, for example, as information indicating the state after the operation of the gripping mechanism unit 111, for example.
Information indicating the position and posture of the gripping mechanism 111 after operation,
Information indicating the opening and closing of the gripping mechanism 111 after operation,
Is output.
 一方、図3の3bは、他の機能構成例を示している。図3の3bに示すように、強化学習装置120の状態入力部320は、撮影画像及び目標物体画像に加えて、把持機構部111の動作前(現在)の状態を示す情報を取得し、強化学習モデル330に通知するように構成されている。ここでいう、把持機構部111の動作前(現在)の状態を示す情報には、例えば、
・把持機構部111の動作前(現在)の位置及び姿勢を示す情報、
・把持機構部111の動作前(現在)の開閉を示す情報、
が含まれる。
On the other hand, 3b in FIG. 3 shows another functional configuration example. As shown in 3b of FIG. 3, the state input unit 320 of the reinforcement learning device 120 acquires, in addition to the captured image and the target object image, information indicating the state before (currently) the operation of the gripping mechanism unit 111, and strengthens it. It is configured to notify the learning model 330. The information indicating the state before (currently) the operation of the gripping mechanism portion 111 referred to here is, for example,
Information indicating the position and posture of the gripping mechanism 111 before (currently) operation,
Information indicating the opening and closing of the gripping mechanism 111 before (currently) operation,
Is included.
 この場合、強化学習モデル330は、状態入力部320より通知された撮影画像、目標物体画像、把持機構部111の動作前(現在)の状態を示す情報を入力として、把持機構部111の動作後の状態を示す情報を出力する。 In this case, the reinforcement learning model 330 receives the captured image, the target object image, and the information indicating the state before (currently) the operation of the gripping mechanism unit 111 notified from the state input unit 320 as input, and after the operation of the gripping mechanism unit 111. Outputs information indicating the status of.
 <強化学習処理の流れ>
 次に、強化学習装置120による強化学習処理の流れについて説明する。図4は、強化学習処理の流れを示す第1のフローチャートである。以下、図4を参照しながら、強化学習処理の流れについて説明する。なお、図4に示す強化学習処理は、あくまで一例であり、他のモデル生成方法により強化学習処理が実行されることで強化学習済みのモデルが生成されてもよい。
<Flow of reinforcement learning process>
Next, the flow of the reinforcement learning process by the reinforcement learning device 120 will be described. FIG. 4 is a first flowchart showing the flow of the reinforcement learning process. Hereinafter, the flow of the reinforcement learning process will be described with reference to FIG. The reinforcement learning process shown in FIG. 4 is only an example, and a model for which reinforcement learning has been completed may be generated by executing the reinforcement learning process by another model generation method.
 ステップS401において、強化学習装置120の状態入力部320は、目標物体画像を取得する。 In step S401, the state input unit 320 of the reinforcement learning device 120 acquires the target object image.
 ステップS402において、強化学習装置120の状態入力部320は、撮影画像を取得する。 In step S402, the state input unit 320 of the reinforcement learning device 120 acquires a captured image.
 ステップS403において、強化学習装置120の状態入力部320は、把持機構部111の動作前(現在)の状態を示す情報を取得するように構成されている場合にあっては、把持機構部111の動作前(現在)の状態を示す情報を取得する。 In step S403, when the state input unit 320 of the reinforcement learning device 120 is configured to acquire information indicating the state before (currently) the operation of the gripping mechanism unit 111, the gripping mechanism unit 111 Acquires information indicating the state before (current) operation.
 ステップS404において、強化学習装置120の強化学習モデル330は、目標物体画像、撮影画像、(及び把持機構部111の動作前の状態を示す情報)を入力として、把持機構部111の動作後の状態を示す情報を出力する。なお、強化学習モデル330は、把持機構部111の動作後の状態を示す情報として、様々な情報を網羅的に出力するように構成されているものとする。この結果、強化学習処理中の把持機構部111の動作には、可能な動作の集合の中から選択された最適な動作と、可能な動作の集合の中からランダムに選択された動作とが含まれることになる。 In step S404, the reinforcement learning model 330 of the reinforcement learning device 120 receives the target object image, the captured image, and (and information indicating the state before the operation of the gripping mechanism unit 111) as inputs, and the state after the operation of the gripping mechanism unit 111. The information indicating is output. It is assumed that the reinforcement learning model 330 is configured to comprehensively output various information as information indicating the state after the operation of the gripping mechanism unit 111. As a result, the motion of the gripping mechanism unit 111 during the reinforcement learning process includes the optimum motion selected from the set of possible motions and the motion randomly selected from the set of possible motions. Will be.
 ステップS405において、強化学習装置120は、強化学習モデル330により出力された、把持機構部111の動作後の状態を示す情報を、駆動制御装置115に送信する。 In step S405, the reinforcement learning device 120 transmits the information indicating the post-operation state of the gripping mechanism unit 111 output by the reinforcement learning model 330 to the drive control device 115.
 ステップS406において、強化学習装置120の更新部310は、把持機構部111の動作が制御されたことによる状態の変化を示す情報を取得する。 In step S406, the update unit 310 of the reinforcement learning device 120 acquires information indicating a change in the state due to the control of the operation of the gripping mechanism unit 111.
 ステップS407において、強化学習装置120の更新部310は、把持対象の物体に対する把持操作が成功したか否かの判定結果を取得し、強化学習装置120の報酬算出部311は、取得した判定結果に基づき、報酬を算出する。 In step S407, the update unit 310 of the reinforcement learning device 120 acquires a determination result of whether or not the gripping operation for the object to be gripped is successful, and the reward calculation unit 311 of the reinforcement learning device 120 obtains the acquired determination result. Based on this, the reward is calculated.
 ステップS408において、強化学習装置120の更新部310は、これまでに取得または算出した各種情報(状態の変化を示す情報、報酬等)に基づいて、強化学習モデル330のモデルパラメータを更新する。 In step S408, the update unit 310 of the reinforcement learning device 120 updates the model parameters of the reinforcement learning model 330 based on various information (information indicating changes in the state, rewards, etc.) acquired or calculated so far.
 ステップS409において、強化学習装置120の状態入力部320は、現在の目標物体画像から、異なる目標物体画像へと切り替えるか否かを判定する。 In step S409, the state input unit 320 of the reinforcement learning device 120 determines whether or not to switch from the current target object image to a different target object image.
 ステップS409において、異なる目標物体画像に切り替えないと判定した場合には(ステップS409においてNOの場合には)、ステップS402に戻る。 If it is determined in step S409 that the image is not switched to a different target object image (NO in step S409), the process returns to step S402.
 一方、ステップS409において、異なる目標物体画像に切り替えると判定した場合には(ステップS409においてYESの場合には)、ステップS410に進む。 On the other hand, if it is determined in step S409 to switch to a different target object image (YES in step S409), the process proceeds to step S410.
 ステップS410において、強化学習装置120の更新部310は、強化学習処理の終了条件を満たすか否かを判定する。なお、強化学習処理の終了条件とは、例えば、強化学習システム100のユーザによって規定された条件であり、一例として、所定の物体に対する把持操作の目標成功確率等が挙げられる。 In step S410, the update unit 310 of the reinforcement learning device 120 determines whether or not the end condition of the reinforcement learning process is satisfied. The end condition of the reinforcement learning process is, for example, a condition defined by the user of the reinforcement learning system 100, and one example thereof is a target success probability of a gripping operation for a predetermined object.
 ステップS410において、強化学習処理の終了条件を満たさないと判定した場合には(ステップS410においてNOの場合には)、ステップS401に戻る。 If it is determined in step S410 that the end condition of the reinforcement learning process is not satisfied (NO in step S410), the process returns to step S401.
 一方、ステップS410において、強化学習処理の終了条件を満たすと判定した場合には(ステップS410においてYESの場合には)、強化学習処理を終了する。なお、強化学習処理を終了した後の強化学習モデル330は、強化学習済みモデルとして、把持機構部111の動作を制御するための情報を、駆動制御装置115に対して出力する装置(物体操作装置と称す)に適用される。 On the other hand, if it is determined in step S410 that the end condition of the reinforcement learning process is satisfied (YES in step S410), the reinforcement learning process is terminated. The reinforcement learning model 330 after the reinforcement learning process is completed is a device (object operation device) that outputs information for controlling the operation of the gripping mechanism unit 111 to the drive control device 115 as the reinforcement learning completed model. Applies to).
 物体操作装置に適用された強化学習済みモデルは、図4のステップS401~S405の処理を実行する(つまり、状態の変化を示す情報の取得、報酬の算出、モデルパラメータの更新等は行わない)。また、ステップS404では、把持機構部111の動作後の状態を示す情報として、最適な情報が出力されるように構成される。つまり、把持機構部111は、強化学習処理中とは異なり、様々な動作を網羅的に行う代わりに、可能な動作の集合の中から選択された最適な動作を行う。 The reinforcement-learned model applied to the object manipulation device executes the processes of steps S401 to S405 of FIG. 4 (that is, it does not acquire information indicating a state change, calculate a reward, update model parameters, or the like). .. Further, in step S404, the optimum information is output as the information indicating the state after the operation of the gripping mechanism unit 111. That is, unlike during the reinforcement learning process, the gripping mechanism unit 111 performs the optimum operation selected from the set of possible operations instead of performing various operations comprehensively.
 <強化学習処理の実行例>
 次に、強化学習システム100による強化学習処理の実行例について説明する。図5及び図6は、強化学習処理の実行例を示す第1及び第2の図である。図5の5aに示す目標物体画像510がユーザにより入力されると、強化学習装置120は、目標物体画像510に含まれる物体511を把持対象の物体として認識する。
<Execution example of reinforcement learning process>
Next, an execution example of the reinforcement learning process by the reinforcement learning system 100 will be described. 5 and 6 are first and second diagrams showing an execution example of the reinforcement learning process. When the target object image 510 shown in 5a of FIG. 5 is input by the user, the reinforcement learning device 120 recognizes the object 511 included in the target object image 510 as the object to be gripped.
 このように、目標物体画像510の入力により把持対象の物体の種類を指定する構成とすることで、強化学習装置120によれば、ユーザは、物体群130に含まれる任意の物体を、把持対象の物体として指定することができる。 In this way, by designating the type of the object to be gripped by inputting the target object image 510, according to the reinforcement learning device 120, the user can grip any object included in the object group 130. Can be specified as an object of.
 図5の5bにおいて矢印500は、物体511が把持対象の物体として認識された時点での撮像装置112の位置及び姿勢(撮影位置及び撮影方向)を示している。また、撮影画像521は、矢印500に示す位置及び姿勢のもとで物体群130を撮影した場合の撮影画像を示している。 In 5b of FIG. 5, the arrow 500 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 at the time when the object 511 is recognized as the object to be gripped. Further, the captured image 521 shows a captured image when the object group 130 is captured under the position and posture indicated by the arrow 500.
 撮影画像521に示すように、物体群130を上方向(Z軸方向)から撮影した場合、把持対象の物体511は、他の物体512によって遮蔽され、撮像装置112は、物体511を撮影することができない。つまり、この状態では、物体511を把持することができない。 As shown in the captured image 521, when the object group 130 is photographed from the upward direction (Z-axis direction), the object 511 to be grasped is shielded by another object 512, and the image pickup apparatus 112 photographs the object 511. I can't. That is, in this state, the object 511 cannot be grasped.
 このため、強化学習装置120では、把持対象の物体511が撮影可能となるように把持機構部111の位置及び姿勢を変化させるべく、把持機構部111の動作後の状態を示す情報を出力する。これにより、駆動制御装置115では、把持機構部111の動作後の状態を示す情報に基づいて、把持機構部111の動作を制御する。 Therefore, the reinforcement learning device 120 outputs information indicating the state after the operation of the gripping mechanism unit 111 in order to change the position and posture of the gripping mechanism unit 111 so that the object 511 to be gripped can be photographed. As a result, the drive control device 115 controls the operation of the gripping mechanism unit 111 based on the information indicating the state after the operation of the gripping mechanism unit 111.
 図5の5cにおいて矢印501は、把持機構部111の動作が制御されることで変化した、変化後の撮像装置112の位置及び姿勢(撮影位置及び撮影方向)を示している。また、撮影画像522は、矢印501に示す位置及び姿勢のもとで物体群130を撮影した場合の撮影画像を示している。 In 5c of FIG. 5, the arrow 501 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 after the change, which is changed by controlling the operation of the gripping mechanism portion 111. Further, the captured image 522 shows a captured image when the object group 130 is captured under the position and posture indicated by the arrow 501.
 撮影画像522に示すように、物体群130を横方向(X軸方向)から撮影することで、把持対象の物体511が撮影可能になっている。 As shown in the photographed image 522, the object group 511 to be grasped can be photographed by photographing the object group 130 from the lateral direction (X-axis direction).
 このため、強化学習装置120では、把持対象の物体511が把持可能となるように把持機構部111の位置及び姿勢を更に変化させるべく、把持機構部111の動作後の状態を示す情報を出力する。これにより、駆動制御装置115では、把持機構部111の動作後の状態を示す情報に基づいて、把持機構部111の動作を制御する。 Therefore, the reinforcement learning device 120 outputs information indicating the state after the operation of the gripping mechanism unit 111 in order to further change the position and posture of the gripping mechanism unit 111 so that the object 511 to be gripped can be gripped. .. As a result, the drive control device 115 controls the operation of the gripping mechanism unit 111 based on the information indicating the state after the operation of the gripping mechanism unit 111.
 図6の6aにおいて矢印601は、マニピュレータ110の動作が制御されることで変化した、変化後の撮像装置112の位置及び姿勢(撮影位置及び撮影方向)を示している。また、撮影画像611は、矢印601に示す位置及び姿勢のもとで物体群130を撮影した場合の撮影画像を示している。 In 6a of FIG. 6, the arrow 601 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 after the change, which is changed by controlling the operation of the manipulator 110. Further, the captured image 611 shows a captured image when the object group 130 is captured under the position and posture indicated by the arrow 601.
 撮影画像611に示すように、把持対象の物体511に近づけたことで、把持対象の物体511が把持可能となっている。 As shown in the captured image 611, the object 511 to be gripped can be gripped by bringing it closer to the object 511 to be gripped.
 このため、強化学習装置120では、把持機構部111に把持対象の物体511を把持させるべく、把持機構部111の動作後の状態を示す情報を出力する。これにより、駆動制御装置115では、動作後の把持機構部111の状態を示す情報に基づいて、把持機構部111の動作を制御する。 Therefore, the reinforcement learning device 120 outputs information indicating the state after the operation of the gripping mechanism unit 111 so that the gripping mechanism unit 111 grips the object 511 to be gripped. As a result, the drive control device 115 controls the operation of the gripping mechanism unit 111 based on the information indicating the state of the gripping mechanism unit 111 after the operation.
 図6の6bにおいて矢印602は、把持機構部111の動作が制御されることで変化した、変化後の撮像装置112の位置及び姿勢(撮影位置及び撮影方向)を示している(物体511が把持され、所定の高さまで持ち上げられた状態を示している)。また、撮影画像612は、矢印602に示す位置及び姿勢のもとで物体511を撮影した場合の撮影画像を示している。 In 6b of FIG. 6, the arrow 602 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 after the change, which is changed by controlling the operation of the gripping mechanism unit 111 (object 511 grips). It shows the state of being lifted to a predetermined height). Further, the captured image 612 shows a captured image when the object 511 is captured under the position and posture indicated by the arrow 602.
 このように、撮像装置112の位置及び姿勢が、把持機構部111の位置及び姿勢の変化に伴って変化するように構成したうえで、当該撮像装置により撮影された撮影画像を用いて強化学習を行うことで、
・撮像装置からの見え方が変わるなどの長期的な視点での評価ができる。
・把持対象の物体を把持するという動作を試行する過程で、把持対象の物体を探索するという動作を試行することができる。
In this way, after the position and posture of the image pickup device 112 are configured to change according to the change in the position and posture of the gripping mechanism portion 111, reinforcement learning is performed using the captured image taken by the image pickup device. By doing
・ It is possible to evaluate from a long-term perspective, such as changing the appearance from the image pickup device.
-In the process of trying the operation of gripping the object to be gripped, the operation of searching for the object to be gripped can be tried.
 つまり、強化学習の過程で、把持対象の物体が撮影可能となるように、把持機構部の動作を制御することができる。この結果、把持対象の物体の載置状態によらず、把持操作の成功確率を上げることができる。 That is, in the process of reinforcement learning, it is possible to control the operation of the gripping mechanism so that the object to be gripped can be photographed. As a result, the success probability of the gripping operation can be increased regardless of the placement state of the object to be gripped.
 <まとめ>
 以上の説明から明らかなように、第1の実施形態に係る強化学習システム100は、
・把持機構部の位置及び姿勢の変化に伴って、位置及び姿勢が変化する撮像装置により撮影された撮影画像と、把持機構部が把持する把持対象の物体を示す目標物体画像とを、把持機構部の動作後の状態を示す情報を出力する強化学習モデルに入力する。
・把持機構部の動作後の状態を示す情報に基づき把持機構部の動作が制御された場合の、把持対象の物体に対する操作結果(エンドエフェクタによる把持操作が成功したか否かの判定結果)に基づいて、強化学習モデルのモデルパラメータを更新する。
<Summary>
As is clear from the above explanation, the reinforcement learning system 100 according to the first embodiment is
A gripping mechanism captures an image taken by an image pickup device whose position and posture change as the position and posture of the gripping mechanism change, and a target object image showing an object to be gripped by the gripping mechanism. Input to the reinforcement learning model that outputs information indicating the state after the operation of the part.
-The operation result (determination result of whether or not the gripping operation by the end effector was successful) for the object to be gripped when the operation of the gripping mechanism unit is controlled based on the information indicating the state after the operation of the gripping mechanism unit. Based on this, update the model parameters of the reinforcement learning model.
 これにより、強化学習システム100によれば、把持対象の物体が遮蔽されるように載置されていた場合でも、強化学習の過程で把持対象の物体が撮影可能となるように、把持機構部の動作を制御することができる。 As a result, according to the reinforcement learning system 100, even when the object to be gripped is placed so as to be shielded, the gripping mechanism unit can take an image of the object to be gripped in the process of reinforcement learning. The operation can be controlled.
 つまり、第1の実施形態によれば、載置状態によらず、指定された種類の物体に対して把持操作の成功確率を上げることが可能な、強化学習装置、強化学習システム、物体操作装置、モデル生成方法及び強化学習プログラムを提供することができる。 That is, according to the first embodiment, the reinforcement learning device, the reinforcement learning system, and the object operation device capable of increasing the success probability of the gripping operation for the specified type of object regardless of the mounting state. , Model generation methods and reinforcement learning programs can be provided.
 [第2の実施形態]
 第2の実施形態では、Q学習により、強化学習を行う場合について説明する。以下、第2の実施形態について、上記第1の実施形態との相違点を中心に説明する。
[Second Embodiment]
In the second embodiment, a case where reinforcement learning is performed by Q-learning will be described. Hereinafter, the second embodiment will be described focusing on the differences from the first embodiment.
 <強化学習装置の機能構成>
 はじめに、第2の実施形態に係る強化学習装置120の機能構成例について説明する。図7は、強化学習装置の機能構成の一例を示す第2の図である。図7に示すように、第2の実施形態に係る強化学習装置120は、更新部710、強化学習モデル720を有する。強化学習モデル720には、例えば、ニューラルネットワークが用いられてもよい。
<Functional configuration of reinforcement learning device>
First, a functional configuration example of the reinforcement learning device 120 according to the second embodiment will be described. FIG. 7 is a second diagram showing an example of the functional configuration of the reinforcement learning device. As shown in FIG. 7, the reinforcement learning device 120 according to the second embodiment has an update unit 710 and a reinforcement learning model 720. For the reinforcement learning model 720, for example, a neural network may be used.
 更新部710は、報酬算出部711、パラメータ更新部712を有し、強化学習モデル720のモデルパラメータを更新する。 The update unit 710 has a reward calculation unit 711 and a parameter update unit 712, and updates the model parameters of the reinforcement learning model 720.
 具体的には、更新部710は、把持対象の物体に対する把持操作が成功したか否かの判定結果、及び、把持機構部111の動作が制御されたことによる状態の変化を示す情報を取得する。 Specifically, the update unit 710 acquires the determination result of whether or not the gripping operation for the object to be gripped is successful, and the information indicating the change in the state due to the control of the operation of the gripping mechanism unit 111. ..
 また、報酬算出部711は、把持対象の物体に対する把持操作が成功したか否かの判定結果に基づき報酬を算出する。なお、把持対象の物体に対する把持操作が成功したか否かの判定方法や、報酬の算出方法は、上記第1の実施形態において説明済みであるため、ここでは説明を省略する。 Further, the reward calculation unit 711 calculates the reward based on the determination result of whether or not the gripping operation for the object to be gripped is successful. Since the method for determining whether or not the gripping operation for the object to be gripped has been successful and the method for calculating the reward have already been described in the first embodiment, the description thereof will be omitted here.
 また、パラメータ更新部712は、強化学習モデル720に含まれる、画像解析部721、状態及び動作入力部722、期待値算出部724の各モデルパラメータを更新する。なお、パラメータ更新部712は、
・更新部710により取得された、状態の変化を示す情報、
・報酬算出部711により算出された報酬(即時報酬)、
・後述する期待値算出部724において算出された、割引累積報酬の期待値(Q値)の予測値、
に基づいて、モデルパラメータを更新する。
Further, the parameter update unit 712 updates each model parameter of the image analysis unit 721, the state and motion input unit 722, and the expected value calculation unit 724 included in the reinforcement learning model 720. The parameter update unit 712 is
-Information indicating a change in state acquired by the update unit 710,
-Reward calculated by the reward calculation unit 711 (immediate reward),
-The predicted value of the expected value (Q value) of the discount cumulative reward calculated by the expected value calculation unit 724, which will be described later.
Update the model parameters based on.
 強化学習モデル720は、更新部710によりモデルパラメータが更新される。また、モデルパラメータが更新された後の強化学習モデル720は、撮影画像、目標物体画像、把持機構部111の動作前の状態(s)を示す情報を入力として、把持機構部111の動作後の状態を示す情報を出力する。 The model parameters of the reinforcement learning model 720 are updated by the update unit 710. Further, in the reinforcement learning model 720 after the model parameters are updated, the captured image, the target object image, and the information indicating the state (s) before the operation of the gripping mechanism unit 111 are input, and after the operation of the gripping mechanism unit 111. Outputs status information.
 具体的には、図7に示すように、強化学習モデル720は、画像解析部721、状態及び動作入力部722、加算部723、期待値算出部724、調整部725を有する。 Specifically, as shown in FIG. 7, the reinforcement learning model 720 has an image analysis unit 721, a state and motion input unit 722, an addition unit 723, an expected value calculation unit 724, and an adjustment unit 725.
 画像解析部721は、駆動制御装置115より送信された撮影画像と、ユーザにより入力された目標物体(g)画像とを取得することで処理を実行し、実行結果を加算部723に出力する。なお、画像解析部721は、例えば、ニューラルネットワークを用いて構成される。より具体的には、画像解析部721は、例えば、第1の畳み込み層、第1のMaxPooling層、第2の畳み込み層、第2のMaxPooling層等により構成される。 The image analysis unit 721 executes the process by acquiring the captured image transmitted from the drive control device 115 and the target object (g) image input by the user, and outputs the execution result to the addition unit 723. The image analysis unit 721 is configured by using, for example, a neural network. More specifically, the image analysis unit 721 is composed of, for example, a first convolution layer, a first MaxPooling layer, a second convolution layer, a second MaxPooling layer, and the like.
 状態及び動作入力部722は、把持機構部111の動作前の状態(s)を示す情報と、把持機構部111の動作(a)を示す情報とを取得することで処理を実行し、実行結果を加算部723に出力する。なお、状態及び動作入力部722は、例えば、ニューラルネットワークを用いて構成される。より具体的には、状態及び動作入力部722は、第1の線形層、第2の線形層、形状変換層等により構成される。また、状態及び動作入力部722には、後述する期待値算出部724により算出される最大のQ値を探索するために、調整部725により調整された、把持機構部111の動作(a)を示す情報が、所定回数(例えば、20回)入力される。 The state and operation input unit 722 executes the process by acquiring the information indicating the state (s) before the operation of the gripping mechanism unit 111 and the information indicating the operation (a) of the gripping mechanism unit 111, and the execution result. Is output to the addition unit 723. The state and motion input unit 722 is configured by using, for example, a neural network. More specifically, the state and motion input unit 722 is composed of a first linear layer, a second linear layer, a shape conversion layer, and the like. Further, the state and operation input unit 722 is provided with the operation (a) of the gripping mechanism unit 111 adjusted by the adjustment unit 725 in order to search for the maximum Q value calculated by the expected value calculation unit 724, which will be described later. The information to be shown is input a predetermined number of times (for example, 20 times).
 加算部723は、画像解析部721より出力された実行結果と、状態及び動作入力部722より出力された実行結果とを加算して、期待値算出部724に入力する。 The addition unit 723 adds the execution result output from the image analysis unit 721 and the execution result output from the state and operation input unit 722, and inputs the expected value calculation unit 724.
 期待値算出部724は、加算部723において加算された、画像解析部721の実行結果と、状態及び動作入力部722の実行結果とが入力されることで処理を実行し、Q値(Q(s,a,g))を算出する。期待値算出部724では、調整部725により調整された、把持機構部111の動作(a)を示す情報の数に応じた数のQ値を算出する。なお、期待値算出部724は、例えば、ニューラルネットワークを用いて構成される。より具体的には、期待値算出部724は、第1の畳み込み層、第1のMaxPooling層、第2の畳み込み層、第2のMaxPooling層等により構成される。 The expected value calculation unit 724 executes the process by inputting the execution result of the image analysis unit 721 and the execution result of the state and operation input unit 722 added by the addition unit 723, and executes the process, and the Q value (Q (Q) s, a, g)) is calculated. The expected value calculation unit 724 calculates a number of Q values according to the number of information indicating the operation (a) of the gripping mechanism unit 111 adjusted by the adjustment unit 725. The expected value calculation unit 724 is configured by using, for example, a neural network. More specifically, the expected value calculation unit 724 is composed of a first convolution layer, a first MaxPooling layer, a second convolution layer, a second MaxPooling layer, and the like.
 調整部725は、期待値算出部724においてQ値が算出されるごとに、把持機構部111の動作(a)を示す情報を調整し、状態及び動作入力部722に入力する。調整部725では、把持機構部111の動作(a)を示す情報を、所定回数(例えば、20回)調整し、その間に算出されたQ値の中から最大のQ値を抽出する。なお、調整部725は、例えば、ε-グリーディ法に基づいて、把持機構部111の可能な動作の集合の中から、いずれかの動作(a)を示す情報を特定する。 The adjustment unit 725 adjusts the information indicating the operation (a) of the gripping mechanism unit 111 every time the Q value is calculated by the expected value calculation unit 724, and inputs the information to the state and operation input unit 722. The adjusting unit 725 adjusts the information indicating the operation (a) of the gripping mechanism unit 111 a predetermined number of times (for example, 20 times), and extracts the maximum Q value from the Q values calculated during that period. The adjusting unit 725 specifies information indicating one of the operations (a) from the set of possible operations of the gripping mechanism unit 111, for example, based on the ε-greedy method.
 ε-グリーディ法によれば、最大のQ値に対応する動作(a)を示す情報が特定される場合もあれば、ランダムに選択された動作(a)を示す情報が特定される場合もある。 According to the ε-greedy method, the information indicating the operation (a) corresponding to the maximum Q value may be specified, or the information indicating the randomly selected operation (a) may be specified. ..
 更に、調整部725は、特定した把持機構部111の動作(a)を示す情報と、把持機構部111の動作前の状態(s)を示す情報とに基づいて、把持機構部111の動作後の状態を示す情報を導出し、駆動制御装置115に送信する。 Further, the adjusting unit 725 is based on the information indicating the operation (a) of the specified gripping mechanism unit 111 and the information indicating the state (s) before the operation of the gripping mechanism unit 111, after the operation of the gripping mechanism unit 111. Information indicating the state of is derived and transmitted to the drive control device 115.
 このように、第2の実施形態に係る強化学習装置120では、ε-グリーディ法を用いることで、把持機構部111の動作後の状態を示す情報として、様々な情報を網羅的に出力することができる。この結果、強化学習処理中の把持機構部111の動作には、可能な動作の集合の中から選択された最適な動作(Q値が最大となる動作)と、可能な動作の集合の中からランダムに選択された動作とが含まれることになる。 As described above, in the reinforcement learning device 120 according to the second embodiment, by using the ε-greedy method, various information can be comprehensively output as information indicating the state after the operation of the gripping mechanism unit 111. Can be done. As a result, the motion of the gripping mechanism unit 111 during the reinforcement learning process includes the optimum motion selected from the set of possible motions (the motion that maximizes the Q value) and the set of possible motions. It will include randomly selected actions.
 なお、かかる機能を実現する強化学習モデル720の構成として、図7に示した機能構成は、あくまで一例にすぎず、他の機能構成により強化学習モデル720を構成してもよい。例えば、上記説明では、画像解析部721、状態及び動作入力部722、期待値算出部724がそれぞれ、ニューラルネットワークを用いて構成されるものとしたが、強化学習モデル720全体がニューラルネットワークを用いて構成されてもよい。 As the configuration of the reinforcement learning model 720 that realizes such a function, the functional configuration shown in FIG. 7 is merely an example, and the reinforcement learning model 720 may be configured by another functional configuration. For example, in the above description, the image analysis unit 721, the state and motion input unit 722, and the expected value calculation unit 724 are each configured using a neural network, but the entire reinforcement learning model 720 uses a neural network. It may be configured.
 また、上記説明では、強化学習処理時の機能について言及したが、強化学習処理が終了した後の機能については、上記第1の実施形態と同様である。すなわち、強化学習処理が終了した後は、更新部710による、状態の変化を示す情報の取得、報酬の算出、モデルパラメータの更新等は行われない。また、調整部725では、把持機構部111の動作後の状態を示す情報として、最適な情報(Q値が最大となる動作(a)を示す情報に基づいて導出された把持機構部111の動作後の状態を示す情報)が出力される。これにより、強化学習済みのモデルによれば、割引累積報酬の期待値(Q値)を最大化する行動則を獲得することができる。 Further, in the above description, the function at the time of the reinforcement learning process is mentioned, but the function after the reinforcement learning process is completed is the same as that of the first embodiment. That is, after the reinforcement learning process is completed, the update unit 710 does not acquire information indicating the change in the state, calculate the reward, update the model parameter, or the like. Further, in the adjusting unit 725, the operation of the gripping mechanism unit 111 derived based on the optimum information (the information indicating the operation (a) in which the Q value is maximized) is used as the information indicating the state after the operation of the gripping mechanism unit 111. Information indicating the later state) is output. As a result, according to the reinforcement-learned model, it is possible to acquire a behavioral rule that maximizes the expected value (Q value) of the discount cumulative reward.
 <強化学習処理の流れ>
 次に、第2の実施形態に係る強化学習装置120による強化学習処理の流れについて説明する。図8は、強化学習処理の流れを示す第2のフローチャートである。以下、図8を参照しながら、強化学習処理の流れについて説明する。なお、図8に示す強化学習処理は、あくまで一例であり、他のモデル生成方法により強化学習処理が実行されることで強化学習済みのモデルが生成されてもよい。
<Flow of reinforcement learning process>
Next, the flow of the reinforcement learning process by the reinforcement learning device 120 according to the second embodiment will be described. FIG. 8 is a second flowchart showing the flow of the reinforcement learning process. Hereinafter, the flow of the reinforcement learning process will be described with reference to FIG. The reinforcement learning process shown in FIG. 8 is merely an example, and a model for which reinforcement learning has been completed may be generated by executing the reinforcement learning process by another model generation method.
 ステップS801において、強化学習装置120の強化学習モデル720は、目標物体画像を取得する。 In step S801, the reinforcement learning model 720 of the reinforcement learning device 120 acquires a target object image.
 ステップS802において、強化学習装置120の強化学習モデル720は、撮影画像を取得する。 In step S802, the reinforcement learning model 720 of the reinforcement learning device 120 acquires a photographed image.
 ステップS803において、強化学習装置120の強化学習モデル720は、把持機構部111の動作前(現在)の状態(s)を示す情報を取得する。 In step S803, the reinforcement learning model 720 of the reinforcement learning device 120 acquires information indicating the state (s) before (currently) the operation of the gripping mechanism unit 111.
 ステップS804~S807は、例えば、ε-グリーディ法に基づいて、可能な動作の集合の中から、いずれかの動作(a)を示す情報を特定し、把持機構部111の動作後の状態を示す情報を網羅的に出力する。 In steps S804 to S807, for example, based on the ε-greedy method, information indicating any operation (a) is specified from a set of possible operations, and the state after the operation of the gripping mechanism unit 111 is shown. Comprehensively output information.
 具体的には、可能な動作の集合の中から、最適なQ値に対応する動作(a)を示す情報を特定する場合にあっては、ステップS804~S806を実行したうえで、ステップS807に進む。また、可能な動作の集合の中から、ランダムに選択された動作(a)を示す情報を特定する場合にあっては、直接、ステップS807に進む。 Specifically, in the case of specifying the information indicating the operation (a) corresponding to the optimum Q value from the set of possible operations, steps S804 to S806 are executed, and then step S807 is performed. move on. Further, in the case of specifying the information indicating the randomly selected motion (a) from the set of possible motions, the process directly proceeds to step S807.
 ステップS804において、強化学習装置120の強化学習モデル720は、Q値を算出する。 In step S804, the reinforcement learning model 720 of the reinforcement learning device 120 calculates the Q value.
 ステップS805において、強化学習装置120の強化学習モデル720は、Q値を所定回数算出したか否かを判定する。ステップS805において、Q値を所定回数算出していないと判定した場合には(ステップS805においてNOの場合には)、ステップS806に進む。 In step S805, the reinforcement learning model 720 of the reinforcement learning device 120 determines whether or not the Q value has been calculated a predetermined number of times. If it is determined in step S805 that the Q value has not been calculated a predetermined number of times (NO in step S805), the process proceeds to step S806.
 ステップS806において、強化学習装置120の強化学習モデル720は、把持機構部111の動作(a)を示す情報を調整し、ステップS804に戻る。 In step S806, the reinforcement learning model 720 of the reinforcement learning device 120 adjusts the information indicating the operation (a) of the gripping mechanism unit 111, and returns to step S804.
 一方、ステップS805において、Q値を所定回数算出したと判定した場合には(ステップS805においてYESの場合には)、ステップS807に進む。 On the other hand, if it is determined in step S805 that the Q value has been calculated a predetermined number of times (YES in step S805), the process proceeds to step S807.
 ステップS807において、強化学習装置120の強化学習モデル720は、ステップS804~S807を実行した場合にあっては、最大のQ値に対応する動作(a)を示す情報を特定し、把持機構部111の動作後の状態を示す情報を導出した後、駆動制御装置115に送信する。また、強化学習装置120の強化学習モデル720は、ステップS804~S807を実行しなかった場合にあっては、ランダムに選択した動作(a)を示す情報を特定し、把持機構部111の動作後の状態を示す情報を導出した後、駆動制御装置115に送信する。 In step S807, the reinforcement learning model 720 of the reinforcement learning device 120 identifies the information indicating the operation (a) corresponding to the maximum Q value when steps S804 to S807 are executed, and the gripping mechanism unit 111. After deriving information indicating the state after the operation of, it is transmitted to the drive control device 115. Further, the reinforcement learning model 720 of the reinforcement learning device 120 specifies information indicating a randomly selected operation (a) when steps S804 to S807 are not executed, and after the operation of the gripping mechanism unit 111. After deriving the information indicating the state of, it is transmitted to the drive control device 115.
 ステップS808において、強化学習装置120の更新部710は、把持機構部111の動作が制御されたことによる状態の変化を示す情報を取得する。 In step S808, the update unit 710 of the reinforcement learning device 120 acquires information indicating a change in the state due to the control of the operation of the gripping mechanism unit 111.
 ステップS809において、強化学習装置120の更新部710は、把持対象の物体に対する把持操作が成功したか否かの判定結果を取得し、即時報酬を算出する。また、強化学習装置120の更新部710は、期待値算出部724により算出された割引累積報酬の期待値(Q値)の予測値を取得する。 In step S809, the update unit 710 of the reinforcement learning device 120 acquires the determination result of whether or not the gripping operation for the object to be gripped is successful, and calculates the immediate reward. Further, the update unit 710 of the reinforcement learning device 120 acquires the predicted value (Q value) of the expected value (Q value) of the discount cumulative reward calculated by the expected value calculation unit 724.
 ステップS810において、強化学習装置120の更新部710は、取得した状態の変化を示す情報、算出した即時報酬、取得した割引累積報酬の期待値(Q値)の予測値を用いて、強化学習モデル720のモデルパラメータを更新する。 In step S810, the update unit 710 of the reinforcement learning device 120 uses the information indicating the change in the acquired state, the calculated immediate reward, and the predicted value (Q value) of the acquired discount cumulative reward to be used as the reinforcement learning model. Update the model parameters of 720.
 ステップS811において、強化学習装置120は、現在の目標物体画像から、異なる目標物体画像へと切り替えるか否かを判定する。 In step S811, the reinforcement learning device 120 determines whether or not to switch from the current target object image to a different target object image.
 ステップS811において、異なる目標物体画像に切り替えないと判定した場合には(ステップS811においてNOの場合には)、ステップS802に戻る。 If it is determined in step S811 not to switch to a different target object image (NO in step S811), the process returns to step S802.
 一方、ステップS811において、異なる目標物体画像に切り替えると判定した場合には(ステップS811においてYESの場合には)、ステップS812に進む。 On the other hand, if it is determined in step S811 to switch to a different target object image (YES in step S811), the process proceeds to step S812.
 ステップS812において、強化学習装置120の更新部310は、強化学習処理の終了条件を満たすか否かを判定する。なお、強化学習処理の終了条件とは、例えば、強化学習システム100のユーザによって規定された条件であり、一例として、所定の物体に対する把持操作の目標成功確率等が挙げられる。 In step S812, the update unit 310 of the reinforcement learning device 120 determines whether or not the end condition of the reinforcement learning process is satisfied. The end condition of the reinforcement learning process is, for example, a condition defined by the user of the reinforcement learning system 100, and one example thereof is a target success probability of a gripping operation for a predetermined object.
 ステップS812において、強化学習処理の終了条件を満たさないと判定した場合には(ステップS812においてNOの場合には)、ステップS801に戻る。 If it is determined in step S812 that the end condition of the reinforcement learning process is not satisfied (NO in step S812), the process returns to step S801.
 一方、ステップS812において、強化学習処理の終了条件を満たすと判定した場合には(ステップS812においてYESの場合には)、強化学習処理を終了する。なお、強化学習処理を終了した後の強化学習モデル720は、強化学習済みモデルとして、物体操作装置に適用される。 On the other hand, if it is determined in step S812 that the end condition of the reinforcement learning process is satisfied (YES in step S812), the reinforcement learning process is terminated. The reinforcement learning model 720 after the reinforcement learning process is completed is applied to the object operation device as the reinforcement learning completed model.
 物体操作装置に適用された強化学習済みモデルは、図8のステップS801~S807の処理を実行する(つまり、状態の変化を示す情報の取得、報酬の算出、モデルパラメータの更新等は行わない)。また、ステップS807では、把持機構部111の動作後の状態を示す情報として、最適な情報が出力されるように構成される。つまり、把持機構部111は、強化学習処理中とは異なり、様々な動作を網羅的に行う代わりに、可能な動作の集合の中から選択された最適な動作(Q値が最大となる動作)を行う。 The reinforcement-learned model applied to the object manipulation device executes the processes of steps S801 to S807 in FIG. 8 (that is, it does not acquire information indicating a state change, calculate a reward, update model parameters, or the like). .. Further, in step S807, the optimum information is output as the information indicating the state after the operation of the gripping mechanism unit 111. That is, unlike during the reinforcement learning process, the gripping mechanism unit 111 comprehensively performs various operations, but instead performs the optimum operation selected from the set of possible operations (the operation that maximizes the Q value). I do.
 <まとめ>
 以上の説明から明らかなように、第2の実施形態に係る強化学習システム100によれば、上記第1の実施形態と同様な効果を奏する。
<Summary>
As is clear from the above description, according to the reinforcement learning system 100 according to the second embodiment, the same effect as that of the first embodiment is obtained.
 [第3の実施形態]
 上記第1及び第2の実施形態では、指定された種類の物体に対して、把持操作を行う場合について説明した。しかしながら、指定された種類の物体に対して行う所定の操作は、把持操作に限定されず、他の任意の操作であってもよい。つまり、マニピュレータ110の本体部113の先端部分に取り付けられるエンドエフェクタは、把持機構部111に限定されず、他の任意の操作機構部であってもよい。ここでいう任意の操作には、例えば、指定された種類の物体を押す押圧操作や、指定された種類の物体を吸着する吸着操作、指定された種類の物体を電磁石等で吸引する吸引操作等が含まれる。
[Third Embodiment]
In the first and second embodiments described above, a case where a gripping operation is performed on a designated type of object has been described. However, the predetermined operation performed on the specified type of object is not limited to the gripping operation, and may be any other operation. That is, the end effector attached to the tip end portion of the main body portion 113 of the manipulator 110 is not limited to the gripping mechanism portion 111, and may be any other operation mechanism portion. Arbitrary operations referred to here include, for example, a pressing operation for pushing a specified type of object, a suction operation for sucking a specified type of object, a suction operation for sucking a specified type of object with an electromagnet, etc. Is included.
 また、上記第1及び第2の実施形態では、マニピュレータの先端部分に撮像装置が取り付けられるものとして説明したが、撮像装置の取り付け位置はマニピュレータの先端部分に限定されない。把持機構部の位置及び姿勢の変化に応じて、撮像装置の位置及び姿勢が変化する位置であれば、他の位置であってもよい。 Further, in the first and second embodiments described above, it has been described that the image pickup device is attached to the tip portion of the manipulator, but the attachment position of the image pickup device is not limited to the tip portion of the manipulator. Any position may be used as long as the position and posture of the image pickup apparatus change according to the change in the position and posture of the gripping mechanism portion.
 なお、把持機構部と撮像装置とは、例えば、異なるマニピュレータに取り付けられていてもよく、その場合も上述した強化学習モデルが適用可能である。この場合の強化学習モデルは、把持機構部の動作を制御するための情報に加え、撮像装置の少なくとも位置及び姿勢のいずれかを制御するための情報を出力するように構成されてもよい。 The gripping mechanism unit and the image pickup device may be attached to different manipulators, for example, and the above-mentioned reinforcement learning model can be applied even in that case. The reinforcement learning model in this case may be configured to output information for controlling at least one of the position and the posture of the image pickup device in addition to the information for controlling the operation of the gripping mechanism unit.
 また、上記第1及び第2の実施形態では、強化学習モデルに入力する、把持機構部の動作前の状態を示す情報として、把持機構部の位置及び姿勢を示す情報、把持機構部の開閉を示す情報が含まれるものとして説明した。しかしながら、把持機構部の動作前の状態を示す情報はこれらに限定されず、他の情報が入力されてもよい。 Further, in the first and second embodiments, as information indicating the state before the operation of the gripping mechanism unit, which is input to the reinforcement learning model, information indicating the position and posture of the gripping mechanism unit and opening / closing of the gripping mechanism unit are performed. It has been described as including the information shown. However, the information indicating the state before the operation of the gripping mechanism portion is not limited to these, and other information may be input.
 また、上記第1及び第2の実施形態では、マニピュレータ110と強化学習装置120(あるいは物体操作装置)とを別体として構成したが、マニピュレータ110と強化学習装置120(あるいは物体操作装置)とは一体として構成されてもよい。あるいは、駆動制御装置115と強化学習装置120(あるいは物体操作装置)とは一体として構成されてもよい。 Further, in the first and second embodiments, the manipulator 110 and the reinforcement learning device 120 (or the object manipulation device) are configured as separate bodies, but the manipulator 110 and the reinforcement learning device 120 (or the object manipulation device) are different from each other. It may be configured as one. Alternatively, the drive control device 115 and the reinforcement learning device 120 (or the object manipulation device) may be integrally configured.
 また、上記第1及び第2の実施形態では、強化学習装置120より出力された、把持機構部111の動作後の状態を示す情報に基づいて、把持機構部111の動作を実際に制御することで強化学習処理を行うものとして説明した。しかしながら、把持機構部111の動作を実際に制御する必要はなく、実環境を模擬したシミュレータを用いて、強化学習処理を行うように構成してもよい。この場合、撮像装置についても、実環境を模擬したシミュレータ上で、位置及び姿勢を変化させたり、撮影を行うように構成してもよい。また、操作対象の物体に対する所定の操作及び操作結果の生成についても、実環境を模擬したシミュレータ上で行うように構成してもよい。 Further, in the first and second embodiments, the operation of the gripping mechanism unit 111 is actually controlled based on the information output from the reinforcement learning device 120 indicating the state after the operation of the gripping mechanism unit 111. It was explained as performing reinforcement learning processing in. However, it is not necessary to actually control the operation of the gripping mechanism unit 111, and the reinforcement learning process may be performed by using a simulator simulating the actual environment. In this case, the image pickup device may also be configured to change the position and posture or perform shooting on a simulator simulating the actual environment. Further, a predetermined operation on the object to be operated and the generation of the operation result may be performed on a simulator simulating the actual environment.
 また、上記第1及び第2の実施形態では、マニピュレータ110の本体部113の先端部分にエンドエフェクタが取り付けられているケースについて、強化学習装置120が強化学習処理を行うものとして説明した。しかしながら、エンドエフェクタが先端部分に取り付けられていないマニピュレータ110が、本体部113により操作対象の物体を操作するケースについて、強化学習装置120が強化学習処理を行ってもよい。この場合、強化学習装置120では、マニピュレータ110の本体部113の先端部分の動作を制御するための情報を出力してもよい。 Further, in the first and second embodiments described above, the case where the end effector is attached to the tip portion of the main body 113 of the manipulator 110 is described as the case where the reinforcement learning device 120 performs the reinforcement learning process. However, the reinforcement learning device 120 may perform reinforcement learning processing in the case where the manipulator 110 to which the end effector is not attached to the tip portion operates the object to be operated by the main body 113. In this case, the reinforcement learning device 120 may output information for controlling the operation of the tip portion of the main body 113 of the manipulator 110.
 また、上記第1及び第2の実施形態では、マニピュレータ110の本体部113の先端部分の位置及び姿勢が変化するように構成されているものとして説明したが、少なくとも位置及び姿勢のいずれか一方が変化するように構成されていてもよい。つまり、把持機構部111は、少なくとも位置及び姿勢のいずれか一方が変化するように構成されていてもよい。また、撮像装置112は、把持機構部111の少なくとも位置及び姿勢のいずれか一方の変化に伴って、少なくとも位置及び姿勢のいずれか一方が変化するように構成されていてもよい。この場合、強化学習装置120では、把持機構部111の動作を制御するための情報として、把持機構部111の少なくも位置及び姿勢のいずれか一方を制御するための情報、及び、把持機構部111の開閉を制御するための情報を出力してもよい。 Further, in the first and second embodiments, it has been described that the position and posture of the tip portion of the main body 113 of the manipulator 110 are changed, but at least one of the position and the posture is described. It may be configured to change. That is, the gripping mechanism portion 111 may be configured to change at least one of the position and the posture. Further, the image pickup device 112 may be configured so that at least one of the position and the posture changes with the change of at least one of the position and the posture of the gripping mechanism unit 111. In this case, in the reinforcement learning device 120, as information for controlling the operation of the gripping mechanism unit 111, information for controlling at least one of the position and the posture of the gripping mechanism unit 111, and the gripping mechanism unit 111. Information for controlling the opening and closing of the may be output.
 [その他の実施形態]
 本明細書(請求項を含む)において、「a、bおよびcの少なくとも1つ(一方)」又は「a、b又はcの少なくとも1つ(一方)」の表現(同様な表現を含む)が用いられる場合は、a、b、c、a-b、a-c、b-c、又はa-b-cのいずれかを含む。また、a-a、a-b-b、a-a-b-b-c-c等のように、いずれかの要素について複数のインスタンスを含んでもよい。さらに、a-b-c-dのようにdを有する等、列挙された要素(a、b及びc)以外の他の要素を加えることも含む。
[Other embodiments]
In the present specification (including claims), the expression (including similar expressions) of "at least one (one) of a, b and c" or "at least one (one) of a, b or c" is used. When used, it comprises any of a, b, c, ab, ac, bc, or abc. Further, a plurality of instances may be included for any of the elements, such as aa, abb, aabbbcc, and the like. It also includes the addition of other elements than the listed elements (a, b and c), such as having d, such as abcd.
 また、本明細書(請求項を含む)において、「データを入力として/データに基づいて/に従って/に応じて」等の表現(同様な表現を含む)が用いられる場合は、特に断りがない場合、各種データそのものを入力として用いる場合や、各種データに何らかの処理を行ったもの(例えば、ノイズ加算したもの、正規化したもの、各種データの中間表現等)を入力として用いる場合を含む。また「データに基づいて/に従って/に応じて」何らかの結果が得られる旨が記載されている場合、当該データのみに基づいて当該結果が得られる場合を含むとともに、当該データ以外の他のデータ、要因、条件、及び/又は状態等にも影響を受けて当該結果が得られる場合をも含み得る。また、「データを出力する」旨が記載されている場合、特に断りがない場合、各種データそのものを出力として用いる場合や、各種データに何らかの処理を行ったもの(例えば、ノイズ加算したもの、正規化したもの、各種データの中間表現等)を出力とする場合も含む。 Further, in the present specification (including claims), there is no particular notice when expressions (including similar expressions) such as "with data as input / based on / according to / according to" are used. This includes the case where various data itself is used as an input, and the case where various data are processed in some way (for example, noise-added data, normalized data, intermediate representation of various data, etc.) are used as input data. In addition, when it is stated that some result can be obtained "based on / according to / according to the data", it includes the case where the result can be obtained based only on the data, and other data other than the data. It may also include cases where the result is obtained under the influence of factors, conditions, and / or conditions. In addition, when it is stated that "data is output", unless otherwise specified, various data itself is used as output, or various data is processed in some way (for example, noise is added, normal). It also includes the case where the output is output (intermediate representation of various data, etc.).
 また、本明細書(請求項を含む)において、「接続される(connected)」及び「結合される(coupled)」との用語が用いられる場合は、直接的な接続/結合、間接的な接続/結合、電気的(electrically)な接続/結合、通信的(communicatively)な接続/結合、機能的(operatively)な接続/結合、物理的(physically)な接続/結合等のいずれをも含む非限定的な用語として意図される。当該用語は、当該用語が用いられた文脈に応じて適宜解釈されるべきであるが、意図的に或いは当然に排除されるのではない接続/結合形態は、当該用語に含まれるものして非限定的に解釈されるべきである。 In addition, when the terms "connected" and "coupled" are used in the present specification (including claims), direct connection / connection and indirect connection are used. Unlimited including / coupling, electrically connected / coupled, communicatively connected / coupled, operatively connected / coupled, physically connected / coupled, etc. Intended as a term. The term should be interpreted as appropriate according to the context in which the term is used, but any connection / coupling form that is not intentionally or naturally excluded is not included in the term. It should be interpreted in a limited way.
 また、本明細書(請求項を含む)において、「AがBするよう構成される(A configured to B)」との表現が用いられる場合は、要素Aの物理的構造が、動作Bを実行可能な構成を有するとともに、要素Aの恒常的(permanent)又は一時的(temporary)な設定(setting/configuration)が、動作Bを実際に実行するように設定(configured/set)されていることを含んでよい。例えば、要素Aが汎用プロセッサである場合、当該プロセッサが動作Bを実行可能なハードウェア構成を有するとともに、恒常的(permanent)又は一時的(temporary)なプログラム(命令)の設定により、動作Bを実際に実行するように設定(configured)されていればよい。また、要素Aが専用プロセッサ又は専用演算回路等である場合、制御用命令及びデータが実際に付属しているか否かとは無関係に、当該プロセッサの回路的構造が動作Bを実際に実行するように構築(implemented)されていればよい。 Further, in the present specification (including claims), when the expression "A is configured to B (A configured to B)" is used, the physical structure of the element A executes the operation B. It has a possible configuration, and the permanent or temporary setting (setting / configuration) of the element A is set (configured / set) to actually execute the operation B. May include. For example, when the element A is a general-purpose processor, the processor has a hardware configuration capable of executing the operation B, and the operation B is set by setting a permanent or temporary program (instruction). It suffices if it is configured to actually execute. Further, when the element A is a dedicated processor, a dedicated arithmetic circuit, or the like, the circuit structure of the processor actually executes the operation B regardless of whether or not the control instruction and data are actually attached. It only needs to be implemented.
 また、本明細書(請求項を含む)において、含有又は所有を意味する用語(例えば、「含む(comprising/including)」及び「有する(having)」等)が用いられる場合は、当該用語の目的語により示される対象物以外の物を含有又は所有する場合を含む、open-endedな用語として意図される。これらの含有又は所有を意味する用語の目的語が数量を指定しない又は単数を示唆する表現(a又はanを冠詞とする表現)である場合は、当該表現は特定の数に限定されないものとして解釈されるべきである。 In addition, when a term meaning inclusion or possession (for example, "comprising / including" and "having") is used in the present specification (including claims), the object of the term is used. It is intended as an open-ended term, including the case of containing or owning an object other than the object indicated by the word. If the object of these terms that mean inclusion or possession is an expression that does not specify a quantity or suggests a singular (an expression with a or an as an article), the expression is interpreted as not being limited to a specific number. It should be.
 また、本明細書(請求項を含む)において、ある箇所において「1つ又は複数(one or more)」又は「少なくとも1つ(at least one)」等の表現が用いられ、他の箇所において数量を指定しない又は単数を示唆する表現(a又はanを冠詞とする表現)が用いられているとしても、後者の表現が「1つ」を意味することを意図しない。一般に、数量を指定しない又は単数を示唆する表現(a又はanを冠詞とする表現)は、必ずしも特定の数に限定されないものとして解釈されるべきである。 Further, in the present specification (including claims), expressions such as "one or more" or "at least one" are used in some places, and quantities are used in other places. Even if an expression that does not specify or suggests a singular (an article with a or an as an article) is used, the latter expression is not intended to mean "one". In general, expressions that do not specify a quantity or suggest a singular (an article with a or an as an article) should be construed as not necessarily limited to a particular number.
 また、本明細書において、ある実施例の有する特定の構成について特定の効果(advantage/result)が得られる旨が記載されている場合、別段の理由がない限り、当該構成を有する他の1つ又は複数の実施例についても当該効果が得られると理解されるべきである。但し当該効果の有無は、一般に種々の要因、条件、及び/又は状態等に依存し、当該構成により必ず当該効果が得られるものではないと理解されるべきである。当該効果は、種々の要因、条件、及び/又は状態等が満たされたときに実施例に記載の当該構成により得られるものに過ぎず、当該構成又は類似の構成を規定したクレームに係る発明において、当該効果が必ずしも得られるものではない。 In addition, when it is stated in the present specification that a specific effect (advantage / result) can be obtained for a specific configuration of a certain embodiment, the other one having the configuration is not specified unless there is another reason. Alternatively, it should be understood that the effect can be obtained for a plurality of examples. However, it should be understood that the presence or absence of the effect generally depends on various factors, conditions, and / or states, and the effect cannot always be obtained by the configuration. The effect is merely obtained by the configuration described in the examples when various factors, conditions, and / or conditions are satisfied, and in the invention relating to the claim that defines the configuration or a similar configuration. , The effect is not always obtained.
 また、本明細書(請求項を含む)において、複数のハードウェアが所定の処理を行う場合、各ハードウェアが協働して所定の処理を行ってもよいし、一部のハードウェアが所定の処理の全てを行ってもよい。また、一部のハードウェアが所定の処理の一部を行い、別のハードウェアが所定の処理の残りを行ってもよい。本明細書(請求項を含む)において、「1又は複数のハードウェアが第1の処理を行い、前記1又は複数のハードウェアが第2の処理を行う」等の表現が用いられている場合、第1の処理を行うハードウェアと第2の処理を行うハードウェアは同じものであってもよいし、異なるものであってもよい。つまり、第1の処理を行うハードウェア及び第2の処理を行うハードウェアが、前記1又は複数のハードウェアに含まれていればよい。なお、ハードウェアは、電子回路、又は、電子回路を含む装置等を含んでよい。 Further, in the present specification (including claims), when a plurality of hardware perform predetermined processing, each hardware may cooperate to perform predetermined processing, and some hardware may perform predetermined processing. You may perform all of the processing of. Further, some hardware may perform a part of a predetermined process, and another hardware may perform the rest of the predetermined process. In the present specification (including claims), expressions such as "one or more hardware performs the first process and the one or more hardware performs the second process" are used. , The hardware that performs the first process and the hardware that performs the second process may be the same or different. That is, the hardware that performs the first process and the hardware that performs the second process may be included in the one or a plurality of hardware. The hardware may include an electronic circuit, a device including the electronic circuit, or the like.
 また、本明細書(請求項を含む)において、複数の記憶装置(メモリ)がデータの記憶を行う場合、複数の記憶装置(メモリ)のうち個々の記憶装置(メモリ)は、データの一部のみを記憶してもよいし、データの全体を記憶してもよい。 Further, in the present specification (including the claims), when a plurality of storage devices (memory) store data, each storage device (memory) among the plurality of storage devices (memory) is a part of the data. Only may be stored, or the entire data may be stored.
 以上、本開示の実施形態について詳述したが、本開示は上記した個々の実施形態に限定されるものではない。特許請求の範囲に規定された内容及びその均等物から導き出される本発明の概念的な思想と趣旨を逸脱しない範囲において種々の追加、変更、置き換え及び部分的削除等が可能である。例えば、前述した全ての実施形態において、説明に用いた数値は、一例として示したものであり、これらに限られるものではない。また、実施形態における各動作の順序は、一例として示したものであり、これらに限られるものではない。 Although the embodiments of the present disclosure have been described in detail above, the present disclosure is not limited to the individual embodiments described above. Various additions, changes, replacements, partial deletions, etc. are possible without departing from the conceptual idea and purpose of the present invention derived from the contents specified in the claims and their equivalents. For example, in all the above-described embodiments, the numerical values used in the explanation are shown as an example, and are not limited thereto. Further, the order of each operation in the embodiment is shown as an example, and is not limited to these.
 本出願は、2020年7月10日に出願された日本国特許出願第2020-119349号に基づきその優先権を主張するものであり、同日本国特許出願の全内容を参照することにより本願に援用する。 This application claims its priority based on Japanese Patent Application No. 2020-119349 filed on July 10, 2020, and the present application is made by referring to the entire contents of the Japanese patent application. Use it.
 100         :強化学習システム
 110         :マニピュレータ
 111         :把持機構部
 112         :撮像装置
 113         :本体部
 115         :駆動制御装置
 120         :強化学習装置
 310         :更新部
 311         :報酬算出部
 320         :状態入力部
 330         :強化学習モデル
 510         :目標物体画像
 511         :物体
 521、522     :撮影画像
 611、612     :撮影画像
 710         :更新部
 711         :報酬算出部
 712         :パラメータ更新部
 720         :強化学習モデル
 721         :画像解析部
 722         :状態及び動作入力部
 723         :加算部
 724         :期待値算出部
 725         :調整部
100: Reinforcement learning system 110: Manipulator 111: Grip mechanism unit 112: Imaging device 113: Main body unit 115: Drive control device 120: Reinforcement learning device 310: Update unit 311: Reward calculation unit 320: State input unit 330: Reinforcement learning model 510: Target object image 511: Object 521: 522: Photographed image 611, 612: Photographed image 710: Update unit 711: Reward calculation unit 712: Reinforcement learning model 721: Reinforcement learning model 721: State and motion input Part 723: Addition part 724: Expected value calculation part 725: Adjustment part

Claims (13)

  1.  少なくとも1つのメモリと、
     少なくとも1つのプロセッサと、を有し、
     前記少なくとも1つのプロセッサは、
      少なくとも位置及び姿勢のいずれかが変化する撮像装置により撮影された撮影画像に関する情報と、エンドエフェクタにより操作される操作対象の物体を示す目標物体画像に関する情報とを、前記エンドエフェクタの動作を制御するための情報を出力する訓練モデルに入力することと、
     前記訓練モデルにより出力された情報に基づき前記エンドエフェクタの動作が制御された場合の、前記物体に対する操作結果に基づいて、前記訓練モデルのパラメータを更新することと
     を実行可能に構成される、
     強化学習装置。
    With at least one memory
    With at least one processor,
    The at least one processor
    The operation of the end effector is controlled by controlling the operation of the end effector with the information about the captured image taken by the image pickup device whose position and posture change at least and the information about the target object image indicating the object to be operated by the end effector. To input to the training model to output information for
    When the operation of the end effector is controlled based on the information output by the training model, it is possible to update the parameters of the training model based on the operation result for the object.
    Reinforcement learning device.
  2.  前記撮像装置の前記少なくとも位置及び姿勢のいずれかは、前記エンドエフェクタの少なくとも位置及び姿勢のいずれかに応じて変化する、
     請求項1に記載の強化学習装置。
    Any of the at least positions and orientations of the imaging device will vary depending on any of the at least positions and orientations of the end effector.
    The reinforcement learning device according to claim 1.
  3.  前記撮像装置は、前記エンドエフェクタに取り付けられている、
     請求項2に記載の強化学習装置。
    The image pickup device is attached to the end effector.
    The reinforcement learning device according to claim 2.
  4.  前記撮像装置の前記少なくとも位置及び姿勢のいずれかは、前記訓練モデルからの出力に基づいて制御される、
     請求項1に記載の強化学習装置。
    Any of the at least positions and orientations of the image pickup device is controlled based on the output from the training model.
    The reinforcement learning device according to claim 1.
  5.  前記エンドエフェクタは、前記物体を把持する把持機構部であり、
     前記少なくとも1つのプロセッサは、前記把持機構部による前記物体に対する把持操作が成功したか否かの判定結果に基づいて、前記訓練モデルのパラメータを更新する、請求項1に記載の強化学習装置。
    The end effector is a gripping mechanism unit that grips the object.
    The reinforcement learning device according to claim 1, wherein the at least one processor updates the parameters of the training model based on the determination result of whether or not the gripping operation for the object is successful by the gripping mechanism unit.
  6.  前記少なくとも1つのプロセッサは、前記把持機構部の動作前の少なくとも位置及び姿勢のいずれかに関する情報と、前記把持機構部の動作前の開閉に関する情報とを、前記訓練モデルに入力する、請求項5に記載の強化学習装置。 5. The method 5 in which the at least one processor inputs information regarding at least one of the positions and postures of the gripping mechanism portion before operation and information regarding opening and closing of the gripping mechanism portion before operation into the training model. Reinforcement learning device described in.
  7.  前記訓練モデルは、前記把持機構部の動作後の少なくとも位置及び姿勢のいずれかに関する情報と、前記把持機構部の動作後の開閉に関する情報とを出力する、請求項6に記載の強化学習装置。 The reinforcement learning device according to claim 6, wherein the training model outputs information regarding at least one of the positions and postures of the gripping mechanism after operation and information regarding opening and closing of the gripping mechanism after operation.
  8.  前記エンドエフェクタによる前記物体に対する所定の操作及び前記撮像装置の前記少なくとも位置及び姿勢のいずれかの変化がシミュレータ上で実行される、請求項1に記載の強化学習装置。 The reinforcement learning device according to claim 1, wherein a predetermined operation on the object by the end effector and a change in at least one of the positions and postures of the image pickup device are executed on the simulator.
  9.  請求項1乃至8のいずれか1項に記載の強化学習装置と、
     前記エンドエフェクタと前記撮像装置とが取り付けられたマニピュレータと、
     を有する強化学習システム。
    The reinforcement learning device according to any one of claims 1 to 8.
    A manipulator to which the end effector and the image pickup device are attached,
    Reinforcement learning system with.
  10.  強化学習によりパラメータが更新された訓練モデルを記憶する少なくとも1つのメモリと、
     少なくとも1つのプロセッサと、を備え、
     前記少なくとも1つのプロセッサは、
      少なくとも位置及び姿勢のいずれかが変化する撮像装置により撮影された撮影画像に関する情報と、前記エンドエフェクタにより操作される操作対象の物体を示す目標物体画像に関する情報とを、前記訓練モデルに入力することと、
      前記訓練モデルにより出力された情報に基づいて、前記エンドエフェクタの動作を制御することと、
     を実行可能に構成される、
     物体操作装置。
    At least one memory that stores the training model whose parameters have been updated by reinforcement learning,
    With at least one processor,
    The at least one processor
    Information about an image taken by an image pickup device whose position or posture changes at least, and information about a target object image indicating an object to be operated by the end effector are input to the training model. When,
    Controlling the operation of the end effector based on the information output by the training model
    Is configured to be executable,
    Object control device.
  11.  前記エンドエフェクタと、
     前記撮像装置と、
     を更に有する、請求項10に記載の物体操作装置。
    With the end effector
    With the image pickup device
    10. The object manipulation device according to claim 10.
  12.  少なくとも1つのプロセッサにより実行されるモデル生成方法であって、
     少なくとも位置及び姿勢のいずれかが変化する撮像装置により撮影された撮影画像に関する情報と、エンドエフェクタにより操作される操作対象の物体を示す目標物体画像に関する情報とを、前記エンドエフェクタの動作を制御するための情報を出力する訓練モデルに入力する工程と、
     前記訓練モデルにより出力された情報に基づき前記エンドエフェクタの動作が制御された場合の、前記物体に対する操作結果に基づいて、前記訓練モデルのパラメータを更新する工程と
     を有するモデル生成方法。
    A model generation method executed by at least one processor.
    The operation of the end effector is controlled by controlling the operation of the end effector with the information about the captured image taken by the image pickup device whose position and posture change at least and the information about the target object image indicating the object to be operated by the end effector. The process of inputting to the training model that outputs information for
    A model generation method including a step of updating the parameters of the training model based on the operation result for the object when the operation of the end effector is controlled based on the information output by the training model.
  13.  少なくとも位置及び姿勢のいずれかが変化する撮像装置により撮影された撮影画像に関する情報と、エンドエフェクタにより操作される操作対象の物体を示す目標物体画像に関する情報とを、前記エンドエフェクタの動作を制御するための情報を出力する訓練モデルに入力する工程と、
     前記訓練モデルにより出力された情報に基づき前記エンドエフェクタの動作が制御された場合の、前記物体に対する操作結果に基づいて、前記訓練モデルのパラメータを更新する工程と
     を少なくとも1台のコンピュータに実行させるための強化学習プログラム。
    The operation of the end effector is controlled by controlling the operation of the end effector with the information about the captured image taken by the image pickup device whose position and posture change at least and the information about the target object image indicating the object to be operated by the end effector. The process of inputting to the training model that outputs information for
    When the operation of the end effector is controlled based on the information output by the training model, at least one computer is made to perform the step of updating the parameters of the training model based on the operation result for the object. Reinforcement learning program for.
PCT/JP2021/025392 2020-07-10 2021-07-06 Reinforcement learning device, reinforcement learning system, object manipulation device, model generation method, and reinforcement learning program WO2022009859A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2020119349A JP2023145809A (en) 2020-07-10 2020-07-10 Reinforcement learning device, reinforcement learning system, object operation device, model generation method and reinforcement learning program
JP2020-119349 2020-07-10

Publications (1)

Publication Number Publication Date
WO2022009859A1 true WO2022009859A1 (en) 2022-01-13

Family

ID=79553121

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/025392 WO2022009859A1 (en) 2020-07-10 2021-07-06 Reinforcement learning device, reinforcement learning system, object manipulation device, model generation method, and reinforcement learning program

Country Status (2)

Country Link
JP (1) JP2023145809A (en)
WO (1) WO2022009859A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017030135A (en) * 2015-07-31 2017-02-09 ファナック株式会社 Machine learning apparatus, robot system, and machine learning method for learning workpiece take-out motion
JP2019171540A (en) * 2018-03-29 2019-10-10 ファナック株式会社 Machine learning device, robot control device using machine learning device, robot vision system, and machine learning method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017030135A (en) * 2015-07-31 2017-02-09 ファナック株式会社 Machine learning apparatus, robot system, and machine learning method for learning workpiece take-out motion
JP2019171540A (en) * 2018-03-29 2019-10-10 ファナック株式会社 Machine learning device, robot control device using machine learning device, robot vision system, and machine learning method

Also Published As

Publication number Publication date
JP2023145809A (en) 2023-10-12

Similar Documents

Publication Publication Date Title
JP4032793B2 (en) Charging system, charging control method, robot apparatus, charging control program, and recording medium
JP6963041B2 (en) Local feature model updates based on modifications to robot actions
US9387589B2 (en) Visual debugging of robotic tasks
Ott et al. A humanoid two-arm system for dexterous manipulation
JP3855812B2 (en) Distance measuring method, apparatus thereof, program thereof, recording medium thereof, and robot apparatus mounted with distance measuring apparatus
JP6931457B2 (en) Motion generation method, motion generator, system and computer program
CN110815258B (en) Robot teleoperation system and method based on electromagnetic force feedback and augmented reality
JP2009157948A (en) Robot apparatus, face recognition method, and face recognition apparatus
JP2021000678A (en) Control system and control method
CN114080583A (en) Visual teaching and repetitive motion manipulation system
US20220331962A1 (en) Determining environment-conditioned action sequences for robotic tasks
JP7458741B2 (en) Robot control device and its control method and program
JP2022543926A (en) System and Design of Derivative-Free Model Learning for Robotic Systems
JP6811465B2 (en) Learning device, learning method, learning program, automatic control device, automatic control method and automatic control program
CN114641375A (en) Dynamic programming controller
WO2019230399A1 (en) Robot control device, system, information processing method, and program
JP2022061022A (en) Technique of assembling force and torque guidance robot
WO2022134702A1 (en) Action learning method and apparatus, storage medium, and electronic device
JP2003266349A (en) Position recognition method, device thereof, program thereof, recording medium thereof, and robot device provided with position recognition device
WO2022009859A1 (en) Reinforcement learning device, reinforcement learning system, object manipulation device, model generation method, and reinforcement learning program
JP2004298975A (en) Robot device and obstacle searching method
JP5659787B2 (en) Operation environment model construction system and operation environment model construction method
JP2021091022A (en) Robot control device, learned model, robot control method, and program
JP2003271958A (en) Method and processor for processing image, program therefor, recording medium therefor, and robot system of type mounted with image processor
JP2020017206A (en) Information processing apparatus, action determination method, and program

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21838554

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21838554

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP