WO2022009859A1 - Dispositif d'apprentissage par renforcement, système d'apprentissage par renforcement, dispositif de manipulation d'objet, procédé de génération de modèle et programme d'apprentissage par renforcement - Google Patents

Dispositif d'apprentissage par renforcement, système d'apprentissage par renforcement, dispositif de manipulation d'objet, procédé de génération de modèle et programme d'apprentissage par renforcement Download PDF

Info

Publication number
WO2022009859A1
WO2022009859A1 PCT/JP2021/025392 JP2021025392W WO2022009859A1 WO 2022009859 A1 WO2022009859 A1 WO 2022009859A1 JP 2021025392 W JP2021025392 W JP 2021025392W WO 2022009859 A1 WO2022009859 A1 WO 2022009859A1
Authority
WO
WIPO (PCT)
Prior art keywords
reinforcement learning
end effector
information
gripping mechanism
training model
Prior art date
Application number
PCT/JP2021/025392
Other languages
English (en)
Japanese (ja)
Inventor
康博 藤田
Original Assignee
株式会社Preferred Networks
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社Preferred Networks filed Critical 株式会社Preferred Networks
Publication of WO2022009859A1 publication Critical patent/WO2022009859A1/fr

Links

Images

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J13/00Controls for manipulators
    • B25J13/08Controls for manipulators by means of sensing devices, e.g. viewing or touching devices

Definitions

  • This disclosure relates to a reinforcement learning device, a reinforcement learning system, an object manipulation device, a model generation method, and a reinforcement learning program.
  • Input an image taken by a fixed camera so that a predetermined operation (for example, a gripping operation by an end effector) is successful for a specified type of object among a plurality of types of objects placed in a predetermined area.
  • a reinforcement learning system for reinforcement learning of the operation of an end effector is known.
  • the success probability of a predetermined operation can be increased by repeating reinforcement learning.
  • reinforcement learning cannot proceed and the success probability of a predetermined operation cannot be increased.
  • the present disclosure provides a reinforcement learning device, a reinforcement learning system, an object manipulation device, a model generation method, and a reinforcement learning program capable of increasing the success probability of a predetermined operation on an object.
  • the reinforcement learning device has, for example, the following configuration. That is, With at least one memory With at least one processor, The at least one processor The operation of the end effector is controlled by controlling the operation of the end effector with the information about the captured image taken by the image pickup device whose position and posture change at least and the information about the target object image indicating the object to be operated by the end effector. To input to the training model to output information for When the operation of the end effector is controlled based on the information output by the training model, it is possible to update the parameters of the training model based on the operation result for the object.
  • FIG. 1 is a diagram showing an example of a system configuration of a reinforcement learning system.
  • FIG. 2 is a diagram showing an example of the hardware configuration of each device constituting the reinforcement learning system.
  • FIG. 3 is a first diagram showing an example of the functional configuration of the reinforcement learning device.
  • FIG. 4 is a first flowchart showing the flow of the reinforcement learning process.
  • FIG. 5 is a first diagram showing an execution example of the reinforcement learning process.
  • FIG. 6 is a second diagram showing an execution example of the reinforcement learning process.
  • FIG. 7 is a second diagram showing an example of the functional configuration of the reinforcement learning device.
  • FIG. 8 is a second flowchart showing the flow of the reinforcement learning process.
  • FIG. 1 is a diagram showing an example of a system configuration of a reinforcement learning system.
  • the reinforcement learning system 100 includes a manipulator 110 and a reinforcement learning device 120.
  • the manipulator 110 performs a predetermined operation on a specified type of object (object to be operated as shown by a target object image) from an object group 130 in which a plurality of types of objects are mixedly placed. It is a device that performs.
  • the main body 113 of the manipulator 110 has a plurality of arms connected via a plurality of joints, and by controlling the joint angle of each, the position and posture of the tip portion of the main body 113 of the manipulator 110 are controlled. It is configured to be.
  • a gripping mechanism portion 111 (an example of an end effector) that performs a predetermined operation (a gripping operation in this embodiment) for an object of a specified type is attached to the tip portion of the main body portion 113 of the manipulator 110.
  • the gripping operation for the specified type of object is performed by controlling the opening and closing of the gripping mechanism unit 111.
  • an image pickup device 112 is attached to the tip portion of the main body 113 of the manipulator 110. That is, the image pickup device 112 is configured so that the position and the posture change with the change of the position and the posture of the gripping mechanism portion 111.
  • the image pickup apparatus 112 outputs a photographed image including each image of R value, G value, and B value at a predetermined frame cycle.
  • the image pickup apparatus 112 may output a captured image including distance information to each position on the surface of the object in addition to the R value, G value, and B value images at a predetermined frame period.
  • the image pickup apparatus 112 may output a distance image including distance information to each position on the surface of the object at a predetermined frame period.
  • the captured image captured by the image pickup apparatus 112 may be a moving image.
  • the image pickup apparatus 112 will be described as outputting a captured image including each image of R value, G value, and B value at a predetermined frame cycle.
  • the support base 114 that supports the main body 113 of the manipulator 110 is driven to "control the operation of the gripping mechanism 111" (control the position and posture of the gripping mechanism 111 and the opening / closing of the gripping mechanism 111).
  • the control device 115 is built-in.
  • the drive control device 115 acquires a captured image captured by the image pickup device 112 and transmits it to the reinforcement learning device 120. Further, the drive control device 115 acquires sensor signals detected by various sensors (not shown) arranged in the grip mechanism portion 111 of the manipulator 110 and the main body portion 113, and transmits them to the reinforcement learning device 120.
  • the drive control device 115 acquires information for controlling the operation of the gripping mechanism unit 111 from the reinforcement learning device 120 in response to the transmission of the captured image and the sensor signal.
  • the information for controlling the operation of the gripping mechanism unit 111 referred to here is, for example, Information (target value) indicating the state of the gripping mechanism 111 after operation, A specific operation amount and control amount for controlling the position and posture of the gripping mechanism unit 111 and the opening / closing of the gripping mechanism unit 111. Etc., any command regarding the operation of the gripping mechanism unit 111 may be included.
  • the information for controlling the operation of the gripping mechanism unit 111 may include information for controlling the operation of the manipulator 110.
  • the drive control device 115 will be described as acquiring information indicating a state after the operation of the gripping mechanism unit 111 as an example of information for controlling the operation of the gripping mechanism unit 111.
  • the drive control device 115 acquires the information indicating the state after the operation of the gripping mechanism unit 111
  • the drive control device 115 is based on various sensor signals (information indicating the state before the operation of the gripping mechanism unit 111).
  • the position and posture of the gripping mechanism portion 111 and the opening / closing of the gripping mechanism portion 111 are controlled.
  • the reinforcement learning device 120 shows the state after the operation of the gripping mechanism unit 111 by inputting the captured image transmitted from the drive control device 115 and the target object image showing the object to be gripped by the gripping mechanism unit 111. It has a reinforcement learning model (an example of a training model) that outputs information.
  • a reinforcement learning model an example of a training model
  • a neural network may be used.
  • the feature amount extracted from the captured image may be input instead of inputting the captured image itself.
  • the feature amount extracted from the captured image is, for example, a feature amount output from the intermediate layer by inputting the captured image into the neural network.
  • the information regarding the target object image input to the reinforcement learning model may be a captured image including each image of R value, G value, and B value, and each image and object of R value, G value, and B value. It may be a photographed image including distance information to each position on the surface.
  • the target object image may be a distance image including distance information to each position on the object surface.
  • the target object image may be a moving image.
  • the feature amount extracted from the target object image (for example, the feature amount output from the intermediate layer by inputting the target object image into the neural network). May be entered.
  • a photographed image including each image of R value, G value, and B value will be input.
  • the reinforcement learning device 120 operates the object to be gripped.
  • the result (for example, the determination result of whether or not the gripping operation is successful) is acquired.
  • the reinforcement learning device 120 updates the model parameters of the reinforcement learning model based on the acquired operation result.
  • -Reinforcement learning is performed using information on the captured image captured by the image pickup device 112 whose position and posture change with the change in the position and posture of the gripping mechanism unit 111.
  • the gripping mechanism unit 111 can be operated so that the object to be gripped can be photographed in the process of reinforcement learning. can. That is, according to the present embodiment, it is possible to provide the reinforcement learning system 100 capable of increasing the success probability of the gripping operation regardless of the placement state of the object to be gripped.
  • the vertical direction of the paper surface in FIG. 1 is defined as the Z-axis direction
  • the horizontal direction of the paper surface in FIG. 1 is defined as the Y-axis direction
  • the depth direction of the paper surface in FIG. 1 is defined as the X-axis direction.
  • FIG. 2 is used for the hardware configuration of the manipulator 110 (here, the mechanical system is omitted and the hardware configuration for the control system is shown) and the hardware configuration of the reinforcement learning device 120 constituting the reinforcement learning system 100. I will explain.
  • FIG. 2 is a diagram showing an example of the hardware configuration of each device constituting the reinforcement learning system.
  • the manipulator 110 has a sensor group 211 and an actuator group 212 in addition to the image pickup device 112 and the drive control device 115.
  • the sensor group 211 includes n sensors.
  • n sensors at least, -Sensor for calculating the position and posture of the gripping mechanism portion 111 (sensor for measuring each joint angle of the main body portion 113), A sensor that detects the opening and closing of the gripping mechanism 111, Is included.
  • the actuator group 212 includes m actuators.
  • m actuators at least, An actuator for controlling the position and posture of the gripping mechanism portion 111 (actuator for controlling each joint angle of the main body portion 113), An actuator for controlling the opening and closing of the gripping mechanism 111, Is included.
  • the drive control device 115 includes a sensor signal processing device 201, an actuator drive device 202, and a controller 203.
  • the sensor signal processing device 201 receives the sensor signal transmitted from the sensor group 211 and notifies the controller 203 of the sensor signal data.
  • the actuator drive device 202 acquires the control signal data from the controller 203 and transmits the control signal to the actuator group 212.
  • the controller 203 acquires the captured image transmitted from the image pickup device 112 and transmits it to the reinforcement learning device 120. Further, the controller 203 transmits the sensor signal data notified from the sensor signal processing device 201 to the reinforcement learning device 120.
  • the controller 203 acquires information indicating the state after the operation of the gripping mechanism unit 111 from the reinforcement learning device 120 in response to the transmission of the captured image and the sensor signal data. Further, when the controller 203 acquires the information indicating the state after the operation of the gripping mechanism unit 111, the controller 203 generates the control signal data for operating the actuator group 212 based on the sensor signal data and notifies the actuator drive device 202.
  • the enhanced learning device 120 has a processor 221, a main storage device (memory) 222, an auxiliary storage device 223, a network interface 224, and a device interface 225 as components.
  • the reinforcement learning device 120 is realized as a computer in which these components are connected via a bus 226.
  • the reinforcement learning device 120 is shown to include one component for each, but the reinforcement learning device 120 may include a plurality of the same components. Further, in the example of FIG. 2, one reinforcement learning device 120 is shown, but the reinforcement learning program is installed in a plurality of reinforcement learning devices, and each of the plurality of reinforcement learning devices is a reinforcement learning program. It may be configured to perform the same or different parts of the process. In this case, the reinforcement learning device may take the form of distributed computing in which the entire processing is executed by communicating with each other via the network interface 224 or the like. That is, the reinforcement learning device 120 may be configured as a system that realizes a function by executing instructions stored in one or a plurality of storage devices by one or a plurality of computers. Further, various data transmitted from the drive control device 115 may be processed by one or a plurality of reinforcement learning devices provided on the cloud, and the processing result may be transmitted to the drive control device 115.
  • Various operations of the reinforcement learning device 120 may be executed in parallel processing by using one or a plurality of processors or by using a plurality of reinforcement learning devices that communicate via the communication network 240. Further, various operations may be distributed to a plurality of arithmetic cores in the processor 221 and executed in parallel processing. Further, some or all of the processes, means, etc. of the present disclosure are executed by an external device 230 (at least one of the processor and the storage device) provided on the cloud capable of communicating with the reinforcement learning device 120 via the communication network 240. May be done. As described above, the reinforcement learning device 120 may take the form of parallel computing by one or a plurality of computers.
  • the processor 221 may be an electronic circuit (processing circuit, Processing circuitry, CPU, GPU, FPGA, ASIC, etc.). Further, the processor 221 may be a semiconductor device or the like including a dedicated processing circuit. The processor 221 is not limited to an electronic circuit using an electronic logic element, and may be realized by an optical circuit using an optical logic element. Further, the processor 221 may include an arithmetic function based on quantum computing.
  • the processor 221 performs various calculations based on various data and instructions input from each device and the like of the internal configuration of the reinforcement learning device 120, and outputs the calculation result and the control signal to each device and the like.
  • the processor 221 controls each component included in the reinforcement learning device 120 by executing an OS (Operating System), an application, or the like.
  • OS Operating System
  • the processor 221 may refer to one or more electronic circuits arranged on one chip, or may refer to one or more electronic circuits arranged on two or more chips or devices. When a plurality of electronic circuits are used, each electronic circuit may communicate by wire or wirelessly.
  • the main storage device 222 is a storage device that stores instructions executed by the processor 221 and various data, and various data stored in the main storage device 222 is read out by the processor 221.
  • the auxiliary storage device 223 is a storage device other than the main storage device 222. It should be noted that these storage devices mean arbitrary electronic components capable of storing various data, and may be semiconductor memories.
  • the semiconductor memory may be either a volatile memory or a non-volatile memory.
  • the storage device for storing various data in the reinforcement learning device 120 may be realized by the main storage device 222 or the auxiliary storage device 223, or may be realized by the built-in memory built in the processor 221.
  • a plurality of processors 221 may be connected (combined) to one main storage device 222, or a single processor 221 may be connected.
  • a plurality of main storage devices 222 may be connected (combined) to one processor 221.
  • the processor may include a configuration in which it is connected (coupled) to at least one main memory device 222. Further, this configuration may be realized by the main storage device 222 and the processor 221 included in the plurality of reinforcement learning devices 120.
  • the main storage device 222 may include a configuration in which the processor is integrated (for example, a cache memory including an L1 cache and an L2 cache).
  • the network interface 224 is an interface for connecting to the communication network 240 wirelessly or by wire.
  • the network interface 224 may exchange various data with the drive control device 115 and other external devices 230 connected via the communication network 240.
  • the communication network 240 may be any one of WAN (Wide Area Network), LAN (Local Area Network), PAN (Personal Area Network), or a combination thereof, and may be a computer and a drive control device 115 or a combination thereof. Anything may be used as long as information is exchanged with another external device 230.
  • An example of WAN is the Internet
  • an example of LAN is 802.11, Ethernet, etc.
  • PAN is Bluetooth (registered trademark), NFC (Near Field Communication), etc.
  • the device interface 225 is an interface such as USB that directly connects to the external device 250.
  • the external device 250 is a device connected to a computer.
  • the external device 250 may be an input device as an example.
  • the input device is, for example, a device such as a camera, a microphone, a motion capture, various sensors, a keyboard, a mouse, or a touch panel, and gives acquired information to a computer. Further, it may be a device having an input unit such as a personal computer, a tablet terminal, or a smartphone, a memory, and a processor.
  • the external device 250 may be an output device as an example.
  • the output device may be, for example, a display device such as an LCD (Liquid Crystal Display), a CRT (Cathode Ray Tube), a PDP (Plasma Display Panel), or an organic EL (Electro Luminescence) panel, and outputs audio or the like. It may be a speaker or the like. Further, it may be a device having an output unit such as a personal computer, a tablet terminal, or a smartphone, a memory, and a processor.
  • the external device 250 may be a storage device (memory).
  • the external device 250 may be a network storage or the like, and the external device 250 may be a storage such as an HDD.
  • the external device 250 may be a device having some functions of the components of the reinforcement learning device 120. That is, the computer may transmit or receive a part or all of the processing result of the external device 250.
  • FIG. 3 is a first diagram showing an example of the functional configuration of the reinforcement learning device.
  • the reinforcement learning device 120 has an update unit 310, a state input unit 320, and a reinforcement learning model 330.
  • the update unit 310 has a reward calculation unit 311 and updates the model parameters of the reinforcement learning model 330. Specifically, the updating unit 310 acquires a determination result of whether or not the gripping operation for the object to be gripped is successful, and information indicating a change in the state due to the control of the operation of the gripping mechanism unit 111. .. Further, the reward calculation unit 311 calculates the reward based on the determination result acquired by the update unit 310. Then, the update unit 310 updates the model parameters of the reinforcement learning model 330 based on various information (information indicating the change in the state, reward, etc.) acquired or calculated so far.
  • the determination of whether or not the gripping operation for the object to be gripped is successful may be automatically performed based on, for example, a captured image.
  • the user of the reinforcement learning system 100 may determine whether or not the gripping operation for the object to be gripped is successful.
  • the update unit 310 may calculate the reward based on information other than the determination result of whether or not the gripping operation is successful.
  • the update unit 310 calculates the reward based on various information such as the operation time and the number of operations required for the grip operation to succeed, and the magnitude (energy efficiency) of the operation of the entire manipulator 110 during the grip operation. You may.
  • the state input unit 320 acquires the captured image transmitted from the drive control device 115 and the target object image input by the user, and notifies the reinforcement learning model 330.
  • the model parameters are updated by the update unit 310. Further, the reinforcement learning model 330 after the model parameters are updated receives the captured image notified from the state input unit 320 and the target object image as inputs, and outputs information indicating the state after the operation of the gripping mechanism unit 111. ..
  • the reinforcement learning model 330 is, for example, as information indicating the state after the operation of the gripping mechanism unit 111, for example. Information indicating the position and posture of the gripping mechanism 111 after operation, Information indicating the opening and closing of the gripping mechanism 111 after operation, Is output.
  • 3b in FIG. 3 shows another functional configuration example.
  • the state input unit 320 of the reinforcement learning device 120 acquires, in addition to the captured image and the target object image, information indicating the state before (currently) the operation of the gripping mechanism unit 111, and strengthens it. It is configured to notify the learning model 330.
  • the information indicating the state before (currently) the operation of the gripping mechanism portion 111 referred to here is, for example, Information indicating the position and posture of the gripping mechanism 111 before (currently) operation, Information indicating the opening and closing of the gripping mechanism 111 before (currently) operation, Is included.
  • the reinforcement learning model 330 receives the captured image, the target object image, and the information indicating the state before (currently) the operation of the gripping mechanism unit 111 notified from the state input unit 320 as input, and after the operation of the gripping mechanism unit 111. Outputs information indicating the status of.
  • FIG. 4 is a first flowchart showing the flow of the reinforcement learning process.
  • the reinforcement learning process shown in FIG. 4 is only an example, and a model for which reinforcement learning has been completed may be generated by executing the reinforcement learning process by another model generation method.
  • step S401 the state input unit 320 of the reinforcement learning device 120 acquires the target object image.
  • step S402 the state input unit 320 of the reinforcement learning device 120 acquires a captured image.
  • step S403 when the state input unit 320 of the reinforcement learning device 120 is configured to acquire information indicating the state before (currently) the operation of the gripping mechanism unit 111, the gripping mechanism unit 111 Acquires information indicating the state before (current) operation.
  • the reinforcement learning model 330 of the reinforcement learning device 120 receives the target object image, the captured image, and (and information indicating the state before the operation of the gripping mechanism unit 111) as inputs, and the state after the operation of the gripping mechanism unit 111.
  • the information indicating is output. It is assumed that the reinforcement learning model 330 is configured to comprehensively output various information as information indicating the state after the operation of the gripping mechanism unit 111.
  • the motion of the gripping mechanism unit 111 during the reinforcement learning process includes the optimum motion selected from the set of possible motions and the motion randomly selected from the set of possible motions. Will be.
  • step S405 the reinforcement learning device 120 transmits the information indicating the post-operation state of the gripping mechanism unit 111 output by the reinforcement learning model 330 to the drive control device 115.
  • step S406 the update unit 310 of the reinforcement learning device 120 acquires information indicating a change in the state due to the control of the operation of the gripping mechanism unit 111.
  • step S407 the update unit 310 of the reinforcement learning device 120 acquires a determination result of whether or not the gripping operation for the object to be gripped is successful, and the reward calculation unit 311 of the reinforcement learning device 120 obtains the acquired determination result. Based on this, the reward is calculated.
  • step S408 the update unit 310 of the reinforcement learning device 120 updates the model parameters of the reinforcement learning model 330 based on various information (information indicating changes in the state, rewards, etc.) acquired or calculated so far.
  • step S409 the state input unit 320 of the reinforcement learning device 120 determines whether or not to switch from the current target object image to a different target object image.
  • step S409 If it is determined in step S409 that the image is not switched to a different target object image (NO in step S409), the process returns to step S402.
  • step S409 determines whether a different target object image (YES in step S409) or a different target object image (YES in step S409). If it is determined in step S409 to switch to a different target object image (YES in step S409), the process proceeds to step S410.
  • step S410 the update unit 310 of the reinforcement learning device 120 determines whether or not the end condition of the reinforcement learning process is satisfied.
  • the end condition of the reinforcement learning process is, for example, a condition defined by the user of the reinforcement learning system 100, and one example thereof is a target success probability of a gripping operation for a predetermined object.
  • step S410 If it is determined in step S410 that the end condition of the reinforcement learning process is not satisfied (NO in step S410), the process returns to step S401.
  • the reinforcement learning model 330 after the reinforcement learning process is completed is a device (object operation device) that outputs information for controlling the operation of the gripping mechanism unit 111 to the drive control device 115 as the reinforcement learning completed model. Applies to).
  • the reinforcement-learned model applied to the object manipulation device executes the processes of steps S401 to S405 of FIG. 4 (that is, it does not acquire information indicating a state change, calculate a reward, update model parameters, or the like). .. Further, in step S404, the optimum information is output as the information indicating the state after the operation of the gripping mechanism unit 111. That is, unlike during the reinforcement learning process, the gripping mechanism unit 111 performs the optimum operation selected from the set of possible operations instead of performing various operations comprehensively.
  • 5 and 6 are first and second diagrams showing an execution example of the reinforcement learning process.
  • the reinforcement learning device 120 recognizes the object 511 included in the target object image 510 as the object to be gripped.
  • the user can grip any object included in the object group 130. Can be specified as an object of.
  • the arrow 500 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 at the time when the object 511 is recognized as the object to be gripped. Further, the captured image 521 shows a captured image when the object group 130 is captured under the position and posture indicated by the arrow 500.
  • the object 511 to be grasped is shielded by another object 512, and the image pickup apparatus 112 photographs the object 511. I can't. That is, in this state, the object 511 cannot be grasped.
  • the reinforcement learning device 120 outputs information indicating the state after the operation of the gripping mechanism unit 111 in order to change the position and posture of the gripping mechanism unit 111 so that the object 511 to be gripped can be photographed.
  • the drive control device 115 controls the operation of the gripping mechanism unit 111 based on the information indicating the state after the operation of the gripping mechanism unit 111.
  • the arrow 501 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 after the change, which is changed by controlling the operation of the gripping mechanism portion 111. Further, the captured image 522 shows a captured image when the object group 130 is captured under the position and posture indicated by the arrow 501.
  • the object group 511 to be grasped can be photographed by photographing the object group 130 from the lateral direction (X-axis direction).
  • the reinforcement learning device 120 outputs information indicating the state after the operation of the gripping mechanism unit 111 in order to further change the position and posture of the gripping mechanism unit 111 so that the object 511 to be gripped can be gripped. ..
  • the drive control device 115 controls the operation of the gripping mechanism unit 111 based on the information indicating the state after the operation of the gripping mechanism unit 111.
  • the arrow 601 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 after the change, which is changed by controlling the operation of the manipulator 110. Further, the captured image 611 shows a captured image when the object group 130 is captured under the position and posture indicated by the arrow 601.
  • the object 511 to be gripped can be gripped by bringing it closer to the object 511 to be gripped.
  • the reinforcement learning device 120 outputs information indicating the state after the operation of the gripping mechanism unit 111 so that the gripping mechanism unit 111 grips the object 511 to be gripped.
  • the drive control device 115 controls the operation of the gripping mechanism unit 111 based on the information indicating the state of the gripping mechanism unit 111 after the operation.
  • the arrow 602 indicates the position and posture (shooting position and shooting direction) of the image pickup apparatus 112 after the change, which is changed by controlling the operation of the gripping mechanism unit 111 (object 511 grips). It shows the state of being lifted to a predetermined height). Further, the captured image 612 shows a captured image when the object 511 is captured under the position and posture indicated by the arrow 602.
  • the reinforcement learning system 100 is A gripping mechanism captures an image taken by an image pickup device whose position and posture change as the position and posture of the gripping mechanism change, and a target object image showing an object to be gripped by the gripping mechanism.
  • Input to the reinforcement learning model that outputs information indicating the state after the operation of the part.
  • the operation result (determination result of whether or not the gripping operation by the end effector was successful) for the object to be gripped when the operation of the gripping mechanism unit is controlled based on the information indicating the state after the operation of the gripping mechanism unit. Based on this, update the model parameters of the reinforcement learning model.
  • the gripping mechanism unit can take an image of the object to be gripped in the process of reinforcement learning.
  • the operation can be controlled.
  • the reinforcement learning device capable of increasing the success probability of the gripping operation for the specified type of object regardless of the mounting state.
  • Model generation methods and reinforcement learning programs can be provided.
  • FIG. 7 is a second diagram showing an example of the functional configuration of the reinforcement learning device.
  • the reinforcement learning device 120 according to the second embodiment has an update unit 710 and a reinforcement learning model 720.
  • the reinforcement learning model 720 for example, a neural network may be used.
  • the update unit 710 has a reward calculation unit 711 and a parameter update unit 712, and updates the model parameters of the reinforcement learning model 720.
  • the update unit 710 acquires the determination result of whether or not the gripping operation for the object to be gripped is successful, and the information indicating the change in the state due to the control of the operation of the gripping mechanism unit 111. ..
  • the reward calculation unit 711 calculates the reward based on the determination result of whether or not the gripping operation for the object to be gripped is successful. Since the method for determining whether or not the gripping operation for the object to be gripped has been successful and the method for calculating the reward have already been described in the first embodiment, the description thereof will be omitted here.
  • the parameter update unit 712 updates each model parameter of the image analysis unit 721, the state and motion input unit 722, and the expected value calculation unit 724 included in the reinforcement learning model 720.
  • the parameter update unit 712 is -Information indicating a change in state acquired by the update unit 710, -Reward calculated by the reward calculation unit 711 (immediate reward), -The predicted value of the expected value (Q value) of the discount cumulative reward calculated by the expected value calculation unit 724, which will be described later. Update the model parameters based on.
  • the model parameters of the reinforcement learning model 720 are updated by the update unit 710. Further, in the reinforcement learning model 720 after the model parameters are updated, the captured image, the target object image, and the information indicating the state (s) before the operation of the gripping mechanism unit 111 are input, and after the operation of the gripping mechanism unit 111. Outputs status information.
  • the reinforcement learning model 720 has an image analysis unit 721, a state and motion input unit 722, an addition unit 723, an expected value calculation unit 724, and an adjustment unit 725.
  • the image analysis unit 721 executes the process by acquiring the captured image transmitted from the drive control device 115 and the target object (g) image input by the user, and outputs the execution result to the addition unit 723.
  • the image analysis unit 721 is configured by using, for example, a neural network. More specifically, the image analysis unit 721 is composed of, for example, a first convolution layer, a first MaxPooling layer, a second convolution layer, a second MaxPooling layer, and the like.
  • the state and operation input unit 722 executes the process by acquiring the information indicating the state (s) before the operation of the gripping mechanism unit 111 and the information indicating the operation (a) of the gripping mechanism unit 111, and the execution result. Is output to the addition unit 723.
  • the state and motion input unit 722 is configured by using, for example, a neural network. More specifically, the state and motion input unit 722 is composed of a first linear layer, a second linear layer, a shape conversion layer, and the like. Further, the state and operation input unit 722 is provided with the operation (a) of the gripping mechanism unit 111 adjusted by the adjustment unit 725 in order to search for the maximum Q value calculated by the expected value calculation unit 724, which will be described later.
  • the information to be shown is input a predetermined number of times (for example, 20 times).
  • the addition unit 723 adds the execution result output from the image analysis unit 721 and the execution result output from the state and operation input unit 722, and inputs the expected value calculation unit 724.
  • the expected value calculation unit 724 executes the process by inputting the execution result of the image analysis unit 721 and the execution result of the state and operation input unit 722 added by the addition unit 723, and executes the process, and the Q value (Q (Q) s, a, g)) is calculated.
  • the expected value calculation unit 724 calculates a number of Q values according to the number of information indicating the operation (a) of the gripping mechanism unit 111 adjusted by the adjustment unit 725.
  • the expected value calculation unit 724 is configured by using, for example, a neural network. More specifically, the expected value calculation unit 724 is composed of a first convolution layer, a first MaxPooling layer, a second convolution layer, a second MaxPooling layer, and the like.
  • the adjustment unit 725 adjusts the information indicating the operation (a) of the gripping mechanism unit 111 every time the Q value is calculated by the expected value calculation unit 724, and inputs the information to the state and operation input unit 722.
  • the adjusting unit 725 adjusts the information indicating the operation (a) of the gripping mechanism unit 111 a predetermined number of times (for example, 20 times), and extracts the maximum Q value from the Q values calculated during that period.
  • the adjusting unit 725 specifies information indicating one of the operations (a) from the set of possible operations of the gripping mechanism unit 111, for example, based on the ⁇ -greedy method.
  • the information indicating the operation (a) corresponding to the maximum Q value may be specified, or the information indicating the randomly selected operation (a) may be specified. ..
  • the adjusting unit 725 is based on the information indicating the operation (a) of the specified gripping mechanism unit 111 and the information indicating the state (s) before the operation of the gripping mechanism unit 111, after the operation of the gripping mechanism unit 111.
  • Information indicating the state of is derived and transmitted to the drive control device 115.
  • the motion of the gripping mechanism unit 111 during the reinforcement learning process includes the optimum motion selected from the set of possible motions (the motion that maximizes the Q value) and the set of possible motions. It will include randomly selected actions.
  • the functional configuration shown in FIG. 7 is merely an example, and the reinforcement learning model 720 may be configured by another functional configuration.
  • the image analysis unit 721, the state and motion input unit 722, and the expected value calculation unit 724 are each configured using a neural network, but the entire reinforcement learning model 720 uses a neural network. It may be configured.
  • the function at the time of the reinforcement learning process is mentioned, but the function after the reinforcement learning process is completed is the same as that of the first embodiment. That is, after the reinforcement learning process is completed, the update unit 710 does not acquire information indicating the change in the state, calculate the reward, update the model parameter, or the like. Further, in the adjusting unit 725, the operation of the gripping mechanism unit 111 derived based on the optimum information (the information indicating the operation (a) in which the Q value is maximized) is used as the information indicating the state after the operation of the gripping mechanism unit 111. Information indicating the later state) is output. As a result, according to the reinforcement-learned model, it is possible to acquire a behavioral rule that maximizes the expected value (Q value) of the discount cumulative reward.
  • FIG. 8 is a second flowchart showing the flow of the reinforcement learning process.
  • the reinforcement learning process shown in FIG. 8 is merely an example, and a model for which reinforcement learning has been completed may be generated by executing the reinforcement learning process by another model generation method.
  • step S801 the reinforcement learning model 720 of the reinforcement learning device 120 acquires a target object image.
  • step S802 the reinforcement learning model 720 of the reinforcement learning device 120 acquires a photographed image.
  • step S803 the reinforcement learning model 720 of the reinforcement learning device 120 acquires information indicating the state (s) before (currently) the operation of the gripping mechanism unit 111.
  • steps S804 to S807 for example, based on the ⁇ -greedy method, information indicating any operation (a) is specified from a set of possible operations, and the state after the operation of the gripping mechanism unit 111 is shown. Comprehensively output information.
  • steps S804 to S806 are executed, and then step S807 is performed. move on. Further, in the case of specifying the information indicating the randomly selected motion (a) from the set of possible motions, the process directly proceeds to step S807.
  • step S804 the reinforcement learning model 720 of the reinforcement learning device 120 calculates the Q value.
  • step S805 the reinforcement learning model 720 of the reinforcement learning device 120 determines whether or not the Q value has been calculated a predetermined number of times. If it is determined in step S805 that the Q value has not been calculated a predetermined number of times (NO in step S805), the process proceeds to step S806.
  • step S806 the reinforcement learning model 720 of the reinforcement learning device 120 adjusts the information indicating the operation (a) of the gripping mechanism unit 111, and returns to step S804.
  • step S805 determines whether the Q value has been calculated a predetermined number of times (YES in step S805). If it is determined in step S805 that the Q value has been calculated a predetermined number of times (YES in step S805), the process proceeds to step S807.
  • step S807 the reinforcement learning model 720 of the reinforcement learning device 120 identifies the information indicating the operation (a) corresponding to the maximum Q value when steps S804 to S807 are executed, and the gripping mechanism unit 111. After deriving information indicating the state after the operation of, it is transmitted to the drive control device 115. Further, the reinforcement learning model 720 of the reinforcement learning device 120 specifies information indicating a randomly selected operation (a) when steps S804 to S807 are not executed, and after the operation of the gripping mechanism unit 111. After deriving the information indicating the state of, it is transmitted to the drive control device 115.
  • step S808 the update unit 710 of the reinforcement learning device 120 acquires information indicating a change in the state due to the control of the operation of the gripping mechanism unit 111.
  • step S809 the update unit 710 of the reinforcement learning device 120 acquires the determination result of whether or not the gripping operation for the object to be gripped is successful, and calculates the immediate reward. Further, the update unit 710 of the reinforcement learning device 120 acquires the predicted value (Q value) of the expected value (Q value) of the discount cumulative reward calculated by the expected value calculation unit 724.
  • step S810 the update unit 710 of the reinforcement learning device 120 uses the information indicating the change in the acquired state, the calculated immediate reward, and the predicted value (Q value) of the acquired discount cumulative reward to be used as the reinforcement learning model. Update the model parameters of 720.
  • step S811 the reinforcement learning device 120 determines whether or not to switch from the current target object image to a different target object image.
  • step S811 If it is determined in step S811 not to switch to a different target object image (NO in step S811), the process returns to step S802.
  • step S811 determines whether it is determined in step S811 to switch to a different target object image (YES in step S811), the process proceeds to step S812.
  • step S812 the update unit 310 of the reinforcement learning device 120 determines whether or not the end condition of the reinforcement learning process is satisfied.
  • the end condition of the reinforcement learning process is, for example, a condition defined by the user of the reinforcement learning system 100, and one example thereof is a target success probability of a gripping operation for a predetermined object.
  • step S812 If it is determined in step S812 that the end condition of the reinforcement learning process is not satisfied (NO in step S812), the process returns to step S801.
  • step S812 if it is determined in step S812 that the end condition of the reinforcement learning process is satisfied (YES in step S812), the reinforcement learning process is terminated.
  • the reinforcement learning model 720 after the reinforcement learning process is completed is applied to the object operation device as the reinforcement learning completed model.
  • the reinforcement-learned model applied to the object manipulation device executes the processes of steps S801 to S807 in FIG. 8 (that is, it does not acquire information indicating a state change, calculate a reward, update model parameters, or the like). .. Further, in step S807, the optimum information is output as the information indicating the state after the operation of the gripping mechanism unit 111. That is, unlike during the reinforcement learning process, the gripping mechanism unit 111 comprehensively performs various operations, but instead performs the optimum operation selected from the set of possible operations (the operation that maximizes the Q value). I do.
  • the predetermined operation performed on the specified type of object is not limited to the gripping operation, and may be any other operation. That is, the end effector attached to the tip end portion of the main body portion 113 of the manipulator 110 is not limited to the gripping mechanism portion 111, and may be any other operation mechanism portion.
  • Arbitrary operations referred to here include, for example, a pressing operation for pushing a specified type of object, a suction operation for sucking a specified type of object, a suction operation for sucking a specified type of object with an electromagnet, etc. Is included.
  • the image pickup device is attached to the tip portion of the manipulator, but the attachment position of the image pickup device is not limited to the tip portion of the manipulator. Any position may be used as long as the position and posture of the image pickup apparatus change according to the change in the position and posture of the gripping mechanism portion.
  • the gripping mechanism unit and the image pickup device may be attached to different manipulators, for example, and the above-mentioned reinforcement learning model can be applied even in that case.
  • the reinforcement learning model in this case may be configured to output information for controlling at least one of the position and the posture of the image pickup device in addition to the information for controlling the operation of the gripping mechanism unit.
  • information indicating the state before the operation of the gripping mechanism unit which is input to the reinforcement learning model
  • information indicating the position and posture of the gripping mechanism unit and opening / closing of the gripping mechanism unit are performed. It has been described as including the information shown. However, the information indicating the state before the operation of the gripping mechanism portion is not limited to these, and other information may be input.
  • the manipulator 110 and the reinforcement learning device 120 are configured as separate bodies, but the manipulator 110 and the reinforcement learning device 120 (or the object manipulation device) are different from each other. It may be configured as one. Alternatively, the drive control device 115 and the reinforcement learning device 120 (or the object manipulation device) may be integrally configured.
  • the operation of the gripping mechanism unit 111 is actually controlled based on the information output from the reinforcement learning device 120 indicating the state after the operation of the gripping mechanism unit 111. It was explained as performing reinforcement learning processing in. However, it is not necessary to actually control the operation of the gripping mechanism unit 111, and the reinforcement learning process may be performed by using a simulator simulating the actual environment. In this case, the image pickup device may also be configured to change the position and posture or perform shooting on a simulator simulating the actual environment. Further, a predetermined operation on the object to be operated and the generation of the operation result may be performed on a simulator simulating the actual environment.
  • the reinforcement learning device 120 may perform reinforcement learning processing in the case where the manipulator 110 to which the end effector is not attached to the tip portion operates the object to be operated by the main body 113. In this case, the reinforcement learning device 120 may output information for controlling the operation of the tip portion of the main body 113 of the manipulator 110.
  • the position and posture of the tip portion of the main body 113 of the manipulator 110 are changed, but at least one of the position and the posture is described. It may be configured to change. That is, the gripping mechanism portion 111 may be configured to change at least one of the position and the posture. Further, the image pickup device 112 may be configured so that at least one of the position and the posture changes with the change of at least one of the position and the posture of the gripping mechanism unit 111. In this case, in the reinforcement learning device 120, as information for controlling the operation of the gripping mechanism unit 111, information for controlling at least one of the position and the posture of the gripping mechanism unit 111, and the gripping mechanism unit 111. Information for controlling the opening and closing of the may be output.
  • the expression (including similar expressions) of "at least one (one) of a, b and c" or "at least one (one) of a, b or c" is used. When used, it comprises any of a, b, c, ab, ac, bc, or abc. Further, a plurality of instances may be included for any of the elements, such as aa, abb, aabbbcc, and the like. It also includes the addition of other elements than the listed elements (a, b and c), such as having d, such as abcd.
  • connection and “coupled” are used in the present specification (including claims), direct connection / connection and indirect connection are used. Unlimited including / coupling, electrically connected / coupled, communicatively connected / coupled, operatively connected / coupled, physically connected / coupled, etc. Intended as a term. The term should be interpreted as appropriate according to the context in which the term is used, but any connection / coupling form that is not intentionally or naturally excluded is not included in the term. It should be interpreted in a limited way.
  • the physical structure of the element A executes the operation B. It has a possible configuration, and the permanent or temporary setting (setting / configuration) of the element A is set (configured / set) to actually execute the operation B. May include.
  • the element A is a general-purpose processor
  • the processor has a hardware configuration capable of executing the operation B, and the operation B is set by setting a permanent or temporary program (instruction). It suffices if it is configured to actually execute.
  • the element A is a dedicated processor, a dedicated arithmetic circuit, or the like, the circuit structure of the processor actually executes the operation B regardless of whether or not the control instruction and data are actually attached. It only needs to be implemented.
  • each hardware when a plurality of hardware perform predetermined processing, each hardware may cooperate to perform predetermined processing, and some hardware may perform predetermined processing. You may perform all of the processing of. Further, some hardware may perform a part of a predetermined process, and another hardware may perform the rest of the predetermined process.
  • expressions such as "one or more hardware performs the first process and the one or more hardware performs the second process" are used.
  • the hardware that performs the first process and the hardware that performs the second process may be the same or different. That is, the hardware that performs the first process and the hardware that performs the second process may be included in the one or a plurality of hardware.
  • the hardware may include an electronic circuit, a device including the electronic circuit, or the like.
  • each storage device (memory) among the plurality of storage devices (memory) is a part of the data. Only may be stored, or the entire data may be stored.
  • Reinforcement learning system 110 Manipulator 111: Grip mechanism unit 112: Imaging device 113: Main body unit 115: Drive control device 120: Reinforcement learning device 310: Update unit 311: Reward calculation unit 320: State input unit 330: Reinforcement learning model 510: Target object image 511: Object 521: 522: Photographed image 611, 612: Photographed image 710: Update unit 711: Reward calculation unit 712: Reinforcement learning model 721: Reinforcement learning model 721: State and motion input Part 723: Addition part 724: Expected value calculation part 725: Adjustment part

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Robotics (AREA)
  • Mechanical Engineering (AREA)
  • Manipulator (AREA)
  • Image Analysis (AREA)

Abstract

L'invention concerne un dispositif d'apprentissage par renforcement, un système d'apprentissage par renforcement, un dispositif de manipulation d'objet, un procédé de génération de modèle et un programme d'apprentissage par renforcement, la probabilité de réussite d'une manipulation prescrite sur un objet pouvant être augmentée. Ce dispositif d'apprentissage par renforcement présente au moins une mémoire et au moins un processeur, l'au moins un processeur étant configuré de manière à pouvoir : entrer des informations relatives à une image capturée capturée par un dispositif d'imagerie qui change dans au moins une position ou une orientation de celui-ci, et des informations concernant une image d'objet cible indiquant un objet devant être manipulé par un effecteur terminal, dans un modèle d'apprentissage qui délivre des informations destinées à commander le fonctionnement de l'effecteur terminal; et la mise à jour d'un paramètre du modèle d'apprentissage sur la base du résultat de manipulation de l'objet pour un cas où le fonctionnement de l'effecteur d'extrémité est commandé sur la base des informations délivrées par le modèle d'apprentissage.
PCT/JP2021/025392 2020-07-10 2021-07-06 Dispositif d'apprentissage par renforcement, système d'apprentissage par renforcement, dispositif de manipulation d'objet, procédé de génération de modèle et programme d'apprentissage par renforcement WO2022009859A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2020119349A JP2023145809A (ja) 2020-07-10 2020-07-10 強化学習装置、強化学習システム、物体操作装置、モデル生成方法及び強化学習プログラム
JP2020-119349 2020-07-10

Publications (1)

Publication Number Publication Date
WO2022009859A1 true WO2022009859A1 (fr) 2022-01-13

Family

ID=79553121

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/025392 WO2022009859A1 (fr) 2020-07-10 2021-07-06 Dispositif d'apprentissage par renforcement, système d'apprentissage par renforcement, dispositif de manipulation d'objet, procédé de génération de modèle et programme d'apprentissage par renforcement

Country Status (2)

Country Link
JP (1) JP2023145809A (fr)
WO (1) WO2022009859A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017030135A (ja) * 2015-07-31 2017-02-09 ファナック株式会社 ワークの取り出し動作を学習する機械学習装置、ロボットシステムおよび機械学習方法
JP2019171540A (ja) * 2018-03-29 2019-10-10 ファナック株式会社 機械学習装置、機械学習装置を用いたロボット制御装置及びロボットビジョンシステム、並びに機械学習方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2017030135A (ja) * 2015-07-31 2017-02-09 ファナック株式会社 ワークの取り出し動作を学習する機械学習装置、ロボットシステムおよび機械学習方法
JP2019171540A (ja) * 2018-03-29 2019-10-10 ファナック株式会社 機械学習装置、機械学習装置を用いたロボット制御装置及びロボットビジョンシステム、並びに機械学習方法

Also Published As

Publication number Publication date
JP2023145809A (ja) 2023-10-12

Similar Documents

Publication Publication Date Title
JP4032793B2 (ja) 充電システム及び充電制御方法、ロボット装置、及び充電制御プログラム及び記録媒体
US9387589B2 (en) Visual debugging of robotic tasks
JP6963041B2 (ja) ロボットアクションへの修正に基づく局所的特徴モデルの更新
Ott et al. A humanoid two-arm system for dexterous manipulation
JP3855812B2 (ja) 距離計測方法、その装置、そのプログラム、その記録媒体及び距離計測装置搭載型ロボット装置
JP6931457B2 (ja) モーション生成方法、モーション生成装置、システム及びコンピュータプログラム
CN110815258B (zh) 基于电磁力反馈和增强现实的机器人遥操作系统和方法
JP2009157948A (ja) ロボット装置、顔認識方法及び顔認識装置
JP2021000678A (ja) 制御システムおよび制御方法
CN114080583A (zh) 视觉教导和重复移动操纵系统
US20220331962A1 (en) Determining environment-conditioned action sequences for robotic tasks
JP6811465B2 (ja) 学習装置、学習方法、学習プログラム、自動制御装置、自動制御方法および自動制御プログラム
JP7458741B2 (ja) ロボット制御装置及びその制御方法及びプログラム
CN114641375A (zh) 动态规划控制器
WO2019230399A1 (fr) Dispositif de commande de robot, système, procédé de traitement d'informations, et programme
JP2022061022A (ja) 力及びトルク誘導ロボット組立のための技術
WO2022134702A1 (fr) Procédé et appareil d'apprentissage d'action, support de stockage et dispositif électronique
JP2003266349A (ja) 位置認識方法、その装置、そのプログラム、その記録媒体及び位置認識装置搭載型ロボット装置
WO2022009859A1 (fr) Dispositif d'apprentissage par renforcement, système d'apprentissage par renforcement, dispositif de manipulation d'objet, procédé de génération de modèle et programme d'apprentissage par renforcement
JP2004298975A (ja) ロボット装置、障害物探索方法
JP5659787B2 (ja) 操作環境モデル構築システム、および操作環境モデル構築方法
JP2021091022A (ja) ロボット制御装置、学習済みモデル、ロボット制御方法およびプログラム
JP2003271958A (ja) 画像処理方法、その装置、そのプログラム、その記録媒体及び画像処理装置搭載型ロボット装置
JP2020017206A (ja) 情報処理装置、行動決定方法及びプログラム
JP4193098B2 (ja) トラッキング装置、トラッキング装置のトラッキング方法及びロボット装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21838554

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21838554

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP