KR102590411B1

KR102590411B1 - Control policy for robot agents

Info

Publication number: KR102590411B1
Application number: KR1020197010314A
Authority: KR
Inventors: 첼시 브리나 핀; 세르게이 블라디미르 레빈
Original assignee: 구글 엘엘씨
Priority date: 2016-09-15
Filing date: 2017-09-15
Publication date: 2023-10-17
Also published as: CN109844771A; WO2018053246A1; EP3497628A1; US11853876B2; JP6952767B2; US20240078429A1; CN109844771B; US20190251437A1; KR20190044683A; JP2019530925A

Abstract

방법은 하나 이상의 객체들 각각에 대해, 실제 환경과 상호 작용하는 로봇 에이전트가 객체를 이동시켜야 하는 각각의 목표 위치를 식별하는 데이터를 수신하는 단계; 상기 로봇 에이전트로 하여금 동작들을 반복적으로 수행함으로써 하나 이상의 목표 위치들로 상기 하나 이상의 객체들을 이동하게 하는 단계를 포함하며, 상기 동작들은, 실제 환경의 현재 상태의 현재 이미지를 수신하는 동작; 상기 로봇 에이전트에 의해 수행될 액션 및 현재 액션에 기초하여 미래의 이미지들을 예측하는 다음 이미지 예측 신경망을 사용하여 상기 로봇 에이전트에 의해 수행될 액션들의 다음 시퀀스를 상기 현재 이미지로부터 결정하는 동작; 상기 액션들의 다음 시퀀스를 수행하도록 상기 로봇 에이전트에 지시하는 동작을 포함한다.The method includes receiving, for each of one or more objects, data identifying the respective target location to which a robotic agent interacting with the real environment should move the object; causing the robotic agent to move the one or more objects to one or more target locations by repeatedly performing operations, the operations comprising: receiving a current image of a current state of a real environment; determining from the current image a next sequence of actions to be performed by the robotic agent using a next image prediction neural network that predicts future images based on the current action and the action to be performed by the robotic agent; and instructing the robot agent to perform the next sequence of actions.

Description

Control policy for robot agents

본 명세는 로봇 에이전트에 의해 수행될 액션들을 선택하는 것에 관한 것이다.This specification concerns selecting actions to be performed by a robotic agent.

로봇 에이전트들은 환경 상태를 특징짓는 데이터를 수신하고 로봇 태스크를 수행하기 위해 액션을 수행함으로써 환경과 상호 작용한다. 일부 로봇 에이전트들은 신경망들을 사용하여 소정의 관측(observation)을 수신하는 것에 응답하여 수행할 액션을 선택한다.Robotic agents interact with the environment by receiving data characterizing the state of the environment and performing actions to perform robotic tasks. Some robotic agents use neural networks to select an action to perform in response to receiving certain observations.

신경망들은 수신된 입력에 대한 출력을 예측하기 위해 하나 이상의 비선형 유닛들의 계층들을 사용하는 기계 학습 모델들이다. 일부 신경망들은 출력층 외에도 하나 이상의 은닉층을 포함하는 심층 신경망들이다. 각각의 은닉층의 출력은 네트워크의 다음 계층, 즉 다음 은닉층 또는 출력층에 대한 입력으로 사용된다. 네트워크의 각 계층은 각각의 파라미터 세트의 현재 값들에 따라 수신된 입력으로부터 출력을 생성한다.Neural networks are machine learning models that use one or more layers of nonlinear units to predict an output for received input. Some neural networks are deep neural networks that include one or more hidden layers in addition to the output layer. The output of each hidden layer is used as input to the next layer of the network, that is, the next hidden layer or output layer. Each layer of the network generates an output from the received input according to the current values of each parameter set.

본 명세서는 하나 이상의 위치에 있는 하나 이상의 컴퓨터에서 컴퓨터 프로그램으로 구현된 시스템이 로봇 에이전트에 의해 수행될 액션을 선택하는 방법을 기술한다.This specification describes a method for a system implemented as a computer program on one or more computers at one or more locations to select an action to be performed by a robotic agent.

일반적으로 다음과 같은 방법으로 혁신적인 양태가 구현될 수 있으며, 상기 방법은 하나 이상의 객체 각각에 대해, 실제 환경과 상호 작용하는 로봇 에이전트가 상기 객체를 이동시켜야 하는 각각의 목표 위치를 식별하는 데이터를 수신하는 단계를 포함한다. 상기 방법은, 상기 로봇 에이전트로 하여금 다음 동작들을 반복적으로 수행함으로써 하나 이상의 목표 위치로 상기 하나 이상의 객체를 이동하게 하는 단계를 포함하며, 상기 동작들은, 실제 환경의 현재 상태의 현재 이미지를 수신하는 동작; 상기 로봇 에이전트에 의해 수행될 액션 및 현재 액션에 기초하여 미래의 이미지들을 예측하는 다음 이미지 예측 신경망을 사용하여 상기 로봇 에이전트에 의해 수행될 액션들의 다음 시퀀스를 상기 현재 이미지로부터 결정하는 동작 -상기 다음 시퀀스는, 환경이 현재 상태에 있을 때 시작하는 로봇 에이전트에 의해 수행되는 경우, 하나 이상의 객체가 각각의 목표 위치로 이동되는 결과를 초래할 가능성이 가장 높은 복수의 후보 시퀀스들 중 상기 시퀀스이며-; 로봇 에이전트에게 액션들의 상기 다음 시퀀스를 수행하도록 지시하는 동작을 포함한다. 현재 이미지는 로봇 에이전트의 카메라에 의해 캡처된 이미지일 수 있다.In general, the innovative aspect may be implemented in the following way, wherein, for each of one or more objects, a robotic agent interacting with the real environment receives data identifying the respective target location to which the object should be moved. It includes steps to: The method includes causing the robot agent to move the one or more objects to one or more target locations by repeatedly performing the following operations, the operations comprising: receiving a current image of the current state of the real environment; ; Determining from the current image a next sequence of actions to be performed by the robotic agent using a next image prediction neural network that predicts future images based on the current action and the action to be performed by the robotic agent - the next sequence is a sequence among a plurality of candidate sequences that, if performed by a robotic agent starting when the environment is in its current state, would most likely result in one or more objects being moved to the respective target location; and instructing a robotic agent to perform the next sequence of actions. The current image may be an image captured by the robot agent's camera.

로봇 에이전트에게 액션들의 상기 다음 시퀀스를 수행하도록 지시하는 동작은 로봇 에이전트에 의해 수행되는 액션들의 현재 시퀀스를 인터럽트하고, 액션들의 다음 시퀀스를 수행하도록 로봇 에이전트에게 지시(명령)하는 단계를 포함한다.The act of instructing the robot agent to perform the next sequence of actions includes interrupting the current sequence of actions being performed by the robot agent and instructing the robot agent to perform the next sequence of actions.

상기 방법은 사용자에게 프리젠테이션하기 위해, 사용자가 이동시킬 객체 및 목표 위치를 지정할 수 있도록 하는 사용자 인터페이스를 제공하는 단계를 더 포함할 수 있다.The method may further include providing a user interface that allows the user to specify an object to be moved and a target location for presentation to the user.

다음 이미지 예측 신경망은, 적어도 현재 이미지 및 입력 액션을 입력으로서 수신하고, 환경이 현재 상태에 있을 때 로봇 에이전트가 상기 입력 액션을 수행하는 경우, 환경의 예측된 다음 상태의 이미지인 다음 이미지를 생성하도록 상기 입력을 처리하도록 훈련된 순환 신경망일 수 있다. 다음 이미지를 생성하는 부분으로서, 순환 신경망은, 다음 이미지의 복수의 픽셀 각각에 대해, 현재 이미지의 복수의 픽셀 각각으로부터 이동된 픽셀의 각각의 예측 우도를 식별하는 플로우 맵(flow map)을 생성한다.A next image prediction neural network receives as input at least a current image and an input action and, if the robot agent performs said input action when the environment is in its current state, generates a next image that is an image of the predicted next state of the environment. It may be a recurrent neural network trained to process the input. As part of generating the next image, the recurrent neural network generates a flow map that identifies, for each of the plurality of pixels in the next image, the respective predicted likelihood of the pixel moved from each of the plurality of pixels in the current image. .

액션들의 다음 시퀀스를 결정하는 것은, 다음 이미지 예측 신경망에 의해 생성된 플로우 맵을 사용하여, 로봇 에이전트에 의한 후보 시퀀스 내의 액션들의 수행으로 인해 객체들이 목표 위치들로 이동하게 되는 후보 시퀀스들 각각에 대한 각 우도를 결정하는 것을 포함할 수 있다.Determining the next sequence of actions uses the flow map generated by the next image prediction neural network to determine for each of the candidate sequences that the performance of the actions within the candidate sequence by the robotic agent will cause the objects to move to the target locations. It may include determining each likelihood.

액션들의 다음 시퀀스를 결정하는 것은 현재 환경에 위치하는 하나 이상의 객체를 묘사하는 현재 이미지의 하나 이상의 픽셀을 결정하는 것을 포함할 수 있다.Determining the next sequence of actions may include determining one or more pixels of the current image that depict one or more objects located in the current environment.

소정의 후보 시퀀스에 대한 각각의 우도를 결정하는 것은 상기 신경망에 입력으로서 재귀적으로 상기 시퀀스 내의 액션들 및 상기 액션들에 대한 상기 신경망에 의해 생성된 다음 이미지들을 공급(feeding)하는 것을 포함할 수 있다.Determining the respective likelihoods for a given candidate sequence may include recursively feeding the neural network as input the actions in the sequence and the next images generated by the neural network for the actions. there is.

상기 방법은 가능한 액션 시퀀스들에 대한 분포로부터 후보 시퀀스들을 샘플링하는 것을 더 포함할 수 있다. 후보 시퀀스들을 샘플링하는 것은 크로스 엔트로피(cross-entropy) 기술을 사용하여 샘플링의 다중 반복을 수행하는 것을 포함할 수 있다.The method may further include sampling candidate sequences from a distribution over possible action sequences. Sampling candidate sequences may include performing multiple iterations of sampling using a cross-entropy technique.

또 다른 혁신적인 양태는 하나 이상의 컴퓨터 및 상기 하나 이상의 컴퓨터에 의해 실행될 때 하나 이상의 컴퓨터로 하여금 상기 기술된 방법의 동작들을 수행하게 하는 명령어들을 저장하는 하나 이상의 저장 장치로 구현될 수 있다.Another innovative aspect may be implemented with one or more computers and one or more storage devices storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the operations of the method described above.

또 다른 혁신적인 양태는, 하나 이상의 컴퓨터에 의해 실행될 때, 하나 이상의 컴퓨터로 하여금 전술한 방법의 동작들을 수행하게 하는 명령어들로 인코딩된 하나 이상의 비-일시적 저장 매체에서 구현될 수 있다.Another innovative aspect may be implemented in one or more non-transitory storage media encoded with instructions that, when executed by one or more computers, cause one or more computers to perform the operations of the method described above.

본 명세서에서 설명된 요지의 특정 실시 예들은 다음의 장점 중 하나 이상을 실현하도록 구현될 수 있다. 학습된 예측 모델과 모델 예측 제어(MPC) 기반 제어기를 결합하여 로봇 에이전트가 수행할 액션들을 선택함으로써, 본 명세서에서 설명된 시스템은 로봇 에이전트가 완전히 레이블링되지 않은 훈련 데이터만을 사용하여, 즉 훈련 이미지들을 라벨링(label)하거나 훈련 이미지들이 사용자에 의해 라벨링되도록 요구하기 위한 추가적인 계산 자원의 사용을 요구하지 않고 객체들을 효과적으로 목표 위치로 이동하게 한다. 더욱이, 여기에 기술된 기술들은 보정된(calibrated) 카메라, 계측된 훈련(instrumented training) 셋업, 또는 로봇 에이전트에 의한 정확한 감지 및 작동을 필요로 하지 않는다. 또한 기술된 기술들은 로봇이 다음 이미지 예측 신경망의 훈련 동안 보이지 않는 새로운 객체를 처리할 수 있게 하여, 본 명세서에 기술된 기술들은 상이한 태스크들 및 객체들에 대해 더 잘 일반화하도록 한다.Certain embodiments of the subject matter described herein may be implemented to realize one or more of the following advantages. By combining a learned predictive model with a model predictive control (MPC)-based controller to select the actions to be performed by the robotic agent, the system described herein allows the robotic agent to use only fully unlabeled training data, i.e., training images. It effectively moves objects to target locations without requiring the use of additional computational resources to label or require training images to be labeled by the user. Moreover, the techniques described here do not require calibrated cameras, instrumented training setups, or accurate sensing and actuation by robotic agents. The techniques described also enable the robot to process new objects that were not seen during training of the next image prediction neural network, allowing the techniques described herein to better generalize to different tasks and objects.

본 명세서의 요지의 하나 이상의 실시 예의 세부 사항은 첨부된 도면 및 이하의 설명에서 설명된다. 요지의 다른 특징, 양태 및 장점은 상세한 설명, 도면 및 청구 범위로부터 명백해질 것이다.The details of one or more embodiments of the subject matter of the present disclosure are set forth in the accompanying drawings and the description below. Other features, aspects and advantages of the subject matter will become apparent from the detailed description, drawings and claims.

도 1은 로봇 조작 시스템의 일례의 블록도이다.
도 2는 로봇 에이전트가 객체를 목표 위치로 이동시키는 예시적인 프로세스의 흐름도이다.
도 3은 객체를 목표 위치로 이동시키기 위해 로봇 에이전트에 의해 수행될 액션들의 다음 시퀀스를 결정하기 위한 예시적인 프로세스의 흐름도이다.
다양한 도면들에서 동일한 참조 번호 및 명칭은 동일한 구성요소를 나타낸다.1 is a block diagram of an example of a robot manipulation system.
2 is a flow diagram of an example process in which a robotic agent moves an object to a target location.
3 is a flow diagram of an example process for determining the next sequence of actions to be performed by a robotic agent to move an object to a target location.
Like reference numerals and names in the various drawings indicate like elements.

도 1은 로봇 조작 시스템(100)의 일 예의 블록도이다. 시스템(100)은 하나 이상의 위치에서 하나 이상의 컴퓨터상의 컴퓨터 프로그램으로서 구현되는 시스템의 예로서, 이하에서 설명되는 시스템, 컴포넌트 및 기술이 구현될 수 있다.1 is a block diagram of an example of a robot manipulation system 100. System 100 is an example of a system implemented as a computer program on one or more computers at one or more locations, in which the systems, components and techniques described below may be implemented.

일반적으로, 시스템(100)은, 에이전트(102)는 하나 이상의 객체를 환경(104) 내의 각각의 초기 위치로부터 환경(104) 내의 각각의 최종 위치로 이동시키도록 실제 환경(104)과 상호 작용하는 로봇 에이전트(102)에 의해 수행될 액션들을 선택하도록 구성된다.Generally, system 100 includes agents 102 interacting with a real-world environment 104 to move one or more objects from each initial location within environment 104 to each final location within environment 104. It is configured to select actions to be performed by the robot agent 102.

특히, 시스템(100)은 실제 환경(104) 내의 하나 이상의 객체들 각각에 대해, 로봇 에이전트(102)가 객체를 이동시켜야 하는 각각의 목표 위치, 예컨대 목표 위치(114)를 식별하는 데이터(112)를 수신하도록 구성된 제어기(106)를 포함한다.In particular, system 100 provides data 112 that identifies, for each of one or more objects within real environment 104, a respective target location to which the robotic agent 102 should move the object, e.g., target location 114. It includes a controller 106 configured to receive.

일부 구현 예에서, 시스템(100)은 사용자에게 프리젠테이션하기 위해 사용자가 이동될 하나 이상의 객체들 및 목표 위치들을 지정할 수 있게 하는 사용자 인터페이스를 제공할 수 있다. 예를 들어, 사용자는 현재 환경(104)에 위치하는 것과 같은 객체를 묘사하는 초기 이미지에서 소스 픽셀을 특정할 수 있고, 소스 픽셀이 어디로 이동되어야 하는지, 즉 초기 이미지의 다른 픽셀로 이동해야 하는 위치를 특정할 수 있다. 초기 이미지는 로봇 에이전트(102)의 카메라에 의해 캡처될 수 있고, 환경(104)의 초기 상태를 특정지을 수 있다.In some implementations, system 100 may provide a user interface that allows a user to specify target locations and one or more objects to be moved for presentation to the user. For example, a user may specify a source pixel in an initial image depicting an object as it is currently located in the environment 104 and where the source pixel should be moved, i.e., where another pixel in the initial image should be moved. can be specified. The initial image may be captured by the camera of the robotic agent 102 and may characterize the initial state of the environment 104.

예를 들어, 초기 이미지는 직사각형 트레이에 놓인 컵을 보여준다. 사용자는 이미지 내의 컵에 속하는 소스 픽셀을 특정할 수 있고, 소스 픽셀이 이동되어야 하는, 예를 들어 직사각형 트레이의 코너들 중 하나 근처에 있는 위치로 이동해야 하는 이미지의 목표 위치를 특정할 수 있다. 이러한 목표 사양에 따라, 시스템(100)은 컵을 직사각형 트레이의 특정 코너로 이동시키도록 로봇 에이전트를 제어할 수 있다. 일부 다른 구현 예에서, 이동될 객체를 식별하는 데이터 및 사용자로부터의 목표 위치를 수신하는 대신에, 시스템은 하나 이상의 다른 시스템, 예를 들어, 하나 이상의 다른 로봇 시스템으로부터 데이터를 수신할 수 있다.For example, the initial image shows a cup placed on a rectangular tray. The user can specify a source pixel within the image that belongs to a cup and a target location in the image to which the source pixel should be moved, for example to a position near one of the corners of a rectangular tray. Depending on these target specifications, system 100 can control the robotic agent to move the cup to a specific corner of the rectangular tray. In some other implementations, instead of receiving data identifying the object to be moved and a target location from a user, the system may receive data from one or more other systems, such as one or more other robotic systems.

그 다음 제어기(106)는 학습된 예측 모델(108)을 사용하여 로봇 에이전트(102)를 제어하여 객체를 목표 위치(114) 쪽으로 이동시킨다. The controller 106 then uses the learned prediction model 108 to control the robotic agent 102 to move the object toward the target location 114.

학습된 예측 모델(108)은 다음 이미지 예측 신경망(110)을 포함한다. 다음 이미지 예측 신경망(110)은 환경(104)의 현재 상태의 현재 이미지 및 입력 액션을 입력으로서 수신하도록 훈련된다. 입력 액션은 예를 들어 객체를 목표 위치로 푸쉬(push)하거나 객체를 회전시키는 것일 수 있다. 그 다음, 신경망(110)은 로봇 에이전트가 환경이 현재 상태에 있을 때 입력 액션을 수행하는 경우, 환경의 예측된 다음 상태의 이미지인 다음 이미지를 생성하도록 상기 수신된 입력을 처리(프로세싱)한다. 다음 이미지를 생성하는 과정에서, 신경망(110)은 다음 이미지의 복수의 픽셀 각각에 대해, 현재 이미지의 복수의 픽셀 각각으로부터 이동된 픽셀의 각각의 예측 우도를 식별하는 플로우 맵을 생성한다.The learned prediction model 108 includes the following image prediction neural network 110. The next image prediction neural network 110 is trained to receive as input a current image of the current state of the environment 104 and an input action. The input action may be, for example, pushing an object to a target position or rotating the object. The neural network 110 then processes the received input to generate a next image, which is an image of the predicted next state of the environment if the robotic agent performs the input action when the environment is in its current state. In the process of generating the next image, the neural network 110 generates a flow map that identifies each prediction likelihood of a pixel moved from each of the plurality of pixels in the current image for each of the plurality of pixels in the next image.

다음 이미지 예측 신경망(110)의 예는 정규화된 콘볼루션 커널들의 세트, 하나 이상의 콘볼루션 신경망 계층들 및 콘볼루션 LSTM(long short-term memory) 신경망 계층들의 스택을 포함하는 순환 신경망이다. 콘볼루션 LSTM 신경망 계층들은 일반 LSTM 신경망 계층들과 유사하지만 그들의 게이트들은 완전 연결된 신경망 계층들 대신 콘볼루션들에 의해 구현된다. 콘볼루션 LSTM 신경망 계층들은 2015년 NIPS(Neural Information Processing Systems)의 문서(X. Shi et al. "Convolutional lstm network: A machine learning approach for precipitation nowcasting.")에서 자세히 설명된다. 신경망(110)의 위의 예는 2016년 NIPS(Neural Information Processing Systems)의 문서(C. Finn, I. Goodfellow, and S. Levine, "Unsupervised learning for physical interaction through video prediction," 에서 자세히 설명된다. Next, an example of an image prediction neural network 110 is a recurrent neural network that includes a set of normalized convolutional kernels, one or more convolutional neural network layers, and a stack of convolutional long short-term memory (LSTM) neural network layers. Convolutional LSTM neural network layers are similar to regular LSTM neural network layers, but their gates are implemented by convolutions instead of fully connected neural network layers. Convolutional LSTM neural network layers are described in detail in a 2015 NIPS (Neural Information Processing Systems) paper (X. Shi et al. "Convolutional lstm network: A machine learning approach for precipitation nowcasting."). The above example of neural network 110 is described in detail in the 2016 Neural Information Processing Systems (NIPS) article (C. Finn, I. Goodfellow, and S. Levine, "Unsupervised learning for physical interaction through video prediction,"

소정의 시간 단계에서, 목표 위치(114)가 주어지면, 제어기(106)는 로봇 에이전트(102)로 하여금 객체를 목표 위치로 이동하게 한다. 로봇 에이전트(102)로 하여금 목표 위치로 객체를 이동하게 하는 것은 도 2를 참조하여 후술하는 바와 같이 객체가 목표 위치에 도달할 때까지 프로세스(200)를 반복적으로 수행하는 것을 포함한다. At a given time step, given the target location 114, the controller 106 causes the robotic agent 102 to move the object to the target location. Having the robotic agent 102 move an object to a target location includes repeatedly performing the process 200 until the object reaches the target location, as described below with reference to FIG. 2 .

예를 들어, 제어기(106)는 환경(104)의 현재 상태를 특징짓는 현재 이미지(116)를 수신한 다음, 학습된 예측 모델(108)을 사용하여 객체를 목표 위치(114)로 이동시키기 위해 로봇 에이전트(102)에 의해 수행될 액션들의 다음 시퀀스(118)를 결정한다. 학습된 예측 모델(108)을 사용하여 액션들의 다음 시퀀스를 결정하는 것은 도 3을 참조하여보다 상세히 설명된다. 그 다음 제어기(106)는 로봇 에이전트(102)에게 액션들의 동작 시퀀스(118)를 수행하도록 지시한다. 로봇 에이전트가 액션들의 현재 시퀀스를 수행하는 경우, 제어기(106)는 로봇 에이전트(102)로 하여금 수행되고 있는 액션들의 현재 시퀀스를 인터럽트하게 하고, 로봇 에이전트(102)로 하여금 액션들의 다음 시퀀스(118)를 수행하기 시작하게 한다. 에이전트(102)가 액션들의 다음 시퀀스(118)에서 하나 이상의 액션들을 수행한 후에, 제어기(106)는 환경의 새로운 상태를 특징짓는 새로운 이미지를 수신할 수 있다. 제어기(106)는 현재 이미지(116)를 새로운 이미지로 업데이트한 액션들의 다른 다음 시퀀스를 계속 결정한다. 제어기(106)는 미리 결정된 수의 다음 시퀀스가 결정될 때까지 또는 로봇 에이전트(102)가 목표 위치(114)로 객체를 성공적으로 이동시킬 때까지 액션들의 다음 시퀀스들을 반복적으로 결정할 수 있다.For example, the controller 106 may receive a current image 116 characterizing the current state of the environment 104 and then use the learned prediction model 108 to move the object to the target location 114. Determine the next sequence of actions (118) to be performed by the robotic agent (102). Determining the next sequence of actions using the learned prediction model 108 is described in more detail with reference to FIG. 3 . Controller 106 then instructs robotic agent 102 to perform an operational sequence of actions 118. When the robotic agent is performing a current sequence of actions, the controller 106 causes the robotic agent 102 to interrupt the current sequence of actions being performed and causes the robotic agent 102 to perform the next sequence of actions 118. Start performing. After agent 102 performs one or more actions in the next sequence of actions 118, controller 106 may receive a new image characterizing the new state of the environment. Controller 106 continues to determine another next sequence of actions that will update the current image 116 with a new image. Controller 106 may repeatedly determine next sequences of actions until a predetermined number of next sequences are determined or until robotic agent 102 successfully moves the object to the target location 114.

도 2는 로봇 에이전트로 하여금 객체를 실제 환경의 목표 위치로 이동하게 하는 예시적인 프로세스의 흐름도이다. 편의상, 프로세스(200)는 하나 이상의 위치에 위치한 하나 이상의 컴퓨터의 시스템에 의해 수행되는 것으로 설명될 것이다. 예를 들어, 로봇 조작 시스템, 예를 들어, 도 1의 로봇 조작 시스템(100) 또는 로봇 조작 시스템의 컴포넌트, 예를 들어 본 명세서에 따라 적절히 프로그램된 도 1의 제어기(106)는 프로세스(200)를 수행할 수 있다.2 is a flow diagram of an example process that causes a robotic agent to move an object to a target location in the real environment. For convenience, process 200 will be described as being performed by a system of one or more computers located at one or more locations. For example, a robot manipulation system, e.g., robot manipulation system 100 of FIG. 1, or a component of a robot manipulation system, e.g., controller 106 of FIG. 1 appropriately programmed in accordance with the present disclosure, may perform process 200 can be performed.

상기 시스템은 예를 들어 로봇 에이전트의 카메라에 의해 캡쳐된 바와 같은, 실제 환경의 현재 상태의 현재 이미지를 수신한다(단계 202). 상기 현재 이미지가 실제 환경에 초기 위치한 객체를 묘사하는 초기 이미지(I_t)인 경우(즉, 실제 환경의 초기 상태를 묘사하는 경우), 목표(goal)는 초기 이미지(I_t)의 단일 지정된 픽셀(d_t = (xd, yd))을 초기 이미지 (I_t)의 목표 위치(g = (xg, yg))로 이동하는 것이다. 현재 이미지가 초기 이미지가 아닌 경우, 예를 들어, 현재 이미지가 실제 환경의 다음 상태를 나타내는 다음 이미지(I_t + 1)인 경우, 상기 시스템은 예를 들어, 이미지 관측(I_t:t+1 및 d_t)에서 계산된 옵티컬 플로우(optical flow)를 사용함으로써 지정된 픽셀을 d_t에서 d_t+1로 업데이트할 수 있다. 목표가 변하지 않는다면, 목표 위치(g = (xg, yg))는 시스템이 예측한 다음 이미지에서 동일하게 유지된다.The system receives a current image of the current state of the real environment, for example as captured by the robotic agent's camera (step 202). If the current image is an initial image (I _t ) depicting an object initially located in the real environment (i.e., depicting the initial state of the real environment), then the goal is a single specified pixel of the initial image (I _t ) (d _t = (xd, yd)) is moved to the target location (g = (xg, yg)) of the initial image (I _t ). If the current image is not the initial image, for example, if the current image is the next image (I _t + 1) representing the next state of the real environment, the system may perform an image observation (I _{t: t} + 1), for example and d _t ), the specified pixel can be updated from d _t to d _t +1 by using the optical flow calculated in d t ). If the target does not change, the target location (g = (xg, yg)) remains the same in the next image predicted by the system.

시스템은 학습된 예측 모델을 사용하여 객체를 목표 위치로 이동시키기 위해 로봇 에이전트에 의해 수행될 액션들의 다음 시퀀스를 현재 이미지로부터 결정한다(단계 204). 상기 다음 시퀀스는 환경이 현재 상태에 있을 때 시작하는 로봇 에이전트에 의해 수행되는 경우 객체가 목표 위치로 이동되는 결과를 초래할 가능성이 가장 높은 복수의 후보 시퀀스들 중 하나의 시퀀스이다.The system uses the learned prediction model to determine from the current image the next sequence of actions to be performed by the robotic agent to move the object to the target location (step 204). The next sequence is one of a plurality of candidate sequences that, if performed by a robotic agent starting when the environment is in its current state, would most likely result in the object being moved to the target location.

예를 들어, 상기 시스템은 학습된 예측 모델을 사용하여, 로봇 에이전트에 의해 수행되는 경우, 지정된 픽셀(dt)이 목표 위치(g)로 이동될 가능성이 가장 높은 미래 액션들(H future actions)()의 다음 시퀀스를 결정한다.For example, the system uses a learned prediction model to determine future actions ( H future actions) that, when performed by a robotic agent, are most likely to cause a given pixel (dt) to move to the target location (g). ) determine the next sequence.

액션들의 다음 시퀀스를 결정하는 것은 도 3을 참조하여 이하에서보다 상세히 설명된다.Determining the next sequence of actions is explained in more detail below with reference to FIG. 3 .

그 다음 상기 시스템은 액션들의 다음 시퀀스를 수행하도록 로봇 에이전트에 지시(명령)한다(단계 206). 로봇 에이전트가 액션들의 현재 시퀀스를 수행하는 경우, 상기 시스템은 로봇 에이전트로 하여금 수행되는 액션들의 현재 시퀀스를 인터럽트하게 하고 대신에 에이전트로 하여금 액션의 다음 시퀀스를 수행하는 것을 시작하게 할 수 있다. 상기 에이전트가 액션들의 다음 시퀀스에서 제1 액션을 수행한 후, 상기 시스템은 환경의 새로운 상태를 특징짓는 새로운 이미지를 수신한다. 일부 구현 예에서, 상기 시스템은 로봇 에이전트가 액션들의 다음 시퀀스에서 미리 결정된 수(> 1)의 액션들을 수행한 후에 만 새로운 이미지를 수신할 수 있다.The system then instructs the robotic agent to perform the next sequence of actions (step 206). When a robotic agent is performing a current sequence of actions, the system may cause the robotic agent to interrupt the current sequence of actions being performed and instead have the agent begin performing the next sequence of actions. After the agent performs the first action in the next sequence of actions, the system receives a new image characterizing the new state of the environment. In some implementations, the system may receive a new image only after the robotic agent has performed a predetermined number (> 1) of actions in the next sequence of actions.

예를 들어, 상기 시스템은 상기 결정된 액션들()의 다음 시퀀스를 수행하도록 상기 에이전트에 지시한다. 에이전트가 시퀀스에서 액션()을 수행하면 시스템은 새로운 이미지(I_t+1)를 수신한다. 그 다음 상기 시스템은 예를 들어 이미지 관측(I_t:t+1 및 d_t)에서 계산된 옵티컬 플로우를 사용함으로써 지정된 픽셀을 d_t에서 d_t+1로 업데이트한다. 그 다음, 상기 시스템은 학습된 예측 모델을 사용하여, 새로운 지정 소스 픽셀(I_t+ 1) 및 목표 위치(g)에 기초하여 로봇 에이전트에 의해 수행될 액션들의 새로운 다음 시퀀스를 결정한다. 그 다음, 상기 시스템은 수행중인 액션들의 다음 시퀀스를 인터럽트하고, 액션들의 새로운 다음 시퀀스에서 하나 이상의 액션들을 수행하기 시작하도록 로봇 에이전트에 지시한다. 로봇 에이전트가 액션들의 새로운 다음 시퀀스에서 제1 액션을 수행한 후, 상기 시스템은 실제 환경의 새로운 상태에 대한 새로운 이미지를 수신하고 그리고 지정된 픽셀을 업데이트할 수 있다. 일부 구현 예에서, 상기 시스템은 로봇 에이전트가 액션들의 새로운 다음 시퀀스에서 일부 액션들을 수행한 후에만 새로운 이미지를 수신하고 지정된 픽셀을 업데이트할 수 있다.For example, the system may execute the determined actions ( ) instructs the agent to perform the following sequence. The agent takes action in the sequence ( ), the system receives a new image (I _t +1). The system then updates the specified pixel from d _t to d _t +1, for example by using the optical flow calculated from the image observations (I _t:t +1 and d _t ). The system then uses the learned prediction model to determine a new next sequence of actions to be performed by the robotic agent based on the new designated source pixel (I _t + 1) and target location (g). The system then interrupts the next sequence of actions being performed and instructs the robotic agent to begin performing one or more actions in the new next sequence of actions. After the robotic agent performs the first action in the next new sequence of actions, the system can receive a new image of the new state of the real environment and update the designated pixel. In some implementations, the system may receive a new image and update a designated pixel only after the robotic agent has performed some actions in the next new sequence of actions.

상기 시스템은 로봇 에이전트가 목표 위치로 객체를 성공적으로 이동할 때까지 반복적으로 202-206 단계를 수행할 수 있다.The system may repeatedly perform steps 202-206 until the robotic agent successfully moves the object to the target location.

도 3은 객체를 목표 위치로 이동시키기 위해 로봇 에이전트에 의해 수행될 액션들의 다음 시퀀스를 결정하기 위한 예시적인 프로세스(300)의 흐름도이다.3 is a flow diagram of an example process 300 for determining the next sequence of actions to be performed by a robotic agent to move an object to a target location.

편의상, 프로세스(300)는 하나 이상의 위치에 위치한 하나 이상의 컴퓨터의 시스템에 의해 수행되는 것으로 설명될 것이다. 예를 들어, 로봇 조작 시스템, 예컨대 본 명세서에 따라 적절히 프로그램된 도 1의 로봇 조작 시스템(100)은 프로세스(300)를 수행할 수 있다. For convenience, process 300 will be described as being performed by a system of one or more computers located at one or more locations. For example, a robotic manipulation system, such as robotic manipulation system 100 of FIG. 1 appropriately programmed in accordance with the present disclosure, can perform process 300.

일반적으로, 프로세스(300)는 크로스 엔트로피 기술을 사용하여 샘플링 후보 액션 시퀀스들의 다중 반복을 포함한다. 일반적으로, 크로스 엔트로피 기술은, 각 반복을 (a) 지정된 메커니즘에 따라 임의의 데이터 샘플을 생성하고 (b) 다음 반복에서 더 나은 샘플을 생성하기 위해 이 데이터를 기반으로 상기 지정된 메커니즘의 파라미터들을 업데이트하는 두 단계로 나눌 수 있는 반복적인 절차를 포함한다.Generally, process 300 involves multiple iterations of sampling candidate action sequences using a cross-entropy technique. In general, the cross-entropy technique involves each iteration (a) generating a random data sample according to a specified mechanism and (b) updating the parameters of the specified mechanism based on this data to generate better samples in the next iteration. It involves an iterative process that can be divided into two steps:

먼저, 소정의 시간 단계(t)에서, 상기 시스템은 객체를 목표 위치로 이동시키기 위해 로봇 에이전트에 의해 수행될 수 있는 가능한 액션 시퀀스들의 세트에 대한 분포(distribution)를 초기화한다(단계 302). 상기 초기화된 분포는 예를 들어 균일한 분포일 수 있다. 가능한 액션 시퀀스들 각각은 동일한 길이를 가지며, 예를 들어, 각각의 시퀀스는 H 액션들을 갖는다. 상기 시스템은 또한 후보 액션 시퀀스들(J> 1)를 샘플링하는데 필요한 다수의 반복 J를 지정한다. 즉, J는 시스템이 이하에서 설명하는 바와 같이 단계들(304 내지 308)을 반복적으로 수행해야 하는 횟수이다. J는 시스템 사용자가 지정할 수 있다.First, at a given time step t, the system initializes a distribution of the set of possible action sequences that can be performed by the robotic agent to move the object to the target location (step 302). The initialized distribution may be, for example, a uniform distribution. Each of the possible action sequences has the same length, eg, each sequence has H actions. The system also specifies the number of iterations J needed to sample candidate action sequences (J>1). That is, J is the number of times the system must repeatedly perform steps 304 to 308, as described below. J can be specified by the system user.

그 다음 상기 시스템은 분포로부터 M 개의 후보 액션 시퀀스들을 샘플링한다(단계 304). 예를 들어, 첫 번째 반복에서, 상기 시스템은 초기화된 균일 분포로부터 길이 H의 M개의 액션 시퀀스들()을 샘플링한다.The system then samples M candidate action sequences from the distribution (step 304). For example, in the first iteration, the system generates M action sequences of length H from an initialized uniform distribution ( ) is sampled.

다음으로, 상기 시스템은 학습된 예측 모델을 사용하여, M 개의 후보 시퀀스들 각각에 대해, 로봇 에이전트에 의한 후보 시퀀스 내의 액션들의 수행으로 인해 상기 객체가 목표 위치로 이동되는 각각의 확률을 결정하며, 결과적으로 현재 이미지(I_t) 내의 지정된 픽셀(d_t)가 현재 이미지(I_t) 의 목표 위치(g = (x _g , y _g ))로 이동된다(단계 306). 학습된 예측 모델은 순환 신경망인 다음 이미지 예측 신경망을 포함한다. 특히, M 개의 후보 시퀀스들 각각에 대해, 상기 시스템은 우선 후보 시퀀스 내의 제1 액션 및 환경의 현재 상태를 특징짓는 현재 이미지를 다음 이미지 예측 신경망에 입력으로서 공급(feed)한다. 그 다음, 다음 이미지 예측 신경망은 환경이 현재 상태에 있을 때 로봇 에이전트가 먼저 제1 액션을 수행하는 경우 환경의 예측된 다음 상태의 이미지인 제1 다음 이미지를 생성하도록 상기 입력을 처리한다. 상기 제1 다음 이미지를 생성하는 일부로서, 다음 이미지 예측 신경망은 제1 다음 이미지의 복수의 픽셀들 각각에 대해, 현재 이미지의 복수의 픽셀들 각각으로부터 이동된 픽셀의 각각의 예측된 확률(예를 들어, 우도)을 식별하는 제1 플로우 맵을 생성한다.Next, the system uses the learned prediction model to determine, for each of the M candidate sequences, the respective probability that the object will be moved to the target location due to performance of the actions in the candidate sequence by the robotic agent, As a result, the specified pixel (d _t ) in the current image (I _t ) is moved to the target position ( g = ( x _g , y _g )) of the current image (I _t ) (step 306). The learned prediction model includes the following image prediction neural network, which is a recurrent neural network. In particular, for each of the M candidate sequences, the system first feeds the current image characterizing the current state of the environment and the first action in the candidate sequence as input to the next image prediction neural network. The next image prediction neural network then processes the input to generate a first next image, which is an image of the predicted next state of the environment if the robotic agent first performs the first action when the environment is in its current state. As part of generating the first next image, a next image prediction neural network may perform, for each of a plurality of pixels in the first next image, a respective predicted probability (e.g., For example, generate a first flow map that identifies the likelihood.

그 다음, 시스템은 입력으로서 다음 이미지 예측 신경망에 시퀀스의 제1 액션에 후속하는 제2 액션 및 다음 이미지 예측 신경망에 의해 생성된 제1 다음 이미지를 공급한다. 그 다음, 다음 이미지 예측 신경망은 로봇 에이전트가 제2 액션을 수행하는 경우 환경의 예측된 다음 상태의 이미지인 제2 다음 이미지를 생성하도록 상기 입력을 처리한다. 제2 다음 이미지를 생성하는 동안, 다음 이미지 예측 신경망은 제2 다음 이미지의 복수의 픽셀들 각각에 대해, 현재 이미지의 복수의 픽셀들 각각으로부터 이동된 픽셀의 각 확률을 식별하는 제2 플로우 맵을 생성한다.The system then supplies as input to the next image prediction neural network a second action that follows the first action in the sequence and the first next image generated by the next image prediction neural network. The next image prediction neural network then processes the input to generate a second next image, which is an image of the predicted next state of the environment when the robotic agent performs the second action. While generating a second next image, the next image prediction neural network creates a second flow map that identifies, for each of a plurality of pixels in the second next image, a respective probability of a pixel being moved from each of a plurality of pixels in the current image. Create.

상기 시스템은 시퀀스 내의 후속 액션들 및 상기 액션들에 대해 상기 신경망에 의해 생성된 다음 이미지들을 이미지 예측 신경망에 입력으로서 재귀적으로 공급하고, 상기 신경망은 액션 시퀀스의 모든 액션들이 처리될 때까지 반복적으로 입력을 처리하여 다음 이미지들과 해당 플로우 맵을 생성한다. 각 단계에서 생성된 플로우 맵은 상기 다음 이미지의 각 픽셀이 현재 이미지의 각 픽셀에서 나올 확률을 제공한다. 시퀀스의 마지막(최종) 액션이 처리된 후, 상기 시스템은 이미지 예측 신경망을 사용하여, 최종 다음 이미지(I_t + H-1) 내의 복수의 픽셀 각각에 대해, 현재 이미지(I_t) 내의 복수의 픽셀 각각으로부터 최종 다음 이미지(I_t + H-1)로 이동한 픽셀의 각 확률을 식별하는 최종 플로우 맵을 결정한다. 즉, 최종 플로우 맵은 최종 다음 이미지(I_t + H-1)의 각 픽셀이 현재 이미지(I_t)의 각 픽셀로부터 나올 확률을 제공한다. 그 다음, 상기 시스템은, 최종 플로우 맵 및 목표 위치(g = (xg, yg)(다음 이미지 예측 신경망에 의해 생성된 임의의 다음 이미지에서 동일하게 유지됨)에 기초하여, 상기 에이전트가 후보 시퀀스에서 액션들을 수행하면 현재 이미지(I_t)에서 지정된 픽셀(d_t)이 최종 다음 이미지의 목표 위치(g)로 이동될 확률을 결정한다.The system recursively feeds subsequent actions in the sequence and the next images generated by the neural network for those actions as input to an image prediction neural network, and the neural network iteratively feeds the subsequent actions in the sequence until all actions in the action sequence have been processed. Processes the input to generate the following images and the corresponding flow map. The flow map generated at each step provides the probability that each pixel of the next image comes from each pixel of the current image. After the last (final) action of the sequence is processed, the system uses _an image prediction neural network to determine, for each of the plurality of pixels in the final next image (I _t + H-1), From each pixel, determine the final flow map that identifies the respective probability of the pixel moving to the final next image (I _t + H-1). That is, the final flow map provides the probability that each pixel of the final next image (I _t + H-1) comes from each pixel of the current image (I _t ). Then, based on the final flow map and the target position (g = (xg, yg) (which remains the same in any next image generated by the next image prediction neural network), the system determines whether the agent will take an action in the candidate sequence. When these are performed, the probability that a specified pixel (d _t ) in the current image (I _t ) will be moved to the target position (g) of the final next image is determined.

시스템이 M개의 후보 시퀀스들 각각에 대한 각 확률을 결정한 후에, 상기 시스템은 M 개의 후보 액션 시퀀스들 중에서 가장 높은 확률을 갖는 K 개의 후보 액션 시퀀스를 선택한다(단계 308).After the system determines the respective probabilities for each of the M candidate sequences, the system selects the K candidate action sequences with the highest probability among the M candidate action sequences (step 308).

그 다음, 상기 시스템은 K 개의 후보 액션 시퀀스들(즉, 단계 304 내지 308)을 샘플링하기 위한 반복 횟수가 지정된 J 횟수의 반복에 도달했는지 여부를 결정한다(단계 310).The system then determines whether the number of iterations for sampling the K candidate action sequences (i.e., steps 304-308) has reached the specified J number of iterations (step 310).

반복 횟수가 J 횟수에 도달하지 않은 경우, 상기 시스템은 선택된 K 개의 후보 액션 시퀀스에 새로운 분포를 피팅(fit)한다(단계 312). 즉, 상기 시스템은 선택된 K 개의 후보 액션 시퀀스에 맞는 새로운 분포를 결정한다. 새로운 분포는 예를 들어 다변량 가우스 분포(multivariate Gaussian distribution)일 수 있다.If the number of iterations does not reach J, the system fits a new distribution to the selected K candidate action sequences (step 312). That is, the system determines a new distribution that fits the selected K candidate action sequences. The new distribution may be, for example, a multivariate Gaussian distribution.

그 다음, 상기 시스템은 단계들(304-312)을 반복하는데, 즉 새로운 분포로부터 M 개의 액션 시퀀스들의 새로운 세트를 재샘플링하는 단계, M 개의 후보 시퀀스들 각각에 대한 각각의 확률을 결정하는 단계, M 개의 후보 시퀀스들로부터 K 개의 후보 액션 시퀀스들을 선택하는 단계, 시스템이 J 횟수 반복을 완료할 때까지 새 분포를 재피팅하는 단계를 반복한다.The system then repeats steps 304-312, namely resampling a new set of M action sequences from a new distribution, determining respective probabilities for each of the M candidate sequences, and The steps of selecting K candidate action sequences from the M candidate sequences and refitting the new distribution are repeated until the system completes J number of iterations.

최종 반복의 마지막에서, 상기 시스템은 K 개의 후보 액션 시퀀스들 중에서, 시퀀스 내의 액션들의 수행으로 인해 객체가 목표 위치쪽으로 이동될 가능성이 가장 높은 확률을 갖는 액션 시퀀스를 선택한다(단계 314).At the end of the final iteration, the system selects, from among the K candidate action sequences, the action sequence with the highest probability that performance of the actions within the sequence will cause the object to move toward the target location (step 314).

그 다음, 상기 시스템은 객체를 목표 위치 쪽으로 이동시키기 위해 로봇 에이전트로 하여금 상기 선택된 액션 시퀀스에서 하나 이상의 액션을 수행하게 한다(단계 316). 상기 시스템은 수행되는 액션들의 현재 시퀀스를 인터럽트하고, 상기 선택된 액션 시퀀스를 수행하기 시작하도록 로봇 에이전트에게 지시할 수 있다.The system then causes the robotic agent to perform one or more actions in the selected action sequence to move the object toward the target location (step 316). The system may interrupt the current sequence of actions being performed and instruct the robotic agent to begin performing the selected action sequence.

상기 설명의 관점에서, 본 명세서에 기재된 시스템 및 방법은 하나 이상의 장점을 제공할 수 있음을 이해할 것이다. 예를 들어, 상기 시스템은 로봇 에이전트가 완전히 라벨링되지 않은 훈련 데이터만을 사용하여 훈련된 신경망을 사용하여 목표 위치로 효과적으로 객체를 이동시킬 수 있게 한다. 또한, 상기 시스템은 보정된(calibrated) 카메라, 계측된 훈련(instrumented training) 셋업, 또는 로봇 에이전트에 의한 정확한 감지 및 작동을 필요로 하지 않는 보다 단순한 셋업을 이용하여 로봇 에이전트에 의해 객체가 이동되도록 할 수 있다. 설명된 기술은 또한 로봇 에이전트가 다음 이미지 예측 신경망의 훈련 동안 보이지 않는 새로운 객체를 처리할 수 있게 하여 시스템의 잠재적인 유용성을 넓힐 수 있게 한다.In light of the above description, it will be appreciated that the systems and methods described herein may provide one or more advantages. For example, the system allows a robotic agent to effectively move an object to a target location using a neural network trained using only fully unlabeled training data. Additionally, the system can allow objects to be moved by a robotic agent using calibrated cameras, instrumented training setups, or simpler setups that do not require precise sensing and actuation by the robotic agent. You can. The described technique also allows the robotic agent to process new, unseen objects during the training of the next image prediction neural network, broadening the potential utility of the system.

본 명세서는 시스템 및 컴퓨터 프로그램 구성 요소와 관련하여 "구성된"이라는 용어를 사용한다. 특정 동작이나 액션을 수행하도록 구성된 하나 이상의 컴퓨터 시스템은 시스템이 소프트웨어, 펌웨어, 하드웨어 또는 이들의 조합으로 인해 시스템이 동작 또는 액션을 수행하게 하는 것을 의미한다. 특정 동작 또는 액션을 수행하도록 구성된 하나 이상의 컴퓨터 프로그램은 하나 이상의 컴퓨터 프로그램이 데이터 처리 장치에 의해 실행될 때 장치로 하여금 동작 또는 액션을 수행하게 하는 명령어들을 포함함을 의미한다.This specification uses the term “configured” in reference to system and computer program components. One or more computer systems configured to perform a particular operation or action means that the system causes the system to perform the operation or action due to software, firmware, hardware, or a combination thereof. One or more computer programs configured to perform a particular operation or action means that the one or more computer programs, when executed by a data processing device, include instructions that cause the device to perform the operation or action.

본 명세서에서 설명된 요지 및 기능적 동작의 실시 예는 본 명세서 및 그의 구조적 균등물에 개시된 구조들 또는 그들 중 하나 이상의 조합을 포함하여, 디지털 전자 회로, 유형적으로- 구현된 컴퓨터 소프트웨어 또는 펌웨어, 컴퓨터 하드웨어에서 구현될 수 있다. 본 명세서에서 설명된 요지의 실시 예는 하나 이상의 컴퓨터 프로그램, 즉 데이터 처리 장치에 의해 실행되거나 또는 데이터 처리 장치의 동작을 제어하기 위해 유형의 일시적 저장 매체상에 인코딩된 컴퓨터 프로그램 명령어들의 하나 이상의 모듈로서 구현될 수 있다. 상기 컴퓨터 저장 매체는 기계 판독가능 저장 장치, 기계 판독가능 저장 기판, 랜덤 또는 직렬 액세스 메모리 장치, 또는 이들 중 하나 이상의 조합일 수 있다. 대안으로 또는 부가적으로, 상기 프로그램 명령어들은 데이터 처리 장치에 의한 실행을 위해 적절한 수신기 장치로의 송신을 위해 정보를 인코딩하기 위해 생성되는 인위적으로 생성된 전파된 신호, 예를 들어, 기계-발생 전기, 광학 또는 전자기 신호상에 인코딩될 수 있다. Embodiments of the subject matter and functional operations described herein may include the structures disclosed herein and their structural equivalents, or a combination of one or more thereof, in a digital electronic circuit, tangibly-embodied computer software or firmware, or computer hardware. It can be implemented in . Embodiments of the subject matter described herein may comprise one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible temporary storage medium to be executed by or to control the operation of the data processing device. It can be implemented. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of these. Alternatively or additionally, the program instructions may be an artificially generated propagated signal, e.g., machine-generated electricity, generated to encode information for transmission to a suitable receiver device for execution by a data processing device. , may be encoded on optical or electromagnetic signals.

"데이터 처리 장치"라는 용어는 데이터 처리 하드웨어를 의미하며, 예를 들어 프로그램 가능 프로세서, 컴퓨터, 또는 복수의 프로세서 또는 컴퓨터를 포함하여 데이터를 처리하기 위한 모든 종류의 장치, 디바이스 및 기계를 포함한다. 이 장치는 또한 특수 목적 논리 회로, 예를 들어 FPGA (field programmable gate array) 또는 ASIC(application specific integrated circuit)일 수 있다. 상기 장치는 하드웨어 이외에, 컴퓨터 프로그램들의 실행 환경을 생성하는 코드, 예를 들어 프로세서 펌웨어, 프로토콜 스택, 데이터베이스 관리 시스템, 운영 체제 또는 이들 중 하나 이상의 조합을 구성하는 코드를 선택적으로 포함할 수 있다.The term “data processing apparatus” means data processing hardware and includes all types of apparatus, devices and machines for processing data, including, for example, a programmable processor, a computer, or a plurality of processors or computers. The device may also be a special purpose logic circuit, such as a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC). In addition to hardware, the device may optionally include code that creates an execution environment for computer programs, for example, code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of these.

컴퓨터 프로그램(프로그램, 소프트웨어, 소프트웨어 애플리케이션, 모듈, 소프트웨어 모듈, 스크립트 또는 코드로 지칭되거나 설명될 수 있음)은 컴파일된 또는 해석된 언어, 또는 선언적 또는 절차적 언어를 포함한 모든 형태의 프로그래밍 언어로 작성될 수 있으며, 독립 실행형 프로그램이나 모듈, 컴포넌트, 서브루틴 또는 컴퓨팅 환경에서 사용하기에 적합한 다른 장치를 포함하여 어떤 형태로든 배포될 수 있다. 컴퓨터 프로그램은 파일 시스템의 파일에 해당할 수 있지만 반드시 그런 것은 아니다. 프로그램은 프로그램 전용 단일 파일, 여러 개의 조정된 파일(예를 들어, 하나 이상의 모듈, 하위 프로그램 또는 코드의 부분들을 저장하는 파일들), 또는 마크업 언어 문서에 저장된 하나 이상의 스크립트와 같은 다른 프로그램들 또는 데이터를 보유하고 있는 파일의 부분에 저장될 수 있다. 컴퓨터 프로그램은 한 사이트에 있거나 여러 사이트에 분산되어 있으며 통신 네트워크로 상호 연결된 여러 대의 컴퓨터 또는 하나의 컴퓨터에서 실행되도록 배포될 수 있다.A computer program (which may be referred to or described as a program, software, software application, module, software module, script, or code) may be written in any form of programming language, including a compiled or interpreted language, or a declarative or procedural language. and may be distributed in any form, including as a stand-alone program, module, component, subroutine, or other device suitable for use in a computing environment. Computer programs can, but do not necessarily, correspond to files in a file system. A program can be a single file dedicated to the program, a number of coordinated files (for example, files storing one or more modules, subprograms, or portions of code), or other programs, such as one or more scripts stored in a markup language document. It may be stored in the part of the file that holds the data. A computer program may be distributed to run on a single computer or on multiple computers located at one site or distributed across multiple sites and interconnected by a communications network.

본 명세서에서 설명되는 프로세스들 및 로직 흐름은 입력 데이터를 조작하고 출력을 생성함으로써 기능을 수행하도록 하나 이상의 컴퓨터 프로그램을 실행하는 하나 이상의 프로그램 가능 컴퓨터에 의해 수행될 수 있다. 상기 프로세스들 및 로직 흐름은 또한 FPGA(field programmable gate array) 또는 ASIC(application specific integrated circuit)과 같은 특수 목적 논리 회로에 의해 수행될 수 있고, 장치는 또한 상기 특수 목적 논리 회로로 구현될 수 있다.The processes and logic flows described herein may be performed by one or more programmable computers executing one or more computer programs to perform functions by manipulating input data and generating output. The above processes and logic flow may also be performed by a special purpose logic circuit, such as a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC), and the device may also be implemented with the special purpose logic circuit.

컴퓨터 프로그램의 실행에 적합한 컴퓨터는 예를 들어 범용 또는 특수 목적 마이크로프로세서 또는 둘 모두, 또는 임의의 다른 종류의 중앙 처리 장치를 포함하고, 이들에 기반할 수 있다. 일반적으로, 중앙 처리 장치는 판독 전용 메모리 또는 랜덤 액세스 메모리 또는 둘 모두로부터 명령어들 및 데이터를 수신할 것이다. 컴퓨터의 필수 구성요소들은 명령어들을 수행하거나 실행하기 위한 중앙 처리 장치 및 명령어들 및 데이터를 저장하기 위한 하나 이상의 메모리 장치이다. 일반적으로, 컴퓨터는 데이터(예를 들어, 자기, 광 자기 디스크 또는 광 디스크)를 저장하기 위한 하나 이상의 대용량 저장 장치를 포함하거나, 그 하나 이상의 대용량 저장 장치로부터 데이터를 수신하거나 전송하기 위해 동작 가능하게 결합될 것이다. 그러나, 컴퓨터는 그러한 장치들을 가질 필요는 없다. 또한, 컴퓨터는 다른 장치, 예를 들어, 이동 전화기, 개인 휴대 정보 단말기(PDA), 이동 오디오 또는 비디오 플레이어, 게임 콘솔, GPS 수신기 또는 휴대용 저장 장치(예를 들어, 범용 직렬 버스(USB) 플래시 드라이브)에 내장될 수 있다.A computer suitable for the execution of computer programs may include, or be based on, for example, a general-purpose or special-purpose microprocessor, or both, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from read-only memory or random access memory, or both. The essential components of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer includes one or more mass storage devices for storing data (e.g., magnetic, magneto-optical disks, or optical disks), or is operable to receive or transmit data from the one or more mass storage devices. will be combined However, a computer need not have such devices. Additionally, the computer may be connected to other devices, such as mobile phones, personal digital assistants (PDAs), mobile audio or video players, game consoles, GPS receivers, or portable storage devices (e.g., Universal Serial Bus (USB) flash drives). ) can be built into.

컴퓨터 프로그램 명령어들 및 데이터를 저장하기에 적합한 컴퓨터 판독 가능 매체는 예를 들어 EPROM, EEPROM 및 플래시 메모리 장치와 같은 반도체 메모리 장치, 내부 하드 디스크 또는 이동식 디스크와 같은 자기 디스크, 광 자기 디스크, 및 CD ROM 및 DVD-ROM 디스크를 포함하는 모든 형태의 비휘발성 메모리, 매체 및 메모리 장치를 포함한다. Computer-readable media suitable for storing computer program instructions and data include, for example, semiconductor memory devices such as EPROM, EEPROM, and flash memory devices, magnetic disks such as internal hard disks or removable disks, magneto-optical disks, and CD ROMs. and all forms of non-volatile memory, media and memory devices, including DVD-ROM disks.

사용자와의 상호 작용을 제공하기 위해, 본 명세서에서 설명된 요지의 실시예들은 사용자에게 정보를 제공하기 위한 CRT(cathode ray tube) 또는 LCD(liquid crystal display) 모니터와 같은 디스플레이 장치, 사용자가 입력을 컴퓨터에 제공할 수 있는 마우스 또는 트랙볼과 같은 키보드 및 포인팅 장치를 갖는 컴퓨터에서 구현될 수 있다. 다른 종류의 장치들은 사용자와의 상호 작용을 제공하는 데 사용될 수 있으며, 예를 들어, 사용자에게 제공되는 피드백은 시각 피드백, 청각 피드백 또는 촉각 피드백과 같은 임의의 형태의 감각 피드백일 수 있고, 사용자로부터의 입력은 음향, 음성 또는 촉각 입력을 포함하는 임의의 형태로 수신될 수 있다. 또한, 컴퓨터는 사용자가 사용하는 장치로 문서를 보내고 문서를 수신하여 사용자와 상호 작용할 수 있으며, 예를 들어, 웹 브라우저로부터 수신된 요청에 응답하여 사용자의 클라이언트 장치상의 웹 브라우저에 웹 페이지를 전송함으로써 수행될 수 있다. 또한, 컴퓨터는 문자 메시지 또는 다른 형태의 메시지를 개인용 장치(예를 들어, 메시징 애플리케이션을 실행중인 스마트폰)에 송신하고 사용자로부터 응답 메시지를 수신함으로써 사용자와 상호 작용할 수 있다.In order to provide interaction with a user, embodiments of the subject matter described herein include a display device such as a cathode ray tube (CRT) or liquid crystal display (LCD) monitor to provide information to the user, and a display device such as a cathode ray tube (CRT) or liquid crystal display (LCD) monitor to provide information to the user, It may be implemented on a computer having a keyboard and a pointing device, such as a mouse or trackball, that can be provided to the computer. Different types of devices may be used to provide interaction with the user, for example, the feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback, and may be used to provide interaction with the user. The input may be received in any form, including acoustic, vocal, or tactile input. The computer may also interact with the user by sending and receiving documents to and from the device used by the user, for example, by sending web pages to a web browser on the user's client device in response to requests received from the web browser. It can be done. Additionally, the computer may interact with the user by sending text messages or other forms of messages to a personal device (eg, a smartphone running a messaging application) and receiving response messages from the user.

기계 학습 모델을 구현하기 위한 데이터 처리 장치는 또한, 예를 들어, 기계 학습 훈련 또는 생산, 즉 추론, 작업부하의 공통 및 연산 중심 부분을 처리하기 위한 특수 목적 하드웨어 가속기 유닛을 포함할 수 있다.Data processing devices for implementing machine learning models may also include special-purpose hardware accelerator units for processing common and computationally intensive parts of the workload, for example, machine learning training or production, i.e., inference.

기계 학습 모델은" TensorFlow" 프레임워크, "Microsoft Cognitive Toolkit" 프레임워크, "Apache Singa" 프레임워크 또는 "Apache MXNet" 프레임워크와 같은 기계 학습 프레임워크를 사용하여 구현 및 배치할 수 있다.Machine learning models can be implemented and deployed using machine learning frameworks such as the "TensorFlow" framework, the "Microsoft Cognitive Toolkit" framework, the "Apache Singa" framework, or the "Apache MXNet" framework.

본 명세서에서 설명된 요지의 실시예들은 데이터 서버와 같은 백 엔드 컴포넌트; 애플리케이션 서버와 같은 미들웨어 컴포넌트; 예를 들어 관계 그래픽 사용자 인터페이스 또는 사용자가 본 명세서에 설명된 요지의 구현예와 상호 작용할 수 있는 웹 브라우저를 갖는 클라이언트 컴퓨터와 같은 프론트 엔트 컴포넌트; 또는 하나 이상의 백 엔드, 미들웨어, 프론트 엔트 컴포넌트들의 임의의 조합을 포함하는 컴퓨팅 시스템에서 구현될 수 있다. 상기 시스템의 컴포넌트들은 디지털 데이터 통신의 임의의 형태 또는 매체, 예를 들어 통신 네트워크에 의해 상호 접속될 수 있다. 예시적인 통신 네트워크는 근거리 통신망("LAN") 및 광역 통신망("WAN"), 예를 들어 인터넷을 포함한다.Embodiments of the subject matter described herein include back-end components such as data servers; Middleware components such as application servers; a front-end component, such as a client computer having, for example, a relational graphical user interface or a web browser through which a user can interact with implementations of the subject matter described herein; Or, it may be implemented in a computing system that includes any combination of one or more back end, middleware, and front end components. The components of the system may be interconnected by any form or medium of digital data communication, for example a communication network. Exemplary communication networks include local area networks (“LANs”) and wide area networks (“WANs”), such as the Internet.

상기 컴퓨팅 시스템은 클라이언트들과 서버들을 포함할 수 있다. 클라이언트와 서버는 일반적으로 서로 멀리 떨어져 있으며, 일반적으로 통신 네트워크를 통해 상호 작용한다. 클라이언트와 서버의 관계는 각각의 컴퓨터에서 실행되고 서로 클라이언트-서버 관계를 갖는 컴퓨터 프로그램들로 인해 발생한다. 일부 실시 예에서, 서버는 데이터, 예를 들어, 데이터를 디스플레이하고, 클라이언트로서 동작하는 장치와 상호 작용하는 사용자로부터 사용자 입력을 수신하기 위해 HTML 페이지를 사용자 장치로 송신한다. 사용자 장치에서 생성된 데이터, 예를 들어 사용자 상호 작용의 결과는 상기 장치로부터 서버에서 수신될 수 있다.The computing system may include clients and servers. Clients and servers are usually remote from each other and typically interact through a communications network. The relationship between client and server arises due to computer programs running on each computer and having a client-server relationship with each other. In some embodiments, the server sends data, e.g., an HTML page, to a user device to display data and receive user input from a user interacting with the device acting as a client. Data generated on the user device, for example the results of user interactions, may be received on a server from the device.

본 명세서는 다수의 특정 구현 세부 사항을 포함하지만, 이들은 임의의 발명 또는 청구될 수 있는 범위에 대한 제한으로서 해석되어서는 안되며, 오히려 특정 발명의 특정 실시예에 특정될 수 있는 특징에 대한 설명으로 해석되어야 한다. 별도의 실시예들과 관련하여 본 명세서에서 설명되는 특정 특징들은 또한 단일 실시예에서 조합하여 구현될 수 있다. 반대로, 단일 실시예의 콘텍스트에서 설명된 다양한 특징들은 또한 다수의 실시예에서 개별적으로 또는 임의의 적합한 서브조합으로 구현될 수 있다. 더욱이, 특징들은 소정의 조합으로 작용하고 상술한 바와 같이 초기에 청구된 것으로서 설명될 수 있지만, 청구된 조합의 하나 이상의 특징이 어떤 경우 그 조합으로부터 제거될 수 있고, 그 청구된 조합은 서브조합 또는 그 서브조합의 변형을 지향할 수 있다.Although this specification contains numerous specific implementation details, these should not be construed as limitations on any invention or scope that may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. do. Certain features described herein in relation to separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented in multiple embodiments individually or in any suitable subcombination. Moreover, while features may operate in certain combinations and be described as initially claimed, as described above, one or more features of a claimed combination may in some cases be removed from that combination, and the claimed combination may be referred to as a subcombination or You can aim for transformation of that subcombination.

유사하게, 동작들이 특정 순서로 도면들에 도시되어 있지만, 이는 바람직한 동작들을 달성하기 위해, 그러한 동작들이 도시된 순서 또는 순차적인 순서로 수행되거나, 도시된 모든 동작들이 수행될 것을 요구하는 것으로 이해되어서는 안된다. 특정 상황에서 멀티 태스킹 및 병렬 처리가 유리할 수 있다. 또한, 상술한 실시 예에서 다양한 시스템 모듈 및 컴포넌트의 분리는 모든 실시예에서 그러한 분리를 필요로 하는 것으로 이해되어서는 안되며, 서술된 프로그램 컴포넌트들 및 시스템들은 일반적으로 단일 소프트웨어 제품에 함께 통합되거나 다중 소프트웨어 제품들로 패키징될 수 있다는 것을 이해해야 한다.Similarly, although operations are shown in the drawings in a particular order, this is to be understood to require that such operations be performed in the order shown or sequential order, or that all operations shown be performed, to achieve the desired operations. should not be done. Multitasking and parallel processing can be advantageous in certain situations. Additionally, the separation of various system modules and components in the above-described embodiments should not be construed as requiring such separation in all embodiments, and the described program components and systems are generally integrated together in a single software product or integrated into multiple software products. It is important to understand that products can be packaged.

본 발명의 특정 실시예들이 설명되었다. 다른 실시예들은 다음의 청구항들의 범위 내에 있다. 예를 들어, 청구 범위에 열거된 동작들은 상이한 순서로 수행될 수 있으며 여전히 바람직한 결과를 달성한다. 하나의 예로서, 첨부된 도면에 도시된 프로세스는 바람직한 결과를 얻기 위해 도시된 특정 순서 또는 순차적 순서를 반드시 필요로 하지는 않는다. 특정 구현예들에서, 멀티태스킹 및 병렬 처리가 유리할 수 있다.Specific embodiments of the invention have been described. Other embodiments are within the scope of the following claims. For example, the operations recited in the claims can be performed in a different order and still achieve the desired result. By way of example, the process depicted in the accompanying drawings does not necessarily require the specific order or sequential order shown to achieve desirable results. In certain implementations, multitasking and parallel processing may be advantageous.

Claims

As a method,
Receiving, by the one or more computers, data identifying, for each of the one or more objects, a respective target location to which a robotic agent interacting with a real-world environment should move the object;
causing, by the one or more computers, the robotic agent to move the one or more objects to one or more target locations by repeatedly performing actions,
The above operations are,
Receiving a current image of the current state of the real environment;
Determining from the current image a next sequence of actions to be performed by the robotic agent using a next image prediction neural network that predicts future images based on the current action and the action to be performed by the robotic agent - the next sequence is one candidate sequence among a plurality of candidate sequences that, if performed by the robotic agent starting when the environment is in the current state, would most likely result in the one or more objects being moved to respective target locations, and ; The operation that determines the next sequence of actions is:
For each of the plurality of candidate sequences:
Recursively supplying subsequent actions of the candidate sequence and next images generated by the next image prediction neural network for the subsequent actions as input to the next image prediction neural network, each of the next images being the current associated with a flow map that provides the probability of moving from each pixel in an image to each pixel in the next image,
Using the next image prediction neural network, determine a final flow map for the final next image of the candidate sequence, which is the image of the predicted final state of the environment when the robotic agent performs all actions in the candidate sequence, and
Based on the final flow map and target location, determining a probability that performance of all actions of the candidate sequence by the robotic agent will cause the one or more objects to move to the respective target locations in the final next image. Contains actions-; and
selecting, from the plurality of candidate sequences, the candidate sequence with the highest probability as the next sequence of actions; and
instructing the robotic agent to perform the next sequence of actions, wherein the next image prediction neural network receives as input at least a current image and an input action, and the robotic agent when the environment is in the current state. and is configured to process the input to generate a next image, which is an image of a predicted next state of the environment, upon performing the input action.

The method of claim 1, wherein the current image is an image captured by a camera of the robot agent.

The method of claim 1, wherein
The method further comprising providing a user interface that allows the user to specify target positions and objects to be moved, for presentation to the user.

According to paragraph 1,
The operation of instructing the robot agent to perform the next sequence of actions includes interrupting the current sequence of actions performed by the robot agent and instructing the robot agent to begin performing the next sequence of actions. A method comprising:

According to paragraph 1,
The next image prediction neural network is a recurrent neural network,
As part of generating the next images, the recurrent neural network includes a flow map that identifies, for each of a plurality of pixels in the next image, a respective probability of a pixel being moved from each of a plurality of pixels in the current image. A method characterized in that generating for each of the following images.

delete

In paragraph 1
wherein determining the next sequence of actions comprises determining one or more pixels of the current image that depict one or more objects currently located in the environment.

delete

The method of claim 1, wherein
The method further comprising sampling said candidate sequences from a distribution over possible action sequences.

According to clause 9,
wherein sampling the candidate sequences includes performing multiple iterations of sampling using a cross-entropy technique.

As a system,
one or more computers; and
Comprising one or more storage devices, wherein the one or more storage devices, when executed by the one or more computers, cause the one or more computers to perform the operations of claims 1 to 5, 7, 9, and 10. A system characterized by storing instructions that perform the operations of each method of any one of the methods.

Instructions that, when executed by one or more computers, cause the one or more computers to perform the operations of each method of any one of claims 1 to 5, 7, 9, and 10. One or more computer-readable storage media encoded with .

Comprising computer-readable instructions that, when executed by one or more computers, cause the one or more computers to perform the method of any one of claims 1 to 5, 7, 9, and 10. A computer program contained in a computer-readable storage medium that