JP2021197151A

JP2021197151A - Object-to-robot pose estimation from single rgb image

Info

Publication number: JP2021197151A
Application number: JP2021018845A
Authority: JP
Inventors: トレンブレイジョナサン; Tremblay Jonathan; ウォルタータイリースティーブン; Walter Tyree Stephen; トーマスバーチフィールドスタンリー; Thomas Birchfield Stanley
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2020-06-15
Filing date: 2021-02-09
Publication date: 2021-12-27

Abstract

To provide methods and systems of performing object-to-object pose estimation from a single image.SOLUTION: A method comprises: identifying an image of a first object and a target object, the image captured by a camera external to the first object and the target object; processing the image, using a first neural network, to estimate a first pose of the target object with respect to the camera; processing the image, using a second neural network, to estimate a second pose of the first object with respect to the camera; and calculating a third pose of the first object with respect to the target object, using the first pose and the second pose.SELECTED DRAWING: Figure 1

Description

本出願は、２０１９年５月７日に出願した「ＤＥＴＥＣＴＩＮＧＡＮＤＥＳＴＩＭＡＴＩＮＧＴＨＥＰＯＳＥＯＦＡＮＯＢＪＥＣＴＵＳＩＮＧＡＮＥＵＲＡＬＮＥＴＷＯＲＫＭＯＤＥＬ」と題する米国特許出願第１６／４０５，６６２号（参考：７４１８５１／１８−ＲＥ−０１６１−ＵＳ０２）の一部継続出願であり、２０１８年５月１７日に出願した「ＤＥＴＥＣＴＩＯＮＡＮＤＰＯＳＥＥＳＴＩＭＡＴＩＯＮＯＦＨＯＵＳＥＨＯＬＤＯＢＪＥＣＴＳＦＯＲＨＵＭＡＮ−ＲＯＢＯＴＩＮＴＥＲＡＣＴＩＯＮ」と題する米国仮特許出願第６２／６７２，７６７（参考：１８−ＲＥ−０１６１ＵＳ０１）の優先権を主張するものであり、これらはその全体が参照によって本明細書に組み込まれる。 This application is a US patent application No. 16 / 405,662 (Reference: 741851 / 18-RE-0161) entitled "DETECTING AND ESTIMATING THE POS OF AN OBJECT USING A NEURAL NETWORK MODEL" filed on May 7, 2019. -US02) is a partial continuation application, and the US provisional patent application No. 62 / 672,767 (Reference: 18) entitled "DECTION AND POSESTITION OF HOUSEHOLD OBJECTS FOR HUMAN-ROBOT INTERACTION" filed on May 17, 2018. -RE-0161US01) claims priority, which are incorporated herein by reference in their entirety.

本出願は、２０１９年１０月１８日に出願した「ＰＯＳＥＤＥＴＥＲＭＩＮＡＴＩＯＮＵＳＩＮＧＯＮＥＯＲＭＯＲＥＮＥＵＲＡＬＮＥＴＷＯＲＫＳ」と題する米国特許出願第１６／６５７，２２０（参考：１Ｒ２６７４．００６９０１／１９−ＳＥ−０３４１ＵＳ０１）の一部継続出願であり、その全体が参照によって本明細書に組み込まれる。 This application is part of US Patent Application No. 16 / 657,220 (Reference: 1R2674.06901 / 19-SE-0341US01) entitled "POSE DETERMINATION USING ONE OR MORE NEURAL NETWORKS" filed on October 18, 2019. It is a continuation application and is incorporated herein by reference in its entirety.

本開示は、姿勢推定のシステム及び方法に関する。 The present disclosure relates to a posture estimation system and method.

姿勢推定は一般的に、通常は特定のカメラに対する、ある物体のユークリッド位置及び配向を決定するコンピュータ・ビジョン技法を指す。姿勢推定には、多くの用途があるが、特にロボティック操作システムのコンテキストにおいて有用である。これまで、ロボティック操作システムは、物体の画像を捕捉するためのロボット自身に設置されるカメラ（すなわち、手元のカメラ）、及び／又は物体の画像を捕捉するためのロボットの外部のカメラを必要としてきた。両方の場合で、その次にロボットに対して捕捉した物体の姿勢を推定するために、カメラはロボットに対して校正しなければならない。 Posture estimation generally refers to computer vision techniques that determine the Euclidean position and orientation of an object, usually with respect to a particular camera. Posture estimation has many uses, but is particularly useful in the context of robotic manipulation systems. Historically, robotic manipulation systems have required a camera installed on the robot itself to capture the image of the object (ie, the camera at hand) and / or an external camera of the robot to capture the image of the object. I have been. In both cases, the camera must then be calibrated to the robot in order to estimate the attitude of the captured object to the robot.

手元のカメラの校正は、ロボットに対するカメラの位置を決定するために一度だけ実行すればよいが、ロボットにカメラをしっかり設置するため、このカメラは視野が限られており、そのため周囲のコンテキストを見ること、及びカメラの視野から物体が移動したときに容易に調整可能であることが妨げられる。外部のカメラは広い視野を有することができ、任意選択で手元のカメラを補助することができるが、外部のカメラは少しでも動かされると毎回校正する必要がある。しかしながら、校正は、典型的には退屈で、繊細で、エラーを起こしやすいオフラインの処理である。これらの問題を、特にロボティック操作システムに関して説明するが、同じ制限が、ある物体の別の物体（すなわち、動いていてもよく、又は動いていなくてもよい）に対する姿勢を推定するよう動作可能なあらゆる姿勢推定システムに当てはまることに留意されたい。 Calibration of the camera at hand only needs to be done once to determine the position of the camera with respect to the robot, but because the camera is firmly placed on the robot, this camera has a limited field of view and therefore looks at the surrounding context. This also prevents it from being easily adjustable when the object moves from the camera's field of view. The external camera can have a wide field of view and can optionally assist the camera at hand, but the external camera needs to be calibrated every time it is moved. However, calibration is typically a tedious, delicate, and error-prone offline process. These problems are described specifically for robotic manipulation systems, but the same restrictions can behave to estimate the attitude of one object to another (ie, may or may not be moving). Note that this applies to any attitude estimation system.

先行技術に関連付けられるこれらの問題及び／又は他の問題に対処する必要性がある。 There is a need to address these and / or other issues associated with the prior art.

画像からの、物体から物体への姿勢推定のための、方法、コンピュータ読取り可能媒体、システムが開示される。使用の際、第１の物体及びターゲット物体の画像が識別され、画像は第１の物体及びターゲット物体の外部のカメラによって捕捉される。加えて、画像は第１のニューラル・ネットワークを使用して処理され、カメラに対するターゲット物体の第１の姿勢を推定する。さらには、画像は第２のニューラル・ネットワークを使用して処理され、カメラに対する第１の物体の第２の姿勢を推定する。さらになお、ターゲット物体に対する第１の物体の第３の姿勢が、第１の姿勢及び第２の姿勢を使用して計算される。 Methods, computer-readable media, and systems for object-to-object attitude estimation from images are disclosed. In use, images of the first object and the target object are identified and the images are captured by a camera outside the first object and the target object. In addition, the image is processed using a first neural network to estimate the first attitude of the target object with respect to the camera. In addition, the image is processed using a second neural network to estimate the second orientation of the first object with respect to the camera. Furthermore, the third posture of the first object with respect to the target object is calculated using the first and second postures.

一実施例による、画像からの物体から物体への姿勢推定のための方法の図である。It is a figure of the method for estimating the posture from the object to the object from the image by one Example. 一実施例による、画像からの物体から物体への姿勢推定のためのシステムの図である。It is a figure of the system for the posture estimation from the object to the object from the image by one Example. 一実施例による、図２の第１のニューラル・ネットワークに関連付けられるブロック図である。FIG. 2 is a block diagram associated with the first neural network of FIG. 2, according to an embodiment. 一実施例による、図２の第２のニューラル・ネットワークに関連付けられるブロック図である。FIG. 2 is a block diagram associated with a second neural network of FIG. 2, according to an embodiment. 一実施例による、ロボットから物体への姿勢推定システムを使用して制御されるロボティック把持システムの図である。FIG. 3 is a diagram of a robotic gripping system controlled using a robot-to-object posture estimation system according to an embodiment. 少なくとも一実施例による、推論及び／又は訓練論理を示す図である。It is a figure which shows the reasoning and / or the training logic by at least one Example. 少なくとも一実施例による、推論及び／又は訓練論理を示す図である。It is a figure which shows the reasoning and / or the training logic by at least one Example. 少なくとも一実施例による、ニューラル・ネットワークの訓練及び展開を示す図である。It is a figure which shows the training and development of a neural network by at least one Example. 少なくとも一実施例による、例示的データ・センタ・システムを示す図である。FIG. 3 illustrates an exemplary data center system according to at least one embodiment.

図１は、一実施例による、画像からの物体から物体への姿勢推定のための方法１００を図示している。方法１００は、あらゆるコンピューティング・システムによって実行され得、コンピューティング・システムは、１つ又は複数のコンピューティング・デバイス、１つ又は複数のコンピュータ・プロセッサ、非一時的なメモリ、回路などを含むことができる。一実施例では、非一時的なメモリは、方法１００を実行するために、１つ若しくは複数のコンピューティング・デバイス及び／又は１つ若しくは複数のコンピュータ・プロセッサによって実行可能な命令を記憶することができる。別の実施例では、回路は方法１００を実行するよう構成され得る。 FIG. 1 illustrates a method 100 for estimating an object-to-object posture from an image according to one embodiment. Method 100 can be performed by any computing system, the computing system including one or more computing devices, one or more computer processors, non-temporary memory, circuits, and the like. Can be done. In one embodiment, the non-temporary memory may store instructions that can be executed by one or more computing devices and / or one or more computer processors to perform method 100. can. In another embodiment, the circuit may be configured to perform method 100.

動作１０２に示されるように、第１の物体及びターゲット物体の画像が識別され、画像は第１の物体及びターゲット物体の外部のカメラによって捕捉される。本説明のコンテキストでは、ターゲット物体及び第１の物体は、物理的に別個の物体である（たとえば、互いに幾分の近傍で）。一実施例では、第１の物体は、物体を把持するためのロボティック・アームを有するロボティック把持システムであり得る。この実施例に対する促進において、ターゲット物体はロボティック把持システムによって把持される既知の物体であり得る。別の実施例では、第１の物体は、別の車などのターゲット物体と相互作用する（又は、それに関連する決定を行う）自律車両などの別の自律物体であり得る。しかしながら、もちろん、第１の物体及びターゲット物体は、二者間の姿勢が（たとえば、コンピュータ・ビジョン用途にとって）望ましい、あらゆる物体であってもよい。 As shown in operation 102, images of the first object and the target object are identified, and the images are captured by a camera outside the first object and the target object. In the context of this description, the target object and the first object are physically separate objects (eg, somewhat in close proximity to each other). In one embodiment, the first object can be a robotic gripping system with a robotic arm for gripping the object. In facilitating this embodiment, the target object can be a known object gripped by a robotic gripping system. In another embodiment, the first object can be another autonomous object, such as an autonomous vehicle, that interacts with (or makes related decisions about) a target object, such as another vehicle. However, of course, the first object and the target object may be any object whose posture between the two is desirable (eg, for computer vision applications).

上述のように、第１の物体及び第２の物体の外部のカメラは、第１の物体及びターゲット物体の画像を捕捉する。本説明に関して、カメラは、第１の物体及びターゲット物体とは無関係に配置されることにより、第１の物体及びターゲット物体の外部にある。たとえば、カメラは、第１の物体又はターゲット物体のいずれにも設置されていなくてもよい。一実施例では、カメラは、第１の物体及びターゲット物体の画像を捕捉するために、三脚又は他の剛性表面に設置され得る。 As mentioned above, cameras outside the first and second objects capture images of the first and target objects. For this description, the camera is located outside the first object and the target object by being placed independently of the first object and the target object. For example, the camera may not be installed on either the first object or the target object. In one embodiment, the camera may be mounted on a tripod or other rigid surface to capture images of a first object and a target object.

ターゲット物体の画像は、一実施例では赤緑青（ＲＧＢ）画像であってもよく、又は別の実施例ではグレースケール画像であってもよい。画像は、深度を含んでいてもよく、含んでいなくてもよい。別の実施例では、画像は単一の画像であってもよい。さらなる実施例では、赤、緑、青、赤外、紫外、及びレーダを含むことができる６チャネル画像などの様々なタイプのセンサから得られた画像の組合せ。であってもよい。さらには、カメラは単眼ＲＧＢカメラであってもよい。しかしながら、もちろん、カメラは、あらゆる波長の光（人間に可視、又は不可視）の、さらには非光の波長の、あらゆる数のチャネルを捕捉することができる。たとえば、カメラは画像を捕捉するために、赤外、紫外、マイクロ波、レーダ、ソナー、又は他の技術を利用することができる。 The image of the target object may be a red-green-blue (RGB) image in one embodiment, or may be a grayscale image in another embodiment. The image may or may not include depth. In another embodiment, the image may be a single image. In a further embodiment, a combination of images obtained from various types of sensors such as red, green, blue, infrared, ultraviolet, and 6-channel images that can include radar. May be. Furthermore, the camera may be a monocular RGB camera. However, of course, the camera can capture any number of channels of light of any wavelength (visible or invisible to humans) and even non-light wavelengths. For example, cameras can utilize infrared, ultraviolet, microwave, radar, sonar, or other techniques to capture images.

加えて、動作１０４に示されるように、画像は第１のニューラル・ネットワークを使用して処理され、カメラに対するターゲット物体の第１の姿勢を推定する。第１のニューラル・ネットワークは、ターゲット物体のキー点の二次元（２Ｄ）画像ロケーション（ｘ、ｙ座標）を出力するよう訓練され得る。これらの２Ｄ画像ロケーションは次に、ターゲット物体モデルの３Ｄ座標に沿って、カメラに対するターゲット物体の姿勢を推定するために使用され得る。一実施例では、ＰｎＰアルゴリズムを使用してカメラに対するターゲット物体の姿勢を計算することができる。別の実施例では、カメラに対するターゲット物体の第１の姿勢は、カメラに対するターゲット物体の三次元（３Ｄ）回転及び併進を含むことができる。 In addition, as shown in motion 104, the image is processed using a first neural network to estimate the first pose of the target object with respect to the camera. The first neural network can be trained to output a two-dimensional (2D) image location (x, y coordinates) of the key points of the target object. These 2D image locations can then be used to estimate the attitude of the target object with respect to the camera along the 3D coordinates of the target object model. In one embodiment, the PnP algorithm can be used to calculate the attitude of the target object with respect to the camera. In another embodiment, the first orientation of the target object with respect to the camera can include three-dimensional (3D) rotation and translation of the target object with respect to the camera.

例として、第１のニューラル・ネットワークは、２０１９年５月７日に出願した「ＤＥＴＥＣＴＩＮＧＡＮＤＥＳＴＩＭＡＴＩＮＧＴＨＥＰＯＳＥＯＦＡＮＯＢＪＥＣＴＵＳＩＮＧＡＮＥＵＲＡＬＮＥＴＷＯＲＫＭＯＤＥＬ」と題する米国特許出願第１６／４０５，６６２号（参考：７４１８５１／１８−ＲＥ−０１６１−ＵＳ０２）に開示されたものであり得、その全体が参照によって本明細書に組み込まれる。第１のニューラル・ネットワークの実施例に関連するさらなる詳細を、図３を参照して以下で説明する。 As an example, the first neural network is U.S. Patent Application No. 16 / 405,662 entitled "DECTING AND ESTIMATING THE POSE OF AN OBJECT USING A NEURAL NETWORK MODEL" filed May 7, 2019 (Reference: 741851 / 18-RE-0161-US02), which is incorporated herein by reference in its entirety. Further details relating to the first neural network embodiment are described below with reference to FIG.

さらには、動作１０６に示されるように、画像は第２のニューラル・ネットワークを使用して処理され、カメラに対する第１の物体の第２の姿勢を推定する。第２のニューラル・ネットワークは、第１の物体のキー点の二次元（２Ｄ）画像ロケーション（ｘ、ｙ座標）を出力するよう訓練され得る。これらの２Ｄ画像ロケーションは次に、第１の物体モデルの３Ｄ座標に沿って、カメラに対する第１の物体の姿勢を推定するために使用され得る。一実施例では、ｐｅｒｓｐｅｃｔｉｖｅ−ｎ−ｐｏｉｎｔ（ＰｎＰ）アルゴリズムを使用してカメラに対する第１の物体の姿勢を計算することができる。別の実施例では、カメラに対する第１の物体の第２の姿勢は、カメラに対する第１の物体の３Ｄ回転及び併進を含むことができる。 Further, as shown in motion 106, the image is processed using a second neural network to estimate the second attitude of the first object with respect to the camera. The second neural network can be trained to output a two-dimensional (2D) image location (x, y coordinates) of the key points of the first object. These 2D image locations can then be used to estimate the orientation of the first object with respect to the camera along the 3D coordinates of the first object model. In one embodiment, the perceptive-n-point (PnP) algorithm can be used to calculate the attitude of the first object with respect to the camera. In another embodiment, the second orientation of the first object with respect to the camera can include 3D rotation and translation of the first object with respect to the camera.

一選択肢として、第２のニューラル・ネットワークがカメラのオンライン校正を実行することもできる。このオンライン校正により、たとえばロボティック把持システム又は他の自律システムの動作中を含むランタイム中にカメラを動かすことができる。 Alternatively, a second neural network can perform online calibration of the camera. This online calibration allows the camera to move during runtime, including during operation of, for example, a robotic gripping system or other autonomous system.

例として、第２のニューラル・ネットワークは、２０１９年１０月１８日に出願した「ＰＯＳＥＤＥＴＥＲＭＩＮＡＴＩＯＮＵＳＩＮＧＯＮＥＯＲＭＯＲＥＮＥＵＲＡＬＮＥＴＷＯＲＫＳ」と題する米国特許出願第１６／６５７，２２０（参考：１Ｒ２６７４．００６９０１／１９−ＳＥ−０３４１ＵＳ０１）に開示されたものであり得、その全体が参照によって本明細書に組み込まれる。第２のニューラル・ネットワークの実施例に関連するさらなる詳細を、図４を参照して以下で説明する。 As an example, the second neural network is U.S. Patent Application No. 16 / 657,220 entitled "POSE DETERMINATION USING ONE OR MORE NEURAL NETWORKS" filed on October 18, 2019 (Reference: 1R2674.0006901 / 19-). It may be disclosed in SE-0341US01), which is incorporated herein by reference in its entirety. Further details relating to the second neural network embodiment are described below with reference to FIG.

さらになお、動作１０８に示されるように、ターゲット物体に対する第１の物体の第３の姿勢が、第１の姿勢及び第２の姿勢を使用して計算される。したがって、第１の物体からターゲット物体への姿勢推定が計算され得る。一実施例では、第３の姿勢は、第１の物体の座標フレームに対するターゲット物体の姿勢であり得る。別の実施例では、第３の姿勢は、ニューラル・ネットワークのうちの１つの出力の逆数に、ニューラル・ネットワークのうちの別の１つの出力、たとえば第２の姿勢を乗じた逆数第１の姿勢、又は第２の姿勢の逆数に第１の姿勢を乗じたものなどを乗ずることによって計算され得る。 Furthermore, as shown in motion 108, the third posture of the first object with respect to the target object is calculated using the first and second postures. Therefore, the attitude estimation from the first object to the target object can be calculated. In one embodiment, the third posture can be the posture of the target object with respect to the coordinate frame of the first object. In another embodiment, the third posture is the reciprocal of one output of the neural network multiplied by another output of the neural network, eg, the second posture. , Or it can be calculated by multiplying the reciprocal of the second posture by the first posture.

第１の物体がロボティック把持システムであるコンテキストでは、ロボティック把持システムに、さらに第３の姿勢を使用して、ターゲット物体（すなわち、既知の物体）を把持させるようにすることができる。たとえば、第３の姿勢は、ロボティック把持システムに与えられ、ロボティック把持システムがターゲット物体を位置特定して把持することを可能にすることができる。もちろん、他の実施例では、たとえば自律車を制御すること、又は自律車にターゲット物体の相対的なロケーションに基づいた決定を行わせることを含む、他の目的に、第３の姿勢を使用することができる。 In the context where the first object is a robotic gripping system, the robotic gripping system can be made to grip the target object (ie, a known object) using a further third posture. For example, a third posture is given to the robotic gripping system, which can allow the robotic gripping system to locate and grip the target object. Of course, in other embodiments, the third attitude is used for other purposes, including, for example, controlling the autonomous vehicle or having the autonomous vehicle make decisions based on the relative location of the target object. be able to.

一選択肢として、第１の姿勢及び／又は第２の姿勢は、第３の姿勢を計算することに先立って、精緻化することができる。たとえば、第１の姿勢は、画像を第１の姿勢によるモデルの合成投影と反復してマッチングし、次いで反復マッチングの結果に基づいて第１の姿勢のパラメータを調節することによって精緻化することができる。同様に、第２の姿勢は、画像を第２の姿勢によるモデルの合成投影と反復してマッチングし、次いで反復マッチングの結果に基づいて第２の姿勢のパラメータを調節することによって精緻化することができる。 As an option, the first and / or second posture can be refined prior to calculating the third posture. For example, the first pose can be refined by iteratively matching the image with the composite projection of the model by the first pose and then adjusting the parameters of the first pose based on the result of the iterative match. can. Similarly, the second pose is refined by iteratively matching the image with the composite projection of the model in the second pose and then adjusting the parameters of the second pose based on the result of the iterative match. Can be done.

この目的のために、第１の物体とターゲット物体との間の姿勢が、２つのニューラル・ネットワークを使用して推定され得る。特に、１つのニューラル・ネットワークは、カメラに対する第１の物体の第１の姿勢を推定することができ、第２のニューラル・ネットワークは、カメラに対する第１の物体の第２の姿勢を推定することができる。ニューラル・ネットワークの出力（すなわち、第１の姿勢及び第２の姿勢）は、次いで第１の物体とターゲット物体との間の姿勢を決定する基本として使用され得る。 For this purpose, the attitude between the first object and the target object can be estimated using two neural networks. In particular, one neural network can estimate the first attitude of the first object with respect to the camera, and the second neural network can estimate the second attitude of the first object with respect to the camera. Can be done. The output of the neural network (ie, the first and second poses) can then be used as the basis for determining the pose between the first object and the target object.

次に、さらに説明的な情報を、ユーザの所望ごとに前述のフレームワークを実装することができる様々な任意選択のアーキテクチャ及び特徴に関連して説明する。以下の情報は、例示目的に説明され、いかなるやり方でも限定として解釈されるべきではないことに特に留意されたい。以下の特徴のいずれも、任意選択で、説明される他の特徴を除外して、又は除外せずに、組み込まれ得る。 Further descriptive information will then be described in relation to various optional architectures and features that can implement the framework described above at the user's discretion. It should be noted in particular that the following information is provided for illustrative purposes and should not be construed as limiting in any way. Any of the following features may optionally be incorporated with or without exclusion of the other features described.

図２は、一実施例による、画像からの物体から物体への姿勢推定のためのシステム２００を図示している。たとえば、システム２００は、図１の方法１００を実装することができる。システム２００は、一実施例では、クラウドに配置され、そのため物体に関してリモートに配置されてもよい。別の実施例では、システム２００は、物体にローカルなコンピューティング・デバイスに配置されてもよい。 FIG. 2 illustrates a system 200 for object-to-object attitude estimation from an image according to one embodiment. For example, the system 200 can implement the method 100 of FIG. The system 200 may, in one embodiment, be located in the cloud and thus remotely with respect to the object. In another embodiment, the system 200 may be located on a computing device that is local to the object.

示されるように、システム２００は、第１のニューラル・ネットワーク２０２を含む第１のモジュール２０１、第２のニューラル・ネットワーク２０４を含む第２のモジュール２０３、及びプロセッサ２０６を含む。画像は、第１のニューラル・ネットワーク２０２及び第２のニューラル・ネットワーク２０４に入力される。画像は第１の物体及びターゲット物体のものであり、第１の物体及びターゲット物体の両方の外部のカメラによって捕捉される。カメラはシステム２００の外部にあってもよい。しかしながら、画像は、カメラによって、ネットワーク、共有メモリを介して又は他のあらゆるやり方で、システム２００に与えられてもよい。 As shown, the system 200 includes a first module 201 including a first neural network 202, a second module 203 including a second neural network 204, and a processor 206. The image is input to the first neural network 202 and the second neural network 204. The image is of a first object and a target object and is captured by cameras outside both the first object and the target object. The camera may be outside the system 200. However, the image may be given to the system 200 by a camera, over a network, through shared memory, or in any other way.

第１のニューラル・ネットワーク２０２は、入力として画像を受信し、カメラに対するターゲット物体の第１の姿勢（すなわち、ターゲット物体からカメラへの姿勢）を推定するためにその画像を処理する。第１のニューラル・ネットワーク２０２は、カメラに対して既知の物体の６ＤｏＦ姿勢（たとえば３Ｄ回転及び併進）を推定するディープ・ニューラル・ネットワークであってもよい。このネットワーク２０２は、キー点ごとに１つ、入力ＲＧＢ画像をビリーフ・マップのセットに変換する、多段コンボリューショナル・ネットワークから構成され得る。一実施例では、ｎ＝９キー点が、重心とともに、バウンディング直方体の頂点を表現するために使用される。ビリーフ・マップに加えて、ネットワーク２０２は、重心ではないキー点ごとに１つ、ｎ−１アフィニティ・マップを出力することができる。各マップは、最も近い物体重心を指す単位ベクトルの２Ｄ場である。マップを使用して、後処理ステップにより物体を個別化することができ、システムが各物体の複数のインスタンスを扱えるようにしている。姿勢は、ビリーフ・マップのピークとして検出されたキー点にＰｎＰアルゴリズムを適用する第１のモジュール２０１によって決定することができる。 The first neural network 202 receives the image as input and processes the image to estimate the first attitude of the target object with respect to the camera (ie, the attitude from the target object to the camera). The first neural network 202 may be a deep neural network that estimates the 6DoF orientation (eg, 3D rotation and translation) of a known object with respect to the camera. The network 202 may consist of a multi-stage convolutional network that converts the input RGB image into a set of belief maps, one for each key point. In one embodiment, the n = 9 key point is used to represent the vertices of the bounding rectangular parallelepiped, along with the center of gravity. In addition to the belief map, network 202 can output one n-1 affinity map for each key point that is not the center of gravity. Each map is a 2D field of unit vectors pointing to the closest object center of gravity. Maps can be used to individualize objects through post-processing steps, allowing the system to handle multiple instances of each object. The posture can be determined by a first module 201 that applies the PnP algorithm to the key points detected as peaks in the belief map.

一実施例では、ネットワーク２０２への入力は、ＶＧＧベースの特徴抽出器によって処理された５３３ｘ４００画像から構成され得、結果として５０ｘ５０ｘ５１２特徴ブロックとなる。これらの特徴は、上述のビリーフ・マップを出力して精緻化する一連の６段（それぞれ７コンボリューショナル層を含む）によって処理され得る。 In one embodiment, the input to network 202 may consist of 533x400 images processed by a VGG-based feature extractor, resulting in a 50x50x512 feature block. These features can be processed by a series of 6 stages (each containing 7 convolutional layers) that outputs and refines the belief map described above.

第２のニューラル・ネットワーク２０４は、入力として画像を受信し、カメラに対する第１の物体の第２の姿勢（すなわち、第１の物体からカメラへの姿勢）を推定するためにその画像を処理する。一実施例では、第２のニューラル・ネットワーク２０４は、カメラに対してロボットの６ＤｏＦ姿勢を推定するディープ・ニューラル・ネットワークであってもよい。このネットワーク２０４は、キー点ごとに１つ、入力ＲＧＢ画像をビリーフ・マップのセットに変換する、エンコーダ−デコーダから構成され得る。あるシーンには１つのロボットしかないため、アフィニティ場は必要ない場合がある。 The second neural network 204 receives the image as input and processes the image to estimate the second attitude of the first object with respect to the camera (ie, the attitude from the first object to the camera). .. In one embodiment, the second neural network 204 may be a deep neural network that estimates the robot's 6DoF posture with respect to the camera. The network 204 may consist of an encoder-decoder that converts the input RGB image into a set of belief maps, one for each key point. Since there is only one robot in a scene, an affinity field may not be needed.

一実施例では、ロボットの関節に配置されるキー点は、アームがほとんどカメラの視野の外にある場合、姿勢安定性を達成するよう定義されてもよく、これはカメラが近い距離からシーンを見ている場合に生じる。一実施例では、ネットワーク２０４への入力は、４００ｘ４００画像であり、元々の６４０ｘ４８０からダウンサンプリングされ、中心クロップされている。ネットワーク２０４の層は、以下のとおりであってもよい：エンコーダは、ＶＧＧと同じ層構造に従うが、デコーダは４アップサンプリング層とそれに続く２コンボリューション層から構築されて、キー点のビリーフ・マップを与える。第２のモジュール２０３は、ＰｎＰアルゴリズムをビリーフ・マップのピークとして検出されたキー点に適用する。 In one embodiment, key points placed on the joints of the robot may be defined to achieve postural stability when the arm is largely out of the camera's field of view, which allows the camera to view the scene from close range. Occurs when looking. In one embodiment, the input to network 204 is a 400x400 image, downsampled from the original 640x480 and center cropped. The layers of network 204 may be as follows: the encoder follows the same layer structure as the VGG, but the decoder is constructed from 4 upsampling layers followed by 2 convolution layers, with a key point belief map. give. The second module 203 applies the PnP algorithm to the key points detected as peaks in the belief map.

一選択肢として、第１の姿勢及び／又は第２の姿勢は、精緻化して、推定において、訓練データ及び／又はネットワーク容量が限定されることによる誤りを低減することができる。この精緻化は、入力画像を現在の姿勢によるモデルの合成投影と反復してマッチングすることにより姿勢パラメータを調節することによって実施され得る。 As an option, the first and / or second poses can be refined to reduce errors in estimation due to limited training data and / or network capacity. This refinement can be performed by adjusting the pose parameters by iteratively matching the input image with the composite projection of the model with the current pose.

第１のニューラル・ネットワーク２０２第２のニューラル・ネットワーク２０４は、互いの出力を必要としないため、第１のニューラル・ネットワーク２０２第２のニューラル・ネットワーク２０４は、所望であれば並列に動作してもよく、動作しなくてもよい。第１のニューラル・ネットワーク２０２及び第２のニューラル・ネットワーク２０４のそれぞれの出力は、プロセッサ２０６に与えられる。プロセッサ２０６は、第１の姿勢及び第２の姿勢を使用して、ターゲット物体に対する第１の物体の第３の姿勢を計算する。第３の姿勢は、一実施例に従って説明される等式１を使用して計算され得る。

Since the first neural network 202 and the second neural network 204 do not need each other's outputs, the first neural network 202 and the second neural network 204 operate in parallel if desired. It doesn't have to work. The output of each of the first neural network 202 and the second neural network 204 is given to the processor 206. Processor 206 uses the first and second poses to calculate the third pose of the first object with respect to the target object. The third posture can be calculated using Equality 1 as described in accordance with one embodiment.

ここで、

は、ロボット・フレームにおける物体の姿勢であり、

は、カメラ・フレームにおけるロボットの姿勢であり（第２のネットワーク２０４によって計算される）、

は、カメラ・フレームにおける物体の姿勢である（第１のネットワーク２０２によって計算される）。 here,

Is the posture of the object in the robot frame,

Is the posture of the robot in the camera frame (calculated by the second network 204).

Is the pose of the object in the camera frame (calculated by the first network 202).

第１の物体がロボティック把持システムであり、ターゲット物体が既知の物体である例示的な実施例では、プロセッサ２０６は、ここで第３の姿勢を使用してロボティック把持システムにターゲット物体を把持させることができる。プロセッサ２０６は、インターネット（たとえばシステム２００が物体に対してリモートに配置されている場合）又はイーサネット（登録商標）（たとえばシステム２００が物体に対してローカルに配置されている場合）などのネットワークを介してロボティック把持システムと通信することができる。 In an exemplary embodiment where the first object is a robotic gripping system and the target object is a known object, the processor 206 here uses a third posture to grip the target object into the robotic gripping system. Can be made to. Processor 206 is via a network such as the Internet (eg, if system 200 is located remotely to an object) or Ethernet® (eg, if system 200 is located locally to an object). Can communicate with the robotic gripping system.

第１のニューラル・ネットワーク２０２は、カメラに対するターゲット物体の姿勢を推定するだけなので、第１のニューラル・ネットワーク２０２だけ（必ずしも第２のニューラル・ネットワーク２０４ではなく）がターゲット物体について訓練され得る。このやり方で、ターゲット物体は第１のニューラル・ネットワーク２０２にとって「既知」となり得、又は換言すると「既知の物体」となり得る。もちろん、第１のニューラル・ネットワーク２０２は、他のカテゴリ又はタイプのターゲット物体についても訓練され得る。 Since the first neural network 202 only estimates the attitude of the target object with respect to the camera, only the first neural network 202 (not necessarily the second neural network 204) can be trained on the target object. In this way, the target object can be "known" to the first neural network 202, or in other words, a "known object". Of course, the first neural network 202 may also be trained on target objects of other categories or types.

同様に、第２のニューラル・ネットワーク２０４は、カメラに対する第１の物体の姿勢を推定するだけなので、第２のニューラル・ネットワーク２０４だけ（必ずしも第１のニューラル・ネットワーク２０２ではなく）が第１の物体について訓練され得る。換言すると、第１の物体は、第２のニューラル・ネットワーク２０４にとって「既知」となり得る。しかしながら、第２のニューラル・ネットワーク２０４は、他の物体（たとえば他のロボット、又は他の自律物体）についても訓練され得ることに留意されたい。 Similarly, since the second neural network 204 only estimates the attitude of the first object with respect to the camera, only the second neural network 204 (not necessarily the first neural network 202) is the first. Can be trained on objects. In other words, the first object can be "known" to the second neural network 204. However, it should be noted that the second neural network 204 can also be trained on other objects (eg, other robots, or other autonomous objects).

この目的のために、エンドツーエンドの学習とは対照的に、本システム２００で提示されるモジュール的な手法により、すべてのネットワークを再訓練する必要なくシステムを再利用することができる。たとえば、システム２００を新しいロボットに適用するためには、第２のネットワーク２０４だけを再訓練する必要がある。同様に、システム２００を新しい物体に適用するためには、ロボット及び他の物体とは無関係に、第１のネットワーク２０２だけをこの物体について訓練する必要がある。その上、このモジュール的な手法は、正確性及び信頼性を保証するための個々のコンポーネントの試験及び精緻化を容易にすることができる。 To this end, in contrast to end-to-end learning, the modular approach presented in the system 200 allows the system to be reused without the need to retrain all networks. For example, in order to apply the system 200 to a new robot, only the second network 204 needs to be retrained. Similarly, in order to apply the system 200 to a new object, only the first network 202 needs to be trained on this object, independent of the robot and other objects. Moreover, this modular approach can facilitate testing and refinement of individual components to ensure accuracy and reliability.

図３は、一実施例による、図２の第１のニューラル・ネットワーク２０２に関連付けられるブロック図である。もちろん、ブロック図は、図２の第１のニューラル・ネットワーク２０２に関連付けられる１つの可能な実施例として説明される。この実施例は、２０１９年５月７日に出願した「ＤＥＴＥＣＴＩＮＧＡＮＤＥＳＴＩＭＡＴＩＮＧＴＨＥＰＯＳＥＯＦＡＮＯＢＪＥＣＴＵＳＩＮＧＡＮＥＵＲＡＬＮＥＴＷＯＲＫＭＯＤＥＬ」と題する米国特許出願第１６／４０５，６６２号（参考：７４１８５１／１８−ＲＥ−０１６１−ＵＳ０２）においてより詳細に説明されており、その全体が参照によって本明細書に組み込まれる。 FIG. 3 is a block diagram associated with the first neural network 202 of FIG. 2 according to an embodiment. Of course, the block diagram is described as one possible embodiment associated with the first neural network 202 of FIG. This example is U.S. Patent Application No. 16 / 405,662 (Reference: 741851 / 18-RE-) entitled "DECTING AND ESTIMATING THE POSE OF AN OBJECT USING A NEURAL NETWORK MODEL" filed on May 7, 2019. It is described in more detail in 0161-US02) and is incorporated herein by reference in its entirety.

示される実施例では、姿勢推定システム３００は、キー点モジュール３１０、多段モジュール３０５のセット、及び姿勢ユニット３２０を含む。姿勢推定システム３００は、処理ユニットのコンテキストで説明されるが、キー点モジュール３１０、多段モジュール３０５のセット、及び姿勢ユニット３２０のうちの１つ又は複数は、プログラム、カスタム回路、又はカスタム回路とプログラムの組合せによって実行され得る。たとえば、キー点モジュール３１０は、ＧＰＵ、中央処理装置（ＣＰＵ：ｃｅｎｔｒａｌｐｒｏｃｅｓｓｉｎｇｕｎｉｔ）、又はキー点データを生成するよう画像を処理することが可能なあらゆるプロセッサによって実装することができる。 In the embodiments shown, the attitude estimation system 300 includes a key point module 310, a set of multi-stage modules 305, and an attitude unit 320. The attitude estimation system 300 is described in the context of the processing unit, where one or more of the key point module 310, the set of multi-stage modules 305, and the attitude unit 320 may be a program, a custom circuit, or a custom circuit and program. Can be performed by a combination of. For example, the key point module 310 can be implemented by a GPU, a central processing unit (CPU), or any processor capable of processing an image to generate key point data.

姿勢推定システム３００は、単一のカメラによって捕捉された画像を受信する。画像は、検出のための１つ又は複数の物体を含む場合がある。一実施例では、いかなる深度データも持たない画像において、画像はピクセルごとの色データを含む。姿勢推定システム３００は、まず物体に関連付けられるキー点を検出し、次いでキー点に関連付けられる物体を取り囲むバウンディング・ボリュームを定義する頂点の２Ｄ投影を推定する。キー点は、物体の重心、及び物体を取り囲むバウンディング・ボリュームの頂点を含み得る。キー点は、画像内で明示的に可視ではないが、代わりに姿勢推定システム３００によって推測される。換言すると、対象の物体は画像において、遮蔽され得る物体の部分を除いて可視であり、対象の物体に関連付けられるキー点は、画像において明示的に可視ではない。キー点の２Ｄロケーションは、姿勢推定システム３００によって画像データのみを使用して推定される。姿勢ユニット３２０は、推定された２Ｄロケーション、カメラ固有パラメータ、及び物体の次元を使用して物体の３Ｄ姿勢を復元する。 The attitude estimation system 300 receives an image captured by a single camera. The image may contain one or more objects for detection. In one embodiment, in an image that does not have any depth data, the image contains pixel-by-pixel color data. The attitude estimation system 300 first detects the key points associated with the object and then estimates the 2D projection of the vertices that define the bounding volume surrounding the object associated with the key points. Key points can include the center of gravity of the object and the vertices of the bounding volume surrounding the object. The key points are not explicitly visible in the image, but are instead inferred by the attitude estimation system 300. In other words, the object of interest is visible in the image except for the portion of the object that can be occluded, and the key points associated with the object of interest are not explicitly visible in the image. The 2D location of the key points is estimated by the attitude estimation system 300 using only the image data. The attitude unit 320 restores the 3D orientation of the object using the estimated 2D location, camera-specific parameters, and object dimensions.

キー点モジュール３１０は、物体を含む画像を受信し、画像特徴を出力する。一実施例では、キー点モジュール３１０は、多層のコンボリューショナル・ニューラル・ネットワーク（すなわち、第１のネットワーク２０２）を含む。一実施例では、キー点モジュール３１０は、ＩｍａｇｅＮｅｔ訓練データベースを使用して事前訓練済みのＶｉｓｕａｌＧｅｏｍｅｔｒｙＧｒｏｕｐ（ＶＧＧ−１９）ニューラル・ネットワークの初めの１０層、それに続いて特徴次元を５１２から２５６へ、さらに２５６から１２８へと低減するための２つの３×３のコンボリューション層を含む。キー点モジュール３１０は、３チャネルの特徴を、チャネルごとに１つ、出力する（たとえば、ＲＧＢ）。 The key point module 310 receives an image containing an object and outputs an image feature. In one embodiment, the key point module 310 includes a multi-layered convolutional neural network (ie, a first network 202). In one embodiment, the key point module 310 is the first 10 layers of a Visual Geomery Group (VGG-19) neural network pre-trained using the ImageNet training database, followed by feature dimensions from 512 to 256. It also contains two 3x3 convolution layers to reduce from 256 to 128. The key point module 310 outputs one feature of the three channels for each channel (for example, RGB).

画像特徴は、多段モジュール３０５のセットへ入力される。一実施例では、多段モジュール３０５のセットは、物体の重心を検出するように構成された第１の多段モジュール３０５、及び物体を取り囲むバウンディング・ボリュームの頂点を検出するように構成された追加的な多段モジュール３０５を並列に含む。一実施例では、多段モジュール３０５のセットは、画像特徴を複数の経路で処理して、順次バウンディング・ボリュームの重心及び頂点を検出するために使用される単一の多段モジュール３０５を含む。一実施例では、多段モジュール３０５は、重心を検出することなく頂点を検出するように構成される。 Image features are input to the set of multi-stage modules 305. In one embodiment, the set of multi-stage modules 305 is a first multi-stage module 305 configured to detect the center of gravity of the object, and an additional configured to detect the vertices of the bounding volume surrounding the object. Multi-stage modules 305 are included in parallel. In one embodiment, a set of multi-stage modules 305 includes a single multi-stage module 305 used to process image features in multiple paths to detect the centroids and vertices of sequential bounding volumes. In one embodiment, the multi-stage module 305 is configured to detect vertices without detecting the center of gravity.

各多段モジュール３０５は、Ｔ段のビリーフ・マップ・ユニット３１５を含む。一実施例では、段数は６に等しい（たとえば、Ｔ＝６）。ビリーフ・マップ・ユニット３１５−１は、第１の段であり、ビリーフ・マップ・ユニット３１５−２は、第２の段である、などである。キー点モジュール３１０によって抽出された画像特徴は、多段モジュール３０５内のビリーフ・マップ・ユニット３１５のそれぞれに渡される。一実施例では、キー点モジュール３１０及び多段モジュール３０５は、入力としてサイズがｗｘｈｘ３のＲＧＢ画像を受け取り、たとえば、ビリーフ・マップなどの２つの異なる出力を生成するよう分岐するフィードフォワードのニューラル・ネットワーク（すなわち、第１のニューラル・ネットワーク２０２）を含む。一実施例では、ｗ＝６４０且つｈ＝４８０である。ビリーフ・マップ・ユニット３１５の段は、順次動作し、このとき各段（ビリーフ・マップ・ユニット３１５）は、画像特徴だけでなく、直前の段の出力も考慮している。 Each multi-stage module 305 includes a T-stage belief map unit 315. In one embodiment, the number of stages is equal to 6 (eg, T = 6). The belief map unit 315-1 is the first stage, the belief map unit 315-2 is the second stage, and so on. The image features extracted by the key point module 310 are passed to each of the belief map units 315 in the multistage module 305. In one embodiment, the keypoint module 310 and the multistage module 305 receive an RGB image of size wxhx3 as input and a feedforward neural network that branches to produce two different outputs, eg, belief maps. That is, it includes a first neural network 202). In one embodiment, w = 640 and h = 480. The stages of the belief map unit 315 operate sequentially, and at this time, each stage (belief map unit 315) considers not only the image features but also the output of the immediately preceding stage.

各多段モジュール３０５内のビリーフ・マップ・ユニット３１５の段は、画像内の物体に関連付けられる単一の２Ｄロケーションの推定のためのビリーフ・マップを生成する。第１のビリーフ・マップは、物体の重心についての確率値を含み、追加的なビリーフ・マップは、物体を取り囲むバウンディング・ボリュームの頂点についての確率値を含む。 The stages of the belief map unit 315 in each multi-stage module 305 generate a belief map for estimating a single 2D location associated with an object in the image. The first belief map contains the probability values for the center of gravity of the object, and the additional belief map contains the probability values for the vertices of the bounding volume surrounding the object.

一実施例では、検出された頂点の２Ｄロケーションは、それぞれが物体を取り囲み、そのシーンでの画像空間に投影された３Ｄバウンディング頂点の２Ｄ座標である。３Ｄのバウンディング・ボックスによって各物体を表現することにより、姿勢推定に十分であるが、なお物体の形状の詳細とは無関係な、各物体の抽象的な表現が定義される。バウンディング・ボリュームが３Ｄのバウンディング・ボックスである場合、９つの多段モジュール３０５を使用して、重心及び８つの頂点についてのビリーフ・マップを並列に生成することができる。姿勢ユニット３２５は、画像空間に投影された３Ｄバウンディング・ボックス頂点の２Ｄ座標を推定し、次いで従来的なコンピュータ・ビジョンのアルゴリズム又は別のニューラル・ネットワークのいずれかを使用して、ｐｅｒｓｐｅｃｔｉｖｅ−ｎ−ｐｏｉｎｔ（ＰｎＰ）から３Ｄ空間での物体ロケーション及び姿勢を推論する。ＰｎＰは、３Ｄ空間のｎロケーションのセット及び画像空間のｎロケーションの投影を使用して物体の姿勢を推定する。一実施例では、姿勢推定システム３００は、リアルタイムで、単一のＲＧＢ画像からクラッタ中にある既知の物体の３Ｄ姿勢を推定する。 In one embodiment, the 2D locations of the detected vertices are the 2D coordinates of the 3D bounding vertices, each surrounding the object and projected into the image space of the scene. Representing each object with a 3D bounding box defines an abstract representation of each object that is sufficient for postural estimation but is still irrelevant to the details of the object's shape. If the bounding volume is a 3D bounding box, nine multi-stage modules 305 can be used to generate belief maps for the center of gravity and eight vertices in parallel. The attitude unit 325 estimates the 2D coordinates of the 3D bounding box vertices projected into image space, and then uses either a conventional computer vision algorithm or another neural network to persective-n-. The object location and orientation in 3D space are inferred from the point (PnP). PnP estimates the pose of an object using a set of n-locations in 3D space and a projection of n-locations in image space. In one embodiment, the posture estimation system 300 estimates the 3D posture of a known object in the clutter from a single RGB image in real time.

一実施例では、ビリーフ・マップ・ユニット３１５の段は、それぞれコンボリューショナル・ニューラル・ネットワーク（ＣＮＮ：ＣｏｎｖｏｌｕｔｉｏｎａｌＮｅｕｒａｌＮｅｔｗｏｒｋ）の段である。各段がＣＮＮである場合、各段は、データがニューラル・ネットワークを通過するにつれ、益々増大した有効受容野を活用する。この性質は、後方の段で益々増大した量のコンテキストを組み込むことにより、ビリーフ・マップ・ユニット３１５の段が曖昧さを解決できるようにしている。 In one embodiment, the stages of the belief map unit 315 are each stage of a convolutional neural network (CNN). If each stage is a CNN, each stage utilizes an increasingly increased receptive field as the data passes through the neural network. This property allows the stage of the belief map unit 315 to resolve the ambiguity by incorporating an increasing amount of context in the rear stage.

一実施例では、ビリーフ・マップ・ユニット３１５の段は、キー点モジュール３１０によって抽出された１２８次元の特徴を受信する。一実施例では、ビリーフ・マップ・ユニット３１５−１は３つの３×３×１２８層及び１つの１×１×５１２層を含む。一実施例では、ビリーフ・マップ・ユニット３１５−２は１×１×９層である。一実施例では、ビリーフ・マップ・ユニット３１５−３から３１５−Ｔは、それぞれが１５３次元の入力（１２８＋１６＋９＝１５３）を受信し、１×１×１２８層又は１×１×１６層の前に５つの７×７×１２８層及び１つの１×１×１２８層を含むことを除いて第１の段と同一である。一実施例では、ビリーフ・マップ・ユニット３１５のそれぞれは、_ｗ／８及び_ｈ／８のサイズであり、正規化線形ユニット（ＲｅＬＵ）活性化関数が全体的に交互配置されている。 In one embodiment, the stage of the belief map unit 315 receives the 128-dimensional features extracted by the key point module 310. In one embodiment, the belief map unit 315-1 comprises three 3x3x128 layers and one 1x1x512 layer. In one embodiment, the belief map unit 315-2 has 1 × 1 × 9 layers. In one embodiment, the belief map units 315-3 to 315-T each receive a 153 dimensional input (128 + 16 + 9 = 153) before the 1x1x128 layer or the 1x1x16 layer. It is the same as the first stage except that it includes five 7 × 7 × 128 layers and one 1 × 1 × 128 layer. In one embodiment, each of the belief map units 315 is _{of size w / 8} and _{h / 8} , and the normalized linear unit (ReLU) activation functions are totally alternated.

図４は、一実施例による、図２の第２のニューラル・ネットワーク２０４の使用に関連付けられるブロック図である。もちろん、ブロック図は、図２の第２のニューラル・ネットワーク２０４に関連付けられる１つの可能な実施例として説明される。第２のニューラル・ネットワーク２０４に関するこの実施例は、２０１９年１０月１８日に出願した「ＰＯＳＥＤＥＴＥＲＭＩＮＡＴＩＯＮＵＳＩＮＧＯＮＥＯＲＭＯＲＥＮＥＵＲＡＬＮＥＴＷＯＲＫＳ」と題する米国特許出願第１６／６５７，２２０（参考：１Ｒ２６７４．００６９０１／１９−ＳＥ−０３４１ＵＳ０１）により詳細に説明されており、その全体が参照によって本明細書に組み込まれる。 FIG. 4 is a block diagram associated with the use of the second neural network 204 of FIG. 2 according to one embodiment. Of course, the block diagram is described as one possible embodiment associated with the second neural network 204 of FIG. This example relating to the second neural network 204 is a US patent application 16 / 657,220 entitled "POSE DETERMINATION USING ONE OR MORE NEURAL NETWORKS" filed on October 18, 2019 (Reference: 1R2674.7006901 /). 19-SE-0341US01), which is incorporated herein by reference in its entirety.

示されるように、ロボットの捕捉された画像４０２は、入力として訓練済みニューラル・ネットワーク２０４に与えられる。ロボットが本実施例のコンテキストで説明されるが、本明細書において説明されるブロック図は他の物体（たとえば、他の自律システム）に等しく適用できることに留意されたい。一選択肢として、処理の前に、解像度、色深度、又はコントラストを調節するためなど、この画像の何らかの事前処理又は拡張が実施されてもよい。少なくとも一実施例において、異なるロボットは異なる形状、サイズ、構成、運動学、及び特徴を有する可能性があるため、ネットワーク２０４は、特にロボット２０４のタイプについて訓練することができる。 As shown, the captured image 402 of the robot is given to the trained neural network 204 as input. Although robots are described in the context of this embodiment, it should be noted that the block diagrams described herein are equally applicable to other objects (eg, other autonomous systems). Alternatively, some pre-processing or enhancement of this image may be performed prior to processing, such as to adjust resolution, color depth, or contrast. In at least one embodiment, the network 204 can be trained specifically for the type of robot 204, as different robots can have different shapes, sizes, configurations, kinematics, and features.

使用の際、ニューラル・ネットワーク２０４は、入力画像４０２を分析して、推論のセットとしてビリーフ・マップ４０６のセットを出力することができる。一選択肢として、特徴ポイントを位置特定するために他の次元決定推論を生成することができる。たとえば、ニューラル・ネットワーク２０４は、識別されるロボット特徴ごとに１つのビリーフ・マップ４０６を推論することができる。 In use, the neural network 204 can analyze the input image 402 and output a set of belief maps 406 as a set of inferences. As an option, other dimensional inferences can be generated to locate feature points. For example, the neural network 204 can infer one belief map 406 for each robot feature identified.

少なくとも一実施例において、訓練に使用されるロボットのモデルは、追跡される具体的な特徴を識別することができる。これらの特徴は、訓練プロセスを通じて学習され得る。別の実施例では、特徴は、ロボットの姿勢をこれらの特徴から決定することができるように、そのロボットの様々な可動部分又はコンポーネントに位置特定され得る。さらなる実施例では、特徴は、ロボットのそれぞれの姿勢が１つ且つ１つだけの特徴の設定に対応するように、また特徴のそれぞれの設定が１つ且つ１つだけのロボット姿勢に対応するように、選択され得る。この一意性は、カメラからロボットへの姿勢が、捕捉された画像データに表現されるように特徴の一意な配向に基づいて決定され得るようにできる。 In at least one embodiment, the robot model used for training can identify specific features to be tracked. These features can be learned through the training process. In another embodiment, the features may be located on various moving parts or components of the robot so that the posture of the robot can be determined from these features. In a further embodiment, the features are such that each posture of the robot corresponds to one and only one feature setting, and each feature setting corresponds to one and only one robot pose. Can be selected. This uniqueness can allow the camera-to-robot attitude to be determined based on the unique orientation of the features as represented in the captured image data.

ネットワーク２０４に関して自動エンコーダ・ネットワークは、キー点を検出することができる。少なくとも一実施例において、ニューラル・ネットワーク２０４は入力として、サイズがｗｘｈｘ３のＲＧＢ画像を受け取り、形態ｗｘｈｘｎを有するｎのビリーフ・マップ４０６を出力する。少なくとも一実施例において、ＲＧＢＤ又は立体視画像を、入力として同じように受け取ることができる。任意選択で、ｗ＝６４０且つｈ＝４８０である。少なくとも一実施例において、キー点ごとの出力は、２Ｄのビリーフ・マップであり、ピクセル値はキー点がそのピクセルに投影される尤度を表現する。 With respect to network 204, the automatic encoder network can detect key points. In at least one embodiment, the neural network 204 receives as input an RGB image of size wxhx3 and outputs n belief maps 406 with form wxhxn. In at least one embodiment, RGBD or stereoscopic images can be similarly received as input. Arbitrarily, w = 640 and h = 480. In at least one embodiment, the output per key point is a 2D belief map, where the pixel value represents the likelihood that the key point will be projected onto that pixel.

一実施例では、ネットワーク２０４のエンコーダは、ＩｍａｇｅＮｅｔで事前訓練済みのＶＧＧ−１９のコンボリューショナル層を含む。別の実施例では、ＲｅｓＮｅｔベースのエンコーダを使用することができる。さらなる実施例では、ネットワーク２０４のデコーダ又はアップサンプリング・コンポーネントは、４つの２Ｄ転置コンボリューショナル層で構成され、各層には通常の３×３コンボリューショナル層及びＲｅＬＵ活性化関数が続く。なおさらに、出力された頭部は、それぞれ６４、３２、及びｎチャネルでＲｅＬＵ活性化を伴う３つのコンボリューショナル層（３×３、ストライド＝１、パディング＝１）で構成され得る。少なくとも一実施例において、最後のコンボリューショナル層の後に活性化層がない場合がある。別の実施例では、エンコーダ・ネットワークは、出力されたビリーフ・マップを正解（ｇｒｏｕｎｄｔｒｕｔｈ）ビリーフ・マップと比較するＬ２損失関数を使用して訓練することができ、ここで正解のビリーフ・マップはピークを生成するためにσ＝２ピクセルを使用して生成される。少なくとも一実施例において、立体視画像対の使用によって、これらの画像よって推定された姿勢を融合することを可能にできる、又は点群が計算され得、Ｐｒｏｃｒｕｓｔｅｓａｎａｌｙｓｉｓ又はｉｔｅｒａｔｉｖｅｃｌｏｓｅｓｔｐｏｉｎｔ（ＩＣＰ）などの処理を使用して決定された姿勢。 In one embodiment, the encoder of network 204 includes a convolutional layer of VGG-19 pre-trained with ImageNet. In another embodiment, a ResNet-based encoder can be used. In a further embodiment, the decoder or upsampling component of the network 204 is composed of four 2D transposed convolutional layers, each layer followed by a normal 3x3 convolutional layer and a ReLU activation function. Furthermore, the output head may be composed of three convolutional layers (3 × 3, stride = 1, padding = 1) with ReLU activation on 64, 32, and n channels, respectively. In at least one embodiment, there may be no active layer after the last convolutional layer. In another embodiment, the encoder network can be trained using an L2 loss function that compares the output belief map to the ground truth belief map, where the correct belief map is Generated using σ = 2 pixels to generate the peak. In at least one embodiment, the use of stereoscopic image pairs can allow fusion of poses estimated from these images, or point clouds can be calculated, such as Procrustes analyze or iterative close point (ICP). Posture determined using processing.

少なくとも一実施例において、ビリーフ・マップ４０６は、入力として、関連性のあるロボット特徴の位置を表現する二次元での座標のセットを決定することができるピーク抽出コンポーネント４０８、又はサービスに与えられ得る。一選択肢として、キー点座標は、個々のビリーフ・マップにおいて、まずガウス平滑化をこれらのビリーフ・マップに適用してノイズ効果を低減した後、閾値とされるピーク付近の値の重み付けされた平均として計算することができる。この重み付けされた平均は、サブピクセルの精度を可能にすることができる。少なくとも一実施例において、これらの二次元座標（又はピクセル・ロケーション）は、入力としてｐｅｒｓｐｅｃｔｉｖｅ−ｎ−ｐｏｉｎｔ（ＰｎＰ）モジュール４１４などの姿勢決定モジュールに提供することができる。この姿勢決定モジュールは、入力として、レンズ非対称性、焦点距離、主点、又は他のそのようなファクタに起因する画像アーチファクトを説明するために使用され得るカメラ用の校正情報など、カメラ固有データ４１０を受け入れることができる。 In at least one embodiment, the belief map 406 may, as an input, be given to a peak extraction component 408, or service, capable of determining a set of coordinates in two dimensions representing the location of relevant robot features. .. As an option, the keypoint coordinates are the weighted average of the values near the peak, which is the threshold, after first applying Gaussian smoothing to these belief maps to reduce the noise effect in each belief map. Can be calculated as. This weighted average can allow for subpixel accuracy. In at least one embodiment, these two-dimensional coordinates (or pixel locations) can be provided as input to an attitude determination module such as the perceptive-n-point (PnP) module 414. This orientation module is camera-specific data 410, such as camera calibration information that can be used as input to account for image artifacts due to lens asymmetry, focal length, principal point, or other such factors. Can be accepted.

少なくとも一実施例において、この姿勢決定モジュールは、入力として、可能な姿勢を決定するための、このタイプのロボットの順運動学（ｆｏｒｗａｒｄｋｉｎｅｍａｔｉｃｓ）４１２についての情報を受信することもできる。運動学（ｋｉｎｅｍａｔｉｃｓ）は、このタイプのロボットの物理的な構成又は制限のために、ある特徴ロケーションだけが可能である探索空間を狭めるために使用することができる。この情報は、ＰｎＰアルゴリズムを使用して分析して、決定されたカメラからロボットへの姿勢を出力することができる。少なくとも一実施例において、ｐｅｒｓｐｅｃｔｉｖｅ−ｎ−ｐｏｉｎｔは、このロボット・マニピュレータの関節構成が既知であると仮定して、カメラの外的要因を復元するために使用される。カメラ空間又はカメラ座標系において、このロボットのベース座標又は他の特徴が正確に特定され得るため、この姿勢情報を使用して、カメラとロボットの間の相対的な距離及び配向を決定することができる。この目的のために、ニューラル・ネットワーク２０４は、ビリーフ・マップ４０６のセットを推論することなどにより、これらの特徴の位置を推論できるように訓練され得る。 In at least one embodiment, the posture determination module can also receive, as an input, information about the forward kinematics 412 of this type of robot for determining possible postures. Kinematics can be used to narrow the search space where only certain feature locations are possible due to the physical configuration or limitations of this type of robot. This information can be analyzed using the PnP algorithm to output the determined camera-to-robot pose. In at least one embodiment, the perceptive-n-point is used to restore external factors of the camera, assuming that the joint configuration of this robot manipulator is known. This attitude information can be used to determine the relative distance and orientation between the camera and the robot, as the base coordinates or other features of the robot can be accurately identified in the camera space or camera coordinate system. can. To this end, the neural network 204 can be trained to infer the location of these features, such as by inferring a set of belief maps 406.

少なくとも一実施例において、相対的な位置及び配向情報を使用して、カメラの視点からのカメラ座標空間が、ロボットのロボット座標空間と、次元及びアラインメントの両方について、位置合わせされたことを保証することができる。少なくとも一実施例において、これらの座標はカメラ座標系から正しいがロボットの座標系では正しくない場合があるため、ロボット５０４に対するカメラ５０２の不正確な配向又は位置により、ロボット５０４があるアクションを実行する誤った座標が与えられる可能性がある。この目的のために、カメラ５０２の相対的な位置及び配向は、ロボット５０４に対して決定され得る。少なくとも一実施例において、カメラ５０２の相対的な位置は十分である場合がある一方で、配向情報はカメラ固有などの要因に依存して有用な可能性があり、この場合、非対称な画像の性質は、適切に考慮しないと正確さに影響を及ぼす可能性がある。この相対的な位置／配向は、（たとえば、ロボットのランタイム中に）オンラインで校正することができる。 In at least one embodiment, relative position and orientation information is used to ensure that the camera coordinate space from the camera's point of view is aligned with the robot coordinate space of the robot, both in dimension and alignment. be able to. In at least one embodiment, these coordinates may be correct from the camera coordinate system but not in the robot coordinate system, so the robot 504 performs some action due to the incorrect orientation or position of the camera 502 with respect to the robot 504. Wrong coordinates may be given. For this purpose, the relative position and orientation of the camera 502 can be determined relative to the robot 504. In at least one embodiment, the relative position of the camera 502 may be sufficient, while the orientation information may be useful depending on factors such as camera specificity, in which case asymmetric image properties. Can affect accuracy if not properly considered. This relative position / orientation can be calibrated online (eg, during the robot's runtime).

図５は、一実施例による、ロボットから物体への姿勢推定システムを使用して制御されるロボティック把持システム５００を図示している。ロボティック把持システム５００は、前述の実施例のコンテキストで実装することができる。 FIG. 5 illustrates a robotic gripping system 500 controlled using a robot-to-object posture estimation system according to an embodiment. The robotic gripping system 500 can be implemented in the context of the embodiments described above.

示されるように、カメラ５０２を使用して、可能性としては、ロボット５０４などの自律物体の動画フレームの形態で、画像を捕捉することができる。カメラ５０２は、ロボット５０４がカメラ５０２の視野５１０内に入るように、また完全なビュー表現ではない場合はロボット５０４の少なくとも部分的な表現を含む場合がある画像を、カメラ５０２が捕捉できるように、位置付けることができる、又は外部に搭載することができる。捕捉された画像を使用して、特定のタスクを実行するためにロボット５０４に命令を与えるよう支援することができる。 As shown, the camera 502 can be used to capture an image, potentially in the form of a moving image frame of an autonomous object such as a robot 504. The camera 502 allows the robot 504 to be within the field of view 510 of the camera 502 and to allow the camera 502 to capture images that may include at least a partial representation of the robot 504 if it is not a complete view representation. , Can be positioned, or can be mounted externally. The captured images can be used to assist the robot 504 in commanding it to perform a particular task.

少なくとも一実施例において、捕捉された画像は、ロボット５０４に対する物体５１２のロケーションを決定するために分析することができ、そのロケーションを用いてロボット５０４は何らかの方法でこの物体５１２を持ち上げる又は変更を加えるなどの相互作用をする。特に、図２のシステム２００は、ロボット５０４又はロボット５０４用の制御システムに正確な命令を与える目的のために、ロボットから物体への姿勢を決定するために利用され得る。ロボットから物体への姿勢は、ロボット５０４をナビゲートすること、又はロボット５０４の状態について現在の情報を提供することを支援するなど、他の目的にも使用することができる。少なくとも一実施例において、ロボット５０４と物体５１２との間の正確な位置及び配向データは、構造化されていない動的な環境でロボット５０４が、物体把持及び操作、人間とロボットの対話、並びに衝突検出及び回避などのタスクを実行しつつ、ロバストに動作することを可能にする。したがって、カメラ５０２に対するロボット５０４の位置又は配向、及びカメラ５０２に対する物体５１２の位置又は配向のうちの少なくとも１つを決定して、次いでロボットから物体への姿勢を決定することが望ましい。 In at least one embodiment, the captured image can be analyzed to determine the location of the object 512 with respect to the robot 504, using that location for the robot 504 to lift or modify the object 512 in some way. And so on. In particular, the system 200 of FIG. 2 can be used to determine the attitude from the robot to the object for the purpose of giving accurate commands to the robot 504 or the control system for the robot 504. The robot-to-object posture can also be used for other purposes, such as assisting in navigating the robot 504 or providing current information about the state of the robot 504. In at least one embodiment, accurate position and orientation data between the robot 504 and the object 512 is such that the robot 504 grips and manipulates the object, interacts with the robot, and collides in an unstructured dynamic environment. Allows robotic operation while performing tasks such as detection and avoidance. Therefore, it is desirable to determine at least one of the position or orientation of the robot 504 with respect to the camera 502 and the position or orientation of the object 512 with respect to the camera 502, and then determine the posture from the robot to the object.

上述のように、カメラ５０２に対するロボット５０４の現在の配向及びカメラ５０２に対する物体５１２の現在の配向を示す画像がカメラ５０２によって捕捉され得る。ロボット５０４の配向に関して、ロボット５０４が様々な構成又は「姿勢」をとなるように、ロボット５０４は様々な関節接合された四肢５０８又はコンポーネントを有することができる。少なくとも一実施例において、ロボット５０４の様々な姿勢は、カメラ５０２によって捕捉される画像における様々な表現をもたらし得る。少なくとも一実施例において、カメラ５０２によって捕捉された単一の画像は、ロボット５０４の姿勢を決定するために使用可能なロボット５０４の特徴を決定するために分析することができる。特徴は、ロボットが動かすことができる又は位置若しくは配向を調節することができる関節又はロケーションに対応することができる。さらには、ロボット５０４の次元及び運動学は既知であるため、カメラ５０２の視点からロボット５０４の姿勢を決定することは、カメラからロボットへの距離及び配向の決定を正確にすることを可能にしている。 As mentioned above, an image showing the current orientation of the robot 504 with respect to the camera 502 and the current orientation of the object 512 with respect to the camera 502 can be captured by the camera 502. The robot 504 can have various articulated limbs 508 or components so that the robot 504 has different configurations or "postures" with respect to the orientation of the robot 504. In at least one embodiment, different postures of the robot 504 can result in different representations in the image captured by the camera 502. In at least one embodiment, a single image captured by the camera 502 can be analyzed to determine the features of the robot 504 that can be used to determine the posture of the robot 504. Features can correspond to joints or locations that the robot can move or adjust its position or orientation. Furthermore, since the dimensions and kinematics of the robot 504 are known, determining the attitude of the robot 504 from the viewpoint of the camera 502 makes it possible to accurately determine the distance and orientation from the camera to the robot. There is.

機械学習
プロセッサで発達した、モデルを深層学習することを含むディープ・ニューラル・ネットワーク（ＤＮＮ：ＤｅｅｐＮｅｕｒａｌＮｅｔｗｏｒｋ）は、自動運転車から迅速な創薬まで、オンライン画像データベースにおける自動画像キャプション付けから動画チャット・アプリケーションにおけるスマートなリアルタイム言語翻訳まで、多様な使用事例に用いられている。深層学習は、人間の脳の神経学習プロセスをモデル化する技法であり、継続的に学習し、継続的に賢くなり、より正確な結果をより速い時間で提供する。子供は初め、大人によって教えられて、様々な形状を正しく識別及び分類し、最終的にはいかなるコーチングもなしに形状を識別できるようになる。同様に、ディープ・ラーニング又はニューラル・ラーニング・システムは、コンテキストを物体に割り当てしつつも、基本的な物体、遮蔽された物体などを、よりスマート且つ効率的に識別するために、物体認識及び分類について訓練される必要がある。 Developed with machine learning processors, Deep Neural Networks (DNN), which includes deep learning of models, is used for everything from self-driving cars to rapid drug discovery, from automatic image captioning in online image databases to video chat. -It is used in various use cases, including smart real-time language translation in applications. Deep learning is a technique that models the neural learning process of the human brain, continuously learning, continuously becoming smarter, and providing more accurate results in faster time. Children are initially taught by adults to correctly identify and classify various shapes and eventually to be able to identify shapes without any coaching. Similarly, deep learning or neural learning systems assign object context to objects while recognizing and classifying objects in order to identify basic objects, occluded objects, etc. smarter and more efficiently. Need to be trained about.

最も簡単なレベルでは、人間の脳のニューロンは、受信した様々な入力を見て、重要度のレベルをこれらの入力のそれぞれに割り当て、出力は他のニューロンに渡されて作用する。人工ニューロン又はパーセプトロンは、最も基本的なニューラル・ネットワークのモデルである。一実例では、パーセプトロンは、パーセプトロンが認識して分類するよう訓練される物体の様々な特徴を表現する１つ又は複数の入力を受信することができ、これらの特徴のそれぞれは、物体の形状を定義することにおいて、その特徴の重要度に基づいて一定の重みを割り当てられる。 At the simplest level, neurons in the human brain look at the various inputs they receive, assign a level of importance to each of these inputs, and the output is passed to other neurons to act. Artificial neurons or perceptrons are the most basic models of neural networks. In one embodiment, the perceptron can receive one or more inputs that represent various features of the object that the perceptron is trained to recognize and classify, each of which features the shape of the object. In defining, a certain weight is assigned based on the importance of the feature.

ディープ・ニューラル・ネットワーク（ＤＮＮ）モデルは、膨大な量の入力データで訓練することができる多くの接続されたノード（たとえば、パーセプトロン、ボルツマン・マシン、半径ベースの関数、コンボリューショナル層など）の複数の層を含み、複雑な問題を高い精度で迅速に解決する。一実例では、ＤＮＮモデルの第１の層は、自動車の入力画像を様々なセクションに分解し、線や角度などの基本的なパターンを探す。第２の層は、線を組み立て、ホイール、フロントガラス、及びミラーなどの、より高次のパターンを探す。次の層は、車両のタイプを識別し、最後のいくつかの層は、入力画像用のラベルを生成し、具体的な自動車ブランドのモデルを特定する。 Deep Neural Networks (DNN) models are of many connected nodes that can be trained with vast amounts of input data (eg, perceptrons, Boltzmann machines, radius-based functions, convolutional layers, etc.). It includes multiple layers and solves complex problems quickly with high accuracy. In one example, the first layer of the DNN model decomposes the input image of the car into various sections, looking for basic patterns such as lines and angles. The second layer assembles the lines and looks for higher order patterns such as wheels, windshields, and mirrors. The next layer identifies the type of vehicle, and the last few layers generate labels for the input image and identify the specific car brand model.

いったん、ＤＮＮが訓練されると、ＤＮＮを展開して、推論として知られるプロセスにおいて、物体又はパターンを識別及び分類するために使用することができる。推論（ＤＮＮが所与の入力から有用な情報を抽出するプロセス）の例としては、ＡＴＭ機に預け入れられた小切手の手書きの数字を識別すること、写真の友人の画像を識別すること、５０００万以上のユーザに映画のおすすめを提供すること、無人自動車において様々なタイプの自動車、歩行者、及び道路危険物を識別して分類すること、又はリアルタイムに人間によるスピーチを翻訳することが挙げられる。 Once the DNN is trained, the DNN can be expanded and used to identify and classify objects or patterns in a process known as inference. Examples of inference (the process by which DNN extracts useful information from a given input) are identifying the handwritten numbers on a check deposited on an ATM machine, identifying the image of a friend in a photo, 50 million. Providing movie recommendations to these users, identifying and classifying various types of vehicles, pedestrians, and road hazards in unmanned vehicles, or translating human speech in real time.

訓練の間、データは、入力に対応するラベルを示す予測が生成されるまで、ＤＮＮを順伝播フェーズで通過する。ニューラル・ネットワークが入力を正確にラベル付けしない場合、正しいラベルと予測されたラベルとの間の誤差が分析され、逆伝播フェーズの間、ＤＮＮがその入力及び訓練データ・セット中の他の入力を正確にラベル付けするまで、特徴ごとに重みが調節される。複雑なニューラル・ネットワークを訓練することは、浮動小数点の乗算及び加算を含む、大量の並列コンピューティング・パフォーマンスを必要とする。推論することは、訓練することよりは計算集約的ではなく、訓練済みのニューラル・ネットワークが、画像の分類、スピーチの翻訳、及び新しい情報の一般的な推論のための、以前に見たことがない新しい入力に適用される、レイテンシに敏感なプロセスである。 During training, the data passes through the DNN in the forward propagation phase until a prediction indicating the label corresponding to the input is generated. If the neural network does not label the input correctly, the error between the correct label and the predicted label is analyzed and DNN takes that input and other inputs in the training data set during the backpropagation phase. The weights are adjusted for each feature until they are labeled correctly. Training complex neural networks requires a large amount of parallel computing performance, including floating-point multiplication and addition. Inferring is less computationally intensive than training, and what trained neural networks have seen before for image classification, speech translation, and general inference of new information. It is a latency-sensitive process that applies to no new inputs.

推論及び訓練の論理
上述のように、深層学習又はニューラル学習システムは、入力データから推論を生成するために訓練される必要がある。深層学習又はニューラル学習システムについての推論及び／又は訓練論理６１５に関する詳細を、図６Ａ及び／又は図６Ｂと併せて、以下に与える。 Reasoning and Training Logic As mentioned above, deep learning or neural learning systems need to be trained to generate inferences from input data. Details regarding inference and / or training logic 615 for deep learning or neural learning systems are given below, along with FIGS. 6A and / or 6B.

少なくとも一実施例では、推論及び／又は訓練論理６１５は、１つ又は複数の実施例の態様において推論するように訓練及び／又は使用されるニューラル・ネットワークのニューロン又は層に対応した、順伝播及び／又は出力の重み、及び／又は入力／出力データを記憶するためのデータ・ストレージ６０１を、限定することなく含んでもよい。少なくとも一実施例では、データ・ストレージ６０１は、１つ又は複数の実施例の態様を使用した訓練及び／又は推論中に、入力／出力データ及び／又は重みパラメータを順伝播する間に１つ又は複数の実施例と併せて訓練又は使用されるニューラル・ネットワークの各層の重みパラメータ及び／又は入力／出力データを記憶する。少なくとも一実施例では、データ・ストレージ６０１の任意の部分は、プロセッサのＬ１、Ｌ２、又はＬ３のキャッシュ、若しくはシステム・メモリを含む他のオン・チップ又はオフ・チップのデータ・ストレージとともに含められてもよい。 In at least one embodiment, the inference and / or training logic 615 corresponds to the neurons or layers of the neural network trained and / or used to infer in one or more embodiments. / Or output weights and / or data storage 601 for storing input / output data may be included without limitation. In at least one embodiment, the data storage 601 is one or more while forward propagating input / output data and / or weight parameters during training and / or inference using aspects of one or more embodiments. Store weight parameters and / or input / output data for each layer of the neural network to be trained or used in conjunction with a plurality of embodiments. In at least one embodiment, any portion of the data storage 601 is included with the L1, L2, or L3 cache of the processor, or other on-chip or off-chip data storage including system memory. May be good.

少なくとも一実施例では、データ・ストレージ６０１の任意の部分は、１つ若しくは複数のプロセッサ、又は他のハードウェア論理デバイス若しくは回路の内部にあっても外部にあってもよい。少なくとも一実施例では、データ・ストレージ６０１は、キャッシュ・メモリ、ダイナミック・ランダム・アドレス可能メモリ（「ＤＲＡＭ」：ｄｙｎａｍｉｃｒａｎｄｏｍｌｙａｄｄｒｅｓｓａｂｌｅｍｅｍｏｒｙ）、スタティック・ランダム・アドレス可能メモリ（「ＳＲＡＭ」：ｓｔａｔｉｃｒａｎｄｏｍｌｙａｄｄｒｅｓｓａｂｌｅｍｅｍｏｒｙ）、不揮発性メモリ（たとえば、フラッシュ・メモリ）、又は他のストレージであってもよい。少なくとも一実施例では、データ・ストレージ６０１が、たとえばプロセッサの内部にあるか外部にあるかの選択、又はＤＲＡＭ、ＳＲＡＭ、フラッシュ、若しくは何らか他のタイプのストレージから構成されるかの選択は、オン・チップ対オフ・チップで利用可能なストレージ、実行される訓練及び／又は推論の機能のレイテンシ要件、ニューラル・ネットワークの推論及び／又は訓練で使用されるデータのバッチ・サイズ、又はこれらの要因の何からの組合せに応じて決められてもよい。 In at least one embodiment, any portion of the data storage 601 may be inside or outside one or more processors, or other hardware logic devices or circuits. In at least one embodiment, the data storage 601 includes a cache memory, a dynamic random addressable memory (“DRAM”: dynamic random addressable memory), and a static random addressable memory (“SRAM”: static random addressable memory). ), Non-volatile memory (eg, flash memory), or other storage. In at least one embodiment, the choice of whether the data storage 601 is, for example, inside or outside the processor, or whether it consists of DRAM, SRAM, flash, or any other type of storage. Storage available on-chip vs. off-chip, latency requirements for training and / or inference functions performed, batch size of data used in inference and / or training in neural networks, or factors thereof. It may be decided according to the combination of the above.

少なくとも一実施例では、推論及び／又は訓練論理６１５は、１つ又は複数の実施例の態様において推論するように訓練及び／又は使用されるニューラル・ネットワークのニューロン又は層に対応した、逆伝播及び／又は出力の重み、及び／又は入力／出力データを記憶するためのデータ・ストレージ６０５を、限定することなく含んでもよい。少なくとも一実施例では、データ・ストレージ６０５は、１つ又は複数の実施例の態様を使用した訓練及び／又は推論中に、入力／出力データ及び／又は重みパラメータを逆伝播する間に１つ又は複数の実施例と併せて訓練又は使用されるニューラル・ネットワークの各層の重みパラメータ及び／又は入力／出力データを記憶する。少なくとも一実施例では、データ・ストレージ６０５の任意の部分は、プロセッサのＬ１、Ｌ２、又はＬ３のキャッシュ、若しくはシステム・メモリを含む他のオン・チップ又はオフ・チップのデータ・ストレージとともに含められてもよい。少なくとも一実施例では、データ・ストレージ６０５の任意の部分は、１つ又は複数のプロセッサ、又は他のハードウェア論理デバイス若しくは回路の内部にあっても外部にあってもよい。少なくとも一実施例では、データ・ストレージ６０５は、キャッシュ・メモリ、ＤＲＡＭ、ＳＲＡＭ、不揮発性メモリ（たとえば、フラッシュ・メモリ）、又は他のストレージであってもよい。少なくとも一実施例では、データ・ストレージ６０５が、たとえばプロセッサの内部にあるか外部にあるかの選択、又はＤＲＡＭ、ＳＲＡＭ、フラッシュ、若しくは何らか他のタイプのストレージから構成されるかの選択は、オン・チップ対オフ・チップで利用可能なストレージ、実行される訓練及び／又は推論の機能のレイテンシ要件、ニューラル・ネットワークの推論及び／又は訓練で使用されるデータのバッチ・サイズ、又はこれらの要因の何からの組合せに応じて決められてもよい。 In at least one embodiment, the inference and / or training logic 615 corresponds to the neurons or layers of the neural network trained and / or used to infer in one or more embodiments. / Or output weights and / or data storage 605 for storing input / output data may be included without limitation. In at least one embodiment, the data storage 605 is one or more while backpropagating input / output data and / or weight parameters during training and / or inference using aspects of one or more embodiments. Store weight parameters and / or input / output data for each layer of the neural network to be trained or used in conjunction with a plurality of embodiments. In at least one embodiment, any portion of the data storage 605 is included with the L1, L2, or L3 cache of the processor, or other on-chip or off-chip data storage including system memory. May be good. In at least one embodiment, any part of the data storage 605 may be inside or outside of one or more processors, or other hardware logic devices or circuits. In at least one embodiment, the data storage 605 may be cache memory, DRAM, SRAM, non-volatile memory (eg, flash memory), or other storage. In at least one embodiment, the choice of whether the data storage 605 is, for example, inside or outside the processor, or whether it consists of DRAM, SRAM, flash, or any other type of storage. Storage available on-chip vs. off-chip, latency requirements for training and / or inference functions performed, batch size of data used in inference and / or training in neural networks, or factors thereof. It may be decided according to the combination of the above.

少なくとも一実施例では、データ・ストレージ６０１とデータ・ストレージ６０５は、別々のストレージ構造であってもよい。少なくとも一実施例では、データ・ストレージ６０１とデータ・ストレージ６０５は、同じストレージ構造であってもよい。少なくとも一実施例では、データ・ストレージ６０１とデータ・ストレージ６０５は、部分的に同じストレージ構造で、部分的に別々のストレージ構造であってもよい。少なくとも一実施例では、データ・ストレージ６０１とデータ・ストレージ６０５との任意の部分は、プロセッサのＬ１、Ｌ２、又はＬ３のキャッシュ、若しくはシステム・メモリを含む他のオン・チップ又はオフ・チップのデータ・ストレージとともに含められてもよい。 In at least one embodiment, the data storage 601 and the data storage 605 may have separate storage structures. In at least one embodiment, the data storage 601 and the data storage 605 may have the same storage structure. In at least one embodiment, the data storage 601 and the data storage 605 may have a partially the same storage structure and a partially separate storage structure. In at least one embodiment, any portion of the data storage 601 and the data storage 605 is a cache of L1, L2, or L3 of the processor, or other on-chip or off-chip data including system memory. -May be included with storage.

少なくとも一実施例では、推論及び／又は訓練論理６１５は、訓練及び／又は推論コードに少なくとも部分的に基づく、又はそれにより示される論理演算及び／又は算術演算を実行するための、１つ又は複数の算術論理演算ユニット（「ＡＬＵ」）６１０を限定することなく含んでもよく、その結果が、アクティブ化ストレージ６２０に記憶されるアクティブ化（たとえば、ニューラル・ネットワーク内の層若しくはニューロンからの出力値）が生じる可能性があり、これらは、データ・ストレージ６０１及び／又はデータ・ストレージ６０５に記憶される入力／出力及び／又は重みパラメータのデータの関数である。少なくとも一実施例では、アクティブ化ストレージ６２０に記憶されるアクティブ化は、命令又は他のコードを実行したことに応答して、ＡＬＵ６１０によって実行される線形代数計算及び又は行列ベースの計算に従って生成され、ここでデータ・ストレージ６０５及び／又はデータ６０１に記憶された重み値は、バイアス値、勾配情報、運動量値などの他の値、又は他のパラメータ若しくはハイパーパラメータとともにオペランドとして使用され、これらのいずれか又はすべてが、データ・ストレージ６０５又はデータ・ストレージ６０１、又はオン・チップ若しくはオフ・チップの別のストレージに記憶されてもよい。少なくとも一実施例では、ＡＬＵ６１０は、１つ若しくは複数のプロセッサ、又は他のハードウェア論理デバイス若しくは回路内に含まれるが、別の実施例では、ＡＬＵ６１０は、それらを使用するプロセッサ又は他のハードウェア論理デバイス若しくは回路の外部にあってもよい（たとえばコプロセッサ）。少なくとも一実施例では、ＡＬＵ６１０は、プロセッサの実行ユニット内に含まれてもよく、又は同じプロセッサ内にあるか異なるタイプの異なるプロセッサ（たとえば、中央処理装置、グラフィックス・プロセッシング・ユニット、固定機能ユニットなど）の間で分散されているかのいずれかであるプロセッサの実行ユニットによりアクセス可能なＡＬＵバンク内に、他のやり方で含まれてもよい。少なくとも一実施例では、データ・ストレージ６０１、データ・ストレージ６０５、及びアクティブ化ストレージ６２０は、同じプロセッサ又は他のハードウェア論理デバイス若しくは回路にあってもよく、別の実施例では、それらは異なるプロセッサ又は他のハードウェア論理デバイス若しくは回路にあってもよく、或いは同じプロセッサ又は他のハードウェア論理デバイス若しくは回路と、異なるプロセッサ又は他のハードウェア論理デバイス若しくは回路との何らかの組合せにあってもよい。少なくとも一実施例では、アクティブ化ストレージ６２０の任意の部分は、プロセッサのＬ１、Ｌ２、又はＬ３のキャッシュ、若しくはシステム・メモリを含む他のオン・チップ又はオフ・チップのデータ・ストレージとともに含められてもよい。さらに、推論及び／又は訓練コードが、プロセッサ又は他のハードウェア論理若しくは回路にアクセス可能な他のコードとともに記憶されてもよく、プロセッサのフェッチ、デコード、スケジューリング、実行、リタイア、及び／又は他の論理回路を使用してフェッチ及び／又は処理されてもよい。 In at least one embodiment, the inference and / or training logic 615 is one or more for performing logical and / or arithmetic operations that are at least partially based on or indicated by the training and / or inference code. The arithmetic logic unit (“ALU”) 610 may be included without limitation, and the result is stored in the activation storage 620 for activation (eg, output value from a layer or neuron in a neural network). These are functions of the input / output and / or weight parameter data stored in the data storage 601 and / or the data storage 605. In at least one embodiment, the activation stored in the activation storage 620 is generated according to the linear algebraic and / or matrix-based calculations performed by the ALU610 in response to the execution of an instruction or other code. Here, the weight value stored in the data storage 605 and / or the data 601 is used as an operand together with other values such as bias value, gradient information, momentum value, or other parameters or hyperparameters, and any of these. Or all may be stored in data storage 605 or data storage 601 or another storage on-chip or off-chip. In at least one embodiment, the ALU610 is contained within one or more processors, or other hardware logic devices or circuits, whereas in another embodiment, the ALU610 is a processor or other hardware that uses them. It may be outside the logic device or circuit (eg, a coprocessor). In at least one embodiment, the ALU610 may be contained within an execution unit of a processor, or may be within the same processor or with different types of different processors (eg, central processing unit, graphics processing unit, fixed function unit). Etc.), which may otherwise be contained within an ALU bank accessible by the execution unit of the processor, which is either distributed among. In at least one embodiment, the data storage 601, data storage 605, and activation storage 620 may be in the same processor or other hardware logical device or circuit, in another embodiment they are different processors. Or it may be in another hardware logic device or circuit, or it may be in some combination of the same processor or other hardware logic device or circuit with a different processor or other hardware logic device or circuit. In at least one embodiment, any portion of the activated storage 620 is included with the L1, L2, or L3 cache of the processor, or other on-chip or off-chip data storage including system memory. May be good. In addition, inference and / or training code may be stored with the processor or other code accessible to other hardware logic or circuits, including processor fetching, decoding, scheduling, execution, retirement, and / or other. It may be fetched and / or processed using logic circuits.

少なくとも一実施例では、アクティブ化ストレージ６２０は、キャッシュ・メモリ、ＤＲＡＭ、ＳＲＡＭ、不揮発性メモリ（たとえば、フラッシュ・メモリ）、又は他のストレージであってもよい。少なくとも一実施例では、アクティブ化ストレージ６２０は、完全に又は部分的に、１つ若しくは複数のプロセッサ又は他の論理回路の内部にあってもよく、又は外部にあってもよい。少なくとも一実施例では、アクティブ化ストレージ６２０が、たとえばプロセッサの内部にあるか外部にあるかの選択、又はＤＲＡＭ、ＳＲＡＭ、フラッシュ、若しくは何らか他のタイプのストレージから構成されるかの選択は、オン・チップ対オフ・チップの利用可能なストレージ、実行される訓練及び／又は推論機能のレイテンシ要件、ニューラル・ネットワークの推論及び／又は訓練で使用されるデータのバッチ・サイズ、又はこれらの要因の何からの組合せに応じて決められてもよい。少なくとも一実施例では、図６Ａに示す推論及び／又は訓練論理６１５は、グーグルからのＴｅｎｓｏｒｆｌｏｗ（登録商標）処理ユニット、Ｇｒａｐｈｃｏｒｅ（商標）からの推論処理ユニット（ＩＰＵ：ｉｎｆｅｒｅｎｃｅｐｒｏｃｅｓｓｉｎｇｕｎｉｔ）、又はＩｎｔｅｌＣｏｒｐからのＮｅｒｖａｎａ（登録商標）（たとえば「ＬａｋｅＣｒｅｓｔ」）プロセッサなどの特定用途向け集積回路（「ＡＳＩＣ：ａｐｐｌｉｃａｔｉｏｎ−ｓｐｅｃｉｆｉｃｉｎｔｅｇｒａｔｅｄｃｉｒｃｕｉｔ」）と併せて使用されてもよい。少なくとも一実施例では、図６Ａに示す推論及び／又は訓練論理６１５は、中央処理装置（「ＣＰＵ」：ｃｅｎｔｒａｌｐｒｏｃｅｓｓｉｎｇｕｎｉｔ）ハードウェア、グラフィックス・プロセッシング・ユニット（「ＧＰＵ」：ｇｒａｐｈｉｃｓｐｒｏｃｅｓｓｉｎｇｕｎｉｔ）ハードウェア、又はフィールド・プログラマブル・ゲート・アレイ（「ＦＰＧＡ」：ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）など他のハードウェアと併せて使用されてもよい。 In at least one embodiment, the activation storage 620 may be cache memory, DRAM, SRAM, non-volatile memory (eg, flash memory), or other storage. In at least one embodiment, the activation storage 620 may be completely or partially inside or outside one or more processors or other logic circuits. In at least one embodiment, the choice of whether the activated storage 620 is, for example, inside or outside the processor, or whether it consists of DRAM, SRAM, flash, or any other type of storage. Available storage of on-chip vs. off-chip, latency requirements for training and / or inference functions performed, batch size of data used in inference and / or training in neural networks, or these factors. It may be decided according to the combination from any. In at least one embodiment, the inference and / or training logic 615 shown in FIG. 6A is a Tensorflow® processing unit from Google, an inference processing unit (IPU) from a Graphcore®, or an Intel Corp. It may be used in conjunction with an application specific integrated circuit (“ASIC”) such as a Nervana® (eg, “Lake Crest”) processor from. In at least one embodiment, the reasoning and / or training logic 615 shown in FIG. 6A is a central processing unit (“CPU”: central processing unit) hardware, graphics processing unit (“GPU”: graphics processing unit) hardware. It may be used in conjunction with hardware or other hardware such as a field programmable gate array (“FPGA”: field programmable gate array).

図６Ｂは、少なくとも１つの実施例による、推論及び／又は訓練論理６１５を示す。少なくとも一実施例では、推論及び／又は訓練論理６１５は、ハードウェア論理を限定することなく含んでもよく、このハードウェア論理では、計算リソースが、ニューラル・ネットワーク内のニューロンの１つ若しくは複数の層に対応する重み値又は他の情報の専用のものであるか、又は他のやり方でそれらと併せてしか使用されない。少なくとも一実施例では、図６Ｂに示す推論及び／又は訓練論理６１５は、グーグルからのＴｅｎｓｏｒｆｌｏｗ（登録商標）処理ユニット、Ｇｒａｐｈｃｏｒｅ（商標）からの推論処理ユニット（ＩＰＵ）、又はインテルコーポレーションからのＮｅｒｖａｎａ（登録商標）（たとえば「ＬａｋｅＣｒｅｓｔ」）プロセッサなどの特定用途向け集積回路（ＡＳＩＣ）と併せて使用されてもよい。少なくとも一実施例では、図６Ｂに示す推論及び／又は訓練論理６１５は、中央処理装置（ＣＰＵ）ハードウェア、グラフィックス・プロセッシング・ユニット（「ＧＰＵ」）ハードウェア、又はフィールド・プログラマブル・ゲート・アレイ（ＦＰＧＡ）など他のハードウェアと併せて使用されてもよい。少なくとも一実施例では、推論及び／又は訓練論理６１５は、限定することなく、データ・ストレージ６０１及びデータ・ストレージ６０５を含み、これらを使用して、重み値、並びに／又はバイアス値、勾配情報、運動量値、及び／若しくは他のパラメータ若しくはハイパーパラメータ情報を含む他の情報を記憶してもよい。図６Ｂに示す少なくとも一実施例では、データ・ストレージ６０１及びデータ・ストレージ６０５のそれぞれは、それぞれ計算ハードウェア６０２及び計算ハードウェア６０６などの専用計算リソースに関連付けられる。少なくとも一実施例では、計算ハードウェア６０６のそれぞれは、線形代数関数などの数学的関数を、それぞれデータ・ストレージ６０１及びデータ・ストレージ６０５に記憶された情報に対してのみ実行する１つ又は複数のＡＬＵを備え、その結果は、アクティブ化ストレージ６２０に記憶される。 FIG. 6B shows inference and / or training logic 615 according to at least one embodiment. In at least one embodiment, inference and / or training logic 615 may include, without limitation, hardware logic, in which computational resources are one or more layers of neurons in a neural network. Dedicated to the corresponding weight value or other information, or used only in combination with them in other ways. In at least one embodiment, the inference and / or training logic 615 shown in FIG. 6B is a Tensorflow® processing unit from Google, an inference processing unit (IPU) from Graphcore ™, or Nervana from Intel Corporation. It may be used in conjunction with an application specific integrated circuit (ASIC) such as a registered trademark) (eg, "Lake Crest") processor. In at least one embodiment, the inference and / or training logic 615 shown in FIG. 6B is a central processing unit (CPU) hardware, a graphics processing unit (“GPU”) hardware, or a field programmable gate array. It may be used in combination with other hardware such as (FPGA). In at least one embodiment, the inference and / or training logic 615 includes, without limitation, data storage 601 and data storage 605, which are used to weight values and / or bias values, gradient information, and so on. Other information may be stored, including momentum values and / or other parameters or hyperparameter information. In at least one embodiment shown in FIG. 6B, data storage 601 and data storage 605 are each associated with dedicated computational resources such as computational hardware 602 and computational hardware 606, respectively. In at least one embodiment, each of the computational hardware 606 performs one or more mathematical functions, such as linear algebraic functions, only on the information stored in data storage 601 and data storage 605, respectively. It is equipped with an ALU and the result is stored in the activation storage 620.

少なくとも一実施例では、データ・ストレージ６０１及び６０５のそれぞれ、並びに対応する計算ハードウェア６０２及び６０６は、ニューラル・ネットワークの異なる層にそれぞれ対応し、それにより、データ・ストレージ６０１及び計算ハードウェア６０２との１つの「ストレージ／計算の対６０１／６０２」から結果的に生じるアクティブ化は、ニューラル・ネットワークの概念的組織化を反映させるために、次のデータ・ストレージ６０５及び計算ハードウェア６０６との「ストレージ／計算の対６０５／６０６」への入力として提供される。少なくとも一実施例では、ストレージ／計算の対６０１／６０２、及び６０５／６０６は、２つ以上のニューラル・ネットワークの層に対応してもよい。少なくとも一実施例では、ストレージ／計算の対６０１／６０２、及び６０５／６０６の後に、又はそれと並列に、追加のストレージ／計算の対（図示せず）が、推論及び／又は訓練論理６１５に含まれてもよい。 In at least one embodiment, the data storages 601 and 605, and the corresponding computing hardware 602 and 606, respectively, correspond to different layers of the neural network, thereby with the data storage 601 and the computing hardware 602, respectively. The resulting activation from one "storage / computational pair 601/602" of "with the following data storage 605 and computational hardware 606" to reflect the conceptual organization of the neural network. Provided as an input to "Storage / Calculation vs. 605/606". In at least one embodiment, storage / computational pairs 601/602 and 605/606 may correspond to two or more layers of neural networks. In at least one embodiment, an additional storage / calculation pair (not shown) is included in the inference and / or training logic 615 after, or in parallel with, the storage / calculation pair 601/602 and 605/606. It may be.

ニューラル・ネットワークの訓練及び導入
図７は、ディープ・ニューラル・ネットワークの訓練及び導入のための別の実施例を示す。少なくとも一実施例では、未訓練ニューラル・ネットワーク７０６が、訓練データ・セット７０２を使用して訓練される。少なくとも一実施例では、訓練フレームワーク７０４は、ＰｙＴｏｒｃｈフレームワークであり、一方他の実施例では、訓練フレームワーク７０４は、Ｔｅｎｓｏｒｆｌｏｗ、Ｂｏｏｓｔ、Ｃａｆｆｅ、マイクロソフトＣｏｇｎｉｔｉｖｅＴｏｏｌｋｉｔ／ＣＮＴＫ、ＭＸＮｅｔ、Ｃｈａｉｎｅｒ、Ｋｅｒａｓ、Ｄｅｅｐｌｅａｒｎｉｎｇ４ｊ、又は他の訓練フレームワークである。少なくとも一実施例では、訓練フレームワーク７０４は、未訓練ニューラル・ネットワーク７０６を訓練し、本明細書に記載の処理リソースを使用してそれが訓練されるのを可能にして、訓練済みニューラル・ネットワーク７０８を生成する。少なくとも一実施例では、重みは、ランダムに選択されてもよく、又はディープ・ビリーフ・ネットワークを使用した事前訓練によって選択されてもよい。少なくとも一実施例では、訓練は、教師あり、一部教師あり、又は教師なしのいずれかのやり方で実行されてもよい。 Training and Introducing Neural Networks Figure 7 shows another embodiment for training and introducing deep neural networks. In at least one embodiment, the untrained neural network 706 is trained using the training data set 702. In at least one embodiment, the training framework 704 is the PyTorch framework, while in the other embodiment, the training framework 704 is the Tensorflow, Boost, Cafe, Microsoft CognitiveToolkit / CNT, MXNet, Chainer, Keras, Depple. , Or other training framework. In at least one embodiment, the training framework 704 trains an untrained neural network 706, allowing it to be trained using the processing resources described herein, a trained neural network. Generate 708. In at least one embodiment, the weights may be randomly selected or may be selected by pre-training using a deep belief network. In at least one embodiment, training may be performed in a supervised, partially supervised, or unsupervised manner.

少なくとも一実施例では、未訓練ニューラル・ネットワーク７０６は教師あり学習を使用して訓練され、ここで訓練データ・セット７０２は、入力に対する所望の出力と対になった入力を含み、又は訓練データ・セット７０２は、既知の出力を有する入力を含み、ニューラル・ネットワーク７０６の出力が手動で採点される。少なくとも一実施例では、未訓練ニューラル・ネットワーク７０６は教師ありのやり方で訓練され、訓練データ・セット７０２からの入力を処理し、結果として得られた出力を、予想の又は所望の出力のセットと比較する。少なくとも一実施例では、次いで、誤差が、未訓練ニューラル・ネットワーク７０６を通って逆伝播される。少なくとも一実施例では、訓練フレームワーク７０４は、未訓練ニューラル・ネットワーク７０６を制御する重みを調節する。少なくとも一実施例では、訓練フレームワーク７０４は、未訓練ニューラル・ネットワーク７０６が、新規データ７１２などの既知の入力データに基づき、結果７１４などにおいて正しい答えを生成するのに好適な訓練済みニューラル・ネットワーク７０８などのモデルに向かって、どれだけ良好に収束しているかを監視するツールを含む。少なくとも一実施例では、訓練フレームワーク７０４は、未訓練ニューラル・ネットワーク７０６を繰り返し訓練する一方、損失関数、及び確率的勾配降下法などの調整アルゴリズムを使用して、未訓練ニューラル・ネットワーク７０６の出力を精緻化するように重みを調整する。少なくとも一実施例では、訓練フレームワーク７０４は、未訓練ニューラル・ネットワーク７０６が所望の精度に到達するまで未訓練ニューラル・ネットワーク７０６を訓練する。少なくとも一実施例では、次いで訓練済みニューラル・ネットワーク７０８を、任意の数の機械学習動作を実装するように導入することができる。 In at least one embodiment, the untrained neural network 706 is trained using supervised learning, where the training data set 702 contains an input paired with the desired output to the input, or training data. The set 702 includes inputs with known outputs, the outputs of the neural network 706 are manually graded. In at least one embodiment, the untrained neural network 706 is trained in a supervised manner, processing inputs from training data set 702, and using the resulting output as the expected or desired set of outputs. compare. In at least one embodiment, the error is then backpropagated through the untrained neural network 706. In at least one embodiment, the training framework 704 adjusts the weights that control the untrained neural network 706. In at least one embodiment, the training framework 704 is a trained neural network suitable for the untrained neural network 706 to generate the correct answer in results 714, etc., based on known input data, such as new data 712. Includes tools to monitor how well converged towards a model such as 708. In at least one embodiment, the training framework 704 iteratively trains the untrained neural network 706, while using tuning algorithms such as the loss function and stochastic gradient descent to output the untrained neural network 706. Adjust the weights to refine. In at least one embodiment, the training framework 704 trains the untrained neural network 706 until the untrained neural network 706 reaches the desired accuracy. In at least one embodiment, a trained neural network 708 can then be introduced to implement any number of machine learning actions.

少なくとも一実施例では、未訓練ニューラル・ネットワーク７０６は、教師なし学習を使用して訓練され、ここで未訓練ニューラル・ネットワーク７０６は、ラベルなしデータを使用して自らを訓練しようとする。少なくとも一実施例では、教師なし学習の訓練データ・セット７０２は、いかなる関連出力データ又は「グラウンド・トゥルース」データもない入力データを含む。少なくとも一実施例では、未訓練ニューラル・ネットワーク７０６は、訓練データ・セット７０２内でグループ化を学習することができ、個々の入力が、未訓練データ・セット７０２にどのように関係しているかを判定することができる。少なくとも一実施例では、教師なし訓練を使用して、自己組織化マップを生成することができ、自己組織化マップは、新規データ７１２の次元を低減するのに有用な動作を実行することができるタイプの訓練済みニューラル・ネットワーク７０８である。少なくとも一実施例では、教師なし訓練を使用して異常検出を実行することもでき、異常検出は、新規データ・セット７１２の通常のパターンから逸脱した、新規データ・セット７１２内のデータ点を識別できるようにする。 In at least one embodiment, the untrained neural network 706 is trained using unsupervised learning, where the untrained neural network 706 attempts to train itself using unlabeled data. In at least one embodiment, the training data set 702 for unsupervised learning includes input data without any relevant output data or "ground truth" data. In at least one embodiment, the untrained neural network 706 can learn grouping within the training data set 702 and how the individual inputs relate to the untrained data set 702. It can be determined. In at least one embodiment, unsupervised training can be used to generate a self-organizing map, which can perform useful actions to reduce the dimensions of the new data 712. A type of trained neural network 708. In at least one embodiment, unsupervised training can also be used to perform anomaly detection, which identifies data points in the new data set 712 that deviate from the normal pattern of the new data set 712. It can be so.

少なくとも一実施例では、半教師あり学習が使用されてもよく、それは、ラベル付きデータとラベルなしデータが訓練データ・セット７０２に混在している技法である。少なくとも一実施例では、訓練フレームワーク７０４を使用して、伝達学習技法などによる漸次的学習が実行されてもよい。少なくとも一実施例では、漸次的学習により、訓練済みニューラル・ネットワーク７０８は、初期訓練中にネットワーク内に教え込まれた知識を忘れることなく、新規データ７１２に適合できるようになる。 In at least one embodiment, semi-supervised learning may be used, which is a technique in which labeled and unlabeled data are mixed in the training data set 702. In at least one embodiment, the training framework 704 may be used to perform gradual learning, such as by communication learning techniques. In at least one embodiment, gradual learning allows the trained neural network 708 to adapt to the new data 712 without forgetting the knowledge taught in the network during the initial training.

データ・センタ
図８は、少なくとも一実施例が使用されてもよい例示的なデータ・センタ８００を示す。少なくとも一実施例では、データ・センタ８００は、データ・センタ・インフラストラクチャ層８１０、フレームワーク層８２０、ソフトウェア層８３０、及びアプリケーション層８４０を含む。 Data Center FIG. 8 shows an exemplary data center 800 in which at least one embodiment may be used. In at least one embodiment, the data center 800 includes a data center infrastructure layer 810, a framework layer 820, a software layer 830, and an application layer 840.

少なくとも一実施例では、図８に示すように、データ・センタ・インフラストラクチャ層８１０は、リソース・オーケストレータ８１２、グループ化済みコンピューティング・リソース８１４、及びノード・コンピューティング・リソース（「ノードＣ．Ｒ．」：ｎｏｄｅｃｏｍｐｕｔｉｎｇｒｅｓｏｕｒｃｅ）８１６（１）〜８１６（Ｎ）を含んでもよく、ここで「Ｎ」は、任意の正の整数を表す。少なくとも一実施例では、ノードＣ．Ｒ．８１６（１）〜８１６（Ｎ）は、任意の数の中央処理装置（「ＣＰＵ」）又は（アクセラレータ、フィールド・プログラマブル・ゲート・アレイ（ＦＰＧＡ）、グラフィックス・プロセッサなどを含む）他のプロセッサ、メモリ・デバイス（たとえば、ダイナミック読取り専用メモリ）、ストレージ・デバイス（たとえば、半導体ドライブ又はディスク・ドライブ）、ネットワーク入力／出力（「ＮＷＩ／Ｏ」：ｎｅｔｗｏｒｋｉｎｐｕｔ／ｏｕｔｐｕｔ）デバイス、ネットワーク・スイッチ、仮想機械（「ＶＭ」：ｖｉｒｔｕａｌｍａｃｈｉｎｅ）、電源モジュール、及び冷却モジュールを含んでもよいが、これらに限定されない。少なくとも一実施例では、ノードＣ．Ｒ．８１６（１）〜８１６（Ｎ）のうち１つ又は複数のノードＣ．Ｒ．は、上述したコンピューティング・リソースのうちの１つ又は複数を有するサーバであってもよい。 In at least one embodiment, as shown in FIG. 8, the data center infrastructure layer 810 is a resource orchestrator 812, a grouped computing resource 814, and a node computing resource (“Node C.I. R. ": node computing resources) 816 (1) to 816 (N) may be included, where" N "represents an arbitrary positive integer. In at least one embodiment, Node C.I. R. 816 (1)-816 (N) are any number of central processing units (“CPUs”) or other processors (including accelerators, field programmable gate arrays (FPGAs), graphics processors, etc.). Memory devices (eg, dynamic read-only memory), storage devices (eg, semiconductor or disk drives), network input / output (“NWI / O”: network input / output) devices, network switches, virtual It may include, but is not limited to, a machine (“VM”: virtual machine), a power supply module, and a cooling module. In at least one embodiment, Node C.I. R. One or more nodes C. of 816 (1) to 816 (N). R. May be a server having one or more of the computing resources described above.

少なくとも一実施例では、グループ化済みコンピューティング・リソース８１４は、１つ若しくは複数のラック（図示せず）内に収容されたノードＣ．Ｒ．の別々のグループ、又は様々なグラフィカル・ロケーション（同じく図示せず）においてデータ・センタに収容された多数のラックを含んでもよい。グループ化済みコンピューティング・リソース８１４内のノードＣ．Ｒ．の別々のグループは、１つ若しくは複数のワークロードをサポートするように構成又は配分されてもよいグループ化済みのコンピュート・リソース、ネットワーク・リソース、メモリ・リソース、又はストレージ・リソースを含んでもよい。少なくとも一実施例では、ＣＰＵ又はプロセッサを含むいくつかのノードＣ．Ｒ．は、１つ又は複数のラック内でグループ化されて、１つ又は複数のワークロードをサポートするためのコンピュート・リソースが提供されてもよい。少なくとも一実施例では、１つ又は複数のラックはまた、任意の数の電源モジュール、冷却モジュール、及びネットワーク・スイッチを任意の組合せで含んでもよい。 In at least one embodiment, the grouped computing resource 814 is housed in one or more racks (not shown). R. It may contain multiple racks housed in a data center in separate groups of, or in various graphical locations (also not shown). Node C. in grouped computing resources 814. R. Separate groups of may include grouped compute resources, network resources, memory resources, or storage resources that may be configured or allocated to support one or more workloads. In at least one embodiment, some nodes C.I. R. May be grouped within one or more racks to provide compute resources to support one or more workloads. In at least one embodiment, the rack may also include any number of power supply modules, cooling modules, and network switches in any combination.

少なくとも一実施例では、リソース・オーケストレータ８２２は、１つ又は複数のノードＣ．Ｒ．８１６（１）〜８１６（Ｎ）及び／若しくはグループ化済みコンピューティング・リソース８１４を構成してもよく、又は他のやり方で制御してもよい。少なくとも一実施例では、リソース・オーケストレータ８２２は、データ・センタ８００用のソフトウェア設計インフラストラクチャ（「ＳＤＩ」：ｓｏｆｔｗａｒｅｄｅｓｉｇｎｉｎｆｒａｓｔｒｕｃｔｕｒｅ）管理エンティティを含んでもよい。少なくとも一実施例では、リソース・オーケストレータは、ハードウェア、ソフトウェア、又はこれらの何らかの組合せを含んでもよい。 In at least one embodiment, the resource orchestra 822 has one or more nodes C.I. R. 816 (1)-816 (N) and / or grouped computing resources 814 may be configured or controlled in other ways. In at least one embodiment, the resource orchestrator 822 may include a software design infrastructure (“SDI”: software design infrastructure) management entity for the data center 800. In at least one embodiment, the resource orchestrator may include hardware, software, or any combination thereof.

図８に示す少なくとも一実施例では、フレームワーク層８２０は、ジョブ・スケジューラ８３２、構成マネージャ８３４、リソース・マネージャ８３６、及び分配ファイル・システム８３８を含む。少なくとも一実施例では、フレームワーク層８２０は、ソフトウェア層８３０のソフトウェア８３２、及び／又はアプリケーション層８４０の１つ若しくは複数のアプリケーション８４２をサポートするためのフレームワークを含んでもよい。少なくとも一実施例では、ソフトウェア８３２又はアプリケーション８４２はそれぞれ、アマゾン・ウェブ・サービス、グーグル・クラウド、及びマイクロソフト・アジュールによって提供されるものなど、ウェブ・ベースのサービス・ソフトウェア又はアプリケーションを含んでもよい。少なくとも一実施例では、フレームワーク層８２０は、大規模なデータ処理（たとえば「ビック・データ」）のために分配ファイル・システム８３８を使用することができるＡｐａｃｈｅＳｐａｒｋ（登録商標）（以下「Ｓｐａｒｋ」）など、無料でオープン・ソースのソフトウェア・ウェブ・アプリケーション・フレームワークの一種であってもよいが、これに限定されない。少なくとも一実施例では、ジョブ・スケジューラ８３２は、データ・センタ８００の様々な層によってサポートされるワークロードのスケジューリングを容易にするために、Ｓｐａｒｋドライバを含んでもよい。少なくとも一実施例では、構成マネージャ８３４は、ソフトウェア層８３０、並びに大規模なデータ処理をサポートするためのＳｐａｒｋ及び分配ファイル・システム８３８を含むフレームワーク層８２０などの異なる層を構成することが可能であってもよい。少なくとも一実施例では、リソース・マネージャ８３６は、分配ファイル・システム８３８及びジョブ・スケジューラ８３２をサポートするようにマッピング若しくは配分されたクラスタ化済み又はグループ化済みのコンピューティング・リソースを管理することが可能であってもよい。少なくとも一実施例では、クラスタ化済み又はグループ化済みのコンピューティング・リソースは、データ・センタ・インフラストラクチャ層８１０にあるグループ化済みコンピューティング・リソース８１４を含んでもよい。少なくとも一実施例では、リソース・マネージャ８３６は、リソース・オーケストレータ８１２と連携して、これらのマッピング又は配分されたコンピューティング・リソースを管理してもよい。 In at least one embodiment shown in FIG. 8, framework layer 820 includes job scheduler 832, configuration manager 834, resource manager 836, and distribution file system 838. In at least one embodiment, the framework layer 820 may include software 832 of software layer 830 and / or a framework for supporting one or more applications 842 of application layer 840. In at least one embodiment, software 832 or application 842 may include web-based service software or applications, such as those provided by Amazon Web Services, Google Cloud, and Microsoft Azure, respectively. In at least one embodiment, framework layer 820 can use the distributed file system 838 for large-scale data processing (eg, "big data") Apache Spark® (hereinafter "Spark"). ), Etc., may be a type of free and open source software web application framework, but is not limited to this. In at least one embodiment, Jobscheduler 832 may include a Spark driver to facilitate scheduling of workloads supported by various layers of data center 800. In at least one embodiment, configuration manager 834 may configure different layers such as software layer 830, as well as framework layer 820 including Spark and distribution file system 838 to support large-scale data processing. There may be. In at least one embodiment, resource manager 836 is capable of managing clustered or grouped computing resources that are mapped or allocated to support distribution file system 838 and job scheduler 832. It may be. In at least one embodiment, the clustered or grouped computing resource may include the grouped computing resource 814 at the data center infrastructure layer 810. In at least one embodiment, the resource manager 836 may work with the resource orchestrator 812 to manage these mapped or allocated computing resources.

少なくとも一実施例では、ソフトウェア層８３０に含まれるソフトウェア８３２は、ノードＣ．Ｒ．８１６（１）〜８１６（Ｎ）、グループ化済みコンピューティング・リソース８１４、及び／又はフレームワーク層８２０の分配ファイル・システム８３８のうちの少なくとも一部分によって使用されるソフトウェアを含んでもよい。１つ又は複数のタイプのソフトウェアは、インターネット・ウェブ・ページ検索ソフトウェア、電子メール・ウイルス・スキャン・ソフトウェア、データベース・ソフトウェア、及びストリーミング・ビデオ・コンテンツ・ソフトウェアを含んでもよいが、これらに限定されない。 In at least one embodiment, the software 832 included in the software layer 830 is the node C.I. R. 816 (1)-816 (N), grouped computing resources 814, and / or software used by at least a portion of the distributed file system 838 of framework layer 820 may be included. One or more types of software may include, but is not limited to, Internet web page search software, email virus scanning software, database software, and streaming video content software.

少なくとも一実施例では、アプリケーション層８４０に含まれるアプリケーション８４２は、ノードＣ．Ｒ．８１６（１）〜８１６（Ｎ）、グループ化済みコンピューティング・リソース８１４、及び／又はフレームワーク層８２０の分配ファイル・システム８３８のうちの少なくとも一部分によって使用される１つ若しくは複数のタイプのアプリケーションを含んでもよい。１つ若しくは複数のタイプのアプリケーションは、任意の数のゲノム学アプリケーション、認識コンピュート、並びに訓練若しくは推論のソフトウェア、機械学習フレームワーク・ソフトウェア（たとえば、ＰｙＴｏｒｃｈ、Ｔｅｎｓｏｒｆｌｏｗ、Ｃａｆｆｅなど）を含む機械学習アプリケーション、又は１つ若しくは複数の実施例と併せて使用される他の機械学習アプリケーションを含んでもよいが、これらに限定されない。 In at least one embodiment, the application 842 included in the application layer 840 is the node C.I. R. 816 (1)-816 (N), grouped computing resources 814, and / or one or more types of applications used by at least a portion of the distributed file system 838 of framework layer 820. It may be included. One or more types of applications include any number of genomics applications, cognitive compute, and machine learning framework software, including training or inference software, machine learning framework software (eg, PyTorch, Tensorflow, Cafe, etc.). Alternatively, it may include, but is not limited to, other machine learning applications used in conjunction with one or more embodiments.

少なくとも一実施例では、構成マネージャ８３４、リソース・マネージャ８３６、及びリソース・オーケストレータ８１２のうちのいずれかは、任意の技術的に実行可能なやり方で取得された任意の量及びタイプのデータに基づき、任意の数及びタイプの自己修正措置を実装してもよい。少なくとも一実施例では、自己修正措置は、データ・センタ８００のデータ・センタ演算子が、不良の恐れのある構成を決定しないようにし、十分に利用されていない且つ／又は性能の低いデータ・センタの部分をなくせるようにしてもよい。 In at least one embodiment, any one of the configuration manager 834, the resource manager 836, and the resource orchestrator 812 is based on any amount and type of data obtained in any technically viable manner. , Any number and type of self-correction measures may be implemented. In at least one embodiment, self-correction measures prevent the data center operator of the data center 800 from determining potentially defective configurations, and are underutilized and / or poorly performing data centers. You may be able to eliminate the part of.

少なくとも一実施例では、データ・センタ８００は、１つ若しくは複数の機械学習モデルを訓練し、又は本明細書に記載の１つ若しくは複数の実施例による１つ若しくは複数の機械学習モデルを使用して情報を予測若しくは推論するためのツール、サービス、ソフトウェア、又は他のリソースを含んでもよい。たとえば、少なくとも一実施例では、機械学習モデルは、データ・センタ８００に関して上述したソフトウェア及びコンピューティング・リソースを使用して、ニューラル・ネットワーク・アーキテクチャに従って重みパラメータを計算することによって、訓練されてもよい。少なくとも一実施例では、１つ又は複数のニューラル・ネットワークに対応する訓練済み機械学習モデルは、本明細書に記載の１つ又は複数の技法によって計算された重みパラメータを使用することにより、データ・センタ８００に関して上述したリソースを使用して、情報を推論又は予測するために使用されてもよい。 In at least one embodiment, the data center 800 trains one or more machine learning models, or uses one or more machine learning models according to one or more embodiments described herein. May include tools, services, software, or other resources for predicting or inferring information. For example, in at least one embodiment, the machine learning model may be trained by computing weight parameters according to a neural network architecture using the software and computing resources described above for the data center 800. .. In at least one embodiment, the trained machine learning model corresponding to one or more neural networks is data by using the weighting parameters calculated by one or more techniques described herein. The resources described above for the center 800 may be used to infer or predict information.

少なくとも一実施例では、データ・センタは、上述したリソースを使用して訓練及び／又は推論を実行するために、ＣＰＵ、特定用途向け集積回路（ＡＳＩＣ）、ＧＰＵ、ＦＰＧＡ、又は他のハードウェアを使用してもよい。さらに、上述した１つ又は複数のソフトウェア及び／又はハードウェアのリソースは、画像認識、音声認識、又は他の人工知能サービスなどの情報の訓練又は推論の実行を、ユーザが行えるようにするためのサービスとして構成されてもよい。 In at least one embodiment, the data center uses a CPU, application specific integrated circuit (ASIC), GPU, FPGA, or other hardware to perform training and / or inference using the resources described above. You may use it. In addition, one or more software and / or hardware resources mentioned above are used to enable the user to perform training or inference of information such as image recognition, speech recognition, or other artificial intelligence services. It may be configured as a service.

推論及び／又は訓練論理６１５を使用して、１つ若しくは複数の実施例に関連する推論及び／又は訓練の動作が実行される。少なくとも一実施例では、推論及び／又は訓練論理６１５は、本明細書に記載のニューラル・ネットワークの訓練動作、ニューラル・ネットワークの機能及び／若しくはアーキテクチャ、又はニューラル・ネットワークのユース・ケースを使用して計算された重みパラメータに少なくとも部分的に基づき、推論又は予測の動作のために図８のシステムにおいて使用されてもよい。 Inference and / or training logic 615 is used to perform inference and / or training operations related to one or more embodiments. In at least one embodiment, the inference and / or training logic 615 uses the training behavior of the neural network described herein, the function and / or architecture of the neural network, or the use case of the neural network. It may be used in the system of FIG. 8 for inference or prediction behavior, at least partially based on the calculated weight parameters.

本明細書に記載されるように、外部的に捕捉された物体の画像を使用して物体から物体への姿勢を推定するための方法、コンピュータ読取り可能媒体、及びシステムが開示される。図１〜図４によると、実施例は、動作を推論することを実行すること、及び推論されたデータを提供することのために使用可能なニューラル・ネットワークを提供することができ、図６Ａ及び図６Ｂに描写されるように、ニューラル・ネットワークは、推論及び／又は訓練論理６１５内の（部分的に、又は全体的に）データ・ストレージ６０１及び６０５のうちの１つ又は両方に記憶される。ニューラル・ネットワークの訓練及び展開は、図７で描写され、また本明細書において説明されるように、実行され得る。ニューラル・ネットワークの分散は、図８で描写され、また本明細書において説明されるように、データ・センタ８００の１つ又は複数のサーバを使用して実行され得る。 As described herein, methods, computer readable media, and systems for estimating object-to-object attitudes using images of externally captured objects are disclosed. According to FIGS. 1-4, the embodiment can provide a neural network that can be used to perform inferring behavior and to provide inferred data, FIG. 6A and. As depicted in FIG. 6B, the neural network is stored in one or both of the data storages 601 and 605 (partially or wholly) within the inference and / or training logic 615. .. Training and deployment of neural networks can be performed as depicted in FIG. 7 and as described herein. The distribution of the neural network is depicted in FIG. 8 and can be performed using one or more servers in the data center 800 as described herein.

Claims

A step of identifying an image of a first object and a target object, wherein the image is captured by a camera outside the first object and the target object.
A step of processing the image using a first neural network to estimate a first attitude of the target object with respect to the camera.
A step of processing the image using a second neural network to estimate a second orientation of the first object with respect to the camera.
A method comprising the step of calculating a third posture of the first object with respect to the target object using the first posture and the second posture.

The method according to claim 1, wherein the image is a red-green-blue (RGB) image or a grayscale image.

The method of claim 1, wherein the camera captures one of a wavelength of light or a wavelength of non-light.

The method of claim 1, wherein the first object is a robotic gripping system.

The method of claim 4, wherein the target object is a known object that is gripped by the robotic gripping system.

5. The method of claim 5, further comprising having the robotic gripping system grip the known object using the third posture.

The method of claim 1, wherein the first attitude of the target object with respect to the camera comprises three-dimensional (3D) rotation and translation of the target object with respect to the camera.

The method of claim 1, wherein the second posture of the first object with respect to the camera comprises 3D rotation and translation of the first object with respect to the camera.

The method of claim 1, wherein the second neural network performs online calibration of the camera.

The method according to claim 1, wherein the third posture is the posture of the target object with respect to the coordinate frame of the first object.

The method of claim 1, wherein only the first neural network is trained on the target object.

The method of claim 1, wherein only the second neural network is trained on the first object.

The step of refining the first posture and
The method of claim 1, further comprising the step of refining the second posture.

The step of refining the first posture is
13. According to claim 13, the image is repeatedly matched with the synthetic projection of the model by the first posture, and the parameter of the first posture is adjusted based on the result of the iterative matching. The method described.

The step of refining the second posture is
13. Claim 13 is performed by iteratively matching the image with the synthetic projection of the model in the second posture and adjusting the parameters of the second posture based on the result of the iterative matching. The method described.

A non-temporary computer-readable medium that, when executed by one or more processors, stores computer instructions that cause the one or more processors to perform the method.
A step of identifying an image of a first object and a target object, wherein the image is captured by a camera outside the first object and the target object.
A step of processing the image using a first neural network to estimate a first attitude of the target object with respect to the camera.
A step of processing the image using a second neural network to estimate a second orientation of the first object with respect to the camera.
A non-temporary computer-readable medium comprising the step of calculating a third posture of the first object with respect to the target object using the first posture and the second posture.

A first neural network that receives images of a first object and a target object as input and processes the images to estimate a first attitude of the target object with respect to the camera, wherein the image is the first. A first neural network captured by a camera outside the object 1 and the target object.
A second neural network that receives the image as input and processes the image to estimate a second orientation of the first object with respect to the camera.
A system comprising a processor that calculates a third posture of the first object with respect to the target object using the first posture and the second posture.

17. The system of claim 17, wherein the camera is outside the system.

17. The system of claim 17, wherein the first object is a robotic gripping system and the target object is a known object.

19. The system of claim 19, wherein the processor causes the robotic gripping system to grip the target object using the third posture.