KR102478980B1

KR102478980B1 - 3d contrastive learning apparatus and method for unsupervised 6d pose estimation

Info

Publication number: KR102478980B1
Application number: KR1020200169346A
Authority: KR
Inventors: 이정호
Original assignee: 주식회사 플라잎
Priority date: 2020-12-07
Filing date: 2020-12-07
Publication date: 2022-12-19
Also published as: KR20220080341A

Abstract

6D 포즈 추정 장치는 비지도 학습(Unsupervised Learning)을 이용하여 객체 포즈를 추정하는 장치에 있어서, 이미지에 대한 RGB 데이터를 제 1 딥러닝 모델에 입력하여 외형 특징을 추출하는 외형 특징 추출부와, 상기 이미지에 대한 깊이(depth) 데이터를 사용자 정의 문제(Pretext task)에 의해 미리 학습된 네트워크에 기초하여 생성된 제 2 딥러닝 모델에 입력하여 기하학 특징을 추출하는 기하학 특징 추출부를 포함하는 특징 추출부; 상기 추출된 외형 특징 및 기하학 특징을 결합하여 상기 이미지에 대한 특징맵(feature map)을 생성하는 결합부; 및, 상기 생성된 특징맵에 기초하여 상기 이미지에 대응하는 객체에 대한 6D 포즈를 추정하는 객체 포즈 추정부를 포함하고, 상기 네트워크는 매 학습 시 n개의 입력데이터를 입력받고, 상기 n개의 입력데이터를 증가시켜 미리 학습된다.A 6D pose estimating device is a device for estimating an object pose using unsupervised learning, comprising: an external feature extractor for extracting external features by inputting RGB data of an image into a first deep learning model; a feature extractor including a geometric feature extractor extracting geometric features by inputting depth data of an image to a second deep learning model generated based on a network previously learned by a user-defined problem (Pretext task); a combination unit generating a feature map for the image by combining the extracted shape features and geometric features; and an object pose estimator for estimating a 6D pose for an object corresponding to the image based on the generated feature map, wherein the network receives n pieces of input data at each learning time and generates the n pieces of input data. It is learned in advance by increasing

Description

3D contrast learning apparatus and method for unsupervised 6D pose estimation {3D CONTRASTIVE LEARNING APPARATUS AND METHOD FOR UNSUPERVISED 6D POSE ESTIMATION}

본 발명은 6D 포즈 추정 장치 및 방법에 관한 것이다.The present invention relates to a 6D pose estimation apparatus and method.

일반적으로, 객체의 포즈 추정은 레이블이 포함된 학습 데이터를 이용하여 학습된 딥러닝 모델을 이용한다. 즉, 학습된 딥러닝 모델에 객체의 3D 데이터를 입력하여 객체의 포즈를 추정할 수 있다. In general, object pose estimation uses a deep learning model trained using label-included training data. That is, the pose of the object may be estimated by inputting the 3D data of the object to the learned deep learning model.

6D (x, y, z, roll, pitch, yaw) 포즈 추정을 위한 딥러닝 모델을 학습시키는 데에는 양질의 데이터가 매우 중요하다. 이때, 대량의 데이터 셋을 만들기 위해서는 임의 데이터를 생성하여야 하기 때문에, 실제(real) 환경에서의 성능이 떨어질 수 있다. 특히, 3D 데이터를 이용하여 6자유도(6DOF: Six degrees of freedom)의 물체를 다루기 위한 데이터 셋의 제작은 매우 까다롭고, 2D에 비해 시간과 비용이 훨씬 많이 소요된다. Good quality data is very important to train a deep learning model for 6D (x, y, z, roll, pitch, yaw) pose estimation. At this time, since random data must be generated in order to create a large data set, performance in a real environment may deteriorate. In particular, it is very difficult to create a data set to handle objects with six degrees of freedom (6DOF) using 3D data, and it takes much more time and money than 2D.

비지도 학습(Unsupervised Learning)은 기계 학습의 일종으로 데이터가 어떻게 구성되었는지를 알아내는 문제의 범주에 속하며, 지도 학습(Supervised Learning) 또는 강화 학습(Reinforcement Learning)과는 달리, 입력 값만 있는 훈련 데이터를 이용하여 입력들의 규칙성을 찾는 학습 방법이다.Unsupervised Learning is a type of machine learning that belongs to the category of problems that determine how data is structured. Unlike Supervised Learning or Reinforcement Learning, training data with only input values is It is a learning method to find the regularity of inputs using

비지도 학습 중 자기 지도 학습(Self-Supervised Learning)은 비지도 학습의 일종으로 레이블이 없는 데이터(Unlabeled data)를 이용하여 사용자가 정의한 문제(Pretext task)를 학습한 네트워크를 실질적으로 풀고자 하는 문제(downstream task)로 전이 학습(transfer learning)하는 학습 방법이다.Among unsupervised learning, Self-Supervised Learning is a type of unsupervised learning that uses unlabeled data to actually solve a network that has learned a user-defined pretext task. It is a learning method that transfer learning to a downstream task.

한국등록특허공보 제1994316호 (2019. 6. 24. 등록)Korean Registered Patent Publication No. 1994316 (registered on June 24, 2019)

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 임의 데이터에 기초하여 학습된 딥러닝 모델을 사용하여 객체의 포즈를 추정하는 경우 발생하게 되는 성능 저하의 문제를 해결할 수 있는 6D 포즈 추정 장치를 제공하고자 한다. The present invention is to solve the above-mentioned problems of the prior art, a 6D pose estimating device that can solve the problem of performance degradation that occurs when estimating the pose of an object using a deep learning model learned based on arbitrary data want to provide

또한, 레이블이 없는 데이터 또는 소량의 레이블을 포함하고 있는 데이터 만으로도 효율적으로 객체 포즈를 추정할 수 있는 6D 포즈 추정 장치를 제공하고자 한다. In addition, it is intended to provide a 6D pose estimation device capable of estimating an object pose efficiently only with unlabeled data or data containing a small amount of labels.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the present embodiment is not limited to the technical problems described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 수단으로서, 본 발명의 일 실시예는, 비지도 학습(Unsupervised Learning)을 이용하여 객체 포즈를 추정하는 장치에 있어서, 이미지에 대한 RGB 데이터를 제 1 딥러닝 모델에 입력하여 외형 특징을 추출하는 외형 특징 추출부와, 상기 이미지에 대한 깊이(depth) 데이터를 사용자 정의 문제(Pretext task)에 의해 학습된 네트워크에 기초하여 생성된 제 2 딥러닝 모델에 입력하여 기하학 특징을 추출하는 기하학 특징 추출부를 포함하는 특징 추출부; 상기 추출된 외형 특징 및 기하학 특징을 결합하여 상기 이미지에 대한 특징맵(feature map)을 생성하는 결합부; 및, 상기 생성된 특징맵에 기초하여 상기 이미지에 대응하는 객체에 대한 6D 포즈를 추정하는 객체 포즈 추정부를 포함하고, 상기 네트워크는 매 학습 시 n개의 입력데이터를 입력받고, 상기 n개의 입력데이터를 증가시켜 미리 학습된 것인, 6D 포즈 추정 장치를 제공 할 수 있다. As a means for achieving the above technical problem, an embodiment of the present invention, in an apparatus for estimating an object pose using unsupervised learning, RGB data for an image is converted to a first deep learning model. An appearance feature extractor that extracts an appearance feature by inputting it, and a geometric feature by inputting the depth data of the image to a second deep learning model generated based on a network learned by a user-defined problem (Pretext task) a feature extraction unit including a geometric feature extraction unit that extracts; a combination unit generating a feature map for the image by combining the extracted shape features and geometric features; and an object pose estimator for estimating a 6D pose for an object corresponding to the image based on the generated feature map, wherein the network receives n pieces of input data at each learning time and generates the n pieces of input data. It is possible to provide a 6D pose estimating device that is pre-learned by increasing the pose.

본 발명의 다른 실시예는, 비지도 학습(Unsupervised Learning)을 이용하여 객체 포즈를 추정하는 방법에 있어서, 이미지에 대한 RGB 데이터를 제 1 딥러닝 모델에 입력하여 외형 특징을 추출하는 단계와, 상기 이미지에 대한 깊이(depth) 데이터를 사용자 정의 문제(Pretext task)에 의해 미리 학습된 네트워크에 기초하여 생성된 제 2 딥러닝 모델에 입력하여 기하학 특징을 추출하는 단계를 포함하는 특징 추출 단계; 상기 추출된 외형 특징 및 기하학 특징을 결합하여 상기 이미지에 대한 특징맵(feature map)을 생성하는 단계; 및, 상기 생성된 특징맵에 기초하여 상기 이미지에 대응하는 객체에 대한 6D 포즈를 추정하는 단계를 포함하고, 상기 네트워크는 매 학습 시 n개의 입력데이터를 입력받고, 상기 n개의 입력데이터를 증가시켜 미리 학습된 것인, 6D 포즈 추정 방법을 제공할 수 있다. Another embodiment of the present invention is a method for estimating an object pose using unsupervised learning, comprising the steps of extracting appearance features by inputting RGB data of an image to a first deep learning model; A feature extraction step including the step of extracting a geometric feature by inputting depth data of an image to a second deep learning model generated based on a network previously learned by a user-defined problem (Pretext task); generating a feature map for the image by combining the extracted shape features and geometric features; And, based on the generated feature map, estimating a 6D pose for the object corresponding to the image, wherein the network receives n pieces of input data at every learning, and increments the n pieces of input data A pre-learned, 6D pose estimation method may be provided.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described means for solving the problems is only illustrative and should not be construed as limiting the present invention. In addition to the exemplary embodiments described above, there may be additional embodiments described in the drawings and detailed description.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 레이블이 없는 데이터 또는 소량의 레이블을 포함하고 있는 데이터 만으로도 효율적으로 객체의 포즈를 추정할 수 있는 6D 포즈 추정 장치를 제공할 수 있다. According to any one of the above-described problem solving means of the present invention, it is possible to provide a 6D pose estimating device capable of estimating the pose of an object efficiently only with unlabeled data or data containing a small amount of labels.

또한, 추정된 객체의 6D 정보를 사용하여 로봇의 Pick & Place 작업뿐만 아니라 조립을 포함한 복잡한 작업도 수행할 수 있다.In addition, by using the 6D information of the estimated object, it is possible to perform complex tasks including assembly as well as Pick & Place tasks of the robot.

도 1은 본 발명의 일 실시예에 따른 6D 포즈 추정 장치의 블록도이다.
도 2는 본 발명의 일 실시예에 따른 6D 포즈 추정 장치의 구성을 설명하기 위한 예시적인 도면이다.
도 3은 본 발명의 일 실시예에 따른 사용자 정의 문제(Pretext task) 수행부의 증가부를 설명하기 위한 예시적인 도면이다.
도 4는 본 발명의 일 실시예에 따른 6D 포즈 추정 방법의 순서도이다.1 is a block diagram of a 6D pose estimation device according to an embodiment of the present invention.
2 is an exemplary diagram for explaining the configuration of a 6D pose estimation device according to an embodiment of the present invention.
3 is an exemplary diagram for explaining an increase in a pretext task performer according to an embodiment of the present invention.
4 is a flowchart of a 6D pose estimation method according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present invention will be described in detail so that those skilled in the art can easily practice the present invention with reference to the accompanying drawings. However, the present invention may be embodied in many different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. Throughout the specification, when a part is said to be "connected" to another part, this includes not only the case where it is "directly connected" but also the case where it is "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, this means that it may further include other components, not excluding other components, unless otherwise stated, and one or more other characteristics. However, it should be understood that it does not preclude the possibility of existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다.In this specification, a "unit" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. Further, one unit may be realized using two or more hardware, and two or more units may be realized by one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 장치에서 대신 수행될 수도 있다. 이와 마찬가지로, 장치가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 장치와 연결된 단말 또는 디바이스에서 수행될 수도 있다.In this specification, some of the operations or functions described as being performed by a terminal or device may be performed instead by a device connected to the terminal or device. Likewise, some of the operations or functions described as performed by the device may also be performed by a terminal or device connected to the device.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다. Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 6D 포즈 추정 장치의 블록도이다. 도 1을 참조하면, 6D 포즈 추정 장치(100)는 특징 추출부(110), 결합부(120), 객체 포즈 추정부(130) 및 사용자 정의 문제(Pretext task) 수행부(140)를 포함할 수 있다. 1 is a block diagram of a 6D pose estimation device according to an embodiment of the present invention. Referring to FIG. 1 , the 6D pose estimating device 100 may include a feature extractor 110, a combiner 120, an object pose estimator 130, and a pretext task performer 140. can

특징 추출부(110)는 외형 특징 추출부(111) 및 기하학 특징 추출부(112)를 포함할 수 있고, 객체 포즈 추정부(130)는 시맨틱 세그먼테이션 모듈(Semantic segmentation module, 131), 특징점 검출 모듈(Keypoint detection module, 132) 및 센터 투표 모듈(Center voting module, 133)을 포함할 수 있고, 사용자 정의 문제(Pretext task) 수행부(140)는 샘플링부(141), 증가부(142) 및 학습부(143)를 포함할 수 있다. 다만 위 구성 요소들(110 내지 140)은 6D 포즈 추정 장치(100)에 의하여 제어될 수 있는 구성요소들을 예시적으로 도시한 것일 뿐이다. The feature extractor 110 may include an external feature extractor 111 and a geometric feature extractor 112, and the object pose estimator 130 includes a semantic segmentation module 131 and a feature point detection module. (Keypoint detection module, 132) and center voting module (Center voting module, 133), and the user-defined problem (Pretext task) performing unit 140 includes the sampling unit 141, the increase unit 142, and the learning unit 140. A section 143 may be included. However, the above components 110 to 140 are merely examples of components that can be controlled by the 6D pose estimating device 100 .

본 발명의 일 실시예에 따른 6D 포즈 추정 장치(100)는 레이블이 없는 데이터 또는 소량의 레이블을 포함하고 있는 데이터 만으로도 효율적으로 객체의 포즈를 추정할 수 있다. The 6D pose estimation apparatus 100 according to an embodiment of the present invention can estimate the pose of an object efficiently only with unlabeled data or data including a small amount of labels.

또한, 본 발명의 일 실시예에 따른 6D 포즈 추정 장치(100)는 추정된 객체의 6D 정보를 사용하여 로봇의 픽앤플레이스(Pick & Place) 작업뿐만 아니라 조립을 포함한 복잡한 작업도 수행할 수 있다.In addition, the 6D pose estimating device 100 according to an embodiment of the present invention can perform complex tasks including assembly as well as Pick & Place tasks of a robot using 6D information of an estimated object.

도 2는 본 발명의 일 실시예에 따른 6D 포즈 추정 장치의 구성을 설명하기 위한 예시적인 도면이다. 도 2를 참조하면, 특징 추출부(110)는 외형 특징 추출부(111)와 기하학 특징 추출부(112)를 포함할 수 있다. 2 is an exemplary diagram for explaining the configuration of a 6D pose estimation device according to an embodiment of the present invention. Referring to FIG. 2 , the feature extractor 110 may include an external feature extractor 111 and a geometric feature extractor 112 .

본 발명의 일 실시예에 따른 외형 특징 추출부(111)는 이미지에 대한 RGB 데이터(111a)를 제 1 딥러닝 모델(111b)에 입력하여 외형 특징을 추출할 수 있다. 여기서, 제 1 딥러닝 모델(111b)은 CNN(Convolutional Neural Network)일 수 있다. 예를 들어, 외형 특징 추출부(111)는 RGB 데이터(111a)를 입력 받아 CNN(111b)을 통해 이미지에 대한 외형 특징을 추출할 수 있다. The external feature extraction unit 111 according to an embodiment of the present invention may extract the external features by inputting the RGB data 111a of the image to the first deep learning model 111b. Here, the first deep learning model 111b may be a Convolutional Neural Network (CNN). For example, the external feature extractor 111 may receive the RGB data 111a and extract external features of the image through the CNN 111b.

기하학 특징 추출부(112)는 이미지에 대한 깊이(depth) 데이터(112a)를 사용자 정의 문제(Pretext task)에 의해 미리 학습된 네트워크에 기초하여 생성된 제 2 딥러닝 모델(112b)에 입력하여 기하학 특징을 추출할 수 있다. 여기서, 제 2 딥러닝 모델(112b)은 포인트넷(PointNet) 모델일 수 있다. 예를 들어, 기하학 특징 추출부(112)는 깊이 데이터(112a)를 입력 받아 제 2 딥러닝 모델(112b)을 통해 이미지에 대한 기하학 특징을 추출할 수 있다.The geometric feature extractor 112 inputs the depth data 112a of the image to the second deep learning model 112b generated based on a network pre-learned by a user-defined problem (Pretext task) to obtain geometry features can be extracted. Here, the second deep learning model 112b may be a PointNet model. For example, the geometric feature extractor 112 may receive the depth data 112a and extract geometric features of the image through the second deep learning model 112b.

여기서, 네트워크(143a)는 매 학습 시 n개의 입력데이터를 입력받고, n개의 입력데이터를 증가시켜 미리 학습될 수 있다. 예를 들어, 네트워크(143a)는 N개의 데이터셋으로부터 샘플링된 n개의 입력데이터를 증가시킨 n개 이상의 입력데이터로 미리 학습될 수 있다. 여기서, n개의 입력데이터는 배치데이터일 수 있다.Here, the network 143a may be pre-learned by receiving n pieces of input data and increasing the n pieces of input data at each learning time. For example, the network 143a may be trained in advance with n or more input data obtained by increasing n input data sampled from N data sets. Here, the n pieces of input data may be batch data.

본 발명의 일 실시예에 따른 기하학 특징 추출부(112)는 제 2 딥러닝 모델(112b)의 가중치를 별도의 네트워크(143a)의 가중치로 전이 학습할 수 있다. 여기서, 네트워크(143a)는 비지도 학습의 일종인 자기 지도 학습(Self-Supervised Learning)을 통해 레이블이 없는 데이터를 이용하여 학습될 수 있다. 예를 들어, 네트워크(143a)는 사용자 정의 문제(Pretext task)에 의해 학습될 수 있다. 이와 같이, 자기 지도 학습으로서 사용자 정의 문제(Pretext task)를 사용하면 3D 데이터 자체에 대한 이해를 높일 수 있기 때문에, 6D 포즈 추정 시 실제 환경에 대한 성능을 향상시킬 수 있다. The geometric feature extractor 112 according to an embodiment of the present invention may transfer-learn the weights of the second deep learning model 112b to the weights of a separate network 143a. Here, the network 143a may be trained using unlabeled data through self-supervised learning, which is a type of unsupervised learning. For example, the network 143a may be trained by a user-defined pretext task. In this way, since the use of a user-defined task (Pretext task) as self-supervised learning can increase the understanding of 3D data itself, it is possible to improve performance for real environments when estimating 6D poses.

기하학 특징 추출부(112)는 사용자 정의 문제(Pretext task)에 의해 학습된 네트워크를 전이 학습한 제 2 딥러닝 모델(112b)을 이용하여 이미지에 대한 기하학 특징을 추출할 수 있다. The geometric feature extractor 112 may extract geometric features of an image using the second deep learning model 112b obtained by transfer learning of a network learned by a user-defined pretext task.

본 발명의 일 실시예에 따른 사용자 정의 문제(Pretext task) 수행부(140)는 네트워크(143a)에 기초하여 사용자 정의 문제(Pretext task)를 수행할 수 있다. 예를 들어, 사용자 정의 문제(Pretext task) 수행부(140)는 레이블이 없는 데이터, 즉, 사용자가 정의한 사용자 정의 문제(Pretext task, input)를 이용하여 네트워크(143a)를 학습시켜 네트워크(143a)가 데이터를 이해하고, 데이터에서 의미 있는 특징(output)을 추출하도록 반복 학습한다. The pretext task performer 140 according to an embodiment of the present invention may perform a pretext task based on the network 143a. For example, the user-defined task (Pretext task) performer 140 trains the network 143a using unlabeled data, that is, a user-defined user-defined task (Pretext task, input) to generate the network 143a. Iteratively learns to understand the data and extract meaningful features (outputs) from the data.

사용자 정의 문제(Pretext task) 수행부(140)의 샘플링부(141)는 N개의 데이터셋으로부터 1개의 포지티브(positive) 데이터와 포지티브 데이터 이외의 네거티브(negative) 데이터를 포함하는 n개의 입력데이터를 샘플링하여 네트워크에 입력할 수 있다.The sampling unit 141 of the pretext task execution unit 140 samples n input data including one positive data and negative data other than the positive data from N data sets. and can be entered into the network.

예를 들어, 샘플링부(141)는 N개의 데이터셋으로부터 n개의 입력데이터를 추출하여 샘플링할 수 있다. 샘플링부(141)는 추출된 n개의 입력데이터 중 1개의 입력데이터를 포지티브 데이터로 샘플링할 수 있고, 이외 n-1개의 입력데이터를 네거티브 데이터로 샘플링할 수 있다. 여기서, 샘플링되는 1개의 포지티브 데이터는 n개의 입력데이터 중 임의로 선정되는 입력데이터다. For example, the sampling unit 141 may extract and sample n input data from N data sets. The sampling unit 141 may sample one input data among the extracted n pieces of input data as positive data, and may sample the other n-1 pieces of input data as negative data. Here, one piece of positive data to be sampled is input data randomly selected from among n pieces of input data.

본 발명의 일 실시예에 따른 증가부(142)는 n개의 입력데이터 각각을 기설정된 횟수만큼 회전(rotation), 크롭(crop), 노이즈(noise) 추가, 크기 조정(resize), 추출(sampling) 및 왜곡(distortion) 중 어느 하나를 수행하여 n개 이상의 입력데이터로 증가시킬 수 있다. 예를 들어, 증가부(142)는 n개의 입력데이터를 2번씩 각각 회전시키거나, 크롭시키거나, 노이즈를 추가시키거나, 크기를 조정하거나, 샘플을 추출하거나, 왜곡시킬 수 있다. 이 경우, n개 이상의 입력데이터는 예를 들어, 2n개일 수 있다.The increaser 142 according to an embodiment of the present invention rotates, crops, adds noise, resizes, and samples each of the n input data by a predetermined number of times. And distortion (distortion) by performing any one can be increased to n or more input data. For example, the augmentation unit 142 may rotate, crop, add noise, resize, extract samples, or distort n input data twice, respectively. In this case, n or more input data may be, for example, 2n.

도 3은 본 발명의 일 실시예에 따른 사용자 정의 문제(Pretext task) 수행부의 증가부를 설명하기 위한 예시적인 도면이다. 도 3을 참조하면, 증가부(142)는 추출된 n개의 입력데이터를 랜덤으로 회전시키거나, 크롭시키거나, 노이즈를 추가시키거나, 크기를 조정하거나, 샘플을 추출하거나, 왜곡시켜 2n개의 입력데이터로 증가시킬 수 있다.3 is an exemplary diagram for explaining an increase in a pretext task performer according to an embodiment of the present invention. Referring to FIG. 3, the augmentation unit 142 randomly rotates, crops, adds noise, adjusts the size, extracts samples, or distorts the extracted n input data to obtain 2n input data. data can be increased.

예를 들어, 도 3의 (a)를 참조하면, 증가부(142)는, 입력데이터를 회전시켜 좌우가 대칭되는 한 쌍의 입력데이터를 생성시킬 수 있고, 도 3의 (b)를 참조하면, 증가부(142)는, 입력데이터의 일부를 크롭시킨 한 쌍의 입력데이터를 생성시킬 수 있다. 증가부(142)에서 회전시키거나, 크롭시킨 입력데이터를 활용함에 따라, 객체 포즈 추정부(130)의 객체에 대한 이해력을 향상시킬 수 있다.For example, referring to FIG. 3(a), the increaser 142 may generate a pair of left and right symmetrical input data by rotating the input data, and referring to FIG. 3(b) , The increasing unit 142 may generate a pair of input data obtained by cropping a part of the input data. As input data rotated or cropped by the increaser 142 is used, the object pose estimator 130's understanding of the object may be improved.

증가부(142)는 입력데이터에 노이즈를 추가시켜 증가시킬 수 있다. 가상 환경의 노이즈가 전혀 없는 입력데이터를 사용하면, 객체 포즈 추정부(130)에서 노이즈가 포함되어 있는 실제 이미지에서 객체에 대한 이해력이 낮아질 수 있다.The increaser 142 may increase input data by adding noise. If the input data having no noise in the virtual environment is used, the object pose estimator 130 may not be able to understand the object in the real image including the noise.

따라서, 증가부(142)는 n개의 입력데이터에 랜덤으로 노이즈를 추가시켜 n개 이상의 입력데이터로 증가시킬 수 있다. 예를 들어, 증가부(142)는 입력데이터에서 깊이 표준 편차(depth standard deviation) 범위 내의 랜덤한 값을 더하거나 감할 수 있다. Therefore, the increaser 142 may randomly add noise to the n input data and increase the number to n or more input data. For example, the increasing unit 142 may add or subtract random values within a depth standard deviation range from input data.

또한, 증가부(142)는 랜덤으로 n개의 입력데이터의 크기를 조정할 수 있고, 샘플을 추출할 수 있고, 왜곡시킴으로써 n개 이상의 입력데이터로 증가시킬 수 있다. 따라서, 객체 포즈 추정부(130)는 객체에 대한 이해력을 향상시킬 수 있다. In addition, the increaser 142 may randomly adjust the size of n pieces of input data, extract samples, and increase the number of input data to more than n pieces by distorting them. Accordingly, the object pose estimator 130 may improve understanding of the object.

다시 도 2를 참조하면, 본 발명의 일 실시예에 따른 학습부(143)는 네트워크(143a)를 통과한 n개 이상의 입력데이터의 잠재 벡터(latent vector) 중 어느 하나의 포지티브 데이터의 잠재 벡터를 기준으로 하여 다른 하나의 포지티브 데이터의 잠재 벡터의 점수(score)는 높게 부여하고, 네거티브 데이터의 잠재 벡터의 점수는 낮게 부여하도록 네트워크(143a)를 학습시킬 수 있다.Referring back to FIG. 2 , the learning unit 143 according to an embodiment of the present invention selects any one latent vector of positive data from n or more latent vectors of input data that have passed through the network 143a. As a criterion, the network 143a may be trained to assign a high score to the latent vector of the other positive data and a low score to the latent vector of the negative data.

즉, 학습부(143)는 입력데이터에 점수를 부여하는 방식으로 네트워크(143a)를 학습시킬 수 있다. 예를 들어, 학습부(143)는 네트워크(143a)를 통과한 n개 이상의 입력데이터의 잠재 벡터 중 1개의 포지티브 데이터의 잠재 벡터를 기준으로 할 수 있다. 학습부(143)는 기준이 된 포지티브 데이터의 잠재 벡터와 상이한 입력데이터의 잠재 벡터 중 포지티브 데이터의 잠재 벡터에는 점수를 높게 부여하고, 네거티브 데이터의 잠재 벡터에는 점수를 낮게 부여하도록 학습할 수 있다. That is, the learning unit 143 may train the network 143a in a manner of assigning scores to input data. For example, the learning unit 143 may use one latent vector of positive data among n or more latent vectors of input data that have passed through the network 143a as a reference. The learning unit 143 may learn to give high scores to latent vectors of positive data and low scores to latent vectors of negative data among latent vectors of input data that are different from the latent vectors of positive data that have become references.

다시 도 2를 참조하면, 특징 추출부(110)는 외형 특징 추출부(111)를 통해 추출한 이미지에 대한 외형 특징과 기하학 특징 추출부(112)를 통해 추출한 이미지에 대한 기하학 특징을 결합부(120)에 전달할 수 있다. Referring back to FIG. 2 , the feature extractor 110 combines the external features of the image extracted through the external feature extractor 111 and the geometric features of the image extracted through the geometric feature extractor 112 into a combination unit 120. ) can be passed on.

본 발명의 일 실시예에 따른 결합부(120)는 추출된 외형 특징 및 기하학 특징을 결합하여 이미지에 대한 특징맵(feature map)을 생성할 수 있다. 예를 들어, 결합부(120)는 외형 특징 추출부(111)에서 추출한 외형 특징과 기하학 특징 추출부(112)에서 추출한 기하학 특징을 결합하여, 해당 이미지에서 인식된 객체의 각 좌표(포인트)에 대해 결합된 특징을 가지는 특징맵을 생성할 수 있다.The combination unit 120 according to an embodiment of the present invention may generate a feature map for an image by combining the extracted external features and geometric features. For example, the combining unit 120 combines the external features extracted from the external feature extraction unit 111 and the geometric features extracted from the geometric feature extraction unit 112 to obtain each coordinate (point) of an object recognized in the corresponding image. A feature map with combined features can be created.

본 발명의 일 실시예에 따른 객체 포즈 추정부(130)는 생성된 특징맵에 기초하여 이미지에 대응하는 객체에 대한 6D 포즈를 추정할 수 있다. 예를 들어, 객체 포즈 추정부(130)는 이미지에서 인식된 객체의 각 좌표에 대해 결합된 특징을 가지는 특징맵에 기초하여 객체에 대한 6D 포즈를 추정(136)할 수 있다.The object pose estimator 130 according to an embodiment of the present invention may estimate a 6D pose of an object corresponding to an image based on the generated feature map. For example, the object pose estimator 130 may estimate 136 a 6D pose of the object based on a feature map having features combined for each coordinate of the object recognized in the image.

객체 포즈 추정부는(130) 시맨틱 세그먼테이션 모듈(131), 특징점 검출 모듈(132) 및 센터 투표 모듈(133)에 기초하여 객체에 대한 6D 포즈를 계산할 수 있다.The object pose estimator 130 may calculate a 6D pose of the object based on the semantic segmentation module 131 , the feature point detection module 132 , and the center voting module 133 .

예를 들어, 객체 포즈 추정부(130)는 특징맵을 입력 받아, 먼저, 시맨틱 세그먼테이션 모듈(131)을 통해 이미지 상에 포함되어 있는 하나 이상의 객체를 각각 구분할 수 있고, 다음으로, 특징점 검출 모듈(132)을 통해 구분된 각각의 객체 표면에 대한 3D 특징점(Keypoint)을 감지할 수 있고, 이후, 센터 투표 모듈(133)을 통해 객체의 중심 포인트를 검출하여, 객체에 대한 6D 포즈를 추정(136)할 수 있다.For example, the object pose estimator 130 receives the feature map, first, through the semantic segmentation module 131, can distinguish one or more objects included in the image, respectively, and then, the feature point detection module ( 132) to detect 3D keypoints for each object surface, and then detect the center point of the object through the center voting module 133 to estimate the 6D pose of the object (136 )can do.

본 발명의 일 실시예에 따른 시맨틱 세그먼테이션 모듈(131)은 특징맵에 기초하여 이미지 상의 객체가 복수인 경우, 각각의 객체를 구분할 수 있다. When there are a plurality of objects on the image, the semantic segmentation module 131 according to an embodiment of the present invention may classify each object based on the feature map.

특징점 검출 모듈(132)은 구분된 객체의 표면에 대한 3D 특징점을 감지할 수 있다. 예를 들어, 특징점 검출 모듈(132)은 구분된 객체의 표면에 대한 3D 특징점을 감지하고, 감지된 각 포인트마다 가시점(visible point)에서 대상 특징점(target keypoint)까지의 유클리드 변환 오프셋을 예측하고, 이를 다시 대상 특징점에 투표 및 클러스터링 하는 과정을 통해 구분된 객체의 표면에 대한 3D 특징점을 감지할 수 있다. The feature point detection module 132 may detect 3D feature points on the surface of the identified object. For example, the feature point detection module 132 detects 3D feature points on the surface of the identified object, predicts a Euclidean transform offset from a visible point to a target keypoint for each detected point, and , it is possible to detect 3D feature points on the surface of the classified object through the process of voting and clustering the target feature points again.

센터 투표 모듈(133)은 객체의 중심(center) 포인트를 검출할 수 있다. 예를 들어, 센터 투표 모듈(133)은 객체의 중심 포인트를 2D에서 3D로 확장할 수 있다.The center voting module 133 may detect a center point of an object. For example, the center voting module 133 may expand the center point of an object from 2D to 3D.

인스턴스 세그먼테이션(Instance Segmentation, 134)은 시맨틱 세그먼테이션 모듈(131), 특징점 검출 모듈(132) 및 센터 투표 모듈(133)에서 검출한 객체, 객체의 3D 특징점 및 중심 포인트에 기초하여 해당 이미지에서 전역 특징과 지역 특징을 추출할 수 있다. 인스턴스 세그먼테이션(134)은 특징점에 대한 오프셋(offset)을 예측하기 위해, 학습된 크기 정보에서 모양은 비슷하지만 크기가 다른 물체를 구별하도록 할 수 있다.Instance segmentation (Instance Segmentation, 134) is based on the objects detected by the semantic segmentation module 131, the feature point detection module 132, and the center voting module 133, and the 3D feature points and center points of the object. Local features can be extracted. In the instance segmentation 134, objects having a similar shape but different sizes may be distinguished from learned size information in order to predict an offset for a feature point.

객체 포즈 추정부(130)는 시맨틱 세그먼테이션 모듈(131), 특징점 검출 모듈(132), 센터 투표 모듈(133) 및 인스턴스 세그먼테이션(134)을 거쳐 추출된 특징에 기초하여 최소 제곱법 적합(Least-squares Fitting, 135)을 수행할 수 있다. 예를 들어, 객체 포즈 추정부(130)는 카메라 좌표계에서 감지된 M개의 특징점과 이에 대응하는 물체 좌표계의 포인트를 가지고 제곱 손실을 최소화하는 객체 모델에 대한 포즈 파라미터(R, t)를 계산하여 객체에 대한 6D 포즈를 추정(136)할 수 있다. The object pose estimator 130 performs least squares fit (least-squares fit) based on the features extracted through the semantic segmentation module 131, feature point detection module 132, center voting module 133, and instance segmentation 134. Fitting, 135) can be performed. For example, the object pose estimator 130 calculates pose parameters (R, t) for an object model that minimizes squared loss using M feature points detected in the camera coordinate system and points in the object coordinate system corresponding thereto, and calculates the object A 6D pose for 136 may be estimated.

한편, 객체 포즈 추정부(130)는 데이터가 3D 포인트 클라우드인 경우, 데이터에 대한 복셀화(voxelization)를 수행할 수 있고, 이러한 경우, 사용자 정의 문제(Pretext task) 수행부(140)에서도 입력 데이터에 대한 복셀화를 추가로 수행할 수 있다.Meanwhile, if the data is a 3D point cloud, the object pose estimator 130 may perform voxelization on the data, and in this case, the user-defined task performer 140 may also perform voxelization on the input data Voxelization may be additionally performed.

본 발명에 따른 6D 포즈 추정 장치(100)는, 자기 지도 학습을 활용한 사용자 정의 문제(Pretext task) 수행부(140)를 통해 객체를 포함한 데이터에 대한 이해도를 향상시킴으로써, 객체 포즈 추정부(130)에서 이미지에 포함된 객체에 대한 6D 포즈를 보다 정확하게 추정하도록 할 수 있다. The 6D pose estimating device 100 according to the present invention improves the understanding of data including objects through the pretext task performing unit 140 using self-supervised learning, so that the object pose estimating unit 130 ), it is possible to more accurately estimate the 6D pose of the object included in the image.

또한, 6D 포즈 추정 장치(100)는 추정된 객체의 6D 포즈 정보를 활용하여 로봇의 픽앤플레이스(Pick & Place) 작업뿐만 아니라, 조립을 포함한 복잡하고 다양한 작업도 수행 가능하도록 할 수 있다. 예를 들어, 6D 포즈 추정 장치(100)는 객체의 6D 정보를 사용하여 객체의 위치 및 각도를 파악할 수 있다. In addition, the 6D pose estimating device 100 may utilize the 6D pose information of the estimated object to enable the robot to perform not only Pick & Place work but also complex and various tasks including assembling. For example, the 6D pose estimating apparatus 100 may determine the position and angle of an object using 6D information of the object.

도 4는 본 발명의 일 실시예에 따른 6D 포즈 추정 방법의 순서도이다. 도 1에 도시된 6D 포즈 추정 장치(100)는 도 1 내지 도 3에 도시된 실시예에 따라 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하 생략된 내용이라고 하더라도 도 1 내지 도 3에 도시된 실시예에 따른 6D 포즈 추정 장치(100)에서 객체 포즈를 추정하는 방법에도 적용된다. 4 is a flowchart of a 6D pose estimation method according to an embodiment of the present invention. The 6D pose estimation apparatus 100 shown in FIG. 1 includes steps processed time-sequentially according to the embodiment shown in FIGS. 1 to 3 . Therefore, even if the content is omitted below, it is also applied to the method of estimating the object pose in the 6D pose estimation apparatus 100 according to the embodiment shown in FIGS. 1 to 3 .

단계 S410에서 6D 포즈 추정 장치(100)는 이미지에 대한 RGB 데이터를 제 1 딥러닝 모델에 입력하여 외형 특징을 추출할 수 있다.In step S410, the 6D pose estimating device 100 may input the RGB data of the image to the first deep learning model to extract external features.

단계 S420에서 6D 포즈 추정 장치(100)는 이미지에 대한 깊이(depth) 데이터를 사용자 정의 문제(Pretext task)에 의해 학습된 네트워크 기반의 제 2 딥러닝 모델에 입력하여 기하학 특징을 추출할 수 있다.In step S420, the 6D pose estimating apparatus 100 may extract geometric features by inputting depth data of the image to a second network-based deep learning model learned by a user-defined problem (Pretext task).

단계 S430에서 6D 포즈 추정 장치(100)는 추출된 외형 특징 및 기하학 특징을 결합하여 이미지에 대한 특징맵(feature map)을 생성할 수 있다.In step S430, the 6D pose estimating device 100 may generate a feature map for the image by combining the extracted shape features and geometric features.

단계 S440에서 6D 포즈 추정 장치(100)는 생성된 특징맵에 기초하여 이미지에 대응하는 객체에 대한 6D 포즈를 추정할 수 있다.In step S440, the 6D pose estimating apparatus 100 may estimate the 6D pose of the object corresponding to the image based on the generated feature map.

상술한 설명에서, 단계 S410 내지 S440는 본 발명의 구현 예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 전환될 수도 있다. In the above description, steps S410 to S440 may be further divided into additional steps or combined into fewer steps, depending on an implementation example of the present invention. Also, some steps may be omitted as needed, and the order of steps may be switched.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The above description of the present invention is for illustrative purposes, and those skilled in the art can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the detailed description above, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts should be construed as being included in the scope of the present invention. do.

100: 6D 포즈 추정 장치
110: 특징 추출부
120: 결합부
130: 객체 포즈 추정부
140: 사용자 정의 문제(Pretext task) 수행부100: 6D pose estimation device
110: feature extraction unit
120: coupling part
130: object pose estimation unit
140: Pretext task execution unit

Claims

An apparatus for estimating an object pose using unsupervised learning,
Appearance feature extraction unit for extracting appearance features by inputting RGB data of an image to a first deep learning model, and depth data of the image based on a pre-learned network by a user-defined problem (Pretext task) a feature extraction unit including a geometric feature extraction unit for extracting geometric features by inputting the input to the second deep learning model generated by the above;
a combination unit generating a feature map for the image by combining the extracted shape features and geometric features; and,
An object pose estimator for estimating a 6D pose of an object corresponding to the image based on the generated feature map
including,
The network is pre-learned by receiving n input data at each learning and increasing the n input data,
A user-defined task execution unit for performing the pretext task based on the network
Including more,
The user-defined problem performing unit,
a sampling unit for sampling the n input data including one positive data and negative data other than the positive data from the N data sets and inputting the samples to the network;
By performing any one of rotation, cropping, noise addition, resize, sampling, and distortion on each of the n input data by a predetermined number of times, n an increaser for increasing the input data above; and
Based on the latent vector of any one of the latent vectors of the n or more input data, the score of the latent vector of the other positive data is given high, and the latent vector of the negative data A learning unit that trains the network to give a low score of
To include, 6D pose estimation device.

According to claim 1,
The first deep learning model is a convolutional neural network (CNN), and the second deep learning model is a point net (PointNet) model, the 6D pose estimating device.

delete

According to claim 1,
The geometric feature extraction unit transfer-learns the weights of the second deep learning model to the weights of the network, 6D pose estimating device.

According to claim 1,
The object pose estimation unit,
A 6D pose estimating device that calculates a pose for the object based on a semantic segmentation module, a keypoint detection module, and a center voting module.

According to claim 7,
The semantic segmentation module,
Based on the feature map, if there are a plurality of objects on the image, to distinguish each object, 6D pose estimating device.

According to claim 8,
The feature point detection module,
6D pose estimating device for detecting 3D keypoints on the surface of the separated object.

According to claim 9,
The center voting module,
To detect the center (center) point of the object, 6D pose estimating device.

In the method of estimating an object pose using unsupervised learning,
extracting appearance features by inputting RGB data of an image to a first deep learning model;
A feature extraction step including the step of extracting a geometric feature by inputting depth data of the image to a second deep learning model generated based on a network previously learned by a user-defined problem (Pretext task);
generating a feature map for the image by combining the extracted shape features and geometric features; and,
Estimating a 6D pose for an object corresponding to the image based on the generated feature map
including,
The second deep learning model is pre-learned by receiving n input data and increasing the n input data at each learning,
Further comprising performing the user-defined problem based on the network;
The step of performing the user-defined problem is,
sampling the n input data including one positive data and negative data other than the positive data from the N data sets and inputting them to the network;
By performing any one of rotation, cropping, noise addition, resize, sampling, and distortion on each of the n input data by a predetermined number of times, n Step of increasing the above input data; and
Based on the latent vector of any one positive data among the latent vectors of the n or more input data that have passed through the network, the score of the latent vector of the other positive data is high, and the Training the network to assign low scores to latent vectors of negative data.
To include, 6D pose estimation method.

According to claim 11,
Wherein the first deep learning model is a Convolutional Neural Network (CNN), and the second deep learning model is a PointNet model.

delete

According to claim 11,
The step of extracting the geometric feature is transfer learning of the weights of the second deep learning model to the weights of the network, 6D pose estimation method.

According to claim 11,
Estimating the 6D pose for the object,
A 6D pose estimation method, wherein a pose for the object is calculated based on a semantic segmentation module, a keypoint detection module, and a center voting module.

18. The method of claim 17,
The semantic segmentation module,
Based on the feature map, if there are a plurality of objects on the image, to distinguish each object, 6D pose estimation method.

According to claim 18,
The feature point detection module,
6D pose estimation method of detecting 3D keypoints on the surface of the separated object.

According to claim 19,
The center voting module,
6D pose estimation method of detecting the center point of the object.