KR102288566B1

KR102288566B1 - Method and system for configuring deep neural network

Info

Publication number: KR102288566B1
Application number: KR1020190139519A
Authority: KR
Inventors: 전세영; 박동원; 서용혁
Original assignee: 울산과학기술원
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2021-08-11
Also published as: KR20210053649A

Abstract

딥 뉴럴 네트워크 구성 방법 및 딥 뉴럴 네트워크 구성 장치가 개시된다. 본 발명의 일실시예에 따른, 딥 뉴럴 네트워크 구성 방법은, 다양한 이미지를 처리하기 위한 복수의 컨볼루션 레이어를 포함하는 컨볼루션 뉴럴 네트워크(CNN)를 구성하는 단계; 및 상기 컨볼루션 뉴럴 네트워크(CNN)의 말단으로, 회전 불변성 로봇 파지 추론을 위한 REM(Rotation Ensemble Module)을 배치하는 단계를 포함한다.Disclosed are a method for configuring a deep neural network and an apparatus for configuring a deep neural network. According to an embodiment of the present invention, a method for configuring a deep neural network includes constructing a convolutional neural network (CNN) including a plurality of convolutional layers for processing various images; and disposing a Rotation Ensemble Module (REM) for rotation invariant robot gripping inference at the end of the convolutional neural network (CNN).

Description

DEEP NEURAL NETWORK CONFIGURATION METHOD AND SYSTEM

본 발명은, 로봇 파지 감지에 있어, 회전 불변하는 딥 뉴럴 네트워크를 구성하기 위한 것으로, 분류(Classification, Cls) 방식에 앵커 박스(Anchor box) 회전 방식을 적용한 REM(Rotation Ensemble Module)을, CNN의 말단에 적용하여 로봇 파지 판단을 정확하고 빠르게 수행할 수 있게 하는, 딥 뉴럴 네트워크 구성 방법 및 장치에 관한 것이다.The present invention is to construct a rotation-invariant deep neural network in robot grip detection, and a REM (Rotation Ensemble Module) that applies an anchor box rotation method to a classification (Cls) method, a CNN It relates to a method and apparatus for constructing a deep neural network, which can be applied to the distal end to accurately and quickly perform robot grip determination.

새로운 객체에 대한 데이터 기반 로봇 파지 감지는, 근래에 광범위하게 연구되고 있다. 예컨대, 로봇 파지 감지에 관한 선행기술 방법으로는, 기계 학습을 제안하여 모든 후보들 중에서 방향과 그라퍼 거리를 포함하는 가장 잘 잡히는 이미지 패치의 순위를 매기는 방법(Saxena et al.), 5D 로보 틱 환경에서 파지 영역을 개선하여 파악하는 방법(Jiang et al.) 등을 예시할 수 있다.Data-based robotic gripping detection of new objects has been extensively studied in recent years. For example, as a prior art method for detecting a robot grip, we propose machine learning to rank the best captured image patch including orientation and gripper distance among all candidates (Saxena et al.), 5D robotic A method for identifying and improving the phage region in the environment (Jiang et al.) may be exemplified.

특히, Two-stage, classification based approach에 관한 모델에서는, 초기 깊이 학습 모델인 스파스 자동 인코더(sparse auto- encoder, SAE)를 사용하여, 멀티 모달 정보(색상, 깊이 및 표면 표준)를 갖는 슬라이딩 윈도우에서 가장 잘 잡을 수 있는 후보 이미지 패치의 순위를 매겨, 로봇 감지를 탐지하였다.In particular, in a model related to a two-stage, classification based approach, a sliding window with multi-modal information (color, depth, and surface standard) using a sparse auto-encoder (SAE), which is an initial depth learning model, is used. The robot detection was detected by ranking the candidate image patches that could be best captured.

또한, Single-stage, regression based approach에 관한 모델에서는, AlexNet에 기반한 로봇 학습 탐지 기반 방법을 제안하여 빠른 계산 시간으로도 보다 향상된 로봇 감지의 정확도를 갖게 하였다.In addition, in the model for a single-stage, regression based approach, a robot learning detection-based method based on AlexNet was proposed to have improved robot detection accuracy even with fast computation time.

또한, Multibox based approach에 관한 모델에서는, 전체 입력 이미지를 S× S 그리드로 나누어 다중 박스 기반의 로봇 파악 감지 방법을 제안하고 있다.In addition, in the multibox based approach model, a multi-box-based robot grasp detection method is proposed by dividing the entire input image into an S×S grid.

또한, Hybrid approach에 관한 모델에서는, 로봇의 파지력을 예측하고 고해상도 파지 확률 맵을 기반으로 로봇 파지 매개 변수를 추정 함으로써, 최첨단 계산 시간으로 로봇 파지 예측에 대해 높은 정확성을 달성하였다.In addition, in the model related to the hybrid approach, high accuracy was achieved for the prediction of robot grip with cutting-edge computation time by predicting the robot's gripping force and estimating the robot gripping parameters based on the high-resolution gripping probability map.

또한, 심층 신경망을 이용한 로봇 파지 추론은 최근 많이 개발되었으며, 회전 불변성을 고려한 컴퓨터 비전 및 로봇 비전은 과거에 많이 개발되었다.In addition, robot grasping inference using deep neural networks has been developed a lot recently, and computer vision and robot vision considering rotational invariance have been developed a lot in the past.

일반적으로, 로봇 파지 추론에서 회전 불변성은 매우 중요하지만 심층 신경망에서 회전 불변성을 고려하는 기법 상의 어려움으로 인해, 아직 많은 연구가 이루어지지 않고 있다.In general, rotational invariance is very important in robot gripping reasoning, but due to difficulties in the technique of considering rotational invariance in deep neural networks, many studies have not been done yet.

따라서, 로봇 파지에 있어, 심층 신경망과 회전 불변성을 접목하여, 로봇 파지 인식률을 향상시키는 기법이 절실히 요구되고 있다.Therefore, in robot gripping, there is an urgent need for a technique for improving the robot gripping recognition rate by combining a deep neural network and rotational invariance.

본 발명의 실시예는, 앵커 박스 회전 방식을 적용한 REM을 CNN의 말단에 적용 함으로써, 로봇 파지에 관한 높은 인식률과 빠른 작동속도를 도모하는, 딥 뉴럴 네트워크 구성 방법 및 장치를 제공하는 것을 목적으로 한다.An embodiment of the present invention aims to provide a method and apparatus for constructing a deep neural network, which achieves a high recognition rate and fast operation speed for robot grip by applying REM to the end of the CNN to which the anchor box rotation method is applied. .

또한, 본 발명의 실시예는, 최신의 심층신경망 기반 로봇 파지 추론에 회전 불변성을 가질 수 있도록 하여 파지 정보 추론의 성능을 높이는 것을 목적으로 한다.In addition, an embodiment of the present invention aims to increase the performance of grasping information inference by allowing rotational invariance in the latest deep neural network-based robot grasping inference.

또한, 본 발명의 실시예는, 기존의 심층신경망 구조를 변경하지 않고 간단히 회전 앙상블 모듈을 추가하는 방식으로, 회전 불변성을 구현 함으로써, 심층 신경망의 크기, 계산량을 그대로 유지하면서도 원하는 목표를 이룰 수 있도록 하는 것을 목적으로 한다.In addition, the embodiment of the present invention implements rotational invariance by simply adding a rotational ensemble module without changing the existing deep neural network structure, so that the desired goal can be achieved while maintaining the size and computational amount of the deep neural network as it is. aim to do

본 발명의 일실시예에 따른, 딥 뉴럴 네트워크 구성 방법은, 다양한 이미지를 처리하기 위한 복수의 컨볼루션 레이어를 포함하는 컨볼루션 뉴럴 네트워크(CNN)를 구성하는 단계; 및 상기 컨볼루션 뉴럴 네트워크(CNN)의 말단으로, 회전 불변성 로봇 파지 추론을 위한 REM(Rotation Ensemble Module)을 배치하는 단계를 포함할 수 있다.According to an embodiment of the present invention, a method for configuring a deep neural network includes constructing a convolutional neural network (CNN) including a plurality of convolutional layers for processing various images; and disposing a Rotation Ensemble Module (REM) for rotation invariant robot grip inference at the end of the convolutional neural network (CNN).

또한, 본 발명의 실시예에 따른, 딥 뉴럴 네트워크 구성 장치는, 다양한 이미지를 처리하기 위한 복수의 컨볼루션 레이어를 포함하는 컨볼루션 뉴럴 네트워크(CNN)를 구성하는 구성부; 및 상기 컨볼루션 뉴럴 네트워크(CNN)의 말단으로, 회전 불변성 로봇 파지 추론을 위한 REM을 배치하는 배치부를 포함하여 구성할 수 있다.In addition, according to an embodiment of the present invention, an apparatus for configuring a deep neural network includes a configuration unit for configuring a convolutional neural network (CNN) including a plurality of convolutional layers for processing various images; and a disposing unit for disposing a REM for rotation invariant robot grip inference at the end of the convolutional neural network (CNN).

본 발명의 일실시예에 따르면, 앵커 박스 회전 방식을 적용한 REM을 CNN의 말단에 적용 함으로써, 로봇 파지에 관한 높은 인식률과 빠른 작동속도를 도모하는, 딥 뉴럴 네트워크 구성 방법 및 장치를 제공할 수 있다.According to an embodiment of the present invention, it is possible to provide a method and apparatus for configuring a deep neural network, which promotes a high recognition rate and fast operation speed for robot grip by applying REM to the end of the CNN to which the anchor box rotation method is applied. .

또한, 본 발명의 일실시예에 따르면, 최신의 심층신경망 기반 로봇 파지 추론에 회전 불변성을 가질 수 있도록 하여 파지 정보 추론의 성능을 높일 수 있다.In addition, according to an embodiment of the present invention, it is possible to increase the performance of grasping information inference by making it possible to have rotational invariance in the latest deep neural network-based robot grasping inference.

또한, 본 발명의 일실시예에 따르면, 기존의 심층신경망 구조를 변경하지 않고 간단히 회전 앙상블 모듈을 추가하는 방식으로, 회전 불변성을 구현 함으로써, 심층 신경망의 크기, 계산량을 그대로 유지하면서도 원하는 목표를 이룰 수 있게 한다.In addition, according to an embodiment of the present invention, by implementing rotational invariance by simply adding a rotational ensemble module without changing the existing deep neural network structure, the desired goal can be achieved while maintaining the size and computational amount of the deep neural network. make it possible

도 1은 본 발명의 일실시예에 따른 딥 뉴럴 네트워크 구성 장치의 구성을 도시한 블록도이다.
도 2는 컨볼루션 뉴럴 네트워크(CNN)의 구성 일례를 설명하기 위한 도이다.
도 3은 본 발명에 따른 컨볼루션 뉴럴 네트워크(CNN)의 구조를 설명하기 위한 도면이다.
도 4는 본 발명에 따른 5D 파지 표현을 설명하기 위한 도면이다.
도 5는 본 발명의 일시예에 따른 CNN의 구조를 설명하는 도면이다.
도 6은 본 발명의 따른 로봇 보정과 관련한, 학습 기반 비전을 설명하기 위한 도면이다.
도 7은 본 발명에 따른 증가하는 학습 샘플 수에 대한 로봇 좌표계의 x, y에 대한 보정 오류를 설명하기 위한 도면이다.
도 8은 본 발명의 일실시예에 따른, 딥 뉴럴 네트워크 구성 방법을 도시한 흐름도이다.1 is a block diagram illustrating the configuration of an apparatus for configuring a deep neural network according to an embodiment of the present invention.
2 is a diagram for explaining an example of a configuration of a convolutional neural network (CNN).
3 is a diagram for explaining the structure of a convolutional neural network (CNN) according to the present invention.
4 is a diagram for explaining a 5D grip representation according to the present invention.
5 is a diagram for explaining the structure of a CNN according to an embodiment of the present invention.
6 is a view for explaining a learning-based vision in relation to the robot calibration according to the present invention.
7 is a diagram for explaining a correction error for x and y of the robot coordinate system for the increasing number of learning samples according to the present invention.
8 is a flowchart illustrating a method for configuring a deep neural network according to an embodiment of the present invention.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, since various changes may be made to the embodiments, the scope of the patent application is not limited or limited by these embodiments. It should be understood that all modifications, equivalents and substitutes for the embodiments are included in the scope of the rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are used for description purposes only, and should not be construed as limiting. The singular expression includes the plural expression unless the context clearly dictates otherwise. In the present specification, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same components are given the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted. In the description of the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description thereof will be omitted.

도 1은 본 발명의 일실시예에 따른 딥 뉴럴 네트워크 구성 장치의 구성을 도시한 블록도이다.1 is a block diagram illustrating the configuration of an apparatus for configuring a deep neural network according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일실시예에 따른, 딥 뉴럴 네트워크 구성 장치(100)는, 구성부(110), 및 배치부(120)를 포함하여 구성할 수 있다. 또한, 딥 뉴럴 네트워크 구성 장치(100)는 실시예에 따라, 처리부(130)를 추가하여 구성할 수 있다.Referring to FIG. 1 , an apparatus 100 for configuring a deep neural network according to an embodiment of the present invention may include a configuration unit 110 and an arrangement unit 120 . Also, the apparatus 100 for configuring a deep neural network may be configured by adding the processing unit 130 according to an embodiment.

우선, 구성부(110)는 다양한 이미지를 처리하기 위한 복수의 컨볼루션 레이어를 포함하는 컨볼루션 뉴럴 네트워크(CNN)를 구성한다.First, the configuration unit 110 configures a convolutional neural network (CNN) including a plurality of convolutional layers for processing various images.

컨볼루션 뉴럴 네트워크(CNN)는 모델이 직접 이미지, 비디오, 텍스트 또는 사운드 등을 분류하는 머신 러닝의 한 유형인 딥러닝에서 가장 많이 사용되는 알고리즘일 수 있다. 컨볼루션 뉴럴 네트워크(CNN)는 이미지에서 객체, 얼굴, 장면 등을 인식하기 위해 패턴을 찾는 데에 유용하며, 데이터에서 직접 학습하며, 패턴을 사용하여 이미지를 분류하고 특징을 자동으로 추출할 수 있다. 이러한, 컨볼루션 뉴럴 네트워크(CNN)는 자율 주행 자동차, 얼굴 인식 애플리케이션과 같이 객체 인식과 컴퓨터 비전이 필요한 분야에서 많이 사용되고 있다.A convolutional neural network (CNN) may be the most used algorithm in deep learning, a type of machine learning where the model directly classifies images, videos, text or sounds, etc. Convolutional neural networks (CNNs) are useful for finding patterns to recognize objects, faces, scenes, etc. in images, learn directly from data, and use patterns to classify images and extract features automatically. . Such a convolutional neural network (CNN) is widely used in fields that require object recognition and computer vision, such as autonomous vehicles and face recognition applications.

즉, 구성부(110)는 컨볼루션 레이어를 적어도 포함하여, 이미지 등의 특징을 식별, 분류하는 컨볼루션 뉴럴 네트워크(CNN)를 구성하는 역할을 할 수 있다. 상기 컨볼루션 뉴럴 네트워크(CNN)의 구성에 있어, 구성부(110)는 상기 컨볼루션 레이어 이외에, ReLU 레이어, 풀링 레이어 등을 포함하여 컨볼루션 뉴럴 네트워크(CNN)을 구성할 수 있다.That is, the configuration unit 110 may serve to configure a convolutional neural network (CNN) for identifying and classifying features such as images by including at least a convolutional layer. In the configuration of the convolutional neural network (CNN), the configuration unit 110 may configure the convolutional neural network (CNN) by including a ReLU layer, a pooling layer, etc. in addition to the convolutional layer.

도 2는 컨볼루션 뉴럴 네트워크(CNN)의 구성 일례를 설명하기 위한 도이다.2 is a diagram for explaining an example of a configuration of a convolutional neural network (CNN).

도 2에서, 컨볼루션 레이어는 각 이미지에서 특정의 특징을 활성화하는 컨볼루션 필터 집합에 입력 이미지를 통과시키는 기능을 할 수 있다.In FIG. 2 , the convolution layer may function to pass an input image through a set of convolution filters that activate specific features in each image.

또한, ReLU(Rectified Linear Unit) 레이어는, 음수 값을 0에 매핑하고 양수 값을 유지하여 활성화된 특징 만을 다음 계층으로 전달되도록 하여, 더 빠르고 효과적인 학습을 가능하게 하는 기능을 할 수 있다.In addition, the Rectified Linear Unit (ReLU) layer can function to enable faster and more effective learning by mapping negative values to 0 and maintaining positive values so that only activated features are transferred to the next layer.

또한, 풀링 레이어는 비선형 다운샘플링을 수행하고 네트워크에서 학습해야 하는 매개 변수 수를 줄여서 출력을 간소화하는 기능을 할 수 있다.In addition, the pooling layer can function to simplify the output by performing non-linear downsampling and reducing the number of parameters that the network has to learn.

이들 컨볼루션 레이어, ReLU 레이어, 풀링 레이어는 하나의 필터로서, FEATURE LERNING을 구성할 수 있다.These convolutional layers, ReLU layers, and pooling layers are one filter and can constitute FEATURE LERNING.

도 2에서와 같이, 각 필터는 각 학습 이미지에 서로 다른 해상도로 적용되고, 필터의 출력은 다음 계층의 입력으로 활용될 수 있다.As shown in FIG. 2 , each filter is applied with a different resolution to each training image, and the output of the filter may be used as an input of the next layer.

FEATURE LERNING에 의해 출력되는 결과는, CLASSIFICATION에서의 분류 과정을 통해, 입력되는 이미지를 특정하게 된다.The result output by FEATURE LERNING specifies the input image through the classification process in CLASSIFICATION.

배치부(120)는 상기 컨볼루션 뉴럴 네트워크(CNN)의 말단으로, 회전 불변성 로봇 파지 추론을 위한 REM을 배치한다.The disposing unit 120 is an end of the convolutional neural network (CNN), and disposes a REM for rotation invariant robot grip inference.

REM(Rotation Ensemble Module)은, 앙상블 필터에서 이미지의 매 프레임 마다 선택한 이미지의 특징점과 분산 필터(variance filter)를 통과한 패치들의 특징점 비교를 통해 이미지 후보의 유효성을 산출할 수 있다. 또한, REM은 이미지의 기준 각도를 산출한 후, 앙상블 필터로 입력되는 패치들의 회전변화율(회전 각도)과의 차이를 특징점에 보상 반영할 수 있다. 이후, REM은 보상 반영된 특징점과 이미지의 특징점과의 비교를 통해 신뢰성 있는 패치들을 이미지 후보로서 결정할 수 있다.The Rotation Ensemble Module (REM) may calculate the validity of an image candidate by comparing the feature points of the image selected by the ensemble filter for every frame of the image and the feature points of the patches that have passed through the variance filter. Also, after calculating the reference angle of the image, the REM may compensate and reflect the difference from the rotation change rate (rotation angle) of the patches input to the ensemble filter to the feature point. Thereafter, the REM may determine reliable patches as image candidates by comparing the compensated feature points with the feature points of the image.

즉, 배치부(120)는 이미지를 다양한 각도로 회전시켜 특징을 뽑아내어, 결합을 통해 최적의 특징을 도출하는 REM(Rotation Ensemble Module)을, 상기 컨볼루션 뉴럴 네트워크(CNN)의 말단에 배치하는 역할을 할 수 있다.That is, the arrangement unit 120 extracts features by rotating the image at various angles, and arranges a Rotation Ensemble Module (REM) that derives optimal features through combination at the end of the convolutional neural network (CNN). can play a role

상기 컨볼루션 뉴럴 네트워크(CNN) 중에서 말단을 결정하기 위해, 본 발명의 딥 뉴럴 네트워크 구성 장치(100)는, 처리부(130)를 추가로 포함하여 구성할 수 있다.In order to determine the end of the convolutional neural network (CNN), the apparatus 100 for configuring a deep neural network according to the present invention may further include a processing unit 130 .

컨볼루션 뉴럴 네트워크(CNN)의 말단의 결정에 있어, 처리부(130)는 상기 복수의 컨볼루션 레이어 중에서, 처리 가능한 이미지의 크기가 작은 순으로 두 개의 레이어를 식별하고, 식별된 두 개의 레이어 사이를, 상기 컨볼루션 뉴럴 네트워크(CNN)의 말단으로 결정할 수 있다.In determining the end of the convolutional neural network (CNN), the processing unit 130 identifies two layers from among the plurality of convolutional layers in the order of the smallest size of the processable image, and selects between the two identified layers. , can be determined as the end of the convolutional neural network (CNN).

상술의 도 2에서와 같이, 컨볼루션 뉴럴 네트워크(CNN)는 영역 크기가 순차적으로 작아지는 복수의 컨볼루션 레이어를 포함하여 구성되고 있다.As shown in FIG. 2 above, the convolutional neural network (CNN) is configured to include a plurality of convolutional layers in which region sizes are sequentially decreased.

처리부(130)는 이러한 컨볼루션 뉴럴 네트워크(CNN)의 구조를 고려하여, 영역 크기가 가장 작은 컨볼루션 레이어와 다음으로 작은 컨볼루션 레이어 사이를, 컨볼루션 뉴럴 네트워크(CNN)의 말단으로 결정할 수 있다.In consideration of the structure of the convolutional neural network (CNN), the processing unit 130 may determine between the convolutional layer having the smallest region size and the next smallest convolutional layer as the end of the convolutional neural network (CNN). .

또한, 다른 실시예에서, 처리부(130)는 상기 복수의 컨볼루션 레이어 중에서, 건너뛰기 연결과 관련되는 두 개의 레이어를 식별하고, 식별된 두 개의 레이어 사이를, 상기 컨볼루션 뉴럴 네트워크(CNN)의 말단으로 결정할 수 있다.Also, in another embodiment, the processing unit 130 identifies two layers related to skip connection from among the plurality of convolutional layers, and divides between the identified two layers of the convolutional neural network (CNN). can be determined at the end.

여기서, '건너뛰기 연결(Skip connection)'은 연결되는 양쪽 레이어에서의 이미지 처리와 관련하여, 미세 조정 처리를 가능하게 하는 연결체 일 수 있다. 예컨대, 후술하는 도 5의 컨볼루션 뉴럴 네트워크(CNN)에서는, 크기 '23x23'의 컨볼루션 레이어와, 크기 '11x11'의 컨볼루션 레이어 간을, 건너뛰기 연결로 연결시키는 것이 예시되고 있다. 도 5의 경우, 처리부(130)는 이들 크기 '23x23'의 컨볼루션 레이어와, 크기 '11x11'의 컨볼루션 레이어 사이를 컨볼루션 뉴럴 네트워크(CNN)의 말단으로 결정할 수 있다.Here, the 'skip connection' may be a connection that enables fine adjustment processing in relation to image processing in both layers to be connected. For example, in the convolutional neural network (CNN) of FIG. 5, which will be described later, a skip connection is exemplified between a convolutional layer having a size of '23x23' and a convolutional layer having a size of '11x11'. In the case of FIG. 5 , the processing unit 130 may determine between the convolutional layer of size '23x23' and the convolution layer of size '11x11' as the end of the convolutional neural network (CNN).

처리부(130)에 의해 결정된 컨볼루션 뉴럴 네트워크(CNN)의 말단에는, 배치부(120)에 의해 REM을 배치하여, 본 발명을 실현할 수 있다.At the end of the convolutional neural network (CNN) determined by the processing unit 130 , the REM is arranged by the arrangement unit 120 to realize the present invention.

실시예에 따라, 구성부(110)는 이미지 처리시의 이미지에 대한 회전을 위한 인덱스를 부여하여 상기 REM을 구성할 수 있다.According to an embodiment, the configuration unit 110 may configure the REM by assigning an index for rotation to the image during image processing.

즉, 구성부(110)는, 상기 이미지를 회전시키는 각도를 결정하는 인덱스를 부여하는 n개의 제1 앙상블 모듈을 포함하여 상기 REM을 구성할 수 있다. 상기 제1 앙상블 모듈은 그 각각에 부여된 인덱스에 따라 이미지를 회전시키는 수단일 수 있다.That is, the configuration unit 110 may configure the REM by including n first ensemble modules that give an index for determining an angle at which the image is rotated. The first ensemble module may be a means for rotating an image according to an index assigned to each.

상기 인덱스의 부여에 있어, 구성부(110)는, 이웃하는 제1 앙상블 모듈 사이에서의 상기 각도 간의 차이가 서로 일정하도록, 상기 제1 앙상블 모듈 각각으로 상기 인덱스를 상이하게 부여할 수 있다. 즉, 구성부(110)는 이미지를 회전하는 각도의 크기가, 복수의 제1 앙상블 모듈 사이에서 일정하게 정해지도록(예컨대 45도 간격으로) 인덱스를 부여할 수 있다.In assigning the index, the configuration unit 110 may differently assign the index to each of the first ensemble modules so that the difference between the angles between the adjacent first ensemble modules is constant. That is, the configuration unit 110 may assign an index such that the size of the angle at which the image is rotated is uniformly determined between the plurality of first ensemble modules (eg, at intervals of 45 degrees).

예컨대, 이미지를 0도, 45도, 90도, 135도 회전시키는 4개의 케이스에 기초하여 이미지 처리하고자 하는 경우, 구성부(110)는 상기 4개의 각도가 각각 인덱스로서 부여된 4개의 제1 앙상블 모듈로 REM을 구성할 수 있다.For example, if the image is to be processed based on four cases of rotating the image by 0 degrees, 45 degrees, 90 degrees, and 135 degrees, the configuration unit 110 sets the four first ensembles to which the four angles are assigned as indices, respectively. A REM can be configured as a module.

이후, 구성부(110)는, 상기 n개의 제1 앙상블 모듈 각각에 의해 추출되는 n개의 특징을 병렬로 결합하여 제1 특징을 도출할 수 있다. 즉, 구성부(110)는 복수의 제1 앙상블 모듈로부터의 특징을 컨볼루션을 하여 산출되는, 이미지에 관한 균형적인 특징을 제1 특징으로 결정할 수 있다.Thereafter, the configuration unit 110 may derive the first feature by combining the n features extracted by each of the n first ensemble modules in parallel. That is, the configuration unit 110 may determine, as the first feature, a balanced feature with respect to the image, which is calculated by convolving features from the plurality of first ensemble modules.

상술의 예시에서, 0도, 45도, 90도, 135도의 인덱스를 갖는 4개의 제1 앙상블 모듈 각각으로부터의, 4개의 특징을 컨볼루션하여 얻은 특징을 제1 특징으로 결정하는 RME을 구성 함으로써, 구성부(110)는 이들 제1 앙상블 모듈의 균형적인 특징을 출력할 수 있다.In the above example, by constructing an RME that determines a feature obtained by convolving four features from each of the four first ensemble modules having indices of 0 degrees, 45 degrees, 90 degrees, and 135 degrees as the first features, The configuration unit 110 may output balanced characteristics of these first ensemble modules.

다른 실시예에서, 구성부(110)는 상기 n개의 제1 앙상블 모듈 각각에 의해 추출되는 n개의 특징에 대해, 로봇 파지 시의 정확도를 추정하고, 상기 n개의 특징 중에서, 가장 높은 정확도가 추정되는 특징을, 제1 특징으로서 도출할 수 있다. 즉, 구성부(110)는 제1 앙상블 모듈로부터 출력되는 특징 각각에 대해, 로봇이 파지하는 경우의 정확도를 예상하고, 가장 높은 정확도가 예상되는 특징을 선별하여, 이를 제1 특징으로 결정할 수 있다. 상술의 예시에서, 구성부(110)는 0도, 45도, 90도, 135도의 인덱스를 갖는 4개의 제1 앙상블 모듈 각각으로부터의, 4개의 특징(회전된 이미지)에 대한, 로봇 파지시의 정확도를 추정하고, 가장 높은 정확도를 갖는 45도의 인덱스를 갖는 제1 앙상블 모듈로부터의 특징을 제1 특징으로 결정하는 RME을 구성할 수 있다.In another embodiment, the configuration unit 110 estimates the accuracy when holding the robot for n features extracted by each of the n first ensemble modules, and the highest accuracy is estimated among the n features. A characteristic can be derived as a 1st characteristic. That is, for each feature output from the first ensemble module, the configuration unit 110 may predict an accuracy when gripped by the robot, select a feature expected to have the highest accuracy, and determine it as the first feature. . In the above example, the configuration unit 110 performs robot gripping for four features (rotated images) from each of the four first ensemble modules having indices of 0 degrees, 45 degrees, 90 degrees, and 135 degrees. It is possible to configure the RME for estimating the accuracy and determining the feature from the first ensemble module having the index of 45 degrees with the highest accuracy as the first feature.

또한, 구성부(110)는, 상기 인덱스가 부여되지 않는 제2 앙상블 모듈을 더 포함하여 상기 REM을 구성하고, 상기 제1 특징과, 상기 제2 앙상블 모듈에 의해 추출되는 제2 특징을 곱으로 연결하여, 상기 로봇 파지에 관한 결과물을 출력할 수 있다.In addition, the configuration unit 110 configures the REM by further including a second ensemble module to which the index is not assigned, and multiplies the first feature and a second feature extracted by the second ensemble module. By connecting, it is possible to output a result related to gripping the robot.

즉, 구성부(110)는 인덱스가 부여되지 않는 제2 앙상블 모듈로부터의 제2 특징과, 앞서 제1 앙상블 모듈과 관련하여 결정되는 제1 특징을 컨볼루션하여 최종적인 특징을 산출한다.That is, the configuration unit 110 calculates a final feature by convolving the second feature from the second ensemble module to which no index is assigned and the first feature previously determined in relation to the first ensemble module.

도 3은 본 발명에 따른 컨볼루션 뉴럴 네트워크(CNN)의 구조를 설명하기 위한 도면이다.3 is a diagram for explaining the structure of a convolutional neural network (CNN) according to the present invention.

도 3(a)에서는 로봇 파지 정보 추론 심층신경망에, 회전 앙상블 모듈이 결합되는 것을 예시한다.In Fig. 3 (a), the robot grasp information inference deep neural network, it exemplifies that the rotation ensemble module is coupled.

도 3(a)에서와 같이, 딥 뉴럴 네트워크 구성 장치(100)는 컨볼루션 뉴럴 네트워크(CNN)의 말단으로, 회전 불변성 로봇 파지 추론을 위한 REM(Rotation Ensemble Module)을 배치한다.As shown in Fig. 3(a), the deep neural network configuration apparatus 100 is the end of the convolutional neural network (CNN), and arranges a rotation ensemble module (REM) for rotation invariant robot grasping inference.

컨볼루션 뉴럴 네트워크(CNN)의 말단은, 복수의 컨볼루션 레이어 중에서, 처리 가능한 이미지의 크기가 작은 순으로 식별된 두 개의 레이어 사이일 수 있다.The end of the convolutional neural network (CNN) may be between two layers identified in the order of the smallest size of the processable image among the plurality of convolutional layers.

또는, 컨볼루션 뉴럴 네트워크(CNN)의 말단은, 복수의 컨볼루션 레이어 중에서, 건너뛰기 연결(skip connection)과 관련되어 식별된 두 개의 레이어 사이일 수 있다.Alternatively, the end of the convolutional neural network (CNN) may be between two layers identified in relation to a skip connection among a plurality of convolutional layers.

도 3(b)는, REM의 구체 구성으로서, 회전 앙상블 모듈에 대한 구조도 이다.Fig. 3(b) is a structural diagram of a rotating ensemble module as a specific configuration of a REM.

딥 뉴럴 네트워크 구성 장치(100)는 REM을 컨볼루션 뉴럴 네트워크(CNN)의 말단에 넣어, 영상 기반의 로봇 파지 판단을 빠르고 정확하게 할 수 있다.The apparatus 100 for configuring a deep neural network puts the REM at the end of a convolutional neural network (CNN), so that image-based robot grasp determination can be quickly and accurately.

딥 뉴럴 네트워크 구성 장치(100)는 회전 각도, 예컨대 0도, 45도, 90도, 135도의 인덱스가 부여된 4개의 제1 앙상블 모듈을 포함하여 REM을 구성하고, 각 제1 앙상블 모듈로부터 출력되는 특징들을 결합하여, 제1 특징을 도출해 낼 수 있다.The deep neural network configuration device 100 configures the REM by including four first ensemble modules to which indexes of rotation angles, for example, 0 degrees, 45 degrees, 90 degrees, and 135 degrees are given, and outputted from each first ensemble module. By combining the features, the first feature can be derived.

또한, 딥 뉴럴 네트워크 구성 장치(100)는 도출된 제1 특징을, 인덱스가 부여되지 않는 제2 앙상블 모듈로부터의 제2 특징과 다시 곱으로 결합하여 최종 결과를 다음 레이어로 전달할 수 있다.In addition, the apparatus 100 for configuring a deep neural network may combine the derived first feature with the second feature from the second ensemble module to which no index is given by multiplying it to deliver the final result to the next layer.

딥 뉴럴 네트워크 구성 장치(100)는 이미지를 다양한 각도(예, 0도, 45도, 90도, 135도)로 회전하여 처리하는 REM을 구성하여, 기존 심층 신경망의 일부로 결합하여 사용 함으로써, 로봇 파지 정보 추론 뿐만 아니라 심층 신경망을 사용하여, 회전 불변성이 필요한 다양한 컴퓨터 비전 작업(예: 얼굴 검출)에 활용되도록 할 수 있다.The device 100 for configuring a deep neural network configures a REM that processes images by rotating them at various angles (eg, 0 degrees, 45 degrees, 90 degrees, 135 degrees), and uses them by combining them as a part of the existing deep neural network, thereby grasping the robot. Using deep neural networks as well as information inference, they can be exploited for a variety of computer vision tasks that require rotational invariance (such as face detection).

딥 뉴럴 네트워크 구성 장치(100)는, 주어진 컬러 이미지(RGB)와 가능한 깊이 이미지(RGB-D)로부터 여러 객체에 대한 5D 로봇 파지 표현을 예측할 수 있다. 5D 로봇 파지 표현은 위치 (x, y), angle θgripper opening width(w), parallel gripper plate size(h)를 통해, 아래와 같이 표시될 수 있다.The apparatus 100 for constructing a deep neural network may predict a 5D robot grip representation for several objects from a given color image (RGB) and a possible depth image (RGB-D). The 5D robot gripping expression can be expressed as follows through the position (x, y), angle θ gripper opening width (w), and parallel gripper plate size (h).

{x, y, θw, h}{x, y, θw, h}

여기에, MultiGrasp에서는 각 그리드 셀에 대한 확률 z와, c = cos θ, s = sin θ를 더 고려하여 매개 변수화 한다.Here, in MultiGrasp, the probability z for each grid cell, c = cos θ, and s = sin θ are further considered and parameterized.

{t^x, t^y, θt^w, t^h, t^z} {t ^x , t ^y , θt ^w , t ^h , t ^z }

where x = σ(t^x)+c_x, y = σ(t^y)+c_y, w = p_w exp(t^w), h =p_h exp(t^h), and z = σ(tz)where x = σ(t ^x )+c _x , y = σ(t ^y )+c _y , w = p _w exp(t ^w ), h =p _h exp(t ^h ), and z = σ(tz)

σ(·)는 시그모이드 함수이고, exp(·)는 지수 함수이고, p_h, p_w는 각각 앵커 박스의 미리 정의된 높이와 폭이며, (c_x, c_y)는 상단 각 그리드 셀의 왼쪽 모서리를 지칭할 수 있다.σ(·) is the sigmoid function, exp(·) is the exponential function, p _h , p _w are the predefined height and width of the anchor box, respectively, and (c _x , c _y ) is the top each grid cell may refer to the left edge of

도 4는 본 발명에 따른 5D 파지 표현을 설명하기 위한 도면이다.4 is a diagram for explaining a 5D grip representation according to the present invention.

도 4(a)에서는 위치 (x, y), 방향 θ, 그라퍼 개구 폭 w, 및 판 크기 h를 갖는 5D 파지 표현을 예시한다.Figure 4(a) illustrates a 5D gripping representation with position (x, y), direction θ, gripper aperture width w, and plate size h.

도 4(b)에서는 (2, 2) 격자 셀의 경우, 미리 정의된 앵커 박스(점선 상자)로 이미지를 θ 만큼 회전(실선 상자)시켰을 때의 5D 파지 표현을 예시한다.In Fig. 4(b), in the case of (2, 2) grid cells, a 5D gripping representation is illustrated when the image is rotated by θ (solid line box) with a predefined anchor box (dotted line box).

도 4에서 x, y, w, h는 각 그리드 셀의 크기가 1 × 1이 되도록 적절하게 표준화된다. 마지막으로 각도 θ는 MultiGrasp와는 다른 연속 값 대신 이산 값으로 모델링된다.In FIG. 4, x, y, w, and h are properly normalized so that the size of each grid cell is 1×1. Finally, the angle θ is modeled as discrete values instead of continuous values, which is different from MultiGrasp.

딥 뉴럴 네트워크 구성 장치(100)는, 이미지 좌표에서 (x, y)를 예측하는 대신 로봇 인식 위치를 각각의 그리드 셀 (c_x, c_y)의 상부 좌측 코너로부터의 (x, y) 오프셋을 추정한다. S ×S 그리드 셀의 경우, 딥 뉴럴 네트워크 구성 장치(100)는, (c_x, c_y) ∈{(c_x, c_y) | c_x, c_y ∈{0,1, . . , S-1}}를 통해, (x, y) 오프셋을 추정할 수 있다.The deep neural network construction apparatus 100, instead of predicting (x, y) in image coordinates, sets the robot recognition position to (x, y) offset from the upper left corner _{of each grid cell (c x} , c _{y )} estimate In the case of an S × S grid cell, the device 100 for configuring a deep neural network is (c _x , c _y ) ∈ {(c _x , c _y ) | c _x , c _y ∈{0,1, . . , S-1}}, the (x, y) offset can be estimated.

따라서 주어진 (cx, cy) 에 대해 (x, y)의 범위는, 'c_x < x < c_x+1, c_y < y < c_y+1'의 시그모이드 함수를 사용하여, 재매개변수화(re-parametrization)를 한다.Thus for a given (cx, cy) the range of (x, y) is remediated using the sigmoid function _{'c x} < x < c _x +1, c _y < y < c _{y +1'} Re-parametrization.

앵커 박스 접근법은 객체 검출에도 유용하기 때문에, 로봇 파지 감지에 적용할 수 있다. 딥 뉴럴 네트워크 구성 장치(100)는, 앵커 박스를 이용한 재 모델링으로 인하여 w, h를 다양한 크기의 기대치와 관련된 추정치 t^w, t^h로 변환한 다음, 모든 앵커 박스 후보 중에서 최상의 파악 표현을 분류할 수 있다.Since the anchor box approach is also useful for object detection, it can be applied to robot grip detection. ^{The deep neural network construction apparatus 100 converts w, h into estimates t w} , t ^h related to expectations of various sizes due to remodeling using an anchor box, and then classifies the best grasp expression among all anchor box candidates. can

도 5는 본 발명의 일시예에 따른 CNN의 구조를 설명하는 도면이다.5 is a diagram for explaining the structure of a CNN according to an embodiment of the present invention.

도 5에서는, Alexnet , Darknet-19, Resnet-50 등을 이용하여, 이미지 분류를 위한 CNN의 구조를 예시한다.In FIG. 5, using Alexnet, Darknet-19, Resnet-50, etc., the structure of a CNN for image classification is exemplified.

사전 훈련된 네트워크는 로봇 이해 매개 변수를 산출하도록 수정되었으며, 딥 뉴럴 네트워크 구성 장치(100)는, 복수의 컨볼루션 레이어를 포함하는 CNN의 구조를 만들어 모든 크기의 이미지를 처리할 수 있게 한다.The pre-trained network has been modified to calculate the robot understanding parameters, and the deep neural network configuration apparatus 100 makes a structure of a CNN including a plurality of convolutional layers to process images of any size.

또한, 딥 뉴럴 네트워크 구성 장치(100)는, 크기를 조정하지 않고 파악 감지를 위한 최대 360 × 360 이미지를 처리하는 컨볼루션 레이어를 포함하고, 네트워크 말단으로, 미세 입자 기능을 사용할 수 있도록 건너 뛰기 연결 레이어를 추가할 수 있다. 예를 들어, 도 5에서와 같이 Darknet-19의 경우, 딥 뉴럴 네트워크 구성 장치(100)는, 마지막 3 × 3 × 512 레이어와 마지막 컨볼루션 레이어 사이에 건너뛰기 연결 레이어를 추가할 수 있다. 실시예에 따라, 딥 뉴럴 네트워크 구성 장치(100)는, 마지막 최대 풀링 레이어 바로 앞의 길쌈 레이어와 감지 레이어 사이에 Resnet-50에 대한 유사한 건너 뛰기 연결 레이어를 추가할 수 있다.In addition, the deep neural network configuration device 100 includes a convolutional layer that processes up to 360 × 360 images for grasp detection without resizing, and at the end of the network, skip connection to use the fine particle function. You can add layers. For example, in the case of Darknet-19 as shown in FIG. 5 , the device 100 for configuring a deep neural network may add a skip connection layer between the last 3×3×512 layer and the last convolutional layer. According to an embodiment, the apparatus 100 for configuring a deep neural network may add a skip connection layer similar to Resnet-50 between the convolutional layer and the sensing layer immediately before the last maximum pooling layer.

도 6은 본 발명의 따른 로봇 보정과 관련한, 학습 기반 비전을 설명하기 위한 도면이다.6 is a view for explaining a learning-based vision in relation to the robot calibration according to the present invention.

도 6에서는 학습 기반의 완전 자동 비전 로봇 교정 방법을 예시한다.6 illustrates a learning-based fully automatic vision robot calibration method.

도 6의 (1)에서는, 알려진 작은 물체(예, 둥근 모양)가 알려진 위치에 놓여진 것을 보여준다.In (1) of FIG. 6, it is shown that a known small object (eg, a round shape) is placed in a known position.

도 6의 (2)에서는 로봇이 알려진 위치에서 물체를 파지하는 것을 보여준다.6 (2) shows that the robot grips an object at a known position.

도 6의 (3)에서 로봇은 물체를 임의의 위치로 이동시켜 놓는다.In (3) of FIG. 6, the robot moves the object to an arbitrary position.

도 6의 (4)에서 로봇은 원래의 위치로 돌아온다.In (4) of FIG. 6, the robot returns to its original position.

도 6의 (5)에서 비전 시스템은 물체가 있는 지점을 예측한다.In (5) of FIG. 6, the vision system predicts a point where an object is located.

로봇은 많은 샘플을 수집하기 위해 위의 절차들을 반복하여 수행할 수 있다.The robot can repeat the above procedures to collect many samples.

이후, 도 6의 (6)에서와 같이, 딥 뉴럴 네트워크 구성 장치(100)는, 5D 파악 시각 좌표와 로봇 좌표의 표현을, 선형 또는 비선형 회귀를 사용하거나 간단한 비선형 신경망을 사용하여 매핑 할 수 있다.Thereafter, as shown in FIG. 6 ( 6 ), the deep neural network configuration apparatus 100 may map the representation of 5D grasped visual coordinates and robot coordinates using linear or non-linear regression or using a simple non-linear neural network. .

도 7은 본 발명에 따른 증가하는 학습 샘플 수에 대한 로봇 좌표계의 x, y에 대한 보정 오류를 설명하기 위한 도면이다.7 is a diagram for explaining a correction error for x and y of the robot coordinate system for the increasing number of learning samples according to the present invention.

도 7에서는, 샘플 수가 증가함에 따라 교차 오차가 감소하고 있음을 보여주고 있다.7 shows that the crossover error decreases as the number of samples increases.

예컨대, 40 개 이상의 샘플 수의 경우, 교차 오차가 1.5 이하 이므로, 딥 뉴럴 네트워크 구성 장치(100)는, 샘플 수를 40개 이상을 사용하여, 교정 정밀도를 높임으로써, 고해상도 이미지의 사용시의 높은 정확도의 교정을 지원할 수 있다.For example, in the case of the number of samples of 40 or more, since the crossover error is 1.5 or less, the deep neural network configuration apparatus 100 uses 40 or more samples to increase the calibration precision, so that high accuracy when using a high-resolution image can support the correction of

이하, 도 8에서는 본 발명의 실시예들에 따른 딥 뉴럴 네트워크 구성 장치(100)의 작업 흐름을 상세히 설명한다.Hereinafter, a work flow of the apparatus 100 for configuring a deep neural network according to embodiments of the present invention will be described in detail with reference to FIG. 8 .

도 8은 본 발명의 일실시예에 따른, 딥 뉴럴 네트워크 구성 방법을 도시한 흐름도이다.8 is a flowchart illustrating a method for configuring a deep neural network according to an embodiment of the present invention.

본 실시예에 따른 딥 뉴럴 네트워크 구성 방법은 딥 뉴럴 네트워크 구성 장치(100)에 의해 수행될 수 있다.The deep neural network configuration method according to the present embodiment may be performed by the deep neural network configuration apparatus 100 .

우선, 딥 뉴럴 네트워크 구성 장치(100)는 다양한 이미지를 처리하기 위한 복수의 컨볼루션 레이어를 포함하는 컨볼루션 뉴럴 네트워크(CNN)를 구성한다(810).First, the apparatus 100 for configuring a deep neural network constructs a convolutional neural network (CNN) including a plurality of convolutional layers for processing various images ( 810 ).

단계(810)는 컨볼루션 레이어를 적어도 포함하여, 이미지 등의 특징을 식별, 분류하는 컨볼루션 뉴럴 네트워크(CNN)를 구성하는 과정일 수 있다. 상기 컨볼루션 뉴럴 네트워크(CNN)의 구성에 있어, 딥 뉴럴 네트워크 구성 장치(100)는 상기 컨볼루션 레이어 이외에, ReLU 레이어, 풀링 레이어 등을 포함하여 컨볼루션 뉴럴 네트워크(CNN)을 구성할 수 있다.Step 810 may be a process of constructing a convolutional neural network (CNN) including at least a convolutional layer to identify and classify features such as images. In the configuration of the convolutional neural network (CNN), the deep neural network configuration apparatus 100 may configure the convolutional neural network (CNN) by including a ReLU layer, a pooling layer, etc. in addition to the convolutional layer.

또한, 딥 뉴럴 네트워크 구성 장치(100)는 상기 컨볼루션 뉴럴 네트워크(CNN)의 말단으로, 회전 불변성 로봇 파지 추론을 위한 REM을 배치한다(820).In addition, the deep neural network configuration apparatus 100 arranges a REM for rotation invariant robot grasp inference at the end of the convolutional neural network (CNN) ( 820 ).

단계(820)는 이미지를 다양한 각도로 회전시켜 특징을 뽑아내어, 결합을 통해 최적의 특징을 도출하는 REM(Rotation Ensemble Module)을, 상기 컨볼루션 뉴럴 네트워크(CNN)의 말단에 배치하는 과정일 수 있다.Step 820 may be a process of arranging a Rotation Ensemble Module (REM) that extracts features by rotating the image at various angles and derives optimal features through combining, at the end of the convolutional neural network (CNN). there is.

컨볼루션 뉴럴 네트워크(CNN)의 말단의 결정에 있어, 딥 뉴럴 네트워크 구성 장치(100)는 상기 복수의 컨볼루션 레이어 중에서, 처리 가능한 이미지의 크기가 작은 순으로 두 개의 레이어를 식별하고, 식별된 두 개의 레이어 사이를, 상기 컨볼루션 뉴럴 네트워크(CNN)의 말단으로 결정할 수 있다.In determining the end of the convolutional neural network (CNN), the deep neural network configuration apparatus 100 identifies two layers in the order of the smallest size of the processable image among the plurality of convolutional layers, and the identified two A space between the layers may be determined as the end of the convolutional neural network (CNN).

컨볼루션 뉴럴 네트워크(CNN)는 영역 크기가 순차적으로 작아지는 복수의 컨볼루션 레이어를 포함하여 구성되고 있다.A convolutional neural network (CNN) is configured by including a plurality of convolutional layers in which region sizes are sequentially reduced.

딥 뉴럴 네트워크 구성 장치(100)는 이러한 컨볼루션 뉴럴 네트워크(CNN)의 구조를 고려하여, 영역 크기가 가장 작은 컨볼루션 레이어와 다음으로 작은 컨볼루션 레이어 사이를, 컨볼루션 뉴럴 네트워크(CNN)의 말단으로 결정할 수 있다.The deep neural network configuration apparatus 100 considers the structure of such a convolutional neural network (CNN), between the convolutional layer having the smallest region size and the next smallest convolutional layer, and the end of the convolutional neural network (CNN). can be decided with

또한, 다른 실시예에서, 딥 뉴럴 네트워크 구성 장치(100)는 상기 복수의 컨볼루션 레이어 중에서, 건너뛰기 연결과 관련되는 두 개의 레이어를 식별하고, 식별된 두 개의 레이어 사이를, 상기 컨볼루션 뉴럴 네트워크(CNN)의 말단으로 결정할 수 있다.Also, in another embodiment, the apparatus 100 for configuring a deep neural network identifies two layers related to skip connection from among the plurality of convolutional layers, and places between the identified two layers, the convolutional neural network (CNN) can be determined by the end.

여기서, '건너뛰기 연결(Skip connection)'은 연결되는 양쪽 레이어에서의 이미지 처리와 관련하여, 미세 조정 처리를 가능하게 하는 연결체 일 수 있다. 예컨대, 도 5의 컨볼루션 뉴럴 네트워크(CNN)에서는, 크기 '23x23'의 컨볼루션 레이어와, 크기 '11x11'의 컨볼루션 레이어 간을, 건너뛰기 연결로 연결시키는 것이 예시되고 있다. 도 5의 경우, 딥 뉴럴 네트워크 구성 장치(100)는 이들 크기 '23x23'의 컨볼루션 레이어와, 크기 '11x11'의 컨볼루션 레이어 사이를 컨볼루션 뉴럴 네트워크(CNN)의 말단으로 결정할 수 있다.Here, the 'skip connection' may be a connection that enables fine adjustment processing in relation to image processing in both layers to be connected. For example, in the convolutional neural network (CNN) of FIG. 5 , a skip connection is exemplified between a convolutional layer having a size of '23x23' and a convolutional layer having a size of '11x11'. In the case of FIG. 5 , the apparatus 100 for configuring a deep neural network may determine between the convolutional layer of size '23x23' and the convolutional layer of size '11x11' as the end of the convolutional neural network (CNN).

실시예에 따라, 딥 뉴럴 네트워크 구성 장치(100)는 이미지 처리시의 이미지에 대한 회전을 위한 인덱스를 부여하여 상기 REM을 구성할 수 있다.According to an embodiment, the apparatus 100 for configuring a deep neural network may configure the REM by giving an index for rotation to an image during image processing.

즉, 딥 뉴럴 네트워크 구성 장치(100)는, 상기 이미지를 회전시키는 각도를 결정하는 인덱스를 부여하는 n개의 제1 앙상블 모듈을 포함하여 상기 REM을 구성할 수 있다. 상기 제1 앙상블 모듈은 그 각각에 부여된 인덱스에 따라 이미지를 회전시키는 수단일 수 있다.That is, the apparatus 100 for configuring a deep neural network may configure the REM by including n first ensemble modules that give an index for determining an angle at which the image is rotated. The first ensemble module may be a means for rotating an image according to an index assigned to each.

상기 인덱스의 부여에 있어, 딥 뉴럴 네트워크 구성 장치(100)는, 이웃하는 제1 앙상블 모듈 사이에서의 상기 각도 간의 차이가 서로 일정하도록, 상기 제1 앙상블 모듈 각각으로 상기 인덱스를 상이하게 부여할 수 있다. 즉, 딥 뉴럴 네트워크 구성 장치(100)는 이미지를 회전하는 각도의 크기가, 복수의 제1 앙상블 모듈 사이에서 일정하게 정해지도록(예컨대 45도 간격으로) 인덱스를 부여할 수 있다.In assigning the index, the deep neural network configuration apparatus 100 may assign the index differently to each of the first ensemble modules so that the difference between the angles between the neighboring first ensemble modules is constant. there is. That is, the apparatus 100 for configuring the deep neural network may assign an index such that the size of the angle at which the image is rotated is uniformly determined between the plurality of first ensemble modules (eg, at intervals of 45 degrees).

예컨대, 이미지를 0도, 45도, 90도, 135도 회전시키는 4개의 케이스에 기초하여 이미지 처리하고자 하는 경우, 딥 뉴럴 네트워크 구성 장치(100)는 상기 4개의 각도가 각각 인덱스로서 부여된 4개의 제1 앙상블 모듈로 REM을 구성할 수 있다.For example, if you want to process an image based on four cases of rotating the image by 0 degrees, 45 degrees, 90 degrees, and 135 degrees, the deep neural network configuration apparatus 100 provides the four angles as indexes. The REM may be configured as the first ensemble module.

이후, 딥 뉴럴 네트워크 구성 장치(100)는, 상기 n개의 제1 앙상블 모듈 각각에 의해 추출되는 n개의 특징을 병렬로 결합하여 제1 특징을 도출할 수 있다. 즉, 딥 뉴럴 네트워크 구성 장치(100)는 복수의 제1 앙상블 모듈로부터의 특징을 컨볼루션을 하여 산출되는, 이미지에 관한 균형적인 특징을 제1 특징으로 결정할 수 있다.Thereafter, the apparatus 100 for configuring a deep neural network may derive a first feature by combining n features extracted by each of the n first ensemble modules in parallel. That is, the apparatus 100 for configuring a deep neural network may determine, as the first feature, a balanced feature with respect to an image, which is calculated by convolving features from a plurality of first ensemble modules.

상술의 예시에서, 0도, 45도, 90도, 135도의 인덱스를 갖는 4개의 제1 앙상블 모듈 각각으로부터의, 4개의 특징을 컨볼루션하여 얻은 특징을 제1 특징으로 결정하는 RME을 구성 함으로써, 딥 뉴럴 네트워크 구성 장치(100)는 이들 제1 앙상블 모듈의 균형적인 특징을 출력할 수 있다.In the above example, by constructing an RME that determines a feature obtained by convolving four features from each of the four first ensemble modules having indices of 0 degrees, 45 degrees, 90 degrees, and 135 degrees as the first features, The deep neural network configuration apparatus 100 may output balanced characteristics of these first ensemble modules.

다른 실시예에서, 딥 뉴럴 네트워크 구성 장치(100)는 상기 n개의 제1 앙상블 모듈 각각에 의해 추출되는 n개의 특징에 대해, 로봇 파지 시의 정확도를 추정하고, 상기 n개의 특징 중에서, 가장 높은 정확도가 추정되는 특징을, 제1 특징으로서 도출할 수 있다. 즉, 딥 뉴럴 네트워크 구성 장치(100)는 제1 앙상블 모듈로부터 출력되는 특징 각각에 대해, 로봇이 파지하는 경우의 정확도를 예상하고, 가장 높은 정확도가 예상되는 특징을 선별하여, 이를 제1 특징으로 결정할 수 있다. 상술의 예시에서, 딥 뉴럴 네트워크 구성 장치(100)는 0도, 45도, 90도, 135도의 인덱스를 갖는 4개의 제1 앙상블 모듈 각각으로부터의, 4개의 특징(회전된 이미지)에 대한, 로봇 파지시의 정확도를 추정하고, 가장 높은 정확도를 갖는 45도의 인덱스를 갖는 제1 앙상블 모듈로부터의 특징을 제1 특징으로 결정하는 RME을 구성할 수 있다.In another embodiment, the device 100 for configuring a deep neural network estimates the accuracy when holding the robot for n features extracted by each of the n first ensemble modules, and among the n features, the highest accuracy A feature in which is estimated can be derived as the first feature. That is, the deep neural network configuration apparatus 100 predicts the accuracy when the robot grips each feature output from the first ensemble module, selects the feature expected to have the highest accuracy, and uses it as the first feature. can decide In the above example, the deep neural network construction apparatus 100 is a robot for four features (rotated images) from each of the four first ensemble modules having indices of 0 degrees, 45 degrees, 90 degrees, and 135 degrees. It is possible to configure the RME for estimating the accuracy at the time of grasping and determining the feature from the first ensemble module having the index of 45 degrees having the highest accuracy as the first feature.

또한, 딥 뉴럴 네트워크 구성 장치(100)는, 상기 인덱스가 부여되지 않는 제2 앙상블 모듈을 더 포함하여 상기 REM을 구성하고, 상기 제1 특징과, 상기 제2 앙상블 모듈에 의해 추출되는 제2 특징을 곱으로 연결하여, 상기 로봇 파지에 관한 결과물을 출력할 수 있다.In addition, the deep neural network configuration apparatus 100 further comprises a second ensemble module to which the index is not given to configure the REM, and the first feature and the second feature extracted by the second ensemble module By multiplying , it is possible to output a result related to holding the robot.

즉, 딥 뉴럴 네트워크 구성 장치(100)는 인덱스가 부여되지 않는 제2 앙상블 모듈로부터의 제2 특징과, 앞서 제1 앙상블 모듈과 관련하여 결정되는 제1 특징을 컨볼루션하여 최종적인 특징을 산출한다.That is, the deep neural network configuration apparatus 100 calculates the final feature by convolving the second feature from the second ensemble module to which no index is assigned and the first feature previously determined in relation to the first ensemble module. .

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and carry out program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or device, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in a different order than the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

100 : 딥 뉴럴 네트워크 구성 장치
110 : 구성부 120 : 배치부
130 : 처리부100: Deep neural network configuration device
110: constituent unit 120: arrangement unit
130: processing unit

Claims

constructing a convolutional neural network (CNN) including a plurality of convolutional layers for processing various images;
Constructing a Rotation Ensemble Module (REM) for rotation invariant robot grip inference, including n first ensemble modules (where n is a natural number greater than or equal to 4) giving an index for determining an angle for rotating the image;
disposing the REM at the end of the convolutional neural network (CNN);
estimating accuracy when holding the robot for n features extracted by each of the n first ensemble modules; and
deriving, as a first feature, a feature with the highest accuracy among the n features
A method of constructing a deep neural network comprising a.

According to claim 1,
identifying two layers from among the plurality of convolutional layers in order of decreasing sizes of processable images; and
Determining between the two identified layers as the end of the convolutional neural network (CNN)
A method for configuring a deep neural network further comprising a.

According to claim 1,
identifying two layers associated with a skip connection from among the plurality of convolutional layers; and
Determining between the two identified layers as the end of the convolutional neural network (CNN)
A method for configuring a deep neural network further comprising a.

delete

According to claim 1,
configuring the REM by further including a second ensemble module to which the index is not assigned; and
outputting a result related to holding the robot by concatenating the first feature and the second feature extracted by the second ensemble module by a product
A method for configuring a deep neural network further comprising a.

According to claim 1,
Differently assigning the index to each of the first ensemble modules so that the difference between the angles between the adjacent first ensemble modules is constant
A method for configuring a deep neural network further comprising a.

Constructing a convolutional neural network (CNN) including a plurality of convolutional layers for processing various images, and giving an index that determines an angle at which the image is rotated (where n is a natural number greater than or equal to 4) 1 A component, including an ensemble module, to configure a REM for rotation invariant robot grip inference; and
At the end of the convolutional neural network (CNN), a disposing unit for disposing the REM
including,
The component is
For n features extracted by each of the n first ensemble modules, the accuracy of robot gripping is estimated, and among the n features, the feature with the highest accuracy is derived as the first feature
Deep neural network construction device.

9. The method of claim 8,
The deep neural network configuration device,
Among the plurality of convolutional layers, a processing unit that identifies two layers in the order of the smallest size of a processable image, and determines between the two identified layers as the end of the convolutional neural network (CNN)
A device for configuring a deep neural network further comprising a.

9. The method of claim 8,
The deep neural network configuration device,
A processing unit that identifies two layers related to skip connection from among the plurality of convolutional layers, and determines between the two identified layers as the end of the convolutional neural network (CNN)
A device for configuring a deep neural network further comprising a.

delete

9. The method of claim 8,
The component is
The REM is configured by further comprising a second ensemble module to which the index is not assigned, and the first feature and the second feature extracted by the second ensemble module are multiplied together, resulting in the robot gripping to output
Deep neural network construction device.

9. The method of claim 8,
The component is
Differently assigning the index to each of the first ensemble modules so that the difference between the angles between the neighboring first ensemble modules is constant
Deep neural network construction device.