KR20180022366A

KR20180022366A - Integrated learning apparatus for lighting/object/hands recognition/tracking in augmented/virtual reality and method therefor

Info

Publication number: KR20180022366A
Application number: KR1020160107723A
Authority: KR
Inventors: 우운택; 유정민; 박갑용; 전익범; 박진우
Original assignee: 한국과학기술원
Priority date: 2016-08-24
Filing date: 2016-08-24
Publication date: 2018-03-06
Also published as: KR101891884B1

Abstract

Disclosed are an integrated learning device and a method for recognizing and tracking light source/object/hand motion in augmented/virtual reality. According to an embodiment of the present invention, the integrated learning device comprises: a detection unit detecting a region of interest from input image data; an estimation recognition unit estimating and recognizing at least one of an object and a posture in the detected region of interest through learning, and providing a result value for each of the detected regions of interest; and an adjustment unit integrating the result values for each region of interest, and correcting a result value for each region of interest by learning in consideration of the correlation between the combined result values.

Description

BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an integrated learning apparatus and method for lighting / object / hand motion recognition /

본 발명은 증강/가상 현실에서의 통합 학습 기술에 관한 것으로, 보다 상세하게는 증강/가상 현실에서 광원, 객체, 손동작 인식, 추적 등의 요소 기술들을 유기적으로 연동하여 성능을 향상시킬 수 있는 통합 학습 장치 및 방법에 관한 것이다.The present invention relates to an integrated learning technique in augmentation / virtual reality, and more particularly, to an integrated learning technique for enhancing performance by organically linking element technologies such as a light source, an object, Apparatus and method.

스마트 폰의 확산과 함께 Google 글래스, MicroSoft 홀로렌즈 등 다양한 형태의 안경형 디스플레이 장치가 소개되면서 가상현실과 함께 증강현실(Augmented reality)에 대한 관심과 기대가 점차 커지고 있다. 증강현실 응용에서는 사용자 손(맨손 or 현실 객체를 쥔 손)과 증강된 3D 가상 객체간의 실시간 상호작용을 위해서는 1) 가상객체 실감증강을 위한 광원 추정, 2)현실 객체인식/추적, 3) 사용자 손동작 추적 등이 핵심 기술인데, 기존 핵심 요소 기술들은 다음과 같은 문제점이 있다.With the proliferation of smart phones and the introduction of various types of eyeglass type display devices such as Google Glass and MicroSoft Hollolens, the interest and expectation of augmented reality is increasing with virtual reality. In the augmented reality application, for real-time interaction between a user's hand (a bare hand or a hand holding a real object) and augmented 3D virtual object, 1) a light source estimation for enhancing virtual object realization, 2) a real object recognition / tracking, 3) Tracking is the core technology. However, the existing core technology has the following problems.

1) 영상 데이터에서 잘 디자인된(Well-designed) 특징 추출(Feature extraction) 기법과 특징의 정보를 기술하는 기술자(Descriptor) 설계가 필요하다.1) Well-designed feature extraction technique in image data and descriptor design describing feature information are needed.

2) 다양한 학습 데이터 생성 기술이 요구되며 학습을 위한 데이터 샘플링 오류에 따른 성능 저하 문제가 존재한다.2) Various learning data generation techniques are required, and performance degradation due to data sampling errors for learning exists.

3) 각 요소기술들의 Context 정보를 전체 시스템 성능 향상을 위해 유기적으로 통합하는 솔루션이 부재하다.3) There is no solution to organically integrate the context information of each element technology to improve overall system performance.

본 발명에서는, 증강/가상 현실 환경에서 사용자와 증강된 3D 가상 콘텐츠간의 실시간 상호작용을 지원하기 위해 필수적인, 실감 증강을 위한 광원 추정, 현실 객체 인식/추적, 사용자 손동작 인식/추적 기술들을 유기적으로 연동하는 통합 학습 장치 및 방법을 제안한다.In the present invention, light source estimation, real-object recognition / tracking, and user's hand recognition / tracking techniques for real-time enhancement, which are essential to support real-time interaction between a user and an enhanced 3D virtual content in an augmented / virtual reality environment, This paper proposes an integrated learning device and method.

본 발명의 실시예들은, 증강/가상 현실에서 광원, 객체, 손동작 인식, 추적 등의 요소 기술들을 유기적으로 연동하여 성능을 향상시킬 수 있는 통합 학습 장치 및 방법을 제공한다.Embodiments of the present invention provide an integrated learning apparatus and method capable of improving performance by organically linking element technologies such as a light source, an object, a hand motion recognition, and tracking in augmentation / virtual reality.

본 발명의 일 실시예에 따른 통합 학습 장치는 입력 영상 데이터로부터 관심 영역을 검출하는 검출부; 상기 검출된 관심 영역에서 객체와 자세 중 적어도 하나를 학습을 통해 추정하고 인식하여 상기 검출된 관심 영역 각각에 대한 결과 값을 제공하는 추정 인식부; 및 상기 관심 영역 각각에 대한 결과 값들을 통합하고, 상기 통합된 결과 값들 간의 상관 관계를 고려하여 상기 관심 영역 각각에 대한 결과 값을 학습을 통해 교정하는 교정부를 포함한다.An integrated learning apparatus according to an embodiment of the present invention includes: a detection unit that detects a region of interest from input image data; An estimation recognition unit for estimating and recognizing at least one of an object and a posture in the detected region of interest by learning and providing a result value for each of the detected regions of interest; And an adjustment unit for integrating the result values for each of the ROIs and correcting a result value for each ROI by learning in consideration of the correlation between the combined result values.

상기 추정 인식부는 상기 검출된 관심 영역 각각에 대한 결과 값을 통합하고, 상기 통합된 결과 값들 간의 상관 관계를 고려한 렌더링을 통해 상기 결과 값들 각각에 예측 영상을 미리 정의된 참 값(ground truth) 영상에 대응되게 생성하는 렌더러부를 포함할 수 있다.The estimation and recognition unit may integrate the result values of the detected ROIs and render a prediction image on each of the result values by considering the correlation between the integrated result values to a predefined ground truth image And may include a renderer portion that generates correspondingly.

상기 교정부는 복수의 완전 연결 레이어들(fully connected layers)를 포함하고, 상기 복수의 완결 연결 레이어들의 입력단에서 상기 통합된 결과 값들 간의 상관 관계를 고려하여 상기 관심 영역 각각에 대한 결과 값에 웨이트 값을 반영함으로써, 상기 관심 영역 각각에 대한 결과 값을 교정할 수 있다.Wherein the calibration unit includes a plurality of fully connected layers and calculates a weight value for a result value for each of the ROIs in consideration of a correlation between the integrated result values at an input of the plurality of complete connection layers Thereby correcting the results for each of the regions of interest.

상기 교정부는 복수의 완전 연결 레이어들(fully connected layers)를 포함하고, 상기 복수의 완결 연결 레이어들의 출력단에서 상기 통합된 결과 값들 간의 상관 관계를 고려하여 상기 관심 영역 각각에 대한 결과 값의 로스 함수(loss function)에 웨이트 값을 반영함으로써, 상기 관심 영역 각각에 대한 결과 값을 교정할 수 있다.Wherein the calibration unit includes a plurality of fully connected layers and a loss function of a result value for each of the ROIs in consideration of the correlation between the integrated result values at the output ends of the plurality of complete connection layers by reflecting the weight value to the loss function, the result value for each of the regions of interest can be corrected.

상기 교정부는 복수의 완전 연결 레이어들(fully connected layers)를 포함하고, 상기 복수의 완결 연결 레이어들 중 미리 설정된 완전 연결 레이어에서 상기 통합된 결과 값들 간의 상관 관계를 고려하여 상기 관심 영역 각각에 대한 결과 값에 미리 정의된 해싱 테이블의 웨이트 값을 반영함으로써, 상기 관심 영역 각각에 대한 결과 값을 교정할 수 있다.Wherein the calibration unit includes a plurality of fully connected layers and determines a result for each of the ROs in consideration of the correlation between the integrated result values in a predetermined fully connected layer among the plurality of completed connection layers, By reflecting the weight value of the hash table predefined in the value, the result for each of the regions of interest can be corrected.

나아가, 본 발명의 실시예에 따른 통합 학습 장치는 미리 결정된 렌더링 파라미터들(rendering parameters)을 이용한 통합 렌더링을 통해 합성 영상들을 생성하고, 상기 교정부에서 출력된 결과 값들을 이용한 통합 렌더링을 통해 제1 영상을 생성하며, 상기 생성된 합성 영상들과 제1 영상들을 이용하여 학습 데이터베이스를 생성하는 DB 생성부를 더 포함할 수 있다.Further, the integrated learning apparatus according to an embodiment of the present invention may generate composite images through integrated rendering using predetermined rendering parameters, and may perform integrated rendering using the output values output from the orientation unit, And generating a training database by using the generated composite images and the first images.

상기 DB 생성부는 상기 제1 영상과 상기 제1 영상에 대응하는 참 값 영상을 비교하여 비교 결과 값이 미리 설정된 임계 값보다 낮은 경우에만 상기 제1 영상을 상기 학습 데이터베이스의 학습 데이터로 생성할 수 있다.The DB generator may compare the first image and the true value image corresponding to the first image to generate the first image as learning data of the learning database only when the comparison result value is lower than a preset threshold value .

본 발명의 실시예에 따른 통합 학습 방법은 입력 영상 데이터로부터 관심 영역을 검출하는 단계; 상기 검출된 관심 영역에서 객체와 자세 중 적어도 하나를 학습을 통해 추정하고 인식하여 상기 검출된 관심 영역 각각에 대한 결과 값을 제공하는 단계; 및 상기 관심 영역 각각에 대한 결과 값들을 통합하고, 상기 통합된 결과 값들 간의 상관 관계를 고려하여 상기 관심 영역 각각에 대한 결과 값을 학습을 통해 교정하는 단계를 포함한다.An integrated learning method according to an embodiment of the present invention includes: detecting a region of interest from input image data; Estimating and recognizing at least one of an object and a posture in the detected region of interest through learning and providing a result value for each of the detected regions of interest; And integrating the results for each of the regions of interest and calibrating the results for each of the regions of interest by learning taking into account the correlation between the aggregated results.

상기 검출된 관심 영역 각각에 대한 결과 값을 제공하는 단계는 상기 검출된 관심 영역 각각에 대한 결과 값을 통합하고, 상기 통합된 결과 값들 간의 상관 관계를 고려한 렌더링을 통해 상기 결과 값들 각각에 예측 영상을 미리 정의된 참 값(ground truth) 영상에 대응되게 생성하는 단계를 포함할 수 있다.Wherein the step of providing the resultant values for each of the detected ROIs includes integrating the result values for each of the detected ROIs and rendering a prediction image for each of the resultant values through rendering, And generating the image corresponding to the predefined ground truth image.

상기 교정하는 단계는 복수의 완전 연결 레이어들(fully connected layers)의 입력단에서 상기 통합된 결과 값들 간의 상관 관계를 고려하여 상기 관심 영역 각각에 대한 결과 값에 웨이트 값을 반영함으로써, 상기 관심 영역 각각에 대한 결과 값을 교정할 수 있다.Wherein the calibrating includes reflecting a weight value to a result value for each of the ROIs by considering the correlation between the aggregated result values at an input of a plurality of fully connected layers, You can calibrate the results for.

상기 교정하는 단계는 복수의 완전 연결 레이어들(fully connected layers)의 출력단에서 상기 통합된 결과 값들 간의 상관 관계를 고려하여 상기 관심 영역 각각에 대한 결과 값의 로스 함수(loss function)에 웨이트 값을 반영함으로써, 상기 관심 영역 각각에 대한 결과 값을 교정할 수 있다.Wherein the correcting step reflects a weight value to a loss function of the resultant for each of the regions of interest, taking into account the correlation between the integrated results at the output of a plurality of fully connected layers Thereby correcting the result for each of the regions of interest.

상기 교정하는 단계는 복수의 완전 연결 레이어들(fully connected layers) 중 미리 설정된 완전 연결 레이어에서 상기 통합된 결과 값들 간의 상관 관계를 고려하여 상기 관심 영역 각각에 대한 결과 값에 미리 정의된 해싱 테이블의 웨이트 값을 반영함으로써, 상기 관심 영역 각각에 대한 결과 값을 교정할 수 있다.Wherein the step of calibrating comprises: weighing a weight of a hash table predefined in a result value for each of the ROIs, taking into account the correlation between the aggregated result values at a predetermined fully connected layer among a plurality of fully connected layers By reflecting the value, the result for each of the regions of interest can be corrected.

나아가, 본 발명의 실시예에 따른 통합 학습 방법은 미리 결정된 렌더링 파라미터들(rendering parameters)을 이용한 통합 렌더링을 통해 합성 영상들을 생성하고, 상기 교정하는 단계에서 출력된 결과 값들을 이용한 통합 렌더링을 통해 제1 영상을 생성하며, 상기 생성된 합성 영상들과 제1 영상들을 이용하여 학습 데이터베이스를 생성하는 단계를 더 포함할 수 있다.Furthermore, the integrated learning method according to an embodiment of the present invention generates composite images through integrated rendering using predetermined rendering parameters, and performs integrated rendering using result values output in the correction step 1 image, and generating a learning database using the generated composite images and the first images.

상기 학습 데이터베이스를 생성하는 단계는 상기 제1 영상과 상기 제1 영상에 대응하는 참 값 영상을 비교하여 비교 결과 값이 미리 설정된 임계 값보다 낮은 경우에만 상기 제1 영상을 상기 학습 데이터베이스의 학습 데이터로 생성할 수 있다.Wherein the step of generating the learning database comprises comparing the first image with a true value image corresponding to the first image and comparing the first image with the learning data of the learning database only when the comparison result value is lower than a preset threshold value Can be generated.

본 발명의 실시예들에 따르면, 증강/가상 현실에서 광원, 객체, 손동작 인식, 추적 등의 요소 기술들을 유기적으로 연동하여 성능을 향상시킬 수 있다.According to embodiments of the present invention, it is possible to improve performance by organically linking element technologies such as a light source, an object, a hand motion recognition, and tracking in augmentation / virtual reality.

구체적으로, 본 발명의 실시예들에 따르면, 요소 기술들의 중간 결과 정보들을 상호 보완적으로 이용함으로써, 개별 기술 성능 향상과 전체 시스템 성능을 향상시킬 수 있다.In particular, according to the embodiments of the present invention, by using complementary intermediate result information of element technologies, it is possible to improve individual technology performance and overall system performance.

도 1은 본 발명의 실시예에 따른 통합 학습 장치에 대한 개념도를 나타낸 것이다.
도 2는 본 발명의 실시예에 따른 통합 학습 장치에 대한 구성 블록도를 나타낸 것이다.
도 3은 도 2에 도시된 추정 인식부를 설명하기 위한 일 예시도를 나타낸 것이다.
도 4는 도 2에 도시된 교정부를 설명하기 위한 일 예시도를 나타낸 것이다.
도 5는 도 2에 도시된 교정부에 대한 동작을 설명하기 위한 실시예들을 도시한 것이다.
도 6은 본 발명에서의 학습 데이터베이스를 생성하는 과정을 설명하기 위한 예시도를 나타낸 것이다.1 is a conceptual diagram of an integrated learning apparatus according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a configuration of an integrated learning apparatus according to an embodiment of the present invention.
FIG. 3 is a diagram for explaining the estimation recognition unit shown in FIG. 2. FIG.
Fig. 4 shows an example for explaining the calibration unit shown in Fig. 2. Fig.
FIG. 5 shows an embodiment for explaining the operation of the calibration unit shown in FIG. 2. FIG.
6 is a diagram illustrating an example of a process of generating a learning database according to the present invention.

이하, 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다. 그러나 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 또한, 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings. However, the present invention is not limited to or limited by the embodiments. In addition, the same reference numerals shown in the drawings denote the same members.

증강/가상 현실 응용에서는 사용자 손(맨손 or 현실 객체를 쥔 손)과 증강된 3D 가상 객체 간의 실시간 상호 작용을 지원하기 위한 요소기술인 광원 추정, 객체 인식/추적, 맨손 or 객체를 쥔 손 자세 추정/추적 기술들은 RGB-D 영상 입력 값을 가지고 개별적인 성능을 높이는 연구에 중점을 두고 있으며, 따라서 요소기술들 간 정보들을 유기적으로 활용하지 못하기 때문에 성능 향상에 한계가 있다.In the Augmented Reality / Virtual Reality application, the light source estimation, the object recognition / tracking, the hand posture estimation with the bare hand or the object, the element technology to support the real interaction between the user hand (hand holding the bare hand or the real object) Tracking technologies focus on research to improve individual performance with RGB-D image input values, and therefore there is a limit to performance improvement because information between element technologies is not used organically.

본 발명의 실시예들은, 요소 기술들의 중간 결과 정보들을 상호 보완적으로 이용하여 개별 기술 성능향상과 전체 시스템 성능을 향상시키는 것을 그 요지로 한다.Embodiments of the present invention are based on complementary use of intermediate result information of element technologies to improve individual technology performance and overall system performance.

예를 들어, 객체 인식/손 자세 추정 부분에서, 손 또는 손에 쥔 현실 객체가 서로의 외형에 의해 가려진 경우 객체 인식 및 손 자세 추정 성능이 저하되는데, 본 발명의 기술을 사용하면 객체가 많이 가려진 경우 손 자세 추정 정보를 보조적으로 이용하고, 손이 많이 가려진 경우 쥐어진 객체 정보를 보조적으로 이용할 수 있기 때문에 서로의 성능을 향상시킬 수 있다.For example, in the object recognition / hand posture estimation part, the object recognition and the hand posture estimation performance are degraded when the real objects held in the hand or the hand are obscured by each other's appearance. However, using the technique of the present invention, It is possible to enhance the performance of each other because the hand posture estimation information is supplementarily used and the object information held when the hand is largely hidden can be supplementarily used.

도 1은 본 발명의 실시예에 따른 통합 학습 장치에 대한 개념도를 나타낸 것으로, 도 1에 도시된 바와 같이, 본 발명에 따른 통합 학습 장치는 RGB-D카메라로부터 컬러(RGB) 영상과 깊이(Depth) 영상을 입력으로 받고, 입력영상으로부터 관심 현실 객체 및 손의 영역을 검출(Detection)/추적(Tracking)한다.FIG. 1 is a conceptual diagram of an integrated learning apparatus according to an embodiment of the present invention. As shown in FIG. 1, an integrated learning apparatus according to the present invention includes a color (RGB) ) Receives an image as an input, and detects (detects) / tracks an area of a real object and a hand from the input image.

본 발명(통합 인지 모형 CNN)은 추적된 손(맨손 or 물체를 쥔 손)과 광원 추정을 통해 렌더링(Rendering)된 가상 객체(virtual object)와 상호 작용(Interaction)을 수행하고, 상호 작용 결과를 실사 영상에 반영한다.The present invention (integrated cognitive model CNN) performs an interaction between a tracked hand (a bare hand or a hand holding an object) and a virtual object that is rendered through a light source estimation, It reflects on real image.

또한, 본 발명은 학습에 사용되는 학습 데이터베이스를 생성하고, 업데이트할 수 있으며, 광원 추정, 객체 인식/추적, 맨손 or 객체를 쥔 손 자세 추정/추적 등의 요소 기술 간의 상호 작용 예를 들어, 상관도를 고려 또는 반영함으로써, 학습 성능을 향상시킬 수 있다.In addition, the present invention can create and update a learning database used for learning, and can be used for interaction between elements such as light source estimation, object recognition / tracking, hand posture estimation / tracking with bare hands or objects, By considering or reflecting the degree of learning, the learning performance can be improved.

이러한 본 발명에 따른 장치 및 방법에 대해 도 2 내지 도 7을 참조하여 상세히 설명한다.The apparatus and method according to the present invention will be described in detail with reference to Figs. 2 to 7. Fig.

도 2는 본 발명의 실시예에 따른 통합 학습 장치에 대한 구성 블록도를 나타낸 것이다.FIG. 2 is a block diagram illustrating a configuration of an integrated learning apparatus according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 본 발명에 따른 통합 학습 장치(unified visual perception model)는 검출부(detection phase), 추정 인식부(estimation and recognition phase) 및 교정부(refinement phase)를 포함한다.As shown in FIG. 2, the unified visual perception model according to the present invention includes a detection phase, an estimation and recognition phase, and a refinement phase.

검출부는 현실 객체에 대한 입력 영상 즉, RGB 영상과 깊이 영상으로부터 객체 관심 영역(ROI; Region of Interest)과 손 영역을 검출하고, 검출된 객체 ROI와 손을 학습 예를 들어, 회선 신경망(CNN; Convolutional Neural Network)을 이용하여 객체에 해당하는 객체 영역, 객체를 쥔 손 영역 그리고 객체를 쥐지 않는 맨손 영역을 검출한다.The detection unit detects an ROI (Region of Interest) and a hand region from an input image of a real object, that is, an RGB image and a depth image, and detects a detected object ROI and a hand, for example, a line neural network (CNN). Convolutional Neural Network) is used to detect the object region corresponding to the object, the hand region holding the object, and the bare region not holding the object.

여기서, 객체 ROI와 손 영역은 모두 학습을 위한 관심 영역일 수 있으며, CNN 학습에 대한 부분은 이 기술 분야에 종사하는 당업자라면 알 수 있는 내용이기에, 그 상세한 설명은 생략한다.Here, both the object ROI and the hand region may be areas of interest for learning, and the portion related to CNN learning may be known to those skilled in the art, and a detailed description thereof will be omitted.

추정 인식부는 검출부에 의해 검출된 각 영역을 CNN 학습을 통해 광원을 추정하고, 객체를 인식하며, 객체/손 자체를 추정하고, 맨손 자세를 추정한다.The estimation and recognition unit estimates the light source through CNN learning for each region detected by the detection unit, recognizes the object, estimates the object / hand itself, and estimates the bare attitude.

구체적으로, 추정 인식부는 검출부에 의해 검출된 객체 영역에 대한 CNN 학습을 이용하여 추정된 광원 추정 결과와 객체 인식 결과(객체 ID)를 교정부로 제공하고, 검출부에 의해 검출된 객체를 쥔 손 영역에 대한 CNN 학습을 이용하여 추정된 객체/손 자체 추정 결과를 교정부로 제공하며, 검출부에 의해 검출된 맨손 영역에 대한 CNN 학습을 이용하여 추정된 맨손 자세 결과를 교정부로 제공한다.Specifically, the estimation recognition unit provides the light source estimation result and the object recognition result (object ID) estimated using the CNN learning for the object region detected by the detection unit to the correction unit, Handedness estimation result using the CNN learning is provided to the correction part and the bare attitude result estimated using the CNN learning for the bare area detected by the detection part is provided to the correction part.

이 때, 추정 인식부는 도 3에 도시된 일 예와 같이, 검출된 각 영역에 대한 추정과 인식 학습을 위하여 렌더러부(renderer phase)를 더 포함할 수 있다.In this case, the estimation recognition unit may further include a renderer phase for estimation and recognition learning of each detected region, as in the example shown in FIG.

즉, 추정 인식부는 도 3에 도시된 추정 인식부와 렌더러부를 포함하는 구성으로, 도 2에서의 추정 인식부와 같이 렌더러부가 없는 구성으로 구성될 수도 있고, 도 3에 도시된 바와 같이, 추정 인식부와 렌더러부를 함께 포함하여 구성될 수도 있다.That is, the estimation recognizing unit may include the estimation recognizing unit and the renderer unit shown in FIG. 3, and may be configured to have no renderer unit like the estimation recognizing unit in FIG. 2. As shown in FIG. 3, And a renderer unit.

여기서, 추가적으로 구성되는 렌더러부는 추정 인식부의 추정 인식 CNN 네트워크를 학습하기 위한 구성 수단으로, 추정 인식부로부터 추정 인식된 중간 결과 값을 유기적으로 통합한 후 이미 알고 있는 참 값(Ground truth) 영상과 유사하게 예측(Predicted)된 영상을 도출하기 위한 구성 수단일 수 있다.The additional renderer unit is a means for learning the estimated recognition CNN network of the estimation recognition unit. The renderer unit organically integrates the intermediate result values estimated and recognized from the estimation recognition unit, and then synthesizes the ground truth image And may be constituent means for deriving a predicted image.

참 값 영상은 객체 ID, 객체 ID에 따른 손 자세 등의 정답에 해당하는 주어진 영상일 수 있다.The true value image may be a given image corresponding to a correct answer such as an object ID, a hand posture according to an object ID, and the like.

구체적으로, 렌더러부는 추정 인식부에 의해 도출된 중간 결과 값들 예를 들어, 광원 추정 값, 객체 ID, 객체 자세, 객체를 쥔 손 자세 및 맨손 자세 정보를 입력으로 수신하고, 입력으로 수신된 중간 결과 값들에 대한 상관 관계에 기초하여 중간 결과 값들 통합함으로써, 각각의 입력 값들에 대한 영상의 예측된 영상을 만들어 낸다.Specifically, the renderer unit receives the intermediate result values derived by the estimation and recognition unit, for example, the light source estimation value, the object ID, the object attitude, the hand attitude holding the object and the bare attitude information, By incorporating the intermediate result values based on the correlation to the values, a predicted image of the image for each input value is generated.

이러한 렌더러부는 기존의 중간 결과 값들이 통합되지 않고 각각으로 예측되는 경우 객체가 손에 가려질 수도 있고, 객체에 의해 손이 가려질 수도 있기 때문에 객체와 손에 대한 정확한 예측 영상을 만들어내기가 어려운 문제점을 해결하기 위한 구성 수단으로, 객체에 의해 손이 가려지는 경우 검출된 객체에 의해 객체를 쥐고 있는 손의 자세가 결정될 수가 있고, 반대로 객체가 손에 의해 가려지는 경우 손의 자세에 의하여 손에서 쥐고 있는 객체를 추정할 수도 있다. 즉, 렌더러부는 객체에 따라 객체를 쥔 손의 자세가 달라지고, 손의 자세에 의하여 손에 쥔 객체가 달라질 수 있기 때문에 이를 바탕으로 가려진 객체 또는 가려진 손 자세 등을 더 명확하게 추정 또는 인식할 수 있다.These renderer parts may be obstructed by the object if the existing intermediate result values are not merged and predicted separately, and it may be difficult to generate an accurate prediction image of the object and the hand because the hand may be obscured by the object It is possible to determine the posture of the hand holding the object by the detected object when the hand is covered by the object and hold it in the hand by the posture of the hand when the object is blocked by the hand You can also estimate the object that is located. In other words, the renderer unit can change the posture of the hand holding the object according to the object and change the object held by the hand according to the posture of the hand, so that it is possible to more accurately estimate or recognize the obscured object or the obscured hand posture have.

이와 같이, 렌더러부는 도출된 중간 결과 값들에 해당하는 광원 추정, 객체 ID, 객체 자세, 객체를 쥔 손 자세, 맨손 자세를 통합하여 중간 결과 값들 간의 개연성을 고려할 수 있기 때문에 중간 결과 값들 각각에 해당하는 영상을 참 값 영상에 최대한 유사하게 도출할 수 있다.As described above, since the renderer unit can integrate the light source estimation, the object ID, the object attitude, the hand attitude holding the object, and the bare attitude corresponding to the derived intermediate result values, the possibility of the intermediate result values can be considered, It is possible to derive the image as much as possible to the true value image.

즉, 추정 인식부에 의해 도출된 객체와 객체를 쥔 손 자세가 명확하게 도출되지 않은 경우라도, 객체와 객체를 쥔 손 자세 간의 상관 관계를 반영함으로써, 객체, 광원 추정, 객체를 쥔 손 자세 등의 도출된 중간 결과 값에 대한 예측 영상을 참 값 영상에 최대한 유사하게 만들 수 있다.In other words, even when the hand derived from the object and the object derived by the estimation and recognition unit are not clearly derived, the correlation between the object and the hand attitude of the object is reflected so that the object, the light source estimation, The predicted image for the derived intermediate result value can be made as close as possible to the true value image.

이렇듯 렌더러부는 중간 결과 값을 개별적으로 렌더링하지 않고, 중간 결과 값을 통합하여 참 값 영상에 최대한 유사하게 예측 영상을 만들 수 있기 때문에 객체가 가려져서 손 자세에 의해 가려져서 객체가 명확하지 않거나 객체에 의해 손 자세가 가려져 객체를 쥔 손 자세가 명확하지 않은 경우라도 객체와 객체를 쥔 손 자세 등의 예측 영상을 참 값 영상에 최대한 유사하게 만들 수 있다.In this way, the renderer does not render the intermediate results individually, but can synthesize the intermediate result values to make the prediction image as close as possible to the true value image. Therefore, the object is obscured by the hand posture, It is possible to make the predicted image of the object and the hand holding the object as similar as possible to the true value image even if the hand posture of the object is not clear due to the obstruction of the posture.

물론, 렌더러부는 참 값 영상과 예측된 영상의 차이를 줄여주는 방식으로 CNN 네트워크를 학습할 수 있으며, 이러한 CNN 네트워크 학습에 대한 것은 이 기술 분야에 종사하는 당업자에게 자명하기에 그 설명은 생략한다.Of course, the renderer unit can learn the CNN network by reducing the difference between the true value image and the predicted image. The CNN network learning is obvious to those skilled in the art, so that the description thereof is omitted.

다시 도 2를 참조하면, 추정 인식부는 도 3의 렌더러부를 이용하여 중간 결과 값에 대한 예측 영상을 참 값 영상에 최대한 유사하게 도출하고, 이렇게 도출된 중간 결과 값을 교정부로 제공한다.Referring again to FIG. 2, the estimation recognizing unit derives a prediction image of the intermediate result value as much as possible to the true value image using the renderer unit of FIG. 3, and provides the intermediate result value thus derived to the correction unit.

교정부는 각 영역에 대해 도출된 중간 결과 값들을 통합하고, 중간 결과 값들 간의 상관 관계에 기초하여 웨이트 값을 적용함으로써, 요소 기술들 각각의 교정된 학습 결과 값을 렌더링 및 추적부(rendering/tracking phase)제공한다.The calibration unit may be configured to integrate the derived intermediate results for each region and apply a weight value based on the correlation between the intermediate result values to provide a calibration / )to provide.

여기서, 교정부는 도 4에 도시된 바와 같이, 개별적으로 추정/인식된 결과 벡터 값들을 유기적으로 통합하고, 통합된 결과 벡터 값들을 RDNN(Refinement Deep Neural Network)를 이용하여 개별 추정/인식 값의 성능을 향상시킬 수 있다.Here, as shown in FIG. 4, the calibration unit may organically integrate individually estimated / recognized result vector values, and may use the Refinement Deep Neural Network (RDNN) Can be improved.

이러한 RDNN은 도 5에 도시된 바와 같이, 복수의 완전 연결 레이어들(FCL; fully connected layers)을 포함할 수 있으며, 교정부의 입력단에서 웨이트 값을 제어하거나, 출력단에서 웨이트 값을 제어하거나 중간 단에서 웨이트 값을 제어함으로써, 교정부에서의 결과 값을 향상시킬 수 있다.Such an RDNN may comprise a plurality of fully connected layers FCL, as shown in FIG. 5, to control the weight value at the input of the calibrator, to control the weight value at the output stage, The resultant value in the calibration unit can be improved.

이 때, 교정부는 도 5a에 도시된 바와 같이, 요소 기술들 간의 상관 관계에 기초하여 입력단으로 입력되는 결과 값들(N1 내지 N4) 중 상관 관계가 있는 결과 값을 묶어서 웨이트 값을 부여할 수 있고, 독립적인 결과 값들에 대해서는 개별적으로 웨이트 값을 부여할 수 있다.At this time, as shown in FIG. 5A, the calibration unit may assign a weight value by grouping the correlated result values among the result values N1 to N4 input to the input terminal based on the correlation between the element descriptions, Independent result values can be individually weighted.

이 때, 교정부는 도 5b에 도시된 바와 같이, 교정부의 출력단에서 요소 기술들 각각에 대한 로스 함수(loss function)의 상관 관계를 고려하여 로스 함수에 웨이트 값을 부여할 수 있으며, 도 5a에 마찬가지로 상관 관계가 있는 요소 기술들 간의 로스 함수를 묶어서 동일한 웨이트 값을 부여할 수 있다.At this time, the calibration unit may assign a weight value to the loss function considering the correlation of the loss function for each of the element technologies at the output stage of the calibration unit, as shown in Fig. 5 Likewise, we can assign the same weight value by grouping the loss functions between correlated element descriptions.

이 때, 교정부는 도 5c에 도시된 바와 같이, 교정부의 중간단에서 미리 정의된 해싱 테이블(hashing table)에 저장된 요소 기술들 각각에 대한 웨이트 값들에서 요소 기술들 간의 상관 관계를 고려하여 각 요소 기술의 웨이트 값을 결정하고, 결정된 웨이트 값을 반영함으로써, 요소 기술들 각각의 결과 값이 통합된 교정부에서의 결과 값을 향상시킬 수 있다.At this time, as shown in FIG. 5C, the calibration unit may calculate the correlation between the element values at the weight values for each of the element descriptions stored in the pre-defined hashing table at the middle stage of the calibration unit, By determining the weight value of the technique and reflecting the determined weight value, the result of each of the element techniques can be improved in the result of the integrated calibration.

도 5에서의 웨이트 값들은 객체, 객체를 쥔 손의 자세, 광원 등에 따라 각각 달라질 수 있으며, 해싱 테이블에 저장된 웨이트 값들은 실험 등을 통해 미리 결정될 수 있다.The weight values in FIG. 5 may vary depending on the object, the posture of the hand holding the object, the light source, etc., and the weight values stored in the hashing table may be determined in advance through experiments or the like.

즉, 교정부는 통합된 결과 값들 각각의 상관 관계 또는 상관도 또는 개연성을 고려하여 결과 값들 각각에 동일 또는 상이한 웨이트 값을 반영함으로써, 각 영역에 대한 결과 값을 참 값에 최대한 유사하게 학습할 수 있다.That is, the calibration unit can learn the result value for each region as much as possible to the true value by reflecting the same or different weight value to each of the result values in consideration of the correlation, the degree of correlation, or the probability of each of the integrated result values .

여기서, 교정부는 개연성이 높은 요소 기술의 결과 값들에 대해서는 높은 웨이트 값을 부여할 수 있고, 개연성이 낮은 요소 기술에 대해서는 미리 정의된 웨이트 값 또는 해당 요소 기술에 해당하는 웨이트 값을 부여할 수 있다.Here, the calibration unit can give a high weight value to the result values of the element technology with a high probability, and assign a weight value corresponding to a predefined weight value or a corresponding element description to the element technology having low probability.

렌더링 및 추적부(rendering/tracking phase)는 교정부에 의해 교정된 요소 기술들 각각의 학습 결과 값에 기초하여 가상 객체를 랜더링하고, 객체와 손을 추적함으로써, 미리 선택된 가상 객체를 증강하고 현실 객체를 쥔 손과 맨손의 추적 인식을 통해 가상 현실, 증강 현실을 구현할 수 있다.The rendering / tracking phase is a process that renders a virtual object based on the learning result values of each of the element techniques corrected by the calibration unit, traces the object and the hand, augments the preselected virtual object, It is possible to realize virtual reality and augmented reality through tracking recognition of the hands and bare hands.

본 발명에서는 도 6에 도시된 바와 같이, 렌더링 파라미터들(rendering parameters)을 입력으로 하는 통합 렌더러(renderer)를 이용하여 학습을 위해 사용되는 참 값 영상에 대응하는 가상 영상 또는 합성 영상(synthesized images)을 생성하고, 이렇게 생성된 합성 영상을 포함하는 학습 데이터베이스(training set)를 생성할 수 있다. 즉, 학습 데이터베이스의 다양한 학습 데이터(참 값 영상)가 렌더링 파라미터들(rendering parameters)을 입력으로 하는 통합 렌더러(renderer)에 의해 생성될 수 있다.In the present invention, as shown in FIG. 6, a virtual image or synthesized images corresponding to a true value image used for learning using an integrated renderer inputting rendering parameters, And generate a training set including the generated composite image. That is, various learning data (true value images) of the learning database can be generated by an integrated renderer with input of rendering parameters.

본 발명은 렌더링 파라미터들(rendering parameters)을 입력으로 하는 통합 렌더러(renderer) 뿐만 아니라 상술한 교정부를 통해 출력되는 요소 기술들 각각의 결과 값을 입력으로 하는 통합 렌더러(renderer)를 이용하여 학습을 위해 사용되는 참 값 영상에 대응하는 실제 영상을 생성하고, 이렇게 생성된 실제 영상을 참 값 영상으로 학습 데이터베이스에 추가하여 학습 데이터베이스를 업데이트함으로써, 최종 학습 데이터베이스를 생성 또는 갱신할 수 있다.The present invention is not limited to an integrated renderer that receives rendering parameters as input, but also an integrated renderer that takes as input the result values of each of the element descriptions output through the above-described calibration section. A final learning database can be created or updated by generating a real image corresponding to the used true value image and updating the learning database by adding the actual image thus generated to the learning database as a true value image.

여기서, 통합 렌더러는 결과 값들을 통합 렌더링하는 것으로, 통합의 의미는 현실 객체, 광원, 객체 자세, 손 자세를 한번에 통합적으로 렌더링한다는 것이다. 그리고, 본 발명을 통해 도출한 광원, 객체 및 이를 쥔 손 자세, 맨손 자세 결과를 통합 렌더링된 결과 예를 들어, 참 값 영상과 비교하여 특정 임계 값보다 낮을 경우, 통합 렌더링된 결과 값 즉, 실제 영상을 학습 데이터베이스의 학습 데이터로 활용할 수 있다.Here, the integrated renderer performs integrated rendering of the result values. The meaning of the integration is to integrally render a real object, a light source, an object attitude, and a hand attitude at once. When the light source, the object, and the hand and the bare attitude results derived from the present invention are compared with the result of the integrated rendering, for example, the true value image, when the intensity is lower than the specific threshold value, The image can be utilized as learning data of the learning database.

상술한 바와 같이, 본 발명에 따른 장치 및 방법은 증강/가상 현실에서 광원, 객체, 손동작 인식, 추적 등의 요소 기술들을 요소 기술들 각각의 결과 값에 대한 상관 관계에 기초하여 유기적으로 연동함으로써, 요소 기술들 각각의 학습 결과 또는 인식 결과에 대한 성능을 향상시킬 수 있다.As described above, the apparatus and method according to the present invention can organically link elementary technologies such as light source, object, handwriting recognition, tracking, etc. in augmented / virtual reality based on the correlation of the results of each elementary technology, The performance of each elemental technology learning or recognition result can be improved.

이와 같이, 본 발명에 따른 방법은 검출 단계, 추정 인식 단계 및 교정 단계를 포함하며, 검출 단계는 상술한 검출부에 의해 수행되는 단계로 RGB 영상과 깊이 영상으로부터 객체 영역과 손 영역을 검출하는 단계이며, 추정 인식 단계는 검출된 각 영역에 대한 학습을 통해 광원 추정, 객체 인식, 객체/손 자세 추정 및 맨손 자세 추정 등을 수행하는 단계이고, 교정 단계는 추정 인식 단계에 의한 결과 값들 즉, 요소 기술들 각각에 대한 결과 값을 통합하고 요소 기술들 각각에 대한 상관 관계를 고려하여 요소 기술들 각각에 대한 결과 값을 교정함으로써, 요소 기술들 각각에 대한 학습 결과 값을 향상시키는 단계이다.As described above, the method according to the present invention includes a detection step, an estimation recognition step, and a correction step, and the detection step is a step performed by the above-described detection step, wherein the object area and the hand area are detected from the RGB image and the depth image , And the estimation step is a step of performing a light source estimation, an object recognition, an object / hand posture estimation and a bare posture estimation, etc. through learning for each detected region, and the correction step is a step of calculating the result values by the estimation recognition step, Is to enhance the learning outcome for each of the element descriptions by integrating the results for each of the element descriptions and by calibrating the result for each of the element descriptions taking into account the correlation for each of the element descriptions.

여기서, 추정 인식 단계는 도 3에서 설명한 바와 같이, 각 요소 기술들에 대한 중간 결과 값을 통합하고, 통합된 중간 결과 값을 이용하여 렌더링을 수행함으로써, 예측된 영상을 참 값 영상에 최대한 유사하게 렌더링하고, 이를 통해 객체가 손을 가리거나 손에 의해 객체가 가려지더라도 객체, 객체를 쥔 손 자세 등을 보다 정확하게 추정 인식할 수 있다.Here, as described in FIG. 3, the estimation recognition step integrates the intermediate result values for the respective element descriptions and performs rendering using the integrated intermediate result values so that the predicted image is maximally similar to the true value image And it is possible to more accurately estimate and recognize the object and the hand posture of the object even if the object is covered by the hand or the object is covered by the hand.

물론, 방법에서의 각 단계는 상술한 장치에서 설명한 각 구성 수단에서의 동작을 모두 수행할 수 있다는 것은 상술한 발명의 상세한 설명을 통해 알 수 있고, 이러한 사실은 이 기술 분야에 종사하는 당업자에게 있어서 자명하다.Of course, it will be understood from the detailed description of the present invention that each step in the method can perform all of the operations in each of the constituent means described in the above-mentioned apparatus, and this fact is obvious to those skilled in the art It is obvious.

이상에서 설명된 시스템 또는 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 시스템, 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The system or apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the systems, devices, and components described in the embodiments may be implemented in various forms such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array ), A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For ease of understanding, the processing apparatus may be described as being used singly, but those skilled in the art will recognize that the processing apparatus may have a plurality of processing elements and / As shown in FIG. For example, the processing unit may comprise a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the foregoing, and may be configured to configure the processing device to operate as desired or to process it collectively or collectively Device can be commanded. The software and / or data may be in the form of any type of machine, component, physical device, virtual equipment, computer storage media, or device , Or may be permanently or temporarily embodied in a transmitted signal wave. The software may be distributed over a networked computer system and stored or executed in a distributed manner. The software and data may be stored on one or more computer readable recording media.

실시예들에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to embodiments may be implemented in the form of a program instruction that may be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks, and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

A detector for detecting a region of interest from input image data;
An estimation recognition unit for estimating and recognizing at least one of an object and a posture in the detected region of interest by learning and providing a result value for each of the detected regions of interest; And
And a correction unit for correcting a result value for each of the ROIs by learning in consideration of a correlation between the integrated result values,
And an integrated learning device.

The method according to claim 1,
The estimation recognition unit
Integrates the result values for each of the detected ROIs, and generates a predicted image corresponding to a predetermined ground truth image in each of the result values through rendering based on the correlation between the integrated result values Renderer
And an integrated learning device.

The method according to claim 1,
The calibration section
Wherein a weight value is reflected in a result value for each of the ROIs by considering a correlation between the integrated result values at an input of the plurality of complete connection layers, And corrects the result for each of the regions of interest.

The method according to claim 1,
The calibration section
A loss function of the result for each of the regions of interest, taking into account the correlation between the aggregated results at the output of the plurality of complete connection layers, And corrects a result value for each of the regions of interest by reflecting a weight value in the region of interest.

The method according to claim 1,
The calibration section
The method of any one of the preceding claims, further comprising: providing a result set for each of the ROIs in consideration of the correlation between the integrated results in a pre- And corrects the result value for each of the regions of interest by reflecting the weight value of the defined hashing table.

The method according to claim 1,
Generates synthesized images through integrated rendering using predetermined rendering parameters, generates a first image through integrated rendering using the output values output from the orthogonal unit, 1) < / RTI >
Further comprising the step of:

The method according to claim 6,
The DB generating unit
Comparing the first image with a true value image corresponding to the first image, and generating the first image as learning data of the learning database only when the comparison result value is lower than a preset threshold value. Device.

Detecting an area of interest from the input image data;
Estimating and recognizing at least one of an object and a posture in the detected region of interest through learning and providing a result value for each of the detected regions of interest; And
Integrating the results for each of the regions of interest, and calibrating the results for each of the regions of interest by learning, taking into account the correlation between the aggregated results
/ RTI >

9. The method of claim 8,
The step of providing a result for each of the detected regions of interest
Integrates the result values for each of the detected ROIs, and generates a predicted image corresponding to a predetermined ground truth image in each of the result values through rendering based on the correlation between the integrated result values step
And an integrated learning method.

In the eighth aspect,
The step of calibrating
By reflecting the weight value to the result for each of the regions of interest, taking into account the correlation between the aggregated results at the input of a plurality of fully connected layers, the result for each of the regions of interest is corrected Wherein the learning method comprises:

9. The method of claim 8,
The step of calibrating
By reflecting the weight value to a loss function of the result for each of the regions of interest, taking into account the correlation between the integrated results at the output of a plurality of fully connected layers, And correcting the resultant value for each of the plurality of learning methods.

9. The method of claim 8,
The step of calibrating
By reflecting a weight value of a hash table defined in advance in a result value for each of the ROIs in consideration of a correlation between the integrated result values in a predetermined fully connected layer among a plurality of fully connected layers, And correcting the result for each of the regions of interest.

9. The method of claim 8,
Generates composite images through integrated rendering using predetermined rendering parameters, generates a first image through integrated rendering using the output values outputted in the correction step, Generating a learning database using the first images;
Further comprising the steps of:

14. The method of claim 13,
The step of generating the learning database
Comparing the first image with a true value image corresponding to the first image, and generating the first image as learning data of the learning database only when the comparison result value is lower than a preset threshold value. Way.