KR20220104531A

KR20220104531A - Method for estimating depth based on deep learning using averaging layer and loss function and apparatus using the same

Info

Publication number: KR20220104531A
Application number: KR1020210006971A
Authority: KR
Inventors: 조용주; 석주명; 서정일
Original assignee: 한국전자통신연구원
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2022-07-26

Abstract

Disclosed are a deep learning network-based depth estimation method using an averaging layer and a loss function and a device using the same. The depth estimation method according to one embodiment of the present invention comprises the steps of: generating a deep learning network-based depth estimation model in which a concatenate layer in a UNET module of a recurrent multi-view stereo network (RMVSNET) is replaced with an averaging LAYER; generating a loss function for a value resulting from image conversion of a sample image based on a depth prediction value of the sample image and a measured data depth value (ground truth depth); and estimating a depth of an input image by applying the loss function to the depth estimation model and inputting an input image to the depth estimation model to which the loss function is applied. According to the present invention, an interconnection layer in RMVSNET is replaced with an averaging layer so as to reduce the amount of data in front and rear layers in half, and thus, memory efficiency can be improved.

Description

Deep learning network-based depth estimation method using average layer and loss function and apparatus using the same

본 발명은 평균 계층과 손실함수를 이용한 딥러닝 네트워크 기반의 깊이 추정 기술에 관한 것으로, 특히 깊이 추정 분야의 최첨단 기술(the state of the art)인 RMVSNet(Recurrent Multi-View Stereo Network)의 구조에서 평균 계층(averaging layer)을 이용하여 메모리 사용을 감소시키고, 새로운 손실함수를 통해 성능을 향상시킬 수 있는 딥러닝 기반 깊이 추정 기술에 관한 것이다.The present invention relates to a depth estimation technique based on a deep learning network using an average layer and a loss function, and in particular, the average in the structure of a Recurrent Multi-View Stereo Network (RMVSNet), which is the state of the art in the depth estimation field. It relates to a deep learning-based depth estimation technique that can reduce memory usage by using an averaging layer and improve performance through a new loss function.

3차원 영상복원 또는 포인트 클라우드(point cloud) 등의 차세대 실감 미디어(컴퓨터 비전) 분야에서 깊이를 추정하는 기술은 요소 핵심 기술로써 널리 사용되고 있다. In the field of next-generation immersive media (computer vision) such as 3D image restoration or point cloud, a technology for estimating depth is widely used as a core technology.

일반적으로 영상 처리 기반의 hand-craft 방식을 통한 깊이 추정 기술은 딥러닝 기술의 진화로 활발하게 연구개발이 이루어지고 있다. 이와 같이, 딥러닝을 활용한 기술은 정확도 측면에서 hand-craft 방식에 뒤쳐져 있으나, 영상의 깊이 추정을 위한 계산속도에서는 월등한 우수성을 보여주고 있다. In general, the depth estimation technology through the hand-craft method based on image processing is being actively researched and developed with the evolution of deep learning technology. As such, the technology using deep learning is inferior to the hand-craft method in terms of accuracy, but shows superiority in the calculation speed for estimating the depth of the image.

그러나, 다수의 레이어 또는 계층을 포함하는 딥러닝 네트워크는 다량의 데이터를 이용하여 학습이 이루어지기 때문에 네트워크의 구성에 따라 메모리 활용이나 정확도 등의 성능에 있어서 차이가 크게 발생하기도 한다. However, since learning is performed using a large amount of data in a deep learning network including a plurality of layers or layers, there is a large difference in performance such as memory utilization or accuracy depending on the configuration of the network.

특히, 깊이 추정 분야의 최첨단 기술에 해당하는 RMVSNet(Recurrent Multi-View Stereo Network)는 3차원 영상 복원의 요소 기술로 사용되는 깊이 정보 추정기술로 사용되고 있으나, 대용량의 메모리를 기반으로 하는 장시간의 학습이 필요하며 성능적인 측면에 있어서도 hand-craft 방식에 비해 다소 떨어진다는 문제점이 존재한다.In particular, RMVSNet (Recurrent Multi-View Stereo Network), a state-of-the-art technology in the depth estimation field, is used as a depth information estimation technology used as an element technology for 3D image restoration. It is necessary and there is a problem that it is somewhat inferior to the hand-craft method in terms of performance.

한국 공개 특허 제10-2017-0028749호, 2017년 3월 14일 공개(명칭: 학습 기반 깊이 정보 추출 방법 및 장치)Korean Patent Publication No. 10-2017-0028749, published on March 14, 2017 (Title: Learning-based depth information extraction method and apparatus)

본 발명의 목적은 깊이 추정 분야의 최첨단 기술에 해당하는 RMVSNet(Recurrent Multi-View Stereo Network) 구조에서 효율적인 메모리 사용과 보다 정교한 성능향상을 위한 방안을 제공하는 것이다.An object of the present invention is to provide a method for efficient memory usage and more sophisticated performance improvement in a Recurrent Multi-View Stereo Network (RMVSNet) structure, which is a state-of-the-art technology in the depth estimation field.

또한, 본 발명의 목적은 RMVSNet에서의 상호연결 계층을 평균계층으로 대체함으로써 전후 계층에서의 데이터량을 절반으로 감소시켜 메모리 효율을 향상시키는 것이다.Another object of the present invention is to improve memory efficiency by reducing the amount of data in the front and rear layers by half by replacing the interconnection layer in RMVSNet with an average layer.

또한, 본 발명의 목적은 RMVSNet의 손실함수를 ground truth depth 정보를 이용한 영상 변환에 활용하여 보다 정교하게 손실을 계산하고, 이를 기반으로 깊이 추정의 성능 향상을 도모하는 것이다.In addition, it is an object of the present invention to use the loss function of RMVSNet for image conversion using ground truth depth information to calculate loss more precisely, and to improve the performance of depth estimation based on this.

또한, 본 발명의 목적은 Hand-Craft 방식에 비해 계산 속도가 월등히 빠르고, 소프트웨어 복잡도는 현저히 저하된 깊이 추정 기술을 제공함으로써 보다 다양한 어플리케이션에서 해당 기술을 활용 가능하도록 지원하는 것이다. In addition, an object of the present invention is to provide a depth estimation technique that has a significantly faster calculation speed and significantly lowered software complexity than the Hand-Craft method, thereby supporting the use of the corresponding technique in more diverse applications.

또한, 본 발명의 목적은 3차원 영상복원 또는 포인트 클라우드(point cloud)와 같은 차세대 실감미디어 생성 성능을 향상시킬 수 있는 요소 기술로 활용될 수 있도록 보다 효율적인 깊이 추정 기술을 제공하는 것이다.Another object of the present invention is to provide a more efficient depth estimation technique that can be utilized as an element technology capable of improving the performance of generating next-generation immersive media such as 3D image restoration or point cloud.

상기한 목적을 달성하기 위한 본 발명에 따른 깊이 추정 방법 RMVSNET(RECURRENT MULTI-VIEW STEREO NETWORK)의 UNET 모듈 내 상호연결계층(CONCATENATE LAYER)을 평균계층(AVERAGING LAYER)으로 치환한 딥러닝 네트워크(DEEP LEARNING NETWORK) 기반의 깊이 추정 모델을 생성하는 단계; 샘플 영상에 대한 깊이 예측 값과 실측 자료 깊이 값(GROUND TRUTH DEPTH)을 기반으로 상기 샘플 영상을 영상 변환한 결과 값에 대한 손실 함수를 생성하는 단계; 및 상기 깊이 추정 모델에 상기 손실 함수를 적용하고, 상기 손실 함수가 적용된 깊이 추정 모델로 입력 영상을 입력하여 상기 입력 영상의 깊이를 추정하는 단계를 포함한다.DEEP LEARNING in which the interconnection layer (CONCATENATE LAYER) in the UNET module of the depth estimation method RMVSNET (RECURRENT MULTI-VIEW STEREO NETWORK) according to the present invention for achieving the above object is replaced with an average layer (AVERAGING LAYER) NETWORK)-based depth estimation model; generating a loss function with respect to a result of image conversion of the sample image based on a depth prediction value of the sample image and a ground truth depth value (GROUND TRUTH DEPTH); and applying the loss function to the depth estimation model and estimating the depth of the input image by inputting an input image into the depth estimation model to which the loss function is applied.

이 때, 평균 계층은 상기 UNET 모듈 내 두 개의 루트들을 통해 전달되는 데이터의 평균값을 제공할 수 있다.In this case, the average layer may provide an average value of data transmitted through two routes in the UNET module.

이 때, 두 개의 루트들은 상기 UNET 모듈 내 전 계층으로부터 전달되는 제1 루트와 스킵 커넥션(SKIP CONNECTION)으로 연결된 계층으로부터 전달되는 제2 루트에 상응할 수 있다.In this case, the two routes may correspond to a first route delivered from all layers within the UNET module and a second route delivered from a layer connected by a skip connection (SKIP CONNECTION).

이 때, 손실 함수를 생성하는 단계는 상기 깊이 예측 값에 카메라 투영 행렬(CAMERA PROJECTION METRIX)을 적용하여 보정된 깊이 예측 값을 산출하는 단계; 상기 실측 자료 깊이 값에 상기 카메라 투영 행렬을 적용하여 보정된 실측 자료 깊이 값을 산출하는 단계; 및 상기 보정된 깊이 예측 값과 상기 보정된 실측 자료 깊이 값의 차분에 상응하게 상기 손실 함수를 생성하는 단계를 포함할 수 있다.In this case, generating the loss function may include calculating a corrected depth prediction value by applying a camera projection matrix (CAMERA PROJECTION METRIX) to the depth prediction value; calculating a corrected depth value of the measured data by applying the camera projection matrix to the measured data depth value; and generating the loss function corresponding to a difference between the corrected depth prediction value and the corrected actual data depth value.

이 때, 카메라 투영 행렬은 기정의된 3차원 좌표의 실측 자료(GROUND TRUTH)를 이용한 카메라 내부 행렬(CAMERA INTRINSIC MATRIX), 상기 기정의된 3차원 좌표의 실측 자료를 이용한 카메라 회전 행렬(CAMERA ROTATION MATRIX) 및 상기 기정의된 3차원 좌표의 실측 자료를 이용한 카메라 변환 행렬(CAMERA TRANSLATION MATRIX)를 이용하여 생성될 수 있다.At this time, the camera projection matrix is a camera internal matrix (CAMERA INTRINSIC MATRIX) using the ground truth of the predefined three-dimensional coordinates, the camera rotation matrix using the measured data of the predefined three-dimensional coordinates (CAMERA ROTATION MATRIX) ) and a camera transformation matrix (CAMERA TRANSLATION MATRIX) using the measured data of the predefined three-dimensional coordinates.

또한, 본 발명의 일실시예에 따른 깊이 추정 장치는, RMVSNET(RECURRENT MULTI-VIEW STEREO NETWORK)의 UNET 모듈 내 상호연결계층(CONCATENATE LAYER)을 평균계층(AVERAGING LAYER)으로 치환한 딥러닝 네트워크(DEEP LEARNING NETWORK) 기반의 깊이 추정 모델을 생성하고, 샘플 영상에 대한 깊이 예측 값과 실측 자료 깊이 값(GROUND TRUTH DEPTH)을 기반으로 상기 샘플 영상을 영상 변환한 결과 값에 대한 손실 함수를 생성하고, 상기 깊이 추정 모델에 상기 손실 함수를 적용하고, 상기 손실 함수가 적용된 깊이 추정 모델로 입력 영상을 입력하여 상기 입력 영상의 깊이를 추정하는 프로세서; 및 상기 깊이 추정 모델을 저장하는 메모리를 포함한다.In addition, the depth estimation apparatus according to an embodiment of the present invention is a deep learning network (DEEP) in which the interconnection layer (CONCATENATE LAYER) in the UNET module of RMVSNET (RECURRENT MULTI-VIEW STEREO NETWORK) is replaced with an average layer (AVERAGING LAYER). LEARNING NETWORK)-based depth estimation model, and based on the depth prediction value for the sample image and the ground truth depth value (GROUND TRUTH DEPTH), a loss function for the image conversion result of the sample image is generated, and the a processor for estimating the depth of the input image by applying the loss function to a depth estimation model and inputting an input image to the depth estimation model to which the loss function is applied; and a memory for storing the depth estimation model.

이 때, 두 개의 루트들은 상기 UNET 모듈 내 전 계층으로부터 전달되는 제1 루트와 스킵 커넥션으로 연결된 계층으로부터 전달되는 제2 루트에 상응할 수 있다.In this case, the two routes may correspond to a first route delivered from all layers in the UNET module and a second route delivered from a layer connected through a skip connection.

이 때, 프로세서는 상기 깊이 예측 값에 카메라 투영 행렬(CAMERA PROJECTION METRIX)을 적용하여 보정된 깊이 예측 값을 산출하고, 상기 실측 자료 깊이 값에 상기 카메라 투영 행렬을 적용하여 보정된 실측 자료 깊이 값을 산출하고, 상기 보정된 깊이 예측 값과 상기 보정된 실측 자료 깊이 값의 차분에 상응하게 상기 손실 함수를 생성할 수 있다.At this time, the processor calculates a corrected depth prediction value by applying a camera projection matrix (CAMERA PROJECTION METRIX) to the depth prediction value, and applies the camera projection matrix to the actual measurement data depth value. and the loss function may be generated according to a difference between the corrected depth prediction value and the corrected actual data depth value.

본 발명에 따르면, 깊이 추정 분야의 최첨단 기술에 해당하는 RMVSNet(Recurrent Multi-View Stereo Network) 구조에서 효율적인 메모리 사용과 보다 정교한 성능향상을 위한 방안을 제공할 수 있다.According to the present invention, it is possible to provide a method for efficient memory use and more sophisticated performance improvement in a Recurrent Multi-View Stereo Network (RMVSNet) structure, which is a state-of-the-art technology in the depth estimation field.

또한, 본 발명은 RMVSNet에서의 상호연결 계층을 평균계층으로 대체함으로써 전후 계층에서의 데이터량을 절반으로 감소시켜 메모리 효율을 향상시킬 수 있다.In addition, the present invention can improve memory efficiency by reducing the amount of data in the front and rear layers by half by replacing the interconnection layer in the RMVSNet with an average layer.

또한, 본 발명은 RMVSNet의 손실함수를 ground truth depth 정보를 이용한 영상 변환에 활용하여 보다 정교하게 손실을 계산하고, 이를 기반으로 깊이 추정의 성능 향상을 도모할 수 있다.In addition, the present invention utilizes the loss function of RMVSNet for image conversion using ground truth depth information to more precisely calculate the loss, and based on this, it is possible to improve the performance of depth estimation.

또한, 본 발명은 Hand-Craft 방식에 비해 계산 속도가 월등히 빠르고, 소프트웨어 복잡도는 현저히 저하된 깊이 추정 기술을 제공함으로써 보다 다양한 어플리케이션에서 해당 기술을 활용 가능하도록 지원할 수 있다.In addition, the present invention provides a depth estimation technique that has a significantly faster calculation speed and significantly lowered software complexity than the Hand-Craft method, thereby supporting the use of the corresponding technique in more diverse applications.

또한, 본 발명은 3차원 영상복원 또는 포인트 클라우드(point cloud)와 같은 차세대 실감미디어 생성 성능을 향상시킬 수 있는 요소 기술로 활용될 수 있도록 보다 효율적인 깊이 추정 기술을 제공할 수 있다.In addition, the present invention can provide a more efficient depth estimation technique so that it can be utilized as an element technique that can improve the performance of generating next-generation immersive media such as 3D image restoration or point cloud.

도 1은 본 발명의 일실시예에 따른 깊이 추정 방법을 나타낸 동작 흐름도이다.
도 2는 RMVSNet 구조의 일 예를 나타낸 도면이다.
도 3 내지 도 5는 도 2에 도시된 RMVSNet 구조를 상세하게 나타낸 도면이다.
도 6은 본 발명의 일실시예에 따른 RMVSNET의 UNET 구조를 나타낸 도면이다.
도 7 내지 도 8은 동일한 영상에 대해 종래의 RMVSNet과 본 발명에 따른 RMVSNet를 통해 각각 추정한 깊이 정보를 나타낸 도면이다.
도 9는 본 발명의 일실시예에 따른 깊이 추정 장치를 나타낸 블록도이다.
도 10은 본 발명의 일실시예에 따른 컴퓨터 시스템을 나타낸 도면이다.1 is an operation flowchart illustrating a depth estimation method according to an embodiment of the present invention.
2 is a diagram illustrating an example of an RMVSNet structure.
3 to 5 are diagrams showing the RMVSNet structure shown in FIG. 2 in detail.
6 is a diagram illustrating a UNET structure of an RMVSNET according to an embodiment of the present invention.
7 to 8 are diagrams showing depth information estimated through the conventional RMVSNet and the RMVSNet according to the present invention for the same image, respectively.
9 is a block diagram illustrating an apparatus for estimating depth according to an embodiment of the present invention.
10 is a diagram illustrating a computer system according to an embodiment of the present invention.

본 발명을 첨부된 도면을 참조하여 상세히 설명하면 다음과 같다. 여기서, 반복되는 설명, 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능, 및 구성에 대한 상세한 설명은 생략한다. 본 발명의 실시형태는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위해서 제공되는 것이다. 따라서, 도면에서의 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.The present invention will be described in detail with reference to the accompanying drawings as follows. Here, repeated descriptions, well-known functions that may unnecessarily obscure the gist of the present invention, and detailed descriptions of configurations will be omitted. The embodiments of the present invention are provided in order to more completely explain the present invention to those of ordinary skill in the art. Accordingly, the shapes and sizes of elements in the drawings may be exaggerated for clearer description.

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일실시예에 따른 깊이 추정 방법을 나타낸 동작 흐름도이다.1 is a flowchart illustrating a depth estimation method according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일실시예에 따른 깊이 추정 방법은 RMVSNET(RECURRENT MULTI-VIEW STEREO NETWORK)의 UNET 모듈 내 상호연결계층(CONCATENATE LAYER)을 평균계층(AVERAGING LAYER)으로 치환한 딥러닝 네트워크(DEEP LEARNING NETWORK) 기반의 깊이 추정 모델을 생성한다(S110).1, the depth estimation method according to an embodiment of the present invention is deep learning in which the interconnection layer (CONCATENATE LAYER) in the UNET module of RMVSNET (RECURRENT MULTI-VIEW STEREO NETWORK) is replaced with an average layer (AVERAGING LAYER). A network (DEEP LEARNING NETWORK)-based depth estimation model is generated (S110).

이 때, RMVSNET(RECURRENT MULTI-VIEW STEREO NETWORK)은 일반적으로 도 2와 같이 구성될 수 있으며, 도 2에 도시된 각 부분의 세부 구조는 도 3 내지 도 5에 도시된 것처럼 구성될 수 있다. At this time, a RECURRENT MULTI-VIEW STEREO NETWORK (RMVSNET) may be generally configured as shown in FIG. 2 , and detailed structures of each part shown in FIG. 2 may be configured as shown in FIGS. 3 to 5 .

이 때, 도 3을 참조하면, 일반적인 RMVSNET의 UNET(210)에는 전 계층에서의 데이터와 스킵 커넥션(skip connection)으로 연결된 계층에서의 데이터를 합하는 상호연결계층(211~214)이 포함된 것을 확인할 수 있다. 예를 들어, 도 3에 도시된 상호연결계층(211)의 경우, 64와 64를 합하여 128을 출력하는 구조에 상응할 수 있다. At this time, referring to FIG. 3 , it will be confirmed that the UNET 210 of a typical RMVSNET includes interconnection layers 211 to 214 that combine data from all layers and data from layers connected by skip connection. can For example, in the case of the interconnection layer 211 shown in FIG. 3 , it may correspond to a structure in which 64 and 64 are summed to output 128 .

이 때, 본 발명에서는 두 개의 루트들로 전달되는 데이터의 특성이 유사함을 착안하여, 두 개의 루트들로 전달되는 데이터를 합하는 것이 아닌 평균값을 사용함으로써 기존의 RMVSNET의 UNET(210)와 비교하여 절반의 데이터를 활용할 수 있는 구조를 제안하고자 한다. At this time, in the present invention, noting that the characteristics of the data transmitted to the two routes are similar, the average value is used instead of summing the data transmitted to the two routes, compared with the UNET 210 of the existing RMVSNET. We would like to propose a structure that can utilize half of the data.

즉, 본 발명의 일실시예에 따른 네트워크는, 도 6에 도시된 것처럼, 도 2에 도시된 상호연결계층(211~214)을 평균 계층(611~614)으로 치환한 구조에 상응할 수 있다. That is, the network according to an embodiment of the present invention may correspond to a structure in which the interconnection layers 211 to 214 shown in FIG. 2 are replaced with average layers 611 to 614, as shown in FIG. 6 . .

이 때, 평균 계층은 전후 계층에서의 데이터량을 절반으로 감소시킬 수 있으므로, 상호연결계층을 사용하는 네트워크보다 메모리 효율성을 향상시키는 효과를 불러올 수 있다. In this case, since the average layer can reduce the amount of data in the front and rear layers by half, it can bring about an effect of improving memory efficiency compared to a network using the interconnection layer.

이 때, 평균 계층은 UNET 모듈 내 두 개의 루트들을 통해 전달되는 데이터의 평균값을 제공할 수 있다.In this case, the average layer may provide an average value of data transmitted through two routes in the UNET module.

이 때, 두 개의 루트들은 UNET 모듈 내 전 계층으로부터 전달되는 제1 루트와 스킵 커넥션으로 연결된 계층으로부터 전달되는 제2 루트에 상응할 수 있다. In this case, the two routes may correspond to a first route delivered from all layers in the UNET module and a second route delivered from a layer connected through a skip connection.

예를 들어, 도 6의 평균 계층(611)은 두 개의 루트들을 통해 전달되는 값의 평균을 사용하여 64로 구성됨으로써 도 2의 상호연결계층(211)을 사용할 때보다 데이터량을 절반으로 감소시킨 것을 확인할 수 있다. For example, the average layer 611 of FIG. 6 is composed of 64 using the average of values transmitted through two routes, so that the amount of data is reduced by half compared to when the interconnection layer 211 of FIG. 2 is used. that can be checked

또한, 본 발명의 일실시예에 따른 깊이 추정 방법은 샘플 영상에 대한 깊이 예측 값과 실측 자료 깊이 값(GROUND TRUTH DEPTH)을 기반으로 샘플 영상을 영상 변환한 결과 값에 대한 손실 함수를 생성한다(S120).In addition, the depth estimation method according to an embodiment of the present invention generates a loss function for a result value obtained by converting a sample image to an image based on a depth prediction value for the sample image and a ground truth depth value (GROUND TRUTH DEPTH) for the sample image ( S120).

이 때, 깊이 예측 값에 카메라 투영 행렬(CAMERA PROJECTION METRIX)을 적용하여 보정된 깊이 예측 값을 산출할 수 있다. In this case, a corrected depth prediction value may be calculated by applying a camera projection matrix (CAMERA PROJECTION METRIX) to the depth prediction value.

이 때, 실측 자료 깊이 값에 카메라 투영 행렬을 적용하여 보정된 실측 자료 깊이 값을 산출할 수 있다. In this case, a corrected depth value of the measured data may be calculated by applying the camera projection matrix to the measured data depth value.

이 때, 보정된 깊이 예측 값과 보정된 실측 자료 깊이 값의 차분에 상응하게 손실 함수를 생성할 수 있다. In this case, a loss function may be generated corresponding to the difference between the corrected depth prediction value and the corrected actual data depth value.

일반적인 RMVSNet의 손실함수는 [수학식 1]과 같이 영상의 깊이 정보(depth) 예측 값과 실측 자료(ground truth) 값의 차분에 상응할 수 있다.The loss function of a general RMVSNet may correspond to the difference between the predicted value of the depth information of the image and the ground truth value, as shown in Equation 1].

[수학식 1][Equation 1]

이 때, d_Pred는 predicted depth에 상응하고, d_GT는 ground truth depth에 상응할 수 있다.In this case, d _Pred may correspond to the predicted depth, and d _GT may correspond to the ground truth depth.

그러나, 본 발명에서는 보다 정교한 성능향상을 위해 [수학식 2]에 상응하는 손실함수를 제안하고자 한다. However, in the present invention, a loss function corresponding to [Equation 2] is proposed for more sophisticated performance improvement.

[수학식 2][Equation 2]

이 때, X_pred 및 X_GT는 [수학식 3]에 상응하게 산출되는 값으로, [수학식 1]에서 사용된 predicted depth와 ground truth depth에 각각 카메라 투영 행렬(P)을 적용하여 보정된 값에 상응할 수 있다. At this time, X _pred and X _GT are values calculated corresponding to [Equation 3], and are corrected values by applying the camera projection matrix (P) to the predicted depth and ground truth depth used in [Equation 1], respectively. can correspond to

[수학식 3][Equation 3]

이 때, d_p(x,y),GT는 ground truth depth at p(x,y)에 상응하고, d_p(x,y),Pred는 predicated depth at p(x,y)에 상응할 수 있다.In this case, d _p(x,y),GT may correspond to the ground truth depth at p(x,y), and d _p(x,y),Pred may correspond to the predicated depth at p(x,y). have.

예를 들어, 카메라 투영 행렬(P)는 [수학식 4]와 같이 생성될 수 있다. For example, the camera projection matrix P may be generated as in [Equation 4].

[수학식 4][Equation 4]

이 때, K는 카메라 내부 행렬(CAMERA INTRINSIC MATRIX)에 상응하고, R은 카메라 회전 행렬(CAMERA ROTATION MATRIX)에 상응하고, t는 카메라 변환 행렬(CAMERA TRANSLATION MATRIX)에 상응할 수 있다. In this case, K may correspond to a camera internal matrix (CAMERA INTRINSIC MATRIX), R may correspond to a camera rotation matrix (CAMERA ROTATION MATRIX), and t may correspond to a camera transformation matrix (CAMERA TRANSLATION MATRIX).

또한, 본 발명의 일실시예에 따른 깊이 추정 방법은 깊이 추정 모델에 손실 함수를 적용하고, 손실 함수가 적용된 깊이 추정 모델로 입력 영상을 입력하여 입력 영상의 깊이를 추정한다(S130).In addition, the depth estimation method according to an embodiment of the present invention estimates the depth of the input image by applying a loss function to the depth estimation model and inputting the input image to the depth estimation model to which the loss function is applied ( S130 ).

예를 들어, 종래의 RMVSNet와 본 발명에서 제안한 깊이 추정 방법을 적용한 네트워크의 학습에러는 [표 1]과 같으며, 본 발명에서 제안한 방법이 우수한 Mean Absolute Error(MAE)값을 나타냄을 알 수 있다.For example, the learning errors of the conventional RMVSNet and the network to which the depth estimation method proposed in the present invention is applied are shown in [Table 1], and it can be seen that the method proposed in the present invention shows an excellent Mean Absolute Error (MAE) value. .

ArchitectureArchitecture Mean Absolute Error(MAE)Mean Absolute Error (MAE) RMVSNetRMVSNet 5.2665.266 Proposed NetProposed Net 4.6484.648

이 때, [표 1]의 MAE 값은 [수학식 5]에 상응하게 산출될 수 있다.In this case, the MAE value of [Table 1] may be calculated according to [Equation 5].

[수학식 5][Equation 5]

이 때, d_i,pred는 predicated depth image에 상응하고, d_i,GT는 ground truth depth image에 상응하고, N은 number of images에 상응할 수 있다.In this case, d _i,pred may correspond to a predicated depth image, d _i,GT may correspond to a ground truth depth image, and N may correspond to a number of images.

이와 같은 종래의 RMVSNet와 본 발명에서 제안한 깊이 추정 방법을 적용한 네트워크를 이용하여 동일한 이미지에 대한 깊이 정보를 추정하면, 도 7 내지 도 8에 도시된 것처럼 추정된 깊이 정보에 차이가 존재하는 것을 확인할 수 있다. When the depth information for the same image is estimated using the conventional RMVSNet and the network to which the depth estimation method proposed in the present invention is applied, as shown in FIGS. 7 to 8 , it can be confirmed that there is a difference in the estimated depth information. have.

즉, 동일한 이미지에 대해 종래의 RMVSNet를 이용하여 추정된 깊이 정보보다 본 발명에서 제안한 깊이 추정 방법을 적용한 네트워크를 이용하여 추정된 깊이 정보가 보다 정교한 것을 확인할 수 있다. That is, it can be confirmed that the depth information estimated using the network to which the depth estimation method proposed in the present invention is applied is more sophisticated than the depth information estimated using the conventional RMVSNet for the same image.

이와 같은 평균 계층과 손실함수를 이용한 딥러닝 네트워크 기반의 깊이 추정 방법을 통해 깊이 추정 분야의 최첨단 기술에 해당하는 RMVSNet(Recurrent Multi-View Stereo Network) 구조에서 효율적인 메모리 사용과 보다 정교한 성능향상을 위한 방안을 제공할 수 있다.A method for efficient memory use and more sophisticated performance improvement in the RMVSNet (Recurrent Multi-View Stereo Network) structure, which is a state-of-the-art technology in the depth estimation field, through the deep learning network-based depth estimation method using the average layer and loss function. can provide

또한, RMVSNet에서의 상호연결 계층을 평균계층으로 대체함으로써 전후 계층에서의 데이터량을 절반으로 감소시켜 메모리 효율을 향상시킬 수 있다.In addition, by replacing the interconnection layer in RMVSNet with an average layer, the amount of data in the front and rear layers can be reduced by half to improve memory efficiency.

또한, RMVSNet의 손실함수를 ground truth depth 정보를 이용한 영상 변환에 활용하여 보다 정교하게 손실을 계산하고, 이를 기반으로 깊이 추정의 성능 향상을 도모할 수 있다.In addition, by using the loss function of RMVSNet for image transformation using ground truth depth information, it is possible to calculate the loss more precisely, and to improve the performance of the depth estimation based on this.

또한, Hand-Craft 방식에 비해 계산 속도가 월등히 빠르고, 소프트웨어 복잡도는 현저히 저하된 깊이 추정 기술을 제공함으로써 보다 다양한 어플리케이션에서 해당 기술을 활용 가능하도록 지원할 수 있다.In addition, by providing a depth estimation technology that has a significantly faster calculation speed and significantly lowered software complexity compared to the Hand-Craft method, it is possible to support the use of the technology in more diverse applications.

또한, 3차원 영상복원 또는 포인트 클라우드(point cloud)와 같은 차세대 실감미디어 생성 성능을 향상시킬 수 있는 요소 기술로 활용될 수 있도록 보다 효율적인 깊이 추정 기술을 제공할 수 있다.In addition, it is possible to provide a more efficient depth estimation technique so that it can be utilized as a factor technology capable of improving the performance of generating next-generation immersive media such as 3D image restoration or point cloud.

도 9는 본 발명의 일실시예에 따른 깊이 추정 장치를 나타낸 블록도이다.9 is a block diagram illustrating an apparatus for estimating depth according to an embodiment of the present invention.

도 9를 참조하면, 본 발명의 일실시예에 따른 깊이 추정 장치는 통신부(910), 프로세서(920) 및 메모리(930)를 포함한다.Referring to FIG. 9 , the apparatus for estimating depth according to an embodiment of the present invention includes a communication unit 910 , a processor 920 , and a memory 930 .

통신부(910)는 네트워크와 같은 통신망을 통해 깊이 추정을 위해 필요한 정보를 송수신하는 역할을 할 수 있다. 이 때, 네트워크는 장치들간에 데이터를 전달하는 통로를 제공하는 것으로서, 기존에 이용되는 네트워크 및 향후 개발 가능한 네트워크를 모두 포괄하는 개념이다.The communication unit 910 may serve to transmit/receive information necessary for depth estimation through a communication network such as a network. In this case, the network provides a path for transferring data between devices, and is a concept that encompasses both an existing network and a network that can be developed in the future.

예를 들면, 네트워크는 인터넷 프로토콜(IP)을 통하여 대용량 데이터의 송수신 서비스 및 끊기는 현상이 없는 데이터 서비스를 제공하는 아이피망, 아이피를 기반으로 서로 다른 망을 통합한 아이피망 구조인 올 아이피(All IP)망 등일 수 있으며, 유선망, Wibro(Wireless Broadband)망, WCDMA를 포함하는 3세대 이동통신망, HSDPA(High Speed Downlink Packet Access)망 및 LTE 망을 포함하는 3.5세대 이동통신망, LTE advanced를 포함하는 4세대 이동통신망, 위성통신망 및 와이파이(Wi-Fi)망 중에서 하나 이상을 결합하여 이루어질 수 있다.For example, the network is IP Network, which provides large-capacity data transmission/reception service and data service without interruption through Internet Protocol (IP), and All IP, which is an IP network structure that integrates different networks based on IP. ) network, etc., including wired network, Wibro (Wireless Broadband) network, 3G mobile communication network including WCDMA, HSDPA (High Speed Downlink Packet Access) network and 3.5G mobile communication network including LTE network, 4 including LTE advanced It may be achieved by combining one or more of a generation mobile communication network, a satellite communication network, and a Wi-Fi network.

또한, 네트워크는 한정된 지역 내에서 각종 정보장치들의 통신을 제공하는 유무선근거리 통신망, 이동체 상호 간 및 이동체와 이동체 외부와의 통신을 제공하는 이동통신망, 위성을 이용해 지구국과 지구국간 통신을 제공하는 위성통신망이거나 유무선 통신망 중에서 어느 하나이거나, 둘 이상의 결합으로 이루어질 수 있다. 한편, 네트워크의 전송 방식 표준은, 기존의 전송 방식 표준에 한정되는 것은 아니며, 향후 개발될 모든 전송 방식 표준을 포함할 수 있다.In addition, the network includes a wired/wireless local area network that provides communication of various information devices within a limited area, a mobile communication network that provides communication between and between mobile devices and between a mobile device and the outside of the mobile device, and a satellite communication network that provides communication between an earth station and an earth station using satellites. or any one of wired and wireless communication networks, or a combination of two or more. Meanwhile, the transmission method standard of the network is not limited to the existing transmission method standard, and may include all transmission method standards to be developed in the future.

프로세서(920)는 RMVSNET(RECURRENT MULTI-VIEW STEREO NETWORK)의 UNET 모듈 내 상호연결계층(CONCATENATE LAYER)을 평균계층(AVERAGING LAYER)으로 치환한 딥러닝 네트워크(DEEP LEARNING NETWORK) 기반의 깊이 추정 모델을 생성한다.Processor 920 creates a deep learning network (DEEP LEARNING NETWORK)-based depth estimation model in which the interconnection layer (CONCATENATE LAYER) in the UNET module of RMVSNET (RECURRENT MULTI-VIEW STEREO NETWORK) is replaced with the average layer (AVERAGING LAYER) do.

또한, 프로세서(920)는 샘플 영상에 대한 깊이 예측 값과 실측 자료 깊이 값(GROUND TRUTH DEPTH)을 기반으로 샘플 영상을 영상 변환한 결과 값에 대한 손실 함수를 생성한다. In addition, the processor 920 generates a loss function with respect to a result of image conversion of a sample image based on a depth prediction value for the sample image and a ground truth depth value (GROUND TRUTH DEPTH) of the sample image .

[수학식 1][Equation 1]

[수학식 2][Equation 2]

[수학식 3][Equation 3]

이 때, 카메라 투영 행렬은 기정의된 3차원 좌표의 실측 자료(GROUND TRUTH)를 이용한 카메라 내부 행렬(CAMERA INTRINSIC MATRIX), 상기 기정의된 3차원 좌표의 실측 자료를 이용한 카메라 회전 행렬(CAMERA ROTATION MATRIX) 및 상기 기정의된 3차원 좌표의 실측 자료를 이용한 카메라 변환 행렬(CAMERA TRANSLATION MATRIX)를 이용하여 생성될 수 있다.At this time, the camera projection matrix is a camera internal matrix (CAMERA INTRINSIC MATRIX) using the ground truth of the predefined three-dimensional coordinates, the camera rotation matrix using the measured data of the predefined three-dimensional coordinates (CAMERA ROTATION MATRIX) ) and the predefined three-dimensional coordinates can be generated using a camera transformation matrix (CAMERA TRANSLATION MATRIX) using the measured data.

[수학식 4][Equation 4]

또한, 프로세서(920)는 깊이 추정 모델에 손실 함수를 적용하고, 손실 함수가 적용된 깊이 추정 모델로 입력 영상을 입력하여 입력 영상의 깊이를 추정한다.In addition, the processor 920 estimates the depth of the input image by applying the loss function to the depth estimation model and inputting the input image to the depth estimation model to which the loss function is applied.

예를 들어, 종래의 RMVSNet와 본 발명에서 제안한 깊이 추정 방법을 적용한 네트워크의 학습에러는 상기의 [표 1]과 같으며, 본 발명에서 제안한 방법이 우수한 Mean Absolute Error(MAE)값을 나타냄을 알 수 있다.For example, it can be seen that the learning errors of the conventional RMVSNet and the network to which the depth estimation method proposed in the present invention is applied are as shown in [Table 1] above, and it can be seen that the method proposed in the present invention shows an excellent Mean Absolute Error (MAE) value. can

[수학식 5][Equation 5]

메모리(930)는 깊이 추정 모델을 저장한다.The memory 930 stores the depth estimation model.

또한, 메모리(930)는 상술한 바와 같이 본 발명의 일실시예에 따른 깊이 추정 장치에서 발생하는 다양한 정보를 저장한다.Also, the memory 930 stores various information generated in the depth estimation apparatus according to an embodiment of the present invention as described above.

실시예에 따라, 메모리(930)는 깊이 추정 장치와 독립적으로 구성되어 깊이 추정을 위한 기능을 지원할 수 있다. 이 때, 메모리(930)는 별도의 대용량 스토리지로 동작할 수 있고, 동작 수행을 위한 제어 기능을 포함할 수도 있다.According to an embodiment, the memory 930 may be configured independently of the depth estimation apparatus to support a function for depth estimation. In this case, the memory 930 may operate as a separate mass storage and may include a control function for performing an operation.

한편, 깊이 추정 장치는 메모리가 탑재되어 그 장치 내에서 정보를 저장할 수 있다. 일 구현예의 경우, 메모리는 컴퓨터로 판독 가능한 매체이다. 일 구현 예에서, 메모리는 휘발성 메모리 유닛일 수 있으며, 다른 구현예의 경우, 메모리는 비휘발성 메모리 유닛일 수도 있다. 일 구현예의 경우, 저장장치는 컴퓨터로 판독 가능한 매체이다. 다양한 서로 다른 구현 예에서, 저장장치는 예컨대 하드디스크 장치, 광학디스크 장치, 혹은 어떤 다른 대용량 저장장치를 포함할 수도 있다.On the other hand, the depth estimation apparatus may have a memory mounted therein to store information in the apparatus. In one implementation, the memory is a computer-readable medium. In one implementation, the memory may be a volatile memory unit, and in another implementation, the memory may be a non-volatile memory unit. In one embodiment, the storage device is a computer-readable medium. In various different implementations, the storage device may include, for example, a hard disk device, an optical disk device, or some other mass storage device.

이와 같은 평균 계층과 손실함수를 이용한 딥러닝 네트워크 기반의 깊이 추정 장치를 이용함으로써 깊이 추정 분야의 최첨단 기술에 해당하는 RMVSNet(Recurrent Multi-View Stereo Network) 구조에서 효율적인 메모리 사용과 보다 정교한 성능향상을 위한 방안을 제공할 수 있다.By using such a deep learning network-based depth estimation device using such an average layer and loss function, the RMVSNet (Recurrent Multi-View Stereo Network) structure, which is a state-of-the-art technology in the depth estimation field, can be used for efficient memory usage and more sophisticated performance improvement. can provide a way.

또한, 3차원 영상복원 또는 포인트 클라우드(point cloud)와 같은 차세대 실감미디어 생성 성능을 향상시킬 수 있는 요소 기술로 활용될 수 있도록 보다 효율적인 깊이 추정 기술을 제공할 수 있다.In addition, it is possible to provide a more efficient depth estimation technology so that it can be utilized as a factor technology capable of improving the performance of generating next-generation immersive media such as 3D image restoration or point cloud.

도 10은 본 발명의 일실시예에 따른 컴퓨터 시스템을 나타낸 도면이다.10 is a diagram illustrating a computer system according to an embodiment of the present invention.

도 10을 참조하면, 본 발명의 실시예는 컴퓨터로 읽을 수 있는 기록매체와 같은 컴퓨터 시스템에서 구현될 수 있다. 도 10에 도시된 바와 같이, 컴퓨터 시스템(1000)은 버스(1020)를 통하여 서로 통신하는 하나 이상의 프로세서(1010), 메모리(1030), 사용자 입력 장치(1040), 사용자 출력 장치(1050) 및 스토리지(1060)를 포함할 수 있다. 또한, 컴퓨터 시스템(1000)은 네트워크(1080)에 연결되는 네트워크 인터페이스(1070)를 더 포함할 수 있다. 프로세서(1010)는 중앙 처리 장치 또는 메모리(1030)나 스토리지(1060)에 저장된 프로세싱 인스트럭션들을 실행하는 반도체 장치일 수 있다. 메모리(1030) 및 스토리지(1060)는 다양한 형태의 휘발성 또는 비휘발성 저장 매체일 수 있다. 예를 들어, 메모리는 ROM(1031)이나 RAM(1032)을 포함할 수 있다.Referring to FIG. 10 , an embodiment of the present invention may be implemented in a computer system such as a computer-readable recording medium. As shown in FIG. 10 , computer system 1000 includes one or more processors 1010 , memory 1030 , user input device 1040 , user output device 1050 and storage that communicate with each other via bus 1020 . (1060). In addition, the computer system 1000 may further include a network interface 1070 coupled to the network 1080 . The processor 1010 may be a central processing unit or a semiconductor device that executes processing instructions stored in the memory 1030 or the storage 1060 . The memory 1030 and the storage 1060 may be various types of volatile or non-volatile storage media. For example, the memory may include ROM 1031 or RAM 1032 .

따라서, 본 발명의 실시예는 컴퓨터로 구현된 방법이나 컴퓨터에서 실행 가능한 명령어들이 기록된 비일시적인 컴퓨터에서 읽을 수 있는 매체로 구현될 수 있다. 컴퓨터에서 읽을 수 있는 명령어들이 프로세서에 의해서 수행될 때, 컴퓨터에서 읽을 수 있는 명령어들은 본 발명의 적어도 한 가지 측면에 따른 방법을 수행할 수 있다.Accordingly, the embodiment of the present invention may be implemented as a computer-implemented method or a non-transitory computer-readable medium in which computer-executable instructions are recorded. When the computer readable instructions are executed by a processor, the computer readable instructions may perform a method according to at least one aspect of the present invention.

이상에서와 같이 본 발명에 따른 평균 계층과 손실함수를 이용한 딥러닝 네트워크 기반의 깊이 추정 방법 및 이를 이용한 장치는 상기한 바와 같이 설명된 실시예들의 구성과 방법이 한정되게 적용될 수 있는 것이 아니라, 상기 실시예들은 다양한 변형이 이루어질 수 있도록 각 실시예들의 전부 또는 일부가 선택적으로 조합되어 구성될 수도 있다.As described above, in the deep learning network-based depth estimation method using the average layer and the loss function according to the present invention and the apparatus using the same, the configuration and method of the embodiments described above are not limitedly applicable. Embodiments may be configured by selectively combining all or part of each embodiment so that various modifications can be made.

210, 610: UNET 211~214: 상호연결계층
611~614: 평균계층 910: 통신부
920, 1010: 프로세서 920, 1030: 메모리
1000: 컴퓨터 시스템 1020: 버스
1031: 롬 1032: 램
1040: 사용자 입력 장치 1050: 사용자 출력 장치
1060: 스토리지 1070: 네트워크 인터페이스
1080: 네트워크210, 610: UNET 211~214: interconnection layer
611~614: Average layer 910: Communication department
920, 1010: Processor 920, 1030: Memory
1000: computer system 1020: bus
1031: rom 1032: ram
1040: user input device 1050: user output device
1060: storage 1070: network interface
1080: network

Claims

Generating a deep learning network (DEEP LEARNING NETWORK)-based depth estimation model in which the interconnection layer (CONCATENATE LAYER) in the UNET module of RMVSNET (RECURRENT MULTI-VIEW STEREO NETWORK) is replaced with the average layer (AVERAGING LAYER);
generating a loss function with respect to a result of image conversion of the sample image based on a depth prediction value of the sample image and a ground truth depth value (GROUND TRUTH DEPTH); and
estimating the depth of the input image by applying the loss function to the depth estimation model and inputting an input image into the depth estimation model to which the loss function is applied;
Depth estimation method, characterized in that it comprises a.