KR20230066752A

KR20230066752A - Method and apparatus for collecting video information

Info

Publication number: KR20230066752A
Application number: KR1020210152077A
Authority: KR
Inventors: 장훈석
Original assignee: 한국전자기술연구원
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2023-05-16

Abstract

본 발명은 영상 정보 수집 방법 및 장치에 관한 것이다. 본 발명의 일 실시예에 따른 방법은 전자 장치에 의해 수행되는 영상 정보 수집 방법으로서, 영상 시퀀스에서 서로 다른 프레임의 위치를 가지는 복수의 제1 학습용 영상에 대해 제1 학습용 영상마다 그 위치에 영향을 주는 노이즈가 부가된 n개(단, n은 2 이상의 자연수)의 학습용 위치 데이터를 생성하는 단계; 상기 학습용 위치 데이터들이 각각 입력되는 입력층과, 해당 제1 학습용 영상의 원래 위치에 대한 원위치 데이터를 출력하는 출력층을 각각 구비한 딥러닝 신경망에 대해 머신 러닝 기법에 따른 학습을 수행하는 단계; 영상 시퀀스에서 서로 다른 프레임의 위치를 가지는 복수의 대상 영상에 대해 대상 영상마다 그 위치에 영향을 주는 노이즈가 부가된 n개의 대상 위치 데이터를 생성하는 단계; 및 학습된 상기 신경망의 제1 모델에 각 대상 위치 데이터를 입력시켜, 복수의 대상 영상의 추정 위치에 대한 추정 위치 데이터를 추출하고, 추출된 상기 추정 위치 데이터를 이용하여 각 대상 영상의 위치를 보정하는 단계; 보정된 각 대상 영상을 이용하여 깊이 추정 또는 3차원 형상 복원을 수행하는 단계;를 포함한다.The present invention relates to a method and apparatus for collecting image information. A method according to an embodiment of the present invention is a method for collecting image information performed by an electronic device, and affects the position of each first training image for a plurality of first training images having different frame positions in an image sequence. Generating n pieces of position data for learning to which noise is added (where n is a natural number of 2 or more); Performing learning according to a machine learning technique on a deep learning neural network each having an input layer inputting the training location data and an output layer outputting original location data for the original location of the corresponding first training image; generating n object position data to which noise affecting the position of each target image is added for a plurality of target images having positions of different frames in the image sequence; and inputting each object position data to the first model of the learned neural network, extracting the estimated position data for the estimated positions of a plurality of target images, and correcting the position of each target image using the extracted estimated position data. doing; and performing depth estimation or 3D shape restoration using each corrected target image.

Description

Video information collection method and apparatus {METHOD AND APPARATUS FOR COLLECTING VIDEO INFORMATION}

본 발명은 영상 정보 수집 방법 및 장치에 관한 것으로서, 더욱 상세하게는 우사 관제 시스템 등에서의 영상 정보 수집 시에 영상에 포함된 지터 노이즈(jitter noise)를 제거함으로써 최적의 깊이 추정 등에 대한 영상 정보를 수집하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for collecting image information, and more particularly, when image information is collected in a rain control system or the like, image information for optimal depth estimation is collected by removing jitter noise included in the image. It relates to a method and apparatus for doing so.

가상현실, 게임, 애니메이션 등의 컴퓨터 그래픽스에 기반한 응용에서는 숙련된 디자이너가 수작업으로 3차원 모델을 제작하는데, 이는 시간이 많이 소요되고, 디자이너의 숙련도에 따라서 품질의 차이가 많은 문제점이 있다. 이러한 문제점에 대한 대안으로 3차원 형상 복원 기술이 활용되고 있다. 즉, 3차원 형상 복원 기술은 영상을 분석하여 물체의 3차원 형상에 대한 정보를 복원하는 기술이다.In computer graphics-based applications such as virtual reality, games, and animations, skilled designers manually create 3D models, which take a lot of time and have many problems in quality depending on the designer's skill level. As an alternative to these problems, 3D shape restoration technology is being used. That is, the 3D shape restoration technology is a technology for restoring information about a 3D shape of an object by analyzing an image.

이러한 3차원 형상 복원 기술 중에 빈번하게 사용되는 DFF(depth from focus) 기반의 깊이 추정 기술은 구현이 간단하고 정밀한 기술이다. 이때, DFF 기반의 깊이 추정을 수행하기 위해서는 영상 시퀀스에서 각 영상의 프레임 위치(즉, 순서)가 정확히 파악되어야 한다. 하지만, 카메라로 촬영 중에 지터 노이즈(jitter noise)이 발생하기 쉬우며, 이러한 지터 노이즈는 영상 시퀀스의 각 영상에 대한 프레임 위치 파악에 악영향을 줄 수 있다.Among these 3D shape restoration techniques, a frequently used depth estimation technique based on depth from focus (DFF) is simple and precise technique to implement. In this case, in order to perform DFF-based depth estimation, the frame position (ie, order) of each image in the image sequence must be accurately identified. However, jitter noise tends to occur during shooting with a camera, and such jitter noise may adversely affect frame positioning of each image in an image sequence.

이에 따라, 이러한 지터 노이즈를 제거하기 위해, 칼만 필터 및 베이즈 필터를 이용한 방식이 개발되었다. 하지만, 이러한 방식은 지터 노이즈를 가우시안 노이즈에 한정하여 제거하는 방식이므로 실제 영상에 적용하기 어려운 한계가 있다. Accordingly, in order to remove such jitter noise, a method using a Kalman filter and a Bayes filter has been developed. However, since this method restricts jitter noise to Gaussian noise and removes it, there is a limitation in that it is difficult to apply to actual images.

또한, 지터 노이즈를 제거하기 위한 다른 방식으로는 수정된 칼만 필터와 적응 신경망 필터를 이용하는 방식이 있다. 하지만, 이러한 방식은 지터 노이즈의 분산만을 이용하기 때문에, 평균치와 크게 벗어난 아웃라이어를 가진 지터 노이즈에 대해서는 효과적으로 제거하지 못한다.In addition, as another method for removing jitter noise, there is a method using a modified Kalman filter and an adaptive neural network filter. However, since this method uses only the variance of jitter noise, it cannot effectively remove jitter noise having outliers that deviate greatly from the average value.

또한, 지터 노이즈를 제거하기 위한 또 다른 방식으로는 최대 코렌트로피 기준 칼만 필터를 이용하는 방식이 있다. 하지만, 이러한 방식은 영상의 프레임 위치에 대한 공분산 행렬과 칼만 이득에서 수치 불안정성과 계산 복잡성으로 인해 지터 노이즈를 효과적으로 제거하지 못하여, 무인 관제 시스템 등에서 영상 센서를 통한 깊이 추정 기술의 정확도가 낮아지는 문제점이 발생한다.In addition, as another method for removing jitter noise, there is a method using a Kalman filter based on maximum corentropy. However, this method cannot effectively remove jitter noise due to numerical instability and computational complexity in the covariance matrix and Kalman gain for the frame position of the image, so the accuracy of the depth estimation technology through the image sensor in the unmanned control system is lowered. Occurs.

상기한 바와 같은 종래 기술의 문제점을 해결하기 위하여, 본 발명은 우사 관제 시스템 등에서의 영상 정보 수집 시에 영상에 포함된 지터 노이즈(jitter noise)를 효과적으로 제거하여 깊이 추정 등에 대한 정확도가 개선된 영상 정보를 수집하는 기술을 제공하는데 그 목적이 있다.In order to solve the problems of the prior art as described above, the present invention effectively removes jitter noise included in the image when image information is collected in a rain control system, etc., and image information with improved accuracy for depth estimation. Its purpose is to provide a technology for collecting

다만, 본 발명이 해결하고자 하는 과제는 이상에서 언급한 과제에 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the problem to be solved by the present invention is not limited to the above-mentioned problems, and other problems not mentioned can be clearly understood by those skilled in the art from the description below. There will be.

상기와 같은 과제를 해결하기 위한 본 발명의 일 실시예에 따른 영상 정보 수집 방법은 전자 장치에 의해 수행되는 영상 정보 수집 방법으로서, 영상 시퀀스에서 서로 다른 프레임의 위치를 가지는 복수의 제1 학습용 영상에 대해 제1 학습용 영상마다 그 위치에 영향을 주는 노이즈가 부가된 n개(단, n은 2 이상의 자연수)의 학습용 위치 데이터를 생성하는 단계; 상기 학습용 위치 데이터들이 각각 입력되는 입력층과, 해당 제1 학습용 영상의 원래 위치에 대한 원위치 데이터를 출력하는 출력층을 각각 구비한 딥러닝 신경망에 대해 머신 러닝 기법에 따른 학습을 수행하는 단계; 영상 시퀀스에서 서로 다른 프레임의 위치를 가지는 복수의 대상 영상에 대해 대상 영상마다 그 위치에 영향을 주는 노이즈가 부가된 n개의 대상 위치 데이터를 생성하는 단계; 및 학습된 상기 신경망의 제1 모델에 각 대상 위치 데이터를 입력시켜, 복수의 대상 영상의 추정 위치에 대한 추정 위치 데이터를 추출하고, 추출된 상기 추정 위치 데이터를 이용하여 각 대상 영상의 위치를 보정하는 단계; 보정된 각 대상 영상을 이용하여 깊이 추정 또는 3차원 형상 복원을 수행하는 단계;를 포함한다.To solve the above problem, a method for collecting image information according to an embodiment of the present invention is a method for collecting image information performed by an electronic device, and includes a plurality of first learning images having different frame positions in an image sequence. generating n (n is a natural number of 2 or more) position data for learning to which noise affecting the position is added for each first training image; Performing learning according to a machine learning technique on a deep learning neural network each having an input layer inputting the training location data and an output layer outputting original location data for the original location of the corresponding first training image; generating n object position data to which noise affecting the position of each target image is added for a plurality of target images having positions of different frames in the image sequence; and inputting each object position data to the first model of the learned neural network, extracting the estimated position data for the estimated positions of a plurality of target images, and correcting the position of each target image using the extracted estimated position data. doing; and performing depth estimation or 3D shape restoration using each corrected target image.

상기 딥러닝 신경망은 장단기 메모리(Long Short-Term Memory) 신경망일 수 있다.The deep learning neural network may be a long short-term memory neural network.

상기 깊이 추정 또는 3차원 형상 복원을 수행하는 단계는 보정된 상기 각 대상 영상의 위치를 기반으로 상기 각 대상 영상의 초점 값을 도출하며, 도출된 초점 값을 이용하여 상기 깊이 추정 또는 3차원 형상 복원을 수행하는 단계를 포함할 수 있다.In the step of performing depth estimation or 3D shape restoration, a focus value of each target image is derived based on the corrected position of each target image, and the depth estimation or 3D shape restoration is performed using the derived focus value. It may include the step of performing.

상기 깊이 추정 또는 3차원 형상 복원을 수행하는 단계는 제2 학습용 영상에서 다수의 픽셀에 대한 제1 초점 측정 연산의 결과로 도출된 각 픽셀의 초점 값(focus value)과, 해당 초점 값을 기반으로 추정되는 초점 커브의 피팅 함수를 기반으로 가우시안 프로세스 회귀(Gaussian Process Regression)의 머신 러닝에 따른 학습이 수행된 제2 모델을 이용하여 상기 깊이 추정 또는 3차원 형상 복원을 수행할 수 있다.The step of performing depth estimation or 3D shape restoration is based on the focus value of each pixel derived as a result of the first focus measurement operation on a plurality of pixels in the second training image and the corresponding focus value. The depth estimation or the 3D shape reconstruction may be performed using a second model in which machine learning of Gaussian Process Regression has been performed based on the fitting function of the estimated focus curve.

상기 깊이 추정 또는 3차원 형상 복원을 수행하는 단계는, 보정된 상기 대상 영상에서 다수의 픽셀에 대해 제2 초점 측정 연산을 수행하는 단계; 상기 제2 초점 측정 연산의 결과로 도출된 다수 픽셀의 초점 값(focus value)을 상기 제2 모델에 입력시켜, 다수의 초점 커브를 피팅(fitting)하는 단계; 피팅된 다수의 초점 커브에서 최대의 초점 값을 가지는 보정된 상기 대상 영상에서의 픽셀 위치를 추출하는 단계; 및 추출된 상기 픽셀 위치를 기반으로 상기 깊이 추정 또는 3차원 형상 복원을 수행하는 단계;를 포함할 수 있다.The performing of depth estimation or 3D shape restoration may include performing a second focus measurement operation on a plurality of pixels in the corrected target image; fitting a plurality of focus curves by inputting focus values of a plurality of pixels derived as a result of the second focus measurement operation to the second model; extracting a pixel position in the corrected target image having a maximum focus value from a plurality of fitted focus curves; and performing the depth estimation or 3D shape restoration based on the extracted pixel position.

상기 제1 및 제2 초점 측정 연산은 SML(Sum of Modified Laplacian)을 이용할 수 있다.The first and second focus measurement operations may use a sum of modified laplacian (SML).

상기 SML은 하기 식을 이용할 수 있다.The SML may use the following formula.

(단, I(x, y)는 (x, y)의 픽셀에서의 그레이 레벨 밝기, W는 영상 윈도우 크기)(However, I(x, y) is the gray level brightness at the pixel of (x, y), W is the video window size)

상기 제2 모델에 대한 학습 방법은, 피팅 함수에 대한 확률 분포의 커널 함수를 제곱 지수 커널인 k(i, i')로 선정하는 단계;The learning method for the second model may include selecting a kernel function of a probability distribution for a fitting function as k(i, i′), which is a square exponential kernel;

(단, i와 I'는 제곱 지수 커널 함수의 입력, | | | |는 유클리디안 거리)

(However, i and I' are the inputs of the square exponential kernel function, | | | | is the Euclidean distance)

피팅 함수에 대한 초기 확률 분포에 대해, 평균을 0으로 설정하고 커널 함수를 k(x₀, x₀')로 설정하는 단계(단, x₀와 x₀'는 타겟 데이터에서 픽셀 위치에 대한 축인 x축의 값들의 집합); 및 피팅 함수에 대한 갱신된 확률 분포에 대해, 평균을 m_g(x₀)로 설정하고 커널 함수를 k_g(x₀, x₀')로 설정하는 단계;를 포함할 수 있다.For the initial probability distribution for the fitting function, setting the mean to 0 and setting the kernel function to k(x ₀ , x ₀ '), where x ₀ and x ₀ ' are axes for pixel locations in the target data a set of values on the x-axis); and setting the mean to m _g (x ₀ ) and the kernel function to k _g (x ₀ , x ₀ ') for the updated probability distribution for the fitting function.

(단, x_t는 훈련 데이터에서 픽셀 위치 값에 대한 축인 x축의 값들의 집합, x_t'는 훈련 데이터에서 초점 값에 대한 축인 y축의 값들의 집합)(However, x _t is a set of x-axis values, which is an axis for pixel position values in training data, and x _t 'is a set of y-axis values, which is an axis for focus values in training data)

상기 제2 초점 측정 연산을 수행하는 단계는 보정된 상기 대상 영상에서 무작위로 샘플링하여 다수의 픽셀을 추출하고, 추출된 다수의 픽셀에 대해 제2 초점 측정 연산을 수행하는 단계를 포함할 수 있다.The performing of the second focus measurement operation may include randomly sampling and extracting a plurality of pixels from the corrected target image, and performing a second focus measurement operation on the plurality of extracted pixels.

상기 노이즈는 비가우시안 분포를 가질 수 있다.The noise may have a non-Gaussian distribution.

본 발명의 일 실시예에 따른 장치는 메모리에 저장된 정보를 이용하여, 깊이 추정 또는 3차원 형상 복원을 제어하는 제어부;를 포함한다.An apparatus according to an embodiment of the present invention includes a control unit for controlling depth estimation or 3D shape restoration using information stored in a memory.

상기 제어부는, 영상 시퀀스에서 서로 다른 프레임의 위치를 가지는 복수의 제1 학습용 영상에 대해 제1 학습용 영상마다 그 위치에 영향을 주는 노이즈가 부가된 n개(단, n은 2 이상의 자연수)의 학습용 위치 데이터의 생성을 제어하고, 상기 학습용 위치 데이터들이 각각 입력되는 입력층과, 해당 제1 학습용 영상의 원래 위치에 대한 원위치 데이터를 출력하는 출력층을 각각 구비한 딥러닝 신경망에 대해 머신 러닝 기법에 따른 학습 수행을 제어하며, 영상 시퀀스에서 서로 다른 프레임의 위치를 가지는 복수의 대상 영상에 대해 대상 영상마다 그 위치에 영향을 주는 노이즈가 부가된 n개의 대상 위치 데이터의 생성을 제어하고, 학습된 상기 신경망의 제1 모델에 각 대상 위치 데이터를 입력시켜, 복수의 대상 영상의 추정 위치에 대한 추정 위치 데이터의 추출하고, 추출된 상기 추정 위치 데이터를 이용하여 각 대상 영상의 위치 보정을 제어하며, 보정된 각 대상 영상을 이용하여 깊이 추정 또는 3차원 형상 복원의 수행을 제어할 수 있다.The control unit controls n (n is a natural number greater than or equal to 2) to which noise affecting the position of each first training image is added to a plurality of first training images having positions of different frames in the image sequence. For a deep learning neural network having an input layer that controls the generation of location data and inputs each of the location data for learning, and an output layer that outputs the original location data for the original location of the corresponding first learning image, according to machine learning techniques Controls learning, controls generation of n object position data to which noise affecting the position of each target image is added for a plurality of target images having positions of different frames in an image sequence, and controls the learning of the neural network Each object location data is input to the first model of , the estimated location data for the estimated locations of a plurality of target images is extracted, and position correction of each target image is controlled using the extracted estimated location data, and the corrected Depth estimation or 3D shape restoration may be controlled using each target image.

상기와 같이 구성되는 본 발명은 우사 관제 시스템 등에서의 영상 정보 수집 시에 영상에 포함된 지터 노이즈(jitter noise)를 효과적으로 제거하여 최적의 깊이 추정 또는 최적의 3차원 형상 복원에 따른 영상 정보를 수집할 수 있는 이점이 있다. The present invention configured as described above can collect image information according to optimal depth estimation or optimal 3D shape restoration by effectively removing jitter noise included in images when image information is collected in a rain control system or the like. There are advantages that can be

또한, 본 발명은 딥러닝 기술 중의 하나인 장단기 메모리 기술을 적용하여 지터 노이즈(jitter noise)에 따라 변경된 대상 영상의 프레임 위치를 보다 정확하게 보정함으로써 깊이 추정 또는 3차원 형상 복원의 정확도를 개선할 수 있는 이점이 있다.In addition, the present invention can improve the accuracy of depth estimation or 3D shape restoration by more accurately correcting the frame position of the target image changed according to jitter noise by applying long and short term memory technology, which is one of deep learning technologies. There is an advantage.

또한, 본 발명은 깊이 추정 또는 3차원 형상 복원 시에 가우시안 프로세스 회귀 기술을 이용함으로써 적은 양의 데이터만으로도 최적 깊이 추정 또는 최적 3차원 형상 복원에 따른 영상 정보를 수집할 수 있는 이점이 있다.In addition, the present invention has an advantage in that image information according to optimal depth estimation or optimal 3D shape restoration can be collected with only a small amount of data by using a Gaussian process regression technique for depth estimation or 3D shape restoration.

본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects obtainable in the present invention are not limited to the effects mentioned above, and other effects not mentioned can be clearly understood by those skilled in the art from the description below. will be.

도 1은 본 발명의 일 실시예에 따른 영상 정보 수집 장치(100)의 블록 구성도를 나타낸다.
도 2는 본 발명의 일 실시예에 따른 영상 정보 수집 장치(100)에서 영상의 프레임 위치 보정에 대한 제어를 위한 제어부(150)의 블록 구성도를 나타낸다.
도 3은 본 발명의 일 실시예에 따른 영상 정보 수집 장치(100)에서 DFF 기법에 따른 깊이 추정에 대한 제어를 위한 제어부(150)의 블록 구성도를 나타낸다.
도 4는 DFF(depth from focus) 기법에 따른 깊이 추정에 대한 개념도를 나타낸다.
도 5는 본 발명의 일 실시예에 따른 영상 정보 수집 방법의 순서도를 나타낸다.
도 6은 S210의 상세한 순서도를 나타낸다.
도 7은 S220의 상세한 순서도를 나타낸다.
도 8은 본 발명에 따라 제1 모델을 이용하여 영상 프레임의 위치에 대한 지터 노이즈를 최소화하는 기술에 대한 개념도를 나타낸다.
도 9는 장단기 메모리 신경망의 구조에 대한 일 예를 나타낸다.
도 10은 도 9에서 장단기 메모리 신경망에 포함된 t시점의 셀의 구조에 대한 일 예를 나타낸다.
도 11은 S230의 상세한 순서도를 나타낸다.
도 12는 S240의 상세한 순서도를 나타낸다.
도 13은 DFF(depth from focus) 또는 SFF(shape from focus)에 의하여 깊이 추정 또는 3차원 복원 영상이 형성되는 원리를 나타내는 일 예를 나타낸다.
도 14는 종래 기술과 본 발명에 따른 비교 결과 그래프를 나타낸다.1 shows a block configuration diagram of an image information collection device 100 according to an embodiment of the present invention.
FIG. 2 shows a block diagram of a controller 150 for controlling frame position correction of an image in the image information collection device 100 according to an embodiment of the present invention.
3 is a block diagram of a controller 150 for controlling depth estimation according to a DFF technique in the image information collecting apparatus 100 according to an embodiment of the present invention.
4 shows a conceptual diagram for depth estimation according to a depth from focus (DFF) technique.
5 is a flowchart of a method for collecting image information according to an embodiment of the present invention.
6 shows a detailed flowchart of S210.
7 shows a detailed flowchart of S220.
8 is a conceptual diagram of a technique for minimizing jitter noise with respect to a position of an image frame using a first model according to the present invention.
9 shows an example of the structure of a long-short-term memory neural network.
FIG. 10 shows an example of the structure of a cell at time t included in the long and short term memory neural network of FIG. 9 .
11 shows a detailed flowchart of S230.
12 shows a detailed flowchart of S240.
13 illustrates an example of a principle of forming a depth estimation or a 3D reconstructed image by depth from focus (DFF) or shape from focus (SFF).
14 shows a comparison result graph according to the prior art and the present invention.

본 발명의 상기 목적과 수단 및 그에 따른 효과는 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다.The above objects and means of the present invention and the effects thereof will become clearer through the following detailed description in conjunction with the accompanying drawings, and accordingly, those skilled in the art to which the present invention belongs can easily understand the technical idea of the present invention. will be able to carry out. In addition, in describing the present invention, if it is determined that a detailed description of a known technology related to the present invention may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며, 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 경우에 따라 복수형도 포함한다. 본 명세서에서, "포함하다", “구비하다”, “마련하다” 또는 “가지다” 등의 용어는 언급된 구성요소 외의 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다.Terms used in this specification are for describing the embodiments and are not intended to limit the present invention. In this specification, singular forms also include plural forms in some cases unless otherwise specified in the text. In this specification, terms such as "comprise", "have", "provide" or "have" do not exclude the presence or addition of one or more other elements other than the mentioned elements.

본 명세서에서, “또는”, “적어도 하나” 등의 용어는 함께 나열된 단어들 중 하나를 나타내거나, 또는 둘 이상의 조합을 나타낼 수 있다. 예를 들어, “또는 B”“및 B 중 적어도 하나”는 A 또는 B 중 하나만을 포함할 수 있고, A와 B를 모두 포함할 수도 있다.In this specification, terms such as “or” and “at least one” may represent one of the words listed together, or a combination of two or more. For example, "or B" and "at least one of B" may include only one of A or B, or may include both A and B.

본 명세서에서, “예를 들어” 등에 따르는 설명은 인용된 특성, 변수, 또는 값과 같이 제시한 정보들이 정확하게 일치하지 않을 수 있고, 허용 오차, 측정 오차, 측정 정확도의 한계와 통상적으로 알려진 기타 요인을 비롯한 변형과 같은 효과로 본 발명의 다양한 실시 예에 따른 발명의 실시 형태를 한정하지 않아야 할 것이다.In this specification, descriptions following "for example" may not exactly match the information presented, such as cited characteristics, variables, or values, and tolerances, measurement errors, limits of measurement accuracy and other commonly known factors It should not be limited to the embodiments of the present invention according to various embodiments of the present invention with effects such as modifications including.

본 명세서에서, 어떤 구성요소가 다른 구성요소에 '연결되어’ 있다거나 '접속되어' 있다고 기재된 경우, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성 요소에 '직접 연결되어' 있다거나 '직접 접속되어' 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해될 수 있어야 할 것이다.In this specification, when a component is described as being 'connected' or 'connected' to another component, it may be directly connected or connected to the other component, but there may be other components in the middle. It should be understood that it may be On the other hand, when a component is referred to as 'directly connected' or 'directly connected' to another component, it should be understood that no other component exists in the middle.

본 명세서에서, 어떤 구성요소가 다른 구성요소의 '상에' 있다거나 '접하여' 있다고 기재된 경우, 다른 구성요소에 상에 직접 맞닿아 있거나 또는 연결되어 있을 수 있지만, 중간에 또 다른 구성요소가 존재할 수 있다고 이해되어야 할 것이다. 반면, 어떤 구성요소가 다른 구성요소의 '바로 위에' 있다거나 '직접 접하여' 있다고 기재된 경우에는, 중간에 또 다른 구성요소가 존재하지 않은 것으로 이해될 수 있다. 구성요소 간의 관계를 설명하는 다른 표현들, 예를 들면, '～사이에'와 '직접 ～사이에' 등도 마찬가지로 해석될 수 있다.In the present specification, when an element is described as being 'on' or 'in contact with' another element, it may be in direct contact with or connected to the other element, but another element may be present in the middle. It should be understood that On the other hand, if an element is described as being 'directly on' or 'directly in contact with' another element, it may be understood that another element in the middle does not exist. Other expressions describing the relationship between components, such as 'between' and 'directly between', can be interpreted similarly.

본 명세서에서, '제1', '제2' 등의 용어는 다양한 구성요소를 설명하는데 사용될 수 있지만, 해당 구성요소는 위 용어에 의해 한정되어서는 안 된다. 또한, 위 용어는 각 구성요소의 순서를 한정하기 위한 것으로 해석되어서는 안 되며, 하나의 구성요소와 다른 구성요소를 구별하는 목적으로 사용될 수 있다. 예를 들어, '제1구성요소'는 '제2구성요소'로 명명될 수 있고, 유사하게 '제2구성요소'도 '제1구성요소'로 명명될 수 있다.In this specification, terms such as 'first' and 'second' may be used to describe various elements, but the elements should not be limited by the above terms. In addition, the above terms should not be interpreted as limiting the order of each component, and may be used for the purpose of distinguishing one component from another. For example, a 'first element' may be named a 'second element', and similarly, a 'second element' may also be named a 'first element'.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. Unless otherwise defined, all terms used in this specification may be used in a meaning commonly understood by those of ordinary skill in the art to which the present invention belongs. In addition, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless explicitly specifically defined.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일 실시예를 상세히 설명하도록 한다.Hereinafter, a preferred embodiment according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 영상 정보 수집 장치(100)의 블록 구성도를 나타낸다.1 shows a block configuration diagram of an image information collection device 100 according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 영상 정보 수집 장치(100)는 영상 시퀀스에서 서로 다른 프레임의 위치를 가지는 복수의 대상 영상에 대해 깊이 추정을 수행하여 해당 깊이 추정(또는 3차원 형상 복원)에 따른 영상 정보를 수집하는 장치로서, 컴퓨팅(computing)이 가능한 전자 장치 또는 컴퓨팅 네트워크일 수 있다. The image information collection apparatus 100 according to an embodiment of the present invention performs depth estimation on a plurality of target images having different frame positions in an image sequence, and the image according to the depth estimation (or 3D shape restoration). As a device for collecting information, it may be an electronic device capable of computing or a computing network.

이때, 깊이 추정 또는 3차원 형상 복원은 복수의 대상 영상에 대해 DFF(depth from focus) 기법 또는 SFF(shape from focus) 기법을 기반으로 대상 영역에 대한 깊이 정보(즉, 깊이 영상 등) 또는 3차원 복원 형상을 도출하는 영상 처리 기술에 해당한다. 또한, 이러한 DFF 또는 SFF 기법에 따른 깊이 추정 또는 3차원 형상 복원은 가우시안 프로세스 회귀(Gaussian Process Regression)를 기반으로 수행될 수 있다.In this case, depth estimation or 3D shape restoration is based on depth from focus (DFF) technique or shape from focus (SFF) technique for a plurality of target images, and depth information (ie, depth image, etc.) or 3D It corresponds to an image processing technology that derives a restored shape. Also, depth estimation or 3D shape reconstruction according to the DFF or SFF technique may be performed based on Gaussian Process Regression.

특히, 영상 정보 수집 장치(100)는 우사 관제 시스템 등과 같은 다양한 관제 시스템에 적용되는 장치일 수 있다. 즉, 영상 정보 수집 장치(100)는 관제 시스템의 대상 영역을 관리하기 위해 영상 촬영 장치(가령, 카메라 등)에서 획득되는 대상 영역의 영상(즉, 대상 영상)에 대해 DFF(depth from focus) 기법 또는 SFF 기법을 적용하여 깊이 추정 또는 3차원 형상 복원을 수행할 수 있다.In particular, the image information collection device 100 may be a device applied to various control systems such as a rain control system. That is, the image information collecting device 100 uses a depth from focus (DFF) technique for an image of a target area (ie, a target image) obtained from an image capture device (eg, a camera) to manage the target area of the control system. Alternatively, depth estimation or 3D shape restoration may be performed by applying the SFF technique.

예를 들어, 전자 장치는 데스크탑 PC(desktop personal computer), 랩탑 PC(laptop personal computer), 태블릿 PC(tablet personal computer), 넷북 컴퓨터(netbook computer), 워크스테이션(workstation), PDA(personal digital assistant), 스마트폰(smartphone), 스마트패드(smartpad), 또는 휴대폰(mobile phone), 등이거나, 영상 정보 수집을 위해 별도로 구현된 장치일 수 있으나, 이에 한정되는 것은 아니다.For example, the electronic device includes a desktop personal computer, a laptop personal computer, a tablet personal computer, a netbook computer, a workstation, and a personal digital assistant (PDA). , a smartphone, a smart pad, or a mobile phone, or may be a separately implemented device for collecting image information, but is not limited thereto.

이러한 영상 정보 수집 장치(100)는, 도 1에 도시된 바와 같이, 입력부(110), 통신부(120), 디스플레이(130), 메모리(140) 및 제어부(150)를 포함할 수 있다.As shown in FIG. 1 , the image information collecting device 100 may include an input unit 110, a communication unit 120, a display 130, a memory 140, and a control unit 150.

입력부(110)는 사용자의 입력에 대응하여, 입력데이터를 발생시키며, 다양한 입력수단을 포함할 수 있다. 예를 들어, 입력부(110)는 키보드(key board), 키패드(key pad), 돔 스위치(dome switch), 터치 패널(touch panel), 터치 키(touch key), 터치 패드(touch pad), 마우스(mouse), 메뉴 버튼(menu button) 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.The input unit 110 generates input data in response to a user's input, and may include various input means. For example, the input unit 110 includes a keyboard, a key pad, a dome switch, a touch panel, a touch key, a touch pad, and a mouse. (mouse), menu button (menu button), etc. may be included, but is not limited thereto.

통신부(120)는 다른 장치와의 통신을 수행하는 구성이다. 예를 들어, 통신부(120)는 5G(5th generation communication), LTE-A(long term evolution-advanced), LTE(long term evolution), 블루투스, BLE(bluetooth low energy), NFC(near field communication), 와이파이(WiFi) 통신 등의 무선 통신을 수행하거나, 케이블 통신 등의 유선 통신을 수행할 수 있으나, 이에 한정되는 것은 아니다. The communication unit 120 is a component that communicates with other devices. For example, the communication unit 120 may perform 5th generation communication (5G), long term evolution-advanced (LTE-A), long term evolution (LTE), Bluetooth, bluetooth low energy (BLE), near field communication (NFC), Wireless communication such as WiFi communication or wired communication such as cable communication may be performed, but is not limited thereto.

가령, 통신부(120)는 영상 정보, 영상 시퀀스에서 각 영상의 프레임의 위치를 보정하기 위한 딥러닝 기반의 모델(이하, “제1 모델”이라 지칭함), 영상에 대한 각 픽셀의 초점 값(focus value)에 대한 초점 커브를 피팅(fitting)하기 위한 가우시안 프로세스 회귀 모델(이하, “제2 모델”이라 지칭함) 등을 타 장치로부터 수신할 수 있으며, 깊이 추정 결과 등을 타 장치로 송신할 수 있다. 이때, 제1 모델 또는 제2 모델을 타 장치로부터 수신하는 경우는 후술하는 영상 정보 수집 방법에서 제1 모델 또는 제2 모델에 대한 학습을 타 장치에서 수행하는 경우일 수 있다.For example, the communication unit 120 provides image information, a deep learning-based model for correcting the position of each image frame in an image sequence (hereinafter, referred to as a "first model"), and a focus value (focus value) of each pixel of the image. A Gaussian process regression model (hereinafter referred to as “second model”) for fitting a focus curve for value) may be received from another device, and depth estimation results may be transmitted to other devices. . In this case, when the first model or the second model is received from another device, it may be a case in which another device performs learning on the first model or the second model in an image information collection method described later.

디스플레이(130)는 다양한 영상 데이터를 화면으로 표시하는 것으로서, 비발광형 패널이나 발광형 패널로 구성될 수 있다. 예를 들어, 디스플레이(130)는 액정 디스플레이(LCD; liquid crystal display), 발광 다이오드(LED; light emitting diode) 디스플레이, 유기 발광 다이오드(OLED; organic LED) 디스플레이, 마이크로 전자기계 시스템(MEMS; micro electro mechanical systems) 디스플레이, 또는 전자 종이(electronic paper) 디스플레이 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 또한, 디스플레이(130)는 입력부(110)와 결합되어 터치 스크린(touch screen) 등으로 구현될 수 있다.The display 130 displays various image data on a screen, and may be composed of a non-emissive panel or a light-emitting panel. For example, the display 130 may include a liquid crystal display (LCD), a light emitting diode (LED) display, an organic LED (OLED) display, and a micro electromechanical system (MEMS). mechanical systems) display, or electronic paper (electronic paper) display, etc. may be included, but is not limited thereto. Also, the display 130 may be combined with the input unit 110 and implemented as a touch screen or the like.

메모리(140)는 영상 정보 수집 장치(100)의 동작에 필요한 각종 정보를 저장한다. 저장 정보로는 영상 정보, 제1 모델, 제2 모델, 후술할 영상 정보 수집 방법에 관련된 프로그램 정보 등이 포함될 수 있으나, 이에 한정되는 것은 아니다. 예를 들어, 메모리(140)는 그 유형에 따라 하드디스크 타입(hard disk type), 마그네틱 매체 타입(Magnetic media type), CD-ROM(compact disc read only memory), 광기록 매체 타입(Optical Media type), 자기-광 매체 타입(Magneto-optical media type), 멀티미디어 카드 마이크로 타입(Multimedia card micro type), 플래시 저장부 타입(flash memory type), 롬 타입(read only memory type), 또는 램 타입(random access memory type) 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 또한, 메모리(140)는 그 용도/위치에 따라 캐시(cache), 버퍼, 주기억장치, 또는 보조기억장치이거나 별도로 마련된 저장 시스템일 수 있으나, 이에 한정되는 것은 아니다.The memory 140 stores various types of information necessary for the operation of the image information collection device 100 . The stored information may include image information, a first model, a second model, and program information related to a method for collecting image information to be described later, but is not limited thereto. For example, the memory 140 may be a hard disk type, a magnetic media type, a compact disc read only memory (CD-ROM), and an optical media type according to its type. ), Magneto-optical media type, Multimedia card micro type, flash memory type, read only memory type, or RAM type (random access memory type), etc., but is not limited thereto. In addition, the memory 140 may be a cache, a buffer, a main memory, an auxiliary memory, or a separately provided storage system depending on its purpose/location, but is not limited thereto.

제어부(150)는 영상 정보 수집 장치(100)의 다양한 제어 동작을 수행할 수 있다. 즉, 제어부(150)는 후술할 영상 정보 수집 방법의 수행을 제어할 수 있으며, 영상 정보 수집 장치(100)의 나머지 구성, 즉 입력부(110), 통신부(120), 디스플레이(130), 메모리(140) 등의 동작을 제어할 수 있다. 예를 들어, 제어부(150)는 하드웨어인 프로세서(processor) 또는 해당 프로세서에서 수행되는 소프트웨어인 프로세스(process) 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.The controller 150 may perform various control operations of the image information collection device 100 . That is, the control unit 150 may control the execution of an image information collection method to be described later, and the remaining components of the image information collection device 100, that is, the input unit 110, the communication unit 120, the display 130, the memory ( 140) can be controlled. For example, the control unit 150 may include, but is not limited to, a processor that is hardware or a process that is software that is executed in the corresponding processor.

도 2는 본 발명의 일 실시예에 따른 영상 정보 수집 장치(100)에서 영상의 프레임 위치 보정에 대한 제어를 위한 제어부(150)의 블록 구성도를 나타낸다. 또한, 도 3은 본 발명의 일 실시예에 따른 영상 정보 수집 장치(100)에서 DFF 기법에 따른 깊이 추정에 대한 제어를 위한 제어부(150)의 블록 구성도를 나타낸다.FIG. 2 shows a block diagram of a controller 150 for controlling frame position correction of an image in the image information collection device 100 according to an embodiment of the present invention. 3 shows a block diagram of the controller 150 for controlling depth estimation according to the DFF technique in the image information collecting device 100 according to an embodiment of the present invention.

제어부(150)는 본 발명의 일 실시예에 따른 영상 정보 수집 방법의 수행을 제어하며, 도 2 및 도 3에 도시된 바와 같이, 데이터 생성부(151), 제1 학습부(152), 제1 적용부(153), 초점 측정 연산부(154), 제2 학습부(155), 제2 적용부(156), 추출부(157) 및 수집부(158)를 포함할 수 있다. 예를 들어, 데이터 생성부(151), 제1 학습부(152), 제1 적용부(153), 초점 측정 연산부(154), 제2 학습부(155), 제2 적용부(156), 추출부(157) 및 수집부(158)는 제어부(150)의 하드웨어 구성이거나, 제어부(150)에서 수행되는 소프트웨어인 프로세스일 수 있으나, 이에 한정되는 것은 아니다.The control unit 150 controls the execution of the image information collection method according to an embodiment of the present invention, and as shown in FIGS. 2 and 3, the data generator 151, the first learning unit 152, the first A first application unit 153, a focus measurement operation unit 154, a second learning unit 155, a second application unit 156, an extraction unit 157, and a collection unit 158 may be included. For example, the data generating unit 151, the first learning unit 152, the first application unit 153, the focus measurement operation unit 154, the second learning unit 155, the second application unit 156, The extraction unit 157 and the collection unit 158 may be hardware components of the controller 150 or software processes executed by the controller 150, but are not limited thereto.

한편, 제1 및 제2 모델은 머신 러닝 기법(machine learning)에 따라 학습된 모델일 수 있다. 특히, 제1 모델은 제1 학습 데이터(제1 훈련 데이터)를 통해 딥러닝 신경망인 장단기 메모리(Long Short-Term Memory) 신경망을 이용하여 학습된 모델일 수 있으며, 제2 모델은 제2 학습 데이터를 통해 가우시안 프로세스 회귀의 머신 러닝 기법 따라 학습된 모델일 수 있다.Meanwhile, the first and second models may be models learned according to machine learning. In particular, the first model may be a model learned using a long short-term memory neural network, which is a deep learning neural network, through the first training data (first training data), and the second model is the second training data. It may be a model learned according to the machine learning technique of Gaussian process regression through

가령, 제1 및 제2 학습 데이터는 입력 데이터 및 출력 데이터 쌍(데이터 셋)을 포함할 수 있다. 이때, 제1 및 제2 모델은 다수의 레이어(layer)를 포함하여, 입력층의 입력 데이터와 출력층의 출력 데이터의 관계에 대한 함수를 은닉층에 포함한다. 제1 및 제2 모델의 입력층에 입력 데이터가 입력되는 경우, 해당 함수에 따른 출력 데이터가 출력층에 출력될 수 있다.For example, the first and second training data may include input data and output data pairs (data sets). In this case, the first and second models include a plurality of layers, and a function for a relationship between input data of an input layer and output data of an output layer is included in a hidden layer. When input data is input to the input layer of the first and second models, output data according to the corresponding function may be output to the output layer.

즉, 제1 및 제2 모델은 입력 데이터와 출력 데이터 간의 관계를 다수의 층(즉, 레이어)으로 표현하며, 이러한 다수의 표현층을 “인공신경망(neural network)”라 지칭하기도 한다. 인공신경망 내의 각 레이어는 적어도 하나 이상의 필터로 이루어지며, 각 필터는 가중치(weight)의 매트릭스(matrix)를 가진다. 즉, 해당 필터의 매트릭스에서 각 원소는 가중치의 값에 해당할 수 있다. 제1 및 제2 모델의 학습 및 이용 등에 대한 상세한 설명은 후술하도록 한다.That is, the first and second models express the relationship between input data and output data in multiple layers (ie, layers), and these multiple expression layers are also referred to as “neural networks”. Each layer in the artificial neural network consists of at least one filter, and each filter has a matrix of weights. That is, each element in the matrix of the corresponding filter may correspond to a weight value. A detailed description of the learning and use of the first and second models will be described later.

도 4는 DFF(depth from focus) 기법에 따른 깊이 추정에 대한 개념도를 나타낸다.4 shows a conceptual diagram for depth estimation according to a depth from focus (DFF) technique.

도 4를 참조하면, DFF(depth from focus) 기법에 따른 깊이 추정은 대상 영역에 대해 초점 정도가 다른 다수의 2D 영상들을 이용하여 대상 영역에 대한 깊이 정보를 추정하는 기술이다. 이때, 각 2D 영상은 대상 영역을 촬영하는 영상 촬영 장치로부터 영상 시퀀스에 따라 획득될 수 있고, 그 영상 시퀀스에 따라 획득된 각 2D 영상의 프레임 위치가 결정되며, 결정된 프레임 위치에 따라 획득된 각 2D 영상에 대한 초점 값이 도출될 수 있으며, 도출된 초점 값을 기반으로 DFF 기법에 따라 대상 영역에 대한 깊이 정보가 추정될 수 있다.Referring to FIG. 4 , depth estimation according to a depth from focus (DFF) technique is a technique of estimating depth information of a target region using a plurality of 2D images having different focal degrees of the target region. In this case, each 2D image may be obtained according to an image sequence from an image capture device that captures the target region, and a frame position of each 2D image obtained according to the image sequence is determined, and each 2D image obtained according to the determined frame position A focus value of the image may be derived, and depth information of the target region may be estimated based on the derived focus value according to a DFF technique.

다만, 각 2D 영상을 획득할 때, 광축 방향으로 영상 촬영 장치의 기계 진동에 따른 지터 노이즈(jitter noise)이 발생할 수 있다. 이러한 지터 노이즈가 발생하게 되면, 2D 영상들은 영상 시퀀스의 순서에 대응하여 결정되는 그 프레임 위치가 변하게 되며, 이에 따라 이들 2D 영상을 이용하여 추정된 깊이 정보는 정확도가 떨어질 수밖에 없다. 이러한 문제를 해결하기 위해, 본 발명에서는 제1 모델을 이용하여 지터 노이즈를 제거함으로써, 2D 영상들에 대한 프레임 위치를 보정할 수 있다. 그 결과, 보정된 2D 영상을 이용하여 추정된 깊이 정보는 그 정확성이 보다 향상될 수 있다.However, when acquiring each 2D image, jitter noise may occur due to mechanical vibration of the image capture device in the direction of the optical axis. When such jitter noise occurs, the frame positions of 2D images determined in accordance with the order of image sequences change, and accordingly, depth information estimated using these 2D images inevitably deteriorates in accuracy. In order to solve this problem, in the present invention, frame positions of 2D images may be corrected by removing jitter noise using the first model. As a result, the accuracy of the depth information estimated using the corrected 2D image can be further improved.

이하, 본 발명에 따른 영상 정보 수집 방법에 대해 보다 상세하게 설명하도록 한다.Hereinafter, a method for collecting image information according to the present invention will be described in more detail.

도 5는 본 발명의 일 실시예에 따른 영상 정보 수집 방법의 순서도를 나타낸다. 또한, 도 6은 S210의 상세한 순서도를 나타내며, 도 7은 S220의 상세한 순서도를 나타낸다.5 is a flowchart of a method for collecting image information according to an embodiment of the present invention. 6 shows a detailed flow chart of S210, and FIG. 7 shows a detailed flow chart of S220.

즉, 도 5를 참조하면, 본 발명의 일 실시예에 따른 영상 정보 수집 방법은, 제1 모델을 학습시키는 단계(S210)와, 학습된 제1 모델을 이용하여 대상 영상의 프레임 위치를 보정하는 단계(S220)와, 제2 모델을 학습시키는 단계(S230)와, 학습된 제2 모델을 이용하여 깊이 추정을 수행하는 단계(S240)를 포함한다. 이때, 도 6을 참조하면, S210은 S211 및 S212를 포함할 수 있다. 또한, 도 7을 참조하면, S220은 S221 및 S222를 포함할 수 있다.That is, referring to FIG. 5 , the image information collection method according to an embodiment of the present invention includes learning a first model (S210), and correcting a frame position of a target image using the learned first model. Step S220, step S230 of learning a second model, and step S240 of performing depth estimation using the learned second model. At this time, referring to FIG. 6 , S210 may include S211 and S212. Also, referring to FIG. 7 , S220 may include S221 and S222.

도 8은 본 발명에 따라 제1 모델을 이용하여 영상 프레임의 위치에 대한 지터 노이즈를 최소화하는 기술에 대한 개념도를 나타낸다. 또한, 도 9는 장단기 메모리 신경망의 구조에 대한 일 예를 나타내며, 도 10은 도 9에서 장단기 메모리 신경망에 포함된 t시점의 셀의 구조에 대한 일 예를 나타낸다.8 is a conceptual diagram of a technique for minimizing jitter noise with respect to a position of an image frame using a first model according to the present invention. 9 shows an example of the structure of the long and short-term memory neural network, and FIG. 10 shows an example of the structure of a cell at time t included in the long and short-term memory neural network of FIG. 9 .

먼저, S211에서, 제어부(150)의 위치 데이터 생성부(151)는 복수의 제1 학습용 영상의 프레임 위치에 대해 노이즈를 부가한 정보(이하, “학습용 위치 데이터”라 지칭함)를 생성한다. 이때, 각 제1 학습용 영상은 특정 영역에 대한 영상 시퀀스에서 서로 다른 프레임의 위치를 가지는 영상 프레임일 수 있다. 또한, 노이즈는 지터 노이즈를 시뮬레이션하는 노이즈로서, 비가우시안(non-Gaussian)의 분포를 가질 수 있다. 즉, 이러한 비가우시안 분포를 통해 해당 노이즈에 현실성을 반영할 수 있다.First, in S211, the location data generation unit 151 of the control unit 150 generates noise-added information (hereinafter, referred to as “location data for learning”) about frame positions of a plurality of first training images. In this case, each first learning image may be an image frame having a different frame position in an image sequence for a specific region. Also, the noise is noise simulating jitter noise, and may have a non-Gaussian distribution. That is, reality can be reflected in the corresponding noise through such a non-Gaussian distribution.

위치 데이터 생성부(151)는 서로 다른 프레임 위치를 가지는 각 제1 학습용 영상에 대해 그 프레임 위치에 영향을 주는 비가우시안 분포의 랜덤(random)한 노이즈가 부가된 n개(단, n은 2 이상의 자연수)의 학습용 위치 데이터를 생성한다. The location data generation unit 151 is configured with n (provided that n is 2 or more natural number) to generate location data for learning.

예를 들어, 영상 시퀀스의 첫번째 프레임 위치의 제1 학습용 영상에 대해서 n개의 학습용 위치 데이터가 생성되고, 두번째 프레임 위치의 제1 학습용 영상에 대해서 n의 학습용 위치 데이터가 생성된다. 이와 같은 방식으로, 제1 학습용 영상마다 n개(단, n은 2 이상의 자연수)의 학습용 위치 데이터가 생성될 수 있다.For example, n pieces of location data for learning are generated for the first training image at the location of the first frame of the image sequence, and n pieces of location data for learning are generated for the first training image at the location of the second frame of the image sequence. In this way, n pieces of location data for learning (where n is a natural number greater than or equal to 2) may be generated for each first learning image.

또한, 어느 한 제1 학습용 영상에 대해 생성된 n개의 학습용 위치 데이터는 부가된 노이즈에 따라 그 제1 학습용 영상의 원래 프레임 위치에 대한 정보(이하, “원위치 데이터”라 지칭함)와 동일한 위치 값을 가지거나 다른 위치 값을 가질 수 있다. In addition, n pieces of training location data generated for any one first training image have the same location value as information on the original frame location of the first training image (hereinafter referred to as "in-situ data") according to the added noise. or may have other positional values.

예를 들어, 영상 시퀀스의 m번째(단, m은 1 이상의 자연수) 프레임 위치의 제1 학습용 영상에 대해 생성된 n개의 학습용 위치 데이터는 부가된 노이즈에 따라 그 제1 학습용 영상의 원위치 데이터의 값(즉, 'm번째' 관련 값)과 동일하거나 다를 수 있다. 이때, 생성된 학습용 위치 데이터는 부가된 노이즈의 값이 커질수록 'm번째' 관련 값에서 많이 벗어난 값을 그 학습용 위치 데이터의 값으로 가지게 된다. 물론, 생성된 학습용 위치 데이터는 부가된 노이즈의 값이 '0'이라면 원위치 데이터와 동일한 값을 그 학습용 위치 데이터의 값으로 가지게 된다. For example, n pieces of training location data generated for the first training image at the m-th (where m is a natural number greater than or equal to 1) frame position of the image sequence are values of original location data of the first training image according to the added noise. (that is, the 'mth' related value). At this time, as the value of the added noise increases, the generated location data for learning has a value that deviate more from the 'mth' related value as the value of the location data for learning. Of course, if the value of the added noise is '0', the generated position data for learning has the same value as the original position data as the value of the position data for learning.

즉, m번째 프레임 위치의 제1 학습용 영상에 대해 생성된 n개의 학습용 위치 데이터에 각각 적용되는 n개의 노이즈는 랜덤하게 생성되며, 실제 지터 노이즈를 시뮬레이션하기 위해 비가우시안 분포를 가질 수 있다.That is, the n pieces of noise applied to the n pieces of training location data generated for the first training image at the mth frame location are randomly generated and may have a non-Gaussian distribution to simulate actual jitter noise.

이후, 데이터 생성부(151)는 각 제1 학습용 영상에 대해 생성한 학습용 위치 데이터들과, 해당 제1 학습용 영상의 원위치 데이터를 각각 이용하여, 제1 학습 데이터를 생성할 수 있다. 이때, 제1 학습 데이터는 입력 데이터와 출력 데이터의 쌍을 포함한다. 즉, 제1 학습 데이터의 입력 데이터는 제1 모델의 학습 시 입력층에 입력되는 데이터로서, 노이즈가 부가된 학습용 위치 데이터를 포함한다. 또한, 제1 학습 데이터의 출력 데이터는 제1 모델의 학습 시 출력층에 입력되는 데이터로서, 원위치 데이터를 포함한다.Thereafter, the data generator 151 may generate first training data by using the learning position data generated for each first training image and the original position data of the corresponding first training image. At this time, the first learning data includes a pair of input data and output data. That is, the input data of the first training data is data input to the input layer when learning the first model, and includes position data for learning to which noise is added. In addition, output data of the first training data is data input to the output layer when learning the first model, and includes in-situ data.

예를 들어, m번째 프레임 위치의 제1 학습용 영상에 대한 제1 학습 데이터는 생성된 n개의 학습용 위치 데이터를 입력 데이터로 포함하고, 그 제1 학습용 영상의 원위치 데이터의 값(즉, 'm번째' 관련 값)을 출력 데이터로 포함할 수 있다. 이러한 제1 학습 데이터는 모든 프레임 위치의 제1 학습용 영상에 대해 생성될 수 있다.For example, the first training data for the first training image at the m-th frame location includes n pieces of generated training location data as input data, and the value of the original location data of the first training image (that is, 'm-th ' related values) can be included as output data. Such first training data may be generated for first training images at all frame positions.

다음으로, S212에서, 제1 학습부(152)는 S211에서 생성된 학습용 위치 데이터(특히, 제1 학습 데이터)를 이용하여 머신 러닝 기법에 따라 제1 모델에 대한 학습을 수행한다.Next, in S212, the first learning unit 152 performs learning on the first model according to a machine learning technique using the location data for learning (in particular, the first training data) generated in S211.

이때, 제1 모델은 딥러닝 신경망으로서, 장단기 메모리 신경망일 수 있다. 즉, 제1 모델은 학습용 위치 데이터들이 각각 입력되는 입력층과, 해당 제1 학습용 영상의 원위치 데이터를 출력하는 출력층과, 이들 입력층 및 출력층 사이의 함수를 가지는 은닉층을 포함할 수 있다.In this case, the first model is a deep learning neural network, and may be a short-term memory neural network. That is, the first model may include an input layer inputting position data for learning, an output layer outputting original position data of the corresponding first training image, and a hidden layer having a function between the input layer and the output layer.

도 9 및 도 10을 참조하면, 장단기 메모리 신경망은 은닉층의 장단기 메모리 셀에 입력 게이트, 망각 게이트, 출력 게이트를 추가하여 불필요한 기억을 지우고, 기억해야할 것들을 정하여 동작한다. 즉, 장단기 메모리 신경망에 있는 장단기 메모리 계층은 4개의 구성요소인 입력 게이트(i), 망각 게이트(f), 셀 후보(g) 및 출력 게이트(σ)를 통해 시간 스텝에 대하여 셀 상태인 C_t와, 히든 상태인 출력 값 h_t에 대해 망각, 업데이트 및 출력의 동작을 수행할 수 있다.Referring to FIGS. 9 and 10 , the short and long term memory neural network operates by deleting unnecessary memories by adding an input gate, a forget gate, and an output gate to long and short term memory cells of the hidden layer and determining what to remember. That is, the long-short-term memory layer in the long-short-term memory neural network generates the cell state C _t with respect to the time step through four components: an input gate (i), a forget gate (f), a cell candidate (g), and an output gate (σ). And forgetting, updating, and outputting operations can be performed on the hidden output value h _t .

또한, 장단기 메모리 셀과 관련된 식들은 하기 표 1와 같이 정의될 수 있다.In addition, equations related to the long and short term memory cells may be defined as shown in Table 1 below.

요소Element 식ceremony σ_c σ _c 상태 활성화 함수 (쌍곡 탄젠트 함수)State activation function (hyperbolic tangent function) σ_g σ _g 게이트 활성화 함수 (시그모이드 함수)Gate activation function (sigmoid function) i_t i _t σ_g(W_tx_t + R_ih_t-1 + b_i)σ _g (W _t x _t + R _i h _t-1 + b _i ) f_t f _t σ_g(W_fx_t + R_fh_t-1 + b_f)σ _g (W _f x _t + R _f h _t-1 + b _f ) g_t g _t σ_c(W_gx_t + R_gh_t-1 + b_g)σ _c (W _g x _t + R _g h _t-1 + b _g ) o_t o _t σ_g(W_ox_t + R_oh_t-1 + b_o)σ _g (W _o x _t + R _o h _t-1 + b _o ) c_t c _t f_t × c_t-1 + i_t × g_t f _t × c _t-1 + i _t × g _t h_t h _t o_t × σ_c(c_t)o _t × σ _c (c _t )

표 1에서, 4개의 구성요소의 결합 형태인 입력 가중치(W), 순환 가중치(R) 및 편향(b)는 각 장단기 메모리 셀의 학습 가능한 가중치로서, 하기의 식에 따라 정의될 수 있다.

In Table 1, input weight (W), cyclic weight (R), and bias (b), which are combinations of four components, are learnable weights of each short-term memory cell and may be defined according to the following equation.

이러한 장단기 메모리 신경망은 시점이 충분히 긴 상황에서는 초기 정보에 대한 영향력이 줄어드는 RNN(Recurrent Neural Network)의 장기 의존성 문제(the problem of Long-Term Dependencies)를 해결할 수 있다. 즉, 장단기 메모리 신경망은 RNN과 비교하여 긴 시퀀스의 입력을 처리하는데 탁월한 성능을 보인다. 이에 따라, 장단기 메모리 신경망은 긴 시점에 대한 영향이 필요한 비가우시안 분포의 n개의 학습용 위치 데이터에 대한 학습에 유리할 수 있다.Such a long-term short-term memory neural network can solve the problem of long-term dependencies of a recurrent neural network (RNN), in which influence on initial information is reduced in a situation where the viewpoint is sufficiently long. That is, compared to RNN, long-short-term memory neural networks show excellent performance in processing long sequences of inputs. Accordingly, the long-term and short-term memory neural network may be advantageous for learning about n pieces of training location data of a non-Gaussian distribution, which require effects on long time points.

예를 들어, m번째 프레임 위치의 제1 학습용 영상에 대한 제1 학습 데이터를 이용하여 학습시킬 경우, 제1 학습부(152)는 m번째 프레임 위치의 제1 학습용 영상에 대해 생성된 n개의 학습용 위치 데이터를 입력 데이터로 제1 모델의 입력층에 입력시킨다. 즉, 도 9를 참조하면, 제1 학습부(152)는 장단기 메모리 계층의 셀들(즉, n개의 셀)에 각각 대응하는 입력층의 입력 영역(즉, n개의 입력 영역)에, m번째 프레임 위치의 제1 학습용 영상에 대해 생성된 n개 학습용 위치 데이터를 하나씩 입력시킨다. 또한, 제1 학습부(152)는 m번째 프레임 위치의 제1 학습용 영상의 원위치 데이터의 값(즉, 'm번째' 관련 값)을 출력 데이터로 제1 모델의 출력층에 입력시킨다. 이러한 방식으로 각 프레임 위치에 대한 모든 제1 학습용 영상에 대해 학습이 이루어질 수 있다.For example, when learning is performed using the first training data for the first training image at the m-th frame position, the first learning unit 152 has n training data generated for the first training image at the m-th frame position. Position data is input to the input layer of the first model as input data. That is, referring to FIG. 9 , the first learning unit 152 stores m-th frames in the input regions (ie, n input regions) of the input layer respectively corresponding to the cells (ie, n cells) of the long and short-term memory layer. n pieces of location data for learning generated for the first learning image of the location are input one by one. In addition, the first learning unit 152 inputs the value of the original position data of the first training image at the position of the mth frame (ie, the 'mth' related value) to the output layer of the first model as output data. In this way, learning may be performed on all first training images for each frame position.

다만, 상술한 S210은 타 장치에서 수행될 수 있다. 이 경우, 타 장치에서 수행된 S210에 따른 제1 모델은 통신부(120)를 통해 본 발명의 영상 정보 수집 장치(100)로 수신되어, 후술할 S220에서 활용될 수 있다. 또한, 본 발명의 영상 정보 수집 장치(100)에서 상술한 S210을 수행한 결과인 제1 모델을 통신부(120)를 통해 타 장치로 전송할 수도 있다. 이 경우, 본 발명의 영상 정보 수집 장치(100)는 장단기 메모리 신경망에 대한 학습을 수행하여 그 결과인 제1 모델을 전송하는 서버로 동작할 수 있다.However, the above-described S210 may be performed in another device. In this case, the first model according to S210 performed by the other device may be received by the image information collection device 100 of the present invention through the communication unit 120 and used in S220 to be described later. In addition, the first model, which is the result of performing the above-described S210 in the image information collection device 100 of the present invention, may be transmitted to another device through the communication unit 120 . In this case, the image information collecting device 100 of the present invention may operate as a server that performs learning on the long-term and short-term memory neural network and transmits the resultant first model.

다음으로, S221에서, 제어부(150)의 데이터 생성부(151)는 복수의 대상 영상의 프레임 위치에 대해 노이즈를 부가한 정보(이하, “대상 위치 데이터”라 지칭함)를 생성한다. 이때, 각 대상 영상은 대상 영역의 영상 시퀀스에서 서로 다른 프레임의 위치를 가지는 영상 프레임일 수 있다. 또한, 노이즈는 S211에서 사용된 동일 종류의 노이즈일 수 있다. 즉, 노이즈는 지터 노이즈를 시뮬레이션하는 노이즈로서, 비가우시안(non-Gaussian)의 분포를 가질 수 있다. 물론, S221은 S212 이전에 수행될 수도 있다.Next, in S221, the data generation unit 151 of the control unit 150 generates noise-added information (hereinafter, referred to as “target location data”) about frame positions of a plurality of target images. In this case, each target image may be an image frame having a different frame position in the image sequence of the target region. Also, the noise may be the same type of noise used in S211. That is, the noise is noise simulating jitter noise and may have a non-Gaussian distribution. Of course, S221 may be performed before S212.

위치 데이터 생성부(151)는 서로 다른 프레임 위치를 가지는 각 대상 영상에 대해 그 프레임 위치에 영향을 주는 비가우시안 분포의 랜덤(random)한 노이즈가 부가된 n개(단, n은 2 이상의 자연수)의 대상 위치 데이터를 생성한다.For each target image having a different frame position, the position data generation unit 151 adds n random noises of a non-Gaussian distribution that affect the frame position (where n is a natural number greater than or equal to 2). Create target location data of

예를 들어, 첫번째 프레임 위치의 대상 영상에 대해서 n개의 대상 위치 데이터가 생성되고, 두번째 프레임 위치의 대상 영상에 대해서 n의 대상 위치 데이터가 생성된다. 이와 같은 방식으로, 대상 영상마다 n개(단, n은 2 이상의 자연수)의 대상 위치 데이터가 생성될 수 있다.For example, n object location data are generated for the target image at the first frame location, and n object location data are generated for the target image at the second frame location. In this way, n object location data (where n is a natural number of 2 or more) can be generated for each target image.

또한, 어느 한 대상 영상에 대해 생성된 n개의 위치 데이터는 부가된 노이즈에 따라 그 대상 영상의 원래 프레임 위치에 대한 정보(즉, 원위치 데이터)와 동일한 위치 값을 가지거나 다른 위치 값을 가질 수 있다. In addition, n pieces of location data generated for any one target image may have the same location value as information on the original frame location (ie, original location data) of the target image or may have a different location value depending on the added noise. .

예를 들어, m번째 프레임 위치의 대상 영상에 대해 생성된 n개의 대상 위치 데이터는 부가된 노이즈에 따라 그 대상 영상의 원위치 데이터의 값(즉, 'm번째' 관련 값)과 동일하거나 다를 수 있다. 이때, 생성된 대상 위치 데이터는 부가된 노이즈의 값이 커질수록 'm번째' 관련 값에서 많이 벗어난 값을 그 대상 위치 데이터의 값으로 가지게 된다. 물론, 생성된 대상 위치 데이터는 부가된 노이즈의 값이 '0'이라면 원위치 데이터와 동일한 값을 그 대상 위치 데이터의 값으로 가지게 된다. For example, n pieces of object location data generated for the target image at the m-th frame location may be the same as or different from values of the original location data (ie, 'm-th' related value) of the target image according to added noise. . At this time, as the value of the added noise increases, the generated target location data has a value that deviates from the 'mth' related value as a value of the target location data. Of course, if the value of the added noise is '0', the generated target location data has the same value as the original location data as the value of the target location data.

즉, m번째 프레임 위치의 대상 영상에 대해 생성된 n개의 대상 위치 데이터에 각각 적용되는 n개의 노이즈는 랜덤하게 생성되며, 실제 지터 노이즈를 시뮬레이션하기 위해 비가우시안 분포를 가질 수 있다.That is, the n pieces of noise applied to the n pieces of object location data generated for the object image at the mth frame location are randomly generated and may have a non-Gaussian distribution to simulate actual jitter noise.

다음으로, S222에서, 제어부(150)의 제1 적용부(153)는 S212에서 학습된 제1 모델에 각 대상 위치 데이터를 적용한다. 즉, 제1 적용부(153)는 각 대상 위치 데이터를 제1 모델의 입력층에 입력시켜, 제1 모델의 출력층에서 출력되는 복수의 대상 영상의 추정 위치에 대한 데이터(이하, “추정 위치 데이터”라 지칭함)를 추출한다. 이후, 제1 적용부(153)는 추출된 추정 위치 데이터를 이용하여 각 대상 영상의 위치를 보정한다.Next, in S222, the first application unit 153 of the control unit 150 applies each target location data to the first model learned in S212. That is, the first application unit 153 inputs each target location data to the input layer of the first model, and data on the estimated locations of a plurality of target images output from the output layer of the first model (hereinafter referred to as “estimated location data”). ”) is extracted. Then, the first application unit 153 corrects the position of each target image using the extracted estimated position data.

예를 들어, m번째 프레임 위치의 대상 영상에 대한 대상 위치 데이터를 제1 모델에 적용할 경우, 제1 적용부(153)는 m번째 프레임 위치의 대상 영상에 대해 생성된 n개의 대상 위치 데이터를 입력 데이터로 제1 모델의 입력층에 입력시킨다. 즉, 제1 적용부(153)는 장단기 메모리 계층의 셀들(즉, n개의 셀)에 각각 대응하는 입력층의 입력 영역(즉, n개의 입력 영역)에, m번째 프레임 위치의 대상 영상에 대해 생성된 n개 대상 위치 데이터를 하나씩 입력시킨다. 그 결과, m번째 프레임 위치의 대상 영상의 원위치 데이터의 값(즉, 'm번째' 관련 값)에 유사 또는 동일한 값의 추정 위치 데이터가 출력 데이터로 제1 모델의 출력층에서 출력된다. 이후, 제1 적용부(153)는 제1 모델의 출력층에서 출력된 추정 위치 데이터를 m번째 프레임 위치의 대상 영상에 대한 프레임 위치로 보정한다. 제1 적용부(153)는 이와 같은 방식으로 모든 프레임 위치의 대상 영상들에 대해 그 프레임 위치(즉, 원위치 데이터)를 보정할 수 있다.For example, when the object location data for the target image at the m-th frame location is applied to the first model, the first application unit 153 applies n pieces of object location data generated for the target image at the m-th frame location. Input data to the input layer of the first model. That is, the first application unit 153 applies the target image at the m-th frame position to the input regions (ie, n input regions) of the input layer respectively corresponding to the cells (ie, n cells) of the long and short-term memory layer. Input the generated n target location data one by one. As a result, estimated position data similar to or equal to the value of the original position data of the target image at the m-th frame position (ie, 'm-th' related value) is output as output data from the output layer of the first model. Then, the first application unit 153 corrects the estimated position data output from the output layer of the first model to the frame position of the target image at the m-th frame position. In this way, the first application unit 153 may correct frame positions (ie, original position data) of target images of all frame positions.

도 11은 S230의 상세한 순서도를 나타내며, 도 12는 S240의 상세한 순서도를 나타낸다.11 shows a detailed flow chart of S230, and FIG. 12 shows a detailed flow chart of S240.

다음으로, S230 및 S240이 수행된다. 즉, S230에서는 가우시안 프로세스 회귀에 기반하여 제2 모델이 학습되며, S240에서는 S222에서 보정된 각 대상 영상을 이용하여 S230에서 학습된 제2 모델을 기반으로 DFF(depth from focus) 기법 또는 shape from focus)에 따른 깊이 추정 또는 3차원 형상 복원이 수행된다. 이때, 도 11을 참조하면, S230은 S231 및 S232를 포함할 수 있다. 또한, 도 12를 참조하면, S240은 S241 내지 S244를 포함할 수 있다.Next, S230 and S240 are performed. That is, in S230, a second model is learned based on Gaussian process regression, and in S240, a depth from focus (DFF) technique or shape from focus based on the second model learned in S230 using each target image corrected in S222 ), depth estimation or 3D shape restoration is performed. At this time, referring to FIG. 11 , S230 may include S231 and S232. Also, referring to FIG. 12 , S240 may include S241 to S244.

S231에서, 제어부(150)의 초점 측정 연산부(154)는 제2 학습용 영상에서 선택된 다수의 픽셀에 대해 초점 측정 연산(이하, “제1 초점 측정 연산”이라 지칭함)을 수행한다. 이때, 초점 측정 연산은 제2 학습용 영상의 다수의 픽셀에 대해 각각 그 초점 값(focus value)을 계산하는 것을 지칭한다. 이때, 선택되는 다수의 픽셀은 제2 학습용 영상에서 무작위로 샘플링(sampling)된 것일 수 있으며, 서로 이웃하지 않고 최소 하나 이상의 픽셀을 사이에 두어 서로 이격된 위치에 존재하는 픽셀일 수 있다.In S231, the focus measurement operation unit 154 of the control unit 150 performs a focus measurement operation (hereinafter, referred to as “first focus measurement operation”) on a plurality of pixels selected from the second training image. In this case, the focus measurement operation refers to calculating focus values for each of a plurality of pixels of the second training image. In this case, the plurality of pixels selected may be randomly sampled from the second training image, and may be pixels that are not adjacent to each other and are spaced apart from each other with at least one pixel interposed therebetween.

예를 들어, 초점 측정 연산을 위한 기법은 라플라시안 기법, SML(Sum of Modified Laplacian) 기법, Tenengrad 초점 측도법, 또는 Gray Level Variance 기법 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 다만, 라플라시안 기법의 경우, 공간적인 주파수(spatial frequency)를 기반으로 하는 기법으로서, 물체 표면에 많은 텍스쳐가 존재할 경우에만 정상적으로 동작할 수 있다. 이에 따라, 이러한 텍스쳐 문제를 보완하기 위해 작은 윈도우 내에서도 초점 값 측정이 가능한 SML 기법을 이용하는 것이 바람직할 수 있다.For example, a technique for calculating focus may include, but is not limited to, a Laplacian technique, a sum of modified Laplacian (SML) technique, a Tenengrad focus measurement technique, or a gray level variance technique. However, in the case of the Laplacian technique, as a technique based on spatial frequency, it can operate normally only when many textures exist on the surface of an object. Accordingly, it may be desirable to use an SML technique capable of measuring a focus value even within a small window in order to compensate for such a texture problem.

이때, SML 기법은 하기 식을 이용하여 초점 값을 계산할 수 있다. At this time, the SML technique may calculate the focus value using the following equation.

단, SML(x, y)는 영상의 (x, y) 좌표점의 픽셀에 대한 SML 연산의 결과 값(즉, 초점 값)을 나타내며, ML(x, y)는 영상의 (x, y) 좌표점의 픽셀에 대한 ML 연산의 결과 값을 나타낸다. 또한, I(x, y)는 영상의 (x, y) 좌표점의 픽셀에서의 그레이 레벨 밝기를 나타내며, W는 영상 윈도우 크기를 나타낸다.However, SML(x, y) represents the result value of the SML operation (i.e., the focus value) for the pixel at the (x, y) coordinate point of the image, and ML(x, y) represents the (x, y) value of the image Indicates the result value of ML operation on the pixel of the coordinate point. In addition, I(x, y) represents the gray level brightness of a pixel at the (x, y) coordinate point of the image, and W represents the image window size.

다음으로, S232에서, 제어부(150)의 제2 학습부(155)는 S241에서 연산된 각 픽셀의 초점 값을 이용하여 가우시안 프로세스 회귀(Gaussian Process Regression)에 따른 머신 러닝 기법으로 제2 모델에 대한 학습을 수행한다. 즉, 제2 학습부(155)는 제1 초점 측정 연산의 결과로 도출된 각 픽셀의 초점 값과, 해당 초점 값을 기반으로 추정되는 초점 커브의 피팅 함수를 기반으로 가우시안 프로세스 회귀의 머신 러닝을 수행할 수 있다.Next, in S232, the second learning unit 155 of the control unit 150 uses the focus value of each pixel calculated in S241 to calculate the second model using a machine learning technique based on Gaussian Process Regression. carry out learning That is, the second learning unit 155 performs machine learning of Gaussian process regression based on the focus value of each pixel derived as a result of the first focus measurement operation and the fitting function of the focus curve estimated based on the corresponding focus value. can be done

이때, 입력 데이터는 영상의 초점 값(focus value)에 대한 정보를 포함할 수 있다 또한, 출력 데이터는 해당 초점 값에 대응하는 해당 영상의 픽셀 위치에 대한 정보를 포함할 수 있다. 이에 따라, 제2 모델은 어떤 영상의 초점 값을 입력 받는 경우에 이에 대응하는 픽셀 위치를 출력하도록 가우시안 프로세스 회귀의 머신 러닝 기법으로 학습된 모델이다.In this case, the input data may include information about a focus value of the image. In addition, the output data may include information about a pixel position of the corresponding image corresponding to the corresponding focus value. Accordingly, the second model is a model learned by a machine learning technique of Gaussian process regression to output a pixel position corresponding to a focus value of a certain image as an input.

후술할 SFF를 위해, 제2 학습 데이터(제2 훈련 데이터)는 제1 초점 측정 연산을 통해 획득한 초점 값을 입력 데이터에 포함할 수 있고, 회귀 함수는 훈련 데이터를 기반으로 추정되는 초점 커브(x축은 초점 위치, y축은 초점 값을 나타내는 좌표계 상에서의 커브)로 각각 정의될 수 있다. 이에 따라, 제2 모델의 은닉층은 영상에서 초점 값(입력 데이터)과 픽셀 위치(출력 데이터)의 관계에 대한 함수를 가진다. 이때, 해당 함수는 초점 값들에 따라 추정되는 다수 초점 커브의 특성을 반영할 수 있다.For SFF, which will be described later, the second learning data (second training data) may include focus values obtained through the first focus measurement operation as input data, and the regression function may include a focus curve estimated based on the training data ( The x-axis may be defined as a focus position, and the y-axis may be defined as a curve on a coordinate system representing a focus value. Accordingly, the hidden layer of the second model has a function for a relationship between a focus value (input data) and a pixel position (output data) in an image. In this case, the function may reflect characteristics of multiple focus curves estimated according to focus values.

구체적으로, S232는 피팅 함수에 대한 확률 분포의 커널 함수를 제곱 지수 커널인 k(i, i')로 선정하는 단계를 포함할 수 있다.Specifically, S232 may include selecting a kernel function of the probability distribution for the fitting function as k(i, i′), which is a square exponential kernel.

이때, 제곱 지수 커널인 k(i, i')는 다음의 식과 같이 나타낼 수 있다.In this case, k(i, i'), which is a square exponential kernel, can be expressed as the following equation.

단, i와 I'는 제곱 지수 커널 함수의 입력을 나타내며, | | | |는 유클리디안 거리(Euclidean distance)를 나타낸다.However, i and I' represent the inputs of the squared exponential kernel function, and | | | | represents the Euclidean distance.

또한, S232는 피팅 함수에 대한 초기 확률 분포에 대해, 평균을 0으로 설정하고 커널 함수를 k(x₀, x₀')로 설정하는 단계를 포함할 수 있다. 이때, k(x₀, x₀')에서 x₀와 x₀'는 타겟 데이터에서 픽셀 위치에 대한 축(즉, x축)의 값들의 집합을 나타낸다.Further, S232 may include setting the mean to 0 and the kernel function to k(x ₀ , x ₀ ') for the initial probability distribution for the fitting function. In this case, in k(x ₀ , x ₀ '), x ₀ and x ₀ ' represent a set of values of an axis (ie, an x-axis) for a pixel position in the target data.

또한, S232는 피팅 함수에 대한 갱신된 확률 분포에 대해, 평균을 m_g(x₀)로 설정하고 커널 함수를 k_g(x₀, x₀')로 설정하는 단계를 포함할 수 있다. 이때, m_g(x₀)와 k_g(x₀, x₀')는 다음의 식과 같이 나타낼 수 있다.Further, S232 may include setting the average to m _g (x ₀ ) and the kernel function to k _g (x ₀ , x ₀ ') for the updated probability distribution for the fitting function. At this time, m _g (x ₀ ) and k _g (x ₀ , x ₀ ') can be expressed as the following equation.

단, x_t는 제2 훈련 데이터에서 픽셀 위치 값에 대한 축(즉, x축)의 값들의 집합을 나타낸다. 또한, x_t'는 제2 훈련 데이터에서 초점 값에 대한 축(즉, y축)의 값들의 집합을 나타낸다.However, x _t represents a set of values of an axis (ie, an x-axis) of pixel position values in the second training data. In addition, x _t ' represents a set of values of an axis (ie, y-axis) for a focus value in the second training data.

다만, 상술한 S230은 타 장치에서 수행될 수 있다. 이 경우, 타 장치에서 수행된 S230에 따른 제2 모델은 통신부(120)를 통해 본 발명의 영상 정보 수집 장치(100)로 수신되어, 후술할 S240에서 활용될 수 있다. 또한, 본 발명의 영상 정보 수집 장치(100)에서 상술한 S230을 수행한 결과인 제2 모델을 통신부(120)를 통해 타 장치로 전송할 수도 있다. 이 경우, 본 발명의 영상 정보 수집 장치(100)는 가우시안 프로세스 회귀의 학습을 수행하여 그 결과인 제2 모델을 전송하는 서버로 동작할 수 있다.However, the above-described S230 may be performed in another device. In this case, the second model according to S230 performed by the other device is received by the image information collection device 100 of the present invention through the communication unit 120 and can be used in S240 to be described later. In addition, the second model, which is a result of performing S230 described above in the image information collection device 100 of the present invention, may be transmitted to another device through the communication unit 120 . In this case, the image information collection device 100 of the present invention may operate as a server that performs Gaussian process regression learning and transmits a second model that is the result.

다음으로, S241에서, 제어부(150)의 초점 측정 연산부(154)는 보정된 대상 영상에서 선택된 다수의 픽셀에 대해 초점 측정 연산(이하, “제2 초점 측정 연산”이라 지칭함)을 수행한다. 이러한 초점 측정 연산은 S231에서 상술한 바와 같으며, 다만 학습용 영상이 아닌 대상 영상에서 선택된 다수의 픽셀에 대해 수행되는 점만 다를 뿐이다. 이때, 선택되는 다수의 픽셀은 대상 영상에서 무작위로 샘플링(sampling)된 것일 수 있으며, 서로 이웃하지 않고 최소 하나 이상의 픽셀을 사이에 두어 서로 이격된 위치에 존재하는 픽셀일 수 있다.Next, in S241, the focus measurement operation unit 154 of the control unit 150 performs a focus measurement operation (hereinafter referred to as “second focus measurement operation”) on a plurality of pixels selected from the corrected target image. This focus measurement operation is the same as described above in S231, except that it is performed on a plurality of pixels selected from a target image other than a training image. In this case, the plurality of pixels selected may be randomly sampled from the target image, and may be pixels that are not adjacent to each other and are spaced apart from each other with at least one pixel interposed therebetween.

다음으로, S242에서, 제어부(150)의 제2 적용부(156)는 S241에서 연산된 다수 픽셀의 초점 값을 S232에서 학습된 제2 모델에 적용한다. 즉, 제2 적용부(156)는 제2 초점 측정 연산의 결과로 도출된 다수 픽셀의 초점 값을 제2 모델에 입력시켜, 다수의 초점 커브를 피팅(fitting)한다.Next, in S242, the second application unit 156 of the controller 150 applies the focus values of the plurality of pixels calculated in S241 to the second model learned in S232. That is, the second application unit 156 inputs the focus values of the plurality of pixels derived as a result of the second focus measurement operation to the second model and fits a plurality of focus curves.

다음으로, S243에서, 제어부(150)의 추출부(157)는 피팅된 다수의 초점 커브에서 최대의 초점 값을 가지는 보정된 대상 영상에서의 픽셀 위치를 추출한다. 가령, 도 7에서, 피팅된 붉은 그래프의 초점 커브 중에 10번에 해당하는 픽셀 위치가 최대의 초점 값(약 8.6)을 가진다. 이 경우, 추출부(157)는 해당 10번의 픽셀 위치를 최대의 초점 값을 가지는 픽셀로 추출할 수 있다.Next, in S243, the extraction unit 157 of the control unit 150 extracts a pixel position in the corrected target image having the maximum focus value from the plurality of fitted focus curves. For example, in FIG. 7 , a pixel position corresponding to number 10 among the focus curves of the fitted red graph has the maximum focus value (about 8.6). In this case, the extractor 157 may extract the pixel position of the number 10 as a pixel having the maximum focus value.

다음으로, S244에서, 제어부(150)의 수집부(158)는 추출된 픽셀 위치를 기반으로, DFF 기법에 따른 깊이 추정을 수행하거나 SFF에 따른 3차원 형상 복원하여, 해당 영상 정보를 수집한다. 즉, 다수의 보정된 대상 영상들에 대해 S241 내지 S243을 통해 획득된 각 최대의 초점 값을 이용하여, 수집부(158)는 보정된 대상 영상에 대해, DFF 기법에 따른 깊이 추정을 수행하거나 SFF에 따른 3차원 형상 복원을 수행할 수 있다.Next, in S244, the collection unit 158 of the controller 150 collects corresponding image information by performing depth estimation according to the DFF technique or 3D shape restoration according to the SFF technique based on the extracted pixel position. That is, using the maximum focus value obtained through S241 to S243 for the plurality of corrected target images, the collection unit 158 performs depth estimation according to the DFF technique or SFF on the corrected target image. It is possible to perform 3D shape restoration according to.

이때, DFF 및 SFF는 영상의 초점에 맞는 렌즈의 위치를 찾아내어 렌즈공식에 의하여 초점이 맞는 부분의 거리를 구한다. 즉, DFF 및 SFF는 초점이 맞은 정도를 계산하기 위하여 렌즈의 광학 축(Optical axis)에 수직인 단순한 평면에서 초점 값이 최대가 되는 위치를 찾아내어 물체의 깊이(거리) 또는 3차원적 형상을 측정할 수 있다.At this time, DFF and SFF find the position of the lens that is in focus of the image and obtain the distance of the in-focus part according to the lens formula. That is, DFF and SFF find the position where the focus value is maximized on a simple plane perpendicular to the optical axis of the lens in order to calculate the degree of focus, and calculate the depth (distance) or three-dimensional shape of the object. can be measured

도 13은 DFF(depth from focus) 또는 SFF(shape from focus)에 의하여 깊이 추정 또는 3차원 복원 영상이 형성되는 원리를 나타내는 일 예를 나타낸다.13 illustrates an example of a principle of forming a depth estimation or a 3D reconstructed image by depth from focus (DFF) or shape from focus (SFF).

즉, 도 13에서, 렌즈(L)로부터 이격 거리(u)에 위치한 광원(P)의 초점이 맞는 영상(Foucused Image)(P')은 렌즈(L)로부터 v만큼의 위치에서 구해진다. 초점이 맞는 영상의 광도는 물체의 광도에 비례하고 물체의 거리는 초점이 맞는 영상의 위치와 다음의 식에 따른 관계를 갖는다.That is, in FIG. 13 , a focused image (P′) of the light source P located at a distance u from the lens L is obtained at a position v from the lens L. The luminous intensity of the focused image is proportional to the luminous intensity of the object, and the distance of the object has a relationship with the position of the focused image according to the following equation.

단, f는 초점거리, u는 렌즈평면에서 물체까지의 거리, v는 초점이 맞는 영상까지의 거리이다. 즉, 렌즈 공식으로부터 초점이 맞는 영상의 광도와 위치가 주어지면, 물체의 광도와 위치는 결정되고 물체의 깊이 정보 또는 3차원적인 정보가 획득될 수 있다.However, f is the focal length, u is the distance from the lens plane to the object, and v is the distance to the focused image. That is, given the luminous intensity and position of an image in focus from the lens formula, the luminous intensity and position of the object are determined and depth information or 3-dimensional information of the object can be obtained.

정리하면, DFF 또는 SFF를 위해, S241에서 2D 영상들에서 무작위로 샘플링하여 획득된 2D 영상들의 각 픽셀에 초점 측정 연산자 SML을 적용한다. 이후, S242에서 광축을 따라 존재하는 픽셀들의 초점 데이터에 가우시안 프로세스 회귀 기술을 사용하여 피팅한다. 이후, S243에서 피팅된 초점 커브에서 최대 값을 가지는 픽셀 위치를 구하여 이들에 대해 DFF 또는 SFF를 적용함으로써, 영상에서 물체에 대한 깊이 맵을 획득할 수 있다.In summary, for DFF or SFF, a focus measurement operator SML is applied to each pixel of 2D images obtained by randomly sampling 2D images in S241. Then, in S242, focus data of pixels existing along the optical axis are fitted using a Gaussian process regression technique. Thereafter, a depth map of an object in an image may be obtained by obtaining a pixel position having a maximum value from the fitted focus curve in S243 and applying DFF or SFF to the pixel position.

도 14는 종래 기술과 본 발명에 따른 비교 결과 그래프를 나타낸다.14 shows a comparison result graph according to the prior art and the present invention.

한편, 도 14에서, 영상의 모든 픽셀에 대해 SML을 수행하는 종래 기술은 점선 그래프와 같고, 본 발명의 영상 정보 수집 방법에 따라 가우시안 프로세스 회귀 기반으로 영상 중 일부 픽셀만으로 획득된 초점 커브는 붉은 그래프와 같다. Meanwhile, in FIG. 14, the conventional technique of performing SML on all pixels of an image is shown as a dotted line graph, and the focus curve obtained with only some pixels in the image based on Gaussian process regression according to the image information collection method of the present invention is a red graph Same as

도 14를 참조하면, 본 발명은 적은 양의 데이터만으로도 영상에서 최대의 초점 값을 가지는 픽셀 위치를 보다 정확하게 획득하여, 이를 이용하여 최적의 깊이 추정 또는 최적의 3차원 형상 복원에 따른 영상 정보를 수집할 수 있다. Referring to FIG. 14, the present invention more accurately obtains a pixel position having the maximum focus value in an image even with a small amount of data, and collects image information according to optimal depth estimation or optimal 3D shape restoration using this. can do.

즉, 본 발명은 가우시안 프로세스 회귀 기술을 이용함으로써 적은 양의 데이터만으로도 최적의 깊이 추정 또는 최적의 3차원 형상 복원에 따른 영상 정보를 수집할 수 있는 이점이 있다. 또한, 본 발명은 우사 관제 시스템 등에서의 영상 정보 수집 시에 영상에 포함된 지터 노이즈(jitter noise)를 효과적으로 제거하여 최적의 깊이 추정 또는 최적의 3차원 형상 복원에 따른 영상 정보를 수집할 수 있는 이점이 있다. 또한, 본 발명은 딥러닝 기술 중의 하나인 장단기 메모리 기술을 적용하여 지터 노이즈(jitter noise)에 따라 변경된 대상 영상의 프레임 위치를 보다 정확하게 보정함으로써 깊이 추정 또는 3차원 형상 복원의 정확도를 개선할 수 있는 이점이 있다.That is, the present invention has the advantage of being able to collect image information according to optimal depth estimation or optimal 3D shape restoration with only a small amount of data by using the Gaussian process regression technique. In addition, the present invention has the advantage of being able to collect image information according to optimal depth estimation or optimal 3D shape restoration by effectively removing jitter noise included in images when image information is collected in a rain control system or the like. there is In addition, the present invention can improve the accuracy of depth estimation or 3D shape restoration by more accurately correcting the frame position of the target image changed according to jitter noise by applying long and short term memory technology, which is one of deep learning technologies. There is an advantage.

본 발명의 상세한 설명에서는 구체적인 실시 예에 관하여 설명하였으나 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러 가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시 예에 국한되지 않으며, 후술되는 청구범위 및 이 청구범위와 균등한 것들에 의해 정해져야 한다.In the detailed description of the present invention, specific embodiments have been described, but various modifications are possible without departing from the scope of the present invention. Therefore, the scope of the present invention is not limited to the described embodiments, and should be defined by the following claims and equivalents thereof.

100: 영상 획득 장치 110: 입력부
120: 통신부 130: 디스플레이
140: 메모리 150: 제어부
151: 데이터 생성부 152: 제1 학습부
153: 제1 적용부 154: 초점 측정 연산부
155: 제2 학습부 156: 제2 적용부
157: 추출부 158: 수집부100: image acquisition device 110: input unit
120: communication unit 130: display
140: memory 150: control unit
151: data generator 152: first learning unit
153: first application unit 154: focus measurement operation unit
155: second learning unit 156: second application unit
157: extraction unit 158: collection unit

Claims

A method for collecting image information performed by an electronic device,
For a plurality of first training images having different frame positions in the image sequence, n pieces of training position data (where n is a natural number of 2 or more) to which noise affecting the position is added for each first training image are generated. doing;
Performing learning according to a machine learning technique on a deep learning neural network each having an input layer inputting the training location data and an output layer outputting original location data for the original location of the corresponding first training image;
generating n object position data to which noise affecting the position of each target image is added for a plurality of target images having positions of different frames in the image sequence; and
inputting each object location data to the first model of the learned neural network, extracting the estimated location data for the estimated locations of a plurality of target images, and correcting the location of each target image using the extracted estimated location data; step;
performing depth estimation or 3D shape restoration using each corrected target image;
How to include.

According to claim 1,
The deep learning neural network is a long short-term memory (Long Short-Term Memory) neural network.

According to claim 1,
In the step of performing depth estimation or 3D shape restoration, a focus value of each target image is derived based on the corrected position of each target image, and the depth estimation or 3D shape restoration is performed using the derived focus value. A method comprising the steps of performing

According to claim 1,
The step of performing depth estimation or 3D shape restoration is based on the focus value of each pixel derived as a result of the first focus measurement operation on a plurality of pixels in the second training image and the corresponding focus value. Performing the depth estimation or 3D shape reconstruction using a second model in which learning according to machine learning of Gaussian Process Regression has been performed based on a fitting function of an estimated focus curve.

According to claim 4,
The step of performing the depth estimation or 3D shape restoration,
performing a second focus measurement operation on a plurality of pixels in the corrected target image;
fitting a plurality of focus curves by inputting focus values of a plurality of pixels derived as a result of the second focus measurement operation to the second model;
extracting a pixel position in the corrected target image having a maximum focus value from a plurality of fitted focus curves; and
performing the depth estimation or 3D shape restoration based on the extracted pixel position;
How to include.

According to claim 6,
The first and second focus measurement operations use a Sum of Modified Laplacian (SML).

According to claim 6,
The SML is a method using the following formula.

(However, I(x, y) is the gray level brightness at the pixel of (x, y), W is the video window size)

According to claim 4,
The learning method for the second model,
Selecting a kernel function of a probability distribution for a fitting function as k(i, i'), which is a square exponential kernel;

(However, i and I' are the inputs of the square exponential kernel function, | | | | is the Euclidean distance)
For the initial probability distribution for the fitting function, setting the mean to 0 and setting the kernel function to k(x ₀ , x ₀ '), where x ₀ and x ₀ ' are axes for pixel locations in the target data a set of values on the x-axis); and
For the updated probability distribution for the fitting function, setting the mean to m _g (x ₀ ) and the kernel function to k _g (x ₀ , x ₀ ');

(However, x _t is a set of x-axis values, which is an axis for pixel position values in training data, and x _t 'is a set of y-axis values, which is an axis for focus values in training data)
How to include.

According to claim 5,
The performing of the second focus measurement operation includes randomly sampling and extracting a plurality of pixels from the corrected target image, and performing a second focus measurement operation on the plurality of extracted pixels.

According to claim 1,
The method of claim 1, wherein the noise has a non-Gaussian distribution.

Memory; and
A control unit for controlling depth estimation or 3D shape restoration using information stored in a memory; includes,
The control unit,
For a plurality of first training images having different frame positions in the image sequence, generation of n (n is a natural number greater than or equal to 2) position data for learning to which noise affecting the position is added for each first training image. to control,
For a deep learning neural network having an input layer inputting each of the learning location data and an output layer outputting in situ data for the original location of the corresponding first training image, learning according to a machine learning technique is controlled,
For a plurality of target images having positions of different frames in the image sequence, controlling the generation of n target position data to which noise affecting the position of each target image is added;
Each target position data is input to the first model of the trained neural network, the estimated position data for the estimated positions of a plurality of target images is extracted, and position correction of each target image is controlled using the extracted estimated position data. and
An apparatus for controlling depth estimation or 3D shape restoration using each corrected target image.