KR102433632B1

KR102433632B1 - Apparatus And Method For Inpainting Depth of LiDAR

Info

Publication number: KR102433632B1
Application number: KR1020200141887A
Authority: KR
Inventors: 이상윤; 이준협; 황상원; 김우진; 이도균; 우성민
Original assignee: 연세대학교 산학협력단
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2022-08-18
Also published as: KR20220057060A

Abstract

본 발명은 라이다를 이용하여 획득된 3차원 점군 이미지가 인가되면, 기지정된 방식으로 2차원 이미지로 변환하여 입력 데이터를 획득하는 입력 데이터 획득부, 미리 학습된 인공 신경망으로 구현되어, 입력 데이터를 학습된 방식에 따라 인코딩하여 특징맵을 추출하는 과정을 반복하여 획득되는 다수의 특징맵 중 최종 인코딩된 특징맵인 인코딩 특징맵을 학습된 방식에 따라 디코딩하여 동적 객체 영역을 나타내는 이진 마스크를 획득하고, 확률적 구조 유사도에 따라 다수의 특징맵 중 대응하는 특징맵에서 이진 마스크에 대응하는 영역을 이전 획득된 채움 특징맵의 영역으로 대체하여 관심맵을 획득하며, 획득된 관심맵을 미리 학습된 방식으로 디코딩하여 채움 특징맵을 반복적으로 획득하는 배경 채움 네트워크 및 최종 획득된 채움 특징맵을 채움 이미지로 인가받고, 기지정된 방식으로 3차원 이미지로 변환하는 3D 지도 획득부를 포함하여, 동적 객체 영역이 자연스러운 배경으로 대체되어 자율 주행에 이용 가능한 수준의 고품질의 3D 지도를 획득할 수 있는 라이다 깊이 이미지 채움 장치 및 방법을 제공할 수 있다.The present invention is implemented with an input data acquisition unit that acquires input data by converting it into a two-dimensional image in a predetermined manner when a three-dimensional point cloud image obtained using lidar is applied, and a pre-learned artificial neural network, By repeating the process of extracting the feature map by encoding according to the learned method, the encoding feature map, which is the final encoded feature map, among a plurality of feature maps obtained is decoded according to the learned method to obtain a binary mask representing the dynamic object region, , according to the probabilistic structural similarity, an interest map is obtained by replacing a region corresponding to a binary mask in a corresponding feature map among a plurality of feature maps with an area of a previously obtained filled feature map, and the obtained interest map is pre-learned Including a background filling network that repeatedly acquires a filling feature map by decoding with a 3D map acquisition unit that receives the final acquired filling feature map as a filling image and converts it into a three-dimensional image in a predetermined way, the dynamic object area is natural It is possible to provide a lidar depth image filling apparatus and method that can be replaced with a background and obtain a high-quality 3D map usable for autonomous driving.

Description

Apparatus And Method For Inpainting Depth of LiDAR

본 발명은 라이다 깊이 이미지 채움 장치 및 방법에 관한 것으로, 라이다 깊이 이미지 채움 장치 및 방법에 관한 것이다.The present invention relates to a lidar depth image filling apparatus and method, and to a lidar depth image filling apparatus and method.

자율주행차에는 비주얼 카메라, 레이더, LiDAR, GNSS 등 다양한 센서가 부착돼 차량에 추가되고 있으며, 자율주행을 위한 정밀한 3D 지도를 만드는 데 있어서, 이러한 센서들을 활용하여 많은 종류의 연구 분야에서 연구되어오고 있다.Various sensors, such as visual cameras, radar, LiDAR, and GNSS, are attached to autonomous vehicles and are being added to vehicles. have.

완전한 자율주행을 위해서는 주행 장면의 정확한 3D 지도를 만드는 것이 필수적이다. 그러나 움직이는 물체나 변화하는 환경으로 인하여 정적이고 정확한 3D 지도를 만드는 것은 어렵다. 기존에 3D 고밀도(High Density: HD) 지도를 취득하기 위한 대표적인 방식은 고성능의 3D 측정 장비를 탑재하여 동일한 공간을 여러 번 이동하며 3D 점군(point cloud)으로 구성되는 3D 점군 이미지를 취득하는 것이다. 취득된 3D 이미지에는 다양한 동적 객체(차량, 보행자 등)가 함께 포함되어 있는 경우가 대부분이므로, 정적 3D 지도를 만들기 위해서는 주변의 동적 객체들을 제거해야 한다. 그리고 동적 객체가 제거된 영역에 대해 동적 객체가 존재하지 않을 때의 정보를 맵핑(mapping)시켜 빈 공간을 채워야 한다.Creating an accurate 3D map of the driving scene is essential for fully autonomous driving. However, it is difficult to create static and accurate 3D maps due to moving objects or changing environments. A typical method for acquiring a 3D high-density (HD) map is to acquire a 3D point cloud image composed of a 3D point cloud while moving in the same space several times using a high-performance 3D measuring device. Since the acquired 3D image often includes various dynamic objects (vehicles, pedestrians, etc.) together, in order to create a static 3D map, it is necessary to remove surrounding dynamic objects. In addition, the blank space must be filled by mapping information when the dynamic object does not exist to the area where the dynamic object is removed.

그러나 동적 객체가 빈번하게 위치하는 영역의 경우, 다수로 반복 촬영할지라도 동적 객체가 반복해서 촬영되어 동적 객체를 정상적으로 제거하지 못하는 경우가 빈번하게 발생할 뿐만 아니라, 제거하더라도 빈공간을 채우지 못하는 경우가 발생하게 된다.However, in the case of an area where dynamic objects are frequently located, even if a large number of repeated images are taken, dynamic objects are repeatedly photographed and thus dynamic objects cannot be removed normally, and even if removed, empty spaces cannot be filled. do.

도 1은 기존의 방식으로 획득된 3D 점군 이미지의 일 예를 나타낸다.1 shows an example of a 3D point cloud image obtained by a conventional method.

도 1의 (a) 내지 (c)에서 사각 박스 또는 타원으로 표시된 바와 같이, 기존의 방식으로 획득된 3D 점군 이미지의 경우 여러 번 반복하여 촬영될지라도, 다수의 차량이 빈번하게 이동하는 도심의 도로 등과 같은 영역에서는 서로 다른 동적 객체가 계속적으로 해당 위치에 존재하여 중복하여 누적 감지됨으로써, 동적 객체로 인식되지 않아 제거되지 않게 된다. 따라서 3D 지도에서도 동적 객체가 배경으로 인식되어 포함되는 오류가 발생하게 된다.As indicated by a square box or an ellipse in FIGS. 1A to 1C , in the case of a 3D point cloud image obtained by the conventional method, even if it is repeatedly photographed several times, a road in the city center where a large number of vehicles move frequently In an area such as a dynamic object, different dynamic objects are continuously present at a corresponding position and are repeatedly and cumulatively detected, so that they are not recognized as a dynamic object and thus not removed. Therefore, even in the 3D map, an error occurs when a dynamic object is recognized as a background and included.

도 2는 3D 점군 이미지로부터 기존의 인공 신경망을 이용하여 추출되는 특징점의 일예를 나타낸다.2 shows an example of a feature point extracted from a 3D point cloud image using an existing artificial neural network.

도 2에서(a)는 3D 점군 이미지를 나타내고, (b)와 (c)는 각각 (a)의 3D 점군 이미지에서 기존의 인공 신경망을 이용하여 추출된 특징점과 수작업 기반 ISS(Intrinsic shape signatures) 방식으로 추출된 특징점을 나타낸다. 도 2의 (b) 및 (c)를 살펴보면, 인공 신경망을 이용하거나 수작업으로 특징점을 추출하는 경우에도, 배경과 동적 객체의 구분이 용이하지 않아 동적 객체에서 특징점이 검출됨을 알 수 있다. 따라서 이와 같은 기법을 활용한 3D 점군 들을 매칭하는 경우, 오차가 발생하게 되어 정확하게 동적 객체를 제거할 수 없다는 한계가 있다.In FIG. 2(a), a 3D point cloud image is shown, and (b) and (c) are the feature points extracted from the 3D point cloud image of (a) using a conventional artificial neural network, respectively, and a manual based Intrinsic shape signatures (ISS) method. represents the extracted feature points. Referring to (b) and (c) of FIG. 2 , it can be seen that even when a feature point is extracted using an artificial neural network or manually, it is not easy to distinguish between a background and a dynamic object, so that the feature point is detected from the dynamic object. Therefore, when matching 3D point clouds using such a technique, an error occurs and there is a limitation in that the dynamic object cannot be accurately removed.

또한 수작업에 기반하여 동적 객체를 제거하고, 배경의 정보로 매칭하는 방법은 많은 시간이 소요되는 작업이며, 도심 내에서 건물 또는 조형의 변화가 있다면 다시 이미지를 취득해야 하는 번거로움이 존재한다.In addition, the method of removing a dynamic object based on manual operation and matching it with background information is a time-consuming task, and there is a hassle of acquiring an image again if there is a change in a building or format in the city center.

한국 공개 특허 제10-2018-0134563호 (2018.12.19 공개)Korean Patent Publication No. 10-2018-0134563 (published on December 19, 2018)

본 발명의 목적은 라이다를 이용하여 획득된 3D 점군 이미지에서 정확하게 동적 객체 영역을 검출하고, 검출된 동적 객체 영역을 배경으로 채워 동적 객체가 제거된 3D 지도를 획득할 수 있는 라이다 깊이 이미지 채움 장치 및 방법을 제공하는데 있다.An object of the present invention is to accurately detect a dynamic object region in a 3D point cloud image obtained using lidar, and fill the detected dynamic object region with a background to obtain a 3D map from which the dynamic object is removed. To provide an apparatus and method.

본 발명의 다른 목적은 검출된 동적 객체 영역을 주변의 구조적 특성을 반영하여 배경을 채움으로써 동적 객체 영역을 이질감없이 배경으로 채워 자연 스러운 3D 지도를 획득할 수 있는 라이다 깊이 이미지 채움 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide an apparatus and method for filling a LIDAR depth image that can obtain a natural 3D map by filling the detected dynamic object region with the background without any heterogeneity by filling the background by reflecting the structural characteristics of the surroundings. is to provide

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 라이다 깊이 이미지 채움 장치는 라이다를 이용하여 획득된 3차원 점군 이미지가 인가되면, 기지정된 방식으로 2차원 이미지로 변환하여 입력 데이터를 획득하는 입력 데이터 획득부; 미리 학습된 인공 신경망으로 구현되어, 상기 입력 데이터를 학습된 방식에 따라 인코딩하여 특징맵을 추출하는 과정을 반복하여 획득되는 다수의 특징맵 중 최종 인코딩된 특징맵인 인코딩 특징맵을 학습된 방식에 따라 디코딩하여 동적 객체 영역을 나타내는 이진 마스크를 획득하고, 확률적 구조 유사도에 따라 상기 다수의 특징맵 중 대응하는 특징맵에서 상기 이진 마스크에 대응하는 영역을 이전 획득된 채움 특징맵의 영역으로 대체하여 관심맵을 획득하며, 획득된 관심맵을 미리 학습된 방식으로 디코딩하여 채움 특징맵을 반복적으로 획득하는 배경 채움 네트워크; 및 최종 획득된 채움 특징맵을 채움 이미지로 인가받고, 기지정된 방식으로 3차원 이미지로 변환하는 3D 지도 획득부를 포함한다.In the lidar depth image filling apparatus according to an embodiment of the present invention for achieving the above object, when a three-dimensional point cloud image obtained using lidar is applied, it is converted into a two-dimensional image in a predetermined manner to obtain input data an input data acquisition unit; It is implemented as a pre-learned artificial neural network, and the encoding feature map, which is the final encoded feature map among a plurality of feature maps obtained by repeating the process of extracting the feature map by encoding the input data according to the learned method, is applied to the learned method. decoding to obtain a binary mask representing the dynamic object region, and replacing the region corresponding to the binary mask in the corresponding feature map among the plurality of feature maps with the region of the previously obtained filled feature map according to the probabilistic structural similarity a background filling network that acquires an interest map and repeatedly acquires a filling feature map by decoding the acquired interest map in a pre-learned manner; and a 3D map acquisition unit that receives the finally obtained filling feature map as a filling image and converts it into a three-dimensional image in a predetermined manner.

상기 배경 채움 네트워크는 각각 입력 데이터 또는 이전 배치된 컨볼루션 레이어에서 출력되는 특징맵을 인가받아 인코딩하여 특징맵을 출력하는 다수의 컨볼루션 레이어를 포함하여 인코딩 네트워크; 상기 인코딩 특징맵을 인가받아 학습된 방식에 따라 디코딩하여, 상기 입력 데이터에서 학습에 의해 지정된 동적 객체가 위치하는 영역을 나타내는 상기 이진 마스크를 획득하는 객체 검출 네트워크; 상기 인코딩 특징맵 또는 이전 획득된 관심맵을 인가받아 학습된 방식으로 디코딩하여 채움 특징맵을 획득하는 채움 네트워크; 및 상기 다수의 특징맵 중 대응하는 특징맵에서 상기 이진 마스크에 대응하는 영역을 이전 획득된 채움 특징맵의 영역으로 대체하여 퓨전 특징맵을 획득하고, 대응하는 특징맵과 획득된 퓨전 특징맵 사이의 확률적 구조 유사도를 계산하여 상기 퓨전 특징맵에 기지정된 방식으로 가중하여 상기 관심맵을 획득하는 다수의 퓨전 블록을 구비하는 퓨전 블록부를 포함할 수 있다.The background filling network may include an encoding network including a plurality of convolutional layers that receive input data or a feature map output from a previously arranged convolutional layer, and encode and output the feature map; an object detection network that receives the encoding feature map and decodes it according to a learned method to obtain the binary mask indicating a region in which a dynamic object designated by learning is located in the input data; a filling network for obtaining a filling feature map by decoding the encoding feature map or the previously acquired interest map in a learned manner; and a fusion feature map is obtained by replacing a region corresponding to the binary mask in a corresponding feature map among the plurality of feature maps with a region of a previously obtained filled feature map, and between the corresponding feature map and the acquired fusion feature map and a fusion block unit including a plurality of fusion blocks for calculating the probabilistic structural similarity and weighting the fusion feature map in a predetermined manner to obtain the interest map.

상기 객체 검출 네트워크는 상기 이진 마스크를 상기 인코딩 네트워크에서 획득되는 다수의 특징맵 각각에 대응하는 크기로 스케일링하여 다수의 스케일 이진 마스크를 획득할 수 있다.The object detection network may obtain a plurality of scale binary masks by scaling the binary mask to a size corresponding to each of a plurality of feature maps obtained from the encoding network.

상기 다수의 퓨전 블록 각각은 상기 다수의 특징맵 중 대응하는 특징맵과 다수의 스케일 이진 마스크 중 대응하는 스케일 이진 마스크 및 상기 채움 네트워크에서 이전 획득된 채움 특징맵을 인가받아, 상기 대응하는 특징맵에서 대응하는 스케일 이진 마스크에 의해 지정된 영역을 이전 획득된 채움 특징맵의 영역으로 대체하여 퓨전 특징맵을 획득하는 퓨전부; SSIM(structural similarity index measurement)의 구조 유사도 평가 방식에 따라 대응하는 특징맵과 상기 퓨전 특징맵 사이의 구조 유사도를 판별하고, 확률적 분포를 추가 반영하여 확률적 구조 유사도를 계산하는 구조 유사도 계산부; 및 상기 퓨전 특징맵에 계산된 확률적 구조 유사도를 가중하여 상기 관심맵을 획득하는 관심맵 획득부를 포함할 수 있다.Each of the plurality of fusion blocks receives a corresponding feature map from among the plurality of feature maps, a corresponding scaled binary mask from among a plurality of scaled binary masks, and a filling feature map previously obtained from the filling network, and in the corresponding feature map a fusion unit configured to obtain a fusion feature map by replacing an area designated by the corresponding scaled binary mask with an area of the previously obtained filling feature map; a structural similarity calculation unit that determines the structural similarity between a corresponding feature map and the fusion feature map according to a structural similarity index measurement (SSIM) structural similarity evaluation method, and calculates a stochastic structural similarity by additionally reflecting a stochastic distribution; and an interest map acquisition unit configured to acquire the interest map by weighting the calculated probabilistic structural similarity to the fusion feature map.

상기 퓨전부는 상기 퓨전 특징맵(F_f ⁿ)을 수학식 The fusion unit calculates the fusion feature map (F _f ⁿ ) by Equation

(여기서 F_e ⁿ 는 특징맵이고,

는 스케일 이진 마스크이며, F_i ⁿ 는 채움 특징맵이다.)에 따라 획득할 수 있다.(where F _e ⁿ is the feature map,

is a scaled binary mask, and F _i ⁿ is a filling feature map).

상기 구조 유사도 계산부는 상기 구조 유사도(S)를 수학식 The structural similarity calculator calculates the structural similarity (S) by the equation

(여기서 k는 특징맵(F_e ⁿ)이고, q는 퓨전 특징맵(F_f ⁿ)이며, σ는 분산이다. 그리고 c₃는 기지정된 상수이다.)에 따라 계산하고, (where k is the feature map (F _e ⁿ ), q is the fusion feature map (F _f ⁿ ), σ is the variance, and c ₃ is a predetermined constant.)

상기 확률적 구조 유사도를 수학식 The probabilistic structural similarity is calculated by the equation

에 따라 계산할 수 있다.can be calculated according to

상기 관심맵 획득부는 상기 관심맵(Cⁿ)을 수학식 The interest map obtaining unit calculates the interest map (C ⁿ ) by the equation

(여기서 λ_atn 은 관심 가중치이다.)(Where λ _atn is the weight of interest.)

에 따라 획득할 수 있다.can be obtained according to

상기 입력 데이터 획득부는 3D 점군 이미지로부터 깊이값, x축 값, y축값, z 축값 및 세기 값에 따른 5개의 2차원 거리 이미지를 획득하고, 획득된 5개의 2차원 거리 이미지를 채널축 방향으로 결합하여 상기 입력 데이터를 획득할 수 있다.The input data acquisition unit acquires five two-dimensional distance images according to a depth value, an x-axis value, a y-axis value, a z-axis value, and an intensity value from a 3D point cloud image, and combines the obtained five two-dimensional distance images in the channel axis direction. to obtain the input data.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 라이다 깊이 이미지 채움 방법은 라이다를 이용하여 획득된 3차원 점군 이미지가 인가되면, 기지정된 방식으로 2차원 이미지로 변환하여 입력 데이터를 획득하는 단계; 미리 학습된 인공 신경망을 이용하여, 상기 입력 데이터를 학습된 방식에 따라 인코딩하여 특징맵을 추출하는 과정을 반복하여 다수의 특징맵을 획득하는 단계; 미리 학습된 인공 신경망을 이용하여, 상기 다수의 특징맵 중 최종 인코딩된 특징맵인 인코딩 특징맵을 학습된 방식에 따라 디코딩하여 동적 객체 영역을 나타내는 이진 마스크를 획득하는 단계; 미리 학습된 인공 신경망을 이용하여, 확률적 구조 유사도에 따라 상기 다수의 특징맵 중 대응하는 특징맵에서 상기 이진 마스크에 대응하는 영역을 이전 획득된 채움 특징맵의 영역으로 대체하여 관심맵을 획득하고, 획득된 관심맵을 미리 학습된 방식으로 디코딩하여 채움 특징맵을 획득하는 단계; 및 최종 획득된 채움 특징맵을 채움 이미지로 인가받고, 기지정된 방식으로 3차원 이미지로 변환하는 단계를 포함한다.In the lidar depth image filling method according to another embodiment of the present invention for achieving the above object, when a three-dimensional point cloud image obtained using lidar is applied, it is converted into a two-dimensional image in a predetermined manner to obtain input data to do; obtaining a plurality of feature maps by repeating the process of extracting feature maps by encoding the input data according to the learned method using a pre-learned artificial neural network; using a pre-trained artificial neural network to decode an encoded feature map, which is a final encoded feature map among the plurality of feature maps, according to a learned method to obtain a binary mask representing a dynamic object region; Using a pre-trained artificial neural network, the region of interest is obtained by replacing the region corresponding to the binary mask in the corresponding feature map among the plurality of feature maps according to the degree of probabilistic structural similarity with the region of the previously obtained filled feature map, , decoding the obtained interest map in a pre-learned manner to obtain a filling feature map; and receiving the finally obtained filling feature map as a filling image, and converting it into a three-dimensional image in a predetermined manner.

따라서, 본 발명의 실시예에 따른 라이다 깊이 이미지 채움 장치 및 방법은 라이다를 이용하여 획득된 3D 점군 이미지에서 정확하게 동적 객체 영역을 검출하고, 검출된 동적 객체 영역을 주변의 구조적 특성이 확률 분포에 기반하여 반영된 배경으로 채움으로써, 동적 객체 영역이 자연스러운 배경으로 대체되어 자율 주행에 이용 가능한 수준의 고품질의 3D 지도를 획득할 수 있다. 또한 생성적 적대 신경망 기법으로 학습이 수행될 수 있어, 학습 데이터 셋 없이도 학습이 가능할 뿐만 아니라, 다른 인공 신경망을 학습시키기 위한 학습 데이터 셋으로 이용 가능한 3D 지도를 제공할 수 있다.Accordingly, the apparatus and method for filling a LIDAR depth image according to an embodiment of the present invention accurately detect a dynamic object area from a 3D point cloud image obtained using LIDAR, and the structural characteristics around the detected dynamic object area are probability distribution By filling in the reflected background based on , the dynamic object area is replaced with a natural background to obtain a high-quality 3D map usable for autonomous driving. In addition, since learning can be performed with a generative adversarial neural network technique, it is possible not only to learn without a learning data set, but also to provide a 3D map that can be used as a learning data set for learning other artificial neural networks.

도 1은 기존의 방식으로 획득된 3D 점군 이미지의 일 예를 나타낸다.
도 2는 3D 점군 이미지로부터 기존의 인공 신경망을 이용하여 추출되는 특징점의 일예를 나타낸다.
도 3은 본 발명의 일 실시예에 따른 라이다 깊이 이미지 채움 장치의 개략적 구조를 나타낸다.
도 4는 도 3의 입력 데이터 획득부에서 획득되는 입력 데이터의 일 예를 나타낸다.
도 5는 도 3의 객체 보상 네트워크의 상세 구조의 일 예를 설명하기 위한 도면이다.
도 6은 도 5의 퓨전 블록의 구성의 일 예를 나타낸다.
도 7은 본 실시예에 따른 라이다 깊이 이미지 채움 장치를 이용하여 획득되는 3D 지도 이미지의 일 예를 나타낸다.
도 8은 본 발명의 일 실시예에 따른 라이다 깊이 이미지 채움 방법을 나타낸다.1 shows an example of a 3D point cloud image obtained by a conventional method.
2 shows an example of a feature point extracted from a 3D point cloud image using an existing artificial neural network.
3 shows a schematic structure of a lidar depth image filling apparatus according to an embodiment of the present invention.
FIG. 4 shows an example of input data obtained by the input data obtaining unit of FIG. 3 .
FIG. 5 is a diagram for explaining an example of a detailed structure of the object reward network of FIG. 3 .
6 shows an example of the configuration of the fusion block of FIG. 5 .
7 shows an example of a 3D map image obtained by using the LIDAR depth image filling apparatus according to the present embodiment.
8 is a diagram illustrating a lidar depth image filling method according to an embodiment of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention, the operational advantages of the present invention, and the objects achieved by the practice of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing preferred embodiments of the present invention with reference to the accompanying drawings. However, the present invention may be embodied in several different forms, and is not limited to the described embodiments. In addition, in order to clearly explain the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain component, it does not exclude other components unless otherwise stated, meaning that other components may be further included. In addition, terms such as "... unit", "... group", "module", and "block" described in the specification mean a unit that processes at least one function or operation, which is hardware, software, or hardware. and a combination of software.

도 3은 본 발명의 일 실시예에 따른 라이다 깊이 이미지 채움 장치의 개략적 구조를 나타내고, 도 4는 도 3의 입력 데이터 획득부에서 획득되는 입력 데이터의 일 예를 나타낸다.3 shows a schematic structure of an apparatus for filling a lidar depth image according to an embodiment of the present invention, and FIG. 4 shows an example of input data obtained by the input data acquisition unit of FIG. 3 .

도 3을 참조하면, 본 실시예에 따른 라이다 깊이 이미지 채움 장치는 점군 이미지 획득부(100), 입력 데이터 획득부(200), 배경 채움 네트워크(300) 및 3D 지도 획득부(400)를 포함할 수 있다.Referring to FIG. 3 , the apparatus for filling a lidar depth image according to the present embodiment includes a point cloud image acquisition unit 100 , an input data acquisition unit 200 , a background filling network 300 , and a 3D map acquisition unit 400 . can do.

점군 이미지 획득부(100)는 동적 객체를 고려하지 않고 취득된 3D 점군 이미지를 획득한다. 점군 이미지 획득부(100)는 라이다로 구현되어 3D 점군 이미지를 직접 획득하거나, 라이다에서 획득된 3D 점군 이미지를 인가받는 통신 모듈 또는 미리 획득된 3D 점군 이미지가 저장된 저장 모듈 등으로 구현될 수 있다. 그리고 라이다는 기저정된 높이 범위에서 기지정된 회전 범위를 회전하면서 주변을 스캔하여 3D 점군 이미지를 획득될 수 있다. 일 예로 라이다가 기지정된 높이 범위에서 64개의 스캔 라인 각각을 따라 360도 회전하는 동안 2,048번 샘플링하는 경우, 라이다는 2,048 × 64개의 샘플링 픽셀로 구성되는 3D 점군 이미지를 획득할 수 있다. 이때, 획득된 3D 점군 이미지는 각 점에 대한 x, y 및 z축 방향의 위치 정보와 깊이(depth) 정보 및 세기(Intensity) 정보가 포함될 수 있다.The point cloud image acquisition unit 100 acquires the acquired 3D point cloud image without considering the dynamic object. The point cloud image acquisition unit 100 may be implemented as a lidar to directly acquire a 3D point cloud image, a communication module that receives the 3D point cloud image obtained from the lidar, or a storage module in which a pre-acquired 3D point cloud image is stored. have. In addition, the lidar may acquire a 3D point cloud image by scanning the surroundings while rotating a predetermined rotation range in a predetermined height range. For example, if the lidar samples 2,048 times while rotating 360 degrees along each of 64 scan lines in a predetermined height range, the lidar may acquire a 3D point cloud image composed of 2,048 × 64 sampling pixels. In this case, the obtained 3D point cloud image may include position information, depth information, and intensity information for each point in x, y, and z-axis directions.

입력 데이터 획득부(200)는 점군 이미지 획득부(100)에서 획득된 3D 점군 이미지를 2차원의 거리 이미지(Range Image)로 변환한다. 입력 데이터 획득부(100)는 3D 점군 이미지의 각 점에 대해, 수학식 1 및 2에 따라 구좌표계의 수직각(θ)과 수평각(φ)으로 변환하여 2차원 공간 상에 투영함으로써, 2차원 거리 이미지를 획득할 수 있다.The input data acquisition unit 200 converts the 3D point cloud image acquired by the point cloud image acquisition unit 100 into a two-dimensional range image. The input data acquisition unit 100 converts each point of the 3D point cloud image into a vertical angle (θ) and a horizontal angle (φ) of the spherical coordinate system according to Equations 1 and 2 and projects it in a two-dimensional space, two-dimensional A street image can be obtained.

이때 입력 데이터 획득부(200)는 라이다 스캔 규격에 대응하여 거리 이미지를 정규화하여 획득할 수 있다. 즉 2차원의 거리 이미지는 360도 전방향을 스캔하는 라이다 센서로부터 2,048 × 64개의 픽셀을 갖는 크기로 획득될 수 있다. 다만 입력 데이터 획득부(200)는 라이다로부터 3D 점군 이미지가 360도 전방향에 대해 획득될지라도 기지정된 각도 범위(여기서는 일 예로 90도)에 대해서만 추출하여 입력 데이터로 획득할 수도 있다. 이 경우, 입력 데이터는 일 예로 512 × 64의 픽셀을 가질 수 있다.In this case, the input data acquisition unit 200 may obtain a normalized distance image corresponding to the lidar scan standard. That is, the two-dimensional distance image may be acquired with a size of 2,048 × 64 pixels from the lidar sensor that scans 360 degrees in all directions. However, the input data acquisition unit 200 may obtain the input data by extracting only a predetermined angular range (here, 90 degrees as an example) even if the 3D point cloud image is obtained from the lidar for 360 degrees omnidirectional. In this case, the input data may have, for example, 512×64 pixels.

또한 입력 데이터 획득부(200)는 도 4에 도시된 바와 같이, 2차원 거리 이미지를 깊이(depth)값, x축 값, y축값, z 축값 및 세기(Intensity)값에 따른 5개로 구분하여 입력 데이터로서 획득할 수 있다. 도 4에서 (a) 내지 (e)는 각각 깊이값, x축 값, y축값, z 축값 및 세기 값에 대한 2차원 거리 이미지를 나타낸다. 그리고 입력 데이터 획득부(200)는 5개의 거리 이미지를 채널 축 방향으로 결합하여 입력 데이터를 획득할 수 있다. 따라서 입력 데이터는 일 예로 512 × 64 × 5의 크기로 획득될 수 있다. 그리고 획득된 입력 데이터는 배경 채움 네트워크(300)에 인가된다.In addition, as shown in FIG. 4 , the input data acquisition unit 200 divides the two-dimensional distance image into five according to a depth value, an x-axis value, a y-axis value, a z-axis value, and an intensity value. can be obtained as data. In FIG. 4, (a) to (e) represent two-dimensional distance images with respect to a depth value, an x-axis value, a y-axis value, a z-axis value, and an intensity value, respectively. In addition, the input data acquisition unit 200 may acquire input data by combining five distance images in a channel axis direction. Accordingly, the input data may be obtained, for example, with a size of 512×64×5. And the obtained input data is applied to the background filling network (300).

인공 신경망으로 구현되는 배경 채움 네트워크(300)는 입력 데이터를 인가받고, 미리 학습된 방식에 따라 인가된 입력 데이터에서 동적 객체를 검출하고, 검출된 동적 객체 영역을 배경과의 구조 유사도에 기반하여 학습된 방식에 따라 소거하고 주변 배경으로 채워 배경 채움 이미지를 획득한다.The background filling network 300 implemented as an artificial neural network receives input data, detects a dynamic object from the applied input data according to a pre-learned method, and learns the detected dynamic object region based on the structural similarity with the background. The background fill image is obtained by erasing it according to the specified method and filling it with the surrounding background.

배경 채움 네트워크(300)는 인코딩 네트워크(310)와 객체 검출 네트워크(320), 퓨전 블록부(330) 및 채움 네트워크(340)를 포함할 수 있다. 여기서 인코딩 네트워크(310)와 객체 검출 네트워크(320) 및 채움 네트워크(340)는 각각 미리 학습된 인공 신경망으로 구현될 수 있다.The background filling network 300 may include an encoding network 310 , an object detection network 320 , a fusion block unit 330 , and a filling network 340 . Here, the encoding network 310 , the object detection network 320 , and the filling network 340 may be implemented as pre-trained artificial neural networks, respectively.

도 5는 도 3의 객체 보상 네트워크의 상세 구조의 일 예를 설명하기 위한 도면이다.FIG. 5 is a diagram for explaining an example of a detailed structure of the object reward network of FIG. 3 .

도 5를 참조하여 도 1의 배경 채움 네트워크(300)를 설명하면, 우선 인코딩 네트워크(310)는 입력 데이터를 인가받고, 학습된 방식에 따라 인가된 입력 데이터의 특징을 추정하여 인코딩 특징맵을 획득한다. 인코딩 네트워크(310)는 도 5에 도시된 바와 같이, 순차적으로 연결된 다수의 컨볼루션 레이어를 포함하여 구성될 수 있다. 도 5에서는 인코딩 네트워크(310)가 5개의 컨볼루션 레이어로 구성되는 경우를 예로 들어 도시하였으나, 컨볼루션 레이어의 개수 및 각각의 크기는 다양하게 조절될 수 있다. 다만 도 5에서와 같이, 다수의 컨볼루션 레이어는 제1 컨볼루션 레이어가 5 × 5 크기의 컨볼루션 레이어로 구성되는 반면, 나머지 컨볼루션 레이어는 3 × 3 크기의 컨볼루션 레이어로 구성될 수 있다. 즉 인코딩 네트워크(310)의 다수의 컨볼루션 레이어 중 제1 컨볼루션 레이어를 제외한 나머지 컨볼루션 레이어는 동일한 크기의 컨볼루션 레이어로 구성되는 것이 바람직하다.When the background filling network 300 of FIG. 1 is described with reference to FIG. 5, first, the encoding network 310 receives input data and estimates the characteristics of the applied input data according to the learned method to obtain an encoding feature map. do. As shown in FIG. 5 , the encoding network 310 may be configured to include a plurality of sequentially connected convolutional layers. Although FIG. 5 illustrates an example in which the encoding network 310 includes five convolutional layers, the number and size of each convolutional layer may be variously adjusted. However, as shown in FIG. 5 , in the plurality of convolutional layers, the first convolution layer is composed of a convolution layer having a size of 5 × 5, while the remaining convolution layers are composed of a convolution layer having a size of 3 × 3. . That is, it is preferable that the remaining convolutional layers excluding the first convolutional layer among the plurality of convolutional layers of the encoding network 310 consist of convolutional layers having the same size.

순차 연결된 5개의 컨볼루션 레이어는 각각 입력 데이터 또는 이전 컨볼루션 레이어에서 출력되는 특징맵(F_e ^n-1)(여기서 n은 자연수)을 인가받아 특징점을 추정하여 특징맵(F_e ⁿ)을 획득하고, 획득된 특징맵(F_e ⁿ)을 다음 컨볼루션 레이어로 전달하며, 마지막으로 컨볼루션 레이어에서 출력되는 특징맵을 인코딩 특징맵(F_e)이라 한다. 따라서 인코딩 네트워크(310)는 다수의 컨볼루션 레이어를 통해 입력 네트워크의 특징을 정확하게 추출하여 인코딩 특징맵(F_e)을 획득한다.Each of the five convolutional layers connected sequentially receives the input data or a feature map (F _e ^n-1 ) output from the previous convolution layer (where n is a natural number) and estimates feature points to obtain a feature map (F _e ⁿ ) Then, the acquired feature map F _e ⁿ is transferred to the next convolution layer, and finally, the feature map output from the convolution layer is called an encoding feature map F _e . Accordingly, the encoding network 310 accurately extracts the features of the input network through a plurality of convolutional layers to obtain the encoding feature map (F _e ).

객체 검출 네트워크(320)는 인코딩 네트워크(310)에서 획득된 인코딩 특징맵(F_e)을 인가받고, 미리 학습된 방식에 따라 인가된 특징맵(F_e)에서 동적 객체 영역을 추정하여 동적 객체 영역을 나타내는 이진 마스크(m)를 추출한다. 객체 검출 네트워크(320)는 동적 객체에 대한 패턴이 미리 학습되어, 인가된 특징맵에서 동적 객체가 포함된 것으로 추정되는 영역을 추출한다. 여기서 추출되는 영역은 후술하는 퓨전 블록부(330)가 인가된 특징맵(F_e ⁿ)에서 동적 객체를 소거하기 위한 영역을 설정하기 위해 이용된다.The object detection network 320 receives the encoding feature map (F _e ) obtained from the encoding network 310 and estimates the dynamic object region from the applied feature map (F _e ) according to a pre-learned method to determine the dynamic object region. A binary mask (m) representing The object detection network 320 extracts a region estimated to contain the dynamic object from the applied feature map by learning the pattern for the dynamic object in advance. The extracted area is used to set an area for erasing a dynamic object from the feature map F _e ⁿ to which the fusion block unit 330 to be described later is applied.

객체 검출 네트워크(320) 또한 도 5에 도시된 바와 같이, 순차 연결된 다수의 컨볼루션 레이어를 포함하여 구성될 수 있으며, 포함되는 컨볼루션 레이어의 개수는 인코딩 네트워크(310)에 포함된 컨볼루션 레이어의 개수와 검출하고자 하는 동적 객체의 종류 수 등에 따라 다양하게 조절될 수 있다. 다만 인코딩 네트워크(310)와 객체 검출 네트워크(320)의 다수의 컨볼루션 레이어를 거치면서 전달되는 각 컨볼루션 레이어에서 출력되는 특징맵(F_e ⁿ)의 2차원 크기가 점차로 줄어들게 된다. 이에 객체 검출 네트워크(320)는 다수의 컨볼루션 레이어를 거치면서 점차로 감소하는 특징맵(F_e ⁿ)의 2차원 크기를 증가시키기 위해 다수의 업샘플링 레이어를 더 포함할 수 있다. 다수의 업샘플링 레이어는 각각 다수의 컨볼루션 레이어에서 기지정된 개수 컨볼루션 레이어 이후에 삽입배치 되어 기지정된 방식 또는 학습된 방식에 따라 이전 레이어에서 인가되는 특징맵(F_e ⁿ)의 2차원 크기를 확장하여 출력한다. 이때, 다수의 업샘플링 레이어는 일 예로 인가되는 특징맵의 채널축 방향의 값을 기지정된 방식으로 결합하여 2차원 크기로 확장시킬 수 있다.The object detection network 320 may also be configured to include a plurality of sequentially connected convolutional layers, as shown in FIG. 5 , and the number of included convolutional layers is the number of convolutional layers included in the encoding network 310 . It can be variously adjusted according to the number and the number of types of dynamic objects to be detected. However, the two-dimensional size of the feature map F _e ⁿ output from each convolutional layer transmitted through a plurality of convolutional layers of the encoding network 310 and the object detection network 320 is gradually reduced. Accordingly, the object detection network 320 may further include a plurality of upsampling layers to increase the two-dimensional size of the feature map F _e ⁿ that is gradually reduced while passing through the plurality of convolutional layers. A plurality of upsampling layers are inserted after a predetermined number of convolutional layers in each of the plurality of convolutional layers, and the two-dimensional size of the feature map (F _e ⁿ ) applied from the previous layer according to a predetermined method or a learned method. expand and output. In this case, the plurality of upsampling layers may be expanded to a two-dimensional size by combining values in the channel axis direction of the applied feature map, for example, in a predetermined manner.

한편, 퓨전 블록부(330)는 인코딩 네트워크(310)의 다수의 컨볼루션 레이어에서 출력되는 다수의 특징맵(F_e ⁿ)과 객체 검출 네트워크(320)에서 인가되는 이진 마스크(

) 및 채움 네트워크(340)의 다수의 컨볼루션 레이어에서 출력되는 채움 특징맵(F_i ⁿ)을 인가받아, 기지정된 방식으로 결합하여 관심맵(attention map)(Cⁿ)을 획득한다.Meanwhile, the fusion block unit 330 includes a plurality of feature maps F e n output from a plurality of convolutional layers of the encoding network 310 and a binary mask (F _e ⁿ ) applied from the object detection network 320 .

) and a filling feature map (F _i ⁿ ) output from a plurality of convolutional layers of the filling network 340 are authorized, and combined in a predetermined manner to obtain an attention map (C ⁿ ).

도 4에 도시된 바와 같이, 퓨전 블록부(330)는 다수의 퓨전 블록(FB1 ~ FB3)을 포함할 수 있으며, 다수의 퓨전 블록(FB1 ~ FB3)은 각각 채움 네트워크(340)의 다수의 컨볼루션 레이어 사이에 배치될 수 있다.As shown in FIG. 4 , the fusion block unit 330 may include a plurality of fusion blocks FB1 to FB3 , and the plurality of fusion blocks FB1 to FB3 are a plurality of convolutions of the filling network 340 , respectively. It can be placed between solution layers.

즉 도 3에서는 비록 설명의 편의를 위해, 퓨전 블록부(330)와 채움 네트워크(340)를 구분하여 도시하였으나, 퓨전 블록부(330)의 다수의 퓨전 블록(FB1 ~ FB3)은 도 4에 도시된 바와 같이, 채움 네트워크(340)와 결합되어 구성될 수 있다.That is, although FIG. 3 shows the fusion block unit 330 and the filling network 340 separately for convenience of explanation, a plurality of fusion blocks FB1 to FB3 of the fusion block unit 330 are shown in FIG. 4 . As described above, it may be configured in combination with the filling network 340 .

퓨전 블록부(330)의 다수의 퓨전 블록(FB1 ~ FB3) 각각은 다수의 특징맵(F_e ¹ ~ F_e ³) 중 대응하는 특징맵(F_e ⁿ)과 대응하는 이진 마스크(mⁿ) 및 다수의 채움 특징맵(F_i ¹ ~ F_i ³) 중 대응하는 채움 특징맵(F_i ⁿ)을 인가받아 결합하여, 퓨전 특징맵(F_f ⁿ)을 획득하고, 인가된 특징맵(F_e ⁿ)과 획득된 퓨전 특징맵(F_f ⁿ) 사이의 구조 유사도에 따라 대응하는 관심맵(Cⁿ)을 획득할 수 있다.Each of the plurality of fusion blocks FB1 to FB3 of the fusion block unit 330 is a binary mask (m ⁿ ) corresponding to a corresponding feature map (F _e ⁿ ) among the plurality of feature maps (F _e ¹ to F _e ³ ). and a plurality of filling feature maps F _i ¹ to F _i ³ , receiving and combining a corresponding filling feature map F _i ⁿ , to obtain a fusion feature map F _f ⁿ , and applying the applied feature map F _e ⁿ ) and the obtained fusion feature map (F _f ⁿ ) may obtain a corresponding interest map (C ⁿ ) according to the degree of structural similarity.

여기서 대응하는 이진 마스크(mⁿ)는 객체 검출 네트워크(320)에서 추출된 이진 마스크(m)를 퓨전 블록(FBn)에 대응하도록 스케일 축소된 이진 마스크를 나타낸다. 상기한 바와 같이, 객체 검출 네트워크(320)에서 추출된 이진 마스크(m)는 인코딩 특징맵(F_e)에 비해 2차원 크기가 확장된 크기로 추출될 수 있다. 따라서 퓨전 블록(FBn)에 인가되는 대응하는 특징맵(F_e ⁿ)과도 2차원 크기가 매칭되지 않게 된다. 이는 퓨전 블록(FBn)이 이진 마스크(m)를 이용하여 특징맵(F_e ⁿ)을 마스킹하는 경우, 특징맵(F_e ⁿ)의 동적 객체 영역이 아닌 영역을 마스킹하게 되는 결과를 초래하게 되어 오류를 발생시킬 수 있다.Here, the corresponding binary mask m ⁿ represents a scaled-down binary mask such that the binary mask m extracted from the object detection network 320 corresponds to the fusion block FBn. As described above, the binary mask m extracted from the object detection network 320 may be extracted to have a two-dimensional size extended compared to the encoding feature map F _e . Accordingly, the two-dimensional size does not match with the corresponding feature map F _e ⁿ applied to the fusion block FBn. This results in that, when the fusion block FBn masks the feature map F _e ⁿ using the binary mask m, an area other than the dynamic object area of the feature map F _e ⁿ is masked. may cause an error.

이에 객체 검출 네트워크(320)는 최종 획득된 이진 마스크(m)를 다수의 특징맵(F_e ⁿ) 각각에 대응하는 크기로 스케일 축소 변환하여 다수의 스케일 이진 마스크(

)를 획득할 수 있다. 그리고 퓨전 블록부(330)의 다수의 퓨전 블록(FBn) 각각은 다수의 스케일 이진 마스크(

) 중 대응하는 스케일 이진 마스크(mⁿ)를 인가받아 특징맵(F_e ⁿ)을 마스킹할 수 있다.Accordingly, the object detection network 320 scales down the finally obtained binary mask (m) to a size corresponding to each of the plurality of feature maps (F _e ⁿ ) to reduce the scale to a plurality of scale binary masks (

) can be obtained. In addition, each of the plurality of fusion blocks FBn of the fusion block unit 330 includes a plurality of scale binary masks (

) of the corresponding scale binary mask (m ⁿ ) may be applied to mask the feature map (F _e ⁿ ).

도 6은 도 5의 퓨전 블록의 구성의 일 예를 나타낸다.6 shows an example of the configuration of the fusion block of FIG. 5 .

도 6을 참조하면, 다수의 퓨전 블록(FB1 ~ FB3) 각각은 특징 퓨전부(331), 구조 유사도 계산부(332) 및 관심맵 획득부(333)를 포함할 수 있다.Referring to FIG. 6 , each of the plurality of fusion blocks FB1 to FB3 may include a feature fusion unit 331 , a structural similarity calculator 332 , and an interest map obtainer 333 .

n번째 퓨전 블록(FBn)에서 특징 퓨전부(331)는 인코딩 네트워크(310)의 대응하는 컨볼루션 레이어로부터 특징맵(F_e ⁿ)을 인가받고, 객체 검출 네트워크(320)로부터 다수의 스케일 이진 마스크(

) 중 대응하는 스케일 이진 마스크(mⁿ)를 인가받으며, 채움 네트워크(340) 중 대응하는 컨볼루션 레이어로부터 채움 특징맵(F_i ⁿ)을 인가받는다. 그리고 인가된 특징맵(F_e ⁿ)과 스케일 이진 마스크(mⁿ) 및 채움 특징맵(F_i ⁿ)으로부터 수학식 3에 따라 퓨전 특징맵(F_f ⁿ)을 획득한다.In the n-th fusion block FBn, the feature fusion unit 331 receives the feature map F _e ⁿ from the corresponding convolutional layer of the encoding network 310 , and receives a plurality of scale binary masks from the object detection network 320 . (

), a corresponding scaled binary mask (m ⁿ ) is applied, and a filling feature map (F _i ⁿ ) is applied from a corresponding convolutional layer in the filling network 340 . Then, the fusion feature map F _f ⁿ is obtained according to Equation 3 from the applied feature map F _e ⁿ , the scale binary mask m ⁿ , and the filling feature map F _i ⁿ .

수학식 3에 따르면, 특징 퓨전부(331)는 스케일 이진 마스크(mⁿ)에 해당하지 않는 영역에 대해서는 인가된 특징맵(F_e ⁿ)을 유지하고, 스케일 이진 마스크에 해당하는 영역은 채움 특징맵(F_i ⁿ)의 대응하는 영역으로 채워 퓨전 특징맵(F_f ⁿ)을 획득한다. 즉 특징 퓨전부(331)는 특징맵(F_e ⁿ)에서 스케일 이진 마스크(mⁿ)에 의해 지정되는 영역을 채움 특징맵(F_i ⁿ)으로 대체하여 퓨전 특징맵(F_f ⁿ)을 합성한다.According to Equation 3, the feature fusion unit 331 maintains the applied feature map F _e ⁿ for a region that does not correspond to the scale binary mask m ⁿ , and fills the region corresponding to the scale binary mask. A fusion feature map F _f ⁿ is obtained by filling the corresponding region of the map F _i ⁿ . That is, the feature fusion unit 331 synthesizes the fusion feature map F _f ⁿ by replacing the area designated by the scale binary mask m ⁿ in the feature map F _e ⁿ with the filling feature map F _i ⁿ . do.

그리고 획득된 퓨전 특징맵(F_f ⁿ)을 구조 유사도 계산부(332)와 관심맵 획득부(333)로 전달한다.Then, the obtained fusion feature map F _f ⁿ is transmitted to the structure similarity calculating unit 332 and the interest map obtaining unit 333 .

구조 유사도 계산부(332)는 특징 퓨전부(331)로 인가된 특징맵(F_e ⁿ)과 특징 퓨전부(331)에서 획득된 퓨전 특징맵(F_f ⁿ)을 인가받아, 구조적 관점에서 인가된 특징맵(F_e ⁿ)과 퓨전 특징맵(F_f ⁿ) 사이의 유사도를 판별한다.The structural similarity calculator 332 receives the feature map F _e ⁿ applied to the feature fusion unit 331 and the fusion feature map F _f ⁿ obtained from the feature fusion unit 331 , and applies it from a structural point of view. The degree of similarity between the acquired feature map (F _e ⁿ ) and the fusion feature map (F _f ⁿ ) is determined.

구조 유사도 계산부(332)는 구조적 관심 모듈(Structural Attention Module: SAM)으로 구현될 수 있으며, 구조적 관심 모듈은 이미지에서 화질 평가 목적으로 주로 이용되는 SSIM(structural similarity index measurement)의 구조적 요소(Structural Component)를 이용하여, 특징맵(F_e ⁿ)과 퓨전 특징맵(F_f ⁿ) 사이의 구조 유사도를 판별한다.The structural similarity calculator 332 may be implemented as a structural attention module (SAM), which is a structural component of structural similarity index measurement (SSIM) that is mainly used for image quality evaluation purposes. ) to determine the structural similarity between the feature map (F _e ⁿ ) and the fusion feature map (F _f ⁿ ).

SSIM은 수학식 4와 같이, 밝기(Luminance)(l), 대조도(Contrast)(C) 및 구조(Structure)(S)의 관점에서 두 이미지(k, q) 사이의 유사도를 비교하기 위해 제안된 기법이다. SSIM is proposed to compare the similarity between two images (k, q) in terms of Luminance (l), Contrast (C), and Structure (S), as shown in Equation 4 It is a method that has been

(여기서 α, β, γ는 각각 가중치이다.)(Where α, β, and γ are weights, respectively.)

그리고 수학식 4의 밝기(l), 대조도(C) 및 구조(S) 각각의 유사도는 수학식 5에 따라 계산될 수 있다.In addition, the similarity of each of the brightness (l), contrast (C), and structure (S) in Equation 4 may be calculated according to Equation 5.

(여기서 k, q는 비교 대상 이미지들이고, μ는 평균(mean)이며, σ는 분산(variance)이다. 그리고 c₁, c₂ 및 c₃는 기지정된 상수이다.)(Where k and q are images to be compared, μ is the mean, σ is the variance, and c ₁ , c ₂ and c ₃ are predefined constants.)

다만 본 실시예에서 구조 유사도 계산부(332)는 상기한 바와 같이, 수학식 5의 SSIM의 세가지 요소(l, c, s) 중 구조적 요소(s)만을 이용하여 두 이미지 사이의 유사도를 평가한다. 여기서 구조적 요소가 비교되는 두 개의 이미지는 특징 퓨전부(331)의 입력 이미지와 출력 이미지인 특징맵(k = F_e ⁿ)과 퓨전 특징맵(q = F_f ⁿ)이다. 즉 구조 유사도 계산부(332)는 특징 퓨전부(331)의 입력 이미지와 출력 이미지 사이의 구조 유사도를 판별한다.However, in this embodiment, the structural similarity calculator 332 evaluates the similarity between the two images by using only the structural element (s) among the three elements (l, c, s) of the SSIM of Equation 5 as described above. . Here, the two images in which the structural elements are compared are an input image and an output image of the feature fusion unit 331 , a feature map (k = F _e ⁿ ) and a fusion feature map (q = F _f ⁿ ). That is, the structural similarity calculator 332 determines the structural similarity between the input image and the output image of the feature fusion unit 331 .

다만 본 실시예에서 구조 유사도 계산부(332)는 구조 유사도에 대해 확률적 분포를 추가로 고려하기 위해 소프트맥스(softmax) 함수를 수학식 6과 같이 추가로 반영하여 확률적 구조 유사도(Sⁿ)를 판별한다.However, in the present embodiment, the structural similarity calculator 332 additionally reflects a softmax function as in Equation 6 in order to additionally consider a stochastic distribution with respect to the structural similarity, so that the stochastic structural similarity (S ⁿ ) to determine

관심맵 획득부(333)는 특징 퓨전부(331)에서 획득된 퓨전 특징맵(F_f ⁿ)과 구조 유사도 계산부(332)에서 획득된 확률적 구조 유사도(Sⁿ)로부터 관심맵(Cⁿ)을 수학식 7에 따라 획득한다.The interest map obtaining unit 333 is configured to obtain the interest map C ⁿ from the fusion feature map F _f ⁿ obtained by the feature fusion unit 331 and the probabilistic structure similarity S ⁿ obtained by the structure similarity calculating unit 332 . ) is obtained according to Equation 7.

수학식 7에 따르면, 관심맵 획득부(333)는 퓨전 특징맵(F_f ⁿ)에 관심 가중치(λ_atn)에 비례하는 확률적 구조 유사도(Sⁿ)를 가중하여 관심맵(Cⁿ)을 획득한다.According to Equation 7, the interest map acquisition unit 333 weights the stochastic structural similarity (S ⁿ ) proportional to the interest weight (λ _atn ) to the fusion feature map (F _f ⁿ ) to obtain the interest map (C ⁿ ). acquire

이때, 관심맵 획득부(333)는 도시하지 않았으나 디콘볼루션 레이어를 더 포함하여 관심맵(Cⁿ)의 2차원 크기를 확장하여 출력할 수 있다.In this case, although not shown, the interest map acquisition unit 333 may expand and output the two-dimensional size of the interest map C ⁿ by further including a deconvolution layer.

즉 확률적 구조 유사도(Sⁿ)가 높을수록 특징맵(F_e ⁿ)과 채움 특징맵(F_i ⁿ)이 퓨전되어 획득된 퓨전 특징맵(F_f ⁿ)이 강조되어 관심맵(Cⁿ)을 획득한다.That is, as the probabilistic structural similarity (S ⁿ ) is higher, the fusion feature map (F _f ⁿ ) obtained by fusion of the feature map (F _e ⁿ ) and the filling feature map (F _i ⁿ ) is emphasized and the interest map (C ⁿ ) to acquire

채움 네트워크(340)는 다수의 컨볼루션 레이어를 포함하여 구성되며, 이중 제1 컨볼루션 레이어는 인코딩 특징맵(F_e)을 인가받아 학습된 방식으로 디코딩하여 초기 채움 특징맵(F_i ¹)을 획득하다. 그리고 나머지 컨볼루션 레이어는 각각 퓨전 블록부(330)의 다수의 퓨전 블록 중 이전 배치된 대응하는 퓨전 블록(FBn-1)으로부터 관심맵(C^n-1)을 인가받고, 미리 학습된 방식에 따라 인가된 관심맵(C^n-1)을 디코딩하여, 관심맵(C^n-1)에서 특징맵(F_e ^n-1)과 채움 특징맵(F_i ^n-1)이 퓨전된 영역이 스무스하게 변화하도록 변형하여 채움 특징맵(F_i ⁿ)을 출력한다.The filling network 340 is configured to include a plurality of convolutional layers, of which the first convolutional layer receives the encoding feature map (F _e ) and decodes it in a learned manner to obtain the initial filling feature map (F _i ¹ ). gain. In addition, each of the remaining convolution layers receives an interest map C ^n-1 from a previously disposed corresponding fusion block FBn-1 among a plurality of fusion blocks of the fusion block unit 330, and according to a pre-learned method By decoding the applied interest map (C ^n-1 ), the region in which the feature map (F _e ^n-1 ) and the filling feature map (F _i ^n-1 ) are fused in the interest map (C ^n-1 ) is smooth It transforms to change and outputs a filling feature map (F _i ⁿ ).

이때 채움 네트워크(340)는 다수의 컨볼루션 레이어 중 인코딩 특징맵(F_e)이 인가되는 제1 컨볼루션 레이어와 제1 퓨전 블록(FB1) 사이에 배치되어 채움 특징맵(F_i ¹)의 2차원 크기를 특징맵(F_e ¹)의 크기로 확장하는 업샘플링 레이어를 더 포함하여, 초기 채움 특징맵(F_i ¹)의 2차원 크기를 조절하여 출력할 수 있다.At this time, the filling network 340 is disposed between the first convolutional layer to which the encoding feature map F _e is applied among a plurality of convolution layers and the first fusion block FB1, and 2 of the filling feature map F _i ¹ ). By further including an upsampling layer that expands the dimension size to the size of the feature map F _e ¹ , the two-dimensional size of the initial filling feature map F _i ¹ may be adjusted and output.

3D 지도 획득부(400)는 채움 네트워크(340)에서 최종 출력되는 채움 특징맵(F_i ⁿ)인 채움 이미지(

)를 인가받아, 기지정된 방식으로 3차원으로 변환하여 3D 지도 이미지를 획득한다. 여기서 3D 지도 획득부(400)는 일 예로 입력 데이터 획득부(200)의 2차원 이미지 획득 방식에 대향하는 방식으로 3차원 변환하여 3D 지도 이미지를 획득할 수 있다.The 3D map acquisition unit 400 provides a filling image (F i n ) that is a filling feature map (F _i ⁿ ) that is finally output from the filling network 340 .

) to obtain a 3D map image by converting it into 3D in a predetermined manner. Here, the 3D map acquisition unit 400 may obtain a 3D map image by performing 3D transformation in a method opposite to the 2D image acquisition method of the input data acquisition unit 200, for example.

상기한 라이다 깊이 이미지 채움 장치를 이용하기 위해서는 인공 신경망으로 구현되는 배경 채움 네트워크(300)가 미리 학습되어야 한다. 이에 본 실시예에 따른 라이다 깊이 이미지 채움 장치는 배경 채움 네트워크(300)의 학습 시에 결합되는 학습부(미도시)를 더 포함할 수 있다.In order to use the LIDAR depth image filling apparatus described above, the background filling network 300 implemented as an artificial neural network must be learned in advance. Accordingly, the apparatus for filling the LIDAR depth image according to the present embodiment may further include a learning unit (not shown) coupled during learning of the background filling network 300 .

학습부는 동적 객체 영역이 미리 설정된 학습 데이터를 이용하여 학습을 수행할 수 있다. 학습부는 우선 배경 채움 네트워크(300)의 객체 검출 네트워크(320)에 대한 마스크 손실(L_m)을 수학식 8에 따라 획득할 수 있다.The learning unit may perform learning using learning data in which the dynamic object region is preset. The learner may first obtain a mask loss (L _m ) for the object detection network 320 of the background filling network 300 according to Equation (8).

(여기서 m_gt 는 동적 객체 영역에 대한 진리값을 나타내고, ∥∥₁ 은 L1-norm 함수이며, λ_m)은 마스크 손실 가중치이다.(where m _gt represents the truth value for the dynamic object domain, |||||| ₁ is an L1-norm function, and λ _m ) is a mask loss weight.

그리고 학습부는 퓨전 블록부(330) 및 채움 네트워크(340)에 대한 채움 손실(L_i)을 수학식 8에 따라 획득할 수 있다.In addition, the learning unit may obtain the filling loss (L _i ) for the fusion block unit 330 and the filling network 340 according to Equation (8).

(여기서

는 채움 네트워크(340)에서 동적 객체가 배경으로 채워진 최종 채움 이미지이고, I_gt 는 배경 진리값을 나타낸다.)(here

is the final filling image in which the dynamic object is filled with the background in the filling network 340, and I _gt represents the background truth value.)

만일 학습 데이터가 대량으로 준비된 경우, 학습부는 수학식 8 및 9의 마스크 손실(L_m)과 채움 손실(L_i)의 합으로 총 손실(L_total)을 계산하고, 계산된 총 손실(L_total)을 역전파하여 배경 채움 네트워크(300)를 학습시킬 수 있다. 그러나 동적 객체가 포함된 3D 이미지에서 동적 객체가 지정된 학습 데이터의 경우, 상대적으로 용이하게 수집할 수 있는 반면, 동적 객체가 제거되어 배경으로 채워진 3D 이미지로 구성된 학습 데이터 셋을 대량으로 획득하는 것은 현실적으로 용이하지 않다.If the training data is prepared in large quantities, the learning unit calculates the total loss (L _total ) as the sum of the mask loss (L _m ) and the filling loss (L _i ) of Equations (8) and (9), and the calculated total loss (L _total ) ) can be backpropagated to train the background filling network 300 . However, in the case of training data in which a dynamic object is specified in a 3D image containing a dynamic object, it is relatively easy to collect, whereas it is realistic to acquire a large amount of training data set consisting of a 3D image filled with the background with the dynamic object removed. Not easy.

이에 본 실시예의 학습부는 생성적 적대 네트워크(Generative adversarial network: 이하 GAN) 기법을 적용하여 학습 데이터 셋 없이도 학습이 가능하도록 할 수 있다.Accordingly, the learning unit of the present embodiment may apply a generative adversarial network (GAN) technique to enable learning without a training data set.

GAN 기법을 적용하기 위해, 학습부는 배경 채움 네트워크(300)를 생성자(Generator)로 하고, 별도의 판별자(discriminator)를 추가로 구비한다. 판별자는 생성자인 배경 채움 네트워크(300)와 함께 학습되어 획득되는 채움 이미지(

)이 리얼 이미지인지 페이크 이미지인지 판별하도록 학습되는 반면, 생성자인 배경 채움 네트워크(300)는 획득되는 채움 이미지(

)에 대해 판별자가 리얼 이미지로 인식하도록 학습된다.In order to apply the GAN technique, the learning unit uses the background filling network 300 as a generator and additionally includes a separate discriminator. The discriminator is a filling image (

) is learned to determine whether it is a real image or a fake image, while the background filling network 300, which is a generator, is

), the discriminator is trained to recognize it as a real image.

이에 학습부는 생성자인 배경 채움 네트워크(300)의 생성 손실(L_G)은 채움 이미지(

)에 대한 판별자의 판별 결과(D(

))에 기초하여 수학식 10으로 계산하는 한편, 적대적 손실(L_D)을 수학식 11에 따라 계산한다.Accordingly, the learning unit generates a loss (L _G ) of the background filling network 300, which is a generator, as a filling image (

) the discriminator's discriminant result (D(

)) based on Equation (10), while calculating the adversarial loss ( _LD ) according to Equation (11).

여기서 E는 데이터 분포에 대한 기대값을 획득하는 함수이다.where E is a function that obtains the expected value for the data distribution.

결과적으로 학습부는 학습 데이터 없이 수학식 10 및 11의 생성 손실(L_G)과 적대적 손실(L_D)의 합을 총 손실(L_total)로 획득하여 학습을 수행하거나, 수학식 8의 마스크 손실(L_m)과 수학식 10 및 11의 생성 손실(L_G)과 적대적 손실(L_D)의 합을 총 손실(L_total)로 획득하여 학습을 수행할 수도 있다.As a result, the learning unit performs learning by acquiring the sum of the generation loss (L _G ) and the adversarial loss (L _D ) of Equations 10 and 11 as a total loss (L _total ) without training data, or the mask loss ( Learning may be performed by obtaining the sum of L _m ), the generation loss (L _G ), and the adversarial loss (L _D ) of Equations (10) and (11) as a total loss (L _total ).

그리고 수학식 8의 마스크 손실(L_m)과 수학식 10 및 11의 생성 손실(L_G)과 적대적 손실(L_D)의 합을 총 손실(L_total)로 획득하여 학습을 수행되는 경우, 학습이 완료된 본 실시예에 따른 라이다 깊이 이미지 채움 장치는 다른 장치의 학습을 위한 학습 데이터 획득 장치로 이용될 수도 있다.And when learning is performed by acquiring the sum of the mask loss (L _m ) of Equation 8, the generation loss (L _G ) and the adversarial loss (L _D ) of Equation 10 and 11 as a total loss (L _total ), learning The completed lidar depth image filling apparatus according to the present embodiment may be used as a learning data acquisition apparatus for learning of other apparatuses.

도 7은 본 실시예에 따른 라이다 깊이 이미지 채움 장치를 이용하여 획득되는 3D 지도 이미지의 일 예를 나타낸다.7 shows an example of a 3D map image obtained by using the LIDAR depth image filling apparatus according to the present embodiment.

도 7에서 (a)는 라이다 깊이 이미지 채움 장치의 입력 이미지로서, 동적 객체가 존재하는 3D 지도 이미지를 나타내고, (b)는 본 실시예에 따라 입력 이미지로부터 획득된 3D 지도 이미지를 나타내며, (c)는 퓨전 블록부(330)의 퓨전 블록(FBn)이 구조 유사도를 고려하지 않고, 특징맵(F_e ⁿ)과 채움 특징맵(F_i ⁿ)이 퓨전한 경우에 획득되는 3D 지도 이미지를 나타낸다. (b) 및 (c)에서 보라색 선은 동적 객체를 대체한 배경의 진리값을 나타내고, 녹색 선이 획득된 3D 지도 이미지의 배경을 나타낸다.In FIG. 7, (a) is an input image of the LIDAR depth image filling device, which shows a 3D map image in which a dynamic object exists, (b) shows a 3D map image obtained from an input image according to this embodiment, ( c) is a 3D map image obtained when the fusion block FBn of the fusion block unit 330 is fused with the feature map F _e ⁿ and the filling feature map F _i ⁿ without considering the structural similarity. indicates. In (b) and (c), the purple line represents the truth value of the background substituted for the dynamic object, and the green line represents the background of the obtained 3D map image.

도 7의 (a)에 도시된 바와 같이, 입력 이미지의 경우, 동적 객체에 의해 배경이 나타나지 않는 영역이 크게 존재하지만, (b) 및 (c)의 경우, 동적 객체 영역이 채워졌음을 알 수 있다. 다만 (c)의 경우, 진리값인 보라색 라인과 획득된 녹색 라인 사이의 차이가 크게 발생하는데 반해, 확률적 구조 유사도를 고려하한 본 실시예의 경우, 진리값과 유사하게 획득됨을 알 수 있다. 즉 매우 정확하게 동적 객체를 배경으로 대체할 수 있다.As shown in (a) of Figure 7, in the case of the input image, there is a large area where the background does not appear due to the dynamic object, but in the case of (b) and (c), it can be seen that the dynamic object area is filled. have. However, in the case of (c), it can be seen that the difference between the purple line, which is the truth value, and the obtained green line, occurs significantly, whereas in the present embodiment considering the probabilistic structural similarity, it can be seen that the truth value is obtained similarly. This means that you can replace dynamic objects with the background very accurately.

도 8은 본 발명의 일 실시예에 따른 라이다 깊이 이미지 채움 방법을 나타낸다.8 is a diagram illustrating a lidar depth image filling method according to an embodiment of the present invention.

도 3 내지 도 6을 참조하여, 도 8의 라이다 깊이 이미지 채움 방법을 설명하면, 우선, 동적 객체를 고려하지 않고 라이다에서 취득된 깊이 이미지인 3D 점군 이미지를 획득한다(S11). 그리고 획득된 3D 점군 이미지를 기지정된 방식으로 2차원의 거리 이미지로 변환하여 입력 데이터를 획득한다(S12). 이때, 3D 점군 이미지를 깊이값, x축 값, y축값, z 축값 및 세기 값 각각에 대한 5개의 2차원 거리 이미지로 변환하고, 5개의 거리 이미지를 채널 축 방향으로 결합하여 입력 데이터를 획득할 수 있다.Referring to FIGS. 3 to 6 , the LIDAR depth image filling method of FIG. 8 is described. First, a 3D point cloud image, which is a depth image acquired from LIDAR, is acquired without considering a dynamic object ( S11 ). Then, the obtained 3D point cloud image is converted into a two-dimensional distance image in a predetermined manner to obtain input data (S12). At this time, the 3D point cloud image is converted into five two-dimensional distance images for each depth value, x-axis value, y-axis value, z-axis value, and intensity value, and the five distance images are combined in the channel axis direction to obtain input data. can

그리고 입력 이미지를 미리 학습된 인공 신경망으로 구현되는 인코딩 네트워크(310)에 입력하여 학습된 방식에 따라 인코딩함으로써, 인코딩 특징맵(F_e)을 획득한다(S13). 여기서 인코딩 네트워크는 순차적으로 연결된 다수의 컨볼루션 레이어로 구성되고, 각 컨볼루션 레이어 각각은 인가되는 입력 이미지로부터 단계적으로 특징맵(F_e ⁿ)을 추출하고, 최종 컨볼루션 레이어는 추출된 특징맵을 인코딩 특징맵(F_e)으로서 출력한다.And by inputting the input image into the encoding network 310 implemented as a pre-trained artificial neural network and encoding according to the learned method, the encoding feature map (F _e ) is obtained (S13). Here, the encoding network consists of a plurality of sequentially connected convolutional layers, each of the convolutional layers extracts a feature map (F _e ⁿ ) step by step from the applied input image, and the final convolution layer is the extracted feature map. Output as an encoding feature map (F _e ).

인코딩 특징맵(F_e)이 획득되면, 미리 학습된 인공 신경망으로 구현되는 객체 검출 네트워크(320)는 인코딩 특징맵(F_e)을 인가받아 학습된 방식으로 디코딩하여, 입력 이미지에서 동적 객체에 해당하는 영역을 나타내는 이진 마스크(m)를 획득한다(S14). 여기서 객체 검출 네트워크(320) 또한 다수의 컨볼루션 레이어를 포함하여, 순차적 디코딩을 수행할 수 있으며, 다수의 컨볼루션 레이어의 컨볼루션 연산에 의해 축소되는 특징맵의 2차원 크기를 확장하기 위해 기지정된 간격으로 배치되는 다수의 업샘플링 레이어를 더 포함할 수 있다.When the encoding feature map (F _e ) is obtained, the object detection network 320 implemented as a pre-trained artificial neural network receives the encoding feature map (F _e ) and decodes it in a learned manner, corresponding to a dynamic object in the input image A binary mask m representing the region to be obtained is obtained (S14). Here, the object detection network 320 also includes a plurality of convolutional layers, can perform sequential decoding, and is predetermined to expand the two-dimensional size of the feature map reduced by the convolution operation of the plurality of convolutional layers. It may further include a plurality of upsampling layers arranged at intervals.

그리고 획득된 이진 마스크(m)를 다수의 특징맵(F_e ⁿ) 각각에 대응하는 크기로 스케일 축소 변환하여 다수의 스케일 이진 마스크(

)를 획득할 수 있다.Then, the obtained binary mask (m) is scaled down to a size corresponding to each of the plurality of feature maps (F _e ⁿ ), and a plurality of scaled binary masks (

) can be obtained.

다수의 스케일 이진 마스크(

)가 획득되면, 미리 학습된 인공 신경망으로 구현되는 채움 네트워크(340)의 다수의 컨볼루션 레이어 중 제1 컨볼루션 레이어가 학습된 방식에 따라 인코딩 특징맵(F_e)을 디코딩하여 초기 채움 특징맵(F_i ¹)을 획득한다(S15).Multiple scale binary masks (

) is obtained, the first convolutional layer among a plurality of convolutional layers of the filling network 340 implemented as a pre-trained artificial neural network is decoded according to the learned method of the encoding feature map (F _e ) to decode the initial filling feature map. (F _i ¹ ) is obtained (S15).

초기 채움 특징맵(F_i ¹)이 획득되면, 다수의 특징맵(F_e ⁿ) 중 초기 채움 특징맵(F_i ¹)에 대응하는 특징맵(F_e ¹)에서 다수의 스케일 이진 마스크(

) 중 대응하는 스케일 이진 마스크(m¹)에 따라 동적 객체 영역을 마스킹하고, 마스킹된 영역을 초기 채움 특징맵(F_i ¹)의 대응하는 영역인 배경을 수학식 3에 따라 대입하여 채움으로써 퓨전 특징맵(F_f ¹)을 획득한다(S16).When the initial _filling feature map _F _i ¹ ^is obtained, ^a plurality ^of scale binary masks ₍

) of the dynamic object region according to the corresponding scaled binary mask (m ¹ ), and filling the masked region with the background, which is the corresponding region of the initial filling feature map (F _i ¹ ), according to Equation 3, thereby filling the fusion A feature map F _f ¹ is obtained (S16).

그리고 특징맵(F_e ¹)과 퓨전 특징맵(F_f ¹) 사이의 확률적 구조 유사도(S¹)를 수학식 6에 따라 계산한다(S17). 이후, 수학식 7에 따라 계산된 확률적 구조 유사도(Sⁿ)를 퓨전 특징맵(F_f ¹)에 가중하여 관심맵(C¹)을 획득한다.And the stochastic structural similarity (S ¹ ) between the feature map (F _e ¹ ) and the fusion feature map (F _f ¹ ) is calculated according to Equation 6 (S17). Thereafter, the fusion feature map F _f ¹ is weighted with the stochastic structural similarity (S ⁿ ) calculated according to Equation (7) to obtain the interest map (C ¹ ).

관심맵(C¹)이 획득되면, 채움 네트워크(340)의 다수의 컨볼루션 레이어 중 다음 컨볼루션 레이어가 학습된 방식에 따라 관심맵(C¹)을 인가받아 디코딩하여 채움 특징맵(F_i ²)을 획득한다(S19). 채움 특징맵(F_i ²)이 획득되면 배경 채움 횟수가 기지정된 횟수 보다 작은지 판별한다(S20). 여기서 기지정된 횟수는 채움 네트워크(340)의 컨볼루션 레이어의 개수에 대응한다. 배경 채움 횟수가 기지정된 횟수에 도달하지 않았으면, 획득된 채움 특징맵(F_i ²)을 이용하여 대응하는 특징맵(F_e ²)의 스케일 이진 마스크(m²)에 해당하는 영역을 다시 채운다(S16).When the interest map (C ¹ ) is obtained, the interest map (C ¹ ) is received and decoded according to the method in which the next convolutional layer among a plurality of convolutional layers of the filling network 340 is learned, and then the filling feature map (F _i ² ) ) is obtained (S19). When the filling feature map F _i ² is obtained, it is determined whether the number of background filling is smaller than a predetermined number (S20). Here, the predetermined number corresponds to the number of convolutional layers of the filling network 340 . If the number of background fills does not reach the predetermined number of times, the area corresponding to the scale binary mask (m ² ) of the corresponding feature map (F _e ² ) is refilled using the obtained filling feature map (F _i ² ). (S16).

그러나 기지정된 횟수에 도달한 것으로 판별되면, 획득된 채움 특징맵(F_i ⁿ)을 채움 이미지(

)로서 기지정된 방식으로 3차원으로 변환하여 3D 지도 이미지를 획득한다(S21).However, if it is determined that the predetermined number of times has been reached, the obtained filling feature map (F _i ⁿ ) is filled with the filling image (

) to obtain a 3D map image by converting it into three dimensions in a predetermined manner (S21).

본 발명에 따른 방법은 컴퓨터에서 실행시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention may be implemented as a computer program stored in a medium for execution by a computer. Here, the computer-readable medium may be any available medium that can be accessed by a computer, and may include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, and read dedicated memory), RAM (Random Access Memory), CD (Compact Disk)-ROM, DVD (Digital Video Disk)-ROM, magnetic tape, floppy disk, optical data storage, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.Although the present invention has been described with reference to the embodiment shown in the drawings, which is only exemplary, those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Accordingly, the true technical protection scope of the present invention should be defined by the technical spirit of the appended claims.

100: 점군 이미지 획득부 200: 입력 데이터 획득부
300: 배경 채움 네트워크 310: 인코딩 네트워크
320: 객체 검출 네트워크 330: 퓨전 블록부
FB: 퓨전 블록 331: 특징 퓨전부
332: 구조 유사도 계산부 333: 관심맵 획득부
340: 채움 네트워크 400: 3D 지도 획득부100: point cloud image acquisition unit 200: input data acquisition unit
300: background filling network 310: encoding network
320: object detection network 330: fusion block unit
FB: fusion block 331: features fusion unit
332: structure similarity calculation unit 333: interest map acquisition unit
340: filling network 400: 3D map acquisition unit

Claims

an input data acquisition unit configured to obtain input data by converting a three-dimensional point cloud image obtained using lidar into a two-dimensional image in a predetermined manner when applied;
It is implemented as a pre-learned artificial neural network, and by repeating the process of extracting the feature map by encoding the input data according to the learned method, the encoded feature map, which is the final encoded feature map among the plurality of feature maps, is applied to the learned method. decoding to obtain a binary mask representing the dynamic object region, and replacing the region corresponding to the binary mask in the corresponding feature map among the plurality of feature maps with the region of the previously obtained filled feature map according to the probabilistic structural similarity a background filling network that acquires an interest map and repeatedly acquires a filling feature map by decoding the acquired interest map in a pre-learned manner; and
Including a 3D map acquisition unit that receives the finally obtained filling feature map as a filling image, and converts it into a three-dimensional image in a predetermined manner,
The background filling network is
an encoding network including a plurality of convolutional layers each of which receives input data or a feature map output from a previously arranged convolutional layer and encodes it to output a feature map;
an object detection network that receives the encoding feature map and decodes it according to a learned method to obtain the binary mask indicating a region in which a dynamic object designated by learning is located in the input data;
a filling network for obtaining a filling feature map by decoding the encoding feature map or the previously acquired interest map in a learned manner; and
A fusion feature map is obtained by replacing an area corresponding to the binary mask in a corresponding feature map among the plurality of feature maps with an area of a previously obtained filled feature map, and a probability between the corresponding feature map and the obtained fusion feature map and a fusion block unit including a plurality of fusion blocks for calculating the structural similarity of an enemy and weighting the fusion feature map in a predetermined manner to obtain the interest map,
The interest map (C ⁿ ) is expressed by the equation

(Where λ _atn is the weight of interest, F _f ⁿ is the fusion feature map, and S ⁿ is the probabilistic structural similarity.)
LiDAR depth image filling device acquired according to

delete

The method of claim 1, wherein the object detection network is
A lidar depth image filling apparatus for obtaining a plurality of scaled binary masks by scaling the binary mask to a size corresponding to each of a plurality of feature maps obtained from the encoding network.

The method of claim 3, wherein each of the plurality of fusion blocks comprises:
A corresponding feature map of the plurality of feature maps, a corresponding scaled binary mask of a plurality of scaled binary masks, and a filling feature map previously obtained from the filling network are applied, and the corresponding scaled binary mask in the corresponding feature map is used. a fusion unit configured to obtain a fusion feature map by replacing the designated area with an area of the previously obtained filling feature map;
a structural similarity calculation unit that determines the structural similarity between a corresponding feature map and the fusion feature map according to a structural similarity index measurement (SSIM) structural similarity evaluation method, and calculates a stochastic structural similarity by additionally reflecting a stochastic distribution; and
and an interest map acquisition unit configured to acquire the interest map by weighting the calculated probabilistic structural similarity to the fusion feature map.

5. The method of claim 4, wherein the fusion unit
The fusion feature map (F _f ⁿ ) is expressed by the equation

(where F _e ⁿ is the feature map,

is the scaled binary mask, and F _i ⁿ is the fill feature map.)
LiDAR depth image filling device acquired according to

The method of claim 5, wherein the structural similarity calculator
The structural similarity (S) is expressed by the equation

(Where k is the feature map (F _e ⁿ ), q is the fusion feature map (F _f ⁿ ), σ is the variance, and c ₃ is a predetermined constant.)
calculated according to
The probabilistic structural similarity is calculated by the equation

A lidar depth image filling device that calculates according to

delete

According to claim 1, wherein the input data acquisition unit
Five two-dimensional distance images are obtained from the 3D point cloud image according to the depth value, the x-axis value, the y-axis value, the z-axis value, and the intensity value, and the obtained five two-dimensional distance images are combined in the channel axis direction to combine the input data. Acquiring lidar depth image filling device.

According to claim 4, wherein the lidar depth image filling device is
Further comprising a learning unit for learning the background filling network implemented as an artificial neural network during learning,
the learning unit
By applying a training data set in which a dynamic object region is preset and filled with the background as input data to the background filling network, the mask loss and the background caused by the difference between the binary mask output from the object detection network and the set dynamic object region A LIDAR depth image filling device that calculates the total loss as the sum of the filling loss caused by the difference between the filling feature maps output from the filling network, and performs learning by backpropagating the calculated total loss.

According to claim 4, wherein the lidar depth image filling device is
Further comprising a learning unit for learning the background filling network implemented as an artificial neural network during learning,
the learning unit
Lidar depth image filling further comprising a discriminator trained to determine whether the filling feature map output from the background filling network is a real image or a fake image, learning the generator and the discriminator as the background filling network in a generative adversarial network method Device.

when the 3D point cloud image obtained using the lidar is applied, converting it into a 2D image in a predetermined manner to obtain input data;
obtaining a plurality of feature maps by repeating the process of extracting feature maps by encoding the input data according to the learned method using a pre-learned artificial neural network;
using a pre-trained artificial neural network to decode an encoded feature map, which is a final encoded feature map among the plurality of feature maps, according to a learned method to obtain a binary mask representing a dynamic object region;
Using a pre-trained artificial neural network, the region of interest is obtained by replacing the region corresponding to the binary mask in the corresponding feature map among the plurality of feature maps according to the degree of probabilistic structural similarity with the region of the previously obtained filled feature map, , decoding the obtained interest map in a pre-learned manner to obtain a filling feature map; and
Including the step of receiving the finally obtained filling feature map as a filling image, and converting it into a three-dimensional image in a predetermined manner,
The step of obtaining the filling feature map is
decoding an encoding feature map or a previously acquired interest map in a learned manner to obtain the filled feature map; and
obtaining a fusion feature map by replacing an area corresponding to the binary mask in a corresponding feature map among the plurality of feature maps with an area of a previously obtained filling feature map; and
calculating the probabilistic structural similarity between the corresponding feature map and the obtained fusion feature map and weighting the fusion feature map in a predetermined manner to obtain the interest map,
The interest map (C ⁿ ) is expressed by the equation

(Where λ _atn is the weight of interest, F _f ⁿ is the fusion feature map, and S ⁿ is the probabilistic structural similarity.)
LiDAR depth image fill method acquired according to
.

The method of claim 11, wherein the obtaining of the plurality of feature maps comprises:
A LIDAR depth image filling method that receives input data or a feature map output from a previously placed convolutional layer using an encoding network, which is an artificial neural network including a plurality of convolutional layers, and encodes it to output a plurality of feature maps.

13. The method of claim 12, wherein obtaining the binary mask comprises:
decoding the encoding feature map according to a learned method to obtain the binary mask representing a region in which a dynamic object designated by learning is located in the input data; and
and obtaining a plurality of scaled binary masks by scaling the binary mask to a size corresponding to each of the plurality of feature maps.

delete

The method of claim 11, wherein the obtaining of the interest map comprises:
determining the structural similarity between the corresponding feature map and the fusion feature map according to the structural similarity index measurement (SSIM) structural similarity evaluation method, and calculating the stochastic structural similarity by additionally reflecting a stochastic distribution in the structural similarity ; and
and weighting the calculated probabilistic structural similarity to the fusion feature map to obtain the interest map.

The method of claim 15, wherein the obtaining of the fusion feature map comprises:
The fusion feature map (F _f ⁿ ) is expressed by the equation

(where F _e ⁿ is the feature map,

is the scaled binary mask, and F _i ⁿ is the fill feature map.)
LiDAR depth image fill method acquired according to

The method of claim 16, wherein calculating the probabilistic structural similarity comprises:
The structural similarity (S) is expressed by the equation

(Where k is the feature map (F _e ⁿ ), q is the fusion feature map (F _f ⁿ ), σ is the variance, and c ₃ is a predetermined constant.)
calculating according to; and
The probabilistic structural similarity is calculated by the equation

A lidar depth image filling method comprising the step of calculating according to

delete

12. The method of claim 11, wherein the LIDAR depth image filling method is
Further comprising a learning step for training the artificial neural network,
The learning step is
Calculating the mask loss caused by the difference between the binary mask output in the step of obtaining the binary mask by applying a training data set in which the dynamic object region is preset and filled with the background as input data of the artificial neural network and the set dynamic object region to do;
calculating a filling loss caused by a difference between the filling feature maps output in the step of obtaining the filling feature map; and
and calculating a total loss as a sum of the mask loss and the fill loss, and backpropagating the calculated total loss.

12. The method of claim 11, wherein the LIDAR depth image filling method is
Further comprising a learning step for training the artificial neural network,
The learning step is
determining whether the filling feature map is a real image or a fake image;
calculating each of a generation loss and an adversarial loss in a generative adversarial network method based on the determination result of the filling feature map; and
A method of filling a lidar depth image comprising backpropagating the sum of the calculated generational loss and the adversarial loss to the total loss.