KR20190131207A

KR20190131207A - Robust camera and lidar sensor fusion method and system

Info

Publication number: KR20190131207A
Application number: KR1020180055784A
Authority: KR
Inventors: 최준원; 김재겸
Original assignee: 한양대학교 산학협력단
Priority date: 2018-05-16
Filing date: 2018-05-16
Publication date: 2019-11-26
Also published as: KR102108953B1

Abstract

Disclosed are a deep learning-based camera robust to the sensor quality degradation and a method and system for recognizing the LIDAR sensor fusion. According to one embodiment of the present invention, the method performed by the system may comprise the steps of: inputting different data to each deep neural network to obtain each feature map; fusing the obtained feature maps through a fusion network; and detecting an object based on a new feature map fused through the fusion network.

Description

Deep learning based camera and rider sensor fusion recognition method and system robust against deterioration of sensor quality {ROBUST CAMERA AND LIDAR SENSOR FUSION METHOD AND SYSTEM}

아래의 설명은 서로 다른 데이터를 융합하여 물체를 인지하기 위한 딥러닝 기술에 관한 것이다. The following description relates to a deep learning technique for recognizing objects by fusing different data.

영상 정보를 통한 물체 인지 기법은 기존 컴퓨터 비전 분야에서 다양한 접근 방식으로 활발히 연구되어 왔던 분야이다. 대표적으로 영상 내에 물체를 인지하는 데 도움을 줄 수 있는 특징들을 영상으로부터 추출하고 분류기를 통해 물체의 위치와 종류를 판별하는 방법이 있다. 최근에는 딥 뉴럴 네트워크 구조를 활용하는 딥러닝 기술의 발전으로 인해 대용량의 영상 데이터로부터 특징들을 학습하고 이를 통해 매우 높은 수준의 정확도로 물체를 검출하는 방법이 제안되었다. 일례로, 영상의 선의 구조나 형태 등의 특징들을 알아내고 이를 이미 알고있는 템플릿과 비교하여 물체를 인지하거나 검출하는 기술로서, 대표적인 특징 추출 기술로 SIFT(Scale invariant feature transform)과 HOG(Historgram of oriented gradients) 표현자 등이 있다. SIFT는 영상에서 코너점 등 식별이 용이한 특징점들을 선택한 후에 각 특징점을 중심으로 한 로컬 패치에 대해 방향성을 갖는 특징 벡터를 추출하는 방법이다. 이는 주변의 밝기 변화의 방향 및 밝기 변화의 급격한 정도를 표현하는 특징을 이용한 방법이다. HOG는 대상 영역을 일정 크기의 셀로 분할하고, 각 셀마다 경계 픽셀들의 방향에 대한 히스토그램을 구한 후 이들 히스토그램 bin 값들을 일렬로 연결한 벡터를 이용한 방법이다. HOG는 경계의 방향정보를 이용하기 때문에 일종의 경계기반 템플릿 매칭 방법으로도 볼 수 있다. 또한 물체의 실루엣 정보를 이용하므로 사람, 자동차 등과 같이 내부 패턴이 복잡하지 않으면서도 고유의 독특한 윤곽선 정보를 갖는 물체를 식별하는데 적합한 방법이다. 이러한 기술들은 미리 엣지나 형태에 대한 물체의 알려진 정보들만을 활용하기 때문에 영상 데이터에 다양한 조도 변화나 형태 왜곡, 노이즈, 가림등이 발생하는 경우에 특징값이 예측된 분포에서 벗어나게 되면 인식 성능이 크게 떨어지는 단점이 있다. Object recognition technique using image information is a field that has been actively studied in various approaches in the field of computer vision. Typically, there is a method of extracting features from an image that can help to recognize an object in the image and determining the position and type of the object through a classifier. Recently, due to the development of deep learning technology using a deep neural network structure, a method of learning features from a large amount of image data and detecting an object with a very high level of accuracy has been proposed. For example, it is a technology that recognizes or detects an object by comparing features of a line structure or shape of an image and comparing it with a template that is already known, and typical feature extraction techniques include SIFT (Scale invariant feature transform) and HOG (Historgram of oriented). gradients), etc. SIFT is a method of extracting feature vectors with directivity for local patches around each feature point after selecting feature points that can be easily identified such as corner points in an image. This is a method using a feature that expresses the direction of the change in brightness of the surroundings and the degree of sudden change in brightness. HOG divides the target area into cells of a certain size, obtains a histogram of the direction of boundary pixels for each cell, and then uses a vector in which these histogram bin values are connected in a row. HOG can be viewed as a kind of boundary-based template matching method because it uses the direction information of the boundary. In addition, since the silhouette information of the object is used, it is a suitable method for identifying an object having unique unique contour information without complicated internal patterns such as a person or a car. Since these technologies utilize only the known information of the object about the edge or shape in advance, the recognition performance is greatly increased when the feature values deviate from the predicted distribution when various illumination changes, shape distortions, noise, and occlusion occur in the image data. There is a downside to falling.

최근에 많은 연구가 되고 있는 딥러닝 기술은 방대한 양의 데이터로부터 데이터의 표현 방법 즉 특징 추출을 직접 학습하는 방법을 사용하여 다양한 환경이나 변화에 좋은 성능을 유지할 수 장점을 갖고 있다. 특히 Convolutional neural network(CNN) 구조는 2차원 영상의 로컬한 특징을 계층적으로 추출하여 영상 인식에 필요한 고차원 의미 정보를 효과적으로 얻을 수 있다. 최근에 병렬 계산을 할 수 있는 컴퓨터 기술이 발달하고 대용량 데이터를 활용할 수 있는 인프라가 가능해지면서 이러한 딥러닝 기술이 매우 효과적인 인식 기술로 활용되고 있다.Deep learning technology, which has been studied a lot recently, has the advantage of maintaining good performance in various environments or changes by using a method of directly expressing data extraction, that is, feature extraction from a large amount of data. In particular, the convolutional neural network (CNN) structure can extract high-level semantic information necessary for image recognition by hierarchically extracting local features of two-dimensional images. Recently, with the development of computer technology that can perform parallel computation and the infrastructure that can utilize large amounts of data, such deep learning technology is being used as a very effective recognition technology.

한편, 센서융합 기술이란 다양한 센서로부터 취득된 데이터를 결합하여 활용하는 기술을 의미한다. 서로 다른 특성을 갖는 다양한 종류의 센서를 사용하여 인지의 성능과 신뢰성을 높이는 것이 가능하기 때문에 센서 융합은 자율주행이나 영상인지 분야에 있어 중요한 기술이다. 다양한 분포의 센서 데이터를 융합하기 위해서는 여러 가지 융합 기술이 가능한데 일반적으로 초기 단계 융합, 중간 단계 융합, 후기 단계 융합 방식이 존재한다. 초기 단계 융합은 하나 이상이 데이터를 미리 결합하여 동시에 처리하는 융합 방법이고 후기 단계 융합은 각각의 데이터를 모두 처리하여 최종적인 인지 결과를 얻은 후에 이를 결합하는 기술이다. 중간 단계 융합은 이러한 두 가지 방법의 중간에 위치한 방법이다. 초기 단계 융합은 서로 이질적인 데이터 특성에 의하여 좋은 융합 성능을 얻기가 어렵고 후기 단계 융합은 데이터의 고차 의미적 상관도를 잘 활용하지 못하는 단점이 있다. 최근에 딥러닝 기법에 의한 센서융합 기술이 등장하면서 각각의 센서 신호를 별도의 CNN을 통과시킨 뒤에 이를 중간에서 결합하고 마지막 단계에서의 결합된 특징값을 CNN을 통해 처리하는 중단 단계 융합 기법이 좋은 성능을 나타내고 있다. 이러한 다양한 센서들로부터 같은 환경에 대해 서로 다른 분포를 갖는 센서 측정 데이터를 얻을 수 있기 때문에 이러한 다른 형태의 센서 데이터를 어떻게 융합하여 센서 성능을 최적화할 것인가의 문제가 중요해지고 있다. 기본적으로 각각의 센서 데이터를 별도로 처리하여 결과를 반영하는 것 보다는 각 센서 성능의 특성을 반영해 데이터를 효과적으로 결합하는 센서 융합(sensor fusion) 기술이 개발될 필요가 있다. Meanwhile, the sensor fusion technology refers to a technology that combines and uses data acquired from various sensors. Sensor fusion is an important technology in the field of autonomous driving or image recognition because it is possible to improve the performance and reliability of cognition by using various kinds of sensors having different characteristics. Various fusion techniques are possible to fuse sensor data of various distributions. Generally, early fusion, intermediate fusion, and late fusion are available. Early stage fusion is a fusion method in which one or more data is precombined and processed simultaneously, and late stage fusion is a technique of combining each data after the final recognition result is obtained. Intermediate stage fusion is a method located in the middle of these two methods. In the early stage fusion, it is difficult to obtain good fusion performance due to heterogeneous data characteristics, and the late stage fusion does not use the higher semantic correlation of data well. Recently, the sensor fusion technique by deep learning technique has emerged, and the middle stage fusion technique that passes each sensor signal through a separate CNN, combines them in the middle, and processes the combined feature values at the last stage through the CNN is good. Performance is shown. Since sensor measurement data having different distributions for the same environment can be obtained from these various sensors, the question of how to fuse these different types of sensor data to optimize sensor performance becomes important. Basically, rather than processing each sensor data separately and reflecting the results, sensor fusion technology needs to be developed to effectively combine the data by reflecting the characteristics of each sensor performance.

최근 자율주행 분야에서 센서 융합 기반의 인지 기술에 딥러닝 기술이 적용되고 있다. 특히 자동차에 장착되는 카메라와 라이더 센서를 융합하여 인지 성능의 신뢰성을 높이는 연구가 활발히 진행되고 있다. 이러한 카메라와 라이더 센서 융합을 위해 딥러닝을 적용하는 기술로서, 라이더 포인트로 탑-뷰 이미지를 만들어 카메라 이미지와 네트워크 안에서 Fully connected layer를 이용하여 융합하는 방법과, 둘째, 라이더 포인트로 만든 탑-뷰 이미지를 전방 이미지로 변환하여 융합한 방법이 존재한다. Recently, deep learning technology has been applied to sensor convergence-based cognitive technology in autonomous driving. In particular, research is being actively conducted to increase the reliability of cognitive performance by fusing cameras and rider sensors installed in automobiles. Deep learning is applied to the fusion of camera and rider sensors, and a top-view image is created by using a rider point and fused using a fully connected layer in the camera image and network. Second, the top-view is made by a rider point. There is a method in which an image is converted into a front image and fused.

그러나 센서 융합을 통한 물체 인지 기술은 다음과 같은 문제점이 발생할 수 있다. CNN을 이용하여 각각 센서 데이터의 특징맵을 추출한 후에 카메라와 라이더 센서를 네트워크 단에서 융합하는 기술의 경우, 다양한 데이터를 통해 이미 학습이 되고 나면 고정된 네트워크 계수를 사용하게 된다. 이러한 경우에 두 센서 데이터가 모두 온전한 경우에는 좋은 성능을 나타내지만 하나의 센서 데이터의 품질의 저하가 발생할 경우는 네트워크가 이러한 상황을 잘 처리하지 못해 성능이 떨어지게 되는 문제가 발생한다. 이에 따라 센서 데이터에 저하가 자주 발생하는 자율주행 환경의 경우 치명적인 문제의 요인이 될 수 있다.However, the object recognition technology through sensor fusion can cause the following problems. After extracting feature maps of each sensor data using CNN, the technology of fusion of camera and rider sensors at the network stage uses fixed network coefficients after learning through various data. In this case, when both sensor data are intact, they show good performance. However, when the quality of one sensor data is degraded, the network does not handle this situation well, resulting in a problem of poor performance. Accordingly, in an autonomous driving environment in which deterioration occurs frequently in sensor data, it may be a cause of fatal problem.

참고자료: KR10-2018-0003535(2018.01.09.), KR10-1714233(2017.03.02.), KR10-2016-0096460(2016.08.16)Reference: KR10-2018-0003535 (2018.01.09.), KR10-1714233 (2017.03.02.), KR10-2016-0096460 (2016.08.16)

서로 다른 데이터를 딥 뉴럴 네트워크에 입력시킴에 따라 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합하여 구성된 새로운 특징맵에 기반하여 객체를 검출하는 방법 및 시스템을 제공할 수 있다. Provided are a method and system for detecting an object based on a new feature map configured by fusing different feature maps obtained by inputting different data into a deep neural network.

또한, 센서 품질의 저하에 강인한 딥 러닝 기반의 융합 인지 기술을 제공할 수 있다. 다시 말해서, 서로 다른 형태의 센서 성능의 특성을 반영하여 데이터를 효과적으로 결합하는 센서 융합 방법 및 시스템을 제공할 수 있다. 구체적으로, 카메라의 영상 데이터와 라이더의 포인트 데이터를 융합하기 위한 딥 뉴럴 네트워크를 통하여 물체의 인지 성능을 향상시키며 카메라 혹은 라이더 센서 중 일부 데이터가 저하될지라도 높은 수준의 물체 인지 성능을 발휘하는 방법 및 시스템을 제공할 수 있다. In addition, it is possible to provide a deep learning-based fusion recognition technology that is robust against deterioration of sensor quality. In other words, it is possible to provide a sensor fusion method and system that effectively combines data by reflecting characteristics of different types of sensor performance. Specifically, a method of improving object recognition performance through a deep neural network for fusing camera image data and point data of a rider and exhibiting a high level of object recognition performance even when some data of the camera or rider sensor is degraded. A system can be provided.

인지 시스템에 의하여 수행되는 인지 방법은, 서로 다른 데이터를 각각의 딥 뉴럴 네트워크에 입력하여 각각의 특징맵을 획득하는 단계; 상기 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합하는 단계; 및 상기 융합 네트워크를 통하여 융합된 새로운 특징맵에 기반하여 객체를 검출하는 단계를 포함할 수 있다. A cognitive method performed by a cognitive system includes: inputting different data into each deep neural network to obtain each feature map; Fusing the acquired respective feature maps through a fusion network; And detecting an object based on a new feature map fused through the fusion network.

상기 서로 다른 데이터를 각각의 딥 뉴럴 네트워크에 입력하여 각각의 특징맵을 획득하는 단계는, 상기 서로 다른 데이터가 라이더와 카메라와 관련된 데이터일 경우, 상기 라이더에 대하여 전처리 과정을 수행함에 따라 변환된 2차원의 3채널 이미지와 상기 카메라에 대한 카메라 이미지를 각각의 CNN에 통과시키는 단계를 포함할 수 있다. The step of acquiring each feature map by inputting the different data to each deep neural network includes: when the different data is data related to a rider and a camera, 2 Transformed by performing a preprocessing process on the rider; And passing the three-dimensional image of the dimension and the camera image for the camera through each CNN.

상기 서로 다른 데이터를 각각의 딥 뉴럴 네트워크에 입력하여 각각의 특징맵을 획득하는 단계는, 상기 라이더로부터 취득한 3차원 포인트 정보를 상기 카메라의 2차원 영상에 매핑하되, 상기 3차원 포인트 정보의 위치 정보를 라이더 좌표에서 카메라 좌표로 변환하는 행렬을 곱하여 2차원 영상의 좌표로 생성하는 전처리 과정을 수행하는 단계를 포함할 수 있다. Acquiring each feature map by inputting the different data into each deep neural network may include mapping 3D point information acquired from the rider to a 2D image of the camera, wherein location information of the 3D point information is mapped. And multiplying the matrix from the rider coordinates to the camera coordinates to generate coordinates of the 2D image.

상기 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합하는 단계는, 상기 각각의 특징맵에 대한 품질을 판별함에 따라 상기 각각의 특징맵 중 어느 하나 이상의 특징맵에 가중치를 부여한 후, 각각의 특징맵을 융합 네트워크를 통하여 융합시키는 단계를 포함할 수 있다. In the fusion of each of the acquired feature maps through a fusion network, weighting any one or more feature maps of each feature map according to the quality of each feature map is determined, and then each feature map. May include fusing through a fusion network.

상기 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합하는 단계는, 상기 획득된 각각의 특징맵이 카메라의 특징맵과 라이더의 특징맵일 경우, 상기 카메라의 특징맵과 상기 라이더의 특징맵을 상기 융합 네트워크에 통과시킴에 따라 각각의 특징맵을 병렬로 융합하여 새로운 특징맵을 생성하는 단계를 포함할 수 있다. The fusing of each acquired feature map through the fusion network may include fusing the feature map of the camera and the feature map of the rider when the acquired feature map is a feature map of a camera and a rider. As passing through the network, each feature map may be fused in parallel to generate a new feature map.

상기 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합하는 단계는, 상기 카메라의 특징맵과 상기 라이더의 특징맵을 병렬로 1차 융합하고, 상기 1차 융합된 특징맵을 복수의 3X3 크기의 커널을 가진 딥 뉴럴 네트워크와 복수의 sigmoid 함수를 통과하여 상기 라이더 또는 상기 카메라의 강인성을 판단하는 단계를 포함할 수 있다. The fusion of each of the acquired feature maps through a fusion network may include: primary fusion of the feature map of the camera and the feature map of the rider in parallel, and converting the primary fused feature map into a plurality of 3X3 kernels. And passing through a deep neural network having a plurality of sigmoid functions and determining the robustness of the rider or the camera.

상기 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합하는 단계는, 상기 1차 융합된 특징맵을 복수의 3X3 크기의 커널을 가진 딥 뉴럴 네트워크와 sigmoid 함수를 통과시킴에 따라 상기 카메라 및 상기 라이더의 데이터에 대한 신뢰도로서 픽셀별로 0 내지 1 사이의 값으로 출력하는 단계를 포함할 수 있다. The fusion of each of the acquired feature maps through a fusion network may include passing the first fused feature map through a deep neural network having a plurality of 3 × 3 kernels and a sigmoid function. And outputting a value between 0 and 1 for each pixel as reliability of the data.

상기 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합하는 단계는, 상기 카메라에 대한 신뢰도를 카메라의 특징맵에 곱한 값과 상기 라이더에 대한 신뢰도를 상기 라이더의 특징맵에 곱한 값을 다시 병렬로 2차 융합하고, 상기 2차 융합된 특징맵을 1x1 크기의 커널의 딥 뉴럴 네트워크를 통과하여 3차 융합하여 상기 카메라의 특징맵과 라이더의 특징맵에 대한 새로운 특징맵을 도출하는 단계를 포함할 수 있다. The fusion of each of the acquired feature maps through a fusion network may include: multiplying a value obtained by multiplying the reliability of the camera by a feature map of the camera and a value obtained by multiplying the reliability of the rider by the feature map of the rider in parallel. And fusion the second and second fusion feature maps through a deep neural network of 1x1 size kernels to derive new feature maps for the feature map of the camera and the rider. have.

상기 융합 네트워크를 통하여 융합된 새로운 특징맵에 기반하여 객체를 검출하는 단계는, 상기 서로 다른 데이터 중 일부의 데이터의 품질을 저하시키기 위한 데이터를 생성하고, 상기 품질이 저하된 데이터를 학습하도록 제어하는 단계를 포함할 수 있다.The detecting of the object based on the new feature map fused through the fusion network may include generating data for reducing the quality of data of some of the different data and controlling learning of the degraded data. It may include a step.

상기 융합 네트워크를 통하여 융합된 새로운 특징맵에 기반하여 객체를 검출하는 단계는, 상기 딥 뉴럴 네트워크와 상기 센서 융합 네트워크를 동시에 학습함에 따라 상기 새로운 특징맵을 처리하여 객체와 관련된 위치와 종류를 동시에 판별하는 단계를 포함할 수 있다. In the detecting of the object based on the new feature map fused through the fusion network, the deep neural network and the sensor fusion network are simultaneously learned to process the new feature map to simultaneously determine the location and type associated with the object. It may include the step.

인지 시스템은, 서로 다른 데이터를 각각의 딥 뉴럴 네트워크에 입력하여 제1 특징맵 및 제2 특징맵을 획득하는 특징맵 획득부; 상기 획득된 제1 특징맵 및 제2 특징맵을 융합 네트워크를 통하여 융합하는 융합부; 및 상기 융합 네트워크를 통하여 융합된 새로운 특징맵에 기반하여 객체를 검출하는 검출부를 포함할 수 있다. The cognitive system includes: a feature map obtainer for inputting different data into each deep neural network to obtain a first feature map and a second feature map; A fusion unit configured to fuse the obtained first and second feature maps through a fusion network; And a detector configured to detect an object based on a new feature map fused through the fusion network.

상기 특징맵 획득부는, 상기 서로 다른 데이터로서 제1 데이터와 제2 데이터가 입력됨에 따라 상기 제1 데이터에 대하여 전처리 과정을 수행함에 따라 변환된 2차원의 3채널 이미지와 상기 제2 데이터에 대한 이미지를 각각의 CNN에 통과시킬 수 있다. The feature map obtaining unit may include a two-dimensional three-channel image and an image of the second data converted as the first data and the second data are input as the different data, and then the preprocessing process is performed on the first data. Can be passed to each CNN.

상기 융합부는, 상기 제1 특징맵 및 상기 제2 특징맵에 대한 품질을 판별함에 따라 각각의 특징맵 중 어느 하나 이상의 특징맵에 가중치를 부여한 후, 각각의 특징맵을 융합 네트워크를 통하여 융합시킬 수 있다. The fusion unit may weight the at least one feature map of each feature map according to the quality of the first feature map and the second feature map, and then fuse the feature maps through the fusion network. have.

상기 융합부는, 상기 제1 특징맵과 제2 특징맵을 병렬로 1차 융합하고, 상기 1차 융합된 특징맵을 복수의 3X3 크기의 커널을 가진 딥 뉴럴 네트워크와 sigmoid 함수를 통과시킴에 따라 상기 카메라 및 상기 라이더의 데이터에 대한 신뢰도로서 픽셀별로 0 내지 1 사이의 값으로 출력할 수 있다. The fusion unit may first fuse the first feature map and the second feature map in parallel, and pass the first fused feature map through a deep neural network having a plurality of 3 × 3 kernels and a sigmoid function. The reliability of the data of the camera and the rider may be output as a value between 0 and 1 for each pixel.

상기 융합부는, 상기 카메라에 대한 신뢰도를 카메라의 특징맵에 곱한 값과 상기 라이더에 대한 신뢰도를 상기 라이더의 특징맵에 곱한 값을 다시 병렬로 2차 융합하고, 상기 2차 융합된 특징맵을 1x1 크기의 커널의 딥 뉴럴 네트워크를 통과하여 3차 융합하여 상기 카메라의 특징맵과 라이더의 특징맵에 대한 새로운 특징맵을 도출할 수 있다. The fusion unit performs second-order fusion in parallel with the value obtained by multiplying the reliability map of the camera by the feature map of the camera and the reliability map of the rider by the feature map of the rider, and 1x1 the second fused feature map. Through the deep neural network of the kernel of size, the third fusion can be derived to derive new feature maps for the feature map of the camera and the feature map of the rider.

일 실시예에 따른 인지 시스템은 서로 다른 데이터를 각각의 딥 뉴럴 네트워크에 입력함에 따라 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합된 새로운 특징맵에 기반하여 객체를 인지함으로써 객체의 인지 및 검출 성능을 향상시킬 수 있다. The cognitive system according to an embodiment recognizes and detects an object by recognizing an object based on a new feature map fused through a fusion network by inputting different feature data into each deep neural network. Can improve.

또한, 융합 네트워크에서 서로 다른 특징맵에 가중치를 부여하여 결합하기 때문에 품질이 저하된 특징맵에 대한 기여도를 조절하여 최적의 센서 융합을 가능하게 한다. 다시 말해서, 종래의 융합 네트워크(GFU)가 없이 트레이닝이 종료된 네트워크를 적용 시 둘 중의 하나의 센서 데이터에 밝기 변화, 가림, 노이즈, 고장 등의 센서 품질 저하가 발생할 경우 결합된 특징맵에 영향을 주어 전체적인 인지 성능이 떨어지는 문제가 발생하는 것을 해결할 수 있다.In addition, in the converged network, different feature maps are weighted and combined to adjust the contribution to the degraded feature map, thereby enabling optimal sensor fusion. In other words, when the training is completed without a conventional fusion network (GFU), when the sensor quality deterioration such as brightness change, occlusion, noise, or failure occurs in one of the sensor data, the combined feature map is affected. This can solve the problem of lowering the overall cognitive performance.

또한, 라이더와 카메라에 포함된 정보를 딥 러닝 기법을 이용하여 딥 러닝의 뛰어난 분류 및 일반화 성능을 그대로 활용할 수 있다는 장점이 있다. In addition, the information contained in the rider and the camera can be used to utilize deep learning techniques to utilize the excellent classification and generalization performance of deep learning.

도 1은 일 실시예에 따른 인지 시스템의 개괄적인 동작을 설명하기 위한 도면이다.
도 2는 일 실시예에 따른 인지 시스템의 구성을 설명하기 위한 블록도이다.
도 3은 일 실시예에 따른 인지 시스템의 객체 인지 방법을 설명하기 위한 흐름도이다.
도 4는 일 실시예에 따른 인지 시스템에서 딥 러닝 기반의 물체 인지 알고리즘의 구조를 설명하기 위한 도면이다.
도 5는 일 실시예에 따른 인지 시스템에서 서로 다른 데이터의 융합을 위한 GFU의 구조를 설명하기 위한 도면이다.
도 6은 일 실시예에 따른 인지 시스템의 성능을 평가하기 위하여 카메라 이미지의 성능을 저하시킨 것을 나타낸 예이다.
도 7은 일 실시예에 따른 인지 시스템의 성능을 설명하기 위한 표이다. 1 is a diagram for describing an operation of a cognitive system according to an exemplary embodiment.
2 is a block diagram illustrating a configuration of a cognitive system according to an exemplary embodiment.
3 is a flowchart illustrating an object recognition method of a recognition system according to an exemplary embodiment.
4 is a diagram illustrating a structure of a deep learning-based object recognition algorithm in a cognitive system according to an exemplary embodiment.
5 is a diagram illustrating a structure of a GFU for fusion of different data in a cognitive system according to an embodiment.
6 illustrates an example of degrading the performance of a camera image in order to evaluate the performance of a cognitive system according to an exemplary embodiment.
7 is a table for explaining a performance of a cognitive system according to an exemplary embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, exemplary embodiments will be described in detail with reference to the accompanying drawings.

도 1은 일 실시예에 따른 인지 시스템의 개괄적인 동작을 설명하기 위한 도면이다.1 is a diagram for describing an operation of a cognitive system according to an exemplary embodiment.

자율주행 또는 스마트 가정을 위한 사물인터넷 환경에서 카메라와 라이더 센서를 융합하여 사물을 인지하고 환경을 이해할 수 있는 딥 러닝 기반의 물체 인지 알고리즘(100)에 대하여 제안하고자 한다. 딥 러닝 기반의 물체 인지 알고리즘(100)은 인지 시스템에 의하여 동작될 수 있다. 이때, 서로 다른 특성/구조를 갖는 데이터가 딥 러닝 기반의 물체 인지 알고리즘을 수행함으로써 객체(예를 들면, 물체, 사물 등)를 인지할 수 있다. 아래의 실시예에서는 카메라의 데이터와 라이더 센서(이하, '라이더'로 기재)의 데이터를 이용하여 객체를 인지하는 방법을 예를 들어 설명하기로 한다. In the IoT environment for autonomous driving or smart home, a deep learning-based object recognition algorithm 100 capable of recognizing objects and understanding the environment by fusing a camera and a rider sensor is proposed. The deep learning based object recognition algorithm 100 may be operated by a cognitive system. In this case, data having different characteristics / structures may recognize an object (eg, an object or an object) by performing a deep learning based object recognition algorithm. In the following embodiment, a method of recognizing an object using data of a camera and data of a rider sensor (hereinafter, referred to as a rider) will be described as an example.

아래의 실시예에서는 딥 뉴럴 네트워크의 입력으로 라이더로부터 획득한 2차원 포인트 이미지 정보와 카메라를 통해 획득한 영상 정보를 동시에 활용하기 위한 센서 융합 딥 러닝 기술에 대하여 설명하고자 한다. 또한, 라이더 혹은 카메라 중 하나의 센서 데이터에 저하가 발생하여도 나머지 하나의 센서 데이터를 기반으로 좋은 성능을 나타내도록 한다. In the following embodiment, a sensor fusion deep learning technique for simultaneously utilizing two-dimensional point image information acquired from a rider and image information acquired through a camera as an input of a deep neural network will be described. In addition, even if a degradation occurs in the sensor data of one of the riders or cameras, the performance is good based on the other sensor data.

인지 시스템은 카메라(110)와 라이더(120)의 데이터를 획득함에 따라 데이터 전처리 과정을 수행하여 딥 러닝 기반의 물체 인지 알고리즘(100)을 기반으로 하여 객체를 인지할 수 있다. 카메라(110)로부터 획득한 카메라 데이터를 딥 러닝 기반의 물체 인지 알고리즘(100)에 기반하여 객체를 인지할 수 있다. 이때, 카메라 데이터는 2차원의 영상으로서, RGB 이미지(111)로 구성될 수 있다. 라이더(120)로부터 취득한 3차원 포인트 데이터(3차원 포인트 클라우드)(121)를 2차원 영상(Front-view 이미지)(122)으로 변환하여 딥 러닝 기반의 물체 인지 알고리즘(100)에 기반하여 객체를 인지할 수 있다. 구체적으로, 데이터 전처리 과정은 카메라 또는/및 라이더에 수행될 수 있다. 이에, 라이더로부터 획득된 데이터에 대한 전처리 과정을 설명하기로 한다. 예를 들면, 라이더로부터 획득된 3차원 공간 정보를 2차원 깊이, 높이 또는 반사율 중 적어도 하나 이상의 정보를 포함하는 영상으로 변환하여 3채널로 이미지화할 수 있다. 이러한 전처리 과정이 수행된2차원의 3채널 이미지를 RGB 이미지(111)와 같이 이용하여 각각 두 개의 Convolutional neural network(CNN)의 입력으로 이용할 수 있다. 두 개의 CNN으로부터 획득된 특징맵을 결합하고, 결합된 특징맵을 이용하여 객체 검출에 관련된 최종적인 정보를 추출하게 된다. 두 개의 CNN으로부터 획득된 각각의 특징맵을 결합할 때 열악한 상황에서 하나의 센서 신호의 품질이 저하되는 경우에 강인한 성능을 얻기 위하여 카이더와 카메라의 센서 신호로부터 획득된 특징맵에 적절한 가중치 값을 부여하여 융합하는 센서융합 기법이 적용된다. 이러한 가중치 값은 특징맵을 입력으로 하여 자동으로 계산이 되고 트레이닝 단계에서 가중치를 계산하는 융합 네트워크를 앞의 두 CNN 네트워크와 함께 동시에 학습시키게 된다. As the recognition system acquires data of the camera 110 and the rider 120, the recognition system may perform data preprocessing to recognize an object based on the deep learning based object recognition algorithm 100. The object may be recognized based on the deep learning based object recognition algorithm 100 based on the camera data acquired from the camera 110. At this time, the camera data is a two-dimensional image, it may be composed of an RGB image (111). 3D point data (three-dimensional point cloud) 121 obtained from the rider 120 is converted into a two-dimensional image (front-view image) 122 to convert the object based on the deep learning-based object recognition algorithm 100 It can be recognized. Specifically, the data preprocessing process may be performed on the camera or / and the rider. Thus, the preprocessing process for the data obtained from the rider will be described. For example, the 3D spatial information obtained from the rider may be converted into an image including at least one or more information of 2D depth, height, or reflectance, and imaged into 3 channels. The two-dimensional three-channel image subjected to such a preprocessing process may be used as an input of two convolutional neural networks (CNNs) by using the same as the RGB image 111. The feature maps obtained from the two CNNs are combined and final information related to object detection is extracted using the combined feature maps. When combining the feature maps obtained from two CNNs, a weight value appropriate to the feature maps obtained from the sensor signals of the camera and the camera is obtained in order to obtain robust performance when the quality of one sensor signal is degraded in a poor situation. The sensor fusion technique is applied. These weight values are automatically calculated by inputting the feature map, and trained the converged network that calculates the weights in the training phase together with the two previous CNN networks.

인지 시스템은 이러한 카메라의 데이터 및 라이더의 데이터를 딥 러닝 기반의 물체 인지 알고리즘(100)에 기반하여 학습함에 따른 인지 결과(130), 객체를 검출할 수 있다. 이에 따라, 기존에 라이더 센서와 카메라 센서 융합 기술에 비하여 모두 온전한 데이터뿐만 아니라 특정 센서의 데이터에 품질 저하가 발생하였을 때에도 좋은 인식 성능을 획득할 수 있다.The cognitive system may detect the recognition result 130 and the object of learning the data of the camera and the data of the rider based on the deep learning based object recognition algorithm 100. Accordingly, compared to conventional rider sensor and camera sensor fusion technology, good recognition performance can be obtained even when quality deterioration occurs in not only intact data but also data of a specific sensor.

도 2는 일 실시예에 따른 인지 시스템의 구성을 설명하기 위한 블록도이고, 도 3은 일 실시예에 따른 인지 시스템의 객체 인지 방법을 설명하기 위한 흐름도이다.2 is a block diagram illustrating a configuration of a cognitive system according to an embodiment, and FIG. 3 is a flowchart illustrating an object recognition method of a cognitive system according to an embodiment.

인지 시스템(200)은 특징맵 획득부(210), 융합부(220) 및 검출부(230)를 포함할 수 있다. 이러한 구성요소들은 인지 시스템(200)에 저장된 프로그램 코드가 제공하는 제어 명령에 따라 프로세서에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다. 구성요소들은 도 3의 객체 인지 방법이 포함하는 단계들(310 내지 330)을 수행하도록 인지 시스템(200)을 제어할 수 있다. 이때, 구성요소들은 메모리가 포함하는 운영체제의 코드와 적어도 하나의 프로그램의 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다. The recognition system 200 may include a feature map acquirer 210, a fusion unit 220, and a detector 230. These components may be representations of different functions performed by a processor in accordance with control instructions provided by program code stored in cognitive system 200. The components may control the recognition system 200 to perform the steps 310 to 330 included in the object recognition method of FIG. 3. In this case, the components may be implemented to execute instructions according to code of an operating system included in a memory and code of at least one program.

인지 시스템(200)의 프로세서는 객체 인지 방법을 위한 프로그램의 파일에 저장된 프로그램 코드를 메모리에 로딩할 수 있다. 예를 들면, 인지 시스템(200)에서 프로그램이 실행되면, 프로세서는 운영체제의 제어에 따라 프로그램의 파일로부터 프로그램 코드를 메모리에 로딩하도록 인지 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서가 포함하는 특징맵 획득부(210), 융합부(220) 및 검출부(230) 각각은 메모리에 로딩된 프로그램 코드 중 대응하는 부분의 명령을 실행하여 이후 단계들(310 내지 330)을 실행하기 위한 프로세서의 서로 다른 기능적 표현들일 수 있다. The processor of the recognition system 200 may load program code stored in a file of a program for an object recognition method into a memory. For example, when a program is executed in the cognitive system 200, the processor may control the cognitive system to load program code from a file of a program into a memory under control of an operating system. In this case, each of the processor and the feature map acquirer 210, the fuser 220, and the detector 230 included in the processor executes instructions of a corresponding part of the program code loaded in the memory, and then executes steps 310 to 330. May be different functional representations of a processor for executing < RTI ID = 0.0 >

단계(310)에서 특징맵 획득부(210)는 서로 다른 데이터를 각각의 딥 뉴럴 네트워크에 입력하여 각각의 특징맵을 획득할 수 있다. 특징맵 획득부(210)는 서로 다른 데이터로서 제1 센서와 제2 센서가 입력됨에 따라 제1 센서의 데이터에 대하여 전처리 과정을 수행함에 따라 변환된 2차원의 3채널 이미지와 제2 센서에 대한 이미지를 각각의 CNN에 통과시킬 수 있다. 예를 들면, 특징맵 획득부(210)는 서로 다른 데이터가 라이더와 카메라와 관련된 데이터일 경우, 라이더에 대하여 전처리 과정을 수행함에 따라 변환된 2차원의 3채널 이미지와 카메라에 대한 카메라 이미지를 각각의 CNN에 통과시킬 수 있다. 이때, 특징맵 획득부(210)는 라이더로부터 취득한 3차원 포인트 정보를 카메라의 2차원 영상에 매핑하되, 3차원 포인트 정보의 위치 정보를 라이더 좌표에서 카메라 좌표로 변환하는 행렬을 곱하여 2차원 영상의 좌표로 생성하는 전처리 과정을 수행할 수 있다. In operation 310, the feature map acquirer 210 may obtain different feature maps by inputting different data into each deep neural network. The feature map acquirer 210 performs a preprocessing process on the data of the first sensor as the first sensor and the second sensor are input as different data, and thus the 2D 3-channel image and the second sensor You can pass an image through each CNN. For example, when the different data is data related to the rider and the camera, the feature map acquisition unit 210 performs a preprocessing process on the rider, and converts the two-dimensional three-channel image and the camera image of the camera, respectively. Pass it to CNN. At this time, the feature map acquisition unit 210 maps the 3D point information obtained from the rider to the 2D image of the camera, and multiplies the matrix for converting the position information of the 3D point information from the rider coordinates to the camera coordinates to obtain the 2D image. A preprocessing process can be performed to generate coordinates.

단계(320)에서 융합부(220)는 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합할 수 있다. 융합부(220)는 각각의 특징맵에 대한 품질을 판별함에 따라 각각의 특징맵 중 어느 하나 이상의 특징맵에 가중치를 부여한 후, 각각의 특징맵을 융합 네트워크를 통하여 융합시킬 수 있다. 융합부(220)는 획득된 각각의 특징맵이 카메라의 특징맵과 상기 라이더의 특징맵일 경우, 카메라의 특징맵과 라이더의 특징맵을 융합 네트워크에 통과시킴에 따라 각각의 특징맵을 병렬로 융합하여 새로운 특징맵을 생성할 수 있다. 융합부(220)는 카메라의 특징맵과 상기 라이더의 특징맵을 병렬로 1차 융합하고, 1차 융합된 특징맵을 복수의 3X3 크기의 커널을 가진 딥 뉴럴 네트워크와 복수의 sigmoid 함수를 통과하여 라이더 또는 상기 카메라의 강인성을 판단할 수 있다. 융합부(220)는 1차 융합된 특징맵을 복수의 3X3 크기의 커널을 가진 딥 뉴럴 네트워크와 sigmoid 함수를 통과시킴에 따라 카메라 및 라이더의 데이터에 대한 신뢰도로서 픽셀별로 0 내지 1 사이의 값으로 출력시킬 수 있다. 융합부(220)는 카메라에 대한 신뢰도를 카메라의 특징맵에 곱한 값과 라이더에 대한 신뢰도를 라이더의 특징맵에 곱한 값을 다시 병렬로 2차 융합하고, 2차 융합된 특징맵을 1x1 크기의 커널의 딥 뉴럴 네트워크를 통과하여 3차 융합하여 카메라의 특징맵과 라이더의 특징맵에 대한 새로운 특징맵을 도출할 수 있다. In operation 320, the fusion unit 220 may fuse each acquired feature map through a fusion network. The fusion unit 220 may assign a weight to any one or more feature maps of each feature map according to the quality of each feature map, and then fuse the feature maps through the fusion network. When each of the acquired feature maps is the feature map of the camera and the feature map of the rider, the fusion unit 220 merges the feature maps in parallel as the feature map of the camera and the rider feature map pass through the fusion network. To create a new feature map. The fusion unit 220 first fuses the feature map of the camera and the feature map of the rider in parallel, and passes the first fused feature map through a deep neural network having a plurality of 3X3 kernels and a plurality of sigmoid functions. The robustness of the rider or the camera can be determined. As the fusion unit 220 passes the first fused feature map through a deep neural network having a plurality of 3X3 kernels and a sigmoid function, the fusion unit 220 has a value between 0 and 1 for each pixel as a reliability of the data of the camera and the rider. Can be printed. The fusion unit 220 performs second-order fusion in parallel with the value obtained by multiplying the reliability of the camera by the feature map of the camera and the value of the rider's reliability by the feature map of the rider, and converts the second-fused fused feature map into a 1x1 size. Through the kernel's deep neural network, the third fusion can be used to derive new feature maps for the camera's feature map and the rider's feature map.

단계(330)에서 검출부(230)는 융합 네트워크를 통하여 융합된 새로운 특징맵에 기반하여 객체를 검출할 수 있다. 검출부(230)는 딥 뉴럴 네트워크와 센서 융합 네트워크를 동시에 학습함에 따라 새로운 특징맵을 처리하여 객체와 관련된 위치와 종류를 동시에 판별할 수 있다. 검출부(230)는 서로 다른 데이터 중 일부의 데이터의 품질을 저하시키기 위한 데이터를 생성하고, 품질이 저하된 데이터를 학습하도록 제어할 수 있다. In operation 330, the detector 230 may detect an object based on the new feature map fused through the converged network. As the detector 230 simultaneously learns the deep neural network and the sensor fusion network, the detector 230 may process a new feature map to simultaneously determine the location and type of the object. The detector 230 may generate data for lowering the quality of some of the different data, and may control to learn data whose quality is degraded.

딥 러닝 기반 물체 인지 알고리즘은 라이더와 카메라에 포함된 정보를 딥 러닝 기법을 이용하여 딥러닝의 뛰어난 분류 및 일반화 성능을 그대로 활용할 수 있다는 장점이 있다. Deep learning-based object recognition algorithms have the advantage of using deep learning techniques to take advantage of the deep classification and generalization of information contained in riders and cameras.

도 4는 일 실시예에 따른 인지 시스템의 딥 러닝 기반의 물체 인지 알고리즘의 구조를 설명하기 위한 도면이다. 4 is a diagram illustrating a structure of an object recognition algorithm based on deep learning of a cognitive system, according to an exemplary embodiment.

도 4를 참고하면, 딥 러닝 기반의 물체 인지 알고리즘의 구조(400)를 나타낸 것이다. 카메라와 라이더의 센서 데이터를 복수의 딥 뉴럴 네트워크(예를 들면, CNN)에 각각 먼저 통과시킴에 따라 각 데이터의 품질을 판별한 후, 두 데이터를 융합하여 객체를 인지할 수 있다. 다시 말해서, 카메라의 RGB 이미지(이하, 카메라 이미지)(111)와 라이더의 라이다 이미지(122)가 각각의 딥 뉴럴 네트워크에 입력될 수 있다. 이때, 센서 융합 기술을 위하여 라이더의 데이터에 전처리 과정이 수행될 수 있다. 라이더를 통하여 취득한 3차원 포인트 정보를 카메라의 2차원의 영상에 매핑할 수 있다. 3차원 포인트의 좌표 정보 p=(x, y, z)에 라이더 좌표에서 카메라 좌표로 변환하는 행렬을 곱하여 2차원의 영상 좌표

를 도출할 수 있다. 영상 좌표는 영상의 각 픽셀의 위치에 해당하게 되고, 픽셀값은 깊이, 높이, 반사율 정보 값으로부터 획득될 수 있다. 예를 들면, 매우 가까이 있는 포인트에 대해서는 255에 가까운 밝기로, 매우 먼 경우에 있는 포인트에 대해서는 0에 가까운 밝기로 변환될 수 있다. 이 경우, 포인트를 이용하여 2차원 공간을 모두 채울 수 없기 때문에 포인트가 이미지 상에 생긴 형태로 드문드문(sparse) 나타나게 된다. 포인트가 존재하지 않는 픽셀의 경우 0으로 채울 수 있다. 이미 카메라의 좌표와 라이더의 좌표가 정합이 되어 있을 경우, 앞서 설명한 바와 같은 방법으로 라이더의 데이터를 변환하여 새로운 2차원의 3채널 이미지를 추가적으로 획득할 수 있다. 이와 같이, 라이더에 대하여 전처리 과정을 수행함에 따라 변환된 라이더에 대한 2차원의 3채널 이미지(122)와 카메라의 카메라 이미지(111)를 각각의 딥 뉴럴 네트워크(예를 들면, CNN)에 통과시킬 수 있다. 도 4와 같이, 딥 뉴럴 네트워크의 입력 부분에는 라이더의 3채널 이미지와 카메라의 카메라 이미지를 각각 독립적인 CNN을 통과시킨 후, 어느 시점 이후의 네트워크 노드에서 획득되는 특징맵(feature map)을 융합 네트워크(GFU)(410)를 통하여 융합하여 새로운 특징맵을 구성할 수 있다. 이러한 특징맵을 처리함에 따라 객체의 위치와 종류를 동시에 판별(420)할 수 있다. 이때, 예를 들면, Single Shot Detector(SSD) 방법을 통하여 객체가 검출될 수 있으며, SSD 방법 이외에도 다양한 방법을 통하여 객체를 검출할 수도 있다. Referring to FIG. 4, the structure 400 of the deep learning based object recognition algorithm is illustrated. The sensor data of the camera and the rider are first passed through a plurality of deep neural networks (eg, CNNs) to determine the quality of each data, and then the two data may be fused to recognize the object. In other words, an RGB image (hereinafter referred to as a camera image) 111 of a camera and a lidar image 122 of a rider may be input to each deep neural network. In this case, a preprocessing process may be performed on the data of the rider for the sensor fusion technology. Three-dimensional point information acquired through the rider can be mapped to the two-dimensional image of the camera. Two-dimensional image coordinates by multiplying coordinate information p = (x, y, z) by three-dimensional points by a matrix that converts from rider coordinates to camera coordinates

Can be derived. The image coordinates correspond to the position of each pixel of the image, and the pixel value may be obtained from depth, height, and reflectance information values. For example, it can be converted to a brightness close to 255 for very close points and to a brightness close to zero for points that are very far away. In this case, since the point cannot be used to fill all of the two-dimensional space, the point appears sparse in the form of the image. Pixels that do not have a point can be filled with zeros. If the coordinates of the camera and the coordinates of the rider are already matched, a new two-dimensional three-channel image may be additionally acquired by converting the rider's data as described above. As such, as the preprocessing process is performed on the rider, the two-dimensional three-channel image 122 of the converted rider and the camera image 111 of the camera are passed through each deep neural network (for example, CNN). Can be. As shown in FIG. 4, the input part of the deep neural network passes a three-channel image of a rider and a camera image of a camera through independent CNNs, and then a feature map obtained at a network node after a certain point is converged. A new feature map may be constructed by fusing through the (GFU) 410. As the feature map is processed, the location and type of the object may be simultaneously determined 420. In this case, for example, the object may be detected through a single shot detector (SSD) method, and the object may be detected through various methods in addition to the SSD method.

도 5를 참고하면, 카메라의 특징맵(510)과 라이더의 특징맵(520)의 융합을 위한 GFU의 구조를 설명하기 위한 도면이다. 각각의 특징맵에 대한 품질을 판별함에 따라 각각의 특징맵 중 어느 하나 이상의 특징맵에 가중치를 부여한 후, 각각의 특징맵을 융합 네트워크(GFU(410))를 통과시켜 융합할 수 있다. 구체적으로, 우선, 카메라의 특징맵(510)과 라이더의 특징맵(520)이 병렬로 합치게 된다. 그 후, 병렬로 합쳐진 특징맵(1차 융합 특징맵)은 각각 3X3 크기의 커널을 가진 CNN과 sigmoid 함수를 통과하게 된다. 이러한 과정을 통하여 카메라와 라이더의 데이터 중 어느 데이터가 더 신뢰성 있는 데이터인지 강인성을 판단할 수 있다. 이때, 병렬로 합쳐진 특징맵(1차 융합 특징맵)이 3X3 크기의 커널을 가진 CNN과 sigmoid 함수를 통과함에 따라 픽셀별로 데이터에 대한 신뢰도로서 0 내지 1 사이의 값으로 매핑하여 출력 데이터를 출력할 수 있다. 다시 말해서, 3X3 크기의 커널을 가진 CNN과 sigmoid 함수를 통과한 출력이 데이터에 대한 신뢰도로서 0부터 1 사이의 값으로 픽섹별로 도출될 수 있다. 이러한 신뢰도는 다시 카메라와 라이더의 각 특징맵에 곱해질 수 있고, 곱해진 특징맵은 다시 병렬로 합쳐지는 과정이 수행될 수 있다. 그리고 나서, 1X1크기의 커널을 가진 CNN을 통과하여 최종적으로 카메라의 특징맵과 라이더의 특징맵이 융합된다. 구체적으로, 카메라에 대한 신뢰도를 카메라의 특징맵에 곱한 값과 라이더에 대한 신뢰도를 라이더의 특징맵에 곱한 값을 다시 병렬로 2차 융합하고, 2차 융합된 특징맵을 1X1크기의 커널의 딥 뉴럴 네트워크(예를 들면, CNN)를 통과하여 3차 융합하여 카메라의 특징맵과 라이더의 특징맵에 대한 새로운 특징맵이 구성될 수 있다. 마지막으로, GFU 계수와 각 센서에 해당하는 CNN, 융합된 특징맵으로부터 검출 결과를 추출하는 네트워크의 계수를 동시에 학습시키게 된다. 이에 따라 새로운 특징맵에 기반하여 객체가 검출될 수 있다. Referring to FIG. 5, a diagram illustrating a structure of a GFU for fusion of a feature map 510 of a camera and a feature map 520 of a rider is illustrated. After determining the quality of each feature map, one or more feature maps of each feature map may be weighted, and then each feature map may be fused through the fusion network (GFU 410). Specifically, first, the feature map 510 of the camera and the feature map 520 of the rider are merged in parallel. Thereafter, the combined feature maps (primary fusion feature maps) pass through the CNN and sigmoid functions, each with a 3X3 kernel. Through this process, it is possible to determine the robustness of which of the data of the camera and the rider is more reliable data. At this time, as the feature maps (primary fusion feature maps) merged in parallel pass through the CNN and sigmoid function having a 3X3 kernel, the output data can be output by mapping to a value between 0 and 1 as the reliability of the data for each pixel. Can be. In other words, a CNN with a 3 × 3 kernel and an output through the sigmoid function can be derived per pixel with a value between 0 and 1 as the reliability of the data. This reliability may be multiplied by each feature map of the camera and the rider, and the multiplication feature map may be combined again in parallel. Then, the CNN with a 1 × 1 kernel is finally fused with the camera's feature map and the rider's feature map. Specifically, secondly fusion in parallel the value obtained by multiplying the camera's reliability map with the camera's feature map and the rider's reliability with the rider's feature map in parallel, and dividing the second fused feature map in the kernel of 1 × 1 size. The third feature may be fused through a neural network (eg, a CNN) to form a new feature map for the feature map of the camera and the feature map of the rider. Finally, the GFU coefficient, the CNN corresponding to each sensor, and the coefficient of the network extracting the detection result from the fused feature map are simultaneously learned. Accordingly, the object may be detected based on the new feature map.

이때, 카메라와 라이더의 센서 품질이 저하될 경우, 이에 적합한 가중치를 부여하는 기능을 학습하기 위하여 카메라와 라이더 중 하나에 품질의 저하를 적용하는 데이터 augmentation 기법이 필요하다. 예를 들면, 도 6을 참고하면, 실시예에서 제안된 딥 러닝 기반의 물체인지 알고리즘의 성능을 평가하기 위하여 카메라의 이미지에 성능 저하를 시킨 것을 나타낸 예이다. 도 6(a)는 밝기가 조작된 영상 데이터이고, 도 6(b)는 일부분이 가려진 영상 데이터이고, 도 6(c)는 노이즈가 포함된 영상 데이터이고, 도 6(d)는 원본의 영상 데이터이다. 영상 데이터의 부분적으로 가림이 있다거나 밝기를 조절한다든지 둘 중 하나의 데이터를 제거하는 등의 조작을 통해서 강인성을 주기 위한 추가적인 데이터 생성을 수행하고 이를 통해 해당 네트워크를 학습시킬 수 있다. 이러한 일부 데이터의 품질을 저하시켜 데이터를 학습하도록 함으로써 영상의 품질에 따라 적응적으로 네트워크의 구조뿐만 아니라 가중치를 부여하도록 할 수 있다. In this case, when the sensor quality of the camera and the rider is deteriorated, a data augmentation technique is required to apply the deterioration of the quality to one of the camera and the rider in order to learn a function of assigning a weight appropriate thereto. For example, referring to FIG. 6, in order to evaluate the performance of the deep learning-based object recognition algorithm proposed in the embodiment, the performance of the camera image is reduced. FIG. 6 (a) is image data in which brightness is manipulated, FIG. 6 (b) is image data in which part is masked, FIG. 6 (c) is image data including noise, and FIG. 6 (d) is original image. Data. By performing operations such as partially obscuring image data, adjusting brightness, or removing one of the two data, additional data generation for robustness can be performed and the corresponding network can be learned. By lowering the quality of some of the data to learn the data, the weight of the network may be adaptively assigned according to the quality of the image.

일 실시예에 따른 인지 시스템은 딥 러닝을 통해서 센서 융합을 하는 경우에 CNN을 통하여 획득된 특징맵을 결합하여 새로운 특징맵을 획득하고, 획득된 새로운 특징맵을 추가적으로 CNN 계층을 더 적용하여 센서 융합 기반의 물체 검출을 수행하게 된다. 이때, 융합 네트워크(GFU)에서 각각의 특징맵에 적절한 가중치를 부여하여 결합하기 때문에 품질이 저하된 특징맵에 대한 기여를 조절함으로써 최적의 센서 융합이 이루어지게 된다. 또한, 라이더와 카메라에 대한 데이터의 장점을 이용한 융합을 발휘할 수 있고, 하나의 센서 데이터에 저하가 발생한 경우도 최적의 융합을 가능하게 한다. In the case of sensor fusion through deep learning, the cognitive system combines a feature map obtained through CNN to obtain a new feature map, and additionally applies the acquired CNN layer to the sensor fusion Based object detection. In this case, since each feature map is combined with an appropriate weight in the fusion network (GFU), optimal sensor fusion is achieved by controlling the contribution to the degraded feature map. In addition, it is possible to achieve the fusion using the advantages of the data for the rider and the camera, it is possible to optimize the fusion even in the case of a degradation in one sensor data.

도 7은 일 실시예에 따른 인지 시스템의 성능을 설명하기 위한 표이다. 7 is a table for explaining a performance of a cognitive system according to an exemplary embodiment.

카메라와 라이더의 데이터의 융합을 위한 딥 러닝 기법을 통하여 물체 인지 및 검출 성능을 향상시킬 수 있다. 일례로, KITTI 벤치마크 데이터 셋을 이용하여, 라이더 좌표와 카메라 좌표를 정합한 후 라이더 센서의 3차원 포인트 데이터를 2차원 영상으로 변환한 후, 라이더 센서의 2차원 영상과 카메라의 영상을 CNN에 각각 입력할 수 있다. 이때, 융합 네트워크(GFU)를 통하여 센서 데이터의 품질 판단 및 융합을 하여 객체를 인지하게 된다. 도 7에서 GFU에 대한 성능을 베이스 네트워크에 GFU가 포함된 경우와, GFU가 포함되지 않은 경우를 비교한 결과를 나타낸 것이다. 예를 들면, 성능을 비교하기 위하여 카메라 이미지가 비었거나, 일부 가려지거나, 일부분이 밝게 되거나, 노이즈가 섞인 경우에서 진행될 수 있고, 이외에도 라이더 이미지가 비었거나, 일부 가려진 환경에서도 진행될 수 있다. 이에 따라, 비교 결과 모든 항목에서 GFU가 존재하는 네트워크가 더 좋은 성능을 나타내었으며, 특히, 노이즈가 섞인 경우에는 보다 큰 성능 차이를 보이는 것을 확인할 수 있다. Deep learning techniques for fusion of camera and rider data can improve object recognition and detection performance. For example, using a KITTI benchmark data set, rider coordinates and camera coordinates are matched, and then the 3D point data of the rider sensor is converted into a 2D image, and then the 2D image of the rider sensor and the camera image are transferred to the CNN. You can enter each one. At this time, the quality of the sensor data is determined and fused through the fusion network (GFU) to recognize the object. In FIG. 7, the performance of the GFU is compared with the case where the GFU is included in the base network and the case where the GFU is not included. For example, to compare performance, the camera image may be empty, partially obscured, partially lightened, or mixed with noise. In addition, the rider image may be empty or partially obscured. Accordingly, as a result of the comparison, the network with the GFU in all items showed better performance, and especially, when the noise is mixed, it can be seen that the larger performance difference.

일 실시예에 따른 인지 시스템은 최근 자율주행에서 라이더와 카메라 센서가 동시에 활용될 것으로 예상되는 자율주행에 적용될 수 있고, 자율주행뿐만 아니라 환경이나 물체를 인식하는 다양한 인공지능 분야에도 적용이 가능하다. A cognitive system according to an embodiment may be applied to autonomous driving in which riders and camera sensors are expected to be used simultaneously in recent autonomous driving, and may be applied to various artificial intelligence fields that recognize environments or objects as well as autonomous driving.

대표적인 적용 가능 분야로는 자율주행자동차, 로봇공학 및 차량 모니터링 시스템이 있다. 첫째로, 자율주행자동차에서 물체 인지는 가장 먼저 수반되어야 하는 기술로, 주변의 물체 및 보행자가 어느 위치에 존재하는지 위험한 요소는 없는지 판단하는 역할을 하며 이를 통해 안전한 주행을 할 수 있고 사고를 미연에 방지할 수 있다. 또한 주행 보조를 위해 신호등이나 표지판 등 주행 시 필요한 도로 정보를 확인하는 데에 사용된다. 둘째로, 로봇 공학에서 물체 인지는 사람의 눈처럼 로봇의 시각적 활동을 돕는다. 물체 인지를 통해 로봇이 주변 상황 및 물체를 확인하여 이를 바탕으로 목적에 맞는 기능을 수행할 수 있도록 한다. 또한 물체 인지 기술은 차량 모니터링 시스템에도 적용 가능하다. 주차 관리 시스템을 예로 들면, 주차장으로 입차하는 차의 종류 및 차량번호를 인지하고 더 나아가 현재 빈자리 상황까지 확인하여 주차를 도울 수 있다.Typical applications include autonomous vehicles, robotics and vehicle monitoring systems. Firstly, object recognition is the first technology to be involved in autonomous cars, and it determines the location of nearby objects and pedestrians to determine if there are any dangerous factors. It can prevent. It is also used to check road information needed for driving such as traffic lights and signs for driving assistance. Secondly, in robotics, object recognition helps the robot's visual activity like the human eye. Object recognition enables the robot to check the surroundings and objects and perform functions according to the purpose. Object recognition technology is also applicable to vehicle monitoring systems. For example, the parking management system may help parking by recognizing the type and vehicle number of the car entering the parking lot, and even checking the current empty space.

이 기술을 적용하기 위해서는 카메라, 라이더 센서를 이용하여 데이터를 실시간으로 취득하고 이를 그래픽 프로세서 유닛(GPU) 등을 장착한 임베디드 시스템에서 물체 인식 알고리즘을 수행하는 방법이 있다. 이를 위해서는 미리 다양한 환경에 대한 데이터들을 확보하고 이를 통해 딥 뉴럴 네트워크의 구조를 학습시키게 된다. 학습된 딥 뉴럴 네트워크는 최적화된 네트워크 계수로 저장되게 되고 이를 임베디드 시스템에 적용하여 실시간으로 입력되는 테스트 데이터에 대해 물체인지 알고리즘을 수행하여 그 결과를 획득하도록 한다.To apply this technology, there is a method of acquiring data in real time using a camera and a rider sensor, and performing an object recognition algorithm in an embedded system equipped with a graphic processor unit (GPU). To this end, data on various environments are secured in advance and the deep neural network structure is learned through this. The trained deep neural network is stored with optimized network coefficients and applied to the embedded system to perform the object recognition algorithm on the test data input in real time to obtain the result.

또한, 카메라, 라이더 센서 기반의 물체 인지 알고리즘은 현재 자율주행이나 모바일 로봇등에 응용될 수 있으며 향후 인지를 넘어 객체 추적 및 미래 예측 등 좀 더 복잡한 기능을 수행할 것으로 예상된다. 예를 들면 자동화된 감시, 교통 모니터링 및 차량 탐색 또는 도로 위의 교통상황을 고려한 신호등 제어 등도 가능해 질 것이다. 더불어 객체 인식은 주변 환경을 이해하는데 중요한 역할을 하며 이를 IoT를 결합한 가전제품에 적용 가능하다. 이와 같이 딥 러닝 기반의 물체 인지 알고리즘은 미래 기술에 기초가 되는 기술로 대표적인 인공지능 기술 중의 하나라고 할 수 있다.In addition, camera and rider sensor-based object recognition algorithms can be applied to autonomous driving or mobile robots and are expected to perform more complex functions such as object tracking and future prediction beyond recognition. For example, automated monitoring, traffic monitoring and vehicle navigation, or traffic light control based on traffic conditions on the road will also be possible. In addition, object recognition plays an important role in understanding the environment and can be applied to home appliances that combine IoT. As such, deep learning-based object recognition algorithm is one of the representative artificial intelligence technologies based on future technologies.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the devices and components described in the embodiments are, for example, processors, controllers, arithmetic logic units (ALUs), digital signal processors, microcomputers, field programmable gate arrays (FPGAs). Can be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to the execution of the software. For convenience of explanation, one processing device may be described as being used, but one of ordinary skill in the art will appreciate that the processing device includes a plurality of processing elements and / or a plurality of types of processing elements. It can be seen that it may include. For example, the processing device may include a plurality of processors or one processor and one controller. In addition, other processing configurations are possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the above, and configure the processing device to operate as desired, or process it independently or collectively. You can command the device. Software and / or data may be any type of machine, component, physical device, virtual equipment, computer storage medium or device in order to be interpreted by or to provide instructions or data to the processing device. It can be embodied in. The software may be distributed over networked computer systems so that they may be stored or executed in a distributed manner. Software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be embodied in the form of program instructions that can be executed by various computer means and recorded in a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the media may be those specially designed and constructed for the purposes of the embodiments, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine code, such as produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described by the limited embodiments and the drawings as described above, various modifications and variations are possible to those skilled in the art from the above description. For example, the described techniques may be performed in a different order than the described method, and / or components of the described systems, structures, devices, circuits, etc. may be combined or combined in a different form than the described method, or other components. Or even if replaced or substituted by equivalents, an appropriate result can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are within the scope of the claims that follow.

Claims

In the cognitive method performed by the cognitive system,
Inputting different data to each deep neural network to obtain each feature map;
Fusing the acquired respective feature maps through a fusion network; And
Detecting an object based on a new feature map fused through the fusion network;
Cognitive method, characterized in that.

The method of claim 1,
Obtaining each feature map by inputting the different data to each deep neural network,
When the different data is data related to a rider and a camera, passing the two-dimensional three-channel image and the camera image of the camera, which have been converted as a preprocessing process is performed on the rider, through each CNN.
Cognitive method comprising a.

The method of claim 2,
Obtaining each feature map by inputting the different data to each deep neural network,
The 3D point information obtained from the rider is mapped to the 2D image of the camera, and a preprocessing process of generating the coordinates of the 2D image by multiplying a matrix for converting the position information of the 3D point information from the rider coordinates to the camera coordinates. Steps to perform
Cognitive method comprising a.

The method of claim 1,
Converging the obtained respective feature maps through a fusion network,
Weighting any one or more feature maps of each feature map according to the quality of each feature map, and then fusing each feature map through a fusion network
Cognitive method comprising a.

The method of claim 4, wherein
Converging the obtained respective feature maps through a fusion network,
When each of the acquired feature maps is a feature map of a camera and a feature map of a rider, the feature map of the camera and the feature map of the rider are passed through the fusion network, and the feature maps are fused in parallel to form a new feature. Steps to Create a Map
Cognitive method comprising a.

The method of claim 5,
Converging the obtained respective feature maps through a fusion network,
The feature map of the camera and the feature map of the rider are primarily fused in parallel, and the rider or the fused feature map is passed through a deep neural network having a plurality of 3X3 kernels and a plurality of sigmoid functions. Steps to determine the robustness of the camera
Cognitive method comprising a.

The method of claim 6,
Converging the obtained respective feature maps through a fusion network,
Outputting the first fused feature map as a value between 0 and 1 for each pixel as reliability of data of the camera and the rider as a deep neural network having a plurality of 3X3 kernels and a sigmoid function are passed through
Cognitive method comprising a.

The method of claim 7, wherein
Converging the obtained respective feature maps through a fusion network,
The second multiplication is performed again in parallel by multiplying the reliability of the camera by the feature map of the camera and the reliability of the rider by the feature map of the rider, and converting the second fused feature map into a 1x1 kernel. Deriving a new feature map for the feature map of the camera and the rider's feature map by performing the third fusion through a deep neural network
Cognitive method comprising a.

The method of claim 1,
Detecting an object based on a new feature map fused through the fusion network,
Generating data for lowering the quality of some of the different data, and controlling to learn the degraded data;
Cognitive method comprising a.

The method of claim 1,
Detecting an object based on a new feature map fused through the fusion network,
Simultaneously determining the deep neural network and the sensor fusion network to determine the location and type associated with the object by processing the new feature map.
Cognitive method comprising a.

In the cognitive system,
A feature map acquisition unit for inputting different data into each deep neural network to obtain a first feature map and a second feature map;
A fusion unit configured to fuse the obtained first and second feature maps through a fusion network; And
A detector for detecting an object based on a new feature map fused through the fusion network
Cognitive system comprising a.

The method of claim 11,
The feature map acquisition unit,
As the first data and the second data are input as the different data, the two-dimensional three-channel image and the image of the second data, which are converted as the preprocessing process is performed on the first data, are passed through the respective CNNs. Letting
Cognitive system, characterized in that.

The method of claim 11,
The fusion unit,
Weighting at least one feature map of each feature map according to the quality of the first feature map and the second feature map, and then fusing each feature map through a fusion network.
Cognitive system, characterized in that.

The method of claim 11,
The fusion unit,
The camera and the rider as the first and second feature maps are firstly fused in parallel, and the first and second feature maps are passed through a deep neural network having a plurality of 3X3 kernels and a sigmoid function. As the reliability of the data of
Cognitive system, characterized in that.

The method of claim 14,
The fusion unit,
The second multiplication is performed again in parallel by multiplying the reliability of the camera by the feature map of the camera and the reliability of the rider by the feature map of the rider, and converting the second fused feature map into a 1x1 kernel. The third feature is fused through a deep neural network to derive a new feature map for the feature map of the camera and the feature map of the rider.
Cognitive system, characterized in that.