KR102108951B1

KR102108951B1 - Deep learning-based object detection method and system utilizing global context feature of image

Info

Publication number: KR102108951B1
Application number: KR1020180055783A
Authority: KR
Inventors: 최준원; 유진혁
Original assignee: 한양대학교 산학협력단
Priority date: 2018-05-16
Filing date: 2018-05-16
Publication date: 2020-05-11
Also published as: KR20190131206A

Abstract

일 실시예에 따른 객체 검출 방법은, 영상 정보를 제1 딥 뉴럴 네트워크에 통과시킴에 따라 특징맵을 획득하는 단계; 상기 획득된 특징맵을 제2 딥 뉴럴 네트워크 및 제3 딥 뉴럴 네트워크에 각각 통과시키는 단계; 상기 제2 딥 뉴럴 네트워크 및 상기 제3 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합하는 단계; 및 상기 융합된 특징맵에 기반하여 상기 영상 정보로부터 객체를 검출하는 단계를 포함할 수 있다. An object detection method according to an embodiment may include: acquiring a feature map by passing image information through a first deep neural network; Passing the acquired feature map through a second deep neural network and a third deep neural network, respectively; Fusing each feature map obtained by passing through the second deep neural network and the third deep neural network through a convergence network; And detecting an object from the image information based on the fused feature map.

Description

DEEP LEARNING-BASED OBJECT DETECTION METHOD AND SYSTEM UTILIZING GLOBAL CONTEXT FEATURE OF IMAGE}

아래의 설명은 영상의 전역 문맥 정보를 이용하여 객체를 검출하는 기술에 관한 것이다. The following description relates to a technique for detecting an object using global context information of an image.

물체 검출기법이란 카메라 영상으로부터 검출하고자 하는 물체의 위치와 종을 구분하는 기술을 의미한다. 영상 정보를 통한 물체 검출 기술은 컴퓨터 비전 분야에서 오랫동안 연구되어 왔던 분야이다. 최근에 딥 뉴럴 네트워크의 구조를 이용하는 딥러닝 기술이 발전함에 따라 대용량의 영상 데이터로부터 고차원의 특징값을 학습하여 매우 높은 수준의 정확도로 물체를 검출하는 방법이 제안되었다. 이러한 기술은 자율주행, 로봇 공학, 의료, 보안 등의 다양한 분야에 활발히 적용되고 있다. 최근 자동차의 보행자 안전 시스템 탑재가 의무화되는 추세로 안전을 위한 운전자 지원 시스템에 보행자의 위치와 상태를 인지하여 위험을 회피하는 기술에도 사용되고 있다. 공장에서도 사용되는 로봇의 경우 영상 기반 물체 인지 기술을 사용하여 스스로 공장 환경의 변화와 특징을 학습하고 궁극적으로 로봇의 지능과 처리 성능을 극대화하는 기법에도 중요한 기술로 사용되고 있다. 의료 영상 분야에서도 의사의 단순노동을 줄이기 위해 딥러닝 기반 물체 인지 기법을 사용하여 전문가의 진단을 보조해주는 수단으로 사용되고 있다. 이처럼 영상분야의 인지 기술의 발달과 함께 환경과 상황에 대한 이해를 통한 물체 검출 기술의 필요성이 높아지고 있다. The object detector method refers to a technique for classifying the position and species of an object to be detected from a camera image. Object detection technology through image information is a field that has been studied for a long time in the field of computer vision. In recent years, as a deep learning technology using a deep neural network structure has been developed, a method of detecting an object with a very high level of accuracy by learning a high-dimensional feature value from a large amount of image data has been proposed. These technologies are actively applied to various fields such as autonomous driving, robotics, medical care, and security. In recent years, the vehicle is required to mount a pedestrian safety system, and it is also used in a technology to avoid danger by recognizing the location and condition of a pedestrian in a driver assistance system for safety. In the case of robots used in factories, it is also used as an important technique for learning the changes and characteristics of the factory environment by using image-based object recognition technology and ultimately maximizing the intelligence and processing performance of robots. In the medical imaging field, it is used as a means to assist the diagnosis of experts by using deep learning-based object recognition techniques to reduce simple labor for doctors. As such, with the development of cognitive technology in the field of imaging, the need for object detection technology through understanding of the environment and situation is increasing.

일례로, 데이터의 주요 특징들을 추출할 수 있는 기법들을 정교하고 설계하여 영상의 특징값을 얻고 이를 통계적인 분류기를 통해 물체를 인식 또는 검출하는 기술이 주로 사용되었다. 예를 들어, 영상의 선의 구조나 형태 등의 특징들을 알아내고 이를 이미 알고있는 템플릿과 비교하여 물체를 인지하거나 검출하는 기술이다. 대표적인 특징 추출 기술로는 SIFT(Scale invariant feature transform)과 HOG(Historgram of oriented gradients) 표현자 등이 있다. SIFT는 영상에서 코너점 등 식별이 용이한 특징점들을 선택한 후에 각 특징점을 중심으로 한 로컬 패치에 대해 방향성을 갖는 특징 벡터를 추출하는 방법으로서, 이는 주변의 밝기 변화의 방향 및 밝기 변화의 급격한 정도를 표현하는 특징을 이용한다. HOG는 대상 영역을 일정 크기의 셀로 분할하고, 각 셀마다 경계 픽셀들의 방향에 대한 히스토그램을 구한 후 이들 히스토그램 bin 값들을 일렬로 연결한 벡터를 이용한 방법으로서, 경계의 방향정보를 이용하기 때문에 경계기반 템플릿 매칭을 수행한다. 또한, HOG는 물체의 실루엣 정보를 이용하므로 사람, 자동차 등과 같이 내부 패턴이 복잡하지 않으면서도 고유의 독특한 윤곽선 정보를 갖는 물체를 식별하는데 적합하다. 이러한 기술들은 미리 엣지나 형태에 대한 물체의 알려진 정보들만을 활용하기 때문에 영상 데이터에 다양한 조도 변화나 형태 왜곡, 노이즈, 가림 등이 발생하는 경우에 특징값이 예측된 분포에서 벗어나게 되면 인식 성능이 크게 떨어지는 단점이 있다. As an example, techniques for extracting key features of data and designing sophisticated techniques to obtain image feature values and to recognize or detect objects through statistical classifiers have been mainly used. For example, it is a technology that recognizes or detects an object by finding out features such as the structure or shape of a line of an image and comparing it with a template that is already known. Typical feature extraction technologies include scale invariant feature transform (SIFT) and histogram of oriented gradients (HOG). SIFT is a method of extracting feature vectors having directionality for local patches centered on each feature point after selecting feature points that are easily identified, such as corner points, in the image. Use expressive features. HOG is a method using a vector that divides the target region into cells of a certain size, obtains a histogram of the direction of the boundary pixels for each cell, and then uses these histogram bin values in a line, and uses boundary direction information. Perform template matching. In addition, since HOG uses silhouette information of an object, it is suitable for identifying an object having unique unique contour information without complicated internal patterns such as a person or a car. Since these technologies utilize only the known information of the object for the edge or shape in advance, the recognition performance is greatly improved when the feature value deviates from the predicted distribution when various illuminance changes, shape distortion, noise, and occlusion occur in the image data. There is a downside.

이러한 특징값 기반의 물체 검출 기법의 한계를 극복하기 위해서 딥러닝 기술은 보다 일반화된 학습 모델인 딥 뉴럴 네트워크의 구조를 사용하여 방대한 양의 데이터로부터 데이터의 표현 방법 즉, 특징값을 학습하게 된다. 이러한 경우 데이터의 표현 방법을 다양한 종류의 데이터를 통해 얻기 때문에 영상 정보의 다양한 변화와 악의적인 요소에 대해 좋은 성능을 획득할 수 있다. 딥러닝 기법을 이용한 종래 물체검출 방법에는 R-CNN 기법이 있다. R-CNN 기법은 카메라 센서로부터 입력을 받아 먼저 이미지에서 물체의 후보가 될 수 있는 영역을 제안하는 전처리 네트워크를 거친 후, 그 영역에 대한 정보를 각각 딥 뉴럴 네트워크의 입력으로 하는 물체 분류 네트워크에 통과시키는 기법이다. 또한 R-CNN의 단점을 개선한 Fast R-CNN 기법은 기존 R-CNN 기법의 연산량을 줄이기 위해 CNN 네트워크를 이용하여 특징값을 찾고 중간 노드의 특징맵(feature map)단에서 물체 후보 영역에 대한 정보를 추출하는 영역제안 네트워크와 물체의 종류를 판별하는 컨볼루션 네트워크를 모두 사용하여 물체 검출을 수행하게 된다. 최근에는 이러한 두 가지 네트워크를 하나로 통합하여 물체 영역의 위치를 찾는 단계와 물체의 종류를 구분하는 단계를 통합하여 처리하는 Single shot detector(SSD)와 You only look once(YOLO)라는 딥러닝 기반 물체 검출 기법이 제안되었다. In order to overcome the limitations of the feature value-based object detection technique, the deep learning technique learns a method of expressing data, i.e., feature values, from a large amount of data using a structure of a deep neural network, a more generalized learning model. In this case, since a method of expressing data is obtained through various types of data, good performance can be obtained for various changes and malicious elements of image information. A conventional object detection method using a deep learning technique has an R-CNN technique. The R-CNN method receives an input from the camera sensor, first goes through a pre-processing network that proposes an area that can be an object candidate in the image, and then passes through the object classification network, which uses information about the area as an input of the deep neural network. It is a technique. In addition, the Fast R-CNN technique, which improves the shortcomings of R-CNN, uses the CNN network to find the feature values to reduce the computational complexity of the existing R-CNN technique, and then uses the CNN network for the object candidate region in the feature map stage of the middle node. Object detection is performed using both the area suggestion network that extracts information and the convolutional network that determines the type of object. Recently, these two networks are integrated into a single step detector (SSD) and deep learning based object detection called You only look once (YOLO), which combines the steps of finding the location of the object area and classifying the types of objects. Techniques have been proposed.

그러나, Convolutional neural network(CNN)을 사용하여 영상으로부터 특징맵을 추출하고 특징맵으로부터 특정 물체에 해당하는 지역정보를 이용하여 물체의 위치를 위치를 표시하는 바운딩 상자와 물체의 클래스 정보를 찾는 딥러닝을 기반으로 하는 종래의 물체검출 기술은 CNN으로부터 획득된 특징값으로부터 물체에 해당하는 특징값을 매칭하여 물체의 영역과 종류를 판별하기 때문에 특정 물체에 해당하는 지역적 정보만을 활용한다는 한계가 있다. 예를 들면, 실제 사람의 경우에 물체를 검출하는 데 있어 주변 환경 또는 배경과 같은 전역적인 정보는 물체를 검출하는 데 큰 도움을 준다. 주변 배경이 도로인 경우에 물체는 교통에 관련된 물체일 확률이 높으며 주변이 실내의 환경인 경우에는 우리가 실내에서 보는 물체일 가능성이 높다. 종래의 딥러닝 기반의 물체검출 기술은 이러한 전역적인 문맥 정보를 활용하지 않으므로 활용 가능한 주변 정보를 모두 적극적으로 이용하지 않는 문제점이 있다. However, the feature map is extracted from the image using a convolutional neural network (CNN), and deep learning to find the bounding box and class information of the object using the local information corresponding to a specific object from the feature map. The conventional object detection technology based on has a limitation in utilizing only local information corresponding to a specific object because it determines the region and type of the object by matching the feature values corresponding to the object from the feature values obtained from the CNN. For example, in the case of a real person, global information, such as the surrounding environment or background, is very helpful in detecting an object. When the surrounding background is a road, it is very likely that the object is a traffic-related object, and when the surrounding is an indoor environment, it is most likely an object that we see indoors. Conventional deep learning-based object detection technology does not utilize such global context information, so there is a problem in that it does not actively utilize all available surrounding information.

참고자료: KR 10-2016-0099289(2016.08.22), KR 10-1836742(2018.03.02), KR 10-2015-0050134(2015.05.08)References: KR 10-2016-0099289 (2016.08.22), KR 10-1836742 (2018.03.02), KR 10-2015-0050134 (2015.05.08)

객체가 존재하는 일부분에 대한 지역적 정보와 더불어 객체의 주변 환경에 대한 전역적 정보를 함께 이용하는 딥러닝 기반 물체 검출 방법 및 시스템을 제안하고자 한다. We would like to propose a deep learning-based object detection method and system that uses global information on the surrounding environment of an object together with local information on a part of the object.

객체 검출 시스템에 의해 수행되는 객체 검출 방법은, 영상 정보를 제1 딥 뉴럴 네트워크에 통과시킴에 따라 특징맵을 획득하는 단계; 상기 획득된 특징맵을 제2 딥 뉴럴 네트워크 및 제3 딥 뉴럴 네트워크에 각각 통과시키는 단계; 상기 제2 딥 뉴럴 네트워크 및 상기 제3 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합하는 단계; 및 상기 융합된 특징맵에 기반하여 상기 영상 정보로부터 객체를 검출하는 단계를 포함할 수 있다. The object detection method performed by the object detection system includes: obtaining a feature map by passing image information through a first deep neural network; Passing the acquired feature map through a second deep neural network and a third deep neural network, respectively; Fusing each feature map obtained by passing through the second deep neural network and the third deep neural network through a convergence network; And detecting an object from the image information based on the fused feature map.

상기 획득된 특징맵을 제2 딥 뉴럴 네트워크 및 제3 딥 뉴럴 네트워크에 통과시키는 단계는, 상기 획득된 특징맵을 지역적 정보를 표현하기 위한 특징맵을 획득하기 위한 제2 딥 뉴럴 네트워크 및 전역적 정보를 표현하는 특징맵을 획득하기 위한 제3 딥 뉴럴 네트워크 각각에 통과시키는 단계를 포함할 수 있다. The step of passing the acquired feature map through the second deep neural network and the third deep neural network includes: a second deep neural network and global information for obtaining a feature map for expressing regional information on the acquired feature map. And passing through each of the third deep neural networks for obtaining a feature map expressing.

상기 제2 딥 뉴럴 네트워크는, 지역적 정보를 표현하기 위한 특징값을 획득하기 위하여 6계층의 CNN 구조로 구성되고, 상기 CNN 구조에서 6계층의 특징맵을 추출하여 객체를 검출하고, 상기 제 3 딥 뉴럴 네트워크는, 전역적 정보를 표현하기 위한 특징맵을 획득하기 위하여 5계층의 CNN구조로 구성되고, 상기 CNN구조에서 5계층 중 하나의 계층의 특징맵을 활용하여 객체를 검출할 수 있다.The second deep neural network is configured with a six-layer CNN structure to obtain feature values for expressing regional information, extracts a six-layer feature map from the CNN structure, detects an object, and detects the third dip The neural network is composed of a five-layer CNN structure to obtain a feature map for expressing global information, and an object can be detected using the feature map of one of the five layers in the CNN structure.

상기 제2 딥 뉴럴 네트워크 및 상기 제3 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징맵은, 지역적 정보의 특징맵 및 전역적 정보의 특징맵이고, 상기 제2 딥 뉴럴 네트워크 및 상기 제3 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징값을 융합 네트워크를 통하여 융합하는 단계는, 상기 지역적 정보의 특징맵을 위치에 따라 각각 컨볼루션 필터에 통과시킴에 따라 객체의 종류와 위치를 포함하는 객체 정보 값을 획득하고, 상기 전역적 정보의 특징맵의 특징값들을 펼친 후 임베딩 벡터와 연결하여 완전 연결 층에 통과시킴에 따라 객체의 종류와 위치를 포함하는 객체 정보 값을 획득하고, 상기 지역적 정보에 대한 객체 정보 값과 상기 전역적 정보에 대한 객체 정보 값을 결합하는 단계를 포함할 수 있다. Each feature map obtained by passing through the second deep neural network and the third deep neural network is a feature map of regional information and a feature map of global information, and the second deep neural network and the third The step of fusing each feature value obtained by passing through the deep neural network through the convergence network includes the type and location of the object as each feature map of the regional information passes through the convolution filter according to the location. To obtain the object information value, expand the feature values of the feature map of the global information, and connect with the embedding vector to pass through the complete connection layer to obtain the object information value including the type and location of the object, and And combining an object information value for local information and an object information value for the global information.

상기 제2 딥 뉴럴 네트워크 및 상기 제3 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징맵은, 지역적 정보의 특징맵 및 전역적 정보의 특징맵이고, 상기 제2 딥 뉴럴 네트워크 및 상기 제3 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합하는 단계는, 상기 전역적 정보의 특징맵에서 상기 지역적 정보의 특징맵의 위치에 해당하는 특징값들을 0으로 만드는 행렬을 곱하는 마스킹 기법을 이용하여 상기 지역적 정보의 특징맵으로 상기 객체에 대한 정보를 이용하고, 상기 전역적 정보의 특징맵으로 상기 객체 이외의 환경 정보를 이용하는 단계를 포함할 수 있다. Each feature map obtained by passing through the second deep neural network and the third deep neural network is a feature map of regional information and a feature map of global information, and the second deep neural network and the third The step of fusing each feature map obtained by passing through the deep neural network through a convergence network is a matrix that makes feature values corresponding to the location of the feature map of the regional information zero in the feature map of the global information. The method may include using information on the object as a feature map of the local information using a masking technique that multiplies by and using environment information other than the object as a feature map of the global information.

상기 제2 딥 뉴럴 네트워크 및 상기 제2 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징값을 융합 네트워크를 통하여 융합하는 단계는, 상기 지역적 정보의 특징맵을 위치에 따라 각각 컨볼루션 필터에 통과시킴에 따라 객체의 종류와 위치를 포함하는 객체 정보 값을 획득하고, 상기 마스킹 기법이 적용된 전역적 정보의 특징맵의 특징값들을 펼친 후 완전 연결 층에 통과시킴에 따라 획득된 값과 상기 지역적 정보에 대한 객체 정보 값을 결합하는 단계를 포함할 수 있다. The step of fusing each feature value obtained by passing through the second deep neural network and the second deep neural network through a convergence network passes each feature map of the regional information through a convolution filter according to a location. Obtain the object information value including the type and location of the object according to the seam, expand the feature values of the feature map of the global information to which the masking technique is applied, and then pass the complete connection layer and obtain the value and the local information It may include the step of combining the object information value for.

상기 제2 딥 뉴럴 네트워크 및 상기 제3 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징맵은, 지역적 정보의 특징맵 및 전역적 정보의 특징맵이고, 상기 융합된 특징맵에 기반하여 상기 영상 정보로부터 객체를 검출하는 단계는, 상기 지역적 정보에 대한 객체 정보 값과 상기 전역적 정보에 대한 객체 정보 값을 결합하여 완전 연결 층에 통과시켜 상기 영상 정보로부터 상기 객체를 검출하는 단계를 포함할 수 있다. Each feature map obtained by passing through the second deep neural network and the third deep neural network is a feature map of regional information and a feature map of global information, and the image is based on the fused feature map. The step of detecting an object from the information may include detecting the object from the image information by combining the object information value for the local information and the object information value for the global information and passing it through a complete connection layer. have.

객체 검출 시스템은, 영상 정보를 제1 딥 뉴럴 네트워크에 통과시킴에 따라 특징맵을 획득하는 특징맵 획득부; 상기 획득된 특징맵을 제2 딥 뉴럴 네트워크 및 제3 딥 뉴럴 네트워크에 각각 통과시키는 입력부; 상기 제2 딥 뉴럴 네트워크 및 상기 제3 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합하는 융합부; 및 상기 융합된 특징맵에 기반하여 상기 영상 정보로부터 객체를 검출하는 검출부를 포함할 수 있다. The object detection system includes: a feature map acquiring unit that acquires a feature map by passing image information through a first deep neural network; An input unit for passing the acquired feature map through a second deep neural network and a third deep neural network, respectively; A fusion unit that fuses each feature map obtained by passing through the second deep neural network and the third deep neural network through a fusion network; And a detection unit detecting an object from the image information based on the fused feature map.

상기 입력부는, 상기 획득된 특징맵을 지역적 정보를 표현하기 위한 특징맵을 획득하기 위한 제2 딥 뉴럴 네트워크 및 전역적 정보를 표현하는 특징맵을 획득하기 위한 제3 딥 뉴럴 네트워크 각각에 통과시킬 수 있다. The input unit may pass the acquired feature map through each of a second deep neural network for obtaining a feature map for expressing regional information and a third deep neural network for obtaining a feature map representing global information. have.

상기 제2 딥 뉴럴 네트워크 및 상기 제3 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징맵은, 지역적 정보의 특징맵 및 전역적 정보의 특징맵이고, 상기 융합부는, 상기 지역적 정보의 특징맵을 위치에 따라 각각 컨볼루션 필터에 통과시킴에 따라 객체의 종류와 위치를 포함하는 객체 정보 값을 획득하고, 상기 전역적 정보의 특징맵의 특징값들을 펼친 후 임베딩 벡터와 연결하여 완전 연결 층에 통과시킴에 따라 객체의 종류와 위치를 포함하는 객체 정보 값을 획득하고, 상기 지역적 정보에 대한 객체 정보 값과 상기 전역적 정보에 대한 객체 정보 값을 결합할 수 있다.Each feature map obtained by passing through the second deep neural network and the third deep neural network is a feature map of regional information and a feature map of global information, and the fusion unit is a feature map of the regional information As each passes through a convolution filter according to the location, an object information value including the type and location of the object is obtained, and the feature values of the feature map of the global information are expanded, and then connected with an embedding vector to the full connection layer. Upon passing, an object information value including the type and location of an object may be obtained, and an object information value for the local information and an object information value for the global information may be combined.

상기 제2 딥 뉴럴 네트워크 및 상기 제3 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징맵은, 지역적 정보의 특징맵 및 전역적 정보의 특징맵이고, 상기 융합부는, 상기 전역적 정보의 특징맵에서 상기 지역적 정보의 특징맵의 위치에 해당하는 특징값들을 0으로 만드는 행렬을 곱하는 마스킹 기법을 이용하여 상기 지역적 정보의 특징맵으로 상기 객체에 대한 정보를 이용하고, 상기 전역적 정보의 특징맵으로 상기 객체 이외의 환경 정보를 이용할 수 있다.Each feature map obtained by passing through the second deep neural network and the third deep neural network is a feature map of regional information and a feature map of global information, and the fusion unit features the global information The information on the object is used as a feature map of the regional information by using a masking technique of multiplying a matrix that makes feature values corresponding to the location of the feature map of the regional information zero, and the feature map of the global information It is possible to use environment information other than the above object.

상기 융합부는, 상기 지역적 정보의 특징맵을 위치에 따라 각각 컨볼루션 필터에 통과시킴에 따라 객체의 종류와 위치를 포함하는 객체 정보 값을 획득하고, 상기 마스킹 기법이 적용된 전역적 정보의 특징맵의 특징값들을 펼친 후 완전 연결 층에 통과시킴에 따라 획득된 값과 상기 지역적 정보에 대한 객체 정보 값을 결합할 수 있다. The fusion unit acquires an object information value including the type and location of the object by passing the feature map of the regional information to the convolution filter according to the location, and of the feature map of the global information to which the masking technique is applied. After expanding the feature values, the value obtained by passing through the complete connection layer and the object information value for the local information may be combined.

상기 제2 딥 뉴럴 네트워크 및 상기 제3 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징맵은, 지역적 정보의 특징맵 및 전역적 정보의 특징맵이고, 상기 검출부는, 상기 지역적 정보에 대한 객체 정보 값과 상기 전역적 정보에 대한 객체 정보 값을 결합하여 완전 연결 층에 통과시켜 상기 영상 정보로부터 상기 객체를 검출할 수 있다. Each feature map obtained by passing through the second deep neural network and the third deep neural network is a feature map of regional information and a feature map of global information, and the detection unit is an object for the regional information. The object value can be detected from the image information by combining an information value and an object information value for the global information and passing it through a complete connection layer.

일 실시예에 따른 객체 검출 시스템은 영상 정보에 대한 지역적 정보와 전역적 정보를 함께 이용함으로써 검출 성능을 향상시킬 수 있다. The object detection system according to an embodiment may improve detection performance by using both local and global information for image information.

일 실시예에 따른 객체 검출 시스템은 객체가 있는 지역적 정보를 획득하기 위한 네트워크와 객체가 속해있는 주변 환경에 대한 전역적 정보를 획득하기 위한 네트워크를 각각 구성하고, 구성된 네트워크를 통과함에 따라 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합하여 객체에 대한 주변 환경에 대한 이해를 기반으로 효율적인 객체 인식을 수행할 수 있다. The object detection system according to an embodiment configures a network for acquiring regional information where an object is located and a network for acquiring global information about the surrounding environment to which the object belongs, and each obtained by passing through the configured network By combining the feature maps through the convergence network, it is possible to perform effective object recognition based on the understanding of the surrounding environment for the object.

도 1은 일 실시예에 따른 객체 검출 시스템에서 객체 검출을 수행하는 딥 뉴럴 네트워크의 구조를 설명하기 위한 도면이다.
도 2는 일 실시예에 따른 객체 검출 시스템에서 임베딩 벡터를 이용하여 지역 정보와 전역 정보를 융합하는 딥 뉴럴 네트워크의 구조를 설명하기 위한 도면이다.
도 3은 일 실시예에 따른 객체 검출 시스템에서 마스킹 기법을 이용하여 지역 정보와 전역 정보를 융합하는 딥 뉴럴 네트워크의 구조를 설명하기 위한 도면이다.
도 4는 일 실시예에 따른 객체 검출 시스템에서 객체를 검출하는 성능을 설명하기 위한 표이다.
도 5는 일 실시예에 따른 객체 검출 시스템의 구조를 설명하기 위한 블록도이다.
도 6은 일 실시예에 따른 객체 검출 시스템에서 객체를 인식하는 방법을 설명하기 위한 흐름도이다. 1 is a diagram for explaining a structure of a deep neural network performing object detection in an object detection system according to an embodiment.
2 is a diagram for explaining a structure of a deep neural network that fuses local information and global information using an embedding vector in an object detection system according to an embodiment.
3 is a diagram for explaining the structure of a deep neural network that fuses local information and global information using a masking technique in an object detection system according to an embodiment.
4 is a table for explaining the performance of detecting an object in the object detection system according to an embodiment.
5 is a block diagram illustrating the structure of an object detection system according to an embodiment.
6 is a flowchart illustrating a method of recognizing an object in an object detection system according to an embodiment.

이하, 실시예를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings.

도 1은 일 실시예에 따른 객체 검출 시스템에서 객체 검출을 수행하는 딥 뉴럴 네트워크의 구조를 설명하기 위한 도면이다.1 is a diagram for explaining a structure of a deep neural network performing object detection in an object detection system according to an embodiment.

객체 검출 시스템은 딥 뉴럴 네트워크를 이용하여 객체 검출을 수행함에 있어서, 객체에 대한 지역적 정보와 함께 주변 환경을 포함하는 전역적 정보를 융합하여 영상 기반의 객체 검출 성능을 향상시킬 수 있다. 이에, 지역적 정보와 전역적 정보를 추출하기 위하여 딥 뉴럴 네트워크(예를 들면, CNN)의 구조를 활용하고, 특징맵으로부터 객체를 검출하기 위한 구조로 Single Shot Detector(SSD) 방법의 일부를 활용할 수 있다. In performing object detection using a deep neural network, the object detection system may improve image-based object detection performance by fusing global information including the surrounding environment with local information about the object. Accordingly, a structure of a deep neural network (eg, CNN) is used to extract local information and global information, and a part of the Single Shot Detector (SSD) method can be used as a structure for detecting an object from a feature map. have.

도 1을 참고하면, 전역적 정보와 지역적 정보를 융합하여 객체를 검출하기 위한 딥 뉴럴 네트워크의 구조를 나타낸 것이다. 카메라의 영상 정보가 딥 뉴럴 네트워크(예를 들면, CNN)에 입력될 수 있다. 이하, 아래의 설명에서는 편의상 영상 정보가 처음에 입력되는 딥 뉴럴 네트워크를 제1 딥 뉴럴 네트워크로 기재하기로 한다. 영상 정보가 제1 딥 뉴럴 네트워크를 통과함에 따라 출력된 출력값을 제2 딥 뉴럴 네트워크 및 제3 딥 뉴럴 네트워크로 나누어 통과시킬 수 있다. 예를 들면, 카메라의 RGB 이미지 정보(110)를 VGG-16 CNN(101)에 통과시킬 수 있다. 영상 정보가 VGG-16 CNN(101)에 통과됨에 따른 특징값을 지역적 정보의 특징맵과 전역적 정보의 특징맵을 획득하기 위한 지역적 CNN(102) 및 전역적 CNN(103)을 나누어 각각 통과시킬 있다. 지역적 CNN의 구조에서는 (a), (b), (c), (d), (e), (f)의 총 6계층의 특징맵을 추출하여 객체를 검출하고, 전역적 CNN의 구조에서는 총 5계층 중 4번 째 계층의 특징맵인 (g)를 활용하게 된다. 이러한 각각의 특징맵을 융합 네트워크(200)를 통하여 융합함으로써 객체를 검출(120)할 수 있다. 이와 같이 지역적 특징맵과 전역적 특징맵을 동시에 활용하여 객체를 검출하기 위한 방법을 도 2 및 도 3에서 설명하기로 한다. Referring to FIG. 1, a structure of a deep neural network for detecting an object by fusing global information and local information is illustrated. Image information of the camera may be input to a deep neural network (eg, CNN). Hereinafter, in the following description, a deep neural network in which image information is initially input will be described as a first deep neural network for convenience. As the image information passes through the first deep neural network, the output value may be divided into a second deep neural network and a third deep neural network and passed. For example, the RGB image information 110 of the camera may be passed through the VGG-16 CNN 101. The feature value as the image information is passed through the VGG-16 CNN 101 is divided into the regional CNN 102 and the global CNN 103 for passing the feature information map and the global information feature map, respectively. have. In the structure of the local CNN, the object map is detected by extracting the feature maps of the six layers (a), (b), (c), (d), (e), and (f), and in the structure of the global CNN, the total The feature map (g) of the 4th layer among the 5 layers will be used. Objects may be detected 120 by fusing each of the feature maps through the fusion network 200. As described above, a method for detecting an object by simultaneously using a local feature map and a global feature map will be described in FIGS. 2 and 3.

도 2는 일 실시예에 따른 객체 검출 시스템에서 임베딩 벡터를 이용하여 지역 정보와 전역 정보를 융합하는 융합 네트워크(딥 뉴럴 네트워크)의 구조를 설명하기 위한 도면이다. 일례로, 도 2를 참고하면, 지역적 정보에 대한 위치 정보를 임베딩 벡터를 이용하여 서로 다른 두 종류의 특징맵을 융합하는 방법이다. 2 is a diagram for explaining a structure of a convergence network (deep neural network) in which local information and global information are fused using an embedding vector in an object detection system according to an embodiment. As an example, referring to FIG. 2, a method of fusing two different feature maps using an embedding vector of location information for regional information.

지역적 정보의 특징맵을 SSD 방법과 같이 위치에 따라 각각 컨볼루션 필터에 통과시켜 지역적 정보에 해당하는 물체의 종류와 위치를 포함하는 객체 정보 값을 획득할 수 있다. 또한, 전역적 정보의 특징맵은 전역적 정보의 특징맵의 값들을 펼친 후 임베딩 벡터와 연결하여 완전 연결 층(fully-connected layer)에 통과시켜 객체의 종류와 위치를 포함하는 객체 정보 값을 획득할 수 있다. 이와 같이, 지역적 정보의 특징맵으로부터 획득된 객체 정보 값과 전역적 정보의 특징맵으로부터 획득된 객체 정보 값(예를 들면, 점수 값)을 결합(Concatenation)한 후, 다시 완전 연결 층에 통과시켜 객체의 종류와 좌표에 대한 값(예를 들면, 점수 값)을 획득하여 최종적인 검출을 수행할 수 있다. 여기서, 임베딩 벡터란 위치 좌표에 대한 정보를 특정 차원의 벡터로 표현하는 기법으로서, 실시예에서는 지역적 정보의 특징맵을 컨볼루션 필터의 위치에 대응하는 전역적 정보를 활성화시키기 위한 목적으로 이용할 수 있다. 이때, 임베딩 벡터 또한 네트워크의 훈련 과정에서 같이 학습될 수 있으며, 지역적 정보의 특징맵에서 컨볼루션 필터가 위치하고 있는 중심 좌표를 가져와서 상기 중심 좌표를 기준으로 임베딩 벡터를 가져올 수 있다. The feature map of the regional information may be passed through a convolution filter according to the location, such as the SSD method, to obtain an object information value including the type and location of the object corresponding to the regional information. In addition, the feature map of the global information expands the values of the feature map of the global information and connects with the embedding vector to pass through the fully-connected layer to obtain the object information value including the type and location of the object. can do. In this way, the object information value obtained from the feature map of the local information and the object information value (eg, score value) obtained from the feature map of the global information are concatenated, and then passed through the complete connection layer again. A final detection may be performed by obtaining a value (for example, a score value) for the object type and coordinates. Here, the embedding vector is a technique of expressing information about location coordinates as a vector of a specific dimension. In an embodiment, a feature map of local information may be used for the purpose of activating global information corresponding to the location of the convolution filter. . At this time, the embedding vector may also be learned together in the training process of the network, and the center coordinates in which the convolution filter is located may be obtained from the feature map of the local information and the embedding vector may be obtained based on the center coordinates.

도 3은 일 실시예에 따른 객체 검출 시스템에서 마스킹 기법을 이용하여 지역 정보와 전역 정보를 융합하는 융합 네트워크(딥 뉴럴 네트워크)의 구조를 설명하기 위한 도면이다.FIG. 3 is a diagram for explaining a structure of a converged network (deep neural network) in which regional information and global information are fused using a masking technique in an object detection system according to an embodiment.

도 3을 참고하면, 전역적 정보의 특징맵이 객체 정보 외에 환경에 대한 정보만을 이용하기 위하여 전역적 정보의 특징맵에서 지역적 정보의 위치에 해당하는 특징값들을 0으로 생성하는 행렬을 곱하는 마스킹(masking) 기법을 이용하여 지역적 정보와 전역적 정보를 융합할 수 있다. 이때, 마스킹 기법은 지역적 정보의 특징맵과 전역적 정보의 특징맵을 함께 이용하여 객체를 검출할 때, 지역적 정보의 특징맵은 객체에 대한 정보만을 이용하고, 전역적 정보의 특징맵은 객체와 관련된 정보 이외의 환경 정보만을 이용하여 검출을 수행하기 때문에 전역적 정보의 특징맵에서 객체와 관련된 정보를 제거하고 환경 정보만을 유지시키는 효과를 도출할 수 있는 기법이다. Referring to FIG. 3, in order to use only the information on the environment in addition to the object information, the feature map of the global information multiplies the matrix multiplying the matrix generating feature values corresponding to the location of the local information in the feature map of the global information ( Using the masking technique, regional and global information can be fused. At this time, when the masking technique detects an object using the feature map of the local information and the feature map of the global information, the feature map of the local information uses only the information about the object, and the feature map of the global information uses the object map Since the detection is performed using only environment information other than related information, it is a technique that can derive the effect of removing object-related information from the feature map of global information and maintaining only environment information.

지역적 정보의 특징맵을 SSD 방법과 같이 위치에 따라 각각 컨볼루션 필터에 통과시켜 지역적 정보에 해당하는 물체의 종류와 위치를 포함하는 객체 정보 값을 획득할 수 있다.The feature map of the regional information may be passed through a convolution filter according to the location, such as the SSD method, to obtain an object information value including the type and location of the object corresponding to the regional information.

전역적 정보의 특징맵에서 지역적 정보의 위치에 해당하는 특징값을 0으로 생성하는 행렬을 곱하고, 전역적 정보의 특징맵의 값들을 펼친 후 완전 연결 층에 통과시킴에 따라 획득된 값과 지역적 정보의 특징맵으로부터 획득된 객체 정보 값(예를 들면, 점수 값)을 결합(Concatenation)한 후, 다시 완전 연결 층에 통과시킴에 따라 객체에 대한 최종 검출을 수행하게 된다. In the feature map of the global information, multiply the matrix that generates the feature value corresponding to the location of the local information by 0, expand the values of the feature map of the global information, and then pass the complete connection layer to obtain the value and the local information. After concatenating the object information value (eg, score value) obtained from the feature map of, after passing through the complete connection layer again, the final detection of the object is performed.

도 4는 일 실시예에 따른 객체 검출 시스템에서 객체를 검출하는 성능을 설명하기 위한 표이다. 4 is a table for explaining the performance of detecting an object in the object detection system according to an embodiment.

실시예에서 제안된 객체 검출 기술과 종래의 객체 검출 알고리즘의 성능을 비교한 표이다. 표에서 검출기의 성능을 mean average precision(mAP)로 나타내었고, 트레이닝을 위한 데이터는 PASCAL VOC 를 사용하였다. 표에서 보는 바와 같이, 전역적 문맥 정보를 활용하는 제안하는 기술이 전역적 정보를 활용하지 않는 기존의 Single Shot Detector(SSD) 방법에 비해 평균 1프로 이상의 좋은 검출 성능을 나타냄을 확인할 수 있다. A table comparing the performance of the object detection technique proposed in the embodiment with the conventional object detection algorithm. In the table, the performance of the detector is expressed as mean average precision (mAP), and PASCAL VOC is used for training data. As shown in the table, it can be confirmed that the proposed technology that utilizes global context information shows a better detection performance of an average of 1% or more compared to the existing Single Shot Detector (SSD) method that does not utilize global information.

종래의 딥러닝 기반 물체 인지 기술은 딥 뉴럴 네트워크의 특징맵에서 지역 정보만을 활용하여 물체 검출을 수행하지만 실시예에서 제안하는 방법은 중간 단 이후의 네트워크 노드에서 지역 정보를 획득하기 위해 훈련된 네트워크와 전역 정보를 획득하기 위해 훈련된 네트워크를 함께 이용하여 주변 배경에 대한 정보를 이용하여 객체 검출을 수행하게 된다. Conventional deep learning-based object recognition technology performs object detection using only regional information in a feature map of a deep neural network, but the method proposed in the embodiment is a network trained to acquire regional information from a network node after the middle stage. To acquire global information, object detection is performed using information about the surrounding background by using a trained network together.

또한, 종래의 딥러닝 기반 물체 검출 네트워크는 물체에 대한 레이블(label)이 있는 데이터를 이용하여 해당 레이블에 맞는 검출을 수행하도록 지도 학습을 통해 트레이닝을 수행하는 반면, 실시예에서 제안된 기술에서 사용하는 전역적 정보는 레이블 정보가 없는 데이터이지만 전역적 정보에 해당하는 데이터가 지역적 정보 데이터와의 상호작용을 활용하게 된다. 이에 따라, 특징맵의 지역적 정보를 이용할 때 컴볼루션 필터의 위치에 대한 임베딩 벡터를 함께 이용함으로써 지역적 정보와 전역적 정보간의 상호작용을 극대화시킬 수 있다. 이에 따라 지역적 정보만을 이용하는 물체 검출 방법에 비하여 환경에 대한 부가적인 정보를 이용함으로써 객체의 인지 검출 성능을 향상시킬 수 있다.In addition, the conventional deep learning-based object detection network performs training through supervised learning to perform detection corresponding to the corresponding label using the data with the label of the object, whereas it is used in the technique proposed in the embodiment The global information is data without label information, but the data corresponding to the global information utilizes the interaction with the local information data. Accordingly, when using the regional information of the feature map, the interaction between the local information and the global information can be maximized by using the embedding vector for the location of the evolution filter. Accordingly, the cognitive detection performance of the object may be improved by using additional information about the environment compared to the object detection method using only local information.

일 실시예에 따른 객체 검출 시스템은 딥러닝 기법을 이용한 물체 검출 수행 시 지역적 정보와 전역적 정보를 함께 이용함으로써 검출 성능을 향상 시킬 수 있다. 이러한 제안하는 방법은 카메라 센서 데이터를 입력으로 하는 딥 뉴럴 네트워크에서 물체가 있는 지역적 정보를 획득하기 위한 네트워크와 물체가 속해 있는 주변 환경에 대한 전역적 정보를 획득하기 위한 네트워크를 각각 구성하고, 두 정보를 효과적으로 융합하여 물체의 주변 환경에 대한 이해를 기반으로 효율적인 물체 인지를 수행할 수 있다. The object detection system according to an embodiment may improve detection performance by using both local and global information when performing object detection using a deep learning technique. In the proposed method, a network for acquiring regional information with an object and a network for acquiring global information about the surrounding environment to which the object belongs are configured in a deep neural network using camera sensor data as input. By effectively fusion, it is possible to perform effective object recognition based on an understanding of the surrounding environment of an object.

일 실시예에 따른 객체 검출 시스템은 최근 스마트 홈 환경에나 자율 주행 환경에서는 카메라 센서 데이터를 이용하여 물체 검출을 수행할 때 물체 검출과 동시에 물체가 속한 주변 환경에 대한 이해를 수행하는 데에 활용될 것으로 예상이 되며, 제안하는 방법은 이러한 물체와 그 주변 환경에 대한 이해를 효과적으로 수행할 수 있는 해결책을 제시할 수 있다. 또한 스마트 홈이나 자율주행뿐만 아니라 환경이나 물체를 인식하는 다양한 인공지능 기술에 적용이 가능하다.The object detection system according to an embodiment will be used to perform an object detection and an understanding of the surrounding environment to which an object belongs when performing object detection using camera sensor data in a recent smart home environment or an autonomous driving environment. It is expected, and the proposed method can provide a solution to effectively understand these objects and their surroundings. In addition, it can be applied not only to smart homes or autonomous driving, but also to various artificial intelligence technologies that recognize environments or objects.

도 5는 일 실시예에 따른 객체 검출 시스템의 구조를 설명하기 위한 블록도이고, 도 6은 일 실시예에 따른 객체 검출 시스템에서 객체를 인식하는 방법을 설명하기 위한 흐름도이다. 5 is a block diagram illustrating a structure of an object detection system according to an embodiment, and FIG. 6 is a flowchart illustrating a method of recognizing an object in an object detection system according to an embodiment.

객체 검출 시스템(500)은 특징맵 획득부(510), 입력부(520), 융합부(530) 및 검출부(540)를 포함할 수 있다. 이러한 구성요소들은 객체 검출 시스템(500)에 저장된 프로그램 코드가 제공하는 제어 명령에 따라 프로세서에 의해 수행되는 서로 다른 기능들(different functions)의 표현들일 수 있다. 구성요소들은 도 6의 객체 검출 방법이 포함하는 단계들(610 내지 640)을 수행하도록 객체 검출 시스템(500)을 제어할 수 있다. 이때, 구성요소들은 메모리가 포함하는 운영체제의 코드와 적어도 하나의 프로그램의 코드에 따른 명령(instruction)을 실행하도록 구현될 수 있다. The object detection system 500 may include a feature map acquisition unit 510, an input unit 520, a fusion unit 530, and a detection unit 540. These components may be representations of different functions performed by a processor according to a control instruction provided by program code stored in the object detection system 500. Components may control the object detection system 500 to perform steps 610 to 640 included in the object detection method of FIG. 6. At this time, the components may be implemented to execute instructions according to the code of the operating system included in the memory and the code of at least one program.

객체 검출 시스템(500)의 프로세서는 객체 검출 방법을 위한 프로그램의 파일에 저장된 프로그램 코드를 메모리에 로딩할 수 있다. 예를 들면, 객체 검출 시스템(500)에서 프로그램이 실행되면, 프로세서는 운영체제의 제어에 따라 프로그램의 파일로부터 프로그램 코드를 메모리에 로딩하도록 객체 검출 시스템을 제어할 수 있다. 이때, 프로세서 및 프로세서가 포함하는 특징맵 획득부(510), 입력부(520), 융합부(530) 및 검출부(540) 각각은 메모리에 로딩된 프로그램 코드 중 대응하는 부분의 명령을 실행하여 이후 단계들(610 내지 640)을 실행하기 위한 프로세서의 서로 다른 기능적 표현들일 수 있다. The processor of the object detection system 500 may load program code stored in a file of a program for an object detection method into a memory. For example, when a program is executed in the object detection system 500, the processor may control the object detection system to load program code from a file of the program into a memory under control of an operating system. At this time, each of the feature map acquisition unit 510, the input unit 520, the fusion unit 530, and the detection unit 540 included in the processor and the processor executes an instruction of a corresponding part of the program code loaded in the memory, and then executes the subsequent steps. It may be different functional representations of the processor for executing the fields (610 to 640).

단계(610)에서 특징맵 획득부(510)는 영상 정보를 제1 딥 뉴럴 네트워크에 통과시킴에 따라 특징맵을 획득할 수 있다. In step 610, the feature map acquiring unit 510 may acquire the feature map by passing the image information through the first deep neural network.

단계(620)에서 입력부(520)는 획득된 특징맵을 제2 딥 뉴럴 네트워크 및 제3 딥 뉴럴 네트워크에 각각 통과시킬 수 있다. 입력부(520)는 획득된 특징맵을 지역적 정보를 표현하기 위한 특징맵을 획득하기 위한 제2 딥 뉴럴 네트워크 및 전역적 정보를 표현하는 특징맵을 획득하기 위한 제3 딥 뉴럴 네트워크 각각에 통과시킬 수 있다. 이때, 제2 딥 뉴럴 네트워크는, 지역적 정보를 표현하기 위한 특징값을 획득하기 위하여 6계층의 CNN 구조로 구성되고, CNN 구조에서 6계층의 특징맵을 추출하여 객체를 검출할 수 있고, 제 3 딥 뉴럴 네트워크는, 전역적 정보를 표현하기 위한 특징맵을 획득하기 위하여 5계층의 CNN구조로 구성되고, CNN구조에서 5계층 중 하나의 계층의 특징맵을 활용하여 객체를 검출할 수 있다.In step 620, the input unit 520 may pass the acquired feature map to the second deep neural network and the third deep neural network, respectively. The input unit 520 may pass the acquired feature map to each of the second deep neural network for obtaining a feature map for representing regional information and the third deep neural network for obtaining a feature map representing global information. have. At this time, the second deep neural network is composed of a six-layer CNN structure to obtain a feature value for expressing regional information, and an object can be detected by extracting a six-layer feature map from the CNN structure. The deep neural network is composed of a five-layer CNN structure to obtain a feature map for expressing global information, and an object can be detected using a feature map of one of the five layers in the CNN structure.

단계(630)에서 융합부(530)는 제2 딥 뉴럴 네트워크 및 제3 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징맵을 융합 네트워크를 통하여 융합시킬 수 있다. 융합부(530)는 제2 딥 뉴럴 네트워크 및 제3 딥 뉴럴 네트워크에 통과시킴에 따라 획득된 각각의 특징맵이 지역적 정보의 특징맵 및 전역적 정보의 특징맵이라면, 지역적 정보의 특징맵을 위치에 따라 각각 컨볼루션 필터에 통과시킴에 따라 객체의 종류와 위치를 포함하는 객체 정보 값을 획득하고, 전역적 정보의 특징맵의 특징값들을 펼친 후 임베딩 벡터와 연결하여 완전 연결 층에 통과시킴에 따라 객체의 종류와 위치를 포함하는 객체 정보 값을 획득하고, 지역적 정보에 대한 객체 정보 값과 전역적 정보에 대한 객체 정보 값을 결합할 수 있다. 융합부(530)는 전역적 정보의 특징맵에서 지역적 정보의 특징맵의 위치에 해당하는 특징값들을 0으로 만드는 행렬을 곱하는 마스킹 기법을 이용하여 지역적 정보의 특징맵으로 객체에 대한 정보를 이용하고, 전역적 정보의 특징맵으로 객체 이외의 환경 정보를 이용할 수 있다. 융합부(530)는 지역적 정보의 특징맵을 위치에 따라 각각 컨볼루션 필터에 통과시킴에 따라 객체의 종류와 위치를 포함하는 객체 정보 값을 획득하고, 마스킹 기법이 적용된 전역적 정보의 특징맵의 특징값들을 펼친 후 완전 연결 층에 통과시킴에 따라 획득된 값과 지역적 정보에 대한 객체 정보 값을 결합할 수 있다. In step 630, the fusion unit 530 may fuse each feature map obtained by passing through the second deep neural network and the third deep neural network through the fusion network. The fusion unit 530 locates the feature map of the local information if each feature map obtained by passing through the second deep neural network and the third deep neural network is a feature map of the local information and a feature map of the global information. According to each passing through the convolution filter, the object information value including the type and location of the object is obtained, and the feature values of the feature map of the global information are expanded and connected to the embedding vector to pass through the complete connection layer. Accordingly, an object information value including an object type and location may be obtained, and an object information value for local information and an object information value for global information may be combined. The fusion unit 530 uses the information on the object as a feature map of the local information by using a masking technique that multiplies the feature map corresponding to the location of the feature map of the local information from the feature map of the global information to 0. , Environment information other than objects can be used as a feature map of global information. The fusion unit 530 obtains an object information value including the type and location of the object as each feature map of the regional information passes through the convolution filter according to the location, and of the feature map of the global information to which the masking technique is applied. After unfolding the feature values, the values obtained by passing through the complete connection layer and the object information values for the local information can be combined.

단계(640)에서 검출부(540)는 융합된 특징맵에 기반하여 영상 정보로부터 객체를 검출할 수 있다. 검출부(540)는 지역적 정보에 대한 객체 정보 값과 전역적 정보에 대한 객체 정보 값을 결합하여 완전 연결 층에 통과시킴에 따라 획득된 객체 정보 값에 기반하여 영상 정보로부터 객체를 검출할 수 있다.In operation 640, the detector 540 may detect an object from image information based on the fused feature map. The detection unit 540 may combine an object information value for local information and an object information value for global information and detect an object from image information based on the obtained object information value by passing it through the complete connection layer.

일 실시예에 따른 전역 문맥 정보를 활용하는 물체 검출 기법은 다양한 분야에 적용이 가능하다. 대표적인 적용 가능 분야로는 스마트홈 카메라, 자율주행 자동차, 로봇공학 시스템이 있다. 첫째로, 스마트홈 카메라에서 물체 검출은 가장 중요하게 수행되어야 하는 기술로, 스마트홈 시스템이 구현되어있는 실내 환경에 존재하는 물체들의 종류와 위치에 대한 정보를 기반으로 목적에 맞는 제안이나 기능을 수행할 수 있도록 한다. 두 번째로 자율주행 자동차에서는 주변의 다른 차량, 물체 및 보행자가 어느 위치에 존재하는지, 위험요소의 존재유무를 판단하는 역할을 하며 이를 통해 안전한 주행을 할 수 있고 사고를 미연에 방지할 수 있다. 또한 물체 검출 기술은 로봇 공학 분야에서 로봇에 장착된 카메라 센서를 이용하여 데이터를 취득하여 로봇이 이동하는데 있어 방해가 되는 물체나 장애물에 대한 위치정보와 크기, 종류 등을 인지하여 목적에 맞는 임무를 수행하는데 필수적인 정보를 제공한다. 이러한 기술을 적용하기 위해서는 카메라 센서를 이용하여 데이터를 실시간으로 취득하고 이를 그래픽 프로세서 유닛(GPU) 등을 장착한 임베디드 시스템에서 물체 인식 알고리즘을 수행하는 방법이 있다. 이를 위해서는 미리 다양한 환경에 대한 데이터들을 확보하고 이를 통해 딥 뉴럴 네트워크 구조를 학습시키게 된다. 학습된 딥 뉴럴 네트워크는 최적화된 네트워크 계수로 저장되게 되고 이를 임베디드 시스템에 적용하여 실시간으로 입력되는 테스트 데이터에 대해 물체인지 알고리즘을 수행하여 그 결과를 얻게 된다.The object detection technique using global context information according to an embodiment may be applied to various fields. Typical applications include smart home cameras, autonomous vehicles, and robotics systems. First, object detection in a smart home camera is the most important technology to be performed, and suggests or functions suitable for the purpose based on information on the types and locations of objects in the indoor environment where the smart home system is implemented. Make it possible. Second, in self-driving cars, it plays a role in determining where other vehicles, objects, and pedestrians in the vicinity exist, and whether or not there is a risk factor, which enables safe driving and prevents accidents. In addition, the object detection technology acquires data using the camera sensor mounted on the robot in the field of robotics and recognizes the location information, size, and type of obstacles or obstacles in the robot's movement to perform the task appropriate for the purpose. Provides the information necessary to perform. In order to apply this technology, there is a method of acquiring data in real time using a camera sensor and performing an object recognition algorithm in an embedded system equipped with a graphic processor unit (GPU). To do this, in advance, secure data on various environments and learn the deep neural network structure through this. The trained deep neural network is stored as an optimized network coefficient, and it is applied to an embedded system to perform object recognition algorithm on test data input in real time to obtain the result.

딥 러닝을 기반으로 한 전역 정보 이용 물체 검출 기술은 현재 스마트홈 카메라나 자율주행, 모바일 로봇 등에 응용될 수 있으며 이 기술을 바탕으로 향후에는 인지를 넘어 물체의 추적 및 물체간의 관계 파악, 환경에 대한 이해를 통한 미래 예측 등 좀 더 복잡한 기능을 수행할 것으로 예상된다. 예를 들면, 스마트 홈 환경에서 전역 정보를 이용하여 환경에 대한 이해를 바탕으로 카메라 센서만을 이용해 위험 상황의 예측 및 예방, 물체간의 관계 파악을 통해 움직이는 물체의 향후 위치에 대한 예측이 가능하다. 또한 자율주행 환경에서는 자동화된 감시, 교통 모니터링 등의 고도화된 작업에 이용될 수 있다. 이와 같이 전역 정보를 이용한 물체 검출 알고리즘은 미래 기술에 기초가 되는 기술로 대표적인 인공지능 기술 중의 하나라고 할 수 있다.The object detection technology using global information based on deep learning can be applied to smart home cameras, autonomous driving, mobile robots, etc. Based on this technology, in the future, it will track objects beyond recognition and grasp the relationship between objects and understand the environment. It is expected to perform more complex functions such as predicting the future through understanding. For example, based on the understanding of the environment using global information in a smart home environment, it is possible to predict the future position of a moving object by predicting and preventing a dangerous situation and understanding relationships between objects using only a camera sensor. In addition, in an autonomous driving environment, it can be used for advanced tasks such as automated monitoring and traffic monitoring. As described above, an object detection algorithm using global information is a technology that is based on future technologies and can be said to be one of the representative artificial intelligence technologies.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented with hardware components, software components, and / or combinations of hardware components and software components. For example, the devices and components described in the embodiments include, for example, processors, controllers, arithmetic logic units (ALUs), digital signal processors (micro signal processors), microcomputers, field programmable gate arrays (FPGAs). , A programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions, may be implemented using one or more general purpose computers or special purpose computers. The processing device may run an operating system (OS) and one or more software applications running on the operating system. In addition, the processing device may access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, a processing device may be described as one being used, but a person having ordinary skill in the art, the processing device may include a plurality of processing elements and / or a plurality of types of processing elements. It can be seen that may include. For example, the processing device may include a plurality of processors or a processor and a controller. In addition, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instruction, or a combination of one or more of these, and configure the processing device to operate as desired, or process independently or collectively You can command the device. Software and / or data may be interpreted by a processing device, or to provide instructions or data to a processing device, of any type of machine, component, physical device, virtual equipment, computer storage medium or device. Can be embodied in The software may be distributed over networked computer systems, and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, or the like alone or in combination. The program instructions recorded in the medium may be specially designed and configured for the embodiments or may be known and usable by those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs, DVDs, and magnetic media such as floptical disks. -Hardware devices specifically configured to store and execute program instructions such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, etc., as well as machine language codes produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described by a limited embodiment and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques are performed in a different order than the described method, and / or the components of the described system, structure, device, circuit, etc. are combined or combined in a different form from the described method, or other components Alternatively, even if replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

In the object detection method performed by the object detection system,
The object detection system comprising: acquiring a feature map by passing image information through a first deep neural network;
In the object detection system, Passing the obtained feature map through a second deep neural network and a third deep neural network, respectively;
In the object detection system, Fusing each feature map obtained by passing through the second deep neural network and the third deep neural network through a fusion network; And
In the object detection system, detecting an object from the image information based on the fused feature map
Including,
Each feature map obtained by passing through the second deep neural network and the third deep neural network is a feature map of regional information and a feature map of global information,
In the object detection system, the step of fusing each feature value obtained by passing through the second deep neural network and the third deep neural network through a convergence network may include:
The object information value including the type and location of the object is obtained by passing the feature map of the regional information to the convolution filter according to the location, and after expanding the feature values of the feature map of the global information, the embedding vector and By connecting and passing through the complete connection layer, obtaining an object information value including the type and location of the object, and combining the object information value for the local information and the object information value for the global information
Object detection method comprising a.

According to claim 1,
In the object detection system, the step of passing the acquired feature map through the second deep neural network and the third deep neural network,
Passing the acquired feature map through each of a second deep neural network for obtaining a feature map for representing local information and a third deep neural network for obtaining a feature map for representing global information.
Object detection method comprising a.

According to claim 1,
The second deep neural network is configured with a six-layer CNN structure to obtain feature values for representing regional information, and an object is detected by extracting a six-layer feature map from the CNN structure,
The third deep neural network is configured with a 5-layer CNN structure to obtain a feature map for expressing global information, and detecting an object by utilizing a feature map of one of the 5 layers in the CNN structure , Object detection method.

delete

In the object detection method performed by the object detection system,
The object detection system comprising: acquiring a feature map by passing image information through a first deep neural network;
In the object detection system, Passing the obtained feature map through a second deep neural network and a third deep neural network, respectively;
In the object detection system, Fusing each feature map obtained by passing through the second deep neural network and the third deep neural network through a fusion network; And
In the object detection system, detecting an object from the image information based on the fused feature map
Including,
Each feature map obtained by passing through the second deep neural network and the third deep neural network is a feature map of regional information and a feature map of global information,
In the object detection system, the step of fusing each feature map obtained by passing through the second deep neural network and the third deep neural network through a convergence network,
In the object detection system, the object is mapped to the object as a feature map of the local information by using a masking technique of multiplying a feature matrix corresponding to the location of the feature map of the local information from a feature map of the global information to 0. Using information about the environment and using environment information other than the object as a feature map of the global information
Object detection method comprising a.

The method of claim 5,
In the object detection system, the step of fusing each feature value obtained by passing through the second deep neural network and the third deep neural network through a convergence network may include:
As the feature map of the regional information is passed through the convolution filter according to the location, an object information value including the type and location of the object is obtained, and the feature values of the feature map of the global information to which the masking technique is applied are expanded. Then combining the values obtained by passing through the complete connection layer and the object information values for the local information
Object detection method comprising a.

The method of claim 1 or 5,
Each feature map obtained by passing through the second deep neural network and the third deep neural network is a feature map of regional information and a feature map of global information,
In the object detection system, detecting an object from the image information based on the fused feature map,
Combining the object information value for the local information and the object information value for the global information and passing it through a complete connection layer to detect the object from the image information
Object detection method comprising a.

In the object detection system,
A feature map acquiring unit for acquiring a feature map by passing the image information through the first deep neural network;
An input unit for passing the acquired feature map through a second deep neural network and a third deep neural network, respectively;
A fusion unit that fuses each feature map obtained by passing through the second deep neural network and the third deep neural network through a fusion network; And
A detection unit that detects an object from the image information based on the fused feature map
Including,
Each feature map obtained by passing through the second deep neural network and the third deep neural network is a feature map of regional information and a feature map of global information,
The fusion unit,
The object information value including the type and location of the object is obtained by passing the feature map of the regional information to the convolution filter according to the location, and after expanding the feature values of the feature map of the global information, the embedding vector and By connecting and passing through the complete connection layer, an object information value including the type and location of the object is obtained, and the object information value for the local information and the object information value for the global information are combined.
Object detection system.

The method of claim 8,
The input unit,
Passing the acquired feature map through each of the second deep neural network for obtaining a feature map for expressing regional information and the third deep neural network for obtaining a feature map for expressing global information.
Object detection system, characterized in that.

The method of claim 8,
The second deep neural network is configured with a six-layer CNN structure to obtain feature values for representing regional information, and an object is detected by extracting a six-layer feature map from the CNN structure,
The third deep neural network is configured with a 5-layer CNN structure to obtain a feature map for expressing global information, and detecting an object by utilizing a feature map of one of the 5 layers in the CNN structure
Object detection system, characterized in that.

delete

In the object detection system,
A feature map acquiring unit for acquiring a feature map by passing the image information through the first deep neural network;
An input unit for passing the acquired feature map through a second deep neural network and a third deep neural network, respectively;
A fusion unit that fuses each feature map obtained by passing through the second deep neural network and the third deep neural network through a fusion network; And
A detection unit that detects an object from the image information based on the fused feature map
Including,
Each feature map obtained by passing through the second deep neural network and the third deep neural network is a feature map of regional information and a feature map of global information,
The fusion unit,
In the feature map of the global information, the information on the object is used as a feature map of the regional information by using a masking technique that multiplies a matrix that makes feature values corresponding to the location of the feature map of the regional information equal to 0, and the Using environment information other than the above object as a feature map of global information
Object detection system, characterized in that.

The method of claim 12,
The fusion unit,
As the feature map of the regional information is passed through the convolution filter according to the location, an object information value including the type and location of the object is obtained, and the feature values of the feature map of the global information to which the masking technique is applied are expanded. After that, the value obtained by passing through the complete connection layer is combined with the object information value for the local information.
Object detection system, characterized in that.

The method of claim 8 or 12,
Each feature map obtained by passing through the second deep neural network and the third deep neural network is a feature map of regional information and a feature map of global information,
The detection unit,
The object information value for the local information and the object information value for the global information are combined and passed through a complete connection layer to detect the object from the image information.
Object detection system, characterized in that.