KR102270009B1

KR102270009B1 - Method for detecting moving object and estimating distance thereof based on artificial intelligence algorithm of multi channel images

Info

Publication number: KR102270009B1
Application number: KR1020200034465A
Authority: KR
Inventors: 박현주
Original assignee: 주식회사 인포웍스; 박현주
Priority date: 2020-03-20
Filing date: 2020-03-20
Publication date: 2021-06-28

Abstract

An object of the present invention is to extract features for estimating distance from image information of stereo vision based on a convolution neural network (CNN) algorithm and estimate the distance of a moving object with a relatively small amount of computation. The present invention relates to a method for detecting a moving object and estimating a distance. A method for detecting a moving object and estimating a distance according to an embodiment of the present invention includes the steps of: extracting global features; synthesizing a cost block; extracting a disparity map; extracting a bounding box; cropping a disparity map; and estimating a distance.

Description

METHOD FOR DETECTING MOVING OBJECT AND ESTIMATING DISTANCE THEREOF BASED ON ARTIFICIAL INTELLIGENCE ALGORITHM OF MULTI CHANNEL IMAGES

본 발명은 이동 물체 탐지 및 거리 추정 방법에 관한 것으로서, 보다 구체적으로는 스테레오 영상 정보를 인공지능 알고리즘에 기반하여 비교적 적은 연산량으로도 특정 종류의 이동 물체 탐지 및 해당 물체의 거리를 추정할 수 있는 다채널 영상 인공지능 알고리즘 기반 이동 물체 탐지 및 거리 추정 방법에 관한 것이다.The present invention relates to a method for detecting a moving object and estimating a distance, and more specifically, it is possible to detect a specific type of moving object and estimate the distance of the object with a relatively small amount of computation based on the artificial intelligence algorithm based on stereo image information It relates to a method for detecting moving objects and estimating distances based on a channel image artificial intelligence algorithm.

일반적으로 종래 이동물체의 거리를 탐지하는 알고리즘은 객체 인식 알고리즘과 스테레오 비전 알고리즘을 이용하여 좌우 이미지간 픽셀의 유사도를 이미지 전체에 걸처 측정후 뎁스 맵(Depth Map)을 구축한 후 뎁스 맵에서의 객체 위치를 검출하는 방식을 사용한다.In general, the conventional algorithm for detecting the distance of a moving object uses an object recognition algorithm and a stereo vision algorithm to measure the similarity of pixels between left and right images over the entire image, builds a depth map, and then constructs an object in the depth map. The method of detecting the position is used.

여기서, 객체 인식 알고리즘이란 이미지 또는 비디오 상의 객체를 식별하는 컴퓨터 비전 기술로 다음과 같은 방식이 있다.Here, the object recognition algorithm is a computer vision technology that identifies an object on an image or video, and has the following methods.

첫째로, 컨볼루션 뉴럴 네트워크(CNN)와 같은 딥러닝 모델은 객체를 식별하기 위해 해당 객체 고유의 특징을 자동으로 학습하도록 사용되고, 예를 들어 CNN에서 수천 장의 훈련용 이미지를 분석하고 고양이와 개를 구분하는 특징을 학습하여 고양이와 개의 차이점을 식별하는 방법을 학습할 수 있는데, 기초부터 딥 네트워크를 훈련시키기 위해서는 레이블이 지정된 매우 방대한 데이터 세트를 모으고 네트워크 아키텍처를 설계하여 특징을 학습하고 모델을 완성시키면 이를 통해 뛰어난 결과물을 얻을 수 있지만 이러한 접근 방식을 위해서는 방대한 분량의 훈련 데이터가 필요하고 CNN에 레이어와 가중치를 설정해야 한다.First, deep learning models such as convolutional neural networks (CNNs) are used to automatically learn features unique to objects in order to identify them, for example, by analyzing thousands of training images in a CNN and analyzing cats and dogs. By learning the distinguishing features, it can learn to identify the differences between cats and dogs. In order to train a deep network from the ground up, we collect a very large, labeled data set, design a network architecture, learn the features, and complete the model. This gives great results, but this approach requires a large amount of training data and requires setting up the layers and weights on the CNN.

그리고 사전 훈련된 딥러닝 모델을 사용하는 방법으로 대다수 딥러닝 응용 프로그램은 사전 훈련된 모델을 세밀하게 조정하는 방법이 포함된 프로세스인 전이학습 방식을 사용하고, 이 방식에서는 AlexNet 또는 GoogLeNet과 같은 기존 네트워크를 사용하여 이전에 알려지지 않은 클래스를 포함하는 새로운 데이터를 주입하게 되며, 이 방법을 사용하면 수천 또는 수백만 장의 이미지로 모델을 미리 훈련한 덕분에 시간 소모가 줄게 되고 결과물을 빠르게 산출할 수 있다.And as a way to use a pre-trained deep learning model, most deep learning applications use transfer learning, a process that involves fine-tuning a pre-trained model, in which existing networks such as AlexNet or GoogLeNet are used. is used to inject new data containing previously unknown classes, which saves time and produces faster results thanks to pre-training the model with thousands or millions of images.

둘째로, 딥러닝과는 다른 접근 방식으로 객체 인식에 머신러닝 기술이 사용되는 일반적인 사례는 SVM(Support vector machine) 머신러닝 모델을 사용한 HOG 특징 추출과 SURF 및 MSER과 같은 특징을 사용한 BoW(Bag of Words) 모델과 경쟁 객체 탐지 속도를 실시간 제공하는 최초의 객체 탐지 프레임 워크로 얼굴과 상반신을 포함하여 다양한 객체를 인식하는 데 사용할 수 있는 Viola-Jones 알고리즘 등이 있다.Second, typical examples of machine learning techniques used for object recognition in an approach different from deep learning are HOG feature extraction using support vector machine (SVM) machine learning models, and Bag of (BoW) features using features such as SURF and MSER. Words) model and the first object detection framework to provide real-time speed of competing object detection, such as the Viola-Jones algorithm, which can be used to recognize a variety of objects, including faces and torso.

또한, 표준 머신러닝 방식을 사용하여 객체 인식을 수행하려면 이미지(또는 비디오)를 모아 각 이미지에서 주요 특징을 선택하는데, 예를 들어 특징 추출 알고리즘을 사용하면 데이터에서 클래스 간의 구분에 사용할 수 있는 가장자리 또는 코너 특징이 추출되고, 그런 다음 이러한 특징을 머신러닝 모델에 추가하여 각 특징을 고유 카테고리로 나눈 후 새로운 객체를 분석/분류할 때 이 정보를 사용하고, 정확한 객체 인식 모델을 만들기 위해 다양한 머신러닝 알고리즘과 특징 추출 방법을 조합하여 사용할 수 있으며, 이와 같이 객체 인식에 머신 러닝을 사용하면 다양한 특징과 분류기를 최적으로 조합하여 학습에 사용할 수 있다. 최소의 데이터로도 정확한 결과를 얻을 수 있다.Also, to perform object recognition using standard machine learning methods, you gather images (or videos) and select key features from each image, such as using a feature extraction algorithm to create an edge or video that can be used to differentiate between classes in the data. Corner features are extracted, and then these features are added to the machine learning model to divide each feature into unique categories, then use this information to analyze/classify new objects, and various machine learning algorithms to create accurate object recognition models. and feature extraction methods can be used in combination, and if machine learning is used for object recognition in this way, various features and classifiers can be optimally combined and used for learning. Accurate results can be obtained with minimal data.

한편으로, 스테레오 비전 알고리즘은 스테레오 비전은 두 대의 카메라 영상으로부터 시차를 추출하여 실제 거리정보를 계산하는 기술로서 다양한 분야에서 활용될 수 있어 활발하게 연구되고 있다.On the other hand, the stereo vision algorithm is being actively studied because it can be used in various fields as a technology for calculating the actual distance information by extracting the disparity from the images of two cameras.

그러나 아직까지는 객체 인식 알고리즘 및 스테레오 비전 알고리즘 모두 많은 연산량을 필요로 하는 단점이 있으며, 기존의 스테레오 비전 알고리즘은 대부분 정합 과정에 초점을 맞추고 있어 상대적으로 후처리에 관한 연구가 부족한 실정이다.However, both the object recognition algorithm and the stereo vision algorithm have the disadvantage that they require a lot of computation. Existing stereo vision algorithms mostly focus on the registration process, so research on post-processing is relatively lacking.

대한민국 등록특허공보 제10-1391095호Republic of Korea Patent Publication No. 10-1391095 대한민국 등록특허공보 제10-2003387호Republic of Korea Patent Publication No. 10-2003387

본 발명은 상기한 종래 기술의 문제점을 해결하기 위해 안출된 것으로서, 스테레오 비전의 영상정보를 CNN(Convolution Neural Network) 알고리즘을 기반으로 거리를 추정할 수 있는 특징(Features)을 추출하여 비교적 적은 연산량으로 이동 물체의 거리를 추정 가능하도록 하는 다채널 영상 인공지능 알고리즘 기반 이동 물체 탐지 및 거리 추정 방법을 제공하는데 그 목적이 있다.The present invention has been devised to solve the problems of the prior art, and by extracting features capable of estimating distance from image information of stereo vision based on a Convolution Neural Network (CNN) algorithm, it uses a relatively small amount of computation. An object of the present invention is to provide a method for detecting and estimating a moving object based on a multi-channel image artificial intelligence algorithm that enables the distance of a moving object to be estimated.

상기 목적을 달성하기 위하여, 본 발명은, U-Net 특징 추출기(U-Net Feature Extrator)에 의한 컨볼루션(Convolution) 연산을 통해 물체를 촬영한 스테레오 비전의 좌우 영상으로부터 전역 특징(Global Feature)을 추출하는 전역특징 추출단계와; 추출된 상기 전역 특징에 거리정보 기반의 코스트 블록(Cost Block)을 컨볼루션 연산하는 코스트 블록 합성단계와; 상기 코스트 블록이 합성된 전역 특징으로부터 3차원 특징을 추출하기 위해 3D 컨볼루션 및 Soft ArgMax 연산을 통해 디스패리티 맵(Disparity Map)을 추출하는 디스패리티 맵 추출단계와; 상기 U-Net 특징 추출기가 추출한 전역 특징에 반복적으로 컨볼루션 연산 및 풀링(Pooling)하면서 마지막에 콘캣(Concat)하여 감지된 물체의 바운딩 박스(Bounding Box)를 추출하는 바운딩 박스 추출단계와; 상기 스테레오 비전의 좌우 영상에서 추출된 바운딩 박스 좌표만큼 각각의 디스패리티 맵을 크롭(Crop)하는 디스패리티 맵 크롭단계와; 크롭된 각각의 디스패리티 맵 차를 구하고 컨볼루션 연산을 통해 물체의 거리를 추정하는 거리 추정 단계;를 포함하는 것을 특징으로 한다.In order to achieve the above object, the present invention provides a global feature from left and right images of stereo vision photographed by an object through a convolution operation by a U-Net feature extractor. a global feature extraction step of extracting; a cost block synthesizing step of convolutionally calculating a cost block based on distance information on the extracted global feature; a disparity map extraction step of extracting a disparity map through 3D convolution and Soft ArgMax operation in order to extract a 3D feature from the global feature in which the cost block is synthesized; a bounding box extraction step of extracting a bounding box of a detected object by concating it while repeatedly performing convolution operation and pooling on the global feature extracted by the U-Net feature extractor; a disparity map cropping step of cropping each disparity map by the bounding box coordinates extracted from the left and right images of the stereo vision; and a distance estimation step of obtaining a difference between each cropped disparity map and estimating the distance of the object through a convolution operation.

여기서, 상기 전역특징 추출단계 이전으로 학습 데이터를 생성하는 학습 데이터 생성단계를 더 포함하며, 상기 학습 데이터는 물체의 바운딩 박스 정보 및 해당 박스 중심점의 거리값을 포함하는 것을 특징으로 한다.Here, the method further includes a training data generation step of generating training data before the global feature extraction step, wherein the training data includes information about an object's bounding box and a distance value of a center point of the corresponding box.

또한, 감지된 물체의 위치 및 거리 정확도를 측정하기 위한 손실함수 L은 아래 수학식 1에 따르는 것을 특징으로 한다. In addition, the loss function L for measuring the position and distance accuracy of the sensed object is characterized in that it follows Equation 1 below.

수학식 1: L = L _loc (x, l, g) + L _dist (d, g _dist ) Equation 1: L = L _loc (x, l, g) + L _dist (d, g _dist )

(여기서, L _loc 는 감지된 물체의 정확도를 나타내는 손실함수, L _dist 는 물체의 거리 정확도를 나타내는 손실함수, l은 예측된 바운딩 박스, g는 Ground Truth의 Box 좌표, d는 예측된 물체의 거리, g _dist 는 Ground Truth의 실제 거리를 나타냄)(Where L _loc is a loss function indicating the accuracy of the detected object, L _dist is a loss function indicating the accuracy of the distance of the object, l is the predicted bounding box, g is the box coordinate of the ground truth, and d is the predicted object distance , where g _dist represents the actual distance to ground truth)

추가적으로, 상기 수학식 1에서 x는 예측된 바운딩 박스 및 실제 Ground Truth의 오버랩 비율이 0.5 이상이면 1, 그렇지 않으면 0을 나타내는 것을 특징으로 한다.Additionally, in Equation 1, x is 1 if the overlap ratio between the predicted bounding box and the actual ground truth is 0.5 or more, otherwise it is characterized as representing 0.

또한, 상기 L _dist 는 아래 수학식 2로 이루어지는 것을 특징으로 한다.In addition, the L _dist is characterized in that it consists of Equation 2 below.

수학식 2: L _dist (d, g _dist ) = |d - g _dist |Equation 2: L _dist ( d , g _dist ) = | d - g _dist |

이상과 같은 본 발명에 따른 다채널 영상 인공지능 알고리즘 기반 이동 물체 탐지 및 거리 추정 방법은 다음과 같은 효과를 나타낼 수 있다.As described above, the multi-channel image artificial intelligence algorithm-based method for detecting a moving object and estimating distance according to the present invention can exhibit the following effects.

1. 이동 물체의 거리 추정을 위해 스테레오비전 영상에서의 이동물체 거리 추정 방식과 다양한 환경변화에 대응하기 위한 신경망 학습 방법을 제시함1. To estimate the distance of a moving object, a method for estimating the distance of a moving object in a stereo vision image and a neural network learning method to respond to various environmental changes are presented.

2. 학습된 인공지능 알고리즘 기반으로 환경변화에 강건한 이동물체 추정이 가능함2. Based on the learned artificial intelligence algorithm, it is possible to estimate a moving object that is robust to environmental changes

3. 스테레오비전 영상에 인공지능 알고리즘을 기반으로 비교적 적은 연산량으로 이동 물체 탐지 및 물체 거리 추정이 가능함3. Based on artificial intelligence algorithm on stereo vision image, it is possible to detect moving object and estimate object distance with relatively small amount of computation.

도 1은 본 발명에 따른 다채널 영상 인공지능 알고리즘 기반 이동 물체 탐지 및 거리 추정 방법을 도시한 순서도
도 2는 본 발명에서 사용되는 U-Net 구조의 일례를 나타낸 도면
도 3은 본 발명에 따른 다채널 영상 인공지능 알고리즘 기반 이동 물체 탐지 및 거리 추정 방법을 실시하기 위한 시스템의 일례를 도시한 도면1 is a flowchart illustrating a method for detecting a moving object and estimating a distance based on a multi-channel image AI algorithm according to the present invention
2 is a view showing an example of the U-Net structure used in the present invention
3 is a diagram illustrating an example of a system for implementing a method for detecting a moving object and estimating a distance based on a multi-channel image artificial intelligence algorithm according to the present invention.

본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다.The terms or words used in the present specification and claims should not be construed as being limited to their ordinary or dictionary meanings, and the inventor may properly define the concept of the term in order to best describe his invention. It should be interpreted as meaning and concept consistent with the technical idea of the present invention based on the principle that there is.

따라서, 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고, 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Therefore, the configuration shown in the embodiments and drawings described in the present specification is only the most preferred embodiment of the present invention, and does not represent all the technical spirit of the present invention, so they can be substituted at the time of the present application It should be understood that various equivalents and modifications may be made.

이하, 도면을 참조로 하여 본 발명에 따른 다채널 영상 인공지능 알고리즘 기반 이동 물체 탐지 및 거리 추정 방법을 설명하기로 한다.Hereinafter, a method for detecting a moving object and estimating a distance based on a multi-channel image artificial intelligence algorithm according to the present invention will be described with reference to the drawings.

도 1은 본 발명에 따른 다채널 영상 인공지능 알고리즘 기반 이동 물체 탐지 및 거리 추정 방법을 도시한 순서도이고, 도 2는 본 발명에서 사용되는 U-Net 구조의 일례를 나타낸 도면이다.1 is a flowchart illustrating a method for detecting a moving object and estimating a distance based on a multi-channel image artificial intelligence algorithm according to the present invention, and FIG. 2 is a view showing an example of a U-Net structure used in the present invention.

본 발명은 기본적으로 전역특징 추출단계, 코스트 블록 합성단계, 디스패리티 맵 추출단계, 바운딩 박스 추출단계, 디스패리티 맵 크롭단계 및 거리 추정 단계를 포함하여 구성된다.The present invention is basically configured to include a global feature extraction step, a cost block synthesis step, a disparity map extraction step, a bounding box extraction step, a disparity map cropping step, and a distance estimation step.

보다 구체적으로, 본 발명은 U-Net 특징 추출기(U-Net Feature Extrator)에 의한 컨볼루션(Convolution) 연산을 통해 물체를 촬영한 스테레오 비전의 좌우 영상으로부터 전역 특징(Global Feature)을 추출하는 전역특징 추출단계와, 추출된 전역 특징에 거리정보 기반의 코스트 블록(Cost Block)을 컨볼루션 연산하는 코스트 블록 합성단계와, 코스트 블록이 합성된 전역 특징으로부터 3차원 특징을 추출하기 위해 3D 컨볼루션 및 Soft ArgMax 연산을 통해 디스패리티 맵(Disparity Map)을 추출하는 디스패리티 맵 추출단계와, U-Net 특징 추출기가 추출한 전역 특징에 반복적으로 컨볼루션 연산 및 풀링(Pooling)하면서 마지막에 콘캣(Concat)하여 감지된 물체의 바운딩 박스(Bounding Box)를 추출하는 바운딩 박스 추출단계와, 스테레오 비전의 좌우 영상에서 추출된 바운딩 박스 좌표만큼 각각의 디스패리티 맵을 크롭(Crop)하는 디스패리티 맵 크롭단계와, 크롭된 각각의 디스패리티 맵 차를 구하고 컨볼루션 연산을 통해 물체의 거리를 추정하는 거리 추정 단계를 포함하여 이루어진다.More specifically, the present invention provides a global feature for extracting global features from left and right images of stereo vision photographing an object through a convolution operation by a U-Net Feature Extractor. Extraction step, cost block synthesizing step of convolution operation of cost block based on distance information on extracted global features, and 3D convolution and soft The disparity map extraction step of extracting the disparity map through ArgMax operation, and the concat operation and pooling of the global features extracted by the U-Net feature extractor are repeated at the end to detect A bounding box extraction step of extracting the bounding box of the object, a disparity map cropping step of cropping each disparity map by the bounding box coordinates extracted from the left and right images of stereo vision, and the cropped and a distance estimation step of calculating the difference of each disparity map and estimating the distance of the object through a convolution operation.

상기 전역특징 추출단계는 U-Net 특징 추출기를 통해 스테레오 비전의 좌우 영상으로부터 전역 특징을 추출하게 되는데, 여기서 U-Net는 기존 네트워크들의 느린 속도를 개선하기 위해 오버랩 비율을 감소시키는 구조를 갖게 된다.The global feature extraction step extracts global features from the left and right images of stereo vision through the U-Net feature extractor, where U-Net has a structure that reduces the overlap ratio to improve the slow speed of existing networks.

예를 들어 슬라이딩 윈도우 방식을 사용하면 이전 패치(Patch: 이미지 인식 단위)에서 검증이 이미 끝난 부분을 다름 패치에서 또 검증하지만 U-Net은 이미 검증이 끝난 곳은 아예 건너뛰고 다음 패치부터 새 검증을 하기 때문에 속다가 빠른 장점을 갖는다.For example, if the sliding window method is used, the part that has already been verified in the previous patch (image recognition unit) is verified again in another patch, but U-Net skips the already verified area and starts new verification from the next patch. Therefore, it has the advantage of being fast.

그리고 만약 패치 사이즈가 커진다면 더 넓은 범위의 이미지를 한 번에 인식해야 하다보니 컨텍스트(Context: 이웃한 픽셀들간 관계) 인식에는 탁월한 효과가 있는 반면에 로컬라이제이션(Localization)에 패널티를 가지게 되지만, U-Net에서는 컨텍스트 및 로컬라이제이션 두 가지 모두에 정확한 장점을 갖는다.And if the patch size is increased, it has to recognize a wider range of images at once, so while it has an excellent effect in recognizing the context (relationship between neighboring pixels), it has a penalty for localization, but U- Net has precise advantages for both context and localization.

도 2를 참조로 하면, 대칭되는 U 모양 네트워크의 가운데를 기준으로 왼쪽을 컨트랙팅 패스(Contracting path), 오른쪽을 익스팬시비 패스(Expansive path)라고 정의할 때, 컨트랙팅 패스는 이미지의 컨텍스트를 포착하도록 돕고 익스팬시브 패스는 피쳐맵을 업샘플링하고 이를 포착한 피쳐맵의 컨텍스트와 결합하여 더 정확한 로컬라이제이션을 하게 된다(각각의 파란색 박스는 멀티 채널 피쳐맵을 의미하고, 채널수가 박스 위에 표기되고, xy 사이즈(너비 및 높이)는 박스의 왼쪽 아래 표기되며, 하얀색 박스는 피쳐맵의 카피본을 의미함).Referring to FIG. 2, when defining the left as a contracting path and the right as an expansive path based on the center of the symmetric U-shaped network, the contracting path is the Helps capture the context and the expansive pass upsamples the feature map and combines it with the context of the captured feature map for more accurate localization (each blue box represents a multi-channel feature map, and the number of channels is above the box). , and the xy size (width and height) is indicated in the lower left corner of the box, and the white box indicates a copy of the feature map).

상기 코스트 블록 합성단계는 전역 특징에 거리정보가 반영되도록 코스트 블록을 컨볼루션 연산 처리하게 된다.In the cost block synthesizing step, a convolution operation is performed on the cost block so that the distance information is reflected in the global feature.

여기서, 코스트 블록은 두 이미지의 픽셀간의 유사성을 나타내는 블록으로, 코스트 블록의 크기는 H x W x M으로 정의되며, H, X는 이미지의 가로/세로, M은 최대 디스패리티 수를 나타내게 된다.Here, the cost block is a block indicating the similarity between pixels of two images, and the size of the cost block is defined as H x W x M, where H and X are horizontal/vertical of the image, and M is the maximum number of disparities.

상기 디스패리티 맵 추출단계는 코스트 블록과 컨볼루션 연산 처리된 전역 특징을 추출하기 위해 3D 컨볼루션 및 Soft ArgMax 연산을 순차적으로 수행함으로써 디스패리티 맵을 추출하게 된다.In the step of extracting the disparity map, the disparity map is extracted by sequentially performing 3D convolution and Soft ArgMax operation in order to extract the cost block and the convolutional-processed global feature.

참고로, 2D 컨볼루션은 2차원 입력값이나 3차원 입력값을 2차원 값으로 출력하게 되지만, 3D 컨볼루션은 3차원 입력값을 3차원 값으로 출력 가능하기 때문에 3차원적 시공간 특징을 추출할 때 사용되며, 이러한 3D 컨볼루션은 종래 공지된 바에 따르므로 이에 대한 구체적인 언급은 생략하기로 한다.For reference, 2D convolution outputs two-dimensional input values or three-dimensional input values as two-dimensional values, but 3D convolution can output three-dimensional input values as three-dimensional values, so it is possible to extract three-dimensional space-time features. is used, and since this 3D convolution is conventionally known, a detailed description thereof will be omitted.

참고로, 상기 Soft ArgMax 연산은 각각의 입력에 대해서 표준화된 확률들을 얻을 수 있고, 이것의 기대치는 인덱스를 각각의 확률에 곱해서 더한 것으로, 이것의 평균값은 만약에 여러 개의 최빈수가 존재하면 약한 성질을 띄고, 최대값을 올리고 다른값들은 낮추기 위해 상수 값을 입력에 곱함으로써 최대값이 되는 인덱스를 출력 가능하며 미분 값을 가질 수 있게 된다(미분값이 존재하지 않으면 학습할 때 Gradient를 계산하는 것이 불가능함).For reference, the Soft ArgMax operation can obtain standardized probabilities for each input, and the expected value of this is to multiply and add an index to each probability, and its average value is weak if there are several modes. By multiplying the input by a constant value to raise the maximum value and lower the other values, it is possible to output the index that becomes the maximum value and to have a differential value (if the differential value does not exist, it is impossible to calculate the gradient when learning) box).

상기 바운딩 박스 추출단계는 추출한 전역 특징에 대해 반복적인 컨볼루션 연산 및 풀링한 이후로 최종적으로 콘캣하여 감지된 물체의 바운딩 박스를 추출하게 되는데, 여기서 바운딩 박스는 물체의 좌표 영역을 확인하도록 기능하게 된다.The bounding box extraction step extracts the bounding box of the detected object by concating after iterative convolution operation and pooling for the extracted global feature, where the bounding box functions to confirm the coordinate region of the object. .

상기 디스패리티 맵 크롭단계는 바운딩 박스 추출단계를 거쳐 추출된 바운딩 박스의 좌표에 대응하여 스테레오 비전 좌우 영상에서 각각의 디스패리티 맵을 잘라내게 된다.In the step of cropping the disparity map, each disparity map is cut out from the stereo vision left and right images corresponding to the coordinates of the bounding box extracted through the step of extracting the bounding box.

상기 거리 추정 단계는 디스패리티 맵 크롭단계에 의해 잘라진 각각의 디스패리티 맵 차를 구한 이후로 컨볼루션 연산을 통해 해당 물체의 거리를 추정하게 되며, 예로써 본 발명을 건설 현장 중장비에 적용할 경우 중장비 주변 사람이나 물체를 탐지하여 거리를 표시할 수 있고 이와 더불어 일정 반경 이내 접근할 경우 경고신호를 생성하게 하여 인명 피해나 물체 파손을 최대한 방지하는데 도움을 줄 수 있을 것이다.The distance estimating step estimates the distance of the object through convolution after obtaining each disparity map difference cut by the disparity map cropping step. For example, when applying the present invention to heavy equipment at a construction site, heavy equipment It can detect nearby people or objects and display the distance. In addition, when approaching within a certain radius, a warning signal is generated to help prevent damage to human life or damage to objects as much as possible.

도 3은 본 발명에 따른 다채널 영상 인공지능 알고리즘 기반 이동 물체 탐지 및 거리 추정 방법을 실시하기 위한 시스템의 일례를 도시한 도면이다.3 is a diagram illustrating an example of a system for implementing a method for detecting a moving object and estimating a distance based on a multi-channel image artificial intelligence algorithm according to the present invention.

아래에서는 도 3을 참조로 하여 본 발명에 의한 이동 물체 탐지 및 거리 추정 방법에 대해 살펴보기로 한다.Hereinafter, a method for detecting a moving object and estimating a distance according to the present invention will be described with reference to FIG. 3 .

1. 디스패리티 맵 추출 과정1. Disparity map extraction process

1-1) U-Net 특징 추출기가 스테레오 비전의 좌우 영상으로부터 각각의 전역 특징(L/R Feature)을 추출함1-1) U-Net feature extractor extracts each global feature (L/R feature) from left and right images of stereo vision

1-2) 전역 특징에 거리정보가 반영되도록 코스트 블록을 컨볼루션 연산 처리함1-2) Convolution operation is performed on the cost block so that the distance information is reflected in the global feature

1-3) 코스트 블록 연산 처리된 전역 특징에 3D 컨볼루션 및 Soft ArgMax 연산을 순차적으로 수행하여 디스패리티 맵을 추출함1-3) The disparity map is extracted by sequentially performing 3D convolution and Soft ArgMax operation on the global features processed by cost block operation.

2. 바운딩 박스 추출 과정2. Bounding box extraction process

2-1) U-Net 특징 추출기가 스테레오 비전의 좌우 영상으로부터 각각의 전역 특징(L/R Feature)을 추출함2-1) U-Net feature extractor extracts each global feature (L/R feature) from left and right images of stereo vision

2-2) 전역 특징에 반복적인 컨볼루션 연산 및 풀링 처리를 수행하고 마지막에 컨캣하여 감지된 물체의 바운딩 박스를 추출함2-2) Iteratively performs convolution operation and pooling processing on global features, and finally concatenates and extracts the bounding box of the detected object.

3. 거리 추정 과정3. Distance estimation process

3-1) 스테레오 비전 좌우 영상에서 추출된 바운딩 박스 좌표만큼 각각의 디스패리티 맵을 크롭함3-1) Crop each disparity map by the bounding box coordinates extracted from the stereo vision left and right images

3-2) 크롭된 각각의 디스패리티 맵 차를 구하고 컨볼루션 연산을 통해 해당 물체의 거리를 추정함3-2) Obtain the difference of each cropped disparity map and estimate the distance of the object through convolution operation

한편으로, 본 발명은 심층 신경망 알고리즘을 수행하기 위해서 실제 시스템이 수행되기 이전에 학습 데이터를 마련되어 있어야 하기 때문에 전역특징 추출단계 이전으로 학습 데이터를 생성하는 학습 데이터 생성단계를 더 포함하게 되며, 여기서의 학습 데이터는 물체의 바운딩 박스 정보(예: 바운딩 박스 좌표 정보) 및 해당 박스 중심점의 거리값을 포함하게 된다.On the other hand, the present invention further includes a training data generation step of generating training data before the global feature extraction step because the training data must be prepared before the actual system is performed in order to perform the deep neural network algorithm, wherein The training data includes information on the bounding box of the object (eg, information on the coordinates of the bounding box) and the distance value of the center point of the corresponding box.

상기 학습 데이터 생성단계는 학습의 정확도를 높일 수 있도록 감지된 물체의 위치 및 거리의 정확도를 측정하기 위한 손실함수 L을 아래 수학식 1과 같이 구성하고 이를 최소화하는 방향으로 학습을 진행하게 된다. In the learning data generation step, a loss function L for measuring the accuracy of the position and distance of a sensed object is configured as shown in Equation 1 below so as to increase the learning accuracy, and learning is performed in a direction to minimize it.

추가적으로, 상기 수학식 1에서 x는 예측된 바운딩 박스 및 실제 Ground Truth의 오버랩 비율이 0.5 이상이면 1, 그렇지 않으면 0을 나타내게 된다.Additionally, in Equation 1, x represents 1 if the overlap ratio between the predicted bounding box and the actual ground truth is 0.5 or more, otherwise it represents 0.

또한, 상기 L _dist 는 아래 수학식 2로 이루어지게 된다.In addition, the L _dist is formed by Equation 2 below.

참고로 L _loc 는 아래 수학식 3 내지 수학식 7에 의해 구해질 수 있다.For reference, L _loc may be obtained by Equations 3 to 7 below.

수학식 3:

Equation 3:

수학식 4:

Equation 4:

수학식 5:

Equation 5:

수학식 6:

Equation 6:

수학식 7:

Equation 7:

(여기서, 수학식 3의

는 smooth L1 loss function으로 L1 loss와 비슷하나 error의 값이 충분히 작을 경우 맞는 것으로 판단하는 함수이고, 수학식 4 내지 수학식 7 에서 g는 ground truth, d는 default box, g^는 regression을 통해 얻어진 박스의 좌표를 나타냄)(Here, in Equation 3

is a smooth L1 loss function that is similar to the L1 loss, but is a function that is judged to be correct when the error value is sufficiently small. In Equations 4 to 7, g is the ground truth, d is the default box, and g^ is the indicates the coordinates of the box)

이상에서 첨부된 도면을 참조하여 본 발명을 설명함에 있어 특정형상 및 방향을 위주로 설명하였으나, 본 발명은 당업자에 의하여 다양한 변형 및 변경이 가능하고, 이러한 변형 및 변경은 본 발명의 권리범위에 포함되는 것으로 해석되어야 한다.In the above description of the present invention with reference to the accompanying drawings, a specific shape and direction have been mainly described, but the present invention can be variously modified and changed by those skilled in the art, and such modifications and changes are included in the scope of the present invention. should be interpreted as

Claims

A global feature extraction step of extracting global features from left and right images of stereo vision photographing an object through a convolution operation by a U-Net feature extractor;
a cost block synthesizing step of convolutionally calculating a cost block based on distance information on the extracted global feature;
a disparity map extraction step of extracting a disparity map through 3D convolution and Soft ArgMax operation in order to extract a 3D feature from the global feature in which the cost block is synthesized;
a bounding box extraction step of extracting a bounding box of a detected object by concating it while repeatedly performing a convolution operation and pooling on the global feature extracted by the U-Net feature extractor;
a disparity map cropping step of cropping each disparity map by the bounding box coordinates extracted from the left and right images of the stereo vision;
A distance estimation step of obtaining a difference between each cropped disparity map and estimating the distance of the object through a convolution operation; A multi-channel image AI algorithm-based moving object detection and distance estimation method comprising: a.

The method according to claim 1,
Multi-channel image AI algorithm-based moving object detection and distance estimation method, characterized in that it further comprises a training data generation step of generating training data before the global feature extraction step.

3. The method according to claim 2,
The learning data is a multi-channel image artificial intelligence algorithm-based moving object detection and distance estimation method, characterized in that it includes the bounding box information of the object and the distance value of the center point of the box.

4. The method according to claim 3,
The loss function L for measuring the position and distance accuracy of the detected object is a multi-channel image AI algorithm-based moving object detection and distance estimation method, characterized in that according to Equations 1 and 3 to 7 below.
Equation 1: L = L _loc (x, l, g) + L _dist (d, g _dist )
Equation 3:

Equation 4:

Equation 5:

Equation 6:

Equation 7:

(Where L _loc is a loss function indicating the accuracy of the detected object, L _dist is a loss function indicating the accuracy of the distance of the object, l is the predicted bounding box, g is the box coordinate of the ground truth, and d is the predicted object distance , where g _dist represents the actual distance to ground truth)
(Here, in Equation 3

is the smooth L1 loss function, and in Equations 4 to 7, g is the ground truth, d is the default box, and g^ represents the coordinates of the box obtained through regression)
(Here, x in Equation 1 represents 1 if the overlap ratio of the predicted bounding box and the actual ground truth is 0.5 or more, otherwise 0)

delete

5. The method according to claim 4,
_Wherein L dist is a multi-channel image AI algorithm-based moving object detection and distance estimation method, characterized in that consisting of Equation 2 below.
Equation 2: L _dist ( d , g _dist ) = | d - g _dist |