KR102655999B1

KR102655999B1 - Apparatus and method for detecting distance values per instance object using a stereo camera

Info

Publication number: KR102655999B1
Application number: KR1020180128012A
Authority: KR
Inventors: 이동열
Original assignee: 현대모비스 주식회사
Priority date: 2018-10-25
Filing date: 2018-10-25
Publication date: 2024-04-11
Also published as: KR20200054344A

Abstract

본 발명은 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치 및 방법이 개시된다. 본 발명의 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치 및 방법은, 스테레오 카메라로부터 좌우측 각각의 이미지를 수신하는 스테레오 카메라 이미지 수신부; 스테레오 카메라 이미지 수신부에서 수신한 좌우측 각각의 이미지에서 미리 설정된 객체분류 알고리즘을 통해 인스턴스 객체(instance object)를 분류(classification)하고, 인스턴스 객체별 거리값 추출 딥러닝을 위한 데이터를 추출하는 데이터 추출부; 및 미리 설정된 딥러닝 네트워크 알고리즘을 통해 데이터 추출부에서 추출된 데이터를 연관성이 있는 데이터들끼리 결합(concatenation)하여 하나의 이미지 형태로 압축하고, 인스턴스 객체별 클래스(class) 정보 및 인스턴스 객체와 자차간의 거리값을 포함하는 2 채널의 배열을 생성하는 딥러닝부;를 포함하는 것을 특징으로 한다.The present invention discloses an apparatus and method for detecting distance values for each instance object using a stereo camera. The apparatus and method for detecting distance values for each instance object using a stereo camera of the present invention include: a stereo camera image receiver that receives left and right images from a stereo camera; A data extraction unit that classifies instance objects from the left and right images received from the stereo camera image receiver using a preset object classification algorithm, extracts distance values for each instance object, and extracts data for deep learning; And through a preset deep learning network algorithm, the data extracted from the data extraction unit is concatenated with related data and compressed into one image, and class information for each instance object and between the instance object and the own vehicle are compressed. It is characterized by including a deep learning unit that generates a two-channel array containing distance values.

Description

Apparatus and method for detecting distance values for each instance object using a stereo camera {APPARATUS AND METHOD FOR DETECTING DISTANCE VALUES PER INSTANCE OBJECT USING A STEREO CAMERA}

본 발명은 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치 및 방법에 관한 것으로서, 보다 상세하게는 스테레오 카메라의 영상 이미지에서 검출 가능한 데이터를 딥러닝에 활용하여 자차와 주위 인스턴스 객체간의 거리값을 검출하는 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for detecting distance values for each instance object using a stereo camera. More specifically, the present invention relates to detecting the distance value between the own vehicle and surrounding instance objects by utilizing data detectable from the video image of the stereo camera for deep learning. This relates to an apparatus and method for detecting distance values for each instance object using a stereo camera.

최근 급속한 과학 기술의 발전으로, 디지털 영상 처리는 우주 관측, 의료용, 사진 해석, 디자인, 각종 패턴 인식 및 차량 운행 등 많은 응용 분야에서 실용화되고 있다. 특히 최근에는 이미지나 영상인식 기술을 AI에 결합한 서비스들이 등장하고 있으며, 딥러닝을 활용한 영상인식 기술은 의료, 자율주행차, 커머스 등과 연계할 수 있는 분야가 다양해 기술 개발에 박차를 가하고 있다.With the recent rapid development of science and technology, digital image processing has been put to practical use in many application fields such as space observation, medical use, photo interpretation, design, various pattern recognition, and vehicle operation. In particular, recently, services that combine image or video recognition technology with AI are emerging, and image recognition technology using deep learning is accelerating technology development as it can be linked to various fields such as medicine, self-driving cars, and commerce.

다시 말해, 영상인식이란 카메라로부터 입력된 영상신호를 해석하여, 물체를 분할하고 특징을 추출하는 것이며, 특히 본 실시예에서는 차량 주행에 있어, 영상인식을 수행하여 자차와 자차 주변의 인스턴스(instance) 객체들과의 거리값을 구하는 것에 관한 것이다.In other words, image recognition refers to analyzing image signals input from a camera to segment objects and extract features. In particular, in this embodiment, when driving a vehicle, image recognition is performed to identify the own vehicle and instances around the vehicle. It is about finding the distance value between objects.

영상인식을 수행하여 배경에서 특정 물체를 추출하는 방법은 다양하나, 도 1에 도시된 바와 같이, 뎁스 맵(Depth map)과 바운딩 박스(bounding box, bbox) 정보를 활용하여, 영상 이미지 내에 존재하는 인스턴스 객체들과 자차간의 상대적인 거리값을 구할 수 있다. 보다 구체적으로는, 바운딩 박스별 상대적 거리 산출을 위해 바운딩 박스 내에서 가장 깊이값이 가장 작은 값을 사용하거나 좌측 이미지와 우측 이미지간의 상관관계를 계산하여 거리값을 산출할 수 있다.There are various ways to perform image recognition to extract a specific object from the background, but as shown in Figure 1, depth map and bounding box (bbox) information are used to identify objects present in the video image. You can obtain the relative distance value between instance objects and the own vehicle. More specifically, to calculate the relative distance for each bounding box, the smallest depth value within the bounding box can be used, or the distance value can be calculated by calculating the correlation between the left image and the right image.

한편, 이미지 내에서 검출해야 하는 다양한 객체들이 존재하는데 이 중 같은 속성을 가지는 객체를 묶어서 정의하는 것을 클래스(class)라고 하며, 적용하는 어플리케이션에 따라 클래스 형태는 가변적일 수 있다. 이때 정의된 클래스의 형태에 따라서 동일한 객체일 수도 있으며, 다른 객체일 수도 있다. 예를 들어, 하늘(sky) 클래스의 경우 이미지 내에 서로 고립된 하늘의 영역이 두 군데 이상에서 나타날 수 있으며, 상기 영역들은 같은 객체로 구분할 수 있다. 반면, 서로 다른 객체인 경우에는, 경계가 애매하여 구분이 어려운 경우에는 객체 구분을 하지 않고 클래스만 정의하고, 예를 들어, 자동차, 보행자, 자전거와 같은 클래스를 정의할 때는 경계가 명확하고 때에 따라서 구분하여 인식하는 것이 중요한 정보이기 때문에, 각 객체를 필수적으로 구분해야 한다. 이때 구분하는 각 객체를 인스턴스 객체라고 한다. Meanwhile, there are various objects that need to be detected in an image. Among these, objects with the same properties that are grouped and defined are called classes, and the form of the class can be variable depending on the application to which it is applied. At this time, depending on the type of class defined, it may be the same object or a different object. For example, in the case of the sky class, two or more isolated areas of the sky may appear in the image, and these areas can be classified as the same object. On the other hand, in the case of different objects, if the boundaries are ambiguous and difficult to distinguish, only the class is defined without distinguishing the objects. For example, when defining classes such as cars, pedestrians, and bicycles, the boundaries are clear and may be necessary depending on the time. Since it is important information to distinguish and recognize, it is essential to distinguish each object. At this time, each object distinguished is called an instance object.

상기 도 1에 도시된 종래기술에서는, 바운딩 박스 내의 모든 필셀 정보를 다 사용하여 자차와 인스턴스 객체와의 거리값을 산출하였다. 이에, 인스턴스 객체와 관련 없는 픽셀들로 정보가 오염되어 정확한 거리산출을 할 수 없는 문제가 있었다. 또한, 뎁스 맵을 산출하여 바운딩 박스별 상대적 거리를 산출함에 따라, 연산량이 상당히 증가하여 성능이 떨어지는 문제가 있었다.In the prior art shown in FIG. 1, the distance value between the own vehicle and the instance object is calculated using all pixel information in the bounding box. Accordingly, there was a problem in which accurate distance calculation could not be performed because the information was contaminated with pixels unrelated to the instance object. In addition, as the depth map was calculated to calculate the relative distance for each bounding box, the amount of calculation increased significantly, leading to a decrease in performance.

본 발명의 배경기술로는 대한민국 공개특허공보 제10-2014-0142554호(공개일 : 2014.12.12.공개)인 "깊이 정보를 사용한 움직인 물체 검출 장치, 방법 및 컴퓨터 판독 가능한 기록 매체"가 있다.The background technology of the present invention includes Korean Patent Publication No. 10-2014-0142554 (publication date: December 12, 2014) titled “Moving object detection device, method, and computer-readable recording medium using depth information.” .

본 발명의 일 측면에 따르면, 본 발명은 상기와 같은 문제점들을 개선하기 위하여 창안된 것으로, 스테레오 카메라의 영상 이미지에서 검출 가능한 데이터를 딥러닝에 활용하여 자차와 주위 인스턴스 객체간의 거리값을 검출하는 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치 및 방법을 제공하는 것을 목적으로 한다.According to one aspect of the present invention, the present invention was created to improve the above problems, and the stereo camera detects the distance value between the own vehicle and surrounding instance objects by utilizing data detectable in the video image of the stereo camera for deep learning. The purpose is to provide an apparatus and method for detecting distance values for each instance object using a camera.

본 발명의 일 측면에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치는, 스테레오 카메라로부터 좌우측 각각의 이미지를 수신하는 스테레오 카메라 이미지 수신부; 상기 스테레오 카메라 이미지 수신부에서 수신한 좌우측 각각의 이미지에서 미리 설정된 객체분류 알고리즘을 통해 인스턴스 객체(instance object)를 분류(classification)하고, 인스턴스 객체별 거리값 추출 딥러닝을 위한 데이터를 추출하는 데이터 추출부; 및 미리 설정된 딥러닝 네트워크 알고리즘을 통해 상기 데이터 추출부에서 추출된 데이터를 연관성이 있는 데이터들끼리 결합(concatenation)하여 하나의 이미지 형태로 압축하고, 인스턴스 객체별 클래스(class) 정보 및 상기 인스턴스 객체와 자차간의 거리값을 포함하는 2 채널의 배열을 생성하는 딥러닝부;를 포함하는 것을 특징으로 한다.An apparatus for detecting a distance value for each instance object using a stereo camera according to an aspect of the present invention includes: a stereo camera image receiver that receives left and right images from the stereo camera; A data extraction unit that classifies instance objects from the left and right images received from the stereo camera image receiver through a preset object classification algorithm and extracts data for deep learning by extracting distance values for each instance object. ; And through a preset deep learning network algorithm, the data extracted from the data extraction unit is concatenated with related data and compressed into one image form, and class information for each instance object and the instance object are combined. It is characterized by including a deep learning unit that generates a two-channel array containing distance values between the own vehicles.

본 발명에서, 상기 데이터 추출부는, 상기 스테레오 카메라 이미지 수신부에서 수신한 좌우측 각각의 이미지에서, 바운딩 박스(bounding box)와, 픽셀별 특정 인스턴스 객체에 해당되는지를 표시하는 인스턴스별 마스크 세그먼테이션(mask segmentation)을 검출하는 마스크 세그먼테이션 생성부; 및 상기 마스크 세그먼테이션 생성부에서 검출한 인스턴스별 마스크 세그먼테이션을 영역으로 나누고, 바운딩 박스의 원점을 산출하며, 인스턴스 객체별 클래스 정보를 추출하는 영역추출부;를 포함하는 것을 특징으로 한다.In the present invention, the data extraction unit performs mask segmentation for each instance, indicating whether each pixel corresponds to a specific instance object and a bounding box in the left and right images received from the stereo camera image receiver. a mask segmentation generator that detects; and a region extraction unit that divides the mask segmentation for each instance detected by the mask segmentation generator into regions, calculates the origin of the bounding box, and extracts class information for each instance object.

본 발명에서, 상기 딥러닝부는, 좌우측 각각의 이미지에 대한 마스크 세그먼테이션 영역, 바운딩 박스의 원점 및 객체별 클래스 정보 각각의 카테고리에 대응하는 인스턴스 객체를 나열하고 배열 형태로 구조화시켜 결합(concatenation)하며, 딥러닝 수행을 위한 입력 데이터로 변환하는 결합구조화부; 상기 마스크 세그먼테이션 영역, 바운딩 박스의 원점 및 객체별 클래스 정보 각각의 카테고리별로 상기 결합구조화부에서 변환한 입력 데이터를 미리 설정된 딥러닝 네트워크 알고리즘에 입력하여 부호화(encoding)하는 인코딩부; 상기 인코딩부에서 부호화한 각각의 카테고리별 데이터를 결합(concatenation)하여 하나의 이미지 형태로 압축하는 이미지압축부; 및 상기 이미지압축부에서 압축한 이미지의 데이터를 다시 미리 설정된 딥러닝 네트워크 알고리즘에 입력하여 인스턴스 객체별 클래스 정보 및 상기 인스턴스 객체와 자차간의 거리값을 포함하는 2 채널의 배열을 생성하는 배열생성부;를 포함하는 것을 특징으로 한다.In the present invention, the deep learning unit lists instance objects corresponding to each category of the mask segmentation area for each left and right image, the origin of the bounding box, and class information for each object, structures them in an array, and concatenates them, A combination structuring unit that converts input data for performing deep learning; An encoding unit for encoding the input data converted by the combination structuring unit for each category of the mask segmentation area, the origin of the bounding box, and class information for each object into a preset deep learning network algorithm; an image compression unit that concatenates the data for each category encoded by the encoding unit and compresses it into a single image; and an array generator that inputs the image data compressed by the image compression unit back into a preset deep learning network algorithm to generate a two-channel array including class information for each instance object and a distance value between the instance object and the own vehicle; It is characterized by including.

본 발명에서, 상기 데이터 추출부는, 상기 스테레오 카메라 이미지 수신부에서 수신한 좌우측 각각의 이미지에서 미리 설정된 객체분류 알고리즘을 통해 바운딩 박스를 검출하고, 상기 검출된 바운딩 박스의 해당 위치의 특징 맵(feature map)을 RoIAlign 방식으로 정렬(align)시켜 추출하며, 상기 추출된 특징 맵으로부터 객체의 클래스를 분류함과 동시에 객체의 마스크를 검출하는 것을 특징으로 한다.In the present invention, the data extraction unit detects a bounding box through a preset object classification algorithm in each of the left and right images received from the stereo camera image receiver, and creates a feature map of the corresponding position of the detected bounding box. is extracted by aligning it using the RoIAlign method, and classifies the class of the object from the extracted feature map and detects the mask of the object at the same time.

본 발명에서, 상기 미리 설정된 객체분류 알고리즘은, 마스크 R-CNN(mask R-CNN: mask region-based convolutional neural network)을 포함하는 영역 기반 딥러닝 알고리즘인 것을 특징으로 한다.In the present invention, the preset object classification algorithm is characterized as a region-based deep learning algorithm including a mask region-based convolutional neural network (R-CNN).

본 발명에서, 상기 미리 설정된 딥러닝 네트워크 알고리즘은, 컨볼루션 신경망(Convolution Neural Network, CNN) 기반의 딥러닝 네트워크 알고리즘인 것을 특징으로 한다.In the present invention, the preset deep learning network algorithm is characterized as a deep learning network algorithm based on a convolution neural network (CNN).

본 발명의 다른 측면에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 방법은, 스테레오 카메라 이미지 수신부가 스테레오 카메라로부터 좌우측 각각의 이미지를 수신하는 단계; 데이터 추출부가 상기 스테레오 카메라 이미지 수신부에서 수신한 좌우측 각각의 이미지에서 미리 설정된 객체분류 알고리즘을 통해 인스턴스 객체(instance object)를 분류(classification)하고, 인스턴스 객체별 거리값 추출 딥러닝을 위한 데이터를 추출하는 단계; 및 딥러닝부가 미리 설정된 딥러닝 네트워크 알고리즘을 통해 상기 데이터 추출부에서 추출된 데이터를 연관성이 있는 데이터들끼리 결합(concatenation)하여 하나의 이미지 형태로 압축하고, 인스턴스 객체별 클래스(class) 정보 및 상기 인스턴스 객체와 자차간의 거리값을 포함하는 2 채널의 배열을 생성하는 단계;를 포함하는 것을 특징으로 한다.A method of detecting a distance value for each instance object using a stereo camera according to another aspect of the present invention includes the steps of a stereo camera image receiving unit receiving left and right images from the stereo camera; The data extraction unit classifies instance objects through a preset object classification algorithm in each of the left and right images received from the stereo camera image receiver, extracts distance values for each instance object, and extracts data for deep learning. step; And a deep learning unit compresses the data extracted from the data extraction unit into one image form by concatenating related data through a preset deep learning network algorithm, and class information for each instance object and the Generating a two-channel array containing distance values between the instance object and the host vehicle.

본 발명의 상기 인스턴스 객체별 거리값 추출 딥러닝을 위한 데이터를 추출하는 단계에서, 상기 데이터 추출부는, 상기 스테레오 카메라 이미지 수신부에서 수신한 좌우측 각각의 이미지에서, 바운딩 박스(bounding box)와, 픽셀별 특정 인스턴스 객체에 해당되는지를 표시하는 인스턴스별 마스크 세그먼테이션(mask segmentation)을 검출하고, 상기 검출한 인스턴스별 마스크 세그먼테이션을 영역으로 나누고, 바운딩 박스의 원점을 산출하며, 인스턴스 객체별 클래스 정보를 추출하는 것을 특징으로 한다.In the step of extracting data for deep learning of the distance value extraction for each instance object of the present invention, the data extraction unit extracts a bounding box and a pixel from each left and right image received from the stereo camera image receiver. Detecting mask segmentation for each instance indicating whether it corresponds to a specific instance object, dividing the detected mask segmentation for each instance into regions, calculating the origin of the bounding box, and extracting class information for each instance object. It is characterized by

본 발명의 상기 인스턴스 객체별 클래스(class) 정보 및 상기 인스턴스 객체와 자차간의 거리값을 포함하는 2 채널의 배열을 생성하는 단계에서, 상기 딥러닝부는, 좌우측 각각의 이미지에 대한 마스크 세그먼테이션 영역, 바운딩 박스의 원점 및 객체별 클래스 정보 각각의 카테고리에 대응하는 인스턴스 객체를 나열하고 배열 형태로 구조화시켜 결합(concatenation)하며, 딥러닝 수행을 위한 입력 데이터로 변환하고, 상기 마스크 세그먼테이션 영역, 바운딩 박스의 원점 및 객체별 클래스 정보 각각의 카테고리별로 변환한 입력 데이터를 미리 설정된 딥러닝 네트워크 알고리즘에 입력하여 부호화(encoding)하며, 상기 부호화한 각각의 카테고리별 데이터를 결합(concatenation)하여 하나의 이미지 형태로 압축하고, 상기 압축한 이미지의 데이터를 다시 미리 설정된 딥러닝 네트워크 알고리즘에 입력하여 인스턴스 객체별 클래스 정보 및 상기 인스턴스 객체와 자차간의 거리값을 포함하는 2 채널의 배열을 생성하는 것을 특징으로 한다.In the step of generating a two-channel array including class information for each instance object and a distance value between the instance object and the host vehicle of the present invention, the deep learning unit creates a mask segmentation area and bounding for each image on the left and right. Origin of box and class information for each object Lists instance objects corresponding to each category, structures them in an array form and concatenates them, converts them into input data for deep learning, and sets the origin of the mask segmentation area and bounding box. And the input data converted for each category of class information for each object is entered into a preset deep learning network algorithm and encoded, and the encoded data for each category is concatenated and compressed into one image form. , Characterized by inputting the data of the compressed image back into a preset deep learning network algorithm to generate a two-channel array containing class information for each instance object and a distance value between the instance object and the own vehicle.

본 발명의 상기 인스턴스 객체별 거리값 추출 딥러닝을 위한 데이터를 추출하는 단계에서, 상기 데이터 추출부는, 상기 스테레오 카메라 이미지 수신부에서 수신한 좌우측 각각의 이미지에서 미리 설정된 객체분류 알고리즘을 통해 바운딩 박스를 검출하고, 상기 검출된 바운딩 박스의 해당 위치의 특징 맵(feature map)을 RoIAlign 방식으로 정렬(align)시켜 추출하며, 상기 추출된 특징 맵으로부터 객체의 클래스를 분류함과 동시에 객체의 마스크를 검출하는 것을 특징으로 한다.In the step of extracting data for deep learning of distance value extraction for each instance object of the present invention, the data extractor detects a bounding box through a preset object classification algorithm in each of the left and right images received from the stereo camera image receiver. The feature map of the corresponding position of the detected bounding box is aligned and extracted using the RoIAlign method, and the class of the object is classified from the extracted feature map and the mask of the object is detected at the same time. It is characterized by

본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치 및 방법은, 스테레오 카메라의 영상 이미지에서 검출 가능한 데이터를 딥러닝에 활용하여 자차와 주위 인스턴스 객체간의 거리값을 검출함으로써, 정확도를 향상시킬 수 있으며, 뎁스 맵(깊이지도)을 구하지 않기 때문에 연산량 및 하드웨어 저장 용량 감소, 및 속도 향상에 따른 성능 향상의 효과를 얻을 수 있다.The apparatus and method for detecting distance values for each instance object using a stereo camera according to an embodiment of the present invention utilizes data detectable in the video image of the stereo camera for deep learning to detect the distance value between the own vehicle and surrounding instance objects, Accuracy can be improved, and since a depth map is not obtained, performance can be improved by reducing the amount of calculation and hardware storage capacity, and improving speed.

또한, 본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치 및 방법은, 배열을 사용한 간단한 자료구조와, CNN 알고리즘만으로 구현이 쉽게 가능하여 편의성을 향상시킬 수 있다.In addition, the apparatus and method for detecting distance values for each instance object using a stereo camera according to an embodiment of the present invention can be easily implemented using only a simple data structure using an array and a CNN algorithm, thereby improving convenience.

또한, 본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치 및 방법은, 계산 용량이 감소하여 필요한 Flop수가 급감하고, 이에 해당하는 계산칩의 크기가 작아져 타겟 보드의 사이즈가 감소되어 경량화가 가능하고, 부품수 감소 및 경량화로 인해 원가 절감의 효과가 있다. In addition, in the apparatus and method for detecting distance values for each instance object using a stereo camera according to an embodiment of the present invention, the number of required Flops is drastically reduced due to a decrease in calculation capacity, and the size of the corresponding calculation chip is reduced, thereby reducing the size of the target board. is reduced, making it possible to reduce weight, and reducing the number of parts and reducing weight has the effect of reducing costs.

도 1은 종래의 스테레오 카메라를 이용한 객체별 거리 산출 방법을 설명하기 위한 도면이다.
도 2는 본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치를 나타낸 블록구성도이다.
도 3은 본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 방법을 설명하기 위한 흐름도이다.
도 4는 본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치의 데이터 추출을 설명하기 위한 도면이다.
도 5a, b는 본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치의 딥러닝 과정을 설명하기 위한 도면이다. Figure 1 is a diagram for explaining a method of calculating distance for each object using a conventional stereo camera.
Figure 2 is a block diagram showing a distance value detection device for each instance object using a stereo camera according to an embodiment of the present invention.
Figure 3 is a flowchart illustrating a method for detecting distance values for each instance object using a stereo camera according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating data extraction by a distance value detection device for each instance object using a stereo camera according to an embodiment of the present invention.
Figures 5a and 5b are diagrams for explaining the deep learning process of a distance value detection device for each instance object using a stereo camera according to an embodiment of the present invention.

이하, 첨부된 도면들을 참조하여 본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치 및 방법을 설명한다. 이 과정에서 도면에 도시된 선들의 두께나 구성요소의 크기 등은 설명의 명료성과 편의상 과장되게 도시되어 있을 수 있다. Hereinafter, an apparatus and method for detecting distance values for each instance object using a stereo camera according to an embodiment of the present invention will be described with reference to the attached drawings. In this process, the thickness of lines or sizes of components shown in the drawing may be exaggerated for clarity and convenience of explanation.

또한, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례에 따라 달라질 수 있다. 그러므로 이러한 용어들에 대한 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다.In addition, the terms described below are terms defined in consideration of functions in the present invention, and may vary depending on the intention or custom of the user or operator. Therefore, definitions of these terms should be made based on the content throughout this specification.

도 2는 본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치를 나타낸 블록구성도이고, 도 4는 본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치의 데이터 추출을 설명하기 위한 도면이며, 도 5a, b는 본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치의 딥러닝 과정을 설명하기 위한 도면으로서, 이를 참조하여 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치를 설명하면 다음과 같다.Figure 2 is a block diagram showing an apparatus for detecting distance values for each instance object using a stereo camera according to an embodiment of the present invention, and Figure 4 is a block diagram showing distance value detection for each instance object using a stereo camera according to an embodiment of the present invention. This is a diagram for explaining data extraction of the device, and Figures 5a and 5b are diagrams for explaining the deep learning process of the distance value detection device for each instance object using a stereo camera according to an embodiment of the present invention. With reference to this, the stereo The distance value detection device for each instance object using a camera is explained as follows.

도 2에 도시된 바와 같이, 본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치는, 스테레오 카메라 이미지 수신부(10), 데이터 추출부(20) 및 딥러닝부(30)를 포함한다.As shown in FIG. 2, the apparatus for detecting distance values for each instance object using a stereo camera according to an embodiment of the present invention includes a stereo camera image receiver 10, a data extractor 20, and a deep learning unit 30. Includes.

먼저, 일 실시예에 따르면, 본 발명은 스테레오 카메라를 사용하는 환경에서 자차의 주변(전방 등)의 객체를 판별하고 객체별 거리를 산출하는 발명에 관한 것이다.First, according to one embodiment, the present invention relates to an invention that determines objects around (e.g. in front of) the own vehicle and calculates the distance for each object in an environment using a stereo camera.

이에, 스테레오 카메라 이미지 수신부(10)는 스테레오 카메라로부터 촬영된 좌측(Left) 이미지와 우측(Right) 이미지를 수신하는 것으로, 수신한 좌측 및 우측 이미지를 데이터 추출부(20) 및 그 외 차량 내 ECU 등으로 전송할 수 있다.Accordingly, the stereo camera image receiver 10 receives the left and right images taken from the stereo camera, and the received left and right images are sent to the data extractor 20 and other ECUs in the vehicle. It can be transmitted, etc.

데이터 추출부(20)는 마스크 세그먼테이션(mask segmentation) 생성부(22) 및 영역추출부(24)를 포함하여, 스테레오 카메라 이미지 수신부에서 수신한 좌우측 각각의 이미지에서 미리 설정된 객체분류 알고리즘을 통해 인스턴스 객체(instance object)를 분류(classification)하고, 인스턴스 객체별 거리값 추출 딥러닝을 위한 데이터를 추출한다. The data extraction unit 20 includes a mask segmentation generation unit 22 and a region extraction unit 24, and uses a preset object classification algorithm in each of the left and right images received from the stereo camera image reception unit to create an instance object. Classifies (instance object), extracts distance values for each instance object, and extracts data for deep learning.

다시 말해, 데이터 추출부(20)는 스테레오 카메라 이미지 수신부(10)에서 수신한 좌우측 각각의 이미지에서 미리 설정된 객체분류 알고리즘을 통해 바운딩 박스(bounding box)를 검출하고, 상기 검출된 바운딩 박스의 해당 위치의 특징 맵(feature map)을 RoIAlign 방식으로 정렬(align)시켜 추출하며, 상기 추출된 특징 맵으로부터 객체의 클래스를 분류함과 동시에 객체의 마스크를 검출할 수 있다. In other words, the data extractor 20 detects a bounding box in each of the left and right images received from the stereo camera image receiver 10 through a preset object classification algorithm, and determines the corresponding position of the detected bounding box. The feature map is aligned and extracted using the RoIAlign method, and the class of the object can be classified from the extracted feature map and at the same time the mask of the object can be detected.

여기서, 바운딩 박스는 이미지 내에서 인스턴스 객체를 모두 포함하는 사각박스를 의미한다. 일반적으로 바운딩 박스의 원점 또는 대각선 양 극값의 위치 좌표에 의해서 정의되며, 원점과 대각선 양 극값은 서로 쉽게 구할 수 있다.Here, the bounding box refers to a rectangular box that includes all instance objects within the image. It is generally defined by the position coordinates of the origin of the bounding box or the diagonal extreme values, and the origin and diagonal extreme values can be easily obtained from each other.

마스크 세그먼테이션 생성부(22)는 도 4에 도시된 바와 같이, 스테레오 카메라 이미지 수신부(10)에서 수신한 좌우측 각각의 이미지에서, 바운딩 박스(bounding box)와, 픽셀별 특정 인스턴스 객체에 해당되는지를 표시하는 인스턴스별 마스크 세그먼테이션(mask segmentation)을 검출한다. 이때, (a)는 마스크 세그멘테이션 결과 영역이고, (b)는 검출된 객체의 클래스 정보와 바운딩 박스의 이미지 내 원점 좌표이다.As shown in FIG. 4, the mask segmentation generator 22 displays a bounding box in each of the left and right images received from the stereo camera image receiver 10 and whether each pixel corresponds to a specific instance object. Detects mask segmentation for each instance. At this time, (a) is the mask segmentation result area, and (b) is the class information of the detected object and the origin coordinates of the bounding box in the image.

한편, 이미지 내에서 픽셀 단위로, 사전에 정의한 클래스로 맵핑(mapping)하는 것을 세그멘테이션(segmentation, 분할)이라고 하며, 이때 동일한 인스턴스(instance) 또는 클래스(class)끼리 세그멘테이션된 픽셀들을 묶어서 이미지화한 것을 마스크 세그멘테이션(mask segmentation)이라고 한다. 보다 정확히는 인스턴스 끼리 묶으면 instance mask segmentation, 클래스 끼리 묶으면 class mask segmentation 이라고 하나, 본 실시예에서는 instance object를 인식하는 것을 대상으로 하기 때문에, instance mask segmentation만 해당되므로, instance mask segmentation을 간략히 mask segmentation이라고 할 수 있다. Meanwhile, mapping pixels within an image to a predefined class is called segmentation, and at this time, pixels segmented from the same instance or class are grouped and imaged as a mask. It is called segmentation (mask segmentation). More precisely, when grouping instances together, it is called instance mask segmentation, and when grouping between classes, it is called class mask segmentation. However, in this embodiment, since the target is to recognize instance objects, only instance mask segmentation is applicable, so instance mask segmentation is simply called mask segmentation. can do.

그리고 영역추출부(24)는 상기 마스크 세그먼테이션 생성부(22)에서 검출한 인스턴스별 마스크 세그먼테이션을 영역으로 나누고, 바운딩 박스의 원점을 산출하며, 인스턴스 객체별 클래스 정보를 추출한다.Then, the region extraction unit 24 divides the mask segmentation for each instance detected by the mask segmentation generation unit 22 into regions, calculates the origin of the bounding box, and extracts class information for each instance object.

한편, 본 실시예에서, 미리 설정된 객체분류 알고리즘은 마스크 R-CNN(Mask R-CNN: mask region-based convolutional neural network)을 포함하는 영역 기반 딥러닝 알고리즘이다. Meanwhile, in this embodiment, the preset object classification algorithm is a region-based deep learning algorithm including mask region-based convolutional neural network (Mask R-CNN).

여기서, Mask R-CNN은 이미지 면(plane)에서 인스턴스 객체를 인식하는 알고리즘 중 하나이다. 수많은 바운딩 박스를 제안(proposal-based)하고 제안된 바운딩 박스 내에서 CNN(Convolutional Neural Network) 기반의 딥러닝 알고리즘으로 가장 주요한 인스턴스 객체를 픽셀 단위(pixel-wise)로 나타낸다. 이때 찾아낸 결과가 옳을 경우에는 이미지 내에서 인스턴스 객체에 해당하는 모든 픽셀들이 찾아질 것이다.Here, Mask R-CNN is one of the algorithms that recognizes instance objects in the image plane. Numerous bounding boxes are proposed (proposal-based), and within the proposed bounding boxes, the most important instance objects are displayed on a pixel-wise basis using a CNN (Convolutional Neural Network)-based deep learning algorithm. If the finding result is correct, all pixels corresponding to the instance object in the image will be found.

또한, Mask R-CNN은 제안된 바운딩 박스(RoI라고도 함)에서, 해당 위치의 특징 맵(feature map)을 RoIAling 방식으로 추출하고, 추출된 특징 맵으로부터 객체의 클래스를 분류함과 동시에 객체의 마스크를 얻을 수 있으며, 여러 객체들을 각각의 인스턴스로 인식할 수 있다. 즉, Mask R-CNN은 RoIAlign과 클래스별 마스크를 분리하는 알고리즘이며, 객체 검출(object detection)과 semantic segmentation(이론적 분할)을 별개의 작업으로 수행하지 않을 수 있다. 그리고 RoIAlign은 특징 맵을 정렬(align)시켜 바이리니어 인터폴레이션(bilinear interpolation)과 같은 보간법을 사용하여 보정된 특정 값을 사용한다. 즉, RoIAlign은 반올림 등을 사용하지 않고 바이리니어 인터폴레이션을 이용해서 특징 맵의 RoI 영역을 정확하게 정렬되도록 할 수 있다.In addition, Mask R-CNN extracts the feature map of the corresponding location from the proposed bounding box (also known as RoI) using the RoIAling method, classifies the class of the object from the extracted feature map, and simultaneously classifies the object's mask. can be obtained, and multiple objects can be recognized as individual instances. In other words, Mask R-CNN is an algorithm that separates RoIAlign and class-specific masks, and may not perform object detection and semantic segmentation (theoretical segmentation) as separate tasks. And RoIAlign aligns the feature map and uses specific values corrected using an interpolation method such as bilinear interpolation. In other words, RoIAlign can accurately align the RoI area of the feature map by using bilinear interpolation without using rounding.

다시 말해, Mask R-CNN은 RoI 영역 안에 있는 객체의 마스클 예측할 때 클래스별로 독립적으로 예측할 수 있다. 즉 k개의 클래스가 존재한다면 마스크를 k개 예측할 수 있으며, 실제 클래스에 해당하는 마스크 채널의 loss(손실)만 사용해서 학습시킬 수 있다. 종래의 분할(segmentation) 네트워크에서는 영상의 한 픽셀이 어떤 클래스에 해당하는지를 예측하기 위해서 모든 클래스들은 서로 상호 배제임을 가정하였다. 그러므로 한 픽셀에 대해 모든 클래스가 경쟁 상태에 놓이게 된다. In other words, Mask R-CNN can independently predict each class when predicting the mask of an object in the RoI area. In other words, if there are k classes, k masks can be predicted and trained using only the loss of the mask channel corresponding to the actual class. In a conventional segmentation network, in order to predict which class a pixel in an image corresponds to, it is assumed that all classes are mutually exclusive. Therefore, all classes are in competition for one pixel.

그러나 Mask R-CNN은 RoI의 객체의 클래스가 무엇인지 예측하는 것과 그 객체의 마스크를 예측하는 것이 분리되어 있다. 객체의 클래스 예측은 다른 네트워크와 마찬가지로 상호 배제를 가정하여 서로 경쟁하나, 그 객체의 마스크를 예측할 때는 객체에 대한 k개의 마스크를 예측하고, k개의 마스크는 loss 계산이 독립적으로 이루어지기 때문에 서로 경쟁하지 않을 수 있다. However, Mask R-CNN separates predicting what class the object in RoI is from predicting the mask of that object. Like other networks, object class predictions compete with each other assuming mutual exclusion, but when predicting the object's mask, k masks for the object are predicted, and the k masks do not compete with each other because loss calculations are performed independently. It may not be possible.

다음으로, 딥러닝부(30)는 결합구조화부(32), 인코딩(encoding)부(34), 이미지압축부(36) 및 배열생성부(38)을 포함하여, 도 5a, b에 도시된 바와 같이, 미리 설정된 딥러닝 네트워크 알고리즘을 통해 상기 데이터 추출부에서 추출된 데이터를 연관성이 있는 데이터들끼리 결합(concatenation)하여 하나의 이미지 형태로 압축하고, 인스턴스 객체별 클래스(class) 정보 및 상기 인스턴스 객체와 자차간의 거리값을 포함하는 2 채널의 배열을 생성한다. Next, the deep learning unit 30 includes a combination structuring unit 32, an encoding unit 34, an image compression unit 36, and an array generation unit 38, as shown in FIGS. 5A and 5B. As shown, the data extracted from the data extraction unit is compressed into one image form by concatenating related data through a preset deep learning network algorithm, and class information for each instance object and the instance Create a two-channel array containing the distance values between the object and the host vehicle.

먼저, 결합구조화부(32)는 좌우측 각각의 이미지에 대한 마스크 세그먼테이션 영역, 바운딩 박스의 원점 및 객체별 클래스 정보 각각의 카테고리에 대응하는 인스턴스 객체를 나열하고 배열 형태로 구조화시켜 결합(concatenation)하며, 딥러닝 수행을 위한 입력 데이터로 변환한다.First, the combination structuring unit 32 lists instance objects corresponding to each category of the mask segmentation area, origin of the bounding box, and object-specific class information for each of the left and right images, structures them in an array, and concatenates them, Convert it to input data for deep learning.

한편, 본 실시예에서, 결합(concatenation)이란, 딥러닝에서 사용되는 용어로 다(n)차원 또는 다채널 배열을 기준으로 하며, 특정 채널을 기준으로 배열을 바로 쌓는 것을 의미한다. 이로 인해, n차원 배열에서 하나의 채널을 제외한 나머지 채널이 무조건 동일해야만 결합(concatenation)이 가능하다. 예를 들어 5*5*3 배열과 5*5*10 채널을 결합(concatenation)하면 3번째 채널을 기준으로 배열을 쌓아 5*5*13채널이 되나, 5*5*3 배열과 5*3*10 채널의 경우 두 개 이상의 채널에서 크기가 상이하기 때문에 결합(concatenation)이 될 수 없다. Meanwhile, in this embodiment, concatenation is a term used in deep learning and is based on a multi(n)-dimensional or multi-channel array, and means directly stacking the array based on a specific channel. Because of this, concatenation is only possible if all channels except one channel in the n-dimensional array are unconditionally identical. For example, if you concatenate a 5*5*3 array and 5*5*10 channels, the arrays will be stacked based on the 3rd channel to create 5*5*13 channels, but the 5*5*3 array and 5*3 *In the case of 10 channels, concatenation cannot occur because the sizes of two or more channels are different.

그리고 인코딩부(34)는 상기 마스크 세그먼테이션 영역, 바운딩 박스의 원점 및 객체별 클래스 정보 각각의 카테고리별로 상기 결합구조화부에서 변환한 입력 데이터를 미리 설정된 딥러닝 네트워크 알고리즘에 입력하여 부호화(encoding)한다.And the encoding unit 34 encodes the input data converted by the combination structuring unit for each category of the mask segmentation area, the origin of the bounding box, and the class information for each object into a preset deep learning network algorithm.

또한, 이미지압축부(36)는 상기 인코딩부(34)에서 부호화한 각각의 카테고리별 데이터를 결합(concatenation)하여 하나의 이미지 형태로 압축할 수 있다. Additionally, the image compression unit 36 can concatenate the data for each category encoded by the encoding unit 34 and compress it into one image.

최종적으로, 배열생성부(38)는 상기 이미지압축부(36)에서 압축한 이미지의 데이터를 다시 미리 설정된 딥러닝 네트워크 알고리즘에 입력하여 인스턴스 객체별 클래스 정보 및 상기 인스턴스 객체와 자차간의 거리값을 포함하는 2 채널의 배열을 생성할 수 있다. 즉, 배열생성부(38)는 최종적으로 인스턴스 객체별 클래스 정보와, 상기 인스턴스 객체와 자차간의 거리값을 검출할 수 있다. Finally, the array generator 38 inputs the image data compressed by the image compression unit 36 back into a preset deep learning network algorithm to include class information for each instance object and a distance value between the instance object and the own vehicle. You can create a two-channel array. That is, the array generator 38 can finally detect class information for each instance object and the distance value between the instance object and the own vehicle.

본 실시예에서, 상기 미리 설정된 딥러닝 네트워크 알고리즘은 컨볼루션 신경망(CNN) 기반의 딥러닝 네트워크 알고리즘이다.In this embodiment, the preset deep learning network algorithm is a convolutional neural network (CNN)-based deep learning network algorithm.

CNN은 Convolution Neural Network의 약자로, convolution filter, pooling, relu/tanh/absolute/sigmode function 등 딥러닝 기본 구성 블럭을 통해서 정의되며, convolution filter, pooling, relu function 등을 조합하는 방법에 따라 수 만 가지의 방법이 있을 수 있다. CNN은 pooling으로 인해서 이미지로부터 관련성 있는 정보들을 포함하는 데이터들이 서로 합쳐지고 배열의 크기가 감소하게 되며, pooling은 상기와 같이 특징이 압축되었다고 해서 encoding 이라고 부르기도 한다. CNN is an abbreviation for Convolution Neural Network, and is defined through basic deep learning building blocks such as convolution filter, pooling, relu/tanh/absolute/sigmode function, etc., and can be used in tens of thousands of ways depending on how to combine convolution filter, pooling, relu function, etc. There may be a way. In CNN, due to pooling, data containing relevant information from images are combined and the size of the array is reduced. Pooling is also called encoding because the features are compressed as above.

한편, 상술하는 내용은 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치의 forward pass에 대한 것으로, CNN을 학습시키기 위해 backward pass가 동작할 수 있다.Meanwhile, the above-described content is about the forward pass of the distance value detection device for each instance object using a stereo camera, and the backward pass can be operated to learn the CNN.

즉, 본 실시예에서는, 실제 물체와 자차간의 정답지를 사용하여 상기에서 예측한 거리정보와의 차이를 loss(손실)로 계산하여 CNN의 가중치(weight)들을 학습시킬 수 있다. 이때, 정답지는 스테레오 카메라 외에 라이더(lidar) 센서(미도시)를 통해서 얻거나 고정밀 거리센서(미도시)를 통해서 얻을 수 있다. 이때, loss 계산에는 L2 loss 등이 사용될 수 있다. 또한, backward pass에서 학습은 Mask R-CNN을 포함하여 학습할 수도 있으며, Mask R-CNN 이후 CNN 네트워크에 대해서만 학습할 수도 있다. 만약, Mask R-CNN을 학습시키지 않을 경우에는 해당 가중치들을 학습 시에 업데이트 대상에서 제외할 수 있다.That is, in this embodiment, the weights of the CNN can be learned by calculating the difference with the distance information predicted above as loss using the correct answer sheet between the real object and the own vehicle. At this time, the answer key can be obtained through a lidar sensor (not shown) in addition to a stereo camera or a high-precision distance sensor (not shown). At this time, L2 loss, etc. may be used to calculate loss. Additionally, in the backward pass, learning can include Mask R-CNN, and learning can only be done for the CNN network after Mask R-CNN. If Mask R-CNN is not trained, the corresponding weights can be excluded from the update target during training.

도 3은 본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 방법을 설명하기 위한 흐름도로서, 이를 참조하여 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 방법을 설명하면 다음과 같다.Figure 3 is a flowchart for explaining a method of detecting a distance value for each instance object using a stereo camera according to an embodiment of the present invention. With reference to this, the method for detecting a distance value for each instance object using a stereo camera will be described as follows.

먼저, 스테레오 카메라 이미지 수신부(10)가 스테레오 카메라로부터 좌우측 각각의 이미지를 수신한다(S10).First, the stereo camera image receiver 10 receives left and right images from the stereo camera (S10).

다음으로, 데이터 추출부(20)는 스테레오 카메라 이미지 수신부(10)에서 수신한 좌우측 각각의 이미지에서, 바운딩 박스(bounding box)와, 픽셀별 특정 인스턴스 객체에 해당되는지를 표시하는 인스턴스별 마스크 세그먼테이션(mask segmentation)을 검출한다(S20).Next, the data extractor 20 generates a bounding box in each of the left and right images received from the stereo camera image receiver 10 and a mask segmentation for each instance indicating whether each pixel corresponds to a specific instance object ( Detect mask segmentation (S20).

그리고, 데이터 추출부(20)는 S20단계에서 검출한 인스턴스별 마스크 세그먼테이션을 영역으로 나누고, 바운딩 박스의 원점을 산출하며, 인스턴스 객체별 클래스 정보를 추출한다(S30).Then, the data extraction unit 20 divides the mask segmentation for each instance detected in step S20 into regions, calculates the origin of the bounding box, and extracts class information for each instance object (S30).

즉, 데이터 추출부(20)는 스테레오 카메라 이미지 수신부(10)에서 수신한 좌우측 각각의 이미지에서 미리 설정된 객체분류 알고리즘을 통해 인스턴스 객체(instance object)를 분류(classification)하고, 인스턴스 객체별 거리값 추출 딥러닝을 위한 데이터를 추출할 수 있다.That is, the data extractor 20 classifies instance objects from each of the left and right images received from the stereo camera image receiver 10 through a preset object classification algorithm and extracts distance values for each instance object. Data for deep learning can be extracted.

여기서, Mask R-CNN은 이미지 면(plane)에서 인스턴스 객체를 인식하는 알고리즘 중 하나이다. 수많은 바운딩 박스를 제안(proposal-based)하고 제안된 바운딩 박스 내에서 CNN(Convolutional Neural Network) 기반의 딥러닝 알고리즘으로 가장 주요한 인스턴스 객체를 픽셀 단위(pixel-wise)로 나타낸다. Here, Mask R-CNN is one of the algorithms that recognizes instance objects in the image plane. Numerous bounding boxes are proposed (proposal-based), and within the proposed bounding boxes, the most important instance objects are displayed on a pixel-wise basis using a CNN (Convolutional Neural Network)-based deep learning algorithm.

그 다음, 딥러닝부(30)는 좌우측 각각의 이미지에 대한 마스크 세그먼테이션 영역, 바운딩 박스의 원점 및 객체별 클래스 정보 각각의 카테고리에 대응하는 인스턴스 객체를 나열하고 배열 형태로 구조화시켜 결합(concatenation)하며, 딥러닝 수행을 위한 입력 데이터로 변환한다(S40).Next, the deep learning unit 30 lists the instance objects corresponding to each category of the mask segmentation area for each image on the left and right, the origin of the bounding box, and class information for each object, structures them in an array, and concatenates them. , Convert it to input data for deep learning (S40).

그리고, 딥러닝부(30)는 S40단계에서, 상기 마스크 세그먼테이션 영역, 바운딩 박스의 원점 및 객체별 클래스 정보 각각의 카테고리별로 변환한 입력 데이터를 미리 설정된 딥러닝 네트워크 알고리즘에 입력하여 부호화(encoding)한다(S50).Then, in step S40, the deep learning unit 30 inputs the converted input data for each category of the mask segmentation area, the origin of the bounding box, and the class information for each object into a preset deep learning network algorithm and encodes it. (S50).

다음으로, 딥러닝부(30)는 S50단계에서, 부호화한 각각의 카테고리별 데이터를 결합(concatenation)하여 하나의 이미지 형태로 압축하고(S60), 상기 압축한 이미지의 데이터를 다시 미리 설정된 딥러닝 네트워크 알고리즘에 입력하여 인스턴스 객체별 클래스 정보 및 상기 인스턴스 객체와 자차간의 거리값을 포함하는 2 채널의 배열을 생성한다(S70).Next, in step S50, the deep learning unit 30 concatenates the encoded data for each category and compresses it into one image (S60), and the data of the compressed image is again subjected to a preset deep learning process. It is input into the network algorithm to generate a two-channel array containing class information for each instance object and the distance value between the instance object and the host vehicle (S70).

즉, 딥러닝부(30)는 미리 설정된 딥러닝 네트워크 알고리즘을 통해 상기 데이터 추출부에서 추출된 데이터를 연관성이 있는 데이터들끼리 결합(concatenation)하여 하나의 이미지 형태로 압축하고, 인스턴스 객체별 클래스(class) 정보 및 상기 인스턴스 객체와 자차간의 거리값을 포함하는 2 채널의 배열을 생성하며, 즉 최종적으로 인스턴스 객체별 클래스 정보와, 상기 인스턴스 객체와 자차간의 거리값을 검출할 수 있다. That is, the deep learning unit 30 compresses the data extracted from the data extraction unit into one image form by concatenating the data extracted from the data extraction unit through a preset deep learning network algorithm, and classifies each instance object ( class) information and a distance value between the instance object and the own vehicle are created, that is, class information for each instance object and the distance value between the instance object and the own vehicle can be finally detected.

CNN은 Convolution Neural Network의 약자로, convolution filter, pooling, relu/tanh/absolute/sigmode function 등 딥러닝 기본 구성 블럭을 통해서 정의되며, convolution filter, pooling, relu function 등을 조합하는 방법에 따라 수 만 가지의 방법이 있을 수 있다. CNN is an abbreviation for Convolution Neural Network, and is defined through basic deep learning building blocks such as convolution filter, pooling, relu/tanh/absolute/sigmode function, etc., and can be used in tens of thousands of ways depending on how to combine convolution filter, pooling, relu function, etc. There may be a way.

상술한 바와 같이, 본 발명의 일 실시예에 따른 스테레오 카메라를 이용한 인스턴스 객체별 거리값 검출 장치 및 방법은, 스테레오 카메라의 영상 이미지에서 검출 가능한 데이터를 딥러닝에 활용하여 자차와 주위 인스턴스 객체간의 거리값을 검출함으로써, 정확도를 향상시킬 수 있으며, 뎁스 맵(깊이지도)을 구하지 않기 때문에 연산량 및 하드웨어 저장 용량 감소, 및 속도 향상에 따른 성능 향상의 효과를 얻을 수 있다.As described above, the apparatus and method for detecting the distance value for each instance object using a stereo camera according to an embodiment of the present invention utilizes data detectable in the video image of the stereo camera for deep learning to determine the distance between the own vehicle and surrounding instance objects. By detecting the value, accuracy can be improved, and since the depth map is not obtained, the amount of calculation and hardware storage capacity can be reduced, and performance can be improved due to increased speed.

본 발명은 도면에 도시된 실시예를 참고로 하여 설명되었으나, 이는 예시적인 것에 불과하며, 당해 기술이 속하는 분야에서 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.The present invention has been described with reference to the embodiments shown in the drawings, but these are merely exemplary, and those skilled in the art will recognize that various modifications and other equivalent embodiments are possible therefrom. You will understand.

따라서 본 발명의 진정한 기술적 보호범위는 아래의 특허청구범위에 의해서 정하여져야 할 것이다.Therefore, the true technical protection scope of the present invention should be determined by the scope of the patent claims below.

10 : 스테레오 카메라 이미지 수신부
20 : 데이터 추출부
22 : 마스크 세그먼테이션(mask segmentation) 생성부
24 : 영역추출부
30 : 딥러닝부
32 : 결합구조화부
34 : 인코딩(encoding)부
36 : 이미지압축부
38 : 배열생성부10: Stereo camera image receiver
20: data extraction unit
22: Mask segmentation generation unit
24: Area extraction unit
30: Deep learning department
32: Combined structural unit
34: encoding unit
36: Image compression unit
38: Array generation unit

Claims

a stereo camera image receiving unit that receives left and right images from a stereo camera;
A mask segmentation generator that detects a bounding box and mask segmentation for each instance from each of the left and right images received from the stereo camera image receiver, and divides the detected mask segmentation for each instance into regions, calculates the origin of the bounding box, and A data extraction unit that classifies instance objects using a preset object classification algorithm, including a region extraction unit that extracts class information for each object, extracts distance values for each instance object, and extracts data for deep learning. ; and
Through a preset deep learning network algorithm, the data extracted from the data extraction unit is compressed into one image form by concatenating related data, class information for each instance object, and the instance object and its own difference. It includes a deep learning unit that generates a two-channel array containing distance values between
The deep learning unit,
Mask segmentation area for each image on the left and right, origin of bounding box, and class information for each instance object. Instance objects corresponding to each category are listed, structured and combined in an array form, and converted into input data for deep learning. A distance value detection device for each instance object using a stereo camera including a combination structure unit.

delete

According to clause 1,
The deep learning unit,
An encoding unit for encoding the input data converted by the combination structuring unit for each category of the mask segmentation area, the origin of the bounding box, and class information for each object into a preset deep learning network algorithm;
an image compression unit that concatenates the data for each category encoded by the encoding unit and compresses it into a single image; and
An array generator that inputs the image data compressed in the image compression unit back into a preset deep learning network algorithm to generate a two-channel array containing class information for each instance object and a distance value between the instance object and the own vehicle; A distance value detection device for each instance object using a stereo camera, further comprising:

According to clause 1,
The data extraction unit,
A bounding box is detected through a preset object classification algorithm in each of the left and right images received from the stereo camera image receiver, and the feature map of the corresponding position of the detected bounding box is aligned using the RoIAlign method. A distance value detection device for each instance object using a stereo camera, characterized in that it extracts, classifies the class of the object from the extracted feature map, and simultaneously detects the mask of the object.

According to clause 1,
The preset object classification algorithm is,
A distance value detection device for each instance object using a stereo camera, characterized as a region-based deep learning algorithm including mask R-CNN (mask R-CNN: mask region-based convolutional neural network).

According to clause 1,
The preset deep learning network algorithm is,
A distance value detection device for each instance object using a stereo camera, characterized by a deep learning network algorithm based on a convolution neural network (CNN).

A stereo camera image receiving unit receiving left and right images from a stereo camera;
The data extraction unit classifies instance objects through a preset object classification algorithm in each of the left and right images received from the stereo camera image receiver, extracts distance values for each instance object, and extracts data for deep learning. step; and
The deep learning unit compresses the data extracted from the data extraction unit into one image form by concatenating related data through a preset deep learning network algorithm, and provides class information for each instance object and the instance. Including; generating a two-channel array containing distance values between the object and the own vehicle,
In the step of extracting the data, the data extraction unit,
Detect bounding boxes and mask segmentation for each instance from the left and right images, divide the detected mask segmentation for each instance into regions, calculate the origin of the bounding box, and extract class information for each instance object,
In the step of generating the two-channel array, the deep learning unit,
Mask segmentation area for each image on the left and right, origin of bounding box, and class information for each instance object. Instance objects corresponding to each category are listed, structured and combined in an array form, and converted into input data for deep learning. A method for detecting distance values for each instance object using a stereo camera.

delete

According to clause 7,
In the step of generating a two-channel array containing class information for each instance object and a distance value between the instance object and the own vehicle, the deep learning unit,
The input data converted for each category of the mask segmentation area, the origin of the bounding box, and the class information for each object are input into a preset deep learning network algorithm and encoded, and the encoded data for each category is concatenated. ) is compressed into a single image format, and the data of the compressed image is input back into a preset deep learning network algorithm to generate a two-channel array containing class information for each instance object and the distance value between the instance object and the own vehicle. A method for detecting distance values for each instance object using a stereo camera.

According to clause 7,
In the step of extracting data for deep learning, extracting distance values for each instance object, the data extraction unit,
A bounding box is detected through a preset object classification algorithm in each of the left and right images received from the stereo camera image receiver, and the feature map of the corresponding position of the detected bounding box is aligned using the RoIAlign method. A method for detecting distance values for each instance object using a stereo camera, characterized in that it extracts, classifies the class of the object from the extracted feature map, and simultaneously detects the mask of the object.

According to clause 7,
The preset object classification algorithm is a region-based deep learning algorithm including a mask region-based convolutional neural network (R-CNN). Method for detecting distance values for each instance object using a stereo camera. .

According to clause 7,
The preset deep learning network algorithm is a deep learning network algorithm based on a convolution neural network (CNN). A method for detecting distance values for each instance object using a stereo camera.