KR102538231B1

KR102538231B1 - Method for 3D analysis of semantic segmentation, and computer program recorded on record-medium for executing method thereof

Info

Publication number: KR102538231B1
Application number: KR1020230030193A
Authority: KR
Inventors: 이윤선; 오승진; 송광호
Original assignee: 주식회사 인피닉
Priority date: 2023-03-07
Filing date: 2023-03-07
Publication date: 2023-05-31

Abstract

본 발명은 시맨틱 세그멘테이션 방법을 실행하기 위하여 기록매체에 기록된 컴퓨터 프로그램을 제안한다. 상기 컴퓨터 프로그램은 메모리(memory), 송수신기(transceiver) 및 상기 메모리에 상주된 명령어를 처리하는 프로세서(processor)를 포함하여 구성된 컴퓨팅 장치와 결합될 수 있다. 그리고, 상기 컴퓨터 프로그램은 상기 프로세서가, 라이다(lidar)로부터 획득된 점군 데이터(point cloud) 및 동시에 카메라(camera)를 통해 촬영된 이미지(image)를 수신하는 단계, 상기 프로세서가, 상기 수신한 점군 데이터 및 상기 이미지를 전처리하는 단계 및 상기 프로세서가, 상기 전처리 된 점군 데이터 및 이미지 각각으로부터 특징을 독립적으로 추출하고, 상기 추출된 특징을 융합하여 시맨틱 세그멘테이션을 추론하는 단계를 실행시키기 위하여, 기록매체에 기록된 컴퓨터 프로그램이 될 수 있다.The present invention proposes a computer program recorded on a recording medium to execute the semantic segmentation method. The computer program may be combined with a computing device comprising a memory, a transceiver, and a processor that processes instructions resident in the memory. In addition, the computer program includes receiving, by the processor, point cloud data obtained from a lidar and an image captured through a camera at the same time, the processor, the received In order to execute the step of preprocessing the point cloud data and the image and the processor independently extracting features from each of the preprocessed point cloud data and the image and inferring semantic segmentation by fusing the extracted features, a recording medium It can be a computer program written in

Description

Method for 3D analysis of semantic segmentation, and computer program recorded on record-medium for executing method thereof}

본 발명은 인공지능(Artificial Intelligence, AI) 학습용 데이터의 가공에 관한 것이다. 보다 상세하게는, 카메라 및 라이다의 센서 퓨전을 통해 얻어낸 융합 데이터를 바탕으로 2차원의 시맨틱 세그멘테이션을 추론하고, 추론 결과를 3차원으로 재구성하기 위한, 시맨틱 세그멘테이션의 3차원 해석 방법 및 이를 실행하기 위하여 기록매체에 기록된 컴퓨터 프로그램에 관한 것이다.The present invention relates to the processing of artificial intelligence (AI) learning data. More specifically, a 3D analysis method of semantic segmentation for inferring 2D semantic segmentation based on fusion data obtained through sensor fusion of a camera and lidar and reconstructing the inference result in 3D, and executing the same It relates to a computer program recorded on a recording medium for

인공지능(AI)은 인간의 학습능력, 추론능력 및 지각능력 등의 일부 또는 전부를 컴퓨터 프로그램을 이용하여 인공적으로 구현하는 기술을 의미한다. 인공지능(AI)과 관련하여, 기계 학습(machine learning)은 다수의 파라미터로 구성된 모델을 이용하여 주어진 데이터로 파라미터를 최적화하는 학습을 의미한다. 이와 같은, 기계 학습은 학습용 데이터의 형태에서 따라, 지도 학습(supervised learning), 비지도 학습(unsupervised learning) 및 강화 학습(reinforcement learning)으로 구분된다.Artificial intelligence (AI) refers to a technology that artificially implements some or all of human learning abilities, reasoning abilities, and perception abilities using computer programs. In relation to artificial intelligence (AI), machine learning refers to learning to optimize parameters with given data using a model composed of multiple parameters. Such machine learning is classified into supervised learning, unsupervised learning, and reinforcement learning according to the form of learning data.

일반적으로, 인공지능(AI) 학습용 데이터의 설계는 데이터 구조의 설계, 데이터의 수집, 데이터의 정제, 데이터의 가공, 데이터의 확장 및 데이터의 검증 단계로 진행된다.In general, designing data for artificial intelligence (AI) learning proceeds in the steps of data structure design, data collection, data refinement, data processing, data expansion, and data verification.

각각의 단계에서 대하여 보다 구체적으로 설명하면, 데이터 구조의 설계는 온톨로지(ontology) 정의, 분류 체계의 정의 등을 통해 이루어진다. 데이터의 수집은 직접 촬영, 웹 크롤링(web crawling) 또는 협회/전문 단체 등을 통해 데이터를 수집하여 이루어진다. 데이터 정제는 수집된 데이터 내에서 중복 데이터를 제거하고, 개인 정보 등을 비식별화하여 이루어진다. 데이터의 가공은 어노테이션(annotation)을 수행하고, 메타데이터(metadata)를 입력하여 이루어진다. 데이터의 확장은 온톨로지 매핑(mapping)을 수행하고, 필요에 따라 온톨로지를 보완하거나 확장하여 이루어진다. 그리고, 데이터의 검증은 다양한 검증 도구를 활용하여 설정된 목표 품질에 따른 유효성을 검증하여 이루어진다.To describe each step in more detail, data structure design is performed through ontology definition, classification system definition, and the like. Data collection is performed by collecting data through direct filming, web crawling, or associations/professional organizations. Data purification is performed by removing redundant data from collected data and de-identifying personal information. Data processing is performed by performing annotation and inputting metadata. Data extension is performed by performing ontology mapping and supplementing or extending the ontology as needed. In addition, data verification is performed by verifying validity according to the set target quality using various verification tools.

한편, 차량의 자율주행(automatic driving)은 차량 스스로 판단하여 주행할 수 있는 시스템을 의미한다. 이와 같은, 자율주행은 시스템이 주행에 관여하는 정도와 운전차가 차량을 제어하는 정도에 따라 비자동화부터 완전 자동화까지 점진적인 단계로 구분될 수 있다. 일반적으로, 자율주행의 단계는 국제자동차기술자협회(SAE(Society of Automotive Engineers) International)에서 분류한 6단계의 레벨로 구분된다. 국제자동차기술자협회가 분류한 6단계에 따르면, 레벨 0단계는 비자동화, 레벨 1단계는 운전자 보조, 레벨 2단계는 부분 자동화, 레벨 3단계는 조건부 자동화, 레벨 4단계는 고도 자동화, 그리고 레벨 5단계는 완전 자동화 단계이다.On the other hand, autonomous driving of a vehicle refers to a system that can judge and drive a vehicle by itself. Such autonomous driving may be classified into gradual stages from non-automation to complete automation according to the degree of involvement of the system in driving and the degree of control of the vehicle by the driver. In general, the level of autonomous driving is divided into six levels classified by the Society of Automotive Engineers (SAE) International. According to the six levels classified by the International Association of Automotive Engineers, level 0 is non-automation, level 1 is driver assistance, level 2 is partial automation, level 3 is conditional automation, level 4 is highly automated, and level 5 The steps are fully automated steps.

차량의 자율주행은 인지(perception), 측위(localization), 경로 계획(path planning) 및 제어(control)의 메커니즘을 통해 수행된다. 현재 여러 기업체들은 자율주행 메커니즘 중에서 인지 및 경로 계획을 인공지능(AI)을 이용하여 구현하기 위해 개발 중에 있다.Autonomous driving of vehicles is performed through mechanisms of perception, localization, path planning, and control. Currently, several companies are developing to implement recognition and path planning among autonomous driving mechanisms using artificial intelligence (AI).

그러나, 최근에는 자율주행 차량의 충돌 사고가 빈번히 발생함에 따라, 자율 주행의 안전성 개선에 대한 요구가 늘어나고 있다.However, recently, as collision accidents of autonomous vehicles frequently occur, the demand for improving the safety of autonomous driving is increasing.

특히, RGB 카메라 센서와 2차원 객체 인식(object detection) 기술을 중심으로 하는 최근의 자율 주행용 첨단 운전자 보조 시스템(Advanced Driver Assistance System, ADAS)으로는 주변 객체들의 3차원적 상대 거리나 구조 등을 알기 어려운 문제점이 있었다.In particular, recent Advanced Driver Assistance Systems (ADAS) for autonomous driving centered on RGB camera sensors and 2D object detection technology can detect 3D relative distances or structures of surrounding objects. There was a problem that was hard to figure out.

이러한 문제점을 해결하기 위하여, 최근에는 라이다(lidar)와 같은 3차원 센서 및 카메라와 같은 2차원 센서로부터 취득한 멀티 모달(multi modal) 데이터를 융합하는 센서 퓨전(sensor fusion) 기법과, 객체 인식의 단위를 데이터의 구성 단위까지 확장하는 시맨틱 세그멘테이션(semantic segmentation) 기법에 대한 연구가 활발이 진행되고 있다.In order to solve this problem, recently, a sensor fusion technique that fuses multi-modal data acquired from a 3-dimensional sensor such as lidar and a 2-dimensional sensor such as a camera and object recognition have been developed. Research on a semantic segmentation technique that extends a unit to a data configuration unit is being actively conducted.

그러나, 카메라와 라이다의 센서 퓨전을 활용하는 기존의 시맨틱 세그멘테이션 연구들은 대부분 3차원의 라이다 점군 데이터와 2차원의 카메라 영상 데이터를 혼합하기 위해 점군 데이터를 2차원의 평면 데이터로 변환한다.However, most of the existing semantic segmentation studies using camera and lidar sensor fusion convert point cloud data into 2-dimensional plane data in order to mix 3-dimensional lidar point cloud data and 2-dimensional camera image data.

이를 이용해 만들어진 평면 형태의 추론 결과는 라이다의 장점인 3차원의 거리 정보가 손실된 상태이므로 주행 환경에 대한 3차원적 정보들을 제공하기 어려운 문제점이 있었다.The planar inference result made using this has a problem in that it is difficult to provide three-dimensional information about the driving environment because the three-dimensional distance information, which is an advantage of LIDAR, is lost.

대한민국 등록특허공보 제10-2073873호, ‘시맨틱 세그멘테이션 방법 및 그 장치’, (2020.01.30. 등록)Republic of Korea Patent Registration No. 10-2073873, 'Semantic Segmentation Method and Apparatus', (2020.01.30. Registration)

본 발명의 일 목적은 카메라 및 라이다의 센서 퓨전을 통해 얻어낸 3차원 정보를 포함하는 융합 데이터를 바탕으로 2차원의 시맨틱 세그멘테이션을 추론하고, 추론 결과를 3차원으로 재구성하기 위한, 시맨틱 세그멘테이션의 3차원 해석 방법을 제공하는 것이다.One object of the present invention is to infer 2-dimensional semantic segmentation based on fusion data including 3-dimensional information obtained through sensor fusion of a camera and lidar, and to reconstruct the inference result in 3 dimensions. It is to provide a dimensional analysis method.

본 발명의 다른 목적은 카메라 및 라이다의 센서 퓨전을 통해 얻어낸 3차원 정보를 포함하는 융합 데이터를 바탕으로 2차원의 시맨틱 세그멘테이션을 추론하고, 추론 결과를 3차원으로 재구성하기 위한, 시맨틱 세그멘테이션의 3차원 해석 방법을 실행하기 위하여 기록매체에 기록된 컴퓨터 프로그램을 제공하는 것이다.Another object of the present invention is to infer 2D semantic segmentation based on fusion data including 3D information obtained through sensor fusion of a camera and lidar, and to reconstruct the inference result in 3D. To provide a computer program recorded on a recording medium in order to execute a dimensional analysis method.

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the following description.

상술한 바와 같은 기술적 과제를 달성하기 위하여, 본 발명은 2차원 시맨틱 세그멘테이션의 3차원 해석 방법을 제안한다. 상기 방법은 학습 데이터 생성 장치가, 라이다(lidar)로부터 획득된 점군 데이터(point cloud) 및 동시에 카메라(camera)를 통해 촬영된 이미지(image)를 수신하는 단계, 상기 학습 데이터 생성 장치가, 사전 기계 학습(machine learning)된 인공 지능(AI, Artificial Intelligence)을 기초로 상기 점군 데이터 및 상기 이미지로부터 2차원 시맨틱 세그멘테이션 특징맵을 생성하는 단계 및 상기 학습 데이터 생성 장치가, 상기 생성된 특징맵으로부터 3차원 정보를 해석하는 단계를 포함할 수 있다.In order to achieve the technical problem as described above, the present invention proposes a 3D analysis method of 2D semantic segmentation. The method includes receiving, by a learning data generating device, point cloud data obtained from a lidar and an image captured through a camera at the same time, the learning data generating device Generating a 2-dimensional semantic segmentation feature map from the point cloud data and the image based on machine learning artificial intelligence (AI), and the learning data generating device 3 from the generated feature map Interpreting dimensional information may be included.

구체적으로, 상기 생성하는 단계 이전에 상기 인공 지능을 사전 학습시키는 단계를 더 포함하는 것을 특징으로 한다.Specifically, it is characterized in that it further comprises the step of pre-learning the artificial intelligence prior to the generating step.

상기 사전 학습시키는 단계는 사전 저장된 데이터 셋에 포함된 라이다로부터 획득된 점군 데이터, 상기 점군 데이터와 동시에 카메라를 통해 촬영된 이미지, 상기 라이다 및 상기 카메라 사이의 캘리브레이션 정보 및 상기 점군 데이터의 3차원 포인트 단위로 클래스 라벨이 명시되어 있는 정답 데이터를 기초로 상기 인공 지능을 사전 학습시키는 것을 특징으로 한다.The pre-learning step is the point cloud data obtained from the lidar included in the pre-stored data set, the image captured through the camera at the same time as the point cloud data, the calibration information between the lidar and the camera, and the 3-dimensional point cloud data It is characterized in that the artificial intelligence is pre-learned based on correct answer data in which class labels are specified in units of points.

상기 사전 학습시키는 단계는 상기 점군 데이터 및 상기 정답 데이터를 상기 이미지의 2차원 평면상 좌표에 사영시키되, 상기 이미지의 사영될 픽셀과 맨해튼 거리(Manhattan distance)가 사전 설정된 값 이내인 픽셀들을 동일한 클래스를 갖도록 하는 것을 특징으로 한다.In the pre-learning step, the point cloud data and the correct answer data are projected onto coordinates on a two-dimensional plane of the image, and pixels whose Manhattan distance is within a predetermined value from pixels to be projected in the image are assigned the same class. It is characterized by having.

상기 사전 학습시키는 단계는 복수의 손실함수 중 적어도 하나를 통해 상기 인공 지능을 사전 학습시키는 것을 특징으로 한다.The pre-learning step is characterized in that the artificial intelligence is pre-learned through at least one of a plurality of loss functions.

상기 사전 학습시키는 단계는 하기의 수학식 1로 표현되는 손실함수를 통해 상기 인공 지능을 사전 학습시키는 것을 특징으로 한다.The pre-learning step is characterized in that the artificial intelligence is pre-learned through a loss function represented by Equation 1 below.

[수학식 1][Equation 1]

(여기서, p_pred는 크로스 엔트로피를 기반으로 하는 추론 확률, p_true는 정답을 의미한다.)(Here, p _pred is the inference probability based on cross entropy, and p _true means the correct answer.)

상기 사전 학습시키는 단계는 하기의 수학식 2로 표현되는 손실함수를 통해 상기 인공 지능을 사전 학습시키는 것을 특징으로 한다.The pre-learning step is characterized in that the artificial intelligence is pre-learned through a loss function represented by Equation 2 below.

[수학식 2][Equation 2]

(여기서, p_pred는 Dice 계수를 기반으로 하는 추론 확률, p_true는 정답을 의미한다.)(Where p _pred is the inference probability based on the Dice coefficient, and p _true means the correct answer.)

상기 사전 학습시키는 단계는 배경 클래스에 해당하는 픽셀을 손실함수의 종류와 관계없이 손실 값의 계산에서 제외시키는 것을 특징으로 한다.The pre-learning step is characterized in that pixels corresponding to the background class are excluded from the calculation of the loss value regardless of the type of the loss function.

상기 해석하는 단계는 상기 점군 데이터의 3차원 좌표 및 상기 점군 데이터를 사영한 이미지의 2차원 평면상 좌표 쌍을 준비하는 것을 특징으로 한다.The analyzing step is characterized in that a pair of 3D coordinates of the point cloud data and a 2D plane coordinate pair of an image projected with the point cloud data is prepared.

상기 해석하는 단계는 상기 생성된 특징맵에 대하여 각각의 2차원 평면상 좌표마다 해당 좌표를 중심으로 3*3 크기의 커널을 사용하는 최대 풀링(max pooling) 연산을 적용하는 것을 특징으로 한다.The interpreting step is characterized by applying a maximum pooling operation using a 3*3 kernel centered on the corresponding coordinate for each coordinate on a two-dimensional plane to the generated feature map.

상기 해석하는 단계는 상기 최대 풀링 연산을 통해 각 좌표에 대한 세그먼트를 추정한 벡터를 획득하는 것을 특징으로 한다.The analyzing step is characterized in that a vector obtained by estimating a segment for each coordinate is obtained through the maximum pooling operation.

상기 해석하는 단계는 상기 획득된 벡터 중 가장 큰 확률을 갖는 클래스를 상기 점군 데이터의 3차원 좌표에 대해 추론된 클래스로 예측하는 것을 특징으로 한다.In the interpreting step, a class having the highest probability among the obtained vectors is predicted as a class inferred with respect to 3D coordinates of the point cloud data.

상술한 바와 같은 기술적 과제를 달성하기 위하여, 본 발명은 2차원 시맨틱 세그멘테이션의 3차원 해석 방법을 실행하기 위하여 기록매체에 기록된 컴퓨터 프로그램을 제안한다. 상기 컴퓨터 프로그램은 메모리(memory), 송수신기(transceiver) 및 상기 메모리에 상주된 명령어를 처리하는 프로세서(processor)를 포함하여 구성된 컴퓨팅 장치와 결합될 수 있다. 그리고, 상기 컴퓨터 프로그램은 상기 프로세서가, 라이다(lidar)로부터 획득된 점군 데이터(point cloud) 및 동시에 카메라(camera)를 통해 촬영된 이미지(image)를 수신하는 단계, 상기 프로세서, 사전 기계 학습(machine learning)된 인공 지능(AI, Artificial Intelligence)을 기초로 상기 점군 데이터 및 상기 이미지로부터 2차원 시맨틱 세그멘테이션 특징맵을 생성하는 단계 및 상기 프로세서, 상기 생성된 특징맵으로부터 3차원 정보를 해석하는 단계를 실행시키기 위하여, 기록매체에 기록된 컴퓨터 프로그램이 될 수 있다.In order to achieve the technical problem as described above, the present invention proposes a computer program recorded on a recording medium to execute a 3D analysis method of 2D semantic segmentation. The computer program may be combined with a computing device comprising a memory, a transceiver, and a processor that processes instructions resident in the memory. In addition, the computer program includes the step of receiving, by the processor, point cloud data obtained from a lidar and an image captured through a camera at the same time, the processor, prior machine learning ( Generating a 2D semantic segmentation feature map from the point cloud data and the image based on machine learning (AI) and interpreting 3D information from the processor and the generated feature map To be executed, it may be a computer program recorded on a recording medium.

기타 실시 예들의 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Details of other embodiments are included in the detailed description and drawings.

본 발명의 실시 예들에 따르면, 카메라 및 라이다의 센서 퓨전을 통해 얻어낸 3차원 정보를 포함하는 융합 데이터를 바탕으로 2차원의 시맨틱 세그멘테이션을 추론하고, 추론 결과를 3차원으로 재구성할 수 있다.According to embodiments of the present invention, 2D semantic segmentation can be inferred based on fusion data including 3D information obtained through sensor fusion of a camera and lidar, and the inference result can be reconstructed into 3D.

본 발명의 효과들은 이상에서 언급한 효과로 제한되지 아니하며, 언급되지 않은 또 다른 효과들은 청구범위의 기재로부터 본 발명이 속한 기술분야의 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description of the claims.

도 1은 본 발명의 일 실시예에 따른 인공지능 학습 시스템을 나타낸 구성도이다.
도 2는 본 발명의 일 실시예에 따른 학습 데이터 수집 장치의 구성을 설명하기 위한 예시도이다.
도 3은 본 발명의 일 실시예에 따른 학습 데이터 생성 장치의 논리적 구성도이다.
도 4는 본 발명의 일 실시예에 따른 학습 데이터 생성 장치의 하드웨어 구성도이다.
도 5는 본 발명의 일 실시예에 따른 시맨틱 세그멘테이션 및 3차원 해석 방법을 설명하기 위한 순서도이다.
도 6은 본 발명의 일 실시예에 따른 시맨틱 세그멘테이션 및 3차원 해석 방법을 설명하기 위한 예시도이다.
도 7은 본 발명의 일 실시예에 따른 시맨틱 세그멘테이션 방법을 구체적으로 설명하기 위한 예시도이다.
도 8은 본 발명의 일 실시예에 따른 3차원 해석 방법을 설명하기 위한 예시도이다.
도 9는 본 발명의 일 실시예에 따른 3차원 해석 방법의 성능을 설명하기 위한 예시도이다.
도 10은 본 발명의 일 실시예에 따른 2차원 시맨틱 세그맨테이션의 3차원 해석 결과를 나타낸 도면이다.1 is a configuration diagram showing an artificial intelligence learning system according to an embodiment of the present invention.
2 is an exemplary diagram for explaining the configuration of a learning data collection device according to an embodiment of the present invention.
3 is a logical configuration diagram of an apparatus for generating learning data according to an embodiment of the present invention.
4 is a hardware configuration diagram of an apparatus for generating learning data according to an embodiment of the present invention.
5 is a flowchart illustrating a semantic segmentation and 3D analysis method according to an embodiment of the present invention.
6 is an exemplary view for explaining a semantic segmentation and 3D analysis method according to an embodiment of the present invention.
7 is an exemplary view for specifically explaining a semantic segmentation method according to an embodiment of the present invention.
8 is an exemplary diagram for explaining a 3D analysis method according to an embodiment of the present invention.
9 is an exemplary diagram for explaining performance of a 3D analysis method according to an embodiment of the present invention.
10 is a diagram showing a 3D analysis result of 2D semantic segmentation according to an embodiment of the present invention.

본 명세서에서 사용되는 기술적 용어는 단지 특정한 실시 예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아님을 유의해야 한다. 또한, 본명세서에서 사용되는 기술적 용어는 본 명세서에서 특별히 다른 의미로 정의되지 않는 한, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 의미로 해석되어야 하며, 과도하게 포괄적인 의미로 해석되거나, 과도하게 축소된 의미로 해석되지 않아야 한다. 또한, 본 명세서에서 사용되는 기술적인 용어가 본 발명의 사상을 정확하게 표현하지 못하는 잘못된 기술적 용어일 때에는, 당업자가 올바르게 이해할 수 있는 기술적 용어로 대체되어 이해되어야 할 것이다. 또한, 본 발명에서 사용되는 일반적인 용어는 사전에 정의되어 있는 바에 따라, 또는 전후 문맥상에 따라 해석되어야 하며, 과도하게 축소된 의미로 해석되지 않아야 한다.It should be noted that the technical terms used in this specification are only used to describe specific embodiments and are not intended to limit the present invention. In addition, technical terms used in this specification should be interpreted in terms commonly understood by those of ordinary skill in the art to which the present invention belongs, unless specifically defined otherwise in this specification, and are excessively inclusive. It should not be interpreted in a positive sense or in an excessively reduced sense. In addition, when the technical terms used in this specification are incorrect technical terms that do not accurately express the spirit of the present invention, they should be replaced with technical terms that those skilled in the art can correctly understand. In addition, general terms used in the present invention should be interpreted as defined in advance or according to context, and should not be interpreted in an excessively reduced sense.

또한, 본 명세서에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "구성된다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 여러 구성 요소들, 또는 여러 단계들을 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다.Also, singular expressions used in this specification include plural expressions unless the context clearly indicates otherwise. In this application, terms such as "consisting of" or "having" should not be construed as necessarily including all of the various components or steps described in the specification, and some of the components or steps are included. It should be construed that it may not be, or may further include additional components or steps.

또한, 본 명세서에서 사용되는 제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성 요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성 요소는 제2 구성 요소로 명명될 수 있고, 유사하게 제2 구성 요소도 제1 구성 요소로 명명될 수 있다.Also, terms including ordinal numbers such as first and second used in this specification may be used to describe various components, but the components should not be limited by the terms. These terms are only used for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element, without departing from the scope of the present invention.

어떤 구성 요소가 다른 구성 요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성 요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성 요소가 존재할 수도 있다. 반면에, 어떤 구성 요소가 다른 구성 요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성 요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is referred to as being “connected” or “connected” to another component, it may be directly connected or connected to the other component, but other components may exist in the middle. On the other hand, when a component is referred to as “directly connected” or “directly connected” to another component, it should be understood that no other component exists in the middle.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성 요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 또한, 본 발명을 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 발명의 사상을 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 발명의 사상이 제한되는 것으로 해석되어서는 아니 됨을 유의해야 한다. 본 발명의 사상은 첨부된 도면 외에 모든 변경, 균등물 내지 대체물에 까지도 확장되는 것으로 해석되어야 한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings, but the same or similar components are given the same reference numerals regardless of reference numerals, and redundant description thereof will be omitted. In addition, in describing the present invention, if it is determined that a detailed description of a related known technology may obscure the gist of the present invention, the detailed description will be omitted. In addition, it should be noted that the accompanying drawings are only for easily understanding the spirit of the present invention, and should not be construed as limiting the spirit of the present invention by the accompanying drawings. The spirit of the present invention should be construed as extending to all changes, equivalents or substitutes other than the accompanying drawings.

한편, 최근에는 라이다(lidar)와 같은 3차원 센서 및 카메라와 같은 2차원 센서로부터 취득한 멀티 모달(multi modal) 데이터를 융합하는 센서 퓨전(sensor fusion)과, 객체 인식의 단위를 데이터의 구성 단위까지 확장하는 시맨틱 세그멘테이션(semantic segmentation)이 활발히 연구되고 있다.On the other hand, recently, sensor fusion, which fuses multi-modal data acquired from 3-dimensional sensors such as lidar and 2-dimensional sensors such as cameras, and object recognition units are used as data constituent units. Semantic segmentation that extends up to is being actively studied.

센서 퓨전은 이종의 센서로부터 취득한 멀티 모달 데이터를 상호 융합하는 기술이다. 센서 퓨전은 딥러닝 분야에서 다수의 원천(source)으로부터 형성된 데이터 또는 정보들이 융합되는 시점에 따라 세 가지 종류로 구분한다.Sensor fusion is a technology that mutually fuses multi-modal data acquired from different types of sensors. Sensor fusion is classified into three types according to the point at which data or information formed from multiple sources in the field of deep learning is fused.

데이터 단계 퓨전(data level or early fusion)은 원천으로부터 취득된 데이터 자체를 융합하여 새로운 표현형을 갖는 융합 데이터를 만든다. 이 융합데이터는 신경망에 전달되어 특징맵(feature map)을 추출, 결과 추론 맵(prediction map)을 만드는데 활용된다. 이 방법은 신경망에 입력으로 수용되는 데이터의 종류가 적어 신경망의 전체적인 구조가 비교적 단순하게 구성된다는 장점이 있다. Data level or early fusion creates fusion data with a new phenotype by fusing the data itself acquired from the source. This convergence data is transmitted to the neural network and used to extract a feature map and create a prediction map. This method has the advantage that the overall structure of the neural network is configured relatively simply because the type of data received as input to the neural network is small.

이와 달리 특징 단계 퓨전(deep feature level or mid-level fusion)은 원천으로부터 취득한 멀티 모달 데이터를 신경망에 입력으로 각각 전달한다. 병렬 구조의 신경망을 통해 각각의 입력으로부터 추출되는 멀티 모달 특징들은 신경망의 내부에서 하나로 융합되어 심층 융합 특징을 구성하며, 이는 결과 추론 맵을 만드는데 활용된다. 이 방법은 신경망이 결과를 추론함에 있어 멀티 모달 데이터로부터 추출한 각각의 특징부터 그들의 융합 특징까지 다양하게 활용 가능한 장점이 있다.Unlike this, deep feature level or mid-level fusion delivers multi-modal data obtained from the source to the neural network as input. The multi-modal features extracted from each input through the parallel structured neural network are fused into one inside the neural network to form a deep fusion feature, which is used to create the result inference map. This method has the advantage that it can be used in various ways, from each feature extracted from multi-modal data to their convergence features, in inferring the result of the neural network.

추론 단계 퓨전(score level or late fusion) 또한 멀티 모달 데이터를 신경망의 입력으로 각각 전달하나, 특징 단계 퓨전과 달리 각각의 원천마다 독립된 신경망을 사용한다. 각 신경망이 독립적으로 만들어낸 결과 추론 맵들은 합산, 평균 등의 단순 연산이나 별도로 학습된 분류기를 활용함으로써 최종 추론 맵을 도출하는데 활용된다. 이 방법은 추론 결과가 융합에 사용된 센서들 사이의 특성 차나 상호 간섭에 의한 영향에 강인하다는 장점이 있다.Inference level fusion (score level or late fusion) also transfers multi-modal data as input to the neural network, but unlike feature level fusion, independent neural networks are used for each source. The resulting inference maps independently created by each neural network are used to derive the final inference map by simple operations such as summation and average or by using a separately learned classifier. This method has the advantage that the inference result is robust to the influence of the characteristic difference or mutual interference between the sensors used for fusion.

시맨틱 세그멘테이션은 자율 주행 차량이 주행 환경을 인지하는데 필요한 핵심적인 요소 기술 중의 하나로서 카메라로부터 취득한 2차원 RGB 영상의 객체들을 픽셀 단위로 분류(dense prediction)할 수 있다. Semantic segmentation is one of the key element technologies required for an autonomous vehicle to recognize a driving environment, and objects of a 2D RGB image acquired from a camera can be classified (dense prediction) in units of pixels.

특히, 최근에는 기존의 컨볼루션 계층을 중심으로 구성된 심층 신경망을 사용하는 방식에서 더 나아가 셀프 어텐션을 기반으로 특징들 사이의 상대적 중요도를 알아내 보다 표현력이 높은 심층 특징들을 추출해낼 수 있는 트랜스포머 모듈을 추가하여 2차원 시맨틱 세그멘테이션의 성능을 증진하는 연구들이 늘고 있다.In particular, recently, a transformer module that can extract deep features with higher expressiveness by finding out the relative importance between features based on self-attention, going further than the method using a deep neural network centered on the existing convolutional layer, has been developed. In addition, studies to improve the performance of 2D semantic segmentation are increasing.

트랜스포머 모듈을 활용하면 신경망에 컨볼루션 계층을 비교적 적게 사용하더라도 표현력이 좋은 심층 특징들을 추출할 수 있으며, 이 심층 특징들을 바탕으로 기존의 방법들보다 상대적으로 좋은 성능을 보였다.Using the transformer module, it is possible to extract deep features with good expressiveness even if a relatively small number of convolutional layers are used in the neural network, and based on these deep features, it showed relatively better performance than existing methods.

그러나 앞서 언급한 바와 같이 이들 2차원 시맨틱 세그멘테이션 방법이 만들어내는 평면 형태의 추론 결과만으로는 차량의 실제 주행 환경인 3차원에서의 객체의 구조나 상대적 거리 등을 파악하기 어려우며, 3차원 정보를 추론하기 위해 여러 대의 카메라를 사용할 경우에는 계산량이 과도하게 늘어난다는 문제가 있다.However, as mentioned above, it is difficult to grasp the structure or relative distance of an object in 3D, which is the actual driving environment of a vehicle, with only the inference results in the form of a plane created by these 2D semantic segmentation methods. When using multiple cameras, there is a problem that the amount of calculation is excessively increased.

또한, 카메라와 라이다의 센서 퓨전을 활용하는 기존의 시맨틱 세그멘테이션 연구들은 대부분 3차원의 라이다 점군 데이터와 2차원의 카메라 영상 데이터를 혼합하기 위해 점군 데이터를 2차원의 평면 데이터로 변환한다.In addition, most of the existing semantic segmentation studies that utilize camera and lidar sensor fusion convert point cloud data into 2-dimensional plane data in order to mix 3-dimensional lidar point cloud data and 2-dimensional camera image data.

이를 이용해 만들어진 평면 형태의 추론 결과는 라이다의 장점인 3차원의 거리 정보가 손실된 상태이므로 주행 환경에 대한 3차원적 정보들을 제공하기 어렵다.It is difficult to provide three-dimensional information about the driving environment because the planar inference result made using this is in a state where the three-dimensional distance information, which is an advantage of LIDAR, is lost.

이러한 한계를 극복하고자, 본 발명은 시맨틱 세그멘테이션 기법과, 카메라 및 라이다의 센서 퓨전을 통해 얻어낸 3차원 정보를 포함하는 융합 데이터를 기초로 한 2차원 시맨틱 세그멘테이션 결과를 3차원으로 재구성할 수 있는 다양한 수단들을 제안하고자 한다.In order to overcome these limitations, the present invention is a semantic segmentation technique and various 2D semantic segmentation results that can be reconstructed in 3D based on fusion data including 3D information obtained through sensor fusion of cameras and lidar. I would like to suggest means.

도 1은 본 발명의 일 실시예에 따른 인공지능 학습 시스템을 나타낸 구성도이다.1 is a configuration diagram showing an artificial intelligence learning system according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 인공지능 학습 시스템은 학습 데이터 수집 장치(100), 학습 데이터 생성 장치(200) 및 인공지능 학습 장치(300)를 포함하여 구성될 수 있다.As shown in FIG. 1, the artificial intelligence learning system according to an embodiment of the present invention may include a learning data collection device 100, a learning data generating device 200, and an artificial intelligence learning device 300. there is.

이와 같은, 일 실시예에 따른 인공지능 학습 시스템의 구성 요소들은 기능적으로 구분되는 요소들을 나타낸 것에 불과하므로, 둘 이상의 구성 요소가 실제 물리적 환경에서는 서로 통합되어 구현되거나, 하나의 구성 요소가 실제 물리적 환경에서는 서로 분리되어 구현될 수 있을 것이다.Since the components of the artificial intelligence learning system according to an embodiment are merely functionally distinct elements, two or more components are integrated and implemented in an actual physical environment, or one component is implemented in an actual physical environment. may be implemented separately from each other.

각각의 구성 요소에 대하여 설명하면, 학습 데이터 수집 장치(100)는 자율주행에 사용될 수 있는 인공지능(AI)을 기계 학습시키기 위한 데이터를 수집하기 위하여, 차량에 설치된 라이다(lidar) 및 카메라(camera)로부터 실시간으로 데이터를 수집하는 장치이다. 하지만, 이에 한정된 것은 아니고, 학습 데이터 수집 장치(100)는 레이더(radar) 및 초음파 센서(ultrasonic sensor)를 포함할 수도 있다. 또한, 학습 데이터 수집 장치(100)의 제어 대상이자, 차량에 설치되어 기계 학습용 데이터를 획득, 촬영 또는 감지하는 센서는 종류별로 하나씩 구비되는 것으로 한정되지 아니하며, 동일한 종류의 센서라 할지라도 복수 개로 구비될 수 있다.To describe each component, the learning data collection device 100 includes a lidar and a camera installed in a vehicle to collect data for machine learning of artificial intelligence (AI) that can be used for autonomous driving. It is a device that collects data in real time from a camera. However, it is not limited thereto, and the learning data collection device 100 may include a radar and an ultrasonic sensor. In addition, sensors that are controlled by the learning data collection device 100 and are installed in a vehicle to obtain, photograph, or detect machine learning data are not limited to being provided one by one for each type, and are provided in plural even if they are the same type of sensor. It can be.

학습 데이터 수집 장치(100)의 제어 대상이자, 차량에 설치되어 기계 학습용 데이터를 획득, 촬영 또는 감지하는 센서들의 종류에 대해서는 추후 도 2를 참조하여 보다 구체적으로 설명하기로 한다.The types of sensors that are controlled by the learning data collection device 100 and are installed in the vehicle to acquire, photograph, or sense machine learning data will be described in more detail later with reference to FIG. 2 .

다음 구성으로, 학습 데이터 생성 장치(200)는 복수의 학습 데이터 수집 장치(100) 각각으로부터 이동통신(mobile communication)을 이용하여 각각의 학습 데이터 수집 장치(100)에 의해 실시간으로 수집된 데이터를 수신하고, 수신된 데이터에 대하여 어노테이션을 수행할 수 있다.With the following configuration, the learning data generating device 200 receives data collected in real time by each learning data collecting device 100 from each of the plurality of learning data collecting devices 100 using mobile communication. and annotate the received data.

이러한, 학습 데이터 생성 장치(200)는 학습 데이터 생성 장치(100)는 인공지능 학습 장치(400)로부터 인공지능(AI) 학습용 데이터의 요청이 수신되기 이전에, 선제적으로 인공지능(AI) 학습용 데이터를 생성할 수 있는 빅데이터(big data)를 구축해 놓을 수 있다.The learning data generating device 200 is configured to preemptively use artificial intelligence (AI) learning before a request for artificial intelligence (AI) learning data is received from the artificial intelligence learning device 400. You can build big data that can generate data.

특징적으로, 학습 데이터 생성 장치(200)는 라이다(lidar)로부터 획득된 점군 데이터(point cloud) 및 동시에 카메라(camera)를 통해 촬영된 이미지(image)를 수신하고, 수신한 점군 데이터 및 이미지를 전처리하고, 사전 기계 학습(machine learning)된 인공 지능(AI, Artificial Intelligence)을 기초로 전처리 된 점군 데이터 및 이미지 각각으로부터 특징을 독립적으로 추출하고, 추출된 특징을 융합하여 시맨틱 세그멘테이션 특징맵을 생성할 수 있다.Characteristically, the learning data generating apparatus 200 receives point cloud data obtained from a lidar and an image captured through a camera at the same time, and transmits the received point cloud data and image Preprocessing, independently extracting features from each of the preprocessed point cloud data and images based on preprocessed artificial intelligence (AI), and fusing the extracted features to create a semantic segmentation feature map. can

또한, 학습 데이터 생성 장치(200)는 라이다(lidar)로부터 획득된 점군 데이터(point cloud) 및 동시에 카메라(camera)를 통해 촬영된 이미지(image)를 수신하고, 사전 기계 학습(machine learning)된 인공 지능(AI, Artificial Intelligence)을 기초로 점군 데이터 및 이미지로부터 2차원 시맨틱 세그멘테이션 특징맵을 생성하고, 생성된 특징맵으로부터 3차원 정보를 해석할 수 있다.In addition, the learning data generating device 200 receives point cloud data obtained from a lidar and an image captured through a camera at the same time, and performs prior machine learning. Based on artificial intelligence (AI), a 2D semantic segmentation feature map can be created from point cloud data and images, and 3D information can be analyzed from the created feature map.

이와 같은 특징을 가지는, 학습 데이터 생성 장치(200)는 학습 데이터 수집 장치(100) 및 인공지능 학습 장치(300)와 데이터를 송수신하고, 송수신된 데이터를 기초로 연산을 수행할 수 있는 장치라면 어떠한 장치라도 허용될 수 있다.The learning data generation device 200 having such characteristics is any device capable of transmitting and receiving data with the learning data collection device 100 and the artificial intelligence learning device 300 and performing calculations based on the transmitted and received data. Any device may be acceptable.

예를 들어, 학습 데이터 생성 장치(200)는 데스크탑(desktop), 워크스테이션(workstation) 또는 서버(server)와 같은 고정식 컴퓨팅 장치 중 어느 하나가 될 수 있으나, 이에 한정되는 것은 아니다.For example, the learning data generating device 200 may be any one of a fixed computing device such as a desktop, workstation, or server, but is not limited thereto.

한편, 학습 데이터 생성 장치(200)에 관한 구체적인 설명은 이하, 도 3 및 도 4를 참조하여 후술하도록 한다.Meanwhile, a detailed description of the learning data generating device 200 will be described later with reference to FIGS. 3 and 4 .

다음 구성으로, 인공지능 학습 장치(300)는 인공지능(AI)을 개발하는데 사용될 수 있는 장치이다. With the following configuration, the artificial intelligence learning device 300 is a device that can be used to develop artificial intelligence (AI).

구체적으로, 인공지능 학습 장치(300)는 인공지능(AI)이 개발 목적을 달성하기 위하여 인공지능(AI) 학습용 데이터가 만족해야 하는 요구 사항을 포함하는 요구 값을 학습 데이터 생성 장치(200)에 전송할 수 있다. 인공지능 학습 장치(300)는 학습 데이터 생성 장치(200)로부터 인공지능(AI) 학습용 데이터를 수신할 수 있다. 그리고, 인공지능 학습 장치(300)는 수신된 인공지능(AI) 학습용 데이터를 이용하여, 개발하고자 하는 인공지능(AI)을 기계 학습할 수 있다.Specifically, the artificial intelligence learning device 300 provides the learning data generating device 200 with a request value including requirements that AI learning data must satisfy in order for AI to achieve its development purpose. can transmit The artificial intelligence learning device 300 may receive artificial intelligence (AI) learning data from the learning data generating device 200 . And, the artificial intelligence learning device 300 may machine learn the artificial intelligence (AI) to be developed using the received artificial intelligence (AI) learning data.

이와 같은, 인공지능 학습 장치(300)는 학습 데이터 생성 장치(200)와 데이터를 송수신하고, 송수신된 데이터를 이용하여 연산을 수행할 수 있는 장치라면 어떠한 장치라도 허용될 수 있다. 예를 들어, 인공지능 학습 장치(300)는 데스크탑, 워크스테이션 또는 서버와 같은 고정식 컴퓨팅 장치 중 어느 하나가 될 수 있으나, 이에 한정되는 것은 아니다.As such, the artificial intelligence learning device 300 may be any device capable of transmitting and receiving data to and from the learning data generating device 200 and performing calculations using the transmitted and received data. For example, the artificial intelligence learning device 300 may be any one of a fixed computing device such as a desktop, workstation, or server, but is not limited thereto.

상술한 바와 같은, 하나 이상의 학습 데이터 수집 장치(100), 학습 데이터 생성 장치(200) 및 인공지능 학습 장치(300)는 장치들 사이에 직접 연결된 보안회선, 공용 유선 통신망 또는 이동 통신망 중 하나 이상이 조합된 네트워크를 이용하여 데이터를 송수신할 수 있다. As described above, one or more learning data collection devices 100, learning data generating devices 200, and artificial intelligence learning devices 300 may include at least one of a security line, a common wired communication network, or a mobile communication network directly connected between devices. Data can be transmitted and received using the combined network.

예를 들어, 공용 유선 통신망에는 이더넷(ethernet), 디지털가입자선(x Digital Subscriber Line, xDSL), 광동축 혼합망(Hybrid Fiber Coax, HFC), 광가입자망(Fiber To The Home, FTTH)가 포함될 수 있으나, 이에 한정되는 것도 아니다. 그리고, 이동 통신망에는 코드 분할 다중 접속(Code Division Multiple Access, CDMA), 와이드 밴드 코드 분할 다중 접속(Wideband CDMA, WCDMA), 고속 패킷 접속(High Speed Packet Access, HSPA), 롱텀 에볼루션(Long Term Evolution, LTE), 5세대 이동통신(5th generation mobile telecommunication)가 포함될 수 있으나, 이에 한정되는 것은 아니다.For example, public wired communication networks may include Ethernet, x Digital Subscriber Line (xDSL), Hybrid Fiber Coax (HFC), and Fiber To The Home (FTTH). It may be, but is not limited thereto. In addition, in the mobile communication network, Code Division Multiple Access (CDMA), Wideband CDMA (WCDMA), High Speed Packet Access (HSPA), Long Term Evolution, LTE) and 5th generation mobile telecommunication may be included, but is not limited thereto.

도 2는 본 발명의 일 실시예에 따른 센서들을 설명하기 위한 예시도이다.2 is an exemplary view for explaining sensors according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 본 발명의 일 실시예에 따른 학습 데이터 수집 장치(100)는 차량(10)에 고정 설치된 레이더(20), 라이다(30), 카메라(40) 및 초음파 센서(50) 중 하나 이상을 제어하여, 인공지능(AI)을 기계 학습시키기 위한 기초 데이터를 수집할 수 있다.As shown in FIG. 2, the learning data collection device 100 according to an embodiment of the present invention includes a radar 20, a lidar 30, a camera 40, and an ultrasonic sensor ( 50), it is possible to collect basic data for machine learning of artificial intelligence (AI).

여기서, 차량(10)은 인공지능(AI)을 기계 학습시키기 위한 기초 데이터를 수집하기 위한 레이더(20), 라이다(30), 카메라(40) 및 초음파 센서(50)가 설치된 차량으로, 인공지능(AI)에 의해 자율주행을 수행하는 차량과는 서로 구별될 수 있다.Here, the vehicle 10 is a vehicle equipped with a radar 20, a lidar 30, a camera 40, and an ultrasonic sensor 50 for collecting basic data for machine learning of artificial intelligence (AI). It can be distinguished from vehicles that perform autonomous driving by intelligence (AI).

레이더(20)는 차량(10)에 고정 설치되어 차량(10)의 주행 방향을 향하여 전자기파(electromagnetic wave)를 발사하고, 차량(10)의 전방에 위치하는 객체(object)에 의해 반사되어 돌아온 전자기파를 감지하여, 차량(10)이 전방에 대한 영상에 해당하는 감지 데이터를 생성할 수 있다. The radar 20 is fixedly installed in the vehicle 10 and emits electromagnetic waves toward the driving direction of the vehicle 10, and the electromagnetic waves reflected by an object located in front of the vehicle 10 and returned. By sensing, the vehicle 10 may generate sensing data corresponding to an image of the front side.

다르게 말하면, 감지 데이터는 차량(10)에 고정 설치된 레이더(20)에 의해 차량의 주행 방향을 향하여 발사된 전자기파를 반사시킨 점들(points)에 대한 정보이다. 따라서, 감지 데이터에 포함된 점들의 좌표들은 차량(10)의 전방에 위치하는 객체의 위치 및 형상에 대응하는 값을 가질 수 있다. 이러한, 감지 데이터는 2차원 정보가 될 수 있으나, 이에 한정되지 않고 3차원 정보가 될 수도 있다.In other words, the sensing data is information on points at which electromagnetic waves emitted by the radar 20 fixedly installed in the vehicle 10 toward the driving direction of the vehicle are reflected. Accordingly, the coordinates of the points included in the sensing data may have values corresponding to the position and shape of an object located in front of the vehicle 10 . This sensed data may be 2D information, but is not limited thereto and may be 3D information.

라이다(30)는 차량(10)에 고정 설치되어 차량(10)의 주위로 레이저 펄스(laser pulse)를 방사하고, 차량(10)의 주위에 위치하는 객체에 의해 반사되어 돌아온 빛을 감지하여, 차량(10)의 주위에 대한 3차원 영상에 해당하는 3D 점군 데이터를 생성할 수 있다.The lidar 30 is fixedly installed on the vehicle 10, radiates a laser pulse around the vehicle 10, and detects light reflected back by an object located around the vehicle 10. , 3D point cloud data corresponding to a 3D image of the surroundings of the vehicle 10 may be generated.

다르게 말하면, 3D 점군 데이터는 차량(10)에 고정 설치된 라이다(30)에 의해 차량의 주위로 방사된 레이저 펄스를 반사시킨 점들에 대한 3차원 정보이다. 따라서, 3D 점군 데이터에 포함된 점들의 좌표들은 차량(10)의 주위에 위치하는 객체의 위치 및 형성에 대응하는 값을 가질 수 있다.In other words, the 3D point cloud data is three-dimensional information on points obtained by reflecting laser pulses emitted around the vehicle by the LIDAR 30 fixedly installed in the vehicle 10 . Accordingly, coordinates of points included in the 3D point cloud data may have values corresponding to the location and formation of objects located around the vehicle 10 .

카메라(40)는 차량(10)에 고정 설치되어 차량(10)의 주위에 대한 2차원 이미지를 촬영할 수 있다. 이와 같은, 카메라(40)는 서로 다른 방향을 촬영할 수 있도록 복수 개가 지표면과 수평 또는 수평 방향으로 이격되게 설치될 수 있다. 예를 들어, 도 2는 서로 다른 6개의 방향을 촬영할 수 있는 6개의 카메라(40)가 고정 설치된 차량(10)의 예시를 도시하고 있으나, 차량(10)에 설치될 수 있는 카메라(40)가 다양한 개수로 구성될 수 있음은 본 발명이 속한 기술분야의 통상의 지식을 가진 자에게 자명할 것이다.The camera 40 is fixedly installed in the vehicle 10 and can capture a two-dimensional image of the surroundings of the vehicle 10 . As such, a plurality of cameras 40 may be installed horizontally or spaced apart from each other in a horizontal direction so as to photograph different directions. For example, FIG. 2 shows an example of a vehicle 10 in which six cameras 40 capable of capturing images in six different directions are fixedly installed, but the cameras 40 that can be installed in the vehicle 10 It will be apparent to those skilled in the art that it can be configured in various numbers.

다르게 말하면, 2D 이미지는 차량(10)에 고정 설치된 카메라(40)에 의해 촬영된 이미지이다. 따라서, 2D 이미지에는 카메라(40)가 향하는 방향에 위치하는 객체의 색상 정보가 포함될 수 있다. In other words, the 2D image is an image captured by the camera 40 fixed to the vehicle 10 . Accordingly, the 2D image may include color information of an object located in a direction in which the camera 40 faces.

초음파 센서(50)는 차량(50)에 고정 설치되어 차량(10)의 주위로 초음파(ultrasonic)를 발사하고, 차량(10)과 인접하게 위치하는 객체에 의해 반사되어 돌아온 음파를 감지하여, 차량(10)에 설치된 초음파 센서(50)와 객체 사이의 거리에 해당하는 거리 정보를 생성할 수 있다. 일반적으로, 초음파 센서(50)는 복수 개로 구성되어, 객체와 접촉하기 쉬운 차량(10)의 전방, 후방, 전측방 및 후측방에 고정 설치될 수 있다.The ultrasonic sensor 50 is fixedly installed in the vehicle 50, emits ultrasonic waves around the vehicle 10, detects sound waves reflected by an object positioned adjacent to the vehicle 10 and returns, Distance information corresponding to the distance between the ultrasonic sensor 50 installed in (10) and the object may be generated. In general, a plurality of ultrasonic sensors 50 may be configured and fixedly installed at the front, rear, front and rear sides of the vehicle 10, which are easily contacted with objects.

다르게 말하면, 거리 정보는 차량(10)에 고정 설치된 초음파 센서(50)에 의해 감지된 객체로부터의 거리에 관한 정보이다.In other words, the distance information is information about a distance from an object detected by the ultrasonic sensor 50 fixedly installed in the vehicle 10 .

이하, 상술한 바와 같은, 학습 데이터 생성 장치(200)의 구성에 대하여 보다 구체적으로 설명하기로 한다.Hereinafter, the configuration of the learning data generating device 200 as described above will be described in more detail.

도 3은 본 발명의 일 실시예에 따른 학습 데이터 생성 장치의 논리적 구성도이다.3 is a logical configuration diagram of an apparatus for generating learning data according to an embodiment of the present invention.

도 3을 참조하면, 학습 데이터 생성 장치(200)는 통신부(205), 입출력부(210), 사전학습부(215), 데이터전처리부(220), 추론부(225), 3차원해석부(230) 및 저장부(235)를 포함하여 구성될 수 있다.Referring to FIG. 3 , the training data generating device 200 includes a communication unit 205, an input/output unit 210, a pre-learning unit 215, a data pre-processing unit 220, an inference unit 225, a 3D analysis unit ( 230) and a storage unit 235.

이와 같은, 학습 데이터 생성 장치(200)의 구성 요소들은 기능적으로 구분되는 요소들을 나타낸 것에 불과하므로, 둘 이상의 구성 요소가 실제 물리적 환경에서는 서로 통합되어 구현되거나, 하나의 구성 요소가 실제 물리적 환경에서는 서로 분리되어 구현될 수 있을 것이다.Since the components of the learning data generation device 200 are merely functionally distinct elements, two or more components are integrated and implemented in an actual physical environment, or one component is mutually exclusive in an actual physical environment. It could be implemented separately.

구체적으로, 통신부(205)는 인공지능(AI)의 기계 학습을 위한 이미지 및 점군 데이터를 학습 데이터 수집 장치(100)로부터 수신할 수 있다.Specifically, the communication unit 205 may receive image and point cloud data for machine learning of artificial intelligence (AI) from the learning data collection device 100 .

또한, 통신부(205)는 학습 데이터 수집 장치(100)로부터 점군 데이터 및 이미지를 동기화 하기 위한 캘리브레이션 행렬을 함께 수신할 수 있다.Also, the communication unit 205 may receive a calibration matrix for synchronizing point cloud data and images from the learning data collection device 100 together.

또한, 통신부(205)는 시맨틱 세그멘테이션의 3차원 해석 결과를 인공지능 학습 장치(300)에 전송할 수 있다.In addition, the communication unit 205 may transmit the 3D analysis result of the semantic segmentation to the artificial intelligence learning device 300 .

다음 구성으로, 입출력부(210)는 사용자 인터페이스(User Interface, UI)를 통해 사용자로부터 신호를 입력 받거나, 연산된 결과를 외부로 출력할 수 있다.With the following configuration, the input/output unit 210 may receive a signal from a user through a user interface (UI) or output a calculated result to the outside.

구체적으로, 입출력부(210)는 사용자로부터 시맨틱 세그멘테이션 특징맵을 생성하거나, 생성된 시맨틱 세그멘테이션 특징맵을 3차원 해석하기 위한 다양한 설정 값들을 입력받고, 생성된 결과 값들을 출력할 수 있다.Specifically, the input/output unit 210 may generate a semantic segmentation feature map from a user or receive various setting values for 3D analysis of the generated semantic segmentation feature map, and output generated resultant values.

다음 구성으로, 사전 학습부(215)는 2차원 시맨틱 세그멘테이션을 추론하기 위한 인공 지능을 사전 학습시킬 수 있다.With the following configuration, the pre-learning unit 215 may pre-train artificial intelligence for inferring 2D semantic segmentation.

구체적으로, 사전 학습부(215)는 사전 저장된 데이터 셋에 포함된 라이다로부터 획득된 점군 데이터, 점군 데이터와 동시에 카메라를 통해 촬영된 이미지, 라이다 및 카메라 사이의 캘리브레이션 정보, 점군 데이터의 3차원 포인트 단위로 클래스 라벨이 명시되어 있는 정답 데이터를 기초로 인공 지능을 사전 학습시킬 수 있다.Specifically, the pre-learning unit 215 includes point cloud data obtained from lidar included in a pre-stored data set, an image taken through a camera at the same time as the point cloud data, calibration information between lidar and camera, and 3-dimensional point cloud data Artificial intelligence can be trained in advance based on correct answer data in which class labels are specified in units of points.

이때, 사전 학습부(215)는 점군 데이터 및 정답 데이터를 이미지의 2차원 평면상 좌표에 사영시키되, 이미지의 사영될 픽셀과 맨해튼 거리(Manhattan distance)가 사전 설정된 값 이내인 픽셀들을 동일한 클래스를 갖도록 할 수 있다.At this time, the pre-learning unit 215 projects the point cloud data and the correct answer data onto the coordinates of the 2D plane of the image, so that the pixel to be projected and the pixels whose Manhattan distance is within a preset value have the same class. can do.

또한, 사전 학습부(215)는 복수의 손실함수 중 적어도 하나를 통해 인공 지능을 사전 학습시킬 수 있다.Also, the pre-learning unit 215 may pre-learn artificial intelligence through at least one of a plurality of loss functions.

이때, 사전 학습부(215)는 하기의 수학식 1으로 표현되는 손실함수를 통해 인공 지능을 사전 학습시킬 수 있다.At this time, the pre-learning unit 215 may pre-learn artificial intelligence through a loss function represented by Equation 1 below.

[수학식 1][Equation 1]

또한, 사전 학습부(215)는 하기의 수학식 2로 표현되는 손실함수를 통해 상기 인공 지능을 사전 학습시킬 수 있다.In addition, the pre-learning unit 215 may pre-learn the artificial intelligence through a loss function represented by Equation 2 below.

[수학식 2][Equation 2]

이때, 사전 학습부(215)는 배경 클래스에 해당하는 픽셀을 손실함수의 종류와 관계없이 손실 값의 계산에서 제외시킬 수 있다.In this case, the pre-learning unit 215 may exclude pixels corresponding to the background class from calculation of the loss value regardless of the type of the loss function.

다음 구성으로, 데이터전처리부(220)는 학습 데이터 수집 장치(100)로부터 수신한 점군 데이터 및 이미지를 전처리할 수 있다.With the following configuration, the data pre-processing unit 220 may pre-process the point cloud data and images received from the learning data collection device 100 .

구체적으로, 데이터전처리부(220)는 점군 데이터 및 이미지와 함께 수신한 캘리브레이션 행렬을 기초로 점군 데이터를 이미지와 동일한 크기를 갖는 2차원 평면 상의 좌표에 사영할 수 있다.Specifically, the data preprocessing unit 220 may project the point cloud data onto coordinates on a two-dimensional plane having the same size as the image based on the calibration matrix received together with the point cloud data and the image.

여기서, 캘리브레이션 행렬은 하기의 수학식 3으로 표현될 수 있다.Here, the calibration matrix may be expressed by Equation 3 below.

[수학식 3][Equation 3]

(여기서, u 및 v는 이미지 내 픽셀들의 2차원 좌표, x, y 및 z는 점군 데이터의 3차원 좌표, f_u 및 f_v는 픽셀 단위의 초점 거리, u₀ 및 v₀은 이미지 평면에서 점군 데이터가 위치하는 x 및 y 좌표를 의미한다.)(Where u and v are the 2-dimensional coordinates of the pixels in the image, x, y and z are the 3-dimensional coordinates of the point cloud data, f _u and f _v are the focal lengths in pixels, u ₀ and v ₀ are the point clouds on the image plane It means the x and y coordinates where the data is located.)

다음 구성으로, 추론부(225)는 사전 기계 학습(machine learning)된 인공 지능(AI, Artificial Intelligence)을 기초로 전처리 된 점군 데이터 및 이미지 각각으로부터 특징을 독립적으로 추출하고, 추출된 특징을 융합하여 시맨틱 세그멘테이션 특징맵을 생성할 수 있다.With the following configuration, the inference unit 225 independently extracts features from each of the preprocessed point cloud data and images based on pre-machine learning artificial intelligence (AI), fuses the extracted features, A semantic segmentation feature map may be created.

구체적으로, 추론부(225)는 사영된 점군 데이터 및 상기 이미지로부터 심층 특징(deep feature)을 추출하는 인코더 및 심층 특징을 해석하여 세그멘테이션 추론 맵을 생성하는 디코더를 통해 시맨틱 세그멘테이션을 추론할 수 있다.Specifically, the inference unit 225 may infer semantic segmentation through an encoder that extracts deep features from projected point cloud data and the image, and a decoder that generates a segmentation inference map by interpreting the deep features.

이때, 추론부(225)는 인코더를 통해 심층 특징을 생성하되, 사영된 점군 데이터 및 이미지를 멀티 모달(multi modal) 방식으로 수용하고, 병렬 구조로 배치된 복수의 컨벌루션 블록들을 이용하여 사영된 점군 데이터 및 이미지로부터 지역적 특징 및 구조적 특징을 상호 독립적으로 추출할 수 있다.At this time, the inference unit 225 generates deep features through an encoder, accepts projected point cloud data and images in a multi-modal manner, and uses a plurality of convolution blocks arranged in a parallel structure to project a point cloud. Regional and structural features can be independently extracted from data and images.

즉, 추론부(225)는 제1 컨벌루션 블록 및 제2 컨벌루션 블록을 통해 상기 사영된 점군 데이터 및 상기 이미지 각각을 입력 크기의 1/2 및 1/4로 단계적으로 다운 샘플링하여 특징맵을 추출할 수 있다.That is, the inference unit 225 extracts a feature map by stepwise downsampling the projected point cloud data and the image to 1/2 and 1/4 of the input size through the first convolution block and the second convolution block, respectively. can

다음으로, 추론부(225)는 연결 계층(concatenate)를 통해 다운 샘플링 된 특징맵을 연결하고, 제3 컨벌루션 블록을 통해 연결된 특징맵을 입력 크기의 1/8 면적을 가진 융합 특징맵으로 생성할 수 있다.Next, the inference unit 225 connects the downsampled feature maps through a concatenation layer (concatenate), and generates a converged feature map having an area of 1/8 of the input size from the concatenated feature maps through a third convolution block. can

다음으로, 추론부(225)는 생성된 융합 특징맵을 사전 설정된 크기의 패치(patch)로 분할하고, 각 지역으로부터 해당 지역들을 대표하는 패치 임베딩(patch embedding)을 구성하고, 위치 임베딩(position embedding)을 가산하여 각 패치를 대표하는 임베딩 벡터를 생성할 수 있다.Next, the inference unit 225 divides the generated fusion feature map into patches of a preset size, constructs patch embeddings representing corresponding regions from each region, and position embeddings. ) to generate an embedding vector representing each patch.

다음으로, 추론부(225)는 생성된 임베딩 벡터를 연속된 복수의 트랜스포머 모듈을 통해 심층 특징 벡터(fusion based deep feature)로 변환할 수 있다.Next, the inference unit 225 may convert the generated embedding vector into a fusion based deep feature through a plurality of consecutive transformer modules.

다음으로, 추론부(225)는 재배열(reshape) 계층을 통해 변환된 심층 특징 벡터를 입력 크기 대비 1/16 크기의 특징맵으로 변환할 수 있다.Next, the inference unit 225 may convert the deep feature vector converted through the reshape layer into a feature map having a size of 1/16 of the input size.

다음으로, 추론부(225)는 변환된 특징맵을 복수의 업 샘플링(up sampling) 계층을 통해 면적을 사전 설정된 배수로 확장하되, 접합 계층을 통해 복수의 업샘플링 계층 각각에서의 특징맵을 인코더의 각 단계에서 대응하는 특징맵과 접합할 수 있다.Next, the reasoning unit 225 expands the area of the converted feature map by a predetermined multiple through a plurality of up-sampling layers, and converts the feature maps in each of the plurality of up-sampling layers to the encoder through a joint layer. In each step, it can be joined with the corresponding feature map.

여기서, 추론부(225)는 복수의 업 샘플링 계층 각각에서 업샘플링 된 특징맵을 3*3 커널을 사용하는 컨벌루션 계층에 통과시킬 수 있다.Here, the inference unit 225 may pass feature maps upsampled in each of the plurality of upsampling layers to a convolutional layer using a 3*3 kernel.

다음 구성으로, 3차원해석부(230)는 생성된 특징맵으로부터 3차원 정보를 해석할 수 있다.With the following configuration, the 3D analysis unit 230 may analyze 3D information from the generated feature map.

구체적으로, 3차원해석부(230)는 점군 데이터의 3차원 좌표 및 점군 데이터를 사영한 이미지의 2차원 평면상 좌표 쌍을 준비할 수 있다.Specifically, the 3D analyzer 230 may prepare a pair of 3D coordinates of point cloud data and 2D plane coordinates of an image projected with the point cloud data.

또한, 3차원해석부(230)는 생성된 특징맵에 대하여 각각의 2차원 평면상 좌표마다 해당 좌표를 중심으로 3*3 크기의 커널을 사용하는 최대 풀링(max pooling) 연산을 적용할 수 있다.In addition, the 3D analyzer 230 may apply a max pooling operation using a kernel having a size of 3*3 centered on the corresponding coordinate for each coordinate on the 2D plane to the generated feature map. .

또한, 3차원해석부(230)는 최대 풀링 연산을 통해 각 좌표에 대한 세그먼트를 추정한 벡터를 획득할 수 있다.Also, the 3D analyzer 230 may obtain a vector obtained by estimating segments for each coordinate through a maximum pooling operation.

그리고, 3차원해석부(230)는 획득된 벡터 중 가장 큰 확률을 갖는 클래스를 점군 데이터의 3차원 좌표에 대해 추론된 클래스로 예측할 수 있다.Also, the 3D analyzer 230 may predict a class having the highest probability among the obtained vectors as a class inferred with respect to 3D coordinates of the point cloud data.

다음 구성으로, 저장부(235)는 학습 데이터 생성 장치(200)의 동작에 필요한 데이터를 저장할 수 있다. 저장부(235)는 인공지능(AI) 학습을 위한 데이터를 설계하는데 필요한 데이터를 저장할 수 있다.With the following configuration, the storage unit 235 may store data necessary for the operation of the learning data generating device 200 . The storage unit 235 may store data necessary for designing data for artificial intelligence (AI) learning.

이하, 상술한 바와 같은 학습 데이터 생성 장치(200)의 논리적 구성요소를 구현하기 위한 하드웨어에 대하여 보다 구체적으로 설명한다.Hereinafter, hardware for implementing the above-described logical components of the learning data generating device 200 will be described in more detail.

도 4는 본 발명의 일 실시예에 따른 학습 데이터 생성 장치의 하드웨어 구성도이다.4 is a hardware configuration diagram of an apparatus for generating learning data according to an embodiment of the present invention.

도 4를 참조하면, 학습 데이터 생성 장치(200)는 프로세서(Processor, 250), 메모리(Memory, 255), 송수신기(Transceiver, 260), 입출력장치(Input/output device, 265), 데이터 버스(Bus, 270) 및 스토리지(Storage, 275)를 포함하여 구성될 수 있다.Referring to FIG. 4, the learning data generation device 200 includes a processor (250), a memory (Memory, 255), a transceiver (260), an input/output device (265), and a data bus (Bus). , 270) and storage (Storage, 275).

프로세서(250)는 메모리(255)에 상주된 소프트웨어(280a)에 따른 명령어를 기초로, 학습 데이터 생성 장치(200)의 동작 및 기능을 구현할 수 있다. 메모리(255)에는 본 발명에 따른 방법이 구현된 소프트웨어(280a)가 상주(loading)될 수 있다. 송수신기(260)는 학습 데이터 수집 장치(100) 및 인공지능 학습 장치(300)와 데이터를 송수신할 수 있다.The processor 250 may implement operations and functions of the learning data generating device 200 based on instructions according to the software 280a residing in the memory 255 . Software 280a implementing the method according to the present invention may be loaded in the memory 255 . The transceiver 260 may transmit and receive data with the learning data collection device 100 and the artificial intelligence learning device 300 .

입출력장치(265)는 학습 데이터 설계 장치(200)의 동작에 필요한 데이터를 입력 받고, 생성된 결과 값을 출력할 수 있다. 데이터 버스(270)는 프로세서(250), 메모리(255), 송수신기(260), 입출력장치(265) 및 스토리지(275)와 연결되어, 각각의 구성 요소 사이가 서로 데이터를 전달하기 위한 이동 통로의 역할을 수행할 수 있다.The input/output device 265 may receive data necessary for the operation of the learning data design device 200 and output the generated result value. The data bus 270 is connected to the processor 250, the memory 255, the transceiver 260, the input/output device 265, and the storage 275, and is a movement path for transferring data between each component. role can be fulfilled.

스토리지(275)는 본 발명에 다른 방법이 구현된 소프트웨어(280a)의 실행을 위해 필요한 애플리케이션 프로그래밍 인터페이스(Application Programming Interface, API), 라이브러리(library) 파일, 리소스(resource) 파일 등을 저장할 수 있다. 스토리지(275)는 본 발명에 따른 방법이 구현된 소프트웨어(280b)를 저장할 수 있다. 또한, 스토리지(275)는 시맨틱 세그멘테이션 방법 및 3차원 해석 방법의 수행에 필요한 정보들을 저장할 수 있다. 특히, 스토리지(275)는 시맨틱 세그멘테이션 방법 및 3차원 해석 방법을 수행하기 위한 프로그램을 저장하는 데이터베이스(285)를 포함할 수 있다.The storage 275 may store an application programming interface (API), a library file, a resource file, and the like necessary for the execution of the software 280a implemented in another method according to the present invention. The storage 275 may store software 280b in which the method according to the present invention is implemented. In addition, the storage 275 may store information necessary for performing the semantic segmentation method and the 3D analysis method. In particular, the storage 275 may include a database 285 storing programs for performing a semantic segmentation method and a 3D analysis method.

본 발명의 일 실시예에 따르면, 메모리(255)에 상주되거나 또는 스토리지(275)에 저장된 소프트웨어(280a, 280b)는 프로세서(250)가 라이다(lidar)로부터 획득된 점군 데이터(point cloud) 및 동시에 카메라(camera)를 통해 촬영된 이미지(image)를 수신하고, 수신한 점군 데이터 및 이미지를 전처리하는 단계 및 프로세서(250)가, 전처리 된 점군 데이터 및 이미지 각각으로부터 특징을 독립적으로 추출하고, 추출된 특징을 융합하여 시맨틱 세그멘테이션을 추론하는 단계를 실행하기 위하여, 기록매체에 기록된 컴퓨터 프로그램이 될 수 있다.According to one embodiment of the present invention, the software 280a, 280b resident in the memory 255 or stored in the storage 275 causes the processor 250 to obtain the point cloud data (point cloud) and At the same time, a step of receiving an image captured through a camera and preprocessing the received point cloud data and image, and the processor 250 independently extracts and extracts features from each of the preprocessed point cloud data and image In order to execute the step of inferring semantic segmentation by fusing the selected features, it may be a computer program recorded on a recording medium.

또한, 본 발명의 일 실시예에 따르면, 메모리(255)에 상주되거나 또는 스토리지(275)에 저장된 소프트웨어(280a, 280b)는 프로세서(250)가 라이다(lidar)로부터 획득된 점군 데이터(point cloud) 및 동시에 카메라(camera)를 통해 촬영된 이미지(image)를 수신하는 단계, 프로세서(250), 사전 기계 학습(machine learning)된 인공 지능(AI, Artificial Intelligence)을 기초로 점군 데이터 및 이미지로부터 2차원 시맨틱 세그멘테이션 특징맵을 생성하는 단계 및 프로세서(250), 생성된 특징맵으로부터 3차원 정보를 해석하는 단계를 실행시키기 위하여, 기록매체에 기록된 컴퓨터 프로그램이 될 수 있다.In addition, according to an embodiment of the present invention, the software 280a, 280b residing in the memory 255 or stored in the storage 275 allows the processor 250 to obtain point cloud data from lidar. ) and at the same time receiving an image taken through a camera, processor 250, based on artificial intelligence (AI) based on pre-machine learning, point cloud data and image 2 It may be a computer program recorded on a recording medium to execute the step of generating the dimensional semantic segmentation feature map and the step of interpreting the 3D information from the processor 250 and the generated feature map.

보다 구체적으로, 프로세서(250)는 ASIC(Application-Specific Integrated Circuit), 다른 칩셋(chipset), 논리 회로 및/또는 데이터 처리 장치를 포함할 수 있다. 메모리(255)는 ROM(Read-Only Memory), RAM(Random Access Memory), 플래쉬 메모리, 메모리 카드, 저장 매체 및/또는 다른 저장 장치를 포함할 수 있다. 송수신기(260)는 유무선 신호를 처리하기 위한 베이스밴드 회로를 포함할 수 있다. 입출력장치(265)는 키보드(keyboard), 마우스(mouse), 및/또는 조이스틱(joystick) 등과 같은 입력 장치 및 액정표시장치(Liquid Crystal Display, LCD), 유기 발광 다이오드(Organic LED, OLED) 및/또는 능동형 유기 발광 다이오드(Active Matrix OLED, AMOLED) 등과 같은 영상 출력 장치 프린터(printer), 플로터(plotter) 등과 같은 인쇄 장치를 포함할 수 있다. More specifically, the processor 250 may include an Application-Specific Integrated Circuit (ASIC), another chipset, a logic circuit, and/or a data processing device. The memory 255 may include read-only memory (ROM), random access memory (RAM), flash memory, a memory card, a storage medium, and/or other storage devices. The transceiver 260 may include a baseband circuit for processing wired/wireless signals. The input/output device 265 includes an input device such as a keyboard, a mouse, and/or a joystick, and a Liquid Crystal Display (LCD), an Organic LED (OLED), and/or a liquid crystal display (LCD). Alternatively, an image output device such as an active matrix OLED (AMOLED) may include a printing device such as a printer or a plotter.

본 명세서에 포함된 실시 예가 소프트웨어로 구현될 경우, 상술한 방법은 상술한 기능을 수행하는 모듈(과정, 기능 등)로 구현될 수 있다. 모듈은 메모리(255)에 상주되고, 프로세서(250)에 의해 실행될 수 있다. 메모리(255)는 프로세서(250)의 내부 또는 외부에 있을 수 있고, 잘 알려진 다양한 수단으로 프로세서(250)와 연결될 수 있다.When the embodiments included in this specification are implemented as software, the above-described method may be implemented as a module (process, function, etc.) that performs the above-described functions. A module may reside in memory 255 and be executed by processor 250 . The memory 255 may be internal or external to the processor 250 and may be connected to the processor 250 by various well-known means.

도 4에 도시된 각 구성요소는 다양한 수단, 예를 들어, 하드웨어, 펌웨어(firmware), 소프트웨어 또는 그것들의 결합 등에 의해 구현될 수 있다. 하드웨어에 의한 구현의 경우, 본 발명의 일 실시예는 하나 또는 그 이상의 ASICs(Application Specific Integrated Circuits), DSPs(Digital Signal Processors), DSPDs(Digital Signal Processing Devices), PLDs(Programmable Logic Devices), FPGAs(Field Programmable Gate Arrays), 프로세서, 콘트롤러, 마이크로 콘트롤러, 마이크로 프로세서 등에 의해 구현될 수 있다.Each component shown in FIG. 4 may be implemented by various means, for example, hardware, firmware, software, or a combination thereof. In the case of hardware implementation, one embodiment of the present invention includes one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), FPGAs ( Field Programmable Gate Arrays), processors, controllers, microcontrollers, microprocessors, etc.

또한, 펌웨어나 소프트웨어에 의한 구현의 경우, 본 발명의 일 실시예는 이상에서 설명된 기능 또는 동작들을 수행하는 모듈, 절차, 함수 등의 형태로 구현되어, 다양한 컴퓨터 수단을 통하여 판독 가능한 기록매체에 기록될 수 있다. 여기서, 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 예컨대 기록매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM(Compact Disk Read Only Memory), DVD(Digital Video Disk)와 같은 광 기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치를 포함한다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 이러한, 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.In addition, in the case of implementation by firmware or software, an embodiment of the present invention is implemented in the form of a module, procedure, function, etc. that performs the functions or operations described above, and is stored on a recording medium readable through various computer means. can be recorded. Here, the recording medium may include program commands, data files, data structures, etc. alone or in combination. Program instructions recorded on the recording medium may be those specially designed and configured for the present invention, or those known and usable to those skilled in computer software. For example, recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs (Compact Disk Read Only Memory) and DVDs (Digital Video Disks), floptical It includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, such as a floptical disk, and ROM, RAM, flash memory, and the like. Examples of program instructions may include high-level language codes that can be executed by a computer using an interpreter or the like as well as machine language codes generated by a compiler. These hardware devices may be configured to operate as one or more pieces of software to perform the operations of the present invention, and vice versa.

도 5는 본 발명의 일 실시예에 따른 시맨틱 세그멘테이션 및 3차원 해석 방법을 설명하기 위한 순서도이다.5 is a flowchart illustrating a semantic segmentation and 3D analysis method according to an embodiment of the present invention.

도 5를 참조하면, 먼저 S100 단계에서 학습 데이터 생성 장치는 학습 데이터 수집 장치로부터 점군 데이터 및 이미지를 수신할 수 있다. 이때, 학습 데이터 생성 장치는 점군 데이터 및 이미지와 함께 캘리브레이션 형렬을 추가로 수신할 수 있다.Referring to FIG. 5 , in step S100, the learning data generating device may receive point cloud data and images from the learning data collecting device. In this case, the learning data generating device may additionally receive a calibration sequence along with the point cloud data and the image.

다음으로, S200 단계에서 학습 데이터 생성 장치는 학습 데이터 수집 장치로부터 수신한 점군 데이터 및 이미지를 전처리할 수 있다.Next, in step S200, the learning data generating device may pre-process the point cloud data and image received from the learning data collecting device.

구체적으로, 학습 데이터 생성 장치는 점군 데이터 및 이미지와 함께 수신한 캘리브레이션 행렬을 기초로 점군 데이터를 이미지와 동일한 크기를 갖는 2차원 평면 상의 좌표에 사영할 수 있다.Specifically, the learning data generating apparatus may project the point cloud data onto coordinates on a two-dimensional plane having the same size as the image based on the calibration matrix received together with the point cloud data and the image.

다음으로, S300 단계에서 학습 데이터 생성 장치는 사전 기계 학습(machine learning)된 인공 지능(AI, Artificial Intelligence)을 기초로 전처리 된 점군 데이터 및 이미지 각각으로부터 특징을 독립적으로 추출하고, 추출된 특징을 융합하여 시맨틱 세그멘테이션 특징맵을 생성할 수 있다.Next, in step S300, the learning data generating device independently extracts features from each of the preprocessed point cloud data and images based on machine learning artificial intelligence (AI), and fuses the extracted features. Thus, a semantic segmentation feature map can be generated.

구체적으로, 학습 데이터 생성 장치는 사영된 점군 데이터 및 상기 이미지로부터 심층 특징(deep feature)을 추출하는 인코더 및 심층 특징을 해석하여 세그멘테이션 추론 맵을 생성하는 디코더를 통해 시맨틱 세그멘테이션을 추론할 수 있다.Specifically, the learning data generation apparatus may infer semantic segmentation through an encoder that extracts deep features from the projected point cloud data and the image, and a decoder that analyzes the deep features and generates a segmentation inference map.

이때, 학습 데이터 생성 장치는 인코더를 통해 심층 특징을 생성하되, 사영된 점군 데이터 및 이미지를 멀티 모달(multi modal) 방식으로 수용하고, 병렬 구조로 배치된 복수의 컨벌루션 블록들을 이용하여 사영된 점군 데이터 및 이미지로부터 지역적 특징 및 구조적 특징을 상호 독립적으로 추출할 수 있다.At this time, the learning data generation apparatus generates deep features through an encoder, accepts projected point cloud data and images in a multi-modal manner, and uses a plurality of convolutional blocks arranged in a parallel structure to projected point cloud data. and regional and structural features can be independently extracted from the image.

즉, 학습 데이터 생성 장치는 제1 컨벌루션 블록 및 제2 컨벌루션 블록을 통해 상기 사영된 점군 데이터 및 상기 이미지 각각을 입력 크기의 1/2 및 1/4로 단계적으로 다운 샘플링하여 특징맵을 추출할 수 있다.That is, the learning data generating apparatus extracts a feature map by stepwise downsampling the projected point cloud data and the image to 1/2 and 1/4 of the input size through the first convolution block and the second convolution block, respectively. there is.

또한, 학습 데이터 생성 장치는 연결 계층(concatenate)를 통해 다운 샘플링 된 특징맵을 연결하고, 제3 컨벌루션 블록을 통해 연결된 특징맵을 입력 크기의 1/8 면적을 가진 융합 특징맵으로 생성할 수 있다.In addition, the training data generating device may connect the downsampled feature maps through a concatenation layer, and generate a converged feature map having an area of 1/8 of the input size with a feature map connected through a third convolution block. .

학습 데이터 생성 장치는 생성된 융합 특징맵을 사전 설정된 크기의 패치(patch)로 분할하고, 각 지역으로부터 해당 지역들을 대표하는 패치 임베딩(patch embedding)을 구성하고, 위치 임베딩(position embedding)을 가산하여 각 패치를 대표하는 임베딩 벡터를 생성할 수 있다.The learning data generating device divides the generated fusion feature map into patches of a preset size, constructs patch embeddings representing corresponding regions from each region, and adds position embeddings to obtain An embedding vector representing each patch may be generated.

학습 데이터 생성 장치는 생성된 임베딩 벡터를 연속된 복수의 트랜스포머 모듈을 통해 심층 특징 벡터(fusion based deep feature)로 변환할 수 있다.The learning data generating device may convert the generated embedding vector into a fusion based deep feature through a plurality of consecutive transformer modules.

학습 데이터 생성 장치는 재배열(reshape) 계층을 통해 변환된 심층 특징 벡터를 입력 크기 대비 1/16 크기의 특징맵으로 변환할 수 있다.The learning data generation apparatus may convert the deep feature vector converted through the reshape layer into a feature map having a size of 1/16 of the input size.

학습 데이터 생성 장치는 변환된 특징맵을 복수의 업 샘플링(up sampling) 계층을 통해 면적을 사전 설정된 배수로 확장하되, 접합 계층을 통해 복수의 업샘플링 계층 각각에서의 특징맵을 인코더의 각 단계에서 대응하는 특징맵과 접합할 수 있다.The learning data generating device expands the area of the converted feature map by a preset multiple through a plurality of up-sampling layers, and responds to the feature maps in each of the plurality of up-sampling layers in each step of the encoder through a joint layer. It can be joined with a feature map that

여기서, 학습 데이터 생성 장치는 복수의 업 샘플링 계층 각각에서 업샘플링 된 특징맵을 3*3 커널을 사용하는 컨벌루션 계층에 통과시킬 수 있다.Here, the learning data generation apparatus may pass the feature map upsampled in each of the plurality of upsampling layers to a convolution layer using a 3*3 kernel.

그리고, S400 단계에서 학습 데이터 생성 장치는 생성된 특징맵으로부터 3차원 정보를 해석할 수 있다.In step S400, the learning data generating device may interpret 3D information from the generated feature map.

구체적으로, 학습 데이터 생성 장치는 점군 데이터의 3차원 좌표 및 점군 데이터를 사영한 이미지의 2차원 평면상 좌표 쌍을 준비할 수 있다.Specifically, the learning data generating device may prepare a pair of 3D coordinates of the point cloud data and 2D plane coordinates of an image projected with the point cloud data.

또한, 학습 데이터 생성 장치는 생성된 특징맵에 대하여 각각의 2차원 평면상 좌표마다 해당 좌표를 중심으로 3*3 크기의 커널을 사용하는 최대 풀링(max pooling) 연산을 적용할 수 있다.In addition, the learning data generating apparatus may apply a max pooling operation using a kernel having a size of 3*3 centered on the corresponding coordinate for each coordinate on a 2D plane to the generated feature map.

또한, 학습 데이터 생성 장치는 최대 풀링 연산을 통해 각 좌표에 대한 세그먼트를 추정한 벡터를 획득할 수 있다.In addition, the learning data generating device may obtain a vector obtained by estimating a segment for each coordinate through a maximum pooling operation.

그리고, 학습 데이터 생성 장치는 획득된 벡터 중 가장 큰 확률을 갖는 클래스를 점군 데이터의 3차원 좌표에 대해 추론된 클래스로 예측할 수 있다.Also, the learning data generation apparatus may predict a class having the highest probability among the obtained vectors as a class inferred for 3D coordinates of the point cloud data.

이하 도면을 참조하여, 본 발명의 일 실시예에 따른 시맨틱 세그멘테이션 및 3차원 해석 방법에 대하여 더욱 상세히 설명하도록 한다.With reference to the following drawings, a semantic segmentation and 3D analysis method according to an embodiment of the present invention will be described in more detail.

도 6은 본 발명의 일 실시예에 따른 시맨틱 세그멘테이션 및 3차원 해석 방법을 설명하기 위한 예시도이고, 도 7은 본 발명의 일 실시예에 따른 시맨틱 세그멘테이션 방법을 구체적으로 설명하기 위한 예시도이고, 도 8은 본 발명의 일 실시예에 따른 3차원 해석 방법을 설명하기 위한 예시도이다.6 is an exemplary diagram for explaining a semantic segmentation and 3D analysis method according to an embodiment of the present invention, and FIG. 7 is an exemplary diagram for explaining a semantic segmentation method in detail according to an embodiment of the present invention, 8 is an exemplary diagram for explaining a 3D analysis method according to an embodiment of the present invention.

도 6 내지 도 8을 참조하면, 데이터 원천(Data Source) 단계에서는 카메라와 라이다 센서로부터 각각 차량 전방의 2차원 RGB 영상과 차량 둘레 전 영역의 3차원 점군 데이터를 취득한다. 또한 두 센서의 데이터 취득 영역이 서로 상이하므로 이를 동기화 하기 위해 사용되는 캘리브레이션 행렬을 함께 취득한다.Referring to FIGS. 6 to 8 , in the data source step, a 2D RGB image of the front of the vehicle and 3D point cloud data of the entire area around the vehicle are obtained from a camera and lidar sensor, respectively. In addition, since the data acquisition areas of the two sensors are different from each other, a calibration matrix used to synchronize them is acquired together.

이어서, 데이터 전처리(data pre-process) 단계에서는 라이다로부터 취득한 데이터 중 차량의 전방 영역, 즉 카메라가 영상을 취득하는 영역과 동일한 영역의 데이터를 선별하는 과정을 수행한다.Then, in the data pre-processing step, a process of selecting data of the front area of the vehicle, that is, the same area as the area from which the camera acquires the image, is selected from among the data obtained from lidar.

이를 통해 구분된 N개의 전방 영역 점군 좌표(x,y,z) 데이터(N, 3)는 앞서 구한 캘리브레이션 행렬을 바탕으로 RGB 영상과 동일한 크기(H, W)를 갖는 2차원 평면상의 사영 좌표 (u,v)로 변환된다. 이로써 전방 영역의 점군 데이터인 N개의 3차원 좌표 (x,y,z)들은 (H, W, 1) 크기의 사영 영상으로 변환되며 이 사영 영상과 (H, W, 3) 크기의 RGB 영상을 다음 단계인 추론 단계로 전달한다.The N point cloud coordinates (x, y, z) data (N, 3) of the front region identified through this are projected coordinates ( u,v). As a result, N three-dimensional coordinates (x, y, z), which are the point cloud data of the front area, are converted into a projection image of (H, W, 1) size, and this projection image and an RGB image of (H, W, 3) size are converted. It is passed on to the next step, the inference step.

추론(prediction) 단계에서는 앞서 전달받은 두 영상을 바탕으로 심층 신경망을 이용한 2차원 시맨틱 세그멘테이션을 수행한다. 이때 사용되는 심층 신경망은 입력으로부터 심층 특징(deep feature)을 추출하는 인코더(encoder)와 심층 특징을 해석해 세그멘테이션 추론 맵을 만들어내는 디코더(decoder)를 포함하는 인코더-디코더 구조로 이루어져 있다.In the prediction step, 2D semantic segmentation is performed using a deep neural network based on the two previously transmitted images. The deep neural network used at this time consists of an encoder-decoder structure including an encoder that extracts deep features from input and a decoder that creates a segmentation inference map by interpreting the deep features.

여기서, 인코더는 "ResNet50"의 컨볼루션 블록들을 바탕으로 입력 데이터의 지역적, 구조적 특징들을 추출하고 트랜스포머 모듈을 사용하여 추출된 특징들을 보다 표현력이 좋은 심층 특징들로 변환한다.Here, the encoder extracts regional and structural features of the input data based on convolutional blocks of “ResNet50” and converts the extracted features into deep features with better expressiveness using a transformer module.

대신 신경망이 사영 영상을 위한 입력(remission input)과 RGB 영상을 위한 입력(RGB input)을 동시에 수용할 수 있도록 구성하고 신경망 내 모듈들을 센서 퓨전 방법에 따라 병렬 구조로 배치하여 기존 방법과 달리 멀티 모달 데이터의 처리를 가능하게 한다.Instead, the neural network is configured so that it can simultaneously accept inputs for projection images (remission input) and inputs for RGB images (RGB input), and the modules in the neural network are arranged in a parallel structure according to the sensor fusion method, unlike the existing method. It enables processing of data.

또한, 디코더는 인코더의 각 단계에서 추출되는 특징맵들이 심층 특징과 연결되어 입력과 동일한 (H, W, C) 크기의 결과 추론 맵을 구성할 수 있도록 컨볼루션 계층을 비롯한 여러 신경망 계층들로 구성한다. 이 때 C는 픽셀 단위로 객체를 분류할 수 있는 클래스의 총 개수이다.In addition, the decoder consists of several neural network layers, including convolutional layers, so that the feature maps extracted at each stage of the encoder are connected to the deep features to construct a resulting inference map of the same (H, W, C) size as the input. do. Here, C is the total number of classes that can classify objects in units of pixels.

이에 따라, 도 7에 도시된 바와 같이, 신경망은 멀티 모달 입력을 수용하고 병렬 구조로 배치된 컨볼루션 블록들을 이용하여 각 입력으로부터 각 입력의 구조적, 지역적 특징들을 상호 독립적으로 추출한다.Accordingly, as shown in FIG. 7, the neural network accepts multi-modal inputs and independently extracts structural and regional features of each input from each input using convolution blocks arranged in a parallel structure.

이에 따라, 추출된 입력 대비 1/2, 1/4 면적의 특징맵들이 추출되며 그 중 1/4 면적의 특징맵들을 특징 단계 퓨전에 활용하기 위해 연결 계층(concatenate)을 두어 서로 이어 붙인다. 이렇게 만들어진 연결 특징맵은 세 번째 컨볼루션 블록을 통해 입력 대비 1/8 면적을 가진 융합 특징맵으로 만들어진다.Accordingly, feature maps of 1/2 and 1/4 area compared to the extracted input are extracted, and among them, feature maps of 1/4 area are connected to each other by concatenating them to be used for feature stage fusion. The connection feature map created in this way is made into a fusion feature map with an area 1/8 of the input through the third convolution block.

이어서, 트랜스포머 모듈에 앞서 만든 융합 특징맵을 전달하기 위해 융합 특징맵을 (2, 2) 크기의 패치로 분할한 뒤 각 지역으로부터 해당 지역들을 대표하는 패치 임베딩(patch embedding)을 구성하고 위치 임베딩(position embedding)을 가산하여 각 패치를 대표하는 임베딩 벡터들을 만든다. 이어서 이 임베딩 벡터들은 9개의 연속된 트랜스포머 모듈로 구성된 트랜스포머 블록을 거치면서 융합 특징맵에 포함된 중요 특징들을 중심으로 구성된 심층 특징 벡터(fusion based deep feature)로 변환된다.Then, in order to deliver the fusion feature map created previously to the transformer module, the fusion feature map is divided into (2, 2) sized patches, and then patch embeddings representing the corresponding regions are constructed from each region, and location embeddings ( position embedding) to create embedding vectors representing each patch. Subsequently, these embedding vectors are converted into a fusion based deep feature centered on important features included in the fusion feature map while passing through a transformer block composed of 9 consecutive transformer modules.

다음으로, 디코더에서는 먼저 인코더로부터 추출한 심층 특징 벡터들을 재배열(reshape) 계층을 통해 입력 대비 1/16 크기의 특징맵으로 변환하고 3*3 커널을 사용하는 컨볼루션 계층을 통과하도록 한 후, 업 샘플링(up sampling) 계층을 통해 그 면적을 두 배로 확장한다.Next, in the decoder, the deep feature vectors first extracted from the encoder are converted into a feature map with a size of 1/16 of the input through a reshape layer, pass through a convolution layer using a 3*3 kernel, and The area is doubled through an up sampling layer.

이 확장된 특징맵이 앞서 심층 특징 벡터에 담겨 있던 주요 특징들을 담고 있다 하더라도 벡터화로 인해 손실된 지역적, 구조적 특성들은 컨볼루션과 업샘플링 만으로 쉽게 회복되지 않는다.Even if this extended feature map contains the main features previously included in the deep feature vector, the local and structural features lost due to vectorization are not easily recovered only by convolution and upsampling.

따라서, 접합 계층을 이용하여 해당 특징맵과 앞서 인코더에서 추출된 동일 크기의 특징맵을 이어 붙임으로써 특징맵이 심층 특징 벡터의 주요 특징들과 인코딩 당시 추출된 지역적, 구조적 특성들을 동시에 내포할 수 있도록 한다.Therefore, by attaching the corresponding feature map and the feature map of the same size previously extracted from the encoder using the joint layer, the feature map can contain the main features of the deep feature vector and the regional and structural features extracted at the time of encoding at the same time. do.

이어서, 접합 특징맵을 3*3 커널을 사용하는 컨볼루션 계층을 이용해 서로 혼합하고, 다시 업샘플링 계층을 통해 확장, 그리고 인코더 내 동일 크기의 특징맵을 접합하는 과정을 2회 더 반복함으로써 특징맵의 크기를 입력 대비 1/2까지 복원해 낸다.Subsequently, the feature maps are mixed with each other using a convolution layer using a 3*3 kernel, extended through an upsampling layer, and the process of joining feature maps of the same size in the encoder is repeated twice more. Restores the size of up to 1/2 of the input.

마지막으로, 그 특징맵을 다시 한번 업샘플링하고 3*3 커널을 사용하는 컨볼루션 계층을 적용하여 입력과 동일한 크기의 특징맵을 만들어 낸 뒤 이를 소프트맥스 활성화 함수와 1*1 커널을 사용하는 컨볼루션 계층인 세그멘테이션 헤드에 전달하여 세그멘테이션 추론맵을 만들어 낸다.Finally, the feature map is upsampled once again and a convolution layer using a 3*3 kernel is applied to create a feature map of the same size as the input, and then convolved using the softmax activation function and a 1*1 kernel. It is passed to the segmentation head, which is a solution layer, to create a segmentation inference map.

프로세스의 마지막 단계인 데이터 후처리(data post-processing) 단계에서는 도 8에 도시된 바와 같이 점군 데이터를 활용하여 신경망이 만들어내는 2차원의 세그멘테이션 추론 맵을 3차원 정보로 재해석하는 과정(reconstruction)을 수행한다. In the data post-processing step, which is the last step of the process, as shown in FIG. 8, a process of reinterpreting the 2D segmentation inference map created by the neural network into 3D information using point cloud data (reconstruction) Do it.

먼저, 데이터 전처리 단계에서 사영 영상 생성에 사용했던 N개의 3차원 좌표 (x,y,z)와 이를 사영한 N개의 2차원 평면상 좌표 (u,v) 쌍을 준비한다. 이어서, 신경망에 의해 추론된 (H, W, C) 크기의 세그멘테이션 추론 맵에 대하여 각각의 2차원 좌표 (u,v)마다 해당 좌표를 중심으로 3*3 크기의 커널을 사용하는 최대 풀링(max pooling) 연산을 적용한다.First, in the data pre-processing step, N 3-dimensional coordinates (x, y, z) used to generate the projected image and N 2-dimensional coordinates (u, v) pairs projected thereon are prepared. Next, for each 2-dimensional coordinate (u, v) for the (H, W, C) size segmentation inference map inferred by the neural network, maximum pooling using a 3*3 kernel centered on the corresponding coordinate (max pooling) operation is applied.

이를 통해, 본 발명은 N 개의 좌표 각각에 대해 세그먼트를 추정한 벡터를 얻어낼 수 있으며 그 중 가장 큰 확률을 갖는 클래스를 3차원 좌표(x,y,z)에 대해 추론된 클래스로 본다.Through this, the present invention can obtain a vector obtained by estimating segments for each of the N coordinates, and among them, the class with the highest probability is regarded as the inferred class for the three-dimensional coordinates (x, y, z).

따라서, 데이터 전처리 단계에서 분리한 전방 영역의 점군 데이터 각각에 대한 시맨틱 세그멘테이션 추론 결과를 얻을 수 있으며, 그 결과 각 좌표의 세그먼트 값을 담은 (N, C) 크기의 출력 행렬을 얻을 수 있다.Therefore, it is possible to obtain semantic segmentation inference results for each point cloud data of the forward region separated in the data preprocessing step, and as a result, an output matrix of size (N, C) containing segment values of each coordinate can be obtained.

이하, 본 발명의 일 실시예에 따른 시맨틱 세그멘테이션 방법 및 3차원 해석 방법의 성능에 대하여 설명하도록 한다.Hereinafter, performance of the semantic segmentation method and the 3D analysis method according to an embodiment of the present invention will be described.

도 9는 본 발명의 일 실시예에 따른 3차원 해석 방법의 성능을 설명하기 위한 예시도이다.9 is an exemplary diagram for explaining performance of a 3D analysis method according to an embodiment of the present invention.

실험Experiment

실험은 "AMD EPYC 7742 64-Core Processor" 급의 CPU(Central Processing Unit), "A100 80GB" 급의 GPU(Graphics Processing Unit) 및 2TB의 메모리가 탑재된 하드웨어 환경에서 진행하였다.The experiment was conducted in a hardware environment equipped with an "AMD EPYC 7742 64-Core Processor" class CPU (Central Processing Unit), an "A100 80GB" class GPU (Graphics Processing Unit), and 2TB of memory.

신경망 구현을 위한 소프트웨어 환경은 "ubuntu 20.04" 운영체제에 설치된 "Python 3.9"에 "Tensorflow 2.80" 과 "Keras 2.80"을 설정해 활용하였다.The software environment for neural network implementation was used by setting "Tensorflow 2.80" and "Keras 2.80" to "Python 3.9" installed on "ubuntu 20.04" operating system.

인코더에 사용된 트랜스포머 모듈은 GeLU(Gauusian Error Linear Unit) 활성화 함수를 사용하도록 하였으며, 디코더를 구성하는 모든 컨볼루션 계층은 세그멘테이션 헤드를 제외하고 모두 ReLU 활성화 함수를 사용하도록 하였고 각각의 컨볼루션 계층 이후에는 배치 정규화 계층과 드롭 아웃 계층을 두어 신경망의 과적합을 최소화할 수 있도록 하였다.The transformer module used in the encoder uses the GeLU (Gauusian Error Linear Unit) activation function, all convolution layers constituting the decoder use the ReLU activation function except for the segmentation head, and after each convolution layer A batch normalization layer and a dropout layer were placed to minimize overfitting of the neural network.

신경망의 학습에는 Adam(Adaptive Moment Estimation) 최적화기를 사용하였으며 학습률은 0.001로 시작해 학습이 2에폭(Epoch)이상 진행되지 않을 때마다 0.75배씩 줄어들도록 하였다.Adam (Adaptive Moment Estimation) optimizer was used to train the neural network, and the learning rate was started with 0.001 and reduced by a factor of 0.75 whenever learning did not progress more than 2 epochs.

학습은 최대 500에폭 동안 진행하도록 설정하였으며 이때 손실 값이 연속적으로 10에폭 이상 낮아지지 않으면 학습을 조기 종료하도록 설정하였다.Learning is set to proceed for up to 500 epochs, and at this time, if the loss value does not continuously decrease by more than 10 epochs, the learning is set to end early.

본 발명의 일 실시예에 시맨틱 세그멘테이션 방법의 학습과 평가에 사용된 데이터 셋은 "semantic KITTI"로 자율 주행 분야의 3차원 객체 인식이나 시맨틱 세그멘테이션 등의 연구에 널리 활용된다.The data set used for learning and evaluation of the semantic segmentation method in an embodiment of the present invention is "semantic KITTI" and is widely used in research on 3D object recognition or semantic segmentation in the field of autonomous driving.

사용된 데이터 셋은 총 21개 시퀀스로 구성되어 있으며 각 시퀀스에는 2차원 RGB 프레임 이미지들과 3차원 점군 그리고 각 데이터를 취득한 카메라와 라이다 센서 사이의 캘리브레이션 정보 그리고 각각의 3차원 포인트 단위로 클래스 라벨이 명시되어 있는 정답(Ground Truth: GT) 데이터를 포함하고 있다.The data set used consists of a total of 21 sequences, and each sequence includes 2D RGB frame images, 3D point clouds, calibration information between the camera and lidar sensor that acquired each data, and class labels for each 3D point. contains the specified Ground Truth (GT) data.

본 실험에서는 그 중 8번 시퀀스를 제외한 0번부터 10번까지의 총 10개 시퀀스, 총 19,130건의 프레임을 학습 데이터로 사용하고 4,071건의 프레임으로 구성된 8번 시퀀스를 평가 데이터로 활용한다.In this experiment, a total of 10 sequences from 0 to 10, excluding the 8th sequence, and a total of 19,130 frames are used as training data, and the 8th sequence consisting of 4,071 frames is used as evaluation data.

다만, 사용된 데이터 셋에는 2차원 시맨틱 세그멘테이션을 위한 정답 데이터가 존재하지 않는다. 따라서, 점군 데이터에 대해 매겨진 정답 데이터를 점군 데이터를 사영하는 방법과 동일한 방법으로 사영하여 2차원 시맨틱 세그멘테이션을 위한 정답 데이터를 만들 경우, 도 9의 (a)와 같이 대부분의 픽셀의 값이 0을 갖는 희소 데이터가 만들어진다.However, answer data for 2D semantic segmentation does not exist in the used data set. Therefore, when correct answer data for 2D semantic segmentation is created by projecting the correct answer data assigned to the point cloud data in the same way as for projecting the point cloud data, most of the pixel values are 0 as shown in FIG. 9 (a). sparse data with

이는, 신경망의 학습을 극도로 저해하는 요소로 작용하므로, 이를 보완하기 위해 정답 데이터의 사영 될 픽셀과 맨해튼 거리(Manhattan distance)가 2 이내인 픽셀들을 동일한 클래스를 갖도록 하여 도 9의 (b)와 같이 정답 데이터의 희소성을 줄여 활용하였다.This acts as a factor that extremely hinders the learning of the neural network, so to compensate for this, pixels whose Manhattan distance is within 2 of the pixels to be projected in the correct answer data have the same class, and Likewise, the sparseness of the correct answer data was reduced and utilized.

실험 과정에서 신경망의 학습에 사용되는 손실함수는 하기의 수학식 1 및 2로 각각 표현되는 "Focal loss"함수 및 "Dice loss"함수를 사용하였다.In the course of the experiment, the "Focal loss" function and the "Dice loss" function represented by Equations 1 and 2 below were used as the loss function used for learning the neural network.

[수학식 1][Equation 1]

[수학식 2][Equation 2]

Focal loss 함수는 수학식 1과 같이 크로스 엔트로피를 기반으로 하는 추론 확률(p_pred)과 정답(p_true) 사이의 가중치 합을 구하여 손실 값을 계산하는 방법으로 각각의 픽셀 단위로 추론 정확성(accuracy)을 증진하는 것을 목적으로 한다.The focal loss function calculates the loss value by calculating the weighted sum between the inference probability (p _pred ) and the correct answer (p _true ) based on cross entropy as shown in Equation 1, and the inference accuracy (accuracy) for each pixel unit is aimed at promoting

한편, Dice loss 함수는 수학식 2와 같이 Dice 계수를 기반으로 하는 추론 확률(p_pred)과 정답(p_true) 사이의 유사도를 구하여 손실 값을 계산하는 방법으로 전체적인 추론의 유사도(similarity)를 증진하는 것을 목적으로 한다.On the other hand, the Dice loss function improves the similarity of overall inference by calculating the loss value by obtaining the similarity between the inference probability (p _pred ) and the correct answer (p _true ) based on the Dice coefficient as shown in Equation 2. aims to do

이때, 배경 클래스에 해당하는 픽셀의 경우에는 손실함수의 종류와 관계없이 손실 값의 계산에서 제외함으로써 신경망이 배경 지역을 정확히 추론하는데 편향되어 학습되는 문제를 방지하도록 한다.At this time, pixels corresponding to the background class are excluded from the calculation of the loss value regardless of the type of the loss function, thereby preventing the neural network from being biased in accurately inferring the background area.

본 발명의 실시예에 따른 시맨틱 세그멘테이션 성능 평가를 위해서는 시맨틱 세그멘테이션 분야에서 널리 사용되는 하기의 수학식 4에 따른 MIoU (Mean Intersection over Union)을 사용하였다.To evaluate semantic segmentation performance according to an embodiment of the present invention, MIoU (Mean Intersection over Union) according to Equation 4, which is widely used in the field of semantic segmentation, was used.

[수학식 4][Equation 4]

MIoU는 수학식 4에서 보이는 바와 같이, 전체 N장으로 이루어진 영상 데이터 셋에서 각 영상마다 구한 IoU 값의 평균을 나타내며, IoU는 영상의 픽셀마다 추론된 클래스들과 해당 픽셀의 정답을 비교하여 그 둘이 서로 동일한 데이터의 수(TP)를 전체 추론 데이터의 수로 나누어 준 값(TP+FN+FP)을 의미한다.As shown in Equation 4, MIoU represents the average of the IoU values obtained for each image in the image data set consisting of all N images. It means a value obtained by dividing the number of identical data (TP) by the total number of inferred data (TP+FN+FP).

이는 추론치가 클래스 라벨이 명시되어 있는 정답(GT)과 일치하는 비율을 의미하는 것으로써 픽셀 단위의 객체 추론에 대한 정확도를 나타낼 수 있다.This means the rate at which the inference value matches the correct answer (GT) in which the class label is specified, and may represent the accuracy of object inference in pixel units.

이때에도 손실함수의 경우와 마찬가지로 배경 픽셀에 해당하는 클래스는 평가 값의 계산에서 제외함으로써 신경망이 전경 지역을 정확히 추론하지 못함에도 성능이 높게 나오는 편향 문제를 방지하도록 한다.In this case, as in the case of the loss function, the class corresponding to the background pixel is excluded from the calculation of the evaluation value, thereby preventing the bias problem in which the performance is high even though the neural network cannot accurately infer the foreground area.

절제 연구ablation studies

본 발명의 일 실시예에 따른 시맨틱 세그멘테이션 방법에서 사용할 신경망의 구조와 손실함수의 구성 등에 따른 성능 평가를 토대로 상세 사항들을 결정하기 위해 수행한 절제 실험 결과들에 관해 서술한다.The results of ablation experiments performed to determine details based on the performance evaluation according to the structure of the neural network and the configuration of the loss function to be used in the semantic segmentation method according to an embodiment of the present invention are described.

먼저, 첫 번째 실험에서는 제안 방법에 대한 최적의 센서 퓨전을 결정하기 위해 신경망에 적용되는 퓨전 방법을 데이터 단계 퓨전, 특징 단계 퓨전 그리고 추론 단계 퓨전으로 나누어 각각 적용함으로써 센서 퓨전을 적용한 경우와 그렇지 않은 경우의 성능을 비교하였다.First, in the first experiment, in order to determine the optimal sensor fusion for the proposed method, the fusion method applied to the neural network is divided into data stage fusion, feature stage fusion, and inference stage fusion, respectively. performance was compared.

이때, 데이터 단계 퓨전을 적용한 신경망은 하기의 수학식 5와 같이 연결 계층을 통해 점군의 사영 영상(X_PCD)과 RGB 영상(X_RGB)을 이어 붙인 (H, W, 4) 크기의 하나의 융합 데이터를 인코더의 입력으로 전달하고, 그 출력을 다시 디코더의 입력으로 전달하여 하나의 2차원 세그멘테이션 추론맵(Y_seg)을 만들어 낸다.At this time, the neural network to which the data-level fusion is applied is a fusion of (H, W, 4) size by connecting the projected image (X _PCD ) and the RGB image (X _RGB ) of the point group through the connection layer as shown in Equation 5 below. Data is passed to the input of the encoder, and the output is passed back to the input of the decoder to create a 2D segmentation inference map (Y _seg ).

[수학식 5][Equation 5]

한편, 추론 단계 퓨전을 적용한 신경망은 하기의 수학식 6에 표시된 바와 같이 두 입력을 위한 서로 다른 가중치((θ_en1, θ_de1), (θ_en2, θ_de2))를 갖는 독립적인 신경망들을 각각 적용하여 두 개의 2차원 세그멘테이션 추론 맵을 만들어내며 이들을 평균(average)하는 방법으로 하나의 2차원 세그멘테이션 추론맵을 만들어 낸다.On the other hand, the neural network to which the inference step fusion is applied applies independent neural networks having different weights ((θ _en1 , θ _de1 ), (θ _en2 , θ _de2 )) for the two inputs, respectively, as shown in Equation 6 below. Then, two 2D segmentation inference maps are created, and one 2D segmentation inference map is created by averaging them.

[수학식 6][Equation 6]

또한, 특징 단계 퓨전의 경우에는 신경망을 구성하는 초입의 블록으로부터 시작하여 한 블록씩 더 깊은 신경망 블록에서 퓨전 하도록 하여 다양한 경우의 수를 비교할 수 있도록 하였으며, 학습의 진행은 "Focal loss"함수를 바탕으로 수행하였다.In addition, in the case of feature step fusion, starting from the initial block constituting the neural network, fusion is performed in deeper neural network blocks one by one so that the number of various cases can be compared, and the learning progress is based on the "Focal loss" function performed with

[표 1][Table 1]

한편, 표 1은 본 발명의 실시예에 따른 시맨틱 세그멘테이션 방법에 대한 센서 융합 응용 실험 결과를 나타낸 표이다.Meanwhile, Table 1 is a table showing sensor fusion application test results for the semantic segmentation method according to an embodiment of the present invention.

표 1에 기재된 바와 같이, 센서 퓨전을 사용하지 않고 입력으로 RGB 영상만을 사용한 단일 모달(single modal)의 방법이 센서 퓨전을 사용한 그 어떤 경우보다 낮은 0.1012의 MIoU를 기록하였다.As shown in Table 1, the single modal method using only RGB images as input without using sensor fusion recorded an MIoU of 0.1012 lower than any other case using sensor fusion.

센서 퓨전을 적용한 경우들끼리 비교할 경우에는 추론 단계 퓨전을 적용한 경우가 0.1844로 가장 낮은 MIoU를 기록하였으며 데이터 단계 퓨전이 0.2045 그리고 특징 단계 퓨전이 평균 0.2095로 가장 높은 수치를 기록하였다.When comparing the cases where sensor fusion was applied, the case where inference stage fusion was applied recorded the lowest MioU with 0.1844, the data stage fusion recorded the highest value with 0.2045 and the feature stage fusion averaged 0.2095.

특징 단계 퓨전 중에는 인코더를 구성하는 두 번째 컨볼루션 블록의 출력을 서로 융합하여 사용할 때가 0.2152의 MIoU로 가장 높은 수치를 기록하였으며, 그 이후에 융합하면 점차 성능이 하락하는 모습을 보였다.Among the feature stage fusion, the highest MIoU of 0.2152 was recorded when the outputs of the second convolution block constituting the encoder were fused together, and after that, the fusion showed a gradual decrease in performance.

이에 따라, 센서 퓨전을 사용하는 것이 그렇지 않은 것보다 더 나은 결과를 만들어 낼 수 있고, 동시에 특징 단계 퓨전을 사용할 때 최적의 성능을 발휘할 수 있음을 알 수 있다.Accordingly, it can be seen that using sensor fusion can produce better results than not, and at the same time, optimal performance can be achieved when using feature stage fusion.

두 번째 실험으로는 인코더에서 심층 특징 벡터를 구성하는 데 사용되는 트랜스포머 모듈의 수에 따른 성능 변화를 알아보고자 모듈의 수를 6개에서부터 9, 12, 18, 24로 차례로 늘려가며 총 5개 경우에 대해 학습하고 일련의 성능 변화를 관찰하도록 하였다.In the second experiment, in order to investigate the performance change according to the number of transformer modules used to construct the deep feature vector in the encoder, the number of modules was increased from 6 to 9, 12, 18, and 24 in order, for a total of 5 cases. and observed a series of performance changes.

이때, 인코더의 나머지 구성에는 앞서 첫 번째 실험에서 가장 좋은 성능을 보였던 특징 단계 퓨전을 적용한 구성을 사용하였으며, 학습을 위한 손실함수는 이전과 동일하게 "Focal loss"함수를 사용하였다.At this time, the configuration to which the feature stage fusion was applied, which showed the best performance in the first experiment, was used for the rest of the encoder configuration, and the same “Focal loss” function was used as the loss function for learning.

한편, 표 2는 심층 특징 벡터 형성을 위한 인코더 세부 구성 실험 결과를 나타낸 표이다.On the other hand, Table 2 is a table showing the experimental results of encoder detailed configuration for deep feature vector formation.

표 2에 기재된 바와 같이, 모듈의 수가 늘어날수록 점차 성능이 상승하여 9개의 모듈을 사용할 때 MIoU를 기준으로 가장 높은 0.2240의 성능을 보였으며, 그보다 모듈의 수가 더 늘어나면 점차 그 성능이 하락하는 모습을 보였다.As shown in Table 2, the performance gradually increased as the number of modules increased, showing the highest performance of 0.2240 based on MIoU when using 9 modules, and the performance gradually decreased as the number of modules increased. showed

특히, 9개의 모듈을 사용한 인코더가 12개의 모듈을 사용하는 기존 방법의 구성보다 신경망이 얕음에도 불구하고 약 1%가량 더 나은 성능을 보인 것은 센서 퓨전을 사용하면 표현력 높은 심층 특징을 비교적 빨리 만들어 낼 수 있다는 것을 의미한다고 볼 수 있다.In particular, the fact that the encoder using 9 modules showed about 1% better performance than the existing method using 12 modules despite having a shallower neural network indicates that sensor fusion can create highly expressive deep features relatively quickly. It can be seen to mean that

세 번째 실험으로는 시맨틱 세그멘테이션 분야에서 널리 사용되는 두 손실함수인 수학식 1의 "Focal loss"함수와 수학식 2의 "Dice loss"함수를 각각 적용하여 신경망을 학습하고 그 결과를 평가함으로써 손실함수의 변화가 성능에 미치는 영향을 관찰하고자 하였다.In the third experiment, two loss functions widely used in the field of semantic segmentation, the "Focal loss" function in Equation 1 and the "Dice loss" function in Equation 2, are applied to train the neural network and evaluate the result to obtain a loss function We wanted to observe the effect of changes in the performance.

또한, 하기의 수학식 7과 같이, 두 손실함수 중 하나를 주(Main) 손실함수로 함과 동시에 나머지 하나를 규제항(regularizer)으로 추가하여 두 손실함수가 가진 각자의 장점인 픽셀 단위의 정확도 증진과 전체적인 구조적 유사도를 동시에 증진하고자 하는 실험도 함께 수행하였다.In addition, as shown in Equation 7 below, one of the two loss functions is used as the main loss function and the other is added as a regularizer, so that the two loss functions have their own advantages, the accuracy in pixel units. An experiment was also conducted to simultaneously enhance the enhancement and overall structural similarity.

[수학식 7][Equation 7]

한편, 표 3은 학습에 사용된 손실함수의 구성에 따른 실험 결과를 나타낸 표이다.On the other hand, Table 3 is a table showing the experimental results according to the configuration of the loss function used for learning.

[표 3][Table 3]

이 때, 신경망은 앞서 두 번째 실험에서 가장 우수한 성능을 보인 본 발명의 일 실시예에 따른 시맨틱 세그멘테이션 방법으로 만들었으며, 실험 결과 하기의 표 3에서 기재된 바와 같이 "Dice loss"함수를 규제항으로 하고 "Focal loss"함수를 사용할 때 가장 좋은 결과를 보였다.At this time, the neural network was made with the semantic segmentation method according to an embodiment of the present invention, which showed the best performance in the second experiment, and as a result of the experiment, as described in Table 3 below, the "Dice loss" function was used as a control term, The best results were obtained when using the "Focal loss" function.

이는 "Focal loss"함수를 중심으로 학습하여 픽셀 별 정확도 증진에 학습의 초점을 맞추되, "Dice loss"함수를 규제항으로 함께 사용해 전체적인 유사도의 증진에 방해가 되는 부분에는 강한 규제 값을, 반대로 전체적 유사도 증진에 도움이 되는 부분에는 약한 규제 값을 부여함으로써 모델이 보다 고르게 학습될 수 있도록 한 결과로 보인다.This is based on the "Focal loss" function and focuses on improving per-pixel accuracy, but uses the "Dice loss" function together as a regulation term to apply a strong regulation value to areas that hinder the overall similarity improvement, and vice versa. This seems to be the result of allowing the model to learn more evenly by assigning a weak regulation value to the part that helps to increase the overall similarity.

도 10은 본 발명의 일 실시예에 따른 2차원 시맨틱 세그멘테이션의 3차원 해석 결과를 나타낸 도면이다.10 is a diagram showing a 3D analysis result of 2D semantic segmentation according to an embodiment of the present invention.

도 10을 참조하면, 본 발명의 일 실시예에 따른 시맨틱 세그멘테이션 방법을 통해 2차원 세그멘테이션 MIoU 기준 31.94%의 성능을 거둘 수 있었으며, 이를 토대로 도 10에 도시된 바와 같이, 3차원 시맨틱 세그멘테이션 정보를 재해석해 낼 수 있다.Referring to FIG. 10, through the semantic segmentation method according to an embodiment of the present invention, performance of 31.94% based on 2D segmentation MIoU was obtained, and based on this, as shown in FIG. 10, 3D semantic segmentation information was reconstructed. can be interpreted

[표 4][Table 4]

한편, 표 4는 본 발명의 실시예 및 비교예에 따른 시맨틱 세그멘테이션 결과를 나타낸 표이다.Meanwhile, Table 4 is a table showing semantic segmentation results according to Examples and Comparative Examples of the present invention.

이에 따라, 기존 방법으로는 구해내는 것이 불가능했던 3차원 좌표를 기준으로 하는 세그멘테이션 MIoU를 표 4와 같이 구해낼 수 있으며, 그 수치는 17.12%를 기록하였다.Accordingly, the segmentation MIoU based on the three-dimensional coordinates, which was impossible to obtain by the existing method, can be obtained as shown in Table 4, and the figure was 17.12%.

이상과 같이, 본 명세서와 도면에는 본 발명의 바람직한 실시예에 대하여 개시하였으나, 여기에 개시된 실시예 외에도 본 발명의 기술적 사상에 바탕을 둔 다른 변형 예들이 실시 가능하다는 것은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 자명한 것이다. 또한, 본 명세서와 도면에서 특정 용어들이 사용되었으나, 이는 단지 본 발명의 기술 내용을 쉽게 설명하고 발명의 이해를 돕기 위한 일반적인 의미에서 사용된 것이지, 본 발명의 범위를 한정하고자 하는 것은 아니다. 따라서, 상술한 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적해석에 의해 선정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.As described above, although preferred embodiments of the present invention have been disclosed in the present specification and drawings, it is in the technical field to which the present invention belongs that other modified examples based on the technical spirit of the present invention can be implemented in addition to the embodiments disclosed herein. It is self-evident to those skilled in the art. In addition, although specific terms have been used in the present specification and drawings, they are only used in a general sense to easily explain the technical content of the present invention and help understanding of the present invention, but are not intended to limit the scope of the present invention. Accordingly, the foregoing detailed description should not be construed as limiting in all respects and should be considered illustrative. The scope of the present invention should be selected by reasonable interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.

100 : 학습 데이터 수집 장치 200 : 학습 데이터 생성 장치
300 : 인공지능 학습 장치
205 : 통신부 210 : 입출력부
215 : 사전 학습부 220 : 데이터 전처리부
225 : 추론부 230 : 3차원 해석부
235 : 저장부100: learning data collection device 200: learning data generating device
300: artificial intelligence learning device
205: communication unit 210: input/output unit
215: pre-learning unit 220: data pre-processing unit
225: reasoning unit 230: 3-dimensional analysis unit
235: storage unit

Claims

Receiving, by an apparatus for generating learning data, point cloud data obtained from a lidar and an image captured through a camera at the same time;
generating, by the learning data generating device, a 2D semantic segmentation feature map from the point cloud data and the image based on AI (Artificial Intelligence) that has been machine learning; and
interpreting, by the learning data generating device, 3D information from the generated feature map; It is characterized by including,
prior to the creation of
pre-learning the artificial intelligence; It is characterized in that it further comprises,
The pre-learning step is
Point cloud data obtained from the lidar included in the pre-stored data set, an image taken through a camera at the same time as the point cloud data, calibration information between the lidar and the camera, and a class label for each 3D point of the point cloud data Characterized in that the artificial intelligence is pre-learned based on the specified correct answer data,
The pre-learning step is
The point cloud data and the correct answer data are projected onto coordinates on a two-dimensional plane of the image, and pixels having a Manhattan distance between a pixel to be projected and a Manhattan distance of the image within a preset value have the same class. , 3D interpretation method of semantic segmentation.

The method of claim 1, wherein the pre-learning step
The artificial intelligence is pre-learned through at least one of a plurality of loss functions, and one of the plurality of loss functions is added as a regularizer.

The method of claim 2, wherein the pre-learning step
A three-dimensional interpretation method of semantic segmentation, characterized in that the artificial intelligence is pre-learned through a loss function represented by Equation 1 below.
[Equation 1]

(Here, p _pred is the inference probability based on cross entropy, and p _true means the correct answer.)

The method of claim 2, wherein the pre-learning step
A three-dimensional interpretation method of semantic segmentation, characterized in that the artificial intelligence is pre-learned through a loss function represented by Equation 2 below.
[Equation 2]

(Where p _pred is the inference probability based on the Dice coefficient, and p _true means the correct answer.)

The method of claim 2, wherein the pre-learning step
A three-dimensional interpretation method of semantic segmentation, characterized in that pixels corresponding to the background class are excluded from the calculation of the loss value regardless of the type of the loss function.

Receiving, by an apparatus for generating learning data, point cloud data obtained from a lidar and an image captured through a camera at the same time;
generating, by the learning data generating device, a 2D semantic segmentation feature map from the point cloud data and the image based on AI (Artificial Intelligence) that has been machine learning; and
interpreting, by the learning data generating device, 3D information from the generated feature map; It is characterized by including,
The interpretation step is
A 3-dimensional analysis method of semantic segmentation, characterized in that preparing a pair of 3-dimensional coordinates of the point cloud data and a coordinate pair on a 2-dimensional plane of an image projected with the point cloud data.

The method of claim 6, wherein the interpreting step
3D analysis of semantic segmentation, characterized by applying a max pooling operation using a 3*3 kernel centered on the coordinates for each 2D plane coordinate for the generated feature map. method.

The method of claim 7, wherein the interpreting step
A three-dimensional interpretation method of semantic segmentation, characterized in that obtaining a vector obtained by estimating a segment for each coordinate through the maximum pooling operation.

memory;
transceiver; and
In combination with a computing device configured to include a processor for processing instructions resident in the memory,
Receiving, by the processor, point cloud data obtained from a lidar and an image captured through a camera at the same time;
generating a 2-dimensional semantic segmentation feature map from the point cloud data and the image based on artificial intelligence (AI) that has been machine-learned by the processor; and
interpreting 3D information from the generated feature map by the processor; Execute, including
prior to the creation of
pre-learning the artificial intelligence; It is characterized in that it further comprises,
The pre-learning step is
Point cloud data obtained from the lidar included in the pre-stored data set, an image taken through a camera at the same time as the point cloud data, calibration information between the lidar and the camera, and a class label for each 3D point of the point cloud data Characterized in that the artificial intelligence is pre-learned based on the specified correct answer data,
The pre-learning step is
The point cloud data and the correct answer data are projected onto coordinates on a two-dimensional plane of the image, and pixels having a Manhattan distance between a pixel to be projected and a Manhattan distance of the image within a preset value have the same class. , A computer program recorded on a recording medium.

memory;
transceiver; and
In combination with a computing device configured to include a processor for processing instructions resident in the memory,
Receiving, by the processor, point cloud data obtained from a lidar and an image captured through a camera at the same time;
generating a 2-dimensional semantic segmentation feature map from the point cloud data and the image based on artificial intelligence (AI) that has been machine-learned by the processor; and
interpreting 3D information from the generated feature map by the processor; Execute, including
The interpretation step is
A computer program recorded on a recording medium, characterized in that for preparing a three-dimensional coordinate pair of the point cloud data and a two-dimensional plane coordinate pair of an image projected with the point cloud data.