KR102591078B1

KR102591078B1 - Electronic device for performing 3d object detection using an artificial intelligence model and operating method thereof

Info

Publication number: KR102591078B1
Application number: KR1020230057376A
Authority: KR
Inventors: 장정식
Original assignee: 주식회사 렛서
Priority date: 2023-05-02
Filing date: 2023-05-02
Publication date: 2023-10-19

Abstract

본 개시는 인공지능 모델을 이용하여 3차원 객체 탐지를 수행하는 전자 장치 및 이의 동작 방법을 제공한다. 일 실시예에 따른 인공지능 모델을 이용하여 3차원 객체 탐지를 수행하는 전자 장치는, 디스플레이, 하나 이상의 인스트럭션을 저장하는 메모리, 및 하나 이상의 인스트럭션을 실행하는 적어도 하나의 프로세서를 포함하되, 적어도 하나의 프로세서는, 복수의 객체들을 포함하는 3차원 장면에 대응하는 포인트 클라우드를 획득하고, 포인트 클라우드를 2차원 좌표계로 변환하고, 변환된 포인트 클라우드를 디스플레이에 표시하고, 변환된 포인트 클라우드 중 복수의 객체들에 대응하는 적어도 하나의 포인트를 포함하는 제1 사용자 입력을 획득하고, 포인트 클라우드에 기초하여 제1 사용자 입력을 인코딩하고, 포인트 클라우드 및 인코딩된 제1 사용자 입력을 입력으로 하는, 인공지능 모델을 이용하여, 포인트 클라우드에서 복수의 객체들에 대응하는 3차원 바운딩 박스를 추론하는, 하나 이상의 인스트럭션을 실행할 수 있다.The present disclosure provides an electronic device that performs 3D object detection using an artificial intelligence model and a method of operating the same. An electronic device that performs 3D object detection using an artificial intelligence model according to an embodiment includes a display, a memory storing one or more instructions, and at least one processor executing one or more instructions, and includes at least one The processor obtains a point cloud corresponding to a three-dimensional scene including a plurality of objects, converts the point cloud into a two-dimensional coordinate system, displays the converted point cloud on the display, and selects a plurality of objects from the converted point cloud. Obtain a first user input including at least one point corresponding to, encode the first user input based on the point cloud, and use an artificial intelligence model using the point cloud and the encoded first user input as input. Accordingly, one or more instructions for inferring a 3D bounding box corresponding to a plurality of objects in the point cloud may be executed.

Description

Electronic device for performing 3D object detection using artificial intelligence model and operating method thereof {ELECTRONIC DEVICE FOR PERFORMING 3D OBJECT DETECTION USING AN ARTIFICIAL INTELLIGENCE MODEL AND OPERATING METHOD THEREOF}

본 개시는 인공지능 모델을 이용하여 3차원 객체 탐지를 수행하는 전자 장치 및 이의 동작 방법에 관한 것이며, 좀 더 상세하게는, 포인트 클라우드 및 사용자 입력에 기초하여 3차원 바운딩 박스를 추론하는 인공지능 모델을 이용함으로써 3차원 객체 탐지를 수행하는 전자 장치 및 이의 동작 방법에 관한 것이다.The present disclosure relates to an electronic device that performs three-dimensional object detection using an artificial intelligence model and an operating method thereof, and more specifically, to an artificial intelligence model that infers a three-dimensional bounding box based on a point cloud and user input. It relates to an electronic device that performs 3D object detection by using and a method of operating the same.

장기간 연구되어 온 3차원 객체 탐지 기술은 자율주행 및 로봇 공학 분야에서 활발히 연구되고 있다. 포인트 클라우드는 복잡한 장면을 효율적으로 나타내는 데 사용되는 대표적인 3차원 데이터 유형 중 하나이다. 포인트 클라우드에서의 3차원 객체 탐지를 수행하기 위해, 복셀(voxel)화 방식, 포인트 기반 방식, 그래프 기반 방식 등 다양한 기법들이 연구되어 왔다.3D object detection technology, which has been studied for a long time, is being actively researched in the fields of autonomous driving and robotics. Point clouds are one of the representative 3D data types used to efficiently represent complex scenes. To perform 3D object detection in point clouds, various techniques such as voxelization, point-based, and graph-based methods have been studied.

그러나, 포인트 클라우드에서 3차원 객체 탐지 모델을 학습시키기 위한 데이터셋을 확보하기 어려운 실정이며, 특히 포인트 클라우드의 복잡한 데이터 특성으로 인해 주석 처리 과정에 오류가 발생하기 쉽고 비용이 많이 드는 문제가 있다. 이러한 문제는 고성능 3차원 객체 탐지 기술 개발에 어려움을 주고 있다.However, it is difficult to secure a dataset for training a 3D object detection model in point clouds, and in particular, due to the complex data characteristics of point clouds, errors are prone to occur in the annotation processing process and are expensive. These problems are making it difficult to develop high-performance 3D object detection technology.

본 개시의 목적은, 객체에 대응하는 3차원 바운딩 박스를 정확하고 효율적으로 추론하기 위해, 사용자 상호작용 및 인공지능 모델을 이용하여 3차원 객체 탐지를 수행하는 전자 장치 및 이의 동작 방법을 제공하는 것이다.The purpose of the present disclosure is to provide an electronic device and a method of operating the same that perform 3D object detection using user interaction and artificial intelligence models in order to accurately and efficiently infer the 3D bounding box corresponding to the object. .

일 실시예에 따른 인공지능 모델을 이용하여 3차원 객체 탐지를 수행하는 전자 장치는, 디스플레이, 하나 이상의 인스트럭션을 저장하는 메모리, 및 상기 메모리에 저장된 상기 하나 이상의 인스트럭션을 실행하는 적어도 하나의 프로세서를 포함하되, 상기 적어도 하나의 프로세서는, 복수의 객체들을 포함하는 3차원 장면에 대응하는 포인트 클라우드(point cloud)를 획득하고, 상기 포인트 클라우드를 2차원 좌표계로 변환하고, 상기 변환된 포인트 클라우드를 상기 디스플레이에 표시하고, 상기 변환된 포인트 클라우드 중 상기 복수의 객체들에 대응하는 적어도 하나의 포인트를 포함하는 제1 사용자 입력을 획득하고, 상기 포인트 클라우드에 기초하여 상기 제1 사용자 입력을 인코딩하고, 상기 포인트 클라우드 및 상기 인코딩된 제1 사용자 입력을 입력으로 하는, 인공지능 모델을 이용하여, 상기 포인트 클라우드에서 상기 복수의 객체들에 대응하는 3차원 바운딩 박스(bounding box)를 추론하는, 상기 하나 이상의 인스트럭션을 실행할 수 있다.An electronic device that performs 3D object detection using an artificial intelligence model according to an embodiment includes a display, a memory that stores one or more instructions, and at least one processor that executes the one or more instructions stored in the memory. However, the at least one processor acquires a point cloud corresponding to a three-dimensional scene including a plurality of objects, converts the point cloud into a two-dimensional coordinate system, and displays the converted point cloud on the display. , obtain a first user input including at least one point corresponding to the plurality of objects among the converted point cloud, encode the first user input based on the point cloud, and encode the point The one or more instructions for inferring a three-dimensional bounding box corresponding to the plurality of objects in the point cloud using an artificial intelligence model that takes the cloud and the encoded first user input as input. It can be run.

일 실시예에 따르면, 상기 적어도 하나의 프로세서는, 상기 변환된 포인트 클라우드 및 상기 추론된 3차원 바운딩 박스를 상기 디스플레이에 표시하고, 상기 추론된 3차원 바운딩 박스에 대응하되 상기 복수의 객체들에 대응하지 않는 포인트, 및 상기 추론된 3차원 바운딩 박스에 대응하지 않되 상기 복수의 객체들에 대응하는 포인트 중 적어도 하나를 포함하는 제2 사용자 입력을 획득하고, 상기 포인트 클라우드에 기초하여 제2 사용자 입력을 인코딩하고, 상기 포인트 클라우드, 및 상기 인코딩된 제1 사용자 입력 및 상기 인코딩된 제2 사용자 입력을 입력으로 하는, 인공지능 모델을 이용하여, 상기 포인트 클라우드에서 상기 복수의 객체들에 대응하는 3차원 바운딩 박스를 추론하는, 상기 하나 이상의 인스트럭션을 더 실행할 수 있다.According to one embodiment, the at least one processor displays the converted point cloud and the inferred 3D bounding box on the display, and corresponds to the inferred 3D bounding box but corresponds to the plurality of objects. Obtaining a second user input including at least one of a point that does not correspond to the inferred three-dimensional bounding box and a point that does not correspond to the plurality of objects, and receives a second user input based on the point cloud. Encoding, using an artificial intelligence model using the point cloud, and the encoded first user input and the encoded second user input as input, three-dimensional bounding corresponding to the plurality of objects in the point cloud One or more of the above instructions may further be executed to infer a box.

일 실시예에 따르면, 상기 인공지능 모델은, 상기 포인트 클라우드 및 상기 인코딩된 제1 사용자 입력에 기초하여 특징을 추출하는 포인트 인코더(point encoder), 상기 추출된 특징에 기초하여 상기 복수의 객체들에 대응하는 중심점을 예측하고, 상기 예측된 중심점에 기초하여 상기 복수의 객체들에 대응하는 포인트들을 그룹화하는 센트로이드 어그리게이션(centroid aggregation) 모듈, 상기 추출된 특징 및 상기 인코딩된 제1 사용자 입력에 기초하여 상기 제1 사용자 입력에 대응하는 객체와 동일한 클래스의 객체의 정보를 계산하는 공간 클릭 전파(spatial click propagation) 모듈, 및 상기 센트로이드 어그리게이 모듈의 출력 및 상기 공간 클릭 전파 모듈의 출력에 기초하여 상기 복수의 객체들에 대응하는 상기 3차원 바운딩 박스를 출력하는 검출 헤드(detection head) 모듈을 포함할 수 있다.According to one embodiment, the artificial intelligence model includes a point encoder that extracts features based on the point cloud and the encoded first user input, and a point encoder that extracts features based on the extracted features. A centroid aggregation module that predicts a corresponding center point and groups points corresponding to the plurality of objects based on the predicted center point, the extracted features, and the encoded first user input. A spatial click propagation module that calculates information of an object of the same class as the object corresponding to the first user input based on the output of the centroid aggregation module and the output of the spatial click propagation module Thus, it may include a detection head module that outputs the three-dimensional bounding box corresponding to the plurality of objects.

일 실시예에 따르면, 상기 포인트 인코더는 적어도 하나의 완전 연결 레이어 및 적어도 하나의 다운샘플링(downsampling) 레이어를 포함하고, 상기 적어도 하나의 프로세서는, 상기 적어도 하나의 다운샘플링 레이어의 출력에 상기 인코딩된 제1 사용자 입력을 연결(concatenating)하는, 상기 하나 이상의 인스트럭션을 더 실행할 수 있다.According to one embodiment, the point encoder includes at least one fully connected layer and at least one downsampling layer, and the at least one processor encodes the encoded signal at the output of the at least one downsampling layer. The one or more instructions concatenating the first user input may further be executed.

일 실시예에 따르면, 상기 공간 클릭 전파 모듈은, 상기 인코딩된 제1 사용자 입력에 대응하는 정보와 상기 추출된 특징 간의 유사도를 계산하고, 상기 검출 헤드 모듈은, 상기 센트로이드 어그리게이션 모듈의 출력에 상기 유사도가 연결된 데이터를 획득하고, 상기 획득된 값에 기초하여 상기 제1 사용자 입력에 대응하는 객체와 동일한 클래스의 객체를 검출할 수 있다.According to one embodiment, the spatial click propagation module calculates similarity between information corresponding to the encoded first user input and the extracted features, and the detection head module calculates the output of the centroid aggregation module. Data linked to the similarity may be obtained, and an object of the same class as the object corresponding to the first user input may be detected based on the obtained value.

일 실시예에 따르면, 상기 인공지능 모델은, 포인트 클라우드 데이터셋, 상기 포인트 클라우드 데이터셋 중 객체에 대응하는 적어도 하나의 포지티브(positive) 포인트, 및 상기 포인트 클라우드 데이터셋 중 배경에 대응하는 적어도 하나의 네거티브(negative) 포인트에 기초하여, 객체에 대응하는 3차원 바운딩 박스를 출력하도록 학습될 수 있다.According to one embodiment, the artificial intelligence model includes a point cloud dataset, at least one positive point corresponding to an object in the point cloud dataset, and at least one corresponding to a background in the point cloud dataset. Based on negative points, it can be learned to output a 3D bounding box corresponding to the object.

일 실시예에 따르면, 상기 인공지능 모델은, 상기 인공지능 모델의 학습 과정에서, 그라운드 트루스 값 및 상기 추출된 특징에 기초하여 상기 복수의 객체들에 대응하지 않는 배경부 포인트들 중 전경부 점수가 임계 값을 초과하는 포인트들을 상기 적어도 하나의 네거티브 포인트로 할당하는, 네거티브 클릭 시뮬레이션 모듈을 포함할 수 있다.According to one embodiment, in the learning process of the artificial intelligence model, the foreground score among the background points that do not correspond to the plurality of objects is based on the ground truth value and the extracted features. It may include a negative click simulation module that allocates points exceeding a threshold as the at least one negative point.

일 실시예에 따르면, 상기 제1 사용자 입력은 객체 위치 정보 및 객체 클래스 정보를 포함할 수 있다.According to one embodiment, the first user input may include object location information and object class information.

일 실시예에 따르면, 상기 제1 사용자 입력을 인코딩하는, 상기 하나 이상의 인스트럭션은, 상기 포인트 클라우드의 (x, y) 위치 정보와 상기 제1 사용자 입력의 상기 객체 위치 정보 간의 거리에 기초하여 상기 제1 사용자 입력을 인코딩하는, 상기 하나 이상의 인스트럭션을 포함할 수 있다.According to one embodiment, the one or more instructions for encoding the first user input may be configured to encode the first user input based on the distance between the (x, y) location information of the point cloud and the object location information of the first user input. 1 May include one or more instructions that encode user input.

일 실시예에 따르면, 상기 인공지능 모델은, 상기 포인트 클라우드에 상기 인코딩된 제1 사용자 입력을 연결한 값을 입력으로 할 수 있다.According to one embodiment, the artificial intelligence model may use as input a value obtained by connecting the encoded first user input to the point cloud.

일 실시예에 따르면, 네거티브 클릭 시뮬레이션을 통해 잘못된 3차원 바운딩 박스 추론을 줄일 수 있다. 일 실시예에 따르면, 덴스 클릭 가이던스를 통해 사용자 입력이 인공지능 모델의 데이터 처리에서 희석되지 않아 사용자 상호작용의 효과를 최대화할 수 있다. 일 실시예에 따르면, 공간 클릭 전파를 통해 적은 수의 사용자 입력만으로도 정확한 추론을 할 수 있다. 일 실시예에 따르면, 포인트 클라우드를 2차원 좌표계로 변환하여 사용자에게 제공함으로써, 더 편리한 사용자 경험 및 상호작용 툴을 제공할 수 있다.According to one embodiment, incorrect 3D bounding box inference can be reduced through negative click simulation. According to one embodiment, through dense click guidance, user input is not diluted in the data processing of the artificial intelligence model, thereby maximizing the effect of user interaction. According to one embodiment, accurate inference can be made with only a small number of user inputs through spatial click propagation. According to one embodiment, a more convenient user experience and interaction tool can be provided by converting the point cloud into a two-dimensional coordinate system and providing it to the user.

도 1a 및 1b는 일 실시예에 따른 인공지능 모델을 이용하여 3차원 객체 탐지를 수행하는 전자 장치의 동작을 개략적으로 보여주는 개념도이다.
도 2는 일 실시예에 따른 전자 장치의 구성을 보여주는 블록도이다.
도 3은 일 실시예에 따른 인공지능 모델을 이용한 추론 과정을 보여주는 블록도이다.
도 4는 일 실시예에 따른 공간 클릭 전파 모듈의 동작을 설명하기 위한 개념도이다.
도 5는 일 실시예에 따른 인공지능 모델의 학습 과정을 보여주는 블록도이다.
도 6은 일 실시예에 따른 네거티브 클릭 시뮬레이션 모듈의 동작을 설명하기 위한 개념도이다.
도 7a 및 7b는 일 실시예에 따른 2차원 변환 모듈의 동작을 예시적으로 보여주는 도면이다.
도 8a 및 8b는 일 실시예에 따른 인코딩 모듈의 동작을 예시적으로 보여주는 도면이다.
도 9a 내지 9d는 일 실시예에 따른 공간 클릭 전파 모듈의 동작을 예시적으로 보여주는 도면이다.
도 10은 일 실시예에 따른 전자 장치의 동작 방법을 보여주는 흐름도이다.
도 11은 일 실시예에 따른 전자 장치의 동작 방법을 보여주는 흐름도이다.1A and 1B are conceptual diagrams schematically showing the operation of an electronic device that performs 3D object detection using an artificial intelligence model according to an embodiment.
Figure 2 is a block diagram showing the configuration of an electronic device according to an embodiment.
Figure 3 is a block diagram showing an inference process using an artificial intelligence model according to an embodiment.
Figure 4 is a conceptual diagram for explaining the operation of a spatial click propagation module according to an embodiment.
Figure 5 is a block diagram showing the learning process of an artificial intelligence model according to an embodiment.
Figure 6 is a conceptual diagram for explaining the operation of a negative click simulation module according to an embodiment.
7A and 7B are diagrams exemplarily showing the operation of a 2D conversion module according to an embodiment.
8A and 8B are diagrams exemplarily showing the operation of an encoding module according to an embodiment.
9A to 9D are diagrams exemplarily showing the operation of a spatial click propagation module according to an embodiment.
Figure 10 is a flowchart showing a method of operating an electronic device according to an embodiment.
Figure 11 is a flowchart showing a method of operating an electronic device according to an embodiment.

이하에서, 본 개시의 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다. 본 개시의 기술 분야에서 통상의 지식을 가진 자가 본 개시를 용이하게 실시할 수 있을 정도로, 실시예들이 명확하게 상세하게 기재될 것이다. 그러나, 권리범위는 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 각 도면 상의 유사한 구성요소에 대해서는 동일 또는 유사한 참조부호가 사용되고, 동일 또는 유사한 구성요소에 대해서 중복되는 설명은 생략한다.Hereinafter, embodiments of the present disclosure will be described in detail with reference to the attached drawings. The embodiments will be described in clear detail so that a person skilled in the art can easily practice the present disclosure. However, the scope of rights is not limited or limited by these embodiments. The same or similar reference numerals are used for similar components in each drawing, and overlapping descriptions of the same or similar components are omitted.

아래 설명에서 사용되는 용어는, 연관되는 기술 분야에서 일반적이고 보편적인 것으로 선택되었으나, 기술의 발달 및/또는 변화, 관례, 기술자의 선호 등에 따라 다른 용어가 있을 수 있다. 따라서, 아래 설명에서 사용되는 용어는 기술적 사상을 한정하는 것으로 이해되어서는 안 되며, 실시예들을 설명하기 위한 예시적 용어로 이해되어야 한다.The terms used in the description below have been selected as general and universal in the related technical field, but there may be different terms depending on technological developments and/or changes, customs, technicians' preferences, etc. Accordingly, the terms used in the description below should not be understood as limiting the technical idea, but should be understood as illustrative terms for describing embodiments.

또한 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 설명 부분에서 상세한 그 의미를 기재할 것이다. 따라서 아래 설명에서 사용되는 용어는 단순한 용어의 명칭이 아닌 그 용어가 가지는 의미와 명세서 전반에 걸친 내용을 토대로 이해되어야 한다.In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the detailed meaning will be described in the relevant description section. Therefore, the terms used in the description below should be understood based on the meaning of the term and the overall content of the specification, not just the name of the term.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함할 수 있다. 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 용어들은 본 명세서에 기재된 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가질 수 있다. 또한, 본 명세서에서 사용되는 '제1' 또는 '제2' 등과 같이 서수를 포함하는 용어는 다양한 구성 요소들을 설명하는데 사용할 수 있지만, 상기 구성 요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로만 사용된다.Singular expressions may include plural expressions, unless the context clearly indicates otherwise. Terms used herein, including technical or scientific terms, may have the same meaning as generally understood by a person of ordinary skill in the technical field described herein. Additionally, terms including ordinal numbers, such as 'first' or 'second', used in this specification may be used to describe various components, but the components should not be limited by the terms. The above terms are used only for the purpose of distinguishing one component from another.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에 기재된 "부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.When it is said that a part "includes" a certain element throughout the specification, this means that, unless specifically stated to the contrary, it does not exclude other elements but may further include other elements. Additionally, terms such as “unit” and “module” used in the specification refer to a unit that processes at least one function or operation, which may be implemented as hardware or software, or as a combination of hardware and software.

아래에서는 첨부한 도면을 참고하여 본 개시의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 개시는 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 개시를 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. 또한, 각각의 도면에서 사용된 도면 부호는 각각의 도면을 설명하기 위한 것일 뿐, 상이한 도면들 각각에서 사용된 상이한 도면 부호가 상이한 요소를 나타내기 위한 것은 아니다. 이하 첨부된 도면을 참고하여 본 개시를 상세히 설명하기로 한다.Below, with reference to the attached drawings, embodiments of the present disclosure will be described in detail so that those skilled in the art can easily implement the present invention. However, the present disclosure may be implemented in many different forms and is not limited to the embodiments described herein. In order to clearly explain the present disclosure in the drawings, parts that are not related to the description are omitted, and similar parts are given similar reference numerals throughout the specification. In addition, the reference numerals used in each drawing are only for explaining each drawing, and the different reference numerals used in each of the different drawings are not intended to indicate different elements. Hereinafter, the present disclosure will be described in detail with reference to the attached drawings.

본 개시에서, 인공지능 모델은 복수의 신경망 레이어들로 구성될 수 있다. 복수의 신경망 레이어들 각각은 복수의 가중치들(weight values)을 갖고 있으며, 이전(previous) 레이어의 연산 결과와 복수의 가중치들 간의 연산을 통해 신경망 연산을 수행한다. 복수의 신경망 레이어들이 갖고 있는 복수의 가중치들은 심층 신경망 모델의 학습 결과에 의해 최적화될 수 있다. 예를 들어, 학습 과정 동안 심층 신경망 모델에서 획득한 로스(loss) 값 또는 코스트(cost) 값이 감소 또는 최소화되도록 복수의 가중치들이 갱신될 수 있다. 예를 들어, 심층 신경망 모델은 CNN (Convolutional Neural Network), DNN (Deep Neural Network), RNN (Recurrent Neural Network), RBM (Restricted Boltzmann Machine), DBN (Deep Belief Network), BRDNN(Bidirectional Recurrent Deep Neural Network) 또는 심층 Q-네트워크 (Deep Q-Networks) 등을 포함할 수 있으나, 이에 한정되지 않는다.In this disclosure, an artificial intelligence model may be composed of multiple neural network layers. Each of the plurality of neural network layers has a plurality of weight values, and neural network calculation is performed through calculation between the calculation result of the previous layer and the plurality of weights. The plurality of weights possessed by the plurality of neural network layers can be optimized by the learning results of the deep neural network model. For example, during the learning process, a plurality of weights may be updated so that loss or cost values obtained from the deep neural network model are reduced or minimized. For example, deep neural network models include Convolutional Neural Network (CNN), Deep Neural Network (DNN), Recurrent Neural Network (RNN), Restricted Boltzmann Machine (RBM), Deep Belief Network (DBN), and Bidirectional Recurrent Deep Neural Network (BRDNN). ) or Deep Q-Networks, etc., but is not limited thereto.

본 개시에서, 사용자 입력은 2차원 좌표계에서의 객체 위치 정보 및 객체 클래스(또는 카테고리로 지칭될 수 있음) 정보를 포함할 수 있다. 예를 들어, 사용자 입력은 (p_k, c_k)로 정의될 수 있다. 여기서, k는 사용자 입력에 대응하는 포인트의 수로 정의될 수 있다. 여기서, p_k는 객체 위치 정보로, 예컨대, p_k는 (p_k,x, p_k,y)로 정의될 수 있다. 여기서, c_k는 객체 클래스 정보로, 예컨대, 미리 정의된 C 개의 클래스 중 해당 포인트의 클래스로 정의될 수 있다.In the present disclosure, user input may include object location information and object class (or category) information in a two-dimensional coordinate system. For example, user input can be defined as (p _k , c _k ). Here, k can be defined as the number of points corresponding to user input. Here, p _k is object location information, for example, p _k may be defined as (p _k,x , p _k,y ). Here, c _k may be object class information, for example, defined as the class of the point among C predefined classes.

도 1a 및 1b는 일 실시예에 따른 인공지능 모델을 이용하여 3차원 객체 탐지를 수행하는 전자 장치의 동작을 개략적으로 보여주는 개념도이다. 1A and 1B are conceptual diagrams schematically showing the operation of an electronic device that performs 3D object detection using an artificial intelligence model according to an embodiment.

도 1a을 참조하면, 전자 장치(100)는 포인트 클라우드(110) 및 제1 사용자 입력(10)을 입력으로 하는 인공지능 모델(150)을 이용하여, 포인트 클라우드(110)의 복수의 객체들에 대응하는 3차원 바운딩 박스(예컨대, 160a, 160b)를 추론할 수 있다. 전자 장치(100)는 미리 정해진 이터레이션(iteration) 횟수만큼 3차원 바운딩 박스(예컨대, 160a, 160b)를 추론할 수 있다. 도 1a는 초기 이터레이션의 예시를 보여준다.Referring to FIG. 1A, the electronic device 100 uses the point cloud 110 and the artificial intelligence model 150 using the first user input 10 as input to a plurality of objects in the point cloud 110. A corresponding 3D bounding box (eg, 160a, 160b) can be inferred. The electronic device 100 may infer a 3D bounding box (eg, 160a, 160b) a predetermined number of iterations. Figure 1a shows an example of an initial iteration.

예를 들어, 전자 장치(100)는 디스플레이(130)를 포함하는 장치일 수 있다. 전자 장치(100)는 포인트 클라우드(110)에 대응하는 2차원 이미지를 디스플레이(130)를 통해 출력하는 장치일 수 있다. 예를 들어, 전자 장치(100)는 스마트 폰(smart phone), 노트북 컴퓨터(laptop computer), PC(Personal Computer), 태블릿 PC, 디지털 카메라, CCTV(Closed-Circuit Television), 전자북 단말기, 디지털방송용 단말기, PDA(Personal Digital Assistants), PMP(Portable Multimedia Player), 네비게이션, 또는 MP3 플레이어 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 전자 장치(100)는 디스플레이(130)를 포함하는 다양한 종류 및 형태의 장치로 구현될 수 있다.For example, the electronic device 100 may be a device that includes a display 130. The electronic device 100 may be a device that outputs a two-dimensional image corresponding to the point cloud 110 through the display 130. For example, the electronic device 100 includes a smart phone, a laptop computer, a personal computer (PC), a tablet PC, a digital camera, a closed-circuit television (CCTV), an e-book terminal, and a digital broadcasting device. It may include, but is not limited to, terminals, PDAs (Personal Digital Assistants), PMPs (Portable Multimedia Players), navigation, or MP3 players. The electronic device 100 may be implemented as various types and types of devices including a display 130.

일 실시예에 따른 전자 장치(100)는 2차원 변환 모듈(120), 인코딩 모듈(140), 및 인공지능 모델(150)을 포함할 수 있다.The electronic device 100 according to one embodiment may include a two-dimensional conversion module 120, an encoding module 140, and an artificial intelligence model 150.

전자 장치(100)는 복수의 객체들을 포함하는 3차원 장면에 대응하는 포인트 클라우드(110)를 획득할 수 있다. 예를 들어, 전자 장치(100)는 포인트 클라우드(110)를 생성하기 위해 센서(미도시)를 포함할 수 있다. 예를 들어, 센서(미도시)는 이미지 센서, LiDAR 센서, RGB-D 센서, 깊이 센서, ToF(time of flight) 센서, 초음파 센서, 레이다 센서, 및 스테레오 카메라 중 적어도 하나로 구성될 수 있으나, 본 개시는 이에 한정되지 않는다.The electronic device 100 may acquire a point cloud 110 corresponding to a 3D scene including a plurality of objects. For example, the electronic device 100 may include a sensor (not shown) to generate the point cloud 110 . For example, the sensor (not shown) may be comprised of at least one of an image sensor, a LiDAR sensor, an RGB-D sensor, a depth sensor, a time of flight (ToF) sensor, an ultrasonic sensor, a radar sensor, and a stereo camera. The disclosure is not limited to this.

일 실시예에 있어서, 센서(미도시)는 외부로부터 센싱 데이터를 획득할 수 있다. 일 실시예에 있어서, 전자 장치(100)는 센싱 데이터에 기초하여 포인트 클라우드(110)를 생성할 수 있다. 일 실시예에 있어서, 전자 장치(100)는 외부 서버 또는 외부 전자 장치로부터 포인트 클라우드(110)를 수신할 수 있다.In one embodiment, a sensor (not shown) may obtain sensing data from the outside. In one embodiment, the electronic device 100 may generate the point cloud 110 based on sensing data. In one embodiment, the electronic device 100 may receive the point cloud 110 from an external server or external electronic device.

2차원 변환 모듈(120)은 포인트 클라우드(110)를 2차원 좌표계로 변환할 수 있다. 전자 장치(100)는 변환된 포인트 클라우드(111)를 디스플레이(130)에 표시할 수 있다. 전자 장치(100)는 변환된 포인트 클라우드(111) 중 복수의 객체들에 대응하는 적어도 하나의 포인트(포지티브 클릭(positive click) 또는 포지티브 포인트(positive point))를 포함하는 제1 사용자 입력(10)을 수신할 수 있다. 예를 들어, 전자 장치(100)는 입력 인터페이스를 통해 복수의 객체들 중 어느 객체에 해당하는 포인트에 대응하는 제1 사용자 입력(10)을 수신할 수 있다. 예를 들어, 제1 사용자 입력(10)은 변환된 포인트 클라우드(111)에서의 객체 위치 정보 및 객체 클래스 정보(예컨대, 보행자 클래스)를 포함할 수 있다.The 2D conversion module 120 can convert the point cloud 110 into a 2D coordinate system. The electronic device 100 may display the converted point cloud 111 on the display 130. The electronic device 100 receives a first user input 10 that includes at least one point (positive click or positive point) corresponding to a plurality of objects among the converted point cloud 111. can receive. For example, the electronic device 100 may receive a first user input 10 corresponding to a point corresponding to an object among a plurality of objects through an input interface. For example, the first user input 10 may include object location information and object class information (eg, pedestrian class) in the converted point cloud 111.

인코딩 모듈(140)은 포인트 클라우드(110)에 기초하여 제1 사용자 입력(10)을 인코딩할 수 있다. 예를 들어, 인코딩 모듈(140)은 포인트 클라우드의 (x, y) 위치 정보와 제1 사용자 입력(10)의 객체 위치 정보 간의 거리에 기초하여 제1 사용자 입력(10)을 인코딩할 수 있다. 일 실시예에 있어서, 전자 장치(100)는 포인트 클라우드(110)에 인코딩된 제1 사용자 입력을 연결(concatenating)할 수 있다.The encoding module 140 may encode the first user input 10 based on the point cloud 110 . For example, the encoding module 140 may encode the first user input 10 based on the distance between the (x, y) location information of the point cloud and the object location information of the first user input 10. In one embodiment, the electronic device 100 may concatenate the encoded first user input to the point cloud 110.

전자 장치(100)는 (미리 학습된) 인공지능 모델(150)에 포인트 클라우드(110) 및 인코딩된 제1 사용자 입력을 입력할 수 있다. 인공지능 모델(150)은 포인트 클라우드(110) 및 인코딩된 제1 사용자 입력에 기초하여 포인트 클라우드(110)에서 복수의 객체들에 대응하는 3차원 바운딩 박스(예컨대, 160a, 160b)를 추론(또는 예측)할 수 있다. 예를 들어, 인공지능 모델(150)은 3차원 바운딩 박스(예컨대, 160a, 160b)에 대응하는 어노테이션(annotation) 값을 출력할 수 있다. 즉, 추론 결과는 3차원 바운딩 박스(예컨대, 160a, 160b)의 어노테이션 값을 나타낼 수 있다. 예를 들어, 어노테이션 값은, 3차원 바운딩 박스의 중심점, 높이, 폭, 깊이, 클래스, 및 각도 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.The electronic device 100 may input the point cloud 110 and the encoded first user input to the (pre-trained) artificial intelligence model 150. The artificial intelligence model 150 infers (or prediction) can be made. For example, the artificial intelligence model 150 may output an annotation value corresponding to a 3D bounding box (eg, 160a, 160b). That is, the inference result may represent an annotation value of a 3D bounding box (eg, 160a, 160b). For example, the annotation value may include, but is not limited to, the center point, height, width, depth, class, and angle of the 3D bounding box.

일 실시예에 따르면, 객체에 대응하는 포지티브 포인트에 대한 정보가 인공지능 모델(150)에 제공됨에 따라, 3차원 바운딩 박스가 높은 정확도로 추론될 수 있다.According to one embodiment, as information about positive points corresponding to the object is provided to the artificial intelligence model 150, a 3D bounding box can be inferred with high accuracy.

예를 들어, 3차원 바운딩 박스들(160a_1~160a_5)는 보행자 클래스에 대응하고, 3차원 바운딩 박스들(160b_1, 160b_2)는 차량 클래스에 대응할 수 있다. 3차원 바운딩 박스들(160a_1~160a_3) 및 3차원 바운딩 박스(160b_1)는 트루 포지티브(true positive)로 분류될 수 있다. 3차원 바운딩 박스들(160a_4, 160a_5) 및 3차원 바운딩 박스(160b_2)는 폴스 포지티브(false positive)로 분류될 수 있다. 본 개시에서, 트루 포지티브는 검출되어야 할 객체가 검출된 것을 나타내고, 폴스 포지티브는 검출되지 않아야 할 객체가 검출된 것을 나타낼 수 있다. 따라서, 3차원 바운딩 박스들(160a_4, 160a_5, 160b_2)은 배경부이거나 다른 객체로 검출되어야 함에도 잘못 검출된 3차원 바운딩 박스들이다.For example, the 3D bounding boxes 160a_1 to 160a_5 may correspond to the pedestrian class, and the 3D bounding boxes 160b_1 and 160b_2 may correspond to the vehicle class. The 3D bounding boxes 160a_1 to 160a_3 and the 3D bounding box 160b_1 may be classified as true positive. The 3D bounding boxes 160a_4 and 160a_5 and the 3D bounding box 160b_2 may be classified as false positives. In the present disclosure, a true positive may indicate that an object that should be detected has been detected, and a false positive may indicate that an object that should not be detected has been detected. Accordingly, the 3D bounding boxes 160a_4, 160a_5, and 160b_2 are 3D bounding boxes that are incorrectly detected even though they should be detected as background parts or other objects.

도 1b와 함께, 도 1a를 참조하면, 도 1b는 도 1a의 초기 이터레이션 이후의 이터레이션에서 객체 검출을 수행하는 전자 장치(100)의 동작을 보여준다.Referring to FIG. 1A together with FIG. 1B, FIG. 1B shows the operation of the electronic device 100 performing object detection in an iteration after the initial iteration of FIG. 1A.

전자 장치(100)는 변환된 포인트 클라우드(111) 및 초기 이터레이션에서 추론된 3차원 바운딩 박스(160a_1~160a_5, 160b_1, 160b_2)에 대응하는 데이터를 디스플레이(130)에 표시할 수 있다.The electronic device 100 may display data corresponding to the converted point cloud 111 and the 3D bounding boxes 160a_1 to 160a_5, 160b_1, and 160b_2 inferred from the initial iteration on the display 130.

전자 장치(100)는 상기 추론된 3차원 바운딩 박스에 대응하되 상기 복수의 객체들에 대응하지 않는 포인트(즉, 네거티브 포인트(negative point) 또는 네거티브 클릭(negative click)), 및 상기 추론된 3차원 바운딩 박스에 대응하지 않되 상기 복수의 객체들에 대응하는 포인트(즉, 포지티브 포인트 또는 포지티브 클릭) 중 적어도 하나를 포함하는 제2 사용자 입력(11)을 획득할 수 있다. 예를 들어, 제2 사용자 입력(11)은 복수의 네거티브 포인트 및/또는 복수의 포지티브 포인트를 포함할 수 있다. 도 1b를 참조하면, 3차원 바운딩 박스(160a_4)는 보행자 클래스에 대응하지 않는 객체임에도 검출되었으므로, 제2 사용자 입력(11)은 네거티브 포인트일 수 있다.The electronic device 100 includes a point (i.e., a negative point or negative click) that corresponds to the inferred three-dimensional bounding box but does not correspond to the plurality of objects, and the inferred three-dimensional bounding box. A second user input 11 that does not correspond to the bounding box but includes at least one of the points (i.e., positive point or positive click) corresponding to the plurality of objects may be obtained. For example, the second user input 11 may include a plurality of negative points and/or a plurality of positive points. Referring to FIG. 1B, since the 3D bounding box 160a_4 was detected even though it is an object that does not correspond to the pedestrian class, the second user input 11 may be a negative point.

인코딩 모듈(140)은 포인트 클라우드(110)에 기초하여 제2 사용자 입력(11)을 인코딩할 수 있다. 일 실시예에 있어서, 인코딩 모듈(140)은 이전 이터레이션의 사용자 입력(예컨대, 제1 사용자 입력(10))과 함께, 현재 이터레이션의 사용자 입력(예컨대, 제2 사용자 입력(11))을 연결(concatenating)한 값을 인코딩할 수 있다.The encoding module 140 may encode the second user input 11 based on the point cloud 110. In one embodiment, the encoding module 140 combines the user input of the current iteration (e.g., the second user input 11) together with the user input of the previous iteration (e.g., the first user input 10). Concatenating values can be encoded.

전자 장치(100)는 포인트 클라우드(110), 인코딩된 제1 사용자 입력, 및 인코딩된 제2 사용자 입력을 입력으로 하는, 인공지능 모델(150)을 이용하여, 포인트 클라우드(110)에서 복수의 객체들에 대응하는 3차원 바운딩 박스를 추론할 수 있다.The electronic device 100 uses the artificial intelligence model 150, which uses the point cloud 110, the encoded first user input, and the encoded second user input as input, to select a plurality of objects in the point cloud 110. The 3D bounding box corresponding to the fields can be inferred.

도 1b를 참조하면, 인공지능 모델(150)은 제2 사용자 입력(11)에 대응하는 3차원 바운딩 박스(160a_4)를 추론하지 않았을 뿐만 아니라, 3차원 바운딩 박스(160a_4)와 유사한 특징을 가진 3차원 바운딩 박스(160a_5)를 추론하지 않았음을 알 수 있다.Referring to FIG. 1B, the artificial intelligence model 150 not only did not infer the 3D bounding box 160a_4 corresponding to the second user input 11, but also 3 with similar characteristics to the 3D bounding box 160a_4. It can be seen that the dimensional bounding box 160a_5 has not been inferred.

일 실시예에 따르면, 객체에 대응하는 소량의 포지티브 포인트 또는 객체에 대응하지 않는 소량의 네거티브 포인트에 대한 정보가 인공지능 모델(150)에 되었음에도, 3차원 바운딩 박스가 높은 정확도로 추론될 수 있다.According to one embodiment, even though the artificial intelligence model 150 contains information about a small amount of positive points corresponding to the object or a small amount of negative points not corresponding to the object, the 3D bounding box can be inferred with high accuracy.

도 2는 일 실시예에 따른 전자 장치의 구성을 보여주는 블록도이다. 전자 장치(200)의 구성, 동작, 및 기능은 도 1a 및 1b에 도시된 전자 장치(100)의 구성, 동작, 및 기능에 대응할 수 있다. 설명의 편의를 위해, 도 1a 및 1b에서 설명한 내용과 중복되는 내용은 생략한다.Figure 2 is a block diagram showing the configuration of an electronic device according to an embodiment. The configuration, operation, and function of the electronic device 200 may correspond to the configuration, operation, and function of the electronic device 100 shown in FIGS. 1A and 1B. For convenience of explanation, content that overlaps with the content described in FIGS. 1A and 1B will be omitted.

도 2를 참조하면, 일 실시예에 따른 전자 장치(200)는 통신 인터페이스(210), 사용자 인터페이스(220), 프로세서(230), 및 메모리(240)를 포함할 수 있다. 그러나, 도 2에 도시된 구성요소가 필수 구성요소인 것은 아니며, 전자 장치(200)는 구성요소를 생략하거나, 추가 구성요소를 더 포함할 수 있다. 예를 들어, 전자 장치(200)는 디스플레이(130), 프로세서(230), 및 메모리(240)를 포함하도록 구성될 수 있다.Referring to FIG. 2 , the electronic device 200 according to one embodiment may include a communication interface 210, a user interface 220, a processor 230, and a memory 240. However, the components shown in FIG. 2 are not essential components, and the electronic device 200 may omit the components or may further include additional components. For example, the electronic device 200 may be configured to include a display 130, a processor 230, and a memory 240.

통신 인터페이스(210)는 전자 장치(200)와 외부의 다른 전자 장치(미도시) 또는 서버(미도시) 사이의 유선 또는 무선 통신 채널의 수립 및 수립된 통신 채널을 통한 통신 수행을 지원할 수 있다. 일 실시 예에 있어서, 통신 인터페이스(210)는 유선 또는 무선 통신을 통해 외부의 다른 전자 장치(미도시) 또는 서버(미도시)로부터 데이터를 수신하거나 외부의 다른 전자 장치(미도시) 또는 서버(미도시)로 데이터를 송신할 수 있다. 일 실시 예에 있어서, 통신 인터페이스(210)는 무선 통신 모듈(예컨대, 셀룰러 통신 모듈, 근거리 무선 통신 모듈, 또는 GNSS(global navigation satellite system) 통신 모듈) 또는 유선 통신 모듈(예컨대, LAN(local area network) 통신 모듈, 또는 전력선 통신 모듈)을 포함할 수 있고, 그 중 어느 하나의 통신 모듈을 이용하여 적어도 하나의 네트워크(예컨대, 근거리 통신 네트워크(예컨대, 블루투스, WiFi direct 또는 IrDA(infrared data association)) 또는 원거리 통신 네트워크(예컨대, 셀룰러 네트워크, 인터넷, 또는 컴퓨터 네트워크(예컨대, LAN 또는 WAN)))를 통하여 외부의 다른 전자 장치(미도시) 또는 서버(미도시)와 통신할 수 있다.The communication interface 210 may support establishing a wired or wireless communication channel between the electronic device 200 and another external electronic device (not shown) or a server (not shown) and performing communication through the established communication channel. In one embodiment, the communication interface 210 receives data from another external electronic device (not shown) or a server (not shown) through wired or wireless communication, or receives data from another external electronic device (not shown) or a server (not shown). Data can be transmitted (not shown). In one embodiment, the communication interface 210 is a wireless communication module (e.g., a cellular communication module, a short-range wireless communication module, or a global navigation satellite system (GNSS) communication module) or a wired communication module (e.g., a local area network (LAN) ) may include a communication module, or a power line communication module), and at least one network (e.g., a short-range communication network (e.g., Bluetooth, WiFi direct, or IrDA (infrared data association)) using any one of the communication modules) Alternatively, the device may communicate with another external electronic device (not shown) or a server (not shown) through a long-distance communication network (e.g., a cellular network, the Internet, or a computer network (e.g., LAN or WAN)).

일 실시예에 있어서, 전자 장치(200)는 통신 인터페이스(210)를 통해 외부의 다른 전자 장치(미도시) 또는 서버(미도시)로부터 포인트 클라우드를 수신할 수 있다.In one embodiment, the electronic device 200 may receive a point cloud from another external electronic device (not shown) or a server (not shown) through the communication interface 210.

사용자 인터페이스(220)는 입력 인터페이스(221) 및 출력 인터페이스(222)를 포함할 수 있다.The user interface 220 may include an input interface 221 and an output interface 222.

입력 인터페이스(221)는, 사용자로부터의 입력(이하에서, 사용자 입력)을 수신하기 위한 것이다. 입력 인터페이스(221)는 키 패드(key pad), 돔 스위치 (dome switch), 터치 패드(접촉식 정전 용량 방식, 압력식 저항막 방식, 적외선 감지 방식, 표면 초음파 전도 방식, 적분식 장력 측정 방식, 피에조 효과 방식 등), 터치 스크린, 컨트롤러, 조그 휠, 조그 스위치, 마우스, 키보드, 무선 펜, 음성 인식 장치, 제스처 인식 장치 중 적어도 하나일 수 있으나, 이에 한정되는 것은 아니다. The input interface 221 is for receiving input from a user (hereinafter referred to as user input). The input interface 221 includes a key pad, a dome switch, and a touch pad (contact capacitive type, pressure resistance type, infrared detection type, surface ultrasonic conduction type, integral tension measurement type, Piezo effect type, etc.), a touch screen, a controller, a jog wheel, a jog switch, a mouse, a keyboard, a wireless pen, a voice recognition device, or a gesture recognition device, but is not limited thereto.

출력 인터페이스(222)는 비디오 신호의 출력을 위한 디스플레이(130)를 포함할 수 있다. 일 실시예에 의하면, 전자 장치(200)는 디스플레이(130)를 통해서 전자 장치(200)와 관련된 정보를 표시해 줄 수 있다. 예를 들어, 전자 장치(200)는 포인트 클라우드를 2차원 좌표계로 변환하여 시각화한 이미지들을 디스플레이(130)에 표시할 수 있다. The output interface 222 may include a display 130 for outputting video signals. According to one embodiment, the electronic device 200 may display information related to the electronic device 200 through the display 130. For example, the electronic device 200 may convert a point cloud into a two-dimensional coordinate system and display visualized images on the display 130.

디스플레이(130)와 터치패드가 레이어 구조를 이루어 터치 스크린으로 구성되는 경우, 디스플레이(130)는 출력 장치 이외에 입력 장치로도 사용될 수 있다. 디스플레이(130)는 액정 디스플레이(liquid crystal display), 박막 트랜지스터 액정 디스플레이(thin film transistor-liquid crystal display), 발광 다이오드(LED, light-emitting diode), 유기 발광 다이오드(organic light-emitting diode), 플렉시블 디스플레이(flexible display), 3차원 디스플레이(3D display), 전기영동 디스플레이(electrophoretic display) 중에서 적어도 하나를 포함할 수 있다. 그리고 전자 장치(200)의 구현 형태에 따라 디스플레이를 2개 이상 포함할 수도 있다. When the display 130 and the touch pad form a layered structure to form a touch screen, the display 130 can be used as an input device in addition to an output device. The display 130 may be a liquid crystal display, a thin film transistor-liquid crystal display, a light-emitting diode (LED), an organic light-emitting diode, or a flexible display. It may include at least one of a flexible display, a 3D display, and an electrophoretic display. Additionally, depending on the implementation form of the electronic device 200, it may include two or more displays.

프로세서(230)는 AP(application processor), CPU(central processing unit) 또는 GPU(graphic processing unit)와 같은 범용 프로세서와 소프트웨어의 조합을 통해 구현될 수도 있다. 전용 프로세서의 경우, 본 개시의 실시예를 구현하기 위한 메모리를 포함하거나, 외부 메모리를 이용하기 위한 메모리 처리부를 포함할 수 있다.The processor 230 may be implemented through a combination of a general-purpose processor, such as an application processor (AP), a central processing unit (CPU), or a graphic processing unit (GPU), and software. In the case of a dedicated processor, it may include a memory for implementing an embodiment of the present disclosure, or a memory processing unit for using an external memory.

프로세서(230)는 복수의 프로세서로 구성될 수도 있다. 이 경우, 전용 프로세서들의 조합으로 구현될 수도 있고, AP, CPU 또는 GPU와 같은 다수의 범용 프로세서들과 소프트웨어의 조합을 통해 구현될 수도 있다.The processor 230 may be comprised of a plurality of processors. In this case, it may be implemented through a combination of dedicated processors, or it may be implemented through a combination of software and multiple general-purpose processors such as AP, CPU, or GPU.

일 실시예에 있어서, 프로세서(230)는, 인공지능(AI) 전용 프로세서를 탑재할 수도 있다. 인공지능 전용 프로세서는, 인공지능을 위한 전용 하드웨어 칩 형태로 제작될 수도 있고, 기존의 범용 프로세서(예: CPU 또는 application processor) 또는 그래픽 전용 프로세서(예: GPU)의 일부로 제작되어 전자 장치(200)에 탑재될 수도 있다. 예를 들어, 인공지능 전용 프로세서는 인공지능 모델(150)과 관련된 학습 및/또는 추론에 필요한 데이터 처리를 수행할 수 있다.In one embodiment, the processor 230 may be equipped with a processor dedicated to artificial intelligence (AI). The artificial intelligence-specific processor may be manufactured in the form of a dedicated hardware chip for artificial intelligence, or may be manufactured as part of an existing general-purpose processor (eg, CPU or application processor) or graphics-specific processor (eg, GPU) and used in the electronic device 200. It may be mounted on . For example, an artificial intelligence-specific processor may perform data processing necessary for learning and/or inference related to the artificial intelligence model 150.

본 개시에 따른 인공지능과 관련된 기능은 프로세서(230)와 메모리(240)를 통해 동작된다. 프로세서(230)는 하나 또는 복수의 프로세서로 구성될 수 있다. 이때, 하나 또는 복수의 프로세서는 CPU, AP, DSP(Digital Signal Processor) 등과 같은 범용 프로세서, GPU, VPU(Vision Processing Unit)와 같은 그래픽 전용 프로세서 또는 NPU와 같은 인공지능 전용 프로세서일 수 있다. 하나 또는 복수의 프로세서는, 메모리에 저장된 기 정의된 동작 규칙 또는 인공지능 모델(예컨대, 인공지능 모델(150))에 따라, 입력 데이터를 처리하도록 제어한다. 또는, 하나 또는 복수의 프로세서가 인공지능 전용 프로세서인 경우, 인공지능 전용 프로세서는, 인공지능 모델(150)의 처리에 특화된 하드웨어 구조로 설계될 수 있다. Functions related to artificial intelligence according to the present disclosure are operated through the processor 230 and memory 240. The processor 230 may be comprised of one or multiple processors. At this time, one or more processors may be a general-purpose processor such as a CPU, AP, or DSP (Digital Signal Processor), a graphics-specific processor such as a GPU or VPU (Vision Processing Unit), or an artificial intelligence-specific processor such as an NPU. One or more processors control input data to be processed according to predefined operation rules or an artificial intelligence model (eg, artificial intelligence model 150) stored in memory. Alternatively, when one or more processors are dedicated artificial intelligence processors, the artificial intelligence dedicated processors may be designed with a hardware structure specialized for processing the artificial intelligence model 150.

기 정의된 동작 규칙 또는 인공지능 모델(150)은 학습을 통해 만들어진 것을 특징으로 한다. 여기서, 학습을 통해 만들어진다는 것은, 기본 인공지능 모델이 학습 알고리즘에 의하여 다수의 학습 데이터들을 이용하여 학습됨으로써, 원하는 특성(또는, 목적)을 수행하도록 설정된 기 정의된 동작 규칙 또는 인공지능 모델이 만들어짐을 의미한다. 이러한 학습은 본 개시에 따른 인공지능이 수행되는 기기 자체에서 이루어질 수도 있고, 별도의 서버 및/또는 시스템을 통해 이루어 질 수도 있다. 학습 알고리즘의 예로는, 지도형 학습(supervised learning), 비지도형 학습(unsupervised learning), 준지도형 학습(semi-supervised learning) 또는 강화 학습(reinforcement learning)이 있으나, 전술한 예에 한정되지 않는다. 메모리(240)는, 프로세서(230)의 처리 및 제어를 위한 프로그램을 저장할 수도 있고, 입/출력되는 데이터들을 저장할 수도 있다. 메모리(240)는 학습된 인공지능 모델(150)을 저장할 수도 있다. The predefined operation rule or artificial intelligence model 150 is characterized by being created through learning. Here, being created through learning means that the basic artificial intelligence model is learned using a large number of learning data by a learning algorithm, thereby creating a predefined operation rule or artificial intelligence model set to perform the desired characteristics (or purpose). It means burden. This learning may be performed on the device itself that performs the artificial intelligence according to the present disclosure, or may be performed through a separate server and/or system. Examples of learning algorithms include supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning, but are not limited to the examples described above. The memory 240 may store programs for processing and control of the processor 230, and may also store input/output data. The memory 240 may store the learned artificial intelligence model 150.

메모리(240)는 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램(RAM, Random Access Memory) SRAM(Static Random Access Memory), 롬(ROM, Read-Only Memory), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다. 또한, 전자 장치(200)는 인터넷(internet)상에서 저장 기능을 수행하는 웹 스토리지(web storage) 또는 클라우드 서버를 운영할 수도 있다.The memory 240 is a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (for example, SD or XD memory, etc.), and RAM. (RAM, Random Access Memory) SRAM (Static Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, magnetic disk , and may include at least one type of storage medium among optical disks. Additionally, the electronic device 200 may operate a web storage or cloud server that performs a storage function on the Internet.

일 실시예에 있어서, 메모리(240)에는 프로세서(230)에 의하여 처리되거나 처리될 예정인 데이터, 펌웨어, 소프트웨어, 및 프로세스 코드 등이 저장될 수 있다. 일 실시 예에 있어서, 메모리(240)에는 2차원 변환 모듈(120), 인코딩 모듈(140), 및 인공지능 모델(150) 중 적어도 하나에 대응되는 데이터 및 프로그램 코드들이 저장될 수 있다. In one embodiment, the memory 240 may store data processed or scheduled to be processed by the processor 230, firmware, software, and process code. In one embodiment, data and program codes corresponding to at least one of the two-dimensional conversion module 120, the encoding module 140, and the artificial intelligence model 150 may be stored in the memory 240.

도 3은 일 실시예에 따른 인공지능 모델을 이용한 추론 과정을 보여주는 블록도이다. 인공지능 모델(150)의 구성, 동작, 및 기능은 도 1a 내지 2에 도시된 인공지능 모델(150)의 구성, 동작, 및 기능에 대응할 수 있다. 설명의 편의를 위해, 도 1a 내지 2에서 설명한 내용과 중복되는 내용은 생략한다.Figure 3 is a block diagram showing an inference process using an artificial intelligence model according to an embodiment. The configuration, operation, and function of the artificial intelligence model 150 may correspond to the configuration, operation, and function of the artificial intelligence model 150 shown in FIGS. 1A to 2. For convenience of explanation, content that overlaps with the content described in FIGS. 1A to 2 is omitted.

도 3을 참조하면, 인공지능 모델(150)은 포인트 인코더(point encoder)(310), 센트로이드 어그리게이션(centroid aggregation) 모듈(320), 공간 클릭 전파(spatial click propagation) 모듈, 및 검출 헤드(detection head) 모듈을 포함할 수 있다. 예를 들어, 인공지능 모델(150)은 복수의 신경망 레이어들을 포함할 수 있다. 복수의 신경망 레이어들은 기능에 따라, 포인트 인코더(310), 센트로이드 어그리게이션 모듈(320), 공간 클릭 전파 모듈, 및 검출 헤드 모듈로 구분될 수 있다.Referring to FIG. 3, the artificial intelligence model 150 includes a point encoder 310, a centroid aggregation module 320, a spatial click propagation module, and a detection head. (detection head) module may be included. For example, the artificial intelligence model 150 may include multiple neural network layers. The plurality of neural network layers may be divided into a point encoder 310, a centroid aggregation module 320, a spatial click propagation module, and a detection head module according to their functions.

포인트 인코더(310)는 포인트 클라우드 및 인코딩된 사용자 입력을 획득할 수 있다. 일 실시예에 있어서, 포인트 인코더(310)는 포인트 클라우드에 인코딩된 사용자 입력이 연결된 데이터를 획득할 수 있다. 포인트 인코더(310)는 포인트 클라우드 및 인코딩된 사용자 입력(예컨대, 제1 사용자 입력(도 1a, 10) 및/또는 제2 사용자 입력(도 1b, 11))에 기초하여 특징(feature)(또는 특징 맵, 특징 벡터, 내재 표현, 내재 벡터 등으로도 지칭될 수 있음)을 추출할 수 있다.Point encoder 310 may obtain a point cloud and encoded user input. In one embodiment, the point encoder 310 may obtain data linked to the user input encoded in the point cloud. Point encoder 310 encodes a feature (or features) based on the point cloud and encoded user input (e.g., first user input (FIG. 1A, 10) and/or second user input (FIG. 1B, 11)) (may also be referred to as map, feature vector, implicit representation, implicit vector, etc.) can be extracted.

포인트 인코더(310)는 적어도 하나의 완전 연결(fully connected) 레이어 및 적어도 하나의 다운샘플링(downsampling) 레이어를 포함할 수 있다. 예를 들어, 포인트 인코더(310)는 포인트 클라우드의 총 N 개의 포인트들을 완전 연결 레이어 및 활성화 함수를 이용하여 인코딩할 수 있다. 예를 들어, 포인트 인코더(310)는 인코딩된 포인트들을 다운샘플링 레이어에 입력함으로써 샘플링할 수 있다. 일 실시예에 있어서, 다운샘플링 레이어는 인코딩된 포인트들을 거리 기준으로 샘플링(farthest point sampling)하거나, 전경부 점수(foreground score)를 기준으로 가장 높은 n 개를 샘플링할 수 있다. 그러나 본 개시는 이에 한정되지 않으며, 인코딩된 포인트들은 최대 풀링(max pooling), 평균 풀링(average pooling), 스트라이드 컨볼루션(strided convolution), 글로벌 풀링(global pooling) 기법 등에 의해 샘플링될 수 있다.The point encoder 310 may include at least one fully connected layer and at least one downsampling layer. For example, the point encoder 310 may encode a total of N points of the point cloud using a fully connected layer and an activation function. For example, the point encoder 310 can sample encoded points by inputting them to a downsampling layer. In one embodiment, the downsampling layer may sample the encoded points based on the distance (farthest point sampling) or sample the highest n points based on the foreground score. However, the present disclosure is not limited to this, and encoded points may be sampled by max pooling, average pooling, strided convolution, global pooling techniques, etc.

일 실시예에 있어서, 다운샘플링 레이어의 출력마다 인코딩된 사용자 입력이 연결(concatenating)될 수 있다. 본 개시에서, 다운샘플링 레이어의 출력마다 인코딩된 사용자 입력이 연결되는 기법을, 덴스 클릭 가이던스(dense click guidance)로 지칭할 수 있다. 도 3을 참조하면, 3 개의 다운샘플링 레이어의 출력에 인코딩된 사용자 입력이 연결된 것으로 도시되었으나, 다운샘플링 레이어의 개수는 이에 한정되지 않는다. 일 실시예에 있어서, 포인트 인코더(310)의 다운샘플링 레이어들 중 적어도 일부의 출력에 인코딩된 사용자 입력이 연결될 수 있다.In one embodiment, encoded user input may be concatenated for each output of the downsampling layer. In the present disclosure, a technique in which encoded user input is connected to each output of a downsampling layer may be referred to as dense click guidance. Referring to FIG. 3, it is shown that encoded user input is connected to the output of three downsampling layers, but the number of downsampling layers is not limited to this. In one embodiment, the encoded user input may be connected to the output of at least some of the downsampling layers of the point encoder 310.

일 실시예에 따르면, 사용자 입력에 대응하는 정보가 다운샘플링 레이어를 거치면서 희석되는 현상이 방지됨으로써, 인공지능 모델(150)이 더 효율적이고 정확한 추론을 수행할 수 있다.According to one embodiment, the artificial intelligence model 150 can perform more efficient and accurate inference by preventing the information corresponding to the user input from being diluted while passing through the downsampling layer.

센트로이드 어그리게이션 모듈(320)은 포인트 인코더(310)로부터 추출된 특징을 수신할 수 있다. 센트로이드 어그리게이션 모듈(320)은 추출된 특징에 기초하여, 3차원 장면의 복수의 객체들에 대응하는 중심점을 예측할 수 있다. 센트로이드 어그리게이션 모듈(320)은 예측된 중심점에 기초하여 복수의 객체들에 대응하는 포인트들을 그룹화할 수 있다.The centroid aggregation module 320 may receive features extracted from the point encoder 310. The centroid aggregation module 320 may predict the central point corresponding to a plurality of objects in the 3D scene based on the extracted features. The centroid aggregation module 320 may group points corresponding to a plurality of objects based on the predicted central point.

공간 클릭 전파 모듈(330)은 포인트 인코더(310)로부터 추출된 특징을 수신할 수 있다. 공간 클릭 전파 모듈(330)은 센트로이드 어그리게이션 모듈(320)의 출력을 수신할 수 있다. 공간 클릭 전파 모듈(330)은 추출된 특징(또는 센트로이드 어그리게이션 모듈(320)의 출력) 및 인코딩된 사용자 입력에 기초하여, 사용자 입력에 대응하는 객체와 동일한 클래스의 객체의 정보를 계산할 수 있다.The spatial click propagation module 330 may receive features extracted from the point encoder 310. The spatial click propagation module 330 may receive the output of the centroid aggregation module 320. The spatial click propagation module 330 may calculate information of an object of the same class as the object corresponding to the user input, based on the extracted features (or the output of the centroid aggregation module 320) and the encoded user input. there is.

일 실시예에 있어서, 공간 클릭 전파 모듈(330)은 인코딩된 사용자 입력에 대응하는 정보와 추출된 특징 간의 유사도를 계산할 수 있다. 공간 클릭 전파 모듈(330)은 유사도에 대응하는 정보를 검출 헤드 모듈(340)에 전달할 수 있다.In one embodiment, the spatial click propagation module 330 may calculate the similarity between information corresponding to the encoded user input and the extracted features. The spatial click propagation module 330 may transmit information corresponding to the similarity to the detection head module 340.

일 실시예에 따르면, 공간 클릭 전파 모듈(330)에 의해 사용자 입력으로 획득되지 않은 포지티브 포인트 및/또는 네거티브 포인트에 대한 정보가 동일한 클래스의 객체로 전파됨으로써, 3차원 바운딩 박스가 효율적으로 추론될 수 있다.According to one embodiment, information about positive points and/or negative points that were not obtained by user input by the spatial click propagation module 330 is propagated to objects of the same class, so that a three-dimensional bounding box can be efficiently inferred. there is.

검출 헤드 모듈(340)은 센트로이드 어그리게이션 모듈(320)의 출력 및 공간 클릭 전파 모듈(330)의 출력을 수신할 수 있다. 일 실시예에 있어서, 검출 헤드 모듈(340)은 센트로이드 어그리게이션 모듈(320)의 출력에 공간 클릭 전파 모듈(330)의 출력이 연결된 데이터를 획득(예컨대, 수신)할 수 있다. 일 실시예에 있어서, 검출 헤드 모듈(340)은 획득된 데이터에 기초하여 사용자 입력(예컨대, 제1 사용자 입력 또는 제2 사용자 입력)에 대응하는 객체와 동일한 클래스의 객체를 검출할 수 있다. 검출 헤드 모듈(340)은 센트로이드 어그리게이션 모듈(320)의 출력 및 공간 클릭 전파 모듈(330)의 출력에 기초하여 복수의 객체들에 대응하는 3차원 바운딩 박스를 출력할 수 있다.The detection head module 340 may receive the output of the centroid aggregation module 320 and the output of the spatial click propagation module 330. In one embodiment, the detection head module 340 may obtain (eg, receive) data in which the output of the centroid aggregation module 320 is connected to the output of the spatial click propagation module 330. In one embodiment, the detection head module 340 may detect an object of the same class as an object corresponding to a user input (eg, a first user input or a second user input) based on the acquired data. The detection head module 340 may output a three-dimensional bounding box corresponding to a plurality of objects based on the output of the centroid aggregation module 320 and the output of the spatial click propagation module 330.

도 4는 일 실시예에 따른 공간 클릭 전파 모듈의 동작을 설명하기 위한 개념도이다. 도 4와 함께, 도 3을 참조하면, 공간 클릭 전파 모듈(330)은 인코딩된 사용자 입력(E_k) 및 추출된 특징(F)을 수신할 수 있다. 공간 클릭 전파 모듈(330)은 인코딩된 사용자 입력(E_k) 및 추출된 특징(F)을 내적함으로써, 클릭 프로토타입(click prototype) 벡터(P_k)를 생성할 수 있다. 공간 클릭 전파 모듈(330)은 클릭 프로토타입 벡터(P_k)에 글로벌 합 풀링(global sum pooling)을 수행할 수 있다. 공간 클릭 전파 모듈(330)은 추출된 특징(F)과 풀링된 클릭 프로토타입 벡터(P_k) 간의 코사인 유사도(M_k)를 계산할 수 있다. 공간 클릭 전파 모듈(330)은 코사인 유사도(M_k)에 기초하여 클래스 별 연관성 맵을 생성할 수 있다. 클래스별 연관성 맵은 센트로이드 어그리게이션 모듈(320)의 출력에 연결될 수 있다.Figure 4 is a conceptual diagram for explaining the operation of a spatial click propagation module according to an embodiment. Referring to FIG. 3 along with FIG. 4 , the spatial click propagation module 330 may receive the encoded user input (E _k ) and the extracted feature (F). The spatial click propagation module 330 may generate a click prototype vector (P _k ) by dot producting the encoded user input (E _k ) and the extracted feature (F). The spatial click propagation module 330 may perform global sum pooling on the click prototype vector (P _k ). The spatial click propagation module 330 may calculate the cosine similarity (M _k ) between the extracted feature (F) and the pooled click prototype vector (P _k ). The spatial click propagation module 330 may generate an association map for each class based on cosine similarity (M _k ). The class-specific relevance map may be connected to the output of the centroid aggregation module 320.

도 5는 일 실시예에 따른 인공지능 모델의 학습 과정을 보여주는 블록도이다. 도 5를 참조하면, 포인트 인코더(310), 센트로이드 어그리게이션 모듈(320), 공간 클릭 전파 모듈(330), 및 검출 헤드 모듈(340)의 구성, 동작, 기능은, 도 3의 포인트 인코더(310), 센트로이드 어그리게이션 모듈(320), 공간 클릭 전파 모듈(330), 및 검출 헤드 모듈(340)의 구성, 동작, 기능에 대응할 수 있다. 따라서, 도 3 및 4에서 설명한 내용과 중복되는 내용은 생략한다.Figure 5 is a block diagram showing the learning process of an artificial intelligence model according to an embodiment. Referring to FIG. 5, the configuration, operation, and functions of the point encoder 310, centroid aggregation module 320, spatial click propagation module 330, and detection head module 340 are similar to those of the point encoder of FIG. 3. It can correspond to the configuration, operation, and function of the centroid aggregation module 320, the spatial click propagation module 330, and the detection head module 340. Therefore, content that overlaps with the content described in FIGS. 3 and 4 will be omitted.

인공지능 모델(150)의 학습 과정에서는, 사용자 입력이 제공되지 않는다. 따라서, 인공지능 모델(150)에 포지티브 포인트(520) 및/또는 네거티브 포인트(530)에 대응하는 데이터가 입력되어야 한다. 일 실시예에 있어서, 포지티브 포인트(520)는 포인트 클라우드 데이터셋의 그라운드 트루스 값에서 랜덤하게 선택될 수 있다. 일 실시예에 있어서, 네거티브 포인트(530)는 네거티브 클릭 시뮬레이션 모듈(510)의 출력에 의해 생성될 수 있다. 일 실시예에 있어서, 인코딩 모듈(140)은 포인트 클라우드에 기초하여 포지티브 포인트(520) 및/또는 네거티브 포인트(530)를 인코딩할 수 있다.In the learning process of the artificial intelligence model 150, user input is not provided. Therefore, data corresponding to the positive point 520 and/or negative point 530 must be input into the artificial intelligence model 150. In one embodiment, positive points 520 may be randomly selected from the ground truth values of the point cloud dataset. In one embodiment, negative point 530 may be generated by the output of negative click simulation module 510. In one embodiment, encoding module 140 may encode positive points 520 and/or negative points 530 based on the point cloud.

인공지능 모델(150)은, 포인트 클라우드 데이터셋, 포인트 클라우드 데이터셋 중 객체에 대응하는 적어도 하나의 포지티브 포인트(520), 및 포인트 클라우드 데이터셋 중 배경(또는 일치하지 않는 클래스 정보를 갖는 객체)에 대응하는 적어도 하나의 네거티브 포인트(530)에 기초하여, 객체에 대응하는 3차원 바운딩 박스를 출력하도록 학습될 수 있다.The artificial intelligence model 150 is applied to a point cloud dataset, at least one positive point 520 corresponding to an object in the point cloud dataset, and a background (or object with inconsistent class information) in the point cloud dataset. Based on the corresponding at least one negative point 530, it can be learned to output a 3D bounding box corresponding to the object.

일 실시예에 있어서, 인공지능 모델(150)은 네거티브 클릭 시뮬레이션 모듈(510)을 포함할 수 있다. 네거티브 클릭 시뮬레이션 모듈(510)은, 인공지능 모델(150)의 학습 과정에서, 그라운드 트루스 값 및 추출된 특징에 기초하여 상기 복수의 객체들에 대응하지 않는 배경부 포인트들 중 전경부 점수(foreground score)가 임계 값을 초과하는 포인트들을 네거티브 포인트로 할당할 수 있다. 일 실시예에 있어서, 임계 값은 사용자 또는 제조사의 설정에 의해 미리 정의될 수 있다.In one embodiment, the artificial intelligence model 150 may include a negative click simulation module 510. During the learning process of the artificial intelligence model 150, the negative click simulation module 510 determines the foreground score among the background points that do not correspond to the plurality of objects based on the ground truth value and the extracted features. ) points that exceed the threshold can be assigned as negative points. In one embodiment, the threshold may be predefined by user or manufacturer settings.

도 6은 일 실시예에 따른 네거티브 클릭 시뮬레이션 모듈의 동작을 설명하기 위한 개념도이다. 도 6과 함께, 도 5를 참조하면, 네거티브 클릭 시뮬레이션 모듈(510)은 추출된 특징에 기초하여 전경부 점수를 계산할 수 있다. 예를 들어, 네거티브 클릭 시뮬레이션 모듈(510)은 추출된 특징에 기초하여 전경부 점수를 계산할 수 있다. 네거티브 클릭 시뮬레이션 모듈(510)은 그라운드 트루스 값을 참조하여 배경부 포인트들 중 전경부 점수가 임계 값을 초과하는 포인트를 결정할 수 있다. 네거티브 클릭 시뮬레이션 모듈(510)은 결정된 포인트들을 네거티브 포인트(또는 네거티브 클릭으로 지칭될 수 있음)로 할당할 수 있다. 할당된 네거티브 포인트는 인코딩 모듈(140)에 의해 인코딩될 수 있다. 인코딩된 네거티브 포인트는 인공지능 모델(150)의 학습의 다음 이터레이션에서 활용될 수 있다.Figure 6 is a conceptual diagram for explaining the operation of a negative click simulation module according to an embodiment. Referring to FIG. 5 along with FIG. 6 , the negative click simulation module 510 may calculate a foreground score based on the extracted features. For example, the negative click simulation module 510 may calculate a foreground score based on the extracted features. The negative click simulation module 510 may refer to the ground truth value and determine the point at which the foreground score exceeds the threshold among the background points. Negative click simulation module 510 may assign the determined points as negative points (or may be referred to as negative clicks). The assigned negative point may be encoded by the encoding module 140. The encoded negative points can be utilized in the next iteration of training of the artificial intelligence model 150.

도 7a 및 7b는 일 실시예에 따른 2차원 변환 모듈의 동작을 예시적으로 보여주는 도면이다. 도 7a 및 7b와 함께, 도 1a를 참조하면, 2차원 변환 모듈(120)은 3차원 장면으로 구성되는 포인트 클라우드(710)를 2차원 좌표계로 변환한 2차원 포인트 그룹(720)을 생성할 수 있다.7A and 7B are diagrams exemplarily showing the operation of a 2D conversion module according to an embodiment. Referring to FIG. 1A along with FIGS. 7A and 7B, the 2D conversion module 120 can generate a 2D point group 720 by converting the point cloud 710 composed of a 3D scene into a 2D coordinate system. there is.

도 7a를 참조하면, 포인트 클라우드(710)는 x, y, z 좌표계의 포인트들로 구성될 수 있다. 포인트 클라우드(710)을 이용한 3차원 좌표계에서 사용자 입력(12, 13)을 수신하는 경우를 가정한다. 사용자가 사용자 입력(12)을 양의 x 축 방향으로 수정하려 했으나, 의도치 않은 축 방향(예컨대, y 축)으로 포인트가 이동되어 사용자 입력(13)이 수신될 수 있다.Referring to FIG. 7A, the point cloud 710 may be composed of points in x, y, and z coordinate systems. Assume that user inputs 12 and 13 are received in a 3D coordinate system using a point cloud 710. Although the user attempted to modify the user input 12 in the positive x-axis direction, the point may be moved in an unintended axis direction (eg, y-axis) and the user input 13 may be received.

도 7b를 참조하면, 2차원 포인트 그룹(720)이 출력 인터페이스(예컨대, 디스플레이)를 통해 표시되는 경우를 가정한다. 즉, 2차원 좌표계에서 사용자 입력(14, 15)을 수신하는 경우를 가정한다. 사용자가 사용자 입력(14)을 양의 x 축 방향으로 수정하려 했으며, 깊이 값을 고려하지 않으므로, 의도한 축 방향으로 포인트가 이동되어 사용자 입력(15)이 의도한 대로 수신될 수 있다.Referring to FIG. 7B, it is assumed that a two-dimensional point group 720 is displayed through an output interface (eg, display). That is, assume that user input (14, 15) is received in a two-dimensional coordinate system. Since the user attempted to modify the user input 14 in the positive x-axis direction and does not consider the depth value, the point is moved in the intended axis direction so that the user input 15 can be received as intended.

일 실시예에 따르면, 2차원 변환 모듈(120)로 포인트 클라우드가 2차원 좌표계로 변환되어 사용자에게 제공됨에 따라, 신속하고 용이하며 정확하게 사용자 입력이 수신될 수 있다.According to one embodiment, the point cloud is converted into a 2D coordinate system by the 2D conversion module 120 and provided to the user, so that user input can be received quickly, easily, and accurately.

도 8a 및 8b는 일 실시예에 따른 인코딩 모듈의 동작을 예시적으로 보여주는 도면이다. 도 8a 및 8b와 함께, 도 1a, 1b, 3, 및 5를 참조하면, 인코딩 모듈(140)은 사용자 입력(즉, 포지티브 포인트 및/또는 네거티브 포인트)을 인코딩할 수 있다. 인코딩된 사용자 입력은 거리 기반 히트맵을 나타낼 수 있다.8A and 8B are diagrams exemplarily showing the operation of an encoding module according to an embodiment. Referring to Figures 1A, 1B, 3, and 5, along with Figures 8A and 8B, encoding module 140 may encode user input (i.e., positive points and/or negative points). Encoded user input may represent a distance-based heatmap.

인코딩 모듈(140)은 주어진 N 개의 포인트들로 구성되는 포인트 클라우드에 기초하여, 사용자 입력을 인코딩할 수 있다. 예를 들어, 포인트 클라우드는 로 표현될 수 있다. 예를 들어, 사용자 입력은 전술한 바와 같이 로 표현될 수 있다. 인코딩된 사용자 입력은 E_k로 표현될 수 있으며, 사용자 입력을 인코딩하는 방법은 수학식 1을 따를 수 있다. The encoding module 140 may encode user input based on a point cloud composed of a given N number of points. For example, a point cloud is It can be expressed as For example, user input may be It can be expressed as The encoded user input can be expressed as E _k , and the method of encoding the user input can follow Equation 1.

수학식 1을 참조하면, d는 p_k와 (x_i, y_i) 사이의 2차원 유클리드 거리로 정의되며, 예컨대, 로 계산될 수 있다. 는 거리 임계 값을 제어하는 하이퍼파라미터일 수 있다.Referring to Equation 1, d is defined as the two-dimensional Euclidean distance between p _k and (x _i , y _i ), e.g. It can be calculated as may be a hyperparameter that controls the distance threshold.

일 실시예에 있어서, 인코딩된 사용자 입력은 클래스 별로 인코딩될 수 있다. 클래스 개수는 C로 정의되며, 예컨대, 일 수 있다. 클래스별로 인코딩되는 사용자 입력은 U_c로 표현될 수 있으며, 수학식 2를 따를 수 있다.In one embodiment, encoded user input may be encoded by class. The number of classes is defined in C, e.g. It can be. User input encoded for each class can be expressed as U _c and can follow Equation 2.

C 개의 인코딩된 사용자 입력이 생성되면, 포인트 클라우드에 클래스별로 인코딩된 사용자 입력(즉, )가 연결될 수 있다.Once C encoded user inputs are generated, the point cloud contains the encoded user inputs by class (i.e. ) can be connected.

도 8a를 참조하면, 사용자 입력은 차량 클래스의 객체(16, 17)에 대응하는 적어도 하나의 포인트일 수 있다. 도 8b를 참조하면, 인코딩 모듈(140)은 차량 클래스의 객체(16, 17)에 대응하는 거리 기반 히트맵(18,19)을 생성할 수 있다. 거리 기반 히트맵(18,19)에 대응하는 값들은 포인트 클라우드에 연결되어 인공지능 모델(150)에 입력될 수 있다.Referring to FIG. 8A, the user input may be at least one point corresponding to an object 16 or 17 of the vehicle class. Referring to FIG. 8B, the encoding module 140 may generate distance-based heatmaps 18 and 19 corresponding to objects 16 and 17 of the vehicle class. Values corresponding to the distance-based heatmaps 18 and 19 may be connected to the point cloud and input into the artificial intelligence model 150.

도 9a 내지 9d는 일 실시예에 따른 공간 클릭 전파 모듈의 동작을 예시적으로 보여주는 도면이다. 도 9a 내지 9d와 함께, 도 3 및 5를 참조하면, 사용자 입력에 대응하는 객체에 대한 정보는 포인트 클라우드 내의 동일한 클래스의 객체를 검출하는데 활용될 수 있다.9A to 9D are diagrams exemplarily showing the operation of a spatial click propagation module according to an embodiment. Referring to FIGS. 3 and 5 along with FIGS. 9A to 9D, information about objects corresponding to user input can be used to detect objects of the same class in the point cloud.

도 9a를 참조하면, 인코딩된 사용자 입력(또는 인코딩된 포지티브 포인트 또는 인코딩된 네거티브 포인트)에 대응하는 값은 포인트 인코더(310)에 입력될 수 있다. 예를 들어, 인코딩된 사용자 입력은 도 9a에서 도시되는 화살표로 표시되는 거리 기반 히트맵에 대응할 수 있다.Referring to FIG. 9A, a value corresponding to an encoded user input (or an encoded positive point or an encoded negative point) may be input to the point encoder 310. For example, the encoded user input may correspond to a distance-based heatmap indicated by the arrows shown in FIG. 9A.

도 9b를 참조하면, 인코딩된 사용자 입력에 대응하는 정보가 포인트 인코더(310)의 다운샘플링 레이어를 통해 샘플링될 수 있다. 도 9c를 참조하면, 공간 클릭 전파 모듈(330)은 다운샘플링된 사용자 입력을 수신할 수 있다. 공간 클릭 전파 모듈(330)은 다운샘플링된 사용자 입력에 대응하는 포인트와 다운샘플링된 포인트 클라우드의 포인트들 간의 유사도를 계산할 수 있다. 공간 클릭 전파 모듈(330)은 유사도에 기초하여 연관성 맵을 생성할 수 있다. Referring to FIG. 9B, information corresponding to the encoded user input may be sampled through the downsampling layer of the point encoder 310. Referring to FIG. 9C, the spatial click propagation module 330 may receive downsampled user input. The spatial click propagation module 330 may calculate similarity between points corresponding to the downsampled user input and points of the downsampled point cloud. The spatial click propagation module 330 may generate a correlation map based on similarity.

도 9d를 참조하면, 사용자 입력에 대응하는 포인트 뿐만 아니라, 공간 클릭 전파 모듈(330)에 의해 결정된 포인트들이 해당 클래스의 3차원 바운딩 박스로 추론될 수 있다.Referring to FIG. 9D, not only the points corresponding to the user input but also the points determined by the spatial click propagation module 330 can be inferred as the 3D bounding box of the corresponding class.

도 10은 일 실시예에 따른 전자 장치의 동작 방법을 보여주는 흐름도이다. 도 1 내지 9d에서 설명한 내용과 중복되는 내용은 생략한다. 설명의 편의를 위해, 1a 및 2를 참조하여, 도 10을 설명한다.Figure 10 is a flowchart showing a method of operating an electronic device according to an embodiment. Content that overlaps with the content described in FIGS. 1 to 9D will be omitted. For convenience of explanation, FIG. 10 will be described with reference to 1a and 2.

도 10을 참조하면, 단계 S1010 내지 S1060은 전자 장치(100, 200) 또는 전자 장치(200)의 프로세서(230)에 의해 수행될 수 있다. 본 개시에 따른 전자 장치(100, 200)의 동작 방법은 도 10에 도시된 바에 한정되지 않으며, 도 10에 도시된 단계 중 어느 하나를 생략할 수도 있고, 도 10에 도시되지 않은 단계를 더 포함할 수도 있다.Referring to FIG. 10 , steps S1010 to S1060 may be performed by the electronic devices 100 and 200 or the processor 230 of the electronic device 200. The method of operating the electronic devices 100 and 200 according to the present disclosure is not limited to that shown in FIG. 10, and any one of the steps shown in FIG. 10 may be omitted, and may further include steps not shown in FIG. 10. You may.

일 실시예에 있어서, 단계 S1010 내지 S1060은 3차원 바운딩 박스를 추론하는 인공지능 모델(150)의 초기 이터레이션에 대응할 수 있다.In one embodiment, steps S1010 to S1060 may correspond to an initial iteration of the artificial intelligence model 150 for inferring a 3D bounding box.

단계 S1010에서, 전자 장치(100, 200)는 복수의 객체들을 포함하는 3차원 장면에 대응하는 포인트 클라우드를 획득할 수 있다. 예를 들어, 포인트 클라우드는 외부 장치 또는 서버로부터 수신될 수 있다. 예를 들어, 포인트 클라우드는 전자 장치(100, 200)의 센서에 의해 획득될 수 있다.In step S1010, the electronic devices 100 and 200 may acquire a point cloud corresponding to a 3D scene including a plurality of objects. For example, the point cloud may be received from an external device or server. For example, the point cloud may be acquired by a sensor of the electronic devices 100 and 200.

단계 S1020에서, 전자 장치(100, 200)는 포인트 클라우드를 2차원 좌표계로 변환할 수 있다. 예를 들어, 포인트 클라우드는 (x, y, z) 좌표계로 표현될 수 있고, 변환된 포인트 클라우드는 (x, y) 좌표계로 표현될 수 있다. 일 실시예에 있어서, 전자 장치(100, 200) 포인트 클라우드의 포인트들을 투영하여 2차원 좌표계로 변환할 수 있다.In step S1020, the electronic devices 100 and 200 may convert the point cloud into a two-dimensional coordinate system. For example, a point cloud can be expressed in an (x, y, z) coordinate system, and the transformed point cloud can be expressed in an (x, y) coordinate system. In one embodiment, the points of the point cloud of the electronic devices 100 and 200 may be projected and converted into a two-dimensional coordinate system.

단계 S1030에서, 전자 장치(100, 200)는 변환된 포인트 클라우드를 디스플레이(130)에 표시할 수 있다. 예를 들어, 전자 장치(100, 200)는 변환된 포인트 클라우드에 대응하는 이미지를 디스플레이(130)에 출력할 수 있다.In step S1030, the electronic devices 100 and 200 may display the converted point cloud on the display 130. For example, the electronic devices 100 and 200 may output an image corresponding to the converted point cloud to the display 130.

단계 S1040에서, 전자 장치(100, 200)는 변환된 포인트 클라우드 중 복수의 객체들에 대응하는 적어도 하나의 포인트를 포함하는 제1 사용자 입력을 획득할 수 있다. 예를 들어, 제1 사용자 입력은 적어도 하나의 포지티브 포인트일 수 있다. 그러나 본 개시는 이에 한정되지 않으며, 제1 사용자 입력은 적어도 하나의 네거티브 포인트를 포함할 수도 있다.In step S1040, the electronic device 100 or 200 may obtain a first user input including at least one point corresponding to a plurality of objects among the converted point cloud. For example, the first user input may be at least one positive point. However, the present disclosure is not limited to this, and the first user input may include at least one negative point.

단계 S1050에서, 전자 장치(100, 200)는 포인트 클라우드에 기초하여 제1 사용자 입력을 인코딩할 수 있다. 일 실시예에 있어서, 인코딩된 사용자 입력은 포인트 클라우드에 연결되어 인공지능 모델(150)에 입력될 수 있다.In step S1050, the electronic devices 100 and 200 may encode the first user input based on the point cloud. In one embodiment, the encoded user input may be connected to the point cloud and input into the artificial intelligence model 150.

단계 S1060에서, 전자 장치(100, 200)는 포인트 클라우드 및 인코딩된 제1 사용자 입력을 입력으로 하는 인공지능 모델(150)을 이용하여, 포인트 클라우드에서 복수의 객체들에 대응하는 3차원 바운딩 박스를 추론할 수 있다.In step S1060, the electronic devices 100 and 200 use the artificial intelligence model 150 that uses the point cloud and the encoded first user input as input to create a three-dimensional bounding box corresponding to a plurality of objects in the point cloud. can be inferred.

도 11은 일 실시예에 따른 전자 장치의 동작 방법을 보여주는 흐름도이다. 도 1 내지 10에서 설명한 내용과 중복되는 내용은 생략한다. 설명의 편의를 위해, 1b 및 2를 참조하여, 도 11을 설명한다.Figure 11 is a flowchart showing a method of operating an electronic device according to an embodiment. Content that overlaps with the content described in FIGS. 1 to 10 will be omitted. For convenience of explanation, FIG. 11 will be described with reference to 1b and 2.

도 11을 참조하면, 도 10의 단계 S1060 이후, 단계 S1110 내지 S1140은 전자 장치(100, 200) 또는 전자 장치(200)의 프로세서(230)에 의해 수행될 수 있다. 본 개시에 따른 전자 장치(100, 200)의 동작 방법은 도 11에 도시된 바에 한정되지 않으며, 도 11에 도시된 단계 중 어느 하나를 생략할 수도 있고, 도 11에 도시되지 않은 단계를 더 포함할 수도 있다.Referring to FIG. 11, after step S1060 of FIG. 10, steps S1110 to S1140 may be performed by the electronic devices 100 and 200 or the processor 230 of the electronic device 200. The method of operating the electronic devices 100 and 200 according to the present disclosure is not limited to that shown in FIG. 11, and any one of the steps shown in FIG. 11 may be omitted, and may further include steps not shown in FIG. 11. You may.

일 실시예에 있어서, 단계 S1110 내지 S1140은 3차원 바운딩 박스를 추론하는 인공지능 모델(150)의 초기 이터레이션 이후의 이터레이션 중 하나에 대응할 수 있다.In one embodiment, steps S1110 to S1140 may correspond to one of the iterations after the initial iteration of the artificial intelligence model 150 for inferring a 3D bounding box.

단계 S1110에서, 전자 장치(100, 200)는 변환된 포인트 클라우드 및 추론된 3차원 바운딩 박스를 디스플레이(130)에 표시할 수 있다. 예를 들어, 전자 장치(100, 200)는 변환된 포인트 클라우드와 함께, 추론된 3차원 바운딩 박스가 2차원 변환된 데이터를 이용하여 2차원 이미지를 생성할 수 있다. 전자 장치(100, 200)는 2차원 이미지를 디스플레이(130)에 표시할 수 있다.In step S1110, the electronic devices 100 and 200 may display the converted point cloud and the inferred 3D bounding box on the display 130. For example, the electronic devices 100 and 200 may generate a two-dimensional image using data obtained by converting the inferred three-dimensional bounding box into two dimensions along with the converted point cloud. The electronic devices 100 and 200 may display a two-dimensional image on the display 130.

단계 S1120에서, 전자 장치(100, 200)는 상기 추론된 3차원 바운딩 박스에 대응하되 복수의 객체들에 대응하지 않는 포인트(즉, 네거티브 포인트), 및 추론된 3차원 바운딩 박스에 대응하지 않되 상기 복수의 객체들에 대응하는 포인트(즉, 포지티브 포인트) 중 적어도 하나를 포함하는 제2 사용자 입력을 획득할 수 있다. 네거티브 포인트와 포지티브 포인트의 개수는 적어도 하나일 수 있다. In step S1120, the electronic device 100 or 200 selects a point (i.e., a negative point) that corresponds to the inferred three-dimensional bounding box but does not correspond to a plurality of objects, and a point that does not correspond to the inferred three-dimensional bounding box but does not correspond to a plurality of objects. A second user input including at least one of points (ie, positive points) corresponding to a plurality of objects may be obtained. The number of negative points and positive points may be at least one.

단계 S1140에서, 전자 장치(100, 200)는 포인트 클라우드에 기초하여 제2 사용자 입력을 인코딩할 수 있다. 일 실시예에 있어서, 전자 장치(100, 200)는 이전 이터레이션의 제1 사용자 입력과 제2 사용자 입력을 함께 인코딩할 수 있다. 일 실시예에 있어서, 전자 장치(100, 200)는 인코딩된 제1 사용자 입력 및 제2 사용자 입력을 포인트 클라우드에 연결할 수 있다.In step S1140, the electronic devices 100 and 200 may encode the second user input based on the point cloud. In one embodiment, the electronic devices 100 and 200 may encode the first user input and the second user input of the previous iteration together. In one embodiment, the electronic devices 100 and 200 may connect the encoded first and second user inputs to the point cloud.

단계 S1140에서, 전자 장치(100, 200)는 포인트 클라우드, 인코딩된 제1 사용자 입력, 및 인코딩된 제2 사용자 입력을 입력으로 하는, 인공지능 모델을 이용하여, 포인트 클라우드에서 복수의 객체들에 대응하는 3차원 바운딩 박스를 추론할 수 있다.In step S1140, the electronic devices 100 and 200 correspond to a plurality of objects in the point cloud using an artificial intelligence model that uses the point cloud, the encoded first user input, and the encoded second user input as input. A 3D bounding box can be inferred.

일 실시예에 있어서, 미리 정의된 이터레이션 횟수가 도달하거나, 또는 이터레이션 종료에 대응하는 사용자 입력이 수신되지 않는 한, 전자 장치(100, 200)는 단계 S1110 내지 S1140에 대응하는 동작을 반복할 수 있다.In one embodiment, unless a predefined number of iterations is reached or a user input corresponding to the end of the iteration is not received, the electronic devices 100 and 200 repeat the operations corresponding to steps S1110 to S1140. You can.

일 실시예에 있어서, 본 개시의 실시예들에 따른 방법은 컴퓨터 프로그램 제품에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 기기로 읽을 수 있는 기록 매체의 형태로 배포될 수 있다. 또는, 컴퓨터 프로그램 제품은 애플리케이션 스토어를 통해 또는 복수의 장치들 간에 직접 배포될 수 있다. In one embodiment, the method according to the embodiments of the present disclosure may be included and provided in a computer program product. Computer program products may be distributed in the form of machine-readable recording media. Alternatively, the computer program product may be distributed through an application store or directly between multiple devices.

상술된 실시예들에서, 제1, 제2, 제3 등의 용어들을 사용하여 본 개시의 기술적 사상에 따른 구성 요소들이 설명되었다. 그러나 제1, 제2, 제3 등과 같은 용어들은 구성 요소들을 서로 구별하기 위해 사용되며, 본 개시의 기술적 사상을 한정하는 것은 아니다. 제1, 제2, 제3 등과 같은 용어들은 순서 또는 임의의 형태의 수치적 의미를 내포하지 않는다.In the above-described embodiments, components according to the technical idea of the present disclosure have been described using terms such as first, second, and third. However, terms such as first, second, third, etc. are used to distinguish components from each other and do not limit the technical idea of the present disclosure. Terms such as first, second, third, etc. do not imply order or any form of numerical meaning.

상술된 실시예들은, 본 개시를 실시하기 위한 구체적인 실시예들이다. 본 개시은 상술된 실시예들뿐만 아니라, 상술된 실시예들을 이용하여 단순하게 설계 변경되거나 용이하게 변경할 수 있는 실시예들을 포함하는 것으로 이해되어야 한다. 따라서, 본 개시의 범위는 상술된 실시예들에 한정되어서는 안되며 후술하는 특허청구범위뿐만 아니라 이 발명의 특허청구범위와 균등한 것들에 의해 정해져야 할 것이다.The above-described embodiments are specific embodiments for carrying out the present disclosure. It should be understood that the present disclosure includes not only the above-described embodiments, but also embodiments that are simply designed or can be easily changed using the above-described embodiments. Accordingly, the scope of the present disclosure should not be limited to the above-described embodiments, but should be determined by the claims and equivalents of the present invention as well as the claims described below.

Claims

In an electronic device that performs 3D object detection using an artificial intelligence model,
display;
A memory that stores one or more instructions; and
At least one processor executing the one or more instructions stored in the memory, wherein the at least one processor includes:
Obtain a point cloud corresponding to a three-dimensional scene including a plurality of objects,
Convert the point cloud into a two-dimensional coordinate system,
Displaying the converted point cloud on the display,
Obtaining a first user input including at least one point corresponding to the plurality of objects among the converted point cloud,
encode the first user input based on the point cloud,
Inferring a three-dimensional bounding box corresponding to the plurality of objects in the point cloud using an artificial intelligence model using the point cloud and the encoded first user input as input, the one or more An electronic device that executes instructions.

The method of claim 1, wherein the at least one processor:
Displaying the converted point cloud and the inferred 3D bounding box on the display,
A second comprising at least one of a point corresponding to the inferred 3D bounding box but not corresponding to the plurality of objects, and a point not corresponding to the inferred 3D bounding box but corresponding to the plurality of objects. obtain user input,
encode a second user input based on the point cloud,
Inferring a three-dimensional bounding box corresponding to the plurality of objects in the point cloud using an artificial intelligence model using the point cloud, the encoded first user input, and the encoded second user input as input. An electronic device that further executes the one or more instructions.

The method of claim 1, wherein the artificial intelligence model is:
a point encoder that extracts features based on the point cloud and the encoded first user input;
a centroid aggregation module that predicts a central point corresponding to the plurality of objects based on the extracted features and groups points corresponding to the plurality of objects based on the predicted central point;
A spatial click propagation module that calculates information of an object of the same class as an object corresponding to the first user input based on the extracted features and the encoded first user input; and
An electronic device comprising a detection head module that outputs the three-dimensional bounding box corresponding to the plurality of objects based on the output of the centroid aggregation module and the output of the spatial click propagation module.

According to paragraph 3,
The point encoder includes at least one fully connected layer and at least one downsampling layer,
The at least one processor,
The electronic device further executes the one or more instructions, concatenating the encoded first user input to an output of the at least one downsampling layer.

According to paragraph 3,
The spatial click propagation module,
Calculate similarity between information corresponding to the encoded first user input and the extracted features,
The detection head module is,
Obtain data with the similarity linked to the output of the centroid aggregation module,
An electronic device detecting an object of the same class as an object corresponding to the first user input based on the obtained value.

According to paragraph 3,
The artificial intelligence model includes a point cloud dataset, at least one positive point corresponding to an object in the point cloud dataset, and at least one negative point corresponding to a background in the point cloud dataset. Based on this, an electronic device is trained to output a three-dimensional bounding box corresponding to an object.

According to clause 6,
The artificial intelligence model is:
In the learning process of the artificial intelligence model, based on the ground truth value and the extracted features, the points whose foreground scores exceed a threshold among the background points that do not correspond to the plurality of objects are selected as the at least one negative. An electronic device further comprising a negative click simulation module for assigning to a point.

According to paragraph 1,
The first user input includes object location information and object class information.

According to clause 8,
The one or more instructions encoding the first user input include:
An electronic device comprising the one or more instructions, encoding the first user input based on a distance between the (x, y) location information of the point cloud and the object location information of the first user input.

According to paragraph 1,
The artificial intelligence model is an electronic device that inputs a value obtained by connecting the encoded first user input to the point cloud.