KR20240073747A

KR20240073747A - Method of processing image data and Electronic device for performing the Method

Info

Publication number: KR20240073747A
Application number: KR1020230111343A
Authority: KR
Inventors: 저우 이; 박승인; 유병인; 정상일; 장 후이
Original assignee: 삼성전자주식회사
Priority date: 2022-11-18
Filing date: 2023-08-24
Publication date: 2024-05-27
Also published as: CN118057477A

Abstract

본 출원의 실시예는 영상 처리 방법, 전자 장치, 및 저장 매체를 제공한다. 상기 영상 처리 방법은, 복수의 영상 프레임을 포함하는 영상의 영상 특징을 얻는 단계, 신경망을 사용하여 영상 특징을 기반으로 영상의 타겟 객체 표현을 결정하는 단계, 타겟 객체 표현을 기반으로 영상의 파노라마 분할 결과를 결정하는 단계를 포함한다. 본 출원의 실시예는 클립 레벨의 타겟 객체 표현을 통해 영상의 파노라마 분할 결과를 예측하고, 네트워크 구조를 효과적으로 단순화하여 분할 정확도와 견고성을 향상시킬 수 있다. 전자 장치에서 수행되는 상기 영상 처리 방법은 인공지능 모델을 이용하여 수행될 수 있다.Embodiments of the present application provide an image processing method, an electronic device, and a storage medium. The image processing method includes obtaining image features of an image including a plurality of image frames, determining a target object representation of the image based on the image features using a neural network, and panoramic segmentation of the image based on the target object representation. Includes steps for determining the outcome. The embodiment of the present application predicts panoramic segmentation results of images through clip-level target object representation, and effectively simplifies the network structure to improve segmentation accuracy and robustness. The image processing method performed in an electronic device may be performed using an artificial intelligence model.

Description

Image processing method and electronic device for performing the method {Method of processing image data and Electronic device for performing the Method}

본 출원은 영상 처리 방법, 전자 장치, 저장 매체 및 프로그램 제품에 관한 것으로, 더욱 상세하게는 인공지능 기술을 이용한 영상 처리 분야에 관한 것이다.This application relates to image processing methods, electronic devices, storage media, and program products, and more specifically, to the field of image processing using artificial intelligence technology.

파노라마 분할은 2차원 이미지의 각 픽셀에 레이블 정보를 할당하는 과정이다. 동영상(video)의 파노라마 분할은 시간 영역에서의 파노라마 분할의 확장으로, 각 이미지에 대한 파노라마 분할 외에도 객체 추적 작업, 즉 다른 이미지에서 동일한 인스턴스에 속하는 픽셀에 동일한 레이블을 할당하는 작업 또한 결합한다.Panoramic segmentation is the process of assigning label information to each pixel of a two-dimensional image. Panoramic segmentation of video is an extension of panoramic segmentation in the time domain, and in addition to panoramic segmentation for each image, it also combines object tracking tasks, that is, assigning the same label to pixels belonging to the same instance in different images.

기존 영상 파노라마 분할 기술에서는 단일 프레임 이미지에 대한 파노라마 객체의 표현을 결정할 때 파노라마 분할의 정확도가 낮다. 영상의 각 영상 프레임 간의 대응 정보를 얻기 위해서는 추가적인 추적 모듈이 필요하므로 네트워크 구조가 복잡해지게 된다.In existing video panorama segmentation techniques, the accuracy of panorama segmentation is low when determining the representation of a panoramic object for a single frame image. In order to obtain correspondence information between each video frame of a video, an additional tracking module is required, making the network structure complex.

본 출원의 실시예는 네트워크 구조를 단순화하고 파노라마 분할의 정확성 및 견고성을 개선하는 것을 목표로 하는 영상 처리 방법, 전자 장치, 및 저장 매체를 제공한다. 해당 기술 방안은 다음과 같다.Embodiments of the present application provide an image processing method, electronic device, and storage medium aimed at simplifying the network structure and improving the accuracy and robustness of panoramic segmentation. The technical plan is as follows.

일 실시 예에 따른, 파노라마 분할 모델을 통한 영상 처리 방법은, 복수의 영상 프레임을 포함하는 영상의 영상 특징을 얻는 단계, 신경망을 사용하여 상기 영상 특징을 기반으로 상기 영상의 타겟 객체 표현을 결정하는 단계, 및 상기 타겟 객체 표현을 기반으로 상기 영상의 파노라마 분할 결과를 결정하는 단계를 포함한다. 상기 신경망을 사용하여 상기 영상 특징을 기반으로 상기 영상의 타겟 객체 표현을 결정하는 단계는, 상기 신경망을 사용하여 상기 영상 특징에 대해 복수의 반복 처리를 수행하여 상기 영상의 타겟 객체 표현을 결정하는 단계를 포함할 수 있다.According to one embodiment, an image processing method using a panoramic segmentation model includes obtaining image features of an image including a plurality of image frames, and determining a target object representation of the image based on the image features using a neural network. and determining a panoramic segmentation result of the image based on the target object representation. The step of determining the target object representation of the image based on the image feature using the neural network includes performing a plurality of iterative processing on the image feature using the neural network to determine the target object representation of the image. may include.

상기 신경망을 사용하여 상기 영상 특징에 대해 상기 복수의 반복 처리를 수행하여 상기 영상의 상기 타겟 객체 표현을 결정하는 단계는, 상기 신경망을 사용하여, 상기 영상 특징 및 상기 영상의 이전 반복 처리에 의한 객체 표현에 기초하여 반복 처리를 수행하여 상기 영상의 현재 반복 처리에 의한 객체 표현을 결정하는 단계를 포함할 수 있다.Determining the target object representation of the image by performing the plurality of iterative processing on the image feature using the neural network includes using the neural network to determine the image feature and the object by previous iterative processing of the image. It may include performing iterative processing based on the expression to determine the object representation by the current iterative processing of the image.

상기 복수의 반복 처리 중 첫 번째 반복 처리의 경우, 상기 이전 반복 처리에 의한 객체 표현은 미리 구성된 초기 객체 표현일 수 있다.In the case of the first repetition process among the plurality of repetition processes, the object representation by the previous repetition process may be a pre-configured initial object expression.

상기 영상 특징 및 상기 영상의 이전 반복 처리에 의한 객체 표현에 기초하여 상기 반복 처리를 수행하여 상기 영상의 현재 반복 처리에 의한 객체 표현을 결정하는 단계는, 상기 영상의 이전 반복 처리에 의한 객체 표현에 대해 변환 처리하여 마스크를 얻는 단계, 상기 영상 특징, 상기 이전 반복 처리에 의한 객체 표현 및 상기 마스크를 처리하여 제1 객체 표현을 얻는 단계, 및 상기 제1 객체 표현을 기반으로 상기 영상의 현재 반복 처리에 의한 객체 표현을 결정하는 단계를 포함할 수 있다.The step of performing the iterative processing based on the image features and the object representation by the previous iterative process of the image to determine the object representation by the current iterative process of the image includes the object representation by the previous iterative process of the image. obtaining a mask by performing transformation processing on the image features, processing the object representation from the previous iteration and the mask to obtain a first object representation, and processing the current iteration of the image based on the first object representation. It may include the step of determining the object representation by .

상기 영상 특징, 상기 이전 반복 처리에 의한 객체 표현 및 상기 마스크를 처리하여 상기 제1 객체 표현을 얻는 단계는, 상기 영상 특징, 상기 이전 반복 처리에 의한 객체 표현 및 상기 마스크에 대해 어텐션 처리를 수행하여 마스크와 관련된 객체 표현을 얻는 단계, 및 상기 마스크와 관련된 객체 표현과 상기 이전 반복 처리에 의한 객체 표현을 기반으로 셀프 어텐션 처리 및 분류 처리를 수행하여 제1 객체 표현을 얻는 단계를 포함할 수 있다.The step of obtaining the first object representation by processing the image feature, the object representation by the previous iteration process, and the mask includes performing attention processing on the image feature, the object representation by the previous iteration process, and the mask. It may include obtaining an object representation related to a mask, and obtaining a first object representation by performing self-attention processing and classification processing based on the object representation related to the mask and the object representation by the previous iteration process.

상기 영상 특징, 상기 이전 반복 처리에 의한 객체 표현 및 상기 마스크에 대해 상기 어텐션 처리를 수행하여 상기 마스크와 관련된 객체 표현을 얻는 단계는, 상기 영상 특징에 대응하는 키 특징, 상기 이전 반복 처리에 의한 객체 표현 및 상기 마스크를 기반으로 제2 객체 표현을 얻는 단계, 상기 제2 객체 표현에 기초하여, 상기 영상에 포함된 객체 카테고리를 나타내는 제1 확률을 결정하는 단계, 및 상기 제1 확률, 상기 영상 특징에 대응하는 값 특징 및 상기 영상 특징을 기반으로 상기 마스크와 관련된 객체 표현을 얻는 단계를 포함할 수 있다.Obtaining an object representation related to the mask by performing the attention processing on the image feature, the object representation by the previous iterative process, and the mask includes key features corresponding to the image feature, the object by the previous iterative process. Obtaining a second object representation based on the representation and the mask, determining, based on the second object representation, a first probability representing an object category included in the image, and the first probability, the image feature. Obtaining an object representation associated with the mask based on the value feature corresponding to and the image feature.

상기 제1 객체 표현을 기반으로 상기 영상의 현재 반복 처리에 의한 객체 표현을 결정하는 단계는, 상기 영상 특징 및 상기 제1 객체 표현에 기초하여 적어도 하나의 영상 프레임 중 각각의 영상 프레임에 대응하는 객체 표현을 결정하는 단계, 및 상기 제1 객체 표현과 상기 결정된 각각의 영상 프레임에 대응하는 객체 표현을 기반으로 상기 영상의 현재 반복 처리에 의한 객체 표현을 결정하는 단계를 포함할 수 있다.The step of determining an object representation by current iteration processing of the image based on the first object representation includes selecting an object corresponding to each image frame among at least one image frame based on the image features and the first object representation. It may include determining an expression, and determining an object expression by current iteration processing of the image based on the first object expression and the object expression corresponding to each determined image frame.

상기 영상 특징 및 상기 제1 객체 표현에 기초하여 상기 적어도 하나의 영상 프레임 중 각각의 영상 프레임에 대응하는 객체 표현을 결정하는 단계는, 상기 영상 특징에 대응하는 키 특징 및 상기 제1 객체 표현에 기초하여 제4 객체 표현을 결정하는 단계, 상기 제4 객체 표현에 기초하여 상기 영상에 포함된 상기 객체 카테고리를 나타내는 제2 확률을 결정하는 단계, 및 상기 제2 확률 및 상기 영상 특징에 대응하는 값 특징에 기초하여 상기 적어도 하나의 영상 프레임 중 각각의 영상 프레임에 대응하는 객체 표현을 결정하는 단계를 포함할 수 있다.Determining an object representation corresponding to each image frame of the at least one image frame based on the image feature and the first object representation comprises determining an object representation corresponding to the key feature corresponding to the image feature and the first object representation. determining a fourth object representation, determining a second probability representing the object category included in the image based on the fourth object representation, and value features corresponding to the second probability and the image feature. It may include determining an object representation corresponding to each image frame among the at least one image frame based on .

상기 제1 객체 표현과 상기 결정된 각각의 영상 프레임에 대응하는 객체 표현을 기반으로 상기 영상의 현재 반복 처리에 의한의 객체 표현을 결정하는 단계는, 상기 결정된 각각의 영상 프레임에 대응하는 객체 표현에 대해 분류 처리 및 셀프 어텐션 처리를 수행하여 상기 영상에 대응하는 제3 객체 표현을 획득하는 단계, 및 상기 제1 객체 표현과 상기 제3 객체 표현을 기반으로 상기 영상의 현재 반복 처리에 의한 객체 표현을 결정하는 단계를 포함할 수 있다.Determining an object representation by current iterative processing of the image based on the first object representation and the object representation corresponding to each determined image frame comprises: determining the object representation corresponding to each determined image frame; Obtaining a third object representation corresponding to the image by performing classification processing and self-attention processing, and determining an object representation by current iteration processing of the image based on the first object representation and the third object representation. It may include steps.

상기 타겟 객체 표현을 기반으로 상기 영상의 파노라마 분할 결과를 결정하는 단계는, 상기 타겟 객체 표현에 대해 선형 변환 처리를 수행하는 단계, 및 선형 변환 처리된 상기 타겟 객체 표현과 상기 영상 특징을 기반으로 상기 영상의 마스크 정보를 결정하고, 선형 변환 처리된 상기 타겟 객체 표현을 기반으로 상기 영상의 카테고리 정보를 결정하는 단계를 포함할 수 있다.Determining a panoramic segmentation result of the image based on the target object representation includes performing linear transformation processing on the target object expression, and based on the linearly transformed target object expression and the image features. It may include determining mask information of the image and determining category information of the image based on the linearly transformed target object representation.

일 실시 예에 따른, 프로세서에 의해 수행되는, 파노라마 분할 모델의 훈련 방법에 있어서, 상기 파노라마 분할 모델은 제1 모듈 및 제2 모듈을 포함하고, 훈련 데이터를 얻는 단계 - 상기 훈련 데이터는 훈련 영상, 훈련 영상의 제1 영상 특징 및 훈련 영상에 대응하는 샘플 파노라마 분할 결과를 포함함 -, 상기 제1 영상 특징의 프레임 순서를 바꿈으로써 제2 영상 특징을 얻는 단계, 상기 제1 모듈을 통해, 상기 제1 영상 특징 및 상기 제2 영상 특징에 각각 기초하여 상기 훈련 영상의 제1 예측 객체 표현 및 제2 예측 객체 표현을 결정하는 단계, 상기 제2 모듈을 통해, 상기 제1 예측 객체 표현 및 상기 제2 예측 객체 표현에 각각 기초하여 상기 훈련 영상의 제1 예측 결과 및 제2 예측 결과를 결정하는 단계, 및 상기 샘플 파노라마 분할 결과, 상기 제1 예측 객체 표현, 상기 제2 예측 객체 표현, 상기 제1 예측 결과 및 상기 제2 예측 결과를 기반으로 타겟 손실 함수를 사용하여 상기 파노라마 분할 모델을 훈련하는 단계를 포함한다.According to one embodiment, a method of training a panoramic segmentation model performed by a processor, wherein the panoramic segmentation model includes a first module and a second module, and obtaining training data, wherein the training data includes a training image, Comprising a first image feature of a training image and a sample panorama segmentation result corresponding to the training image, obtaining a second image feature by changing the frame order of the first image feature, through the first module, determining, through the second module, a first predicted object representation and a second predicted object representation of the training image based on the first image feature and the second image feature, respectively; determining a first prediction result and a second prediction result of the training image based on the predicted object representation, respectively, and the sample panorama segmentation result, the first predicted object representation, the second predicted object representation, and the first prediction and training the panoramic segmentation model using a target loss function based on the result and the second prediction result.

상기 샘플 파노라마 분할 결과, 상기 제1 예측 객체 표현, 상기 제2 예측 객체 표현, 상기 제1 예측 결과 및 상기 제2 예측 결과를 기반으로 타겟 손실 함수를 사용하여 상기 파노라마 분할 모델을 훈련하는 단계는, 상기 제1 예측 객체 표현 및 상기 제2 예측 객체 표현에 기초하여 제1 유사 행렬을 결정하는 단계, 상기 샘플 파노라마 분할 결과, 상기 제1 예측 결과 및 상기 제2 예측 결과에 기초하여 제2 유사 행렬을 결정하는 단계, 및 상기 제1 유사 행렬 및 상기 제2 유사 행렬을 기반으로 타겟 손실 함수가 최소로 결정되는 경우, 훈련된 파노라마 분할 모델을 출력하는 단계를 포함할 수 있다.Training the panorama segmentation model using a target loss function based on the sample panorama segmentation result, the first prediction object representation, the second prediction object representation, the first prediction result, and the second prediction result, comprising: determining a first similarity matrix based on the first prediction object representation and the second prediction object representation, determining a second similarity matrix based on the sample panorama segmentation result, the first prediction result, and the second prediction result. It may include determining, and when the target loss function is determined to be minimum based on the first similarity matrix and the second similarity matrix, outputting a trained panorama segmentation model.

본 출원의 실시예에서 제공하는 기술 방안의 유익한 효과는 다음과 같다:The beneficial effects of the technical solutions provided in the embodiments of this application are as follows:

본 출원은 영상 처리 방법, 전자 장치, 저장 매체 및 프로그램 제품을 제공하고, 구체적으로, 획득된 영상 특징은 적어도 2개의 영상 프레임을 포함하는 영상에 대응하고, 즉, 영상 특징은 적어도 2개의 영상 프레임의 특징 정보에 대응하고; 그후 신경망을 사용하여 영상 특징을 기반으로 영상의 타겟 객체 표현을 결정하고 타겟 객체 표현을 기반으로 영상의 파노라마 분할 결과를 예측할 수 있다. 본 출원은 클립(clip) 레벨의 대상(예를 들어 적어도 두 개 영상 프레임에 대응하는 영상 특징, 객체 표현)에 대해 처리하는 것으로, 네트워크에 추가 객체 추적 모듈을 구축할 필요 없이 영상의 각 영상 프레임 간의 대응 정보를 얻을 수 있고, 네트워크 구조를 효과적으로 단순화할 수 있다. 또한, 클립 레벨의 대상 자체는 영상에서 시간 영역 정보를 전달하므로 본 출원 방안의 구현은 영상 정보를 보다 완전하게 활용할 수 있으며, 이는 파노라마 분할의 정확성과 견고성을 향상시키는데 도움이 된다.The present application provides image processing methods, electronic devices, storage media and program products, and specifically, the obtained image feature corresponds to an image containing at least two image frames, that is, the image feature corresponds to an image containing at least two image frames. Corresponds to the characteristic information of; Then, using a neural network, the target object representation of the image can be determined based on the image features and the panoramic segmentation result of the image can be predicted based on the target object representation. This application processes clip-level objects (e.g., image features, object representations corresponding to at least two image frames) for each image frame of the image without the need to build an additional object tracking module in the network. Correspondence information between networks can be obtained, and the network structure can be effectively simplified. Additionally, since the clip-level object itself carries time-domain information in the video, the implementation of the proposed method can utilize video information more completely, which helps improve the accuracy and robustness of panoramic segmentation.

본 출원의 실시예의 기술방안에 대한 보다 명확한 설명을 제공하기 위해, 이하 본 출원의 실시예의 설명 시 필요한 첨부 도면에 대해 간략히 소개한다.
도 1은 일 실시 예에 따른 영상 처리 방법의 흐름도이다.
도 2는 일 실시 예에 따른 파노라마 분할 예시의 개략도이다.
도 3은 일 실시 예에 따른 클립 쿼리(clip query) 시각화의 결정 과정의 예시도이다.
도 4는 일 실시 예에 따른 파노라마 분할 모델의 네트워크 아키텍처 도면이다.
도 5는 일 실시 예에 따른 파노라마 분할 알고리즘의 흐름도이다.
도 6a는 일 실시 예에 따른 마스크 디코더의 프레임워크의 개략도이다.
도 6b는 일 실시 예에 따른 다른 마스크 디코더의 프레임워크의 개략도이다.
도 7a는 일 실시 예에 따른 계층적 상호작용(Hierarchical interaction) 모듈의 프레임워크의 개략도이다.
도 7b는 일 실시 예에 따른 다른 계층적 상호작용 모듈의 프레임워크의 개략도이다.
도 8은 일 실시 예에 따른, 도 7a 및 도 7b에 도시된 클립 특징 쿼리 상호작용(401)의 프레임워크의 개략도이다.
도 9는 일 실시 예에 따른, 도 7b에 도시된 클립 특징 쿼리 상호작용(402)의 프레임워크의 개략도이다.
도 10은 일 실시 예에 따른, 도 8에 도시된 마스크 어텐션 모듈(501)의 구조의 개략도이다.
도 11은 일 실시 예에 따른, 도 9에 도시된 프레임 쿼리 생성 모듈(601)의 구조의 개략도이다.
도 12는 일 실시 예에 따른, 도 9에 도시된 상호 어텐션 모듈(605)의 구조의 개략도이다.
도 13은 일 실시 예에 따른 분할 헤드 모듈(303)의 구조의 개략도이다.
도 14는 일 실시 예에 따른 파노라마 분할 모델의 훈련 과정의 개략도이다.
도 15는 일 실시 예에 따른 효과 대비도이다.
도 16은 일 실시 예에 따른 다른 효과 대비도이다.
도 17은 일 실시 예에 따른 또 다른 효과 대비도이다.
도 18은 일 실시 예에 따른 전자 장치의 구조의 개략도이다.In order to provide a clearer description of the technical solution of the embodiments of this application, the following briefly introduces the accompanying drawings required when describing the embodiments of this application.
1 is a flowchart of an image processing method according to an embodiment.
Figure 2 is a schematic diagram of an example panorama segmentation according to one embodiment.
Figure 3 is an example diagram of a decision process for visualizing a clip query according to an embodiment.
Figure 4 is a network architecture diagram of a panoramic segmentation model according to an embodiment.
Figure 5 is a flowchart of a panorama segmentation algorithm according to one embodiment.
Figure 6A is a schematic diagram of the framework of a mask decoder according to one embodiment.
6B is a schematic diagram of the framework of another mask decoder according to one embodiment.
Figure 7A is a schematic diagram of a framework of a hierarchical interaction module according to one embodiment.
7B is a schematic diagram of a framework of another hierarchical interaction module according to one embodiment.
Figure 8 is a schematic diagram of the framework of the clip feature query interaction 401 shown in Figures 7A and 7B, according to one embodiment.
Figure 9 is a schematic diagram of the framework of the clip feature query interaction 402 shown in Figure 7B, according to one embodiment.
FIG. 10 is a schematic diagram of the structure of the mask attention module 501 shown in FIG. 8, according to one embodiment.
FIG. 11 is a schematic diagram of the structure of the frame query generation module 601 shown in FIG. 9, according to one embodiment.
Figure 12 is a schematic diagram of the structure of the mutual attention module 605 shown in Figure 9, according to one embodiment.
Figure 13 is a schematic diagram of the structure of the split head module 303 according to one embodiment.
Figure 14 is a schematic diagram of a training process of a panoramic segmentation model according to an embodiment.
Figure 15 is an effect contrast diagram according to an embodiment.
Figure 16 is another effect contrast diagram according to an embodiment.
Figure 17 is another effect contrast diagram according to an embodiment.
18 is a schematic diagram of the structure of an electronic device according to an embodiment.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 개시의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only and may be changed and implemented in various forms. Accordingly, the actual implementation form is not limited to the specific disclosed embodiments, and the scope of the present disclosure includes changes, equivalents, or substitutes included in the technical idea described in the embodiments.

"제1" 또는 "제2" 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, "제1 구성요소"는 "제2 구성요소"로 명명될 수 있고, 유사하게 "제2 구성요소"는 "제1 구성요소"로도 명명될 수 있다.Although terms such as “first” or “second” may be used to describe various components, these terms should be interpreted only for the purpose of distinguishing one component from another component. For example, a “first component” may be named a “second component” and similarly, a “second component” may also be named a “first component”.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.When a component is referred to as being “connected” to another component, it should be understood that it may be directly connected or connected to the other component, but that other components may exist in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 개시에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In the present disclosure, terms such as "comprise" or "have" are intended to designate the presence of the described features, numbers, steps, operations, components, parts, or combinations thereof, and are intended to indicate the presence of one or more other features or numbers, It should be understood that this does not exclude in advance the possibility of the presence or addition of steps, operations, components, parts, or combinations thereof.

본 명세서에서 사용된 단수형 "일", "하나", 및 "상기" 등은 특별히 언급되지 않는 한 복수형도 포함할 수 있음은 당업자에게 있어 자명하다. 본 명세서의 실시예에서 사용된 "포함" 및 "함유"라는 용어는 상응하는 특징이 제시된 특징, 정보, 데이터, 단계, 동작, 요소 및/또는 구성요소로서 구현될 수 있음을 의미하고, 본 기술분야에서 지원하는 다른 특징, 정보, 데이터, 단계, 동작, 요소, 구성 요소 및/또는 이들의 조합 등을 배제하지 않는다. 한 요소가 다른 요소에 "연결"되거나 "결합"되었다고 말할 때, 하나의 요소는 다른 요소에 직접 연결되거나 결합될 수 있고, 하나의 요소 및 다른 요소가 중간 요소를 통해 연결 관계가 구성될 수도 있다. 또한, 본 명세서에서 "연결" 또는 "결합"은 무선 연결 또는 무선 결합을 포함할 수 있다. 본 명세서에서 "및/또는"의 용어는 해당 용어가 정의한 항목 중 적어도 하나를 나타내며, 예를 들어 "A 및/또는 B"는 "A"로 구현, "B"로 구현 또는 "A 및 B"로 구현됨을 나타낸다.It is obvious to those skilled in the art that the singular forms "one", "one", and "the" as used herein may also include plural forms, unless specifically stated. As used in the embodiments of this specification, the terms “comprise” and “contains” mean that the corresponding feature can be implemented as a presented feature, information, data, step, operation, element, and/or component, and that the present technology It does not exclude other features, information, data, steps, operations, elements, components, and/or combinations thereof, etc. supported by the field. When an element is said to be "connected" or "coupled" to another element, one element may be directly connected or joined to another element, or a connection may be formed between one element and the other element through intermediate elements. . Additionally, in this specification, “connection” or “coupling” may include wireless connection or wireless coupling. As used herein, the term "and/or" refers to at least one of the items defined by that term, for example, "A and/or B" embodied by "A", embodied by "B", or "A and B". Indicates that it is implemented as .

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 개시에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person of ordinary skill in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and should not be interpreted in an idealized or overly formal sense unless explicitly defined in the present disclosure. No.

인공지능(AI)은 디지털 컴퓨터 또는 디지털 컴퓨터로 제어되는 기계를 사용하여 인간 지능을 시뮬레이션, 연장 및 확장하고, 환경을 인식하고, 지식을 획득하고, 지식을 사용하여 최상의 결과를 얻는 이론, 방법, 기술 및 응용 시스템이다. 즉, 인공지능은 지능의 본질을 이해하고 인간의 지능과 유사하게 반응할 수 있는 새로운 지능 기계를 생산하려는 컴퓨터 과학의 종합 기술이다. 인공지능은 다양한 지능형 기계의 설계 원리와 구현 방법을 연구하여 기계가 인식, 추론 및 의사 결정 기능을 갖도록 하는 것이다. 인공지능 기술은 하드웨어 측면의 기술과 소프트웨어 측면의 기술을 모두 포함하는 광범위한 분야를 포함하는 포괄적인 학문이다. 인공지능의 기본 기술에는 일반적으로 센서, 특수 인공지능 칩, 클라우드 컴퓨팅, 분산 스토리지, 빅 데이터 처리 기술, 운영/상호 작용 시스템, 전자 기계 통합 등과 같은 기술이 포함된다. 인공지능 소프트웨어 기술은 주로 컴퓨터 비전 기술, 음성 처리 기술, 자연어 처리 기술 및 기계 학습/딥 러닝, 자율 주행, 스마트 교통 등 주요 방향을 포함한다.Artificial intelligence (AI) is the theory, method, or method of using digital computers or machines controlled by digital computers to simulate, extend, and extend human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve the best results. It is a technology and application system. In other words, artificial intelligence is a comprehensive technology of computer science that seeks to understand the nature of intelligence and produce new intelligent machines that can respond similarly to human intelligence. Artificial intelligence is the study of design principles and implementation methods of various intelligent machines to enable machines to have perception, reasoning, and decision-making functions. Artificial intelligence technology is a comprehensive study that encompasses a wide range of fields, including both hardware- and software-side technologies. The basic technologies of artificial intelligence generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, electromechanical integration, etc. Artificial intelligence software technology mainly includes major directions such as computer vision technology, speech processing technology, natural language processing technology and machine learning/deep learning, autonomous driving, and smart transportation.

구체적으로, 본 출원은 확률 이론, 통계학, 근사 이론, 볼록 분석, 알고리즘 복잡성 이론 등 다양한 학과와 관련된 여러 분야의 교차 학과인 머신 러닝(Machine Learning, ML) 및 컴퓨터 비전 기술(Computer Vision, CV)과 관련이 있다. 컴퓨터가 인간의 학습 행동을 시뮬레이션하거나 구현하여 새로운 지식 또는 기술을 습득하고 기존 지식 구조를 재구성하여 자신의 성능을 지속적으로 향상시키는 방법을 전문적으로 연구한다. 머신 러닝은 인공지능의 핵심이자 컴퓨터를 지능화하는 근본적인 방법으로 인공지능의 다양한 분야에 적용된다. 머신 러닝 및 딥 러닝은 일반적으로 인공 신경망, 신뢰 네트워크, 강화 학습, 전이 학습, 귀납 학습, 형식 학습 등 기술을 포함한다. 컴퓨터 비전은 기계가 "보는" 방법을 연구하는 과학으로, 보다 구체적으로, 사람의 눈 대신 카메라와 컴퓨터를 이용해 대상을 식별하고 측정하는 머신 비전을 말하며, 더 나아가 그래픽 처리를 통해 사람의 눈 관찰이나 이미지의 기기 감지에 더 적합하도록 컴퓨터 처리를 수행하는 것을 말한다. 과학 분야로서 컴퓨터 비전은 관련 이론 및 기술을 연구하여 이미지 또는 다차원 데이터에서 정보를 얻을 수 있는 인공지능 시스템을 구축하려고 시도한다. 컴퓨터 비전 기술에는 일반적으로 이미지 처리, 이미지 인식, 이미지 의미론적 이해, 이미지 검색, OCR, 영상 처리, 영상 의미론적 이해, 영상 콘텐츠/행동 인식, 3D 객체 재구성, 3D 기술, 가상 현실, 증강 현실, 동기 위치 지정 및 지도 구축, 자율 주행, 스마트 교통 등 기술이 포함되며, 얼굴인식, 지문인식 등 일반적인 생체인식 기술도 포함된다.Specifically, this application covers machine learning (ML) and computer vision technology (Computer Vision, CV), which are cross-disciplinary departments related to various departments such as probability theory, statistics, approximation theory, convex analysis, and algorithmic complexity theory. It's related. We specialize in the study of how computers simulate or implement human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and a fundamental method of making computers intelligent, and is applied to various fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal learning. Computer vision is a science that studies how machines "see". More specifically, it refers to machine vision that identifies and measures objects using cameras and computers instead of human eyes. It further refers to human eye observation and measurement through graphic processing. This refers to performing computer processing on images to make them more suitable for device detection. As a scientific field, computer vision attempts to build artificial intelligence systems that can obtain information from images or multidimensional data by studying related theories and technologies. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, image processing, image semantic understanding, image content/action recognition, 3D object reconstruction, 3D technology, virtual reality, augmented reality, and synchronization. It includes technologies such as positioning and map building, autonomous driving, and smart transportation, as well as general biometric technologies such as facial recognition and fingerprint recognition.

본 개시는 영상 처리 방법, 전자 장치, 저장 매체 및 프로그램 제품을 제안하고, 구체적으로 강력한 클립-객체 중심 표현 학습을 통해 영상 판옵틱 분할 알고리즘(Robust Clip-object-centric Representation Learning for Video Panoptic Segmentation)을 구현할 수 있으며, 해당 방안의 구현에는 객체 추적 모듈이 필요하지 않고 알고리즘 구조가 단순화되며 동시에 분할 정확도 및 견고성이 일정 수준 향상된다.The present disclosure proposes an image processing method, electronic device, storage medium, and program product, and specifically, a Robust Clip-object-centric Representation Learning for Video Panoptic Segmentation algorithm. It can be implemented, and the implementation of the method does not require an object tracking module, simplifies the algorithm structure, and at the same time improves the segmentation accuracy and robustness to a certain level.

이하, 몇 가지 예시적 실시예에 대해 설명한다. 구현 방법은 상호 참고, 참조 또는 결합될 수 있으며, 상이한 구현 방법 중 동일한 용어, 유사한 기능 및 유사한 구현 단계 등에 대해서는 반복 설명하지 않는다.Below, several exemplary embodiments are described. Implementation methods may be cross-referenced, referenced, or combined, and the same terminology, similar functions, similar implementation steps, etc. among different implementation methods will not be repeatedly described.

도 1은 일 실시 예에 따른 영상 처리 방법의 흐름도이며, 해당 방법은 단말기 또는 서버와 같은 임의의 전자 장치에 의해 실행될 수 있다. 단말기는 스마트폰, 태블릿, 노트북, 데스크톱 컴퓨터, 스마트 스피커, 스마트 워치, 자동차 탑재 장치 등일 수 있다. 서버는 독립적인 물리적 서버이거나 여러 물리적 서버로 구성된 서버 클러스터 또는 분산 시스템일 수 있으며, 클라우드 서비스, 클라우드 데이터베이스, 클라우드 컴퓨팅, 클라우드 기능, 클라우드 스토리지, 네트워크 서비스, 클라우드 커뮤니케이션, 미들웨어 서비스, 도메인 네임 서비스, 보안 서비스, CDN, 빅데이터 및 인공지능 플랫폼 등 기본적인 클라우드 컴퓨팅 서비스를 제공하는 클라우드 서버가 될 수도 있으며, 이에 국한되지 않는다.1 is a flowchart of an image processing method according to an embodiment, and the method may be executed by any electronic device such as a terminal or server. The terminal may be a smartphone, tablet, laptop, desktop computer, smart speaker, smart watch, vehicle-mounted device, etc. A server may be an independent physical server, a server cluster composed of multiple physical servers, or a distributed system, and may include cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, and security. It may be, but is not limited to, a cloud server that provides basic cloud computing services such as services, CDN, big data, and artificial intelligence platforms.

실시예를 더 잘 설명하기 위해, 이하 도 2 및 도 3을 결합하여 파노라마 분할 장면에 대해 설명한다.To better explain the embodiment, a panoramic segmented scene is described below by combining FIGS. 2 and 3.

이미지 파노라마 분할은 2차원 이미지의 각 픽셀에 레이블 정보(예, 시맨틱 레이블 및 인스턴스 레이블의 정보)를 할당하는 프로세스이다. 이미지 콘텐츠는 두 가지 범주로 나눌 수 있으며, 하나의 유형은 'stuff'으로, 잔디, 하늘, 건축물 등 서로 다른 객체를 구별할 필요가 없는 콘텐츠를 말하며, 도 2의 의미론적 분할 결과와 같이 시맨틱 레이블을 예측한다. 또 다른 유형은 'thing'으로, 사람, 자동차 등 서로 다른 객체를 구분할 필요가 있는 콘텐츠로 도 2의 인스턴스 분할 결과와 같이 인스턴스 레이블을 예측한다. 파노라마 분할 작업은 도 2의 파노라마 분할 결과와 같이 시맨틱 분할과 인스턴스 분할의 복합 작업으로 볼 수 있다.Image panorama segmentation is the process of assigning label information (e.g., semantic label and instance label information) to each pixel of a two-dimensional image. Image content can be divided into two categories, and one type is 'stuff', which refers to content that does not need to distinguish between different objects such as grass, sky, and buildings. As shown in the semantic segmentation result in Figure 2, semantic labels are used. predict. Another type is 'thing', which is content that needs to distinguish between different objects such as people and cars, and predicts the instance label as shown in the instance segmentation result in Figure 2. The panorama segmentation task can be viewed as a composite task of semantic segmentation and instance segmentation, as shown in the panorama segmentation result in Figure 2.

영상 파노라마 분할은 시간 영역에서 이미지 파노라마 분할의 확장이다. 구체적으로, 이미지의 각 프레임에 대해 파노라마 분할을 수행하는 것 외에도 영상 파노라마 분할은 객체 추적 작업도 결합하며, 즉 서로 다른 이미지에서 동일한 인스턴스에 속하는 픽셀에 동일한 레이블을 할당한다.Image panorama segmentation is an extension of image panorama segmentation in the time domain. Specifically, in addition to performing panoramic segmentation for each frame of an image, video panoramic segmentation also combines object tracking operations, that is, assigning the same label to pixels belonging to the same instance in different images.

일 실시예에서, 영상 파노라마 분할의 장면에 대해, 임의의 영상 클립에서 파노라마 객체를 나타내기 위해 클립 레벨의 객체 표현이 제안된다. 구체적으로, 'stuff'와 'thing'이라는 두 가지 내용이 파노라마 객체로 일률적으로 표현된다. 'stuff' 콘텐츠(예, 하늘, 잔디 등)의 경우 이미지에서 동일한 유형의 모든 픽셀은 파노라마 객체를 형성한다(예. 모든 하늘 범주의 픽셀은 하늘 파노라마 객체를 형성함). 'thing' 콘텐츠(예, 보행자, 자동차 등)의 경우 각 개인이 파노라마 객체를 구성한다. 단일 영상 프레임의 파노라마 객체 표현은 프레임 쿼리, 즉 단일 이미지에 대한 객체 표현이라고 할 수 있다. 영상 클립을 처리할 때, 단일 프레임 상의 파노라마 객체는 영상 클립 내의 파노라마 객체, 즉 영상의 객체 표현(도 5와 같은 클립-객체 표현(clip-object representation))으로 처리할 수 있으며, 본 출원의 실시예에서는 클립 쿼리(Clip Query)라고도 하며, 클립 쿼리는 하나의 벡터(벡터의 길이는 C이고, C는 하나의 하이퍼파라미터일 수 있음)로 나타낼 수 있다. 영상 클립 상에 L개의 클립 쿼리가 있고 L이 0보다 크거나 같은 정수일 수 있다고 가정하면, 해당 영상 클립 상의 모든 클립 파노라마 객체 표현은 LxC 차원의 매트릭스를 형성하며, 즉 클립 객체 중심의 표현(clip-object-centric representation)이다.In one embodiment, for scenes of video panoramic segmentation, a clip-level object representation is proposed to represent panoramic objects in arbitrary video clips. Specifically, two contents, 'stuff' and 'thing', are uniformly expressed as panoramic objects. For 'stuff' content (e.g. sky, grass, etc.), all pixels of the same type in the image form a panorama object (e.g. pixels of all sky categories form a sky panorama object). In the case of 'thing' content (e.g. pedestrians, cars, etc.), each individual constitutes a panoramic object. The panoramic object representation of a single video frame can be called a frame query, that is, an object representation for a single image. When processing a video clip, a panoramic object on a single frame can be processed as a panoramic object within a video clip, that is, an object representation of the video (clip-object representation as shown in Figure 5), and the practice of the present application In the example, it is also called a clip query, and a clip query can be expressed as one vector (the length of the vector is C, and C can be one hyperparameter). Assuming there are L clip queries on a video clip, and L can be an integer greater than or equal to 0, the representation of all clip panorama objects on that video clip forms a matrix of dimension LxC, i.e. a clip object-centric representation (clip- object-centric representation).

예를 들어 영상 클립에 포함된 클립 쿼리는 아래 [수학식 1]과 같이 나타낼 수 있다.For example, a clip query included in a video clip can be expressed as [Equation 1] below.

[수학식 1]에서, 영상 클립의 모든 클립 쿼리는 L개의 벡터(벡터 길이는 C)를 나타내고, 각 벡터는 하나의 클립 레벨의 대상을 나타내며; C는 벡터 차원으로, 클립 레벨의 대상의 복잡도를 제어하는데 사용할 수 있는 하이퍼파라미터이다.In [Equation 1], every clip query of a video clip represents L vectors (vector length is C), and each vector represents the target of one clip level; C is the vector dimension and is a hyperparameter that can be used to control the complexity of the object at the clip level.

선택적으로, 클립 쿼리는 네트워크 훈련 단계에서 무작위로 초기화할 수 있고 시간 영역 및 공간 정보와 같은 시공간 정보와의 상호 작용을 통해 점진적으로 최적화할 수 있는 일련의 학습 가능한 매개변수이다.Optionally, a clip query is a set of learnable parameters that can be randomly initialized during the network training phase and progressively optimized through interaction with spatiotemporal information such as time domain and spatial information.

구체적으로, 도 3에 도시된 바와 같이, 영상 프레임이 시간에 따라 진행함에 따라 4개의 클립 쿼리의 예시가 제공되며, 각 클립 쿼리는 각 영상 프레임의 특징맵에 일대일로 대응한다.Specifically, as shown in Figure 3, an example of four clip queries is provided as the video frame progresses over time, and each clip query corresponds one-to-one to the feature map of each video frame.

이하, 일 실시예의 첨부된 도면에 포함될 수 있는 벡터 및 텐서를 나타내는 일부 상수의 차원, 연산자 등에 대해 설명한다.Hereinafter, dimensions, operators, etc. of some constants representing vectors and tensors that may be included in the accompanying drawings of one embodiment will be described.

T: 영상 클립의 길이를 나타내며, 즉 영상 클립의 프레임 수이다.T: Indicates the length of the video clip, that is, the number of frames of the video clip.

L: 클립 쿼리의 최대 수를 나타내며 클립의 파노라마 객체 수이다.L: Represents the maximum number of clip queries and is the number of panoramic objects in the clip.

C: 특징맵 및 클립 쿼리의 채널 차원(channel dimension)을 나타낸다.C: Indicates the channel dimension of feature maps and clip queries.

H 및 W: 이미지(즉, 영상 프레임)의 해상도를 나타내며, H는 높이이고 W는 너비이다. 또한 2차원 이미지의 길이와 너비를 나타낼 수도 있다.H and W: Indicate the resolution of the image (i.e. video frame), where H is the height and W is the width. It can also indicate the length and width of a two-dimensional image.

nc: 총 범주 수를 나타낸다.nc: Indicates the total number of categories.

: 요소별 가산 연산을 나타낸다.

: Indicates addition operation for each element.

: 행렬 곱셈 연산을 나타낸다.

: Indicates a matrix multiplication operation.

이하, 일 실시 예에 따른 영상 처리 방법에 대해 설명한다.Hereinafter, an image processing method according to an embodiment will be described.

구체적으로, 도 1에 도시된 바와 같이, 일 실시 예에 따른 영상 처리 방법은 다음 단계 S101-S103를 포함한다.Specifically, as shown in FIG. 1, the image processing method according to one embodiment includes the following steps S101-S103.

단계 S101: 영상의 영상 특징을 획득하고; 상기 영상은 적어도 두 개의 영상 프레임을 포함한다.Step S101: Obtain image features of the image; The video includes at least two video frames.

단계 S102: 신경망을 사용하여 상기 영상 특징에 기초하여 상기 영상의 타겟 객체 표현을 결정한다.Step S102: Determine a target object representation of the image based on the image features using a neural network.

단계 S103: 상기 타겟 객체 표현에 기초하여 상기 영상의 파노라마 분할 결과를 결정한다.Step S103: Determine a panoramic segmentation result of the image based on the target object representation.

구체적으로, 본 출원은 파노라마 분할 모델을 통해 상기 영상 처리 방법을 구현할 수 있다. 도 4에 도시된 바와 같이, 파노라마 분할 모델은 마스크 디코더와 분할 헤드 모듈을 포함할 수 있다. 선택적으로, 도 5에 도시된 바와 같이, 파노라마 분할 모델은 특징 추출 모듈을 더 포함할 수 있다. 즉, 파노라마 분할 모델은 다른 네트워크에서 영상(영상 클립이라고도 함, 적어도 두 개의 영상 프레임을 포함하는 데이터를 나타냄)에 대해 추출하여 얻은 특징맵을 처리할 수 있고, 또한 획득한 영상 자체에서 특징 추출을 수행한 다음 추출된 영상 특징을 처리할 수도 있다.Specifically, the present application can implement the image processing method through a panoramic segmentation model. As shown in Figure 4, the panoramic segmentation model may include a mask decoder and a segmentation head module. Optionally, as shown in Figure 5, the panoramic segmentation model may further include a feature extraction module. In other words, the panoramic segmentation model can process feature maps obtained by extracting images (also called video clips, representing data containing at least two image frames) from another network, and can also extract features from the acquired images themselves. After performing this, the extracted image features can also be processed.

특징 추출 모듈(Clip feature extractor)은 백본 네트워크(Res50-backbone) 및 픽셀용 픽셀 디코더(pixel decoder)와 같은 범용 특징 추출 네트워크 구조를 채택할 수 있고, 본 개시는 이에 대해 제한하지 않는다. 선택적으로, 도 5에 도시된 바와 같이, 특징 추출 모듈(301)은 입력된 영상 클립에 대해 특징 추출을 수행하여 클립 레벨 멀티 스케일 특징(clip-level multi-scale features)을 추출할 수 있다.The feature extraction module (Clip feature extractor) may adopt a general-purpose feature extraction network structure such as a backbone network (Res50-backbone) and a pixel decoder for pixels, and the present disclosure is not limited thereto. Optionally, as shown in FIG. 5, the feature extraction module 301 may extract clip-level multi-scale features by performing feature extraction on the input video clip.

입력된 영상 클립의 모든 프레임(t-T+1, t-T+2, ??, t; 여기서 T는 영상 클립의 길이, 즉 해당 영상 클립에 포함된 영상 프레임 수를 나타냄)에 대해, 특징 추출 모듈을 통해 영상 특징(클립 특징이라고도 함, clip feature)을 추출할 수 있다. 특징 추출 모듈은 여러 프레임 또는 단일 프레임을 입력할 수 있다. 여러 영상 프레임을 포함하는 영상의 경우 여러 영상 프레임을 특징 추출 모듈에 함께 입력하거나 각 영상 프레임을 프레임별로 특징 추출 모듈에 입력할 수 있다. 선택적으로, 특징 추출 작업을 단순화하고 특징 추출 속도를 향상시키기 위해 단일 프레임 입력 방법을 통해 영상 특징을 추출할 수 있다.For every frame of the input video clip (t-T+1, t-T+2, ??, t; where T represents the length of the video clip, that is, the number of video frames included in the video clip), the features Video features (also called clip features) can be extracted through the extraction module. The feature extraction module can input multiple frames or a single frame. In the case of an image containing multiple image frames, multiple image frames can be input together into the feature extraction module, or each video frame can be input into the feature extraction module frame by frame. Optionally, image features can be extracted through a single frame input method to simplify the feature extraction task and improve feature extraction speed.

마스크 디코더(Masked decoder)는 N개의 계층적 상호작용 모듈 HIM(Hierarchical Interaction Modules)로 구성될 수 있고, 여기서 N은 1보다 크거나 같은 정수이므로 마스크 디코더는 복수의 캐스케이드된 HIM으로 구성될 수 있다. 예를 들어 각 레벨의 계층적 상호작용 모듈은 구조는 같지만 매개변수는 다르다. 구체적으로, 마스크 디코더는 입력된 영상 특징에 기초하여 영상의 타겟 객체 표현을 결정할 수 있다.A masked decoder may be composed of N hierarchical interaction modules (HIMs), where N is an integer greater than or equal to 1, so the masked decoder may be composed of a plurality of cascaded HIMs. For example, the hierarchical interaction modules at each level have the same structure, but different parameters. Specifically, the mask decoder may determine the target object representation of the image based on the input image features.

분할 헤드(Segmentation Head) 모듈은 카테고리, 마스크, 객체 ID와 같은 타겟 객체 표현을 기반으로 영상 클립의 파노라마 객체의 분할 결과를 출력할 수 있다. 구체적으로, 획득된 마스크는 여러 프레임에 정의된 클립 파노라마 객체의 마스크이며, 서로 다른 프레임에서 동일한 마스크에 속하는 픽셀은 서로 다른 프레임 상의 객체의 대응 관계를 나타낸다. 즉 이는 서로 다른 영상 프레임 간의 일치 또는 추적 작업 없이 객체 ID를 자동으로 얻을 수 있음을 의미한다. 마스크는 이미지 필터의 템플릿으로 이해할 수 있다. 특징맵에서 타겟 객체를 추출할 때, n*n(n의 값은 수용 필드, 정확도 등 요소에 기반하여 고려할 수 있고, 예를 들어 3*3, 5*5, 7*7 등으로 설정할 수 있음)의 행렬을 통해 이미지를 필터링한 후, 타겟 객체를 강조 표시할 수 있다.The Segmentation Head module can output segmentation results of panoramic objects in video clips based on target object expressions such as category, mask, and object ID. Specifically, the obtained mask is a mask of a clip panorama object defined in several frames, and pixels belonging to the same mask in different frames represent the correspondence between objects in different frames. This means that object IDs can be obtained automatically without matching or tracking between different video frames. A mask can be understood as a template for an image filter. When extracting a target object from a feature map, n*n (the value of n can be considered based on factors such as receptive field and accuracy, and can be set to, for example, 3*3, 5*5, 7*7, etc. After filtering the image through a matrix of ), the target object can be highlighted.

이하, 본 출원의 실시예에서의 상호작용 과정에 대해 설명한다.Hereinafter, the interaction process in the embodiments of the present application will be described.

일 실행 가능한 실시예에서, 단계 S102에서, 신경망을 사용하여 상기 영상 특징을 기반으로 상기 영상의 타겟 객체 표현을 결정하는 단계는, 다음 단계 A1을 포함한다.In one feasible embodiment, in step S102, determining a target object representation of the image based on the image features using a neural network includes the following step A1.

단계 A1: 신경망을 사용하여 상기 영상 특징에 대해 복수의 반복 처리를 수행하여 상기 영상의 타겟 객체 표현을 결정한다.Step A1: Perform multiple iterative processing on the image features using a neural network to determine a target object representation of the image.

각각의 반복 처리는, 상기 영상 특징 및 영상의 이전 반복 처리에 의한 객체 표현에 기초하여 반복 처리를 수행하여 영상의 현재 반복 처리에 의한 객체 표현을 결정하는 것을 포함한다.Each iterative process includes performing the iterative process based on the image features and the object representation by the previous iterative process of the image to determine the object representation by the current iterative process of the image.

선택적으로, 복수의 반복 처리 중 첫 번째 반복인 경우 상기 이전 반복 처리에 의한 객체 표현은 미리 구성된 초기 객체 표현이다.Optionally, in the case of the first iteration among a plurality of iterations, the object representation by the previous iteration is a pre-configured initial object representation.

선택적으로, 마스크 디코더는 하나 이상의 계층적 상호작용 모듈을 포함할 수 있다. 두 개 이상의 계층적 상호작용 모듈을 포함할 때, 각 계층적 상호작용 모듈은 캐스케이드되고 정렬되며, 이전 레벨 모듈의 출력은 다음 레벨 모듈의 입력으로 작용할 수 있다. 해당 네트워크 구조를 기반으로 마스크 디코더는 영상 특징의 복수의 반복을 구현할 수 있다. 반복 횟수는 마스크 디코더에 포함된 계층적 상호작용 모듈의 수와 관련된다.Optionally, the mask decoder may include one or more hierarchical interaction modules. When including two or more hierarchical interaction modules, each hierarchical interaction module is cascaded and aligned, and the output of the previous level module can serve as the input of the next level module. Based on the network structure, the mask decoder can implement multiple repetitions of image features. The number of iterations is related to the number of hierarchical interaction modules included in the mask decoder.

레벨의 계층적 상호작용 모듈은 해당 영상의 영상 특징과 이전 반복 처리에 의한 객체 표현(즉 이전 레벨의 계층적 상호작용 모듈의 출력)에 대해 처리할 수 있고, 현재 반복 처리에 의한 객체 표현을 출력할 수 있다. 1차적으로 정렬된 계층적 상호작용 모듈로서 그 입력은 영상의 영상 특징과 사전 구성된 초기 객체 표현을 포함한다. 두 번째 및 후속 레벨에서 정렬된 계층적 상호작용 모듈로서 그 입력은 해당 영상 특징과 이전 레벨의 계층적 상호작용 모듈에 의해 출력된 객체 표현(즉 이전 반복 처리에 의한 객체 표현)을 포함한다.The hierarchical interaction module of the level can process the image features of the corresponding image and the object representation by previous iteration processing (i.e., the output of the hierarchical interaction module of the previous level), and output the object representation by current iteration processing. can do. It is a first-order ordered hierarchical interaction module whose input includes image features of the image and a pre-configured initial object representation. An ordered hierarchical interaction module at the second and subsequent levels, whose inputs include the corresponding image features and the object representation output by the hierarchical interaction module at the previous level (i.e., the object representation by the previous iterative process).

선택적으로, 파노라마 분할 처리가 필요한 영상과 비교할 때 초기 객체 표현은 동일하다.Optionally, the initial object representation is the same when compared to the image requiring panoramic segmentation processing.

도 4 및 도 5에 도시된 바와 같이, 마스크 디코더에서 1차적 계층적 상호작용 모듈의 입력은 특징 추출 모듈에 의해 추출된 영상 특징뿐 아니라 초기 클립 쿼리(초기 객체 표현이라고도 함, 즉 네트워크 훈련에서 얻은 매개변수)을 더 포함하며, 후속 레벨의 계층적 상호작용 모듈에 의한 클립 쿼리 입력은 이전 레벨의 계층적 상호작용 모듈에 의해 출력된 클립 쿼리다. 구체적으로, 마스크 디코더는 클립 쿼리와 영상 특징 사이의 관계 연산을 수행하여 해당 영상 클립의 특정 쿼리(즉, 타겟 객체 표현, 즉 도 5에 도시된 클립 쿼리 출력)를 얻을 수 있다.As shown in Figures 4 and 5, the input of the first-order hierarchical interaction module in the mask decoder is the image features extracted by the feature extraction module as well as the initial clip query (also called initial object representation, i.e. obtained from network training). parameters), and the clip query input by the hierarchical interaction module of the subsequent level is the clip query output by the hierarchical interaction module of the previous level. Specifically, the mask decoder can perform a relational operation between a clip query and video features to obtain a specific query (i.e., target object representation, i.e., clip query output shown in FIG. 5) of the video clip.

본 출원 실시예의 알고리즘 프로세스는 클립 쿼리와 영상 클립에서 추출된 영상 특징 간의 관계 연산을 수행하고, 최종적으로 클립 파노라마 객체의 분할 결과를 얻을 때까지 각 클립 쿼리를 영상 클립 상의 특정 클립 파노라마 객체와 정렬한다. The algorithmic process of the present application embodiment performs relationship operations between clip queries and video features extracted from video clips, and aligns each clip query with a specific clip panorama object on the video clip until finally obtaining the segmentation result of the clip panorama object. .

일 실행 가능한 실시예에서, 도 6a에 도시된 바와 같이, 마스크 디코더는 N개의 동일한 계층적 상호작용 모듈을 포함할 수 있다. 선택적으로, 도 6b에 도시된 바와 같이, 계층적 상호작용 모듈은 제1 계층적 상호작용 모듈과 제2 계층적 상호작용 모듈로 구별될 수 있다. 즉, 마스크 디코더는 M개의 제1 계층적 상호작용 모듈과 N-M개의 제2 계층적 상호작용 모듈을 포함할 수 있고, M은 N보다 작거나 같다.In one feasible embodiment, as shown in Figure 6A, the mask decoder may include N identical hierarchical interaction modules. Optionally, as shown in FIG. 6B, the hierarchical interaction module may be divided into a first hierarchical interaction module and a second hierarchical interaction module. That is, the mask decoder may include M first hierarchical interaction modules and N-M second hierarchical interaction modules, where M is less than or equal to N.

도 7a에 도시된 바와 같이, 제1 계층적 상호작용 모듈의 네트워크 구조는 클립 특징 쿼리 상호작용 모듈(401)을 포함할 수 있고, 도 7b에 도시된 바와 같이, 제2 계층적 상호작용 모듈은 클립 특징 쿼리 상호작용 모듈(401) 및 클립 프레임 쿼리 상호작용 모듈(402)을 포함할 수 있다. 즉, 제1 계층적 상호작용 모듈은 제2 계층적 상호작용 모듈의 단순화된 네트워크일 수 있다.As shown in Figure 7A, the network structure of the first hierarchical interaction module may include a clip feature query interaction module 401, and as shown in Figure 7B, the second hierarchical interaction module may include It may include a clip feature query interaction module 401 and a clip frame query interaction module 402. That is, the first hierarchical interaction module may be a simplified network of the second hierarchical interaction module.

선택적으로, 마스크 디코더는 제1 계층적 상호작용 모듈과 제2 계층적 상호작용 모듈의 임의의 조합으로 구성될 수 있다.Optionally, the mask decoder may be comprised of any combination of a first hierarchical interaction module and a second hierarchical interaction module.

선택적으로, 마스크 디코더에 의해 채택된 계층적 상호작용 모듈이 크립 특징 쿼리 상호작용 모듈(401)을 포함할 때, 단계 A1에서, 상기 영상 특징과 상기 영상의 이전 반복 처리에 의한 객체 표현에 기초하여 반복 처리를 수행하여 상기 영상의 현재 반복 처리에 의한 객체 표현을 결정하는 단계는 다음 단계 A11-A13을 포함한다.Optionally, when the hierarchical interaction module adopted by the mask decoder includes a creep feature query interaction module 401, in step A1, based on the image features and the object representation by previous iterative processing of the image, The step of performing iterative processing to determine the object representation of the image by the current iterative process includes the following steps A11-A13.

단계 A11: 이전 반복 처리에 의한 객체 표현에 대해 변환 처리하여 마스크를 얻는다.Step A11: Obtain a mask by converting the object expression from the previous iteration.

단계 A12: 상기 영상 특징, 상기 이전 반복 처리에 의한 객체 표현 및 상기 마스크를 처리하여 제1 객체 표현을 얻는다.Step A12: Process the image features, the object representation by the previous iterative process, and the mask to obtain a first object representation.

단계 A13: 상기 제1 객체 표현을 기반으로 현재 반복에 대한 객체 표현을 결정한다.Step A13: Determine the object representation for the current iteration based on the first object representation.

단계 A11에서 이전 반복에 의한 객체 표현의 변환에 대해 도 13과 같은 네트워크 구조의 마스크 분기를 통해 처리될 수 있다. 즉, 이전 반복 처리에 의한 객체 표현에 대해 복수의 선형 변환 처리를 수행한 후, 영상 특징과 행렬 곱셈을 수행하여 최종적으로 변환된 마스크를 얻는다.In step A11, the conversion of the object representation by the previous iteration can be processed through mask branching of the network structure as shown in FIG. 13. That is, after performing a plurality of linear transformation processes on the object expression through the previous iterative process, image features and matrix multiplication are performed to obtain the final transformed mask.

일 실행 가능한 실시예에서, 도 7a 및 도 7b에 도시된 바와 같이, 현재의 반복 처리에서, 클립 특징 쿼리 상호작용 모듈(401)에 입력되는 데이터는 단계 S101에서 획득된 영상 특징(클립 특징 X)과 이전 반복 처리에 의한 객체 표현(현재 반복이 처음이면 클립 쿼리 S는 도 5와 같은 초기 클립 쿼리가 될 수 있고, 현재 반복이 처음이 아니면 클립 쿼리 S는 이전 레벨의 계층적 상호 작용 모듈이 출력한 클립 쿼리가 될 수 있음) 및 이전 반복 처리에 의한 객체 표현의 변환에서 얻은 이진 마스크(클립 마스크 MA)를 포함할 수 있다. 클립 쿼리 및 특징맵을 처리한 후, 제1 객체 표현이라고도 하는 클립 쿼리 출력을 얻을 수 있다. 선택적으로, 제1 객체 표현을 현재 반복 처리에 의한 객체 표현으로 직접 사용할 수 있다.In one feasible embodiment, as shown in FIGS. 7A and 7B, in the current iterative process, the data input to the clip feature query interaction module 401 is the image feature (clip feature and object representation by previous iteration processing (if the current iteration is the first, the clip query S can be the initial clip query as shown in Figure 5, and if the current iteration is not the first, the clip query S is output by the hierarchical interaction module of the previous level can be a clip query) and a binary mask (clip mask MA) obtained from the transformation of the object representation by a previous iteration. After processing the clip query and feature map, the clip query output, also called the first object representation, can be obtained. Optionally, the first object representation can be used directly as the object representation by the current iteration.

클립 특징 쿼리 상호작용 모듈(401)에 의해 구현된 처리는 클립 특징의 모든 픽셀로부터 객체의 위치 및 외관 정보를 얻을 수 있다. 마스크를 사용하여 관련 없는 영역의 영향을 제거하고 학습 프로세스를 가속화할 수 있다.Processing implemented by the clip feature query interaction module 401 may obtain the location and appearance information of the object from every pixel of the clip feature. Masks can be used to remove the influence of irrelevant regions and accelerate the learning process.

일 실행 가능한 실시예에서, 단계 A12에서, 상기 영상 특징, 상기 이전 반복 처리에 의한 객체 표현 및 상기 마스크를 처리하여 제1 객체 표현을 얻는 단계는, 다음 단계 A121-A122를 포함한다. In one feasible embodiment, in step A12, processing the image features, the object representation by the previous iterative processing, and the mask to obtain a first object representation includes the following steps A121-A122.

단계 A121: 상기 영상 특징, 상기 이전 반복 처리에 의한 객체 표현 및 상기 마스크에 대해 어텐션 처리를 수행하여 마스크와 관련된 객체 표현을 얻는다.Step A121: Perform attention processing on the image feature, the object representation by the previous iteration process, and the mask to obtain an object representation related to the mask.

단계 A122: 상기 마스크와 관련된 객체 표현과 상기 이전 반복 처리에 의한 객체 표현을 기반으로 셀프 어텐션 처리 및 분류 처리를 수행하여 제1 객체 표현을 얻는다.Step A122: Obtain a first object representation by performing self-attention processing and classification processing based on the object representation related to the mask and the object representation by the previous iteration process.

일 실행 가능한 실시예에서, 도 8에 도시된 바와 같이, 입력된 클립 특징 X(영상 특징), 클립 쿼리 S(이전 반복 처리에 의한 객체 표현) 및 클립 마스크 MA(이전 반복 처리에 의한 객체 표현으로부터 변환됨)에 대해, 마스크 어텐션 모듈(501)을 사용하여 처리를 수행하여 세분화된 클립 쿼리를 얻을 수 있다. 이를 바탕으로, 단계 A121의 처리 결과를 요소별로 클립 쿼리 S에 추가할 수 있고, 추가된 합 결과는 정규화될 수 있다(도 8에 도시된 합산 및 정규화 모듈(502)을 통해 구현됨). 이후, 셀프 어텐션 모듈(503)을 통해 정규화 처리된 결과 A에 대해 셀프 어텐션 처리하고, 출력된 결과는 요소별로 정규화 처리된 결과 A에 가산하고, 가산된 합 결과는 정규화 처리한다(도 8에 도시된 합산 및 정규화 모듈(504)을 통해 구현됨). 그런 다음, FFN(Feed Forward Network) 모듈(505)을 통해, 정규화 처리된 결과 B에 대해 피드포워드 연산(분류 처리)을 수행하고, 출력된 결과를 요소별로 정규화 처리된 결과 B에 가산하고, 가산된 합 결과는 정규화 처리하여(도 8에 도시된 합산 및 정규화 모듈(506)을 통해 구현됨), 클립 쿼리 출력(제1 객체 표현)을 얻는다.In one feasible embodiment, as shown in Figure 8, input clip features converted), processing can be performed using the mask attention module 501 to obtain a fine-grained clip query. Based on this, the processing results of step A121 can be added element-by-element to the clip query S, and the added sum result can be normalized (implemented through the summing and normalization module 502 shown in FIG. 8). Afterwards, self-attention is processed on the normalized result A through the self-attention module 503, the output result is added to the normalized result A for each element, and the added sum result is normalized (shown in FIG. 8) implemented through the integrated summation and normalization module 504). Then, through the Feed Forward Network (FFN) module 505, a feedforward operation (classification processing) is performed on the normalized result B, the output result is added to the normalized result B for each element, and the addition The sum result is normalized (implemented through the sum and normalization module 506 shown in Figure 8) to obtain a clip query output (first object representation).

도 8에 도시된 바와 같이, 셀프 어텐션 모듈(503)에서, 입력된 Q, K 및 V는 클립 쿼리(정규화 처리된 결과 A)의 차원 L, C에 대응한다.As shown in Figure 8, in the self-attention module 503, the input Q, K, and V correspond to the dimensions L, C of the clip query (normalized result A).

선택적으로, 도 8에 도시된 네트워크 구조에서, FFN 모듈(505) 및 어텐션 메커니즘을 포함하는 다른 모듈의 처리 순서는 교환될 수 있다. 도 8에 도시된 네트워크 구조에서 셀프 어텐션 모듈(503) 및 FFN 모듈(505)의 위치가 교환될 수 있는 경우, 교환 후 합산 및 정규화 모듈(502)의 출력은 FFN 모듈(505)의 입력으로 사용될 수 있다.Optionally, in the network structure shown in Figure 8, the processing order of the FFN module 505 and other modules including the attention mechanism can be exchanged. If the positions of the self-attention module 503 and the FFN module 505 can be exchanged in the network structure shown in FIG. 8, the output of the summation and normalization module 502 after the exchange will be used as the input of the FFN module 505. You can.

선택적으로, 단계 A121에서, 상기 영상 특징, 상기 이전 반복 처리에 의한 객체 표현 및 상기 마스크에 대해 어텐션 처리를 수행하여 마스크와 관련된 객체 표현을 얻는 단계는, 다음 단계 A121a-A121c를 포함한다.Optionally, in step A121, performing attention processing on the image feature, the object representation by the previous iterative process, and the mask to obtain an object representation related to the mask includes the following steps A121a-A121c.

단계 A121a: 상기 영상 특징에 대응하는 키 특징, 상기 이전 반복 처리에 의한 객체 표현 및 상기 마스크를 기반으로 제2 객체 표현을 얻는다.Step A121a: Obtain a second object representation based on the key feature corresponding to the image feature, the object representation by the previous iterative process, and the mask.

단계 A121b: 상기 제2 객체 표현에 기초하여, 상기 영상에 포함된 객체 카테고리를 나타내는 제1 확률을 결정한다.Step A121b: Based on the second object representation, determine a first probability representing an object category included in the image.

단계 A121c: 상기 제1 확률, 상기 영상 특징에 대응하는 값 특징 및 상기 영상 특징을 기반으로 상기 마스크와 관련된 객체 표현을 얻는다.Step A121c: Obtain an object representation associated with the mask based on the first probability, a value feature corresponding to the image feature, and the image feature.

일 실행 가능한 실시예에서, 상기 단계 A121a 내지 A121c의 입력 및 출력은 텐서일 수 있고, 텐서의 차원은 도 10을 참조할 수 있다.In one feasible embodiment, the inputs and outputs of steps A121a to A121c may be tensors, and the dimensions of the tensors may refer to FIG. 10.

먼저, 입력 Q(클립 쿼리, 이전 반복 처리에 의한 객체 표현, 차원 L, C에 해당), K(영상 특징을 키 특징으로 사용, 차원 THW, C에 해당) 및 V(영상 특징을 값 특징으로 사용, 차원 THW, C에 해당)에 대해 각각 선형 작업을 수행한다. 도 10에 도시된 선형 연산 모듈(701, 702, 703)은 하나 이상의 선형 연산 레이어(Linear Layer)를 포함할 수 있다. 이어서, 선형 연산 모듈(701)의 출력과 선형 연산 모듈(702)의 출력은 모듈(704)을 통해 행렬 곱셈 연산을 수행할 수 있고, 모듈(704)의 출력 결과를 모듈(705)을 통해 클립 마스크 MA에 요소별로 추가하여 모듈(705)의 출력(제2 객체 표현)을 모듈(706)을 통해 softmax 연산(THW 차원에서 수행 가능)을 수행하여 카테고리에 해당하는 제1 확률로 변환할 수 있다. 해당 제1 확률은 영상에 포함된 모든 객체 카테고리를 나타낼 수 있고, 예를 들어 카테고리가 자동차인 경우 확률은 x이고, 카테고리가 가로등 기둥인 경우 확률은 y이다. 이를 바탕으로, 모듈(706)의 출력(변환된 제1 확률)은 모듈(707)을 통해 선형 변환 처리된 후 값 특징에 의해 행렬 곱될 수 있고, 모듈(707)에 의해 출력된 결과는 모듈(708)을 통해 선형 변환 처리된 후의 영상 특징에 요소별로 중첩되어 최종 클립 쿼리 출력(마스크 관련 객체 표현)을 얻을 수 있다.First, input Q (clip query, object representation by previous iteration processing, corresponding to dimensions L, C), K (using image features as key features, corresponding to dimensions THW, C), and V (using image features as key features, corresponding to dimensions THW, C) Use, respectively, to perform a linear operation on the dimensions THW, corresponding to C). The linear calculation modules 701, 702, and 703 shown in FIG. 10 may include one or more linear calculation layers. Subsequently, the output of the linear operation module 701 and the output of the linear operation module 702 may perform a matrix multiplication operation through the module 704, and the output result of the module 704 may be clipped through the module 705. By adding each element to the mask MA, the output of module 705 (second object representation) can be converted to a first probability corresponding to the category by performing a softmax operation (can be performed in the THW dimension) through module 706. . The first probability may represent all object categories included in the image, for example, if the category is a car, the probability is x, and if the category is a lamppost, the probability is y. Based on this, the output of the module 706 (transformed first probability) can be linearly transformed through the module 707 and then matrix multiplied by the value feature, and the result output by the module 707 is the module ( 708), the final clip query output (mask-related object representation) can be obtained by overlapping each element with the image features after linear transformation processing.

도 10에 도시된 위치 정보 (P_q) 및 (P_k)는 Q 및 K에 대응하는 위치 정보로서 범용 위치 정보 부호화 모듈에 의해 생성될 수 있고, 위치 정보는 마스크 어텐션 모듈(501)에 대해 선택적이다.The location information (P_q) and (P_k) shown in FIG. 10 are location information corresponding to Q and K and can be generated by a general-purpose location information encoding module, and the location information is optional for the mask attention module 501.

선택적으로, 마스크 디코더에 의해 채택된 계층적 상호작용 모듈이 클립 특징 쿼리 상호작용 모듈(401) 및 클립 프레임 쿼리 상호작용 모듈(402)을 포함할 때, 단계 A13에서, 상기 제1 객체 표현을 기반으로 현재 반복 처리에 의한 객체 표현을 결정하는 단계는, 다음 단계 A131-A132를 포함한다.Optionally, in step A13, when the hierarchical interaction module adopted by the mask decoder includes a clip feature query interaction module 401 and a clip frame query interaction module 402, based on the first object representation The step of determining the object representation by the current iteration process includes the following steps A131-A132.

단계 A131: 상기 영상 특징 및 상기 제1 객체 표현에 기초하여 적어도 하나의 영상 프레임 중 각각의 영상 프레임에 대응하는 객체 표현을 결정한다.Step A131: Determine an object representation corresponding to each image frame among at least one image frame based on the image feature and the first object representation.

단계 A132: 상기 제1 객체 표현과 결정된 영상 프레임에 대응하는 객체 표현을 기반으로 현재 반복 처리에 의한 객체 표현을 결정한다.Step A132: An object representation by the current iterative process is determined based on the first object representation and the object representation corresponding to the determined image frame.

구체적으로, 도 9에 도시된 바와 같이, 입력된 클립 특징 X(영상 특징) 및 클립 쿼리 S(클립 특징 쿼리 상호작용 모듈(401)의 출력, 제1 객체 표현)에 대해, 프레임 쿼리 생성 모듈(601)을 통해 처리하여 [1, T] 프레임 상의 프레임 쿼리(여기서 T는 영상에 포함된 영상 프레임의 수를 나타냄)를 얻는다. 각 C차원의 클립 쿼리 벡터에 대해, T 프레임의 각 프레임에 대한 C차원 프레임 쿼리 벡터를 얻을 수 있으므로 전체 차원은 L, C에서 TL, C로 변경된다.Specifically, as shown in FIG. 9, for the input clip feature 601) to obtain a frame query on the [1, T] frame (where T represents the number of video frames included in the video). For each C-dimension clip query vector, we can obtain a C-dimension frame query vector for each frame in T frames, so the overall dimension changes from L, C to TL, C.

도 7b에 도시된 바와 같이, 현재의 반복 처리에서, 클립 특징 쿼리 상호작용 모듈(401)에 의해 출력된 클립 쿼리를 획득한 후, 해당 클립 쿼리와 프레임 쿼리(각 영상 프레임에 대응하는 객체 표현) 사이에서도 처리를 수행하여 현재 레벨의 계층적 상호작용 모듈의 최종 출력(현재 반복 처리에 의한 객체 표현)을 얻을 수 있다.As shown in Figure 7b, in the current iterative process, after obtaining the clip query output by the clip feature query interaction module 401, the corresponding clip query and frame query (object representation corresponding to each image frame) You can also perform processing between them to obtain the final output (object representation by the current iteration) of the hierarchical interaction module at the current level.

선택적으로, 제1 객체 표현은 하나 이상의 프레임 쿼리 또는 영상에 포함된 모든 영상 프레임에 각각 1:1 대응하는 프레임 쿼리로 처리될 수 있다.Optionally, the first object representation may be processed as one or more frame queries or frame queries each corresponding 1:1 to all image frames included in the image.

선택적으로, 단계 A131에서, 상기 영상 특징 및 상기 제1 객체 표현에 기초하여 적어도 하나의 영상 프레임 중 각각의 영상 프레임에 대응하는 객체 표현을 결정하는 단계는, 다음 단계 A131a-A131c를 포함한다.Optionally, in step A131, determining an object representation corresponding to each image frame of the at least one image frame based on the image features and the first object representation includes the following steps A131a-A131c.

단계 A131a: 상기 영상 특징에 대응하는 키 특징 및 상기 제1 객체 표현에 기초하여 제4 객체 표현을 결정한다.Step A131a: Determine a fourth object representation based on the key feature corresponding to the image feature and the first object representation.

단계 A131b: 상기 제4 객체 표현에 기초하여 상기 영상에 포함된 객체 카테고리를 나타내는 제2 확률을 결정한다.Step A131b: Determine a second probability representing an object category included in the image based on the fourth object representation.

단계 A131c: 상기 제2 확률 및 상기 영상 특징에 대응하는 값 특징에 기초하여 적어도 하나의 영상 프레임 중 각 영상 프레임에 대응하는 객체 표현을 결정한다.Step A131c: Determine an object representation corresponding to each image frame among at least one image frame based on the second probability and the value feature corresponding to the image feature.

구체적으로, 상술한 단계 A131a-A131c의 입력과 출력은 텐서가 될 수 있으며, 텐서의 차원은 도 11을 참조할 수 있다.Specifically, the input and output of the above-described steps A131a-A131c may be a tensor, and the dimension of the tensor may refer to FIG. 11.

먼저, 입력 Q(클립 쿼리, 제1 객체 표현, 즉 상기 단계 A12의 출력, 차원 L, C에 해당), K(영상 특징을 키 특징으로 사용, 차원 THW, C에 해당) 및 V(영상 특징을 값 특징으로 사용, 차원 THW, C에 해당)에 대해 각각 선형 작업을 수행한다. 도 11에 도시된 선형 연산 모듈(801, 802, 803)은 하나 이상의 선형 연산 레이어(Linear Layer)를 포함할 수 있다. 이어서, 선형 연산 모듈(801)의 출력과 선형 연산 모듈(802)의 출력에 대해 행렬 곱셈 연산을 수행하여(모듈(804)를 통해 실행), 모듈(804)의 출력 결과(제4 객체 표현)를 softmax 연산하여 카테고리에 해당하는 제2 확률(모듈(805)를 통해 실행, 해당 제2 확률은 영상에 포함된 객체 카테고리를 나타낼 수 있음)로 변한한다. 이를 바탕으로, 모듈(806)을 통해 선형 변환 처리된 후의 값 특징과 모듈(805)의 출력(제2 확률)에 대해 행렬 곱셈 연산을 수행하여 최종 프레임 쿼리 출력(적어도 하나의 영상 프레임 중 각 영상 프레임에 대응하는 객체 표현)을 얻을 수 있다.First, input Q (clip query, first object representation, i.e. output of step A12 above, corresponding to dimensions L, C), K (using image features as key features, corresponding to dimensions THW, C) and V (image features As the value feature, we perform a linear operation on each dimension (corresponding to THW, C). The linear calculation modules 801, 802, and 803 shown in FIG. 11 may include one or more linear calculation layers. Next, a matrix multiplication operation is performed on the output of the linear operation module 801 and the output of the linear operation module 802 (executed through the module 804) to produce an output result of the module 804 (fourth object representation). is converted into a second probability corresponding to the category by softmax operation (executed through module 805, the second probability may represent the object category included in the image). Based on this, a matrix multiplication operation is performed on the value features after linear transformation processing through the module 806 and the output (second probability) of the module 805 to output the final frame query (each image among at least one image frame). You can obtain an object representation corresponding to the frame.

선택적으로, 모듈(806)의 입력에 대해 변형(reshape) 작업을 수행할 수 있으며 텐서의 차원 변경은 도 11을 참조할 수 있다.Optionally, a reshape operation can be performed on the input of module 806 to change the dimension of the tensor, see Figure 11.

도 11에 도시된 위치 정보 (P_q) 및 (P_k)는 Q 및 K에 대응하는 위치 정보로서 범용 위치 정보 부호화 모듈에 의해 생성될 수 있고, 위치 정보는 프레임 쿼리 생성 모듈(601)에 대해 선택적이다.The location information (P_q) and (P_k) shown in FIG. 11 are location information corresponding to Q and K and can be generated by a general-purpose location information encoding module, and the location information is optional for the frame query generation module 601. .

일 실시예에서, 단계 A132에서, 상기 제1 객체 표현과 결정된 영상 프레임에 대응하는 객체 표현을 기반으로 현재 반복 처리에 의한 객체 표현을 결정하는 단계는, 다음 단계 B1-B2를 포함한다.In one embodiment, in step A132, determining the object representation by current iteration process based on the first object representation and the object representation corresponding to the determined image frame includes the following steps B1-B2.

단계 B1: 결정된 영상 프레임에 대응하는 객체 표현에 대해 분류 처리 및 셀프 어텐션 처리를 수행하여 상기 영상에 대응하는 제3 객체 표현을 획득한다.Step B1: Perform classification processing and self-attention processing on the object expression corresponding to the determined image frame to obtain a third object expression corresponding to the image.

단계 B2: 상기 제1 객체 표현과 상기 제3 객체 표현을 기반으로 현재 반복 처리에 의한 객체 표현을 결정한다.Step B2: Determine an object representation by current iteration process based on the first object representation and the third object representation.

선택적으로, 클립 프레임 쿼리 상호작용 모듈(402)의 동작에서, 그 입력 및 출력은 모두 텐서일 수 있고, 텐서의 차원은 도 9에 도시되어 있다.Optionally, in the operation of the clip frame query interaction module 402, both its input and output may be tensors, the dimensions of which are shown in Figure 9.

일 실행 가능한 실시예에서, 도 9에 도시된 바와 같이, 모듈(601) 출력의 결과(결정된 영상 프레임에 대응하는 객체 표현)에 대해 FFN 모듈(602)을 통해 피드포워드 연산(분류 처리)을 수행하고, 모듈(602) 출력의 결과에 대해 셀프 어텐션 모듈(603)을 통해 셀프 어텐션 처리한다. 그런 다음, 셀프 어텐션 작업 후의 결과를 요소별로 모듈(602) 출력의 결과 상에 중첩하고, 가산된 합 결과에 대해 정규화 처리(도 9에 도시된 합산 및 정규화 모듈(604)을 통해 구현됨)하여 정규화 처리로 얻은 결과 C(제3 객체 표현)를 얻는다. 이를 바탕으로, 상호 어텐션 모듈(605)을 통해 정규화 처리로 얻은 결과 C와 입력된 클립 쿼리 S에 대해 상호 어텐션 계산을 수행하고, 이때 모듈(605) 출력의 차원은 L, C가 되고, 이는 입력된 클립 쿼리S와 동일하다. 마지막으로, 모듈(605) 출력의 결과를 클립 쿼리S에 요소별로 가산하고, 가산된 합 결과에 대해 정규화 처리하여 정규화 처리로 얻은 결과 D를 얻는다. d는 즉 클립 프레임 쿼리 상호작용 모듈(402)의 클립 쿼리 출력이다(현재 반복 처리에 의한 타겟 객체 표현).In one feasible embodiment, as shown in Figure 9, a feedforward operation (classification processing) is performed via the FFN module 602 on the results of the module 601 output (object representation corresponding to the determined image frame). And, the result of the output of the module 602 is subjected to self-attention processing through the self-attention module 603. Then, the results after the self-attention operation are superimposed on the results of the output of the module 602 for each element, and normalization processing is performed on the added sum result (implemented through the summation and normalization module 604 shown in FIG. 9). The result C (third object representation) obtained from the normalization process is obtained. Based on this, mutual attention calculation is performed on the result C obtained from normalization processing through the mutual attention module 605 and the input clip query S. At this time, the dimensions of the output of the module 605 are L and C, which are the input It is the same as the clip query S. Finally, the result of the output of module 605 is added to the clip query S for each element, and the added sum result is normalized to obtain a result D obtained through the normalization process. d is the clip query output of the clip frame query interaction module 402 (target object representation by current iteration process).

도 9에 도시된 바와 같이, 셀프 어텐션 모듈(603)에서, 입력된 Q, K 및 V는 모듈(602)의 출력 결과의 차원 TL, C에 대응한다.As shown in Figure 9, in the self-attention module 603, the input Q, K, and V correspond to the dimensions TL, C of the output result of the module 602.

선택적으로, 도 9에 도시된 네트워크 구조에서, FFN 모듈(602) 및 어텐션 메커니즘을 포함하는 다른 모듈의 처리 순서는 교환될 수 있다. 도 9에 도시된 네트워크 구조에서 셀프 어텐션 모듈(603) 및 FFN 모듈(602)의 위치가 교환될 수 있는 경우, 교환 후 프레임 쿼리 생성 모듈(601)의 출력은 셀프 어텐션 모듈(603)의 입력으로 사용될 수 있고, 셀프 어텐션 모듈(603)의 출력은 FFN 모듈(602)의 입력으로 사용될 수 있다.Optionally, in the network structure shown in Figure 9, the processing order of the FFN module 602 and other modules including the attention mechanism can be swapped. If the positions of the self-attention module 603 and the FFN module 602 can be exchanged in the network structure shown in FIG. 9, the output of the frame query generation module 601 after the exchange is input to the self-attention module 603. can be used, and the output of the self-attention module 603 can be used as the input of the FFN module 602.

선택적으로, 도 9에 도시된 상호 어텐션 모듈(605)의 특정 네트워크 구조는 도 12에 도시된 구조를 참조할 수 있다.Optionally, the specific network structure of the mutual attention module 605 shown in FIG. 9 may refer to the structure shown in FIG. 12.

먼저, 입력 Q(클립 쿼리, 제1 객체 표현, 즉 클립 특징 쿼리 상호작용 모듈(401)의 출력, 차원 L, C에 해당), K(모듈(604)의 출력의 결과를 키 특징으로 사용, 차원 TL, C에 해당) 및 V(모듈(604)의 출력의 결과를 값 특징으로 사용, 차원 TL, C에 해당)에 대해 각각 선형 작업을 수행한다. 도 12에 도시된 선형 연산 모듈(1201, 1202, 1203)은 하나 이상의 선형 연산 레이어(Linear Layer)를 포함할 수 있다. 이어서, 선형 연산 모듈(1201)의 출력과 선형 연산 모듈(1202)의 출력의 결과에 대해 행렬 곱셈 연산을 수행하여(모듈(1204)를 통해 실행), 모듈(1204)의 출력 결과를 softmax 연산하여 카테고리에 해당하는 제3 확률(모듈(1205)를 통해 실행)로 변환한다. 이를 바탕으로, 모듈(1206)을 통해 선형 변환 처리된 후의 영상 특징의 값 특징과 모듈(1205)의 출력(제3 확률)에 대해 행렬 곱셈 연산을 수행하고, 모듈(1206)의 출력을 선형 변환 처리된 후의 입력Q에 중첩하여 모듈(605)의 상호 어텐션 출력을 얻는다.First, using the input Q (a clip query, the first object representation, i.e. the output of the clip feature query interaction module 401, corresponding to dimensions L, C), K (the result of the output of the module 604) as key features; Perform linear operations on dimensions TL, C (corresponding to dimension TL, C) and V (corresponding to dimension TL, C, respectively, using the result of the output of module 604 as the value feature). The linear calculation modules 1201, 1202, and 1203 shown in FIG. 12 may include one or more linear calculation layers. Next, a matrix multiplication operation is performed on the output of the linear operation module 1201 and the result of the output of the linear operation module 1202 (executed through the module 1204), and a softmax operation is performed on the output result of the module 1204. Convert to a third probability (executed through module 1205) corresponding to the category. Based on this, a matrix multiplication operation is performed on the value features of the image features after linear transformation processing through the module 1206 and the output (third probability) of the module 1205, and the output of the module 1206 is linearly transformed. The mutual attention output of the module 605 is obtained by superimposing the processed input Q.

도 12에 도시된 위치 정보 (P_q) 및 (P_k)는 Q 및 K에 대응하는 위치 정보로서 범용 위치 정보 부호화 모듈에 의해 생성될 수 있고, 위치 정보는 상호 어텐션 모듈(605)에 대해 선택적이다.The location information (P_q) and (P_k) shown in FIG. 12 are location information corresponding to Q and K and can be generated by a general-purpose location information encoding module, and the location information is optional for the mutual attention module 605.

이하, 본 출원의 실시예의 영상 파노라마 분할 프로세스에 대해 설명한다.Hereinafter, the image panorama segmentation process of the embodiment of the present application will be described.

구체적으로, 단계 S103에서, 상기 타겟 객체 표현을 기반으로 상기 영상의 파노라마 분할 결과를 결정하는 단계는, 다음 단계 S103a-S103b를 포함한다.Specifically, in step S103, determining a panoramic segmentation result of the image based on the target object representation includes the following steps S103a-S103b.

단계 S103a: 상기 타겟 객체 표현에 대해 선형 변환 처리를 수행한다.Step S103a: Perform linear transformation processing on the target object representation.

단계 S103b: 선형 변환 처리된 후의 타겟 객체 표현과 상기 영상 특징을 기반으로 상기 영상의 마스크 정보를 결정하고, 선형 변환 처리된 후의 타겟 객체 표현을 기반으로 상기 영상의 카테고리 정보를 결정한다.Step S103b: Mask information of the image is determined based on the target object expression after linear transformation and the image characteristics, and category information of the image is determined based on the target object expression after linear transformation.

일 실행 가능한 실시예에서, 도 13에 도시된 바와 같이, 분할 헤드 모듈의 입력은 클립 쿼리 S(마스크 디코더의 출력) 및 클립 특징 X를 포함하며, 그 출력은 예측된 마스크 정보(모듈(903)의 출력) 및 카테고리 정보(모듈(905)의 출력)이다. 분할 헤드 모듈은 구체적으로 두 분기로 나눌 수 있으며, 하나는 모듈(901, 902 및 903)로 구성된 마스크 분기이고 하나는 모듈(904 및 905)로 구성된 카테고리 분기이다.In one feasible embodiment, as shown in Figure 13, the input of the segmentation head module includes a clip query S (output of the mask decoder) and a clip feature X, the output of which is the predicted mask information (module 903) output) and category information (output of module 905). The split head module can be specifically divided into two branches, one is the mask branch consisting of modules 901, 902 and 903, and the other is the category branch consisting of modules 904 and 905.

마스크 분기에서, 먼저 클립 쿼리 S에 대해 선형 변환을 수행하여 모듈(901)의 출력을 얻은 다음, 모듈(901)의 출력에 대해 선형 변환을 수행하여 모듈(902)의 출력을 얻는다(1개, 2개 또는 3개와 같이 여러 개의 선형 연산 모듈을 마스크 분기에 설정할 수 있으며, 본 출원은 마스크 분기에 포함된 선형 연산 모듈의 수를 제한하지 않음). 이를 기반으로 모듈(902)의 출력과 클립 특징 X에 대해 행렬 곱셈 연산을 수행하여 마스크 출력을 얻는다. 도 13에 도시된 마스크 출력(마스크 정보)은 클립 레벨에 속하며, 서로 다른 영상 프레임 간의 매칭이나 추적 작업 없이 객체 ID를 자동으로 얻을 수 있다. 즉, 본 출원의 실시예에서 마스크 분기에 의해 출력되는 마스크 정보는 마스크 및 객체 ID를 포함할 수 있다.In the mask branch, first perform a linear transformation on the clip query S to obtain the output of module 901, then perform a linear transformation on the output of module 901 to obtain the output of module 902 (1, Multiple linear operation modules, such as two or three, can be set in the mask branch (the present application does not limit the number of linear operation modules included in the mask branch). Based on this, a matrix multiplication operation is performed on the output of the module 902 and the clip feature X to obtain a mask output. The mask output (mask information) shown in Figure 13 belongs to the clip level, and object IDs can be automatically obtained without matching or tracking between different video frames. That is, in the embodiment of the present application, mask information output by mask branching may include a mask and an object ID.

카테고리 분기에서, 먼저 클립 쿼리 S에 대해 선형 변환을 수행하여 모듈(904)의 출력을 얻고, 그런 다음 모듈(904)의 출력에 대해 선형 변환을 수행하여 모듈(905)의 출력을 얻는다(1개, 2개 또는 3개와 같이 여러 개의 선형 연산 모듈을 카테고리 분기에 설정할 수 있으며, 본 출원은 카테고리 분기에 포함된 선형 연산 모듈의 수를 제한하지 않음). 이는 즉 도 13에 도시된 카테고리 출력(카테고리 정보)이다.In the category branch, first perform a linear transformation on the clip query S to obtain the output of module 904, then perform a linear transformation on the output of module 904 to obtain the output of module 905 (1 , multiple linear operation modules, such as two or three, can be set in a category branch, and the present application does not limit the number of linear operation modules included in a category branch). This is the category output (category information) shown in FIG. 13.

선택적으로, 영상 클립의 프레임 수가 1보다 크면 마스크 분기에서 출력한 마스크 정보는 클립 레벨에 속하므로 영상의 모든 프레임의 객체 ID를 자동으로 얻을 수 있다. 따라서 카테고리 분기에서 출력되는 카테고리 정보는 객체의 의미 정보이다(예, 자동차, 사람 등).Optionally, if the number of frames of the video clip is greater than 1, the mask information output from the mask branch belongs to the clip level, so the object ID of every frame of the video can be automatically obtained. Therefore, the category information output from the category branch is the semantic information of the object (e.g., car, person, etc.).

이하, 본 출원의 실시예의 파노라마 분할 모델의 훈련 과정에 대해 설명한다.Hereinafter, the training process of the panoramic segmentation model in the embodiment of the present application will be described.

구체적으로, 네트워크를 훈련할 때 예측 결과(마스크 및 카테고리)를 ground-truth(GT, 레이블)의 마스크 및 카테고리와 유사하게 제한해야 하며 이는 손실 함수를 최소화하여 달성할 수 있다. 도 14에 도시된 바와 같이, 이진 일치 알고리즘을 사용하여 예측 마스크와 GT 마스크 간의 대응 관계를 설정할 수 있다. 여기에서 사용되는 손실 함수는 픽셀 레벨의 교차 엔트로피 손실, 파노라마 객체 레벨의 파노라마 품질 손실, 마스크 ID 손실 및 마스크 유사성 손실(Dice 기반 계산) 등과 같은 여러 항목을 포함한다.Specifically, when training a network, the prediction results (mask and category) should be constrained to be similar to the mask and category of the ground-truth (GT, label), which can be achieved by minimizing the loss function. As shown in Figure 14, the correspondence between the prediction mask and the GT mask can be established using a binary matching algorithm. The loss function used here includes several items such as cross-entropy loss at the pixel level, panoramic quality loss at the panoramic object level, mask ID loss, and mask similarity loss (Dice-based calculation).

일 실행 가능한 실시예에서, 파노라마 분할 모델의 훈련 단계는 다음 단계 B1-B2를 포함한다.In one feasible embodiment, the training phase of the panoramic segmentation model includes the following steps B1-B2.

단계 B1: 훈련 데이터를 얻는다. 상기 훈련 데이터는 훈련 영상, 훈련 영상의 제1 영상 특징 및 훈련 영상에 대응하는 샘플 파노라마 분할 결과를 포함한다.Step B1: Obtain training data. The training data includes a training image, a first image feature of the training image, and a sample panorama segmentation result corresponding to the training image.

단계 B2: 상기 훈련 데이터를 기반으로 상기 파노라마 분할 모델을 훈련하여 훈련된 파노라마 분할 모델을 얻는다. 훈련 시 다음 단계 B21-B24를 실행한다.Step B2: Train the panoramic segmentation model based on the training data to obtain a trained panoramic segmentation model. During training, perform the following steps B21-B24.

단계 B21: 상기 제1 영상 특징의 프레임 순서를 바꿔 제2 영상 특징을 얻는다.Step B21: Obtain a second video feature by changing the frame order of the first video feature.

단계 B22: 상기 계층적 상호작용 모듈(즉 제1 모듈)을 통해 각각 상기 제1 영상 특징 및 상기 제2 영상 특징에 기초하여 상기 훈련 영상의 제1 예측 객체 표현 및 제2 예측 객체 표현을 결정한다.Step B22: Determine a first predicted object representation and a second predicted object representation of the training image based on the first image feature and the second image feature, respectively, through the hierarchical interaction module (i.e., the first module). .

단계 B23: 상기 분할 헤드 모듈(즉 제2 모듈)을 통해 각각 상기 제1 예측 객체 표현 및 상기 제2 예측 객체 표현에 기초하여 상기 훈련 영상의 제1 예측 결과 및 제2 예측 결과를 결정한다.Step B23: Determine a first prediction result and a second prediction result of the training image based on the first prediction object representation and the second prediction object representation, respectively, through the segmentation head module (i.e., the second module).

단계 B24: 상기 샘플 파노라마 분할 결과, 상기 제1 예측 객체 표현, 상기 제2 예측 객체 표현, 상기 제1 예측 결과 및 상기 제2 예측 결과를 기반으로 타겟 손실 함수를 사용하여 상기 파노라마 분할 모델을 훈련한다.Step B24: Train the panorama segmentation model using a target loss function based on the sample panorama segmentation result, the first prediction object representation, the second prediction object representation, the first prediction result, and the second prediction result. .

선택적으로, 본 출원에서, 클립 특징맵 X(도 14에 도시된 제1 영상 특징)의 프레임 순서를 교체하여 교체된 프레임 특징맵 X'(도 14에 도시된 제2 영상 특징)를 얻을 수 있다. X 및 X'는 각각 계층적 상호작용 모듈을 통해 두 세트의 클립 쿼리 출력 out_S(제1 예측 객체 표현) 및 out_S'(제2 예측 객체 표현)를 얻는다. 네트워크 훈련에서, out_S와 out_S'의 동일한 클립 파노라마 객체에 해당하는 클립 쿼리 벡터는 유사해야 하고, 다른 클립 파노라마 객체에 해당하는 클립 쿼리 벡터는 유사하지 않아야 한다. out_S(제1 예측 객체 표현) 및 out_S'(제2 예측 객체 표현)는 분할 헤드 모듈에 의해 각각 처리되어 제1 예측 결과 및 제2 예측 결과를 얻을 수 있다.Optionally, in the present application, the frame order of the clip feature map X (the first video feature shown in FIG. 14) can be replaced to obtain a replaced frame feature map . X and In network training, clip query vectors corresponding to the same clip panorama object in out_S and out_S' should be similar, and clip query vectors corresponding to different clip panorama objects should not be similar. out_S (the first prediction object representation) and out_S' (the second prediction object representation) may be processed by the split head module to obtain the first prediction result and the second prediction result, respectively.

선택적으로, 단계 B24에서, 상기 샘플 파노라마 분할 결과, 상기 제1 예측 객체 표현, 상기 제2 예측 객체 표현, 상기 제1 예측 결과 및 상기 제2 예측 결과를 기반으로 타겟 손실 함수를 사용하여 상기 파노라마 분할 모델을 훈련하는 단계는, 다음 단계 B242-B243을 포함한다.Optionally, in step B24, segment the panorama using a target loss function based on the sample panorama segmentation result, the first prediction object representation, the second prediction object representation, the first prediction result and the second prediction result. The step of training the model includes the following steps B242-B243.

단계 B241: 상기 제1 예측 객체 표현 및 상기 제2 예측 객체 표현에 기초하여 객체 표현 간의 제1 유사 행렬을 결정한다.Step B241: Determine a first similarity matrix between object representations based on the first predicted object representation and the second predicted object representation.

단계 B242: 상기 샘플 파노라마 분할 결과, 상기 제1 예측 결과 및 상기 제2 예측 결과를 기반으로 분할 결과 간의 제2 유사 행렬을 결정한다.Step B242: Determine a second similarity matrix between segmentation results based on the sample panorama segmentation result, the first prediction result, and the second prediction result.

단계 B243: 상기 제1 유사 행렬과 상기 제2 유사 행렬을 기반으로 타겟 손실 함수가 최소로 결정되면, 훈련된 파노라마 분할 결과를 출력한다.Step B243: When the target loss function is determined to be minimum based on the first similarity matrix and the second similarity matrix, the trained panorama segmentation result is output.

구체적으로, 타겟 손실 함수(클립 대비 손실 함수)는 아래 [수학식 2]와 같이 표현될 수 있다.Specifically, the target loss function (loss function relative to clip) can be expressed as [Equation 2] below.

X는 out_S와 out_S' 사이에서 계산된 유사성 행렬(벡터 간의 유사성은 일반 코사인 유사도(cosine similarity)와 같은 기존 방법을 사용하여 계산할 수 있음)이고, Y는 ground-truth의 유사성 행렬로, GT 행렬(도 14에 도시된 바와 같이, 두 개의 클립 쿼리 벡터가 동일한 ground-truth의 클립 파노라마 객체에 대응하는 경우 Y의 해당 위치값은 1이고 그렇지 않은 경우 0임)로 칭할 수 있다. W는 가중치 행렬이며, 동일한 카테고리의 위치에 대해 값이 1이고 그렇지 않으면 0이다(본 출원에서는 다른 카테고리의 객체가 명백하다는 점을 고려하여 감독할 필요가 없음). 또한 [수학식 2]에서 곱셈 기호 "*"는 요소별 곱셈을 나타내는 반면 Ave()는 모든 클립의 파노라마 객체의 평균이다.where As shown in FIG. 14, if two clip query vectors correspond to clip panorama objects of the same ground-truth, the corresponding position value of Y is 1; otherwise, it is 0). W is the weight matrix, which has the value 1 for positions of the same category and 0 otherwise (in this application there is no need for supervision considering that objects of different categories are obvious). Additionally, in [Equation 2], the multiplication symbol "*" indicates element-wise multiplication, while Ave() is the average of the panoramic objects of all clips.

본 출원의 실시예에서, 네트워크 구조는 딥 러닝 프레임워크인 Python을 사용하여 구현할 수 있다. 기존 기술과 비교하여 해당 네트워크는 네트워크 구조를 단순화하면서 계산 복잡성을 효과적으로 줄일 수 있는 동시에 영상 정보를 최대한 활용하면서 분할 정확도를 효과적으로 향상시킬 수 있다. 여기서 정확도는 영상 파노라마 품질(VPQ, video panoptic quality)을 사용하여 측정할 수 있다.In the embodiments of this application, the network structure can be implemented using Python, a deep learning framework. Compared with existing technologies, the network can effectively reduce computational complexity while simplifying the network structure, while making full use of image information and effectively improving segmentation accuracy. Here, accuracy can be measured using video panoptic quality (VPQ).

본 출원에서 달성할 수 있는 기술적 효과를 보다 더 잘 설명하기 위해, 이하 도 15 내지 도 17에 도시된 분할 결과에 대해 설명한다. In order to better explain the technical effects that can be achieved in this application, the segmentation results shown in FIGS. 15 to 17 will be described below.

도 15에 도시된 바와 같이, 일반적인 기술을 통해 얻은 파노라마 분할 결과나 네트워크 훈련에 사용할 수 있는 샘플 파노라마 분할 결과 모두 두 원이 도시하는 자동차 사이에 존재하는 자동차를 인식하지 못한다(일부 인스턴스 누락). 본 개시에 따른 영상 처리 방법을 이용하여 얻은 파노라마 분할 결과에서는 차량(후방 차량) 사이에 위치한 차량(전방 차량)을 식별할 수 있다.As shown in Figure 15, neither the panorama segmentation results obtained through general techniques nor the sample panorama segmentation results that can be used for network training recognize cars that exist between the cars shown by the two circles (some instances are missing). In the panorama segmentation result obtained using the image processing method according to the present disclosure, a vehicle (front vehicle) located between vehicles (rear vehicle) can be identified.

도 16에 도시된 바와 같이, 일반적인 기술의 파노라마 분할 결과에서 한편으로는 자전거의 마스크가 불완전하여 자전거에 속한 ID를 부여할 수 없고; 또한 제3 영상 프레임에서의 더 작은 보행자를 식별할 수 없다. 또한, 제1 프레임 영상 프레임에서의 자동차(후방 차량) 사이에 존재하는 자동차(전방 차량)를 식별할 수 없다. 네트워크 훈련에 사용되는 샘플 파노라마 분할 결과에서, 제1 프레임 영상 프레임에서 자동차 사이에 존재하는 자동차(원으로 표시됨)를 인식할 수 없다. 그러나 본 개시에 따른 실시예에서 제공하는 영상 처리 방법을 이용하여 얻은 파노라마 분할 결과에서는 기존 기술과 샘플 파노라마 분할 결과 중 존재하는 단점을 효과적으로 보완할 수 있다.As shown in Figure 16, in the panorama segmentation result of the general technique, on the one hand, the mask of the bicycle is incomplete, so the ID belonging to the bicycle cannot be assigned; Additionally, the smaller pedestrian in the third video frame cannot be identified. Additionally, a car (front car) existing between cars (rear car) in the first frame image frame cannot be identified. In the sample panorama segmentation results used for network training, cars (indicated by circles) existing between cars in the first frame video frame cannot be recognized. However, the panorama segmentation result obtained using the image processing method provided in the embodiment according to the present disclosure can effectively compensate for the shortcomings that exist between the existing technology and the sample panorama segmentation result.

도 17에 도시된 바와 같이, 본 개시는 보다 명확한 자전거 윤곽 표현을 얻을 수 있으며, 이는 본 출원이 더 나은 분할 정확도를 달성할 수 있음을 의미한다. 네트워크 훈련에 사용되는 샘플 파노라마 분할 결과와 비교할 때, 본 출원에서 얻은 자전거에 속하는 분할 결과는 각 영상 프레임에서 일치하는 반면, 샘플 파노라마 분할 결과의 각 영상 프레임에서 동일한 자전거에 대한 분할 결과는 서로 다르며, 이는 본 출원이 더 나은 견고성을 달성할 수 있음을 의미한다.As shown in Figure 17, the present disclosure can obtain a clearer bicycle outline representation, which means that the present application can achieve better segmentation accuracy. Compared with the sample panoramic segmentation results used for network training, the segmentation results belonging to the bicycle obtained in this application are consistent in each image frame, while the segmentation results for the same bicycle in each image frame of the sample panoramic segmentation results are different; This means that this application can achieve better robustness.

선택 가능한 실시예에서, 전자 장치를 제공한다. 도 18에 도시된 바와 같이, 도 18에 도시된 전자 장치(4000)는 프로세서(4001) 및 메모리(4003)를 포함한다. 프로세서(4001)는, 예를 들어 버스(4002)를 통해 메모리(4003)에 연결된다. 선택적으로, 전자 장치(4000)는 트랜시버(4004)를 더 포함할 수 있으며, 트랜시버(4004)는 데이터 송신 및/또는 데이터 수신과 같은 전자 장치와 다른 전자 장치 간의 데이터 상호작용을 위해 사용될 수 있다. 실제 응용에서 트랜시버(4004)는 하나로 제한되지 않으며, 해당 전자 장치(4000)의 구조는 본 출원 실시예에 대한 제한을 구성하지 않는다는 점에 유의해야 한다.In select embodiments, an electronic device is provided. As shown in FIG. 18, the electronic device 4000 shown in FIG. 18 includes a processor 4001 and a memory 4003. Processor 4001 is connected to memory 4003 via bus 4002, for example. Optionally, the electronic device 4000 may further include a transceiver 4004, which may be used for data interaction between the electronic device and another electronic device, such as transmitting and/or receiving data. It should be noted that in actual applications, the transceiver 4004 is not limited to one, and the structure of the corresponding electronic device 4000 does not constitute a limitation to the embodiments of the present application.

프로세서(4001)는 CPU, 범용 프로세서, DSP, 주문형 집적 회로(ASIC), 필드 프로그램 가능 게이트 어레이(FPGA) 또는 기타 프로그램 가능 논리 장치, 트랜지스터 논리 장치, 하드웨어 구성 요소, 또는 이들의 임의의 조합일 수 있다. 이는 본 출원에서 설명된 다양한 예시적 논리 블록, 모듈 및 회로를 구현하거나 실행할 수 있다. 프로세서(4001)는 또한, 예를 들어, 하나 이상의 마이크로프로세서 조합, DSP와 마이크로프로세서의 조합 등을 포함하는 컴퓨팅 기능을 실현하는 조합일 수 있다.Processor 4001 may be a CPU, general purpose processor, DSP, application specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device, transistor logic device, hardware component, or any combination thereof. there is. It may implement or implement various example logical blocks, modules, and circuits described in this application. The processor 4001 may also be a combination that realizes computing functions, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, etc.

버스(4002)는 상기 구성요소들 사이에서 정보를 전달하기 위한 경로를 포함할 수 있다. 버스(4002)는 PCI(Peripheral Component Interconnect) 버스 또는 EISA(Extended Industry Standard Architecture) 버스일 수 있다. 버스(4002)는 어드레스 버스, 데이터 버스, 제어 버스 등으로 구분될 수 있다. 예시의 편의를 위해, 도 18에는 굵은 선 하나만 도시하였으나, 버스가 하나 또는 한 종류만 있는 것은 아니다.Bus 4002 may include a path for transferring information between the components. Bus 4002 may be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. The bus 4002 can be divided into an address bus, a data bus, a control bus, etc. For convenience of illustration, only one thick line is shown in Figure 18, but there is not one or only one type of bus.

메모리(4003)는 읽기 전용 메모리(ROM) 또는 정적 정보 및 명령을 저장할 수 있는 다른 유형의 정적 저장 장치, 랜덤 액세스 메모리(RAM) 또는 정보 및 명령을 저장할 수 있는 다른 유형의 동적 저장 장치일 수 있고, EEPROM, CD-ROM 또는 기타 광 디스크 스토리지, 광 디스크 스토리지(압축 광 디스크, 레이저 디스크, 광 디스크, 디지털 다목적 디스크, 블루 레이 디스크 등 포함), 디스크 저장 매체, 기타 자기 저장 장치 또는 컴퓨터 프로그램을 운반하거나 저장하는데 사용할 수 있고 컴퓨터에서 읽을 수 있는 기타 모든 매체일 수도 있으며, 여기서 이에 대해 제한하지는 않는다.Memory 4003 may be read-only memory (ROM) or another type of static storage device capable of storing static information and instructions, random access memory (RAM), or other type of dynamic storage device capable of storing information and instructions; , EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compressed optical disks, laser disks, optical disks, digital versatile disks, Blu-ray disks, etc.), disk storage media, other magnetic storage devices or computer programs. It may be any other medium that can be used to store or read on a computer, but is not limited herein.

메모리(4003)는 본 출원의 실시예를 실행하기 위한 컴퓨터 프로그램을 저장하는데 사용되며 프로세서(4001)에 의해 제어된다. 프로세서(4001)는 메모리(4003)에 저장된 컴퓨터 프로그램을 실행하여 전술한 방법 실시예에 도시된 단계들을 실현하도록 구성된다. Memory 4003 is used to store computer programs for executing embodiments of the present application and is controlled by processor 4001. The processor 4001 is configured to execute a computer program stored in the memory 4003 to implement the steps shown in the above-described method embodiment.

일 실시예에서 제공하는 방법에서, AI 모델을 통해 복수의 모듈 중 적어도 하나의 모듈을 구현할 수 있다. AI와 관련된 기능은 비휘발성 메모리, 휘발성 메모리 및 프로세서에 의해 수행될 수 있다.In the method provided by one embodiment, at least one module among a plurality of modules may be implemented through an AI model. AI-related functions can be performed by non-volatile memory, volatile memory, and processors.

해당 프로세서는 하나 이상의 프로세서를 포함할 수 있다. 이때, 해당 하나 이상의 프로세서는 범용 프로세서(예, 중앙 처리 장치(CPU), 응용 프로세서(AP) 등) 또는 순수 그래픽 처리 장치(예, 그래픽 처리 장치(GPU), 시각 처리 장치(VPU)), 및/또는 AI 전용 프로세서(예, 신경 처리 장치(NPU))일 수 있다.The processor may include one or more processors. At this time, the one or more processors may be a general-purpose processor (e.g., central processing unit (CPU), application processor (AP), etc.) or a pure graphics processing unit (e.g., graphics processing unit (GPU), visual processing unit (VPU)), and /Or it may be an AI-specific processor (e.g., neural processing unit (NPU)).

하나 이상의 프로세서는 비휘발성 메모리 및 휘발성 메모리에 저장된 사전 정의된 동작 규칙 또는 인공 지능(AI) 모델에 따라 입력 데이터의 처리를 제어한다. 훈련 또는 학습을 통해 사전 정의된 동작 규칙 또는 인공 지능 모델을 제공한다.One or more processors control the processing of input data according to predefined operation rules or artificial intelligence (AI) models stored in non-volatile memory and volatile memory. Provides predefined action rules or artificial intelligence models through training or learning.

여기서, 학습에 의한 제공은 복수의 학습 데이터에 학습 알고리즘을 적용하여 사전 정의된 동작 규칙 또는 원하는 특성을 갖는 AI 모델을 얻는 것을 의미한다. 이러한 학습은 실시예에 따른 AI가 수행되는 장치 자체에서 수행될 수 있고, 및/또는 별도의 서버/시스템에 의해 구현될 수 있다.Here, provision by learning means applying a learning algorithm to a plurality of learning data to obtain an AI model with predefined operation rules or desired characteristics. This learning may be performed on the device itself where AI according to the embodiment is performed, and/or may be implemented by a separate server/system.

AI 모델은 복수의 신경망 레이어로 구성될 수 있다. 각 레이어는 복수의 가중치 값을 가지며, 하나의 레이어의 계산은 이전 레이어의 계산 결과와 현재 레이어의 복수의 가중치에 의해 수행된다. 신경망의 예시로, 컨볼루션 신경망(CNN), 심층 신경망(DNN), 순환 신경망(RNN), 제한된 볼츠만 머신(RBM), 심층 신뢰망(DBN), 양방향 순환 심층 신경망(BRDNN), 생성 대응 네트워크(GAN) 및 심층 Q 네트워크를 포함하나 이에 제한되지 않는다.An AI model may consist of multiple neural network layers. Each layer has multiple weight values, and the calculation of one layer is performed based on the calculation results of the previous layer and multiple weights of the current layer. Examples of neural networks include convolutional neural networks (CNN), deep neural networks (DNN), recurrent neural networks (RNN), restricted Boltzmann machines (RBM), deep belief networks (DBN), bidirectional recurrent deep neural networks (BRDNN), and generative correspondence networks ( GAN) and deep Q networks.

학습 알고리즘은 복수의 학습 데이터를 이용하여 소정의 타겟 장치(예, 로봇)를 훈련시켜 타겟 장치를 결정 또는 예측하도록 유도, 허용 또는 제어하는 방법이다. 학습 알고리즘의 예시는 지도 학습(supervised learning), 비지도 학습, 반 지도 학습 또는 강화 학습을 포함하나 이에 국한되지는 않는다.A learning algorithm is a method of inducing, allowing, or controlling a target device (e.g., a robot) to determine or predict the target device by training it using a plurality of learning data. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.

본 출원 실시예는 컴퓨터 프로그램이 저장된 컴퓨터 판독 가능 저장 매체를 제공한다. 컴퓨터 프로그램이 프로세서에 의해 실행될 때, 전술한 방법 실시예의 단계 및 상응하는 내용이 구현될 수 있다.Embodiments of the present application provide a computer-readable storage medium storing a computer program. When the computer program is executed by a processor, the steps and corresponding contents of the above-described method embodiments can be implemented.

본 출원 실시예는 또한 프로세서에 의해 실행될 때 전술한 방법 실시예의 단계 및 상응하는 내용을 구현할 수 있는 컴퓨터 프로그램을 포함하는 컴퓨터 프로그램 제품을 제공한다.Embodiments of the present application also provide a computer program product including a computer program capable of implementing the steps and corresponding contents of the above-described method embodiments when executed by a processor.

첨부된 도면 뿐만 아니라 본 출원의 명세서 및 특허청구범위에서 사용되는 용어 "제1", "제2", "제3", "제4", "1", "2" 등(있는 경우)은 특정 순서 또는 앞뒤 순서를 설명하지 않고 유사한 대상을 구별하기 위해 사용된다. 이러한 방식으로 사용된 데이터는 적절한 경우에 교환될 수 있으므로, 여기에 설명된 본 출원의 실시예는 도면 예시 또는 텍스트 설명 이외의 순서로 구현될 수 있음을 이해해야 한다.The terms "first", "second", "third", "fourth", "1", "2", etc. (if any) used in the specification and claims of this application as well as in the attached drawings are It is used to distinguish similar objects without describing a specific order or sequence. It should be understood that the embodiments of the present application described herein may be implemented in an order other than the drawing examples or text descriptions, since data used in this manner may be exchanged where appropriate.

본 출원의 실시예에서 설명된 모듈은 소프트웨어 또는 신경망을 통해 구현될 수 있다. 경우에 따라 모듈 명칭이 해당 모듈 자체의 제한을 구성하지 않으며, 예를 들어, 셀프 어텐션 모듈은 "셀프 어텐션 처리를 위한 모듈", "제1 모듈", 셀프 어텐션 네트워크, 셀프 어텐션 신경망 등으로도 설명될 수 있다.The modules described in the embodiments of this application may be implemented through software or neural networks. In some cases, the module name does not constitute a limitation of the module itself; for example, a self-attention module is also described as "module for self-attention processing", "first module", self-attention network, self-attention neural network, etc. It can be.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented with hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods, and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, and a field programmable gate (FPGA). It may be implemented using a general-purpose computer or a special-purpose computer, such as an array, programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and software applications running on the operating system. Additionally, a processing device may access, store, manipulate, process, and generate data in response to the execution of software. For ease of understanding, a single processing device may be described as being used; however, those skilled in the art will understand that a processing device includes multiple processing elements and/or multiple types of processing elements. It can be seen that it may include. For example, a processing device may include multiple processors or one processor and one controller. Additionally, other processing configurations, such as parallel processors, are possible.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on a computer-readable recording medium.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 저장할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. A computer-readable medium may store program instructions, data files, data structures, etc., singly or in combination, and the program instructions recorded on the medium may be specially designed and constructed for the embodiment or may be known and available to those skilled in the art of computer software. there is. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes optical media (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or multiple software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited drawings, those skilled in the art can apply various technical modifications and variations based on this. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the claims described below.

Claims

In an image processing method using a panoramic segmentation model performed by a processor,
Obtaining image features of an image including a plurality of image frames;
determining a target object representation of the image based on the image features using a neural network; and
Determining a panoramic segmentation result of the image based on the target object representation
Including, an image processing method.

According to paragraph 1,
The step of determining a target object representation of the image based on the image features using the neural network includes:
Determining a target object representation of the image by performing a plurality of iterative processing on the image features using the neural network,
Image processing method.

According to paragraph 2,
Determining the target object representation of the image by performing the plurality of iterative processing on the image features using the neural network includes:
Using the neural network, performing iterative processing based on the image features and the object representation by a previous iteration of the image to determine an object representation by the current iteration of the image,
Image processing method.

According to paragraph 3,
In the case of the first repetition process among the plurality of repetition processes, the object expression by the previous repetition process is a pre-configured initial object expression,
Image processing method.

According to paragraph 3,
The step of performing the iterative processing based on the image features and the object representation by the previous iterative process of the image to determine the object representation by the current iterative process of the image,
Obtaining a mask by converting the object representation obtained from previous repeated processing of the image;
obtaining a first object representation by processing the image features, the object representation by the previous iterative processing, and the mask; and
Determining an object representation by current iteration processing of the image based on the first object representation
Including,
Image processing method.

According to clause 5,
Obtaining the first object representation by processing the image features, the object representation by the previous iterative processing, and the mask,
Obtaining an object representation related to the mask by performing attention processing on the image features, the object representation by the previous iteration process, and the mask; and
Obtaining a first object representation by performing self-attention processing and classification processing based on the object representation related to the mask and the object representation by the previous iteration process.
Including,
Image processing method.

According to clause 6,
Obtaining an object representation related to the mask by performing the attention processing on the image features, the object representation by the previous iteration process, and the mask,
Obtaining a second object representation based on key features corresponding to the image features, the object representation by the previous iterative process, and the mask;
determining a first probability representing an object category included in the image, based on the second object representation; and
Obtaining an object representation associated with the mask based on the first probability, a value feature corresponding to the image feature, and the image feature.
Including,
Image processing method.

According to clause 5,
The step of determining an object representation by current repetitive processing of the image based on the first object representation,
determining an object representation corresponding to each image frame among at least one image frame based on the image features and the first object representation; and
Determining an object representation by current iterative processing of the image based on the first object representation and the object representation corresponding to each determined image frame.
Including,
Image processing method.

According to clause 8,
Determining an object representation corresponding to each image frame of the at least one image frame based on the image features and the first object representation includes:
determining a fourth object representation based on the first object representation and a key feature corresponding to the image feature;
determining a second probability representing the object category included in the image based on the fourth object representation; and
determining an object representation corresponding to each image frame of the at least one image frame based on the second probability and a value feature corresponding to the image feature.
Including,
Image processing method.

According to clause 8,
The step of determining an object representation by current iterative processing of the image based on the first object representation and the object representation corresponding to each determined image frame,
performing classification processing and self-attention processing on the object expression corresponding to each determined image frame to obtain a third object expression corresponding to the image; and
Determining an object representation by current iteration processing of the image based on the first object representation and the third object representation
Including,
Image processing method.

According to paragraph 1,
The step of determining a panoramic segmentation result of the image based on the target object representation,
performing linear transformation processing on the target object representation; and
Determining mask information of the image based on the linearly transformed target object expression and the image features, and determining category information of the image based on the linearly transformed target object expression.
Including,
Image processing method.

In a method of training a panoramic segmentation model, performed by a processor,
The panoramic segmentation model includes a first module and a second module,
Obtaining training data, wherein the training data includes a training image, a first image feature of the training image, and a sample panorama segmentation result corresponding to the training image;
Obtaining a second image feature by changing the frame order of the first image feature;
determining, through the first module, a first predicted object representation and a second predicted object representation of the training image based on the first image feature and the second image feature, respectively;
determining, through the second module, a first prediction result and a second prediction result of the training image based on the first prediction object representation and the second prediction object representation, respectively; and
Training the panorama segmentation model using a target loss function based on the sample panorama segmentation result, the first prediction object representation, the second prediction object representation, the first prediction result, and the second prediction result.
Including,
Training methods.

According to clause 12,
Training the panorama segmentation model using a target loss function based on the sample panorama segmentation result, the first prediction object representation, the second prediction object representation, the first prediction result, and the second prediction result, comprising:
determining a first similarity matrix based on the first predicted object representation and the second predicted object representation;
determining a second similarity matrix based on the sample panorama segmentation result, the first prediction result, and the second prediction result; and
When the target loss function is determined to be minimum based on the first similarity matrix and the second similarity matrix, outputting a trained panorama segmentation model.
Including,
Training method.

In electronic devices,
Memory;
processor; and
A computer program stored in memory;
Including,
The electronic device, wherein the processor executes the computer program to implement the steps of the method according to any one of claims 1 to 13.

In a computer-readable storage medium storing a computer program,
A computer-readable storage medium, wherein the computer program implements the steps of the method according to any one of claims 1 to 13 when executed by a processor.