KR20220065666A

KR20220065666A - Apparatus and method for processing video

Info

Publication number: KR20220065666A
Application number: KR1020210130042A
Authority: KR
Inventors: 장 후이; 리 핑준; 유병인; 이한아; 저우 이
Original assignee: 삼성전자주식회사
Priority date: 2020-11-13
Filing date: 2021-09-30
Publication date: 2022-05-20
Also published as: CN114494929A

Abstract

The present disclosure relates to a video processing apparatus and method. The video processing method comprises: acquiring a first image feature of a first image of video data and a second image feature of a second image before the first image; performing time-domain information fusion processing on the first image feature and the second image feature; obtaining a time-domain information fusion processing result; and obtaining a panorama segmentation result of the first image according to the time-domain information fusion processing result.

Description

Apparatus and method for processing video

비디오 분할 기술 분야에 관한 것으로, 비디오 처리 장치 및 방법에 관한 것이다.TECHNICAL FIELD It relates to the field of video segmentation technology, and to a video processing apparatus and method.

이미지 파노라마 분할은 2차원 이미지의 각 픽셀에 레이블 정보를 할당하는 과정이다. 이미지 콘텐츠는 두 가지 범주로 나눌 수 있다: 하나는 '비고정 형태의 사물(stuff)'로, 풀밭, 하늘, 건축물 등 다른 객체를 구별할 필요가 없는 콘텐츠이고, 다른 하나는 '고정 형태의 사물(thing)'로, 사람, 차량 등 다른 객체를 구별해야 하는 콘텐츠이다.Image panorama segmentation is a process of allocating label information to each pixel of a two-dimensional image. Image content can be divided into two categories: one is 'stuff in a non-fixed form', content that does not need to distinguish other objects such as grass, sky, and buildings, and the other is 'stuff in a fixed form' (thing)', which is content that needs to be distinguished from other objects such as people and vehicles.

파노라마 분할 작업은 시맨틱 분할 및 인스턴스 분할 두 가지 작업의 복합 작업으로 간주할 수 있다. '비고정 형태의 사물(stuff)' 범주에 속하는 픽셀의 경우 그 시맨틱 레이블을 예측하고, '고정 형태의 사물(thing)' 범주에 속하는 픽셀의 경우 그 인스턴스 레이블을 예측한다. The panorama segmentation operation can be considered as a composite operation of two operations: semantic segmentation and instance segmentation. In the case of a pixel belonging to a 'stuff of non-fixed type' category, the semantic label is predicted, and in the case of a pixel belonging to the category of 'stuff of a fixed type', the instance label is predicted.

비디오 파노라마 분할은 시간-도메인 상에서의 이미지 파노라마 분할의 확장이다. 각 이미지의 파노라마 분할 외에도 객체 추적의 작업, 즉 다른 이미지의 동일한 인스턴스에 속하는 픽셀에 대해 동일한 레이블을 할당해야 하는 작업도 결합한다. 기존 비디오 파노라마 분할 기술은 비용이 많이 들고 속도가 느리며 정확도 또한 낮다.Video panorama segmentation is an extension of image panorama segmentation on the time-domain. In addition to the panoramic segmentation of each image, it also combines the task of tracking objects, that is, having to assign identical labels to pixels belonging to the same instance in different images. Existing video panorama segmentation techniques are expensive, slow, and have low accuracy.

본 개시의 일 실시 예에 따른 비디오 처리 방법은, 비디오 데이터의 제1 이미지의 제1 이미지 특징 및 제1 이미지 이전의 제2 이미지의 제2 이미지 특징을 획득하는 단계; 상기 제1 이미지 특징 및 상기 제2 이미지 특징에 대해 시간-도메인(time domain) 정보 융합 처리를 진행하여, 시간-도메인 정보 융합 처리 결과를 획득하는 단계; 및 상기 시간-도메인 정보 융합 처리 결과에 따라 상기 제1 이미지의 파노라마 분할 결과를 획득하는 단계를 포함한다.A video processing method according to an embodiment of the present disclosure includes: acquiring a first image feature of a first image of video data and a second image feature of a second image before the first image; performing time-domain information fusion processing on the first image feature and the second image feature to obtain a time-domain information fusion processing result; and obtaining a panoramic segmentation result of the first image according to the time-domain information fusion processing result.

이때, 상기 시간-도메인 정보 융합 처리 결과에 따라 상기 제1 이미지의 파노라마 분할 결과를 획득하는 단계는, 상기 시간-도메인 정보 융합 처리 결과를 기반으로 상기 제1 이미지에 대해 인스턴스 추적을 수행하여, 상기 제1 이미지의 프레임 간의 인스턴스 대응을 획득하는 단계; 및 상기 제1 이미지의 프레임 간의 인스턴스 대응에 따라, 상기 제1 이미지의 파노라마 분할 결과를 획득하는 단계를 포함할 수 있다.In this case, the step of obtaining the panorama segmentation result of the first image according to the time-domain information fusion processing result includes performing instance tracking on the first image based on the time-domain information fusion processing result, and the obtaining an instance correspondence between frames of the first image; and obtaining a panoramic segmentation result of the first image according to instance correspondence between frames of the first image.

이때, 상기 시간-도메인 정보 융합 처리 결과에 따라 상기 제1 이미지의 파노라마 분할 결과를 획득하는 단계는, 상기 시간-도메인 정보 융합 처리 결과를 기반으로 상기 제1 이미지에 대해 시맨틱(semantic) 분할, 인스턴스 분할 및 바운딩 박스 세분화(refinement)를 수행하여, 상기 제1 이미지의 시맨틱 분할 결과, 상기 제1 이미지의 인스턴스 분할 결과 및 상기 제1 이미지의 바운딩 박스를 획득하는 단계를 더 포함하고, 상기 제1 이미지의 프레임 간의 인스턴스 대응에 따라, 상기 제1 이미지의 파노라마 분할 결과를 획득하는 단계는, 상기 제1 이미지의 시맨틱 분할 결과, 상기 제1 이미지의 인스턴스 분할 결과, 상기 제1 이미지의 바운딩 박스 및 상기 제1 이미지의 프레임 간의 인스턴스 대응을 융합하여, 상기 제1 이미지의 파노라마 분할 결과를 획득하는 단계를 포함할 수 있다.In this case, the step of obtaining the panoramic segmentation result of the first image according to the time-domain information fusion processing result includes semantic segmentation, instance, for the first image based on the time-domain information fusion processing result. The method further comprises: performing segmentation and bounding box refinement to obtain a semantic segmentation result of the first image, an instance segmentation result of the first image, and a bounding box of the first image, wherein the first image According to instance correspondence between frames of and fusing instance correspondences between frames of one image to obtain a panoramic segmentation result of the first image.

이때, 상기 비디오 데이터의 상기 제1 이미지의 제1 이미지 특징 및 상기 제1 이미지 이전의 상기 제2 이미지의 제2 이미지 특징을 획득하는 단계는, 특징 추출 네트워크를 통해 상기 제1 이미지 및 상기 제2 이미지 각각에 대해 특징 추출을 수행하여, 상기 제1 이미지의 제1 이미지 특징 및 상기 제2 이미지의 제2 이미지 특징을 획득하는 단계를 포함할 수 있다.In this case, the step of obtaining a first image feature of the first image of the video data and a second image feature of the second image before the first image includes: the first image and the second image through a feature extraction network performing feature extraction on each image to obtain a first image feature of the first image and a second image feature of the second image.

이때, 상기 시간-도메인 정보 융합 처리 결과는, 상기 제1 이미지의 제1 시간-도메인 통합 특징 및 상기 제2 이미지의 제2 시간-도메인 통합 특징을 포함할 수 있다.In this case, the time-domain information fusion processing result may include a first time-domain integrated feature of the first image and a second time-domain integrated feature of the second image.

이때, 상기 제1 이미지 특징 및 상기 제2 이미지 특징에 대해 상기 시간-도메인 정보 융합 처리를 진행하여, 상기 시간-도메인 정보 융합 처리 결과를 획득하는 단계는, 상기 제1 이미지 특징 및 상기 제2 이미지 특징에 대해 조합 연산을 수행하는 단계; 조합된 이미지 특징을 2개의 경로(path)로 분할하고, 제1 경로에 대해 상관성 처리를 수행하는 단계; 상기 제1 경로의 상관성 처리 결과와 제2 경로에 대해 요소별로 덧셈 연산(add operation)을 진행하는 단계; 및 요소별 덧셈 연산 결과에 따라 상기 시간-도메인 정보 융합 처리 결과를 획득하는 단계를 포함할 수 있다.In this case, the step of performing the time-domain information fusion process on the first image feature and the second image feature to obtain the time-domain information fusion process result includes the first image feature and the second image performing a combinatorial operation on the features; splitting the combined image feature into two paths, and performing correlation processing on the first path; performing an add operation for each element on the correlation processing result of the first path and the second path; and obtaining the time-domain information fusion processing result according to the element-by-element addition operation result.

이때, 상기 제1 경로에 대해 상관성 처리를 수행하는 단계는, 상기 제1 경로에 대해 적어도 한 번의 컨볼루션 연산을 수행하는 단계; 및 상기 컨볼루션 연산 후의 상기 제1 경로를 공간-도메인(space domain) 융합을 위한 네트워크에 입력하고, 상기 공간-도메인 융합을 위한 네트워크를 통해 상관성 처리를 수행하는 단계를 포함할 수 있다.In this case, the performing correlation processing on the first path may include: performing at least one convolution operation on the first path; and inputting the first path after the convolution operation into a network for space-domain fusion, and performing correlation processing through the network for space-domain fusion.

이때, 상기 공간-도메인 융합을 위한 네트워크를 통해 상기 상관성 처리를 수행하는 단계는, 상기 공간-도메인 융합을 위한 네트워크에 입력된 특징을 적어도 2개의 경로로 나누고, 상기 적어도 2개의 경로 중 일부 또는 전부에 대해 서브 도메인을 추출하는 단계; 서브 도메인 추출 결과에 대해 행렬 곱셈 연산을 수행하는 단계; 및 행렬 곱셈 연산 결과와 상기 공간-도메인 융합을 위한 네트워크에 입력된 특징에 대해 요소별 덧셈 연산을 수행하는 단계를 포함할 수 있다.In this case, the step of performing the correlation processing through the network for spatial-domain fusion may include dividing a feature input to the network for spatial-domain fusion into at least two paths, and some or all of the at least two paths. extracting subdomains for performing a matrix multiplication operation on the subdomain extraction result; and performing an element-by-element addition operation on the matrix multiplication operation result and the feature input to the network for spatial-domain fusion.

이때, 상기 공간-도메인 융합을 위한 네트워크를 통해 상기 상관성 처리를 수행하는 단계는, 컨볼루션 연산된 조합 특징을 4개의 경로로 나누는 단계; 상기 4개의 경로 중 제1 경로, 제2 경로 및 제3 경로에 대해 각각 서브 도메인 추출을 진행하는 단계; 상기 4개의 경로 중 상기 제1 경로의 서브 도메인 추출 결과와 상기 제2 경로의 서브 도메인 추출 결과에 대해 행렬 곱셈 연산을 수행하고, 행렬 곱셈 연산 결과와 상기 4개의 경로 중 상기 제3 경로의 서브 도메인 추출 결과에 대해 행렬 곱셈 연산을 수행하는 단계; 및 상기 제3 경로의 서브 도메인 추출 결과에 대해 행렬 곱셈 연산을 수행한 결과와 상기 4개의 경로 중 제4 경로에 대해 요소별 덧셈 연산을 수행하는 단계를 포함할 수 있다.In this case, the step of performing the correlation processing through the network for the spatial-domain fusion may include: dividing the convolutional combined features into four paths; performing subdomain extraction for each of the first path, the second path, and the third path among the four paths; A matrix multiplication operation is performed on the subdomain extraction result of the first path and the subdomain extraction result of the second path among the four paths, and the matrix multiplication operation result and the subdomain of the third path among the four paths performing a matrix multiplication operation on the extraction result; and performing an element-by-element addition operation on a result of performing a matrix multiplication operation on the subdomain extraction result of the third path and a fourth path among the four paths.

이때, 상기 서브 도메인 추출을 진행하는 단계는, 데이터 재구성을 통해 상기 서브 도메인 추출을 진행하는 단계를 포함할 수 있다.In this case, the step of extracting the sub-domain may include performing the extraction of the sub-domain through data reconstruction.

이때, 상기 제1 이미지 특징 및 상기 제2 이미지 특징에 대해 조합 연산을 수행하는 단계는, 상기 제1 이미지 특징에 대해 적어도 한 번의 컨볼루션 연산을 수행하는 단계; 상기 제2 이미지 특징에 대해 적어도 한 번의 컨볼루션 연산을 수행하는 단계; 및 컨볼루션 연산된 제1 이미지 특징 및 컨볼루션 연산된 제2 이미지 특징에 대해 조합 연산을 수행하는 단계를 포함할 수 있다.In this case, the performing the combination operation on the first image feature and the second image feature includes: performing at least one convolution operation on the first image feature; performing at least one convolution operation on the second image feature; and performing a combination operation on the first convolutional image feature and the convolutional second image feature.

이때, 상기 요소별 덧셈 연산 결과에 따라 상기 시간-도메인 정보 융합 처리 결과를 획득하는 단계는, 상기 요소별 덧셈 연산 결과를 2개의 경로로 나누는 단계; 상기 2개의 경로 중 각 경로의 요소별 덧셈 연산 결과에 대해 적어도 한 번의 컨볼루션 연산을 수행하는 단계; 컨볼루션 연산된 상기 제1 경로의 요소별 덧셈 연산 결과와 상기 컨볼루션 연산된 제1 이미지 특징에 대해 요소별 덧셈 연산을 수행하여 제1 시간-도메인 통합 특징을 획득하는 단계; 및 컨볼루션 연산된 상기 제2 경로의 요소별 덧셈 연산 결과와 상기 컨볼루션 연산된 제2 이미지 특징에 대해 요소별 덧셈 연산을 수행하여 제2 시간-도메인 통합 특징을 획득하는 단계를 포함할 수 있다.In this case, the step of obtaining the time-domain information fusion processing result according to the element-by-element addition operation result may include dividing the element-by-element addition operation result into two paths; performing at least one convolution operation on an addition operation result for each element of each path among the two paths; obtaining a first time-domain integrated feature by performing an element-by-element addition operation on the convolutional-calculated element-wise addition operation result of the first path and the convolutionally-computed first image feature; and obtaining a second time-domain integration feature by performing an element-by-element addition operation on the convolution-calculated element-wise addition operation result of the second path and the convolutionally-calculated second image feature. .

이때, 상기 요소별 덧셈 연산 결과에 따라 상기 시간-도메인 정보 융합 처리 결과를 획득하는 단계는, 상기 요소별 덧셈 연산 결과에 대해 적어도 한 번의 컨볼루션 연산을 수행하고, 컨볼루션 연산된 요소별 덧셈 연산 결과와 상기 제2 이미지 특징을 상기 시간-도메인 정보 융합 처리 결과로 취하는 단계를 포함할 수 있다.In this case, the step of obtaining the time-domain information fusion processing result according to the element-by-element addition operation result includes performing at least one convolution operation on the element-by-element addition operation result, and performing the convolutional element-by-element addition operation. taking the result and the second image feature as the result of the time-domain information fusion processing.

이때, 상기 시간-도메인 정보 융합 처리 결과를 기반으로 상기 제1 이미지에 대해 인스턴스 추적을 수행하여, 상기 제1 이미지의 프레임 간의 인스턴스 대응을 획득하는 단계는, 제2 시간-도메인 통합 특징을 기반으로 상기 비디오 데이터의 인스턴스 데이터베이스(instance database)를 업데이트하는 단계; 및 업데이트된 인스턴스 데이터베이스를 기반으로 제1 시간-도메인 통합 특징에 대해 각각 인스턴스 추적을 진행하는 단계를 포함할 수 있다.In this case, the step of obtaining an instance correspondence between frames of the first image by performing instance tracking on the first image based on a result of the time-domain information fusion processing includes a second time-domain integration feature based on the updating an instance database of the video data; and performing each instance tracking for the first time-domain integration feature based on the updated instance database.

이때, 상기 제2 시간-도메인 통합 특징을 기반으로 상기 비디오 데이터의 상기 인스턴스 데이터베이스를 업데이트하는 단계는, 상기 제2 시간-도메인 통합 특징에서 제1 개수의 기설정된 특징을 선택하는 단계; 및 선택한 상기 제1 개수의 기설정된 특징을 상기 비디오 데이터의 상기 인스턴스 데이터베이스에 추가하는 단계를 포함할 수 있다.In this case, the step of updating the instance database of the video data based on the second time-domain merging feature includes: selecting a first number of preset features from the second time-domain merging feature; and adding the selected first number of preset features to the instance database of the video data.

이때, 업데이트된 상기 인스턴스 데이터베이스를 기반으로 상기 제1 시간-도메인 통합 특징에 대해 각각 인스턴스 추적을 진행하는 단계는, 상기 제1 시간-도메인 통합 특징에서 제2 개수의 기설정된 특징을 선택하는 단계; 및 선택한 상기 제2 개수의 기설정된 특징과 업데이트된 상기 인스턴스 데이터베이스를 기반으로 추적 네트워크를 통해 인스턴스 대응을 진행하는 단계를 포함할 수 있다.In this case, the step of performing instance tracking for each of the first time-domain integrated features based on the updated instance database may include: selecting a second number of preset features from the first time-domain integrated features; and performing instance correspondence through a tracking network based on the selected second number of preset characteristics and the updated instance database.

이때, 상기 기설정된 특징은, 관심 영역 특징, 바운딩 박스에 기초하여 표현되는 특징 및 마스크에 기초하여 표현되는 특징 중에서 적어도 하나를 포함할 수 있다.In this case, the preset feature may include at least one of a region of interest feature, a feature expressed based on a bounding box, and a feature expressed based on a mask.

본 개시의 일 실시 예에 따른 비디오 처리 장치는, 비디오 데이터의 제1 이미지의 제1 이미지 특징 및 제1 이미지 앞의 제2 이미지의 제2 이미지 특징을 획득하도록 구성된 특징 획득부; 상기 제1 이미지 특징 및 상기 제2 이미지 특징에 대해 시간-도메인 정보 융합 처리를 진행하여, 시간-도메인 정보 융합 처리 결과를 획득하도록 구성된 시간-도메인 정보 융합부; 및 상기 시간-도메인 정보 융합 처리 결과에 따라 상기 제1 이미지의 파노라마 분할 결과를 획득하도록 구성된 파노라마 분할부를 포함한다.A video processing apparatus according to an embodiment of the present disclosure includes: a feature obtaining unit configured to obtain a first image feature of a first image of video data and a second image feature of a second image in front of the first image; a time-domain information fusion unit configured to perform time-domain information fusion processing on the first image feature and the second image feature to obtain a time-domain information fusion processing result; and a panorama segmentation unit configured to obtain a panorama segmentation result of the first image according to the time-domain information fusion processing result.

이때, 상기 파노라마 분할부는, 상기 시간-도메인 정보 융합 처리 결과를 기반으로 상기 제1 이미지에 대해 인스턴스 추적을 수행하여, 상기 제1 이미지의 프레임 간의 인스턴스 대응을 획득하고, 상기 제1 이미지의 프레임 간의 인스턴스 대응에 따라, 상기 제1 이미지의 파노라마 분할 결과를 획득할 수 있다.In this case, the panorama dividing unit performs instance tracking on the first image based on the time-domain information fusion processing result to obtain an instance correspondence between frames of the first image, and between frames of the first image According to the instance correspondence, a panoramic segmentation result of the first image may be obtained.

도 1은 일 실시예에 따른 비디오 처리 알고리즘의 예시적 네트워크 구조를 도시한 도면이다.
도 2는 일 실시예에 따른 비디오 처리 과정을 도시한 흐름도이다.
도 3은 일 실시예에 따른 시간-도메인 통합 네트워크의 예시적 구조 및 그 사용 흐름을 도시한 도면이다.
도 4는 다른 일 실시예에 따른 시간-도메인 통합 네트워크의 예시적 구조 및 그 사용 흐름을 도시한 도면이다.
도 5는 일 실시예에 따른 시간-도메인 통합 네트워크에서 공간 융합을 위한 네트워크의 예시적 구조를 도시한 도면이다.
도 6은 일 실시예에 따른 시간-도메인 통합 네트워크에서 공간 융합을 위한 네트워크의 예시적 구조 및 사용 흐름을 도시한 도면이다.
도 7은 일 실시예에 따른 비디오 처리 장치를 도시한 도면이다.
도 8은 일 실시예에 따른 비디오 처리 장치 중 파노라마 분할부를 도시한 도면이다.
도 9는 일 실시예에 따른 전자 장치를 도시한 도면이다.1 is a diagram illustrating an exemplary network structure of a video processing algorithm according to an embodiment.
2 is a flowchart illustrating a video processing process according to an exemplary embodiment.
3 is a diagram illustrating an exemplary structure of a time-domain integrated network and a flow of use thereof according to an embodiment.
4 is a diagram illustrating an exemplary structure and a flow of use of a time-domain integrated network according to another embodiment.
5 is a diagram illustrating an exemplary structure of a network for spatial convergence in a time-domain integrated network according to an embodiment.
6 is a diagram illustrating an exemplary structure and a flow of use of a network for spatial convergence in a time-domain integrated network according to an embodiment.
7 is a diagram illustrating a video processing apparatus according to an exemplary embodiment.
8 is a diagram illustrating a panorama dividing unit in a video processing apparatus according to an exemplary embodiment.
9 is a diagram illustrating an electronic device according to an exemplary embodiment.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, since various changes may be made to the embodiments, the scope of the patent application is not limited or limited by these embodiments. It should be understood that all modifications, equivalents and substitutes for the embodiments are included in the scope of the rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are used for the purpose of description only, and should not be construed as limiting. The singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that a feature, number, step, operation, component, part, or a combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same components are given the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted. In describing the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description thereof will be omitted.

또한, 실시 예의 구성 요소를 설명하는 데 있어서, 제1, 제2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다. In addition, in describing the components of the embodiment, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the elements from other elements, and the essence, order, or order of the elements are not limited by the terms. When it is described that a component is "connected", "coupled" or "connected" to another component, the component may be directly connected or connected to the other component, but another component is between each component. It will be understood that may also be "connected", "coupled" or "connected".

어느 하나의 실시 예에 포함된 구성요소와, 공통적인 기능을 포함하는 구성요소는, 다른 실시 예에서 동일한 명칭을 사용하여 설명하기로 한다. 반대되는 기재가 없는 이상, 어느 하나의 실시 예에 기재한 설명은 다른 실시 예에도 적용될 수 있으며, 중복되는 범위에서 구체적인 설명은 생략하기로 한다.Components included in one embodiment and components having a common function will be described using the same names in other embodiments. Unless otherwise stated, descriptions described in one embodiment may be applied to other embodiments as well, and detailed descriptions within the overlapping range will be omitted.

도 1은 일 실시예에 따른 비디오 처리 알고리즘의 예시적 네트워크 구조를 도시한 도면이다.1 is a diagram illustrating an exemplary network structure of a video processing algorithm according to an embodiment.

도 1을 참조하면, 비디오 처리 알고리즘의 예시적 네트워크 구조는 특징 추출 네트워크(110), 특징 추출 네트워크(120), 시간-도메인 통합 모듈(TUM)(130), 바운딩 박스 제안 네트워크(RPN)(140), 시맨틱 분할 모듈(150), 바운딩 박스 모듈(160), 마스크 모듈(170) 및 추적 모듈(180)을 포함한다.Referring to FIG. 1 , an exemplary network structure of a video processing algorithm is a feature extraction network 110 , a feature extraction network 120 , a time-domain integration module (TUM) 130 , and a bounding box proposal network (RPN) 140 . ), a semantic segmentation module 150 , a bounding box module 160 , a mask module 170 , and a tracking module 180 .

특징 추출 네트워크(110)와 특징 추출 네트워크(120)에서 각각 추출한 t번째 프레임의 특징맵과 t-τ번째 프레임의 특징맵은 시간-도메인 통합 모듈(130)에 입력된다. 시간-도메인 통합 모듈(130)은 t번째 프레임 시간-도메인 통합 특징맵 및 t-τ번째 프레임 시간-도메인 통합 특징맵(즉, 매칭에 적합한 특징)을 출력한다.The feature map of the t-th frame and the feature map of the t-τ-th frame extracted from the feature extraction network 110 and the feature extraction network 120, respectively, are input to the time-domain integration module 130 . The time-domain integration module 130 outputs the t-th frame time-domain integrated feature map and the t-τ-th frame time-domain integrated feature map (ie, features suitable for matching).

비디오 처리 알고리즘은 t-τ번째 프레임의 시간-도메인 통합 특징맵에서 추출한 m개의 마스킹된 특징을 인스턴스 데이터베이스에 추가하고, t번째 프레임의 시간-도메인 통합 특징맵에서 추출한 n개의 마스킹된 특징과 인스턴스 데이터베이스를 함께 추적 모듈에 입력하여 프레임 간의 인스턴스 대응을 구축한다. 여기서, 마스킹된 특징은 마스크를 기반으로 표시한 특징이다.The video processing algorithm adds m masked features extracted from the time-domain integrated feature map of the t-τ-th frame to the instance database, and n masked features extracted from the time-domain integrated feature map of the t-th frame and the instance database together into the tracking module to build instance correspondence between frames. Here, the masked feature is a feature displayed based on the mask.

t번째 프레임 시간-도메인 통합 특징맵은 또한 시맨틱 분할 모듈(150), 바운딩 박스 모듈(160) 및 마스크 모듈(170)에 입력된다.The t-th frame time-domain integrated feature map is also input to the semantic segmentation module 150 , the bounding box module 160 and the mask module 170 .

그리고, 시맨틱 분할 모듈(150), 바운딩 박스 모듈(160), 마스크 모듈(170) 및 추적 모듈(180)의 출력 결과는 t번째 프레임 파노라마 레이블맵으로 조합된다.Then, the output results of the semantic segmentation module 150 , the bounding box module 160 , the mask module 170 , and the tracking module 180 are combined into a t-th frame panorama label map.

보다 구체적으로 설명하면, 특징 추출 네트워크(110)는 t번째 프레임의 경우, 일반적인 특징 추출 네트워크를 사용하여 다중 해상도의 이미지 특징(피라미드로 다양한 해상도를 표시함)을 추출하고, 추출된 이미지 특징을 융합하여 t번째 프레임 특징맵을 획득한다.More specifically, in the case of the t-th frame, the feature extraction network 110 extracts image features of multiple resolutions (indicating various resolutions with a pyramid) using a general feature extraction network, and fuses the extracted image features. Thus, the t-th frame feature map is obtained.

특징 추출 네트워크(120)는 t-τ번째 프레임의 경우, 특징 추출 네트워크(110)와 동일한 특징 추출 네트워크를 사용하여 다중 해상도의 이미지 특징을 추출하고, 추출된 이미지 특징을 융합하여 t-τ번째 프레임 특징맵을 획득한다.The feature extraction network 120 extracts image features of multiple resolutions using the same feature extraction network as the feature extraction network 110 in the case of the t-τ-th frame, and fuses the extracted image features to create the t-τ-th frame Acquire a feature map.

시간-도메인 통합 모듈(130)은 t번째 프레임 특징과 t-τ번째 프레임 특징을 수신하고 이를 이용해서 t번째 프레임 시간-도메인 통합 특징맵(즉, 매칭에 적합한 특징) 및 t-τ번째 프레임 시간-도메인 통합 특징맵을 획득한다.The time-domain integration module 130 receives the t-th frame feature and the t-τ-th frame feature and uses them to create a t-th frame time-domain integrated feature map (ie, a feature suitable for matching) and a t-τ-th frame time - Acquire a domain-integrated feature map.

바운딩 박스 제안 네트워크(140)는 t번째 프레임 시간-도메인 특징맵에서, 바운딩 박스 제안 네트워크(RPN)를 호출하여 후보 바운딩 박스(즉, 시간-도메인 통합 특징맵 상의 박스들)를 획득한다.The bounding box proposal network 140 calls the bounding box proposal network (RPN) in the t-th frame time-domain feature map to obtain candidate bounding boxes (ie, boxes on the time-domain integrated feature map).

시맨틱 분할 모듈(150)은 시간-도메인 통합 모듈(130)에서 얻은 t번째 프레임 시간-도메인 통합 특징맵에 대해 시맨틱 분할을 진행하여, 모든 픽셀의 시맨틱 레이블을 획득한다.The semantic segmentation module 150 performs semantic segmentation on the t-th frame time-domain integrated feature map obtained from the time-domain integration module 130 to obtain semantic labels of all pixels.

바운딩 박스 모듈(160)은 시간-도메인 통합 모듈(130)에서 얻은 t번째 프레임 시간-도메인 통합 특징맵에 바운딩 박스 제안 네트워크(140)에서 얻은 후보 바운딩 박스를 세분화하여, 그 종류 및 정확한 바운딩 박스 위치를 확인한다.The bounding box module 160 subdivides the candidate bounding box obtained from the bounding box proposal network 140 in the t-th frame time-domain integrated feature map obtained from the time-domain integration module 130, its type and exact bounding box location. check

마스크 모듈(170)은 시간-도메인 통합 모듈(130)에서 얻은 t번째 프레임 시간-도메인 통합 특징맵에서 바운딩 박스 제안 네트워크(140)에서 얻은 후보 바운딩 박스에 대한 마스크를 계산한다(즉, 바운딩 박스 내의 각 픽셀에 대해 이것이 전경 또는 배경에 속하는지 확인).The mask module 170 calculates a mask for the candidate bounding box obtained from the bounding box proposal network 140 in the t-th frame time-domain integrated feature map obtained from the time-domain integration module 130 (that is, within the bounding box). For each pixel, make sure it belongs to the foreground or background).

추적 모듈(180)은 시간-도메인 통합 모듈(130)에서 얻은 t번째 프레임 시간-도메인 통합 특징맵에서 추출된 n개의 마스크된 특징을 수신하고, t-τ번째 프레임 시간-도메인 통합 특징맵에서 추출된 m개의 마스크된 특징(훈련 시 미리 수동으로 표시된 올바른 마스크)이 추가된 인스턴스 데이터베이스를 이용하여 프레임 간의 인스턴스 대응을 구축한다(즉, 이전 프레임에 나타난 인스턴스를 식별함). The tracking module 180 receives the n masked features extracted from the t-th frame time-domain integrated feature map obtained from the time-domain integration module 130, and extracted from the t-τ-th frame time-domain integrated feature map. A database of instances with added m masked features (correct masks marked manually in advance during training) is used to build instance correspondences between frames (i.e., to identify instances that appeared in previous frames).

마지막으로, 비디오 처리 알고리즘은 4개의 모듈(시맨틱 분할 모듈(150), 바운딩 박스 모듈(160), 마스크 모듈(170) 및 추적 모듈(180))에서 계산된 모든 정보를 조합하여, 파노라마 레이블맵을 획득할 수 있다.Finally, the video processing algorithm combines all the information calculated in the four modules (semantic segmentation module 150, bounding box module 160, mask module 170, and tracking module 180) to create a panoramic label map. can be obtained

도 2는 일 실시예에 따른 비디오 처리 과정을 도시한 흐름도이다.2 is a flowchart illustrating a video processing process according to an exemplary embodiment.

도 2를 참조하면, 비디오 처리 장치는 비디오 데이터의 제1 이미지의 제1 이미지 특징 및 제1 이미지 앞의 제2 이미지의 제2 이미지 특징을 획득한다(210).Referring to FIG. 2 , the video processing apparatus obtains a first image feature of a first image of video data and a second image feature of a second image in front of the first image ( 210 ).

구체적으로, 비디오 처리 장치는 비디오를 여러 개의 프레임 이미지로 나눌 수 있고, 이후 모든 이미지 파노라마 분할 방법을 사용하여 비디오 데이터의 첫 번째 프레임 이미지에 대해 이미지 파노라마 분할을 진행하여, 첫 번째 프레임 이미지의 파노라마 분할 결과를 얻을 수 있다. 비디오 데이터의 첫 번째 프레임 이미지 이후의 각 프레임 이미지를 순차적으로 제1 이미지로 취하고, 본 개시의 비디오 처리 방법을 수행하여, 첫 번째 프레임 이미지 이후의 각 프레임 이미지의 파노라마 분할 결과를 획득할 수 있다.Specifically, the video processing apparatus may divide the video into multiple frame images, and then perform image panorama segmentation on the first frame image of the video data using the all image panorama segmentation method to divide the panorama of the first frame image results can be obtained. Each frame image after the first frame image of the video data may be sequentially taken as the first image, and the video processing method of the present disclosure may be performed to obtain a panoramic segmentation result of each frame image after the first frame image.

첫 번째 프레임 이미지의 파노라마 분할 결과를 획득한 후, 이전 각 프레임 이미지의 파노라마 분할 결과를 기반으로, 후속의 각 프레임 이미지에 대해 이미지 파노라마 분할을 진행할 수 있다.After obtaining the panoramic segmentation result of the first frame image, image panorama segmentation may be performed for each subsequent frame image based on the panoramic segmentation result of each previous frame image.

일 실시예에서, 210단계에서 비디오 처리 장치는 비디오 데이터에서 제1 이미지의 제1 이미지 특징 및 제1 이미지 앞의 제2 이미지의 제2 이미지 특징을 획득할 때, 특징 추출 네트워크를 통해 제1 이미지(예, t번째 프레임 이미지) 및 제2 이미지(예, t-τ번째 프레임 이미지)에 대해 각각 특징 추출을 수행하여, 제1 이미지의 제1 이미지 특징 및 제2 이미지의 제2 이미지 특징을 획득할 수 있다. 여기서, 제1 이미지 특징 및 제2 이미지 특징은 다중 해상도의 이미지 특징일 수 있으며, 구체적으로, 피라미드를 사용하여 여러 상이한 해상도를 나타낼 수 있다. In one embodiment, when the video processing device acquires the first image feature of the first image and the second image feature of the second image in front of the first image from the video data, in step 210, the first image through the feature extraction network (eg, t-th frame image) and the second image (eg, t-τ-th frame image) are each performed feature extraction to obtain a first image feature of the first image and a second image feature of the second image can do. Here, the first image feature and the second image feature may be image features of multiple resolutions, specifically, a pyramid may be used to represent several different resolutions.

비디오 처리 장치는 제1 이미지 특징 및 제2 이미지 특징에 대해 시간-도메인 정보 융합 처리를 진행하여, 시간-도메인 정보 융합 처리 결과를 획득한다(220).The video processing apparatus performs time-domain information fusion processing on the first image feature and the second image feature to obtain a time-domain information fusion processing result ( 220 ).

이때, 시간-도메인 정보 융합 처리 결과는 제1 이미지의 제1 시간-도메인 통합 특징 및 제2 이미지의 시간-도메인 통합 특징을 포함할 수 있다. 다시 말해, 제1 이미지 특징 및 제2 이미지 특징에 대해 시간-도메인 정보 융합 처리를 진행하여, 제1 이미지의 제1 시간-도메인 통합 특징 및 제2 이미지의 제2 시간-도메인 통합 특징을 획득할 수 있다.In this case, the time-domain information fusion processing result may include the first time-domain integrated feature of the first image and the time-domain integrated feature of the second image. In other words, time-domain information fusion processing is performed on the first image feature and the second image feature to obtain a first time-domain integrated feature of the first image and a second time-domain integrated feature of the second image. can

220단계에서 비디오 처리 장치는 제1 이미지 특징 및 제2 이미지 특징에 대해 시간-도메인 정보 융합 처리 시, 먼저 제1 이미지 특징 및 제2 이미지 특징에 대해 조합 연산을 수행하고, 조합된 이미지 특징을 2개의 경로로 분할하고, 제1 경로에 대해 상관성 처리를 수행하고, 제1 경로의 상관성 처리 결과와 제2 경로에 대해 요소별로 덧셈 연산을 진행한 후, 요소별 덧셈 연산 결과에 따라 시간-도메인 정보 융합 처리 결과를 획득할 수 있다. 여기서, 비디오 처리 장치는 제1 경로에 대해 상관성 처리 시, 먼저 제1 경로에 대해 적어도 한 번의 컨볼루션 연산을 수행한 후, 컨볼루션 연산 후의 제1 경로를 공간-도메인 융합을 위한 네트워크에 입력하고, 공간-도메인 융합을 위한 네트워크를 통해 상관성 처리를 수행할 수 있다.In step 220, when the time-domain information fusion processing is performed on the first image feature and the second image feature, the video processing device performs a combination operation on the first image feature and the second image feature, and sets the combined image feature to 2 After dividing into n paths, performing correlation processing on the first path, performing addition operation for each element on the correlation processing result of the first path and the second path, time-domain information according to the element-by-element addition operation result A fusion processing result can be obtained. Here, when the video processing apparatus performs correlation processing on the first path, first performs at least one convolution operation on the first path, and then inputs the first path after the convolution operation to the network for spatial-domain fusion, and , correlation processing can be performed through the network for spatial-domain fusion.

220단계에서 비디오 처리 장치는 제1 이미지 특징 및 제2 이미지 특징에 대해 조합 연산 수행 시, 먼저 제1 이미지 특징에 대해 적어도 한 번의 컨볼루션 연산을 수행하고, 제2 이미지 특징에 대해 적어도 한 번의 컨볼루션 연산을 수행한 후, 컨볼루션 연산된 제1 이미지 특징 및 컨볼루션 연산된 제2 이미지 특징에 대해 조합 연산을 수행할 수 있다.In step 220, when performing a combination operation on the first image feature and the second image feature, the video processing apparatus first performs at least one convolution operation on the first image feature and at least one convolution operation on the second image feature After performing the convolution operation, a combination operation may be performed on the first convolutional image feature and the convolutional second image feature.

220단계에서 비디오 처리 장치는 요소별 덧셈 연산 결과에 따라 시간-도메인 정보 융합 처리 결과 획득 시, 먼저 요소별 덧셈 연산 결과를 2개의 경로로 나누고, 2개의 경로 중 각 경로의 요소별 덧셈 연산 결과에 대해 적어도 한 번의 컨볼루션 연산을 수행한 다음, 컨볼루션 연산된 제1 경로의 요소별 덧셈 연산 결과와 컨볼루션 연산된 제1 이미지 특징에 대해 요소별 덧셈 연산을 수행하여 제1 시간-도메인 통합 특징을 획득하고, 컨볼루션 연산된 제2 경로의 요소별 덧셈 연산 결과와 컨볼루션 연산된 제2 이미지 특징에 대해 요소별 덧셈 연산을 수행하여 제2 시간-도메인 통합 특징을 획득할 수 있다.In step 220, when obtaining the time-domain information fusion processing result according to the element-by-element addition operation result, the video processing device divides the element-specific addition operation result into two paths, First time-domain integrated feature by performing at least one convolution operation on , and performing element-by-element addition operation on the element-by-element addition operation result of the convolutionally-calculated second path and the convolutionally-computed second image feature to obtain the second time-domain integrated feature.

220단계에서 비디오 처리 장치는 공간-도메인 융합을 위한 네트워크를 통해 상관성 처리 시, 먼저 공간-도메인 융합을 위한 네트워크에 입력된 특징을 적어도 2개의 경로로 나누고, 적어도 2개의 경로 중 일부 또는 전부에 대해 서브 도메인을 추출하고, 서브 도메인 추출 결과에 대해 행렬 곱셈 연산을 수행한 후, 행렬 곱셈 연산 결과와 공간-도메인 융합을 위한 네트워크에 입력된 특징에 대해 요소별 덧셈 연산을 수행할 수 있다.In step 220, the video processing device divides the feature input to the network for spatial-domain fusion into at least two paths when correlation is processed through the network for spatial-domain fusion, and for some or all of the at least two paths After extracting the subdomain and performing the matrix multiplication operation on the subdomain extraction result, the element-wise addition operation may be performed on the matrix multiplication operation result and the feature input to the network for spatial-domain fusion.

220단계에서 비디오 처리 장치는 공간-도메인 융합을 위한 네트워크를 통해 상관성 처리 시, 먼저 컨볼루션 연산된 조합 특징을 4개의 경로로 나누고, 4개의 경로 중 제1 경로, 제2 경로 및 제3 경로에 대해 각각 서브 도메인 추출을 진행하고, 4개의 경로 중 제1 경로의 서브 도메인 추출 결과와 제2 경로의 서브 도메인 추출 결과에 대해 행렬 곱셈 연산을 수행하고, 행렬 곱셈 연산 결과와 4개의 경로 중 제3 경로의 서브 도메인 추출 결과에 대해 행렬 곱셈 연산을 수행한 후, 제3 경로의 서브 도메인 추출 결과에 대해 행렬 곱셈 연산을 수행한 결과와 4개의 경로 중 제4 경로에 대해 요소별 덧셈 연산을 수행할 수 있다. 이때, 비디오 처리 장치는 데이터 재구성을 통해 서브 도메인 추출을 진행할 수 있다.In step 220, the video processing device divides the convolutional combinatorial features into four paths when performing correlation processing through the network for spatial-domain fusion, and assigns the first, second, and third paths among the four paths. for each of the four paths, the matrix multiplication operation is performed on the subdomain extraction result of the first path and the subdomain extraction result of the second path among the four paths, and the matrix multiplication operation result and the third of the four paths After the matrix multiplication operation is performed on the subdomain extraction result of the path, the result of performing the matrix multiplication operation on the subdomain extraction result of the third path and the element-by-element addition operation are performed on the fourth path among the four paths. can In this case, the video processing apparatus may perform subdomain extraction through data reconstruction.

220단계에서 비디오 처리 장치는 요소별 덧셈 연산 결과에 따라 시간-도메인 정보 융합 처리 결과 획득 시, 먼저 요소별 덧셈 연산 결과에 대해 적어도 한 번의 컨볼루션 연산을 수행한 다음, 컨볼루션 연산된 요소별 덧셈 연산 결과와 제2 이미지 특징을 시간-도메인 정보 융합 처리 결과로 취할 수 있다.In step 220, when obtaining the time-domain information fusion processing result according to the element-by-element addition operation result, the video processing apparatus first performs at least one convolution operation on the element-by-element addition operation result, and then performs the convolution operation on the element-by-element addition operation result. The calculation result and the second image feature may be taken as a time-domain information fusion processing result.

한편, 220단계에서 시간-도메인 통합 네트워크를 통해 제1 이미지 특징 및 제2 이미지 특징에 대해 시간-도메인 정보 융합 처리를 진행하는 것에 대한 보다 구체적인 설명은 이후 도 3 - 도 6을 참조하여 후술한다.Meanwhile, in step 220, a more detailed description of the time-domain information fusion processing for the first image feature and the second image feature through the time-domain integrated network will be described later with reference to FIGS. 3 to 6 .

그리고, 비디오 처리 장치는 시간-도메인 정보 융합 처리 결과에 따라 제1 이미지의 파노라마 분할 결과를 획득한다.Then, the video processing apparatus obtains a panorama segmentation result of the first image according to the time-domain information fusion processing result.

230단계에서 비디오 처리 장치는 시간-도메인 정보 융합 처리 결과에 기초하여 제1 이미지에 대해 먼저 인스턴스 추적을 실행하여, 제1 이미지의 프레임 간의 인스턴스 대응을 획득할 수 있고, 그런 다음 프레임 간의 인스턴스 대응에 따라 제1 이미지의 파노라마 분할 결과를 얻을 수 있다.In step 230 , the video processing apparatus may first execute instance tracking on the first image based on the time-domain information fusion processing result to obtain an instance correspondence between frames of the first image, and then perform an instance correspondence between frames. Accordingly, a panoramic segmentation result of the first image can be obtained.

230단계에서 비디오 처리 장치는 시간-도메인 정보 융합 처리 결과에 기초하여 제1 이미지에 대해 시맨틱 분할, 인스턴스 분할, 바운딩 박스를 세분화하여, 제1 이미지의 시맨틱 분할 결과, 인스턴스 분할 결과, 바운딩 박스를 획득할 수 있다. 이때, 비디오 처리 장치는 프레임 간의 인스턴스 대응에 따라 제1 이미지의 파노라마 분할 결과를 획득할 때, 제1 이미지의 시맨틱 분할 결과, 인스턴스 분할 결과, 바운딩 박스, 프레임 간의 인스턴스 대응을 융합하여, 제1 이미지의 파노라마 분할 결과를 획득할 수 있다.In step 230, the video processing apparatus subdivides the semantic segmentation, instance segmentation, and bounding box for the first image based on the time-domain information fusion processing result to obtain a semantic segmentation result, instance segmentation result, and bounding box of the first image can do. In this case, the video processing apparatus fuses the semantic segmentation result of the first image, the instance segmentation result, the bounding box, and the instance correspondence between frames when obtaining the panoramic segmentation result of the first image according to the instance correspondence between the frames, so that the first image A panoramic segmentation result of .

230단계에서 비디오 처리 장치는 시간-도메인 정보 융합 처리 결과에 기초하여 제1 이미지에 대해 인스턴스 추적을 수행할 때, 먼저 제2 시간-도메인 통합 특징에 기반하여 비디오 데이터의 인스턴스 데이터베이스를 업데이트하고, 업데이트된 인스턴스 데이터베이스에 기반하여 제1 시간-도메인 통합 특징에 대해 각각 인스턴스 추적을 진행할 수 있다.In step 230, when the video processing device performs instance tracking on the first image based on the time-domain information fusion processing result, first updates the instance database of video data based on the second time-domain integration feature, and updates Instance tracking may be performed for each of the first time-domain integration features based on the established instance database.

230단계에서 비디오 처리 장치는 제2 시간-도메인 통합 특징에 기반하여 비디오 데이터의 인스턴스 데이터베이스를 업데이트할 때, 먼저 제2 시간-도메인 통합 특징으로부터 제1 개수(예, m)의 기설정된 특징을 선택한 다음, 선택한 제1 개수(예, m)의 기설정된 특징을 비디오 데이터의 인스턴스 데이터베이스에 추가할 수 있다.In step 230, when the video processing apparatus updates the instance database of video data based on the second time-domain integrated characteristic, a first number (eg, m) of preset features is selected from the second time-domain integrated characteristic. Next, the selected first number (eg, m) of preset features may be added to the instance database of video data.

230단계에서 비디오 처리 장치는 업데이트된 인스턴스 데이터베이스를 기반으로 제1 시간-도메인 통합 특징에 대해 각각 인스턴스 추적하는 경우, 먼저 제1 시간-도메인 통합 특징에서 제2 개수(예, n)의 기설정된 특징을 선택한 다음, 추적 네트워크를 통해, 선택된 제2 개수(예, n)의 기설정된 특징과 업데이트된 인스턴스 데이터베이스를 기반으로, 인스턴스 대응을 진행할 수 있다.In step 230, when the video processing device tracks each instance for the first time-domain integrated feature based on the updated instance database, first, a second number (eg, n) of preset features in the first time-domain integrated feature After selecting , instance correspondence may be performed through the tracking network based on the selected second number (eg, n) of preset characteristics and the updated instance database.

일 실시예에서, 인스턴스 데이터베이스는 비디오의 첫 번째 프레임부터 현재 프레임의 이전 프레임까지 모든 인스턴스를 기록할 수 있다. 인스턴스 데이터베이스의 크기는 인스턴스의 개수이며, 사전의 방식으로 저장된다. 키워드는 인스턴스 id이고, 값은 대응되는 인스턴스 특징이다. 인스턴스 데이터베이스의 초기화 상태는 모두 0인 텐서일 수 있다. In one embodiment, the instance database may record all instances from the first frame of the video to the previous frame of the current frame. The size of the instance database is the number of instances, and is stored in a dictionary manner. The keyword is the instance id, and the value is the corresponding instance characteristic. The initialization state of the instance database may be a tensor with all zeros.

첫 번째 프레임 이미지부터 t-1번째 프레임 이미지까지 읽기 및 쓰기 작업이 필요하며, 추출된 인스턴스의 id에 대응하는 값이 초기화 상태일 경우, 그 값은 현재의 특징으로 대체하고. 추출된 인스턴스의 id에 대응하는 값이 초기화 상태가 아닌 경우, 매개변수 alpha(예를 들어, 0.5(제한되지 않음))를 사용하여 원래 특징과 현재 특징을 융합한다(즉, feat_memory_new = alpha*feat_memory_org + (1-alpha)*feat_current). t번째 프레임 이미지는 읽기만 수행하고, 인스턴스 데이터베이스에 있는 모든 인스턴스의 특징을 읽어, t번째 프레임 이미지에서 검출된 모든 인스턴스의 특징과 함께 유사도를 계산한다(모든 유사도 알고리즘 선택 가능, 예를 들어, 두 개의 특징의 벡터 내적을 유사도로 취하는 것에 국한되지 않음). t 번째 프레임 이미지의 모든 인스턴스에 대해, 예측된 id는 즉 가장 유사한 인스턴스 데이터베이스의 인스턴스의 id이다. 이러한 방식을 통해, 네트워크는 훈련 과정에서 충분히 큰 인스턴스 데이터베이스를 갖게 된다.Reading and writing operations are required from the first frame image to the t-1 frame image, and if the value corresponding to the id of the extracted instance is in the initialized state, the value is replaced with the current feature. If the value corresponding to the id of the extracted instance is not in the initialized state, the original feature and the current feature are fused using the parameter alpha (e.g., 0.5 (not limited)) (i.e., feat_memory_new = alpha*feat_memory_org + (1-alpha)*feat_current). The t-th frame image is read only, the features of all instances in the instance database are read, and the similarity is calculated along with the features of all instances detected in the t-th frame image (all similarity algorithms can be selected, for example, two not limited to taking the vector dot product of the features as the similarity). For every instance of the t-th frame image, the predicted id is that of the instance of the most similar instance database. In this way, the network has a large enough instance database during training.

일 실시예에서, 기설정된 특징은 관심 영역 특징, 바운딩 박스에 기초하여 표현되는 특징(roi 특징), 또는 마스크에 기초하여 표현되는 특징(마스크된 특징)일 수 있다.In an embodiment, the preset feature may be a feature expressed based on a region of interest feature, a bounding box (roi feature), or a feature expressed based on a mask (masked feature).

roi 특징은 인스턴스를 직사각형 영역(일반적으로 bounding box, 즉 '바운딩 박스'라고 함)으로 나타내며, 해당 직사각형 영역의 특징맵 상의 모든 정보는 해당 인스턴스의 정보로 간주된다. 일반적으로 인스턴스의 모양이 직사각형이 아니기 때문에, 인스턴스 특징에는 배경 영역 정보의 일부 또한 포함되므로 정확하지 않다. 마스킹된 특징은 인스턴스의 마스크를 사용하고, 마스크 영역의 정보만 인스턴스 정보로 사용하며, 배경 영역의 영향을 제거한다.The roi feature represents an instance as a rectangular region (commonly referred to as a bounding box, that is, a 'bounding box'), and all information on the feature map of the rectangular region is regarded as information of the corresponding instance. In general, since the shape of the instance is not rectangular, the instance characteristics also contain some of the background area information, so it is not accurate. The masked feature uses the mask of the instance, uses only the information of the mask area as the instance information, and removes the influence of the background area.

여기서 특징은 roi 특징 또는 마스크된 특징일 수 있다. roi 특징은 바운딩 박스(직사각형 모양)에 의해 추출되기 때문에, 약간의 배경 노이즈가 발생하여 인스턴스 특징의 식별도가 낮아지게 된다. 마스크된 특징을 선택하면 인스턴스 특징의 식별도를 높일 수 있으므로, 인스턴스 id 예측의 정확도를 높이기 위해 마스킹된 특징을 선호하게 된다.Here, the feature may be a roi feature or a masked feature. Since the roi feature is extracted by the bounding box (rectangular shape), some background noise is generated and the degree of identification of the instance feature is lowered. If the masked feature is selected, the degree of identification of the instance feature can be increased, and thus the masked feature is preferred in order to increase the accuracy of instance id prediction.

도 3은 일 실시예에 따른 시간-도메인 통합 네트워크의 예시적 구조 및 그 사용 흐름을 도시한 도면이다. 3 is a diagram illustrating an exemplary structure of a time-domain integrated network and a flow of use thereof according to an embodiment.

도 3을 참조하면, 시간-도메인 통합 네트워크(모듈)는 수집 부분, 공간 융합을 위한 네트워크(예, 부분 비로컬적(nonlocal) 어텐션 네트워크) 및 분배 부분을 포함한다. 수집 부분은 두 프레임의 특징을 함께 조합하고(ⓒ는 조합을 나타냄), 분배 부분은 어텐션 네트워크를 통과하는 조합된 정보를 두 프레임의 정보로 나눈다. 공간 융합을 위한 네트워크(예, 부분 비로컬적 어텐션 네트워크)는 수집 부분과 분배 부분 사이에 위치하며, 본 개시에서 제안하는 새로운 어텐션 네트워크로서, 종래의 광류(Optical flow) 해결 및 종래의 정렬 작업을 수행하지 않는 상황에서 두 프레임의 정보를 융합할 수 있다.Referring to FIG. 3 , the time-domain aggregation network (module) includes an aggregation part, a network for spatial fusion (eg, a partial nonlocal attention network), and a distribution part. The collection part combines the characteristics of the two frames together (© indicates combination), and the distribution part divides the combined information passing through the attention network into the information of the two frames. A network for spatial fusion (eg, a partial non-local attention network) is located between the collection part and the distribution part, and is a new attention network proposed in the present disclosure. In a non-performing situation, the information of two frames can be fused.

도 3에서 시간-도메인 통합 네트워크의 입력은 제1 이미지 특징(t번째 프레임 특징맵)과 제2 이미지 특징(t-τ번째 프레임 특징맵)이고 (두 프레임 이미지의 특징맵 모두 N*C*H*W의 텐서(tensor)로 표시된다. 이때, N은 동일한 배치에 속하는 이미지 데이터의 수(즉, batch size)이고, C는 채널 수이고, H 및 W는 특징맵의 높이와 너비이다.In Fig. 3, the inputs of the time-domain integrated network are the first image feature (t-th frame feature map) and the second image feature (t-τ-th frame feature map) (both feature maps of the two frame images are N*C*H It is expressed as a tensor of *W, where N is the number of image data belonging to the same batch (ie, batch size), C is the number of channels, and H and W are the height and width of the feature map.

시간-도메인 통합 네트워크의 출력은 두 프레임 간의 시간-도메인 통합 특징이다. 도 3의 시간-도메인 통합 네트워크(모듈)에서 시간-도메인 정보 융합 처리하는 구체적인 과정은 311단계에서 322단계를 포함할 수 있다.The output of the time-domain aggregation network is the time-domain aggregation feature between two frames. A specific process of time-domain information fusion processing in the time-domain integrated network (module) of FIG. 3 may include steps 311 to 322 .

시간-도메인 통합 네트워크는 t번째 프레임 특징맵에 대해 컨볼루션 연산(1*1 컨볼루션)을 진행하여, 특징맵(A)를 획득한다(311).The time-domain integrated network performs a convolution operation (1*1 convolution) on the t-th frame feature map to obtain a feature map A (311).

그리고, 시간-도메인 통합 네트워크는 t-τ번째 프레임 특징맵에 대해 컨볼루션 연산(1*1 컨볼루션)을 진행하여, 특징맵(A')를 획득한다(312). Then, the time-domain integrated network performs a convolution operation (1*1 convolution) on the t-τ-th frame feature map to obtain a feature map A' (312).

그리고, 시간-도메인 통합 네트워크는 A와 A'를 조합하여 특징맵(E)를 획득한다(313). 이때, 특징맵(E)의 크기는 N*2C*H*W이다.Then, the time-domain integrated network obtains a feature map (E) by combining A and A' (313). In this case, the size of the feature map E is N*2C*H*W.

그리고, 시간-도메인 통합 네트워크는 획득한 특징맵(E)에 대해 컨볼루션 연산(예를 들어, 1*1 컨볼루션)을 진행한다(314).Then, the time-domain integrated network performs a convolution operation (eg, 1*1 convolution) on the acquired feature map E ( 314 ).

그리고, 시간-도메인 통합 네트워크는 계속해서 컨볼루션 연산(예를 들어, 3*3 컨볼루션)을 진행한다(315).Then, the time-domain integrated network continues to perform a convolution operation (eg, 3*3 convolution) ( 315 ).

그리고, 시간-도메인 통합 네트워크는 계속해서 컨볼루션 연산(예, 1*1 컨볼루션)을 진행하여, 특징맵(X)를 획득한다(316).Then, the time-domain integrated network continuously performs a convolution operation (eg, 1*1 convolution) to obtain a feature map (X) ( 316 ).

여기서, 314단계에서 316단계의 컨볼루션 연산은 다른 매개변수의 컨볼루션 연산일 수 있다.Here, the convolution operation of steps 314 to 316 may be a convolution operation of other parameters.

그리고, 시간-도메인 통합 네트워크는 특징맵 X를 공간 융합을 위한 네트워크(예를 들어, 부분 비로컬적 어텐션 네트워크)에 입력한다(317).Then, the time-domain integrated network inputs the feature map X to a network for spatial fusion (eg, a partial non-local attention network) ( 317 ).

그리고, 시간-도메인 통합 네트워크는 317단계에서 출력된 특징맵과 특징맵E를 요소별로 덧셈 연산을 진행한다(318). Then, the time-domain integrated network performs an element-by-element addition operation on the feature map outputted in step 317 and the feature map E ( 318 ).

그리고, 시간-도메인 통합 네트워크는 318단계에서 출력된 결과에 대해 컨볼루션 연산(예를 들어, 1*1 컨볼루션)을 수행한다(319).Then, the time-domain integrated network performs a convolution operation (eg, 1*1 convolution) on the result output in step 318 ( 319 ).

그리고, 시간-도메인 통합 네트워크는 318단계에서 출력된 결과에 대해 컨볼루션 연산(예를 들어, 1*1 컨볼루션)을 수행한다(320).Then, the time-domain integrated network performs a convolution operation (eg, 1*1 convolution) on the result output in step 318 ( 320 ).

여기서 319단계 및 320단계의 컨볼루션 연산은 다른 매개변수의 컨볼루션 연산일 수 있다.Here, the convolution operation of steps 319 and 320 may be a convolution operation of other parameters.

그리고, 시간-도메인 통합 네트워크는 319에서 출력된 결과와 특징맵A를 요소별로 덧셈 연산을 진행하여, t번째 프레임의 시간-도메인 통합 특징맵을 획득한다(321). Then, the time-domain integrated network performs an element-by-element addition operation on the result outputted in step 319 and the feature map A to obtain a time-domain integrated feature map of the t-th frame ( 321 ).

그리고, 시간-도메인 통합 네트워크는 320단계에서 출력된 결과와 특징맵A'를 요소별로 덧셈 연산을 진행하여, t-τ번째 프레임의 시간-도메인 통합 특징맵을 획득한다(322).Then, the time-domain integrated network performs an element-by-element addition operation on the result outputted in step 320 and the feature map A' to obtain a time-domain integrated feature map of the t-τ th frame ( 322 ).

도 3은 시간-도메인 통합 네트워크의 예시적 구조 및 예시적 흐름만을 도시하고 있으며, 시간-도메인 통합 네트워크는 또한 기능을 구현할 수 있는 다른 구조를 가질 수 있고 및/또는 다른 상이한 흐름을 가질 수 있다. 본 개시는 도 3의 시간-도메인 통합 네트워크로 제한되지 않는다.3 shows only an example structure and example flow of a time-domain aggregation network, the time-domain aggregation network may also have other structures capable of implementing functions and/or may have other different flows. The present disclosure is not limited to the time-domain aggregation network of FIG. 3 .

도 4는 다른 일 실시예에 따른 시간-도메인 통합 네트워크의 예시적 구조 및 그 사용 흐름을 도시한 도면이다.4 is a diagram illustrating an exemplary structure and a flow of use of a time-domain integrated network according to another embodiment.

도 4를 참조하면, 시간-도메인 통합 네트워크(모듈)은 공간 융합을 위한 네트워크(예를 들어, 부분 비로컬적 어텐션 네트워크)를 포함하며, 종래의 광류 해결 및 종래의 정렬 작업을 수행하지 않는 상황에서 두 프레임의 정보를 융합할 수 있다.Referring to FIG. 4 , the time-domain integration network (module) includes a network for spatial fusion (eg, a partial non-local attention network), and the conventional optical flow resolution and conventional alignment tasks are not performed. It is possible to fuse the information of the two frames in

도 4에서 시간-도메인 통합 네트워크(모듈)의 입력은 제1 이미지 특징과 제2 이미지 특징이고(두 프레임 이미지의 특징맵 모두 N*C*H*W의 텐서로 표시된다. 이때, N은 동일한 배치에 속하는 이미지 데이터의 수(즉, batch size)이고, C는 채널 수이고, H 및 W는 특징맵의 높이와 너비임)이다. 시간-도메인 통합 네트워크의 출력은 두 프레임 간의 시간-도메인 통합 특징이다. In Fig. 4, the input of the time-domain integrated network (module) is the first image feature and the second image feature (both feature maps of the two frame images are expressed as a tensor of N*C*H*W. In this case, N is the same It is the number of image data belonging to a batch (ie, batch size), C is the number of channels, and H and W are the height and width of the feature map). The output of the time-domain aggregation network is the time-domain aggregation feature between two frames.

도 4의 시간-도메인 통합 네트워크(모듈)에서 시간-도메인 정보 융합 처리하는 구체적인 과정은 411단계에서 417단계를 포함할 수 있다.A specific process of time-domain information fusion processing in the time-domain integrated network (module) of FIG. 4 may include steps 411 to 417 .

시간-도메인 통합 네트워크는 t번째 프레임 특징맵 및 t-τ번째 프레임 특징맵을 조합하여, N, 2C, H, W차원의 특징맵을 획득한다(411).The time-domain integrated network obtains N, 2C, H, and W-dimensional feature maps by combining the t-th frame feature map and the t-τ-th frame feature map (411).

그리고, 시간-도메인 통합 네트워크는 411단계에서 출력된 결과에 대해 컨볼루션 연산(예를 들어, 1*1 컨볼루션)을 진행한다(412). Then, the time-domain integrated network performs a convolution operation (eg, 1*1 convolution) on the result output in step 411 ( 412 ).

그리고, 시간-도메인 통합 네트워크는 계속해서 컨볼루션 연산(예를 들어, 3*3 컨볼루션)을 진행한다(413).Then, the time-domain integrated network continues to perform a convolution operation (eg, 3*3 convolution) ( 413 ).

그리고, 시간-도메인 통합 네트워크는 계속해서 컨볼루션 연산(예를 들어, 1*1 컨볼루션)을 진행하여, 특징맵(X)를 획득한다(413).Then, the time-domain integrated network continuously performs a convolution operation (eg, 1*1 convolution) to obtain a feature map (X) ( 413 ).

여기서 412단계에서 414단계의 컨볼루션 연산은 다른 매개변수의 컨볼루션 연산일 수 있다.Here, the convolution operation of steps 412 to 414 may be a convolution operation of other parameters.

그리고, 시간-도메인 통합 네트워크는 특징맵 X를 공간 융합을 위한 네트워크(예를 들어, 부분 비로컬적 어텐션 네트워크)에 입력한다(415).Then, the time-domain integrated network inputs the feature map X to a network for spatial fusion (eg, a partial non-local attention network) ( 415 ).

그리고, 시간-도메인 통합 네트워크는 415단계에서 출력된 결과와 단계(5-1)에서 출력된 결과를 요소별로 덧셈 연산을 진행한다(416). Then, the time-domain integrated network performs an element-by-element addition operation on the result output in step 415 and the result output in step 5-1 (step 416).

그리고, 시간-도메인 통합 네트워크는 416단계에서 출력된 결과에 대해 컨볼루션 연산(예를 들어, 1*1 컨볼루션)을 수행하여, t번째 프레임의 시간-도메인 통합 특징맵을 획득한다(417). 이때, 417단계에서의 컨볼루션 연산은 다른 매개변수의 컨볼루션 연산일 수 있다.Then, the time-domain integrated network performs a convolution operation (eg, 1*1 convolution) on the result output in step 416 to obtain a time-domain integrated feature map of the t-th frame (417) . In this case, the convolution operation in step 417 may be a convolution operation of other parameters.

도 4는 시간-도메인 통합 네트워크의 예시적 구조 및 예시적 흐름만을 도시하고 있으며, 시간-도메인 통합 네트워크는 또한 기능을 구현할 수 있는 다른 구조를 가질 수 있고 및/또는 다른 상이한 흐름을 가질 수 있다. 본 개시는 도 4의 시간-도메인 통합 네트워크로 제한되지 않는다.4 shows only an example structure and example flow of a time-domain aggregation network, the time-domain aggregation network may also have other structures capable of implementing functions and/or may have other different flows. The present disclosure is not limited to the time-domain aggregation network of FIG. 4 .

본 개시의 일 실시예에서, 시간-도메인 통합 네트워크로 인해서 종래에 계산하던 프레임 간의 광류를 더는 계산할 필요가 없으므로, 계산 속도가 크게 향상되고, 해당 네트워크의 데이터 양 훈련에 대한 수요 또한 크게 줄어든다. 또한, 시간-도메인 통합 네트워크에서 두 프레임의 특징이 동일한 어텐션 네트워크를 통과하여 시맨틱 상의 일관성 또한 향상된다. In an embodiment of the present disclosure, since it is no longer necessary to calculate the optical flow between frames that was conventionally calculated due to the time-domain integrated network, the calculation speed is greatly improved, and the demand for training the data amount of the network is also greatly reduced. In addition, in the time-domain integrated network, the characteristics of two frames pass through the same attention network, thereby improving semantic consistency.

도 5는 일 실시예에 따른 시간-도메인 통합 네트워크에서 공간 융합을 위한 네트워크의 예시적 구조를 도시한 도면이다.5 is a diagram illustrating an exemplary structure of a network for spatial convergence in a time-domain integrated network according to an embodiment.

도 6은 일 실시예에 따른 시간-도메인 통합 네트워크에서 공간 융합을 위한 네트워크의 예시적 구조 및 사용 흐름을 도시한 도면이다.6 is a diagram illustrating an exemplary structure and a flow of use of a network for spatial convergence in a time-domain integrated network according to an embodiment.

도 5를 참조하면, 공간 융합을 위한 네트워크(예를 들어, 부분 비로컬적 어텐션 네트워크)는 특징맵 수신부(510), 컨볼루션 연산부(511, 521, 541, 651), 서브 도메인 추출부(512, 522, 542), 행렬 곱셈부(530, 550) 소프트맥스 연산부(531), 요소별 덧셈부(552)를 포함하여 구성될 수 있다.Referring to FIG. 5 , a network for spatial fusion (eg, a partial non-local attention network) includes a feature map receiver 510 , convolution operators 511 , 521 , 541 , 651 , and a subdomain extractor 512 . , 522 , 542 , matrix multiplication units 530 and 550 , softmax operation unit 531 , and element-specific adder 552 may be included.

도 5에 도시된 바와 같이, 공간 융합을 위한 네트워크(예를 들어, 부분 비로컬적 어텐션 네트워크)는 서브 도메인 추출 및 행렬 곱셈을 통해 프레임 간의 관련 정보를 학습하므로, 종래 기술의 광류를 사용한 정렬 작업을 제거할 수 있다. 도 5의 C/4 및 C/2는 예시의 채널 수로, 본 개시에서 채널의 수는 이에 한정되지 않는다. 공간 융합을 위한 네트워크(예를 들어, 부분 비로컬적 어텐션 네트워크)의 입력은 특징맵(예를 들어, 도 3의 316단계의 출력 또는 도 4의 414단계의 출력인 특징맵X)이고, 크기는 N*2C*H*W(아래에서는 다른 쓰기 방법이 사용됨, N, 2C, H, W)이다. 이때, 서브 도메인 추출부는 데이터 재구성 연산을 통해 서브 도메인의 추출을 수행할 수 있다. 데이터 재구성 연산을 통해 서브 도메인 추출을 수행하는 경우, 도 6에 도시된 바와 같이, 공간 융합을 위한 네트워크(예를 들어, 부분 비로컬적 어텐션 네트워크)가 프레임 간의 관련 정보를 학습하는 구체적인 과정은 아래의 610단계에서 652단계를 포함할 수 있다.As shown in FIG. 5 , a network for spatial fusion (eg, a partial non-local attention network) learns related information between frames through subdomain extraction and matrix multiplication, so the prior art alignment task using optical flow can be removed. C/4 and C/2 of FIG. 5 are exemplary number of channels, and the number of channels in the present disclosure is not limited thereto. The input of the network for spatial fusion (e.g., partial non-local attention network) is a feature map (e.g., the feature map X that is the output of step 316 of FIG. 3 or the output of step 414 of FIG. 4), and the size is N*2C*H*W (different writing methods are used below, N, 2C, H, W). In this case, the sub-domain extractor may extract the sub-domain through a data reconstruction operation. When subdomain extraction is performed through data reconstruction operation, as shown in FIG. 6 , a specific process for a network for spatial fusion (eg, a partial non-local attention network) to learn related information between frames is as follows. Steps 610 to 652 may be included.

공간 융합을 위한 네트워크는 도 3의 316단계 또는 도 4의 414단계로부터 특징맵 X를 수신(610)하고, 특징맵 X에 대해 컨볼루션 연산(예를 들어, 1*1 컨볼루션)을 수행한다(611).The network for spatial fusion receives (610) the feature map X from step 316 of FIG. 3 or step 414 of FIG. 4, and performs a convolution operation (eg, 1*1 convolution) on the feature map X. (611).

그리고, 공간 융합을 위한 네트워크는 611단계에서 얻은 특징맵에 대해 데이터 재구성(reshape)을 진행하여, 크기가 N, C/4, H/k, W/k, k*k인 특징맵을 획득한다(612). 이때, C/4는 예시의 채널 수로, 본 개시에서 채널의 수는 이에 한정되지 않는다.Then, the network for spatial fusion performs data reshaping on the feature map obtained in step 611 to obtain feature maps of sizes N, C/4, H/k, W/k, and k*k. (612). In this case, C/4 is the number of channels in the example, and the number of channels in the present disclosure is not limited thereto.

여기서, k는 특정 픽셀과 관련된 이웃(neighborhood)의 크기를 나타내며, 예를 들어, 16으로 설정할 수 있으며, 이 경우, 각 픽셀의 관련 이웃의 크기는 16Х16이다.Here, k represents the size of a neighbor associated with a specific pixel, and may be set to, for example, 16. In this case, the size of the neighborhood of each pixel is 16Х16.

그리고, 공간 융합을 위한 네트워크는 612단계에서 얻은 특징맵에 대해 데이터 재구성(reshape)을 진행하여, 크기가 N*(H/k)*(W/k), k*k, C/4인 특징맵을 획득한다(613). 이때, C/4는 예시의 채널 수로, 본 개시에서 채널의 수는 이에 한정되지 않는다.Then, the network for spatial convergence performs data reshaping on the feature map obtained in step 612, and features of size N*(H/k)*(W/k), k*k, C/4 A map is acquired (613). In this case, C/4 is the number of channels in the example, and the number of channels in the present disclosure is not limited thereto.

그리고, 공간 융합을 위한 네트워크는 특징맵 X에 대해 컨볼루션 연산(예를 들어, 1*1 컨볼루션)을 수행한다(621).Then, the network for spatial fusion performs a convolution operation (eg, 1*1 convolution) on the feature map X ( 621 ).

그리고, 공간 융합을 위한 네트워크는 621단계에서 얻은 특징맵에 대해 데이터 재구성(reshape)을 진행하여, 크기가 N, C/4, H/k, W/k, k*k인 특징맵을 획득한다(622). 이때, C/4는 예시의 채널 수로, 본 개시에서 채널의 수는 이에 한정되지 않는다.Then, the network for spatial convergence performs data reshaping on the feature map obtained in step 621 to obtain feature maps of sizes N, C/4, H/k, W/k, and k*k. (622). In this case, C/4 is the number of channels in the example, and the number of channels in the present disclosure is not limited thereto.

그리고, 공간 융합을 위한 네트워크는 622단계에서 얻은 특징맵에 대해 데이터 재구성(reshape)을 진행하여, 크기가 N*(H/k)*(W/k), C/4, k*k인 특징맵을 획득한다(623). 이때, C/4는 예시의 채널 수로, 본 개시에서 채널의 수는 이에 한정되지 않는다.Then, the network for spatial convergence performs data reshaping on the feature map obtained in step 622, and features size N*(H/k)*(W/k), C/4, k*k A map is acquired (623). In this case, C/4 is the number of channels in the example, and the number of channels in the present disclosure is not limited thereto.

그리고, 공간 융합을 위한 네트워크는 특징맵 X에 대해 컨볼루션 연산(예를 들어, 1*1 컨볼루션)을 수행한다(641).Then, the network for spatial fusion performs a convolution operation (eg, 1*1 convolution) on the feature map X (641).

그리고, 공간 융합을 위한 네트워크는 641단계에서 얻은 특징맵에 대해 데이터 재구성(reshape)을 진행하여, 크기가 N, C/2, H/k, W/k, k*k인 특징맵을 획득한다(642). 이때, C/2는 예시의 채널 수로, 본 개시에서 채널의 수는 이에 한정되지 않는다.Then, the network for spatial convergence performs data reshaping on the feature map obtained in step 641 to obtain feature maps of sizes N, C/2, H/k, W/k, and k*k. (642). In this case, C/2 is the number of channels in the example, and the number of channels in the present disclosure is not limited thereto.

그리고, 공간 융합을 위한 네트워크는 642단계에서 얻은 특징맵에 대해 데이터 재구성(reshape)을 진행하여, 크기가 N*(H/k)*(W/k), k*k, C/2인 특징맵을 획득한다(643). 이때, C/2는 예시의 채널 수로, 본 개시에서 채널의 수는 이에 한정되지 않는다.And, the network for spatial convergence performs data reshaping on the feature map obtained in step 642, and features size N*(H/k)*(W/k), k*k, and C/2. A map is acquired (643). In this case, C/2 is the number of channels in the example, and the number of channels in the present disclosure is not limited thereto.

그리고, 공간 융합을 위한 네트워크는 613단계 및 623단계에서 얻은 특징맵에 대해 행렬 곱셈 연산을 진행하여, 크기가 N*(H/k)*(W/k), k*k, k*k 인 특징맵을 획득한다(630). In addition, the network for spatial fusion performs matrix multiplication on the feature maps obtained in steps 613 and 623, and has sizes of N*(H/k)*(W/k), k*k, and k*k. A feature map is obtained ( 630 ).

그리고, 공간 융합을 위한 네트워크는 63단계에서 얻은 특징맵에 대해 소프트맥스(Softmax) 연산을 수행한다(631).Then, the network for spatial fusion performs a softmax operation on the feature map obtained in step 63 (631).

그리고, 공간 융합을 위한 네트워크는 631단계 및 643단계에서 얻은 특징맵에 대해 행렬 곱셈 연산을 진행하여, 크기가 N*(H/k)*(W/k), k*k, C/2인 특징맵을 획득한다(650). 이때, C/2는 예시의 채널 수로, 본 개시에서 채널의 수는 이에 한정되지 않는다.In addition, the network for spatial fusion performs matrix multiplication on the feature maps obtained in steps 631 and 643, and has sizes of N*(H/k)*(W/k), k*k, and C/2. A feature map is acquired (650). In this case, C/2 is the number of channels in the example, and the number of channels in the present disclosure is not limited thereto.

그리고, 공간 융합을 위한 네트워크는 계속해서 컨볼루션 연산(예를 들어, 1*1 컨볼루션)을 진행하여, 크기가 N, 2C, H, W로 복원된 특징맵을 획득한다(651). Then, the network for spatial fusion continues to perform a convolution operation (eg, 1*1 convolution) to obtain a feature map whose size is restored to N, 2C, H, and W ( 651 ).

그리고, 공간 융합을 위한 네트워크는 651단계에서 출력된 특징맵과 특징맵X를 요소별로 덧셈 연산을 진행한다(652).Then, the network for spatial convergence performs an element-by-element addition operation on the feature map outputted in step 651 and the feature map X (652).

도 5 및 도 6은 시간-도메인 통합 모듈에서 공간 융합을 위한 네트워크(예를 들어, 부분 비로컬적 어텐션 네트워크)의 예시적 구조 및 예시적 흐름만을 도시하고 있으며, 시간-도메인 통합 모듈에서 공간 융합을 위한 네트워크(예를 들어, 부분 비로컬적 어텐션 네트워크)는 또한 기능을 구현할 수 있는 다른 구조를 가질 수 있고 및/또는 다른 상이한 흐름을 가질 수 있음을 이해해야 한다. 본 개시는 도 5 및 도 6의 공간 융합을 위한 네트워크로 제한되지 않는다.5 and 6 show only an exemplary structure and exemplary flow of a network (eg, a partial non-local attention network) for spatial fusion in the time-domain integration module, and spatial fusion in the time-domain integration module. It should be understood that the network (eg, a partial non-local attention network) for , may also have other structures that may implement functions and/or may have other different flows. The present disclosure is not limited to the network for spatial fusion of FIGS. 5 and 6 .

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 저장할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may store program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or apparatus, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

이상, 도 1 내지 도 6을 참조하여 일 실시예에 따른 비디오 처리 방법에 대해 설명하였다. 이하, 도 7 및 도 8을 참조하여 일 실시예에 따른 비디오 처리 장치 및 그 유닛에 대해 설명한다. A video processing method according to an embodiment has been described above with reference to FIGS. 1 to 6 . Hereinafter, a video processing apparatus and a unit thereof according to an embodiment will be described with reference to FIGS. 7 and 8 .

도 7은 일 실시예에 따른 비디오 처리 장치를 도시한 도면이다.7 is a diagram illustrating a video processing apparatus according to an exemplary embodiment.

도 7을 참조하면, 비디오 처리 장치는 특징 획득부(710), 시간-도메인 정보 융합부(720) 및 파노라마 분할부(730)를 포함한다.Referring to FIG. 7 , the video processing apparatus includes a feature acquisition unit 710 , a time-domain information fusion unit 720 , and a panorama segmentation unit 730 .

특징 획득부(710)는 비디오 데이터에서 제1 이미지의 제1 이미지 특징 및 제1 이미지 앞의 제2 이미지의 제2 이미지 특징을 획득하도록 구성될 수 있다.The feature obtaining unit 710 may be configured to obtain a first image feature of the first image and a second image feature of a second image before the first image from the video data.

특징 획득부(710)는 특징 추출 네트워크를 통해 제1 이미지 및 제2 이미지 각각에 대해 특징 추출을 수행하여, 제1 이미지의 제1 이미지 특징 및 제2 이미지의 제2 이미지 특징을 획득하도록 구성될 수 있다.The feature acquisition unit 710 may be configured to perform feature extraction on each of the first image and the second image through a feature extraction network to obtain a first image feature of the first image and a second image feature of the second image. can

시간-도메인 정보 융합부(720)는 제1 이미지 특징 및 제2 이미지 특징에 대해 시간-도메인 정보 융합 처리를 진행하여, 시간-도메인 정보 융합 처리 결과를 획득하도록 구성될 수 있다.The time-domain information fusion unit 720 may be configured to perform time-domain information fusion processing on the first image feature and the second image feature to obtain a time-domain information fusion processing result.

시간-도메인 정보 융합 처리 결과는 제1 이미지의 제1 시간-도메인 통합 특징 및 제2 이미지의 제2 시간-도메인 통합 특징을 포함할 수 있다.The time-domain information fusion processing result may include a first time-domain merging feature of the first image and a second time-domain merging feature of the second image.

시간-도메인 정보 융합부(720)는 제1 이미지 특징 및 제2 이미지 특징에 대해 조합 연산을 수행하고, 조합된 이미지 특징을 2개의 경로로 분할하고, 제1 경로에 대해 상관성 처리를 수행하고, 제1 경로의 상관성 처리 결과와 제2 경로에 대해 요소별로 덧셈 연산을 진행하고, 요소별 덧셈 연산 결과에 따라 시간-도메인 정보 융합 처리 결과를 획득하도록 구성될 수 있다.The time-domain information fusion unit 720 performs a combination operation on the first image feature and the second image feature, divides the combined image feature into two paths, and performs correlation processing on the first path, It may be configured to perform an element-by-element addition operation on the correlation processing result of the first path and the second path, and obtain a time-domain information fusion processing result according to the element-by-element addition operation result.

시간-도메인 정보 융합부(720)는 제1 경로에 대해 적어도 한 번의 컨볼루션 연산을 수행하고, 컨볼루션 연산 후의 제1 경로를 공간-도메인 융합을 위한 네트워크에 입력하고, 공간-도메인 융합을 위한 네트워크를 통해 상관성 처리를 수행하도록 구성될 수 있다.The time-domain information fusion unit 720 performs at least one convolution operation on the first path, inputs the first path after the convolution operation into the network for spatial-domain fusion, and space-domain fusion for It may be configured to perform correlation processing over a network.

시간-도메인 정보 융합부(720)는 공간-도메인 융합을 위한 네트워크에 입력된 특징을 적어도 2개의 경로로 나누고, 적어도 2개의 경로 중 일부 또는 전부에 대해 서브 도메인을 추출하고, 서브 도메인 추출 결과에 대해 행렬 곱셈 연산을 수행하고, 행렬 곱셈 연산 결과와 공간-도메인 융합을 위한 네트워크에 입력된 특징에 대해 요소별 덧셈 연산을 수행하도록 구성될 수 있다.The time-domain information fusion unit 720 divides the feature input to the network for spatial-domain fusion into at least two paths, extracts subdomains for some or all of the at least two paths, and adds to the subdomain extraction result. It may be configured to perform a matrix multiplication operation on the data, and an element-by-element addition operation on the matrix multiplication operation result and a feature input to the network for spatial-domain fusion.

시간-도메인 정보 융합부(720)는 컨볼루션 연산된 조합 특징을 4개의 경로로 나누고, 4개의 경로 중 제1 경로, 제2 경로 및 제3 경로에 대해 각각 서브 도메인 추출을 진행하고, 4개의 경로 중 제1 경로의 서브 도메인 추출 결과와 제2 경로의 서브 도메인 추출 결과에 대해 행렬 곱셈 연산을 수행하고, 행렬 곱셈 연산 결과와 4개의 경로 중 제3 경로의 서브 도메인 추출 결과에 대해 행렬 곱셈 연산을 수행한 후, 제3 경로의 서브 도메인 추출 결과에 대해 행렬 곱셈 연산을 수행한 결과와 4개의 경로 중 제4 경로에 대해 요소별 덧셈 연산을 수행하도록 구성될 수 있다.The time-domain information fusion unit 720 divides the convolutional combined feature into four paths, and extracts subdomains for each of the first, second, and third paths among the four paths, A matrix multiplication operation is performed on the subdomain extraction result of the first path and the subdomain extraction result of the second path, and a matrix multiplication operation is performed on the matrix multiplication operation result and the subdomain extraction result of the third path among the four paths After performing , it may be configured to perform element-by-element addition operation on the result of performing the matrix multiplication operation on the subdomain extraction result of the third path and the fourth path among the four paths.

시간-도메인 정보 융합부(720)는 제1 이미지 특징에 대해 적어도 한 번의 컨볼루션 연산을 수행하고, 제2 이미지 특징에 대해 적어도 한 번의 컨볼루션 연산을 수행하고, 컨볼루션 연산된 제1 이미지 특징 및 컨볼루션 연산된 제2 이미지 특징에 대해 조합 연산을 수행하도록 구성될 수 있다.The time-domain information fusion unit 720 performs at least one convolution operation on the first image feature, performs at least one convolution operation on the second image feature, and performs the convolution operation on the first image feature. and perform a combinatorial operation on the convolutionally computed second image feature.

시간-도메인 정보 융합부(720)는 요소별 덧셈 연산 결과를 2개의 경로로 나누고, 2개의 경로 중 각 경로의 요소별 덧셈 연산 결과에 대해 적어도 한 번의 컨볼루션 연산을 수행한 다음, 컨볼루션 연산된 제1 경로의 요소별 덧셈 연산 결과와 컨볼루션 연산된 제1 이미지 특징에 대해 요소별 덧셈 연산을 수행하여 제1 시간-도메인 통합 특징을 획득하고, 컨볼루션 연산된 제2 경로의 요소별 덧셈 연산 결과와 컨볼루션 연산된 제2 이미지 특징에 대해 요소별 덧셈 연산을 수행하여 제2 시간-도메인 통합 특징을 획득하도록 구성될 수 있다.The time-domain information fusion unit 720 divides the element-specific addition operation result into two paths, performs at least one convolution operation on the element-specific addition operation result of each path among the two paths, and then performs a convolution operation. The first time-domain integration feature is obtained by performing the element-by-element addition operation on the element-by-element addition operation result of the first path and the convolutional first image feature, and the element-by-element addition of the convolutional second path The second time-domain integrated feature may be obtained by performing element-by-element addition operation on the operation result and the convolutional second image feature.

시간-도메인 정보 융합부(720)는 요소별 덧셈 연산 결과에 대해 적어도 한 번의 컨볼루션 연산을 수행하고, 컨볼루션 연산된 요소별 덧셈 연산 결과와 제2 이미지 특징을 시간-도메인 정보 융합 처리 결과로 취하도록 구성될 수 있다.The time-domain information fusion unit 720 performs at least one convolution operation on the element-by-element addition operation result, and converts the convolutional element-by-element addition operation result and the second image feature to the time-domain information fusion processing result. can be configured to take.

파노라마 분할부(730)는 시간-도메인 정보 융합 처리 결과에 따라 제1 이미지의 파노라마 분할 결과를 획득하도록 구성된다.The panorama division unit 730 is configured to obtain a panorama division result of the first image according to the time-domain information fusion processing result.

파노라마 분할부(730)는 시간-도메인 정보 융합 처리 결과에 기초하여 제1 이미지에 대해 인스턴스 추적을 수행하여, 제1 이미지의 프레임 간의 인스턴스 대응을 획득하고, 프레임 간의 인스턴스 대응에 따라 제1 이미지의 파노라마 분할 결과를 획득하도록 구성될 수 있다.The panorama segmentation unit 730 performs instance tracking on the first image based on the time-domain information fusion processing result to obtain an instance correspondence between frames of the first image, and according to the instance correspondence between frames, and obtain a panoramic segmentation result.

도 8은 일 실시예에 따른 비디오 처리 장치 중 파노라마 분할부를 도시한 도면이다.8 is a diagram illustrating a panorama dividing unit in a video processing apparatus according to an exemplary embodiment.

도 8을 참조하면, 파노라마 분할부(730)는 인스턴스 데이터베이스 업데이트부(810) 및 인스턴스 추적부(820)을 포함할 수 있다.Referring to FIG. 8 , the panorama segmentation unit 730 may include an instance database update unit 810 and an instance tracking unit 820 .

인스턴스 데이터베이스 업데이트부(810)는 제2 시간-도메인 통합 특징에 기초하여 비디오 데이터의 인스턴스 데이터베이스를 업데이트하도록 구성된다.The instance database update unit 810 is configured to update the instance database of video data based on the second time-domain integration characteristic.

인스턴스 데이터베이스 업데이트부(810)는 제2 시간-도메인 통합 특징에서 제1 개수의 기설정된 특징을 선택하고, 선택한 제1 개수의 기설정된 특징을 비디오 데이터의 인스턴스 데이터베이스에 추가하도록 구성될 수 있다.The instance database update unit 810 may be configured to select a first number of preset features from the second time-domain integrated feature and add the selected first number of preset features to the instance database of video data.

인스턴스 추적부(820)는 업데이트된 인스턴스 데이터베이스에 기초하여 제1 시간-도메인 통합 특징에 대해 인스턴스 추적을 수행하도록 구성된다.The instance tracking unit 820 is configured to perform instance tracking for the first time-domain integration feature based on the updated instance database.

인스턴스 추적부(820)는 제1 시간-도메인 통합 특징에서 제2 개수의 기설정된 특징을 선택하고, 선택한 제2 개수의 기설정된 특징과 업데이트된 인스턴스 데이터베이스를 기반으로 추적 네트워크를 통해 인스턴스 대응을 진행하도록 구성될 수 있다.The instance tracking unit 820 selects a second number of preset features from the first time-domain integrated feature, and performs instance correspondence through the tracking network based on the selected second number of preset features and the updated instance database. can be configured to

이때, 기설정된 특징은 관심 영역 특징, 바운딩 박스에 기초하여 표현되는 특징, 또는 마스크에 기초하여 표현되는 특징을 포함할 수 있다.In this case, the preset feature may include a feature expressed based on a region of interest feature, a bounding box, or a mask.

파노라마 분할부(730)는 또한, 시간-도메인 정보 융합 처리 결과를 바탕으로 제1 이미지에 대해 시맨틱 분할, 인스턴스 분할, 바운딩 박스 세분화를 수행하여, 제1 이미지의 시맨틱 분할 결과, 인스턴스 분할 결과, 바운딩 박스를 획득하고, 제1 이미지의 시맨틱 분할 결과, 인스턴스 분할 결과, 바운딩 박스, 프레임 간의 인스턴스 대응을 융합하여, 제1 이미지의 파노라마 분할 결과를 획득하도록 구성될 수 있다.The panorama segmentation unit 730 also performs semantic segmentation, instance segmentation, and bounding box segmentation on the first image based on the time-domain information fusion processing result, and results in semantic segmentation, instance segmentation, and bounding of the first image. and obtain a box, and fuse the semantic segmentation result of the first image, the instance segmentation result, the bounding box, and the instance correspondence between frames to obtain the panoramic segmentation result of the first image.

이상, 도 7 및 도 8을 참조하여 일 실시예에 따른 비디오 처리 장치에 대해 설명하였다. 다음으로 도 9를 참조하여 일 실시예에 따른 전자 장치에 대해 설명한다.A video processing apparatus according to an embodiment has been described above with reference to FIGS. 7 and 8 . Next, an electronic device according to an embodiment will be described with reference to FIG. 9 .

도 9는 일 실시예에 따른 전자 장치를 도시한 도면이다.9 is a diagram illustrating an electronic device according to an exemplary embodiment.

도 9를 참조하면, 전자 장치(900)는 메모리(910)와 프로세서(920)를 포함하고, 메모리(910) 상에는 컴퓨터 프로그램(912)이 저장된다. 컴퓨터 프로그램(912)이 프로세서(920)에 의해 실행될 때, 본 개시의 일 실시예에 따른 비디오 처리 방법이 구현된다.Referring to FIG. 9 , the electronic device 900 includes a memory 910 and a processor 920 , and a computer program 912 is stored in the memory 910 . When the computer program 912 is executed by the processor 920 , a video processing method according to an embodiment of the present disclosure is implemented.

컴퓨터 프로그램(912)이 프로세서(920)에 의해 실행될 때, 비디오 데이터의 제1 이미지의 제1 이미지 특징 및 제1 이미지 앞의 제2 이미지의 제2 이미지 특징을 획득하는 단계, 제1 이미지 특징 및 제2 이미지 특징에 대해 시간-도메인 정보 융합 처리를 진행하여, 시간-도메인 정보 융합 처리 결과를 획득하는 단계, 시간-도메인 정보 융합 처리 결과에 따라 제1 이미지의 파노라마 분할 결과를 획득하는 단계가 구현될 수 있다.when the computer program 912 is executed by the processor 920, obtaining a first image characteristic of a first image of the video data and a second image characteristic of a second image preceding the first image, the first image characteristic and Time-domain information fusion processing is performed on the second image feature to obtain a time-domain information fusion processing result, and the step of obtaining a panoramic segmentation result of the first image according to the time-domain information fusion processing result is implemented can be

도 9에 도시된 전자 장치는 예시일 뿐이며, 본 개시의 실시예의 기능 및 사용 범위를 제한해서는 안 된다.The electronic device illustrated in FIG. 9 is merely an example, and the function and scope of use of the embodiment of the present disclosure should not be limited.

이상, 도 1 내지 도 9를 참조하여 본 발명의 일 실시예에 따른 비디오 처리 방법 및 장치에 대해 설명하였다. 그러나, 도 7 및 도 8에 도시된 비디오 처리 장치 및 그 구성들은 각각 소프트웨어, 하드웨어, 펌웨어 또는 항목들의 임의의 조합으로 구성되어 특정 기능을 수행할 수 있고, 도 9에 도시된 전자 장치는 도시된 구성 요소를 포함하는 것으로 제한되지 않고, 필요에 따라 일부 구성 요소를 추가하거나 삭제할 수 있으며, 구성 요소를 결합할 수도 있다.A video processing method and apparatus according to an embodiment of the present invention have been described above with reference to FIGS. 1 to 9 . However, the video processing apparatus and components thereof shown in FIGS. 7 and 8 may each be configured with software, hardware, firmware, or any combination of items to perform a specific function, and the electronic device shown in FIG. It is not limited to including components, some components may be added or deleted as needed, and components may be combined.

본 개시의 일 실시예에 따른 비디오 처리 장치 및 방법은 비디오 데이터의 제1 이미지의 제1 이미지 특징 및 제1 이미지 앞의 제2 이미지의 제2 이미지 특징을 획득하고, 제1 이미지 특징 및 제2 이미지 특징에 대해 시간-도메인 정보 융합 처리를 진행하여, 시간-도메인 정보 융합 처리 결과를 획득하고, 시간-도메인 정보 융합 처리 결과에 따라 제1 이미지의 파노라마 분할 결과를 획득하여, 비디오 처리 비용을 낮추고, 비디오 처리의 속도 및 정확도를 향상시킨다. 또한, 인공지능 네트워크를 통해 시간-도메인 정보 융합 처리를 진행할 수 있다. 본 개시의 비디오 처리 방법은 인공지능을 통해 구현 가능하며, 자율 주행, 증강 현실 및 비디오 편집 등 비디오 분할의 글로벌한 관점이 필요한 응용 분야에 인공지능 기반을 제공할 수 있다. 본 개시의 비디오 처리 장치 및 방법을 통해, 자율주행 차량의 주변 환경에 대한 자동 인식 효과 및 자동 인식 속도를 향상시킬 수 있으며, 이를 통해 자율 주행의 안전성을 향상시킬 수 있다.A video processing apparatus and method according to an embodiment of the present disclosure obtains a first image feature of a first image of video data and a second image feature of a second image in front of the first image, the first image feature and the second Time-domain information fusion processing is performed on the image feature to obtain a time-domain information fusion processing result, and according to the time-domain information fusion processing result, a panorama segmentation result of the first image is obtained, thereby lowering the video processing cost and , to improve the speed and accuracy of video processing. In addition, time-domain information fusion processing can be performed through an artificial intelligence network. The video processing method of the present disclosure can be implemented through artificial intelligence, and can provide an artificial intelligence basis for applications that require a global perspective of video segmentation, such as autonomous driving, augmented reality, and video editing. Through the video processing apparatus and method of the present disclosure, the automatic recognition effect and the automatic recognition speed of the autonomous vehicle's surrounding environment may be improved, and thus, the safety of autonomous driving may be improved.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

obtaining a first image feature of a first image of the video data and a second image feature of a second image before the first image;
performing time-domain information fusion processing on the first image feature and the second image feature to obtain a time-domain information fusion processing result; and
obtaining a panoramic segmentation result of the first image according to the time-domain information fusion processing result;
A video processing method comprising

According to claim 1,
The step of obtaining the panoramic segmentation result of the first image according to the time-domain information fusion processing result,
performing instance tracking on the first image based on a result of the time-domain information fusion processing to obtain an instance correspondence between frames of the first image; and
obtaining a panoramic segmentation result of the first image according to instance correspondence between frames of the first image;
A video processing method comprising

3. The method of claim 2,
The step of obtaining the panoramic segmentation result of the first image according to the time-domain information fusion processing result,
Semantic segmentation, instance segmentation, and bounding box refinement are performed on the first image based on the time-domain information fusion processing result, and the result of semantic segmentation of the first image, the first image Obtaining an instance segmentation result and a bounding box of the first image
further comprising,
According to the instance correspondence between the frames of the first image, obtaining a panoramic segmentation result of the first image comprises:
To obtain a panoramic segmentation result of the first image by fusing the semantic segmentation result of the first image, the instance segmentation result of the first image, and the instance correspondence between the bounding box of the first image and the frame of the first image step
A video processing method comprising

According to claim 1,
obtaining a first image feature of the first image of the video data and a second image feature of the second image before the first image,
performing feature extraction on each of the first image and the second image through a feature extraction network to obtain a first image feature of the first image and a second image feature of the second image
A video processing method comprising

According to claim 1,
The time-domain information fusion processing result is,
a first time-domain merging feature of the first image and a second time-domain merging feature of the second image;
How to process video.

According to claim 1,
The step of obtaining the time-domain information fusion processing result by performing the time-domain information fusion processing on the first image feature and the second image feature,
performing a combinatorial operation on the first image feature and the second image feature;
splitting the combined image feature into two paths, and performing correlation processing on the first path;
performing an add operation for each element on the correlation processing result of the first path and the second path; and
obtaining the time-domain information fusion processing result according to the element-by-element addition operation result
A video processing method comprising

7. The method of claim 6,
The step of performing correlation processing on the first path,
performing at least one convolution operation on the first path; and
inputting the first path after the convolution operation into a network for space-domain fusion, and performing correlation processing through the network for space-domain fusion
A video processing method comprising

8. The method of claim 7,
The step of performing the correlation processing through the network for the spatial-domain convergence,
dividing the feature input to the network for spatial-domain fusion into at least two paths, and extracting subdomains from some or all of the at least two paths;
performing a matrix multiplication operation on the subdomain extraction result; and
performing an element-by-element addition operation on the matrix multiplication operation result and the feature input to the network for spatial-domain fusion
A video processing method comprising a.

8. The method of claim 7,
The step of performing the correlation processing through the network for the spatial-domain convergence,
dividing the convolutional combinatorial features into four paths;
performing subdomain extraction for each of the first path, the second path, and the third path among the four paths;
A matrix multiplication operation is performed on the subdomain extraction result of the first path and the subdomain extraction result of the second path among the four paths, and the matrix multiplication operation result and the subdomain of the third path among the four paths performing a matrix multiplication operation on the extraction result; and
performing an element-by-element addition operation on a result of performing a matrix multiplication operation on the subdomain extraction result of the third path and a fourth path among the four paths
A video processing method comprising

10. The method of claim 9,
The step of performing the sub-domain extraction is,
Proceeding to extract the sub-domain through data reconstruction
A video processing method comprising

7. The method of claim 6,
The step of performing a combination operation on the first image feature and the second image feature includes:
performing at least one convolution operation on the first image feature;
performing at least one convolution operation on the second image feature; and
performing a combinatorial operation on the first convolutional image feature and the convolutional second image feature
A video processing method comprising

12. The method of claim 11,
The step of obtaining the time-domain information fusion processing result according to the element-by-element addition operation result,
dividing the element-by-element addition operation result into two paths;
performing at least one convolution operation on an addition operation result for each element of each path among the two paths;
obtaining a first time-domain integrated feature by performing an element-by-element addition operation on the convolution-calculated element-wise addition operation result of the first path and the convolutionally-computed first image feature; and
obtaining a second time-domain integrated feature by performing an element-by-element addition operation on the element-wise addition operation result of the second convolutional path and the convolutionally-calculated second image feature
A video processing method comprising

7. The method of claim 6,
The step of obtaining the time-domain information fusion processing result according to the element-by-element addition operation result,
performing at least one convolution operation on the element-by-element addition operation result, and taking the convolutional element-by-element addition operation result and the second image feature as the time-domain information fusion processing result
A video processing method comprising

3. The method of claim 2,
The step of obtaining an instance correspondence between frames of the first image by performing instance tracking on the first image based on the time-domain information fusion processing result,
updating an instance database of the video data based on a second time-domain integration characteristic; and
Step of tracking each instance for the first time-domain integration feature based on the updated instance database
A video processing method comprising

15. The method of claim 14,
Updating the instance database of the video data based on the second time-domain integration feature comprises:
selecting a first number of preset features from the second time-domain integrated features; and
adding the selected first number of preset features to the instance database of the video data;
A video processing method comprising

15. The method of claim 14,
The step of performing instance tracking for each of the first time-domain integration features based on the updated instance database includes:
selecting a second number of preset features from the first time-domain integrated features; and
performing instance correspondence through a tracking network based on the selected second number of preset characteristics and the updated instance database;
A video processing method comprising

16. The method of claim 15,
The predetermined feature is,
at least one of a region of interest feature, a feature expressed based on a bounding box, and a feature expressed based on a mask.
How to process video.

A computer-readable recording medium in which a program for executing the method of any one of claims 1 to 17 is recorded.

a feature acquiring unit configured to acquire a first image feature of a first image of the video data and a second image feature of a second image before the first image;
a time-domain information fusion unit configured to perform time-domain information fusion processing on the first image feature and the second image feature to obtain a time-domain information fusion processing result; and
a panorama segmentation unit configured to obtain a panoramic segmentation result of the first image according to a result of the time-domain information fusion processing
A video processing device comprising a.

20. The method of claim 19,
The panoramic division unit,
performing instance tracking on the first image based on a result of the time-domain information fusion processing to obtain an instance correspondence between frames of the first image;
According to the instance correspondence between the frames of the first image, obtaining a panoramic segmentation result of the first image
video processing unit.