KR20230157220A

KR20230157220A - Image processing apparatus and operating method for the same

Info

Publication number: KR20230157220A
Application number: KR1020220121078A
Authority: KR
Inventors: 강수민; 송영찬; 이태미; 박재연; 신하늘; 안일준
Original assignee: 삼성전자주식회사
Priority date: 2022-05-09
Filing date: 2022-09-23
Publication date: 2023-11-16

Abstract

일 실시예에 따른 하나 이상의 뉴럴 네트워크들을 이용하여, 영상을 처리하는 영상 처리 장치는, 하나 이상의 인스트럭션들을 저장하는 메모리, 및 메모리에 저장된 하나 이상의 인스트럭션들을 실행함으로써, 제1 영상에 기초하는 제1 특징 데이터를 획득하고, 제1 특징 데이터에 제1 영상 처리를 수행하여, 제1 개수의 픽셀들을 포함하는 제1 영역들에 대응하는 제2 특징 데이터들을 획득하고, 제1 영상에 기초하는 제3 특징 데이터를 획득하고, 제3 특징 데이터에 제2 영상 처리를 수행하여, 제1 개수보다 큰 제2 개수의 픽셀들을 포함하는 제2 영역들에 대응하는 제4 특징 데이터들을 획득하고, 제2 특징 데이터들 및 제4 특징 데이터들에 기초하여, 제2 영상을 생성하는 적어도 하나의 프로세서를 포함할 수 있다.An image processing device that processes an image using one or more neural networks according to an embodiment includes a memory that stores one or more instructions, and a first feature based on the first image by executing one or more instructions stored in the memory. Obtain data, perform first image processing on the first feature data, obtain second feature data corresponding to first areas including a first number of pixels, and third feature based on the first image. Obtain data, perform second image processing on the third feature data, obtain fourth feature data corresponding to second areas including a second number of pixels greater than the first number, and obtain second feature data. and at least one processor that generates a second image based on the first and fourth feature data.

Description

Image processing apparatus and operating method for the same}

다양한 실시예들은 뉴럴 네트워크를 이용하여, 영상을 처리하는 영상 처리 장치 및 그 동작 방법에 관한 것이다.Various embodiments relate to an image processing device that processes images using a neural network and a method of operating the same.

컴퓨터 기술의 발달과 함께 데이터 트래픽이 지수함수 형태로 증가하면서 인공지능은 미래 혁신을 주도하는 중요한 트랜드로 자리잡았다. 인공지능은 사람의 사고방식을 모방하는 방식이기 때문에 사실상 전 산업에 무한하게 응용이 가능하다. 인공지능의 대표적인 기술로는 패턴 인식, 기계 학습, 전문가 시스템, 뉴럴 네트워크, 자연어 처리 등이 있다.As data traffic increases exponentially with the development of computer technology, artificial intelligence has become an important trend leading future innovation. Because artificial intelligence imitates human thinking, it can be applied virtually to all industries. Representative technologies of artificial intelligence include pattern recognition, machine learning, expert systems, neural networks, and natural language processing.

뉴럴 네트워크는 인간의 생물학적 신경 세포의 특성을 수학적 표현에 의해 모델링한 것으로, 인간이 가지고 있는 학습이라는 능력을 모방한 알고리즘을 이용한다. 이 알고리즘을 통하여, 뉴럴 네트워크는 입력 데이터와 출력 데이터 사이의 사상(mapping)을 생성할 수 있고, 이러한 사상을 생성하는 능력은 뉴럴 네트워크의 학습 능력이라고 표현될 수 있다. 또한, 뉴럴 네트워크는 학습된 결과에 기초하여, 학습에 이용되지 않았던 입력 데이터에 대하여, 올바른 출력 데이터를 생성할 수 있는 일반화 능력을 가진다.A neural network models the characteristics of human biological nerve cells using mathematical expressions and uses an algorithm that imitates the learning ability of humans. Through this algorithm, the neural network can create a mapping between input data and output data, and the ability to create this mapping can be expressed as the learning ability of the neural network. In addition, the neural network has a generalization ability to generate correct output data for input data that was not used for learning, based on the learned results.

뉴럴 네트워크는 영상 처리에 이용될 수 있으며, 특히, 심층 신경망(DNN: Deep Neural Network)을 이용하여, 영상의 노이즈 또는 아티팩트를 제거하거나, 영상의 해상도를 증가시키는 영상 처리를 수행할 수 있다.Neural networks can be used in image processing, and in particular, deep neural networks (DNNs) can be used to perform image processing to remove noise or artifacts from images or increase the resolution of images.

일 실시예에 따른 영상 처리 장치는 하나 이상의 뉴럴 네트워크들을 이용하여, 영상을 처리할 수 있다.An image processing device according to an embodiment may process images using one or more neural networks.

일 실시예에 따른 영상 처리 장치는 하나 이상의 인스트럭션들을 저장하는 메모리 및 상기 하나 이상의 인스트럭션들을 실행하는 적어도 하나의 프로세서를 포함할 수 있다.An image processing device according to an embodiment may include a memory that stores one or more instructions and at least one processor that executes the one or more instructions.

일 실시예에 따른 적어도 하나의 프로세서는 상기 메모리에 저장된 상기 하나 이상의 인스트럭션들을 실행함으로써, 제1 영상에 기초하는 제1 특징 데이터를 획득할 수 있다.At least one processor according to an embodiment may acquire first feature data based on the first image by executing the one or more instructions stored in the memory.

일 실시예에 따른 적어도 하나의 프로세서는 상기 메모리에 저장된 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제1 특징 데이터에 제1 영상 처리를 수행하여, 제1 개수의 픽셀들을 포함하는 제1 영역들에 대응하는 제2 특징 데이터들을 획득할 수 있다.At least one processor according to an embodiment performs first image processing on the first feature data by executing the one or more instructions stored in the memory, thereby corresponding to first areas including a first number of pixels. Second characteristic data that can be obtained can be obtained.

일 실시예에 따른 적어도 하나의 프로세서는 상기 메모리에 저장된 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제1 영상에 기초하는 제3 특징 데이터를 획득할 수 있다.At least one processor according to an embodiment may acquire third characteristic data based on the first image by executing the one or more instructions stored in the memory.

일 실시예에 따른 적어도 하나의 프로세서는 상기 메모리에 저장된 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제3 특징 데이터에 제2 영상 처리를 수행하여, 상기 제1 개수보다 큰 제2 개수의 픽셀들을 포함하는 제2 영역들에 대응하는 제4 특징 데이터들을 획득할 수 있다.At least one processor according to an embodiment performs a second image processing on the third characteristic data by executing the one or more instructions stored in the memory, and includes a second number of pixels greater than the first number. Fourth feature data corresponding to the second areas may be obtained.

일 실시예에 따른 적어도 하나의 프로세서는 상기 메모리에 저장된 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제2 특징 데이터들 및 상기 제4 특징 데이터들에 기초하여, 제2 영상을 생성할 수 있다.At least one processor according to an embodiment may generate a second image based on the second feature data and the fourth feature data by executing the one or more instructions stored in the memory.

일 실시예에 따른 하나 이상의 뉴럴 네트워크들을 이용하여, 영상을 처리하는 영상 처리 장치의 동작 방법은, 제1 영상에 기초하는 제1 특징 데이터를 획득하는 단계를 포함할 수 있다.A method of operating an image processing device that processes an image using one or more neural networks according to an embodiment may include acquiring first feature data based on a first image.

일 실시예에 따른 하나 이상의 뉴럴 네트워크들을 이용하여, 영상을 처리하는 영상 처리 장치의 동작 방법은, 상기 제1 특징 데이터에 제1 영상 처리를 수행하여, 제1 개수의 픽셀들을 포함하는 제1 영역들에 대응하는 제2 특징 데이터들을 획득하는 단계를 포함할 수 있다.A method of operating an image processing device that processes an image using one or more neural networks according to an embodiment includes performing first image processing on the first feature data to create a first region including a first number of pixels. It may include obtaining second feature data corresponding to the features.

일 실시예에 따른 하나 이상의 뉴럴 네트워크들을 이용하여, 영상을 처리하는 영상 처리 장치의 동작 방법은, 상기 제1 영상에 기초하는 제3 특징 데이터를 획득하는 단계를 포함할 수 있다.A method of operating an image processing device that processes an image using one or more neural networks according to an embodiment may include acquiring third feature data based on the first image.

일 실시예에 따른 하나 이상의 뉴럴 네트워크들을 이용하여, 영상을 처리하는 영상 처리 장치의 동작 방법은, 상기 제3 특징 데이터에 제2 영상 처리를 수행하여, 상기 제1 개수보다 큰 제2 개수의 픽셀들을 포함하는 제2 영역들에 대응하는 제4 특징 데이터들을 획득하는 단계를 포함할 수 있다.A method of operating an image processing device that processes an image using one or more neural networks according to an embodiment includes performing a second image processing on the third feature data to produce a second number of pixels greater than the first number. It may include acquiring fourth feature data corresponding to second areas including .

일 실시예에 따른 하나 이상의 뉴럴 네트워크들을 이용하여, 영상을 처리하는 영상 처리 장치의 동작 방법은, 상기 제2 특징 데이터들 및 상기 제4 특징 데이터들에 기초하여, 제2 영상을 생성하는 단계를 포함할 수 있다.A method of operating an image processing device that processes an image using one or more neural networks according to an embodiment includes generating a second image based on the second feature data and the fourth feature data. It can be included.

일 실시예에 따른 컴퓨터로 읽을 수 있는 기록매체는 일 실시예에 따른 하나 이상의 뉴럴 네트워크들을 이용하여, 영상을 처리하는 영상 처리 장치의 동작 방법을 컴퓨터에 의해 수행하기 위한 적어도 하나의 인스트럭션을 포함하는 프로그램을 저장할 수 있다.A computer-readable recording medium according to an embodiment includes at least one instruction for performing, by a computer, a method of operating an image processing device that processes an image using one or more neural networks according to an embodiment. You can save the program.

도 1은 일 실시예에 따른 영상 처리 장치가 영상 처리 네트워크를 이용하여, 영상을 처리하는 동작을 나타내는 도면이다.
도 2는 일 실시예에 따른 제2 특징 추출 네트워크를 나타내는 도면이다.
도 3은 일 실시예에 따른 변환 블록을 나타내는 도면이다.
도 4는 일 실시예에 따른 제1 변환 레이어를 나타내는 도면이다.
도 5는 일 실시예에 따른 제1 셀프 어텐션 모듈에서 수행되는 셀프 어텐션 연산을 나타내는 도면이다.
도 6은 일 실시예에 따른 제1 셀프 어텐션 모듈을 나타내는 도면이다.
도 7은 일 실시예에 따른 다중 퍼셉트론 모듈을 나타내는 도면이다.
도 8은 일 실시예에 따른 제2 변환 레이어를 나타내는 도면이다.
도 9는 일 실시예에 따른 제2 셀프 어텐션 모듈에서 수행되는 셀프 어텐션 연산을 나타내는 도면이다.
도 10은 일 실시예에 따른 제2 셀프 어텐션 모듈을 나타내는 도면이다.
도 11은 일 실시예에 따른 제2 셀프 어텐션 모듈을 나타내는 도면이다.
도 12는 일 실시예에 따른 변환 블록을 나타내는 도면이다.
도 13은 일 실시예에 따른 변환 블록을 나타내는 도면이다.
도 14는 일 실시예에 따른 영상 처리 장치의 동작 방법을 나타내는 흐름도이다.
도 15는 일 실시예에 따른 영상 처리 장치의 구성을 나타내는 블록도이다.FIG. 1 is a diagram illustrating an operation of an image processing device processing an image using an image processing network, according to an embodiment.
Figure 2 is a diagram showing a second feature extraction network according to an embodiment.
Figure 3 is a diagram showing a transform block according to one embodiment.
Figure 4 is a diagram showing a first transform layer according to one embodiment.
Figure 5 is a diagram showing a self-attention operation performed in a first self-attention module according to an embodiment.
Figure 6 is a diagram showing a first self-attention module according to an embodiment.
Figure 7 is a diagram showing a multi-perceptron module according to an embodiment.
Figure 8 is a diagram showing a second transform layer according to one embodiment.
FIG. 9 is a diagram illustrating a self-attention operation performed in a second self-attention module according to an embodiment.
Figure 10 is a diagram showing a second self-attention module according to an embodiment.
Figure 11 is a diagram showing a second self-attention module according to an embodiment.
Figure 12 is a diagram showing a transform block according to an embodiment.
Figure 13 is a diagram showing a transform block according to an embodiment.
Figure 14 is a flowchart showing a method of operating an image processing device according to an embodiment.
Figure 15 is a block diagram showing the configuration of an image processing device according to an embodiment.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 본 발명에 대해 구체적으로 설명하기로 한다.The terms used in this specification will be briefly explained, and the present invention will be described in detail.

본 발명에서 사용되는 용어는 본 발명에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 발명에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 발명의 전반에 걸친 내용을 토대로 정의되어야 한다. The terms used in the present invention are general terms that are currently widely used as much as possible while considering the function in the present invention, but this may vary depending on the intention or precedent of a person working in the art, the emergence of new technology, etc. In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the relevant invention. Therefore, the terms used in the present invention should be defined based on the meaning of the term and the overall content of the present invention, rather than simply the name of the term.

명세서 전체에서 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다. 또한, 명세서에 기재된 "...부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.When it is said that a part "includes" a certain element throughout the specification, this means that, unless specifically stated to the contrary, it does not exclude other elements but may further include other elements. In addition, terms such as "... unit" and "module" used in the specification refer to a unit that processes at least one function or operation, which may be implemented as hardware or software, or as a combination of hardware and software. .

아래에서는 첨부한 도면을 참고하여 실시예들에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Below, with reference to the attached drawings, embodiments will be described in detail so that those skilled in the art can easily implement the present invention. However, the present invention may be implemented in many different forms and is not limited to the embodiments described herein. In order to clearly explain the present invention in the drawings, parts that are not related to the description are omitted, and similar parts are given similar reference numerals throughout the specification.

도 1은 일 실시예에 따른 영상 처리 장치가 영상 처리 네트워크를 이용하여, 영상을 처리하는 동작을 나타내는 도면이다.FIG. 1 is a diagram illustrating an operation of an image processing device processing an image using an image processing network, according to an embodiment.

도 1을 참조하면, 일 실시예에 따른 영상 처리 네트워크(103)는 제1 영상(101)을 입력 받아, 제1 영상(101)을 처리함으로써, 제2 영상(102)을 생성할 수 있다. 이때, 제1 영상(101)은 노이즈 또는 아티팩트를 포함하는 영상일 수 있으며, 저해상도 영상, 또는 저화질 영상일 수 있다. 영상 처리 장치(100)는 영상 처리 네트워크(103)를 이용하여, 제1 영상(101)의 세밀한 가장자리(edge)와 텍스쳐를 유지하면서 노이즈를 제거하는 디노이징(denoising)을 수행함으로써, 제2 영상(102)을 생성할 수 있다. 또한, 제2 영상(102)은 제1 영상(101)보다 고해상도 영상일 수 있다. 또한, 제2 영상(102)은 제1 영상(101)보다 화질이 개선된 영상일 수 있다. 다만, 이에 한정되지 않는다.Referring to FIG. 1, the image processing network 103 according to an embodiment may receive a first image 101 and process the first image 101 to generate a second image 102. At this time, the first image 101 may be an image containing noise or artifacts, and may be a low-resolution image or low-quality image. The image processing device 100 uses the image processing network 103 to perform denoising to remove noise while maintaining the detailed edges and texture of the first image 101, thereby producing a second image. (102) can be generated. Additionally, the second image 102 may be a higher resolution image than the first image 101. Additionally, the second image 102 may be an image with improved image quality than the first image 101. However, it is not limited to this.

일 실시예에 따른 영상 처리 네트워크(103)는 하나 이상의 뉴럴 네트워크들을 포함할 수 있다. 예를 들어, 영상 처리 네트워크(103)는 제1 특징 추출 네트워크(200), 제2 특징 추출 네트워크(300) 및 영상 복원 네트워크(400)를 포함할 수 있다. 다만, 이에 한정되지 않는다.The image processing network 103 according to one embodiment may include one or more neural networks. For example, the image processing network 103 may include a first feature extraction network 200, a second feature extraction network 300, and an image restoration network 400. However, it is not limited to this.

제1 특징 추출 네트워크(200)는 제1 영상(101, 입력 영상)이 존재하는 이미지 공간에서 특징(feature) 공간으로 매핑하는 네트워크로, 하나 이상의 컨볼루션 뉴럴 네트워크(CNN: Convolutional neural network)들을 포함할 수 있다. 또한, 제2 특징 추출 네트워크(300)는 제1 특징 추출 네트워크(200)에서 추출된 특징에 기초하여, 제1 특징 추출 네트워크(200)에서 추출된 특징보다 고차원적인 특징(high-dimensional feature)을 추출하는 네트워크일 수 있다. The first feature extraction network 200 is a network that maps the image space in which the first image 101 (input image) exists to a feature space, and includes one or more convolutional neural networks (CNN). can do. In addition, the second feature extraction network 300 generates higher-dimensional features than the features extracted from the first feature extraction network 200, based on the features extracted from the first feature extraction network 200. It may be an extracting network.

또한, 영상 복원 네트워크(400)는 고차원 특징을 이용하여, 제2 영상(102, 출력 영상)을 획득하는 네트워크일 수 있다. 영상 복원 네트워크(400)는 하나 이상의 컨볼루션 뉴럴 네트워크(CNN)들을 포함할 수 있다. 일 실시예에 따른 영상 복원 네트워크(400)는 업 샘플링을 수행하여, 제1 영상보다 해상도가 높은 제2 영상을 생성할 수 있다.Additionally, the image restoration network 400 may be a network that obtains the second image 102 (output image) using high-dimensional features. Image restoration network 400 may include one or more convolutional neural networks (CNNs). The image restoration network 400 according to an embodiment may perform up-sampling to generate a second image with higher resolution than the first image.

다만, 이에 한정되지 않으며, 일 실시예에 따른 영상 처리 네트워크(103)는 다양한 뉴럴 네트워크들을 포함할 수 있다.However, it is not limited to this, and the image processing network 103 according to one embodiment may include various neural networks.

이하, 도면을 참조하여 일 실시예에 따른 영상 처리 네트워크를 자세히 설명하기로 한다.Hereinafter, an image processing network according to an embodiment will be described in detail with reference to the drawings.

도 2는 일 실시예에 따른 제2 특징 추출 네트워크를 나타내는 도면이다.Figure 2 is a diagram showing a second feature extraction network according to an embodiment.

도 2를 참조하면, 일 실시예에 따른 제2 특징 추출 네트워크(300)는 하나 이상의 변환 블록들(310), 컨볼루션 레이어(320) 및 합산 레이어(330)를 포함할 수 있다.Referring to FIG. 2, the second feature extraction network 300 according to one embodiment may include one or more transform blocks 310, a convolution layer 320, and a summation layer 330.

일 실시예에 따른 변환 블록들(310) 각각은 입력된 제1 특징 데이터에 포함되는 픽셀들 각각에 인접하는 주변 픽셀들에 대한 정보와 입력된 제1 특징 데이터에 포함되는 소정 단위 영역들 각각에 인접하는 주변 영역들에 대한 정보를 이용하여, 제2 특징 데이터를 추출할 수 있다. 변환 블록에 대해서는 도 3을 참조하여 자세히 설명하기로 한다.Each of the transform blocks 310 according to an embodiment contains information about surrounding pixels adjacent to each of the pixels included in the input first feature data and each of predetermined unit areas included in the input first feature data. Second feature data can be extracted using information about adjacent surrounding areas. The conversion block will be described in detail with reference to FIG. 3.

일 실시예에 따른 컨볼루션 레이어(320)에서는 컨볼루션 레이어(320)에 입력된 입력 데이터(또는 입력 정보)와 컨볼루션 레이어(320)에 포함되는 커널과의 컨볼루션 연산이 수행될 수 있다. 이때, 컨볼루션 레이어(320)에 입력되는 입력 데이터는 변환 블록들(310)에서 연산이 수행되어 출력된 결과 데이터일 수 있다. 또한, 도 2에서는 제2 특징 추출 네트워크(300)가 하나의 컨볼루션 레이어(320)를 포함하는 것으로 도시하였지만, 이에 한정되지 않고, 2개 이상의 컨볼루션 레이어들을 포함할 수도 있다.In the convolution layer 320 according to one embodiment, a convolution operation may be performed between input data (or input information) input to the convolution layer 320 and a kernel included in the convolution layer 320. At this time, the input data input to the convolution layer 320 may be result data output after an operation is performed in the transform blocks 310. In addition, although the second feature extraction network 300 is shown in FIG. 2 as including one convolution layer 320, it is not limited to this and may include two or more convolution layers.

일 실시예에 따른 합산 레이어(330)에서는 제2 특징 추출 네트워크(300)로 입력된 입력 데이터(x1)와 컨볼루션 레이어(320)에서 출력된 출력 데이터의 요소별 합산 연산이 수행될 수 있다. 합산 레이어(330)에서 출력된 출력 데이터(y1)는 도 1의 영상 복원 네트워크(400)로 입력될 수 있다. 다만, 이에 한정되지 않는다.In the summing layer 330 according to one embodiment, a summation operation may be performed for each element of the input data (x1) input to the second feature extraction network 300 and the output data output from the convolution layer 320. Output data (y1) output from the summation layer 330 may be input to the image restoration network 400 of FIG. 1. However, it is not limited to this.

도 3은 일 실시예에 따른 변환 블록을 나타내는 도면이다.Figure 3 is a diagram showing a transform block according to one embodiment.

도 3에 도시된 변환 블록(311)은 도 2의 변환 블록들(310) 중 어느 하나일 수 있다. 도 3을 참조하면, 일 실시예에 따른 변환 블록(311)은 하나 이상의 제1 레지듀얼 변환 블록들 및 하나 이상의 제2 레지듀얼 변환 블록들을 포함할 수 있다.The transform block 311 shown in FIG. 3 may be any one of the transform blocks 310 in FIG. 2 . Referring to FIG. 3, the transform block 311 according to one embodiment may include one or more first residual transform blocks and one or more second residual transform blocks.

일 실시예에 따른 제1 레지듀얼 변환 블록(350)은 인트라 레지듀얼 변환 블록(intra Residual Transformer Block, aRTB)으로 지칭될 수 있고, 제2 레지듀얼 변환 블록(370)은 인터 레지듀얼 변환 블록(inter Residual Transformer Block, eRTB)으로 지칭될 수 있다.The first residual transform block 350 according to one embodiment may be referred to as an intra residual transform block (aRTB), and the second residual transform block 370 may be referred to as an inter residual transform block (aRTB). It may be referred to as inter Residual Transformer Block (eRTB).

또한, 일 실시예에 따른 변환 블록(311)은 제1 레지듀얼 변환 블록(350) 및 제2 레지듀얼 변환 블록(370)이 교차로 배열된 구조를 포함할 수 있다. 다만, 이에 한정되지 않으며, 변환 블록(311)은 제1 레지듀얼 변환 블록(350) 및 제2 레지듀얼 변환 블록(370)이 병렬로 배열된 구조를 포함할 수도 있다.Additionally, the transform block 311 according to one embodiment may include a structure in which the first residual transform block 350 and the second residual transform block 370 are arranged at an intersection. However, the present invention is not limited to this, and the transform block 311 may include a structure in which the first residual transform block 350 and the second residual transform block 370 are arranged in parallel.

제1 레지듀얼 변환 블록(350)은 하나 이상의 제1 변환 레이어들(351), 컨볼루션 레이어(352), 및 합산 레이어(353)를 포함할 수 있다. 제1 변환 레이어에 대해서는 도 4를 참조하여, 자세히 설명하기로 한다.The first residual transform block 350 may include one or more first transform layers 351, a convolution layer 352, and a summing layer 353. The first conversion layer will be described in detail with reference to FIG. 4.

일 실시예에 따른 컨볼루션 레이어(352)에서는 컨볼루션 레이어(352)에 입력된 입력 데이터(또는 입력 정보)와 컨볼루션 레이어(352)에 포함되는 커널과의 컨볼루션 연산이 수행될 수 있다. 이때, 컨볼루션 레이어(352)에 입력되는 입력 데이터는 제1 변환 레이어들(351)에서 연산이 수행되어 출력된 결과 데이터일 수 있다. 또한, 도 3에서는 제1 레지듀얼 변환 블록(350)이 하나의 컨볼루션 레이어(352)를 포함하는 것으로 도시하였지만, 이에 한정되지 않고, 제1 레지듀얼 변환 블록(350)은 2개 이상의 컨볼루션 레이어들을 포함할 수도 있다.In the convolution layer 352 according to one embodiment, a convolution operation may be performed between input data (or input information) input to the convolution layer 352 and a kernel included in the convolution layer 352. At this time, the input data input to the convolution layer 352 may be result data output after an operation is performed in the first transformation layers 351. In addition, in FIG. 3, the first residual transform block 350 is shown as including one convolution layer 352, but it is not limited to this, and the first residual transform block 350 includes two or more convolution layers. It may also contain layers.

일 실시예에 따른 합산 레이어(353)에서는 제1 레지듀얼 변환 블록(350)으로 입력된 입력 데이터와 컨볼루션 레이어(352)에서 출력된 출력 데이터의 요소별 합산 연산이 수행될 수 있다. 합산 레이어(353)에서 출력된 데이터는 제2 레지듀얼 변환 블록(370)으로 입력될 수 있다. 다만, 이에 한정되지 않는다.In the summing layer 353 according to one embodiment, an element-by-element summation operation may be performed on the input data input to the first residual transform block 350 and the output data output from the convolution layer 352. Data output from the summing layer 353 may be input to the second residual transform block 370. However, it is not limited to this.

또한, 도 3을 참조하면, 제2 레지듀얼 변환 블록(370)은 하나 이상의 제2 변환 레이어들(371), 컨볼루션 레이어(372) 및 합산 레이어(373)를 포함할 수 있다. 여기서, 제2 변환 레이어에 대해서는 도 8을 참조하여, 자세히 설명하기로 한다.Additionally, referring to FIG. 3, the second residual transform block 370 may include one or more second transform layers 371, a convolution layer 372, and a summation layer 373. Here, the second transform layer will be described in detail with reference to FIG. 8.

일 실시예에 따른 컨볼루션 레이어(372)에서는 컨볼루션 레이어(372)에 입력된 입력 데이터(또는 입력 정보)와 컨볼루션 레이어(372)에 포함되는 커널과의 컨볼루션 연산이 수행될 수 있다. 이때, 컨볼루션 레이어(372)에 입력되는 입력 데이터는 제2 변환 레이어들(371)에서 연산이 수행되어 출력된 결과 데이터일 수 있다. 또한, 도 3에서는 제2 레지듀얼 변환 블록(370)이 하나의 컨볼루션 레이어(372)를 포함하는 것으로 도시하였지만, 이에 한정되지 않고, 제2 레지듀얼 변환 블록(370)은 2개 이상의 컨볼루션 레이어들을 포함할 수도 있다.In the convolution layer 372 according to one embodiment, a convolution operation may be performed between input data (or input information) input to the convolution layer 372 and a kernel included in the convolution layer 372. At this time, the input data input to the convolution layer 372 may be result data output after an operation is performed in the second transformation layers 371. In addition, in FIG. 3, the second residual transform block 370 is shown as including one convolution layer 372, but it is not limited to this, and the second residual transform block 370 includes two or more convolution layers. It may also contain layers.

일 실시예에 따른 합산 레이어(373)에서는 제2 레지듀얼 변환 블록(370)으로 입력된 입력 데이터와 컨볼루션 레이어(372)에서 출력된 출력 데이터의 요소별 합산 연산이 수행될 수 있다. 합산 레이어(373)에서 출력된 데이터는 제2 레지듀얼 변환 블록(370) 다음에 위치하는 제1 레지듀얼 변환 블록으로 입력될 수 있다. 다만, 이에 한정되지 않는다.In the summing layer 373 according to one embodiment, an element-by-element summation operation may be performed on the input data input to the second residual transform block 370 and the output data output from the convolution layer 372. Data output from the summing layer 373 may be input to the first residual transform block located after the second residual transform block 370. However, it is not limited to this.

도 4는 일 실시예에 따른 제1 변환 레이어를 나타내는 도면이다.Figure 4 is a diagram showing a first transform layer according to one embodiment.

도 4에 도시된 제1 변환 레이어(410)는 도 3의 제1 변환 레이어들(351) 중 어느 하나일 수 있다. 도 4를 참조하면, 일 실시예에 따른 제1 변환 레이어(410)는 제1 정규화 레이어(411), 제1 셀프 어텐션 모듈(412), 제1 합산 레이어(413), 제2 정규화 레이어(414), 다중 퍼셉트론(MLP: Multi-Layer Perceptron) 모듈 (415) 및 제2 합산 레이어(416)를 포함할 수 있다.The first transform layer 410 shown in FIG. 4 may be one of the first transform layers 351 of FIG. 3 . Referring to FIG. 4, the first transformation layer 410 according to one embodiment includes a first normalization layer 411, a first self-attention module 412, a first summation layer 413, and a second normalization layer 414. ), a multi-layer perceptron (MLP) module 415, and a second summation layer 416.

일 실시예에 따른 제1 정규화 레이어(411)는 제1 변환 레이어(410)에 입력되는 입력 데이터(x2)를 정규화할 수 있다. 예를 들어, 제1 정규화 레이어(411)는 제1 변환 레이어(410)에 입력되는 입력 데이터의 합이 1이되도록 입력 데이터(x2)를 정규화할 수 있다. 다만, 이에 한정되지 않으며, 다양한 정규화 방법을 이용하여, 입력 데이터를 정규화할 수 있다. 정규화된 입력 데이터는 제1 셀프 어텐션 모듈(412)로 입력될 수 있다.The first normalization layer 411 according to one embodiment may normalize input data (x2) input to the first transformation layer 410. For example, the first normalization layer 411 may normalize the input data (x2) so that the sum of the input data input to the first transformation layer 410 is 1. However, it is not limited to this, and input data can be normalized using various normalization methods. Normalized input data may be input to the first self-attention module 412.

일 실시예에 따른 제1 셀프 어텐션 모듈(412)은 제1 셀프 어텐션 모듈(412)로 입력되는 제1 특징 데이터(예를 들어, 정규화된 입력 데이터)에 셀프 어텐션 연산을 수행하여, 제1 개수의 픽셀들을 포함하는 제1 영역들에 대응하는 제2 특징 데이터들을 획득할 수 있다. 이때, 제1 개수는 픽셀 하나일 수 있다.The first self-attention module 412 according to an embodiment performs a self-attention operation on the first feature data (e.g., normalized input data) input to the first self-attention module 412, and calculates the first number Second feature data corresponding to the first areas including pixels may be obtained. At this time, the first number may be one pixel.

여기서, 어텐션 연산은 쿼리 데이터(query, Q)와 키 데이터(key, K)와의 연관성 정보(예를 들어, 유사도 정보)를 획득하고, 연관성 정보에 기초하여, 가중치를 획득하며, 가중치를 키 데이터(K)에 매핑되어 있는 밸류 데이터(value, V)에 반영하고, 가중치가 반영된 밸류 데이터(V)에 대한 가중 합을 수행하는 연산을 의미한다.Here, the attention operation acquires association information (e.g., similarity information) between query data (query, Q) and key data (key, K), obtains a weight based on the association information, and applies the weight to the key data. This refers to an operation that reflects on the value data (V) mapped to (K) and performs a weighted sum on the value data (V) with the weight reflected.

이때, 동일한 입력 데이터로부터 획득된 쿼리 데이터(Q), 키 데이터(K), 및 밸류 데이터(V)에 기초하여, 수행되는 어텐션 연산을 셀프 어텐션(self-attention) 연산이라 지칭할 수 있다.At this time, the attention operation performed based on query data (Q), key data (K), and value data (V) obtained from the same input data may be referred to as a self-attention operation.

일 실시예에 따른 제1 셀프 어텐션 모듈(412)에서 수행되는 셀프 어텐션 연산에 대해서는 도 5 및 도 6을 참조하여, 자세히 설명하기로 한다.The self-attention operation performed in the first self-attention module 412 according to an embodiment will be described in detail with reference to FIGS. 5 and 6.

도 5는 일 실시예에 따른 제1 셀프 어텐션 모듈에서 수행되는 셀프 어텐션 연산을 나타내는 도면이다.Figure 5 is a diagram showing a self-attention operation performed in a first self-attention module according to an embodiment.

일 실시예에 따른 제1 셀프 어텐션 모듈(412)은 인트라 멀티 헤드 셀프 어텐션(intra Multi-head Self-Attention(MSA)) 모듈로 지칭될 수 있다.The first self-attention module 412 according to one embodiment may be referred to as an intra Multi-head Self-Attention (MSA) module.

도 5를 참조하면, 제1 셀프 어텐션 모듈(412)에서는 제1 셀프 어텐션 모듈(412)로 입력된 제1 입력 데이터(510)에 기초하여, 쿼리 데이터(Q), 키 데이터(K), 밸류 데이터(V)가 획득될 수 있다.Referring to FIG. 5, the first self-attention module 412 generates query data (Q), key data (K), and value based on the first input data 510 input to the first self-attention module 412. Data (V) can be obtained.

예를 들어, 제1 입력 데이터(510)의 크기는 W x H이며, 채널 수는 C일 수 있다. 도 5에서는 설명의 편의를 위해, 제1 입력 데이터(510)가 하나의 채널을 포함하는(C=1) 것을 예로 들어 설명하기로 한다.For example, the size of the first input data 510 may be W x H, and the number of channels may be C. In FIG. 5 , for convenience of explanation, it will be described as an example that the first input data 510 includes one channel (C=1).

제1 셀프 어텐션 모듈(412)은 제1 입력 데이터(510)에 포함되는 픽셀들을 M x M 크기를 가지는 패치 단위로 셀프 어텐션 처리할 수 있다. 예를 들어, 제1 셀프 어텐션 모듈(412)은 하나의 패치에 포함되는 M²개의 픽셀들 단위로 셀프 어텐션 처리를 수행할 수 있다. 도 5에서는 설명의 편의를 위해, 하나의 패치에 포함되는 픽셀들(x₁, x₂, ..., x_K)을 기준으로 셀프 어텐션 연산을 설명하기로 한다. 여기서 K는 M² 이다. 도 5를 참조하면, 하나의 패치에 포함되는 픽셀들(x₁, x₂, ..., x_K)과 제1 가중치 행렬(W_Q1)의 곱 연산을 통해 픽셀들(x₁, x₂, ..., x_K)에 각각 대응하는 쿼리 데이터들(521)이 획득될 수 있다. 또한, 픽셀들(x₁, x₂, ..., x_K)과 제2 가중치 행렬(W_K1)의 곱 연산을 통해 픽셀들(x₁, x₂, ..., x_K)에 각각 대응하는 키 데이터들(522)이 획득될 수 있다. 또한, 픽셀들(x₁, x₂, ..., x_K)과 제3 가중치 행렬(W_V1)의 곱 연산을 통해 픽셀들(x₁, x₂, ..., x_K)에 각각 대응하는 밸류 데이터들(523)이 획득될 수 있다.The first self-attention module 412 may perform self-attention processing on pixels included in the first input data 510 in units of patches having a size of M x M. For example, the first self-attention module 412 may perform self-attention processing in units of M ² pixels included in one patch. In Figure 5, for convenience of explanation, the self-attention operation will be explained based on the pixels (x ₁ , x ₂ , ..., x _K ) included in one patch. Here K is M ² . Referring to FIG. 5, the pixels (x 1 _, x 2) are calculated by multiplying the pixels (x ₁ , x ₂ , ..., x _K ) included _in one patch and the first weight matrix (W _Q1 ) , ..., x _K ), respectively, and corresponding query data 521 may be obtained. _In addition, each of the pixels (x ₁ , x ₂ , ..., x _K ) is calculated through a multiplication operation between the pixels (x ₁ _, x ₂ , ..., Corresponding key data 522 may be obtained. _In addition, each of the pixels ₍ x ₁ , x ₂ , ..., x _K ) is calculated through a multiplication operation between the pixels (x ₁ , x ₂ , ..., Corresponding value data 523 may be obtained.

키 데이터들(522)과 쿼리 데이터들(521)의 요소 별 곱셈 연산을 통해 연관성 데이터(E, 530)가 획득될 수 있다. 예를 들어, 연관성 데이터(E, 530)는 다음과 같은 수학식 1에 의해 계산될 수 있다.Relevance data (E, 530) can be obtained through an element-by-element multiplication operation of the key data 522 and the query data 521. For example, correlation data (E, 530) can be calculated by Equation 1 below.

수학식 1에서, e_ij는 연관성 데이터(E, 530)에서 (i, j) 위치의 엘리먼트를 나타내고, q_i는 쿼리 데이터들(521) 중 i번째 픽셀(xi)에 대응하는 쿼리 데이터를 나타내며, k_j는 키 데이터들(522) 중 j번째 픽셀(xj)에 대응하는 키 데이터를 나타낸다.In Equation 1, e _ij represents the element at the (i, j) position in the correlation data (E, 530), and q _i represents query data corresponding to the ith pixel (xi) among the query data 521. , k _j represents key data corresponding to the jth pixel (xj) among the key data 522.

제1 셀프 어텐션 모듈(412)은 연관성 데이터(E, 530)에 소프트맥스(softmax) 함수를 적용함으로써, 가중치 데이터(A, 540)를 획득할 수 있다. 예를 들어, 가중치 데이터(A, 540)는 다음과 같은 수학식 2에 의해 계산될 수 있다.The first self-attention module 412 may obtain weight data (A, 540) by applying a softmax function to the correlation data (E, 530). For example, weight data (A, 540) can be calculated by Equation 2 below.

수학식 2에서, a_ij는 가중치 데이터(A, 540)에서 (i, j)위치의 엘리먼트이고, e_ij는 연관성 데이터(E, 530)에서의 (i, j) 위치의 엘리먼트를 나타낸다.In Equation 2, a _ij represents the element at the (i, j) position in the weight data (A, 540), and e _ij represents the element at the (i, j) position in the correlation data (E, 530).

제1 셀프 어텐션 모듈(412)은 가중치 데이터(A, 540)와 밸류 데이터들(523)의 가중 합을 수행함으로써, 픽셀들(x₁, x₂, ..., x_K)에 대응하는 제1 출력 데이터(550)를 획득할 수 있다. 예를 들어, 제1 출력 데이터(550)는 다음과 같은 수학식 3에 의해 계산될 수 있다.The first self-attention module 412 performs a weighted sum of the weight data (A, 540) and the value data 523, thereby generating the first self-attention module corresponding to the pixels (x ₁ , x ₂ , ..., x _K ). 1 Output data 550 can be obtained. For example, the first output data 550 may be calculated by Equation 3 below.

수학식 3에서 y_i는 제1 출력 데이터(550)에서, 픽셀들(x₁, x₂, ..., x_K)에 포함되는 픽셀 x_i에 대응하는 특징 값을 나타낸다.In Equation 3, y _i represents a feature value corresponding to pixel x _i included in the pixels (x ₁ , x ₂ , ..., x _K ) in the first output data 550.

도 6은 일 실시예에 따른 제1 셀프 어텐션 모듈을 나타내는 도면이다.Figure 6 is a diagram showing a first self-attention module according to an embodiment.

도 6을 참조하면, 일 실시예에 따른 제1 셀프 어텐션 모듈(412)은 병렬로 구성된 3개의 리니어 레이어들(611, 612, 613)을 포함할 수 있다. 3개의 리니어 레이어들(611, 612, 613)은 풀리 커넥티드 레이어(fully connected layer)들일 수 있다.Referring to FIG. 6, the first self-attention module 412 according to one embodiment may include three linear layers 611, 612, and 613 configured in parallel. The three linear layers 611, 612, and 613 may be fully connected layers.

예를 들어, 제1 입력 데이터(x, 510)와 제1 리니어 레이어(611)에 포함된 제1 가중치 행렬(W_Q1)과의 곱 연산을 통해 제1 입력 데이터(x, 510)에 대응하는 쿼리 데이터(Q)가 획득될 수 있다. 또한, 제1 입력 데이터(x, 510)와 제2 리니어 레이어(612)에 포함된 제2 가중치 행렬(W_K1)과의 곱 연산을 통해 키 데이터(K)가 획득될 수 있다. 또한, 제1 입력 데이터(x, 510)와 제3 리니어 레이어(613)에 포함된 제3 가중치 행렬(W_V1)과의 곱 연산을 통해 밸류 데이터(V)가 획득될 수 있다.For example, the product corresponding to the first input data (x, 510) is obtained through a multiplication operation between the first input data (x, 510) and the first weight matrix (W _Q1 ) included in the first linear layer 611. Query data (Q) may be obtained. Additionally, key data (K) may be obtained through a multiplication operation between the first input data (x, 510) and the second weight matrix (W _K1 ) included in the second linear layer 612. Additionally, value data (V) may be obtained through a multiplication operation between the first input data (x, 510) and the third weight matrix (W _V1 ) included in the third linear layer 613.

제1 셀프 어텐션 모듈(412)은 키 데이터(K)에 트랜스포즈 함수를 적용한 데이터(K^T)와 쿼리 데이터(Q)의 요소 별 곱셈 연산을 통해 제1 연관성 데이터(E1)를 획득할 수 있다.The first self-attention module 412 can obtain the first correlation data (E1) through an element-by-element multiplication operation of the data (K ^T ) and the query data (Q) by applying a transpose function to the key data (K). .

제1 셀프 어텐션 모듈(412)은 제1 연관성 데이터(E1)에 위치 바이어스(B)를 합산하고, 윈도우 마스크를 적용함으로써, 제2 연관성 데이터(E2)를 획득할 수 있다.The first self-attention module 412 may obtain second correlation data (E2) by adding the position bias (B) to the first correlation data (E1) and applying a window mask.

여기서, 위치 바이어스(B)는 다음과 같은 수학식에 의해 결정될 수 있다.Here, the position bias (B) can be determined by the following equation.

여기서, B[]는 제1 연관성 데이터에 적용할 위치 바이어스를 나타내고, d(x_i, x_j)는 픽셀 x_i와 픽셀 x_j의 거리를 의미할 수 있다. 또한, B_train[d(x_i, x_j)]는 뉴럴 네트워크의 훈련을 통해 결정된 값으로 기 저장된 값일 수 있다.Here, B[] represents the position bias to be applied to the first correlation data, and d(x _i , x _j ) may mean the distance between pixel x _i and pixel x _j . Additionally, B _train [d(x _i , x _j )] is a value determined through training of a neural network and may be a previously stored value.

한편, 제1 셀프 어텐션 모듈(412)에 입력되는 제1 입력 데이터(x, 510)에는 리플렉션 패딩(reflection padding)이 적용될 수 있다. 리플렉션 패딩은 제1 입력 데이터(x, 510)의 크기가 패치 크기의 배수가 아닌 경우, 예를 들어, 제1 입력 데이터(x, 510)의 너비 W가 패치 크기 M의 배수가 아니거나 제1 입력 데이터(x, 510)의 높이 H가 패치 크기 M의 배수가 아닌 경우, 패딩을 수행하여, 제1 입력 데이터의 크기가 패치 크기의 배수가 되도록 하는 것을 의미한다. 예를 들어, 제1 입력 데이터(x, 510)의 크기(해상도)가 126 x 127이고, 패치 크기 M=8인 경우, 제1 셀프 어텐션 모듈(412)은 리플렉션 패딩을 수행하여, 제1 입력 데이터(x, 510)의 크기(해상도)가 128 x 128이 되도록 할 수 있다.Meanwhile, reflection padding may be applied to the first input data (x, 510) input to the first self-attention module 412. Reflection padding is used when the size of the first input data (x, 510) is not a multiple of the patch size, for example, the width W of the first input data (x, 510) is not a multiple of the patch size M, or the first This means that if the height H of the input data (x, 510) is not a multiple of the patch size M, padding is performed so that the size of the first input data is a multiple of the patch size. For example, if the size (resolution) of the first input data (x, 510) is 126 x 127 and the patch size M = 8, the first self-attention module 412 performs reflection padding to The size (resolution) of data (x, 510) can be set to 128 x 128.

또한, 윈도우 마스크를 적용함으로써, 제1 입력 데이터(x, 510)를 시프팅(shifting)하는 효과를 가질 수 있다. 예를 들어, 제1 셀프 어텐션 모듈(412)은 윈도우 마스크를 적용함으로써, 패치를 분할하는(partitioning) 위치를 시프팅할 수 있다. 패치를 분할하는 위치를 시프팅함으로써, 패치들을 더 다양하게 구성할 수 있으며, 하나의 픽셀이 포함되는 패치들의 구성이 다양화됨에 따라 해당 픽셀에 대한 인접 픽셀들이 더 다양하게 구성될 수 있다.Additionally, by applying a window mask, the effect of shifting the first input data (x, 510) can be achieved. For example, the first self-attention module 412 may shift the partitioning position of the patch by applying a window mask. By shifting the position where patches are divided, patches can be configured more diversely, and as the configuration of patches containing one pixel becomes more diverse, pixels adjacent to that pixel can be configured more diversely.

제1 셀프 어텐션 모듈(412)은 제2 연관성 데이터(E2)에 소프트맥스 함수를 적용함으로써, 가중치 데이터(A)를 획득할 수 있다.The first self-attention module 412 may obtain weight data (A) by applying the softmax function to the second correlation data (E2).

제1 셀프 어텐션 모듈(412)은 가중치 데이터(A)와 밸류 데이터(V)의 가중 합을 수행함으로써, 제1 출력 데이터(y)를 획득할 수 있다.The first self-attention module 412 may obtain first output data (y) by performing a weighted sum of weight data (A) and value data (V).

다시, 도 4를 참조하면, 제1 셀프 어텐션 모듈(412)에서 출력된 제1 출력 데이터(y)는 제1 합산 레이어(413)로 출력될 수 있다. 제1 합산 레이어(413)에서는 제1 셀프 어텐션 모듈(412)에서 출력된 제1 출력 데이터(y)와 제1 변환 레이어(410)에 입력된 입력 데이터(x2)의 요소별 합산 연산이 수행될 수 있다.Referring again to FIG. 4 , the first output data (y) output from the first self-attention module 412 may be output to the first summation layer 413. In the first summing layer 413, an element-by-element summation operation is performed on the first output data (y) output from the first self-attention module 412 and the input data (x2) input to the first conversion layer 410. You can.

제2 정규화 레이어(414)는 제1 합산 레이어(413)로부터 출력된 제2 출력 데이터를 정규화할 수 있다. 정규화된 데이터는 다중 퍼셉트론 모듈(415)로 입력될 수 있다. 다중 퍼셉트론 모듈(415)에 대해서는 도 7을 참조하여 자세히 설명하기로 한다. The second normalization layer 414 may normalize the second output data output from the first summing layer 413. Normalized data may be input into the multi-perceptron module 415. The multi-perceptron module 415 will be described in detail with reference to FIG. 7.

도 7은 일 실시예에 따른 다중 퍼셉트론 모듈을 나타내는 도면이다.Figure 7 is a diagram showing a multi-perceptron module according to an embodiment.

도 7을 참조하면, 일 실시예에 따른 다중 퍼셉트론 모듈(415)은 제1 리니어 레이어(710), 정규화 레이어(720) 및 제2 리니어 레이어(730)를 포함할 수 있다. Referring to FIG. 7 , the multi-perceptron module 415 according to one embodiment may include a first linear layer 710, a normalization layer 720, and a second linear layer 730.

제1 리니어 레이어(710)에서는 제1 리니어 레이어(710)에 입력된 데이터와 제1 리니어 레이어(710)에 포함된 제1 가중치 행렬과의 곱셈 연산을 통해 제3 출력 데이터가 획득될 수 있다. 제3 출력 데이터는 정규화 레이어(720)로 입력될 수 있다. 정규화 레이어(720)에서는 정규화 레이어(720)에 입력된 데이터를 정규화할 수 있다. 정규화된 데이터는 제2 리니어 레이어(720)로 입력될 수 있다.In the first linear layer 710, third output data may be obtained through a multiplication operation between data input to the first linear layer 710 and the first weight matrix included in the first linear layer 710. The third output data may be input to the normalization layer 720. The normalization layer 720 can normalize data input to the normalization layer 720. Normalized data may be input to the second linear layer 720.

제2 리니어 레이어(720)에서는 제2 리니어 레이어(720)에 입력된 데이터와 제2 리니어 레이어(720)에 포함된 제2 가중치 행렬과의 곱셈 연산을 통해 제4 출력 데이터가 획득될 수 있다.In the second linear layer 720, fourth output data may be obtained through a multiplication operation between data input to the second linear layer 720 and a second weight matrix included in the second linear layer 720.

한편, 도 7에서는 다중 퍼셉트론 모듈(415)이 2개의 리니어 레이어들을 포함하는 것으로 도시하였지만, 이에 한정되지 않으며, 3개 이상의 리니어 레이어들을 포함할 수도 있다.Meanwhile, in FIG. 7, the multi-perceptron module 415 is shown as including two linear layers, but it is not limited to this and may include three or more linear layers.

다시 도 4를 참조하면, 다중 퍼셉트론 모듈(415)에서 획득된 제4 출력 데이터는 제2 합산 레이어(416)로 출력될 수 있다. 제2 합산 레이어(416)에서는 다중 퍼셉트론 모듈(415)에서 출력된 제4 출력 데이터와 제1 합산 레이어(413)에서 출력된 제2 출력 데이터의 요소별 합산 연산이 수행될 수 있다.Referring again to FIG. 4, the fourth output data obtained from the multi-perceptron module 415 may be output to the second summation layer 416. In the second summation layer 416, an element-by-element sum operation may be performed on the fourth output data output from the multi-perceptron module 415 and the second output data output from the first summation layer 413.

도 8은 일 실시예에 따른 제2 변환 레이어를 나타내는 도면이다.Figure 8 is a diagram showing a second transform layer according to one embodiment.

도 8에 도시된 제2 변환 레이어(810)는 도 3의 제2 변환 레이어들(371) 중 어느 하나일 수 있다. 도 8을 참조하면, 일 실시예에 따른 제2 변환 레이어(810)는 제1 정규화 레이어(811), 제2 셀프 어텐션 모듈(812), 제1 합산 레이어(813), 제2 정규화 레이어(814), 다중 퍼셉트론(MLP) 모듈(815) 및 제2 합산 레이어(816)를 포함할 수 있다.The second transform layer 810 shown in FIG. 8 may be one of the second transform layers 371 of FIG. 3 . Referring to FIG. 8, the second transform layer 810 according to one embodiment includes a first normalization layer 811, a second self-attention module 812, a first summation layer 813, and a second normalization layer 814. ), a multi-perceptron (MLP) module 815, and a second summation layer 816.

일 실시예에 따른 제1 정규화 레이어(811)는 제2 변환 레이어(810)에 입력되는 입력 데이터(x3)를 정규화할 수 있다. 예를 들어, 제1 정규화 레이어(811)는 제2 변환 레이어(810)에 입력되는 입력 데이터의 합이 1이 되도록 입력 데이터를 정규화할 수 있다. 다만, 이에 한정되지 않으며, 다양한 정규화 방법을 이용하여, 입력 데이터를 정규화할 수 있다. 정규화된 입력 데이터는 제2 셀프 어텐션 모듈(812)로 입력될 수 있다.The first normalization layer 811 according to one embodiment may normalize input data (x3) input to the second transformation layer 810. For example, the first normalization layer 811 may normalize the input data so that the sum of the input data input to the second transformation layer 810 is 1. However, it is not limited to this, and input data can be normalized using various normalization methods. Normalized input data may be input to the second self-attention module 812.

일 실시예에 따른 제2 셀프 어텐션 모듈(812)은 제2 셀프 어텐션 모듈(812)로 입력되는 제3 특징 데이터(예를 들어, 정규화된 입력 데이터)에 셀프 어텐션 연산을 수행하여, 제2 개수의 픽셀들을 포함하는 제2 영역들에 대응하는 제3 특징 데이터들을 획득할 수 있다. 이때, 제1 개수는 도 5에서 설명한 제1 개수보다 클 수 있으며, 제2 영역은 제1 영역보다 크며, 제2 영역들 각각은 복수의 픽셀들을 포함할 수 있다. The second self-attention module 812 according to an embodiment performs a self-attention operation on the third feature data (e.g., normalized input data) input to the second self-attention module 812 to obtain a second number. Third feature data corresponding to the second areas including pixels may be obtained. At this time, the first number may be larger than the first number described in FIG. 5, the second area is larger than the first area, and each of the second areas may include a plurality of pixels.

일 실시예에 따른 제2 셀프 어텐션 모듈(812)에서 수행되는 셀프 어텐션 연산에 대해서는 도 9 및 도 10을 참조하여, 자세히 설명하기로 한다.The self-attention operation performed in the second self-attention module 812 according to an embodiment will be described in detail with reference to FIGS. 9 and 10.

도 9는 일 실시예에 따른 제2 셀프 어텐션 모듈에서 수행되는 셀프 어텐션 연산을 나타내는 도면이다.FIG. 9 is a diagram illustrating a self-attention operation performed in a second self-attention module according to an embodiment.

일 실시예에 따른 제2 셀프 어텐션 모듈(812)은 인터 멀티 헤드 셀프 어텐션(inter Multi-head Self-Attention(MSA)) 모듈로 지칭될 수 있다.The second self-attention module 812 according to one embodiment may be referred to as an inter Multi-head Self-Attention (MSA) module.

도 9를 참조하면, 제2 셀프 어텐션 모듈(812)에서는 제2 셀프 어텐션 모듈(812)로 입력된 제2 입력 데이터(910)에 기초하여, 쿼리 데이터(Q), 키 데이터(K), 밸류 데이터(V)가 획득될 수 있다.Referring to FIG. 9, the second self-attention module 812 generates query data (Q), key data (K), and value based on the second input data 910 input to the second self-attention module 812. Data (V) can be obtained.

예를 들어, 제2 입력 데이터(910)의 크기는 W x H이며, 채널 수는 C일 수 있다. 도 9에서는 설명의 편의를 위해, 제2 입력 데이터(910)가 하나의 채널을 포함하는(C=1) 것을 예로 들어 설명하기로 한다. 제2 입력 데이터(910)를 소정 개수(예를 들어, M²개)의 픽셀들을 포함하는 영역들(패치들)로 분할하면, 하나의 채널에 포함되는 패치들의 개수는 N개일 수 있다.For example, the size of the second input data 910 may be W x H, and the number of channels may be C. In FIG. 9 , for convenience of explanation, it will be described as an example that the second input data 910 includes one channel (C=1). If the second input data 910 is divided into regions (patches) including a predetermined number (eg, M ² ) of pixels, the number of patches included in one channel may be N.

일 실시예에 따른 제2 셀프 어텐션 모듈(812)은 패치들(P₁, P₂, ..., P_N) 각각에 대응하는 특징 정보를 획득할 수 있다.The second self-attention module 812 according to an embodiment may obtain feature information corresponding to each of the patches (P ₁ , P ₂ , ..., P _N ).

예를 들어, 제2 입력 데이터(910)에 포함되는 패치들(920, P₁, P₂, ..., P_N)과 제1 가중치 행렬의 곱 연산(W_Q2)을 통해 제2 입력 데이터(910)에 포함되는 패치들(920, P₁, P₂, ..., P_N)에 각각 대응하는 쿼리 데이터들(921)이 획득될 수 있다.For example, through the product operation (W _Q2 ) of the patches (920, P ₁ , P ₂ , ..., P _N ) included in the second input data 910 and the first weight matrix, the second input data Query data 921 corresponding to each of the patches 920, P ₁ , P ₂ , ..., P _N included in 910 may be obtained.

또한, 제2 입력 데이터(910)에 포함되는 패치들(920, P₁, P₂, ..., P_N)과 제2 가중치 행렬(W_K2)의 곱 연산을 통해 제2 입력 데이터(910)에 포함되는 패치들(920, P₁, P₂, ..., P_N)에 각각 대응하는 키 데이터들(922)이 획득될 수 있다. In addition, the second input data 910 is calculated by multiplying the patches 920, P ₁ , P ₂ , ..., P _N included in the second input data 910 and the second weight matrix W _K2 . ), key data 922 corresponding to each of the patches 920, P ₁ , P ₂ , ..., P _N may be obtained.

또한, 제2 입력 데이터(910)에 포함되는 패치들(920, P₁, P₂, ..., P_N)과 제3 가중치 행렬(W_V2)의 곱 연산을 통해 제2 입력 데이터(910)에 포함되는 패치들(920, P₁, P₂, ..., P_N)에 각각 대응하는 밸류 데이터들(923)이 획득될 수 있다.In addition, the second input data 910 is calculated through a multiplication operation between the patches 920, P ₁ , P ₂ , ..., P _N included in the second input data 910 and the third weight matrix W _V2 . ), value data 923 corresponding to each of the patches 920, P ₁ , P ₂ , ..., P _N may be obtained.

키 데이터들(922)과 쿼리 데이터들(921)의 요소 별 곱셈 연산을 통해 연관성 데이터(E, 930)가 획득될 수 있다. 이에 대해서는 도 5의 수학식 1에서 설명하였으므로 동일한 설명은 생략하기로 한다.Relevance data (E, 930) may be obtained through an element-by-element multiplication operation of the key data 922 and the query data 921. Since this has been explained in Equation 1 of FIG. 5, the same description will be omitted.

연관성 데이터(E, 930)에 소프트맥스 함수를 적용함으로써, 가중치 데이터(A, 940)가 획득될 수 있다. 이에 대해서는 도 5의 수학식 2에서 설명하였으므로 동일한 설명은 생락하기로 한다.By applying the softmax function to the correlation data (E, 930), weight data (A, 940) can be obtained. Since this has been explained in Equation 2 of FIG. 5, the same explanation will be omitted.

제2 셀프 어텐션 모듈(812)은 가중치 데이터(A, 940)와 밸류 데이터들(923)의 가중 합을 수행함으로써, 제2 출력 데이터(950)를 획득할 수 있다. 이에 대해서는 도 5의 수학식 3에서 설명하였으므로 동일한 설명은 생략하기로 한다.The second self-attention module 812 may obtain second output data 950 by performing a weighted sum of the weight data (A, 940) and the value data 923. Since this has been explained in Equation 3 of FIG. 5, the same description will be omitted.

도 10은 일 실시예에 따른 제2 셀프 어텐션 모듈을 나타내는 도면이다.Figure 10 is a diagram showing a second self-attention module according to an embodiment.

도 10을 참조하면, 일 실시예에 다른 제2 셀프 어텐션 모듈(812)은 제1 변형(reshape) 레이어(1010)를 포함할 수 있다. 변형 레이어(1010)에서는 제2 입력 데이터(910, x)에 포함되는 픽셀들을 동일한 패치에 포함되는 픽셀들끼리 그룹핑되도록 제2 입력 데이터를 변형(reshape)하여, 제3 입력 데이터를 획득할 수 있다. 이때, 제3 입력 데이터는 도 9의 패치들(920)로 구분된 형태일 수 있다.Referring to FIG. 10 , the second self-attention module 812 according to one embodiment may include a first reshape layer 1010. In the transformation layer 1010, third input data can be obtained by reshaping the second input data so that pixels included in the second input data 910 (x) are grouped with pixels included in the same patch. . At this time, the third input data may be divided into patches 920 of FIG. 9.

일 실시예에 따른 제2 셀프 어텐션 모듈(812)은 병렬로 구성된 3개의 리니어 레이어들(1021, 1022, 1023)을 포함할 수 있다. 3개의 리니어 레이어들(1021, 1022, 1023)은 풀리 커넥티드 레이어(fully connected layer)들일 수 있다.The second self-attention module 812 according to one embodiment may include three linear layers 1021, 1022, and 1023 configured in parallel. The three linear layers 1021, 1022, and 1023 may be fully connected layers.

예를 들어, 제3 입력 데이터와 제1 리니어 레이어(1021)에 포함된 제1 가중치 행렬(W_Q2)과의 곱 연산을 통해 제3 입력 데이터에 대응하는 쿼리 데이터(Q')가 획득될 수 있다. 또한, 제3 입력 데이터와 제2 리니어 레이어(1022)에 포함된 제2 가중치 행렬(W_K2)과의 곱 연산을 통해 키 데이터(K')가 획득될 수 있다. 또한, 제3 입력 데이터와 제3 리니어 레이어(1023)에 포함된 제3 가중치 행렬(W_V2)과의 곱 연산을 통해 밸류 데이터(V')가 획득될 수 있다.For example, query data (Q') corresponding to the third input data can be obtained through a multiplication operation between the third input data and the first weight matrix (W _Q2 ) included in the first linear layer 1021. there is. Additionally, key data K' may be obtained through a multiplication operation between the third input data and the second weight matrix W _K2 included in the second linear layer 1022. Additionally, value data (V') may be obtained through a multiplication operation between the third input data and the third weight matrix (W _V2 ) included in the third linear layer 1023.

제2 셀프 어텐션 모듈(812)은 키 데이터(K')에 트랜스포즈 함수를 적용한 데이터(K'^T)와 쿼리 데이터(Q')의 요소 별 곱셈 연산을 통해 제1 연관성 데이터(E1')를 획득할 수 있다.The second self-attention module 812 generates the first correlation data (E1') through an element-by-element multiplication operation of the data ( ^K'T ) obtained by applying the transpose function to the key data (K') and the query data (Q'). It can be obtained.

제2 셀프 어텐션 모듈(812)은 제1 연관성 데이터(E1')에 위치 바이어스(B')를 합산하고, 윈도우 마스크를 적용함으로써, 제2 연관성 데이터(E2')를 획득할 수 있다.The second self-attention module 812 may obtain second correlation data (E2') by adding the position bias (B') to the first correlation data (E1') and applying a window mask.

여기서, 위치 바이어스(B')는 다음과 같은 수학식 5에 의해 결정될 수 있다.Here, the position bias (B') can be determined by Equation 5 as follows.

여기서, B'[]는 제1 연관성 데이터에 적용할 위치 바이어스를 나타내고, d(P_i, P_j)는 패치 P_i와 패치 P_j의 거리를 의미할 수 있다. 또한, B'_train[d(P_i, P_j)]는 뉴럴 네트워크의 훈련을 통해 결정된 값으로 기 저장된 값일 수 있다.Here, B'[] represents the position bias to be applied to the first correlation data, and d(P _i , P _j ) may mean the distance between patch P _i and patch P _j . Additionally, B' _train [d(P _i , P _j )] is a value determined through training of a neural network and may be a previously stored value.

또한, 윈도우 마스크를 적용함으로써, 제2 입력 데이터(x, 910)를 시프팅(shifting)하는 효과를 가질 수 있다. 예를 들어, 제2 셀프 어텐션 모듈(812)은 윈도우 마스크를 적용함으로써, 패치를 분할하는(partitioning) 위치를 시프팅할 수 있다. 패치를 분할하는 위치를 시프팅함으로써, 패치들을 다양하게 구성할 수 있다. 이에 따라, 하나의 패치에 인접하는 패치들의 구성도 다양화될 수 있다.Additionally, by applying a window mask, the effect of shifting the second input data (x, 910) can be achieved. For example, the second self-attention module 812 may shift the partitioning position of the patch by applying a window mask. By shifting the position where patches are divided, patches can be configured in various ways. Accordingly, the configuration of patches adjacent to one patch may also be diversified.

제2 셀프 어텐션 모듈(812)은 제2 연관성 데이터(E2')에 소프트맥스 함수를 적용함으로써, 가중치 데이터(A')를 획득할 수 있다.The second self-attention module 812 may obtain weight data (A') by applying the softmax function to the second correlation data (E2').

제2 셀프 어텐션 모듈(812)은 가중치 데이터(A')와 밸류 데이터(V')의 가중 합을 수행함으로써, 제2 출력 데이터(y')를 획득할 수 있다.The second self-attention module 812 may obtain second output data (y') by performing a weighted sum of the weight data (A') and the value data (V').

이때, 제2 출력 데이터(y')는 패치들로 구분된 형태일 수 있으며, 제2 출력 데이터(y')는 제2 변형(reshape) 레이어(1030)에서 제3 출력 데이터(y)로 변형될 수 있다. 제3 출력 데이터(y)는 패치들로 구분됨이 없이 픽셀들로 구분된 형태일 수 있다.At this time, the second output data (y') may be divided into patches, and the second output data (y') is transformed into the third output data (y) in the second reshape layer 1030. It can be. The third output data (y) may be divided into pixels rather than patches.

다시 도 8을 참조하면, 제2 셀프 어텐션 모듈(812)에서 출력된 제3 출력 데이터(y)는 제1 합산 레이어(813)로 출력될 수 있다. 제1 합산 레이어(813)에서는 제2 셀프 어텐션 모듈(812)에서 출력된 제3 출력 데이터(y)와 제2 변환 레이어(810)에 입력된 입력 데이터(x3)의 요소별 합산 연산이 수행될 수 있다.Referring again to FIG. 8, the third output data (y) output from the second self-attention module 812 may be output to the first summation layer 813. In the first summing layer 813, an element-by-element summation operation is performed on the third output data (y) output from the second self-attention module 812 and the input data (x3) input to the second conversion layer 810. You can.

제2 정규화 레이어(814)는 제1 합산 레이어(813)로부터 출력된 제4 출력 데이터를 정규화할 수 있다. 정규화된 데이터는 다중 퍼셉트론 모듈(815)로 입력될 수 있다. 다중 퍼셉트론 모듈(815)에 대해서는 도 7에서 자세히 설명하였으므로 동일한 설명은 생략하기로 한다. The second normalization layer 814 may normalize the fourth output data output from the first summing layer 813. Normalized data may be input into the multi-perceptron module 815. Since the multi-perceptron module 815 has been described in detail in FIG. 7, the same description will be omitted.

다중 퍼셉트론 모듈(815)에서 획득된 제5 출력 데이터는 제2 합산 레이어(816)로 출력될 수 있다. 제2 합산 레이어(816)에서는 다중 퍼셉트론 모듈(815)에서 출력된 제5 출력 데이터와 제1 합산 레이어(813)에서 출력된 제4 출력 데이터의 요소별 합산 연산이 수행될 수 있다.The fifth output data obtained from the multi-perceptron module 815 may be output to the second summation layer 816. In the second summation layer 816, an element-by-element summation operation may be performed on the fifth output data output from the multi-perceptron module 815 and the fourth output data output from the first summation layer 813.

도 11은 일 실시예에 따른 제2 셀프 어텐션 모듈을 나타내는 도면이다.Figure 11 is a diagram showing a second self-attention module according to an embodiment.

도 11을 참조하면, 일 실시예에 따른 제2 셀프 어텐션 모듈(812)은 병렬로 구성된 3개의 리니어 레이어들(1111, 1112, 1113)을 포함할 수 있다. 3개의 리니어 레이어들(1111, 1112, 1113)은 풀리 커넥티드 레이어(fully connected layer)들일 수 있다.Referring to FIG. 11, the second self-attention module 812 according to one embodiment may include three linear layers 1111, 1112, and 1113 configured in parallel. The three linear layers 1111, 1112, and 1113 may be fully connected layers.

예를 들어, 제2 입력 데이터(910, x)와 제1 리니어 레이어(1111)에 포함된 제1 가중치 행렬(W_Q3)과의 곱 연산을 통해 제2 입력 데이터에 대응하는 쿼리 데이터(Q1')가 획득될 수 있다. 또한, 제2 입력 데이터와 제2 리니어 레이어(1112)에 포함된 제2 가중치 행렬(W_K3)과의 곱 연산을 통해 키 데이터(K1')가 획득될 수 있다. 또한, 제2 입력 데이터와 제3 리니어 레이어(1113)에 포함된 제3 가중치 행렬(W_V3)과의 곱 연산을 통해 밸류 데이터(V1')가 획득될 수 있다.For example, query data (Q1') corresponding to the second input data is obtained through a multiplication operation between the second input data (910, x) and the first weight matrix (W _Q3 ) included in the first linear layer (1111). ) can be obtained. Additionally, key data K1' may be obtained through a multiplication operation between the second input data and the second weight matrix W _K3 included in the second linear layer 1112. Additionally, value data V1' may be obtained through a multiplication operation between the second input data and the third weight matrix W _V3 included in the third linear layer 1113.

일 실시예에 따른 제2 셀프 어텐션 모듈(812)은 제1 내지 제3 변형 레이어들(1121, 1122, 1123)을 포함할 수 있다. 예를 들어, 제1 변형 레이어(1121)는 제1 리니어 레이어(1111)에 연결된 구조로 제1 리니어 레이어(1111)에서 출력된 제1 쿼리 데이터(Q1')를 변형시켜, 제2 쿼리 데이터(Q')를 획득할 수 있다. 예를 들어, 제2 쿼리 데이터(Q')는 제1 쿼리 데이터(Q1')에 포함되는 픽셀들에 각각 대응하는 쿼리 값들을 동일한 패치에 포함되는 픽셀들에 대응하는 쿼리 값들끼리 그룹핑한 형태일 수 있다. 즉, 제2 쿼리 데이터(Q')는 패치들에 각각 대응하는 쿼리 데이터들로 구분된 형태일 수 있다.The second self-attention module 812 according to one embodiment may include first to third deformation layers 1121, 1122, and 1123. For example, the first transformation layer 1121 is a structure connected to the first linear layer 1111 and transforms the first query data (Q1') output from the first linear layer 1111 to generate second query data ( Q') can be obtained. For example, the second query data (Q') may be in the form of grouping query values corresponding to pixels included in the first query data (Q1') with query values corresponding to pixels included in the same patch. You can. That is, the second query data (Q') may be divided into query data corresponding to patches.

또한, 제2 변형 레이어(1122)는 제2 리니어 레이어(1112)에 연결된 구조로 제2 리니어 레이어(1112)에서 출력된 제1 키 데이터(K1')를 변형시켜, 제2 키 데이터(K')를 획득할 수 있다. 예를 들어, 제2 키 데이터(K')는 제1 키 데이터(K1')에 포함되는 픽셀들에 각각 대응하는 키 값들을 동일한 패치에 포함되는 픽셀들에 대응하는 키 값들끼리 그룹핑한 형태일 수 있다. 즉, 제2 키 데이터(K')는 패치들에 각각 대응하는 키 데이터들로 구분된 형태일 수 있다.In addition, the second transformation layer 1122 is connected to the second linear layer 1112 and transforms the first key data (K1') output from the second linear layer (1112) into second key data (K'). ) can be obtained. For example, the second key data K' may be in the form of grouping key values corresponding to pixels included in the first key data K1' with key values corresponding to pixels included in the same patch. You can. That is, the second key data K' may be divided into key data corresponding to patches.

또한, 제3 변형 레이어(1123)는 제3 리니어 레이어(1113)에 연결된 구조로 제3 리니어 레이어(1113)에서 출력된 제1 밸류 데이터(V1')를 변형시켜, 제2 밸류 데이터(V')를 획득할 수 있다. 예를 들어, 제2 밸류 데이터(V')는 제1 밸류 데이터(V1')에 포함되는 픽셀들에 각각 대응하는 밸류 값들을 동일한 패치에 포함되는 픽셀들에 대응하는 밸류 값들끼리 그룹핑한 형태일 수 있다. 즉, 제2 밸류 데이터(V')는 패치들에 각각 대응하는 밸류 데이터들로 구분된 형태일 수 있다.In addition, the third transformation layer 1123 is connected to the third linear layer 1113 and transforms the first value data (V1') output from the third linear layer (1113) into second value data (V'). ) can be obtained. For example, the second value data (V') may be in the form of grouping value values corresponding to pixels included in the first value data (V1') with value values corresponding to pixels included in the same patch. You can. That is, the second value data (V') may be divided into value data corresponding to patches.

제2 셀프 어텐션 모듈(812)은 제2 키 데이터(K')에 트랜스포즈 함수를 적용한 데이터(K'^T)와 제2 쿼리 데이터(Q')의 요소 별 곱셈 연산을 통해 제1 연관성 데이터(E1')를 획득할 수 있다.The second self- ^attention module 812 generates first correlation data ( E1') can be obtained.

이때, 제2 출력 데이터(y')는 패치들로 구분된 형태일 수 있으며, 제2 출력 데이터(y')는 제4 변형(reshape) 레이어(1130)에서 제3 출력 데이터(y)로 변형될 수 있다. 제3 출력 데이터(y)는 패치들로 구분됨이 없이 픽셀들로 구분된 형태일 수 있다.At this time, the second output data (y') may be divided into patches, and the second output data (y') is transformed into the third output data (y) in the fourth reshape layer 1130. It can be. The third output data (y) may be divided into pixels rather than patches.

도 10의 제2 셀프 어텐션 모듈(812)의 경우, 제2 입력 데이터(x)의 크기는 H x W x C일 수 있으며, 제1 내지 제3 리니어 레이어들(1021, 1022, 1023)로 입력되는 변형된 제3 입력 데이터의 크기는 N x M² x C일 수 있다. 이때, N은 패치들의 개수, M ² 은 패치 하나의 크기, C는 채널 수를 나타내며, H x W는 N x M ² 과 동일하다. 이에 따라, 제1 내지 제3 리니어 레이어들(1021, 1022, 1023) 각각에 포함되는 가중치 행렬의 파라미터 수는 M²C x M²C일 수 있다.In the case of the second self-attention module 812 of FIG. 10, the size of the second input data (x) may be H x W x C, and is input to the first to third linear layers 1021, 1022, and 1023. The size of the modified third input data may be N x M ² x C. At this time, N represents the number of patches, M ² represents the size of one patch, C represents the number of channels, and H x W is equal to N x M ² . Accordingly, the number of parameters of the weight matrix included in each of the first to third linear layers 1021, 1022, and 1023 may be M ² C x M ² C.

반면, 도 11의 제2 셀프 어텐션 모듈(812)의 경우, 제1 내지 제3 리니어 레이어들(1111, 1112, 1113)로 입력되는 제2 입력 데이터의 크기는 H x W x C이며, 이에 따라, 제1 내지 제3 리니어 레이어들(1111, 1112, 1113) 각각에 포함되는 가중치 행렬의 파라미터 수는 C x C일 수 있다.On the other hand, in the case of the second self-attention module 812 of FIG. 11, the size of the second input data input to the first to third linear layers 1111, 1112, and 1113 is H x W x C, accordingly , the number of parameters of the weight matrix included in each of the first to third linear layers 1111, 1112, and 1113 may be C x C.

따라서, 도 11과 같이, 제2 입력 데이터를 리니어 레이어들에서 연산한 후, 패치에 대응하는 데이터들로 그룹핑되도록 변형하는 경우가 도 10과 같이, 제2 입력 데이터를 패치들로 그룹핑되도록 변형한 후, 리니어 레이어들에서 연산하는 경우보다 연산량 및 연산 파라미터의 수를 감소시킬 수 있다. 이에 따라, 도 11의 제2 셀프 어텐션 모듈은 영상 처리(인터 셀프 어텐션)의 성능은 유지하면서, 더 적은 연산량 및 연산 파라미터를 이용하여, 영상 처리를 수행할 수 있다.Therefore, as shown in FIG. 11, in the case where the second input data is operated on linear layers and then transformed to be grouped into data corresponding to patches, as shown in FIG. 10, the second input data is transformed to be grouped into patches. Afterwards, the amount of computation and the number of computation parameters can be reduced compared to the case of computation in linear layers. Accordingly, the second self-attention module of FIG. 11 can perform image processing using a smaller amount of calculation and calculation parameters while maintaining the performance of image processing (inter-self attention).

도 12는 일 실시예에 따른 변환 블록을 나타내는 도면이다.Figure 12 is a diagram showing a transform block according to an embodiment.

도 12에 도시된 변환 블록(1200)은 도 2의 변환 블록들(310) 중 어느 하나일 수 있다.The transform block 1200 shown in FIG. 12 may be one of the transform blocks 310 of FIG. 2 .

도 12를 참조하면, 일 실시예에 따른 변환 블록(1200)은 제1 레지듀얼 변환 블록(1210), 제2 레지듀얼 변환 블록(1220), 연결 레이어(1230), 정규화 레이어(1240), 리니어 레이어(1250), 합산 레이어(1260)를 포함할 수 있다.Referring to FIG. 12, the transform block 1200 according to one embodiment includes a first residual transform block 1210, a second residual transform block 1220, a connection layer 1230, a normalization layer 1240, and a linear transform block 1200. It may include a layer 1250 and a summing layer 1260.

일 실시예에 따른 변환 블록(1200)은 제1 레지듀얼 변환 블록(1210) 및 제2 레지듀얼 변환 블록(1220)이 병렬로 연결된 구조를 포함할 수 있다.The transform block 1200 according to one embodiment may include a structure in which the first residual transform block 1210 and the second residual transform block 1220 are connected in parallel.

일 실시예에 따른 제1 레지듀얼 변환 블록(1210)은 인트라 레지듀얼 변환 블록(intra Residual Transformer Block, aRTB)으로 지칭될 수 있고, 제2 레지듀얼 변환 블록(1220)은 인터 레지듀얼 변환 블록(inter Residual Transformer Block, eRTB)으로 지칭될 수 있다. 제1 레지듀얼 변환 블록(1210)은 도 3의 제1 레지듀얼 변환 블록(350)에 대응되며, 제2 레지듀얼 변환 블록(1220)은 도 3의 제2 레지듀얼 변환 블록(370)에 대응될 수 있다. 이에 따라, 도 3 내지 도 11에서 도시하고 설명한 제1 레지듀얼 변환 블록(350), 제2 레지듀얼 변환 블록(370), 제1 레지듀얼 변환 블록(350)에 포함되는 제1 변환 레이어(410), 제2 레지듀얼 변환 블록(370)에 포함되는 제2 변환 레이어(810)는 도 12의 제1 레지듀얼 변환 블록(1210) 및 제2 레지듀얼 변환 블록(1220)에도 동일하게 적용될 수 있으며, 동일한 설명은 생략하기로 한다.The first residual transform block 1210 according to one embodiment may be referred to as an intra residual transform block (aRTB), and the second residual transform block 1220 may be referred to as an inter residual transform block (aRTB). It may be referred to as inter Residual Transformer Block (eRTB). The first residual transform block 1210 corresponds to the first residual transform block 350 in FIG. 3, and the second residual transform block 1220 corresponds to the second residual transform block 370 in FIG. 3. It can be. Accordingly, the first transform layer 410 included in the first residual transform block 350, the second residual transform block 370, and the first residual transform block 350 shown and explained in FIGS. 3 to 11 ), the second transform layer 810 included in the second residual transform block 370 can be equally applied to the first residual transform block 1210 and the second residual transform block 1220 of FIG. 12, , the same description will be omitted.

일 실시예에 따른 변환 블록(1200)으로 입력된 제1 입력 데이터(X1)는 제1 레지듀얼 변환 블록(1210)과 제2 레지듀얼 변환 블록(1220)으로 입력될 수 있다. 제1 레지듀얼 변환 블록(1210)에서 출력된 제1 출력 데이터와 제2 레지듀얼 변환 블록(1220)에서 출력된 제2 출력 데이터는 연결 레이어(1230)에서 연결(concatenation)될 수 있다. 예를 들어, 연결 레이어(1230)에서는 연결 레이어(1230)에 입력된 제1 출력 데이터와 제2 출력 데이터를 채널 방향으로 연결시킨 제3 출력 데이터를 정규화 레이어(1240)로 출력할 수 있다. 제3 출력 데이터는 정규화 레이어(1240)로 입력되어 정규화될 수 있으며, 정규화된 제4 출력 데이터는 리니어 레이어(1250)로 입력될 수 있다. The first input data (X1) input to the transform block 1200 according to one embodiment may be input to the first residual transform block 1210 and the second residual transform block 1220. The first output data output from the first residual transform block 1210 and the second output data output from the second residual transform block 1220 may be concatenated in the connection layer 1230. For example, the connection layer 1230 may output third output data obtained by connecting the first and second output data input to the connection layer 1230 in the channel direction to the normalization layer 1240. The third output data may be input to the normalization layer 1240 and normalized, and the normalized fourth output data may be input to the linear layer 1250.

리니어 레이어(1250)에서는 리니어 레이어(1250)에 입력된 제4 출력 데이터와 리니어 레이어(1250)에 포함된 가중치 행렬과의 곱셈 연산을 통해 제5 출력 데이터가 획득될 수 있다. 제5 출력 데이터는 합산 레이어(1260)로 입력될 수 있다. 또한, 변환 블록(1200)으로 입력된 제1 입력 데이터(X1)는 합산 레이어(1260)로 입력될 수 있다.In the linear layer 1250, the fifth output data may be obtained through a multiplication operation between the fourth output data input to the linear layer 1250 and the weight matrix included in the linear layer 1250. The fifth output data may be input to the summation layer 1260. Additionally, the first input data (X1) input to the transform block 1200 may be input to the summing layer 1260.

합산 레이어(1260)에서는 합산 레이어(1260)에 입력된 제5 출력 데이터와 제1 입력 데이터의 요소별 합산 연산이 수행될 수 있다.In the summation layer 1260, a summation operation for each element of the fifth output data and the first input data input to the summation layer 1260 may be performed.

일 실시예에 따른 변환 블록(1200)은 제1 레지듀얼 변환 블록(1210), 제2 레지듀얼 변환 블록(1220), 연결 레이어(1230), 정규화 레이어(1240), 리니어 레이어(1250), 합산 레이어(1260)를 포함하는 모듈(1201)이 직렬로 반복적으로 배열된 구조를 포함할 수 있다. 다만, 이에 한정되지 않는다.The transform block 1200 according to an embodiment includes a first residual transform block 1210, a second residual transform block 1220, a connection layer 1230, a normalization layer 1240, a linear layer 1250, and a summation layer. The module 1201 including the layer 1260 may include a structure in which the module 1201 is repeatedly arranged in series. However, it is not limited to this.

도 13은 일 실시예에 따른 변환 블록을 나타내는 도면이다.Figure 13 is a diagram showing a transform block according to an embodiment.

도 13에 도시된 변환 블록(1300)은 도 2의 변환 블록들(310) 중 어느 하나일 수 있다.The transform block 1300 shown in FIG. 13 may be one of the transform blocks 310 of FIG. 2 .

도 13을 참조하면, 일 실시예에 따른 변환 블록(1300)은 제1 레지듀얼 변환 블록(1310), 제2 레지듀얼 변환 블록(1320), 제1 정규화 레이어(1321), 제2 정규화 레이어(1322), 제1 리니어 레이어(1331), 제2 리니어 레이어(1332), 제1 어텐션 레이어(1341), 제2 어텐션 레이어(1342), 제3 리니어 레이어(1351), 제4 리니어 레이어(1352), 제1 합산 레이어(1361), 제2 합산 레이어(1362), 연결 레이어(1370), 제3 정규화 레이어(1380), 제5 리니어 레이어(1390), 제3 합산 레이어(1395)를 포함할 수 있다.Referring to FIG. 13, the transform block 1300 according to an embodiment includes a first residual transform block 1310, a second residual transform block 1320, a first normalization layer 1321, and a second normalization layer ( 1322), first linear layer (1331), second linear layer (1332), first attention layer (1341), second attention layer (1342), third linear layer (1351), fourth linear layer (1352) , may include a first summation layer (1361), a second summation layer (1362), a connection layer (1370), a third normalization layer (1380), a fifth linear layer (1390), and a third summation layer (1395). there is.

일 실시예에 따른 변환 블록(1300)은 제1 레지듀얼 변환 블록(1310) 및 제2 레지듀얼 변환 블록(1320)이 병렬로 연결된 구조를 포함할 수 있다.The transform block 1300 according to one embodiment may include a structure in which the first residual transform block 1310 and the second residual transform block 1320 are connected in parallel.

일 실시예에 따른 제1 레지듀얼 변환 블록(1310)은 인트라 레지듀얼 변환 블록(intra Residual Transformer Block, aRTB)으로 지칭될 수 있고, 제2 레지듀얼 변환 블록(1320)은 인터 레지듀얼 변환 블록(inter Residual Transformer Block, eRTB)으로 지칭될 수 있다. 제1 레지듀얼 변환 블록(1310)은 도 3의 제1 레지듀얼 변환 블록(350)에 대응되며, 제2 레지듀얼 변환 블록(1320)은 도 3의 제2 레지듀얼 변환 블록(370)에 대응될 수 있다. 이에 따라, 도 3 내지 도 11에서 도시하고 설명한 제1 레지듀얼 변환 블록(350), 제2 레지듀얼 변환 블록(370), 제1 레지듀얼 변환 블록(350)에 포함되는 제1 변환 레이어(410), 제2 레지듀얼 변환 블록(370)에 포함되는 제2 변환 레이어(810)는 도 13의 제1 레지듀얼 변환 블록(1310) 및 제2 레지듀얼 변환 블록(1320)에도 동일하게 적용될 수 있으며, 동일한 설명은 생략하기로 한다.The first residual transform block 1310 according to one embodiment may be referred to as an intra residual transform block (aRTB), and the second residual transform block 1320 may be referred to as an inter residual transform block (aRTB). It may be referred to as inter Residual Transformer Block (eRTB). The first residual transform block 1310 corresponds to the first residual transform block 350 in FIG. 3, and the second residual transform block 1320 corresponds to the second residual transform block 370 in FIG. 3. It can be. Accordingly, the first transform layer 410 included in the first residual transform block 350, the second residual transform block 370, and the first residual transform block 350 shown and explained in FIGS. 3 to 11 ), the second transform layer 810 included in the second residual transform block 370 can be equally applied to the first residual transform block 1310 and the second residual transform block 1320 of FIG. 13, , the same description will be omitted.

일 실시예에 따른 변환 블록(1300)으로 입력된 제1 입력 데이터(X1)는 제1 레지듀얼 변환 블록(1310)과 제2 레지듀얼 변환 블록(1320)으로 입력될 수 있다. 제1 레지듀얼 변환 블록(1310)에서 출력된 제1 출력 데이터는 제1 정규화 레이어(1321)로 입력되어 정규화될 수 있으며, 정규화된 제2 출력 데이터는 제1 리니어 레이어(1331)로 입력될 수 있다.The first input data (X1) input to the transform block 1300 according to one embodiment may be input to the first residual transform block 1310 and the second residual transform block 1320. The first output data output from the first residual transform block 1310 may be input to the first normalization layer 1321 and normalized, and the normalized second output data may be input to the first linear layer 1331. there is.

제1 리니어 레이어(1331)에서는 제1 리니어 레이어(1331)에 입력된 제1 정규화 데이터와 제1 리니어 레이어(1331)에 포함된 제1 가중치 행렬과의 곱셈 연산을 통해 제3 출력 데이터가 획득될 수 있다. In the first linear layer 1331, third output data is obtained through a multiplication operation between the first normalized data input to the first linear layer 1331 and the first weight matrix included in the first linear layer 1331. You can.

제3 출력 데이터는 제2 레지듀얼 변환 블록(1320)에서 출력되는 제4 출력 데이터를 어텐션하는 어텐션 맵으로 이용될 수 있다. 예를 들어, 제3 출력 데이터와 제4 출력 데이터는 제2 어텐션 레이어(1342)로 입력될 수 있으며, 제2 어텐션 레이어(1342)에서는 제3 출력 데이터와 제4 출력 데이터의 요소별 곱셈 연산이 수행될 수 있다.The third output data can be used as an attention map to attend to the fourth output data output from the second residual transform block 1320. For example, the third output data and the fourth output data may be input to the second attention layer 1342, and in the second attention layer 1342, an element-by-element multiplication operation of the third output data and the fourth output data is performed. It can be done.

또한, 제2 레지듀얼 변환 블록(1320)에서 출력된 제4 출력 데이터는 제2 정규화 레이어(1322)로 입력되어 정규화될 수 있으며, 정규화된 제5 출력 데이터는 제2 리니어 레이어(1332)로 입력될 수 있다.Additionally, the fourth output data output from the second residual transform block 1320 may be input to the second normalization layer 1322 and normalized, and the normalized fifth output data may be input to the second linear layer 1332. It can be.

제2 리니어 레이어(1332)에서는 제2 리니어 레이어(1332)에 입력된 제5 출력 데이터와 제2 리니어 레이어(1332)에 포함된 제2 가중치 행렬과의 곱셈 연산을 통해 제6 출력 데이터가 획득될 수 있다.In the second linear layer 1332, the sixth output data is obtained through a multiplication operation between the fifth output data input to the second linear layer 1332 and the second weight matrix included in the second linear layer 1332. You can.

제6 출력 데이터는 제1 레지듀얼 변환 블록(1310)에서 출력된 제1 출력 데이터를 어텐션하는 어텐션 맵으로 이용될 수 있다. 예를 들어, 제1 출력 데이터와 제6 출력 데이터는 제1 어텐션 레이어(1341)로 입력될 수 있으며, 제1 어텐션 레이어(1341)에서는 제1 출력 데이터와 제6 출력 데이터의 요소별 곱셈 연산이 수행될 수 있다.The sixth output data can be used as an attention map to attend to the first output data output from the first residual transform block 1310. For example, the first output data and the sixth output data may be input to the first attention layer 1341, and in the first attention layer 1341, an element-by-element multiplication operation of the first output data and the sixth output data is performed. It can be done.

제1 어텐션 레이어(1341)에서 출력된 제7 출력 데이터는 제3 리니어 레이어(1351)로 입력될 수 있다.The seventh output data output from the first attention layer 1341 may be input to the third linear layer 1351.

제3 리니어 레이어(1351)에서는 제3 리니어 레이어(1351)에 입력된 제7 출력 데이터와 제3 리니어 레이어(1351)에 포함된 제3 가중치 행렬과의 곱셈 연산을 통해 제8 출력 데이터가 획득될 수 있다. 제8 출력 데이터는 제1 합산 레이어(1361)로 입력될 수 있다. 또한, 제1 레지듀얼 변환 블록(1310)에서 출력된 제1 출력 데이터도 제1 합산 레이어(1361)로 입력될 수 있다. 제1 합산 레이어(1361)에서는 제1 합산 레이어(1361)로 입력된 제8 출력 데이터와 제1 출력 데이터의 요소별 합산 연산이 수행될 수 있다.In the third linear layer 1351, the eighth output data is obtained through a multiplication operation between the seventh output data input to the third linear layer 1351 and the third weight matrix included in the third linear layer 1351. You can. The eighth output data may be input to the first summation layer 1361. Additionally, the first output data output from the first residual transform block 1310 may also be input to the first summation layer 1361. In the first summation layer 1361, an element-by-element sum operation may be performed on the eighth output data and the first output data input to the first summation layer 1361.

또한, 제2 어텐선 레이어(1342)에서 출력된 제9 출력 데이터는 제4 리니어 레이어(1352)로 입력될 수 있다.Additionally, the ninth output data output from the second attention layer 1342 may be input to the fourth linear layer 1352.

제4 리니어 레이어(1352)에서는 제4 리니어 레이어(1362)에 입력된 제9 출력 데이터와 제4 리니어 레이어(1352)에 포함된 제4 가중치 행렬과의 곱셈 연산을 통해 제10 출력 데이터가 획득될 수 있다. 제10 출력 데이터는 제2 합산 레이어(1362)로 입력될 수 있다. 또한, 제2 레지듀얼 변환 블록(1320)에서 출력된 제4 출력 데이터도 제2 합산 레이어(1362)로 입력될 수 있다. 제2 합산 레이어(1362)에서는 제2 합산 레이어(1362)로 입력된 제10 출력 데이터와 제4 출력 데이터의 요소별 합산 연산이 수행될 수 있다.In the fourth linear layer 1352, the 10th output data is obtained through a multiplication operation between the 9th output data input to the 4th linear layer 1362 and the 4th weight matrix included in the 4th linear layer 1352. You can. The tenth output data may be input to the second summation layer 1362. Additionally, the fourth output data output from the second residual transform block 1320 may also be input to the second summation layer 1362. In the second summation layer 1362, an element-by-element summation operation may be performed on the 10th output data and the fourth output data input to the second summation layer 1362.

제1 합산 레이어(1361)에서 출력된 제11 출력 데이터는 연결 레이어(1370)로 입력될 수 있다.The 11th output data output from the first summation layer 1361 may be input to the connection layer 1370.

제2 합산 레이어(1362)에서 출력된 제12 출력 데이터는 연결 레이어(1370)로 입력될 수 있다. 연결 레이어(1370)에서는 제11 출력 데이터와 제12 출력 데이터를 채널 방향으로 연결시킨 제13 출력 데이터를 제3 정규화 레이어(1380)로 출력할 수 있다. 제13 출력 데이터는 제3 정규화 레이어(1380)로 입력되어 정규화될 수 있으며, 정규화된 제14 출력 데이터는 제5 리니어 레이어(1390)로 입력될 수 있다. The twelfth output data output from the second summation layer 1362 may be input to the connection layer 1370. The connection layer 1370 may output the 13th output data by connecting the 11th output data and the 12th output data in the channel direction to the third normalization layer 1380. The 13th output data may be input to the third normalization layer 1380 and normalized, and the normalized 14th output data may be input to the fifth linear layer 1390.

제5 리니어 레이어(1390)에서는 제5 리니어 레이어(1390)에 입력된 제14 출력 데이터와 제5 리니어 레이어(1390)에 포함된 제5 가중치 행렬과의 곱셈 연산을 통해 제15 출력 데이터가 획득될 수 있다. 제15 출력 데이터는 제3 합산 레이어(1395)로 입력될 수 있다. 또한, 변환 블록(1300)으로 입력된 제1 입력 데이터(X1) 제3 합산 레이어(1395)로 입력될 수 있다.In the fifth linear layer 1390, the 15th output data is obtained through a multiplication operation between the 14th output data input to the 5th linear layer 1390 and the 5th weight matrix included in the 5th linear layer 1390. You can. The 15th output data may be input to the third summation layer 1395. Additionally, the first input data (X1) input to the transform block 1300 may be input to the third summation layer 1395.

제3 합산 레이어(1395)에서는 제3 합산 레이어(1395)에 입력된 제15 출력 데이터와 제1 입력 데이터(X1)의 요소별 합산 연산을 수행함으로써, 제16 출력 데이터를 획득할 수 있다.In the third summation layer 1395, the 16th output data can be obtained by performing an element-wise sum operation of the 15th output data input to the third summation layer 1395 and the first input data (X1).

일 실시예에 따른 변환 블록(1300)은 제1 레지듀얼 변환 블록(1310), 제2 레지듀얼 변환 블록(1320), 제1 정규화 레이어(1321), 제2 정규화 레이어(1322), 제1 리니어 레이어(1331), 제2 리니어 레이어(1332), 제1 어텐션 레이어(1341), 제2 어텐션 레이어(1342), 제3 리니어 레이어(1351), 제4 리니어 레이어(1352), 제1 합산 레이어(1361), 제2 합산 레이어(1362), 연결 레이어(1370), 제3 정규화 레이어(1380), 제5 리니어 레이어(1390), 제3 합산 레이어(1395)를 포함하는 모듈(1301)이 직렬로 반복적으로 배열된 구조를 포함할 수 있다. 다만, 이에 한정되지 않는다.The transform block 1300 according to an embodiment includes a first residual transform block 1310, a second residual transform block 1320, a first normalization layer 1321, a second normalization layer 1322, and a first linear transform block. Layer 1331, second linear layer 1332, first attention layer 1341, second attention layer 1342, third linear layer 1351, fourth linear layer 1352, first summation layer ( 1361), a second summation layer 1362, a connection layer 1370, a third normalization layer 1380, a fifth linear layer 1390, and a third summation layer 1395 are connected in series. It may contain repetitively arranged structures. However, it is not limited to this.

도 14는 일 실시예에 따른 영상 처리 장치의 동작 방법을 나타내는 흐름도이다.Figure 14 is a flowchart showing a method of operating an image processing device according to an embodiment.

일 실시예에 따른 영상 처리 장치(100)는 하나 이상의 뉴럴 네트워크들을 이용하여, 제1 영상에 기초하는 제1 특징 데이터를 획득할 수 있다(S1410).The image processing device 100 according to an embodiment may acquire first feature data based on the first image using one or more neural networks (S1410).

예를 들어, 영상 처리 장치(100)는 하나 이상의 컨볼루션 뉴럴 네트워크들을 이용하여, 제1 영상에 대응하는 제1 특징 데이터를 추출할 수 있다.For example, the image processing apparatus 100 may extract first feature data corresponding to the first image using one or more convolutional neural networks.

일 실시예에 따른 영상 처리 장치(100)는 제1 특징 데이터에 제1 영상 처리를 수행하여, 제2 특징 데이터들을 획득할 수 있다(S1420).The image processing apparatus 100 according to an embodiment may perform first image processing on first feature data to obtain second feature data (S1420).

예를 들어, 영상 처리 장치(100)는 제1 특징 데이터에서 제1 개수의 픽셀들을 포함하는 제1 영역들에 대응하는 제2 특징 데이터들을 획득할 수 있다. 영상 처리 장치(100)는 제1 영역들 각각에 대한 주변 영역들의 정보에 기초하여, 제1 영역들에 각각 대응하는 제2 특징 데이터들을 획득할 수 있다.For example, the image processing apparatus 100 may obtain second feature data corresponding to first areas including the first number of pixels from the first feature data. The image processing apparatus 100 may obtain second feature data corresponding to each of the first areas based on information on surrounding areas for each of the first areas.

영상 처리 장치(100)는 제1 특징 데이터에 셀프 어텐션을 수행함으로써, 제2 특징 데이터들을 획득할 수 있다.The image processing device 100 may obtain second feature data by performing self-attention on the first feature data.

일 실시예에 따른 영상 처리 장치(100)는 도 5 및 도 6에서 도시하고 설명한 제1 셀프 어텐션 모듈(412)을 이용하여, 제1 특징 데이터에 대응하는 제2 특징 데이터들을 획득할 수 있다. 제1 셀프 어텐션 모듈(412)에서 수행되는 연산에 대해서는 도 5 및 도 6에서 자세히 설명하였으므로 동일한 설명은 생략하기로 한다.The image processing apparatus 100 according to an embodiment may obtain second feature data corresponding to first feature data using the first self-attention module 412 shown and described in FIGS. 5 and 6. Since the operations performed in the first self-attention module 412 have been described in detail in FIGS. 5 and 6, the same description will be omitted.

일 실시예에 따른 영상 처리 장치(100)는 제1 영상에 기초하는 제3 특징 데이터를 획득할 수 있다(S1430).The image processing device 100 according to an embodiment may acquire third feature data based on the first image (S1430).

예를 들어, 영상 처리 장치(100)는 하나 이상의 컨볼루션 뉴럴 네트워크들을 이용하여, 제1 영상에 대응하는 제3 특징 데이터를 추출할 수 있다. 또는, 영상 처리 장치(100)는 1420 단계(S1420)에서 획득한 제2 특징 데이터들에 기초하여, 제3 특징 데이터를 획득할 수 있다. 다만, 이에 한정되지 않는다.For example, the image processing apparatus 100 may extract third feature data corresponding to the first image using one or more convolutional neural networks. Alternatively, the image processing device 100 may acquire third feature data based on the second feature data acquired in step S1420. However, it is not limited to this.

일 실시예에 따른 영상 처리 장치(100)는 제3 특징 데이터에 제2 영상 처리를 수행하여, 제4 특징 데이터들을 획득할 수 있다(S1440).The image processing apparatus 100 according to one embodiment may perform second image processing on the third feature data to obtain fourth feature data (S1440).

예를 들어, 영상 처리 장치(100)는 제3 특징 데이터에서 상기 제1 개수보다 큰 제2 개수의 픽셀들을 포함하는 제2 영역들에 대응하는 제4 특징 데이터들을 획득할 수 있다. 영상 처리 장치(100)는 제2 영역들 각각에 대한 주변 영역들의 정보에 기초하여, 제2 영역들에 각각 대응하는 제4 특징 데이터들을 획득할 수 있다.For example, the image processing apparatus 100 may obtain fourth feature data corresponding to second areas including a second number of pixels greater than the first number from the third feature data. The image processing apparatus 100 may obtain fourth feature data corresponding to each of the second areas based on information on surrounding areas for each of the second areas.

영상 처리 장치(100)는 제3 특징 데이터에 셀프 어텐션을 수행함으로써, 제4 특징 데이터들을 획득할 수 있다.The image processing device 100 may obtain fourth feature data by performing self-attention on the third feature data.

일 실시예에 따른 영상 처리 장치(100)는 도 9 내지 도 11에서 도시하고 설명한 제2 셀프 어텐션 모듈(812)을 이용하여, 제3 특징 데이터에 대응하는 제4 특징 데이터들을 획득할 수 있다. 제2 셀프 어텐션 모듈(812)에서 수행되는 연산에 대해서는 도 9 내지 도 11에서 자세히 설명하였으므로 동일한 설명은 생략하기로 한다.The image processing apparatus 100 according to an embodiment may acquire fourth feature data corresponding to third feature data using the second self-attention module 812 shown and described in FIGS. 9 to 11 . Since the operations performed in the second self-attention module 812 have been described in detail in FIGS. 9 to 11, the same description will be omitted.

일 실시예에 따른 영상 처리 장치(100)는 제2 특징 데이터들 및 제4 특징 데이터들에 기초하여, 제2 영상을 생성할 수 있다(S1450).The image processing device 100 according to an embodiment may generate a second image based on the second feature data and the fourth feature data (S1450).

예를 들어, 제2 특징 데이터들은 제1 영상에 포함되는 픽셀들 각각에 인접하는 주변 픽셀들에 대한 정보에 기초하는 특징 데이터일 수 있으며, 제4 특징 데이터들은 제1 영상에 포함되는 소정 단위 영역들(패치들) 각각에 인접하는 주변 영역들(패치들)에 대한 정보에 기초하는 특징 데이터일 수 있다. 이에 따라, 일 실시예에 따른 영상 처리 장치(100)는 인접하는 주변 픽셀들에 대한 정보(local 정보)와 인접하는 주변 영역들에 대한 정보(non-local 정보)를 모두 이용하여, 제2 영상을 생성할 수 있다.For example, the second feature data may be feature data based on information about surrounding pixels adjacent to each pixel included in the first image, and the fourth feature data may be feature data based on a predetermined unit area included in the first image. It may be feature data based on information about surrounding areas (patches) adjacent to each of the fields (patches). Accordingly, the image processing device 100 according to an embodiment uses both information about adjacent surrounding pixels (local information) and information about adjacent surrounding areas (non-local information) to generate a second image. can be created.

구체적으로, 일 실시예에 따른 영상 처리 장치(100)는 제2 특칭 추출 네트워크(300)로부터 특징 데이터를 추출할 수 있으며, 추출된 특징 데이터를 입력 받아 영상 복원 네트워크(400)를 이용하여, 제2 영상을 획득할 수 있다. 이때, 제2 특징 추출 네트워크(300)는 제2 특징 데이터들을 획득하는 모듈(예를 들어, 제1 셀프 어텐션 모듈), 제4 특징 데이터들을 획득하는 모듈(예를 들어, 제2 셀프 어텐션 모듈)을 포함할 수 있다.Specifically, the image processing device 100 according to an embodiment may extract feature data from the second feature extraction network 300, receive the extracted feature data, and use the image restoration network 400 to 2 Images can be acquired. At this time, the second feature extraction network 300 includes a module for acquiring second feature data (e.g., a first self-attention module) and a module for obtaining fourth feature data (e.g., a second self-attention module). may include.

일 실시예에 따른 제2 영상은 제1 영상보다 고해상도 영상일 수 있으며, 제1 영상에서 아티팩트, 노이즈 등이 제거됨으로써, 제1 영상보다 화질이 개선된 영상일 수 있다.The second image according to one embodiment may be a higher-resolution image than the first image, and may be an image with improved image quality than the first image by removing artifacts, noise, etc. from the first image.

도 15는 일 실시예에 따른 영상 처리 장치의 구성을 나타내는 블록도이다.Figure 15 is a block diagram showing the configuration of an image processing device according to an embodiment.

도 15의 영상 처리 장치(100)는 영상 처리 네트워크(103)를 이용하여, 영상 처리를 수행하는 장치일 수 있다. 일 실시예에 따른 영상 처리 네트워크(103)는 하나 이상의 뉴럴 네트워크들을 포함할 수 있다. 예를 들어, 영상 처리 네트워크(103)는 제1 특징 추출 네트워크(200), 제2 특징 추출 네트워크(300), 및 영상 복원 네트워크(400)를 포함할 수 있다. 다만, 이에 한정되지 않는다.The image processing device 100 of FIG. 15 may be a device that performs image processing using the image processing network 103. The image processing network 103 according to one embodiment may include one or more neural networks. For example, the image processing network 103 may include a first feature extraction network 200, a second feature extraction network 300, and an image restoration network 400. However, it is not limited to this.

도 15를 참조하면, 일 실시예에 따른 영상 처리 장치(100)는 프로세서(110), 메모리(120) 및 디스플레이(130)를 포함할 수 있다.Referring to FIG. 15 , the image processing device 100 according to an embodiment may include a processor 110, a memory 120, and a display 130.

일 실시예에 따른 프로세서(110)는 영상 처리 장치(100)를 전반적으로 제어할 수 있다. 일 실시예에 따른 프로세서(110)는 메모리(120)에 저장되는 하나 이상의 프로그램들을 실행할 수 있다.The processor 110 according to one embodiment may generally control the image processing device 100. The processor 110 according to one embodiment may execute one or more programs stored in the memory 120.

일 실시예에 따른 메모리(120)는 영상 처리 장치(100)를 구동하고 제어하기 위한 다양한 데이터, 프로그램 또는 어플리케이션을 저장할 수 있다. 메모리(120)에 저장되는 프로그램은 하나 이상의 인스트럭션들을 포함할 수 있다. 메모리(120)에 저장된 프로그램(하나 이상의 인스트럭션들) 또는 어플리케이션은 프로세서(110)에 의해 실행될 수 있다.The memory 120 according to one embodiment may store various data, programs, or applications for driving and controlling the image processing device 100. A program stored in memory 120 may include one or more instructions. A program (one or more instructions) or application stored in the memory 120 may be executed by the processor 110.

일 실시예에 따른 프로세서(110)는 CPU(Cetral Processing Unit), GPU (Graphic Processing Unit) 및 VPU(Video Processing Unit) 중 적어도 하나를 포함할 수 있다. 또는, 실시예에 따라, CPU, GPU 및 VPU 중 적어도 하나를 통합한 SoC(System On Chip) 형태로 구현될 수 있다. 또는, 프로세서(110)는 NPU(Neural Processing Unit)를 더 포함할 수 있다.The processor 110 according to one embodiment may include at least one of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), and a Video Processing Unit (VPU). Alternatively, depending on the embodiment, it may be implemented in the form of a SoC (System On Chip) integrating at least one of CPU, GPU, and VPU. Alternatively, the processor 110 may further include a Neural Processing Unit (NPU).

일 실시예에 따른 프로세서(110)는 하나 이상의 뉴럴 네트워크들을 이용하여, 제1 영상을 처리함으로써, 제2 영상을 생성할 수 있다. 예를 들어, 프로세서(110)는 영상 처리 네트워크(103)를 이용하여, 제1 영상의 노이즈를 제거하면서, 세밀한 가장자리 처리와 텍스쳐를 유지하는 디노이징을 수행한 제2 영상을 생성할 수 있다. 또는, 프로세서(110)는 영상 처리 네트워크(103)를 이용하여, 제1 영상보다 해상도가 높은 제2 영상을 생성할 수 있다.The processor 110 according to one embodiment may generate a second image by processing the first image using one or more neural networks. For example, the processor 110 may use the image processing network 103 to remove noise from the first image and generate a second image that has undergone denoising to maintain detailed edge processing and texture. Alternatively, the processor 110 may use the image processing network 103 to generate a second image with higher resolution than the first image.

일 실시예에 따른 프로세서(110)는 제1 특징 추출 네트워크(200)를 이용하여, 제1 영상의 제1 특징 데이터를 획득할 수 있다.The processor 110 according to one embodiment may use the first feature extraction network 200 to obtain first feature data of the first image.

일 실시예에 따른 프로세서(110)는 도 2의 제2 특징 추출 네트워크(300)를 이용하여, 제1 영상의 제2 특징 데이터들(deep features)을 획득할 수 있다. 도 2의 제2 특징 추출 네트워크(300)의 구조 및 동작에 대해서는 도 3 내지 도 14에서 자세히 설명하였으므로, 구체적인 설명은 생략하기로 한다.The processor 110 according to one embodiment may obtain second feature data (deep features) of the first image using the second feature extraction network 300 of FIG. 2. Since the structure and operation of the second feature extraction network 300 of FIG. 2 have been described in detail in FIGS. 3 to 14, detailed description will be omitted.

특히, 일 실시예에 따른 프로세서(110)는 제2 특징 추출 네트워크(300)에 포함되는 제1 셀프 어텐션 모듈(412)을 이용하여, 제1 영상에 포함되는 픽셀들 각각에 인접하는 주변 픽셀들에 대한 정보에 기초하는 특징 데이터들을 획득할 수 있다. 제1 셀프 어텐션 모듈(412)의 구조 및 동작에 대해서는 도 5 및 도 6에서 자세히 설명하였으므로, 구체적인 설명은 생략하기로 한다.In particular, the processor 110 according to one embodiment uses the first self-attention module 412 included in the second feature extraction network 300 to select neighboring pixels adjacent to each of the pixels included in the first image. Feature data based on information about can be obtained. Since the structure and operation of the first self-attention module 412 are explained in detail in FIGS. 5 and 6, detailed description will be omitted.

또한, 일 실시예에 따른 프로세서(110)는 제2 특징 추출 네트워크(300)에 포함되는 제2 셀프 어텐션 모듈(812)을 이용하여, 제1 영상에 포함되는 소정 단위 영역들(패치들) 각각에 인접하는 주변 영역들(패치들)에 대한 정보에 기초하는 특징 데이터들을 획득할 수 있다. 제2 셀프 어텐션 모듈(812)의 구조 및 동작에 대해서는 도 9 내지 도 11에서 자세히 설명하였으므로, 구체적인 설명은 생략하기로 한다.Additionally, the processor 110 according to one embodiment uses the second self-attention module 812 included in the second feature extraction network 300 to select each of the predetermined unit regions (patches) included in the first image. Feature data based on information about surrounding areas (patches) adjacent to can be obtained. Since the structure and operation of the second self-attention module 812 are explained in detail in FIGS. 9 to 11, detailed description will be omitted.

또한, 일 실시예에 따른 프로세서(100)는 영상 복원 네트워크(400)를 이용하여, 제2 특징 추출 네트워크(300)에서 추출된 데이터를 입력 받아 영상 복원 네트워크(400)를 이용하여, 제2 영상을 획득할 수 있다.In addition, the processor 100 according to one embodiment receives data extracted from the second feature extraction network 300 using the image restoration network 400, and uses the image restoration network 400 to generate a second image. can be obtained.

한편, 일 실시예에 따른 영상 처리 네트워크(103)는, 서버 또는 외부 장치에 의해 훈련된 네트워크일 수 있다. 외부 장치는 훈련 데이터에 기초하여, 영상 처리 네트워크(103)를 학습시킬 수 있다. 이때, 훈련 데이터는 노이즈가 포함된 영상 데이터와 노이즈는 제거되면서, 엣지 특성이나 텍스쳐 특성은 보존되는 영상 데이터를 포함하는 복수의 데이터 세트들을 포함할 수 있다.Meanwhile, the image processing network 103 according to one embodiment may be a network trained by a server or an external device. An external device can train the image processing network 103 based on training data. At this time, the training data may include a plurality of data sets including image data containing noise and image data from which the noise is removed while edge characteristics or texture characteristics are preserved.

서버 또는 외부 장치는 영상 처리 네트워크(103)에 포함된 복수의 컨볼루션 레이어들 각각에서 이용되는 커널들에 포함되는 파라미터 값들 및 리니어 레이어들 각각에서 이용되는 가중치 행렬들에 포함되는 파라미터 값들을 결정할 수 있다. 예를 들어, 서버 또는 외부 장치는 영상 처리 네트워크(103)에 의해 생성된 영상 데이터와 노이즈는 제거되면서, 엣지 특성은 보존되는 영상 데이터(훈련 데이터)의 차이(손실 정보)를 최소화하는 방향으로 파라미터 값들을 결정할 수 있다.The server or external device may determine the parameter values included in the kernels used in each of the plurality of convolutional layers included in the image processing network 103 and the parameter values included in the weight matrices used in each of the linear layers. there is. For example, the server or external device sets parameters in a direction to minimize the difference (loss information) of the image data (training data) that preserves edge characteristics while removing the image data and noise generated by the image processing network 103. Values can be determined.

일 실시예에 따른 영상 처리 장치(100)는 서버 또는 외부 장치로부터 훈련이 완료된 영상 처리 네트워크(103)를 수신하여, 메모리(120)에 저장할 수 있다. 예를 들어, 메모리(120)는 일 실시예에 따른 영상 처리 네트워크(103)의 구조 및 파라미터 값들을 저장할 수 있으며, 프로세서(110)는 메모리(120)에 저장된 파라미터 값들을 이용하여, 일 실시예에 따른 제1 영상으로부터 노이즈는 제거되면서, 엣지 특성은 보존되는 제2 영상을 생성할 수 있다.The image processing device 100 according to one embodiment may receive the trained image processing network 103 from a server or an external device and store it in the memory 120 . For example, the memory 120 may store the structure and parameter values of the image processing network 103 according to an embodiment, and the processor 110 may use the parameter values stored in the memory 120, according to an embodiment. While noise is removed from the first image according to , a second image in which edge characteristics are preserved can be generated.

일 실시예에 따른 디스플레이(130)는, 프로세서(110)에서 처리된 영상 신호, 데이터 신호, OSD 신호, 제어 신호 등을 변환하여 구동 신호를 생성한다. 디스플레이(130)는 PDP, LCD, OLED, 플렉시블 디스플레이(flexible display)등으로 구현될 수 있으며, 또한, 3차원 디스플레이(3D display)로 구현될 수 있다. 또한, 디스플레이(130)는, 터치 스크린으로 구성되어 출력 장치 이외에 입력 장치로 사용되는 것도 가능하다.The display 130 according to one embodiment generates a driving signal by converting image signals, data signals, OSD signals, and control signals processed by the processor 110. The display 130 may be implemented as a PDP, LCD, OLED, flexible display, etc., and may also be implemented as a 3D display. Additionally, the display 130 can be configured as a touch screen and used as an input device in addition to an output device.

일 실시예에 따른 디스플레이(130)는 영상 처리 네트워크(103)를 이용하여, 영상 처리된 제2 영상을 표시할 수 있다.The display 130 according to one embodiment may display a second image that has been image processed using the image processing network 103.

한편, 도 15에 도시된 영상 처리 장치(100)의 블록도는 일 실시예를 위한 블록도이다. 블록도의 각 구성요소는 실제 구현되는 영상 처리 장치(100)의 사양에 따라 통합, 추가, 또는 생략될 수 있다. 즉, 필요에 따라 2 이상의 구성요소가 하나의 구성요소로 합쳐지거나, 혹은 하나의 구성요소가 2 이상의 구성요소로 세분되어 구성될 수 있다. 또한, 각 블록에서 수행하는 기능은 실시예들을 설명하기 위한 것이며, 그 구체적인 동작이나 장치는 본 발명의 권리범위를 제한하지 아니한다.Meanwhile, the block diagram of the image processing device 100 shown in FIG. 15 is a block diagram for one embodiment. Each component of the block diagram may be integrated, added, or omitted depending on the specifications of the image processing device 100 that is actually implemented. That is, as needed, two or more components may be combined into one component, or one component may be subdivided into two or more components. In addition, the functions performed by each block are for explaining the embodiments, and the specific operations or devices do not limit the scope of the present invention.

일 실시예에 따른 영상 처리 장치는 하나 이상의 인스트럭션들을 저장하는 메모리 및 상기 메모리에 저장된 상기 하나 이상의 인스트럭션들을 실행하는 적어도 하나의 프로세서를 포함할 수 있다.An image processing device according to an embodiment may include a memory that stores one or more instructions, and at least one processor that executes the one or more instructions stored in the memory.

상기 적어도 하나의 프로세서는 상기 하나 이상의 인스트럭션들을 실행함으로써, 제1 영상에 기초하는 제1 특징 데이터를 획득할 수 있다.The at least one processor may acquire first feature data based on the first image by executing the one or more instructions.

상기 적어도 하나의 프로세서는 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제1 특징 데이터에 제1 영상 처리를 수행하여, 제1 개수의 픽셀들을 포함하는 제1 영역들에 대응하는 제2 특징 데이터들을 획득할 수 있다.The at least one processor may perform first image processing on the first feature data by executing the one or more instructions to obtain second feature data corresponding to first areas including a first number of pixels. You can.

상기 적어도 하나의 프로세서는 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제1 영상에 기초하는 제3 특징 데이터를 획득할 수 있다.The at least one processor may obtain third feature data based on the first image by executing the one or more instructions.

상기 적어도 하나의 프로세서는 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제3 특징 데이터에 제2 영상 처리를 수행하여, 상기 제1 개수보다 큰 제2 개수의 픽셀들을 포함하는 제2 영역들에 대응하는 제4 특징 데이터들을 획득할 수 있다.The at least one processor performs second image processing on the third characteristic data by executing the one or more instructions, thereby producing second images corresponding to second areas including a second number of pixels greater than the first number. 4 Feature data can be obtained.

상기 적어도 하나의 프로세서는 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제2 특징 데이터들 및 상기 제4 특징 데이터들에 기초하여, 제2 영상을 생성할 수 있다.The at least one processor may generate a second image based on the second feature data and the fourth feature data by executing the one or more instructions.

상기 제1 영상 처리 및 상기 제2 영상 처리는 셀프 어텐션(self-attention)을 포함하는, 영상 처리 장치.The first image processing and the second image processing include self-attention.

상기 적어도 하나의 프로세서는 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제1 영역들 각각에 대한 주변 영역들의 정보에 기초하여, 상기 제1 영역들에 각각 대응하는 상기 제2 특징 데이터들을 획득할 수 있다.The at least one processor may execute the one or more instructions to obtain the second feature data corresponding to each of the first areas, based on information on surrounding areas for each of the first areas.

상기 적어도 하나의 프로세서는 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제2 영역들 각각에 대한 주변 영역들의 정보에 기초하여, 상기 제2 영역들에 각각 대응하는 상기 제4 특징 데이터들을 획득할 수 있다.The at least one processor may execute the one or more instructions to obtain the fourth characteristic data corresponding to each of the second areas based on information on surrounding areas for each of the second areas.

상기 제1 개수는 1개이고, 상기 제1 영역들 각각은 하나의 픽셀을 포함할 수 있다.The first number is 1, and each of the first areas may include one pixel.

일 실시예에 따른 영상 처리 장치는 제1 영상에 포함되는 픽셀들 각각에 인접하는 주변 픽셀들에 대한 정보와 제1 영상에 포함되는 소정 단위 영역들 각각에 인접하는 주변 영역들에 대한 정보에 기초하여, 제1 영상을 처리함으로써, 제2 영상을 생성할 수 있다. 이에 따라, 일 실시예에 따른 영상 처리의 성능은 기존의 영상 처리 기술들에 비해 향상될 수 있다. 예를 들어, 생성된 제2 영상의 화질의 개선 정도나 노이즈의 제거 정도가 기존 영상 처리 기술에 의해 처리된 영상에 비해 증가할 수 있다.An image processing device according to an embodiment is based on information about surrounding pixels adjacent to each of the pixels included in the first image and information about surrounding areas adjacent to each of predetermined unit areas included in the first image. Thus, by processing the first image, the second image can be generated. Accordingly, the performance of image processing according to one embodiment can be improved compared to existing image processing technologies. For example, the degree of improvement in image quality or removal of noise of the generated second image may increase compared to the image processed by existing image processing technology.

상기 적어도 하나의 프로세서는, 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제1 특징 데이터에 포함되는 상기 제1 영역들에 각각 대응하는 쿼리(query) 데이터들, 키(key) 데이터들, 밸류(value) 데이터들을 획득할 수 있다.The at least one processor executes the one or more instructions to generate query data, key data, and values respectively corresponding to the first areas included in the first characteristic data. Data can be obtained.

상기 적어도 하나의 프로세서는, 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 쿼리 데이터들 및 상기 키 데이터들에 기초하여, 가중치 행렬을 획득할 수 있다.The at least one processor may obtain a weight matrix based on the query data and the key data by executing the one or more instructions.

상기 적어도 하나의 프로세서는, 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 밸류 데이터들 및 상기 가중치 행렬에 기초하여, 상기 제2 특징 데이터들을 획득할 수 있다.The at least one processor may acquire the second feature data based on the value data and the weight matrix by executing the one or more instructions.

상기 적어도 하나의 프로세서는, 상기 메모리(120)에 저장된 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 쿼리 데이터들 및 상기 키 데이터들에 기초하여, 연관성 행렬을 획득할 수 있다.The at least one processor may obtain a correlation matrix based on the query data and the key data by executing the one or more instructions stored in the memory 120.

상기 적어도 하나의 프로세서는, 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제1 영상의 크기와 상기 하나 이상의 뉴럴 네트워크들의 훈련에 이용된 이미지들의 크기에 기초하는 위치 바이어스를 상기 연관성 행렬에 적용하여, 상기 가중치 행렬을 획득할 수 있다.The at least one processor, by executing the one or more instructions, applies a positional bias based on the size of the first image and the size of the images used to train the one or more neural networks to the correlation matrix to determine the weights. You can obtain a matrix.

상기 적어도 하나의 프로세서는, 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제1 개수의 픽셀들을 포함하는 제3 영역들로 구분된 상기 제3 특징 데이터를 상기 제2 영역들로 구분되도록 변환할 수 있다.The at least one processor may convert the third feature data divided into third areas including the first number of pixels to be divided into the second areas by executing the one or more instructions.

상기 적어도 하나의 프로세서는, 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제2 영역들 각각에 상기 제2 영상 처리를 수행함으로써, 상기 제4 특징 데이터들을 획득할 수 있다.The at least one processor may acquire the fourth feature data by executing the one or more instructions and performing the second image processing on each of the second areas.

상기 적어도 하나의 프로세서는, 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제3 특징 데이터에 포함되는 상기 제1 개수의 픽셀들을 포함하는 제3 영역들에 각각 대응하는 제1 쿼리 데이터들, 제1 키 데이터들 및 제1 밸류 데이터들을 획득할 수 있다.By executing the one or more instructions, the at least one processor generates first query data and first key data respectively corresponding to third areas including the first number of pixels included in the third characteristic data. and first value data can be obtained.

상기 적어도 하나의 프로세서는, 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제1 쿼리 데이터들, 제1 키 데이터들 및 제1 밸류 데이터들 각각을 상기 제2 영역들 각각에 대응하도록 그룹핑함으로써, 상기 제2 영역들에 대응하는 제2 쿼리 데이터들, 제2 키 데이터들 및 제2 밸류 데이터들을 획득할 수 있다.The at least one processor, by executing the one or more instructions, groups each of the first query data, first key data, and first value data to correspond to each of the second areas, thereby Second query data, second key data, and second value data corresponding to the areas may be obtained.

상기 적어도 하나의 프로세서는, 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제2 쿼리 데이터들 및 상기 제2 키 데이터들에 기초하여, 가중치 행렬을 획득할 수 있다.The at least one processor may obtain a weight matrix based on the second query data and the second key data by executing the one or more instructions.

상기 적어도 하나의 프로세서는, 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제2 밸류 데이터들 및 상기 가중치 행렬에 기초하여, 상기 제4 특징 데이터들을 획득할 수 있다.The at least one processor may acquire the fourth feature data based on the second value data and the weight matrix by executing the one or more instructions.

일 실시예에 따른 제3 특징 데이터는 상기 제2 특징 데이터들로부터 획득될 수 있다.Third feature data according to one embodiment may be obtained from the second feature data.

일 실시예에 따른 하나 이상의 뉴럴 네트워크들은 하나 이상의 컨볼루션 뉴럴 네트워크를 포함할 수 있다.One or more neural networks according to one embodiment may include one or more convolutional neural networks.

상기 적어도 하나의 프로세서는, 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 하나 이상의 컨볼루션 뉴럴 네트워크들을 이용하여, 상기 제1 영상으로부터 상기 제1 특징 데이터를 추출할 수 있다.The at least one processor may extract the first feature data from the first image using the one or more convolutional neural networks by executing the one or more instructions.

상기 적어도 하나의 프로세서는, 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 제2 특징 데이터들 및 상기 제4 특징 데이터들에 기초하여, 제5 특징 데이터를 획득할 수 있다.The at least one processor may acquire fifth characteristic data based on the second characteristic data and the fourth characteristic data by executing the one or more instructions.

상기 적어도 하나의 프로세서는, 상기 하나 이상의 인스트럭션들을 실행함으로써, 상기 하나 이상의 컨볼루션 뉴럴 네트워크를 이용하여, 상기 제5 특징 데이터로부터 상기 제2 영상을 획득할 수 있다.The at least one processor may acquire the second image from the fifth feature data by executing the one or more instructions and using the one or more convolutional neural networks.

일 실시예에 따른 상기 제1 영상 처리 및 상기 제2 영상 처리는 셀프 어텐션(self-attention)을 포함할 수 있다.The first image processing and the second image processing according to one embodiment may include self-attention.

상기 제2 특징 데이터들을 획득하는 단계는, 상기 제1 영역들 각각에 대한 주변 영역들의 정보에 기초하여, 상기 제1 영역들에 각각 대응하는 상기 제2 특징 데이터들을 획득하는 단계를 포함할 수 있다.Obtaining the second feature data may include obtaining the second feature data corresponding to each of the first regions based on information on surrounding regions for each of the first regions. .

상기 제4 특징 데이터들을 획득하는 단계는, 상기 제2 영역들 각각에 대한 주변 영역들의 정보에 기초하여, 상기 제2 영역들에 각각 대응하는 상기 제4 특징 데이터들을 획득하는 단계를 포함할 수 있다.Obtaining the fourth feature data may include acquiring the fourth feature data corresponding to each of the second regions based on information on surrounding regions for each of the second regions. .

상기 제2 특징 데이터들을 획득하는 단계는, 상기 제1 특징 데이터에 포함되는 상기 제1 영역들에 각각 대응하는 쿼리(query) 데이터들, 키(key) 데이터들, 밸류(value) 데이터들을 획득하는 단계를 포함할 수 있다.The step of acquiring the second feature data includes acquiring query data, key data, and value data corresponding to the first areas included in the first feature data. May include steps.

상기 제2 특징 데이터들을 획득하는 단계는, 상기 쿼리 데이터들 및 상기 키 데이터들에 기초하여, 가중치 행렬을 획득하는 단계를 포함할 수 있다.Obtaining the second feature data may include obtaining a weight matrix based on the query data and the key data.

상기 제2 특징 데이터들을 획득하는 단계는, 상기 밸류 데이터들 및 상기 가중치 행렬에 기초하여, 상기 제2 특징 데이터들을 획득하는 단계를 포함할 수 있다.Obtaining the second feature data may include obtaining the second feature data based on the value data and the weight matrix.

상기 쿼리 데이터들 및 상기 키 데이터들에 기초하여, 가중치 행렬을 획득하는 단계는, 상기 쿼리 데이터들 및 상기 키 데이터들에 기초하여, 연관성 행렬을 획득하며, 상기 제1 영상의 크기와 상기 하나 이상의 뉴럴 네트워크들의 훈련에 이용된 이미지들의 크기에 기초하는 위치 바이어스를 상기 연관성 행렬에 적용함으로써, 상기 가중치 행렬을 획득하는 단계를 포함할 수 있다.The step of obtaining a weight matrix based on the query data and the key data includes obtaining a correlation matrix based on the query data and the key data, and determining the size of the first image and the one or more The method may include obtaining the weight matrix by applying a positional bias based on the size of images used for training neural networks to the correlation matrix.

상기 제4 특징 데이터들을 획득하는 단계는, 상기 제1 개수의 픽셀들을 포함하는 제3 영역들로 구분된 상기 제3 특징 데이터를 상기 제2 영역들로 구분되도록 변환하는 단계를 포함할 수 있다.Obtaining the fourth feature data may include converting the third feature data divided into third regions including the first number of pixels to be divided into the second regions.

상기 제4 특징 데이터들을 획득하는 단계는, 상기 제2 영역들 각각에 상기 제2 영상 처리를 수행함으로써, 상기 제4 특징 데이터들을 획득하는 단계를 포함할 수 있다.Obtaining the fourth feature data may include obtaining the fourth feature data by performing the second image processing on each of the second regions.

상기 제4 특징 데이터들을 획득하는 단계는, 상기 제3 특징 데이터에 포함되는 상기 제1 개수의 픽셀들을 포함하는 제3 영역들에 각각 대응하는 제1 쿼리 데이터들, 제1 키 데이터들 및 제1 밸류 데이터들을 획득하는 단계를 포함할 수 있다.The step of acquiring the fourth feature data includes first query data, first key data, and first query data, respectively, corresponding to third areas including the first number of pixels included in the third feature data. It may include acquiring value data.

상기 제4 특징 데이터들을 획득하는 단계는, 상기 제1 쿼리 데이터, 제1 키 데이터들 및 제1 밸류 데이터들 각각을 상기 제2 영역들 각각에 대응하도록 그룹핑함으로써, 상기 제2 영역들에 대응하는 제2 쿼리 데이터들, 제2 키 데이터들 및 제2 밸류 데이터들을 획득하는 단계를 포함할 수 있다.The step of acquiring the fourth characteristic data includes grouping each of the first query data, first key data, and first value data to correspond to each of the second areas, thereby grouping each of the first query data, first key data, and first value data to correspond to each of the second areas. It may include obtaining second query data, second key data, and second value data.

상기 제4 특징 데이터들을 획득하는 단계는, 상기 제2 쿼리 데이터들 및 상기 제2 키 데이터들에 기초하여, 가중치 행렬을 획득하는 단계를 포함할 수 있다.Obtaining the fourth feature data may include obtaining a weight matrix based on the second query data and the second key data.

상기 제4 특징 데이터들을 획득하는 단계는, 상기 제2 밸류 데이터들 및 상기 가중치 행렬에 기초하여, 상기 제4 특징 데이터들을 획득하는 단계를 포함할 수 있다.Obtaining the fourth feature data may include acquiring the fourth feature data based on the second value data and the weight matrix.

상기 제3 특징 데이터는 상기 제2 특징 데이터들로부터 획득될 수 있다.The third feature data may be obtained from the second feature data.

상기 하나 이상의 뉴럴 네트워크들은 하나 이상의 컨볼루션 뉴럴 네트워크를 포함할 수 있다.The one or more neural networks may include one or more convolutional neural networks.

상기 제1 특징 데이터를 획득하는 단계는, 상기 하나 이상의 컨볼루션 뉴럴 네트워크들을 이용하여, 상기 제1 영상으로부터 상기 제1 특징 데이터를 추출하는 단계를 포함할 수 있다.Obtaining the first feature data may include extracting the first feature data from the first image using the one or more convolutional neural networks.

상기 제2 영상을 생성하는 단계는, 상기 제2 특징 데이터들 및 상기 제4 특징 데이터들에 기초하여, 제5 특징 데이터를 획득하는 단계를 포함할 수 있다.Generating the second image may include acquiring fifth feature data based on the second feature data and the fourth feature data.

상기 제2 영상을 생성하는 단계는, 상기 하나 이상의 컨볼루션 뉴럴 네트워크를 이용하여, 상기 제5 특징 데이터로부터 상기 제2 영상을 획득하는 단계를 포함할 수 있다.Generating the second image may include obtaining the second image from the fifth feature data using the one or more convolutional neural networks.

일 실시예에 따른 영상 처리 장치의 동작 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.A method of operating an image processing device according to an embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the medium may be specially designed and constructed for the present invention or may be known and usable by those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes optical media (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

또한, 개시된 실시예들에 따른 영상 처리 장치 및 영상 처리 장치의 동작 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다.Additionally, the image processing device and the method of operating the image processing device according to the disclosed embodiments may be included and provided in a computer program product. Computer program products are commodities and can be traded between sellers and buyers.

컴퓨터 프로그램 제품은 S/W 프로그램, S/W 프로그램이 저장된 컴퓨터로 읽을 수 있는 저장 매체를 포함할 수 있다. 예를 들어, 컴퓨터 프로그램 제품은 전자 장치의 제조사 또는 전자 마켓(예, 구글 플레이 스토어, 앱 스토어)을 통해 전자적으로 배포되는 S/W 프로그램 형태의 상품(예, 다운로더블 앱)을 포함할 수 있다. 전자적 배포를 위하여, S/W 프로그램의 적어도 일부는 저장 매체에 저장되거나, 임시적으로 생성될 수 있다. 이 경우, 저장 매체는 제조사의 서버, 전자 마켓의 서버, 또는 SW 프로그램을 임시적으로 저장하는 중계 서버의 저장매체가 될 수 있다.A computer program product may include a S/W program and a computer-readable storage medium in which the S/W program is stored. For example, a computer program product may include a product in the form of a S/W program (e.g., a downloadable app) distributed electronically by the manufacturer of an electronic device or through an electronic marketplace (e.g., Google Play Store, App Store). there is. For electronic distribution, at least part of the S/W program may be stored in a storage medium or temporarily created. In this case, the storage medium may be a manufacturer's server, an electronic market server, or a relay server's storage medium that temporarily stores the SW program.

컴퓨터 프로그램 제품은, 서버 및 클라이언트 장치로 구성되는 시스템에서, 서버의 저장매체 또는 클라이언트 장치의 저장매체를 포함할 수 있다. 또는, 서버 또는 클라이언트 장치와 통신 연결되는 제3 장치(예, 스마트폰)가 존재하는 경우, 컴퓨터 프로그램 제품은 제3 장치의 저장매체를 포함할 수 있다. 또는, 컴퓨터 프로그램 제품은 서버로부터 클라이언트 장치 또는 제3 장치로 전송되거나, 제3 장치로부터 클라이언트 장치로 전송되는 S/W 프로그램 자체를 포함할 수 있다.A computer program product, in a system comprised of a server and a client device, may include a storage medium of a server or a storage medium of a client device. Alternatively, if there is a third device (e.g., a smartphone) in communication connection with the server or client device, the computer program product may include a storage medium of the third device. Alternatively, the computer program product may include the S/W program itself, which is transmitted from a server to a client device or a third device, or from a third device to a client device.

이 경우, 서버, 클라이언트 장치 및 제3 장치 중 하나가 컴퓨터 프로그램 제품을 실행하여 개시된 실시예들에 따른 방법을 수행할 수 있다. 또는, 서버, 클라이언트 장치 및 제3 장치 중 둘 이상이 컴퓨터 프로그램 제품을 실행하여 개시된 실시예들에 따른 방법을 분산하여 실시할 수 있다.In this case, one of the server, the client device, and the third device may execute the computer program product to perform the method according to the disclosed embodiments. Alternatively, two or more of a server, a client device, and a third device may execute the computer program product and perform the methods according to the disclosed embodiments in a distributed manner.

예를 들면, 서버(예로, 클라우드 서버 또는 인공 지능 서버 등)가 서버에 저장된 컴퓨터 프로그램 제품을 실행하여, 서버와 통신 연결된 클라이언트 장치가 개시된 실시예들에 따른 방법을 수행하도록 제어할 수 있다.For example, a server (eg, a cloud server or an artificial intelligence server, etc.) may execute a computer program product stored on the server and control a client device connected to the server to perform the method according to the disclosed embodiments.

이상에서 실시예들에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속한다.Although the embodiments have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements made by those skilled in the art using the basic concept of the present invention defined in the following claims are also included in the scope of the present invention. belongs to

Claims

In an image processing device that processes images using one or more neural networks,
a memory 120 that stores one or more instructions; and
By executing the one or more instructions stored in the memory,
Obtain first feature data based on the first image,
Performing first image processing on the first feature data to obtain second feature data corresponding to first areas including a first number of pixels,
Obtaining third feature data based on the first image,
Performing second image processing on the third feature data to obtain fourth feature data corresponding to second areas including a second number of pixels greater than the first number,
An image processing device comprising at least one processor 110 that generates a second image based on the second feature data and the fourth feature data.

According to paragraph 1,
The first image processing and the second image processing include self-attention.

According to claim 1 or 2,
The at least one processor 110 executes the one or more instructions stored in the memory 120,
Based on information on surrounding areas for each of the first areas, obtain the second feature data corresponding to each of the first areas,
An image processing device that acquires the fourth characteristic data corresponding to each of the second areas based on information on surrounding areas for each of the second areas.

According to any one of claims 1 to 3,
The first number is one, and each of the first areas includes one pixel.

According to any one of claims 1 to 4,
The at least one processor 110 executes the one or more instructions stored in the memory 120,
Obtaining query data, key data, and value data corresponding to the first areas included in the first feature data, respectively,
Based on the query data and the key data, obtain a weight matrix,
An image processing device that obtains the second feature data based on the value data and the weight matrix.

According to clause 5,
The at least one processor 110 executes the one or more instructions stored in the memory 120,
Based on the query data and the key data, a correlation matrix is obtained, and a position bias based on the size of the first image and the size of the images used for training the one or more neural networks is applied to the correlation matrix. An image processing device that obtains the weight matrix by doing so.

According to any one of claims 1 to 6,
The at least one processor 110 executes the one or more instructions stored in the memory 120,
Converting the third feature data divided into third areas including the first number of pixels to be divided into the second areas,
An image processing device that obtains the fourth feature data by performing the second image processing on each of the second areas.

According to any one of claims 1 to 7,
The at least one processor 110 executes the one or more instructions stored in the memory 120,
Obtaining first query data, first key data, and first value data respectively corresponding to third areas including the first number of pixels included in the third feature data,
By grouping each of the first query data, first key data, and first value data to correspond to each of the second areas, second query data and second key data corresponding to the second areas and obtain second value data,
Based on the second query data and the second key data, obtain a weight matrix,
An image processing device that acquires the fourth feature data based on the second value data and the weight matrix.

According to any one of claims 1 to 8,
The third feature data is obtained from the second feature data.

According to any one of claims 1 to 9,
The one or more neural networks include one or more convolutional neural networks,
The at least one processor 110 executes the one or more instructions stored in the memory 120,
An image processing device that extracts the first feature data from the first image using the one or more convolutional neural networks.

According to any one of claims 1 to 10,
The one or more neural networks include one or more convolutional neural networks,
The at least one processor 110 executes the one or more instructions stored in the memory 120,
Based on the second characteristic data and the fourth characteristic data, obtain fifth characteristic data,
An image processing device that obtains the second image from the fifth feature data using the one or more convolutional neural networks.

In a method of operating an image processing device that processes images using one or more neural networks,
Obtaining first feature data based on the first image;
performing first image processing on the first feature data to obtain second feature data corresponding to first areas including a first number of pixels;
Obtaining third feature data based on the first image;
performing second image processing on the third feature data to obtain fourth feature data corresponding to second areas including a second number of pixels greater than the first number; and
A method of operating an image processing device, comprising generating a second image based on the second feature data and the fourth feature data.

According to clause 12,
The first image processing and the second image processing include self-attention.

According to claim 12 or 13,
The step of acquiring the second feature data is,
Based on information on surrounding areas for each of the first areas, obtaining the second feature data corresponding to each of the first areas,
The step of acquiring the fourth characteristic data is,
A method of operating an image processing apparatus, comprising acquiring the fourth characteristic data corresponding to each of the second areas based on information on surrounding areas for each of the second areas.

According to any one of claims 12 to 14,
The first number is one, and each of the first areas includes one pixel.

According to any one of claims 12 to 15,
The step of acquiring the second feature data is,
Obtaining query data, key data, and value data respectively corresponding to the first areas included in the first characteristic data;
Obtaining a weight matrix based on the query data and the key data; and
A method of operating an image processing device, comprising acquiring the second feature data based on the value data and the weight matrix.

According to clause 16,
The step of obtaining a weight matrix based on the query data and the key data includes:
Based on the query data and the key data, a correlation matrix is obtained, and a position bias based on the size of the first image and the size of the images used for training the one or more neural networks is applied to the correlation matrix. A method of operating an image processing device, comprising obtaining the weight matrix by doing so.

According to any one of claims 12 to 17,
The step of acquiring the fourth characteristic data is,
converting the third feature data divided into third areas including the first number of pixels to be divided into the second areas; and
A method of operating an image processing apparatus, comprising obtaining the fourth feature data by performing the second image processing on each of the second areas.

According to any one of claims 12 to 18,
The step of acquiring the fourth characteristic data is,
Obtaining first query data, first key data, and first value data respectively corresponding to third areas including the first number of pixels included in the third feature data;
By grouping each of the first query data, first key data, and first value data to correspond to each of the second areas, second query data and second key data corresponding to the second areas and acquiring second value data;
Obtaining a weight matrix based on the second query data and the second key data; and
A method of operating an image processing device, comprising acquiring the fourth feature data based on the second value data and the weight matrix.

According to any one of claims 12 to 19,
A method of operating an image processing device, wherein the third feature data is obtained from the second feature data.

According to any one of claims 12 to 20,
The one or more neural networks include one or more convolutional neural networks,
The step of acquiring the first feature data is,
A method of operating an image processing apparatus, comprising extracting the first feature data from the first image using the one or more convolutional neural networks.

According to any one of claims 12 to 21,
The one or more neural networks include one or more convolutional neural networks,
The step of generating the second image is,
Obtaining fifth feature data based on the second feature data and the fourth feature data; and
A method of operating an image processing device, comprising obtaining the second image from the fifth feature data using the one or more convolutional neural networks.

One or more computer-readable recording media storing a program for performing the method of any one of claims 12 to 22.