KR20230114893A

KR20230114893A - Self-supervised Swin transformer model structure and method of learning the self-supervised Swin transformer model

Info

Publication number: KR20230114893A
Application number: KR1020220011178A
Authority: KR
Inventors: 양지훈; 정복진
Original assignee: 서강대학교산학협력단
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2023-08-02

Abstract

본 발명은 자기지도 스윈 트랜스포머 모델 구조에 관한 것이다. 상기 자기지도 스윈 트랜스포머 모델 구조는, 이미지를 입력받는 이미지 입력 모듈; 입력된 이미지를 사전 설정된 크기로 분할하여 패치들을 생성하고, 상기 패치들을 사전 설정된 크기로 묶어 윈도우들을 생성하는 패치 생성 모듈; 각 패치들을 임베딩하여 벡터로 변환하는 패치 임베딩 모듈; 윈도우의 위치를 변환시키는 이동 윈도우 멀티헤드 셀프 어텐션(SW-MSA) 모듈; 로컬 어텐션 알고리즘을 사용하여 어텐션 맵의 크기를 윈도우의 크기에 비례하도록 해주는 윈도우 멀티헤드 셀프 어텐션(W-MAS) 모듈; 인접한 패치들을 하나의 패치로 합쳐주는 패치 머징 모듈;을 구비하고, 상기 SW-MSA 모듈 및 W-MAS 모듈은 쌍으로 이루어져 각 단계별로 반복적으로 실행되고, 마지막 단계에는 하나의 윈도우가 전체 이미지를 감싸는 것을 특징으로 한다. The present invention relates to a self-supervised Swin transformer model structure. The self-map Swin transformer model structure includes an image input module that receives an image; a patch generation module that divides an input image into preset sizes to create patches, and creates windows by binding the patches to the preset sizes; a patch embedding module that embeds each patch and converts it into a vector; A moving window multi-head self-attention (SW-MSA) module that changes the position of the window; a window multihead self-attention (W-MAS) module that makes the size of the attention map proportional to the size of the window using a local attention algorithm; A patch merging module that merges adjacent patches into one patch; the SW-MSA module and the W-MAS module are paired and repeatedly executed in each step, and in the last step, one window surrounds the entire image. characterized by

Description

Self-supervised Swin transformer model structure and method of learning the self-supervised Swin transformer model}

본 발명은 스윈 트랜스포머 모델 구조 및 이에 대한 자기지도 학습 방법에 관한 것으로서, 더욱 구체적으로는 로컬 어텐션 알고리즘을 적용한 스윈 트랜스포머(Swin Transformer) 모델에 자기지도학습 알고리즘을 적용하고, 윈도우 토큰(Window Token)을 적용한 다중 목적 함수를 활용하여 스윈 트랜스포머 모델을 학습함으로써 비전 트랜스포머의 성능을 향상시키도록 구성된 것을 특징으로 하는 스윈 트랜스포머 모델 구조 및 이에 대한 자기지도 학습 방법에 관한 것이다.The present invention relates to a Swin Transformer model structure and a self-supervised learning method therefor, and more specifically, to a Swin Transformer model to which a local attention algorithm is applied, a self-supervised learning algorithm is applied, and a window token is obtained. It relates to a structure of a Swin transformer model and a self-supervised learning method for the same, characterized in that it is configured to improve the performance of a vision transformer by learning a Swin transformer model using an applied multi-objective function.

최근 이미지 분류, 객체 검출, 비디오 분류 등 다양한 컴퓨터 비전 문제에 기계 학습 및 딥러닝을 적용한 모델이 좋은 성능을 보이고 있다. 대부분의 모델은 사람이 직접 정답을 부여하여 레이블된 학습 데이터를 활용하여 지도 학습(Supervised Learning)을 통해 학습시킨다. 이러한 양질의 학습 데이터를 얻기 위하여는 많은 시간과 인력이 필요하다.Recently, models that apply machine learning and deep learning to various computer vision problems such as image classification, object detection, and video classification show good performance. Most models are trained through supervised learning using labeled learning data by giving correct answers directly by humans. In order to obtain such high-quality learning data, a lot of time and manpower are required.

하지만, 지도 학습을 통해 모델을 학습할 때, 학습 데이터가 적으면 기대 이하의 성능을 내게 된다. 이러한 문제를 해결하기 위하여, 레이블없는 데이터를 활용하여 적은 학습 데이터에도 효율적으로 지도학습을 적용하는 연구가 지속적으로 진행되고 있으며, 그 중 하나가 자기지도학습을 통한 모델 선행 학습 방법이다. 자기지도학습은 스스로 데이터에 임의의 레이블을 붙여준 후 해당 레이블을 사용하여 모델을 학습시키는 방법으로서, 이미 자연어 처리(Natural Language Processing) 분야에서 트랜스포머(Transformer) 기반 모델의 사전 학습에 많이 활용된다. 자기지도학습은 크게 정의된 문제를 해결하면서 모델의 은닉 표현을 찾는 사전 정의 문제(Pretext Task) 기법과 긍정적인 쌍(Positive Pair)과 부정적인 쌍(Negative Pair)의 간극을 최대화하도록 모델을 학습시키는 대조 학습(Contrastive Learning) 기법으로 나뉜다. However, when learning a model through supervised learning, if the training data is small, performance is lower than expected. In order to solve this problem, research on efficiently applying supervised learning to small learning data using unlabeled data is continuously being conducted, and one of them is a model pre-learning method through self-supervised learning. Self-supervised learning is a method of attaching arbitrary labels to data by itself and then using the labels to train a model. Self-supervised learning contrasts a pretext task technique that finds a hidden representation of a model while solving a largely defined problem and trains the model to maximize the gap between the positive and negative pairs. It is divided into Contrastive Learning techniques.

최근 컴퓨터 비전 분야에도 트랜스포머 기반의 모델에 대한 연구가 활발하게 진행되고 있으며, 그 중 하나가 비전 트랜스포머(Vision Transformer)이다. Recently, in the field of computer vision, research on transformer-based models is being actively conducted, and one of them is a vision transformer.

트랜스포머 모델은 기존의 순환 신경망 기반 모델의 문제점들을 해결한 모델로서, 자연어 처리분야에서 가장 좋은 성능을 보이고 있다. 도 1은 트랜스포머 모델을 도시한 구조도이다. 특히, 트랜스포머 모델은 순환 신경망에 어텐션 알고리즘을 사용하여 구성된 모델이다. 도 2는 트랜스포머 모델의 멀티 헤드 어텐션 구조를 도시한 것이다. 도 2에 도시된 바와 같이, 트랜스포머 모델이 다양한 방법으로 어텐션을 시도하도록 유도하기 위하여, h번 셀프 어텐션 알고리즘을 수행한 뒤 그 정보들을 취합하여 최종 표현을 찾아주는 멀티헤드 셀프 어텐션 기법이 제안되기도 하였다. The transformer model is a model that solves the problems of existing recurrent neural network-based models, and shows the best performance in the field of natural language processing. 1 is a structural diagram showing a transformer model. In particular, the transformer model is a model constructed by using an attention algorithm on a recurrent neural network. 2 illustrates a multi-head attention structure of a transformer model. As shown in FIG. 2, in order to induce the transformer model to attempt attention in various ways, a multi-head self-attention technique has been proposed that finds a final expression by collecting the information after performing the self-attention algorithm h times. .

도 3은 비전 트랜스포머 모델을 도시한 구조도이다. 비전 트랜스포머 모델은 기존 트랜스포머 모델의 구조를 그대로 사용하여 이미지 분류 문제를 해결하기 위하여 사용되는데, 도 3에 도시된 바와 같이 이미지를 패치(Patch)로 잘라 패치들을 하나의 시계열 데이터로 본다. 패치들을 트랜스포머 모델에 입력으로 줄 수 있는 차원으로 투사하여 주는 패치 임베딩(Patch Embedding)을 진행한 다음, 이미지 전체에 대한 내용을 취합하는 클래스 토큰(Class Token)을 추가하여 준다. 그 다음, 기존의 트랜스포머 모델이 입력 데이터에 시간성을 부여하는 것과 유사하게 패치들의 위치를 모델이 알 수 있도록 위치 임베딩을 진행한다. 그렇게 만들어진 입력 데이터를 트랜스포머에 넣어서 나오는 출력 벡터들 중 클래스 토큰에 대응하는 벡터 하나만 사용하여 최종 이미지에 대한 시각적 표현 벡터를 얻게 된다. 3 is a structural diagram illustrating a vision transformer model. The vision transformer model is used to solve the image classification problem by using the structure of the existing transformer model as it is. As shown in FIG. 3, the image is cut into patches and the patches are viewed as one time series data. After proceeding with patch embedding that projects the patches into dimensions that can be given as input to the transformer model, class tokens that collect the contents of the entire image are added. Then, position embedding is performed so that the model can know the positions of the patches, similar to how the existing transformer model gives temporality to input data. The input data thus created is put into the transformer, and among the output vectors produced, only one vector corresponding to the class token is used to obtain the visual expression vector for the final image.

전술한 방법은 이미지의 픽셀들을 하나의 시계열 데이터로 보고 트랜스포머 모델을 적용시키지 못했던 이유인 트랜스포머의 계산량과 메모리 사용량이 시계열 데이터 길이의 비례하는 점 때문에 효율적이지 못하다는 문제를 해결할 수 있게 된다. 하지만, 트랜스포머의 계산량과 메모리 사용량을 고려하여 패치 크기를 조절해야 하는 문제와 학습 데이터가 커야지만 합성곱 신경망 기반 모델보다 좋을 성능을 보인다는 문제가 있다. The above-described method can solve the problem that the transformer model is not applied by considering the pixels of the image as one time-series data and is not efficient because the amount of computation and memory usage of the transformer are proportional to the length of the time-series data. However, there are problems in that the patch size must be adjusted in consideration of the amount of computation and memory usage of the transformer, and that only when the training data is large, it shows better performance than the convolutional neural network-based model.

한편, 스윈 트랜스포머(Swin Transformer)는 비전 트랜스포머에 합성곱 신경망(CNN) 모델이 갖고 있는 장점인 계층적 표현을 적용시킨 모델이다. 도 4는 스윈 트랜스포머 모델을 도시한 구조도이다. 도 4에 도시된 바와 같이, 스윈 트랜스포머는 로컬 어텐션(Local Attention) 기법을 사용하여 트랜스포머가 구해야 하는 어텐션 맵(Attention Map)의 크기를 패치 수에 비례하지 않고 패치 묶음의 단위인 윈도우(Window)의 크기에 비례하도록 해주는 윈도우 멀티헤드 셀프 어텐션 방법을 제공한다. 윈도우 멀티헤드 셀프 어텐션 방법에서는 모든 패치간의 상관 관계를 고려하지 않고 윈도우 안에 있는 패치간의 상관 관계만 고려한다. 이렇게 하게 되면, 서로 다른 윈도우에 있는 패치간의 상관 관계는 고려하지 않게 되므로, 이를 해결하기 위하여 이동 윈도우(Shifted-Window) 기법을 제안한다. On the other hand, the Swin Transformer is a model in which hierarchical representation, which is an advantage of a convolutional neural network (CNN) model, is applied to a vision transformer. 4 is a structural diagram showing a Swin transformer model. As shown in FIG. 4, the Swin Transformer uses the Local Attention technique so that the size of the attention map to be obtained by the transformer is not proportional to the number of patches, but rather the size of a window, which is a unit of patch bundles. It provides a windowed multihead self-attention method that makes it proportional to size. In the windowed multihead self-attention method, correlations between all patches are not considered, but only correlations between patches within a window are considered. In this way, since the correlation between patches in different windows is not considered, a shifted-window technique is proposed to solve this problem.

도 5는 이동 윈도우 기법을 설명하기 위하여 도시한 모식도이다. 도 5에 도시된 바와 같이, 이동 윈도우 기법은 매층마다 윈도우의 위치를 변환하여 주는 방법이다. 5 is a schematic diagram illustrating a moving window technique. As shown in FIG. 5, the moving window technique is a method of changing the position of a window for each floor.

스윈 트랜스포머 모델은 패치 머징(Patch Merging) 기법을 사용하여 단계별로 윈도우가 처리하는 이미지의 범위를 늘린다. 패치 머징은 인접한 패치 4개를 하나의 패치로 합쳐주는 방법이다. 이러한 방법을 적용시킴으로써, 스윈 트랜스포머 모델은 기존의 비전 트랜스포머보다 작은 패치로 이미지를 자름으로써 여러 문제에서 더 좋은 성능을 내게 된다. The Swin Transformer model uses a patch merging technique to increase the range of images processed by the window step by step. Patch merging is a method of merging four adjacent patches into one patch. By applying this method, the Swin Transformer model performs better in several problems than conventional vision transformers by cropping the image into smaller patches.

한편, 비전 트랜스포머를 학습하기 위해서는 많은 양의 데이터가 필요하지만, 실제로 많은 학습 데이터를 갖기는 어렵다. 이에 자기지도학습을 통해 학습 데이터의 양에 대한 문제를 해결하고자 하는 연구가 많이 진행되었다. 자기지도학습 비전 트랜스포머(SiT)는 이미지의 회전률을 예측하고 Contrastive Loss를 최소화하고 마지막으로 손상된 이미지를 복원시킨다. 도 6은 자기지도학습 비전 트랜스포머 모델을 도시한 구조도이다. 도 6을 참조하면, 자기지도학습 비전 트랜스포머는 회전률 토큰(Rotation Token)과 CL(Contrastive Loss) 토큰을 입력 이미지에 추가하여 준다. 또한 기존에는 클래스 토큰에서 나오는 출력만 사용하여 이미지를 분류해서 패치별로 나오는 출력은 사용하지 않았다면, SiT는 패치별로 나오는 출력은 해당 패치를 다시 복구하는데 사용된다. SiT 모델이 제안하는 자기지도학습 방법은 다음과 같다. 먼저, 이미지를 증강하고, 증강된 이미지를 0도, 90도, 180도, 270도 중 랜덤하게 하나로 회전한 뒤 패치들에 임의로 노이즈를 섞는다. 이렇게 증강된 이미지를 SiT 모델에 넣어서 나온 출력값 중 회전률 토큰으로는 랜덤하게 회전한 각도를 예측하고, CL 토큰으로는 손실함수를 최소화하고 패치별로 나온 출력은 본래 이미지의 패치들을 복원하도록 모델을 학습한다. 위와 같은 방식을 통해 학습된 비전 트랜스포머 모델을 선형 검증(Linear Evaluation) 기법을 통해 확인했을 때, 기존의 CL 기법을 통해 학습된 모델보다 좋은 성능을 보인다. On the other hand, a large amount of data is required to learn the vision transformer, but it is difficult to actually have a lot of training data. Therefore, many studies have been conducted to solve the problem of the amount of learning data through self-supervised learning. The self-supervised vision transformer (SiT) predicts the rotation rate of the image, minimizes contrastive loss, and finally restores the damaged image. 6 is a structural diagram illustrating a self-supervised learning vision transformer model. Referring to FIG. 6 , the self-supervised vision transformer adds a Rotation Token and a Contrastive Loss (CL) token to an input image. In addition, in the past, if only outputs from class tokens were used to classify images and outputs for each patch were not used, in SiT, outputs for each patch are used to restore the corresponding patch. The self-supervised learning method proposed by the SiT model is as follows. First, the image is augmented, randomly rotated to one of 0 degrees, 90 degrees, 180 degrees, and 270 degrees, and noise is randomly mixed with the patches. Among the output values obtained by putting the augmented image into the SiT model, the rotation rate token randomly predicts the rotation angle, the CL token minimizes the loss function, and the output for each patch learns the model to restore the original image patches. . When the vision transformer model learned through the above method is checked through the linear evaluation technique, it shows better performance than the model learned through the existing CL technique.

하지만, 비전 트랜스포머는 설정한 패치보다 작은 크기의 패치 간의 상관 관계가 고려되기 어려우며, 학습하기 위하여 많은 데이터를 필요로 하는 문제점이 있다. However, the vision transformer has problems in that it is difficult to consider the correlation between patches smaller than the set patch and requires a lot of data to learn.

한국등록특허공보 제 10-2189373호Korean Registered Patent Publication No. 10-2189373 한국등록특허공보 제 10-2306344호Korea Patent Registration No. 10-2306344 한국공개특허공보 제 10-2021-0043995호Korean Patent Publication No. 10-2021-0043995 한국공개특허공보 제 10-2021-0152687호Korean Patent Publication No. 10-2021-0152687

전술한 문제점을 해결하기 위한 본 발명은 로컬 어텐션 알고리즘을 적용한 스윈 트랜스포머(Swin Transformer) 모델에 자기지도학습 알고리즘을 적용하여 트랜스포머의 성능을 향상시키는 방법을 제공하는 것을 목적으로 한다. An object of the present invention to solve the above problems is to provide a method for improving the performance of a transformer by applying a self-supervised learning algorithm to a Swin Transformer model to which a local attention algorithm is applied.

또한, 본 발명은 윈도우 토큰(Window Token)을 적용한 다중 목적 함수를 활용하여 스윈 트랜스포머 모델을 학습함으로써 비전 트랜스포머의 성능을 향상시키는 방법을 제공하는 것을 다른 목적으로 한다. Another object of the present invention is to provide a method for improving the performance of a vision transformer by learning a Swin transformer model using a multi-objective function to which a window token is applied.

전술한 기술적 과제를 달성하기 위한 본 발명의 제1 특징에 따른 자기지도 스윈 트랜스포머 모델 구조는, 컴퓨터 시스템의 프로그램에 의해 구현되는 구조로서, 이미지를 입력받는 이미지 입력 모듈; 입력된 이미지를 사전 설정된 크기로 분할하여 패치들을 생성하고, 상기 패치들을 사전 설정된 크기로 묶어 윈도우들을 생성하는 패치 생성 모듈; 각 패치들을 임베딩하여 벡터로 변환하는 패치 임베딩 모듈; 윈도우의 위치를 변환시키는 이동 윈도우 멀티헤드 셀프 어텐션(SW-MSA) 모듈; 로컬 어텐션 알고리즘을 사용하여 어텐션 맵의 크기를 윈도우의 크기에 비례하도록 해주는 윈도우 멀티헤드 셀프 어텐션(W-MAS) 모듈; 인접한 패치들을 하나의 패치로 합쳐주는 패치 머징 모듈; 을 구비하고, 상기 SW-MSA 모듈 및 W-MAS 모듈은 쌍으로 이루어져 각 단계별로 반복적으로 실행되고, 마지막 단계에는 하나의 윈도우가 전체 이미지를 감싸는 것을 특징으로 한다. A self-guided swing transformer model structure according to a first aspect of the present invention for achieving the above-described technical problem is a structure implemented by a program of a computer system, and includes an image input module that receives an image; a patch generation module that divides an input image into preset sizes to create patches, and creates windows by binding the patches to the preset sizes; a patch embedding module that embeds each patch and converts it into a vector; A moving window multi-head self-attention (SW-MSA) module that changes the position of the window; a windowed multihead self-attention (W-MAS) module that makes the size of the attention map proportional to the size of the window using a local attention algorithm; a patch merging module that merges adjacent patches into one patch; The SW-MSA module and the W-MAS module are paired and repeatedly executed in each step, and at the last step, one window surrounds the entire image.

전술한 제1 특징에 따른 자기지도 스윈 트랜스포머 모델 구조에 있어서, 상기 SW-MSA 모듈은 윈도우에서 윈도우 토큰을 제외한 패치들을 사용하여 셀프 어텐션 알고리즘을 수행하여 윈도우 간의 정보를 공유하도록 구성된 것이 바람직하다. In the self-supervised swing transformer model structure according to the first feature described above, the SW-MSA module is preferably configured to share information between windows by performing a self-attention algorithm using patches excluding window tokens from windows.

전술한 제1 특징에 따른 자기지도 스윈 트랜스포머 모델 구조에 있어서, 상기 W-MAS 모듈은, 윈도우 토큰을 패치들과 같이 사용하여 셀프 어텐션 알고리즘을 수행하여 윈도우 토큰에 윈도우의 정보를 집어넣도록 구성된 것이 바람직하다. In the self-supervised swing transformer model structure according to the first feature described above, the W-MAS module is configured to insert window information into the window token by performing a self-attention algorithm using the window token as patches. desirable.

전술한 제1 특징에 따른 자기지도 스윈 트랜스포머 모델 구조에 있어서, 상기 패치 머징 모듈은, SW-MSA 모듈과 W-MSA 모듈의 쌍이 실행된 후 다음 단계로 넘어갈 때 인접한 4개의 패치를 하나로 만드는 패치 임베딩을 하고, 근접한 윈도우 토큰 4개를 하나로 만드는 토큰 임베딩을 하는 것이 바람직하다. In the self-supervised swing transformer model structure according to the first feature described above, the patch merging module performs a patch embedding unit that combines four adjacent patches into one when moving to the next step after the pair of the SW-MSA module and the W-MSA module are executed. , and it is desirable to do token embedding that makes 4 adjacent window tokens into one.

전술한 제1 특징에 따른 자기지도 스윈 트랜스포머 모델 구조에 있어서, 상기 패치 생성 모듈은, 각 윈도우에 대해 회전률 토큰과 Contrastive Loss 토큰을 부여하는 것이 바람직하다. In the self-supervised swing transformer model structure according to the first feature described above, the patch generation module preferably assigns a turnover token and a contrastive loss token to each window.

본 발명의 제2 특징에 따른 자기지도 스윈 트랜스포머 모델 구조의 학습 방법은, (a) 입력된 이미지에 대하여 사전 설정된 이미지 증가 기법을 사용하여 2개의 이미지로 증가시키고, 증가된 2개의 이미지를 랜덤하게 회전시킨 후, 임의로 망가뜨려, 입력된 이미지를 2개의 이미지로 증가시키는 단계; (b) 2개의 증가 이미지를 자기지도 스윈 트랜스포머 모델에 입력하는 단계; (c) 상기 자기지도 스윈 트랜스포머 모델에서 출력으로 토큰값들을 사용하여 이미지 회전률과 Contrastive Loss를 구하는 단계; 및 (d) 패치값들을 이용하여 본래의 패치로 복원하는 단계;를 구비한다. A method for learning a self-supervised swing transformer model structure according to a second feature of the present invention includes: (a) increasing an input image into two images using a preset image augmentation technique, and randomizing the two augmented images. After rotating, randomly destroying, increasing the input image into two images; (b) inputting the two augmented images into a self-mapped Swin transformer model; (c) obtaining an image rotation rate and contrastive loss using token values as outputs from the self-map Swin Transformer model; and (d) restoring the original patch using the patch values.

본 발명은 윈도우 토큰을 적용한 스윈 트랜스포머와 다중 목적 함수 기반 자기지도 학습 알고리즘을 적용함으로써, 기존의 비전 트랜스포머가 가진 문제점들인 작은 패치간의 상관 관계를 고려하지 못하는 점과 학습하기 위하여 많은 데이터가 필요한 점들을 해결하였다. The present invention solves the problems of existing vision transformers, such as not considering the correlation between small patches and the need for a lot of data to learn, by applying a Swin transformer to which a window token is applied and a self-supervised learning algorithm based on a multi-objective function. Solved.

또한, 전술한 구조를 갖는 본 발명에 따른 자기지도 스윈 트랜스포머 모델 구조는 기존의 모델 구조들에 비하여 회전률 예측 정확도가 높고, 이로 인해 이미지에 대한 시각적 표현을 잘 찾아낼 수 있게 된다. 또한, 본 발명에 따른 자기지도 스윈 트랜스포머 모델 구조는 기존의 모델 구조들에 비하여 선형 검증 실험에서도 우수한 정확도를 가질 뿐만 아니라, 사후 학습 실험에서도 우수한 성능을 보여주었다. In addition, the self-mapped swing transformer model structure according to the present invention having the above-described structure has higher rotation rate prediction accuracy than existing model structures, and thus, it is possible to find a visual representation of an image well. In addition, the self-supervised swing transformer model structure according to the present invention not only has excellent accuracy in linear verification experiments compared to existing model structures, but also showed excellent performance in post-learning experiments.

도 1은 트랜스포머 모델을 도시한 구조도이다.
도 2는 트랜스포머 모델의 멀티 헤드 어텐션 구조를 도시한 것이다.
도 3은 비전 트랜스포머 모델을 도시한 구조도이다.
도 4는 스윈 트랜스포머 모델을 도시한 구조도이다.
도 5는 이동 윈도우 기법을 설명하기 위하여 도시한 모식도이다.
도 6은 자기지도학습 비전 트랜스포머 모델을 도시한 구조도이다.
도 7은 본 발명에 따른 스윈 트랜스포머 모델 구조에 있어서, 윈도우 토큰 작동 과정을 설명하기 위하여 도시한 이미지들이다.
도 8은 본 발명의 바람직한 실시예에 따른 자기지도 스윈 트랜스포머 모델을 도시한 구조도이다.
도 9는 본 발명의 바람직한 실시예에 따른 자기지도 스윈 트랜스포머 모델에 대한 학습 방법을 도시한 순차적으로 도시한 순서도이며, 도 10은 도 9의 방법을 구현한 알고리즘이다.1 is a structural diagram showing a transformer model.
2 illustrates a multi-head attention structure of a transformer model.
3 is a structural diagram illustrating a vision transformer model.
4 is a structural diagram showing a Swin transformer model.
5 is a schematic diagram illustrating a moving window technique.
6 is a structural diagram illustrating a self-supervised learning vision transformer model.
7 is images shown to explain a window token operation process in the Swin Transformer model structure according to the present invention.
8 is a structural diagram showing a self-mapped swing transformer model according to a preferred embodiment of the present invention.
9 is a flowchart sequentially illustrating a learning method for a self-supervised Swin Transformer model according to a preferred embodiment of the present invention, and FIG. 10 is an algorithm implementing the method of FIG. 9 .

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 따른 윈도우 토큰을 적용한 스윈 트랜스포머 모델 구조 및 상기 스윈 트랜스포머 모델의 학습 방법에 대하여 구체적으로 설명한다. Hereinafter, a structure of a Swin transformer model to which a window token is applied and a learning method of the Swin transformer model according to a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

기존의 스윈 트랜스포머는 패치들을 작게 잘라 윈도우별로 셀프 어텐션 알고리즘을 수행하며 지속적으로 더 나은 패치에 대한 표현을 찾는다. 그 중간에 패치 머징을 넣어 줌으로써 윈도우가 처리하는 범위를 지속적으로 늘리며, 마지막 단계에서 하나의 윈도우가 이미지 전체를 처리할 수 있게 만든다. 마지막으로 모델이 찾는 이미지에 대한 최종 표현은 각 패치별로 나오는 표현을 평균내서 사용한다. 전술한 기존의 스윈 트랜스포머 모델은 다중 목적 함수를 활용하여 학습시킬 수 없는 문제점이 있다. 따라서, 본 발명에 따른 스윈 트랜스포머 모델 구조는 윈도우 토큰 기법을 적용한 것을 특징으로 한다. Existing Swin transformers cut patches into small pieces, perform a self-attention algorithm for each window, and continuously search for better patch representations. By inserting patch merging in the middle, the range processed by the window is continuously increased, and at the last stage, one window can process the entire image. Finally, the final expression for the image that the model finds is the average of the expressions for each patch. The aforementioned conventional Swin transformer model has a problem in that it cannot be trained using a multi-objective function. Therefore, the Swin Transformer model structure according to the present invention is characterized by applying the window token technique.

클래스 토큰은 구글에서 제안한 BERT 모델에서 사용된 뒤 트랜스포머 모델을 분류모델로 사용할 때 자주 사용된다. 클래스 토큰에 시계열 데이터가 가지고 있는 정보들 중 분류 문제에 필요한 정보를 갖는 층을 지나면서 지속적으로 축적하여, 최종적으로 분류를 위한 선형층(Linear Layer)에는 클래스 토큰에서 나온 출력만 사용한다. 윈도우 토큰은 로컬 어텐션 알고리즘을 적용한 스윈 트랜스포머 모델에 맞춰 설계된 클래스 토큰으로 작동한다. The class token is used in the BERT model proposed by Google and is often used when using the transformer model as a classification model. Of the information contained in the time series data in the class token, it is continuously accumulated while passing through the layer having the information necessary for the classification problem, and finally, only the output from the class token is used for the linear layer for classification. The Window Token works as a class token designed to fit the Swin Transformer model with the Local Attention Algorithm applied.

도 7은 본 발명에 따른 스윈 트랜스포머 모델 구조에 있어서, 윈도우 토큰 작동 과정을 설명하기 위하여 도시한 이미지들이다. 도 7을 참조하면, 사진을 패치로 자른 후 패치의 수를 N*N이라 하고, 윈도우 크기를 M*M이라고 할 때, 총 윈도우 개수는 N/M * N/M 이 되며, 윈도우 별로 윈도우 토큰을 부여한다. 해당 윈도우 토큰은 특정 윈도우에 대한 정보 중 특정 문제에 필요한 정보만 축약하는 토큰이다. 하지만, 스윈 트랜스포머 모델에서는 단계별로 패치 머징이 일어나서 윈도우가 처리하는 범위는 지속적으로 커지게 되는 문제가 발생된다. 이때 도 7과 같이 윈도우 머징 기법을 같이 사용하여 해당 문제를 해결하게 된다. 윈도우 머징 기법은 수학식 1을 통해서 이루어진다. 여기서, ωn은 새롭게 만들어진 윈도우 토큰이며, ω1, ω2, ω3, ω4는 그 전 단계의 인접한 윈도우 토큰들이다. 7 is images shown to explain a window token operation process in the Swin Transformer model structure according to the present invention. Referring to FIG. 7 , after cutting a picture into patches, when the number of patches is N*N and the window size is M*M, the total number of windows is N/M * N/M, and each window has a window token. grant The corresponding window token is a token that condenses only information necessary for a specific problem among information about a specific window. However, in the Swin Transformer model, patch merging occurs step by step, so the range processed by the window continuously increases. At this time, as shown in FIG. 7, the window merging technique is used together to solve the problem. The window merging technique is performed through Equation 1. Here, ωn is a newly created window token, and ω1, ω2, ω3, and ω4 are adjacent window tokens of the previous step.

이하, 본 발명에 따른 자기지도학습 방법을 통해 학습하는 스윈 트랜스포머 모델 구조를 설명한다. Hereinafter, the structure of the Swin Transformer model learned through the self-supervised learning method according to the present invention will be described.

도 8은 본 발명의 바람직한 실시예에 따른 자기지도 스윈 트랜스포머 모델을 도시한 구조도이다. 도 8을 참조하면, 본 발명에 따른 자기지도 스윈 트랜스포머 모델은, 컴퓨터 시스템에서 실행되는 프로그램 또는 소프트웨어에 의해 구현된 모델로서, 이미지 입력 모듈, 패치 생성 모듈, 패치 임베딩 모듈, 이동 윈도우 멀티헤드 셀프 어텐션(SW-MSA) 모듈, 윈도우 멀티헤드 셀프 어텐션(W-MSA) 모듈 및 패치 머징 모듈을 구비하고, 상기 SW-MSA 모듈 및 W-MAS 모듈은 쌍으로 이루어져 각 단계별로 반복적으로 실행되고, 마지막 단계에는 하나의 윈도우가 전체 이미지를 감싸는 것을 특징으로 한다. 8 is a structural diagram showing a self-mapped swing transformer model according to a preferred embodiment of the present invention. Referring to FIG. 8 , the self-mapped swing transformer model according to the present invention is a model implemented by a program or software running on a computer system, and includes an image input module, a patch generation module, a patch embedding module, a moving window multi-head self-attention (SW-MSA) module, window multi-head self-attention (W-MSA) module, and patch merging module, and the SW-MSA module and W-MAS module are paired and repeatedly executed in each step, and the last step is characterized in that one window surrounds the entire image.

상기 이미지 입력 모듈은, 224*224*3의 이미지를 입력받는다. The image input module receives a 224*224*3 image.

상기 패치 생성 모듈은 이미지 입력 모듈을 통해 입력된 이미지를 4*4*3 패치로 잘라 56*56 개의 패치를 생성하고, 해당 패치들을 7*7 크기를 갖는 윈도우로 묶는다. 그러면, 총 8*8 개의 윈도우가 생성되며 윈도우마다 개별적으로 회전률 토큰과 CL 토큰을 부여한다. 상기 패치들은 패치 임베딩(Patch Embedding) 모듈을 통해 96 차원을 가진 벡터로 변환된다. The patch creation module cuts the image input through the image input module into 4*4*3 patches to generate 56*56 patches, and binds the corresponding patches into a window having a size of 7*7. Then, a total of 8*8 windows are created, and turnover tokens and CL tokens are individually assigned to each window. The patches are converted into 96-dimensional vectors through a patch embedding module.

그후 사용되는 층의 구조는 W-MSA 모듈과 SW-MSA 모듈로 이루어진다. 상기 이동 윈도우 멀티헤드 셀프 어텐션(SW-MSA) 모듈은 윈도우 멀티헤드 셀프 어텐션(W-MSA) 모듈보다 먼저 실행됨으로써, SW-MSA 모듈은 움직여진 윈도우에서 윈도우 토큰을 제외한 패치들을 사용하여 셀프 어텐션 알고리즘을 수행하고, 이는 윈도우 간의 정보를 공유하게 한다. The structure of the layer used thereafter consists of a W-MSA module and a SW-MSA module. The moving window multi-head self-attention (SW-MSA) module is executed before the window multi-head self-attention (W-MSA) module, so the SW-MSA module uses patches excluding window tokens from the moved window to perform the self-attention algorithm. , which allows information to be shared between windows.

다음, W-MSA 모듈을 수행할 때 윈도우 토큰을 패치들과 같이 사용하여 셀프 어텐션 모듈을 진행함으로써 윈도우 토큰에 윈도우의 정보를 집어넣어 준다. 본 발명의 따른 바람직한 실시예에 따른 자기지도 스윈 트랜스포머 모델은 총 12개의 층으로 이루어지며 4개의 단계로 나뉘어진다. 매 단계에서 다음 단계로 넘어갈 때, 인접한 4개의 패치를 하나로 만드는 패치 임베딩을 거치고, 이때 근접한 윈도우 토큰 4개를 하나로 만드는 토큰 임베딩도 같이 진행한다. 각 단계에는 단계별로 SW-MSA, W-MSA를 하나의 쌍으로 하여, 2개, 2개, 6개, 2개의 층을 가지고 있으며 마지막 단계에는 하나의 윈도우가 전체 이미지를 다 감싸게 된다.Next, when the W-MSA module is executed, window information is put into the window token by proceeding with the self-attention module using the window token together with patches. The self-mapped Swin Transformer model according to a preferred embodiment of the present invention consists of a total of 12 layers and is divided into 4 stages. When moving from each step to the next step, patch embedding is performed to make 4 adjacent patches into one, and at this time, token embedding to make 4 adjacent window tokens into one is also performed. In each step, SW-MSA and W-MSA are paired to have 2, 2, 6, 2 layers, and in the last step, one window covers the entire image.

이하, 본 발명에 따른 자기지도 스윈 트랜스포머 모델에 대한 학습 방법에 대하여 설명한다. 도 9는 본 발명의 바람직한 실시예에 따른 자기지도 스윈 트랜스포머 모델에 대한 학습 방법을 도시한 순차적으로 도시한 순서도이며, 도 10은 도 9의 방법을 구현한 알고리즘이다. Hereinafter, a learning method for a self-supervised Swin transformer model according to the present invention will be described. 9 is a flowchart sequentially illustrating a learning method for a self-supervised Swin Transformer model according to a preferred embodiment of the present invention, and FIG. 10 is an algorithm implementing the method of FIG. 9 .

도 9 및 도 10을 참조하면, 본 발명에 따른 자기지도 스윈 트랜스포머 모델에 대한 학습 방법은, 먼저 이미지가 입력으로 들어오면 사전 설정된 이미지 증가 기법을 사용하여 이미지를 2가지 버전으로 증가시킨다. 증가된 2개의 이미지에 랜덤한 회전을 가해준다. 다음, 2개의 이미지를 임의로 망가뜨린 후 전술한 본 발명에 따른 자기지도 스윈 트랜스포머 모델에 넣어준다. 그 후, 본 발명에 따른 스윈 트랜스포머 모델에서 출력으로 주는 토큰값들을 사용하여 이미지 회전률과 Contrastive Loss를 구해주고 패치 값들을 사용하여 본래의 패치로 복원한다. Referring to FIGS. 9 and 10 , in the learning method for the self-supervised swing transformer model according to the present invention, first, when an image is input, the image is increased into two versions using a preset image augmentation technique. Random rotation is applied to the two enlarged images. Next, after randomly destroying the two images, they are put into the self-mapped Swin transformer model according to the present invention described above. After that, the image rotation rate and contrastive loss are obtained using the token values given as outputs from the Swin Transformer model according to the present invention, and the original patch is restored using the patch values.

이상에서 본 발명에 대하여 그 바람직한 실시예를 중심으로 설명하였으나, 이는 단지 예시일 뿐 본 발명을 한정하는 것이 아니며, 본 발명이 속하는 분야의 통상의 지식을 가진 자라면 본 발명의 본질적인 특성을 벗어나지 않는 범위에서 이상에 예시되지 않은 여러 가지의 변형과 응용이 가능함을 알 수 있을 것이다. 그리고, 이러한 변형과 응용에 관계된 차이점들은 첨부된 청구 범위에서 규정하는 본 발명의 범위에 포함되는 것으로 해석되어야 할 것이다.Although the present invention has been described above with reference to preferred embodiments, this is only an example and does not limit the present invention, and those skilled in the art to which the present invention belongs will not deviate from the essential characteristics of the present invention. It will be appreciated that various modifications and applications not exemplified above are possible within the range. And, differences related to these variations and applications should be construed as being included in the scope of the present invention defined in the appended claims.

Claims

an image input module that receives an image;
a patch generation module that divides an input image into preset sizes to create patches, and creates windows by binding the patches to the preset sizes;
a patch embedding module that embeds each patch and converts it into a vector;
A moving window multi-head self-attention (SW-MSA) module that changes the position of the window;
a windowed multihead self-attention (W-MAS) module that makes the size of the attention map proportional to the size of the window using a local attention algorithm;
a patch merging module that merges adjacent patches into one patch;
wherein the SW-MSA module and the W-MAS module are paired and repeatedly executed in each step, and in the last step, one window surrounds the entire image.

The method of claim 1, wherein the SW-MSA module
A self-supervised swing transformer model structure characterized in that it is configured to share information between windows by performing a self-attention algorithm using patches other than window tokens in windows.

The method of claim 1, wherein the W-MAS module,
A self-supervised Swin Transformer model structure characterized in that it is configured to insert window information into the window token by performing a self-attention algorithm using the window token as patches.

The method of claim 1, wherein the patch merging module,
After the pair of the SW-MSA module and the W-MSA module are executed, when moving to the next step, patch embedding is performed to make four adjacent patches into one, and token embedding is performed to make four adjacent window tokens into one. Transformer model structure.

The method of claim 1, wherein the patch generation module,
Self-supervised Swin Transformer model structure characterized by assigning a turnover token and a contrastive loss token for each window.

(a) a learning image increasing step of increasing an input image into two images;
(b) inputting the two augmented images into a self-mapped Swin transformer model;
(c) obtaining an image rotation rate and contrastive loss using token values as outputs from the self-map Swin Transformer model; and
(d) restoring the original patch using the patch values;
The self-mapped Swin transformer model,
an image input module that receives an image;
a patch generation module that divides an input image into preset sizes to generate patches, creates windows by binding the patches to the preset sizes, and assigns a turnover token and a CL token to each window;
a patch embedding module that embeds each patch and converts it into a vector;
A moving window multi-head self-attention (SW-MSA) module that changes the position of the window;
a windowed multihead self-attention (W-MAS) module that makes the size of the attention map proportional to the size of the window using a local attention algorithm;
a patch merging module that merges adjacent patches into one patch;
The SW-MSA module and the W-MAS module are paired and repeatedly executed in each step, and in the last step, a window surrounds the entire image. .

The method of claim 6, wherein step (a),
A learning method of a self-supervised swing transformer model structure, characterized in that the input image is increased to two images using a preset image augmentation technique, the two augmented images are randomly rotated, and then randomly destroyed.

The method of claim 6, wherein the SW-MSA module
A method for learning a self-supervised swing transformer model structure, characterized in that it is configured to share information between windows by performing a self-attention algorithm using patches excluding window tokens in windows.

The method of claim 6, wherein the W-MAS module,
A learning method for a self-supervised Swin Transformer model structure, characterized in that it is configured to insert window information into the window token by performing a self-attention algorithm using the window token as patches.

The method of claim 6, wherein the patch merging module,
After the pair of the SW-MSA module and the W-MSA module are executed, when moving to the next step, patch embedding is performed to make four adjacent patches into one, and token embedding is performed to make four adjacent window tokens into one. Learning method of transformer model structure.