KR20220144687A

KR20220144687A - Dual attention multiple instance learning method

Info

Publication number: KR20220144687A
Application number: KR1020210051331A
Authority: KR
Inventors: 박상현; 치콘테 필립
Original assignee: 재단법인대구경북과학기술원
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2022-10-27

Abstract

The present invention relates to a multiple instance learning device for analyzing a 3D image, which comprises: a memory in which a multiple instance learning model is stored; and at least one processor electrically connected to the memory. The multiple instance learning model may comprise: a convolution block which derives a feature map of each 2D instance of an input 3D image; a spatial attention block which derives a spatial attention map of instances from the feature map derived from the convolution block; and an instance attention block which derives an attention score for each instance by receiving a synthesis result of the feature map and spatial attention map, and derives an aggregated embedding for the 3D image by aggregating embeddings of instances according to the attention score. The present invention provides an attention-based end-to-end weak supervised framework for rapid diagnosis based on multiple instance learning.

Description

DUAL ATTENTION MULTIPLE INSTANCE LEARNING METHOD

본 발명은 듀얼 어텐션 다중 인스턴스 학습 방법에 관한 것으로, 더 상세하게는 비지도 보완 손실을 이용하는 다중 인스턴스 학습 방법에 관한 것이다.The present invention relates to a dual attention multi-instance learning method, and more particularly, to a multi-instance learning method using unsupervised complementary loss.

흉부 컴퓨터 단층촬영(Chest Computed Tomography: Chest CT) 기반의 진단 및 분석은 코로나바이러스 질환 2019(COVID-19)과 같이 전세계적으로 빠르게 확산되는 팬더믹 발생에 대항하는데 주요한 역할을 한다.Chest Computed Tomography (Chest CT)-based diagnosis and analysis plays a key role in combating outbreaks of rapidly spreading pandemics worldwide, such as coronavirus disease 2019 (COVID-19).

그러나, 감염된 영역들에 대한 주석(annotation), 대용량 데이터세트들의 분류 및 구조화, 진단하려는 질환과 다른 질환들 사이의 미세한 불일치의 구분에 있어서 어려움으로 인해 정확한 스크리닝이 도전적인 과제가 된다.However, accurate screening is challenging due to difficulties in annotating infected areas, classifying and structuring large datasets, and distinguishing minute discrepancies between the disease to be diagnosed and other diseases.

또한, 자동화된 스크리닝의 민감도가 제한되어 있고 방사선적인 레벨에서의 성능과 동등하지 못할 수 있다는 문제가 있다.There is also the problem that the sensitivity of automated screening is limited and may not be equivalent to performance at the radiological level.

따라서, 흉부 CT에 기반한 강인한 스크리닝 방법들을 개발하고 향상시켜야할 긴급한 필요가 있다.Therefore, there is an urgent need to develop and improve robust screening methods based on chest CT.

한편, 딥러닝(deep learning) 기반 솔루션들은, 임상 데이터세트들로부터 풍부한 피쳐들(features)을 추출할 수 있고 장기 세그멘테이션 및 질병 진단 등과 같은 넓은 분야에서의 적용 분야를 가질 수 있다는 장점에 의해 의료 이미지 분석에서 상당한 성공을 보여왔다.On the other hand, deep learning-based solutions can extract rich features from clinical datasets and have applications in wide fields such as long-term segmentation and disease diagnosis. The analysis has shown considerable success.

예를 들어, 한국공개특허 제2021-0030730호는 의료 영상을 해석하는 인공지능 모델을 이용한 폐암 발병 가능성 예측 방법 및 의료 영상 분석 장치에 대해 개시한다.For example, Korean Patent Laid-Open Publication No. 2021-0030730 discloses a method for predicting lung cancer incidence using an artificial intelligence model that interprets a medical image and a medical image analysis apparatus.

다만, 딥러닝 기반 솔루션들이 유망한 성능을 보여주고 있기는 하나, 대부분의 방법들이 지도학습의 방식이고 상당한 양의 레이블링 노력이 필요하다는 문제점이 있다. However, although deep learning-based solutions show promising performance, most methods are supervised learning methods, and there is a problem that a considerable amount of labeling effort is required.

따라서, 과도한 데이터 프리-프로세싱(Pre-Processing) 및/또는 강한 사전 지식을 필요로 하지 않는 비지도 방식 또는 약하게 지도가 필요한 학습 방법들이 정확한 진단을 위해 선호되는 옵션이 될 수 있다.Accordingly, an unsupervised or weakly supervised learning method that does not require excessive data pre-processing and/or strong prior knowledge may be a preferred option for accurate diagnosis.

상기와 같은 문제점을 감안한 본 발명이 해결하고자 하는 과제는, 다중 인스턴스 학습(Multiple Instance Learning: MIL)에 기반하여 신속한 진단을 위해 어텐션(attention)-기반의 종단간(end-to-end) 약한 지도방식 프레임워크를 제공하는 것이다.The problem to be solved by the present invention in consideration of the above problems is an attention-based end-to-end weak map for rapid diagnosis based on Multiple Instance Learning (MIL). It is to provide a framework.

또한, 본 발명이 해결하고자 하는 과제는, 공간적 및 잠재적 컨텍스트들 모두에 있어서 적용되는 어텐션을 이용하는 비지도 방식의 대조적 학습(unsupervised contrastive learning) 방법을 제공하고, 이중 어텐션 대조 기반 MIL(Dual Attention Contrastive based MIL: DA-CMIL)을 제공하는 것이다.In addition, the problem to be solved by the present invention is to provide an unsupervised contrastive learning method using attention applied in both spatial and potential contexts, and dual attention contrast based MIL (Dual Attention Contrastive based) MIL: DA-CMIL).

또한, 본 발명이 해결하고자 하는 과제는, 질병의 진단 결정을 위해 (a) 패치-기반, (b) 슬라이스-기반, (c) 3D CT-기반 방법들을 제공하는 것이다.In addition, the problem to be solved by the present invention is to provide (a) patch-based, (b) slice-based, (c) 3D CT-based methods for a diagnosis decision of a disease.

또한, 본 발명이 해결하고자 하는 과제는, 신규한 종단간 어텐션-기반의 약한 지도방식 프레임워크를 제공하는 것으로, 정확한 질병 진단을 위해 피쳐들에 대한 MIL 및 자가-지도 대조 학습(self-supervised contrastive learning)을 이용하는 방법을 제공하는 것이다.In addition, the problem to be solved by the present invention is to provide a novel end-to-end attention-based weak supervised framework, and MIL and self-supervised contrastive learning for features for accurate disease diagnosis learning) is provided.

상기와 같은 기술적 과제를 해결하기 위한 본 개시의 실시예에 따른 3D 이미지에 대한 분석을 위한 다중 인스턴스 학습 장치는, 다중 인스턴스 학습 모델이 저장된 메모리 및 메모리와 전기적으로 연결된 적어도 하나의 프로세서를 포함하고, 다중 인스턴스 학습 모델은, 입력되는 3D 이미지의 2D 슬라이스(slice)/패치(patch)인 인스턴스들(instances) 각각의 특징 맵(Feature Map)을 도출하는 합성곱(Convolution) 블록, 합성곱 블록으로부터 도출된 상기 특징 맵으로부터 상기 인스턴스들의 공간 어텐션 맵(Spatial Attention Map)을 도출하는 공간 어텐션 블록, 특징 맵과 상기 공간 어텐션 맵의 합성 결과를 입력 받아 인스턴스마다의 어텐션 스코어를 도출하고, 어텐션 스코어에 따라 인스턴스들의 임베딩들(embeddings)을 종합하여 상기 3D 이미지에 대한 총합 임베딩(aggregated embedding)을 도출하는 인스턴스 어텐션 블록(Latent Attention Block) 및 총합 임베딩에 기초하여 상기 3D 이미지에 대한 분석 결과를 출력하는 출력 블록을 포함할 수 있다.A multi-instance learning apparatus for analyzing a 3D image according to an embodiment of the present disclosure for solving the above technical problem includes a memory in which a multi-instance learning model is stored and at least one processor electrically connected to the memory, The multi-instance learning model is derived from a convolution block that derives a feature map of each instance that is a 2D slice/patch of an input 3D image, and a convolution block A spatial attention block for deriving a spatial attention map of the instances from the feature map, a spatial attention block for deriving a spatial attention map of the instances, and a result of combining a feature map and the spatial attention map are input to derive an attention score for each instance, and an instance according to the attention score An output block for outputting an analysis result for the 3D image based on an instance attention block (Latent Attention Block) that derives aggregated embedding for the 3D image by synthesizing the embeddings of the 3D image may include

또한, 프로세서는, 다중 인스턴스 학습 모델의 훈련 페이즈에서, 분석 결과에 대한 실제값(Ground-Truth)으로 레이블링된 3D 이미지를 포함한 훈련 데이터를 이용하여 상기 학습 모델의 전체 손실 함수(L)가 최소값을 가지도록 상기 학습 모델을 훈련시키는 동작을 수행할 수 있다.In addition, the processor, in the training phase of the multi-instance learning model, using the training data including the 3D image labeled with the ground-truth for the analysis result, the overall loss function (L) of the learning model is the minimum value. An operation of training the learning model to have

여기서, 전체 손실 함수는 출력 블록의 결과에 대한 모음 레벨(Bag Level)의 손실 함수(L_B) 및 인스턴스 임베딩들과 총합 임베딩 사이의 대조 손실 함수(L_F)의 조합일 수 있다.Here, the overall loss function may be a combination of a bag level loss function ( _{LB ) for the result of the output block and a contrast loss function (L F} ₎ between instance embeddings and sum embeddings.

본 발명은 과도한 데이터 프리-프로세싱(Pre-Processing) 및/또는 강한 사전 지식을 필요로 하지 않는 비지도 방식 또는 약하게 지도가 필요한 학습 방법들을 제공하면서도 3D 이미지에 대한 보다 정확한 분석을 가능하게 한다.The present invention enables more accurate analysis of 3D images while providing unsupervised or weakly supervised learning methods that do not require excessive data pre-processing and/or strong prior knowledge.

또한, 지도학습과 대조 손실을 함께 사용함으로써, 본 발명의 모델은 더 작은 데이터 세트들 상에서 훈련되는 경우에도 오버피팅(overfitting)을 피할 수 있으며 정확성을 희생시키지 않으면서 인스턴스 레벨에서 특징 강인성(feature robustness)을 향상시킬 수 있다.In addition, by using supervised learning and contrast loss together, the model of the present invention can avoid overfitting even when trained on smaller data sets and feature robustness at the instance level without sacrificing accuracy. ) can be improved.

또한, 본 발명의 다중 인스턴스 학습 방법은 인스턴스(예를 들어, 슬라이스/패치) 카운트가 달라지는 경우의 상이한 CT 사이즈들에 대해서도 강인성을 가질 수 있다.In addition, the multi-instance learning method of the present invention may have robustness to different CT sizes when the instance (eg, slice/patch) count is changed.

도 1은 본 개시의 일 실시예에 따른 다중 인스턴스 학습 장치와 다중 인스턴스 학습 모델의 프레임워크를 도시한다.
도 2는 본 개시의 일 실시예에 따른 다중 인스턴스 학습 세팅에 적용되는 대조 학습을 설명하기 위한 도면이다.
도 3은 본 개시의 일 실시예에 따른 다중 인스턴스 학습 방법에서 프리-프로세싱되는 CT 이미지들의 예시이다.
도 4는 본 개시의 일 실시예에 따른 다중 인스턴스 학습 모델과 다른 학습 모델들의 특성을 비교하는 그래프를 도시한다.
도 5는 CT 슬라이스 데이터세트에서 본 개시의 일 실시예에 따른 다중 인스턴스 학습 모델과 다른 학습 모델들의 예측 성능을 비교하는 그래프를 도시한다.
도 6은 CT 패치 데이터세트에서 본 개시의 다른 실시예에 따른 다중 인스턴스 학습 모델과 다른 학습 모델들의 예측 성능을 비교하는 그래프를 도시한다.
도 7은 본 개시의 실시예에 따른 다중 인스턴스 학습 모델에서 도출되는 공간 어텐션 맵과 인스턴스 어텐션 스코어를 도시한다.
도 8은 본 개시의 실시예에 따른 다중 인스턴스 학습 모델에서의 임베딩 공간에 표현된 CT 슬라이스들을 도시한다.
도 9는 본 개시의 실시예에 따른 다중 인스턴스 학습 모델의 학습 방법을 설명하는 순서도를 도시한다.1 illustrates a framework of a multi-instance learning apparatus and a multi-instance learning model according to an embodiment of the present disclosure.
2 is a diagram for explaining contrast learning applied to a multi-instance learning setting according to an embodiment of the present disclosure.
3 is an example of CT images that are pre-processed in a multi-instance learning method according to an embodiment of the present disclosure.
4 illustrates a graph comparing characteristics of a multi-instance learning model and other learning models according to an embodiment of the present disclosure.
5 illustrates a graph comparing the prediction performance of a multi-instance learning model according to an embodiment of the present disclosure and other learning models in a CT slice dataset.
6 illustrates a graph comparing the prediction performance of a multi-instance learning model according to another embodiment of the present disclosure and other learning models in a CT patch dataset.
7 illustrates a spatial attention map and an instance attention score derived from a multi-instance learning model according to an embodiment of the present disclosure.
8 shows CT slices represented in an embedding space in a multi-instance learning model according to an embodiment of the present disclosure.
9 is a flowchart illustrating a learning method of a multi-instance learning model according to an embodiment of the present disclosure.

본 발명의 구성 및 효과를 충분히 이해하기 위하여, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예들을 설명한다. 그러나 본 발명은 이하에서 개시되는 실시예에 한정되는 것이 아니라, 여러가지 형태로 구현될 수 있고 다양한 변경을 가할 수 있다. 단지, 본 실시예에 대한 설명은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위하여 제공되는 것이다. 첨부된 도면에서 구성요소는 설명의 편의를 위하여 그 크기를 실제보다 확대하여 도시한 것이며, 각 구성요소의 비율은 과장되거나 축소될 수 있다.In order to fully understand the configuration and effect of the present invention, preferred embodiments of the present invention will be described with reference to the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, and may be embodied in various forms and various modifications may be made. However, the description of the present embodiment is provided so that the disclosure of the present invention is complete, and to fully inform those of ordinary skill in the art to which the present invention pertains the scope of the invention. In the accompanying drawings, components are enlarged in size than actual for convenience of description, and ratios of each component may be exaggerated or reduced.

'제1', '제2' 등의 용어는 다양한 구성요소를 설명하는데 사용될 수 있지만, 상기 구성요소는 위 용어에 의해 한정되어서는 안 된다. 위 용어는 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용될 수 있다. 예를 들어, 본 발명의 권리범위를 벗어나지 않으면서 '제1구성요소'는 '제2구성요소'로 명명될 수 있고, 유사하게 '제2구성요소'도 '제1구성요소'로 명명될 수 있다. 또한, 단수의 표현은 문맥상 명백하게 다르게 표현하지 않는 한, 복수의 표현을 포함한다. 본 발명의 실시예에서 사용되는 용어는 다르게 정의되지 않는 한, 해당 기술분야에서 통상의 지식을 가진 자에게 통상적으로 알려진 의미로 해석될 수 있다.Terms such as 'first' and 'second' may be used to describe various elements, but the elements should not be limited by the above terms. The above term may be used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, a 'first component' may be termed a 'second component', and similarly, a 'second component' may also be termed a 'first component'. can Also, the singular expression includes the plural expression unless the context clearly dictates otherwise. Unless otherwise defined, terms used in the embodiments of the present invention may be interpreted as meanings commonly known to those of ordinary skill in the art.

본 개시에 따른 다중 인스턴스 학습 장치 및 방법에서는 이중 어텐션 블록을 이용하고 모음 레벨 손실 함수에 더하여 대조 손실 함수를 이용하여 보다 정확한 영상 분석이 가능하도록 한다.In the multi-instance learning apparatus and method according to the present disclosure, a more accurate image analysis is possible by using a double attention block and using a contrast loss function in addition to a vowel level loss function.

본 개시에 따른 다중 인스턴스 학습 장치 및 방법은 다양한 3D 이미지 분석에 사용될 수 있으나, 이하에서는 3D CT 이미지를 분석하여 코로나로 인한 폐렴과 세균성 폐렴을 구분하는 경우를 예시로 사용하여 본 발명의 개념을 설명한다.The multi-instance learning apparatus and method according to the present disclosure can be used for various 3D image analysis, but below, the concept of the present invention is explained using the case of distinguishing pneumonia caused by corona and bacterial pneumonia by analyzing 3D CT images. do.

본 개시에 따른 다중 인스턴스 학습은 이중 어텐션 대조 기반 MIL(DA-CMIL: Dual Attention Contrastive based Multiple Instance Learning)이라고 지칭될 수 있다. DA-CMIL은 3D CT 이미지로부터 도출되는 환자 CT 슬라이스들(인스턴스들의 모음으로 취급됨)을 입력 받고 단일 레이블을 출력할 수 있다.Multi-instance learning according to the present disclosure may be referred to as Dual Attention Contrastive based Multiple Instance Learning (DA-CMIL). DA-CMIL can receive patient CT slices (treated as a collection of instances) derived from a 3D CT image and output a single label.

어텐션 기반 풀링(Pooling)이 잠재적 공간에서 주요 슬라이스들을 선택하는데 적용될 수 있고, 공간 어텐션은 해석가능한 진단을 위해 슬라이스 공간의 컨택스트를 학습할 수 있다. 대조 손실(Contrastive Loss)은 대표적으로 풀링된 환자 CT의 특징들에 대하여 동일한 환자 CT로부터 유사한 특징들을 인코딩하도록 인스턴스 레벨에서 적용될 수 있다. Attention-based pooling can be applied to select key slices in the potential space, and spatial attention can learn the context of the slice space for interpretable diagnosis. Contrastive Loss may be applied at the instance level to encode similar features from the same patient CT relative to the features of the typically pooled patient CT.

DA-CMIL의 목표는 다중의 2D 슬라이스들의 CT 볼륨이 입력으로 주어졌을 때 환자 CT에 하나의 카테고리 레이블(예를 들어, 코로나 폐렴 또는 세균성 폐렴)을 할당하는 것이다. The goal of DA-CMIL is to assign one category label (eg, corona pneumonia or bacterial pneumonia) to patient CT when the CT volume of multiple 2D slices is given as input.

일반적으로 각각의 환자 CT 스캔은 양성 또는 음성일 수 있는 인스턴스들의 모음(bag of instances)로 생각될 수 있다. 또한, 감염된 영역의 위치를 식별하는 것과 함께 어떤 슬라이드들/인스턴스들이 최종 환자 진단에 기여하는지를 식별하는 것이 유익할 것이다. In general, each patient CT scan can be thought of as a bag of instances, which can be either positive or negative. It would also be beneficial to identify which slides/instances contribute to the final patient diagnosis along with identifying the location of the infected area.

본 개시에서는 환자 CT에 대한 단일의 대표적 특징(feature)을 획득하도록 슬라이스들을 풀링하기 위한 어텐션 기반의 치환-불변(permutation-invariant) MIL 방법을 제안할 것이다.In the present disclosure, we will propose an attention-based permutation-invariant MIL method for pooling slices to obtain a single representative feature for patient CT.

추가로, 공간 어텐션이 감염 영역을 발견하기 위한 공간적 특징들을 학습하기 위해 함께 적용될 수 있다. 본 개시의 다중 인스턴스 학습 방법에서는 환자 레벨의 총합 특징과 동일 환자로부터의 인스턴스 특징들이 의미론적으로 유사하도록 인스턴스 레벨에서 비지도 방식으로 대조 학습을 이용할 수 있다.Additionally, spatial attention can be applied together to learn spatial features for discovering an infected area. In the multi-instance learning method of the present disclosure, contrast learning may be used at the instance level in an unsupervised manner so that the patient-level aggregate feature and the instance feature from the same patient are semantically similar.

이를 위해 학습 모델의 훈련 동안 비지도 방식의 대조 손실이 지도 방식의 손실에 대한 환자 카테고리 레이블들과 함께 사용될 수 있다.To this end, during training of the learning model, the unsupervised contrast loss can be used together with the patient category labels for the supervised loss.

기존의 상이한 도메인들에 적용되는 MIL을 사용한 작업들은 인스턴스 레벨 학습과 모음 레벨 학습을 2-스텝 프로시져로 분리시킨다. 예를 들어, 먼저 인스턴스 레벨 인코더들을 학습시키고, MIL 풀링과 함께 훈련된 인코더들을 사용하여 추론을 위한 총합 모델들을 학습시킨다. 그러나, 인스턴스 레이블들의 불확실성과 노이즈로 인하여 강인한 인코더를 훈련시키는 것이 도전적인 과제일 수 있다. Existing tasks using MIL applied to different domains separate instance-level learning and vowel-level learning into a two-step procedure. For example, we first train the instance level encoders, and then train the aggregate models for inference using the encoders trained with MIL pooling. However, training a robust encoder can be challenging due to the uncertainty and noise of the instance labels.

따라서, 본 개시에 따른 다중 인스턴스 학습 모델은 종단 학습을 통해 위의 과제를 해결하고, 정확한 환자 레이블들에만 집중된 모델 최적화와 함께 CT 슬라이스들의 어텐션 기반 풀링을 통해 인스턴스 선택이 달성될 수 있다. Therefore, the multi-instance learning model according to the present disclosure solves the above problem through longitudinal learning, and instance selection can be achieved through attention-based pooling of CT slices with model optimization focused only on accurate patient labels.

본 개시에서는 약한 지도 방식의 코로나 폐렴과 세균성 폐렴의 분류(classification)를 위한 신규한 종단간 모델을 제안하고, MIL 세팅에서 인스턴스 특징들과 환자 레벨 특징들의 대조 학습이 함께 가능하다는 것을 보여준다. In the present disclosure, we propose a novel end-to-end model for the classification of corona pneumonia and bacterial pneumonia in a weakly supervised manner, and show that contrast learning of instance features and patient-level features is possible together in the MIL setting.

도 1은 본 개시의 일 실시예에 따른 다중 인스턴스 학습 장치와 다중 인스턴스 학습 모델의 프레임워크를 도시한다.1 illustrates a framework of a multi-instance learning apparatus and a multi-instance learning model according to an embodiment of the present disclosure.

본 실시예에서는 흉부 CT 데이터 세트 D = {S₁, …, S_n}를 고려하며, 학습 모델은 S × Y로 정의된 결합 분포(joint distribution)에서 가져온 m 개의 레이블이 지정된 예제 스캔들

의 세트를 수신할 수 잇다.In this example, chest CT data set D = {S ₁ , ... , S _n }, the training model is modeled on m labeled example scans taken from a joint distribution defined as S × Y.

can receive a set of

여기서, S_i는 인스턴스들(예를 들어, 2D CT 슬라이스들 또는 패치들)이 있는 환자 CT 스캔이고, Y는 환자-레벨 레이블들의 레이블 세트로, COVID-19 및 다른 경우의 이진 분류를 위해 {0, 1}으로 나타낼 수 있다.where S _i is the patient CT scan with instances (eg, 2D CT slices or patches), Y is the label set of patient-level labels, { for binary classification of COVID-19 and other cases { 0, 1}.

또한, S_i는 S1 = {s₁, s₂, …, s_N}이 있는 인스턴스의 모음(bag)으로 간주될 수 있으며, 여기서 N은 모음의 총 인스턴스 수를 나타낼 수 있다.In addition, S _i is S1 = {s ₁ , s ₂ , ... , s _N } may be considered a bag of instances, where N may represent the total number of instances of the collection.

각 인스턴스 s_n에는 레이블 y_n ∈ {0, 1}이 있다고 가정할 수 있지만, 모든 인스턴스가 음성(negative)이거나 양성(positive)인 것은 아닐 수 있다.It can be assumed that each instance s _n has a label y _n ∈ {0, 1}, but not all instances may be negative or positive.

더욱이, 스캔의 모든 슬라이스가 진단에 중요한 감염 영역을 표시하는 것은 아니다. 기타 다른 슬라이스들은 학습에 유용하지 않은 노이즈가 많은 아티팩트(artifact)일 수 있기 때문이다.Moreover, not all slices of the scan represent areas of infection that are important for diagnosis. This is because other slices may be noisy artifacts that are not useful for training.

따라서, 본 실시예에서, MIL은 모음(bag) S_i가 음성이면 모든 해당 인스턴스가 음성이여야 한다고 가정할 수 있고, 양성인 모음(bag)의 경우 하나 이상의 인스턴스가 양성일 것이라는 제약 조건을 충족해야 한다. 이는 수학식 1과 같이 나타낼 수 있다.Therefore, in this embodiment, the MIL can assume that if a bag S _i is negative, then all corresponding instances must be negative, and in the case of a positive bag, one or more instances must be positive. . This can be expressed as Equation (1).

본 실시예에서 고려되는 모음(bag)의 두 세트들(예를 들어, COVID-19 폐렴 및 기타 폐렴)이 모두 음성 및 양성 인스터스(병변)를 포함하고 있다는 점을 감안할 때 상기의 조건은 엄격하게 적용되지 않을 수도 있다. 따라서, 본 실시예에서는 상기 제약요건의 좀 더 완화된 버전을 고려하고, 내재적(implicitly)으로 인스턴스에 가중치를 부여하여 해당 레이블을 학습하기 위해 어텐션 메커니즘을 적용할 수 있다. Given that both sets of bags considered in this example (eg, COVID-19 pneumonia and other pneumonias) contain negative and positive instances (lesions), the above conditions are strictly may not be applicable. Therefore, in this embodiment, an attention mechanism can be applied to learn a corresponding label by considering a more relaxed version of the constraint, and implicitly assigning weights to instances.

본 실시예에서는, 하나의 종단간(end-to-end) 프레임 워크에서 COVID-19 폐렴과 다른 폐렴 사이의 환자 CT 스캔 레벨에서의 진단을 위한 CNN 모델이 구현되도록 할 수 있다. 본 실시예에서는, 비지도 대조 학습(DA-CMIL)을 이용한 듀얼 어텐션 다중 인스턴스 학습 심층 모델이 적용될 수 있다.In this embodiment, a CNN model for diagnosis at the patient CT scan level between COVID-19 pneumonia and another pneumonia in one end-to-end framework can be implemented. In this embodiment, a dual-attention multi-instance learning deep model using unsupervised contrast learning (DA-CMIL) may be applied.

도 1에서 도시된 바와 같이, 본 실시예에서는, 레이블이 지정되지 않은 인스턴스들을 가지는 CT 스캔을 입력으로 사용하고, 주요 의미 표현(key semantic representation)을 학습할 수 있다. 또한 본 실시예에서는, 어텐션 기반 풀링 방법을 사용하여 환자 인스턴스들을 최종 예측을 위한 단일 모음(bag) 대표자(representation)로 변환할 수 있다.As shown in FIG. 1 , in this embodiment, a CT scan with unlabeled instances is used as an input, and a key semantic representation can be learned. Also, in this embodiment, an attention-based pooling method may be used to transform patient instances into a single bag representation for a final prediction.

그리고 본 실시예에서, 비지도 대조 학습은 모음(bag)의 인스턴스들이 훈련 중 모음(bag) 대표자와 의미상 유사하도록 하기 위해 사용될 수 있다.And in this embodiment, unsupervised contrast learning can be used to make instances of a bag semantically similar to a bag representative during training.

한편 본 실시예에서 제안하는 프레임 워크에서, 백본 CNN 모델

는 CT 모음(bag)의 j번째 인스턴스를 형상(shape) C×H×W의 공간 차수(spatial dimension)를 가진 저차원 임베딩

으로 변환하기 위한 특징 맵 추출기로 구현될 수 있다. 여기서, C는 채널의 크기, H는 높이, W는 너비를 나타낼 수 있다.Meanwhile, in the framework proposed in this embodiment, the backbone CNN model

is a low-dimensional embedding of the j-th instance of the CT bag with a spatial dimension of shape C×H×W.

It can be implemented as a feature map extractor for transforming into . Here, C may represent a size of a channel, H may represent a height, and W may represent a width.

그리고 본 실시예에서, g_ij는 공간 대표 특징들을 학습하고, C = 1로 인스턴스 당 1×H^*×W^* 크기의 공간 어텐션 맵들을 출력하기 위해 공간 어텐션 모듈

에 제공될 수 있다.And in this embodiment, g _ij is a spatial attention module to learn spatial representative features, and to output spatial attention maps of size 1×H ^*× W ^* per instance with C = 1

can be provided on

이때 획득한 맵들은 주요 영역들을 강조할 수 있으며, 모든 초기 인스턴스들의 특징들에 가중치를 부여하는데 사용되어 단일 공간 풀링 특징

,

을 얻을 수 있다. 여기서, D는 특징 맵의 치수(dimension size)를 의미할 수 있다.The maps obtained at this time can highlight key regions, and are used to weight the features of all initial instances, resulting in a single spatial pooling feature.

,

can get Here, D may mean a dimension size of the feature map.

이때, 어텐션 맵과 백본 특징 맵 사이의 요약 연산은 아인슈타인 요약(Einstein Summation)을 통해 달성될 수 있다.In this case, a summary operation between the attention map and the backbone feature map may be achieved through Einstein summation.

본 실시예에서는, 각 CT 스캔에 대한 인스턴스 특징

을 집계(aggregate)하기 위해, 일관성을 위해 동일한 차원을 갖는 단일 모음(bag) 표현

,

를 얻기 위해 어텐션 기반 순열 불변 풀링(permutation invariant pooling)을 수행하는 두번째 모듈

를 구현할 수 있다.In this embodiment, instance characteristics for each CT scan

To aggregate , a single bag representation with the same dimension for consistency

,

A second module that performs attention-based permutation invariant pooling to obtain

can be implemented.

그리고 본 실시예에서, z_n은 전체 모음(bag)

에 대한 예측을 얻기 위해 환자 레벨 분류기

로 전달될 수 있으며, 여기서

는 CT 스캔이 COVID-19 폐렴 또는 기타 폐렴으로 라벨링 될 확률을 의미할 수 있다.And in this embodiment, z _n is the whole bag.

A patient-level classifier to get predictions for

can be passed to, where

can mean the probability that a CT scan will be labeled as COVID-19 pneumonia or other pneumonia.

한편, 본 실시예에서는, 수학식 2에 표현된 바와 같이, 교차 엔트로피를 이용하여 모음(bag) 손실

을 사용할 수 있다.Meanwhile, in this embodiment, as expressed in Equation 2, a bag loss using cross entropy

can be used

듀얼 어텐션 기반 학습Dual Attention-Based Learning

MIL 세팅 하에서 강인한 특징들을 학습하는데 어텐션은 필수적일 수 있다. 특히 어텐션 기반 풀링은 최대 또는 평균과 같은 기존 풀링 방법보다 선호될 수 있는데, 이는 기존 방법이 종단간 모델 업데이트에 있어 미분 가능하지 않거나(not differentiable), 적용될 수 없기(not applicable) 때문이다.Attention can be essential to learning robust features in a MIL setting. In particular, attention-based pooling may be preferred over existing pooling methods such as maximal or average because the existing methods are not differentiable or not applicable for end-to-end model update.

본 실시예에서는, 각 모듈들(블록들)을 통해 공간 임베딩(

) 및 잠재 임베딩(

) 기반 어텐션 풀링을 모두 이용한다. 공간 모듈에서, 입력

가 주어지면, 각각 쌍곡 탄젠트(tanh) 및 시그모이드(sigm) 비선형성이 뒤 따르는 두 개의 컨볼루션 레이어를 사용할 수 있다.In this embodiment, spatial embedding (

) and latent embeddings (

) based attention pooling. In the spatial module, input

Given , we can use two convolutional layers followed by hyperbolic tangent (tanh) and sigmoid (sigm) nonlinearities, respectively.

본 실시예에서, 특징 맵들 g_ij는 각 모듈에 연속적으로 전달된 다음, 감염의 존재를 나타내는 단일 채널 출력을 갖는 최종 컨볼루션 레이어로 전달될 수 있다. In this embodiment, the feature maps g _ij may be passed successively to each module and then to the final convolutional layer with a single channel output indicating the presence of an infection.

특히, 본 실시예에서는, 공간 스코어들

을 얻기 위해 최종 레이어에 전달하기 전에 컨볼루션 레이어(또는 합성곱 계층)들의 각 분기(branch) 출력간에 요소 별 곱셈(element-wise multiplication)을 수행할 수 있다. 다음으로, 공간 스코어들은 소프트 맥스 연산에 의해 정규화될 수 있으며, 이는 높이와 무게 치수(dimension) 모두에서 총합 행렬 곱셈(예를 들어,

, 여기서

이며, 일관성을 위해

를

로 함)을 통해 얻은 최종 공간 풀링 특징 들을 의미할 수 있다.In particular, in this embodiment, spatial scores

, element-wise multiplication can be performed between each branch output of the convolutional layers (or convolutional layers) before being passed to the final layer to obtain . Next, the spatial scores can be normalized by a soft max operation, which is a summation matrix multiplication (e.g., in both height and weight dimensions)

, here

and for consistency

cast

It can mean the final spatial pooling features obtained through

이때, 본 실시예에서, 초기 백본 특징들에 일반적으로 적용되는 글로벌 평균 풀링(GAP) 대신 게이트 공간 어텐션(gated spatial attention)을 구현했다는 점에 주목할 필요가 있다.At this time, it is worth noting that in this embodiment, gated spatial attention is implemented instead of global average pooling (GAP) which is generally applied to the initial backbone features.

또한, 초기 정규화된 공간 맵들을 사용하여 모델이 결정을 내리기 위해 집중하는 영역을 시각적으로 보여줄 수 있다.In addition, initial normalized spatial maps can be used to visually show the area the model focuses on to make a decision.

한편, 본 실시예에서, 특징들

을 집계하기 위해, 인스턴스 어텐션 모듈

에서 어텐션 기반 풀링을 사용할 수 있다. 즉 본 실시예에서는, 인스턴스 임베딩들에 어텐션이 적용되기 때문에 모든 컨볼루션 레이어가 풀리- 커넥티드(fully-connected) 레이어로 대체된다는 점을 제외하고는 초기 백본 특징 맵에서 게이트 공간 어텐션을 위해 이전에 적용된 동일한 아키텍처 디자인을 고려할 수도 있다. On the other hand, in this embodiment, the features

To aggregate , the instance attention module

Attention-based pooling can be used in That is, in this embodiment, since attention is applied to the instance embeddings, all the convolutional layers are replaced with fully-connected layers previously for gate space attention in the initial backbone feature map, except that The same architectural design applied may be considered.

본 실시예에서는,

,

를 N개의 인스턴스 특징들이 있는 모음(bag)으로 표현할 수 있다. 그리고 본 실시예에서는, 게이팅 메커니즘(gating mechanism)을 사용한 어텐션 기반 풀링 MIL을 수학식 3 및 수학식 4와 같이 나타낼 수 있다.In this embodiment,

,

can be expressed as a bag with N instance features. And in this embodiment, the attention-based pooling MIL using a gating mechanism can be expressed as Equations 3 and 4.

여기서,

,

및

는 훈련 가능한 파라미터들이다. 그리고 tanh(*)와 sigm(*)는 요소별 비선형성을 나타낼 수 있으며,

는 요소별 곱셈을 나타낼 수 있다. 또한, a_n은 전체 모음(bag) 예측에 대한 주어진 인스턴스의 관련성을 나타내는 인스턴스당 어텐션 점수를 의미할 수 있다.here,

,

and

are trainable parameters. And tanh(*) and sigm(*) can represent nonlinearity for each element,

can represent element-wise multiplication. In addition, a _n may mean an attention score per instance indicating the relevance of a given instance to the prediction of the entire bag.

기술적 관점에서, 어텐션 기반 풀링을 사용하면 인스턴스에 다른 가중치를 할당할 수 있으므로, 명시적 인스턴스 선택의 필요성이 줄어들 수 있다. 또한 최종 모음(bag) 표현이 더 많은 정보를 줄 수 있다(informative). 즉 본 실시예에서는, 공간 및 어텐션 기반 풀링의 시너지 조합을 통해 강력하고 해석 가능한(interpretable) 특징들을 학습하기 위한 향상된 훈련을 가능하게 할 수 있다.From a technical point of view, attention-based pooling allows you to assign different weights to instances, reducing the need for explicit instance selection. Also, the final bag expression can be more informative. That is, in this embodiment, enhanced training for learning powerful and interpretable features can be enabled through a synergistic combination of spatial and attention-based pooling.

대조(contrastive) MILcontrastive MIL

도 2는 본 개시의 일 실시예에 따른 다중 인스턴스 학습 세팅에 적용되는 대조 학습을 설명하기 위한 도면이다.2 is a diagram for explaining contrast learning applied to a multi-instance learning setting according to an embodiment of the present disclosure.

본 실시예에서는, 인스턴스 레벨 특징 맵들의 학습을 향상시키기 위해, 위에서 제시한 MIL 방법에 더하여 비지도 대비 손실을 통합하여 구현할 수 있다. 본 실시예의 모델은 잠재 공간에서 대조적인 손실값을 통해 동일한 환자의 인스턴스 특징 맵들과 집계된 모음(bag) 특징 맵 간의 합치(agreement)를 최대화하는 표현을 학습할 수 있다. 도 2는 본 실시예에서 적용된 기술의 전체 개념을 도시한다고 할 수 있다.In this embodiment, in order to improve the learning of instance-level feature maps, in addition to the MIL method presented above, the unsupervised loss may be integrated and implemented. The model of this embodiment can learn a representation that maximizes agreement between the instance feature maps of the same patient and the aggregated bag feature map through contrasting loss values in the latent space. Fig. 2 can be said to show the overall concept of the technology applied in this embodiment.

대조 손실값을 사용하는 이전에 제안된 자가-지도 프레임 워크에 따르면, 확률적 데이터 증강이 2D 데이터 샘플에 적용되어 동일한 예에 대한 두 개의 상관된 보기를 생성한다. 증강에는 무작위 자르기(random cropping), 색상 왜곡(color distortion) 및 무작위 가우시안 블러링(random Gaussian bluring) 등이 포함될 수 있다.According to the previously proposed self-supervising framework using contrast loss values, probabilistic data augmentation is applied to 2D data samples to generate two correlated views of the same example. Augmentation may include random cropping, color distortion and random Gaussian bluring, and the like.

또한, 대조 손실값은 레이블이 지정되지 않은 샘플에 대한 대조 예측 작업을 정의하기 위해 적용될 수 있으며, 주어진 샘플에 대해 양성 및 음성 쌍이 식별될 수 있다. 즉 본 실시예에서는 대조 손실이 잠재 공간에 적용되기 때문에 확률적 데이터 증가가 생략될 수 있다. 또한, 본 실시예에서는, 주어진 환자 CT 스캔에 대해, 각 슬라이스는 전체 환자 특성의 의사(pseudo) 증강으로 간주될 수 있다. 따라서, 본 실시예에서는, 예를 들어, 동일한 환자에 대한 다른 관점(view)와 같이, 확률적 증강이 잠재적으로 적용되는 것으로 간주할 수 있다.In addition, control loss values can be applied to define control prediction tasks for unlabeled samples, and positive and negative pairs can be identified for a given sample. That is, in the present embodiment, since the contrast loss is applied to the latent space, the probabilistic data increase can be omitted. Also, in this embodiment, for a given patient CT scan, each slice may be considered a pseudo-enhancement of the overall patient characteristic. Thus, in this embodiment, it can be considered that stochastic augmentation is potentially applied, for example different views of the same patient.

본 실시예에서는, z`를 환자의 잠재 인스턴스 레벨 특징으로, z를 위에서 제안된 모듈을 통해 얻은 환자 모음(bag) 수준 특징으로 나타낼 수 있다. 그리고 z` 및 z 특징들의 l₂정규화 후 대조 손실 함수를 다음 수학식 5와 같이 정의할 수 있다.In this embodiment, z` may be expressed as a latent instance-level feature of a patient, and z may be expressed as a patient bag-level feature obtained through the module proposed above. And after l ₂ normalization of z` and z features, a contrast loss function can be defined as in Equation 5 below.

여기서,

는 1(iff k ≠ i)로 산출되는 인디케이터(indicator) 함수(function)이고,

는 온도 파라미터(temperature parameter)를 나타낼 수 있다. 온도 파라미터는 유사도의 차이가 로스에 반영되는 정도를 조절하는 파라미터이며, 모델의 실시형태에 따라 경험적으로 임의의 숫자를 넣으며 찾아지는 최적변수이다. 그리고

은 유사도 함수(similarity function)일 수 있으며, 예를 들어, 코사인 유사도(cosine similarity)일 수 있다. 여기서, z_i, z_j는 i번째 패치(슬라이스)의 특징, j번째 패치(슬라이스)의 특징을 의미할 수 있다. 또한, N은 학습 페이즈에서 정의되는 배치사이즈이며, 도 2와 같이 두 데이터를 함께 비교하는 경우 배치사이즈가 2N이 되어 시그마에는 2N이 기재될 수 있다.

를 구하기 위해 각 인스턴스들(i=1, …, 총 인스턴스 수)과 각 모음(bag) 레벨 특징 맵들과 비교를 하게 되고 실질적으로 이 비교 값들을 모두 합한 값에 기초하여

가 결정될 수 있다. 손실(loss)은 모든 환자 슬라이스 특징들 및 각각의 모음(bag) 레벨 특징들에 걸쳐 계산될 수 있으며, 본 실시예에서는 미니 배치(mini-batch) 당 증강으로 간주될 수 있다. 본 실시예의 전체 프레임 워크의 최종 손실 함수(전체 손실 함수)는 다음 수학식 6과 같이 정의할 수 있다.here,

is an indicator function calculated as 1 (iff k ≠ i),

may represent a temperature parameter. The temperature parameter is a parameter that adjusts the degree to which the difference in similarity is reflected in the loss, and is an optimal variable found by empirically entering a random number according to the embodiment of the model. and

may be a similarity function, for example, cosine similarity. Here, z _i , z _j may mean a characteristic of an i-th patch (slice) and a characteristic of a j-th patch (slice). In addition, N is a batch size defined in the learning phase, and when comparing two data together as shown in FIG. 2 , the batch size becomes 2N, and 2N may be described in sigma.

Each instance (i=1, ..., total number of instances) and each bag level feature map are compared to obtain

can be determined. A loss may be computed over all patient slice features and each bag level features, and in this example may be considered an enhancement per mini-batch. The final loss function (total loss function) of the entire framework of this embodiment can be defined as in Equation 6 below.

여기서,

는 모음(bag) 손실과 대조 손실의 기여도에 따라 가중치를 부여하는 파라미터이며 0과 1사이의 값이다.here,

is a parameter that is weighted according to the contribution of bag loss and contrast loss, and is a value between 0 and 1.

상술한 내용에 대한 자세한 알고리즘은 하기의 알고리즘 테이블을 참조할 수 있다.For a detailed algorithm for the above content, refer to the following algorithm table.

[알고리즘 테이블][Algorithm Table]

본 실시예에서는 한국, 대구에 위치한 영남 대학 메디컬 센터(YUMC)에서 173개의 샘플들로 구성된 흉부 CT 데이터세트를 수집하였다. 데이터세트는 2020년 2월에서 4월 사이에 수집된 COVID-19이 있는 환자들에 대한 75개의 CT 샘플들과 세균성 폐렴이 있는 환자들에 대한 98개의 CT 샘플들을 포함한다.In this example, a chest CT dataset consisting of 173 samples was collected at Yeungnam University Medical Center (YUMC) located in Daegu, Korea. The dataset includes 75 CT samples for patients with COVID-19 and 98 CT samples for patients with bacterial pneumonia, collected between February and April 2020.

본 실시예에 따른 MIL 프레임워크에서 2D CT 슬라이스 또는 패치들이 인스턴스들로서 사용되며, 슬라이스의 경우와 패치인 경우 모두에 대해 본 실시예에 따른 학습 모델을 평가한다. 추가로, 3D CT 볼륨 데이터세트는 완전한 지도학습 세팅들 하에서 3D 기반 방법들을 훈련하고 테스트하기 위해 처리된다.2D CT slices or patches are used as instances in the MIL framework according to this embodiment, and the learning model according to this embodiment is evaluated for both the slice case and the patch case. Additionally, the 3D CT volume dataset is processed to train and test 3D-based methods under full supervised settings.

프리-프로세싱을 위해, 모든 CT 샘플들에 대하여 폐 영역들이 세그먼트화되었다. 이를 위해, 세그멘테이션 훈련과 추론을 위한 ResNeSt 모델을 이용할 수 있다. For pre-processing, lung regions were segmented for all CT samples. For this, we can use the ResNeSt model for segmentation training and inference.

도 3은 본 개시의 일 실시예에 따른 다중 인스턴스 학습 방법에서 프리-프로세싱되는 CT 이미지들의 예시이다.3 is an example of CT images that are pre-processed in a multi-instance learning method according to an embodiment of the present disclosure.

도 3에서 상부의 이미지들은 프리-프로세싱된 COVID-19 폐렴이 있는 환자의 폐 CT 슬라이스(왼쪽 3개) 및 패치 샘플들(가장 오른쪽 2개)이고, 도 3에서 하부의 이미지들은 프리-프로세싱된 세균성 폐렴이 있는 환자의 폐 CT 슬라이스(왼쪽 3개) 및 패치 샘플들(가장 오른쪽 2개)이다.The upper images in Figure 3 are lung CT slices (three left) and patch samples (two rightmost) of a patient with pre-processed COVID-19 pneumonia, and the lower images in Figure 3 are pre-processed Lung CT slices (3 left) and patch samples (2 far right) from a patient with bacterial pneumonia.

모든 데이터세트들은 환자 ID에 의해 훈련, 검증 및 테스팅을 위한 데이터로 분류되었고 각각의 비율은 0.5, 0.1 및 0.4이다. 단지 잘리진 폐 영역들을 이용하는 모든 버전들에서 모든 데이터세트 변종들에 대해 동일한 분류가 사용되었다. CT 샘플들은 슬라이스들, 패치들 및 3D CT 볼륨 세트들에 있어서 각각 사이즈가 512×512, 128×128 및 256×256×256일 수 있다. 각각의 CT 슬라이스는 512×512에서 256×256으로 리사이징될 수 있고, 패치 슬라이스들은 256 에서 128로 리사이징될 수 있다. 특히, 슬라이스들 세트는 약 14,000 개의 슬라이스들을 포함하고, 패치버전은 64,000개의 패치들을 포함할 수 있으며, 이들은 폐 조직의 30% 이상을 주로 보여줄 수 있다. 3D CT 볼륨들의 경우, 하나의 환자에 속하는 모든 슬라이스들은 희망하는 입력 사이즈들을 획득하도록 적용되는 최근접 이웃 샘플링(nearest neighbor smapling)을 이용하여 볼륨을 구축하는데 사용될 수 있다.All datasets were classified as data for training, validation and testing by patient ID and the ratios were 0.5, 0.1 and 0.4, respectively. The same classification was used for all dataset variants in all versions using only truncated lung regions. CT samples may be 512x512, 128x128 and 256x256x256 in size in slices, patches and 3D CT volume sets, respectively. Each CT slice can be resized from 512x512 to 256x256, and patch slices can be resized from 256 to 128. In particular, a set of slices may contain about 14,000 slices, and a patch version may contain 64,000 patches, which may primarily show more than 30% of lung tissue. For 3D CT volumes, all slices belonging to one patient can be used to build the volume using nearest neighbor smapling applied to obtain the desired input sizes.

본 개시의 모델은 Pytorch에서 구현될 수 있다. ImageNet 사전훈련 가중치로부터 미세조종된 ResNet-34가 모음 분류기

로 이용되는 단일 풀리-커넥티드 레이어와 함께 특징 추출 모듈

로 사용될 수 있다. 피쳐들의 차원은 512로 고정될 수 있고 이는 C=512인 512×8×8을 가지는

로부터 획득되는 피쳐 맵들을 포함할 수 있다. 공간 풀링에 이어서, 피쳐들은 512로 다시 리쉐이핑될 수 있다.The model of the present disclosure may be implemented in Pytorch. ResNet-34 fine-tuned from ImageNet pretrained weights is a collection classifier

A feature extraction module with a single fully-connected layer used as

can be used as The dimensions of the features can be fixed at 512, which has 512×8×8 with C=512.

It may include feature maps obtained from Following spatial pooling, the features may be reshaped back to 512.

본 실시예에 따른 방법의 효과를 평가하기 위해서, DeepAttention MIL, ClassicMIL 및 JointMIL과 같은 최근 MIL 기반 방법들과의 비교가 이루어진다. 또한, 최근 3D 기반 방법인 DeCovNet 및 Zhang3DCNN도 비교에 포함된다. 대중적으로 이용가능한 구현들이 사용된 3D 방법들 외에는 공정한 평가를 위해, 동일한 백본 특징 추출기가 사용된다.In order to evaluate the effectiveness of the method according to the present embodiment, comparison with recent MIL-based methods such as DeepAttention MIL, ClassicMIL and JointMIL is made. In addition, recent 3D-based methods, DeCovNet and Zhang3DCNN, are also included in the comparison. For a fair evaluation, the same backbone feature extractor is used except for the 3D methods in which publicly available implementations are used.

본 개시의 실시예에 따라 제안된 방법의 정량적이고 정성적인 결과들이 이하에서 제시된다. 또한, 모음의 크기, 대조 학습이 있거나 없는 경우의 어텐션 모듈들, 가중치 파라미터

의 효과들에 대한 절삭 연구들이 이루어진다.Quantitative and qualitative results of the method proposed according to an embodiment of the present disclosure are presented below. In addition, vowel size, attention modules with and without contrast learning, weight parameters

Cutting studies on the effects of

표 1은 본 개시의 실시예에 따라 대조 손실

를 가지는 DA-CMIL이 98.6%의 정확도 및 98.4%의 AUC(Area Under the curve)로 최고의 전체 성능을 달성함을 보여준다. 훈련 중에

가 적용되지 않는 경우에도, 본 개시의 실시예에 따른 모델이 93%(JointMIL 비교 +2.9)의 정확도 및 93.4%(JointMIL 비교 +2.5)의 AUC를 보여주어 가장 우수한 약한 방식의 지도학습 방법인 JointMIL보다 우수함을 알 수 있다. Table 1 shows control loss according to an embodiment of the present disclosure.

It shows that DA-CMIL with 98.6% accuracy and 98.4% AUC (Area Under the Curve) achieves the best overall performance. during training

Even when is not applied, the model according to the embodiment of the present disclosure shows an AUC of 93% (JointMIL comparison +2.9) and an AUC of 93.4% (JointMIL comparison +2.5), which is the best weak supervised learning method JointMIL It can be seen that the better

본 개시의 실시예에 따른 방법을 더 검증하기 위해 DA-CMIL을 CT 샘플들의 랜덤하게 잘라진 패치들에 적용할 수 있다. 표 2를 참고하면 패치들의 경우에도 비교 방법들보다 본 개시의 실시예에 따른 DA-CMIL 방법이 더 우수한 성능을 지속적으로 보인다는 것을 알 수 있다.To further validate the method according to an embodiment of the present disclosure, DA-CMIL can be applied to randomly cut patches of CT samples. Referring to Table 2, it can be seen that even in the case of patches, the DA-CMIL method according to the embodiment of the present disclosure consistently shows better performance than the comparative methods.

도 4는 본 개시의 일 실시예에 따른 다중 인스턴스 학습 모델과 다른 학습 모델들의 특성을 비교하는 그래프를 도시한다. 4 illustrates a graph comparing characteristics of a multi-instance learning model and other learning models according to an embodiment of the present disclosure.

도 4는 서로 다른 데이터세트들 상에 비교되는 방법들의 수신측 동작 특성(receiver operating characteristic: ROC) 커브들을 도시한다. 도 5는 본 개시의 일 실시예에 따른 다중 인스턴스 학습 모델과 다른 학습 모델들의 예측 성능을 비교하는 그래프를 도시한다. 도 6은 본 개시의 다른 실시예에 따른 다중 인스턴스 학습 모델과 다른 학습 모델들의 예측 성능을 비교하는 그래프를 도시한다.4 shows receiver operating characteristic (ROC) curves of methods compared on different datasets. 5 illustrates a graph comparing the prediction performance of the multi-instance learning model and other learning models according to an embodiment of the present disclosure. 6 illustrates a graph comparing prediction performance of a multi-instance learning model and other learning models according to another embodiment of the present disclosure.

도 4를 살펴보면, 전체적으로 본 개시의 실시예에 따른 방법이 모든 세팅들에서 더 높은 TPR(True Positive Rate), 더 낮은 FPR(False Positive Rate)을 보인다. 이는 도 5 및 도 6에서 나타나는 비교 방법들에 대한 혼동 행렬들의 요약들에서도 증명될 수 있다. 도 5 및 도 6에서 CP(Common Pneumonia)는 일반적인 폐렴을 의미하고, 본 실시예에서는 세균성 폐렴을 의미할 수 있으며, NCP는 COVID-19 폐렴을 의미한다.Referring to FIG. 4 , as a whole, the method according to an embodiment of the present disclosure exhibits a higher True Positive Rate (TPR) and a lower False Positive Rate (FPR) in all settings. This can also be demonstrated in the summaries of the confusion matrices for the comparison methods shown in FIGS. 5 and 6 . 5 and 6, CP (Common Pneumonia) means general pneumonia, in this embodiment may mean bacterial pneumonia, and NCP means COVID-19 pneumonia.

모음 사이즈, 가중치 파라미터

, 및 학습상 이중 어텐션 모듈의 효과 Collection size, weight parameter

, and the effect of the dual attention module on learning

본 개시의 실시예에 따른 방법의 훈련 동안 모음 사이즈(bag size)의 효과를 평가하기 위해 각각의 모음이 최대 k개의 인스턴스들(슬라이스들/패치들)로 구성되는 경우 k를 변화시키면서 모음을 구성하고 절삭 연구를 수행할 수 있다. 표 3에서 도시된 바와 같이 모음 사이즈가 증가할수록 DA-CMIL 성능도 향상하는 것을 알 수 있다. To evaluate the effect of bag size during training of the method according to an embodiment of the present disclosure, construct a collection while changing k when each collection is composed of up to k instances (slices/patches) and conduct cutting research. As shown in Table 3, it can be seen that the DA-CMIL performance also improves as the collection size increases.

DA-CMIL은 손실들의 효과를 균형잡기 위하여 가중치 파라미터

를 가지는 다중 인스턴스들의 대조 특징 학습을 이용할 수 있다. 표 4를 참조하면, 가중치 파라미터

=1.0인 경우, 즉

를 사용하지 않는 경우,

는 학습에서 아무런 효과도 가지지 못하며

를 사용하는 경우에 비해 93%의 더 낮은 성능을 보인다. DA-CMIL is a weighting parameter to balance the effect of losses.

It is possible to use contrast feature learning of multiple instances with . Referring to Table 4, weight parameters

=1.0, i.e.

If you are not using

has no effect on learning.

93% lower performance compared to the case of using

본 실시 예의 프레임워크에서, 어텐션의 효과를 평가하기 위해 대조 모듈과 어텐션 모듈이 모두 사용되거나 사용되지 않는 여러 설정을 고려할 수 있다(표 1 및 표 2 참조)In the framework of this embodiment, in order to evaluate the effect of attention, several settings in which both the contrast module and the attention module are used or not used (see Table 1 and Table 2) may be considered.

본 실시예의 프레임 워크에서, 어텐션을 제외하면 다음과 같은 두 가지 측면에서 수정이 필요할 수 있다. (1) 공간 어텐션 기반 특징 맵(

)의 풀링 없이, 단순성을 위해 인스턴스 특징 맵의 글로벌 평균 풀링(GAP)을 기본으로 사용하고, (2) (

)를 통한 어텐션 기반 모음(bag) 레벨 특징의 집계(aggregation) 없이, 인스턴스 특징들의 평균을 사용하여 z`와 함께 전체 모음(bag) 수준 특징 z를 얻을 수 있다. 이러한 수정 후 평가를 보다 쉽게 수행할 수 있을 것이다.In the framework of this embodiment, except for attention, modifications may be required in the following two aspects. (1) Spatial attention-based feature map (

), using the global average pooling (GAP) of instance feature maps as a basis for simplicity, (2) (

), without aggregation of the attention-based bag-level features, we can use the average of the instance features to obtain the entire bag-level feature z together with z'. This post-correction evaluation will be easier to perform.

분명히,

와

가 학습의 일부일 때 최고의 성능을 달성함을 확인할 수 있다. 반면에 어텐션 모듈을 제외하면 전체 성능이 현저히 감소함을 확인할 수 있다(최고 성능 방법에 비해 -20%). 그리고 유사한 성능 저하가 CT 패치 데이터 세트에서도 나타날 수 있다. clearly,

Wow

It can be seen that the best performance is achieved when is part of the training. On the other hand, if the attention module is excluded, it can be seen that the overall performance is significantly reduced (-20% compared to the highest performance method). And a similar performance degradation can be seen in the CT patch data set.

어텐션 모듈 없이 대조 특징 손실만 사용하면 비교 방법에 비해 성능 향상 없이 결과가 악화됨을 확인할 수 있다. 이것은 본 실시예의 기술의 조합(즉, 어텐션과 특징 맵 손실 모두)이 이점이 있음을 확인할 수 있으며, 보완 학습을 통해 개선된 결과가 나타남을 확인할 수 있다.It can be seen that using only the contrast feature loss without the attention module deteriorates the result without improving the performance compared to the comparison method. It can be confirmed that the combination of the techniques of the present embodiment (ie, both attention and feature map loss) has an advantage, and it can be confirmed that improved results are obtained through complementary learning.

정성적 결과Qualitative Results

도 7은 본 개시의 실시예에 따른 다중 인스턴스 학습 모델에서 도출되는 공간 어텐션 맵과 인스턴스 어텐션 스코어를 도시한다7 illustrates a spatial attention map and an instance attention score derived from a multi-instance learning model according to an embodiment of the present disclosure;

도 7을 참조하면, 공간 어텐션 맵들과 어텐션 스코어 각각을 기반으로 정성적 결과를 확인할 수 있다. 이는 DA-CMIL이 거친 맵들(coarse map)을 사용하여 감염된 영역과 관련된 주요 슬라이스를 찾을 수 있음을 보여줄 수 있다. 이때 감염된 영역이 없는 노이즈가 많은 슬라이스들/아티팩트들과 같은 슬라이스들에 대해 낮은 어텐션 스코어가 관찰되어 본 실시예의 방법의 유용성을 더 나타낼 수 있다. 또한, 어텐션 맵들은 임상 결과와 일치하는 간유리음영(ground-glass opacities) 및 강화(consolidations)와 같은 주요 영역에 중점을 둔다는 것을 확인할 수 있다.Referring to FIG. 7 , a qualitative result may be confirmed based on each of the spatial attention maps and the attention score. This may show that DA-CMIL can use coarse maps to find major slices related to the infected area. At this time, a low attention score may be observed for slices such as noisy slices/artifacts without an infected region, further indicating the usefulness of the method of this embodiment. In addition, it can be seen that the attention maps focus on key areas such as ground-glass opacities and consolidations that are consistent with clinical outcomes.

또한, 대조 학습이 적용되지 않는 경우 어텐션 맵을 강조 표시할 수 있다. 일반적으로 결과들은 손실이 적용된 경우와 유사한 맵을 보여줄 수 있다. 그러나, 일부 CT 슬라이스의 경우 공간 맵과 스코어 모두 약간의 변화가 있는 반면, 주요 영역의 국소화는 특히 어텐션 스코어의 큰 차이로 인해 저하됨을 볼 수 있다.In addition, attention maps can be highlighted when contrast learning is not applied. In general, the results can show a map similar to the case where the loss is applied. However, it can be seen that for some CT slices, both the spatial map and the score have slight changes, while the localization of the main region is particularly degraded due to the large difference in the attention score.

이는 대조 손실이 피사체의 대표적인 특징 간의 유사성을 장려하기 위한 것이기 때문인 것으로 크게 예측될 수 있다. 두 손실을 모두 사용하는 이점은 분류 성능의 정량적 평가를 통해 더 잘 검증될 수 있다. 즉 본 실시예의 어텐션 메커니즘은 두 경우 모두에서 관련이 있으며 임상 평가에 매우 유용하게 적용될 수 있다.This can be largely predicted because the contrast loss is to encourage similarity between representative features of the subject. The benefits of using both losses can be better verified through quantitative evaluation of classification performance. That is, the attention mechanism of the present embodiment is relevant in both cases and can be very usefully applied to clinical evaluation.

하엽(lower lobes)의 측면(lateral)의 다 초점(multi-focal) 간유리음영(ground-glass opacities, GGO)은 CT에서 가장 일반적인 초기 발견이며, 흉막 비후와 같은 다른 특성은 심각도 단계에 따라 영상 발현에서 덜 일반적으로 관찰될 수 있다. 이것은 대부분의 공간 어텐션 맵이 주로 낮은 영역에 집중되어 있는 것과 일치할 수 있다.Multi-focal ground-glass opacities (GGO) of the lateral side of the lower lobes are the most common early findings in CT, and other characteristics such as pleural thickening are can be observed less commonly in expression. This may be consistent with the fact that most spatial attention maps are mainly concentrated in lower regions.

일반적으로 CAM(Class Activation Maps)은 해결 문제로 인해 정확한 병변 위치를 나타내지 않을 수 있다. 더욱이, 본 실시예에서는, 특히 약하게 감독되는(weakly supervised) 설정에서, 실제 병변 위치에 접근하지 않으면 조직이 제한된 맵을 생성하도록 모델을 강제하는 것(enforcing)이 어려울 수 있다. In general, CAM (Class Activation Maps) may not indicate the exact lesion location due to a resolution problem. Moreover, in this embodiment, especially in a weakly supervised setting, it may be difficult to enforcing the model to generate a limited map of the tissue without access to the actual lesion location.

본 실시예에서는, 맵이 조직 영역으로만 정규화되는 경우에도 현재 결과에 확신을 가지고 있으며, 고밀도 영역은 병변 및/또는 GGO에 해당하는 임상적으로 관련된 영역임이 분명할 수 있다.In this example, we are confident in our current results even if the map is normalized only to tissue regions, and it may be clear that the high-density regions are clinically relevant regions corresponding to lesions and/or GGOs.

RT-PCR은 COVID-19 진단의 표준 방법이지만, 테스트하는데 걸리는 시간이 길어 결과를 얻기까지 많은 시간이 소요될 수 있다. 따라서 몇 분 안에 결과를 얻을 수 있는 CT를 통한 검사 방법은, 테스트 방법에 대한 합리적인 대안으로 간주될 수 있다. RT-PCR is a standard method of diagnosing COVID-19, but it can take a long time to test and can take a lot of time to get results. Therefore, the examination method through CT, which can obtain results within a few minutes, can be considered as a reasonable alternative to the test method.

이에, 본 실시예에서는, 임상적 의미(clinical implication)와 함께 약한 감독(weak supervision) 하에서 COVID-19 진단을 위한 딥러닝(CNN)에 대한 새로운 접근 방식을 구현할 수 있다. 신속한 평가를 위해서는 실제 환경에서 완전히 자동화되고 해석 가능한 방법을 사용하는 것이 중요하다. 더욱이 COVID-19와 다른 폐렴은 굉장히 미세한 차이가 있어 현장 전문가들도 구분하기 어렵기 때문에, 영상 특성 측면에서 정확한 진단이 매우 중요하다. Accordingly, in this embodiment, it is possible to implement a new approach to deep learning (CNN) for diagnosing COVID-19 under weak supervision with clinical implication. For rapid evaluation, it is important to use methods that are fully automated and interpretable in the real world. Moreover, since COVID-19 and other pneumonias have very subtle differences, it is difficult for field experts to distinguish them, so accurate diagnosis is very important in terms of image characteristics.

본 실시예의 방법은, 종래의 방법에서 일반적인 것처럼 병변 감염된 관심 영역 없이 환자 진단 레이블만 사용할 수 있는 데이터 세트를 통해 평가될 수 있다. The method of the present embodiment can be evaluated through a data set in which only patient diagnostic labels can be used without lesion-infected regions of interest, as is common in conventional methods.

또한 상술하는 본 실시예의 접근 방식을 더욱 검증하기 위해, 본 실시예에서는, 어텐션 스코어와 함께 거친(coarse) 어텐션 맵을 통해 모델이 초점을 맞춘 영역을 질적으로 확인할 수 있다. 본 실시예의 방법은 98.4 %의 AUC, 98.6 %의 정확도 및 96.8 %의 TPR(true positive rate)을 달성함을 확인할 수 있다. 또한, 본 실시예에서, 획득한 어텐션 맵은 주요 조각에 해당하는 어텐션 스코어와 함께 대부분의 샘플에서 주요 감염 영역을 강조 표시할 수 있다.In addition, in order to further verify the above-described approach of this embodiment, in this embodiment, a region focused by the model can be qualitatively identified through a coarse attention map together with an attention score. It can be seen that the method of this embodiment achieves an AUC of 98.4%, an accuracy of 98.6% and a true positive rate (TPR) of 96.8%. In addition, in the present embodiment, the obtained attention map may highlight the major infection areas in most samples together with the attention scores corresponding to the major pieces.

또한, 본 실시예에서는 환자 레이블의 감독 학습을 보완하기 위해 비지도 대조 손실을 사용하는 이점을 경험적으로 보여줄 수 있으며, 더 복잡한 방법을 위한 기반이 될 수 있다. 또한, 본 실시예의 방법은 3D 기반 방법을 크게 능가함을 확인할 수 있다. 이는 대규모 코호트를 사용하여 수집한 3D CT 볼륨 데이터 세트의 제한된 크기 때문일 수 있다. 또한, DeCovNet는 처음부터 학습되었으며 맞춤형 심층 아키텍처를 갖추고 있기 때문에 성능이 떨어짐을 확인할 수 있다.In addition, this embodiment can empirically demonstrate the advantage of using unsupervised contrast loss to complement supervised learning of patient labels, and can serve as a basis for more complex methods. In addition, it can be confirmed that the method of this embodiment greatly outperforms the 3D-based method. This may be due to the limited size of the 3D CT volume data set collected using a large cohort. In addition, it can be seen that DeCovNet performs poorly because it has been trained from scratch and has a custom deep architecture.

그리고, ZhangCNN의 성능은 나중(the later)보다 상당히 향상되었지만 모델이 더 많은 시간 동안 훈련되어도 여전히 비슷한 성능을 얻지 못했음을 확인할 수 있다. And, it can be seen that the performance of ZhangCNN is significantly improved than later, but it still does not achieve similar performance even when the model is trained for more time.

즉, COVID-19와 세균성 폐렴이 유사한 특성을 나타내기 때문에 기존의 광범위한 증강으로 훈련된 모델이 평가 지표 전반에서 상당한 개선을 달성하지 못했다는 점에 주목할 필요가 있다. That is, it is worth noting that the models trained with the existing broad-based augmentation did not achieve significant improvement across evaluation indicators, as COVID-19 and bacterial pneumonia exhibit similar characteristics.

도 8은 본 개시의 실시예에 따른 다중 인스턴스 학습 모델에서의 임베딩 공간에 표현된 CT 슬라이스들을 도시한다.8 shows CT slices represented in an embedding space in a multi-instance learning model according to an embodiment of the present disclosure.

단일 주제에 대한 전체 통계를 캡처할 때, 본 실시예의 기술의 이점을 추가로 보여주기 위해 도 8은 2D 공간에 표시된 인스턴스와 대표적인 특징을 도시하였다. 특히 집계된 특징 맵(그림의 왼쪽 상단: 파란색 점)은 함께 잘 클러스터링 된 다른 슬라이스의 잡음이 있는 아티팩트를 무시하는 중요한(key) 슬라이스(빨간색)의 기능을 캡처할 수 있다. 이는 인스턴스 검색에 명시적인 레이블이 사용되지 않지만 본 실시예의 모델은 환자 분류에 유용한 슬라이스를 효과적으로 학습할 수 있음을 보여줄 수 있다.To further demonstrate the benefit of the techniques of this embodiment when capturing overall statistics for a single subject, Figure 8 depicts instances and representative features represented in 2D space. In particular, the aggregated feature map (top left of the figure: blue dots) can capture the features of the key slice (red) ignoring the noisy artifacts of other slices that are well clustered together. This can show that although no explicit label is used for instance search, the model of this embodiment can effectively learn slices useful for patient classification.

본 실시예에서는, 흉부 CT에서 COVID-19와 세균성 폐렴의 서브 타입(sub-type)을 구별하기 위해 다중 인스턴스 학습(MIL) 프레임 워크 하에서 이중 어텐션 모듈과 대조 특징 맵 학습이 포함된 2D CNN 프레임 워크를 적용할 수 있다. CT 패치 및 슬라이스 기반 데이터 세트 버전 모두에서 본 실시예의 방법에 대한 성능을 확인할 수 있다. 또한 절삭(ablation) 실험은 훈련 중에 큰 모음(bag) 크기를 사용하는 이점과 안정적인 학습을 위한 가중치 손실의 효과를 보여줄 수 있다. 따라서 본 실시예에서는, COVID-19 검사를 위해 약하게 감독되는(weakly supervised) 방법에 기반한 다중 인스턴스 학습을 통해 보다 정확하게 COVID-19와 세균성 페렴을 구별할 수 있다.In this example, a 2D CNN framework with dual attention module and contrast feature map learning under a multi-instance learning (MIL) framework to discriminate sub-types of COVID-19 and bacterial pneumonia in chest CT. can be applied. The performance of the method of this embodiment can be confirmed in both the CT patch and slice-based data set versions. In addition, ablation experiments can show the benefits of using a large bag size during training and the effect of weight loss for stable learning. Therefore, in this embodiment, it is possible to more accurately distinguish COVID-19 from bacterial pneumonia through multi-instance learning based on a weakly supervised method for testing for COVID-19.

도 9는 본 개시의 실시예에 따른 다중 인스턴스 학습 모델의 학습 방법을 설명하는 순서도를 도시한다.9 is a flowchart illustrating a learning method of a multi-instance learning model according to an embodiment of the present disclosure.

본 개시의 실시예에 따른 3D 이미지 분석을 위한 다중 인스턴스 학습 방법은 컴퓨팅 장치 또는 컴퓨팅 네트워크에서 적어도 하나의 프로세서에 의해 수행될 수 있다. The multi-instance learning method for 3D image analysis according to an embodiment of the present disclosure may be performed by at least one processor in a computing device or a computing network.

도 1을 참고하면, 본 개시의 실시예에 따른 다중 인스턴스 학습 방법은 다중 인스턴스 학습 장치(100)의 프로세서(110)에 의해 수행될 수 있다.Referring to FIG. 1 , the multi-instance learning method according to an embodiment of the present disclosure may be performed by the processor 110 of the multi-instance learning apparatus 100 .

다중 인스턴스 학습 장치(100)는 프로세서(110)와 메모리(130)를 포함할 수 있으며, 메모리(130)에는 다중 인스턴스 학습 모델 및 여러 명령어들이 저장되어 있을 수 있다. 도 1에서는 하나의 메모리만 개시되었으나, 다중 인스턴스 학습 모델이 저장된 메모리와 여러 명령어들이 저장된 메모리는 별도로 구성될 수도 있다. The multi-instance learning apparatus 100 may include a processor 110 and a memory 130 , and a multi-instance learning model and several instructions may be stored in the memory 130 . Although only one memory is disclosed in FIG. 1, the memory in which the multi-instance learning model is stored and the memory in which several instructions are stored may be configured separately.

본 개시의 실시예에 따른 다중 인스턴스 학습 방법에 따르면, 프로세서(110)는 다중 인스턴스 학습 모델에 입력되는 3D 이미지의 2D 인스턴스들(instances) 각각의 특징 맵(Feature Map)을 도출할 수 있다(S100). 이러한 특징 맵 도출을 위한 블록 또는 모듈은 합성곱 레이어(컨볼루션 레이어)를 포함할 수 있으며, 합성곱 모듈 또는 블록이라고 지칭될 수 있다.According to the multi-instance learning method according to the embodiment of the present disclosure, the processor 110 may derive a feature map of each of the 2D instances of the 3D image input to the multi-instance learning model (S100). ). A block or module for deriving such a feature map may include a convolution layer (convolution layer), and may be referred to as a convolution module or block.

프로세서(110)는 합성곱 블록으로부터 도출된 상기 특징 맵으로부터 상기 인스턴스들의 공간 어텐션 맵(Spatial Attention Map)을 도출할 수 있다(S200). 이러한 공간 어텐션 맵을 도출하는 블록 또는 모듈은 공간 어텐션 블록이라고 지칭될 수 있다. 공간 어텐션 맵은 각각의 인스턴스에서 본 모델이 판단하고자 하는 바를 분석하기 위해 강조되어야 할 부분이 어디인지를 판단할 수 있다. 다만, 초기 공간 어텐션 블록의 파라미터는 랜덤하게 정해질 수 있으나, 상술된 바와 같은 훈련 과정을 거치면서 최적화될 수 있다.The processor 110 may derive a spatial attention map of the instances from the feature map derived from the convolution block (S200). A block or module deriving such a spatial attention map may be referred to as a spatial attention block. The spatial attention map may determine where the part to be emphasized in order to analyze what the present model intends to determine in each instance. However, the parameters of the initial spatial attention block may be randomly determined, but may be optimized through the training process as described above.

이후, 프로세서(110)는 특징 맵과 공간 어텐션 맵의 합성 결과를 입력 받아 인스턴스마다의 어텐션 스코어를 도출할 수 있다(S300). 여기서, 특정 맵과 공간 어텐션 맵의 합성은 요소별 곱셈(element-wise multiplication)일 수 있다. 또한, 프로세서(110)는 어텐션 스코어에 따라 인스턴스들의 임베딩들(embeddings)을 종합하여(aggregate) 3D 이미지에 대한 총합 임베딩(aggregated embedding)을 도출할 수 있다(S400). 이러한 동작들은 인스턴스 어텐션 블록(또는 잠재 어텐션 블록)에 의해 수행될 수 있다. 다만, 초기 인스턴스 어텐션 블록의 파라미터는 랜덤하게 정해질 수 있으나, 상술된 바와 같은 훈련 과정을 거치면서 최적화될 수 있다.Thereafter, the processor 110 may receive the result of combining the feature map and the spatial attention map to derive an attention score for each instance ( S300 ). Here, the synthesis of the specific map and the spatial attention map may be element-wise multiplication. Also, the processor 110 may derive an aggregated embedding for the 3D image by aggregating the embeddings of the instances according to the attention score ( S400 ). These operations may be performed by an instance attention block (or latent attention block). However, the parameters of the initial instance attention block may be randomly determined, but may be optimized through the above-described training process.

프로세서(110)는 총합 임베딩에 기초하여 상기 3D 이미지에 대한 분석 결과를 출력할 수 있다(S500). 여기서, 분석 결과는 상술된 바와 같이 CT 스캔 이미지가 COVID-19으로 인한 폐렴을 나타내는지 다른 질병으로 인한 폐렴을 나타내는지가 될 수도 있고, 기타 다른 질병 또는 이미지가 나타내는 의미가 될 수 있음은 물론이다. 이러한 출력은 출력 블록에 의해 수행될 수 있다.The processor 110 may output an analysis result for the 3D image based on the total embedding (S500). Here, as described above, the analysis result may be whether the CT scan image indicates pneumonia caused by COVID-19 or pneumonia caused by another disease, and other diseases or the meaning of the image may be of course. This output may be performed by an output block.

한편, 다중 인스턴스 학습 모델의 훈련 과정에서는 입력되는 2D 인스턴스의 3D 이미지가 실제값(Ground-Truth)로 레이블링된 데이터일 수 있다. 여기서, 레이블링은 3D 이미지에만 모음(bag) 레벨에서 이루어진 것이고, 각각의 2D 슬라이스에는 레이블링이 이루어되지 않은 상태이다. 상술된 바와 같이 3D 이미지가 COVID-19 폐렴으로 레이블링된 이미지라 하더라도 2D 인스턴스 중 일부는 COVID-19 폐렴의 병변을 전혀 나타내지 않을 수 있다.Meanwhile, in the training process of the multi-instance learning model, the input 3D image of the 2D instance may be data labeled as ground-truth. Here, the labeling is performed only on the 3D image at the bag level, and the labeling is not performed on each 2D slice. As mentioned above, even if the 3D images are images labeled as COVID-19 pneumonia, some of the 2D instances may not show any lesions of COVID-19 pneumonia at all.

프로세서(110)는 다중 인스턴스 학습 모델을 훈련시키기 위해 레이블링된 3D 이미지를 포함하는 훈련 데이터를 이용하여 다중 인스턴스 학습 모델 전체 손실 함수값이 최소가 되도록 학습을 수행할 수 있다(S600).In order to train the multi-instance learning model, the processor 110 may perform learning such that the overall loss function value of the multi-instance learning model is minimized by using training data including the labeled 3D image ( S600 ).

여기서, 학습을 수행한다는 것은 다중 인스턴스 학습 모델의 전체 손실 함수(L)가 최소값을 가지도록 특징 맵을 도출하는 단계, 공간 어텐션 맵을 도출하는 단계, 어텐션 스코어를 도출하는 단계, 총합 임베딩을 도출하는 단계에서 사용되는 블록들에서 설정된 파라미터들을 조정하는 것을 의미할 수 있다.Here, performing learning means deriving a feature map so that the overall loss function (L) of the multi-instance learning model has a minimum value, deriving a spatial attention map, deriving an attention score, deriving a sum embedding It may mean adjusting parameters set in blocks used in the step.

한편, 전체 손실 함수는 상기 출력 블록의 결과에 대한 모음 레벨(Bag Level)의 손실 함수(L_B) 및 상기 인스턴스 임베딩들과 상기 총합 임베딩 사이의 대조 손실 함수(L_F)의 조합일 수 있다.Meanwhile, the overall loss function may be a combination of a bag level loss function ( _{LB ) for the result of the output block and a contrast loss function (L F} ₎ between the instance embeddings and the sum embedding.

상술된 바와 같이, 본 개시에서는 공간적인 어텐션 구조와 인스턴스 어텐션 구조를 동시에 활용하여 중요한 인스턴스와 함께 인스턴스 내부에서 어떤 부분이 분류에 주요한 역할을 하는지를 자동으로 추출해줄 수 있는 기법을 제안하였다. As described above, the present disclosure proposes a technique for automatically extracting which part plays a major role in classification along with an important instance by simultaneously utilizing a spatial attention structure and an instance attention structure.

또한 대조손실(contrastive loss) 함수를 모음 레벨 손실함수와 함께 최소화시키는 방향으로 학습을 수행함으로써 인스턴스를 대표하는 임베딩과 각 인스턴스들간의 구분이 잘 이루어지도록 딥러닝 네트워크를 학습한다. 제안하는 기법을 사용하게 되면 모음 레벨 분류성능을 높여주기 위한 인스턴스들이 잘 선택되고 선택된 인스턴스들의 클래스 내 변형은 줄어들고, 다른 클래스의 인스턴스들의 거리는 커지도록 특징들이 임베딩되게 된다. In addition, the deep learning network is trained so that the embedding representing the instance and the distinction between each instance are well made by performing the learning in the direction of minimizing the contrastive loss function together with the vowel level loss function. When the proposed technique is used, the instances are well selected to improve the collection level classification performance, the variation within the class of the selected instances is reduced, and the features are embedded so that the distance between instances of other classes is increased.

따라서, 이렇게 추출된 특징들을 바탕으로 최종 백레벨 분류를 수행하였을 때 보다 우수한 결과를 얻을 수 있다. 또한, 공간적인 어텐션 구조와 인스턴스 레벨 어텐션 구조를 하나의 네트워크에 사용함으로써 영상단위 레이블만 가지고도 이상병변 부위를 보다 정확하게 찾아낼 수 있음을 정성적으로 확인할 수 있었다.Therefore, better results can be obtained when the final backlevel classification is performed based on the extracted features. In addition, by using the spatial attention structure and the instance-level attention structure in one network, it was qualitatively confirmed that the abnormal lesion site could be more accurately found even with only the image unit label.

이상 설명된 본 발명에 따른 실시예는 컴퓨터 상에서 다양한 구성요소를 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있으며, 이와 같은 컴퓨터 프로그램은 컴퓨터로 판독 가능한 매체에 기록될 수 있다. 이때, 매체는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등과 같은, 프로그램 명령어를 저장하고 실행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다.The embodiment according to the present invention described above may be implemented in the form of a computer program that can be executed through various components on a computer, and such a computer program may be recorded in a computer-readable medium. In this case, the medium includes a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as a CD-ROM and DVD, a magneto-optical medium such as a floppy disk, and a ROM. , RAM, flash memory, and the like, hardware devices specially configured to store and execute program instructions.

한편, 상기 컴퓨터 프로그램은 본 발명을 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수 있다. 컴퓨터 프로그램의 예에는, 컴파일러에 의하여 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용하여 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함될 수 있다.Meanwhile, the computer program may be specially designed and configured for the present invention, or may be known and used by those skilled in the computer software field. Examples of the computer program may include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

본 발명의 명세서(특히 특허청구범위에서)에서 "상기"의 용어 및 이와 유사한 지시 용어의 사용은 단수 및 복수 모두에 해당하는 것일 수 있다. 또한, 본 발명에서 범위(range)를 기재한 경우 상기 범위에 속하는 개별적인 값을 적용한 발명을 포함하는 것으로서(이에 반하는 기재가 없다면), 발명의 상세한 설명에 상기 범위를 구성하는 각 개별적인 값을 기재한 것과 같다.In the specification of the present invention (especially in the claims), the use of the term "above" and similar referential terms may be used in both the singular and the plural. In addition, when a range is described in the present invention, each individual value constituting the range is described in the detailed description of the invention as including the invention to which individual values belonging to the range are applied (unless there is a description to the contrary). same as

본 발명에 따른 방법을 구성하는 단계들에 대하여 명백하게 순서를 기재하거나 반하는 기재가 없다면, 상기 단계들은 적당한 순서로 행해질 수 있다. 반드시 상기 단계들의 기재 순서에 따라 본 발명이 한정되는 것은 아니다. 본 발명에서 모든 예들 또는 예시적인 용어(예들 들어, 등등)의 사용은 단순히 본 발명을 상세히 설명하기 위한 것으로서 특허청구범위에 의해 한정되지 않는 이상 상기 예들 또는 예시적인 용어로 인해 본 발명의 범위가 한정되는 것은 아니다. 또한, 당업자는 다양한 수정, 조합 및 변경이 부가된 특허청구범위 또는 그 균등물의 범주 내에서 설계 조건 및 팩터에 따라 구성될 수 있음을 알 수 있다.The steps constituting the method according to the present invention may be performed in an appropriate order, unless the order is explicitly stated or there is no description to the contrary. The present invention is not necessarily limited to the order in which the steps are described. The use of all examples or exemplary terms (eg, etc.) in the present invention is merely for the purpose of describing the present invention in detail, and the scope of the present invention is limited by the examples or exemplary terms unless defined by the claims. it's not going to be In addition, those skilled in the art will appreciate that various modifications, combinations, and changes may be made in accordance with design conditions and factors within the scope of the appended claims or their equivalents.

이상에서 본 발명에 따른 실시예들이 설명되었으나, 이는 예시적인 것에 불과하며, 당해 분야에서 통상적 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 범위의 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 다음의 청구범위에 의해서 정해져야 할 것이다.Although the embodiments according to the present invention have been described above, these are merely exemplary, and those of ordinary skill in the art will understand that various modifications and equivalent ranges of embodiments are possible therefrom. Accordingly, the true technical protection scope of the present invention should be defined by the following claims.

Claims

A multi-instance learning device for analysis of 3D images, comprising:
memory in which multi-instance training models are stored; and
at least one processor electrically connected to the memory;
The multi-instance learning model is
a convolution block for deriving a feature map of each of the 2D instances of the input 3D image;
a spatial attention block for deriving a spatial attention map of the instances from the feature map derived from the convolution block;
The result of combining the feature map and the spatial attention map is input to derive an attention score for each instance, and according to the attention score, the embeddings of instances are synthesized to perform aggregated embedding of the 3D image. deriving instance attention block (Instance Attention Block); and
An output block for outputting an analysis result for the 3D image based on the sum embedding,
Multi-instance learning device.

The method of claim 1,
The processor, in the training phase of the learning model, using training data including a 3D image labeled with a ground-truth for the analysis result, the overall loss function (L) of the learning model has a minimum value. To perform the operation of training the learning model to
wherein the overall loss function is a combination of a Bag Level loss function ( _{LB ) for the result of the output block and a contrast loss function (L F} ₎ between the instance embeddings and the sum embedding,
Multi-instance learning device.

3. The method of claim 2,
The operation of training the multi-instance learning model includes adjusting parameters of the convolution block, the spatial attention block, and the instance attention block so that an overall loss function (L) of the multi-instance learning model has a minimum value. doing,
Multi-instance learning device.

3. The method of claim 2,
The total loss function (L) is expressed by Equation 1 below,
Multi-instance learning device.
Equation 1

is a value between 0 and 1, and is a parameter representing the weight of the loss function at the vowel level.

3. The method of claim 2,
The loss function L _B of the bag level is expressed by Equation 2 below, and the contrast loss function L _F is expressed by Equation 3 below,
Multi-instance learning device.
Equation 2

is the instance label,

is the label of the 3D image

probability to be
Equation 3

is an instance-level characteristic,

is a characteristic of the vowel level,

, has a value of 1 if k is not equal to i, and a value of 0 if k = i,

is the temperature parameter,

is the similarity function, N is the number of instances of the 3D image

A multi-instance learning device for analysis of 3D images, comprising:
a memory storing at least one instruction; and
Comprising at least one processor interlocked with the memory to execute the instruction,
The instructions, when executed by the processor, cause the processor to:
deriving a feature map of each of the 2D instances of the 3D image input to the multi-instance learning model;
deriving a spatial attention map of the instances from the derived feature map;
deriving an attention score for each instance by receiving a synthesis result of the feature map and the spatial attention map;
deriving an aggregated embedding for the 3D image by synthesizing embeddings of instances according to the attention score; and
to perform an operation of outputting an analysis result for the 3D image based on the total embedding,
Multi-instance learning device.

7. The method of claim 6,
The processor, in a training phase of the multi-instance learning model, using training data including a 3D image labeled with a ground-truth for the analysis result, a total loss function (L) of the multi-instance learning model training the multi-instance learning model to have a minimum value,
wherein the overall loss function is a combination of a Bag Level loss function ( _{LB ) for the result of the output block and a contrast loss function (L F} ₎ between the instance embeddings and the sum embedding,
Multi-instance learning device.

8. The method of claim 7,
The training of the multi-instance learning model may include deriving the feature map so that the overall loss function L of the multi-instance learning model has a minimum value, deriving the spatial attention map, and deriving the attention score. Including the operation of adjusting parameters used in the operation of deriving the sum embedding,
Multi-instance learning device.

8. The method of claim 7,
The total loss function (L) is expressed by Equation 1 below,
Multi-instance learning device.
Equation 1

8. The method of claim 7,
The loss function L _B of the bag level is expressed by Equation 2 below, and the contrast loss function L _F is expressed by Equation 3 below,
Multi-instance learning device.
Equation 2

is the instance label,

is the label of the 3D image

probability to be
Equation 3

is an instance-level characteristic,

is a characteristic of the vowel level,

, has a value of 1 if k is not equal to i, and a value of 0 if k = i,

is the temperature parameter,

is the similarity function, N is the number of instances of the 3D image

A multi-instance learning method for 3D image analysis performed by at least one processor in a computing device or computing network, comprising:
deriving a feature map of each of the 2D instances of the 3D image input to the multi-instance learning model;
deriving a spatial attention map of the instances from the derived feature map;
deriving an attention score for each instance by receiving a synthesis result of the feature map and the spatial attention map;
deriving an aggregated embedding for the 3D image by synthesizing embeddings of instances according to the attention score; and
Comprising the step of outputting an analysis result for the 3D image based on the total embedding,
Multi-instance learning method.

12. The method of claim 11,
The 3D image is labeled with a ground-truth for the analysis result,
The multi-instance learning method is
In the training phase of the multi-instance learning model, using training data including the labeled 3D image to train the multi-instance learning model so that the overall loss function (L) of the multi-instance learning model has a minimum value including,
wherein the overall loss function is a combination of a Bag Level loss function ( _{LB ) for the result of the output block and a contrast loss function (L F} ₎ between the instance embeddings and the sum embedding,
Multi-instance learning method.

13. The method of claim 12,
The step of training the multi-instance learning model includes deriving the feature map so that the overall loss function L of the multi-instance learning model has a minimum value, deriving the spatial attention map, and deriving the attention score. and adjusting parameters used in the step of deriving the sum embedding.
Multi-instance training method.

13. The method of claim 12,
The total loss function (L) is expressed by Equation 1 below,
Multi-instance training method.
Equation 1

13. The method of claim 12,
The loss function L _B of the bag level is expressed by Equation 2 below, and the contrast loss function L _F is expressed by Equation 3 below,
Multi-instance learning method.
Equation 2

is the instance label,

is the label of the 3D image

probability to be
Equation 3

is an instance-level characteristic,

is a characteristic of the vowel level,

, has a value of 1 if k is not equal to i, and a value of 0 if k = i,

is the temperature parameter,

is the similarity function, N is the number of instances of the 3D image