KR102223116B1

KR102223116B1 - Image analysis method and apparatus

Info

Publication number: KR102223116B1
Application number: KR1020200059363A
Authority: KR
Inventors: 김민철; 박종찬; 나세일; 유동근; 박창민; 황의진
Original assignee: 주식회사 루닛; 서울대학교병원
Priority date: 2020-05-18
Filing date: 2020-05-18
Publication date: 2021-03-04

Abstract

A computing device extracts a first feature map from an input image based on a neural network, generates a plurality of attention maps from the first feature map based on the neural network, and generates context information based on the plurality of attention maps. Also, the computing device generates a second feature map by combining the context information with the first feature map, and outputs a reading result of the input image from the second feature map based on the neural network. The present invention provides an image reading method and device capable of considering other areas.

Description

Image reading method and apparatus {IMAGE ANALYSIS METHOD AND APPARATUS}

본 발명은 영상 판독 방법 및 장치에 관한 것이다.The present invention relates to an image reading method and apparatus.

최근 신경망을 이용한 기계 학습 모델이 의료 영상 판독 분야에 적용되어, 영상 판독의 정확도 및 속도를 높이고 있다. 특히, 엑스레이 영상 분석을 통해 폐 결절 등의 이상 소견이 있는 부위를 표시하고, 그 가능성을 지표로 제시하는 판독 시스템이 개발되었다.Recently, machine learning models using neural networks have been applied to the field of medical image reading, thereby increasing the accuracy and speed of image reading. In particular, a reading system has been developed that marks areas with abnormal findings such as lung nodules through X-ray image analysis and presents the possibility as an index.

그런데 의사가 이상 부위를 판단하는 경우, 대상 영역만이 아니라 대상 영역과 비교하는 경우에 의미가 있는 영역을 찾아서, 두 영역의 차이를 고려해서 대상 영역의 이상을 판단할 수 있다. 예를 들면, 폐 결절을 판단하는 경우, 한쪽 폐에서 폐 결절로 추정되는 영역을 다른 쪽 폐의 비슷한 영역과 비교해서 실제 폐 결절인지를 판단할 수 있다. 그러나 기계 학습 모델은 폐 결절로 추정되는 영역의 특징 정보만으로 폐 결절인지를 판단할 수 있을 뿐, 다른 영역을 고려하지 않고 있다.However, when a doctor determines an abnormal region, it is possible to find an area that is meaningful when comparing not only the target area but also the target area, and considers the difference between the two areas to determine the abnormality of the target area. For example, when determining a pulmonary nodule, it is possible to determine whether it is an actual pulmonary nodule by comparing an area in one lung that is estimated to be a pulmonary nodule with a similar area in the other lung. However, the machine learning model can only determine whether a pulmonary nodule is a pulmonary nodule with only the characteristic information of a region estimated to be a pulmonary nodule, and does not consider other regions.

신경망에서 수용 영역(receptive field) 밖의 다른 영역을 고려하기 위해서 수용 영역을 넓힐 수 있지만, 기계 학습 모델을 크게 하여 수용 영역을 증가시키는 경우 컴퓨팅 자원이 많이 필요하며 또한 데이터를 저장하기 위한 메모리의 용량 및 추론 시간이 증가할 수 있다. 또한 풀링(pooling) 연산을 통해 수용 영역을 증가시키는 경우 추론에 중요한 고해상도 텍스처(high resolution texture) 정보가 손실되어 추론 성능이 저하될 수 있다.In a neural network, the receptive region can be widened to consider other regions outside the receptive field, but when the receptive region is increased by enlarging the machine learning model, a large amount of computing resources are required. Inference time can be increased. In addition, when the receiving area is increased through a pooling operation, high resolution texture information important for inference is lost, and inference performance may be degraded.

"CBAM: Convolutional Block Attention Module", Sanghyun Woo, Jongchan Park, Joon-Young Lee, In So Kweon, The European Conference on Computer Vision (ECCV), 2018"CBAM: Convolutional Block Attention Module", Sanghyun Woo, Jongchan Park, Joon-Young Lee, In So Kweon, The European Conference on Computer Vision (ECCV), 2018

본 발명이 이루고자 하는 과제는 다른 영역을 고려할 수 있는 영상 판독 방법 및 장치를 제공하는 것이다.An object of the present invention is to provide an image reading method and apparatus that can consider other areas.

본 발명의 한 실시예에 따르면, 컴퓨팅 장치에 의해 수행되는 영상 판독 방법이 제공된다. 상기 컴퓨팅 장치는, 신경망에 기초해서 입력 영상으로부터 제1 특징 맵을 추출하고, 상기 신경망에 기초해서 상기 제1 특징 맵으로부터 복수의 어텐션 맵(attention maps)을 생성하고, 상기 복수의 어텐션 맵에 기초해서 콘텍스트(context) 정보를 생성하고, 상기 콘텍스트 정보를 상기 제1 특징 맵에 결합하여서 제2 특징 맵을 생성하고, 상기 신경망에 기초해서 상기 제2 특징 맵으로부터 상기 입력 영상의 판독 결과를 출력한다.According to an embodiment of the present invention, a method of reading an image performed by a computing device is provided. The computing device extracts a first feature map from an input image based on a neural network, generates a plurality of attention maps from the first feature map based on the neural network, and based on the plurality of attention maps. Then, context information is generated, a second feature map is generated by combining the context information with the first feature map, and a reading result of the input image is output from the second feature map based on the neural network. .

상기 컴퓨팅 장치는, 상기 복수의 어텐션 맵과 상기 제1 특징 맵에 기초해서 복수의 벡터를 생성하고, 상기 복수의 벡터를 비교하여서 상기 콘텍스트 정보를 생성할 수 있다.The computing device may generate a plurality of vectors based on the plurality of attention maps and the first feature map, and may generate the context information by comparing the plurality of vectors.

상기 컴퓨팅 장치는, 상기 제1 특징 맵에 서로 다른 파라미터를 가지는 1×1 컨볼루션 연산을 각각 적용해서 상기 복수의 어텐션 맵을 생성하고, 각 어텐션 맵에 기초해서 상기 제1 특징 맵을 가중 평균하여 각 벡터를 생성할 수 있다.The computing device generates the plurality of attention maps by applying a 1×1 convolution operation having different parameters to the first feature map, respectively, and weighted averages the first feature map based on each attention map. Each vector can be created.

상기 컴퓨팅 장치는, 상기 제1 특징 맵을 가중 평균하기 전에 각 어텐션 맵을 정규화할 수 있다.The computing device may normalize each attention map before weighting the first feature map.

상기 1×1 컨볼루션 연산은 상기 제1 특징 맵의 복수의 채널을 소정 개수의 채널 그룹으로 그룹화한 후에 수행되는 그룹 컨볼루션 연산일 수 있다.The 1×1 convolution operation may be a group convolution operation performed after grouping a plurality of channels of the first feature map into a predetermined number of channel groups.

상기 컴퓨팅 장치는, 상기 복수의 벡터에 기초해서 상기 콘텍스트 정보의 손실을 계산하고, 상기 손실을 상기 신경망으로 역전파하여서 상기 신경망을 갱신할 수 있다.The computing device may update the neural network by calculating a loss of the context information based on the plurality of vectors and backpropagating the loss to the neural network.

상기 컴퓨팅 장치는, 상기 복수의 벡터의 직교 손실(orthogonal loss)에 기초해서 상기 손실을 계산할 수 있다.The computing device may calculate the loss based on an orthogonal loss of the plurality of vectors.

상기 컴퓨팅 장치는, 상기 복수의 어텐센 맵을 생성하기 전에 상기 제1 특징 맵을 정규화할 수 있다.The computing device may normalize the first feature map before generating the plurality of attendance maps.

상기 컴퓨팅 장치는, 상기 제1 특징 맵에서 상기 제1 특징 맵의 C차원의 평균 벡터를 감산하여 상기 제1 특징 맵을 정규화할 수 있다. 여기서, C는 상기 제1 특징 맵의 채널 수를 지시한다.The computing device may normalize the first feature map by subtracting the C-dimensional average vector of the first feature map from the first feature map. Here, C indicates the number of channels of the first feature map.

상기 컴퓨팅 장치는, 상기 평균 벡터에 기초해서 C개의 채널 중에서 특정 채널의 크기를 줄이는 채널 특징 벡터를 생성하고, 상기 콘텍스트 정보를 상기 제1 특징 맵에 결합한 특징 맵에 상기 채널 특징 벡터를 결합할 수 있다.The computing device may generate a channel feature vector for reducing the size of a specific channel among C channels based on the average vector, and combine the channel feature vector to a feature map obtained by combining the context information with the first feature map. have.

상기 컴퓨팅 장치는, 상기 평균 벡터에 1×1 컨볼루션 연산과 활성화(activation) 함수를 적용하여 상기 채널 특징 벡터를 생성할 수 있다.The computing device may generate the channel feature vector by applying a 1×1 convolution operation and an activation function to the average vector.

상기 1×1 컨볼루션 연산은 상기 평균 벡터의 C개의 채널을 소정 개수의 채널 그룹으로 그룹화한 후에 수행되는 그룹 컨볼루션 연산일 수 있다.The 1×1 convolution operation may be a group convolution operation performed after grouping the C channels of the average vector into a predetermined number of channel groups.

상기 제1 특징 맵이 C×H×W 차원인 경우, 상기 컨텍스트 정보는 C×1×1 차원일 수 있다. 여기서, C, H 및 W는 각각 상기 제1 특징 맵의 채널 수, 높이 및 폭을 지시한다.When the first feature map has a C×H×W dimension, the context information may be a C×1×1 dimension. Here, C, H, and W indicate the number of channels, height, and width of the first feature map, respectively.

상기 컴퓨팅 장치는, 상기 신경망에 기초해서 상기 제2 특징 맵으로부터 복수의 제2 어텐션 맵을 생성하고, 상기 복수의 제2 어텐션 맵에 기초해서 제2 콘텍스트 정보를 생성하고, 상기 제2 콘텍스트 정보를 상기 제2 특징 맵에 결합하여서 제3 특징 맵을 생성하고, 상기 신경망에 기초해서 상기 제3 특징 맵으로부터 상기 입력 영상의 판독 결과를 출력할 수 있다.The computing device generates a plurality of second attention maps from the second feature map based on the neural network, generates second context information based on the plurality of second attention maps, and stores the second context information. A third feature map may be generated by combining with the second feature map, and a read result of the input image may be output from the third feature map based on the neural network.

상기 컴퓨팅 장치는, 상기 콘텍스트 정보의 손실을 계산하고, 상기 제2 콘텍스트 정보의 손실을 계산하고, 상기 콘텍스트 정보의 손실과 상기 제2 콘텍스트 정보의 손실에 기초해서 콘텍스트 손실을 계산하고, 상기 콘텍스트 손실을 상기 신경망으로 역전파하여서 상기 신경망을 갱신할 수 있다.The computing device calculates a loss of the context information, calculates a loss of the second context information, calculates a context loss based on the loss of the context information and the loss of the second context information, and the context loss The neural network can be updated by backpropagating to the neural network.

본 발명의 다른 실시예에 따르면, 명령어를 저장하는 메모리와 프로세서를 포함하는 영상 판독 장치가 제공된다. 상기 프로세서는, 상기 명령어를 실행함으로써, 신경망에 기초해서 입력 영상으로부터 제1 특징 맵을 추출하고, 상기 신경망에 기초해서 상기 제1 특징 맵으로부터 관심 영역을 나타내는 제1 벡터 및 상기 관심 영역에 대응하는 영역을 나타내는 제2 벡터를 생성하고, 상기 제1 벡터와 상기 제2 벡터에 기초해서 콘텍스트 정보를 생성하고, 상기 콘텍스트 정보를 상기 제1 특징 맵에 결합하여서 제2 특징 맵을 생성하고, 상기 신경망에 기초해서 상기 제2 특징 맵으로부터 상기 입력 영상의 판독 결과를 출력한다.According to another embodiment of the present invention, an image reading apparatus including a processor and a memory for storing instructions is provided. The processor extracts a first feature map from an input image based on a neural network by executing the command, and a first vector representing a region of interest from the first feature map based on the neural network and a first vector corresponding to the region of interest Generate a second vector representing a region, generate context information based on the first vector and the second vector, combine the context information with the first feature map to generate a second feature map, and the neural network On the basis of, the reading result of the input image is output from the second feature map.

상기 프로세서는 상기 신경망에 기초해서 상기 제1 특징 맵으로부터 복수의 어텐션 맵을 생성하고, 상기 복수의 어텐션 맵에 기초해서 상기 제1 벡터와 상기 제2 벡터를 생성할 수 있다.The processor may generate a plurality of attention maps from the first feature map based on the neural network, and may generate the first vector and the second vector based on the plurality of attention maps.

본 발명의 또 다른 실시예에 따르면, 컴퓨팅 장치에 의해 실행되며, 기록 매체에 저장되어 있는 컴퓨터 프로그램이 제공된다. 상기 컴퓨터 프로그램은 상기 컴퓨팅 장치가, 신경망에 기초해서 입력 영상으로부터 제1 특징 맵을 추출하는 단계, 상기 신경망에 기초해서 상기 제1 특징 맵으로부터 관심 영역을 나타내는 제1 벡터 및 상기 관심 영역에 대응하는 영역을 나타내는 제2 벡터를 생성하는 단계, 상기 제1 벡터와 상기 제2 벡터에 기초해서 콘텍스트 정보를 생성하는 단계, 상기 콘텍스트 정보를 상기 제1 특징 맵에 결합하여서 제2 특징 맵을 생성하는 단계, 그리고 상기 신경망에 기초해서 상기 제2 특징 맵으로부터 상기 입력 영상의 판독 결과를 출력하는 단계를 실행하도록 한다.According to another embodiment of the present invention, a computer program executed by a computing device and stored in a recording medium is provided. The computer program includes, by the computing device, extracting a first feature map from an input image based on a neural network, a first vector representing a region of interest from the first feature map based on the neural network and a corresponding region of interest Generating a second vector representing an area, generating context information based on the first vector and the second vector, and generating a second feature map by combining the context information with the first feature map And, based on the neural network, a step of outputting the reading result of the input image from the second feature map is performed.

도 1은 본 발명의 한 실시예에 따른 학습 장치와 학습 환경을 예시하는 도면이다.
도 2는 본 발명의 한 실시예에 따른 학습 장치의 신경망을 예시하는 도면이다.
도 3은 본 발명의 한 실시예에 따른 영상 판독 방법을 예시하는 흐름도이다.
도 4, 도 5 및 도 6은 본 발명의 다양한 실시예에 따른 신경망을 예시하는 도면이다.
도 7은 본 발명의 한 실시예에 따른 콘텍스트 정보 생성 방법을 예시하는 흐름도이다.
도 8, 도 9 및 도 10은 본 발명의 다양한 실시예에 따른 콘텍스트 모듈을 예시하는 도면이다.
도 11은 본 발명의 한 실시예에 따른 학습 방법을 예시하는 흐름도이다
도 12는 본 발명의 한 실시예에 따른 학습 방법을 설명하는 도면이다.
도 13은 본 발명의 다른 실시예에 따른 콘텍스트 모듈을 예시하는 도면이다.
도 14는 본 발명의 다른 실시예에 따른 신경망을 예시하는 도면이다.
도 15는 본 발명의 또 다른 실시예에 따른 신경망을 예시하는 도면이다.
도 16은 본 발명의 한 실시예에 따른 컴퓨팅 장치를 예시하는 도면이다.1 is a diagram illustrating a learning device and a learning environment according to an embodiment of the present invention.
2 is a diagram illustrating a neural network of a learning device according to an embodiment of the present invention.
3 is a flowchart illustrating an image reading method according to an embodiment of the present invention.
4, 5, and 6 are diagrams illustrating neural networks according to various embodiments of the present invention.
7 is a flowchart illustrating a method of generating context information according to an embodiment of the present invention.
8, 9, and 10 are diagrams illustrating a context module according to various embodiments of the present invention.
11 is a flowchart illustrating a learning method according to an embodiment of the present invention
12 is a diagram illustrating a learning method according to an embodiment of the present invention.
13 is a diagram illustrating a context module according to another embodiment of the present invention.
14 is a diagram illustrating a neural network according to another embodiment of the present invention.
15 is a diagram illustrating a neural network according to another embodiment of the present invention.
16 is a diagram illustrating a computing device according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art may easily implement the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and similar reference numerals are attached to similar parts throughout the specification.

아래 설명에서 단수로 기재된 표현은 "하나" 또는 "단일" 등의 명시적인 표현을 사용하지 않은 이상, 단수 또는 복수로 해석될 수 있다.Expressions described in the singular in the following description may be interpreted as the singular or plural unless an explicit expression such as "one" or "single" is used.

아래 설명에서, 제1, 제2 등과 같이 서수를 포함하는 용어들은 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다.In the following description, terms including ordinal numbers such as first and second may be used to describe various elements, but the elements are not limited by the terms. The above terms are used only for the purpose of distinguishing one component from another component. For example, without departing from the scope of the present invention, a first element may be referred to as a second element, and similarly, a second element may be referred to as a first element.

도면을 참고하여 설명한 흐름도에서, 동작 순서는 변경될 수 있고, 여러 동작들이 병합되거나, 어느 동작이 분할될 수 있고, 특정 동작은 수행되지 않을 수 있다.In the flowchart described with reference to the drawings, the order of operations may be changed, several operations may be merged, certain operations may be divided, and specific operations may not be performed.

아래 설명에서, 목적 모델(target model)이란 태스크를 수행하는 모델이자, 기계 학습을 통해 구축하고자 하는 모델을 의미할 수 있다. 목적 모델은 신경망을 포함하는 임의의 기계학습 모델에 기반하여 구현될 수 있으므로, 본 발명은 목적 모델의 구현 방식에 의해 한정되지 않는다.In the following description, a target model is a model that performs a task and may mean a model to be built through machine learning. Since the objective model can be implemented based on any machine learning model including a neural network, the present invention is not limited by the implementation method of the objective model.

또한, 신경망(neural network)이란 신경 구조를 모방하여 고안된 모든 종류의 기계 학습 모델을 포괄하는 용어이다. 가령, 신경망은 인공 신경망(artificial neural network, ANN), 컨볼루션 신경망(convolutional neural network, CNN) 등과 같이 모든 종류의 신경망 기반 모델을 포함할 수 있다.Also, a neural network is a term that encompasses all kinds of machine learning models designed by mimicking neural structures. For example, a neural network may include all kinds of neural network-based models, such as an artificial neural network (ANN) or a convolutional neural network (CNN).

다음, 본 발명의 실시예에 따른 영상 판독 방법 및 장치를 첨부된 도면을 참고로 하여 상세하게 설명한다.Next, an image reading method and apparatus according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 한 실시예에 따른 학습 장치와 학습 환경을 예시하는 도면이며, 도 2는 본 발명의 한 실시예에 따른 학습 장치의 신경망을 예시하는 도면이다.1 is a diagram illustrating a learning device and a learning environment according to an embodiment of the present invention, and FIG. 2 is a diagram illustrating a neural network of a learning device according to an embodiment of the present invention.

도 1을 참고하면, 영상 판독을 위한 학습 장치(100)는 목적 태스크를 수행하기 위해 신경망에 대한 기계 학습을 수행하는 컴퓨팅 장치이다. 어떤 실시예에서, 목적 태스크는 영상 진단을 위한 태스크를 포함할 수 있으며, 예를 들면 시각 인식(visual recognition)과 연관된 태스크를 포함할 수 있다. 한 실시예에서, 목적 태스크는 영상으로부터 병변을 검출하는 태스크를 포함할 수 있다.Referring to FIG. 1, the learning apparatus 100 for reading an image is a computing device that performs machine learning on a neural network in order to perform a target task. In some embodiments, the objective task may include a task for image diagnosis, and may include, for example, a task associated with visual recognition. In one embodiment, the target task may include a task of detecting a lesion from an image.

어떤 실시예에서, 컴퓨팅 장치는 노트북, 데스크톱(desktop), 랩탑(laptop), 서버(server) 등이 될 수 있으나, 이에 한정되는 것은 아니며 컴퓨팅 기능이 구비된 모든 종류의 장치일 수 있다. 이러한 컴퓨팅 장치의 한 예는 도 16을 참고하여 설명한다.In some embodiments, the computing device may be a notebook, a desktop, a laptop, a server, etc., but is not limited thereto and may be any type of device equipped with a computing function. An example of such a computing device will be described with reference to FIG. 16.

도 1에는 학습 장치(100)가 하나의 컴퓨팅 장치로 구현되는 예가 도시되어 있지만, 실제 물리적 환경에서 학습 장치(100)의 기능은 복수의 컴퓨팅 장치에 의해 구현될 수 있다.1 illustrates an example in which the learning device 100 is implemented as one computing device, the function of the learning device 100 in an actual physical environment may be implemented by a plurality of computing devices.

학습 장치(100)는 복수의 훈련 샘플을 포함하는 데이터 셋(110)을 이용하여 신경망(예를 들면, 컨볼루션 신경망)을 학습할 수 있다. 각 훈련 샘플은 정답(111a)이 레이블링되어 있는 영상(111)을 포함할 수 있다. 어떤 실시예에서, 학습 장치(100)는 훈련 샘플(111)을 신경망에 입력하여 목적 태스크를 수행해서 예측한 값과 훈련 샘플에 레이블링되어 있는 정답(111a) 사이의 손실을 신경망으로 역전파하여서 신경망을 학습할 수 있다. 학습 장치(100)는 학습한 신경망에 영상(120)을 입력하여 목적 태스크를 수행함으로써 결과(예를 들면, 병변)(130)를 예측할 수 있다. 즉, 학습 장치(100)는 학습한 신경망에 기초해서 목적 태스크를 수행함으로써 영상을 판독할 수 있다.The learning apparatus 100 may learn a neural network (eg, a convolutional neural network) using the data set 110 including a plurality of training samples. Each training sample may include an image 111 in which the correct answer 111a is labeled. In some embodiments, the training device 100 inputs the training sample 111 into a neural network and backpropagates the loss between the predicted value and the correct answer 111a labeled on the training sample to the neural network to perform the target task. You can learn. The learning device 100 may predict a result (eg, a lesion) 130 by inputting the image 120 to the learned neural network and performing a target task. That is, the learning apparatus 100 may read an image by performing a target task based on the learned neural network.

어떤 실시예에서, 신경망은 도 2에 도시한 것처럼 복수의 레이어(210)를 포함할 수 있다. 어떤 실시예에서, 복수의 레이어(210)는 컨볼루션 레이어를 포함할 수 있다.In some embodiments, the neural network may include a plurality of layers 210 as shown in FIG. 2. In some embodiments, the plurality of layers 210 may include a convolutional layer.

어떤 실시예에서, 복수의 레이어(210) 중 첫 번째 레이어(210)는 입력 영상을 수신하고, 입력 영상으로부터 특징(feature)을 추출할 수 있다. 이러한 특징은 특징 맵(feature map)의 형태로 표현될 수 있다. 다른 레이어(210)는 앞의 레이어(210)가 전달하는 특징으로부터 특징을 추출하여 다음 레이어(210)로 전달할 수 있다. 어떤 실시예에서, 레이어(210)는 입력 영상 또는 입력 특징에 해당 레이어의 연산을 적용하여서 특징을 추출할 수 있다. 어떤 실시예에서, 컨볼루션 레이어(210)는 입력 영상 또는 입력 특징에 컨볼루션 필터를 적용하여서 특징을 추출할 수 있다. 컨볼루션 필터는 입력 영상 또는 입력 특징의 수용 영역(receptive)에 컨볼루션 연산을 수행하여서 특징을 추출할 수 있다.In some embodiments, the first layer 210 of the plurality of layers 210 may receive an input image and extract a feature from the input image. These features may be expressed in the form of a feature map. The other layer 210 may extract features from the features transmitted by the previous layer 210 and transfer them to the next layer 210. In some embodiments, the layer 210 may extract features by applying an operation of a corresponding layer to an input image or input feature. In some embodiments, the convolutional layer 210 may extract features by applying a convolution filter to an input image or input feature. The convolution filter may extract a feature by performing a convolution operation on an input image or a receptive of the input feature.

어떤 실시예에서, 복수의 레이어(210)는 풀링(pooling) 레이어, 활성화(activation) 레이어 등의 다른 레이어를 더 포함할 수 있다. 풀링 레이어는 입력 특징에 풀링 연산을 수행하여서 특징을 추출할 수 있다. 활성화 레이어는 활성화 함수를 통해 입력된 데이터에 대한 비선형 변환을 수행하는 레이어로, 시그모이드(sigmoid) 함수, ReLU(Rectified Linear Unit) 함수 등을 포함할 수 있다. In some embodiments, the plurality of layers 210 may further include other layers such as a pooling layer and an activation layer. The pooling layer may extract features by performing a pooling operation on the input features. The activation layer is a layer that performs nonlinear transformation on data input through an activation function, and may include a sigmoid function, a ReLU (Rectified Linear Unit) function, and the like.

어떤 실시예에서, 복수의 레이어(210)는 다양한 레이어(210)를 거쳐서 출력되는 특징으로부터 분류를 수행하는 완전 연결 레이어(fully connected layer)를 더 포함할 수 있다. 어떤 실시예에서, 복수의 완전 연결 레이어가 제공될 수 있으며, 완전 연결 레이어는 각 레이블에 대한 확률을 제공할 수 있다.In some embodiments, the plurality of layers 210 may further include a fully connected layer that performs classification from features output through the various layers 210. In some embodiments, a plurality of fully connected layers may be provided, and the fully connected layers may provide a probability for each label.

복수의 레이어(210)를 각각 함수로 표현하는 경우, 신경망은 수학식 1과 같이 복수의 레이어에 해당하는 복수의 함수(f₁, f₂, …, f_N)의 합성 함수로 표현될 수 있다.When each of the plurality of layers 210 is expressed as a function, the neural network may be expressed as a composite function of _{a plurality of functions (f 1} , f ₂ , …, f _{N) corresponding to the plurality of layers as shown in Equation 1.} .

수학식 1에서, X는 입력 영상이고, Y는 복수의 레이어(210)의 최종 출력, 즉 목적 태스크에 따른 태스크 출력이며, 각 함수(f_n())에 해당하는 레이어에서 추출되는 특징이다(n은 1에서 (N-1) 사이의 정수).In Equation 1, X is an input image, Y is a final output of the plurality of layers 210, that is, a task output according to a target task, and is a feature extracted from a layer corresponding to _{each function (f n ()) (} n is an integer from 1 to (N-1)).

도 3은 본 발명의 한 실시예에 따른 영상 판독 방법을 예시하는 흐름도이다.3 is a flowchart illustrating an image reading method according to an embodiment of the present invention.

도 3을 참고하면, 컴퓨팅 장치는 입력 영상을 신경망에 입력하여서 입력 영상으로부터 특징 맵을 추출한다(S310). 어떤 실시예에서, 컴퓨팅 장치는 적어도 하나의 레이어를 거쳐 입력 영상으로부터 특징 맵을 추출할 수 있다.Referring to FIG. 3, the computing device extracts a feature map from the input image by inputting an input image into a neural network (S310). In some embodiments, the computing device may extract a feature map from the input image through at least one layer.

컴퓨팅 장치는 신경망에 기초해서 특징 맵으로부터 입력 영상의 정확한 판독을 위한 콘텍스트 정보를 특징 맵으로부터 추출하고(S320), 특징 맵에 콘텍스트 정보를 결합한다(S330). 이 경우, 콘텍스트 정보는 특징 맵에서 중요한 영역을 나타내는 특징의 비교 결과이므로, 콘텍스트 정보를 특징 맵에 결합함으로써, 특징 맵에서 중요한 영역의 비교 결과가 더 잘 식별될 수 있다. 이에 따라, 수용 영역을 증가시키지 않고 관심 영역과 대응하는 영역의 특징을 비교할 수 있으므로, 컴퓨팅 자원의 증가, 메모리의 용량 증가 및 추론 시간의 증가를 방지할 수 있다. 또한 추론에 중요한 고해상도 텍스처(high resolution texture) 정보가 손실되는 것을 방지할 수도 있다.The computing device extracts context information for accurate reading of the input image from the feature map based on the neural network (S320), and combines the context information with the feature map (S330). In this case, since the context information is a result of comparing features representing an important region in the feature map, by combining the context information with the feature map, the comparison result of the important region in the feature map can be better identified. Accordingly, it is possible to compare the characteristics of the region of interest and the corresponding region without increasing the receiving region, thereby preventing an increase in computing resources, an increase in memory capacity, and an increase in inference time. It can also prevent loss of high resolution texture information, which is important for inference.

컴퓨팅 장치는 신경망에서 결합된 특징 맵에 기초해서 영상을 판독한다(S340). 어떤 실시예에서, 컴퓨팅 장치는 적어도 하나의 레이어를 통해 결합된 특징 맵으로부터 다시 특징 맵을 추출하고, 추출된 특징 맵에 기초해서 영상을 판독할 수 있다. 어떤 실시예에서, 컴퓨팅 장치는 적어도 하나의 레이어를 통해 결합된 특징 맵으로부터 추출된 특징 맵으로부터 다시 콘텍스트 정보를 추출하고, 특징 맵에 추출한 콘텍스트 정보를 결합할 수 있다.The computing device reads the image based on the feature map combined in the neural network (S340). In some embodiments, the computing device may extract the feature map again from the feature map combined through at least one layer, and read the image based on the extracted feature map. In some embodiments, the computing device may extract context information again from the feature map extracted from the feature map combined through at least one layer, and combine the extracted context information with the feature map.

도 4, 도 5 및 도 6은 본 발명의 다양한 실시예에 따른 신경망을 예시하는 도면이다.4, 5, and 6 are diagrams illustrating neural networks according to various embodiments of the present invention.

도 4를 참고하면, 신경망(400)은 제1 레이어(410), 콘텍스트 모듈(420), 결합 모듈(430) 및 제2 레이어(440)를 포함한다.Referring to FIG. 4, the neural network 400 includes a first layer 410, a context module 420, a combination module 430, and a second layer 440.

제1 레이어(410)는 영상 또는 이전의 레이어에서 추출된 특징 맵을 입력 받으며, 입력 영상 또는 입력 특징 맵으로부터 특징 맵(h_n-1)을 추출한다. 어떤 실시예에서, 제1 레이어(410)는 도 2를 참고로 하여 설명한 복수의 레이어(210) 중 어느 하나의 레이어를 포함할 수 있다.The first layer 410 receives an image or a feature map extracted from a previous layer, and extracts a feature map h _n-1 from the input image or the input feature map. In some embodiments, the first layer 410 may include any one of the plurality of layers 210 described with reference to FIG. 2.

콘텍스트 모듈(420)은 제1 레이어(410)에서 추출된 특징 맵(h_n-1)으로부터 복수의 어텐션 맵을 생성하고, 복수의 어텐션 맵에 해당하는 영역을 각각 대표하는 복수의 영역 특징 벡터를 생성하고, 복수의 벡터의 비교 결과에 기초해서 콘텍스트 정보(c_n(h_n-1))를 생성한다. 어떤 실시예에서, 콘텍스트 모듈(420)는 어텐션 맵을 생성하기 위한 레이어를 포함할 수 있다. _{The context module 420 generates a plurality of attention maps from the feature map h n-1} extracted from the first layer 410, and generates a plurality of region feature vectors each representing regions corresponding to the plurality of attention maps. _{And generates context information (c n} (h _n-1 )) based on the comparison result of a plurality of vectors. In some embodiments, the context module 420 may include a layer for generating an attention map.

결합 모듈(430)은 콘텍스트 모듈(420)에서 생성된 콘텍스트 정보(c_n(h_n-1))를 제1 레이어(410)에서 추출한 특징 맵(h_n-1)에 결합한다. 콘텍스트 모듈(420)에서 생성된 벡터는 입력 특징 맵에서 중요한 영역을 나타내는 특징이므로, 결합 모듈(430)에서 콘텍스트 정보(c_n(h_n-1))를 특징 맵(h_n-1)에 결합함으로써, 특징 맵에서 중요한 영역의 비교 결과가 더 잘 식별될 수 있다.The combining module 430 combines the context information c _n (h _n-1 _{) generated by the context module 420 to the feature map h n-1} extracted from the first layer 410. Since the vector generated in the context module 420 is a feature representing an important area in the input feature map, the context information (c _n (h _n-1 )) is combined with the feature map (h _n-1 ) in the combining module 430 By doing so, the comparison result of the important area in the feature map can be better identified.

제2 레이어(440)는 결합 모듈(430)에서 결합된 특징 맵을 처리한다. 어떤 실시예에서, 제2 레이어(440)는 결합 모듈(430)에서 결합된 특징 맵으로부터 특징 맵(h_n)을 추출할 수 있다. 어떤 실시예에서, 제2 레이어(440)는 도 2를 참고로 하여 설명한 복수의 레이어(210) 중 어느 하나의 레이어를 포함할 수 있다.The second layer 440 processes the feature map combined in the combining module 430. In some embodiments, the second layer 440 may extract the _{feature map h n} from the feature map combined in the combining module 430. In some embodiments, the second layer 440 may include any one of the plurality of layers 210 described with reference to FIG. 2.

도 5를 참고하면, 다른 실시예에 따른 신경망(500)은 제1 레이어(510), 콘텍스트 모듈(520), 결합 모듈(530) 및 제2 레이어(540)를 포함한다. 제1 레이어(510), 콘텍스트 모듈(520) 및 제2 레이어(540)는 도 4를 참고로 하여 설명한 제1 레이어(410), 콘텍스트 모듈(420) 및 제2 레이어(440)와 유사한 동작을 수행하므로 그 설명을 생략한다.Referring to FIG. 5, a neural network 500 according to another embodiment includes a first layer 510, a context module 520, a combination module 530, and a second layer 540. The first layer 510, the context module 520, and the second layer 540 perform operations similar to those of the first layer 410, the context module 420, and the second layer 440 described with reference to FIG. 4. As it is performed, its description is omitted.

도 5에 도시한 것처럼, 신경망(500)은 결합 모듈(530)로서 합산 모듈을 포함한다. 합산 모듈(530)은 콘텍스트 모듈(520)에서 생성된 콘텍스트 정보(c_n(h_n-1))를 제1 레이어(510)에서 추출한 특징 맵(h_n-1)에 합산할 수 있다.As shown in FIG. 5, the neural network 500 includes an summing module as the combining module 530. The summing module 530 may add the context information c _n (h _n-1 _{) generated by the context module 520 to the feature map h n-1} extracted from the first layer 510.

도 6을 참고하면, 또 다른 실시예에 따른 신경망(600)은 제1 레이어(610), 콘텍스트 모듈(620), 결합 모듈(630) 및 제2 레이어(640)를 포함한다. 제1 레이어(610), 콘텍스트 모듈(620) 및 제2 레이어(640)는 도 4를 참고로 하여 설명한 제1 레이어(410), 콘텍스트 모듈(420) 및 제2 레이어(440)와 유사한 동작을 수행하므로 그 설명을 생략한다.Referring to FIG. 6, a neural network 600 according to another embodiment includes a first layer 610, a context module 620, a combination module 630, and a second layer 640. The first layer 610, the context module 620, and the second layer 640 perform similar operations to the first layer 410, the context module 420, and the second layer 440 described with reference to FIG. 4. As it is performed, its description is omitted.

도 6에 도시한 것처럼, 신경망(600)은 결합 모듈(630)로서 승산 모듈을 포함한다. 승산 모듈(630)은 콘텍스트 모듈(620)에서 생성된 콘텍스트 정보(c_n(h_n-1))를 제1 레이어(610)에서 추출한 특징 맵(h_n-1)에 곱할 수 있다.As shown in FIG. 6, the neural network 600 includes a multiplication module as the combining module 630. The multiplication module 630 may multiply the context information (c _n (h _n-1 )) generated by the context module 620 by the feature map (h _n-1 ) extracted from the first layer 610.

도 4 내지 도 6에서는 두 레이어 사이에 삽입되는 하나의 콘텍스트 모듈을 도시하였지만, 신경망이 복수의 레이어를 포함하는 경우, 복수의 콘텍스트 모듈이 사용될 수도 있다. 이 경우, 각 콘텍스트 모듈은 복수의 레이어 중 두 레이어 사이에 삽입될 수 있다.4 to 6 illustrate one context module inserted between two layers, when a neural network includes a plurality of layers, a plurality of context modules may be used. In this case, each context module may be inserted between two layers of a plurality of layers.

다음 본 발명의 다양한 실시예에 따른 콘텍스트 모듈 및 콘텍스트 정보 생성 방법에 대해서 대해서 도 7 내지 도 10을 참고로 하여 설명한다.Next, a context module and a method of generating context information according to various embodiments of the present invention will be described with reference to FIGS. 7 to 10.

도 7은 본 발명의 한 실시예에 따른 콘텍스트 정보 생성 방법을 예시하는 흐름도이다.7 is a flowchart illustrating a method of generating context information according to an embodiment of the present invention.

도 7을 참고하면, 컴퓨팅 장치는 이전 레이어로부터 특징 맵을 입력 받으며(S710), 입력 특징 맵으로부터 복수의 어텐션 맵(attention map)을 생성한다(S720).Referring to FIG. 7, the computing device receives a feature map from a previous layer (S710) and generates a plurality of attention maps from the input feature map (S720).

컴퓨팅 장치는 각 어텐션 맵을 입력 특징 맵에 적용하여서 각 어텐션 맵에 해당하는 영역을 대표하는 영역 특징 벡터를 생성한다(S730). 컴퓨팅 장치는 복수의 영역 특징 벡터를 비교하여서 콘텍스트 정보를 생성한다(S740). 이 경우, 영역 특징 벡터는 입력 특징 맵에서 중요한 영역을 나타내는 특징이므로, 콘텍스트 정보는 특징 맵에서 중요한 영역의 비교 결과를 제공할 수 있다.The computing device generates an area feature vector representing an area corresponding to each attention map by applying each attention map to the input feature map (S730). The computing device generates context information by comparing a plurality of region feature vectors (S740). In this case, since the region feature vector is a feature representing an important region in the input feature map, the context information can provide a comparison result of the important region in the feature map.

도 8, 도 9 및 도 10은 본 발명의 다양한 실시예에 따른 콘텍스트 모듈을 예시하는 도면이다. 도 8, 도 9 및 도 10에서는 설명의 편의상 이전 레이어에서 콘텍스트 모듈로 입력되는 특징 맵(h_n-1)을 X로 표현한다.8, 9, and 10 are diagrams illustrating a context module according to various embodiments of the present invention. In FIGS. 8, 9, and 10, for convenience of description, a feature map h _n-1 input from a previous layer to a context module is represented by X.

도 8을 참고하면, 한 실시예에 따른 콘텍스트 모듈(800)은 어텐션(attention) 모듈(810), 벡터 생성 모듈(820) 및 비교 모듈(830)을 포함한다.Referring to FIG. 8, the context module 800 according to an embodiment includes an attention module 810, a vector generation module 820, and a comparison module 830.

어텐션 모듈(810)은 이전 레이어로부터 특징 맵(X)을 입력 받으며, 입력 특징 맵(X)으로부터 복수의 어텐션 맵(attention map)을 생성한다. 입력 특징 맵(X)의 차원(dimension)이 C×H×W인 경우(X ∈ R ^C×H×W), 어텐션 모듈(810)은 H×W의 크기를 가지는 복수의 어텐션 맵을 생성할 수 있다. 도 8에는 설명의 편의상 두 개의 어텐션 맵이 생성되는 경우가 도시되어 있다. 여기서, C는 입력 특징 맵의 채널(channel)의 수, H는 입력 특징 맵의 높이(height), W는 입력 특징 맵의 폭(width)를 지시한다. 어떤 실시예에서, 각 어텐션 맵은 R ^1×H×W차원에서의 단일 채널 어텐션 맵일 수 있다.The attention module 810 receives a feature map X from a previous layer and generates a plurality of attention maps from the input feature map X. When the dimension of the input feature map (X) is C×H×W (X ∈ R ^C×H×W ), the attention module 810 generates a plurality of attention maps having a size of H×W. I can. 8 illustrates a case in which two attention maps are generated for convenience of explanation. Here, C denotes the number of channels of the input feature map, H denotes the height of the input feature map, and W denotes the width of the input feature map. In some embodiments, each attention map may be a single channel attention map in the R ^{1×H×W dimension.}

어떤 실시예에서, 어텐션 모듈(810)에서 생성되는 복수의 어텐션 맵은 서로 다를 수 있다. 어떤 실시예에서, 어텐션 모듈(810)은 입력 특징 맵(X)에 1×1 컨볼루션 연산을 적용하여서 어텐션 맵을 생성할 수 있다. 이 경우, 어텐션 모듈(810)은 입력 특징 맵(X)에 학습 가능한(learnable) 파라미터를 적용하여서 어텐션 맵을 생성할 수 있다. 어텐션 모듈(810)은 입력 특징 맵(X)에 서로 다른 학습 가능한(learnable) 파라미터(W_K, W_Q)를 적용하여서 서로 다른 어텐션 맵을 생성할 수 있다. 어떤 실시예에서, 학습 가능한(learnable) 파라미터(W_K, W_Q)는 1×1 컨볼루션 연산의 가중치일 수 있다(W_K∈ R ^C×1×1 _,W_Q∈ R ^C×1×1). 따라서 어텐션 모듈(810)는 수학식 2와 같이 입력 특징 맵(X)에 1×1 컨볼루션 연산을 적용하여서 어텐션 맵을 생성할 수 있다.In some embodiments, a plurality of attention maps generated by the attention module 810 may be different from each other. In some embodiments, the attention module 810 may generate an attention map by applying a 1×1 convolution operation to the input feature map X. In this case, the attention module 810 may generate an attention map by applying a learnable parameter to the input feature map X. The attention module 810 may generate different attention maps by applying different _{learnable parameters W K} and W _{Q to the input feature map X.} In some embodiments, the learnable parameters (W _K , W _Q ) may be the weights of the 1×1 convolution operation (W _K ∈ R ^C×1×1 _, W _Q ∈ R ^C×1×1 ). Accordingly, the attention module 810 may generate an attention map by applying a 1×1 convolution operation to the input feature map X as shown in Equation 2.

수학식 2에서, X_i,j는 입력 특징 맵의 공간 위치 (i,j)에서의 벡터를 나타내고, A^K _i,j는 파라미터(W_K)가 적용된 어텐션 맵의 공간 위치 (i,j)에서의 벡터를 나타내고, A^Q _i,j는 파라미터(W_Q)가 적용된 어텐션 맵의 공간 위치 (i,j)에서의 벡터를 나타낸다.In Equation 2, X _i,j denotes a vector at the spatial position (i,j) of the input feature map, and A ^K _i,j denotes the spatial position (i,j) of the attention map to which the parameter (W _{K) is applied.} Represents a vector at, and A ^Q _i,j represents a vector at a spatial position (i,j) of the attention map to which the parameter (W _{Q) is applied.}

다음, 벡터 생성 모듈(820)은 각 어텐션 맵을 입력 특징 맵에 적용하여서 대응하는 어텐션 맵에 해당하는 영역을 대표하는 영역 특징 벡터(K, Q)를 생성한다. 생성된 영역 특징 벡터(K, Q)는 C×1×1 차원의 벡터일 수 있다(K ∈ R ^C×1×1, Q ∈ R ^C×1×1). 어떤 실시예에서, 벡터 생성 모듈(820)은 어텐션 맵과 입력 특징 맵 사이의 원소별(element-wise) 연산(예를 들면, 원소별 곱셈) 결과를 합한 값에 기초해서 영역 특징 벡터(K, Q)를 생성할 수 있다. 이와 같이 생성된 두 영역 특징 벡터(K, Q)는 입력 특징 맵(X)의 두 공간 영역(spatial regions)에 초점을 두는 특징을 나타낼 수 있다. 즉, 영역 특징 벡터(K, Q)는 입력 특징 맵(X)에서 중요한 영역을 나타내는 특징으로 사용될 수 있다.Next, the vector generation module 820 generates region feature vectors (K, Q) representing regions corresponding to the corresponding attention maps by applying each attention map to the input feature map. The generated region feature vectors (K, Q) may be a vector having a dimension of C×1×1 (K ∈ R ^C×1×1 , Q ∈ R ^C×1×1 ). In some embodiments, the vector generation module 820 is based on the sum of the result of an element-wise operation (e.g., element-wise multiplication) between the attention map and the input feature map. Q) can be generated. The two region feature vectors K and Q generated as described above may represent features that focus on two spatial regions of the input feature map X. That is, the region feature vectors (K, Q) can be used as features representing an important region in the input feature map (X).

두 영역 특징 벡터(K, Q) 중 한 영역 특징 벡터(K)는 관심 영역을 나타내는 특징이고, 다른 영역 특징 벡터(Q)는 관심 영역에 대응하는 다른 영역(즉, 콘텍스트)를 나타내는 특징일 수 있다.Of the two region feature vectors (K, Q), one region feature vector (K) is a feature representing the region of interest, and the other region feature vector (Q) may be a feature representing another region (i.e., context) corresponding to the region of interest. have.

비교 모듈(830)은 복수의 어텐션 맵에 의해 생성된 복수의 영역 특징 벡터(K, Q)의 비교 결과를 출력한다. 어떤 실시예에서, 비교 모듈(830)은 두 영역 특징 벡터(K, Q)를 감산한 결과(K-Q)를 출력할 수 있다. 이 경우, 두 영역 특징 벡터는 원소별(element-wise)로 감산될 수 있다.The comparison module 830 outputs a comparison result of a plurality of region feature vectors (K, Q) generated by a plurality of attention maps. In some embodiments, the comparison module 830 may output a result of subtracting the two region feature vectors K and Q (K-Q). In this case, the two region feature vectors may be subtracted element-wise.

도 9를 참고하면, 다른 실시예에 따른 콘텍스트 모듈(900)은 어텐션 모듈(910), 정규화 모듈(940), 벡터 생성 모듈(920) 및 비교 모듈(930)을 포함한다. 도 9에는 설명의 편의상 두 개의 어텐션 맵이 생성되는 경우가 도시되어 있다. 어텐션 모듈(910), 벡터 생성 모듈(920) 및 비교 모듈(930)은 도 8을 참고로 하여 설명한 어텐션 모듈(810), 벡터 생성 모듈(820) 및 비교 모듈(830)과 유사하게 동작할 수 있으며 그 상세한 설명은 생략한다.Referring to FIG. 9, a context module 900 according to another embodiment includes an attention module 910, a normalization module 940, a vector generation module 920, and a comparison module 930. 9 illustrates a case in which two attention maps are generated for convenience of description. The attention module 910, the vector generation module 920, and the comparison module 930 may operate similarly to the attention module 810, vector generation module 820, and comparison module 830 described with reference to FIG. 8. And a detailed description thereof will be omitted.

어텐션 모듈(910)은 이전 레이어로부터 특징 맵(X)을 입력 받으며, 입력 특징 맵(X)으로부터 복수의 어텐션 맵을 생성한다. 어떤 실시예에서, 어텐션 모듈(910)은 입력 특징 맵(X)에 서로 다른 학습 가능한 파라미터(W_K, W_Q)를 적용하여서 서로 다른 어텐션 맵을 생성할 수 있다. 어떤 실시예에서, 학습 가능한 파라미터(W_K, W_Q)는 1×1 컨볼루션 연산의 가중치일 수 있다(W_K∈ R ^C×1×1 _,W_Q∈ R ^C×1×1).The attention module 910 receives a feature map X from a previous layer and generates a plurality of attention maps from the input feature map X. In some embodiments, the attention module 910 may generate different attention maps by applying different learnable _{parameters W K} and W _{Q to the input feature map X.} In some embodiments, the learnable parameters (W _K , W _Q ) may be the weights of the 1×1 convolution operation (W _K ∈ R ^C×1×1 _, W _Q ∈ R ^C×1×1 ).

정규화 모델(940)은 어텐션 모듈(910)에서 생성한 어텐션 맵을 정규화한다. 어떤 실시예에서, 정규화 모델(940)은 소프트맥스(softmax) 함수에 기초해서 어텐션 맵을 정규화할 수 있다. 예를 들면, 정규화 모델(940)은 수학식 3과 같이 어텐션 맵을 정규화할 수 있다. 수학식 3은 입력 특징 맵의 공간 위치 (i,j)에서의 벡터(X_i,j ∈ R ^C×1×1)의 정규화된 벡터를 지시한다.The normalization model 940 normalizes the attention map generated by the attention module 910. In some embodiments, the normalization model 940 may normalize the attention map based on a softmax function. For example, the normalization model 940 may normalize the attention map as shown in Equation 3. Equation 3 indicates a normalized vector _{of a vector (X i,j} ∈ R ^C×1×1 ) at the spatial location (i,j) of the input feature map.

수학식 3에서, X_i,j는 입력 특징 맵의 공간 위치 (i,j)에서의 벡터를 나타내고, A^K _i,j는 파라미터(W_K)가 적용된 어텐션 맵의 공간 위치 (i,j)에서의 정규화된 벡터를 나타내고, A^Q _i,j는 파라미터(W_Q)가 적용된 어텐션 맵의 공간 위치 (i,j)에서의 정규화된 벡터를 나타낸다.In Equation 3, X _i,j denotes a vector at the spatial position (i,j) of the input feature map, and A ^K _i,j denotes the spatial position (i,j) of the attention map to which the parameter (W _{K) is applied.} Denotes the normalized vector at, and A ^Q _i,j denotes the normalized vector at the spatial position (i,j) of the attention map to which the parameter (W _{Q) is applied.}

벡터 생성 모듈(920)은 각 어텐션 맵을 입력 특징 맵에 적용하여서 대응하는 어텐션 맵에 해당하는 영역을 대표하는 영역 특징 벡터(K, Q)를 생성한다. 생성된 영역 특징 벡터(K, Q)는 C×1×1 차원의 벡터일 수 있다(K ∈ R ^C×1×1, Q ∈ R ^C×1×1). 어떤 실시예에서, 벡터 생성 모듈(920)은 정규화된 맵에 기초해서 입력 특징 맵(X)을 가중 평균하여서 영역 특징 벡터(K, Q)를 생성할 수 있다. 예를 들면, 벡터 생성 모델(920)은 수학식 4와 같이 영역 특징 벡터를 생성할 수 있다.The vector generation module 920 generates region feature vectors (K, Q) representing regions corresponding to the corresponding attention map by applying each attention map to the input feature map. The generated region feature vectors (K, Q) may be a vector having a dimension of C×1×1 (K ∈ R ^C×1×1 , Q ∈ R ^C×1×1 ). In some embodiments, the vector generation module 920 may generate region feature vectors (K, Q) by weighted average of the input feature map (X) based on the normalized map. For example, the vector generation model 920 may generate a region feature vector as shown in Equation 4.

비교 모듈(930)은 복수의 어텐션 맵에 의해 생성된 복수의 영역 특징 벡터(K, Q)의 비교 결과를 출력한다. 어떤 실시예에서, 비교 모듈(830)은 두 영역 특징 벡터(K, Q)를 감산한 결과(K-Q)를 출력할 수 있다.The comparison module 930 outputs a comparison result of a plurality of region feature vectors (K, Q) generated by a plurality of attention maps. In some embodiments, the comparison module 830 may output a result of subtracting the two region feature vectors K and Q (K-Q).

이상 콘텍스트 모듈이 두 개의 어텐션 맵을 생성하는 예에 대해서 설명하였지만, 콘텍스트 모듈에서 생성하는 어텐션 맵의 개수는 두 개에 한정되지 않고, 셋 이상의 어텐션 맵이 생성될 수 있다. 아래에서는 도 10을 참고로 하여서 세 개의 어텐션 맵이 생성되는 예에 대해서 설명한다.Although the example in which the context module generates two attention maps has been described above, the number of attention maps generated by the context module is not limited to two, and three or more attention maps may be generated. Hereinafter, an example in which three attention maps are generated will be described with reference to FIG. 10.

도 10을 참고하면, 또 다른 실시예에 따른 콘텍스트 모듈(1000)은 어텐션 모듈(1010), 벡터 생성 모듈(1020) 및 비교 모듈(1030)을 포함한다. 어텐션 모듈(1010) 및 벡터 생성 모듈(1020)은 도 8을 참고로 하여 설명한 어텐션 모듈(810) 및 벡터 생성 모듈(820)과 유사하게 동작할 수 있으며 그 상세한 설명은 생략한다.Referring to FIG. 10, a context module 1000 according to another embodiment includes an attention module 1010, a vector generation module 1020, and a comparison module 1030. The attention module 1010 and the vector generation module 1020 may operate similarly to the attention module 810 and the vector generation module 820 described with reference to FIG. 8, and detailed descriptions thereof will be omitted.

어텐션 모듈(1010)은 이전 레이어로부터 특징 맵(X)을 입력 받으며, 입력 특징 맵(X)으로부터 복수의 어텐션 맵을 생성한다. 어떤 실시예에서, 어텐션 모듈(1010)은 입력 특징 맵(X)에 서로 다른 학습 가능한 파라미터(W_K, W_Q, W_J)를 적용하여서 서로 다른 어텐션 맵을 생성할 수 있다. 어떤 실시예에서, 학습 가능한 파라미터(W_K, W_Q, W_J)는 1×1 컨볼루션 연산의 가중치일 수 있다(W_K∈ R ^C×1×1 _,W_Q∈ R ^C×1×1 _,W_J∈ R ^C×1×1).The attention module 1010 receives a feature map X from a previous layer and generates a plurality of attention maps from the input feature map X. In some embodiments, the attention module 1010 may generate different attention maps by applying different learnable _{parameters W K} , W _Q , and W _{J to the input feature map X.} In some embodiments, the learnable parameters (W _K , W _Q , W _J ) may be the weights of the 1×1 convolution operation (W _K ∈ R ^C×1×1 _, W _Q ∈ R ^C×1×1 _, W _J ∈ R ^C×1×1 ).

벡터 생성 모듈(1020)은 각 어텐션 맵을 입력 특징 맵에 적용하여서 대응하는 어텐션 맵에 해당하는 영역을 대표하는 영역 특징 벡터(K, Q, J)를 생성한다. 생성된 영역 특징 벡터(K, Q, J)는 C×1×1 차원의 벡터일 수 있다(K ∈ R ^C×1×1, Q ∈ R ^C×1×1, J ∈ R ^C×1×1). 이와 같이 생성된 세 영역 특징 벡터(K, Q, J)는 입력 특징 맵(X)의 세 공간 영역에 초점을 두는 특징을 나타낼 수 있다.The vector generation module 1020 generates region feature vectors (K, Q, J) representing regions corresponding to the corresponding attention map by applying each attention map to the input feature map. The generated region feature vectors (K, Q, J) may be C×1×1 dimensional vectors (K ∈ R ^C×1×1 , Q ∈ R ^C×1×1 , J ∈ R ^{C×1× 1} ). The three region feature vectors K, Q, and J generated as described above may represent a feature that focuses on three spatial regions of the input feature map X.

비교 모듈(1030)은 복수의 어텐션 맵에 의해 생성된 복수의 영역 특징 벡터(K, Q, J) 중에서 하나의 조합의 두 영역 특징 벡터(K, Q)의 비교 결과와 다른 조합의 두 영역 특징 벡터(Q, J)의 비교 결과에 기초해서 최종 비교 결과를 출력한다. 어떤 실시예에서, 비교 모듈(1030)은 하나의 조합의 두 영역 특징 벡터(K, Q)를 감산하는 감산 모듈(1031) 및 다른 조합의 두 영역 특징 벡터(Q, J)를 감산하는 감산 모듈(1032)를 포함할 수 있다. 또한 비교 모듈(1030)은 감산 모듈(1031)의 감산 결과(K-Q)와 감산 모듈(1032)의 감산 결과(Q-J)에 기초해서 최종 비교 결과를 출력하는 결합 모듈(1033)을 더 포함할 수 있다. 어떤 실시예에서, 결합 모듈(1033)은 두 감산 결과(K-Q, Q-J)를 결합해서 최종 비교 결과를 생성할 수 있다. 결합 모듈(1033)은 다양한 연산을 사용해서 감산 결과를 결합할 수 있다.The comparison module 1030 is a comparison result of two region feature vectors (K, Q) of one combination among a plurality of region feature vectors (K, Q, J) generated by a plurality of attention maps and two region features of another combination. The final comparison result is output based on the comparison result of the vectors (Q, J). In some embodiments, the comparison module 1030 is a subtraction module 1031 that subtracts two region feature vectors (K, Q) of one combination and a subtraction module that subtracts two region feature vectors (Q, J) of another combination. (1032) may be included. In addition, the comparison module 1030 may further include a combining module 1033 that outputs a final comparison result based on the subtraction result KQ of the subtraction module 1031 and the subtraction result QJ of the subtraction module 1032. . In some embodiments, the combining module 1033 may combine two subtraction results (K-Q and Q-J) to generate a final comparison result. The combining module 1033 may combine the subtraction results using various operations.

어떤 실시예에서, 도 10을 참고로 하여 설명한 것처럼, 세 개 이상의 어텐션 맵을 생성하는 콘텍스트 모듈에도 정규화 모델이 적용될 수 있다.In some embodiments, as described with reference to FIG. 10, the normalization model may be applied to a context module that generates three or more attention maps.

다음 본 발명의 한 실시예에 따른 영상 판독 장치의 학습 방법에 대해서 도 11 및 도 12를 참고로 하여 설명한다.Next, a learning method of an image reading apparatus according to an embodiment of the present invention will be described with reference to FIGS. 11 and 12.

도 11은 본 발명의 한 실시예에 따른 학습 방법을 예시하는 흐름도이다11 is a flowchart illustrating a learning method according to an embodiment of the present invention

도 11을 참고하면, 컴퓨팅 장치는 레이블(label)이 주어진 복수의 영상을 포함하는 데이터 셋을 훈련 샘플로 사용한다.Referring to FIG. 11, the computing device uses a data set including a plurality of images given a label as a training sample.

컴퓨팅 장치는 데이터 셋의 이미지를 신경망에 입력한다(S1110). 신경망은 앞서 설명한 것처럼 콘텍스트 모듈을 포함하는 신경망이다. 컴퓨팅 장치는 신경망에 포함되어 있는 각 콘텍스트 모듈에서 콘텍스트 정보를 생성하고(S1120), 콘텍스트 손실(loss)을 계산한다(S1130). 어떤 실시예에서, 컴퓨팅 장치는 입력 영상을 신경망에 입력하여서 입력 영상으로부터 특징 맵을 추출하고, 특징 맵으로부터 콘텍스트 정보를 추출하고, 특징 맵에 콘텍스트 정보를 결합할 수 있다.The computing device inputs the image of the data set into the neural network (S1110). A neural network is a neural network that includes a context module as described above. The computing device generates context information in each context module included in the neural network (S1120), and calculates a context loss (S1130). In some embodiments, the computing device may input the input image into a neural network to extract a feature map from the input image, extract context information from the feature map, and combine the context information with the feature map.

컴퓨팅 장치는 신경망을 통해 이미지에 대해 목적 태스크를 수행하면서 이미지에 포함된 객체를 예측한다(S1140). 컴퓨팅 장치는 객체의 예측 클래스와 레이블의 정답 클래스 사이의 태스크 손실을 계산한다(1150).The computing device predicts an object included in the image while performing a target task on the image through a neural network (S1140). The computing device calculates a task loss between the predicted class of the object and the correct answer class of the label (1150).

컴퓨팅 장치는 콘텍스트 손실과 클래스 손실을 신경망으로 역전파하여서 신경망의 파라미터를 갱신한다(S1160). 신경망의 파라미터는 신경망에 포함되는 복수의 레이어에서 사용되는 가중치와 콘텍스트 모듈에서 사용되는 가중치를 포함할 수 있다. 이에 따라, 신경망에 사용되는 레이어뿐만 아니라 콘텍스트 모듈이 학습될 수 있다.The computing device updates the parameters of the neural network by backpropagating the context loss and the class loss to the neural network (S1160). The parameters of the neural network may include weights used in a plurality of layers included in the neural network and weights used in a context module. Accordingly, not only the layer used in the neural network but also the context module can be learned.

도 12는 본 발명의 한 실시예에 따른 학습 방법을 설명하는 도면이다.12 is a diagram illustrating a learning method according to an embodiment of the present invention.

도 12를 참고하면, 신경망(1200)은 영상 세트의 영상(1230)에 대해서 목적 태스크를 수행하여서 영상의 객체를 예측한다. 신경망(1200)은 다양한 레이어(1210)을 통해 영상의 객체를 예측할 수 있다. 이 경우, 태스크 손실(1220)은 신경망(1200)에서 목적 태스크를 수행해서 출력되는 목적 태스크(1240)와 입력 영상(1230)에 정답으로 주어진 레이블에 기초해서 계산될 수 있다. 어떤 실시예에서, 태스크 손실(1220)은 손실 함수, 예를 들면 L1 손실 함수 또는 L2 손실 함수에 기초해서 계산될 수 있다. L1 손실 함수는 최소 절대 오류(least absolute errors)를 의미하고, L2 손실 함수는 최소 제곱 오류(least square errors)를 의미한다.Referring to FIG. 12, the neural network 1200 predicts an object of an image by performing a target task on an image 1230 of an image set. The neural network 1200 may predict an object of an image through various layers 1210. In this case, the task loss 1220 may be calculated based on the target task 1240 output by performing the target task in the neural network 1200 and a label given as a correct answer to the input image 1230. In some embodiments, task loss 1220 may be calculated based on a loss function, such as an L1 loss function or an L2 loss function. The L1 loss function stands for least absolute errors, and the L2 loss function stands for least square errors.

콘텍스트 손실(1260)은 신경망의 콘텍스트 모듈(1220)에서 계산되는 어텐션 맵에 해당하는 영역을 대표하는 영역 특징 벡터에 기초해서 계산될 수 있다. 어떤 실시예에서, 콘텍스트 손실(1260)은 영역 특징 벡터 사이의 직교 손실(orthogonal loss)에 기초해서 계산될 수 있다. 직교 손실은 영역 특징 벡터와 채널의 개수에 기초해서 계산될 수 있다. 예를 들면, 두 영역 특징 벡터(K, Q) 사이의 콘텍스트 손실(l_ctxt(K,Q))은 수학식 5와 같이 계산될 수 있다.The context loss 1260 may be calculated based on the region feature vector representing the region corresponding to the attention map calculated by the context module 1220 of the neural network. In some embodiments, context loss 1260 may be calculated based on orthogonal loss between region feature vectors. The orthogonal loss can be calculated based on the region feature vector and the number of channels. _{For example, a context loss (l ctxt} (K, Q)) between two region feature vectors (K, Q) can be calculated as in Equation (5).

수학식 5에서, C는 채널의 개수를 의미한다.In Equation 5, C means the number of channels.

태스크 손실(1250)과 콘텍스트 손실(1260)에 기초해서 신경망(1200)에서의 최종 손실(1270)이 계산된다. 어떤 실시예에서, 최종 손실(1270)은 신경망(1200)에 포함되는 콘텍스트 손실(1260)의 합과 태스트 손실(1250)에 기초해서 계산될 수 있다. 예를 들면, 최종 손실(1270)은 수학식 6과 같이 계산될 수 있다.The final loss 1270 in the neural network 1200 is calculated based on the task loss 1250 and the context loss 1260. In some embodiments, the final loss 1270 may be calculated based on the sum of the context loss 1260 included in the neural network 1200 and the task loss 1250. For example, the final loss 1270 may be calculated as in Equation 6.

수학식 6에서, l_fin은 최종 손실을 나타내고, l_task는 태스크 손실을 나타내고, λ는 콘텍스트 손실의 효과를 제어하기 위한 상수이며, K_m과 Q_m은 신경망의 m번째 콘텍스트 모듈에서 생성되는 영역 특징 벡터를 나타내고, M는 신경망에 포함되는 콘텍스트 모듈의 개수를 나타낸다(M은 1 이상의 정수).In Equation 6, l _fin represents the final loss, l _task represents the task loss, λ is a constant for controlling the effect of the context loss, and K _m and Q _m are regions created in the m-th context module of the neural network Represents a feature vector, and M represents the number of context modules included in the neural network (M is an integer greater than or equal to 1).

최종 손실(1270)은 신경망(1200)으로 역전파되어서 신경망(1270)의 학습 가능한 파라미터(예를 들면, 가중치)가 갱신된다. 이에 따라 신경망(1270)이 학습될 수 있다.The final loss 1270 is backpropagated to the neural network 1200 so that a learnable parameter (eg, weight) of the neural network 1270 is updated. Accordingly, the neural network 1270 may be trained.

한편, 콘텍스트 모듈은 입력 특징 맵을 정규화할 수 있으며, 이러한 실시예에 대해서 도 13을 참고로 하여 설명한다.Meanwhile, the context module can normalize the input feature map, and this embodiment will be described with reference to FIG. 13.

도 13은 본 발명의 다른 실시예에 따른 콘텍스트 모듈을 예시하는 도면이다.13 is a diagram illustrating a context module according to another embodiment of the present invention.

도 13을 참고하면, 콘텍스트 모듈(1300)은 정규화 모듈(1350), 어텐션 모듈(1310), 벡터 생성 모듈(1320) 및 비교 모듈(1330)을 포함한다. 어텐션 모듈(1310), 벡터 생성 모듈(1320) 및 비교 모듈(1330)은 도 8을 참고로 하여 설명한 어텐션 모듈(810), 벡터 생성 모듈(820) 및 비교 모듈(830)과 유사하게 동작할 수 있으며 그 설명을 생략한다.Referring to FIG. 13, the context module 1300 includes a normalization module 1350, an attention module 1310, a vector generation module 1320, and a comparison module 1330. The attention module 1310, the vector generation module 1320, and the comparison module 1330 may operate similarly to the attention module 810, vector generation module 820, and comparison module 830 described with reference to FIG. 8. And the explanation is omitted.

정규화 모듈(1350)은 이전 레이어로부터 입력된 특징 맵(X)을 정규화하고, 정규화한 특징 맵을 어텐션 모듈(1310)에 입력한다. 어텐션 모듈(1310)은 정규화된 입력 특징 맵으로부터 복수의 어텐션 맵을 생성한다.The normalization module 1350 normalizes the feature map X input from the previous layer, and inputs the normalized feature map to the attention module 1310. The attention module 1310 generates a plurality of attention maps from the normalized input feature map.

어떤 실시예에서, 정규화 모듈(1350)을 입력 특징 맵(X)에서 입력 특징 맵(X)의 C차원 평균 벡터를 감산하여서 입력 특징 맵(X)을 정규화할 수 있다. 예를 들면, 정규화 모듈(1350)을 입력 특징 맵(X)을 수학식 7과 같이 정규화할 수 있다.In some embodiments, the normalization module 1350 may normalize the input feature map X by subtracting the C-dimensional mean vector of the input feature map X from the input feature map X. For example, the normalization module 1350 may normalize the input feature map X as shown in Equation 7.

수학식 7에서 μ는 X의 C차원 평균 벡터이다(μ ∈ R ^C×1×1).In Equation 7, μ is the C-dimensional mean vector of X (μ ∈ R ^C×1×1 ).

도 13에는 설명의 편의상 두 개의 어텐션 맵이 생성되는 경우가 도시되어 있지만, 세 개 이상의 어텐션 맵이 생성될 수도 있다. 또한 도 9을 참고로 하여 설명한 것처럼 어텐션 맵이 정규화될 수도 있다.13 illustrates a case in which two attention maps are generated for convenience of explanation, but three or more attention maps may be generated. In addition, the attention map may be normalized as described with reference to FIG. 9.

어떤 실시예에서, 입력 특징 맵의 정규화에 사용되는 벡터에 기초해서 채널을 캘리브레이션(calibration)할 수 있으며, 이러한 실시예에 대해서 도 14를 참고로 하여 설명한다.In some embodiments, a channel may be calibrated based on a vector used for normalization of an input feature map, and such an embodiment will be described with reference to FIG. 14.

도 14는 본 발명의 다른 실시예에 따른 신경망을 예시하는 도면이다.14 is a diagram illustrating a neural network according to another embodiment of the present invention.

도 14를 참고하면, 신경망은 콘텍스트 모듈(1420), 결합 모듈(1430, 1460) 및 채널 리캘리브레이션 모듈(1450)를 포함한다.Referring to FIG. 14, the neural network includes a context module 1420, a combination module 1430 and 1460, and a channel recalibration module 1450.

콘텍스트 모듈(1420)은 정규화 모듈(1425), 어텐션 모듈(1411), 벡터 생성 모듈(1422) 및 비교 모듈(1423)을 포함한다. 정규화 모듈(1425), 어텐션 모듈(1411), 벡터 생성 모듈(1422) 및 비교 모듈(1423)은 도 13을 참고로 하여 설명한 정규화 모듈(1350), 어텐션 모듈(1310), 벡터 생성 모듈(1320) 및 비교 모듈(1330)과 유사하게 동작할 수 있으며 그 설명을 생략한다. 콘텍스트 모듈(1420)은 입력 특징 맵(X)로부터 콘텍스트 정보를 생성한다. 결합 모듈(1430)은 콘텍스트 모듈(1420)에서 생성된 콘텍스트 정보를 입력 특징 맵(X)에 결합한다.The context module 1420 includes a normalization module 1425, an attention module 1411, a vector generation module 1422 and a comparison module 1423. The normalization module 1425, the attention module 1411, the vector generation module 1422, and the comparison module 1423 are the normalization module 1350, attention module 1310, and vector generation module 1320 described with reference to FIG. And the comparison module 1330 may operate similarly, and a description thereof will be omitted. The context module 1420 generates context information from the input feature map X. The combining module 1430 combines the context information generated by the context module 1420 with the input feature map X.

채널 리캘리브레이션 모듈(1450)은 입력 특징 맵(X)의 C차원 평균 벡터(μ)에 기초해서 채널별 가중치를 계산한다. 어떤 실시예에서, 채널 리캘리브레이션 모듈(1450)은 C개의 채널 중에서 어떤 채널에 주목할지를 지시하는 채널 특징 벡터(P)를 생성할 수 있다. 어떤 실시예에서, 채널 특징 벡터(P)는 C개의 채널 중에서 특정 채널의 크기를 줄이는(scale down) 벡터일 수 있다. 어떤 실시예에서, 채널 특징 벡터(P)는 C×1×1차원의 벡터일 수 있다(P ∈ R ^C×1×1).The channel recalibration module 1450 calculates a weight for each channel based on the C-dimensional average vector μ of the input feature map X. In some embodiments, the channel recalibration module 1450 may generate a channel feature vector P indicating which of the C channels to pay attention to. In some embodiments, the channel feature vector P may be a vector that scales down a specific channel among C channels. In some embodiments, the channel feature vector P may be a C×1×1 dimensional vector (P ∈ R ^C×1×1 ).

결합 모듈(1460)은 채널 리캘리브레이션 모듈(1450)에서 생성된 채널 특징 벡터(P)를 결합 모듈(1430)에서 결합된 특징 맵에 결합한다. 어떤 실시예에서, 결합 모듈(1460)은 결합 모듈(1430)에서 결합된 특징 맵에 채널 리캘리브레이션 모듈(1450)에서 생성된 채널 특징 벡터를 곱할 수 있다.The combining module 1460 combines the channel feature vector P generated by the channel recalibration module 1450 to the feature map combined by the combining module 1430. In some embodiments, the combining module 1460 may multiply the feature map combined in the combining module 1430 by the channel feature vector generated by the channel recalibration module 1450.

어떤 실시예에서, 채널 리캘리브레이션 모듈(1450)은 C차원 평균 벡터(μ)에 1×1 컨볼루션 연산과 활성화 함수를 적용해서 채널 특징 벡터(P)를 생성할 수 있다. 어떤 실시예에서, 채널 리캘리브레이션 모듈(1450)은 1×1 컨볼루션 연산과 활성화 함수의 조합을 복수 개 사용할 수도 있다. 어떤 실시예에서, 도 14에 도시한 것처럼, 채널 리캘리브레이션 모듈(1450)은 컨볼루션 모듈(1451), 활성화 모듈(1452), 컨볼루션 모듈(1453) 및 활성화 모듈(1454)을 포함할 수 있다. 도 14에서는 활성화 모듈(1452)이 활성화 함수로 ReLU(rectified linear unit) 함수를 사용하고, 활성화 모듈(1454)이 활성화 함수로 시그모이드(sigmoid) 함수를 사용하는 것으로 도시하였다.In some embodiments, the channel recalibration module 1450 may generate a channel feature vector P by applying a 1×1 convolution operation and an activation function to the C-dimensional mean vector μ. In some embodiments, the channel recalibration module 1450 may use a plurality of combinations of 1×1 convolution operations and activation functions. In some embodiments, as shown in FIG. 14, the channel recalibration module 1450 may include a convolution module 1451, an activation module 1452, a convolution module 1453, and an activation module 1454. . In FIG. 14, it is shown that the activation module 1452 uses a ReLU (rectified linear unit) function as the activation function, and the activation module 1454 uses a sigmoid function as the activation function.

도 15는 본 발명의 또 다른 실시예에 따른 신경망을 예시하는 도면이다.15 is a diagram illustrating a neural network according to another embodiment of the present invention.

또 다른 실시예에 따른 신경망은 컨볼루션 연산으로 그룹 컨볼루션(grouped convolution) 연산을 사용할 수 있다. 이 경우, 입력과 출력은 G개의 채널 그룹으로 분할되고, 그룹 컨볼루션 연산이 각 그룹에 대해서 별도로 수행될 수 있다. 그룹 컨볼루션 연산을 사용하는 경우, 입력에서 다수의 중요 지점을 정교하게 표현할 수 있다.A neural network according to another embodiment may use a grouped convolution operation as a convolution operation. In this case, the input and output are divided into G channel groups, and a group convolution operation may be separately performed for each group. When using the group convolution operation, it is possible to elaborately represent a number of important points in the input.

도 15를 참고하면, 신경망은 콘텍스트 모듈, 결합 모듈(1530, 1560) 및 채널 리캘리브레이션 모듈을 포함한다.Referring to FIG. 15, the neural network includes a context module, combining modules 1530 and 1560, and a channel recalibration module.

콘텍스트 모듈은 정규화 모듈(1525), 어텐션 모듈(1521), 벡터 생성 모듈(1522), 정규화 모듈(1524) 및 비교 모듈(1523)을 포함할 수 있으며, 이들은 앞서 설명한 것처럼 동작할 수 있으며 그 설명을 생략한다. 도 15에서는 어텐션 모듈(1521)로 그룹 컨볼루션 연산을 수행하는 모듈을 도시하였고, 정규화 모듈(1524)로 소프트맥스 연산을 수행하는 모듈을 도시하였다. 이 경우, 그룹 컨볼루션 연산(1521)과 소프트맥스 연산(1524)의 결과로 G개의 어텐션 맵이 출력될 수 있다. 영역 특징 벡터(K, Q)는 각각 G개의 원소([K¹,...,K^G], [Q¹,...,Q^G])를 가지며, 수학식 8과 같이 표현될 수 있다. 도 15에서는 설명의 편의상 G가 3인 경우가 도시되어 있다.The context module may include a normalization module 1525, an attention module 1521, a vector generation module 1522, a normalization module 1524, and a comparison module 1523, which may operate as described above, and the description thereof is provided. Omit it. In FIG. 15, a module that performs a group convolution operation by the attention module 1521 is shown, and a module that performs a softmax operation by the normalization module 1524 is shown. In this case, G attention maps may be output as a result of the group convolution operation 1521 and the softmax operation 1524. Region feature vectors (K, Q) each have G elements ([K ¹ ,...,K ^G ], [Q ¹ ,..., Q ^G ]), and can be expressed as Equation 8 . In FIG. 15, for convenience of explanation, a case where G is 3 is illustrated.

수학식 8에서, g는 g번째 그룹을 나타낸다.In Equation 8, g represents the g-th group.

채널 리캘리브레이션 모듈은 컨볼루션 모듈(1551), 활성화 모듈(1552), 컨볼루션 모듈(1553) 및 활성화 모듈(1554)을 포함할 수 있으며, 이들은 앞서 설명한 것처럼 동작할 수 있으며 그 설명을 생략한다. 도 15에서는 컨볼루션 모듈(1551, 1553)로 그룹 컨볼루션 연산을 수행하는 모듈을 도시하였고, 활성화 모듈(1552, 1554)로 ReLU 함수를 사용하는 모듈과 시그모이드 함수를 사용하는 모듈을 각각 도시하였다.The channel recalibration module may include a convolution module 1551, an activation module 1552, a convolution module 1535, and an activation module 1554, which may operate as described above, and a description thereof will be omitted. In FIG. 15, a module that performs a group convolution operation using convolution modules 1551 and 1553 is shown, and a module using a ReLU function and a module using a sigmoid function as the activation modules 1552 and 1554 are shown, respectively. I did.

다음, 본 발명의 한 실시예에 따른 영상 판독 장치를 구현할 수 있는 예시적인 컴퓨팅 장치(1600)에 대하여 도 16을 참고로 하여 설명한다.Next, an exemplary computing device 1600 capable of implementing an image reading device according to an embodiment of the present invention will be described with reference to FIG. 16.

도 16은 본 발명의 한 실시예에 따른 컴퓨팅 장치를 예시하는 도면이다.16 is a diagram illustrating a computing device according to an embodiment of the present invention.

도 16을 참고하면, 컴퓨팅 장치는 프로세서(1610), 메모리(1620), 저장 장치(1630), 통신 인터페이스(1640) 및 버스(1650)를 포함한다. 컴퓨팅 장치(1600)는 다른 범용적인 구성 요소를 더 포함할 수 있다.Referring to FIG. 16, the computing device includes a processor 1610, a memory 1620, a storage device 1630, a communication interface 1640, and a bus 1650. The computing device 1600 may further include other general purpose components.

프로세서(1610)는 컴퓨팅 장치(1600)의 각 구성의 전반적인 동작을 제어한다. 프로세서(1610)는 CPU(central processing unit), MPU(microprocessor unit), MCU(micro controller unit), GPU(graphic processing unit) 등의 다양한 프로세싱 유닛 중 적어도 하나로 구현될 수 있으며, 병렬 프로세싱 유닛으로 구현될 수도 있다. 또한, 프로세서(1610)는 위에서 설명한 영상 판독 방법을 실행하기 위한 프로그램에 대한 연산을 수행할 수 있다.The processor 1610 controls the overall operation of each component of the computing device 1600. The processor 1610 may be implemented as at least one of various processing units such as a central processing unit (CPU), a microprocessor unit (MPU), a micro controller unit (MCU), and a graphic processing unit (GPU), and may be implemented as a parallel processing unit. May be. Also, the processor 1610 may perform an operation on a program for executing the image reading method described above.

메모리(1620)는 각종 데이터, 명령 및/또는 정보를 저장한다. 메모리(1620)는 위에서 설명한 영상 판독 방법을 실행하기 위하여 저장 장치(1630)로부터 컴퓨터 프로그램을 로드할 수 있다. 저장 장치(1630)는 프로그램을 비임시적으로 저장할 수 있다. 저장 장치(1630)는 비휘발성 메모리로 구현될 수 있다.The memory 1620 stores various types of data, commands, and/or information. The memory 1620 may load a computer program from the storage device 1630 to execute the image reading method described above. The storage device 1630 may non-temporarily store a program. The storage device 1630 may be implemented as a nonvolatile memory.

통신 인터페이스(1640)는 컴퓨팅 장치(1600)의 유무선 인터넷 통신을 지원한다. 또한, 통신 인터페이스(1640)는 인터넷 통신 외의 다양한 통신 방식을 지원할 수도 있다.The communication interface 1640 supports wired/wireless Internet communication of the computing device 1600. In addition, the communication interface 1640 may support various communication methods other than Internet communication.

버스(1650)는 컴퓨팅 장치(1600)의 구성 요소간 통신 기능을 제공한다. 버스(1650)는 주소 버스(address bus), 데이터 버스(data bus) 및 제어 버스(control bus) 등 다양한 형태의 버스로 구현될 수 있다.The bus 1650 provides communication functions between components of the computing device 1600. The bus 1650 may be implemented as various types of buses such as an address bus, a data bus, and a control bus.

컴퓨터 프로그램은 메모리(1620)에 로드될 때 프로세서(1610)로 하여금 영상 판독 방법을 수행하도록 하는 명령어(instructions)를 포함할 수 있다. 즉, 프로세서(1610)는 명령어를 실행함으로써, 영상 판독 방법을 위한 동작을 수행할 수 있다.The computer program may include instructions that when loaded into the memory 1620 cause the processor 1610 to perform an image reading method. That is, the processor 1610 may perform an operation for an image reading method by executing a command.

어떤 실시예에서, 컴퓨터 프로그램은, 신경망에 기초해서 입력 영상으로부터 제1 특징 맵을 추출하고, 신경망에 기초해서 제1 특징 맵으로부터 복수의 어텐션 맵을 생성하고, 복수의 어텐션 맵에 기초해서 콘텍스트 정보를 생성하고, 콘텍스트 정보를 제1 특징 맵에 결합하여서 제2 특징 맵을 생성하고, 신경망에 기초해서 제2 특징 맵으로부터 상기 입력 영상의 판독 결과를 출력하는 명령어를 포함할 수 있다.In some embodiments, the computer program extracts a first feature map from the input image based on a neural network, generates a plurality of attention maps from the first feature map based on the neural network, and generates context information based on the plurality of attention maps. And a command for generating a second feature map by combining context information with the first feature map, and outputting a read result of the input image from the second feature map based on a neural network.

어떤 실시예에서, 컴퓨터 프로그램은, 신경망에 기초해서 입력 영상으로부터 제1 특징 맵을 추출하고, 신경망에 기초해서 제1 특징 맵으로부터 관심 영역을 나타내는 제1 벡터 및 관심 영역에 대응하는 영역을 나타내는 제2 벡터를 생성하고, 제1 벡터와 제2 벡터에 기초해서 콘텍스트 정보를 생성하고, 콘텍스트 정보를 제1 특징 맵에 결합하여서 제2 특징 맵을 생성하고, 신경망에 기초해서 제2 특징 맵으로부터 입력 영상의 판독 결과를 출력하는 명령어를 포함할 수 있다.In some embodiments, the computer program extracts a first feature map from the input image based on the neural network, and a first vector representing the region of interest from the first feature map based on the neural network and a first vector representing the region corresponding to the region of interest. 2 Create a vector, generate context information based on the first vector and the second vector, combine the context information with the first feature map to generate a second feature map, and input from the second feature map based on a neural network It may include a command for outputting an image read result.

위에서 설명한 본 발명의 한 실시예에 따른 영상 판독 방법 또는 장치는 컴퓨터가 읽을 수 있는 매체 상에 컴퓨터가 읽을 수 있는 컴퓨터 프로그램으로 구현될 수 있다. 한 실시예에서, 컴퓨터가 읽을 수 있는 매체는 이동형 기록 매체이거나 고정식 기록 매체일 수 있다. 다른 실시예에서, 컴퓨터가 읽을 수 있는 매체에 기록된 컴퓨터 프로그램은 인터넷 등의 네트워크를 통하여 다른 컴퓨팅 장치에 전송되어 다른 컴퓨팅 장치에 설치되어 실행될 수 있다.The image reading method or apparatus according to an embodiment of the present invention described above may be implemented as a computer program readable by a computer on a computer readable medium. In one embodiment, the computer-readable medium may be a removable recording medium or a fixed recording medium. In another embodiment, a computer program recorded in a computer-readable medium may be transmitted to another computing device through a network such as the Internet, and may be installed and executed in another computing device.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements by those skilled in the art using the basic concept of the present invention defined in the following claims are also provided. It belongs to the scope of rights.

Claims

An image reading method performed by a computing device, comprising:
Extracting a first feature map from the input image based on the neural network,
Generating a plurality of attention maps from the first feature map based on the neural network,
Generating context information based on the plurality of attention maps,
Generating a second feature map by combining the context information with the first feature map, and
And outputting a read result of the input image from the second feature map based on the neural network,
Generating the context information,
Generating a plurality of vectors based on the plurality of attention maps and the first feature map, and
Comprising the step of comparing the plurality of vectors to generate the context information
How to read images.

In claim 1,
The step of generating the context information by comparing the plurality of vectors includes comparing the plurality of vectors by subtracting two vectors from among the plurality of vectors.

In claim 1,
Generating the plurality of vectors comprises:
Generating the plurality of attention maps by applying a 1×1 convolution operation having different parameters to the first feature map, and
Generating each vector by weighted average of the first feature map based on each attention map
Image reading method comprising a.

In paragraph 3,
The generating of the plurality of vectors further comprises normalizing each attention map prior to weighted averaging the first feature map.

In paragraph 3,
The 1×1 convolution operation is a group convolution operation performed after grouping a plurality of channels of the first feature map into a predetermined number of channel groups.

In claim 1,
Calculating the loss of the context information based on the plurality of vectors, and
Updating the neural network by backpropagating the loss to the neural network
Image reading method further comprising a.

In paragraph 6,
The step of calculating the loss comprises calculating the loss based on an orthogonal loss of the plurality of vectors.

In claim 1,
The step of generating the plurality of attention maps includes normalizing the first feature map before generating the plurality of attention maps.

In clause 8,
The step of normalizing the first feature map includes normalizing the first feature map by subtracting an average vector of the C dimension of the first feature map from the first feature map,
C indicates the number of channels of the first feature map
How to read images.

In claim 9,
Generating a channel feature vector for reducing the size of a specific channel among the C channels based on the average vector,
The step of generating the second feature map includes combining the channel feature vector to a feature map in which the context information is combined with the first feature map.
How to read images.

In claim 10,
The step of generating the channel feature vector comprises generating the channel feature vector by applying a 1×1 convolution operation and an activation function to the average vector.

In clause 11,
The 1×1 convolution operation is a group convolution operation performed after grouping the C channels of the average vector into a predetermined number of channel groups.

In claim 1,
When the first feature map is a C×H×W dimension, the context information is a C×1×1 dimension,
The C, H, and W respectively indicate the number, height, and width of the first feature map.
How to read images.

In claim 1,
Outputting the read result of the input image
Generating a plurality of second attention maps from the second feature map based on the neural network,
Generating second context information based on the plurality of second attention maps,
Generating a third feature map by combining the second context information with the second feature map, and
Outputting a read result of the input image from the third feature map based on the neural network
Image reading method comprising a.

In clause 14,
Calculating the loss of the context information,
Calculating the loss of the second context information,
Calculating a context loss based on the loss of the context information and the loss of the second context information, and
Updating the neural network by backpropagating the context loss to the neural network
Image reading method further comprising a.

Memory to store instructions, and
By executing the command, a first feature map is extracted from an input image based on a neural network, and based on the neural network, a first vector representing a region of interest from the first feature map and a first vector representing a region corresponding to the region of interest 2 A vector is generated, context information is generated by comparing the first vector and the second vector, and a second feature map is generated by combining the context information with the first feature map, and the second feature map is generated based on the neural network. Processor for outputting a read result of the input image from a second feature map
Image reading device comprising a.

In paragraph 16,
The processor generates a plurality of attention maps from the first feature map based on the neural network, and generates the first vector and the second vector based on the plurality of attention maps.

A computer program executed by a computing device and stored in a recording medium,
The computer program is the computing device,
Extracting a first feature map from the input image based on the neural network,
Generating a first vector representing a region of interest and a second vector representing a region corresponding to the region of interest from the first feature map based on the neural network,
Generating context information by comparing the first vector and the second vector,
Generating a second feature map by combining the context information with the first feature map, and
Outputting a read result of the input image from the second feature map based on the neural network
A computer program that allows you to run.

In paragraph 18,
Generating the first vector and the second vector comprises:
Generating a plurality of attention maps from the first feature map based on the neural network, and
Generating the first vector and the second vector based on the plurality of attention maps
Computer program comprising a.