KR102553934B1

KR102553934B1 - Apparatus for Image Fusion High Quality

Info

Publication number: KR102553934B1
Application number: KR1020210119081A
Authority: KR
Inventors: 전광길
Original assignee: 인천대학교 산학협력단
Priority date: 2021-09-07
Filing date: 2021-09-07
Publication date: 2023-07-10
Also published as: KR20230036343A

Abstract

고화질 영상 융합 장치는 소스 영상의 주요 구조를 반영하는 중간 특징(Intermediate Features)과, 소스 영상의 세부 사항을 나타내는 보상 특징(Compensation Features)을 기초로 적외선 영상과 가시광선 영상을 융합하여 융합된 영상 품질이 향상되는 효과가 있다.The high-definition image convergence device fuses an infrared image and a visible ray image based on intermediate features that reflect the main structure of the source image and compensation features that represent the details of the source image, resulting in converged image quality. This has an enhancing effect.

Description

High-definition image fusion device {Apparatus for Image Fusion High Quality}

본 발명은 고화질 영상 융합 장치에 관한 것으로서, 더욱 상세하게는 소스 영상의 주요 구조를 반영하는 중간 특징(Intermediate Features)과, 소스 영상의 세부 사항을 나타내는 보상 특징(Compensation Features)을 기초로 적외선 영상과 가시광선 영상을 융합하여 융합된 영상 품질이 향상되는 고화질 영상 융합 장치에 관한 것이다.The present invention relates to a high-definition image convergence device, and more particularly, to an infrared image and an infrared image based on intermediate features that reflect the main structure of a source image and compensation features that represent details of the source image. The present invention relates to a high-definition image fusion device capable of fusing visible light images and improving the quality of the fused image.

영상 융합은 원격탐사, 의료진단, 보안 및 감시에 이르기까지 다양한 응용 분야에서 널리 사용되고 있다.Image fusion is widely used in a variety of applications ranging from remote sensing, medical diagnosis, security and surveillance.

일반적으로 융합 작업은 야간 환경, 의료 시스템, 사진, 원격 감지의 매핑 맥락에서 물체 감지 인식 문제에 기여할 수 있다.In general, fusion tasks can contribute to object detection recognition challenges in the context of mapping in night environments, medical systems, photography, and remote sensing.

영상 융합은 다양한 센서의 다중 소스 영상을 통합하여 보다 포괄적이고 유익한 영상을 얻을 수 있도록 설계 되었다. 예를 들어, 열화상 및 가시광선 이미징 시스템은 민간 및 군용 모두에서 감시를 향상시키기 위해 널리 채택되고 있다.Image fusion is designed to obtain a more comprehensive and informative image by integrating multiple source images from various sensors. For example, thermal and visible light imaging systems are widely adopted to enhance surveillance in both civilian and military applications.

최근에는 적외선 영상과 가시광선 영상에 대한 수많은 융합 알고리즘이 제안되고 있다. 저수준 융합 기술로서 픽셀 수준 영상 융합은 여러 센서에서 얻은 영상의 픽셀을 직접 처리한다.Recently, numerous convergence algorithms for infrared and visible ray images have been proposed. As a low-level fusion technique, pixel-level image fusion directly processes the pixels of an image from multiple sensors.

그러나 실제 응용 프로그램에서 픽셀 수준 융합은 일반적으로 높은 계산 강도와 허용할 수 없는 아티팩트(Artifact)를 초래하는 문제점이 있다.However, in practical applications, pixel-level fusion generally has problems of high computational intensity and unacceptable artifacts.

한국 등록특허번호 제10-0551826호Korean Registered Patent No. 10-0551826

이와 같은 문제점을 해결하기 위하여, 본 발명은 소스 영상의 주요 구조를 반영하는 중간 특징(Intermediate Features)과, 소스 영상의 세부 사항을 나타내는 보상 특징(Compensation Features)을 기초로 적외선 영상과 가시광선 영상을 융합하여 융합된 영상 품질이 향상되는 고화질 영상 융합 장치를 제공하는데 그 목적이 있다.In order to solve this problem, the present invention provides an infrared image and a visible ray image based on intermediate features that reflect the main structure of the source image and compensation features that represent details of the source image. An object of the present invention is to provide a high-definition image fusion device in which the quality of the fused image is improved by fusion.

상기 목적을 달성하기 위한 본 발명의 특징에 따른 고화질 영상 융합 장치는,A high-definition image fusion device according to the features of the present invention for achieving the above object,

적외선 영상과 가시광선 영상의 소스 영상으로부터 특징을 각각 추출하는 특징 추출부;a feature extractor for extracting features from source images of an infrared image and a visible ray image, respectively;

특징 추출기 기능을 하는 하나 이상의 컨볼루션 레이어로 구성되고, 상기 소스 영상의 주요 구조를 반영하는 중간 특징(Intermediate Features)과, 상기 소스 영상의 세부 사항을 나타내는 보상 특징(Compensation Features)을 출력하는 인코더부;An encoder unit composed of one or more convolutional layers functioning as a feature extractor and outputting intermediate features reflecting the main structure of the source image and compensation features representing details of the source image. ;

상기 중간 특징과 보상 특징을 융합하는 경우, 상기 중간 특징을 사용하여 상기 소스 영상의 특징을 반영하기 위한 상기 소스 영상의 어텐션 맵(Attention Map)을 특징 맵에서 추출하고, 상기 생성한 어텐션 맵을 이용하여 융합된 중간 특징을 생성하고, 상기 보상 특징을 이용하여 융합된 보상 특징을 생성하는 융합부; 및When the intermediate feature and the compensation feature are fused, an attention map of the source image for reflecting the feature of the source image is extracted from a feature map using the intermediate feature, and the generated attention map is used. a fusion unit generating a fused intermediate feature and generating a fused compensation feature using the compensation feature; and

하나 이상의 디컨볼루션 레이어로 구성되고, 상기 융합된 중간 특징과 상기 융합된 보상 특징을 디코딩하여 최종 융합 영상을 출력하는 디코더부를 포함한다.and a decoder unit including one or more deconvolution layers and outputting a final fusion image by decoding the fused intermediate features and the fused compensation features.

전술한 구성에 의하여, 본 발명은 가시광선 영상과 적외선 영상의 융합 시 중복 정보로 인한 계산 복잡성과 부하를 크게 줄일 수 있고, 허용할 수 없는 아티팩트(Artifact)를 방지할 수 있는 효과가 있다.According to the configuration described above, the present invention can significantly reduce computational complexity and load due to redundant information when converging a visible ray image and an infrared image, and has an effect of preventing unacceptable artifacts.

도 1은 본 발명의 실시예에 따른 고화질 영상 융합 장치의 구성을 나타낸 도면이다.
도 2는 본 발명의 실시예에 따른 인코더부와 디코더부 간의 훈련 단계를 나타낸 도면이다.
도 3은 본 발명의 실시예에 따른 어텐션 맵의 시각화 및 획득한 일부 특징을 나타낸 도면이다.
도 4는 본 발명의 실시예에 따른 훈려 단계에서 훈련 총 손실 곡선을 나타낸 도면이다.
도 5는 샘플 영상을 나타낸 도면이다.
도 6은 본 발명의 실시예에 따른 중간 특징에 대한 융합 전략을 나타낸 도면이다.
도 7은 본 발명의 실시예에 따른 보상 특징에 대한 융합 전략을 나타낸 도면이다.
도 8은 본 발명의 실시예에 따른 건너뛰기 연결 유무에 대한 연구를 나타낸 도면이다.1 is a diagram showing the configuration of a high-definition video convergence device according to an embodiment of the present invention.
2 is a diagram illustrating a training step between an encoder unit and a decoder unit according to an embodiment of the present invention.
3 is a diagram illustrating visualization of an attention map and some acquired features according to an embodiment of the present invention.
4 is a diagram showing a training total loss curve in a training phase according to an embodiment of the present invention.
5 is a diagram illustrating a sample image.
6 is a diagram illustrating a fusion strategy for intermediate features according to an embodiment of the present invention.
7 is a diagram illustrating a fusion strategy for compensation features according to an embodiment of the present invention.
8 is a diagram illustrating a study on whether or not a skip connection is present according to an embodiment of the present invention.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout the specification, when a certain component is said to "include", it means that it may further include other components without excluding other components unless otherwise stated.

도 1은 본 발명의 실시예에 따른 고화질 영상 융합 장치의 구성을 나타낸 도면이고, 도 2는 본 발명의 실시예에 따른 인코더부와 디코더부 간의 훈련 단계를 나타낸 도면이다.1 is a diagram showing the configuration of a high-definition image fusion device according to an embodiment of the present invention, and FIG. 2 is a diagram showing a training step between an encoder unit and a decoder unit according to an embodiment of the present invention.

본 발명의 실시예에 따른 고화질 영상 융합 장치(100)는 인코더부(110), 융합부(120) 및 디코더부(130)를 포함한다.The high-definition image fusion device 100 according to an embodiment of the present invention includes an encoder unit 110, a fusion unit 120, and a decoder unit 130.

고화질 영상 융합 장치(100)는 적외선 영상과 가시광선 영상의 융합을 위한 잔류 블록 방식의 인코더 디코더 융합 장치를 나타낸다.The high-definition image convergence device 100 represents a residual block method encoder decoder convergence device for convergence of an infrared image and a visible ray image.

인코더부(110)는 특징 추출기 기능을 하는 하나 이상의 컨볼루션 레이어로 구성되고, 상기 소스 영상의 주요 구조를 반영하는 중간 특징(Intermediate Features)과, 상기 소스 영상의 세부 사항을 나타내는 보상 특징(Compensation Features)을 출력한다.The encoder unit 110 is composed of one or more convolutional layers functioning as a feature extractor, and includes intermediate features reflecting the main structure of the source image and compensation features representing details of the source image. ) is output.

융합부(120)는 중간 특징과 보상 특징을 융합하는 경우, 중간 특징을 사용하여 소스 영상의 특징을 반영하기 위한 소스 영상의 어텐션 맵(Attention Map)을 특징 맵에서 추출하고, 생성한 어텐션 맵을 이용하여 융합된 중간 특징을 생성하고, 보상 특징을 이용하여 융합된 보상 특징을 생성한다.When fusing the intermediate features and compensation features, the fusion unit 120 extracts an attention map of the source image for reflecting the features of the source image from the feature map using the intermediate feature, and converts the generated attention map to to generate a fused intermediate feature, and to generate a fused compensation feature using a reward feature.

디코더부(130)는 하나 이상의 디컨볼루션 레이어로 구성되고, 융합된 중간 특징과 융합된 보상 특징을 디코딩하여 최종 융합 영상을 출력한다.The decoder unit 130 is composed of one or more deconvolution layers, decodes the fused intermediate feature and the fused compensation feature, and outputs a final fused image.

소스 영상에서 다양한 대표 특징을 효과적으로 추출할 수 있다.Various representative features can be effectively extracted from the source image.

또한, 이러한 특징들의 융합은 동일한 컨볼루션 레이어가 동일한 가중치를 가지므로 동일한 유형의 특징들을 출력할 수 있다.In addition, the fusion of these features can output features of the same type because the same convolutional layer has the same weight.

기존의 중간 특징만을 사용하는 프레임 워크과 달리 본 발명의 프레임 워크는 컨볼루션 레이어에서 생성된 모든 특징을 융합에 사용한다.Unlike existing frameworks that use only intermediate features, the framework of the present invention uses all features generated in the convolution layer for fusion.

구체적으로 2개의 소스 영상은 일련의 특징을 추출하기 위해 훈련된 모델을 활용하여 두 개의 소스 영상(가시광선 영상, 적외선 영상)을 별도로 인코딩한다.Specifically, the two source images (visible ray image and infrared image) are separately encoded using a trained model to extract a series of features.

잔차 블록에 의해 생성된 중간 특징들을

라고 정의한다. 여기서, k는 k번째 입력 소스 영상을 의미한다.Intermediate features generated by the residual block

define it as Here, k denotes the kth input source image.

k=1은 적외선(IR)영상이고, k=2는 가시광선 영상이다. i는 입력 영상의 중간 특징의 총 개수이다. res는 잔차 레이어 출력을 의미한다. k = 1 is an infrared (IR) image, and k = 2 is a visible ray image. i is the total number of intermediate features of the input image. res means the residual layer output.

또한, 첫 번째 컨볼루션(conv1)과 두 번째 컨볼루션(conv2)의 출력 특성은

로 정의된다.Also, the output characteristics of the first convolution (conv1) and the second convolution (conv2) are

is defined as

m은 conv1 레이어에서 특징들의 총 개수이고, n은 conv2 레이어에서 특징들의 총 개수이다. 이러한 특징들을 보상 특징이라고 한다.m is the total number of features in the conv1 layer, and n is the total number of features in the conv2 layer. These features are called reward features.

도 3은 본 발명의 실시예에 따른 어텐션 맵의 시각화 및 획득한 일부 특징을 나타낸 도면이다.3 is a diagram illustrating visualization of an attention map and some acquired features according to an embodiment of the present invention.

도 3의 (a)는 소스 영상이고, 도 3의 (b)는 어텐션 맵이고, 도 3의 (c)는 첫 번째 컨볼루션 레이어와 잔차 블록에서 얻은 일부 특징이다.Figure 3 (a) is a source image, Figure 3 (b) is an attention map, Figure 3 (c) is some features obtained from the first convolution layer and the residual block.

중간 특징(빨간색 상자)은 소스 영상의 주요 구조를 전달하는 반면, 보상 특징(녹색 상자)은 소스 영상의 일부 세부 정보를 유지하는 것을 볼 수 있다.It can be seen that the intermediate features (red boxes) convey the main structure of the source image, while the compensating features (green boxes) retain some details of the source image.

도 3의 (c)와 같이, 첫 번째 컨볼루션 레이어에서 얻은 특징 중 일부와 시각화를 위한 잔차 블록을 선택한다.As shown in (c) of FIG. 3, some of the features obtained from the first convolution layer and a residual block for visualization are selected.

빨간색 상자가 있는 중간 특징들은 소스 영상의 주요 구조를 반영함을 알 수 있다. 따라서, 이러한 심층(Deep) 특징들은 융합 결과의 배경 정보를 제공할 수 있다. 대조적으로, 녹색 상자가 있는 보상 특징들은 소스 영상의 다양한 질감 세부사항을 전달한다.It can be seen that the middle features with red boxes reflect the main structure of the source image. Accordingly, these deep features may provide background information of the fusion result. In contrast, compensation features with green boxes convey various textural details of the source image.

따라서, 첫 번째 컨볼루션 레이어와 두 번째 컨볼루션 레이어에서 생성된 얕은(Shallow) 특징들은 최종 융합 결과를 재구성하는데 동등하게 중요하다.Therefore, the shallow features generated in the first convolutional layer and the second convolutional layer are equally important in reconstructing the final fusion result.

융합부(120)는 중간 특징과 보상 특징에 대한 두 가지 융합 전략, 즉 어텐션(Attention) 기반으로 하는 특징 융합과 보상 특징 융합으로 구성된다.The fusion unit 120 is composed of two fusion strategies for intermediate features and reward features, namely, attention-based feature fusion and reward feature fusion.

디코더부(130)는 위에서 언급한 두 개의 융합된 부분을 디코딩하여 최종 융합된 영상을 복원하는 것을 목적으로 한다. 유사하게, 마지막(dconv1)과 끝에서 두 번째(dconv2) 디컨볼루션들의 특징은 각각

으로 표현된다.The purpose of the decoder unit 130 is to decode the above-mentioned two fused parts to restore a final fused image. Similarly, the last (dconv1) and penultimate (dconv2) deconvolutions feature each

is expressed as

훈련 단계training phase

본 발명의 SEDRFuse 네트워크는 도 2와 같이, 훈련 단계에서 융합 부분이 없는 인코더부(110) 및 디코더부(130)를 포함한다.As shown in FIG. 2, the SEDRFuse network of the present invention includes an encoder unit 110 and a decoder unit 130 without a fusion part in the training stage.

SEDRFuse 네트워크는 인코더부(110), 융합부(120) 및 디코더부(130)를 포함하는 개념이다.The SEDRFuse network is a concept including an encoder unit 110, a fusion unit 120, and a decoder unit 130.

SEDRFuse 네트워크는 사용 가능한 KAIST 및 FLIR 데이터 세트로 훈련된다.The SEDRFuse network is trained with available KAIST and FLIR data sets.

훈련 단계는 재구성 손실을 최소화하면서 원본 데이터 세트를 정확하게 재구성하는 것이다. 즉, 재구성 오차가 작을수록 추출된 특징이 더 대표적이다. 훈련 네트워크의 기본 단위는 컨볼루션 계층(Convolutional Layer), 디컨볼루션 계층(Deconvolutional Layer), 잔차 블록(Residual Block), 건너뛰기 연결(Skip Connections) 및 정류 선형 단위(Rectified Linear Units, ReLUs) 함수이다. 풀링 레이어(Pooling Layer)는 원래 데이터 세트에서 유용한 세부 정보의 손실을 방지하기 위해 제거된다.The training step is to accurately reconstruct the original data set with minimal reconstruction loss. That is, the smaller the reconstruction error, the more representative the extracted feature. The basic units of the training network are the functions Convolutional Layer, Deconvolutional Layer, Residual Block, Skip Connections, and Rectified Linear Units (ReLUs). . A pooling layer is removed to avoid loss of useful details in the original data set.

(1) 인코더부(1) Encoder unit

인코더부(110)는 3개의 컨볼루션 레이어와 1개의 잔차 블록으로 구성된다. 입력 훈련 데이터의 크기는 256 × 256(높이 및 너비)이다.The encoder unit 110 is composed of three convolutional layers and one residual block. The size of the input training data is 256 × 256 (height and width).

첫 번째 컨볼루션 레이어는 입력 크기를 변경하지 않는 반면 두 번째 및 세 번째 컨볼루션 레이어(다운샘플링)는 입력 크기의 절반이다.The first convolutional layer does not change the input size, while the second and third convolutional layers (downsampling) half the input size.

컨볼루션 과정에서 누락된 영상 세부 정보를 보완하기 위해 ResNet을 모방하여 이전 특징들을 추가로 재사용한다.In order to compensate for missing image details in the convolution process, we further reuse previous features by mimicking ResNet.

이러한 SEDRFuse 네트워크에서 인코더부(110)는 마지막 Convolutional Layer 뒤에 하나의 잔차 블록을 추가한다. 모든 컨볼루션 작업은 원본 영상의 텍스처 및 구조 정보를 완전히 유지하면서 특징 추출기 역할을 수행한다.In this SEDRFuse network, the encoder unit 110 adds one residual block after the last convolutional layer. All convolution operations act as feature extractors while completely maintaining the texture and structure information of the original image.

인코더의 출력에는 크기가 64 × 64인 256개의 중간 특징(Intermediate Features)을 가진다.The output of the encoder has 256 intermediate features of size 64 × 64.

인코더부(110)는 마지막 컨볼루션 레이어의 뒤에 하나의 잔차 블록(Residual Block)을 추가한다.The encoder unit 110 adds one residual block after the last convolution layer.

(2) 디코더부(2) decoder unit

디코더부(130)는 입력과 동일한 크기의 출력 영상을 얻기 위해서 인코더부(110)에서 해당 컨볼루션에 대칭 디컨볼루션을 선택한다.The decoder unit 130 selects symmetric deconvolution for the corresponding convolution in the encoder unit 110 to obtain an output image having the same size as the input.

디컨볼루션(Deconvolution)은 일반적으로 업샘플링을 통해 추출된 중간 특징들에서 원본 영상을 재구성하는데 사용된다. 디컨볼루션 레이어의 커널 크기는 정확히 일치하기 위해 컨볼루션 레이어와 동일해야 한다.Deconvolution is generally used to reconstruct an original image from intermediate features extracted through upsampling. The kernel size of the deconvolution layer must be the same as the convolution layer to match exactly.

이러한 SEDRFuse 네트워크에서 모든 커널 크기는 3 × 3으로 설정된다. 게다가 디코더부(130)에는 두 가지 유형의 단위, 즉 Deconvolutional Layer와 ReLU 함수만 있다.In this SEDRFuse network, all kernel sizes are set to 3 × 3. In addition, the decoder unit 130 has only two types of units, that is, a Deconvolutional Layer and a ReLU function.

(3) 잔차 블록(Residual Block)(3) Residual Block

도 2와 같이, 두 가지 용도로 인코더부(110)에 하나의 잔차 블록을 추가한다.As shown in FIG. 2, one residual block is added to the encoder unit 110 for two purposes.

첫 번째는 더 대표적인 중간 특징을 추출하는데 사용되는 심층 네트워크(Deep Network)에서 최적의 훈련 수렴을 보장하는 것이다.The first is to ensure optimal training convergence in the deep network used to extract more representative intermediate features.

두 번째는 세 번째 컨볼루션 레이어에서 생성된 이전 특징들을 최대한 활용하는 것이다.The second is to make the most of the previous features generated in the third convolutional layer.

잔차 블록 수의 선택은 실험 성능에 따라 결정된다.The selection of the number of residual blocks is determined by experimental performance.

잔차 블록은 입력 특징 x와 하나 이상의 가중치 레이어(Weight Layer)로 구성된 F(x)로 구성하고, F(x)의 출력과 상기 입력 특징 x를 요소별 방식으로 병합하여 정류 선형 단위(Rectified Linear Units, ReLUs) 함수를 생성한다.The residual block is composed of F(x) composed of an input feature x and one or more weight layers, and rectified linear units (Rectified Linear Units) are formed by merging the output of F(x) and the input feature x in an element-wise manner. , ReLUs) function.

(4) 건너뛰기 연결(Skip Connections)(4) Skip Connections

컨볼루션 연산은 기본 영상 콘텐츠를 보존하는 반면, 영상의 텍스처 세부 정보는 손실될 수 있다. 또한, 디컨볼루션은 인코더부(110)에서 다운샘플링하는 동안 일정량의 정보 손실이 있는 추출된 특징에서 영상 내용의 구조적 세부 사항만 복원할 수 있다.While the convolution operation preserves the underlying video content, texture details of the video may be lost. In addition, deconvolution may restore only structural details of video contents from extracted features having a certain amount of information loss during downsampling in the encoder unit 110 .

일반적으로 디코더부(130)의 출력은 입력 영상의 필터링된 버전이므로 영상 융합 성능이 만족스럽지 않다. 따라서, 본 발명은 요소별 선택 최대 방식으로 텍스처 특징 정보를 컨볼루션 레이어에서 해당 디컨볼루션 레이어로 전송하기 위해 건너뛰기 연결을 사용한다. 이러한 건너뛰기 연결은 제안된 프레임워크를 더 쉽게 훈련시키고 수렴 속도를 높인다.In general, since the output of the decoder unit 130 is a filtered version of the input image, image fusion performance is not satisfactory. Therefore, the present invention uses skip connection to transfer texture feature information from a convolution layer to a corresponding deconvolution layer in an element-by-element selection-maximum manner. These skip connections make the proposed framework easier to train and speed up convergence.

융합 단계에서 중간 특징과 보상 특징을 병합하기 위한 두 가지 전략을 설명한다. 도 3의 (c)에서 볼 수 있듯이, 중간 특징은 소스 영상의 주요 구조를 나타낸다. 따라서, 단순 가중 평균 또는

-norm 전략을 사용하면, 융합 과정에서 두드러진 특징을 구별할 수 없어 낮은 대비와 낮은 밝기의 융합 결과가 나타낸다.We describe two strategies for merging intermediate features and compensating features in the fusion step. As can be seen in (c) of FIG. 3, the middle feature represents the main structure of the source image. Therefore, a simple weighted average or

Using the -norm strategy, salient features cannot be distinguished during the fusion process, resulting in fusion results with low contrast and low brightness.

원본 영상의 밝기 정보를 보존하기 위해서는 중간 특징을 융합하는 어텐션 기반의 특징 융합 전략을 설계한다.In order to preserve the brightness information of the original image, an attention-based feature fusion strategy that fuses intermediate features is designed.

도 3의 (b)는 소스 영상의 어텐션 맵을 보여준다. 소스 영상의 두드러진 개체에는 이러한 중간 특징을 융합하는데 적합한 더 높은 어텐션 점수가 할당된다.Figure 3 (b) shows the attention map of the source video. Prominent entities in the source image are assigned higher attention scores suitable for fusing these intermediate features.

그럼에도 불구하고, 컨볼루션 레이어의 수가 증가함에 따라 중간 특징에서 많은 세부 정보가 손실된다. 이전의 얕은 레이어에 의해 생성된 특징[도 3의 (c)의 녹색 상자]에는 소스 영상에 대한 자세한 내용이 포함되어 있음을 알 수 있다. 따라서, 융합된 결과에서 이러한 시각적 세부 정보를 유지하기 위해서 건너뛰기 연결 전략을 사용하여 보상 특징을 재사용한다.Nonetheless, as the number of convolutional layers increases, a lot of detail is lost in intermediate features. It can be seen that the features generated by the previous shallow layer (green box in FIG. 3(c)) contain details of the source image. Therefore, in order to retain these visual details in the fused result, we reuse reward features using a skip linking strategy.

1) 어텐션 기반 특징 융합(Attention-Based Feature Fusion)1) Attention-Based Feature Fusion

최근에는 어텐션 메커니즘을 기반으로 하는 모델이 CNN 아키텍처 교육에 도입되었다. 많은 시각적 작업, 특히 시각적 장면에서 관심 영역에 적용된다. 적외선 영상 및 가시 영상 융합의 목표는 시각적 세부 사항과 두드러진 열 복사 영역을 동시에 유지하는 것이다.Recently, a model based on the attention mechanism has been introduced to CNN architecture training. It is applied to the region of interest in many visual tasks, especially in visual scenes. The goal of infrared imaging and visible imaging fusion is to simultaneously retain visual detail and prominent thermal radiation areas.

따라서, 이전 작업에서 동기를 부여하여 이러한 중간 특징을 사용하여 원본 영상의 어텐션 맵(Attention Map)을 얻는다. 본 발명의 프레임워크에서 잔차의 출력은 일련의 중간 특징이다.Therefore, motivated by the previous work, we obtain the attention map of the original video using these intermediate features. The output of the residuals in the framework of the present invention is a series of intermediate features.

그들 각각은 소스 영상에 대한 특별한 종류의 정보를 보여준다. 융합부(120)는 소스 영상의 두드러진 특징을 정확하게 반영하기 위해서 특징 맵(Feature Map)에서 어텐션 맵을 생성한다. 각 특징 맵에는 채널 방향의 확률을 계산하는 Softmax 연산에 의해 주어진 고유한 가중치가 있다.Each of them shows a special kind of information about the source image. The convergence unit 120 generates an attention map from a feature map in order to accurately reflect salient features of the source image. Each feature map has a unique weight given by the Softmax operation that calculates the probability of the channel direction.

Softmax 함수는 각 중간 특징이 어텐션 맵에 기여할 확률을 계산할 수 있으므로 소스 영상에 대한 활동 수준 측정을 계산하는데 적합하다. 융합부(120)는 하기의 수학식 1에 의해 Softmax 함수에 의해 확률 가중치 맵을 계산한다.The Softmax function can calculate the probability that each intermediate feature contributes to the attention map, so it is suitable for calculating the activity level measure for the source image. The fusion unit 120 calculates the probability weight map by using the Softmax function according to Equation 1 below.

여기서,

는 각 특징의 확률 가중치 맵이고, (x, y)는 모든 중간 특징 채널의 동일한 위치를 나타내고, i는 채널 번호이고, 상기 잔차 블록에 의해 생성된 중간 특징들을

라고 정의하고, k는 k번째 입력 소스 영상이고, k=1은 적외선(IR)영상이고, k=2는 가시광선 영상이고, i는 입력 영상의 중간 특징의 총 개수이고, res는 잔차 레이어 출력이다.here,

is the probability weight map of each feature, (x, y) denotes the same location of all intermediate feature channels, i is the channel number, intermediate features generated by the residual block

, where k is the kth input source image, k = 1 is an infrared (IR) image, k = 2 is a visible ray image, i is the total number of intermediate features of the input image, and res is the residual layer output. am.

Softmax 함수는 다음의 수학식 2와 같이 나타낼 수 있다.The Softmax function can be expressed as Equation 2 below.

여기서,

는 벡터 시퀀스의 요소이다.here,

is an element of a vector sequence.

융합부(120)는 하기의 수학식 3에 의해 모든 중간 특징들에 대해서 해당 확률 가중치를 곱하고 합산하여 소스 영상에 대한 어텐션 맵을 생성한다.The fusion unit 120 generates an attention map for the source image by multiplying and summing the corresponding probability weights for all intermediate features by Equation 3 below.

여기서,

는 소스 영상의 활동 수준 측정을 반영하는 어텐션 맵이다. 본 발명은 두드러진 메커니즘(Salient Mechanism)에 따라 특징 수준 융합(Feature-Level Fusion) 전에 이러한 중간 특징을 최적화하기 위해 어텐션 맵을 사용한다.here,

is an attention map that reflects the activity level measurement of the source video. The present invention uses the attention map to optimize these intermediate features before feature-level fusion according to the Salient Mechanism.

융합부(120)는 하기의 수학식 4와 수학식 5를 이용하여 융합된 중간 특징을 생성한다.The fusing unit 120 generates a fused intermediate feature using Equations 4 and 5 below.

여기서, j는 j번째 입력 소스 영상을 의미한다.

는 j번째 소스 영상의 특징에 대한 최적의 가중치 맵이다.

는 디코더부(130)로 보내질 융합된 중간 특징이다.Here, j denotes the jth input source image.

is an optimal weight map for the feature of the jth source image.

is the fused intermediate feature to be sent to the decoder unit 130.

2) 보상 특징 융합(Compensation Feature Fusion)2) Compensation Feature Fusion

보상 특징의 경우 이러한 특징을 사용하여 인코더부(110)에서 컨볼루션 프로세스의 누락된 세부 정보를 재구성할 수 있다. 압축 후의 각 특징 픽셀 값은 원본 영상의 수용 필드를 나타내므로 선택 최대 전략(choose-Max Strategy)은 요소별 방식으로 병합하는 더 나은 선택이다.In the case of the compensation feature, the encoder unit 110 may reconstruct missing details of the convolution process using this feature. Since each feature pixel value after compression represents the receptive field of the original image, the choose-max strategy is a better choice to merge in an element-by-element fashion.

융합부(120)는 인코더부(110)에서 수신한 보상 특징을 이용하여 융합된 보상 특징을 생성한다.The fusion unit 120 generates a fused compensation feature by using the compensation feature received from the encoder unit 110 .

융합부(120)는 하기의 수학식 6을 이용하여 첫 번째 컨볼루션에 대하여 융합된 제1 보상 특징을 생성한다.The fusion unit 120 generates a fused first compensation feature with respect to the first convolution using Equation 6 below.

여기서,

는 각각 첫 번째 컨볼루션의 적외선 특징들, 가시광선 특징들, 융합된 제1 보상 특징들을 나타낸다. m은 첫 번째 컨볼루션의 총 개수이다.here,

denotes infrared features, visible light features, and fused first compensation features of the first convolution, respectively. m is the total number of first convolutions.

(x,y)는 특징의 픽셀 좌표이고,

은 요소별 방식에서 최대 선택 함수이다.(x,y) are the pixel coordinates of the feature,

is the maximum selection function in the element-wise method.

융합부(120)는 하기의 수학식 7을 이용하여 두 번째 컨볼루션에 대하여 융합된 제2 보상 특징을 생성한다.The fusion unit 120 generates a fused second compensation feature for the second convolution using Equation 7 below.

여기서,

및

은 각각 두 번째 컨볼루션의 적외선 특징들, 가시광선 특징들, 융합된 제2 보상 특징들을 나타낸다.here,

and

denotes infrared features, visible light features, and fused second compensation features of the second convolution, respectively.

수학식 7은 수학식 6과 동일한 방식으로 표현된다.Equation 7 is expressed in the same way as Equation 6.

디코더부(130)는 하기의 수학식 8에 의해 첫 번째 디컨볼루션 레이어의 출력과 융합부(120)에서 수신한 융합된 제1 보상 특징들을 요소별 합계(Elementwise Summation)로 계산하여 제1 입력 특징을 생성한다.The decoder unit 130 calculates the output of the first deconvolution layer and the fused first compensation features received from the fusion unit 120 as elementwise summation by Equation 8 below, and inputs the first input. create a feature

여기서,

은 제1 입력 특징이고,

은 첫 번째 디컨볼루션 레이어의 출력이다.here,

is the first input feature,

is the output of the first deconvolution layer.

디코더부(130)는 하기의 수학식 9에 의해 두 번째 디컨볼루션 레이어의 출력과 융합부(120)에서 수신한 융합된 제2 보상 특징들을 요소별 합계(Elementwise Summation)로 계산하여 제2 입력 특징을 생성한다.The decoder unit 130 calculates the output of the second deconvolution layer and the fused second compensation features received from the fusion unit 120 as an elementwise summation according to Equation 9 below, and inputs the second input. create a feature

여기서,

은 제2 입력 특징이고,

은 두 번째 디컨볼루션 레이어의 출력이다.here,

is the second input feature,

is the output of the second deconvolution layer.

결과적으로, 첫 번째 컨볼루션 레이어 및 두 번째 컨볼루션 레이어의 융합된 보상 특징들[

및

]은 융합된 영상의 시각적 세부 사항을 보상할 수 있다. 그것들은 요소별 합계(Elementwise Summation)에 의해 해당 디컨볼루션 레이어(Deconvolutional Layer)에 전달되며, 수학식 8과 수학식 9와같이 나타낼 수 있다.As a result, the fused compensation features of the first convolutional layer and the second convolutional layer [

and

] can compensate for the visual details of the fused image. They are transferred to the corresponding deconvolution layer by elementwise summation and can be expressed as Equations 8 and 9.

는 각각 첫 번째 디컨볼루션 레이어 및 두 번째 디컨볼루션 레이어의 출력이고,

와

는 입력 특징이며, 이러한 입력 특징들은 디코더부(130)의 두 번째 디컨볼루션 레이어와 세 번째 디컨볼루션 레이어로 전달된다.

are the outputs of the first deconvolution layer and the second deconvolution layer, respectively,

and

is an input feature, and these input features are transferred to the second deconvolution layer and the third deconvolution layer of the decoder unit 130.

실험 결과Experiment result

A. 데이터 준비A. Data preparation

본 발명은 적외선 영상 및 가시광선 영상 쌍을 포함하는 KAIST 및 FLIR 데이터 세트를 훈련에 사용했다. KAIST 데이터 세트는 차량에서 가져온 95,000개의 색열 쌍으로 구성된다. FLIR 데이터 세트는 자율 주행 연구를 위해 약 14,452개의 열화상 및 가시 이미지 쌍을 제공한다.In the present invention, KAIST and FLIR data sets including pairs of infrared and visible images were used for training. The KAIST data set consists of 95,000 color column pairs taken from vehicles. The FLIR data set provides approximately 14,452 pairs of thermal and visible images for autonomous driving research.

두 데이터 세트는 각각 20 및 30Hz에서 일련의 비디오 클립으로 기록되었다.Both data sets were recorded as a series of video clips at 20 and 30 Hz, respectively.

데이터 세트는 낮과 밤에 캡처된 캠퍼스, 도로, 시내, 거리 및 고속도로를 포함한 다양한 시나리오를 다룬다. 훈련 데이터를 더욱 다양화하기 위해 리샘플링을 통해 프레임 간격을 확장했다. 적외선 영상 및 가시광선 영상의 총 수는 훈련용 50,000개, 검증용 2000개를 포함하여 52,000개이다. 데이터 세트에 대한 자세한 내용은 표 1(훈련에 사용되는 데이터 세트들)에 나와 있다. 실험에서 모든 이미지는 256 × 256 픽셀로 크기가 조정되었고 [0, 1] 범위 내에서 회색조 이미지로 변환되었다.The data set covers a variety of scenarios including campus, road, downtown, street and highway captured day and night. To further diversify the training data, the frame interval was extended through resampling. The total number of infrared and visible ray images is 52,000, including 50,000 for training and 2000 for verification. Details of the data sets are shown in Table 1 (data sets used for training). All images in the experiment were resized to 256 × 256 pixels and converted to grayscale images within the [0, 1] range.

B. 실험 설정B. Experimental setup

1) 훈련 세부사항1) Training Details

준비된 훈련 데이터 세트로 SEDRFuse 네트워크를 훈련했다(표 1). 배치 크기와 에포크(Epoch) 번호는 각각 2와 50으로 설정되었다.We trained the SEDRFuse network with the prepared training data set (Table 1). Batch size and Epoch number were set to 2 and 50, respectively.

학습률은 1 × 10^-4로 설정되었다. 총 손실 함수로 픽셀 손실과 SSIM 손실을 채택했다. 이 두 가지 손실 함수는 재구성된 픽셀 오류와 에지 오류를 모두 제한할 수 있다. 총 손실 함수는 수학식 10으로 나타낼 수 있다.The learning rate was set to 1 × 10 ^-4 . Pixel loss and SSIM loss were adopted as total loss functions. These two loss functions can limit both reconstructed pixel errors and edge errors. The total loss function can be expressed as Equation 10.

여기서, Tloss, Ploss 및 SSIMloss는 각각 총 손실, 픽셀 손실 및 SSIM-손실을 나타낸다. SSIM은 구조적 유사성 점수이고, SSIM-loss는 수학식 11로 정의한다.Here, Tloss, Ploss and SSIMloss represent total loss, pixel loss and SSIM-loss, respectively. SSIM is a structural similarity score, and SSIM-loss is defined by Equation 11.

여기서, out과 in은 각각 재구성된 데이터와 입력된 훈련 데이터를 의미한다.

은 구조 유사도 점수로 계산된 SSIM 함수를 나타낸다.Here, out and in refer to reconstructed data and input training data, respectively.

denotes the SSIM function calculated as a structural similarity score.

수학식 12에서 M과 N은 이미지의 크기이고, (x, y)는 픽셀 위치를 나타낸다.In Equation 12, M and N are image sizes, and (x, y) represent pixel positions.

본 발명의 SEDRFuse 네트워크는 NVIDIA GTX 1070Ti(GPU), 32GB RAM(메모리), Intel Core i5-8500(CPU)이 장착된 컴퓨터에서 구현되었다. 네트워크 아키텍처는 Tensorflow에서 프로그래밍되었다.The SEDRFuse network of the present invention was implemented on a computer equipped with NVIDIA GTX 1070 Ti (GPU), 32 GB RAM (memory), and Intel Core i5-8500 (CPU). The network architecture was programmed in Tensorflow.

도 4는 본 발명의 실시예에 따른 훈려 단계에서 훈련 총 손실 곡선을 나타낸 도면이다.4 is a diagram showing a training total loss curve in a training phase according to an embodiment of the present invention.

1000번의 반복마다 총 손실 값이 제공됩니다. 각 Epoch에는 25 × 103 반복이 필요합니다. 곡선에서 총 손실 값은 약 50 Epoch에서 안정적인 경향이 있음을 알 수 있습니다. 이것은 훈련된 모델이 최적의 설정에 도달했음을 보여준다.The total loss value is given for every 1000 iterations. Each epoch requires 25 × 103 iterations. From the curve, it can be seen that the total loss value tends to be stable around 50 epochs. This shows that the trained model has reached an optimal setting.

테스트 세부 정보는 실험에서 Nederlandse Organisatie voor Toegepast-Natuurwetenschappelijk Onderzoek(TNO) 데이터 세트4가 훈련된 모델을 테스트하기 위해 채택되었다. 다양한 장면을 포함하는 20쌍의 적외선 영상 및 가시광선 영상이 선택되었다. 이 모든 소스 영상의 쌍은 완전히 등록되었으며, 도 5는 샘플 영상을 나타낸 도면이다.The test details were adapted from the experiment to test the trained model on the Nederlandse Organisatie voor Toegepast-Natuurwetenschappelijk Onderzoek (TNO) dataset4. Twenty pairs of infrared and visible light images containing various scenes were selected. All pairs of source images have been fully registered, and FIG. 5 shows sample images.

도 5는 TNO 데이터 세트의 소스 영상 부분이고, 첫 번째 및 세 번째 행은 적외선 영상이고 두 번째 및 네 번째 행은 가시광선 영상이다.Figure 5 is the source image part of the TNO data set, the first and third rows are infrared images and the second and fourth rows are visible images.

평가 지표는 융합 성능을 검증하기 위해 실험에서 주관적인 평가와 객관적인 평가를 모두 수행했습니다. 평균 기울기(AG), 상관 계수(CC), 엔트로피(EN), 공간 주파수(SF), 구조 유사성 융합 메트릭(SSIMfm), 융합을 위한 시각적 정보 충실도(V IFF), 및 표준 편차(SD). 위에서 언급한 메트릭이 클수록 융합 방법의 성능이 더 좋습니다. 자세한 내용은 해당 문헌을 참조하십시오. 특히, 본 실험에 사용된 SSIMfm 융합 메트릭은 수학식 13과 같이 계산된다. 여기서 SSIM(·)은 SSIM 함수를 나타낸다. A와 B는 두 개의 소스 이미지입니다. F는 참조 이미지 역할을 하는 융합된 이미지입니다.As for the evaluation index, both subjective and objective evaluations were performed in the experiment to verify the fusion performance. Mean slope (AG), correlation coefficient (CC), entropy (EN), spatial frequency (SF), structural similarity convergence metric (SSIMfm), visual information fidelity for convergence (V IFF), and standard deviation (SD). The larger the metric mentioned above, the better the fusion method performs. For details, see the corresponding literature. In particular, the SSIMfm fusion metric used in this experiment is calculated as in Equation 13. Here, SSIM(·) represents the SSIM function. A and B are the two source images. F is a fused image that serves as a reference image.

절제 연구(Ablation Study)Ablation Study

절제 연구에서 우리는 먼저 잔차 블록의 수와 훈련 에포크(Epoch)를 포함하여 본 발명의 프레임워크에 대한 매개변수를 최적화했다.In the ablation study, we first optimized the parameters for our framework, including the number of residual blocks and training epochs.

본 발명은 선택된 매개변수의 유효성을 검증하기 위해 세 가지 비교 실험을 수행한다. 잔차 블록(Residual Block)의 수와 훈련 에포크(Training Epoch)의 선택은 시행 착오를 통해 이루어질 수 있다.The present invention performs three comparison experiments to validate the selected parameters. Selection of the number of residual blocks and training epochs can be made through trial and error.

예를 들어, 특정 메트릭의 관점에서 블록 수에 대한 융합 성능을 관찰하여 잔차 블록의 수를 결정할 수 있다. 유사하게, 최적 에포크 수는 수에 대한 융합 성능 메트릭 값의 플롯에서 찾을 수 있다. 실험에서 본 발명의 프레임워크에 대해 하나의 잔차 블록을 선택하고 50개의 에포크가 모델을 훈련하는데 충분해야 함을 확인했다. 이러한 숫자 선택은 계산 효율성도 고려한다.For example, the number of residual blocks can be determined by observing the fusion performance for the number of blocks in terms of a particular metric. Similarly, the optimal number of epochs can be found in the plot of the fusion performance metric value against number. In experiments, we chose one residual block for our framework and confirmed that 50 epochs should be sufficient to train the model. The choice of these numbers also considers computational efficiency.

도 6은 본 발명의 실시예에 따른 중간 특징에 대한 융합 전략을 나타낸 도면이다.6 is a diagram illustrating a fusion strategy for intermediate features according to an embodiment of the present invention.

중간 특징을 위한 융합 전략Fusion strategy for intermediate features

중간 특징 융합을 위한 평균 기반,

-norm 기반 및 어텐션 맵 기반 융합 전략을 조사했다. 도 6은 Trench와 House의 두 장면을 세 가지 전략으로 융합한 결과를 보여준다. 인간의 머리와 집 모서리(강조 표시 및 확대)의 관찰은 어텐션 기반 융합 전략이 중간 특징 융합에 대한

-norm 및 평균 기반 전략보다 더 많은 밝기 정보를 보존한다는 것을 보여주었다.Mean basis for intermediate feature fusion,

-We investigated norm-based and attention map-based convergence strategies. Figure 6 shows the results of fusion of two scenes of Trench and House with three strategies. Observations of human heads and house corners (highlighted and magnified) suggest that attention-based fusion strategies can be useful for intermediate feature fusions.

showed that it preserves more brightness information than -norm and average-based strategies.

TNO 데이터 세트에 대한 융합 결과도 9개의 선택된 융합 평가 메트릭으로 평가되었습니다. 표 2에 보고된 정량적 결과로부터 어텐션 맵 기반 융합 전략이 평균화 및

-norm 기반 융합 전략 모두를 다시 능가함을 알 수 있다.Convergence outcomes on the TNO data set were also evaluated with nine selected convergence evaluation metrics. From the quantitative results reported in Table 2, the attention map-based fusion strategy averaged and

It can be seen that it again outperforms all of the -norm-based convergence strategies.

따라서, 어텐션 기반 전략은 중간 특징을 융합하는 데 더 적합하다.Therefore, attention-based strategies are more suitable for fusing intermediate features.

도 6의 (a)는 적외선 영상, 도 6의 (b)는 가시광선 영상, 도 6의 (c)는 평균화 전략을 사용한 결과이고, 도 6의 (d)는

-norm 전략을 사용한 결과이고, 도 6의 (e)는 어텐션 맵 전략을 사용한 결과이다.Figure 6 (a) is an infrared image, Figure 6 (b) is a visible ray image, Figure 6 (c) is the result of using the averaging strategy, Figure 6 (d) is

This is the result using the -norm strategy, and FIG. 6(e) is the result using the attention map strategy.

평균화 및

-norm에 의한 결과의 확대된 영역은 어텐션 기반 융합 전략에 의한 영역보다 더 낮은 밝기 정보를 보여준다.averaging and

The enlarged region of the result by -norm shows lower brightness information than the region by the attention-based convergence strategy.

도 7은 본 발명의 실시예에 따른 보상 특징에 대한 융합 전략을 나타낸 도면이다.7 is a diagram illustrating a fusion strategy for compensation features according to an embodiment of the present invention.

보상 특징을 위한 융합 전략은 보상 특징 융합을 위한 픽셀별 합산, 연결 및 선택-최대 융합 전략을 조사했다. 도 7은 Bunker와 Jeep과 같은 두 개의 이미지 쌍을 세 가지 전략으로 융합한 결과를 보여준다.The fusion strategy for reward features investigated pixel-by-pixel summation, concatenation, and select-max fusion strategies for reward feature fusion. Figure 7 shows the results of fusing two image pairs such as Bunker and Jeep with three strategies.

도 7은 보상 특징을 위한 융합 전략으로서 (a) IR 영상, (b) 가시광선 영상, (c) 연결 전략을 사용한 결과. (d) 픽셀별 합산 전략을 사용한 결과. (e) 최대 선택 전략을 사용한 결과이다. 확대된 영역은 선택 최대 전략과 융합된 지붕 및 창 영역이 다른 두 전략의 영역보다 더 명확함을 보여준다.7 is a result of using (a) IR imaging, (b) visible ray imaging, and (c) linking strategy as fusion strategies for compensation features. (d) Results using pixel-by-pixel summation strategy. (e) This is the result using the maximum selection strategy. The enlarged area shows that the roof and window area fused with the selection maximal strategy is clearer than that of the other two strategies.

시각적 평가에 따르면 픽셀별 합산 및 연결 전략은 선택 최대 전략보다 약간 어두운 IR 이미지의 밝기 정보를 보존하는 데 유사한 성능을 보인다. 특히, 선택 최대 전략으로 융합된 지붕과 창 영역(강조 표시 및 확대)은 다른 두 가지 전략보다 명확하다.Visual evaluation shows that the pixel-by-pixel summation and concatenation strategy performs similarly to preserving brightness information in slightly darker IR images than the select max strategy. In particular, the fused roof and window areas (highlighted and enlarged) with the selection maximal strategy are clearer than the other two strategies.

본 발명은 선택된 융합 메트릭을 사용하여 TNO 데이터 세트에 대한 세 가지 전략을 추가로 테스트했다.We further tested three strategies on the TNO data set using selected convergence metrics.

표 3의 결과는 선택 최대 전략이 융합 메트릭 측면에서 최고의 전체 성능을 달성함을 보여준다. 따라서, 본 발명의 프레임워크에서 선택 최대 전략을 채택한다.The results in Table 3 show that the select maximal strategy achieves the best overall performance in terms of fusion metrics. Therefore, we adopt a select maximal strategy in the framework of the present invention.

도 8은 본 발명의 실시예에 따른 건너뛰기 연결 유무에 대한 연구를 나타낸 도면이다.8 is a diagram illustrating a study on whether or not a skip connection is present according to an embodiment of the present invention.

건너뛰기 연결의 영향을 확인하기 위해서 건너뛰기 연결을 사용하거나 사용하지 않은 두 가지 실험을 수행했다.To determine the effect of skip linking, we conducted two experiments with and without skip linking.

Sandpath 및 Lake 이미지 쌍에 대한 융합 결과는 도 8에 나와 있다. 가시광선 영상의 자세한 정보는 건너뛰기 연결이 없는 융합에서 심각한 누락이 발생할 수 있다. 예를 들어, 울타리와 의자는 도 8의 (c)에서 거의 볼 수 없다.The fusion results for Sandpath and Lake image pairs are shown in FIG. 8 . Detailed information in the visible light image can be seriously missed in fusion without skip connections. For example, fences and chairs are barely visible in Fig. 8(c).

도 8은 건너뛰기 연결 유무에 대한 연구로서, (a) 적외선 영상, (b) 가시광선 영상, (c) 건너뛰기 연결 없이 사용한 결과이고, (d) 건너뛰기 연결을 사용한 결과이다.8 is a study on the presence or absence of skip connection, which shows (a) an infrared image, (b) a visible ray image, (c) a result without skip connection, and (d) a result using skip connection.

건너뛰기 연결이 없는 결과의 확대된 영역은 자세한 정보가 누락되는 문제가 있습니다. 대조적으로 본 발명의 방법에 의한 결과의 확대된 영역은 소스 영상의 세부 사항을 잘 보존할 수 있다.Zoomed areas in results without skip connections suffer from missing detailed information. In contrast, the enlarged region resulting from the method of the present invention may well preserve the details of the source image.

대조적으로 건너뛰기 연결과의 융합은 더 많은 세부사항을 보존할 수 있다. In contrast, fusion with skip linking can preserve more details.

또한, 표 4의 정량적 결과에서도 건너뛰기 연결과의 융합이 일반적으로 우월함을 알 수 있다. 따라서, 본 발명의 프레임워크에서 건너뛰기 연결을 사용하여 영상 재구성 과정에서 세부 정보를 유지한다.In addition, it can be seen from the quantitative results in Table 4 that fusion with skip linkage is generally superior. Therefore, in the framework of the present invention, skip connections are used to retain detailed information in the image reconstruction process.

이상에서 본 발명의 실시예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements made by those skilled in the art using the basic concept of the present invention defined in the following claims are also included in the scope of the present invention. that fall within the scope of the right.

100: 고화질 영상 융합 장치
110: 인코더부
120: 융합부
130: 디코더부100: high-definition image fusion device
110: encoder unit
120: fusion part
130: decoder unit

Claims

a feature extractor for extracting features from source images of an infrared image and a visible ray image, respectively;
An encoder unit composed of one or more convolutional layers functioning as a feature extractor and outputting intermediate features reflecting the main structure of the source image and compensation features representing details of the source image. ;
When the intermediate feature and the compensation feature are fused, an attention map of the source image for reflecting the feature of the source image is extracted from a feature map using the intermediate feature, and the generated attention map is a fusion unit generating a fused intermediate feature using a fused compensation feature and generating a fused compensation feature using the compensation feature; and
It is composed of one or more deconvolution layers, and includes a decoder unit that decodes the fused intermediate feature and the fused compensation feature to output a final fusion image,
The encoder unit adds one residual block after the last convolution layer, the residual block is composed of F (x) composed of an input feature x and one or more weight layers, and the F ( merging the output of x) and the input feature x in an element-by-element manner to generate a Rectified Linear Units (ReLUs) function;
The fusion unit calculates a probability weight map by a Softmax function according to Equations 1 and 2 below,
[Equation 1]

here,

is the probability weight map of each feature, (x, y) denotes the same position of all intermediate feature channels, i is the channel number, intermediate features generated by the residual block

, where k is the kth input source image, k = 1 is an infrared (IR) image, k = 2 is a visible ray image, i is the total number of intermediate features of the input image, and res is the residual layer output. is,
[Equation 2]

here,

is an element of the vector sequence,
The fusion unit generates an attention map for the source image by multiplying and summing corresponding probability weights for all intermediate features by Equation 3 below,
[Equation 3]

here,

is an attention map reflecting the activity level measurement of the source video,
The fusion part generates the fused intermediate feature using Equations 4 and 5 below,
[Equation 4]

[Equation 5]

Here, j is the jth input source image,

Is the optimal weight map for the feature of the jth source image,

is the fused intermediate feature to be sent to the decoder unit,
The fusion part generates a fused first compensation feature for the first convolution using Equation 6 below,
[Equation 6]

here,

denotes the infrared features, visible light features, and fused first compensation features of the first convolution, respectively, m is the total number of first convolutions, (x, y) are the pixel coordinates of the feature,

is the maximum selection function in the element-wise method,
The fusion part generates a fused second compensation feature for the second convolution using Equation 7 below,
[Equation 7]

here,

and

are infrared features, visible light features, and fused second compensation features of the second convolution, respectively.

delete

The method of claim 1,
The decoder unit calculates the output of the first deconvolution layer and the fused first compensation features received from the fusion unit as an elementwise summation by Equation 8 below to generate a first input feature High-definition image fusion device.
[Equation 8]

here,

is the first input feature,

is the output of the first deconvolution layer.

The method of claim 1,
The decoder unit calculates the output of the second deconvolution layer and the fused second compensation features received from the fusion unit as an elementwise summation by Equation 9 below to generate a second input feature High-definition image fusion device.
[Equation 9]

here,

is the second input feature,

is the output of the second deconvolution layer.