KR102589551B1

KR102589551B1 - Multi-scale object detection method and device

Info

Publication number: KR102589551B1
Application number: KR1020220130174A
Authority: KR
Inventors: 백준기; 박혁진
Original assignee: 중앙대학교 산학협력단
Priority date: 2022-10-12
Filing date: 2022-10-12
Publication date: 2023-10-13
Also published as: KR102589551B9

Abstract

다중 스케일 객체 검출 방법 및 그 장치가 개시된다. 다중 스케일 객체 검출 장치는, 상향식 계층 구조를 가지는 복수의 인코더를 가지는 백본 네트워크 모듈; 하향식 계층 구조를 가지는 복수의 디코더; 상기 복수의 인코더의 출력인 특징맵에 변형 가능한 컨볼루션 네트워크 알고리즘을 적용하여 디코더로 전달하는 복수의 DCN(deformable convolution network) 모듈; 및 상기 복수의 디코더 사이에 각각 위치되며, 상위 계층 디코더의 출력인 특징맵에 글로벌 평균 풀링을 적용하여 업 샘플링 결과를 출력하는 복수의 업 샘플링 모듈을 포함하되, 상기 복수의 디코더는 각각 상기 업 샘플링 결과와 상기 복수의 DCN 모듈을 통해 전달되는 하위 계층 특징맵을 요소합한 결과를 입력받은 후 디컨볼루션 연산하여 다중 스케일 특징맵을 각각 출력한다. A multi-scale object detection method and device are disclosed. The multi-scale object detection device includes a backbone network module having a plurality of encoders with a bottom-up hierarchical structure; A plurality of decoders with a top-down hierarchical structure; A plurality of DCN (deformable convolution network) modules that apply a deformable convolutional network algorithm to the feature maps output from the plurality of encoders and transmit them to a decoder; and a plurality of up-sampling modules, each located between the plurality of decoders, which outputs an up-sampling result by applying global average pooling to a feature map output from a higher layer decoder, wherein the plurality of decoders each perform the up-sampling. After receiving the result of element summing the result and the lower layer feature maps delivered through the plurality of DCN modules, a deconvolution operation is performed to output a multi-scale feature map.

Description

Multi-scale object detection method and device {Multi-scale object detection method and device}

본 발명은 다중 스케일 객체 검출 방법 및 그 장치에 관한 것이다. The present invention relates to a multi-scale object detection method and device.

영상에서 객체 감지는 기본으로 수행되나, 여전히 어려운 문제 중 하나이다. 컴퓨터 비전에서 딥러닝 기반의 객체 감지 방법은 예를 들어, 의료 영상, 얼굴 감지, 객체 추적, 보행자 감지, 자율 주행 등 다양한 분야에서 널리 이용되고 있다. Although object detection in images is performed by default, it is still one of the difficult problems. In computer vision, deep learning-based object detection methods are widely used in various fields, such as medical imaging, face detection, object tracking, pedestrian detection, and autonomous driving.

CNN(Convolution Neural Network)은 Alexnet을 시작으로 최근 몇 년 동안 빠르게 발전하였으며, 객체 감지 분야에서 크게 향상된 결과를 보여준다. CNN (Convolution Neural Network) has developed rapidly in recent years, starting with Alexnet, and shows greatly improved results in the field of object detection.

다중 스케일 객체를 감지하기 위한 피라미드 방법은 계산 부하와 메모리 공간이 증가하는 대신 상위 계층에서 하위 계층까지의 다중 스케일 특징 맵을 사용하고 있으나, 다중 스케일 특징맵 융합 과정에서의 의미 있는 정보 손실이 발생하는 문제점이 있다. The pyramid method for detecting multi-scale objects uses multi-scale feature maps from the upper layer to the lower layer at the expense of increased computational load and memory space, but meaningful information loss occurs during the multi-scale feature map fusion process. There is a problem.

본 발명은 다중 스케일 객체 검출 방법 및 그 장치를 제공하기 위한 것이다.The present invention is to provide a multi-scale object detection method and device.

또한, 본 발명은 피라미드 네트워크 구조를 이용하는 다중 스케일 객체 검출시, 업 샘플링 과정에서 발생하는 의미 있는 정보 손실을 줄일 수 있는 다중 스케일 객체 검출 방법 및 그 장치를 제공하기 위한 것이다.Additionally, the present invention is intended to provide a multi-scale object detection method and device that can reduce meaningful information loss that occurs during the up-sampling process when detecting a multi-scale object using a pyramid network structure.

본 발명의 일 측면에 따르면 피라미드 네트워크 구조에서 다중 스케일 특징맵 융합 과정에서 의미 있는 정보 손실을 줄일 수 있는 객체 검출 장치가 제공된다. According to one aspect of the present invention, an object detection device is provided that can reduce meaningful information loss in a multi-scale feature map fusion process in a pyramid network structure.

본 발명의 일 실시예에 따르면, 상향식 계층 구조를 가지는 복수의 인코더를 가지는 백본 네트워크 모듈; 하향식 계층 구조를 가지는 복수의 디코더; 상기 복수의 인코더의 출력인 특징맵에 변형 가능한 컨볼루션 네트워크 알고리즘을 적용하여 디코더로 전달하는 복수의 DCN(deformable convolution network) 모듈; 및 상기 복수의 디코더 사이에 각각 위치되며, 상위 계층 디코더의 출력인 특징맵에 글로벌 평균 풀링을 적용하여 업 샘플링 결과를 출력하는 복수의 업 샘플링 모듈을 포함하되, 상기 복수의 디코더는 각각 상기 업 샘플링 결과와 상기 복수의 DCN 모듈을 통해 전달되는 하위 계층 특징맵을 요소합한 결과를 입력받은 후 디컨볼루션 연산하여 다중 스케일 특징맵을 각각 출력하는 것을 특징으로 하는 객체 검출 장치가 제공될 수 있다. According to one embodiment of the present invention, a backbone network module having a plurality of encoders with a bottom-up hierarchical structure; A plurality of decoders with a top-down hierarchical structure; A plurality of DCN (deformable convolution network) modules that apply a deformable convolutional network algorithm to the feature maps output from the plurality of encoders and transmit them to a decoder; and a plurality of up-sampling modules, each located between the plurality of decoders, which outputs an up-sampling result by applying global average pooling to a feature map output from a higher layer decoder, wherein the plurality of decoders each perform the up-sampling. An object detection device may be provided that receives a result of summing the result and a lower-layer feature map delivered through the plurality of DCN modules, performs a deconvolution operation, and outputs a multi-scale feature map.

상기 복수의 업 샘플링 모듈은 각각, 상기 상위 계층 디코더의 출력인 특징맵에 전역 평균 풀링과 컨볼루션 연산을 적용하여 어텐션 맵을 생성하는 제1 분기 모듈; 상기 상위 계층 디코더의 출력인 특징맵에 이웃 보간을 적용하여 크기를 조절하는 제2 분기 모듈; 및 상기 제1 분기 모듈의 출력인 어텐션 맵과 상기 제2 분기 모듈의 출력인 크기 조절된 특징맵을 융합하여 업 샘플링 결과를 출력하는 융합 모듈을 포함할 수 있다. The plurality of up-sampling modules each include a first branch module that generates an attention map by applying global average pooling and convolution operations to the feature map output from the upper layer decoder; a second branching module that adjusts the size of the feature map output from the higher layer decoder by applying neighborhood interpolation; and a fusion module that outputs an up-sampling result by fusing the attention map, which is the output of the first branching module, and the resized feature map, which is the output of the second branching module.

제n(n은 자연수) DCN 모듈은 제n 인코더의 출력인 제n 특징맵에 변형 가능한 컨볼루션 연산을 적용하여 제n 디코더로 전달하되, 상기 제n 디코더는 제n 업 샘플링 모듈의 출력인 업 샘플링 결과와 상기 제n DCN 모듈을 통해 전달된 특징맵을 융합할 수 있다. The nth (n is a natural number) DCN module applies a deformable convolution operation to the nth feature map, which is the output of the nth encoder, and transmits it to the nth decoder. The sampling results and the feature map delivered through the nth DCN module can be fused.

상기 제n DCN 모듈은 점별수렴(point-wise conversion)을 통해 채널 수를 256개로 고정하여 상기 변형 가능한 컨볼루션 연산한 결과를 상기 제n 디코더로 출력할 수 있다. The nth DCN module can fix the number of channels to 256 through point-wise conversion and output the result of the deformable convolution operation to the nth decoder.

본 발명의 다른 측면에 따르면, 피라미드 네트워크 구조에서 다중 스케일 특징맵 융합 과정에서 의미 있는 정보 손실을 줄일 수 있는 객체 검출 방법이 제공된다. According to another aspect of the present invention, an object detection method that can reduce meaningful information loss during the multi-scale feature map fusion process in a pyramid network structure is provided.

본 발명의 일 실시예에 따르면, (a) 상향식 계층 구조를 가지는 복수의 인코더를 통해 컨볼루션 연산을 적용하여 다중 스케일의 복수의 특징맵을 생성하는 인코딩 단계; (b) 각 인코더의 출력인 각 특징맵에 변형 가능한 컨볼루션 네트워크 알고리즘을 적용하는 단계; (c) 하향식 계층 구조를 가지는 복수의 디코더 사이에 위치되며, 상위 계층 디코더의 출력인 특징맵에 글로벌 평균 풀링을 적용하여 업 샘플링 결과를 출력하는 업 샘플링 단계; 및 (d) 상기 업 샘플링 결과와 상기 변형 가능한 컨볼루션 연산을 적용한 결과를 요소합 연산한 결과를 디코더에 적용하여 디컨볼루션 연산한 결과를 출력하는 디코딩 단계를 포함하는 객체 검출 방법이 제공될 수 있다. According to one embodiment of the present invention, (a) an encoding step of generating a plurality of multi-scale feature maps by applying a convolution operation through a plurality of encoders having a bottom-up hierarchical structure; (b) applying a deformable convolutional network algorithm to each feature map that is the output of each encoder; (c) an up-sampling step that is located between a plurality of decoders having a top-down hierarchical structure and outputs an up-sampling result by applying global average pooling to the feature map that is the output of the upper-layer decoder; and (d) a decoding step of applying the result of element sum operation of the up-sampling result and the result of applying the deformable convolution operation to a decoder and outputting the result of the deconvolution operation. there is.

상기 (b) 단계는, 점별수렴(point-wise conversion)을 통해 채널 수를 256개로 고정하여 상기 변형 가능한 컨볼루션 연산한 결과를 출력할 수 있다.In step (b), the result of the deformable convolution operation can be output by fixing the number of channels to 256 through point-wise conversion.

상기 (c) 단계는, 상기 상위 계층 디코더의 출력인 특징맵에 전역 평균 풀링과 컨볼루션 연산을 적용하여 어텐션 맵을 생성하는 단계; 상기 상위 계층 디코더의 출력인 특징맵에 이웃 보간 알고리즘을 적용하여 크기를 조절하는 단계; 및 상기 어텐션 맵과 상기 크기 조절된 특징맵을 융합하여 상기 업 샘플링 결과를 출력하는 단계를 포함할 수 있다. The step (c) includes generating an attention map by applying global average pooling and convolution operations to the feature map output from the higher layer decoder; adjusting the size by applying a neighborhood interpolation algorithm to the feature map output from the upper layer decoder; And it may include fusing the attention map and the resized feature map to output the up-sampling result.

상기 (d) 단계는, 제n 인코더의 출력인 제n 특징맵에 변형 가능한 컨볼루션 네트워크 알고리즘을 적용한 결과와 상기 제n 업 샘플링 결과를 융합한 결과를 입력받아 디컨볼루션 연산을 적용할 수 있다. In step (d), a deconvolution operation can be applied by receiving the result of fusing the nth up-sampling result with the result of applying a deformable convolutional network algorithm to the nth feature map, which is the output of the nth encoder. .

본 발명의 일 실시예에 따른 다중 스케일 객체 검출 방법 및 그 장치를 제공함으로써, 피라미드 네트워크 구조를 이용하는 다중 스케일 객체 검출시, 업 샘플링 과정에서 발생하는 의미 있는 정보 손실을 줄일 수 있다.By providing a multi-scale object detection method and apparatus according to an embodiment of the present invention, meaningful information loss occurring during the up-sampling process can be reduced when detecting a multi-scale object using a pyramid network structure.

이를 통해, 본 발명은 정확한 다중 스케일 객체 검출이 가능한 이점이 있다. Through this, the present invention has the advantage of enabling accurate multi-scale object detection.

도 1은 본 발명의 일 실시예에 따른 다중 스케일 객체 검출 장치의 구성을 개략적으로 도시한 블록도,
도 2는 본 발명의 일 실시예에 따른 피라미드 네트워크 구조를 도시한 도면.
도 3은 본 발명의 일 실시예에 따른 업 샘플링 구조를 도시한 도면.
도 4는 종래와 본 발명의 일 실시예에 따른 다중 스케일 객체 검출을 비교한 도면.
도 5는 본 발명의 일 실시예에 따른 다중 스케일 객체 감지 방법을 나타낸 순서도,1 is a block diagram schematically showing the configuration of a multi-scale object detection device according to an embodiment of the present invention;
Figure 2 is a diagram showing a pyramid network structure according to an embodiment of the present invention.
Figure 3 is a diagram showing an up-sampling structure according to an embodiment of the present invention.
Figure 4 is a diagram comparing multi-scale object detection according to the prior art and an embodiment of the present invention.
5 is a flowchart showing a multi-scale object detection method according to an embodiment of the present invention;

본 명세서에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "구성된다" 또는 "포함한다" 등의 용어는 명세서상에 기재된 여러 구성 요소들, 또는 여러 단계들을 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다. 또한, 명세서에 기재된 "...부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다.As used herein, singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as “consists of” or “comprises” should not be construed as necessarily including all of the various components or steps described in the specification, and some of the components or steps may be included in the specification. It may not be included, or it should be interpreted as including additional components or steps. In addition, terms such as "... unit" and "module" used in the specification refer to a unit that processes at least one function or operation, which may be implemented as hardware or software, or as a combination of hardware and software. .

이하, 첨부된 도면들을 참조하여 본 발명의 실시예를 상세히 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the attached drawings.

도 1은 본 발명의 일 실시예에 따른 다중 스케일 객체 검출 장치의 구성을 개략적으로 도시한 블록도이고, 도 2는 본 발명의 일 실시예에 따른 피라미드 네트워크 구조를 도시한 도면이고, 도 3은 본 발명의 일 실시예에 따른 업 샘플링 구조를 도시한 도면이며, 도 4는 종래와 본 발명의 일 실시예에 따른 다중 스케일 객체 검출을 비교한 도면이다. FIG. 1 is a block diagram schematically showing the configuration of a multi-scale object detection device according to an embodiment of the present invention, FIG. 2 is a diagram showing a pyramid network structure according to an embodiment of the present invention, and FIG. 3 is a This is a diagram illustrating an up-sampling structure according to an embodiment of the present invention, and Figure 4 is a diagram comparing multi-scale object detection according to the prior art and an embodiment of the present invention.

이하에서 설명되는 다중 스케일 객체 검출 장치(100)는 도 2에 도시된 바와 같은 피라미드 네트워크 구조를 기반으로 다중 스케일 객체를 검출하기 위한 장치이다. 이하에서 설명되는 백본 네트워크 모듈과 디코더 모듈은 각각 상향식 계층 구조와 하향식 계층 구조로 특징맵들이 생성되는 것으로 이해되어야 할 것이다. The multi-scale object detection apparatus 100 described below is an apparatus for detecting multi-scale objects based on a pyramid network structure as shown in FIG. 2. The backbone network module and decoder module described below should be understood as generating feature maps in a bottom-up hierarchical structure and a top-down hierarchical structure, respectively.

도 1을 참조하면, 본 발명의 일 실시예에 따른 다중 스케일 객체 검출 장치(100)는 백본 네트워크 모듈(110), 복수의 DCN 모듈(120), 디코더 모듈(130), 복수의 업 샘플링 모듈(140), 객체 검출부(145), 메모리(150) 및 프로세서(160)를 포함하여 구성된다. Referring to Figure 1, the multi-scale object detection device 100 according to an embodiment of the present invention includes a backbone network module 110, a plurality of DCN modules 120, a decoder module 130, and a plurality of up-sampling modules ( 140), an object detection unit 145, a memory 150, and a processor 160.

백본 네트워크 모듈(110)은 상향식 계층 구조를 가지는 복수의 인코더를 구비한다. The backbone network module 110 includes a plurality of encoders with a bottom-up hierarchical structure.

도 2에서 보여지는 바와 같이, 백본 네트워크 모듈(110)은 상향식 계층 구조를 가지는 복수의 인코더를 가지되, 각 인코더는 입력 영상 또는 특징맵에 대한 컨볼루션 연산을 적용하여 다중 스케일 특징맵을 각각 추출할 수 있다. As shown in FIG. 2, the backbone network module 110 has a plurality of encoders with a bottom-up hierarchical structure, and each encoder extracts a multi-scale feature map by applying a convolution operation to the input image or feature map. can do.

이해와 설명의 편의를 도모하기 위해, 백본 네트워크 모듈(110)이 제1 인코더, 제2 인코더, 제3 인코더를 가지는 것을 가정하여 설명하기로 한다.For convenience of understanding and explanation, the description will be made assuming that the backbone network module 110 has a first encoder, a second encoder, and a third encoder.

제1 인코더는 입력 영상에 대해 컨볼루션 연산을 적용하여 제1 해상도를 가지는 제1 특징맵을 추출할 수 있다. 또한, 제2 인코더는 제1 특징맵에 대해 컨볼루션 연산을 적용하여 제2 해상도를 가지는 제2 특징맵을 추출할 수 있다. 또한, 제3 인코더는 제2 특징맵에 대해 컨볼루션 연산을 적용하여 제3 해상도를 가지는 제3 특징맵을 추출할 수 있다. 제1 특징맵의 해상도가 가장 높으며, 제3 특징맵의 해상도가 가장 낮을 수 있다. The first encoder may extract a first feature map with a first resolution by applying a convolution operation to the input image. Additionally, the second encoder may apply a convolution operation to the first feature map to extract a second feature map with a second resolution. Additionally, the third encoder may apply a convolution operation to the second feature map to extract a third feature map with a third resolution. The first feature map may have the highest resolution, and the third feature map may have the lowest resolution.

복수의 DCN(deformable convolution network) 모듈(120)은 각각의 인코더의 출력인 특징맵에 대해 변형 가능한 컨볼루션 연산을 적용하고, 점별수렴(point-wise conversion)을 통해 채널 수를 256개로 고정하여 디코더로 출력한다. A plurality of DCN (deformable convolution network) modules 120 apply a deformable convolution operation to the feature map that is the output of each encoder, and fix the number of channels to 256 through point-wise conversion to decoder. Output as

예를 들어, 복수의 DCN 모듈(120)을 제1 DCN 모듈, 제2 DCN 모듈, 제3 DCN 모듈이라 칭하기로 한다. For example, the plurality of DCN modules 120 will be referred to as a first DCN module, a second DCN module, and a third DCN module.

제1 DCN 모듈은 제1 인코더의 출력인 제1 특징맵에 대해 변형 가능한 컨볼루션 연산을 적용한 후 제1 디코더로 전달할 수 있다. 또한, 제2 DCN 모듈은 제2 인코더의 출력인 제2 특징맵에 대해 변형 가능한 컨볼루션 연산을 적용하여 제2 디코더로 전달할 수 있다. 또한, 제3 DCN 모듈은 제3 인코더의 출력인 제3 특징맵에 대해 변형 가능한 컨볼루션 연산을 적용하여 제3 디코더로 전달할 수 있다. The first DCN module may apply a deformable convolution operation to the first feature map that is the output of the first encoder and then transfer it to the first decoder. Additionally, the second DCN module may apply a deformable convolution operation to the second feature map that is the output of the second encoder and transmit it to the second decoder. Additionally, the third DCN module can apply a deformable convolution operation to the third feature map, which is the output of the third encoder, and transmit it to the third decoder.

각 인코더에서 생성된 특징맵에 변형 가능한 컨볼루션 연산을 적용하여 각 특징맵에 대해 눈에 띄는 컨텍스트 정보를 생성하여 디코더로 전달함으로써 상위 계층 특징맵에 대한 시맨틱 정보의 손실을 줄일 수 있다. By applying a deformable convolution operation to the feature maps generated by each encoder, salient context information is generated for each feature map and delivered to the decoder, thereby reducing the loss of semantic information for the upper layer feature maps.

본 발명의 일 실시예에서는 DCNv2를 적용하는 것을 가정하기로 하며, 이는 당업자에게는 자명한 사항이므로 DCN에 대한 상세한 설명은 생략하기로 한다. In one embodiment of the present invention, it is assumed that DCNv2 is applied, and since this is obvious to those skilled in the art, a detailed description of DCN will be omitted.

또한, 각 DCN 모듈(즉, 제1 DCN 모듈 내지 제3 DCN 모듈)은 인코더의 특징맵에 변형 가능한 컨볼루션 연산을 적용할 때, 채널 수를 256개로 확장 변경할 수 있다. Additionally, each DCN module (i.e., the first DCN module to the third DCN module) can expand and change the number of channels to 256 when applying a deformable convolution operation to the feature map of the encoder.

각 DCN 모듈의 결과는 수학식 1과 같이 나타낼 수 있다. The results of each DCN module can be expressed as Equation 1.

여기서, 와 는 k번째 위치에 대한 오프셋과 가중치를 각각 나타내고, 와 는 출력 특징맵과 입력 특징맵에서 위치 p의 특성(charatereristic)을 나타내고, 과 는 k번째 위치에 대한 학습 가능한 오프셋과 변조 스칼라를 각각 나타낸다. here, and represents the offset and weight for the kth position, respectively, and represents the characteristic of position p in the output feature map and the input feature map, class represents the learnable offset and modulation scalar for the kth position, respectively.

이러한 DCN 모듈은 무제한 범위의 실수를 갖는 오프셋에 대해 [0, 1] 범위의 변조 스칼라로 공간 지원 영역을 미세 조정하여 변형 가능한 컨볼루션을 개선할 수 있다. 이러한 DCN 모듈은 쉽게 변경할 수 있는 유연한 커널을 사용하며, 변조 스칼라를 통해 수신 필드(receptive field)를 미세하게 조정할 수 있다. 따라서, DCN 모듈은 고정 커널보다 희소 공간 위치에 더 강력하고 효율적으로 집중할 수 있다. These DCN modules can improve deformable convolutions by fine-tuning the spatial support region with a modulation scalar in the range [0, 1] for offsets with an unlimited range of real numbers. These DCN modules use a flexible kernel that can be easily changed, and a modulation scalar allows fine tuning of the receptive field. Therefore, the DCN module can focus on sparse spatial locations more powerfully and efficiently than the fixed kernel.

인코더의 출력인 특징맵을 디코더로 전달하기 위해 DCN 모듈을 이용함으로써 각 단계에서 생성된 특징맵의 표현 능력을 향상시킬 수 있으며, 기하학적 변형에 보다 강건한 특징맵으로 상위 수준 특징을 추출하여 객체 감지 성능을 향상시킬 수 있다. By using the DCN module to transfer the feature map, which is the output of the encoder, to the decoder, the expression ability of the feature map generated at each step can be improved, and object detection performance can be improved by extracting high-level features with a feature map that is more robust to geometric deformation. can be improved.

디코더 모듈(130)은 하향식 계층 구조를 가지는 복수의 디코더를 포함한다. 각 디코더는 상위 계층 특징맵과 하위 계층 특징맵을 융합한 결과를 입력받아 디컨볼루션 연산을 적용한다. The decoder module 130 includes a plurality of decoders with a top-down hierarchical structure. Each decoder receives the result of fusing the upper layer feature map and the lower layer feature map and applies a deconvolution operation.

이에 대해 각각 설명하기로 한다. Let us explain each of these.

도 2를 참조하여 보다 상세히 설명하기로 한다.This will be described in more detail with reference to FIG. 2 .

제1 인코더의 출력값은 제1 DCN 모듈이 적용된 후 제1 어텐션 모듈의 출력값(업 샘플링 결과)와 융합된 후 제1 디코더로 입력된다. 이에 따라, 제1 디코더는 입력된 융합된 결과에 대해 디컨볼루션 연산을 적용할 수 있다. The output value of the first encoder is fused with the output value (up-sampling result) of the first attention module after the first DCN module is applied and then input to the first decoder. Accordingly, the first decoder can apply a deconvolution operation to the input fused result.

즉, 제1 디코더는 제1 업 샘플링 모듈을 통해 상위 계층 디코더의 특징맵에 대한 업 샘플링한 결과와 제1 인코더의 특징맵에 대해 변형 가능한 컨볼루션 연산을 적용한 하위 계층 특징맵을 융합한 특징맵을 입력받아 디컨볼루션 연산을 적용할 수 있다. That is, the first decoder creates a feature map that fuses the result of up-sampling the feature map of the upper layer decoder through the first up-sampling module and the lower layer feature map by applying a deformable convolution operation to the feature map of the first encoder. You can take input and apply deconvolution operation.

각 디코더의 동작은 동일하게 적용될 수 있다. 다만, 최상위 계층의 디코더는 상위 계층 디코더가 존재하지 않으므로, 최상위 계층 인코더의 특징맵을 DCN 모듈에 적용한 결과를 입력받아 디컨볼루션 연산을 적용할 수 있다. The operation of each decoder can be applied identically. However, since there is no upper layer decoder, the deconvolution operation can be applied by receiving the result of applying the feature map of the highest layer encoder to the DCN module.

업 샘플링 모듈(140)은 디코더와 디코더 사이에 위치되며, 상위 계층 디코더의 출력인 특징맵에 글로벌 평균 풀링을 적용하여 업 샘플링 결과를 출력한다. The up-sampling module 140 is located between the decoders and outputs up-sampling results by applying global average pooling to the feature map that is the output of the upper layer decoder.

업 샘플링 모듈(140)의 상세 구조는 도 3에 도시된 바와 같다. The detailed structure of the up-sampling module 140 is as shown in FIG. 3.

업 샘플링 모듈(140)은 제1 분기 모듈(310), 제2 분기 모듈(320) 및 융합 모듈(330)을 포함하여 구성된다. The up-sampling module 140 includes a first branch module 310, a second branch module 320, and a fusion module 330.

제1 분기 모듈(310)은 상위 계층 디코더의 출력값(특징맵)에 전역 평균 풀링 연산과 컨볼루션 연산을 적용하여 어텐션 맵을 생성한다. The first branch module 310 generates an attention map by applying a global average pooling operation and a convolution operation to the output value (feature map) of the upper layer decoder.

이와 같이, 제1 분기 모듈(310)을 통해 상위 계층 디코더에서 출력된 특징맵에 전역 평균 풀링 연산을 적용함으로써 전체 이미지를 고려하여 공간 정보를 요약하고 반영하여 전역적으로 중요한 특징을 추출할 수 있다. 이와 같이, 전역 평균 풀링 연산을 특징맵에 적용함으로써 에지, 윤곽선을 강조할 수 있다. In this way, by applying a global average pooling operation to the feature map output from the upper layer decoder through the first branch module 310, globally important features can be extracted by considering the entire image and summarizing and reflecting spatial information. . In this way, edges and contours can be emphasized by applying the global average pooling operation to the feature map.

즉, 상위 계층 디코더에서 출력된 특징맵에 전역 평균 풀링 연산을 수행하고, 1 x 1 x C 컨볼루션 연산을 통해 전역 컨텍스트 정보를 추출한 후 배치 정규화 및 ReLU 연산을 통해 어텐션 맵을 생성할 수 있다. In other words, a global average pooling operation can be performed on the feature map output from the upper layer decoder, global context information can be extracted through a 1 x 1 x C convolution operation, and then an attention map can be generated through batch normalization and ReLU operation.

제2 분기 모듈(320)은 상위 계층 디코더에서 출력된 특징맵에 보간 연산을 적용하여 특징맵의 크기를 조정할 수 있다. The second branch module 320 may adjust the size of the feature map by applying an interpolation operation to the feature map output from the upper layer decoder.

예를 들어, 제2 분기 모듈(320)은 이웃 보간 연산을 적용하여 특징맵의 크기를 조정할 수 있다. For example, the second branching module 320 may adjust the size of the feature map by applying a neighbor interpolation operation.

융합 모듈(330)은 제1 분기 모듈(310)의 출력값(어텐션 맵)과 제2 분기 모듈(320)의 출력값(크기 조정된 특징맵)을 곱하여 상위 계층 특징맵의 의미 정보에 집중하여 업 샘플링 결과를 생성할 수 있다. The fusion module 330 multiplies the output value (attention map) of the first branch module 310 and the output value (scaled feature map) of the second branch module 320 and up-samples by focusing on the semantic information of the upper layer feature map. can produce results.

종래의 피라미드 네트워크 구조는 상위 계층의 특징맵을 업 샘플링하여 하위 계층의 특징맵과 융합함에 있어, 중첩으로 인한 바둑판 아티팩트를 피할 수 없어 성능이 저하되는 문제점이 있다. The conventional pyramid network structure has the problem of deteriorating performance because checkerboard artifacts due to overlap cannot be avoided when up-sampling the feature maps of the upper layer and fusing them with the feature maps of the lower layer.

이를 해결하기 위해, 본 발명의 일 실시예와 같이, 상위 계층 디코더의 특징맵에 전역 평균 풀링 연산을 수행하고 컨볼루션 연산을 수행하여 어텐션 맵을 생성한 후 업 샘플링된 맵(즉, 보간을 통해 크기 조정된 상위 계층 디코더의 특징맵)과 곱하여 업 샘플링 결과를 생성함으로써, 상위 계층 특징맵의 의미 정보에 주목한 업 샘플링 결과를 생성할 수 있는 이점이 있다. To solve this, as in an embodiment of the present invention, a global average pooling operation is performed on the feature map of the upper layer decoder, a convolution operation is performed to generate an attention map, and then the up-sampled map (i.e., through interpolation) By generating an up-sampling result by multiplying it with the resized feature map of the upper-layer decoder, there is an advantage of generating an up-sampling result that focuses on the semantic information of the upper-layer feature map.

각 디코더는 상위 계층 특징맵에 대한 업 샘플링 결과와 DCN 모듈을 통해 전달된 하위 계층 특징맵을 요소합 연산한 결과값을 입력받은 후 디 컨볼루션 연산한 결과를 출력할 수 있다. Each decoder can receive the upsampling result for the upper layer feature map and the result of element sum calculation of the lower layer feature map delivered through the DCN module and then output the deconvolution result.

객체 검출부(145)는 디코더 모듈(130)을 통해 출력되는 다중 스케일 특징맵을 이용하여 각각의 객체를 추출할 수 있다. The object detector 145 may extract each object using the multi-scale feature map output through the decoder module 130.

메모리(150)는 본 발명의 일 실시예에 따른 피라미드 네트워크 구조에서 다중 스케일 특징맵 융합 과정에서 의미 있는 정보 손실을 줄일 수 있는 객체 검출 방법을 제공하기 위한 프로그램 코드를 저장한다. The memory 150 stores program code for providing an object detection method that can reduce meaningful information loss in the multi-scale feature map fusion process in a pyramid network structure according to an embodiment of the present invention.

프로세서(160)는 본 발명의 일 실시예에 따른 객체 검출 장치(100)의 내부 구성 요소들(예를 들어, 백본 네트워크 모듈(110), 복수의 DCN 모듈(120), 디코더 모듈(130), 복수의 업 샘플링 모듈(140), 객체 검출부(145), 메모리(150) 등)을 제어하기 위한 수단이다. The processor 160 includes internal components (e.g., a backbone network module 110, a plurality of DCN modules 120, a decoder module 130, It is a means for controlling a plurality of up-sampling modules 140, object detection unit 145, memory 150, etc.).

도 4는 종래와 본 발명의 일 실시예에 따른 비교 결과를 도시한 도면이다. 도 4의 첫번째 행과 두번째 행은 다양한 크기의 객체가 포함된 영상에서 객체를 검출한 결과를 비교한 것으로, 본 발명에 따른 객체 검출 장치가 작은 물체도 성공적으로 감지 가능한 것을 알 수 있다. Figure 4 is a diagram showing comparison results according to an embodiment of the present invention and the prior art. The first and second rows of FIG. 4 compare the results of object detection in images containing objects of various sizes, and it can be seen that the object detection device according to the present invention can successfully detect even small objects.

또한, 도 4의 세번째와 네번째 행은 큰 객체가 포함된 영상에서의 객체 감지 결과를 비교한 것으로, 종래 방법은 여러 객체를 잘못 감지하는 반면, 본 발명의 일 실시예에 따른 객체 감지 장치는 전역 컨텍스트 정보에 주목하여 특징 융합을 수행하기 때문에 큰 객체를 성공적으로 감지하는 것을 알 수 있다. In addition, the third and fourth rows of FIG. 4 compare object detection results in images containing large objects. While the conventional method incorrectly detects several objects, the object detection device according to an embodiment of the present invention detects global objects. It can be seen that large objects are successfully detected because feature fusion is performed by paying attention to context information.

도 5는 본 발명의 일 실시예에 따른 다중 스케일 객체 감지 방법을 나타낸 순서도이다. 이하에서 설명되는 다중 스케일 객체 검출 장치(100)는 피라미드 네트워크 구조 기반으로 동작되는 것으로 확장 해석되어야 할 것이다. Figure 5 is a flowchart showing a multi-scale object detection method according to an embodiment of the present invention. The multi-scale object detection apparatus 100 described below should be interpreted in an extended manner as operating based on a pyramid network structure.

단계 510에서 다중 스케일 객체 검출 장치(100)는 입력 영상을 상향식 계층 구조를 가지는 복수의 인코더로 구성된 백본 네트워크 모듈에 적용하여 다중 스케일을 가지는 복수의 특징맵을 추출한다. In step 510, the multi-scale object detection apparatus 100 applies the input image to a backbone network module composed of a plurality of encoders with a bottom-up hierarchical structure to extract a plurality of feature maps with multi-scale.

단계 515에서 다중 스케일 객체 검출 장치(100)는 백본 네트워크 모듈의 각 인코더 특징맵에 변형 가능한 컨볼루션 네트워크 알고리즘을 적용하여 대응하는 디코더로 출력한다.In step 515, the multi-scale object detection device 100 applies a deformable convolutional network algorithm to each encoder feature map of the backbone network module and outputs it to the corresponding decoder.

즉, 제1 해상도를 가지는 제1 인코더의 특징맵에 DCN 모듈을 적용한 결과는 제1 해상도를 가지는 제1 디코더 입력으로 이용되도록 전달할 수 있다. 이러한, DCN 모듈을 적용한 결과는 제1 디코더로 직접 입력되지 않으며, 도 2에서 보여지는 바와 같이, 상위 계층 디코더의 출력을 업 샘플링한 결과와 융합되어 제1 디코더로 입력될 수 있다. That is, the result of applying the DCN module to the feature map of the first encoder with the first resolution can be transmitted to be used as an input to the first decoder with the first resolution. The result of applying the DCN module is not directly input to the first decoder, but as shown in FIG. 2, it can be input to the first decoder by being fused with the result of up-sampling the output of the upper layer decoder.

다만, 도 2에서 보여지는 바와 같이, 최상위 계층 인코더의 특징맵을 DCN 모듈을 적용한 결과는 최상위 디코더는 상단에 위치되는 디코더가 존재하지 않으므로, 업 샘플링 과정을 거치지 않는다. 따라서, 해당 최상위 계층 인코더의 특징맵을 DCN 모듈을 적용한 결과는 최상위 디코더의 입력으로 직접 이용될 수 있다. However, as shown in Figure 2, the result of applying the DCN module to the feature map of the top layer encoder is that the top decoder does not go through an up-sampling process because there is no decoder located at the top. Therefore, the result of applying the DCN module to the feature map of the corresponding top layer encoder can be directly used as the input of the top layer decoder.

단계 520에서 다중 스케일 객체 검출 장치(100)는 하향식 계층 구조를 가지는 복수의 디코더 사이에 위치되는 업 샘플링 모듈에 상위 계층 디코더의 출력(특징맵)에 글로벌 평균 풀링을 적용하여 업 샘플링 결과를 출력한다.In step 520, the multi-scale object detection device 100 applies global average pooling to the output (feature map) of the upper layer decoder to the up-sampling module located between a plurality of decoders with a top-down hierarchical structure and outputs an up-sampling result. .

전술한 바와 같이, 다중 스케일 객체 검출 장치(100)는 상위 계층 디코더의 출력(특징맵)에 대해 제1 분기 모듈을 통해 글로벌 평균 풀링을 적용하고 컨볼루션 연산을 적용하여 어텐션 맵을 생성하고, 제2 분기 모듈을 통해 상위 계층 디코더의 출력(특징맵)을 보간 연산하여 크기를 조정할 수 있다. As described above, the multi-scale object detection device 100 applies global average pooling through the first branch module to the output (feature map) of the upper layer decoder and applies a convolution operation to generate an attention map, and The size can be adjusted by interpolating the output (feature map) of the upper layer decoder through the second branch module.

이어, 다중 스케일 객체 검출 장치(100)는 어텐션 맵과 크기 조정된 특징맵을 곱하여 업 샘플링 결과를 하위 계층 디코더로 출력할 수 있다. Next, the multi-scale object detection apparatus 100 may multiply the attention map and the resized feature map and output the up-sampling result to the lower layer decoder.

단계 525에서 다중 스케일 객체 검출 장치(100)는 상위 계층 디코더의 특징맵을 업 샘플링한 결과와 동일 계층(해상도)의 인코더의 특징맵을 DCN 모듈을 적용한 결과를 요소합하여 디코더를 통해 디컨볼루션 연산한 결과를 출력한다. In step 525, the multi-scale object detection device 100 performs a deconvolution operation through the decoder by adding the result of up-sampling the feature map of the upper layer decoder and the result of applying the DCN module to the feature map of the encoder of the same layer (resolution). Prints a result.

다중 스케일 객체 검출 장치(100)는 이러한 하향식 계층 구조를 가지는 복수의 디코더에 의해 출력된 결과(다중 해상도 특징맵)을 이용하여 다중 스케일 객체를 감지한다. The multi-scale object detection apparatus 100 detects multi-scale objects using the results (multi-resolution feature maps) output by a plurality of decoders having this top-down hierarchical structure.

본 발명의 실시 예에 따른 장치 및 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 분야 통상의 기술자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Devices and methods according to embodiments of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. Computer-readable media may include program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on a computer-readable medium may be specially designed and constructed for the present invention or may be known and usable by those skilled in the computer software field. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes magneto-optical media and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc.

상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이제까지 본 발명에 대하여 그 실시 예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been examined focusing on its embodiments. A person skilled in the art to which the present invention pertains will understand that the present invention may be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments should be considered from an illustrative rather than a restrictive perspective. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the equivalent scope should be construed as being included in the present invention.

100: 객체 감지 장치
110: 백본 네트워크 모듈
120: DCN 모듈
130: 디코더 모듈
140: 업 샘플링 모듈
145: 객체 검출부
150: 메모리
160: 프로세서100: object detection device
110: backbone network module
120: DCN module
130: decoder module
140: Upsampling module
145: object detection unit
150: memory
160: processor

Claims

A backbone network module having a plurality of encoders with a bottom-up hierarchical structure;
A plurality of decoders with a top-down hierarchical structure;
A plurality of DCN (deformable convolution network) modules that apply a deformable convolutional network algorithm to the feature maps output from the plurality of encoders and transmit them to a decoder; and
Includes a plurality of up-sampling modules, each located between the plurality of decoders, which outputs an up-sampling result by applying global average pooling to the feature map that is the output of the upper layer decoder,
The plurality of decoders each receive a result of element summing the up-sampling result and the lower layer feature map transmitted through the plurality of DCN modules, then perform a deconvolution operation and output a multi-scale feature map,
Each of the plurality of upsampling modules:
a first branch module that generates an attention map by applying global average pooling and convolution operations to the feature map output from the upper layer decoder;
a second branching module that adjusts the size of the feature map output from the higher layer decoder by applying neighborhood interpolation;
It includes a fusion module that outputs an up-sampling result by fusing the attention map, which is the output of the first branch module, and the resized feature map, which is the output of the second branch module,
The nth (n is a natural number) DCN module applies a deformable convolution operation to the nth feature map, which is the output of the nth encoder, and transmits it to the nth decoder.
The n-th decoder is an object detection device characterized in that it fuses the up-sampling result, which is the output of the n-th up-sampling module, and the feature map transmitted through the n-th DCN module.

delete

According to claim 1,
The n-th DCN module fixes the number of channels to 256 through point-wise conversion and outputs the result of the deformable convolution operation to the n-th decoder.

(a) an encoding step of generating a plurality of multi-scale feature maps by applying a convolution operation through a plurality of encoders with a bottom-up hierarchical structure;
(b) applying a deformable convolutional network algorithm to each feature map that is the output of each encoder;
(c) an up-sampling step that is located between a plurality of decoders having a top-down hierarchical structure and outputs an up-sampling result by applying global average pooling to the feature map that is the output of the upper-layer decoder; and
(d) a decoding step of applying the result of element sum operation of the up-sampling result and the result of applying the deformable convolution operation to a decoder to output the result of the deconvolution operation,
In step (c),
generating an attention map by applying global average pooling and convolution operations to the feature map output from the upper layer decoder;
adjusting the size by applying a neighborhood interpolation algorithm to the feature map output from the upper layer decoder; and
A step of fusing the attention map and the resized feature map to output the up-sampling result,
In step (d),
An object detection method characterized by applying a deconvolution operation by receiving the result of fusing the result of applying a deformable convolutional network algorithm to the nth feature map, which is the output of the nth encoder, and the nth up-sampling result.

According to clause 5,
In step (b),
An object detection method characterized by outputting the result of the deformable convolution operation by fixing the number of channels to 256 through point-wise conversion.

delete

A computer-readable recording medium recording program code for performing the method according to claim 5 or 6.