KR102612936B1

KR102612936B1 - System and method for detecting multi-size target

Info

Publication number: KR102612936B1
Application number: KR1020230053198A
Authority: KR
Inventors: 김해문; 안종식; 이태영; 최병인
Original assignee: 한화시스템(주)
Priority date: 2023-04-24
Filing date: 2023-04-24
Publication date: 2023-12-12

Abstract

본 발명에 따른 다중 크기 표적 탐지 시스템은 소정의 소형 표적, 중형 표적 및 대형 표적 중 적어도 하나가 포함된 이미지를 입력받아 제1 특징맵을 출력하는 복수의 합성곱 레이어로 구성된 인코더부, 상기 인코더부와 잔차연결되어 상기 인코더부의 제1 특징맵을 복원한 제2 특징맵을 출력하는 복수의 합성곱 레이어로 구성된 디코더부 및 상기 제1 및 제2 특징맵에 기초하여 상기 이미지에 포함된 객체의 탐지 및 분류 결과를 출력하는 결과 예측 신경망 영역부를 포함한다.The multi-size target detection system according to the present invention includes an encoder unit consisting of a plurality of convolutional layers that receives an image containing at least one of a predetermined small target, a medium target, and a large target and outputs a first feature map, the encoder unit A decoder unit consisting of a plurality of convolutional layers that are residually connected and output a second feature map that restores the first feature map of the encoder unit, and detection of objects included in the image based on the first and second feature maps and a result prediction neural network area unit that outputs a classification result.

Description

SYSTEM AND METHOD FOR DETECTING MULTI-SIZE TARGET

본 발명은 다중 크기 표적 탐지 시스템 및 방법에 관한 것으로, 특히 중형, 대형 객체가 포함된 영상뿐만 아니라 작고 밀집된 소형 객체들이 포함된 항공 영상과 같이 다양한 크기 및 형상의 객체가 포함되어 있는 영상을 분석하여 용이하게 객체를 검출할 수 있는, 초해상도 모델 구조 신경망 기반의 다중 크기 표적 탐지 시스템 및 방법에 관한 것이다.The present invention relates to a multi-size target detection system and method, and in particular, analyzes images containing objects of various sizes and shapes, such as images containing medium and large objects, as well as aerial images containing small and dense objects. The present invention relates to a multi-scale target detection system and method based on a super-resolution model structure neural network that can easily detect objects.

감시 시스템에서 객체 검출(Object Detection)은 영상 내 다수의 객체 위치를 바운딩 박스(Bounding Box)로 표현하고, 해당 바운딩 박스의 객체의 종류(Class)를 구별하는 컴퓨터 영상처리의 연구 분야이다. Object detection in surveillance systems is a research field in computer image processing that expresses the locations of multiple objects in an image as a bounding box and distinguishes the class of the object in the bounding box.

감시 시스템은 객체 크기의 다양성, 객체 간 중첩성, 객체와 배경 간 유사성, 배경 복잡성이 존재하는 영상에서의 객체 검출 알고리즘이 요구된다. 일 예로, 무인항공기의 고고도, 저고도 비행에서 촬영된 영상은 앞서 언급한 특성들이 고르게 나타난다. 따라서, 본 발명에서는 항공에서 촬영한 영상을 예시로 객체 검출 알고리즘을 설명하도록 한다.Surveillance systems require object detection algorithms in images with diversity of object sizes, overlap between objects, similarity between objects and background, and background complexity. For example, videos taken during high-altitude and low-altitude flights of an unmanned aerial vehicle show the above-mentioned characteristics evenly. Therefore, in the present invention, the object detection algorithm will be explained using images taken from an airplane as an example.

도 1a 내지 도 1c는 일상생활 및 항공에서 촬영된 영상의 일 예시를 도시한 도면이다.1A to 1C are diagrams showing examples of images captured in daily life and aviation.

한편, 최근 객체 검출 알고리즘은 인공 신경망 기술의 발전과 함께 CNN(Convolutional Neural Networks)와 Transformer 기반 특징 추출 신경망을 사용하여 COCO(Microsoft Common Objects in COntext) 데이터 세트와 같은 일상생활을 대상으로 촬영된 영상에서 높은 정확도로 객체 검출이 가능하다(도 1a). 하지만 도 1b와 같이 중첩도 및 밀집도가 높은 소형 객체(표 1의 Very Tiny, Tiny, Small)와 도 1c의 다양한 크기의 객체가 포함된 영상에서는 낮은 객체 검출 정확도를 나타내는 문제가 있다.Meanwhile, recent object detection algorithms, along with the development of artificial neural network technology, use CNN (Convolutional Neural Networks) and Transformer-based feature extraction neural networks to detect images captured in everyday life, such as the COCO (Microsoft Common Objects in COntext) data set. Object detection is possible with high accuracy (Figure 1a). However, there is a problem of low object detection accuracy in images containing small objects with high overlap and density (Very Tiny, Tiny, Small in Table 1) as shown in Figure 1b and objects of various sizes in Figure 1c.

그 밖에, 소형 객체의 검출을 주목적으로 하는 소형 객체 검출기의 경우 소형 객체의 원활한 검출이 가능하지만, 촬영 각도에 따라 객체의 중형, 대형 객체와 형상을 가진 영상에서는 상대적으로 낮은 검출정확도 성능을 보인다.In addition, in the case of a small object detector whose main purpose is to detect small objects, smooth detection of small objects is possible, but it shows relatively low detection accuracy performance in images with medium or large objects and shapes depending on the shooting angle.

따라서, 표 1과 같은 객체 크기의 다양성, 객체 간 중첩성, 객체와 배경 간 유사성, 배경 복성성이 존재하는 영상에서 효과적으로 객체를 검출할 수 있는 딥러닝 기반의 객체 검출 알고리즘이 필요한 실정이다.Therefore, there is a need for a deep learning-based object detection algorithm that can effectively detect objects in images with diversity of object sizes, overlap between objects, similarity between objects and background, and background multiplicity as shown in Table 1.

Categorization by Object sizeCategorization by Object size Object SizeObject Size MetricMetric Very TinyVery Tiny TinyTiny SmallSmall MediumMedium LargeLarge

본 발명이 해결하고자 하는 과제는 중형, 대형 객체가 포함된 영상뿐만 아니라 작고 밀집된 소형 객체들이 포함된 항공 영상과 같이 다양한 크기 및 형상의 객체가 포함되어 있는 영상을 분석하여 용이하게 객체를 검출할 수 있는, 초해상도 모델 구조 신경망 기반의 다중 크기 표적 탐지 시스템 및 방법에 관한 것이다.The problem that the present invention aims to solve is to easily detect objects by analyzing images containing objects of various sizes and shapes, such as images containing medium-sized and large objects, as well as aerial images containing small and dense objects. It relates to a multi-scale target detection system and method based on a super-resolution model structure neural network.

다만, 본 발명이 해결하고자 하는 과제는 상기된 바와 같은 과제로 한정되지 않으며, 또다른 과제들이 존재할 수 있다.However, the problem to be solved by the present invention is not limited to the problems described above, and other problems may exist.

상술한 과제를 해결하기 위한 본 발명의 제1 측면에 따른 다중 크기 표적 탐지 시스템은 다수의 소형 표적, 중형 표적 및 대형 표적을 포함하는 이미지를 입력받아 제1 특징맵을 출력하는 복수의 합성곱 레이어로 구성된 인코더부, 상기 인코더부와 잔차연결되어 상기 인코더부의 제1 특징맵을 복원한 제2 특징맵을 출력하는 복수의 합성곱 레이어로 구성된 디코더부 및 상기 제1 및 제2 특징맵에 기초하여 상기 이미지에 포함된 객체의 탐지 및 분류 결과를 출력하는 결과 예측 신경망 영역부를 포함한다.The multi-size target detection system according to the first aspect of the present invention to solve the above-described problem includes a plurality of convolution layers that receive images including a plurality of small targets, medium-sized targets, and large targets and output a first feature map. An encoder unit consisting of an encoder unit, a decoder unit consisting of a plurality of convolution layers that are residually connected to the encoder unit and output a second feature map that restores the first feature map of the encoder unit, and based on the first and second feature maps. and a result prediction neural network area unit that outputs detection and classification results of objects included in the image.

본 발명의 일부 실시예에 있어서, 상기 인코더부 및 디코더부의 합성곱 레이어는 CSPDarkNet 특징 추출 신경망 기반으로 구성될 수 있다.In some embodiments of the present invention, the convolution layer of the encoder unit and the decoder unit may be configured based on a CSPDarkNet feature extraction neural network.

본 발명의 일부 실시예에 있어서, 상기 디코더부의 각 합성곱 레이어는 특징맵 분석 결과에 기초하여 상위 및 하위의 소정 개수의 인코더부 및 디코더부의 제1 및 제2 특징맵이 각각 연결될 수 있다.In some embodiments of the present invention, each convolution layer of the decoder unit may be connected to a predetermined number of upper and lower encoder units and first and second feature maps of the decoder unit, respectively, based on a feature map analysis result.

본 발명의 일부 실시예에 있어서, 상기 디코더부의 각 합성곱 레이어의 입력 특징맵(은 상기 인코더부에서 출력되는 제1 특징맵() 및 상기 디코더부에서 출력되는 제2 특징맵(을 기반으로 하기 식에 의해 결정될 수 있다.In some embodiments of the present invention, the input feature map ( is the first feature map output from the encoder unit ( ) and a second feature map output from the decoder unit ( It can be determined by the following equation based on:

[식][ceremony]

본 발명의 일부 실시예는, 상기 디코더부의 제2 특징맵을 입력받아 객체 검출의 정확도를 향상시키는 병렬 세그먼트 헤드부를 더 포함할 수 있다.Some embodiments of the present invention may further include a parallel segment head unit that receives the second feature map of the decoder unit and improves the accuracy of object detection.

본 발명의 일부 실시예에 있어서, 상기 병렬 세그먼트 헤드부는 학습 과정에서는 적용되되, 추론 과정에서는 적용되지 않을 수 있다.In some embodiments of the present invention, the parallel segment head unit may be applied in the learning process, but not in the inference process.

본 발명의 일부 실시예에 있어서, 상기 병렬 세그먼트 헤드부는 원-핫 인코딩(one-hot encoding) 기법에 기반하여 생성된 정답 영역 분할 마스크에 기초하여 학습될 수 있다.In some embodiments of the present invention, the parallel segment head unit may be learned based on a correct answer region segmentation mask generated based on a one-hot encoding technique.

본 발명의 일부 실시예에 있어서, 상기 정답 영역 분할 마스크는 영상 내 모든 크기의 객체의 바운딩 박스 영역을 대상으로 상기 원-핫 인코딩 기법이 수행됨에 따라 생성될 수 있다.In some embodiments of the present invention, the correct region segmentation mask may be generated by performing the one-hot encoding technique on the bounding box region of objects of all sizes in the image.

본 발명의 일부 실시예에 있어서, 상기 병렬 세그먼트 헤드부는 상기 결과 예측 신경망 영역부와 함께 하기 식의 손실함수( )에 기초하여 학습이 수행되되, 상기 병렬 세그먼트 헤드부의 손실함수(는 MS-SSIM (Multi Scale-Structural SIMilarity) 기반의 손실함수를 적용할 수 있다.In some embodiments of the present invention, the parallel segment head unit and the result prediction neural network area unit use a loss function of the following equation ( ), learning is performed based on the loss function of the parallel segment head part ( A loss function based on MS-SSIM (Multi Scale-Structural SIMilarity) can be applied.

[식][ceremony]

상기 [식]에서 는 결과 예측 신경망 영역부의 손실함수, 는 가중치, 와 는 신경망 생성 마스크와 정답 마스크의 평균, 표준편차, 는 손실함수가 0으로 나뉘는 것을 방지하는 상수값,. 은 다중 해상도에서의 상대적인 중요도를 의미한다.In the above [formula] is the loss function of the resulting prediction neural network domain, is the weight, and is the mean, standard deviation, and is a constant value that prevents the loss function from being divided by 0. refers to the relative importance in multiple resolutions.

또한, 본 발명의 제2 측면에 따른 다중 크기 표적 탐지 시스템은 다수의 소형 표적, 중형 표적 및 대형 표적을 포함하는 이미지를 복수의 합성곱 레이어로 구성된 인코더부로 입력하는 단계; 상기 인코더부의 각 합성곱 레이어에서 제1 특징맵을 출력하는 단계; 상기 인코더부와 잔차연결되며 복수의 합성곱 레이어로 구성된 디코더부에서 상기 제1 특징맵을 입력받는 단계; 상기 디코더부의 각 합성곱 레이어에서 상기 제1 특징맵을 복원한 제2 특징맵을 출력하는 단계; 및 상기 제1 및 제2 특징맵에 기초하여 상기 이미지에 포함된 객체의 탐지 및 분류 결과를 출력하는 단계를 포함한다.In addition, the multi-size target detection system according to the second aspect of the present invention includes the steps of inputting an image including a plurality of small targets, medium-sized targets, and large targets into an encoder unit composed of a plurality of convolutional layers; Outputting a first feature map from each convolution layer of the encoder unit; Receiving the first feature map from a decoder unit residually connected to the encoder unit and composed of a plurality of convolution layers; outputting a second feature map obtained by reconstructing the first feature map from each convolution layer of the decoder unit; and outputting detection and classification results of objects included in the image based on the first and second feature maps.

본 발명의 일부 실시예에 있어서, 상기 인코더부와 잔차연결되며 복수의 합성곱 레이어로 구성된 디코더부에서 상기 제1 특징맵을 입력받는 단계는, 상기 인코더부에서 출력되는 제1 특징맵() 및 상기 디코더부에서 출력되는 제2 특징맵(을 기반으로 하기 식에 의해 결정되는 상기 디코더부의 각 합성곱 레이어에서의 입력 특징맵(을 입력받을 수 있다.In some embodiments of the present invention, the step of receiving the first feature map from a decoder unit residually connected to the encoder unit and composed of a plurality of convolution layers includes the first feature map output from the encoder unit ( ) and a second feature map output from the decoder unit ( The input feature map in each convolution layer of the decoder unit is determined by the following equation based on ( can be input.

[식][ceremony]

본 발명의 일부 실시예는, 상기 객체의 검출 정확도를 향상시키기 위한 병렬 세그먼트 헤드부에서 상기 디코더부의 제2 특징맵을 입력받아 다중 크기의 객체의 위치 추정, 탐지 및 분류를 수행하는 단계를 더 포함할 수 있다.Some embodiments of the present invention further include the step of receiving the second feature map of the decoder unit from the parallel segment head unit to improve the detection accuracy of the object and performing location estimation, detection, and classification of the multi-sized object. can do.

본 발명의 일부 실시예는, 원-핫 인코딩(one-hot encoding) 기법에 기반하여 정답 영역 분할 마스크를 생성하는 단계; 및 상기 정답 영역 분할 마스크에 기초하여 상기 병렬 세그먼트 헤드부를 학습하는 단계를 더 포함할 수 있다.Some embodiments of the present invention include generating a correct answer region segmentation mask based on a one-hot encoding technique; And it may further include learning the parallel segment head unit based on the correct answer region division mask.

본 발명의 일부 실시예에 있어서, 상기 원-핫 인코딩(one-hot encoding) 기법에 기반하여 정답 영역 분할 마스크를 생성하는 단계는, 상기 소정의 소형 객체의 바운딩 박스 영역을 대상으로 상기 원-핫 인코딩 기법을 수행하여 상기 정답 영역 분할 마스크를 생성할 수 있다.In some embodiments of the present invention, the step of generating a correct answer region segmentation mask based on the one-hot encoding technique involves targeting the bounding box region of the predetermined small object. The correct answer region segmentation mask can be generated by performing an encoding technique.

본 발명의 일부 실시예에 있어서, 상기 정답 영역 분할 마스크에 기초하여 상기 병렬 세그먼트 헤드부를 학습하는 단계는, 상기 결과 예측 신경망 영역부와 함께 하기 식의 손실함수( )에 기초하여 학습이 수행되되, 상기 병렬 세그먼트 헤드부의 손실함수(는 MS-SSIM (Multi Scale-Structural SIMilarity) 기반의 손실함수를 적용할 수 있다.In some embodiments of the present invention, the step of learning the parallel segment head portion based on the correct answer region segmentation mask includes using a loss function of the following equation together with the result prediction neural network region portion ( ), learning is performed based on the loss function of the parallel segment head part ( A loss function based on MS-SSIM (Multi Scale-Structural SIMilarity) can be applied.

[식][ceremony]

상기 [식]에서 는 결과 예측 신경망 영역부의 손실함수, 는 가중치, 와 는 신경망 생성 마스크와 정답 마스크의 평균, 표준편차, 는 손실함수가 0으로 나뉘는 것을 방지하는 상수값, 은 다중 해상도에서의 상대적인 중요도를 의미한다.In the above [formula] is the loss function of the resulting prediction neural network domain, is the weight, and is the mean, standard deviation, and is a constant value that prevents the loss function from being divided by 0, refers to the relative importance in multiple resolutions.

상술한 과제를 해결하기 위한 본 발명의 다른 면에 따른 컴퓨터 프로그램은, 하드웨어인 컴퓨터와 결합되어 상기 다중 크기 표적 탐지 방법을 실행하며, 컴퓨터 판독가능 기록매체에 저장된다.A computer program according to another aspect of the present invention for solving the above-described problem is combined with a computer as hardware to execute the multi-size target detection method, and is stored in a computer-readable recording medium.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

본 발명의 일 실시예에 의하면, 작고 밀집된 소형 객체의 탐지 능력 및 복잡한 배경에서의 높은 객체 탐지 능력을 기대함과 동시에 중형 및 대형 객체에 대해서도 원활한 검출이 가능하다. 즉, 다양한 크기의 객체 검출이 가능한바, 무인항공기에 특화된 객체 검출 신경망을 제공할 수 있다.According to one embodiment of the present invention, detection ability of small and dense objects and high object detection ability in complex background are expected, while smooth detection of medium and large objects is also possible. In other words, since objects of various sizes can be detected, an object detection neural network specialized for unmanned aerial vehicles can be provided.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

이하에 첨부되는 도면들은 본 실시 예에 관한 이해를 돕기 위한 것으로, 상세한 설명과 함께 실시 예들을 제공한다. 다만, 본 실시 예의 기술적 특징이 특정 도면에 한정되는 것은 아니며, 각 도면에서 개시하는 특징들은 서로 조합되어 새로운 실시 예로 구성될 수 있다.
도 1a 내지 도 1c는 일상생활 및 항공에서 촬영된 영상의 일 예시를 도시한 도면이다.
도 2는 CSPDarkNet에서의 각 단계의 특징맵을 시각화한 일 예시를 도시한 도면이다.
도 3은 본 발명의 일 실시예에 따른 다중 크기 표적 탐지 시스템의 블록도이다.
도 4는 본 발명의 일 실시예에 따른 다중 크기 표적 탐지 시스템에서 적용된 YOLOv5U 모델을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에서의 병렬 세그먼트 헤드부를 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 다중 크기 표적 탐지 방법의 순서도이다.
도 7 및 도 8은 종래기술과 본 발명의 표적 탐지 결과를 테스트한 결과를 도시한 도면이다.
도 9는 본 발명의 일 실시예에서의 다중 크기 객체에 대한 특징맵 시각화 결과를 나타낸 도면이다.The drawings attached below are intended to aid understanding of the present embodiment and provide examples along with a detailed description. However, the technical features of this embodiment are not limited to specific drawings, and the features disclosed in each drawing may be combined to form a new embodiment.
1A to 1C are diagrams showing examples of images captured in daily life and aviation.
Figure 2 is a diagram showing an example of visualizing the feature map of each stage in CSPDarkNet.
Figure 3 is a block diagram of a multi-size target detection system according to an embodiment of the present invention.
Figure 4 is a diagram for explaining the YOLOv5U model applied in a multi-size target detection system according to an embodiment of the present invention.
Figure 5 is a diagram for explaining a parallel segment head unit in one embodiment of the present invention.
Figure 6 is a flowchart of a multi-size target detection method according to an embodiment of the present invention.
Figures 7 and 8 are diagrams showing the results of testing the target detection results of the prior art and the present invention.
Figure 9 is a diagram showing the result of feature map visualization for a multi-sized object in one embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. The advantages and features of the present invention and methods for achieving them will become clear by referring to the embodiments described in detail below along with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms. The present embodiments are merely provided to ensure that the disclosure of the present invention is complete and to provide a general understanding of the technical field to which the present invention pertains. It is provided to fully inform the skilled person of the scope of the present invention, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for describing embodiments and is not intended to limit the invention. As used herein, singular forms also include plural forms, unless specifically stated otherwise in the context. As used in the specification, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other elements in addition to the mentioned elements. Like reference numerals refer to like elements throughout the specification, and “and/or” includes each and every combination of one or more of the referenced elements. Although “first”, “second”, etc. are used to describe various components, these components are of course not limited by these terms. These terms are merely used to distinguish one component from another. Therefore, it goes without saying that the first component mentioned below may also be a second component within the technical spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with meanings commonly understood by those skilled in the art to which the present invention pertains. Additionally, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless clearly specifically defined.

이하에서는 통상의 기술자의 이해를 돕기 위하여 본 발명이 제안된 배경을 설명한 후, 본 발명에 대해 상세히 설명하도록 한다.Hereinafter, to aid the understanding of those skilled in the art, the background on which the present invention was proposed will be explained, and then the present invention will be described in detail.

YOLO5l 모델에서의 특징 추출 네트워크인 CSPDarkNet은 객체 탐지를 위한 딥러닝 신경망 구조 중 하나로, 각 단계(Stage)의 다운 샘플링(Down-sampling) 계층을 지나며 객체의 의미론적 특징을 추출한다.CSPDarkNet, a feature extraction network in the YOLO5l model, is one of the deep learning neural network structures for object detection, and extracts semantic features of objects by passing through the down-sampling layer of each stage.

도 2는 CSPDarkNet에서의 각 단계의 특징맵을 시각화한 일 예시를 도시한 도면이다. 도 2를 참조하면, CSPDarkNet의 각 단계마다 특징맵의 시각화를 진행하였으며, 다음의 단계별 특성을 확인하였다. Figure 2 is a diagram showing an example of visualizing the feature map of each stage in CSPDarkNet. Referring to Figure 2, the feature map was visualized at each stage of CSPDarkNet, and the characteristics of the following stages were confirmed.

단계 1, 2에 해당하는 특징 추출 신경망의 얇은 계층에서는 개별 소형 객체의 위치 추정에 적합한 공간 정보가 풍부하다. 반면, 단계 4, 5와 같이 깊은 계층의 특징 추출 신경망에서는 개별 객체의 식별은 어렵지만 객체의 탐지 및 분류에 적합한 의미론적 정보가 풍부하다. 그리고 단계 3에서는 얇고 깊은 계층의 특징이 고르게 나타난다.The thin layer of the feature extraction neural network corresponding to steps 1 and 2 has abundant spatial information suitable for estimating the location of individual small objects. On the other hand, in deep layer feature extraction neural networks such as steps 4 and 5, identification of individual objects is difficult, but semantic information suitable for object detection and classification is abundant. And in stage 3, the features of thin and deep layers appear evenly.

즉, CSPDarkNet 특징 추출 신경망에서 소형 객체의 공간 정보는 얕은 특징맵에서 가장 풍부하고, 깊은 특징맵으로 진행할수록 의미론적 정보가 증가하는 것에 반해 공간 정보는 감소한다. In other words, in the CSPDarkNet feature extraction neural network, spatial information of small objects is richest in shallow feature maps, and as you progress to deep feature maps, spatial information decreases while semantic information increases.

이러한 CSPDarkNet 특징 추출 신경망에서 종래 YOLOv5l 모델은 일상생활의 대부분을 차지하는 중형(표 1, Medium), 대형(표 1, Large) 객체의 경우, 단계 3, 4, 5의 깊은 계층의 특징맵에서 충분한 공간 정보 유지가 가능하여 원활한 객체 검출이 가능하다. In this CSPDarkNet feature extraction neural network, the conventional YOLOv5l model has sufficient space in the feature maps of the deep layers of steps 3, 4, and 5 for medium-sized (Table 1, Medium) and large (Table 1, Large) objects that occupy most of our daily lives. Information can be maintained, enabling smooth object detection.

반면, 소형 객체 검출기는 영상의 많은 부분을 차지하는 Small 이하의 소형 객체의 공간 정보가 유지된 단계 1, 2의 얇은 계층의 특징맵을 중심으로 객체 검출 연산을 수행하여 소형 객체 검출 정확도를 향상하였다. 이러한 소형 객체 검출기는 대형 객체의 충분한 특징 추출이 수행되지 못해 대형 객체 검출 정확도가 저하되는 문제가 있다.On the other hand, the small object detector improved the accuracy of small object detection by performing object detection operations centered on the thin layer feature maps of stages 1 and 2, which maintain the spatial information of small objects below Small, which occupy a large portion of the image. These small object detectors have a problem in that the accuracy of detecting large objects deteriorates because they cannot extract sufficient features of large objects.

이와 같은 종래기술과 달리 본 발명의 일 실시예는 영상에 존재하는 모든 크기의 객체 검출 정확도의 향상이 가능하다. Unlike the prior art, an embodiment of the present invention can improve detection accuracy of objects of all sizes present in an image.

본 발명의 일 실시예에 따른 YOLOv5U 모델은 종래 YOLOv5l 모델을 중심으로 U자형 인코더-디코더(Encoder-Decoder) 구조를 적용하였으며, 잔차연결(Skip Connection)을 사용하여 모든 단계의 특징 정보가 연결되도록 하였다. 또한, 다양한 크기의 객체 위치 추정, 탐지 및 분류를 수행하기 위해 Head 신경망(병렬 세그먼트 헤드부)을 추가하였다. 추가적으로, 소형 객체의 검출 정확도 향상을 목표로 본 발명의 일 실시예는 객체 검출기 신경망인 결과 예측 신경망 영역부(230)에서 영역 분할(Segmentation) 손실 함수를 함께 적용함으로써 화소(Pixel) 단위에서 객체 검출 및 분류가 가능하도록 하였다.The YOLOv5U model according to an embodiment of the present invention applies a U-shaped encoder-decoder structure centered on the conventional YOLOv5l model, and uses residual connection (Skip Connection) to connect feature information at all stages. . In addition, a head neural network (parallel segment head part) was added to perform location estimation, detection, and classification of objects of various sizes. Additionally, with the goal of improving the detection accuracy of small objects, an embodiment of the present invention detects objects in pixel units by applying a segmentation loss function in the result prediction neural network region 230, which is an object detector neural network. and classification were possible.

이하 첨부된 도면을 참조하여 본 발명의 일 실시예에 따른 다중 크기 표적 탐지 시스템(100) 및 방법에 대해 보다 상세히 설명하도록 한다.Hereinafter, the multi-size target detection system 100 and method according to an embodiment of the present invention will be described in more detail with reference to the attached drawings.

도 3은 본 발명의 일 실시예에 따른 다중 크기 표적 탐지 시스템(100)의 블록도이다.Figure 3 is a block diagram of a multi-size target detection system 100 according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 다중 크기 표적 탐지 시스템(100)은 입력부(110), 통신부(120), 표시부(130), 메모리(140) 및 프로세서(150)를 포함한다.The multi-size target detection system 100 according to an embodiment of the present invention includes an input unit 110, a communication unit 120, a display unit 130, a memory 140, and a processor 150.

입력부(110)는 다중 크기 표적 탐지 시스템(100)의 사용자 입력에 대응하여 입력데이터를 발생시킨다. 사용자 입력은 다중 크기 표적 탐지 시스템(100)가 처리하고자 하는 영상에 관한 사용자 입력 등일 수 있다. 입력부(110)는 적어도 하나의 입력수단을 포함한다. 입력부(110)는 키보드(key board), 키패드(key pad), 돔 스위치(dome switch), 터치패널(touch panel), 터치 키(touch key), 마우스(mouse), 메뉴 버튼(menu button) 등을 포함할 수 있다.The input unit 110 generates input data in response to user input of the multi-size target detection system 100. The user input may be a user input regarding an image that the multi-size target detection system 100 wants to process. The input unit 110 includes at least one input means. The input unit 110 includes a keyboard, key pad, dome switch, touch panel, touch key, mouse, menu button, etc. may include.

통신부(120)는 데이터를 수신하기 위해 카메라나 기타 외부장치와의 통신을 수행한다. 이와 같은 통신부(120)는 유선 통신 모듈 및 무선 통신 모듈을 모두 포함할 수 있다. 유선 통신 모듈은 전력선 통신 장치, 전화선 통신 장치, 케이블 홈(MoCA), 이더넷(Ethernet), IEEE1294, 통합 유선 홈 네트워크 및 RS-485 제어 장치로 구현될 수 있다. 또한, 무선 통신 모듈은 WLAN(wireless LAN), Bluetooth, HDR WPAN, UWB, ZigBee, Impulse Radio, 60GHz WPAN, Binary-CDMA, 무선 USB 기술 및 무선 HDMI 기술, 그밖에 5G(5th generation communication), LTE-A(long term evolution-advanced), LTE(long term evolution), Wi-Fi(wireless fidelity) 등의 기능을 구현하기 위한 모듈로 구성될 수 있다.The communication unit 120 performs communication with a camera or other external device to receive data. This communication unit 120 may include both a wired communication module and a wireless communication module. The wired communication module can be implemented as a power line communication device, telephone line communication device, home cable (MoCA), Ethernet, IEEE1294, integrated wired home network, and RS-485 control device. In addition, wireless communication modules include WLAN (wireless LAN), Bluetooth, HDR WPAN, UWB, ZigBee, Impulse Radio, 60GHz WPAN, Binary-CDMA, wireless USB technology and wireless HDMI technology, as well as 5G (5th generation communication) and LTE-A. It may be composed of modules to implement functions such as (long term evolution-advanced), LTE (long term evolution), and Wi-Fi (wireless fidelity).

표시부(130)는 다중 크기 표적 탐지 시스템(100)의 동작에 따른 데이터, 즉 학습 결과 및 추론 결과 등을 표시한다. 표시부(130)는 액정 디스플레이(LCD; liquid crystal display), 발광 다이오드(LED; light emitting diode) 디스플레이, 유기 발광 다이오드(OLED; organic LED) 디스플레이, 마이크로 전자기계 시스템(MEMS; micro electro mechanical systems) 디스플레이 및 전자 종이(electronic paper) 디스플레이를 포함한다. 표시부(130)는 입력부(110)와 결합되어 터치 스크린(touch screen)으로 구현될 수 있다.The display unit 130 displays data according to the operation of the multi-size target detection system 100, that is, learning results and inference results. The display unit 130 includes a liquid crystal display (LCD), a light emitting diode (LED) display, an organic light emitting diode (OLED) display, and a micro electro mechanical systems (MEMS) display. and electronic paper displays. The display unit 130 may be combined with the input unit 110 and implemented as a touch screen.

메모리(140)는 다중 크기 표적 탐지 시스템(100)의 동작 프로그램들을 저장한다. 여기에서, 메모리(140)는 전원이 공급되지 않아도 저장된 정보를 계속 유지하는 비휘발성 저장장치 및 휘발성 저장장치를 통칭하는 것이다. 예를 들어, 메모리(120)는 콤팩트 플래시(compact flash; CF) 카드, SD(secure digital) 카드, 메모리 스틱(memory stick), 솔리드 스테이트 드라이브(solid-state drive; SSD) 및 마이크로(micro) SD 카드 등과 같은 낸드 플래시 메모리(NAND flash memory), 하드 디스크 드라이브(hard disk drive; HDD) 등과 같은 마그네틱 컴퓨터 기억 장치 및 CD-ROM, DVD-ROM 등과 같은 광학 디스크 드라이브(optical disc drive) 등을 포함할 수 있다. Memory 140 stores operating programs of multi-size target detection system 100. Here, the memory 140 is a general term for non-volatile storage devices and volatile storage devices that continue to retain stored information even when power is not supplied. For example, memory 120 may include compact flash (CF) cards, secure digital (SD) cards, memory sticks, solid-state drives (SSD), and micro SD. This includes NAND flash memory such as cards, magnetic computer storage devices such as hard disk drives (HDD), and optical disc drives such as CD-ROM, DVD-ROM, etc. You can.

프로세서(150)는 프로그램 등 소프트웨어를 실행하여 다중 크기 표적 탐지 시스템(100)의 적어도 하나의 다른 구성요소(예: 하드웨어 또는 소프트웨어 구성요소)를 제어할 수 있고, 다양한 데이터 처리 또는 연산을 수행할 수 있다.The processor 150 may execute software, such as a program, to control at least one other component (e.g., hardware or software component) of the multi-scale target detection system 100 and may perform various data processing or operations. there is.

프로세서(150)는 소정의 영상을 입력받음에 따라 복수의 합성곱 레이어들로 각각 구성된 인코더부(210) 및 디코더부(220), 그리고 결과 예측 신경망 영역부(230)의 학습 과정을 수행하며, 학습이 완료됨에 따라 인코더부(210), 디코더부(220) 및 결과 예측 신경망 영역부(230)를 통한 추론 과정을 수행한다.As a predetermined image is received, the processor 150 performs a learning process of the encoder unit 210 and decoder unit 220, each composed of a plurality of convolution layers, and the result prediction neural network area unit 230, As learning is completed, an inference process is performed through the encoder unit 210, decoder unit 220, and result prediction neural network area unit 230.

한편, 본 발명의 일 실시예에서, 프로세서(150)는 인공지능(Artificial Intelligence) 알고리즘으로서 기계학습, 신경망 네트워크(neural network), 또는 딥러닝 알고리즘 중 적어도 하나를 이용할 수 있다. 일 예로, 인공지능(Artificial Intelligence) 알고리즘으로서 기계학습, 신경망 네트워크(neural network), 또는 딥러닝 알고리즘 중 적어도 하나를 이용할 수 있으며, 신경망 네트워크의 예로는, CNN (Convolutional Neural Network), DNN (Deep Neural Network), RNN (Recurrent Neural Network)과 같은 모델을 포함할 수 있다.Meanwhile, in one embodiment of the present invention, the processor 150 may use at least one of machine learning, neural network, or deep learning algorithms as an artificial intelligence algorithm. For example, as an artificial intelligence algorithm, at least one of machine learning, neural network, or deep learning algorithm may be used. Examples of neural network networks include Convolutional Neural Network (CNN) and Deep Neural Network (DNN). Network) and RNN (Recurrent Neural Network).

도 4는 본 발명의 일 실시예에 따른 다중 크기 표적 탐지 시스템(100)에서 적용된 YOLOv5U 모델을 설명하기 위한 도면이다.Figure 4 is a diagram for explaining the YOLOv5U model applied in the multi-size target detection system 100 according to an embodiment of the present invention.

본 발명의 일 실시예는 종래기술인 YOLOv5l 대비 PAN (Path Aggregation) 구조를 제거함으로써 신경망 파라미터의 개수를 효과적으로 감소시켰다. 추가적으로, 종래기술에서의 CSPDarkNet 특징 추출 신경망을 인코더부(210) 및 디코더부(220)에 적용하였다. 이때, 도 3의 예시에서 인코더부(210)에서는 5개의 합성곱 레이어를 적용하였고, 디코더부(220)에서는 3개의 합성곱 레이어를 적용하였다.An embodiment of the present invention effectively reduces the number of neural network parameters by eliminating the PAN (Path Aggregation) structure compared to the prior art YOLOv5l. Additionally, the CSPDarkNet feature extraction neural network in the prior art was applied to the encoder unit 210 and the decoder unit 220. At this time, in the example of FIG. 3, five convolution layers were applied to the encoder unit 210, and three convolution layers were applied to the decoder unit 220.

또한, 본 발명의 일 실시예는 인코더-디코더 구조에서 특징 추출과 함께 특징맵의 업샘플링(Up-sampling) 연산을 수행하여 공간 정보가 풍부한 고해상도 특징맵을 생성하였다. 이러한 U자형 인코더-디코더 구조는 초해상화 복원(Super Resolution) 모델과 유사하지만 특징맵 기반에서 해상력을 향상하는 점에서 본 발명은 기존 U자형 초해상화 복원 모델과 기술적 차이가 있다.In addition, an embodiment of the present invention generates a high-resolution feature map rich in spatial information by performing an upsampling operation on the feature map along with feature extraction in an encoder-decoder structure. This U-shaped encoder-decoder structure is similar to the super resolution model, but the present invention has a technical difference from the existing U-shaped super resolution restoration model in that it improves resolution based on feature maps.

구체적으로, 본 발명의 일 실시예에 따른 다중 크기 표적 탐지 시스템(100)은 YOLOv5U 모델로 인코더부(210), 디코더부(220) 및 결과 예측 신경망 영역부(230)를 포함한다.Specifically, the multi-size target detection system 100 according to an embodiment of the present invention is a YOLOv5U model and includes an encoder unit 210, a decoder unit 220, and a result prediction neural network area unit 230.

인코더부(210)는 소정의 소형 표적, 중형 표적 및 대형 표적 중 적어도 하나가 포함된 이미지를 입력받아 제1 특징맵을 출력한다. 인코더부(210)는 입력된 영상의 각 픽셀에서 고차원의 추상적인 정보를 추출하고 이를 더 작은 차원의 공간에 매핑하기 위한 복수의 합성곱 레이어로 구성된다. 도 3의 예시에서는 5개의 단계(Stage)로 구성된 것으로 나타내었다. 인코더부(210)는 입력된 영상에서 불필요한 정보를 제거하고 중요한 특징을 강조함으로써 객체 탐지를 더욱 용이하게 할 수 있다.The encoder unit 210 receives an image containing at least one of a small target, a medium target, and a large target and outputs a first feature map. The encoder unit 210 consists of a plurality of convolution layers to extract high-dimensional abstract information from each pixel of the input image and map it to a smaller dimensional space. In the example of Figure 3, it is shown as consisting of five stages. The encoder unit 210 can make object detection easier by removing unnecessary information from the input image and emphasizing important features.

디코더부(220)는 인코더부(210)와 잔차연결되어 인코더부(210)의 제1 특징맵을 복원한 제2 특징맵을 출력한다. 디코더부(220)는 복수의 합성곱 레이어로 구성되며, 도 3의 예시에서는 3개의 합성곱 레이어로 구성된 것을 나타내었다.The decoder unit 220 is residually connected to the encoder unit 210 and outputs a second feature map that restores the first feature map of the encoder unit 210. The decoder unit 220 is composed of a plurality of convolutional layers, and in the example of FIG. 3, it is shown to be composed of three convolutional layers.

이때, 본 발명의 일 실시예는 인코더부(210)와 디코더부(220)의 잔차연결을 통해 특징 추출 과정에서 연산 손실을 보완할 수 있다. 본 발명에서의 잔차연결 방법은 무인항공기에서 촬영된 항공 영상을 예시로, 영상에서 객체의 공간 정보, 계산 비용(Computational Cost)이 고려되었다.At this time, an embodiment of the present invention can compensate for computational loss in the feature extraction process through residual connection of the encoder unit 210 and the decoder unit 220. The residual connection method in the present invention uses an aerial image taken from an unmanned aerial vehicle as an example, and spatial information and computational cost of objects in the image are taken into consideration.

인코더의 모든 단계의 특징맵을 디코더에 연결하는 단순한 잔차연결은 일반적인 결측치 보간법(Interpolation)을 이용하여 소형 객체의 공간 정보 왜곡 및 추출된 특징맵에 잡음(Noise) 정보를 더하게 되어, 신경망의 객체 검출 정확도 하락을 초래하며 또한, 불필요한 계산 비용 증가 문제도 초래한다.A simple residual connection that connects the feature maps of all stages of the encoder to the decoder distorts the spatial information of small objects and adds noise information to the extracted feature maps using general missing value interpolation, making the objects of the neural network It causes a decrease in detection accuracy and also causes an unnecessary increase in computational costs.

따라서, 본 발명의 일 실시예는 기존의 단순한 잔자연결 방법과 달리, 디코더부(220)의 각 합성곱 레이어는 특징맵 분석 결과에 기초하여 상위 및 하위의 소정 개수의 인코더부(210) 및 디코더부(220)의 제1 및 제2 특징맵이 각각 연결되도록 하였다. 일 예로, 각 디코더 단계에서는 상위 및 하위 1개의 인코더부(210) 및 디코더부(220)의 특징맵을 연결할 수 있다. Therefore, in one embodiment of the present invention, unlike the existing simple residual concatenation method, each convolution layer of the decoder unit 220 is configured to include a predetermined number of upper and lower encoder units 210 and decoders based on the feature map analysis result. The first and second feature maps of the unit 220 were connected to each other. For example, in each decoder stage, the feature maps of the upper and lower encoder units 210 and decoder units 220 may be connected.

즉, 디코더부(220)의 특정 단계에서 인코더부(210)의 상위 또는 하위 1개 단계의 특징맵과 함께 연결하여 사용할 수 있으며(예를 들어, 인코더부의 2번째 단계 및 디코더부의 3번째 단계), 이러한 잔차연결 방법을 통해 특징맵 간의 정보 공유와 결합이 가능하도록 하여 더욱 정교한 객체 검출 및 분류 성능을 기대할 수 있다. 또한, 이러한 연결 방법은 기존의 단순한 잔차연결 방법과는 다른 형태의 연결 방식으로, 객체 검출 정확도 향상과 계산 비용 증가를 최소화하는 효과를 가질 수 있다.In other words, it can be used in connection with the feature map of the upper or lower stage of the encoder unit 210 at a specific stage of the decoder unit 220 (for example, the second stage of the encoder unit and the third stage of the decoder unit). , This residual connection method enables information sharing and combination between feature maps, so more sophisticated object detection and classification performance can be expected. In addition, this connection method is a different type of connection method from the existing simple residual connection method, and can have the effect of improving object detection accuracy and minimizing the increase in computational cost.

한편, 디코더부(220)의 각 합성곱 레이어의 입력 특징맵은 식 1과 같이 나타낼 수 있다. 입력 특징맵은 인코더부(210)에서 출력되는 제1 특징맵과 디코더부(220)에서 출력되는 제2 특징맵을 기반으로 결정될 수 있다.Meanwhile, the input feature map of each convolution layer of the decoder unit 220 can be expressed as Equation 1. The input feature map may be determined based on the first feature map output from the encoder unit 210 and the second feature map output from the decoder unit 220.

[식 1][Equation 1]

이때, 식 1에서 , 는 인코더부(210), 디코더부(220)의 각 합성곱 레이어에서 출력되는 제1 및 제2 특징맵을 의미하며, 는 번째 단계를 의미한다. At this time, in equation 1 , means the first and second feature maps output from each convolution layer of the encoder unit 210 and the decoder unit 220, Is It means the first step.

결과 예측 신경망 영역부(230)는 제1 및 제2 특징맵에 기초하여 상기 이미지에 포함된 객체의 탐지 및 분류 결과를 출력하는 (Detection-P2 내지 Detection-P5).The result prediction neural network area unit 230 outputs detection and classification results of objects included in the image based on the first and second feature maps (Detection-P2 to Detection-P5).

도 5는 본 발명의 일 실시예에서의 병렬 세그먼트 헤드부(240)를 설명하기 위한 도면이다.Figure 5 is a diagram for explaining the parallel segment head unit 240 in one embodiment of the present invention.

본 발명의 일 실시예는 소형 객체의 검출 정확도 향상을 위해 공통된 신경망 파라미터를 공유하며 학습하는 다중 작업 학습(Multi-task Learning)을 결과 예측 신경망 영역부(230)의 학습에 적용하였다. 이에 따라, 본 발명의 일 실시예는 도 5에 도시된 바와 같은 병렬 세그먼트 헤드부(240)를 더 포함하는 것을 특징으로 한다. 병렬 세그먼트 헤드부(240)를 결과 예측 신경망 영역부(230)에 적용한 결과, 화소 단위에서 객체의 종류를 분류하는 지역화(Localization)가 가능하도록 할 수 있다.In one embodiment of the present invention, multi-task learning, which learns by sharing common neural network parameters, is applied to the learning of the result prediction neural network area unit 230 to improve the detection accuracy of small objects. Accordingly, one embodiment of the present invention is characterized by further including a parallel segment head unit 240 as shown in FIG. 5. As a result of applying the parallel segment head unit 240 to the result prediction neural network area unit 230, localization that classifies the type of object at the pixel level can be possible.

이때, 병렬 세그먼트 학습부는 기존 다중 작업 학습과 달리 객체 검출 정확도 향상을 목표로 하는바, 학습 과정에서는 적용되지만 추론 과정에서는 사용되지 않는 것을 특징으로 한다. 이에 따라, 본 발명의 일 실시예는 기존 객체 검출 신경망에서 계산 비용의 증가 없이 효과적인 객체 검출이 가능하다는 장점이 있다.At this time, unlike existing multi-task learning, the parallel segment learning unit aims to improve object detection accuracy and is applied in the learning process but is not used in the inference process. Accordingly, an embodiment of the present invention has the advantage of enabling effective object detection without increasing computational costs in an existing object detection neural network.

일 실시예로, 병렬 세그먼트 헤드부(240)는 원-핫 인코딩(one-hot encoding) 기법에 기반하여 생성된 정답 영역 분할 마스크에 기초하여 학습될 수 있다. 이때, 정답 영역 분할 마스크는 소정의 소형 객체(Tiny, Small)의 검출 정확도 향상에 초점을 맞춰 바운딩 박스 영역을 대상으로 원-핫 인코딩이 수행됨에 따라 생성될 수 있다.In one embodiment, the parallel segment head unit 240 may be learned based on a correct answer region segmentation mask generated based on a one-hot encoding technique. At this time, the correct answer area segmentation mask can be created by performing one-hot encoding on the bounding box area with a focus on improving the detection accuracy of certain small objects (Tiny, Small).

정답 영역 분할 마스크 생성 작업은 병렬 세그먼트 헤드부(240)의 학습을 위해 사용되는 작업으로, 객체 검출을 위해 이미지에서 객체의 위치를 정확하게 표시하는 마스크를 생성하는 과정을 의미한다. 이를 위해, 본 발명의 일 실시예에서는 원-핫 인코딩 기법이 사용되었다.The answer region segmentation mask creation task is a task used for learning the parallel segment head unit 240, and refers to the process of creating a mask that accurately displays the position of the object in the image for object detection. For this purpose, one-hot encoding technique was used in one embodiment of the present invention.

여기에서, 원-핫 인코딩 기법은 객체의 위치를 표시하는 마스크를 이진화하여 표현하는 기법으로, 예를 들어 객체가 있는 픽셀에는 1의 값을, 객체가 없는 픽셀에는 0의 값을 할당할 수 있다. Here, the one-hot encoding technique is a technique that represents the mask indicating the location of the object by binarizing it. For example, a value of 1 can be assigned to a pixel with an object, and a value of 0 can be assigned to a pixel without an object. .

또한, 본 발명의 일 실시예에서 병렬 세그먼트 헤드부(240)는 결과 예측 신경망 영역부(230)와 함께 식 2에 따른 손실함수에 기초하여 학습이 수행될 수 있다.Additionally, in one embodiment of the present invention, the parallel segment head unit 240 may perform learning based on the loss function according to Equation 2 together with the result prediction neural network area unit 230.

[식 2][Equation 2]

이때, 식 2에서 은 전체 손실 함수, 은 결과 예측 신경망 영역부(230)의 손실 함수, 은 병렬 세그먼트 헤드부(240)의 영역 분할 손실 함수, 은 가중치를 의미한다( ). 는 종래기술인 YOLOv5l 신경망과 동일한 손실 함수를 적용할 수 있으며, 병렬 세그먼트 헤드부(240)의 학습은 MS-SSIM (Multi Scale-Structural SIMilarity) 기반의 손실 함수를 사용할 수 있다.At this time, in equation 2 is the overall loss function, is the loss function of the result prediction neural network area unit 230, is the region division loss function of the parallel segment head unit 240, means weight ( ). The same loss function as the conventional YOLOv5l neural network can be applied, and the learning of the parallel segment head unit 240 can use a loss function based on MS-SSIM (Multi Scale-Structural SIMilarity).

여기에서, MS-SSIM 손실 함수는 신경망 생성 영상과 정답 영상의 SSIM (Structural SIMilarity)을 다중 해상도에서 비교하여 계산한다. 일 실시예로, 이미지를 다중 해상도로 변환하고 각 해상도에서의 SSIM 값을 계산하며, 이를 가중 평균하여 최종적인 MS-SSIM 값을 얻을 수 있다. 본 발명에서는 이러한 MS-SSIM을 사용하여 영역 분할 마스크의 구조적 특성을 비교하며 학습을 진행하였다.Here, the MS-SSIM loss function is calculated by comparing the SSIM (Structural SIMilarity) of the neural network generated image and the correct image at multiple resolutions. In one embodiment, the image can be converted to multiple resolutions, the SSIM value at each resolution is calculated, and the final MS-SSIM value can be obtained by performing a weighted average. In the present invention, learning was conducted by comparing the structural characteristics of region segmentation masks using MS-SSIM.

[식 3][Equation 3]

이때, 식 3에서 와 는 신경망 생성 마스크와 정답 마스크의 평균, 표준편차, 는 손실함수가 0으로 나뉘는 것을 방지하는 상수값,. 은 다중 해상도에서의 상대적인 중요도를 의미한다.At this time, in equation 3 and is the mean, standard deviation, and is a constant value that prevents the loss function from being divided by 0. refers to the relative importance in multiple resolutions.

이러한 손실 함수를 사용하였을 때 결과 예측 신경망 영역부(230)는 화소 단위의 객체 분류 작업을 학습하는 과정에서 객체의 대략적인(Coarse) 객체 위치 추정이 가능하다. 이와 같은 정답 영역 분할 마스크의 대략적인 위치 추정은 기존 객체 검출 신경망의 Anchor Grid의 개별 셀(Cell)에서 객체 존재 여부 판별 학습에 요구되는 특징 표현(Representation) 능력 향상에 긍정적인 영향을 줄 수 있다.When using this loss function, the result prediction neural network area unit 230 is capable of coarsely estimating the object location in the process of learning a pixel-level object classification task. Such approximate position estimation of the correct answer area division mask can have a positive effect on improving the feature representation ability required for learning to determine whether an object exists in individual cells of the anchor grid of the existing object detection neural network.

이하에서는 도 6을 참조하여 본 발명의 일 실시예에 따른 다중 크기 표적 탐지 시스템(100)에 의해 수행되는 방법을 설명하도록 한다.Hereinafter, a method performed by the multi-size target detection system 100 according to an embodiment of the present invention will be described with reference to FIG. 6.

도 6은 본 발명의 일 실시예에 따른 다중 크기 표적 탐지 방법의 순서도이다.Figure 6 is a flowchart of a multi-size target detection method according to an embodiment of the present invention.

먼저, 소정의 소형 표적, 중형 표적 및 대형 표적 중 적어도 하나가 포함된 이미지를 복수의 합성곱 레이어로 구성된 인코더부(210)로 입력하고(S110), 인코더부(210)의 각 합성곱 레이어에서 제1 특징맵을 출력한다(S120).First, an image containing at least one of a predetermined small target, a medium target, and a large target is input to the encoder unit 210 consisting of a plurality of convolution layers (S110), and each convolution layer of the encoder unit 210 The first feature map is output (S120).

다음으로, 인코더부(210)와 잔차 연결되며 복수의 합성곱 레이어로 구성된 디코더부(220)에서 제1 특징맵을 입력받음에 따라(S130), 디코더부(220)의 각 합성곱 레이어에서 제1 특징맵을 복원한 제2 특징맵을 출력한다(S140).Next, as the first feature map is received from the decoder unit 220, which is residually connected to the encoder unit 210 and consists of a plurality of convolution layers (S130), each convolution layer of the decoder unit 220 1 A second feature map in which the feature map is restored is output (S140).

다음으로, 제1 및 제2 특징맵에 기초하여 이미지에 포함된 객체의 탐지 및 분류 결과를 출력한다(S150).Next, the detection and classification results of objects included in the image are output based on the first and second feature maps (S150).

한편, 상술한 설명에서, 단계 S110 내지 S150은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. 아울러, 기타 생략된 내용이라 하더라도 도 1 내지 도 5에 기술된 내용과 도 6에 기술된 내용은 상호 적용될 수 있다.Meanwhile, in the above description, steps S110 to S150 may be further divided into additional steps or combined into fewer steps, depending on the implementation of the present invention. Additionally, some steps may be omitted or the order between steps may be changed as needed. In addition, even if other omitted content, the content described in FIGS. 1 to 5 and the content described in FIG. 6 can be mutually applied.

도 7 및 도 8은 종래기술과 본 발명의 표적 탐지 결과를 테스트한 결과를 도시한 도면이다.Figures 7 and 8 are diagrams showing the results of testing the target detection results of the prior art and the present invention.

도 7 및 도 8은 종래기술에 해당하는 YOLOv5l 신경망, 소형 객체 검출 신경망으로 평가받는 YOLOv5-TA 신경망과, 본 발명에 해당하는 YOLOv5U 신경망을 대상으로, 항공 영상을 예시로 하여 표적을 탐지한 결과를 나타낸 것이다. Figures 7 and 8 show the results of target detection using aerial images as an example for the YOLOv5l neural network corresponding to the prior art, the YOLOv5-TA neural network evaluated as a small object detection neural network, and the YOLOv5U neural network corresponding to the present invention. It is shown.

먼저 YOLOv5-TA 신경망은 도 7의 밀집도가 높은 소형 객체의 원활한 검출이 가능하지만, 도 8의 중형, 대형 객체 검출은 원활하지 못함을 확인할 수 있다. First, it can be seen that the YOLOv5-TA neural network is capable of smoothly detecting small, dense objects in Figure 7, but is not able to smoothly detect medium and large objects in Figure 8.

반면, 본 발명에 따른 YOLOv5U 신경망은 다양한 크기의 객체 검출이 가능하다. 즉, 도 8에서 대형 객체의 정확한 종류 구분이 가능하며 도 7의 작은 영역에 밀집된 개별 소형 객체 식별이 가능함을 확인할 수 있다. 이에 따라, 본 발명의 일 실시예에 따른 YOLOv5U 신경망은 무인항공기에 특화된 객체 검출 신경망임을 확인하였다.On the other hand, the YOLOv5U neural network according to the present invention is capable of detecting objects of various sizes. In other words, it can be confirmed that accurate classification of large objects in FIG. 8 is possible and individual small objects concentrated in the small area of FIG. 7 can be identified. Accordingly, it was confirmed that the YOLOv5U neural network according to an embodiment of the present invention is an object detection neural network specialized for unmanned aerial vehicles.

도 9는 본 발명의 일 실시예에서의 다중 크기 객체에 대한 특징맵 시각화 결과를 나타낸 도면이다.Figure 9 is a diagram showing the result of feature map visualization for a multi-sized object in one embodiment of the present invention.

이외 본 발명에서의 객체 검출 결과에 대한 특징맵 시각화 결과, 도 9에 도시된 바와 같이 모든 크기의 객체 특징이 효과적으로 검출됨을 확인할 수 있다. In addition, as a result of feature map visualization of the object detection results in the present invention, it can be confirmed that object features of all sizes are effectively detected, as shown in FIG. 9.

또한, 대표적인 무인항공기에서 촬영된 항공 영상으로 구성된 Visdrone-DET2019 데이터 세트에서 정량적인 평가를 수행한 결과, 본 발명은 In addition, as a result of performing a quantitative evaluation on the Visdrone-DET2019 data set composed of aerial images taken from representative unmanned aerial vehicles, the present invention

모든 객체의 크기를 반영한 성능에서 다른 객체 검출기 신경망 대비 가장 높은 36.3을 달성하였다. 이는 종래의 YOLOv5l 신경망 대비 13.5% 향상된 수치이다. 또한, 소형 객체 검출기인 YOLOv5l-TA 신경망 대비 본 발명에 따른 YOLOv5U 신경망은 소형 객체 검출 성능은 다소 감소하였으나, 중형과 대형 객체의 검출 성능인 은 29.9, 46.6으로 6.3%, 143% 더 향상된 것임을 확인할 수 있다.Reflects the size of all objects In terms of performance, it achieved 36.3, the highest compared to other object detector neural networks. This is a 13.5% improvement over the conventional YOLOv5l neural network. In addition, compared to the YOLOv5l-TA neural network, which is a small object detector, the YOLOv5U neural network according to the present invention slightly decreased the small object detection performance, but the detection performance of medium and large objects was improved. It can be seen that it is 29.9 and 46.6, an improvement of 6.3% and 143%.

ModelModel ParamsParams FLOPsFLOPs FPSFPS

Without
Parallel Segmentation LossWithout
Parallel Segmentation Loss YOLOv5lYOLOv5l 46.27M46.27M 107.9G107.9G 6565 31.231.2 17.117.1 0.50.5 4.04.0 11.911.9 26.826.8 40.840.8 YOLOv5l-TA YOLOv5l-TA 2.23M2.23M 214.7G214.7G 2525 34.734.7 17.717.7 1.91.9 6.26.2 16.216.2 28.128.1 19.119.1 YOLOv5UYOLOv5U 41.9M41.9M 177.0G177.0G 3535 36.136.1 19.419.4 1.81.8 5.15.1 13.813.8 29.929.9 46.646.6 With
Parallel Segmentation LossWith
Parallel Segmentation Loss YOLOv5l YOLOv5l 46.27M46.27M 107.9G107.9G 6565 33.833.8 18.518.5 1.21.2 4.24.2 13.113.1 29.129.1 46.046.0 YOLOv5UYOLOv5U 41.9M41.9M 177.0G177.0G 3535 36.336.3 19.519.5 1.21.2 5.55.5 14.414.4 29.829.8 42.242.2

이상에서 전술한 본 발명의 일 실시예에 따른 다중 크기 표적 탐지 방법은, 하드웨어인 컴퓨터와 결합되어 실행되기 위해 프로그램(또는 어플리케이션)으로 구현되어 매체에 저장될 수 있다.The multi-size target detection method according to an embodiment of the present invention described above may be implemented as a program (or application) and stored in a medium in order to be executed in combination with a hardware computer.

상기 전술한 프로그램은, 상기 컴퓨터가 프로그램을 읽어 들여 프로그램으로 구현된 상기 방법들을 실행시키기 위하여, 상기 컴퓨터의 프로세서(CPU)가 상기 컴퓨터의 장치 인터페이스를 통해 읽힐 수 있는 C, C++, JAVA, Ruby, 기계어 등의 컴퓨터 언어로 코드화된 코드(Code)를 포함할 수 있다. 이러한 코드는 상기 방법들을 실행하는 필요한 기능들을 정의한 함수 등과 관련된 기능적인 코드(Functional Code)를 포함할 수 있고, 상기 기능들을 상기 컴퓨터의 프로세서가 소정의 절차대로 실행시키는데 필요한 실행 절차 관련 제어 코드를 포함할 수 있다. 또한, 이러한 코드는 상기 기능들을 상기 컴퓨터의 프로세서가 실행시키는데 필요한 추가 정보나 미디어가 상기 컴퓨터의 내부 또는 외부 메모리의 어느 위치(주소 번지)에서 참조되어야 하는지에 대한 메모리 참조관련 코드를 더 포함할 수 있다. 또한, 상기 컴퓨터의 프로세서가 상기 기능들을 실행시키기 위하여 원격(Remote)에 있는 어떠한 다른 컴퓨터나 서버 등과 통신이 필요한 경우, 코드는 상기 컴퓨터의 통신 모듈을 이용하여 원격에 있는 어떠한 다른 컴퓨터나 서버 등과 어떻게 통신해야 하는지, 통신 시 어떠한 정보나 미디어를 송수신해야 하는지 등에 대한 통신 관련 코드를 더 포함할 수 있다.The above-mentioned program is C, C++, JAVA, Ruby, and It may include code encoded in a computer language such as machine language. These codes may include functional codes related to functions that define the necessary functions for executing the methods, and include control codes related to execution procedures necessary for the computer's processor to execute the functions according to predetermined procedures. can do. In addition, these codes may further include memory reference-related codes that indicate at which location (address address) in the computer's internal or external memory additional information or media required for the computer's processor to execute the above functions should be referenced. there is. In addition, if the computer's processor needs to communicate with any other remote computer or server in order to execute the above functions, the code uses the computer's communication module to determine how to communicate with any other remote computer or server. It may further include communication-related codes regarding whether communication should be performed and what information or media should be transmitted and received during communication.

상기 저장되는 매체는, 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상기 저장되는 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있지만, 이에 제한되지 않는다. 즉, 상기 프로그램은 상기 컴퓨터가 접속할 수 있는 다양한 서버 상의 다양한 기록매체 또는 사용자의 상기 컴퓨터상의 다양한 기록매체에 저장될 수 있다. 또한, 상기 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장될 수 있다.The storage medium refers to a medium that stores data semi-permanently and can be read by a device, rather than a medium that stores data for a short period of time, such as a register, cache, or memory. Specifically, examples of the storage medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc., but are not limited thereto. That is, the program may be stored in various recording media on various servers that the computer can access or on various recording media on the user's computer. Additionally, the medium may be distributed to computer systems connected to a network, and computer-readable code may be stored in a distributed manner.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The description of the present invention described above is for illustrative purposes, and those skilled in the art will understand that the present invention can be easily modified into other specific forms without changing the technical idea or essential features of the present invention. will be. Therefore, the embodiments described above should be understood in all respects as illustrative and not restrictive. For example, each component described as unitary may be implemented in a distributed manner, and similarly, components described as distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the claims described below rather than the detailed description above, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be construed as being included in the scope of the present invention. do.

100: 다중 크기 객체 탐지 시스템
110: 입력부
120: 통신부
130: 표시부
140: 메모리
150: 프로세서
210: 인코더부
220: 디코더부
230: 결과 예측 신경망 영역부
240: 병렬 세그먼트 헤드부100: Multi-scale object detection system
110: input unit
120: Department of Communications
130: display unit
140: memory
150: processor
210: Encoder unit
220: decoder unit
230: Result prediction neural network area section
240: Parallel segment head part

Claims

In a multi-size target detection system,
An encoder unit consisting of a plurality of convolutional layers that receives images including a plurality of small targets, medium targets, and large targets and outputs a first feature map;
A decoder unit composed of a plurality of convolutional layers that are residually connected to the encoder unit and output a second feature map that restores the first feature map of the encoder unit;
A result prediction neural network area unit that outputs detection and classification results of objects included in the image based on the first and second feature maps,
The convolution layer of the encoder unit and decoder unit is constructed based on a CSPDarkNet feature extraction neural network,
Each convolution layer of the decoder unit is connected to a predetermined number of upper and lower upper and lower encoder units and first and second feature maps of the decoder unit, respectively, based on the feature map analysis result,
The input feature map of each convolution layer of the decoder unit ( is the first feature map output from the encoder unit ( ) and a second feature map output from the decoder unit ( Based on this, it is determined by the formula below,
[ceremony]

It further includes a parallel segment head unit that receives the second feature map of the decoder unit and improves object detection accuracy,
The parallel segment head part is applied in the learning process, but not in the inference process,
The parallel segment head unit is learned based on the correct answer region segmentation mask generated based on one-hot encoding technique,
The correct region segmentation mask is generated by performing the one-hot encoding technique on the bounding box region of objects of all sizes in the image,
The parallel segment head unit uses a loss function of the following equation together with the result prediction neural network area unit ( ), but the loss function (L _Seg ) of the parallel segment head part is a loss function based on MS-SSIM (Multi Scale-Structural SIMilarity),
[ceremony]

From above is the loss function of the resulting prediction neural network domain, is the weight, and is the mean, standard deviation, and is a constant value that prevents the loss function from being divided by 0, refers to the relative importance in multiple resolutions.

delete

In the multi-size target detection method,
Inputting an image including a plurality of small targets, medium targets, and large targets into an encoder unit composed of a plurality of convolution layers;
Outputting a first feature map from each convolution layer of the encoder unit;
Receiving the first feature map from a decoder unit residually connected to the encoder unit and composed of a plurality of convolutional layers;
outputting a second feature map obtained by reconstructing the first feature map from each convolution layer of the decoder unit; and
Outputting detection and classification results of objects included in the image based on the first and second feature maps,
The convolution layer of the encoder unit and decoder unit is constructed based on a CSPDarkNet feature extraction neural network,
Each convolution layer of the decoder unit is connected to a predetermined number of upper and lower upper and lower encoder units and first and second feature maps of the decoder unit, respectively, based on the result of feature map analysis,
The step of receiving the first feature map from the decoder unit, which is residually connected to the encoder unit and consists of a plurality of convolution layers, includes:
The first feature map output from the encoder unit ( ) and a second feature map output from the decoder unit ( The input feature map in each convolution layer of the decoder unit is determined by the following equation based on ( After receiving input,
[ceremony]

It further includes receiving a second feature map from the decoder unit from a parallel segment head unit to improve detection accuracy of the object, and performing location estimation, detection, and classification of a multi-sized object,
Generating a correct answer region segmentation mask based on a one-hot encoding technique; and
Further comprising the step of learning the parallel segment head unit based on the correct answer region segmentation mask,
The parallel segment head part is applied in the learning process, but not in the inference process,
The step of generating a correct answer region segmentation mask based on the one-hot encoding technique is,
Generate the correct region segmentation mask by performing the one-hot encoding technique on the bounding box region of objects of all sizes in the image,
The step of learning the parallel segment head unit based on the correct answer region division mask,
The loss function of the following equation together with the result prediction neural network area part ( ), but the loss function (L _Seg ) of the parallel segment head part is a loss function based on MS-SSIM (Multi Scale-Structural SIMilarity),
[ceremony]

From above is the loss function of the resulting prediction neural network domain, is the weight, and is the mean, standard deviation, and is a constant value that prevents the loss function from being divided by 0, refers to the relative importance in multiple resolutions.

delete