KR20230053347A

KR20230053347A - Object Detection Network using Coordinate Information

Info

Publication number: KR20230053347A
Application number: KR1020210136757A
Authority: KR
Inventors: 정제창; 유동균
Original assignee: 한양대학교 산학협력단
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2023-04-21
Also published as: KR102649119B1

Abstract

Disclosed are a YOLO-based object detection method and apparatus, which may accurately mark a position of a bounding box. The YOLO-based object detection apparatus disclosed in the present invention comprises: a backbone network which extracts an image feature in an input image; a neck network which includes regression information of a bounding box for the extracted image feature, and outputs the regression information of the bounding box to which a coordinate convolution module having coordinate information on a position of the bounding box is applied for increasing the accuracy of the position of the bounding box from among the regression information of the bounding box; and a head network which detects an object by using a YOLO layer for the output of the neck network, and performs regression learning for the detected object.

Description

Object detection network technique using coordinate information {Object Detection Network using Coordinate Information}

본 발명은 좌표 정보를 이용한 객체 검출 방법 및 장치에 관한 것이다. The present invention relates to a method and apparatus for detecting an object using coordinate information.

객체 검출(Object detection)은 의학 이미지 분석, 자율 주행, 얼굴 인식 그리고 CCTV를 통한 사람 인식 등 많은 분야에서 필요로 하고 있다. 위와 같은 응용 서비스에 적용하기 위해선 높은 검출 성능과 고속화가 필요하다. 그러나 기존 객체 검출 방식으로는 실시간 응용 서비스에 적용하기 어려웠지만 딥러닝 기술의 발전으로 객체 검출 기술의 정확도와 고속화의 가능성을 보였다. 딥러닝이 발전되기 전에 객체 검출 기술은 대표적으로 슬라이딩 윈도우(Sliding window)가 사용되었다. 슬라이딩 윈도우 기법은 다양한 크기의 블록들을 이미지 왼쪽 위부터 오른쪽 아래까지 이동하면서 객체를 찾는다. 그러나 하나의 이미지에서 반복적으로 계산을 해야 하기 때문에 속도 측면에서 비효율 적이다. 이러한 문제를 해결하기 위해 DPM(Deformable Part Model) 등이 제안이 되기도 하였다 [1]. 딥러닝을 사용한 객체 검출기로는 크게 투-스테이지(two-stage) 객체 검출기와 원-스테이지(one-stage) 객체 검출기로 나눌 수 있다. 투-스테이지 객체 검출기는 R-CNN [2], fast R-CNN [3]과 faster R-CNN [4]에서 제안되었다. 투-스테이지 객체 검출기는 슬라이딩 윈도우 기법의 비효율성을 개선하기 위해 객체가 존재하는 영역을 빠르게 찾아내는 영역 제안(Region proposal) 기법을 사용하고 객체를 분류(Classification)한다. 이러한 두 번의 과정 때문에 속도 측면에서는 느리다는 단점이 있다. 반면에 원-스테이지 객체 검출기는 CenterNet [5], EfficientDet [6], YOLO [7-9] 에서 연구되어 왔다. 원-스테이지 객체 검출기는 바운딩 박스와 객체 종류를 단 한 번의 과정으로 검출하기 때문에 투-스테이지 객체 검출기에 비해 정확도는 낮지만, 검출 속도는 빠르다. Object detection is required in many fields such as medical image analysis, autonomous driving, face recognition, and human recognition through CCTV. In order to apply the above application service, high detection performance and high speed are required. However, it was difficult to apply to real-time application services with the existing object detection method, but the development of deep learning technology showed the possibility of accuracy and speed of object detection technology. Before the development of deep learning, object detection technology typically used a sliding window. The sliding window technique finds objects by moving blocks of various sizes from the upper left to the lower right of the image. However, it is inefficient in terms of speed because it has to be calculated repeatedly in one image. To solve this problem, DPM (Deformable Part Model) has been proposed [1]. Object detectors using deep learning can be largely divided into two-stage object detectors and one-stage object detectors. Two-stage object detectors were proposed in R-CNN [2], fast R-CNN [3] and faster R-CNN [4]. To improve the inefficiency of the sliding window technique, the two-stage object detector uses a region proposal technique that quickly finds a region where an object exists and classifies the object. Because of these two processes, there is a disadvantage that it is slow in terms of speed. On the other hand, one-stage object detectors have been studied in CenterNet [5], EfficientDet [6], and YOLO [7-9]. Since the one-stage object detector detects the bounding box and the object type in a single process, the accuracy is lower than the two-stage object detector, but the detection speed is fast.

원-스테이지 객체 검출기는 앵커 박스(Anchor box)를 사용하지 않는 객체 검출기와 앵커 박스를 사용한 객체 검출기로 나눌 수 있다. 앵커 박스는 초기 학습할 때 데이터 셋의 객체 크기를 그룹별로 평균을 내어 바운딩 박스에 대한 정보를 제공해준다. 따라서 앵커 박스를 이용하면 학습이 빠르게 수렴된다. 대표적으로 앵커 박스를 사용하지 않는 객체 검출기가 CenterNet에서 제안되었다. 앵커 박스를 사용하는 객체 검출기로는 EfficientDet, YOLOv3 [7], YOLOv4 [8], PP-YOLO [9] 에서 제안되었고 비교적 높은 정확도를 가지고 있다. One-stage object detectors can be divided into object detectors that do not use anchor boxes and object detectors that use anchor boxes. The anchor box provides information about the bounding box by averaging the size of objects in the data set by group during initial training. Therefore, learning converges quickly when an anchor box is used. Representatively, an object detector that does not use an anchor box has been proposed in CenterNet. Object detectors using anchor boxes have been proposed in EfficientDet, YOLOv3 [7], YOLOv4 [8], and PP-YOLO [9] and have relatively high accuracy.

본 발명이 이루고자 하는 기술적 과제는 YOLO를 기반한 원-스테이지 객체 검출기에 다양한 기법들을 적용하여 검출 성능을 향상시키기 위한 방법 및 장치를 제공하는데 있다. 제안하는 좌표 정보를 이용한 객체 검출 네트워크는 학습 성능을 높이기 위해 CIOU Loss, Swish, EMA를 사용하고 네트워크의 수용 필드를 넓혀주기 위해 Focus모듈, SPP모듈, PANet을 사용한다. 그리고 바운딩 박스의 위치를 정확하게 표시하기 위해 좌표 정보를 가지고 있는 좌표 컨볼루션(CoordConv) 모듈을 추가하고 CSP 기법을 이용하여 모델의 파라미터 및 연산량을 감소시키고자 한다. A technical problem to be achieved by the present invention is to provide a method and apparatus for improving detection performance by applying various techniques to a one-stage object detector based on YOLO. The proposed object detection network using coordinate information uses CIOU Loss, Swish, and EMA to improve learning performance, and uses Focus module, SPP module, and PANet to widen the receptive field of the network. In addition, to accurately display the location of the bounding box, we will add a coordinate convolution (CoordConv) module that has coordinate information and reduce the model parameters and computation amount by using the CSP technique.

일 측면에 있어서, 본 발명에서 제안하는 YOLO 기반 객체 검출 장치는 입력 영상 내의 이미지 특징을 추출하는 백본(backbone) 네트워크, 상기 추출된 이미지 특징에 대한 바운딩 박스의 회귀 정보를 포함하고, 상기 바운딩 박스의 회귀 정보 중 바운딩 박스 위치의 정확도를 높이기 위해 바운딩 박스 위치에 대한 좌표 정보를 갖는 좌표 컨볼루션 모듈을 적용한 바운딩 박스의 회귀 정보를 출력하는 넥(Neck) 네트워크 및 넥 네트워크의 출력에 대해 YOLO 레이어를 사용하여 객체를 검출하고, 검출된 객체에 대한 회귀 학습을 수행하는 헤드(Head) 네트워크를 포함한다. In one aspect, the YOLO-based object detection apparatus proposed in the present invention includes a backbone network for extracting image features in an input image, regression information of a bounding box for the extracted image features, and In order to increase the accuracy of the location of the bounding box among the regression information, a neck network that outputs the regression information of the bounding box to which a coordinate convolution module having coordinate information on the location of the bounding box is applied and a YOLO layer is used for the output of the neck network. and a head network that detects objects and performs regression learning on the detected objects.

상기 백본 네트워크는 포커스(Focus) 모듈을 사용하여 입력 이미지를 그리드 셀로 나눈 후, 나눠진 이미지를 채널 정보에 더하여 수용 필드를 넓힌다. The backbone network divides the input image into grid cells using a focus module, and then adds the divided image to channel information to widen the acceptance field.

상기 넥 네트워크는 바운딩 박스의 정확도를 높이기 위해 좌표 컨볼루션 모듈을 적용하여 바운딩 박스의 X좌표와 Y좌표를 채널에 추가한 후 컨볼루션을 수행한다. The neck network adds the X and Y coordinates of the bounding box to a channel by applying a coordinate convolution module to increase the accuracy of the bounding box, and then performs convolution.

상기 넥 네트워크는 모델의 파라미터와 연산량을 감소시키기 위해 CSP(Cross Stage Partial) 기법을 적용하고, FPN(feature Pyramid Net)에 바텀-업(bottom-up) 방식을 추가하는 PANet(Path Aggregation Network)을 사용하여 바운딩 박스의 위치에 대한 정보를 추가적으로 획득하며, 네트워크의 수용 필드를 증가시키기 위해 SPP(Spatial Pyramid Pooling) 모듈을 사용하여 복수의 크기의 맥스풀링(maxpooling)을 적용하고 채널 별로 연결한다. The neck network applies a Cross Stage Partial (CSP) technique to reduce model parameters and computational complexity, and uses a Path Aggregation Network (PANet) that adds a bottom-up method to a feature pyramid net (FPN). In order to increase the receptive field of the network, SPP (Spatial Pyramid Pooling) module is used to apply maxpooling of multiple sizes and connect each channel.

상기 헤드 네트워크는 바운딩 박스의 중점 위치 및 IOU(Intersection Over Union)에 따라 회귀 학습을 수행하고, 종횡비의 일관성을 부과하기 위해 아크탄젠트를 사용하여 종횡비를 구하는 CIOU(Complete Intersection Over Union) 손실을 사용한다. The head network performs regression learning according to the midpoint position of the bounding box and Intersection Over Union (IOU), and uses CIOU (Complete Intersection Over Union) loss to obtain the aspect ratio using arctangent to impose aspect ratio consistency. .

또 다른 일 측면에 있어서, 본 발명에서 제안하는 YOLO 기반 객체 검출 방법은 백본(backbone) 네트워크를 통해 입력 영상 내의 이미지 특징을 추출하는 단계, 상기 추출된 이미지 특징에 대한 바운딩 박스의 회귀 정보 중 바운딩 박스 위치의 정확도를 높이기 위해 바운딩 박스 위치에 대한 좌표 정보를 갖는 좌표 컨볼루션 모듈을 적용한 넥(Neck) 네트워크를 통해 바운딩 박스의 회귀 정보를 출력하는 단계 및 헤드(Head) 네트워크에서 넥 네트워크의 출력에 대해 YOLO 레이어를 사용하여 객체를 검출하고, 검출된 객체에 대한 회귀 학습을 수행하는 단계를 포함한다.In another aspect, the YOLO-based object detection method proposed in the present invention includes the steps of extracting image features in an input image through a backbone network, and the bounding box among the regression information of the bounding box for the extracted image features Outputting regression information of the bounding box through a neck network to which a coordinate convolution module having coordinate information on the location of the bounding box is applied in order to increase location accuracy, and output of the neck network in the head network and detecting an object using the YOLO layer and performing regression learning on the detected object.

본 발명의 실시예들에 따르면 YOLO를 기반한 원-스테이지 객체 검출기에 다양한 기법들을 적용하여 검출 성능을 향상시킬 수 있다. 제안하는 좌표 정보를 이용한 객체 검출 네트워크는 CIOU Loss, Swish, EMA를 사용하여 학습 성능을 높일 수 있고, Focus모듈, SPP모듈, PANet을 사용하여 네트워크의 수용 필드를 넓힐 수 있다. 그리고 좌표 정보를 가지고 있는 좌표 컨볼루션(CoordConv) 모듈을 추가하여 바운딩 박스의 위치를 정확하게 표시할 수 있고, CSP 기법을 이용하여 모델의 파라미터 및 연산량을 감소시킬 수 있다.According to embodiments of the present invention, detection performance can be improved by applying various techniques to a one-stage object detector based on YOLO. The object detection network using the proposed coordinate information can improve learning performance by using CIOU Loss, Swish, and EMA, and can widen the acceptance field of the network by using Focus module, SPP module, and PANet. In addition, the position of the bounding box can be accurately displayed by adding a coordinate convolution (CoordConv) module that has coordinate information, and the parameters and calculation amount of the model can be reduced by using the CSP technique.

도 1은 종래기술에 따른 YOLO 레이어의 출력 구조를 설명하기 위한 도면이다.
도 2는 종래기술에 따른 Focus 모듈을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에 따른 좌표 정보를 이용한 객체 검출 네트워크의 구성을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른 좌표 컨볼루션(CoordConv) 모듈을 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 SPP 모듈을 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 CSP 모듈을 설명하기 위한 도면이다.
도 7은 본 발명의 일 실시예에 따른 좌표 정보를 이용한 객체 검출 방법을 설명하기 위한 흐름도이다. 1 is a diagram for explaining an output structure of a YOLO layer according to the prior art.
2 is a diagram for explaining a Focus module according to the prior art.
3 is a diagram for explaining the configuration of an object detection network using coordinate information according to an embodiment of the present invention.
4 is a diagram for explaining a coordinate convolution (CoordConv) module according to an embodiment of the present invention.
5 is a diagram for explaining an SPP module according to an embodiment of the present invention.
6 is a diagram for explaining a CSP module according to an embodiment of the present invention.
7 is a flowchart illustrating a method of detecting an object using coordinate information according to an embodiment of the present invention.

객체 검출은 영상 내에 객체의 종류와 위치를 찾는 것이 주목적이다. 본 발명에서는 YOLO를 기반한 원-스테이지 객체 검출기에 다양한 기법들을 적용하여 검출 성능을 향상시키는 방법을 제안한다. 제안하는 알고리듬은 학습 성능을 높이기 위해 CIOU 손실(Loss), Swish, EMA를 사용하고 네트워크의 수용 필드를 넓혀주기 위해 Focus모듈, SPP모듈, PANet을 사용한다. 그리고 바운딩 박스의 위치를 정확하게 표시하기 위해 좌표 정보를 가지고 있는 좌표 컨볼루션(CoordConv) 모듈을 추가하고 CSP 기법을 이용하여 모델의 파라미터를 약 23%, 그리고 연산량을 약 27% 감소시킬 수 있다. 검출기의 성능 평가는 MS COCO 2017 테스트 데이터셋으로 평가하였으며 제안하는 검출기(CSP-Coords YOLO)는 YOLOv4와 비교하여 높은 정확도(46.3% mAP)와 빠른 속도(96.2 FPS)를 나타내었다. 이하, 본 발명의 실시 예를 첨부된 도면을 참조하여 상세하게 설명한다. The main purpose of object detection is to find the type and location of an object in an image. The present invention proposes a method for improving detection performance by applying various techniques to a one-stage object detector based on YOLO. The proposed algorithm uses CIOU Loss, Swish, and EMA to increase learning performance, and uses Focus module, SPP module, and PANet to widen the receptive field of the network. In addition, to accurately display the location of the bounding box, a coordinate convolution (CoordConv) module with coordinate information can be added, and the model parameters can be reduced by about 23% and the amount of computation by about 27% by using the CSP technique. The performance of the detector was evaluated with the MS COCO 2017 test dataset, and the proposed detector (CSP-Coords YOLO) showed high accuracy (46.3% mAP) and high speed (96.2 FPS) compared to YOLOv4. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 종래기술에 따른 YOLO 레이어의 출력 구조를 설명하기 위한 도면이다. 1 is a diagram for explaining an output structure of a YOLO layer according to the prior art.

종래기술에 따른 YOLO 레이어는 YOLOv3 [7]에서 제안한 알고리듬이다. 객체를 검출하는 방법은 YOLO 레이어로 들어오는 입력 값을

으로 나타내고, k개의

의 집합으로 구성된다. n은 모델 내에 스트라이드(stride)가 2인 컨볼루션을 지날 때마다 증가하며, n이 커질 때마다 입력되는 입력 값의 크기 는 2n 만큼 줄어든다. 이때

의 집합 요소 중 하나인

는 텐서(Tensor)의 형태로 이루어져 있으며 식(1)과 같다: The YOLO layer according to the prior art is an algorithm proposed by YOLOv3 [7]. The method of detecting an object is the input value coming into the YOLO layer.

Represented by, k

consists of a set of n increases each time a convolution with a stride of 2 passes through the model, and whenever n increases, the size of the input value decreases by 2n. At this time

is one of the set elements of

is in the form of a Tensor and is represented by Equation (1):

식(1)

Equation (1)

식(1)에서 b, c, W_S, H_S 는 각각 배치 크기(Batch size), 객체 종류의 개수, 백본(backbone)으로부터 입력되는 이미지의 가로, 세로 크기이다. YOLO 레이어를 지나면 텐서 형태를 바꿔주고 바운딩 박스에 대해 회귀 학습을 수행한다. YOLO 레이어의 출력 값은

로 k개의

의 집합이 된다. 이때

의 집합 요소 중 하나인

의 출력 텐서는 식(2)와 같다: In equation (1), b, c, W _S , and H _S are the batch size, the number of object types, and the horizontal and vertical sizes of images input from the backbone, respectively. After passing the YOLO layer, the tensor shape is changed and regression learning is performed on the bounding box. The output of the YOLO layer is

by k

becomes a set of At this time

is one of the set elements of

The output tensor of is equal to equation (2):

식(2)

Equation (2)

식(2)에서 마지막 채널 5+c는 도 1과 같이 구성될 수 있다. 각각의 X_B, Y_B는 바운딩 박스의 중점 위치를 나타내고 W_B, H_B는 바운딩 박스의 너비와 높이이다. C는 객체가 바운딩 박스 안에 존재할 확률을 나타내며 P_C는 객체 종류에 대한 확률을 나타낸다. YOLO 레이어에서 한 장의 이미지 크기가 W_S Х H_S 일 때 추출할 수 있는 바운딩 박스의 개수는 식(3)과 같다: In Equation (2), the last channel 5+c may be configured as shown in FIG. Each of X _B and Y _B represents the position of the midpoint of the bounding box, and W _B and H _B are the width and height of the bounding box. C represents the probability that an object exists in the bounding box, and P _C represents the probability for the object type. When the size of one image in the YOLO layer is W _S Х H _S , the number of bounding boxes that can be extracted is as shown in Equation (3):

식(3)

Equation (3)

도 1에서 C가 신뢰도 임계값(confidence threshold)보다 큰 경우 C와 각각의 P_C를 곱해준다. C Х P_C가 신뢰도 임계값보다 큰 경우 바운딩 박스 내에 해당 종류의 객체가 존재한다고 간주하며 P_C를 P로 변환해 준다. 바운딩 박스의 좌상단 좌표(X₁ , Y₁)와 우하단 좌표 (X₂ , Y₂)는 X_B, Y_B, W_B, H_B 를 이용하여 계산한다. 각각의 바운딩 박스에 대한 6개 정보 X₁ , Y₁ , X₂ , Y₂ , C, P를 연결(Concatenation)하여 바운딩 박스 를 생성한다. NMS(Non Maximum Suppression)를 통해 객체가 있을 확률이 가장 높은 바운딩 박스를 추출한다.In FIG. 1, when C is greater than the confidence threshold, C is multiplied by each P _C . If C Х P _C is greater than the reliability threshold, it is assumed that an object of that type exists within the bounding box, and P _C is converted to P. The upper left coordinates (X ₁ , Y ₁ ) and lower right coordinates (X ₂ , Y ₂ ) of the bounding box are calculated using X _B , Y _B , W _B , and H _B . A bounding box is created by concatenating six pieces of information X ₁ , Y ₁ , X ₂ , Y ₂ , C, and P for each bounding box. A bounding box with the highest probability of having an object is extracted through NMS (Non Maximum Suppression).

도 2는 종래기술에 따른 Focus 모듈을 설명하기 위한 도면이다. 2 is a diagram for explaining a Focus module according to the prior art.

Focus 모듈은 YOLOv5 백본[10]에 사용한 모듈이며 입력 이미지를 도 2과 같이 나눠주고 나눠진 이미지를 채널 정보에 더해준다. RGB 영상은 채널이 3개이지만 Focus 모듈로 인해 채널 값이 넓어 지기 때문에 수용 필드가 넓어진다. The Focus module is a module used in the YOLOv5 backbone [10] and divides the input image as shown in Figure 2 and adds the divided image to the channel information. Although RGB image has 3 channels, the acceptance field is widened because the channel value is widened by the focus module.

(4)

식(4)에서 b, c, w, h는 각각 배치 크기, 채널, 너 비와 높이를 의미하며, Focus 모듈의 출력 채널 값은 입력 채널 값에 비해 4배 커지고 너비와 높이가 절반으로 줄어든다. In Equation (4), b, c, w, and h mean the batch size, channel, width, and height, respectively, and the output channel value of the focus module increases by 4 times compared to the input channel value, and the width and height are reduced by half.

도 3은 본 발명의 일 실시예에 따른 좌표 정보를 이용한 객체 검출 네트워크의 구성을 설명하기 위한 도면이다. 3 is a diagram for explaining the configuration of an object detection network using coordinate information according to an embodiment of the present invention.

본 발명에서는 객체 검출 속도와 검출 성능을 향상하기 위해 앵커 박스를 사용한 원-스테이지 객체 검출 알고리듬을 제안한다. 제안하는 객체 검출기는 영상 내의 이미지 특징을 추출하는 백본(backbone) 네트워크(310), 바운딩 박스에 대한 회귀 정보를 가지고 있는 넥(neck) 네트워크(320), 그리고 객체를 검출하는 헤드(head) 네트워크(330)로 구성된다. 제안하는 검출기는 네트워크의 수용 필드(Receptive field)를 넓혀주기 위해 Focus [10] 모듈, SPP [11] 모듈, PANet [12]을 사용하고, 학습 성능을 높이기 위해서 CIOU Loss [13], Swish [14], EMA [9]를 사용한다. 그리고 모델의 넥(neck)(320) 부분에 바운딩 박스를 정확하게 표시하기 위해서 좌표 정보를 가지고 있는 좌표 컨볼루션(CoordConv) [15] 모듈을 추가하고 모델의 파라미터와 연산량을 줄이기 위해 CSP [16]기법을 적용한다.In the present invention, we propose a one-stage object detection algorithm using an anchor box to improve object detection speed and performance. The proposed object detector includes a backbone network 310 that extracts image features from an image, a neck network 320 that has regression information about a bounding box, and a head network that detects objects ( 330). The proposed detector uses Focus [10] module, SPP [11] module, and PANet [12] to widen the receptive field of the network, and CIOU Loss [13] and Swish [14] to improve learning performance. ], using EMA [9]. In addition, in order to accurately display the bounding box in the neck (320) part of the model, a coordinate convolution (CoordConv) [15] module with coordinate information is added, and the CSP [16] technique is used to reduce the parameters and operation amount of the model. apply

제안하는 YOLO 기반 객체 검출 장치의 좌표 정보를 이용한 객체 검출 네트워크는 백본(backbone) 네트워크(310), 넥(neck) 네트워크(320) 및 헤드(head) 네트워크(330)를 포함한다. An object detection network using the coordinate information of the proposed YOLO-based object detection device includes a backbone network 310, a neck network 320, and a head network 330.

본 발명의 실시예에 따른 백본 네트워크(310)는 입력 영상 내의 이미지 특징을 추출한다. The backbone network 310 according to an embodiment of the present invention extracts image features from an input image.

백본 네트워크(310)는 Focus 모듈을 사용하여 입력 이미지를 그리드 셀로 나눈 후, 나눠진 이미지를 채널 정보에 더하여 수용 필드를 넓힌다. The backbone network 310 divides the input image into grid cells using the focus module, and then adds the divided image to channel information to widen the acceptance field.

본 발명의 실시예에 따르면, Focus 모듈은 YOLOv5 백본(311)[10]에 사용한 모듈을 사용할 수 있다. 입력 이미지를 도 2와 같이 나눠주고 나눠진 이미지를 채널 정보에 더해준다. RGB 영상은 채널이 3개이지만 Focus 모듈로 인해 채널 값이 넓어 지기 때문에 수용 필드가 넓어진다. 앞서 설명된 바와 같이, Focus 모듈의 출력 채널 값은 입력 채널 값에 비해 4배 커지고 너비와 높이가 절반으로 줄어든다. According to an embodiment of the present invention, the focus module may use a module used in the YOLOv5 backbone 311 [10]. The input image is divided as shown in FIG. 2 and the divided image is added to the channel information. Although RGB image has 3 channels, the acceptance field is widened because the channel value is widened by the focus module. As explained above, the value of the output channel of the Focus module is increased by 4 times compared to the value of the input channel, and the width and height are reduced by half.

본 발명의 실시예에 따른 백본 네트워크(310)의 구조는 표 1과 같다. The structure of the backbone network 310 according to an embodiment of the present invention is shown in Table 1.

<표 1><Table 1>

본 발명의 실시예에 따른 CC-YOLO는 YOLOv3에서 사용된 Darknet53 구조 대신 YOLOv5에서 사용된 백본 구조를 사용한다. 백본 네트워크(310)에서 추출한 출력 값은 도 3과 같이 P₃, P₄, P₅로 나타내고 P₃, P₄, P₅ 는 입력 이미지의 1/8, 1/16, 1/32만큼 축소된 크기를 가진다.CC-YOLO according to an embodiment of the present invention uses the backbone structure used in YOLOv5 instead of the Darknet53 structure used in YOLOv3. The output values extracted from the backbone network 310 are represented by P ₃ , P ₄ , and P ₅ as shown in FIG. 3 , and P ₃ , P ₄ , and P ₅ are reduced by 1/8, 1/16, and 1/32 of the input image. have a size

본 발명의 실시예에 따른 넥 네트워크(320)는 상기 추출된 이미지 특징에 대한 바운딩 박스의 회귀 정보를 포함하고, 상기 바운딩 박스의 회귀 정보 중 바운딩 박스 위치의 정확도를 높이기 위해 바운딩 박스 위치에 대한 좌표 정보를 갖는 좌표 컨볼루션 모듈을 적용한 바운딩 박스의 회귀 정보를 출력한다. The neck network 320 according to an embodiment of the present invention includes bounding box regression information for the extracted image feature, and coordinates for the bounding box position to increase the accuracy of the bounding box position among the bounding box regression information. Outputs the regression information of the bounding box to which the informational coordinate convolution module is applied.

넥 네트워크(320)는 바운딩 박스의 정확도를 높이기 위해 좌표 컨볼루션 모듈을 적용하여 바운딩 박스의 X좌표와 Y좌표를 채널에 추가한 후 컨볼루션을 수행한다. The neck network 320 applies a coordinate convolution module to increase the accuracy of the bounding box, adds the X and Y coordinates of the bounding box to the channel, and then performs convolution.

기존 YOLOv3에서는 백본 네트워크에서 이미지 특징을 연결하는 FPN [18] 구조를 사용하였다. 본 발명의 실시예에 따른 CC-YOLO는 FPN구조 대신 PANet(321)을 사용한다. PANet(321)은 기존 탑-다운(top-down) 방식이었던 FPN을 바텀-업(bottom-up) 방식 을 추가하여 바운딩 박스의 위치에 대한 정보를 더 획득할 수 있다. 백본 네트워크로부터 획득한 이미지 특징 P₃, P₄, P₅는 PANet의 입력으로 사용된다. PANet (321)에서 출력된 결과값은 헤드 네트워크(330)의 입력으로 사용되고 P₅가 입력으로 사용되는 검출기(Detect)는 상대적으로 큰 객체를 검출하고 P₃ 가 입력으로 사용되는 검출기(Detect)는 상대적으로 작은 객체를 검출한다. PANet(321)의 자세한 구조는 도 3과 같다.In the existing YOLOv3, the FPN [18] structure that connects image features in the backbone network was used. CC-YOLO according to an embodiment of the present invention uses PANet 321 instead of FPN structure. PANet 321 can obtain more information about the location of the bounding box by adding a bottom-up method to the existing top-down FPN. The image features P ₃ , P ₄ , and P ₅ obtained from the backbone network are used as inputs of PANet. The result value output from PANet 321 is used as an input of the head network 330, and the detector (Detect) using P ₅ as an input detects a relatively large object and the detector (Detect) using P ₃ as an input Detect relatively small objects. The detailed structure of PANet 321 is shown in FIG.

본 발명의 실시예에 따른 넥 네트워크(320)는 모델의 파라미터와 연산량을 감소시키기 위해 CSP(Cross Stage Partial) 기법을 적용할 수 있다. 또한, FPN(feature Pyramid Net)에 바텀-업(bottom-up) 방식을 추가하는 PANet(Path Aggregation Network)을 사용하여 바운딩 박스의 위치에 대한 정보를 추가적으로 획득할 수 있다. 또한, 네트워크의 수용 필드를 증가시키기 위해 SPP(Spatial Pyramid Pooling) 모듈을 사용하여 복수의 크기의 맥스풀링(maxpooling)을 적용하고 채널 별로 연결할 수 있다. The neck network 320 according to an embodiment of the present invention may apply a Cross Stage Partial (CSP) technique to reduce model parameters and computational complexity. In addition, information on the location of the bounding box may be additionally acquired using a path aggregation network (PANet) that adds a bottom-up scheme to a feature pyramid net (FPN). In addition, in order to increase the receptive field of the network, maxpooling of a plurality of sizes may be applied using a Spatial Pyramid Pooling (SPP) module and connected for each channel.

본 발명의 실시예에 따른 헤드 네트워크(330)는 넥 네트워크의 출력에 대해 YOLO 레이어(331)를 사용하여 객체를 검출하고, 검출된 객체에 대한 회귀 학습을 수행한다. The head network 330 according to an embodiment of the present invention detects an object using the YOLO layer 331 for the output of the neck network, and performs regression learning on the detected object.

헤드 네트워크(330)는 바운딩 박스의 중점 위치 및 IOU(Intersection Over Union)에 따라 회귀 학습을 수행하고, 종횡비의 일관성을 부과하기 위해 아크탄젠트를 사용하여 종횡비를 구하는 CIOU(Complete Intersection Over Union) 손실을 사용한다. The head network 330 performs regression learning according to the midpoint position and Intersection Over Union (IOU) of the bounding box, and completes Intersection Over Union (CIOU) loss to obtain the aspect ratio using the arctangent to impose aspect ratio consistency. use.

본 발명의 실시예에 따른 CC-YOLO는 YOLOv3에서 객체를 검출하는 방식 YOLO 레이어(331)를 사용한다. 검출기(Detect)에서 사용되는 입력 값 P₃, P₄, P₅는 앞서 설명된 YOLO 레이어 방식으로 변환 해주며, 각각의 검출기에서 출력되는 출력 값들에 대해 회귀 학습을 수행한다. 모델 학습이 아닌 객체 검출을 할 때는 3개의 검출기에 대하여 앞서 설명된 바운딩 박스 추출 방식으로 객체를 검출한다. CC-YOLO according to an embodiment of the present invention uses the YOLO layer 331 to detect objects in YOLOv3. The input values P ₃ , P ₄ , and P ₅ used in the detector (Detect) are converted by the YOLO layer method described above, and regression learning is performed on the output values output from each detector. When object detection is performed instead of model learning, objects are detected by the previously described bounding box extraction method for the three detectors.

도 4는 본 발명의 일 실시예에 따른 좌표 컨볼루션(CoordConv) 모듈을 설명하기 위한 도면이다. 4 is a diagram for explaining a coordinate convolution (CoordConv) module according to an embodiment of the present invention.

기존 CNN은 좌표 정보를 주고 합성 곱을 취해줬을 때 그 좌표에 해당하는 부분에 객체가 위치하지 않는 문제가 있었다. 하지만, 본 발명에서는 좌표 컨볼루션(CoordConv) 모듈을 통해 X좌표와 Y좌표를 채널에 추가함으로써 정확한 좌표에 객체의 위치를 예측할 수 있었다. 객체 검출 문제도 바운딩 박스의 위치를 예측하는 문제이기 때문에 도 4와 같이 바운딩 박스에 대한 정보를 가지고 있는 넥 네트워크 부분에 좌표 정보를 추가함으로써 바운딩 박스의 위치를 정확하게 위치시킬 수 있다. Existing CNNs had a problem in that an object was not located at the part corresponding to the coordinates when coordinate information was given and a convolution was performed. However, in the present invention, by adding the X and Y coordinates to the channel through the coordinate convolution (CoordConv) module, the position of the object can be predicted at the exact coordinates. Since the object detection problem is also a problem of predicting the position of the bounding box, the position of the bounding box can be accurately located by adding coordinate information to the neck network portion having information on the bounding box as shown in FIG. 4 .

도 5는 본 발명의 일 실시예에 따른 SPP 모듈을 설명하기 위한 도면이다. 5 is a diagram for explaining an SPP module according to an embodiment of the present invention.

본 발명의 실시예에 따른 SPP 모듈은 도 5와 같이 다양한 크기의 맥스 풀링(maxpooling)을 적용하고 채널 별로 연결한 구조이다. 본 발명의 실시예에서는 커널(Kernel)크기는 1, 5, 9, 13으로 사용하였고, 스트라이드(stride)는 1로 동일하다. 컨볼루션(Conv) 1x1을 통과할 때마다 배치 정규화(Batch normal)와 Swish를 적용한다. 이러한 방식을 통해 네트워크의 수용 필드가 효과적으로 증가할 수 있다. 본 발명의 실시예에서는 도 3과 같이 가장 위에서 추출한 특징 P₅에 적용시켰다. As shown in FIG. 5, the SPP module according to an embodiment of the present invention has a structure in which maxpooling of various sizes is applied and connected for each channel. In the embodiment of the present invention, kernel sizes of 1, 5, 9, and 13 are used, and the stride is equal to 1. Batch normal and Swish are applied whenever it passes through the convolution (Conv) 1x1. In this way, the receptive field of the network can be effectively increased. In the embodiment of the present invention, as shown in FIG. 3 , it was applied to feature P ₅ extracted at the top.

도 6은 본 발명의 일 실시예에 따른 CSP 모듈을 설명하기 위한 도면이다. 6 is a diagram for explaining a CSP module according to an embodiment of the present invention.

본 발명의 실시예에 따른 CSPNet은 어떠한 네트워크에도 적용할 수 있다는 장 점이 있다. CSPNet은 도 6과 같이 기존 모듈에서 입력 채널을 2개로 분리하여 컨볼루션(Conv) 1x1에 각각 입력하고, 하나는 보틀넥(Bottleneck) 모듈을 거치고 다른 하나는 보틀넥 모듈을 건너뛰고 중간에 채널을 다시 합해준다. 처음에 채널 절반만을 모듈에 사용하기 때문에 연산량이 감소한다. 본 발명의 실시예에 따른 CC-YOLO에서는 백본 네트워크와 넥 네트워크에 존재하는 모든 보틀넥에 CSPNet을 적용한다. CSPNet according to an embodiment of the present invention has the advantage of being applicable to any network. As shown in FIG. 6, CSPNet separates the input channels from the existing module into two and inputs them to convolution (Conv) 1x1, one passes through the Bottleneck module, and the other skips the Bottleneck module and creates a channel in the middle. put it back together Since only half of the channels are initially used for the module, the amount of computation is reduced. In CC-YOLO according to an embodiment of the present invention, CSPNet is applied to all bottlenecks existing in the backbone network and the neck network.

기존 YOLOv3에서는 바운딩 박스 회귀를 MSE 손실[7]을 사용하였지만, 박스가 겹치는 부분에만 회귀 학습을 수행하기 때문에 CIOU 손실[13]과 GIOU 손실[19]이 이러한 문제에 대해 지적하고 IOU 손실[20]을 사용하여 문제점을 해결하였다. 본 발명의 실시예에 따 CIOU 손실은 박스의 중점 위치, 종횡 비, IOU(intersection over union)에 따라 회귀를 수행하기 때문에 빠르게 회귀한다는 장점이 있다. 따라서 제안하는 알고리듬에서는 모든 요건들을 고려한 CIOU 손실을 사용한다. In the existing YOLOv3, MSE loss [7] was used for bounding box regression, but CIOU loss [13] and GIOU loss [19] point out this problem because regression learning is performed only on the part where the boxes overlap, and IOU loss [20] I solved the problem by using . According to the embodiment of the present invention, the CIOU loss has the advantage of fast regression because regression is performed according to the position of the midpoint of the box, the aspect ratio, and the intersection over union (IOU). Therefore, the proposed algorithm uses the CIOU loss considering all requirements.

본 발명의 실시예에 따르면, 표 1에서 사용된 활성화 함수(Activation function)는 ReLU [20]와 Mish [21] 대신 Swish [22]를 사용한다. 제안하는 검출기에서는 Swish가 Mish보다 연산이 빠르고 목표 값에 빠르게 도달하기 때문에 Swish를 사용하였다. According to an embodiment of the present invention, the activation function used in Table 1 uses Swish [22] instead of ReLU [20] and Mish [21]. In the proposed detector, Swish is used because Swish is faster in operation than Mish and reaches the target value quickly.

본 발명의 실시예에 따른 EMA는 모델을 학습시킬 때 학습된 파라미터에 대한 이동 평균을 구한다. 학습된 마지막 파라미터를 쓰는 것보다 학습된 파라미터들의 평균을 이용하는 것이 더욱 효율적이므로 본 발명의 실시예에서는 EMA방식을 적용 한다. EMA according to an embodiment of the present invention obtains a moving average for the learned parameters when training a model. Since it is more efficient to use the average of the learned parameters than to write the last learned parameter, the EMA method is applied in the embodiment of the present invention.

도 7은 본 발명의 일 실시예에 따른 좌표 정보를 이용한 객체 검출 방법을 설명하기 위한 흐름도이다.7 is a flowchart illustrating a method of detecting an object using coordinate information according to an embodiment of the present invention.

제안하는 좌표 정보를 이용한 객체 검출 방법은 백본(backbone) 네트워크를 통해 입력 영상 내의 이미지 특징을 추출하는 단계(710), 상기 추출된 이미지 특징에 대한 바운딩 박스의 회귀 정보 중 바운딩 박스 위치의 정확도를 높이기 위해 바운딩 박스 위치에 대한 좌표 정보를 갖는 좌표 컨볼루션 모듈을 적용한 넥(Neck) 네트워크를 통해 바운딩 박스의 회귀 정보를 출력하는 단계(720) 및 헤드(Head) 네트워크에서 넥 네트워크의 출력에 대해 YOLO 레이어를 사용하여 객체를 검출하고, 검출된 객체에 대한 회귀 학습을 수행하는 단계(730)를 포함한다. The object detection method using the proposed coordinate information includes extracting image features in an input image through a backbone network (710), and increasing the accuracy of the bounding box location among regression information of the bounding box for the extracted image features. Step 720 of outputting regression information of the bounding box through a neck network to which a coordinate convolution module having coordinate information on the location of the bounding box is applied, and a YOLO layer for output of the neck network in the head network and detecting an object using , and performing regression learning on the detected object (730).

단계(710)에서, 백본 네트워크를 통해 입력 영상 내의 이미지 특징을 추출한다. In step 710, image features in the input image are extracted through the backbone network.

본 발명의 실시예에 따른 백본 네트워크는 Focus 모듈을 사용하여 입력 이미지를 그리드 셀로 나눈 후, 나눠진 이미지를 채널 정보에 더하여 수용 필드를 넓힌다. The backbone network according to an embodiment of the present invention divides an input image into grid cells using a focus module, and then adds the divided images to channel information to widen the acceptance field.

본 발명의 실시예에 따르면, Focus 모듈은 YOLOv5 백본에 사용한 모듈을 사용할 수 있다. 입력 이미지를 도 2와 같이 나눠주고 나눠진 이미지를 채널 정보에 더해준다. RGB 영상은 채널이 3개이지만 Focus 모듈로 인해 채널 값이 넓어 지기 때문에 수용 필드가 넓어진다. 앞서 설명된 바와 같이, Focus 모듈의 출력 채널 값은 입력 채널 값에 비해 4배 커지고 너비와 높이가 절반으로 줄어든다. According to an embodiment of the present invention, the Focus module may use a module used in the YOLOv5 backbone. The input image is divided as shown in FIG. 2 and the divided image is added to the channel information. Although RGB image has 3 channels, the acceptance field is widened because the channel value is widened by the focus module. As explained above, the value of the output channel of the Focus module is increased by 4 times compared to the value of the input channel, and the width and height are reduced by half.

단계(720)에서, 상기 추출된 이미지 특징에 대한 바운딩 박스의 회귀 정보 중 바운딩 박스 위치의 정확도를 높이기 위해 바운딩 박스 위치에 대한 좌표 정보를 갖는 좌표 컨볼루션 모듈을 적용한 넥(Neck) 네트워크를 통해 바운딩 박스의 회귀 정보를 출력한다. In step 720, in order to increase the accuracy of the bounding box position among the regression information of the bounding box for the extracted image feature, bounding is performed through a neck network to which a coordinate convolution module having coordinate information on the bounding box position is applied. Output the regression information of the box.

본 발명의 실시예에 따른 넥 네트워크는 바운딩 박스의 정확도를 높이기 위해 좌표 컨볼루션 모듈을 적용하여 바운딩 박스의 X좌표와 Y좌표를 채널에 추가한 후 컨볼루션을 수행한다. In order to increase the accuracy of the bounding box, the neck network according to an embodiment of the present invention applies a coordinate convolution module to add the X and Y coordinates of the bounding box to a channel, and then performs convolution.

본 발명의 실시예에 따른 넥 네트워크는 모델의 파라미터와 연산량을 감소시키기 위해 CSP(Cross Stage Partial) 기법을 적용할 수 있다. 또한, FPN(feature Pyramid Net)에 바텀-업(bottom-up) 방식을 추가하는 PANet(Path Aggregation Network)을 사용하여 바운딩 박스의 위치에 대한 정보를 추가적으로 획득할 수 있다. 또한, 네트워크의 수용 필드를 증가시키기 위해 SPP(Spatial Pyramid Pooling) 모듈을 사용하여 복수의 크기의 맥스풀링(maxpooling)을 적용하고 채널 별로 연결할 수 있다. A neck network according to an embodiment of the present invention may apply a Cross Stage Partial (CSP) technique to reduce model parameters and computational complexity. In addition, information on the location of the bounding box may be additionally acquired using a path aggregation network (PANet) that adds a bottom-up scheme to a feature pyramid net (FPN). In addition, in order to increase the receptive field of the network, maxpooling of a plurality of sizes may be applied using a Spatial Pyramid Pooling (SPP) module and connected for each channel.

단계(730)에서, 헤드(Head) 네트워크에서 넥 네트워크의 출력에 대해 YOLO 레이어를 사용하여 객체를 검출하고, 검출된 객체에 대한 회귀 학습을 수행한다. In step 730, an object is detected using the YOLO layer for the output of the neck network in the head network, and regression learning is performed on the detected object.

본 발명의 실시예에 따른 헤드 네트워크는 바운딩 박스의 중점 위치 및 IOU(Intersection Over Union)에 따라 회귀 학습을 수행하고, 종횡비의 일관성을 부과하기 위해 아크탄젠트를 사용하여 종횡비를 구하는 CIOU(Complete Intersection Over Union) 손실을 사용한다. The head network according to an embodiment of the present invention performs regression learning according to the position of the midpoint of the bounding box and Intersection Over Union (IOU), and CIOU (Complete Intersection Over Union), which obtains the aspect ratio by using the arctangent to impose consistency of the aspect ratio. Union) loss.

본 발명의 실시예에 따른 좌표 정보를 이용한 객체 검출 네트워크 구조의 효율성을 검증하기 위해서 MS COCO 2017 데이터 셋을 기준으로 객체 검출 성능 비교를 하였고 표 2와 같이 제안한 알고리듬 각각에 대하여 에블레이션 테스트(ablation test)를 진행하였다. In order to verify the efficiency of the object detection network structure using coordinate information according to an embodiment of the present invention, object detection performance was compared based on the MS COCO 2017 data set, and an ablation test was performed for each of the proposed algorithms as shown in Table 2. ) was carried out.

<표 2><Table 2>

각 실험 에서 GPU는 RTX 2080ti 1개를 이용하였고 기본 이미지 크기 640Х640을 사용하였다. 본 발명은 아키텍처를 수정하지 않고 성능을 높이는 방법으로 소개된 [23]에서 제안된 학습 방법(learning rate warmup, zero

, no bias decay, cosine learning rate decay)을 적용하였다. 네트워크의 학습 방법은 SGD(Stochastic Gradient Descent)방식을 사용하였고 초기 학습 비율(learning rate)은 0.01로 설정하였다. SGD의 모멘텀(momentum) 값은 0.937, 가중치 감쇠(weight decay) 값은 0.0005로 설정하였다. 제안하는 검출기의 백본 네트워크는 초기 값을 사전 학습된 가중치(Pretrained weight)로 이용한다. 그리고 학습 중에 백본 네트워크는 업데이트되지 않도록 설정하였다. 넥 네트워크에 존재하는 PANet과 헤드 네트워크는 사전 학습을 하지 않고 초기 값을 무작위로 설정하였다. mAP(mean Average Precision)는 객체 검출 성능의 평가 지표이며 실측 자료(ground truth)의 바운딩 박스와 예측한 바운딩 박스와 비교하여 모델의 성능을 평가한다. mAP50는 실측 자료의 바운딩 박스와 예측한 바운딩 박스가 50%이상 겹친 경우 정답으로 간주하는 평가 방식이다. 파라미터는 모델 크기를 나타내며, GFLOPS(Giga Floating Point Operations Per Second)는 컴퓨터가 1초동안 계산할 수 있는 연산량을 나타낸다. FPS(Frames Per Second)는 객체 검출기가 1초에 계산할 수 있는 프레임 수를 의미한다. In each experiment, one RTX 2080ti GPU was used and the default image size 640Х640 was used. The present invention is a learning method (learning rate warmup, zero

, no bias decay, cosine learning rate decay) was applied. The learning method of the network used SGD (Stochastic Gradient Descent) method, and the initial learning rate was set to 0.01. The momentum value of SGD was set to 0.937, and the weight decay value was set to 0.0005. The proposed detector's backbone network uses initial values as pretrained weights. Also, during training, the backbone network was set not to be updated. The PANet and the head network existing in the neck network were randomly set to initial values without pre-learning. mAP (mean average precision) is an evaluation index of object detection performance and compares the bounding box of the ground truth with the predicted bounding box to evaluate the performance of the model. mAP50 is an evaluation method in which the bounding box of the measured data and the predicted bounding box overlap by more than 50%, which is considered correct. The parameter represents the size of the model, and GFLOPS (Giga Floating Point Operations Per Second) represents the amount of computation a computer can calculate in one second. FPS (Frames Per Second) means the number of frames that an object detector can calculate in one second.

표 2를 참조하여 A와 B를 비교하면, 기존 YOLOv3에서는 모델 구조 내에 leaky ReLU를 사용하고 바운딩 박스에 대해 MSE 손실을 사용하였지만, swish와 CIOU 손실을 사용함으로써 손실이 감소하였고 EMA방식을 적용하여 전체적으로 mAP가 향상되었다. Referring to Table 2, comparing A and B, in the existing YOLOv3, leaky ReLU was used in the model structure and MSE loss was used for the bounding box, but the loss was reduced by using swish and CIOU loss, and by applying the EMA method, the overall mAP improved.

표 2를 참조하여 B와 C를 비교하면, SPP모듈을 추가하여 파라미터는 약 2% 증가하였고 연산량은 0.5% 증가하였다. 하지만, 네트워크가 이미지를 더 넓게 인식하게 되었고 mAP는 약 0.4 증가하였다. Comparing B and C with reference to Table 2, the parameter increased by about 2% and the amount of calculation increased by 0.5% by adding the SPP module. However, the network became more aware of the image and the mAP increased by about 0.4.

표 2를 참조하여 C와 D를 비교하면, CSPNet을 적용함으로써 파라미터가 62.99M 에서 48.51M로 약 23% 감소하였으며 연산량도 157.1G에서 115.7G로 감소하였다. 40epoch까지는 C 모델의 mAP가 더 높지만 그 이후로 점차 mAP가 역전하였고 mAP는 약 0.2 증가하였다. Referring to Table 2, comparing C and D, the parameter decreased by about 23% from 62.99M to 48.51M by applying CSPNet, and the amount of computation also decreased from 157.1G to 115.7G. The mAP of the C model was higher until 40 epoch, but after that, the mAP gradually reversed and the mAP increased by about 0.2.

표 2를 참조하여 D와 E를 비교하면, 기존 YOLOv3의 백본 네트워크였던 darknet53을 YOLOv5 백본 네트워크로 교체하였다. YOLOv5 백본 네트워크는 Focus 모듈을 추가하고 SPP블록 다음에 백본 네트워크를 쌓아 전체적인 수용 필드가 늘어났다. 약간의 파라미터가 증가하였지만, mAP는 약 0.3 증가하였다. Referring to Table 2, comparing D and E, darknet53, which was the backbone network of the existing YOLOv3, was replaced with the YOLOv5 backbone network. The YOLOv5 backbone network has increased the overall acceptance field by adding the Focus module and building the backbone network after the SPP block. Although some parameters were increased, mAP increased by about 0.3.

표 2를 참조하여 E와 F를 비교하면, 모델의 넥 네트워크를 FPN대신 PANet을 사용하였고 바운딩 박스에 대한 심층 특징 값이 많아지기 때문에 바운딩 박스 안에 객체 종류와 위치에 대한 정확도가 향상되었다. mAP50에서는 약간 낮아졌지만 mAP는 약 0.8 증가하였다. Comparing E and F with reference to Table 2, PANet was used instead of FPN for the neck network of the model, and the accuracy of the object type and location in the bounding box was improved because the deep feature values for the bounding box increased. Although slightly lowered in mAP50, mAP increased by about 0.8.

표 2를 참조하여 F와 G를 비교하면, 바운딩 박스의 정보를 가지고 있는 PANet에 좌표 정보를 가지고 있는 좌표 컨볼루션(CoordConv)을 추가하여 더욱 정확한 바운딩 박스의 위치를 예측하였다. 파라미터가 약 0.02M, 연산량은 약 0.1G 증가했지만 mAP는 약 0.7 증가하고 mAP50 또한 약 0.9 증가하였다. Comparing F and G with reference to Table 2, the coordinate convolution (CoordConv) with coordinate information was added to PANet with bounding box information to predict the location of the bounding box more accurately. Although the parameter increased by about 0.02M and the amount of calculation by about 0.1G, mAP increased by about 0.7 and mAP50 also increased by about 0.9.

표 2를 참조하여 G와 H를 비교하면, 백본 네트워크를 전이 학습된 백본 네트워크로 학습하여 성능을 향상시키는 방법 [23,24]으로 실험 하였다. mAP가 약 1.3 증가하고 mAP50 또한 약 1.2 로 크게 증가하였다.Comparing G and H with reference to Table 2, experiments were conducted [23,24] to improve performance by learning the backbone network as a transfer-learned backbone network. mAP increased by about 1.3 and mAP50 also increased significantly by about 1.2.

이와 같이 본 발명의 실시예에 따르면 YOLO를 기반한 원-스테이지 객체 검출기에 다양한 기법들을 적용하여 검출 성능을 향상시킬 수 있다. 제안하는 좌표 정보를 이용한 객체 검출 네트워크는 CIOU Loss, Swish, EMA를 사용하여 학습 성능을 높일 수 있고, Focus모듈, SPP모듈, PANet을 사용하여 네트워크의 수용 필드를 넓힐 수 있다. 또한, 좌표 정보를 가지고 있는 좌표 컨볼루션(CoordConv) 모듈을 추가하여 바운딩 박스의 위치를 정확하게 표시할 수 있고, CSP 기법을 이용하여 모델의 파라미터 및 연산량을 감소시킬 수 있다. In this way, according to an embodiment of the present invention, detection performance can be improved by applying various techniques to the one-stage object detector based on YOLO. The object detection network using the proposed coordinate information can improve learning performance by using CIOU Loss, Swish, and EMA, and can widen the acceptance field of the network by using Focus module, SPP module, and PANet. In addition, the location of the bounding box can be accurately displayed by adding a coordinate convolution (CoordConv) module having coordinate information, and the parameters and calculation amount of the model can be reduced by using the CSP technique.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다.　 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다.　 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다.　 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다.　 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The devices described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. A processing device may run an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다.　 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다.　 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. You can command the device. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. can be embodied in Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다.　 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다.　 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.　 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다.　 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.　 The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program commands recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다.　 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited examples and drawings, those skilled in the art can make various modifications and variations from the above description. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

<참고 문헌><References>

[1] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, "Object detection with discriminatively trained part-based models," IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, pp. 1627-1645, 2010. [1] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, "Object detection with discriminatively trained part-based models," IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 9, p. 1627-1645, 2010.

[2] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," arXiv [cs.CV], pp. 580- 587, 2013. [2] R. Girshick, J. Donahue, T. Darrell, and J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation," arXiv [cs.CV], pp. 580- 587, 2013.

[3] R. Girshick, "Fast R-CNN," in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440-1448. [3] R. Girshick, "Fast R-CNN," in 2015 IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440-1448.

[4] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, pp. 1137-1149, 2017. [4] S. Ren, K. He, R. Girshick, and J. Sun, "Faster R-CNN: Towards real-time object detection with region proposal networks," IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 6, p. 1137-1149, 2017.

[5] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, "CenterNet: Keypoint triplets for object detection," in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6569-6578, 2019. [5] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian, "CenterNet: Keypoint triplets for object detection," in 2019 IEEE/CVF International Conference on Computer Vision (ICCV ), pp. 6569-6578, 2019.

[6] M. Tan, R. Pang, and Q. V. Le, "EfficientDet: Scalable and Efficient Object Detection," arXiv [cs.CV], pp. 10781-10790, 2019. [6] M. Tan, R. Pang, and Q. V. Le, "EfficientDet: Scalable and Efficient Object Detection," arXiv [cs.CV], pp. 10781-10790, 2019.

[7] J. Redmon and A. Farhadi, "YOLOv3: An Incremental Improvement," arXiv [cs.CV], 2018. [7] J. Redmon and A. Farhadi, "YOLOv3: An Incremental Improvement," arXiv [cs.CV], 2018.

[8] A. Bochkovskiy, C. Y. Wang, and H. Y. M. Liao, "YOLOv4: Optimal speed and accuracy of object detection," arXiv [cs.CV], 2020. [8] A. Bochkovskiy, C. Y. Wang, and H. Y. M. Liao, "YOLOv4: Optimal speed and accuracy of object detection," arXiv [cs.CV], 2020.

[9] X. Long, K. Deng, G. Wang, Y. Zhang, Q. Dang, Y. Gao, H. Shen, J. Ren, S. Han, E. Ding, and S. Wen, "PP-YOLO: An effective and efficient implementation of object detector," arXiv [cs.CV], 2020. [9] X. Long, K. Deng, G. Wang, Y. Zhang, Q. Dang, Y. Gao, H. Shen, J. Ren, S. Han, E. Ding, and S. Wen, "PP -YOLO: An effective and efficient implementation of object detector," arXiv [cs.CV], 2020.

[10] YOLO-v5. https://github.com/ultralytics/yolov5 [10] YOLO-v5. https://github.com/ultralytics/yolov5

[11] K. He, X. Zhang, S. Ren, and J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, pp. 1904-1916, Sep. 2015. [11] K. He, X. Zhang, S. Ren, and J. Sun, "Spatial pyramid pooling in deep convolutional networks for visual recognition," IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, p. 1904-1916, Sep. 2015.

[12] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, "Path aggregation network for instance segmentation," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8759-8768, 2018. [12] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, "Path aggregation network for instance segmentation," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8759-8768, 2018.

[13] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, "Distance-IoU loss: Faster and better learning for bounding box regression," arXiv [cs.CV], 2019. [13] Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye, and D. Ren, "Distance-IoU loss: Faster and better learning for bounding box regression," arXiv [cs.CV] , 2019.

[14] P. Ramachandran, B. Zoph, and Q. V. Le, "Searching for Activation Functions," arXiv [cs.NE], 2017. [14] P. Ramachandran, B. Zoph, and Q. V. Le, "Searching for Activation Functions," arXiv [cs.NE], 2017.

[15] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, J. Yosinski, "An intriguing failing of convolutional neural networks and the CoordConv solution," arXiv [cs.CV], 2018. [15] R. Liu, J. Lehman, P. Molino, F. P. Such, E. Frank, A. Sergeev, J. Yosinski, "An intriguing failing of convolutional neural networks and the CoordConv solution," arXiv [cs.CV] , 2018.

[16] C. Y. Wang, H. Y. Mark Liao, Y. H. Wu, P. Y. Chen, J. W. Hsieh, and I. H. Yeh, "CSPNet: A New Backbone that can Enhance Learning Capability of CNN," in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 390-391, 2020. [16] C. Y. Wang, H. Y. Mark Liao, Y. H. Wu, P. Y. Chen, J. W. Hsieh, and I. H. Yeh, "CSPNet: A New Backbone that can Enhance Learning Capability of CNN," in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 390-391, 2020.

[17] T.-Y. Lin, M. mayor, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, "Microsoft COCO: Common objects in context," in Computer Vision - ECCV 2014, Cham: Springer International Publishing, pp. 740-755, 2014. [17] T.-Y. Lin, M. mayor, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick, "Microsoft COCO: Common objects in context," in Computer Vision - ECCV 2014, Cham: Springer International Publishing, pp. 740-755, 2014.

[18] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature pyramid networks for object detection," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117- 2125, 2017. [18] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie, "Feature pyramid networks for object detection," in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117- 2125, 2017.

[19] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, "Generalized intersection over Union: A metric and A loss for bounding box regression," arXiv [cs.CV], pp. 658-666, 2019. [19] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, "Generalized intersection over Union: A metric and A loss for bounding box regression," arXiv [cs.CV ], pp. 658-666, 2019.

[20] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, "UnitBox: An advanced object detection network," in Proceedings of the 2016 ACM on Multimedia Conference - MM '16, 2016. [20] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, "UnitBox: An advanced object detection network," in Proceedings of the 2016 ACM on Multimedia Conference - MM '16, 2016.

[21] A. L. Maas, A. Y. Hannun, and A. Y. Ng, "Rectifier nonlinearities improve neural network acoustic models," Stanford.edu, 2013. [21] A. L. Maas, A. Y. Hannun, and A. Y. Ng, "Rectifier nonlinearities improve neural network acoustic models," Stanford.edu, 2013.

[22] D. Misra, "Mish: A self regularized non-monotonic activation function," arXiv [cs.LG], 2019. [22] D. Misra, "Mish: A self regularized non-monotonic activation function," arXiv [cs.LG], 2019.

[23] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li, "Bag of tricks for image classification with convolutional neural networks," in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 558-567, 2019. [23] T. He, Z. Zhang, H. Zhang, Z. Zhang, J. Xie, and M. Li, "Bag of tricks for image classification with convolutional neural networks," in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 558-567, 2019.

[24] Z. Zhang, T. He, H. Zhang, Z. Zhang, J. Xie, and M. Li, "Bag of freebies for training object detection neural networks," arXiv [cs.CV], 2019. [24] Z. Zhang, T. He, H. Zhang, Z. Zhang, J. Xie, and M. Li, "Bag of freebies for training object detection neural networks," arXiv [cs.CV], 2019.

Claims

a backbone network for extracting image features in an input image;
Regression of a bounding box including regression information of the bounding box for the extracted image feature and applying a coordinate convolution module having coordinate information on the bounding box position to increase the accuracy of the bounding box position among the regression information of the bounding box A neck network that outputs information; and
Head network that detects objects using the YOLO layer on the output of the neck network and performs regression learning on the detected objects
YOLO-based object detection device comprising a.

According to claim 1,
The backbone network,
After dividing the input image into grid cells using the focus module, the divided image is added to the channel information to widen the receptive field.
YOLO-based object detection device.

According to claim 1,
The neck network,
In order to increase the accuracy of the bounding box, a coordinate convolution module is applied to add the X and Y coordinates of the bounding box to the channel and then perform convolution.
YOLO-based object detection device.

According to claim 3,
The neck network,
Apply CSP (Cross Stage Partial) technique to reduce model parameters and calculation amount,
Information on the location of the bounding box is additionally acquired using PANet (Path Aggregation Network), which adds a bottom-up method to FPN (feature Pyramid Net),
In order to increase the receptive field of the network, SPP (Spatial Pyramid Pooling) module is used to apply maxpooling of multiple sizes and connect each channel.
YOLO-based object detection device.

According to claim 1,
The head network,
Perform regression learning according to the midpoint position and Intersection Over Union (IOU) of the bounding box, and use CIOU (Complete Intersection Over Union) loss to obtain the aspect ratio using the arctangent to impose aspect ratio consistency.
YOLO-based object detection device.

extracting image features in an input image through a backbone network;
In order to increase the accuracy of the bounding box position among the regression information of the bounding box for the extracted image feature, the bounding box regression information is output through a neck network to which a coordinate convolution module having coordinate information on the bounding box position is applied. doing; and
Detecting an object using a YOLO layer on the output of the neck network in the head network, and performing regression learning on the detected object.
YOLO-based object detection method comprising a.

According to claim 6,
The step of extracting image features in the input image through the backbone network,
After dividing the input image into grid cells using the focus module, the divided image is added to the channel information to widen the receptive field.
YOLO-based object detection method.

According to claim 6,
Outputting the regression information of the bounding box through a neck network to which a coordinate convolution module having coordinate information on the bounding box position is applied in order to increase the accuracy of the bounding box position among the regression information of the bounding box for the extracted image feature includes: ,
In order to increase the accuracy of the bounding box, a coordinate convolution module is applied to add the X and Y coordinates of the bounding box to the channel and then perform convolution.
YOLO-based object detection method.

According to claim 8,
The neck network applies a CSP (Cross Stage Partial) technique to reduce the model parameters and calculation amount,
Information on the location of the bounding box is additionally acquired using PANet (Path Aggregation Network), which adds a bottom-up method to FPN (feature Pyramid Net),
In order to increase the receptive field of the network, SPP (Spatial Pyramid Pooling) module is used to apply maxpooling of multiple sizes and connect each channel.
YOLO-based object detection method.

According to claim 6,
The step of detecting an object using a YOLO layer for the output of the neck network in the head network and performing regression learning on the detected object,
Perform regression learning according to the midpoint position and Intersection Over Union (IOU) of the bounding box, and use CIOU (Complete Intersection Over Union) loss to obtain the aspect ratio using the arctangent to impose aspect ratio consistency.
YOLO-based object detection method.