KR102512151B1

KR102512151B1 - Method and apparatus for object detection

Info

Publication number: KR102512151B1
Application number: KR1020200156354A
Authority: KR
Inventors: 원웅재; 순 권; 박재형; 김태훈
Original assignee: 재단법인대구경북과학기술원
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2023-03-20
Also published as: KR20220069336A

Abstract

고성능 저연산의 객체 검출 모델을 이용하는 객체 검출 방법 및 장치가 제공된다. 이로써 다양한 객체 모양 및 크기 변화에 강건한 객체 검출이 가능하다. 또한, 효율적인 후처리 과정을 제공하여 객체 영역 결정의 정확도가 제고된다.An object detection method and apparatus using a high-performance, low-computational object detection model are provided. This enables robust object detection against various object shapes and size changes. In addition, by providing an efficient post-processing process, the accuracy of object area determination is improved.

Description

Object detection method and apparatus {METHOD AND APPARATUS FOR OBJECT DETECTION}

본 발명은 객체 검출 방법 및 장치에 관한 것으로, 보다 상세하게는 객체 검출 모델을 이용하여 입력 영상으로부터 객체를 검출하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for detecting an object, and more particularly, to a method and apparatus for detecting an object from an input image using an object detection model.

이하에서 기술되는 내용은 본 발명의 실시 예와 관련되는 배경 정보를 제공할 목적으로 기재된 것일 뿐이고, 기술되는 내용들이 당연하게 종래기술을 구성하는 것은 아니다.The contents described below are only described for the purpose of providing background information related to an embodiment of the present invention, and the contents described do not naturally constitute prior art.

딥러닝 기반의 객체 검출 모델은 크게 two-stage 검출 모델과 One-stage 검출 모델로 나뉜다.Deep learning-based object detection models are largely divided into two-stage detection models and one-stage detection models.

Two-stage 검출 모델은 Region of Interest(RoI) 생성하는 모듈과 RoI Sampling을 통해 RoI의 객체 클래스 정보와 위치 정보를 추정하는 모듈로 구성되어 있다. One-stage 검출 모델은 RoI Sampling 과정 없이 한번에 객체의 클래스와 위치정보를 추정하는 하나의 모듈로 구성되어 있다.The two-stage detection model consists of a module that generates a Region of Interest (RoI) and a module that estimates object class information and location information of RoI through RoI sampling. The one-stage detection model consists of one module that estimates the class and location information of an object at once without the RoI sampling process.

성능 면에서는 Two-stage 검출 모델이 우수한 결과들을 보이고 있지만, 속도 면에서는 One-stage 검출 모델이 우수한 결과들을 보이고 있다.In terms of performance, the two-stage detection model shows excellent results, but in terms of speed, the one-stage detection model shows excellent results.

한편, 종래의 객체 검출 모델은 다양한 크기의 객체를 검출 하기 위해서 사전에 검출 템플릿(template)인 가로 세로 비율과 기본 객체 크기 값(Anchors)들 정의하여 객체 검출을 수행 한다. 최근 One-stage 모델에서는 Multi-scale feature 융합 기법에 대한 연구를 통해 Anchor를 적용하지 않아도 우수한 객체 검출 성능을 보여 주고 있지만, Multi-scale 융합을 위한 많은 학습 파라미터/연산이 필요하다.Meanwhile, a conventional object detection model performs object detection by defining an aspect ratio and basic object size values (Anchors) as detection templates in advance in order to detect objects of various sizes. In recent one-stage models, research on multi-scale feature convergence techniques shows excellent object detection performance even without applying anchor, but many learning parameters/operations are required for multi-scale convergence.

고성능 저연산 객체 검출 딥러닝 모델을 개발하기 위해서는 앵커(Anchor)가 필요없는 One-stage 딥러닝 모델 구조가 필요하다. 또한, 다양한 크기/모양 변화에 강건한 특징을 표현할 수 있는 특징 디코딩 방법과 이를 기반으로 한 학습 방법이 필요하다. 나아가, 예측된 객체 영역들에 대해 영역 중복도를 제거 할 수 있는 효율적 NMS 후처리 방법이 필요하다. In order to develop a high-performance, low-computational object detection deep learning model, a one-stage deep learning model structure that does not require anchors is required. In addition, a feature decoding method that can express features that are robust to various size/shape changes and a learning method based on the feature decoding method are required. Furthermore, there is a need for an efficient NMS post-processing method capable of removing region redundancy for predicted object regions.

대한민국 공개특허공보(A) 10-2019-0005045 (공개일: 2019.01.15)Republic of Korea Patent Publication (A) 10-2019-0005045 (published date: 2019.01.15)

본 발명의 일 과제는, 앵커가 필요없는 객체 검출 모델에 기반한 객체 검출 방법 및 장치를 제공하는 것이다.An object of the present invention is to provide an object detection method and apparatus based on an object detection model that does not require an anchor.

본 발명의 일 과제는, 객체의 크기 및 모양 변화에 강건한 객체 검출 모델을 제공하는 것이다.One object of the present invention is to provide an object detection model that is robust to changes in the size and shape of an object.

본 발명의 일 과제는, 예측된 객체 영역들에 대해 영역 중복도를 제거 할 수 있는 효율적 NMS 후처리 방법을 제공하는 것이다.An object of the present invention is to provide an efficient NMS post-processing method capable of removing region redundancy with respect to predicted object regions.

본 발명의 목적은 이상에서 언급한 과제에 한정되지 않으며, 언급되지 않은 본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있고, 본 발명의 실시예에 의해 보다 분명하게 이해될 것이다. 또한, 본 발명의 목적 및 장점들은 청구범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 알 수 있을 것이다.The object of the present invention is not limited to the problems mentioned above, and other objects and advantages of the present invention not mentioned above can be understood by the following description and will be more clearly understood by the examples of the present invention. It will also be seen that the objects and advantages of the present invention may be realized by means of the instrumentalities and combinations indicated in the claims.

본 발명의 일 실시 예에 따른 객체 검출 방법은, 복수의 디코딩 레이어를 포함하는 객체 검출 모델을 이용하여 복수의 디코딩 레이어의 각 디코딩 레이어별로 입력 영상으로부터 적어도 하나의 후보 객체에 대한 적어도 하나의 객체 정보를 획득하는 단계, 각 디코딩 레이어에서 획득된 적어도 하나의 객체 정보에 기반하여 각 디코딩 레이어에 대한 E-NMS(Efficient Non-Maximum Suppression) 연산을 수행하는 단계 및 각 디코딩 레이어에 대한 E-NMS 연산의 결과에 기반하여 E-NMS 연산을 재수행하는 단계를 포함할 수 있다.An object detection method according to an embodiment of the present invention provides at least one object information for at least one candidate object from an input image for each decoding layer of a plurality of decoding layers by using an object detection model including a plurality of decoding layers. Obtaining, performing an E-NMS (Efficient Non-Maximum Suppression) operation for each decoding layer based on at least one piece of object information obtained from each decoding layer, and E-NMS operation for each decoding layer. and re-performing the E-NMS operation based on the result.

예를 들어 객체 정보는, 객체 영역 예측 정보, 객체 IOU(Intersection Over Union) 예측 정보및 객체 분류 예측 정보를 포함할 수 있다.For example, the object information may include object domain prediction information, object intersection over union (IOU) prediction information, and object classification prediction information.

예를 들어 E-NMS 연산은 적어도 하나의 후보 객체의 객체 영역 예측 정보에 대한 컨피던스 스코어의 초기값을 결정하는 제 1 연산, 적어도 하나의 후보 객체의 객체 영역 예측 정보 및 객체 IOU 예측 정보에 기반하여 컨피던스 스코어를 업데이트하는 제 2 연산 및 업데이트된 컨피던스 스코어에 기반하여 적어도 하나의 후보 객체의 최종 객체 영역을 결정하는 제 3 연산을 포함할 수 있다.For example, the E-NMS operation is based on the first operation for determining the initial value of the confidence score for the object region prediction information of at least one candidate object, the object region prediction information and the object IOU prediction information of the at least one candidate object A second operation for updating the confidence score and a third operation for determining a final object region of at least one candidate object based on the updated confidence score may be included.

본 발명의 일 실시 예에 따른 객체 검출 장치는, 복수의 디코딩 레이어를 포함하는 객체 검출 모델을 저장하는 메모리 및 하나 이상의 프로세서를 포함하고, 프로세서는, 객체 검출 모델을 이용하여 복수의 디코딩 레이어의 각 디코딩 레이어별로 입력 영상으로부터 적어도 하나의 후보 객체에 대한 적어도 하나의 객체 정보를 획득하고, 각 디코딩 레이어별로 획득된 적어도 하나의 객체 정보에 기반하여 각 디코딩 레이어에 대한 E-NMS(Efficient Non-Maximum Suppression) 연산을 수행하고, 각 디코딩 레이어에 대한 NMS 연산의 결과에 기반하여 E-NMS 연산을 재수행하도록 구성될 수 있다.An object detection apparatus according to an embodiment of the present invention includes a memory for storing an object detection model including a plurality of decoding layers and one or more processors, the processor using the object detection model to each of the plurality of decoding layers. At least one object information about at least one candidate object is obtained from an input image for each decoding layer, and E-NMS (Efficient Non-Maximum Suppression for each decoding layer) is performed based on the at least one object information obtained for each decoding layer. ) operation, and to re-perform the E-NMS operation based on the result of the NMS operation for each decoding layer.

전술한 것 외의 다른 측면, 특징, 및 이점이 이하의 도면, 청구범위 및 발명의 상세한 설명으로부터 명확해질 것이다.Other aspects, features, and advantages other than those described above will become apparent from the following drawings, claims, and detailed description of the invention.

본 발명의 실시예에 따르면, 앵커가 필요없는 고성능 저연산의 객체 검출 모델이 제공된다.According to an embodiment of the present invention, a high-performance, low-computational object detection model that does not require an anchor is provided.

본 발명의 실시예에 따르면, 다양한 객체 모양 및 크기 변화에 강건한 객체 검출이 가능하다.According to an embodiment of the present invention, it is possible to detect objects robustly to various object shapes and sizes.

본 발명의 실시예에 따르면, 효율적인 후처리 과정(E-NMS)을 제공하여 객체 영역 결정의 정확도가 제고된다.According to an embodiment of the present invention, the accuracy of object area determination is improved by providing an efficient post-processing process (E-NMS).

본 발명의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 실시예에 따른 객체 검출 과정을 개략적으로 도시한 예시도이다.
도 2는 실시예에 따른 객체 검출 장치의 블록도이다.
도 3은 실시예에 따른 객체 검출 방법의 흐름도이다.
도 4는 실시예에 따른 객체 검출 모델의 개략적인 예시도이다.
도 5는 실시예에 따른 객체 검출 모델의 구성도이다.
도 6은 실시예에 따른 객체 검출 모델의 공통 디코더의 구성도이다.
도 7은 실시예에 따른 객체 검출 모델의 헤더 및 디코더의 구성도이다.
도 8은 실시예에 따른 E-NMS 연산의 개략적인 예시도이다.
도 9은 실시예에 따른 E-NMS 연산의 흐름도이다.
도 10는 실시예에 따른 E-NMS 연산의 제 2 연산의 흐름도이다.
도 11은 실시예에 따른 IOU 매트릭스의 예시도이다.
도 12는 실시예에 따른 객체 검출 모델의 객체 검출 결과를 예시적으로 보여주는 도면이다.1 is an exemplary diagram schematically illustrating an object detection process according to an embodiment.
2 is a block diagram of an object detection device according to an embodiment.
3 is a flowchart of an object detection method according to an embodiment.
4 is a schematic illustration of an object detection model according to an embodiment.
5 is a configuration diagram of an object detection model according to an embodiment.
6 is a configuration diagram of a common decoder of an object detection model according to an embodiment.
7 is a configuration diagram of a header and a decoder of an object detection model according to an embodiment.
8 is a schematic illustration of E-NMS operation according to an embodiment.
9 is a flowchart of E-NMS operation according to an embodiment.
10 is a flowchart of a second operation of E-NMS operation according to an embodiment.
11 is an exemplary diagram of an IOU matrix according to an embodiment.
12 is a diagram showing an object detection result of an object detection model according to an embodiment by way of example.

이하에서는 도면을 참조하여 본 발명을 보다 상세하게 설명한다. 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 여기에서 설명하는 실시 예들에 한정되지 않는다. 이하 실시 예에서는 본 발명을 명확하게 설명하기 위해서 설명과 직접적인 관계가 없는 부분을 생략하지만, 본 발명의 사상이 적용된 장치 또는 시스템을 구현함에 있어서, 이와 같이 생략된 구성이 불필요함을 의미하는 것은 아니다. 아울러, 명세서 전체를 통하여 동일 또는 유사한 구성요소에 대해서는 동일한 참조번호를 사용한다.Hereinafter, the present invention will be described in more detail with reference to the drawings. The invention may be embodied in many different forms and is not limited to the embodiments set forth herein. In the following embodiments, parts not directly related to the description are omitted in order to clearly describe the present invention, but this does not mean that the omitted configuration is unnecessary in implementing a device or system to which the spirit of the present invention is applied. . In addition, the same reference numbers are used for the same or similar elements throughout the specification.

이하의 설명에서 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 되며, 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 또한, 이하의 설명에서 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.In the following description, terms such as first and second may be used to describe various components, but the components should not be limited by the terms, and the terms refer to one component from another. Used only for distinguishing purposes. Also, in the following description, singular expressions include plural expressions unless the context clearly indicates otherwise.

이하의 설명에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. In the following description, terms such as "comprise" or "having" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, but one or more other It should be understood that it does not preclude the possibility of addition or existence of features, numbers, steps, operations, components, parts, or combinations thereof.

이하 도면을 참고하여 본 발명을 상세히 설명하기로 한다. The present invention will be described in detail with reference to the drawings below.

도 1은 실시예에 따른 객체 검출 과정을 개략적으로 도시한 예시도이다.1 is an exemplary diagram schematically illustrating an object detection process according to an embodiment.

실시예에 따른 객체 검출 장치(100)는 입력 영상을 수신한다. 예를 들어 입력 영상은 객체 검출 장치(100)에 연결된 카메라를 이용하여 직접 획득하거나 외부 장치로부터 통신 네트워크를 통해 수신할 수 있다. 여기서 입력 영상은 정지 영상 및 동영상을 포함한다.The object detection device 100 according to the embodiment receives an input image. For example, the input image may be directly obtained using a camera connected to the object detection device 100 or may be received from an external device through a communication network. Here, the input image includes still images and moving images.

객체 검출 장치(100)는 수신된 입력 영상을 객체 검출 모델을 이용하여 분석한다. 객체 검출 모델은 입력 영상으로부터 객체 정보를 획득하기 위한 딥러닝 모델로서, 영상 기반 객체 검출 모델이다. 객체 검출 모델에 대하여는 도 4 내지 도 7을 참조하여 후술한다.The object detection apparatus 100 analyzes the received input image using an object detection model. The object detection model is a deep learning model for obtaining object information from an input image, and is an image-based object detection model. The object detection model will be described later with reference to FIGS. 4 to 7 .

일 예에서 객체 정보는 객체 영역 예측 정보, IOU(Intersection Over Union) 예측 정보 및 객체 분류 예측 정보를 포함할 수 있다.In one example, the object information may include object domain prediction information, intersection over union (IOU) prediction information, and object classification prediction information.

객체 영역 예측 정보는 입력 영상에서 객체에 대응하는 이미지 영역에 대한 예측 정보를 의미한다.Object region prediction information means prediction information about an image region corresponding to an object in an input image.

객체 IOU 예측 정보는 객체 영역 예측 정보와 Ground Truth(GT) 정보 간의 겹침도에 대한 정보를 의미한다. 예를 들어, 객체 IOU 예측 정보는, 객체 영역 예측 정보와 해당 객체의 영역에 대한 Ground Truth(GT) 정보의 교집합인 영역의 크기를 객체 영역 예측 정보와 해당 객체의 영역에 대한 Ground Truth(GT) 정보의 합집합인 영역의 크기로 나눈 값에 기반하여 결정될 수 있다.Object IOU prediction information means information about the degree of overlap between object area prediction information and Ground Truth (GT) information. For example, in object IOU prediction information, the size of an area, which is the intersection of object area prediction information and ground truth (GT) information about the area of the object, is calculated as object area prediction information and ground truth (GT) information about the area of the object. It may be determined based on a value divided by the size of an area that is a union of information.

객체 분류 예측 정보는 예측된 객체 영역이 어떤 객체일 지를 확률적으로 예측한 정보를 의미한다.Object classification prediction information refers to information probabilistically predicting what kind of object a predicted object region will be.

추가적으로 객체 정보는 객체 크기 예측 정보 및 객체 분류 정확도 예측 정보를 더 포함할 수 있다. 이에 대하여는 도 7을 참조하여 후술한다.Additionally, the object information may further include object size prediction information and object classification accuracy prediction information. This will be described later with reference to FIG. 7 .

객체 검출 장치(100)는 객체 검출 모델을 실행가능한 다양한 형태의 전자 장치에 대응한다.The object detection device 100 corresponds to various types of electronic devices capable of executing an object detection model.

일 예에서 객체 검출 장치(100)는 차량에 탑재될 수 있다. 일 예에서 객체 검출 장치(100)는 차량과 통신가능한 전자 장치일 수 있다. 예를 들어 객체 검출 장치(100)는 차량과 네트워크를 통해 연결된 서버를 포함할 수 있다. 일 예에서 객체 검출 장치(100)는 로봇 및 스마트폰 등의 단말 장치를 포함할 수 있으며, 이에 제한되는 것은 아니다.In one example, the object detection device 100 may be mounted on a vehicle. In one example, the object detection device 100 may be an electronic device capable of communicating with a vehicle. For example, the object detection apparatus 100 may include a server connected to a vehicle through a network. In one example, the object detection device 100 may include a terminal device such as a robot and a smart phone, but is not limited thereto.

도 2는 실시예에 따른 객체 검출 장치의 블록도이다.2 is a block diagram of an object detection device according to an embodiment.

실시예에 따른 객체 검출 장치(100)는 프로세서(110) 및 메모리(120)를 포함할 수 있다. 도 2에 도시된 구성 요소는 예시적인 것이고, 객체 검출 장치(100)는 추가적인 구성 요소를 더 포함할 수 있다.The object detection device 100 according to the embodiment may include a processor 110 and a memory 120 . Components shown in FIG. 2 are exemplary, and the object detection apparatus 100 may further include additional components.

객체 검출 장치(100)는 프로세서(150)를 포함할 수 있다.The object detection device 100 may include a processor 150 .

프로세서(110)는 일종의 중앙처리장치로서, 메모리(120)에 저장된 하나 이상의 명령어를 실행하여 객체 검출 장치(100)의 동작을 제어할 수 있다. 프로세서(110)는 명령어를 실행하여 데이터를 처리할 수 있는 모든 종류의 장치를 포함할 수 있다.The processor 110, as a kind of central processing unit, may control the operation of the object detection device 100 by executing one or more commands stored in the memory 120. The processor 110 may include all types of devices capable of processing data by executing instructions.

프로세서(110)는 예를 들어 프로그램 내에 포함된 코드 또는 명령으로 표현된 기능을 수행하기 위해 물리적으로 구조화된 회로를 갖는, 하드웨어에 내장된 데이터 처리 장치를 의미할 수 있다. 이와 같이 하드웨어에 내장된 데이터 처리 장치의 일 예로서, 마이크로프로세서(microprocessor), 중앙처리장치(central processing unit: CPU), 프로세서 코어(processor core), 멀티프로세서(multiprocessor), ASIC(application-specific integrated circuit), FPGA(field programmable gate array) 등의 처리 장치를 망라할 수 있으나, 이에 한정되는 것은 아니다. 프로세서(110)는 하나 이상의 프로세서를 포함할 수 있다.The processor 110 may mean, for example, a data processing device embedded in hardware having a physically structured circuit to perform a function expressed as a code or command included in a program. As an example of such a data processing device built into hardware, a microprocessor, a central processing unit (CPU), a processor core, a multiprocessor, an application-specific integrated (ASIC) circuit) and a processing device such as a field programmable gate array (FPGA), but is not limited thereto. Processor 110 may include one or more processors.

추가적으로 객체 검출 장치(100)는 메모리(120)를 포함할 수 있다.Additionally, the object detection device 100 may include a memory 120 .

메모리(120)는 객체 정보 획득을 위한 객체 검출 모델을 저장할 수 있다. 메모리(120)는 객체 검출 장치(100)가 객체 검출 모델을 실행하는 과정에 대한 명령 등을 저장할 수 있다.The memory 120 may store an object detection model for acquiring object information. The memory 120 may store instructions for a process in which the object detection apparatus 100 executes an object detection model.

프로세서(110)는 메모리(120)에 저장된 프로그램, 명령어들 및 객체 검출 모델에 기반하여 실시예에 따른 객체 검출 과정을 실행할 수 있다.The processor 110 may execute an object detection process according to an embodiment based on a program, instructions, and an object detection model stored in the memory 120 .

메모리(120)는 알고리즘에 의해 연산되어 객체 검출을 위한 연산 과정에서 발생하는 중간 데이터 및 연산 결과 등을 더 저장할 수 있다.The memory 120 may further store intermediate data, calculation results, and the like that are calculated by an algorithm and generated during an operation process for detecting an object.

메모리(120)는 내장 메모리 및/또는 외장 메모리를 포함할 수 있으며, DRAM, SRAM, 또는 SDRAM 등과 같은 휘발성 메모리, OTPROM(one time programmable ROM), PROM, EPROM, EEPROM, mask ROM, flash ROM, NAND 플래시 메모리, 또는 NOR 플래시 메모리 등과 같은 비휘발성 메모리, SSD, CF(compact flash) 카드, SD 카드, Micro-SD 카드, Mini-SD 카드, Xd 카드, 또는 메모리 스틱(memory stick) 등과 같은 플래시 드라이브, 또는 HDD와 같은 저장 장치를 포함할 수 있다. 메모리(120)는 자기 저장 매체(magnetic storage media) 또는 플래시 저장 매체(flash storage media)를 포함할 수 있으나, 이에 한정되는 것은 아니다.The memory 120 may include built-in memory and/or external memory, and may include volatile memory such as DRAM, SRAM, or SDRAM, one time programmable ROM (OTPROM), PROM, EPROM, EEPROM, mask ROM, flash ROM, and NAND. Non-volatile memory such as flash memory or NOR flash memory, flash drives such as SSD, compact flash (CF) card, SD card, Micro-SD card, Mini-SD card, Xd card, or memory stick; Alternatively, it may include a storage device such as a HDD. The memory 120 may include magnetic storage media or flash storage media, but is not limited thereto.

도 3은 실시예에 따른 객체 검출 방법의 흐름도이다.3 is a flowchart of an object detection method according to an embodiment.

실시예에 따른 객체 검출 방법은 복수의 디코딩 레이어를 포함하는 객체 검출 모델을 이용하여 복수의 디코딩 레이어의 각 디코딩 레이어별로 입력 영상으로부터 적어도 하나의 후보 객체에 대한 적어도 하나의 객체 정보를 획득하는 단계(S10), 각 디코딩 레이어에서 획득된 적어도 하나의 객체 정보에 기반하여 각 디코딩 레이어에 대한 E-NMS(Efficient Non-Maximum Suppression) 연산을 수행하는 단계(S20) 및 각 디코딩 레이어에 대한 E-NMS 연산의 결과에 기반하여 E-NMS 연산을 재수행하는 단계(S30)을 포함할 수 있다.An object detection method according to an embodiment includes obtaining at least one object information about at least one candidate object from an input image for each decoding layer of a plurality of decoding layers by using an object detection model including a plurality of decoding layers ( S10), performing an E-NMS (Efficient Non-Maximum Suppression) operation for each decoding layer based on at least one object information obtained from each decoding layer (S20) and E-NMS operation for each decoding layer A step of re-performing the E-NMS operation based on the result of (S30) may be included.

단계(S10)에서 프로세서(110)는 복수의 디코딩 레이어를 포함하는 객체 검출 모델을 이용하여 복수의 디코딩 레이어의 각 디코딩 레이어별로 입력 영상으로부터 적어도 하나의 후보 객체에 대한 적어도 하나의 객체 정보를 획득할 수 있다.In step S10, the processor 110 obtains at least one object information about at least one candidate object from an input image for each decoding layer of the plurality of decoding layers by using an object detection model including a plurality of decoding layers. can

여기서 복수의 디코딩 레이어는 도 5를 참조하여 후술할 공통 디코더(2)의 바텀-업 경로 병합 레이어에 대응한다.Here, the plurality of decoding layers correspond to a bottom-up path merging layer of the common decoder 2 to be described later with reference to FIG. 5 .

일 예에서 객체 정보는, 객체 영역 예측 정보(pred_box_posts), IOU(Intersection Over Union) 예측 정보(pred_reward_iou) 및 객체 분류 예측 정보(pred_confidence)를 포함할 수 있다.In one example, object information may include object domain prediction information (pred_box_posts), IOU (Intersection Over Union) prediction information (pred_reward_iou), and object classification prediction information (pred_confidence).

단계(S20)에서 프로세서(110)는 단계(S10)에서 각 디코딩 레이어에서 획득된 적어도 하나의 객체 정보에 기반하여 각 디코딩 레이어에 대한 E-NMS 연산을 수행할 수 있다.In step S20, the processor 110 may perform an E-NMS operation for each decoding layer based on at least one piece of object information acquired in each decoding layer in step S10.

단계(S20)에서 프로세서(110)는 각 디코딩 레이어 별로 E-NMS 연산을 수행한다. 즉, 단계(S20)에서 프로세서(110)는 레이어화된(layer-wise) E-NMS 연산을 수행한다.In step S20, the processor 110 performs an E-NMS operation for each decoding layer. That is, in step S20, the processor 110 performs a layer-wise E-NMS operation.

E-NMS 연산은, 적어도 하나의 후보 객체의 컨피던스 스코어의 초기값을 결정하는 제 1 연산, 적어도 하나의 후보 객체의 객체 영역 예측 정보 및 객체 IOU 예측 정보에 기반하여 컨피던스 스코어를 업데이트하는 제 2 연산 및 업데이트된 컨피던스 스코어에 기반하여 적어도 하나의 후보 객체의 최종 객체 영역을 결정하는 제 3 연산을 포함할 수 있다. 실시예에 따른 E-NMS 연산에 대해서는 도 8 내지 도 11을 참조하여 구체적으로 후술한다.The E-NMS operation includes a first operation for determining an initial value of the confidence score of at least one candidate object, and a second operation for updating the confidence score based on object region prediction information and object IOU prediction information of the at least one candidate object. and a third operation for determining a final object region of at least one candidate object based on the updated confidence score. An E-NMS operation according to an embodiment will be described later in detail with reference to FIGS. 8 to 11 .

단계(S30)에서 프로세서(110)는 단계(S20)의 각 디코딩 레이어에 대한 E-NMS 연산의 결과에 기반하여 E-NMS 연산을 재수행할 수 있다.In step S30, the processor 110 may re-perform the E-NMS calculation based on the result of the E-NMS calculation for each decoding layer in step S20.

즉, 단계(S30)에서 프로세서(110)는 단계(S20)에서 디코딩 레이어별로 수행된 E-NMS 연산의 결과를 결합하고, 결합된 객체 정보에 기반하여 E-NMS 연산을 재수행할 수 있다.That is, in step S30, the processor 110 may combine the result of the E-NMS operation performed for each decoding layer in step S20, and re-perform the E-NMS operation based on the combined object information.

이하에서 도 4 내지 도 7을 참조하여 실시예에 따른 객체 검출 모델에 대하여 살펴본다.Hereinafter, an object detection model according to an embodiment will be described with reference to FIGS. 4 to 7 .

도 4는 실시예에 따른 객체 검출 모델의 개략적인 예시도이다.4 is a schematic illustration of an object detection model according to an embodiment.

실시예에 따른 객체 검출 모델은 공통 인코더(1), 공통 디코더(2), 객체 영역 헤더(3-1) 및 객체 분류 헤더(3-2)를 포함하는 헤더(3) 및 객체 영역 디코더(4-1) 및 객체 분류 디코더(4-2)를 포함하는 디코더(4)를 포함할 수 있다.The object detection model according to the embodiment includes a common encoder 1, a common decoder 2, a header 3 including an object domain header 3-1 and an object classification header 3-2, and an object domain decoder 4 -1) and a decoder 4 including an object classification decoder 4-2.

객체 검출 모델은 Fully-Convolutional Neural Network 기반 객체 검출 딥러닝 모델이다.The object detection model is a fully-convolutional neural network-based object detection deep learning model.

객체 검출 모델의 공통 인코더(1) 및 공통 디코더(2)는 태스크(예를 들어 객체 영역 예측 태스크 및 객체 분류 예측 태스크)별로 구현되는 헤더(3) 및 디코더(4)의 전단에 위치하여 입력 영상으로부터 공통 특징 맵을 추출한다.The common encoder (1) and common decoder (2) of the object detection model are located in front of the header (3) and decoder (4) implemented for each task (eg, object region prediction task and object classification prediction task) to input the image. A common feature map is extracted from

즉, 객체 검출 모델은 태스크 간에 공유되는 공통 인코더(1) 및 공통 디코더(2) 구조를 통해 입력 영상의 특징을 인코딩 및 디코딩 하기 위한 연산 중복도를 제거 할 수 있다.That is, the object detection model can remove redundancy of operations for encoding and decoding features of an input image through a structure of a common encoder 1 and a common decoder 2 shared between tasks.

객체 검출 모델에서 헤더(3) 및 디코더(4)는 객체 검출과 관련된 태스크별로 구현될 수 있다. 즉, 헤더(3) 및 디코더(4)는 각 태스크가 검출하려는 객체 정보에 적합한 레이어 및 연산으로 구현된다.In the object detection model, header 3 and decoder 4 may be implemented for each task related to object detection. That is, the header 3 and the decoder 4 are implemented with layers and operations suitable for object information to be detected by each task.

헤더(3)는 공통 인코더(1) 및 공통 디코더(2)를 통해 출력된 특징 맵의 전부 또는 일부를 디코더(4)에 적합한 특징 맵으로 변환한다. 디코더(4)는 헤더(3)에서 변환된 특징 맵을 입력 받아서 검출하고자 하는 객체 정보를 출력한다.The header 3 converts all or part of the feature maps output through the common encoder 1 and the common decoder 2 into feature maps suitable for the decoder 4. The decoder 4 receives the feature map converted from the header 3 and outputs object information to be detected.

실시예에 따른 객체 검출 모델은 추가적인 헤더(3) 및 디코더(4)를 공통 인코더(1) 및 공통 디코더(2)에 연결함으로써 전술한 객체 영역 예측 태스크 및 객체 분류 예측 태스크 이외에도 추가적인 태스크를 쉽게 플러그-인/아웃(Plug-in/Out)가능한 구조로 설계되어 있다.The object detection model according to the embodiment easily plugs additional tasks in addition to the aforementioned object region prediction task and object classification prediction task by connecting additional headers 3 and decoders 4 to the common encoder 1 and common decoder 2. -It is designed in a structure that enables plug-in/out.

이하에서 도 5를 참조하여 객체 검출 모델의 구조를 보다 상세히 살펴본다.Hereinafter, the structure of the object detection model will be described in more detail with reference to FIG. 5 .

도 5는 실시예에 따른 객체 검출 모델의 구성도이다.5 is a configuration diagram of an object detection model according to an embodiment.

공통 인코더(1)는 일련의 인코딩 레이어(Conv1, Conv2, Conv3, Conv4, Conv5, Conv6 및 Conv7)를 포함할 수 있다. 공통 인코더(1)는 일련의 인코딩 레이어를 통해 입력 영상의 특징을 인코딩한 복수의 제1 공통 특징 맵을 출력한다. 복수의 제1 공통 특징 맵은 일련의 인코딩 레이어의 각 레이어에서 출력되는 특징 맵에 대응한다.A common encoder 1 may include a series of encoding layers (Conv1, Conv2, Conv3, Conv4, Conv5, Conv6 and Conv7). The common encoder 1 outputs a plurality of first common feature maps obtained by encoding features of an input image through a series of encoding layers. The plurality of first common feature maps correspond to feature maps output from each layer of a series of encoding layers.

공통 인코더(1)는 Conv1 내지 Conv5 인코딩 레이어를 포함할 수 있다. Conv1 내지 Conv5 인코딩 레이어는 VGGNet, ResNet, XceptionNet, ResnetXT, SuffleNet 또는 MobileNet 등의 구조를 적용할 수 있다.The common encoder 1 may include Conv1 to Conv5 encoding layers. A structure such as VGGNet, ResNet, XceptionNet, ResnetXT, SuffleNet or MobileNet may be applied to the Conv1 to Conv5 encoding layers.

공통 인코더(1)는 추가적으로 Conv6 및 Conv7 인코딩 레이어를 포함할 수 있다. 이는 입력 영상에 대한 큰 객체/영역 특징을 잘 표현 할 수 있고, 공통 디코더(2)에서 멀티스케일 피처(Multi-Scale Feature) 융합을 위해 필요한 구조이다.The common encoder 1 may additionally include Conv6 and Conv7 encoding layers. This can express the large object/region features of the input image well, and is a necessary structure for multi-scale feature convergence in the common decoder (2).

공통 디코더(2)는 다중 디코딩 레이어를 포함할 수 있다. 다중 디코딩 레이어는 객체 검출을 위한 디코더 구조인 피처 피라미드 네트워크(Feature Pyramid Network; FPN) 구조 및 바텀-업 경로 병합 레이어(Bottom-up Path Aggregation Layer; BPA)를 포함할 수 있다.A common decoder 2 may include multiple decoding layers. The multi-decoding layer may include a feature pyramid network (FPN) structure, which is a decoder structure for object detection, and a bottom-up path aggregation layer (BPA).

공통 디코더(2)는 다중 디코딩 레이어를 통해 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 출력한다. 복수의 제2 공통 특징 맵은 BPA 레이어의 각 레이어에서 출력되는 특징 맵에 대응한다.The common decoder 2 outputs a plurality of second common feature maps P3, P4, P5, P6 and P7 through multiple decoding layers. The plurality of second common feature maps correspond to feature maps output from each layer of the BPA layer.

공통 디코더(2)에서 출력되는 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)은 다양한 객체의 크기 및 모양 변화에 강건한 특징 추출을 할 수 있는 특성을 가지고 있다. 이는 공통 디코더(2)가 멀티스케일 피처 융합 방식으로 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 생성하며, 이와 같은 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)에 기반하여 후술할 헤더(3) 및 디코더(4)가 작동하기 때문에 가능하다. 헤더(3)는 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)의 적어도 일부에 기반하여 입력 특징 맵을 구성하여 디코더(4)에 제공한다.The plurality of second common feature maps P3 , P4 , P5 , P6 , and P7 output from the common decoder 2 have characteristics capable of robust feature extraction against changes in the size and shape of various objects. This means that the common decoder 2 generates a plurality of second common feature maps P3, P4, P5, P6, and P7 in a multiscale feature fusion scheme, and such a plurality of second common feature maps P3, P4, P5 , P6 and P7), it is possible because the header 3 and the decoder 4, which will be described later, operate. The header 3 constructs an input feature map based on at least a part of the plurality of second common feature maps P3, P4, P5, P6, and P7, and provides the configured input feature map to the decoder 4.

공통 디코더(2)에서 출력된 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)은 객체 영역 예측을 위해 구현된 헤더(3-1)와 객체 분류 예측을 위해 구현된 헤더(3-2)로 각각 입력된다. 헤더(3) 및 디코더(4)에 대하여는 도 7을 참조하여 후술한다.The plurality of second common feature maps P3, P4, P5, P6, and P7 output from the common decoder 2 include a header 3-1 implemented for object region prediction and a header implemented for object classification prediction ( 3-2) respectively. The header 3 and the decoder 4 will be described later with reference to FIG. 7 .

이하에서 도 6을 참조하여 공통 디코더(2)를 구체적으로 살펴본다.Hereinafter, the common decoder 2 will be described in detail with reference to FIG. 6 .

도 6은 실시예에 따른 객체 검출 모델의 공통 디코더의 구성도이다.6 is a configuration diagram of a common decoder of an object detection model according to an embodiment.

공통 디코더(2)는 멀티 스케일 피처 융합을 통해 복수의 제1 공통 특징 맵을 추출하는 다중 디코딩 레이어를 포함한다.The common decoder 2 includes multiple decoding layers for extracting a plurality of first common feature maps through multi-scale feature fusion.

다중 디코딩 레이어는 공통 인코더(1)에서 출력된 복수의 제1 공통 특징 맵(C3, C4, C5, C6 및 C7)의 일부(C3, C4, C5)를 업샘플링(upsampling) 및 합산(add)하여 출력하는 톱-다운 레이어(Top-Down Layer)(2-1), 공통 인코더(1)에서 복수의 제1 공통 특징 맵(C3, C4, C5, C6 및 C7)의 일부를 제외한 나머지(C6, C7) 및 톱-다운 레이어(2-1)의 출력에 대하여 콘볼루션 연산을 수행하는 중간 레이어(2-2) 및 중간 레이어(2-2)의 출력을 콘볼루션 및 합산하여 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 출력하는 바텀-업 경로 병합 레이어(Bottom-Up Path Aggregation Layer)(2-3)를 포함할 수 있다.The multi-decoding layer upsamples and adds parts (C3, C4, C5) of the plurality of first common feature maps (C3, C4, C5, C6, and C7) output from the common encoder (1). The top-down layer (2-1) output by the top-down layer (2-1), a plurality of first common feature maps (C3, C4, C5, C6 and C7) in the common encoder (1) except for some (C6 , C7) and the middle layer 2-2 performing the convolution operation on the output of the top-down layer 2-1, and the outputs of the middle layer 2-2 are convolved and summed to form a plurality of second layers. A bottom-up path aggregation layer 2-3 outputting common feature maps P3, P4, P5, P6, and P7 may be included.

톱-다운 레이어(2-1)는 공통 인코더(1)에서 출력된 복수의 제1 공통 특징 맵의 일부를 업샘플링(upsampling) 및 합산(add)하여 출력한다.The top-down layer 2-1 upsamples and adds a part of a plurality of first common feature maps output from the common encoder 1, and outputs them.

일 예에서, 톱-다운 레이어는 공통 인코더(1)에서 생성된 복수의 제1 공통 특징 맵(C3, C4, C5, C6 및 C7) 중에서 중간 해상도에 해당하는 세 개의 특징 맵(C3, C4 및 C5)를 대상으로 수행된다.In one example, the top-down layer includes three feature maps (C3, C4, and C4) corresponding to medium resolution among a plurality of first common feature maps (C3, C4, C5, C6, and C7) generated by the common encoder (1). C5) is performed.

톱-다운 레이어(2-1)의 1단계는, 공통 인코더(1)의 다섯번째 인코딩 레이어(Conv5)에서 출력된 제1 공통 특징 맵(C5)을 1x1x256 콘볼루션하여 생성된 특징 맵(C5')을 출력한다. 톱-다운 레이어(2-1)의 2단계는, 1단계의 출력(C5')을 2배로 업샘플링한 결과와 공통 인코더(1)의 네번째 인코딩 레이어(Conv4)에서 출력된 제1 공통 특징 맵(C4)을 1x1x256 콘볼루션하여 생성된 특징 맵(C4')을 합산(C4'+ C5')하여 출력한다.The first step of the top-down layer (2-1) is a feature map (C5') generated by 1x1x256 convolution of the first common feature map (C5) output from the fifth encoding layer (Conv5) of the common encoder (1). ) is output. The second stage of the top-down layer (2-1) is the result of upsampling the output (C5') of the first stage by a factor of 2 and the first common feature map output from the fourth encoding layer (Conv4) of the common encoder (1). The feature map (C4') generated by 1x1x256 convolution of (C4) is summed (C4'+C5') and output.

동일한 방식으로 톱-다운 레이어(2-1)의 3단계는, 2단계의 출력(C4'+ C5')을 2배로 업샘플링한 결과와 공통 인코더(1)의 세번째 인코딩 레이어(Conv3)에서 출력된 제1 공통 특징 맵(C3)을 1x1x256 콘볼루션하여 생성된 특징맵(C3')을 합산(C3'+ C4'+ C5')하여 출력한다. 이를 통해 톱-다운 방향으로 저해상도 및 고해상도의 피처가 누적적으로 융합될 수 있다.In the same way, the third stage of the top-down layer (2-1) is the result of upsampling the output (C4' + C5') of the second stage by a factor of 2 and the output from the third encoding layer (Conv3) of the common encoder (1). The first common feature map C3 is summed (C3'+C4'+C5') with the feature map C3' generated by 1x1x256 convolution and output. This allows cumulative fusion of low-resolution and high-resolution features in the top-down direction.

중간 레이어(2-2)는 공통 인코더(1)에서 복수의 제1 공통 특징 맵(C3, C4, C5, C6 및 C7)의 일부를 제외한 나머지(C6 및 C7) 및 톱-다운 레이어(2-1)의 출력에 대하여 콘볼루션 연산을 수행한다.The middle layer (2-2) is the remainder (C6 and C7) except for some of the plurality of first common feature maps (C3, C4, C5, C6 and C7) in the common encoder (1) and the top-down layer (2-2). Convolution is performed on the output of 1).

중간 레이어(2-2)는 공통 인코더(1)에서 출력된 복수의 제1 공통 특징 맵(C3, C4, C5, C6 및 C7) 중에서 톱-다운 레이어(2-1)에서 사용되지 않은 두 개의 특징 맵(C6 및 C7)에 대하여 각각 3x3x256 콘볼루션한다. 즉, 중간 레이어(2-2)는 공통 인코더(1)의 마지막 인코딩 레이어(C7)에서 출력된 제1 공통 특징 맵(C7)을 3x3x256 콘볼루션하여 중간 특징 맵(M7)을 출력한다. 마찬가지로, 중간 레이어(2-2)는 공통 인코더(1)의 여섯번째 인코딩 레이어(C6)에서 출력된 제1 공통 특징 맵(C6)을 3x3x256 콘볼루션하여 중간 특징 맵(M6)을 출력한다.The middle layer 2-2 includes two unused top-down layers 2-1 among a plurality of first common feature maps C3, C4, C5, C6, and C7 output from the common encoder 1. 3x3x256 convolutions on feature maps C6 and C7, respectively. That is, the intermediate layer 2-2 performs a 3x3x256 convolution of the first common feature map C7 output from the last encoding layer C7 of the common encoder 1 to output an intermediate feature map M7. Similarly, the intermediate layer 2-2 performs a 3x3x256 convolution of the first common feature map C6 output from the sixth encoding layer C6 of the common encoder 1 to output an intermediate feature map M6.

중간 레이어(2-2)는 톱-다운 레이어(2-1)의 출력에 대하여 각각 3x3x256 콘볼루션을 수행하여 중간 특징 맵(M5, M4 및 P3)을 출력한다.The intermediate layer 2-2 performs 3x3x256 convolution on the output of the top-down layer 2-1, respectively, and outputs intermediate feature maps M5, M4, and P3.

바텀-업 경로 병합 레이어(2-3)는 중간 레이어(2-2)의 출력을 콘볼루션 및 합산하여 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 출력한다. 바텀-업 경로 병합 레이어(2-3)는 피라미드의 최하위층에서 최상위층으로 진행하면서 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 생성한다.The bottom-up path merging layer 2-3 convolves and sums the outputs of the middle layer 2-2 to output a plurality of second common feature maps P3, P4, P5, P6, and P7. The bottom-up path merging layer 2 - 3 generates a plurality of second common feature maps P3 , P4 , P5 , P6 , and P7 while proceeding from the lowest layer to the highest layer of the pyramid.

바텀-업 경로 병합 레이어(2-3)는 중간 레이어(2-2)에서 출력된 P3 중간 특징 맵을 3x3x256, stride=2 콘볼루션하여 저해상도의 특징 맵(P3')을 생성하고, 중간 레이어(2-2)에서 출력된 중간 특징 맵(M4)과 합산(P3'+M4)하여 제2 공통 특징 맵(P4)을 생성한다. 생성된 제2 공통 특징 맵(P4)에 대하여 마찬가지 방식으로 나머지 제2 공통 특징 맵(P4, P6 및 P7)을 생성한다. The bottom-up path merge layer (2-3) generates a low-resolution feature map (P3 ') by convolving the P3 intermediate feature map output from the intermediate layer (2-2) with 3x3x256, stride = 2, and the intermediate layer ( The second common feature map P4 is generated by summing (P3'+M4) with the intermediate feature map M4 output in 2-2). The remaining second common feature maps P4, P6, and P7 are generated in the same manner with respect to the generated second common feature map P4.

본 명세서에서 복수의 디코딩 레이어는 공통 디코더(2)의 최종 디코딩 레이어를 의미한다. 예를 들어, 복수의 디코딩 레이어는 바텀-업 경로 병합 레이어(2-3)에 대응할 수 있다.In this specification, a plurality of decoding layers means the final decoding layer of the common decoder 2. For example, the plurality of decoding layers may correspond to the bottom-up path merging layer 2-3.

복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)은 바텀-업 경로 병합 레이어(2-3)로부터 출력되는 일련의 멀티스케일(multi-scale) 특징 맵을 포함한다. 즉, 바텀-업 경로 병합 레이어(2-3)에 의해서 바텀-업 방향으로 저해상도 및 고해상도의 피처가 누적적으로 융합되어 일련의 멀티스케일 특징 맵이 생성될 수 있으며, 이로써 다양한 객체가 복합적으로 포함된 영상 및 객체의 크기와 모양 변화에 강건한 특징을 표현할 수 있게 된다.The plurality of second common feature maps P3, P4, P5, P6, and P7 include a series of multi-scale feature maps output from the bottom-up path merging layer 2-3. That is, a series of multi-scale feature maps can be generated by accumulatively fusing low-resolution and high-resolution features in the bottom-up direction by the bottom-up path merging layer 2-3, whereby various objects are complexly included. It is possible to express characteristics that are robust to changes in size and shape of images and objects.

도 7은 실시예에 따른 객체 검출 모델의 헤더 및 디코더의 구성도이다.7 is a configuration diagram of a header and a decoder of an object detection model according to an embodiment.

실시예에 따른 객체 검출 모델에서 도 6을 참조하여 살펴본 공통 디코더(2)의 바텀-업 경로 병합 레이어(2-3)에서 출력된 복수의 제2 공통 특징 맵(P3, P4, P5, P6, P7)은 도 4를 참조하여 헤더(3)로 입력된다. In the object detection model according to the embodiment, a plurality of second common feature maps (P3, P4, P5, P6, P7) is input to the header 3 with reference to FIG.

실시예에 따른 객체 검출 모델은 객체 영역 헤더(3-1) 및 객체 분류 헤더(3-2)를 포함한다.The object detection model according to the embodiment includes an object domain header 3-1 and an object classification header 3-2.

객체 영역 헤더(3-1)는 복수의 디코딩 레이어에서 출력된 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 객체 영역 검출을 위한 입력 특징 맵으로 변환하는 도메인 적응 레이어를 의미한다.The object region header 3-1 includes a domain adaptation layer that converts a plurality of second common feature maps P3, P4, P5, P6, and P7 output from a plurality of decoding layers into an input feature map for object region detection. it means.

객체 분류 헤더(3-2)는 복수의 디코딩 레이어에서 생성된 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 객체 분류를 위한 입력 특징 맵으로 변환하는 일종의 도메인 적응 레이어를 의미한다.The object classification header 3-2 includes a kind of domain adaptation layer that converts a plurality of second common feature maps (P3, P4, P5, P6, and P7) generated in a plurality of decoding layers into an input feature map for object classification. it means.

전술한대로 복수의 디코딩 레이어는 도 6을 참조하여 살펴본 공통 디코더(2)의 바텀-업 경로 병합 레이어(2-3)에 대응한다.As described above, the plurality of decoding layers correspond to the bottom-up path merging layer 2 - 3 of the common decoder 2 described with reference to FIG. 6 .

객체 검출을 위한 헤더(3)는 도메인 적응 레이어(Domain Adaptation Layer)를 통해 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 각각 객체 검출을 위한 디코더(4)에 입력될 입력 특징 맵으로 변환한다. 여기서 입력 특징 맵은 객체 영역 특징 맵과 객체 분류 특징 맵을 포함한다.In the header 3 for object detection, a plurality of second common feature maps P3, P4, P5, P6, and P7 are respectively input to the decoder 4 for object detection through a domain adaptation layer. Convert to input feature map. Here, the input feature map includes an object domain feature map and an object classification feature map.

헤더(3)는 객체 영역 헤더(3-1)와 객체 분류 헤더(3-2)로 구성된다. 객체 영역 헤더(box_feature)와 객체 분류 헤더(class_feature)는 공통 디코더(2)에서 출력된 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 동일한 파라미터를 이용(Sharing Layer)하여 객체 영역 특징을 표현하는 객체 영역 특징 맵과 객체 분류 특징을 표현하는 객체 분류 특징 맵을 생성한다.The header 3 is composed of an object domain header 3-1 and an object classification header 3-2. The object region header (box_feature) and the object classification header (class_feature) are a plurality of second common feature maps (P3, P4, P5, P6, and P7) output from the common decoder 2 using the same parameter (Sharing Layer) An object domain feature map expressing object domain features and an object classification feature map expressing object classification features are created.

일 예에서 객체 영역 헤더(3-1) 및 객체 분류 헤더(3-2)는, 가중치를 서로 공유하는 복수의 콘볼루션 레이어를 포함할 수 있다.In one example, the object region header 3-1 and the object classification header 3-2 may include a plurality of convolutional layers that share weights.

일 예에서, 객체 영역 헤더(3-1)와 객체 분류 헤더(3-2)는 각각 공통 디코더(2)의 바텀-업 경로 병합 레이어(2-3)에서 출력된 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)에 대해서 3x3x256 콘볼루션을 4번 수행하여 객체 영역 특징 맵과 객체 분류 특징 맵을 각각 생성할 수 있다.In one example, the object domain header 3-1 and the object classification header 3-2 are a plurality of second common feature maps output from the bottom-up path merging layer 2-3 of the common decoder 2, respectively. For (P3, P4, P5, P6, and P7), a 3x3x256 convolution is performed four times to generate an object domain feature map and an object classification feature map, respectively.

객체 영역 헤더(3-1) 및 객체 분류 헤더(3-2)는 동일한 파라미터를 이용하여, 복수의 디코딩 레이어로부터 출력된 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)의 멀티스케일(Multi-Scale) 특징을 객체 영역 특징 맵과 객체 분류 특징 맵으로 추출한다.The object region header 3-1 and the object classification header 3-2 use the same parameters to generate a plurality of second common feature maps P3, P4, P5, P6, and P7 output from a plurality of decoding layers. Multi-scale features are extracted into object domain feature maps and object classification feature maps.

객체 검출 모델의 디코더(4)는 헤더(3)에서 생성된 입력 특징 맵을 이용하여 적어도 하나의 객체 정보를 예측한다.The decoder 4 of the object detection model predicts at least one piece of object information using the input feature map generated in the header 3.

일 예에서, 디코더(4)는 앵커가 없는 Multi-Scale Fully-Convolutional Neural Network(MS-FCN)를 사용할 수 있으며, 중복적이고 부가적인 연산을 제거하여 고성능 고효율을 확보할 수 있다.In one example, the decoder 4 may use an anchorless Multi-Scale Fully-Convolutional Neural Network (MS-FCN), and may secure high performance and efficiency by removing redundant and additional operations.

객체 영역 특징 맵은 객체 영역 예측 정보(pred_box_posts), 객체 IOU 예측 정보(pred_reward_iou) 및 객체 크기 예측 정보(pred_reward_scale)를 예측하는 객체 영역 디코더(4-1)의 입력으로 사용된다.The object domain feature map is used as an input of the object domain decoder 4-1 that predicts object domain prediction information (pred_box_posts), object IOU prediction information (pred_reward_iou), and object size prediction information (pred_reward_scale).

객체 영역 디코더(4-1)는 객체 영역 특징 맵에 대하여 3x3x4(x1, y1, x2, y2:현재 grid 위치로부터 객체가 떨어져 있는 위치) 콘볼루션을 수행하여 객체 박스(pred_boxes)를 생성하고, 생성된 객체 박스(pred_boxes)와 scale factor의 곱셈 연산을 통해 객체 영역 예측 정보(pred_box_posts)를 생성한다.The object region decoder (4-1) performs a 3x3x4 (x1, y1, x2, y2: position where the object is away from the current grid position) convolution with respect to the object region feature map to generate object boxes (pred_boxes), and generate Object region prediction information (pred_box_posts) is generated through multiplication of the predefined object boxes (pred_boxes) and scale factors.

일 예에서, scale factor는 객체 영역 예측 크기 값을 키우기 위하여 학습되는 상수이며, 객체 크기 예측 정보(pred_reward_scale)에 기반하여 학습될 수 있다.In one example, the scale factor is a constant learned to increase an object region predicted size value, and may be learned based on object size prediction information (pred_reward_scale).

객체 영역 디코더(4-1)는 객체 영역 특징 맵에 대하여 1x1x256 콘볼루션과 1x1x1 콘볼루션을 통해 객체 IOU 예측 정보(pred_reward_iou) 및 객체 크기 예측 정보(pred_reward_scale)을 각각 생성한다.The object domain decoder 4-1 generates object IOU prediction information (pred_reward_iou) and object size prediction information (pred_reward_scale) through 1x1x256 convolution and 1x1x1 convolution with respect to the object domain feature map, respectively.

객체 IOU 예측 정보(pred_reward_iou)는 예측된 객체의 영역(pred_box_posts)이 실제 정답 객체 영역에 얼마나 일치 하는지를 예측하여, 학습 시에 객체 검출 정확도를 높일 수 있으며, 객체 IOU 예측 정보(pred_reward_iou)를 객체 검출 후처리인 E-NMS에 적용하여 객체 영역 검출 성능을 향상 시킬 수 있다.The object IOU prediction information (pred_reward_iou) predicts how much the predicted object area (pred_box_posts) matches the actual correct answer object area, thereby increasing object detection accuracy during learning, and the object IOU prediction information (pred_reward_iou) is used after object detection. It is possible to improve object area detection performance by applying E-NMS, which is a process.

객체 분류 특징 맵은 객체 분류 예측 정보(pred_confidence) 및 객체 분류 정확도 예측 정보(pred_reward_identificatoin)를 예측하는 객체 분류 디코더(4-2)의 입력으로 사용된다.The object classification feature map is used as an input of the object classification decoder 4-2 that predicts object classification prediction information (pred_confidence) and object classification accuracy prediction information (pred_reward_identificatoin).

객체 분류 디코더(4-2)는 객체 분류 특징 맵에 대하여 3x3xN(N: 배경+분류 객체 개수) 콘볼루션을 통해 객체 분류 예측 정보(pred_confidence)를 생성한다.The object classification decoder 4-2 generates object classification prediction information (pred_confidence) through 3x3xN (N: background + number of classified objects) convolution with respect to the object classification feature map.

한편, 객체 분류 정확도 예측 정보(pred_reward_identification) 및 객체 크기 예측 정보(pred_reward_scale)는 객체 검출 모델의 학습에는 사용되지만, 객체 검출 모델의 실제 추론 과정에서는 사용되지 않는다.Meanwhile, object classification accuracy prediction information (pred_reward_identification) and object size prediction information (pred_reward_scale) are used for learning the object detection model, but are not used in the actual inference process of the object detection model.

즉, 객체 분류 정확도 예측 정보(pred_reward_identification) 및 객체 크기 예측 정보(pred_reward_scale)는 객체 검출 모델의 학습을 돕기 위한 가상의 보조 태스크이고, 실제 추론 과정에서는 사용되지 않는다.That is, object classification accuracy prediction information (pred_reward_identification) and object size prediction information (pred_reward_scale) are virtual auxiliary tasks to help the object detection model learn, and are not used in an actual inference process.

여기서 가상의 보조 태스크는, 학습 시에 예측을 통해 손실 함수(Loss function)을 계산하고, 학습 파라미터 업데이트에 반영하여 객체 정보의 학습이 잘 될 수 있도록 도와주고, 실행 모드에서는 해당 객체 정보를 예측하지 않아 연산 부하를 늘리지 않는 방법을 의미한다.Here, the virtual auxiliary task calculates a loss function through prediction during learning, reflects it in the update of the learning parameter to help the object information learn well, and does not predict the object information in the execution mode. This means a method that does not increase the computational load.

객체 크기 예측 정보(pred_reward_scale)은 객체의 크기를 정확히 예측하는 지를 예측하는 정보로서, 학습 시 객체 IOU 예측 정보(pred_reward_iou)와 객체 크기 예측 정보(pred_reward_scale)를 함께 학습함으로써 객체 영역 예측 성능이 강화되고 객체 검출 정확도가 제고된다.Object size prediction information (pred_reward_scale) is information that predicts whether the size of an object is accurately predicted. By learning object IOU prediction information (pred_reward_iou) and object size prediction information (pred_reward_scale) together during learning, object area prediction performance is strengthened and object size prediction information is strengthened. Detection accuracy is improved.

객체 분류 정확도 예측 정보(pred_reward_identification)은 객체 분류 정보를 정확히 예측하는 지를 예측하는 정보로서, 학습 시 객체 분류 예측 성능을 강화시켜 줄 수 있다.Object classification accuracy prediction information (pred_reward_identification) is information for predicting whether object classification information is accurately predicted, and can enhance object classification prediction performance during learning.

프로세서(110)는 디코더(4)를 이용하여 획득한 객체 정보에 대한 N-NMS 기반의 후처리(post-processing)를 통해, 최종 객체 영역(Object Region)을 결정할 수 있다.The processor 110 may determine a final object region through N-NMS-based post-processing of the object information obtained using the decoder 4.

일 예에서, 프로세서(110)는 객체 영역 예측 정보(pred_box_posts), 객체 IOU 예측 정보(pred_reward_iou) 및 객체 분류 예측 정보(pred_confidence)를 도 8을 참조하여 후술할 N-NMS 연산에 입력하여 최종 객체 영역을 선택할 수 있다.In one example, the processor 110 inputs object region prediction information (pred_box_posts), object IOU prediction information (pred_reward_iou), and object classification prediction information (pred_confidence) to an N-NMS operation to be described later with reference to FIG. can choose

도 8은 실시예에 따른 E-NMS 연산의 개략적인 예시도이다.8 is a schematic illustration of E-NMS operation according to an embodiment.

실시예에 따른 객체 검출 방법은 객체 검출 모델에서 출력된 객체 정보에 대한 후처리 과정을 위하여 E-NMS 연산을 제안한다.The object detection method according to the embodiment proposes E-NMS operation for post-processing of object information output from the object detection model.

E-NMS 연산은 객체 분류 예측 정보(pred_confidence), 객체 IOU 예측 정보(pred_reward_iou) 및 객체 영역 예측 정보(pred_box_posts)에 기반하여 입력 영상에 포함된 것으로 예측되는 적어도 하나의 후보 객체에 대한 최종 객체 영역을 결정하기 위한 후처리 과정을 수행한다.The E-NMS operation determines the final object region for at least one candidate object predicted to be included in the input image based on object classification prediction information (pred_confidence), object IOU prediction information (pred_reward_iou), and object region prediction information (pred_box_posts). Perform a post-processing process to determine.

실시예에 따른 객체 검출 방법은 E-NMS 연산을 두 단계에서 실행한다. 즉, 도 3을 참조하여 프로세서(110)는 단계(S20) 및 단계(S30)에서 E-NMS 연산을 수행한다.The object detection method according to the embodiment executes the E-NMS operation in two steps. That is, referring to FIG. 3, the processor 110 performs the E-NMS operation in steps S20 and S30.

단계(S20)에서는 디코딩 레이어별로 E-NMS 연산을 수행한다(Layer-wise E-NMS). 단계(S30)에서는 단계(S20)의 레이어별 E-NMS 연산의 결과에 기반하여 전체 E-NMS 연산을 수행한다.In step S20, an E-NMS operation is performed for each decoding layer (Layer-wise E-NMS). In step S30, the entire E-NMS calculation is performed based on the result of the E-NMS calculation for each layer in step S20.

이하에서 도 9 및 도 10을 참조하여 E-NMS 연산에 대하여 보다 상세히 살펴본다.Hereinafter, the E-NMS operation will be described in more detail with reference to FIGS. 9 and 10 .

도 9는 실시예에 따른 E-NMS 연산의 흐름도이다.9 is a flowchart of E-NMS operation according to an embodiment.

E-NMS 연산은 적어도 하나의 후보 객체의 컨피던스 스코어의 초기값을 결정하는 제 1 연산(S31), 적어도 하나의 후보 객체의 객체 영역 예측 정보(pred_box_posts) 및 객체 IOU 예측 정보(pred_reward_iou)에 기반하여 컨피던스 스코어를 업데이트하는 제 2 연산(S32) 및 업데이트된 컨피던스 스코어에 기반하여 적어도 하나의 후보 객체의 최종 객체 영역을 결정하는 제 3 연산(S33)을 포함할 수 있다.The E-NMS operation is based on the first operation (S31) for determining the initial value of the confidence score of at least one candidate object, the object area prediction information (pred_box_posts) and the object IOU prediction information (pred_reward_iou) of the at least one candidate object. A second operation ( S32 ) of updating the confidence score and a third operation ( S33 ) of determining a final object region of at least one candidate object based on the updated confidence score may be included.

제 1 연산(S31)에서 프로세서(110)는 적어도 하나의 후보 객체의 컨피던스 스코어의 초기값을 결정한다.In the first operation S31, the processor 110 determines an initial value of the confidence score of at least one candidate object.

일 예에서, 프로세서(110)는 단계(S20)에 대하여 E-NMS 연산을 수행하는 경우, 제 1 연산(S31)에서, 객체 IOU 예측 정보(pred_reward_iou)에 대한 함수에 기반하여 컨피던스 스코어의 초기값을 결정할 수 있다.In one example, when the processor 110 performs the E-NMS operation in step S20, in the first operation S31, the initial value of the confidence score based on the function for object IOU prediction information (pred_reward_iou). can determine

이 경우, 컨피던스 스코어의 초기값은 다음과 같은 수학식 1로 표현될 수 있다.In this case, the initial value of the confidence score can be expressed by Equation 1 below.

여기서, 우측 항의 Confidence는 객체 검출 모델의 객체 분류 디코더(4-2)로부터 출력된 객체 분류 예측 정보(pred_confidence)를 의미한다.Here, the confidence of the right term means object classification prediction information (pred_confidence) output from the object classification decoder 4-2 of the object detection model.

f(pred_reward_iou)는 객체 IOU 예측 정보(pred_reward_iou)에 의한 가중치 요소로서, 다양한 함수를 적용할 수 있다. 예를 들어 f(x)= x^0.8을 f(pred_reward_iou)로 사용할 수 있다.f(pred_reward_iou) is a weight factor based on object IOU prediction information (pred_reward_iou), and various functions may be applied. For example, f(x)= x ^0.8 can be used as f(pred_reward_iou).

이와 같은 초기값 설정은, 객체 분류 예측 정보(pred_confidence)와 객체 IOU 예측 정보(pred_reward_iou)의 융합 정보를 기반으로 하여 E-NMS의 수행이 가능하게 한다.Such initial value setting enables E-NMS to be performed based on fusion information of object classification prediction information (pred_confidence) and object IOU prediction information (pred_reward_iou).

일 예에서, 프로세서(110)는 단계(S30)에 대하여 E-NMS 연산을 수행하는 경우, 제 1 연산(S31)에서, 단계(S20)의 각 디코딩 레이어에 대한 E-NMS 연산 결과에 기반하여 컨피던스 스코어의 초기값을 결정할 수 있다.In one example, when the processor 110 performs the E-NMS operation in step S30, in the first operation S31, based on the E-NMS operation result for each decoding layer in step S20, An initial value of the confidence score may be determined.

예를 들어 프로세서(110)는 단계(S20)의 각 디코딩 레이어에 대한 E-NMS 연산의 제 2 연산(S32)을 수행하는 동안 업데이트된 컨피던스 스코어를 단계(S30)의 E-NMS 연산의 제 1 연산(S31)에서의 컨피던스 스코어의 초기값으로 결정할 수 있다.For example, while performing the second operation (S32) of the E-NMS operation for each decoding layer in step (S20), the processor 110 converts the updated confidence score into the first operation of the E-NMS operation (S30). It can be determined as the initial value of the confidence score in operation S31.

제 2 연산(S32)에서, 프로세서(110)는 적어도 하나의 후보 객체의 객체 영역 예측 정보(pred_box_posts) 및 객체 IOU 예측 정보(pred_reward_iou)에 기반하여 컨피던스 스코어를 업데이트한다. 제 2 연산(S32)에 대하여는 도 10을 참조하여 설명하기로 한다.In the second operation S32, the processor 110 updates the confidence score based on object region prediction information (pred_box_posts) and object IOU prediction information (pred_reward_iou) of at least one candidate object. The second operation ( S32 ) will be described with reference to FIG. 10 .

제 3 연산(S33)에서 프로세서(110)는 업데이트된 컨피던스 스코어에 기반하여 적어도 하나의 후보 객체의 최종 객체 영역을 결정한다.In a third operation (S33), the processor 110 determines a final object region of at least one candidate object based on the updated confidence score.

예를 들어, 프로세서(110)는 객체 영역 예측 정보(pred_box_posts)에 대응하는 영역 중에서 업데이트된 컨피던스 스코어가 최대인 영역을 해당 객체의 최종 객체 영역으로 결정할 수 있다.For example, the processor 110 may determine a region having the largest updated confidence score among regions corresponding to the object region prediction information (pred_box_posts) as the final object region of the corresponding object.

결과적으로, 프로세서(110)는 이와 같은 E-NMS 연산을 실행하여 입력 영상에 포함된 것으로 예측되는 적어도 하나의 후보 객체에 대한 최종 객체 영역을 결정할 수 있다.As a result, the processor 110 may determine a final object region for at least one candidate object predicted to be included in the input image by executing the E-NMS operation.

도 10은 실시예에 따른 E-NMS 연산의 제 2 연산의 흐름도이다.10 is a flowchart of a second operation of E-NMS operation according to an embodiment.

프로세서(110)는 E-NMS 연산의 제 2 연산(S32)을 실행하기 위하여 다음과 같은 연산을 실행한다.즉, 제 2 연산(S32)은, 제 1 연산(S31)에서 결정된 초기값에 기반하여 적어도 하나의 후보 객체를 정렬하는 연산(S32-1), 적어도 하나의 후보 객체에 대한 서로 다른 두 개의 객체 영역 예측 정보 간의 IOU를 결정하는 연산(S32-2), 사전정의된 가중치 함수에 기반하여 IOU를 재결정하는 연산(S32-3) 및 재결정된 IOU에 기반하여 적어도 하나의 후보 객체의 컨피던스 스코어를 결정하는 연산(S32-4)을 포함한다.The processor 110 executes the following operation to execute the second operation (S32) of the E-NMS operation. That is, the second operation (S32) is based on the initial value determined in the first operation (S31). Based on a predefined weight function and determining the confidence score of at least one candidate object based on the re-determined IOU (S32-3).

연산(S32-1)에서 프로세서(110)는 제 1 연산(S31)에서 결정된 컨피던스 스코어에 기반하여 적어도 하나의 후보 객체에 대한 객체 영역 예측 정보(pred_box_posts)를 정렬(sort)한다. 예를 들어 후보 객체의 컨피던스 스코어의 내림차순에 기반하여 적어도 하나의 후보 객체에 대한 객체 영역 예측 정보(pred_box_posts)를 정렬할 수 있다.In operation S32-1, the processor 110 sorts object region prediction information (pred_box_posts) for at least one candidate object based on the confidence score determined in operation S31. For example, object region prediction information (pred_box_posts) for at least one candidate object may be sorted based on the descending order of confidence scores of the candidate objects.

일 예에서 프로세서(110)는 IOU를 결정하는 연산(S32-2), IOU를 재결정하는 연산(S32-3) 및 컨피던스 스코어를 결정하는 연산(S32-4)를, IOU 매트릭스를 이용하여 병렬적으로 수행가능하다.In one example, the processor 110 performs the operation of determining the IOU (S32-2), the operation of re-determining the IOU (S32-3), and the operation of determining the confidence score (S32-4) in parallel using the IOU matrix. can be performed with

IOU 매트릭스를 이용한 계산법은, 예측된 객체 영역간의 하행 삼각 IOU 매트릭스를 생성하여 계산하는 방식으로, 서로 다른 객체간의 중첩의 모호성과 병렬 연산 효율성을 개선한 방법이다.The calculation method using the IOU matrix is a method of generating and calculating a descending triangular IOU matrix between predicted object areas, and is a method of improving overlap ambiguity and parallel operation efficiency between different objects.

IOU 매트릭스는 객체 영역 예측 정보(pred_box_posts)의 개수 만큼의 행과 열을 갖는 정방행렬이다. 이하에서 IOU 매트릭스에 기초하여 E-NMS 연산을 설명한다.The IOU matrix is a square matrix having as many rows and columns as the number of object region prediction information (pred_box_posts). The E-NMS calculation based on the IOU matrix is described below.

연산(S32-2)에서 프로세서(110)는 적어도 하나의 후보 객체에 대한 서로 다른 두 개의 객체 영역 예측 정보(pred_box_posts) 간의 IOU를 결정한다.In operation S32-2, the processor 110 determines an IOU between two different pieces of object area prediction information (pred_box_posts) for at least one candidate object.

1) 연산(S32-2)에서 프로세서(110)는 i번째 객체 영역 예측 정보인 Box_i와 j번째 객체 영역 예측 정보인 Box_j 간의 IOU로 결정하고, 결정된 IOU를 IOU 매트릭스의 i번째 행의 j번째 행의 행렬 요소(matrix element)인 b_ij에 저장한다. 여기서, 행과 열의 번호가 일치하는 b_ii의 IOU 값은 1이 된다.1) In operation (S32-2), the processor 110 determines the IOU between the i-th object region prediction information Box _i and the j-th object region prediction information Box _j , and determines the IOU as j of the i-th row of the IOU matrix It is stored in b _ij , which is the matrix element of the th row. Here, the IOU value of b _ii whose row and column numbers match is 1.

2) b_ii를 기준으로 상행 삼각형에 위치한 b_ij(i<j)의 값은 0으로 설정한다. 행과 열의 번호가 일치하는 b_ii의 IOU 값도 0으로 설정한다.2) The value of b _ij (i<j) located in an ascending triangle based on b _ii is set to 0. The IOU value of b _ii whose row and column numbers match is also set to 0.

3) 서로 다른 후보 객체 간의 중첩에 의한 성능 저하를 제거하기 위해, 각 IOU Matrix별로 종/횡축의 객체가 같은 객체가 아닐 경우, 해당 행렬 요소를 0으로 설정한다.3) In order to eliminate performance degradation due to overlapping of different candidate objects, if the vertical/horizontal objects for each IOU matrix are not the same object, the corresponding matrix element is set to 0.

연산(S32-3)에서 프로세서(110)는 사전정의된 가중치 함수에 기반하여 IOU를 재결정한다.In operation S32-3, the processor 110 re-determines the IOU based on the predefined weight function.

다음의 수학식 2는 예시적인 사전정의된 가중치 함수이다.Equation 2 below is an exemplary predefined weight function.

여기서 n은 linear weight 함수이고, 우측 항의 IOU는 현재 IOU 매트릭스에 저장된 IOU이고, 좌측 항의 IOU는 수학식 2에 따라 재결정된 IOU를 나타낸다.Here, n is a linear weight function, the IOU of the right term is the IOU currently stored in the IOU matrix, and the IOU of the left term represents the IOU re-determined according to Equation 2.

연산(S32-4)에서 프로세서(110)는 재결정된 IOU에 기반하여 컨피던스 스코어를 결정한다.In operation S32-4, the processor 110 determines a confidence score based on the re-determined IOU.

프로세서(110)는 다음의 수학식 3에 기반하여 적어도 하나의 후보 객체의 객체 영역 예측 정보(pred_box_posts)에 대응하는 객체 영역에 대한 컨피던스 스코어를 결정한다.The processor 110 determines a confidence score for an object region corresponding to object region prediction information (pred_box_posts) of at least one candidate object based on Equation 3 below.

여기서, conf_n 은 n번째 객체 영역(Box_n)의 컨피던스 스코어를 의미한다.Here, conf _n means the confidence score of the nth object area (Box _n ).

도 11은 실시예에 따른 IOU 매트릭스의 예시도이다.11 is an exemplary diagram of an IOU matrix according to an embodiment.

좌측도는 연산(S32-3)에서 재결정된 IOU를 저장하는 IOU 매트릭스를 예시적으로 보여준다.The figure on the left exemplarily shows an IOU matrix for storing the IOU re-determined in operation S32-3.

우측도는 수학식 3에 따라 각 객체 영역(b_i)에 대한 컨피던스 스코어를 결정하기 위하여 i번째 행의 IOU의 값을 곱하여 bi를 계산해 놓은 것을 보여준다.The diagram on the right shows that bi is calculated by multiplying the value of IOU in the i-th row to determine the confidence score for each object region ( _bi ) according to Equation 3.

도 12는 실시예에 따른 객체 검출 모델의 객체 검출 결과를 예시적으로 보여주는 도면이다.12 is a diagram showing an object detection result of an object detection model according to an embodiment by way of example.

다양한 모양과 크기의 객체들이 잘 검출되고 있으며, 매우 인접하거나 가려짐(occlusion)이 있어도 객체 영역을 정확히 특정하고 있음을 알 수 있다.It can be seen that objects of various shapes and sizes are well detected, and the object area is accurately specified even if it is very adjacent or occluded.

이하에서는 실시예에 따른 객체 검출 모델의 학습 과정에 대하여 설명한다.Hereinafter, a learning process of an object detection model according to an embodiment will be described.

실시예에 따른 객체 검출 모델의 학습을 위한 손실 함수(Loss Function)는 객체 분류 에러(class), 객체 영역 추정 에러(box), IOU 추정 에러(iou), 객체 크기 정확도 예측 에러(scale), 및 객체 분류 정확도 예측(id)를 고려하였다. 또한, 학습을 위한 최종 Loss는 이와 같은 네 가지 에러의 가중 합으로써 구할 수 있다. The loss function for learning the object detection model according to the embodiment includes an object classification error (class), an object region estimation error (box), an IOU estimation error (iou), an object size accuracy prediction error (scale), and Object classification accuracy prediction (id) was considered. In addition, the final loss for learning can be obtained as a weighted sum of these four errors.

(1) Class 에러 계산은 수학식 4와 같이 Focal loss 사용한 방법을 사용한다.(1) Class error calculation uses the method using Focal loss as shown in Equation 4.

(2) 객체 영역 예측 에러 계산은 수학식 5와 같이 GIOU loss를 사용한다. (2) The object area prediction error calculation uses GIOU loss as shown in Equation 5.

(3) IOU 추정 에러(iou)는 수학식 6을 사용한다. (3) The IOU estimation error iou uses Equation 6.

(4) 객체 크기 예측 정확도 예측 에러(scale)는 수학식 7을 이용하여 구할 수 있다. (4) Object size prediction accuracy The prediction error (scale) can be obtained using Equation 7.

(5)객체 분류 정확도 예측 에러(id)는 수학식 8을 이용하여 구할 수 있다. (5) Object Classification Accuracy The prediction error (id) can be obtained using Equation 8.

여기서 u는 uni-step 함수이다. where u is a uni-step function.

(6) 마지막으로, 모델 학습에 필요한 최종 에러 함수는 수학식 9와 같이 (1) 내지 (4)에서 구한 에러값의 가중 합으로 계산한다.(6) Finally, the final error function required for model learning is calculated as a weighted sum of the error values obtained in (1) to (4) as shown in Equation 9.

이하에서 본 발명의 실시예와 관련된 인공지능 기술에 대하여 설명한다.Hereinafter, artificial intelligence technology related to an embodiment of the present invention will be described.

인공지능(artificial intelligence, AI)은 인간의 지능으로 할 수 있는 사고, 학습, 자기계발 등을 컴퓨터가 할 수 있도록 하는 방법을 연구하는 컴퓨터 공학 및 정보기술의 한 분야로, 컴퓨터가 인간의 지능적인 행동을 모방할 수 있도록 하는 것을 의미한다. Artificial intelligence (AI) is a field of computer science and information technology that studies ways to enable computers to do thinking, learning, and self-development that human intelligence can do. This means that behavior can be imitated.

또한, 인공지능은 그 자체로 존재하는 것이 아니라, 컴퓨터 과학의 다른 분야와 직간접으로 많은 관련을 맺고 있다. 특히 현대에는 정보기술의 여러 분야에서 인공지능적 요소를 도입하여, 그 분야의 문제 풀이에 활용하려는 시도가 매우 활발하게 이루어지고 있다.Also, artificial intelligence does not exist by itself, but is directly or indirectly related to other fields of computer science. In particular, in modern times, attempts to introduce artificial intelligence elements in various fields of information technology and use them to solve problems in those fields are being actively made.

머신 러닝(machine learning)은 인공지능의 한 분야로, 컴퓨터에 명시적인 프로그램 없이 배울 수 있는 능력을 부여하는 연구 분야이다. Machine learning is a branch of artificial intelligence, a field of study that gives computers the ability to learn without being explicitly programmed.

구체적으로 머신 러닝은, 경험적 데이터를 기반으로 학습을 하고 예측을 수행하고 스스로의 성능을 향상시키는 시스템과 이를 위한 알고리즘을 연구하고 구축하는 기술이라 할 수 있다. 머신 러닝의 알고리즘들은 엄격하게 정해진 정적인 프로그램 명령들을 수행하는 것이라기보다, 입력 데이터를 기반으로 예측이나 결정을 이끌어내기 위해 특정한 모델을 구축하는 방식을 취한다.Specifically, machine learning can be said to be a technology that studies and builds a system that learns based on empirical data, makes predictions, and improves its own performance, as well as algorithms for it. Machine learning algorithms build specific models to make predictions or decisions based on input data, rather than executing rigidly defined, static program instructions.

용어 '머신 러닝'은 용어 '기계 학습'과 혼용되어 사용될 수 있다.The term 'machine learning' may be used interchangeably with the term 'machine learning'.

기계 학습에서 데이터를 어떻게 분류할 것인가를 놓고, 많은 기계 학습 알고리즘이 개발되었다. 의사결정나무(Decision Tree)나 베이지안 망(Bayesian network), 서포트벡터머신(SVM: support vector machine), 그리고 인공 신경망(ANN: Artificial Neural Network) 등이 대표적이다.In machine learning, many machine learning algorithms have been developed regarding how to classify data. Representative examples include decision trees, Bayesian networks, support vector machines (SVMs), and artificial neural networks (ANNs).

의사결정나무는 의사결정규칙(Decision Rule)을 나무구조로 도표화하여 분류와 예측을 수행하는 분석방법이다.A decision tree is an analysis method that performs classification and prediction by charting decision rules in a tree structure.

베이지안 망은 다수의 변수들 사이의 확률적 관계(조건부독립성: conditional independence)를 그래프 구조로 표현하는 모델이다. 베이지안 망은 비지도 학습(unsupervised learning)을 통한 데이터마이닝(data mining)에 적합하다. A Bayesian network is a model that expresses a stochastic relationship (conditional independence) among multiple variables in a graph structure. Bayesian networks are suitable for data mining through unsupervised learning.

서포트벡터머신은 패턴인식과 자료분석을 위한 지도 학습(supervised learning)의 모델이며, 주로 분류와 회귀분석을 위해 사용한다.A support vector machine is a supervised learning model for pattern recognition and data analysis, and is mainly used for classification and regression analysis.

인공신경망은 생물학적 뉴런의 동작원리와 뉴런간의 연결 관계를 모델링한 것으로 노드(node) 또는 처리 요소(processing element)라고 하는 다수의 뉴런들이 레이어(layer) 구조의 형태로 연결된 정보처리 시스템이다.An artificial neural network is an information processing system in which a plurality of neurons called nodes or processing elements are connected in the form of a layer structure by modeling the operating principle of biological neurons and the connection relationship between neurons.

인공 신경망은 기계 학습에서 사용되는 모델로써, 기계학습과 인지과학에서 생물학의 신경망(동물의 중추신경계 중 특히 뇌)에서 영감을 얻은 통계학적 학습 알고리즘이다.An artificial neural network is a model used in machine learning, a statistical learning algorithm inspired by neural networks in biology (particularly the brain in the central nervous system of animals) in machine learning and cognitive science.

구체적으로 인공신경망은 시냅스(synapse)의 결합으로 네트워크를 형성한 인공 뉴런(노드)이 학습을 통해 시냅스의 결합 세기를 변화시켜, 문제 해결 능력을 가지는 모델 전반을 의미할 수 있다.Specifically, an artificial neural network may refer to an overall model that has problem-solving ability by changing synapse coupling strength through learning of artificial neurons (nodes) that form a network by synapse coupling.

용어 인공신경망은 용어 뉴럴 네트워크(Neural Network)와 혼용되어 사용될 수 있다.The term artificial neural network may be used interchangeably with the term neural network.

인공신경망은 복수의 레이어(layer)를 포함할 수 있고, 레이어들 각각은 복수의 뉴런(neuron)을 포함할 수 있다. 또한 인공신경망은 뉴런과 뉴런을 연결하는 시냅스를 포함할 수 있다.An artificial neural network may include a plurality of layers, and each of the layers may include a plurality of neurons. In addition, the artificial neural network may include neurons and synapses connecting neurons.

인공 신경망은 일반적으로 다음의 세가지 인자, 즉 (1) 다른 레이어의 뉴런들 사이의 연결 패턴 (2) 연결의 가중치를 갱신하는 학습 과정 (3) 이전 레이어로부터 수신되는 입력에 대한 가중 합으로부터 출력값을 생성하는 활성화 함수에 의해 정의될 수 있다.Artificial neural networks generally use the following three factors: (1) connection patterns between neurons in different layers, (2) a learning process that updates the weights of connections, and (3) an output value from the weighted sum of the inputs received from the previous layer. It can be defined by the activation function you create.

인공 신경망은, DNN(Deep Neural Network), RNN(Recurrent Neural Network), BRDNN(Bidirectional Recurrent Deep Neural Network), MLP(Multilayer Perceptron), CNN(Convolutional Neural Network)와 같은 방식의 네트워크 모델들을 포함할 수 있으나, 이에 한정되지 않는다.Artificial neural networks may include network models such as Deep Neural Network (DNN), Recurrent Neural Network (RNN), Bidirectional Recurrent Deep Neural Network (BRDNN), Multilayer Perceptron (MLP), and Convolutional Neural Network (CNN). , but not limited thereto.

본 명세서에서 용어 '레이어'는 용어 '계층'과 혼용되어 사용될 수 있다.In this specification, the term 'layer' may be used interchangeably with the term 'layer'.

인공신경망은 계층 수에 따라 단층 신경망(Single-Layer Neural Networks)과 다층 신경망(Multi-Layer Neural Networks)으로 구분된다.Artificial neural networks are classified into single-layer neural networks and multi-layer neural networks according to the number of layers.

일반적인 단층 신경망은, 입력층과 출력층으로 구성된다.A typical single-layer neural network consists of an input layer and an output layer.

또한 일반적인 다층 신경망은 입력층(Input Layer)과 하나 이상의 은닉층(Hidden Layer), 출력층(Output Layer)으로 구성된다.In addition, a general multilayer neural network is composed of an input layer, one or more hidden layers, and an output layer.

입력층은 외부의 자료들을 받아들이는 층으로서, 입력층의 뉴런 수는 입력되는 변수의 수와 동일하며, 은닉층은 입력층과 출력층 사이에 위치하며 입력층으로부터 신호를 받아 특성을 추출하여 출력층으로 전달한다. 출력층은 은닉층으로부터 신호를 받고, 수신한 신호에 기반한 출력 값을 출력한다. 뉴런간의 입력신호는 각각의 연결강도(가중치)와 곱해진 후 합산되며 이 합이 뉴런의 임계치보다 크면 뉴런이 활성화되어 활성화 함수를 통하여 획득한 출력값을 출력한다. The input layer is a layer that accepts external data. The number of neurons in the input layer is the same as the number of input variables. The hidden layer is located between the input layer and the output layer. do. The output layer receives a signal from the hidden layer and outputs an output value based on the received signal. The input signal between neurons is multiplied by each connection strength (weight) and then summed. If this sum is greater than the neuron's threshold, the neuron is activated and outputs the output value obtained through the activation function.

한편 입력층과 출력 층 사이에 복수의 은닉층을 포함하는 심층 신경망은, 기계 학습 기술의 한 종류인 딥 러닝을 구현하는 대표적인 인공 신경망일 수 있다.Meanwhile, a deep neural network including a plurality of hidden layers between an input layer and an output layer may be a representative artificial neural network implementing deep learning, which is a type of machine learning technology.

한편 용어 '딥 러닝'은 용어 '심층 학습'과 혼용되어 사용될 수 있다.Meanwhile, the term 'deep learning' may be used interchangeably with the term 'deep learning'.

인공 신경망은 훈련 데이터(training data)를 이용하여 학습(training)될 수 있다. 여기서 학습이란, 입력 데이터를 분류(classification)하거나 회귀분석(regression)하거나 군집화(clustering)하는 등의 목적을 달성하기 위하여, 학습 데이터를 이용하여 인공 신경망의 파라미터(parameter)를 결정하는 과정을 의미할 수 있다. 인공 신경망의 파라미터의 대표적인 예시로써, 시냅스에 부여되는 가중치(weight)나 뉴런에 적용되는 편향(bias)을 들 수 있다.The artificial neural network may be trained using training data. Here, learning may refer to a process of determining parameters of an artificial neural network using learning data in order to achieve a purpose such as classification, regression analysis, or clustering of input data. can As representative examples of parameters of an artificial neural network, a weight assigned to a synapse or a bias applied to a neuron may be cited.

훈련 데이터에 의하여 학습된 인공 신경망은, 입력 데이터를 입력 데이터가 가지는 패턴에 따라 분류하거나 군집화 할 수 있다. An artificial neural network learned from training data may classify or cluster input data according to a pattern of the input data.

한편 훈련 데이터를 이용하여 학습된 인공 신경망을, 본 명세서에서는 학습 모델(a trained model)이라 명칭 할 수 있다.Meanwhile, an artificial neural network trained using training data may be referred to as a trained model in this specification.

다음은 인공 신경망의 학습 방식에 대하여 설명한다.Next, the learning method of the artificial neural network will be described.

인공 신경망의 학습 방식은 크게, 지도 학습, 비 지도 학습, 준 지도 학습(Semi-Supervised Learning), 강화 학습(Reinforcement Learning)으로 분류될 수 있다.Learning methods of artificial neural networks can be largely classified into supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.

지도 학습은 훈련 데이터로부터 하나의 함수를 유추해내기 위한 기계 학습의 한 방법이다. Supervised learning is a method of machine learning to infer a function from training data.

그리고 이렇게 유추되는 함수 중, 연속적인 값을 출력하는 것을 회귀분석(Regression)이라 하고, 입력 벡터의 클래스(class)를 예측하여 출력하는 것을 분류(Classification)라고 할 수 있다.Among the inferred functions, outputting a continuous value is called regression analysis, and predicting and outputting a class of an input vector is called classification.

지도 학습에서는, 훈련 데이터에 대한 레이블(label)이 주어진 상태에서 인공 신경망을 학습시킨다.In supervised learning, an artificial neural network is trained under a given label for training data.

여기서 레이블이란, 훈련 데이터가 인공 신경망에 입력되는 경우 인공 신경망이 추론해 내야 하는 정답(또는 결과 값)을 의미할 수 있다.Here, the label may mean a correct answer (or a result value) to be inferred by the artificial neural network when training data is input to the artificial neural network.

본 명세서에서는 훈련 데이터가 입력되는 경우 인공 신경망이 추론해 내야 하는 정답(또는 결과값)을 레이블 또는 레이블링 데이터(labeling data)이라 명칭 한다.In this specification, when training data is input, an answer (or a result value) to be inferred by an artificial neural network is referred to as a label or labeling data.

또한 본 명세서에서는, 인공 신경망의 학습을 위하여 훈련 데이터에 레이블을 설정하는 것을, 훈련 데이터에 레이블링 데이터를 레이블링(labeling) 한다고 명칭 한다.Also, in this specification, setting labels on training data for learning of an artificial neural network is referred to as labeling labeling data on training data.

이 경우 훈련 데이터와 훈련 데이터에 대응하는 레이블)은 하나의 트레이닝 셋(training set)을 구성하고, 인공 신경망에는 트레이닝 셋의 형태로 입력될 수 있다.In this case, training data and labels corresponding to the training data constitute one training set, and may be input to the artificial neural network in the form of a training set.

한편 훈련 데이터는 복수의 특징(feature)을 나타내고, 훈련 데이터에 레이블이 레이블링 된다는 것은 훈련 데이터가 나타내는 특징에 레이블이 달린다는 것을 의미할 수 있다. 이 경우 훈련 데이터는 입력 객체의 특징을 벡터 형태로 나타낼 수 있다.Meanwhile, the training data represents a plurality of features, and labeling the training data with a label may mean that a label is attached to a feature represented by the training data. In this case, the training data may represent the characteristics of the input object in the form of a vector.

인공 신경망은 훈련 데이터와 레이블링 데이터를 이용하여, 훈련 데이터와 레이블링 데이터의 연관 관계에 대한 함수를 유추할 수 있다. 그리고, 인공 신경망에서 유추된 함수에 대한 평가를 통해 인공 신경망의 파라미터가 결정(최적화)될 수 있다.The artificial neural network may use the training data and the labeling data to infer a function for a correlation between the training data and the labeling data. In addition, parameters of the artificial neural network may be determined (optimized) through evaluation of the function inferred from the artificial neural network.

비 지도 학습은 기계 학습의 일종으로, 훈련 데이터에 대한 레이블이 주어지지 않는다.Unsupervised learning is a type of machine learning in which labels are not given to the training data.

구체적으로, 비 지도 학습은, 훈련 데이터 및 훈련 데이터에 대응하는 레이블의 연관 관계 보다는, 훈련 데이터 자체에서 패턴을 찾아 분류하도록 인공 신경망을 학습시키는 학습 방법일 수 있다.Specifically, unsupervised learning may be a learning method for learning an artificial neural network to find and classify a pattern in training data itself rather than an association between training data and a label corresponding to the training data.

비 지도 학습의 예로는, 군집화 또는 독립 성분 분석(Independent Component Analysis)을 들 수 있다.Examples of unsupervised learning include clustering or independent component analysis.

본 명세서에서 용어 '군집화'는 용어 '클러스터링'과 혼용되어 사용될 수 있다.In this specification, the term 'clustering' may be used interchangeably with the term 'clustering'.

비지도 학습을 이용하는 인공 신경망의 일례로 생성적 적대 신경망(GAN: Generative Adversarial Network), 오토 인코더(AE: Autoencoder)를 들 수 있다.Examples of artificial neural networks using unsupervised learning include a Generative Adversarial Network (GAN) and an Autoencoder (AE).

생성적 적대 신경망이란, 생성기(generator)와 판별기(discriminator), 두 개의 서로 다른 인공지능이 경쟁하며 성능을 개선하는 머신 러닝 방법이다.A generative adversarial network is a machine learning method in which two different artificial intelligences, a generator and a discriminator, compete to improve performance.

이 경우 생성기는 새로운 데이터를 창조하는 모형으로, 원본 데이터를 기반으로 새로운 데이터를 생성할 수 있다.In this case, the generator is a model that creates new data and can generate new data based on original data.

또한 판별기는 데이터의 패턴을 인식하는 모형으로, 입력된 데이터가 원본 데이터인지 또는 생성기에서 생성한 새로운 데이터인지 여부를 감별하는 역할을 수행할 수 있다.In addition, the discriminator is a model that recognizes data patterns and can play a role in discriminating whether input data is original data or new data generated by a generator.

그리고 생성기는 판별기를 속이지 못한 데이터를 입력 받아 학습하며, 판별기는 생성기로부터 속은 데이터를 입력 받아 학습할 수 있다. 이에 따라 생성기는 판별기를 최대한 잘 속이도록 진화할 수 있고, 판별기는 원본 데이터와 생성기에 의해 생성된 데이터를 잘 구분하도록 진화할 수 있다.The generator learns by receiving data that has not deceived the discriminator, and the discriminator can learn by receiving deceived data from the generator. Accordingly, the generator can evolve to deceive the discriminator as best as possible, and the discriminator can evolve to distinguish well between the original data and the data generated by the generator.

오토 인코더는 입력 자체를 출력으로 재현하는 것을 목표로 하는 신경망이다.An autoencoder is a neural network that aims to reproduce the input itself as an output.

오토 인코더는 입력층, 적어도 하나의 은닉층 및 출력층을 포함한다. An auto-encoder includes an input layer, at least one hidden layer, and an output layer.

이 경우 은닉 계층의 노드 수가 입력 계층의 노드 수보다 적으므로 데이터의 차원이 줄어들게 되며, 이에 따라 압축 또는 인코딩이 수행되게 된다.In this case, since the number of nodes in the hidden layer is smaller than the number of nodes in the input layer, the dimensionality of data is reduced, and compression or encoding is performed accordingly.

또한 은닉 계층에서 출력한 데이터는 출력 계층으로 들어간다. 이 경우 출력 계층의 노드 수는 은닉 계층의 노드 수보다 많으므로, 데이터의 차원이 늘어나게 되며, 이에 따라 압축 해제 또는 디코딩이 수행되게 된다.Also, the data output from the hidden layer goes into the output layer. In this case, since the number of nodes in the output layer is greater than the number of nodes in the hidden layer, the dimensionality of data increases, and accordingly, decompression or decoding is performed.

한편 오토 인코더는 학습을 통해 뉴런의 연결 강도를 조절함으로써 입력 데이터가 은닉층 데이터로 표현된다. 은닉층에서는 입력층보다 적은 수의 뉴런으로 정보를 표현하는데 입력 데이터를 출력으로 재현할 수 있다는 것은, 은닉층이 입력 데이터로부터 숨은 패턴을 발견하여 표현했다는 것을 의미할 수 있다.On the other hand, the autoencoder adjusts the connection strength of neurons through learning, so that input data is expressed as hidden layer data. In the hidden layer, information is expressed with fewer neurons than in the input layer, and being able to reproduce input data as an output may mean that the hidden layer discovered and expressed a hidden pattern from the input data.

준 지도 학습은 기계 학습의 일종으로, 레이블이 주어진 훈련 데이터와 레이블이 주어지지 않은 훈련 데이터를 모두 사용하는 학습 방법을 의미할 수 있다.Quasi-supervised learning is a type of machine learning and may refer to a learning method using both labeled training data and unlabeled training data.

준 지도 학습의 기법 중 하나로, 레이블이 주어지지 않은 훈련 데이터의 레이블을 추론한 후 추론된 라벨을 이용하여 학습을 수행하는 기법이 있으며, 이러한 기법은 레이블링에 소요되는 비용이 큰 경우에 유용하게 사용될 수 있다.As one of the techniques of semi-supervised learning, there is a technique of inferring the label of unlabeled training data and then performing learning using the inferred label. This technique is useful when the cost required for labeling is high. can

강화 학습은, 에이전트(Agent)가 매 순간 어떤 행동을 해야 좋을지 판단할 수 있는 환경이 주어진다면, 데이터 없이 경험으로 가장 좋을 길을 찾을 수 있다는 이론이다. Reinforcement learning is a theory that if an agent is given an environment in which it can judge what action to take every moment, it can find the best way through experience without data.

강화 학습은 주로 마르코프 결정 과정(MDP: Markov Decision Process)에 의하여 수행될 수 있다.Reinforcement learning may be performed mainly by a Markov Decision Process (MDP).

마르코프 결정 과정을 설명하면, 첫 번째로 에이전트가 다음 행동을 하기 위해 필요한 정보들이 구성된 환경이 주어지며, 두 번째로 그 환경에서 에이전트가 어떻게 행동할지 정의하고, 세 번째로 에이전트가 무엇을 잘하면 보상(reward)을 주고 무엇을 못하면 벌점(penalty)을 줄지 정의하며, 네 번째로 미래의 보상이 최고점에 이를 때까지 반복 경험하여 최적의 정책(policy)을 도출하게 된다.To explain the Markov decision process, first, an environment in which the information necessary for the agent to take the next action is given, second, how the agent will behave in that environment, and third, if the agent does well, a reward ( Fourth, the optimal policy is derived by repeating experience until the future reward reaches the highest point.

인공 신경망은 모델의 구성, 활성 함수(Activation Function), 손실 함수(Loss Function) 또는 비용 함수(Cost Function), 학습 알고리즘, 최적화 알고리즘 등에 의해 그 구조가 특정되며, 학습 전에 하이퍼파라미터(Hyperparameter)가 미리 설정되고, 이후에 학습을 통해 모델 파라미터(Model Parameter)가 설정되어 내용이 특정될 수 있다.The structure of an artificial neural network is specified by model configuration, activation function, loss function or cost function, learning algorithm, optimization algorithm, etc., and hyperparameters are set in advance before learning. After setting, the model parameter (Model Parameter) is set through learning, so that the content can be specified.

예컨대, 인공 신경망의 구조를 결정하는 요소에는 은닉층의 개수, 각 은닉층에 포함된 은닉 노드의 개수, 입력 특징 벡터(Input Feature Vector), 대상 특징 벡터(Target Feature Vector) 등이 포함될 수 있다.For example, factors determining the structure of an artificial neural network may include the number of hidden layers, the number of hidden nodes included in each hidden layer, an input feature vector, a target feature vector, and the like.

하이퍼파라미터는 모델 파라미터의 초기값 등과 같이 학습을 위하여 초기에 설정하여야 하는 여러 파라미터들을 포함한다. 그리고, 모델 파라미터는 학습을 통하여 결정하고자 하는 여러 파라미터들을 포함한다.Hyperparameters include various parameters that must be initially set for learning, such as initial values of model parameters. And, the model parameters include several parameters to be determined through learning.

예컨대, 하이퍼파라미터에는 노드 간 가중치 초기값, 노드 간 편향 초기값, 미니 배치(Mini-batch) 크기, 학습 반복 횟수, 학습률(Learning Rate) 등이 포함될 수 있다. 그리고, 모델 파라미터에는 노드 간 가중치, 노드 간 편향 등이 포함될 수 있다.For example, the hyperparameters may include an initial value of weight between nodes, an initial value of bias between nodes, a mini-batch size, a number of training iterations, a learning rate, and the like. In addition, model parameters may include weights between nodes, biases between nodes, and the like.

손실 함수는 인공 신경망의 학습 과정에서 최적의 모델 파라미터를 결정하기 위한 지표(기준)로 이용될 수 있다. 인공 신경망에서 학습은 손실 함수를 줄이기 위하여 모델 파라미터들을 조작하는 과정을 의미하며, 학습의 목적은 손실 함수를 최소화하는 모델 파라미터를 결정하는 것으로 볼 수 있다.The loss function may be used as an index (reference) for determining optimal model parameters in the learning process of an artificial neural network. In an artificial neural network, learning means a process of manipulating model parameters to reduce a loss function, and the purpose of learning can be seen as determining model parameters that minimize a loss function.

손실 함수는 주로 평균 제곱 오차(MSE: Mean Squared Error) 또는 교차 엔트로피 오차(CEE, Cross Entropy Error)를 사용할 수 있으며, 본 발명이 이에 한정되지는 않는다. The loss function may mainly use mean squared error (MSE) or cross entropy error (CEE), but the present invention is not limited thereto.

교차 엔트로피 오차는 정답 레이블이 원 핫 인코딩(one-hot encoding)된 경우에 사용될 수 있다. 원 핫 인코딩은 정답에 해당하는 뉴런에 대하여만 정답 레이블 값을 1로, 정답이 아닌 뉴런은 정답 레이블 값이 0으로 설정하는 인코딩 방법이다.Cross entropy error can be used when the correct answer label is one-hot encoded. One-hot encoding is an encoding method in which the correct answer label value is set to 1 only for neurons corresponding to the correct answer, and the correct answer label value is set to 0 for neurons with no correct answer.

머신 러닝 또는 딥 러닝에서는 손실 함수를 최소화하기 위하여 학습 최적화 알고리즘을 이용할 수 있으며, 학습 최적화 알고리즘에는 경사 하강법(GD: Gradient Descent), 확률적 경사 하강법(SGD: Stochastic Gradient Descent), 모멘텀(Momentum), NAG(Nesterov Accelerate Gradient), Adagrad, AdaDelta, RMSProp, Adam, Nadam 등이 있다.In machine learning or deep learning, learning optimization algorithms can be used to minimize the loss function, and learning optimization algorithms include Gradient Descent (GD), Stochastic Gradient Descent (SGD), Momentum ), NAG (Nesterov Accelerate Gradient), Adagrad, AdaDelta, RMSProp, Adam, and Nadam.

경사 하강법은 현재 상태에서 손실 함수의 기울기를 고려하여 손실 함수값을 줄이는 방향으로 모델 파라미터를 조정하는 기법이다. Gradient descent is a technique that adjusts model parameters in the direction of reducing the value of the loss function by considering the slope of the loss function in the current state.

모델 파라미터를 조정하는 방향은 스텝(step) 방향, 조정하는 크기는 스텝 사이즈(size)라고 칭한다.A direction for adjusting model parameters is called a step direction, and a size for adjusting the model parameters is called a step size.

이때, 스텝 사이즈는 학습률을 의미할 수 있다.In this case, the step size may mean a learning rate.

경사 하강법은 손실 함수를 각 모델 파라미터들로 편미분하여 기울기를 획득하고, 모델 파라미터들을 획득한 기울기 방향으로 학습률만큼 변경하여 갱신할 수 있다.In the gradient descent method, a gradient may be obtained by partial differentiation of a loss function with respective model parameters, and the model parameters may be updated by changing the model parameters in the direction of the obtained gradient by a learning rate.

확률적 경사 하강법은 학습 데이터를 미니 배치로 나누고, 각 미니 배치마다 경사 하강법을 수행하여 경사 하강의 빈도를 높인 기법이다.Stochastic gradient descent is a technique that increases the frequency of gradient descent by dividing training data into mini-batches and performing gradient descent for each mini-batch.

Adagrad, AdaDelta 및 RMSProp는 SGD에서 스텝 사이즈를 조절하여 최적화 정확도를 높이는 기법이다. SGD에서 모멘텀 및 NAG는 스텝 방향을 조절하여 최적화 정확도를 높이는 기법이다. Adam은 모멘텀과 RMSProp를 조합하여 스텝 사이즈와 스텝 방향을 조절하여 최적화 정확도를 높이는 기법이다. Nadam은 NAG와 RMSProp를 조합하여 스텝 사이즈와 스텝 방향을 조절하여 최적화 정확도를 높이는 기법이다.Adagrad, AdaDelta, and RMSProp are techniques that increase optimization accuracy by adjusting the step size in SGD. In SGD, momentum and NAG are techniques that increase optimization accuracy by adjusting the step direction. Adam is a technique that increases optimization accuracy by adjusting the step size and step direction by combining momentum and RMSProp. Nadam is a technique that increases optimization accuracy by adjusting the step size and step direction by combining NAG and RMSProp.

인공 신경망의 학습 속도와 정확도는 인공 신경망의 구조와 학습 최적화 알고리즘의 종류뿐만 아니라, 하이퍼파라미터에 크게 좌우되는 특징이 있다. 따라서, 좋은 학습 모델을 획득하기 위하여는 적당한 인공 신경망의 구조와 학습 알고리즘을 결정하는 것뿐만 아니라, 적당한 하이퍼파라미터를 설정하는 것이 중요하다.The learning speed and accuracy of an artificial neural network are characterized by being largely dependent on hyperparameters as well as the structure of the artificial neural network and the type of learning optimization algorithm. Therefore, in order to obtain a good learning model, it is important to set appropriate hyperparameters as well as to determine an appropriate artificial neural network structure and learning algorithm.

통상적으로 하이퍼파라미터는 실험적으로 다양한 값으로 설정해가며 인공 신경망을 학습시켜보고, 학습 결과 안정적인 학습 속도와 정확도를 제공하는 최적의 값으로 설정한다.Typically, hyperparameters are experimentally set to various values to train the artificial neural network, and as a result of learning, the optimal values are set to provide stable learning speed and accuracy.

이상 설명된 본 발명에 따른 실시예는 컴퓨터 상에서 다양한 구성요소를 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있으며, 이와 같은 컴퓨터 프로그램은 컴퓨터로 판독 가능한 매체에 기록될 수 있다. 이때, 매체는 하드 디스크, SSD(Solid State Disk), SDD(Silicon Disk Drive), 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등과 같은, 프로그램 명령어를 저장하고 실행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다.Embodiments according to the present invention described above may be implemented in the form of a computer program that can be executed on a computer through various components, and such a computer program may be recorded on a computer-readable medium. At this time, the media include hard disks, solid state disks (SSDs), silicon disk drives (SDDs), magnetic media such as floppy disks and magnetic tapes, optical recording media such as CD-ROMs and DVDs, floptical disks and and hardware devices specially configured to store and execute program instructions, such as a magneto-optical medium, such as ROM, RAM, flash memory, and the like.

한편, 상기 컴퓨터 프로그램은 본 발명을 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수 있다. 컴퓨터 프로그램의 예에는, 컴파일러에 의하여 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용하여 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함될 수 있다.Meanwhile, the computer program may be specially designed and configured for the present invention, or may be known and usable to those skilled in the art of computer software. An example of a computer program may include not only machine language code generated by a compiler but also high-level language code that can be executed by a computer using an interpreter or the like.

이상 설명된 본 발명의 실시 예에 대한 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The description of the embodiments of the present invention described above is for illustrative purposes, and those skilled in the art can easily modify them into other specific forms without changing the technical spirit or essential features of the present invention. you will understand that Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 청구범위에 의하여 나타내어지며, 청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts thereof should be construed as being included in the scope of the present invention.

100 객체 검출 장치
110 프로세서
120 메모리100 object detection device
110 processor
120 memory

Claims

obtaining at least one object information about at least one candidate object from an input image for each decoding layer of the plurality of decoding layers by using an object detection model including a plurality of decoding layers;
performing an Efficient Non-Maximum Suppression (E-NMS) operation for each decoding layer based on the at least one piece of object information obtained from each decoding layer; and
Re-performing the E-NMS operation based on object information obtained by combining the result of the E-NMS operation for each decoding layer.
including,
The object information,
Object domain prediction information, object intersection over union (IOU) prediction information, and object classification prediction information
including,
The E-NMS calculation,
a first operation for determining an initial value of a confidence score for object region prediction information of the at least one candidate object;
a second operation of updating the confidence score based on object region prediction information and object IOU prediction information of the at least one candidate object; and
a third operation for determining a final object region of the at least one candidate object based on the updated confidence score;
including,
object detection method.

According to claim 1,
The object detection model,
a common encoder including a series of encoding layers for extracting a plurality of first common feature maps from the input image; and
A common decoder including multiple decoding layers generating a plurality of second common feature maps by multi-scale feature fusion of the plurality of first common feature maps.
containing
object detection method.

According to claim 2,
The multiple decoding layers,
a top-down layer that upsamples and adds a portion of the plurality of first common feature maps and outputs them;
an intermediate layer that performs a convolution operation on outputs of the top-down layer and other than some of the plurality of first common feature maps; and
A bottom-up path aggregation layer configured to output the plurality of second common feature maps by convolving and summing outputs of the middle layer
including,
The plurality of decoding layers correspond to the bottom-up path merging layer,
object detection method.

According to claim 1,
The object detection model,
an object region header for converting the plurality of second common feature maps output from the plurality of decoding layers into an object region feature map for object region detection; and
An object classification header for converting the plurality of second common feature maps into an object classification feature map for object classification.
including,
object detection method.

According to claim 4,
The object domain header and the object classification header include a plurality of convolutional layers that share weights with each other.
object detection method.

According to claim 1,
In the step of performing the E-NMS operation for each decoding layer, the first operation,
Determining the initial value based on a function for object IOU prediction information of the candidate object;
object detection method.

According to claim 1,
The second calculation is,
sorting object region prediction information of the at least one candidate object based on the initial value;
an operation for determining an IOU between two different pieces of object area prediction information for the at least one candidate object;
an operation to re-determine the IOU based on a predefined weight function; and
An operation for determining the confidence score based on the re-determined IOU
including,
object detection method.

According to claim 7,
The operation of determining the IOU, the operation of re-determining the IOU, and the operation of determining the confidence score can be performed in parallel using an IOU matrix,
object detection method.

According to claim 4,
The object detection model is
outputting object size prediction information (pred_reward_scale) based on the object region feature map;
Outputting object classification accuracy prediction information (pred_reward_identification) based on the object classification feature map,
object detection method.

According to claim 9,
The object size prediction information and the object classification accuracy prediction information are information used in the learning process of the object detection model and not used in the inference process.
object detection method.

a memory for storing an object detection model including a plurality of decoding layers; and
one or more processors
including,
the processor,
Obtaining at least one object information about at least one candidate object from an input image for each decoding layer of the plurality of decoding layers by using the object detection model;
Performing E-NMS (Efficient Non-Maximum Suppression) operation for each decoding layer based on the at least one object information obtained for each decoding layer,
configured to re-perform the E-NMS operation based on object information obtained by combining the result of the E-NMS operation for each decoding layer;
The object information,
Object domain prediction information, object IOU prediction information, and object classification prediction information
including,
The E-NMS calculation,
a first operation for determining an initial value of a confidence score for object region prediction information of the at least one candidate object;
a second operation of updating the confidence score based on object region prediction information and object IOU prediction information of the at least one candidate object; and
a third operation for determining a final object region of the at least one candidate object based on the updated confidence score;
including,
object detection device.

According to claim 11,
The object detection model,
a common encoder including a series of encoding layers for extracting a plurality of first common feature maps from the input image; and
A common decoder including multiple decoding layers generating a plurality of second common feature maps by multi-scale feature fusion of the plurality of first common feature maps.
containing
object detection device.

According to claim 12,
The multiple decoding layers,
a top-down layer that upsamples and adds a portion of the plurality of first common feature maps and outputs them;
an intermediate layer that performs a convolution operation on outputs of the top-down layer and other than some of the plurality of first common feature maps; and
A bottom-up path aggregation layer configured to output the plurality of second common feature maps by convolving and summing outputs of the middle layer
including,
The plurality of decoding layers correspond to the bottom-up path merging layer,
object detection device.

According to claim 11,
The object detection model,
An object region header for converting a plurality of second common feature maps output from the plurality of decoding layers into an object region feature map for object region detection; and
An object classification header for converting the plurality of second common feature maps into an object classification feature map for object classification.
including,
object detection device.

15. The method of claim 14,
The object domain header and the object classification header include a plurality of convolutional layers that share weights with each other.
object detection device.

According to claim 11,
the processor,
In order to perform the E-NMS operation for each decoding layer,
In the first operation, the initial value is determined based on a function for object IOU prediction information of the candidate object.
object detection device.

According to claim 11,
The second calculation is,
sorting object region prediction information of the at least one candidate object based on the initial value;
an operation for determining an IOU between two different pieces of object area prediction information for the at least one candidate object;
an operation to re-determine the IOU based on a predefined weight function; and
An operation for determining the confidence score based on the re-determined IOU
including,
object detection device.

18. The method of claim 17,
the processor,
Capable of parallelly performing an operation for determining the IOU, an operation for re-determining the IOU, and an operation for determining the confidence score using an IOU matrix,
object detection device.

15. The method of claim 14,
The processor executes the object detection model,
Extracting object size prediction information (pred_reward_scale) based on the object region feature map;
Is configured to extract object classification accuracy prediction information (pred_reward_identification) based on the object classification feature map,
object detection device.

According to claim 19,
the processor,
The object size prediction information and the object classification accuracy prediction information are used in a learning process of the object detection model and configured not to be used in an inference process.
object detection device.