KR20220069336A

KR20220069336A - Method and apparatus for object detection

Info

Publication number: KR20220069336A
Application number: KR1020200156354A
Authority: KR
Inventors: 원웅재; 순 권; 박재형; 김태훈
Original assignee: 재단법인대구경북과학기술원
Priority date: 2020-11-20
Filing date: 2020-11-20
Publication date: 2022-05-27
Also published as: KR102512151B1

Abstract

An object detection method and apparatus using a high-performance, low-computational object detection model are provided. Accordingly, robust object detection can be performed in various object shapes and size changes. In addition, by providing an efficient post-processing process, the accuracy of object area determination is improved. The object detection method may include the steps of: obtaining at least one piece of object information; performing an effective non-maximum suppression (ENMS) operation; and re-performing the ENMS operation.

Description

Object detection method and apparatus

본 발명은 객체 검출 방법 및 장치에 관한 것으로, 보다 상세하게는 객체 검출 모델을 이용하여 입력 영상으로부터 객체를 검출하는 방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for detecting an object, and more particularly, to a method and apparatus for detecting an object from an input image using an object detection model.

이하에서 기술되는 내용은 본 발명의 실시 예와 관련되는 배경 정보를 제공할 목적으로 기재된 것일 뿐이고, 기술되는 내용들이 당연하게 종래기술을 구성하는 것은 아니다.The content to be described below is only provided for the purpose of providing background information related to an embodiment of the present invention, and the content to be described does not naturally constitute the prior art.

딥러닝 기반의 객체 검출 모델은 크게 two-stage 검출 모델과 One-stage 검출 모델로 나뉜다.Deep learning-based object detection models are largely divided into two-stage detection models and one-stage detection models.

Two-stage 검출 모델은 Region of Interest(RoI) 생성하는 모듈과 RoI Sampling을 통해 RoI의 객체 클래스 정보와 위치 정보를 추정하는 모듈로 구성되어 있다. One-stage 검출 모델은 RoI Sampling 과정 없이 한번에 객체의 클래스와 위치정보를 추정하는 하나의 모듈로 구성되어 있다.The two-stage detection model consists of a module for generating Region of Interest (RoI) and a module for estimating object class information and location information of RoI through RoI sampling. One-stage detection model consists of one module that estimates object class and location information at once without RoI sampling process.

성능 면에서는 Two-stage 검출 모델이 우수한 결과들을 보이고 있지만, 속도 면에서는 One-stage 검출 모델이 우수한 결과들을 보이고 있다.In terms of performance, the two-stage detection model shows excellent results, but in terms of speed, the one-stage detection model shows excellent results.

한편, 종래의 객체 검출 모델은 다양한 크기의 객체를 검출 하기 위해서 사전에 검출 템플릿(template)인 가로 세로 비율과 기본 객체 크기 값(Anchors)들 정의하여 객체 검출을 수행 한다. 최근 One-stage 모델에서는 Multi-scale feature 융합 기법에 대한 연구를 통해 Anchor를 적용하지 않아도 우수한 객체 검출 성능을 보여 주고 있지만, Multi-scale 융합을 위한 많은 학습 파라미터/연산이 필요하다.On the other hand, the conventional object detection model performs object detection by defining the aspect ratio and basic object size values (Anchors) as a detection template in advance in order to detect objects of various sizes. Recently, the one-stage model shows excellent object detection performance even without applying an anchor through research on the multi-scale feature fusion technique, but many learning parameters/operations for multi-scale fusion are required.

고성능 저연산 객체 검출 딥러닝 모델을 개발하기 위해서는 앵커(Anchor)가 필요없는 One-stage 딥러닝 모델 구조가 필요하다. 또한, 다양한 크기/모양 변화에 강건한 특징을 표현할 수 있는 특징 디코딩 방법과 이를 기반으로 한 학습 방법이 필요하다. 나아가, 예측된 객체 영역들에 대해 영역 중복도를 제거 할 수 있는 효율적 NMS 후처리 방법이 필요하다. To develop a high-performance, low-computational object detection deep learning model, a one-stage deep learning model structure that does not require an anchor is required. In addition, there is a need for a feature decoding method that can express features robust to various size/shape changes, and a learning method based on it. Furthermore, there is a need for an efficient NMS post-processing method that can remove region redundancy for predicted object regions.

본 발명의 일 과제는, 앵커가 필요없는 객체 검출 모델에 기반한 객체 검출 방법 및 장치를 제공하는 것이다.An object of the present invention is to provide an object detection method and apparatus based on an object detection model that does not require an anchor.

본 발명의 일 과제는, 객체의 크기 및 모양 변화에 강건한 객체 검출 모델을 제공하는 것이다.An object of the present invention is to provide an object detection model that is robust to changes in the size and shape of an object.

본 발명의 일 과제는, 예측된 객체 영역들에 대해 영역 중복도를 제거 할 수 있는 효율적 NMS 후처리 방법을 제공하는 것이다.It is an object of the present invention to provide an efficient NMS post-processing method capable of removing region redundancy for predicted object regions.

본 발명의 목적은 이상에서 언급한 과제에 한정되지 않으며, 언급되지 않은 본 발명의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있고, 본 발명의 실시예에 의해 보다 분명하게 이해될 것이다. 또한, 본 발명의 목적 및 장점들은 청구범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 알 수 있을 것이다.The object of the present invention is not limited to the above-mentioned problems, and other objects and advantages of the present invention that are not mentioned may be understood by the following description, and will be more clearly understood by the examples of the present invention. It will also be appreciated that the objects and advantages of the present invention may be realized by means of the instrumentalities and combinations thereof indicated in the claims.

본 발명의 일 실시 예에 따른 객체 검출 방법은, 복수의 디코딩 레이어를 포함하는 객체 검출 모델을 이용하여 복수의 디코딩 레이어의 각 디코딩 레이어별로 입력 영상으로부터 적어도 하나의 후보 객체에 대한 적어도 하나의 객체 정보를 획득하는 단계, 각 디코딩 레이어에서 획득된 적어도 하나의 객체 정보에 기반하여 각 디코딩 레이어에 대한 E-NMS(Efficient Non-Maximum Suppression) 연산을 수행하는 단계 및 각 디코딩 레이어에 대한 E-NMS 연산의 결과에 기반하여 E-NMS 연산을 재수행하는 단계를 포함할 수 있다.In an object detection method according to an embodiment of the present invention, at least one object information on at least one candidate object from an input image for each decoding layer of a plurality of decoding layers using an object detection model including a plurality of decoding layers obtaining, performing an Efficient Non-Maximum Suppression (E-NMS) operation for each decoding layer based on at least one object information obtained from each decoding layer, and E-NMS operation for each decoding layer It may include re-performing the E-NMS operation based on the result.

예를 들어 객체 정보는, 객체 영역 예측 정보, 객체 IOU(Intersection Over Union) 예측 정보및 객체 분류 예측 정보를 포함할 수 있다.For example, the object information may include object region prediction information, object intersection over union (IOU) prediction information, and object classification prediction information.

예를 들어 E-NMS 연산은 적어도 하나의 후보 객체의 객체 영역 예측 정보에 대한 컨피던스 스코어의 초기값을 결정하는 제 1 연산, 적어도 하나의 후보 객체의 객체 영역 예측 정보 및 객체 IOU 예측 정보에 기반하여 컨피던스 스코어를 업데이트하는 제 2 연산 및 업데이트된 컨피던스 스코어에 기반하여 적어도 하나의 후보 객체의 최종 객체 영역을 결정하는 제 3 연산을 포함할 수 있다.For example, the E-NMS operation is based on a first operation for determining an initial value of a confidence score for object region prediction information of at least one candidate object, object region prediction information of at least one candidate object, and object IOU prediction information. and a second operation of updating the confidence score and a third operation of determining a final object region of the at least one candidate object based on the updated confidence score.

본 발명의 일 실시 예에 따른 객체 검출 장치는, 복수의 디코딩 레이어를 포함하는 객체 검출 모델을 저장하는 메모리 및 하나 이상의 프로세서를 포함하고, 프로세서는, 객체 검출 모델을 이용하여 복수의 디코딩 레이어의 각 디코딩 레이어별로 입력 영상으로부터 적어도 하나의 후보 객체에 대한 적어도 하나의 객체 정보를 획득하고, 각 디코딩 레이어별로 획득된 적어도 하나의 객체 정보에 기반하여 각 디코딩 레이어에 대한 E-NMS(Efficient Non-Maximum Suppression) 연산을 수행하고, 각 디코딩 레이어에 대한 NMS 연산의 결과에 기반하여 E-NMS 연산을 재수행하도록 구성될 수 있다.An object detection apparatus according to an embodiment of the present invention includes a memory for storing an object detection model including a plurality of decoding layers and one or more processors, and the processor includes: each of the plurality of decoding layers using the object detection model At least one object information on at least one candidate object is obtained from an input image for each decoding layer, and Efficient Non-Maximum Suppression (E-NMS) for each decoding layer is obtained based on the at least one object information obtained for each decoding layer. ) operation, and re-performing the E-NMS operation based on the result of the NMS operation for each decoding layer.

전술한 것 외의 다른 측면, 특징, 및 이점이 이하의 도면, 청구범위 및 발명의 상세한 설명으로부터 명확해질 것이다.Other aspects, features, and advantages other than those described above will become apparent from the following drawings, claims and detailed description.

본 발명의 실시예에 따르면, 앵커가 필요없는 고성능 저연산의 객체 검출 모델이 제공된다.According to an embodiment of the present invention, a high-performance, low-computational object detection model that does not require an anchor is provided.

본 발명의 실시예에 따르면, 다양한 객체 모양 및 크기 변화에 강건한 객체 검출이 가능하다.According to an embodiment of the present invention, object detection robust to changes in various object shapes and sizes is possible.

본 발명의 실시예에 따르면, 효율적인 후처리 과정(E-NMS)을 제공하여 객체 영역 결정의 정확도가 제고된다.According to an embodiment of the present invention, the accuracy of object region determination is improved by providing an efficient post-processing process (E-NMS).

본 발명의 효과는 이상에서 언급된 것들에 한정되지 않으며, 언급되지 아니한 다른 효과들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to those mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 실시예에 따른 객체 검출 과정을 개략적으로 도시한 예시도이다.
도 2는 실시예에 따른 객체 검출 장치의 블록도이다.
도 3은 실시예에 따른 객체 검출 방법의 흐름도이다.
도 4는 실시예에 따른 객체 검출 모델의 개략적인 예시도이다.
도 5는 실시예에 따른 객체 검출 모델의 구성도이다.
도 6은 실시예에 따른 객체 검출 모델의 공통 디코더의 구성도이다.
도 7은 실시예에 따른 객체 검출 모델의 헤더 및 디코더의 구성도이다.
도 8은 실시예에 따른 E-NMS 연산의 개략적인 예시도이다.
도 9은 실시예에 따른 E-NMS 연산의 흐름도이다.
도 10는 실시예에 따른 E-NMS 연산의 제 2 연산의 흐름도이다.
도 11은 실시예에 따른 IOU 매트릭스의 예시도이다.
도 12는 실시예에 따른 객체 검출 모델의 객체 검출 결과를 예시적으로 보여주는 도면이다.1 is an exemplary diagram schematically illustrating an object detection process according to an embodiment.
2 is a block diagram of an object detecting apparatus according to an embodiment.
3 is a flowchart of a method for detecting an object according to an embodiment.
4 is a schematic illustration of an object detection model according to an embodiment.
5 is a block diagram of an object detection model according to an embodiment.
6 is a block diagram of a common decoder of an object detection model according to an embodiment.
7 is a configuration diagram of a header and a decoder of an object detection model according to an embodiment.
8 is a schematic illustration of an E-NMS operation according to an embodiment.
9 is a flowchart of an E-NMS operation according to an embodiment.
10 is a flowchart of a second operation of an E-NMS operation according to an embodiment.
11 is an exemplary diagram of an IOU matrix according to an embodiment.
12 is a diagram illustrating an object detection result of an object detection model according to an exemplary embodiment.

이하에서는 도면을 참조하여 본 발명을 보다 상세하게 설명한다. 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 여기에서 설명하는 실시 예들에 한정되지 않는다. 이하 실시 예에서는 본 발명을 명확하게 설명하기 위해서 설명과 직접적인 관계가 없는 부분을 생략하지만, 본 발명의 사상이 적용된 장치 또는 시스템을 구현함에 있어서, 이와 같이 생략된 구성이 불필요함을 의미하는 것은 아니다. 아울러, 명세서 전체를 통하여 동일 또는 유사한 구성요소에 대해서는 동일한 참조번호를 사용한다.Hereinafter, the present invention will be described in more detail with reference to the drawings. The present invention may be implemented in several different forms, and is not limited to the embodiments described herein. In the following embodiments, parts not directly related to the description are omitted in order to clearly explain the present invention, but it does not mean that the omitted configuration is unnecessary in implementing the device or system to which the spirit of the present invention is applied . In addition, the same reference numerals are used for the same or similar elements throughout the specification.

이하의 설명에서 제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 되며, 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 또한, 이하의 설명에서 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다.In the following description, terms such as first, second, etc. may be used to describe various components, but the components should not be limited by the terms, and the terms refer to one component from another component. It is used only for distinguishing purposes. Also, in the following description, the singular expression includes the plural expression unless the context clearly dictates otherwise.

이하의 설명에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. In the following description, terms such as “comprise” or “have” are intended to designate that a feature, number, step, operation, component, part, or combination thereof described in the specification is present, but one or more other It should be understood that this does not preclude the possibility of addition or presence of features or numbers, steps, operations, components, parts, or combinations thereof.

이하 도면을 참고하여 본 발명을 상세히 설명하기로 한다. Hereinafter, the present invention will be described in detail with reference to the drawings.

도 1은 실시예에 따른 객체 검출 과정을 개략적으로 도시한 예시도이다.1 is an exemplary diagram schematically illustrating an object detection process according to an embodiment.

실시예에 따른 객체 검출 장치(100)는 입력 영상을 수신한다. 예를 들어 입력 영상은 객체 검출 장치(100)에 연결된 카메라를 이용하여 직접 획득하거나 외부 장치로부터 통신 네트워크를 통해 수신할 수 있다. 여기서 입력 영상은 정지 영상 및 동영상을 포함한다.The object detecting apparatus 100 according to an embodiment receives an input image. For example, the input image may be directly acquired using a camera connected to the object detection apparatus 100 or received from an external device through a communication network. Here, the input image includes a still image and a moving image.

객체 검출 장치(100)는 수신된 입력 영상을 객체 검출 모델을 이용하여 분석한다. 객체 검출 모델은 입력 영상으로부터 객체 정보를 획득하기 위한 딥러닝 모델로서, 영상 기반 객체 검출 모델이다. 객체 검출 모델에 대하여는 도 4 내지 도 7을 참조하여 후술한다.The object detection apparatus 100 analyzes the received input image using the object detection model. The object detection model is a deep learning model for obtaining object information from an input image, and is an image-based object detection model. The object detection model will be described later with reference to FIGS. 4 to 7 .

일 예에서 객체 정보는 객체 영역 예측 정보, IOU(Intersection Over Union) 예측 정보 및 객체 분류 예측 정보를 포함할 수 있다.In an example, the object information may include object region prediction information, IOU (Intersection Over Union) prediction information, and object classification prediction information.

객체 영역 예측 정보는 입력 영상에서 객체에 대응하는 이미지 영역에 대한 예측 정보를 의미한다.The object region prediction information means prediction information on an image region corresponding to an object in the input image.

객체 IOU 예측 정보는 객체 영역 예측 정보와 Ground Truth(GT) 정보 간의 겹침도에 대한 정보를 의미한다. 예를 들어, 객체 IOU 예측 정보는, 객체 영역 예측 정보와 해당 객체의 영역에 대한 Ground Truth(GT) 정보의 교집합인 영역의 크기를 객체 영역 예측 정보와 해당 객체의 영역에 대한 Ground Truth(GT) 정보의 합집합인 영역의 크기로 나눈 값에 기반하여 결정될 수 있다.The object IOU prediction information means information about the degree of overlap between the object area prediction information and the ground truth (GT) information. For example, the object IOU prediction information is the size of the area that is the intersection of the object area prediction information and the ground truth (GT) information for the area of the object, the object area prediction information and the ground truth (GT) for the area of the object It may be determined based on a value divided by the size of a region that is a union of information.

객체 분류 예측 정보는 예측된 객체 영역이 어떤 객체일 지를 확률적으로 예측한 정보를 의미한다.The object classification prediction information refers to information probabilistically predicting what kind of object the predicted object area will be.

추가적으로 객체 정보는 객체 크기 예측 정보 및 객체 분류 정확도 예측 정보를 더 포함할 수 있다. 이에 대하여는 도 7을 참조하여 후술한다.Additionally, the object information may further include object size prediction information and object classification accuracy prediction information. This will be described later with reference to FIG. 7 .

객체 검출 장치(100)는 객체 검출 모델을 실행가능한 다양한 형태의 전자 장치에 대응한다.The object detection apparatus 100 corresponds to various types of electronic devices capable of executing an object detection model.

일 예에서 객체 검출 장치(100)는 차량에 탑재될 수 있다. 일 예에서 객체 검출 장치(100)는 차량과 통신가능한 전자 장치일 수 있다. 예를 들어 객체 검출 장치(100)는 차량과 네트워크를 통해 연결된 서버를 포함할 수 있다. 일 예에서 객체 검출 장치(100)는 로봇 및 스마트폰 등의 단말 장치를 포함할 수 있으며, 이에 제한되는 것은 아니다.In an example, the object detection apparatus 100 may be mounted on a vehicle. In one example, the object detecting apparatus 100 may be an electronic device capable of communicating with a vehicle. For example, the object detection apparatus 100 may include a server connected to the vehicle through a network. In an example, the object detection apparatus 100 may include a robot and a terminal device such as a smart phone, but is not limited thereto.

도 2는 실시예에 따른 객체 검출 장치의 블록도이다.2 is a block diagram of an object detecting apparatus according to an embodiment.

실시예에 따른 객체 검출 장치(100)는 프로세서(110) 및 메모리(120)를 포함할 수 있다. 도 2에 도시된 구성 요소는 예시적인 것이고, 객체 검출 장치(100)는 추가적인 구성 요소를 더 포함할 수 있다.The object detecting apparatus 100 according to the embodiment may include a processor 110 and a memory 120 . The components shown in FIG. 2 are exemplary, and the object detecting apparatus 100 may further include additional components.

객체 검출 장치(100)는 프로세서(150)를 포함할 수 있다.The object detection apparatus 100 may include a processor 150 .

프로세서(110)는 일종의 중앙처리장치로서, 메모리(120)에 저장된 하나 이상의 명령어를 실행하여 객체 검출 장치(100)의 동작을 제어할 수 있다. 프로세서(110)는 명령어를 실행하여 데이터를 처리할 수 있는 모든 종류의 장치를 포함할 수 있다.The processor 110 is a kind of central processing unit, and may execute one or more instructions stored in the memory 120 to control the operation of the object detection apparatus 100 . The processor 110 may include any type of device capable of processing data by executing instructions.

프로세서(110)는 예를 들어 프로그램 내에 포함된 코드 또는 명령으로 표현된 기능을 수행하기 위해 물리적으로 구조화된 회로를 갖는, 하드웨어에 내장된 데이터 처리 장치를 의미할 수 있다. 이와 같이 하드웨어에 내장된 데이터 처리 장치의 일 예로서, 마이크로프로세서(microprocessor), 중앙처리장치(central processing unit: CPU), 프로세서 코어(processor core), 멀티프로세서(multiprocessor), ASIC(application-specific integrated circuit), FPGA(field programmable gate array) 등의 처리 장치를 망라할 수 있으나, 이에 한정되는 것은 아니다. 프로세서(110)는 하나 이상의 프로세서를 포함할 수 있다.The processor 110 may refer to, for example, a data processing device embedded in hardware having a physically structured circuit to perform a function expressed as a code or an instruction included in a program. As an example of the data processing apparatus embedded in the hardware as described above, a microprocessor, a central processing unit (CPU), a processor core, a multiprocessor, an application-specific integrated (ASIC) circuit) and a processing device such as a field programmable gate array (FPGA), but is not limited thereto. The processor 110 may include one or more processors.

추가적으로 객체 검출 장치(100)는 메모리(120)를 포함할 수 있다.Additionally, the object detection apparatus 100 may include a memory 120 .

메모리(120)는 객체 정보 획득을 위한 객체 검출 모델을 저장할 수 있다. 메모리(120)는 객체 검출 장치(100)가 객체 검출 모델을 실행하는 과정에 대한 명령 등을 저장할 수 있다.The memory 120 may store an object detection model for obtaining object information. The memory 120 may store a command for a process in which the object detection apparatus 100 executes an object detection model.

프로세서(110)는 메모리(120)에 저장된 프로그램, 명령어들 및 객체 검출 모델에 기반하여 실시예에 따른 객체 검출 과정을 실행할 수 있다.The processor 110 may execute an object detection process according to an embodiment based on a program, instructions, and an object detection model stored in the memory 120 .

메모리(120)는 알고리즘에 의해 연산되어 객체 검출을 위한 연산 과정에서 발생하는 중간 데이터 및 연산 결과 등을 더 저장할 수 있다.The memory 120 may further store intermediate data and operation results that are calculated by an algorithm and are generated in the operation process for object detection.

메모리(120)는 내장 메모리 및/또는 외장 메모리를 포함할 수 있으며, DRAM, SRAM, 또는 SDRAM 등과 같은 휘발성 메모리, OTPROM(one time programmable ROM), PROM, EPROM, EEPROM, mask ROM, flash ROM, NAND 플래시 메모리, 또는 NOR 플래시 메모리 등과 같은 비휘발성 메모리, SSD, CF(compact flash) 카드, SD 카드, Micro-SD 카드, Mini-SD 카드, Xd 카드, 또는 메모리 스틱(memory stick) 등과 같은 플래시 드라이브, 또는 HDD와 같은 저장 장치를 포함할 수 있다. 메모리(120)는 자기 저장 매체(magnetic storage media) 또는 플래시 저장 매체(flash storage media)를 포함할 수 있으나, 이에 한정되는 것은 아니다.Memory 120 may include internal memory and/or external memory, volatile memory such as DRAM, SRAM, or SDRAM, one time programmable ROM (OTPROM), PROM, EPROM, EEPROM, mask ROM, flash ROM, NAND Flash memory, or non-volatile memory such as NOR flash memory, SSD, compact flash (CF) card, SD card, Micro-SD card, Mini-SD card, Xd card, or flash drive such as a memory stick; Alternatively, it may include a storage device such as an HDD. The memory 120 may include, but is not limited to, magnetic storage media or flash storage media.

도 3은 실시예에 따른 객체 검출 방법의 흐름도이다.3 is a flowchart of a method for detecting an object according to an embodiment.

실시예에 따른 객체 검출 방법은 복수의 디코딩 레이어를 포함하는 객체 검출 모델을 이용하여 복수의 디코딩 레이어의 각 디코딩 레이어별로 입력 영상으로부터 적어도 하나의 후보 객체에 대한 적어도 하나의 객체 정보를 획득하는 단계(S10), 각 디코딩 레이어에서 획득된 적어도 하나의 객체 정보에 기반하여 각 디코딩 레이어에 대한 E-NMS(Efficient Non-Maximum Suppression) 연산을 수행하는 단계(S20) 및 각 디코딩 레이어에 대한 E-NMS 연산의 결과에 기반하여 E-NMS 연산을 재수행하는 단계(S30)을 포함할 수 있다.An object detection method according to an embodiment includes obtaining at least one object information about at least one candidate object from an input image for each decoding layer of a plurality of decoding layers using an object detection model including a plurality of decoding layers ( S10), performing an Efficient Non-Maximum Suppression (E-NMS) operation for each decoding layer based on at least one object information obtained from each decoding layer (S20) and E-NMS operation for each decoding layer It may include a step (S30) of re-performing the E-NMS operation based on the result of .

단계(S10)에서 프로세서(110)는 복수의 디코딩 레이어를 포함하는 객체 검출 모델을 이용하여 복수의 디코딩 레이어의 각 디코딩 레이어별로 입력 영상으로부터 적어도 하나의 후보 객체에 대한 적어도 하나의 객체 정보를 획득할 수 있다.In step S10, the processor 110 obtains at least one object information about at least one candidate object from the input image for each decoding layer of the plurality of decoding layers using an object detection model including a plurality of decoding layers. can

여기서 복수의 디코딩 레이어는 도 5를 참조하여 후술할 공통 디코더(2)의 바텀-업 경로 병합 레이어에 대응한다.Here, the plurality of decoding layers correspond to the bottom-up path merge layer of the common decoder 2, which will be described later with reference to FIG. 5 .

일 예에서 객체 정보는, 객체 영역 예측 정보(pred_box_posts), IOU(Intersection Over Union) 예측 정보(pred_reward_iou) 및 객체 분류 예측 정보(pred_confidence)를 포함할 수 있다.In an example, the object information may include object region prediction information (pred_box_posts), IOU (Intersection Over Union) prediction information (pred_reward_iou), and object classification prediction information (pred_confidence).

단계(S20)에서 프로세서(110)는 단계(S10)에서 각 디코딩 레이어에서 획득된 적어도 하나의 객체 정보에 기반하여 각 디코딩 레이어에 대한 E-NMS 연산을 수행할 수 있다.In step S20 , the processor 110 may perform an E-NMS operation for each decoding layer based on at least one object information obtained from each decoding layer in step S10 .

단계(S20)에서 프로세서(110)는 각 디코딩 레이어 별로 E-NMS 연산을 수행한다. 즉, 단계(S20)에서 프로세서(110)는 레이어화된(layer-wise) E-NMS 연산을 수행한다.In step S20, the processor 110 performs an E-NMS operation for each decoding layer. That is, in step S20 , the processor 110 performs a layer-wise E-NMS operation.

E-NMS 연산은, 적어도 하나의 후보 객체의 컨피던스 스코어의 초기값을 결정하는 제 1 연산, 적어도 하나의 후보 객체의 객체 영역 예측 정보 및 객체 IOU 예측 정보에 기반하여 컨피던스 스코어를 업데이트하는 제 2 연산 및 업데이트된 컨피던스 스코어에 기반하여 적어도 하나의 후보 객체의 최종 객체 영역을 결정하는 제 3 연산을 포함할 수 있다. 실시예에 따른 E-NMS 연산에 대해서는 도 8 내지 도 11을 참조하여 구체적으로 후술한다.The E-NMS operation includes a first operation for determining an initial value of the confidence score of the at least one candidate object, and a second operation for updating the confidence score based on the object region prediction information and the object IOU prediction information of the at least one candidate object. and a third operation of determining a final object region of the at least one candidate object based on the updated confidence score. The E-NMS operation according to the embodiment will be described later in detail with reference to FIGS. 8 to 11 .

단계(S30)에서 프로세서(110)는 단계(S20)의 각 디코딩 레이어에 대한 E-NMS 연산의 결과에 기반하여 E-NMS 연산을 재수행할 수 있다.In step S30 , the processor 110 may re-perform the E-NMS operation based on the result of the E-NMS operation for each decoding layer in step S20 .

즉, 단계(S30)에서 프로세서(110)는 단계(S20)에서 디코딩 레이어별로 수행된 E-NMS 연산의 결과를 결합하고, 결합된 객체 정보에 기반하여 E-NMS 연산을 재수행할 수 있다.That is, in step S30 , the processor 110 may combine the results of the E-NMS operation performed for each decoding layer in step S20 , and re-perform the E-NMS operation based on the combined object information.

이하에서 도 4 내지 도 7을 참조하여 실시예에 따른 객체 검출 모델에 대하여 살펴본다.Hereinafter, an object detection model according to an embodiment will be described with reference to FIGS. 4 to 7 .

도 4는 실시예에 따른 객체 검출 모델의 개략적인 예시도이다.4 is a schematic illustration of an object detection model according to an embodiment.

실시예에 따른 객체 검출 모델은 공통 인코더(1), 공통 디코더(2), 객체 영역 헤더(3-1) 및 객체 분류 헤더(3-2)를 포함하는 헤더(3) 및 객체 영역 디코더(4-1) 및 객체 분류 디코더(4-2)를 포함하는 디코더(4)를 포함할 수 있다.The object detection model according to the embodiment includes a header 3 and an object region decoder 4 including a common encoder 1, a common decoder 2, an object region header 3-1, and an object classification header 3-2. -1) and a decoder 4 including an object classification decoder 4-2.

객체 검출 모델은 Fully-Convolutional Neural Network 기반 객체 검출 딥러닝 모델이다.The object detection model is an object detection deep learning model based on a Fully-Convolutional Neural Network.

객체 검출 모델의 공통 인코더(1) 및 공통 디코더(2)는 태스크(예를 들어 객체 영역 예측 태스크 및 객체 분류 예측 태스크)별로 구현되는 헤더(3) 및 디코더(4)의 전단에 위치하여 입력 영상으로부터 공통 특징 맵을 추출한다.The common encoder 1 and the common decoder 2 of the object detection model are located in front of the header 3 and the decoder 4 implemented for each task (eg, an object region prediction task and an object classification prediction task), so that the input image Extract the common feature map from

즉, 객체 검출 모델은 태스크 간에 공유되는 공통 인코더(1) 및 공통 디코더(2) 구조를 통해 입력 영상의 특징을 인코딩 및 디코딩 하기 위한 연산 중복도를 제거 할 수 있다.That is, the object detection model can remove the computational redundancy for encoding and decoding the features of the input image through the common encoder (1) and common decoder (2) structures shared between tasks.

객체 검출 모델에서 헤더(3) 및 디코더(4)는 객체 검출과 관련된 태스크별로 구현될 수 있다. 즉, 헤더(3) 및 디코더(4)는 각 태스크가 검출하려는 객체 정보에 적합한 레이어 및 연산으로 구현된다.In the object detection model, the header 3 and the decoder 4 may be implemented for each task related to object detection. That is, the header 3 and the decoder 4 are implemented with layers and operations suitable for object information to be detected by each task.

헤더(3)는 공통 인코더(1) 및 공통 디코더(2)를 통해 출력된 특징 맵의 전부 또는 일부를 디코더(4)에 적합한 특징 맵으로 변환한다. 디코더(4)는 헤더(3)에서 변환된 특징 맵을 입력 받아서 검출하고자 하는 객체 정보를 출력한다.The header 3 converts all or part of the feature map output through the common encoder 1 and the common decoder 2 into a feature map suitable for the decoder 4 . The decoder 4 receives the feature map converted from the header 3 and outputs object information to be detected.

실시예에 따른 객체 검출 모델은 추가적인 헤더(3) 및 디코더(4)를 공통 인코더(1) 및 공통 디코더(2)에 연결함으로써 전술한 객체 영역 예측 태스크 및 객체 분류 예측 태스크 이외에도 추가적인 태스크를 쉽게 플러그-인/아웃(Plug-in/Out)가능한 구조로 설계되어 있다.The object detection model according to the embodiment connects the additional header 3 and the decoder 4 to the common encoder 1 and the common decoder 2 to easily plug additional tasks in addition to the above-described object region prediction task and object classification prediction task. - It is designed with a plug-in/out structure.

이하에서 도 5를 참조하여 객체 검출 모델의 구조를 보다 상세히 살펴본다.Hereinafter, the structure of the object detection model will be described in more detail with reference to FIG. 5 .

도 5는 실시예에 따른 객체 검출 모델의 구성도이다.5 is a block diagram of an object detection model according to an embodiment.

공통 인코더(1)는 일련의 인코딩 레이어(Conv1, Conv2, Conv3, Conv4, Conv5, Conv6 및 Conv7)를 포함할 수 있다. 공통 인코더(1)는 일련의 인코딩 레이어를 통해 입력 영상의 특징을 인코딩한 복수의 제1 공통 특징 맵을 출력한다. 복수의 제1 공통 특징 맵은 일련의 인코딩 레이어의 각 레이어에서 출력되는 특징 맵에 대응한다.The common encoder 1 may include a series of encoding layers Conv1, Conv2, Conv3, Conv4, Conv5, Conv6 and Conv7. The common encoder 1 outputs a plurality of first common feature maps obtained by encoding features of an input image through a series of encoding layers. The plurality of first common feature maps correspond to feature maps output from each layer of the series of encoding layers.

공통 인코더(1)는 Conv1 내지 Conv5 인코딩 레이어를 포함할 수 있다. Conv1 내지 Conv5 인코딩 레이어는 VGGNet, ResNet, XceptionNet, ResnetXT, SuffleNet 또는 MobileNet 등의 구조를 적용할 수 있다.The common encoder 1 may include Conv1 to Conv5 encoding layers. Conv1 to Conv5 encoding layers may apply a structure such as VGGNet, ResNet, XceptionNet, ResnetXT, SuffleNet or MobileNet.

공통 인코더(1)는 추가적으로 Conv6 및 Conv7 인코딩 레이어를 포함할 수 있다. 이는 입력 영상에 대한 큰 객체/영역 특징을 잘 표현 할 수 있고, 공통 디코더(2)에서 멀티스케일 피처(Multi-Scale Feature) 융합을 위해 필요한 구조이다.The common encoder 1 may additionally include Conv6 and Conv7 encoding layers. This is a structure that can express a large object/region feature for an input image well and is necessary for multi-scale feature fusion in the common decoder 2 .

공통 디코더(2)는 다중 디코딩 레이어를 포함할 수 있다. 다중 디코딩 레이어는 객체 검출을 위한 디코더 구조인 피처 피라미드 네트워크(Feature Pyramid Network; FPN) 구조 및 바텀-업 경로 병합 레이어(Bottom-up Path Aggregation Layer; BPA)를 포함할 수 있다.The common decoder 2 may include multiple decoding layers. The multiple decoding layer may include a feature pyramid network (FPN) structure, which is a decoder structure for object detection, and a bottom-up path aggregation layer (BPA).

공통 디코더(2)는 다중 디코딩 레이어를 통해 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 출력한다. 복수의 제2 공통 특징 맵은 BPA 레이어의 각 레이어에서 출력되는 특징 맵에 대응한다.The common decoder 2 outputs a plurality of second common feature maps P3, P4, P5, P6 and P7 through multiple decoding layers. The plurality of second common feature maps correspond to feature maps output from each layer of the BPA layer.

공통 디코더(2)에서 출력되는 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)은 다양한 객체의 크기 및 모양 변화에 강건한 특징 추출을 할 수 있는 특성을 가지고 있다. 이는 공통 디코더(2)가 멀티스케일 피처 융합 방식으로 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 생성하며, 이와 같은 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)에 기반하여 후술할 헤더(3) 및 디코더(4)가 작동하기 때문에 가능하다. 헤더(3)는 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)의 적어도 일부에 기반하여 입력 특징 맵을 구성하여 디코더(4)에 제공한다.The plurality of second common feature maps P3 , P4 , P5 , P6 , and P7 output from the common decoder 2 has a characteristic capable of robust feature extraction to changes in the size and shape of various objects. This causes the common decoder 2 to generate a plurality of second common feature maps P3, P4, P5, P6 and P7 in a multiscale feature fusion manner, and such a plurality of second common feature maps P3, P4, P5 , P6 and P7), it is possible because the header 3 and the decoder 4, which will be described later, operate. The header 3 constructs an input feature map based on at least a part of the plurality of second common feature maps P3 , P4 , P5 , P6 and P7 , and provides it to the decoder 4 .

공통 디코더(2)에서 출력된 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)은 객체 영역 예측을 위해 구현된 헤더(3-1)와 객체 분류 예측을 위해 구현된 헤더(3-2)로 각각 입력된다. 헤더(3) 및 디코더(4)에 대하여는 도 7을 참조하여 후술한다.The plurality of second common feature maps P3, P4, P5, P6 and P7 output from the common decoder 2 includes a header 3-1 implemented for object region prediction and a header implemented for object classification prediction ( 3-2), respectively. The header 3 and the decoder 4 will be described later with reference to FIG. 7 .

이하에서 도 6을 참조하여 공통 디코더(2)를 구체적으로 살펴본다.Hereinafter, the common decoder 2 will be described in detail with reference to FIG. 6 .

도 6은 실시예에 따른 객체 검출 모델의 공통 디코더의 구성도이다.6 is a block diagram of a common decoder of an object detection model according to an embodiment.

공통 디코더(2)는 멀티 스케일 피처 융합을 통해 복수의 제1 공통 특징 맵을 추출하는 다중 디코딩 레이어를 포함한다.The common decoder 2 includes multiple decoding layers for extracting a plurality of first common feature maps through multi-scale feature fusion.

다중 디코딩 레이어는 공통 인코더(1)에서 출력된 복수의 제1 공통 특징 맵(C3, C4, C5, C6 및 C7)의 일부(C3, C4, C5)를 업샘플링(upsampling) 및 합산(add)하여 출력하는 톱-다운 레이어(Top-Down Layer)(2-1), 공통 인코더(1)에서 복수의 제1 공통 특징 맵(C3, C4, C5, C6 및 C7)의 일부를 제외한 나머지(C6, C7) 및 톱-다운 레이어(2-1)의 출력에 대하여 콘볼루션 연산을 수행하는 중간 레이어(2-2) 및 중간 레이어(2-2)의 출력을 콘볼루션 및 합산하여 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 출력하는 바텀-업 경로 병합 레이어(Bottom-Up Path Aggregation Layer)(2-3)를 포함할 수 있다.The multiple decoding layer upsampling and adds a portion (C3, C4, C5) of a plurality of first common feature maps (C3, C4, C5, C6, and C7) output from the common encoder (1) A top-down layer 2-1, which is output by performing , C7) and the output of the intermediate layer 2-2 and the intermediate layer 2-2, which perform a convolution operation on the output of the top-down layer 2-1, by convolution and summing the output of the plurality of second A bottom-up path aggregation layer 2-3 for outputting the common feature maps P3, P4, P5, P6 and P7 may be included.

톱-다운 레이어(2-1)는 공통 인코더(1)에서 출력된 복수의 제1 공통 특징 맵의 일부를 업샘플링(upsampling) 및 합산(add)하여 출력한다.The top-down layer 2-1 upsampling and summing a portion of the plurality of first common feature maps output from the common encoder 1 and outputs the upsampling and summing.

일 예에서, 톱-다운 레이어는 공통 인코더(1)에서 생성된 복수의 제1 공통 특징 맵(C3, C4, C5, C6 및 C7) 중에서 중간 해상도에 해당하는 세 개의 특징 맵(C3, C4 및 C5)를 대상으로 수행된다.In one example, the top-down layer includes three feature maps C3, C4 and C5) is the target.

톱-다운 레이어(2-1)의 1단계는, 공통 인코더(1)의 다섯번째 인코딩 레이어(Conv5)에서 출력된 제1 공통 특징 맵(C5)을 1x1x256 콘볼루션하여 생성된 특징 맵(C5')을 출력한다. 톱-다운 레이어(2-1)의 2단계는, 1단계의 출력(C5')을 2배로 업샘플링한 결과와 공통 인코더(1)의 네번째 인코딩 레이어(Conv4)에서 출력된 제1 공통 특징 맵(C4)을 1x1x256 콘볼루션하여 생성된 특징 맵(C4')을 합산(C4'+ C5')하여 출력한다.Step 1 of the top-down layer 2-1 is a feature map C5' generated by 1x1x256 convolution of the first common feature map C5 output from the fifth encoding layer Conv5 of the common encoder 1 ) is output. In step 2 of the top-down layer 2-1, the result of upsampling the output C5' of step 1 is doubled and the first common feature map output from the fourth encoding layer Conv4 of the common encoder 1 The feature map (C4') generated by 1x1x256 convolution of (C4) is summed (C4' + C5') and output.

동일한 방식으로 톱-다운 레이어(2-1)의 3단계는, 2단계의 출력(C4'+ C5')을 2배로 업샘플링한 결과와 공통 인코더(1)의 세번째 인코딩 레이어(Conv3)에서 출력된 제1 공통 특징 맵(C3)을 1x1x256 콘볼루션하여 생성된 특징맵(C3')을 합산(C3'+ C4'+ C5')하여 출력한다. 이를 통해 톱-다운 방향으로 저해상도 및 고해상도의 피처가 누적적으로 융합될 수 있다.In the same way, step 3 of the top-down layer 2-1 is the result of upsampling the output (C4' + C5') of step 2 by 2 and output from the third encoding layer (Conv3) of the common encoder 1 The feature map C3' generated by 1x1x256 convolution of the first common feature map C3 is summed (C3' + C4' + C5') and output. Through this, low-resolution and high-resolution features can be cumulatively fused in the top-down direction.

중간 레이어(2-2)는 공통 인코더(1)에서 복수의 제1 공통 특징 맵(C3, C4, C5, C6 및 C7)의 일부를 제외한 나머지(C6 및 C7) 및 톱-다운 레이어(2-1)의 출력에 대하여 콘볼루션 연산을 수행한다.The middle layer (2-2) includes the remainder (C6 and C7) and the top-down layer (2-) except for a part of the plurality of first common feature maps (C3, C4, C5, C6 and C7) in the common encoder (1) A convolution operation is performed on the output of 1).

중간 레이어(2-2)는 공통 인코더(1)에서 출력된 복수의 제1 공통 특징 맵(C3, C4, C5, C6 및 C7) 중에서 톱-다운 레이어(2-1)에서 사용되지 않은 두 개의 특징 맵(C6 및 C7)에 대하여 각각 3x3x256 콘볼루션한다. 즉, 중간 레이어(2-2)는 공통 인코더(1)의 마지막 인코딩 레이어(C7)에서 출력된 제1 공통 특징 맵(C7)을 3x3x256 콘볼루션하여 중간 특징 맵(M7)을 출력한다. 마찬가지로, 중간 레이어(2-2)는 공통 인코더(1)의 여섯번째 인코딩 레이어(C6)에서 출력된 제1 공통 특징 맵(C6)을 3x3x256 콘볼루션하여 중간 특징 맵(M6)을 출력한다.The middle layer 2-2 includes two unused first common feature maps C3, C4, C5, C6, and C7 output from the common encoder 1 in the top-down layer 2-1. 3x3x256 convolutions for the feature maps C6 and C7, respectively. That is, the intermediate layer 2-2 3x3x256 convolves the first common feature map C7 output from the last encoding layer C7 of the common encoder 1 to output the intermediate feature map M7. Similarly, the intermediate layer 2-2 performs 3x3x256 convolution on the first common feature map C6 output from the sixth encoding layer C6 of the common encoder 1 to output the intermediate feature map M6.

중간 레이어(2-2)는 톱-다운 레이어(2-1)의 출력에 대하여 각각 3x3x256 콘볼루션을 수행하여 중간 특징 맵(M5, M4 및 P3)을 출력한다.The intermediate layer 2-2 outputs intermediate feature maps M5, M4, and P3 by performing 3x3x256 convolution on the output of the top-down layer 2-1, respectively.

바텀-업 경로 병합 레이어(2-3)는 중간 레이어(2-2)의 출력을 콘볼루션 및 합산하여 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 출력한다. 바텀-업 경로 병합 레이어(2-3)는 피라미드의 최하위층에서 최상위층으로 진행하면서 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 생성한다.The bottom-up path merging layer 2-3 outputs a plurality of second common feature maps P3, P4, P5, P6 and P7 by convolution and summing the output of the intermediate layer 2-2. The bottom-up path merging layer 2-3 generates a plurality of second common feature maps P3 , P4 , P5 , P6 and P7 while proceeding from the lowest layer to the highest layer of the pyramid.

바텀-업 경로 병합 레이어(2-3)는 중간 레이어(2-2)에서 출력된 P3 중간 특징 맵을 3x3x256, stride=2 콘볼루션하여 저해상도의 특징 맵(P3')을 생성하고, 중간 레이어(2-2)에서 출력된 중간 특징 맵(M4)과 합산(P3'+M4)하여 제2 공통 특징 맵(P4)을 생성한다. 생성된 제2 공통 특징 맵(P4)에 대하여 마찬가지 방식으로 나머지 제2 공통 특징 맵(P4, P6 및 P7)을 생성한다. The bottom-up path merging layer (2-3) generates a low-resolution feature map (P3') by 3x3x256, stride=2 convolution of the P3 intermediate feature map output from the intermediate layer (2-2), and the intermediate layer ( A second common feature map P4 is generated by summing (P3'+M4) with the intermediate feature map M4 output in 2-2). The remaining second common feature maps P4 , P6 and P7 are generated in the same manner with respect to the generated second common feature map P4 .

본 명세서에서 복수의 디코딩 레이어는 공통 디코더(2)의 최종 디코딩 레이어를 의미한다. 예를 들어, 복수의 디코딩 레이어는 바텀-업 경로 병합 레이어(2-3)에 대응할 수 있다.In this specification, a plurality of decoding layers means a final decoding layer of the common decoder 2 . For example, the plurality of decoding layers may correspond to the bottom-up path merging layer 2-3.

복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)은 바텀-업 경로 병합 레이어(2-3)로부터 출력되는 일련의 멀티스케일(multi-scale) 특징 맵을 포함한다. 즉, 바텀-업 경로 병합 레이어(2-3)에 의해서 바텀-업 방향으로 저해상도 및 고해상도의 피처가 누적적으로 융합되어 일련의 멀티스케일 특징 맵이 생성될 수 있으며, 이로써 다양한 객체가 복합적으로 포함된 영상 및 객체의 크기와 모양 변화에 강건한 특징을 표현할 수 있게 된다.The plurality of second common feature maps P3 , P4 , P5 , P6 and P7 include a series of multi-scale feature maps output from the bottom-up path merging layer 2-3 . That is, a series of multi-scale feature maps can be generated by cumulatively merging low-resolution and high-resolution features in the bottom-up direction by the bottom-up path merging layer 2-3, which includes various objects in a complex manner. It becomes possible to express characteristics that are robust to changes in the size and shape of images and objects.

도 7은 실시예에 따른 객체 검출 모델의 헤더 및 디코더의 구성도이다.7 is a configuration diagram of a header and a decoder of an object detection model according to an embodiment.

실시예에 따른 객체 검출 모델에서 도 6을 참조하여 살펴본 공통 디코더(2)의 바텀-업 경로 병합 레이어(2-3)에서 출력된 복수의 제2 공통 특징 맵(P3, P4, P5, P6, P7)은 도 4를 참조하여 헤더(3)로 입력된다. In the object detection model according to the embodiment, the plurality of second common feature maps P3, P4, P5, P6, P7) is input to the header 3 with reference to FIG.

실시예에 따른 객체 검출 모델은 객체 영역 헤더(3-1) 및 객체 분류 헤더(3-2)를 포함한다.The object detection model according to the embodiment includes an object region header 3-1 and an object classification header 3-2.

객체 영역 헤더(3-1)는 복수의 디코딩 레이어에서 출력된 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 객체 영역 검출을 위한 입력 특징 맵으로 변환하는 도메인 적응 레이어를 의미한다.The object region header 3-1 is a domain adaptation layer that converts a plurality of second common feature maps P3, P4, P5, P6 and P7 output from a plurality of decoding layers into an input feature map for object region detection. it means.

객체 분류 헤더(3-2)는 복수의 디코딩 레이어에서 생성된 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 객체 분류를 위한 입력 특징 맵으로 변환하는 일종의 도메인 적응 레이어를 의미한다.The object classification header 3-2 is a kind of domain adaptation layer that converts the plurality of second common feature maps P3, P4, P5, P6 and P7 generated in the plurality of decoding layers into an input feature map for object classification. it means.

전술한대로 복수의 디코딩 레이어는 도 6을 참조하여 살펴본 공통 디코더(2)의 바텀-업 경로 병합 레이어(2-3)에 대응한다.As described above, the plurality of decoding layers corresponds to the bottom-up path merging layer 2-3 of the common decoder 2 described with reference to FIG. 6 .

객체 검출을 위한 헤더(3)는 도메인 적응 레이어(Domain Adaptation Layer)를 통해 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 각각 객체 검출을 위한 디코더(4)에 입력될 입력 특징 맵으로 변환한다. 여기서 입력 특징 맵은 객체 영역 특징 맵과 객체 분류 특징 맵을 포함한다.The header 3 for object detection is to input a plurality of second common feature maps P3, P4, P5, P6 and P7 through a domain adaptation layer to the decoder 4 for object detection, respectively. Convert it to an input feature map. Here, the input feature map includes an object domain feature map and an object classification feature map.

헤더(3)는 객체 영역 헤더(3-1)와 객체 분류 헤더(3-2)로 구성된다. 객체 영역 헤더(box_feature)와 객체 분류 헤더(class_feature)는 공통 디코더(2)에서 출력된 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)을 동일한 파라미터를 이용(Sharing Layer)하여 객체 영역 특징을 표현하는 객체 영역 특징 맵과 객체 분류 특징을 표현하는 객체 분류 특징 맵을 생성한다.The header 3 is composed of an object area header 3-1 and an object classification header 3-2. The object region header (box_feature) and the object classification header (class_feature) are the plurality of second common feature maps (P3, P4, P5, P6, and P7) output from the common decoder 2 using the same parameters (Sharing Layer). An object domain feature map expressing object domain characteristics and an object classification feature map representing object classification characteristics are generated.

일 예에서 객체 영역 헤더(3-1) 및 객체 분류 헤더(3-2)는, 가중치를 서로 공유하는 복수의 콘볼루션 레이어를 포함할 수 있다.In an example, the object region header 3 - 1 and the object classification header 3 - 2 may include a plurality of convolutional layers sharing a weight with each other.

일 예에서, 객체 영역 헤더(3-1)와 객체 분류 헤더(3-2)는 각각 공통 디코더(2)의 바텀-업 경로 병합 레이어(2-3)에서 출력된 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)에 대해서 3x3x256 콘볼루션을 4번 수행하여 객체 영역 특징 맵과 객체 분류 특징 맵을 각각 생성할 수 있다.In one example, the object region header 3-1 and the object classification header 3-2 are a plurality of second common feature maps output from the bottom-up path merging layer 2-3 of the common decoder 2, respectively. 3x3x256 convolution is performed 4 times for (P3, P4, P5, P6, and P7) to generate an object domain feature map and an object classification feature map, respectively.

객체 영역 헤더(3-1) 및 객체 분류 헤더(3-2)는 동일한 파라미터를 이용하여, 복수의 디코딩 레이어로부터 출력된 복수의 제2 공통 특징 맵(P3, P4, P5, P6 및 P7)의 멀티스케일(Multi-Scale) 특징을 객체 영역 특징 맵과 객체 분류 특징 맵으로 추출한다.The object region header 3-1 and the object classification header 3-2 use the same parameter to form a plurality of second common feature maps P3, P4, P5, P6 and P7 output from a plurality of decoding layers. Multi-scale features are extracted as an object domain feature map and an object classification feature map.

객체 검출 모델의 디코더(4)는 헤더(3)에서 생성된 입력 특징 맵을 이용하여 적어도 하나의 객체 정보를 예측한다.The decoder 4 of the object detection model predicts at least one piece of object information using the input feature map generated from the header 3 .

일 예에서, 디코더(4)는 앵커가 없는 Multi-Scale Fully-Convolutional Neural Network(MS-FCN)를 사용할 수 있으며, 중복적이고 부가적인 연산을 제거하여 고성능 고효율을 확보할 수 있다.In one example, the decoder 4 may use an anchor-free Multi-Scale Fully-Convolutional Neural Network (MS-FCN), and may secure high performance and high efficiency by eliminating redundant and additional operations.

객체 영역 특징 맵은 객체 영역 예측 정보(pred_box_posts), 객체 IOU 예측 정보(pred_reward_iou) 및 객체 크기 예측 정보(pred_reward_scale)를 예측하는 객체 영역 디코더(4-1)의 입력으로 사용된다.The object region feature map is used as an input of the object region decoder 4-1 that predicts the object region prediction information (pred_box_posts), the object IOU prediction information (pred_reward_iou), and the object size prediction information (pred_reward_scale).

객체 영역 디코더(4-1)는 객체 영역 특징 맵에 대하여 3x3x4(x1, y1, x2, y2:현재 grid 위치로부터 객체가 떨어져 있는 위치) 콘볼루션을 수행하여 객체 박스(pred_boxes)를 생성하고, 생성된 객체 박스(pred_boxes)와 scale factor의 곱셈 연산을 통해 객체 영역 예측 정보(pred_box_posts)를 생성한다.The object area decoder 4-1 performs 3x3x4 (x1, y1, x2, y2: an object away from the current grid position) convolution on the object area feature map to generate object boxes (pred_boxes) and generate Object area prediction information (pred_box_posts) is generated through the multiplication operation of the object boxes (pred_boxes) and the scale factor.

일 예에서, scale factor는 객체 영역 예측 크기 값을 키우기 위하여 학습되는 상수이며, 객체 크기 예측 정보(pred_reward_scale)에 기반하여 학습될 수 있다.In one example, the scale factor is a constant learned to increase the object region prediction size value, and may be learned based on the object size prediction information (pred_reward_scale).

객체 영역 디코더(4-1)는 객체 영역 특징 맵에 대하여 1x1x256 콘볼루션과 1x1x1 콘볼루션을 통해 객체 IOU 예측 정보(pred_reward_iou) 및 객체 크기 예측 정보(pred_reward_scale)을 각각 생성한다.The object region decoder 4-1 generates object IOU prediction information (pred_reward_iou) and object size prediction information (pred_reward_scale) through 1x1x256 convolution and 1x1x1 convolution with respect to the object region feature map, respectively.

객체 IOU 예측 정보(pred_reward_iou)는 예측된 객체의 영역(pred_box_posts)이 실제 정답 객체 영역에 얼마나 일치 하는지를 예측하여, 학습 시에 객체 검출 정확도를 높일 수 있으며, 객체 IOU 예측 정보(pred_reward_iou)를 객체 검출 후처리인 E-NMS에 적용하여 객체 영역 검출 성능을 향상 시킬 수 있다.The object IOU prediction information (pred_reward_iou) predicts how much the predicted object area (pred_box_posts) matches the actual correct answer object area, so it is possible to increase the object detection accuracy during learning. The object area detection performance can be improved by applying it to E-NMS processing.

객체 분류 특징 맵은 객체 분류 예측 정보(pred_confidence) 및 객체 분류 정확도 예측 정보(pred_reward_identificatoin)를 예측하는 객체 분류 디코더(4-2)의 입력으로 사용된다.The object classification feature map is used as an input of the object classification decoder 4-2 that predicts object classification prediction information (pred_confidence) and object classification accuracy prediction information (pred_reward_identificatoin).

객체 분류 디코더(4-2)는 객체 분류 특징 맵에 대하여 3x3xN(N: 배경+분류 객체 개수) 콘볼루션을 통해 객체 분류 예측 정보(pred_confidence)를 생성한다.The object classification decoder 4-2 generates object classification prediction information (pred_confidence) through 3x3xN (N: background + number of classification objects) convolution with respect to the object classification feature map.

한편, 객체 분류 정확도 예측 정보(pred_reward_identification) 및 객체 크기 예측 정보(pred_reward_scale)는 객체 검출 모델의 학습에는 사용되지만, 객체 검출 모델의 실제 추론 과정에서는 사용되지 않는다.Meanwhile, the object classification accuracy prediction information (pred_reward_identification) and the object size prediction information (pred_reward_scale) are used for learning the object detection model, but are not used in the actual reasoning process of the object detection model.

즉, 객체 분류 정확도 예측 정보(pred_reward_identification) 및 객체 크기 예측 정보(pred_reward_scale)는 객체 검출 모델의 학습을 돕기 위한 가상의 보조 태스크이고, 실제 추론 과정에서는 사용되지 않는다.That is, the object classification accuracy prediction information (pred_reward_identification) and the object size prediction information (pred_reward_scale) are virtual auxiliary tasks for helping the learning of the object detection model, and are not used in the actual reasoning process.

여기서 가상의 보조 태스크는, 학습 시에 예측을 통해 손실 함수(Loss function)을 계산하고, 학습 파라미터 업데이트에 반영하여 객체 정보의 학습이 잘 될 수 있도록 도와주고, 실행 모드에서는 해당 객체 정보를 예측하지 않아 연산 부하를 늘리지 않는 방법을 의미한다.Here, the virtual auxiliary task calculates a loss function through prediction during training and reflects it in the learning parameter update to help the object information learn well, and does not predict the object information in execution mode. It means a method that does not increase the computational load.

객체 크기 예측 정보(pred_reward_scale)은 객체의 크기를 정확히 예측하는 지를 예측하는 정보로서, 학습 시 객체 IOU 예측 정보(pred_reward_iou)와 객체 크기 예측 정보(pred_reward_scale)를 함께 학습함으로써 객체 영역 예측 성능이 강화되고 객체 검출 정확도가 제고된다.The object size prediction information (pred_reward_scale) is information that predicts whether the size of an object is accurately predicted. When learning, the object area prediction performance is strengthened and the object The detection accuracy is improved.

객체 분류 정확도 예측 정보(pred_reward_identification)은 객체 분류 정보를 정확히 예측하는 지를 예측하는 정보로서, 학습 시 객체 분류 예측 성능을 강화시켜 줄 수 있다.The object classification accuracy prediction information (pred_reward_identification) is information for predicting whether the object classification information is accurately predicted, and may enhance object classification prediction performance during learning.

프로세서(110)는 디코더(4)를 이용하여 획득한 객체 정보에 대한 N-NMS 기반의 후처리(post-processing)를 통해, 최종 객체 영역(Object Region)을 결정할 수 있다.The processor 110 may determine the final object region through N-NMS-based post-processing of the object information obtained using the decoder 4 .

일 예에서, 프로세서(110)는 객체 영역 예측 정보(pred_box_posts), 객체 IOU 예측 정보(pred_reward_iou) 및 객체 분류 예측 정보(pred_confidence)를 도 8을 참조하여 후술할 N-NMS 연산에 입력하여 최종 객체 영역을 선택할 수 있다.In an example, the processor 110 inputs the object region prediction information (pred_box_posts), the object IOU prediction information (pred_reward_iou), and the object classification prediction information (pred_confidence) into an N-NMS operation to be described later with reference to FIG. 8 to input the final object region can be selected.

도 8은 실시예에 따른 E-NMS 연산의 개략적인 예시도이다.8 is a schematic illustration of an E-NMS operation according to an embodiment.

실시예에 따른 객체 검출 방법은 객체 검출 모델에서 출력된 객체 정보에 대한 후처리 과정을 위하여 E-NMS 연산을 제안한다.The object detection method according to the embodiment proposes an E-NMS operation for post-processing of object information output from the object detection model.

E-NMS 연산은 객체 분류 예측 정보(pred_confidence), 객체 IOU 예측 정보(pred_reward_iou) 및 객체 영역 예측 정보(pred_box_posts)에 기반하여 입력 영상에 포함된 것으로 예측되는 적어도 하나의 후보 객체에 대한 최종 객체 영역을 결정하기 위한 후처리 과정을 수행한다.The E-NMS operation calculates the final object area for at least one candidate object predicted to be included in the input image based on the object classification prediction information (pred_confidence), the object IOU prediction information (pred_reward_iou), and the object area prediction information (pred_box_posts) Perform post-processing to determine.

실시예에 따른 객체 검출 방법은 E-NMS 연산을 두 단계에서 실행한다. 즉, 도 3을 참조하여 프로세서(110)는 단계(S20) 및 단계(S30)에서 E-NMS 연산을 수행한다.The object detection method according to the embodiment executes the E-NMS operation in two steps. That is, with reference to FIG. 3 , the processor 110 performs the E-NMS operation in steps S20 and S30 .

단계(S20)에서는 디코딩 레이어별로 E-NMS 연산을 수행한다(Layer-wise E-NMS). 단계(S30)에서는 단계(S20)의 레이어별 E-NMS 연산의 결과에 기반하여 전체 E-NMS 연산을 수행한다.In step S20, an E-NMS operation is performed for each decoding layer (Layer-wise E-NMS). In step S30, the entire E-NMS operation is performed based on the result of the E-NMS operation for each layer in step S20.

이하에서 도 9 및 도 10을 참조하여 E-NMS 연산에 대하여 보다 상세히 살펴본다.Hereinafter, the E-NMS operation will be described in more detail with reference to FIGS. 9 and 10 .

도 9는 실시예에 따른 E-NMS 연산의 흐름도이다.9 is a flowchart of an E-NMS operation according to an embodiment.

E-NMS 연산은 적어도 하나의 후보 객체의 컨피던스 스코어의 초기값을 결정하는 제 1 연산(S31), 적어도 하나의 후보 객체의 객체 영역 예측 정보(pred_box_posts) 및 객체 IOU 예측 정보(pred_reward_iou)에 기반하여 컨피던스 스코어를 업데이트하는 제 2 연산(S32) 및 업데이트된 컨피던스 스코어에 기반하여 적어도 하나의 후보 객체의 최종 객체 영역을 결정하는 제 3 연산(S33)을 포함할 수 있다.The E-NMS operation is based on a first operation ( S31 ) of determining an initial value of the confidence score of at least one candidate object, the object area prediction information (pred_box_posts) and the object IOU prediction information (pred_reward_iou) of the at least one candidate object It may include a second operation ( S32 ) of updating the confidence score and a third operation ( S33 ) of determining a final object region of the at least one candidate object based on the updated confidence score.

제 1 연산(S31)에서 프로세서(110)는 적어도 하나의 후보 객체의 컨피던스 스코어의 초기값을 결정한다.In the first operation ( S31 ), the processor 110 determines an initial value of the confidence score of at least one candidate object.

일 예에서, 프로세서(110)는 단계(S20)에 대하여 E-NMS 연산을 수행하는 경우, 제 1 연산(S31)에서, 객체 IOU 예측 정보(pred_reward_iou)에 대한 함수에 기반하여 컨피던스 스코어의 초기값을 결정할 수 있다.In one example, when the processor 110 performs the E-NMS operation for step S20 , in the first operation S31 , the initial value of the confidence score is based on a function for the object IOU prediction information (pred_reward_iou). can be decided

이 경우, 컨피던스 스코어의 초기값은 다음과 같은 수학식 1로 표현될 수 있다.In this case, the initial value of the confidence score may be expressed by the following Equation (1).

여기서, 우측 항의 Confidence는 객체 검출 모델의 객체 분류 디코더(4-2)로부터 출력된 객체 분류 예측 정보(pred_confidence)를 의미한다.Here, Confidence in the right term means object classification prediction information (pred_confidence) output from the object classification decoder 4-2 of the object detection model.

f(pred_reward_iou)는 객체 IOU 예측 정보(pred_reward_iou)에 의한 가중치 요소로서, 다양한 함수를 적용할 수 있다. 예를 들어 f(x)= x^0.8을 f(pred_reward_iou)로 사용할 수 있다.f(pred_reward_iou) is a weight factor based on the object IOU prediction information (pred_reward_iou), and various functions can be applied. For example, f(x)= x ^0.8 can be used as f(pred_reward_iou).

이와 같은 초기값 설정은, 객체 분류 예측 정보(pred_confidence)와 객체 IOU 예측 정보(pred_reward_iou)의 융합 정보를 기반으로 하여 E-NMS의 수행이 가능하게 한다.The initial value setting enables E-NMS to be performed based on the fusion information of the object classification prediction information (pred_confidence) and the object IOU prediction information (pred_reward_iou).

일 예에서, 프로세서(110)는 단계(S30)에 대하여 E-NMS 연산을 수행하는 경우, 제 1 연산(S31)에서, 단계(S20)의 각 디코딩 레이어에 대한 E-NMS 연산 결과에 기반하여 컨피던스 스코어의 초기값을 결정할 수 있다.In one example, when the processor 110 performs the E-NMS operation with respect to step S30, in the first operation S31, based on the result of the E-NMS operation for each decoding layer of step S20, An initial value of the confidence score may be determined.

예를 들어 프로세서(110)는 단계(S20)의 각 디코딩 레이어에 대한 E-NMS 연산의 제 2 연산(S32)을 수행하는 동안 업데이트된 컨피던스 스코어를 단계(S30)의 E-NMS 연산의 제 1 연산(S31)에서의 컨피던스 스코어의 초기값으로 결정할 수 있다.For example, the processor 110 calculates the updated confidence score while performing the second operation S32 of the E-NMS operation for each decoding layer in step S20, the first of the E-NMS operation of step S30. It can be determined as the initial value of the confidence score in the operation (S31).

제 2 연산(S32)에서, 프로세서(110)는 적어도 하나의 후보 객체의 객체 영역 예측 정보(pred_box_posts) 및 객체 IOU 예측 정보(pred_reward_iou)에 기반하여 컨피던스 스코어를 업데이트한다. 제 2 연산(S32)에 대하여는 도 10을 참조하여 설명하기로 한다.In the second operation ( S32 ), the processor 110 updates the confidence score based on the object region prediction information (pred_box_posts) and the object IOU prediction information (pred_reward_iou) of the at least one candidate object. The second operation S32 will be described with reference to FIG. 10 .

제 3 연산(S33)에서 프로세서(110)는 업데이트된 컨피던스 스코어에 기반하여 적어도 하나의 후보 객체의 최종 객체 영역을 결정한다.In the third operation ( S33 ), the processor 110 determines a final object area of at least one candidate object based on the updated confidence score.

예를 들어, 프로세서(110)는 객체 영역 예측 정보(pred_box_posts)에 대응하는 영역 중에서 업데이트된 컨피던스 스코어가 최대인 영역을 해당 객체의 최종 객체 영역으로 결정할 수 있다.For example, the processor 110 may determine a region having the largest updated confidence score among regions corresponding to the object region prediction information pred_box_posts as the final object region of the corresponding object.

결과적으로, 프로세서(110)는 이와 같은 E-NMS 연산을 실행하여 입력 영상에 포함된 것으로 예측되는 적어도 하나의 후보 객체에 대한 최종 객체 영역을 결정할 수 있다.As a result, the processor 110 may determine the final object area for at least one candidate object predicted to be included in the input image by executing the E-NMS operation.

도 10은 실시예에 따른 E-NMS 연산의 제 2 연산의 흐름도이다.10 is a flowchart of a second operation of an E-NMS operation according to an embodiment.

프로세서(110)는 E-NMS 연산의 제 2 연산(S32)을 실행하기 위하여 다음과 같은 연산을 실행한다.즉, 제 2 연산(S32)은, 제 1 연산(S31)에서 결정된 초기값에 기반하여 적어도 하나의 후보 객체를 정렬하는 연산(S32-1), 적어도 하나의 후보 객체에 대한 서로 다른 두 개의 객체 영역 예측 정보 간의 IOU를 결정하는 연산(S32-2), 사전정의된 가중치 함수에 기반하여 IOU를 재결정하는 연산(S32-3) 및 재결정된 IOU에 기반하여 적어도 하나의 후보 객체의 컨피던스 스코어를 결정하는 연산(S32-4)을 포함한다.The processor 110 executes the following operation to execute the second operation S32 of the E-NMS operation. That is, the second operation S32 is based on the initial value determined in the first operation S31. operation to align at least one candidate object (S32-1), an operation to determine the IOU between two different object region prediction information for at least one candidate object (S32-2), based on a predefined weight function and an operation (S32-3) of re-determining the IOU and an operation (S32-4) of determining the confidence score of at least one candidate object based on the re-determined IOU.

연산(S32-1)에서 프로세서(110)는 제 1 연산(S31)에서 결정된 컨피던스 스코어에 기반하여 적어도 하나의 후보 객체에 대한 객체 영역 예측 정보(pred_box_posts)를 정렬(sort)한다. 예를 들어 후보 객체의 컨피던스 스코어의 내림차순에 기반하여 적어도 하나의 후보 객체에 대한 객체 영역 예측 정보(pred_box_posts)를 정렬할 수 있다.In operation S32-1, the processor 110 sorts the object region prediction information pred_box_posts for at least one candidate object based on the confidence score determined in the first operation S31. For example, the object region prediction information (pred_box_posts) for at least one candidate object may be sorted based on the descending order of the confidence scores of the candidate objects.

일 예에서 프로세서(110)는 IOU를 결정하는 연산(S32-2), IOU를 재결정하는 연산(S32-3) 및 컨피던스 스코어를 결정하는 연산(S32-4)를, IOU 매트릭스를 이용하여 병렬적으로 수행가능하다.In one example, the processor 110 performs an operation for determining an IOU (S32-2), an operation for re-determining an IOU (S32-3), and an operation for determining a confidence score (S32-4) using an IOU matrix in parallel can be performed with

IOU 매트릭스를 이용한 계산법은, 예측된 객체 영역간의 하행 삼각 IOU 매트릭스를 생성하여 계산하는 방식으로, 서로 다른 객체간의 중첩의 모호성과 병렬 연산 효율성을 개선한 방법이다.The calculation method using the IOU matrix is a method of generating and calculating a descending triangular IOU matrix between predicted object areas, and is a method that improves the ambiguity of overlap between different objects and the efficiency of parallel operation.

IOU 매트릭스는 객체 영역 예측 정보(pred_box_posts)의 개수 만큼의 행과 열을 갖는 정방행렬이다. 이하에서 IOU 매트릭스에 기초하여 E-NMS 연산을 설명한다.The IOU matrix is a square matrix having as many rows and columns as the number of object region prediction information (pred_box_posts). The E-NMS operation based on the IOU matrix will be described below.

연산(S32-2)에서 프로세서(110)는 적어도 하나의 후보 객체에 대한 서로 다른 두 개의 객체 영역 예측 정보(pred_box_posts) 간의 IOU를 결정한다.In operation S32-2, the processor 110 determines an IOU between two pieces of different object region prediction information (pred_box_posts) for at least one candidate object.

1) 연산(S32-2)에서 프로세서(110)는 i번째 객체 영역 예측 정보인 Box_i와 j번째 객체 영역 예측 정보인 Box_j 간의 IOU로 결정하고, 결정된 IOU를 IOU 매트릭스의 i번째 행의 j번째 행의 행렬 요소(matrix element)인 b_ij에 저장한다. 여기서, 행과 열의 번호가 일치하는 b_ii의 IOU 값은 1이 된다.1) In operation S32-2, the processor 110 determines an IOU between Box i, which is the i-th object area prediction information, and Box _j , which is the j-th object area prediction information, and sets the determined IOU to _j of the i-th row of the IOU matrix. It is stored in the matrix element b _ij of the th row. Here, the IOU value of b _ii with the same row and column number becomes 1.

2) b_ii를 기준으로 상행 삼각형에 위치한 b_ij(i<j)의 값은 0으로 설정한다. 행과 열의 번호가 일치하는 b_ii의 IOU 값도 0으로 설정한다.2) The value of b _ij (i<j) located in the ascending triangle with respect to b _ii is set to 0. The IOU value of b _ii with the same row and column number is also set to 0.

3) 서로 다른 후보 객체 간의 중첩에 의한 성능 저하를 제거하기 위해, 각 IOU Matrix별로 종/횡축의 객체가 같은 객체가 아닐 경우, 해당 행렬 요소를 0으로 설정한다.3) To eliminate performance degradation due to overlap between different candidate objects, if the vertical/horizontal objects are not the same for each IOU Matrix, the corresponding matrix element is set to 0.

연산(S32-3)에서 프로세서(110)는 사전정의된 가중치 함수에 기반하여 IOU를 재결정한다.In operation S32-3, the processor 110 re-determines the IOU based on a predefined weight function.

다음의 수학식 2는 예시적인 사전정의된 가중치 함수이다.Equation 2 below is an exemplary predefined weight function.

여기서 n은 linear weight 함수이고, 우측 항의 IOU는 현재 IOU 매트릭스에 저장된 IOU이고, 좌측 항의 IOU는 수학식 2에 따라 재결정된 IOU를 나타낸다.Here, n is a linear weight function, the IOU in the right term is the IOU stored in the current IOU matrix, and the IOU in the left term indicates the IOU re-determined according to Equation (2).

연산(S32-4)에서 프로세서(110)는 재결정된 IOU에 기반하여 컨피던스 스코어를 결정한다.In operation S32-4, the processor 110 determines a confidence score based on the re-determined IOU.

프로세서(110)는 다음의 수학식 3에 기반하여 적어도 하나의 후보 객체의 객체 영역 예측 정보(pred_box_posts)에 대응하는 객체 영역에 대한 컨피던스 스코어를 결정한다.The processor 110 determines a confidence score for an object region corresponding to the object region prediction information (pred_box_posts) of at least one candidate object based on Equation 3 below.

여기서, conf_n 은 n번째 객체 영역(Box_n)의 컨피던스 스코어를 의미한다.Here, conf _n means the confidence score of the nth object area (Box _n ).

도 11은 실시예에 따른 IOU 매트릭스의 예시도이다.11 is an exemplary diagram of an IOU matrix according to an embodiment.

좌측도는 연산(S32-3)에서 재결정된 IOU를 저장하는 IOU 매트릭스를 예시적으로 보여준다.The left diagram exemplifies the IOU matrix for storing the IOU re-determined in the operation (S32-3).

우측도는 수학식 3에 따라 각 객체 영역(b_i)에 대한 컨피던스 스코어를 결정하기 위하여 i번째 행의 IOU의 값을 곱하여 bi를 계산해 놓은 것을 보여준다.The right diagram shows that bi is calculated by multiplying the IOU value of the i-th row in order to determine the confidence score for each object area b _i according to Equation 3.

도 12는 실시예에 따른 객체 검출 모델의 객체 검출 결과를 예시적으로 보여주는 도면이다.12 is a diagram illustrating an object detection result of an object detection model according to an exemplary embodiment.

다양한 모양과 크기의 객체들이 잘 검출되고 있으며, 매우 인접하거나 가려짐(occlusion)이 있어도 객체 영역을 정확히 특정하고 있음을 알 수 있다.It can be seen that objects of various shapes and sizes are detected well, and the object area is accurately specified even if there is very close proximity or occlusion.

이하에서는 실시예에 따른 객체 검출 모델의 학습 과정에 대하여 설명한다.Hereinafter, a learning process of an object detection model according to an embodiment will be described.

실시예에 따른 객체 검출 모델의 학습을 위한 손실 함수(Loss Function)는 객체 분류 에러(class), 객체 영역 추정 에러(box), IOU 추정 에러(iou), 객체 크기 정확도 예측 에러(scale), 및 객체 분류 정확도 예측(id)를 고려하였다. 또한, 학습을 위한 최종 Loss는 이와 같은 네 가지 에러의 가중 합으로써 구할 수 있다. A loss function for learning an object detection model according to an embodiment includes an object classification error (class), an object area estimation error (box), an IOU estimation error (iou), an object size accuracy prediction error (scale), and Object classification accuracy prediction (id) was considered. In addition, the final loss for learning can be obtained as a weighted sum of these four errors.

(1) Class 에러 계산은 수학식 4와 같이 Focal loss 사용한 방법을 사용한다.(1) Class error calculation uses a method using focal loss as in Equation 4.

(2) 객체 영역 예측 에러 계산은 수학식 5와 같이 GIOU loss를 사용한다. (2) The object region prediction error calculation uses the GIOU loss as in Equation 5.

(3) IOU 추정 에러(iou)는 수학식 6을 사용한다. (3) Equation (6) is used for the IOU estimation error (iou).

(4) 객체 크기 예측 정확도 예측 에러(scale)는 수학식 7을 이용하여 구할 수 있다. (4) Object size prediction accuracy The prediction error (scale) can be obtained using Equation (7).

(5)객체 분류 정확도 예측 에러(id)는 수학식 8을 이용하여 구할 수 있다. (5) The object classification accuracy prediction error (id) can be obtained using Equation (8).

여기서 u는 uni-step 함수이다. where u is a uni-step function.

(6) 마지막으로, 모델 학습에 필요한 최종 에러 함수는 수학식 9와 같이 (1) 내지 (4)에서 구한 에러값의 가중 합으로 계산한다.(6) Finally, the final error function required for model learning is calculated as the weighted sum of the error values obtained in (1) to (4) as in Equation 9.

이하에서 본 발명의 실시예와 관련된 인공지능 기술에 대하여 설명한다.Hereinafter, artificial intelligence technology related to an embodiment of the present invention will be described.

인공지능(artificial intelligence, AI)은 인간의 지능으로 할 수 있는 사고, 학습, 자기계발 등을 컴퓨터가 할 수 있도록 하는 방법을 연구하는 컴퓨터 공학 및 정보기술의 한 분야로, 컴퓨터가 인간의 지능적인 행동을 모방할 수 있도록 하는 것을 의미한다. Artificial intelligence (AI) is a field of computer engineering and information technology that studies how computers can do the thinking, learning, and self-development that can be done with human intelligence. This means that the behavior can be imitated.

또한, 인공지능은 그 자체로 존재하는 것이 아니라, 컴퓨터 과학의 다른 분야와 직간접으로 많은 관련을 맺고 있다. 특히 현대에는 정보기술의 여러 분야에서 인공지능적 요소를 도입하여, 그 분야의 문제 풀이에 활용하려는 시도가 매우 활발하게 이루어지고 있다.In addition, artificial intelligence does not exist by itself, but is directly or indirectly related to other fields of computer science. In particular, in modern times, attempts are being made to introduce artificial intelligence elements in various fields of information technology and use them to solve problems in that field.

머신 러닝(machine learning)은 인공지능의 한 분야로, 컴퓨터에 명시적인 프로그램 없이 배울 수 있는 능력을 부여하는 연구 분야이다. Machine learning is a branch of artificial intelligence, a field of study that gives computers the ability to learn without an explicit program.

구체적으로 머신 러닝은, 경험적 데이터를 기반으로 학습을 하고 예측을 수행하고 스스로의 성능을 향상시키는 시스템과 이를 위한 알고리즘을 연구하고 구축하는 기술이라 할 수 있다. 머신 러닝의 알고리즘들은 엄격하게 정해진 정적인 프로그램 명령들을 수행하는 것이라기보다, 입력 데이터를 기반으로 예측이나 결정을 이끌어내기 위해 특정한 모델을 구축하는 방식을 취한다.Specifically, machine learning can be said to be a technology to study and build a system and algorithms for learning based on empirical data, making predictions, and improving its own performance. Machine learning algorithms build specific models to make predictions or decisions based on input data, rather than executing strictly set static program instructions.

용어 '머신 러닝'은 용어 '기계 학습'과 혼용되어 사용될 수 있다.The term 'machine learning' may be used interchangeably with the term 'machine learning'.

기계 학습에서 데이터를 어떻게 분류할 것인가를 놓고, 많은 기계 학습 알고리즘이 개발되었다. 의사결정나무(Decision Tree)나 베이지안 망(Bayesian network), 서포트벡터머신(SVM: support vector machine), 그리고 인공 신경망(ANN: Artificial Neural Network) 등이 대표적이다.With regard to how to classify data in machine learning, many machine learning algorithms have been developed. Decision trees, Bayesian networks, support vector machines (SVMs), and artificial neural networks (ANNs) are representative examples.

의사결정나무는 의사결정규칙(Decision Rule)을 나무구조로 도표화하여 분류와 예측을 수행하는 분석방법이다.Decision tree is an analysis method that performs classification and prediction by charting decision rules in a tree structure.

베이지안 망은 다수의 변수들 사이의 확률적 관계(조건부독립성: conditional independence)를 그래프 구조로 표현하는 모델이다. 베이지안 망은 비지도 학습(unsupervised learning)을 통한 데이터마이닝(data mining)에 적합하다. The Bayesian network is a model that expresses the probabilistic relationship (conditional independence) between multiple variables in a graph structure. Bayesian networks are suitable for data mining through unsupervised learning.

서포트벡터머신은 패턴인식과 자료분석을 위한 지도 학습(supervised learning)의 모델이며, 주로 분류와 회귀분석을 위해 사용한다.The support vector machine is a model of supervised learning for pattern recognition and data analysis, and is mainly used for classification and regression analysis.

인공신경망은 생물학적 뉴런의 동작원리와 뉴런간의 연결 관계를 모델링한 것으로 노드(node) 또는 처리 요소(processing element)라고 하는 다수의 뉴런들이 레이어(layer) 구조의 형태로 연결된 정보처리 시스템이다.An artificial neural network is an information processing system in which a number of neurons called nodes or processing elements are connected in the form of a layer structure by modeling the operating principle of biological neurons and the connection relationship between neurons.

인공 신경망은 기계 학습에서 사용되는 모델로써, 기계학습과 인지과학에서 생물학의 신경망(동물의 중추신경계 중 특히 뇌)에서 영감을 얻은 통계학적 학습 알고리즘이다.Artificial neural network is a model used in machine learning, and it is a statistical learning algorithm inspired by neural networks in biology (especially the brain in the central nervous system of animals) in machine learning and cognitive science.

구체적으로 인공신경망은 시냅스(synapse)의 결합으로 네트워크를 형성한 인공 뉴런(노드)이 학습을 통해 시냅스의 결합 세기를 변화시켜, 문제 해결 능력을 가지는 모델 전반을 의미할 수 있다.Specifically, the artificial neural network may refer to an overall model having problem-solving ability by changing the bonding strength of synapses through learning in which artificial neurons (nodes) that form a network by combining synapses.

용어 인공신경망은 용어 뉴럴 네트워크(Neural Network)와 혼용되어 사용될 수 있다.The term artificial neural network may be used interchangeably with the term neural network.

인공신경망은 복수의 레이어(layer)를 포함할 수 있고, 레이어들 각각은 복수의 뉴런(neuron)을 포함할 수 있다. 또한 인공신경망은 뉴런과 뉴런을 연결하는 시냅스를 포함할 수 있다.The artificial neural network may include a plurality of layers, and each of the layers may include a plurality of neurons. Also, the artificial neural network may include neurons and synapses connecting neurons.

인공 신경망은 일반적으로 다음의 세가지 인자, 즉 (1) 다른 레이어의 뉴런들 사이의 연결 패턴 (2) 연결의 가중치를 갱신하는 학습 과정 (3) 이전 레이어로부터 수신되는 입력에 대한 가중 합으로부터 출력값을 생성하는 활성화 함수에 의해 정의될 수 있다.In general, artificial neural networks calculate the output value from the following three factors: (1) the connection pattern between neurons in different layers (2) the learning process to update the weight of the connection (3) the weighted sum of the input received from the previous layer It can be defined by the activation function it creates.

인공 신경망은, DNN(Deep Neural Network), RNN(Recurrent Neural Network), BRDNN(Bidirectional Recurrent Deep Neural Network), MLP(Multilayer Perceptron), CNN(Convolutional Neural Network)와 같은 방식의 네트워크 모델들을 포함할 수 있으나, 이에 한정되지 않는다.The artificial neural network may include network models such as Deep Neural Network (DNN), Recurrent Neural Network (RNN), Bidirectional Recurrent Deep Neural Network (BRDNN), Multilayer Perceptron (MLP), Convolutional Neural Network (CNN). , but is not limited thereto.

본 명세서에서 용어 '레이어'는 용어 '계층'과 혼용되어 사용될 수 있다.In this specification, the term 'layer' may be used interchangeably with the term 'layer'.

인공신경망은 계층 수에 따라 단층 신경망(Single-Layer Neural Networks)과 다층 신경망(Multi-Layer Neural Networks)으로 구분된다.Artificial neural networks are divided into single-layer neural networks and multi-layer neural networks according to the number of layers.

일반적인 단층 신경망은, 입력층과 출력층으로 구성된다.A typical single-layer neural network consists of an input layer and an output layer.

또한 일반적인 다층 신경망은 입력층(Input Layer)과 하나 이상의 은닉층(Hidden Layer), 출력층(Output Layer)으로 구성된다.In addition, a general multilayer neural network consists of an input layer, one or more hidden layers, and an output layer.

입력층은 외부의 자료들을 받아들이는 층으로서, 입력층의 뉴런 수는 입력되는 변수의 수와 동일하며, 은닉층은 입력층과 출력층 사이에 위치하며 입력층으로부터 신호를 받아 특성을 추출하여 출력층으로 전달한다. 출력층은 은닉층으로부터 신호를 받고, 수신한 신호에 기반한 출력 값을 출력한다. 뉴런간의 입력신호는 각각의 연결강도(가중치)와 곱해진 후 합산되며 이 합이 뉴런의 임계치보다 크면 뉴런이 활성화되어 활성화 함수를 통하여 획득한 출력값을 출력한다. The input layer is a layer that receives external data. The number of neurons in the input layer is the same as the number of input variables, and the hidden layer is located between the input layer and the output layer. do. The output layer receives a signal from the hidden layer and outputs an output value based on the received signal. The input signal between neurons is multiplied by each connection strength (weight) and then summed. If the sum is greater than the neuron threshold, the neuron is activated and the output value obtained through the activation function is output.

한편 입력층과 출력 층 사이에 복수의 은닉층을 포함하는 심층 신경망은, 기계 학습 기술의 한 종류인 딥 러닝을 구현하는 대표적인 인공 신경망일 수 있다.Meanwhile, a deep neural network including a plurality of hidden layers between an input layer and an output layer may be a representative artificial neural network that implements deep learning, which is a type of machine learning technology.

한편 용어 '딥 러닝'은 용어 '심층 학습'과 혼용되어 사용될 수 있다.Meanwhile, the term 'deep learning' may be used interchangeably with the term 'deep learning'.

인공 신경망은 훈련 데이터(training data)를 이용하여 학습(training)될 수 있다. 여기서 학습이란, 입력 데이터를 분류(classification)하거나 회귀분석(regression)하거나 군집화(clustering)하는 등의 목적을 달성하기 위하여, 학습 데이터를 이용하여 인공 신경망의 파라미터(parameter)를 결정하는 과정을 의미할 수 있다. 인공 신경망의 파라미터의 대표적인 예시로써, 시냅스에 부여되는 가중치(weight)나 뉴런에 적용되는 편향(bias)을 들 수 있다.The artificial neural network may be trained using training data. Here, learning refers to a process of determining parameters of an artificial neural network using learning data to achieve objectives such as classification, regression, or clustering of input data. can As a representative example of parameters of an artificial neural network, a weight applied to a synapse or a bias applied to a neuron may be mentioned.

훈련 데이터에 의하여 학습된 인공 신경망은, 입력 데이터를 입력 데이터가 가지는 패턴에 따라 분류하거나 군집화 할 수 있다. The artificial neural network learned by the training data may classify or cluster input data according to a pattern of the input data.

한편 훈련 데이터를 이용하여 학습된 인공 신경망을, 본 명세서에서는 학습 모델(a trained model)이라 명칭 할 수 있다.Meanwhile, an artificial neural network trained using training data may be referred to as a trained model in the present specification.

다음은 인공 신경망의 학습 방식에 대하여 설명한다.The following describes the learning method of the artificial neural network.

인공 신경망의 학습 방식은 크게, 지도 학습, 비 지도 학습, 준 지도 학습(Semi-Supervised Learning), 강화 학습(Reinforcement Learning)으로 분류될 수 있다.Learning methods of artificial neural networks can be broadly classified into supervised learning, unsupervised learning, semi-supervised learning, and reinforcement learning.

지도 학습은 훈련 데이터로부터 하나의 함수를 유추해내기 위한 기계 학습의 한 방법이다. Supervised learning is a method of machine learning to infer a function from training data.

그리고 이렇게 유추되는 함수 중, 연속적인 값을 출력하는 것을 회귀분석(Regression)이라 하고, 입력 벡터의 클래스(class)를 예측하여 출력하는 것을 분류(Classification)라고 할 수 있다.And among these inferred functions, outputting a continuous value is called regression, and predicting and outputting a class of an input vector can be called classification.

지도 학습에서는, 훈련 데이터에 대한 레이블(label)이 주어진 상태에서 인공 신경망을 학습시킨다.In supervised learning, an artificial neural network is trained in a state in which a label for training data is given.

여기서 레이블이란, 훈련 데이터가 인공 신경망에 입력되는 경우 인공 신경망이 추론해 내야 하는 정답(또는 결과 값)을 의미할 수 있다.Here, the label may mean a correct answer (or result value) that the artificial neural network should infer when training data is input to the artificial neural network.

본 명세서에서는 훈련 데이터가 입력되는 경우 인공 신경망이 추론해 내야 하는 정답(또는 결과값)을 레이블 또는 레이블링 데이터(labeling data)이라 명칭 한다.In the present specification, when training data is input, the correct answer (or result value) that the artificial neural network must infer is called a label or labeling data.

또한 본 명세서에서는, 인공 신경망의 학습을 위하여 훈련 데이터에 레이블을 설정하는 것을, 훈련 데이터에 레이블링 데이터를 레이블링(labeling) 한다고 명칭 한다.Also, in this specification, setting a label on the training data for learning of the artificial neural network is called labeling the labeling data on the training data.

이 경우 훈련 데이터와 훈련 데이터에 대응하는 레이블)은 하나의 트레이닝 셋(training set)을 구성하고, 인공 신경망에는 트레이닝 셋의 형태로 입력될 수 있다.In this case, the training data and a label corresponding to the training data) constitute one training set, and may be input to the artificial neural network in the form of a training set.

한편 훈련 데이터는 복수의 특징(feature)을 나타내고, 훈련 데이터에 레이블이 레이블링 된다는 것은 훈련 데이터가 나타내는 특징에 레이블이 달린다는 것을 의미할 수 있다. 이 경우 훈련 데이터는 입력 객체의 특징을 벡터 형태로 나타낼 수 있다.On the other hand, training data represents a plurality of features, and labeling the training data may mean that the feature represented by the training data is labeled. In this case, the training data may represent the features of the input object in a vector form.

인공 신경망은 훈련 데이터와 레이블링 데이터를 이용하여, 훈련 데이터와 레이블링 데이터의 연관 관계에 대한 함수를 유추할 수 있다. 그리고, 인공 신경망에서 유추된 함수에 대한 평가를 통해 인공 신경망의 파라미터가 결정(최적화)될 수 있다.The artificial neural network may infer a function for the relationship between the training data and the labeling data by using the training data and the labeling data. In addition, parameters of the artificial neural network may be determined (optimized) through evaluation of the function inferred from the artificial neural network.

비 지도 학습은 기계 학습의 일종으로, 훈련 데이터에 대한 레이블이 주어지지 않는다.Unsupervised learning is a type of machine learning where no labels are given to training data.

구체적으로, 비 지도 학습은, 훈련 데이터 및 훈련 데이터에 대응하는 레이블의 연관 관계 보다는, 훈련 데이터 자체에서 패턴을 찾아 분류하도록 인공 신경망을 학습시키는 학습 방법일 수 있다.Specifically, the unsupervised learning may be a learning method of learning the artificial neural network to find and classify patterns in the training data itself, rather than the association between the training data and the labels corresponding to the training data.

비 지도 학습의 예로는, 군집화 또는 독립 성분 분석(Independent Component Analysis)을 들 수 있다.Examples of unsupervised learning include clustering or independent component analysis.

본 명세서에서 용어 '군집화'는 용어 '클러스터링'과 혼용되어 사용될 수 있다.In this specification, the term 'clustering' may be used interchangeably with the term 'clustering'.

비지도 학습을 이용하는 인공 신경망의 일례로 생성적 적대 신경망(GAN: Generative Adversarial Network), 오토 인코더(AE: Autoencoder)를 들 수 있다.Examples of artificial neural networks using unsupervised learning include a generative adversarial network (GAN) and an autoencoder (AE).

생성적 적대 신경망이란, 생성기(generator)와 판별기(discriminator), 두 개의 서로 다른 인공지능이 경쟁하며 성능을 개선하는 머신 러닝 방법이다.A generative adversarial neural network is a machine learning method in which two different artificial intelligences, a generator and a discriminator, compete to improve performance.

이 경우 생성기는 새로운 데이터를 창조하는 모형으로, 원본 데이터를 기반으로 새로운 데이터를 생성할 수 있다.In this case, the generator is a model that creates new data, and can generate new data based on the original data.

또한 판별기는 데이터의 패턴을 인식하는 모형으로, 입력된 데이터가 원본 데이터인지 또는 생성기에서 생성한 새로운 데이터인지 여부를 감별하는 역할을 수행할 수 있다.In addition, the discriminator is a model for recognizing patterns in data, and may play a role of discriminating whether input data is original data or new data generated by the generator.

그리고 생성기는 판별기를 속이지 못한 데이터를 입력 받아 학습하며, 판별기는 생성기로부터 속은 데이터를 입력 받아 학습할 수 있다. 이에 따라 생성기는 판별기를 최대한 잘 속이도록 진화할 수 있고, 판별기는 원본 데이터와 생성기에 의해 생성된 데이터를 잘 구분하도록 진화할 수 있다.And the generator learns by receiving the data that did not deceive the discriminator, and the discriminator can learn by receiving the deceived data from the generator. Accordingly, the generator may evolve to deceive the discriminator as best as possible, and the discriminator may evolve to distinguish the original data and the data generated by the generator well.

오토 인코더는 입력 자체를 출력으로 재현하는 것을 목표로 하는 신경망이다.An autoencoder is a neural network that aims to reproduce the input itself as an output.

오토 인코더는 입력층, 적어도 하나의 은닉층 및 출력층을 포함한다. The auto-encoder includes an input layer, at least one hidden layer and an output layer.

이 경우 은닉 계층의 노드 수가 입력 계층의 노드 수보다 적으므로 데이터의 차원이 줄어들게 되며, 이에 따라 압축 또는 인코딩이 수행되게 된다.In this case, since the number of nodes in the hidden layer is smaller than the number of nodes in the input layer, the dimension of data is reduced, and thus compression or encoding is performed.

또한 은닉 계층에서 출력한 데이터는 출력 계층으로 들어간다. 이 경우 출력 계층의 노드 수는 은닉 계층의 노드 수보다 많으므로, 데이터의 차원이 늘어나게 되며, 이에 따라 압축 해제 또는 디코딩이 수행되게 된다.Also, the data output from the hidden layer goes into the output layer. In this case, since the number of nodes of the output layer is greater than the number of nodes of the hidden layer, the dimension of data is increased, and decompression or decoding is performed accordingly.

한편 오토 인코더는 학습을 통해 뉴런의 연결 강도를 조절함으로써 입력 데이터가 은닉층 데이터로 표현된다. 은닉층에서는 입력층보다 적은 수의 뉴런으로 정보를 표현하는데 입력 데이터를 출력으로 재현할 수 있다는 것은, 은닉층이 입력 데이터로부터 숨은 패턴을 발견하여 표현했다는 것을 의미할 수 있다.On the other hand, the auto-encoder controls the neuron's connection strength through learning, so that the input data is expressed as hidden layer data. The hidden layer expresses information with fewer neurons than the input layer, and being able to reproduce the input data as an output may mean that the hidden layer found and expressed hidden patterns from the input data.

준 지도 학습은 기계 학습의 일종으로, 레이블이 주어진 훈련 데이터와 레이블이 주어지지 않은 훈련 데이터를 모두 사용하는 학습 방법을 의미할 수 있다.Semi-supervised learning is a type of machine learning, and may refer to a learning method using both labeled and unlabeled training data.

준 지도 학습의 기법 중 하나로, 레이블이 주어지지 않은 훈련 데이터의 레이블을 추론한 후 추론된 라벨을 이용하여 학습을 수행하는 기법이 있으며, 이러한 기법은 레이블링에 소요되는 비용이 큰 경우에 유용하게 사용될 수 있다.As one of the techniques of semi-supervised learning, there is a technique of inferring a label of unlabeled training data and then performing learning using the inferred label. can

강화 학습은, 에이전트(Agent)가 매 순간 어떤 행동을 해야 좋을지 판단할 수 있는 환경이 주어진다면, 데이터 없이 경험으로 가장 좋을 길을 찾을 수 있다는 이론이다. Reinforcement learning is a theory that, given the environment in which the agent can decide what action to take at every moment, it can find the best way through experience without data.

강화 학습은 주로 마르코프 결정 과정(MDP: Markov Decision Process)에 의하여 수행될 수 있다.Reinforcement learning may be mainly performed by a Markov Decision Process (MDP).

마르코프 결정 과정을 설명하면, 첫 번째로 에이전트가 다음 행동을 하기 위해 필요한 정보들이 구성된 환경이 주어지며, 두 번째로 그 환경에서 에이전트가 어떻게 행동할지 정의하고, 세 번째로 에이전트가 무엇을 잘하면 보상(reward)을 주고 무엇을 못하면 벌점(penalty)을 줄지 정의하며, 네 번째로 미래의 보상이 최고점에 이를 때까지 반복 경험하여 최적의 정책(policy)을 도출하게 된다.To explain the Markov decision process, first, an environment is given in which the information necessary for the agent to take the next action is given, secondly, how the agent behaves in the environment is defined, and thirdly, the agent is rewarded ( reward) and a penalty point for failure to do so, and fourthly, the optimal policy is derived by repeating experiences until the future reward reaches the highest point.

인공 신경망은 모델의 구성, 활성 함수(Activation Function), 손실 함수(Loss Function) 또는 비용 함수(Cost Function), 학습 알고리즘, 최적화 알고리즘 등에 의해 그 구조가 특정되며, 학습 전에 하이퍼파라미터(Hyperparameter)가 미리 설정되고, 이후에 학습을 통해 모델 파라미터(Model Parameter)가 설정되어 내용이 특정될 수 있다.The structure of an artificial neural network is specified by the model configuration, activation function, loss function or cost function, learning algorithm, optimization algorithm, etc., and hyperparameters are It is set, and then the model parameter is set through learning and the content can be specified.

예컨대, 인공 신경망의 구조를 결정하는 요소에는 은닉층의 개수, 각 은닉층에 포함된 은닉 노드의 개수, 입력 특징 벡터(Input Feature Vector), 대상 특징 벡터(Target Feature Vector) 등이 포함될 수 있다.For example, factors determining the structure of an artificial neural network may include the number of hidden layers, the number of hidden nodes included in each hidden layer, an input feature vector, a target feature vector, and the like.

하이퍼파라미터는 모델 파라미터의 초기값 등과 같이 학습을 위하여 초기에 설정하여야 하는 여러 파라미터들을 포함한다. 그리고, 모델 파라미터는 학습을 통하여 결정하고자 하는 여러 파라미터들을 포함한다.Hyperparameters include several parameters that must be initially set for learning, such as initial values of model parameters. And, the model parameter includes several parameters to be determined through learning.

예컨대, 하이퍼파라미터에는 노드 간 가중치 초기값, 노드 간 편향 초기값, 미니 배치(Mini-batch) 크기, 학습 반복 횟수, 학습률(Learning Rate) 등이 포함될 수 있다. 그리고, 모델 파라미터에는 노드 간 가중치, 노드 간 편향 등이 포함될 수 있다.For example, the hyperparameter may include an initial weight value between nodes, an initial bias value between nodes, a mini-batch size, a number of learning repetitions, a learning rate, and the like. In addition, the model parameters may include inter-node weights, inter-node biases, and the like.

손실 함수는 인공 신경망의 학습 과정에서 최적의 모델 파라미터를 결정하기 위한 지표(기준)로 이용될 수 있다. 인공 신경망에서 학습은 손실 함수를 줄이기 위하여 모델 파라미터들을 조작하는 과정을 의미하며, 학습의 목적은 손실 함수를 최소화하는 모델 파라미터를 결정하는 것으로 볼 수 있다.The loss function may be used as an index (reference) for determining the optimal model parameter in the learning process of the artificial neural network. In artificial neural networks, learning refers to the process of manipulating model parameters to reduce the loss function, and the purpose of learning can be seen to determine the model parameters that minimize the loss function.

손실 함수는 주로 평균 제곱 오차(MSE: Mean Squared Error) 또는 교차 엔트로피 오차(CEE, Cross Entropy Error)를 사용할 수 있으며, 본 발명이 이에 한정되지는 않는다. The loss function may mainly use a mean squared error (MSE) or a cross entropy error (CEE), but the present invention is not limited thereto.

교차 엔트로피 오차는 정답 레이블이 원 핫 인코딩(one-hot encoding)된 경우에 사용될 수 있다. 원 핫 인코딩은 정답에 해당하는 뉴런에 대하여만 정답 레이블 값을 1로, 정답이 아닌 뉴런은 정답 레이블 값이 0으로 설정하는 인코딩 방법이다.The cross-entropy error can be used when the correct answer label is one-hot encoded. One-hot encoding is an encoding method in which the correct label value is set to 1 only for neurons corresponding to the correct answer, and the correct answer label value is set to 0 for neurons that do not have the correct answer.

머신 러닝 또는 딥 러닝에서는 손실 함수를 최소화하기 위하여 학습 최적화 알고리즘을 이용할 수 있으며, 학습 최적화 알고리즘에는 경사 하강법(GD: Gradient Descent), 확률적 경사 하강법(SGD: Stochastic Gradient Descent), 모멘텀(Momentum), NAG(Nesterov Accelerate Gradient), Adagrad, AdaDelta, RMSProp, Adam, Nadam 등이 있다.In machine learning or deep learning, a learning optimization algorithm can be used to minimize the loss function, and learning optimization algorithms include Gradient Descent (GD), Stochastic Gradient Descent (SGD), and Momentum. ), Nesterov Accelerate Gradient (NAG), Adagrad, AdaDelta, RMSProp, Adam, and Nadam.

경사 하강법은 현재 상태에서 손실 함수의 기울기를 고려하여 손실 함수값을 줄이는 방향으로 모델 파라미터를 조정하는 기법이다. Gradient descent is a technique that adjusts model parameters in a direction to reduce the loss function value by considering the gradient of the loss function in the current state.

모델 파라미터를 조정하는 방향은 스텝(step) 방향, 조정하는 크기는 스텝 사이즈(size)라고 칭한다.The direction in which the model parameter is adjusted is referred to as a step direction, and the size to be adjusted is referred to as a step size.

이때, 스텝 사이즈는 학습률을 의미할 수 있다.In this case, the step size may mean a learning rate.

경사 하강법은 손실 함수를 각 모델 파라미터들로 편미분하여 기울기를 획득하고, 모델 파라미터들을 획득한 기울기 방향으로 학습률만큼 변경하여 갱신할 수 있다.In the gradient descent method, a gradient is obtained by partial differentiation of the loss function into each model parameter, and the model parameters can be updated by changing the learning rate in the obtained gradient direction.

확률적 경사 하강법은 학습 데이터를 미니 배치로 나누고, 각 미니 배치마다 경사 하강법을 수행하여 경사 하강의 빈도를 높인 기법이다.The stochastic gradient descent method is a technique in which the frequency of gradient descent is increased by dividing the training data into mini-batch and performing gradient descent for each mini-batch.

Adagrad, AdaDelta 및 RMSProp는 SGD에서 스텝 사이즈를 조절하여 최적화 정확도를 높이는 기법이다. SGD에서 모멘텀 및 NAG는 스텝 방향을 조절하여 최적화 정확도를 높이는 기법이다. Adam은 모멘텀과 RMSProp를 조합하여 스텝 사이즈와 스텝 방향을 조절하여 최적화 정확도를 높이는 기법이다. Nadam은 NAG와 RMSProp를 조합하여 스텝 사이즈와 스텝 방향을 조절하여 최적화 정확도를 높이는 기법이다.Adagrad, AdaDelta, and RMSProp are techniques to increase optimization accuracy by adjusting the step size in SGD. In SGD, momentum and NAG are techniques to increase optimization accuracy by adjusting the step direction. Adam is a technique to increase optimization accuracy by adjusting the step size and step direction by combining momentum and RMSProp. Nadam is a technique to increase optimization accuracy by adjusting the step size and step direction by combining NAG and RMSProp.

인공 신경망의 학습 속도와 정확도는 인공 신경망의 구조와 학습 최적화 알고리즘의 종류뿐만 아니라, 하이퍼파라미터에 크게 좌우되는 특징이 있다. 따라서, 좋은 학습 모델을 획득하기 위하여는 적당한 인공 신경망의 구조와 학습 알고리즘을 결정하는 것뿐만 아니라, 적당한 하이퍼파라미터를 설정하는 것이 중요하다.The learning speed and accuracy of an artificial neural network have a characteristic that it largely depends on hyperparameters as well as the structure of the artificial neural network and the type of learning optimization algorithm. Therefore, in order to obtain a good learning model, it is important not only to determine an appropriate artificial neural network structure and learning algorithm, but also to set appropriate hyperparameters.

통상적으로 하이퍼파라미터는 실험적으로 다양한 값으로 설정해가며 인공 신경망을 학습시켜보고, 학습 결과 안정적인 학습 속도와 정확도를 제공하는 최적의 값으로 설정한다.Typically, hyperparameters are set to various values experimentally to train the artificial neural network, and as a result of learning, they are set to optimal values that provide stable learning speed and accuracy.

이상 설명된 본 발명에 따른 실시예는 컴퓨터 상에서 다양한 구성요소를 통하여 실행될 수 있는 컴퓨터 프로그램의 형태로 구현될 수 있으며, 이와 같은 컴퓨터 프로그램은 컴퓨터로 판독 가능한 매체에 기록될 수 있다. 이때, 매체는 하드 디스크, SSD(Solid State Disk), SDD(Silicon Disk Drive), 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등과 같은, 프로그램 명령어를 저장하고 실행하도록 특별히 구성된 하드웨어 장치를 포함할 수 있다.The above-described embodiment according to the present invention may be implemented in the form of a computer program that can be executed through various components on a computer, and such a computer program may be recorded in a computer-readable medium. In this case, the medium includes a hard disk, a solid state disk (SSD), a silicon disk drive (SDD), a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as a CD-ROM and a DVD, and a floppy disk. magneto-optical media, such as, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like.

한편, 상기 컴퓨터 프로그램은 본 발명을 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수 있다. 컴퓨터 프로그램의 예에는, 컴파일러에 의하여 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용하여 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함될 수 있다.Meanwhile, the computer program may be specially designed and configured for the present invention, or may be known and used by those skilled in the computer software field. Examples of the computer program may include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like.

이상 설명된 본 발명의 실시 예에 대한 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The description of the embodiment of the present invention described above is for illustration, and those of ordinary skill in the art to which the present invention pertains can easily transform into other specific forms without changing the technical spirit or essential features of the present invention you will be able to understand that Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 청구범위에 의하여 나타내어지며, 청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present invention.

100 객체 검출 장치
110 프로세서
120 메모리100 object detection device
110 processor
120 memory

Claims

obtaining at least one object information on at least one candidate object from an input image for each decoding layer of the plurality of decoding layers by using an object detection model including a plurality of decoding layers;
performing an Efficient Non-Maximum Suppression (E-NMS) operation for each decoding layer based on the at least one object information obtained from each decoding layer; and
Re-performing the E-NMS operation based on the result of the E-NMS operation for each decoding layer
including,
The object information is
Object area prediction information, object intersection over union (IOU) prediction information, and object classification prediction information
including,
The E-NMS operation is
a first operation for determining an initial value of a confidence score for object region prediction information of the at least one candidate object;
a second operation of updating the confidence score based on the object region prediction information and the object IOU prediction information of the at least one candidate object; and
A third operation for determining a final object area of the at least one candidate object based on the updated confidence score
containing,
Object detection method.

The method of claim 1,
The object detection model is
a common encoder including a series of encoding layers for extracting a plurality of first common feature maps from the input image; and
a common decoder including multiple decoding layers for generating a plurality of second common feature maps by multi-scale feature fusion for the plurality of first common feature maps;
containing
Object detection method.

3. The method of claim 2,
The multiple decoding layer,
a top-down layer for outputting a portion of the plurality of first common feature maps by upsampling and adding;
an intermediate layer that performs a convolution operation on the remainder except for a portion of the plurality of first common feature maps and the output of the top-down layer; and
Bottom-Up Path Aggregation Layer for outputting the plurality of second common feature maps by convolution and summing outputs of the intermediate layer
including,
The plurality of decoding layers correspond to the bottom-up path merging layer,
Object detection method.

The method of claim 1,
The object detection model is
an object region header for converting a plurality of second common feature maps output from the plurality of decoding layers into an object region feature map for object region detection; and
An object classification header for converting the plurality of second common feature maps into an object classification feature map for object classification
containing,
Object detection method.

5. The method of claim 4,
The object region header and the object classification header include a plurality of convolutional layers sharing a weight with each other,
Object detection method.

The method of claim 1,
In the step of performing the E-NMS operation for each decoding layer, the first operation is
determining the initial value based on a function for the object IOU prediction information of the candidate object,
Object detection method.

The method of claim 1,
The second operation is
an operation of aligning object region prediction information of the at least one candidate object based on the initial value;
an operation for determining an IOU between two different object region prediction information for the at least one candidate object;
an operation of re-determining the IOU based on a predefined weight function; and
Calculation to determine the confidence score based on the re-determined IOU
containing,
Object detection method.

8. The method of claim 7,
The operation for determining the IOU, the operation for re-determining the IOU, and the operation for determining the confidence score can be performed in parallel using an IOU matrix,
Object detection method.

5. The method of claim 4,
The object detection model is
Outputs object size prediction information (pred_reward_scale) based on the object region feature map,
Outputting object classification accuracy prediction information (pred_reward_identification) based on the object classification feature map,
Object detection method.

10. The method of claim 9,
The object size prediction information and the object classification accuracy prediction information are information used in the learning process of the object detection model and not used in the inference process,
Object detection method.

a memory for storing an object detection model including a plurality of decoding layers; and
one or more processors
including,
The processor is
obtaining at least one object information on at least one candidate object from an input image for each decoding layer of the plurality of decoding layers using the object detection model;
Efficient Non-Maximum Suppression (E-NMS) operation is performed on each decoding layer based on the at least one object information obtained for each decoding layer,
configured to re-perform the E-NMS operation based on the result of the NMS operation for each decoding layer,
The object information is
Object area prediction informationObject IOU prediction information and object classification prediction information
including,
The E-NMS operation is
a first operation for determining an initial value of a confidence score for object region prediction information of the at least one candidate object;
a second operation of updating the confidence score based on the object region prediction information and the object IOU prediction information of the at least one candidate object; and
A third operation for determining a final object area of the at least one candidate object based on the updated confidence score
containing,
object detection device.

12. The method of claim 11,
The object detection model is
a common encoder including a series of encoding layers for extracting a plurality of first common feature maps from the input image; and
a common decoder including multiple decoding layers for generating a plurality of second common feature maps by multi-scale feature fusion for the plurality of first common feature maps;
containing
object detection device.

13. The method of claim 12,
The multiple decoding layer,
a top-down layer for outputting a portion of the plurality of first common feature maps by upsampling and adding;
an intermediate layer that performs a convolution operation on the remainder except for a portion of the plurality of first common feature maps and the output of the top-down layer; and
Bottom-Up Path Aggregation Layer for outputting the plurality of second common feature maps by convolution and summing outputs of the intermediate layer
including,
The plurality of decoding layers correspond to the bottom-up path merging layer,
object detection device.

12. The method of claim 11,
The object detection model is
an object region header for converting a plurality of second common feature maps output from the plurality of decoding layers into an object region feature map for object region detection; and
An object classification header for converting the plurality of second common feature maps into an object classification feature map for object classification
containing,
object detection device.

15. The method of claim 14,
The object region header and the object classification header include a plurality of convolutional layers sharing a weight with each other,
object detection device.

12. The method of claim 11,
The processor is
In order to perform an E-NMS operation for each decoding layer,
configured to, in the first operation, determine the initial value based on a function for the object IOU prediction information of the candidate object;
object detection device.

12. The method of claim 11,
The second operation is
an operation of aligning object region prediction information of the at least one candidate object based on the initial value;
an operation for determining an IOU between two different object region prediction information for the at least one candidate object;
an operation of re-determining the IOU based on a predefined weight function; and
Calculation to determine the confidence score based on the re-determined IOU
containing,
object detection device.

18. The method of claim 17,
The processor is
By using the IOU matrix, the operation to determine the IOU, the operation to re-determine the IOU, and the operation to determine the confidence score can be performed in parallel,
object detection device.

15. The method of claim 14,
The processor executes the object detection model,
Extracting object size prediction information (pred_reward_scale) based on the object region feature map,
configured to extract object classification accuracy prediction information (pred_reward_identification) based on the object classification feature map,
object detection device.

20. The method of claim 19,
The processor is
configured to use the object size prediction information and the object classification accuracy prediction information in the learning process of the object detection model, and not to use it in the inference process,
object detection device.