KR102261894B1

KR102261894B1 - Apparatus and method for object detection

Info

Publication number: KR102261894B1
Application number: KR1020190117656A
Authority: KR
Inventors: 유영준; 한동윤; 윤상두
Original assignee: 네이버 주식회사; 라인 가부시키가이샤
Priority date: 2019-06-13
Filing date: 2019-09-24
Publication date: 2021-06-08
Also published as: KR20200143193A

Abstract

본 출원은 객체인식장치 및 객체인식방법에 관한 것으로서, 본 발명의 일 실시예에 의한 객체인식장치는, 대상 이미지를 입력받으면, 특징추출연산을 수행하여 특징이미지를 생성하는 특징 추출부(feature extraction block); 및 상기 특징이미지에 특징맵추출연산을 반복적으로 수행하여, 상기 특징맵 추출연산을 적용한 횟수에 따라 각각 상이한 해상도를 가지는 복수의 제1 특징맵을 생성하는 백본 네트워크부(backbone network)를 포함할 수 있다. The present application relates to an object recognition apparatus and an object recognition method, wherein the object recognition apparatus according to an embodiment of the present invention receives a target image, and performs a feature extraction operation to generate a feature image. block); and a backbone network unit that repeatedly performs a feature map extraction operation on the feature image to generate a plurality of first feature maps each having different resolutions depending on the number of times the feature map extraction operation is applied. have.

Description

Object recognition apparatus and method for object recognition {Apparatus and method for object detection}

본 출원은 대상이미지 내에 포함된 객체를 인식할 수 있는 객체인식장치 및 객체인식방법에 관한 것이다. The present application relates to an object recognition apparatus and an object recognition method capable of recognizing an object included in a target image.

최근 얼굴 정보를 이용한 다양한 어플리케이션 (application)이 등장함에 따라 실용적인 얼굴 검출 방법에 관심이 높아지고 있다. 얼굴 인식 시스템은 특정인의 출입을 허가하는 보안시스템과 감시 환경에서 개인의 프라이버시(privacy) 보호를 위해 사용 되고 있다. 또한, 표정 인식은 얼굴 영역에 대해 표 정 변화를 분석하여 외형적 표정 변화로부터 사람 의 감정을 해석하는 분야에 이용되고 있다. 이러한 얼굴 정보를 활용한 어플리케이션의 영역이 확대되고 그 수가 증가함에 따라 다양한 환경에서 얼굴 영역을 정확하게 추출할 수 있는 실용성 높은 얼굴 검출 방법에 대한 연구가 활발하게 진행되고 있다.Recently, as various applications using face information appear, interest in a practical face detection method is increasing. The face recognition system is used to protect personal privacy in a security system that permits access of a specific person and a surveillance environment. In addition, facial expression recognition is being used in the field of analyzing facial expression changes in the face region and interpreting human emotions from external facial expression changes. As the field of applications utilizing such face information is expanded and the number of applications increases, research on a highly practical face detection method capable of accurately extracting a face area in various environments is being actively conducted.

최근에 다양한 컴퓨터 비전(computer vision)의 분야에서 학습에 기반한 합성곱 신경망(convolutional neural network; CNN) 방법이 큰 성과를 이루었다. CNN의 얼굴 검출 방법이 검출 성능에서 큰 발전을 이루었지만, 시스템의 증가된 복잡도는 이에 대한 실용성에 의문을 갖게 했다. 320×240의 이미지로부터 추출할 수 있는 윈도우(window)의 수는 십억 개에 달한다. 수많은 패치에 대해서 각각 CNN에 기반하여 특징 정보를 추출하고 얼굴과 얼굴이 아닌 영역으로 분류(classification)를 하게 된다. 이는 얼굴 검출 성능과 시스템의 복잡도 사이의 트레이드오프(trade off) 관계를 잘 나타낸다. 또한, 인접한 윈도우 사이의 교집합 영역에 대해 합성곱 연산(convolution operation)이 반복적으로 수행됨으로써 불필요한 연산 과정이 포함되고, 합성곱 신경망의 완전 연결 계층(fully-connected layer)의 입력과 출력이 고정됨으로 인하여 그 신경망을 통과하는 모든 입력 데이터는 입력 데이터의 크기를 고정된 크기로 재조정(resizing)하는 과정을 수반함으로써 시스템의 연산 복잡도가 증가하게 된다.Recently, a convolutional neural network (CNN) method based on learning has achieved great results in various fields of computer vision. Although CNN's face detection method has made great strides in detection performance, the increased complexity of the system has cast doubt on its practicality. The number of windows that can be extracted from a 320×240 image reaches billions. For numerous patches, feature information is extracted based on CNN and classified into faces and non-face regions. This well represents the trade-off relationship between face detection performance and system complexity. In addition, since a convolution operation is repeatedly performed on the intersection region between adjacent windows, unnecessary computational processes are included, and the input and output of the fully-connected layer of the convolutional neural network are fixed. All input data passing through the neural network is accompanied by a process of resizing the size of the input data to a fixed size, thereby increasing the computational complexity of the system.

본 출원은, 상대적으로 적은 용량으로 다양한 크기의 객체들에 대한 높은 객체인식성능을 구현할 수 있는 객체인식장치 및 객체인식방법을 제공하고자 한다. An object of the present application is to provide an object recognition apparatus and an object recognition method capable of realizing high object recognition performance for objects of various sizes with a relatively small capacity.

본 출원은, 백본네트워크의 반복적인 재사용을 이용하여 복수의 특징맵들을 생성할 수 있는 객체인식장치 및 객체인식방법을 제공하고자 한다. An object of the present application is to provide an object recognition apparatus and an object recognition method capable of generating a plurality of feature maps using the iterative reuse of a backbone network.

본 발명의 일 실시예에 의한 객체인식장치는, 대상 이미지를 입력받으면, 특징추출연산을 수행하여 특징이미지를 생성하는 특징 추출부(feature extraction block); 및 상기 특징이미지에 특징맵추출연산을 반복적으로 수행하여, 상기 특징맵 추출연산을 적용한 횟수에 따라 각각 상이한 해상도를 가지는 복수의 제1 특징맵을 생성하는 백본 네트워크부(backbone network)를 포함할 수 있다. An object recognition apparatus according to an embodiment of the present invention includes: a feature extraction block for generating a feature image by performing a feature extraction operation upon receiving a target image; and a backbone network unit that repeatedly performs a feature map extraction operation on the feature image to generate a plurality of first feature maps each having different resolutions depending on the number of times the feature map extraction operation is applied. have.

본 발명의 일 실시예에 의한 객체인식방법은, 입력받은 대상이미지에 특징추출연산을 수행하여 특징이미지를 생성하는 단계; 상기 특징이미지에 특징맵추출연산을 반복적으로 수행하고, 상기 특징맵 추출연산을 적용한 횟수에 따라 각각 상이한 해상도를 가지는 복수의 제1 특징맵을 생성하는 단계; 및 상기 복수의 제1 특징맵 내를 이동하는 각각의 경계박스를 이용하여, 상기 경계박스 내의 객체 포함여부를 판별하고, 상기 객체가 포함된 경우에는 상기 제1 특징맵 내의 경계박스를 상기 대상 이미지 내의 위치로 회귀시켜 상기 객체의 위치정보를 추출하는 단계를 포함할 수 있다. Object recognition method according to an embodiment of the present invention, generating a feature image by performing a feature extraction operation on the input target image; repeatedly performing a feature map extraction operation on the feature image, and generating a plurality of first feature maps each having different resolutions according to the number of times the feature map extraction operation is applied; and using each bounding box moving within the plurality of first feature maps, it is determined whether an object is included in the bounding box, and when the object is included, the bounding box in the first feature map is set to the target image It may include the step of extracting the location information of the object by returning to the location in the.

본 발명의 다른 실시예에 의한 객체인식방법은, 입력받은 대상이미지에 특징추출연산을 수행하여 특징이미지를 생성하는 단계; 상기 특징이미지에 특징맵추출연산을 반복적으로 수행하고, 상기 특징맵 추출연산을 적용한 횟수에 따라 각각 상이한 해상도를 가지는 복수의 제1 특징맵을 생성하는 단계; 복수의 제1 특징맵 중에서 해상도가 가장 낮은 제1 특징맵을 최초의 제2 특징맵으로 설정하는 단계; 직전에 생성한 제2 특징맵을 업샘플링(upsampling)하고, 상기 업샘플링된 결과와 동일한 해상도를 가지는 제1 특징맵을 스킵 커넥션 기법(skip connection)으로 연결하여 제2 특징맵을 생성하는 단계; 및 상기 복수의 제2 특징맵 내를 이동하는 각각의 경계박스를 이용하여, 상기 경계박스 내의 객체 포함여부를 판별하고, 상기 객체가 포함된 경우에는 상기 제2 특징맵 내의 경계박스를 상기 대상 이미지 내의 위치로 회귀시켜 상기 객체의 위치정보를 추출하는 단계를 포함할 수 있다. Object recognition method according to another embodiment of the present invention, generating a feature image by performing a feature extraction operation on the input target image; repeatedly performing a feature map extraction operation on the feature image, and generating a plurality of first feature maps each having different resolutions according to the number of times the feature map extraction operation is applied; setting a first feature map having the lowest resolution among a plurality of first feature maps as an initial second feature map; generating a second feature map by upsampling the previously generated second feature map and connecting the first feature map having the same resolution as the upsampled result using a skip connection; and using each bounding box moving within the plurality of second feature maps, it is determined whether an object is included in the bounding box, and when the object is included, the bounding box in the second feature map is set to the target image It may include the step of extracting the location information of the object by returning to the location in the.

덧붙여 상기한 과제의 해결수단은, 본 발명의 특징을 모두 열거한 것이 아니다. 본 발명의 다양한 특징과 그에 따른 장점과 효과는 아래의 구체적인 실시형태를 참조하여 보다 상세하게 이해될 수 있을 것이다.Incidentally, the means for solving the above problems do not enumerate all the features of the present invention. Various features of the present invention and its advantages and effects may be understood in more detail with reference to the following specific embodiments.

본 발명의 일 실시예에 의한 객체인식장치 및 객체인식방법은, 백본네트워크의 반복적인 재사용을 이용하여 복수의 특징맵을 생성할 수 있으므로, 객체인식장치의 구현을 위해 필요한 파라미터의 개수를 획기적으로 줄일 수 있다. The object recognition apparatus and object recognition method according to an embodiment of the present invention can generate a plurality of feature maps using the iterative reuse of a backbone network, so that the number of parameters required for the implementation of the object recognition apparatus can be dramatically reduced. can be reduced

본 발명의 일 실시예에 의한 객체인식장치 및 객체인식방법에 의하면, 다양한 크기의 객체들에 대한 반복적인 학습을 수행하므로, 상대적으로 작은 크기의 객체에 대한 객체인식율을 높일 수 있다. According to the object recognition apparatus and the object recognition method according to an embodiment of the present invention, since iterative learning is performed on objects of various sizes, it is possible to increase the object recognition rate for objects of relatively small sizes.

다만, 본 발명의 실시예들에 따른 객체인식장치 및 객체인식방법이 달성할 수 있는 효과는 이상에서 언급한 것들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the effects that can be achieved by the object recognition apparatus and the object recognition method according to the embodiments of the present invention are not limited to those mentioned above, and other effects not mentioned are from the description below. It will be clearly understood by those of ordinary skill in the art.

도1은 본 발명의 일 실시예에 의한 객체인식장치를 나타내는 블록도이다.
도2는 본 발명의 일 실시예에 의한 특징추출부를 나타내는 블록도이다.
도3은 본 발명의 일 실시예에 의한 역잔류모듈을 나타내는 블록도이다.
도4는 본 발명의 일 실시예에 의한 업샘플링모듈, 분류부 및 위치감지부를 나타내는 블록도이다.
도5는 본 발명의 일 실시예에 의한 백본네트워크의 구조를 나타내는 블록도이다.
도6은 본 발명의 다른 실시예에 의한 객체인식장치를 나타내는 블록도이다.
도7은 본 발명의 일 실시예에 의한 객체인식방법을 나타내는 순서도이다.
도8은 본 발명의 다른 실시예에 의한 객체인식방법을 나타내는 순서도이다. 1 is a block diagram illustrating an object recognition apparatus according to an embodiment of the present invention.
2 is a block diagram showing a feature extraction unit according to an embodiment of the present invention.
3 is a block diagram illustrating a reverse residual module according to an embodiment of the present invention.
4 is a block diagram illustrating an up-sampling module, a classification unit, and a position sensing unit according to an embodiment of the present invention.
5 is a block diagram showing the structure of a backbone network according to an embodiment of the present invention.
6 is a block diagram illustrating an object recognition apparatus according to another embodiment of the present invention.
7 is a flowchart illustrating an object recognition method according to an embodiment of the present invention.
8 is a flowchart illustrating an object recognition method according to another embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 '모듈' 및 '부'는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 즉, 본 발명에서 사용되는 '부'라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, '부'는 어떤 역할들을 수행한다. 그렇지만 '부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 '부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '부'들로 결합되거나 추가적인 구성요소들과 '부'들로 더 분리될 수 있다.Hereinafter, the embodiments disclosed in the present specification will be described in detail with reference to the accompanying drawings, but the same or similar components are assigned the same reference numerals regardless of reference numerals, and redundant description thereof will be omitted. The suffixes 'module' and 'part' for components used in the following description are given or mixed in consideration of only the ease of writing the specification, and do not have distinct meanings or roles by themselves. That is, the term 'unit' used in the present invention means a hardware component such as software, FPGA, or ASIC, and 'unit' performs certain roles. However, 'part' is not limited to software or hardware. The 'unit' may be configured to reside on an addressable storage medium or may be configured to refresh one or more processors. Thus, as an example, 'part' refers to components such as software components, object-oriented software components, class components and task components, processes, functions, properties, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays and variables. Functions provided within components and 'units' may be combined into a smaller number of components and 'units' or further divided into additional components and 'units'.

또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.In addition, in describing the embodiments disclosed in the present specification, if it is determined that detailed descriptions of related known technologies may obscure the gist of the embodiments disclosed in the present specification, the detailed description thereof will be omitted. In addition, the accompanying drawings are only for easy understanding of the embodiments disclosed in the present specification, and the technical idea disclosed herein is not limited by the accompanying drawings, and all changes included in the spirit and scope of the present invention , should be understood to include equivalents or substitutes.

도1은 본 발명의 일 실시예에 의한 객체인식장치를 나타내는 블록도이다. 1 is a block diagram illustrating an object recognition apparatus according to an embodiment of the present invention.

도1을 참조하면, 본 발명의 일 실시예에 의한 객체인식장치는, 특징 추출부(110), 백본 네트워크부(120), 업샘플링부(130), 분류부(140) 및 위치감지부(150)를 포함할 수 있다. Referring to FIG. 1 , the object recognition apparatus according to an embodiment of the present invention includes a feature extraction unit 110 , a backbone network unit 120 , an upsampling unit 130 , a classification unit 140 , and a position sensing unit ( 150) may be included.

이하 도1을 참조하여 본 발명의 일 실시예에 의한 객체인식장치를 설명한다. Hereinafter, an object recognition apparatus according to an embodiment of the present invention will be described with reference to FIG. 1 .

특징추출부(110)는 대상 이미지를 입력받으면, 대상 이미지에 대한 특징추출연산을 수행하여 특징 이미지를 생성할 수 있다. 도2에 예시된 바와 같이, 특징추출부(110)는 3×3 컨볼루션 레이어, 배치 정규화 레이어(batch normalization layer) 및 활성함수 레이어(Activation layer)를 포함할 수 있으며, 각각의 레이어를 순차적으로 통과하여 특징추출연산을 수행할 수 있다. 여기서, 3×3 컨볼루션 레이어에 포함된 s는 컨볼루션 레이어의 스트라이드(stride)이고, p는 패딩(padding)의 수행여부, g는 그룹, c_in은 입력 채널의 폭(width), c_out은 출력 채널의 폭에 각각 대응한다. 즉, 도2에 의하면 특징추출부(110)는, 스트라이드가 2이고, 패딩을 수행하며, 입력채널의 폭은 3, 출력채널의 폭은 h인 3×3 컨볼루션 필터 연산을 수행할 수 있도록 구성될 수 있다. When receiving a target image, the feature extraction unit 110 may generate a feature image by performing a feature extraction operation on the target image. As illustrated in FIG. 2 , the feature extraction unit 110 may include a 3×3 convolution layer, a batch normalization layer, and an activation function layer, and sequentially apply each layer. Through this, feature extraction operation can be performed. Here, s included in the 3×3 convolutional layer is the stride of the convolutional layer, p is whether padding is performed, g is the group, c _in is the width of the input channel, c _out Each corresponds to the width of the output channel. That is, according to FIG. 2 , the feature extraction unit 110 performs a 3×3 convolution filter operation with a stride of 2, padding, and a width of an input channel of 3 and an output channel of h. can be configured.

예를들어, 대상 이미지가 컬러 이미지인 경우, 입력채널은 RGB의 3가지 색상에 대응하는 3개의 채널을 포함할 수 있으며, 특징추출부(110)의 3×3 컨볼루션 레이어는 3개의 입력채널을 가질 수 있다. 여기서, 출력채널은 h이므로 3×3 컨볼루션 레이어는 특징이미지의 채널을 h개로 확장시킬 수 있으며, 스트라이드가 2이므로 특징이미지의 해상도를 대상 이미지의 절반으로 낮출 수 있다. 예를들어, 대상 이미지의 해상도가 640×640인 경우, 특징 이미지는 320×320으로 줄어들 수 있다. For example, when the target image is a color image, the input channel may include three channels corresponding to three colors of RGB, and the 3×3 convolutional layer of the feature extraction unit 110 has three input channels. can have Here, since the output channel is h, the 3×3 convolution layer can extend the channel of the feature image to h, and since the stride is 2, the resolution of the feature image can be reduced to half that of the target image. For example, when the resolution of the target image is 640×640, the feature image may be reduced to 320×320.

백본 네트워크부(120)는 특징이미지에 대해 특징맵추출연산을 반복적으로 수행할 수 있으며, 특징맵추출연산을 수행한 횟수에 따라, 해상도가 상이한 제1 특징맵들을 생성할 수 있다. The backbone network unit 120 may repeatedly perform a feature map extraction operation on the feature image, and may generate first feature maps having different resolutions according to the number of times the feature map extraction operation is performed.

도1을 참조하면, 백본 네트워크부(120)는 복수의 역잔류모듈(inverted residual module, 121)들을 포함할 수 있으며, 복수의 역잔류모듈(121)들을 모두 포함하는 하나의 세트에 대한 연산을 완료하면 특징맵추출연산을 1회 수행한 것에 해당한다. Referring to FIG. 1 , the backbone network unit 120 may include a plurality of inverted residual modules 121 , and performs an operation on one set including all of the plurality of inverted residual modules 121 . When completed, it corresponds to performing the feature map extraction operation once.

즉, 특징맵추출연산을 1회 수행하면 연산결과로 제1 특징맵 f1을 생성할 수 있으며, 이후 제1 특징맵 f1을 다시 백본 네트워크부(120)로 입력하여 특징맵추출연산을 2회째 수행하도록 할 수 있다. 이 경우, 백본 네트워크부(120)는 2회의 특징맵추출연산에 대응하는 제1 특징맵 f2를 생성할 수 있다. 이후, 동일한 방식으로 반복하여, 제1 특징맵 f3, f4, f5, f6를 순차적으로 생성할 수 있다. 여기서는, 제1 특징맵을 6개 생성하는 실시예를 제시하고 있으나, 이에 한정되는 것은 아니며, 생성하는 제1 특징맵의 개수는 실시예에 따라 다양하게 변형가능하다. That is, if the feature map extraction operation is performed once, a first feature map f1 can be generated as a result of the operation, and then the first feature map f1 is inputted to the backbone network unit 120 again to perform the feature map extraction operation a second time. can make it In this case, the backbone network unit 120 may generate the first feature map f2 corresponding to the feature map extraction operation twice. Thereafter, the first feature maps f3, f4, f5, and f6 may be sequentially generated by repeating the same method. Here, an embodiment in which six first feature maps are generated is presented, but the present invention is not limited thereto, and the number of generated first feature maps may be variously modified according to embodiments.

종래의 SSD(Single Shot Detector)의 경우, 대상 이미지에 복수의 컨볼루션 필터를 순차적으로 적용하여, 각각의 컨볼루션 필터에 대응하는 특징맵들을 생성하였다. 즉, 종래의 SSD에서는 서로 다른 컨볼루션 필터들을 순차적으로 통과하여 특징맵들을 생성하였으므로, 6개의 특징맵을 생성하기 위해서는 서로다른 6개의 컨볼루션 필터들의 파라미터를 설계할 필요가 있다. In the case of a conventional single shot detector (SSD), a plurality of convolution filters are sequentially applied to a target image to generate feature maps corresponding to each convolution filter. That is, since the conventional SSD generates feature maps by sequentially passing through different convolutional filters, it is necessary to design parameters of six different convolutional filters in order to generate six feature maps.

반면에, 본 발명에서는 동일한 특징맵추출연산을 반복적으로 재사용하므로, 특징맵추출연산을 위한 파라미터만을 설계하여도 복수의 특징맵들을 추출하는 것이 가능하다. 따라서, 본 발명에 의하면 기존의 SSD에 비하여 필요한 파라미터의 개수를 획기적으로 줄일 수 있다.On the other hand, in the present invention, since the same feature map extraction operation is repeatedly reused, it is possible to extract a plurality of feature maps even by designing only parameters for the feature map extraction operation. Therefore, according to the present invention, the number of necessary parameters can be remarkably reduced compared to the conventional SSD.

한편, 백본 네트워크부(120)는 다음과 같이 나타낼 수 있다.Meanwhile, the backbone network unit 120 may be represented as follows.

여기서, {f₁, f₂, ... , f_N}는 각각의 제1 특징맵이고, N은 1 이상의 정수, F(·)는 백본 네트워크부(120)의 특징맵 추출연산에 해당한다. 또한, E(·)는 특징 추출부(110)의 특징추출연산, x는 대상 이미지, f₀는 특징이미지에 해당한다. 즉, 출력값을 다시 입력값으로 되먹이는 방식으로 반복하여 복수의 제1 특징맵들을 생성할 수 있다.Here, {f ₁ , f ₂ , ... , f _N } is each first feature map, N is an integer greater than or equal to 1, and F(·) corresponds to the feature map extraction operation of the backbone network unit 120 . . In addition, E(·) corresponds to the feature extraction operation of the feature extraction unit 110 , x corresponds to the target image, and f ₀ corresponds to the feature image. That is, the plurality of first feature maps may be generated by repeating the output value as the input value again.

추가적으로, 백본 네트워크부(120) 내에 포함된 역잔류모듈(121)들은, 제1 역잔류모듈과 제2 역잔류모듈로 구분될 수 있다. 여기서, 제1 역잔류모듈은 백본 네트워크부(120) 내에서 가장 먼저 연산되는 것이고, 제2 역잔류모듈은 제1 역잔류모듈 이후에 순차적으로 연산되는 것일 수 있다. 구체적으로, 도5에 도시한 바와 같이, 한 개의 제1 역잔류모듈과 복수의 제2 역잔류모듈들을 포함하도록 각각 백본 네트워크부(120)를 구현할 수 있다. 도3에는 제1 역잔류모듈과 제2 역잔류모듈의 예시가 나타나 있으며, 여기서 도3(a)는 제1 역잔류모듈, 도3(b)는 제2 역잔류모듈의 일 예에 해당한다. Additionally, the reverse residual modules 121 included in the backbone network unit 120 may be divided into a first residual residual module and a second residual residual module. Here, the first inverse residual module may be calculated first in the backbone network unit 120 , and the second inverse residual module may be sequentially calculated after the first inverse residual module. Specifically, as shown in FIG. 5 , the backbone network unit 120 may be implemented to include one first residual residual module and a plurality of second residual residual modules, respectively. 3 shows an example of a first inverse residual module and a second inverse residual module, where FIG. 3(a) corresponds to a first inverse residual module, and FIG. 3(b) corresponds to an example of a second inverse residual module. .

여기서, 각각의 역잔류모듈은 깊이 기반 분리 컨볼루션(depth-wise separable convolution) 연산을 수행할 수 있으며, 이를 통해 특징맵추출연산시 필요한 연산량을 감소시킬 수 있다. Here, each inverse residual module may perform a depth-wise separable convolution operation, thereby reducing the amount of computation required for feature map extraction.

구체적으로, 제1 역잔류모듈의 경우, 도3(a)에 도시한 바와 같이, 3×3 컨볼루션 레이어, 배치 정규화 레이어, 활성화 레이어, 1×1 컨볼루션 레이어, 배치 정규화 레이어를 포함할 수 있다. 여기서, 특징추출부(110)가 특징 이미지의 채널폭을 3에서 h로 확장하였으므로, 3×3 컨볼루션 레이어에서는 깊이 기반의 컨볼루션(depth-wise convolution) 연산을 수행하고, 이후 1×1 컨볼루션 레이어를 통해 포인트 기반 컨볼루션(point-wise convolution) 연산을 수행하여 채널폭을 h에서 c로 다시 축소시킬 수 있다. Specifically, in the case of the first inverse residual module, as shown in FIG. 3(a), a 3×3 convolution layer, a batch normalization layer, an activation layer, a 1×1 convolution layer, and a batch normalization layer may be included. have. Here, since the feature extraction unit 110 extends the channel width of the feature image from 3 to h, a depth-wise convolution operation is performed in the 3×3 convolution layer, and then 1×1 convolution is performed. The channel width can be reduced from h to c again by performing a point-wise convolution operation through the convolution layer.

또한, 제2 역잔류모듈은, 도3(b)에 도시한 바와 같이, 1×1 컨볼루션 레이어, 배치 정규화 레이어, 활성화 레이어, 3×3 컨볼루션 레이어, 배치 정규화 레이어, 활성화 레이어, 1×1 컨볼루션 레이어 및 배치 정규화 레이어를 순차적으로 포함하여, 깊이 기반 분리 컨볼루션 연산을 수행할 수 있다. 즉, 첫번째 1×1 컨볼루션 레이어에서 포인트 기반 컨볼루션 연산을 수행하여 채널 폭을 c에서 h로 확장시키고, 이후 3×3 컨볼루션 레이어에서 확장된 채널 폭에 대한 깊이 기반 컨볼루션 연산을 수행한 후, 다시 1×1 컨볼루션 레이어에서 채널 폭을 h에서 c로 줄이는 포인트 기반 컨볼루션 연산을 수행할 수 있다. In addition, as shown in FIG. 3(b), the second inverse residual module includes a 1×1 convolution layer, a batch normalization layer, an activation layer, a 3×3 convolution layer, a batch normalization layer, an activation layer, and a 1×1 convolutional layer. A depth-based separation convolution operation may be performed by sequentially including a 1 convolution layer and a batch normalization layer. That is, the channel width is extended from c to h by performing a point-based convolution operation in the first 1×1 convolution layer, and then a depth-based convolution operation is performed on the expanded channel width in the 3×3 convolution layer. After that, a point-based convolution operation that reduces the channel width from h to c may be performed again in the 1×1 convolution layer.

한편, 제1 역잔류모듈 및 제2 역잔류모듈에 포함된 각각의 활성화 레이어의 경우, 활성함수(activation function)로 PReLU(Parametric Rectified Linear Unit) 또는 Leaky-ReLU를 이용할 수 있다. 종래에는 ReLU(Rectified Linear Unit)를 활성함수로 사용하였으나, ReLU는 특성상 음의 값을 0으로 설정하므로 반복적인 특징맵추출연산시 정보의 손실이 발생하는 등의 문제점이 발생할 수 있다. 따라서, 여기서는 음의 값을 반영할 수 있는 PReLU(Parametric Rectified Linear Unit) 또는 Leaky-ReLU를 활성함수로 이용하여, 보다 정확한 객체인식이 가능하도록 할 수 있다.Meanwhile, in the case of each activation layer included in the first inverse residual module and the second inverse residual module, a Parametric Rectified Linear Unit (PReLU) or Leaky-ReLU may be used as an activation function. Conventionally, a Rectified Linear Unit (ReLU) is used as an activation function. However, since ReLU sets a negative value to 0 due to its characteristics, there may be problems such as loss of information during repeated feature map extraction operations. Therefore, here, by using a Parametric Rectified Linear Unit (PReLU) or Leaky-ReLU capable of reflecting a negative value as an activation function, more accurate object recognition can be made possible.

추가적으로, 백본 네트워크부(120)는 복수의 역잔류모듈 중에서, 가장 마지막에 연산되는 역잔류모듈(121)의 스트라이드(stride)를 2로 설정하고, 나머지 역잔류모듈(121)들의 스트라이드는 1로 설정할 수 있다. 이 경우, 새로 생성되는 제1 특징맵의 해상도를 절반으로 줄일 수 있다. 예를들어, 특징 이미지의 해상도가 320×320인 경우, 제1 특징맵 f1의 해상도는 160×160로 줄어들고, 이후 제1 특징맵 f2, f3, f4, f5, f6의 해상도는 각각 80×80, 40×40, 20×20, 10×10, 5×5로 생성될 수 있다. 이러한 특징맵의 개수는 속도와 연산량을 희생하여 좀 더 작은 얼굴들을 검출할 것인지, 아니면 빠른 속도와 적은 연산량을 확보할 것인지에 따라 많게, 혹은 적게 조절이 가능하다. Additionally, the backbone network unit 120 sets the stride of the inverse residual module 121 calculated last among the plurality of inverse residual modules to 2, and the stride of the remaining inverse residual modules 121 is 1 can be set. In this case, the resolution of the newly generated first feature map may be reduced by half. For example, when the resolution of the feature image is 320×320, the resolution of the first feature map f1 is reduced to 160×160, and then the resolution of the first feature maps f2, f3, f4, f5, and f6 is 80×80, respectively. , 40×40, 20×20, 10×10, 5×5. The number of these feature maps can be adjusted more or less depending on whether to detect smaller faces at the expense of speed and computational amount, or to secure fast speed and little computational amount.

여기서, 특징맵의 해상도가 절반으로 줄어들게 되면, 특징맵으로부터 상대적으로 큰 크기의 객체를 인식하는 것이 가능하다. 객체 검출시에는 미리 설정된 경계박스 내에 객체가 포함되는지를 판별하는 방식을 활용할 수 있으며, 이때 특징맵의 해상도가 절반으로 줄어들게 되면, 경계박스가 커버하는 영역이 2배로 넓어질 수 있다. 따라서, 이전에는 경계박스 내에 전부 포함되지 않은 객체가 경계박스 내에 전부 포함되므로, 상대적으로 큰 크기의 객체를 검출하는 것이 가능하다. 즉, 해상도가 높은 특징맵에서는 상대적으로 작은 크기의 객체를 검출하고, 해상도가 낮은 특징맵에서는 상대적으로 큰 크기의 객체를 검출할 수 있다. Here, when the resolution of the feature map is reduced by half, it is possible to recognize a relatively large object from the feature map. When detecting an object, a method of determining whether an object is included in a preset bounding box may be utilized. In this case, if the resolution of the feature map is reduced by half, the area covered by the bounding box may be doubled. Accordingly, since all objects not previously included in the bounding box are all included in the bounding box, it is possible to detect an object of a relatively large size. That is, an object having a relatively small size may be detected in a feature map having a high resolution, and an object having a relatively large size may be detected in a feature map having a low resolution.

이와 같이, 백본 네트워크부(120)는 생성하는 각각의 제1 특징맵들의 해상도를 변경함으로써, 대상 이미지 내에 포함된 다양한 크기의 객체들을 인식할 수 있다. In this way, the backbone network unit 120 may recognize objects of various sizes included in the target image by changing the resolution of each of the generated first feature maps.

실시예에 따라서는, 백본네트워크부(120)에 포함되는 복수의 역잔류모듈의 마지막에 스트라이드가 2인 3×3 컨볼루션 필터를 추가하여 백본네트워크부(120)가 생성하는 특징맵들의 해상도를 절반으로 줄이는 것도 가능하다. According to an embodiment, a 3×3 convolution filter having a stride of 2 is added to the end of a plurality of inverse residual modules included in the backbone network unit 120 to increase the resolution of the feature maps generated by the backbone network unit 120 . It is also possible to cut it in half.

한편, 도6에 도시한 바와 같이, 백본 네트워크부(120)에서 생성한 제1 특징맵들을 이용하여 객체를 검출하는 것도 가능하다. 다만, 제1 특징맵을 이용하는 경우 상대적으로 크기가 작은 객체를 검출하기 어려운 문제가 발생할 수 있다. 즉, 상대적으로 크기가 작은 객체들은 낮은 레벨의 제1 특징맵에서 검출되는 것으로, 낮은 레벨의 제1 특징맵들은 역잔류모듈들이 충분한 깊이만큼 형성되지 않은 상태에서 생성될 수 있다. Meanwhile, as shown in FIG. 6 , it is also possible to detect an object using the first feature maps generated by the backbone network unit 120 . However, when the first feature map is used, it may be difficult to detect a relatively small object. That is, relatively small objects are detected from the low-level first feature map, and the low-level first feature maps may be generated in a state in which the inverse residual modules are not formed to a sufficient depth.

이를 해결하기 위하여, 본 발명의 일 실시예에 의한 객체인식장치는 업샘플링부(130)를 더 포함할 수 있다. 즉, FPN(Feature Pyramid Netwokr) 구조를 도입하여, 낮은 레벨의 특징맵에도 충분한 역잔류모듈들의 깊이를 형성하도록 할 수 있다.To solve this problem, the object recognition apparatus according to an embodiment of the present invention may further include an upsampling unit 130 . That is, by introducing a Feature Pyramid Network (FPN) structure, it is possible to form a depth of sufficient inverse residual modules even in a low-level feature map.

구체적으로, 업샘플링부(130)는 직전에 생성한 제2 특징맵을 업샘플링(upsampling)하고, 업샘플링된 결과와 동일한 해상도를 가지는 제1 특징맵을 스킵 커넥션 기법(skip connection)으로 연결하는 방식으로 제2 특징맵을 생성할 수 있다. 이때, 업샘플링부(130)는 복수의 제1 특징맵 중에서 해상도가 가장 낮은 제1 특징맵을 최초의 제2 특징맵으로 설정할 수 있다. Specifically, the upsampling unit 130 upsampling the previously generated second feature map, and connects the first feature map having the same resolution as the upsampled result by a skip connection technique. In this way, the second feature map may be generated. In this case, the upsampling unit 130 may set the first feature map having the lowest resolution among the plurality of first feature maps as the first second feature map.

즉, 도1에 도시한 바와 같이, 가장 마지막에 생성된 제1 특징맵 f6를 첫번째 제2 특징맵 g1으로 설정할 수 있으며, 이후 제2 특징맵 g1를 업샘플링하고, 업샘플링된 결과를 이전의 제1 특징맵 f5와 더하여 제2 특징맵 g2를 생성할 수 있다. 여기서, 제2 특징맵 g2는 제1 특징맵 f5와 동일한 해상도를 가질 수 있다. That is, as shown in FIG. 1 , the last generated first feature map f6 may be set as the first second feature map g1, and then the second feature map g1 is up-sampled, and the up-sampled result is transferred to the previous one. In addition to the first feature map f5, a second feature map g2 may be generated. Here, the second feature map g2 may have the same resolution as the first feature map f5.

또한, 업샘플링부(130)는 제2 특징맵 g2를 업샘플링하고, 이를 동일한 해상도를 가지는 제1 특징맵 f4와 더하여 제2 특징맵 g3를 생성할 수 있으며, 이후 동일한 방식으로 반복하여 나머지 제2 특징맵들을 생성할 수 있다. In addition, the upsampling unit 130 may upsample the second feature map g2 and add it to the first feature map f4 having the same resolution to generate the second feature map g3, and then repeat the same method in the same way to generate the remaining second feature map f4. 2 You can create feature maps.

여기서, 업샘플링부(130)의 동작은 다음의 수식으로 표시할 수 있으며, {g₁, g₂, ... , g_N}는 각각의 제2 특징맵, {f₁, f₂, ... , f_N}는 각각의 제1 특징맵이고, N은 1 이상의 정수, U_i(·)는 업샘플링 함수에 해당한다. Here, the operation of the upsampling unit 130 may be expressed by the following equation, where {g ₁ , g ₂ , ... , g _N } is each of the second feature maps, {f ₁ , f ₂ , . .. , f _N } is each first feature map, N is an integer greater than or equal to 1, and U _i (·) corresponds to an upsampling function.

한편, 도1에 도시한 바와 같이, 업샘플링부(130)는 업샘플링모듈(131)을 포함할 수 있으며, 업샘플링모듈(131)은 각각의 제2 특징맵들에 대한 업샘플링을 수행할 수 있다. 도4(a)를 참조하면 업샘플링모듈(131)은 쌍선형 업샘플 레이어(bilinear upsample layer), 3×3 컨볼루션 레이어, 1×1 컨볼루션 레이어, 배치 정규화 레이어 및 활성함수 레이어를 포함할 수 있다. Meanwhile, as shown in FIG. 1 , the upsampling unit 130 may include an upsampling module 131 , and the upsampling module 131 performs upsampling on each of the second feature maps. can Referring to FIG. 4( a ), the upsampling module 131 may include a bilinear upsample layer, a 3×3 convolution layer, a 1×1 convolution layer, a batch normalization layer, and an activation function layer. can

분류부(140)는 복수의 제2 특징맵 내를 이동하는 각각의 경계박스(boundary box)를 이용하여, 경계박스 내의 객체 포함여부를 판별할 수 있다. 여기서, 분류부(140)가 감지하는 객체는 얼굴(face)일 수 있다. The classification unit 140 may determine whether an object is included in the boundary box by using each boundary box moving within the plurality of second feature maps. Here, the object detected by the classification unit 140 may be a face.

또한, 위치감지부(150)는 분류부(140)에서 객체를 감지한 경우, 제2 특징맵 내의 경계박스를 대상 이미지 내의 위치로 회귀시켜 객체의 위치정보를 추출할 수 있다. 즉, 대상 이미지 내에 포함된 객체의 위치정보를 제공할 수 있으며, 위치정보를 이용하여 대상 이미지 내에 객체의 위치를 표시할 수 있다. In addition, when the classifying unit 140 detects an object, the position detection unit 150 may extract the position information of the object by returning the bounding box in the second feature map to the position in the target image. That is, location information of an object included in the target image may be provided, and the location of the object may be displayed in the target image using the location information.

한편, 도4(b) 및 도4(c)는 분류부(140) 및 위치감지부(150)에 각각 대응하는 것으로, 분류부(140) 및 위치감지부(150)는 3×3 컨볼루션 필터로 동작할 수 있다. 분류부(140)의 경우, 경계박스 내에 객체가 포함되는 경우와 포함되지 않은 경우의 2가지 경우를 표시하므로, 출력채널은 2개일 수 있다. 여기서 출력채널이 4개인 경우에는 맥스아웃(Maxout)을 이용하여 4개의 채널 중에서 2개의 채널을 선택할 수 있으며, 이를 통해 상대적으로 작은 크기의 객체들에 대한 긍정오류(false positive) 비율을 낮출 수 있다. Meanwhile, FIGS. 4(b) and 4(c) correspond to the classification unit 140 and the position sensing unit 150, respectively, and the classification unit 140 and the position sensing unit 150 are 3×3 convolutions. It can act as a filter. In the case of the classification unit 140, since two cases are displayed, a case in which an object is included in the bounding box and a case in which an object is not included in the bounding box, there may be two output channels. Here, when there are four output channels, two channels can be selected from among the four channels by using Maxout, thereby reducing the false positive rate for relatively small objects. .

또한, 위치감지부(150)의 경우, 출력채널은 4개의 차원을 포함할 수 있으며, 각각은 경계박스의 폭, 높이, 중심점 위치좌표 등일 수 있다. In addition, in the case of the position sensing unit 150, the output channel may include four dimensions, each of which may be the width, height, center point coordinates of the bounding box, and the like.

한편, 본 발명의 일 실시예에 의한 객체인식장치는, 멀티태스크 손실 함수(multitask loss function)를 이용하여 동시에 학습될 수 있다. 즉, 특징 추출부(110), 백본 네트워크부(120), 업샘플링부(130), 분류부(140) 및 위치감지부(150) 등의 구성은 멀티태스크 손실 함수를 이용하여 동시에 학습될 수 있다. Meanwhile, the object recognition apparatus according to an embodiment of the present invention may be simultaneously learned using a multitask loss function. That is, the configuration of the feature extraction unit 110 , the backbone network unit 120 , the up-sampling unit 130 , the classification unit 140 , and the position sensing unit 150 can be simultaneously learned using a multi-task loss function. have.

구체적으로, 멀티태스크 손실 함수는 Specifically, the multitask loss function is

일 수 있다. 여기서, l_c는 분류손실(classification loss), l_r은 회귀손실(regression loss), j는 경계 박스(anchor box)의 인덱스, r_j ^*는 상기 경계 박스에 대응하는 실측값(ground truth)이다. 또한, c_j ^* 는 0 또는 1로 설정되는 것으로 경계박스의 자카드 오버랩(Jaccard overlap)이 기준값 t 이상이면 1로 설정되고, 기준값 t 미만이면 0으로 설정될 수 있다. 또한, N_cls는 학습시 사용된 유사 샘플(positive sample) 및 비유사 샘플(negative sample)의 전체 개수, N_reg = ∑_j c_j ^*, λ는 임의의 변수일 수 있다. 여기서, 회귀 손실은 유사 샘플에 대하여만 연산되므로, N_reg 는 ∑_j c_j ^*을 이용하여 계산하는 것이 가능하다. 또한, 분류손실은 크로스 엔트로피 손실(cross-entropy loss), 상기 회귀손실은 스무스 l1 손실(smooth l1 loss)로 설정될 수 있다.can be Here, l _c is the classification loss, l _r is the regression loss, j is the index of the anchor box, and r _j ^* is the ground truth corresponding to the bounding box. . In addition, c _j ^* is set to 0 or 1, if the Jaccard overlap of the bounding box is greater than the reference value t, it may be set to 1, and if it is less than the reference value t, it may be set to 0. In addition, N _cls is the total number of positive and negative samples used during training, and N _reg = ∑ _j c _j ^* , λ may be an arbitrary variable. Here, since the regression loss is calculated only for similar samples, N _reg can be calculated using ∑ _j c _j ^*. In addition, the classification loss may be set as a cross-entropy loss, and the regression loss may be set as a smooth l1 loss.

이후, 입력되는 학습데이터들을 이용하여, 멀티태스크 손실 함수의 분류손실 및 회귀손실이 최소가 되도록 객체인식장치의 각 구성들의 파라미터를 설정할 수 있다. Thereafter, by using the input learning data, parameters of the respective components of the object recognition apparatus may be set so that the classification loss and the regression loss of the multi-task loss function are minimized.

여기서, 본 발명의 일 실시예에 의한 객체인식장치는 백본네트워크부(120)의 반복적인 재사용을 이용하여 특징맵을 생성하므로, 종래의 객체인식장치에 비하여 필요한 파라미터의 개수를 획기적으로 줄일 수 있다. 또한, 파라미터의 개수를 줄인만큼 특징맵 추출을 위한 역잔류모듈 등을 추가할 수 있으므로, 보다 정확한 객체인식을 구현할 수 있다. Here, since the object recognition apparatus according to an embodiment of the present invention generates a feature map by using the repeated reuse of the backbone network unit 120, the number of necessary parameters can be dramatically reduced compared to the conventional object recognition apparatus. . In addition, since a reverse residual module for extracting a feature map can be added as much as the number of parameters is reduced, more accurate object recognition can be implemented.

또한, 본 발명의 일 실시예에 의한 백본네트워크(120)는 다양한 크기의 객체들을 반복하여 학습하여 생성되므로, 상대적으로 작은 객체들에 대한 인식률을 향상시키는 것이 가능하다. In addition, since the backbone network 120 according to an embodiment of the present invention is generated by repeatedly learning objects of various sizes, it is possible to improve the recognition rate of relatively small objects.

도5는 본 발명의 일 실시예에 의한 백본 네트워크의 구조를 나타내는 블록도이다. 도5에는 백본 네트워크의 3가지 실시예가 나타나 있으며, 각각의 실시예는 도5(a), 도5(b) 및 도5(c)에 대응한다. 5 is a block diagram showing the structure of a backbone network according to an embodiment of the present invention. Three embodiments of the backbone network are shown in Fig. 5, and each embodiment corresponds to Figs. 5(a), 5(b) and 5(c).

첫번째 실시예는 출력 채널의 개수가 32개이고, 두번째 실시예는 출력 채널의 개수가 48개, 세번째 실시예는 출력채널의 개수가 64개에 해당한다. 반면에, 첫번째 실시예에서는 7개의 역잔류모듈을 포함하고, 두번째와 세번째 실시예에서는 각각 5개의 역잔류모듈을 포함한다. 이때, 첫번째 실시예는 6만개의 파라미터, 두번째 실시예는 10만개의 파라미터, 세번째 실시예는 16만개의 파라미터를 포함할 수 있다. 여기서, 두번째 실시예와 세번째 실시예의 경우, 출력 채널수가 많은 대신에 역잔류모듈의 수를 줄여서 파라미터의 개수를 줄이도록 하였다. In the first embodiment, the number of output channels is 32, in the second embodiment, the number of output channels is 48, and in the third embodiment, the number of output channels is 64. On the other hand, in the first embodiment, seven inverse residual modules are included, and in the second and third embodiments, each of five inverse residual modules is included. In this case, the first embodiment may include 60,000 parameters, the second embodiment may include 100,000 parameters, and the third embodiment may include 160,000 parameters. Here, in the case of the second and third embodiments, the number of parameters is reduced by reducing the number of inverse residual modules instead of a large number of output channels.

각각의 실시예들에 대한 객체인식성능은, 세번째 실시예가 가장 뛰어난 성능을 보였으며, 이후 두번째 실시예, 첫번째 실시예의 순서로 나타났다. 이는 채널의 개수가 역잔류모듈의 개수 등 깊이보다 더 중요한 요소에 해당함을 나타낸다. As for the object recognition performance for each of the embodiments, the third embodiment showed the best performance, followed by the second embodiment and the first embodiment in order. This indicates that the number of channels corresponds to a more important factor than depth, such as the number of inverse residual modules.

도7은 본 발명의 일 실시예에 의한 객체인식방법을 나타내는 순서도이다.7 is a flowchart illustrating an object recognition method according to an embodiment of the present invention.

도7을 참조하면, 본 발명의 일 실시예에 의한 객체인식방법은, 특징이미지 생성단계(S110), 제1 특징맵 생성단계(S120) 및 객체인식단계(S130)를 포함할 수 있으며, 각 단계들은 객체인식장치에 의하여 수행될 수 있다. 7, the object recognition method according to an embodiment of the present invention may include a feature image generating step (S110), a first feature map generating step (S120) and an object recognition step (S130), each The steps may be performed by the object recognition device.

이하 도7을 참조하여 본 발명의 일 실시예에 의한 객체인식방법을 설명한다.Hereinafter, an object recognition method according to an embodiment of the present invention will be described with reference to FIG. 7 .

특징이미지 생성단계(S110)에서는, 입력받은 대상이미지에 특징추출연산을 수행하여 특징이미지를 생성할 수 있다. 여기서, 특징이미지는 3×3 컨볼루션 연산을 통하여 생성할 수 있으며, 이때, 3×3 컨볼루션 연산의 스트라이드를 2로 설정하여 특징 이미지의 해상도를 절반으로 낮출 수 있다. 또한, 대상이미지에 포함된 RGB 3개의 채널을 h개로 확장시켜 특징 이미지를 생성할 수 있다. In the feature image generation step ( S110 ), a feature image may be generated by performing a feature extraction operation on the received target image. Here, the feature image may be generated through a 3×3 convolution operation, and in this case, the resolution of the feature image may be reduced by half by setting the stride of the 3×3 convolution operation to 2. Also, it is possible to generate a feature image by extending three RGB channels included in the target image to h.

제1 특징맵 생성단계(S120)에서는, 특징이미지에 대한 특징맵추출연산을 반복적으로 수행하고, 특징맵 추출연산을 적용한 횟수에 따라 각각 상이한 해상도를 가지는 복수의 제1 특징맵을 생성할 수 있다. In the first feature map generation step S120, a feature map extraction operation is repeatedly performed on the feature image, and a plurality of first feature maps each having different resolutions may be generated according to the number of times the feature map extraction operation is applied. .

여기서, 특징맵추출연산은 미리 설정된 복수의 역잔류모듈들을 모두 포함하는 하나의 세트에 대한 연산에 대응하는 것일 수 있다. 즉, 특징맵추출연산을 1회 수행하면 연산결과로 제1 특징맵 f1을 생성할 수 있으며, 제1 특징맵 f1에 대해 다시 특징맵추출연산을 수행하여 제1 특징맵 f2를 생성할 수 있다. 이후 동일한 방식으로 반복하여, 제1 특징맵 f3, f4, f5, f6를 순차적으로 생성할 수 있다. 이 경우, 동일한 특징맵추출연산을 반복적으로 재사용하므로, 특징맵추출연산을 위한 파라미터만을 설계하여도 복수의 특징맵들을 추출하는 것이 가능하다. Here, the feature map extraction operation may correspond to an operation for one set including all of a plurality of preset inverse residual modules. That is, if the feature map extraction operation is performed once, the first feature map f1 can be generated as a result of the operation, and the first feature map f2 can be generated by performing the feature map extraction operation again on the first feature map f1. . Thereafter, the first feature maps f3, f4, f5, and f6 may be sequentially generated by repeating the same method. In this case, since the same feature map extraction operation is repeatedly reused, it is possible to extract a plurality of feature maps even by designing only parameters for the feature map extraction operation.

여기서, 각각의 역잔류모듈은 깊이 기반 분리 컨볼루션(depth-wise separable convolution) 연산을 수행하는 것일 수 있으며, 이를 통해 특징맵추출연산시 필요한 연산량을 감소시킬 수 있다. 한편, 역잔류모듈들은 활성함수(activation function)로 PReLU(Parametric Rectified Linear Unit) 또는 Leaky-ReLU를 이용하는 것일 수 있다. Here, each inverse residual module may perform a depth-wise separable convolution operation, thereby reducing the amount of computation required for feature map extraction. Meanwhile, the inverse residual modules may use a Parametric Rectified Linear Unit (PReLU) or Leaky-ReLU as an activation function.

추가적으로, 복수의 역잔류모듈 중에서 가장 마지막에 연산되는 역잔류모듈에 대하여는 스트라이드를 2로 설정하고, 나머지 역잔류모듈들의 스트라이드는 1로 설정할 수 있다. 즉, 특징맵추출연산시마다 제1 특징맵의 해상도를 절반으로 줄일 수 있다. 이와 같이, 각각의 제1 특징맵들의 해상도를 변경함으로써, 대상 이미지 내에 포함된 다양한 크기의 객체들을 검출하는 것이 가능하다. Additionally, the stride may be set to 2 for the inverse residual module that is calculated last among the plurality of inverse residual modules, and the stride of the remaining inverse residual modules may be set to 1. That is, the resolution of the first feature map may be reduced by half for each feature map extraction operation. As such, by changing the resolution of each of the first feature maps, it is possible to detect objects of various sizes included in the target image.

객체인식단계(S130)에서는, 복수의 제1 특징맵 내를 이동하는 각각의 경계박스를 이용하여, 상기 경계박스 내의 객체 포함여부를 판별할 수 있다. 또한, 객체가 포함된 경우에는 제1 특징맵 내의 경계박스를 대상 이미지 내의 위치로 회귀시켜 객체의 위치정보를 추출할 수 있다. 여기서, 객체인식단계(S130)는 3×3 컨볼루션 필터를 이용하여 구현할 수 있으며, 위치정보에는 경계박스의 폭, 높이, 중심점 위치좌표 등이 포함될 수 있다. In the object recognition step ( S130 ), it is possible to determine whether an object is included in the bounding box by using each bounding box moving within the plurality of first feature maps. In addition, when the object is included, the location information of the object may be extracted by returning the bounding box in the first feature map to the location in the target image. Here, the object recognition step S130 may be implemented using a 3×3 convolution filter, and the location information may include the width, height, and center point location coordinates of the bounding box.

도8은 본 발명의 다른 실시예에 의한 객체인식방법을 나타내는 순서도이다.8 is a flowchart illustrating an object recognition method according to another embodiment of the present invention.

도8을 참조하면, 본 발명의 다른 실시예에 의한 객체인식방법은, 도7의 객체인식방법과 비교할 때, 초기설정단계(S130) 및 제2 특징맵 생성단계(S140)를 더 포함할 수 있다. Referring to FIG. 8 , the object recognition method according to another embodiment of the present invention may further include an initial setting step ( S130 ) and a second feature map generation step ( S140 ) when compared with the object recognition method of FIG. 7 . have.

즉, 도7과 같이, 제1 특징맵들을 이용하여 객체를 검출하는 것도 가능하지만, 제1 특징맵을 이용하는 경우 상대적으로 크기가 작은 객체를 검출하기 어려운 문제가 발생할 수 있다. That is, as shown in FIG. 7 , although it is possible to detect an object using the first feature maps, it may be difficult to detect an object having a relatively small size when the first feature map is used.

이를 해결하기 위하여, 본 발명의 다른 실시예에 의한 객체인식방법은 초기설정단계(S130) 및 제2 특징맵 생성단계(S140)를 더 포함하여, FPN(Feature Pyramid Netwokr) 구조를 도입할 수 있다. In order to solve this problem, the object recognition method according to another embodiment of the present invention may further include an initial setting step (S130) and a second feature map generation step (S140) to introduce a Feature Pyramid Network (FPN) structure. .

구체적으로, 초기설정단계(S130)에서는 복수의 제1 특징맵 중에서 해상도가 가장 낮은 제1 특징맵을 최초의 제2 특징맵으로 설정할 수 있으며, 이후 제2 특징맵 생성단계(S130)에서는 직전에 생성한 제2 특징맵을 업샘플링(upsampling)하고, 상기 업샘플링된 결과와 동일한 해상도를 가지는 제1 특징맵을 스킵 커넥션 기법(skip connection)으로 연결하여 제2 특징맵을 생성할 수 있다. Specifically, in the initial setting step (S130), the first feature map having the lowest resolution among the plurality of first feature maps may be set as the first second feature map, and thereafter, in the second feature map generating step (S130), immediately before The second feature map may be generated by upsampling the generated second feature map and connecting the first feature map having the same resolution as the up-sampled result using a skip connection technique.

즉, 가장 마지막에 생성된 제1 특징맵 f6를 첫번째 제2 특징맵 g1으로 설정할 수 있으며, 이후 제2 특징맵 g1를 업샘플링하고, 업샘플링된 결과를 이전의 제1 특징맵 f5와 더하여 제2 특징맵 g2를 생성할 수 있다. 여기서, 제2 특징맵 g2는 제1 특징맵 f5와 동일한 해상도를 가질 수 있다. That is, the last generated first feature map f6 may be set as the first second feature map g1, then the second feature map g1 is up-sampled, and the up-sampled result is added to the previous first feature map f5 to obtain the first feature map f5. 2 A feature map g2 can be generated. Here, the second feature map g2 may have the same resolution as the first feature map f5.

또한, 제2 특징맵 g2를 업샘플링하고, 이를 동일한 해상도를 가지는 제1 특징맵 f4와 더하여 제2 특징맵 g3를 생성할 수 있으며, 이후 동일한 방식으로 반복하여 나머지 제2 특징맵들을 생성할 수 있다. In addition, the second feature map g2 may be upsampled and added to the first feature map f4 having the same resolution to generate a second feature map g3, and then the remaining second feature maps may be generated by repeating the same method. have.

전술한 본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수개 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 애플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The present invention described above can be implemented as computer-readable code on a medium in which a program is recorded. The computer-readable medium may continuously store a computer-executable program, or may be temporarily stored for execution or download. In addition, the medium may be various recording means or storage means in the form of a single or several hardware combined, it is not limited to a medium directly connected to any computer system, and may exist distributed on a network. Examples of the medium include a hard disk, a magnetic medium such as a floppy disk and a magnetic tape, an optical recording medium such as CD-ROM and DVD, a magneto-optical medium such as a floppy disk, and those configured to store program instructions, including ROM, RAM, flash memory, and the like. In addition, examples of other media may include recording media or storage media managed by an app store that distributes applications, sites that supply or distribute other various software, and servers. Accordingly, the above detailed description should not be construed as restrictive in all respects but as exemplary. The scope of the present invention should be determined by a reasonable interpretation of the appended claims, and all modifications within the equivalent scope of the present invention are included in the scope of the present invention.

본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다. 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 본 발명에 따른 구성요소를 치환, 변형 및 변경할 수 있다는 것이 명백할 것이다.The present invention is not limited by the above embodiments and the accompanying drawings. For those of ordinary skill in the art to which the present invention pertains, it will be apparent that the components according to the present invention can be substituted, modified and changed without departing from the technical spirit of the present invention.

100: 객체인식장치 110: 특징추출부
120: 백본네트워크부 121: 역잔류모듈
130: 업샘플링부 131: 업샘플링모듈
140: 분류부 150: 위치감지부100: object recognition device 110: feature extraction unit
120: backbone network unit 121: reverse residual module
130: up-sampling unit 131: up-sampling module
140: classification unit 150: position detection unit

Claims

a feature extraction block for generating a feature image by performing a feature extraction operation upon receiving a target image; and
A backbone network unit that repeatedly performs a feature map extraction operation on the feature image to generate a plurality of first feature maps each having a different resolution depending on the number of times the feature map extraction operation is applied,
The backbone network unit

to generate the first feature map {f ₁ , f ₂ , ... , f _N } respectively, where N is an integer greater than or equal to 1, F(·) is the feature map extraction operation of the backbone network unit, E (·) is a feature extraction operation of the feature extraction unit, x is the target image, and f ₀ is the feature image.

delete

The method of claim 1, wherein the backbone network unit
An object recognition apparatus comprising a plurality of inverted residual modules, and sequentially applying the feature image to the plurality of inverted residual modules to perform the feature map extraction operation.

The method of claim 3, wherein the inverse residual module
An object recognition apparatus comprising a depth-wise separable convolution operation and using a Parametric Rectified Linear Unit (PReLU) or Leaky-ReLU as an activation function.

The method of claim 3, wherein the backbone network unit
Among the plurality of inverse residual modules, the stride of the last calculated inverse residual module is set to 2, and the stride of the remaining inverse residual modules is set to 1.

According to claim 1,
a classification head for determining whether an object is included in the bounding box by using each bounding box moving within the plurality of first feature maps; and
When the object is included, object recognition further comprising a regression head for extracting location information of the object by returning the bounding box in the first feature map to a location in the target image Device.

7. The method of claim 6,
The object recognition apparatus, characterized in that the feature extraction unit, the backbone network unit, the classification unit and the position detection unit are simultaneously learned using a multitask loss function.

According to claim 1,
Upsampling for generating a second feature map by upsampling a second feature map generated just before, and connecting a first feature map having the same resolution as the up-sampled result using a skip connection technique By further including wealth,
The up-sampling unit
An object recognition apparatus, characterized in that the first feature map having the lowest resolution among the plurality of first feature maps is set as the first second feature map.

generating a feature image by performing a feature extraction operation on the input target image;
repeatedly performing a feature map extraction operation on the feature image, and generating a plurality of first feature maps each having different resolutions according to the number of times the feature map extraction operation is applied; and
Using each bounding box moving within the plurality of first feature maps, it is determined whether an object is included in the bounding box, and when the object is included, a bounding box in the first feature map is set within the target image. To include the step of extracting the location information of the object by returning to the location,
The step of generating the plurality of first feature maps includes:

to generate the first feature map {f ₁ , f ₂ , ... , f _N } respectively, where N is an integer greater than or equal to 1, F(·) is a feature map extraction operation of the backbone network unit, E( ·) is the feature extraction operation of the feature extraction unit, x is the target image, and f ₀ is the feature image.

generating a feature image by performing a feature extraction operation on the input target image;
repeatedly performing a feature map extraction operation on the feature image, and generating a plurality of first feature maps each having different resolutions according to the number of times the feature map extraction operation is applied;
setting a first feature map having the lowest resolution among a plurality of first feature maps as an initial second feature map;
generating a second feature map by upsampling the previously generated second feature map and connecting the first feature map having the same resolution as the upsampled result using a skip connection; and
Using each bounding box moving within the plurality of second feature maps, it is determined whether an object is included in the bounding box, and when the object is included, a bounding box in the second feature map is placed within the target image. To include the step of extracting the location information of the object by returning to the location,
The step of generating the plurality of first feature maps includes:

A computer program stored in a medium in combination with hardware to execute the object recognition method of any one of claims 9 to 10.