KR20220050617A

KR20220050617A - Method and device for detecting object location based on weak supervised learning

Info

Publication number: KR20220050617A
Application number: KR1020200134493A
Authority: KR
Inventors: 변혜란; 기민송; 이원영; 고성필
Original assignee: 연세대학교 산학협력단
Priority date: 2020-10-16
Filing date: 2020-10-16
Publication date: 2022-04-25
Also published as: KR102537947B1

Abstract

The present invention relates to a method and device for detecting an object location based on weak supervised learning. According to the present invention, the method comprises the following steps by an object location detection device: extracting a feature map for an input image by using a first convolutional layer of a deep learning model; generating a first attention map by using the feature map; generating a dropped foreground mask from the first attention map, wherein the dropped foreground mask is generated on the basis of a contrastive attention loss; and estimating an object location on the basis of an importance map generated from the dropped foreground mask or the first attention map. Accordingly, the method provides an effect of preventing performance degradation of object location estimation.

Description

Weak supervised learning-based object location detection method and device {METHOD AND DEVICE FOR DETECTING OBJECT LOCATION BASED ON WEAK SUPERVISED LEARNING}

개시된 기술은 약한 지도학습 기반 객체 위치 검출 방법 및 장치에 관한 것이다.The disclosed technology relates to a weak supervised learning-based object position detection method and apparatus.

지도학습은 합성곱 신경망에 입력되는 이미지에 포함된 객체를 식별하기 위해서 완벽한 정답에 가까운 레이블링이 필요하다. 레이블은 신경망이 객체를 정확하게 분류하기 위해 이용된다. 이와 같이 지도학습은 레이블링에 소요되는 코스트가 높아지는 단점이 존재하므로, 최근에는 비 지도학습이나 이미지 수준의 주석(annotation)만 제공하는 약한 지도학습 기반의 객체 검출 기술도 이용되고 있다.Supervised learning requires labeling close to the perfect answer in order to identify objects included in the image input to the convolutional neural network. Labels are used by neural networks to accurately classify objects. As such, supervised learning has a disadvantage in that the cost required for labeling is increased. Recently, unsupervised learning or weak supervised learning-based object detection technology that only provides image-level annotations is being used.

한편, 종래 합성곱 신경망을 이용한 객체 위치 검출 기술에서는 객체에서 가장 강하게 식별되는 영역을 중심으로 객체를 식별하였다. 이에 따라 객체 전체를 찾기보다는 일부에 치중된 식별력을 나타내었다.On the other hand, in the conventional object position detection technology using a convolutional neural network, an object is identified based on a region most strongly identified in the object. Accordingly, rather than searching for the whole object, it showed the discrimination power focused on a part.

이러한 문제점을 개선하기 위하여 객체 전체를 찾을 수 있도록 중요한 영역을 가린 상태에서 학습하는 방법들이 개발되었는데 이는 학습모델이 이미지에서 상대적으로 덜 중요한 배경까지 학습하도록 유도하기 때문에 실제 객체의 크기보다 훨씬 큰 박스를 생성하게 되어서 결과적으로 객체의 위치를 식별하는데 정확도가 낮아지는 문제점이 있었다. To improve this problem, learning methods have been developed in the state that the important area is covered so that the entire object can be found. This leads the learning model to learn from the image to the relatively less important background, so a box that is much larger than the size of the real object is used. As a result, there was a problem in that the accuracy in identifying the location of the object was lowered.

한국 공개특허 제10-2020-0074940호Korean Patent Publication No. 10-2020-0074940

개시된 기술은 이미지 내 샘플 대조를 통한 약한 지도학습을 이용하여 객체 위치를 검출하는 방법 및 장치을 제공하는데 있다.The disclosed technology is to provide a method and apparatus for detecting an object position using weak supervised learning through sample collation in an image.

상기의 기술적 과제를 이루기 위하여 개시된 기술의 제 1 측면은 객체 위치 검출 장치가 딥러닝 모델의 제 1 합성곱 레이어를 이용하여 입력된 이미지에 대한 특징 맵(Feature map)을 추출하는 단계, 상기 객체 위치 검출 장치가 상기 특징 맵을 이용하여 제 1 어텐션 맵(Attention map)을 생성하는 단계, 상기 객체 위치 검출 장치가 상기 제 1 어텐션 맵으로부터 드롭된 포어그라운드 마스크(Dropped foreground Mask)를 생성하되, 상기 드롭된 포어그라운드 마스크는 컨트라스티브 어텐션 로스(Contrastive attention loss)를 기준으로 생성되는 단계 및 상기 객체 위치 검출 장치가 상기 드롭된 포어그라운드 마스크 또는 상기 제 1 어텐션 맵으로부터 생성된 중요 맵(Importance Map)을 기준으로 객체 위치를 추정하는 단계를 포함하는 객체 위치 검출 방법을 제공하는데 있다.A first aspect of the disclosed technology to achieve the above technical problem is the step of extracting a feature map for the input image using the first convolutional layer of the deep learning model, by the object position detection apparatus, the object position generating, by a detection device, a first attention map by using the feature map, the object position detection device generating a dropped foreground mask from the first attention map, wherein the dropping The foreground mask is generated based on a contrastive attention loss, and the object position detection device is an important map generated from the dropped foreground mask or the first attention map (Importance Map) An object of the present invention is to provide a method for detecting an object location, including estimating the location of the object as a reference.

상기의 기술적 과제를 이루기 위하여 개시된 기술의 제 2 측면은 객체가 포함된 이미지를 입력받는 입력장치, 딥러닝 모델을 저장하는 저장장치 및 상기 딥러닝 모델의 제 1 합성곱 레이어를 이용하여 상기 이미지에 대한 특징 맵(Feature map)을 추출하고, 상기 특징 맵에 어텐션 블록(Attention block)을 설정하여 제 1 어텐션 맵(Attention map)을 생성하고 상기 제 1 어텐션 맵으로부터 드롭된 포어그라운드 마스크(Dropped foreground Mask)를 생성하되, 상기 드롭된 포어그라운드 마스크는 컨트라스티브 어텐션 로스(Contrastive attention loss)를 기준으로 생성되고, 상기 드롭된 포어그라운드 마스크 또는 상기 제 1 어텐션 맵으로부터 생성된 중요 맵(Importance Map)을 기준으로 객체 위치를 추정하는 연산장치를 포함하는 객체 위치 검출 장치를 제공하는데 있다.A second aspect of the disclosed technology to achieve the above technical task is to use an input device for receiving an image including an object, a storage device for storing a deep learning model, and a first convolutional layer of the deep learning model to the image. A first attention map is generated by extracting a feature map, an attention block is set in the feature map, and a dropped foreground mask from the first attention map ), but the dropped foreground mask is generated based on a contrastive attention loss, and an important map generated from the dropped foreground mask or the first attention map It is to provide an object position detection apparatus including an arithmetic unit for estimating the object position as a reference.

개시된 기술의 실시 예들은 다음의 장점들을 포함하는 효과를 가질 수 있다. 다만, 개시된 기술의 실시 예들이 이를 전부 포함하여야 한다는 의미는 아니므로, 개시된 기술의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다. Embodiments of the disclosed technology may have effects including the following advantages. However, since it does not mean that the embodiments of the disclosed technology should include all of them, it should not be understood that the scope of the disclosed technology is limited thereby.

개시된 기술의 일 실시예에 따르면 약한 지도학습 기반 객체 위치 검출 방법 및 장치는 객체의 영역을 나타내는 박스를 생성하되 과도하게 박스가 커지는 것을 방지하는 효과가 있다.According to an embodiment of the disclosed technology, a weak supervised learning-based method and apparatus for detecting a position of an object generates a box indicating an area of an object, but has an effect of preventing the box from becoming excessively large.

또한, 컨트라스티브 어텐션 로스와 포어그라운드 컨시스턴시 로스를 학습에 사용하여 객체 위치 추정 성능이 저하되는 것을 방지하는 효과가 있다.In addition, there is an effect of preventing the degradation of object position estimation performance by using the contrasting attention loss and the foreground consistency loss for training.

또한, 비 지역적인 어텐션 블록을 적용하여 생성된 어텐션 맵을 기준으로 객체 위치를 추정함으로써 객체의 특정 영역에 치우쳐지지 않고 전체 영역을 커버하는 효과가 있다.In addition, by estimating the location of the object based on the attention map generated by applying the non-local attention block, there is an effect of covering the entire area without being biased toward a specific area of the object.

도 1은 개시된 기술의 일 실시예에 따른 약한 지도학습 기반 객체 위치 검출 시스템을 이용하여 객체 위치를 추정하는 것을 나타낸 도면이다.
도 2는 개시된 기술의 일 실시예에 따라 객체 위치를 추정하는 전체적인 흐름을 나타낸 도면이다.
도 3은 개시된 기술의 일 실시예에 따라 컨트라스티브 어텐션 로스를 계산하는 것을 나타낸 도면이다.
도 4는 개시된 기술의 일 실시예에 따른 객체 위치 검출 장치의 구성을 나타낸 도면이다.
도 5는 개시된 기술의 일 실시에에 따른 객체 위치 검출 방법에 대한 순서도이다.
도 6은 객체 위치 추정에 대한 결과를 비교하는 도면이다.
도 7은 종래의 이미지 데이터셋에 개시된 기술을 적용한 것을 나타낸 도면이다.1 is a diagram illustrating estimating an object position using a weak supervised learning-based object position detection system according to an embodiment of the disclosed technology.
2 is a diagram illustrating an overall flow of estimating an object position according to an embodiment of the disclosed technology.
3 is a diagram illustrating calculating a contrasting attention loss according to an embodiment of the disclosed technology.
4 is a diagram showing the configuration of an object position detection apparatus according to an embodiment of the disclosed technology.
5 is a flowchart of a method for detecting an object position according to an embodiment of the disclosed technology.
6 is a diagram comparing results of object position estimation.
7 is a diagram illustrating application of the disclosed technology to a conventional image data set.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention.

제 1, 제 2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 해당 구성요소들은 상기 용어들에 의해 한정되지는 않으며, 단지 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제 1 구성요소는 제 2 구성요소로 명명될 수 있고, 유사하게 제 2 구성요소도 제 1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다.Terms such as first, second, A, and B may be used to describe various components, but the components are not limited by the above terms, and only for the purpose of distinguishing one component from other components. is used only as For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component. and/or includes a combination of a plurality of related listed items or any of a plurality of related listed items.

본 명세서에서 사용되는 용어에서 단수의 표현은 문맥상 명백하게 다르게 해석되지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 한다. 그리고 "포함한다" 등의 용어는 설시된 특징, 개수, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 의미하는 것이지, 하나 또는 그 이상의 다른 특징들이나 개수, 단계 동작 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 배제하지 않는 것으로 이해되어야 한다.In terms of terms used herein, the singular expression is to be understood as including the plural expression unless the context clearly dictates otherwise. And terms such as "comprising" mean that the specified feature, number, step, operation, component, part, or a combination thereof exists, but one or more other features or number, step operation component, part It should be understood as not excluding the possibility of the presence or addition of or combinations thereof.

도면에 대한 상세한 설명을 하기에 앞서, 본 명세서에서의 구성부들에 대한 구분은 각 구성부가 담당하는 주기능 별로 구분한 것에 불과함을 명확히 하고자 한다. 즉, 이하에서 설명할 2개 이상의 구성부가 하나의 구성부로 합쳐지거나 또는 하나의 구성부가 보다 세분화된 기능별로 2개 이상으로 분화되어 구비될 수도 있다. Prior to a detailed description of the drawings, it is intended to clarify that the classification of the constituent parts in the present specification is merely a division according to the main function each constituent unit is responsible for. That is, two or more components to be described below may be combined into one component, or one component may be divided into two or more for each more subdivided function.

그리고 이하에서 설명할 구성부 각각은 자신이 담당하는 주기능 이외에도 다른 구성부가 담당하는 기능 중 일부 또는 전부의 기능을 추가적으로 수행할 수도 있으며, 구성부 각각이 담당하는 주기능 중 일부 기능이 다른 구성부에 의해 전담되어 수행될 수도 있음은 물론이다. 따라서, 본 명세서를 통해 설명되는 각 구성부들의 존재 여부는 기능적으로 해석되어야 할 것이다.In addition, each of the constituent units to be described below may additionally perform some or all of the functions of other constituent units in addition to the main function it is responsible for. Of course, it can also be performed by being dedicated to it. Accordingly, the existence or non-existence of each component described through the present specification should be interpreted functionally.

도 1은 개시된 기술의 일 실시예에 따른 약한 지도학습 기반 객체 위치 검출 시스템을 이용하여 객체 위치를 추정하는 것을 나타낸 도면이다. 도 1을 참조하면 객체 위치 검출 시스템(100)은 사용자의 PC나 스마트폰과 같은 컴퓨팅 장치(110)를 이용하여 사용자가 입력하는 이미지를 수신할 수 있다. 이러한 컴퓨팅 장치는 알고리즘에 따라 연산을 수행 가능한 다른 종류의 장치를 이용할 수도 있다. 컴퓨팅 장치는 사용자의 입력을 수신하는 장치, 연산을 위한 알고리즘, 프로그램 등을 저장하는 장치, 사용자의 입력 및 저장된 알고리즘 또는 프로그램을 이용하여 연산을 수행하는 장치 및 연산 결과를 사용자에게 출력하는 장치를 포함할 수 있다. 입력장치에는 사용자가 입력하는 이미지 및 이미지에 대한 주석이 입력될 수 있다. 일 실시예로, 입력장치는 이미지 파일을 업로드할 수 있는 마우스 및 이미지 파일에 대한 주석을 입력할 수 있는 키보드를 포함할 수 있다. 이러한 입력을 수신하기 위해서 컴퓨팅 장치는 사용자를 위한 소정의 인터페이스를 출력하는 어플리케이션 또는 프로그램을 저장할 수 있다.1 is a diagram illustrating estimating an object position using a weak supervised learning-based object position detection system according to an embodiment of the disclosed technology. Referring to FIG. 1 , the object position detection system 100 may receive an image input by the user using the computing device 110 such as the user's PC or smart phone. The computing device may use other types of devices capable of performing calculations according to algorithms. The computing device includes a device for receiving a user's input, a device for storing an algorithm for calculation, a program, etc., a device for performing a calculation using the user's input and the stored algorithm or program, and a device for outputting the calculation result to the user can do. An image input by a user and a comment on the image may be input to the input device. In one embodiment, the input device may include a mouse for uploading an image file and a keyboard for inputting comments on the image file. In order to receive such an input, the computing device may store an application or program that outputs a predetermined interface for a user.

한편, 사용자는 컴퓨팅 장치(110)에 데이터를 입력할 수 있다. 사용자가 입력하는 데이터는 기본적으로 객체를 촬영한 이미지이다. 이미지는 사용자가 직접 촬영한 것을 이용할 수도 있고 별도의 데이터셋에서 제공되는 특정 이미지를 사용할 수도 있다. 입력 데이터에는 이러한 이미지와 함께 객체에 대한 주석(annotation)이 포함된다. 여기에서 주석은 전체 레이블링 데이터가 아닌 이미지 레벨의 주석(annotation)만 입력될 수 있다. 즉, 사용자는 객체가 포함된 이미지 및 객체에 대한 주석을 컴퓨팅 장치(110)에 데이터로 입력할 수 있다. 그리고 컴퓨팅 장치는 데이터를 입력받고 저장된 알고리즘에 따라 약한 지도학습을 수행하여 객체의 위치를 추정할 수 있다. 알고리즘에 따라 객체의 위치를 추정하는 것은 컴퓨팅 장치에서 연산을 담당하는 장치를 통해 수행될 수 있으며 연산 결과는 출력 장치를 통해 사용자에게 제공될 수 있다.Meanwhile, the user may input data into the computing device 110 . The data input by the user is basically an image of an object. For the image, a user's own shot may be used, or a specific image provided from a separate dataset may be used. The input data includes annotations on the objects along with these images. Here, only image-level annotations may be input, not entire labeling data. That is, the user may input an image including the object and a comment on the object as data to the computing device 110 . In addition, the computing device may receive data and estimate the location of the object by performing weak supervised learning according to the stored algorithm. Estimating the position of the object according to the algorithm may be performed by a device in charge of calculation in the computing device, and the calculation result may be provided to a user through an output device.

한편, 개시된 기술에서는 객체 위치를 보다 정확하게 추정하기 위해서 컨트라스티브 어텐션 로스(Contrastive attention loss)와 포어그라운드 컨시스턴시 로스(Foreground consistency loss)의 2가지 함수를 이용하여 합성곱 레이어를 훈련한다. 먼저 컨트라스티브 어텐션 로스를 계산하기 위해서 이미지에서 추출된 특징 맵(Feature map)을 토대로 어텐션 맵(Atention map)을 생성하고 어텐션 맵에 임계값을 적용하여 복수의 마스크를 생성한다. 어텐션 맵은 특징 맵에 어텐션 블록을 적용하여 생성하는 것으로, 어텐션 블록은 컴퓨팅 장치에 저장된 알고리즘에 따라 다소 다르게 설정될 수 있다. 다만, 개시된 기술에서 언급하는 어텐션 블록은 채널 풀링을 대체하기 위한 것으로 비 지역적인(non-local) 특성을 갖는다. 즉, 공간의 유사성을 고려하여 특징 맵에서 향상된 어텐션 맵을 생성하는 것이다. 특징 맵이 주어지면 어텐션 블록은 이를 채널(Channel), 가로(Height) 및 세로(Weight)로 각각 임베딩하고 처음 두 임베딩 간의 유사성에 의해 가중치가 부여된 세 번째 임베딩의 공간 합계를 출력할 수 있다. 이와 같이 채널 풀링된 결과에 의해 특징 맵으로부터 향상된 어텐션 맵이 정의될 수 있다.Meanwhile, in the disclosed technology, a convolutional layer is trained using two functions, a contrastive attention loss and a foreground consistency loss, in order to more accurately estimate an object position. First, in order to calculate the contrasting attention loss, an attention map is generated based on a feature map extracted from an image, and a threshold value is applied to the attention map to generate a plurality of masks. The attention map is generated by applying the attention block to the feature map, and the attention block may be set somewhat differently according to an algorithm stored in the computing device. However, the attention block referred to in the disclosed technology is intended to replace channel pooling and has a non-local characteristic. That is, an improved attention map is generated from the feature map in consideration of spatial similarity. Given a feature map, the attention block embeds them into Channel, Height, and Weight, respectively, and outputs the spatial sum of the third embedding weighted by the similarity between the first two embeddings. An enhanced attention map may be defined from the feature map by the channel pooling result as described above.

한편, 어텐션 맵을 이용하여 포어그라운드(Foreground) 마스크와 백그라운드(Background) 마스크 및 포어그라운드 마스크에서 특정 영역을 가린 형태의 드롭된 포어그라운드 마스크(Dropped foreground mask)의 3가지 마스크를 생성한다. 여기에서 특정 영역이라 함은 이미지의 배경과 전경 중 가장 차별화된 영역을 의미한다. 종래의 신경망 기반 객체 위치 검출 기술에서는 이러한 특정 영역을 기점으로 특징 맵에 대한 바운딩 박스를 생성하기 때문에 객체의 전체 영역을 커버하지 못하거나 실제 객체의 크기보다 훨씬 큰 바운딩 박스를 생성하는 경우가 발생하였다. 따라서 개시된 기술에서는 이러한 문제점을 해소하고자 가장 차별화된 영역을 가려서 생성된 Dropped foreground mask를 이용한다. 컴퓨팅 장치는 각각의 마스크와 입력된 특징을 곱하는 것으로 이미지의 특징을 임베딩할 수 있다.Meanwhile, by using the attention map, three masks are generated: a foreground mask, a background mask, and a dropped foreground mask in the form of obscuring a specific area from the foreground mask. Here, the specific region means the most differentiated region among the background and foreground of the image. In the conventional neural network-based object location detection technology, since the bounding box for the feature map is generated based on this specific area, it may not cover the entire area of the object or generate a bounding box much larger than the size of the actual object. . Therefore, the disclosed technology uses a dropped foreground mask created by covering the most differentiated area in order to solve this problem. The computing device may embed a feature of the image by multiplying each mask by the input feature.

한편, 드롭된 포어그라운드 마스크는 컨트라스티브 어텐션 로스를 기준으로 생성된다. 예컨대, 제 1 어텐션 맵을 특징공간으로 임베딩하는데 있어서 드롭된 포어그라운드 마스크가 포어그라운드 마스크의 임베딩 결과와는 최대한 유사하도록, 백그라운드 마스크의 임베딩 결과와는 최대한 멀어지도록 생성될 수 있다. 드롭된 포어그라운드 마스크는 제 1 어텐션 맵에 임계값을 적용하여 배경과 전경 중 가장 차별화된 영역을 가려서 생성될 수 있으며 생성된 마스크를 통해 임베딩된 특징값이 포어그라운드 마스크와 가까운 공간에 위치할 수 있고 백그라운드 마스크의 특징값과는 다소 떨어진 공간에 위치할 수 있다.Meanwhile, the dropped foreground mask is generated based on the contrasting attention loss. For example, in embedding the first attention map into the feature space, the dropped foreground mask may be generated so as to be as similar to the embedding result of the foreground mask as possible and as far away from the embedding result of the background mask as possible. The dropped foreground mask can be created by applying a threshold to the first attention map to cover the most differentiated area between the background and the foreground, and the feature values embedded through the generated mask can be located in a space close to the foreground mask. and may be located in a space somewhat distant from the feature value of the background mask.

한편, 컴퓨팅 장치는 컨트라스티브 어텐션 로스를 복수의 합성곱 레이어 각각에 특징 맵을 입력함으로써 초기 레이어에서 상위 레이어로 순차적으로 수행될 수 있다. 가장 먼저 컨트라스티브 어텐션 로스를 계산하는 초기 레이어에서는 이미지에서 추출된 특징 맵을 그대로 입력값으로 이용하나, 이후의 상위 레이어에서는 앞선 레이어에서 생성된 Dropped foreground mask와 중요 맵(Importance map) 중 랜덤하게 선택된 하나를 특징 맵에 곱한 것을 입력값으로 이용한다. 여기에서 Dropped foreground mask는 상술한 바와 같이 어텐션 맵에 임계값(Threshold)를 적용함으로써 생성되는 것이며, 중요 맵은 어텐션 맵에 시그모이드 활성화를 적용함으로써 생성된다. Meanwhile, the computing device may sequentially perform the contrasting attention loss from the initial layer to the upper layer by inputting the feature map to each of the plurality of convolutional layers. First, in the initial layer that calculates the contrasting attention loss, the feature map extracted from the image is used as an input value as it is, but in the later upper layer, the dropped foreground mask and the important map generated in the previous layer are randomly selected. Multiplying the selected one by the feature map is used as an input value. Here, the dropped foreground mask is generated by applying a threshold to the attention map as described above, and the important map is generated by applying sigmoid activation to the attention map.

즉, 이후의 레이어에서는 Dropped foreground mask 또는 중요 맵이 가중치로 적용된 특징 맵을 입력받고 앞선 레이어와 동일하게 컨트라스티브 어텐션 로스를 수행하는 것이다. 여기에서 Dropped foreground mask 또는 중요 맵은 pixel-wise multiplication이 된다. pixel-wise multiplication은 채널 차원에서 브로드캐스팅하는 element-wise multiplication을 의미한다. 이 경우, 이후의 레이어에는 입력값에 가중치가 적용되었으므로 앞선 레이어에서 생성한 어텐션 맵과 차이가 있는 어텐션 맵이 생성된다. 이에 따라 투 어탠션 맵 간의 차이가 적어지도록 학습하기 위해서 포어그라운드 컨시스턴시 로스를 적용한다. 설명을 위해 예시적으로 2개의 레이어에 대해서만 언급하였으나 실제로 합성곱 신경망의 레이어는 다수개가 존재하므로 첫 번째 레이어와 마지막 레이어 간의 어탠션 맵이 서로 유사해지도록 포어그라운드 컨시스턴시 로스를 계산할 수 있다.That is, in the subsequent layer, a feature map to which the dropped foreground mask or important map is applied as a weight is input, and the contrast-attention loss is performed in the same way as in the previous layer. Here the dropped foreground mask or critical map becomes pixel-wise multiplication. Pixel-wise multiplication refers to element-wise multiplication that is broadcast at the channel level. In this case, since weights are applied to input values in subsequent layers, an attention map different from the attention map generated in the previous layer is generated. Accordingly, the foreground consistency loss is applied to learn to reduce the difference between the two attention maps. Although only two layers have been mentioned as an example for explanation, since there are actually a plurality of layers in the convolutional neural network, the foreground consistency loss can be calculated so that the attention maps between the first and last layers are similar to each other.

도 2는 개시된 기술의 일 실시예에 따라 객체 위치를 추정하는 전체적인 흐름을 나타낸 도면이다. 도 2를 참조하면 딥러닝 모델(200)에 포함된 복수개의 합성곱 레이어에서 각각 컨트라스티브 어텐션 로스(220)가 수행되며 첫 번째 레이어(201)의 어텐션 맵(210)과 마지막 레이어(202)의 어텐션 맵이 유사한 값을 갖도록 포어그라운드 컨시스턴시 로스(230)가 수행된다. 2 is a diagram illustrating an overall flow of estimating an object position according to an embodiment of the disclosed technology. Referring to FIG. 2 , each of the contrasting attention loss 220 is performed on a plurality of convolutional layers included in the deep learning model 200, and the attention map 210 of the first layer 201 and the last layer 202 Foreground consistency loss 230 is performed so that the attention map of n has a similar value.

먼저 첫 번째 레이어(201)에서 입력된 이미지에 대한 특징 맵을 추출할 수 있다. 특징 맵의 추출은 종래의 합성곱 신경망에서 수행되는 것과 유사한 방식으로 처리된다. 예컨대, 특징 맵에 비 지역적 특성을 갖는 어텐션 블록을 적용하여 어텐션 맵을 생성할 수 있다. 그리고 생성된 어텐션 맵을 기준으로 드롭된 포어그라운드 마스크를 생성하고 어텐션 맵과 드롭된 포어그라운드 마스크를 이용하여 객체 위치를 추정하는 연산을 수행하게 된다. 객체의 위치 추정에 대한 결과는 컨트라스티브 어텐션 로스를 통해 조정될 수 있다. 예컨대, 딥러닝 모델이 객치 위치를 추정하기 위해서 컨트라스티브 어텐션 로스 값이 작아지도록 학습될 수 있다.First, a feature map for an image input from the first layer 201 may be extracted. The extraction of the feature map is processed in a manner similar to that performed in a conventional convolutional neural network. For example, an attention map may be generated by applying an attention block having a non-local characteristic to the feature map. Then, a dropped foreground mask is generated based on the generated attention map, and an operation for estimating an object position is performed using the attention map and the dropped foreground mask. A result of estimating an object's position may be adjusted through a contrasting attention loss. For example, the deep learning model may be trained so that the contrasting attention loss value becomes small in order to estimate the position of the object.

한편, 컨트라스티브 어텐션 로스를 기준으로 생성되는 드롭된 포어그라운드 마스크(Dropped foreground mask)는 이하의 수학식 1에 따라 계산될 수 있다.Meanwhile, a dropped foreground mask generated based on the contrasting attention loss may be calculated according to Equation 1 below.

여기에서 A는 어텐션 맵을 의미하며,

는 사전에 정의된 전경(Foreground) 영역이고 ,

는 사전에 정의된 배경(Background) 영역을 의미한다. 즉, 드롭된 포어그라운드 마스크는 어텐션 맵에서 가장 차별화된 영역을 제거하여 생성되되, 임베딩 된 특징이 포어그라운드 마스크의 임베딩 결과와는 유사해지고, 백그라운드 마스크의 임베딩 결과는 서로 멀어지도록 생성될 수 있다.Here, A means attention map,

is the predefined foreground area,

denotes a predefined background area. That is, the dropped foreground mask may be generated by removing the most differentiated region from the attention map, the embedded features may be similar to the embedding result of the foreground mask, and the embedding result of the background mask may be generated to move away from each other.

한편, 아래의 수학식 2를 참조하여 컨트라스티브 어텐션 로스를 계산할 수 있다.Meanwhile, the contrasting attention loss may be calculated with reference to Equation 2 below.

여기에서

는 드롭된 포어그라운드 마스크를 통해 임베딩된 특징이고,

는 포어그라운드 마스크를 통해 임베딩된 특징이고,

는 백그라운드 마스크를 통해 임배딩된 특징을 의미한다. 그리고

는 제 1 마스크 및 제 2 마스크의 인스턴스의 배열이고,

는 제 1 마스크 및 제 3 마스크의 인스턴스의 배열이고,

은 마진이다. 즉, 제 1 마스크를 통해 출력된 특징과 제 2 마스크를 통해 출력된 특징의 인스턴스는 서로 유사하고, 제 1 마스크의 특징과 제 3 마스크의 특징의 인스턴스는 서로 상이하게 출력된다. From here

is the feature embedded through the dropped foreground mask,

is a feature embedded through the foreground mask,

denotes a feature embedded through a background mask. And

is the array of instances of the first mask and the second mask,

is the array of instances of the first mask and the third mask,

is the margin That is, instances of features output through the first mask and features output through the second mask are similar to each other, and instances of features of the first mask and features of the third mask are output differently.

컨트라스티브 어텐션 로스 함수는 쿼리가 동등한 인스턴스와 유사하고 쿼리가 상이한 인스턴스 간에는 유사하지 않을 때 값이 적게 도출되는 함수이다. 즉, 드롭된 포어그라운드 마스크는 컨트라스티브 어텐션 로스의 계산 결과를 작아지도록 하는 마스크를 의미한다. 컨트라스티브 어텐션 로스는 어텐션 맵에 배경을 포함하면 비 유사성으로 인한 불이익이 발생하기 때문에 경계에 도달할 때까지 어탠션 맵을 가이드할 수 있다. 따라서, 삼중항 손실(Triplet loss)와 같이 값이 샘플을 추출할 필요가 없고 값이 큰 네거티브 샘플을 관리하지 않고 제 1 마스크 내지 제 3 마스크를 통해 출력되는 특징을 토대로 포지티브 샘플과 네거티브 샘플로 간주할 수 있다. A contrasting attention loss function is a function that yields a small value when a query is similar to an equivalent instance and a query is not similar between different instances. That is, the dropped foreground mask refers to a mask that makes the calculation result of the contrast attention loss smaller. Contrastive attention loss can guide the attention map until it reaches the boundary, because including the background in the attention map causes a penalty due to dissimilarity. Therefore, it is not necessary to extract samples with values such as triplet loss, and negative samples with large values are not managed, and positive samples and negative samples are considered based on the features output through the first to third masks. can do.

한편, 첫 번째 레이어는 다음 레이어의 입력값을 생성하기 위해서 특징 맵에 가중치를 적용한다. 여기에서 특징 맵에 대한 가중치로 Dropped foreground mask가 적용될 수도 있고 중요 맵이 적용될 수도 있다. 이는 딥러닝 모델에서 랜덤하게 결정된 어느 하나를 특징 맵에 곱하는 것으로 다음 레이어의 입력값을 계산할 수 있다.Meanwhile, the first layer applies a weight to the feature map to generate the input value of the next layer. Here, a dropped foreground mask may be applied as a weight for the feature map or an important map may be applied. By multiplying the feature map by any one randomly determined in the deep learning model, the input value of the next layer can be calculated.

한편, 두 번째 레이어는 가중치가 적용된 특징 맵을 입력값으로 하여 첫 번째 레이어와 동일한 방식으로 연산을 수행한다. 그리고 세 번째 레이어는 두 번째 레이어에서 계산된 가중치가 적용된 특징 맵을 입력값으로 하여 마찬가지로 동일하게 연산을 수행한다. 이러한 방식에 따라 마지막 레이어(202)까지 연산을 순차적으로 진행할 수 있다. 마지막 레이어까지 연산이 종료되면 첫 번째 레이어의 어텐션 맵과 마지막 레이어의 어텐션 맵을 비교할 수 있다. 레이어의 개수에 따라 첫 번째 레이어와 마지막 레이어에서 각각 생성된 어텐션 맵의 차이가 커질 수 있다. 객체 위치 검출 장치는 두 어텐션 맵이 서로 유사해지도록 딥러닝 모델을 학습할 수 있다. 예컨대, 첫 번째 레이어의 배경 활성화(Activation)를 마지막 레이어를 기준으로 감소시킴으로써 두 어텐션 맵 간의 차이를 줄이는 포어그라운드 컨시스턴시 로스를 계산할 수 있다. Meanwhile, the second layer performs an operation in the same manner as the first layer using the weighted feature map as an input value. In addition, the third layer performs the same operation by using the weighted feature map calculated in the second layer as an input value. According to this method, the operation may be sequentially performed up to the last layer 202 . When the operation to the last layer is completed, the attention map of the first layer and the attention map of the last layer can be compared. Depending on the number of layers, the difference between the attention maps generated in the first layer and the last layer may increase. The object position detection apparatus may learn the deep learning model so that the two attention maps are similar to each other. For example, the foreground consistency loss that reduces the difference between the two attention maps may be calculated by reducing the background activation of the first layer based on the last layer.

한편, 아래의 수학식 3을 참조하여 포어그라운드 컨시스턴시 로스를 계산할 수 있다.Meanwhile, the foreground consistency loss may be calculated with reference to Equation 3 below.

여기에서

는 i번째 어탠션 맵이고,

는 j번째 어탠션 맵을 의미한다. 즉, i번째 어탠션 맵이 초기 레이어에서 출력되는 어탠션 맵이고 j번째 어탠션 맵이 이후의 레이어를 통해 출력되는 어탠션 맵이다. 어탠션 맵은 대략적으로 모든 위치(Location)의 활성화를 나타낸 맵이며, 초기 레이어는 객체의 전체 범위 대신 가장자리나 모서리와 같은 국부적인 영역을 구별하는 것에서 더 활성화될 수 있다. 이러한 문제점을 해소하기 위해서 초기 레이어의 어탠션 맵과 이후 레이어의 어탠션 맵이 일관성을 유지할 수 있도록 포어그라운드 컨시스턴시 로스를 계산함으로써 두 어탠션 맵이 유사해지도록 학습할 수 있다.From here

is the i-th attention map,

denotes the j-th attention map. That is, the i-th attention map is an attention map output from the initial layer, and the j-th attention map is an attention map output through subsequent layers. The attention map is a map showing the activation of approximately all locations, and the initial layer may be further activated in distinguishing local areas such as edges or corners instead of the entire range of an object. In order to solve this problem, the two attention maps can be learned to be similar by calculating the foreground consistency loss so that the attention map of the initial layer and the attention map of the subsequent layer are consistent.

한편, 객체 위치 검출 장치는 이하의 수학식 4에 따라 딥러닝 모델의 전체 로스를 계산할 수 있다.Meanwhile, the object position detection apparatus may calculate the total loss of the deep learning model according to Equation 4 below.

여기에서

는 종래의 클래스 로스를 의미하고,

는 컨트라스티브 어텐션 로스를 의미하고,

는 포어그라운드 컨시스턴시 로스를 의미한다. 클래스 로스는 네트워크의 마지막에 GAP(Global average pooling) 레이어를 구성함으로써 소프트맥스 출력값을 생성하고 one-hot ground truth label이 주어지면 계산할 수 있으며, 앞서 언급한 수학식 1을 통해 컨트라스티브 어텐션 로스를 계산하고 수학식 2를 통해 포어그라운드 컨시스턴시 로스를 계산할 수 있다. From here

means the conventional class loss,

stands for contrasting attention loss,

is the foreground consistency loss. Class loss can be calculated when a softmax output is generated by configuring a global average pooling (GAP) layer at the end of the network and a one-hot ground truth label is given. and the foreground consistency loss can be calculated through Equation (2).

객체 위치 검출 장치는 이와 같이 컨트라스티브 어텐션 로스와 포어그라운드 컨시스턴시 로스를 계산하고 이미지 내 클래스 분류를 위한 클래시피케이션 로스(Classification loss)와 합산하여 딥러닝 모델의 전체 로스를 계산할 수 있다. 객체 위치 검출 장치는 컨트라스티브 어텐션 로스 값이 작아지도록 드롭된 포어그라운드 마스크를 생성하고, 레이어 별 어텐션 맵이 유사해지도록 포어그라운드 컨시스턴시 로스가 작아지도록 딥러닝 모델을 학습할 수 있다. 즉, 전체 로스가 작아지도록 학습과정을 반복하여 객체 전체를 커버하는 바운딩 박스를 생성하되 박스의 크기가 과도하게 커지는 것을 방지할 수 있다. 따라서, 종래의 계층적 방식으로 학습된 모델에서 성능 저하가 발생하는 것과는 다르게 객체 전체에 해당하는 영역과 유사한 박스를 형성함으로써 객체 위치 추정 성능을 높일 수 있다.The object position detection apparatus can calculate the total loss of the deep learning model by calculating the contrasting attention loss and the foreground consistency loss as described above, and adding it to the classification loss for class classification in the image. The object position detecting apparatus may generate a dropped foreground mask so that the contrasting attention loss value is small, and train the deep learning model so that the foreground consistency loss is small so that the attention maps for each layer become similar. That is, by repeating the learning process so that the overall loss is small, a bounding box covering the entire object can be created, but the size of the box can be prevented from becoming excessively large. Accordingly, unlike the performance degradation in the conventional hierarchical model, it is possible to increase the object location estimation performance by forming a box similar to the area corresponding to the entire object.

도 3은 개시된 기술의 일 실시예에 따라 컨트라스티브 어텐션 로스를 계산하는 것을 나타낸 도면이다. 도 3을 참조하면 각 레이어에서는 어텐션 맵을 이용하여 3개의 마스크를 생성할 수 있다. 종래와 같이 이미지의 배경에 해당하는 영역을 제거하여 포어그라운드 마스크를 생성하거나 이미지의 전경에 해당하는 영역을 제거하여 백그라운드 마스크를 생성할 수 있다. 그리고 앞서 도 1 및 도 2를 통해 설명한 방식으로 드롭된 포어그라운드 마스크(310)를 생성할 수 있다. 드롭된 포어그라운드 마스크(310)는 제 1 어텐션 맵에 임계값(310a)를 적용하여 생성된다. 여기에서 임계값(310a)는 이미지의 배경과 전경 중에서 가장 차별된 특징이 나타나는 영역일 수 있다. 가령 객체가 동물인 경우에는 동물의 얼굴이 차별된 영역으로 간주될 수 있다.3 is a diagram illustrating calculating a contrasting attention loss according to an embodiment of the disclosed technology. Referring to FIG. 3 , in each layer, three masks may be generated using an attention map. As in the prior art, a foreground mask may be generated by removing a region corresponding to the background of an image, or a background mask may be generated by removing a region corresponding to the foreground of the image. In addition, the dropped foreground mask 310 may be generated in the manner described above with reference to FIGS. 1 and 2 . The dropped foreground mask 310 is generated by applying a threshold value 310a to the first attention map. Here, the threshold 310a may be an area in which the most differentiated feature appears among the background and foreground of the image. For example, when the object is an animal, the face of the animal may be regarded as a differentiated region.

드롭된 포어그라운드 마스크(310)와 포어그라운드 마스크는 학습이 진행되면서 차별된 영역에서 객체 영역 내 덜 중요한 부분까지 학습하도록 유도될 수 있다. 그리고 백그라운드 마스크는 배경 쪽으로 차별된 영역이 발생하도록 유도될 수 있다. 이러한 과정에 따라 드롭된 포어그라운드 마스크와 포어그라운드 마스크의 임베딩 결과는 서로 유사해지도록 학습되고, 드롭된 포어그라운드 마스크와 백그라운드 마스크의 임베딩 결과는 서로 멀어지도록 학습될 수 있다. 즉, triplet loss 형태를 이용하여 학습을 진행할 수 있다. The dropped foreground mask 310 and the foreground mask may be induced to learn from a differentiated area to a less important part within the object area as learning is progressed. In addition, the background mask may be induced to generate a differentiated region toward the background. According to this process, the dropped foreground mask and the embedding result of the foreground mask may be learned to be similar to each other, and the dropped foreground mask and the embedding result of the background mask may be learned to move away from each other. That is, learning can be performed using the triplet loss form.

3개의 마스크에 각각 입력된 제 1 어텐션 맵은 특징공간(310b)으로 임베딩된다. 특징공간 내에서 드롭된 포어그라운드 마스크의 특징과 포어그라운드 마스크의 특징은 서로 최대한 유사하게 표현되고, 드롭된 포어그라운드 마스크의 특징과 백그라운드 마스크의 특징은 최대한 비유사하게 표현된다. 예컨대, 드롭된 포어그라운드 마스크의 특징과 포어그라운드 마스크의 특징은 특징공간 내 인접하는 위치에 표현될 수 있고, 드롭된 포어그라운드 마스크의 특징과 백그라운드 마스크의 특징은은 최대한 멀리 떨어지도록 표현될 수 있다.The first attention map input to each of the three masks is embedded into the feature space 310b. In the feature space, the characteristics of the dropped foreground mask and the characteristics of the foreground mask are expressed as similar to each other as possible, and the characteristics of the dropped foreground mask and the characteristics of the background mask are expressed as similarly as possible. For example, the characteristic of the dropped foreground mask and the characteristic of the foreground mask may be expressed at adjacent positions in the feature space, and the characteristic of the dropped foreground mask and the characteristic of the background mask may be expressed as far apart as possible. .

도 4는 개시된 기술의 일 실시예에 따른 객체 위치 검출 장치의 구성을 나타낸 도면이다. 도 4를 참조하면 객체 위치 검출 장치(400)는 입력장치(410), 저장장치(420) 및 연산장치(430)를 포함한다. 객체 위치 검출 장치(400)는 내부에 특정 계산을 수행하도록 프로그램 또는 알고리즘을 탑재하며 입력된 데이터를 토대로 계산을 수행하여 특정한 값의 데이터를 출력하는 컴퓨팅 장치일 수 있다. 예컨대, PC나 스마트폰과 같은 장치를 이용할 수 있다.4 is a diagram illustrating a configuration of an apparatus for detecting an object position according to an embodiment of the disclosed technology. Referring to FIG. 4 , the object position detecting device 400 includes an input device 410 , a storage device 420 , and an arithmetic device 430 . The object position detection device 400 may be a computing device that has a program or algorithm mounted therein to perform a specific calculation, and outputs data of a specific value by performing calculations based on input data. For example, a device such as a PC or a smartphone may be used.

입력장치(410)는 객체가 포함된 이미지를 입력받는다. 입력장치는 객체 위치 검출 장치에 탑재되거나 연결될 수 있으며 사용자가 입력하는 이미지를 데이터로 전달받을 수 있다. 예컨대, 키보드 또는 마우스를 입력장치로 이용할 수 있다.The input device 410 receives an image including an object. The input device may be mounted on or connected to the object position detection device, and may receive an image input by a user as data. For example, a keyboard or mouse may be used as an input device.

저장장치(420)는 이미지에 대한 객체 위치를 검출하도록 학습된 딥러닝 모델이 저장된다. 저장장치는 객체 위치 검출 장치의 메모리에 해당되며 연산장치의 제어에 따라 저장된 딥러닝 모델이 호출될 수 있다.The storage device 420 stores a deep learning model trained to detect an object position with respect to an image. The storage device corresponds to the memory of the object position detection device, and the stored deep learning model may be called under the control of the computing device.

연산장치(430)는 딥러닝 모델의 제 1 합성곱 레이어를 이용하여 이미지에 대한 특징 맵(Feature map)을 추출한다. 그리고, 특징 맵에 어텐션 블록(Attention block)을 설정하여 제 1 어텐션 맵(Attention map)을 생성한다. 그리고 제 1 어텐션 맵으로부터 드롭된 포어그라운드 마스크(Dropped foreground Mask)를 생성한다. 앞서 설명한 바와 같이 드롭된 포어그라운드 마스크는 컨트라스티브 어텐션 로스(Contrastive attention loss)를 기준으로 생성되는 것이다. 연산장치(430)는 드롭된 포어그라운드 마스크 또는 제 1 어텐션 맵으로부터 생성된 중요 맵(Importance Map)을 기준으로 객체 위치를 추정한다. 연산장치는 컨트라스티브 어텐션 로스 및 포어그라운드 컨시스턴시 로스를 계산할 수 있는 성능의 CPU 또는 프로세서일 수 있다. 연산장치는 입력장치에서 전달되는 입력 데이터 및 저장장치에 저장된 딥러닝 모델을 이용하여 객체 위치 검출을 위한 연산을 수행하여 출력 데이터를 생성할 수 있다. 여기에서 입력 데이터는 객체가 포함된 이미지 및 주석이고 출력되는 데이터는 이미지에 포함된 객체의 위치 내지는 좌표일 수 있다. The calculator 430 extracts a feature map for the image by using the first convolutional layer of the deep learning model. Then, a first attention map is generated by setting an attention block in the feature map. Then, a dropped foreground mask is generated from the first attention map. As described above, the dropped foreground mask is generated based on a contrastive attention loss. The computing unit 430 estimates the object location based on the dropped foreground mask or an Importance Map generated from the first attention map. The arithmetic unit may be a CPU or processor capable of calculating a contrasting attention loss and a foreground consistency loss. The computing device may generate output data by performing an operation for object position detection using the input data transmitted from the input device and the deep learning model stored in the storage device. Here, the input data may be an image and an annotation including an object, and the output data may be a position or coordinates of an object included in the image.

한편, 필요에 따라 연산장치(430)에서 처리된 결과를 출력할 수 있는 출력장치를 더 구비할 수 있다. 출력장치는 연산장치에서 전달되는 객체 위치 검출 결과를 출력하는 디스플레이일 수 있다. 출력장치는 객체 위치 검출 장치와 하나로 연결된 디스플레이일 수도 있고 별도로 연결된 모니터일 수도 있다.Meanwhile, if necessary, an output device capable of outputting the result processed by the arithmetic unit 430 may be further provided. The output device may be a display for outputting the object position detection result transmitted from the computing device. The output device may be a display connected to the object position detection device as one, or a monitor connected separately.

도 5는 개시된 기술의 일 실시에에 따른 객체 위치 검출 방법에 대한 순서도이다. 도 5를 참조하면 객체 위치 검출 장치는 이하의 순서에 따라 객체 위치 검출 방법을 수행할 수 있다.5 is a flowchart of a method for detecting an object position according to an embodiment of the disclosed technology. Referring to FIG. 5 , the apparatus for detecting an object location may perform an object location detecting method according to the following procedure.

첫번째로 510 단계에 따라 객체 위치 검출 장치가 제 1 합성곱 레이어를 이용하여 입력된 이미지에 대한 특징 맵(Feature map)을 추출한다. 제 1 합성곱 레이어는 복수의 합성곱 레이어들 중 초기의 레이어를 의미한다. 객체 위치 검출 장치는 사용자가 입력하는 이미지를 수신하여 제 1 합성곱 레이어에 입력할 수 있다. 그리고 제 1 합성곱 레이어는 이미지에 포함된 객체에 대한 특징 맵을 추출할 수 있다.First, according to step 510, the object position detecting apparatus extracts a feature map for the input image by using the first convolutional layer. The first convolutional layer refers to an initial layer among the plurality of convolutional layers. The object position detecting apparatus may receive an image input by a user and input the image to the first convolutional layer. In addition, the first convolutional layer may extract a feature map for an object included in the image.

두 번째로 520 단계에서 객체 위치 검출 장치는 특징 맵을 이용하여 제 1 어텐션 맵(Attention map)을 생성한다. 객체 위치 검출 장치는 특징 맵에 어텐션 블록을 설정하여 제 1 합성곱 레이어에 대한 제 1 어텐션 맵을 생성한다.Second, in step 520 , the object position detecting apparatus generates a first attention map by using the feature map. The object position detecting apparatus generates a first attention map for the first convolutional layer by setting an attention block in the feature map.

세 번째로, 530 단계에서 객체 위치 검출 장치는 제 1 어텐션 맵을 기준으로 드롭된 포어그라운드 마스크를 생성한다. 드롭된 포어그라운드 마스크는 컨트라스티브 어텐션 로스(Contrastive attention loss)에 대한 결과값을 작아지도록 하는 가중치 역할을 수행하는 것을 의미한다. 객체 위치 검출 장치는 입력된 이미지로부터 생성한 제 1 어텐션 맵을 기준으로 3개의 마스크를 생성할 수 있다. 일 실시예로, 배경을 가려서 생성한 포어그라운드 마스크, 전경을 가려서 생성한 백그라운드 마스크 및 전경과 배경 중 가장 차별된 영역을 가려서 생성한 드롭된 포어그라운드 마스크를 생성할 수 있다. 그리고 드롭된 포어그라운드 마스크가 포어그라운드 마스크의 임베딩 결과와 가까운 공간에 위치하면서 백그라운드 마스크의 임베딩 결과와는 멀어지도록 드롭된 포어그라운드 마스크를 생성할 수 있다.Third, in step 530 , the object position detecting apparatus generates a dropped foreground mask based on the first attention map. The dropped foreground mask means that it serves as a weight to make the result value for the contrastive attention loss small. The object position detecting apparatus may generate three masks based on the first attention map generated from the input image. As an example, a foreground mask generated by covering the background, a background mask generated by covering the foreground, and a dropped foreground mask generated by covering a region most differentiated between the foreground and the background may be generated. In addition, while the dropped foreground mask is located in a space close to the embedding result of the foreground mask, the dropped foreground mask may be generated so as to be far from the embedding result of the background mask.

다음으로 540 단계에서 드롭된 포어그라운드 마스크 또는 중요 맵을 기준으로 객체의 위치를 추정한다. 객체의 위치는 객체를 중심으로 생성된 바운딩 박스의 위치를 토대로 계산될 수 있다. 다만 객체 위치 검출 장치는 객체의 특정 영역에 치우쳐지지 않고 객체 전체를 모두 포함하면서 과도하게 크게 생성되지 않도록 드롭된 포어그라운드 마스크를 이용할 수 있다.Next, in step 540, the location of the object is estimated based on the dropped foreground mask or the important map. The position of the object may be calculated based on the position of the bounding box created around the object. However, the object position detecting apparatus may use the dropped foreground mask so that it is not excessively large while not being biased toward a specific area of the object and including the entire object.

한편, 이와 같이 제 1 합성곱 레이어에서의 연산이 종료되면 객체 위치 검출 장치는 550 단계에 따라 제 2 합성곱 레이어에 가중치가 적용된 객체 이미지를 입력한다. 가중치가 적용된 객체 이미지는 제 1 어텐션 맵에 Dropped foreground 마스크 및 중요맵(Importance map) 중 랜덤하게 선택된 하나를 가중치로 적용하여 생성된다. Meanwhile, when the operation in the first convolutional layer is finished as described above, the object position detecting apparatus inputs the weighted object image to the second convolutional layer in step 550 . The weighted object image is generated by applying a weight to a randomly selected one of a dropped foreground mask and an importance map to the first attention map.

560 단계에서 객체 위치 검출 장치는 제 2 합성곱 레이어를 이용하여 제 2 어텐션 맵을 생성한다. 제 2 합성곱 레이어에 가중치가 적용된 이미지를 입력하고 제 1 합성곱 레이어와 동일하게 이미지에 어텐션 블록을 적용하여 제 2 어텐션 맵을 생성할 수 있다.In operation 560, the object position detecting apparatus generates a second attention map by using the second convolutional layer. A second attention map may be generated by inputting a weighted image to the second convolutional layer and applying an attention block to the image in the same manner as in the first convolutional layer.

570 단계에서 객치 위치 검출 장치는 제 1 어텐션 맵과 제 2 어텐션 맵이 서로 유사해지도록 딥러닝 모델을 학습한다. 일 실시예로, 포어그라운드 컨시스턴시 로스를 계산하여 제 1 합성곱 레이어에서 출력된 제 1 어텐션 맵과 제 2 합성곱 레이어에서 출력된 제 2 어텐션 맵이 서로 비슷한 값이되도록 학습할 수 있다.In operation 570, the object location detection apparatus learns the deep learning model so that the first attention map and the second attention map are similar to each other. As an embodiment, by calculating the foreground consistency loss, it is possible to learn such that the first attention map output from the first convolutional layer and the second attention map output from the second convolutional layer have similar values.

도 6은 객체 위치 추정에 대한 결과를 비교하는 도면이다. 도 6을 참조하면 녹색 박스가 예측된 값이고 붉은색 박스가 실제 객체의 영역에 대한 값을 나타낸다. 종래의 ADL(Autonomos Deep Learning)과 SPG(Self Producted Guidance)과 개시된 기술의 일 실시예에 따른 객체 위치 검출 기술을 비교한 결과 ADL에서는 덜 차별적인 부분에서 더 많이 활성화하려는 시도로 인하여 녹색 박스의 크기가 과도하게 확장되는 것을 확인할 수 있다. 그리고 SPG에서는 배경을 억제하려고 하지만 객체 영역을 과도한 크기로 평가하고 있어서 일부 녹색 박스의 크기가 여전히 크게 설정되는 것을 확인할 수 있다. 반면, 개시된 객체 위치 검출 기술에서는 배경 영역으로 과도하게 확장되지 않으면서 객체의 전체 범위를 포함할 수 있도록 붉은색 박스와 유사한 형태로 녹색 박스가 형성되는 것을 확인할 수 있다.6 is a diagram for comparing results of object position estimation. Referring to FIG. 6 , a green box indicates a predicted value, and a red box indicates a value for an area of an actual object. As a result of comparing the conventional Autonomos Deep Learning (ADL) and Self Producted Guidance (SPG) with the object position detection technique according to an embodiment of the disclosed technology, the size of the green box due to the attempt to activate more in the less differentiated part in ADL It can be seen that is excessively expanded. And although SPG tries to suppress the background, it evaluates the object area to an excessive size, so you can see that the size of some green boxes is still set large. On the other hand, in the disclosed object position detection technology, it can be seen that the green box is formed in a shape similar to the red box so as to cover the entire range of the object without being excessively extended to the background area.

도 7은 종래의 이미지 데이터셋에 개시된 기술을 적용한 것을 나타낸 도면이다. 도 7을 참조하면 붉은색으로 표시된 영역이 중요도가 높은 영역이고 녹색으로 표시된 영역이 중요도가 낮은 영역을 의미한다. 중요도가 높은 영역은 객체가 포함된 영역을 의미하며 중요도가 낮은 영역은 배경을 의미한다. 종래의 ImageNet과 CUB의 데이터셋을 통해 테스트를 수행한 결과 개시된 기술이 적용된 모델에 의해 생성된 녹색 박스와 붉은색 박스가 유사하게 형성되는 것을 확인할 수 있다. 따라서, 비 지역적인 어텐션 맵을 기준으로 객체 위치를 추정함으로써 특정 영역에 치우쳐지지 않고 전체 영역을 커버하는 효과가 있다. 7 is a diagram illustrating application of the disclosed technology to a conventional image data set. Referring to FIG. 7 , a region marked in red indicates a high importance region, and a region indicated in green indicates a region of low importance. The high-importance area means the area containing the object, and the low-importance area means the background. As a result of testing through the conventional ImageNet and CUB datasets, it can be confirmed that the green box and the red box generated by the model to which the disclosed technology is applied are similarly formed. Accordingly, by estimating the location of the object based on the non-regional attention map, there is an effect of covering the entire area without being biased toward a specific area.

개시된 기술의 일 실시예에 따른 약한 지도학습 기반 객체 위치 검출 방법 및 장치는 이해를 돕기 위하여 도면에 도시된 실시 예를 참고로 설명되었으나, 이는 예시적인 것에 불과하며, 당해 분야에서 통상적 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시 예가 가능하다는 점을 이해할 것이다. 따라서, 개시된 기술의 진정한 기술적 보호범위는 첨부된 특허청구범위에 의해 정해져야 할 것이다.The weak supervised learning-based object position detection method and apparatus according to an embodiment of the disclosed technology has been described with reference to the embodiment shown in the drawings to help understanding, but this is only an example, and those of ordinary skill in the art It will be understood that various modifications and equivalent other embodiments are possible therefrom. Accordingly, the true technical protection scope of the disclosed technology should be defined by the appended claims.

Claims

extracting, by the object position detection apparatus, a feature map for the input image using the first convolutional layer of the deep learning model;
generating, by the object position detection apparatus, a first attention map using the feature map;
generating, by the object position detecting apparatus, a dropped foreground mask from the first attention map, wherein the dropped foreground mask is generated based on a contrastive attention loss; and
and estimating, by the object location detection device, the location of the object based on the dropped foreground mask or an Importance Map generated from the first attention map.

The method of claim 1, wherein generating the first attention map comprises:
An object position detection method in which the object position detection apparatus generates the first attention map by setting an attention block in the feature map.

The method of claim 1, wherein generating the dropped foreground mask comprises:
generating, by the object position detection apparatus, a foreground mask for the first attention map;
generating, by the object position detection apparatus, a background mask for the first attention map; and
Embedding the first attention map into a feature space using the dropped foreground mask, the foreground mask, and the background mask,
generating the dropped foreground mask so that the dropped foreground mask and the embedding results of the foreground mask are as similar to each other as possible, and the dropped foreground mask and the embedding results of the background mask are as dissimilar to each other as possible. An object position detection method comprising a.

The method of claim 1,
The dropped foreground mask is a mask generated by applying a first threshold value to the first attention map to cover the most discriminative region among the background and the foreground.

According to claim 1, wherein the object position detection method,
generating, by the object position detection apparatus, a second attention map using a second convolutional layer; and
and estimating, by the object position detection apparatus, the object position based on a second attention map.

6. The method of claim 5,
The second attention map is an object location detection method generated by applying a weight to the first attention map, the dropped foreground mask or a randomly selected one of the important map.

The method of claim 1,
The dropped foreground mask is generated by applying a threshold value to the first attention map, and the important map is generated by applying a sigmoid function to the first attention map.

The method of claim 5, wherein the object position detection method comprises:
and calculating, by the object position detection apparatus, a foreground consistency loss such that the first attention map and the second attention map are similar to each other.

The method of claim 8, wherein the foreground consistency loss is:
An object position detection method for reducing a background activation of a first convolutional layer that has generated the first attention map based on a background activation of the second convolutional layer.

The method of claim 8, wherein the object position detection method comprises:
A final loss of the deep learning model is calculated by summing the contrasting attention loss, the foreground consistency loss, and a classification loss for the first attention map, wherein the contrastive attention loss and the The method of detecting an object position further comprising the step of learning the deep learning model to minimize a foreground consistency loss.

an input device for receiving an image including an object;
a storage device for storing the deep learning model; and
Extracting a feature map for the image using the first convolutional layer of the deep learning model, and setting an attention block in the feature map to generate a first attention map and generate a dropped foreground mask from the first attention map, wherein the dropped foreground mask is generated based on a contrastive attention loss,
and an arithmetic unit estimating an object position based on the dropped foreground mask or an importance map generated from the first attention map.

12. The method of claim 11,
The computing device generates a foreground mask and a background mask for the first attention map, and uses the dropped foreground mask, the foreground mask, and the background mask to generate the second 1 Embed the attention map into the feature space,
An object for generating the dropped foreground mask so that the dropped foreground mask and the embedding result of the foreground mask are as similar to each other as possible, and the dropped foreground mask and the embedding result of the background mask are as dissimilar as possible to each other position detection device.

12. The method of claim 11,
The dropped foreground mask is a mask generated by applying a first threshold value to the first attention map to cover the most discriminative region among the background and the foreground.

12. The method of claim 11,
The computing device generates a second attention map by using a second convolutional layer, and estimates an object location based on the second attention map.

15. The method of claim 14,
The second attention map is an object location detecting apparatus generated by applying a weight to the first attention map, the dropped foreground mask or the randomly selected one of the important map.

12. The method of claim 11,
The dropped foreground mask is generated by applying a threshold value to the first attention map, and the important map is generated by applying a sigmoid function to the first attention map.

15. The method of claim 14,
The computing device calculates a foreground consistency loss so that the first attention map and the second attention map are similar to each other.

18. The method of claim 17, wherein the foreground consistency loss is:
An apparatus for detecting an object position for reducing a background activation of a first convolutional layer that has generated the first attention map based on a background activation of the second convolutional layer.

18. The method of claim 17,
The calculating unit calculates the final loss of the deep learning model by summing the contrasting attention loss, the foreground consistency loss, and the classification loss for the first attention map, the contrastive attention Object position detection apparatus for learning the deep learning model to minimize the loss and the foreground consistency loss.