KR102540290B1

KR102540290B1 - Apparatus and Method for Person Re-Identification based on Heterogeneous Sensor Camera

Info

Publication number: KR102540290B1
Application number: KR1020210006683A
Authority: KR
Inventors: 함범섭; 박현종; 이상훈
Original assignee: 연세대학교 산학협력단
Priority date: 2021-01-18
Filing date: 2021-01-18
Publication date: 2023-06-02
Also published as: KR20220104426A

Abstract

본 발명은 이종 센서 카메라에서 획득된 이미지들에서 추출된 특징맵과 함께 특징맵에서 서로 대응하는 픽셀을 와핑한 와핑 특징맵을 기반으로 학습을 수행하며, 특히 특징맵과 와핑 특징맵에 대한 밀집 삼중 손실을 계산하여 학습에 반영함으로써, 정렬되지 않은 이미지들에서도 세분화된 로컬 정보를 추출할 수 있어 다양한 카메라에서 매우 상이한 조건에서 획득된 이미지로부터도 사람을 정확하게 재식별할 수 있는 사람 재식별 장치 및 방법을 제공할 수 있다.The present invention performs learning based on warping feature maps obtained by warping pixels corresponding to each other in the feature maps together with feature maps extracted from images acquired from heterogeneous sensor cameras. Apparatus and method for re-identifying a person accurately even from images acquired under very different conditions from various cameras by extracting segmented local information even from unaligned images by calculating loss and reflecting it in learning. can provide.

Description

Apparatus and Method for Person Re-Identification based on Heterogeneous Sensor Camera}

본 발명은 사람 재식별 장치 및 방법에 관한 것으로, 이종 센서 카메라 기반 사람 재식별 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for human re-identification, and relates to a heterogeneous sensor camera-based apparatus and method for human re-identification.

서로 다른 환경에서 촬영된 동일한 사람을 탐색할 수 있는 사람 재식별(Person Re-identification: reID 라고 함) 기술에 대한 연구가 활발하게 이루어지고 있다.Research on Person Re-identification (referred to as reID) technology capable of searching for the same person photographed in different environments is being actively conducted.

도 1은 사람 재식별의 개념을 나타내고, 도 2는 사람 재식별을 이용하는 분야 중 하나로 사람 추적 기술의 일 예를 나타낸다.1 shows the concept of person re-identification, and FIG. 2 shows an example of a person tracking technology as one of the fields using person re-identification.

도 1에 도시된 바와 같이, 사람 재식별은 서로 다른 조건에서 촬영된 동일한 사람을 식별하여 검출하는 기법으로서, 자세의 변화, 배경의 변화, 조명의 변화, 촬영 거리 및 각도 변화와 같이 여러 환경 조건이 변화하여도 동일한 사람이 포함된 이미지를 정확하게 검출하는 것을 목적으로 한다.As shown in FIG. 1, person re-identification is a technique for identifying and detecting the same person photographed under different conditions, and various environmental conditions such as change in posture, change in background, change in lighting, and change in shooting distance and angle. It is aimed at accurately detecting an image including the same person even if the image is changed.

이러한 사람 재식별 기술은 도 2에 도시된 바와 같이, 실종자 탐색 또는 범죄자 추적 등 다수의 이미지에서 특정인을 탐색 및 추적하는 다양한 분야에 이용될 수 있다.As shown in FIG. 2 , such person re-identification technology can be used in various fields of searching for and tracking a specific person in multiple images, such as searching for a missing person or tracking a criminal.

이러한 사람 재식별 기술은 이미지로부터 주로 인공 신경망을 이용하여 개인을 식별하기 위해 사람의 특정 부분이 아닌 정체성을 나타내는 특징을 추출하고, 추출된 특징을 기반으로 동일한 사람을 식별할 수 있도록 한다. 즉 사람의 얼굴 형태나 특정한 자세, 배경, 촬영 각도, 조명 등에 한정되지 않고 개개인 각각의 정체성을 나타내는 특징을 추출하여 사람을 재식별함으로써, 서로 다른 다수의 이미지에서 동일한 사람을 검출할 수 있도록 한다. 다만 기존의 사람 재식별은 주로 동일한 종류의 센서 카메라에서 획득된 이미지에서 동일한 사람을 재식별하기 위해 집중되어 있었다.Such a person re-identification technology extracts features representing an identity rather than a specific part of a person in order to identify an individual from an image, mainly using an artificial neural network, and allows the same person to be identified based on the extracted features. That is, it is possible to detect the same person in a plurality of different images by re-identifying a person by extracting features representing each individual identity, not limited to a person's face shape, specific posture, background, shooting angle, lighting, etc. However, existing human re-identification was mainly focused on re-identifying the same person in images obtained from the same type of sensor camera.

도 3은 이종 센서 카메라에서 획득된 이미지의 일 예를 나타내고, 도 4는 이종 센서 카메라 기반 사람 재식별 개념을 나타낸다.3 shows an example of an image obtained from a heterogeneous sensor camera, and FIG. 4 illustrates a concept of re-identifying a person based on a heterogeneous sensor camera.

도 3과 상단 3개의 이미지 중 왼쪽 2개의 이미지 각각 일반적인 가시 카메라(RGB camera 또는 Visible camera)에서 주간 및 야간에 획득된 이미지이고, 오른쪽 이미지는 적외선(Infrared: 이하 IR) 카메라에서 야간에 획득된 이미지를 나타낸다. 3 and the upper three images, the left two images are images acquired during the day and night from a general visible camera (RGB camera or visible camera), respectively, and the right image is an image acquired at night from an infrared (hereinafter referred to as IR) camera. indicates

가운데 좌측 이미지는 주간에 가시 카메라에서 획득된 이미지로부터 추출된 사람 영역을 나타내고, 우측 다수의 이미지는 재식별을 위해 비교 대상이 되는 이미지를 나타낸다. 그리고 하단 우측 이미지는 야간에 IR 카메라에서 획득된 이미지로부터 추출된 사람 영역을 나타내고, 좌측 다수의 이미지는 재식별을 위해 비교 대상이 되는 이미지를 나타낸다.The image on the left in the middle represents a human region extracted from an image acquired by a visible camera during the daytime, and multiple images on the right represent images to be compared for re-identification. In addition, the lower right image represents a human area extracted from an image acquired by an IR camera at night, and a plurality of images on the left represent images to be compared for re-identification.

도 3에 도시된 바와 같이, 가시 카메라는 주간에 이미지를 획득하는 경우, 사람 식별이 가능한 수준의 이미지를 획득할 수 있으나, 야간에는 주변 환경과 구분되지 않아 사람 식별이 불가능하다. 그에 반해, 가시 카메라와 다른 IR 센서를 이용하는 IR 카메라를 이용하는 경우, 야간에 사람을 정확하게 식별할 수 있다.As shown in FIG. 3 , the visible camera can acquire an image at a level at which a person can be identified when acquiring an image during the day, but cannot identify a person at night because it is not distinguished from the surrounding environment. In contrast, when an IR camera using an IR sensor different from a visible camera is used, a person can be accurately identified at night.

따라서 도 4에 도시된 바와 같이, 환경 변화에 무관하게 사람을 식별하고 추적할 수 있도록 하기 위해서는 서로 다른 이종 센서 카메라에서 획득된 이미지에서 사람을 재식별할 수 있어야 한다.Therefore, as shown in FIG. 4 , in order to identify and track a person regardless of environmental changes, it is necessary to re-identify a person from images obtained from different types of sensor cameras.

도 5은 사람 재식별에서의 오정렬 문제를 설명하기 위한 도면이다.5 is a diagram for explaining a misalignment problem in person re-identification.

도 5에서 상단과 하단에서 동일한 행에 위치한 이미지는 동일한 객체, 즉 동일한 사람에 대한 이미지이다. 도 5에 도시된 바와 같이, 동일한 가시 카메라로 획득된 이미지에서도 사람은 여러 조건에 따라 매우 상이한 형태로 나타날 수 있으며, 이로 인해 각 이미지에서 동일한 사람이 식별하기 어렵게 될 수 있다. 다만 동일한 가시 카메라에서 획득된 이미지의 경우, 컬러 분포 등에 따른 특징을 기반으로도 사람을 재식별할 수도 있다.In FIG. 5 , images located in the same row at the top and bottom are images of the same object, that is, the same person. As shown in FIG. 5 , even in images obtained by the same visible camera, people may appear in very different shapes depending on various conditions, and as a result, it may be difficult to identify the same person in each image. However, in the case of an image obtained from the same visible camera, a person may be re-identified based on characteristics according to color distribution and the like.

그러나 가시 카메라와 달리 IR 카메라의 경우, 색상 정보가 포함되지 않으므로 가시 카메라에서 획득된 컬러 이미지의 특징과 IR 카메라에서 획득된 IR 이미지의 특징 사이의 비교가 용이하지 않다. 즉 재식별을 위해 서로 비교할 수 특징이 제한된다. 이와 같은 모달리티 차이를 극복하기 위해 기존에는 RGB 이미지와 IR 이미지를 모두 가상의 특징 공간(common feature space)에 임베딩(embedding)하고, 특징 공간에 임베딩된 특징을 이미지 또는 파트 수준으로 추출하여 비교하는 방식으로 학습을 수행한다. 그러나 이와 같은 공통 특징 공간에 임베딩하는 방식에서는 특징점들에 대한 밀집 정합 오류로 인한 객체 오정렬(misalignment)에 의한 문제가 발생할 수 있다.However, unlike a visible camera, an IR camera does not include color information, so it is not easy to compare the characteristics of a color image obtained from a visible camera and an IR image obtained from an IR camera. That is, features that can be compared with each other for re-identification are limited. In order to overcome this difference in modality, conventionally, both the RGB image and the IR image are embedded in a virtual feature space (common feature space), and features embedded in the feature space are extracted and compared at the image or part level. carry out learning with However, in such a method of embedding in a common feature space, a problem may occur due to object misalignment due to an error in dense matching of feature points.

기존에는 서로 다른 이종 센서 카메라에서 획득된 이미지에서 사람을 재식별하는 경우, 각 이미지에 포함된 사람이 서로 정렬되어 있다고 가정하여 이미지 또는 파트 수준에서의 특징을 학습하는데 집중하였다. 즉 두 이미지가 개략적일지라도 서로 일정 수준 이상 정렬된 것으로 가정하여 학습을 수행하였다. 일 예로 이미지를 기지정된 횡방향 그리드로 구분하고 두 이미지에서 각 그리드에 포함되는 신체 부위(예를 들면, 어깨, 무릎 등)가 서로 대응하도록 정렬된 것으로 가정하여 학습된다.In the past, when re-identifying a person in images obtained from different heterogeneous sensor cameras, it was assumed that the people included in each image were aligned with each other and focused on learning features at the image or part level. That is, learning was performed assuming that the two images were aligned at a certain level or higher, even if they were approximate. For example, it is learned by assuming that images are divided into predetermined horizontal grids, and body parts (eg, shoulders, knees, etc.) included in each grid in the two images are aligned to correspond to each other.

그러나 도 5와 같이 비록 동일 객체에 대한 이미지일지라도 실제로 이미지 내에서 객체의 위치, 크기, 포즈 등은 매우 상이한 경우가 많으며, 이와 같이 두 이미지가 정렬되지 않은 상태에서 학습이 수행되면 오정렬로 인해 학습이 정상적으로 수행되지 않아 사람 재식별 오차가 크게 발생하는 문제가 발생할 수 있다. 상기한 바와 같이, 이러한 오정렬 문제는 특히 특성이 다른 두 모달리티 간 영상의 객체 재식별에서 문제가 더 크게 나타날 수 있다.However, as shown in FIG. 5, even though they are images of the same object, the position, size, and pose of the object in the image are often very different. This may not be performed normally, resulting in a large human re-identification error. As described above, such a misalignment problem may appear more serious in object re-identification of an image between two modalities having different characteristics.

또한 공통 특징 공간에 임베딩하는 방식에서는 객체 재식별에 크게 도움이 될 수 있는 세분화된 로컬 정보가 간과되는 경향이 있어 재식별 성능에 한계가 있다.In addition, in the method of embedding in a common feature space, there is a limit to re-identification performance because segmented local information that can be greatly helpful in object re-identification tends to be overlooked.

한국 공개 특허 제10-2019-0078270호 (2019.07.04 공개)Korean Patent Publication No. 10-2019-0078270 (published on July 4, 2019)

본 발명의 목적은 이종 센서 카메라에서 획득된 이미지로부터 사람을 정확하게 재식별할 수 있는 사람 재식별 장치 및 방법을 제공하는데 있다.An object of the present invention is to provide a person re-identification device and method capable of accurately re-identifying a person from an image acquired by a heterogeneous sensor camera.

본 발명의 다른 목적은 정렬되지 않은 이미지에서도 교차 모달리티 객체 이미지간 밀집 정합을 이용하여 정확하게 사람을 재식별할 수 있는 사람 재식별 장치 및 방법을 제공하는데 있다.Another object of the present invention is to provide a person re-identification device and method capable of accurately re-identifying a person using dense matching between cross-modality object images even in unaligned images.

상기 목적을 달성하기 위한 본 발명의 일 실시예에 따른 사람 재식별 장치는 서로 다른 이종의 센서를 이용하여 획득된 제1 및 제2 이미지를 인가받아 학습되는 방식에 따라 특징을 추출하여 제1 및 제2 특징맵을 획득하고, 기지정된 방식으로 풀링하여 제1 및 제2 표현자를 획득하는 특징 풀링부; 학습 시에 구비되어, 상기 제1 및 제2 특징맵을 인가받아 서로 대응하는 픽셀을 추정하고, 추정된 대응 픽셀의 위치를 기반으로 상호 와핑을 수행하여, 상기 제2 특징맵이 상기 제1 특징맵에 대응하도록 와핑된 제1 와핑 특징맵과 상기 제1 특징맵이 상기 제2 특징맵에 대응하도록 와핑된 제2 와핑 특징맵 획득하며, 상기 제1 및 제2 와핑 특징맵을 풀링하여 제1 및 제2 와핑 표현자를 획득하는 교차 모달 정렬부; 학습되는 방식에 따라 상기 제1 및 제2 표현자와 상기 제1 및 제2 와핑 표현자 각각이 기지정된 다수의 클래스 각각에 대응하는 확률을 나타내는 제1 및 제2 스코어맵과 제1 및 제2 와핑 스코어맵을 획득하고, 획득된 제1 및 제2 스코어맵에서 가장 높은 확률의 클래스의 식별자를 획득하는 재식별부; 및 학습 시에 구비되어, 상기 제1 특징맵과 상기 제1 와핑 특징맵 사이의 차이와 상기 제2 특징맵과 상기 제2 와핑 특징맵 사이의 차이를 동일한 식별자인 경우의 포지티브와 서로 다른 식별자인 경우의 네거티브로 구분하여 밀집 삼중 손실을 계산하는 손실 계산부를 포함한다.In order to achieve the above object, an apparatus for re-identifying a person according to an embodiment of the present invention receives first and second images acquired using different types of sensors and extracts features according to a learning method to obtain first and second images. a feature pooling unit that obtains a second feature map and performs pooling in a predetermined manner to obtain first and second descriptors; Provided during learning, receiving the first and second feature maps, estimating pixels corresponding to each other, and performing mutual warping based on the positions of the estimated corresponding pixels, so that the second feature map corresponds to the first feature Obtaining a first warped feature map warped to correspond to the map and a second warped feature map warped so that the first feature map corresponds to the second feature map, and pooling the first and second warped feature maps to obtain a first and a cross-modal arranging unit obtaining a second warping descriptor; First and second score maps representing probabilities corresponding to each of a plurality of classes in which the first and second descriptors and the first and second warping descriptors are pre-specified according to the learned method, and the first and second descriptors a re-identification unit that obtains a warping score map and obtains an identifier of a class having the highest probability in the obtained first and second score maps; And provided during learning, the difference between the first feature map and the first warping feature map and the difference between the second feature map and the second warping feature map is a positive identifier in the case of the same identifier and a different identifier and a loss calculation unit that calculates the dense triple loss by dividing the case into the negative.

상기 교차 모달 정렬부는 기지정된 방식에 따라 상기 제1 및 제2 특징맵 사이의 픽셀간 유사도를 계산하여 유사도맵을 획득하는 유사도 측정부; 상기 유사도 맵을 기반으로 상기 제1 및 제2 특징맵의 각 픽셀별 매칭 확률을 계산하여 매칭 확률맵을 획득하는 매칭 확률 추정부; 상기 매칭 확률맵을 이용하여 상기 제1 및 제2 특징맵 각각의 픽셀들이 상대 특징맵의 각 픽셀에 매칭될 확률을 가중 집계하여 상대 특징맵으로 와핑함으로써 상기 제1 및 제2 와핑 특징맵을 획득하는 소프트 와핑부; 및 상기 제1 및 제2 와핑 특징맵을 기지정된 방식으로 풀링하여 제1 및 제2 와핑 표현자를 획득하는 와핑 풀링부를 포함할 수 있다.The cross-modal arranging unit may include a similarity measurement unit calculating a similarity between pixels between the first and second feature maps according to a predetermined method and obtaining a similarity map; a matching probability estimator calculating a matching probability for each pixel of the first and second feature maps based on the similarity map to obtain a matching probability map; Obtaining the first and second warping feature maps by weighting probabilities of matching pixels of the first and second feature maps using the matching probability map to each pixel of the relative feature map and warping the relative feature map a soft warping unit; and a warping pooling unit that obtains first and second warping descriptors by pooling the first and second warping feature maps in a predetermined manner.

상기 교차 모달 정렬부는 상기 제1 및 제2 특징맵의 각 위치별 픽셀의 크기를 정규화하여 제1 및 제2 마스크를 획득하는 마스크 제공부를 더 포함할 수 있다.The cross-modal arranging unit may further include a mask providing unit for obtaining first and second masks by normalizing the sizes of pixels for each position of the first and second feature maps.

상기 소프트 와핑부는 상기 제1 및 제2 마스크를 인가받아 상기 제1 및 제2 특징맵 각각의 픽셀들이 상대 특징맵의 각 픽셀에 매칭될 확률을 가중 집계하고, 가중 집계된 와핑 픽셀값에 마스크의 픽셀값을 추가로 가중하여 와핑할 수 있다.The soft warping unit receives the first and second masks, weights and aggregates probabilities of matching pixels of each of the first and second feature maps to each pixel of the relative feature map, and calculates the value of the mask based on the weighted and aggregated warped pixel values. You can warp by additionally weighting the pixel values.

상기 손실 계산부는 상기 제1 및 제2 이미지 모두에서 모두 확인 가능한 사람 영역을 강조하는 제1 및 제2 공통 주의맵을 획득하고, 상기 제1 공통 주의맵에서 앵커인 상기 제1 특징맵의 식별자와 동일한 식별자로 추정되는 제1 와핑 특징맵인 포지티브 및 서로 다른 식별자로 추정되는 제1 와핑 특징맵인 네거티브 각각의 사이의 차이와 상기 제2 공통 주의맵에서 앵커인 상기 제2 특징맵의 식별자와 동일한 식별자로 추정되는 포지티브 및 서로 다른 식별자로 추정되는 제2 와핑 특징맵인 네거티브 각각의 사이의 차이를 누적하여 상기 밀집 삼중 손실을 계산할 수 있다.The loss calculation unit obtains first and second common attentional maps emphasizing a human region identifiable in both the first and second images, and an identifier of the first feature map that is an anchor in the first common attentional map. The difference between each positive, which is the first warping feature map estimated with the same identifier and the negative, which is the first warping feature map estimated with different identifiers, and the same as the identifier of the second feature map, which is an anchor in the second common attention map The dense triple loss may be calculated by accumulating differences between positive values estimated as identifiers and negative values, which are second warping feature maps estimated using different identifiers.

상기 손실 계산부는 상기 제1 및 제2 스코어맵 각각과 학습 시에 상기 제1 및 제2 이미지에 레이블되어 인가되는 식별자 사이의 차이에 따른 식별 손실과 상기 제1 및 제2 스코어맵과 상기 제1 및 제2 와핑 스코어맵 사이의 차이에 따른 일관성 손실을 더 계산하고, 상기 식별 손실과 상기 일관성 손실 및 상기 밀집 삼중 손실을 가중합하여 계산되는 총 손실을 역전파하여 학습을 수행할 수 있다.The loss calculation unit identifies an identification loss according to a difference between each of the first and second scoremaps and identifiers that are labeled and applied to the first and second images during training, and the first and second scoremaps and the first and second scoremaps. and further calculating a coherence loss according to the difference between the second warping scoremaps, and performing learning by backpropagating a total loss calculated by weighting the identification loss, the coherence loss, and the dense triplet loss.

상기 손실 계산부는 상기 제1 및 제2 스코어맵과 상기 제1 및 제2 이미지에 레이블되어 인가되는 식별자 사이의 음의 크로스 엔트로피를 계산하여 상기 식별 손실을 획득하고, 상기 제1 및 제2 스코어맵 각각과 상기 제1 및 제2 와핑 스코어맵 중 대응하는 와핑 스코어맵 사이의 음의 크로스 엔트로피를 계산하여 상기 일관성 손실을 계산할 수 있다.The loss calculation unit obtains the identification loss by calculating a negative cross entropy between the first and second scoremaps and identifiers labeled and applied to the first and second images, and the first and second scoremaps. The coherence loss may be calculated by calculating a negative cross entropy between each of the warping score maps and a corresponding warping score map of the first and second warping score maps.

상기 목적을 달성하기 위한 본 발명의 다른 실시예에 따른 사람 재식별 방법은 학습 단계; 및 인공 신경망을 이용하여, 서로 다른 이종의 센서를 이용하여 획득된 이미지들을 인가받아 각 이미지에 포함된 사람에 대한 식별자를 획득하는 사람 재식별 단계를 포함하고, 상기 학습 단계는 학습 데이터로서 서로 다른 이종의 센서를 이용하여 획득되고 식별자가 미리 레이블된 제1 및 제2 이미지를 인가받아 학습되는 방식에 따라 특징을 추출하여 제1 및 제2 특징맵을 획득하고, 기지정된 방식으로 풀링하여 제1 및 제2 표현자를 획득하는 단계; 상기 제1 및 제2 특징맵을 인가받아 서로 대응하는 픽셀을 추정하고, 추정된 대응 픽셀의 위치를 기반으로 상호 와핑을 수행하여, 상기 제2 특징맵이 상기 제1 특징맵에 대응하도록 와핑된 제1 와핑 특징맵과 상기 제1 특징맵이 상기 제2 특징맵에 대응하도록 와핑된 제2 와핑 특징맵 획득하며, 상기 제1 및 제2 와핑 특징맵을 풀링하여 제1 및 제2 와핑 표현자를 획득하는 단계; 상기 제1 및 제2 표현자와 상기 제1 및 제2 와핑 표현자 각각이 기지정된 다수의 클래스 각각에 대응하는 확률을 나타내는 제1 및 제2 스코어맵과 제1 및 제2 와핑 스코어맵을 획득하고, 획득된 제1 및 제2 스코어맵에서 가장 높은 확률의 클래스의 식별자를 획득하는 단계; 및 상기 제1 특징맵과 상기 제1 와핑 특징맵 사이의 차이와 상기 제2 특징맵과 상기 제2 와핑 특징맵 사이의 차이를 동일한 식별자인 경우의 포지티브와 서로 다른 식별자인 경우의 네거티브로 구분하여 손실을 계산하는 단계를 포함한다.A method for re-identifying a person according to another embodiment of the present invention for achieving the above object includes a learning step; and a person re-identification step of obtaining an identifier for a person included in each image by receiving images acquired using different types of sensors using an artificial neural network, wherein the learning step includes different types of learning data. First and second feature maps are obtained by extracting features according to a method in which first and second images obtained using heterogeneous sensors and pre-labeled with identifiers are received and learned, and first and second feature maps are obtained by pooling in a predetermined manner to obtain first and second images. and obtaining a second presenter; The first and second feature maps are received, pixels corresponding to each other are estimated, and mutual warping is performed based on the positions of the estimated corresponding pixels, so that the second feature map is warped to correspond to the first feature map. Obtaining a first warped feature map and a second warped feature map warped so that the first feature map corresponds to the second feature map, and pooling the first and second warped feature maps to obtain first and second warped descriptors obtaining; Obtaining first and second scoremaps and first and second warping scoremaps indicating probabilities corresponding to each of a plurality of classes in which the first and second descriptors and the first and second warping descriptors are pre-specified, respectively and obtaining an identifier of a class having the highest probability in the obtained first and second score maps; And dividing the difference between the first feature map and the first warping feature map and the difference between the second feature map and the second warping feature map into a positive in the case of the same identifier and a negative in the case of different identifiers It involves calculating the loss.

따라서, 본 발명의 실시예에 따른 사람 재식별 장치 및 방법은 이종 센서 카메라에서 획득된 이미지들에서 추출된 특징맵과 함께 특징맵에서 서로 대응하는 픽셀을 와핑한 와핑 특징맵을 기반으로 학습을 수행하며, 특히 특징맵과 와핑 특징맵에 대한 밀집 삼중 손실을 계산하여 학습에 반영함으로써, 정렬되지 않은 이미지들에서도 세분화된 로컬 정보를 추출할 수 있어 다양한 카메라에서 매우 상이한 조건에서 획득된 이미지로부터도 사람을 정확하게 재식별할 수 있다.Therefore, the apparatus and method for re-identification of a person according to an embodiment of the present invention performs learning based on a feature map extracted from images acquired from a heterogeneous sensor camera and a warped feature map obtained by warping pixels corresponding to each other in the feature map. In particular, by calculating the dense triple loss for the feature map and the warping feature map and applying it to learning, it is possible to extract segmented local information even from unaligned images. can be accurately re-identified.

도 1은 사람 재식별의 개념을 나타낸다.
도 2는 사람 재식별을 이용하는 분야 중 하나로 사람 추적 기술의 일 예를 나타낸다.
도 3은 이종 센서 카메라에서 획득된 이미지의 일 예를 나타낸다.
도 4는 이종 센서 카메라 기반 사람 재식별 개념을 나타낸다.
도 5는 사람 재식별에서의 오정렬 문제를 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에 따른 이종 센서 카메라 기반 사람 재식별 장치의 개략적 구조를 나타낸다.
도 7은 도 6의 사람 재식별 장치의 동작을 설명하기 위한 도면이다.
도 8은 공통 주의맵을 설명하기 위한 도면이다.
도 9는 사람 재식별 장치의 손실 설정에 따른 성능 변화를 나타낸다.
도 10은 본 실시예에 따른 사람 재식별 장치의 성능을 시뮬레이션한 결과를 나타낸다.
도 11은 본 발명의 일 실시예에 따른 이종 센서 카메라 기반 사람 재식별 방법을 나타낸다.1 illustrates the concept of person re-identification.
2 shows an example of a person tracking technology as one of the fields using person re-identification.
3 shows an example of an image acquired by a heterogeneous sensor camera.
4 shows a concept of human re-identification based on heterogeneous sensor cameras.
5 is a diagram for explaining a misalignment problem in person re-identification.
6 shows a schematic structure of a device for re-identifying a person based on a heterogeneous sensor camera according to an embodiment of the present invention.
FIG. 7 is a diagram for explaining the operation of the apparatus for re-identifying a person of FIG. 6 .
8 is a diagram for explaining a common attention map.
9 shows the performance change according to the loss setting of the person re-identification device.
10 shows simulation results of the performance of the apparatus for re-identification of a person according to the present embodiment.
11 illustrates a method for re-identifying a person based on a heterogeneous sensor camera according to an embodiment of the present invention.

본 발명과 본 발명의 동작상의 이점 및 본 발명의 실시에 의하여 달성되는 목적을 충분히 이해하기 위해서는 본 발명의 바람직한 실시예를 예시하는 첨부 도면 및 첨부 도면에 기재된 내용을 참조하여야만 한다. In order to fully understand the present invention and its operational advantages and objectives achieved by the practice of the present invention, reference should be made to the accompanying drawings illustrating preferred embodiments of the present invention and the contents described in the accompanying drawings.

이하, 첨부한 도면을 참조하여 본 발명의 바람직한 실시예를 설명함으로써, 본 발명을 상세히 설명한다. 그러나, 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며, 설명하는 실시예에 한정되는 것이 아니다. 그리고, 본 발명을 명확하게 설명하기 위하여 설명과 관계없는 부분은 생략되며, 도면의 동일한 참조부호는 동일한 부재임을 나타낸다. Hereinafter, the present invention will be described in detail by describing preferred embodiments of the present invention with reference to the accompanying drawings. However, the present invention may be embodied in many different forms and is not limited to the described embodiments. And, in order to clearly describe the present invention, parts irrelevant to the description are omitted, and the same reference numerals in the drawings indicate the same members.

명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라, 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "...부", "...기", "모듈", "블록" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. Throughout the specification, when a part "includes" a certain component, it means that it may further include other components, not excluding other components unless otherwise stated. In addition, terms such as "... unit", "... unit", "module", and "block" described in the specification mean a unit that processes at least one function or operation, which is hardware, software, or hardware. And it can be implemented as a combination of software.

도 6은 본 발명의 일 실시예에 따른 이종 센서 카메라 기반 사람 재식별 장치의 개략적 구조를 나타내고, 도 7은 도 6의 사람 재식별 장치의 동작을 설명하기 위한 도면이며, 도 8은 공통 주의맵을 설명하기 위한 도면이다.6 shows a schematic structure of a device for re-identifying a person based on a heterogeneous sensor camera according to an embodiment of the present invention, FIG. 7 is a diagram for explaining the operation of the device for re-identifying a person in FIG. 6, and FIG. 8 is a common attention map. It is a drawing for explaining.

도 6 및 도 7을 참조하면, 본 실시예의 이종 센서 카메라 기반 사람 재식별 장치는 이종 이미지 획득부(100), 특징 추출부(200), 풀링부(300), 재식별부(400), 교차 모달 정렬부(500) 및 손실 계산부(600)를 포함할 수 있다.Referring to FIGS. 6 and 7 , the heterogeneous sensor camera-based human re-identification device according to the present embodiment includes a heterogeneous image acquisition unit 100, a feature extraction unit 200, a pooling unit 300, a re-identification unit 400, an intersection It may include a modal arranging unit 500 and a loss calculating unit 600 .

이종 이미지 획득부(100)는 서로 다른 센서 카메라에서 획득된 이종 이미지를 획득한다. 이종 이미지 획득부(100)는 제1 이미지와 제2 이미지를 획득할 수 있으며, 여기서는 일 예로 제1 이미지는 가시 카메라를 이용하여 획득된 RGB 이미지이고, 제2 이미지는 IR 카메라를 이용하여 획득된 IR 이미지인 것으로 가정하여 설명한다. 이종 이미지 획득부(100)는 제1 이미지를 획득하는 제1 이미지 획득부(110)와 제2 이미지를 획득하는 제2 이미지 획득부(120)를 포함할 수 있다. 다만 본 실시예의 이종 센서 카메라 기반 사람 재식별 장치는 가시 카메라나 IR 카메라 이외에 다른 종류의 센서 카메라를 이용하여 이미지를 획득할 수도 있다. 이때, 제1 이미지와 제2 이미지는 각각 단일 프레임의 이미지일 수도 있으나, 다수 프레임으로 구성된 이미지일 수 있으며, 동일한 사람이 포함되거나 다른 사람이 포함된 이미지일 수 있다.The heterogeneous image acquisition unit 100 acquires heterogeneous images obtained from different sensor cameras. The heterogeneous image acquisition unit 100 may acquire a first image and a second image, where, for example, the first image is an RGB image obtained using a visible camera, and the second image is obtained using an IR camera. It is assumed that it is an IR image and explained. The heterogeneous image acquisition unit 100 may include a first image acquisition unit 110 that acquires a first image and a second image acquisition unit 120 that obtains a second image. However, the heterogeneous sensor camera-based human re-identification apparatus of the present embodiment may acquire images using a sensor camera of a different type in addition to a visible camera or an IR camera. In this case, the first image and the second image may each be a single frame image, or may be images composed of multiple frames, and may include the same person or different people.

제1 이미지 획득부(110)와 제2 이미지 획득부(120)는 각각 서로 다른 센서를 포함하는 이종 센서 카메라로 구현되거나, 이종 센서 카메라를 이용하여 획득된 이미지를 저장하는 저장 장치 또는 외부에서 이종 센서 카메라를 이용하여 획득된 이미지를 인가받는 통신 모듈 등으로 구현될 수도 있다.The first image acquisition unit 110 and the second image acquisition unit 120 are implemented as heterogeneous sensor cameras each including different sensors, or in a storage device for storing images acquired using heterogeneous sensor cameras, or in heterogeneous external sources. It may also be implemented as a communication module that receives an image obtained using a sensor camera.

특징 추출부(200)는 이종 이미지 획득부(100)에서 획득된 이종 이미지 각각으로부터 특징을 추출하여 특징맵을 획득한다. 특징 추출부(200)는 각각 CNN(Convolutional neural network)과 같은 인공 신경망으로 구현되는 제1 특징 추출부(210)와 제2 특징 추출부(220)를 포함할 수 있다. 제1 특징 추출부(210)는 제1 이미지 획득부(110)에서 획득된 제1 이미지를 인가받아 미리 학습된 방식에 따라 특징을 추출하여 제1 특징맵(f_RGB)을 획득하고, 제2 특징 추출부(220)는 제2 이미지 획득부(120)에서 획득된 제2 이미지를 인가받아 미리 학습된 방식에 따라 특징을 추출하여 제2 특징맵(f_IR)을 획득한다.The feature extraction unit 200 obtains a feature map by extracting features from each of the heterogeneous images acquired by the heterogeneous image obtaining unit 100 . The feature extractor 200 may include a first feature extractor 210 and a second feature extractor 220 each implemented as an artificial neural network such as a convolutional neural network (CNN). The first feature extractor 210 receives the first image acquired by the first image acquirer 110 and extracts features according to a pre-learned method to obtain a first feature map f _RGB , The feature extractor 220 receives the second image acquired by the second image acquirer 120 and extracts features according to a pre-learned method to obtain a second feature map f _IR .

이때 제1 특징맵(f_RGB ∈

)과 제2 특징맵(f_IR ∈

)은 동일한 높이(h)와 폭(w) 및 채널(d)을 갖는 h × w × d 크기로 추출될 수 있다.At this time, the first feature map (f _RGB ∈

) and the second feature map (f _IR ∈

) can be extracted in the size of h × w × d with the same height (h), width (w) and channel (d).

풀링부(300)는 특징 추출부(200)에서 획득된 제1 및 제2 특징맵(f_RGB, f_IR)을 인가받아 각각 미리 지정된 방식으로 풀링하여, 제1 및 제2 표현자를 획득한다. 풀링부(300)는 제1 풀링부(310)와 제2 풀링부(320)를 포함할 수 있다. 제1 풀링부(310)는 제1 특징맵(f_RGB)을 인가받아 풀링하여 제1 표현자(ø(f_RGB))를 획득하고, 제2 풀링부(320)는 제2 특징맵(f_IR)을 인가받아 풀링하여 제2 표현자(ø(f_IR))를 획득할 수 있다. 여기서 제1 및 제2 표현자(ø(f_RGB), ø(f_IR))각각은 채널(d) 길이에 대응하는 1차원 벡터 형태를 갖고, 이미지에 포함된 사람을 식별할 수 있도록 하는 특징 표현자로 볼 수 있다.The pooling unit 300 receives the first and second feature maps f _RGB and f _IR obtained from the feature extraction unit 200 and pools them in a predetermined manner, respectively, to obtain first and second descriptors. The pulling unit 300 may include a first pulling unit 310 and a second pulling unit 320 . The first pooling unit 310 receives and pools the first feature map f _RGB to obtain a first descriptor ø(f _RGB ), and the second pooling unit 320 obtains the second feature map f _IR ) may be applied and pooled to obtain the second descriptor ø(f _IR ). Here, each of the first and second descriptors ø(f _RGB ) and ø(f _IR ) has a form of a 1-dimensional vector corresponding to the length of the channel d, and is a feature enabling identification of a person included in the image. can be seen as expressive.

여기서 특징 추출부(200)와 풀링부(300)는 특징 풀링부로 통합될 수 있다.Here, the feature extraction unit 200 and the pooling unit 300 may be integrated into a feature pooling unit.

재식별부(400)는 이미지에 포함된 사람의 특징 표현자인 제1 및 제2 표현자(ø(f_RGB), ø(f_IR))를 인가받아, 제1 및 제2 이미지에 포함된 사람을 재식별한다. 재식별부(400)는 일 예로 제1 및 제2 표현자(ø(f_RGB), ø(f_IR)) 사이의 유사도를 측정하여 제1 및 제2 이미지에 포함된 사람이 동일한 사람이지 여부를 판별할 수 있다. 그러나 재식별부(400)는 미리 지정된 다수의 사람에 대응하도록 설정된 다수의 클래스 중 제1 및 제2 표현자(ø(f_RGB), ø(f_IR))에 대응하는 클래스를 확인하여 식별자를 획득하도록 구성될 수도 있다. 즉 재식별부(400)는 이종 이미지 각각에 포함된 사람의 식별자를 획득하여 동일한 사람인지 여부를 확인할 수 있다.The re-identification unit 400 receives the first and second expressers ø(f _RGB ) and ø(f _IR ), which are feature expressors of the person included in the image, and identifies the person included in the first and second images. re-identify For example, the re-identification unit 400 determines whether the people included in the first and second images are the same person by measuring the degree of similarity between the first and second presenters ø(f _RGB ) and ø(f _IR ). can be identified. However, the re-identification unit 400 checks the classes corresponding to the first and second descriptors ø(f _RGB ) and ø(f _IR ) among a plurality of classes set to correspond to a plurality of pre-specified persons, and determines the identifier. may be configured to obtain That is, the re-identification unit 400 may obtain an identifier of a person included in each of the different types of images to determine whether they are the same person.

여기서 재식별부(400)는 일 예로 소프트 맥스와 같은 활성화 함수와 완전 연결 레이어(Fully Connected Layer)를 포함하는 구성되는 인공 신경망으로 구현되어, 제1 및 제2 표현자(ø(f_RGB), ø(f_IR)) 각각이 기지정된 다수의 클래스 각각에 대응하는 확률을 나타내는 스코어맵을 획득하고, 획득된 스코어맵에서 가장 높은 확률의 클래스를 확인하여 식별자를 선택할 수 있다.Here, the re-identification unit 400 is implemented as an artificial neural network including, for example, an activation function such as soft max and a fully connected layer, and the first and second expressors ø(f _RGB ), An identifier may be selected by obtaining a scoremap representing a probability corresponding to each of a plurality of predefined classes, respectively ø(f _IR ), and checking a class with the highest probability in the obtained scoremap.

도 6에서는 설명의 편의를 위하여 특징 추출부(200)와 풀링부(300) 및 재식별부(400)를 별도의 구성으로 도시하였으나, 특징 추출부(200)와 풀링부(300) 및 재식별부(400)는 사람을 재식별하기 위한 재식별 인공 신경망에서 서로 다른 기능을 수행하는 레이어들로 구성될 수 있다.In FIG. 6, for convenience of explanation, the feature extraction unit 200, the pooling unit 300, and the re-identification unit 400 are shown as separate configurations, but the feature extraction unit 200, the pooling unit 300, and the re-identification unit 400 are shown. Unit 400 may be composed of layers that perform different functions in a re-identification artificial neural network for re-identification of a person.

사람 재식별 장치는 실제 적용 시에 이종 이미지 획득부(100), 특징 추출부(200), 풀링부(300) 및 재식별부(400)만으로 구성될 수 있다. 다만, 사람 재식별 장치는 실제 적용되기 이전에 인공 신경망으로 구성되는 특징 추출부(200), 풀링부(300) 및 재식별부(400) 등이 학습되어야만 한다. 이에 본 실시예의 사람 재식별 장치는 학습을 수행하는 동안 교차 모달 정렬부(500) 및 손실 계산부(600)를 더 포함할 수 있으며, 학습이 완료되면 제외할 수 있다.The human re-identification device may be composed of only the heterogeneous image acquisition unit 100, the feature extraction unit 200, the pooling unit 300, and the re-identification unit 400 when applied in practice. However, in the human re-identification device, the feature extraction unit 200, the pooling unit 300, and the re-identification unit 400 composed of artificial neural networks must be learned before they are actually applied. Accordingly, the apparatus for re-identification of a person according to the present embodiment may further include a cross-modal arranging unit 500 and a loss calculating unit 600 during learning, and may be excluded when learning is completed.

교차 모달 정렬부(500)는 서로 다른 이종 이미지에서 추출된 제1 및 제2 특징맵(f_RGB, f_IR)을 인가받고, 인가된 제1 특징맵(f_RGB)과 제2 특징맵(f_IR)의 픽셀간 유사도에 기반하여 서로 대응하는 픽셀을 추정하고, 추정된 대응 픽셀의 위치를 기반으로 상호 와핑을 수행하여, 제1 및 제2 와핑 특징맵(

,

)을 획득한다. 여기서 제1 와핑 특징맵(

)은 제2 이미지에서 추출된 제2 특징맵(f_IR)을 와핑하여 획득되는 특징맵이고, 제2 와핑 특징맵(

)은 제1 이미지에서 추출된 제1 특징맵(f_RGB)을 와핑하여 획득되는 특징맵이다. 즉 제1 및 제2 와핑 특징맵(

,

)은 서로 상대의 이종 이미지로부터 추출된 특징맵을 와핑하여 모달리티를 교차하여 획득된다.The cross-modal arranging unit 500 receives first and second feature maps (f _RGB , f _IR ) extracted from different heterogeneous images, and receives the applied first feature map (f _RGB ) and the second feature map (f _IR ), by estimating pixels corresponding to each other based on the similarity between pixels, and performing mutual warping based on the positions of the estimated corresponding pixels, to obtain first and second warping feature maps (

,

) to obtain Here, the first warping feature map (

) is a feature map obtained by warping the second feature map (f _IR ) extracted from the second image, and the second warped feature map (

) is a feature map obtained by warping the first feature map (f _RGB ) extracted from the first image. That is, the first and second warping feature maps (

,

) is obtained by crossing modalities by warping the feature maps extracted from each other's heterogeneous images.

구체적으로 살펴보면, 교차 모달 정렬부(500)는 유사도 측정부(510), 매칭 확률 추정부(520), 마스크 제공부(530), 소프트 와핑부(540) 및 와핑 풀링부(550)를 포함할 수 있다.Specifically, the cross-modal arranging unit 500 may include a similarity measuring unit 510, a matching probability estimating unit 520, a mask providing unit 530, a soft warping unit 540, and a warping pulling unit 550. can

유사도 측정부(510)는 제1 및 제2 특징맵(f_RGB, f_IR)을 인가받고, 인가된 제1 및 제2 특징맵(f_RGB, f_IR) 사이의 픽셀간 유사도를 계산하여 유사도맵(C(p,q))을 획득한다. 유사도 측정부(510)는 제1 특징맵(f_RGB)의 h × w 평면 상에서 각 위치(p)별 채널 방향 길이(d)를 갖는 1차원 벡터인 픽셀 벡터와 제2 특징맵(f_IR)의 h × w 평면 상에서 각 위치(q)별 채널 방향 길이(d)를 갖는 1차원 벡터인 픽셀 벡터 사이의 유사도를 수학식 1과 같이 코사인 유사도를 기반으로 계산하여 유사도맵(C(p,q))을 획득할 수 있다.The similarity measurer 510 receives the first and second feature maps f _RGB and f _IR , and calculates the pixel-to-pixel similarity between the applied first and second feature maps f _RGB and f _IR to determine the degree of similarity. Obtain map C(p,q). The similarity measurement unit 510 calculates a pixel vector, which is a one-dimensional vector having a channel direction length (d) for each position (p) on the h × w plane of the first feature map (f _RGB ), and the second feature map (f _IR ). A similarity map (C(p,q )) can be obtained.

여기서 f_RGB(p)는 제1 특징맵(f_RGB)의 위치(p)별 픽셀 벡터를 나타내고, f_IR(q)는 제2 특징맵(f_IR)의 위치(q)별 픽셀 벡터를 나타내며, ∥·∥₂는 L2 놈 함수를 나타내며, T는 전치 행렬를 나타낸다.Here, f _RGB (p) represents a pixel vector for each position (p) of the first feature map (f _RGB ), and f _IR (q) represents a pixel vector for each position (q) of the second feature map (f _IR ). , Α Π ₂ denotes the L2 norm function, and T denotes the transposition matrix.

매칭 확률 추정부(520)는 유사도 측정부(510)에서 획득된 유사도맵(C(p,q))을 인가받아 각 픽셀별 일치 확률을 계산한다. 매칭 확률 추정부(520)는 제1 특징맵(f_RGB)과 제2 특징맵(f_IR)의 각 픽셀 벡터 사이의 유사성을 기반으로 제1 특징맵(f_RGB)과 제2 특징맵(f_IR)의 픽셀 위치별 매칭 확률을 나타내는 h × w × h × w의 크기의 매칭 확률맵(P(p,q))을 수학식 2에 따라 계산하여 획득한다.The matching probability estimator 520 receives the similarity map C(p,q) obtained from the similarity measurement unit 510 and calculates a matching probability for each pixel. The matching probability estimator 520 determines the first feature map (f _RGB ) and the second feature map (f RGB ) based on the similarity between pixel vectors of the second feature map (f _RGB ) and the second feature map (f _IR ). A matching probability map (P(p,q)) having a size of h × w × h × w representing a matching probability for each pixel location of _IR is calculated and obtained according to Equation 2.

여기서 β는 온도 파라미터이다.where β is the temperature parameter.

마스크 제공부(530)는 제1 및 제2 특징맵(f_RGB, f_IR)을 인가받아, 제1 특징맵(f_RGB)에 대응하는 제1 마스크(M_RGB)와 제2 특징맵(f_IR)에 대응하는 제2 마스크(M_IR)를 획득하여 소프트 와핑부(540)로 출력한다.The mask providing unit 530 receives the first and second feature maps f _RGB and f _IR , and receives the first mask M _RGB corresponding to the first feature map f _RGB and the second feature map f A second mask (M _IR ) corresponding to _IR ) is obtained and output to the soft warping unit 540 .

마스크 제공부(530)는 학습이 수행됨에 따라 제1 및 제2 이미지에서 사람에 해당하는 영역, 즉 전경이 나머지 영역에 비해 특징이 두드러진다는 가정 하에, 실제 사람 영역에 대한 레이블이 없는 제1 이미지에서 획득된 제1 특징맵(f_RGB)으로부터 사람 영역이 강조된 제1 마스크(M_RGB)를 수학식 3에 따라 획득할 수 있다.As learning is performed, the mask providing unit 530 assumes that the region corresponding to a person in the first and second images, that is, the foreground has a prominent feature compared to the remaining regions, and the first image without labels for the real human region. A first mask (M _RGB ) in which the human region is emphasized may be obtained from the first feature map (f _RGB ) obtained in Equation 3.

여기서 f는 수학식 4로 정의되는 최대-최소 정규화 함수이다.where f is the maximum-minimum normalization function defined by Equation 4.

이와 유사하게 마스크 제공부(530)는 제2 이미지에서 획득된 제2 특징맵(f_IR)으로부터 사람 영역에 대응하는 제2 마스크(M_IR) 또한 획득할 수 있다.Similarly, the mask providing unit 530 may also obtain a second mask M _IR corresponding to the human region from the second feature map f _IR obtained from the second image.

즉 마스크 제공부(530)는 제1 및 제2 특징맵(f_RGB, f_IR)의 각 위치별 픽셀 벡터의 크기를 정규화하여 마스크를 획득한다. 만일 마스크 제공부(530)가 이진 마스크를 이용하는 경우, 전경과 배경이 명확하게 구분될 수 있는 경우에는 유효하지만 명확하게 구분되지 않는 경우에는 오히려 큰 오차를 유발할 수 있게 된다. 이러한 오차가 발생되는 것을 방지하기 위해, 본 실시예의 마스크 제공부(530)는 제1 및 제2 특징맵(f_RGB, f_IR)의 각 위치별 픽셀 벡터의 크기를 정규화하여 전경 영역이 배경 영역에 비해 강조되는 제1 및 제2 마스크(M_RGB, M_IR)를 획득한다.That is, the mask providing unit 530 obtains a mask by normalizing the size of the pixel vector for each position of the first and second feature maps f _RGB and f _IR . If the mask providing unit 530 uses a binary mask, it is effective when the foreground and background can be clearly distinguished, but may cause a large error when not clearly distinguished. In order to prevent such an error from occurring, the mask provider 530 of the present embodiment normalizes the size of the pixel vector for each position of the first and second feature maps f _RGB and f _IR so that the foreground area is the background area. First and second masks (M _RGB , M _IR ) that are emphasized compared to M are obtained.

한편 소프트 와핑부(540)는 수학식 2에 따라 획득된 픽셀간 매칭 확률맵(P(p,q))을 기반으로 제1 및 제2 특징맵(f_RGB, f_IR)의 각 픽셀을 서로 대응하는 픽셀의 위치로 소프트 와핑함으로써 제1 및 제2 와핑 특징맵(

,

)을 획득한다.Meanwhile, the soft warping unit 540 compares each pixel of the first and second feature maps f _RGB and f _IR based on the inter-pixel matching probability map P(p,q) obtained according to Equation 2. First and second warped feature maps by soft warping to corresponding pixel locations (

,

) to obtain

이때 소프트 와핑부(540)는 마스크 제공부(530)로부터 제공되는 제1 및 제2 마스크(M_RGB, M_IR)를 이용하여, 제1 및 제2 특징맵(f_RGB, f_IR)에서 사람에 해당하는 전경 영역에 집중적으로 와핑을 수행할 수 있다.At this time, the soft warping unit 540 uses the first and second masks M _RGB and M _IR provided from the mask providing unit 530 to determine the human face in the first and second feature maps f _RGB and f _IR . Warping may be performed intensively on a foreground region corresponding to .

소프트 와핑부(540)는 제2 특징맵(f_IR)의 각 픽셀이 제1 특징맵(f_RGB)의 사람에 해당하는 전경 영역이 강조되어 와핑되도록 수학식 5에 따라 제1 마스크(M_RGB)를 이용하여 소프트 와핑을 수행함으로써 제1 와핑 특징맵(

)을 획득할 수 있다.The soft warping unit 540 generates a first mask (M _RGB ) according to Equation 5 so that each pixel of the second feature map (f _IR ) emphasizes and warps the foreground region corresponding to the person of the first feature map (f _RGB ). ) by performing soft warping using the first warping feature map (

) can be obtained.

여기서 W는 수학식 6으로 정의되는 소프트 와핑 함수이다.Here, W is a soft warping function defined by Equation 6.

수학식 5 및 6에 따르면, 소프트 와핑부(540)는 소프트 와핑 함수(W)를 이용하여 제2 특징맵(f_IR)의 각 위치(q)별 픽셀 벡터를 매칭 확률에 따라 가중 집계하여 제1 특징맵(f_RGB)의 대응하는 픽셀 위치(p)로 와핑하되, 제1 마스크(M_RGB)에 의해 지정되는 사람 영역에 제2 특징맵(f_IR)의 매칭 확률 가중 집계된 픽셀 벡터가 강조되어 와핑되도록 하고, 나머지 영역은 제1 특징맵(f_RGB)의 픽셀 벡터가 가급적 유지되도록 한다. 즉 제1 특징맵(f_RGB)에서 사람에 해당하는 전경 영역은 제2 특징맵(f_IR)에서 매칭 확률에 따라 가중 집계되어 와핑된 픽셀 벡터가 강조되는 반면, 나머지 배경 영역은 제1 특징맵(f_RGB)의 픽셀 벡터가 강조되도록 하여 제1 와핑 특징맵(

)을 획득한다.According to Equations 5 and 6, the soft warping unit 540 weights the pixel vectors for each position q of the second feature map f _IR using the soft warping function W according to the matching probability. 1 warp to the corresponding pixel position (p) of the feature map (f _RGB ), but the matching probability weighted pixel vector of the second feature map (f _IR ) is aggregated in the human area designated by the first mask (M _RGB ) It is emphasized and warped, and the pixel vector of the first feature map (f _RGB ) is maintained as much as possible in the remaining area. That is, the foreground area corresponding to a person in the first feature map f _RGB is weighted and aggregated according to the matching probability in the second feature map f _IR , and the warped pixel vector is emphasized, while the remaining background area is the first feature map The first warping feature map (f _RGB ) is emphasized so that the pixel vector of

) to obtain

또한 소프트 와핑부(540)는 수학식 5 및 6과 유사하게 소프트 와핑 함수와 제2 마스크(M_IR)를 이용하여, 제2 특징맵(f_IR)에서 사람에 해당하는 전경 영역은 제1 특징맵(f_RGB)에서 와핑된 픽셀 벡터가 강조되고, 나머지 배경 영역은 제2 특징맵(f_IR)의 픽셀 벡터가 강조되도록 하여 제2 와핑 특징맵(

)을 획득한다.Also, the soft warping unit 540 uses a soft warping function and a second mask M _IR similar to Equations 5 and 6, so that the foreground region corresponding to the person in the second feature map f _IR is the first feature The pixel vector warped in the map f _RGB is emphasized, and the pixel vector of the second feature map f _IR is emphasized in the remaining background area, so that the second warped feature map (

) to obtain

본 실시예에서 교차 모달 정렬부(500)가 소프트 와핑 함수(W)를 이용하는 것은 서로 다른 이미지에서 픽셀간 매칭을 위해 일반적으로 사용되는 argmax 함수를 이용하는 경우, 매우 하드하게 픽셀간 매칭이 수행되기 때문에 명시적인 대응 픽셀을 설정할 수 있는 반면, 서로 다른 이미지의 질감이나 폐색 등으로 인해 대응점이 조밀하지 않고 분산되거나 대응점이 존재하지 않는 경우가 빈번하게 발생될 수 있기 때문이다. 특히 본 실시예에서와 같이 이종의 이미지에서 추출된 제1 및 제2 특징맵(f_RGB, f_IR) 사이의 대응 픽셀을 검출하고자 하는 경우, 이미지의 색상이나 질감과 같은 디테일한 특징보다 이미지의 개략적 형상 변화에 더욱 치중되어야 한다. 이에 본 실시예에서는 argmax 함수가 아닌 소프트 와핑 함수(W)를 이용하여 이종 이미지에서 서로 대응하는 픽셀의 위치로 와핑을 수행한다.In this embodiment, the cross-modal aligning unit 500 uses the soft warping function (W) because when using the argmax function generally used for matching between pixels in different images, matching between pixels is performed very hard. This is because while it is possible to explicitly set corresponding pixels, cases in which corresponding points are not dense and scattered or do not exist frequently occur due to textures or occlusion of different images. In particular, as in this embodiment, when detecting corresponding pixels between the first and second feature maps (f _RGB , f _IR ) extracted from heterogeneous images, rather than detailed features such as color or texture of the image, More emphasis should be placed on the rough shape change. Therefore, in this embodiment, warping is performed on the positions of pixels corresponding to each other in heterogeneous images by using the soft warping function (W) instead of the argmax function.

와핑 풀링부(550)는 제1 및 제2 풀링부(310, 320)와 동일한 방식으로 소프트 와핑부(540)에서 획득된 제1 와핑 특징맵(

)과 제2 와핑 특징맵(

) 각각을 풀링하여 제1 와핑 표현자(ø(

))과 제2 와핑 표현자(ø(

))를 획득한다.The warping pooling unit 550 generates a first warping feature map obtained from the soft warping unit 540 in the same manner as the first and second pooling units 310 and 320 (

) and the second warping feature map (

) by pooling each of the first warped descriptors (ø(

)) and the second warping descriptor (ø(

)) to obtain

그리고 재식별부(400)는 제1 및 제2 표현자(ø(f_RGB), ø(f_IR))과 마찬가지로, 제1 및 제2 와핑 표현자(ø(

), ø(

))을 인가받아 제1 및 제2 표현자(ø(f_RGB), ø(f_IR))에 대한 식별자를 추출할 수 있다.And, like the first and second descriptors ø(f _RGB ) and ø(f _IR ), the re-identification unit 400 generates first and second warping descriptors ø(

), ø(

)), identifiers for the first and second descriptors ø(f _RGB ) and ø(f _IR ) may be extracted.

손실 계산부(600)는 학습 과정에서 발생하는 손실을 기지정된 방식으로 계산하고, 계산된 손실을 역전파함으로써, 특징 추출부(200), 풀링부(300), 재식별부(400) 및 교차 모달 정렬부(500)를 학습시킨다.The loss calculation unit 600 calculates the loss generated in the learning process in a predetermined manner and back-propagates the calculated loss, so that the feature extraction unit 200, the pooling unit 300, the re-identification unit 400 and the intersection The modal alignment unit 500 is trained.

본 실시예에서 손실 계산부(600)는 식별 손실(Identity loss)(L_ID)과 일관성 손실(Identity Consistency loss)(L_IC) 및 밀집 삼중 손실(Dense Triplet loss)(L_DT)를 각각 계산하고, 계산된 식별 손실(L_ID)과 일관성 손실(L_IC) 및 밀집 삼중 손실(L_DT)을 결합하여 계산되는 총 손실(L)을 역전파함으로써 학습을 수행할 수 있다.In this embodiment, the loss calculator 600 calculates identity loss (L _ID ), identity consistency loss (L _IC ), and dense triplet loss (L _DT ), respectively. , learning can be performed by backpropagating the total loss (L), which is calculated by combining the calculated identification loss (L _ID ) with the coherence loss (L _IC ) and the dense triple loss (L _DT ).

손실 계산부(600)는 우선 제1 및 제2 표현자(ø(f_RGB), ø(f_IR))을 기반으로 식별 손실(L_ID)을 획득할 수 있다. 여기서 식별 손실(L_ID)은 제1 및 제2 표현자(ø(f_RGB), ø(f_IR))으로부터 획득되는 제1 및 제2 스코어맵과 학습시에 입력되는 제1 및 제2 이미지에 미리 레이블된 사람에 대한 식별자 사이의 차이를 기반으로 계산되어 획득될 수 있다. 즉 본 실시예의 사람 재식별 장치는 학습 시에 사람에 대한 식별자가 미리 레이블된 이종 이미지를 학습 데이터로 인가받아 지도 학습(Supervised Learning) 방식으로 학습이 수행될 수 있다.The loss calculation unit 600 may first obtain an identification loss L _ID based on the first and second descriptors ø(f _RGB ) and ø(f _IR ). Here, the identification loss (L _ID ) is the first and second score maps obtained from the first and second presenters (ø(f _RGB ) and ø(f _IR ) and the first and second images input during training. may be calculated and obtained based on a difference between identifiers for persons pre-labeled in . That is, the apparatus for re-identifying a person according to the present embodiment may perform learning in a supervised learning method by receiving a heterogeneous image pre-labeled with an identifier for a person as training data during learning.

손실 계산부(600)는 제1 및 제2 스코어맵과 학습 데이터로서 인가된 식별자 사이의 음의 크로스 엔트로피(negative cross-entropy)를 계산하여 식별 손실(L_ID)을 계산할 수 있다. 여기서 음의 크로스 엔트로피를 기반으로 식별 손실(L_ID)을 계산하는 방법은 공지된 기술이므로 여기서는 상세하게 설명하지 않는다.The loss calculation unit 600 may calculate an identification loss (L _ID ) by calculating a negative cross-entropy between the first and second score maps and an identifier applied as training data. Here, a method of calculating the identification loss (L _ID ) based on negative cross entropy is a well-known technique and will not be described in detail here.

한편, 손실 계산부(600)는 교차 모달 정렬부(500)에 의해 이종 이미지에서 획득된 제1 및 제2 특징맵(f_RGB, f_IR)의 전경 영역이 정상적으로 와핑되었다면, 제1 및 제2 스코어맵로부터 추출되는 식별자와 제1 및 제2 와핑 스코어맵에서 추출되는 식별자가 동일해야 한다는 일관성 손실(L_IC)을 계산한다.On the other hand, the loss calculation unit 600 determines if the foreground regions of the first and second feature maps f _RGB and f _IR obtained from the heterogeneous image by the cross-modal arranging unit 500 are normally warped, the first and second Consistency loss (L _IC ) is calculated in which an identifier extracted from the scoremap and an identifier extracted from the first and second warping scoremaps must be the same.

이때 손실 계산부(600)는 학습에 의해 최종적으로는 제1 및 제2 스코어맵에서 추출되는 식별자가 학습 데이터에 레이블된 식별자와 동일해지므로, 제1 및 제2 스코어맵으로부터 식별 손실(L_ID)을 계산하는 방식과 동일하게 제1 및 제2 스코어맵과 학습 데이터에 레이블된 식별자를 기반으로 일관성 손실(L_IC)을 계산할 수 있다.At this time, the loss calculation unit 600 finally extracts the identifiers from the first and second scoremaps by learning to be the same as the identifiers labeled in the training data, so the identification loss (L _ID) from the first and second scoremaps. ), the consistency loss (L _IC ) may be calculated based on the identifiers labeled in the first and second score maps and the training data.

그러나 일관성 손실(L_IC)은 실제 이미지에 대한 식별자보다는 제1 및 제2 스코어맵로부터 추출되는 식별자와 제1 및 제2 와핑 스코어맵에서 추출되는 식별자의 차이를 나타내므로, 손실 계산부(600)는 제1 및 제2 스코어맵과 제1 및 제2 와핑 스코어맵 사이의 음의 크로스 엔트로피로 일관성 손실(L_IC)을 계산할 수 있다.However, since the consistency loss (L _IC ) represents the difference between an identifier extracted from the first and second scoremaps and an identifier extracted from the first and second warping scoremaps, rather than an identifier for an actual image, the loss calculation unit 600 may calculate a coherence loss (L _IC ) with a negative cross entropy between the first and second scoremaps and the first and second warping scoremaps.

한편 특징 추출부(200)에서 획득된 제1 및 제2 특징맵(f_RGB, f_IR)과 소프트 와핑부(540)에서 획득된 제1 및 제2 와핑 특징맵(

,

)을 인가받고, 이중 서로 대응하는 특징맵과 와핑 특징맵을 기반으로 제1 및 제2 공통 주의맵(co-attention map)을 획득하고, 획득된 제1 및 제2 공통 주의맵을 이용하여 밀집 삼중 손실(L_DT)을 계산한다.Meanwhile, the first and second feature maps f _RGB and f _IR obtained by the feature extractor 200 and the first and second warped feature maps obtained by the soft warping unit 540 (

,

), obtaining first and second co-attention maps based on the feature maps and warping feature maps corresponding to each other, and concentrating using the obtained first and second co-attention maps Calculate the triple loss (L _DT ).

제1 및 제2 특징맵(f_RGB, f_IR)과 소프트 와핑되어 재구성된 제1 및 제2 와핑 특징맵(

,

) 사이의 오차를 계산하는 가장 직접적인 방식은 L₂ 놈 함수를 적용하여 대응하는 픽셀간 L₂ 거리를 계산하는 것이다. 그러나, 이는 폐색 영역 등을 고려하지 않는다는 한계가 있다. 이에 본 실시예에서 손실 계산부(600)는 이종의 이미지 양측에서 모두 확인 가능한 사람 영역을 강조하는 공통 주의맵을 획득하고, 공통 주의맵 상에서 특징맵과 와핑 특징맵 사이의 오차를 계산하여 밀집 삼중 손실(L_DT)을 획득한다.First and second warped feature maps reconstructed by soft warping with the first and second feature maps (f _RGB , f _IR ) (

,

) is to calculate the L ₂ distance between corresponding pixels by applying the L ₂ norm function. However, this has a limitation in that it does not consider the occluded area or the like. Therefore, in this embodiment, the loss calculation unit 600 obtains a common attention map emphasizing the human region identifiable on both sides of the heterogeneous images, calculates the error between the feature map and the warping feature map on the common attention map, and calculates the dense triplet Loss (L _DT ) is obtained.

우선 제1 이미지에 대한 제1 공통 주의맵(A_RGB)은 수학식 7에 따라 획득될 수 있다.First, the first common attentional map A _RGB for the first image may be obtained according to Equation 7.

수학식 7을 살펴보면, 제1 공통 주의맵(A_RGB(p))은 제1 마스크(M_RGB)와 와핑된 제2 마스크(M_IR) 사이의 교차점으로 구성되는 맵을 나타낸다.Referring to Equation 7, the first common attentional map A _RGB (p) represents a map composed of intersections between the first mask M _RGB and the warped second mask M _IR .

마찬가지로 제2 이미지에 대한 제2 공통 주의맵(A_IR) 또한 동일한 방식으로 획득될 수 있다.Likewise, the second common attentional map (A _IR ) for the second image may also be obtained in the same manner.

도 8에서 왼쪽 그림은 제2 마스크(M_IR)를 나타내고, 오른쪽 그림은 제1 마스크(M_RGB)를 나타내며, 가운데 그림은 제2 공통 주의맵(A_IR)을 나타낸다. 도 8에 도시된 바와 같이, 제2 마스크(M_IR)와 와핑된 제1 마스크(M_RGB) 사이의 교차점을 나타내는 제2 공통 주의맵(A_IR)은 비록 제1 이미지에 포함된 사람의 일부 영역이 가려져 있음에도 제1 및 제2 이미지에서 공통적으로 표시되는 역역만이 강조되고, 적어도 하나의 이미지에서 표시되지 않은 영역(예를 들면 무릎)은 억제되도록 할 수 있다.In FIG. 8 , the left figure represents the second mask (M _IR ), the right figure represents the first mask (M _RGB ), and the middle figure represents the second common attentional map (A _IR ). As shown in FIG. 8 , the second common attentional map A _IR representing the intersection between the second mask M _IR and the warped first mask M _RGB is part of the person included in the first image. Even though the region is covered, only the region commonly displayed in the first and second images is emphasized, and an unmarked region (eg, a knee) in at least one image may be suppressed.

제1 및 제2 공통 주의맵(A_RGB, A_IR)이 획득되면, 손실 계산부(600)는 앵커를 기준으로 동일한 식별자에 대한 특징맵을 나타내는 포지티브와 다른 식별자에 대한 특징맵을 나타내는 네거티브를 비교하는 삼중(triplet) 학습 방식을 이용하여 손실을 계산한다. 여기서 포지티브 및 네거티브는 앵커와 다른 이종의 이미지(예를 들면, 앵커가 제1 이미지일 때, 포지티브 및 네거티브 쌍은 이종의 제2 이미지)에서 획득된다.When the first and second common attentional maps (A _RGB , A _IR ) are acquired, the loss calculation unit 600 generates a positive representing a feature map for the same identifier and a negative representing a feature map for a different identifier based on the anchor. The loss is calculated using a triplet learning method that compares. Here, the positive and negative pairs are acquired from a heterogeneous image different from the anchor (eg, when the anchor is a first image, a positive and negative pair is a heterogeneous second image).

일 예로 제1 특징맵(f_RGB)이 앵커(f^a _RGB)인 경우, 포지티브(

)는 앵커(f^a _RGB)와 동일한 식별자를 갖는 포지티브 제2 특징맵(f^p _IR)을 와핑하여 재구성된 제1 와핑 특징맵(

)을 나타내고, 네거티브(

)는 앵커(f^a _RGB)와 다른 식별자를 갖는 네거티브 제2 특징맵(fⁿ _IR)을 와핑하여 재구성된 제1 와핑 특징맵(

)을 나타낸다.For example, when the first feature map (f _RGB ) is an anchor (f ^a _RGB ), a positive (

) is a first warped feature map reconstructed by warping the second positive feature map (f ^p _IR ) having the same identifier as the anchor (f ^a _RGB ) (

), negative (

) is a first warped feature map reconstructed by warping the second negative feature map (f ⁿ _IR ) having an identifier different from the anchor (f ^a _RGB ) (

).

유사하게 제2 특징맵(f_IR)이 앵커(f^a _IR)인 경우, 포지티브(

)는 앵커(f^a _IR)와 동일한 식별자를 갖는 포지티브 제1 특징맵(f^p _RGB)을 와핑하여 재구성된 제2 와핑 특징맵(

)을 나타내고, 네거티브(

)는 앵커(f^a _IR)와 다른 식별자를 갖는 네거티브 제1 특징맵(fⁿ _RGB)을 와핑하여 재구성된 제2 와핑 특징맵(

)을 나타낸다.Similarly, when the second feature map (f _IR ) is an anchor (f ^a _IR ), a positive (

) is a second warped feature map (reconstructed by warping the positive first feature map (f ^p _RGB ) having the same identifier as the anchor (f ^a _IR ) (

), negative (

) is a second warped feature map reconstructed by warping the first negative feature map (f ⁿ _RGB ) having an identifier different from that of the anchor (f ^a _IR ) (

).

이에 손실 계산부(600)는 제1 및 제2 공통 주의맵(A_i, i ∈ {RGB, IR}) 각각을 기반으로 밀집 삼중 손실(L_DT)을 수학식 8에 따라 계산할 수 있다.Accordingly, the loss calculation unit 600 may calculate the dense triple loss (L _DT ) according to Equation 8 based on each of the first and second common attentional maps (A _i , i ∈ {RGB, IR}).

여기서 α는 미리 정의된 마진을 나타내고, d_i ⁺(p)와 d_i ^-(p)는 각각 수학식 9에 따라 계산되는 앵커와 포지티브 사이의 오차 및 앵커와 네거티브 각각 사이의 오차를 나타낸다.Here, α represents a predefined margin, and di ₊ ⁽ p) and _di ^- (p) represent the error between the anchor and the positive and the error between the anchor and the negative, respectively, calculated according to Equation 9.

식별 손실(L_IC)과 일관성 손실(L_IC) 및 밀집 삼중 손실(L_DT)이 계산되면, 손실 계산부(600)는 수학식 10에 따라 총 손실(L)을 계산하고 역전파함으로써, 학습을 수행할 수 있다.When the identification loss (L _IC ), coherence loss (L _IC ), and dense triple loss (L _DT ) are calculated, the loss calculation unit 600 calculates the total loss (L) according to Equation 10 and backpropagates, thereby learning can be performed.

여기서 λ_IC, λ_DT 는 각각 일관성 손실 가중치 및 밀집 삼중 손실 가중치이다.where λ _IC and λ _DT are the coherence loss weight and the dense triple loss weight, respectively.

도 9는 사람 재식별 장치의 손실 설정에 따른 성능 변화를 나타낸다.9 shows the performance change according to the loss setting of the person re-identification device.

도 9에서 (a)는 식별 손실(L_IC)만을 적용하여 학습을 수행한 경우, 이종 이미지에서 상호 매칭되는 픽셀을 나타내고, (b)는 식별 손실(L_IC)과 일관성 손실(L_IC)을 적용하여 학습을 수행한 경우, 이종 이미지에서 상호 매칭되는 픽셀을 나타낸다. 그리고 (c)는 식별 손실(L_IC)과 일관성 손실(L_IC) 및 밀집 삼중 손실(L_DT)을 적용하여 학습을 수행한 경우, 상호 매칭되는 픽셀을 나타낸다.In FIG. 9, (a) represents mutually matched pixels in heterogeneous images when learning is performed by applying only the identification loss (L _IC ), and (b) represents the identification loss (L _IC ) and coherence loss (L _IC ). When learning is performed by applying, it indicates pixels that are mutually matched in heterogeneous images. And (c) shows mutually matched pixels when learning is performed by applying identification loss (L _IC ), coherence loss (L _IC ), and dense triple loss (L _DT ).

도 9의 (c)에 도시된 바와 같이, 식별 손실(L_IC)과 일관성 손실(L_IC) 및 밀집 삼중 손실(L_DT)을 적용하여 학습을 수행하는 경우, 이종 이미지에서 사람 영역의 각 픽셀이 서로 대응하는 위치에 정확하게 매칭됨을 알 수 있다. 그리고 이는 본 실시예의 사람 재식별 장치가 매우 높은 정확도로 사람을 재식별할 수 있음을 나타낸다.As shown in (c) of FIG. 9, when learning is performed by applying identification loss (L _IC ), coherence loss (L _IC ), and dense triple loss (L _DT ), each pixel of the human area in a heterogeneous image It can be seen that these are exactly matched to the positions corresponding to each other. And this indicates that the device for re-identifying a person in this embodiment can re-identify a person with very high accuracy.

도 10은 본 실시예에 따른 사람 재식별 장치의 성능을 시뮬레이션한 결과를 나타낸다.10 shows simulation results of the performance of the apparatus for re-identification of a person according to the present embodiment.

상기한 바와 같이, 이종 이미지에서 사람 영역의 각 픽셀이 서로 대응하는 위치에 정확하게 매칭시킬 수 있으며, 도 10의 (a)에 도시된 바와 같이 서로 다른 스케일이나 (b)의 시점 변화 및 (c)와 같이 여러 사람이 포함된 이미지에서도 정확하게 대응하는 픽셀을 매칭시킬 수 있어 사람을 재식별 할 수 있도록 한다.As described above, each pixel of the human area in a heterogeneous image can be accurately matched to a position corresponding to each other, and as shown in (a) of FIG. Even in an image that includes many people, it is possible to accurately match corresponding pixels, so that people can be re-identified.

도 11은 본 발명의 일 실시예에 따른 이종 센서 카메라 기반 사람 재식별 방법을 나타낸다.11 illustrates a method for re-identifying a person based on a heterogeneous sensor camera according to an embodiment of the present invention.

도 11을 참조하면, 본 실시예에 따른 이종 센서 카메라 기반 사람 재식별 방법은 크게 학습 단계(S10)와 이종 이미지 사람 재식별 단계(S20)로 구분될 수 있다. 도 6 및 도 7을 참조하여, 우선 학습 단계(S10)를 설명하면, 이종 센서 카메라에서 획득되고, 각각 포함된 사람의 식별자가 레이블된 이종 이미지인 제1 및 제2 이미지를 학습 데이터로서 획득한다(S11). 이때, 제1 이미지와 제2 이미지는 각각 단일 프레임의 이미지일 수도 있으며, 다수 프레임으로 구성된 이미지일 수 있다. 또한 동일한 사람이 포함된 이미지이거나 서로 다른 사람이 포함된 이미지일 수도 있다.Referring to FIG. 11 , the method for re-identifying a person based on a heterogeneous sensor camera according to the present embodiment can be largely divided into a learning step S10 and a step S20 for re-identifying a person in a heterogeneous image. Referring to FIGS. 6 and 7 , the learning step (S10) is first described. First and second images, which are heterogeneous images obtained from a heterogeneous sensor camera and labeled with an identifier of a person included, are obtained as training data. (S11). In this case, each of the first image and the second image may be a single frame image or may be an image composed of multiple frames. Also, images containing the same person or images containing different people may be used.

그리고 학습되는 방식에 따라 제1 및 제2 이미지 각각의 특징을 추출하여 제1 및 제2 학습 특징맵을 획득한다(S12). 이후 획득된 제1 및 제2 학습 특징맵 사이의 기지정된 방식으로 픽셀간 유사도를 계산하여 유사도맵(C(p,q))을 획득한다(S14). 여기서 유사도맵(C(p,q))은 일 예로 코사인 유사도를 기반으로 수학식 1에 따라 계산할 수 있다.In addition, first and second learning feature maps are obtained by extracting features of each of the first and second images according to the learned method (S12). Thereafter, a similarity map (C(p,q)) is obtained by calculating a similarity between pixels between the obtained first and second learning feature maps in a predetermined manner (S14). Here, the similarity map C(p,q) may be calculated according to Equation 1 based on, for example, cosine similarity.

유사도맵(C(p,q))이 획득되면, 획득된 유사도맵(C(p,q))을 기반으로 제1 및 제2 학습 특징맵에서 픽셀 위치별 매칭 확률을 수학식 2에 따라 계산하여 매칭 확률맵(P(p,q))을 획득한다(S15). 그리고 매칭 확률맵(P(p,q))을 이용하여 제1 및 제2 학습 특징맵에서 서로 대응하는 픽셀을 소프트 와핑하여 와핑 특징맵을 획득한다(S15). 이때 소프트 와핑은 제1 및 제2 학습 특징맵 각각에서 미리 사람 영역이 강조된 제1 및 제2 마스크를 수학식 3 및 4에 따라 획득하고, 수학식 5 및 6으로 나타난 바와 같이, 획득된 제1 및 제2 마스크 중 대응하는 마스크를 이용하여 픽셀이 와핑되는 특징맵의 사람 영역에 집중되어 와핑되도록 하고 나머지 영역은 기존 특징맵의 픽셀이 유지되도록 수행된다. 즉 제1 학습 특징맵의 픽셀은 제2 마스크를 이용하여 제2 특징맵의 사람 영역에 해당하는 위치로 와핑되고, 나머지 영역은 제2 학습 특징맵의 픽셀이 유지되어 제2 와핑 특징맵이 획득된다. 그리고 제2 학습 특징맵의 픽셀은 제1 마스크를 이용하여 제1 특징맵의 사람 영역에 해당하는 위치로 와핑되고, 나머지 영역은 제2 학습 특징맵의 픽셀이 유지되어 제1 와핑 특징맵이 획득된다. 제1 및 제2 와핑 특징맵이 획득되면, 제1 및 제2 학습 특징맵과 제1 및 제2 와핑 특징맵 각각을 풀링하여, 제1 및 제2 표현자와 제1 및 제2 와핑 표현자를 획득한다(S16).When the similarity map (C(p,q)) is obtained, the matching probability for each pixel position in the first and second learning feature maps is calculated according to Equation 2 based on the obtained similarity map (C(p,q)). to obtain a matching probability map (P(p,q)) (S15). Then, pixels corresponding to each other in the first and second learning feature maps are soft-warped using the matching probability map (P(p,q)) to obtain a warped feature map (S15). At this time, soft warping obtains the first and second masks in which the human region is emphasized in advance in each of the first and second learning feature maps according to Equations 3 and 4, and as shown in Equations 5 and 6, the obtained first and using a corresponding mask among the second masks so that the pixels are warped in a concentrated manner in the human area of the feature map to be warped, and pixels of the existing feature map are maintained in the remaining areas. That is, the pixels of the first learning feature map are warped to the position corresponding to the human area of the second feature map using the second mask, and the pixels of the second learning feature map are maintained in the remaining area to obtain the second warped feature map. do. Pixels of the second learning feature map are warped to positions corresponding to the human region of the first feature map using the first mask, and pixels of the second learning feature map are maintained in the remaining regions to obtain the first warped feature map. do. When the first and second warping feature maps are obtained, the first and second learning feature maps and the first and second warping feature maps are pooled, respectively, to obtain the first and second descriptors and the first and second warping feature maps. Obtain (S16).

이후 획득된 제1 및 제2 표현자와 제1 및 제2 와핑 표현자 각각을 기반으로 기지정된 클래스별 대응확률을 나타내는 클래스를 분류를 수행하여 각각에 대응하는 식별자를 획득한다(S17). 일 예로 식별자는 제1 및 제2 표현자와 제1 및 제2 와핑 표현자 각각이 기지정된 다수의 클래스 각각에 대응할 확률을 나타내는 제1 및 제2 스코어맵과 제1 및 제2 와핑 스코어맵을 획득하고, 획득된 제1 및 제2 스코어맵과 제1 및 제2 와핑 스코어맵 각각을 기반으로 가장 높은 확률을 나타내는 클래스의 식별자를 추출하여 획득할 수 있다.Thereafter, based on the obtained first and second descriptors and the first and second warping descriptors, respectively, a class representing a corresponding probability for each predefined class is classified, and an identifier corresponding to each is obtained (S17). As an example, the identifier may include first and second score maps and first and second warping score maps indicating probabilities that each of the first and second descriptors and the first and second warping descriptors correspond to each of a plurality of predetermined classes. It may be obtained by extracting and obtaining an identifier of a class showing the highest probability based on the obtained first and second scoremaps and the first and second warping scoremaps, respectively.

그리고 식별자가 획득되면, 획득된 제1 및 제2 스코어맵과 제1 및 제2 와핑 스코어맵 그리고 제1 및 제2 학습 이미지에 레이블된 식별자를 이용하여 식별 손실(L_ID)과 일관성 손실(L_IC)을 계산하고, 제1 및 제2 특징맵과 제1 및 제2 와핑 특징맵을 이용하여 밀집 삼중 손실(L_DT)을 계산하여 총 손실을 계산하고 역전파한다(S18).When the identifier is obtained, identification loss (L _ID ) and consistency loss (L _IC ) is calculated, and a dense triple loss (L _DT ) is calculated using the first and second feature maps and the first and second warping feature maps, and the total loss is calculated and back-propagated (S18).

여기서 식별 손실(L_ID)은 제1 및 제2 표현자에서 획득된 제1 및 제2 스코어맵과 제1 및 제2 학습 이미지에 레이블된 식별자 사이의 차이로 계산되며, 일 예로 제1 및 제2 스코어맵과 레이블된 식별자들 사이의 음의 크로스 엔트로피로 계산될 수 있다.Here, the identification loss (L _ID ) is calculated as the difference between the first and second scoremaps obtained from the first and second presenters and the identifiers labeled in the first and second training images, for example, the first and second scoremaps. 2 can be computed as a negative cross-entropy between the scoremap and the labeled identifiers.

그리고 일관성 손실(L_IC)은 제1 및 제2 스코어맵과 제1 및 제2 와핑 스코어맵 사이의 음의 크로스 엔트로피를 계산하여 획득될 수 있다.In addition, the coherence loss (L _IC ) may be obtained by calculating a negative cross entropy between the first and second scoremaps and the first and second warping scoremaps.

한편, 밀집 삼중 손실(L_DT)은 제1 및 제2 학습 특징맵과 제1 및 제2 와핑 특징맵에서 서로 대응하는 학습 특징맵과 와핑 특징맵을 기반으로 이종의 학습 이미지 양측에서 모두 확인 가능한 사람 영역을 강조하는 제1 및 제2 공통 주의맵을 획득하고 공통 주의맵 상에서 학습 특징맵인 앵커와 앵커의 식별자와 동일한 식별자에 대한 와핑 특징맵인 포지티브와 다른 식별자에 대한 와핑 특징맵인 네거티브 사이의 오차를 계산하여 밀집 삼중 손실(L_DT)을 획득한다.On the other hand, dense triple loss (L _DT ) is based on the learning feature maps and warping feature maps corresponding to each other in the first and second learning feature maps and the first and second warping feature maps. Obtain first and second common attention maps emphasizing the human area, and on the common attention map, between the anchor, which is a learning feature map, and the positive, which is a warping feature map for the same identifier as the identifier of the anchor, and the negative, which is a warping feature map for other identifiers Calculate the error of to obtain the dense triple loss (L _DT ).

그리고 계산된 식별 손실(L_IC)과 일관성 손실(L_IC) 및 밀집 삼중 손실(L_DT)을 가중합하여 총 손실을 계산하고 역전파하여 학습을 수행한다. 학습은 기지정된 횟수로 수행되거나 총 손실이 기지정된 기준 손실이 이하가 될 때까지 반복 수행될 수 있다.Then, the total loss is calculated by weighting the calculated identification loss (L _IC ), coherence loss (L _IC ), and dense triple loss (L _DT ), and learning is performed by backpropagation. Learning may be performed a predetermined number of times or repeatedly until the total loss becomes equal to or less than a predetermined reference loss.

한편, 학습이 완료되면, 실제 사람 재식별을 위해 적용되어 이종 이미지 사람 재식별 단계(S20)를 수행한다.On the other hand, when the learning is completed, it is applied for real person re-identification, and a heterogeneous image person re-identification step (S20) is performed.

이종 이미지 사람 재식별 단계에서는 우선 사람 재식별이 수행되어야 하는 이종의 제1 및 제2 이미지를 획득한다(S21). 그리고 학습 단계에서 학습된 방식에 따라 획득된 제1 및 제2 이미지 각각에서 특징을 추출하여 제1 및 제2 특징맵을 획득한다(S22). 이후 제1 및 제2 특징맵을 기지정된 방식으로 풀링하여 제1 및 제2 표현자를 획득한다(S23). 제1 및 제2 표현자가 획득되면, 획득된 제1 및 제2 표현자가 재식별 대상이 되는 다수의 사람에 대응하도록 기지정된 다수의 클래스 각각에 대응하는 확률을 나타내는 제1 및 제2 스코어맵을 획득하고, 획득된 스코어맵을 기반으로 가장 높은 확률의 클래스에 대응하는 식별자를 추출함으로써 사람을 재식별한다.In the heterogeneous image person re-identification step, first and second heterogeneous images for which person re-identification is to be performed are acquired (S21). In addition, first and second feature maps are obtained by extracting features from each of the obtained first and second images according to the learned method in the learning step (S22). Thereafter, first and second descriptors are acquired by pooling the first and second feature maps in a predetermined manner (S23). When the first and second presenters are obtained, first and second scoremaps indicating probabilities corresponding to each of a plurality of classes predefined so that the obtained first and second presenters correspond to a plurality of persons to be re-identified are obtained. and re-identify the person by extracting an identifier corresponding to the class with the highest probability based on the obtained score map.

본 발명에 따른 방법은 컴퓨터에서 실행시키기 위한 매체에 저장된 컴퓨터 프로그램으로 구현될 수 있다. 여기서 컴퓨터 판독가능 매체는 컴퓨터에 의해 액세스 될 수 있는 임의의 가용 매체일 수 있고, 또한 컴퓨터 저장 매체를 모두 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함하며, ROM(판독 전용 메모리), RAM(랜덤 액세스 메모리), CD(컴팩트 디스크)-ROM, DVD(디지털 비디오 디스크)-ROM, 자기 테이프, 플로피 디스크, 광데이터 저장장치 등을 포함할 수 있다.The method according to the present invention may be implemented as a computer program stored in a medium for execution on a computer. Here, computer readable media may be any available media that can be accessed by a computer, and may also include all computer storage media. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, including read-only memory (ROM) dedicated memory), random access memory (RAM), compact disk (CD)-ROM, digital video disk (DVD)-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

본 발명은 도면에 도시된 실시예를 참고로 설명되었으나 이는 예시적인 것에 불과하며, 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다.Although the present invention has been described with reference to the embodiments shown in the drawings, this is only exemplary, and those skilled in the art will understand that various modifications and equivalent other embodiments are possible therefrom.

따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 청구범위의 기술적 사상에 의해 정해져야 할 것이다.Therefore, the true technical protection scope of the present invention should be determined by the technical spirit of the appended claims.

100: 이종 이미지 획득부 110: 제1 이미지 획득부
120: 제2 이미지 획득부 200: 특징 추출부
210: 제1 특징 추출부 220: 제2 특징 추출부
300: 풀링부 310: 제1 풀링부
320: 제2 풀링부 400: 재식별부
500: 교차 모달 정렬부 510: 유사도 측정부
520: 매칭 확률 추정부 530: 마스크 제공부
540: 소프트 와핑부 550: 와핑 풀링부
600: 손실 계산부100: heterogeneous image acquisition unit 110: first image acquisition unit
120: second image acquisition unit 200: feature extraction unit
210: first feature extraction unit 220: second feature extraction unit
300: pulling unit 310: first pulling unit
320: second pooling unit 400: re-identification unit
500: cross-modal sorting unit 510: similarity measuring unit
520: Matching probability estimation unit 530: Mask providing unit
540: soft warping unit 550: warping pulling unit
600: loss calculator

Claims

The first and second images acquired using different types of sensors are applied and features are extracted according to a learning method to obtain first and second feature maps, and the first and second feature maps are obtained by pooling in a predetermined method. a feature pooling unit that obtains presenters;
Provided during learning, receiving the first and second feature maps, estimating pixels corresponding to each other, and performing mutual warping based on the positions of the estimated corresponding pixels, so that the second feature map corresponds to the first feature Obtaining a first warped feature map warped to correspond to the map and a second warped feature map warped so that the first feature map corresponds to the second feature map, and pooling the first and second warped feature maps to obtain a first and a cross-modal arranging unit obtaining a second warping descriptor;
First and second score maps representing probabilities corresponding to each of a plurality of classes in which the first and second descriptors and the first and second warping descriptors are pre-specified according to the learned method, and the first and second descriptors a re-identification unit that obtains a warping score map and acquires an identifier of a class having the highest probability in the obtained first and second score maps; and
Provided during learning, the difference between the first feature map and the first warping feature map and the difference between the second feature map and the second warping feature map is positive in the case of the same identifier and different identifiers Including a loss calculation unit that calculates a dense triple loss by dividing by the negative of
The cross-modal alignment unit
a similarity measurement unit calculating a similarity between pixels between the first and second feature maps according to a predetermined method and obtaining a similarity map;
a matching probability estimator calculating a matching probability for each pixel of the first and second feature maps based on the similarity map to obtain a matching probability map;
Obtaining the first and second warping feature maps by weighting probabilities of matching pixels of the first and second feature maps using the matching probability map to each pixel of the relative feature map and warping the relative feature map a soft warping unit; and
and a warping pooling unit configured to acquire first and second warping descriptors by pooling the first and second warping feature maps in a predetermined manner.

delete

The method of claim 1, wherein the cross-modal alignment unit
and a mask providing unit configured to obtain first and second masks by normalizing a size of a pixel at each position of the first and second feature maps.

The method of claim 3, wherein the soft warping unit
When the first and second masks are applied, probabilities of matching the pixels of each of the first and second feature maps to each pixel of the relative feature map are weighted and aggregated, and the pixel value of the mask is added to the weighted aggregated warping pixel value. A device for re-identifying people who warp by weighting with

The method of claim 4, wherein the mask providing unit
The first mask M _RGB is obtained from the first feature map f _RGB by Equation

(Where │ mer ₂ is the L ₂ norm function, and f is the maximum-minimum regularization function.)
obtained according to
The soft warping part
The first warping feature map (from the first feature map f _RGB ), the second feature map f _IR , and the first mask M _RGB

) to the equation

(Where W is according to the matching probability map (P(p,q))

It is a soft warping function represented by .)
Acquiring Person Re-Identification Devices.

The method of claim 4, wherein the loss calculator
Obtain first and second common attentional maps emphasizing human regions identifiable in both the first and second images, and estimate the same identifier as the identifier of the first feature map, which is an anchor in the first common attentional map. The difference between the positive, which is the first warping feature map, and the negative, which is the first warping feature map estimated with different identifiers, and the identifier of the second feature map, which is the anchor in the second common attention map, Estimated to be the same identifier Apparatus for calculating the dense triple loss by accumulating differences between positive and negative values that are second warping feature maps estimated with different identifiers.

The method of claim 6, wherein the loss calculator
Equation for the dense triple loss (L _DT ) based on the first and second common attentional maps (A _i , i ∈ {RGB, IR})

(where α represents a predefined margin, and di ₊ ⁽ p) and _di ^- (p) are respectively

represents the error between the anchor ( _fi ^a ) and the positive ( _fi ^p ) and the error between the anchor ( _fi ^a ) and the negative (f _i ⁿ ), respectively, which is calculated as
Person re-identification device counting according to.

The method of claim 7, wherein the loss calculator
Identification loss according to a difference between each of the first and second scoremaps and identifiers that are labeled and applied to the first and second images during learning and the first and second scoremaps and the first and second warping A person re-identification apparatus for performing learning by further calculating a coherence loss according to a difference between scoremaps, and backpropagating a total loss calculated by weighting the identification loss, the coherence loss, and the dense triplet loss.

The method of claim 8, wherein the loss calculator
The identification loss is obtained by calculating a negative cross entropy between the first and second scoremaps and an identifier labeled and applied to the first and second images, and An apparatus for calculating the coherence loss by calculating a negative cross entropy between corresponding warping score maps of the first and second warping score maps.

The apparatus of claim 1, wherein the first image is an RGB image obtained from a visible camera and the second image is an IR image obtained from an IR camera.

learning phase; and
A person re-identification step of acquiring an identifier for a person included in each image by receiving images acquired using different types of sensors using an artificial neural network,
The learning phase is
As learning data, first and second images obtained by using different types of sensors and pre-labeled identifiers are applied, and features are extracted according to a learning method to obtain first and second feature maps, and pre-specified methods are used. obtaining first and second presenters by pooling with ;
The first and second feature maps are received, pixels corresponding to each other are estimated, and mutual warping is performed based on the positions of the estimated corresponding pixels, so that the second feature map is warped to correspond to the first feature map. Obtaining a first warped feature map and a second warped feature map warped so that the first feature map corresponds to the second feature map, and pooling the first and second warped feature maps to obtain first and second warped descriptors obtaining;
Obtaining first and second scoremaps and first and second warping scoremaps indicating probabilities corresponding to each of a plurality of classes in which the first and second descriptors and the first and second warping descriptors are pre-specified, respectively and obtaining an identifier of a class having the highest probability in the obtained first and second score maps; and
Loss by dividing the difference between the first feature map and the first warping feature map and the difference between the second feature map and the second warping feature map into a positive in the case of the same identifier and a negative in the case of different identifiers Including the step of calculating
The step of obtaining the first and second warping descriptors is
obtaining a similarity map by calculating a similarity between pixels between the first and second feature maps according to a predetermined method;
obtaining a matching probability map by calculating a matching probability for each pixel of the first and second feature maps based on the similarity map;
Obtaining the first and second warping feature maps by weighting probabilities of matching pixels of the first and second feature maps using the matching probability map to each pixel of the relative feature map and warping the relative feature map doing; and
and pooling the first and second warped feature maps in a predetermined manner to obtain first and second warped descriptors.

delete

12. The method of claim 11, wherein obtaining the first and second warped descriptors comprises:
The method of re-identifying a person further comprising obtaining first and second masks by normalizing a pixel size for each position of the first and second feature maps.

14. The method of claim 13, wherein obtaining the first and second warping feature maps comprises:
weighting a probability that pixels of each of the first and second feature maps match each pixel of a relative feature map by receiving the first and second masks; and
A method for re-identifying a person comprising the step of additionally weighting the pixel values of a mask on the weighted aggregated warping pixel values and performing warping.

15. The method of claim 14, wherein obtaining the mask comprises:
The first mask M _RGB is obtained from the first feature map f _RGB by Equation

(Where │ mer ₂ is the L ₂ norm function, and f is the maximum-minimum regularization function.)
obtained according to
The step of warping with additional weighting
The first warping feature map (from the first feature map f _RGB ), the second feature map f _IR , and the first mask M _RGB

) to the equation

(Where W is according to the matching probability map (P(p,q))

It is a soft warping function represented by .)
How to Re-Identify Persons Obtaining Under.

15. The method of claim 14, wherein calculating the loss comprises
obtaining first and second common attentional maps emphasizing a human region identifiable in both the first and second images; and
The difference between each positive, which is a first warping feature map estimated with the same identifier as the identifier of the first feature map, which is an anchor in the first common attention map, and each negative, which is a first warping feature map estimated with different identifiers, and the Dense triple loss calculated by accumulating the difference between each positive estimated with the same identifier as the identifier of the second feature map, which is an anchor in the second common attention map, and each negative, the second warping feature map estimated with different identifiers A person re-identification method comprising the step of obtaining.

17. The method of claim 16, wherein obtaining the dense triple loss comprises:
Equation for the dense triple loss (L _DT ) based on the first and second common attentional maps (A _i , i ∈ {RGB, IR})

represents the error between the anchor ( _fi ^a ) and the positive ( _fi ^p ) and the error between the anchor ( _fi ^a ) and the negative (f _i ⁿ ), respectively, which is calculated as
Person re-identification method counting according to.

18. The method of claim 17, wherein calculating the loss comprises
calculating an identification loss according to a difference between each of the first and second score maps and identifiers that are labeled and applied to the first and second images during training;
calculating a consistency loss according to a difference between the first and second scoremaps and the first and second warping scoremaps; and
and backpropagating a total loss calculated by a weighted sum of the identification loss, the coherence loss, and the dense triple loss.

19. The method of claim 18, wherein calculating the identification loss comprises:
Obtaining the identification loss by calculating a negative cross entropy between the first and second scoremaps and identifiers labeled and applied to the first and second images;
The step of calculating the coherence loss is
wherein the loss of consistency is calculated by calculating a negative cross entropy between each of the first and second scoremaps and a corresponding warping scoremap of the first and second warping scoremaps.

12. The method of claim 11, wherein the person re-identification step
obtaining a feature map by extracting features of each of the heterogeneous images applied according to a method pre-learned in the learning step;
acquiring a corresponding descriptor by pooling feature maps obtained from each of the heterogeneous images in a predetermined manner; and
A method for re-identifying a person, comprising acquiring a scoremap corresponding to each of the acquired expressers according to a pre-learned method, and acquiring an identifier for a person included in each of the authorized heterogeneous images based on the acquired scoremap. .