KR102458783B1

KR102458783B1 - Generalized zero-shot object recognition device and generalized zero-shot object recognizing method

Info

Publication number: KR102458783B1
Application number: KR1020200093744A
Authority: KR
Inventors: 김준태; 서상현
Original assignee: 동국대학교 산학협력단
Priority date: 2020-07-28
Filing date: 2020-07-28
Publication date: 2022-10-26
Also published as: KR20220014461A; KR102458783B9

Abstract

본 발명의 실시 예에 따른 일반화된 제로샷 객체인식장치는 입력되는 제1 이미지 데이터를 데이터셋에 저장된 클래스들에 대한 계층 지식을 이용하여 학습하는 객체 학습기, 및 입력되는 제2 이미지 데이터가 상기 객체 학습기에 의하여 미학습된 클래스로 분류될 가능성을 조절하는 객체 추정기를 포함하고, 상기 객체 추정기는, 입력된 이미지 데이터가 사전에 별도의 학습이 되었을 확률을 나타내는 불확실성 추정값을 산출하고, 상기 입력된 이미지가 상기 객체 학습기에 의하여 미학습된 클래스로 분류될 가능성을 상기 불확실성 추정값을 이용하여 조절한다.A generalized zero-shot object recognition apparatus according to an embodiment of the present invention includes an object learner that learns input first image data using hierarchical knowledge of classes stored in a dataset, and second image data that is input to the object. and an object estimator that adjusts a possibility of being classified into an unlearned class by a learner, wherein the object estimator calculates an uncertainty estimate indicating a probability that input image data has been previously learned separately, and the input image The possibility of being classified into an unlearned class by the object learner is adjusted using the uncertainty estimation value.

Description

Generalized zero-shot object recognition device and generalized zero-shot object recognition method

본 발명은 지도학습기반의 일반화된 제로샷 객체 인식 장치 및 객체 인식 방법에 관한 것으로, 보다 구체적으로, 제로샷 객체 인식 성능을 향상시키고, 학습 시, 사용하지 않은 클래스 데이터에 대한 성능 하락을 방지하는 일반화된 제로샷 객체 인식 장치 및 객체 인식 방법에 관한 것이다.The present invention relates to a supervised learning-based generalized zero-shot object recognition apparatus and object recognition method, and more specifically, to improve zero-shot object recognition performance, and to prevent performance degradation of unused class data during learning. The present invention relates to a generalized zero-shot object recognition apparatus and an object recognition method.

최근 머신러닝 분야에서는 딥 러닝 모델에 중점을 둔 다양한 학습 방법론이 제안되고 있으며, 이들 중 다수는 수업 레이블이 있는 교육 데이터를 기반으로 하는 지도학습(supervised learning) 방법론을 사용한다.Recently, in the field of machine learning, various learning methodologies focusing on deep learning models have been proposed, and many of them use supervised learning methodologies based on educational data with class labels.

이 중 제로샷 학습(Zero-shot learning) 방법론은 지도학습의 학습 단계에서 학습하지 않은 클래스에 대하여 인식을 수행할 수 있는 방법론이다. 적절히 학습된 제로샷 학습 모델은 학습 데이터에 포함되지 않은 클래스들(unseen classes)에 대해서 추론단계에서 적절히 인식을 수행할 수 있으나, 이러한 클래스들(unseen classes)의 데이터의 처리에만 집중하면 학습 데이터에서 사용된 클래스들(seen classes)의 데이터를 처리하는 성능이 저하될 수 있다.Among them, the zero-shot learning methodology is a methodology that can perform recognition on a class that has not been learned in the learning stage of supervised learning. A properly trained zero-shot learning model can properly recognize classes that are not included in the training data (unseen classes) in the inference stage, but if we focus only on processing the data of these classes (unseen classes), Performance of processing data of used classes (seen classes) may be degraded.

따라서, 학습 시 사용한 클래스들(seen classes)과 학습 시 사용하지 않은 클래스들(unseen classes)의 데이터를 모두 고려한 일반화된 제로샷 학습(generalized zero-shot learning)을 적용하는 방안이 제시된다.Therefore, a method of applying generalized zero-shot learning considering both data of classes used for learning and classes not used for learning is proposed.

객체 인식 모델을 위한 제로샷 학습에는 크게 두 가지 주요 접근 방식이 있다.There are two main approaches to zero-shot training for object recognition models.

제로샷 학습을 위한 첫 번째 접근 방식은 이미지 데이터와 더불어 해당 이미지에 대한 속성(attribute)을 함께 학습에 사용하는 것이다. 하지만 이러한 접근방식은 학습데이터의 구축 과정에서 이미지 데이터에 대한 레이블과 더불어 속성까지 구축해야 하는 단점이 있다.The first approach for zero-shot learning is to use image data along with attributes for that image for training. However, this approach has a disadvantage in that in the process of building training data, labels and attributes for image data must be built.

제로샷 학습을 위한 두 번째 접근방식은 이미지와 이미지에 대한 클래스 레이블을 공통의 벡터공간에 임베딩 시키는 방식으로서, 클래스 레이블에 대해서 일종의 텍스트 지식(knowledge)를 적절히 활용하여 의미 있는 임베딩을 수행하는 방법이 있다. 하지만, 이러한 접근방식은 언어모델에서의 워드 임베딩 결과에만 의존하여 제로샷 분류를 수행해야 하기 때문에 이미지의 속성을 이용하는 접근방식에 비해 상대적으로 낮은 성능을 보이는 단점이 있다.The second approach for zero-shot learning is to embed images and class labels for images in a common vector space, and it is a method to perform meaningful embeddings by appropriately utilizing a kind of text knowledge for class labels. have. However, this approach has the disadvantage of showing relatively low performance compared to the approach using image properties because zero-shot classification must be performed only depending on the word embedding result in the language model.

아울러 앞서 소개한 주요 접근방식들은 모두, 학습 시 사용한 클래스(seen classes)와 학습 시 사용하지 않은 클래스(unseen classes)를 모두 클래스로 두고 인식을 수행하는 일반화된 제로샷 분류(generalized zero-shot classification)를 수행할 때, 학습 시 사용한 클래스(seen classes)에 대한 일종의 과적합(over-fitting)이 발생하여 학습 시 사용하지 않은 클래스(unseen classes)에 대한 분류 성능이 상대적으로 크게 하락하는 문제가 있다. In addition, all of the major approaches introduced above are generalized zero-shot classification, which performs recognition with both see classes and unseen classes used during training as classes. When performing , there is a problem that a kind of over-fitting of the classes used during training occurs, and the classification performance of classes not used during training is relatively significantly reduced.

본 발명의 목적은, 처음 보는 대상에 대한 객체 인식 모델의 인식 성능을 향상시킬 수 있다. 또한, 학습 시, 사용하지 않은 클래스 데이터에 대한 일반화 제로샷 객체 인식 성능 하락을 최소화하는 데 있다.SUMMARY OF THE INVENTION It is an object of the present invention to improve the recognition performance of an object recognition model for an object to be seen for the first time. In addition, it is to minimize the degradation of generalized zero-shot object recognition performance for unused class data during training.

상기 객체 학습기는, 상기 제1 이미지 데이터를 특정 차원의 제1 이미지 벡터로 변환하는 신경망층을 포함하는 제1 이미지 임베딩부, 상기 데이터셋에 저장된 클래스들을 텍스트로 간주하여 상기 텍스트를 특정 차원의 제1 텍스트 벡터로 변환하는 신경망층을 포함하고, 상기 계층 지식을 계층적 의미 벡터로 변환하는 제1 텍스트 임베딩부, 상기 제1 텍스트 임베딩부로부터 상기 클래스들의 텍스트 정보를 수신하고, 상기 텍스트 정보를 기저장된 계층 구조와 비교한 상기 계층 지식을 생성하여 상기 제1 텍스트 임베딩부로 송신하는 계층지식생성부, 및 상기 제1 이미지 벡터, 상기 제1 텍스트 벡터, 및 상기 계층적 의미 벡터를 수신하고, 각각의 유클리디언 거리를 계산하는 연산부를 포함한다.The object learner, a first image embedding unit including a neural network layer that converts the first image data into a first image vector of a specific dimension, considers the classes stored in the dataset as text, and converts the text into a first image vector of a specific dimension A first text embedding unit comprising a neural network layer for converting one text vector into a hierarchical semantic vector, receiving text information of the classes from the first text embedding unit, and writing the text information A hierarchical knowledge generating unit generating the hierarchical knowledge compared with the stored hierarchical structure and transmitting it to the first text embedding unit, and receiving the first image vector, the first text vector, and the hierarchical semantic vector, each It includes a calculator for calculating the Euclidean distance.

상기 계층 지식은, 상기 클래스들의 상위클래스 정보 및 형제클래스 정보를 포함하고, 상기 객체 학습기는, 상기 제1 이미지 벡터 및 상기 제1 텍스트 벡터가 상기 상위 클래스의 계층적 의미 벡터와 거리가 가깝도록 학습한다.The hierarchical knowledge includes upper class information and sibling class information of the classes, and the object learner learns so that the first image vector and the first text vector are close to a hierarchical semantic vector of the upper class. do.

상기 계층 지식은, 상기 클래스들의 상위클래스 정보 및 형제클래스 정보를 포함하고, 상기 객체 학습기는, 상기 제1 이미지 벡터 및 상기 제1 텍스트 벡터가 상기 형제클래스의 계층적 의미 벡터와 거리가 멀어지도록 학습한다.The hierarchical knowledge includes superclass information and sibling class information of the classes, and the object learner learns so that the first image vector and the first text vector are far apart from the hierarchical semantic vector of the sibling class. do.

상기 제1 이미지 벡터 및 상기 제1 텍스트 벡터는 동일한 차원을 갖는다.The first image vector and the first text vector have the same dimension.

상기 객체 추정기는, 상기 제2 이미지 데이터를 별도의 학습 데이터로 분류를 수행하는 신경망층을 포함하고, 상기 분류된 결과값의 분포로부터 상기 불확실성 추정값을 산출하는 불확실성추정부, 상기 제2 이미지 데이터를 특정 차원의 제2 이미지 벡터로 변환하는 신경망층을 포함하는 제2 이미지 임베딩부, 상기 객체 학습기에 의하여 학습된 클래스들을 텍스트로 간주하여 특정 차원의 제2 텍스트 벡터로 변환하고, 상기 객체 학습기에 의하여 학습되지 않은 클래스들을 텍스트로 간주하여 특정 차원의 제3 텍스트 벡터로 변환하는 신경망층을 포함하고, 텍스트 임베딩부 및 상기 제2 이미지 벡터와 상기 제2 텍스트 벡터 사이의 거리 값으로 정의되는 학습 거리와 상기 제3 텍스트 벡터와 상기 제2 이미지 벡터 사이의 거리로 정의되는 미학습 거리를 상기 불확실성 추정값으로 나누는 거리비교부를 포함한다.The object estimator includes a neural network layer for classifying the second image data as separate training data, an uncertainty estimator for calculating the uncertainty estimate value from the distribution of the classified result value, and the second image data A second image embedding unit including a neural network layer for converting a second image vector of a specific dimension, the classes learned by the object learner are regarded as text and converted into a second text vector of a specific dimension, by the object learner a neural network layer that considers unlearned classes as text and converts them into a third text vector of a specific dimension, and a learning distance defined by a text embedding unit and a distance value between the second image vector and the second text vector; and a distance comparator dividing an unlearned distance defined as a distance between the third text vector and the second image vector by the uncertainty estimate value.

상기 불확실성 추정값은, 상기 제2 이미지 데이터가 상기 학습 데이터에 포함될 가능성이 높을수록 낮은 값을 가지며, 상기 제2 이미지 데이터가 상기 학습 데이터에 포함될 가능성이 낮을수록 높은 값을 갖는다.The uncertainty estimation value has a lower value as the likelihood that the second image data is included in the training data increases, and has a higher value as the likelihood that the second image data is included in the training data is low.

상기 제2 이미지 벡터, 상기 제2 텍스트 벡터 및 상기 제3 텍스트 벡터는 서로 동일한 차원을 갖는다.The second image vector, the second text vector, and the third text vector have the same dimension.

상기 학습 데이터는 상기 객체 학습기가 사용한 클래스 데이터와 동일한 데이터로 학습된다.The learning data is learned with the same data as the class data used by the object learner.

본 발명의 실시 예에 따른 일반화된 제로샷 객체인식방법은 제1 이미지 임베딩부에서 제1 이미지 데이터를 제1 이미지 벡터로 변환하여 연산부로 송신하는 단계, 제1 텍스트 임베딩부에서 데이터셋에 저장된 적어도 일부 클래스들의 텍스트 정보를 계층지식생성부로 송신하는 단계, 상기 계층지식생성부가 상기 클래스 텍스트 정보에 대한 계층 지식을 상기 제1 텍스트 임베딩부로 송신하는 단계, 상기 제1 텍스트 임베딩부가 상기 클래스들을 텍스트로 간주하여 상기 텍스트를 특정 차원의 제1 텍스트 벡터로 변환하여 상기 연산부로 송신하는 단계, 상기 제1 텍스트 임베딩부가 상기 계층 지식을 계층적 의미 벡터로 변환하여 상기 연산부로 송신하는 단계, 상기 연산부가 제1 이미지 벡터, 제1 텍스트 벡터, 및 상기 계층적 의미 벡터로부터 각각의 유클리디언 거리를 계산하여 계층적 의미 손실값을 산출하는 단계, 및 상기 계층적 의미 손실값이 최소화 되도록 파라미터를 학습하는 단계를 포함한다.A generalized zero-shot object recognition method according to an embodiment of the present invention includes the steps of converting first image data into a first image vector in a first image embedding unit and transmitting it to an operation unit, at least stored in a dataset by the first text embedding unit Transmitting text information of some classes to a hierarchical knowledge generating unit, transmitting, by the hierarchical knowledge generating unit, hierarchical knowledge of the class text information to the first text embedding unit, wherein the first text embedding unit regards the classes as text converting the text into a first text vector of a specific dimension and transmitting it to the operation unit, the first text embedding unit converting the hierarchical knowledge into a hierarchical semantic vector and transmitting it to the operation unit, the operation unit first calculating a hierarchical semantic loss value by calculating each Euclidean distance from the image vector, the first text vector, and the hierarchical semantic vector, and learning parameters such that the hierarchical semantic loss value is minimized. include

상기 계층 지식을 상기 제1 텍스트 임베딩부로 송신하는 단계는, 계층 구조를 설정하여 상기 계층지식생성부에 저장하는 단계, 및 상기 계층지식생성부가 기 클래스들의 상기 텍스트 정보를 상기 계층 구조와 비교하여 상기 계층 지식을 생성하는 단계를 포함하고, 상기 계층 지식은 상기 클래스들의 상위클래스 정보 및 형제클래스 정보를 포함한다.The transmitting of the hierarchical knowledge to the first text embedding unit includes: setting a hierarchical structure and storing the hierarchical knowledge generating unit; and generating hierarchical knowledge, wherein the hierarchical knowledge includes superclass information and sibling class information of the classes.

상기 계층적 의미 손실값은, 상기 제1 이미지 벡터 및 상기 제1 텍스트 벡터가 상기 상위클래스의 계층적 의미 벡터와 거리가 가까울수록 감소한다.The hierarchical semantic loss value decreases as the distance between the first image vector and the first text vector is closer to the hierarchical semantic vector of the higher class.

상기 계층적 의미 손실값은, 상기 제1 이미지 벡터 및 상기 제1 텍스트 벡터가 상기 형제클래스의 계층적 의미 벡터와 거리가 멀어질수록 감소한다.The hierarchical semantic loss value decreases as the distance between the first image vector and the first text vector increases from the hierarchical semantic vector of the sibling class.

본 발명의 실시 예에 따른 일반화된 제로샷 객체인식방법은 불확실성추정부에서 제2 이미지 데이터를 별도의 학습 데이터로 분류하여 제2 이미지 데이터가 상기 학습 데이터가 포함하는 클래스들 각각에 분류될 확률값들을 출력하는 단계, 불확실성추정부에서 상기 확률값들의 분포로부터 불확실성 추정값을 산출하여 거리비교부에 송신하는 단계, 제2 이미지 임베딩부에서 상기 제2 이미지 데이터를 제2 이미지 벡터로 변환하여 연산부로 송신하는 단계, 제2 텍스트 임베딩부에서 상기 데이터셋에 저장된 전체 클래스들을 텍스트로 간주하여 학습된 클래스들은 제2 텍스트 벡터로 변환하고, 학습되지 않은 클래스들은 제3 텍스트 벡터로 변환하는 단계, 상기 제2 이미지 벡터와 상기 제2 텍스트 벡터 사이의 거리 값으로 정의되는 학습 거리와 상기 제3 텍스트 벡터와 상기 제2 이미지 벡터 사이의 거리로 정의되는 미학습 거리를 산출하여 상기 거리비교부에 송신하는 단계, 및 상기 거리비교부가 상기 미학습 거리에 상기 불확실성 추정값으로 나누어 최종 분류 결과를 계산하는 단계를 더 포함한다.The generalized zero-shot object recognition method according to an embodiment of the present invention classifies the second image data as separate learning data in the uncertainty estimator to determine the probability values that the second image data is classified into each of the classes included in the learning data. outputting, the uncertainty estimation unit calculating an uncertainty estimation value from the distribution of the probability values and transmitting it to the distance comparator, converting the second image data into a second image vector in a second image embedding unit and transmitting the second image data to the calculation unit , converting the learned classes into a second text vector by a second text embedding unit considering all classes stored in the dataset as text, and converting unlearned classes into a third text vector, the second image vector Calculating a learning distance defined as a distance value between and the second text vector and an unlearned distance defined as a distance between the third text vector and the second image vector, and transmitting it to the distance comparator; and The method further includes calculating, by a distance comparison unit, a final classification result by dividing the unlearned distance by the uncertainty estimate.

본 발명에 의하면, 처음 보는 대상에 대한 인식 성능이 향상될 수 있다. 객체 학습기를 통하여 제로샷 성능이 증가할 수 있으며, 객체 추정기를 통하여 일반화된 제로샷 성능이 증가할 수 있다. 또한, 딥러닝 기반 객체 인식 모델의 응용 효율성을 향상시킬 수 있다.According to the present invention, recognition performance of an object to be seen for the first time may be improved. The zero-shot performance may be increased through the object learner, and the generalized zero-shot performance may be increased through the object estimator. In addition, the application efficiency of the deep learning-based object recognition model can be improved.

도 1은 본 발명의 실시 예에 따른 객체 인식 장치의 모식도이다.
도 2는 도 1에 도시된 객체 학습기가 간략하게 도시된 도면이다.
도 3은 본 발명의 실시 예에 따른 객체학습기의 학습방법이 간략하게 도시된 도면이다.
도 4는 학습 시 사용하지 않은 클래스의 이미지에 대한 분류정확도를 측정한 그래프이다.
도 5는 도 1에 도시된 객체 추정기가 간략하게 도시된 도면이다.
도 6은 본 발명의 실시 예에 따른 객체 인식 방법이 도시된 도면이다.1 is a schematic diagram of an object recognition apparatus according to an embodiment of the present invention.
FIG. 2 is a diagram schematically illustrating the object learner shown in FIG. 1 .
3 is a diagram schematically illustrating a learning method of an object learner according to an embodiment of the present invention.
4 is a graph of measuring classification accuracy for images of classes not used during learning.
FIG. 5 is a diagram schematically illustrating the object estimator shown in FIG. 1 .
6 is a diagram illustrating an object recognition method according to an embodiment of the present invention.

이하에서 설명되는 모든 실시 예들은 본 발명의 이해를 돕기 위해 예시적으로 나타낸 것이며, 여기에 설명된 실시 예들과 다르게 변형되어 다양한 실시 형태로 실시될 수 있다. 또한, 본 발명을 설명함에 있어서, 관련된 공지 기능 혹은 공지 구성요소에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우, 그 구체적인 설명은 생략하도록 한다.All embodiments described below are illustratively shown to aid understanding of the present invention, and may be modified differently from the embodiments described herein and implemented in various embodiments. In addition, in describing the present invention, if it is determined that a detailed description of a related known function or known component may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted.

첨부된 도면은 발명의 이해를 돕기 위해서 실제 축척대로 도시된 것이 아니라 일부 구성요소의 치수가 과장되게 도시될 수 있으며, 각 구성요소들에 참조번호를 기재할 때, 동일한 구성요소들에 대해서는 다른 도면에 표시되더라도 가능한 한 동일한 부호로 표시하였다.The accompanying drawings are not drawn to scale in order to help the understanding of the invention, but the dimensions of some components may be exaggerated. Even though they are indicated in , they are indicated with the same symbols as possible.

또한, 본 발명의 실시 예의 구성 요소를 설명하는 데 있어서, 제1, 제2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 '연결', '결합' 또는 '접속'된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결, 결합 또는 접속될 수 있지만, 그 구성 요소와 그 다른 구성요소 사이에 또 다른 구성 요소가 '연결', '결합' 또는 '접속'될 수도 있다고 이해되어야 할 것이다.In addition, in describing the components of the embodiment of the present invention, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the elements from other elements, and the essence, order, or order of the elements are not limited by the terms. When it is described that a component is 'connected', 'coupled' or 'connected' to another component, the component may be directly connected, coupled or connected to the other component, but the component and the other component It should be understood that another element may be 'connected', 'coupled' or 'connected' between elements.

따라서, 본 명세서에 기재된 실시 예와 도면에 도시된 구성은 본 발명의 가장 바람직한 실시 예에 불과할 뿐이고 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 발명에 대한 다양한 변형 실시 예들이 있을 수 있다. Therefore, since the embodiments described in this specification and the configurations shown in the drawings are only the most preferred embodiments of the present invention and do not represent all the technical spirit of the present invention, there may be various modified embodiments of the present invention. .

그리고, 본 명세서 및 청구범위에서 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정되어서는 안되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다.And, the terms or words used in the present specification and claims should not be limited to conventional or dictionary meanings, and the inventor may properly define the concept of the term to describe his invention in the best way. Based on the principle, it should be interpreted as meaning and concept consistent with the technical idea of the present invention.

또한, 본 발명은 이하에서 개시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시 예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다.In addition, the present invention is not limited to the embodiments disclosed below, but will be implemented in a variety of different forms, only these embodiments allow the disclosure of the present invention to be complete, and common knowledge in the technical field to which the present invention belongs It is provided to fully inform the possessor of the scope of the invention, and the present invention is only defined by the scope of the claims.

또한, 본 출원에서 사용된 단수의 표현은 문맥상 명백히 다른 것을 뜻하지 않는 한, 복수의 표현을 포함한다.Also, the singular expression used in this application includes the plural expression unless the context clearly dictates otherwise.

도 1은 본 발명의 실시 예에 따른 객체 인식 장치의 모식도이고, 도 2는 도 1에 도시된 객체 학습기가 간략하게 도시된 도면이다.1 is a schematic diagram of an object recognition apparatus according to an embodiment of the present invention, and FIG. 2 is a diagram schematically illustrating the object learner shown in FIG. 1 .

도 1 및 도 2를 참조하면, 본 발명의 실시 예에 따른 객체인식장치(1000)는 일반화된 제로샷 객체인식장치일 수 있다. 즉, 본 발명의 실시 예에 따른 객체인식장치(1000)는 일반화된 제로샷 학습 방법론(Generalized zero-shot learning)을 적용하여, 학습 시 사용한 클래스들(seen classes)과 학습 시 사용하지 않은 클래스들(unssen classes)의 데이터를 모두 고려하여 객체를 인식할 수 있다. 설명의 편의를 위하여, 후술되는 명세서에서 일반화된 제로샷 객체인식장치(1000)를 객체인식장치(1000)로 병기한다.1 and 2 , an object recognition apparatus 1000 according to an embodiment of the present invention may be a generalized zero-shot object recognition apparatus. That is, the object recognition apparatus 1000 according to an embodiment of the present invention applies a generalized zero-shot learning methodology, so that classes used for learning (seen classes) and classes not used for learning Objects can be recognized by considering all data of (unssen classes). For convenience of description, the generalized zero-shot object recognition apparatus 1000 is referred to as the object recognition apparatus 1000 in the specification to be described later.

객체인식장치(1000)는 객체학습기(100) 및 객체추정기(200)를 포함한다. 객체학습기(100)는 입력되는 이미지 데이터가 데이터셋에 저장된 클래스들 중 정답 클래스에 분류될 가능성이 높아지도록 학습할 수 있다.The object recognition apparatus 1000 includes an object learner 100 and an object estimator 200 . The object learner 100 may learn so that input image data is more likely to be classified into a correct answer class among the classes stored in the dataset.

구체적으로, 객체학습기(100)는 제1 이미지 임베딩부(110), 제1 텍스트 임베딩부(120), 계층지식생성부(130) 및 연산부(140)를 포함한다.Specifically, the object learner 100 includes a first image embedding unit 110 , a first text embedding unit 120 , a hierarchical knowledge generation unit 130 , and an operation unit 140 .

제1 이미지 임베딩부(110)는 입력 이미지인 제1 이미지 데이터(IM1)를 특정 차원의 제1 이미지 벡터(x_s)로 변환하는 신경망 층을 포함한다. 상기 입력 이미지는 미학습된 데이터 일 수 있다.The first image embedding unit 110 includes a neural network layer that converts first image data IM1 that is an input image into a first image vector x _s of a specific dimension. The input image may be unlearned data.

제1 텍스트 임베딩부(120)는 데이터셋에 저장된 클래스들 중 적어도 일부 클래스들(TX1)을 텍스트로 간주하여 상기 텍스트(TX1)를 특정 차원의 제1 텍스트 벡터(y_s ⁺)로 변환하는 신경망층을 포함한다.The first text embedding unit 120 considers at least some classes TX1 among the classes stored in the dataset as text, and converts the text TX1 into a first text vector (y _s ⁺ ) of a specific dimension. include layers.

본 실시 예에 따르면, 제1 이미지 임베딩부(110)의 벡터 출력 차원은 제1 텍스트 임베딩부(120)의 벡터 출력 차원과 동일하게 설정된다. 즉, 제1 이미지 벡터(x_s)와 제1 텍스트 벡터(y_s ⁺)는 동일한 차원의 벡터일 수 있다. 따라서, 제1 이미지 벡터(x_s)와 제1 텍스트 벡터(y_s ⁺)는 공통의 벡터공간에서 거리 계산이 가능하다.According to this embodiment, the vector output dimension of the first image embedding unit 110 is set to be the same as the vector output dimension of the first text embedding unit 120 . That is, the first image vector (x _s ) and the first text vector (y _s ⁺ ) may be vectors of the same dimension. Accordingly, the distance between the first image vector (x _s ) and the first text vector (y _s ⁺ ) can be calculated in a common vector space.

계층지식생성부(130) 및 연산부(140)는 계층적손실함수부(LF)로 정의될 수 있다. 계층적손실함수부(LF)는 클래스들 간의 계층 지식(TK)을 이용하여, 입력되는 벡터들 간의 거리 정보를 포함하는 계층적 의미 손실값을 산출해낼 수 있다. 본 실시 예에 따른 객체학습기(100)는 계층적손실함수부(LF)를 통하여 산출된 계층적 의미 손실값의 크기가 작아지도록 제1 이미지 임베딩부(110) 및 제1 텍스트 임베딩부(120)의 임베딩 모델을 학습할 수 있다.The hierarchical knowledge generating unit 130 and the calculating unit 140 may be defined as a hierarchical loss function unit LF. The hierarchical loss function unit LF may use the hierarchical knowledge TK between classes to calculate a hierarchical semantic loss value including distance information between input vectors. The object learner 100 according to the present embodiment includes a first image embedding unit 110 and a first text embedding unit 120 such that the size of the hierarchical loss of meaning calculated through the hierarchical loss function unit LF is reduced. can learn the embedding model of

구체적으로, 계층지식생성부(130)는 미리 학습된 데이터의 클래스 텍스트에 대한 정보와 더불어 클래스 텍스트들을 포함하는 계층 구조 정보를 저장하고 있다. 계층 구조 정보는 입력 클래스의 상위 클래스 및 형제 클래스들의 정보를 포함한다. 예시적으로, 입력 클래스 텍스트에 대한 정보가'호랑이'인 경우, 상위클래스는 '포유류' 또는 '포식자' 등일 수 있으며, 형제 클래스는 '치타', '고양이', '사자' 등일 수 있다.Specifically, the hierarchical knowledge generating unit 130 stores hierarchical structure information including class texts along with information on class texts of pre-learned data. The hierarchical structure information includes information of an upper class and sibling classes of the input class. Illustratively, when the information about the input class text is 'tiger', the upper class may be 'mammal' or 'predator', and the sibling class may be 'cheetah', 'cat', 'lion', and the like.

본 실시 예에 따르면, 계층 구조 정보는 사용자에 의하여 사전에 임의로 설계될 수 있다. 그러나, 본 발명은 계층 구조 정보의 구성 방법에 특별히 한정되지 않는다. 예시적으로, 본 발명의 다른 실시 예에 따르면, 계층 구조 정보는 학습 데이터 최초 설계 시, 설계자에 의하여 구축될 수도 있으며, 클래스 텍스트들의 계층 구조를 일반화 시킨 타 모델(예, WordNet)을 차용한 형태일 수도 있다.According to this embodiment, the hierarchical structure information may be arbitrarily designed in advance by the user. However, the present invention is not particularly limited to the method of constructing the hierarchical structure information. Illustratively, according to another embodiment of the present invention, the hierarchical structure information may be constructed by a designer when the learning data is initially designed, and is borrowed from another model (eg, WordNet) that generalizes the hierarchical structure of class texts. may be

계층지식생성부(130)는 제1 텍스트 임베딩부(120)로부터 클래스들(TX1)의 텍스트 정보(TI)를 수신하고, 텍스트 정보(TI)를 기저장된 계층 구조와 비교하여 계층 지식(HK)을 생성한다. 계층 지식(HK)은 입력되는 클래스(TX1)의 상위클래스 및 형제클래스가 무엇인지에 대한 정보를 포함한다. 계층 지식(HK)은 제1 텍스트 임베딩부(120)로 송신된다.The hierarchical knowledge generator 130 receives text information TI of the classes TX1 from the first text embedding unit 120 and compares the text information TI with a pre-stored hierarchical structure to obtain hierarchical knowledge (HK). create The hierarchical knowledge (HK) includes information about the upper class and the sibling class of the input class (TX1). The hierarchical knowledge HK is transmitted to the first text embedding unit 120 .

제1 텍스트 임베딩부(120)는 계층지식생성부(130)로부터 수신한 계층 지식(HK)을 계층적 의미 벡터로 변환한다. 구체적으로, 제1 텍스트 임베딩부(120)는 상위클래스 정보의 계층적 의미 벡터로 정의되는 상위 클래스 벡터(y_s ^sc) 및 형제클래스 정보의 계층적 의미 벡터로 정의되는 형제 클래스 벡터(y_s ^-)를 생성한다.The first text embedding unit 120 converts the hierarchical knowledge HK received from the hierarchical knowledge generating unit 130 into a hierarchical semantic vector. Specifically, the first text embedding unit 120 provides an upper class vector (y _s ^sc ) defined as a hierarchical semantic vector of upper class information and a sibling class vector (y _s ^- ) defined as a hierarchical semantic vector of sibling class information. ) is created.

연산부(140)는 제1 이미지 임베딩부(110)로부터 제1 이미지 벡터(x_s)를 수신하고, 제1 텍스트 임베딩부(120)로부터 제1 텍스트 벡터(y_s+), 상위 클래스 벡터(y_s ^sc) 및 형제 클래스 벡터(y_s ^-)를 수신한다. 본 실시 예에 따르면, 연산부(140)는 수신한 벡터들 각각의 유클리디언 거리를 계산하여 계층적 의미 손실값(HSL)을 산출할 수 있다.The operation unit 140 receives the first image vector (x _s ) from the first image embedding unit 110 , and receives the first text vector (y _s +) and the upper class vector (y) from the first text embedding unit 120 . _s ^sc ) and a sibling class vector (y _s ^- ). According to the present embodiment, the calculator 140 may calculate the Euclidean distance of each of the received vectors to calculate the hierarchical loss of meaning (HSL).

도 3은 본 발명의 실시 예에 따른 객체학습기의 학습방법이 간략하게 도시된 도면이다.3 is a diagram schematically illustrating a learning method of an object learner according to an embodiment of the present invention.

도 3을 도 1 및 도 2와 함께 참조하면, 연산부(140)는 계층적 평균 제곱 오차 손실함수 HMSE (Hierarchical Mean Squared Loss)와 계층적 삼중항 손실함수 HTL(Hierarchical Triplet Loss)를 각각 계산하여 전체 손실값인 계층적 의미 손실값 HSL(Hierarchical Semantic Loss)을 계산할 수 있다.Referring to FIG. 3 together with FIGS. 1 and 2 , the operation unit 140 calculates a hierarchical mean squared loss function HMSE (Hierarchical Mean Squared Loss) and a hierarchical triplet loss function HTL (Hierarchical Triplet Loss) respectively to calculate the total A hierarchical semantic loss (HSL), which is a loss value, may be calculated.

수학식 1은 학습 데이터의 클래스에 속하는 제1 이미지 벡터(x_s), 제1 이미지 벡터(x_s)에 대응하는 클래스의 제1 텍스트 벡터(y_s ⁺) 및 제1 텍스트 벡터(y_s ⁺)의 상위 클래스에 대응하는 상위 클래스 벡터 (y_s ^sc)를 입력으로 하여, 제1 이미지 벡터(x_s)와 상위 클래스 벡터(y_s ^sc) 간의 유클리디언 거리 및 제1 텍스트 벡터(y_s ⁺)와 상위 클래스 벡터(y_s ^sc) 간의 유클리디언 거리를 계산하는 계층적 평균 제곱 오차 손실함수(HMSE)를 나타낸다.Equation 1 is a first image vector (x _s ) belonging to a class of training data, a first text vector (y _s ⁺ ) of a class corresponding to the first image vector (x _s ), and a first text vector (y _s ⁺ ) ), the Euclidean distance ^between the first image vector (x _s ) and the upper class vector (y _s ^sc ) and the first text vector (y _s ₎ ⁺ ) and the hierarchical mean square error loss function (HMSE) that calculates the Euclidean distance between the upper class vector (y _s ^sc ).

수학식 2는 제1 이미지 벡터(x_s), 제1 이미지 벡터(x_s)에 대응하는 클래스의 제1 텍스트 벡터(y_s ⁺), 제1 텍스트 벡터(y_s ⁺)의 형제 클래스에 대응하는 형제 클래스 벡터(y_s ^-) 및 마진(margin) 역할을 수행하는 상수

를 입력으로 하여, 제1 이미지 벡터(x_s)에 대해서 제1 텍스트 벡터(y_s ⁺)를 양성 샘플로 설정하고, 형제 클래스 벡터(y_s ^-)를 음성 샘플로 설정하여 삼중항 손실을 계산하는 계층적 삼중항 손실함수(HTL)를 나타낸다.Equation 2 corresponds to the first image vector (x _s ), the first text vector (y _s ⁺ ) of the class corresponding to the first image vector (x _s ), and the sibling class of the first text vector (y _s ⁺ ) sibling class vector (y _s ^- ) and constant acting as margin

With as input, the triplet loss is calculated by setting the first text vector (y _s ⁺ ) as positive samples and the sibling class vector (y _s ^- ) as negative samples for the first image vector (x _s ). represents a hierarchical triplet loss function (HTL).

수학식 3은 계층적 평균 제곱 오차 손실함수(HMSE)와 계층적 삼중항 손실함수(HTL)의 비율을 반영하는 상수

를 적용한 후 합한 계층적 의미 손실값 (HSL)을 나타낸다.Equation 3 is a constant reflecting the ratio of the hierarchical mean square error loss function (HMSE) and the hierarchical triplet loss function (HTL)

It represents the summed hierarchical loss of meaning (HSL) after applying .

본 실시 예에 따르면, 객체학습기(100)는 입력 데이터에 대해서 HSL의 손실값이 작아지도록 제1 이미지 임베딩부(110)의 임베딩 모델 및 제1 텍스트 임베딩부(120)의 임베딩 모델을 학습할 수 있다. 즉, 객체 학습기(100)는 계층적 제곱 오차 손실함수(HMSE) 값은 감소하도록, 계층적 삼중항 손실함수(HTL) 값은 증가하도록 제1 이미지 임베딩부(110)의 임베딩 모델 및 제1 텍스트 임베딩부(120)의 임베딩 모델을 학습할 수 있다. 계층적 제곱 오차 손실함수(HMSE) 값은 제1 이미지 벡터(x_s) 및 제1 텍스트 벡터(y_s ⁺) 각각이 상위클래스 벡터(y_s ^sc)와의 거리가 가까울수록 감소하고, 계층적 삼중항 손실함수(HTL) 값은 멀어질수록 감소한다.According to this embodiment, the object learner 100 can learn the embedding model of the first image embedding unit 110 and the embedding model of the first text embedding unit 120 so that the loss value of the HSL with respect to the input data becomes small. have. That is, the object learner 100 uses the embedding model and the first text of the first image embedding unit 110 so that the hierarchical squared error loss function (HMSE) value decreases and the hierarchical triplet loss function (HTL) value increases. An embedding model of the embedding unit 120 may be learned. The hierarchical squared error loss function (HMSE) value decreases as the distance between the first image vector (x _s ) and the first text vector (y _s ⁺ ) is closer to the upper class vector (y _s ^sc ), and the hierarchical triplet The term loss function (HTL) value decreases as the distance increases.

객체학습기(100)에서 학습이 완료된 학습데이터(TD)는 객체추정기(200)에 송신된다.The learning data TD on which learning is completed in the object learner 100 is transmitted to the object estimator 200 .

본 발명의 실시 예에 따르면, 미학습된 제1 이미지 데이터(IM1)를 인식후보로 설정하고, 객체학습기(100)에 입력된 제1 이미지 데이터(IM1)를 계층지식을 활용하여 분류하므로, 제로샷 객체 인식(zero-shot object recognition) 성능이 향상될 수 있다.According to an embodiment of the present invention, the unlearned first image data IM1 is set as a recognition candidate, and the first image data IM1 input to the object learner 100 is classified using hierarchical knowledge, so zero Zero-shot object recognition performance may be improved.

도 4는 학습 시 사용하지 않은 클래스의 이미지에 대한 분류정확도를 측정한 그래프이다. 구체적으로, 도 4에 도시된 그래프는, 10개의 클래스에 대한 컬러이미지로 구성된 CIFAR10 데이터셋에서 "cat-dog", "plane-automobile", "automobile-deer", "deer-ship", "cat-truck" 등의 클래스를 각각 학습 시 사용하지 않은 클래스(unseen classes)로 설정하고, 상기 10개의 클래스 중 다른 8개의 클래스만 학습 시 사용하는 클래스(seen classes)로 설정하여 학습을 수행한 뒤, 학습 시 사용하지 않은 클래스(unseen classes)의 이미지에 대해서 분류정확도를 측정한 결과의 비교 그래프이다.4 is a graph of measuring classification accuracy for images of classes not used during learning. Specifically, the graph shown in FIG. 4 is "cat-dog", "plane-automobile", "automobile-deer", "deer-ship", "cat" in the CIFAR10 dataset consisting of color images for 10 classes. -truck" is set as unseen classes for each learning, and only the other 8 classes among the 10 classes are set as the classes used for learning and learning is performed, This is a comparison graph of the results of measuring classification accuracy for images of classes not used during training.

도 4를 참조하면, 본 발명의 실시 예에 따른 객체학습기(100)가 임베딩 기반의 제로샷 객체학습기의 분류 정확도가 기존 대비 높아졌으며, 서로 유사도가 높은 클래스들 간(예시적으로, "cat-dog")의 분류 시, 비교 예와의 분류 정확도 차이가 두드러지는 것으로 나타났다. 이는 언어모델의 텍스트 벡터의 의미적 지식에만 의존한 기존의 방법론과는 달리, 텍스트 벡터의 계층 구조를 활용한 임베딩 모델의 학습을 통한 성능 향상으로 해석할 수 있다.Referring to FIG. 4 , the object learner 100 according to an embodiment of the present invention has increased the classification accuracy of the embedding-based zero-shot object learner compared to the existing one, and between classes with high similarity to each other (eg, "cat- dog"), the difference in classification accuracy with the comparative example was remarkable. This can be interpreted as performance improvement through learning of the embedding model using the hierarchical structure of text vectors, unlike the existing methodologies that depend only on semantic knowledge of text vectors of language models.

도 5는 도 1에 도시된 객체 추정기가 간략하게 도시된 도면이다.FIG. 5 is a diagram schematically illustrating the object estimator shown in FIG. 1 .

도 5를 도 1과 함께 참조하면, 객체학습기(100)에서 학습된 데이터(TD)는 객체추정기(200)에 송신된다. 객체추정기(200)는 객체추정기200)에 입력되는 데이터가 객체학습기(100)에서 학습되지 않은 클래스로 분류될 가능성을 조절할 수 있다.Referring to FIG. 5 together with FIG. 1 , the data TD learned by the object learner 100 is transmitted to the object estimator 200 . The object estimator 200 may adjust the possibility that data input to the object estimator 200 is classified into a class that has not been learned by the object learner 100 .

구체적으로, 객체추청기(200)는 불확실성추정부(210), 제2 이미지 임베딩부(220), 제2 텍스트 임베딩부(230) 및 거리비교부(240)를 포함한다.Specifically, the object tracker 200 includes an uncertainty estimation unit 210 , a second image embedding unit 220 , a second text embedding unit 230 , and a distance comparison unit 240 .

불확실성추정부(210)는 입력 이미지인 제2 이미지 데이터(IM2)를 객체학습기(100)를 이용하여 사전에 학습된 별도의 학습데이터로 분류를 수행하는 일반적인 신경망층을 포함한다.The uncertainty estimator 210 includes a general neural network layer that classifies the second image data IM2, which is an input image, into separate learning data learned in advance using the object learner 100 .

분류된 결과값은 클래스들 각각에 대한 유사도들을 수치화한 값들의 집합체 일 수 있다. 이를 연결가중치라 정의한다. 본 실시 예에 따르면, 상기 분류된 결과값들의 분포로부터, 객체추정기(200)에 새롭게 입력되는 이미지에 대한 불확실성을 추정할 수 있다. 상기 불확실성에 대한 값을 불확실성 추정값(S_uc)라 정의한다. 불확실성 추정값(S_uc)은 상기 별도의 학습데이터에 대한 소프트맥스(softmax) 함수를 제2 이미지 데이터(IM2)에 적용하여 산출된 값의 형태로 거리비교부(240)에 송신될 수 있다.The classified result value may be a set of values obtained by quantifying the similarities for each of the classes. This is defined as a connection weight. According to the present embodiment, the uncertainty of the image newly input to the object estimator 200 may be estimated from the distribution of the classified result values. A value for the uncertainty is defined as an uncertainty estimation value (S _uc ). The uncertainty estimation value S _uc may be transmitted to the distance comparator 240 in the form of a value calculated by applying a softmax function for the separate training data to the second image data IM2 .

상기 분류된 결과값들의 분포가 특정 클래스에서 큰 값을 가지는 경우, 입력되는 이미지에 대하여 상대적으로 확실하게 분류를 수행하는 것이므로, 불확실성추정부(210)는 해당 이미지가 학습 데이터에 포함될 가능성을 크게 계산할 수 있다. 반면, 상기 분류된 결과값들의 분포가 전체 클래스에서 고르게 값을 가지는 경우, 입력되는 이미지에 대하여 상대적으로 불확실하게 분류를 수행하는 것이므로, 불확실성추정부(210)는 해당 이미지가 학습 데이터에 포함될 가능성을 작게 계산할 수 있다.When the distribution of the classified result values has a large value in a specific class, since classification is performed relatively reliably on the input image, the uncertainty estimator 210 can greatly calculate the possibility that the image is included in the training data. can On the other hand, when the distribution of the classified result values has uniform values in the entire class, the classification is performed relatively uncertainly with respect to the input image. can be calculated small.

수학식 (4)는 연결가중치(

)로 구성된 객체추정기(

)에 입력 이미지 (

)가 입력되었을 때, 객체추정기(

)의 출력값에 대해서 분포를 부드럽게 바꿔주는 상수 T를 적용한 소프트맥스 함수(softmax)를 적용하고 그 출력값들 중 가장 큰 값을 상수 2에서 뺀 불확실성 추정값(S_uc)를 계산하는 수식이다.Equation (4) is the connection weight (

) consisting of an object estimator (

) to the input image (

) is input, the object estimator (

), a softmax function to which a constant T that smoothly changes the distribution is applied is applied to the output value of ), and the uncertainty estimation value (S _uc ) is calculated by subtracting the largest value among the output values from the constant 2.

수학식 4의 불확실성 추정값(S_uc)은 1에서 2사이의 실수값을 같도록 구성되며, 객체추정기(

)가 입력데이터(

)를 보다 확실하게 분류할수록 1에 가까워지며, 반대로 불확실하게 분류할수록 2에 가까워지도록 설계된다. 따라서, 입력데이터가 학습 데이터의 클래스에 포함되지 않을 가능성이 커질수록 큰 값을 갖게 된다.The uncertainty estimation value (S _uc ) of Equation 4 is configured to equal a real value between 1 and 2, and the object estimator (

) is the input data (

) is designed to be closer to 1 as it is classified more reliably, and to be closer to 2 as it is classified with uncertainty. Therefore, as the probability that the input data is not included in the class of the learning data increases, it has a larger value.

제2 이미지 임베딩부(220)는 입력 이미지인 제2 이미지 데이터(IM2)를 특정 차원의 제2 이미지 벡터(x_q)로 변환하는 신경망 층을 포함한다. 상기 제2 이미지 데이터(IM2)는 미학습된 데이터를 포함할 수 있으며, 제1 이미지 데이터(IM1)와는 상이할 수 있다. 본 발명의 일 실시 예에서, 제2 이미지 임베딩부(220)는 전술된 제1 이미지 임베딩부(110)와 동일할 수 있다.The second image embedding unit 220 includes a neural network layer that converts second image data IM2 that is an input image into a second image vector x _q of a specific dimension. The second image data IM2 may include unlearned data and may be different from the first image data IM1 . In an embodiment of the present invention, the second image embedding unit 220 may be the same as the above-described first image embedding unit 110 .

제2 텍스트 임베딩부(230)는 데이터셋에 저장된 클래스들 중 객체 학습기(100)에 의하여 학습된 클래스들(TX2-1)을 텍스트로 간주하여 특정 차원의 제2 텍스트 벡터(y_s)로 변환하고, 데이터셋에 저장된 클래스들 중 객체학습기(100)에 의하여 학습되지 않은 클래스들(TX2-2)을 텍스트로 간주하여 특정 차원의 제3 텍스트 벡터(y_u)로 변환하는 신경망 층을 포함한다. 본 발명의 일 실시 예에서, 제2 텍스트 임베딩부(230)는 전술된 제1 텍스트 임베딩부(120)와 동일할 수 있다.The second text embedding unit 230 considers the classes TX2-1 learned by the object learner 100 among the classes stored in the dataset as text and converts them into a second text vector y _s of a specific dimension. and a neural network layer that considers the classes (TX2-2) not learned by the object learner 100 among the classes stored in the dataset as text and converts them into a third text vector (y _u ) of a specific dimension. . In an embodiment of the present invention, the second text embedding unit 230 may be the same as the above-described first text embedding unit 120 .

본 실시 예에 따르면, 제2 이미지 임베딩부(220)의 벡터 출력 차원은 제2 텍스트 임베딩부(230)의 벡터 출력 차원과 동일하게 설정된다. 즉, 제2 이미지 벡터(x_q), 제2 텍스트 벡터(y_s) 및 제3 텍스트 벡터(y_u)는 동일한 차원의 벡터일 수 있다. 따라서, 제2 이미지 벡터(x_q), 제2 텍스트 벡터(y_s) 및 제3 텍스트 벡터(y_u)는 공통의 벡터공간에서 거리 계산이 가능하다.According to this embodiment, the vector output dimension of the second image embedding unit 220 is set to be the same as the vector output dimension of the second text embedding unit 230 . That is, the second image vector (x _q ), the second text vector (y _s ), and the third text vector (y _u ) may be vectors of the same dimension. Accordingly, the distance between the second image vector (x _q ), the second text vector (y _s ), and the third text vector (y _u ) can be calculated in a common vector space.

거리비교부(240)는 제2 이미지 임베딩부(220)로부터 제2 이미지 벡터(x_q)를 수신하고, 제2 텍스트 임베딩부(230)로부터 제2 텍스트 벡터(y_s) 및 제3 텍스트 벡터(y_u)를 수신한다. 본 실시 예에 따르면, 거리비교부(240)는 수신한 벡터들 각각의 유클리디언 거리를 계산한 값을 산출해낼 수 있다. 구체적으로, 거리비교부(240)는 제2 이미지 벡터(x_q)와 제2 텍스트 벡터(y_s) 사이의 거리 값으로 정의되는 학습 거리(D_s) 및 제2 이미지 벡터(x_q)와 제3 텍스트 벡터(y_u) 사이의 거리 값으로 정의되는 미학습 거리(D_u)를 산출한다. 이 때, 학습 거리(D_s) 및 미학습 거리(D_u)는 객체추정기(200)의 임시분류결과일 수 있다.The distance comparison unit 240 receives the second image vector (x _q ) from the second image embedding unit 220 , and the second text vector (y _s ) and the third text vector from the second text embedding unit 230 . (y _u ) is received. According to this embodiment, the distance comparator 240 may calculate a value obtained by calculating the Euclidean distance of each of the received vectors. Specifically, the distance comparison unit 240 is a learning distance (D _s ) defined as a distance value between the second image vector (x _q ) and the second text vector (y _s ) and the second image vector (x _q ) and An unlearned distance D _u defined as a distance value between the third text vectors y _u is calculated. In this case, the learning distance D _s and the non-learning distance D _u may be temporary classification results of the object estimator 200 .

본 발명의 실시 예에 따르면, 거리비교부(240)는 불확실성추정부(210)로부터 수신한 불확실성 추정값(S_uc)으로 미학습 거리(D_u)를 나누어, 객체추정기(200)에 입력된 이미지 데이터(IM2)가 학습 시 사용하지 않은 클래스로 분류될 가능성이 증가할 수 있다. 즉, 객체학습기(100)에 의하여 학습되지 않은 클래스들(TX2-2)의 텍스트 벡터인 제3 텍스트 벡터(y_u)와 제2 이미지 벡터(x_q) 사이의 거리값을 감소시켜, 입력된 이미지(IM2)가 학습되지 않은 클래스들(TX2-2)로 분류될 가능성을 증가시킬 수 있다.According to an embodiment of the present invention, the distance comparator 240 divides the unlearned distance D _u by the uncertainty estimate value S _uc received from the uncertainty estimator 210 , and the image input to the object estimator 200 . The possibility that the data IM2 is classified into a class not used during training may increase. That is, by reducing the distance value between the third text vector (y _u ) and the second image vector (x _q ), which are text vectors of the classes (TX2-2) not learned by the object learner 100, the input It is possible to increase the possibility that the image IM2 is classified into the unlearned classes TX2-2.

수학식 (5)는 입력된 이미지 벡터(xq)와 학습 시 사용하지 않은 클래스의 제3 텍스트 벡터(y_u)

간의 유클리디언 거리(미학습거리, D_u)를 수학식 4에서 계산한 불확실성 추정값(S_uc)으로 나누어 제로샷 분류를 위한 이미지 벡터와 텍스트 벡터 간의 거리값을 계산하는 가공된 미학습거리(D_u')를 나타낸다.Equation (5) is the input image vector (xq) and the third text vector (y _u ) of the class not used during training

The _processed _unlearned distance ( D _u' ).

일반적으로, 객체추정기(200)에 입력된 이미지 데이터(IM2) 중 학습되지 않은 적어도 일부 이미지(IM2)는 학습되지 않은 클래스보다는 학습된 클래스로 분류될 가능성이 높다.In general, at least some unlearned images IM2 among the image data IM2 input to the object estimator 200 are more likely to be classified as a learned class rather than an unlearned class.

본 발명의 실시 예와는 다르게, 미학습 거리(D_u)를 불확실성 추정값(S_uc)으로 나누지 않을 경우, 객체추정기(200)에 입력된 이미지 데이터(IM2)가 학습 시 사용하지 않은 클래스(TX2-2)에 해당한다고 하더라도, 가공된 미학습거리(D_u')의 거리가 감소하지 않으므로, 객체추정기(200)가 상기 이미지 데이터(IM2)를 학습 시 사용한 클래스(TX2-1)로 분류할 수 있다. 즉, 학습 시 사용하지 않은 클래스 데이터(TX2-2)에 대한 객체 추정기(200)의 인식 성능이 저하될 수 있다.Unlike the embodiment of the present invention, when the unlearned distance D _u is not divided by the uncertainty estimation value S _uc , the image data IM2 input to the object estimator 200 is a class TX2 not used during learning. Even if it corresponds to -2), since the distance of the processed unlearned distance D _u ' does not decrease, the object estimator 200 classifies the image data IM2 into the class TX2-1 used for learning. can That is, the recognition performance of the object estimator 200 for the class data TX2-2 not used during learning may be deteriorated.

그러나, 본 발명의 실시 예에 따르면, 거리비교부(240)가 미학습 거리(D_u)를 불확정성 추정값(S_uc)으로 나누어, 미학습 거리(D_u)보다 상대적으로 작은 값을 갖는 가공된 미학습 거리(D_u')를 산출하므로, 객체추정기(200)에 입력된 이미지가 학습 시 사용하지 않은 클래스로 분류될 가능성을 증가시킬 수 있다.However, according to an embodiment of the present invention, the distance comparator 240 divides the _unlearned distance D _u by the uncertainty estimation value S _uc , and processes Since the unlearned distance D _u ' is calculated, it is possible to increase the possibility that the image input to the object estimator 200 is classified into a class not used during learning.

예시적으로, 불확정성추정부(210)에 입력되는 이미지(IM2)가 학습 데이터에 포함될 가능성이 높을 경우, 불확정성추정값(S_uc)은 상대적으로 작은 값을 갖는다. 따라서, 거리비교부(240)가 미학습거리(D_u)를 불확정성추정값(S_uc)을 나누더라도, 가공된 미학습거리(D_u')가 학습거리(D_s)보다 여전히 상대적으로 큰 값을 가지므로, 객체추정기(200)에 입력되는 이미지 데이터(IM2)는 학습된 클래스로 분류될 가능성이 미학습된 클래스로 분류될 가능성보다 상대적으로 높다.For example, when the image IM2 input to the uncertainty estimator 210 is highly likely to be included in the training data, the uncertainty estimation value S _uc has a relatively small value. Therefore, even if the distance comparator 240 divides the unlearned distance D _u by the uncertainty estimation value S _uc , the processed unlearned distance D _u ' is still a relatively larger value than the learning distance D _s . Therefore, the probability that the image data IM2 input to the object estimator 200 will be classified as a learned class is relatively higher than that of an unlearned class.

반면에, 불확정성추정부(210)에 입력되는 이미지(IM2)가 학습 데이터에 포함될 가능성이 작은 경우, 불확정성추정값(S_uc)은 상대적으로 큰 값을 갖는다. 따라서, 거리비교부(240)는 미학습 거리(Du)를 불확정성추정값(S_uc)으로 나누어, 기존의 미학습거리(D_u)보다 작은 값을 갖는 가공된 미학습거리(D_u') 값을 산출한다. 즉, 거리비교부(240)는 객체추정기(200)에 입력되는 이미지 데이터(IM2)가 미학습된 클래스 분류될 가능성을 증가시킬 수 있다.On the other hand, when there is a small possibility that the image IM2 input to the uncertainty estimator 210 is included in the training data, the uncertainty estimation value S _uc has a relatively large value. Accordingly, the distance comparator 240 divides the unlearned distance Du by the uncertainty estimation value S _uc , and the processed unlearned distance D _u ') value having a smaller value than the existing unlearned distance D _u . to calculate That is, the distance comparator 240 may increase the possibility that the image data IM2 input to the object estimator 200 is classified into an unlearned class.

결과적으로, 본 발명의 실시 예에 따르면, 객체 인식 모델의 처음 보는 객체에 대한 인식 성능을 향상시킬 수 있으며, 딥러닝 기반 객체 인식 모델의 실제 산업계에서의 응용 효율성을 향상시킬 수 있다. 예시적으로, 자율주행차에서의 객체 인식 모델이 처음 보는 장애물에 대하여 대응능력이 증가할 수 있고, 객체인식 모델 기반 지능형 방범시스템에서 미확인 물체에 대한 처리 성능이 증가할 수도 있다.As a result, according to an embodiment of the present invention, it is possible to improve the recognition performance of the object that the object recognition model sees for the first time, and it is possible to improve the application efficiency of the deep learning-based object recognition model in the actual industry. For example, the ability to respond to obstacles that the object recognition model in the autonomous vehicle sees for the first time may increase, and the processing performance for unidentified objects may increase in the object recognition model-based intelligent crime prevention system.

또한, 본 발명의 실시 예에 따른 객체인식장치(1000)는 제로샷 학습의 다양한 접근 방식 중 이미지 데이터의 속성정보를 이용하는 접근방식이 아닌 이미지와 텍스트 레이블을 공통의 벡터공간에 임베딩하는 방법론을 사용함으로써, 제로샷 학습을 위한 데이터 구축 비용을 경감할 수 있다.In addition, the object recognition apparatus 1000 according to an embodiment of the present invention uses a methodology of embedding an image and a text label in a common vector space, not an approach using attribute information of image data among various approaches of zero-shot learning. By doing so, the data construction cost for zero-shot learning can be reduced.

도 6은 본 발명의 실시 예에 따른 객체 인식 방법이 도시된 도면이다.6 is a diagram illustrating an object recognition method according to an embodiment of the present invention.

도 6을 도 2 및 도 5와 함께 참조하면, 본 발명의 실시 예에 따른 객체 인식 방법은, 학습단계(S10)와 추론단계(S20)로 구성된다. 추론단계(S20)는 학습단계(S10)보다 후행되며, 본 발명의 일 실시 예에서, 추론단계(S20) 및 학습단계(S10)는 서로 독립적으로 수행될 수 있다.Referring to FIG. 6 together with FIGS. 2 and 5 , the object recognition method according to an embodiment of the present invention includes a learning step ( S10 ) and an inference step ( S20 ). The inference step (S20) follows the learning step (S10), and in an embodiment of the present invention, the inference step (S20) and the learning step (S10) may be performed independently of each other.

먼저, 학습 단계(S10)는 제1 이미지 임베딩부(110, 도 2)에서 제1 이미지 데이터(IM1)를 제1 이미지 벡터(x_s)로 변환하여 연산부(140)로 송신하는 단계(S11), 제1 텍스트 임베딩부(120)에서 데이터셋에 저장된 적어도 일부 클래스들의 텍스트 정보(TX1)를 계층지식생성부(30)로 송신하는 단계(S12), 계층지식생성부(130)가 상위클래스 및 형제 클래스 각각의 정보를 제1 텍스트 임베딩부(120)로 송신하는 단계(S13), 제1 텍스트 임베딩부(110)가 입력된 클래스 정보를 텍스트로 간주하여 제1 텍스트 벡터(y_s ⁺)로 변환하고, 상위클래스 정보를 텍스트로 간주하여 상위클래스 벡터(Y_s ^sc)로 변환하고, 형제클래스 정보를 텍스트로 간주하여 형제클래스 벡터(Y_s ^-)로 변환하여 연산부(140)로 송신하는 단계(S14), 연산부(140)가 수신된 제1 이미지 벡터(x_s), 제1 텍스트 벡터(y_s ⁺), 상위클래스 벡터(Y_s ^sc), 및 형제클래스 벡터(Y_s ^-) 간의 유클리디언 거리를 셰산하여 계층적 의미 손실값(HSL)을 산출하는 단계(S15), 계층적 의미 손실값(HSL)이 최소화 되도록 파라미터를 학습하는 단계(S17)를 포함한다. 각 단계(S11~S17)에 해당하는 구성은 도 1 내지 4에서 전술된 내용과 동일하므로 설명을 생략한다.First, in the learning step (S10), the first image embedding unit 110 (FIG. 2) converts the first image data IM1 into a first image vector (x _s ) and transmits it to the operation unit 140 (S11) , transmitting the text information TX1 of at least some classes stored in the dataset in the first text embedding unit 120 to the hierarchical knowledge generating unit 30 (S12), the hierarchical knowledge generating unit 130 is the upper class and Transmitting information of each of the sibling classes to the first text embedding unit 120 (S13), the first text embedding unit 110 regards the input class information as text and converts it into a first text vector (y _s ⁺ ) converting, considering the upper class information as text, converting it into a higher class vector (Y _s ^sc ), and considering the sibling class information as text and converting it into a sibling class vector (Y _s ^- ) and transmitting it to the operation unit 140 . (S14), the difference between the first image vector (x _s ), the first text vector (y _s ⁺ ), the upper class vector (Y _s ^sc ), and the sibling class vector (Y _s ^- ) received by the operation unit 140 The method includes calculating a hierarchical loss value (HSL) by calculating the Clidian distance (S15), and learning parameters such that the hierarchical loss value (HSL) is minimized (S17). The configuration corresponding to each step (S11 to S17) is the same as that described above with reference to FIGS. 1 to 4, and thus a description thereof will be omitted.

추론 단계(S20)는 불확실성추정부(210)에서 제2 이미지 데이터(IM2)를 별도의 학습 데이터로 분류하여 제2 이미지 데이터(IM2)가 학습 데이터가 포함하는 클래스들 각각에 분류될 확률값들의 분포를 출력하는 단계(S21), 불확실성추정부(210)에서 확률값들의 분포에서 가장 큰값을 이용하여 불확실성 추정값(S_uc)을 산출하여 거리비교부(240)에 송신하는 단계(S22), 제2 이미지 임베딩부(220)에서 제2 이미지 데이터(IM2)를 제2 이미지 벡터(x_q)로 변환하여 거리비교부로 송신하는 단계(S23), 제2 텍스트 임베딩부(230)에서 데이터셋에 저장된 전체 클래스들을 텍스트로 간주하여 객체학습기(100)로 학습된 클래스들(TX2-1)은 제2 텍스트 벡터(y_s)로 변환하고, 객체학습기(100)로 학습되지 않은 클래스들(TX2-2)은 제3 텍스트 벡터(y_u)로 변환하여 거리비교부로 송신하는 단계(S24), 거리비교부(240)가 제2 이미지 벡터(x_q)와 제2 텍스트 벡터(y_s) 사이의 거리값인 학습 거리(D_s)와 제2 이미지 벡터(x_q)와 제3 텍스트 벡터(y_u) 사이의 거리값인 미학습 거리(D_u)를 포함하는 임시분류결과를 산출하는 단계(S25), 및 거리비교부(240)가 미학습 거리(D_u)를 불확실성추정값(S_uc)로 나누어 최종 분류 결과를 계산하는 단계(S26)를 포함한다. 각 단계(S21~S26)에 해당하는 구성은 도 1 및 도 5에서 전술된 내용과 동일하므로 설명을 생략한다.In the reasoning step (S20), the uncertainty estimator 210 classifies the second image data IM2 as separate learning data, and the distribution of probability values in which the second image data IM2 is classified into each of the classes included in the learning data. outputting (S21), calculating the uncertainty estimation value (S _uc ) using the largest value in the distribution of probability values in the uncertainty estimator 210 and transmitting it to the distance comparison unit 240 (S22), the second image The embedding unit 220 converts the second image data IM2 into a second image vector (x _q ) and transmits it to the distance comparator (S23), and the second text embedding unit 230 converts the entire class stored in the dataset Classes TX2-1 learned by the object learner 100 by considering them as text are converted into a second text vector y _s , and the classes TX2-2 that are not learned by the object learner 100 are The third text vector (y _u ) is converted to a distance comparing unit and transmitted to the distance comparator ( S24 ), where the distance comparator 240 is the distance value between the second image vector (x _q ) and the second text vector (y _s ). Calculating a temporary classification result including the learning distance (D _s ) and the unlearned distance (D _u ), which is the distance value between the second image vector (x _q ) and the third text vector (y _u ) (S25), and calculating the final classification result by the distance comparison unit 240 dividing the unlearned distance D _u by the uncertainty estimation value S _uc ( S26 ). The configuration corresponding to each step (S21 to S26) is the same as that described above with reference to FIGS. 1 and 5, and thus a description thereof will be omitted.

이상 첨부된 도면을 참조하여 본 발명의 실시 예들을 더욱 상세하게 설명하였으나, 본 발명은 반드시 이러한 실시 예로 국한되는 것은 아니고, 본 발명의 기술사상을 벗어나지 않는 범위 내에서 다양하게 변형 실시될 수 있다. Although the embodiments of the present invention have been described in more detail with reference to the accompanying drawings, the present invention is not necessarily limited to these embodiments, and various modifications may be made within the scope without departing from the technical spirit of the present invention.

따라서, 본 발명에 개시된 실시 예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시 예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 그러므로, 이상에서 기술한 실시 예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Therefore, the embodiments disclosed in the present invention are not intended to limit the technical spirit of the present invention, but to explain, and the scope of the technical spirit of the present invention is not limited by these embodiments. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. The protection scope of the present invention should be construed by the following claims, and all technical ideas within the equivalent range should be construed as being included in the scope of the present invention.

1000: 객체인식장치
100: 객체 학습기
110: 제1 이미지 임베딩부
120: 제1 텍스트 임베딩부
130: 계층지식생성부
140: 연산부
LF: 계층적 의미 손실함수부
200: 객체 추정기
210: 불확실성 추정부
220: 제2 이미지 임베딩부
230: 제2 텍스트 임베딩부
240: 거리비교부
TD: 학습된 데이터
HK: 계층 지식
TI: 텍스트 정보1000: object recognition device
100: object learner
110: first image embedding unit
120: first text embedding unit
130: hierarchical knowledge generation unit
140: arithmetic unit
LF: hierarchical semantic loss function part
200: object estimator
210: uncertainty estimation unit
220: second image embedding unit
230: second text embedding unit
240: distance comparison department
TD: trained data
HK: Hierarchical knowledge
TI: text information

Claims

an object learner for learning input first image data using hierarchical knowledge of classes stored in a dataset; and
and an object estimator that adjusts the possibility that the input second image data is classified into an unlearned class by the object learner,
The object estimator calculates an uncertainty estimation value indicating a probability that the input image data has been separately learned in advance, and uses the uncertainty estimation value to determine the possibility that the input image is classified into an unlearned class by the object learner A generalized zero-shot object recognition device that controls

The method of claim 1,
The object learner,
a first image embedding unit including a neural network layer that converts the first image data into a first image vector of a specific dimension;
a first text embedding unit comprising a neural network layer for converting the text into a first text vector of a specific dimension by considering the classes stored in the dataset as text, and converting the hierarchical knowledge into a hierarchical semantic vector;
a hierarchical knowledge generating unit that receives text information of the classes from the first text embedding unit, generates the hierarchical knowledge by comparing the text information with a pre-stored hierarchical structure, and transmits the generated hierarchical knowledge to the first text embedding unit; and
and a calculator configured to receive the first image vector, the first text vector, and the hierarchical semantic vector, and calculate each Euclidean distance.

3. The method of claim 2,
The hierarchical knowledge is
Contains superclass information and sibling class information of the classes,
The object learner,
A generalized zero-shot object recognition apparatus for learning such that the first image vector and the first text vector are close to the hierarchical semantic vector of the upper class.

3. The method of claim 2,
The hierarchical knowledge is
Contains superclass information and sibling class information of the classes,
The object learner,
A generalized zero-shot object recognition apparatus that learns to distance the first image vector and the first text vector from the hierarchical semantic vector of the sibling class.

3. The method of claim 2,
The first image vector and the first text vector have the same dimension as a generalized zero-shot object recognition apparatus.

3. The method of claim 2,
The object estimator is
an uncertainty estimator comprising a neural network layer for classifying the second image data as separate learning data, and calculating the uncertainty estimation value from the distribution of the classified result value;
a second image embedding unit including a neural network layer that converts the second image data into a second image vector of a specific dimension;
A neural network layer that considers the classes learned by the object learner as text and converts them into a second text vector of a specific dimension, and considers the classes not learned by the object learner as text and converts them into a third text vector of a specific dimension a second text embedding unit including; and
and a distance comparison unit dividing the unlearned distance by the uncertainty estimate value so that the distance between the third text vector and the second image vector is reduced.

7. The method of claim 6,
The uncertainty estimation value has a lower value as the probability that the second image data is included in the training data is higher, and has a higher value as the probability that the second image data is included in the training data is low. Device.

7. The method of claim 6,
The second image vector, the second text vector, and the third text vector have the same dimension as a generalized zero-shot object recognition apparatus.

8. The method of claim 7,
The learning data is a generalized zero-shot object recognition device that is learned from the same data as the class data used by the object learner.

converting the first image data into a first image vector by the first image embedding unit and transmitting the converted first image data to the calculating unit;
transmitting text information of at least some classes stored in the dataset from the first text embedding unit to the hierarchical knowledge generating unit;
transmitting, by the hierarchical knowledge generating unit, hierarchical knowledge of the class text information to the first text embedding unit;
converting the text into a first text vector of a specific dimension by the first text embedding unit considering the classes as text, and transmitting the converted text to the operation unit;
converting the hierarchical knowledge into a hierarchical semantic vector by the first text embedding unit and transmitting the converted hierarchical semantic vector to the calculating unit;
calculating, by the calculator, each Euclidean distance from the first image vector, the first text vector, and the hierarchical semantic vector to calculate a hierarchical semantic loss value; and
A generalized zero-shot object recognition method comprising learning parameters such that the hierarchical loss of meaning is minimized.

11. The method of claim 10,
Transmitting the layer knowledge to the first text embedding unit includes:
setting a hierarchical structure and storing the hierarchical knowledge generating unit; and
Comprising the step of the hierarchical knowledge generating unit generating the hierarchical knowledge by comparing the text information of the classes with the hierarchical structure,
The hierarchical knowledge is a generalized zero-shot object recognition method including upper class information and sibling class information of the classes.

12. The method of claim 11,
A generalized zero-shot object recognition method in which the hierarchical semantic loss value decreases as the distance between the first image vector and the first text vector is closer to the hierarchical semantic vector of the upper class.

12. The method of claim 11,
The hierarchical semantic loss value decreases as the distance between the first image vector and the first text vector increases from the hierarchical semantic vector of the sibling class.

12. The method of claim 11,
classifying the second image data as separate training data in the uncertainty estimator and outputting probability values that the second image data is classified into each of the classes included in the training data;
calculating an uncertainty estimate from the distribution of the probability values in the uncertainty estimator and transmitting it to the distance comparator;
converting the second image data into a second image vector by a second image embedding unit and transmitting it to a distance comparator;
converting learned classes into a second text vector by considering all classes stored in the dataset as text by a second text embedding unit, and converting unlearned classes into a third text vector;
calculating, by the distance comparator, a learning distance defined as a distance value between the second image vector and the second text vector and an unlearned distance defined as a distance between the third text vector and the second image vector; and
The generalized zero-shot object recognition method further comprising the step of dividing the unlearned distance by the uncertainty estimate by the distance comparison unit to calculate a final classification result.

15. The method of claim 14,
The uncertainty estimation value has a lower value as the probability that the second image data is included in the training data is higher, and has a higher value as the probability that the second image data is included in the training data is low. Way.

15. The method of claim 14,
The second image vector, the second text vector, and the third text vector have the same dimension as a generalized zero-shot object recognition method.