KR20160128869A

KR20160128869A - Method for visual object localization using privileged information and apparatus for performing the same

Info

Publication number: KR20160128869A
Application number: KR1020150060937A
Authority: KR
Inventors: 한보형; 페예레이즐 얀; 곽수하; 손진희
Original assignee: 포항공과대학교 산학협력단
Priority date: 2015-04-29
Filing date: 2015-04-29
Publication date: 2016-11-08
Anticipated expiration: 2035-04-29
Also published as: KR101700030B1

Abstract

사전 정보를 이용한 영상 물체 탐색 방법 및 이를 수행하는 장치가 개시된다. 영상 물체 탐색 방법은 구조화 예측 프레임워크에 사전 정보를 결합한 학습 프레임워크를 생성하는 단계, 학습 프레임워크에 대한 교대 최적화 학습을 실행하는 단계, 교대 최적화 학습이 실행된 학습 프레임워크로부터 예측 모델을 생성하는 단계, 및 예측 모델을 이용하여 테스트 샘플이나 입력 영상으로부터의 특정 이미지에서 물체를 예측 혹은 탐색하는 단계를 포함한다.A method and apparatus for performing an image object search using dictionary information are disclosed. The method of searching an image object includes a step of creating a learning framework combining preliminary information with a structured prediction framework, a step of executing alternate optimization learning on a learning framework, a step of generating a prediction model from a learning framework in which alternate optimization learning is executed And predicting or searching an object in the specific image from the test sample or the input image using the predictive model.

Description

TECHNICAL FIELD The present invention relates to a method of searching for an image object using dictionary information and an apparatus for performing the method.

본 발명의 실시예들은 사전 정보를 접목한 학습 알고리즘에 관한 것으로, 더욱 상세하게는 사전 정보를 이용한 영상 물체 탐색 방법 및 이를 수행하는 장치에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a learning algorithm that combines advance information, and more particularly, to an image object searching method using dictionary information and an apparatus for performing the same.

물체 탐색(Object localization)은 종종 이진 분류 문제로 귀결된다. 기존의 학습형 분류기는 모든 위치, 크기 및 종횡비의 후보 윈도우 내에 타겟 물체의 존재 또는 부재를 결정한다. 이러한 물체 탐색 문제에 대하여 최근에는 SVM(Support Vector Machine)를 이용한 구조화 예측 기술이 적용되고 있다.Object localization often results in a binary classification problem. Conventional learning classifiers determine the presence or absence of a target object within a candidate window of all positions, sizes, and aspect ratios. In recent years, SVM (Support Vector Machine) has been applied to the object search problem.

구조화 예측 기술에서 훈련된 분류기를 통해 타켓 물체를 포함하는 최적의 바운딩 박스가 구해진다. 이러한 접근은 검출과 후처리를 위한 통일된 프레임워크를 제공하고, 여러 가지 종횡비를 가진 물체와 관련된 사안을 처리할 수 있다.An optimal bounding box containing the target object is obtained through a classifier trained in structured prediction techniques. This approach provides a unified framework for detection and post-processing, and can handle issues related to objects with different aspect ratios.

하지만, 물체 탐색은 물체들과 장면들의 많은 변화, 예컨대 형상 변형, 색상 차이, 자세 변화, 폐색, 뷰포인트 변화, 배경 혼잡 등으로 인하여 본질적으로 어려운 작업이다. 이러한 사안은 특히 훈련 데이터셋의 사이즈가 작을 때 심각하다.However, object search is an inherently difficult task due to many changes in objects and scenes, such as shape changes, color differences, attitude changes, occlusion, viewpoint changes, background congestion, and the like. This issue is particularly acute when the size of the training data set is small.

훈련시에 관심 물체(Object of interest)에 대한 추가적인 높은 수준의 정보를 이용할 수 있다면, 더욱 적은 훈련 샘플을 통해서도 더욱 신뢰도 있는 모델을 생성할 수 있다. 이러한 높은 수준의 정보는 사전 정보로 불릴 수 있다. 사전 정보는 기본적으로 물체의 부분, 속성 및 분할과 같은 유용한 의미 있는 특징을 묘사한다. 이러한 아이디어는 사전 정보를 이용하는 학습(Learning Using Privileged Information, LUPI)의 전형적인 예에 해당한다. 일부 LUPI는 추가적인 정보를 활용하여 훈련 예측 모델의 성능을 향상시킨다. 기존의 LUPI 프레임워크는 SVM+ 알고리즘 형태로 SVM에 결합된다. 하지만, SVM+의 응용은 종종 이진 분류 문제로 제한되는 한계가 있다.If additional high-level information on the object of interest is available at training time, a less reliable training sample can be used to generate a more reliable model. This high level of information may be referred to as prior information. Dictionary information basically describes useful meaningful features such as parts, attributes, and partitions of objects. This idea is a typical example of Learning Using Privileged Information (LUPI). Some LUPIs use additional information to improve the performance of the training prediction model. The existing LUPI framework is coupled to the SVM in the form of an SVM + algorithm. However, the application of SVM + is often limited to the problem of binary classification.

상기와 같은 종래 기술의 문제점을 해결하기 위한 본 발명의 목적은, 학습 알고리즘을 물체 탐색의 문제에 적용할 수 있는, 사전 정보를 이용한 새로운 구조화 SVM(Support Vector Machine)(SSVM+) 프레임워크를 제공하는 데 있다. 즉, 본 발명의 목적은 사전 정보를 이용한 영상 물체 탐색 방법 및 이를 수행하는 장치를 제공하는데 있다.SUMMARY OF THE INVENTION An object of the present invention is to provide a new structured SVM (Support Vector Machine) (SSVM +) framework using preliminary information that can apply a learning algorithm to a problem of object search There is. That is, an object of the present invention is to provide a method and apparatus for performing an image object search using advance information.

또한, 본 발명의 다른 목적은, 이진 학습 프레임워크를 포함하는 프레임워크에 사전 정보를 결합하고 모델 파라미터를 조정하여 더 우수한 일반화를 수행함으로써 학습이나 물체 탐색을 위한 알고리즘이나 장치의 성능을 향상시킬 수 있는, 사전 정보를 이용한 영상 물체 탐색 방법 및 이를 수행하는 장치를 제공하는 데 있다.It is also an object of the present invention to improve the performance of an algorithm or apparatus for learning or object searching by combining pre-information with a framework including a binary learning framework and adjusting the model parameters to perform better generalization A method for searching for an object using dictionary information, and an apparatus for performing the method.

상기의 목적을 달성하기 위하여 본 발명의 일 측면에서는, 구조화 예측 프레임워크에 사전 정보를 결합한 학습 프레임워크를 생성하는 단계, 사전 정보를 포함하는 훈련 샘플을 이용하여 학습 프레임워크에 대한 교대 최적화 학습을 실행하는 단계, 교대 최적화 학습이 실행된 학습 프레임워크로부터 예측 모델을 생성하는 단계, 및 예측 모델을 이용하여 입력 영상(테스트 샘플 포함가능함)으로부터의 특정 이미지에서 물체를 예측 혹은 탐색하는 단계를 포함하는 영상 물체 탐색 방법을 제공한다.According to one aspect of the present invention, there is provided a method for generating a learning framework by combining a preliminary information with a structured prediction framework, Generating a prediction model from a learning framework in which alternate optimization learning has been performed; and predicting or searching an object in a particular image from an input image (which may include a test sample) using the prediction model And provides an image object search method.

상기의 목적을 달성하기 위하여 본 발명의 다른 측면에서는, 전술한 영상 물체 탐색 방법을 수행하기 위한 프로그램을 기록한 컴퓨터 판독 가능 매체를 제공한다.According to another aspect of the present invention, there is provided a computer-readable medium having recorded thereon a program for performing the above-mentioned method for searching an object.

상기의 목적을 달성하기 위하여 본 발명의 또 다른 측면에서는, 구조화 예측 프레임워크에 사전 정보를 결합한 학습 프레임워크를 생성하는 프레임워크 생성부, 사전 정보를 포함하는 훈련 샘플을 이용하여 학습 프레임워크에 대한 교대 최적화 학습을 실행하는 학습부, 교대 최적화 학습이 실행된 학습 프레임워크로부터 예측 모델을 생성하는 모델 생성부, 및 예측 모델을 이용하여 입력 영상으로부터의 특정 이미지에서 물체를 예측 혹은 탐색하는 탐색부를 포함하는 영상 물체 탐색 장치를 제공한다.According to still another aspect of the present invention, there is provided a method for constructing a learning framework for a learning framework, the method comprising: generating a framework for learning by combining preliminary information with a structured prediction framework; A learning unit for performing alternate optimization learning, a model generating unit for generating a prediction model from a learning framework in which the alternate optimization learning is executed, and a search unit for predicting or searching an object in a specific image from the input image using the prediction model The object searching apparatus comprising:

여기에서, 프레임워크 생성부는, 사전 정보에 기초한 제1 공간의 제1 함수와 훈련 샘플에 기초한 제2 공간의 제2 함수를 결합할 수 있다. 여기서, 사전 정보는 훈련 샘플의 분할(Segmentation), 부분(Part), 속성(Attributes) 또는 이들의 조합을 포함하며, 제1 함수와 제2 함수의 결합은 훈련 샘플의 이미지들과 속성들을 포함한 공간을 바운딩 박스 좌표들의 공간으로 연결하는 것을 포함할 수 있다.Here, the framework generating unit may combine the first function of the first space based on the priori information and the second function of the second space based on the training sample. Herein, the dictionary information includes a segmentation, a part, an attribute, or a combination thereof of a training sample, and the combination of the first function and the second function includes a space including images and attributes of the training sample, To the space of the bounding box coordinates.

여기에서, 구조화 예측 프레임은 구조화된 SVM(Structured Support Vector Machine) 분류기를 포함할 수 있다.Here, the structured prediction frame may comprise a structured SVM (Structured Support Vector Machine) classifier.

여기에서, 학습부는, 교대 손실 추가 추정(Alternating loss-augmented inference)을 통해 사전 정보에 대응하는 목표 함수의 항목을 처리할 수 있다.Here, the learning unit can process the item of the objective function corresponding to the advance information through the alternating loss-augmented inference.

여기에서, 학습부는, 교대 손실 추가 추정을 통해 사전 정보에 기초한 제1 공간과 훈련 샘플에 기초한 제2 공간에서 효율 서브윈도우 검색(Efficient Subwindow Search, ESS)을 번갈아가며 실행하는 제1 학습부를 포함할 수 있다.Here, the learning unit includes a first learning unit that alternately performs an Efficient Subwindow Search (ESS) in a first space based on the priori information and a second space based on the training sample through the alternate loss addition estimation .

여기에서, 학습부는, 교대 손실 추가 추정을 통해 훈련 샘플의 타겟 이미지에서 모든 가능성 있는 바운딩 박스들을 추출하여 물체의 바운딩 박스 좌표를 추정하는 제2 학습부를 더 포함할 수 있다.Here, the learning unit may further include a second learning unit for extracting all the possible bounding boxes from the target image of the training sample through the alternate loss addition estimation, and estimating the bounding box coordinates of the object.

여기에서, 학습부는, 교대 손실 추가 추정을 통해 바운딩 박스 좌표들을 연결하여 입력과 출력 변수들 간의 관계를 연관짓는 연결 특징점 맵을 작성하는 제3 학습부를 더 포함할 수 있다.Here, the learning unit may further include a third learning unit that creates a connection feature point map that associates the bounding box coordinates with the alternate loss addition estimation to associate the relationship between the input and output variables.

여기에서, 탐색부는, 예측 모델의 학습된 가중 벡터와 상기 입력 영상으로부터의 특정 이미지 내 영상 특징에 의해 주어지는 최적의 바운딩 박스를 찾을 수 있다.Here, the searching unit can find the optimal bounding box given by the learned weighted vector of the prediction model and the image characteristic in the specific image from the input image.

여기에서, 영상 물체 탐색 장치는, 프레임워크 생성부에 결합하거나 프레임워크 생성부와 모델 생성부 사이에 배치되어 특정 물체를 포함하는 실제 영상 정보(Groundtruth)를 토대로 학습 프레임워크를 검증하는 검증부를 더 포함할 수 있다.Here, the video object search apparatus may further include a verification unit coupled to the framework generation unit or disposed between the framework generation unit and the model generation unit, for verifying the learning framework based on actual image information (Groundtruth) including a specific object .

여기에서, 영상 물체 탐색 장치는, 프레임워크 생성부, 학습부, 모델 생성부, 탐색부 또는 이들 조합의 동작을 위한 프로그램이나 명령어를 저장하는 메모리 시스템, 및 메모리 시스템에 연결되고 프로그램이나 명령어를 실행하여 입력 영상에서 미리 지정된 물체를 탐색하는 프로세서를 포함할 수 있다.Here, the video object search apparatus includes a memory system for storing a program or a command for operation of a framework generating unit, a learning unit, a model generating unit, a searching unit or a combination thereof, and a memory system connected to the memory system and executing a program or a command And searching for an object previously designated in the input image.

상기와 같은 본 발명에 따른 사전 정보 및 SSVM(Structured Support Vector Machine)을 통한 영상 물체 탐색 방법과 이를 수행하는 장치를 이용할 경우에는 테스트 시에 필요할 것으로 추정되지 않는 사전 정보를 활용하여 물체를 탐색하는 새로운 프레임워크를 제공할 수 있다. 즉, 초기 학습 프레임워크에 사전 정보를 결합하고 더 우수한 일반화를 위해 모델 파라미터를 조정함으로써 학습이나 물체 탐색을 위한 알고리즘이나 장치의 성능을 향상시킬 수 있다.In the case of using the dictionary information according to the present invention and the apparatus for performing the image object search through the Structured Support Vector Machine (SSVM) and the apparatus performing the image object, it is possible to search for an object using the dictionary information, A framework can be provided. In other words, the performance of algorithms or devices for learning or object searching can be improved by combining the preliminary information into the initial learning framework and adjusting the model parameters for better generalization.

또한, 본 발명에 의하면, 효율적인 서브윈도우 탐색을 위한 교대 손실 추가 추정 방법을 연계하여 종래의 영상(visual) 특징과 함께 사전 정보를 취급할 수 있는 SSVM+ 프레임워크를 만들어 낼 수 있다.In addition, according to the present invention, an SSVM + framework can be created that can handle prior information together with conventional visual features by linking an alternate loss addition estimation method for efficient sub-window search.

또한, 본 발명에 의하면, 영상 내 물체의 탐색과 분류에서 성능 이득을 달성할 수 있고, 특히 작은 훈련 데이터셋에 대하서도 성능 이득을 높게 달성할 수 있는 장점이 있다. 일례로, CUB-2011 데이터셋의 새를 탐색하는 데 있어서, 표준 영상 특징에 더하여 사전 정보로서 표준 영상의 속성과 분할 마스크들을 활용하여 성능을 향상시킬 수 있다.Further, according to the present invention, a performance gain can be achieved in searching and classifying objects in an image, and a performance gain can be achieved particularly even for a small training data set. For example, in searching for a bird in the CUB-2011 dataset, performance can be improved by utilizing attributes of the standard image and split masks as prior information, in addition to the standard image features.

또한, 본 발명에 의하면, 종래의 전이 학습; 사이드 정보나 도메인 적용을 통한 학습; 이항 제약 조건이나 다중 커널이나 계량 기반의 학습; 제로-샷 학습(Zero-shot learning) 등의 종래의 방법들에서 영상 분류 혹은 탐색 성능을 향상시킬 수 있는 장점이 있다.Further, according to the present invention, the conventional transition learning; Learning through side information or domain application; Binomial constraints or multi-kernel or metric-based learning; There is an advantage that image classification or search performance can be improved in conventional methods such as Zero-shot learning.

도 1은 본 발명의 일실시예에 따른 영상 물체 탐색 방법의 흐름도이다.
도 2는 도 1의 영상 물체 탐색 방법에 채용할 수 있는 사전 정보를 이용한 물체 탐색 프레임워크의 개략도이다.
도 3은 도 2의 물체 탐색 프레임워크의 SSVM+ 학습(Learning)을 사전 공간(Privileged space)과 영상 공간(Visual space)에서 효율 서브윈도우 검색(ESS)을 통해 교대로 수행하며 손실 추가 추정(Loss-Augmented Inference)을 최적화하는 과정에 대한 확대 개략도이다.
도 4a 및 도 4b는 도 1의 영상 물체 탐색 방법과 비교예의 SSVM의 성능을 100개의 클래스에 대한 평균 중첩도(Overlap ration) 및 검출 개수(the number of detection)로 각각 평가한 그래프들이다.
도 5는 본 발명의 다른 실시예에 따른 영상 물체 탐색 장치의 블록도이다.
도 6은 본 발명의 또 다른 실시예에 따른 영상 물체 탐색 장치의 블록도이다.1 is a flowchart of an image object searching method according to an embodiment of the present invention.
2 is a schematic view of an object search framework using advance information that can be employed in the video object search method of FIG.
FIG. 3 is a flowchart illustrating a method of performing SSVM + learning of the object search framework of FIG. 2 alternately through efficient sub-window search (ESS) in a pri- vary space and a visual space, Augmented Inference < / RTI >
FIGS. 4A and 4B are graphs showing the performance of the SSVM of the image object searching method and the comparative example of FIG. 1, respectively, with an average overlay ratio and the number of detection for 100 classes.
5 is a block diagram of an image object search apparatus according to another embodiment of the present invention.
6 is a block diagram of an image object search apparatus according to another embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It is to be understood, however, that the invention is not to be limited to the specific embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 아니하는 것으로 이해되어야 할 것이다. It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가질 수 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, may have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as being consistent with the meanings in the context of the relevant art and are not to be construed as ideal or overly formal meanings unless explicitly defined in the present application.

이하, 첨부한 도면들을 참조하여, 본 발명의 바람직한 실시예를 보다 상세하게 설명하고자 한다. 본 발명을 설명함에 있어 전체적인 이해를 용이하게 하기 위하여 도면상의 동일한 구성요소에 대해서는 동일한 참조부호를 사용하고 동일한 구성요소에 대해서 중복된 설명은 생략한다. 또한, 본 발명을 설명함에 있어서, 관련된 공지기능 혹은 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단된 경우 그 상세한 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In order to facilitate the understanding of the present invention, the same reference numerals are used for the same constituent elements in the drawings and redundant explanations for the same constituent elements are omitted. In the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

도 1은 본 발명의 일실시예에 따른 영상 물체 탐색 방법의 흐름도이다. 도 2는 도 1의 영상 물체 탐색 방법에 채용할 수 있는 사전 정보를 이용한 물체 탐색 프레임워크의 개략도이다. 도 3은 도 2의 물체 탐색 프레임워크의 SSVM+ 학습(Learning)을 사전 공간(Privileged space)과 영상 공간(Visual space)에서 효율 서브윈도우 검색(ESS)을 통해 교대로 수행하며 손실 추가 추정(Loss-Augmented Inference)을 최적화하는 과정에 대한 개략도이다. 그리고, 도 4a 및 도 4b는 도 1의 영상 물체 탐색 방법과 비교예의 SSVM의 성능을 100개의 클래스에 대한 평균 중첩도(Overlap ration) 및 검출 개수(the number of detection)로 각각 평가한 그래프들이다.1 is a flowchart of an image object searching method according to an embodiment of the present invention. 2 is a schematic view of an object search framework using advance information that can be employed in the video object search method of FIG. FIG. 3 is a flowchart illustrating a method of performing SSVM + learning of the object search framework of FIG. 2 alternately through efficient sub-window search (ESS) in a pri- vary space and a visual space, Augmented Inference). 4A and 4B are graphs showing the performance of the SSVM of the image object search method and the comparative example of FIG. 1, respectively, by an average overlay and a number of detection for 100 classes.

도 1 내지 도 3을 참조하면, 본 실시예에 따른 영상 물체 탐색 방법은 구조화 예측 프레임워크에 사전 정보를 결합한 학습 프레임워크를 생성하는 단계(S11), 학습 프레임워크에 대한 교대 최적화(alternating optimization) 학습을 실행하는 단계(S12), 교대 최적화 학습이 실행된 학습 프레임워크로부터 예측 모델(prediction model)을 생성하는 단계(S13), 및 예측 모델을 이용하여 테스트 샘플이나 입력 영상으로부터의 특정 이미지에서 물체를 예측 혹은 탐색하는 단계(S14)를 포함한다. 영상 물체 탐색 방법은 디지털 신호 처리가 가능한 장치에 의해 수행될 수 있다.1 to 3, an image object searching method according to an embodiment of the present invention includes a step S11 of creating a learning framework in which dictionary information is combined with a structured prediction framework, an alternating optimization for a learning framework, A step S12 of generating a prediction model from the learning framework in which the alternate optimization learning is performed, and a step S12 of generating a prediction model by using the prediction model, (Step S14). An image object search method can be performed by a device capable of digital signal processing.

본 실시예에 있어서, 용어 '사전 정보'(Privileged Information)는 영상의 이해를 위한 유용한 높은 수준의 정보를 지칭하며, 따라서 사전 정보를 이용하면 적은 훈련 샘플 데이터에 대해서도 신뢰도 있는 모델을 학습하는 데 도움을 얻을 수 있다.In the present embodiment, the term " Privileged Information " refers to a high level of information useful for understanding an image, and thus, using the dictionary information, it is helpful to learn a reliable model Can be obtained.

실제로 사전 정보는 훈련 중에만 이용가능하므로 사용자의 감독 없이 능률적으로 영상 데이터(visual data)에서 사전 정보를 얻기는 힘들다. 따라서, 본 실시예에서는 특정 프레임워크에서 미리 준비되는 사전 정보 예컨대, 부분들(parts), 속성(attributes) 및 분할(segmentation)을 포함하는 사전 정보를 구조화된 물체 탐색을 위한 예측 함수의 학습과 연계하도록 구현된다(도 2 및 도 3 참조). 프레임워크와 연계되는 높은 수준의 정보는 훈련 시뿐 아니라 테스트 시에도 이용될 수 있다. 이러한 프레임워크에 따른 학습 알고리즘은 효율적인 브랜치와 바운드 손실 추가 서브윈도우 탐색(branch-and-bound loss-argmented subwindow search) 방법을 채용하여 오리지널 영상 공간(original visual space)과 사전 공간(privileged space)에서 조인트 최적화(joint optimization)에 의한 추론을 수행할 수 있다. 테스트할 때 추가 정보를 사용하지 않으면, 테스트 단계에서의 추론은 표준 구조화 SVM(SSVM)의 경우와 유사할 수 있다.Actually, prior information is only available during training, so it is difficult to efficiently obtain prior information from visual data without user supervision. Accordingly, in the present embodiment, dictionary information including advance information, such as parts, attributes, and segmentation, which is prepared in advance in a specific framework, is linked with learning of a prediction function for searching for a structured object (See Figures 2 and 3). The high level of information associated with the framework can be used during testing as well as during training. This framework-based learning algorithm employs an efficient branch and a branch-and-bound loss-ar- ranged subwindow search method to create joints in the original visual space and privileged space, Inference by joint optimization can be performed. If additional information is not used when testing, the speculation in the test phase may be similar to that of standard structured SVM (SSVM).

통상 표준 학습 알고리즘은 강인한 모델을 구성하는데 많은 데이터를 요구하나, 제로-샷 학습은 어떠한 훈련 샘플도 요구하지 않는다. 일반적인 LUPI(Learning Using Privileged Information) 프레임워크는 훈련에 이용할 수 있는 사전 정보의 장점을 취하여 적은 수의 훈련 데이터를 가지고 좋은 모델을 학습하고자 하는 것이므로 전통적인 데이터 기반 학습과 제로-샷 학습의 중간에 있다고 볼 수 있다. 여기서, 사전 정보는 얼굴 인식, 얼굴 특징 검출, 및 이벤트 인식에 고려되어 왔으나 사전 정보를 이용한 학습 방법이나 물체 탐색 방법은 아직까지 일반화되어 있지 않다. 본 실시예에서는 LUPI 프레임워크를 SSVM 기반의 물체 탐색에 적용한다. 즉, 물체 탐색을 위해 SSVM을 이용하는 기술은 이미 알려져 있고, 최근에는 탐색 방법의 일부로서 SSVM을 채용하는 것이 알려져 있다. 하지만, 이러한 종래 기술들 중 어느 것도 사전 정보를 결합하는 방법이나 이와 유사한 방법을 채용하고 있지 않다.Normally a standard learning algorithm requires a lot of data to construct a robust model, but zero-shot learning does not require any training samples. The general Learning Learning Privileged Information (LUPI) framework is intended to learn good models with a small number of training data taking advantage of the advance information available for training, so it is in the midst of traditional data-based learning and zero-shot learning . Here, prior information has been considered for face recognition, face feature detection, and event recognition, but learning methods and object search methods using prior information have not been generalized yet. In this embodiment, the LUPI framework is applied to object search based on the SSVM. That is, a technique using an SSVM for object search has already been known, and in recent years it has been known to employ SSVM as part of a search method. However, none of these prior art techniques employs a method of combining dictionary information or a similar method.

이하에서는 전술한 영상 물체 탐색 방법의 각 단계에 대하여 그 배경(Background)으로부터 좀더 구체적으로 설명하기로 한다.Hereinafter, each step of the above-described method of searching for an image object will be described in more detail from the background thereof.

사전 정보를 이용한 학습Learning using dictionary information

LUPI(Learning Using Privileged Information) 패러다임은 테스트 시에 이용하지 않는 추가적인 정보를 훈련 동안에 결합하기 위한 프레임워크이다. 이러한 정보에 포함되는 것은 더 좋은 모델을 찾는데 활용되며, 그에 의해 더 낮은 일반화 에러를 낸다. 전형적인 감독형 학습과 달리 LUPI 패러다임에서는, 데이터의 쌍이

일 때, 추가 정보

가 각 훈련 샘플에 구비된다.The Learning Using Privileged Information (LUPI) paradigm is a framework for combining additional information that is not used during testing during training. What is included in this information is used to find a better model, resulting in lower generalization errors. Unlike typical supervised learning, in the LUPI paradigm,

When,

Is provided for each training sample.

훈련 샘플은 예를 들어

이다.For example,

to be.

한편, 이러한 사전 정보는 테스트를 수행하는 동안에는 요구되지 않는다. LUPI 패러다임에서, 작업(task)은 수집된 함수들 중에서 그때 주어지는 데이터에 숨겨져 있는 결정에 가장 근접한 특정 함수를 찾는 것이다.On the other hand, such advance information is not required during the execution of the test. In the LUPI paradigm, a task is to find a specific function that is closest to the decision hidden from the then given data among the collected functions.

특히, 본 실시예에서는 한 쌍의 함수

와

를 결합하여 학습함으로써 LUPI 프레임워크 내에서 물체 탐색을 수행한다. 여기서,

는 예측을 위해서만 사용된다. 예를 들어, 전술한 기능들은 이미지들과 속성들의 공간을 바운딩 박스 좌표들의 공간

에 연관시킨다. 결정 함수

와 보정 함수

는 다음의 [수학식 1]과 같은 관계에 의해 서로 의존한다.In particular, in this embodiment,

Wow

To perform object search within the LUPI framework. here,

Is used only for prediction. For example, the functions described above may be used to transform the space of images and properties into the space of the bounding box coordinates

. Decision function

And correction function

Are mutually dependent on each other according to the following relationship (1).

수학식 1에서,

와

는 영상 공간(

)과 사전 공간(

)의 경험 손실 함수(empirical loss functions)를 각각 나타낸다. 이러한 두 공간에서의 불균등은, LUPI 패러다임으로부터 영감을 받은 것으로, 모든 훈련 샘플에 있어서 모델

가 사전 정보에 관한 모델

보다 더 작은 데이터 손실을 갖도록 항상 보정함으로써 해결될 수 있다. [수학식 1]에서의 이러한 제약은 동일한 개수의 훈련 샘플에서 영상 정보와 사전 정보의 조합이 단독 영상 정보보다 더 우수한 모델을 학습하는 공간을 제공한다고 추정할 때 유의미하다.In Equation (1)

Wow

The image space (

) And the dictionary space (

), Respectively, as the empirical loss functions. The inequality in these two spaces, inspired by the LUPI paradigm,

A model for prior information

Can be solved by always correcting to have less data loss. This constraint in Equation (1) is significant when it is assumed that the combination of image information and dictionary information in the same number of training samples provides a space for learning a model that is superior to single image information.

이러한 일반화 학습 아이디어를 실제로 적용하기 위해, 이진 분류에 사용되는 SVM+ 알고리즘이 개발되어 있다. SVM+ 알고리즘은 표준 SVM 공식에서의 슬랙 변수(slack variable)

를, 이를 보정한 함수

로 대체하고, 대체된 함수에 의해 사전 정보의 값을 평가한다. 그 결과는 다음의 [수학식 2]와 같다.To actually apply these generalized learning ideas, the SVM + algorithm used for binary classification has been developed. The SVM + algorithm is a slack variable in the standard SVM formula,

, The corrected function

, And evaluates the value of the prior information by the replaced function. The result is shown in the following equation (2).

여기서, 용어

,

및

는 새로운 보정 공간

내에서만 전형적인 SVM의

,

및

와 동일한 역할을 담당한다. 또한,

는

에 대한 조정화 파라미터(regularization parameter)를 나타낸다. 가중 벡터

가

뿐만 아니라

도 의존하도록 유지하는 것은 중요하다. 이러한 이유로 슬랙 함수

를 대체하는 함수가 보정 함수로 지칭된다. 사전 정보가 슬랙 함수의 값을 추정하는데만 사용되므로 사전 정보는 훈련하는 동안뿐 아니라 테스트하는 동안에도 요구된다. 이론적인 분석은 전술한 SVM+ 알고리즘의 수렴속도의 한계가 실질적으로 표준 SVM을 더 좋게 할 수 있음을 보여준다.Here,

,

And

Lt; RTI ID =

Only within a typical SVM

,

And

. Also,

The

Lt; RTI ID = 0.0 > regularization < / RTI > Weighted vector

end

As well as

It is also important to keep it dependent. For this reason,

Is referred to as a correction function. Since the dictionary information is only used to estimate the value of the slack function, the dictionary information is required during testing as well as during training. The theoretical analysis shows that the convergence speed limit of the SVM + algorithm described above can substantially improve the standard SVM.

구조화 Structuring SVMSVM ( ( SSVMSSVM ))

SSVM(Structural Support Vector Machine)은 훈련 입력/출력 쌍들의 데이터 셋에서 평가 함수(scoring function)

에 대한 가중 벡터

를 이산적으로 학습한다. 한번 학습된 예측 함수

는 다음의 [수학식 3]과 같이 모든 가능성

에서

를 최소화하여 얻어진다.The Structural Support Vector Machine (SSVM) is a scoring function in the data set of training input /

&Lt; / RTI >

. Once learned function

As shown in the following equation (3)

in

.

여기서,

는 입력

와 구조화 출력

간의 관계를 모형으로 만든 연결 특징점 맵(joint feature map)이다. 가중 벡터

를 학습하기 위해 다음의 최적화 문제(마진 리스케일링, margin-rescaling)를 풀면 다음의 [수학식 4]와 같다.here,

Input

And structured output

The joint feature map is a model of the relationship between the joint feature map. Weighted vector

The following optimization problem (margin rescaling) is solved to obtain the following equation (4).

여기서,

및

는 그라운드 트루스(ground-truth)

에 대한 예측(예측 함수)

의 품질인 태스크 특정 손실(task-specific loss)이다. 예측을 얻기 위해 주어진 입력

에 대한 응답 변수(response variable)에서의 [수학식 3]을 최대화한다. SSVM은 예측 작업의 다양성을 풀기 위한 일반적인 방법이다. 각 응용에 있어서, 연결 특징점 맵

, 손실 함수

및 효율 손실 추가 추정 기술이 맞춤형으로 꾸며진다.here,

And

Ground-truth < / RTI >

(Prediction function)

Specific loss of quality. Given input to get prediction

(3) in the response variable for < / RTI > SSVM is a common method for solving a variety of forecasting tasks. For each application, a link feature point map

, Loss function

And additional efficiency loss estimation techniques are tailored.

사전 정보를 통한 물체 탐색Object navigation through dictionary information

본 실시예에 따른 사전 정보를 통한 물체 탐색 방법에서는 물체의 훈련 이미지 셋(a set of training images), 그것들의 위치 및 그것들의 속성과 분할 정보가 주어지며, 아직 접하지 않은 이미지에서 관심 물체를 탐색하는 기능을 학습하고자 한다. 기존 방법과 달리, 학습된 함수는 예측 시에 명시적이거나 추론 속성과 분할 정보를 필요로 하지 않는다.In the object search method using the dictionary information according to the present embodiment, a set of training images of the objects, their positions, their attributes, and segmentation information are given, and the object of interest is searched I want to learn the function to do. Unlike the existing method, the learned function does not require explicit or inference attribute and partition information at the time of prediction.

사전 정보를 이용하는 구조화 Structuring using dictionary information SVMSVM (( SSVMSSVM +)+)

전술한 구조화 예측 문제를 사전 정보를 활용하는 데까지 확장하자. 전술한 [수학식 1]에 의하면, 본 실시예의 장치는 상호 의존 함수인 한 쌍의 함수

및

의 학습을 위해 세 쌍의 훈련 셋을 토대로 구조

를 예측하는 것을 학습한다. 세 쌍의 훈련 셋은

로 가정하며, 여기서

는 가변 영상 특징점에 대응하고,

는 속성과 분할에 대응하며,

는 모든 가능성 있는 바운딩 박스들의 공간을 지칭한다. 한번 학습된 함수

는 예측에만 사용된다. [수학식 3]에서와 같이, 함수

는 입력

과 출력

를 토대로 표준 SSVM과 동등하게 모든 가능성 있는 연결 특징점들에 대해 학습된 함수를 최대화하여 얻어진다.Let's extend the above-mentioned structuring prediction problem to the use of advance information. According to the above-described expression (1), the apparatus of the present embodiment includes a pair of functions

And

Based on three pairs of training sets for learning

Is predicted. Three pairs of training sets

, Where

Corresponds to a variable image feature point,

Corresponds to attributes and partitions,

Quot; refers to the space of all possible bounding boxes. Once learned function

Is used only for prediction. As in Equation (3), the function

Input

And output

To maximize the learned function for all possible connection feature points equal to the standard SSVM.

한편, [수학식 1]에서의 제약을 받는 두 함수

및

의 연결 학습을 위해, 본 실시예에서는 SSVM 프레임워크를 실질적으로 확장한다. 두 함수들

및

는 파라미터 벡터들

와

에 의해 특징지어질 수 있다. 이를 각각 나타내면 [수학식 5]와 같다.On the other hand, if the two functions < RTI ID = 0.0 >

And

The SSVM framework is substantially extended in this embodiment. Both functions

And

&Lt; / RTI >

Wow

. &Lt; / RTI > These are expressed by Equation (5).

파라미터 벡터들인 두 가중 벡터

와

의 동시 학습을 위해, 본 실시예에서는 [수학식 1]에서의 제약을 포함하는 새로운 최대 마진 구조화 예측 프레임워크를 제안한다. 이러한 프레임워크는 도 2에 도시한 바와 같이 SSVM+로 지칭될 수 있다. SSVM+는 다음의 [수학식 6]과 같이 두 모델들을 결합하여 학습(learning)한다.The two weighted vectors

Wow

The present embodiment proposes a new maximum margin structured prediction framework including the constraints in Equation (1). Such a framework may be referred to as SSVM + as shown in FIG. SSVM + combines the two models to learn as shown in Equation (6) below.

여기서,

이고, [수학식 1]에서의 불균등은 "Dmitry Pechyony and Vladimir Vapnik. On the theory of learning with privileged information. NIPS, pages 1894-1902, 2010."의 논문에서 유도된 대용 태스크 특정 손실(surrogate task-specific loss)

을 통해 소개되고 있다. 이러한 대용 손실(surrogate loss)은 다음의 [수학식 7]과 같이 정의될 수 있다.here,

, And the inequality in Equation (1) is the surrogate task-specific loss derived from the article of Dmitry Pechyony and Vladimir Vapnik. On the theory of learning and privileged information, NIPS, pages 1894-1902, specific loss)

. This surrogate loss can be defined as the following equation (7).

여기서,

이고,

는 [수학식 1]에서의 제약에 대응하는 패널티화 파라미터(penalization parameter)이며, 태스크 특정 손실 함수들

및

는 [수학식 10]에서 정의된다. 본 실시예에서는 대용 손실을 통해 [수학식 1]에서의 불균등을 보통의 최대 마진(ordinary max-margin) 최적화 프레임워크에 적절하게 적용할 수 있다.here,

ego,

Is a penalization parameter corresponding to the constraint in Equation (1), and the task specific loss functions

And

Is defined in Equation (10). In this embodiment, the inequality in Equation (1) can be suitably applied to the ordinary max-margin optimization framework through the substitution loss.

본 실시예의 프레임워크는 도 3에 도시한 바와 같이 속성과 분할에 관해 학습된 모델(

)이 항상 영상 특징에 관해 훈련된 모델(

)을 보정하도록 강제 동작한다. 이것은 결과적으로 영상 특징점 단독의 경우보다 더 우수한 일반화 모델을 생성하도록 한다. SSVM과 유사하게, 본 실시예에서는 손실 추가 추정 및 최적화 방법들을 통해 전술한 문제들에 존재하는 지수함수적인 개수의 가능성 있는 제약을 다루기 쉽게 취급할 수 있다. 상기의 방법들은 예를 들어, 절단면 알고리즘(cutting plane algorithm)이나 더 최근의 블록 좌표 프랭크 울프(block-coordinate Frank Wolfe) 방법을 지칭한다. 절단면 방법을 이용하여 [수학식 6]을 풀기 위한 의사코드는 아래의 표 1에 기재된 알고리즘 1로 표현될 수 있다.The framework of the present embodiment is a model that is learned about attributes and division

) Is always a trained model for image features (

). This results in the generation of a better generalization model than the case of image feature points alone. Similar to the SSVM, this embodiment can handle the exponential number of possible constraints that are present in the above-mentioned problems in a manageable manner through loss addition estimation and optimization methods. The above methods refer to, for example, a cutting plane algorithm or a more recent block-coordinate Frank Wolfe method. The pseudo code for solving Equation (6) using the section method can be expressed by Algorithm 1 described in Table 1 below.

전술한 알고리즘 1을 나타내면 다음의 표 1과 같다.The algorithm 1 described above is shown in Table 1 below.

표 1은 절단면 방법을 이용하여 [수학식 6]을 푸는 알고리즘의 예시이다.Table 1 is an example of an algorithm for solving Equation (6) using an intersection method.

본 실시예의 알고리즘은 SSVM 프레임워크를 따르는 일반적인 형태를 가진다. 이것은 [수학식 6]이 연결 특징점 맵, 태스크 특정 손실 및 손실 추가 추정의 정의들에 독립적임을 의미한다. 이와 같이, 본 실시예는 물체 탐색에 더하여 다양한 다른 문제들에 적용될 수 있다. 단지 요구되는 것은 세 가지 문제의 특정 구성요소들에 대한 정의이고, 이러한 정의는 표준 SSVM에서도 요구된다. 후술하는 바와 같이, 손실 추가 추정 단계는 사전 정보를 포함하는 것에 의해 SSVM과의 비교가 더욱 어렵게 될 뿐이다.The algorithm of this embodiment has a general form conforming to the SSVM framework. This implies that [Equation 6] is independent of the definitions of joint feature point maps, task specific loss and loss addition estimates. As such, the present embodiment can be applied to various other problems in addition to object search. Only what is required is a definition of the specific components of the three problems, and this definition is also required in the standard SSVM. As described later, the additional loss estimation step only becomes difficult to compare with the SSVM by including the advance information.

연결 특징점 맵(Joint Feature Map)Joint Feature Map

본 실시예의 SSVM+는 확장 구조화 출력 복귀자(extended structured output regressor)로서 타겟 이미지들에서 모든 가능성 있는 바운딩 박스들을 고려하여 바운딩 박스 좌표들을 추정한다. 구조화 출력 공간은, The SSVM + of this embodiment estimates the bounding box coordinates by considering all possible bounding boxes in the target images as an extended structured output regressor. The structured output space,

로서 정의된다.

.

여기서,

는 물체의 존재/부재(presence/absecne)를 나타내고,

는 바운딩 박스의 위, 왼쪽, 아래 및 오른쪽 코너들의 좌표들에 각각 대응된다. 입력과 출력 변수들 간의 관계를 연관짓기 위해, 본 실시예에서는 연결 특징점 맵을 정의한다. 연결 특징점 맵은

로 정의되는 바운딩 박스들에 대한

에서의 특징점들을 인코딩한다. 이를 모형으로 만들면 [수학식 8]과 같다.here,

Represents presence / absences of an object,

Correspond to the coordinates of the upper, left, lower and right corners of the bounding box, respectively. In order to relate the relationship between input and output variables, a connection feature point map is defined in this embodiment. The connection feature point map

For the bounding boxes defined by

To encode the minutiae points. If this is modeled, it can be expressed as [Equation 8].

여기서,

는 좌표

을 갖는 바운딩 박스 내의 이미지 영역을 나타낸다.here,

Coordinate

Lt; / RTI > in the bounding box.

이와 동일하게, 본 실시예의 방법을 수행하는 장치는 사전 공간에 대해서도 다른 연결 특징점 맵을 정의한다. 영상 특징점 대신에 사전 공간은 [수학식 9]와 같은 분할 정보의 도움을 받는 속성들의 공간상에서 동작할 수 있다.Similarly, the apparatus performing the method of this embodiment also defines other connection feature point maps for the dictionary space. Instead of the image feature points, the dictionary space can operate in the space of attributes with the help of the division information as in Equation (9).

연결 특징점 맵의 정의는 문제를 특정하는 것이고, 따라서 물체 탐색에 대하여 제안된 문헌 1의 "Matthew B. Blaschko and Christoph H. Lampert. Learning to localize objects with structured output regression. In ECCV, pages 2-15, 2008."에 기재된 방법을 따를 수 있다. 두 연결 특징점 맵들에 관한 상세 실시예는 아래에서 설명될 것이다.The definition of the connection feature point map is to specify the problem, and thus, for the object search, the proposed document 1 "Matthew B. Blaschko and Christoph H. Lampert. Learning to localize objects with structured output regression. In ECCV, pages 2-15, 2008. " Detailed embodiments of the two connection feature point maps will be described below.

태스크 특정 손실(Task-Specific Loss)Task-Specific Loss

예측된 출력

와 실제 구조화 레벨

사이의 불일치 수준을 측정하기 위해, 본 실시예에서는 불일치 레벨을 능률적으로 측정하는 손실 함수(loss function)를 정의한다. 본 실시예의 물체 탐색 문제에 있어서, 파스칼 VOC 중첩도(Pascal VOC overlap ratio) 기반의 태스크 특정 손실은 [수학식 10]에서와 같이 두 공간들에서 구해질 수 있다.Predicted output

And the actual structured level

In this embodiment, a loss function for efficiently measuring the level of inconsistency is defined. In the object search problem of the present embodiment, the task specific loss based on the Pascal VOC overlap ratio can be obtained in two spaces as in Equation (10).

여기서,

는 i번째 이미지에서 객체의 존재(+1) 또는 부재(-1)를 나타낸다.here,

Represents the presence (+1) or absence (-1) of the object in the ith image.

및

인 경우, 0은 모든 제로 벡터에 대응된다.

와

로 정의된 바운딩 박스들이 동일할 때, 손실은 0이며, 그것들이 연결되지 않거나

일 때 손실은 1과 같다.

And

, 0 corresponds to all zero vectors.

Wow

&Lt; / RTI > are equal, then the loss is zero and they are not connected

The loss is equal to one.

손실 추가 추정(Loss-Augmented Inference)Loss-Augmented Inference

[수학식 6]의 학습 동안에 발생하는 지수함수적인 개수의 제약들과 예측 동안에 처리되는 최대한 매우 큰 검색 공간

로 인하여 본 실시예에서는 SSVM+ 프레임워크의 훈련과 테스트가 다른 효율적 추정 기술을 필요로 한다.The exponential number constraints that occur during the learning of < RTI ID = 0.0 > [Equation 6] < / RTI &

The training and testing of the SSVM + framework requires different efficient estimation techniques.

예측(Prediction)Prediction

본 실시예에서 물체 탐색 방법의 목적은 학습된 가중 벡터

와 영상 특징

에 의해 주어지는 최적의 바운딩 박스를 찾는 것이다. 사전 정보는 테스트할 때 사용할 수 없고, 추론은 시각적 특징들만으로 실행된다. 따라서, 표준 SSVM에서와 같은 동일한 최대화 문제는 예측 시에 해결될 필요가 있다. 이를 나타내면 [수학식 11]과 같다.In the present embodiment, the object search method is to use a learned weighted vector

And image features

To find the optimal bounding box given by. Dictionary information can not be used in testing, and inference is performed only with visual features. Therefore, the same maximization problem as in the standard SSVM needs to be solved at the time of prediction. This can be expressed by Equation (11).

이러한 최대화 문제는 바운딩 박스 좌표들의 공간에 걸쳐 있다. 다만, 이러한 문제는 매우 큰 검색 공간을 포함하므로 철저하게 해결될 수 없다. 물체 탐색 작업에서는 최적화 문제를 효과적으로 해결하기 위해 효율적 서브윈도우 검색(Efficient Subwindow Search, ESS) 알고리즘을 채용할 수 있다. ESS 알고리즘은 문헌2의 "Christoph H. Lampert, Matthew B. Blaschko, and Thomas Hofmann. Efficient subwindow search: A branch and bound framework for object localization. TPAMI, 31(12):2129-2142, 2009."를 참조할 수 있다.This maximization problem spans the space of the bounding box coordinates. However, such a problem includes a very large search space and can not be solved thoroughly. In the object search operation, an Efficient Subwindow Search (ESS) algorithm can be employed to effectively solve the optimization problem. The ESS algorithm is described in " Christoph H. Lampert, Matthew B. Blaschko, and Thomas Hofmann. Efficient subwindow search: A branch and bound framework for object localization. TPAMI, 31 (12): 2129-2142, 2009. " can do.

학습(Learning)Learning

[수학식 11]에서 보여준 예측 과정 동안에 요구되는 추정 문제와 비교하여, 훈련 시에 본 실시예의 주요 과제인 최적화는 더 복잡한 추론 과정을 포함한다. 본 실시예의 학습 알고리즘은 반복 과정 중에 사전 공간에 대응하는 추가 사항이나 정보와 대용 손실을 통해 [수학식 12]와 같이 최대화를 실행할 필요가 있다.Compared with the estimation problem required during the prediction process shown in [Equation 11], the optimization that is the main task of the present embodiment at the time of training includes a more complex reasoning process. The learning algorithm of the present embodiment needs to perform maximization as shown in Equation (12) through additional information, information, and substitution loss corresponding to the dictionary space during the iterative process.

[수학식 12]에서

과

는 상수이며, 최적화에 영향을 미치지 않는다. 손실 추가 추정으로 지칭되는 [수학식 12]에서의 문제는 절단면 방법(cutting plane method)의 각 반복 중에 두 함수들

와

를 학습하는데 이용하여 가중 벡터

와

를 푸는 것이다.In Equation (12)

and

Is a constant and does not affect optimization. The problem in equation (12), referred to as loss addition estimation, is that during each iteration of the cutting plane method,

Wow

The weight vector

Wow

.

본 실시예에서는 추론을 위해 교대 접근법(alternating approach)을 채용한다. 즉, 먼저 원본 공간(original space)

가 [수학식 13]과 같이 정의되어 고정된 해(fixed solution)를 주는 사전 공간에서

를 구한다.The present embodiment employs an alternating approach for reasoning. That is, the original space,

Is defined as < EMI ID = 13.0 > and a fixed solution is given in a dictionary space

.

이어서, 본 실시예의 방법은

를 구하면서 [수학식 14]로 나타낸 바와 같이 원본 공간의 최적화를 수행한다.Then, the method of this embodiment

And optimizes the original space as shown in Equation (14).

본 실시예의 방법에 있어서 [수학식 13]과 [수학식 14]의 두 하위 처리과정은 수렴까지 반복되고 그에 의해 최종 해들

와

를 얻게 된다.In the method of this embodiment, the two sub-processes of (13) and (14) are repeated until convergence,

Wow

.

물체 탐색 작업에 있어서, 두 가지 문제들은 브랜치 바운드 최적화 기술인 ESS에 의해 해결될 수 있다. 여기서,

에서 사각형 셋으로 전술한 목표 함수들의 상부 경계들을 유도하는 것은 중요하다. [수학식 7]의 대용 손실 항목들의 상부 경계들에 대해서만 유도하고, 나머지 항목들에 대한 유도는 문헌 2를 참조하여 얻을 수 있다.In object search, two problems can be solved by ESS, a branch-bound optimization technique. here,

It is important to derive the upper bounds of the above-mentioned target functions with a square set. The derivation for the upper bounds of the substitution loss items in (7) is derived only, and the derivation for the remaining items can be obtained by referring to Document 2.

사전 공간에서 해(solution)가 고정되면, 대용 손실의 상부 경계를 얻기 위해

의 상부 경계만을 고려한다.

은

의 단조 증가 함수(monotonically increasing function)이므로, 그것의 상부 경계는

의 상부 경계로부터 직접 유도된다. 특히,

의 상부 경계는 하기의 [수학식 15]와 같다.Once the solution is fixed in the pre-space, to obtain the upper boundary of the substitution loss

Lt; / RTI >

silver

Is a monotonically increasing function, its upper boundary is < RTI ID = 0.0 >

Lt; / RTI > Especially,

The upper boundary of Equation 15 is expressed by Equation 15 below.

그리고, 고정

의 대용 손실의 상부 경계는 하기의 [수학식 16]과 같다.Then,

The upper limit of the substitutional loss of Equation (16) is expressed by Equation (16).

원본 공간이 고정되면, 대용 함수가

에서 V-형상 함수가 되기 때문에 전술한 문제는 복잡하게 된다. 이 경우,

의 상부 및 하부 경계들에서 함수의 출력을 확인해야 한다.

의 상부 경계는

의 상부 경계와 동일하게 유도되고,

의 하부 경계는 하기의 [수학식 17]과 같이 된다.If the source space is fixed,

The above-described problem becomes complicated. in this case,

Lt; / RTI > the upper and lower bounds of the function.

The upper boundary of

Lt; RTI ID = 0.0 > upper < / RTI &

Is expressed by the following equation (17).

여기서,

의 상부 경계와 하부 경계를 각각

와

라고 하면, 고정

를 포함한 대용 손실의 상부 경계는 하기의 [수학식 18]과 같이 된다.here,

Lt; RTI ID = 0.0 >

Wow

If you say,

The upper boundary of the substitutional loss is expressed by the following equation (18).

본 실시예의 방법에서는 [수학식 17]과 [수학식 18]에서와 같이 대용 손실의 경계들을 증명함으로써 표준 ESS 알고리즘 기반의 교대 과정(alternating procedure)을 통해 [수학식 12]의 목표 함수를 최적화할 수 있다.In the method of this embodiment, the target function of Equation (12) is optimized through the alternating procedure based on the standard ESS algorithm by proving the boundaries of the substitute loss as in Equations (17) and (18) .

실험(Experiments)Experiments

데이터셋(Dataset ( DatasetDataset ))

본 실시예에 따른 물체 탐색 방법의 실증적인 평가를 Caltech-UCSD Birds 2011(CUB-2011)를 토대로 실시하였다. CUB-2011은 새들의 다른 종에 대한 200개의 카테고리를 포함한다. 각각의 새의 위치는 바운딩 박스를 이용하여 특정된다. 또한, 많은 더미의 사전 정보가 15개의 서로 다른 부분의 주석들과 312개의 속성들과 분할 마스크들의 형태로 제공되고, 그것들은 각 이미지에서 휴먼 주석가들에 의해 수작업으로 분류될 수 있다. 각각의 카테고리는 30개의 훈련용 이미지들과 약 30개의 테스트용 이미지들을 포함한다.An empirical evaluation of the object search method according to the present embodiment was performed based on Caltech-UCSD Birds 2011 (CUB-2011). CUB-2011 includes 200 categories for different species of birds. The position of each bird is specified using a bounding box. In addition, many dummy dictionary information is provided in the form of 15 different part annotations and 312 attributes and split masks, which can be manually sorted by human annotations in each image. Each category includes 30 training images and about 30 test images.

영상 및 사전 특징점 추출(Visual and Privileged Feature Extraction)Visual and Privileged Feature Extraction

본 실시예에 있어서, 영상 공간의 특징 기술자(feature descriptor)는 가속화 강인 특징(Speeded Up Robust Features, SURF)(Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-up robust features (SURF). CVIU, 110(3):346-359, 2008. 참조) 기반의 bag-of-visual-words(BoVW) 모델을 이용한다. 추가로, 사전 정보로서 속성과 분할 마스크들을 채용한다. 속성들에 관한 정보는 312 차원 벡터에 의해 기술된다. 312 차원 벡터의 엘리먼트는 각각의 속성에 대응하고 가시성과 관련성에 따른 이진 값을 가진다. 분할 정보는 각각의 이미지 내 분할 마스크를 복원하는데 이용되며, 그 결과 이미지는 균일한 전경 픽셀들을 포함한 원본 배경 픽셀들을 포함하게 된다.In this embodiment, the feature descriptor of the video space includes the Speeded Up Robust Features (SURF) (Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Using the bag-of-visual-words (BoVW) model based on the CVIU, 110 (3): 346-359, In addition, attributes and partition masks are employed as prior information. The information about the attributes is described by the 312 dimensional vector. The elements of the 312-dimensional vector correspond to each attribute and have a binary value according to visibility and relevance. The segmentation information is used to restore the split masks within each image, and the resulting image will contain original background pixels including uniform foreground pixels.

그런 다음, 본 실시예의 방법에서는 영상 공간에서와 같이 동일한 BoVW 모델을 토대로 3000 차원 특징 기술자를 추출한다. 이러한 접근 후의 직관은 전경 영역에 확실하고 강한 응답을 제공하는 일련의 특징점을 발생시키는 것이다. 이러한 응답은 원본 공간에서보다 더 강하게 발생될 수 있으며, 그에 의해 사전 공간에서의 탐색을 더 쉽게 할 수 있다. 각 서브윈도우에 대하여는 속성의 존재와 추가 영상 공간에 대응하는 사전 코드워드의 주파수를 토대로 히스토그램을 생성한다.Then, in the method of this embodiment, a 3000 dimensional feature descriptor is extracted based on the same BoVW model as in the video space. The intuition after this approach is to generate a series of feature points that provide a strong and strong response in the foreground area. Such a response can occur more strongly in the original space, thereby making it easier to search in the dictionary space. For each subwindow, a histogram is generated based on the presence of the attribute and the frequency of the dictionary codeword corresponding to the additional image space.

평가(Evaluation)Evaluation

본 실시예에 따른 SSVM+ 알고리즘을 평가하기 위해 몇몇의 훈련 장면들에서 Blaschko와 Lampert의 오리지널 SSVM 탐색 방법과 비교하였다. 모든 실험에서, 소정의 값

을 포괄하는

공간상에 하이퍼 파라미터들

,

및

를 준비하였다. SSVM 탐색 방법에서, 파라미터

에 대응하는 검색 공간의 단일 차원을 검색하였다.In order to evaluate the SSVM + algorithm according to this embodiment, several training scenes were compared with the original SSVM search method of Blaschko and Lampert. In all experiments,

Encompassing

Hyperparameters on space

,

And

Were prepared. In the SSVM search method,

A single dimension of the search space corresponding to the search space is searched.

먼저, 탐색 성능에 대하여 적은 훈련 샘플 사이즈들의 영향을 조사하였다. 이러한 설정은 문헌 3의 "Ryan Farrell, Om Oza, Ning Zhang, Vlad I. Morariu, Trevor Darrell, and Larry S. Davis. Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance. In ICCV, pages 161-168, 2011."의 실험 설정을 대략적으로 채용한 것이다.First, we examined the effect of small training sample sizes on search performance. These settings are described in "Ryan Farrell, Om Oza, Ning Zhang, Vlad I. Morariu, Trevor Darrell, and Larry S. Davis in Birdlets: Subordinate categorization using volumetric primitives and pose-normalized appearance in Literature 3. In ICCV, pages 161-168 , 2011. ".

훈련에 있어서는, 2개의 주요 새 그룹들에 대응하는 14개의 새 카테고리들에 대해 집중하였다. 본 실시예에서는 네 개의 다른 모델들을 훈련하였고, 각 모델은 차별되는 개수의 훈련용 이미지들 즉, 클래스당

이미지에 대해 훈련되었고, 그 결과

훈련용 이미지를 각각 얻었다. 또한, 본 실시예에서는 100개의 새 클래스들(bird classes)에 대응하고 각각은 10개의 훈련용 이미지들을 포함하는 1000개(n=1000)의 이미지들에 대한 모델을 훈련하였다. 확인 세트로서, 훈련용으로 사용된 이미지들 이외에 카테고리들에서 임의로 선택된 500개의 훈련용 이미지들을 사용하였다.In training, we focused on 14 new categories corresponding to two major new groups. In this embodiment, four different models were trained, each model having a different number of training images,

I was trained on images, and the results

Training images were obtained, respectively. Also, in the present embodiment, we trained a model for 1000 (n = 1000) images corresponding to 100 new classes and each containing 10 training images. As a confirmation set, we used 500 training images randomly selected from the categories in addition to the images used for training.

테스트에 있어서는, 전체 CUB-2011 데이터셋의 모든 테스트용 이미지들을 사용하였다. 전술한 실험의 결과를 나타내면 하기의 표 2와 같다. 표 2는 CUB-2011 데이터셋의 100개의 클래스들에 대한 본 실시예의 사전 정보를 통한 구조화 학습 방법(SSVM+)과 표준 구조화 학습 방법(SSVM) 간의 평균 중첩도(A)와 검출 결과(B)의 비교 결과는 보여준다.For testing, we used all the test images of the entire CUB-2011 dataset. The results of the above-described experiment are shown in Table 2 below. Table 2 shows the average overlay (A) and detection result (B) between the structured learning method (SSVM +) and the standard structured learning method (SSVM) through the dictionary information of this embodiment for 100 classes of the CUB- The comparison results are shown.

표 2에서 알 수 있듯이, 모든 경우에서 본 실시예에 따른 물체 탐색 방법은 평균 중첩도 뿐만 아니라 평균 검출(PASCAL VOC overlap ration > 50%)에서 비교예의 SSVM 방법보다 더 나은 결과를 나타내었다. 이것은 동일한 개수의 훈련 샘플들에 대해서 본 실시예의 물체 탐색 방법이 비교예의 SSVM보다 더 우수한 일반화 성능을 가진 모델에 지속적으로 수렴하는 것을 나타낸다. 또한, 표 2에서는 비교예의 경우 훈련용 데이터셋을 증가시킴에 따라 사전 정보의 이점이 감소하는 경향을 명확하게 나타내고 있다.As can be seen from Table 2, in all cases, the object search method according to the present embodiment showed better results than the SSVM method of the comparative example in the mean detection (PASCAL VOC overlap ratio> 50%) as well as the mean superposition. This indicates that the object search method of the present embodiment converges to a model having better generalization performance than the SSVM of the comparative example, for the same number of training samples. In addition, Table 2 clearly shows the tendency of the benefits of prior information to be reduced by increasing the training data set in the case of the comparative example.

본 실시예의 SSVM+의 장점을 좀더 깊게 평가하기 위해, 전체 100개의 클래스들에 대한 평균 중첩도(Overlap ration) 및 검출 개수(the number of detection)의 성능을 14개의 클래스들을 이용한 훈련에서 클래스당 10개의 이미지들(n=140)을 사용하여 평가하였다.In order to further evaluate the merit of the SSVM + of this embodiment, the performance of the overlap ration and the number of detection for all 100 classes is compared with the performance of 10 classes per class Images (n = 140).

도 4a 및 도 4b에 나타낸 바와 같이, 대부분의 새 클래스들(bird classes)에서 본 실시예의 SSVM+은 중첩도와 검출율 모두에서 상대적으로 더 우수한 성능을 보여주고 있다. 청색으로 표시된 본 실시예의 방법(SSVM+)과 회색으로 표시된 비교예의 방법(SSVM) 간의 차이(diff)는 이들 하부의 검정색 영역으로 표시되어 있다. 각 클래스는 통상 30개의 테스트용 이미지들을 포함하지만, 일부 클래스들은 18개 이하의 이미지들을 포함할 수 있다. 본 실시예의 평균 중첩도는 45.8%이고 평균 검출 개수는 12.1(즉, 41.5%)이다.As shown in FIGS. 4A and 4B, SSVM + of the present embodiment shows relatively superior performance in both superposition and detection rates in most of the bird classes. The difference (diff) between the method (SSVM +) of this embodiment, shown in blue, and the method of comparison (SSVM), shown in gray, is indicated by the black area below these. Each class typically includes 30 test images, but some classes may contain 18 or fewer images. The average degree of overlap in this embodiment is 45.8% and the average number of detections is 12.1 (i.e., 41.5%).

전술한 실시예에 의하면, 사전 정보를 포함하는 SSVM 기반의 물체 탐색을 위한 구조화 예측 알고리즘이 제공된다. 이 알고리즘은 먼저 구조화 예측 프레임에 사전 정보를 결합하는 과정을 포함한다. 또한, 본 실시예의 알고리즘은 테스트할 때의 일반화 성능을 향상시키기 위해 훈련하는 동안 추가 정보의 다양한 유형을 이용할 수 있다. 이러한 알고리즘은 물체 탐색 문제에 적용할 수 있으며, 그것은 사전 정보를 이용하는 새로운 구조화 SVM 모형에 의해 해결될 수 있다. 즉, 본 실시예의 물체 탐색 방법은 교대 손실 추가 추정 과정을 채용하여 사전 정보에 대응하는 목표 함수 내의 항목을 처리하였다. 그리고, 본 실시예의 방법은 제안된 알고리즘을 Caltech-UCSD Birds 200-2011 데이터셋에 적용하여 훈련시에만 이용할 수 있는 추가 정보를 활용하는 긍정적인 이점을 제시하는 유용한 결과를 얻었다. 안타깝게도, 사전 정보의 이점은 훈련 샘플들의 수가 증가할 때 감소하는 경향이 있다. 하지만, 본 실시예의 SSVM+ 프레임워크는 적은 훈련용 데이터가 존재하거나 주석 비용이 매우 큰 경우에 특히 유용할 수 있다.According to the above-described embodiment, a structured prediction algorithm for object search based on SSVM including advance information is provided. The algorithm first involves combining the pre-information into a structured prediction frame. In addition, the algorithm of this embodiment may utilize various types of additional information during training to improve generalization performance in testing. These algorithms can be applied to object search problems, which can be solved by a new structured SVM model using prior information. That is, the object search method of the present embodiment employs an alternate loss addition estimation process to process items in the target function corresponding to the prior information. The method of the present embodiment also provides useful results that suggest a positive advantage of applying the proposed algorithm to the Caltech-UCSD Birds 200-2011 dataset and utilizing additional information available only during training. Unfortunately, the benefit of prior information tends to decrease as the number of training samples increases. However, the SSVM + framework of the present embodiment may be particularly useful when there is little training data or the cost of annotation is very large.

도 5는 본 발명의 다른 실시예에 따른 영상 물체 탐색 장치의 블록도이다.5 is a block diagram of an image object search apparatus according to another embodiment of the present invention.

도 5를 참조하면, 본 실시예에 따른 영상 물체 탐색 장치(11)는 전술한 영상 물체 탐색 방법을 수행하는 장치로서 프레임워크 생성부(111), 학습부(112), 모델 생성부(113) 및 탐색부(114)를 포함하며, 구현에 따라서 검증부(115)를 더 포함할 수 있다.Referring to FIG. 5, the video object searching apparatus 11 according to the present embodiment includes a framework generating unit 111, a learning unit 112, a model generating unit 113, And a search unit 114, and may further include a verification unit 115 according to an implementation.

영상 물체 탐색 장치(11)는 마이크로프로세서 등의 프로세서(processor)로 구현될 수 있으며, 그 경우 프레임워크 생성부(111), 학습부(112), 모델 생성부(113), 탐색부(114) 및 검증부(115)는 기재된 순서대로 프레임워크 생성 모듈, 학습 모듈, 모델 생성 모듈, 탐색 모듈 및 검증 모듈을 각각 포함하거나 각 모듈에 대응될 수 있다.In this case, the video object search apparatus 11 may be implemented as a processor such as a microprocessor. In this case, the video object search apparatus 11 may include a framework generation unit 111, a learning unit 112, a model generation unit 113, And verification unit 115 may include a framework generation module, a learning module, a model generation module, a search module, and a verification module, respectively, or correspond to each module in the described order.

프레임워크 생성부(111)는 구조화 예측 프레임워크에 사전 정보를 결합한 학습 프레임워크를 생성한다. 여기서, 구조화 예측 프레임은 구조화된 SVM(Structured Support Vector Machine) 분류기를 포함할 수 있다. 이러한 프레임워크 생성부(111)는 사전 정보에 기초한 제1 공간의 제1 함수와 훈련 샘플에 기초한 제2 공간의 제2 함수를 결합할 수 있다. 여기서, 사전 정보는 훈련 샘플의 분할(Segmentation), 부분(Part), 속성(Attributes) 또는 이들의 조합을 포함하며, 제1 함수와 제2 함수의 결합은 훈련 샘플의 이미지들과 속성들을 포함한 공간을 바운딩 박스 좌표들의 공간으로 연결하는 것을 포함할 수 있다.The framework generation unit 111 generates a learning framework in which the preliminary information is combined with the structured prediction framework. Here, the structured prediction frame may include a structured SVM (Structured Support Vector Machine) classifier. The framework generation unit 111 may combine the first function of the first space based on the prior information and the second function of the second space based on the training sample. Herein, the dictionary information includes a segmentation, a part, an attribute, or a combination thereof of a training sample, and the combination of the first function and the second function includes a space including images and attributes of the training sample, To the space of the bounding box coordinates.

학습부(112)는 학습 프레임워크에 대한 교대 최적화 학습을 실행한다. 학습부(112)는 교대 손실 추가 추정(Alternating loss-augmented inference)을 통해 사전 정보에 대응하는 목표 함수의 항목을 처리할 수 있다. 이러한 학습부(112)는 [수학식 6]에서와 같이 훈련 예제에 의한 제1 모델과 훈련 예제의 사전 정보에 의한 제2 모델을 결합하여 학습할 수 있다.The learning unit 112 executes alternate optimization learning for the learning framework. The learning unit 112 may process the item of the objective function corresponding to the advance information through the alternating loss-augmented inference. The learning unit 112 can learn the first model based on the training example and the second model based on the prior information of the training example, as in Equation (6).

모델 생성부(113)는 교대 최적화 학습이 실행된 학습 프레임워크로부터 예측 모델을 생성한다. 예측 모델은 속성과 분할에 의해 학습된 모델이 항상 영상 특징에 의해 훈련된 모델을 항시 보정하도록 동작할 수 있다. 이를 통해, 모델 생성부(113)는 기존의 손실 추가 추정이나 최적화 방법들(절단면 알고리즘이나 블록 좌표 프랭크 울프 방법 등)에 비해 최적화에 수반되는 지수함수적인 개수의 가능성 있는 제약을 손쉽게 취급할 수 있다.The model generation unit 113 generates a prediction model from the learning framework in which the alternate optimization learning is executed. The prediction model can operate to always correct the model trained by attributes and segmentation, which is always trained by the image feature. In this way, the model generation unit 113 can easily handle the exponential number of possible constraints associated with the optimization as compared to the existing loss addition estimation or optimization methods (such as the intersection algorithm or the block coordinate Frankwolf method) .

탐색부(114)는 예측 모델을 이용하여 테스트 샘플이나 입력 영상으로부터의 특정 이미지에서 물체를 예측 혹은 탐색한다. 탐색부(114)는 예측 모델의 학습된 가중 벡터와 상기 입력 영상으로부터의 특정 이미지 내 영상 특징에 의해 주어지는 최적의 바운딩 박스를 찾을 수 있다.The search unit 114 predicts or searches an object in a specific image from a test sample or an input image using a prediction model. The search unit 114 can find the optimal bounding box given by the learned weighted vector of the prediction model and the image feature in the specific image from the input image.

검증부(115)는 실제 영상 혹은 실제 영상 정보(groundtruth)를 토대로 학습부(112)의 학습 프레임워크를 검증할 수 있다. 검증부(115)는 학습부(112)에 결합하거나 학습부(112)와 모델 생성부(113) 사이에 결합하여, 학습 중이나 학습 후에 프레임워크를 검증할 수 있다.The verification unit 115 can verify the learning framework of the learning unit 112 based on the actual image or the actual image information (groundtruth). The verification unit 115 may be coupled to the learning unit 112 or may be coupled between the learning unit 112 and the model generation unit 113 to verify the framework during or after learning.

도 6은 도 5의 영상 물체 탐색 장치의 학습부에 채용가능한 구성에 대한 상세 블록도이다.FIG. 6 is a detailed block diagram of a configuration that can be employed in the learning unit of the video object search apparatus of FIG. 5;

도 6을 참조하면, 본 실시예에 따른 영상 물체 탐색 장치는 학습부(112)로서 제1 학습부(1121), 제2 학습부(1122) 및 제3 학습부(1123)를 포함할 수 있다.6, the video object search apparatus according to the present embodiment may include a first learning unit 1121, a second learning unit 1122, and a third learning unit 1123 as a learning unit 112 .

제1 학습부(1121)는 교대 손실 추가 추정을 통해 사전 정보에 기초한 제1 공간과 훈련 샘플에 기초한 제2 공간에서 효율 서브윈도우 검색(Efficient Subwindow Search, ESS)을 번갈아가며 실행한다. 이러한 제1 학습부(1121)에 의하면, 손실 추가 추정(Loss-Augmented Inference)을 최적화할 수 있다.The first learning unit 1121 alternately performs Efficient Subwindow Search (ESS) in the first space based on the advance information and the second space based on the training sample through the alternate loss addition estimation. According to the first learning unit 1121, it is possible to optimize loss-augmented inference.

제2 학습부(1122)는 교대 손실 추가 추정을 통해 훈련 샘플의 타겟 이미지에서 모든 가능성 있는 바운딩 박스들을 추출하여 물체의 바운딩 박스 좌표를 추정한다. 제2 학습부(1122)는 제1 학습부(1121)의 동작의 일실시예를 구체화한 것에 대응할 수 있다. 즉, 제2 학습부(1122)의 동작은 [수학식 13]과 [수학식 14]가 수렴까지 반복되고 그에 의해 최종 해들(solutions)을 얻을 것에 대응할 수 있다.The second learning unit 1122 estimates the bounding box coordinates of the object by extracting all possible bounding boxes from the target image of the training sample through the alternate loss addition estimation. The second learning unit 1122 can correspond to a concrete embodiment of the operation of the first learning unit 1121. [ That is, the operation of the second learning unit 1122 may correspond to the equations (13) and (14) being repeated until convergence and thereby obtaining solutions.

제3 학습부(1123)는 교대 손실 추가 추정을 통해 바운딩 박스 좌표들을 연결하여 입력과 출력 변수들 간의 관계를 연관짓는 연결 특징점 맵을 작성한다.The third learning unit 1123 creates a connection feature point map that associates the bounding box coordinates through the additional loss addition estimation and associates the relationship between input and output variables.

전술한 제1 학습부(1121)와 제3 학습부(1123), 혹은 제2 학습부(1122)와 제3 학습부(1123)에 의하면, 사전 정보를 이용하여 구조화 예측 모델을 교대 손실 추가 추정을 통해 학습함으로써 적은 데이터 샘플에 대하여도 우수한 성능을 예측 모델을 생성하는 것이 가능하다.According to the first learning section 1121 and the third learning section 1123 or the second learning section 1122 and the third learning section 1123 described above, the structured prediction model is estimated using the advance information, It is possible to generate a prediction model with excellent performance even for a small number of data samples.

한편, 전술한 실시예에 따른 영상 물체 탐색 장치(11)에 있어서, 프레임워크 생성부, 학습부, 모델 생성부, 탐색부, 검증부 또는 이들 조합은 이들의 동작을 위한 프로그램이나 명령어 형태로 메모리 시스템에 저장되고, 메모리 시스템에 연결되는 프로세서가 프로그램을 실행할 때 입력 영상에서 특정 물체를 효율적으로 탐색하는데 이용될 수 있다.On the other hand, in the video object search apparatus 11 according to the above-described embodiment, the framework generating unit, the learning unit, the model generating unit, the searching unit, the verifying unit, A processor stored in the system and coupled to the memory system may be used to efficiently search for a particular object in the input image when executing the program.

즉, 전술한 실시예에 있어서, 영상 물체 탐색 장치(11)의 구성요소들(111 내지 114 포함)은 모바일 장치나 컴퓨터 장치의 프로세서에 탑재되는 모듈이나 기능부로 구현될 수 있으나, 이에 한정되지 않는다. 전술한 구성요소들은 이들이 수행하는 일련의 기능(복합 다중 감정 인식 방법)을 구현하기 위한 소프트웨어 형태로 컴퓨터 판독 가능 매체(기록매체)에 저장되거나 혹은 캐리어 형태로 원격지에 전송되어 다양한 컴퓨터 장치에서 동작하도록 구현될 수 있다. 이 경우, 컴퓨터 판독 가능 매체는 네트워크를 통해 연결되는 복수의 컴퓨터 장치나 클라우드 시스템에 결합할 수 있고, 복수의 컴퓨터 장치나 클라우드 시스템 중 적어도 하나 이상은 그 메모리 시스템에 본 실시예의 사전 정보를 이용한 영상 물체 탐색 방법을 수행하기 위한 프로그램이나 소스 코드 등을 저장할 수 있다.That is, in the above-described embodiment, the elements (including 111 to 114) of the video object search apparatus 11 may be implemented as a module or a functional unit mounted on a processor of a mobile device or a computer apparatus, but the present invention is not limited thereto . The above-mentioned components are stored in a computer-readable medium (recording medium) in the form of software for implementing a series of functions (multiple multiple emotion recognition methods) performed by them, or transmitted to a remote place in a carrier form so as to operate in various computer devices Can be implemented. In this case, the computer-readable medium may be coupled to a plurality of computer devices or a cloud system connected via a network, and at least one of the plurality of computer devices or the cloud system may be connected to the memory system A program or a source code for performing an object search method can be stored.

컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하는 형태로 구현될 수 있다. 컴퓨터 판독 가능 매체에 기록되는 프로그램은 본 발명을 위해 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수 있다. 또한, 컴퓨터 판독 가능 매체는 롬(rom), 램(ram), 플래시 메모리(flash memory) 등과 같이 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함할 수 있다. 프로그램 명령은 컴파일러(compiler)에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터(interpreter) 등을 사용해서 컴퓨터에 의해 실행될 수 있는 고급 언어 코드를 포함할 수 있다. 하드웨어 장치는 본 발명의 분산 추정 방법을 수행하기 위해 적어도 하나의 소프트웨어 모듈로 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The computer-readable medium may be embodied in the form of program instructions, data files, data structures, and the like, alone or in combination. Programs recorded on a computer-readable medium may be those specially designed and constructed for the present invention or may be available to those skilled in the art of computer software. The computer-readable medium may also include hardware devices that are specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Program instructions may include machine language code such as those produced by a compiler, as well as high-level language code that may be executed by a computer using an interpreter or the like. The hardware device may be configured to operate with at least one software module to perform the variance estimation method of the present invention, and vice versa.

도 7은 본 발명의 또 다른 실시예에 따른 영상 물체 탐색 장치의 블록도이다.7 is a block diagram of an image object search apparatus according to another embodiment of the present invention.

도 7은 참조하면, 본 실시예에 따른 사전 정보를 이용한 영상 물체 탐색 장치(10)는 전술한 영상 물체 탐색 방법을 수행하는 장치의 일실시예로서, 프로세서(11) 및 메모리 시스템(12)을 포함할 수 있다. 또한, 영상 물체 탐색 장치(10)는 구현에 따라서 네트워크 인터페이스(13)를 포함할 수 있고, 또한 디스플레이 장치(14)(이하, 간략히 디스플레이라 함)와 인터페이스(15)를 더 포함할 수 있다.Referring to FIG. 7, the video object searching apparatus 10 using the dictionary information according to the present embodiment includes a processor 11 and a memory system 12 as an embodiment of the apparatus for performing the video object searching method described above. . In addition, the video object search apparatus 10 may include a network interface 13 according to an implementation, and may further include a display device 14 (hereinafter, simply referred to as a display) and an interface 15.

프로세서(11)는 메모리 시스템(12)에 연결되어 메모리 시스템(12)에 저장된 프로그램(12a)을 수행할 수 있다. 프로그램(12a)은 앞서 설명한 본 실시예의 사전 정보를 이용한 영상 물체 탐색 방법을 구현한 것일 수 있다. 즉, 프로세서(11)는 이미지에서 물체를 탐색하기 위해 이미지 처리가 가능한 모바일 단말이나 컴퓨터 장치에 탑재되어 입력 영상에서 특정 물체를 예측하거나 탐색하여 출력하도록 동작할 수 있다.The processor 11 may be connected to the memory system 12 to perform the program 12a stored in the memory system 12. [ The program 12a may be an implementation of the video object searching method using the dictionary information of the embodiment described above. That is, the processor 11 may be mounted on a mobile terminal or a computer device capable of image processing to search for an object in an image, and may operate to predict or search for a specific object in the input image.

좀더 구체적으로 설명하면, 프로세서(11)는 모바일 장치 등의 시스템(10)에 구비된 카메라(인터페이스에 포함될 수 있음)나 메모리 시스템(12)으로부터의 입력 영상에서 탐색하고자 하는 물체를 적은 훈련 샘플로 교대 최적화 학습 방식을 통해 학습하고, 학습된 파라미터를 가진 예측 모델을 이용하여 입력 영상에서 원하는 물체를 효과적으로 탐색할 수 있다.More specifically, the processor 11 may be configured to sample an object to be searched in an input image from a camera (which may be included in an interface) or a memory system 12 included in the system 10, such as a mobile device, It is possible to learn through the alternate optimization learning method and effectively search the desired object in the input image by using the prediction model having learned parameters.

전술한 프로세서(11)는 프레임워크 생성 모듈, 학습 모듈, 모델 생성 모듈, 탐색 모듈 등을 포함할 수 있다. 프로세서(11)는 전술한 모듈들이 탑재된 하나 이상의 프로세서를 포함할 수 있다. 또한, 프로세서(11)는 마이크로 프로세서, 마이크로 컨트롤러, 그래픽스 프로세서, 디지털신호처리 프로세서와 같은 임의의 타입의 계산 회로 또는 임의의 다른 타입의 프로세싱 회로를 포함할 수 있다. 게다가, 프로세서(11)는 범용 또는 프로그램 가능 논리 소자 또는 어레이, 애플리케이션 특정 집적 회로, 단일 칩 컴퓨터, 스마트 카드 등과 같은 임베디드 컨트롤러를 포함할 수 있다.The processor 11 may include a framework generation module, a learning module, a model generation module, a search module, and the like. The processor 11 may include one or more processors on which the above-described modules are mounted. The processor 11 may also include any type of computing circuit, such as a microprocessor, microcontroller, graphics processor, digital signal processing processor, or any other type of processing circuitry. In addition, the processor 11 may include embedded controllers, such as general purpose or programmable logic devices or arrays, application specific integrated circuits, single chip computers, smart cards, and the like.

프로세서(11)가 마이크로 프로세서, 마이크로 컨트롤러, 그래픽스 프로세서, 또는 디지털신호처리 프로세서인 경우, 프로세서(11)는 계산을 수행하는 연산장치(Arithmetic Logic Unit, ALU), 데이터 및 명령어의 일시적인 저장을 위한 레지스터, 및 미들웨어 간 인터페이스 장치를 제어하거나 관리하기 위한 컨트롤러를 구비할 수 있다. 전술한 모듈들 중 적어도 어느 하나가 응용프로그램 형태로 프로세서(11)에 탑재되는 경우, 프로세서(11)는 하이레벨 명령어 처리부와 모듈 제어부를 포함할 수 있다. 모듈 제어부는 매핑부와 모듈 인터페이스부를 포함할 수 있고, 모듈 제어부를 통해 각 모듈을 제어할 수 있다. 여기서, 하이레벨 명령어 처리부는 API(Application Programming Interface)를 통해 입력되는 신호 또는 명령어를 변환하여 하이레벨 명령어를 출력하고, 매핑부는 하이레벨 명령어를 각 모듈에서 처리할 수 있는 디바이스 레벨 명령어로 매핑하며, 모듈 인터페이스부는 디바이스 레벨 명령어를 해당 모듈에 전달할 수 있다.When the processor 11 is a microprocessor, a microcontroller, a graphics processor, or a digital signal processing processor, the processor 11 includes an arithmetic logic unit (ALU) for performing calculations, a register for temporary storage of data and instructions And a controller for controlling or managing the interface device between the middleware. When at least one of the modules described above is mounted on the processor 11 in the form of an application program, the processor 11 may include a high-level command processing unit and a module control unit. The module control unit may include a mapping unit and a module interface unit, and may control each module through a module control unit. Here, the high-level command processing unit converts a signal or an instruction input through an API (Application Programming Interface) to output a high-level command, and the mapping unit maps a high-level command into a device level command that can be processed by each module, The module interface can pass device level commands to the module.

메모리 시스템(12)은 RAM(Random Access Memory)이나 ROM(Read Only Memory) 같은 휘발성 메모리나 비휘발성 메모리 형태의 저장 매체와, 플로피 디스크, 하드 디스크, 테이프, CD-ROM, 플래시 메모리 등의 장기(long-term) 저장 매체를 포함할 수 있다. 또한, 메모리 시스템(12)은 구현에 따라서 본 실시예의 영상 물체 탐색 알고리즘을 수행하는 프로그램이나 데이터 혹은 명령어의 집합 등을 저장할 수 있다.The memory system 12 may be a volatile memory such as a random access memory (RAM) or a read only memory (ROM), a storage medium in the form of a nonvolatile memory, and a storage medium such as a floppy disk, a hard disk, a tape, a CD- long-term storage medium. In addition, the memory system 12 may store programs, data, or a set of commands that perform the image object search algorithm of the present embodiment according to the implementation.

네트워크 인터페이스(13)는 네트워크에 연결되어 네트워크상의 다른 통신 장치와 데이터 통신을 수행할 수 있다. 네트워크 인터페이스(13)를 이용하면, 사전 정보에 의해 미리 교대 최적화 학습된 것으로 가정할 때, 본 실시예의 영상 물체 탐색 장치(10)는 영상 물체 탐색 방법의 단계들을 수행하기 위한 데이터, 명령 혹은 신호를 다운로드 하거나 실시간 수신하여 입력 영상에서 원하는 물체를 효율적으로 탐색하는 것이 가능하다. 전술한 네트워크 인터페이스(13)는 무선 네트워크, 유선 네트워크, 위성망, 전력선통신 등에서 선택되는 1종 이상의 단일 혹은 조합 네트워크에서 데이터 통신을 수행하기 위한 1개 이상의 통신 프로토콜을 지원하도록 구현될 수 있다.The network interface 13 is connected to a network and can perform data communication with other communication devices on the network. When using the network interface 13, assuming that alternate optimization learning has been performed in advance by advance information, the video object search apparatus 10 of the present embodiment can generate data, commands or signals for performing the steps of the video object searching method It is possible to efficiently search for a desired object in the input image by downloading or receiving it in real time. The network interface 13 described above can be implemented to support one or more communication protocols for performing data communication in one or more single or combination networks selected from a wireless network, a wired network, a satellite network, a power line communication, and the like.

디스플레이(14)는 프로세서(11)에 연결되어 영상 물체 탐색 장치(10)에서 이용하는 훈련 샘플, 인증용 샘플, 테스트 샘플 혹은 입력 영상에서 물체를 탐색하는 과정 중 적어도 일부분을 화면에 표시하는 수단이나 이러한 수단에 상응하는 기능을 수행하는 구성부를 지칭한다. 디스플레이(14)는 프로세서(11)에 직접 연결될 수 있으나, 이에 한정되지 않고, 네트워크 인터페이스(13)를 통해 원격지에 연결될 수 있다. 디스플레이(14)에는 LCD(Liquid crystal display) 장치, OLED(Organic light emitting diode) 표시장치, PDP(Plasma display panel) 장치, 모뎀이 탑재된 브라운관 TV 등이 사용될 수 있다.The display 14 is connected to the processor 11 and displays at least a part of a process of searching for an object in a training sample, an authentication sample, a test sample, or an input image used in the video object searching apparatus 10, Refers to a component that performs a function corresponding to the means. The display 14 may be directly connected to the processor 11 but is not limited thereto and may be connected to the remote site via the network interface 13. [ The display 14 may be a liquid crystal display (LCD) device, an organic light emitting diode (OLED) display device, a plasma display panel (PDP) device, or a cathode ray tube (TV) equipped with a modem.

인터페이스(15)는 프로세서(11)에 연결되어 영상 물체 탐색 장치(10)와 외부(외부의 사용자 포함) 사이의 의사소통을 위한 수단이나 이러한 수단에 상응하는 기능을 수행하는 장치를 포함할 수 있다. 인터페이스(15)는 사용자 인터페이스를 포함할 수 있다. 예를 들어, 인터페이스(15)는 입력 장치로서 키보드, 마우스, 터치스크린, 터치 패널, 마이크, 카메라 등에서 선택되는 적어도 하나 이상을 포함할 수 있고, 출력 장치로서 스피커, 조명 수단, 표시장치 등에서 선택되는 적어도 하나 이상을 포함할 수 있다.The interface 15 may be connected to the processor 11 and may include means for communicating between the video object search apparatus 10 and the outside (including an external user) or a device performing a function corresponding to this means . The interface 15 may include a user interface. For example, the interface 15 may include at least one input device selected from a keyboard, a mouse, a touch screen, a touch panel, a microphone, a camera, and the like. And may include at least one or more.

본 실시예에 의하면, 사전 정보를 접목한 학습 알고리즘을 수행하는 장치를 통해 적은 훈련 샘플 데이터에 대해서도 신뢰도 있는 모델을 학습할 수 있고, 학습된 모델을 통해 입력 이미지에서 찾고자 하는 물체를 효율적으로 예측하거나 탐색할 수 있는 장점이 있다.According to the present embodiment, a reliable model can be learned even with a small number of training sample data through an apparatus that performs a learning algorithm combining prior information, and an object to be searched in the input image can be efficiently predicted through the learned model There is an advantage to be able to search.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the present invention as defined by the following claims It can be understood that

10: 영상 물체 탐색 장치
11: 프로세서
12: 메모리 시스템
13: 네트워크 인터페이스
14: 디스플레이
15: 인터페이스10: Image object search device
11: Processor
12: Memory system
13: Network interface
14: Display
15: Interface

Claims

A method performed by an apparatus capable of digital signal processing,
Generating a learning framework in which the structured prediction framework is combined with prior information;
Executing alternate optimization learning for the learning framework using training samples including the dictionary information;
Generating a prediction model from the learning framework in which the alternate optimization learning is performed; And
Searching for an object in a specific image from the input image using the prediction model;
A method of searching an embedded video object.

The method according to claim 1,
Wherein the generating the learning framework comprises:
And combining a first function of the first space based on the dictionary information and a second function of the second space based on the training sample,
Wherein the combination of the first function and the second function is a combination of an image of the training sample and a space containing attributes of the object in the image, And concatenating them into a space of box coordinates.

The method according to claim 1,
Wherein the structured prediction frame comprises a structured SVM (Structured Support Vector Machine) classifier.

The method according to claim 1,
Wherein the step of performing the alternate optimization learning comprises:
Processing an item of an objective function corresponding to the dictionary information through an alternating loss-augmented inference.

The method of claim 4,
Wherein the step of performing the alternate optimization learning comprises:
And alternately performing Efficient Subwindow Search (ESS) in a second space based on the original image of the training sample and a first space based on the advance information through the alternate loss addition estimation. Search method.

The method of claim 4,
Wherein the step of performing the alternate optimization learning comprises:
Further comprising estimating the bounding box coordinates of the object by extracting all possible bounding boxes from the target image of the training sample through the alternate loss addition estimation.

The method of claim 6,
Wherein the step of performing the alternate optimization learning comprises:
And creating a connection feature point map that links the bounding box coordinates through the alternate loss addition estimation to associate the relationship between input and output variables.

The method according to claim 1,
The step of predicting or searching the object may include:
And finding an optimal bounding box given by a learned weighted vector of the predictive model and an image feature within a particular image from the input image.

The method according to claim 1,
Further comprising: verifying the learning framework based on actual image information after the step of generating the learning framework and before the step of generating the prediction model.

A computer-readable medium having recorded thereon a program for performing an image object searching method according to any one of claims 1 to 9.

A framework generation unit for generating a learning framework in which dictionary information is combined with a structured prediction framework;
A learning unit for executing alternate optimization learning for the learning framework using training samples including the dictionary information;
A model generation unit that generates a prediction model from the learning framework in which the alternate optimization learning is performed; And
A search unit for searching for an object in a specific image from the input image using the prediction model;
Comprising: a video object search device;

The method of claim 11,
Wherein the framework generating unit comprises:
Combining the first function of the first space based on the dictionary information and the second function of the second space based on the training sample,
Wherein the combination of the first function and the second function includes a space including images and attributes of the training sample as a space of bounding box coordinates Wherein the video object is a video object.

The method of claim 11,
Wherein the structured prediction frame comprises a structured SVM (Structured Support Vector Machine) classifier.

The method of claim 11,
Wherein the learning unit processes the item of the objective function corresponding to the advance information through an alternating loss-augmented inference.

15. The method of claim 14,
The learning unit includes a first learning unit that alternately performs an Efficient Subwindow Search (ESS) in a first space based on the advance information and a second space based on the training sample through the alternate loss addition estimation Wherein the object searching unit comprises:

15. The method of claim 14,
Wherein the learning unit further comprises a second learning unit for extracting all possible bounding boxes from a target image of the training sample through the alternate loss addition estimation and estimating bounding box coordinates of the object.

18. The method of claim 16,
Wherein the learning unit further comprises a third learning unit for creating a connection feature point map that connects the bounding box coordinates through the alternate loss addition estimation to associate the relationship between input and output variables.

The method of claim 11,
Wherein the search unit finds an optimal bounding box given by a learned weighted vector of the predictive model and an image feature in a specific image from the input image.

The method of claim 11,
And a verification unit coupled to the framework generation unit or arranged between the framework generation unit and the model generation unit and verifying the learning framework based on actual image information including a specific object.

The method of claim 11,
A memory system for storing a program or an instruction for operation of the framework generating unit, the learning unit, the model generating unit, the searching unit, or a combination thereof; And
A processor coupled to the memory system and executing the program or command to search for an object previously designated in the input image
Comprising: a video object search device;