KR20010041607A

KR20010041607A - Method and system for generating semantic visual templates for image and video retrieval

Info

Publication number: KR20010041607A
Application number: KR1020007009804A
Authority: KR
Inventors: 창시-푸; 첸윌리엄; 선다람하리
Original assignee: 더 트러스티스 오브 콜롬비아 유니버시티 인 더 시티 오브 뉴욕
Priority date: 1998-03-04
Filing date: 1999-03-04
Publication date: 2001-05-25
Also published as: WO1999045483A9; JP2002506255A; EP1066572A1; WO1999045483A1; CA2322448A1

Abstract

데이터베이스에서 화상/비디오를 검색하기 위하여 의미 비주얼 템플릿(SVT)은 스키, 일몰 등과 같은 개념을 특징지을 수 있는 예제 장면/객체들로 이루어진 아이콘 집합이 된다. 의미 비주얼 템플릿은 사용자와 시스템간의 양 방향 대화를 제공한다. 사용자는 시스템에 동일한 개념에 대한 다른 대표들을 자동 생성할 수 있는 종자로서 초기 스케치 또는 예제 화상을 제공할 수 있다. 그런 다음, 사용자는 개념을 나타내는 데 바람직하다고 여겨지는 것을 포함하고 있는 관점들을 선택할 수 있다. 의미 비주얼 템플릿이 수립되면, 데이터베이스는 이를 검색하고 사용자는 되돌아온 결과에 대한 관련성 귀환을 제공할 수 있다. 사용자는 수립된 의미 비주얼 템플릿을 이용하여 개념 수준에서 시스템과 대화할 수 있다. 새로운 개념을 형성할 때는 이미 존재하고 있는 의미 비주얼 템플릿을 이용할 수 있다.To retrieve images / videos from a database, a semantic visual template (SVT) is a set of icons consisting of example scenes / objects that can characterize concepts such as skiing, sunsets, and so on. Semantics Visual templates provide a two-way dialogue between the user and the system. The user can provide an initial sketch or example picture as a seed that can automatically generate other representatives of the same concept to the system. The user can then select the perspectives that include what is considered desirable to represent the concept. Once the semantic visual template is established, the database can retrieve it and the user can provide relevance feedback for the returned results. Users can interact with the system at the conceptual level using established semantic visual templates. When forming new concepts, you can use existing semantic visual templates.

시스템에의 질의에 대해, 의미 비주얼 템플릿에 관련된 단어의 제한된 어휘 범위는 구성요소별로 분석될 수 있다.For queries to the system, the limited lexical range of words related to the semantic visual template can be analyzed on a per component basis.

Description

Method and system for creating semantic visual templates for image and video retrieval {METHOD AND SYSTEM FOR GENERATING SEMANTIC VISUAL TEMPLATES FOR IMAGE AND VIDEO RETRIEVAL}

화상과 비디오가 점점 더 생성되어 보급되고, 디지탈 형태로 저장됨에 따라, 영상 정보(visual information)를 검색하기 위한 도구 및 시스템이 중요하게 되었다. 그러나, 효과적인 "검색엔진"들은 텍스트 데이터(text data)에 널리 유용해진 반면에, 이와 대응할 수 있는 비주얼 화상과 비디오 데이터를 검색하기 위한 도구들은 모호한 상태에 머물러 있다.As images and videos are increasingly created, disseminated, and stored in digital form, tools and systems for retrieving visual information have become important. However, while effective "search engines" have become widely available for text data, the tools for searching corresponding visual picture and video data remain ambiguous.

일반적으로 키워드(keyword) 기법은 이용가능한 화상 및 비디오 데이터베이스에서 화상을 색인(index)하고 검색하는 데 이용된다. 이러한 형태로 알려져 있는 검색 시스템들은, 텍스트 정보를 화상에 결합하지 못하였으며, 캡션(caption)을 수작업으로 포함해야 하므로 시간 낭비적이고 주관적이었으며, 텍스트화된 주석(textual annotation)들은 부수적인 것으로서 장면(scene)의 고유한 영상 특성을 나타내지 못하였다 점 등과 같은 많은 결점을 안고 있었다. 예를 들면, "한 사람이 벽돌 벽에 기대어 서 있다"와 같은 텍스트 설명은 사람 또는 벽돌 벽에 대한 적은 양의 영상 정보밖에 전달하지 못한다. 이러한 영상 정보는 특정 비디오를 검색하는 데 종종 필수적이다.Generally, keyword techniques are used to index and search for pictures in the available picture and video databases. Search systems known in this form were time-consuming and subjective because they could not incorporate textual information into an image, and had to manually include captions, and textual annotations were incidental. ) Had many defects, such as the lack of unique image characteristics. For example, a textual description such as "a person is leaning against a brick wall" conveys only a small amount of image information about the person or brick wall. Such image information is often essential for searching for a particular video.

최근에 들어, 개발자들은 새로운 형태의 화상 및 비디오 자료 모음 검색 (image and video repository retrieval)을 개발하기 시작했다. 이들은 영상 속성, 예를 들면, 색상(color), 텍스쳐(texture), 모양(shape) 및 공간(spatial) 등과 비디오를 구성하는 객체 사이의 시간적 관계(temporal relationship)에 기초하여 개발하고 있다. 통상적으로 이러한 이론은 주어진 예제 화상(example image) 또는 영상 스케치(visual sketch)에 의해 질의를 특정하고 있다.Recently, developers have begun developing new forms of image and video repository retrieval. They develop on the basis of temporal relationships between objects constituting the video, such as image properties, for example, color, texture, shape and spatial. Typically this theory specifies a query by a given example image or visual sketch.

본 발명은 데이터베이스 정지 화상, 비디오 및 오디오 검색에 관한 것으로 더욱 상세하게는 데이터베이스 항목(item)에 접근이 용이하도록 하는 기술에 관한 것이다.The present invention relates to database still picture, video, and audio retrieval and, more particularly, to a technology that facilitates access to database items.

도 1은 본 발명의 바람직한 실시예에 의한 의미 비주얼 템플릿의 라이브러리 또는 수집을 생성하기 위한 대화식 기법의 구성도이고,1 is a block diagram of an interactive technique for generating a library or collection of semantic visual templates according to a preferred embodiment of the present invention;

도 2는 필요하고 충분한 조건을 갖춘 하나의 개념을 나타내는 개략도이고,2 is a schematic diagram showing one concept with necessary and sufficient conditions,

도 3은 질의 생성을 나타내는 개략도이고,3 is a schematic diagram showing query generation;

도 4는 오디오 처리를 포함하는 본 발명의 다른 실시예에 의한 의미 비주얼 시스템의 구성도이고,4 is a block diagram of a semantic visual system according to another embodiment of the present invention including audio processing;

도 5는 "하이 점프" 개념을 예시하는 아이콘 집합을 나타내고,5 shows a set of icons illustrating the "high jump" concept,

도 6은 "일몰" 개념을 예시하는 아이콘 집합을 나타내고,6 shows a set of icons illustrating the "sunset" concept,

도 7은 "회전 경기" 개념을 예시하는 아이콘 집합을 나타낸다.7 shows a set of icons illustrating the concept of a "spinning race".

화상 및 비디오를 데이터베이스로부터 간편하게 검색하기 위하여, 데이터베이스는 비주얼 템플릿의 모음(collection)을 이용하여 색인될 수 있다. 바람직하게는, 본 발명의 관점에서, 비주얼 템플릿은 스키, 일몰 등과 같은 의미 개념(semantic concept) 또는 카테고리를 나타내고 있다. 의미 비주얼 템플릿 (SVT)을 이용한 아키텍쳐(architecture)가 생성되며, 여기서의 각 의미 비주얼 템플릿은 개념을 나타내며, 또한 그 개념을 잘 묘사할 질의들로 구성되어 있다.To easily retrieve pictures and videos from a database, the database can be indexed using a collection of visual templates. Preferably, in view of the present invention, the visual template represents a semantic concept or category, such as skiing, sunset, and the like. An architecture using semantic visual templates (SVT) is created, where each semantic visual template represents a concept and consists of queries that well describe the concept.

의미 비주얼 템플릿은 사용자와 시스템 간의 대화 과정을 통해서 수립될 수도 있다. 사용자는 시스템에 동일한 개념에 대한 다른 대표들을 자동 생성할 수 있는 종자로서 초기 스케치 또는 예제 화상을 제공할 수 있다. 그런 다음, 사용자는 개념을 나타내는 데 바람직하다고 여겨지는 것을 포함하고 있는 관점들을 선택할 수 있다. 의미 비주얼 템플릿이 수립되면, 데이터베이스는 이를 검색하고 사용자는 되돌아온 결과에 대한 관련성 귀환을 제공할 수 있다. 사용자는 수립된 의미 비주얼 템플릿을 이용하여 개념 수준에서 시스템과 대화할 수 있다. 새로운 개념을 형성할 때는 이미 존재하고 있는 의미 비주얼 템플릿을 이용할 수 있다.Semantic visual templates can also be established through a dialogue process between the user and the system. The user can provide an initial sketch or example picture as a seed that can automatically generate other representatives of the same concept to the system. The user can then select the perspectives that include what is considered desirable to represent the concept. Once the semantic visual template is established, the database can retrieve it and the user can provide relevance feedback for the returned results. Users can interact with the system at the conceptual level using established semantic visual templates. When forming new concepts, you can use existing semantic visual templates.

시스템에의 질의를 위해, 의미 비주얼 템플릿에 관련된 단어의 제한된 어휘 범위를 구성요소별로 분석하기 위한 기법이 추가로 제공된다.For querying the system, there is further provided a technique for component-by-component analysis of the limited lexical range of words related to the semantic visual template.

본 발명은 미국 가 특허 출원(provisional patent application) 제 60/045,637호(출원일: 1997년 5월 5일)와 국제특허출원 제 PCT/US98/09124호(출원일: 1998년 5월 5일, 출원국가: 캐나다, 일본, 대한민국, 미국)를 참조로 포함한다. 이들 출원은 비주얼 템플릿(visual template)을 이용하여 서로 다른 카테고리에 있는 화상과 비디오를 검색하는 객체-기반의 시공간 비주얼 검색 기법을 설명하고 있다. 본 발명은 비디오큐(VideoQ)라고 부르는 이러한 검색기법을 함께 이용할 수 있다.The present invention discloses a provisional patent application No. 60 / 045,637 (filed May 5, 1997) and PCT / US98 / 09124 (filed May 5, 1998). Canada, Japan, South Korea, and the United States). These applications describe object-based spatiotemporal visual retrieval techniques for retrieving pictures and videos in different categories using visual templates. The present invention can use this search technique together called VideoQ.

비디오큐를 이용하여 비디오 스트림의 장면의 시작과 끝을 결정할 수 있다. 예를 들면, 비디오큐를 이용하여 장면에서 객체를 추출하기 위한 카메라 동작을 더 보상할 수 있다. 비디오큐를 보다 이용하여 색상, 텍스쳐, 크기, 모양 및 동작 등과 현저한 속성으로 각 객체를 특성지을 수 있다. 이로써, 비디오 객체 데이터베이스(video object database)는 장면과, 그 속성으로부터 추출되는 모든 객체로 구성되게 된다.Video cues can be used to determine the beginning and end of a scene in a video stream. For example, the video cue may be used to further compensate for the camera motion for extracting an object from the scene. By using Video Cue, each object can be characterized with remarkable properties such as color, texture, size, shape, and motion. In this way, a video object database consists of a scene and all objects extracted from its attributes.

비주얼 템플릿(visual template)Visual template

비주얼 템플릿은 스케치나 동화식 스케치(animated sketch) 형태로 생각을 나타낸다. 단일 비주얼 템플릿으로는 관계 클래스(a class of interest)를 대표하기에 부족하기 때문에, 각기 다른 의미 클래스(semantic class)를 위한 대표 템플릿을 포함하는 비주얼 템플릿 라이브러리(a library of visual template)를 조합할 수 있다. 예를 들면, 일몰 클래스의 비디오 클립을 검색하는 경우, 사용자는 이 클래스에 대응하는 하나 또는 그 이상의 비주얼 템플릿을 선택할 수 있으며, 일몰에 관한 비디오 클립을 찾기 위한 유사점 기반 질의(similarity-based querying)를 이용할 수 있다.Visual templates represent ideas in the form of sketches or animated sketches. Since a single visual template is not sufficient to represent a class of interest, it is possible to combine a library of visual templates containing representative templates for different semantic classes. have. For example, when retrieving a video clip of the sunset class, the user can select one or more visual templates corresponding to this class, and use similarity-based querying to find the video clip about sunset. It is available.

비주얼 템플릿 라이브러리를 사용하는 중요한 장점으로는 상위 수준의 의미 개념(high-level semantic concept)에 하위 수준의 영상 특징 대표(low-level visual feature representation)를 연계하는 데 있다. 예를 들면, 사용자가 앞에서의 특허출원에 기재된 바와 같은 제약 자연어 형태(constrained natural language form)로 질의한다면, 비주얼 템플릿을 이용하여 이 자연어 질의를 영상 속성 및 제약에 의해 특정되는 자동 질의로 변환할 수 있다. 자료모음이나 데이터베이스에 있는 영상 내용(visual content)이 텍스트단위로 색인되어 있지 않다면, 주문형 텍스트 검색 방법(customary textual search method)을 바로 적용할 수는 없다.An important advantage of using the visual template library is the incorporation of a low-level visual feature representation into a high-level semantic concept. For example, if a user queries in a constrained natural language form as described in the previous patent application, the visual template can be used to convert this natural language query into an automatic query specified by image attributes and constraints. have. If the visual content in a collection or database is not indexed textually, the custom textual search method cannot be applied immediately.

의미 비주얼 템플릿 (SVT)Meaning Visual Template (SVT)

의미 비주얼 템플릿(semantic visual template)은 개별적인 의미와 결합되어 있는 비주얼 템플릿 집합이다. 이 의미 비주얼 템플릿(SCT)의 개념은 다음과 같은 확실한 키 특성(key property)을 갖고 있다;Semantic visual templates are sets of visual templates combined with individual semantics. The concept of semantic visual template (SCT) has the following key properties;

의미 비주얼 템플릿은 실제로는 일반적이다. 주어진 개념에 대해 이를 잘 포함할 수 있는 비주얼 템플릿 집합이 있어야 한다. 성공적인 의미 비주얼 템플릿의 예로는 일몰, 하이 점프(high jump), 다운 힐 스키(down-hill skiing) 등을 들 수 있다.Semantics Visual templates are common in practice. For a given concept, there should be a set of visual templates that can contain it well. Examples of successful semantic visual templates include sunset, high jump, down-hill skiing, and the like.

개념(concept)에 대한 의미 비주얼 템플릿은 그 크기가 작아야 하지만, 고정밀의 재현을 수행하기 위해서는 컬렉션(collection)에 있는 많은 양의 관련 화상과 비디오를 포함할 수 있어야 한다.Implications for Concepts Visual templates should be small in size, but in order to perform high-precision reproductions they must be able to contain large amounts of related images and videos in a collection.

각기 다른 개념에 대한 의미 비주얼 템플릿을 찾기 위한 프로세스(process)는 계통화되고 효율적이며 강건하다. 효율성은 작은 크기의 비주얼 템플릿 집합으로의 집중과 관계한다. 강건함(robustness)은 새로운 템플릿 라이브러리를 새로운 화상과 비디오 수집에 적용함으로써 명백해진다.Implications for Different Concepts The process for finding visual templates is systematic, efficient, and robust. Efficiency is related to the focus on a small set of visual templates. Robustness is evident by applying a new template library to new picture and video collection.

비디오큐와 관련하여, 의미 비주얼 템플릿은 또한 이와 결합되어 있는 의미를 나타내는 아이콘 또는 예제 장면/객체의 묶음으로 이해될 수 있다. 의미 비주얼 템플릿으로부터 질의를 위한 특징 벡터(feature vector)를 추출할 수 있다. 아이콘은 동화식 스케치이다. 비디오큐에서는, 각각의 객체와 이들의 시공간적인 관계와 결합되어 있는 특성들이 중요하다. 히스토그램, 텍스쳐 및 구조 정보는 이러한 템플릿의 일부가 될 수 있는 포괄적인 특징의 예들이다. 아이콘에 기반하여 실현할 것인가와, 포괄적인 특징으로부터 형성되는 특징 벡터 집합에 의할 것인가에 대한 선택은 그것이 나타내고자 하는 의미에 달려 있다. 예를 들면, 일몰 장면은 포괄적인 특징 집합보다도 하나의 폭포나 군중으로 보다 잘 표현할 수 있으므로 한 쌍의 객체들로 표현하는 것이 적절하다. 이 때문에, 각 템플릿은 하나의 의미를 표현할 다양한 아이콘, 예컨대 장면/객체 등을 담고 있다. 집합의 구성요소들은 그들의 커버리지(coverage)에 따라 중첩될 수도 있다. 최소의 템플릿 집합으로 그 커버리지를 최대화하는 것이 바람직하다.In the context of a video cue, a semantic visual template may also be understood as a bundle of icons or example scenes / objects representing semantics associated with it. A feature vector for a query can be extracted from the semantic visual template. The icon is a fairy tale sketch. In Video Cue, the properties associated with each object and their space-time relationships are important. Histogram, texture, and structure information are examples of generic features that can be part of this template. The choice of whether to implement based on an icon or a set of feature vectors formed from a comprehensive feature depends on the meaning it represents. For example, sunset scenes are better represented by a single waterfall or crowd than by a comprehensive feature set, so it is appropriate to represent them as a pair of objects. For this reason, each template contains various icons, such as scenes / objects, to express one meaning. The components of the set may overlap according to their coverage. It is desirable to maximize its coverage with a minimal set of templates.

하나의 개념, 예컨대 다운 힐 스키, 일몰, 해변가의 인파 등에 대한 각각의 아이콘은 장면에서 실제 객체를 닮은 그래픽 객체를 구성하는 영상 대표이다. 각각의 객체는 영상 속성 집합, 예컨대, 색상, 모양, 텍스쳐, 동작 등과 결합되어 있다. 개념에 대한 각 속성 및 각 객체의 관련성이 또한 특정된다. 예를 들면, "일몰"에 대해서는 태양과 하늘과 같은 객체의 색상과 공간 구조가 보다 더 관련이 있다. 일몰 장면에 대해서는, 태양이 보이지 않는 일몰 비디오가 있을 수 있으므로 태양 객체가 시각적일 수 있다. "하이 점프" 개념에 대해서는, 포그라운드 객체(foreground object)의 동작 속성은 강제적이지만, 텍스트 속성은 비강제적(non-mandartory)이며, 양 속성 모두 나머지 속성들보다 관련성이 많다. 어떤 개념들은 장면의 포괄적인 속성을 나타내는데 단지 하나의 객체만 필요로 할 수도 있다.Each icon for a concept, such as downhill skiing, sunset, beach crowds, etc., is an image representative of the graphical objects that resemble real objects in the scene. Each object is associated with a set of image properties such as color, shape, texture, motion, and the like. The relevance of each attribute and each object to the concept is also specified. For example, for "sunset", the color and spatial structure of objects such as the sun and the sky are more relevant. For sunset scenes, the sun object may be visual because there may be a sunset video where the sun is not visible. For the concept of "high jump", the behavioral property of the foreground object is mandatory, but the textual property is non-mandartory, and both properties are more relevant than the rest. Some concepts may require only a single object to represent the generic properties of the scene.

도 5는 "하이 점프"에 대해, 도 6은 "일몰"에 대해 몇 가지 가능한 아이콘을 도시한다. 최적의 아이콘 집합은 다음에서 보다 상세하게 설명할 재현의 관점에서 관련성 귀환(relevancy feedback)과 최대 커버리지에 기초하여 선택될 수 있다.FIG. 5 shows some possible icons for "high jump" and FIG. 6 for "sunset". The optimal set of icons can be selected based on relevancy feedback and maximum coverage in terms of representation, which will be described in more detail below.

본 발명은 다양한 개념에 대한 의미 비주얼 템플릿을 생성하기 위한 효율적인 기법을 제공한다. 각각의 의미 개념(semantic concept)은 포지티브 커버리지(positive coverage) 또는 자료 모음으로부터의 고재현을 위한 몇가지 대표 비주얼 템플릿을 가질 수 있으며, 이 대표 비주얼 템플릿들은 화상과 비디오의 중요부분을 검색하는데 이용될 수 있다. 각기 다른 비주얼 템플릿에 대한 포지티브 커버리지 집합은 중복될 수 있다. 따라서, 광범위하지만 중복은 최소로 할 수 있는 포지티브 커버리지를 갖는 작은 크기의 비주얼 템플릿 집합을 찾아내는 것을 목적으로 한다.The present invention provides an efficient technique for generating semantic visual templates for various concepts. Each semantic concept can have several representative visual templates for positive coverage or high reproduction from a collection of materials, which can be used to retrieve important parts of pictures and videos. . Positive coverage sets for different visual templates may overlap. Therefore, it aims to find a small set of visual templates with positive coverage that can be extensive but minimally redundant.

사용자들은 효과적인 비주얼 템플릿을 위한 초기 조건을 제공할 수 있다. 예를 들면, 한 사용자는 노란 색의 원(포그라운드)(foreground)과 엷은 빨강의 사각(백그라운드)(background)을 일몰 장면을 검색하기 위한 초기 템플릿으로 이용할 수 있다. 또한, 사용자들은 대화 질의서의 답변에 따라 문맥(context)에 속하게 되는 필요조건과, 각기 다른 객체, 속성 들의 중요도(weights)와 관련성을 나타낼 수도 있다. 질의서(questionnaires)는 사용자가 스케치패드(sketch) 등에서 스케치한 현재의 질의에 민감하다.Users can provide initial conditions for an effective visual template. For example, a user may use yellow circles (foreground) and pale red squares (background) as initial templates for searching for sunset scenes. Users can also indicate the requirements that belong to the context and the weights of different objects and attributes, depending on the answer to the dialogue questionnaire. The questionnaires are sensitive to the current query sketched by the user on a sketchpad or the like.

주어진 초기 비주얼 템플릿과, 이 템플릿의 모든 영상 속성의 관련성에 따라, 검색시스템은 가장 유사한 화상/비디오 집합을 사용자에게 되돌려준다. 되돌려온 결과에 대해, 사용자는 이에 대한 주관적인 평가를 제공할 수 있다. 되돌려준 결과의 정밀도와 포지티브 커버리지, 즉 재현은 컴퓨터로 계산될 수 있다.Depending on the relevance of a given initial visual template and all of its picture attributes, the retrieval system returns the most similar set of images / videos to the user. For the results returned, the user can provide a subjective assessment for this. The precision and positive coverage, or representation, of the results returned can be computed by computer.

시스템은 초기 영상 질의를 바꾸기 위한 최적의 방책을 결정하고, 다음에 기초한 수정된 질의를 생성할 수 있다;The system may determine an optimal strategy for changing the initial image query and generate a modified query based on the following;

1. 사용자의 질의서로부터 얻어지는 각 영상 속성의 관련성 인자.1. Relevance factor of each image attribute obtained from the user's query.

2. 앞에서의 질의에 대한 정밀 재현 수행.2. Perform a precise representation of the query above.

3. 자료 모음에 있는 화상 및 비디오의 특징 수준의 분배에 관한 정보.3. Information about the distribution of feature levels of pictures and videos in a collection of materials.

이러한 특징들을 구현하는 기법이 "하이 점프" 개념에 대한 질의를 특정해서 나타낸 도 1에 개념적으로 예시되고 있다. 이 질의는 세 개의 객체, 즉 두 개의 정적 사각 백그라운드 영역과, 하측 오른 쪽 방향으로 움직이는 하나의 객체를 포함한다. 질의의 각 객체에 대해서, 도 1에 수직 바(vertical bar)로 도시되어 있는 네 개의 속성, 예컨대 색상, 텍스쳐, 모양 및 크기 등이 중요도와 연관되어 특정되어 있다. 적어도 하나 이상의 속성을 한 스텝 높임으로써 새로운 질의를 형성할 수 있으며, 한 스텝 높아진 지점에서는 템플릿에 아이콘으로 포함할 것인지를 결정하기 위한 사용자와의 대화를 불러일으킬 수가 있다. 일단 적당한 수의 아이콘이 임시 템플릿에 모이면, 데이터베이스 검색을 위해 이 템플릿을 이용할 수 있다. 검색된 결과는 재현과 정밀도를 위해 평가될 수 있다. 허용가능한 템플릿들은 "하이 점프"에 대한 의미 비주얼 템플릿으로서 저장될 수 있다.Techniques for implementing these features are conceptually illustrated in FIG. 1, which specifically specifies a query for the "high jump" concept. This query contains three objects, two static rectangular background regions, and one object moving downward and to the right. For each object in the query, four attributes, such as color, texture, shape, and size, shown as vertical bars in FIG. 1, are specified in relation to importance. By raising one or more attributes one step, you can form a new query, and at one point up, you can invoke a dialog with the user to decide whether to include them as icons in the template. Once a reasonable number of icons are collected in a temporary template, you can use this template to search the database. The retrieved results can be evaluated for reproduction and precision. Acceptable templates can be stored as semantic visual templates for "high jump".

템플릿 메트릭(Template Metric)Template Metric

기본 비디오 데이터 유닛은 비디오 숏(video shot)이라 부를 수 있으며, 다수의 분할된 비디오 객체를 포함한다. 특정 비디오 객체의 수명은 비디오 숏의 지속시간과 동일하거나 또는 이보다 적을 수 있다. 의미 비주얼 템플릿(SVT) 집합의 한 부분과 비디오 숏과의 유사 정도(similarity measure) D는 다음과 같이 정의할 수 있다.The basic video data unit may be called a video shot and includes a plurality of segmented video objects. The lifetime of a particular video object may be less than or equal to the duration of the video shot. A similarity measure D between a part of a semantic visual template (SVT) set and a video shot may be defined as follows.

D = min {ω_r·∑_{i}d_f(O_i,O'_i) + ω_s·d_s} _{D = min {ω r · Σ} {i} d f (O i, O 'i) + ω s · d s}

여기에서 O_i는 템플릿에 특정되어 있는 객체이며, O'_i는 O_i에 대해 매칭(matching)된 객체이며, d_f는 이들의 인수 사이의 특징 거리(feature distance)이며, d_s는 템플릿에 있는 시공간 구조와 비디오 숏에 있는 매칭된 객체 사이의 시공간 구조와의 유사성이고, ω_r과 ω_s는 특징 거리와 구조적 상이(structural dissimilarity)에 대한 정상적인 중요도이다. 질의 절차(query procedure)는 질의 내의 각 객체에 대한 후보 리스트를 생성한다. 그런 후에, 거리 D는 시공간적으로 모든 가능한 매칭 객체 집합에 대한 최소한의 값으로 한다. 예를 들면, 의미 템플릿은 세 개의 객체를 가지며, 두 개의 후보 객체가 각각의 단일 객체 질의에 보존된다면, 수학식 1에서 최소거리의 계산을 고려할 때, 기껏해야 여덟 개의 잠재적인 후보 객체 집합이 있을 수 있다.Where O _i is the object specified in the template, O ' _i is the object matched against O _i , d _f is the feature distance between their arguments, and d _s is Similarity of the space-time structure between the space-time structure and the matched objects in the video shot, and ω _r and ω _s are the normal importance for feature distance and structural dissimilarity. The query procedure produces a candidate list for each object in the query. Then, distance D is the minimum value for all possible sets of matching objects in space and time. For example, if a semantic template has three objects, and two candidate objects are preserved in each single object query, then, considering the calculation of the minimum distance in Equation 1, there may be at most eight potential candidate object sets. Can be.

질의에서 N 객체가 주어진다면, 이는 비디오 숏에서 함께 나타날 수 있는 모든 N 객체 집합에 대한 검색을 요구하도록 나타난다. 그러나, 계산의 경제성을 고려할 때, 다음과 같은 보다 경제적인 프로세스가 채택될 수 있다;If N objects are given in the query, they appear to require a search for all sets of N objects that can appear together in the video shot. However, considering the economics of the calculation, the following more economic process can be adopted;

1. 각 비디오 개체 O_i를 객체 데이터베이스 전체에 대해 질의하는데 이용하여, 임계값 사용에 의해 짧게 유지가능한 매칭된 객체 리스트를 만든다. 그런 후에, 이 리스트에 포함된 객체만 O_i에 매칭할 후보 객체로 고려한다.1. Use each video object O _i to query the entire object database to create a list of matching objects that can be kept short by the use of thresholds. Then, only objects contained in this list are considered candidate objects to match O _i .

2. 그런 후에, 리스트의 후보 객체들을 서로 결합하여, 시공간 구조를 확인할 수 있는 최종의 매칭된 객체를 만든다.2. Then, the candidate objects in the list are combined with each other to create the final matched object that can identify the space-time structure.

템플릿 생성(Template generation)Template generation

사용자와 시스템 사이에 템플릿을 생성하는 데는 양 방향 대화(two-way interaction)가 사용된다. 초기 시나리오가 주어지면 본 기법은 관련성 귀환을 이용하여 최대한도의 재현을 제공하는 작은 크기의 아이콘 집합으로 수렴한다. 사용자는 템플릿이 생성할 개념에 대한 스케치로서 시공간적 제약을 갖는 객체로 이루어진 초기 질의를 제공한다. 사용자는 또한 객체를 강제할 것인지를 특정한다. 각각의 객체는 사용자가 관련 중요도를 할당할 특징들을 갖는다.Two-way interaction is used to create a template between the user and the system. Given an initial scenario, the technique converges into a small set of icons that provides maximum representation using relevance feedback. The user provides an initial query made up of objects with space-time constraints as a sketch of the concept that the template will create. The user also specifies whether to force the object. Each object has features for which the user will assign relevant importance.

초기 질의는 데이터베이스에 있는 모든 비디오들을 매핑(mapping)할 수 있는 고차원적인 특징 공간에 있는 한 지점으로 간주될 수 있다. 테스트 아이콘 집합을 자동으로 생성하기 위해서는 각 객체의 각각의 특징을 공간에 양자화한 후에 점프(jump)시키는 것이 필요하다. 양자화를 위해서는, 사용자가 초기 질의에 따라 특정한 중요도의 도움을 받아 스텝의 크기를 결정할 수 있다. 여기서의 중요도는 사용자에 의해 객체의 특징에 부여되는 관련도의 평가기준으로 간주될 수 있다. 따라서, 중요도가 낮으면 거칠게 양자화되고, 반대의 경우도 마찬가지이다. 예를 들면, 다음의 식과 같다.The initial query can be thought of as a point in the high-dimensional feature space that can map all the videos in the database. In order to automatically generate a set of test icons, it is necessary to quantize each feature of each object in space and then jump. For quantization, the user can determine the size of the step with the help of a particular importance according to the initial query. The importance here may be regarded as an evaluation criterion of relevance given to the feature of the object by the user. Therefore, if the importance level is low, it is roughly quantized and vice versa. For example, it is as follows.

Δ( ω) = 1 / ( a ·ω + b )Δ (ω) = 1 / (a

여기서, Δ은 하나의 특징에 대응하는 점프 거리이며, ω는 특징과 연관되는 중요도이고, a 와 b 는 Δ(0) = 1, Δ(1) = d₀(d₀는 임계화와 관계된 시스템 매개 변수로서, 원형 시스템에서는 0.2 로 설정되어 있음)에서의 매개 변수이다. 점프 거리를 이용하면 특징 페이스(feature face)는 과-장방형(hyper-rectangle)으로 양자화된다. 예를 들면, 색상에 대해서 Δ( ω)에 따른 LUV 공간의 메트릭(metric)을 이용하여 직육면체를 생성할 수 있다.Where Δ is the jump distance corresponding to one feature, ω is the importance associated with the feature, a and b are Δ (0) = 1, Δ (1) = d ₀ (d ₀ is the system associated with the thresholding Parameter, which is set to 0.2 in a prototype system). Using jump distances, the feature face is quantized into a hyper-rectangle. For example, a cuboid can be generated using a metric of the LUV space according to Δ (ω) for color.

가능한 아이콘의 전체 갯수가 갑자기 증가하는 것을 방지하기 위하여, 특징들의 접합 변화(joint variation)는 일어나지 않도록 한다. 예를 들면 다음과 같다.In order to prevent a sudden increase in the total number of possible icons, no joint variation of features occurs. For example:

1. 객체의 각각의 특징에 대해서, 사용자는 그 특징에 대한 바람직한 집합을 선택한다.1. For each feature of the object, the user selects a preferred set of features.

2. 그런 다음, 시스템은 객체에 연관된 특징 집합의 결합을 수행한다.2. The system then performs a combination of feature sets associated with the object.

3. 사용자는 객체의 변화를 가장 잘 나타내는 결합을 선택하여 후보 아이콘 리스트를 생성한다.3. The user selects the combination that best represents the change in the object to create a list of candidate icons.

다수의 객체의 경우에는, 각 객체에 대한 후보 리스트와 관련하여 부가적인 결합이 제2단계에 포함될 수 있다. 바람직한 시나리오 리스트가 일단 생성되면, 시스템은 사용자가 선택한 아이콘을 이용하여 질의를 받는다. 긍정적 또는 부정적이라는 사용자의 분류 표식(labeling)과 함께 되돌아온 결과에 대해서 관계성 귀환(relevancy feedback)을 이용하여 최대한의 재현을 할 수 있는 아이콘을 결정한다.In the case of multiple objects, additional combining may be included in the second step with respect to the candidate list for each object. Once the preferred scenario list is created, the system is queried using the icon selected by the user. Relevancy feedback is used to determine an icon that can be reproduced as much as possible with the user's labeling as positive or negative.

개념 표지(Concept Covers)Concept Covers

사용자의 데이터베이스 검색시 하나의 개념에 대한 바람직한 "표지 (covering)"는 충분히 많이 있다. 각각의 표지는 각기 다른 특징 공간상에 있을 수 있다. 예를 들면, "일몰"은 포괄적 수준뿐만 아니라 객체 수준에서도 기술될 수 있다. 포괄적 수준의 기술은 색상 또는 텍스쳐 히스토그램의 형태를 취할 수 있다. 객체 수준의 기술은 하늘이나 태양과 같은 두 개의 객체들의 모음이 될 수 있다. 이 객체는 또한 특징 수준 기술자(feature level descriptor)를 이용하여 정량화될 수 있다.There is plenty of good "covering" for a concept when searching a user's database. Each marker may be on a different feature space. For example, "sunset" can be described at the object level as well as the generic level. A comprehensive level of technology can take the form of a color or texture histogram. Object-level descriptions can be a collection of two objects, such as the sky or the sun. This object may also be quantified using a feature level descriptor.

도 2에 도시된 바와 같이, 하나의 개념(예컨대, 일몰)은 두가지 다른 종류의 조건, 즉 필요조건(N)과 충분조건(S)을 갖는다. 의미 비주얼 템플릿은 개념에 대한 충분조건이지 필요조건은 아니므로 특정 의미 비주얼 템플릿은 개념을 전적으로 포괄할 필요는 없다. 수작업, 즉 사용자가 부가 질의를 입력함으로써 부가 템플릿이 생성될 수 있다. 태스크(task)는 각각의 개념별로 맡겨진다. 필요조건이 하나의 개념에 부여될 수 있으며, 이에 따라, 주어진 초기 질의 템플릿에 대한 부가 템플릿을 자동적으로 생성하게 된다.As shown in Fig. 2, one concept (e.g. sunset) has two different kinds of conditions: requirement N and sufficient condition S. A semantic visual template is not a requirement, but a semantic visual template does not need to cover the concept entirely. Additional templates can be created manually, i. E. By the user entering additional queries. Tasks are left to each concept. Requirements can be assigned to a concept, thereby automatically creating additional templates for a given initial query template.

사용자는 검색된 의미에 대한 필요조건을 특정하기 위해 "개념 질의서"를 통하여 시스템과 대화한다. 이 조건들은 또한 포괄적 색상 분배, 상대적인 시공간 상호관련성 등과 같이 포괄적일 수 있다. 개념에 대한 필요조건과 충분조건이 일단 수립되면, 시스템은 특징 공간으로 이동하여 사용자의 최초의 템플릿을 시작점으로 하여 부가 템플릿을 생성한다. 이러한 생성은 또한 사용자에 의해 시스템에 주어지는 관련성 귀환에 따라 수정된다. 관련성 귀환의 분석에 의해 필요조건에 해당하는 새로운 규칙들이 결정될 수 있다. 이들은 또한 템플릿 생성 프로시져(template generation procedure)를 수정하는 데 이용될 수 있다. 사용자에 의해 적절하다고 표시되어진 비디오들을 가지고서, 개념에 필요하다고 여겨지는 조건들 사이의 상관 관계를 고찰함으로써 규칙이 생성된다. 이러한 규칙(또는 암시)을 결정하는 법칙은 "데이터 찾기(data mining)"에서 발견되는 기법들과 유사하다. 이는 앞서 언급된 특허출원들과, S.Brin 등이 작성한 논문 "장바구니 데이터에 대한 동적 아이템집합 계산 및 암시 규칙(Dynamic Itemset Counting and Implication Rules for Market Basket Data)"(데이터 관리에 관한 ACM SIGMOD 컨퍼런스지, 1997, pp225-246)과, S.Brin 등이 작성한 논문 "장바구니를 넘어서: 상관 관계에 대한 연관 규칙의 일반화(Beyond Market Baskets: Generalizing Association Rules to Correlations)"(데이터 관리에 관한 ACM SIGMOD 컨퍼런스지,1997,pp265-276)에 기재되어 있다.The user communicates with the system via a "conceptual questionnaire" to specify the requirements for the retrieved meaning. These conditions can also be inclusive, such as inclusive color distribution, relative space-time correlation. Once the requirements and sufficient conditions for the concept are established, the system moves to the feature space and creates additional templates with the user's original template as a starting point. This generation is also modified according to the relevance feedback given to the system by the user. Analysis of the relevance feedback can determine new rules that correspond to the requirements. They can also be used to modify template generation procedures. With the videos marked as appropriate by the user, a rule is created by considering the correlation between the conditions deemed necessary for the concept. The rules for determining this rule (or suggestion) are similar to the techniques found in "data mining." This is described in the aforementioned patent applications and in a paper by S.Brin et al. "Dynamic Itemset Counting and Implication Rules for Market Basket Data" (ACM SIGMOD Conference on Data Management). , 1997, pp225-246), and by S.Brin et al., "Beyond Market Baskets: Generalizing Association Rules to Correlations" (ACM SIGMOD Conference on Data Management). , 1997, pp 265-276.

규칙 생성 예제(Rule Generation Example)Rule Generation Example

비디오큐에서 "사람의 군집(crowd of people)"에 대한 질의는 스케치 형태이다. 사용자는 객체에 색상 및 크기에 대한 중요도를 매겨서 영상 질의를 특정할 수 있었으나, 사람의 군집이라는 개념을 특징지을 수 있는 (군집의)텍스쳐 또는 상대적인 시공간적 이동성 등의 형태와 같은 보다 상세한 기술로서 특정할 수는 없었다. 그러나, 그는 "군집(crowd)"의 아이디어는 텍스쳐와, 사람들의 상대적인 시공간 배치에 의해 강하게 특징지어질 수 있다고 느끼기 때문에 그들을 필요조건으로서 리스트화할 수 있다.In the video queue, the query for "crowd of people" is sketchy. Users could specify image queries by attaching importance to colors and sizes to objects, but could be specified as more detailed techniques, such as forms of textures or relative spatiotemporal mobility that could characterize the concept of human clustering. There was no number. However, he can list them as requirements because he feels that the idea of "crowd" can be strongly characterized by textures and the relative space-time arrangement of people.

시스템은 피드백 과정을 통해서 사용자가 관심을 가지는 개념에 적절한 비디오 클립을 식별할 수 있다. 시스템은 이제 텍스쳐와 시공간 배치가 개념에 필수적이라는 것을 알기 때문에, 적절한 비디오들사이에서, 필요하다고 여겨지는 특징들간의 일치하는 패턴을 결정하고자 한다. 이러한 패턴은 사용자에게 되돌려져서, 사용자가 검색하고자 하는 개념과 이들이 일치하고 있는지를 묻게 된다. 만약 사용자가 이러한 패턴들이 개념과 일치한다고 수락하게 되면, 이들은 도 3에 도시한 바와 같은 새로운 질의 템플릿을 형성하는 데 이용되게 된다. 새로운 규칙은 이를 포함하여 질의 템플릿 생성에 대한 두 배의 효과, 즉 검색 속도의 향상과 반송되는 결과의 정밀성 증가 등을 갖는다.The feedback process allows the system to identify video clips appropriate to the concepts of interest to the user. Since the system now knows that texture and space-time placement are essential to the concept, we want to determine the matching pattern between the features that are considered necessary between the appropriate videos. This pattern is returned to the user, asking if the user matches the concepts they are searching for. If the user accepts that these patterns match the concept, they are used to form a new query template as shown in FIG. The new rule includes a double effect on query template generation, including improved search speed and increased precision of returned results.

개념 표지의 생성(Generating Concept Covers)Generating Concept Covers

질의는 검색이 수행될 특징 공간을 정의한다. 특징 공간은 비주얼 템플릿의 속성들과 관련성 중요도에 의해 정해진다. 특히, 속성은 특징 공간의 좌표축을 정하며, 관련성 중요도는 관련된 좌표축을 늘이거나 줄인다. 합성된 특징 공간내에서 각각의 비디오 숏은 점으로 표시될 수 있다. 비주얼 템플릿은 이 공간의 일부를 차지한다. 비주얼 템플릿은 특징 및 성격에 따라 다르기 때문에(객체 수준에 대한 포괄), 템플릿에 의해 정해지는 공간들도 서로 다르며 중첩되지도 않는다.The query defines the feature space in which the search is to be performed. The feature space is determined by the properties and relevance importance of the visual template. In particular, attributes define the coordinate axes of the feature space, and relevance importance increases or decreases the associated axes. Each video shot in the synthesized feature space may be represented by a point. Visual templates take up part of this space. Because visual templates vary by character and character (inclusive at the object level), the spaces defined by the templates are different and do not overlap.

몇가지의 특징에 대한 선택만으로는 개념을 결정하는 데 충분하지 않을 수 있으나, 예컨대, 중요도가 서로 다른 것들을 알맞게 선택함으로써 적절히 표현될 수도 있다. 이에 따라, 개념은 하나의 특징 공간에 매핑될 수 있다.Choosing several features alone may not be sufficient to determine the concept, but may be appropriately expressed, for example, by appropriately selecting different importance. Accordingly, the concept can be mapped to one feature space.

하나의 개념은 단일 특징 공간이나 단일 클러스터(cluster)에 한정되지는 않는다. 예를 들면, 일몰에 대한 비디오 순서(video sequence)의 클래스에 관하여, 일몰을 단일 색상이나 단일 모양으로 전부 특징지을 수가 없다. 따라서, 개념과 관련된 포괄적인 정적 특징 및 중요도 뿐만 아니라, 변화가능한 특징 및 중요도를 결정하는 것도 중요하다.One concept is not limited to a single feature space or a single cluster. For example, with respect to the class of video sequence for sunsets, it is not possible to characterize sunsets all in a single color or a single shape. Thus, it is important to determine not only the comprehensive static features and importance associated with the concept, but also the changeable features and importance.

개념에 대한 검색은 포괄 상수들(global constants)을 특정하는 데서 시작된다. 문맥 질의서를 통하여, 검색할 객체의 개수와, 각 객체에 필요한 포괄적 특징이 결정된다. 이들은 변하지 않아야 할 검색 과정에서의 제약을 나타낸다.The search for concepts begins with specifying global constants. The contextual query determines the number of objects to retrieve and the comprehensive features required for each object. These represent constraints in the search process that should not change.

사용자는 특징을 특정하고 중요도를 설정할 초기 질의를 제공한다. 사용자가 정한 필요 조건의 집합에는 집합간의 교차가 생긴다. 필요 조건은 변하지 않는 상태로 남겨진다. 충분하다고 여겨지는 특징들에 대한 변화에 기초하여 템플릿에도 변화가 생긴다. 집합들이 교차되지 않는다면, 필요 조건과 관련성 귀환에 기초하는 개념을 특징지을 수 있는 규칙이 도출된다.The user provides an initial query to specify the features and set their importance. In the set of requirements set by the user, there is an intersection between the sets. The requirement is left unchanged. Changes are made to templates based on changes to features that are considered sufficient. If the sets are not crossed, a rule is drawn that can characterize concepts based on requirements and reversion of relevance.

각각의 특징의 관련성 중요도는 사용자가 각 특징에 대해서 바라는 허용 오차를 나타낸다. 이 허용 오차는 각 특징별로 거리 임계치에 매핑된다. 예를 들면, 검색된 특징 공간에 고 타원체(hyper-ellipsoid)를 정의하는 식 d(ω) = 1/(a ·ω + c)과 같다. 임계치는 가능한 비 중첩 표지의 개수를 결정한다. 표지의 개수는 특정 특징에서 가능한 점프(jump)의 크기와 개수를 결정한다. 알고리즘은 순간적인 최초 검색을 수행하고 다음의 세 가지 기준에 따라 안내된다.The relevance importance of each feature represents the tolerance the user desires for each feature. This tolerance is mapped to the distance threshold for each feature. For example, the equation d (ω) = 1 / (a.ω + c), which defines a hyper-ellipsoid in the retrieved feature space. The threshold determines the number of possible non-overlapping markers. The number of markers determines the size and number of jumps possible for a particular feature. The algorithm performs an instantaneous initial search and is guided according to the following three criteria.

첫째, 재현을 증가시키는 방향으로 진행하는 열성적인(greedy) 알고리즘;First, a greedy algorithm that proceeds to increase reproduction;

모든 가능한 초기 점프를 연산함.Compute all possible initial jumps.

대응하는 비주얼 템플릿으로 각각의 점프를 변환함.Convert each jump to the corresponding visual template.

질의를 실행하고 모든 결과를 대조함.Run the query and match all the results.

관련성 귀환 결과를 사용자에게 보여주고, 증량성 재현(incremental recall)을 최대화하는 결과들을 이에 연속하는 질의의 가능한 점들로 선택함.Relevant recall results are shown to the user, and the results that maximize incremental recall are selected as possible points of the subsequent query.

둘째, 로컬(local) 영역에서 보다 작은 점프를 받아들임으로써 후속의 질의가 검색됨에 따라 로그 급수로 늘어나는 로그식 검색. 이는 현재의 질의 지점이 좋은 결과를 가져오고 추가적인 템플릿을 면밀히 검색해야 하는 것에 근거하고 있다. 개념의 70% 이상(즉, 70%이상의 재현)을 포괄할 수 있을 정도의 충분한 비주얼 템플릿이 생성되면 검색을 멈춘다.Second, a logarithmic search that increases in logarithmic series as subsequent queries are retrieved by accepting smaller jumps in the local area. This is based on the fact that the current point of query yields good results and requires careful search for additional templates. The search stops when enough visual templates have been created to cover more than 70% of the concept (ie 70% or more representations).

세째, 순간적인 최초 검색은 종종 한번에 검사하지 못할 정도로 너무 많은 가능성을 야기하기 때문에, 특징 수준 분배가 검색을 안내하는 데 이용된다. 각각의 특징에 따른 분배는 미리 연산되어 있다. 이러한 정보는 비디오 숏이 산재하고 있는 지역을 피하고 그 집중도가 높은 지역으로의 점프를 선택하는 데 이용된다.Third, feature level distributions are used to guide the search, because momentary initial searches often cause too many possibilities to fail at once. The distribution according to each feature is precomputed. This information is used to avoid areas where video shots are scattered and to select jumps to areas of high concentration.

의미 비주얼 템플릿을 이용한 언어 통합(Language Integration with SVT)Language Integration with SVT

기존의 화상 및 비디오에 대한 텍스트 기반 질의는 화상 또는 비디오에 덧붙여진 키워드의 매칭에 의존하고 있다. 데이터에 덧붙여지는 키워드는 수작업으로 생성되거나 또는 연합(association)에 의해 얻어질 수 있다. 즉, 키워드는 덧붙여지는 텍스트(화상의 경우) 또는 비디오에 덧붙어지는 캡션으로부터 추출된다.Existing text-based queries for pictures and videos rely on the matching of keywords attached to pictures or videos. Keywords appended to the data can be generated manually or obtained by association. That is, keywords are extracted from the text to be appended (in the case of an image) or the caption attached to the video.

이러한 접근은 다음과 같은 몇가지 이유로 인하여 대용량의 비디오나 화상 데이터베이스를 포함하고 있는 실제 시스템의 가능성은 배제하고 있다;This approach excludes the possibility of a real system containing a large video or image database for several reasons;

존재하고 있는 비디오 데이터베이스에 수작업으로 주석을 생성하기란 가능하지 않다.It is not possible to manually create annotations on existing video databases.

대부분의 비디오는 캡션을 포함하고 있지 않다.Most videos do not contain captions.

덧붙여지는 캡션과 비디오 사이에는 직접적인 상호 관계는 없다. 예를 들면, 야구경기 동안에 해설자는 진행되고 있는 경기에 출장하지 않은 베이브 루스의 공적에 대해서 얘기할 수도 있는데, 이 때 텍스트 기반 키워드가 "베이브 루스"를 포함하고 있는 비디오에 위치하여 이 비디오를 디스플레이한다면 잘못된 것이다.There is no direct correlation between the caption and the video being added. For example, during a baseball game, the narrator may talk about the achievements of Babe Ruth who did not play in an ongoing game, where the text-based keyword was placed in a video containing "Babe Ruth" to display this video. If it is wrong.

비디오 스트림(video stream)만 분석하여 비디오에 대한 의미 내용(semantic content)을 생성하는 것은, 어려운 것으로 알려져 있는 컴퓨터 비젼(computer vision) 문제에 상당한다. 색상 및 텍스쳐와 같은 속성이나 객체의 동작 등과 같은 비주얼 컨텐트(visual content)를 자연어의 기술력(descriptive power)과 동시에 사용하는 것이 보다 실용적인 접근방법이다.Generating semantic content for video by analyzing only the video stream corresponds to a computer vision problem that is known to be difficult. A more practical approach is to use visual content, such as color and texture, or the behavior of an object, simultaneously with the descriptive power of natural language.

사용자가 문자열을 입력하면, 시스템은 이를 비디오 모델로 분석한다. 비디오큐는 스케치에 관한 질의를 입력할 수 있는 "언어"를 제공한다.When the user enters a string, the system parses it into a video model. Video cues provide a "language" into which you can enter queries about sketches.

비디오큐에 존재하는 것과 그 자연어 사본은 표 1과 같이 단순히 대응된다.What exists in the Video Cue and its natural language copy simply corresponds to that shown in Table 1.

[표 1]TABLE 1

속성 NL 형Attribute NL type

동작 동사Action verb

색상, 텍스쳐 형용사Color, texture adjectives

모양 명사Shape nouns

공간/시간 전치사/접속사Space / time prepositions / adjuncts

강제적인 언어 집합이 허용가능한 단어 집합과 함께 사용될 수 있다. 비디오 순서(video sequence)의 동작 모델을 생성하기 위하여 문장은 명사, 동사, 형용사 및 부사 등과 같은 클래스들로 분석된다.Compulsory language sets can be used with acceptable word sets. To create an action model of the video sequence, the sentence is analyzed into classes such as nouns, verbs, adjectives, and adverbs.

예를 들면 "빌이 일몰을 향하여 천천히 걸어갔다"라는 어구에 대해, 시스템은 표 2에 도시한 바와 같이 분석할 수 있다.For example, for the phrase “Bill walked slowly towards sunset,” the system can analyze as shown in Table 2.

[표 2]TABLE 2

단어 NI 형Word NI type

빌 명사Bill noun

걸어갔다 동사Walked verb

천천히 부사Slow adverbs

향하여 전치사Towards prepositions

일몰 명사Sunset celeb

동사, 부사, 형용사 및 전치사에 대해서, 이러한 것들은 명사(객체)에 대한 수정자(또는 기술자)이므로, 작지만 고정된 데이터베이스를 이용할 수 있다. 명사(즉, 시나리오/객체) 데이터베이스는 초기에 백여가지의 장면들을 포함할 수 있으며, 사용자와의 대화에 의해 확장될 수도 있다.For verbs, adverbs, adjectives, and prepositions, these are modifiers (or descriptors) for nouns (objects), so a small but fixed database can be used. The noun (ie scenario / object) database may initially contain a hundred or more scenes, and may be expanded by dialogue with the user.

각각의 객체는 형용사(색상, 텍스쳐), 동사(걸어갔다), 부사(천천히) 등과 같은 여러가지 수정자에 의해 수정되는 모양 설명(shape description)을 가질 수 있다. 그런 다음, 이들은 비디오큐 파레트에 삽입되어 보다 더 정제될 필요가 있다.Each object can have a shape description that is modified by various modifiers such as adjectives (color, texture), verbs (walked), adverbs (slowly), and so on. Then they need to be inserted into the VideoCue palette and further refined.

분석자(parser)가 수정자 데이터베이스(즉, 동사, 부사, 전치사, 형용사 등에 각각 대응하는 데이터베이스)를 갖고 있지 않는 단어를 만나면, 그 단어의 동의어가 그 데이터베이스에 있는가를 결정하기 위해 시소러스(thesaurus)를 검색하고, 그들을 대신 이용한다. 만약 이를 실패한다면, 분석자는 무효 문자열(invalid string)이라고 표시하는 메세지를 되돌려준다.When a parser encounters a word that does not have a modifier database (i.e., a database corresponding to verbs, adverbs, prepositions, adjectives, etc.), the parser is searched for thesaurus to determine whether the word's synonyms are in that database. And use them instead. If this fails, the analyst returns a message indicating an invalid string.

분석자가 분류할 수 없는 단어를 만나면, 사용자는 그 텍스트를 수정해야 하거나, 또는 그 단어가 "빌"과 같은 명사라면, 시스템에 그 클래스(이 경우에는 명사임)를 지시해 줄 수 있으며, 그 단어는 인간을 언급하는 것임을 부가하여 지시할 수도 있다. 사용자가 시스템 데이터베이스에 없는 명사를 지시하려 한다면, 지시후에 사용자는 재빨리 그 객체를 스케치 패드에 끌어넣어 시스템이 그 객체에 대해 학습할 수 있도록 한다. 데이터베이스에서 동작, 색상, 텍스쳐 및 모양과 같은 속성들은 객체 수준에서 생성될 수 있으므로 매칭할 하나의 수준은 그 수준이 될 수 있다.If the analyst encounters a word that cannot be categorized, the user must modify the text, or if the word is a noun such as "Bill," the system can be instructed that class (in this case a noun). The word may additionally indicate that it refers to a human being. If the user tries to point to a noun that is not in the system database, after the prompt, the user quickly drags the object into the sketch pad so that the system can learn about the object. In the database, properties such as behavior, color, texture, and shape can be created at the object level, so one level to match can be that level.

도 4에 도시한 바와 같이, 비디오에 덧붙여진 오디오 스트림이 추가의 정보출처(a source of information)로서 이용될 수 있다. 실제로, 오디오가 비디오와 밀접한 관계가 있다면, 그 오디오는 비디오에 대한 의미 내용의 가장 중요한 단일 출처가 될 수 있다. 예를 들면, 하나의 오디오 스트림으로부터 비디오 순서(video sequence)당 10 내지 20개의 키워드 집합을 생성할 수 있다. 그런 다음, 키워드 수준에서의 검색을 모델 수준에서의 검색에 결합시킬 수 있다. 그런 비디오들은 동작 모델 수준뿐만 아니라 키워드(의미) 수준에서도 매칭될 수 있는 최고의 위치로 분류될 수 있다.As shown in Fig. 4, an audio stream appended to the video can be used as a source of information. Indeed, if audio is closely related to video, that audio can be the single most important source of semantic content for the video. For example, a set of 10 to 20 keywords can be generated per video sequence from one audio stream. You can then combine the search at the keyword level with the search at the model level. Such videos can be classified into the best positions that can be matched at the keyword (meaning) level as well as the motion model level.

예제(EXAMPLES)Example (EXAMPLES)

회전 활강 스키 타는 사람(slalom skier)의 비디오숏을 검색하기 위한 의미 비주얼 템플릿Semantic visual template for searching video shots of slalom skiers

1. 시스템은 묻고 사용자는 문맥에 관한 질문에 답한다. 의미 비주얼 템플릿은 "회전 활강(slalom)"이라는 표식(label)으로 분류된다. 질의는 두 개의 객체를 포함하는 객체 기반으로 특정된다.1. The system asks and the user answers questions about the context. Semantics Visual templates are classified under the label "slalom". The query is specified on an object basis that contains two objects.

2. 사용자는 초기 질의를 스케치한다. 큰 여백의 백그라운드는 스키 슬로프를 나타내고 이보다 작은 포그라운드 객체는 특징적인 지그재그 동작 궤적을 갖는 스키타는 사람을 나타낸다.2. The user sketches the initial query. Backgrounds with large margins represent ski slopes and smaller foreground objects represent skiers with characteristic zigzag motion trajectories.

3. 스키타는 사람과 백그라운드에 관련된 모든 특징들에 관련성 중요도의 최고값을 할당한다.3. The skier assigns the highest value of relevance to all features related to people and background.

4. 시스템은 하나의 집합 또는 테스트 아이콘을 생성하고, 사용자는 이로부터 스키 타는 사람의 색상 및 동작 궤도에서의 바람직한 특징 변화를 선택한다.4. The system creates one set or test icon, from which the user selects the desired feature change in the skier's color and motion trajectory.

5. 선택된 네 개의 색상과 선택된 세 개의 동작 궤적을 결합해서 12 개의 가능한 스키 타는 사람을 형성한다. 스키 타는 사람의 리스트는 단일 백그라운드에 결합되어 도 7의 12 개의 아이콘을 만들어 낸다. 도 7에서, 세 개의 인접한 아이콘으로 이루어진 그룹들은 동일한 색상을 갖는 것으로 이해할 수 있다.5. Combine the four selected colors with the three selected motion trajectories to form 12 possible skiers. The list of skiers is combined into a single background to produce the 12 icons of FIG. In FIG. 7, groups of three adjacent icons can be understood to have the same color.

6. 사용자는 시스템에 질의할 후보 집합을 선정한다. 시스템은 20 개의 가장 근접한 비디오 숏을 검색한다. 사용자는 시스템을 작은 크기의 회전 활강 스키어 예제 집합으로 안내하도록 하는 관련성 귀환을 제공한다.6. The user selects a set of candidates to query the system. The system searches for the 20 closest video shots. The user provides relevance feedback to guide the system to a set of small sized downhill skiers.

일몰(Sunsets). 1952 개의 비디오 숏 이상에 72 개의 일몰을 포함하는 데이터베이스가 사용되었다. 의미 비주얼 템플릿 없이 초기 스케치만을 사용하여, 10% 의 재현과 35% 의 정밀도가 실현되었다. 의미 비주얼 템플릿을 사용하여, 8 개의 아이콘이 생성되고, 36 개의 일몰을 얻을 수 있다. 50% 의 재현과 24%의 정밀도가 실현되었다.Sunsets. A database containing 72 sunsets over 1952 video shots was used. Semantics Using only initial sketches without a visual template, 10% reproduction and 35% accuracy were achieved. Semantics Using the visual template, eight icons can be created and 36 sunsets. 50% reproduction and 24% accuracy are achieved.

하이 점퍼(High Jumpers). 데이터베이스는 2589 개의 비디오 숏에 9 개의 하이 점퍼를 포함하고 있다. 의미 비주얼 템플릿 없이, 44% 의 재현과 20% 의 정밀도가 실현되었다. 의미 비주얼 템플릿을 사용하여, 56% 의 재현과 25%의 정밀도까지 향상시킬 수 있었다. 시스템은 사용자가 제공한 초기 스케치와는 다른 단일 아이콘으로 수렴되었다.High Jumpers. The database contains nine high jumpers in 2589 video shots. With no semantic visual template, 44% reproduction and 20% accuracy are achieved. By using semantic visual templates, we were able to improve 56% reproduction and 25% accuracy. The system converged into a single icon different from the initial sketch provided by the user.

Claims

(a) obtaining at least one initial query for the concept,

(b) generating at least one additional query related to the initial query,

(c) checking the additional query against the concept;

(d) including the additional query in a visual template for the concept, if the additional query is appropriate

Containing

How to generate visual templates for concepts with a computer.

In claim 1,

Wherein each query is represented by an icon / example image.

In claim 1,

And the initial query is obtained through a sketch pad.

In claim 1,

The additional query generating step includes advancing the query feature by a step size that is inversely related to the importance associated with the query feature.

In claim 1,

The generating of the additional icon includes forming a combination of desirable feature values.

In claim 1,

The adequacy check is a visual template generation method confirmed through a two-way user dialogue.

In the method of connecting a natural language subset to a semantic visual template to query a video database about a concept with a computer,

(a) obtaining a text query,

(b) generating an image property by analyzing the query;

(c) generating an image query using the image property;

(d) retrieving information using the image query; and

(e) displaying the information

Containing

How to query.

In claim 7,

The text query is obtained from a keyboard.

In claim 7,

The subset of the natural language includes a set of small nouns, verbs, prepositions, adjectives, and adverbs.

In claim 7,

Interactively expanding the subset.

In claim 7,

The image property generating step

(Iii) establishing a correspondence between the query and the natural language subset,

(Ii) marking different parts of the query as nouns, verbs, adjectives or prepositions, and

(Iii) obtaining a description of whether a word in the query is not in the natural language subset and marking the word accordingly.

Containing

How to query.

In claim 7,

The image query generating step

Setting a correspondence between the natural language subset and a semantic visual template set generated by the method of claim 1,

The semantic visual template is an image implementation of a noun in the query, an adjective is used to modify the image implementation of the noun, a verb is used to implement an action, and a preposition is used to determine the spatiotemporal order necessary to generate an image query. Used to set

How to query.

(a) means for obtaining at least one initial query for the concept,

(b) means for generating at least one additional query related to the initial query,

(c) means for checking the additional query against the concept and

(d) means for including the additional query in a visual template for the concept, if the additional query is appropriate

Containing

Computer system for creating visual templates for concepts.

In claim 13,

Wherein each query is represented by an icon / example image.

In claim 13,

And the initial query is obtained through a sketchpad.

In claim 13,

And said additional query generating means comprises means for advancing said query feature by a step size that is inversely related to the importance associated with said query feature.

In claim 13,

Said additional icon generating means comprises means for forming a combination of desirable feature values.

In claim 13,

The suitability check is ascertained through two-way user dialogue.

In a computer system that associates a semantic visual template with a semantic visual template and queries a video database for concepts,

(a) means for obtaining a text query,

(b) means for analyzing the query to generate image attributes,

(c) means for generating an image query using the image attribute,

(d) means for retrieving information using the image query and

(e) means for displaying the information

Containing

Computer system.

The method of claim 19,

And the text query is obtained from a keyboard.

The method of claim 19,

And means for interactively expanding the subset.

The method of claim 19,

The image property generating means

(Iii) means for establishing a correspondence between the query and the natural language subset,

(Ii) means for marking different parts of the query, such as nouns, verbs, adjectives or prepositions;

(Iii) means for obtaining a description of whether a word in said query is not in said natural language subset and for marking said word accordingly;

Containing

Computer system.

The method of claim 19,

The image query generating means

Means for establishing a correspondence between the natural language subset and a set of semantic visual templates generated by the system of claim 13,

Computer system.