KR20010012261A

KR20010012261A - Algorithms and system for object-oriented content-based video search

Info

Publication number: KR20010012261A
Application number: KR1019997010212A
Authority: KR
Inventors: 시-푸 장; 윌리엄 첸; 호레이스 제이. 멍; 하리 선다램; 디 중
Original assignee: 더 트러스티스 오브 컬럼비아 유니버시티 인 더 시티 오브 뉴욕
Priority date: 1997-05-05
Filing date: 1998-05-05
Publication date: 2001-02-15
Also published as: EP1008064A4; CA2288811A1; EP1008064A1; JP2002513487A; WO1998050869A1

Abstract

사용자가 대화형 네트워크를 통해 하나 이상의 비디오 클립으로부터 하나 이상의 비디오 객체를 배치하게 하는 객체 지향 방법 및 시스템이 개시되어 있다. 시스템은, 비디오 객체 속성의 비디오 클립 및 데이터베이스를 위한 스토리지(111)를 포함하는 서버 컴퓨터(110), 통신 네트워크(120), 및 클라이언트 컴퓨터(130)를 포함한다. 클라이언트 컴퓨터는 모션 궤적 정보(134)를 포함하는 비디오 객체 속성 정보를 특정하는 질의 인터페이스, 서버 컴퓨터 내에 저장된 비디오 객체 속성을 통해 브라우징하는 브라우저 인터페이스, 및 대화형 비디오 플레이어를 포함한다.Object oriented methods and systems are disclosed that allow a user to place one or more video objects from one or more video clips via an interactive network. The system includes a server computer 110, a communications network 120, and a client computer 130 that include storage 111 for a database and video clips of video object attributes. The client computer includes a query interface that specifies video object attribute information, including motion trajectory information 134, a browser interface for browsing through video object attributes stored within the server computer, and an interactive video player.

Description

ALGORITHMS AND SYSTEM FOR OBJECT-ORIENTED CONTENT-BASED VIDEO SEARCH}

지난 몇년 동안, 인터넷이 성숙되고 멀티미디어 어플리케이션이 폭넓게 사용됨에 따라, 쉽게 이용가능한 디지털 비디오 정보는 더욱 증가하게 되었다. 관리가능한 레벨로 대역폭 요구를 감소시키기 위하여, 이런 비디오 정보는 표준 포맷, 예컨데 JPEG, Motion JPEG, MPEG-1, MPEG-2, MPEG-4, H.261 또는 H.263으로 있는 압축된 비트스트림의 형태로 디지털 환경에서 일반적으로 저장된다. 현재, 해양 및 산으로부터 스키 및 야구 모두를 나타내는 수만가지의 다른 정지 화상 및 동화상이 인터넷으로부터 이용가능하다.In the last few years, as the Internet matures and multimedia applications are widely used, the readily available digital video information has increased. In order to reduce bandwidth requirements to a manageable level, such video information can be stored in a compressed bitstream in a standard format, such as JPEG, Motion JPEG, MPEG-1, MPEG-2, MPEG-4, H.261 or H.263. Form is usually stored in a digital environment. At present, tens of thousands of other still and moving pictures representing both skiing and baseball from the ocean and mountains are available from the Internet.

디지털 형태로 이용가능한 다량의 비디오 정보가 증가함에 따라, 이런 정보보를 의미있게 조직화하며 탐색할 필요가 간절하게 되었다. 특히, 사용자는 사용자-정의 질의에 응답하여 저장된 비디오 정보에 내장된 비디오 객체의 형상 또는 모션 특성과 같은 임의의 미리 선정된 기준을 만족하는 일련의 특정 비디오 정보를 탐색하고 검색할 수 있는 콘텐트 기초 비디오 탐색 엔진을 점점 더 요구하고 있다.With the increasing amount of video information available in digital form, there is an urgent need to explore and organize meaningfully this information. In particular, the content-based video enables the user to search for and retrieve a set of specific video information that meets any predetermined criteria, such as the shape or motion characteristics of the video object embedded in the stored video information in response to a user-defined query. There is an increasing demand for search engines.

이런 요구에 응답하여, 비디오 탐색 및 검색 어플리케이션을 개발하는 여러 시도가 있었다. 현존하는 기술은 2가지 명확한 카테고리, 즉 예제별 질의(query by example:"QBE") 및 가시 스케칭(visual sketching)으로 나누어진다.In response to this demand, several attempts have been made to develop video search and search applications. Existing techniques fall into two distinct categories: query by example ("QBE") and visual sketching.

이미지 검색과 관련하여, QBE 시스템의 예는 T. Minka의 "An Image Database Browser that Learns from User Interaction"{MIT Media Laboratory Perceptual Computing Section, TR #365(1996)}에 개시된 것으로 QBIC, PhotoBook, VisualSEEk,Virage and FourEyes를 포함한다. 이런 시스템은 여러 만족할 만한 매치(match)들이 데이터베이스내에 있어야만 하는 구실하에서 작동한다. 이런 구실하에서, 탐색은 데이터베이스 그 자체에서의 요소로 시작하며, 사용자는 질의예의 연속을 통해 소정의 이미지를 향해 안내된다. 불행하게도, 이런 "안내"는 사용자가 탐색을 계속해서 세분화(refine)해야 하기 때문에, 실질적인 시간 소비를 초래한다.Regarding image retrieval, examples of QBE systems are described in T. Minka's "An Image Database Browser that Learns from User Interaction" {MIT Media Laboratory Perceptual Computing Section, TR # 365 (1996)}, QBIC, PhotoBook, VisualSEEk, Virage and FourEyes. This system works under the pretext that several satisfactory matches must be in the database. Under this pretext, the search begins with an element in the database itself, and the user is directed towards a given image through a series of queries. Unfortunately, this "guide" results in substantial time consumption because the user must continue to refine the search.

계층적 그룹화(hierarchical groupings)을 사전에 계산하기 위한 공간 분할 방안이 데이터베이스 탐색을 가속할 수 있을지라도, 이런 그룹화는 정적이며, 새로운 비디오가 데이터베이스 내로 삽입될 때 재계산을 요구한다. 이와 같이, QBE에 원칙적으로 확대될 수 있을지라도, 비디오 샷(shot)은 복소수 다차원 특징 벡타에 의해 기술되는 많은 수의 객체(object)를 포함한다. 형상 및 이동 특징을 기술하는 문제점에 일부 기인하여 복잡성이 야기된다.Although the spatial partitioning scheme for precomputing hierarchical groupings can accelerate database search, this grouping is static and requires recalculation when new video is inserted into the database. As such, although it can be extended in principle to QBE, video shots contain a large number of objects described by complex multidimensional feature vectors. Complexity arises due in part to the problem of describing shape and movement features.

탐색 및 검색 시스템의 제2 카테고리 즉, 스케치 기초 질의 시스템은 사용자가 그린 스케치와, 비디오 정보를 위치시키기 위하여 데이터베이스에서의 각각의 이미지의 엣지 맵(edge map) 사이에 코릴레이션(correlation)을 계산한다. Hirata 등의 "Visual Example Content Based Images Retrieval, Advances in Database Technology-EDBT"{580 Lecture Notes on Computer Science(1992, A.Pirotte et al.eds)}에 개시된 것과 같은 스케치 기초 질의 시스템은 스케치와, 데이터베이스에서의 각각의 이미지의 엣지 맵 사이의 코릴레이션을 계산한다. A.Del Bimbo 등의 "Visual Image Retrieval by Elasic Matching of User Sketches"{19IEEE Trans. on PAMI, 121-131(1997)}에서는 매치를 달성하는 작용을 하는 에너지를 최소화하는 기술이 요구된다. C.E.Jacobs 등의 "Fast Mitiresolution Image Querying"{Proc.of SIGGRAPH, 277-286, Los-Angeles(Aug.1995)}에서는 작가가 스케치의 웨이브렛(wavelet) 시그네쳐(signature)와, 데이터베이스에서의 각각의 이미지 사이의 거리를 계산한다.A second category of search and retrieval systems, namely sketch based query systems, calculates correlations between user drawn sketches and edge maps of each image in the database to locate video information. . Sketch-based query systems, such as those described in Hirata et al., "Visual Example Content Based Images Retrieval, Advances in Database Technology-EDBT", 580 Lecture Notes on Computer Science (1992, A.Pirotte et al.eds). Compute the correlation between the edge maps of each image in. A.Del Bimbo et al., "Visual Image Retrieval by Elasic Matching of User Sketches" {19IEEE Trans. on PAMI, 121-131 (1997)}, requires a technique that minimizes the energy acting to achieve a match. In CEJacobs et al., "Fast Mitiresolution Image Querying" {Proc. Of SIGGRAPH, 277-286, Los-Angeles (Aug. 1995)}, the artist writes the wavelet signature of the sketch and the respective Calculate the distance between images.

인덱스 비디오 샷을 형성하는 몇몇 시도가 있었지만, 비디오 샷을 비디오 객체의 동적 집합으로 표현하는 시도는 전혀 없었다. 오히려, 종래의 기술은 비디오 클립(clip)이 이미지 프레임의 집합임을 가정함으로써 단순하게 비디오를 인덱싱하기 위한 이미지 검색 알고리즘을 활용한다.There have been several attempts to form index video shots, but no attempt has been made to represent video shots as a dynamic set of video objects. Rather, the prior art utilizes an image search algorithm for simply indexing video by assuming that the video clip is a collection of image frames.

특히, Zhang과 Smoliar에 의해 개발된 기술 및 QBIC에서 개발된 것은 비디오용 이미지 검색 방법(예컨데, 컬러 히스토그램(histogram)을 사용하여)을 사용한다. "키-프레임"은 각각의 샷, 예컨데 QBIC 방법에서 r-프레임으로부터 선택된다. Zhang과 Smoliar의 경우, 키 프레임은 클립으로부터 단일 프레임을 선택함으로써 비디오 클립으로부터 추출된다. 클립은 샷에서의 모든 프레임을 평균냄으로써 선택되며, 평균에 가장 가까운 클립에서의 프레임을 선택한다. 종래의 이미지 탐색, 예컨데 컬러 히스토그램을 사용함으로써, 키 프레임은 비디오를 인덱싱하는데 사용된다.In particular, the techniques developed by Zhang and Smoliar and those developed by QBIC use an image retrieval method for video (eg, using a color histogram). The "key-frame" is selected from each shot, e. G. R-frame in the QBIC method. For Zhang and Smoliar, key frames are extracted from the video clip by selecting a single frame from the clip. The clip is selected by averaging all the frames in the shot and selecting the frame in the clip closest to the average. By using conventional image search, such as color histograms, key frames are used to index the video.

이와 같이, QBIC 프로젝트에서 r-프레임은 임의의 프레임, 즉 대표 프레임과 같은 제1 프레임을 취함으로써 선택된다. 비디오 클립이 움직이는 경우, 모자이크 표현이 샷용 대표 프레임으로서 사용된다. QBIC은 이들 r-프레임이 비디오 클립을 순서대로 인덱싱하기 위해 이들 r-프레임에 대한 이미지 검색 기술을 다시 사용한다.As such, in a QBIC project an r-frame is selected by taking any frame, ie a first frame, such as a representative frame. When the video clip is moving, the mosaic representation is used as the representative frame for the shot. QBIC again uses image retrieval techniques for these r-frames to index video clips in order.

비디오 클립을 인덱싱하기 위하여, 인포미디어(Informedia) 프로젝트는 오디오 스트림 상의 음성 인식 알고리즘을 사용함으로써 비디오의 복사를 생성한다. 인식된 워드는 워드가 말해지는 비디오 프레임으로 정렬된다. 사용자는 키워드 탐색을 함으로써 비디오 클립을 탐색한다. 그러나, 음성 대 텍스트 변환은 변환 알고리즘의 정확도가 낮아(대략 20-30%) 검색 품질에 중대한 영향을 끼치기 때문에 주요한 스텀블링(stumbling) 블럭임이 판명되었다.To index video clips, the Informedia project creates a copy of the video by using a speech recognition algorithm on the audio stream. The recognized word is aligned with the video frame in which the word is spoken. The user navigates through the video clips by doing a keyword search. However, speech-to-text conversion has proved to be a major tumbling block because the accuracy of the conversion algorithm is low (approximately 20-30%), which significantly affects search quality.

상술한 종래 장치는 임의의 미리 선정된 기준을 만족하는 특정한 일련의 비디오 정보를 탐색 및 검색할 수 있는 효율적인 콘텐트 기초 비디오 탐색 엔진에 대한 증가하는 요구를 만족하지 못했다. 이 기술은 모션 비디오 정보를 탐색할 수 없으며, 또는 패닝(panning) 또는 주밍(zooming)과 같은 글로벌(global) 파라미터에 관한 이런 정보만을 탐색한다. 이와 같이, 종래 기술은 공간 및 시간 특성에 기초한 비디오 정보를 검색하기 위한 기술을 기재하지 못했다. 따라서, 전술한 현존하는 기술은 사용자-정의 질의에 응답하여 저장된 비디오 정보내에 내장된 비디오 객체의 형상 또는 모션 특성과 같은 임의의 미리 선정된 기준을 만족하는 일련의 특정 비디오 정보를 탐색 및 검색할 수 없다.The prior art device described above has not met the growing demand for an efficient content based video search engine capable of searching and retrieving a particular series of video information that meets any predetermined criteria. This technique cannot search for motion video information, or only search for this information about global parameters such as panning or zooming. As such, the prior art does not describe a technique for retrieving video information based on spatial and temporal characteristics. Thus, the existing techniques described above can search for and retrieve a set of specific video information that satisfies any predetermined criteria, such as the shape or motion characteristics of a video object embedded within stored video information in response to a user-defined query. none.

본 출원은 우선권이 주장되어 있는 1997년 5월 5일자로 제출된 미국 가출원제60/045,637호에 관한 것이다.This application is related to US Provisional Application No. 60 / 045,637, filed May 5, 1997, in which priority is claimed.

본 발명은 가시 정보를 탐색하고 검색하기 위한 기술에 관한 것으로, 특히 이동 가시 정보를 탐색하고 검색하기 위해 콘텐트-기초 탐색 질의의 사용에 관한 것이다.TECHNICAL FIELD The present invention relates to techniques for searching and retrieving visual information, and more particularly, to the use of content-based search queries to search and retrieve mobile visual information.

도 1은 본 발명의 일 면에 따른 비디오 정보를 탐색 및 검색하기 위한 시스템을 도시한 다이어그램.1 is a diagram illustrating a system for searching and retrieving video information in accordance with an aspect of the present invention.

도 2는 도 1의 시스템에 유용한 질의 인터페이스를 예시한 도면.FIG. 2 illustrates a query interface useful for the system of FIG. 1. FIG.

도 3은 도 1의 시스템에서 수행되는 비디오 객체 탐색 방법을 예시한 도면.3 illustrates a video object navigation method performed in the system of FIG.

도 4는 본 발명의 일 면에 따른 비디오 정보의 프레임 시컨스로부터 비디오 객체를 추출하는 방법을 도시한 플로우챠트.4 is a flowchart illustrating a method of extracting a video object from a frame sequence of video information according to an aspect of the present invention.

도 5는 도 4에 도시된 방법에 유용한 영역 투영 및 프레임간 레이블링하는 바람직한 방법을 도시한 플로우챠트.FIG. 5 is a flow chart illustrating a preferred method of region projection and interframe labeling useful in the method illustrated in FIG. 4.

도 6은 도 4에 도시된 방법에 유용한 프레임간 영역 합체의 바람직한 방법을 도시한 플로우챠트.FIG. 6 is a flow chart illustrating a preferred method of interframe region coalescing useful in the method shown in FIG.

도 7은 도 1의 시스템에서 수행되는 다른 비디오 객체 탐색 방법을 예시한 도면.FIG. 7 illustrates another video object navigation method performed in the system of FIG. 1. FIG.

본 발명의 한 목적은 진정하게 콘텐트를 기초한 비디오 탐색 엔진을 제공하는 것이다.One object of the present invention is to provide a video search engine that is truly content based.

본 발명의 다른 목적은 비디오 정보에 내장된 비디오 객체를 탐색 및 검색할 수 있는 탐색 엔진을 제공하는 것이다.Another object of the present invention is to provide a search engine that can search and search for video objects embedded in video information.

본 발명의 다른 목적은 사용자의 탐색 질의에 가장 잘 매칭되는 객체 만이 검색되도록 식별된 비디오 객체를 필터링하기 위한 메카니즘을 제공하는 것이다.Another object of the present invention is to provide a mechanism for filtering identified video objects such that only objects that best match a user's search query are retrieved.

본 발명의 다른 목적은 사용자-정의 질의에 응답하여 임의의 미리 선정된 기준을 만족하는 일련의 특정 비디오 정보를 탐색 및 검색할 수 있는 비디오 탐색 엔진을 제공하는 것이다.Another object of the present invention is to provide a video search engine capable of searching and retrieving a series of specific video information satisfying any predetermined criteria in response to a user-defined query.

본 발명의 다른 목적은 모션, 컬러 및 엣지 정보를 포함하는 비디오 객체의 통합된 특성 특징에 기초한 비디오 정보로부터 비디오 객체를 추출할 수 있는 탐색 엔진을 제공하는 것이다.It is another object of the present invention to provide a search engine capable of extracting video objects from video information based on integrated characteristic features of the video object including motion, color and edge information.

이하를 참고로 명확하게 될 이들 및 다른 목적을 만족하기 위하여, 본 발명은 사용자가 대화형 네트워크 상의 비디오 데이터 프레임의 하나 또는 그 이상의 시컨스로부터 비디오 객체를 탐색 및 검색하게 하기 위한 시스템을 제공한다. 시스템은 비디오 객체 속성의 하나 또는 그 이상의 데이터베이스용 스토리지(storage), 비디오 객체 속성과 대응하는 비디오 데이터의 하나 또는 그 이상의 프레임 시컨스용 스토리지, 서버 컴퓨터로부터 비디오 데이터의 하나 또는 그 이상의 프레임 시컨스를 전송하게 하는 통신 네트워크, 및 클라이언트(client) 컴퓨터를 포함하는 하나 또는 그 이상의 서버 컴퓨터를 포함한다. 클라이언트 컴퓨터는, 모션 궤적 정보를 포함하는 선택된 비디오 객체 속성 정보를 수신하기 위한 질의 인터페이스; 선택된 비디오 객체 속성 정보를 수신하며, 미리 선정된 임계값 내에서 선택된 비디오 객체 속성과 매칭하는 속성을 갖는 하나 또는 그 이상의 비디오 객체를 결정하기 위해서 통신 네트워크를 경유하여 서버 컴퓨터 내의 저장된 비디오 객체 속성을 통해 브라우징(browsing)하기 위한 브라우저 인터페이스; 및 결정된 하나 또는 그 이상의 비디오 객체에 대응하는 서버 컴퓨터로부터 비디오 데이터의 하나 또는 그 이상의 전송된 시컨스를 수신하는 대화형 비디오 플레이어를 포함한다.To satisfy these and other objects, which will become apparent with reference to the following, the present invention provides a system for allowing a user to search for and retrieve a video object from one or more sequences of video data frames on an interactive network. The system is configured to transmit one or more storages for a database of video object attributes, storage for one or more frame sequences of video data corresponding to the video object attributes, and one or more frame sequences of video data from a server computer. One or more server computers, including a communication network, and a client computer. The client computer includes a query interface for receiving selected video object property information including motion trajectory information; Receives selected video object property information, and via the stored video object property in the server computer via the communication network to determine one or more video objects having properties that match the selected video object property within a predetermined threshold. A browser interface for browsing; And an interactive video player for receiving one or more transmitted sequences of video data from a server computer corresponding to the determined one or more video objects.

바람직한 장치에 있어서, 서버 컴퓨터에 저장된 데이터베이스는 모션 궤적 데이터베이스, 시공 데이터베이스, 형상 데이터베이스, 컬러 데이터베이스, 및 텍스쳐(texture) 데이터베이스를 포함한다. 비디오 데이터의 하나 또는 그 이상의 프레임 시컨스는 MPEG-1 또는 MPEG-2와 같은 압축된 형태로 서버 컴퓨터 상에 저장된다.In a preferred apparatus, the database stored on the server computer includes a motion trajectory database, a construction database, a shape database, a color database, and a texture database. One or more frame sequences of video data are stored on the server computer in a compressed form, such as MPEG-1 or MPEG-2.

시스템은 각각의 비디오 객체 속성용 하나인 후보(candidate) 비디오 시컨스의 리스트를 발생시키기 위하여, 서버 컴퓨터내의 대응하는 저장된 비디오 객체 속성과 각각의 선택된 비디오 객체 속성을 비교하기 위한 메카니즘을 포함할 수 있다. 이와 같이, 미리 선정된 임계값 내에서 후보 리스트에 기초하여 선택된 비디오 객체 속성과 매칭하는 집합적 속성을 갖는 하나 또는 그 이상의 비디오 객체를 결정하기 위한 메카니즘이 이롭게 제공된다. 시스템은 질의에서의 다중 객체 중의 공간 및 시간 관계와, 비디오 클립에서의 비디오 객체 프로젝트의 그룹을 매칭하기 위한 메카니즘을 또한 포함한다.The system may include a mechanism for comparing each selected video object property with a corresponding stored video object property in the server computer to generate a list of candidate video sequences, one for each video object property. As such, a mechanism is advantageously provided for determining one or more video objects having a collective attribute that matches a selected video object attribute based on a candidate list within a predetermined threshold. The system also includes a mechanism for matching the spatial and temporal relationships among the multiple objects in the query and the group of video object projects in the video clip.

본 발명의 제2 면에 따르면, 적어도 하나의 인식가능한 속성을 포함하는 비디오 데이터의 프레임 시컨스로부터 비디오 객체를 추출하는 방법이 제공된다. 이 방법은 양자화된 프레임 정보를 발생시키기 위하여 비디오 데이터에 의하여 표현된 적어도 하나의 속성의 다른 편차로 값을 결정하고 할당함으로써 비디오 데이터의 현 프레임을 양자화하는 단계; 프레임에서 엣지 포인트를 결정하기 위하여 속성에 기초한 비디오 데이터의 프레임 상에서 엣지 검출을 수행하여, 엣지 정보를 발생시키는 단계; 이전 프레임으로부터 비디오 정보의 하나 이상의 세그먼트 영역을 수신하는 단계; 및 수신된 세그먼트 영역과, 양자화된 프레임 정보 및 발생된 엣지 정보를 비교함으로써 속성을 공유하는 비디오 정보의 영역을 추출하는 단계를 요구한다.According to a second aspect of the invention, a method is provided for extracting a video object from a frame sequence of video data comprising at least one recognizable property. The method includes quantizing a current frame of video data by determining and assigning a value to another deviation of at least one attribute represented by the video data to generate quantized frame information; Performing edge detection on a frame of video data based on the attribute to determine an edge point in the frame, thereby generating edge information; Receiving one or more segment regions of video information from previous frames; And extracting a region of video information sharing the attribute by comparing the received segment region with the quantized frame information and the generated edge information.

바람직하게, 추출 단계는, 영역의 임의의 이동을 일시적으로 트랙킹(tracking)하기 위하여 수신된 영역 중 하나를 현재의 양자화된 엣지 검출된 프레임 상에 투영(projecting)함으로써 비디오 데이터의 현 프레임에서 영역을 추출하기 위해 프레임간(interframe) 투영을 수행하는 단계; 및 임의의 조건하에서 현 프레임에서 이웃하여 추출된 영역을 합체하기 위해 프레임내(intraframe) 분할을 수행하는 단계로 이루어진다. 추출 단계는 프레임내 분할 후 이웃하는 영역에 남아 있는 현 프레임에서의 모든 엣지에 레이블링(labelling)을 하여 각각의 레이블링된 엣지가 현 프레임에서 비디오 객체의 경계를 정의하게 하는 단계를 또한 포함한다.Preferably, the extracting step is adapted to project an area in the current frame of video data by projecting one of the received areas onto a current quantized edge detected frame to temporarily track any movement of the area. Performing interframe projection to extract; And performing intraframe division to coalesce the region extracted adjacent to the current frame under an arbitrary condition. The extracting step also includes labeling all edges in the current frame remaining in the neighboring region after intra-frame division such that each labeled edge defines a boundary of the video object in the current frame.

특히 바람직한 기술에서는, 비디오 정보의 미래 프레임이 또한 수신되며, 비디오 정보의 현 프레임의 광학 플로우(optical flow)가 현 프레임에서의 비디오 정보의 블럭과 미래 프레임에서의 비디오 정보의 블럭 사이에 계층적 블럭 매칭을 수행함으로써 결정되고, 비디오 정보의 추출된 영역 상의 모션 추정은 광학 플로우에 기초하여 아핀(affine) 매트릭스에 의해서 수행된다. 비디오 정보의 추출된 영역은 크기 및 시간 지속뿐만 아니라 각각의 영역의 아핀 모델에 기초하여 그룹화될 수 있다.In a particularly preferred technique, a future frame of video information is also received, wherein an optical flow of the current frame of video information is a hierarchical block between the block of video information in the current frame and the block of video information in the future frame. Determined by performing matching, motion estimation on the extracted region of video information is performed by an affine matrix based on the optical flow. The extracted regions of the video information can be grouped based on the size and time duration as well as the affine model of each region.

본 발명의 다른 면에서는, 미리 선정된 궤적을 포함하는 하나 또는 그 이상의 비디오 클립을 포함하는 비디오 데이터의 프레임 시컨스로부터 사용자-입력 탐색 질의에 가장 잘 매칭되는 비디오 클립을 위치시키는 방법이 제공된다. 이 방법은 적어도 하나의 비디오 객체를 정의하는 탐색 질의를 수신하는 단계; 수신된 질의와 하나 또는 그 이상의 사전에 정의된 비디오 객체 궤적 중 적어도 일부 사이의 전체 거리를 결정하는 단계; 및 가장 잘 매칭되는 비디오 클립 또는 클립을 위치시키기 위하여 수신된 질의로부터 최소한의 전체 길이를 갖는 상기 정의된 비디오 객체 궤적 중 하나 또는 그 이상을 선택하는 단계를 포함한다.In another aspect of the present invention, a method is provided for locating a video clip that best matches a user-input search query from a frame sequence of video data comprising one or more video clips including a predetermined trajectory. The method includes receiving a search query that defines at least one video object; Determining an overall distance between the received query and at least some of the one or more predefined video object trajectories; And selecting one or more of the defined video object trajectories having a minimum overall length from the received query to locate the best matched video clip or clips.

탐색 질의 및 사전에 정의된 비디오 객체 궤적 모두는 정규화된다. 질의 정규화 단계는 수신된 질의를 각각의 정규화된 비디오 클립에 매핑하며; 수신되고 매핑된 질의를 정규화된 비디오 클립에 의해 정의된 각각의 비디오 객체 궤적에 스케일링(scaling)하는 것이 바람직하게 필요하다. 결정 단계는 공간 거리 비교 또는 시공 거리 비교에 중 어느 하나에 의해 실현된다.Both the search query and the predefined video object trajectories are normalized. The query normalization step maps the received query to each normalized video clip; It is desirable to scale the received and mapped query to each video object trajectory defined by the normalized video clip. The determining step is realized by either spatial distance comparison or construction distance comparison.

본 발명의 다른 면에서는 각각이 미리 선정된 특성을 갖는 하나 또는 그 이상의 비디오 객체로 구성되는 하나 또는 그 이상의 비디오 클립으로부터 사용자-입력 탐색 질의와 가장 잘 매칭하는 비디오 클립을 위치시키는 방법이 제공된다. 이 방법은 비디오 클립에서 하나 또는 그 이상의 다른 비디오 객체에 대한 하나 또는 그 이상의 특성을 정의하는 탐색 질의를 수신하는 단계; 상기 정의된 특성중 적어도 하나와 미리 선정된 임계값으로 매칭하는 비디오 객체를 위치시키기 위해 비디오 클립을 탐색하는 단계; 하나 또는 그 이상의 다른 비디오 객체를 포함하는 비디오 클립을 위치된 비디오 객체로부터 결정하는 단계; 및 탐색 질의에 의해 정의된 하나 또는 그 이상의 비디오 객체와 위치된 비디오 객체 사이의 거리를 계산하여 결정된 비디오 클립으로부터 가장 잘 매칭된 비디오 클립을 결정하는 단계를 포함한다. 특성은 컬러, 텍스쳐, 모션, 크기 또는 형상을 포함할 수 있다.In another aspect of the present invention, a method is provided for positioning a video clip that best matches a user-entered search query from one or more video clips, each consisting of one or more video objects having predetermined characteristics. The method includes receiving a search query that defines one or more characteristics for one or more other video objects in the video clip; Searching for a video clip to locate a video object that matches at least one of the defined characteristics with a predetermined threshold; Determining from the positioned video object a video clip comprising one or more other video objects; And calculating a distance between the located video object and the one or more video objects defined by the search query to determine the best matched video clip from the determined video clip. Characteristics can include color, texture, motion, size or shape.

매우 바람직한 장치에서, 비디오 클립은 연관된 텍스트 정보를 포함하며, 탐색 질의는 하나 또는 그 이상의 다른 비디오 객체에 대응하는 텍스트 특성의 정의를 더 포함하며, 방법은 텍스트 특성과 매칭되는 텍스트를 위치시키기 위하여 연관된 텍스트 정보를 탐색하는 단계를 더 포함한다. 그 후, 가장 잘 매칭된 비디오 클립은 결정된 비디오 클립 및 위치된 텍스트로부터 결정된다.In a highly preferred apparatus, the video clip includes associated text information, the search query further includes a definition of a text property corresponding to one or more other video objects, and the method relates to positioning the text that matches the text property. Searching for text information. The best matched video clip is then determined from the determined video clip and the placed text.

도 1을 참조하면, 사용자-정의 질의에 응답하여 저장된 비디오 정보내에 내장된 비디오 객체의 형상 또는 모션 특성과 같은 임의의 미리 선정된 기준을 만족하는 일련의 특정 비디오 정보를 탐색 및 검색하기 위한 시스템의 예시적인 실시예가 제공된다. 시스템(100)의 구조는 3가지 구성요소, 서버 컴퓨터(11), 통신 네트워크(120), 및 클라이언트 컴퓨터(130)로 크게 분류된다.Referring to FIG. 1, a system for searching and retrieving a series of specific video information that satisfies any predetermined criteria, such as the shape or motion characteristics of a video object embedded in stored video information in response to a user-defined query. An exemplary embodiment is provided. The structure of the system 100 is broadly classified into three components, the server computer 11, the communication network 120, and the client computer 130.

서버 컴퓨터(110)는 비디오 객체 및 가시 특성용 메타데이터(metadata)를 저장하는 데이터베이스(111), 및 추출된 비디오 객체 및 가시 특성과 연관된 임의의 텍스트 정보 및 원 시청각 정보를 저장하는 스토리지 서브시스템(112)을 포함한다. 통신 네트워크(120)는 인터넷 또는 광대역 네트워크에 기초할 수 있다. 따라서, 하나의 컴퓨터로서 도 1에 도시되었다 할지라도, 서버 컴퓨터(110)는 통신 네트워크(120)를 경유해서 클라이언트 컴퓨터(130)와 모두 통신할 수 있는 전세계 웹에 산재된 다수의 컴퓨터일 수 있다.The server computer 110 includes a database 111 that stores metadata for video objects and visible features, and a storage subsystem that stores any textual information and raw audiovisual information associated with the extracted video objects and visible features. 112). The communication network 120 may be based on the Internet or a broadband network. Thus, although shown in FIG. 1 as one computer, server computer 110 may be any number of computers scattered around the world that can communicate with client computer 130 all over communication network 120. .

클라이언트 컴퓨터(130)는 사용자가 탐색 질의를 컴퓨터(130)로 들어가게 하며 시청각 정보용 네트워크(100)를 브라우징하게 하는 브라우저 인터페이스와, 질의 인터페이스 모두를 함게 형성하는 키보드(131), 마우스(132), 및 모니터(133)를 포함한다. 도 1에 도시되지는 않았지만, 다른 질의 입력 장치, 예컨데 라이트 펜 및 터치 스크린은 클라이언트 컴퓨터(130)에 또한 쉽게 일체화될 수 있다. 모니터(133)는 네트워크(120)를 경유해서 서버 컴퓨터(110)로부터 검색된 가시 정보를 디스플레이할 뿐만아니라, 컴퓨터(11)의 사용자에 의해 기입된 탐색 질의를 서술하는데 사용된다. 이런 정보가 압축된 형태, 예컨데 MPEG-2 비트스트림으로 바람직하게 검색되며, 컴퓨터(130)는 검색된 정보를 디스플레이 가능한 포맷으로 압축해제 하기 위하여 적절한 상용가능한 하드웨어 또는 소프트웨어, 예컨데 MPEG-2 디코더를 포함한다.The client computer 130 includes a browser interface that allows a user to enter a search query into the computer 130 and browses the audiovisual information network 100, a keyboard 131, a mouse 132, which together form a query interface. And a monitor 133. Although not shown in FIG. 1, other query input devices, such as light pens and touch screens, can also be easily integrated into the client computer 130. The monitor 133 not only displays the visible information retrieved from the server computer 110 via the network 120, but is also used to describe the search query entered by the user of the computer 11. Such information is preferably retrieved in a compressed form, such as an MPEG-2 bitstream, and the computer 130 includes suitable commercially available hardware or software, such as an MPEG-2 decoder, to decompress the retrieved information into a displayable format. .

키보드(131), 마우스(132) 등을 사용하여, 사용자는 비디오 정보의 클립에 내장된 하나 또는 그 이상의 비디오 객체의 하나 또는 그 이상의 탐색가능한 속성을 특정하는 컴퓨터(130) 상에 탐색 질의를 기입할 수 있다. 따라서, 예컨데 사용자가 어떤 궤적에서 움직이는 야구를 포함하는 비디오 클립을 탐색하고자 한다면, 사용자는 질의에 포함될 객체의 모션(134)을 스케치하며, 크기, 형상, 컬러, 및 텍스쳐과 같은 부가적인 탐색가능한 속성을 선택한다. 예시적인 질의 인터페이스는 도 2에 묘사된다.Using keyboard 131, mouse 132, and the like, a user writes a search query on computer 130 that specifies one or more searchable attributes of one or more video objects embedded in a clip of video information. can do. Thus, for example, if a user wants to navigate a video clip containing a baseball moving on a trajectory, the user sketches the motion 134 of the object to be included in the query, and adds additional navigable properties such as size, shape, color, and texture. Choose. An example query interface is depicted in FIG.

여기서 사용된 바와 같이, "비디오 클립"은 예컨데 예로서 제한되지 않는 야구선수 스윙 베트, 바다를 가로질러 움직이는 서핑보드, 또는 평원을 가로질러 달리는 말과 같은 식별가능한 속성을 갖는 하나 또는 그 이상의 비디오 객체를 갖는 비디오 정보의 프레임 시컨스를 말한다. "비디오 객체"는 하나 또는 그 이상의 중요한 특징, 예컨데 텍스쳐, 컬러, 모션 및 형상과 동종인 연속적인 화소 세트이다. 따라서, 비디오 객체는 적어도 하나의 특성과 일치를 나타내는 하나 또는 그 이사의 비디오 영역에 의해 형성된다. 예컨데, 걷고 있는 사람의 샷(여기서 사람은 "객체"임)은 형상, 컬러 및 텍스쳐와 같은 기준이 다른 인접 영역의 집합으로 분할되나, 모든 영역은 그들의 모션 속성에 일치를 나타낸다.As used herein, a "video clip" is one or more video objects with identifiable properties such as, but not limited to, a baseball player swing bet, a surfboard moving across the ocean, or a horse running across the plains. It refers to the frame sequence of video information having a. A "video object" is a continuous set of pixels that is homogeneous with one or more important features, such as texture, color, motion, and shape. Thus, a video object is formed by one or more video regions that represent at least one characteristic and coincidence. For example, a walking person's shot (where a person is an “object”) is divided into a set of adjacent areas that differ in criteria such as shape, color, and texture, but all areas show correspondence to their motion properties.

도 3을 참조하면, 탐색 질의(300)는 컬러(301), 텍스쳐(302), 모션(303), 형상(304), 크기(305), 및 소정의 비디오 객체의 팬 및 줌과 같이 글로벌 파라미터와 같은 다른 속성을 포함할 수 있다. 각각의 속성의 상대적 중요성을 나타내는 다양한 하중(weight)은 탐색 질의(306)에 또한 합체될 수 있다. 탐색 질의를 수신할 때, 컴퓨터(301)에서의 브라우저는 네트워크(120)를 경유해서 서버 컴퓨터(110)의 데이터베이스(111)에 저장된 유사한 속성을 탐색한다. 서버(110)는 여러 특성 데이터베이스를 포함하며, 시스템이 인덱싱되는 각각의 개별 특성 중 하나는 예컨데, 컬러 데이터베이스(311), 텍스쳐 데이터베이스(312), 모션 데이터베이스(313), 형상 데이터베이스(314), 및 크기 데이터베이스(315)이다. 각각의 데이터베이스는 스토리지(112)에서 압축된 MPEG 비트스트림으로서 저장된다. 물론, 다른 압축 포맷 또는 압축 데이터가 사용될 수 있다.Referring to FIG. 3, the search query 300 may include global parameters such as color 301, texture 302, motion 303, shape 304, size 305, and pan and zoom of a given video object. It can contain other attributes, such as Various weights indicating the relative importance of each attribute may also be incorporated into the search query 306. Upon receiving a search query, the browser at computer 301 searches for similar attributes stored in database 111 of server computer 110 via network 120. The server 110 includes several feature databases, one of each individual feature being indexed by the system, for example, a color database 311, a texture database 312, a motion database 313, a shape database 314, and Size database 315. Each database is stored as a compressed MPEG bitstream in storage 112. Of course, other compression formats or compressed data may be used.

서버에서, 각각의 질의된 속성은 저장된 속성과 비교되는데, 상세한 설명은 후술한다. 따라서, 질의된 컬러(301)는 컬러 데이터베이스(311)에 대하여 매칭되며, 텍스쳐(322), 모션(323), 형상(324), 크기(325), 및 임의의 다른 속성의 매칭도 이와 같다. 예컨데 컬러 객체 리스트(331), 텍스쳐 객체 리스트(332), 모션 객체 리스트(333), 형상 객체 리스트(334), 및 크기 객체 리스트(335)인 후보 비디오 샷의 리스트는 질의에서 특정된 각각의 객체에 대해 발생한다. 서버 컴퓨터(110)에서, 각각의 리스트는 미리 선택된 랭크 임계값과 합쳐지거나 또는 특성 거리 임계값과 합쳐져서, 가장 가능성 있는 후보 샷만이 회생된다.At the server, each queried attribute is compared to a stored attribute, a detailed description of which will be given later. Thus, the queried color 301 is matched against the color database 311, and so is the matching of the texture 322, motion 323, shape 324, size 325, and any other attribute. For example, the list of candidate video shots, which are the color object list 331, texture object list 332, motion object list 333, shape object list 334, and size object list 335, is the respective object specified in the query. Occurs for In server computer 110, each list is combined with a preselected rank threshold or with a characteristic distance threshold, so that only the most likely candidate shots are regenerated.

다음으로, 미리 선정된 최소 임계값에서 각각의 객체에 대한 후보 리스트는 단일 비디오 샷 리스트를 형성하기 위하여 합쳐진다(350). 합체 프로세스는 발생된 후보 리스트(331, 332, 333, 334 및 335) 각각의 비교를 수반하여, 모든 후보 리스트 상에 나타나지 않는 비디오 객체가 스크린 아웃(screen out)된다. 이런 스크리닝(screening)후에 남아있는 후보 비디오 객체는 질의된 속성으로부터의 그들의 글로벌 가중 거리에 기초하여 소트된다. 결국, 미리 선정된 개별 임계값에 기초하며 질의(306)에서 기록된 사용자-정의-가중에 의해 바람직하게 수정된 글로벌 임계값은 가장 잘 매칭된 후보 또는 후보로 객체 리스트를 잘라내는데 사용된다. 바람직한 글로벌 임계값은 0.4이다.Next, the candidate list for each object at the predetermined minimum threshold is combined 350 to form a single video shot list. The merging process involves a comparison of each of the generated candidate lists 331, 332, 333, 334 and 335, so that video objects that do not appear on all candidate lists are screened out. The candidate video objects remaining after this screening are sorted based on their global weighted distance from the queried attribute. In turn, the global threshold based on a predetermined individual threshold and preferably modified by the user-defined-weighted recorded in the query 306 is used to truncate the object list to the best matched candidate or candidate. Preferred global threshold is 0.4.

합체된 리스트에서 이들 비디오 샷에 대하여, 키-프레임은 비디오 샷 데이터베이스로부터 동적으로 추출되며, 네트워크(120)를 거쳐 클라이언트(130)로 복귀된다. 사용자가 그 결과에 만족하면, 키 프레임에 대응하는 비디오 샷은 그 비디오 샷을 데이터베이스로부터 "절단"함에 의해 비디오 데이터베이스로부터 실시간에서 추출될 수 있다. 비디오 샷은 압축된 도메인(domain)에서의 비디오 편집 방안, 예컨데 Chang 등의 1997년 5월 16일자 특허출원번호 제 PCT/US97/08266호로 여기에 참고로 합체된 것을 사용하여 비디오 데이터베이스로부터 추출된다.For these video shots in the merged list, key-frames are dynamically extracted from the video shot database and returned to the client 130 via the network 120. Once the user is satisfied with the result, the video shot corresponding to the key frame can be extracted from the video database in real time by "cutting" the video shot from the database. The video shots are extracted from the video database using video editing schemes in a compressed domain, such as Chang et al., Patent application number PCT / US97 / 08266, filed May 16, 1997, incorporated herein by reference.

본 기술의 숙련자는 도 3의 매칭 기술이 객체 레벨 또는 영역 레벨에서 수행될 수 있음을 이해한다.Those skilled in the art understand that the matching technique of FIG. 3 can be performed at the object level or region level.

도 1과 관련되어 여기에 합체된 시스템에서 사용되는 다양한 기술이 이하 설명된다. 의미있는 탐색 질의를 생성하기 위하여, 클라이언트 컴퓨터(130)는 탐색될 속성을 제한 또는 양자화할 수 있다. 따라서, 컬러에 대하여, 허용가능한 컬러 세트는 HSV 컬러 공간을 균일하게 양자화할 수 있으며, 진정한 컬러가 사용된다 할지라도 임의의 컬러가 모뎀 컴퓨터에서 허용가능하다는 점에서 이미 양자화된 것은 바람직하다.Various techniques used in the systems incorporated herein in connection with FIG. 1 are described below. To generate a meaningful search query, client computer 130 may limit or quantize the attributes to be searched. Thus, for color, the acceptable color set can uniformly quantize the HSV color space, and it is preferable that the color already be quantized in that any color is acceptable in the modem computer, even if true color is used.

텍스쳐에 대하여, 공지된 MIT 텍스쳐 데이터베이스는 텍스쳐 속성을 다양한 객체에 할당하는데 사용된다. 따라서, 사용자는 탐색 질의를 형성하기 위하여 데이터베이스에서 56개 이용가능한 텍스쳐로부터 선택되야만 한다. 당연히, 다른 텍스쳐 세트는 용이하게 사용될 수 있다.For texture, known MIT texture databases are used to assign texture attributes to various objects. Thus, the user must be selected from the 56 available textures in the database to form a search query. Naturally, other texture sets can be used easily.

비디오 객체의 형상은 임의의 형상 및 크기의 타원체에 따라 임의의 폴리곤일 수 있다. 사용자는 따라서 커서의 도움으로 임의의 폴리곤을 스케치할 수 있으며, 원형, 타원형 및 직사각형과 같은 다른 공지의 형상은 미리 정의되고 쉽게 삽입되며 조정된다. 질의 인터페이스는 형상을 정확히 나타내는 한 세트의 수로 형상을 변환한다. 예컨데, 원형은 중심점 및 반경에 의해 나타내고, 타원형은 초점 및 거리에 의해 나타난다.The shape of the video object may be any polygon depending on an ellipsoid of any shape and size. The user can thus sketch any polygon with the aid of the cursor, and other known shapes such as round, oval and rectangular are predefined, easily inserted and adjusted. The query interface converts the shape into a set of numbers that accurately represent the shape. For example, a circle is represented by a center point and a radius and an ellipse is represented by a focal point and a distance.

모션에 대하여, 2개의 다른 모델이 적용된다. 먼저, 탐색은 비디오 객체 내의 화소의 광학 플로우로부터 유도됨에 따라 비디오 객체의 인지된 모션에 기초할 수 있다. 광학 플로우는 글로벌 모션(즉, 카메라 모션) 및 로컬 모션(즉, 객체 모션) 모두의 결합된 효과이다. 예컨데, 카메라는 차의 모션을 트랙킹하며, 차는 비디오 시컨스에서 정적으로 나타난다.For motion, two different models are applied. First, the search may be based on the perceived motion of the video object as derived from the optical flow of the pixels in the video object. Optical flow is a combined effect of both global motion (ie camera motion) and local motion (ie object motion). For example, the camera tracks the car's motion, and the car appears static in video sequence.

둘째로, 탐색은 "비디오 객체의 진정한 모션"에 기초로 한다. 진정한 모션은 글로벌 모션이 보상된 후 객체의 로컬 모션을 말한다. 이동하는 차에 있어서, 차의 진정한 모션은 차 운전의 실제적인 모션이다.Secondly, the search is based on "true motion of the video object". True motion refers to the local motion of an object after global motion is compensated for. In a moving car, the true motion of the car is the actual motion of the car driving.

주요 배경 장면의 글로벌 모션은 공지된 6 파라미터 아핀 모델을 사용하여 추정되며, 반면에 계층적 화소-도메인 모션 추정 방법은 광학 플로우를 추출하는데 사용된다. 글로벌 모션의 아핀 모델은 동일 장면에서 모든 객체의 글로벌 보상을 보상하는데 사용된다. 6 파라미터 모델이 후술된다.The global motion of the main background scene is estimated using a known 6 parameter affine model, while the hierarchical pixel-domain motion estimation method is used to extract the optical flow. The affine model of global motion is used to compensate for the global compensation of all objects in the same scene. A six-parameter model is described below.

여기서, a_i는 아핀 파라미터이고, x, y는 축이며, dx, dy는 각각의 화소에서 변위 또는 광학 플로우이다.Where a _i is an affine parameter, x, y are axes, and dx, dy are displacement or optical flow in each pixel.

글로벌 카메라 모션의 분류, 예컨데 줌, 팬 또는 틸트는 글로벌 아핀 추정에 기초한다. 패닝의 검출에 대하여, 글로벌 모션 속도 필드의 히스토그램은 본 기술의 숙련자에게는 이해될 수 있는 바와 같이 8 방향으로 계산되야 한다. 이동 화소의 주요 수에 한 방향이 존재한다면, 그 방향에서의 카메라 패닝은 선언된다. 카메라 주밍은 상기 아핀 모델에서 글로벌 모션 속도 필드의 평균 진폭, 및 2 스케일링 파라미터(a₁및 a₂)를 조사하여 검출된다. 충분한 모션이 있을 때(즉, 평균 진폭이 소정의 임계값 이상일 때), a₁및 a₂는 모두 양수이며 임의의 임계값 이상이고, 카메라 주밍 인(zooming in)은 선언된다. 그렇치 않고, a₁및 a₂가 모두 음수이고 임의의 값 이하일 때, 카메라 주밍 아웃(zooming out)은 선언된다. 이런 정보는 카메라 패닝 또는 주밍의 존재 또는 부재를 나타내기 위하여 탐색 질의에 포함될 수 있다.The classification of global camera motion, such as zoom, pan or tilt, is based on global affine estimation. For the detection of panning, the histogram of the global motion velocity field should be calculated in eight directions as will be appreciated by those skilled in the art. If there is one direction in the main number of moving pixels, camera panning in that direction is declared. Camera zooming is detected by examining the average amplitude of the global motion velocity field and the 2 scaling parameters a ₁ and a ₂ in the affine model. When there is sufficient motion (ie, when the average amplitude is above a predetermined threshold), a ₁ and a ₂ are both positive and above a certain threshold, and the camera zooming in is declared. Otherwise, when a ₁ and a ₂ are both negative and less than or equal to a random value, a camera zooming out is declared. This information may be included in the search query to indicate the presence or absence of camera panning or zooming.

탐색은 하나 또는 그 이상의 객체와 관련된 시간 정보를 또한 포함할 수 있다. 이런 정보는 상대적인 기간, 즉 길거나 또는 짧은, 또는 절대적인 기간, 즉 초(second) 중 어느 하나에서 객체의 전반적인 지속 시간을 정의할 수 있다. 다중 객체 질의의 경우, 사용자는 장면에서 다양한 객체의 "도달" 순서, 또는 소멸 순서, 즉 비디오 객체가 비디오 클립으로부터 사라지는 순서를 특정함으로써 전반적인 장면 시간 순서를 특정하는 유연성이 주어진다. 시간과 관련된 또 다른 유용한 속성은 척도 인자, 또는 객체의 크기가 객체 존재의 지속 시간에 걸쳐 변화되는 레이트(rate)이다. 이와 같이, 가속은 탐색을 위한 적당한 속성일 수 있다.The search may also include time information associated with one or more objects. This information can define the overall duration of the object in either relative periods, either long or short, or absolute periods, ie seconds. In the case of a multi-object query, the user is given the flexibility to specify the overall scene time sequence by specifying the "arrival" order of various objects in the scene, or the order of destruction, i.e., the order in which video objects disappear from the video clip. Another useful property related to time is the scale factor, or rate at which the size of the object changes over the duration of the object's existence. As such, acceleration may be a suitable attribute for searching.

탐색하는 브라우저용 실제 질의를 형성하기 전에, 다양한 속성은 질의에서 그들의 상대적인 중요성을 반영하기 위하여 가중을 둘 수 있다. 특징 가중은 전체 애니메이트된(animated) 스케치에 글로벌되며, 속성 컬러는 모든 객체에 대해 동일한 가중을 가진다. 시스템에 의해 복귀된 비디오 샷의 최종 순위는 사용자가 다양한 속성에 할당된 가중에 의해 영향을 받는다.Before forming the actual query for the browsing browser, various attributes can be weighted to reflect their relative importance in the query. Feature weights are global to the entire animated sketch, and the attribute color has the same weight for all objects. The final ranking of video shots returned by the system is influenced by the weights that the user has assigned to various attributes.

도 4를 참조하면, 비디오 클립으로부터 비디오 객체를 추출하기 위한 기술이 이하 설명된다. 현 프레임(401)을 포함하는 압축된 비디오 정보(400)의 프레임 시컨스에 의해 형성된 비디오 클립은 도 4에서 예시적으로 분석된다.With reference to FIG. 4, a technique for extracting video objects from video clips is described below. The video clip formed by the frame sequence of compressed video information 400 including the current frame 401 is illustratively analyzed in FIG. 4.

임의의 비디오 객체 추출 전에, 로우(raw) 비디오는 비디오 클립(400)과 같은 비디오 클립으로 바람직하게 분리된다. 비디오 클립 분리는 전술한 Chang 등의 1997년 5월 16일자 특허출원번호 제 PCT/US97/08266호에 개시된 것과 같은 장면 변화 검출 알고리즘에 의해 달성될 수 있다. Chang 등은 정적인 측정(measure)를 계산하기 위하여 MPEG 비트스트림으로부터 모션 벡타 및 이산 코사인 변환 계수를 사용하여 압축된 MPEG-1 또는 MPEG-2 비트스트림에서 급작스러우며 순간적인(예컨데, 디졸브(dissolve), 페이드 인/아웃(fade in/out), 와이프(wipe)) 장면 변화를 검출하기 위한 기술을 설명한다. 이런 측정은 급작스럽거나 또는 순각적인 장면 변화의 발견적 학습(heuristic) 모델을 검증하는데 사용된다.Prior to any video object extraction, the raw video is preferably separated into a video clip such as video clip 400. Video clip separation can be accomplished by a scene change detection algorithm as disclosed in Chang et al., Patent Application No. PCT / US97 / 08266, filed May 16, 1997, above. Chang et al. Described a sudden and instantaneous (eg, dissolve) in a compressed MPEG-1 or MPEG-2 bitstream using motion vector and discrete cosine transform coefficients from an MPEG bitstream to compute a static measure. Techniques for detecting fade in / out, wipe scene changes are described. This measure is used to verify a heuristic model of abrupt or abrupt scene changes.

비디오 객체를 분할 및 트랙킹하기 위하여, "이미지 영역"의 개념이 활용된다. 이미지 영역은 일반적으로 차, 사람, 또는 집과 같은 실제 객체의 일부와 대응하는 컬러, 텍스쳐, 또는 모션과 같은 일치된 특성을 갖는 화소의 인접 영역이다. 비디오 객체는 연속적인 프레임에서 트랙킹된 이미지 영역의 예의 시컨스로 이루어진다.In order to segment and track the video object, the concept of an "image area" is utilized. An image area is typically an adjacent area of pixels with matching characteristics such as color, texture, or motion that correspond to a portion of a real object such as a car, person, or house. The video object consists of the sequence of an example of an image region tracked in successive frames.

도 4에 예시된 기술은 비디오 샷에서 정적 속성, 엣지 및 모션 정보를 고려함으로써 비디오 객체를 분할 및 트랙킹한다. 현 프레임 n(401)은 후술될 투영과 분할 기술(430) 및 모션 추정 기술(440) 둘다에 바람직하게 사용된다.The technique illustrated in FIG. 4 splits and tracks a video object by taking into account static attributes, edges, and motion information in the video shot. The current frame n 401 is preferably used for both projection and segmentation technique 430 and motion estimation technique 440, which will be described later.

투영 및 분할 전에, 정보는 일치된 결과를 얻기 위하여 2가지 다른 방식으로 미리 처리된다. 병행해서, 현 프레임 n은 정보에 대한 하나 또는 그 이상의 인식가능한 속성에 기초하여 양자화(410) 및 엣지 맵 발생하는데(420) 모두 사용된다. 후술될 바와 같이 바람직한 실시예에서, 컬러는 변화하는 조건하에서 그 속성의 일치때문에 그 속성으로서 선택된다. 그러나, 텍스쳐와 같은 정보의 다른 속성은 본 기술의 숙련자에게는 이해되는 바와 같이 투영 및 분할 프로세스를 기초로 상기와 같이 형성된다.Prior to projection and segmentation, the information is preprocessed in two different ways to obtain a consistent result. In parallel, the current frame n is used for both quantization 410 and edge map generation 420 based on one or more recognizable attributes for the information. In the preferred embodiment as will be described below, the color is selected as the attribute because of its matching under changing conditions. However, other attributes of information such as textures are formed as above based on the projection and segmentation process, as will be appreciated by those skilled in the art.

도 4에 예시된 바와 같이, 현 프레임(즉, 프레임 n)은 지각적으로 균일한 컬러 공간, 예컨데 CIE L^*u^*v^*공간에서 변환된다(411). RGB와 같은 비균일 컬러 공간은 이들 공간에서의 거리 측정이 지각 차이에 비례하지 않기 때문에 컬러 분할에는 적절치 않다. CIE L^*u^*v^*컬러 공간은 컬러를 하나의 휘도 채널 및 두개의 색차 채널로 나누어져서, 휘도 및 색차로 주어진 가중에서 편차를 허용하게 한다. 이는 사용자에게 주어진 비디오 샷의 특징에 따라서 다른 가중을 할당할 능력을 허용하는 매우 중요한 선택사항이다. 오히려, 예컨데 두번 이상인 색차 채널에 더욱 가중을 할당하는 것이 일반적으로 더 좋다.As illustrated in FIG. 4, the current frame (ie, frame n) is transformed 411 in a perceptually uniform color space, such as CIE L ^* u ^* v ^* space. Nonuniform color spaces, such as RGB, are not suitable for color segmentation because distance measurements in these spaces are not proportional to perceptual differences. The CIE L ^* u ^* v ^* color space divides the color into one luminance channel and two chrominance channels, allowing for deviations in the weight given by the luminance and chrominance. This is a very important option that allows the user the ability to assign different weights depending on the characteristics of the given video shot. Rather, it is generally better to assign more weighting to, for example, two or more color difference channels.

L^*u^*v^*컬러 공간 변환 정보는 적응적으로 양자화된다(412). 바람직하게, 클러스터링(clustering) 기초 양자화 기술, 예컨데 공지의 K-민즈(Means) 또는 셀프 오가니제이션 맵(Self Organization Map) 크러스터링 알고리즘은 L^*u^*v^*공간에서 실제 비디오 데이터로부터 양자화 팰렛(pallet)을 생성하는데 사용된다. 더욱 일반적인 고정 레벨 양자화 기술은 또한 사용될 수 있다.L ^* u ^* v ^* color space transform information is adaptively quantized (412). Preferably, a clustering based quantization technique, such as the known K-Means or Self Organization Map clustering algorithm, is used to quantize pallets from real video data in L ^* u ^* v ^* space. Is used to generate More general fixed level quantization techniques may also be used.

적응적 양자화(412) 후, 비선형 메디안(median) 필터링(413)은 엣지 정보를 보존하는 동안 이미지에서의 사소한 디테일(detail)과 아웃라이어(outlier)를 제거하는데 바람직하게 사용된다. 양자화 및 메디언 필터링은 따라서 가능한 노이즈 및 아주 작은 디테일을 제거함으로써 이미지를 단순화한다.After adaptive quantization 412, nonlinear median filtering 413 is preferably used to remove minor details and outliers in the image while preserving edge information. Quantization and median filtering thus simplify the image by removing possible noise and tiny details.

양자화(410)와 동시에, 프레임 n의 엣지 맵은 엣지 검출 알고리즘을 사용하여 발생된다(420). 엣지 맵은 엣지 화소가 1로 설정되며 논-엣지 화소가 0으로 설정되는 2진수 마스크이다. 이미지 상에서 2-D 가우시안(Gaussian) 프리-스무싱(pre-smoothing)을 수행하며 수평 및 수직 방향에서 방향 미분 계수를 취하는 공지의 캐니(Canny) 엣지 검출 알고리즘을 통해 생성된다. 미분 계수는 그래디언트(gradient)와, 후보 엣지 화소로서 취해진 로컬 그래디언트 최대값을 계산하는데 순서대로 사용된다. 이 출력은 최종 엣지 맵을 생성하기 위하여 2-레벨 임계 합성 프로세스를 통해 나온다. 간단한 알고리즘은 그래디언트 히스토그램에 기초한 합성 프로세스에서 2 임계 레벨을 자동적으로 선택하는데 사용될 수 있다.Simultaneously with quantization 410, an edge map of frame n is generated 420 using an edge detection algorithm. An edge map is a binary mask in which the edge pixel is set to 1 and the non-edge pixel is set to zero. It is generated through a known Canny edge detection algorithm that performs 2-D Gaussian pre-smoothing on the image and takes directional differential coefficients in the horizontal and vertical directions. The differential coefficients are used in order to calculate the gradients and the maximum of the local gradients taken as candidate edge pixels. This output comes through a two-level threshold synthesis process to produce the final edge map. A simple algorithm can be used to automatically select two threshold levels in the synthesis process based on the gradient histogram.

양자화된 속성 정보와 엣지 맵 모두는 투영 및 분할 단계(430)에서 활용되며, 여기서 일치된 속성, 예컨데 컬러를 갖는 영역은 합쳐진다. 투영 및 분할은 프레임간 투영(431), 프레임내 투영(432), 엣지 포인트 레이블링(432), 및 단순화(433)를 포함하는 4개의 서브 단계로 이루어진다.Both the quantized attribute information and the edge map are utilized in the projection and segmentation step 430, where regions with matched attributes, such as color, are combined. Projection and segmentation consists of four substeps, including interframe projection 431, intraframe projection 432, edge point labeling 432, and simplification 433.

프레임간 투영 단계(431)는 이전 프레임, 즉 도 4에서 프레임 n-1으로부터 결정된 이전에 세그먼트 영역을 투영하며 트랙킹한다. 도 5를 참조하면, 아핀 투영 단계(510)에서 프레임 n-1로부터 존재하는 영역은 이하 설명될 아핀 파라미터에 따라 프레임 n으로 먼저 투영된다. 현 프레임이 시컨스에서 제1 프레임이라면, 이 단계는 간단히 건너뛴다. 다음으로 수정된 화소 레이블링 프로세스(520)가 적용된다. 프레임 n에서 논-엣지 화소 마다에 대해, 이들이 투영된 영역 및 가중된 유클리디언 거리에 의해 커버된다면, WL=1, Wu=2, 및 Wv=2는 디폴트 값이며, 화소의 컬러와 영역의 평균 컬러 사이는 주어진 임계값, 예컨데 256 이하에 있으며, 화소는 구 영역과 일치하여 레이블링 된다. 화소가 주어진 임계값 이하의 하나 이상의 투영된 영역에 의해 커버된다면, 가장 가가운 거리를 갖는 영역으로서 레이블링된다. 그러나, 어떤 영역도 이 조건을 만족하지 못하면, 새로운 레이블이 화소에 할당된다. 엣지 화소가 처리되지 않아 동시에 레이블링 되지 않음에 유의해야 한다. 결국, 연결 그래프(530)는 모든 레이블링, 즉 영역 중에 내장되고, 두 영역은 한 영역의 화소가 다른 영역에서 이웃하는 화소(4 연결 모드)를 갖는다면 이웃으로서 링크된다.The interframe projection step 431 projects and tracks the previous frame, i.e., the segment region previously determined from frame n-1 in FIG. Referring to FIG. 5, in the affine projecting step 510, the region existing from the frame n-1 is first projected into the frame n according to the affine parameter to be described below. If the current frame is the first frame in sequence, this step is simply skipped. Next, a modified pixel labeling process 520 is applied. For every non-edge pixel in frame n, if they are covered by the projected area and the weighted Euclidean distance, WL = 1, Wu = 2, and Wv = 2 are the default values, and the color and area of the pixel The average color is between a given threshold, eg 256 or less, and the pixels are labeled consistent with the sphere area. If the pixel is covered by one or more projected areas below a given threshold, it is labeled as the area with the narrowest distance. However, if no area meets this condition, a new label is assigned to the pixel. Note that the edge pixels are not processed and are not labeled at the same time. As a result, the connection graph 530 is embedded among all labelings, i.e., regions, and the two regions are linked as neighbors if pixels in one region have neighboring pixels (four connection modes) in the other region.

프레임내 투영 단계(432)에서, 상기 트랙킹되며 새로운 레이블(영역)은 큰 영역으로 합체된다. 도 6을 참조하면, 대화형 공간 구속 클러스터링 알고리즘(610)이 활용되며, 주어진 임계값, 바람직하게 225보다 작은 컬러 거리를 갖는 2개의 인접 영영은 임의의 두 인접 영역 간의 컬러 거리가 임계값보다 클 때까지 하나의 새로운 영역(620)으로 합쳐진다. 새로운 영역이 두 인접 영역으로부터 발생된다면, 이 평균 컬러는 2개의 구 영역의 평균 컬러의 가중 평균을 취하여 계산되고(630), 2개의 구 영역의 크기는 가중으로서 사용된다. 영역 연결은 2개의 구 영역의 모든 이웃에 대해 갱신된다(640). 새로운 영역은 2개의 구 영역의 레이블로부터 한 레이블로 할당되고(650), 구 레이블 모두가 이전 프레임으로부터 트랙킹된다면 보다 큰 영역의 레이블을 선택하며, 하나의 구 레이블이 트랙킹되고 다른 것이 트랙킹되지 않으면 트랙킹된 레이블을 선택하고, 그렇치 않으면 보다 큰 영역의 레이블을 선택한다. 2개의 구 영역은 드롭(drop)되며(660), 프로세스는 어떠한 새로운 영역도 결정되지 않을 때까지(670) 반복된다.In an in-frame projection step 432, the tracked new label (area) is incorporated into a large area. Referring to FIG. 6, an interactive spatial constraint clustering algorithm 610 is utilized, where two adjacent domains with a given threshold, preferably a color distance of less than 225, have a color distance between any two adjacent regions that is greater than the threshold. Until it merges into one new area 620. If a new region is to be generated from two adjacent regions, this average color is calculated by taking a weighted average of the average colors of the two sphere regions (630), and the sizes of the two sphere regions are used as weights. The region concatenation is updated 640 for all neighbors of the two old regions. The new region is assigned one label from the labels of the two sphere regions (650), and if all of the sphere labels are tracked from the previous frame, the larger region labels are selected, and if one sphere label is tracked and the other is not tracked Select the label, otherwise the label of the larger area is selected. The two sphere regions are dropped 660 and the process is repeated 670 until no new regions have been determined.

도 4로 되돌아가면, 엣지 포인트는 영역 경계의 정확도를 보증하도록 컬러 측정에 따라 그들의 이웃하는 영역에 할당될 수 있다(433). 상술한 프레임간과 프레임내 분할 프로세스 둘다에서, 논-엣지 화소들만이 처리되고 레이블링된다. 엣지 화소들은 임의의 영역으로 합체되지 않는다. 이는 긴 엣지에 의해 명확히 분리된 영역들이 공간적으로 접속되지 않아 서로 합체되지 않는 것을 보증한다. 모든 논-엣지 화소의 레이블링 후, 엣지 화소들은 동일한 컬러 거리 측정에 따라 그들의 이웃하는 영역들에 할당된다. 상술한 접속 그래프는 레이블링 프로세스 중에 갱신될 수 있다.Returning to FIG. 4, edge points can be assigned 433 to their neighboring areas according to color measurements to ensure accuracy of area boundaries. In both the interframe and intraframe division processes described above, only non-edge pixels are processed and labeled. Edge pixels do not merge into any area. This ensures that the areas clearly separated by the long edges are not spatially connected and do not merge with each other. After labeling all non-edge pixels, the edge pixels are assigned to their neighboring regions according to the same color distance measurement. The aforementioned connection graph can be updated during the labeling process.

최종적으로, 단순화 프로세스(434)가 적용되어, 작은 영역, 즉 주어진 화소수보다 적은 영역을 제거한다. 임계 파라미터는 이미지의 프레임 사이즈에 의존한다. QCIF 사이즈(176×120) 이미지에 대하여, 바람직한 디폴트 값은 50이다. 작은 영역이 그의 이웃하는 영역들 중 하나에 근접한 경우, 즉 컬러 거리가 컬러 임계 이하인 경우, 작은 영역은 이웃하는 영역과 합체된다. 그 외에는 작은 영역이 드롭된다.Finally, a simplification process 434 is applied to remove small areas, that is, areas less than a given number of pixels. The threshold parameter depends on the frame size of the image. For QCIF size (176x120) images, the preferred default value is 50. If the small area is close to one of its neighboring areas, that is, the color distance is below the color threshold, the small area is merged with the neighboring area. Otherwise, small areas are dropped.

프로젝션 및 분할 프로세스(430)와 함께, 현 프레임 n의 광학 플로우는, 여기서 참조로 구체화된, M. Bierling의 "Displacement Estimation by Hierarchical Block Matching" [1001 SPIE Visual Comm. ＆ Immage Processing(1988)]에 기술된 기술광 같은 계층 블럭 매칭 방법을 이용하는 모션 추정 단계(440)에서 프레임 n 및 n+1으로부터 유도된다. 고정 측정 윈도우 사이즈를 이용하여 최소 평균 절대 휘도차만이 탐색되는 통상의 블럭 매칭 기술과는 달리, 이 방법은 게층의 상이한 레벨에서 측정 윈도우의 명확한 사이즈를 이용하여 조밀한 변위 벡터 필드(광학 플로우)를 추정한다. 이는 비교적 실현 가능한 균일한 결과를 산출한다. 3 레벨 계층을 이용하는 것이 바람직하다.In conjunction with the projection and segmentation process 430, the optical flow of the current frame n is described by M. Bierling's "Displacement Estimation by Hierarchical Block Matching" [1001 SPIE Visual Comm. & Immage Processing (1988) is derived from frames n and n + 1 in the motion estimation step 440 using a hierarchical block matching method such as the technique light described in the following. Unlike conventional block matching techniques, where only the minimum average absolute luminance difference is searched using a fixed measurement window size, this method uses a compact displacement vector field (optical flow) using a clear size of the measurement window at different levels of strata. Estimate This yields a relatively feasible uniform result. It is desirable to use a three level hierarchy.

컬러 또는 다른 속성 영역들이 추출되고 프레임 내의 광학 플로우의 측정이 발생된 후, 표준 선형 회귀 알고리즘이 각 영역(450)마다의 아핀 모션을 추정하는 데 사용된다. 각 영역마다, 선형 회귀가 아핀 모션 수식, 즉 영역 내측의 조밀한 모션 필드에 가장 적합한 수식의 6 파라미터들을 결정하는 데 사용된다.After color or other attribute regions are extracted and a measurement of the optical flow in the frame occurs, a standard linear regression algorithm is used to estimate the affine motion for each region 450. For each region, linear regression is used to determine the six parameters of the affine motion equation, ie, the equation that best fits the dense motion field inside the area.

아핀 모션 파라미터는 추정/MPEG 압축에서 사용되는 공통 3단계 블럭 매칭 기술의 확대인, 6차원 아핀 스페이스에서의 3단계 영역 매칭 방법을 이용하여 더욱 세분(refine)(460)되는 것이 바람직하다. 이러한 공지된 기술의 설명은, 여기서 참조로 구체화된, Arun N. Netravali 등의 "Digital Pictures: Representation, Compression and Standards, Second Edition"(pp. 340-344; Plenum Press, New York and London, 1995)에서 찾을 수 있다. 각 영역마다, 최초 아핀 모델은 최소 평균 절대 휘도 오차를 갖는 영역을 투영하는 새로운 모델에 대해 탐색하는 데 사용된다. 각 차원에 따른 탐색은 그 차원의 최초 파라미터의 10%로서 정의된다.The affine motion parameters are preferably further refined 460 using a three step region matching method in six-dimensional affine space, which is an extension of the common three step block matching technique used in estimation / MPEG compression. A description of this known technique is described in "Digital Pictures: Representation, Compression and Standards, Second Edition" by Arun N. Netravali et al. (Pp. 340-344; Plenum Press, New York and London, 1995). You can find it at For each region, the original affine model is used to search for a new model that projects the region with the minimum mean absolute luminance error. The search along each dimension is defined as 10% of the initial parameters of that dimension.

아핀 모션 추정(450) 및 세분(460)을 통해, 아핀 모션 파라미터를 갖는 균일한 컬러 영역이 프레임 N에 대해 발생된다. 마찬가지로, 이들 영역은 프레임 n+1의 분할 프로세스로 추적될 것이다.Through affine motion estimation 450 and subdivision 460, a uniform color region with affine motion parameters is generated for frame N. Likewise, these regions will be tracked in the segmentation process of frame n + 1.

최종적으로, 영역 그룹화(470)가 최종단에서 프로세스 내에 적용되어, 오버 분할을 피하고 높은 레벨의 비디오 객체를 얻을 수 있다. 몇몇의 기준들이 대부분의 주요한 영역을 그룹화하거나 확인하는 데 적합할 것이다.Finally, region grouping 470 can be applied in the process at the end, avoiding oversegmentation and obtaining a high level video object. Some criteria will be appropriate for grouping or identifying most major areas.

제1로, 결정된 영역들의 사이즈, 즉 평균 화소수, 및 지속 기간, 즉 영역이 트랙킹되는 연속 프레임 수는 소수의 중요하지 않은 영역들을 소거하는 데 사용될 수 있다. 작은 사이즈 및/또는 작은 지속 기간 둘다를 갖는 영역들이 드롭될 수 있다.Firstly, the size of the determined regions, i.e. the average number of pixels, and the duration, i. Areas with both small size and / or small duration may be dropped.

제2로, 유사한 모션을 갖는 인접 영역은 하나의 이동 객체로 그룹화될 수 있다. 이는 그들의 객체들을 검출하기 위해 이동 객체를 갖는 비디오 시컨스에 적용된다. 이러한 그룹화를 실현하기 위해, 공간 억제 클러스터링 프로세스(spatial-constrained clustering process)가 각각의 프레림에서 그들의 아핀 모션 파라미터에 기초하여 인접 영역들을 그룹화하는 데 사용될 수 있다. 다음에, 일시 탐색 프로세스는 이들의 영역 그룹들이 적어도 하나의 공통 영역을 포함하는 경우에 영역 그룹들을 하나의 비디오 객체로서 함께 상이한 프레임으로 링크하는 데 사용될 수 있다. 스타팅 프레임에서의 각 영역 그룹에 대하여, 이러한 탐색이 그룹 내부의 긴 지속 기간을 갖는 영역에서 시작된다. 영역 그룹이 예를 들면 1/3초의 소정 시간량 이상으로 연속적으로 트랙킹된 경우, 새로운 객체 레이블이 이 영역 그룹에 할당된다. 최종적으로, 일시 정렬 프로세스가 비디오 객체 내에 포함된 영역들의 일관성을 보증하는 데 적용될 수 있다. 영역이 예를 들면 그 자신의 비디오 객체의 지속 기간의 10% 미만 동안만 짧게 존재하는 경우, 영역 그룹화 프로세스의 오차로서 간주되어, 비디오 객체로부터 드롭된다.Secondly, adjacent regions with similar motion can be grouped into one moving object. This applies to video sequences with moving objects to detect their objects. To realize this grouping, a spatial-constrained clustering process can be used to group adjacent regions based on their affine motion parameters in each praim. The temporary search process can then be used to link the area groups together as different video frames together as one video object if their area groups include at least one common area. For each region group in the starting frame, this search begins in the region with the long duration inside the group. If the area group has been continuously tracked for more than a predetermined amount of time, for example 1/3 of a second, a new object label is assigned to this area group. Finally, a temporary alignment process can be applied to ensure the consistency of the regions contained within the video object. If an area is briefly present, for example only for less than 10% of the duration of its own video object, it is regarded as an error of the area grouping process and is dropped from the video object.

도 3과 관련하여 설명된 바와 같이, 서버 컴퓨터(110)는 복수의 특성 데이터베이스, 예를 들면 컬러 데이터베이스(311), 텍스쳐 데이터베이스(312), 모션 데이터베이스(313), 형상 데이터베이스(314), 및 사이즈 데이터베이스(315)를 포함하며, 각각의 데이터베이스는 원 비디오 정보와 연관된다. 설명된 비디오 클립들로부터 추출된 각 비디오 객체, 예를 들면 도 4를 참조하여 설명된 방법에 의해 추출된 비디오 객체에 대하여, 부수의 특성들이 서버 컴퓨터(110)의 데이터베이스 내에 이롭게 저장된다.As described in connection with FIG. 3, the server computer 110 may include a plurality of feature databases, such as a color database 311, a texture database 312, a motion database 313, a shape database 314, and a size. Database 315, each database being associated with original video information. For each video object extracted from the described video clips, for example the video object extracted by the method described with reference to FIG. 4, an additional feature is advantageously stored in the database of the server computer 110.

컬러 데이터베이스(311)에 대하여, 비디오 객체에 대하여 대표하는 컬러가 양자화 CIE-LUV 스페이스이다. 양자화는 정적 프로세스가 아니라, 양자화 팔레트는 컬러의 편차에 따라 각 비디오 샷이 변화한다. 바람직한 구성이 대표하는 컬러를 사용하지만, 컬러 데이터베이스는 단일 컬러, 평균 컬러, 컬러 히스토그램, 및/또는 비디오 객체의 컬러쌍을 또한 포함할 수 있다.For the color database 311, the color represented for the video object is a quantized CIE-LUV space. Quantization is not a static process, but in the quantization palette, each video shot changes with color variations. Although the preferred configuration uses representative colors, the color database may also include a single color, average color, color histogram, and / or color pairs of video objects.

텍스쳐 데이터베이스(312)에 대하여, 이른 바 타무라 텍스쳐 측정의 3가지, 즉 조야(coarseness), 콘트라스트 및 지향은 객체의 텍스쳐 콘텐트의 측정으로서 계산된다. 선택적으로, 파형 도메인 텍스쳐, 텍스쳐 히스토그램, 및/또는 로스 필터계(Laws Filter-based) 텍스쳐가 데이터베이스(312)를 개선하는 데 사용될 수 있다.For texture database 312, three so-called Tamura texture measurements, namely coarseness, contrast and orientation, are calculated as a measure of the texture content of the object. Optionally, waveform domain textures, texture histograms, and / or Raws Filter-based textures can be used to enhance the database 312.

모션 데이터베이스(313)에 대하여, 각 비디오 객체의 모션이 N-1 벡터의 리스트로서 저장되며, 여기서 비디오 클립에서의 프레임 수가 N이다. 각 벡터는 글로벌 모션 보상 후 연속 프레임들 사이의 객체의 중심의 평균 병진(average translation)이다. 이러한 정보에 따라, 우리는 비디오 샷 시컨스의 프레임 속도를 또한 저장하여, 객체의 "속도"와 그의 지속 기간 둘다가 확립된다.For motion database 313, the motion of each video object is stored as a list of N-1 vectors, where the number of frames in the video clip is N. Each vector is the average translation of the center of the object between successive frames after global motion compensation. In accordance with this information, we also store the frame rate of the video shot sequence so that both the "velocity" of the object and its duration are established.

형상 데이터베이스(314)에 대하여, 각 비디오 객체의 형상의 주요 성분은 E.Saber 등의 "Region-based affine shape matching for automatic image annotation and query-by-example" [8 Visual Comm. and Image Representation 3-20(1997)]에 개시된 바와 같은 공지된 고유치 분석(eigenvalue analysis)에 의해 결정된다. 2개의 다른 새로운 특성, 즉 정규화 영역 및 백분율 영역이 또한 계산된다. 정규화 영역은 외접원의 영역에 의해 분할된 객체의 영역이다. 영역이 원에 의해 공정하게 근사될 수 있는 경우, 이 때 이러한 근사가 이루어진다. 예를 들면, 객체의 축 비율이 0.9보다 크고 정규화 영역이 또한 0.9보다 크면, 형상은 원으로서 분류된다. 선택적으로, 기하 불변, 각 차원의 상이한 순서의 모멘트, 다항식 근사, 스플라인 근사, 및/또는 대수 불변이 이용될 수 있다.For shape database 314, the principal component of the shape of each video object is E. Saber et al., "Region-based affine shape matching for automatic image annotation and query-by-example" [8 Visual Comm. and Image Representation 3-20 (1997)] by known eigenvalue analysis. Two other new properties are also calculated: normalized area and percentage area. The normalized area is the area of the object divided by the area of the circumscribed circle. If the area can be fairly approximated by a circle, then this approximation is made. For example, if the axis ratio of an object is greater than 0.9 and the normalization area is also greater than 0.9, the shape is classified as a circle. Optionally, geometric invariants, different order of moments in each dimension, polynomial approximations, spline approximations, and / or algebraic invariants can be used.

최종적으로, 사이즈 데이터베이스(315)에 대하여, 화소들에 대한 사이즈가 저장된다.Finally, for size database 315, the size for the pixels is stored.

시간에 대한 공간 관계의 전개는 에디트 또는 원래의 공간 그래프의 연속으로서 인덱스될 수 있다. 프레임의 객체를 간의 공간 관계가 공간 그래프 또는 2-D 스트립에 의해 인덱스될 때, 시공(spatio-temporal) 데이터베이스와 같은 다른 데이터베이스가 사용될 수 있다.The evolution of the spatial relationship over time can be indexed as an edit or as a continuation of the original spatial graph. When the spatial relationship between objects in a frame is indexed by a spatial graph or a 2-D strip, other databases, such as a spatio-temporal database, can be used.

다음에, 서버 컴퓨터(110)의 특성 데이터베이스(111) 내에 저장된 정보와 탐색 질의를 비교하기 위한 기술이 설명될 것이다. 도 3을 참조하여 설명된 바와 같이, 서버(110)는 매칭 작업(321, 322, 323, 324, 325), 질의 컬러(301), 텍스쳐(322), 모션(323), 형상(324), 사이즈(325) 및 데이터 베이스(311, 312, 313, 314, 및 315 등) 냐에 저장된 정보에 대한 다른 속성을 수행하여, 후보의 비디오 샷(331, 332, 333, 334, 335)의 리스트들을 발생시킨다.Next, a technique for comparing the search query with the information stored in the feature database 111 of the server computer 110 will be described. As described with reference to FIG. 3, server 110 may match matching operations 321, 322, 323, 324, 325, query color 301, texture 322, motion 323, shape 324, Perform other attributes on information stored in size 325 and database 311, 312, 313, 314, 315, etc. to generate lists of candidate video shots 331, 332, 333, 334, 335 Let's do it.

매칭 모션 궤적(323)에 대하여, 비디오 객체의 3차원 궤적이 최적으로 사용된다. 2개의 공간 차원 x, y 및 프레임 수를 정규화하는 일시 차원 t로 이루어지는 3차원을 시컨스 {x[i], y[i] 여기서 i=1, N}로 표현한다. 프레임 속도는 실제 시간 정보를 제공한다.For the matching motion trajectory 323, the three-dimensional trajectory of the video object is optimally used. A three-dimensional dimension consisting of two spatial dimensions x, y and a temporal dimension t that normalizes the number of frames is expressed by the sequence {x [i], y [i] where i = 1, N}. The frame rate provides real time information.

클라이어트 컴퓨터(130)에서, 사용자는 x-y 평면에서의 최고점의 시컨스로서 객체 궤적을 스케치할 수 있고, 또한 비디오 클립에서의 객체의 지속 기간을 특정한다. 지속 기간은 프레임 속도에 대하여 3개의 레벨, 즉 장 레벨, 중간 레벨, 및 단 레벨로 양자화된다. 전체 궤적은 예를 들멘 30 초당 프레임의 프레임 속도에 기초하여 모션 궤적을 균일하게 샘플링함으로써 용이하게 연산될 수 있다.In client computer 130, the user can sketch the object trajectory as the sequence of the highest point in the x-y plane, and also specify the duration of the object in the video clip. The duration is quantized to three levels with respect to the frame rate: long level, middle level, and short level. The overall trajectory can be easily computed by uniformly sampling the motion trajectory, for example based on a frame rate of 30 frames per second.

본 발명의 바람직한 형태에 따라, 매칭 트레일의 2개의 주된 모드, 즉 공간 모드 및 시공 모드가 설명될 것이다. 공간 모드에서, 모션 트레일은 x-y 평면 상에 투영되어, 순차 윤곽(ordered contour)이 발생된다. 질의 윤곽과 데이터베이스 내의 각 객체에 대하여 대응하는 윤곽 간의 거리를 측정함으로써, 후보 궤적이 결정된다.이러한 종류의 매칭은 "시간 척도 불변(time-scale invariance)"을 제공하고, 사용자가 궤적을 실행하기 위해 객체에 의해 취해진 시간을 확신하지 않을 때 유용하다.According to a preferred form of the invention, two main modes of the matching trail, namely the spatial mode and the construction mode, will be described. In spatial mode, the motion trail is projected on the x-y plane, resulting in an ordered contour. By measuring the distance between the query contour and the corresponding contour for each object in the database, the candidate trajectory is determined. This kind of matching provides a "time-scale invariance" and allows the user to execute the trajectory. This is useful when you are not sure of the time taken by an object.

시공 모드에서, 전체 모션 트레일은 다음과 같은 미터법에 따라 거리를 연산하는 데 사용된다.In construction mode, the entire motion trail is used to calculate distance according to the following metric system.

여기서, 첨자 q 및 t는 각각 질의 및 타겟 궤적을 칭하고, 인덱스 i는 프레임 수에걸쳐 진행한다. 선택적으로, 인덱스는 서브샘플 세트에 걸쳐 진행할 수 있다.Here, the subscripts q and t refer to the query and target trajectories, respectively, and the index i progresses over the number of frames. Optionally, the index may proceed over a set of subsamples.

일반적으로, 질의 객체의 지속 기간이 데이터베이스 내의 객체와 다르므로, 유리한 이점이 있다. 우선, 지속 기간이 다르면, 2개의 궤적은 2개의 짧은 지속 기간 동안만 매칭될 것이고, 즉 인덱스 i는 최소의 질의 지속 기간 및 데이터베이스 지속 기간에 걸쳐 진행할 것이다.In general, there is an advantage because the duration of the query object is different from that in the database. First, if the durations are different, the two trajectories will only match for two short durations, i.e., index i will proceed over the minimum query duration and database duration.

선택적으로, 질의 및 저장된 궤적 지속 기간은 각각 매칭을 수행하기 전에 정준 지속 기간으로 정규화될 수 있다. 예를 들면, 플레이백 프레임 속도가 미리 정해진 시간 척도로 시간 척도되도록 각 비디오 클립이 정규화된 경우, 탐색 질의가 질의를 매핑하고 그 후 정규화된 비디오 클립에 의해 정의된 비디오 객체 궤적에 매핑된 질의를 스케일링함으로써 동일한 미리 정해진 시간 척도로 비디오 클립을 정규화해야 한다.Optionally, the query and stored trajectory durations may each be normalized to a canonical duration before performing matching. For example, if each video clip is normalized such that the playback frame rate is time-scaled on a predetermined time scale, the search query maps the query and then the query mapped to the video object trajectory defined by the normalized video clip. By scaling, the video clip must be normalized on the same predetermined time scale.

모션의 경우와 같이, 데이터베이스 내에 저장된 정보에 대한 질의 컬러(201), 텍스쳐(222), 형상(224), 사이즈(225) 및 그 외의 속성들을 매칭하는 작업은 최적의 비교 프로세스를 포함한다. 컬러에 대해, 질의 객체의 컬러는 수학식 4에 따라 데이터베이스 내의 후보 트랙킹 객체의 평균 컬러와 매칭된다:As in the case of motion, matching the query color 201, texture 222, shape 224, size 225 and other attributes to information stored in the database includes an optimal comparison process. For color, the color of the query object matches the average color of candidate tracking objects in the database according to equation 4:

여기서, C_d는 CIE-LUV 스페이스의 가중된 유클리드 컬러 거리(weighted Euclidean color distance)이고, 첨자 q 및 t는 각각 질의 및 타겟을 칭한다.Where C _d is the weighted Euclidean color distance of the CIE-LUV space, and the subscripts q and t refer to the query and the target, respectively.

텍스쳐에 대하여, 각각의 트랙킹 객체에 대해 3개의 타무라 텍스쳐 파라미터가 데이터베이스 내에 저장된 파라미터들과 비교된다. 거리 미터법은 수학식 5에 나타낸 바와 같이, 각 채널을 따라 편차를 갖는 각 텍스쳐 특성에 따라 가중된 유클리드 거리이다:For the texture, three Tamura texture parameters for each tracking object are compared with the parameters stored in the database. The distance metric is the Euclidean distance weighted for each texture characteristic with a deviation along each channel, as shown in Equation 5:

여기서, α, β, 및 Φ는 각각 조야, 콘트라스트 및 지향을 나타내며, 변수 σ(α,β,Φ)는 대응하는 속성의 편차를 나타낸다.Here, α, β, and Φ represent the contrast, contrast and orientation, respectively, and the variables σ (α, β, Φ) represent the deviations of the corresponding attributes.

형상에 대하여, 미터법은 수학식 6에 나타낸 바와 같이, 형상의 주요 성분만을 간단히 포함할 수 있다:For shapes, the metric can simply include only the main components of the shape, as shown in equation 6:

여기서, and는 객체의 주축에 따른 고유치이며, 즉 그들의 비는 종횡비이다. 기하 불변과 같은 그 외의 더욱 복잡한 알고리즘이 이용될 수 있다.Where and are the eigenvalues along the principal axis of the object, ie their ratio is the aspect ratio. Other more complex algorithms, such as geometric invariant, can be used.

사이즈는 수학식 7에 나타낸 바와 같이 영역비에 대한 거리로서 충족된다:The size is satisfied as the distance to the area ratio as shown in equation 7:

여기서, Aq, At는 각각 질의 및 타겟의 백분율 영역을 나타낸다.Here, Aq and At represent percentage areas of the query and the target, respectively.

총 거리는, 수학식 8에 따라, 각 미터법의 직경 범위가 [0,1]에 놓이도록 정규화된 후, 이들 거리의 가중된 합을 간소화한다:The total distance is normalized so that the diameter range of each metric lies at [0,1], according to Equation 8, and then simplifies the weighted sum of these distances:

도 7을 참조하여, 개재된 비디오 객체 정보와 관련 오디오 또는 텍스트 정보 둘다에 기초하여 비디오 클립의 위치를 탐색하는 탐색 기술에 기초한 조합 비디오 및 텍스트에 대하여 설명한다. 이 기술은 자연 언어의 서술력(descriptive power) 뿐만 아니라, 객체의 모션, 컬러 및 텍스쳐 등의 속성과 같은 가시 콘텐트를 동시에 사용하게 한다.With reference to FIG. 7, a combination video and text based on a search technique for searching for a location of a video clip based on both interposed video object information and associated audio or text information will be described. This technology allows not only the descriptive power of natural language, but also the use of visible content such as properties of objects such as motion, color and texture.

컬러(701), 텍스쳐(702), 모션(703) 및 형상(704)과 같은 하나 이상의 가시 속성들을 입력하는 것 이외에, 탐색 질의(700)를 입력하면, 사용자는 텍스트 정보(710)의 스트링을 입력하게 된다. 정보는 키보드(131)를 통하거나, 상용의 음성 인식 소프트웨어와 접속되는 마이크로폰을 통해, 또는 컴퓨터 인터페이싱 기술로 임의의 다른 사람을 통해 직접 입력될 수 있다.In addition to entering one or more visible attributes, such as color 701, texture 702, motion 703, and shape 704, when a search query 700 is entered, the user enters a string of text information 710. Will be entered. The information may be entered directly via the keyboard 131, through a microphone connected with commercially available speech recognition software, or through any other person with computer interfacing techniques.

가시 정보는 도 3과 연관하여 서술된 바와 같이 가시 속성 정보의 저장 라이브러리(720)에 대하여 매칭(730)되어, 미리 정해진 임계에 최적의 비디오 클립을 발생시킬 것이다. 그러나, 도 7의 구조는 가시 라이브러리(720)를 발생시키는 데 사용된 동일한 비디오 클립과 연관되는 추출된 키 워드(740)와의 텍스트 매칭(750)을 수행함으로써 도 3에 부연된다. 텍스트 매칭(750)의 결과는 텍스트만을 토대로 한 하나 이상의 최적의 매칭 비디오 클립이다. 최종적으로, 가시 매칭(730) 및 텍스트 매칭(750)의 결과는 원래의 탐색 질의(700)에 의해 찾아낸 비디오 클립을 고 정밀도로 결정하도록 조합(760)된다.The visible information will be matched 730 against the storage library 720 of visible attribute information as described in connection with FIG. 3 to generate an optimal video clip at a predetermined threshold. However, the structure of FIG. 7 is further illustrated in FIG. 3 by performing text matching 750 with the extracted keyword 740 associated with the same video clip used to generate the visible library 720. The result of text matching 750 is one or more optimal matching video clips based solely on text. Finally, the results of the visible match 730 and text match 750 are combined 760 to determine with high precision the video clip found by the original search query 700.

MPEG 압출 시청각 정보의 경우, 추출된 키 워드(740)의 라이브러리는 수동으로 주석을 달거나, 오디오를 전사하도록 압출된 비트스트림으로부터 오디오 정보를 우선 추출한 다음, 키워드 스폿팅 기술에 의해 전사된 텍스트 량을 감소시킴으로써 수행될 수 있다.In the case of MPEG extruded audiovisual information, the library of extracted keywords 740 is either manually annotated or first extracts the audio information from the extruded bitstream to transfer the audio, and then the amount of text transferred by the keyword spotting technique. By reducing.

상기 설명은 단지 본 발명에 포함되는 원리만을 설명한 것이다. 본 발명의 다른 변형은 당 분야에 숙련된 자에게는 분명할 것이며, 본 발명의 범주는 첨부된 청구의 범위에 기술된 것에만 한정되는 것이다.The foregoing description merely illustrates the principles included in the present invention. Other variations of the invention will be apparent to those skilled in the art, and the scope of the invention is limited only to that set forth in the appended claims.

Claims

In an object-oriented system that allows a user to navigate the location of one or more video objects from one or more video clips via an interactive network,

a. One or more server computers including storage for the one or more video clips and storage for one or more databases of video object attributes corresponding to the video clips;

b. A communication network connected to the one or more server computers to perform transmission of the one or more video clips from the server computer; And

c. Is connected to the communication network,

Iii. A query interface for specifying video object attribute information including motion trajectory information;

Ii. Access the query interface, receive the selected video object property information, and browse within the server computer via the stored video object property by the communication network to obtain an attribute that optimally matches the specified video object property. A browser interface for determining one or more video objects having; And

Iii. An interactive video player receiving one or more transmission sequences of frames of video data from the server computer corresponding to the determined one or more video objects.

Client computer

Object-oriented system comprising a.

The system of claim 1, wherein one of the one or more databases stored on the server computer comprises a motion trajectory database.

The system of claim 1, wherein one of the one or more databases stored on the server computer comprises a spatio-temporal database.

The system of claim 1, wherein one of the one or more databases stored on the server computer comprises a shape database.

The system of claim 1, wherein one of the one or more databases stored on the server computer comprises a color database.

The system of claim 1, wherein one of the one or more databases stored on the server computer comprises a texture database.

The system of claim 1, wherein one of the one or more databases stored on the server computer comprises a pan database.

The system of claim 1, wherein one of the one or more databases stored on the server computer comprises a zoom database.

The system of claim 1, wherein the one or more sequences of frames of video data are stored on the server computer in a compressed format.

The method of claim 1,

Means for comparing at least one of said one or more specific video object properties with a stored video object property corresponding in said server computer to at least one of said server computers to generate lists of candidate video sequences, one for each video object property. Including system.

11. The apparatus of claim 10, wherein the server computer is connected to the comparison means to receive one or more video objects having an aggregate property that optimally matches the selected video object property based on the candidate list. Further comprising a comparison means for determining.

The method of claim 11, wherein the query video object property information includes properties for one or more video objects.

The comparing means compares each of the one or more specific video object properties for each video object with a corresponding stored video object property in the server computer to generate a list of candidate video sequences, one for each video object property for each video object. Let's

And the determining means determines one or more video objects having an aggregate attribute that best matches the selected video object attribute based on the candidate list for each query video object.

A method for extracting a video object from a video clip that includes at least one recognizable property, the method comprising:

a. Quantizing the current frame of the video data by determining and assigning values for different deviations of the at least one attribute represented by the video data to generate quantized frame information;

b. Generating edge information by performing edge detection on the frame of video data based on the at least one attribute to determine an edge point within the frame;

c. Receiving information defining one or more segment areas from a previous frame; And

d. Extracting an area of video information sharing the at least one attribute from the current frame by comparing the quantized frame information and the generated edge information with the received system region

Video object extraction method comprising a.

The method of claim 13, wherein the attribute is color, and the quantization step is

Converting the current frame into uniform color space information, adaptively quantizing the color space information into a palette, and filtering the palette to remove noise therefrom.

15. The method of claim 14, wherein said adaptive quantization step comprises quantization by a clustering algorithm.

15. The method of claim 13, wherein the edge detection step includes performing canny edge detection on the current frame to generate the edge information as an edge map.

The method of claim 13, wherein the extracting step,

a. Performing inter-frame projection to extract an area within the current frame of video data by projecting one of the received areas onto a currently quantized edge detection frame to temporarily track any movement of the area; And

b. Performing intra-frame division to coalesce adjacent extracted regions within the current frame;

Video object extraction method comprising a.

18. The method of claim 17, wherein the attribute is color and the interframe projection step is

a. Pitching the received area from the previous frame to the current frame to temporarily track an area;

b. Labeling each non-edge pixel in the current frame that matches the received or new area; And

c. Linking neighboring regions by generating a connection graph from the labeling

Video object extraction method comprising a.

The method of claim 18, wherein the intra-frame division step,

a. Merging all adjacent areas having a color distance less than a predetermined threshold into a new area;

b. Determining an average color for the new area;

c. Updating the connection graph;

d. Assigning a new label to the new region from a label previously assigned to the coalesced region; And

e. Dropping the coalesced region

Video object extraction method comprising a.

18. The method of claim 17, wherein the extracting step further comprises labeling all edges in the current frame remaining after in-frame division for neighboring regions such that each labeled edge defines a boundary of a video object in the current frame. Video object extraction method.

21. The method of claim 20, wherein said extracting step further comprises simplifying said extracted region by removing any region having a size below a predetermined threshold.

The method of claim 13,

e. Receiving a future frame of video information;

f. Determining an optical flow of the current frame of video information by performing hierarchical block matching between the block of video information in the current frame and the block of video information in the future frame; And

g. Performing motion estimation on the extracted region of the video information based on the optical flow

Video object extraction method further comprising.

23. The method of claim 22, further comprising grouping the determined regions within the current frame by size and duration.

23. The method of claim 22, further comprising grouping the determined regions within the current frame by determining an internal moving object.

12. A method for searching for a location of a video clip that best matches a user input search query from one or more video clips, wherein each video clip includes one or more video objects that temporarily move in a predetermined trajectory.

a. Receiving a search query defining at least one video object trajectory;

b. Determining a total distance between the received query and at least a portion of one or more predetermined video object trajectories; And

c. Searching for the location of the optimal matching video clip or clips by selecting one or more of the defined video object trajectories with the smallest distance from the received query.

How to include.

The method of claim 25, wherein the stored video clips are normalized such that a playback frame rate is scaled on a predetermined time scale,

Normalizing the received search query by mapping the received query to each normalized video clip, and scaling the received matching query with each video object trajectory defined by the normalized video clips. Including,

And the determining step determines a total distance between the normalized receive query and the normalized video object trajectory.

27. The method of claim 25, wherein the determining step comprises comparing a spatial distance between the received video object trajectory and at least a portion of the one or more predetermined video object trajectories.

27. The method of claim 25, wherein the determining comprises comparing a construction distance between the received video object trajectory and at least a portion of the one or more predetermined video object trajectories.

12. A method for searching for a location of a video clip that best matches a user input search query from one or more video clips, wherein the video clip includes one or more video objects each having a predetermined property.

a. Receiving a search query that defines one or more attributes for one or more different video objects in the video clip;

b. Searching for the video clip to search for a location of one or more video objects that match at least one of the defined attributes to a predetermined threshold;

c. Determining, from the located video object, one or more video clips that include the one or more different video objects; And

d. Determining an optimal matching video clip from the determined video clip by calculating a distance between the one or more video objects defined by the search query and the located video object.

How to include.

The method of claim 29, wherein the one or more attributes include color;

The matching step includes determining an average color for each of the query video objects, and comparing the average color with color information stored in a database.

The method of claim 29, wherein the one or more attributes include a texture,

The matching step includes determining a coarseness, contrast, and orientation for each of the query video objects, and comparing the joy, contrast and orientation information to the joy, contrast, and orientation information stored in a database. .

The method of claim 29, wherein the one or more attributes comprise a shape,

The matching step includes determining an eigenvalue along a major axis for each of the query video objects, and comparing the eigenvalue with shape information stored in a database.

The method of claim 29, wherein the one or more attributes include a size,

The matching step includes determining a percentage region for each of the query video objects, and comparing the region with region information stored in a database.

30. The method of claim 29, wherein the video clip includes associated text information,

The search query further includes a definition of a text characteristic corresponding to the one or more different video objects, and further comprising searching for the associated text information to place text that optimally matches the text characteristic.

31. The method of claim 30, wherein the optimal matching video clip is determined from the determined video clip and the placed text.