JP2002506255A

JP2002506255A - Method and system for generating semantic visual templates for image and video verification

Info

Publication number: JP2002506255A
Application number: JP2000534957A
Authority: JP
Inventors: チャンシー−フー; チェンウィリアム; サンダラムハリ
Original assignee: ザトラスティズオブコロンビアユニバーシティインザシティオブニューヨーク
Priority date: 1998-03-04
Filing date: 1999-03-04
Publication date: 2002-02-26
Also published as: EP1066572A1; KR20010041607A; CA2322448A1; WO1999045483A1; WO1999045483A9

Abstract

(57)【要約】データベース画像／ビデオ検索に対し、セマンティックビジュアルテンプレート（SVT）を、コンセプト例えばスキー、サンセット等を特徴付ける例示シーン／オブジェクトの一組のアイコンとする。SVTによってユーザおよびシステム間に双方向対話を提供する。ユーザは初期スケッチまたは例示画像を有するシステムを同一のコンセプトの他の表示を自動的に発生するシードとして提供する。次いで、ユーザはコンセプトを表わすために一応信頼できそうなインクルージョンのビューを選択することができる。SVTが確立されると、データベースは戻った結果に適切なフィードバックを提供するためにユーザに対して探索することができる。SVTが確立されると、確立されたSVTによってユーザはコンセプトレベルでシステムと対話することができる。新たなコンセプトを形成する場合には、予め存在するSVTを用いることができる。制限された語彙はシステムをクエリーするセマンティックビジュアルテンプレートによって構文解析することができる。 (57) [Summary] For a database image / video search, a semantic visual template (SVT) is a set of icons of exemplary scenes / objects that characterize a concept, eg, skiing, sunset, etc. SVT provides two-way interaction between user and system. The user provides the system with the initial sketch or example image as a seed for automatically generating another representation of the same concept. The user can then select a view of the inclusion that is likely to be reliable to represent the concept. Once the SVT has been established, the database can be searched for the user to provide appropriate feedback on the returned results. Once an SVT has been established, the established SVT allows a user to interact with the system at a concept level. When forming a new concept, an existing SVT can be used. Restricted vocabulary can be parsed by semantic visual templates that query the system.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】本発明はデータベースの静止画像、ビデオおよびオーディオの検索、特に、デ
ータベースのアイテムへのアクセスを容易にする技術に関するものである。The present invention relates to a technique for searching for still images, videos, and audios in a database, and more particularly to a technique for facilitating access to items in the database.

[Industrial applications]

【０００２】画像オーディオビデオを形成し、普及し、デジタル形状に蓄積する際に、ビジ
ュアル情報を探索し、検索するツールおよびシステムが重要になってきている。
しかし、テキストデータに対する有効な“サーチエンジン（探索または検索エン
ジン）”を広く利用できるが、ビジュアル画像およびビデオデータを探索する関
連のツールはいまだ開発が遅れている。[0002] Tools and systems for exploring and retrieving visual information have become important as image audio video is formed, disseminated, and stored in digital form.
However, while the effective "search engine (search or search engine)" for text data is widely available, related tools for searching visual image and video data are still underdeveloped.

【０００３】[0003]

[Prior art]

代表的には、画像およびビデオデータベースを得るためには、キーワード技術
を用いて画像に索引を付け、検索している。この種既知の検索システムは多数の
欠点を有しており、画像がテキスト形状の情報に関連しないため、キャプション
の手動包含に時間がかかり、且つ主格であるため、しかもテキスト形状の告知が
通常付帯的であるため、シーンのビジュアル本質的特徴を表わすことに失敗して
いる。例えば、“レンガ壁に男が寄りかかっている”のようなテキスト形状の記
述は男およびレンガ壁の何れに関するビジュアル情報を少しは伝達している。か
かるビジュアル情報は特定のビデオの検索にしばしば重要である。Typically, images are indexed and searched using keyword technology to obtain image and video databases. Known search systems of this kind have a number of drawbacks, since images are not relevant to text-shaped information, manual inclusion of captions is time-consuming and nominative, and text-shape announcements are usually accompanied. Fails to represent the visual essential features of the scene. For example, a textual description such as "A man is leaning against a brick wall" conveys some visual information about either the man or the brick wall. Such visual information is often important in searching for a particular video.

【０００４】近年、研究者等は画像およびビデオ貯蔵の検索の新たな形態を探求し始めた。
かかる探求はビジュアルアトリビュート、例えばビデオを造るオブジェクトのう
ちでカラー、テクスチャ、形状、並びに空間的および時間的関係の類似性に基づ
くものである。この代表的な例では、例示的画像およびビジュアルスケッチを与
えることによりクエリーを特定する。従って、検索システム所定の例またはスケ
ッチに対する最高の類似性を有する画像およびビデオを復帰し選出する。In recent years, researchers have begun to explore new forms of image and video storage retrieval.
Such a search is based on visual attributes such as color, texture, shape, and similarity of spatial and temporal relationships among the objects making up the video. In this representative example, a query is identified by providing an exemplary image and visual sketch. Thus, the search system retrieves and selects the images and videos with the highest similarity to a given example or sketch.

【０００５】[0005]

[Problems to be solved by the invention]

データベースから画像およびビデオを容易に検索するために、ビジュアルテン
プレートの収集によるデータベースにインデックスをつける。好適には本発明の
アスペクトによれば、ビジュアルテンプレートによってセマンティックコンセプ
トまたはカテゴリー、例えば、スキー、日没等を表わす。これにより、各々がコ
ンセプトを良好に記述するセマンティックビジュアルテンプレート（SVT）を用いるアーキテクチュアを形成する。Index the database with a collection of visual templates to easily retrieve images and videos from the database. Preferably, according to an aspect of the present invention, the visual template represents a semantic concept or category, eg, skiing, sunset, etc. This forms an architecture that uses semantic visual templates (SVT), each of which describes the concept well.

【０００６】セマンティックビジュアルテンプレート（以下SVTとも称する）はユーザおよびシステム間のインタラクティブ処理によって確立する。ユーザは初期スケッチ
または例示的画像を有するシステムを、同一のコンセプトの他の表示を自動的に
発生せしめるシステムへのシードとして設けることができる。従って、ユーザは
コンセプトを表わす最も信頼し得るものを含むこれらの光景をピックアップする
ことができる。SVTが確立された場合には、データベースをSVTにより探索して選
出された結果にフィードバックされた検索能力（適合性）をユーザに提供する。
確立されたSVTによってユーザはシステムとしてコンセプトレベルで対話することができる。新たなコンセプトを形成するにはあらかじめ存在するSVTを用いることができる。[0006] A semantic visual template (hereinafter also referred to as SVT) is established by interactive processing between a user and a system. The user can provide a system with an initial sketch or example image as a seed to a system that will automatically generate another representation of the same concept. Thus, the user can pick up those scenes, including the most reliable ones that represent the concept. When the SVT is established, the database is searched by the SVT, and the user is provided with a search capability (relevance) fed back to the selected result.
The established SVT allows users to interact at the concept level as a system. To create a new concept, an existing SVT can be used.

【０００７】更に、本発明によれば、システムをクエリーするセマンティックビジュアルテ
ンプレートに関連してワードの限定された語彙を構文解析する技術を提供する。Further, according to the present invention, there is provided a technique for parsing a limited vocabulary of words in relation to a semantic visual template querying system.

【０００８】[0008]

[Means for Solving the Problems]

本発明コンセプト用ビジュアルテンプレートを発生するコンピュータ化方法は
コンセプト用可視テンプレートを発生するコンピュータ化方法において、ａ. コンセプトに対し少なくとも１つの初期クエリーを得るステップと、ｂ. この初期クエリーに対し少なくとも１つの追加のクエリーを発生させるス
テップと、ｃ. 前記コンセプトに関する適切性に対して試験用の追加のクエリーを発生す
るステップと、ｄ. 適切性である場合に、コンセプト用ビジュアルテンプレートに追加のテン
プレートを含めるようにしたことを特徴とする。The computerized method for generating a visual template for a concept according to the present invention is a computerized method for generating a visual template for a concept, comprising the steps of: a. Obtaining at least one initial query for the concept; b. Generating an additional query; c. Generating an additional test query for relevance for said concept; d. Including an additional template in the visual concept template if relevant. It is characterized by doing so.

【０００９】[0009]

【Example】

図面につき本発明を説明する。画像およびビデオを異なるカテゴリーで探索するビジュアルテンプレートを用
いるコンピュータベース空間的および時間的ビジュアル探索の技術は、米国暫定
特許出願NO. 60/045,637（1997年5月5日）およびPCT国際出願No. PCT/US98/0912
4(1998年5月5日)（カナダ、日本、韓国および米国）に開示されている。ビデオQ
と称されるかかる探索技術は本発明に関連して用いることができる。The invention will be described with reference to the drawings. Techniques for computer-based spatial and temporal visual search using visual templates to search images and videos in different categories are disclosed in US Provisional Patent Application No. 60 / 045,637 (May 5, 1997) and PCT International Application No. PCT. / US98 / 0912
4 (May 5, 1998) (Canada, Japan, Korea and the United States). Video Q
Such search techniques, referred to as, can be used in connection with the present invention.

【００１０】ビデオストリームでは、シーンの始めおよび終りはビデオQを用いて決めることができ、例えばシーンのオブジェクトを抽出するカメラモーションの補正にも
用いることができる。ビデオQを用いるも、各オブジェクトは例えばからー、テクスチャ、大きさ、形状および動きのようなサイレントアトリビュートによって
特徴付けることができる。これがため、シーンおよびそのアトリビュートから抽
出されたオブジェクトの全部より成るビデオオブジェクトデータベースを得るこ
とができる。In a video stream, the beginning and end of a scene can be determined using video Q, and can be used, for example, to correct camera motion to extract scene objects. Even with video Q, each object can be characterized by silent attributes such as, for example, body, texture, size, shape and motion. This makes it possible to obtain a video object database consisting of all the objects extracted from the scene and its attributes.

【００１１】ビジュアルテンプレートビジュアルテンプレートはスケッチまたはアニメスケッチの形状のアイデアを
表わす。単一のビジュアルテンプレートは関心のあるクラスを表わし難いので、
異なるセマンティッククラスの表現テンプレートを含むビジュアルテンプレート
のライブラリーを組込むことができる。例えば、クラス“サンセット”のビデオ
クリップを探索する際にこのクラスに関連する１つ以上のビジュアルテンプレー
トを選択するとともに類似性ベースクエリーを用いて“サンセット”のビジュア
ルテンプレートを見いだすようにする。 Visual Template A visual template represents the idea of the shape of a sketch or an animation sketch. Since a single visual template is hard to represent the class of interest,
A library of visual templates containing expression templates of different semantic classes can be incorporated. For example, when searching for a video clip of the class "Sunset", one or more visual templates associated with the class are selected and a similarity-based query is used to find the visual template of "Sunset".

【００１２】ビジュアルテンプレートライブラリーを用いる重要な利点は低レベルビジュア
ルフィーテュア表示を高−レベルセマンティックコンセプトにリンクすることに
ある。例えば、ユーザが上記従来の特許出願に記載されているように、強制され
た自然言語の形状のクエリーをエンターする場合には、ビジュアルテンプレート
を用いて自然言語クエリーをビジュアルアトリビュートおよび制約によって特定
された自動クエリーに変換することができる。レポジトリーまたはデータベース
のビジュアルコンテントをテキスト状に索引付けする場合には、通常のテキスト
状に探索方法は直接適用できない。An important advantage of using a visual template library is that it links a low-level visual feature display to a high-level semantic concept. For example, if the user enters a forced natural language shape query, as described in the prior patent application above, the natural language query was identified by visual attributes and constraints using a visual template. Can be converted to automatic queries. When indexing the visual content of a repository or database as text, the search method cannot be directly applied to ordinary text.

【００１３】セマンティックビジュアルテンプレート（SVT）セマンティックビジュアルテンプレートは特定のセマンティックに関連するビ
ジュアルテンプレートの組である。このSVTの告示は次のような或るキー特性を有する。 Semantic Visual Template (SVT) A semantic visual template is a set of visual templates associated with a particular semantic. This SVT announcement has certain key characteristics:

【００１４】セマンティックビジュアルテンプレートは本質的に一般のものである。或る所
定のコンセプトに対してコンセプトを良好にカバーする一組のビジュアルテンプ
レートを考察する。成功するSVTの例はサンセット、ハイジャンプ、ダウンヒルスキーである。[0014] Semantic visual templates are common in nature. Consider a set of visual templates that better cover a concept for a given concept. Examples of successful SVTs are sunset, high jump and downhill skiing.

【００１５】或るコンセプトに対するセマンティックビジュアルテンプレートは小さいが、
高精密−リコール性能に対するコレクションの関連の画像およびビデオの大きな
パーセントをカバーする。Although the semantic visual template for a concept is small,
High Precision-Covers a large percentage of the collection's associated images and videos for recall performance.

【００１６】異なるコンセプトのセマンティックビジュアルテンプレートを見いだす手順は
システマチックで、効率的で丈夫である。効率は小さなビジュアルテンプレート
の組に対する集中に関するものである。堅牢性はテンプレートの新たなライブラ
リーを新たな画像およびビデオコレクションに適用することによって明示するこ
とである。The procedure for finding semantic visual templates of different concepts is systematic, efficient and robust. Efficiency is about focusing on a small set of visual templates. Robustness is manifested by applying a new library of templates to new image and video collections.

【００１７】ビデオＱに関連して、セマンティックビジュアルテンプレートは、テンプレー
トが関連するセマンティックを表わす一組のアイコンまたは例示的シーン／オブ
ジェクトとして更に理解することができる。セマンティックビジュアルテンプレ
ートからクエリーに対する特徴ベクトルを抽出する。アイコンはアニメスケッチ
である。ビデオＱでは、各オブジェクトに関連する特徴およびその空間的および
時間的関係が重要である。ヒストグラム、テクスチュアおよび構造情報はかかる
テンプレートの一部とし得るグローバルな特徴の例である。アイコン−ベース認
識および全部グローバルな特徴から形成した特徴ベクトル対間の選択は表わすべ
きセマンティックに依存する。例えば、サンセットシーンは一対のオブジェクト
によって適切に表わされるが、滝又は群集はグローバルな特徴の組を用いて良好
に表わされる。従って、各テンプレートは複数のアイコン、即ち、例示的シーン
／オブジェクトを含む。従って、この組の素子は適用範囲が重なり合う。望むら
くは、この適用範囲は最小のテンプレートセットで最小化される。In connection with Video Q, a semantic visual template can be further understood as a set of icons or exemplary scenes / objects representing the semantics to which the template relates. Extract the feature vector for the query from the semantic visual template. The icon is an anime sketch. In video Q, the features associated with each object and their spatial and temporal relationships are important. Histograms, textures, and structural information are examples of global features that can be part of such templates. Icon-based recognition and the choice between feature vector pairs formed from all global features depends on the semantics to be represented. For example, a sunset scene is best represented by a pair of objects, while a waterfall or crowd is better represented using a global feature set. Thus, each template contains multiple icons, i.e., exemplary scenes / objects. Accordingly, the elements of this set have overlapping application areas. Hopefully this coverage will be minimized with a minimal set of templates.

【００１８】コンセプト、例えばダウンヒルスキー、サンセット、ビーチの群集に対する各
アイコンはシーンの実際のオブジェクトに類似するグラフィックオブジェクトよ
り成るビジュアル表示とする。各オブジェクトは一組のビジュアルアトリビュー
ト、例えば、カラー、形状、テクスチュア、モーションに関連する。コンセプト
に対する各アトリビュート、および各オブジェクトの適合性も特定される。例え
ば、“サンセット”に対して、太陽および空のようなオブジェクトのカラーおよ
び特定の構造は一層関連性がある。サンセットシーンに対しては、太陽のオブジ
ェクトは任意である。その理由は太陽が見えないサンセットビデオであるからで
ある。コンセプト“ハイジャンプ”に対しては、前景オブジェクトのモーション
アトリビュートは必須のものであるが、背景のテクスチュアアトリビュートは必
須ではなく、双方のアトリビュートは他のアトリビュートよりも一層適切である
。或るコンセプトはただ一つのオブジェクトを必要としてシーンのグローバルな
アトリビュートを表わす。Each icon for a concept, for example, downhill skiing, sunset, beach crowds, is a visual representation of graphic objects similar to the actual objects in the scene. Each object is associated with a set of visual attributes, eg, color, shape, texture, motion. Each attribute to the concept and the suitability of each object is also specified. For example, for "sunset", the colors and specific structures of objects such as the sun and sky are more relevant. For sunset scenes, the sun object is optional. The reason is that it is a sunset video where the sun is not visible. For the concept "high jump", the motion attributes of the foreground object are mandatory, but the texture attributes of the background are not mandatory, and both attributes are more appropriate than the other attributes. Certain concepts require only one object and represent global attributes of the scene.

【００１９】図５は“ハイジャンプ”の数個の可能なアイコンを示し、図６は“サンセット
”の数個の可能なアイコンを示す。アイコンの最適の組は、以下に更に詳細に示
すように、リコールの観点から適切なフィードバックおよび最大の適用範囲に基
づいて選択する必要がある。FIG. 5 shows several possible icons for “high jump” and FIG. 6 shows several possible icons for “sunset”. The optimal set of icons needs to be selected based on appropriate feedback and maximum coverage from a recall point of view, as described in more detail below.

【００２０】種々のコンセプトに対するセマンティックビジュアルテンプレートを発生する
有効な技術を発明した。各セマンティックコンセプトは数個の表示ビジュアルテ
ンプレートを有し、これらビジュアルテンプレートは、積極的な適用範囲または
貯蔵庫からの高いリコールに対して、画像およびビデオの著しい部分を検索する
ために用いる。異なる種々のビジュアルテンプレートの積極的な適用範囲の組は
重なり合うようにすることができる。従って、大きく最小に重なり合う積極的な
適用範囲を有する小さな組のビジュアルテンプレートを見いだすための対象であ
る。An efficient technique for generating semantic visual templates for various concepts has been invented. Each semantic concept has several display visual templates, which are used to retrieve significant portions of images and videos for aggressive coverage or high recall from the repository. Aggressive coverage sets of different visual templates can overlap. It is therefore an object to find a small set of visual templates with aggressive coverage that largely overlaps.

【００２１】ユーザは有効なビジュアルテンプレートに対して初期条件を提供することがで
きる。例えば、ユーザは黄色の円（前景）および明るい赤色の長方形（背景）を
サンセットシーンを検索する初期テンプレートとして用いることができる。また
、ユーザは対話式質問を回答することにより異なるオブジェクトの重みおよび適
合性並びにアトリビュートおよび文脈に関連する必要条件を示すことができる。
この質問は、ユーザが例えばスケッチパッドにスケッチした現在のクエリーに感
応する。A user can provide initial conditions for a valid visual template. For example, a user can use a yellow circle (foreground) and a bright red rectangle (background) as an initial template to search for a sunset scene. The user can also answer the interactive questions to indicate the weight and suitability of different objects and the requirements associated with attributes and context.
This question is responsive to the current query that the user sketched on, for example, the sketchpad.

【００２２】初期ビジュアルテンプレートおよびこのテンプレートの全てのビジュアルアト
リビュートの適合性を与えることにより、探索システムは一組の最も類似する画
像／ビデオをユーザに戻す。戻された結果を与えることにより、ユーザは戻され
てた結果の主観的な評価を得ることができる。結果の精度および積極的な適用範
囲、即ち、リコールは計算することができる。By providing the match of the initial visual template and all visual attributes of the template, the search system returns a set of most similar images / videos to the user. By providing the returned result, the user can obtain a subjective evaluation of the returned result. The accuracy of the results and the aggressive coverage, ie recall, can be calculated.

【００２３】システムによって初期ビジュアルクエリーを変更するに最適の戦略を決めると
ともに以下の事項に基づいて修正されたクエリーを発生する。１．ユーザの質問により得られた各ビジュアルアトリビュートの適合性、２．前のクエリーの精密なリコール性能、および３．貯蔵庫の画像およびビデオの特徴レベル分布に関連する情報。The system determines an optimal strategy for modifying the initial visual query and generates a modified query based on the following: 1. 1. suitability of each visual attribute obtained by user's question; 2. the precise recall performance of the previous query, and Information related to the image and video feature level distribution of the repository.

【００２４】かかる特徴はコンセプト”ハイジャンプ”に対するクエリーの特定の表示を有
する図１により概念的に例示された技術で実施される。このクエリーには３つの
オブジェクト、即ち、２つの静止的長方形の拝啓フィールドおよび低い明るさに
対して動く１つのオブジェクトを含む。クエリーの各オブジェクトに対しては、
４つの量を重み、例えば、図１に垂直バーで示すように、カラー、テクステュア
、形状および大きさに関連して特定する。新たなクエリーはこれら量の少なくと
も１つをステップ作動させることにより形成することができ、この時点で、一応
信頼できそうな包含についてテンプレートにアイコンとして決定するユーザの対
話を求めることができる。好適な数のアイコンを仮のテンプレートに組み込むと
、このテンプレートはデータベース探索に用いることができる。この探索の結果
はリコールおよび制度に対して評価することができる。許容し得る場合には、テ
ンプレートを”ハイジャンプ”に対するセマンティックビジュアルテンプレート
として蓄積することができる。Such a feature is implemented in the technique conceptually illustrated by FIG. 1 with a specific indication of the query for the concept “high jump”. The query includes three objects: two static rectangular reminders and one object that moves for low light. For each object in the query,
The four quantities are specified in relation to weight, for example, color, texture, shape and size, as shown by the vertical bars in FIG. A new query can be formed by stepping on at least one of these quantities, at which point a template interaction can be sought for the template to determine the likely reliable inclusion. Once a suitable number of icons have been incorporated into the temporary template, the template can be used for database searches. The results of this search can be evaluated against recalls and schemes. If acceptable, the template can be stored as a semantic visual template for "high jump".

【００２５】テンプレートメトリック基本的なビデオデータは多数にセグメントされたオブジェクトを具え、ビデオ
ショットと称される。任意特定のビデオオブジェクトの寿命はビデオショットの
期間に等しいか、またはこれよりも短い。多数のSVTセットおよびビデオショット間の類似の手段Dは次式により規定される。 D=min{ω_f・Σ_{i}d_f(O_i,O’_i)+ω_s・d_s} (1) ここにO_iはテンプレートで規定されたオブジェクトであり、O’_iはO_iに対し整合
されたオブジェクトであり、d_fはその引数間の特徴距離であり、 d_sはテンプレートの空間−時間構造とビデオショットの整合オブジェクト間の空
間−時間構造との間の類似性であり、ω_fおよびω_sは特徴距離および構造の非類
似性に対する基準化重みである。クエリーの手順はクエリーにおける各オブジェ
クトに対する候補リストを発生する必要がある。従って、距離Dは空間−時間制限を満足する整合オブジェクトの全ての可能な組全体に亘って最小となる。例え
ば、セマンティックテンプレートが３つのオブジェクトを有し、２つの候補オブ
ジェクトが各単一オブジェクトのクエリーに対して保持されている場合には、式
１の最小距離の計算を考慮したオブジェクトの多くとも８ポテンシャル候補の組
が存在する。 Template Metric The basic video data comprises a number of segmented objects and is called a video shot. The lifetime of any particular video object is equal to or shorter than the duration of the video shot. A similar measure D between multiple SVT sets and video shots is defined by: D = min {ω _f・ Σ _{i} d _f (O _i , O ′ _i ) + ω _s・ d _s } (1) where O _i is an object defined by the template, and O ′ _i is O _{i is the} aligned object for _i , d _f is the feature distance between its arguments, and d _s is the similarity between the spatio-temporal structure of the template and the spatio-temporal structure between the aligned objects of the video shot. And ω _f and ω _s are scaling weights for feature distance and structure dissimilarity. The query procedure needs to generate a candidate list for each object in the query. Thus, the distance D is minimal over all possible sets of registered objects that satisfy the space-time constraint. For example, if the semantic template has three objects and two candidate objects are retained for each single object query, at most 8 potential objects considering the minimum distance calculation of Equation 1 There is a set of candidates.

【００２６】クエリーにＮ個のオブジェクトを与えると、ビデオショットにともに現われる
Ｎ個のオブジェクトの組全部に亘り探索を行う必要がある。しかし、コンピュー
タ利用による経済に対して、次に示す一層経済的な手順を採用することができる
。１. 各ビデオオブジェクト、即ち、O_iを用いて全オブジェクトのデータベース
を質問し、その結果スレシホルドを用いて短く保持し得る整合オブジェクトのリ
ストを得ることができる。次いでこのリストに含まれるオブジェクトのみをO_iを
整合する候補オブジェクトと見なすことができる。２. 次いで、このリストの候補オブジェクトを結合し、空間的−時間的構造の
関係を確認する整合オブジェクトの最終組となる。Given a query with N objects, a search must be performed over the entire set of N objects that appear together in the video shot. However, the following more economical procedure can be adopted for the economy by using a computer. 1. Query the database of all objects using each video object, i.e., O _i, and use the threshold to obtain a list of matching objects that can be kept short. Then only the objects included in this list can be considered as candidate objects that match O _i . 2. The candidate objects in this list are then combined to form the final set of matching objects that confirm the spatial-temporal structure relationship.

【００２７】テンプレート発生ユーザとテンプレートを発生するシステムとの間に双方向対話を用いる。初期
のシナリオを与え、適合するフィードバックを用い、かかる技術によって最大の
リコールを与えるアイコンの小さな組の技術的な範囲を用いるものとする。ユー
ザは、空間的−時間的制限を有するオブジェクトより成るテンプレートを発生す
べきコンセプトのスケッチとして初期クエリーを提供する。また、ユーザはオブ
ジェクトが必須であるかどうかを特定することもできる。各オブジェクトはユー
ザが適切な重みを割当てる特徴を有する。A two-way interaction is used between the template generation user and the template generation system. Given the initial scenario, use the appropriate feedback, and use the technical scope of a small set of icons that gives the greatest recall with such techniques. The user provides the initial query as a sketch of the concept to generate a template consisting of objects with spatial-temporal restrictions. The user can also specify whether the object is mandatory. Each object has the characteristic that the user assigns appropriate weights.

【００２８】初期クエリーはデータベースの全てのビデオをマッピングし得る高次の特徴ス
ペースのポイントと見なすことができる。テストアイコンの組を自動的に発生さ
せるために、スペースを量子化した後、オブジェクトの各々の特徴の各々にジャ
ンプさせる必要がある。量子化を行うためには、スペースの大きさはユーザが初
期クエリーと相俟って特定する重みを用いて決めることができ、この重みはオブ
ジェクトの特徴に対するユーザによりアトリビュートされた適合性の度合の目安
と見なすことができる。従って、次式が成立する場合には、重みが低いと量子化
が粗くなり、またはその逆となる。 △(ω)=I/(a・ω+b) (2) ここに、△は特徴に対応するジャンプ距離、ωは特徴に関連する重み、aおよびb
はパラメータであり、このパラメータは、△(０)=１、および△(１)=d₀となるよ
うに選択し、これは代表的なシステムにおいて0.2で設定されたスレシホルドに関するシステムパラメータである。ジャンプ距離を用いることにより、特徴ペー
スはハイパー長方形に量子化される。例えば、カラーに対して△(ω)と相俟って
LUVスペースに対するメトリックを用いて直平行6面体を発生させる個とができる
。The initial query can be viewed as a point in a higher-order feature space to which all videos in the database can be mapped. In order to automatically generate a set of test icons, it is necessary to jump to each of the features of each of the objects after quantizing the space. To perform the quantization, the size of the space can be determined using a weight specified by the user in conjunction with the initial query, which weight is a measure of the degree of suitability of the attribute attributed by the user to the features of the object. It can be regarded as a guide. Therefore, when the following equation is satisfied, the quantization becomes coarse when the weight is low, or vice versa. Δ (ω) = I / (a ・ ω + b) (2) where Δ is the jump distance corresponding to the feature, ω is the weight associated with the feature, and a and b
Is a parameter, which is chosen such that △ (0) = 1 and △ (1) = d ₀ , which is a system parameter for the threshold set at 0.2 in a typical system . By using the jump distance, the feature pace is quantized into a hyper rectangle. For example, in conjunction with カラー (ω) for color
Using the metric for the LUV space, one can generate a cuboid.

【００２９】可能なアイコンの総数が急速に増大するのを防止するために、特徴の結合変化
は例えば次のように防止する。１. オブジェクトの各特徴に対して、ユーザはこの特徴の一応信頼できそうな
セットを選択する。２．従って、システムはオブジェクトに関連する特徴の組への結合を行う。３. 従って、ユーザはオブジェクトの変化をもっとも最近表わす結合を選択す
る。In order to prevent the total number of possible icons from increasing rapidly, the change of the combination of features is for example prevented as follows. 1. For each feature of the object, the user selects a possibly reliable set of this feature. 2. Thus, the system makes a connection to the set of features associated with the object. 3. Thus, the user selects the bond that most recently represents the change in the object.

【００３０】多数のオブジェクトの場合には、各オブジェクトの候補リストに対し追加の結
合をステップ２で含めることができる。一応信頼できそうなシーンのリストが一
旦発生すると、ユーザが選択したアイコンを用いてこのシステムを質問する。戻
った結果への適切なフィードバックを用いてユーザが戻った結果を正または負と
してラベリングする場合には最大のリコールとなったこれらアイコンを決定する
。In the case of multiple objects, an additional join to the candidate list for each object can be included in step 2. Once a list of potentially reliable scenes is generated, the system is queried using the icon selected by the user. If the user labels the returned result as positive or negative with appropriate feedback to the returned result, determine the icon that was the largest recall.

【００３１】概念範囲ユーザーがデータベースを検索するために、多くの範囲が概念に求められる。
各々の範囲は、異なった特徴空間に属することができ、例えば、「日没」を、全
体的レベルと同時にオブジェクトレベルにおいて記述する。オブジェクトレベル
記述を、空および太陽のような２つのオブジェクトの集合としてもよい。これら
のオブジェクトを、さらに、特著レベル記述を使用して量化してもよい。 Concept Scopes For a user to search a database, many scopes are required for a concept.
Each range can belong to a different feature space, for example, describing "sunset" at the object level as well as at the global level. The object-level description may be a set of two objects, such as sky and sun. These objects may be further quantified using signature level descriptions.

【００３２】図２に示すように、ある概念（例えば、日没）は、２つの異なった種類の状態
、必要（Ｎ）および十分（Ｓ）を有する。語義視覚的テンプレートは、前記概念
における十分条件であり、必要条件ではなく、特定のＳＶＴが前記概念をその完
全な程度に含む必要はない。追加テンプレートを、手動で、すなわち、ユーザ入
力追加クエリーとして発生してもよい。このタスクは、各々の概念に関して行わ
れる。必要条件を概念に与えることができ、これによって、初めのクエリーテン
プレートが与えられれば追加テンプレートを自動的に発生することができる。As shown in FIG. 2, one concept (eg, sunset) has two different types of states, required (N) and sufficient (S). A semantic visual template is a sufficient condition in the concept, not a requirement, and it is not necessary for a particular SVT to include the concept to its fullest extent. Additional templates may be generated manually, ie, as user-entered additional queries. This task is performed for each concept. The requirements can be given to the concept so that additional templates can be automatically generated once the initial query template is given.

【００３３】ユーザは、前記システムと、「概念質問表」を介して相互作用し、探索する語
義に関する必要条件を指定する。これらの条件を、全体的、例えば、全体的色分
布、相対的空間および時間相互関係等としてもよい。前記概念に関する必要条件
および十分条件が確立すると、前記システムは、前記特徴空間中を移動し、追加
テンプレートを、出発点としてのユーザの原型のテンプレートと共に発生する。
この発生を、前記システムにユーザによって与えられた関連するフィードバック
によっても変更する。前記関連するフィードバックを解析することによって、前
記必要条件に属する新たな規則を決定することができる。これらを使用し、前記
テンプレート発生手順をさらに変更することができる。前記規則を、ユーザによ
って適切であるとしてマークを付けられたビデオによって、ある概念に必要だと
みなされる条件間の相関を見つけることによって発生する。規則（または関係）
を決定するこの原理は、上述した特許明細書において記載の「データ採掘」と、
Ｓ．Ｂｒｉｎ他による論文、「買い物かごデータに関する動的項目セット計数お
よび関係規則」データの管理におけるＡＣＭＳＩＧＭＯＤ会議、１９９７、２
５５〜２４６ページと、Ｓ．Ｂｒｉｎ他による論文、「買い物かごを越えて：修
正に関する一般化関連規則」データの管理におけるＡＣＭＳＩＧＭＯＤ会議、
１９９７、２６５〜２７６ページとにおいて見られる技術と同種のものである。The user interacts with the system via a “conceptual questionnaire” to specify the requirements for the meanings to be searched. These conditions may be global, for example global color distribution, relative spatial and temporal correlations, and the like. Once the necessary and sufficient conditions for the concept have been established, the system moves through the feature space and generates additional templates with the user's original template as a starting point.
This occurrence is also modified by relevant feedback provided by the user to the system. By analyzing the relevant feedback, new rules belonging to the requirement can be determined. These can be used to further modify the template generation procedure. The rules are generated by finding correlations between conditions that are deemed necessary for a concept by videos marked as appropriate by the user. Rules (or relationships)
This principle of determining the "data mining" described in the above-mentioned patent specification,
S. Brin et al., "Dynamic Item Set Counting and Relationship Rules for Shopping Basket Data" ACM SIGMOD Conference on Data Management, 1997, 2.
55 to 246 pages, A paper by Brin et al., "Beyond the Shopping Basket: Generalization-Related Rules for Modification," ACM SIGMOD Conference on Managing Data,
1997, pp. 265-276.

【００３４】規則発生例ビデオＱにおける「群集」に関するクエリーを、スケッチの形態におけるもの
とする。ユーザは、オブジェクトに関する視覚的クエリーを指定し、色およびサ
イズに関する重みを与えるが、群集の概念を特徴付ける（前記群集の）構成また
は相対的空間および時間的運動の形態におけるより詳細な記述を指定することは
できない。しかしながら、ユーザは、「群集」の概念が、構成と、人間の相対的
空間および時間的配置とによって強く特徴付けられていると感じるため、彼は、
これらを必要条件としてリストに載せる。 Example of Rule Occurrence A query regarding "crowd" in the video Q is in the form of a sketch. The user specifies a visual query for the object and gives weights regarding color and size, but specifies a composition (of said crowd) or a more detailed description in the form of relative spatial and temporal movement that characterizes the crowd concept It is not possible. However, because the user feels that the concept of "crowd" is strongly characterized by composition and the relative spatial and temporal arrangement of humans,
These are listed as requirements.

【００３５】前記フィードバックのプロセスを通じて、前記システムは、ユーザが興味を持
つ概念に関係するビデオクリップを指定する。ここで、前記システムは、前記構
成と、空間および時間的配置とが前記概念に必要であることを知っているため、
関連するビデオの中の、必要だと見なされた特徴の中で、一致するパターンを決
定しようとする。次に、これらのパターンを、これらが探している概念と一致す
るかどうかをたずねるユーザに返す。前記ユーザが、これらのパターンを前記概
念と一致するとして受け入れた場合、これらを使用し、図３に示すような新たな
クエリーテンプレートを発生する。この新たな規則を含むことは、クエリーテン
プレート発生において２重の影響を有し、すなわち、探索の速度を改善し、返っ
た結果の精度を向上する。Through the process of feedback, the system specifies video clips related to concepts of interest to the user. Here, the system knows that the configuration and spatial and temporal arrangement are necessary for the concept,
Attempts to determine a matching pattern among the features deemed necessary in the relevant video. The patterns are then returned to the user asking if they match the concept they are looking for. If the user accepts these patterns as consistent with the concept, they are used to generate a new query template as shown in FIG. Including this new rule has a dual effect on query template generation, i.e., improves search speed and improves the accuracy of returned results.

【００３６】概念範囲の発生前記クエリーは、探索を行う特徴空間を規定する。前記特徴空間は、前記視覚
的テンプレートの属性および関連する重みによって規定される。特に、前記属性
は、前記特徴空間の軸を規定し、前記関連する重みは、関連する軸を伸長／圧縮
する。この複合特徴空間内で、各ビデオショットを、この空間における点として
表すことができる。前記視覚的テンプレートは、この空間の一部を含む。前記視
覚的テンプレートは、特徴および特性（全体的対オブジェクトレベル）において
異なってもよいため、前記テンプレートによって規定される空間は、異なり、重
ならない。 Generating a Concept Range The query defines a feature space in which to search. The feature space is defined by attributes of the visual template and associated weights. In particular, the attributes define an axis of the feature space, and the associated weights stretch / compress the associated axis. Within this complex feature space, each video shot can be represented as a point in this space. The visual template includes part of this space. Since the visual templates may differ in features and characteristics (global vs. object level), the spaces defined by the templates are different and do not overlap.

【００３７】数個の特徴の選択は、ある概念を決定するのに不十分であるかもしれないが、
例えば、重みに対する異なった好適な選択によって十分に表せるかもしれない。
このように、ある概念をある特徴空間にマッピングすることができる。Although the selection of several features may be insufficient to determine a concept,
For example, different suitable choices for weights may be sufficient.
In this way, a certain concept can be mapped to a certain feature space.

【００３８】ある概念は、ある１つの特徴空間にも、ある１つのクラスタにも限定されない
。例えば、日没ビデオ列のクラスに関して、日没を、ある１つの色、または、あ
る１つの形状によって完全に特徴付けることはできない。このように、ある概念
に関する全体的静的特長および重みを決定するだけでなく、変化するかもしれな
いこれらの特徴および重みも決定することが重要である。A concept is not limited to a certain feature space or a certain cluster. For example, with respect to the sunset video train class, sunset cannot be completely characterized by one color or one shape. Thus, it is important to determine not only the overall static features and weights for a concept, but also those features and weights that may change.

【００３９】概念に関する探索は、特定の全体的定数を指定することによって開始する。前
後関係質問表によって、前記探索におけるオブジェクトの数と、各々のオブジェ
クトに必要な全体的特長とを決定する。これらは、変化しない前記探索プロセス
における制約を表す。A search for a concept starts by specifying a particular global constant. The contextual questionnaire determines the number of objects in the search and the overall characteristics required for each object. These represent constraints in the search process that do not change.

【００４０】ユーザは、特徴を指定し、重みを設定する初期クエリーを与える。組共通部分
を、前記ユーザによって規定された必要条件の組によって得る。前記必要条件は
、変化しないままである。前記テンプレートを、十分だと思われるこれらの特徴
に対する変化に基づいて変化させる。前記組が共通しない場合、前記概念を特徴
付ける規則を、前記必要条件および関連するフィードバックに基づいて得る。The user provides an initial query that specifies features and sets weights. The set intersection is obtained by a set of requirements defined by the user. Said requirements remain unchanged. The template is varied based on changes to those features that are deemed sufficient. If the sets are not common, a rule characterizing the concept is obtained based on the requirements and associated feedback.

【００４１】各々の特徴の関連する重みは、ユーザが、各々の特徴が従うことを望む許容誤
差を示す。この許容誤差を、各々の特徴が沿う距離しきい値、例えば、ｄ（ω）
＝１／（ａ・ω＋ｃ）にマッピングし、探索される特徴空間におけるハイパー楕
円を規定する。前記しきい値は、重ならない可能な範囲の数を決定する。範囲の
数は、この特定の特徴に沿うジャンプのサイズおよび数を決定する。前記アルゴ
リズムは、幅第１探索を行い、３つの規準によって管理される。The associated weight of each feature indicates the tolerance that the user wants each feature to follow. This tolerance is defined as the distance threshold along which each feature follows, eg, d (ω)
= 1 / (a · ω + c), and defines a hyper-ellipse in the searched feature space. The threshold determines the number of possible ranges that do not overlap. The number of ranges determines the size and number of jumps along this particular feature. The algorithm performs a breadth first search and is governed by three criteria.

【００４２】第１に、前記グリーディアルゴリズムは、リコールが増す方向に進み、すべての可能な初期ジャンプを計算し、各々のジャンプを対応する視覚的テンプレートに変換し、前記クエリーを実行し、すべての結果を照合し、前記結果を、関連するフィードバックのためにユーザに示し、その後のクエリ
ーの可能な点として増加リコールを最大にするこれらの結果を選択する。First, the greedy algorithm goes in the direction of increasing recalls, calculates all possible initial jumps, converts each jump into a corresponding visual template, executes the query, Match the results, show the results to the user for relevant feedback, and select those results that maximize incremental recall as possible points for subsequent queries.

【００４３】第２に、対数探索を、その後のクエリーとして、局所領域におけるより小さい
ジャンプを得ることによって探索する。理論的説明は、現在のクエリー点が良好
な結果を発生し、他のテンプレートに関して慎重に探索すべきことである。探索
を、前記概念の７０％程度をカバーする（すなわち、リコールが７０％以上）の
に十分な視覚的テンプレートが発生したときに停止する。Second, the log search is searched as a subsequent query by obtaining smaller jumps in local regions. The rationale is that the current query point produces good results and should be carefully searched for other templates. The search stops when enough visual templates have been generated to cover as much as 70% of the concept (i.e., 70% or more recalls).

【００４４】第３に、前記幅第１探索は、しばしば、一度に実行する多すぎる可能性を生じ
るため、特徴レベル分布を使用し、前記探索を誘導する。各々の特徴が沿う分布
を予め計算する。この情報を使用し、ビデオショットの高い集中を有する領域へ
のジャンプを選択する。Third, the breadth-first search often uses a feature level distribution to guide the search, as it often creates too many possibilities to perform at once. The distribution along each feature is calculated in advance. Using this information, a jump to a region with a high concentration of video shots is selected.

【００４５】ＳＶＴとの言語の一体化既知の画像及び映像に関するテキストに基づく問い合わせは、画像すなわちビ
デオを伴うキーワードの整合に基づいている。データを伴うキーワードは手動で
発生させることができ或いは連想により得ることができ、すなわち添付テキスト
（画像の場合）から又はビデオを伴う表題から取り出される。 Language Integration with SVT Text-based queries for known images and videos are based on matching keywords with images or videos. Keywords with data can be generated manually or obtained by association, i.e., taken from attached text (for images) or from titles with video.

【００４６】このような試みは、以下の理由によりビデオ又は画像の極めて大きなデータベ
ースを含む実際的なシステムの可能性を除外する。既存のビデオのデータベースに対する注釈を手動で発生させることは可能なこ
とではない。多くのビデオは表題を含んでいない。添付された表題とビデオとの間に瞬時の相関は存在しない。例えば、ベースボ
ールのゲーム中、コメンテータは行われているゲーム中に存在しないベーブルー
スの功績について話を行い、「ベーブルース」を含むビデオに関するテキストに
基づキーワードはこのビデオを誤って表示することになる。Such an attempt excludes the possibility of a practical system containing a very large database of videos or images for the following reasons. It is not possible to manually generate annotations to an existing video database. Many videos do not include a title. There is no instantaneous correlation between the attached title and the video. For example, during a baseball game, the commentator may talk about Babe Ruth's achievements that do not exist during the game being played, and keywords based on text about videos that include "Babe Ruth" may incorrectly display this video. become.

【００４７】ビデオ流だけを分析することによりビデオに関する言語的内容を発生させるこ
とは、困難であると理解されているコンピュータ視覚の問題に到達する。より実
際的な試みは、自然な言語の記述的なパワーを用いて、オブジェクトの移動、カ
ラー及びテキスチャのような属性のような視覚的な内容を同時に用いることであ
る。Generating linguistic content for video by analyzing only the video stream reaches a computer vision problem that has been perceived as difficult. A more practical approach is to use the descriptive power of natural language and simultaneously use visual content such as object movement, attributes such as color and texture.

【００４８】ユーザはストリングにタイプし、システムはビデオモデルに分解する。ビデオ
Ｑはスケッチにより問い合わせを入力する「言語」を与える。表１に示すように
、ビデオＱに存在するものとその自然な言語対象物との間に簡単な相関が存在す
る。表１属性ＮＬタイプ移動動詞カラー，テキスチャ形容詞形状名詞空間的／時間的前置詞／接続詞The user types into a string and the system breaks down into a video model. Video Q gives the "language" to enter the query by sketch. As shown in Table 1, there is a simple correlation between what is in video Q and its natural language objects. Table 1 Attribute NL type mobile verb color, texture adjective shape noun spatial / temporal prepositions / conjunctions

【００４９】許容されるワードの組を用いて条件付きの言語の組を用いることができる。セ
ンテンスは、名詞、動詞、形容詞及び副詞のような分類に分析してビデオシーケ
ンスの移動モデルを発生する。A conditional language set can be used with an acceptable set of words. Sentences are parsed into classes such as nouns, verbs, adjectives and adverbs to generate a moving model of the video sequence.

【００５０】例えば、「ビルは夕日に向けてゆっくり歩いた」の文章を解析する場合、シス
テムは表２に示すように解析することができる。表２ワードＮＬタイプ歩いた動詞ゆっくり副詞向けて前置詞夕日名詞For example, when analyzing the sentence “The building walked slowly toward the sunset,” the system can analyze as shown in Table 2. Table 2 words NL type walking verb slowly adverb towards preposition sunset noun

【００５１】動詞、副詞、形容詞及び前置詞に関して、これらは名詞（オブジェクト）に対
する修飾語（又は記述語）である場合、小さいが固定されたデータベースを用い
ことができる。名詞（すなわち、シナリオ／オブジェクト）のデータベースは、
初めに１００個程度のシーンを含み、ユーザとの対話により拡大することができ
る。For verbs, adverbs, adjectives and prepositions, if they are modifiers (or descriptive words) to nouns (objects), a small but fixed database can be used. A database of nouns (ie, scenarios / objects)
Initially, it contains about 100 scenes and can be expanded by dialogue with the user.

【００５２】各対象物は、形容詞（カラー、テキスチャ）、動詞（歩いた）、副詞（ゆっく
り）のような種々の修飾語により修飾される形状記述を有することができる。こ
れはビデオＱパレットに挿入され、このビデオＱパレットにおいて精細処理を受
けることができる。Each object can have a shape description that is modified by various modifiers, such as adjectives (color, texture), verbs (walked), adverbs (slowly). It is inserted into the Video Q Palette, where it can undergo fine processing.

【００５３】解読器がその修飾語データベースに存在しないワードに遭遇した場合（すなわ
ち、データベースは動詞、副詞、前置詞、形容詞にそれぞれ対応している）、解
読器は類義語を参照し、そのワードのシナリオがそのデータベースに存在するか
否かを決定し、代りにそれらを用いる。存在しない場合、解読器は無効なストリ
ングを示すメッセージをリターンする。If the decoder encounters a word that does not exist in its modifier database (ie, the database corresponds to a verb, adverb, preposition, or adjective, respectively), the decoder looks up synonyms and uses the scenario for that word. To determine if are present in the database and use them instead. If not, the decryptor returns a message indicating an invalid string.

【００５４】解読器が分類できないワードに遭遇した場合、ユーザはテキストを修飾する必
要があり、或いはこのワードが名詞（ビルのような）の場合ユーザはその分類（
本例の場合、名詞）をシステムに伝達し、付加的にこのワードが人間に関するも
のであることを伝達する。ユーザがシステムのデータベースに存在しない名詞を
指示した場合、ユーザは、システムがオブジェクトに関して学習できるようにス
ケッチパッドのオブジェクトを取り出すように試みる。データベースにおいて、
移動、カラー、テキスチャ及び形状のような属性をオブジェクトレベルで発生さ
せることができるので、整合レベルをそのレベルとすることができる。If the decryptor encounters a word that cannot be classified, the user needs to modify the text, or if the word is a noun (such as Bill),
In this case, the noun) is communicated to the system, and additionally that the word is about a human. If the user indicates a noun that does not exist in the system's database, the user attempts to retrieve the sketchpad object so that the system can learn about the object. In the database,
Attributes such as movement, color, texture, and shape can be generated at the object level, so that the alignment level can be that level.

【００５５】図４に示すように、別の情報源としてビデオを伴うオーディオ流とすることが
できる。オーディオがビデオに密接に相関する場合、オーディオは、ビデオの語
義内容の単一の最も重要な源とすることができる。オーディオ流から、キーワー
ドの組を発生させることができ、例えばビデオシーケンス当たり１０〜２０個の
キーワードを発生することができる。この場合、キーワードレベルでの検索はモ
デルレベルでの検索と組合せることができる。これらのビデオは、キーワードレ
ベル（言語に関する）及び移動モデルレベルで整合する最高のランクにランク付
けられる。As shown in FIG. 4, another source of information can be an audio stream with video. If audio correlates closely to video, audio can be the single most important source of video semantic content. From the audio stream, a set of keywords can be generated, for example, 10-20 keywords per video sequence. In this case, the search at the keyword level can be combined with the search at the model level. These videos are ranked in the highest rank that matches at the keyword level (for the language) and the mobility model level.

【００５６】実施例スラロームスキーヤのビデオ画像を再生する言語的な視覚テンプレート１．文脈に関してシステムは質問しユーザは回答する。言語的視覚テンプレー
トを「スラローム」とラベルする。この問い合わせは２個のオブジェクトを含む
オブジェクト基準として特定される。２．ユーザは初めの問い合わせをスケッチする。大きな空白のバックグランド
はスキーのスロープを表し、小さな前景のオブジェクトは特有のジッグザッグの
移動軌跡をユーザスキーヤを示す。３．最高の関連性の重み付けは、バックグランドオフジェクトスキーヤと関連
する全ての特徴に割り当てる。バックグランドの特徴は静的なものとして特定さ
れ、スキーヤの特徴はテンプレートの発生中変化することができる。４．システムは、ユーザがスキーヤのカラー及び移動軌跡の特徴変化を選択で
きる試験アイコンの組を発生する。５．４個の選択されたカラー及び３個の選択された移動軌跡を組合せて１２個
のとり得るスキーヤを形成する。スキーヤのリストを単一のバックグランドに結
合し、図７の１２個のアイコンを得る。ここで、３個の隣接するアイコンは同一
のカラーを有するものと理解される。６．ユーザは候補の組を選択してシステムに問い合わす。システムは２０個の
最も近いビデオ画像を再生する。ユーザは関連性をフィードバックし、システム
をスラロームスキーヤの小さな組に導く。[0056] language visual template 1 to play the video image of the implementation example slalom skiing ya. The system asks questions about the context and the user answers. Label the linguistic visual template "Slalom". This query is specified as an object reference containing two objects. 2. The user sketches the initial query. The large blank background represents the slope of the ski, and the small foreground object represents the user's skier with a unique zigzag trajectory. 3. The highest relevance weight is assigned to all features associated with the background off-ject skier. Background features are identified as static, and skier features may change during template generation. 4. The system generates a set of test icons that allow the user to select the skier's color and change of trajectory characteristics. 5. Combine the four selected colors and the three selected trajectories to form twelve possible skiers. Combine the list of skiers into a single background and get the twelve icons of FIG. Here, it is understood that three adjacent icons have the same color. 6. The user selects a candidate set and queries the system. The system plays the 20 closest video images. The user feeds back the relevance and leads the system to a small set of slalom skiers.

【００５７】夕日．１９５２個のビデオ画像中に７２個の夕日を含むデータベースを用い
た。言語的視覚テンプレートを用いることなく最初のスケッチを用い、１０％の
再現度及び３５％の精度が実現された。言語的視覚テンプレートを用いて、８個
のアイコンが発生し、これらのアイコンを用いて５０％の再現度及び２４％の精
度で３６個の夕日が見出された。 Evening . A database containing 72 sunsets in 1952 video images was used. Using the initial sketch without a verbal visual template, 10% reproducibility and 35% accuracy were achieved. Using the linguistic visual template, 8 icons were generated and 36 sunsets were found with 50% reproducibility and 24% accuracy using these icons.

【００５８】ハイジャンパ．データベースは２，５８９個のビデオ画像により９個のハイジ
ャンパを含む。言語的視覚テンプレートを用いない場合、再現度は４４％であり
精度は２０％であった。言語的視覚テンプレートを用いる場合、再現度は５６％
に改善され精度は２５％に改善された。最初のスケッチとは異なる単一のアイコ
ンに集束されるシステムはユーザにより与えられる。 High jumper . The database contains 9 high jumpers with 2,589 video images. Without a verbal visual template, the recall was 44% and the accuracy was 20%. 56% recall with linguistic visual templates
And the accuracy was improved to 25%. A system focused on a single icon different from the original sketch is provided by the user.

[Brief description of the drawings]

【図１】本発明の好適な例によるセマンティックビジュアルテンプレートのラ
イブラリーまたはコレクションを生成するインタラクティブ技術の説明図である
。FIG. 1 is an illustration of an interactive technique for generating a library or collection of semantic visual templates according to a preferred embodiment of the present invention.

【図２】必要且つ十分な条件を有するコンセプトを示す説明図である。FIG. 2 is an explanatory diagram showing a concept having necessary and sufficient conditions.

【図３】図３はクエリー生成を示す説明図である。FIG. 3 is an explanatory diagram illustrating query generation.

【図４】オーディオ処理を含む本発明の好適な他の例によるインタラクティブ
システムの説明図である。FIG. 4 is an explanatory diagram of an interactive system according to another preferred embodiment of the present invention including audio processing.

【図５】コンセプト“ハイジャンプ”を例示する一組のアイコンを初期スケッ
チ説明図である。FIG. 5 is an illustration of an initial sketch of a set of icons illustrating the concept “high jump”.

【図６】コンセプト“サンセット（日没）”を例示する一組のアイコンを初期
スケッチ説明図である。FIG. 6 is a diagram illustrating an initial sketch of a set of icons illustrating the concept “sunset”.

【図７】コンセプト“スラローム”のセマンティックビジュアルテンプレート
を示す説明図である。FIG. 7 is an explanatory diagram showing a semantic visual template of the concept “slalom”.

───────────────────────────────────────────────────── フロントページの続き (72)発明者シー−フーチャンアメリカ合衆国ニューヨーク州 10027 ニューヨークリヴァーサイドドライヴ 560 アパートメント 18ケイ (72)発明者ウィリアムチェンアメリカ合衆国ニューヨーク州 10027 ニューヨークウェストワンハンドレッドトゥエルフスストリート 423 アパートメント 34エイ (72)発明者ハリサンダラムアメリカ合衆国ニューヨーク州 10027 ニューヨークウェストワンハンドレッドトゥエンティースストリート 434 アパートメント９ディーＦターム(参考） 5B075 ND06 NK02 NK46 PP02 PP13 PQ02 QM08 5C052 AB04 AC08 DD04 EE03 ────────────────────────────────────────────────── ─── Continuing on the front page (72) Inventor Sea-Fu Chan 10027 New York Riverside Drive 560 Apartment 18 New York 18 K 34 Ai (72) Inventor Hari Sandaram 10027 New York, New York, United States West One Handed Twenties Street 434 Apartment 9D F Term (Reference) 5B075 ND06 NK02 NK46 PP02 PP13 PQ02 QM08 5C052 AB04 AC08 DD04 EE03

Claims

[Claims]

1. A computerized method for generating a visual template for a concept, comprising: a. Obtaining at least one initial query for the concept; b. Generating at least one additional query for this initial query. C. Generating an additional query for testing for relevance for the concept; d. Including, if appropriate, an additional template in the visual template for the concept. A computerized method for generating visual templates for concepts.

2. The computerized method for generating a visual template for a concept according to claim 1, wherein each query is an icon / example image.

3. The method according to claim 1, wherein the initial query is obtained via a sketchpad.

4. The computerization of claim 1, wherein the additional query steps a query feature having a step size inversely related to a weight associated with the query feature. Method.

5. The computerized method for generating a visual template for a concept according to claim 1, wherein generating an additional icon comprises forming a joint of likely feature values.

6. The computerized method for generating a visual template for a concept according to claim 1, wherein the adequacy is confirmed by interactive user interaction.

7. A computerized method for querying a concept video database using a subset of natural language associated with a semantic visual template, comprising: a. Obtaining a texture query; b. Parsing the query to obtain visual attributes. Using a visual attribute to form a visual query; d. Retrieving information using the visual query; and e. Displaying information. Computerized method of querying a concept video database

8. The computerized method for querying a concept video database according to claim 1, wherein said texture query is obtained from a keyboard.

9. The method of claim 7, wherein the natural language subset comprises a small set of nouns, verbs, prepositions, adjectives, and adverbs.

10. The method of claim 7, further comprising the step of interactively expanding the subset.

11. The parsing step includes: (i) establishing a match between the query and the natural language subset; and (ii) labeling different parts of the query as nouns, verbs, adjectives or prepositions. And (iii) seeking an explanation of whether the words of the query are not present in the natural language subset and labeling the words accordingly. A computerized way to query video databases.

12. The step of forming a visual query comprises establishing a match between a natural language subset and a set of semantic visual templates generated by the method of claim 1, wherein the semantic visual template comprises: Acts to establish the nouns of the query, adjectives that act to modify the visual examples of these nouns, verbs that act to perform the actions, and the spatial and temporal order needed to form the visual query. The computerized method for querying a concept video database according to claim 7, wherein the visual example is a visual example of a preposition.

13. A computing system for generating a visual template for a concept, comprising: a. Means for obtaining at least one initial query for the concept; b. Means for generating at least one additional query for this initial query. C. Means for generating additional test queries for relevance for said concept, and d. Means for including additional templates in the concept visual template if appropriate. A computerized system that generates visual templates for concepts characterized by:

14. The computerized system for generating a visual template for a concept according to claim 13, wherein each query is represented by an icon / example image.

15. The computerized system for generating a concept visual template according to claim 13, wherein the means for obtaining the initial query comprises a sketchpad.

16. The concept of claim 13, wherein said means for generating an additional query comprises means for stepping a query feature having a step size inversely related to a weight associated with the query feature. Computerized system that generates visual templates for

17. The computerized system for generating a visual template for a concept according to claim 13, wherein the means for generating additional icons comprises means for forming a joint of likely feature values.

18. The computerized method for generating a visual template for a concept according to claim 13, wherein the adequacy is confirmed by interactive interaction between a system and a user.

19. A computerized system for querying a concept video database using a subset of natural language associated with a semantic visual template, comprising: a. Means for obtaining a texture query; b. Parsing the query to obtain visual attributes. Means for generating information; c. Means for using visual attributes to form a visual query; d. Means for retrieving information using the visual query; and e. Means for displaying information. A computerized system that queries a video database for a concept.

20. The computerized system of claim 19, wherein the means for obtaining the texture query comprises a keyboard.

21. The natural language subset comprises a small set of nouns, verbs, prepositions,
20. The computerized system for querying a concept video database according to claim 19, comprising adjectives and adverbs.

22. The computerized system of claim 19, further comprising means for interactively extending the subset.

23. The parsing means includes: (i) means for establishing a match between the query and the natural language subset; and (ii) means for labeling different parts of the query as nouns, verbs, adjectives or prepositions. 20. The concept of claim 19, further comprising: (iii) means for asking whether the words of the query are not present in the natural language subset and labeling the words accordingly. A computerized system that queries video databases.

24. The means for forming a visual query comprises means for establishing a match between a natural language subset and a set of semantic visual templates generated by the method of claim 13, wherein the semantic visual template Acts to establish the nouns of the query, adjectives that act to modify the visual examples of these nouns, verbs that act to perform the actions, and the spatial and temporal order needed to form the visual query. 20. The computerized system for querying a concept video database as claimed in claim 19, wherein the visual example is a visual example of a preposition.