JP2007513398A

JP2007513398A - Method and apparatus for identifying high-level structure of program

Info

Publication number: JP2007513398A
Application number: JP2006530944A
Authority: JP
Inventors: アグニホトリ，ラリタ; ディミトロワ，ネヴェンカ
Original assignee: Koninklijke Philips NV; Koninklijke Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 2003-09-30
Filing date: 2004-09-28
Publication date: 2007-05-24
Also published as: WO2005031609A1; KR20060089221A; EP1671246A1; US20070124678A1; CN1860480A

Abstract

人手による解析と共に、教師なしクラスタリングアルゴリズムを利用してテレビや映像プログラムなどのプログラムのハイレベル構成を復元するための装置及び方法が提供される。本方法は、３つの段階、すなわち、ここではテキストタイプクラスタリング段階と呼ばれる第１段階と、ターゲットプログラムのジャンル／サブジャンルタイプが検出されるジャンル／サブジャンル特定段階の第２段階と、ここでは構造復元段階と呼ばれる第３の最終段階とから構成される。構造復元段階は、プログラム構造を表すのに図式的モデルを利用する。プログラムのハイレベル構造は、復元されると、以下に限定されるものではないが、時間的イベント、テキストイベント、プログラムイベントなどを含むさらなる情報の復元に効果的に利用されるかもしれない。
An apparatus and method for restoring a high-level configuration of a program such as a television or video program using an unsupervised clustering algorithm with manual analysis is provided. The method has three stages: a first stage, referred to herein as a text type clustering stage, a second stage of a genre / subgenre identification stage in which the genre / subgenre type of the target program is detected, and a structure here. It consists of a third final stage called the restoration stage. The structure restoration phase uses a schematic model to represent the program structure. Once restored, the high-level structure of the program may be effectively used to restore further information, including but not limited to temporal events, text events, program events, and the like.

Description

本発明は、一般に映像解析の技術分野に関し、より詳細には、プログラムに現れる各種タイプの映像テキストの様相に対する分類子を利用して、テレビや映像プログラムなどのプログラムのハイレベル構造の特定に関する。 The present invention relates generally to the technical field of video analysis, and more particularly to identifying high-level structures of programs such as television and video programs using classifiers for various types of video text aspects that appear in the program.

映像がますます普及するに従い、それに含まれるコンテンツを解析するより効率的な方法が、ますます必要かつ重要となってきている。本来的に、映像は解析を困難な命題とする複雑さと膨大なデータ量を含む。重要な解析は、さらなる詳細な解析の基礎を提供することが可能な映像のハイレベル構造を理解することである。 As video becomes increasingly popular, more efficient methods of analyzing the content contained therein are becoming increasingly necessary and important. In essence, video contains a complex amount of data that makes analysis difficult and a huge amount of data. An important analysis is to understand the high-level structure of the video that can provide the basis for further detailed analysis.

いくつかの解析方法が知られている。Ｙｅｕｎｇらによる「ＶｉｄｅｏＢｒｏｗｓｉｎｇｕｓｉｎｇＣｌｕｓｔｅｒｉｎｇａｎｄＳｃｅｎｅＴｒａｎｓｉｔｉｏｎｓｏｎＣｏｍｐｒｅｓｓｅｄＳｅｑｕｅｎｃｅｓ」（ＭｕｌｔｉｍｅｄｉａＣｏｍｐｕｔｉｎｇａｎｄＮｅｔｗｏｒｋｉｎｇ１９９５，Ｖｏｌ．ＳＰＩＥ２４１７，ｐｐ．３９９−４１３，Ｆｅｂｒｕａｒｙ１９９５）、Ｙｅｕｎｇらによる「Ｔｉｍｅ−ｃｏｎｓｔｒａｉｎｅｄＣｌｕｓｔｅｒｉｎｇｆｏｒＳｅｇｍｅｎｔａｔｉｏｎｏｆＶｉｄｅｏｉｎｔｏＳｔｏｒｙＵｎｉｔｓ」（ＩＣＰＲ，Ｖｏｌ．Ｃ．ｐｐ．３７５−３８０Ａｕｇｕｓｔ１９９６）、Ｚｈｏｎｇらによる「ＣｌｕｓｔｅｒｉｎｇＭｅｔｈｏｄｓｆｏｒＶｉｄｅｏＢｒｏｗｓｉｎｇａｎｄＡｎｎｏｔａｔｉｏｎ」（ＳＰＩＥＣｏｎｆｅｒｅｎｃｅｏｎＳｔｏｒａｇｅａｎｄＲｅｔｒｉｅｖａｌｆｏｒＩｍａｇｅａｎｄＶｉｄｅｏＤａｔａｂａｓｅｓ，Ｖｏｌ．２６７０，Ｆｅｂｒｕａｒｙ１９９６）、Ｃｈｅｎらによる「ＶｉＢＥ：ＡＮｅｗＰａｒａｄｉｇｍｆｏｒＶｉｄｅｏＤａｔａｂａｓｅＢｒｏｗｓｉｎｇａｎｄＳｅａｒｃｈ」（Ｐｒｏｃ．ＩＥＥＥＷｏｒｋｓｈｏｐｏｎＣｏｎｔｅｎｔ−ＢａｓｅｄＡｃｃｅｓｓｏｆＩｍａｇｅａｎｄＶｉｄｅｏＤａｔａｂａｓｅｓ，１９９８）、及びＧｏｎｇらによる「ＡｕｔｏｍａｔｉｃＰａｒｓｉｎｇｏｆＴＶＳｏｃｃｅｒＰｒｏｇｒａｍｓ」（ＰｒｏｃｅｅｄｉｎｇｓｏｆｔｈｅＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＭｕｌｔｉｍｅｄｉａＣｏｍｐｕｔｉｎｇａｎｄｓｙｓｔｅｍｓ（ＩＣＭＣＳ），Ｍａｙ１９９５）を参照されたい。 Several analysis methods are known. By Yeung et al., "Video Browsing using Clustering and Scene Transitions on Compressed Sequences" (Multimedia Computing and Networking 1995, Vol.SPIE2417, pp.399-413, February 1995), "Time-constrained Clustering for Segmentation of Video into Story by Yeung et al. Units "(ICPR, Vol. C. pp. 375-380 August 1996), Zhong et al.," Clustering Methods for Video Browsing and Annot. tion "(SPIE Conference on Storage and Retrieval for Image and Video Databases, Vol.2670, February 1996)," ViBE by Chen et al: A New Paradigm for Video Database Browsing and Search "(Proc.IEEE Workshop on Content-Based Access of Image and Video Databases, 1998), and “Automatic Parsing of TV Soccer Programs” by Gong et al. (Proceedings of the International Conference of Multimedia Computing and systems (ICMCS), see May 1995).

Ｇｏｎｇらは、サッカー映像の構造解析において、ドメインノリッジとドメイン固有モデルを利用したシステムを説明している。他の従来技術によるシステムと同様に、まず映像がショットにセグメント化される。ショットは、シャッターオープニングとクローシングの間のすべてのフレームとして規定される。各ショット内のフレームから抽出される空間的特徴（プレイングフィールドライン）は、各ショットをペナルティエリア、ミッドフィールド、コーナーエリア、コーナーキック、ゴール時のショットなどの異なるカテゴリに分類するのに利用される。この作業は、特徴抽出前に映像をショットに正確にセグメント化することに大きく依存することに留意されたい。また、各ショットは、サッカー映像において起こるイベントを代表するものではない。 Gong et al. Describe a system that uses domain knowledge and a domain-specific model in structural analysis of soccer video. As with other prior art systems, the video is first segmented into shots. A shot is defined as every frame between shutter opening and closing. Spatial features (playing field lines) extracted from the frames in each shot are used to classify each shot into different categories such as penalty area, midfield, corner area, corner kick, goal shot, etc. . Note that this work relies heavily on accurately segmenting the video into shots prior to feature extraction. Each shot does not represent an event that occurs in a soccer video.

また、Ｚｈｏｎｇらは、スポーツ映像を解析するシステムを説明している。当該システムは、野球のピッチングやテニスのサーブなどのハイレベルのセマティックユニットの境界を検出する。各セマティックユニットはさらに、テニスにおけるストローク数、プレイタイプ、ネットへのリターン、ベースラインリターンなどの興味深いイベントを抽出するため解析される。カラーベース適応的フィルタリング方法が、特定のビューを検出するため、各ショットのキーフレームに適用される。エッジや移動するオブジェクトなどの複雑な特徴は、検出結果を検証及び精緻化するのに利用される。この作業もまた、特徴抽出前に映像をショットに正確にセグメント化することに大きく影響されるということに留意されたい。要するに、ＧｏｎｇとＺｈｏｎｇは、映像を各ユニットがショットとなる基本ユニットの連結と考えている。特徴解析の解像度は、ショットレベルより細かくはない。当該作業は大変詳細なものであり、特定のビューを検出するのにカラーベースフィルタリングに大きく依存する。さらに、映像のカラーパレットが変化する場合、システムは役に立たなくなる。 Zhong et al. Describe a system for analyzing sports video. The system detects boundaries of high-level thematic units such as baseball pitching and tennis serve. Each semantic unit is further analyzed to extract interesting events such as the number of strokes in tennis, play type, return to the net, and baseline return. A color-based adaptive filtering method is applied to each shot keyframe to detect a particular view. Complex features such as edges and moving objects are used to verify and refine the detection results. Note that this work is also greatly affected by accurately segmenting the video into shots prior to feature extraction. In short, Gong and Zhong regard video as a connection of basic units in which each unit is a shot. The resolution of feature analysis is not finer than the shot level. The task is very detailed and relies heavily on color-based filtering to detect a particular view. Furthermore, the system becomes useless if the color palette of the video changes.

従って一般には、従来技術は以下のようになる。まず映像がショットにセグメント化される。その後、キーフレームが各ショットから抽出され、シーンにグループ化される。これらデータ構造を表すのに、シーン遷移グラフ及び階層ツリーが利用される。これらのアプローチの問題点は、低レベルショット情報とハイレベルシーン情報との間のミスマッチである。これらは、興味深いコンテンツの変化がショット変化に対応するときのみ機能する。 Therefore, in general, the prior art is as follows. First, the video is segmented into shots. Thereafter, key frames are extracted from each shot and grouped into scenes. A scene transition graph and a hierarchical tree are used to represent these data structures. The problem with these approaches is the mismatch between low level shot information and high level scene information. These only work when interesting content changes correspond to shot changes.

サッカー映像などの多くのアプリケーションでは、「プレイ」などの面白いイベントは、ショット変化により規定することはできない。各プレイは、同様のカラー分布を有する複数のショットを有するかもしれない。プレイ間の遷移は、単なるショット特徴に基づくシンプルなフレームクラスタリングによって検出することは困難である。 In many applications, such as soccer video, interesting events such as “play” cannot be defined by shot changes. Each play may have multiple shots with a similar color distribution. Transition between plays is difficult to detect by simple frame clustering based on simple shot characteristics.

実質的なカメラの動きがある多くの状況では、ショット検出プロセスは、このタイプのセグメント化が映像のドメイン固有のハイレベルシンタックス及びコンテンツモデルを考慮することなく、低レベル特徴からのものであるため、誤ってセグメント化する傾向にある。従って、ショットレベルセグメント化に基づき低レベル特徴とハイレベル特徴との間のギャップを仲介することは困難である。さらに、ショットセグメント化プロセス中に、多くの情報が失われる。 In many situations where there is substantial camera movement, the shot detection process is from this low-level feature without this type of segmentation taking into account the domain-specific high-level syntax and content model of the video. Therefore, it tends to segment by mistake. Therefore, it is difficult to mediate the gap between low and high level features based on shot level segmentation. In addition, much information is lost during the shot segmentation process.

異なるドメインの映像は、極めて異なる特徴及び構造を有する。ドメインノリッジは、解析処理をかなり容易化することが可能である。例えば、スポーツ映像では、サッカーではプレイごと、テニスではサーブごと、野球ではイニングごとなどのゲームのルールにより課される遷移シンタックス、カメラ制御ルール、ビュー及びカメラが通常一定数存在する。 Different domain images have very different features and structures. Domain knowledge can greatly facilitate the analysis process. For example, in a sport video, there are usually a certain number of transition syntaxes, camera control rules, views, and cameras imposed by game rules such as every play in soccer, every serve in tennis, and every inning in baseball.

Ｔａｎらは「Ｒａｐｉｄｅｓｔｉｍａｔｉｏｎｏｆｃａｍｅｒａｍｏｔｉｏｎｆｒｏｍｃｏｍｐｒｅｓｓｅｄｖｉｄｅｏｗｉｔｈａｐｐｌｉｃａｔｉｏｎｔｏｖｉｄｅｏａｎｎｏｔａｔｉｏｎ」（ＩＥＥＥＴｒａｎｓ．ｏｎＣｉｒｃｕｉｔｓａｎｄＳｙｓｔｅｍｓｆｏｒＶｉｄｅｏＴｅｃｈｎｏｌｏｇｙ，１９９９）において、Ｚｈａｎｇらは「ＡｕｔｏｍａｔｉｃＰａｒｓｉｎｇａｎｄＩｎｄｅｘｉｎｇｏｆＮｅｗｓＶｉｄｅｏ」（ＭｕｌｔｉｍｅｄｉａＳｙｓｔｅｍｓ，Ｖｏｌ．２，ｐｐ．２５６−２６６，１９９５）において、ニュースと野球の映像解析を説明している。しかしながら、より複雑な映像及び広範な映像では、極めて少数のシステムしかハイレベル構造を考慮していない。 Tan et al., "Rapid estimation of camera motion from compressed video with application to video annotation" in the (IEEE Trans.on Circuits and Systems for Video Technology, 1999), Zhang et al., "Automatic Parsing and Indexing of News Video" (Multimedia Systems, Vol.2, pp.256-266, 1995) describes video analysis of news and baseball. However, for more complex and broader images, only a very few systems consider high-level structures.

例えば、サッカー映像では、サッカーの試合はニュースや野球などの他の映像と比較して、比較的にルーズな構造を有しているという問題がある。プレイ単位の構造を除いて、コンテンツフローは全く予測不可能であり、ランダムに発生し得る。サッカーの試合の映像には、多数の動き及びビュー変更が存在する。この問題を解決することは、サッカーファンやプロの自動的コンテンツフィルタリングにとって有用である。 For example, a soccer video has a problem that a soccer game has a relatively loose structure compared to other videos such as news and baseball. Except for the structure of play units, the content flow is totally unpredictable and can occur randomly. There are numerous movements and view changes in the video of a soccer game. Solving this problem is useful for automatic content filtering for soccer fans and professionals.

上記問題は、映像構造解析及びコンテンツ理解のより広範な背景においてより興味深いものである。構造に関する主要な問題は、サッカーの試合のプレイとブレークのゲーム状態などのハイレベル映像状態の時間シーケンスである。連続する映像ストリームを上記２つのゲーム状態の交互のシーケンスに自動的に構文解析することが所望される。 The above problem is more interesting in the broader context of video structure analysis and content understanding. The main structural issue is the time sequence of high-level video states such as soccer game play and break game states. It is desired to automatically parse a continuous video stream into an alternating sequence of the two game states.

従来技術による構造解析法は、ほとんどがドメイン固有イベントの検出に着目したものである。イベント検出とは独立に構造解析することは、以下の効果がある。典型的には、コンテンツの６０％以下しかプレイに対応しない。従って、ブレークに対応する映像部分を排除することによりかなりの情報低減を実現することができるであろう。また、プレイとブレークにおけるコンテンツの特徴は異なっており、このような以前の状態の知識によりイベント検出器を最適化することができるであろう。 Most conventional structural analysis methods focus on detecting domain-specific events. The structural analysis independent of event detection has the following effects. Typically, only 60% or less of content corresponds to play. Therefore, considerable information reduction could be realized by eliminating the video portion corresponding to the break. Also, content characteristics in play and break are different, and the event detector could be optimized with knowledge of such previous conditions.

関連技術の構造解析は、サッカーや他の各種ゲームを含むスポーツ映像解析及び一般的な映像セグメント化にほとんど関連している。サッカーの映像では、従来技術はショット分類に関するものであり、上述のＧｏｎｇによるシーン再構成、Ｙｏｗらによる「ＡｎａｌｙｓｉｓａｎｄＰｒｅｓｅｎｔａｔｉｏｎｏｆＳｏｃｃｅｒＨｉｇｈｌｉｇｈｔｓｆｒｏｍＤｉｇｉｔａｌＶｉｄｅｏ」（Ｐｒｏｃ．ＡＣＣＶ，１９９５，Ｄｅｃｅｍｂｅｒ１９９５）、及びＴｏｖｉｎｋｅｒｅらのルールベースセマティック分類である「ＤｅｔｅｃｔｉｎｇＳｅｍａｔｉｃＥｖｅｎｔｓｉｎＳｏｃｃｅｒＧａｍｅｓ：ＴｏｗａｒｄｓＡＣｏｍｐｌｅｔｅＳｏｌｕｔｉｏｎ」（Ｐｒｏｃ．ＩＣＭＥ２００１，Ａｕｇｕｓｔ２００１）を参照されたい。 Structural analysis of related technologies is mostly related to sports video analysis including soccer and other various games and general video segmentation. In soccer video, the prior art is related to shot classification, scene reconstruction by Gong as described above, “Analysis and Presentation of Soccer High Digital Video” by Proc. See “Detecting Semantic Events in Soccer Games: Towers A Complete Solution” (Proc. ICME 2001, August 2001).

隠れマルコフモデル（ＨＭＭ）が、一般的な映像分類及びニュースやコマーシャルなどの各種タイプのプログラムを区別するのに利用されてきた。Ｈｕａｎｇらによる「ＪｏｉｎｔｖｉｄｅｏｓｃｅｎｅｓｅｇｍｅｎｔａｔｉｏｎａｎｄｃｌａｓｓｉｆｉｃａｔｉｏｎｂａｓｅｄｏｎｈｉｄｄｅｎＭａｒｋｏｖｍｏｄｅｌ」（Ｐｒｏｃ．ＩＣＭＥ２０００，ｐｐ．１５５１−１５５４Ｖｏｌ．３，Ｊｕｌｙ２０００）を参照されたい。 Hidden Markov Models (HMMs) have been used to distinguish between general video classification and various types of programs such as news and commercials. See "Joint video scene segmentation and classification based on hidden Markov model" (Proc. ICME 2000, pp. 1551-1554 Vol. 3, July 2000) by Huang et al.

ドメイン固有特徴及び支配的カラーの比率に基づくヒューリスティックルールがまた、プレイ及びブレークをセグメント化するのに利用されている。Ｘｕらによる「Ａｌｇｏｒｉｔｈｍｓａｎｄｓｙｓｔｅｍｆｏｒｓｅｇｍｅｎｔａｔｉｏｎａｎｄｓｔｒｕｃｔｕｒｅａｎａｌｙｓｉｓｉｎｓｏｃｃｅｒｖｉｄｅｏ」（Ｐｒｏｃ．ＩＣＭＥ２００１，Ａｕｇｕｓｔ２００１）、及び２００１年４月２０日にＸｕらにより出願された米国特許出願第０９／８３９，９２４号「ＭｅｔｈｏｄａｎｄＳｙｓｔｅｍｆｏｒＨｉｇｈ−ＬｅｖｅｌＳｔｒｕｃｔｕｒｅＡｎａｌｙｓｉｓａｎｄＥｖｅｎｔＤｅｔｅｃｔｉｏｎｉｎＤｏｍａｉｎＳｐｅｃｉｆｉｃＶｉｄｅｏｓ」を参照されたい。しかしながら、これらの特徴の変化は、明示的な低レベル判定ルールにより定量化することが困難である。 Heuristic rules based on domain specific features and dominant color ratios are also utilized to segment play and breaks. "Algorithms and system for segmentation and structure analysis in soccer video" (Proc. ICME 2001, August 2001), US Patent Application No. 9/24 filed April 20, 2001, Xu et al. See “Method and System for High-Level Structure Analysis and Event Detection in Domain Specific Videos”. However, these feature changes are difficult to quantify with explicit low-level decision rules.

従って、映像の低レベル特徴のすべての情報が保持され、特徴シーケンスがより良好に表されるフレームワークが必要とされる。このとき、単なるショットだけでなく、ハイレベルプログラム構造における映像分類及びセグメント化を可能にするハイレベル構造を特定するため、ドメイン固有シンタックス及びコンテンツモデルを有することが可能となる。 Therefore, there is a need for a framework that retains all the information of the low-level features of the video and better represents the feature sequence. At this time, it is possible to have a domain-specific syntax and a content model in order to specify a high-level structure that enables video classification and segmentation in a high-level program structure as well as a simple shot.

本発明の主たるアイデアは、解析者と協力して教師なしクラスタリングアルゴリズムを利用したテレビや映像プログラムなどのプログラムのハイレベル構造を識別することである。 The main idea of the present invention is to identify the high-level structure of programs such as televisions and video programs that use unsupervised clustering algorithms in cooperation with analysts.

より詳細には、本発明は、テレビや映像プログラムなどのプログラムのハイレベル構造を自動的に決定する装置及び方法を提供する。本発明の方法は、３つの段階、すなわち、ここではテキストタイプクラスタリング段階と呼ばれる第１段階と、ターゲットプログラムのジャンル／サブジャンルタイプが検出されるジャンル／サブジャンル特定段階の第２段階と、ここでは構造復元段階と呼ばれる第３の最終段階とから構成される。構造復元段階は、プログラム構造を表すのに図式的モデルを利用する。トレーニングに用いられる図式的モデルは、手動により構成されるペトリネット、又はＢａｕｍ−Ｗｅｌｃｈトレーニングアルゴリズムを利用して自動的に構成される隠れマルコフモデルとすることができる。ターゲットプログラムの構造を明らかにするため、Ｖｉｔｅｒｂｉアルゴリズムが利用されてもよい。 More particularly, the present invention provides an apparatus and method for automatically determining the high level structure of programs such as television and video programs. The method of the present invention comprises three stages: a first stage, referred to herein as a text type clustering stage, a second stage of a genre / subgenre identification stage in which the genre / subgenre type of the target program is detected, and Then, it is comprised from the 3rd final stage called a structure restoration stage. The structure restoration phase uses a schematic model to represent the program structure. The graphical model used for training can be a Petri net constructed manually or a hidden Markov model constructed automatically using a Baum-Welch training algorithm. To clarify the structure of the target program, the Viterbi algorithm may be used.

第１段階（すなわち、テキストタイプクラスタリング）では、重複したテキストが、ユーザに関心のあるテレビや映像プログラムなどのターゲットプログラムのフレームから検出される。ターゲットプログラムにおいて検出されたテキストの各行に対して、位置（行、列）、高さ、フォントタイプやカラーなどの各種テキスト特徴が抽出される。検出されたテキストの各行に対する抽出されたテキスト特徴から、特徴ベクトルが形成される。次に、教師なしクラスタリング技術に基づき、特徴ベクトルがクラスタにグループ化される。このとき、クラスタは特徴ベクトルにより記述されるテキストのタイプ（ネームプレート、スコア、オープニングクレジットなど）に応じてラベル付けされる。 In the first stage (ie, text type clustering), duplicate text is detected from a frame of a target program such as a television or video program that is of interest to the user. For each line of text detected in the target program, various text features such as position (line, column), height, font type and color are extracted. A feature vector is formed from the extracted text features for each line of detected text. Next, feature vectors are grouped into clusters based on unsupervised clustering techniques. At this time, the clusters are labeled according to the type of text (name plate, score, opening credit, etc.) described by the feature vector.

第２段階（すなわち、ジャンル／サブジャンル特定）では、トレーニング処理が行われ、これにより、各種ジャンル／サブジャンルタイプを表すトレーニング映像が、各自のクラスタ分布を決定するため、第１段階で上述された方法に従って解析される。取得されると、クラスタ分布は、各種ジャンル／サブジャンルタイプのジャンル／サブジャンル識別子として機能する。例えば、コメディー映画はあるクラスタ分布を有し、野球の試合は異なるクラスタ分布を有する。しかしながら、それぞれは各自のジャンル／サブジャンルタイプを適切に表す。トレーニング処理の結果として、ターゲットプログラムのジャンル／サブジャンルタイプは、第１段階（テキストタイプクラスタリング）において以前に取得されたそれのクラスタ分布を、第２段階において取得された各種ジャンル／サブジャンルタイプのクラスタ分布と比較することにより決定されてもよい。 In the second stage (ie, genre / sub-genre identification), training processing is performed, whereby training videos representing various genres / sub-genre types are described above in the first stage to determine their respective cluster distributions. Analysis according to the method. Once acquired, the cluster distribution functions as a genre / sub-genre identifier of various genres / sub-genre types. For example, comedy movies have a certain cluster distribution and baseball games have a different cluster distribution. However, each appropriately represents their genre / sub-genre type. As a result of the training process, the genre / sub-genre type of the target program is the cluster distribution obtained previously in the first stage (text type clustering), and the genre / sub-genre type of the various genres / sub-genre types acquired in the second stage. It may be determined by comparing with the cluster distribution.

第３の最終段階（すなわち、ハイレベルプログラム構造復元段階）では、ターゲットプログラムのハイレベル構造が、まず上位図式的モデルのデータベースを構成することにより復元され、これにより、当該モデルは、複数のジャンル／サブジャンルタイプのプログラムにおける映像テキストのフローを図式的に表すことになる。ステップ１４０において決定されたテキスト検出結果及びステップ１６０において決定されたクラスタ分布の結果を利用して、図式的モデルデータベースが構成されると、複数の格納されているモデルから１つの図式的モデルが特定及び抽出される。選択された図式的モデルは、テキスト検出及びクラスタ情報と共に、プログラムのハイレベル構造を復元するのに利用される。 In the third final stage (ie, the high-level program structure restoration stage), the high-level structure of the target program is first restored by constructing a database of higher-level graphical models, so that the model can be / The flow of video text in a sub-genre type program is schematically represented. When a graphical model database is constructed using the text detection result determined in step 140 and the cluster distribution result determined in step 160, one schematic model is identified from a plurality of stored models. And extracted. The selected graphical model, along with text detection and cluster information, is used to restore the high level structure of the program.

映像又はテレビプログラムなどのプログラムのハイレベル構造は、以下に限定されるものではないが、推奨装置として、ターゲットプログラムにおける時間イベント、テキストイベント及び／又はプログラムイベントの検索と、ターゲットプログラムのマルチメディアサマリの構成を含む広範な用途に効果的に利用可能である。 The high-level structure of a program such as a video or television program is not limited to the following, but as a recommended device, search for time events, text events and / or program events in the target program and multimedia summary of the target program. The present invention can be effectively used in a wide range of applications including the following configurations.

本発明の上記特徴は、添付した図面と共に本発明の例示的な実施例の以下の詳細な説明を参照することによって、より容易に明らかとなり、理解されることとなる。 The above features of the present invention will become more readily apparent and understood by referring to the following detailed description of exemplary embodiments of the invention in conjunction with the accompanying drawings.

本発明の以下の詳細な説明では、本発明の完全なる理解を提供するため、多数の具体的詳細が与えられているが、本発明はこれらの具体的詳細なしに実現されてもよい。いくつかの実施例では、本発明を不明りょうにすることを回避するため、詳細にではなくブロック図の形式により周知な構成及び装置が示される。さらに、後述される図１〜６及び本特許文献における本発明の原理を説明するのに用いられる各種実施例は、単なる例示であり、本発明の範囲を限定するものと解釈されるべきでない。 In the following detailed description of the present invention, numerous specific details are given in order to provide a thorough understanding of the present invention, but the invention may be practiced without these specific details. In some embodiments, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the present invention. Furthermore, the various examples used to explain the principles of the present invention in FIGS. 1-6 and the patent literature described below are merely illustrative and should not be construed as limiting the scope of the present invention.

以下の説明では、通常ソフトウェアプログラムとして実現される本発明の好適な実施例が、明確に説明される。当業者は、このようなソフトウェアと等価なものがハードウェアにより構成されてもよいということを容易に認識するであろう。映像処理アルゴリズム及びシステムは周知であるため、本説明は、本発明によるシステム及びアルゴリズムの一部を形成し、又はより直接的に協調するアルゴリズム及びシステムに特に関するものである。ここに具体的に図示又は説明されない、このようなアルゴリズム及びシステムと、関係する映像信号を生成及び処理されるハードウェア及び／又はソフトウェアの他の特徴が、当該分野において知られるそのようなシステム、アルゴリズム、コンポーネント及び要素から選択されるかもしれない。以下において本発明により説明されるようなシステム及び方法が与えられると、本発明を実現するのに有用である、ここでは具体的には図示、示唆又は記載されていないソフトウェアは、従来技術によるものであり、当該分野の当業者の範囲内に属する。 In the following description, a preferred embodiment of the present invention, usually implemented as a software program, will be clearly described. Those skilled in the art will readily recognize that the equivalent of such software may be configured by hardware. Since video processing algorithms and systems are well known, this description is particularly concerned with algorithms and systems that form part of, or more directly coordinate with, the systems and algorithms according to the present invention. Such algorithms and systems not specifically shown or described herein, and other features of hardware and / or software that generate and process related video signals are known in the art, May be selected from algorithms, components and elements. Given a system and method as described below by the present invention, software not specifically illustrated, suggested or described herein, according to the prior art, is useful for implementing the present invention. And within the scope of those skilled in the art.

さらに、ここで用いられるように、コンピュータプログラムは、例えば、磁気ディスク（ハードドライブやフロッピー（登録商標）ディスクなど）や磁気テープなどの磁気記憶媒体、光ディスク、光テープ又はマシーン可読バーコードなどの光記憶媒体、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）やＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）などのソリッドステート電子記憶装置、又はコンピュータプログラムを格納するのに利用される他の任意の物理的装置又は媒体などを含むコンピュータ可読記憶媒体に格納されてもよい。 Further, as used herein, a computer program can be, for example, a magnetic storage medium such as a magnetic disk (such as a hard drive or floppy disk) or a magnetic tape, an optical disk such as an optical tape, an optical tape, or a machine-readable barcode. Computer-readable storage including storage media, solid state electronic storage devices such as RAM (Random Access Memory) and ROM (Read Only Memory), or any other physical device or medium utilized to store computer programs It may be stored on a medium.

以下の説明は、以下で定義される用語を使用する。 The following description uses the terms defined below.

ジャンル／サブジャンル
ジャンルとは、文学又は芸術作品の種類、カテゴリ、又はタイプなどであり、サブジャンルとは、あるジャンルに属するカテゴリである。ジャンルの一例は、バスケットボール、野球、サッカー、テニスなどのサブジャンルを有する「スポーツ」である。他のジャンルの例は、コメディー、悲劇、ミュージカル、ステップなどのサブジャンルを有する「映画」である。他のジャンルの例として、例えば、「ニュース」、「音楽ショー」、「自然」、「トークショー」、「子供ショー」などがあげられる。 The genre / subgenre genre is the type, category, or type of literature or artwork, and the subgenre is a category that belongs to a certain genre. An example of the genre is “sports” having sub-genres such as basketball, baseball, soccer, and tennis. Examples of other genres are “movies” with sub-genres such as comedy, tragedy, musical, and step. Examples of other genres include “news”, “music show”, “nature”, “talk show”, “kids show”, and the like.

ターゲットプログラム
ターゲットプログラムとは、エンドユーザが興味を有する映像又はテレビプログラムである。それは、本発明の処理への入力として提供される。本発明の原理に従ってターゲットプログラムに対して処理することは、以下の可能性を提供する。（１）ターゲットプログラムのマルチメディアサマリをエンドユーザが受け取ることを可能にし、（２）ターゲットプログラムのハイレベル構造の復元、（３）ターゲットプログラムのジャンル／サブジャンルの決定、（４）プログラムに所望又は所望でないコンテンツであるかもしれないターゲットプログラム内の所定のコンテンツの検出、及び（５）ターゲットプログラムに関する情報の受取り（すなわち、推奨装置として）、である。 Target program A target program is a video or television program in which the end user is interested. It is provided as an input to the process of the present invention. Processing on a target program in accordance with the principles of the present invention provides the following possibilities. (1) Allows the end user to receive a multimedia summary of the target program, (2) restores the high-level structure of the target program, (3) determines the genre / subgenre of the target program, (4) desires the program Or detection of predetermined content in the target program that may be undesired content, and (5) receiving information about the target program (ie, as a recommended device).

クラスタリング
クラスタリングは、類似するコンテンツを有するベクトルが同一のグループに属し、各グループは可能な限り互いに異なるものとなるように、ベクトルを分割する。 Clustering clustering divides a vector so that vectors having similar contents belong to the same group, and each group is as different as possible.

クラスタリングアルゴリズム
クラスタリングアルゴリズムは、類似するアイテムのグループを検出し、それらをカテゴリにグループ化することにより実行される。カテゴリが特定されていないとき、これはときどき教師なしクラスタリングと呼ばれる。予めカテゴリが特定されているとき、これは、ときどき教師付きクラスタリングと呼ばれる。 Clustering Algorithm A clustering algorithm is performed by detecting groups of similar items and grouping them into categories. When no category is specified, this is sometimes referred to as unsupervised clustering. When a category is specified in advance, this is sometimes referred to as supervised clustering.

ここで図１〜３に戻って、一実施例による本発明の方法が示される。 1-3, the method of the present invention according to one embodiment is shown.

図１は、ここではテキストタイプクラスタリング段階１００と呼ばれる一実施例による本発明の第１段階を示すフローチャートであり、ここでは、重ね合わせ、又は添付されたテキストが、ユーザが興味を有するテレビや映像プログラムなどのターゲットプログラムのフレームから検出される。 FIG. 1 is a flowchart illustrating a first stage of the present invention according to one embodiment, referred to herein as a text type clustering stage 100, where the superimposed or attached text is a television or video that the user is interested in. It is detected from the frame of the target program such as a program.

図２は、ここではジャンル／サブジャンル特定と呼ばれる一実施例による本発明の第２段階を示すフローチャートであり、当該処理中に、トレーニング処理が実行され、これにより、各種ジャンル／サブジャンルのタイプを表すトレーニング映像が、各自のクラスタ分布を決定するため解析される。取得されると、クラスタ分布は各種ジャンル／サブジャンルタイプのジャンル／サブジャンル識別子として利用される。トレーニング処理の結果として、ターゲットプログラムのジャンル／サブジャンルタイプが、それのクラスタ分布とトレーニング中に取得された各種ジャンル／サブジャンルタイプのクラスタ分布と比較することにより決定されてもよい。 FIG. 2 is a flow chart illustrating the second stage of the present invention according to one embodiment, referred to herein as genre / sub-genre identification, during which the training process is performed, thereby providing various genre / sub-genre types. Are analyzed to determine their cluster distribution. Once acquired, the cluster distribution is used as a genre / subgenre identifier of various genres / subgenre types. As a result of the training process, the genre / sub-genre type of the target program may be determined by comparing its cluster distribution with the cluster distribution of various genres / sub-genre types obtained during training.

図３は、ターゲットプログラム構造復元段階と呼ばれる一実施例による本発明の第３段階を示すフローチャートであり、当該段階中に、ターゲットプログラムのハイレベル構造が、まずより上位のグラフィカルモデルのデータベースを構築することにより決定され、これにより、各モデルは、あるジャンル／サブジャンルタイプのプログラムを通じて映像テキストのフローを図解的に表す。データベースが構築されると、ターゲットプログラムに関するテキスト検出やクラスタ分布結果などの処理の段階において以前に取得された結果が、プログラムのハイレベル構造を復元するため、データベースに格納されているものの中から１つのグラフィカルモデルを特定及び選択するのに利用される。 FIG. 3 is a flow chart illustrating the third stage of the present invention according to one embodiment, called the target program structure restoration stage, during which the high-level structure of the target program first builds a higher-level graphical model database. Thus, each model graphically represents the flow of video text through a genre / sub-genre type program. When the database is constructed, the results previously obtained at the processing stage such as text detection and cluster distribution results for the target program are one of those stored in the database to restore the high level structure of the program. Used to identify and select two graphical models.

後述される処理フロー図に記載されるステップの必ずしもすべてが、図示されたものに加えて実行されなくてもよいということに留意されたい。また、一部のステップは、他のステップと実質的に同時に、又はその最中に実行されてもよい。本明細書を読んだ後、当業者は、ステップが特定の要求について利用可能であるということを決定することができるであろう。 Note that not all of the steps described in the process flow diagrams described below may be performed in addition to those shown. Also, some steps may be performed substantially simultaneously with or during other steps. After reading this specification, one of ordinary skill in the art will be able to determine that a step is available for a particular requirement.

Ｉ．第１段階−テキストタイプクラスタリング
図１のフローチャートに示されるように、第１段階、すなわち、テキストタイプクラスタリング段階１００は、一般には以下のステップを有する。 I. First Stage—Text Type Clustering As shown in the flow chart of FIG. 1, the first stage, text type clustering stage 100, generally includes the following steps.

１１０−テレビや映像プログラムなどのエンドユーザが興味を有する「ターゲットプログラム」におけるテキストの存在の検出
１２０−ターゲットプログラムにおいて検出される映像テキストの各行のテキスト特徴の特定及び抽出
１３０−特定及び抽出された特徴からの特徴ベクトルの形成
１４０−特徴ベクトルのクラスタへの整理
１５０−クラスタに存在する映像テキストのタイプに従う各クラスタのラベル付け
上記一般的なステップのそれぞれが、より詳細に説明される。 110—Detection of the presence of text in a “target program” of interest to an end user, such as a television or video program 120—Identify and extract text features of each line of video text detected in the target program 130—Specify and extract Forming Feature Vectors from Features 140—Organizing Feature Vectors into Clusters 150—Labeling Each Cluster According to the Type of Video Text Present in the Cluster Each of the above general steps is described in more detail.

ステップ１１０において、本プロセスは、ターゲットプログラムの各映像フレーム内に含まれるテキストの存在を検出するため、「ターゲット」テレビ又は映像プログラムを解析することにより開始される。映像テキスト検出のより詳細な説明は、その内容のすべてが参照することによりここに含まれる、２００３年８月１９日にＡｇｎｉｈｏｔｒｉらに付与された米国特許第６，６０８，９３０号「ＭｅｔｈｏｄａｎｄＳｙｓｔｅｍｆｏｒＡｎａｌｙｚｉｎｇＶｉｄｅｏＣｏｎｔｅｎｔＵｓｉｎｇＤｅｔｅｃｔｅｄＴｅｘｔｉｎＶｉｄｅｏＦｒａｍｅｓ」に与えられている。ターゲットプログラムから検出可能なテキストのタイプは、例えば、スターティング及びエンディングクレジット、スコア、タイトルテキスト、ネームプレートなどを含むかもしれない。あるいは、静止又は移動映像オブジェクトのセグメント化方法を記載するＭＰＥＧ−７規格に従って、テキスト検出が実現されてもよい。 In step 110, the process begins by analyzing the “target” television or video program to detect the presence of text contained within each video frame of the target program. A more detailed description of video text detection is included in US Pat. No. 6,608,930 “Method and System” issued to Agnihotri et al. On Aug. 19, 2003, which is hereby incorporated by reference in its entirety. for Analyzing Video Content Using Detected Text in Video Frames ". Text types that can be detected from the target program may include, for example, starting and ending credits, scores, title text, nameplates, and the like. Alternatively, text detection may be implemented according to the MPEG-7 standard that describes how to segment a still or moving video object.

ステップ１２０において、テキスト特徴がステップ１１０において検出されたテキストから特定及び抽出される。テキスト特徴の例として、位置（行及び列）、高さ（ｈ）、フォントタイプ（ｆ）及びカラー（ｒ，ｇ，ｂ）があげられる。他のものも可能である。位置特徴について、本発明のため、映像フレームは、９つの指定領域をもたらす３×３の格子に分割されているとみなされる。位置特徴の行及び列パラメータは、テキストが存在する領域を規定する。フォントタイプ（ｆ）特徴について、「ｆ」は使用されているフォントタイプを示す。 In step 120, text features are identified and extracted from the text detected in step 110. Examples of text features include position (row and column), height (h), font type (f) and color (r, g, b). Others are possible. For location features, for the purposes of the present invention, a video frame is considered to be divided into a 3 × 3 grid that yields nine designated areas. The row and column parameters of the location feature define the area where the text exists. For the font type (f) feature, “f” indicates the font type being used.

ステップ１３０において、検出されたテキストの各行に対して、抽出されたテキスト特徴が、１つの特徴ベクトルＦ_ｖにグループ化される。 In step 130, for each row of the detected text, extracted text features are grouped into a single feature vector F _v.

ステップ１４０において、特徴ベクトルＦ_Ｖは、クラスタ｛Ｃ１，Ｃ２，Ｃ３，．．．｝に整理される。グループ化は、特徴ベクトルＦ_Ｖ１とクラスタ｛Ｃ１，Ｃ２，Ｃ３，．．．｝及びＦ_Ｖ２の間の距離メトリックを用いることにより実行され、特徴ベクトルＦ_Ｖ１を最も高い類似度を有するクラスタと関連付ける。教師なしクラスタリングアルゴリズムは、この類似度に基づき特徴ベクトルＦ_Ｖをクラスリングするのに利用されてもよい。 In step 140, the feature vector F _V is converted into clusters {C1, C2, C3,. . . }. The grouping consists of feature vector F _V1 and clusters {C1, C2, C3,. . . } And F _V2 to associate the feature vector F _V1 with the cluster with the highest similarity. Unsupervised clustering algorithms may be utilized to class ring feature vector F _V on the basis of this similarity.

一実施例では、使用される距離メトリックは、 In one embodiment, the distance metric used is

として計算される各自のテキスト特徴における差の絶対値の和として計算されるマンハッタン距離である。

Is the Manhattan distance calculated as the sum of the absolute values of the differences in their text features.

加重係数ｗ１〜ｗ４と「Ｄｉｓｔ」は、経験的に決定されてもよいということが留意されるべきである。 It should be noted that the weighting factors w1-w4 and “Dist” may be determined empirically.

ステップ１５０において、ステップ１４０において形成される各クラスタ｛Ｃ１，Ｃ２，Ｃ３，．．．｝は、クラスタのテキストのタイプに従ってラベル付けされる。例えば、クラスタＣ１は、常に黄色で配信され、常に画面の右下部分に位置するテキストを記述する特徴ベクトルを有するようにしてもよい。従って、クラスタＣ１は、記述された特徴が以降のショーを通知するテキストを表すため、「以降のプログラム通知」とラベル付けされる。他の例として、クラスタＣ２は、その周囲の黒色のバナーにより青色で常に配信され、画面の左上部分に常に位置するテキストを記述する特徴ベクトルを有するようにしてもよい。従って、クラスタＣ２は、テキスト特徴がスコアを常に表示するのに使用されるものであるため、「スポーツスコア」とラベル付けされる。 In step 150, each cluster {C1, C2, C3,. . . } Is labeled according to the text type of the cluster. For example, cluster C1 may have a feature vector that describes text that is always delivered in yellow and always located in the lower right portion of the screen. Thus, cluster C1 is labeled "subsequent program notification" because the described feature represents text that notifies subsequent shows. As another example, cluster C2 may have a feature vector that describes text that is always delivered in blue by the surrounding black banner and that is always located in the upper left portion of the screen. Thus, cluster C2 is labeled "Sport Score" because the text feature is used to always display the score.

クラスタをラベル付けする処理、すなわち、ステップ１５０が、手動又は自動により実行されてもよい。手動によりアプローチの効果は、クラスタラベルが、「タイトルテキスト」、「ニュース更新」などのより直感的なものとなるということである。自動ラベル付けは、「テキストタイプ１」や「テキストタイプ２」などのラベルを生成する。 The process of labeling clusters, ie step 150, may be performed manually or automatically. The effect of the manual approach is that the cluster labels become more intuitive, such as “title text”, “news update”, etc. Automatic labeling generates labels such as “text type 1” and “text type 2”.

ＩＩ．第２段階−ジャンル／サブジャンル特定
第２段階、すなわち、図２のフローチャートに示されるようなジャンル／サブジャンル特定段階２００は、一般に以下のステップを有する。 II. Second Stage-Genre / Sub-Genre Identification The second stage, ie, the genre / sub-genre identification stage 200 as shown in the flowchart of FIG. 2, generally includes the following steps.

２１０−ジャンル／サブジャンル特定トレーニングの実行
２１０．ａ−あるジャンル／サブジャンルタイプのいくつかのトレーニング映像Ｎが入力として提供される
２１０．ｂ−各トレーニング映像Ｎに対して、テキスト検出が実行される。 210—Perform Genre / Sub-Genre Specific Training a—Several training videos N of a certain genre / sub-genre type are provided as input 210. b—Text detection is performed for each training video N.

２１０．ｃ−各トレーニング映像Ｎの検出されたテキストの各行に対して、テキスト特徴が特定及び抽出される。 210. c—Text features are identified and extracted for each line of detected text in each training video N.

２１０．ｄ−ステップ２１０．ｃにおいて抽出されるテキスト特徴から、特徴ベクトルが形成される。 210. d-Step 210. A feature vector is formed from the text features extracted in c.

２１０．ｅ−ステップ１４０において導出されたクラスタタイプ｛Ｃ１，Ｃ２，Ｃ３，．．．｝の１つと、ステップ２１０．ｄにおいて形成された特徴ベクトルとを関連付けるため距離メトリックを利用することによって、クラスタタイプ｛Ｃ１，Ｃ２，Ｃ３，．．．｝が特徴ベクトルから導出される。 210. e-cluster type {C1, C2, C3,. . . } And step 210. By using the distance metric to correlate with the feature vectors formed in d, the cluster types {C1, C2, C3,. . . } Is derived from the feature vector.

２２０−ターゲットプログラムのジャンル／サブジャンルタイプに対して、ジャンル特徴ベクトルが構築される。 220—Genre feature vectors are constructed for the genre / sub-genre type of the target program.

各種ジャンル／サブジャンルタイプを規定するのにジャンル特徴ベクトルがどのように使用されるか理解するのに供するため、テーブルＩが例示的に提供される。ステップ２１０において、テーブルＩの行は、各種ジャンル／サブジャンルタイプを示し、列２〜５は、ジャンル／サブジャンル特定を行った後に生じるクラスタ分布（カウント）を示す。 Table I is exemplarily provided to help understand how genre feature vectors are used to define various genre / sub-genre types. In step 210, the rows of Table I show the various genres / sub-genre types, and columns 2-5 show the cluster distribution (count) that occurs after genre / sub-genre identification.

ジャンル／サブジャンル特定から決定されるジャンル特徴ベクトルは、映画／ウェスタン＝｛１３，４４，８，４３｝、スポーツ／野球｛５，３３，８，４｝などの各ジャンル／サブジャンルタイプを特徴付ける。

The genre feature vector determined from the genre / sub-genre specification characterizes each genre / sub-genre type such as movie / western = {13, 44, 8, 43}, sports / baseball {5, 33, 8, 4}. .

ステップ２２０において、ターゲットプログラムのジャンル／サブジャンルタイプが、決定される。ここで、ターゲットプログラムのクラスタ分布（ステップ１４０において以前に計算された）が、各種ジャンル／サブジャンルタイプに対してステップ２１０において決定されたクラスタ分布と比較される。ターゲットプログラムのジャンル／サブジャンルタイプは、ステップ２１０において決定されるクラスタ分布の何れが、ステップ１４０において決定されるターゲットプログラムのクラスタ分布に最も近いか決定することによって決定される。閾値決定は、十分な類似度を保証するため利用されてもよい。例えば、ターゲットプログラムのクラス分布は、ターゲットプログラムのジャンル／サブジャンル特定が成功したことを宣言するため、ステップ２１０において決定された最も近いクラスタ分布の少なくとも８０％の類似スコアを有することが必要とされるかもしれない。 In step 220, the genre / sub-genre type of the target program is determined. Here, the cluster distribution of the target program (previously calculated in step 140) is compared with the cluster distribution determined in step 210 for various genres / sub-genre types. The genre / sub-genre type of the target program is determined by determining which of the cluster distributions determined in step 210 is closest to the cluster distribution of the target program determined in step 140. Threshold determination may be used to ensure sufficient similarity. For example, the class distribution of the target program is required to have a similarity score of at least 80% of the nearest cluster distribution determined in step 210 to declare that the target program genre / sub-genre identification was successful. It may be.

ペトリネットの概略
第３段階３００、すなわち、ハイレベル構造復元段階３００を説明する前に、後述されるように、基礎としてペトリネット理論に着目することによりグラフィカルモデリングのいくつかの基本原理の概略が与えられる。 Before describing the third stage 300 of Petri nets , ie, the high-level structure restoration stage 300, an outline of some basic principles of graphical modeling can be obtained by focusing on Petri net theory as a basis, as described below. Given.

ペトリネットの基礎は周知であり、オースチンのテキサス大学のＪａｍｅｓＬ．Ｐｅｔｅｒｓｏｎによる「ＰｅｔｒｉＮｅｔＴｈｅｏｒｙａｎｄｔｈｅＭｏｄｅｌｌｉｎｇｏｆＳｙｓｔｅｍｓ」に詳しく与えられている。この書籍は、ニュージャージ州のエングルウッドクリフスのＰｒｅｎｔｉｃｅ−Ｈａｌｌ，Ｉｎｃにより発行され、参照することによりここに含まれる。 The basics of Petri nets are well known and James L. of the University of Texas at Austin. It is given in detail in “Petri Net Theory and the Modeling of Systems” by Peterson. This book is published by Plentice-Hall, Inc. of Englewood Cliffs, New Jersey and is incorporated herein by reference.

簡単には、ペトリネットとは、ある場所からある遷移に、またはある遷移からある場所に向けられた有向の弧を有する場所及び遷移と呼ばれる２種類のノードから構成される特定のタイプの有向グラフである。場所はトークンを収集するのに使用され、要素はシステム内を流通するものを表すのに使用され、遷移は場所間のトークンを移動する。 Briefly, a Petri net is a specific type of directed graph that consists of two types of nodes called locations and transitions that have a directed arc from one location to another, or from one transition to another. It is. Locations are used to collect tokens, elements are used to represent what circulates in the system, and transitions move tokens between locations.

図４において、その場所、遷移、弧及びトークンを有する一例となるペトリネットシステムが示される。図４に示されるペトリネットは、映画「ＴｈｅＰｌａｙｅｒ」の導入セグメントをモデル化する図式的モデルである。当該映画では、ここでＬ１、Ｌ２及びＬ３と呼ばれる３つのテキスト位置に、映画クレジットの開始が示される。位置Ｌ１、Ｌ２及びＬ３における導入セグメントにおけるテキストの出現及び以降の消失が、システム状態とそれらの変化に関してペトリネットにより図式的にモデル化されている。より詳細には、後述されるように、システム状態は１以上の条件としてモデル化され、システム状態の変更は遷移としてモデル化されている。 In FIG. 4, an exemplary Petri net system with its location, transitions, arcs and tokens is shown. The Petri net shown in FIG. 4 is a schematic model for modeling an introductory segment of the movie “The Player”. In the movie, the beginning of the movie credit is indicated at three text positions, referred to herein as L1, L2 and L3. The appearance and subsequent disappearance of text in the introductory segment at locations L1, L2, and L3 is modeled graphically by Petri nets with respect to system states and their changes. More specifically, as will be described later, the system state is modeled as one or more conditions, and changes in the system state are modeled as transitions.

図４を継続的に参照するに、一例となるペトリネットの「場所」が丸印により表され、Ｐ１〜Ｐ６とラベル付けされ、本例では「条件」を表す。例えば、図４のペトリの１つの条件は、「映画画面位置Ｌ１に出現するテキスト」である。この条件は、モデル化のため場所Ｐ５と関連付けされる。遷移は長方形により表され、ｔ１〜ｔ８とラベル付けされ、イベントを表す。例えば、図４のペトリネットの１つのイベントは、「テキストは、映画画面位置Ｌ１においてスタートする」である。このイベントは、モデル化のためｔ２と関連付けされる。 With continued reference to FIG. 4, “Place” of an example Petri net is represented by a circle and labeled P1 to P6, which represents “Condition” in this example. For example, one condition of Petri in FIG. 4 is “text appearing at movie screen position L1”. This condition is associated with location P5 for modeling. Transitions are represented by rectangles and are labeled t1-t8 and represent events. For example, one event in the Petri net of FIG. 4 is “Text starts at movie screen position L1”. This event is associated with t2 for modeling.

条件とイベントのコンセプトは、ペトリネット理論において利用されるような遷移と場所の１つの解釈である。図示されるように、各遷移ｔ１〜ｔ８はそれぞれ、イベントの前条件及び後条件を表す一定数の入出力場所を有する。イベントが行われるため、前条件が満足される必要がある。 The condition and event concept is one interpretation of transitions and places as used in Petri net theory. As shown, each transition t1-t8 has a certain number of input / output locations representing the pre-condition and post-condition of the event. Since the event takes place, the preconditions need to be satisfied.

図５において、図４の一例となるペトリネットのため、リンクされるイベントと前後の条件の概略が与えられる。前条件は列１に記載され、後条件は列３に記載され、前後の条件をリンクさせるイベントが列２に記載される。 In FIG. 5, for the Petri net as an example of FIG. 4, an outline of linked events and the conditions before and after are given. The precondition is described in column 1, the postcondition is described in column 3, and an event that links the preceding and following conditions is described in column 2.

図４のペトリネットは、テレビや映像プログラムの小さなセグメントを記述するテキストのシステマティックフローの一例である。従って、図４のペトリネットは、「下位の」ペトリネットとして十分に特徴付け可能である。本出願は、後述されるように、「下位の」ペトリネットから部分的に構成される「上位の」ペトリネットを利用する。 The Petri net of FIG. 4 is an example of a systematic flow of text that describes a small segment of a television or video program. Thus, the Petri net of FIG. 4 can be fully characterized as a “subordinate” Petri net. The present application utilizes an “upper” Petri net that is partially composed of “lower” Petri nets, as described below.

ＩＩＩ．第３段階−ターゲットプログラムのハイレベル構造の復元
第３段階、すなわち、図３のフローチャートに示されるようなハイレベル構造復元段階３００は、一般に以下のステップを有する。 III. Third Stage—Restoring the High Level Structure of the Target Program The third stage, ie, the high level structure restoration stage 300 as shown in the flowchart of FIG. 3, generally includes the following steps.

３１０−目的：ターゲットプログラムのハイレベル構造の復元
３１０．ａ−上位の図式的モデルのデータベースの構成
３１０．ｂ−各上位図式的モデル内のホットスポットの特定
３１０．ｃ−ステップ１４０においてターゲットプログラムに対して以前に生成されたテキスト検出の結果の抽出（図１を参照せよ）
３１０．ｄ−ステップ１６０においてターゲットプログラムに対して以前に生成されたクラスタ分布の結果の抽出（図１を参照せよ）
３１０．ｅ−ターゲットプログラムのクラスタ分布の結果を用いて、データベースに格納されている複数の上位図式的モデルから上位図式的モデルの一部の特定及び抽出
３１０．ｆ−ステップ２１０．ｅにおいて特定された上位図式的モデルの一部及びテキスト検出の結果を用いて、ステップ２１０．ｃにおいて抽出されたターゲットプログラムのテキスト検出イベントシーケンスと最も近似した、ステップ３１０．ｅにおいて特定されるモデルの一部から１つの上位図式的モデルの特定
この単一のハイレベル図式的モデルは、ターゲットプログラムのハイレベル構造を図式的に表す。 310—Purpose: Restoring the high-level structure of the target program a—Structure of database of higher-level graphical model 310. b—Identify hotspots within each superior schematic model 310. c—Extraction of text detection results previously generated for the target program in step 140 (see FIG. 1)
310. d—Extraction of results of cluster distribution previously generated for the target program in step 160 (see FIG. 1)
310. Using the result of the cluster distribution of the e-target program, identification and extraction of a part of the higher-level graphical model from a plurality of higher-level graphical models stored in the database 310. f-Step 210. Using the part of the upper schematic model identified in e and the result of text detection, step 210. c. closest to the text detection event sequence of the target program extracted in step c. Identification of one high-level graphical model from part of the model identified in e. This single high-level graphical model schematically represents the high-level structure of the target program.

上記一般的なステップのそれぞれが、より詳細に説明される。 Each of the above general steps is described in more detail.

ステップ３１０．ａにおいて、プログラム全体において映像テキストのシステマティックフローを記述する複数の上位図式的モデル（ペトリネットなど）が構成される。各図式的モデルは、あるジャンル／サブジャンルタイプの映像テキストのフローを一意的に記述する。これら複数のモデルは、ユーザが興味を有するターゲットプログラムのジャンル／サブジャンルタイプの決定に供するのに、以降の参照のためデータベースに格納される。 Step 310. In a, a plurality of higher-level schematic models (such as Petri nets) describing the systematic flow of video text are configured in the entire program. Each graphical model uniquely describes the flow of video text of a certain genre / sub-genre type. These multiple models are stored in a database for future reference in order to determine the genre / subgenre type of the target program that the user is interested in.

一実施例では、図式的モデルは、手動により構成された上位のペトリネットである。手動によりこのようなモデルを構成するため、システム設計者は、多様なプログラムジャンル／サブジャンルタイプについてプログラムにおけるクラスタマッピング及び映像テキスト検出を解析する。 In one embodiment, the graphical model is a manually configured upper Petri net. In order to manually construct such a model, the system designer analyzes cluster mapping and video text detection in the program for various program genres / sub-genre types.

他の実施例では、図式的モデルは、Ｂａｕｍ−Ｗｅｌｃｈアルゴリズムを用いて隠れマルコフモデルとして自動的に構成される。 In another embodiment, the graphical model is automatically configured as a hidden Markov model using the Baum-Welch algorithm.

手動又は自動の構成方法に関係なく、上位図式的モデルのいくつかのキーとなる特徴は、（１）上位図式的モデルが、プログラムレベルのフローをモデル化する、（２）図式的モデルは、下位図式的モデルの効果的な簡略的表現である遷移を有する、ということである。言い換えると、上位モデルは、下位図式的モデルから部分的に構成される。このキーとなる特徴は、図６を参照してさらに示される。 Regardless of the manual or automatic configuration method, some key features of the high-level graphical model are: (1) the high-level graphical model models the program-level flow; (2) the schematic model is It has a transition that is an effective simplified representation of the sub-schematic model. In other words, the upper model is partially composed of the lower graphical model. This key feature is further illustrated with reference to FIG.

図６は、１つのタイプの上位図式的モデルである上位ペトリネットの例示である。図６の上位ペトリネットは、フィギアスケートプログラムにおける映像テキストのシステマティックフローを図式的に示す。すなわち、それは、プログラムレベルのシステマティックフローをモデル化する。周知のように、フィギアスケートプログラムは、以下のテーブルＩＩに列挙されるようないくつかのプログラムイベントから構成される。 FIG. 6 is an illustration of an upper Petri net that is one type of upper schematic model. The upper Petri net of FIG. 6 schematically shows the systematic flow of video text in the figure skating program. That is, it models a program level systematic flow. As is well known, the figure skating program consists of several program events as listed in Table II below.

前条件がイベントをトリガーするのに必要とされ、後条件がイベントの結果として発生する。本例での条件は、（条件ａ−プログラムがスタートした）、（条件ｂ−スケーターが紹介された）、（条件ｃ−スケーターのスコアが存在する）、及び（条件ｄ−最終順位が示される）、として規定されるかもしれない。

A pre-condition is required to trigger the event, and a post-condition occurs as a result of the event. The conditions in this example are (Condition a-Program started), (Condition b-Skater introduced), (Condition c-Skater score exists), and (Condition d-Final ranking). ), May be defined as:

図６の上位ネットのイベント１〜５が、実際には下位ペトリネットの簡略表現であるということが理解されるべきである。例えば、第１イベント、すなわち、クレジットの開始は、図４に示されるもののような下位ペトリネットとして拡張可能である。 It should be understood that events 1-5 of the upper net in FIG. 6 are actually a simplified representation of the lower Petri net. For example, the first event, i.e., the start of credit, can be expanded as a lower Petri net such as that shown in FIG.

ステップ３１０．ｂ−ステップ２１０ａにおいて構成される各上位図式的モデル内において、いくつかの関心領域（ホットスポット）が特定されてもよい。これらのホットスポットは、可変的範囲を有するものであってもよい。これらホットスポット領域は、エンドユーザに特に関心があるイベントに対応する。例えば、イベント２「スケーターの演技」は、イベント１のクレジットの開始より興味のあるプログラムイベントとしてより高い重要性を有するかもしれない。いわゆる「ホットスポット」には、それの相対的な重要度に応じたランク順位が割り当てられるかもしれない。さらに、上位ペトリネットを構成する下位ペトリネットがまた、いわゆるホットスポットに対して特定されるかもしれない。 Step 310. b-Within each superior graphical model configured in step 210a, several regions of interest (hot spots) may be identified. These hot spots may have a variable range. These hot spot areas correspond to events of particular interest to the end user. For example, event 2 “acting a skater” may have a higher importance as a program event of interest than the start of event 1 credits. A so-called “hot spot” may be assigned a rank order according to its relative importance. Furthermore, the lower Petri nets that make up the upper Petri nets may also be identified for so-called hot spots.

ステップ３１０．ｃ−ステップ１４０においてターゲットプログラムに対して以前に生成されたテキスト検出の結果の抽出（図1を参照せよ）
ステップ３１０．ｄ−ステップ１６０においてターゲットプログラムに対して以前に生成されたクラスタ分布の結果の抽出（図1を参照せよ）
ステップ３１０．ｅ−ステップ２１０．ｄにおいて以前に抽出されたターゲットプログラムに対するクラスタ分布を利用して、ステップ２１０．ａにおいて生成された上位図式的モデルの一部が、データベースから特定及び選択される。上位モデルの一部は、ターゲットプログラムに対して特定されたものと同一のクラスタをどの上位モデルが有しているか判断することによって選択される。 Step 310. c—Extraction of text detection results previously generated for the target program in step 140 (see FIG. 1)
Step 310. d—Extraction of results of cluster distribution previously generated for the target program in step 160 (see FIG. 1)
Step 310. e-Step 210. Step 210. Using the cluster distribution for the target program previously extracted at d. A portion of the high-level graphical model generated at a is identified and selected from the database. A part of the upper model is selected by determining which upper model has the same cluster as that specified for the target program.

ステップ３１０．ｆ−ステップ３１０．ｃにおいて以前に抽出されたターゲットプログラムに対するテキスト検出データを利用して、ステップ３１０．ｄにおいて特定されたネットの一部から１つの上位ペトリネットが特定される。１つの上位ペトリネットを特定するため、テキスト検出データが、ターゲットプログラムに対するテキストイベントシーケンスを満足する１つのペトリネットを特定するため、ペトリネットの一部の各ペトリネットのシステマティックフローと比較される。 Step 310. f-Step 310. c. using the text detection data for the target program previously extracted in step c. One upper Petri net is specified from a part of the net specified in d. To identify one upper Petri net, the text detection data is compared with the systematic flow of each Petri net that is part of the Petri net to identify one Petri net that satisfies the text event sequence for the target program.

ターゲットプログラムのハイレベル構造に最も近似する１つの図式的モデルを特定した結果として、ターゲットプログラムに関する情報を容易に取得するかもしれない。このような情報には、例えば、時間的イベント、テキストイベント、プログラムイベント、プログラム構造、サマリなどが含まれるかもしれない。 As a result of identifying one graphical model that most closely approximates the high level structure of the target program, information about the target program may be easily obtained. Such information may include, for example, temporal events, text events, program events, program structures, summaries, and the like.

一例として、プログラムイベント情報が、特定された１つの上位図式的モデルと共に、ターゲットプログラムからのテキスト検出データを利用して識別することができる。テーブルＩＩＩは、ターゲットプログラムの架空のテキスト検出データを表す。 As an example, program event information can be identified using text detection data from the target program along with one identified high-level graphical model. Table III represents the fictitious text detection data of the target program.

テーブルＩＩＩの第１行に示されるように、検出されたテキストイベントのクラスタタイプ（列１）、テキストイベント発生時間（列２）、テキストイベントの期間（列３）、及びテキストイベントが発生しなければならない時間の上限及び下限を指定する時間境界情報に関するデータを生成する。説明の簡単化のため、テーブルは、プログラムの期間に発生するテキストイベントシーケンスを大きく低減したものを表しているということは理解されるべきである。 As shown in the first row of Table III, the detected text event cluster type (column 1), text event occurrence time (column 2), text event duration (column 3), and text event must occur. Generate data about time boundary information that specifies the upper and lower limits of the time that must be reached. For simplicity of explanation, it should be understood that the table represents a greatly reduced text event sequence that occurs during the duration of the program.

ターゲットプログラムに関する情報が、テーブルＩＩＩに示されるように、テキスト検出データから直接抽出することが可能であるということが理解されるべきである。このような情報には、例えば、あるテキストクラスタタイプの出現回数、発生期間及び／又は発生時間などが含まれる。当業者は、テキスト検出データから抽出可能な他のデータの組み合わせを想起することができる。さらに、テキスト検出データがターゲットプログラムの構造を最も良く表す特定された上位図式的モデルと比較されるとき、プログラムイベントやプログラム構造などのターゲットプログラムに関する追加的情報が導出されるかもしれない。例えば、テーブルＩＩＩを参照するに、最初の３つの行は、以下の順序でテキストクラスタタイプの発生を記述している。すなわち、テキストクラスタタイプ１、それに続いて再びテキストクラスタタイプ１、それに続いてテキストクラスタタイプ２となる。このシーケンス又はテーブルからの他の何れかのシーケンスは、シーケンス｛１，２，２｝が図式的モデルのプログラムイベントを構成するか判断するため、ハイレベル図式的モデルと共に利用されてもよい。当該シーケンスがプログラムイベントを構成する場合、プログラムイベントはマルチメディアサマリに含めるため、特定の適用においては抽出されてもよい。｛１，２，２｝などの何れか選択されたシーケンスがプログラムイベントを構成するかに関する決定は、当該シーケンスがテーブルの第４列に指定される時間境界内に発生するかに基づくものとされる。この時間境界情報は、上位図式的モデルの一部として構成される時間境界と比較される。これの一例が時間処理されたペトリネットである。

It should be understood that information about the target program can be extracted directly from the text detection data as shown in Table III. Such information includes, for example, the number of occurrences, occurrence period, and / or occurrence time of a certain text cluster type. One skilled in the art can conceive of other data combinations that can be extracted from the text detection data. Further, when the text detection data is compared with the identified high-level schematic model that best represents the structure of the target program, additional information about the target program, such as program events and program structure, may be derived. For example, referring to Table III, the first three lines describe the occurrence of a text cluster type in the following order: That is, text cluster type 1, followed by text cluster type 1 again, followed by text cluster type 2. This sequence or any other sequence from the table may be used with the high-level graphical model to determine if the sequence {1, 2, 2} constitutes a graphical model program event. If the sequence constitutes a program event, the program event may be extracted in a particular application for inclusion in the multimedia summary. The decision as to whether any selected sequence such as {1, 2, 2} constitutes a program event shall be based on whether the sequence occurs within the time boundary specified in the fourth column of the table. The This time boundary information is compared to a time boundary that is configured as part of the higher graphical model. An example of this is a time-processed Petri net.

ここで図示及び説明された実施例及び変形は本発明の原理の単なる例示であり、本発明の範囲及び趣旨から逸脱することなく、各種改良が当業者により実現可能であるということは理解されるべきである。 It will be understood that the embodiments and variations shown and described herein are merely illustrative of the principles of the present invention and that various modifications can be made by those skilled in the art without departing from the scope and spirit of the invention. Should.

添付した請求項の解釈において、ａ）「有する」という用語は、与えられた請求項に列挙された以外の他の要素又はステップの存在を排除するものではなく、ｂ）要素に前置される「ある」という用語は、当該要素が複数存在することを排除するものでなく、ｃ）請求項の任意の参照符号は、その範囲を限定するものではなく、ｄ）複数の「手段」は、同一のアイテム、ハードウェア又はソフトウェアにより実現される構造若しくは機能により表現可能であり、ｅ）開示された各要素は、ハードウェア部分（各電子回路など）、ソフトウェア部分（コンピュータプログラミングなど）、又はこれらの任意の組み合わせから構成されてもよい、ということは理解されるべきである。 In interpreting the appended claims, a) the term “comprising” does not exclude the presence of other elements or steps than those listed in a given claim, and b) precedes an element. The term “a” does not exclude the presence of a plurality of such elements, c) any reference signs in the claims do not limit the scope thereof, and d) a plurality of “means” It can be expressed by a structure or function realized by the same item, hardware or software, and e) each disclosed element is a hardware part (such as each electronic circuit), a software part (such as computer programming), or these It should be understood that any combination of these may be used.

図１は、一実施例による本発明のテキストタイプクラスタリング段階を示すフロー図である。FIG. 1 is a flow diagram illustrating text type clustering steps of the present invention according to one embodiment. 図２は、一実施例による本発明のジャンル／サブジャンル特定段階を示すフロー図である。FIG. 2 is a flow diagram illustrating the genre / sub-genre identification stage of the present invention according to one embodiment. 図３は、一実施例による本発明のハイレベル構造復元段階を示すフロー図である。FIG. 3 is a flow diagram illustrating the high level structure restoration stage of the present invention according to one embodiment. 図４は、映画のプログラムイベントを示す一例となる図式的モデルである。FIG. 4 is an exemplary schematic model showing a program event for a movie. 図５は、図４の図式的モデルに関する前後の条件の概要である。FIG. 5 is an overview of the conditions before and after the graphical model of FIG. 図６は、上位ペトリネットの一例である。FIG. 6 is an example of an upper Petri net.

Claims

A method for restoring the high-level structure of a target program,
a) generating text detection data for the target program;
b) generating a genre / sub-genre feature vector for the target program using the text detection data generated in step (a);
c) constructing a plurality of super schematic models;
d) identifying a portion of the upper graphical model using cluster distribution data of the target program;
e) using a text detection data of the target program to identify a single superior graphical model from a portion of the model;
Have
The method, wherein the single high-level graphical model corresponds to a high-level structure of the target program.

The method of claim 1, further comprising:
A method comprising the step of constructing a program summary using the single superior graphical model with the text detection data.

The method of claim 2, comprising:
The step of configuring the program summary further includes:
Determining one or more events important to the viewer;
Retrieving the important event from the text detection data;
Extracting the important event from the text detection data;
Including the extracted event in the program summary;
A method characterized by comprising:

The method of claim 1, further comprising:
Configuring the program summary using the single superior schematic model with the text detection data;
The configuring step includes:
Searching for program events;
Ranking program events identified in the searching step based on a predetermined ranking;
Selecting a portion of the identified program event based on the ranking;
A method characterized by comprising:

The method of claim 4, comprising:
The step of searching for the program event includes:
Determining a text event sequence that collectively defines program events;
Retrieving the text event sequence from the text detection data;
Identifying the text event sequence in the text detection data, comparing the text event sequence to a corresponding node of the higher graphical model;
Determining whether the occurrence time sequence of the text event sequence obeys time constraints associated with corresponding nodes of the high-level graphical model;
A method characterized by comprising:

The method of claim 1, further comprising:
A method comprising: searching for information having a text type, a similarity with a program other than the target program, a text pattern, a program event, and a program event pattern in the target program.

The method of claim 6, comprising:
The method according to claim 1, wherein the information to be searched in the target program uses information provided by the text detection data and the high-level schematic model.

The method of claim 1, comprising:
The graphical model is one of a Petri net model, a hidden Markov model, and a combination of the Petri net model and the hidden Markov model.

The method of claim 1, comprising:
The method of claim 1, wherein the target program is one of a television and a video program.

The method of claim 1, comprising:
Generating text detection data for the target program comprises:
i) detecting the presence of text in the target program;
ii) identifying and extracting text features of the detected text;
iii) forming a text feature vector from the identified and extracted features;
A method characterized by comprising:

The method of claim 10, comprising:
The method of detecting the presence of text in the target program is performed according to the MPEG-7 standard.

The method of claim 10, comprising:
The method wherein the identified and extracted text features include text position, text height, text font type, and text color.

The method of claim 10, comprising:
The method of detecting the presence of text in the target program further comprises detecting the presence of text in a video feature of the target program.

The method of claim 10, comprising:
Generating a genre / sub-genre feature vector for the target program;
comparing the text feature vector for the target program generated in iii) with a plurality of predetermined genre / sub-genre feature vectors for various genre / sub-genre types;
Associating a text feature vector for the target program with a genre / sub-genre feature vector having the highest similarity;
Defining a genre / sub-genre feature vector group identified in the step of associating text feature vectors for the target program;
A method characterized by comprising:

The method of claim 1, comprising:
The plurality of upper graphical models graphically model a program genre / sub-genre type at a program level.

The method of claim 12, comprising:
The transition element of the upper graphical model can be constructed from a lower graphical model having program text and timing information.

The method of claim 16, comprising:
The sub-graphical model is modeled as a Petri net.

The method of claim 17, comprising:
The transition element is assigned a priority with respect to other transition elements of the higher model.

The method of claim 1, comprising:
The method of generating genre feature vector cluster data for the target program is performed according to an unsupervised clustering algorithm.

20. The method of claim 19, wherein
The unsupervised clustering algorithm compares corresponding text features based on a distance metric.

21. The method of claim 20, wherein
The distance metric is

A method characterized by calculating as:

A system for restoring the high-level structure of the target program,
A memory for storing computer readable code;
A database for storing multiple upper Petri nets;
Operatively coupled to the memory, generating text detection data for the target program, generating a genre / sub-genre feature vector for the target program using the text detection data; A graphical model is constructed, the cluster distribution data of the target program is used to identify a part of the higher-level graphical model, and the text detection data of the target program is used to identify a part of the model. A processor configured to identify a high-level graphical model;
Have
The system, wherein the single high-level graphical model corresponds to a high level structure of the target program.