JP5360979B2

JP5360979B2 - Important information extraction method and apparatus

Info

Publication number: JP5360979B2
Application number: JP2009151022A
Authority: JP
Inventors: 知彦高橋; 勝菅野; 茂之酒澤
Original assignee: KDDI Corp
Current assignee: KDDI Corp
Priority date: 2009-06-25
Filing date: 2009-06-25
Publication date: 2013-12-04
Anticipated expiration: 2029-06-25
Also published as: JP2011008509A

Description

本発明は、番組映像などの動画像を解析して重要情報を抽出する重要情報抽出方法および装置に係り、特に、映像から重要なオブジェクトが出現する重要フレームを抽出する重要情報抽出方法および装置に関する。 The present invention relates to an important information extraction method and apparatus for extracting important information by analyzing a moving image such as a program video, and more particularly to an important information extraction method and apparatus for extracting an important frame in which an important object appears from a video. .

従来、映像中に含まれる物体や人物などのオブジェクトを認識する手法としては、対象とするオブジェクトの詳細な特徴量を利用する手法が一般的であった。非特許文献１には、国旗、山、警察官などの個々のオブジェクトに対して、それぞれの映像・音声の特徴量を事前に学習してデータベースに蓄積し、解析対象の映像から抽出した特徴量と、データベースに蓄積された前記各オブジェクトの特徴量とを比較し、両者の類似度に基づいてオブジェクトを認識する技術が開示されている。 Conventionally, as a method for recognizing an object such as an object or a person included in a video, a method using a detailed feature amount of a target object is generally used. In Non-Patent Document 1, for each object such as a national flag, a mountain, and a police officer, feature quantities of each video / audio are learned in advance and accumulated in a database, and extracted from the analysis target video. And a feature amount of each object stored in a database, and a technique for recognizing an object based on the similarity between the two is disclosed.

また、特許文献１には、放送局や映像製作会社が製作・放映する番組映像の中から、特定の人物が映っている場面を画像特徴量に基づいて検出し、これらをサムネイル画像で一覧提示する技術が開示されている。 Also, in Patent Document 1, scenes in which a specific person is shown are detected from program videos produced and broadcasted by a broadcasting station or a video production company based on image feature amounts, and these are displayed as a list of thumbnail images. Techniques to do this are disclosed.

特開２００９−１１０４６０号公報JP 2009-110460 A

"High-Level Feature Extraction Experiments for TRECVID 2007", Proc of TRECVID 2007"High-Level Feature Extraction Experiments for TRECVID 2007", Proc of TRECVID 2007

しかしながら、上記の従来技術では、物体が存在するか否かのみが注目されており、番組制作者が視聴者に印象付けたい物体（以下、"重要オブジェクト"と表現する場合もある）か否かが考慮されていなかった。そのため、従来技術では例えば国旗が映っていることは判っても、それが視聴者に特に伝えるべき重要な情報であるか否かを判別していなかった。 However, in the above-described prior art, attention is paid only to whether or not an object exists, and whether or not it is an object that a program producer wants to impress viewers (hereinafter sometimes referred to as “important object”). Was not considered. For this reason, in the prior art, for example, even if it is understood that a national flag is reflected, it has not been determined whether or not it is important information to be particularly conveyed to the viewer.

本発明の目的は、上記した従来技術の課題を解決し、番組映像などから重要なオブジェクトが出現する重要フレームを抽出する重要情報抽出方法および装置を提供することにある。 An object of the present invention is to solve the above-described problems of the prior art and provide an important information extraction method and apparatus for extracting an important frame in which an important object appears from a program video or the like.

上記の目的を達成するために、本発明は、映像を解析して重要情報を抽出する重要情報抽出装置において、以下のような手段を講じた点に特徴がある。 In order to achieve the above object, the present invention is characterized in that the following measures are taken in an important information extracting apparatus for extracting important information by analyzing a video.

(1)映像から特徴量を抽出する手段と、前記特徴量に基づいて映像を複数のショットに分割する手段と、重要オブジェクトが出現する重要フレームの代表的な特徴量を記憶するデータベースと、各ショットから代表フレームを抽出する手段と、代表フレームから前記特徴量に基づいて重要フレーム候補を抽出する手段と、映像のメタ情報を取得する手段と、重要フレーム候補の特徴量を前記重要フレームの代表的な特徴量と比較して各重要フレーム候補を評価する手段と、メタ情報および各重要フレーム候補の評価値に基づいて、前記重要フレーム候補の中から重要フレームを抽出する手段とを具備したことを特徴とする。 (1) means for extracting feature values from video, means for dividing the video into a plurality of shots based on the feature values, a database for storing representative feature values of important frames in which important objects appear, and Means for extracting a representative frame from a shot; means for extracting an important frame candidate from the representative frame based on the feature quantity; means for obtaining meta information of a video; and representing a feature quantity of the important frame candidate as a representative of the important frame. Means for evaluating each important frame candidate in comparison with a characteristic amount, and means for extracting an important frame from the important frame candidates based on the meta information and the evaluation value of each important frame candidate. It is characterized by.

(2)データベースにはオブジェクトの種別ごとに固有の特徴量が記憶され、前記評価手段は、各重要フレーム候補の特徴量を前記オブジェクトの種別ごとに記憶された各固有特徴量と比較してオブジェクト種別ごとに評価値を算出し、前記メタ情報には、重要オブジェクトの出現回数、出現順序およびオブジェクト種別が記述されており、前記重要フレームを抽出する手段は、重要フレーム候補の集合から、前記重要オブジェクトの出現順序および出現回数に応じた組合せで重要フレーム候補を抽出する手段と、抽出された各重要フレーム候補のオブジェクト種別毎の評価値を、前記各重要オブジェクトの種別に基づいて加算し、その総和が最大となる組合せの重要フレーム候補を重要フレームとして抽出する手段とを具備したことを特徴とする。 (2) The database stores unique feature values for each object type, and the evaluation means compares the feature values of each important frame candidate with each unique feature value stored for each object type. An evaluation value is calculated for each type, and the meta information describes the number of appearances of important objects, the order of appearance, and the object type. The means for extracting the important frame is the important frame from a set of important frame candidates. Means for extracting important frame candidates in a combination according to the appearance order and the number of appearances of objects, and an evaluation value for each object type of each extracted important frame candidate based on the type of each important object, And a means for extracting the important frame candidate of the combination that maximizes the sum as an important frame. .

(3)代表フレームを抽出する手段は、静止画像区間のショットから代表フレームを抽出することを特徴とする。 (3) The means for extracting a representative frame is characterized in that the representative frame is extracted from a shot in a still image section.

(4)代表フレームを抽出する手段は、動物体フォロー区間のショットから代表フレームを抽出することを特徴とする。 (4) The means for extracting the representative frame is characterized in that the representative frame is extracted from a shot of the moving object follow section.

(5)重要フレームをサムネイル化して一覧表示する手段をさらに具備し、 (5) A means for displaying a list of important frames as thumbnails is further provided.

各サムネイル画像には、前記映像を各重要フレームの位置から再生させるための情報が紐付けられていることを特徴とする。 Each thumbnail image is associated with information for reproducing the video from the position of each important frame.

本発明によれば、以下のような効果が達成される。 According to the present invention, the following effects are achieved.

(1)映像に含まれる重要オブジェクトを、その特徴量に基づく評価のみならず、当該映像の文脈等が記述されたメタ情報との照合結果に基づいて抽出するようにしたので、映像から重要オブジェクトを高い精度で抽出できるようになる。 (1) Important objects included in the video are extracted based not only on the evaluation based on the feature value but also on the collation result with the meta information describing the context of the video. Can be extracted with high accuracy.

(2)映像から抽出された各重要オブジェクト候補がオブジェクト種別ごとに評価され、このオブジェクト種別毎の評価値と、メタ情報に記述された重要オブジェクトの出現回数、出現順序およびオブジェクト種別とを照合して各重要オブジェクト候補が重要オブジェクトであるか否かが判断されるので、重要オブジェクトを高い精度で抽出できるようになる。 (2) Each important object candidate extracted from the video is evaluated for each object type, and the evaluation value for each object type is compared with the appearance count, appearance order, and object type of the important object described in the meta information. Thus, it is determined whether or not each important object candidate is an important object, so that the important object can be extracted with high accuracy.

(3)静止画像区間を重要ショットと位置づけ、代表フレームを静止画像区間のショットから抽出するようにしたので、重要オブジェクトが出現する可能性の高い代表フレームを効率よく抽出できるようになる。 (3) Since the still image section is positioned as an important shot and the representative frame is extracted from the shot of the still image section, it is possible to efficiently extract a representative frame that is likely to cause an important object.

(4)動物体フォロー区間を重要ショットと位置づけ、代表フレームを動物体フォロー区間のショットから抽出するようにしたので、重要オブジェクトが出現する可能性の高い代表フレームを効率よく抽出できるようになる。 (4) Since the animal body follow section is positioned as an important shot and the representative frame is extracted from the shot of the animal body follow section, it is possible to efficiently extract a representative frame where an important object is likely to appear.

(5)映像から抽出された重要フレームをサムネイル化して一覧表示し、各サムネイル画像には各重要フレームの再生位置に関する情報を紐付けたので、重要フレームのサムネイルを指定するだけで、番組映像を所望の重要位置から再生できるようになる。 (5) Since the important frames extracted from the video are displayed as thumbnails and displayed in a list, information on the playback position of each important frame is linked to each thumbnail image. Playback from a desired important position becomes possible.

本発明に係る重要情報抽出装置を含むシステムの構成を示したブロック図である。It is the block diagram which showed the structure of the system containing the important information extraction apparatus which concerns on this invention. メタ情報の一例を示した図である。It is the figure which showed an example of meta information. 本発明の一実施形態の動作を示したフローチャートである。It is the flowchart which showed operation | movement of one Embodiment of this invention. フレーム間相関の算出手順を示したフローチャートである。It is the flowchart which showed the calculation procedure of the correlation between frames. 情報密度分布の算出手順を示したフローチャートである。It is the flowchart which showed the calculation procedure of information density distribution. 代表フレームの抽出方法を模式的に示した図である。It is the figure which showed typically the extraction method of the representative frame. 重要フレーム候補の識別方法を模式的に示した図である。It is the figure which showed the identification method of the important frame candidate typically. 重要フレーム候補の評価方法を模式的に示した図である。It is the figure which showed typically the evaluation method of an important frame candidate. 重要フレームのサムネイル画像の表示例を示した図である。FIG. 10 is a diagram illustrating a display example of a thumbnail image of an important frame.

以下、図面を参照して本発明の実施の形態について詳細に説明する。図１は、本発明に係る重要情報抽出装置を含むシステムの全体構成を示した機能ブロック図であり、解析対象の番組映像を配信する映像配信装置２と、番組映像ごとに別途に作成されたメタ情報を提供するメタ情報提供装置３と、番組映像から抽出された各フレームが重要フレームである尤度（重要フレームらしさ）と当該映像のメタ情報とを照合して映像内の重要フレームを抽出する重要情報抽出装置１とを主要な構成としている。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a functional block diagram showing the overall configuration of a system including an important information extracting device according to the present invention, and a video distribution device 2 that distributes a program video to be analyzed and a separate program video. The meta information providing apparatus 3 that provides meta information, and the likelihood that each frame extracted from the program video is an important frame (likeness of an important frame) is compared with the meta information of the video to extract an important frame in the video The important information extracting apparatus 1 is a main component.

前記重要情報抽出装置１は、例えばSTB(Set-top Box)であり、番組映像から抽出された重要フレームのサムネイル画像をテレビ／モニタ装置４へ出力する。前記メタ情報としては、番組映像の放送前や放送後に、メタデータ事業者や視聴者により手作業で作成され、Web等で公開されている文脈情報を用いることができる。 The important information extraction device 1 is, for example, an STB (Set-top Box), and outputs a thumbnail image of an important frame extracted from a program video to the television / monitor device 4. As the meta information, context information that is manually created by a metadata provider or a viewer and broadcast on the Web or the like before or after the program video is broadcast can be used.

重要情報抽出装置１において、映像受信部１０１は映像蓄積部１０１ａを備え、映像配信装置２から配信された番組映像およびその音声を受信して記憶する。特徴量抽出部１０２は、カメラモーション抽出部１０２ａ，ショット境界抽出部１０２ｂ，色ヒストグラム分布抽出部１０２ｂ，エッジ情報抽出部１０２ｄ，特徴点情報（コーナー）抽出部１０２ｅ、テロップ情報抽出部１０２ｆ，顔認識抽出部１０２ｇ，音声抽出部１０２ｈおよび音量抽出部１０２ｉを備え、一時記憶された映像およびその音声から、様々な画像特徴量および音声特徴量を抽出する。 In the important information extracting apparatus 1, the video receiving unit 101 includes a video accumulating unit 101a, and receives and stores the program video and the audio distributed from the video distributing apparatus 2. The feature amount extraction unit 102 includes a camera motion extraction unit 102a, a shot boundary extraction unit 102b, a color histogram distribution extraction unit 102b, an edge information extraction unit 102d, a feature point information (corner) extraction unit 102e, a telop information extraction unit 102f, and face recognition. An extraction unit 102g, an audio extraction unit 102h, and a volume extraction unit 102i are provided, and various image feature amounts and audio feature amounts are extracted from the temporarily stored video and its audio.

ショット分割部１０３は、前記画像特徴量および音声特徴量に基づいて番組映像を複数のショットに分割する。出現時間長検出部１０４は、重要フレームを含む可能性の高い重要ショット（後述する「カメラ静止区間」および「動物体フォロー区間」）を対象に、同一のオブジェクトがショット境界を跨いで出現している総時間長を検出する。 The shot division unit 103 divides the program video into a plurality of shots based on the image feature amount and the audio feature amount. The appearance time length detection unit 104 targets an important shot (a “camera still section” and “animal body follow section”, which will be described later) that is likely to include an important frame, and the same object appears across the shot boundary. Detect the total length of time.

前記出現時間長検出部１０４において、重要ショット検出部１０４ａは、前記カメラモーション特徴量に基づいて、カメラ静止区間および動物体フォロー区間を検出する。カメラ静止区間とは、カメラワークが一定時間以上存在しないショットであり、動物体フォロー区間とは、カメラが動物体を一定時間以上追跡するショットである。 In the appearance time length detection unit 104, the important shot detection unit 104a detects a camera still section and a moving object follow section based on the camera motion feature amount. The camera still section is a shot in which camera work does not exist for a certain time or more, and the moving object follow section is a shot in which the camera tracks the moving object for a certain time or more.

重要オブジェクトにフォーカスしたショットでは、被写体が一定時間以上カメラに写される。そのため、被写体が静止物体であればカメラモーションが一定時間以上静止し、被写体が動物体であればカメラは一定時間以上その物体を追跡する。そこで、本実施形態ではカメラモーション特徴量に基づいて、一定時間以上の静止ショットおよび動物体フォローショットを検出し、これらを重要フレームを含む可能性の高い重要ショット候補と位置づけて抽出する。前記動物体フォローショットの抽出には、例えば「鳥井、他、"映像の動きを用いた動物体アップショット・フォローショット検出"、画像の認識・理解シンポジウム２００５，（２００５年７月）」で紹介された方式を用いることができる。 In a shot focused on an important object, the subject is photographed on the camera for a certain period of time. Therefore, if the subject is a stationary object, the camera motion is stationary for a certain time or more, and if the subject is a moving object, the camera tracks the object for a certain time or more. Therefore, in the present embodiment, still shots and moving object follow shots of a certain time or longer are detected based on the camera motion feature amount, and these are extracted as positions of important shot candidates that are likely to include important frames. For example, “Torii et al.,“ Animal up-shot / follow-shot detection using video motion ”, Image Recognition / Understanding Symposium 2005 (July 2005)” is used to extract the follow-up shot of the moving object. Can be used.

代表フレーム抽出部１０４ｂは、映像から抽出された各種の特徴量に基づいて、前記カメラ静止区間のショットおよび動物体フォロー区間のショットから代表フレームを抽出する。フレーム間相関算出部１０４ｃは、後に詳述するように、時系列で連続する２つの重要ショット候補の各代表フレームN，M間の画像特徴量に関するフレーム間相関を算出する。出現時間長計測部１０４ｄは、フレーム間相関が所定の閾値を超える代表フレームN，Mを含むショットの総時間長を計測し、この総時間長が所定の基準値を超えるとき、その総時間長を当該代表フレームが重要フレームである尤度（らしさ）の指標の一つとして出力する。 The representative frame extracting unit 104b extracts a representative frame from the shot in the camera still section and the shot in the moving object follow section based on various feature amounts extracted from the video. As will be described later in detail, the inter-frame correlation calculation unit 104c calculates an inter-frame correlation related to the image feature amount between the representative frames N and M of two important shot candidates that are continuous in time series. The appearance time length measurement unit 104d measures the total time length of shots including the representative frames N and M whose inter-frame correlation exceeds a predetermined threshold, and when this total time length exceeds a predetermined reference value, the total time length Is output as one of the indicators of likelihood (likeness) that the representative frame is an important frame.

情報密度分布算出部１０５は、前記総時間長が閾値を超えた各重要ショット候補の代表フレームを対象に、その中心部をトリミングした中央画像および残りの周辺画像のそれぞれについて、Harris特徴量を検出する検出部１０５ａ、輝度平均を検出する検出部１０５ｂ、色相の標準偏差を検出する検出部１０５ｃおよびDCT係数を検出するDCT係数検出部１０５ｄを含む。 The information density distribution calculation unit 105 detects a Harris feature amount for each of the central image trimmed at the central portion and the remaining peripheral images for the representative frame of each important shot candidate whose total time length exceeds the threshold. A detection unit 105a that detects a luminance average, a detection unit 105c that detects a standard deviation of hue, and a DCT coefficient detection unit 105d that detects a DCT coefficient.

データベース(DB)１０８には、重要フレームの代表的な特徴量がオブジェクトの種別毎に記憶されている。本実施形態では、後に詳述するように、店の映像を代表するShop特徴量１０８ａ、製品の映像を代表するProduct特徴量１０８ｂ、および人間の映像を代表するPerson特徴量１０８ｃを含む各種のオブジェクト特徴量が記憶されている。 The database (DB) 108 stores typical feature amounts of important frames for each object type. In this embodiment, as will be described in detail later, various objects including a Shop feature amount 108a representing a store image, a Product feature amount 108b representing a product image, and a Person feature amount 108c representing a human image. Features are stored.

重要フレーム候補評価部１０６は、前記各重要ショット候補の時間長ならびに前記Harris特徴量、輝度平均、色相の標準偏差およびDCT係数を、前記データベースDB１０８に蓄積されている重要フレームの教師データと比較して、代表フレーム毎に重要フレームらしさの評価値を算出する。メタ情報取得部１１０は、解析対象の映像に関するメタ情報を前記メタ情報提供装置３から取得する。 The important frame candidate evaluation unit 106 compares the time length of each important shot candidate and the Harris feature amount, luminance average, hue standard deviation, and DCT coefficient with the important frame teacher data stored in the database DB 108. Thus, an evaluation value of the likelihood of an important frame is calculated for each representative frame. The meta information acquisition unit 110 acquires meta information related to the analysis target video from the meta information providing apparatus 3.

図２は、前記メタ情報の一例を示した図であり、番組映像の放送チャンネル(<channel>)、放送開始時間(<Time>)、および番組名(<Title>)をインデックスとして、重要オブジェクトの出現期間を表す複数のサブセクション(<Subsection>)が記述されている。 FIG. 2 is a diagram showing an example of the meta information, in which an important object is indexed with a broadcast channel (<channel>), a broadcast start time (<Time>), and a program name (<Title>) of program video. A plurality of subsections (<Subsection>) representing the appearance period of are described.

この例では、プログラム中に３つのサブセクションが存在し、一番目のサブセクションは『「１３時０２分」から、「ABCレストラン(Shop)」、「特製カレーライス(Product)」および「特製ハンバーグ(Product)」が重要オブジェクトとして当該順序で出現する』ことを示している。また、二番目のサブセクションは『「１３時１５分」から、「雑貨店いろは(Shop)」が重要オブジェクトとして出現する』ことを示している。さらに、三番目のサブセクションは『「１３時３０分」から、「○×喫茶店(Shop)」、「自家製チーズケーキ(Product)」および「店長の佐藤さん(Person)」が重要オブジェクトとして当該順序で出現する』ことを示している。 In this example, there are three subsections in the program. The first subsection is “13:02”, “ABC restaurant (Shop)”, “Special curry rice (Product)” and “Special hamburger steak”. “(Product)” appears as an important object in this order ”. The second sub-section indicates that “From 13:15”, “General store Iroha (Shop)” will appear as an important object ”. Furthermore, the third sub-section starts from “13:30” and “○ × Coffee shop (Shop)”, “Homemade cheesecake (Product)” and “Store manager Mr. Sato (Person)” are important objects in this order. Appears at

このように、本実施形態では重要オブジェクトが店(Shop)，製品(Product)および人物(Person)のいずれかのオブジェクト種別に分類され、メタ情報には、映像内での重要オブジェクトの出現回数、出現順序およびオブジェクト種別が映像の文脈として記述されている。 As described above, in this embodiment, important objects are classified into any one of shop, product, and person object types, and the meta information includes the number of appearances of important objects in the video, The appearance order and the object type are described as the video context.

図１へ戻り、重要フレーム決定部１０７は、後に詳述するように、前記番組映像から抽出された重要フレーム候補の集合から、前記メタ情報に記述された各重要オブジェクトの出現順序および出現回数に応じた組合せで重要フレーム候補を抽出する組合せ抽出部１０７ａ、および抽出された各重要フレーム候補のオブジェクト種別毎の評価値を、前記各重要オブジェクトの種別に基づいて加算し、その総和が最大となる組合せの重要フレーム候補を重要フレームに決定する評価部１０７ｂとを具備している。情報提供部１０９は、前記決定された重要フレームのサムネイル画像を生成してテレビ／モニタ装置４に提供する。 Returning to FIG. 1, as will be described in detail later, the important frame determination unit 107 determines the appearance order and the number of appearances of each important object described in the meta information from a set of important frame candidates extracted from the program video. The combination extraction unit 107a that extracts the important frame candidates in accordance with the combination, and the evaluation value for each object type of each extracted important frame candidate is added based on the type of each important object, and the sum is maximized. And an evaluation unit 107b that determines an important frame candidate of the combination as an important frame. The information providing unit 109 generates a thumbnail image of the determined important frame and provides it to the television / monitor apparatus 4.

次いで、図３，４，５のフローチャートを参照して、本発明の一実施形態の動作を詳細に説明する。 Next, the operation of the embodiment of the present invention will be described in detail with reference to the flowcharts of FIGS.

ステップＳ１では、映像受信部１０１の映像蓄積部１０１ａに一時記憶されている映像が特徴量抽出部１０２に取り込まれ、ショット境界特徴量、カメラモーション特徴量、テロップ特徴量および顔認識特徴量を含む各種の特徴量が抽出される。 In step S1, the video temporarily stored in the video storage unit 101a of the video receiving unit 101 is taken into the feature quantity extraction unit 102, and includes a shot boundary feature quantity, a camera motion feature quantity, a telop feature quantity, and a face recognition feature quantity. Various feature quantities are extracted.

ショット境界とは、映像編集によってカメラの視点が切り替わった点（映像がカットされた点）であり、例えば特開２００７−１３４９８６号公報に開示されているように、連続するフレーム間の差分量に基づいて求めることができる。カメラモーションとは、映像を撮影するカメラの上下の動き（チルト）、左右の動き（パン）およびズーム操作であり、その特徴量は、例えばMPEGの符号化情報である動きベクトルやオプティカルフローを計算して求めることができる。 A shot boundary is a point at which the camera viewpoint is switched by video editing (a point at which a video is cut). For example, as disclosed in Japanese Patent Application Laid-Open No. 2007-134986, a shot boundary is a difference amount between successive frames. Can be based on. Camera motion is the up / down motion (tilt), left / right motion (pan), and zoom operation of the camera that shoots the video, and its feature value is calculated, for example, as a motion vector or optical flow that is MPEG encoded information. Can be obtained.

テロップ特徴量は、例えば特開平１２−２３０６２号公報に開示されているように、(1)テロップは画面の上部または下部の所定領域に表示される、(2)テロップ出現時およびテロップ終了時には輝度変化が生じる、という特徴を用いて検出できる。顔認識特徴量は、例えば特開２００６−５０８４６１号公報に開示されている従来手法で抽出できる。 For example, as disclosed in Japanese Patent Application Laid-Open No. 12-23062, the telop feature amount is displayed in (1) a telop in a predetermined area at the top or bottom of the screen. (2) Luminance when the telop appears and when the telop ends It can be detected using the feature that changes occur. The face recognition feature amount can be extracted by a conventional method disclosed in, for example, Japanese Patent Laid-Open No. 2006-508461.

ステップＳ２では、前記ショット分割部１０３において、番組映像が各特徴量に基づいて複数のショットに分割される。ステップＳ３では、番組映像からタイトル画面のようにテロップが中心となるテロップショットが除去される。ステップＳ４では、前記出現時間長検出部１０４の重要ショット検出部１０４ａにおいて、前記カメラモーション特徴量に基づいて、各ショットからカメラ静止区間および動物体フォロー区間が重要ショット候補として取得される。 In step S2, the shot division unit 103 divides the program video into a plurality of shots based on each feature amount. In step S3, the telop shot centered on the telop as in the title screen is removed from the program video. In step S4, the important shot detection unit 104a of the appearance time length detection unit 104 acquires a camera still section and a moving object follow section as important shot candidates from each shot based on the camera motion feature amount.

ステップＳ５では、図６に一例を示したように、前記抽出されたカメラ静止区間および動物体フォロー区間から、前記代表フレーム抽出部１０４ｂにより代表フレームが取得される。代表フレームの取得方法は、同図右側に示したように、各区間の真ん中を取得する方法の他、同図左側に示したように、テロップは画面中に表示されている映像と紐付いている場合が非常に多いという特徴を利用し、区間中に新たなテロップが出現するような場合は、そのテロップが完全に出現した瞬間を代表フレームとして取得する方法であっても良い。ステップＳ６では、前記フレーム間相関算出部１０４ｃにより、連続する２つの重要ショット候補n，mの各代表フレームN，M間の画像特徴量に関するフレーム間相関が算出される。 In step S5, as shown in an example in FIG. 6, a representative frame is acquired by the representative frame extraction unit 104b from the extracted camera still section and moving object follow section. As shown on the right side of the figure, the representative frame is acquired in the middle of each section, as shown on the left side of the figure, and the telop is associated with the image displayed on the screen. In the case where a new telop appears in a section using the feature that there are very many cases, a method of acquiring the moment when the telop completely appears as a representative frame may be used. In step S6, the inter-frame correlation calculation unit 104c calculates the inter-frame correlation related to the image feature amount between the representative frames N and M of the two consecutive important shot candidates n and m.

図４は、前記フレーム間相関の算出手順を示したフローチャートであり、ステップＳ２１では、各重要ショット候補から抽出された代表フレームの中から、連続する２つのショットn，mの代表フレームが今回の注目フレームN，Mとして抽出される。ステップＳ２２では、先行の注目フレームNをランダムまたは所定の規則でトリミングして複数種の部分画像Nk(N1，N2，N3…)が生成される。ステップＳ２３では、注目フレームNおよびその部分画像後Nkと後続の注目フレームMとの間で、各特徴量に関するヒストグラムの相関値r(N,M)，r(N1,M)，r(N2,M)，r(N3,M)…が算出される。 FIG. 4 is a flowchart showing the calculation procedure of the inter-frame correlation. In step S21, representative frames of two consecutive shots n and m are selected from the representative frames extracted from the respective important shot candidates. Extracted as frames of interest N and M. In step S22, the preceding attention frame N is trimmed randomly or according to a predetermined rule to generate a plurality of types of partial images Nk (N1, N2, N3...). In step S 23, histogram correlation values r (N, M), r (N 1, M), r (N 2) relating to each feature amount between the target frame N and its post-partial image Nk and the subsequent target frame M. M), r (N3, M)... Are calculated.

ステップＳ２４では、前記ステップＳ２３で得られた全てのヒストグラム相関値の最大値Max{r(N,M)，r(N1,M)…}と基準相関値Rrefとが比較され、最大値Max{…}が基準相関値Rref以上であればステップＳ２５へ進む。ステップＳ２５では、注目フレームNを代表フレームとするショットnの時間長tnと注目フレームMを代表フレームとするショットmの時間長tmとの和[tn+tm]が基準時間長trefと比較される。時間長和[tn+tm]が基準時間長tref以上であればステップＳ２６へ進む。ステップＳ２６では、今回の２つの注目フレームM，Nがいずれも重要フレーム候補と判断され、その時間長和[tn+tm]と紐付けられる。 In step S24, the maximum values Max {r (N, M), r (N1, M)...} Of all histogram correlation values obtained in step S23 are compared with the reference correlation value Rref, and the maximum value Max { ...} is greater than or equal to the reference correlation value Rref, the process proceeds to step S25. In step S25, the sum [tn + tm] of the time length tn of the shot n having the target frame N as the representative frame and the time length tm of the shot m having the target frame M as the representative frame is compared with the reference time length tref. . If the time length sum [tn + tm] is greater than or equal to the reference time length tref, the process proceeds to step S26. In step S26, the two current frames of interest M and N are both determined to be important frame candidates, and are associated with the time length sum [tn + tm].

これに対して、前記ステップＳ２４において、最大値Max{…}が基準相関値Rref未満と判定されればステップＳ２７へ進み、注目フレームNを代表フレームとするショットの時間長tnが前記基準時間長trefと比較される。時間長tnが前記基準時間長tref以上であればステップＳ２８へ進み、今回の注目フレームNが重要フレーム候補と判断され、その時間長tnと紐付けられる。ステップＳ２９では、全ての代表フレームに関して上記の判断が完了したか否かが判定される。完了していなければステップＳ２１へ戻り、残りの代表フレームを対象に上記の各処理が繰り返される。 On the other hand, if it is determined in step S24 that the maximum value Max {...} is less than the reference correlation value Rref, the process proceeds to step S27, and the time length tn of the shot with the target frame N as the representative frame is the reference time length. Compared with tref. If the time length tn is equal to or greater than the reference time length tref, the process proceeds to step S28, where the current frame of interest N is determined as an important frame candidate and is associated with the time length tn. In step S29, it is determined whether or not the above determination has been completed for all representative frames. If not completed, the process returns to step S21, and the above processes are repeated for the remaining representative frames.

図７は、ある一連の映像シーケンスにおけるカメラモーションの推移、ショット境界の位置および代表フレームの関係を示している。 FIG. 7 shows the relationship between the transition of the camera motion, the position of the shot boundary, and the representative frame in a series of video sequences.

ここでは、代表フレーム１と代表フレーム２とでは出現するオブジェクトが異なるので、両者のヒストグラム相関値は低くなる。したがって、代表フレーム１は重要フレーム候補に分類されない。これに対して、代表フレーム２および代表フレーム３は、同一オブジェクトの静止画およびズーム画なので両者のヒストグラム相関値は高くなる。そして、各代表フレーム２，３を含む２つのショットの時間長t2，t3の和[t2+t3]が所定の閾値trefを超えていれば、代表フレーム２，３はいずれも重要フレーム候補とされる。 Here, since the objects appearing in the representative frame 1 and the representative frame 2 are different, the histogram correlation value between them is low. Therefore, the representative frame 1 is not classified as an important frame candidate. On the other hand, since the representative frame 2 and the representative frame 3 are a still image and a zoom image of the same object, the histogram correlation value between them is high. If the sum [t2 + t3] of the time lengths t2 and t3 of two shots including the representative frames 2 and 3 exceeds a predetermined threshold value tref, both the representative frames 2 and 3 are determined as important frame candidates. The

図３へ戻り、ステップＳ７では、前記情報密度分布算出部１０５により、前記抽出された重要フレーム候補を対象に、その画像中に含まれる情報量が中央にどれだけ偏在しているかを算出することにより、重要オブジェクトを含む重要フレームであるか否かが最終的に識別される。 Returning to FIG. 3, in step S <b> 7, the information density distribution calculation unit 105 calculates how much the amount of information included in the image is unevenly distributed for the extracted important frame candidate. Thus, it is finally identified whether the frame is an important frame including an important object.

図５は、前記情報密度分布の算出手順を示したフローチャートであり、ステップＳ４１では、各代表フレームから、画像中央部の一定領域（例えば、６０％程度）をトリミングした中央画像およびその周辺画像が抽出される。ステップＳ４２では、重要フレーム候補ごとに中央画像および周辺画像のHarris特徴量が算出される。ステップＳ４３では、重要フレーム候補ごとに中央画像および周辺画像の輝度平均が算出される。ステップＳ４４では、重要フレーム候補ごとに中央画像および周辺画像の色相の標準偏差（または分散）が算出される。ステップＳ４５では、重要フレーム候補ごとに中央画像および周辺画像のDCT係数が算出される。 FIG. 5 is a flowchart showing the calculation procedure of the information density distribution. In step S41, a central image obtained by trimming a certain area (for example, about 60%) in the central portion of the image from each representative frame is displayed. Extracted. In step S42, Harris feature amounts of the central image and the peripheral image are calculated for each important frame candidate. In step S43, the average brightness of the central image and the peripheral image is calculated for each important frame candidate. In step S44, the standard deviation (or variance) of the hues of the central image and the peripheral image is calculated for each important frame candidate. In step S45, the DCT coefficients of the central image and the peripheral image are calculated for each important frame candidate.

図３へ戻り、ステップＳ８では、各重要フレーム候補について、オブジェクトの種別ごと（ここでは、"Person"、"Product"および"Shop"のそれぞれ）に評価値が算出される。なお、本実施形態では、映像を構成する３つの要素（登場人物、物体、背景）に着目すべく、オブジェクト種別がPerson、Product、Shopに３分類されるが、本発明はこれのみに限定されるものではなく、例えば、人工的な加工（テロップやアニメーション）に対するオブジェクト種別を追加あるいは入れ替えるようにしても良い。 Returning to FIG. 3, in step S <b> 8, for each important frame candidate, an evaluation value is calculated for each object type (here, “Person”, “Product”, and “Shop”). In this embodiment, in order to focus on the three elements (characters, objects, and background) constituting the video, the object types are classified into Person, Product, and Shop, but the present invention is limited to this. For example, the object type for artificial processing (telop or animation) may be added or replaced.

評価値の算出には、前記特徴量抽出部１０２で抽出された特徴量から求まる画面の代表色、画面を3x3に分割した際の各領域毎の色ヒストグラム分布、およびエッジ分布等を用いる他、各種別に固有の特徴量として以下のパラメータを追加しても良い。 The calculation of the evaluation value uses a representative color of the screen obtained from the feature amount extracted by the feature amount extraction unit 102, a color histogram distribution for each region when the screen is divided into 3x3, an edge distribution, and the like. The following parameters may be added as unique features for each type.

(1)Personの評価値算出 (1) Person evaluation value calculation

顔認識によって取得された顔領域の面積および顔領域の重心位置に基づいて確からしさを評価し、顔が中央に大きく映っている場合ほど高い評価値を与える。 The probability is evaluated based on the area of the face area acquired by the face recognition and the barycentric position of the face area, and a higher evaluation value is given as the face appears larger in the center.

なお、人物以外の重要オブジェクトについては、対象の大きさに基づいて２つの異なる仮定を用いる。すなわち、重要オブジェクトが撮影対象であれば、オブジェクトは画面フレーム中で視聴者に対して強調された形で提示されると想定できるため、オブジェクトの大きさに応じて以下の評価を行い、重要オブジェクトがProductである確からしさPp、Shopである確からしさPsが求められる。 For important objects other than a person, two different assumptions are used based on the size of the target. In other words, if an important object is a shooting target, it can be assumed that the object is presented to the viewer in an emphasized form in the screen frame, so the following evaluation is performed according to the size of the object, and the important object The probability Pp that is a Product is required, and the probability Ps that is a Shop is required.

(2)Productの確からしさ評価 (2) Product reliability evaluation

ここでは、オブジェクトがカメラの撮影範囲に対して小さいという前提を用いる。すなわち、オブジェクトは画面の中央付近に表示され、フレームの上下左右端何れか、あるいはその全てに背景領域が存在する。撮影されているオブジェクトが重要物体であれば、映像撮影者は撮影対象を強調するように画面を構成する。 Here, it is assumed that the object is small relative to the shooting range of the camera. That is, the object is displayed near the center of the screen, and a background area exists at any or all of the top, bottom, left and right edges of the frame. If the object being photographed is an important object, the videographer configures the screen to emphasize the subject to be photographed.

(2.1)そこで、本実施形態ではフレーム画面を中央領域Cおよび周辺量Oに分類し、中央領域Cの特徴点数をFC、周辺領域Oの特徴点数をFOとしたき、次式が成立すれば評価値Ppを増加させる。 (2.1) Therefore, in the present embodiment, the frame screen is classified into the central region C and the peripheral amount O, the feature point number of the central region C is FC, and the feature point number of the peripheral region O is FO. The evaluation value Pp is increased.

FC＞FO×K FC> FO × K

(2.2)さらに、フレーム画面の中央領域Cの色相の標準偏差をDev(Hc)、周辺領域Oの色相の標準偏差をDev(Ho)としたとき、次式が成立すれば評価値Ppを増加させる。 (2.2) Furthermore, assuming that Dev (Hc) is the standard deviation of hue in the center area C of the frame screen and Dev (Ho) is the standard deviation of hue in the peripheral area O, the evaluation value Pp is increased if the following equation holds: Let

Dev(Hc)＞ Dev(Ho)× L Dev (Hc)> Dev (Ho) x L

(2.3)さらに、フレーム画面の中央領域Cの輝度の平均値をAve(Vc)、周辺領域Oの輝度の平均値をAve(Vo)としたとき、次式が成立すれば評価値Ppを増加させる。 (2.3) Furthermore, when the average value of the luminance in the central area C of the frame screen is Ave (Vc) and the average value of the luminance in the peripheral area O is Ave (Vo), the evaluation value Pp is increased if the following equation holds: Let

Ave(Vc) ＞ Ave(Vo) × M Ave (Vc)> Ave (Vo) × M

ただし、K、L、Mは、それぞれ画像の中央領域Cと周辺領域Oとの面積比よって決定される変数である。 However, K, L, and M are variables determined by the area ratio between the central region C and the peripheral region O of the image, respectively.

なお、前記中央領域Cは、フレーム画面の中心から画面の上下左右端を一定の割合でトリミングした領域とする他、一定の面積を窓として画面を走査し、特徴点の数が最も多くなる窓領域を中央領域Cとしても良い。 The central region C is a region in which the upper, lower, left, and right edges of the screen are trimmed at a certain rate from the center of the frame screen. The region may be the central region C.

(3)Shopの確からしさ評価 (3) Evaluation of the certainty of Shop

ここでは、オブジェクトがカメラの撮影範囲に対して大きいという前提のもと、画像中の一定以上の領域に建物が撮影されていることを利用する。但し、上記のProductの場合と異なり、画像全体に建物が表示されることがある。この場合、フレーム画面の中央領域と周辺領域との間で、特徴点数や色相の分散による差が必ずしも存在しない。その一方、建物が表示されている領域は連続しており、また、建物は構造上の特徴から多くの直線を含む。そこで、本実施形態ではshopを評価するために以下の２つの特徴量を用いる。 Here, based on the premise that the object is larger than the shooting range of the camera, the fact that the building is shot in a certain area or more in the image is used. However, unlike the case of Product, the building may be displayed on the entire image. In this case, there is not necessarily a difference between the central area and the peripheral area of the frame screen due to the number of feature points and hue dispersion. On the other hand, the area where the building is displayed is continuous, and the building includes many straight lines due to structural features. Therefore, in this embodiment, the following two feature amounts are used to evaluate shop.

(3.1)領域の連続性 (3.1) Domain continuity

フレーム画面の単位領域を、その色分布や周波数特性に応じてクラスタリングする。クラスタリングには、例えば平均値シフト法によって各画素の色特徴量の平均で注目画素の特徴量を置き換える処理を繰り返すことで、隣接する類似の特徴を持つ領域をまとめるといった手法を用いる。このようにしてまとめられた領域の大きさおよび重心から、画像がShopを重要オブジェクトとして含む確からしさPsを増加させる。 The unit area of the frame screen is clustered according to the color distribution and frequency characteristics. For the clustering, for example, a method of grouping adjacent regions having similar features by repeating the process of replacing the feature amount of the target pixel with the average of the color feature amounts of each pixel by an average value shift method is used. The probability Ps that the image includes Shop as an important object is increased from the size and the center of gravity of the region thus collected.

なお、平均値シフト法については、D. Comaniciu et al., "Mean Shift: A Robust Approach Toward Feature Space Analysis," IEEE Trans. Pattern Analysis and Machine Intelligence, vol.24, no.5, May 2002.で詳しく論じられている。 For the mean value shift method, see D. Comaniciu et al., “Mean Shift: A Robust Approach Toward Feature Space Analysis,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol.24, no.5, May 2002. It is discussed in detail.

(3.2)エッジ分布 (3.2) Edge distribution

２値化された重要フレーム候補に対してエッジ抽出処理を行い、抽出された縦方向エッジのうち、一定以上の長さを持つエッジの数から、画像がShopを重要オブジェクトとして含む確からしさPsを増加させる。 Edge extraction processing is performed on the binarized important frame candidates, and the probability Ps that the image contains Shop as an important object is calculated from the number of extracted vertical edges that have a certain length or more. increase.

本実施形態では、以上のように、Pp 、Psの値を、それぞれ評価値の重み付け総和として算出し、比較する他、算出した特徴量を、例えばSVM（Support Vector Machine）によるクラス分類処理等の統計的な処理に適用することにより、各重要フレームについて、Personらしさの評価値、Productらしさの評価値、およびShopらしさの評価値が算出される。 In the present embodiment, as described above, the values of Pp and Ps are calculated and compared as weighted sums of evaluation values, respectively, and the calculated feature values are, for example, classified by classifying processing by SVM (Support Vector Machine) or the like. By applying to the statistical processing, an evaluation value of Person-likeness, an evaluation value of Product-likeness, and an evaluation value of Shop-likeness are calculated for each important frame.

そして、本実施形態では"Person"、"Product"および"Shop"のオブジェクト種別ごとに、予め代表的な重要フレームの正例および負例のサンプルについて、上記各評価項目のデータを求め、これらを教師データとしてSVMを構築し、これに各重要フレーム候補から同様に抽出されたショットの時間長や各種の情報量を適用することで、各重要フレーム候補が重要フレームらしさ（尤度）が、オブジェクト種別ごとに算出される。 In the present embodiment, for each of the “Person”, “Product”, and “Shop” object types, the data of each evaluation item is obtained in advance for samples of positive examples and negative examples of representative important frames. By constructing SVM as teacher data and applying the time length of shots extracted from each important frame candidate and various amounts of information to this, each important frame candidate has an important frame likelihood (likelihood). Calculated for each type.

ステップＳ９では、重要フレーム決定部１０７の組合せ抽出部１０７ａにより、前記重要フレーム候補の集合から、前記メタ情報に記述された重要オブジェクトの出現順序および出現回数に応じた組合せで重要フレーム候補が抽出される。ステップＳ１０では、前記評価部１０７ｂにおいて、前記抽出された各重要フレーム候補のオブジェクト種別毎の評価値を、前記メタ情報に記述された各重要オブジェクトの種別に基づいて加算し、その総和が最大となる組合せの重要フレーム候補が重要フレームに決定される。 In step S9, the combination extraction unit 107a of the important frame determination unit 107 extracts important frame candidates from the set of important frame candidates in combinations according to the appearance order and the number of appearances of the important objects described in the meta information. The In step S10, the evaluation unit 107b adds the extracted evaluation values for each object type of each important frame candidate based on the types of each important object described in the meta information, and the sum is maximized. A combination of important frame candidates is determined as an important frame.

図８は、前記重要フレーム決定部１０７における各重要フレーム候補の評価方法を示した図であり、ここでは、図２に示したメタ情報の一番目のサブセクションに基づいて、このサブセクションに対応した映像期間から抽出された４つの重要フレーム候補A，B，C，Dを評価する場合を例にして説明する FIG. 8 is a diagram showing a method of evaluating each important frame candidate in the important frame determining unit 107. Here, the method corresponds to this subsection based on the first subsection of the meta information shown in FIG. A case where four important frame candidates A, B, C, and D extracted from the selected video period are evaluated will be described as an example

本実施形態では、重要フレーム候補Aの"Shop"らしさの評価値は"0.95"，"Product"らしさの評価値は"0.21"である。同様に、重要フレーム候補Bの評価値はそれぞれ"0.51"，"0.41"、重要フレーム候補Cの評価値はそれぞれ"0.01"，"0.91"、重要フレーム候補Dの評価値はそれぞれ"0.34"，"0.85"である。 In this embodiment, the evaluation value of the “Shop” likelihood of the important frame candidate A is “0.95”, and the evaluation value of the “Product” likelihood is “0.21”. Similarly, the evaluation values of the important frame candidate B are “0.51” and “0.41”, the evaluation values of the important frame candidate C are “0.01” and “0.91”, respectively, and the evaluation values of the important frame candidate D are “0.34”, respectively. It is “0.85”.

ここで、対応するサブセクションによれば出現する重要オブジェクトは３つであり、その出現順序は"Shop"→"Product"→"Product"である。そこで、４つの候補A，B，C，Dから当該順序で３つの候補を選択する全ての組合せ（[A，B，C]，[A，B，D]，[A，C，D]…）がステップＳ９で抽出され、一番目の候補が"Shop"であり、２番目および３番目の候補が"Product"であると仮定した場合の合計スコアがステップＳ１０で算出される。 Here, according to the corresponding subsection, three important objects appear, and the appearance order is “Shop” → “Product” → “Product”. Therefore, all combinations ([A, B, C], [A, B, D], [A, C, D]... For selecting three candidates in this order from the four candidates A, B, C, D). ) Are extracted in step S9, and the total score is calculated in step S10 assuming that the first candidate is “Shop” and the second and third candidates are “Product”.

例えば、候補[A，B，C]の組合せであれば、合計スコアは2.27(=0.95+0.41+0.91)となる。同様に、候補[A，B，D]の組合せであれば、合計スコアは2.21(=0.95+0.41+0.85)となる。そして、本実施形態では候補[A，C，D]の組合せのときに、合計スコアが最高スコア2.71(=0.95+0.91+0.85)となる。 For example, in the case of a combination of candidates [A, B, C], the total score is 2.27 (= 0.95 + 0.41 + 0.91). Similarly, for a combination of candidates [A, B, D], the total score is 2.21 (= 0.95 + 0.41 + 0.85). In this embodiment, the total score becomes the highest score 2.71 (= 0.95 + 0.91 + 0.85) when the candidate [A, C, D] is combined.

ステップＳ１１では、前記合計スコアが最高スコア2.71である候補[A，C，D]が重要フレームの組合せとされる。したがって、ここでは候補Aが「ABCレストラン」に対応する重要フレームとされ、候補Cが「特製カレーライス」に対応する重要フレームとされ、候補Dが「特製ハンバーグ」に対応する重要フレームとされる。残った候補Bは誤認候補として破棄される。 In step S11, the candidate [A, C, D] whose total score is the highest score 2.71 is taken as a combination of important frames. Therefore, here, candidate A is an important frame corresponding to “ABC restaurant”, candidate C is an important frame corresponding to “special curry rice”, and candidate D is an important frame corresponding to “special hamburger” . The remaining candidate B is discarded as a misidentification candidate.

以上の処理は全てのサブセクションに対して実行され、番組映像から全ての重要フレームの抽出が完了すると、ステップＳ１２では、前記情報提示部１０９において、前記重要フレームのサムネイルがオブジェクト種別ごとに作成される。 The above processing is executed for all subsections. When extraction of all important frames from the program video is completed, in step S12, the information presenting unit 109 creates thumbnails of the important frames for each object type. The

図９は、各重要フレームの画像をサムネイル化して一覧表示する際の表示方法の一例を示した図であり、本実施形態では「店舗」，「商品」，「出演者」のタブが用意されており、「店舗」タブがリモコン操作等によりクリックされると、前記"Shop"と判別された多数の重要フレームのサムネイル画像が表示される。同様に、「商品」タブがクリックされると、前記"Product"と判別された多数の重要フレームのサムネイル画像が表示される。同様に、「出演者」タブがクリックされると、前記"Person"と判別された多数の重要フレームのサムネイル画像が表示される。 FIG. 9 is a diagram showing an example of a display method when thumbnail images of each important frame are displayed as a list. In this embodiment, tabs of “store”, “product”, and “performer” are prepared. When the “Store” tab is clicked by remote control operation or the like, thumbnail images of a number of important frames determined as “Shop” are displayed. Similarly, when the “Product” tab is clicked, thumbnail images of a number of important frames determined as “Product” are displayed. Similarly, when the “performer” tab is clicked, thumbnail images of many important frames determined as “Person” are displayed.

各サムネイル画像には、当該画像の再生位置を示す情報および前記メタ情報に記述されているテキストデータが紐付けられており、視聴ユーザが所望のタブを選択し、一覧表示されたサムネイル画像の一つをクリックすると、当該サムネイル画像と紐付けられているテキスト情報が画面下に表示される。さらに、サムネイル画像がダブルクリック等されると、前記映像が当該サムネイル画像のフレーム位置から再生される。 Each thumbnail image is associated with information indicating the reproduction position of the image and text data described in the meta information. The viewing user selects a desired tab, and one thumbnail image displayed in a list is displayed. When one is clicked, text information associated with the thumbnail image is displayed at the bottom of the screen. Further, when the thumbnail image is double-clicked, the video is reproduced from the frame position of the thumbnail image.

なお、上記の実施形態では街を紹介する情報番組の映像から重要フレームを抽出する場合を例にして説明したが、例えば通信販売番組であれば、番組内で紹介された商品群の画像を高精度で抽出して視聴者に一覧提示できるので、視聴者は所望の商品に関する情報を素早く取得できるようになる。また、本実施形態では重要フレームのサムネイル画像にテキスト情報が紐付けられているので、例えばお気に入りの商品や出演者の名称をテキストで入力して検索を実行すれば、お気に入りの商品や出演者の映像を素早く視聴できるようになる。 In the above embodiment, the case where an important frame is extracted from the video of an information program that introduces a city has been described as an example. Since it can be extracted with accuracy and presented to the viewer as a list, the viewer can quickly obtain information on the desired product. In this embodiment, text information is associated with the thumbnail image of the important frame. For example, if a search is performed by inputting the name of a favorite product or performer in text, the favorite product or performer's name is displayed. You can watch the video quickly.

１…重要情報抽出装置，２…映像配信装置，３…メタ情報提供装置，４…テレビ／モニタ，１０１…映像受信部，１０２…特徴量抽出部，１０３…ショット分割部，１０４…出現時間長検出部，１０５…画面内情報密度分布算出部，１０６…重要フレーム候補評価部，１０７…重要フレーム決定部，１０８…データベース，１０９…情報提示部，１１０…メタ情報取得部 DESCRIPTION OF SYMBOLS 1 ... Important information extraction apparatus, 2 ... Video distribution apparatus, 3 ... Meta information provision apparatus, 4 ... Television / monitor, 101 ... Video reception part, 102 ... Feature-value extraction part, 103 ... Shot division | segmentation part, 104 ... Appearance time length Detection unit, 105 ... In-screen information density distribution calculation unit, 106 ... Important frame candidate evaluation unit, 107 ... Important frame determination unit, 108 ... Database, 109 ... Information presentation unit, 110 ... Meta information acquisition unit

Claims

In an important information extraction device that analyzes video and extracts important information,
Means for extracting features from the video;
Means for dividing the video into a plurality of shots based on the feature amount;
A database that stores representative features of important frames in which important objects appear;
Means for extracting a representative frame from each shot;
Means for extracting important frame candidates from the representative frame based on the feature amount;
Means for obtaining meta information of the video;
Means for comparing each feature amount of the important frame candidate with a representative feature amount of the important frame to evaluate each important frame candidate;
Means for extracting an important frame from the important frame candidates based on the meta information and the evaluation value of each important frame candidate,
The database stores unique features for each object type,
The evaluation means calculates the evaluation value for each object type by comparing the feature amount of each important frame candidate with each unique feature amount stored for each type of the object,
The meta information describes the number of appearances of important objects, the order of appearance, and the object type,
The means for extracting the important frame includes:
Means for extracting important frame candidates from a set of important frame candidates in combination according to the appearance order and the number of appearances of the important objects;
Means for adding the extracted evaluation value for each object type of each important frame candidate based on the type of each important object, and extracting the important frame candidate of the combination that maximizes the sum as an important frame. An important information extraction device characterized by that.

2. The important information extracting apparatus according to claim 1 , wherein the means for extracting the representative frame extracts a representative frame from a shot of a still image section.

3. The important information extracting apparatus according to claim 1, wherein the means for extracting the representative frame extracts a representative frame from a shot in the moving object follow section.

Means for displaying the important frames as thumbnails in a list;
Each thumbnail image, important information extraction device according to any one of claims 1 to 3 information for reproducing the video from the position of each key frame, characterized in that it tied.

In the important information extraction method that analyzes video and extracts important information,
A procedure for extracting features from the video;
Dividing a video into a plurality of shots based on the feature amount;
A procedure for extracting a representative frame from each shot;
A procedure for extracting an important frame candidate in which an important object appears based on the feature amount from the representative frame;
Obtaining the meta information of the video;
A step of evaluating each important frame candidate by comparing the feature quantity of the important frame candidate with a representative feature quantity of the important frame;
Extracting an important frame from the important frame candidates based on the meta information and the evaluation value of each important frame candidate,
The database stores unique features for each object type,
In the step of evaluating, the feature value of each important frame candidate is compared with each unique feature value stored for each type of object, and an evaluation value is calculated for each object type.
The meta information describes the number of appearances of important objects, the order of appearance, and the object type,
The procedure for extracting the important frame includes:
A procedure for extracting an important frame candidate from a set of important frame candidates in a combination according to the appearance order and the number of appearances of the important object;
A step of adding an evaluation value for each object type of each extracted important frame candidate based on the type of each important object, and extracting an important frame candidate of a combination that maximizes the sum as an important frame. The important information extraction method characterized by this.