JP2004260847A

JP2004260847A - Multimedia data processing apparatus, and recording medium

Info

Publication number: JP2004260847A
Application number: JP2004116921A
Authority: JP
Inventors: Hisashi Aoki; 恒青木; Osamu Hori; 修堀; Toshimitsu Kaneko; 敏充金子
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2004-04-12
Filing date: 2004-04-12
Publication date: 2004-09-16

Abstract

<P>PROBLEM TO BE SOLVED: To provide a data processing apparatus for deciding and outputting a content category (such as drama or variety) of inputted multimedia data. <P>SOLUTION: While paying attention to video in multimedia data, the time in which a scene is switched, is successively stored. When the interval is a multiple of 15sec in several times of scene switching in the past, such a time block is decided to be a commercial. In a device connected to the post-stage of this multimedia data processing apparatus, multimedia data processing in response to the program class can be performed. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

本発明は、マルチメディアデータの内容を判定し出力するマルチメディアデータ処理装置および方法，記録媒体に関する。 The present invention relates to a multimedia data processing device and method for determining and outputting the contents of multimedia data, and a recording medium.

衛星放送、ケーブルテレビジョン放送の普及などにより、家庭においても視聴可能な映像チャンネル数は増加の一途をたどっている。また従来のビデオテープやレーザーディスクなどの映像記録媒体に加え、映像音声情報をディジタルデータとして記録再生できるＤＶＤも製品化され、ごく普通のユーザが大量の映像にアクセスできる環境は充実しつつあると言ってよい。 Due to the spread of satellite broadcasting and cable television broadcasting, the number of video channels that can be viewed at home is steadily increasing. In addition to conventional video recording media such as video tapes and laser disks, DVDs that can record and reproduce video and audio information as digital data have also been commercialized, and the environment in which ordinary users can access large amounts of video is being enhanced. You can say.

その一方で、それら大量の映像の中から目的の映像に効率よくアクセスできる技術の提供は従前のままである。たとえばテレビの視聴者は、新聞などに掲載される番組欄を参照して、見たい映像（番組）を選択したり、ビデオソフトパッケージを購入しようとする消費者はパッケージに記載されているタイトルや出演者、あらすじなどを参照して希望のビデオを選択する。 On the other hand, the provision of technology that allows efficient access to a target video from among such a large amount of video has remained unchanged. For example, a television viewer can select a video (program) to watch by referring to a program column published in a newspaper or the like, and a consumer who wants to purchase a video software package can use a title or a program described in the package. Select the desired video by referring to the performers, synopsis, etc.

これらは文字による映像情報のテキスト記述であるが、映像の全てに対して記述がなされているわけでもなく、そのように全てに対して記述をすることは大変な手間となるものであることは言うまでもない。 These are textual descriptions of video information in characters, but not all of the video is described, and it is very difficult to describe all of them in such a way. Needless to say.

また、映像のどの部分を見たいか、という要求は当然ユーザ個々にも異なるものであるはずで、全てのユーザに対応可能なテキスト記述を作成することは付加情報の増大を招き、おもに放送メディアなど、供給できる情報量に限界があるソースにおいては実用が難しい。このためテレビ視聴者はニューズ番組中の特定の番組だけを見たい場合にも、テレビをつけたままで目的の話題となるのを待たなくてはならなかったり、すでに別の番組で見たニュースを繰り返しみさせられたりして、著しく時間のムダとなっている。 Also, the demands on which part of the video to view should naturally be different for each user. Creating a text description that can be handled by all users leads to an increase in additional information, and is mainly due to broadcast media. Such a source is limited in the amount of information that can be supplied, and is difficult to use. For this reason, TV viewers who want to watch only a specific program in a news program must keep on the TV and wait for a topic to be heard, or watch news already seen on another program. It has been repetitively seen, and it has been a significant waste of time.

このように貧弱な映像アクセスの環境のままでは、近い将来、ユーザが大量の映像情報にアクセスできる状況が実現したとしても、その情報量をスムーズに享受できるとはいえず、かえって情報量の多さに混乱するユーザが増える懸念もある。 In such a poor video access environment, even if a situation in which users can access a large amount of video information is realized in the near future, it cannot be said that the amount of information can be enjoyed smoothly. There is also concern that more users will be confused.

これに関連して、本件出願人は既に映像を少ないアイコンで一覧することのできる動画像処理方法を出願している（特開平９−２７０００６号公報）。この方法によれば、映像中の場面構成を推定し、場面を単位とした映像内容の一覧提示が可能になる。上記発明によれば、ドラマの映像を処理した場合、対話などから構成される場面は一枚のアイコンとして表示することが可能になる。 In connection with this, the present applicant has already applied for a moving image processing method capable of listing videos with a small number of icons (Japanese Patent Application Laid-Open No. 9-270006). According to this method, it is possible to estimate a scene configuration in a video and present a list of video contents in units of a scene. According to the above-described invention, when a video of a drama is processed, a scene including a dialogue or the like can be displayed as a single icon.

しかし、ドラマ、映画、ニュースといった番組のジャンルが異なれば場面構成も異なる。たとえばニュースの場合には、類似したショットの再登場は少ない。その代わりにニュースキャスターのショットが繰り返し登場するものの、それはニュース項目の先頭であることが多く、上記のようなドラマに対するのと同じ表示方法をそのまま適用するのは、視認性がよいとはいえない。 However, if the genre of a program such as a drama, a movie, or news is different, the scene composition is different. For example, in the case of news, similar shots rarely reappear. Instead, newscaster shots appear repeatedly, but are often the beginning of news items, and applying the same display method as above for drama as described above is not good visibility .

逆に、上記発明の場合、入力された映像がニュースであるということが既知であれば、繰り返し登場するショットを発見するという同じ手法をそのまま用いて、ニュース項目ごとに切り分けた一覧表示を提示することも可能になる。
特開平９−２７０００６号公報 Conversely, in the case of the above invention, if it is known that the input video is news, a list display separated for each news item is presented using the same method of finding shots that appear repeatedly as it is. It becomes possible.
JP-A-9-270006

本発明は、マルチメディアデータを入力する手段と、そのデータを解析し、カテゴリを推定する手段と、推定されたカテゴリを出力する手段とを持つことを特徴とする。 The present invention is characterized by having means for inputting multimedia data, means for analyzing the data and estimating a category, and means for outputting the estimated category.

つまり、本発明はマルチメディアデータの番組種を自動的に判定し、得られた番組種を参考にしたメディア解析を容易にすることを目的とする。 That is, an object of the present invention is to automatically determine the program type of multimedia data and to facilitate media analysis with reference to the obtained program type.

本発明のマルチメディアデータ処理装置は、入力された映像を含むマルチメディアデータを解析し、このマルチメディアデータの属する内容種を推定する内容種推定手段と、この内容種推定手段によって推定された内容種を出力する手段とを持つことを特徴とする。 The multimedia data processing apparatus of the present invention analyzes multimedia data including an input video, and estimates content type to which the multimedia data belongs, and content estimated by the content type estimating means. Means for outputting seeds.

また、入力された映像を含むマルチメディアデータの映像中で画面が切り替わった時間間隔の履歴から、このマルチメディアデータの属する内容種を推定する内容種推定手段と、この内容種推定手段によって推定された内容種を出力する手段とを持つことを特徴とする。 The content type estimating means for estimating the content type to which the multimedia data belongs from the history of the time interval at which the screen is switched in the video of the multimedia data including the input video, and the content type estimating means. And means for outputting the content type.

また、予め内容種が既知のマルチメディアデータを保持する保持手段と、この保持手段が有する既知のマルチメディアデータと、入力された映像を含むマルチメディアデータとを比較し、この入力されたマルチメディアデータの属する内容種を推定する内容種推定手段と、この内容種推定手段によって推定された内容種を出力する手段とを持つことを特徴とする。 Also, a holding means for holding multimedia data of a known content type in advance, and comparing the known multimedia data possessed by the holding means with the multimedia data including the input video, The content type estimating means for estimating the content type to which the data belongs, and means for outputting the content type estimated by the content type estimating means are provided.

また、前記保持手段は、既知のマルチメディアデータが言語による辞書であり前記内容種推定手段は、入力されたマルチメディアデータ中の文字情報を抽出し、その文字情報が辞書中で属する内容種を、入力されたマルチメディアデータの内容種であると推定することを特徴とする。 The holding means may include a dictionary in which the known multimedia data is a language, and the content type estimating means may extract character information in the input multimedia data, and determine a content type to which the character information belongs in the dictionary. , The content type of the input multimedia data.

また、入力された映像を含むマルチメディアデータを解析し、このマルチメディアデータの属する内容種を推定し、この推定された内容種を出力することを特徴とする。 The multimedia data including the input video is analyzed, the content type to which the multimedia data belongs is estimated, and the estimated content type is output.

また、マルチメディアデータを処理するプログラムをコンピュータで実行可能なように記録した記録媒体であって、入力された映像を含むマルチメディアデータを解析し、このマルチメディアデータの属する内容種を推定し、この推定された内容種を出力するプログラムをコンピュータで実行可能なように記録した記録媒体である。 Further, a recording medium that records a program for processing the multimedia data in a computer-executable manner, analyzes the multimedia data including the input video, estimates the content type to which the multimedia data belongs, This is a recording medium on which a program for outputting the estimated content type is recorded so as to be executable by a computer.

以上説明したように、映像認識、音声認識、文字認識、テキスト認識、および辞書管理の能力をもった本発明のマルチメディアデータ処理装置では、入力されたマルチメディアデータ（番組）の種類を推定し出力することができるので、本発明のマルチメディアデータ処理装置の後段に接続された装置では、その番組種に即したマルチメディアデータ処理を行うことが可能になる。 As described above, the multimedia data processing apparatus of the present invention having the capabilities of video recognition, voice recognition, character recognition, text recognition, and dictionary management estimates the type of input multimedia data (program). Since the data can be output, a device connected downstream of the multimedia data processing device of the present invention can perform multimedia data processing according to the program type.

本発明のマルチメディアデータ処理装置は、その出力端子をマルチメディア解析装置に接続し、本発明のマルチメディアデータ処理装置が出力する番組種情報に対応して、後段のマルチメディアデータ解析装置が最適の解析手法を自動選択することを想定している。 The multimedia data processing device of the present invention has its output terminal connected to the multimedia analyzing device, and the multimedia data analyzing device at the subsequent stage is optimally adapted to the program type information output by the multimedia data processing device of the present invention. It is assumed that the analysis method is automatically selected.

本発明の実施例を図面に基づいて説明する。 An embodiment of the present invention will be described with reference to the drawings.

図１は、本発明のマルチメディア処理装置の構成を示した図である。図２は本願発明の流れの概略を示したフローチャートである。つまり、本願発明はマルチメディアデータを読み込み（Ｓ２１）、このデータの内容種を推定し（Ｓ２２）、内容種の推定ができたかどうか判断し（Ｓ２３）、推定できた場合その内容種を出力する（Ｓ２４）。以下詳細に説明する。 FIG. 1 is a diagram showing a configuration of a multimedia processing device of the present invention. FIG. 2 is a flowchart showing an outline of the flow of the present invention. That is, the present invention reads multimedia data (S21), estimates the content type of the data (S22), determines whether or not the content type can be estimated (S23), and outputs the content type when the estimation is successful. (S24). The details will be described below.

（実施例１）マルチメディアデータは、ＤＶＤ、ＣＤ、ハードディスクなどの記録媒体、あるいはパソコンなど上段の情報処理装置から入力端子１０１に入る。本発明のマルチメディア処理装置の場合、カット検出部１０２が入力されたマルチメディアデータの場面の変わり目（カット）を、稼働中は常にチェックしている。場面の変わり目が生じた時には、その時刻を一時記憶部１０３に記録する。 (Embodiment 1) Multimedia data enters an input terminal 101 from a recording medium such as a DVD, a CD, or a hard disk, or an upper information processing device such as a personal computer. In the case of the multimedia processing apparatus of the present invention, the cut detection unit 102 constantly checks a transition (cut) between scenes of the input multimedia data during operation. When a scene change occurs, the time is recorded in the temporary storage unit 103.

本発明では、現行放送のコマーシャル（ＣＭ）が通常、１５秒、３０秒、１分であることを利用する。内容種推定部１０６は、過去のカットのうち数回から数十回の最近のものの時刻を観測し、たとえばある２つのカットの起きた時刻を比較したとき、それらがちょうど１５秒の倍数の間隔で起きたものであった場合には、それら２つによって囲まれた時間区間をＣＭであったと判定し、その結果を出力端子１０７より出力する。 In the present invention, the fact that the commercial (CM) of the current broadcast is usually 15 seconds, 30 seconds, and 1 minute is used. The content type estimating unit 106 observes the times of several to several dozens of recent cuts in the past cuts. For example, when comparing the times at which certain two cuts occur, they are determined to be intervals that are just multiples of 15 seconds. If the event has occurred in the above, it is determined that the time section surrounded by the two has been CM, and the result is output from the output terminal 107.

図３はこの手順を示したフローチャートである。つまり、入力端子１０１から入力されたマルチメディアデータのフレームを読み込み（Ｓ３１）、カット検出部１０２でカットか否かを検出し（Ｓ３２）、カットである場合過去のカットのうち数回から数十回の最近のものの時刻を観測し（Ｓ３３）、それらがちょうど１５秒の倍数の間隔で起きたものであるかを判断し（Ｓ３４）、１５秒の倍数の間隔で起きたものである場合はそのカットとカットの間はＣＭであると推定する（Ｓ３５）。 FIG. 3 is a flowchart showing this procedure. That is, the frame of the multimedia data input from the input terminal 101 is read (S31), and whether or not the cut is detected by the cut detection unit 102 (S32). The time of the latest one is observed (S33), and it is determined whether or not they occurred at intervals of a multiple of 15 seconds (S34). If they occurred at intervals of a multiple of 15 seconds, It is estimated that there is a CM between the cuts (S35).

図４はこの手法を説明するための図である。図４で時間軸に交差した縦の短い線はカットを示している。また、カット間に示した数字は、カットの間隙時間（秒）を示している。いま、カット２０１とカット２０２との間隔がちょうど１５秒であったので、カット２０１からカット２０２までの時間区間はＣＭであっただろうと推定される。図１での内容種推定部１０６は、このような手法を用いてカット２０１の時刻、カット２０２の時刻、およびその区間で推定された番組種「ＣＭ」を出力端子１０７に出力する。 FIG. 4 is a diagram for explaining this technique. In FIG. 4, a short vertical line crossing the time axis indicates a cut. The numbers shown between the cuts indicate the gap time (seconds) between the cuts. Now, since the interval between the cut 201 and the cut 202 is just 15 seconds, it is estimated that the time section from the cut 201 to the cut 202 would have been a CM. The content type estimating unit 106 in FIG. 1 outputs the time of the cut 201, the time of the cut 202, and the program type “CM” estimated in the section to the output terminal 107 using such a method.

（実施例２）次に、マルチメディア処理装置の第２の実施例を既出の図１および図５に即して説明する。入力端子１０１から入力された（Ｓ５１）マルチメディア情報は、内容種推定部１０６に送られ、そこで既に番組種が明らかになっており、一時記憶部１０３に蓄えられているマルチメディア情報と比較される。このとき、一時記憶部１０３には、過去に現れたマルチメディア情報、あるいはその圧縮情報が蓄えられており、それとともに、そのマルチメディア情報の番組種情報も蓄えられている。 (Embodiment 2) Next, a second embodiment of the multimedia processing apparatus will be described with reference to FIGS. The multimedia information input from the input terminal 101 (S51) is sent to the content type estimating unit 106, where the type of the program is already known, and is compared with the multimedia information stored in the temporary storage unit 103. You. At this time, the temporary storage unit 103 stores multimedia information appearing in the past or its compression information, and also stores program type information of the multimedia information.

内容種推定部１０６では、一時記憶部１０３中のマルチメディア情報と、新しく入力されたマルチメディア情報との類似を比較し（Ｓ５３）、それが類似と判定できる場合には（Ｓ５３）、蓄積されていたマルチメディア情報に対応して蓄積されていた番組種と同じ番組種を出力端子１０７から出力する（Ｓ５４）。 The content type estimating unit 106 compares the similarity between the multimedia information in the temporary storage unit 103 and the newly input multimedia information (S53). If the similarity can be determined as similar (S53), the content information is accumulated. The same program type as that stored in correspondence with the multimedia information is output from the output terminal 107 (S54).

たとえば、毎週同じオープニング画面で始まるドラマの番組があったとき、先週放送分ですでに「ドラマ」という番組種が出力されていた場合、一時記憶部１０４は、その番組の映像と音声そのものを蓄積するとともに番組種「ドラマ」が記録されている。今週放送分が入力されたとき、その画面と先週放送分のドラマのオープニング画面とを比較すれば、それは一致するので、その時点で「この放送はドラマである」という出力がなされる。 For example, if there is a drama program that starts on the same opening screen every week, and if a program type called “drama” has already been output for the last broadcast, the temporary storage unit 104 stores the video and audio itself of the program. And the program type "Drama" is recorded. When the broadcast for this week is input, if the screen is compared with the opening screen of the drama for the last week, they match, so that "this broadcast is a drama" is output at that time.

この場合、内容種推定部１０６で行われる類似判定のための手段の例としては、映像ならば画面全体の色相ヒストグラムの形状類似を見たり、画素ごとに厳密に差分をとり、その総和が一定値以下であることを根拠にしてもよい。これら類似比較の方法に関しては、長坂ら「カラービデオ映像における自動索引付け法と物体探索法」（情報処理学会論文誌、Ｖｏｌ．３３、Ｎｏ．４、１９９２）などで紹介されている手法を応用することで実現できる。 In this case, as an example of the means for similarity determination performed by the content type estimating unit 106, in the case of a video, the similarity of the shape of the hue histogram of the entire screen is observed, or a strict difference is obtained for each pixel, and the sum thereof is constant. It may be based on being equal to or less than the value. For the similarity comparison method, a method introduced in Nagasaka et al., “Automatic Indexing Method and Object Search Method in Color Video Images” (Information Processing Society of Japan, Vol. 33, No. 4, 1992) is applied. It can be realized by doing.

また、映像ではなく、音声波形の類似を検査してもよいし、放送の付加データとして、現在放送中の番組名などがテキストデータで送信されている場合には、そのテキストが蓄積済みのテキストと合致するかどうかという検査を行ってもよい。また、これら映像、音声、テキストでの類似比較を併用してもよい。本発明はそのような類似比較手法に限定されるものではない。 In addition, the similarity of the audio waveform may be checked instead of the video. If the name of the program currently being broadcast is transmitted as text data as additional data of the broadcast, the text is stored in the stored text. An inspection may be performed to determine whether or not they match. Further, the similarity comparison of these video, audio, and text may be used together. The present invention is not limited to such a similar comparison technique.

（実施例３）次に、本発明の第３の実施例について図面に即して説明する。装置の動作は引き続き図１を用いて説明する。図６は図１中の番組種辞書１０５の中に格納されている辞書データを示した模式図である。この図は説明をわかりやすくするためのもので、実際にこのような形態で辞書のデータとなっているとは限らない。入力端子１０１から入力されたマルチメディアデータは、辞書語翻訳部１０４に入力される。 (Embodiment 3) Next, a third embodiment of the present invention will be described with reference to the drawings. The operation of the apparatus will be described with reference to FIG. FIG. 6 is a schematic diagram showing dictionary data stored in the program type dictionary 105 in FIG. This diagram is for the purpose of making the description easier to understand, and is not always the data of the dictionary in such a form. The multimedia data input from the input terminal 101 is input to the dictionary translation unit 104.

辞書語とは、番組種辞書１０５中で表現されている言語のシソーラス表現である。たとえば、マルチメディア中の音声データで「第○○回臨時国会が今日召集され…」という発話があったとする。このとき、辞書語翻訳部１０４は、このセンテンスを文節に分け、その中で辞書語として有意なものだけを取り出す。「第○○回」「臨時」「国会」「今日」「召集」という単語のうち、この文で特徴的な単語「国会」が選ばれる。 The dictionary word is a thesaurus expression of the language expressed in the program type dictionary 105. For example, suppose that there is an utterance of "the XXth extraordinary Diet is convened today ..." in the audio data in the multimedia. At this time, the dictionary translation unit 104 divides the sentence into phrases, and extracts only those that are significant as dictionary words. Among the words “the XXth”, “temporary”, “diet”, “today”, and “convocation”, the characteristic word “diet” is selected in this sentence.

次に図６中のシソーラスと対応をとり、辞書中の「国会」３０２にもっとも近い番組種ジャンル表現が「政治」３０１であると判定される。ここで、この判定結果は内容種判定部１０６に送られ、内容種判定部１０６はその結果「政治」を出力端子１０７から出力する。 Next, in correspondence with the thesaurus in FIG. 6, it is determined that the program type genre expression closest to “diet” 302 in the dictionary is “politics” 301. Here, the determination result is sent to the content type determination unit 106, and the content type determination unit 106 outputs “politics” from the output terminal 107 as a result.

このとき、複数のセンテンスが入力されるのを待ち、それぞれのセンテンスから取り出された特徴的な単語が、全体として近かった番組種ジャンル表現を探索するという方法でもよい。また、辞書との比較を行うために用いるマルチメディアデータは、上記の例の音声に限らず、文字放送のようなテキスト情報が付加されていればそれを用いてもよいし、画面中にスーパーインポーズされる字幕を認識して、その言葉を用いてもよい。 At this time, a method of waiting for a plurality of sentences to be input and searching for a program type genre expression in which characteristic words extracted from each sentence are close as a whole may be used. Further, the multimedia data used for comparison with the dictionary is not limited to the voice of the above example, but may be used if text information such as teletext is added. You may recognize the subtitle to be imposed and use that word.

図７は、上記と同様の辞書比較を、別の手法で応用する場合について説明するためのものである。番組種辞書１０５には、人名が登録されていてもよい。 FIG. 7 is for explaining a case where the same dictionary comparison as described above is applied by another method. Personal names may be registered in the program type dictionary 105.

たとえば、番組中でタレント同士が名前を呼んだところがあれば、その音声データから出演者を推定してもよいし、タレントの登場とともに画面にスーパーインポーズされる字幕を認識してもよい。映画などの場合には、フィナーレのあとに流されるエンド・ロールの文字を認識してもよい。番組の付加情報として、出演者などの情報がテキストで得られる場合にはそれをそのまま利用してもよい。そのようにして出演者名が得られたとき、その人物が出る可能性のある番組種が記録された辞書データのイメージが図７に示されている。たとえば、バラエティ番組に出演することの多いタレント名は番組種「バラエティ」４０１の近くに配置されている（強いリンクが張ってある）。したがって、出演するタレントの多くが「バラエティ」４０１の近くに配置されている場合、その番組は「バラエティ」と判定される。 For example, if there is a place in the program where the talents call each other, the cast may be estimated from the audio data, or the subtitles superimposed on the screen with the appearance of the talent may be recognized. In the case of a movie or the like, the character of the end roll that flows after the finale may be recognized. When information such as performers is obtained as text as additional information of a program, it may be used as it is. FIG. 7 shows an image of dictionary data in which, when the performer's name is obtained in this manner, a program type in which the person is likely to appear is recorded. For example, talent names that often appear in variety programs are arranged near the program type "variety" 401 (strong links are provided). Therefore, when many of the talents appearing are arranged near “variety” 401, the program is determined to be “variety”.

しかし、上記タレントが出演しているものの、カテゴリ「歌手」に属する出演者がそれにもまして多い場合には、「歌番組」と判定される。この手法は、実在の人物に限らない。たとえば「家光」「大岡忠相」「西鶴」といった人名は「歴史」「時代劇」の近くに配置される、というような辞書の設計を行えば、このようなジャンルのマルチメディアデータも正しく判定されることが期待される。 However, when the above-mentioned talent is performed, but there are more performers belonging to the category "singer" than that, it is determined to be a "song program". This technique is not limited to real people. For example, if a dictionary is designed such that personal names such as "Iemitsu", "Tadasaku Ooka" and "Nishikaku" are placed near "History" and "Drama", multimedia data of such a genre will be correctly judged. It is expected to be.

さらに、この辞書は地名を格納しているものでもよい。辞書語翻訳部１０４が、映像から撮影場所を判定する機能を持ったものであった場合、たとえば「会津磐梯山」「奥入瀬」「霧島」などの単語は「紀行」「旅行」の近くに配置すればよい。撮影場所を判定する手段としては、たとえば特徴的な風景が記録されたデータベース（図示せず）と類似比較を行ったり、文字テロップや音声などから抽出してもよい。さらに、撮影場所などの情報がテキスト付加データとして放送されている場合には、それをそのまま用いてもよい。 Further, the dictionary may store place names. When the dictionary translation unit 104 has a function of determining a shooting location from a video, words such as “Aizu Bandaisan”, “Oirase” and “Kirishima” are near “travel” and “travel”, for example. Should be placed at As a means for determining the shooting location, for example, a similarity comparison with a database (not shown) in which a characteristic scenery is recorded may be made, or the location may be extracted from a character telop or voice. Further, when information such as a photographing location is broadcast as text additional data, it may be used as it is.

（実施例４）この他の番組種判定手法を説明する。たとえばニュース番組の場合、図８中の５０１のように一人あるいは二人のキャスターのバストショットがニュース項目の先頭であることが多い（Ａ，Ｂはキャスターのショットを示す。ＸはＡでもＢでもないショット）で、それぞれのＸは異なる。以下同じ）。そこで、「人物の写っているショットが」「１〜３パターン存在し」「それぞれが２回以上」「最高で１分以上の間を置いて登場する」マルチメディアデータは「ニュース」という判定を行ってもよい。 (Embodiment 4) Another program type determination method will be described. For example, in the case of a news program, a bust shot of one or two casters is often the head of a news item as indicated by 501 in FIG. 8 (A and B indicate caster shots. X indicates either A or B). X), each X is different. same as below). Therefore, the multimedia data is determined to be "news" for "a shot with a person in it", "1 to 3 patterns are present", "each appears twice or more", "appears with a maximum of 1 minute or more". May go.

ニュース番組の判定に関しては坂内「画像・マルチメディアデータベース」（テレビジョン学会誌、Ｖｏｌ．４６，Ｎｏ．１１，ｐｐ．１４７４）などにも提案されている。しかし本発明の手法では、ショットの出現パターンを用いている点が異なる。このように複数の情報を組み合わせることによって、より確度の高い番組種判定が実現する。 The determination of a news program has also been proposed in Sakauchi “Image and Multimedia Database” (Television Society Journal, Vol. 46, No. 11, pp. 1474) and the like. However, the method of the present invention is different in that a shot appearance pattern is used. By combining a plurality of pieces of information in this way, more accurate program type determination is realized.

また、本件出願人によって出願中の特開平９−２７０００６の中で用いた対話シーン検出を活用した番組種判定手法も考えられる。図８中の５０２，５０３，５０４は、それぞれ対話シーン（アルファベットが繰り返されている区間）を持っている。しかし、５０２のように、５分以上にも及ぶ長い対話シーンは「バラエティ」番組であると判定できる。これはバラエティ番組が３〜４台の固定カメラで撮影されることが多いからである。 In addition, a program type determination method utilizing dialog scene detection used in Japanese Patent Application Laid-Open No. 9-270006, filed by the present applicant, is also conceivable. Each of 502, 503, and 504 in FIG. 8 has a dialogue scene (a section in which alphabets are repeated). However, a long dialogue scene as long as 5 minutes or more, such as 502, can be determined to be a "variety" program. This is because variety programs are often shot with three to four fixed cameras.

また、５０３のように、同じ画面構成の対話シーンが番組全体にわたって続くものは「トーク」番組であると判定できる。バラエティ番組はコーナーが変われば絵柄が変わるために、同じパターンの対話シーンが再現することが少ないが、トーク番組の場合、番組全体を通じて同じセットで行われることが多い。さらに、５０４のように同じパターンの対話シーンは再現しないが、いくつかのパターンの対話シーンが断続的に現れる場合には、「ドラマである」と判定することもできる。これらは、人物検出などを併用するとさらに確度があがることが期待できる。 In addition, it is possible to determine that an interactive scene having the same screen configuration continues over the entire program, such as 503, is a “talk” program. In a variety program, since the picture changes when the corner changes, the same scene of the dialogue scene is rarely reproduced, but in the case of a talk program, the same set is often performed throughout the program. Furthermore, when the dialogue scene of the same pattern is not reproduced as in 504, but the dialogue scenes of some patterns appear intermittently, it can be determined that it is “drama”. These can be expected to be more accurate when combined with human detection.

また、全体に風景を写したショットが多く、カメラワークがゆっくりである番組は「紀行」であるとも判定できる。 In addition, it can be determined that a program in which many shots of the entire landscape are taken and the camera work is slow is a “journey”.

さらに、上記の手順で番組種が「ニュース」であると判定された場合でも、内容種推定部１０６が音楽認識の機能を持っている場合には、さらに確度の高い判定を行うことができる。たとえば「ニュース」の構成の番組であって、音楽や効果音が用いられている時間の割合が、設定した割合よりも大きい場合には、それは「ドキュメンタリー」であると推定することもできる。 Furthermore, even when the program type is determined to be "news" in the above procedure, if the content type estimating unit 106 has a music recognition function, a more accurate determination can be made. For example, in the case of a program having the structure of “news”, if the ratio of time during which music and sound effects are used is larger than the set ratio, it can be estimated that the program is “documentary”.

また、「スポーツ」番組に関しては以下のような手法で推定することができる。たとえば、同じようなアングルのパターンが番組全体に出現し、しかもパン（カメラを左右に動かす撮影方法）が多く用いられており、かつ、複数の人物が小さく写っている場面が多く登場する番組は「スポーツ」であると推定できる。 The "sports" program can be estimated by the following method. For example, a program in which a similar angle pattern appears in the entire program, panning (a method of moving the camera left and right) is frequently used, and a scene in which a plurality of people appear in a small number of times appears. It can be estimated that it is "sports".

また、内容種推定部１０６が音声パターンの解析を行い、番組の最初から最後まで断続的に歓声が現れるという現象を上記に併せて「スポーツ」を推定することもできる。 In addition, the content type estimating unit 106 analyzes the audio pattern, and can estimate "sports" together with the phenomenon that cheers appear intermittently from the beginning to the end of the program.

さらに、画面の片隅の固定した位置に、３０秒より長い周期で変化する字幕が頻繁に出てくる場合には、それをゲームのスコアと推定し、番組は「スポーツ」であると推定することができる。 Furthermore, if subtitles that change with a cycle longer than 30 seconds frequently appear at a fixed position in one corner of the screen, it is estimated as a game score, and the program is estimated to be "sports". Can be.

なお、本願発明は図２，図３，図５に示したような手順をＦＤやＣＤ−ＲＯＭにコンピュータによって実行可能なプログラムを記憶したソフトウェアで実現することも可能である。 In the present invention, the procedures shown in FIGS. 2, 3, and 5 can be realized by software in which a program executable by a computer is stored in an FD or a CD-ROM.

本発明のマルチメディアデータ処理装置にかかるブロック図である。1 is a block diagram according to a multimedia data processing device of the present invention. 本発明のマルチメディアデータ処理装置の処理方法の手順を示した図である。FIG. 3 is a diagram showing a procedure of a processing method of the multimedia data processing device of the present invention. 本発明のマルチメディアデータ処理装置の処理方法の手順を示した図である。FIG. 3 is a diagram showing a procedure of a processing method of the multimedia data processing device of the present invention. 本発明のマルチメディアデータ処理装置のＣＭを検出方法を示した図である。FIG. 4 is a diagram illustrating a method for detecting a CM of the multimedia data processing device of the present invention. 本発明のマルチメディアデータ処理装置の処理方法の手順を示した図である。FIG. 3 is a diagram showing a procedure of a processing method of the multimedia data processing device of the present invention. 本発明のマルチメディアデータ処理装置の辞書デ−タを示した図である。FIG. 3 is a diagram showing dictionary data of the multimedia data processing device of the present invention. 本発明のマルチメディアデータ処理装置の辞書デ−タを示した図である。FIG. 3 is a diagram showing dictionary data of the multimedia data processing device of the present invention. 本発明のマルチメディアデータ処理装置の番組種を判定する際のショットの並びを示すフローチャートである。6 is a flowchart showing an arrangement of shots when determining a program type in the multimedia data processing device of the present invention.

Explanation of reference numerals

１０１…マルチメディアデータ入力端子
１０２…カット検出部
１０３…一時記憶部
１０４…辞書語翻訳部
１０５…番組種辞書
１０６…内容種推定部
１０７…マルチメディアデータ出力端子
101 multimedia data input terminal 102 cut detection unit 103 temporary storage unit 104 dictionary translation unit 105 program type dictionary 106 content type estimation unit 107 multimedia data output terminal

Claims

A first input unit for inputting first multimedia data including video,
A second input unit for inputting second multimedia data whose content type is known,
A similarity comparing unit that compares the similarity of the entire screen between one or more video frames of each of the first multimedia data and the second multimedia data;
If the similarity comparison unit determines that the similarity is equal to or greater than a predetermined value, the content to be estimated that the content type of the first multimedia data is the same as the content type of the second multimedia data Species estimation means;
A content type output unit that outputs the content type estimated by the content type estimation unit,
A multimedia data processing device comprising:

The second multimedia data is dictionary data in a language,
The similarity comparison unit includes:
Means for extracting language information by caption recognition from a video signal in the first multimedia data;
Means for comparing and matching the extracted linguistic information with the dictionary data;
With
The content type estimating means estimates a content type to which the linguistic information belongs in the dictionary data as a content type of the first multimedia data.
The multimedia data processing device according to claim 1.

Computer
First input means for inputting first multimedia data including video,
Second input means for inputting second multimedia data whose content type is known,
Similarity comparing means for comparing the similarity of the entire screen between one or more video frames of each of the first multimedia data and the second multimedia data;
If the similarity comparison unit determines that the similarity is equal to or greater than a predetermined value, the content to be estimated that the content type of the first multimedia data is the same as the content type of the second multimedia data Species estimation means,
Content type output means for outputting the content type estimated by the content type estimation means,
A computer-readable recording medium that records a multimedia data processing program for functioning as a computer.

The second multimedia data is dictionary data in a language,
The similarity comparison means includes:
Means for extracting language information by caption recognition from a video signal in the first multimedia data;
Means for comparing and matching the extracted linguistic information with the dictionary data;
With
The content type estimating means estimates a content type to which the linguistic information belongs in the dictionary data as a content type of the first multimedia data.
A computer-readable recording medium on which a multimedia data processing program is recorded.