JP2006120018A

JP2006120018A - Image data processor, image processing method, program and recording medium

Info

Publication number: JP2006120018A
Application number: JP2004308928A
Authority: JP
Inventors: Hiroko Ida; 裕子井田
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2004-10-22
Filing date: 2004-10-22
Publication date: 2006-05-11

Abstract

<P>PROBLEM TO BE SOLVED: To perform video retrieval having the degree of freedom by reflecting the intent of a person inputting information at the time of describing contents of image data. <P>SOLUTION: An image data editing/retrieving device 1 for editing and retrieving image data is provided with: a scene change detecting part 308 having a characteristic analyzing part 309 inside for analyzing the amount of image characteristic and the voice characteristic in an image and for detecting scene change on the basis of its analysis result; a message preparing part 301 for outputting a question item from a device by character and voice; and an inputting part 312 for receiving a contents description input from a user by character and voice in the form of answering the question item, wherein in addition, a question item preparing part 310 attaches a free description item as a question item. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、映像データ処理に関し、特に大量の映像の中から所望の映像を選択的に視聴することを可能にする、映像の構造化を行うための映像データ処理に関するものである。 The present invention relates to video data processing, and more particularly to video data processing for structuring a video that enables a desired video to be selectively viewed from a large amount of video.

近年、通信技術の発達及びハードディスクの大容量化にともない映像データを大量に蓄積し、この映像データの中から所望の映像を選択視聴できるようなシステムの開発がなされている。即ち、撮影された映像を再活用するために映像情報に対して内容記述を行い、この記述情報を基に映像検索を行うといったシステムが従来より提案されている。このようなシステムにおいては一般に、記述を行うためにはビデオを再生して、内容がまとまったところで停止して、その部分の記述を行うという方法をとることが多い。通常撮影されたビデオは一度はその撮影関係者がそろった状態で見られるものであり、その際には撮影時の状況について撮影関係者間で語られたり、あるいはその状況を知らない第三者に説明を行うといったことが行われる。
このようなシステムにおいては、所望の映像データを選択するために映像データに対して予め意味ベースに構造を区切り、これに対する情報を記述することが必要である。このように、映像データに内容記述するためには、まず映像を巻き戻して、再生しながら、意味的に区切れるシーンを検出し、さらにその映像シーンに合わせた内容を入力しなければならないため、映像データの実時間に対して少なくともその数倍の時間を要し、多大な労力を要するものであった。
上記問題に対しては、内容記述の単位となるシーンの切り替わりを自動検出することにより、内容記述の時間や労力を低減し、内容記述の際にも、映像内容を知らない第三者に、映像が撮影された状況を語るという設定をモニター上に登場するキャラクターに受け持たせることにより、家族で撮影されたビデオを視聴する際に、子供でもモニター上に現れるキャラクターとの対話を楽しみながら、内容記述を行うことができる映像データ編集・検索装置が提案されている（特許文献１参照）。
特開２００３−１７８０７６公報 ITU-T G729 AnnexB A silence compressoion scheme for G.729 optimized for terminals conforming to Recommendation V.70 ドラマにおける話者インデキシングの検討、西田他、音響学会１９９９春基本周波数パターンのパラメータの個人差、藤崎、大野他、音響学会１９９９春画像処理アルゴリズムの最新動向、新技術コミュニケーションズＣ言語で学ぶ実践画像処理、オーム社明るさ不安定環境下における人物検知法、１９９６年電子情報通信学ソサイエティ大会論文集横山敏明著「テレビ番組記述言語ＴＶＭＬに基づく番組生成／対話型編集システム」第３回知能情報メディアシンポジュウム、１９９７年１２月横山敏明著「テレビ番組記述言語ＴＶＭＬのマシンインタフェースの開発」電子情報ソサイエティ大会、１９９７年 In recent years, with the development of communication technology and the increase in capacity of hard disks, a large amount of video data is accumulated, and a system capable of selectively viewing a desired video from the video data has been developed. That is, a system has been proposed in which content description is performed on video information in order to reuse the captured video, and video search is performed based on the description information. In general, in such a system, in order to make a description, a method is often used in which a video is played, stopped when the contents are gathered, and the portion is described. A video that is usually taken can be seen once with the people involved in the shooting, and at that time the situation at the time of shooting is talked between the people involved in the shooting, or a third party who does not know the situation An explanation is given.
In such a system, in order to select desired video data, it is necessary to delimit the structure of the video data in a semantic base in advance and to describe information on this. In this way, in order to describe the contents in the video data, it is necessary to first rewind and play back the video, detect a scene that is semantically separated, and input the content that matches the video scene. Therefore, it takes at least several times as much time as the actual time of the video data, which requires a lot of labor.
For the above problem, the time and effort of content description is reduced by automatically detecting the switching of the scene that is the unit of content description, and even at the time of content description, to a third party who does not know the video content, By letting the character appearing on the monitor take the setting of telling the situation where the video was shot, while watching a video taken by the family, even while enjoying the conversation with the character appearing on the monitor, even for children, A video data editing / retrieval device capable of describing content has been proposed (see Patent Document 1).
JP 2003-178076 A ITU-T G729 AnnexB A silence compressoion scheme for G.729 optimized for terminals conforming to Recommendation V.70 Examination of speaker indexing in drama, Nishida et al., Acoustical Society of Japan, 1999 spring Individual differences in fundamental frequency pattern parameters, Fujisaki, Ohno et al., Acoustical Society of Japan, 1999 spring Latest Trends in Image Processing Algorithms, New Technology Communications Practical image processing learned in C, Ohm Human Detection in Unstable Brightness, Proceedings of the 1996 IEICE Society Conference Toshiaki Yokoyama “Program Generation / Interactive Editing System Based on TVML Description Language TVML” 3rd Intelligent Information Media Symposium, December 1997 Toshiaki Yokoyama, “Development of Machine Interface for TV Program Description Language TVML” Electronic Information Society, 1997

しかしながら、上記従来技術は、映像データの特徴量を利用して質問内容を生成し、この質問に対する回答という形でしか内容記述ができず、情報を入力する者の意図が反映されないという問題があった。
本発明はこのような問題を解消し、映像データの内容記述に際し、情報を入力する者の意図が反映されるようにして、自由度のある映像検索が行えるようにすることを目的とする。 However, the above-described conventional technique has a problem that the contents of a question are generated using the feature amount of video data, and the contents can be described only in the form of an answer to the question, and the intention of the person who inputs information is not reflected. It was.
SUMMARY OF THE INVENTION It is an object of the present invention to solve such a problem and to allow a video search with a high degree of freedom by reflecting the intention of a person who inputs information when describing the contents of video data.

上記の課題を解決するために、請求項１に記載の発明は、音声と共に記録されている映像データにシーン毎に映像内容情報を付加する映像データ処理装置において、シーンに関する質問項目を作成する質問項目作成手段と、前記質問項目に対応させて文字または音声によってユーザーからの内容記述入力を受付ける入力手段とを備え、前記質問項目作成手段は、自由記述で入力された記述項目を質問項目に追加することを特徴とする。
また、請求項２は、請求項１記載の映像データ処理装置において、前記質問項目作成手段は、前記入力部より質問項目に対する情報が入力されなかった質問項目を削除することを特徴とする。
また、請求項３は、請求項１または２記載の映像データ処理装置において、前記内容記述された部分を検索する検索手段を設け、前記検索手段が検索したシーンまたはシーンの部分から映像データを再生するようにしたことを特徴とする。
また、請求項４は、請求項３記載の映像データ処理装置において、前記検索手段は、ネットワーク上の他の映像データ処理装置にある前記映像データの内容記述された部分の検索を可能にする通信手段を備えたことを特徴とする。
また、請求項５は、音声と共に記録されている映像データにシーン毎に映像内容情報を付加する映像データ処理方法において、シーンに関する質問項目を作成する質問項目作成するステップと、前記質問項目に対応させて文字または音声によってユーザーからの内容記述入力を受付ける入力ステップとを有し、前記質問項目作成ステップは、自由記述で入力された記述項目を質問項目に追加することを特徴とする。
また、請求項６は、請求項５記載の映像データ処理方法において、前記質問項目作成ステップは、前記入力部より質問項目に対する情報が入力されなかった質問項目を削除することを特徴とする。
また、請求項７は、請求項５または６記載の映像データ処理方法において、前記内容記述された部分を検索する検索ステップを設け、検索したシーンまたはシーンの部分から映像データを再生するようにしたことを特徴とする。
また、請求項８は、請求項７記載の映像データ処理方法において、前記検索ステップは、ネットワーク上の他の映像データ処理装置にある前記映像データの内容記述された部分の検索を可能にする通信機能を備えたことを特徴とする。
また、請求項９は、コンピュータを、請求項１、２、３または４に記載の映像データ処理装置として機能させるためのプログラムであることを特徴とする。
また、請求項１０は、請求項９に記載の映像データ処理プログラムを記録したコンピュータ読み取り可能な記録媒体であることを特徴とする。 In order to solve the above problems, the invention according to claim 1 is a video data processing apparatus for adding video content information for each scene to video data recorded together with audio. Item creation means and input means for accepting a content description input from a user by text or voice corresponding to the question item, the question item creation means adding a description item input as a free description to the question item It is characterized by doing.
According to a second aspect of the present invention, in the video data processing apparatus according to the first aspect, the question item creating unit deletes a question item for which information on the question item has not been input from the input unit.
According to a third aspect of the present invention, there is provided the video data processing apparatus according to the first or second aspect, further comprising search means for searching for the portion in which the contents are described, and reproducing the video data from the scene or the scene portion searched by the search means. It was made to do.
According to a fourth aspect of the present invention, there is provided the video data processing apparatus according to the third aspect, wherein the search means enables a search for a part in which the content of the video data is described in another video data processing apparatus on the network. Means are provided.
According to a fifth aspect of the present invention, there is provided a video data processing method for adding video content information for each scene to video data recorded together with sound. And an input step of accepting a content description input from a user by text or voice, wherein the question item creating step adds a description item input as a free description to the question item.
According to a sixth aspect of the present invention, in the video data processing method according to the fifth aspect, the question item creating step deletes a question item for which information on the question item has not been input from the input unit.
According to a seventh aspect of the present invention, in the video data processing method according to the fifth or sixth aspect, a search step for searching for the content-described portion is provided, and the video data is reproduced from the searched scene or scene portion. It is characterized by that.
Further, according to an eighth aspect of the present invention, in the video data processing method according to the seventh aspect, in the search step, communication for enabling a search of a part in which the content of the video data is described in another video data processing device on the network. It is characterized by having a function.
The ninth aspect is a program for causing a computer to function as the video data processing apparatus according to the first, second, third, or fourth aspect.
A tenth aspect of the present invention is a computer-readable recording medium on which the video data processing program according to the ninth aspect is recorded.

以上説明したように、映像シーンの切り替わり毎にディスプレイ上に表示される文字や音声での質問による内容記述事項以外にも編集者の意図に応じて自由な内容記述を行うことができ、自由に記述された事項が質問項目として追加されるので、次回編集する際には編集者の意図に応じた記述事項を提示することができる。
また、質問事項に対して入力されなかった記述項目が削除されるので、効率的に必要な内容記述項目だけを入力することができる。
また、単体での映像データの処理と検索、さらにネットワーク上に接続された映像データ処理装置に登録された情報から所望の映像を検索、再生することができる。 As explained above, in addition to content description items by text and audio questions displayed on the display every time a video scene changes, it is possible to freely describe content according to the editor's intention, and freely Since the described items are added as question items, the description items according to the editor's intention can be presented when editing next time.
In addition, since description items that have not been input for the question items are deleted, only necessary content description items can be input efficiently.
In addition, it is possible to process and search a single video data, and to search and reproduce a desired video from information registered in a video data processing apparatus connected on a network.

以下、本発明の実施の形態を詳細に説明する。まず本発明の概要について述べる。前述したように、通常撮影されたビデオは、一度はその撮影関係者がそろった状態で見られるものであり、その際には撮影時の状況について撮影関係者間で語られたり、あるいはその状況を知らない第三者に説明を行うといったことが行われる。
本発明の映像データ処理装置は、この状況を知らない第三者に語るという設定をモニター上に登場するキャラクター（ＴＶＭＬなどにより実現される）に受け持たせることにより、家族で撮影されたビデオを視聴する際に子供でも、モニター上に現れるキャラクターとの対話を楽しみながら、内容記述を行うものである。なお、本実施例では本発明の映像データ処理装置を、映像データそのものを編集・加工する訳ではないが、映像データに関連付けて情報の付加や検索を行うので、映像データ編集・検索システム（装置）として説明する。
まず、内容記述を行う場合には映像データ編集・検索装置で内容記述モードの選択を行う。すると、これから内容記述を始めることを知らせるメッセージが音声、またはディスプレイ上に表示される。そして、テープが再生されるとともに、映像データはシーン変化検出装置によってシーン変化の解析が行われる。変化点が検出されるとテープが停止し、変化点情報に対応した内容記述を行うための情報（５Ｗ１Ｈなど）の質問が行われる。
つまり、“いつ行ったの？”とキャラクターから質問が行われる。この質問に対して“２０ＸＸ年１２月○日。”とマイクに向かって答える。さらにキャラクターは、“どこに行ったの？”とか、“誰と？”、“何したの？”などと言った記述内容のための問い掛けを行う。１つのシーンに対する内容記述項目の質問が一通りなされると、“それは楽しかったねえ。もう僕に話すことはないかな”などと、キャラクターから現在のシーンに対する内容記述の入力を終了してもよいかの質問がなされる。 Hereinafter, embodiments of the present invention will be described in detail. First, the outline of the present invention will be described. As mentioned above, the video that was usually shot is seen once with the person concerned with the shooting, and at that time, the situation at the time of shooting is talked between the persons concerned with the shooting, or the situation An explanation is given to a third party who does not know.
The video data processing apparatus of the present invention allows a character (implemented by TVML or the like) appearing on a monitor to have a setting of speaking to a third party who does not know this situation, thereby allowing a video shot by a family to be captured. When watching, even children can describe content while enjoying dialogue with characters appearing on the monitor. In this embodiment, the video data processing apparatus of the present invention does not edit or process the video data itself, but adds and searches information in association with the video data, so that the video data editing / search system (apparatus) ).
First, in the case of content description, the content description mode is selected by the video data editing / retrieval apparatus. Then, a message notifying that the content description is to be started is displayed on the voice or the display. Then, while the tape is reproduced, the video data is subjected to scene change analysis by a scene change detection device. When the change point is detected, the tape is stopped, and a question about information (5W1H, etc.) for describing contents corresponding to the change point information is made.
In other words, the character asks “When did you go?” In response to this question, “December 20th, 20XX” is answered to Mike. In addition, the character asks questions such as “Where did you go?”, “Who?”, “What did you do?” And so on. When a question about a content description item for a scene is made, the character may finish inputting the content description for the current scene, such as “It was fun. That question is asked.

これに対して“おしまい。”などと終了する旨の応答を返した場合には、次のシーンが再生され、前述したような映像に対する内容記述に対する質問が繰り返し行われるとともに記述した結果が記録される。もし、さらに入力したい内容がある場合は、“もっとある。”というような返答を返すと、“どんなこと？”と音声でキャラクターが質問してくるとともにモニター上には、入力項目のリストが表示され、表示されている項目の中から選んで答えるように指示がなされる。この中から“何を”を選択すると、“何をしたの？”とキャラクターから質問がなされる。これに答えると、再度内容記述を終了してもいいか確認が行われ、さらに、質問項目以外に編集者がシーンに関して記述したい事柄を自由に入力することができる。
このように対話をしながら内容記述が行われ、映像が最後まで再生されると、終了のメッセージを出力して内容記述メニューを終了する。
映像データ検索装置においてはこのように内容記述されたデータに対してユーザーが検索要求を入力すると、検索部によって内容記述データに対して要求と一致するものがないか探索が行われ、一致したものが見つかった場合その情報を表示する。 On the other hand, if a response to end such as “End” is returned, the next scene is played back, the above-mentioned question about the content description for the video is repeatedly performed, and the described result is recorded. The If there is more content that you want to input, return a response such as “I have more” and the character will ask you “What?” And a list of input items will be displayed on the monitor. The user is instructed to answer by selecting from the displayed items. If you select “What” from the list, the character will ask you “What did you do?” In response to this, it is confirmed whether or not the content description can be finished again, and in addition to the question item, the editor can freely input a matter to be described regarding the scene.
When the content description is performed while interacting in this way and the video is reproduced to the end, an end message is output and the content description menu is terminated.
In the video data search device, when a user inputs a search request for data described in this way, the search unit searches the content description data for a match with the request. If found, the information is displayed.

次に本発明の詳細を図面を用いて説明する。本発明による映像データの編集及び検索を実施する対話型映像データ編集・検索装置のシステム構成の一例を図１に示す。また、映像データ編集・検索装置における処理フローを図２に示す。図３には映像データ編集・検索装置の構成を示す。図４に映像データ編集・検索装置で映像の編集を行う際にディスプレイに表示される内容記述インタフェイス画面の例を示す。図５にはシーン変化検出処理ブロックを示す。図６に映像の変化と質問事項の生成フローを示す。また、図７に質問事項とフラグビットパターンを示す。図８にＴＶＭＬ作成ブロックの例を示す。図９にメタ情報ファイル作成部構成を示す。図１０にＴＶＭＬ台本の例を示す。図１１〜図１３にＭＰＥＧ−７内容記述ファイル例を示す。
まず本システムの構成を図１に基づいて説明する。本システムは、映像データ編集・検索装置１と、映像データと作成されたＴＶＭＬ情報のうち、視覚に係わる情報（画像情報）を表示するためのディスプレイ装置２と、映像データとＴＶＭＬ情報のうち、音声に係わる情報（音声情報）を提示するためのスピーカー３と、映像データに対して作成された構造情報、内容記述情報を格納するための記録媒体であるディスクを駆動するディスクドライブ装置（以下、単にディスク装置という）４と、記述内容や検索条件を音声により入力するための入力装置５とを備えている。
映像データ編集・検索装置１とディスプレイ装置２との間は、ビデオ信号ケーブル９によって接続され、映像データ編集・検索装置１とスピーカー３との間は、オーディオ信号ケーブル６により接続されている。映像データ編集・検索装置１とディスク装置４との間は、例えばＳＣＳＩ（Small Computer System Interface）仕様のデータケーブル７により接続されている。入力装置５は、音声信号ケーブル８によって映像データ編集・検索装置１に接続されている。 Next, details of the present invention will be described with reference to the drawings. FIG. 1 shows an example of a system configuration of an interactive video data editing / retrieval apparatus for editing and retrieving video data according to the present invention. FIG. 2 shows a processing flow in the video data editing / retrieval apparatus. FIG. 3 shows the configuration of the video data editing / retrieval apparatus. FIG. 4 shows an example of a content description interface screen displayed on the display when editing video with the video data editing / retrieval apparatus. FIG. 5 shows a scene change detection processing block. FIG. 6 shows a flow of video change and question item generation. FIG. 7 shows the question items and flag bit patterns. FIG. 8 shows an example of the TVML creation block. FIG. 9 shows the configuration of the meta information file creation unit. FIG. 10 shows an example of the TVML script. FIGS. 11 to 13 show examples of MPEG-7 content description files.
First, the configuration of this system will be described with reference to FIG. This system includes a video data editing / retrieval device 1, a display device 2 for displaying visual information (image information) among video data and created TVML information, and among video data and TVML information. A speaker 3 for presenting audio-related information (audio information), and a disk drive device (hereinafter referred to as a disk drive device) for driving a disk as a recording medium for storing structure information and content description information created for video data 4) and an input device 5 for inputting description contents and search conditions by voice.
The video data editing / retrieval device 1 and the display device 2 are connected by a video signal cable 9, and the video data editing / retrieval device 1 and the speaker 3 are connected by an audio signal cable 6. The video data editing / retrieval device 1 and the disk device 4 are connected by, for example, a SCSI (Small Computer System Interface) specification data cable 7. The input device 5 is connected to the video data editing / retrieval device 1 by an audio signal cable 8.

映像データ編集・検索装置１のその内部構成は、後述する図３に示すようになっている。ディスプレイ装置２は、例えば１９インチ型のＣＲＴ（陰極線管）が用いられる。ディスク装置４としては、例えば、磁気ディスクを記録媒体として使用してデータの記録及び再生を行うものが使用される。この磁気ディスクには、この動画ファイルに関連付けられた１つまたは複数のメタ情報ファイルが格納されるようになっている。なお、メタ情報ファイルについては後述する。
ここで、記録媒体は、磁気ディスクに限らず、他の記録媒体であってもよい。例えば、ＣＤ−Ｒ（書き込み可能ＣＤ）やＣＤ−ＲＷ（書き替え可能ＣＤ）、ＤＶＤ（ディジタルビデオディスク）等の光ディスク、あるいは光磁気ディスク等であってもよい。また、ディスク装置４は、本体に内蔵されたものであってもよい。
映像データ編集・検索装置１に入力装置５を接続する音声信号ケーブル８は、映像データ編集・検索装置１の内部で図３の音声信号入出力部３１２と直結されており、これにより、入力装置５から入力された音声は、そのまま映像データ編集・検索装置１に供給されるようになっている。 The internal configuration of the video data editing / retrieval apparatus 1 is as shown in FIG. For example, a 19-inch CRT (cathode ray tube) is used as the display device 2. As the disk device 4, for example, a device that records and reproduces data using a magnetic disk as a recording medium is used. The magnetic disk stores one or more meta information files associated with the moving image file. The meta information file will be described later.
Here, the recording medium is not limited to a magnetic disk, and may be another recording medium. For example, it may be an optical disc such as a CD-R (writable CD), CD-RW (rewritable CD), DVD (digital video disc), or a magneto-optical disc. The disk device 4 may be built in the main body.
The audio signal cable 8 for connecting the input device 5 to the video data editing / retrieval device 1 is directly connected to the audio signal input / output unit 312 in FIG. 3 inside the video data editing / retrieval device 1. The audio input from 5 is supplied to the video data editing / retrieval apparatus 1 as it is.

図３は、映像データ編集・検索装置１の内部構成を表すものである。この図に示したように、映像データ編集・検索装置１は、この装置全体の動作を制御するＣＰＵ３００、ビデオテープの再生を行うビデオデッキ３２０、画像、音声特徴量を用いて自動的にシーンの変化点を検出する、シーン変化検出部３０８、ビデオ駆動制御部３１４、ビデオ信号入出力部３１３、音声信号入出力部３１２から入力された音声信号をディジタル信号に変換するＡ／Ｄ変換器３１１、映像データから所望のデータを検索するための検索を行う映像データ検索部３１５、対話型インタフェイスによるメッセージ出力プログラムの作成を行うメッセージ作成部３０１、シーン変化点による映像の構造・内容記述情報をメタ情報としてメタ情報ファイルとして作成するメタ情報作成部３０２、メモリ３０６、ＯＳやアプリケーションプログラム等が格納されたハードディスク装置（ＨＤＤ）３０７、ディスプレイ装置２とのインタフェイスとして機能するディスプレイ接続部３１７を備えている。
また、ＨＤＤ３０７に格納されたＯＳは、いわゆるＧＵＩ機能を備えたもので、これにより、ユーザーは、主として入力装置５とディスプレイ装置２の画面表示とによってインタラクティブに（対話形式で）操作ができるようになっている。ビデオ駆動制御部３１４はビデオ駆動に関する制御信号の入出力を行う。そして、映像データ編集・検索装置を構成する各部とはシステムバス３１６によって相互に接続されている。 FIG. 3 shows the internal configuration of the video data editing / retrieval apparatus 1. As shown in this figure, the video data editing / retrieval apparatus 1 automatically uses a CPU 300 that controls the operation of the entire apparatus, a video deck 320 that plays back a videotape, images, and audio feature quantities to automatically generate scenes. An A / D converter 311 for converting a sound signal input from the scene change detection unit 308, the video drive control unit 314, the video signal input / output unit 313, and the audio signal input / output unit 312 to detect a change point; A video data search unit 315 that searches for desired data from the video data, a message creation unit 301 that creates a message output program using an interactive interface, and a video structure / content description information based on scene change points. A meta information creation unit 302 that creates a meta information file as information, a memory 306, an OS, and an application A hard disk drive (HDD) 307 which programs are stored, and a display connection portion 317 which functions as an interface between the display device 2.
Further, the OS stored in the HDD 307 has a so-called GUI function so that the user can operate interactively (interactively) mainly by the screen display of the input device 5 and the display device 2. It has become. The video drive control unit 314 inputs and outputs control signals related to video drive. The units constituting the video data editing / retrieval apparatus are connected to each other by a system bus 316.

以上のような構成の映像データ編集、検索装置の動作について、図２に示す映像データ編集・検索装置における処理フローにより説明する。まず、図２の２０１で映像の再生が行われ、テープは図３に示すビデオデッキ３２０で再生される。そして、ビデオ映像は図２の２０３で音声・画像特徴量の解析が行われる。図３に示すようにビデオ信号と音声信号とに分離され、音声信号入出力部３１２とビデオ信号入出力部３１３からそれぞれの信号が出力される。次に、分離された音声信号とビデオ信号の解析方法について説明する。
音声信号入出力部３１２から出力された音声信号はＡ／Ｄ変換器３１１によりアナログ信号からディジタル信号へと変換される。このとき、通常の音声帯域は４ｋＨｚ程度と考えられるので、８ｋＨｚでサンプリングが行われる。ディジタル化された音声信号はシーン変化検出部３０８に入力される。
次に、映像変化と質問事項の生成フローを図６を用いて説明する。６０３において、音声信号はおよそ数ｍｓ〜４０ｍｓのフレームに分割される。次にフレーム分割されたディジタル音声信号を用いて、映像変化の検出として、有音／無音区間検出、音声区間の検出、話者の切り変わりの判別を行う。
一定時間の無音の後に音が聞こえた場合の判別は、６０４において音のパワー（０次の自己相関関数）の大きさを算出し、６０５においてこの大きさにより有音／無音区間の検出を行い、無音区間が検出された時刻Ｔと有音区間が検出された時刻ｔとの時間Ｔ−ｔが閾値τより大きい場合、音声区間フラグ６４０を１とする。
また、“ITU-T G729 AnnexB A silence compressoion scheme for G.729 optimized for terminals conforming to Recommendation V.70”で述べられているＶＡＤによる方法（スペクトラム歪み、エネルギー差分、低域エネルギー差分、零交差数差分に対して予め設定された閾値により判別を行うことにより、有音区間の判別を行う）を適用することができる。音声区間の検出においては“ドラマにおける話者インデキシングの検討、西田他、音響学会１９９９春”に示されているような方法が適用できる。 The operation of the video data editing / retrieval apparatus configured as described above will be described with reference to the processing flow in the video data editing / retrieval apparatus shown in FIG. First, video is played back at 201 in FIG. 2, and the tape is played back on the video deck 320 shown in FIG. The video image is analyzed for audio / image feature values at 203 in FIG. As shown in FIG. 3, the video signal and the audio signal are separated, and the audio signal input / output unit 312 and the video signal input / output unit 313 output the respective signals. Next, a method for analyzing the separated audio signal and video signal will be described.
The audio signal output from the audio signal input / output unit 312 is converted from an analog signal to a digital signal by the A / D converter 311. At this time, since the normal voice band is considered to be about 4 kHz, sampling is performed at 8 kHz. The digitized audio signal is input to the scene change detection unit 308.
Next, the flow of video change and question item generation will be described with reference to FIG. At 603, the audio signal is divided into frames of approximately several ms to 40 ms. Next, using the digital audio signal divided into frames, as the detection of the video change, the voiced / silent section detection, the voice section detection, and the switching of the speaker are determined.
When sound is heard after a certain period of silence, the sound power (0th-order autocorrelation function) is calculated at 604, and the sound / silence interval is detected at 605 based on the sound power. When the time T-t between the time T when the silent section is detected and the time t when the voiced section is detected is larger than the threshold τ, the voice section flag 640 is set to 1.
In addition, the VAD method (spectrum distortion, energy difference, low-frequency energy difference, zero-crossing number difference) described in “ITU-T G729 Annex B A silence compression scheme for G.729 optimized for terminals conforming to Recommendation V.70” Can be applied to discriminate a voiced section by discriminating according to a preset threshold. In the detection of the voice section, a method as shown in “Study indexing of speakers in drama, Nishida et al., Acoustical Society 1999 Spring” can be applied.

図６の６０３においてフレーム分割されたディジタル音声から６０８において１次のケプストラムの分散値を算出する。音声である場合は音楽、雑音に比べて、ケプストラムの１次係数の分散が大きくなることから、６０９において予め設定された閾値により、音声区間を検出することができる。音声区間が検出された場合は、６１１において音声区間フラグを１とする。
さらに、この音声区間に対して、話者の切り変わりの判別を行う。話者の判別方法としては、“基本周波数パターンのパラメータの個人差、藤崎、大野他、音響学会１９９９春”で述べられている方法が適用できる。この方法は、モデルの物理的、生理的特性に密接に関係する音声の基本周波数とフレーズ制御機構、アクセント制御機構の時定数の値は同一の話者ではほぼ一定の値を持つことを利用して、話者の判別を行うものである。６１２〜６１５はこの方法を用いた解析である。
まず、６１２において基本周波数の算出を行う。基本周波数の算出は、自己相関法、変形相関法、あるいはケプストラムによって算出できる。ケプストラムは波形の短時間振幅スペクトルの対数の逆フーリエ変換として定義され、スペクトル包絡と微細構造を近似的に分離することにより算出するものである。詳細については東海大学出版会、ディジタル音声処理に記載されている。
さらに６１３において基本周波数のパターン生成過程のモデルに基づいてAnalysis-by-Synthesis法による逐次近似を行い、基底周波数、フレーズ指令の大きさと生起時点、アクセント指令の大きさと生起時点、及びフレーズのアクセント各制御機構の時定数を求める。これらの値の大きさにより６１４において、話者の切り変わり判別を行う。話者が切り変わったことが判定された場合、６１５において話者切り変わりフラグを１とする。 In step 608, the variance value of the first-order cepstrum is calculated from the digital voice divided in frame 603 in FIG. 6. In the case of speech, since the variance of the first-order coefficient of the cepstrum is larger than that of music and noise, it is possible to detect a speech section using a threshold set in advance in 609. If a speech segment is detected, the speech segment flag is set to 1 in 611.
Further, the switching of the speaker is determined for this voice section. As a speaker discrimination method, the method described in “Individual differences in parameters of fundamental frequency pattern, Fujisaki, Ohno et al., Acoustical Society 1999 Spring” can be applied. This method takes advantage of the fact that the fundamental frequency of speech, the phrase control mechanism, and the time constant value of the accent control mechanism, which are closely related to the physical and physiological characteristics of the model, have almost constant values for the same speaker. Thus, the speaker is discriminated. 612 to 615 are analyzes using this method.
First, at 612, the fundamental frequency is calculated. The fundamental frequency can be calculated by an autocorrelation method, a modified correlation method, or a cepstrum. The cepstrum is defined as the inverse Fourier transform of the logarithm of the short-time amplitude spectrum of the waveform, and is calculated by approximately separating the spectral envelope and the fine structure. Details are described in Tokai University Press, Digital Audio Processing.
Further, in 613, successive approximation is performed by an analysis-by-synthesis method based on the model of the fundamental frequency pattern generation process, and each of the base frequency, the size of the phrase command and the time of occurrence, the size of the accent command and the time of occurrence, and the accent of each phrase. Find the time constant of the control mechanism. Based on the magnitude of these values, at 614, speaker switching is determined. If it is determined that the speaker has changed, the speaker change flag is set to 1 in 615.

次に、ビデオ信号の解析方法について説明する。図３のビデオ信号入出力部３１３から入力されたビデオ信号はシーン変化検出部３０８に入力され、図６に示す６１７においてフレーム分割が行われる。フレーム分割された画像に対して、まず、シーン変化の検出を行う。シーン変化の検出にあたっては、輝度平均値よりシーン変化を検出する方法がある。ディスプレイ上の各画素は、走査に応じて時間的に変化する輝度値をもっており、６１８において１フレーム中の輝度値の平均値（あるフレームにおける相加平均）を算出することにより、１フレーム内で全ての画素（ｘ，ｙ）における座標において相加平均を算出する。
映像においては、急に場面が変わった場合この平均値が大きく変わり得ることから、各フレーム毎の輝度平均値を算出し、６３９においてその変化が予め決められた閾値α以上であった場合、シーン変化があったものと判定し、６３８においてシーン変化フラグを１とする。
さらに、物体の存在領域候補を抽出するために、６１９においてエッジ抽出を行う。エッジ抽出方法としてはゼロクロシングによる方法（２次微分における零交差点が新のステップエッジの位置に一致するという性質を利用した方法）、マスクオペレーターによる（Jacobusらによる、マスク内の分散と平均を考慮した閾値画像を作成するという方法）方法が適用できる。エッジ抽出方法の詳細については、“画像処理アルゴリズムの最新動向、新技術コミュニケーションズ”に記述されている。 Next, a video signal analysis method will be described. The video signal input from the video signal input / output unit 313 in FIG. 3 is input to the scene change detection unit 308, and frame division is performed in 617 shown in FIG. First, a scene change is detected for the frame-divided image. In detecting a scene change, there is a method of detecting a scene change from a luminance average value. Each pixel on the display has a luminance value that changes with time according to scanning, and in 618, by calculating the average value of the luminance values in one frame (arithmetic average in a certain frame), within one frame An arithmetic average is calculated at coordinates in all pixels (x, y).
In the video, if the scene suddenly changes, this average value can change greatly. Therefore, the luminance average value for each frame is calculated, and if the change is equal to or greater than a predetermined threshold α in 639, the scene It is determined that there has been a change, and the scene change flag is set to 1 at 638.
Further, edge extraction is performed at 619 in order to extract an object existing area candidate. As edge extraction methods, zero crossing method (method using the property that the zero crossing point in the second derivative coincides with the position of the new step edge), mask operator (considering variance and average in the mask by Jacobus et al.) A method of creating a threshold image that has been made) can be applied. Details of the edge extraction method are described in “Latest Trends in Image Processing Algorithms, New Technology Communications”.

さらに、エッジ抽出された領域に対して曲率、ＲＧＢの色分布を算出し、これらの値の変化を利用して画像を切り出すことによって、物体の存在を検出することができる。詳細は“Ｃ言語で学ぶ実践画像処理、オーム社”に記述されている。また、人物の存在の判別を行う場合、“明るさ不安定環境下における人物検地法、１９９６年電子情報通信学ソサイエティ大会論文集”に記述されている方法を適用することができる。
エッジ抽出された領域に対して、６２１において、各座標位置での方向ベクトルの変化量として勾配方向差を求める。まず、２枚の画像を比較して輝度の差が一定値以上となる領域を抽出し、その変化領域にソーベルオペレータにより一次微分処理を施すことにより、水平成分と垂直成分とからエッジの向きとして勾配ベクトルを作成する。そして、座標位置が同じ勾配ベクトルを比較して向きの違いとしての勾配方向差を検出し、６２４において勾配方向差ヒストグラムを作成する。
このヒストグラムにおいて、人物が存在する場合には人物が占める領域において成分の変化（図６の６２５）が大きくなる。この時、ヒストグラム上の各点からの距離の総和が最小となる横軸に平行な直線をもとめることによって評価値を得る。この評価値の大きさにより人物の存在の判別を行う。人物が存在していると確認された場合、６３４において人物存在フラグを１とする。
さらに、別の人物の存在判定を行う。６２５における現在のフレーム成分変化値βと前のフレームで算出された成分変化値γとの差が予め定めたられた閾値κとの大きさにより違う人物か判別を行い、違う人物であると判定された場合は６３１の別人物存在フラグを１とする。
物体の動きに対する判別は、“画像処理アルゴリズムの最新動向、新技術コミュニケーションズ”に記述されている方法を適用することができる。図６の６１７においてフレーム間差分法（現フレームの画素の濃淡値から前フレームの各画素の濃淡値を引くことにより対象の前後に暗い領域と明るい領域ができ、対象移動方向と移動量が推定できる）、さらに、差分画像から動物体の輪郭を検出するためには連続差分画像を利用することにより判別することができる。差分の正（または負）の成分のみを利用し、連続フレームで論理和をとる。これは動物体が複数個あり、前フレームで１つの物体が占める領域に次の物体が入ってくることによる影響を除くためである。このようにして動物体の候補を検出する。 Furthermore, the presence of an object can be detected by calculating curvature and RGB color distribution for the edge extracted region and cutting out an image using changes in these values. Details are described in "Practical Image Processing Learned in C Language, Ohmsha". In addition, when determining the presence of a person, the method described in “Person Inspection in a Brightness Unstable Environment, Proceedings of the 1996 IEICE Society Conference” can be applied.
In 621, the gradient direction difference is obtained as the amount of change in the direction vector at each coordinate position for the edge extracted region. First, by comparing the two images and extracting an area where the difference in brightness is a certain value or more, and applying a first-order differentiation process to the changed area by the Sobel operator, the direction of the edge is determined from the horizontal component and the vertical component. Create a gradient vector as Then, gradient vectors having the same coordinate position are compared to detect a gradient direction difference as a difference in direction, and a gradient direction difference histogram is created at 624.
In this histogram, when a person is present, the component change (625 in FIG. 6) becomes large in the region occupied by the person. At this time, an evaluation value is obtained by determining a straight line parallel to the horizontal axis that minimizes the sum of the distances from each point on the histogram. The presence of a person is determined based on the magnitude of the evaluation value. If it is confirmed that a person exists, the person presence flag is set to 1 in 634.
Further, the presence determination of another person is performed. The difference between the current frame component change value β in 625 and the component change value γ calculated in the previous frame is determined based on the magnitude of a predetermined threshold value κ to determine whether the person is different. If it is, the different person presence flag 631 is set to 1.
The method described in “Latest Trends in Image Processing Algorithms, New Technology Communications” can be applied to the determination of the movement of an object. In 617 of FIG. 6, an inter-frame difference method (a dark area and a bright area before and after the target are obtained by subtracting the gray value of each pixel of the previous frame from the gray value of the pixel of the current frame, and the target moving direction and moving amount are estimated. Furthermore, in order to detect the contour of the moving object from the difference image, it can be determined by using the continuous difference image. Only the positive (or negative) component of the difference is used, and a logical sum is taken in successive frames. This is in order to eliminate the effect of the next object entering the area occupied by one object in the previous frame. In this way, candidate moving objects are detected.

次に特徴点の対応づけ６２９を行う。画面のある部分が次の画面のどこに対応するかについては特徴点に基づく方法がある。特徴点としてガウスの曲率を利用しスポットの抽出を行う、MoravecのInterest operatorとNagelらによる方法がある。Moravecのオペレーターでは、小領域内で、上下、左右、対角四方向に、隣接画素間の濃度分散を算出し、その最小値で、オペレーターの初期値を与え、次にこれらの値の最大点を特徴点とするものである。
Nagelらの方法では、濃度をｚ軸方向の高さとするｘｙｚ空間内の面を考え、ガウスの曲率の性質から特徴点を求めている。特徴点が求まると、現画面内の特徴点に対応する特徴点を次の画面内で探索する。探索にあたっては動物体の最大移動速度が既知であれば探索範囲を狭めることができる。
また、Barnardらによる弛緩法を用いた方法では、第一画面の特徴的近傍で、対応する第二画面の特徴点候補リストを作る。候補の各要素が対応付けられる確率を対応する特徴点がない場合を考慮した重み付けによる相関値によって決定する。各特長点のベクトルは局所的にはよく似た傾向をもつと考えられるため、近傍で速度ベクトルが同じ傾向をもつかどうかの確率を計算する。このようにして近傍の速度ベクトルの傾向を伝播させて行き、全ての特徴点に対して、１つのラベルに対する確率がある閾値を越えたところで換算を停止する。
このようにして特徴点の対応付けを行い、動物体を検出する。動物体が検出された場合、６３６において動物体フラグを１にする。これらのフラグに変化があった場合、フラグビットパターン６３２によって、図７に示すような条件の判定を行い、各質問事項を決定する。 Next, feature point correlation 629 is performed. There is a method based on feature points as to where a part of a screen corresponds to the next screen. There is a method by Moravec's Interest operator and Nagel et al. That uses Gaussian curvature as a feature point to extract spots. The Moravec operator calculates the density variance between adjacent pixels in a small area in the vertical, horizontal, and diagonal directions, and gives the operator's initial value with the minimum value, and then the maximum point of these values. Is a feature point.
In the method of Nagel et al., A feature point is obtained from the property of Gaussian curvature, considering a surface in the xyz space whose concentration is the height in the z-axis direction. When the feature point is obtained, the feature point corresponding to the feature point in the current screen is searched in the next screen. In the search, if the maximum moving speed of the moving object is known, the search range can be narrowed.
In the method using the relaxation method by Barnard et al., A corresponding feature point candidate list of the second screen is created in the characteristic vicinity of the first screen. The probability that each candidate element is associated is determined by a correlation value by weighting considering the case where there is no corresponding feature point. Since the vector of each feature point is considered to have a similar tendency locally, the probability of whether the velocity vector has the same tendency in the vicinity is calculated. In this way, the tendency of the velocity vector in the vicinity is propagated, and the conversion is stopped when the probability for one label exceeds a certain threshold for all the feature points.
In this way, feature points are associated with each other to detect a moving object. If the moving object is detected, the moving object flag is set to 1 in 636. When these flags are changed, conditions as shown in FIG. 7 are determined based on the flag bit pattern 632 to determine each question item.

次に、図７に示す質問事項とフラグビットパターンの表について説明する。表を列毎に見たときのフラグビットパターンがそろったとき、質問事項に示す質問を行う。例えば、“Who”の質問事項として、“Who”を質問する条件は、音声区間検出フラグとシーン変化フラグがともに１である場合と、別人物存在フラグが１である場合と、人物存在フラグとシーン変化フラグがともに１である場合と、話者切り変わりフラグが１である場合である。
“WhatObject”の質問を行う条件は、音声区間検出フラグとシーン変化フラグがともに１である場合と、話者切り変わりフラグが１である場合、人物存在フラグとシーン変化フラグがともに１である場合、シーン変化フラグと物体検出フラグがともに１である場合である。
“WhatAction”の質問を行う条件は、話者切り変わりフラグが１である場合と、動物体検出フラグが１である場合と、別人物存在フラグが１である場合と、音声区間検出フラグとシーン変化フラグがともに１である場合と、人物存在フラグとシーン変化フラグがともに１である場合である。
“When”、“Where”の質問を行う条件は、話者切り替えフラグが１である場合と、有音フラグが１である場合と、シーン変化フラグが１である場合である。
“Why”、“How”の質問を行う条件は、話者切り変わりフラグが１である場合と、物体検出フラグが１である場合と、音声区間検出フラグとシーン変化フラグがともに１である場合と、人物存在フラグとシーン変化フラグがともに１である場合である。
以上の音声、映像解析により、図２の２０４で特徴量変化を検出する。図３のＣＰＵ３００はビデオ駆動制御部３１４にビデオの再生に関する制御信号を送り、ビデオデッキ３２０で映像の再生を停止する（図２では２０５）。さらに、変化シーンに関する情報、例えば映像ブロックの先頭フレームと最終フレーム番号、時間などの映像データ構造に関する情報と質問事項を図９に示すメタ情報ファイル作成部３０２内の構造記憶部５０２によって記憶する（図２では２０６）。 Next, a table of questions and flag bit patterns shown in FIG. 7 will be described. When the flag bit pattern when the table is viewed for each column is prepared, the question shown in the question item is asked. For example, as a question item of “Who”, the condition for asking “Who” is that both the voice section detection flag and the scene change flag are 1, the case where the separate person presence flag is 1, and the person presence flag. There are a case where the scene change flag is 1 and a case where the speaker change flag is 1.
The condition for making the “WhatObject” question is that the voice segment detection flag and the scene change flag are both 1, the speaker switching flag is 1, and the person presence flag and the scene change flag are 1. In this case, both the scene change flag and the object detection flag are 1.
The condition for the question “WhatAction” is that the speaker switching flag is 1, the moving object detection flag is 1, the different person presence flag is 1, the voice section detection flag and the scene. A case where both the change flags are 1 and a case where both the person presence flag and the scene change flag are 1.
The conditions for making the “When” and “Where” questions are when the speaker switching flag is 1, when the sound flag is 1, and when the scene change flag is 1.
The conditions for the “Why” and “How” questions are when the speaker change flag is 1, when the object detection flag is 1, and when both the voice segment detection flag and the scene change flag are 1. And the person presence flag and the scene change flag are both 1.
Based on the audio and video analysis described above, a feature amount change is detected at 204 in FIG. The CPU 300 in FIG. 3 sends a control signal related to video playback to the video drive control unit 314, and stops video playback on the video deck 320 (205 in FIG. 2). Further, information related to the change scene, for example, information related to the video data structure such as the first frame and the last frame number of the video block, time, and the question are stored in the structure storage unit 502 in the meta information file creation unit 302 shown in FIG. In FIG. 2, 206).

次に、上述した質問を行うために図２のＴＶＭＬの作成２０７が実行される。ＴＶＭＬとはテレビ番組１本を記述することができるように考案されたテキストベースの言語であり、このＴＶＭＬ言語により作成された番組台本は、ＴＶＭＬプレーヤーにより解釈が行われ、リアルタイムでテレビ番組を作り出す。ＴＶＭＬプレーヤーはパソコンやグラフィックワークステーションの上で動くソフトウェアで、これに、ＴＶＭＬ台本と各種のデータ（動画、オーディオなど）を与えるとＴＶＭＬプレーヤーの持つ次の機能によりテレビ番組を作成できる。
ＴＶＭＬの機能は、リアルタイムフルＣＧによるスタジオショットを背景に合成音声機能により話すＣＧキャスターと動き、カメラワーク、テキストのスーパーインポーズ、ＨＴＭＬの記述によるタイトルの表示、動画ファイルの再生、オーディオファイルの再生、音声合成ナレーション等である。
詳細は、http://www.strl.nhk.or.jp/TVML/Japanese/Jsitemap.html、横山敏明著「テレビ番組記述言語ＴＶＭＬに基づく番組生成／対話型編集システム」、第３回知能情報メディアシンポジュウム、１９９７年１２月、及び横山敏明著「テレビ番組記述言語ＴＶＭＬのマシンインタフェースの開発」電子情報ソサイエティ大会、１９９７年に記載されている。 Next, the TVML creation 207 of FIG. 2 is executed to make the above-mentioned question. TVML is a text-based language designed to describe one TV program, and a program script created in the TVML language is interpreted by a TVML player to create a TV program in real time. . The TVML player is software that runs on a personal computer or graphic workstation. When a TVML script and various data (moving images, audio, etc.) are given to this, a TV program can be created by the following functions of the TVML player.
The functions of TVML are CG casters that speak with a synthetic voice function against the background of studio shots in real-time full CG, camera work, text superimpose, title display by HTML description, video file playback, audio file playback Voice synthesis narration, etc.
For details, see http://www.strl.nhk.or.jp/TVML/Japanese/Jsitemap.html, Toshiaki Yokoyama, “Program Generation / Interactive Editing System Based on TVML Description Language TVML”, 3rd Intelligent Information Media Symposium, December 1997, and Toshiaki Yokoyama, “Development of Machine Interface for TV Program Description Language TVML,” Electronic Information Society, 1997.

図８（ａ）の質問事項の表示に示すＴＶＭＬ作成部３１９について説明する。上記処理により映像データからシーン変化点が検出されると、さらにＣＰＵ３００はメッセージ作成部３０１内のＴＶＭＬ作成部３１９を起動し、図８のＴＶＭＬプログラム作成部５００により、シーン変化点などの映像データの構造に関する情報と質問事項をシーン変化検出部３０８に問い合わせ、これらの情報に基づいて映像に対する質問事項の問い掛けを行うためのＴＶＭＬ台本を作成する。図１０にＴＶＭＬ台本の例を示す。
ＴＶＭＬプログラム作成部５００で作成された台本は、図２の２０８で質問事項の表示を行うため、図８に示すＴＶＭＬプレーヤー５０１から、音声信号は音声入出力部３１２、映像信号はビデオ信号入出力部３１３を経て、スピーカー３、ディスプレイ装置２より再生され内容記述者に質問が行われる。
ＴＶＭＬによる出力結果により、内容記述者は図２の２０９で情報の入力を図３に示す入力装置５に向かって発する。図２の２１０で入力情報の認識を行うため、図３に示す入力装置５によって取得された音声信号はＡ／Ｄ変換器３１１により、所定のサンプリング周波数のディジタル信号に変換する。このＡ／Ｄ変換器３１１が出力するディジタル信号を、図３の音声認識処理部３０３内のディジタル音声処理回路３０４に供給する。
このディジタル音声処理回路３０４では帯域分割、フィルタリングなどの処理で、ディジタル音声信号をベクトルデータとし、このベクトル音声信号を音声認識回路３０５に供給する。この音声認識回路３０５には音声認識データ記憶用ＲＯＭが接続され、ＲＯＭに記憶された音声認識用データと、ディジタル音声処理回路から供給されるベクトルデータとの比較を行い、所定の条件に基づいて一致を検出したとき、そのベクトルデータに対応して記憶されたテキストデータを読み出す。
この比較には、例えば隠れマルコフモデル（ＨＭＭ）などを用いる。これらの処理には例えば、ＤＳＰ（デジタルシグナルプロセッサ）を用いる。このような音声認識技術に関しては古井貞煕、「ディジタル音声処理」に記載されている。 The TVML creation unit 319 shown in the question item display of FIG. When the scene change point is detected from the video data by the above processing, the CPU 300 further activates the TVML creation unit 319 in the message creation unit 301, and the TVML program creation unit 500 in FIG. Information about the structure and question items are inquired to the scene change detection unit 308, and a TVML script for inquiring the question items with respect to the video is created based on the information. FIG. 10 shows an example of the TVML script.
The script created by the TVML program creating unit 500 displays the question items in 208 of FIG. 2, so that the audio signal is input / output unit 312 and the video signal is input / output from the TVML player 501 shown in FIG. Through the part 313, the content is reproduced from the speaker 3 and the display device 2, and a question is made to the content writer.
According to the output result by TVML, the content writer issues information input to the input device 5 shown in FIG. 3 at 209 in FIG. In order to recognize input information in 210 of FIG. 2, the audio signal acquired by the input device 5 shown in FIG. 3 is converted into a digital signal of a predetermined sampling frequency by the A / D converter 311. The digital signal output from the A / D converter 311 is supplied to the digital speech processing circuit 304 in the speech recognition processing unit 303 in FIG.
The digital speech processing circuit 304 converts the digital speech signal into vector data by processing such as band division and filtering, and supplies this vector speech signal to the speech recognition circuit 305. A speech recognition data storage ROM is connected to the speech recognition circuit 305, and the speech recognition data stored in the ROM is compared with the vector data supplied from the digital speech processing circuit, and based on predetermined conditions. When a match is detected, the text data stored corresponding to the vector data is read out.
For this comparison, for example, a hidden Markov model (HMM) is used. For example, a DSP (digital signal processor) is used for these processes. Such speech recognition technology is described in Sadaaki Furui, “Digital Speech Processing”.

読み出されたテキストデータは図２の２１１で入力情報の表示を行うため、図３に示すメッセージ作成部３０１内のＴＶＭＬ作成部３１９に入力され、図８の入力された情報の表示に示すＴＶＭＬプログラム作成部５００により入力された情報の確認表示を行うためのＴＶＭＬ台本が作成される。ＴＶＭＬ台本はＴＶＭＬプレーヤー５０１によって読み取りが行われ、音声信号は音声信号入出力部３１２、ビデオ信号はビデオ信号入出力部３１３を経て、スピーカー３、ディスプレイ装置２より出力する。
さらに、図２の２１２で入力情報の確認として、出力情報が正しいか内容記述者によって確認が行われる。入力装置５により、記述者から入力情報に対する確認が入力されると、図３の音声認識処理部３０３内の音声認識回路によって入力情報が読み出され、入力情報に問題がない場合は図２のメタ情報ファイル作成２１３が行われる。また、入力情報に問題がある場合は、図２の２０８で質問事項の表示に戻り、処理が再度行われる。
入力情報に問題がなかった場合について説明をする。この場合、図２のメタ情報ファイル作成２１３の処理に進む。図３の音声認識処理部３０３内の音声認識回路３０５によって読み出されたテキストデータは図９の内容記憶部５０３により記憶される。 The read text data is input to the TVML creation unit 319 in the message creation unit 301 shown in FIG. 3 in order to display the input information in 211 of FIG. 2, and the TVML shown in the display of input information in FIG. A TVML script for confirming and displaying the information input by the program creation unit 500 is created. The TVML script is read by the TVML player 501, and the audio signal is output from the speaker 3 and the display device 2 through the audio signal input / output unit 312 and the video signal through the video signal input / output unit 313.
Further, in 212 of FIG. 2, as the confirmation of the input information, whether or not the output information is correct is confirmed by the content writer. When the input device 5 confirms the input information from the descriptor, the input information is read out by the speech recognition circuit in the speech recognition processing unit 303 in FIG. Meta information file creation 213 is performed. If there is a problem with the input information, the display returns to the question item display at 208 in FIG. 2, and the process is performed again.
A case where there is no problem in the input information will be described. In this case, the process proceeds to the meta information file creation 213 in FIG. The text data read by the speech recognition circuit 305 in the speech recognition processing unit 303 in FIG. 3 is stored in the content storage unit 503 in FIG.

次に、検出されたシーン変化点によるフレームのまとまり毎の構造情報（フレーム数、時間など）、質問事項と入力情報のメタ情報を所定のデータ構造をもつメタ情報ファイルに変換し、記憶するメタ情報ファイル作成部（図９に示す）について説明する。
メタ情報ファイル作成部３０２内の構造記憶部５０２に記憶されている構造情報、質問事項と内容記憶部５０３に記憶されている入力情報はメタ情報スキーマ記憶部５０４に記憶されているデータ構造情報に従うデータ構造に変換され、メタ情報記憶部５０６により情報が記憶される。メタ情報記憶部５０６により記憶された情報はさらにディスク装置４に送られ記憶される。
メタ情報スキーマとしては映像に対する構造化、内容記述情報を示すための標準として現在標準化が行われているＭＰＥＧ−７が知られている。図１１〜図１３にＭＰＥＧ−７によるデータ構造情報の記述例を示す。
このようにして、映像中に変化点が検出されるとＴＶＭＬ作成プログラムが起動し、ＴＶＭＬプログラムの実行、メタ情報ファイルの作成テープの再生が終了するまで繰り返される。これらＴＶＭＬプログラムによる内容記述インタフェイスを図４に示す。
さらに質問項目に対して情報の入力が行われなかった場合、質問項目作成部３１０により質問項目が削除される。質問事項による内容記述情報のメタ情報ファイルの作成が行われると、図２の自由記述事項の入力が表示され（２１４）、内容記述事項の入力が行われる。質問事項による入力情報の処理と同様に、入力情報の認識２１６と入力情報の表示２１７、入力情報の確認２１８を行う。入力情報に問題がなければ、メタ情報ファイルの作成２１９を行う。また、自由記述事項を質問項目として質問項目作成部３１０により質問項目が追加される。
メタ情報ファイルの作成が行われると、図２のテープ終了２２０の確認が行われる。受信した映像信号からＣＰＵ３００において同期信号の有無の検出を行い、同期信号が一定時間途切れると映像が終了したとみなし、ビデオ駆動制御部３１４にビデオの再生を停止する信号を送り、ビデオデッキ３２０でビデオの再生を停止する。そして、図２において記述終了のメッセージ２２１の処理として、本装置を終了するメッセージをＴＶＭＬに表示させることにより処理の終了となる（図２の２２２）。
上記に説明したように構造化、内容記述が行われた映像データに対して、さらに本システムでは検索インタフェイスを提供する。ユーザーが図３の入力装置５より検索条件を入力すると、映像データ検索部３１５において要求に対応する映像データをディスク装置４より探索し、一致するものがあった場合、例えばディスプレイ２やスピーカー３によって提示される。なお、質問事項の追加／削除をメタ情報ファイルの作成時に行うようにしたが、テープの終了時２２０もしくは次回映像データの編集を行うスタート時２００に行うようにしてもよい。 Next, the structure information (number of frames, time, etc.) for each group of frames based on the detected scene change point, meta information of the question items and input information is converted into a meta information file having a predetermined data structure and stored. The information file creation unit (shown in FIG. 9) will be described.
The structure information stored in the structure storage unit 502 in the meta information file creation unit 302 and the question information and the input information stored in the content storage unit 503 follow the data structure information stored in the meta information schema storage unit 504. The data is converted into a data structure, and information is stored in the meta information storage unit 506. The information stored in the meta information storage unit 506 is further sent to and stored in the disk device 4.
As a meta information schema, MPEG-7, which is currently standardized as a standard for showing the structure and content description information for video, is known. FIGS. 11 to 13 show description examples of data structure information according to MPEG-7.
In this way, when a change point is detected in the video, the TVML creation program is started, and is repeated until the execution of the TVML program and the reproduction of the creation tape of the meta information file are completed. The content description interface by these TVML programs is shown in FIG.
Further, when no information is input for the question item, the question item creation unit 310 deletes the question item. When the meta information file of the content description information by the question item is created, the free description item input of FIG. 2 is displayed (214), and the content description item is input. Similar to the processing of the input information by the question items, the input information recognition 216, the input information display 217, and the input information confirmation 218 are performed. If there is no problem with the input information, a meta information file creation 219 is performed. Further, the question item is added by the question item creation unit 310 with the free description item as a question item.
When the meta information file is created, confirmation of the tape end 220 in FIG. 2 is performed. The CPU 300 detects the presence or absence of a synchronization signal from the received video signal. When the synchronization signal is interrupted for a certain period of time, the video is considered to have ended, and a video stop signal is sent to the video drive control unit 314. Stop video playback. Then, as the processing of the description end message 221 in FIG. 2, the processing is ended by displaying on the TVML a message for ending this apparatus (222 in FIG. 2).
The system further provides a search interface for the video data that has been structured and described as described above. When the user inputs a search condition from the input device 5 in FIG. 3, the video data search unit 315 searches for video data corresponding to the request from the disk device 4, and if there is a match, for example by the display 2 or the speaker 3. Presented. Although the addition / deletion of the question items is performed at the time of creating the meta information file, it may be performed at the end of the tape 220 or at the start time 200 when editing the next video data.

以上本発明の一実施形態について述べたが、本発明のシステムでは次のような構成をとるようにしてもよい。映像を再生するための装置としては、例えば映像がビデオテープに収録されている場合はビデオデッキ、ＣＤ−ＲＯＭの場合はＣＤ−Ｒドライブ、ＤＶＤの場合はＤＶＤドライブ、ｍｐｅｇ形式などによりハードディスク上に映像が記憶されている場合はＰＣやハードディスクレコーダーなどでよい。
入力装置としては本実施例ではマイクを例に用いて説明を行ったが、その他にキーボード、タッチパネル、電子ペンなどでよい。この場合、図３の音声信号入出力部３１２は装置に応じた適当な装置に追加変更する。記憶装置としては映像再生装置内の記憶部、あるいはハードディスクのような記憶装置であってもよい。ディスク装置はローカルでもよいし、ネットワーク上に置き、映像データの内容記述情報を管理するようにしてもよい。さらに、検索インタフェイスのみをネットワーク上に置くようにしてもよい。図１４はネットワーク構成の例を示す図である。それぞれ機能構成の異なる映像データ編集装置１１０１、映像データ編集検索装置１１０２、映像データ管理装置１１０３、映像データ検索装置１１０４がネットワーク１１００を介して、それぞれ他の装置のディスク装置にもアクセスできるようになっている。勿論、同種の映像データ編集検索装置１１０２同士での情報交換も可能である。 Although one embodiment of the present invention has been described above, the system of the present invention may be configured as follows. As a device for playing back images, for example, a video deck is recorded on a video tape, a CD-R drive is used for a CD-ROM, a DVD drive is used for a DVD, and a mpeg format is used on a hard disk. If video is stored, a PC or hard disk recorder may be used.
In this embodiment, the input device has been described using a microphone as an example. However, a keyboard, a touch panel, an electronic pen, or the like may be used. In this case, the audio signal input / output unit 312 in FIG. 3 is additionally changed to an appropriate device according to the device. The storage device may be a storage unit in the video reproduction device or a storage device such as a hard disk. The disk device may be local or may be placed on a network to manage the content description information of the video data. Furthermore, only the search interface may be placed on the network. FIG. 14 is a diagram illustrating an example of a network configuration. The video data editing device 1101, the video data editing / searching device 1102, the video data management device 1103, and the video data searching device 1104 having different functional configurations can access the disk devices of other devices via the network 1100. ing. Of course, information exchange between the same kind of video data editing / retrieval apparatuses 1102 is also possible.

また、予めオプションとして、質問事項の追加／削除のＯＮ／ＯＦＦの設定や、応答状況の設定、映像シーンの切り替わりの精度を高めることを支援するためのセグメンテーション設定を行うことができるような場合も考えられる。その場合、内容記述に対するメッセージの出力のための応答状況の設定として、キャラクター登場の有無の選択、音声によるメッセージ出力あるいは文字表示による出力の選択、登場するキャラクターの選択、キャラクターに対する応答者が予め決まっている場合は応答者に対する情報（名前、年齢、性別など）、内容記述項目（いつ、どこで、誰がなど）、キャラクターからの質問に対する応答時間の設定が行えるようにする。
セグメンテーション設定としては映像の特徴量（色、位置、テクスチャ、タイムコードのずれなど）に対する数値設定が行える。また、映像に対する情報もここで予め入力することもできる。オプションによる設定を行わない場合は標準値として設定されているモードに従って処理が行われる。
オプションで設定した応答者情報の例として“８歳、○男君、男”、映像情報として、“場所：遊園地、誰：お父さん、お母さん、○男君”を登録した場合について説明する。映像が収録されたビデオテープを挿入し、再生ボタンを押すとモニター上にキャラクターが登場し“こんにちは、○男君。○男君はお父さんお母さんと一緒に遊園地に行ったんだってね。いいなあ。僕が質問するからどんなことがあったか話してくれるかな。”というように内容記述を行うメッセージが流れる。 In addition, as an option, it may be possible to set segmentation settings to assist in improving the accuracy of switching of video scenes, setting ON / OFF of addition / deletion of questionnaire items, setting of response status, etc. Conceivable. In that case, the response status settings for the message output for the content description are selected in advance as to whether or not a character appears, whether to output a message by voice or character display, to select a character to appear, and to respond to the character. If so, information (name, age, gender, etc.) for responders, content description items (when, where, who, etc.), and response time for questions from characters can be set.
As segmentation settings, numerical values can be set for video feature values (color, position, texture, time code shift, etc.). Also, information about the video can be input in advance here. When setting by option is not performed, processing is performed according to a mode set as a standard value.
A case where “8 years old, ○ man, man” is registered as an example of responder information set as an option, and “location: amusement park, who: father, mother, ○ man” is registered as video information will be described. Insert the video tape that the video has been recorded, and press the play button character appeared on the monitor, "Hello, I heard you ○ man-kun. ○ man-kun went to the amusement park with dad mom. I hope. I'll tell you what happened because I asked a question. "

映像データ編集・検索システムの構成図。1 is a configuration diagram of a video data editing / retrieval system. 映像データ編集・検索装置における処理フローを示す図。The figure which shows the processing flow in a video data editing / retrieval apparatus. 本発明の実施の形態に係る映像データ編集・検索装置の構成図。1 is a configuration diagram of a video data editing / retrieval apparatus according to an embodiment of the present invention. 映像データ編集・検索装置で映像の編集を行う際にディスプレイに表示される内容記述インタフェイス画面の例を示す図。The figure which shows the example of the content description interface screen displayed on a display, when editing an image | video with an image data editing / retrieval apparatus. シーン変化検出処理ブロック図。The scene change detection process block diagram. 映像の変化と質問事項の生成フロー図。A flow chart of changes in video and question items. 質問事項とフラグビットパターンを示す図。The figure which shows a question item and a flag bit pattern. ＴＶＭＬ作成ブロック図。TVML creation block diagram. メタ情報ファイル作成ブロック図。Meta information file creation block diagram. ＴＶＭＬ台本の例を示す図。The figure which shows the example of TVML script. ＭＰＥＧ−７内容記述ファイル例を示す図。The figure which shows the example of an MPEG-7 content description file. ＭＰＥＧ−７内容記述ファイル例を示す図。The figure which shows the example of an MPEG-7 content description file. ＭＰＥＧ−７内容記述ファイル例を示す図。The figure which shows the example of an MPEG-7 content description file. ネットワーク構成の例を示す図。The figure which shows the example of a network structure.

Explanation of symbols

１映像データ編集・検索装置、２ディスプレイ装置、３スピーカー、４ディスク装置、５入力装置、６オーディオ信号ケーブル、７データケーブル、８音声信号ケーブル、９ビデオ信号ケーブル、３００ＣＰＵ、３０１メッセージ作成部、３０２メタ情報ファイル作成部、３０３音声認識処理部、３０４ディジタル音声処理回路、３０５音声認識回路、３０６メモリ、３０７ＨＤＤ、３０８シーン変化検出部、３０９特徴量解析部、３１０質問項目作成部、３１１Ａ／Ｄ変換器、３１２音声信号入出力部（入力部）、３１３ビデオ信号入出力部、３１４ビデオ駆動制御部、３１５映像データ検索部、３１６システムバス、３１７ディスプレイ接続部、３１９ＴＶＭＬ作成部、３２０ビデオデッキ、５００ＴＶＭＬプログラム作成部、５０１ＴＶＭＬプレーヤー、５０２構造記憶部、５０３内容記憶部、５０４メタ情報スキーマ記憶部、５０６メタ情報記憶部、６３２フラグビットパターン、６４０音声区間フラグ、１１００ネットワーク 1 Video data editing / retrieval device, 2 Display device, 3 Speaker, 4 Disc device, 5 Input device, 6 Audio signal cable, 7 Data cable, 8 Audio signal cable, 9 Video signal cable, 300 CPU, 301 Message creation unit, 302 meta information file creation unit, 303 speech recognition processing unit, 304 digital speech processing circuit, 305 speech recognition circuit, 306 memory, 307 HDD, 308 scene change detection unit, 309 feature amount analysis unit, 310 question item creation unit, 311 A / D converter, 312 Audio signal input / output unit (input unit), 313 Video signal input / output unit, 314 Video drive control unit, 315 Video data search unit, 316 System bus, 317 Display connection unit, 319 TVML creation unit, 320 VCR, 5 00 TVML program creation unit, 501 TVML player, 502 structure storage unit, 503 content storage unit, 504 meta information schema storage unit, 506 meta information storage unit, 632 flag bit pattern, 640 audio section flag, 1100 network

Claims

In a video data processing apparatus for adding video content information for each scene to video data recorded together with sound, a question item creating means for creating a question item related to a scene, and a user by text or voice corresponding to the question item The video data processing apparatus is characterized in that the question item creating means adds a description item input as a free description to the question item.

The video data processing apparatus according to claim 1, wherein the question item creation unit deletes a question item for which information on the question item is not input from the input unit.

3. The video data processing apparatus according to claim 1, further comprising search means for searching for the portion in which the contents are described, and reproducing the video data from the scene or the scene portion searched by the search means. A video data processing device.

4. The video data processing apparatus according to claim 3, wherein the search means includes a communication means that enables searching for a part in which the content of the video data is described in another video data processing apparatus on the network. .

In a video data processing method for adding video content information for each scene to video data recorded together with audio, a step of creating a question item for creating a question item related to a scene, and a user by text or voice corresponding to the question item The video data processing method characterized in that the question item creating step adds a description item input as a free description to the question item.

6. The video data processing method according to claim 5, wherein the question item creating step deletes a question item for which information on the question item has not been input from the input unit.

7. The video data processing method according to claim 5, further comprising a search step for searching for the content-described portion so that the video data is reproduced from the searched scene or scene portion. Processing method.

8. The video data processing method according to claim 7, wherein the searching step comprises a communication function that enables searching for a portion in which the content of the video data is described in another video data processing device on the network. .

A program for causing a computer to function as the video data processing apparatus according to claim 1, 2, 3, or 4.

A computer-readable recording medium on which the video data processing program according to claim 9 is recorded.