JP2007101945A

JP2007101945A - Apparatus, method, and program for processing video data with audio

Info

Publication number: JP2007101945A
Application number: JP2005292485A
Authority: JP
Inventors: Sunao Terayoko; 素寺横
Original assignee: Fujifilm Corp
Current assignee: Fujifilm Corp
Priority date: 2005-10-05
Filing date: 2005-10-05
Publication date: 2007-04-19

Abstract

<P>PROBLEM TO BE SOLVED: To provide an apparatus, a method, and a program for processing video data with audio that can automatically display data including a voice in the form of characters in forms matching contents of words and scenes. <P>SOLUTION: A video/audio signal analysis part 48 converts a human voice which can be converted into characters from audio data 62 in video data 60 with audio into characters through speech recognition processing and outputs the characters as utterance content information. Further, the video/audio signal analysis part 48 acquires speaker information by acquiring voice feature quantity information, utterance time information, a speaker identifier for identifying a speaker, and position coordinates of the speaker on a screen. A metadata generation part 50 stores the utterance time information, utterance content information, speaker information, voice feature quantity information, etc., in metadata in specified format (e.g. xml). The metadata are saved in a video/audio signal recording part 46 in specified format (e.g. MPEG-2, AVI, etc.). <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は音声付き映像データ処理装置、音声付き映像データ処理方法及び音声付き映像データ処理用プログラムに係り、特に音声付きの映像データの音声を文字化する技術に関する。 The present invention relates to a video data processing apparatus with audio, a video data processing method with audio, and a program for processing video data with audio, and more particularly to a technique for characterizing audio of video data with audio.

従来、映像人物と音声を対応付けて表示する技術が提案されている。例えば、特許文献１には、音声を文字化したデータを精度良く映像中の話者に対応させて表示する映像表示方法について開示されている。
特開２００４−５６２８６号公報 2. Description of the Related Art Conventionally, a technique for displaying a video person and audio in association with each other has been proposed. For example, Patent Document 1 discloses a video display method for displaying voiced data in correspondence with a speaker in a video with high accuracy.
JP 2004-56286 A

しかしながら、上記の特許文献１に開示された映像表示方法は、映像情報の顔認識処理を行ってせりふの話者が映像に現れる出現タイミングを検出し、この出現タイミングに基づいてせりふに対応する字幕を映像情報に挿入するものであり、画面内に話者がいない場合の映像と音声との同期を行う方法については開示されていなかった。また、上記特許文献１の映像表示方法では、せりふや場面の内容、雰囲気を認識して、文字を自動的に場面に合った様式に変換して表示することはできなかった。 However, the video display method disclosed in Patent Document 1 described above performs face recognition processing of video information to detect the appearance timing at which a speaker of the dialogue appears in the video, and subtitles corresponding to the dialogue based on the appearance timing. Is inserted into video information, and a method for synchronizing video and audio when there is no speaker on the screen has not been disclosed. In addition, in the video display method of the above-mentioned Patent Document 1, it is not possible to recognize a dialog, the contents of a scene, and the atmosphere, and automatically convert characters into a style suitable for the scene for display.

本発明はこのような事情に鑑みてなされたもので、音声を文字化したデータをせりふや場面の内容に合った様式で自動的に表示することができる音声付き映像データ処理装置、音声付き映像データ処理方法及び音声付き映像データ処理用プログラムを提供することを目的とする。 The present invention has been made in view of such circumstances, and a video data processing apparatus with audio and a video with audio that can automatically display data in which voice is converted into text and a format that matches the contents of the scene. An object is to provide a data processing method and a program for processing video data with audio.

上記目的を達成するために請求項１に係る音声付き映像データ処理装置は、映像データと、前記映像データに同期した音声データとを含む音声付き映像データを取得するデータ取得手段と、前記音声データを文字化して発話内容情報を生成する発話内容情報生成手段と、前記映像データにおいて前記音声データが発せられる時間を示す発話時間情報を取得する発話時間情報取得手段と、前記発話内容情報及び前記発話時間情報を含むメタデータを作成するメタデータ作成手段と、前記音声付き映像データと前記メタデータとを関連付けて記録する記録手段とを備えることを特徴とする。 In order to achieve the above object, a video data processing apparatus with audio according to claim 1, a data acquisition means for acquiring video data with audio including video data and audio data synchronized with the video data, and the audio data Utterance content information generating means for generating utterance content information by converting the text into speech, utterance time information acquisition means for acquiring utterance time information indicating the time at which the audio data is uttered in the video data, the utterance content information and the utterance It is characterized by comprising metadata creating means for creating metadata including time information, and recording means for recording the audio-added video data and the metadata in association with each other.

請求項１に係る音声付き映像データ処理装置によれば、音声付き映像データに含まれる音声データを文字化した発話内容情報や発話時間情報をメタデータとして保存することができる。このメタデータを利用することにより、テロップや台本、シナリオ、会議等の議事録等を容易に作成することができる。 According to the video data processing apparatus with audio according to the first aspect, the utterance content information and the utterance time information obtained by converting the audio data included in the video data with audio can be stored as metadata. By using this metadata, it is possible to easily create telops, scripts, scenarios, minutes of meetings, and the like.

請求項２に係る音声付き映像データ処理装置は、請求項１において、前記映像データ及び音声データを解析して、前記音声を発した発話者を識別する発話者識別手段を更に備え、前記メタデータ作成手段は、前記発話内容情報と前記発話者の識別情報とを関連付けて前記メタデータに記録することを特徴とする。 The video data processing apparatus with audio according to claim 2 further comprises speaker identification means for analyzing the video data and the audio data and identifying a speaker who has uttered the audio, according to claim 1, The creating means is characterized in that the utterance content information and the identification information of the utterer are associated and recorded in the metadata.

請求項２に係る音声付き映像データ処理装置によれば、上記した作用に加え、発話した発話者の識別情報をメタデータとして保存することができる。 According to the video data processing apparatus with audio according to the second aspect, in addition to the above-described operation, the identification information of the speaker who has spoken can be stored as metadata.

請求項３に係る音声付き映像データ処理装置は、請求項１又は２において、前記発話者識別手段は、前記映像データ及び音声データを解析して、前記音声を発した発話者の映像データを表示した画面上における位置情報を取得する発話者位置情報取得手段を更に備え、前記メタデータ作成手段は、前記発話内容情報と前記発話者の位置情報とを関連付けて前記メタデータに記録することを特徴とする。 According to a third aspect of the present invention, there is provided the video data processing apparatus with audio according to the first or second aspect, wherein the speaker identifying means analyzes the video data and the audio data and displays the video data of the speaker who has made the voice. The apparatus further comprises speaker position information acquisition means for acquiring position information on the screen, wherein the metadata generation means records the utterance content information and the position information of the speaker in association with each other and records them in the metadata. And

請求項３に係る音声付き映像データ処理装置によれば、上記した作用に加え、発話者の位置情報をメタデータとして保存することができる。 According to the video data processing apparatus with audio according to the third aspect, in addition to the above-described operation, the position information of the speaker can be stored as metadata.

請求項４に係る音声付き映像データ処理装置は、請求項１から３において、前記音声データを解析して、前記音声の特徴量を取得する音声特徴量取得手段を更に備え、前記メタデータ作成手段は、前記発話内容情報と前記音声特徴量とを関連付けて前記メタデータに記録することを特徴とする。 The video data processing apparatus with audio according to claim 4 further comprises audio feature quantity acquisition means for analyzing the audio data and acquiring the audio feature quantity according to claims 1 to 3, wherein the metadata generation means Is characterized in that the utterance content information and the audio feature quantity are associated with each other and recorded in the metadata.

請求項４に係る音声付き映像データ処理装置によれば、発話内容に加えて音声特徴量をメタデータとして保存することができる。 According to the video data processing apparatus with audio according to the fourth aspect, the audio feature quantity can be stored as metadata in addition to the utterance content.

請求項５に係る音声付き映像データ処理装置は、請求項４において、前記音声特徴量取得手段は、前記音声の大きさ、高低、抑揚又はトーンのうち少なくとも１つの情報を取得することを特徴とする。請求項５は、請求項４の音声特徴量を列挙したものである。 According to a fifth aspect of the present invention, there is provided the video data processing apparatus with audio according to the fourth aspect, wherein the audio feature amount acquisition unit acquires at least one information of the size, level, inflection, or tone of the audio. To do. Claim 5 lists the audio feature quantities of claim 4.

請求項６に係る音声付き映像データ処理装置は、請求項１から５において、前記音声付き映像データを再生表示する再生表示手段と、前記メタデータから発話内容情報及び発話時間情報を取得する情報取得手段と、前記取得した発話内容情報に基づいてテロップを作成するテロップ作成手段と、前記取得した発話時間情報に基づいて、前記音声付き映像データの再生時に前記テロップを挿入するテロップ挿入手段とを更に備えることを特徴とする。 A video data processing apparatus with audio according to claim 6 is the information acquisition apparatus according to claim 1, wherein reproduction display means for reproducing and displaying the video data with audio and information for acquiring utterance content information and utterance time information from the metadata. Means, telop creation means for creating a telop based on the acquired utterance content information, and telop insertion means for inserting the telop when reproducing the video data with audio based on the acquired utterance time information. It is characterized by providing.

請求項６に係る音声付き映像データ処理装置によれば、音声付き映像データのメタデータからテロップを作成して、音声付き映像データの再生時に表示させることができる。 According to the video data processing apparatus with audio according to the sixth aspect, it is possible to create a telop from the metadata of the video data with audio and display it when reproducing the video data with audio.

請求項７に係る音声付き映像データ処理装置は、請求項３において、前記音声付き映像データを再生表示する再生表示手段と、前記メタデータから発話内容情報、発話時間情報及び発話者の位置情報を取得する情報取得手段と、前記取得した発話内容情報に基づいてテロップを作成するテロップ作成手段と、前記取得した発話時間情報に基づいて、前記音声付き映像データの再生時に前記テロップを挿入するテロップ挿入手段と、前記取得した発話者の位置情報に基づいて、前記テロップの挿入位置を調整する挿入位置調整手段とを備えることを特徴とする。 According to a seventh aspect of the present invention, there is provided a video data processing apparatus with audio according to the third aspect, wherein reproduction display means for reproducing and displaying the video data with audio, utterance content information, utterance time information, and speaker position information from the metadata. Information acquisition means for acquiring, telop generation means for generating a telop based on the acquired utterance content information, and telop insertion for inserting the telop when reproducing the video data with audio based on the acquired utterance time information And insertion position adjusting means for adjusting the insertion position of the telop based on the acquired position information of the speaker.

請求項７に係る音声付き映像データ処理装置によれば、メタデータから取得した発話者の位置情報に基づいて、テロップの挿入位置を調整することにより、挿入されたテロップと発話者の対応がわかりやすい表示にすることができる。 According to the video data processing apparatus with audio according to claim 7, the correspondence between the inserted telop and the speaker is easily understood by adjusting the insertion position of the telop based on the position information of the speaker acquired from the metadata. Can be displayed.

請求項８に係る音声付き映像データ処理装置は、請求項４又は５において、前記音声付き映像データを再生表示する再生表示手段と、前記メタデータから発話内容情報、発話時間情報及び音声特徴量を取得する情報取得手段と、前記取得した発話内容情報に基づいてテロップを作成するテロップ作成手段と、前記取得した発話時間情報に基づいて、前記音声付き映像データの再生時に前記テロップを挿入するテロップ挿入手段と、前記音声特徴量に応じて、前記テロップの文字属性を変更する文字属性変更手段とを備えることを特徴とする。 According to an eighth aspect of the present invention, there is provided a video data processing apparatus with audio according to claim 4 or 5, wherein reproduction display means for reproducing and displaying the video data with audio, utterance content information, utterance time information, and audio feature amounts from the metadata. Information acquisition means for acquiring, telop generation means for generating a telop based on the acquired utterance content information, and telop insertion for inserting the telop when reproducing the video data with audio based on the acquired utterance time information And character attribute changing means for changing the character attribute of the telop according to the voice feature amount.

請求項８に係る音声付き映像データ処理装置によれば、例えば、大きい（小さい）音声に対応するテロップのフォントサイズを大きく（小さく）したり、力強い（弱々しい）音声に対応するテロップのフォントを太く（細く）することにより、音声特徴量に応じた表現力豊かなテロップを作成することができる。 According to the video data processing apparatus with audio according to claim 8, for example, the font size of a telop corresponding to large (small) audio is increased (small), or the telop font corresponding to strong (weak) audio is used. By thickening (thinning) the telop, it is possible to create a telop rich in expressiveness according to the audio feature amount.

請求項９に係る音声付き映像データ処理装置は、請求項８において、前記文字属性変更手段は、前記音声特徴量に応じて、前記テロップのフォント、フォントサイズ、文字色、背景色、文字装飾、段組、かっこ、又は、前記テロップに付す吹き出し、感嘆符、疑問符等の符号のうち少なくとも１つを変更することを特徴とする。請求項９は、請求項８の文字属性を列挙したものである。 According to a ninth aspect of the present invention, there is provided the video data processing apparatus with audio according to the eighth aspect, wherein the character attribute changing means includes a font, a font size, a character color, a background color, a character decoration of the telop according to the audio feature amount. It is characterized in that at least one of a column, parentheses, or a symbol attached to the telop, such as a balloon, an exclamation mark, or a question mark, is changed. The ninth aspect lists the character attributes of the eighth aspect.

請求項１０に係る音声付き映像データ処理方法は、映像データと、前記映像データに同期した音声データとを含む音声付き映像データを取得するデータ取得工程と、前記音声データを文字化して発話内容情報を生成する発話内容情報生成工程と、前記映像データにおいて前記音声データが発せられる時間を示す発話時間情報を取得する発話時間情報取得工程と、前記発話内容情報及び前記発話時間情報を含むメタデータを作成するメタデータ作成工程と、前記音声付き映像データと前記メタデータとを関連付けて記録する記録工程とを備えることを特徴とする。 The audio-added video data processing method according to claim 10 includes a data acquisition step of acquiring audio-added video data including video data and audio data synchronized with the video data, and utterance content information by characterizing the audio data. Utterance content information generating step, utterance time information acquisition step of acquiring utterance time information indicating a time at which the audio data is uttered in the video data, and metadata including the utterance content information and the utterance time information. It comprises a metadata creating step to create, and a recording step for recording the audio-added video data and the metadata in association with each other.

請求項１１に係る音声付き映像データ処理用プログラムは、映像データと、前記映像データに同期した音声データとを含む音声付き映像データを取得するデータ取得機能と、前記音声データを文字化して発話内容情報を生成する発話内容情報生成機能と、前記映像データにおいて前記音声データが発せられる時間を示す発話時間情報を取得する発話時間情報取得機能と、前記発話内容情報及び前記発話時間情報を含むメタデータを作成するメタデータ作成機能と、前記音声付き映像データと前記メタデータとを関連付けて記録する記録機能とをコンピュータに実現させることを特徴とする。 An audio-added video data processing program according to claim 11 is a data acquisition function for acquiring audio-added video data including video data and audio data synchronized with the video data; Utterance content information generation function for generating information, utterance time information acquisition function for acquiring utterance time information indicating the time at which the audio data is uttered in the video data, and metadata including the utterance content information and the utterance time information And a recording function for recording the video data with audio and the metadata in association with each other.

請求項１１に係る音声付き映像データ処理用プログラムを含むソフトウェアやファームウェアをパーソナルコンピュータ（ＰＣ）のほか、ビデオ再生装置（ビデオデッキ、テレビ）やデジタルカメラ、携帯電話等の映像再生機能を有する装置に適用することにより、本発明の音声付き映像データ処理装置及び音声付き映像データ処理方法を実現することができる。 Software and firmware including the audio data processing program according to claim 11 in addition to a personal computer (PC), a video playback device (video deck, television), a digital camera, a mobile phone or other device having a video playback function. By applying this, it is possible to realize the video data processing apparatus with audio and the video data processing method with audio of the present invention.

本発明によれば、音声付き映像データに含まれる音声データを文字化した発話内容情報や発話時間情報をメタデータとして保存することができる。そして、このメタデータを利用することにより、テロップや台本、シナリオ、会議等の議事録等を容易に作成することができる。 According to the present invention, utterance content information and utterance time information obtained by converting audio data included in video data with audio into characters can be stored as metadata. By using this metadata, telops, scripts, scenarios, minutes of meetings, etc. can be easily created.

以下、添付図面に従って本発明に係る音声付き映像データ処理装置、音声付き映像データ処理方法及び音声付き映像データ処理用プログラムの好ましい実施の形態について説明する。 Preferred embodiments of a video data processing apparatus with audio, a video data processing method with audio, and a program for processing video data with audio according to the present invention will be described below with reference to the accompanying drawings.

図１は、本発明の一実施形態に係る音声付き映像データ処理装置を備える撮像装置の主要構成を示すブロック図である。図１に示す撮像装置１０は、例えば、動画撮影機能を有する電子カメラやデジタルカメラ、デジタルビデオカメラである。 FIG. 1 is a block diagram illustrating a main configuration of an imaging apparatus including an audio-equipped video data processing apparatus according to an embodiment of the present invention. An imaging apparatus 10 illustrated in FIG. 1 is, for example, an electronic camera, a digital camera, or a digital video camera having a moving image shooting function.

ＣＰＵ１２は、バス１４を介して撮像装置１０内の各部に接続されており、操作スイッチ１６等からの操作入力に基づいて撮像装置１０の動作を制御する統括制御部である。操作スイッチ１６は、電源スイッチやレリーズスイッチ１６Ａ、十字キー等を含んでおり、ユーザからの操作入力を受け付ける。レリーズスイッチ１６Ａは２段階式に構成され、レリーズスイッチ１６Ａを軽く押して止める「半押し（Ｓ１＝ＯＮ）」の状態で自動ピント合わせ（ＡＦ）及び自動露出制御（ＡＥ）が作動してＡＦとＡＥをロックし、「半押し」から更に押し込む「全押し（Ｓ２＝ＯＮ）」の状態で撮影が実行される。 The CPU 12 is connected to each part in the imaging device 10 via the bus 14 and is a general control unit that controls the operation of the imaging device 10 based on an operation input from the operation switch 16 or the like. The operation switch 16 includes a power switch, a release switch 16A, a cross key, and the like, and receives an operation input from the user. The release switch 16A is configured in a two-stage manner, and the automatic focus (AF) and automatic exposure control (AE) are activated in the state of “half-press (S1 = ON)” in which the release switch 16A is lightly pressed to stop, and AF and AE Is locked, and shooting is executed in the state of “full press (S2 = ON)”, which is further pressed from “half press”.

メモリ１８は、ＣＰＵ１２が処理するプログラム及び制御に必要な各種データ等が格納されるＲＯＭや、ＣＰＵ１２が各種の演算処理等を行う作業用領域及び映像処理領域となるＳＤＲＡＭ等を有している。 The memory 18 includes a ROM that stores programs to be processed by the CPU 12 and various data necessary for control, an SDRAM that is a work area in which the CPU 12 performs various arithmetic processes, and an image processing area.

外部通信インターフェース（外部通信Ｉ／Ｆ）２０は、ネットワークや外部出力機器（例えば、パーソナルコンピュータやテレビジョン、ディスプレイ、プリンタ、外部記録装置）等と接続するための機器で、所定のプロトコルにしたがって各種データの送受信を行う。なお、データの送受信の方式は、例えば、インターネットや無線ＬＡＮ、有線ＬＡＮ、ＩｒＤＡ、Bluetooth等である。 The external communication interface (external communication I / F) 20 is a device for connecting to a network or an external output device (for example, a personal computer, a television, a display, a printer, an external recording device), and the like. Send and receive data. The data transmission / reception method is, for example, the Internet, a wireless LAN, a wired LAN, IrDA, Bluetooth, or the like.

撮像素子２４は、光学系（レンズ）２２を介して入射した光を受け止めて電気信号に変換する素子であり、例えばＣＣＤである。この電気信号は、図示せぬプリアンプによって増幅され、Ａ／Ｄ変換器２６によってデジタル信号に変換されて、映像処理部２８に入力される。 The imaging element 24 is an element that receives light that has entered through the optical system (lens) 22 and converts it into an electrical signal, and is, for example, a CCD. This electrical signal is amplified by a preamplifier (not shown), converted into a digital signal by an A / D converter 26, and input to the video processing unit 28.

本実施形態の撮像装置１０は、映像（静止画、動画）を撮影するための撮影モードと、映像を表示、再生するための再生モードの複数の動作モードを備えており、ユーザは操作スイッチ１６からの操作入力により動作モードを設定する。 The imaging apparatus 10 according to the present embodiment includes a plurality of operation modes including a shooting mode for shooting a video (still image and moving image) and a playback mode for displaying and playing back the video. The operation mode is set by the operation input from.

撮影モード時には、映像処理部２８によって撮像素子２４から出力された電気信号が処理されて画角確認用の映像データ（スルー画）が作成され、映像表示部（モニタ）３０に表示される。静止画を撮影する場合には、レリーズスイッチ１６Ａが操作されて静止画が撮影されると、撮像素子２４から出力された電気信号が映像処理部２８によって処理されて保存用の静止画データが作成される。この保存用の静止画データは、記録メディア３２に所定のファイル形式で保存される。ここで、記録メディア３２は、例えば、半導体メモリやビデオテープ、ハードディスクドライブ（ＨＤＤ）、ＤＶＤ等である。なお、マイク３４により音声を入力して、上記静止画データと音声とを関連付けて保存することもできる。 In the shooting mode, the electric signal output from the image sensor 24 is processed by the video processing unit 28 to generate video data for checking the angle of view (through image), which is displayed on the video display unit (monitor) 30. When shooting a still image, when the release switch 16A is operated to shoot a still image, the electrical signal output from the image sensor 24 is processed by the video processing unit 28 to create still image data for storage. Is done. This storage still image data is stored in the recording medium 32 in a predetermined file format. Here, the recording medium 32 is, for example, a semiconductor memory, a video tape, a hard disk drive (HDD), a DVD, or the like. It should be noted that voice can be input through the microphone 34 and the still image data and voice can be stored in association with each other.

一方、動画を撮影する場合には、レリーズスイッチ１６Ａにより動画の撮影が開始されると、マイク３４により音声の取得が開始される。そして、映像処理部２８によって保存用の動画データが作成されるとともに、オーディオ処理回路３６によって保存用の音声データが作成される。この保存用の動画データ及び音声データは、記録メディア３２に所定のファイル形式（例えば、ＭＰＥＧ形式やＡＶＩ形式）の音声付き映像データに変換されて保存される。 On the other hand, when shooting a moving image, acquisition of sound is started by the microphone 34 when shooting of the moving image is started by the release switch 16A. Then, the moving image data for storage is generated by the video processing unit 28 and the audio data for storage is generated by the audio processing circuit 36. The video data and audio data for storage are converted into video data with audio in a predetermined file format (for example, MPEG format or AVI format) and stored in the recording medium 32.

一方、再生モード時において、静止画の再生時には、映像処理部２８によって記録メディア３２に保存された静止画データが読み出されて表示用の静止画データが作成され、モニタ３０に表示される。また、動画の再生時には、映像処理部２８によって記録メディア３２に保存された動画データが読み出されて表示用の動画データが作成されモニタ３０に表示されるとともに、上記動画データと関連付けられた音声データが読み出されてスピーカ３８から出力される。上述のように、モニタ３０は撮影時の画角確認用の電子ファインダとして用いられるとともに、撮影された映像データ（静止画データ、動画データ）の表示に用いられる。 On the other hand, in the playback mode, at the time of playback of a still image, the still image data stored in the recording medium 32 is read by the video processing unit 28 to create still image data for display and displayed on the monitor 30. At the time of reproducing a moving image, the moving image data stored in the recording medium 32 is read out by the video processing unit 28, and the moving image data for display is created and displayed on the monitor 30, and the audio associated with the moving image data is displayed. Data is read out and output from the speaker 38. As described above, the monitor 30 is used as an electronic viewfinder for checking the angle of view at the time of shooting, and is used for displaying the shot video data (still image data, moving image data).

次に、上記の撮像装置１０により撮像された音声付き映像データからメタデータを生成して付与する処理について、図２を参照して説明する。図２は、撮像装置１０における音声付き映像データ処理の流れを示す機能ブロック図である。図２に示す録画指示・制御部４０は、録画開始の指示を行うレリーズスイッチ１６ＡやＣＰＵ１２を含む機能ブロックであり、レリーズスイッチ１６Ａからの操作入力によりＣＰＵ１２から外部映像・音声入力部４２及び映像・音声信号符号化部４４に、動画の撮影の開始信号を出力する。外部映像・音声入力部４２は、光学系２２、撮像素子２４及びマイク３４を含む機能ブロックであり、映像・音声信号符号化部４４は、映像処理部２８及びオーディオ処理回路３６を含む機能ブロックである。外部映像・音声入力部４２から出力された映像及び音声の電気信号は、映像・音声信号符号化部４４（動画コーデック）によって、図３に示すように、音声データ６２と映像データ６４とを含む所定の形式の音声付き映像データ６０に変換され、映像・音声信号保存部４６（メモリ１８、記録メディア３２）に保存される。 Next, processing for generating and assigning metadata from audio-attached video data imaged by the imaging device 10 will be described with reference to FIG. FIG. 2 is a functional block diagram showing a flow of video data processing with audio in the imaging apparatus 10. The recording instruction / control unit 40 shown in FIG. 2 is a functional block including a release switch 16A for instructing recording start and the CPU 12. The operation input from the release switch 16A causes the external video / audio input unit 42 and the video / audio input unit 42 from the CPU 12 to operate. A start signal for moving image shooting is output to the audio signal encoding unit 44. The external video / audio input unit 42 is a functional block including the optical system 22, the image sensor 24, and the microphone 34, and the video / audio signal encoding unit 44 is a functional block including the video processing unit 28 and the audio processing circuit 36. is there. The video and audio electrical signals output from the external video / audio input unit 42 include audio data 62 and video data 64 as shown in FIG. 3 by the video / audio signal encoding unit 44 (moving image codec). It is converted into video data 60 with audio in a predetermined format and stored in the video / audio signal storage unit 46 (memory 18, recording medium 32).

次に、映像・音声信号保存部４６から音声付き映像データ６０が読み出されて、映像・音声信号解析部４８によって音声付き映像データ６０中の音声データ６２が抽出される。映像・音声信号解析部４８は、抽出した音声データ６２から文字に変換可能な人の声を音声認識処理により文字化して発話内容情報として出力する。また、音声の大きさや高低、抑揚、トーン等の音声特徴量を認識し、音声を所定のテクスチャに類型化し音声特徴量情報として出力する。 Next, the audio / video signal 60 is read from the video / audio signal storage unit 46, and the audio / data 62 in the audio / video data 60 is extracted by the video / audio signal analysis unit 48. The video / audio signal analysis unit 48 converts a voice of a person that can be converted into characters from the extracted audio data 62 into a character by voice recognition processing, and outputs it as utterance content information. It also recognizes voice feature quantities such as the magnitude, height, inflection, and tone of the voice, classifies the voice into a predetermined texture, and outputs it as voice feature quantity information.

また、映像・音声信号解析部４８は、上記文字化された音声が発せられている発話時間情報を取得する。この時間情報は、例えば、発話の開始時及び終了時の映像データ（動画）のフレームを特定する情報（フレーム）番号や、発話の開始時刻及び終了時刻等である。さらに、映像・音声信号解析部４８は、映像データ６４を解析して上記発話内容に対応する発話者を検出し、発話者識別するための発話者識別子及び発話者の画面上における位置座標を取得して発話者情報として出力する。メタデータ生成部５０は、上記の発話時間情報、発話内容情報、発話者情報、音声特徴量情報等を所定のファイル形式（例えば、ｘｍｌ形式）のメタデータに格納する。このメタデータは、図４に示すような情報を含んでおり、所定の形式（例えば、ＭＰＥＧ−２やＡＶＩ形式）で映像・音声信号記録部４６の記録メディア３２に保存される。 Further, the video / audio signal analyzing unit 48 acquires the utterance time information in which the above-mentioned characterized sound is emitted. This time information is, for example, an information (frame) number that identifies a frame of video data (moving image) at the start and end of an utterance, an utterance start time and an end time, and the like. Further, the video / audio signal analysis unit 48 analyzes the video data 64 to detect a speaker corresponding to the utterance content, and acquires a speaker identifier for identifying the speaker and a position coordinate of the speaker on the screen. And output as speaker information. The metadata generation unit 50 stores the above utterance time information, utterance content information, speaker information, voice feature amount information, and the like in metadata of a predetermined file format (for example, xml format). This metadata includes information as shown in FIG. 4, and is stored in the recording medium 32 of the video / audio signal recording unit 46 in a predetermined format (for example, MPEG-2 or AVI format).

図５はｘｍｌ形式のメタデータの例を示す図であり、図６はｘｍｌのスキーマを示す図である。図５に示す例では、発話時間情報は、文字化された発話内容の開始時刻及び終了時刻がｖｏｉｃｅタグにフレーム番号で記述され、発話内容情報はｔｅｘｔタグで記述されている。発話者情報（ｐｅｒｓｏｎ）は、ｎａｍｅ属性に発話者の例えば、人物名（「○×△男」）で記述される。図５に示す例では、画面内に発話者がいないため、ｐｏｓ属性は省略されるか空欄になっている。また、音声特徴量情報（ｔｏｎｅ）のｔｙｐｅ属性は、ｎａｒａｔｉｏｎ（ナレーション）のほかには、例えば、笑い声、泣き声、大声、ひそひそ声等を設定可能である。 FIG. 5 is a diagram showing an example of metadata in xml format, and FIG. 6 is a diagram showing a schema of xml. In the example shown in FIG. 5, in the utterance time information, the start time and end time of the transcribed utterance content are described in the voice tag by a frame number, and the utterance content information is described by a text tag. The speaker information (person) is described in the name attribute by, for example, the name of the speaker (“◯ × Δ male”). In the example shown in FIG. 5, since there is no speaker on the screen, the pos attribute is omitted or blank. In addition to the narration, the type attribute of the voice feature amount information (tone) can be set to, for example, a laughing voice, a crying voice, a loud voice, a secret voice, or the like.

次に、メタデータの格納形式について説明する。図７は、メタデータをＭＰＥＧ形式で保存する例を示す図である。図７に示すようにＭＰＥＧ−２形式では、映像データ６４を含む映像ストリーム６４′と音声データ６２を含む音声ストリーム６２′、メタデータ６６を含むメタデータストリーム６６′が規格に定められた記録方式（パックと呼ばれる、一例で２，０４８ｋｂのデータ単位の連なり）によって、インターリーブで単一のファイル６８として記録される。 Next, a metadata storage format will be described. FIG. 7 is a diagram illustrating an example of storing metadata in the MPEG format. As shown in FIG. 7, in the MPEG-2 format, a video stream 64 ′ including video data 64, an audio stream 62 ′ including audio data 62, and a metadata stream 66 ′ including metadata 66 are defined in the standard. (Called a pack, which is a series of data units of 2,048 kb in one example), is recorded as a single file 68 by interleaving.

図８は、メタデータをＡＶＩ形式で保存する例を示す図である。図８において、「ＲＩＦＦＡＶＩ」は、ＡＶＩファイル全体を示す。また、「ＬＩＳＴｈｄｒｌ」は、ＡＶＩファイルのヘッダ領域であり、映像用及び音声用の２つのヘッダ領域「ＬＩＳＴｓｔｒｌ」を含んでいる。本実施形態では、映像用のヘッダ領域「ＬＩＳＴｓｔｒｌ」内に太枠で示す「ｓｔｒｄ」及び「ｓｔｒｎ」という独自拡張データ用ストリームを設け、このストリーム内に図５に示すｘｍｌ形式のメタデータをそのままバイナリデータとして保存する。これにより、ＡＶＩファイル内にメタデータを保存することができる。 FIG. 8 is a diagram illustrating an example of storing metadata in the AVI format. In FIG. 8, “RIFF AVI” indicates the entire AVI file. “LIST hdr” is a header area of the AVI file, and includes two header areas “LIST str” for video and audio. In this embodiment, original extension data streams “strd” and “strn” indicated by thick frames are provided in the header area “LIST strl” for video, and the metadata in the xml format shown in FIG. 5 is included in this stream. Save as binary data. Thereby, metadata can be saved in the AVI file.

以下、音声付き映像データの処理方法について、図９を参照して説明する。図９は、本発明の一実施形態に係る音声付き映像データの処理方法を示すフローチャートである。まず、映像・音声信号保存部４６から音声付き映像データ６０を読み出し、一定量バッファリングして、音声データ６２の解析を開始する（ステップＳ１０）。ステップＳ１０において、バッファリングする音声付き映像データ６０のデータ量は調整可能である。ここで、バッファリングするデータ量の値は、解析対象とする音声データ６２を文字化した際の文脈の適切な切れ目をひとつの目安とするとよい。例えば、日本語における通常のスピードの発話が１分間に約４００〜５００語であるといったデータをもとに、１音節を含みうるデータ量などを逆算して、それをバッファリングするデータ量の初期値としてもよい。 Hereinafter, a method for processing video data with audio will be described with reference to FIG. FIG. 9 is a flowchart illustrating a method for processing video data with audio according to an embodiment of the present invention. First, the video data with audio 60 is read from the video / audio signal storage unit 46, buffered by a certain amount, and analysis of the audio data 62 is started (step S10). In step S10, the data amount of the video data with audio 60 to be buffered can be adjusted. Here, the value of the amount of data to be buffered may be an approximate break in the context when the audio data 62 to be analyzed is converted into text. For example, based on the data that the normal speed utterance in Japanese is about 400-500 words per minute, the amount of data that can contain one syllable is calculated backward, and the initial amount of data to be buffered It may be a value.

次に、音声認識により音声データ６２の発話内容を文字化する（ステップＳ１２）。ステップＳ１２では、例えば、音声付き映像データ６０中の音声データ６２の中から人（発話者）の声、動物の声、周囲の音、効果音等を抽出し、人声辞書及び効果音辞書を用いてそれぞれ抽出された人声データ及び効果音データの文字化を行う。さらに、人声データを解析して声紋や発話スピード等の音声特徴量に基づいて発話者ごとに音声を分類し、発話内容情報として保存する。なお、音声データ６２の文字化の方法は、上記のものに限定されるものではない。 Next, the speech content of the voice data 62 is converted into text by voice recognition (step S12). In step S12, for example, a voice of a person (speaker), an animal voice, surrounding sounds, sound effects, and the like are extracted from the sound data 62 in the video data with sound 60, and a human voice dictionary and sound effect dictionary are created. The extracted human voice data and sound effect data are converted into characters. Furthermore, the voice data is analyzed, the voice is classified for each speaker based on the voice feature quantity such as voiceprint and utterance speed, and stored as utterance content information. Note that the method of characterizing the audio data 62 is not limited to the above.

文字化した音声データ（発話内容情報）は、フレーム単位の時間情報と同期させる必要があるため、さらに解析して、１音節または１音などの適切な区切りに分解し、分解された発話内容情報と同期するフレーム番号又は時間情報を、例えば、発話内容「それは夏だった」の発話時間がフレーム０番から１０番又は０分００秒から０分０５秒のように、発話時間情報として保存しておく（図５参照）。 Since it is necessary to synchronize the text data (speech content information) with the time information in units of frames, it is further analyzed and decomposed into appropriate punctuation such as one syllable or one sound. The frame number or time information synchronized with the utterance time information is stored as the utterance time information, for example, the utterance time of the utterance content “It was summer” from frame 0 to frame 10 or from 0:00 to 0:05 (See FIG. 5).

なお、ステップＳ１２において、文字化した音声データの発話終了時間がバッファリングした映像データ６４の終了時間より早く、バッファリングした音声付き映像データに残りが生じる場合は、次のループの音声付き映像データ６０のバッファリング開始位置を、今回文字化した音声データの終了時間に合わせるとよい。 In step S12, if the utterance end time of the transcribed audio data is earlier than the end time of the buffered video data 64 and the remaining video data with the buffer is left, the video data with audio of the next loop is generated. The buffering start position of 60 should be matched with the end time of the voice data that has been characterized this time.

ステップＳ１２において解析した音声データ６２に発話が含まれない場合（ステップＳ１４のＮｏ）、文字化した音声データがないため、ステップＳ１０に戻って音声付き映像データ６０の残りのデータに対し処理を継続する。 If the utterance is not included in the audio data 62 analyzed in step S12 (No in step S14), there is no audio data that has been transcribed, so the process returns to step S10 and the processing is continued on the remaining data of the audio-attached video data 60. To do.

一方、ステップＳ１２において解析した音声データ６２に発話が含まれる場合（ステップＳ１４のＹｅｓ）、文字化した音声データを、音声信号の音声特徴量（声の大きさ、高低、抑揚及びトーン等）によりトーン分析する（ステップＳ１６）。ここで、トーン分析とは、予め用意された声のトーンのテクスチャ（例えば、笑い声、ひそひそ声、大声等）に類型化することを指す。また、ステップＳ１６では、音声のトーンを表す数値データ（大きさ、周波数等）も併せて記録する。 On the other hand, if the speech data 62 analyzed in step S12 includes an utterance (Yes in step S14), the speech data that has been transcribed is converted into speech feature quantities (voice volume, pitch, inflection, tone, etc.) of the speech signal. Tone analysis is performed (step S16). Here, tone analysis refers to categorizing into a tone texture (for example, laughing voice, secret voice, loud voice, etc.) prepared in advance. In step S16, numerical data (size, frequency, etc.) representing the tone of the voice is also recorded.

次に、文字化した音声データをもとに発話者を解析する（ステップＳ１８）。ステップＳ１８では、例えば、映像データ６４を解析して、映像データ６４のフレームごとに人物が映っている人物領域を抽出する。上記人物の映像特徴量を算出し、この映像特徴量に基づいて人物を推定する。ここで、映像特徴量としては、例えば、平均濃度、ハイライト（最低濃度）、シャドー（最高濃度）、ヒストグラム等である。人物の推定は、その人物の性別、年齢、職業等の人物層を推定することで行う。例えば、性別の推定は、上記人物領域から顔領域（頭髪）を抽出し、これらの抽出結果により、頭髪領域のボリュームが大きい場合や、頭髪領域が細長く、長髪である場合、または、胴体以下の輪郭形状のパターンマッチングから抽出された衣服の形状からスカートであると思われる場合、さらに、衣服の色が赤やピンク系統が多い場合、あるいは顔領域の抽出結果から、化粧の有無、口紅の使用の有無やアクセサリの着用の有無等から総合的に判断して、女性であると推定することができる。また、年齢の推定は、表示映像から被写体人物の身長を算出し、その大きさにより、大人、中高生、小学生、幼児等と推定を行うことができる。または、抽出された頭髪領域のボリュームが少ない場合や頭髪の色が白い場合には、高齢者であると推定される。また、職業の推定は、例えば、衣服によって行うことができる。例えば、衣服の形状、濃度、色味からスーツ系の度合いが高い場合には、サラリーマン層と推定でき、衣服の形状や色から制服系であると思われる場合には、性別や年齢の推定結果と合わせて中高生を含めた学生等と推定できる。なお、人物層の推定方法は、ここに挙げたのは一例であり、これに限定されるものではない。 Next, the speaker is analyzed based on the voice data that has been converted to text (step S18). In step S18, for example, the video data 64 is analyzed, and a person region in which a person is shown is extracted for each frame of the video data 64. The video feature amount of the person is calculated, and the person is estimated based on the video feature amount. Here, the video feature amount includes, for example, average density, highlight (lowest density), shadow (highest density), histogram, and the like. The estimation of a person is performed by estimating the person group such as the gender, age and occupation of the person. For example, gender estimation is performed by extracting a face region (hair) from the person region and, based on these extraction results, when the volume of the hair region is large, when the hair region is elongated and long hair, or below the torso If the shape of the clothing extracted from the contour shape pattern matching seems to be a skirt, and if the clothing color is red or pink, or if the face area is extracted, the presence or absence of makeup and the use of lipstick It can be presumed that the woman is a woman by comprehensively judging from the presence / absence of the child and the presence / absence of wearing the accessory. Further, the age can be estimated as an adult, a junior high school student, an elementary school student, an infant, or the like by calculating the height of the subject person from the display video and the size. Alternatively, when the volume of the extracted hair region is small or the hair color is white, it is estimated that the person is an elderly person. Moreover, occupation estimation can be performed by clothes, for example. For example, if the suit type is high from the shape, density, and color of clothing, it can be estimated as a salaried worker group, and if it is thought that it is a uniform type from the shape and color of clothing, the estimation result of gender and age It can be estimated that the students include junior and senior high school students. Note that the person layer estimation method described here is merely an example, and is not limited thereto.

そして、映像データ６４から推定した人物領域の数Ｎと、音声データ６２の人声から推定した人物の数Ｍについて、同一シーン中に登場するタイミングの発生状況の統計を取る。このとき、映像特徴量による人物層推定結果と、音声特徴量による人物層推定結果が矛盾する場合には、統計処理においては、カウントアップしないこととする。例えば、映像データ６４の解析結果では男性なのに、音声データは女性のような場合である。また、映像データ６４では男性候補と女性候補の両方を抽出していて、音声が女性候補のみの場合には、映像の女性候補のみをカウントアップする。なお、このとき、映像中の人物の口元の動きを検出して、発声タイミングとの一致度を、映像と音声の一致度の重み付けに利用して、統計を取るようにしてもよい。 Then, statistics on the occurrence status of timings appearing in the same scene are taken for the number N of person regions estimated from the video data 64 and the number M of persons estimated from the voice of the audio data 62. At this time, when the person layer estimation result based on the video feature amount and the person layer estimation result based on the audio feature amount contradict each other, the statistical processing does not count up. For example, the analysis result of the video data 64 is a case where the audio data is a woman although the man is male. Also, in the video data 64, both male candidates and female candidates are extracted, and when the audio is only female candidates, only the female candidates in the video are counted up. At this time, the movement of the mouth of the person in the video may be detected, and the degree of coincidence with the utterance timing may be used for weighting the degree of coincidence between the video and the audio so as to obtain statistics.

そして、この統計処理を一定時間区切りで行って集計する。一定時間区切りとしては、例えば、１０分間隔とか実際に時間で区切ってもよいし、映像データ６４がＴＶ番組を録画したものであれば、１番組内で区切っても、コマーシャルで区切っても、チャプターで区切ってもよい。このようにして統計を取った結果から、映像による人物推定と音声による人物推定の相関の高い組み合わせに基づいて、映像データ６４から検出された発話者と、発話内容との関連付けを決定し、発話した発話者を特定する。なお、この段階で映像データ６４の解析による人物推定結果と、音声データ６２の解析による人物推定結果との矛盾チェックを行うようにしてもよい。 Then, this statistical processing is performed at regular time intervals and tabulated. As the fixed time interval, for example, an interval of 10 minutes may be actually divided, or if the video data 64 is a TV program recorded, it may be divided within one program or commercial, You may separate them with chapters. From the results of the statistics thus obtained, the association between the utterer detected from the video data 64 and the utterance content is determined based on the combination of the high correlation between the human estimation based on the video and the human estimation based on the voice, and the utterance is determined. Identify the speaker who made the call. At this stage, a contradiction check between the person estimation result obtained by analyzing the video data 64 and the person estimation result obtained by analyzing the audio data 62 may be performed.

そして、上記特定された発話者に人物名や性別、年齢等の発話者を特定できる文字列からなる発話者識別子（例えば、女性Ａ、老婆Ａ等）を付与し、特定された発話者の属する人物領域の位置座標及び発話者識別子を含む発話者情報として保存する。 Then, a speaker identifier (for example, female A, old woman A, etc.) consisting of a character string that can specify a speaker such as a person name, gender, and age is assigned to the specified speaker, and the specified speaker belongs to the speaker. It is stored as speaker information including the position coordinates of the person area and the speaker identifier.

なお、本実施形態では、メモリ１８内に発話者データベース（ＤＢ）を設けておき、この発話者ＤＢに発話者の顔領域や人物名、ニックネーム、声紋等を予め保存しておき、この顔領域と上記抽出された人物の映像特徴量を照合して発話者を特定するようにしてもよい。 In the present embodiment, a speaker database (DB) is provided in the memory 18, and the face area, person name, nickname, voiceprint, etc. of the speaker are stored in advance in the speaker DB. And the extracted video feature amount of the person may be compared to specify the speaker.

次に、上記の発話時間情報、発話内容情報、発話者情報（発話者識別子及び発話者の位置座標）、音声特徴量情報等を含むメタデータを生成する（ステップＳ２０）。ステップＳ２０では、まず、発話内容情報と、発話時間情報をもとにメタデータを生成し、併せて、発話者情報と音声特徴量情報もメタデータ内に記述する。 Next, metadata including the utterance time information, utterance content information, utterer information (speaker identifier and utterer position coordinates), voice feature amount information, and the like is generated (step S20). In step S20, first, metadata is generated based on the utterance content information and the utterance time information, and the speaker information and voice feature information are also described in the metadata.

次に、未処理の音声付き映像データがある場合（ステップＳ２２のＹｅｓ）、ステップＳ１０に戻り処理を継続する。そして、未処理の音声付き映像データがなくなれば（ステップＳ２２のＮｏ）、メタデータ生成を終了し、生成したメタデータを適切な方法で格納する（ステップＳ２４）。なお、メタデータの格納方法としては、例えば、図７及び図８に示すように、ＭＰＥＧ−２やＡＶＩ形式により音声付き映像データ６０と同一のファイルに保存するようにしてもよいし、また、音声付き映像データ６０とは別のｘｍｌファイルとして相互に関連付けて保存するようにしてもよい。 Next, when there is unprocessed audio-added video data (Yes in step S22), the process returns to step S10 to continue the processing. Then, when there is no unprocessed video data with audio (No in step S22), the generation of metadata is terminated, and the generated metadata is stored by an appropriate method (step S24). For example, as shown in FIGS. 7 and 8, the metadata may be stored in the same file as the audio-added video data 60 in MPEG-2 or AVI format. The video data 60 with audio 60 may be stored in association with each other as an xml file.

本実施形態によれば、音声データを文字化した発話内容情報等を含むメタデータを付与して保存することができる。そして、このメタデータを利用することにより、テロップや台本、シナリオ、会議等の議事録等を容易に作成することができる。 According to the present embodiment, it is possible to add and store metadata including utterance content information obtained by characterizing voice data. By using this metadata, telops, scripts, scenarios, minutes of meetings, etc. can be easily created.

次に、上記音声付き映像データ処理装置を備える撮像装置１０の映像再生機能について、図１０及び図１１を参照して説明する。図１０は、音声付き映像データ処理装置の映像再生機能部の主要構成を示すブロック図である。図１０に示すように、音声付き映像データ処理装置の映像再生機能部は、映像・音声信号記録部４６、再生指示制御部７０、映像・音声信号復号再生部７２、メタデータ読込部７４、テロップ生成表示部７６及び外部映像・音声出力部７８を備える。 Next, the video reproduction function of the imaging device 10 including the above-described audio-equipped video data processing device will be described with reference to FIGS. FIG. 10 is a block diagram showing the main configuration of the video playback function unit of the video data processing apparatus with audio. As shown in FIG. 10, the video playback function section of the video data processing apparatus with audio includes a video / audio signal recording section 46, a playback instruction control section 70, a video / audio signal decoding / playback section 72, a metadata reading section 74, a telop. A generation display unit 76 and an external video / audio output unit 78 are provided.

再生指示制御部７０は、映像データの再生指示を行う再生スイッチや再生停止指示を行う停止スイッチ、一時停止スイッチ、巻き戻し／早送りスイッチ、メニュースイッチ、リモコン等のユーザが映像再生に係る操作入力を行うための操作部材を含んでおり、各操作部材からの操作入力に応じて映像再生機能部の各ブロックに制御信号を送る。 The playback instruction control unit 70 allows a user, such as a playback switch for instructing playback of video data, a stop switch for instructing playback stop, a pause switch, a rewind / fast forward switch, a menu switch, a remote controller, etc. An operation member for performing the operation is included, and a control signal is sent to each block of the video reproduction function unit in response to an operation input from each operation member.

映像・音声信号復号再生部７２は、再生指示制御部７０からの操作入力により指定された音声付き映像データを映像・音声信号記録部４６から読み出して、映像信号及び音声信号を復号する。メタデータ読込部７４は、再生指示制御部７０からの操作入力により指定された音声付き映像データのメタデータを読み込んでテロップ生成表示部７６に出力する。テロップ生成表示部７６は、メタデータから発話内容情報及び発話時間情報を読み出して、上記発話時間情報に対応するフレームに、発話内容情報のテロップを挿入する指令を映像・音声信号復号再生部７２に出力する。また、テロップ生成表示部７６は、上記メタデータから発話者情報を読み出して、上記発話時間情報に対応する全フレームにおける発話者の位置を特定し、テロップの挿入位置を指定する指令を出力する。また、テロップ生成表示部７６は、音声特徴量情報に基づいてテロップのフォント、フォントサイズ、文字色、背景色、文字装飾、段組又はテロップに付すかっこ、吹き出し、感嘆符、疑問符等の符号等を指定する指令を出力する。映像・音声信号復号再生部７２は、上記テロップ生成表示部７６からの指令に基づいて上記復号した映像信号にテロップを挿入し、復号した音声信号とともに外部映像・音声出力部７８に出力する。外部映像・音声出力部７８は、映像を表示する画像表示部３０及び音声を出力するスピーカ３８、ビデオ／オーディオ出力端子等を含んでおり、映像・音声信号復号再生部７２から入力された映像及び音声を再生する。 The video / audio signal decoding / reproducing unit 72 reads the video data with audio designated by the operation input from the reproduction instruction control unit 70 from the video / audio signal recording unit 46 and decodes the video signal and the audio signal. The metadata reading unit 74 reads the metadata of the video data with audio designated by the operation input from the reproduction instruction control unit 70 and outputs it to the telop generation display unit 76. The telop generation / display unit 76 reads the utterance content information and the utterance time information from the metadata, and instructs the video / audio signal decoding / playback unit 72 to insert a telop of the utterance content information into the frame corresponding to the utterance time information. Output. The telop generation display unit 76 reads the speaker information from the metadata, specifies the position of the speaker in all frames corresponding to the utterance time information, and outputs a command for designating the insertion position of the telop. Further, the telop generation display unit 76 uses the telop font, font size, character color, background color, character decoration, brackets attached to columns or telops, balloons, exclamation marks, question marks, etc. based on the audio feature information. Outputs a command that specifies. The video / audio signal decoding / playback unit 72 inserts a telop into the decoded video signal based on a command from the telop generation / display unit 76, and outputs it to the external video / audio output unit 78 together with the decoded audio signal. The external video / audio output unit 78 includes an image display unit 30 that displays video, a speaker 38 that outputs audio, a video / audio output terminal, and the like. Play audio.

次に、メタデータが付与された音声付き映像データを再生する処理の流れについて、図１１を参照して説明する。図１１は、メタデータが付与された音声付き映像データを再生する処理の流れを示すフローチャートである。 Next, the flow of processing for reproducing audio-added video data to which metadata is attached will be described with reference to FIG. FIG. 11 is a flowchart showing a flow of processing for reproducing audio-added video data to which metadata is added.

まず、再生指示制御部７０により再生する音声付き映像データが選択されると、メタデータ読込部７４によりメタデータを読み込む（ステップＳ３０）。ここで、図７及び図８に示すようにメタデータが音声付き映像データと同一ファイル内に格納押されている場合には、メタデータ読込部７４は、音声付き映像データからメタデータを読み込む。また、メタデータが音声付き映像データとは別ファイルで、ＵＲＬ等により相互に関連付けられて記録されている場合には、メタデータ読込部７４は、上記指定された音声付き映像データと関連付けられたメタデータのファイルを取得する。 First, when video data with audio to be reproduced is selected by the reproduction instruction control unit 70, the metadata is read by the metadata reading unit 74 (step S30). Here, when the metadata is stored and pushed in the same file as the video data with audio as shown in FIGS. 7 and 8, the metadata reading unit 74 reads the metadata from the video data with audio. If the metadata is a separate file from the video data with audio and is recorded in association with the URL or the like, the metadata reading unit 74 associates with the specified video data with audio. Get metadata file.

次に、上記読み込んだメタデータに含まれる発話内容情報（図５のｔｅｘｔタグ）から、発話内容を文字化したデータを読み込んで、テロップの文字データを生成する（ステップＳ３２）。また、上記メタデータから発話者情報及び音声特徴量情報を取得し、テロップの文字属性を設定する（ステップＳ３４）。ステップＳ３４では、例えば、発話者（発話者識別子）ごとにテロップの文字色を変更し、同一の発話者の発話内容をテロップの色で識別できるようにする。また、音声特徴量に応じて文字属性を変更する。例えば、音声の大きさ、高低、抑揚に応じて、テロップのフォント、フォントサイズ、文字色、背景色、文字装飾、段組を変更したり、又はかっこ、吹き出し、感嘆符、疑問符等の符号をテロップに付す。これにより、音声の性質に応じた効果的なテロップを作成することができる。なお、ステップＳ３２及びＳ３４では、ユーザが画面をみながら操作スイッチ１６を操作して、テロップの文字の修正、追加や、文字属性の設定変更を手動で行えるようにしてもよい。 Next, from the utterance content information (text tag in FIG. 5) included in the read metadata, the text data of the utterance content is read to generate telop character data (step S32). Further, the speaker information and the voice feature amount information are acquired from the metadata, and the text attribute of the telop is set (step S34). In step S34, for example, the text color of the telop is changed for each speaker (speaker identifier) so that the utterance content of the same speaker can be identified by the telop color. Further, the character attribute is changed according to the voice feature amount. For example, the telop font, font size, text color, background color, text decoration, column change according to the volume, height, and inflection of the voice, or the sign such as parenthesis, speech balloon, exclamation mark, question mark, etc. Attached to the telop. This makes it possible to create an effective telop according to the nature of the voice. In steps S32 and S34, the user may operate the operation switch 16 while looking at the screen to manually correct or add a telop character or change a character attribute setting.

また、メタデータに含まれる発話者情報に応じて、テロップの挿入位置やサイズの調整を行う（ステップＳ３６）。例えば、発話者の画面内における位置がメタデータに記載されている場合は、画面内の発話者の位置に応じて、その人物が発言したことがわかるようにテロップの挿入位置とサイズを調整する。例えば、発話者の位置座標に応じて、左側に映っている発話者のせりふは左に、右側に映っている発話者のせりふは右側に挿入する。なお、発話者の顔や口の位置をメタデータに保存しておくか、音声付き映像データ処理装置１２０により検出して、発話者の顔領域付近に吹き出しを表示させ、その吹き出しのなかにテロップを挿入するようにしてもよい。また、テロップは、メタデータに他の発話者の位置や人物領域の大きさを記録しておくことにより、同一フレームに映っている他の人物にテロップが重ならないようにしてもよい。なお、発話者情報に発話者の位置座標が含まれていない場合、すなわち、発話者が画面内にいない場合については、例えば、映像解析により背景領域を検出し、背景領域に収まるように、テロップの位置・サイズを算出するようにするとよい。なお、ステップＳ３６では、ユーザが操作スイッチ１６を操作して、テロップの挿入位置やサイズの変更を手動で行えるようにしてもよい。また、テロップを発話者情報（ｎａｍｅ属性情報）とともに表示させてもよい。 Further, the insertion position and size of the telop are adjusted according to the speaker information included in the metadata (step S36). For example, if the position of the speaker on the screen is described in the metadata, the insertion position and size of the telop are adjusted so that the person speaks according to the position of the speaker on the screen. . For example, depending on the position coordinates of the speaker, the speaker's dialogue shown on the left is inserted on the left, and the speaker's dialogue shown on the right is inserted on the right. Note that the position of the speaker's face and mouth is stored in the metadata, or is detected by the video data processing apparatus with audio 120, and a speech bubble is displayed near the speaker's face area, and a telop is included in the speech bubble. May be inserted. Further, the telop may be recorded so that the position of another speaker or the size of the person area is recorded in the metadata so that the telop does not overlap another person appearing in the same frame. When the speaker's position coordinates are not included in the speaker information, that is, when the speaker is not in the screen, for example, the background area is detected by video analysis, and the telop is set so that it falls within the background area. It is preferable to calculate the position and size of. In step S36, the user may manually change the insertion position and size of the telop by operating the operation switch 16. Moreover, you may display a telop with speaker information (name attribute information).

次に、上記のようにして決定された文字属性や挿入位置、サイズ等に基づいて映像中にテロップが挿入され、音声付き映像データが再生される（ステップＳ３８）。なお、音声付き映像データの再生時には、上記図１２の処理を継続してリアルタイムにテロップを作成表示するようにしてもよいし、再生前にメタデータを先読みしてテロップをキャッシュしておき、再生時に表示してもよい。 Next, a telop is inserted in the video based on the character attribute, insertion position, size, etc. determined as described above, and video data with audio is reproduced (step S38). When reproducing video data with audio, the processing shown in FIG. 12 may be continued to create and display a telop in real time, or the telop may be cached by prefetching metadata before reproduction. Sometimes it may be displayed.

本実施形態の音声付き映像データ処理装置によれば、映像データ中の発話者の位置や音声の特徴に応じて、テロップの挿入位置やサイズ、文字属性を調整することで、インテリジェントなテロップを自動的に作成、表示することができる。 According to the video data processing apparatus with audio of the present embodiment, intelligent telop is automatically adjusted by adjusting the insertion position, size, and character attribute of the telop according to the position of the speaker in the video data and the audio characteristics. Can be created and displayed automatically.

また、上記実施形態では、メタデータを利用してテロップを簡易に作成するようにしたが、メタデータの利用法はこれに限定されるものではない。例えば、上記音声付き映像データ処理装置にプリンタを接続し、上記メタデータを利用してテロップや台本、シナリオ、会議等の議事録等を容易に作成することができる。 In the above embodiment, the telop is easily created using the metadata. However, the method of using the metadata is not limited to this. For example, it is possible to easily create a telop, script, scenario, meeting minutes, etc. using the metadata by connecting a printer to the video data processing apparatus with audio.

なお、本実施形態では、音声付き映像データ処理装置を備える撮像装置の実施例について説明したが、例えば、パーソナルコンピュータやビデオレコーダ、ハードディスクレコーダ等の画像を再生する機能を有する装置にも本発明の音声付き映像データ処理装置を適用することができる。 In the present embodiment, an example of an imaging apparatus including a video data processing apparatus with audio has been described. However, for example, an apparatus having a function of reproducing an image, such as a personal computer, a video recorder, or a hard disk recorder, may be used. A video data processing apparatus with audio can be applied.

図１２は、音声付き映像データ処理装置の別の実施例を示すブロック図である。図１２に示す音声付き映像データ処理装置１００は、例えば、パーソナルコンピュータやビデオレコーダ、ハードディスクレコーダ等であり、記録媒体１１４やビデオ入力端子、オーディオ入力端子（不図示）を介して入力された音声付き映像データやテレビ番組等に対してメタデータを生成、付与する装置である。 FIG. 12 is a block diagram showing another embodiment of a video data processing apparatus with audio. A video data processing apparatus with audio 100 shown in FIG. 12 is, for example, a personal computer, a video recorder, a hard disk recorder, or the like, with audio input via a recording medium 114, a video input terminal, or an audio input terminal (not shown). This is a device that generates and assigns metadata for video data, television programs, and the like.

図１２に示すように、ＣＰＵ１０２は、バス１０４を介して音声付き映像データ処理装置１００内の各ブロックに接続されており、操作部１０６等からの操作入力に基づいて各ブロックを統括制御する統括制御部である。操作部１０６は、キーボードやマウス、その他の操作部材を含んでおり、これらの操作部材からの操作入力に応じてＣＰＵ１０２に信号を出力する。外部保存装置１０８は、ＣＰＵ１０２が処理するプログラム及び制御に必要な各種データ等を格納する装置であり、例えば、ハードディスク装置（ＨＤＤ）である。メモリ制御部１１０は、ＣＰＵ１０２によって制御され、メインメモリ１１２及び記録媒体１１４へのデータの書き込みや、メインメモリ１１２及び記録媒体１１４からのデータの読み出しを行う。メインメモリ１１２は、音声付き映像データ処理装置１００の主保存装置であり、例えば、半導体メモリである。メインメモリ１１２は、ＣＰＵ１０２が外部保存装置１０８からプログラムや各種データを読み出して各種の演算処理等を行う際の作業用領域となるＳＤＲＡＭや、表示モニタに表示される内容を保存する保存領域となるＶＲＡＭ等を備える。記録媒体１１４は、映像を記録する。ユーザは、記録媒体１１４を介して所望の映像を音声付き映像データ処理装置１００に入力できる。なお、映像・音声信号解析部１１６及びメタデータ生成部１１８は、図２と同様であるため説明を省略する。 As shown in FIG. 12, the CPU 102 is connected to each block in the audio-added video data processing apparatus 100 via a bus 104, and performs overall control of each block based on an operation input from the operation unit 106 or the like. It is a control unit. The operation unit 106 includes a keyboard, a mouse, and other operation members, and outputs a signal to the CPU 102 in response to operation inputs from these operation members. The external storage device 108 is a device that stores programs processed by the CPU 102 and various data necessary for control, and is, for example, a hard disk device (HDD). The memory control unit 110 is controlled by the CPU 102 and writes data to the main memory 112 and the recording medium 114 and reads data from the main memory 112 and the recording medium 114. The main memory 112 is a main storage device of the video data processing apparatus with audio 100, for example, a semiconductor memory. The main memory 112 is an SDRAM that is a work area when the CPU 102 reads out programs and various data from the external storage device 108 and performs various arithmetic processes and the like, and a storage area that stores contents displayed on the display monitor. VRAM etc. are provided. The recording medium 114 records video. The user can input a desired video to the video data processing apparatus with audio 100 via the recording medium 114. The video / audio signal analysis unit 116 and the metadata generation unit 118 are the same as those in FIG.

本発明の一実施形態に係る音声付き映像データ処理装置を備える撮像装置の主要構成を示すブロック図The block diagram which shows the main structures of an imaging device provided with the video data processing apparatus with an audio | voice which concerns on one Embodiment of this invention. 撮像装置１０における音声付き映像データ処理の流れを示す機能ブロック図Functional block diagram showing the flow of video data processing with audio in the imaging apparatus 10 音声付き映像データを示すブロック図Block diagram showing video data with audio メタデータに含まれる情報の例を示すテーブルTable showing examples of information included in metadata ｘｍｌ形式のメタデータの例を示す図Diagram showing an example of metadata in xml format ｘｍｌのスキーマを示す図Diagram showing xml schema メタデータをＭＰＥＧ形式で保存する例を示す図The figure which shows the example which preserve | saves metadata in MPEG format メタデータをＡＶＩ形式で保存する例を示す図The figure which shows the example which preserve | saves metadata in AVI format 本発明の一実施形態に係る音声付き映像データの処理方法を示すフローチャートThe flowchart which shows the processing method of the video data with audio | voice which concerns on one Embodiment of this invention. 音声付き映像データ処理装置の映像再生機能部の主要構成を示すブロック図Block diagram showing the main configuration of the video playback function section of the video data processing apparatus with audio メタデータが付与された音声付き映像データを再生する処理の流れを示すフローチャートFlowchart showing a flow of processing for reproducing audio-added video data to which metadata is added 音声付き映像データ処理装置の別の実施例を示すブロック図Block diagram showing another embodiment of a video data processing apparatus with audio

Explanation of symbols

１０…撮像装置、１２…ＣＰＵ、１４…バス、１６…操作スイッチ、１８…メモリ、２０…外部通信インターフェース（外部通信Ｉ／Ｆ）、２２…光学系（レンズ）、２４…撮像素子、２６…Ａ／Ｄ変換器、２８…映像処理部、３０…映像表示部（モニタ）、３２…記録メディア、３４…マイク、３６…オーディオ処理回路、３８…スピーカ、４０…録画指示・制御部、４２…外部映像・音声入力部、４４…映像・音声信号符号化部、４６…映像・音声信号保存部、４８…映像・音声信号解析部、５０…メタデータ生成部、６０…音声付き映像データ、６２…音声データ、６４…映像データ、７０…再生指示制御部、７２…映像・音声信号復号再生部、７４…メタデータ読込部、７６…テロップ生成表示部、７８…外部映像・音声出力部、１００…音声付き映像データ処理装置、１０２…ＣＰＵ、１０４…バス、１０６…操作部、１０８…外部保存装置、１１０…メモリ制御部、１１２…メインメモリ、１１４…記録媒体、１１６…映像・音声信号解析部、１１８…メタデータ生成部
DESCRIPTION OF SYMBOLS 10 ... Imaging device, 12 ... CPU, 14 ... Bus, 16 ... Operation switch, 18 ... Memory, 20 ... External communication interface (external communication I / F), 22 ... Optical system (lens), 24 ... Imaging device, 26 ... A / D converter, 28 ... Video processing unit, 30 ... Video display unit (monitor), 32 ... Recording medium, 34 ... Microphone, 36 ... Audio processing circuit, 38 ... Speaker, 40 ... Recording instruction / control unit, 42 ... External video / audio input unit, 44 ... video / audio signal encoding unit, 46 ... video / audio signal storage unit, 48 ... video / audio signal analysis unit, 50 ... metadata generation unit, 60 ... video data with audio, 62 ... audio data, 64 ... video data, 70 ... reproduction instruction control unit, 72 ... video / audio signal decoding / reproduction unit, 74 ... metadata reading unit, 76 ... telop generation display unit, 78 ... external video / audio output unit, 1 DESCRIPTION OF SYMBOLS 0 ... Video data processing apparatus with an audio | voice, 102 ... CPU, 104 ... Bus, 106 ... Operation part, 108 ... External storage device, 110 ... Memory control part, 112 ... Main memory, 114 ... Recording medium, 116 ... Video / audio signal Analysis unit, 118 ... metadata generation unit

Claims

Data acquisition means for acquiring video data with audio including video data and audio data synchronized with the video data;
Utterance content information generation means for generating utterance content information by characterizing the voice data;
Utterance time information acquisition means for acquiring utterance time information indicating a time at which the audio data is emitted in the video data;
Metadata creation means for creating metadata including the utterance content information and the utterance time information;
Recording means for associating and recording the video data with audio and the metadata;
A video data processing apparatus with sound, comprising:

The apparatus further comprises speaker identification means for analyzing the video data and the voice data and identifying a speaker who has uttered the voice,
The audio-added video data processing apparatus according to claim 1, wherein the metadata creating means records the utterance content information and the identification information of the speaker in association with each other.

The speaker identification means further comprises speaker position information acquisition means for analyzing the video data and voice data and acquiring position information on a screen displaying the video data of the speaker who has spoken the voice,
The audio-added video data processing apparatus according to claim 1 or 2, wherein the metadata creation means records the utterance content information and the position information of the utterer in association with each other.

A voice feature quantity acquisition unit for analyzing the voice data and acquiring the voice feature quantity;
4. The video data processing apparatus with audio according to claim 1, wherein the metadata creating unit records the utterance content information and the audio feature quantity in association with each other. 5.

5. The video data processing apparatus with audio according to claim 4, wherein the audio feature amount acquisition unit acquires at least one information of a size, a height, an inflection, or a tone of the audio.

Reproduction display means for reproducing and displaying the video data with audio,
Information acquisition means for acquiring utterance content information and utterance time information from the metadata;
Telop creating means for creating a telop based on the acquired utterance content information;
A telop insertion means for inserting the telop when reproducing the video data with audio based on the acquired utterance time information;
The video data processing apparatus with audio according to any one of claims 1 to 5, further comprising:

Reproduction display means for reproducing and displaying the video data with audio,
Information acquisition means for acquiring utterance content information, utterance time information and speaker position information from the metadata;
Telop creating means for creating a telop based on the acquired utterance content information;
A telop insertion means for inserting the telop when reproducing the video data with audio based on the acquired utterance time information;
An insertion position adjusting means for adjusting the insertion position of the telop based on the acquired position information of the speaker;
The video data processing apparatus with audio according to claim 3, further comprising:

Reproduction display means for reproducing and displaying the video data with audio,
Information acquisition means for acquiring utterance content information, utterance time information and voice feature amount from the metadata;
Telop creating means for creating a telop based on the acquired utterance content information;
A telop insertion means for inserting the telop when reproducing the video data with audio based on the acquired utterance time information;
Character attribute changing means for changing the character attribute of the telop according to the voice feature amount;
The video data processing apparatus with audio according to claim 4 or 5, characterized by comprising:

The character attribute changing means, according to the audio feature amount, the telop font, font size, character color, background color, character decoration, column, parenthesis, speech balloon attached to the telop, exclamation mark, question mark, etc. 9. The video data processing apparatus with audio according to claim 8, wherein at least one of the codes is changed.

A data acquisition step of acquiring video data with audio including video data and audio data synchronized with the video data;
Utterance content information generating step for generating utterance content information by characterizing the voice data;
An utterance time information acquisition step of acquiring utterance time information indicating a time at which the audio data is emitted in the video data;
A metadata creation step for creating metadata including the utterance content information and the utterance time information;
A recording step of recording the audio-added video data and the metadata in association with each other;
A method of processing video data with audio, comprising:

A data acquisition function for acquiring video data with audio including video data and audio data synchronized with the video data;
An utterance content information generation function for generating utterance content information by characterizing the voice data;
An utterance time information acquisition function for acquiring utterance time information indicating a time at which the audio data is emitted in the video data;
A metadata creation function for creating metadata including the utterance content information and the utterance time information;
A recording function for associating and recording the video data with audio and the metadata;
A computer program for processing video data with sound, characterized in that a computer is realized.