JP2010134507A

JP2010134507A - Reproduction device

Info

Publication number: JP2010134507A
Application number: JP2008307089A
Authority: JP
Inventors: Katsumi Saito; 勝美齊藤
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2008-12-02
Filing date: 2008-12-02
Publication date: 2010-06-17
Anticipated expiration: 2028-12-02
Also published as: JP5111343B2

Abstract

<P>PROBLEM TO BE SOLVED: To visually confirm a person from which only voice is heard without appearance of a face in an image. <P>SOLUTION: A database 16 is registered with face image data and feature values thereof, and voice data and feature values thereof about a plurality of persons. By a person voice recognition unit 32 and a person voice verification unit 34, a feature value of person voice included in a reproduction voice signal is verified with the database to search for a vocalizing person. By a face image recognition unit 20 and a face image verification unit 22, a feature value of a face image included in a reproduction image signal is verified with the database 16 to search for a person inside a screen. An additional image display decision unit 40 excludes the searched person from the searched vocalizing persons, and specifies a speaker outside the screen. An additional image generation unit 24 reads an additional image of the specified speaker from the database 16, and supplies it to a display image generation unit 26. The display image generation unit 26 composes the additional image to the reproduction image signal. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、記録された画像信号と音声信号を再生する際に、画像には顔が現れずに声だけが聞こえる人物について、それがどのような人物であるのかを再生画像上に表示する機能を有する再生装置に関する。 The present invention has a function of displaying on a reproduced image what kind of person the person who can hear only voice without appearing in the image when reproducing the recorded image signal and audio signal. The present invention relates to a playback apparatus having

従来、記録映像や、遠隔地から伝送された映像を再生表示する際に、画面上に表示されている人物に関する情報を追加表示するシステムが知られている。例えば、テレビ会議システムでは、発言者を撮影した画像に、この発言者に関して視聴者が興味のある何らの情報を付加した映像信号を生成する技術が、特許文献１に記載されている。撮影対象となりうる人物の情報を事前に登録しておき、撮影時に人物を識別して付加表示する。この方法は映像に映し出されている人物に関する情報を得たい場合には有用である。
特開平２−６７８８９号公報 2. Description of the Related Art Conventionally, there is known a system that additionally displays information about a person displayed on a screen when a recorded video or a video transmitted from a remote place is reproduced and displayed. For example, in a video conference system, Patent Document 1 discloses a technique for generating a video signal in which an image of a speaker is added with any information that the viewer is interested in regarding the speaker. Information on a person who can be photographed is registered in advance, and the person is identified and added at the time of photographing. This method is useful when it is desired to obtain information about a person shown in the video.
Japanese Patent Laid-Open No. 2-67889

民生用のビデオカメラでは、撮影対象となる人物は，視聴者の家族、友人等の既知の人物であることが多い。このような場合には、被写体人物の情報を付加的に表示する必要性が低い。 In a consumer video camera, a person to be photographed is often a known person such as a viewer's family or friends. In such a case, the necessity of additionally displaying subject person information is low.

しかし、撮影視野外の人物について音声のみが記録されることが多々ある。このような場合、当該人物を特定するには、音声の記憶に頼るしか無い。しかし、これは、聞き慣れているか，話し方等に特徴がなければ、人物の特定は難しい。また、発声主を推測するための記憶を辿る作業は、映像視聴に対する集中度を低下させ、純粋に映像を楽しむことを阻害してしまう。 However, in many cases, only the voice is recorded for a person outside the field of view. In such a case, the only way to identify the person is to rely on voice storage. However, it is difficult to identify a person unless he / she is used to listening or has a feature in speaking. In addition, the work of tracing the memory for estimating the utterer reduces the degree of concentration with respect to video viewing and hinders pure enjoyment of the video.

本発明は、このような不都合を解消する再生装置を提示することを目的とする。 An object of the present invention is to provide a playback apparatus that eliminates such inconveniences.

上記目的を達成するため、本発明に係る再生装置は、記録媒体から読み出した画像データを再生し、再生画像信号を出力する画像処理手段と、前記記録媒体から読み出した音声データを再生し、再生音声信号を出力する音声処理手段と、複数人の声データと顔画像データを記録したデータベースと、前記再生音声信号から人声を判別し認識する人声認識手段と、前記人声認識手段で認識された人声と前記データベースに登録された声データとを照合し同定する人声照合手段と、前記再生画像信号から人物の顔を判別し認識する顔画像認識手段と、前記顔画像認識手段で認識された人物の顔と前記データベースに登録された顔画像データとを照合し同定する顔画像照合手段と、同一シーン中の前記人声照合手段で同定された人物から前記顔画像照合手段で同定された人物を除外した人物を、付加画像表示の対象として決定する付加画像表示判定手段と、前記データベースから前記付加画像表示判定手段で決定された対象の人物を示す情報を読み出して、前記再生画像信号に合成すべき付加画像を生成する付加画像生成手段と、前記再生画像信号に前記付加画像を合成する表示画像生成手段とを有することを特徴とする。 In order to achieve the above object, a playback apparatus according to the present invention plays back image data read from a recording medium, outputs a playback image signal, and plays back audio data read from the recording medium. Voice processing means for outputting voice signals, a database recording voice data and face image data of a plurality of people, human voice recognition means for discriminating and recognizing voices from the reproduced voice signals, and recognition by the voice recognition means A human voice collating means for collating and identifying the recorded human voice and voice data registered in the database, a face image recognizing means for recognizing and recognizing a human face from the reproduced image signal, and the face image recognizing means. Face image matching means for comparing and identifying the face of the recognized person and face image data registered in the database, and the face image from the person identified by the voice matching means in the same scene The additional image display determining means for determining a person excluding the person identified by the combining means as an additional image display target, and information indicating the target person determined by the additional image display determining means from the database And an additional image generating means for generating an additional image to be combined with the reproduced image signal, and a display image generating means for combining the additional image with the reproduced image signal.

本発明によれば、映像に映し出されていない声の発声主に関する情報を再生映像中に表示することで、発声主を視覚的に認識することが可能となる。発声主を推測するための記憶を辿る作業によって映像視聴に対する集中度を低下させてしまうことが無くなる。 According to the present invention, it is possible to visually recognize the utterer by displaying the information related to the utterer of the voice that is not projected on the video in the reproduced video. It is no longer possible to reduce the degree of concentration with respect to video viewing by tracing the memory for estimating the speaker.

以下、図面を参照して、本発明の実施例を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

図１は、本発明に係る再生装置の一実施例の概略構成ブロック図を示す。再生装置１０の記録媒体１２には、音声付きの映像信号が記録されている。記録媒体１２は、ＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｋ）に代表される光ディスク、磁気テープ、ハードディスク、又は、フラッシュメモリを内蔵したメモリカードなどからなる。記録媒体駆動装置１４は、記録媒体１２を駆動して、記録媒体１２に信号を読み書きする装置である。 FIG. 1 shows a schematic block diagram of an embodiment of a playback apparatus according to the present invention. A video signal with sound is recorded on the recording medium 12 of the reproducing apparatus 10. The recording medium 12 includes an optical disk typified by a DVD (Digital Versatile Disk), a magnetic tape, a hard disk, or a memory card incorporating a flash memory. The recording medium driving device 14 is a device that drives the recording medium 12 to read / write signals from / to the recording medium 12.

データベース１６には、複数人の個人情報データが登録可能であり、現に登録されている。各個人情報データは、顔画像照合機能に使用する顔画像データとその特徴量を示す顔画像特徴量データ、人声照合機能に使用する声データとその特徴量を示す声特徴量データ、並びに、その他の種々のデータを含む。 In the database 16, personal information data of a plurality of persons can be registered and are actually registered. Each personal information data includes face image data used for the face image matching function and face image feature value data indicating the feature value, voice data used for the human voice matching function and voice feature value data indicating the feature value, and Includes various other data.

画像処理部１８は、記録媒体１２から読み出された圧縮画像データを復号化し、種々の加工を施して、再生画像データを生成する。顔画像認識部２０は、画像処理部１８で処理された画像データから人物の顔に該当する領域の有無を判別する。顔画像照合部２２は、顔画像認識部２０で顔画像があると判別された領域を、データベース１６に登録された顔画像データと比較する。これにより、再生画像中に含まれる人物が、データベース１６に登録済みか否か、そして登録された誰であるかを特定できる。 The image processing unit 18 decodes the compressed image data read from the recording medium 12 and performs various processes to generate reproduced image data. The face image recognition unit 20 determines the presence or absence of a region corresponding to a human face from the image data processed by the image processing unit 18. The face image matching unit 22 compares the area determined to have a face image by the face image recognition unit 20 with the face image data registered in the database 16. Thereby, it is possible to specify whether or not the person included in the reproduced image is registered in the database 16 and who is registered.

付加画像生成部２４は、記録媒体１２から再生された再生画像信号にスーパーインポーズされるべき付加画像データを生成する。表示画像生成部２６は画像処理部１８から出力させる再生画像データに付加画像生成部２４から出力される付加画像データを合成する。表示画像生成部２６により、再生画像に付加画像がスーパーインポーズされた合成画像が生成される。但し、表示画像生成部２６の出力画像は、合成画像以外に、再生画像のみからなる場合、又は付加画像のみからなる場合もありうる。 The additional image generation unit 24 generates additional image data to be superimposed on the reproduced image signal reproduced from the recording medium 12. The display image generation unit 26 synthesizes the additional image data output from the additional image generation unit 24 with the reproduced image data output from the image processing unit 18. The display image generation unit 26 generates a composite image in which the additional image is superimposed on the reproduction image. However, the output image of the display image generation unit 26 may be composed of only the reproduced image or only the additional image in addition to the synthesized image.

表示部２８は、表示画像生成部２６で生成された画像データを使用者が視認できるように画像として表示する。表示部２８は、例えば、該再生装置に組み込まれている。表示部２８は、液晶ディスプレイ（ＬＣＤ）や有機ＥＬディスプレイなどのディスプレイ装置とその駆動回路からなる。 The display unit 28 displays the image data generated by the display image generation unit 26 as an image so that the user can visually recognize the image data. The display unit 28 is incorporated in the playback device, for example. The display unit 28 includes a display device such as a liquid crystal display (LCD) or an organic EL display and a driving circuit thereof.

音声処理部３０は、記録媒体１２から読み出された圧縮音声データを復号化し、種々の加工を施す。人声認識部３２は、音声処理部３０で得られた再生音声信号から、人間の発声に該当する音声の有無を判別する。人声照合部３４は、人声認識部３２で人間の声と判別された部分に対し、データベース１６に登録された声データと比較する。音量検出部３６は、人声認識部３２で人物の声であると判断された場合に、その音量の大きさを検出する。音声出力装置３８は、音声処理部３０で生成された再生音声を音響出力する装置である。音声出力装置３８は、例えば、再生装置１０に組み込まれたスピーカとその駆動回路からなる。 The audio processing unit 30 decodes the compressed audio data read from the recording medium 12 and performs various processes. The human voice recognition unit 32 determines whether or not there is a voice corresponding to a human voice from the reproduced voice signal obtained by the voice processing unit 30. The human voice collation unit 34 compares the portion determined as the human voice by the human voice recognition unit 32 with the voice data registered in the database 16. When the human voice recognition unit 32 determines that the voice is a human voice, the volume detector 36 detects the volume level. The audio output device 38 is a device that acoustically outputs the reproduced audio generated by the audio processing unit 30. The audio output device 38 includes, for example, a speaker incorporated in the playback device 10 and its drive circuit.

付加画像表示判定部４０は、顔画像照合部２２の照合結果と人声照合部３４の照合結果を基に、付加画像生成部２４により生成させる付加画像の内容を決定する。 The additional image display determination unit 40 determines the content of the additional image to be generated by the additional image generation unit 24 based on the collation result of the face image collation unit 22 and the collation result of the human voice collation unit 34.

外部出力部４２は、表示画像生成部２６により生成された画像信号及び音声処理部３０による再生音声信号を外部に出力する手段であり、例えば、外部出力用の駆動回路と、接続端子又は送信アンテナとからなる。 The external output unit 42 is a unit that outputs the image signal generated by the display image generation unit 26 and the reproduced audio signal generated by the audio processing unit 30 to the outside. For example, the external output drive circuit and the connection terminal or transmission antenna It consists of.

ＣＰＵ４４は、付加画像生成部２４及び表示画像生成部２６を制御するだけでなく、画像処理および音声処理が同期動作するように再生装置１０の全体を制御する中央演算装置である。 The CPU 44 is a central processing unit that not only controls the additional image generation unit 24 and the display image generation unit 26 but also controls the entire playback device 10 so that image processing and audio processing are synchronized.

入力装置４６は、使用者が再生装置１０に動作モードや動作条件等を入力するための装置であり、種々のスイッチ又はボタン、若しくは、メニュー画面上に表示される操作可能な要素などからなる。 The input device 46 is a device for a user to input an operation mode, an operation condition, and the like to the playback device 10, and includes various switches or buttons, operable elements displayed on a menu screen, and the like.

本実施例の特徴的な動作を理解するために、次のような状況を想定する。すなわち、Ａ氏とＢ氏の２人が会話を交わしている状況でＡ氏にカメラを向けた結果として、図２に示すように撮影画角内にはＡ氏だけが収まり、Ｂ氏は音声だけが記録されたとする。 In order to understand the characteristic operation of the present embodiment, the following situation is assumed. That is, as a result of directing the camera to Mr. A in the situation where Mr. A and Mr. B are having a conversation, only Mr. A falls within the shooting angle of view as shown in FIG. Suppose only was recorded.

図３は、そのような状況で記録された映像の、従来の方法による再生画面の表示例を示す。人物Ａを含む再生画像が画面上に表示され、再生音声が、画面横のスピーカから出力される。画面の右隅とスピーカに付記した記号群は、Ｂ氏の声を図示化したものである。 FIG. 3 shows a display example of a reproduction screen of a video recorded in such a situation by a conventional method. A reproduced image including the person A is displayed on the screen, and reproduced sound is output from a speaker on the side of the screen. The symbol group attached to the right corner of the screen and the speaker is an illustration of Mr. B's voice.

これに対し、図４は、本実施例による再生画面例を示す。再生画像が表示部２８の画面上に表示され、再生音声が、画面横のスピーカ（音声出力装置３８）から出力される。画面の右隅とスピーカに付記した記号群は、Ｂ氏の声を図示化したものである。図３とは異なり、画面内に、人物Ｂを示す付加画像が重畳表示される。人物Ｂを示す付加画像により、視聴者は，発言者がＢ氏であることを認識又は推測することができる。 On the other hand, FIG. 4 shows an example of a playback screen according to the present embodiment. The reproduced image is displayed on the screen of the display unit 28, and the reproduced sound is output from the speaker (sound output device 38) on the side of the screen. The symbol group attached to the right corner of the screen and the speaker is an illustration of Mr. B's voice. Unlike FIG. 3, an additional image showing the person B is superimposed on the screen. With the additional image showing the person B, the viewer can recognize or infer that the speaker is Mr. B.

本実施例により撮影画角内に入っていない話者を特定し、当該話者を示す付加画像を合成する動作を説明する。図５は、その動作のフローチャートを示す。なお、ここでは、記録媒体１２がＤＶＤであるとする。 An operation of identifying a speaker who is not within the shooting angle of view and synthesizing an additional image indicating the speaker according to the present embodiment will be described. FIG. 5 shows a flowchart of the operation. Here, it is assumed that the recording medium 12 is a DVD.

まず、データベース１６に、必要な人物の個人情報データを事前に登録する（Ｓ１）。先に説明したように、各人の個人情報データは、顔画像データとその特徴量を示す顔画像特徴量データ、声データとその特徴量を示す声特徴量データ、及び、その他の種々の属性データからなる。属性データは、例えば、氏名、ニックネーム、性別」、年齢、生年月日及び似顔絵などからなる。 First, personal information data of a necessary person is registered in advance in the database 16 (S1). As described above, each person's personal information data includes face image data and face image feature amount data indicating the feature amount, voice data and voice feature amount data indicating the feature amount, and various other attributes. Consists of data. The attribute data includes, for example, name, nickname, sex ", age, date of birth, and portrait.

音声処理部３０は、記録媒体１２から読み出された圧縮音声データを伸長し、再生音声信号を人声認識部３２、音声出力装置３８及び外部出力部４２に供給する。人声認識部３２は、再生音声信号から人声を抽出して解析し、人声が含まれているかどうかを判別する（Ｓ２）。人声が含まれている場合、人声認識部３２は、抽出した人声データを人声照合部３４に供給する。 The audio processing unit 30 expands the compressed audio data read from the recording medium 12 and supplies the reproduced audio signal to the human voice recognition unit 32, the audio output device 38, and the external output unit 42. The human voice recognition unit 32 extracts and analyzes a human voice from the reproduced voice signal, and determines whether or not a human voice is included (S2). If a human voice is included, the human voice recognition unit 32 supplies the extracted voice data to the human voice verification unit 34.

人声照合部３４は、人声認識部３２からの人声データからその特徴量を抽出し、データベース１６の声特徴量データに照合して、一定以上の高い相関を有する特徴量を持つ人物を探索する（Ｓ３）。人声照合部３４は、一定値以上の相関度を有する声特徴量データが存在した場合、これらの声特徴量データを有する人物を付加画像表示の候補として付加画像表示判定部４０に通知する（Ｓ４）。 The human voice collation unit 34 extracts the feature amount from the voice data from the human voice recognition unit 32, collates it with the voice feature amount data in the database 16, and selects a person having a feature amount having a certain correlation or higher. Search (S3). When there is voice feature value data having a degree of correlation equal to or greater than a certain value, the human voice matching unit 34 notifies the additional image display determination unit 40 of a person having the voice feature value data as a candidate for additional image display ( S4).

他方、画像処理部１８は、記録媒体１２から再生された圧縮画像データを伸長して再生画像データを生成する。顔画像認識部２０は、再生画像から人の顔画像を抽出し、抽出した顔画像データを顔画像照合部２２に供給する（Ｓ５）。顔画像照合部２２は、顔画像認識部２０からの顔画像データからその特徴量を抽出し、データベース１６の顔画像特徴量データに照合して、一定以上の高い相関を有する顔画像特徴量を持つ人物を探索する（Ｓ６）。顔画像照合部２２は、一定値以上の相関度を有する顔画像特徴量データが存在した場合、その顔画像特徴量データを有する人物を、付加画像表示の除外候補として付加画像表示判定部４０に通知する（Ｓ７）。 On the other hand, the image processing unit 18 decompresses the compressed image data reproduced from the recording medium 12 to generate reproduced image data. The face image recognition unit 20 extracts a human face image from the reproduced image, and supplies the extracted face image data to the face image matching unit 22 (S5). The face image collating unit 22 extracts the feature amount from the face image data from the face image recognizing unit 20 and collates it with the face image feature amount data in the database 16 to obtain a face image feature amount having a high correlation of a certain level or more. The person who has it is searched (S6). When face image feature value data having a degree of correlation equal to or greater than a certain value exists, the face image matching unit 22 sets a person having the face image feature value data as an additional image display exclusion candidate to the additional image display determination unit 40. Notification is made (S7).

付加画像表示判定部４０は、再生音声から抽出された候補（人声照合部３４からの通知情報）から、再生画像から抽出された除外候補（顔画像照合部２２からの除外候補）を除外する（Ｓ８）。これにより、話者の可能性の高い人物候補（Ｓ４）から、画面内に入っている人物（Ｓ７）を除外できる。即ち，撮影画面内には入っていないが，音声が記録されている人物を特定できる。複数人が存在する場合には、人声特徴量の相関度が最も高い人物を話者と決定する。付加画像表示判定部４０は、最終的に決定した人物を付加画像生成部２４に通知する。 The additional image display determination unit 40 excludes the exclusion candidate extracted from the reproduction image (exclusion candidate from the face image verification unit 22) from the candidate extracted from the reproduction sound (notification information from the human voice verification unit 34). (S8). Thereby, the person (S7) in the screen can be excluded from the person candidates (S4) having a high possibility of being a speaker. That is, it is possible to specify a person who is not in the shooting screen but has recorded voice. When there are a plurality of persons, the person with the highest correlation degree of the human voice feature is determined as the speaker. The additional image display determination unit 40 notifies the additional image generation unit 24 of the finally determined person.

付加画像生成部２４は、付加画像表示判定部４０から通知された人物の顔画像データをデータベース１６から読み出し、この顔画像を含む所定サイズの付加画像を生成する（Ｓ９）。付加画像生成部２４は、生成した付加画像を表示画像生成部２６に供給する。 The additional image generation unit 24 reads the face image data of the person notified from the additional image display determination unit 40 from the database 16, and generates an additional image of a predetermined size including the face image (S9). The additional image generation unit 24 supplies the generated additional image to the display image generation unit 26.

表示画像生成部２６は、画像処理部１８からの再生画像データに、付加画像生成部２４からの付加画像をスーパーインポーズして、合成画像データを生成する（Ｓ１０）。合成画像データは、表示部２８に印加され、図４に示すように表示される。もちろん、外部出力部４２から外部に出力されることもある。表示画像は、記録媒体１２に記録された画像の上にピクトグラフが重ね描きされた見掛けになる。このため、再生画像にスーパーインポーズする付加画像を以降ではピクトグラフとも表現する。 The display image generation unit 26 superimposes the additional image from the additional image generation unit 24 on the reproduced image data from the image processing unit 18 to generate composite image data (S10). The composite image data is applied to the display unit 28 and displayed as shown in FIG. Of course, the external output unit 42 may output to the outside. The display image looks like a pictograph superimposed on the image recorded on the recording medium 12. For this reason, the additional image that is superimposed on the reproduced image is hereinafter also expressed as a pictograph.

データベース１６を説明する。データベース１６は、記録媒体１２に格納されても、再生装置１０に内蔵されても良い。再生装置１０がネットワーク対応の場合には、データベース１６を、ネッットワークを介して接続するサーバ上に用意しても良い。記録媒体１２が再生装置１０から着脱自在である場合、記録媒体１２のデータベースを再生装置１０の記憶領域にロードして利用する方式でも、記録媒体１２上のデータベースを直接参照する方式の何れであってもよい。また、前者の場合、データベースのみを記録した着脱可能な記録媒体を用意してもよい。 The database 16 will be described. The database 16 may be stored in the recording medium 12 or built in the playback device 10. When the playback device 10 is compatible with a network, the database 16 may be prepared on a server connected via a network. When the recording medium 12 is detachable from the playback device 10, either the method of loading the database of the recording medium 12 into the storage area of the playback device 10 for use or the method of directly referring to the database on the recording medium 12 is used. May be. In the former case, a removable recording medium in which only the database is recorded may be prepared.

データベース１６に登録される各人の個人情報データは、顔画像照合とピクトグラフ表示に使用するための顔画像データと、音声照合に使用するための声データを必ず有する。これに加えて、氏名、ニックネーム、性別、年齢、生年月日、及び似顔絵などデータを付加的な属性情報として保持する。これらの付加的な属性情報を、顔画像データと同時に、または置き換えてピクトグラフ表示に使用してもよい。 The personal information data of each person registered in the database 16 necessarily includes face image data for use in face image matching and pictogram display, and voice data for use in voice matching. In addition, data such as name, nickname, gender, age, date of birth, and caricature are stored as additional attribute information. These additional attribute information may be used for pictogram display at the same time as or in place of the face image data.

データベース１６に登録される人数が増加してくると、ピクトグラフとして表示を出したい人物と出したくない人物の区分が生じてくる。この区分に対応するために、各々の個人情報データ毎にピクトグラフ表示の可否を設定する項目を設けるのが好ましい。新規登録された個人情報データの新規登録時にデフォルトでどちらを設定するかは、使用状況に応じて決定すればよい。ＣＰＵ４４は、付加画像を表示すべき人物が、データベース１６において、合成表示を許すように設定されている場合に、表示画像生成部２６に前記再生画像信号に前記付加画像を合成させる。 As the number of people registered in the database 16 increases, there is a classification of a person who wants to display as a pictograph and a person who does not want to display. In order to correspond to this classification, it is preferable to provide an item for setting whether or not to display a pictograph for each personal information data. Which is set by default when newly registering newly registered personal information data may be determined in accordance with the usage status. The CPU 44 causes the display image generation unit 26 to synthesize the additional image with the reproduced image signal when the person who should display the additional image is set in the database 16 so as to permit the combined display.

データベース１６は、他の機器で作成されたものも読み込んで使えるような汎用性のあるものであるのが好ましい。 The database 16 is preferably versatile enough to read and use data created by other devices.

本実施例のピクトグラフ表示について説明する。声だけが聞こえる人物に関する情報をピクトグラフで表示することは、視認性を付加できるので有用ではある。そして、表示方法を工夫することで、その効果を更に増すことができる。 The pictograph display of the present embodiment will be described. Displaying information about a person who can only hear a voice as a pictograph is useful because it can add visibility. And the effect can be further increased by devising the display method.

ピクトグラフ表示機能の有効／無効、即ち、要否を使用者が設定できると、本来の記録画像だけを見たいという要求にも対応できる。これは、画像に映っていない人物が誰であるのかを気にしない場合や、既知の声だが聞こえている場合などに有効である。たとえば、使用者によりピクトグラフ表示機能が無効に設定されている場合、ＣＰＵ４４は、画面外の発声者の声が録音されていても、付加画像生成部２４に付加画像を生成させない。 If the user can set the validity / invalidity of the pictograph display function, that is, whether or not it is necessary, it is possible to respond to a request to view only the original recorded image. This is effective when the person who is not shown in the image is not cared about, or when a known voice is heard. For example, when the pictograph display function is disabled by the user, the CPU 44 does not cause the additional image generation unit 24 to generate an additional image even if a voice of a speaker outside the screen is recorded.

ピクトグラフとして表示される情報はデータベース１６が有する項目の中から１つ以上である。どの項目を表示させるかを選択できると、使用者の知りたい情報に連動することができる。例えば、顔画像と同時に「氏名」を表示させれば、顔画像だけでは誰であるのかを判別しづらいときでも、人物を特定しやすくなる。もちろん、「氏名」だけを表示することでも、同様の効果を得ることができる。 The information displayed as a pictograph is one or more items among the items of the database 16. If an item to be displayed can be selected, it can be linked to information that the user wants to know. For example, if “name” is displayed at the same time as the face image, it becomes easier to identify a person even when it is difficult to determine who the face image alone is. Of course, the same effect can be obtained by displaying only the “name”.

ピクトグラフ表示対象者の全人物に対し表示をするデータ項目を統一した場合、幾人かの人物では該当項目が未登録である可能性も考えられる。このような状況に対応するために、ピクトグラフ表示データの各項目に優先順位を持たせ、上位項目が未登録な場合には次点項目を表示するようにすると良い。例えば、第１優先順位に「ニックネーム」を、第２優先順位に「氏名」を設定したとする。この場合、「ニックネーム」の項目が登録されている人物に関しては「ニックネーム」が表示され、「ニックネーム」の項目が登録されていない人物に関しては「氏名」が表示されるようにする。すなわち、データベース１６の各人物の項目中に、付加画像の生成に利用できる複数の項目がある場合に、使用者が、これら複数の項目の全部又は一部に優先順位を設定する手段を設ける。 If the data items to be displayed for all the pictogram display target persons are unified, there is a possibility that the corresponding item is not registered in some persons. In order to cope with such a situation, it is preferable to give priority to each item of the pictograph display data and display the next item when the upper item is not registered. For example, it is assumed that “nickname” is set as the first priority and “name” is set as the second priority. In this case, “nickname” is displayed for a person for whom the item “nickname” is registered, and “name” is displayed for a person for whom the item “nickname” is not registered. That is, when there are a plurality of items that can be used for generating an additional image among the items of each person in the database 16, a means is provided for the user to set priorities for all or some of the plurality of items.

このような優先順位を設定する作業は、データベース１６に登録できる各人の個人情報データの項目が多数になるほど、煩雑なものとなる。これに対しては、有限個の上位順位までを使用者が設定できるようにしておき、それ以下の順位の項目については再生装置１０が自動的に割り振るようにすればよい。これにより、使用者の負担が軽減する。尚、使用者がデータベース１６の個人情報データ内の項目に一切のデータを追加したかった場合でも、登録必須項目である顔画像データが何れかの優先順位に割り当てられるので、ピクトグラフに表示するデータは確保される。 The task of setting such priorities becomes more complicated as the number of items of personal information data of each person that can be registered in the database 16 increases. For this, the user can set up to a finite number of higher ranks, and the playback apparatus 10 may automatically allocate items of lower ranks. This reduces the burden on the user. Even if the user wants to add any data to the items in the personal information data of the database 16, the face image data, which is a registration-required item, is assigned to any priority order, so that it is displayed on the pictograph. Data is secured.

画面上でのピクトグラフは、再生画像中の注目度の高い部分を出来るだけ遮らないが望ましい。一般的に、画像の注目度は周辺部よりも中心部の方が高い傾向にあるので、ピクトグラフの表示場所は基本的に再生画像の周辺部に配置されることになる。しかし、被写体が動いているシーンなどでは、その被写体の移動に伴って周辺部の注目度が高くなる場合もあり得る。このとき、注目度が高くなる場所と、ピクトグラフが表示される場所とが重なり合うことは望ましくない。ピクトグラフの表示場所を使用者が設定できると、このような事態を回避できる。 It is desirable that the pictograph on the screen does not block as much as possible the portion of high interest in the reproduced image. In general, the degree of attention of an image tends to be higher in the central part than in the peripheral part. Therefore, the display location of the pictograph is basically arranged in the peripheral part of the reproduced image. However, in a scene where the subject is moving, the attention degree of the peripheral portion may increase as the subject moves. At this time, it is not desirable that the place where the degree of attention is high and the place where the pictograph is displayed overlap. If the user can set the display location of the pictograph, such a situation can be avoided.

ピクトグラフの大きさについても使用者が設定できると便利である。例えば、大きさを、「大」、「中」、「小」やドット数で表現された規定段階の中から選択できる形式や、同様の表現が付記された図表を用いて選択する形式が考えられる。これは、再生装置１０に備えつけられている比較的小型の表示画面で見る場合と、外部接続機能を介してテレビなどの比較的大型の画面で見る場合のように、表示画面の画素数が大きく異なる場合に、有用である。視認性を満足するために求められるピクトグラフの大きさに無視できない差があるからである。 It is convenient if the user can also set the size of the pictograph. For example, there are formats that allow you to select the size from a specified level expressed as “Large”, “Medium”, “Small” or the number of dots, or a format that allows you to select a size using a chart with similar expressions. It is done. This is because the number of pixels of the display screen is large, such as when viewing on a relatively small display screen provided in the playback apparatus 10 and when viewing on a relatively large screen such as a television via an external connection function. Useful when different. This is because there is a non-negligible difference in the size of the pictograph required to satisfy the visibility.

また、形状についても、四角形や丸形などの選択肢を設けると、使用者の嗜好性を活かすことができる。ただし、設定された形状に従って表示可能なデータに制限を設ける必要も生じる。例えば、小さな外形寸法内部で視認性を満足しながら表現できるデータは、限られる。実際の制限の設け方は、それぞれの再生装置に適したもので良い。ピクトグラフの形状として、このような制限が働く形状が設定された場合、表示可能なデータ候補の中の優先順位が上位のものから表示される。大きさ及び形状の何れについても、設定変更操作と同時に実際のピクトグラフの大きさ及び形状が変更するのが好ましい。 Further, regarding the shape, if options such as a rectangle and a circle are provided, the user's preference can be utilized. However, it is necessary to limit the data that can be displayed according to the set shape. For example, data that can be expressed while satisfying visibility within a small external dimension is limited. The actual restriction may be provided in a manner suitable for each playback device. When such a shape that restricts is set as the shape of the pictograph, the highest priority is displayed among the displayable data candidates. For both the size and the shape, it is preferable that the size and shape of the actual pictograph change simultaneously with the setting change operation.

再生音声信号とデータベース１６にある声データをそれぞれの特徴量で照合して、正しい人物を選択できなかった場合、ピクトグラフに誤ったデータが使用されることになる。使用者が表示されたピクトグラフを見て誤りに気付いたとき、本実施例では、正しい内容に修正できる。図６は、その修正動作のフローチャートを示す。 When the reproduced voice signal and the voice data in the database 16 are collated with the respective feature amounts and the correct person cannot be selected, incorrect data is used for the pictograph. When the user notices an error by looking at the displayed pictograph, the present embodiment can correct the content correctly. FIG. 6 shows a flowchart of the correcting operation.

使用者は、再生表示画像を見て、聞こえてくる声とピクトグラフに表示される人物情報の正誤を確認する（Ｓ２１）。ここで誤りを発見し、修正を行う場合は修正機能を使用する（Ｓ２２）。表示部２８の画面を見ながら、スイッチやタッチパネルなどの入力装置４６を用いて、正しい情報がヒストグラムとして表示されるように修正する（Ｓ２３〜Ｓ３１）。 The user looks at the reproduced display image and confirms whether the voice to be heard and the personal information displayed on the pictograph are correct or incorrect (S21). Here, when an error is found and corrected, a correction function is used (S22). While viewing the screen of the display unit 28, the input device 46 such as a switch or a touch panel is used to correct so that correct information is displayed as a histogram (S23 to S31).

具体的に説明する。使用者が修正を選択すると（Ｓ２２）、修正モードに入り、画面上のピクトグラフを選択する画面になる（Ｓ２３）。このピクトグラフ選択の際に、データベース１６には該当する人物がいないとして表示対象から外されていた声に対して、「該当人物なし」を示すピクトグラフを表示して、それを選択できるようにすることが望ましい。これにより、人声照合部３４が再生音声信号とデータベース１６の声データとを同定できなかった誤りに対する修正が可能になる。 This will be specifically described. When the user selects correction (S22), the correction mode is entered and a screen for selecting a pictograph on the screen is displayed (S23). When selecting a pictograph, a pictograph indicating “no corresponding person” is displayed for a voice that has been excluded from the display target because there is no corresponding person in the database 16 so that it can be selected. It is desirable to do. As a result, it is possible to correct an error in which the human voice matching unit 34 could not identify the reproduced voice signal and the voice data in the database 16.

ピクトグラフ選択画面上で選択されているピクトグラフは、形状や色や縁取りなどが変化をして何らかの強調表示をする。使用者は入力装置４６によって修正対象のピクトグラフを選択する（Ｓ２４）。選択の直後に、「本当に修正をして良いか」という、修正の意思を確認するダイアログを表示する（Ｓ２５）。続いて、修正後に使用するデータを検索し選択する（Ｓ２６）。 The pictograph selected on the pictograph selection screen changes its shape, color, border, etc. and displays some highlights. The user selects a pictograph to be corrected by the input device 46 (S24). Immediately after the selection, a dialog for confirming the intention of correction is displayed (S25). Subsequently, data to be used after correction is searched and selected (S26).

また、同時に修正するピクトグラフの対象範囲を選択する（Ｓ２７）。このような選択ステップを設けることで、誤選択の可能性を低減できる。たとえば、第１の選択肢として、「選択したもののみ」に限定する。第２の選択肢として、「選択したピクトグラフと同一人物が認識されているもの全て」に限定する。第３の選択肢として、「選択したピクトグラフと同一人物が認識されているものの中で、選択したシーンと比較して人声照合の相関率が低いもの」に限定する。これらの選択肢の表現方法は、それぞれの再生装置に適した方法が選択される。 At the same time, the target range of the pictograph to be corrected is selected (S27). Providing such a selection step can reduce the possibility of erroneous selection. For example, the first option is limited to “only selected”. The second option is limited to “all those where the same person as the selected pictograph is recognized”. The third option is limited to “one in which the same person is recognized as the selected pictograph, and has a lower correlation rate of voice verification than the selected scene”. As a method for expressing these options, a method suitable for each reproducing apparatus is selected.

修正実行可否を確認する（Ｓ２８）。使用者の修正実行の決定に従い、実際のピクトグラフ表示データの差替え処理が行われる（Ｓ２９）。このとき、対象となったピクトグラフと同時修正を行った範囲をデータベース１６に反映すると、次回以降の人声照合の精度が向上する。 Whether or not correction can be executed is confirmed (S28). In accordance with the user's decision to execute correction, actual pictograph display data replacement processing is performed (S29). At this time, if the range of simultaneous correction with the target pictograph is reflected in the database 16, the accuracy of human voice verification after the next time is improved.

他にも修正すべきピクトグラフがある場合（Ｓ３０）、同様の手順によって修正を行う。その他の修正すべきピクトグラフが無い場合（Ｓ３０）、修正モードから抜け（Ｓ２３１）、一連の修正作業を終える。 If there are other pictographs to be corrected (S30), correction is performed by the same procedure. If there is no other pictograph to be corrected (S30), the correction mode is exited (S231), and a series of correction operations is completed.

以上の構成および動作によって、画像信号と音声信号を再生した際に、画像に映っておらずに声だけが聞こえる人物に関する情報を視認できるようになる。 With the above configuration and operation, when an image signal and an audio signal are reproduced, information related to a person who can hear only a voice without being reflected in the image can be visually recognized.

本発明の第２実施例を説明する。図７は、第２実施例の概略構成ブロック図を示す。図１に示す実施例と同じ構成要素には同じ符号を付してある。図１と同じ構成要素には同じ符号を付してある。 A second embodiment of the present invention will be described. FIG. 7 shows a schematic block diagram of the second embodiment. The same components as those in the embodiment shown in FIG. The same components as those in FIG. 1 are denoted by the same reference numerals.

第２実施例では、個人情報データベースに画像と音声を記録する機能を付加した。即ち，再生装置１０ａには、カメラ部５０及びマイクロホン５２が追加される。ＣＰＵ４４ａ、画像処理部１８ａ及び音声処理部３０ａには、それぞれＣＰＵ４４、画像処理部１８及び音声処理部３０の機能に加えて、データベース１６ａに画像と音声を記録する機能を付加した。 In the second embodiment, a function for recording images and sounds is added to the personal information database. That is, a camera unit 50 and a microphone 52 are added to the playback device 10a. In addition to the functions of the CPU 44, the image processing unit 18, and the sound processing unit 30, the CPU 44a, the image processing unit 18a, and the sound processing unit 30a have functions of recording images and sounds in the database 16a, respectively.

変更部分を詳細に説明する。カメラ部５０はレンズ及び撮像センサから成り、データベース１６ａに登録する顔画像データを取り込むのに利用できる。すなわち、画像処理部１８ａは、カメラ部５０で撮影された画像信号に色バランス及びガンマ補正を施し、サイズ等を調整し、データベース１６ａに顔画像データとして登録する。 The changed part will be described in detail. The camera unit 50 includes a lens and an image sensor, and can be used to capture face image data to be registered in the database 16a. That is, the image processing unit 18a performs color balance and gamma correction on the image signal photographed by the camera unit 50, adjusts the size and the like, and registers it as face image data in the database 16a.

マイクロホン５２は、データベース１６ａの声データの基になる人声を収音するのに使用可能である。音声処理部３０ａは、マイクロホン５２で収音された音声信号に必要な処理を施した後、データベース１６ａに音データとして登録する。その処理のために、音声処理部３０ａは、マイクロホン５２の出力を増幅するプリアンプ、及び、プリアンプのアナログ出力をデジタル化するＡ／Ｄ変換器を具備する。 The microphone 52 can be used to collect a human voice that is the basis of voice data in the database 16a. The sound processing unit 30a performs necessary processing on the sound signal collected by the microphone 52, and then registers it as sound data in the database 16a. For this processing, the audio processing unit 30a includes a preamplifier that amplifies the output of the microphone 52 and an A / D converter that digitizes the analog output of the preamplifier.

図８を参照して、取り込んだ画像データ及び音声データのデータベース１６ａへの登録方法を説明する。 With reference to FIG. 8, a method of registering the captured image data and audio data in the database 16a will be described.

使用者は再生装置１０ａの機能メニューからデータベース登録機能を選択し（Ｓ４１）、データベース１６に新しい個人情報を作成する（Ｓ４２）。 The user selects a database registration function from the function menu of the playback device 10a (S41), and creates new personal information in the database 16 (S42).

まず、顔画像データを登録する（Ｓ４３）。具体的には、カメラ部５０が起動し、撮影が可能な状態になる（Ｓ４４）。登録したい人物を被写体とし、その顔を含んだ画像を撮影する（Ｓ４５）。このとき、表示部２８にカメラ部５０が捉えている画像が表示される。被写体となった人物自身が、表示部２８の表示画像を見ながら、カメラ部５０と自分の顔の位置関係を確認できるようにすると、効率的な撮影が行える。撮影画角内に被写体が的確に捉えられたら、入力装置４６のスイッチを押して撮影を実行し、被写体の顔画像を取り込む（Ｓ４６）。画像処理部１８ａは、カメラ部５０からの画像データをデータベース１６ａへの登録に適した形式とサイズに処理する。このとき、画像処理部１８ａは、撮影した顔画像データから顔画像特徴量を算出し（Ｓ４７）、顔画像データとともにデータベース１６ａに登録する（Ｓ４８）。ここでの顔画像特徴量算出処理には、再生画像から検出された人物の顔画像特徴量を算出する機能を利用できる。 First, face image data is registered (S43). Specifically, the camera unit 50 is activated and is ready for shooting (S44). A person to be registered is taken as a subject, and an image including the face is photographed (S45). At this time, an image captured by the camera unit 50 is displayed on the display unit 28. If the person who is the subject can check the positional relationship between the camera unit 50 and his / her face while viewing the display image on the display unit 28, efficient shooting can be performed. When the subject is accurately captured within the shooting angle of view, the switch of the input device 46 is pressed to perform shooting, and the face image of the subject is captured (S46). The image processing unit 18a processes the image data from the camera unit 50 into a format and size suitable for registration in the database 16a. At this time, the image processing unit 18a calculates a face image feature amount from the captured face image data (S47), and registers it in the database 16a together with the face image data (S48). In this face image feature amount calculation process, a function for calculating a face image feature amount of a person detected from a reproduced image can be used.

次に、声データを登録する（Ｓ４９）。音声処理部３０ａは、マイクロホン５２が収音する音声信号を処理する機能を起動する（Ｓ５０）。収音準備が整い収音可能な状態になったら、表示部２８に声を入力することを促す表示をする。登録に適したフレーズを表示し、それを登録者に発声させるようにすれば、登録者が発言すべきフレーズに迷うことがないので好ましい。声を記録する際、入力装置４６のスイッチを記録開始時と終了時に押すようにするか、押されている最中を記録期間とする。 Next, voice data is registered (S49). The audio processing unit 30a activates a function for processing an audio signal collected by the microphone 52 (S50). When the sound collection preparation is complete and the sound can be collected, a display prompting the user to input a voice is displayed on the display unit 28. If a phrase suitable for registration is displayed and the registrant speaks it, it is preferable because the registrant does not get lost in the phrase to be spoken. When recording a voice, the switch of the input device 46 is pressed at the start and end of the recording, or the recording period is set while the switch is being pressed.

音声処理部３０ａは、マイクロホン５２からの音声を一旦保持する（Ｓ５１）。そして、表示とともにこの音声を再生して、登録して良いかどうかを使用者に確認する（Ｓ５２）。使用者が登録を拒否すれば、音声の取り込みをやり直す。使用者が登録を承認すると、音声処理部３０ａは、取り込んだ音声をデータベース１６ａへの登録に適した形式及びサイズに処理し、声特徴量を算出し（Ｓ５３）、声データと声特徴量をデータベース１６ａに登録する（Ｓ５４）。ここでの声特徴量算出処理には、再生音声から声特徴量を算出する機能を利用できる。 The voice processing unit 30a temporarily holds the voice from the microphone 52 (S51). Then, this sound is reproduced together with the display to confirm with the user whether or not registration is allowed (S52). If the user refuses to register, the voice capture is performed again. When the user approves the registration, the voice processing unit 30a processes the captured voice into a format and size suitable for registration in the database 16a, calculates voice feature quantities (S53), and obtains voice data and voice feature quantities. Register in the database 16a (S54). In this voice feature amount calculation process, a function for calculating a voice feature amount from the reproduced voice can be used.

続いて、データベース１６ａに、例えば、氏名等の、その他の項目を入力する（Ｓ５５）。 Subsequently, for example, other items such as names are entered in the database 16a (S55).

図８に示す登録順序は一例であり、例えば最初に「氏名」を入力してから顔画像データを登録してもよいことは明らかである。 The registration order shown in FIG. 8 is an example. For example, it is obvious that the face image data may be registered after first entering “name”.

このようにして登録されたデータベース１６ａの個人情報データは、実施例１の再生装置１０において、記録媒体１２に記録された画像信号および音声信号の再生時に使用可能である。 The personal information data of the database 16a registered in this way can be used when reproducing the image signal and the audio signal recorded on the recording medium 12 in the reproducing apparatus 10 of the first embodiment.

図９は、本発明の第３実施例の概略構成ブロック図を示す。再生装置１１０の記録媒体１１２には、音声付きの映像信号が記録されている。記録媒体１１２は、ＤＶＤに代表される光ディスク、磁気テープ、ハードディスク、又は、フラッシュメモリを内蔵したメモリカードなどからなる。記録媒体駆動装置１１４は、記録媒体１１２を駆動して、記録媒体１１２に信号を読み書きする装置である。 FIG. 9 shows a schematic block diagram of the third embodiment of the present invention. A video signal with sound is recorded on the recording medium 112 of the playback device 110. The recording medium 112 is composed of an optical disk typified by a DVD, a magnetic tape, a hard disk, or a memory card incorporating a flash memory. The recording medium driving device 114 is a device that drives the recording medium 112 to read / write signals from / to the recording medium 112.

データベース１１６にはデータベース１６と同様に、複数人の個人情報データが登録可能であり、現に登録されている。各個人情報データは、顔画像照合機能に使用する顔画像データとその特徴量を示す顔画像特徴量データ、人声照合機能に使用する声データとその特徴量を示す声特徴量データ、並びに、その他の種々のデータを含む。 Similar to the database 16, personal information data of a plurality of persons can be registered in the database 116, and is registered in practice. Each personal information data includes face image data used for the face image matching function and face image feature value data indicating the feature value, voice data used for the human voice matching function and voice feature value data indicating the feature value, and Includes various other data.

画像処理部１１８は、記録媒体１１２から読み出された圧縮画像データを復号化し、種々の加工を施して、再生画像データを生成する。顔画像認識部１２０は、画像処理部１１８で処理された画像データから人物の顔に該当する領域の有無を判別する。顔画像照合部１２２は、顔画像認識部１２０で顔画像があると判別された領域を、データベース１１６に登録された顔画像データと比較する。これにより、再生画像中に含まれる人物が、データベース１１６に登録済みか否か、そして登録された誰であるかを特定できる。 The image processing unit 118 decodes the compressed image data read from the recording medium 112, performs various processes, and generates reproduced image data. The face image recognition unit 120 determines whether or not there is a region corresponding to a human face from the image data processed by the image processing unit 118. The face image matching unit 122 compares the area determined by the face image recognition unit 120 that there is a face image with the face image data registered in the database 116. Thereby, it is possible to specify whether or not the person included in the reproduced image is registered in the database 116 and who is registered.

付加画像生成部１２４は、記録媒体１１２から再生された再生画像信号にスーパーインポーズされるべき付加画像データを生成する。表示画像生成部１２６は画像処理部１１８から出力させる再生画像データに付加画像生成部１２４から出力される付加画像データを合成する。表示画像生成部１２６により、再生画像に付加画像がスーパーインポーズされた合成画像が生成される。但し、表示画像生成部１２６の出力画像は、合成画像以外に、再生画像のみからなる場合、又は付加画像のみからなる場合もありうる。 The additional image generation unit 124 generates additional image data to be superimposed on the reproduced image signal reproduced from the recording medium 112. The display image generation unit 126 combines the additional image data output from the additional image generation unit 124 with the reproduction image data output from the image processing unit 118. The display image generation unit 126 generates a composite image in which the additional image is superimposed on the reproduced image. However, the output image of the display image generation unit 126 may be composed of only a reproduced image or only an additional image in addition to the composite image.

表示部１２８は、表示画像生成部１２６で生成された画像データを使用者が視認できるように画像として表示する。表示部１２８は、例えば、該再生装置に組み込まれている。表示部１２８は、液晶ディスプレイ（ＬＣＤ）や有機ＥＬディスプレイなどのディスプレイ装置とその駆動回路からなる。 The display unit 128 displays the image data generated by the display image generation unit 126 as an image so that the user can visually recognize the image data. The display unit 128 is incorporated in the playback device, for example. The display unit 128 includes a display device such as a liquid crystal display (LCD) or an organic EL display and a driving circuit thereof.

音声処理部１３０は、記録媒体１１２から読み出された圧縮音声データを復号化し、種々の加工を施す。人声認識部１３２は、音声処理部１３０で得られた再生音声信号から、人間の発声に該当する音声の有無を判別する。人声照合部１３４は、人声認識部１３２で人間の声と判別された部分に対し、データベース１１６に登録された声データと比較する。音量検出部１３６は、人声認識部１３２で人物の声であると判断された場合に、その音量の大きさを検出する。音声出力装置１３８は、音声処理部１３０で生成された再生音声を音響出力する装置である。音声出力装置１３８は、例えば、再生装置１１０に組み込まれたスピーカとその駆動回路からなる。 The audio processing unit 130 decodes the compressed audio data read from the recording medium 112 and performs various processes. The human voice recognition unit 132 determines whether or not there is a voice corresponding to a human voice from the reproduced voice signal obtained by the voice processing unit 130. The human voice collation unit 134 compares the portion determined as the human voice by the human voice recognition unit 132 with the voice data registered in the database 116. When the human voice recognition unit 132 determines that the voice is a human voice, the volume detector 136 detects the volume level. The audio output device 138 is a device that acoustically outputs the reproduced audio generated by the audio processing unit 130. The audio output device 138 includes, for example, a speaker incorporated in the playback device 110 and its drive circuit.

付加画像表示判定部１４０は、顔画像照合部１２２の照合結果と人声照合部１３４の照合結果を基に、付加画像生成部１２４により生成させる付加画像の内容を決定する。 The additional image display determination unit 140 determines the content of the additional image to be generated by the additional image generation unit 124 based on the matching result of the face image matching unit 122 and the matching result of the human voice matching unit 134.

外部出力部１４２は、表示画像生成部１２６により生成された画像信号及び音声処理部１３０による再生音声信号を外部に出力する手段であり、例えば、外部出力用の駆動回路と、接続端子又は送信アンテナとからなる。 The external output unit 142 is a means for outputting the image signal generated by the display image generation unit 126 and the reproduced audio signal by the audio processing unit 130 to the outside. For example, the external output drive circuit and the connection terminal or transmission antenna It consists of.

カメラ部１５０はレンズ及び撮像センサから成り、データベース１１６に登録する顔画像データを取り込むのに利用できる。すなわち、画像処理部１１８は、カメラ部１５０で撮影された画像信号に色バランス及びガンマ補正を施し、サイズ等を調整し、データベース１１６に顔画像データとして登録する。画像処理部１１８はまた、カメラ部１５０からの撮影画像の特徴量を抽出し、抽出した特徴量を顔画像特徴量データとしてデータベース１１６に登録する。 The camera unit 150 includes a lens and an image sensor, and can be used to capture face image data to be registered in the database 116. That is, the image processing unit 118 performs color balance and gamma correction on the image signal captured by the camera unit 150, adjusts the size and the like, and registers the image signal as face image data in the database 116. The image processing unit 118 also extracts the feature amount of the photographed image from the camera unit 150 and registers the extracted feature amount in the database 116 as face image feature amount data.

マイクロホン１５２は、データベース１１６の声データの基になる人声を収音するのに使用可能である。音声処理部１３０は、マイクロホン１５２で収音された音声信号に必要な処理を施した後、データベース１１６に音データとして登録する。その処理のために、音声処理部１３０は、マイクロホン５２の出力を増幅するプリアンプ、及び、プリアンプのアナログ出力をデジタル化するＡ／Ｄ変換器を具備する。音声処理部１３０はまた、収音された音データの特徴量を抽出し、声特徴量データとしてデータベース１１６に登録する。 The microphone 152 can be used to collect a human voice that is the basis of voice data in the database 116. The sound processing unit 130 performs necessary processing on the sound signal collected by the microphone 152 and then registers the sound signal in the database 116 as sound data. For this processing, the audio processing unit 130 includes a preamplifier that amplifies the output of the microphone 52 and an A / D converter that digitizes the analog output of the preamplifier. The voice processing unit 130 also extracts the feature amount of the collected sound data and registers it in the database 116 as voice feature amount data.

ＣＰＵ１４４は、付加画像生成部１２４及び表示画像生成部１２６を制御するだけでなく、画像処理および音声処理が同期動作するように再生装置１１０の全体を制御する中央演算装置である。 The CPU 144 is a central processing unit that not only controls the additional image generation unit 124 and the display image generation unit 126 but also controls the entire playback device 110 so that image processing and audio processing operate synchronously.

入力装置１４６は、使用者が再生装置１１０に動作モードや動作条件等を入力するための装置であり、種々のスイッチ又はボタン、若しくは、メニュー画面上に表示される操作可能な要素などからなる。 The input device 146 is a device for a user to input an operation mode, an operation condition, and the like to the playback device 110, and includes various switches or buttons, operable elements displayed on a menu screen, and the like.

オーサリング処理部１６０は画像処理部１１８と音声処理部１３０の出力信号に、付加画像生成部１２４で生成された付加画像をサブピクチャとして付加したものを、規定のフォーマットに則したデータに変換する。オーサリング処理部１６０は、その処理結果を、記録媒体駆動装置１６２を介して記録媒体１６４に記録する。記録媒体１６４は、オーサリング処理部１６０で生成されたデータの記録保存先である。記録媒体駆動装置１６２は、記録媒体１６４に応じた制御方式で記録媒体１６４を駆動し、記録媒体１６４に信号を読み書きする。 The authoring processing unit 160 converts the output signals of the image processing unit 118 and the audio processing unit 130 added with the additional image generated by the additional image generation unit 124 as a sub-picture into data conforming to a prescribed format. The authoring processing unit 160 records the processing result on the recording medium 164 via the recording medium driving device 162. The recording medium 164 is a recording storage destination of data generated by the authoring processing unit 160. The recording medium driving device 162 drives the recording medium 164 by a control method corresponding to the recording medium 164, and reads / writes signals from / to the recording medium 164.

記録媒体１１２をオーサリング処理部１６０の処理結果の記録先にしてもよいことは明らかである。この場合、勿論、記録媒体１１２は読み出し専用記録媒体ではなく、記録可能な媒体である。図９に示す実施例では、各データの保存場所を理解しやすいように、記録媒体１１２と記録媒体１６４を別々に図示しているに過ぎない。 It is obvious that the recording medium 112 may be the recording destination of the processing result of the authoring processing unit 160. In this case, of course, the recording medium 112 is not a read-only recording medium but a recordable medium. In the embodiment shown in FIG. 9, only the recording medium 112 and the recording medium 164 are illustrated separately so that the storage location of each data can be easily understood.

図１０は、本実施例の特徴的な動作を示すフローチャートである。図１０を参照して、本実施例の特徴的な動作を説明する。 FIG. 10 is a flowchart showing the characteristic operation of this embodiment. With reference to FIG. 10, the characteristic operation of the present embodiment will be described.

まず、データベース１１６に、必要な人物の個人情報データを事前に登録する（Ｓ１０１）。先に説明したように、各人の個人情報データは、顔画像データとその特徴量を示す顔画像特徴量データ、声データとその特徴量を示す声特徴量データ、及び、その他の種々の属性データからなる。属性データは、例えば、氏名、ニックネーム、性別」、年齢、生年月日及び似顔絵などからなる。 First, personal information data of a necessary person is registered in advance in the database 116 (S101). As described above, each person's personal information data includes face image data and face image feature amount data indicating the feature amount, voice data and voice feature amount data indicating the feature amount, and various other attributes. Consists of data. The attribute data includes, for example, name, nickname, sex ", age, date of birth, and portrait.

データベース１１６のデータの内、顔画像データ、声データ及びこれらの特徴量を示すデータは、カメラ部１５０、マイクロホン１５２、画像処理部１１８及び音声処理部１３０を使って、データベース１１６に登録できる。具体的な方法は後述する。 Of the data in the database 116, face image data, voice data, and data indicating these feature quantities can be registered in the database 116 using the camera unit 150, the microphone 152, the image processing unit 118, and the audio processing unit 130. A specific method will be described later.

音声処理部１３０は、記録媒体１１２から読み出された圧縮音声データを伸長し、再生音声信号を人声認識部１３２、音声出力装置１３８、外部出力部１４２及びオーサリング処理部１６０に供給する。人声認識部１３２は、再生音声信号から人声を抽出して解析し、人声が含まれているかどうかを判別する（Ｓ１０２）。人声が含まれている場合、人声認識部１３２は、抽出した人声データを人声照合部１３４に供給する。 The audio processing unit 130 decompresses the compressed audio data read from the recording medium 112 and supplies the reproduced audio signal to the human voice recognition unit 132, the audio output device 138, the external output unit 142, and the authoring processing unit 160. The human voice recognition unit 132 extracts and analyzes a human voice from the reproduced voice signal, and determines whether or not a human voice is included (S102). If a human voice is included, the human voice recognition unit 132 supplies the extracted voice data to the human voice verification unit 134.

人声照合部１３４は、人声認識部１３２からの人声データからその特徴量を抽出し、データベース１１６の声特徴量データに照合して、一定以上の高い相関を有する特徴量を持つ人物を探索する（Ｓ１０３）。人声照合部１３４は、一定値以上の相関度を有する声特徴量データが存在した場合、これらの声特徴量データを有する人物を付加画像表示の候補として付加画像表示判定部１４０に通知する（Ｓ１０４）。 The human voice collation unit 134 extracts the feature amount from the human voice data from the human voice recognition unit 132, collates it with the voice feature amount data in the database 116, and selects a person having a feature amount having a high correlation of a certain level or more. Search is performed (S103). When voice feature data having a degree of correlation equal to or greater than a certain value exists, the human voice verification unit 134 notifies the additional image display determination unit 140 of a person having the voice feature data as a candidate for additional image display ( S104).

他方、画像処理部１１８は、記録媒体１１２から再生された圧縮画像データを伸長して再生画像データを生成する。顔画像認識部１２０は、再生画像から人の顔画像を抽出し、抽出した顔画像データを顔画像照合部１２２に供給する（Ｓ１０５）。顔画像照合部１２２は、顔画像認識部１２０からの顔画像データからその特徴量を抽出し、データベース１１６の顔画像特徴量データと照合して、一定以上の高い相関を有する顔画像特徴量を持つ人物を探索する（Ｓ１０６）。顔画像照合部１２２は、一定値以上の相関度を有する顔画像特徴量データが存在した場合、その顔画像特徴量データを有する人物を、付加画像表示の除外候補として付加画像表示判定部１４０に通知する（Ｓ１０７）。 On the other hand, the image processing unit 118 decompresses the compressed image data reproduced from the recording medium 112 to generate reproduced image data. The face image recognition unit 120 extracts a human face image from the reproduced image, and supplies the extracted face image data to the face image matching unit 122 (S105). The face image collation unit 122 extracts the feature amount from the face image data from the face image recognition unit 120 and collates it with the face image feature amount data in the database 116 to obtain a face image feature amount having a high correlation of a certain level or more. The person who has it is searched (S106). When face image feature amount data having a degree of correlation equal to or greater than a certain value exists, the face image matching unit 122 sets a person having the face image feature amount data as an additional image display exclusion candidate to the additional image display determination unit 140. Notification is made (S107).

付加画像表示判定部１４０は、再生音声から抽出された候補（人声照合部１３４からの通知情報）から、再生画像から抽出された除外候補（顔画像照合部１２２からの除外候補）を除外する（Ｓ１０８）。これにより、同一シーン中で、話者の可能性の高い人物候補（Ｓ１０４）から、画面内に入っている人物（Ｓ１０７）を除外できる。即ち，撮影画面内には入っていないが，音声が記録されている人物を特定できる。付加画像表示判定部１４０は、このように特定された人物を付加画像生成部１２４に通知する。 The additional image display determination unit 140 excludes the exclusion candidate extracted from the reproduction image (exclusion candidate from the face image verification unit 122) from the candidate extracted from the reproduction sound (notification information from the human voice verification unit 134). (S108). Thereby, in the same scene, the person (S107) in the screen can be excluded from the person candidates (S104) who are likely to be speakers. That is, it is possible to specify a person who is not in the shooting screen but has recorded voice. The additional image display determination unit 140 notifies the additional image generation unit 124 of the person specified in this way.

付加画像生成部１２４は、付加画像表示判定部１４０から通知された人物の顔画像データをデータベース１１６から取得し（Ｓ１０９）、この顔画像を含む所定サイズの付加画像を生成する（Ｓ１１０）。付加画像生成部１２４は、生成した付加画像を表示画像生成部１２６とオーサリング処理部１６０に供給する。 The additional image generation unit 124 acquires the face image data of the person notified from the additional image display determination unit 140 from the database 116 (S109), and generates an additional image of a predetermined size including the face image (S110). The additional image generation unit 124 supplies the generated additional image to the display image generation unit 126 and the authoring processing unit 160.

表示画像生成部１２６は、画像処理部１１８からの再生画像データに、付加画像生成部１２４からの付加画像をスーパーインポーズして、合成画像データを生成する。合成画像データは、表示部１２８に印加され、図４に示すように表示される。もちろん、外部出力部１４２から外部に出力されることもある。 The display image generation unit 126 superimposes the additional image from the additional image generation unit 124 on the reproduced image data from the image processing unit 118 to generate composite image data. The composite image data is applied to the display unit 128 and displayed as shown in FIG. Of course, it may be output from the external output unit 142 to the outside.

また、オーサリング処理部１６０は、画像処理部１１８からの再生画像信号と、音声処理部１３０からの再生音声信号と、付加画像生成部１２４からの付加画像を多重して１つの映像コンテンツを生成する。その際、付加画像をピクトグラム様のサブピクチャとして再生画像信号に重畳した映像信号を生成する（Ｓ１１１）。オーサリング処理部１６０は、このように生成した映像コンテンツを記録媒体駆動装置１６２により記録媒体１６４に記録する（Ｓ１１２）。例えば、ＤＶＤ−ＶＩＤＥＯ形式のＤＶＤを作成する。ＤＶＤ−ＶＩＤＥＯ形式の場合、「字幕」機能をオンにすることによって、画角外で発声する人物の情報を主たる映像の上に重ね描き表示させながら、視聴できる。 Further, the authoring processing unit 160 multiplexes the reproduction image signal from the image processing unit 118, the reproduction audio signal from the audio processing unit 130, and the additional image from the additional image generation unit 124 to generate one video content. . At this time, a video signal is generated by superimposing the additional image as a pictogram-like sub-picture on the reproduced image signal (S111). The authoring processing unit 160 records the video content thus generated on the recording medium 164 by the recording medium driving device 162 (S112). For example, a DVD in VIDEO-VIDEO format is created. In the case of the DVD-VIDEO format, by turning on the “subtitle” function, information of a person uttering outside the angle of view can be viewed while being overlaid on the main video.

データベース１１６を説明する。データベース１１６は、データベース１６と同様に、記録媒体１１２に格納されても、再生装置１１０に内蔵されても良い。再生装置１１０がネットワーク対応の場合には、データベース１１６を、ネットワークを介して接続するサーバ上に用意しても良い。記録媒体１１２が再生装置１１０から着脱自在である場合、記録媒体１１２に記録されるデータベースを再生装置１１０の記憶領域にロードして利用する方式でも、記録媒体１１２上のデータベースを直接参照する方式の何れであってもよい。また、前者の場合、データベースのみを記録した着脱可能な記録媒体を用意してもよい。 The database 116 will be described. Similar to the database 16, the database 116 may be stored in the recording medium 112 or built in the playback device 110. When the playback device 110 is network compatible, the database 116 may be prepared on a server connected via the network. When the recording medium 112 is detachable from the playback device 110, a method of directly referring to the database on the recording medium 112 even when using a database recorded on the recording medium 112 by loading it into the storage area of the playback device 110. Either may be sufficient. In the former case, a removable recording medium in which only the database is recorded may be prepared.

データベース１１６の構造は、データベース１６の構造と同様である。データベース１１６に登録される各人の個人情報データは、顔画像照合とピクトグラフ表示に使用するための顔画像データと、音声照合に使用するための声データを必ず有する。これに加えて、氏名、ニックネーム、性別、年齢、生年月日、及び似顔絵などデータを付加的な属性情報として保持する。これらの付加的な属性情報を、顔画像データと同時に、または置き換えてピクトグラフ表示に使用してもよい。 The structure of the database 116 is the same as the structure of the database 16. The personal information data of each person registered in the database 116 necessarily includes face image data for use in face image matching and pictogram display, and voice data for use in voice matching. In addition, data such as name, nickname, gender, age, date of birth, and caricature are stored as additional attribute information. These additional attribute information may be used for pictogram display at the same time as or in place of the face image data.

データベース１１６に登録される人数が増加してくると、ピクトグラフとして表示を出したい人物と出したくない人物の区分が生じてくる。この区分に対応するために、各々の個人情報データ毎にピクトグラフ表示の可否を設定する項目を設けるのが好ましい。新規登録された個人情報データの新規登録時にデフォルトでどちらを設定するかは、使用状況に応じて決定すればよい。 As the number of people registered in the database 116 increases, there is a classification of a person who wants to display as a pictograph and a person who does not want to display. In order to correspond to this classification, it is preferable to provide an item for setting whether or not to display a pictograph for each personal information data. Which is set by default when newly registering newly registered personal information data may be determined in accordance with the usage status.

データベース１１６も、データベース１６と同様に、他の機器で作成されたものも読み込んで使えるような汎用性のあるものであるのが好ましい。 Similarly to the database 16, the database 116 is preferably versatile so that it can be read and used by another device.

ピクトグラフ表示機能の有効／無効を使用者が設定できると、本来の記録画像だけを見たいという要求にも対応できる。これは、画像に映っていない人物が誰であるのかを気にしない場合や、既知の声だが聞こえている場合などに有効である。 If the user can set the validity / invalidity of the pictograph display function, it is possible to respond to a request to view only the original recorded image. This is effective when the person who is not shown in the image is not cared about, or when a known voice is heard.

ピクトグラフとして表示される情報はデータベース１１６が有する項目の中から１つ以上である。どの項目を表示させるかを選択できると、使用者の知りたい情報に連動することができる。例えば、顔画像と同時に「氏名」を表示させれば、顔画像だけでは誰であるのかを判別しづらいときでも、人物を特定しやすくなる。もちろん、「氏名」だけを表示することでも、同様の効果を得ることができる。 The information displayed as a pictograph is one or more items among the items of the database 116. If an item to be displayed can be selected, it can be linked to information that the user wants to know. For example, if “name” is displayed at the same time as the face image, it becomes easier to identify a person even when it is difficult to determine who the face image alone is. Of course, the same effect can be obtained by displaying only the “name”.

ピクトグラフ表示対象者の全人物に対し表示をするデータ項目を統一した場合、幾人かの人物では該当項目が未登録である可能性も考えられる。このような状況に対応するために、ピクトグラフ表示データの各項目に優先順位を持たせ、上位項目が未登録な場合には次点項目を表示するようにすると良い。例えば、第１優先順位に「ニックネーム」を、第２優先順位に「氏名」を設定したとする。この場合、「ニックネーム」の項目が登録されている人物に関しては「ニックネーム」が表示され、「ニックネーム」の項目が登録されていない人物に関しては「氏名」が表示される。このような優先順位を設定する作業は、データベース１１６に登録できる各人の個人情報データの項目が多数になるほど、煩雑なものとなる。これに対しては、有限個の上位順位までを使用者が設定できるようにしておき、それ以下の順位の項目については再生装置１１０が自動的に割り振るようにすればよい。これにより、使用者の負担が軽減する。尚、使用者がデータベース１１６の個人情報データ内の項目に一切のデータを追加したかった場合でも、登録必須項目である顔画像データが何れかの優先順位に割り当てられるので、ピクトグラフに表示するデータは確保される。 If the data items to be displayed for all the pictogram display target persons are unified, there is a possibility that the corresponding item is not registered in some persons. In order to cope with such a situation, it is preferable to give priority to each item of the pictograph display data and display the next item when the upper item is not registered. For example, it is assumed that “nickname” is set as the first priority and “name” is set as the second priority. In this case, “nickname” is displayed for a person for whom the item “nickname” is registered, and “name” is displayed for a person for whom the item “nickname” is not registered. The task of setting such priorities becomes more complicated as the number of items of personal information data of each person that can be registered in the database 116 increases. To this end, the user can set up to a finite number of higher ranks, and the playback apparatus 110 may automatically allocate items below that rank. This reduces the burden on the user. Even if the user wants to add any data to the items in the personal information data in the database 116, the face image data, which is a registration-required item, is assigned to any priority order, so that it is displayed on the pictograph. Data is secured.

画面上に同時に表示されるピクトグラフの数に上限を設けると、多数のピクトグラフが現れてしまうことで主映像や他のピクトグラフを覆い隠してしまうことを防止できる。一画面上または同一シーンに対して表示されるピクトグラフの上限数又は最大数を使用者が設定できると、再生するテレビモニタのサイズや視認能力に応じて適切な変更が可能となる。このように表示数に制限がある場合、どのようなピクトグラフを優先して表示をするかが重要になってくる。最もシンプルな方法は、発声者を認識した時点で次々とピクトグラフを更新していく方法である。時系列の発声順序に注目した、所謂「後着優先型」である。常に最新の発声者からピクトグラフの表示上限に等しい数だけ遡る時点での発言者までが表示対象となる。同じ人物が連続的に発声を続ける限りは表示の更新が行われないが、他の者の発言が挿入された時点で更新が行われる。 If an upper limit is set for the number of pictographs displayed on the screen at the same time, it is possible to prevent the main picture and other pictographs from being obscured by the appearance of a large number of pictographs. If the user can set the upper limit number or the maximum number of pictographs displayed on one screen or for the same scene, an appropriate change can be made according to the size of the television monitor to be reproduced and the visual recognition ability. When the number of displays is limited in this way, what kind of pictograph is given priority for display becomes important. The simplest method is to update the pictographs one after another when the speaker is recognized. This is a so-called “late arrival priority type” focusing on the chronological utterance order. The display target is always from the latest speaker to the speaker at a time point that goes back by the number equal to the display upper limit of the pictograph. The display is not updated as long as the same person continues to speak, but is updated when another person's speech is inserted.

他には、発声音量に連動する方法が考えられる。音量検出部１３６を用いて人物の発声音量を測定し、この結果からピクトグラフの表示優先順位を決定する。音量の大きな順番に優先順位が高くなれば、より耳に届きやすい人物のピクトグラフが優先表示される。逆に音量の小さな順番に優先順位が高くなれば、より耳に届きにくく聞き逃しやすい人物のピクトグラフが優先表示される。 In addition, a method linked to the utterance volume can be considered. The sound volume of the person is measured using the sound volume detection unit 136, and the display priority order of the pictograph is determined from the result. If the priority increases in the order of the volume, the pictograph of the person who can easily reach the ear is displayed with priority. On the other hand, if the priority increases in order of increasing volume, a pictograph of a person who is more difficult to hear and misses is displayed with priority.

また、もともと優先的に表示をしたい人物を決めておくことも有効である。これはデータベース１１６の登録項目中にピクトグラフ表示優先度設定値を持たせ、付加画像表示判定部１４０がこの設定値に応じた処理を行うことで実現できる。 It is also effective to determine the person who wants to display with priority. This can be realized by having a pictograph display priority setting value in the registration item of the database 116 and the additional image display determination unit 140 performs processing according to this setting value.

例えば，映像ソースがホームビデオで撮影されたものである場合、撮影者が最も近距離から頻繁に発声していることが考えられる。このとき、前述のような優先順位を与えると、撮影者のピクトグラフが断然高頻度で表示されることになる。撮影者が映像に映らないことが多いのは確かではあるが、それが誰であったのかは比較的分かり易いものである。そこで、撮影者が誰であるのかを入力できるようにして、その人物のピクトグラフだけを他とは違う表現方法にしたり、あるいは表示しないようにすることで、ピクトグラフ表示の煩雑さを軽減できる。 For example, when the video source is a home video, it is possible that the photographer speaks frequently from the shortest distance. At this time, if the priorities are given as described above, the photographer's pictograph is displayed with a very high frequency. Although it is certain that the photographer often does not appear in the video, it is relatively easy to understand who it was. Therefore, by making it possible to input who the photographer is and making the pictograph of that person different from other methods or not displaying it, the complexity of pictograph display can be reduced. .

視認性の向上とあわせて娯楽性のあるインターフェイスも考えられる。１つ目の方法では、発声音声の音量に応じてピクトグラフの大きさを変化させる。音量検出部１３６で検出された音量データとピクトグラフの外形寸法の二者に相関性を持たせる。例えば、大きさを３段階で変化させる場合、音量データに２つの閾値レベルを設け、低い側の閾値レベル以下であれば小さく、２つの閾値レベル間であれば中程度に、高い側の閾値レベル以上であれば大きくする。 Along with improved visibility, an entertaining interface is also possible. In the first method, the size of the pictograph is changed according to the volume of the uttered voice. The volume data detected by the volume detection unit 136 and the external dimensions of the pictograph are correlated. For example, when the magnitude is changed in three stages, two threshold levels are provided in the volume data, and if the threshold level is lower than the lower threshold level, the threshold value is small. If it is above, increase it.

２つ目の方法では、発声音声の指向角に応じてピクトグラフの表示位置を変化させる。音量検出部１３６が再生音声を解析し、スピーカ出力に対して視聴者のがどの方向からの音声と認識するかを調べる。再生音声が２チャンネルモードの場合には、音声の指向角に合わせてピクトグラフを左右方向に変化させて表示する。サラウンドモードの場合には、音声の左右方向を左右に、前後方向を上下に見立ててピクトグラフを配置する。例えば、右前方から聞こえる音声に対応するピクトグラフを再生画像の右上方に表示する。 In the second method, the display position of the pictograph is changed according to the directivity angle of the voice. The sound volume detection unit 136 analyzes the reproduced sound, and examines the direction from which the viewer recognizes the sound with respect to the speaker output. When the reproduced sound is in the two-channel mode, the pictograph is changed in the horizontal direction according to the sound directivity angle and displayed. In the surround mode, the pictograph is arranged with the left and right direction of the sound as left and right and the front and back direction as up and down. For example, a pictograph corresponding to the sound heard from the right front is displayed at the upper right of the reproduced image.

発声音量や指向角は常時変化するものであるから、ピクトグラフ表示もその変化に追随すべきである。適当な時間間隔でピクトグラフ表示を更新することで、音声の変化動向を可視的に表すことができる。 Since the voicing volume and the directivity angle change constantly, the pictogram display should follow the changes. By updating the pictograph display at an appropriate time interval, it is possible to visually represent the change trend of the voice.

また他にも、データベース１１６に登録されたデータを基に、性別や年齢などに応じた色分けをすることなども考えられる。
表示タイミングにも配慮をすると、より見やすくすることができる。例えば、驚いた拍子に発する「あっ」という短い発声を検出した場合を考える。このような音声の検出結果に対して敏速な反応をすると、ピクトグラフが表示されるのは一瞬の出来事となる。これではピクトグラフで表示された人物が誰であったのかを確認するのは、非常に困難になる。そこで、オーサリング処理部１６０が、ある音声に対応するピクトグラフを一定時間先行して表示開始するように設定する。また、発声終了時には、一定時間表示を保持した後に消すという設定にする。すなわち、オーサリング処理部１６０は、発声開始よりも時間的に先行してサブピクチャの表示を開始し、発声終了より時間的に遅れてサブピクチャの表示を終了するようにオーサリングを行う。こうすることで、実際の発声時間の前後にピクトグラフを確認できる時間的余裕が生まれ、短い発声に対する視認性が改善される。あるいは、事前に発声時間を検出できるようであれば、ある規定時間以下の発声に対してだけ、前記のタイミング調整を行うのも良い。 In addition, based on data registered in the database 116, it may be possible to color-code according to gender, age, or the like.
If the display timing is taken into consideration, the display can be made easier to see. For example, let us consider a case where a short utterance “A” uttered with a surprising beat is detected. When a quick reaction is made to such a sound detection result, the pictograph is displayed for an instant. This makes it very difficult to confirm who the person displayed on the pictograph is. Therefore, the authoring processing unit 160 is set so as to start displaying a pictograph corresponding to a certain voice in advance for a certain period of time. Also, at the end of utterance, the display is set to disappear after holding the display for a certain time. That is, the authoring processing unit 160 performs the authoring so that the display of the sub picture starts before the start of the utterance and the display of the sub picture ends after the end of the utterance. By doing this, a time margin for confirming the pictograph before and after the actual utterance time is created, and the visibility for a short utterance is improved. Alternatively, if it is possible to detect the utterance time in advance, the timing adjustment described above may be performed only for utterances less than a predetermined time.

前述した様々なピクトグラフ表示方法を組み合わせてサブピクチャの設定を行う際に次のようにする。すなわち、複数のサブピクチャチャネルを生成するオーサリング処理が可能であるならば、組合せ方法を変えた複数のチャネルを持ち合わせるようにする。例えば、あるチャネルは後着優先表示で、別のチャネルは音量優先表示で、更に別のチャネルでは指向角と性別による色分けで、という手法が可能である。これがＤＶＤＶＩＤＥＯ形式でオーサリングされたものであれば、「字幕」を切り替えることで色々な表示を楽しむことができる。図１１は、同時に表示されるピクトグラフの上限数が３で、発声音量と発声指向角に連動するピクトグラフ表示例を示す。 When sub-pictures are set by combining the various pictograph display methods described above, the following is performed. That is, if authoring processing for generating a plurality of sub-picture channels is possible, a plurality of channels with different combination methods are prepared. For example, it is possible to use a method in which a certain channel is a priority display for a later arrival, another channel is a priority display for a volume, and another channel is color-coded according to a directivity angle and sex. If this is authored in the DVD VIDEO format, various displays can be enjoyed by switching “subtitles”. FIG. 11 shows an example of a pictograph display in which the upper limit number of pictograms displayed at the same time is 3, and the voicing volume and the directional angle of utterance are linked.

このように、本実施例では、再生画像に映らずに再生声だけが聞こえる人物に関する情報を示すピクトグラフをサブピクチャとして表示可能な情報媒体を作成できる。 Thus, in this embodiment, it is possible to create an information medium capable of displaying, as a sub-picture, a pictograph indicating information related to a person who can hear only the reproduced voice without appearing in the reproduced image.

再生音声信号とデータベース１１６にある声データをそれぞれの特徴量で照合して、正しい人物を選択できなかった場合、ピクトグラフに誤ったデータが使用されることになる。使用者が表示されたピクトグラフを見て誤りに気付いたとき、本実施例では、正しい内容に修正できる。図１２は、その修正動作のフローチャートを示す。 If the reproduced voice signal and the voice data in the database 116 are collated with their respective feature amounts and a correct person cannot be selected, incorrect data is used for the pictograph. When the user notices an error by looking at the displayed pictograph, the present embodiment can correct the content correctly. FIG. 12 shows a flowchart of the correction operation.

使用者は、再生表示画像を見て、聞こえてくる声とピクトグラフに表示される人物情報の正誤を確認する（Ｓ１２１）。ここで誤りを発見し、修正を行う場合は修正機能を使用する（Ｓ１２２）。表示部１２８の画面を見ながら、スイッチやタッチパネルなどの入力装置１４６を用いて、正しい情報がヒストグラムとして表示されるように修正する（Ｓ１２３〜Ｓ１３１）。 The user looks at the reproduced display image and confirms whether the voice to be heard and the personal information displayed on the pictograph are correct or incorrect (S121). If an error is found and corrected, the correction function is used (S122). While viewing the screen of the display unit 128, the input device 146 such as a switch or a touch panel is used to correct so that correct information is displayed as a histogram (S123 to S131).

具体的に説明する。使用者が修正を選択すると（Ｓ１２２）、修正モードに入り、画面上のピクトグラフを選択する画面になる（Ｓ１２３）。このピクトグラフ選択の際に、データベース１１６には該当する人物がいないとして表示対象から外されていた声に対して、「該当人物なし」を示すピクトグラフを表示して、それを選択できるようにすることが望ましい。これにより、人声照合部１３４が再生音声信号とデータベース１１６の声データとを同定できなかった誤りに対する修正が可能になる。 This will be specifically described. When the user selects correction (S122), the correction mode is entered and a screen for selecting a pictograph on the screen is displayed (S123). When selecting a pictograph, a pictograph indicating “no corresponding person” is displayed for a voice that has been excluded from the display target because there is no corresponding person in the database 116 so that it can be selected. It is desirable to do. As a result, it is possible to correct an error in which the human voice matching unit 134 could not identify the reproduced voice signal and the voice data in the database 116.

ピクトグラフ選択画面上で選択されているピクトグラフは、形状や色や縁取りなどが変化をして何らかの強調表示をする。使用者は入力装置１４６によって修正対象のピクトグラフを選択する（Ｓ１２４）。選択の直後に、「本当に修正をして良いか」という、修正の意思を確認するダイアログを表示する（Ｓ１２５）。続いて、修正後に使用するデータを検索し選択する（Ｓ１２６）。 The pictograph selected on the pictograph selection screen changes its shape, color, border, etc. and displays some highlights. The user selects a pictograph to be corrected using the input device 146 (S124). Immediately after the selection, a dialog for confirming the intention of correction, “Are you sure to correct?” Is displayed (S125). Subsequently, data to be used after correction is searched and selected (S126).

また、同時に修正するピクトグラフの対象範囲を選択する（Ｓ１２７）。このような選択ステップを設けることで、誤選択の可能性を低減できる。たとえば、第１の選択肢として、「選択したもののみ」に限定する。第２の選択肢として、「選択したピクトグラフと同一人物が認識されているもの全て」に限定する。第３の選択肢として、「選択したピクトグラフと同一人物が認識されているものの中で、選択したシーンと比較して人声照合の相関率が低いもの」に限定する。これらの選択肢の表現方法は、それぞれの再生装置に適した方法が選択される。 At the same time, the target range of the pictograph to be corrected is selected (S127). Providing such a selection step can reduce the possibility of erroneous selection. For example, the first option is limited to “only selected”. The second option is limited to “all those where the same person as the selected pictograph is recognized”. The third option is limited to “one in which the same person as the selected pictograph is recognized, and one having a lower correlation rate of voice collation than the selected scene”. As a method for expressing these options, a method suitable for each reproducing apparatus is selected.

修正実行可否を確認する（Ｓ１２８）。使用者の修正実行の決定に従い、実際のピクトグラフ表示データの差替え処理が行われる（Ｓ１２９）。このとき、対象となったピクトグラフと同時修正を行った範囲をデータベース１１６に反映すると、次回以降の人声照合の精度が向上する。 Whether or not correction can be executed is confirmed (S128). In accordance with the user's decision to execute correction, actual pictograph display data replacement processing is performed (S129). At this time, if the range of simultaneous correction with the target pictograph is reflected in the database 116, the accuracy of human voice verification after the next time is improved.

他にも修正すべきピクトグラフがある場合（Ｓ１３０）、同様の手順によって修正を行う。その他の修正すべきピクトグラフが無い場合（Ｓ１３０）、修正モードから抜け（Ｓ１３１）、一連の修正作業を終える。 When there are other pictographs to be corrected (S130), correction is performed by the same procedure. If there is no other pictograph to be corrected (S130), the correction mode is exited (S131), and a series of correction operations is completed.

本実施例では、カメラ部１５０及びマイクロホン１５２をデータベース１１６に顔画像データ及び声データを登録するのに使用できる。図１３を参照して、取り込んだ画像データ及び音声データのデータベース１１６への登録方法を説明する。 In this embodiment, the camera unit 150 and the microphone 152 can be used to register face image data and voice data in the database 116. With reference to FIG. 13, a method of registering the captured image data and audio data in the database 116 will be described.

使用者は再生装置１１０の機能メニュからデータベース登録機能を選択し（Ｓ１４１）、データベース１１６に新しい個人情報を作成する（Ｓ１４２）。 The user selects a database registration function from the function menu of the playback device 110 (S141), and creates new personal information in the database 116 (S142).

まず、顔画像データを登録する（Ｓ１４３）。具体的には、カメラ部１５０が起動し、撮影が可能な状態になる（Ｓ１４４）。登録したい人物を被写体とし、その顔を含んだ画像を撮影する（Ｓ１４５）。このとき、表示部１２８にカメラ部１５０が捉えている画像が表示される。被写体となった人物自身が、表示部１２８の表示画像を見ながら、カメラ部１５０と自分の顔の位置関係を確認できるようにすると、効率的な撮影が行える。撮影画角内に被写体が的確に捉えられたら、入力装置１４６のスイッチを押して撮影を実行し、被写体の顔画像を取り込む（Ｓ１４６）。画像処理部１１８は、カメラ部１５０からの画像データをデータベース１１６への登録に適した形式とサイズに処理する。このとき、画像処理部１１８は、撮影した顔画像データから顔画像特徴量を算出し（Ｓ１４７）、顔画像データとともにデータベース１１６に登録する（Ｓ１４８）。ここでの顔画像特徴量算出処理には、再生画像から検出された人物の顔画像特徴量を算出する機能を利用できる。 First, face image data is registered (S143). Specifically, the camera unit 150 is activated and is ready for shooting (S144). A person to be registered is taken as a subject, and an image including the face is photographed (S145). At this time, an image captured by the camera unit 150 is displayed on the display unit 128. If the person who is the subject himself / herself can confirm the positional relationship between the camera unit 150 and his / her face while viewing the display image on the display unit 128, efficient shooting can be performed. When the subject is accurately captured within the shooting angle of view, the switch of the input device 146 is pressed to perform shooting, and the face image of the subject is captured (S146). The image processing unit 118 processes the image data from the camera unit 150 into a format and size suitable for registration in the database 116. At this time, the image processing unit 118 calculates a face image feature amount from the captured face image data (S147), and registers it in the database 116 together with the face image data (S148). In this face image feature amount calculation process, a function for calculating a face image feature amount of a person detected from a reproduced image can be used.

次に、声データを登録する（Ｓ１４９）。音声処理部１３０は、マイクロホン１５２が収音する音声信号を処理する機能を起動する（Ｓ１５０）。収音準備が整い収音可能な状態になったら、表示部１２８に声を入力することを促す表示をする。登録に適したフレーズを表示し、それを登録者に発声させるようにすれば、登録者が発言すべきフレーズに迷うことがないので好ましい。声を記録する際、入力装置１４６のスイッチを記録開始時と終了時に押すようにするか、押されている最中を記録期間とする。 Next, voice data is registered (S149). The audio processing unit 130 activates a function for processing an audio signal collected by the microphone 152 (S150). When the sound collection is ready and the sound can be collected, the display unit 128 displays a message prompting the user to input a voice. If a phrase suitable for registration is displayed and the registrant speaks it, it is preferable because the registrant does not get lost in the phrase to be spoken. When recording a voice, the switch of the input device 146 is pressed at the start and end of recording, or the recording period is set while the switch is being pressed.

音声処理部１３０は、マイクロホン１５２からの音声を一旦保持する（Ｓ１５１）。そして、表示とともにこの音声を再生して、登録して良いかどうかを使用者に確認する（Ｓ１５２）。使用者が登録を拒否すれば、音声の取り込みをやり直す。使用者が登録を承認すると、音声処理部１３０は、取り込んだ音声をデータベース１１６への登録に適した形式及びサイズに処理し、声特徴量を算出し（Ｓ１５３）、声データと声特徴量をデータベース１１６に登録する（Ｓ１５４）。ここでの声特徴量算出処理には、再生音声から声特徴量を算出する機能を利用できる。 The voice processing unit 130 temporarily holds the voice from the microphone 152 (S151). Then, this sound is reproduced together with the display to confirm with the user whether registration is allowed (S152). If the user refuses to register, the voice capture is performed again. When the user approves the registration, the voice processing unit 130 processes the captured voice into a format and size suitable for registration in the database 116, calculates voice feature quantities (S153), and obtains voice data and voice feature quantities. Registration in the database 116 (S154). In this voice feature amount calculation process, a function for calculating a voice feature amount from the reproduced voice can be used.

続いて、データベース１１６に、例えば、氏名等の、その他の項目を入力する（Ｓ１５５）。 Subsequently, other items such as a name are input to the database 116 (S155).

図１３に示す登録順序は一例であり、例えば最初に「氏名」を入力してから顔画像データを登録してもよいことは明らかである。 The registration order shown in FIG. 13 is an example. For example, it is obvious that the face image data may be registered after first inputting “name”.

本発明の第１実施例に係る再生装置の概略構成ブロック図である。1 is a schematic block diagram of a playback apparatus according to a first embodiment of the present invention. 撮影状況の説明図である。It is explanatory drawing of the imaging | photography condition. 従来例での再生状態を示す模式図である。It is a schematic diagram which shows the reproduction | regeneration state in a prior art example. 本実施例による再生状態を示す模式図である。It is a schematic diagram which shows the reproduction | regeneration state by a present Example. 本実施例における付加画像の生成過程を説明するフローチャートである。It is a flowchart explaining the production | generation process of the additional image in a present Example. 本実施例におけるピクトグラフ修正過程を説明するフローチャートである。It is a flowchart explaining the pictograph correction process in a present Example. 本発明の第２実施例の概略構成ブロック図である。It is a schematic block diagram of the second embodiment of the present invention. 第２実施例におけるデータベースへの個人情報登録過程を説明するフローチャートである。It is a flowchart explaining the personal information registration process to the database in 2nd Example. 本発明の第３実施例の概略構成ブロック図である。It is a schematic block diagram of the third embodiment of the present invention. 第３実施例における付加画像の生成過程を説明するフローチャートである。It is a flowchart explaining the production | generation process of the additional image in 3rd Example. 第３実施例におけるピクトグラム表示例を示す図である。It is a figure which shows the example of a pictogram display in 3rd Example. 第３実施例におけるピクトグラフ修正過程を説明するフローチャートである。It is a flowchart explaining the pictograph correction process in 3rd Example. 第３実施例におけるデータベースへの個人情報登録過程を説明するフローチャートである。It is a flowchart explaining the personal information registration process to the database in 3rd Example.

Explanation of symbols

１０，１０ａ，１１０：再生装置
１２，１１２：記録媒体
１４，１１４：記録媒体駆動装置
１６，１６ａ，１１６ａ：データベース
１８，１１８：画像処理部
２０，１２０：顔画像認識部
２２，１２２：顔画像照合部
２４，１２４：付加画像生成部
２６，１２６：表示画像生成部
２８，１２８：表示部
３０，１３０：音声処理部
３２，１３２：人声認識部
３４，１３４：人声照合部
３６，１３６：音量検出部
３８，１３８：音声出力装置
４０，１４０：付加画像表示判定部
４２，１４２：外部出力部
４４，１４４：ＣＰＵ
４６，１４６：入力装置
５０，１５０：カメラ部
５２，１５２：マイクロホン
１６０：オーサリング処理部
１６２：記録媒体駆動装置
１６４：記録媒体 10, 10a, 110: Playback device 12, 112: Recording medium 14, 114: Recording medium driving device 16, 16a, 116a: Database 18, 118: Image processing unit 20, 120: Face image recognition unit 22, 122: Face image Verification unit 24, 124: Additional image generation unit 26, 126: Display image generation unit 28, 128: Display unit 30, 130: Voice processing unit 32, 132: Human voice recognition unit 34, 134: Human voice verification unit 36, 136 : Volume detection unit 38, 138: Audio output device 40, 140: Additional image display determination unit 42, 142: External output unit 44, 144: CPU
46, 146: input device 50, 150: camera unit 52, 152: microphone 160: authoring processing unit 162: recording medium driving device 164: recording medium

Claims

Image processing means for reproducing image data read from a recording medium and outputting a reproduced image signal;
Audio processing means for reproducing audio data read from the recording medium and outputting a reproduced audio signal;
A database that records voice data and face image data of multiple people,
Human voice recognition means for discriminating and recognizing human voice from the reproduced voice signal;
A voice collation means for collating and identifying a voice recognized by the voice recognition means and voice data registered in the database;
Face image recognition means for discriminating and recognizing a person's face from the reproduced image signal;
Face image collating means for collating and identifying the face of the person recognized by the face image recognizing means and the face image data registered in the database;
An additional image display determining means for determining a person who excludes the person identified by the face image matching means from the persons identified by the voice matching means in the same scene, as an additional image display target;
Additional image generation means for reading information indicating the target person determined by the additional image display determination means from the database and generating an additional image to be combined with the reproduced image signal;
And a display image generating means for synthesizing the additional image with the reproduced image signal.

The playback apparatus according to claim 1, wherein the database includes data supplied to the additional image generation unit separately from the face image data used by the face image matching unit.

The database has, for each person, information indicating whether or not the reproduction image signal and the additional image can be combined in the display image generation unit,
3. The playback apparatus according to claim 1, wherein the display image generation unit synthesizes the additional image with the playback image signal when information allowing composition is registered in the database. 4.

Furthermore,
A camera unit for taking images,
A microphone that picks up sound,
The playback apparatus according to any one of claims 1 to 3, further comprising means for registering a face image photographed by the camera unit and sound collected by the microphone in the database.

5. The reproducing apparatus according to claim 1, further comprising means for a user to set whether or not the additional image generation means needs to generate the additional image.

Means for setting which information the user uses to generate the additional image when there is a plurality of information that can be used to generate the additional image in the information of each person in the database; The reproducing apparatus according to claim 1, wherein the reproducing apparatus is characterized in that:

The priority order can be set for the plurality of pieces of information when there is a plurality of pieces of information that can be used to generate the additional image in the information of each person in the database. The reproducing apparatus as described.

When there is a plurality of pieces of information that can be used to generate the additional image in the information of each person in the database, the user has means for setting the priority order of all or a part of the plurality of pieces of information. The playback apparatus according to claim 7, characterized in that:

9. The reproduction apparatus according to claim 1, wherein the display image generation unit arranges the additional image around an image indicated by the reproduction image signal.

And authoring means for generating information including an image signal obtained by synthesizing the additional image as a sub-picture with the reproduced image signal and the reproduced audio signal and recording the information on the recording medium or a recording medium different from the recording medium. The playback apparatus according to claim 1, wherein the playback apparatus is provided.

The reproduction apparatus according to claim 10, wherein the number of the additional images generated by the additional image generation unit for the same scene is limited.

12. The playback apparatus according to claim 10, wherein when a plurality of the additional images exist for the same scene, the display priority order of the plurality of additional images is determined in the order of utterance.

12. The playback apparatus according to claim 10 or 11, wherein when a plurality of the additional images are present for the same scene, the display priority order of the plurality of additional images is determined in the order of the sound volume. .

11. The display priority order of the plurality of additional images is determined according to the display priority of each person registered in the database when there are a plurality of the additional images for the same scene. Or the reproducing apparatus of 11.

Furthermore, it has means for detecting the direction of the reproduced audio signal,
The playback apparatus according to claim 10, wherein the authoring means links the display position of the sub-picture in the direction.

11. The playback apparatus according to claim 10, wherein the authoring means performs an authoring process so as to start displaying a sub-picture temporally before utterance.

11. The playback apparatus according to claim 10, wherein the authoring means performs an authoring process so as to end the display of the sub-picture with a time delay from the end of the utterance.