JP2011014985A

JP2011014985A - Imaging apparatus, imaging method and program

Info

Publication number: JP2011014985A
Application number: JP2009154924A
Authority: JP
Inventors: Kazumi Aoyama; 一美青山; Kotaro Sabe; 浩太郎佐部; Masato Ito; 真人伊藤
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2009-06-30
Filing date: 2009-06-30
Publication date: 2011-01-20

Abstract

PROBLEM TO BE SOLVED: To pickup images according to a fact that an object utters a prescribed keyword.SOLUTION: A digital still camera 60 includes an imaging part 61, an image processing part 62, a recording part 63, an U/I part 64, an imaging control part 65, and an automatic shutter control part 66. The imaging part 61 obtains the optical image of an object, converts it to an electric signal and outputs an image signal obtained as a result to the next stage. The imaging control part 65 controls the imaging part 61 to perform imaging according to a shutter operation signal from the U/I part 64 or an automatic shutter signal from the automatic shutter control part 66. When detecting the utterance of a shutter keyword by a person to be the object on the basis of a finder moving image input from the imaging part 61, the automatic shutter control part 66 outputs the automatic shutter signal to the imaging control part 65 corresponding to that. The invention is applied to the digital camera and the digital video camera.

Description

本発明は、撮像装置、撮像方法、およびプログラムに関し、特に、被写体が所定のキーワードを発話したことに応じて撮像を行うようにした撮像装置、撮像方法、およびプログラムに関する。 The present invention relates to an imaging apparatus, an imaging method, and a program, and more particularly to an imaging apparatus, an imaging method, and a program that perform imaging in response to a subject speaking a predetermined keyword.

従来、オートシャッタ機能を備えるカメラが存在する。 Conventionally, there are cameras having an auto shutter function.

古くは、タイマー式のものやリモートコントローラを用いるものなどが存在する。昨今では、被写体の笑顔に応じて撮像を行う、いわゆるスマイルシャッタ機能（例えば、特許文献１参照）を備えるものや、被写体のウィンクに応じて撮像を行う、いわゆるウィンクシャッタ機能を備えるもの（例えば、特許文献２参照）などが存在する。 In the old days, there are a timer type and a remote controller. In recent years, what has a so-called smile shutter function (for example, refer to Patent Document 1) that captures an image in response to a smiling face of a subject, or has a so-called wink shutter function that performs image capturing in response to a wink of a subject (for example, (See Patent Document 2).

また、特許文献２には、被写体の口の動きに応じて撮像を行うことが記載されている。 Japanese Patent Application Laid-Open No. 2004-228561 describes that imaging is performed according to the movement of the subject's mouth.

特開２００５−５６３８７号公報JP 2005-56387 A 特開２００８−７２１８３号公報JP 2008-72183 A

上述したスマイルシャッタ機能の場合、被写体の笑顔の度合いがカメラによって検出され、その度合いが予め定められている閾値を越えたときに撮像が行われる。したがって、被写体が意識的に笑顔の表情を作ったとしても、その撮像タイミングを任意に調整することは困難であった。 In the case of the smile shutter function described above, the degree of smile of the subject is detected by the camera, and imaging is performed when the degree exceeds a predetermined threshold. Therefore, even when the subject consciously creates a smiling expression, it is difficult to arbitrarily adjust the imaging timing.

また、上述したウィンクシャッタ機能の場合、被写体のウィンク（片目の開閉）がカメラによって検出されたときに撮像が行われる。この場合、被写体が意図的に且つ確実にウィンクを実行できれば、被写体が意図するタイミングで撮像を行うことができる。しかしながら、被写体が瞬きをした場合にも、これをカメラがウィンクとして誤検出するが起こり得る。さらに、被写体にとってウィンク自体が普段行わない動きである場合、気恥ずかしさなどからウィンクを行えないこともある。 Further, in the case of the above-described wink shutter function, imaging is performed when a wink (opening / closing of one eye) of a subject is detected by the camera. In this case, if the subject can intentionally and reliably execute winking, it is possible to perform imaging at the timing when the subject intends. However, when the subject blinks, the camera may erroneously detect this as a wink. In addition, when the wink itself is a movement that is not normally performed for the subject, it may not be possible to wink due to embarrassment or the like.

またさらに、複数の人数で集合写真を撮る場合など、同時に写る他の人たちに対し、ウィンク以外にも発声によって例えば「ハイチーズ」などの掛け声により撮像タイミングを知らせる必要があった。 In addition, when taking a group photo with a plurality of people, it is necessary to inform other people who are photographed at the same time of the imaging timing by uttering, for example, “high cheese” in addition to winking.

なお、カメラに音声認識機能を備えることにより、例えば「ハイチーズ」などのキーワードに応じて撮像を行うようにすることが考えられるが、その場合、声が届かない距離にカメラを置いたり、音声以外の環境ノイズが多い状態下などでは使用できないので実用性が低い。 In addition, by providing a voice recognition function in the camera, for example, it may be possible to take an image in response to a keyword such as “Hi-Cheese”. Since it cannot be used under conditions where there is a lot of environmental noise, it is not practical.

またさらに、特許文献２などに記載されている、被写体の発話（口の動き）に応じて撮像を行う方法では、所定のキーワードの発話を検出するのではなく、単に口の動きを検出しているに過ぎないので、希望する撮像タイミングまで何も発言することができない。 Furthermore, in the method of performing imaging according to the utterance of the subject (mouth movement) described in Patent Document 2 or the like, instead of detecting the utterance of a predetermined keyword, the movement of the mouth is simply detected. Therefore, nothing can be said until the desired imaging timing.

本発明はこのような状況に鑑みてなされたものであり、ファインダ画像に基づいて被写体の発話内容を識別し、所定のキーワードが発話されたことに応じて撮像を行うようにするものである。 The present invention has been made in view of such a situation, and identifies the utterance content of a subject based on a finder image, and performs imaging in response to the utterance of a predetermined keyword.

本発明の一側面である撮像装置は、構図決定時にファインダ画像を出力し、撮像時に記録画像を出力する撮像手段と、唇画像の入力に対応し、前記唇画像が複数種類の各口形素にどの程度類似しているかを示す多次元スコアベクトルを出力する多クラス判別器と、キーワードに対応付けて、モデル化された登録時系列特徴得量が登録されている登録データベースと、前記ファインダ画像から被写体の唇領域を含む前記唇画像を生成して前記多クラス判別器に入力し、その結果得られた前記ファインダ画像に基づく前記唇画像に対応する前記多次元スコアベクトルを時系列に配置して認識用時系列特徴量を生成する生成手段と、生成された前記認識用時系列特徴量と、前記登録用データベースに登録されているモデルとの比較結果に基づき、前記撮像手段を制御して撮像処理を実行させるオートシャッタ制御手段とを含む。 An image pickup apparatus according to one aspect of the present invention includes an image pickup unit that outputs a finder image at the time of composition determination and outputs a recorded image at the time of image pickup, and the input of the lip image. From the finder image, a multi-class classifier that outputs a multi-dimensional score vector indicating how similar, a registration database in which modeled registration time series feature amounts are registered in association with keywords, and The lip image including the lip region of the subject is generated and input to the multi-class classifier, and the multi-dimensional score vectors corresponding to the lip image based on the finder image obtained as a result are arranged in time series. Based on the comparison result between the generating means for generating the recognition time-series feature quantity, the generated recognition time-series feature quantity, and the model registered in the registration database, And controls the image pickup means and a automatic shutter control means for executing an imaging process.

前記多クラス判別器は、口形素を示すクラスラベルが付加された唇画像の画像特徴量を用いたAdaBoostECOC学習により生成されているものとすることができる。 The multi-class classifier can be generated by AdaBoostECOC learning using an image feature amount of a lip image to which a class label indicating a viseme is added.

前記画像特徴量は、ピクセル差分特徴とすることができる。 The image feature amount may be a pixel difference feature.

本発明の一側面である撮像装置は、任意の前記キーワードを発話する被験者を被写体とする登録用ファインダ画像から登録用の唇画像を生成し、前記登録用の唇画像を前記多クラス判別器に入力し、その結果得られた前記登録用の唇画像に対応する前記多次元スコアベクトルを時系列に配置して前記登録用時系列特徴量を生成し、前記任意のキーワードに対応付けて前記登録用時系列特徴量をモデル化して前記登録データベースに登録する登録手段をさらに含むことができる。 An imaging apparatus according to one aspect of the present invention generates a lip image for registration from a finder image for registration with a subject who speaks any of the keywords as a subject, and uses the lip image for registration as the multi-class classifier. The multi-dimensional score vector corresponding to the registration lip image obtained as a result is arranged in time series to generate the registration time series feature quantity, and the registration is performed in association with the arbitrary keyword. It may further include a registering means for modeling the use time series feature quantity and registering it in the registration database.

前記登録手段は、前記登録用時系列特徴量を、HMMによりモデル化するようにすることができる。 The registration unit may model the registration time-series feature amount using an HMM.

本発明の一側面である撮像方法は、構図決定時にファインダ画像を出力し、撮像時に記録画像を出力する撮像手段と、唇画像の入力に対応し、前記唇画像が複数種類の各口形素にどの程度類似しているかを示す多次元スコアベクトルを出力する多クラス判別器と、キーワードに対応付けて、モデル化された登録時系列特徴得量が登録されている登録データベースとを備える撮像装置の撮像方法において、前記撮像装置による、前記ファインダ画像から被写体の唇領域を含む前記唇画像を生成して前記多クラス判別器に入力し、その結果得られた前記ファインダ画像に基づく前記唇画像に対応する前記多次元スコアベクトルを時系列に配置して認識用時系列特徴量を生成する生成ステップと、生成された前記認識用時系列特徴量と、前記登録用データベースに登録されているモデルとの比較結果に基づき、前記撮像手段を制御して撮像処理を実行させるオートシャッタ制御ステップとを含む。 An imaging method according to an aspect of the present invention includes an imaging unit that outputs a finder image at the time of composition determination and outputs a recorded image at the time of imaging, and the input of the lip image. An imaging apparatus comprising: a multi-class discriminator that outputs a multi-dimensional score vector indicating how much similarity is present; and a registration database in which modeled registration time-series feature amounts are registered in association with keywords In the imaging method, the imaging device generates the lip image including a lip region of a subject from the finder image, inputs the lip image to the multi-class classifier, and corresponds to the lip image based on the finder image obtained as a result Generating a time series feature quantity for recognition by arranging the multidimensional score vectors in time series, the generated time series feature quantity for recognition, and the registration Based on the result of comparison between the model registered in the database, and a automatic shutter control step of executing an imaging process by controlling the imaging means.

本発明の一側面であるプログラムは、構図決定時にファインダ画像を出力し、撮像時に記録画像を出力する撮像手段と、唇画像の入力に対応し、前記唇画像が複数種類の各口形素にどの程度類似しているかを示す多次元スコアベクトルを出力する多クラス判別器と、キーワードに対応付けて、モデル化された登録時系列特徴得量が登録されている登録データベースとを備える撮像装置のコンピュータに、前記ファインダ画像から被写体の唇領域を含む前記唇画像を生成して前記多クラス判別器に入力し、その結果得られた前記ファインダ画像に基づく前記唇画像に対応する前記多次元スコアベクトルを時系列に配置して認識用時系列特徴量を生成する生成手段と、生成された前記認識用時系列特徴量と、前記登録用データベースに登録されているモデルとの比較結果に基づき、前記撮像手段を制御して撮像処理を実行させるオートシャッタ制御手段として機能させる。 A program according to one aspect of the present invention includes an imaging unit that outputs a finder image at the time of composition determination and outputs a recorded image at the time of imaging, and a lip image input. A computer of an imaging apparatus, comprising: a multi-class discriminator that outputs a multi-dimensional score vector indicating a degree of similarity; and a registration database in which registration time-series feature quantities modeled in association with keywords are registered Then, the lip image including the lip region of the subject is generated from the finder image and input to the multi-class classifier, and the multidimensional score vector corresponding to the lip image based on the finder image obtained as a result is obtained. Generating means for generating a time series feature for recognition by arranging in time series, the generated time series feature for recognition, and registered in the registration database; That on the basis of a comparison result of a model, and controls the imaging unit to function as an auto shutter control means for executing an imaging process.

本発明の一側面においては、ファインダ画像から被写体の唇領域を含む唇画像が生成されて多クラス判別器に入力され、その結果得られたファインダ画像に基づく唇画像に対応する多次元スコアベクトルが時系列に配置されて認識用時系列特徴量が生成され、生成された認識用時系列特徴量と、登録用データベースに登録されているモデルとの比較結果に基づき、撮像手段が制御されて撮像処理が実行される。 In one aspect of the present invention, a lip image including a lip region of a subject is generated from a finder image and input to a multi-class classifier, and a multidimensional score vector corresponding to the lip image based on the obtained finder image is obtained. Time-series feature quantities for recognition are generated by being arranged in time series. Based on the comparison result between the generated time-series feature quantities for recognition and models registered in the registration database, the imaging means is controlled to capture images. Processing is executed.

本発明の一側面によれば、ファインダ画像に基づいて被写体の発話内容を識別し、所定のキーワードが発話されたことに応じて撮像を行うことができる。 According to an aspect of the present invention, it is possible to identify the utterance content of a subject based on a finder image and perform imaging in response to a predetermined keyword being uttered.

本発明を適用した発話認識装置の構成例を示すブロック図である。It is a block diagram which shows the structural example of the speech recognition apparatus to which this invention is applied. 顔画像、唇領域、および唇画像の例を示す図である。It is a figure which shows the example of a face image, a lip area | region, and a lip image. 音素ラベルを口形素ラベルに変換する変換テーブルの一例を示す図である。It is a figure which shows an example of the conversion table which converts a phoneme label into a viseme label. 学習サンプルの例を示す図である。It is a figure which shows the example of a learning sample. 時系列特徴量の一例を示す図である。It is a figure which shows an example of a time series feature-value. 発話認識処理を説明するフローチャートである。It is a flowchart explaining an utterance recognition process. 学習処理を説明するフローチャートである。It is a flowchart explaining a learning process. 学習用発話動画像の処理を説明するフローチャートである。It is a flowchart explaining the process of the learning speech moving image. 学習用発話音声の処理を説明するフローチャートである。It is a flowchart explaining the process of the speech voice for learning. AdaBoostECOC学習処理を説明するフローチャートである。It is a flowchart explaining an AdaBoostECOC learning process. ２値判別弱判別器の学習処理を説明するフローチャートである。It is a flowchart explaining the learning process of a binary discrimination weak discriminator. 登録処理を説明するフローチャートである。It is a flowchart explaining a registration process. Ｋ次元スコアベクトル演算処理を説明するフローチャートである。It is a flowchart explaining a K-dimensional score vector calculation process. 認識処理を説明するフローチャートである。It is a flowchart explaining a recognition process. 登録用発話単語の例を示す図である。It is a figure which shows the example of the utterance word for registration. 認識性能を示す図である。It is a figure which shows recognition performance. 本発明を適用したデジタルスチルカメラの構成例を示すブロック図である。It is a block diagram which shows the structural example of the digital still camera to which this invention is applied. オートシャッタ制御部の構成例を示すブロック図である。It is a block diagram which shows the structural example of an auto shutter control part. オートシャッタ登録処理を説明するフローチャートである。It is a flowchart explaining an auto shutter registration process. オートシャッタ実行処理を説明するフローチャートである。It is a flowchart explaining an auto shutter execution process. コンピュータの構成例Computer configuration example

以下、発明を実施するための最良の形態（以下、実施の形態と称する）について、図面を参照しながら詳細に説明する。なお、説明は、以下の順序で行なう。
１．第１の実施の形態
２．第２の実施の形態 Hereinafter, the best mode for carrying out the invention (hereinafter referred to as an embodiment) will be described in detail with reference to the drawings. The description will be given in the following order.
1. 1. First embodiment Second embodiment

＜１．第１の実施の形態＞
［発話認識装置の構成例］
図１は、第１の実施の形態である発話認識装置１０の構成例を示している。この発話認識装置１０は、話者を被写体としてビデオ撮像した動画像に基づいて、被写体の発話内容を識別するものである。 <1. First Embodiment>
[Configuration example of speech recognition device]
FIG. 1 shows a configuration example of an utterance recognition device 10 according to the first embodiment. This utterance recognition device 10 identifies the utterance content of a subject based on a moving image obtained by video-taking a speaker as a subject.

発話認識装置１０は、学習処理を実行する学習系１１、登録処理を行う登録系１２、および認識処理を行う認識系１３から構成される。 The utterance recognition apparatus 10 includes a learning system 11 that performs learning processing, a registration system 12 that performs registration processing, and a recognition system 13 that performs recognition processing.

学習系１１には、画音分離部２１、顔領域検出部２２、唇領域検出部２３、唇画像生成部２４、音素ラベル付与部２５、音素辞書２６、口形素ラベル変換部２７、口形素ラベル付加部２８、学習サンプル保持部２９、口形素判別器学習部３０、および口形素判別器３１が属する。 The learning system 11 includes an image sound separation unit 21, a face region detection unit 22, a lip region detection unit 23, a lip image generation unit 24, a phoneme label assignment unit 25, a phoneme dictionary 26, a viseme label conversion unit 27, and a viseme label. The addition unit 28, the learning sample holding unit 29, the viseme classifier learning unit 30, and the viseme classifier 31 belong.

登録系１２には、口形素判別器３１、顔領域検出部４１、唇領域検出部４２、唇画像生成部４３、発話期間検出部４４、時系列特徴量生成部４５、時系列特徴量学習部４６、および発話認識器４７が属する。 The registration system 12 includes a viseme classifier 31, a face area detection unit 41, a lip area detection unit 42, a lip image generation unit 43, an utterance period detection unit 44, a time series feature quantity generation unit 45, and a time series feature quantity learning unit. 46 and an utterance recognizer 47 belong.

認識形１３は、口形素判別器３１、顔領域検出部４１、唇領域検出部４２、唇画像生成部４３、発話期間検出部４４、時系列特徴量生成部４５、および発話認識器４７が属する。 The recognition form 13 includes a viseme discriminator 31, a face region detection unit 41, a lip region detection unit 42, a lip image generation unit 43, an utterance period detection unit 44, a time series feature quantity generation unit 45, and an utterance recognition unit 47. .

すなわち、口形素判別器３１は、学習系１１、登録系１２、および認識形１３に重複して属し、登録系１２から時系列特徴量学習部４６を削除したものが認識系１３となる。 That is, the viseme classifier 31 overlaps with the learning system 11, the registration system 12, and the recognition form 13, and the recognition system 13 is obtained by deleting the time-series feature amount learning unit 46 from the registration system 12.

画音分離部２１は、任意の言葉を話している話者をビデオ撮影して得られる音声付動画像（以下、学習用音声付発話動画像と称する）を入力とし、これを学習用発話動画像と学習用発話音声とに分離する。分離された学習用発話動画像は顔領域検出部２２に入力され、分離された学習用発話音声は音素ラベル付与部２５に入力される。 The image / sound separation unit 21 receives a moving image with sound (hereinafter referred to as a learning moving image with sound) obtained by video-taking a speaker who speaks an arbitrary word, and uses this as a learning speech video for learning. Separation into images and speech for learning. The separated learning utterance moving image is input to the face area detection unit 22, and the separated learning utterance voice is input to the phoneme label adding unit 25.

なお、学習用音声付発話動画像は、この学習のためにビデオ撮影を行って用意してもよいし、例えばテレビジョン番組などのコンテンツを利用してもよい。 Note that the learning speech-added speech moving image may be prepared by taking a video for this learning, or may use content such as a television program.

顔領域検出部２２は、学習用発話動画像を各フレームに分割し、各フレームについて、図２Ａに示すように、人の顔を含む顔領域を検出し、学習用発話動画像とともに各フレームの顔領域の位置情報を唇領域検出部２３に出力する。 The face area detection unit 22 divides the learning utterance moving image into each frame, detects a face area including a human face for each frame, as shown in FIG. 2A, and together with the learning utterance moving image, The position information of the face area is output to the lip area detection unit 23.

唇領域検出部２３は、学習用発話動画像の各フレームの顔領域から、図２Ｂに示すように、唇の口角の端点を含む唇領域を検出し、学習用発話動画像とともに各フレームの唇領域の位置情報を唇画像生成部２４に出力する。 As shown in FIG. 2B, the lip area detection unit 23 detects a lip area including the end of the mouth corner of the lip from the face area of each frame of the learning utterance moving image, and the lip of each frame together with the learning utterance moving image. The position information of the region is output to the lip image generation unit 24.

なお、顔領域および唇領域の検出方法については、例えば特開２００５−２８４３４８号公報、特開２００９−４９４８９号公報などに開示されている手法など、既存の任意の手法を適用できる。 As a method for detecting the face region and the lip region, any existing method such as the method disclosed in Japanese Patent Application Laid-Open Nos. 2005-284348 and 2009-49489 can be applied.

唇画像生成部２４は、学習用発話動画像の各フレームを、唇の口角の端点を結ぶ線が水平になるように、適宜、回転補正を行う。さらに、唇画像生成部２４は、回転補正後の各フレームから唇領域を抽出し、図２Ｃに示すように、抽出した唇領域を予め定められた画像サイズ（例えば、３２×３２画素）にリサイズすることにより唇画像を生成する。このようにして生成された各フレームに対応する唇画像は口形素ラベル付加部２８に供給される。 The lip image generation unit 24 appropriately performs rotation correction on each frame of the learning speech moving image so that the line connecting the end points of the mouth corners of the lips becomes horizontal. Further, the lip image generation unit 24 extracts the lip region from each frame after the rotation correction, and resizes the extracted lip region to a predetermined image size (for example, 32 × 32 pixels) as shown in FIG. 2C. By doing so, a lip image is generated. The lip image corresponding to each frame generated in this way is supplied to the viseme label adding unit 28.

音素ラベル付与部２５は、音素辞書２６を参照することにより、学習用発話音声に対してその音素を示す音素ラベルを付与して口形素ラベル変換部２７に出力する。音素ラベルを付与する方法には、例えば、音声認識の研究分野において自動音素ラベリングと称されている方法を適用できる。 The phoneme label assigning unit 25 refers to the phoneme dictionary 26, assigns a phoneme label indicating the phoneme to the utterance speech for learning, and outputs it to the viseme label conversion unit 27. As a method for giving a phoneme label, for example, a method called automatic phoneme labeling in the research field of speech recognition can be applied.

口形素ラベル変換部２７は、学習用発話音声に付与されている音素ラベルを、発話時の唇の形を示す口形素ラベルに変換して口形素ラベル付加部２８に出力する。なお、この変換には、予め用意されている変換テーブルを用いる。 The viseme label conversion unit 27 converts the phoneme label given to the learning speech voice into a viseme label indicating the shape of the lips at the time of utterance, and outputs the viseme label addition unit 28. For this conversion, a conversion table prepared in advance is used.

図３は、音素ラベルを口形素ラベルに変換する変換テーブルの一例を示している。同図の変換テーブルを用いた場合、４０種類に分類されている音素ラベルが、１９種類に分類されている口形素ラベルに変換される。例えば、音素ラベル[ａ]および[ａ：]が口形素ラベル[ａ]に変換される。また例えば、音素ラベル[ｂｙ]，[ｍｙ]および[ｐｙ]が口形素ラベル[ｐｙ]に変換される。なお、変換テーブルは、図３に示されたものに限らず、他の変換テーブルを用いてもよい。 FIG. 3 shows an example of a conversion table for converting phoneme labels into viseme labels. When the conversion table of FIG. 5 is used, the phoneme labels classified into 40 types are converted into viseme labels classified into 19 types. For example, phoneme labels [a] and [a:] are converted into viseme labels [a]. Also, for example, phoneme labels [by], [my], and [py] are converted into viseme labels [py]. Note that the conversion table is not limited to that shown in FIG. 3, and other conversion tables may be used.

口形素ラベル付加部２８は、唇画像生成部２４から入力される学習用発話動画像の各フレームに対応する唇画像に対し、口形素ラベル変換部２７から入力される学習用発話音声に付与された口形素ラベルを流用して付加し、口形素ラベルが付加された唇画像を学習サンプル保持部２９に出力する。 The viseme label adding unit 28 gives the learning speech input from the viseme label conversion unit 27 to the lip image corresponding to each frame of the learning speech moving image input from the lip image generating unit 24. The viseme label is diverted and added, and the lip image to which the viseme label is added is output to the learning sample holding unit 29.

学習サンプル保持部２９は、口形素ラベルが付加された複数の唇画像（以下、口形素ラベル付唇画像と称する）を学習サンプルとして保持する。 The learning sample holding unit 29 holds a plurality of lip images to which viseme labels are added (hereinafter referred to as lip images with viseme labels) as learning samples.

より具体的には、図４に示すように、Ｍ枚の唇画像ｘ_i（ｉ＝１，２，・・・，Ｍ）に、口形素ラベルに相当するクラスラベルｙ_k（ｋ＝１，２，・・・，Ｋ）が付与されて状態で、Ｍ個の学習サンプル（ｘ_i，ｙ_k）を保持する。なお、いまの場合、クラスラベルの種類の数Ｋは１９となる。 More specifically, as shown in FIG. 4, class labels y _k (k = 1, 1) corresponding to viseme labels are added to M lip images x _i (i = 1, 2,..., M). 2,..., K) and M learning samples (x _i , y _k ) are held. In this case, the number K of class label types is 19.

口形素判別器学習部３０は、学習サンプル保持部２９に保持されている複数の学習サンプルとしての口形素ラベル付唇画像からその画像特徴量を求め、AdaBoostECOCにより複数の弱判別器を学習し、これら複数の弱判別器からなる口形素判別器３１を生成する。 The viseme classifier learning unit 30 obtains image feature amounts from viseme-labeled lip images as a plurality of learning samples held in the learning sample holding unit 29, learns a plurality of weak classifiers using AdaBoostECOC, A viseme classifier 31 composed of a plurality of weak classifiers is generated.

唇画像の画像特徴量としては、例えば、本発明者等が提案するPixDif Feature（ピクセル差分特徴）を用いることができる。 As the image feature amount of the lip image, for example, PixDif Feature (pixel difference feature) proposed by the present inventors can be used.

なお、PixDif Feature（ピクセル差分特徴）については、”佐部、日台、「ピクセル差分特徴を用いた実時間任意姿勢顔検出器の学習」、第１０回画像センシングシンポジウム予稿集、pp.547-552, 2004.”、特開２００５−１５７６７９号公報などに開示されている。 As for PixDif Feature (pixel difference feature), “Sabe, Nitadai,“ Learning real-time arbitrary posture face detector using pixel difference feature ”, Proceedings of 10th Image Sensing Symposium, pp.547- 552, 2004. ”, JP-A-2005-157679, and the like.

ピクセル差分特徴は、画像（いまの場合、唇画像）上の２画素の画素値（輝度値）Ｉ₁，Ｉ₂の差分（Ｉ₁−Ｉ₂）を算出することによって得られる。２画素の各組み合わせにそれぞれ対応する２値判別弱判別器ｈ（ｘ）では、次式（１）に示すように、このピクセル差分特徴Ｉ₁−Ｉ₂と閾値Ｔｈにより真（＋１）、または偽（−１）が判別される。
ｈ（ｘ）＝−１ｉｆＩ₁−Ｉ₂≦Ｔｈ
ｈ（ｘ）＝＋１ｉｆＩ₁−Ｉ₂＞Ｔｈ
・・・（１） The pixel difference feature is obtained by calculating a difference (I ₁ −I ₂ ) between _two pixel values (luminance values) I ₁ and I ₂ on an image (in this case, a lip image). In the binary discriminant weak discriminator h (x) corresponding to each combination of two pixels, as shown in the following equation (1), true (+1) by the pixel difference feature I ₁ -I ₂ and the threshold Th, or False (-1) is determined.
h (x) = − 1 if I ₁ −I ₂ ≦ Th
h (x) = + 1 if I ₁ −I ₂ > Th
... (1)

例えば、唇画像のサイズを３２×３２画素として場合、１０２４×１０２３通りのピクセル差分特徴が得られることになる。これら複数通りの２画素の組み合わせとその閾値Ｔｈが各２値判別弱判別器のパラメータとなり、これらのうちの最適なものがブースティング学習により選択される。 For example, when the size of the lip image is 32 × 32 pixels, 1024 × 1023 pixel difference features are obtained. These two combinations of two pixels and the threshold value Th thereof become the parameters of each binary discrimination weak discriminator, and the optimum one of these is selected by boosting learning.

口形素判別器３１は、発話期間検出部４４から通知される発話期間において、唇画像生成部４３から入力される唇画像に対応するＫ次元スコアベクトルを演算して時系列特徴量生成部４５に出力する。 The viseme classifier 31 calculates a K-dimensional score vector corresponding to the lip image input from the lip image generation unit 43 in the utterance period notified from the utterance period detection unit 44, and outputs it to the time-series feature amount generation unit 45. Output.

ここで、Ｋ次元スコアベクトルは、入力された唇画像が、Ｋ（いまの場合、Ｋ＝１９）種類の口形素のうちのどれに対応するものであるかを示す指標であって、Ｋ種類の各口形素に対応するものである可能性を示すＫ次元のスコアからなる。 Here, the K-dimensional score vector is an index indicating which of the K (in this case, K = 19) types of visemes the input lip image corresponds to. It consists of a K-dimensional score indicating the possibility of corresponding to each viseme.

登録系１２および認識系１３の顔領域検出部４１、唇領域検出部４２、および唇画像生成部４３は、上述した学習系１１に属する顔領域検出部２２、唇領域検出部２３、および唇画像生成部２４と同様のものである。 The face area detection unit 41, the lip area detection unit 42, and the lip image generation unit 43 of the registration system 12 and the recognition system 13 are the face area detection unit 22, the lip area detection unit 23, and the lip image that belong to the learning system 11 described above. This is the same as the generation unit 24.

なお、登録系１２には、予め決められている発話内容（登録用発話単語）とそれを発話している話者をビデオ撮影した動画像（以下、登録用発話動画像と称する）とを組み合わせた複数の登録用データが入力される。 The registration system 12 combines a predetermined utterance content (registration utterance word) and a moving image (hereinafter referred to as a registration utterance moving image) obtained by taking a video of a speaker speaking the utterance. A plurality of registration data is input.

また、認識系１３には、認識対象となる、発話内容を話す話者をビデオ撮影した動画像（以下、認識用発話動画像と称する）が入力される。 The recognition system 13 is input with a moving image (hereinafter, referred to as a recognition utterance moving image) obtained by taking a video of a speaker who speaks the utterance content to be recognized.

すなわち、登録処理時において、顔領域検出部４１は、登録用発話動画像を各フレームに分割し、各フレームについて、顔領域を検出し、登録用発話動画像とともに各フレームの顔領域の位置情報を唇領域検出部４２に出力する。 That is, during the registration process, the face area detection unit 41 divides the registration utterance moving image into each frame, detects the face area for each frame, and the position information of the face area of each frame together with the registration utterance moving image. Is output to the lip region detection unit 42.

唇領域検出部４２は、登録用動画像の各フレームの顔領域から唇領域を検出し、登録用発話動画像とともに各フレームの唇領域の位置情報を唇画像生成部４３に出力する。 The lip region detection unit 42 detects the lip region from the face region of each frame of the registration moving image, and outputs the position information of the lip region of each frame to the lip image generation unit 43 together with the registration utterance moving image.

唇画像生成部４３は、登録用発話動画像の各フレームを適宜、回転補正した後、各フレームから唇領域を抽出し、リサイズして唇画像を生成し、口形素判別器３１および発話期間検出部４４に出力する。 The lip image generation unit 43 appropriately rotates and corrects each frame of the utterance moving image for registration, extracts a lip region from each frame, generates a lip image by resizing, and detects the viseme discriminator 31 and the utterance period detection. To the unit 44.

また、認識処理時において、顔領域検出部４１は、認識用発話動画像（話者の発話内容が不明である動画像）を各フレームに分割し、各フレームについて、顔領域を検出し、認識用発話動画像とともに各フレームの顔領域の位置情報を唇領域検出部４２に出力する。 In the recognition process, the face area detecting unit 41 divides the recognition utterance moving image (moving image in which the speaker's utterance content is unknown) into each frame, detects the face area for each frame, and recognizes it. The position information of the face area of each frame is output to the lip area detection unit 42 together with the utterance moving image.

唇領域検出部４２は、認識用発話動画像の各フレームの顔領域から唇領域を検出し、認識用発話動画像とともに各フレームの唇領域の位置情報を唇画像生成部４３に出力する。 The lip area detection unit 42 detects the lip area from the face area of each frame of the recognition utterance moving image, and outputs the position information of the lip area of each frame to the lip image generation unit 43 together with the recognition utterance moving image.

唇画像生成部４３は、認識用発話動画像の各フレームを適宜、回転補正した後、各フレームから唇領域を抽出し、リサイズして唇画像を生成し、口形素判別器３１および発話期間検出部４４に出力する。 The lip image generation unit 43 appropriately rotates and corrects each frame of the recognition utterance moving image, extracts a lip region from each frame, resizes and generates a lip image, and detects the viseme discriminator 31 and the utterance period detection. To the unit 44.

発話期間検出部４４は、唇画像生成部４３から入力される、登録用発話動画像または認識用発話動画像の各フレームの唇画像に基づき、話者が発話している期間（以下、発話期間と称する）を特定し、各フレームの唇画像が発話期間に対応するものであるか否かを口形素判別器３１および時系列特徴量生成部４５に通知する。 The utterance period detection unit 44 is based on the lip image of each frame of the registration utterance moving image or the recognition utterance moving image input from the lip image generation unit 43 (hereinafter, the utterance period). And the viseme discriminator 31 and the time-series feature value generation unit 45 are notified whether or not the lip image of each frame corresponds to the utterance period.

時系列特徴量生成部４５は、発話期間検出部４４から通知される発話期間において、口形素判別器３１から入力されるＫ次元スコアベクトルを時系列に配置することにより、時系列特徴量を生成する。 The time series feature quantity generation unit 45 generates a time series feature quantity by arranging the K-dimensional score vector input from the viseme discriminator 31 in the time series in the utterance period notified from the utterance period detection unit 44. To do.

図５は、話者が「おもしろい」と話したときの発話期間に対応する時系列特徴量を示している。すなわち、この発話期間を１秒間と仮定し、フレームレートを６０フレーム／秒とすれば、６０Ｋのスコアからなる時系列特徴量が生成される。生成された時系列特徴量は、登録処理時には時系列特徴量学習部４６に出力され、認識処理時には発話認識部４７に出力される。 FIG. 5 shows time-series feature amounts corresponding to the utterance period when the speaker speaks “interesting”. That is, assuming that this utterance period is 1 second and the frame rate is 60 frames / second, a time-series feature amount having a score of 60K is generated. The generated time-series feature amount is output to the time-series feature amount learning unit 46 during the registration process, and is output to the utterance recognition unit 47 during the recognition process.

時系列特徴量学習部４６は、登録処理時において、入力される登録用発話単語（登録用発話動画像における話者の発話内容）に対応付けて、時系列特徴量生成部４５から入力される時系列特徴量をHMM(Hidden Markov Model)を用いてモデル化する。なお、モデル化の手法はHMMに限られず、時系列特徴量をモデル化できるものであればよい。モデル化された時系列特徴量は発話認識器４７に内蔵された学習データベース４８に保持される。 The time-series feature quantity learning unit 46 is input from the time-series feature quantity generation unit 45 in association with the registration utterance word (the utterance content of the speaker in the registration utterance moving image) input during the registration process. Time series features are modeled using HMM (Hidden Markov Model). Note that the modeling method is not limited to the HMM, and any modeling method can be used as long as it can model time-series feature amounts. The modeled time-series feature amount is held in a learning database 48 built in the utterance recognizer 47.

発話認識器４７は、認識処理時において、時系列特徴量生成部４５から入力される時系列特徴量に対して、学習データベース４８に保持されているモデルのうちで最も類似しているものを特定する。さらに、発話認識器４７は、特定したモデルに対応付けられている登録用発話単語を、認識用発話動画像に対応する発話認識結果として出力する。 The utterance recognizer 47 identifies the model most similar among the models held in the learning database 48 with respect to the time series feature quantity input from the time series feature quantity generation unit 45 during the recognition process. To do. Further, the utterance recognizer 47 outputs the registered utterance word associated with the identified model as an utterance recognition result corresponding to the recognition utterance moving image.

［動作説明］
図６は、発話認識装置１０の動作を説明するフローチャートである。 [Description of operation]
FIG. 6 is a flowchart for explaining the operation of the speech recognition apparatus 10.

ステップＳ１において、発話認識装置１０の学習系１１は、学習処理を実行することによって口形素判別器３１を生成する。 In step S1, the learning system 11 of the utterance recognition device 10 generates a viseme discriminator 31 by executing a learning process.

ステップＳ２において、発話認識装置１０の登録系１２は、登録処理を実行することによって、登録用発話動画像に対応する時系列特徴量を生成し、HMMを用いてモデル化し、これに登録用発話単語を対応付けて学習データベース４８に登録する。 In step S2, the registration system 12 of the utterance recognition device 10 generates a time-series feature amount corresponding to the utterance moving image for registration by executing a registration process, models it using the HMM, and registers the utterance for registration. The words are associated and registered in the learning database 48.

ステップＳ３において、発話認識装置１０の認識系１３は、認識処理を実行することによって、認識用発話動画像における話者の発話内容を認識する。 In step S 3, the recognition system 13 of the utterance recognition device 10 recognizes the utterance content of the speaker in the recognition utterance moving image by executing recognition processing.

以下、上述したステップＳ１乃至Ｓ３の処理の詳細について説明する。 Hereinafter, the details of the processing in steps S1 to S3 described above will be described.

［学習処理の詳細］
図７は、ステップＳ１の学習処理を詳細に説明するフローチャートである。 [Details of learning process]
FIG. 7 is a flowchart for explaining the learning process in step S1 in detail.

ステップＳ１１において、学習用音声付発話動画像が画音分離部２１に入力される。画音分離部２１は、学習用音声付発話動画像を学習用発話動画像と学習用発話音声とに分離し、学習用発話動画像を顔領域検出部２２に、学習用発話音声を音素ラベル付与部２５に出力する。 In step S 11, the learning speech-added speech moving image is input to the image sound separation unit 21. The image sound separation unit 21 separates the learning speech-added speech moving image into the learning speech moving image and the learning speech speech, the learning speech moving image to the face area detection unit 22, and the learning speech speech to the phoneme label. It outputs to the grant part 25.

ステップＳ１２において、学習用発話動画像の処理が行われる。また、ステップＳ１３において、学習用発話音声の処理が行われる。なお、ステップＳ１２とステップＳ１３とは、実際には並行して同時に実行される。そして、学習用発話動画像の処理の出力（唇画像）と、それに対応する学習用発話音声の処理の出力（口形素ラベル付き学習用発話音声）が口形素ラベル付加部２８に同時に供給されることになる。 In step S12, the learning utterance moving image is processed. In step S13, the learning speech is processed. Note that step S12 and step S13 are actually executed simultaneously in parallel. Then, the learning utterance moving image processing output (lip image) and the corresponding learning utterance speech processing output (learning speech with viseme label) are simultaneously supplied to the viseme label adding unit 28. It will be.

図７は、ステップＳ１２における学習用発話動画像の処理を詳細に説明するフローチャートである。 FIG. 7 is a flowchart for explaining in detail the processing of the learning speech moving image in step S12.

ステップＳ２１において、顔領域検出部２２は、学習用発話動画像を各フレームに分割し、１フレームずつ処理対象とする。ステップＳ２２において、顔領域検出部２２は、処理対象のフレームから顔領域を検出し、ステップＳ２３において、顔領域を検出できたか否か判定する。顔領域を検出できたと判定された場合、処理はステップＳ２４に進められる。反対に、顔領域を検出できなかったと判定された場合、処理はステップＳ２６に進められる。 In step S 21, the face area detection unit 22 divides the learning utterance moving image into frames, and sets each frame as a processing target. In step S22, the face area detection unit 22 detects a face area from the frame to be processed, and determines in step S23 whether the face area has been detected. If it is determined that the face area has been detected, the process proceeds to step S24. On the other hand, if it is determined that the face area cannot be detected, the process proceeds to step S26.

ステップＳ２４において、顔領域検出部２２は、処理対象としている１フレーム分の学習用発話動画像とともに顔領域の位置情報を唇領域検出部２３に出力する。唇領域検出部２３は、処理対象のフレームの顔領域から唇領域を検出し、ステップＳ２５において、唇領域を検出できたか否か判定する。唇領域を検出できたと判定された場合、処理はステップＳ２７に進められる。反対に、唇領域を検出できなかったと判定された場合、処理はステップＳ２６に進められる。 In step S 24, the face area detection unit 22 outputs position information of the face area to the lip area detection unit 23 together with the learning utterance moving image for one frame to be processed. The lip area detection unit 23 detects a lip area from the face area of the processing target frame, and determines whether the lip area has been detected in step S25. If it is determined that the lip region has been detected, the process proceeds to step S27. On the other hand, if it is determined that the lip region has not been detected, the process proceeds to step S26.

なお、ステップＳ２３またはステップＳ２５から、処理がステップＳ２６に進められた場合、処理対象としているフレームの１フレーム前の顔領域または唇領域の少なくとも一方の位置情報が流用される。 When the process proceeds from step S23 or step S25 to step S26, the position information of at least one of the face area or the lip area one frame before the frame to be processed is used.

ステップＳ２７において、唇領域検出部２３は、処理対象としている１フレーム分の学習用発話動画像とともに唇領域の位置情報を唇画像生成部２４に出力する。唇画像生成部２４は、処理対象としている学習用発話動画像の１フレームを、唇の口角の端点を結ぶ線が水平になるように、適宜、回転補正を行う。さらに、唇画像生成部２４は、回転補正後の各フレームから唇領域を抽出し、抽出した唇領域を予め定められた画像サイズにリサイズすることにより唇画像を生成して口形素ラベル付加部２８に出力する。 In step S 27, the lip region detection unit 23 outputs the position information of the lip region to the lip image generation unit 24 together with the learning utterance moving image for one frame to be processed. The lip image generation unit 24 appropriately performs rotation correction on one frame of the learning speech moving image to be processed so that the line connecting the end points of the mouth corners of the lips becomes horizontal. Further, the lip image generation unit 24 extracts a lip region from each frame after rotation correction, generates a lip image by resizing the extracted lip region to a predetermined image size, and a viseme label addition unit 28. Output to.

この後、ステップＳ２１に戻り、学習用発話動画像信号の入力が終わるまで、ステップＳ２１乃至Ｓ２７の処理が繰り返される。 Thereafter, the process returns to step S21, and the processes of steps S21 to S27 are repeated until the input of the learning utterance moving image signal is completed.

次に、図９は、ステップＳ１３における学習用発話音声の処理を詳細に説明するフローチャートである。 Next, FIG. 9 is a flowchart for explaining in detail the processing of the learning speech voice in step S13.

ステップＳ３１において、音素ラベル付与部２５は、音素辞書２６を参照することにより、学習用発話音声に対してその音素を示す音素ラベルを付与して口形素ラベル変換部２７に出力する。 In step S 31, the phoneme label assigning unit 25 refers to the phoneme dictionary 26, assigns a phoneme label indicating the phoneme to the learning speech, and outputs the phoneme label conversion unit 27.

ステップＳ３２において、口形素ラベル変換部２７は、予め保持する変換テーブルを用い、学習用発話音声に付与されている音素ラベルを、発話時の唇の形を示す口形素ラベルに変換して口形素ラベル付加部２８に出力する。 In step S32, the viseme label conversion unit 27 converts the phoneme label given to the learning speech to a viseme label indicating the shape of the lips at the time of utterance by using a conversion table held in advance. The data is output to the label adding unit 28.

この後、ステップＳ３１に戻り、学習用発話音声の入力が終わるまで、ステップＳ３１およびＳ３２の処理が繰り返される。 Thereafter, the process returns to step S31, and the processes of steps S31 and S32 are repeated until the input of the learning speech voice is completed.

図７に戻る。ステップＳ１４において、口形素ラベル付加部２８は、唇画像生成部２４から入力された学習用発話動画像の各フレームに対応する唇画像に対し、口形素ラベル変換部２７から入力された学習用発話音声に付与された口形素ラベルを流用して付加し、口形素ラベルが付加された唇画像を学習サンプル保持部２９に出力する。学習サンプル保持部２９は、口形素ラベル付唇画像を学習サンプルとして保持する。学習サンプル保持部２９に所定の数Ｍの学習サンプルが保持された後、ステップＳ１５以降の処理が行われる。 Returning to FIG. In step S 14, the viseme label adding unit 28 applies the learning utterance input from the viseme label conversion unit 27 to the lip image corresponding to each frame of the learning utterance moving image input from the lip image generation unit 24. The viseme label attached to the voice is diverted and added, and the lip image to which the viseme label is added is output to the learning sample holding unit 29. The learning sample holding unit 29 holds a viseme-labeled lip image as a learning sample. After a predetermined number M of learning samples are held in the learning sample holding unit 29, the processes after step S15 are performed.

ステップＳ１５において、口形素判別器学習部３０は、学習サンプル保持部２９に保持されている複数の学習サンプルとしての唇画像の画像特徴量を求め、AdaBoostECOCにより複数の弱判別器を学習し、これら複数の弱判別器からなる口形素判別器３１を生成する。 In step S15, the viseme discriminator learning unit 30 obtains image feature amounts of lip images as a plurality of learning samples held in the learning sample holding unit 29, learns a plurality of weak discriminators using AdaBoostECOC, and A viseme classifier 31 including a plurality of weak classifiers is generated.

図１０は、ステップＳ１５の処理(AdaBoostECOC学習処理)を詳細に説明するフローチャートである。 FIG. 10 is a flowchart for explaining in detail the process of step S15 (AdaBoostECOC learning process).

ステップＳ４１において、口形素判別器学習部３０は、図４に示されたように、Ｍ個の学習サンプル（ｘ_i，ｙ_k）を学習サンプル保持部２９から取得する。 In step S41, the viseme classifier learning unit 30 acquires M learning samples (x _i , y _k ) from the learning sample holding unit 29 as illustrated in FIG.

ステップＳ４２において、口形素判別器学習部３０は、次式（２）に従い、Ｍ行Ｋ列で表されるサンプル重みＰ_t（ｉ，ｋ）を初期化する。具体的には、サンプル重みＰ_t（ｉ，ｋ）の初期値Ｐ₁（ｉ，ｋ）を、実在する学習サンプル（ｘ_i，ｙ_k）に対応するものは０に、それ以外はそれらの総和が１となるような一様な値に設定する。
Ｐ₁（ｉ，ｋ）＝１／Ｍ（Ｋ−１） for ｙ_k≠ｋ
・・・（２） In step S42, the viseme discriminator learning unit 30 initializes the sample weight P _t (i, k) represented by M rows and K columns according to the following equation (2). Specifically, the initial values P ₁ (i, k) of the sample weights P _t (i, k) are set to 0 for the actual learning samples (x _i , y _k ), otherwise A uniform value is set so that the sum is 1.
P ₁ (i, k) = 1 / M (K−1) for y _k ≠ k
... (2)

以下に説明するステップＳ４３乃至ステップＳ４８の処理は任意の数Ｔだけ繰り返される。なお、任意の繰り返し回数Ｔは、最大で唇画像上で得られるピクセル差分特徴の数とすることができ、この繰り返し回数Ｔと同じ数だけ弱判別器が生成される。 The processes in steps S43 to S48 described below are repeated an arbitrary number T. The arbitrary number of repetitions T can be the maximum number of pixel difference features obtained on the lip image, and as many weak discriminators as the number of repetitions T are generated.

ステップＳ４３において、口形素判別器学習部３０は、１行Ｋ列のECOCテーブルを生成する。なお、ECOCテーブルのｋ列の値μ_t（ｋ）は−１または＋１であり、−１と＋１の数が同数となるようにランダムに割り振られる。
μ_t（ｋ）＝｛−１，＋１｝
・・・（３） In step S43, the viseme discriminator learning unit 30 generates an ECOC table with 1 row and K columns. The value μ _t (k) in the k column of the ECOC table is −1 or +1, and is randomly allocated so that the number of −1 and +1 is the same.
μ _t (k) = {− 1, + 1}
... (3)

ステップＳ４４において、口形素判別器学習部３０は、次式（４）に従い、Ｍ行１列で表される２値判別用重みＤ_t（ｉ）を計算する。なお、式（４）において、[]内は論理式であり、真であれば１、偽であれば０とする。

・・・（４） In step S44, the viseme discriminator learning unit 30 calculates a binary discriminating weight D _t (i) represented by M rows and 1 column according to the following equation (4). In Expression (4), the values in [] are logical expressions, and are 1 if true and 0 if false.

... (4)

ステップＳ４５において、口形素判別器学習部３０は、ステップＳ４４で得られた２値判別用重みＤ_t（ｉ）の下、次式（５）に示す重み付き誤り率ε_tを最小とする２値判別弱判別器ｈ_tを学習する。

・・・（５） In step S45, the viseme discriminator learning unit 30 minimizes the weighted error rate ε _t shown in the following equation (5) under the binary discrimination weight D _t (i) obtained in step S44. The value discriminating weak discriminator _ht is learned.

... (5)

図１１は、ステップＳ４５の処理を詳細に説明するフローチャートである。 FIG. 11 is a flowchart for explaining the process of step S45 in detail.

ステップＳ６１において、口形素判別器学習部３０は、唇画像の全画素からランダムに２画素を選択する。例えば、唇画像を３２×３２画素とした場合、２画素の選択は、１０２４×１０２３通りのうちの１つを選ぶことになる。ここで、選択した２画素の画素位置をＳ₁，Ｓ₂とし、その画素値（輝度値）をＩ₁，Ｉ₂とする。 In step S61, the viseme discriminator learning unit 30 randomly selects two pixels from all the pixels of the lip image. For example, when the lip image is 32 × 32 pixels, selection of 2 pixels selects one of 1024 × 1023 patterns. Here, the pixel positions of the selected two pixels are S ₁ and S ₂ , and the pixel values (luminance values) are I ₁ and I ₂ .

ステップＳ６２において、口形素判別器学習部３０は、全ての学習サンプルについて、ステップＳ６１で選択した２画素の画素値Ｉ₁，Ｉ₂を用いたピクセル差分特徴（Ｉ₁−Ｉ₂）を算出し、その頻度分布を求める。 In step S62, the viseme discriminator learning unit 30 calculates pixel difference features (I ₁ -I ₂ ) using the pixel values I ₁ and I ₂ of the two pixels selected in step S61 for all learning samples. The frequency distribution is obtained.

ステップＳ６３において、口形素判別器学習部３０は、ピクセル差分特徴の頻度分布に基づき、式（５）に示された重み付き誤り率ε_tを最小ε_minにする閾値Ｔｈ_minを求める。 In step S63, the viseme discriminator learning unit 30 obtains a threshold value Th _min that makes the weighted error rate ε _t shown in Equation (5) the minimum ε _min based on the frequency distribution of the pixel difference features.

ステップＳ６４において、口形素判別器学習部３０は、ピクセル差分特徴の頻度分布に基づき、式（５）に示された重み付き誤り率ε_tを最大ε_maxにする閾値Ｔｈ_maxを求める。さらに、口形素判別器学習部３０は、次式（６）に従い、閾値Ｔｈ_maxなどを反転する。
ε’_max＝１−ε_max
Ｓ’₁＝Ｓ₂
Ｓ’₂＝Ｓ₁
Ｔｈ’_max＝−Ｔｈ_max
・・・（６） In step S 64, the viseme discriminator learning unit 30 obtains a threshold Th _max that sets the weighted error rate ε _t shown in Equation (5) to the maximum ε _max based on the frequency distribution of pixel difference features. Furthermore, the viseme classifier learning unit 30 inverts the threshold value Th _{max and the} like according to the following equation (6).
ε ′ _max = 1−ε _max
S ′ ₁ = S ₂
S ′ ₂ = S ₁
Th ' _max = -Th _max
... (6)

ステップＳ６５において、口形素判別器学習部３０は、上述した重み付き誤り率ε_tの最小ε_minと最大ε_maxの大小関係に基づいて、２値判別弱判別器のパラメータである２画素の位置Ｓ₁，Ｓ₂と閾値Ｔｈを決定する。 In step S65, the viseme discriminator learning unit 30 determines the position of two pixels that are parameters of the binary discriminant weak discriminator based on the magnitude relationship between the minimum ε _min and the maximum ε _max of the weighted error rate ε _t described above. S ₁ and S ₂ and a threshold value Th are determined.

すなわち、ε_min＜ε’_maxの場合、２画素の位置Ｓ₁，Ｓ₂と閾値Ｔｈ_minをパラメータとして採用する。また、ε_min≧ε’_maxの場合、２画素の位置Ｓ’₁，Ｓ’₂と閾値Ｔｈ’_maxをパラメータとして採用する。 That is, when ε _min <ε ′ _max , the positions S ₁ and S _{2 of} two pixels and the threshold Th _min are adopted as parameters. When ε _min ≧ ε ′ _max , the positions S ′ ₁ and S ′ _{2 of} two pixels and the threshold Th ′ _max are employed as parameters.

ステップＳ６６において、口形素判別器学習部３０は、上述したステップＳ６１乃至Ｓ６５の処理を所定の回数繰り返したか否かを判定し、所定の回数繰り返したと判定するまでステップＳ６１に戻り、それ以降を繰り返す。そして、ステップＳ６１乃至Ｓ６５の処理を所定の回数繰り返したと判定した場合、処理をステップＳ６７に進める。 In step S66, the viseme discriminator learning unit 30 determines whether or not the processing in steps S61 to S65 described above has been repeated a predetermined number of times, returns to step S61 until it is determined that the predetermined number of times has been repeated, and the subsequent steps are repeated. . If it is determined that the processes in steps S61 to S65 have been repeated a predetermined number of times, the process proceeds to step S67.

ステップＳ６７において、口形素判別器学習部３０は、上述したように所定の回数繰り返されるステップＳ６５の処理において決定された２値判別弱判別器（のパラメータ）のうち、重み付き誤り率ε_tが最小となるものを１つの２値判別弱判別器ｈ_t（のパラメータ）として最終的に採用する。 In step S67, the viseme discriminator learning unit 30 has a weighted error rate ε _t among the binary discriminant weak discriminators (parameters) determined in the process of step S65 repeated a predetermined number of times as described above. The smallest one is finally adopted as one binary discriminant weak discriminator h _t (parameter thereof).

以上説明したように、１つの２値判別弱判別器ｈ_tが決定された後、処理は図１０のステップＳ４６にリターンする。 As described above, after one binary discrimination weak discriminator _ht is determined, the process returns to step S46 in FIG.

ステップＳ４６において、口形素判別器学習部３０は、ステップＳ４５の処理で決定した２値判別弱判別器ｈ_tに対応する重み付き誤り率ε_tに基づき、次式（７）に従い信頼度α_tを計算する。

・・・（７） In step S46, the viseme discriminator learning unit 30 determines the reliability α _t according to the following equation (7) based on the weighted error rate ε _t corresponding to the binary discriminant weak discriminator h _t determined in the process of step S45. Calculate

... (7)

ステップＳ４７において、口形素判別器学習部３０は、次式（８）に示すように、ステップＳ４５の処理で決定した２値判別弱判別器ｈ_tと、ステップＳ４６の処理で計算した信頼度α_tを乗算することにより、信頼度付き２値判別弱判別器ｆ_t（ｘ_i）を求める。
ｆ_t（ｘ_i）＝α_tｈ_t
・・・（８） In step S47, the viseme classifier learning unit 30, as shown in the following equation (8), and binary classification weak classifier h _t determined in the processing in step S45, the reliability α calculated in the processing of step S46 By multiplying _t , a binary discriminant weak discriminator f _t (x _i ) with reliability is obtained.
f _t (x _i ) = α _t h _t
... (8)

ステップＳ４８において、口形素判別器学習部３０は、次式（９）に従い、Ｍ行Ｋ列で表されるサンプル重みＰ_t（ｉ，ｋ）を更新する。

・・・（９） In step S48, the viseme classifier learning unit 30 updates the sample weight P _t (i, k) represented by M rows and K columns according to the following equation (9).

... (9)

ただし、式（９）のＺ_iは次式（１０）に示すとおりである。

・・・（１０） However, Z _i in the equation (9) is as shown in the following equation (10).

(10)

ステップＳ４９において、口形素判別器学習部３０は、上述したステップＳ４３乃至Ｓ４８の処理を所定の回数Ｔだけ繰り返したか否かを判定し、所定の回数Ｔだけ繰り返したと判定するまでステップＳ４３に戻り、それ以降を繰り返す。そして、ステップＳ４３乃至Ｓ４８の処理を所定の回数Ｔだけ繰り返したと判定した場合、処理をステップＳ５０に進める。 In step S49, the viseme discriminator learning unit 30 determines whether or not the processes in steps S43 to S48 described above have been repeated a predetermined number of times T, and returns to step S43 until it is determined that the predetermined number of times T has been repeated. Repeat after that. If it is determined that the processes in steps S43 to S48 have been repeated a predetermined number of times T, the process proceeds to step S50.

ステップＳ５０において、口形素判別器学習部３０は、所定の数Ｔと同じ数だけ得られた信頼度付き２値判別弱判別器ｆ_t（ｘ）、およびそれぞれに対応するECOCテーブルに基づき、次式（１１）に従って最終判別器Ｈ_k（ｘ）、すなわち口形素判別器３１を得る。

・・・（１１） In step S50, the viseme discriminator learning unit 30 performs the following based on the binary discriminant weak discriminator f _t (x) with reliability obtained by the same number as the predetermined number T and the ECOC table corresponding to each. The final discriminator H _k (x), that is, the viseme discriminator 31 is obtained according to the equation (11).

(11)

なお、得られた口形素判別器３１はパラメータとして、クラスの数（口形素の数）Ｋ、および弱判別器の数Ｔを有する。また、各弱判別器はパラメータとして、唇画像上の２画素の位置Ｓ₁，Ｓ₂、ピクセル差分特徴の判別用の閾値Ｔｈ、信頼度α、およびECOCテーブルμを有する。 The obtained viseme discriminator 31 has the number of classes (number of visemes) K and the number T of weak discriminators as parameters. Each weak discriminator has, as parameters, positions S ₁ and S ₂ of two pixels on the lip image, a threshold Th for discriminating pixel difference features, a reliability α, and an ECOC table μ.

以上説明したように最終判別器Ｈ_k（ｘ）、すなわち口形素判別器３１を得て、当該AdaBoostECOC学習処理は終了される。 As described above, the final discriminator H _k (x), that is, the viseme discriminator 31, is obtained, and the AdaBoostECOC learning process is ended.

以上のように生成された口形素判別器３１によれば、入力される唇画像の画像特徴量をＫ次元スコアベクトルで表現できる。すなわち、登録用発話動画像の各フレームから生成される唇画像がＫ（いまの場合、１９）種類の口形素のそれぞれに対してどの程度似ているかを数値化して表すことができる。また、認識用発話動画像の各フレームから生成される唇画像に対しても同様に、Ｋ種類の口形素のそれぞれに対してどの程度似ているかを数値化して表すことができる。 According to the viseme discriminator 31 generated as described above, the image feature amount of the input lip image can be expressed by a K-dimensional score vector. That is, it is possible to numerically express how much the lip image generated from each frame of the registration utterance moving image is similar to each of K (19 in this case) types of visemes. Similarly, a lip image generated from each frame of the recognition utterance moving image can be expressed numerically as to how much it resembles each of the K types of visemes.

［登録処理の詳細］
図１２は、ステップＳ２の登録処理を詳細に説明するフローチャートである。 [Details of registration process]
FIG. 12 is a flowchart illustrating in detail the registration process in step S2.

ステップＳ７１において、登録系１２は、図７を参照して上述した学習系１１による学習用発話動画像の処理と同様の処理を実行することにより、登録用発話動画像の各フレームに対応する唇画像を生成する。生成された唇画像は、口形素判定器３１および発話期間検出部４４に入力される。 In step S 71, the registration system 12 executes the same processing as the learning speech moving image processing by the learning system 11 described above with reference to FIG. 7, so that the lips corresponding to each frame of the registration speech moving image. Generate an image. The generated lip image is input to the viseme determination unit 31 and the utterance period detection unit 44.

ステップＳ７２において、発話期間検出部４４は、登録用発話動画像の各フレームの唇画像に基づき発話期間を特定し、各フレームの唇画像が発話期間に対応するものであるか否かを口形素判別器３１および時系列特徴量生成部４５に通知する。口形素判定器３１は、順次入力される唇画像のうち、発話期間に対応するものについて対応するＫ次元スコアベクトルを演算する。 In step S72, the utterance period detection unit 44 specifies the utterance period based on the lip image of each frame of the registration utterance moving image, and determines whether the lip image of each frame corresponds to the utterance period. Notify the discriminator 31 and the time-series feature value generation unit 45. The viseme determination unit 31 calculates a K-dimensional score vector corresponding to the lip image sequentially input corresponding to the speech period.

図１３は、口形素判定器３１によるＫ次元スコアベクトル演算処理を詳細に説明するフローチャートである。 FIG. 13 is a flowchart for explaining in detail the K-dimensional score vector calculation process by the viseme determiner 31.

ステップＳ８１において、口形素判定器３１は、クラスを示すパラメータｋ（ｋ＝１，２，・・・，Ｋ）を１に初期化する。ステップＳ８２において、口形素判定器３１は、各クラスのスコアＨ_kを０に初期化する。 In step S81, the viseme determiner 31 initializes a parameter k (k = 1, 2,..., K) indicating a class to 1. In step S82, the viseme determiner 31 initializes the score H _{k of} each class to 0.

ステップＳ８３において、口形素判定器３１は、弱判別器を特定するためのパラメータｔ（ｔ＝１，２，・・・，Ｔ）を１に初期化する。 In step S83, the viseme determiner 31 initializes a parameter t (t = 1, 2,..., T) for specifying a weak classifier to 1.

ステップＳ８４において、口形素判定器３１は、２値判別弱判別器ｈ_tのパラメータ、すなわち、唇画像ｘ上の２画素の位置Ｓ₁，Ｓ₂、ピクセル差分特徴の判別用の閾値Ｔｈ、信頼度α、およびECOCテーブルμを設定する。 In step S84, the viseme determiner 31, a parameter of the binary classification weak classifier h _t, that is, the position of the two pixels on the lip image x S _1, S _2, the threshold Th for determination of pixel difference feature, reliable Set degree α and ECOC table μ.

ステップＳ８５において、口形素判定器３１は、唇画像ｘ上の２画素の位置Ｓ₁，Ｓ₂から画素値Ｉ₁，Ｉ₂を読み出し、ピクセル差分特徴（Ｉ₁−Ｉ₂）を算出して閾値Ｔｈと比較することにより、２値判別弱判別器ｈ_tの判別値（−１または＋１）を得る。 In step S85, the viseme determination unit 31 reads out the pixel values I ₁ and I ₂ from the two pixel positions S ₁ and S ₂ on the lip image x, and calculates a pixel difference feature (I ₁ −I ₂ ). by comparing with the threshold value Th, obtained determination value of the binary classification weak classifier h _t a (-1 or +1).

ステップＳ８６において、口形素判定器３１は、ステップＳ８５で得た２値判別弱判別器ｈ_tの判別値に信頼度α_tを乗算し、さらに１行Ｋ列のECOCテーブルの値μ_t（ｋ）を乗算することにより、パラメータｔに対応する１行Ｋ列のクラススコアＨ_kを得る。 In step S86, the viseme determiner 31, the reliability alpha _t by multiplying the determined value of the binary classification weak classifier h _t obtained in step S85, the value of the ECOC table of further first row and K columns mu _t (k ) To obtain a class score H _k of 1 row and K columns corresponding to the parameter t.

ステップＳ８７において、口形素判定器３１は、ステップＳ８６で得た、パラメータｔに対応する１行Ｋ列のクラススコアＨ_kを、前回（すなわち、ｔ−１）までのクラススコアＨ_kの累計値に加算することにより、１行Ｋ列のクラススコアＨ_kを更新する。 In step S87, the viseme determiner 31, obtained in step S86, the class scores H _k of 1 row and K columns corresponding to the parameter t, the previous (i.e., t-1) to the cumulative value of the class scores H _k of by adding to, to update the class scores H _k of 1 row and K columns.

ステップＳ８８において、口形素判定器３１は、パラメータｔ＝Ｔであるか否かを判定し、否と判定した場合、処理をステップＳ８９に進めてパラメータｔを１だけインクリメントする。そして、処理はステップＳ８４に戻され、それ以降の処理が繰り返される。その後、ステップＳ８８において、パラメータｔ＝Ｔであると判定された場合、処理はステップＳ９０に進められる。 In step S88, the viseme determiner 31 determines whether or not the parameter t = T, and if not, the process proceeds to step S89 to increment the parameter t by 1. Then, the process returns to step S84, and the subsequent processes are repeated. Thereafter, when it is determined in step S88 that the parameter t = T, the process proceeds to step S90.

ステップＳ９０において、口形素判定器３１は、パラメータｋ＝Ｋであるか否かを判定し、パラメータｋ＝Ｋではないと判定した場合、処理をステップＳ９１に進めてパラメータｋを１だけインクリメントする。そして、処理はステップＳ８３に戻され、それ以降の処理が繰り返される。その後、ステップＳ９０において、パラメータｋ＝Ｋであると判定された場合、処理はステップＳ９２に進められる。 In step S90, the viseme determination unit 31 determines whether or not the parameter k = K. If it is determined that the parameter k = K is not satisfied, the process proceeds to step S91 and the parameter k is incremented by one. Then, the process returns to step S83, and the subsequent processes are repeated. Thereafter, when it is determined in step S90 that the parameter k = K, the process proceeds to step S92.

ステップＳ９２において、口形素判定器３１は、その時点で得られている１行Ｋ列のクラススコアＨ_kを口形素判定器３１の出力、すなわち、Ｋ次元スコアベクトルとして後段（いまの場合、時系列特徴量生成部４５）に出力する。以上で、Ｋ次元スコアベクトル演算処理は終了される。 In step S92, the viseme determiner 31, the output of the viseme determiner 31 a class score H _k of 1 row and K columns are obtained at that time, i.e., when a K-dimensional score vector subsequent (now, when It outputs to the series feature-value production | generation part 45). This completes the K-dimensional score vector calculation process.

図１２に戻る。ステップＳ７３において、時系列特徴量生成部４５は、発話期間検出部４４から通知される発話期間に、口形素判別器３１から順次入力されたＫ次元スコアベクトルを時系列に配置することにより、登録用発話動画像の発話期間に対応した時系列特徴量を生成する。 Returning to FIG. In step S73, the time-series feature value generation unit 45 registers the K-dimensional score vectors sequentially input from the viseme discriminator 31 in time series in the utterance period notified from the utterance period detection unit 44. A time-series feature amount corresponding to the utterance period of the utterance video is generated.

ステップＳ７４において、時系列特徴量学習部４６は、登録用発話動画像とともに外部から供給された登録用発話単語（登録用発話動画像における話者の発話内容）に対応付けて、時系列特徴量生成部４５から入力された時系列特徴量をHMMによりモデル化する。モデル化された時系列特徴量は、発話認識器４７に内蔵された学習データベース４８に保持される。以上で、登録処理は終了される。 In step S74, the time-series feature amount learning unit 46 associates the registration-speech moving image with the registration utterance word (the utterance content of the speaker in the registration utterance video image) supplied from the outside in association with the time-series feature amount. The time series feature value input from the generation unit 45 is modeled by the HMM. The modeled time series feature amount is held in a learning database 48 built in the utterance recognizer 47. This completes the registration process.

[認識処理の詳細]
図１４は、認識処理を詳細に説明するフローチャートである。 [Details of recognition processing]
FIG. 14 is a flowchart for explaining the recognition process in detail.

認識系１３は、入力された認識用発話動画像に対し、ステップＳ１０１乃至Ｓ１０３の処理として、図１２を参照して上述した登録系１２による登録処理のステップＳ７１乃至Ｓ７３と同様の処理を行う。この結果、認識用発話動画像の発話期間に対応した時系列特徴量が生成される。生成された認識用発話動画像の発話期間に対応した時系列特徴量は、発話認識器４７に入力される。 The recognition system 13 performs the same processing as steps S71 to S73 of the registration processing by the registration system 12 described above with reference to FIG. 12 as the processing of steps S101 to S103 on the input speech moving image for recognition. As a result, a time-series feature amount corresponding to the utterance period of the recognition utterance moving image is generated. A time-series feature amount corresponding to the utterance period of the generated utterance moving image for recognition is input to the utterance recognizer 47.

ステップＳ１０４において、発話認識器４７は、時系列特徴量生成部４５から入力された時系列特徴量に対して、学習データベース４８に保持されているモデルのうちで最も類似しているものを特定する。さらに、発話認識器４７は、特定したモデルに対応付けられている登録用発話単語を、認識用発話動画像に対応する発話認識結果として出力する。以上で、認識処理は終了される。 In step S 104, the utterance recognizer 47 identifies the most similar model stored in the learning database 48 with respect to the time series feature quantity input from the time series feature quantity generation unit 45. . Further, the utterance recognizer 47 outputs the registered utterance word associated with the identified model as an utterance recognition result corresponding to the recognition utterance moving image. This completes the recognition process.

[認識実験の結果]
次に、発話認識装置１０による認識実験の結果について説明する。 [Results of recognition experiment]
Next, the result of the recognition experiment by the speech recognition apparatus 10 will be described.

この認識実験では、学習処理において、２１６単語を発声する７３人の被験者（話者）をそれぞれ個別にビデオ撮影した学習用音声付発話動画像を用いた。また、登録処理においては、学習処理時の２１６単語のうちの、図１５に示す２０単語を登録発話単語に選択し、それに対応する学習用発話動画像を登録用発話動画像に流用した。また、HMMを用いたモデル化では、遷移確率をleft-to-rightに制約し、４０状態の遷移モデルとした。 In this recognition experiment, 73 learning subjects (speakers) who uttered 216 words were used in the learning process, and utterance moving images with voice for learning were used. In the registration process, 20 words shown in FIG. 15 out of 216 words in the learning process are selected as registered utterance words, and the learning utterance moving images corresponding to the 20 words are used as registration utterance moving images. In the modeling using the HMM, the transition probability is restricted to left-to-right, and a 40-state transition model is used.

そして、認識処理では、学習処理および登録処理と同じ被験者の認識用発話動画像を用いたクローズ評価と、学習処理および登録処理とは異なる被験者の認識用発話動画像を用いたオープン評価を行い、図１６に示す認識率を得ることができた。 In the recognition process, a close evaluation using the same subject utterance moving image as the learning process and the registration process, and an open evaluation using the subject utterance moving image different from the learning process and the registration process, The recognition rate shown in FIG. 16 could be obtained.

図１６は、ある登録用発話単語Ｗを発話している認識用発話動画像に対応する時系列特徴量が、２０種類の各登録用発話単語にそれぞれ対応する各HMMにどの程度類似しているかを順位付けした際に、正解（登録用発話単語Ｗに対応するHMM）がＭ番目（横軸）までに入っている確率（縦軸）を示している。 FIG. 16 shows how similar the time-series feature amount corresponding to the recognition utterance moving image uttering a certain utterance word W for registration is to each HMM corresponding to each of the 20 types of utterance words for registration. The probability (vertical axis) that the correct answer (HMM corresponding to the utterance word W for registration) is included up to the Mth (horizontal axis) is shown.

同図によれば、クローズ評価の場合には９６％の識別率を得ることができた。また、オープン評価の場合には８０％の識別率を得ることができた。 According to the figure, in the case of close evaluation, an identification rate of 96% could be obtained. In the case of open evaluation, an identification rate of 80% was obtained.

なお、上述した認識実験では、学習処理と登録処理の被験者（話者）を共通とし、登録用発話動画像に学習用発話動画像を流用したが、学習処理と登録処理の被験者（話者）を別人としてもよく、さらに、認識処理の被験者（話者）をさらに別人としてもよい。 In the above-described recognition experiment, the learning process and the registration process subject (speaker) are shared, and the learning utterance moving image is diverted to the registration utterance moving image, but the learning process and registration process subject (speaker). May be another person, and the subject (speaker) of the recognition process may be another person.

以上説明した、第１の実施の形態である発話認識装置１０によれば、入力された画像（いまの場合、唇画像）の特徴量を演算するための判別器を学習により生成するので、認識したい対象に対して、その都度、判別器を新たに設計する不要ない。したがって、ラベルの種類を変更することにより、例えば動画像からジェスチャや手書き文字を識別したりする認識装置にも容易に適用できる。 According to the utterance recognition device 10 according to the first embodiment described above, the discriminator for calculating the feature amount of the input image (in this case, the lip image) is generated by learning. There is no need to design a new discriminator each time the target is desired. Therefore, by changing the type of the label, the present invention can be easily applied to a recognition device that identifies, for example, a gesture or a handwritten character from a moving image.

また、学習処理によって、個人差の大きい部位の画像に対して汎用性のある特徴量を抽出することができる。 In addition, by the learning process, it is possible to extract a versatile feature amount for an image of a part having a large individual difference.

さらに、画像特徴量に比較的演算量が少ないピクセル差分を用いたので、リアルタイムな認識処理が可能になる。 Furthermore, since the pixel difference having a relatively small amount of calculation is used as the image feature amount, real-time recognition processing can be performed.

＜２．第２の実施の形態＞
［デジタルスチルカメラの構成例］
次に、図１７は、第２の実施の形態であるデジタルスチルカメラ６０の構成例を示している。このデジタルスチルカメラ６０は、読唇技術を応用したオートシャッタ機能を有している。具体的には、被写体となる人物が「ハイ、チーズ」などと所定のキーワード（以下、シャッタキーワードと称する）を発話したことを検出した場合、これに応じてシャッタをきる（静止画像を撮像する）ようにしたものである。 <2. Second Embodiment>
[Configuration example of digital still camera]
Next, FIG. 17 shows a configuration example of a digital still camera 60 according to the second embodiment. The digital still camera 60 has an auto shutter function that applies lip reading technology. Specifically, when it is detected that a person who is a subject speaks a predetermined keyword (hereinafter referred to as a shutter keyword) such as “high, cheese”, the shutter is released (a still image is taken). ).

このデジタルスチルカメラ６０は、撮像部６１、画像処理部６２、記録部６３、Ｕ／Ｉ部６４、撮像制御部６５、およびオートシャッタ制御部６６から構成される。 The digital still camera 60 includes an imaging unit 61, an image processing unit 62, a recording unit 63, a U / I unit 64, an imaging control unit 65, and an auto shutter control unit 66.

撮像部６１は、レンズ群、CMOS(Complementary Metal-Oxide Semiconductor)等の撮像素子（いずれも図示せず）から構成され、被写体の光学像を取得して電気信号に変換し、その結果得られる画像信号を後段に出力する。 The imaging unit 61 includes a lens group and an imaging element (none of which is shown) such as a CMOS (Complementary Metal-Oxide Semiconductor), acquires an optical image of a subject, converts it into an electrical signal, and an image obtained as a result The signal is output to the subsequent stage.

すなわち、撮像部６１は、撮像制御部６５からの制御に従い、撮像前の段階において画像信号を撮像制御部６５およびオートシャッタ制御部６６に出力する。また、撮像部６１は、撮像制御部６５からの制御に従って撮像を行い、その結果得られる画像信号を画像処理部６２に出力する。 That is, the imaging unit 61 outputs an image signal to the imaging control unit 65 and the auto shutter control unit 66 in a stage before imaging in accordance with control from the imaging control unit 65. The imaging unit 61 performs imaging in accordance with control from the imaging control unit 65 and outputs an image signal obtained as a result to the image processing unit 62.

以下、撮像前の構図決定用に撮像制御部６５に出力されてＵ／Ｉ部６４に含まれるディスプレイ（不図示）に表示される動画像をファインダ画像と称する。ファインダ画像は、オートシャッタ制御部６６にも出力される。また、撮像の結果として撮像部６１から画像処理部６２に出力される画像信号を記録画像と称する。 Hereinafter, a moving image that is output to the imaging control unit 65 for determining a composition before imaging and displayed on a display (not shown) included in the U / I unit 64 is referred to as a finder image. The viewfinder image is also output to the auto shutter control unit 66. An image signal output from the imaging unit 61 to the image processing unit 62 as a result of imaging is referred to as a recorded image.

画像処理部６２は、撮像部６１から入力される記録画像に所定の画像処理（例えば、手ぶれ補正、ホワイトバランス補正、画素補間など）を行った後、所定の符号化方式に従って符号化し、その結果得られた画像符号化データを記録部６３に出力する。また、画像処理部６２は、記録部６３から入力される画像符号化データを復号し、その結果得られる画像信号（以下、再生画像と称する）を撮像制御部６５に出力する。 The image processing unit 62 performs predetermined image processing (for example, camera shake correction, white balance correction, pixel interpolation, etc.) on the recorded image input from the imaging unit 61, and then encodes the result according to a predetermined encoding method. The obtained encoded image data is output to the recording unit 63. Further, the image processing unit 62 decodes the encoded image data input from the recording unit 63 and outputs an image signal (hereinafter referred to as a reproduced image) obtained as a result to the imaging control unit 65.

記録部６３は、画像処理部６２から入力される画像符号化データを、図示せぬ記録メディアに記録する。また、記録部６３は、記録メディアに記録されている画像符号化データを読み出して画像処理部６２に出力する。 The recording unit 63 records the encoded image data input from the image processing unit 62 on a recording medium (not shown). Further, the recording unit 63 reads out the encoded image data recorded on the recording medium and outputs it to the image processing unit 62.

撮像制御部６５は、デジタルスチルカメラ６０の全体を制御する。特に、撮像処理部６５は、Ｕ／Ｉ部６４からのシャッタ操作信号、あるいはオートシャッタ制御部６６からのオートシャッタ信号に従い、撮像部６１を制御して撮像を実行させる。 The imaging control unit 65 controls the entire digital still camera 60. In particular, the imaging processing unit 65 controls the imaging unit 61 to execute imaging in accordance with a shutter operation signal from the U / I unit 64 or an auto shutter signal from the auto shutter control unit 66.

Ｕ／Ｉ(user interface)部６４は、ユーザによるシャッタ操作を受け付けるシャッタボタンに代表される各種の入力デバイスと、ファインダ画像や再生画像などを表示するディスプレイからなる。特に、Ｕ／Ｉ部６４は、ユーザからのシャッタ操作に応じてシャッタ操作信号を撮像制御部６５に出力する。 The U / I (user interface) unit 64 includes various input devices typified by a shutter button that receives a shutter operation by a user, and a display that displays a finder image, a reproduced image, and the like. In particular, the U / I unit 64 outputs a shutter operation signal to the imaging control unit 65 in response to a shutter operation from the user.

オートシャッタ制御部６６は、撮像部６１から入力されるファインダ画像に基づき、被写体となる人物によるシャッタキーワードの発話を検出した場合、これに応じてオートシャッタ信号を撮像制御部６５に出力する。 When the auto shutter control unit 66 detects the utterance of the shutter keyword by the person who is the subject based on the viewfinder image input from the imaging unit 61, the auto shutter control unit 66 outputs an auto shutter signal to the imaging control unit 65 accordingly.

次に、図１８は、オートシャッタ制御部６６の詳細な構成例を示している。 Next, FIG. 18 shows a detailed configuration example of the auto shutter control unit 66.

同図と図１を比較して明らかなように、オートシャッタ制御部６６は、図１の発話認識装置１０の登録系１２および認識系１３と同様の構成に加えて、オートシャッタ信号出力部７１が追加されて構成される。オートシャッタ制御部６６の、図１の発話認識装置１０と共通する構成要素には同一の番号を付しているので、その説明は省略する。 As is clear from comparison between FIG. 1 and FIG. 1, the auto shutter control unit 66 has an auto shutter signal output unit 71 in addition to the same configuration as the registration system 12 and the recognition system 13 of the utterance recognition device 10 of FIG. 1. Is added and configured. Since the same number is attached | subjected to the same component as the speech recognition apparatus 10 of FIG. 1 of the auto shutter control part 66, the description is abbreviate | omitted.

ただし、オートシャッタ制御部６６における口形素判別器３１は既に学習済みのものである。 However, the viseme discriminator 31 in the auto shutter control unit 66 has already been learned.

オートシャッタ信号出力部７１は、発話認識器４７からの発話認識結果が予め登録されているシャッタキーワードであることを示す場合、オートシャッタ信号を発生して撮像制御部６５に出力する。 When the utterance recognition result from the utterance recognizer 47 indicates that the shutter keyword is registered in advance, the auto shutter signal output unit 71 generates an auto shutter signal and outputs it to the imaging control unit 65.

[動作説明]
次に、デジタルスチルカメラ６０の動作について説明する。デジタルスチルカメラ６０の動作には、通常撮影モード、通常再生モード、オートシャッタ登録モード、オートシャッタ実行モードなどが設けられている。 [Description of operation]
Next, the operation of the digital still camera 60 will be described. The operation of the digital still camera 60 is provided with a normal shooting mode, a normal playback mode, an auto shutter registration mode, an auto shutter execution mode, and the like.

通常撮影モードでは、ユーザによるシャッタ操作に応じて撮影が行われる。通常再生モードでは、ユーザによる再生操作に応じて撮影済みの画像が再生されて表示される。 In the normal shooting mode, shooting is performed according to the shutter operation by the user. In the normal playback mode, captured images are played back and displayed in response to playback operations by the user.

シャッタキーワード登録モードでは、シャッタキーワードとする任意の言葉を発話する被写体（ユーザなど）の唇の動きを示す時系列特徴量のHMMが登録される。なお、デジタルスチルカメラ６０を商品として出荷する段階において、予めシャッタキーワードとそれに対応する唇の動きを示す時系列特徴量のHMMを登録しておくようにしてもよい。 In the shutter keyword registration mode, a time-series feature amount HMM indicating the movement of the lips of a subject (such as a user) who speaks an arbitrary word as a shutter keyword is registered. It should be noted that, at the stage of shipping the digital still camera 60 as a product, a shutter keyword and an HMM of a time-series feature amount indicating the corresponding lip movement may be registered in advance.

オートシャッタ実行モードでは、ファインダ画像に基づき、被写体となる人物の唇の動きを示す時系列特徴量が検出され、検出された時系列特徴量に基づいてシャッタキーワードを発話していると認識された場合に撮影が行われる。 In the auto shutter execution mode, a time series feature amount indicating the movement of the lips of the person who is the subject is detected based on the finder image, and it is recognized that the shutter keyword is uttered based on the detected time series feature amount. Shooting is done in case.

［シャッタキーワード登録処理の詳細］
次に、図１９は、シャッタキーワード登録処理を説明するフローチャートである。 [Details of shutter keyword registration process]
Next, FIG. 19 is a flowchart for explaining shutter keyword registration processing.

このシャッタキーワード登録処理は、ユーザからの所定の操作に応じてシャッタキーワード登録モードとされたときに開始され、また、ユーザからの所定の操作に応じて終了される。 The shutter keyword registration process is started when the shutter keyword registration mode is set according to a predetermined operation from the user, and is ended according to a predetermined operation from the user.

なお、このシャッタキーワード登録処理の開始を指示した後にユーザは、ファインダ画像に、シャッタキーワードとして登録したい言葉を発話している話者の顔が写るようにする。この話者には、オートシャッタ実行処理時に被写体となる人物を用いることが望ましいが、それ以外の例えばユーザ自身が話者となってもよい。そして、シャッタキーワードの発話が終了した後に、オートシャッタ学習処理の終了を指示するようにする。 In addition, after instructing the start of the shutter keyword registration process, the user causes the face of the speaker who speaks the word to be registered as the shutter keyword to appear in the finder image. For this speaker, it is desirable to use a person who becomes a subject during the auto-shutter execution process, but other users, for example, may be the speaker. Then, after the utterance of the shutter keyword is finished, the end of the auto shutter learning process is instructed.

ステップＳ１２１において、撮像制御部６５は、オートシャッタ登録処理の終了が指示されたか否かを判定し、指示されていない場合、処理をステップＳ１２２に進める。 In step S121, the imaging control unit 65 determines whether or not the end of the auto shutter registration process has been instructed. If not, the process proceeds to step S122.

ステップＳ１２２において、登録系１２の顔領域検出部４１は、ファインダ画像を各フレームに分割し、１フレームずつ処理対象とする。処理対象のフレームから顔領域を検出する。ステップＳ１２３において、顔領域検出部４１は、処理対象のフレームから顔領域を１つだけ検出したか否かを判定し、複数の顔領域を検出した場合、または顔領域を１つも検出できなかった場合、処理をステップＳ１２４に進める。 In step S122, the face area detection unit 41 of the registration system 12 divides the finder image into frames, and sets each frame as a processing target. A face area is detected from a frame to be processed. In step S123, the face area detection unit 41 determines whether or not only one face area is detected from the processing target frame, and when a plurality of face areas are detected or no face area is detected. If so, the process proceeds to step S124.

ステップＳ１２４において、Ｕ／Ｉ部６４は、シャッタキーワードとして登録したい言葉を発話する一人の話者だけがファインダ画像に写るようユーザに注意を促す。この後、処理はステップＳ１２１に戻り、それ以降が繰り返される。 In step S124, the U / I unit 64 urges the user to pay attention so that only one speaker who speaks a word to be registered as a shutter keyword appears in the viewfinder image. Thereafter, the process returns to step S121, and the subsequent steps are repeated.

ステップＳ１２３において、処理対象のフレームから顔領域を１つだけ検出した場合、処理はステップＳ１２５に進められる。 If only one face area is detected from the processing target frame in step S123, the process proceeds to step S125.

ステップＳ１２５において、顔領域検出部４１は、処理対象としている１フレーム分のファインダ画像とともに顔領域の位置情報を唇領域検出部４２に出力する。唇領域検出部４２は、処理対象のフレームの顔領域から唇領域を検出し、処理対象としている１フレーム分のファインダ画像とともに唇領域の位置情報を唇画像生成部４３に出力する。 In step S 125, the face area detection unit 41 outputs the position information of the face area to the lip area detection unit 42 together with the finder image for one frame to be processed. The lip region detection unit 42 detects the lip region from the face region of the frame to be processed, and outputs the position information of the lip region to the lip image generation unit 43 together with the finder image for one frame to be processed.

唇画像生成部４３は、処理対象としているファインダ画像の１フレームを、唇の口角の端点を結ぶ線が水平になるように、適宜、回転補正を行う。さらに、唇画像生成部４３は、回転補正後の各フレームから唇領域を抽出し、抽出した唇領域を予め定められた画像サイズにリサイズして唇画像を生成する。生成された唇画像は、口形素判定器３１および発話期間検出部４４に入力される。 The lip image generation unit 43 appropriately performs rotation correction on one frame of the finder image to be processed so that the line connecting the end points of the mouth corners of the lips is horizontal. Further, the lip image generation unit 43 extracts a lip region from each frame after the rotation correction, and generates a lip image by resizing the extracted lip region to a predetermined image size. The generated lip image is input to the viseme determination unit 31 and the utterance period detection unit 44.

ステップＳ１２６において、発話期間検出部４４は、処理対象としているフレームの唇画像に基づき、当該フレームが発話期間であるか否かを判定し、判定結果を口形素判別器３１および時系列特徴量生成部４５に通知する。そして、発話期間であると判定された場合、処理はステップＳ１２７に進められる。反対に、発話期間ではないと判定された場合、ステップＳ１２７はスキップされる。 In step S126, the utterance period detection unit 44 determines whether the frame is the utterance period based on the lip image of the frame to be processed, and uses the determination result as the viseme discriminator 31 and the time-series feature value generation. Notify the unit 45. And when it determines with it being an utterance period, a process is advanced to step S127. On the other hand, when it is determined that it is not the speech period, step S127 is skipped.

ステップＳ１２７において、口形素判定器３１は、順次入力される唇画像のうち、発話期間に対応するものについて対応するＫ次元スコアベクトルを演算して時系列特徴量生成部４５に出力する。この後、処理はステップＳ１２１に戻り、オートシャッタ登録処理の終了が指示されるまで、ステップＳＳ１２１乃至１２７の処理が繰り返される。 In step S127, the viseme determiner 31 calculates a K-dimensional score vector corresponding to the lip image sequentially input corresponding to the speech period, and outputs the K-dimensional score vector to the time-series feature value generation unit 45. Thereafter, the process returns to step S121, and the processes of steps SS121 to SS127 are repeated until the end of the auto shutter registration process is instructed.

そして、ステップＳ１２１において、オートシャッタ登録処理の終了が指示されたと判定された場合、処理はステップＳ１２８に進められる。 If it is determined in step S121 that the end of the auto shutter registration process has been instructed, the process proceeds to step S128.

ステップＳ１２８において、時系列特徴量生成部４５は、発話期間検出部４４から通知される発話期間に、口形素判別器３１から順次入力されたＫ次元スコアベクトルを時系列に配置することにより、登録したいシャッタキーワードに対応した時系列特徴量を生成する。 In step S128, the time-series feature value generation unit 45 registers the K-dimensional score vectors sequentially input from the viseme discriminator 31 in the utterance period notified from the utterance period detection unit 44 in time series. A time-series feature amount corresponding to the desired shutter keyword is generated.

ステップＳ１２９において、時系列特徴量学習部４６は、Ｕ／Ｉ部６４から入力されるシャッタキーワードのテキストデータに対応付けて、時系列特徴量生成部４５から入力された時系列特徴量を、HMMによりモデル化する。モデル化された時系列特徴量は、発話認識器４７に内蔵された学習データベース４８に保持される。以上で、シャッタキーワード登録処理は終了される。 In step S129, the time-series feature amount learning unit 46 associates the time-series feature amount input from the time-series feature amount generation unit 45 with the HMM in association with the shutter keyword text data input from the U / I unit 64. To model. The modeled time series feature amount is held in a learning database 48 built in the utterance recognizer 47. This completes the shutter keyword registration process.

［オートシャッタ実行処理の詳細］
次に、図２０は、オートシャッタ実行処理を説明するフローチャートである。 [Details of auto shutter execution processing]
Next, FIG. 20 is a flowchart for explaining the auto shutter execution process.

このオートシャッタ実行処理は、ユーザからの所定の操作に応じてオートシャッタ実行モードとされたときに開始され、また、ユーザからの所定の操作に応じて終了される。 The auto shutter execution process is started when the auto shutter execution mode is set according to a predetermined operation from the user, and is ended according to a predetermined operation from the user.

ステップＳ１４１において、認識系１２の顔領域検出部４１は、ファインダ画像を各フレームに分割し、１フレームずつ処理対象とする。処理対象のフレームから顔領域を検出する。 In step S 141, the face area detection unit 41 of the recognition system 12 divides the finder image into frames and sets the frames as processing targets. A face area is detected from a frame to be processed.

ステップＳ１４２において、顔領域検出部４１は、処理対象のフレームから顔領域を検出できたか否かを判定し、顔領域が検出できるまで、処理をステップＳ１４１に戻す。そして、処理対象のフレームから顔領域を検出できた場合、処理はステップＳ１４３に進められる。 In step S142, the face area detection unit 41 determines whether or not the face area has been detected from the processing target frame, and the process returns to step S141 until the face area can be detected. If the face area can be detected from the processing target frame, the process proceeds to step S143.

なお、ここでは、シャッタキーワード登録処理時とは異なり、１フレームから複数の顔領域が検出されてもかまわない。１フレームから複数の顔領域が検出された場合、検出された各顔領域に対して並行し、これ以降の処理が実行される。 Here, unlike the shutter keyword registration process, a plurality of face regions may be detected from one frame. When a plurality of face areas are detected from one frame, the subsequent processes are executed in parallel with each detected face area.

ステップＳ１４３において、顔領域検出部４１は、処理対象としている１フレーム分のファインダ画像とともに顔領域の位置情報を唇領域検出部４２に出力する。唇領域検出部４２は、処理対象のフレームの顔領域から唇領域を検出し、処理対象としている１フレーム分のファインダ画像とともに唇領域の位置情報を唇画像生成部４３に出力する。 In step S143, the face area detection unit 41 outputs the position information of the face area to the lip area detection unit 42 together with the finder image for one frame to be processed. The lip region detection unit 42 detects the lip region from the face region of the frame to be processed, and outputs the position information of the lip region to the lip image generation unit 43 together with the finder image for one frame to be processed.

ステップＳ１４４において、発話期間検出部４４は、処理対象としているフレームの唇画像に基づいて発話期間を判定する。すなわち、処理対象としているフレームが発話期間の始点、または発話期間中であると判定された場合、処理はステップＳ１４５に進められる。 In step S144, the speech period detection unit 44 determines the speech period based on the lip image of the frame to be processed. That is, when it is determined that the frame to be processed is the start point of the utterance period or is in the utterance period, the process proceeds to step S145.

ステップＳ１４５において、口形素判定器３１は、順次入力される唇画像のうち、発話期間に対応するものについて対応するＫ次元スコアベクトルを演算して時系列特徴量生成部４５に出力する。この後、処理はステップＳ１４１に戻り、それ以降が繰り返される。 In step S 145, the viseme determiner 31 calculates a K-dimensional score vector corresponding to the lip image sequentially input corresponding to the utterance period, and outputs the K-dimensional score vector to the time-series feature value generation unit 45. Thereafter, the process returns to step S141, and the subsequent steps are repeated.

ステップＳ１４４において、処理対象としているフレームが発話期間の終点であると判定された場合、処理はステップＳ１４６に進められる。 If it is determined in step S144 that the frame to be processed is the end point of the speech period, the process proceeds to step S146.

ステップＳ１４６において、時系列特徴量生成部４５は、発話期間検出部４４から通知される発話期間に、口形素判別器３１から順次入力されたＫ次元スコアベクトルを時系列に配置することにより、被写体の唇の動きに対応した時系列特徴量を生成する。 In step S146, the time-series feature value generation unit 45 arranges the K-dimensional score vectors sequentially input from the viseme discriminator 31 in time series in the utterance period notified from the utterance period detection unit 44. A time-series feature amount corresponding to the movement of the lips is generated.

ステップＳ１４７において、時系列特徴量生成部４５は、生成した時系列特徴量を発話認識器４７に入力する。ステップＳ１４８において、発話認識器４７は、時系列特徴量生成部４５から入力された時系列特徴量と、学習データベース４８に保持されているシャッタキーワードに対応するHMMを比較して、被写体の唇の動きがシャッタキーワードに対応するものであるか否かを判定する。被写体の唇の動きがシャッタキーワードに対応するものであると判定された場合、処理はステップＳ１４９に進められる。なお、否と判定された場合、処理はステップＳ１４１に戻り、それ以降が繰り返される。 In step S 147, the time-series feature value generation unit 45 inputs the generated time-series feature value to the utterance recognizer 47. In step S148, the utterance recognizer 47 compares the time series feature quantity input from the time series feature quantity generation unit 45 with the HMM corresponding to the shutter keyword held in the learning database 48, and compares the lip of the subject. It is determined whether or not the movement corresponds to the shutter keyword. If it is determined that the movement of the lips of the subject corresponds to the shutter keyword, the process proceeds to step S149. If it is determined NO, the process returns to step S141 and the subsequent steps are repeated.

ステップＳ１４９において、発話認識器４７は、被写体の唇の動きがシャッタキーワードに対応するものである旨をオートシャッタ信号出力部７１に通知する。この通知の応じ、オートシャッタ信号出力部７１は、オートシャッタ信号を発生して撮像制御部６５に出力する。このオートシャッタ信号に従い、撮像制御部６５は、撮像部６１などを制御して撮像を行わせる。なお、この撮像タイミングは、シャッタキーワードの発話の所定時間（例えば１秒間）後などとユーザが任意に設定できる。この後、処理はステップＳ１４１に戻り、それ以降が繰り返される。 In step S149, the utterance recognizer 47 notifies the auto shutter signal output unit 71 that the movement of the lips of the subject corresponds to the shutter keyword. In response to this notification, the auto shutter signal output unit 71 generates an auto shutter signal and outputs it to the imaging control unit 65. In accordance with the auto shutter signal, the imaging control unit 65 controls the imaging unit 61 and the like to perform imaging. The imaging timing can be arbitrarily set by the user, for example, after a predetermined time (for example, 1 second) after the utterance of the shutter keyword. Thereafter, the process returns to step S141, and the subsequent steps are repeated.

なお、上述した説明では、ファインダ画像から複数の顔領域（被写体）が検出された場合、複数の被写体のうちの誰がシャッタキーワードを発話してもよいことになる。 In the above description, when a plurality of face regions (subjects) are detected from the finder image, anyone of the plurality of subjects may speak the shutter keyword.

ただし、このような仕様を変更し、例えば、被写体の過半数がシャッタキーワードを発話したことに応じて撮像を行うようにしてもよい。このような仕様にすれば、集合写真を撮像する際の遊戯性をユーザらに与えることができる。また複数の顔認識を行うので、認識結果が頑健になり、シャッタキーワードの誤検出などを抑止できる効果も期待できる。 However, such a specification may be changed, and for example, imaging may be performed when a majority of subjects speak a shutter keyword. With such a specification, it is possible to give users playability when taking a group photo. In addition, since a plurality of face recognitions are performed, the recognition result is robust, and an effect of suppressing erroneous detection of the shutter keyword can be expected.

さらに、個人の顔を識別する個人識別技術を組み合わせることにより、複数の被写体のうちの特定の人物にだけ注目してシャッタキーワードを検出するようにしてもよい。この特定の人物は複数であってもよい。この特定の人物を被験者（被写体）として、上述したシャッタキーワード登録処理を行えば、より頑健で正確な発話認識が可能となる。 Furthermore, a shutter keyword may be detected by focusing on only a specific person among a plurality of subjects by combining personal identification techniques for identifying a person's face. This specific person may be plural. If this specific person is a subject (subject) and the shutter keyword registration process described above is performed, more robust and accurate speech recognition can be performed.

以上説明したように、第２の実施の形態であるデジタルスチルカメラ６０によれば、離れた位置にいる被写体が、リモートコントローラなどを用いることなく、ノイズ環境化においても、シャッタキーワードを発話するだけで撮像タイミングを指示することができる。なお、このシャッタキーワードは、任意に設定することができる。 As described above, according to the digital still camera 60 according to the second embodiment, a subject at a distant position only utters a shutter keyword even in a noise environment without using a remote controller or the like. The imaging timing can be instructed. This shutter keyword can be set arbitrarily.

なお、本発明は、デジタルスチルカメラに限らず、デジタルビデオカメラにも適用することができる。 Note that the present invention can be applied not only to a digital still camera but also to a digital video camera.

ところで、上述した一連の処理は、ハードウェアにより実行することもできるし、ソフトウェアにより実行することもできる。一連の処理をソフトウェアにより実行する場合には、そのソフトウェアを構成するプログラムが、専用のハードウェアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどに、プログラム記録媒体からインストールされる。 By the way, the above-described series of processing can be executed by hardware or can be executed by software. When a series of processing is executed by software, a program constituting the software may execute various functions by installing a computer incorporated in dedicated hardware or various programs. For example, it is installed from a program recording medium in a general-purpose personal computer or the like.

図２１は、上述した一連の処理をプログラムにより実行するコンピュータのハードウェアの構成例を示すブロック図である。 FIG. 21 is a block diagram illustrating a configuration example of hardware of a computer that executes the above-described series of processing by a program.

このコンピュータ２００において、CPU（Central Processing Unit）２０１，ROM（Read Only Memory）２０２，RAM（Random Access Memory）２０３は、バス２０４により相互に接続されている。 In the computer 200, a CPU (Central Processing Unit) 201, a ROM (Read Only Memory) 202, and a RAM (Random Access Memory) 203 are connected to each other via a bus 204.

バス２０４には、さらに、入出力インタフェース２０５が接続されている。入出力インタフェース２０５には、キーボード、マウス、マイクロホンなどよりなる入力部２０６、ディスプレイ、スピーカなどよりなる出力部２０７、ハードディスクや不揮発性のメモリなどよりなる記憶部２０８、ネットワークインタフェースなどよりなる通信部２０９、磁気ディスク、光ディスク、光磁気ディスク、或いは半導体メモリなどのリムーバブルメディア２１１を駆動するドライブ２１０が接続されている。 An input / output interface 205 is further connected to the bus 204. The input / output interface 205 includes an input unit 206 composed of a keyboard, mouse, microphone, etc., an output unit 207 composed of a display, a speaker, etc., a storage unit 208 composed of a hard disk or nonvolatile memory, and a communication unit 209 composed of a network interface. A drive 210 for driving a removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory is connected.

以上のように構成されるコンピュータでは、CPU２０１が、例えば、記憶部２０８に記憶されているプログラムを、入出力インタフェース２０５及びバス２０４を介して、RAM２０３にロードして実行することにより、上述した一連の処理が行われる。 In the computer configured as described above, the CPU 201 loads, for example, the program stored in the storage unit 208 to the RAM 203 via the input / output interface 205 and the bus 204 and executes the program. Is performed.

コンピュータ（CPU２０１）が実行するプログラムは、例えば、磁気ディスク（フレキシブルディスクを含む）、光ディスク（CD-ROM(Compact Disc-Read Only Memory),DVD(Digital Versatile Disc)等）、光磁気ディスク、もしくは半導体メモリなどよりなるパッケージメディアであるリムーバブルメディア２１１に記録して、あるいは、ローカルエリアネットワーク、インターネット、デジタル衛星放送といった、有線または無線の伝送媒体を介して提供される。 The program executed by the computer (CPU 201) is, for example, a magnetic disk (including a flexible disk), an optical disk (CD-ROM (Compact Disc-Read Only Memory), DVD (Digital Versatile Disc), etc.), a magneto-optical disk, or a semiconductor. The program is recorded on a removable medium 211 that is a package medium composed of a memory or the like, or provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.

そして、プログラムは、リムーバブルメディア２１１をドライブ２１０に装着することにより、入出力インタフェース２０５を介して、記憶部２０８にインストールすることができる。また、プログラムは、有線または無線の伝送媒体を介して、通信部２０９で受信し、記憶部２０８にインストールすることができる。その他、プログラムは、ROM２０２や記憶部２０８に、あらかじめインストールしておくことができる。 The program can be installed in the storage unit 208 via the input / output interface 205 by attaching the removable medium 211 to the drive 210. The program can be received by the communication unit 209 via a wired or wireless transmission medium and installed in the storage unit 208. In addition, the program can be installed in the ROM 202 or the storage unit 208 in advance.

なお、コンピュータが実行するプログラムは、本明細書で説明する順序に沿って時系列に処理が行われるプログラムであってもよいし、並列に、あるいは呼び出しが行われたとき等の必要なタイミングで処理が行われるプログラムであってもよい。 The program executed by the computer may be a program that is processed in time series in the order described in this specification, or in parallel or at a necessary timing such as when a call is made. It may be a program for processing.

また、プログラムは、１台のコンピュータにより処理されるものであってもよいし、複数のコンピュータによって分散処理されるものであってもよい。さらに、プログラムは、遠方のコンピュータに転送されて実行されるものであってもよい。 The program may be processed by a single computer, or may be distributedly processed by a plurality of computers. Furthermore, the program may be transferred to a remote computer and executed.

なお、本発明の実施の形態は、上述した実施の形態に限定されるものではなく、本発明の要旨を逸脱しない範囲において種々の変更が可能である。 The embodiment of the present invention is not limited to the above-described embodiment, and various modifications can be made without departing from the gist of the present invention.

１０発話認識装置，２１画音分離部, ２２顔領域検出部, ２３唇領域検出部, ２４唇画像生成部，２５音素ラベル付与部，２６音素辞書，２７口形素ラベル変換部，２８口形素ラベル付加部，２９学習サンプル保持部，３０口形素判別器学習部，３１口形素判別器，４１顔領域検出部, ４２唇領域検出部, ４３唇画像生成部，４４発話期間検出部，４５時系列特徴量生成部，４６時系列特徴量学習部，４７発話認識器，４８学習データベース，６０デジタルスチルカメラ, ６１撮像部，６２画像処理部，６３記録メディア，６４Ｕ／Ｉ部，６５撮像制御部，６６オートシャッタ制御部，７１オートシャッタ信号出力部，２００コンピュータ，２０１ CPU DESCRIPTION OF SYMBOLS 10 Speech recognition apparatus, 21 Image | video sound separation part, 22 Face area detection part, 23 Lip area detection part, 24 Lip image generation part, 25 Phoneme label provision part, 26 Phoneme dictionary, 27 Viseme label conversion part, 28 Viseme label Addition unit, 29 learning sample holding unit, 30 viseme classifier learning unit, 31 viseme classifier, 41 face region detection unit, 42 lip region detection unit, 43 lip image generation unit, 44 utterance period detection unit, 45 time series Feature amount generation unit, 46 Time series feature amount learning unit, 47 Speech recognizer, 48 Learning database, 60 Digital still camera, 61 Imaging unit, 62 Image processing unit, 63 Recording media, 64 U / I unit, 65 Imaging control unit , 66 Auto shutter control unit, 71 Auto shutter signal output unit, 200 computer, 201 CPU

Claims

Imaging means for outputting a finder image at the time of composition determination and outputting a recorded image at the time of imaging;
A multi-class classifier that outputs a multi-dimensional score vector corresponding to the input of the lip image and indicating how similar the lip image is to each of a plurality of types of visemes;
A registration database in which modeled registration time series feature amounts are registered in association with keywords,
The lip image including the lip region of the subject is generated from the finder image and input to the multi-class classifier, and the multi-dimensional score vector corresponding to the lip image based on the finder image obtained as a result is time-series. Generating means for generating a recognition time-series feature amount,
An image pickup apparatus comprising: an auto shutter control unit configured to control the image pickup unit to execute an image pickup process based on a comparison result between the generated time series feature quantity for recognition and a model registered in the registration database. .

The imaging apparatus according to claim 1, wherein the multi-class classifier is generated by AdaBoostECOC (Error Correct Output Coding) learning using an image feature amount of a lip image to which a class label indicating a viseme is added.

The imaging apparatus according to claim 2, wherein the image feature amount is a pixel difference feature.

A lip image for registration is generated from a finder image for registration whose subject is a subject who speaks the keyword, the lip image for registration is input to the multi-class classifier, and the registration image obtained as a result The registration time-series feature quantity is generated by arranging the multi-dimensional score vectors corresponding to the lip images in time series, the registration time-series feature quantity is modeled in association with the arbitrary keyword, and the registration is performed. The imaging apparatus according to claim 2, further comprising registration means for registering in the database.

The imaging apparatus according to claim 4, wherein the registration unit models the registration time-series feature amount using an HMM (Hidden Markov Model).

Imaging means for outputting a finder image at the time of composition determination and outputting a recorded image at the time of imaging;
A multi-class classifier that outputs a multi-dimensional score vector corresponding to the input of the lip image and indicating how similar the lip image is to each of a plurality of types of visemes;
In an imaging method of an imaging apparatus including a registration database in which registration time-series feature amounts modeled in association with keywords are registered,
According to the imaging device,
The lip image including the lip region of the subject is generated from the finder image and input to the multi-class classifier, and the multi-dimensional score vector corresponding to the lip image based on the finder image obtained as a result is time-series. Generating step for generating a recognition time-series feature value by arranging in
An image capturing method comprising: an auto shutter control step for controlling the image capturing unit to execute an image capturing process based on a comparison result between the generated time series feature amount for recognition and a model registered in the registration database. .

Imaging means for outputting a finder image at the time of composition determination and outputting a recorded image at the time of imaging;
A multi-class classifier that outputs a multi-dimensional score vector corresponding to the input of the lip image and indicating how similar the lip image is to each of a plurality of types of visemes;
In a computer of an imaging device provided with a registration database in which modeled registration time-series feature amounts are registered in association with keywords,
The lip image including the lip region of the subject is generated from the finder image and input to the multi-class discriminator, and the multidimensional score vector corresponding to the lip image based on the finder image obtained as a result is time-series. Generating means for generating a recognition time-series feature amount,
Based on the comparison result between the generated time series feature quantity for recognition and the model registered in the registration database, the imaging means is controlled to function as an auto shutter control means for executing an imaging process. program.