JP4904691B2

JP4904691B2 - Camera device and photographing method

Info

Publication number: JP4904691B2
Application number: JP2004378386A
Authority: JP
Inventors: 滋加福
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2004-12-28
Filing date: 2004-12-28
Publication date: 2012-03-28
Anticipated expiration: 2024-12-28
Also published as: JP2006184589A

Description

本発明は、音声シャッター機能を有するカメラ装置、及び撮影方法に関するものである。 The present invention relates to a camera device having an audio shutter function and a photographing method.

従来、カメラ装置においては、登録されている命令語（認識対象語）の音声を認識したことをトリガとしてオートフォーカスによるピント合わせ、及び露光からなる一連の撮影動作を行う音声シャッター機能を備えたものが下記の特許文献１等に記載されている。
特開平１１−１９４３９２ 2. Description of the Related Art Conventionally, camera devices have a voice shutter function that performs a series of shooting operations including autofocusing and exposure triggered by recognition of the voice of a registered command word (recognition target word). Is described in Patent Document 1 below.
JP-A-11-194392

しかしながら、上記の技術においては、例えば「ハイ、チーズ」という命令語の認識を完了した後で合焦動作と露光動作とを行うため、ユーザーが命令語を発してから、実質的な撮影動作が行われるまでの間に若干のタイムラグが発生するという問題があった。 However, in the above technique, for example, since the focusing operation and the exposure operation are performed after the recognition of the command word “high, cheese” is completed, the substantial shooting operation is performed after the user issues the command word. There was a problem that a slight time lag occurred before it was performed.

本発明は、かかる従来の課題に鑑みてなされたものであり、撮影のための命令語を発してから、実際の撮影動作が行われるまでの間のタイムラグを殆どなくすことが可能となるカメラ装置、及び撮影方法を提供することを目的とする。 The present invention has been made in view of such a conventional problem, and a camera device capable of almost eliminating a time lag between issuing a command word for shooting and performing an actual shooting operation. And an imaging method.

前記課題を解決するため請求項１の発明にあっては、所定の命令語の音声入力をトリガとして複数段階の動作からなる一連の撮影動作を行う自動焦点調整機能を備えるカメラ装置であって、入力する音声を認識する音声認識手段と、前記所定の命令語を記憶する命令語記憶手段と、この命令語記憶手段に記憶されている命令語に設定されている複数の発声段階と、前記一連の撮影動作における各段階の動作との対応関係を示す対応情報を記憶する対応情報記憶手段と、前記音声認識手段により認識された音声の認識段階が前記命令語の各発声段階に達する毎に、前記対応情報記憶手段に記憶されている段階情報により示される、当該発声段階に対応する前記一連の撮影動作における各段階の動作を順に開始させる制御手段とを備え、前記一連の撮影動作における複数段階の動作には自動焦点調整機能による合焦動作及び露光動作を含み、前記音声認識手段は、撮影動作における自動焦点調整の動作中に生じるノイズ成分を含む音響モデルと含まない音響モデルとを有し、前記一連の撮影動作における自動焦点調整を指示した命令語の認識に応答して前記ノイズ成分を含む音響モデルに、含まない音響モデルから変更して、自動焦点調整を指示した前記命令語の発声段階の次の発声段階の音声を、認識するものとした。 In order to solve the above-mentioned problem, the invention of claim 1 is a camera device having an automatic focus adjustment function for performing a series of photographing operations consisting of a plurality of steps triggered by voice input of a predetermined command word. Voice recognition means for recognizing input voice, command word storage means for storing the predetermined command word, a plurality of utterance stages set in the command words stored in the command word storage means, and the series Corresponding information storage means for storing correspondence information indicating a correspondence relationship with the operation at each stage in the shooting operation of the above, and whenever the speech recognition stage recognized by the voice recognition means reaches each utterance stage of the command word, Control means for sequentially starting the operation of each stage in the series of photographing operations corresponding to the utterance stage indicated by the stage information stored in the correspondence information storage means, The multi-stage operation in the photographing operation includes a focusing operation and an exposure operation by the automatic focus adjustment function, and the voice recognition means includes an acoustic model including a noise component generated during the automatic focus adjustment operation in the photographing operation and an acoustic that does not include it. and a model, and change in the acoustic model including the noise component in response to the recognition of the instruction word instructing automatic focusing, an acoustic model that does not include in the series of photographing operations, instructs the automatic focusing The voice in the next utterance stage after the utterance stage of the command word is recognized .

かかる構成においては、所定の命令語が音声入力されたとき、命令語の音声認識が完了する以前の段階から、一連の撮影動作における各段階の動作が順に開始される。 In such a configuration, when a predetermined command word is inputted by voice, the operation of each stage in a series of photographing operations is started in order from the stage before the voice recognition of the command word is completed.

また、請求項２の発明にあっては、前記一連の撮影動作におけるいずれかの段階の動作は、連続して行われる複数の動作からなるものとした。 Further, in the invention of claim 2, the operation at any stage in the series of photographing operations is composed of a plurality of operations performed continuously.

また、請求項４の発明にあっては、前記命令語記憶手段に複数の命令語が記憶されるとともに、前記対応情報記憶手段に、前記命令語記憶手段に記憶されている複数の命令語の各々に設定されている各発声段階と、複数の一連の撮影動作の各々における各段階の動作との対応関係を示す複数の対応情報が記憶されたものとした。 In the invention of claim 4, a plurality of instruction words are stored in the instruction word storage means, and a plurality of instruction words stored in the instruction word storage means are stored in the correspondence information storage means. It is assumed that a plurality of correspondence information indicating a correspondence relationship between each utterance stage set for each and each stage operation in each of a plurality of series of photographing operations is stored.

また、請求項５の発明にあっては、前記音声認識手段により認識された音声からなる語句を新たな命令語として前記命令語記憶手段に記憶させる登録手段と、この登録手段により前記命令語記憶手段に記憶された新たな命令語における複数の発声段階と、前記一連の撮影動作における各段階の動作との対応関係を示す新たな対応情報を生成し、前記対応情報記憶手段に記憶させる生成手段とを備えたものとした。 According to a fifth aspect of the present invention, there is provided a registration means for storing a word composed of speech recognized by the voice recognition means in the command word storage means as a new command word, and the command word storage by the registration means. Generating means for generating new correspondence information indicating a correspondence relationship between a plurality of utterance stages in the new command word stored in the means and the operations of each stage in the series of photographing operations, and storing the correspondence information in the correspondence information storage means And provided.

また、請求項５の発明にあっては、所定の命令語の音声入力をトリガとして複数段階の動作からなる一連の撮影動作を行う自動焦点調整機能を備えるカメラ装置における撮影方法であって、入力する音声を逐次認識する工程と、認識した音声の認識段階が前記命令語に設定されている複数の発声段階の各発声段階に達する毎に、各発声段階に対応付けられている前記一連の撮影動作における各段階の動作を順に開始する工程とを含み、前記一連の撮影動作における複数段階の動作には自動焦点調整機能による合焦動作及び露光動作を含み、音声を逐次認識する工程は、前記一連の撮影動作における自動焦点調整を指示した命令語の認識に応答して前記ノイズ成分を含む音響モデルに、含まない音響モデルから変更して、自動焦点調整を指示した前記命令語の発声段階の次の発声段階の音声を、認識する方法とした。 According to the invention of claim 5, there is provided a photographing method in a camera device having an automatic focus adjustment function for performing a series of photographing operations consisting of a plurality of steps with a voice input of a predetermined command as a trigger. Each time the recognized speech recognition stage reaches each utterance stage of the plurality of utterance stages set in the command word, the series of photographing associated with each utterance stage and a step of starting the operation of each step in the operating sequence, the comprises a focusing operation and the exposure operation by the automatic focusing function on the operation of the plurality of stages in series of photographing operations, sequentially recognizing process speech, the In response to the recognition of the instruction word instructing automatic focus adjustment in a series of shooting operations, the acoustic model including the noise component is changed from the acoustic model not including the noise component, and automatic focus adjustment is instructed The speech of the next utterance step utterance stages of the instruction word, and a method of recognizing.

かかる方法によれば、所定の命令語が音声入力されたとき、命令語の音声認識が完了する以前の段階から、一連の撮影動作における各段階の動作が順に開始される。 According to such a method, when a predetermined command word is inputted by voice, the operation of each stage in a series of photographing operations is started in order from the stage before the voice recognition of the command word is completed.

また、請求項６の発明にあっては、所定の命令語の音声入力をトリガとして複数段階の動作からなる一連の撮影動作を行う自動焦点調整機能を備えるカメラ装置が有するコンピュータに、入力する音声を音声認識手段に逐次認識させる処理と、前記音声認識手段により認識した音声の認識段階が前記命令語に設定されている複数の発声段階の各発声段階に達する毎に、各発声段階に対応付けられている前記一連の撮影動作における各段階の動作を装置各部に順に開始させる処理とを実行させ、前記一連の撮影動作における複数段階の動作には自動焦点調整機能による合焦動作及び露光動作を含み、音声を音声認識手段に逐次認識する処理は、前記一連の撮影動作における自動焦点調整を指示した命令語の認識に応答して前記ノイズ成分を含む音響モデルに、含まない音響モデルから変更して、自動焦点調整を指示した前記命令語の発声段階の次の発声段階の音声を、認識するプログラムとした。 According to the invention of claim 6, the audio to be input to the computer of the camera device having the automatic focus adjustment function for performing a series of photographing operations consisting of a plurality of steps triggered by the audio input of a predetermined command word. Each time a speech recognition stage recognized by the speech recognition means reaches each utterance stage of a plurality of utterance stages set in the command word, and is associated with each utterance stage. And a process for causing each part of the apparatus to sequentially start the operation of each stage in the series of shooting operations, and the plurality of stages of the series of shooting operations include a focusing operation and an exposure operation by an automatic focus adjustment function. wherein, the sequential process of recognizing the speech recognition means speech includes the noise component in response to the recognition of the instruction word instructing an automatic focus adjustment in the series of photographing operations The sound model, and change from the acoustic model which does not include the audio of the next utterance step utterance stages of the instruction word instructing automatic focusing, and the recognized program.

以上のように本発明においては、所定の命令語が音声入力されたとき、命令語の音声認識が完了する以前の段階から、一連の撮影動作における各段階の動作が順に開始されるようにした。よって、撮影のための命令語を発してから、実際の撮影動作が行われるまでの間のタイムラグを殆どなくすことが可能となる。 As described above, in the present invention, when a predetermined command word is inputted by voice, the operation of each stage in a series of shooting operations is started in order from the stage before the voice recognition of the command word is completed. . Therefore, it is possible to eliminate almost all the time lag between the time when the command word for shooting is issued and the time when the actual shooting operation is performed.

以下、本発明の一実施の形態を図にしたがって説明する。図１は本発明に係るデジタルカメラの電気的な概略構成を示すブロック図である。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a schematic electrical configuration of a digital camera according to the present invention.

このデジタルカメラは音声シャッター機能を備えたものであり、キー入力部１と、合焦部２、露光部３、画像入力部４、画像圧縮部５、画像記憶部６、画像表示部７、マイク８、音声入力部９、Ａ／Ｄ変換部（Ａ／Ｄ）１０、ワークメモリ１１、音声認識部１２、プログラムメモリ１３を含み、キー入力部１及びマイク８を除く上記各部２〜７、８〜１３が制御部１４によって駆動及び制御される構成である。 This digital camera has an audio shutter function, and includes a key input unit 1, a focusing unit 2, an exposure unit 3, an image input unit 4, an image compression unit 5, an image storage unit 6, an image display unit 7, a microphone. 8 includes a voice input unit 9, an A / D conversion unit (A / D) 10, a work memory 11, a voice recognition unit 12, and a program memory 13, except for the key input unit 1 and the microphone 8. ˜13 is a configuration driven and controlled by the control unit 14.

前記キー入力部１は、電源キーやシャッターキー、撮影／再生モードのモード切替キーや、各種機能の設定に使用される操作キー等、ユーザーがデジタルカメラの操作に使用する各種のキーからなり、いずれかのキーが操作されるとそれが制御部１４により検出される。前記合焦部２には、撮影モードにおいて、図示しない光学系におけるフォーカスレンズを被写体距離に応じた位置に駆動するフォーカス・モータや、その駆動回路が含まれる。 The key input unit 1 includes various keys used by the user for operating the digital camera, such as a power key, a shutter key, a mode switching key for shooting / playback mode, and an operation key used for setting various functions. When any key is operated, it is detected by the control unit 14. The focusing unit 2 includes a focus motor that drives a focus lens in an optical system (not shown) to a position corresponding to the subject distance and a drive circuit thereof in the shooting mode.

前記露光部３には、前記光学系により結像された被写体像を撮像するＣＣＤ等の撮像素子と、その駆動回路、撮像素子から出力されるアナログの撮像信号をデジタル信号に変換するＡ／Ｄ変換器等を含みデジタルの撮像信号を出力する。前記画像入力部４は、デジタルに変換された撮像信号に種々の信号処理を施すための各種の信号処理回路により構成される。前記画像圧縮部５は、画像入力部４で処理された後の画像データを圧縮し、また圧縮状態の画像データを伸張する回路により構成される。前記画像記憶部６は圧縮された画像データを記憶する各種メモリカード等により構成される。前記画像表示部７は、液晶表示器や、その駆動回路等から構成され、画像入力部４で信号処理された被写体画像や、画像記憶部６から読み出された記録画像を液晶表示器に表示させる。 The exposure unit 3 includes an image pickup device such as a CCD for picking up a subject image formed by the optical system, a driving circuit for the image pickup device, and an analog image pickup signal output from the image pickup device. A digital imaging signal is output including a converter and the like. The image input unit 4 includes various signal processing circuits for performing various signal processing on the digitally converted imaging signal. The image compression unit 5 is composed of a circuit that compresses the image data processed by the image input unit 4 and decompresses the compressed image data. The image storage unit 6 includes various memory cards that store compressed image data. The image display unit 7 includes a liquid crystal display, a driving circuit thereof, and the like, and displays a subject image signal-processed by the image input unit 4 and a recorded image read from the image storage unit 6 on the liquid crystal display. Let

前記音声入力部９は、マイク８から入力した音声を増幅するアンプや他の音声処理回路から構成され、処理後の音声信号を出力する。Ａ／Ｄ変換部１０は音声入力部９から出力されるアナログの音声信号をディジタル信号に変換する。ワークメモリ１１は、変換後の音声信号（音声データ）を逐次記憶したり、制御部１４が各部の制御に際して生成したり使用する各種のデータ等を随時記憶するＲＡＭである。 The voice input unit 9 is composed of an amplifier that amplifies the voice input from the microphone 8 and other voice processing circuits, and outputs the processed voice signal. The A / D converter 10 converts an analog audio signal output from the audio input unit 9 into a digital signal. The work memory 11 is a RAM that sequentially stores converted audio signals (audio data), and stores various data that the control unit 14 generates and uses when controlling each unit as needed.

前記音声認識部１２は、音声シャッター機能がオン設定されているときの撮影待機状態において、ワークメモリ１１に逐次記憶される入力音声に対し、そのデータの蓄積速度と同じ又はそれ以上（遅れた場合にすぐ追いつけるという意味）の速度で、前記プログラムメモリ１３に記録されている音響モデルを使用し、特徴抽出とビタビアルゴリズムによる認識処理を行う。認識結果を制御部１４に逐次送る。 The voice recognition unit 12 is equal to or higher than the data accumulation speed for input voices sequentially stored in the work memory 11 in the shooting standby state when the voice shutter function is turned on (when delayed). The acoustic model recorded in the program memory 13 is used at a speed that means that the feature is extracted and the Viterbi algorithm is used for recognition processing. The recognition result is sequentially sent to the control unit 14.

制御部１４は、主としてＣＰＵや入出力インターフェイスを含む周辺回路から構成されている。前記プログラムメモリ１３はＥＥＰＲＯＭ等の不揮発性のメモリであり、音声認識部１２が音声認識に際して使用する前述した音響モデル等のデータや、制御部１４に上記各部を制御させるための各種のプログラム、例えばＡＥ（自動露出）、ＡＦ（自動焦点調整）等の制御プログラム、さらには制御部１４を本発明の制御手段、登録手段、生成手段として機能させるためのプログラムが格納されている。 The control unit 14 is mainly composed of peripheral circuits including a CPU and an input / output interface. The program memory 13 is a non-volatile memory such as an EEPROM, and data such as the above-described acoustic model used by the speech recognition unit 12 for speech recognition, and various programs for causing the control unit 14 to control each unit, for example, Control programs such as AE (automatic exposure) and AF (automatic focus adjustment), and a program for causing the control unit 14 to function as the control means, registration means, and generation means of the present invention are stored.

また、プログラムメモリ１３は本発明の命令語記憶手段、対応情報記憶手段であって、プログラムメモリ１３には、前述したプログラムやデータと共に図２に示した登録テーブルＴを構成するデータとが格納されている。登録テーブルＴは、複数種の予約語（音声パターン）１０１、及び各々の予約語１０１に設定されている複数の発声段階と、一連の撮影動作における各段階の動作との関係を示す対応情報１０２から構成される。予約語１０１は音声シャッター機能を用いた撮影時に使用可能な命令語であり、予約語１０１には、その内容に応じた複数の発声段階が設定されている。 The program memory 13 is an instruction word storage means and correspondence information storage means of the present invention, and the program memory 13 stores the above-described programs and data together with data constituting the registration table T shown in FIG. ing. The registration table T includes correspondence information 102 indicating a relationship between a plurality of types of reserved words (speech patterns) 101, a plurality of utterance stages set in each reserved word 101, and operations at each stage in a series of photographing operations. Consists of The reserved word 101 is a command word that can be used at the time of photographing using the voice shutter function, and the reserved word 101 has a plurality of utterance stages corresponding to the contents.

発声段階数は、予約語１０１「ハイ、チーズ」が「ハイ」まで（第１の発声段階）と、「・・・チーズ」まで（第２の発声段階）の２段階である以外は３段階である。また、本実施の形態において各々の発声段階に対応して設定可能な動作は、自動焦点調整、露光、記録の３種類であり、動作の数は予約語１０１「ハイ、トリマス、モウイチマイ」の第３の発声段階（「・・・モウイチマイ」）に対応する記録、露光、記録の３動作が最大である。また、前述した音響モデルのうち、上記複数の予約語１０１に含まれる「チーズ」と「トリマス」部分の音響モデルには、自動焦点調整に伴い生じるノイズ（光学系やモータの駆動音）を重畳したＰＣＭで学習済みのＨＭＭ（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ、隠れマルコフ・モデル）が用意されている。 The number of utterance stages is three, except that the reserved word 101 “high, cheese” is two stages, up to “high” (first utterance stage) and “... cheese” (second utterance stage). It is. In this embodiment, there are three types of operations that can be set corresponding to each utterance stage: automatic focus adjustment, exposure, and recording, and the number of operations is the number of reserved words 101 “HIGH, TRIMAS, MOUICHAI”. The three operations of recording, exposure, and recording corresponding to the three utterance stages ("... In addition, among the acoustic models described above, noise (optical system and motor drive sound) generated by automatic focus adjustment is superimposed on the acoustic models of the “cheese” and “trimus” portions included in the plurality of reserved words 101. HMM (Hidden Markov Model, Hidden Markov Model) trained by PCM is prepared.

次に、以上の構成からなるデジタルカメラの本発明に係る動作を説明する。図３は、音声シャッター機能がオン設定されているときの音声シャッターモードによる動作を示したフローチャートである。 Next, the operation according to the present invention of the digital camera having the above configuration will be described. FIG. 3 is a flowchart showing an operation in the sound shutter mode when the sound shutter function is set to ON.

音声シャッターモードでは、例えばユーザーによるシャッターキーの押下に応じてマイク８からの音声入力処理を開始する（ステップＳＡ１）。そして、入力される新たな音節部分を音声認識部１２により逐次認識し、認識中の音声パターンを逐次蓄積する（ステップＳＡ２）。音声認識部１２における音声認識の手法は以下の通りである。 In the voice shutter mode, for example, voice input processing from the microphone 8 is started in response to the user pressing the shutter key (step SA1). Then, the input new syllable part is sequentially recognized by the voice recognition unit 12, and the voice pattern being recognized is successively accumulated (step SA2). The speech recognition method in the speech recognition unit 12 is as follows.

プログラムメモリ１３には前述した予約語１０１が図４に示したような文法でセットされている。同図は「ハイ、チーズ」である例である。「無」は無音を表すサイレントモデルであり、「ハイ、チーズ」の発話の前後と「ハイ」の後に「無音」があっても無くても対応できる文法である。ここでは、説明を簡単にするため音節を単位とした音響モデル（ＨＭＭ）で説明する。なお、音響モデルは、「はいチーズ」という単語ＨＭＭであっても良いし、音素や半音素、トライフォンなどもっと細かい単位のＨＭＭでもよい。 The reserved word 101 described above is set in the program memory 13 with the grammar shown in FIG. The figure is an example of “high, cheese”. “None” is a silent model representing silence, and is a grammar that can be handled with or without “silence” before and after the utterance of “high, cheese” and after “high”. Here, in order to simplify the description, an acoustic model (HMM) with syllables as a unit will be described. Note that the acoustic model may be the word HMM “yes cheese”, or may be an HMM in a finer unit such as phonemes, semiphones, and triphones.

ここで、無音モデルは１状態モデル、音節モデルは３状態モデルであるとする。
例として「は」で説明すると、
「は」の第１状態とは
「は」の子音「ｈ」の部分の音響的特徴を出力する確率が高いＨＭＭモデル
「は」の第２状態とは
「は」の子音と母音のつなぎ部分の音響的特徴を出力する確率が高いＨＭＭモデル
「は」の第３状態とは
「は」の母音「ａ」の部分の音響的特徴を出力する確率が高いＨＭＭモデル
というように学習されたモデルである。 Here, it is assumed that the silence model is a one-state model and the syllable model is a three-state model.
As an example, “Ha” explains.
What is the first state of “ha”? The second state of the HMM model “ha”, which has a high probability of outputting the acoustic features of the “h” consonant “h” part. What is the third state of the HMM model “ha” that has a high probability of outputting acoustic features of “ha”? A model that is learned as an HMM model that has a high probability of outputting the acoustic features of the vowel “a” portion of “ha” It is.

累積尤度計算は図５のように行われる。
例えば時刻Ｎの「い」の第１状態の累積尤度は、
Ａ＝（時刻Ｎ−１の「は」の第３状態の累積尤度）×（時刻Ｎの特徴値が「い」の第１状態である出力確率）×（「は」の第３状態から「は」を終了する状態遷移確率）
Ｂ＝（時刻Ｎ−１の「い」の第１状態の累積尤度）×（時刻何の特徴値が「い」の第１状態である出力確率）×（「い」の第１状態から「い」の第１状態に遷移する（ループ遷移）状態遷移確率）
として求めたＡ，Ｂを比較し、確率の大きい方を時刻Ｎの「い」の第１状態の累積尤度とする。 The cumulative likelihood calculation is performed as shown in FIG.
For example, the cumulative likelihood of the first state “I” at time N is
A = (cumulative likelihood of the third state of “ha” at time N−1) × (output probability that is the first state where the feature value at time N is “yes”) × (from the third state of “ha” State transition probability to end “ha”)
B = (cumulative likelihood of the first state of “I” at time N−1) × (output probability that the feature value at the time is the first state of “I”) × (from the first state of “I” Transition to the first state of “I” (loop transition) state transition probability)
A and B obtained as follows are compared, and the one with the larger probability is set as the cumulative likelihood of the first state of “I” at time N.

このようにして、全ての状態についての累積尤度を時刻毎に更新していく。例えば時刻Ｎが、「はいチーズ」と発話する際の「い」の発話後、「ち」を発話する直前の瞬間であったとすれば、累積尤度では「い」の第３状態か、「い」の第３状態と「ち」の第１状態の間の「無」モデルの尤度が最も高くなっている筈である。 In this way, the cumulative likelihood for all states is updated every time. For example, if the time N is the moment immediately after uttering “Chi” after uttering “I” when uttering “Yes cheese”, the cumulative likelihood is the third state of “I”, “ The likelihood of the “none” model between the third state of “I” and the first state of “Chi” should be the highest.

しかし、ここで問題となるのが、無関係な発話をした場合やノイズを拾った場合であって、その際にも累積尤度が最も高い点は存在するので、偶然「い」の最終状態の尤度が高くなってしまうことが考えられる。そこで、発話の信頼度を計算する必要がある。「はいチーズ」に使われない音節（たとえば「あ」とか「う」「ん」など）についても毎フレーム出力確率を求め、各フレームで最も出力確率の高かったモデルの値を乗算（ビタビならばｌｏｇを使うので加算しておく）する。もし発話が本当に「はい」であったならば、各フレームで最も尤度の高いモデルは「は」又は「い」のどこかの状態である可能性が高いので、「はい」の累積尤度との差は少なくなる。これを基に信頼度を求める。もちろん信頼度の計算方法は、必ずしもこの方法である必要はない。 However, the problem here is when an irrelevant utterance or noise is picked up, and there is also a point with the highest cumulative likelihood at that time. It is conceivable that the likelihood becomes high. Therefore, it is necessary to calculate the reliability of the utterance. For syllables that are not used in “yes cheese” (for example, “A”, “U”, “N”, etc.), the output probability of each frame is calculated, and the value of the model with the highest output probability in each frame is multiplied (if Viterbi) log, so add them). If the utterance was really “yes”, the model with the highest likelihood in each frame is likely to be in a state of “yes” or “yes”, so the cumulative likelihood of “yes” The difference with is less. The reliability is obtained based on this. Of course, the reliability calculation method does not necessarily need to be this method.

以上の手法により音声認識部１２は、「い」の最終状態、または「い」と「ち」の間の「無」モデルの累積尤度が最も高く、かつ信頼度がしきい値以上である時、制御部１４に、現時点で「はい」の発話が終わったことを示す信号を送る。同様に「ず」の最終状態の尤度、またはそれに続く「無」モデルの累積尤度が最も高く、かつ信頼度がしきい値以上となった時、「はいチーズ」の発話が終わったことを示す信号を示す信号を送る。 With the above method, the speech recognition unit 12 has the highest cumulative likelihood of the final state of “I” or the “no” model between “I” and “Chi”, and the reliability is equal to or greater than the threshold value. At this time, a signal is sent to the control unit 14 indicating that the utterance of “Yes” has ended. Similarly, when the likelihood of the final state of “Z” or the cumulative likelihood of the “No” model that follows it is the highest and the reliability is equal to or greater than the threshold value, the utterance of “Yes cheese” has ended. A signal indicating a signal indicating is sent.

一方、音声認識部１２が音声認識を行う間に、制御部１４は逐次認識された音声パターンを内部メモリに蓄積しており、音声認識部１２から上記信号が送られる毎に、認識中の音声パターンと登録テーブルＴの内容とを比較して、候補となる予約語１０１とその発声段階とを特定し（ステップＳＡ３）、それらを示す認識状況情報を更新する（ステップＳＡ４）。図６は、ある時点の認識状況情報１０３の内容を示したものであり、同図（ａ）は「ハイ」までが認識された時点の内容である。 On the other hand, while the speech recognition unit 12 performs speech recognition, the control unit 14 stores the sequentially recognized speech patterns in the internal memory, and each time the above signal is sent from the speech recognition unit 12, the speech being recognized. The pattern and the contents of the registration table T are compared to identify candidate reserved words 101 and their utterance stages (step SA3), and the recognition status information indicating them is updated (step SA4). FIG. 6 shows the contents of the recognition status information 103 at a certain time, and FIG. 6A shows the contents at the time when “high” is recognized.

引き続き、制御部１４は、予約語１０１の候補が残されていれば（ステップＳＡ５でＹＥＳ）、さらに次の発声段階に進んだか否かを判別する（ステップＳＡ６）。そして、未だ次の発声段階に進んでいなければ（ステップＳＡ６でＮＯ）、ステップＳ２以降の処理を繰り返す。やがて次の発声段階に進んだら、つまり前の発声段階が終了したことが確定したら（ステップＳＡ６でＹＥＳ）、その時点での認識状況情報１０３に基づき、残されている予約語候補に対応する現段階の動作を行うための処理を実施する（ステップＳＡ７）。すなわち認識状況情報１０３の内容が図６（ａ）に示した内容であり、ユーザーが「はい」までを発話した段階では自動焦点調整を開始する。以後、現在の発声段階が、認識中の予約語（候補）の最終段階となるまで（ステップＳＡ８でＮＯ）、ステップＡＳ２へ戻って前述した処理を繰り返す。 If the candidate for the reserved word 101 remains (YES in step SA5), the control unit 14 determines whether or not the process proceeds to the next utterance stage (step SA6). If the process has not yet proceeded to the next utterance stage (NO in step SA6), the processes in and after step S2 are repeated. When the process proceeds to the next utterance stage, that is, when it is determined that the previous utterance stage has ended (YES in step SA6), the current reserved word candidate corresponding to the remaining reserved word candidate is determined based on the recognition status information 103 at that time. A process for performing the staged operation is performed (step SA7). That is, the content of the recognition status information 103 is the content shown in FIG. 6A, and automatic focus adjustment is started when the user speaks up to “Yes”. Thereafter, until the current utterance stage is the final stage of the reserved word (candidate) being recognized (NO in step SA8), the process returns to step AS2 and the above-described processing is repeated.

その後、認識状況情報１０３の内容が例えば図６（ｂ）に示したものとなり、ユーザーが「はい、撮ります」までを発話した段階では、ステップＳＡ７において露光を行い、さらに認識状況情報１０３の内容が例えば図６（ｃ）に示したものとなり、ユーザーが「ハイ、トリマス、オーケー」までを発話した段階では、ステップＳＡ７において記録を行う。やがて上記処理を繰り返す間に予約語候補がなくなった場合や（ステップＳＡ５でＮＯ）、現在の発声段階が認識中の予約語１０１の最終段階となったら（ステップＳＡ８でＹＥＳ）、認識状況情報１０３の内容をクリアする（ステップＳＡ９）。しかる後、ステップＡＳ２へ戻り、以降の処理を初めから繰り返す。 Thereafter, the content of the recognition status information 103 becomes, for example, as shown in FIG. 6B. When the user speaks up to “Yes, take a picture”, exposure is performed in step SA7. Is as shown in FIG. 6C, for example. When the user speaks up to "High, Trimus, OK", recording is performed in Step SA7. Eventually, if the reserved word candidate disappears while repeating the above process (NO in step SA5), or if the current utterance stage is the final stage of the reserved word 101 being recognized (YES in step SA8), the recognition status information 103 Is cleared (step SA9). Thereafter, the process returns to step AS2, and the subsequent processing is repeated from the beginning.

以上のように音声シャッターモードにおいては、予め登録されている予約語（命令語）を音声認識して一連の撮影動作を行うが、予約語全体の認識が完了する以前の認識途中において、その認識段階が予約語に設定されている発声段階に達する毎に、撮影に要する動作を開始する。したがって、ユーザーが予約語を発してから、実質的な撮影動作である露光が行われるまでの間のタイムラグを殆どなくすことができる。 As described above, in the voice shutter mode, a reserved word (command word) registered in advance is recognized by voice and a series of shooting operations are performed, but the recognition is performed during the recognition before the entire reserved word is recognized. Every time the stage reaches the utterance stage set as a reserved word, the operation required for shooting is started. Therefore, the time lag between when the user issues a reserved word and when exposure, which is a substantial photographing operation, is performed can be almost eliminated.

また、一連の撮影動作の内容が異なる複数の予約語が登録されているため、ユーザーは、複数の予約語を使い分けることにより、全体の動作内容が異なる撮影動作を指示することができる。しかも「チーズ」と「トリマス」部分、つまり自動焦点調整直後の露光を指示する部分の認識が、自動焦点調整に伴い生じるノイズの成分を含む音響モデルを用いて行われるため、上記部分が自動焦点調整中に発声されたとしても、その部分にも高い認識率を確保することができる。 In addition, since a plurality of reserved words having different contents of a series of shooting operations are registered, the user can instruct a shooting operation having different overall operation contents by using a plurality of reserved words. In addition, since the “cheese” and “trimus” portions, that is, the portions that indicate the exposure immediately after the autofocus adjustment are recognized by using an acoustic model including a noise component caused by the autofocus adjustment, the above portions are automatically focused. Even if it is uttered during adjustment, a high recognition rate can be ensured for that portion.

一方、図７は、前記デジタルカメラにおいて予め用意されているコマンド登録モードによる制御部１４の処理手順を示すフローチャートである。 On the other hand, FIG. 7 is a flowchart showing a processing procedure of the control unit 14 in a command registration mode prepared in advance in the digital camera.

コマンド登録モードが設定されると制御部１４は、コマンド入力方法の選択画面を画像表示部７のＬＣＤ画面に表示させ、ユーザーに入力方法を選択させる（ステップＳＢ１）。本実施の形態では、入力方法として「選択」と「自由」の２種類が用意されている。ここでユーザーが「選択」を選ぶと（ステップＳＢ２でＹＥＳ）、既に登録されている予約語（図２参照）の各発声段階部分からいずれかの言葉をコマンド要素として選択させる（ステップＳＢ３）。また、ユーザーが「自由を」を選ぶと（ステップＳＢ２でＮＯ）、所定のキー操作によって任意の言葉をコマンド要素として入力させる（ステップＳＢ４）。なお、任意の言葉の入力方法は、例えば画像表示部７のＬＣＤ画面に５０音等の文字選択画面を表示し、キー操作で複数の文字を選択させることにより行う。 When the command registration mode is set, the control unit 14 displays a command input method selection screen on the LCD screen of the image display unit 7 and allows the user to select an input method (step SB1). In this embodiment, two types of “selection” and “free” are prepared as input methods. If the user selects “select” (YES in step SB2), any word is selected as a command element from each utterance stage portion of the reserved word (see FIG. 2) already registered (step SB3). If the user selects “freedom” (NO in step SB2), an arbitrary word is input as a command element by a predetermined key operation (step SB4). An arbitrary word input method is performed, for example, by displaying a character selection screen such as 50 sounds on the LCD screen of the image display unit 7 and selecting a plurality of characters by key operation.

引き続き、指定可能な撮影時における動作、すなわち本実施の形態では「自動焦点調整」、「露光」、「記録」の３種類を表示して、その中から選択又は入力されたコマンド要素に対応する１又は複数の動作を選択させる（ステップＳＢ５）。そして、発声段階数をインクリメントした後（ステップＳＢ１）、コマンド入力の終了指示がなければ（ステップＳＢ７でＮＯ）、ステップＳＢ１へ戻り、前述と同様の処理によってユーザーに次の発生段階のコマンド要素を選択又は入力させ、かつそれと対応する動作を選択させる。 Subsequently, three types of operations at the time of photographing that can be designated, that is, “automatic focus adjustment”, “exposure”, and “recording” in the present embodiment are displayed, and a command element selected or input from among them is displayed. One or a plurality of operations are selected (step SB5). Then, after incrementing the number of utterance stages (step SB1), if there is no command input end instruction (NO in step SB7), the process returns to step SB1, and the command element of the next generation stage is given to the user by the same processing as described above. Select or input and select the corresponding action.

そして、ステップＳＢ６で１又は複数の動作を選択させた後、コマンド入力の終了指示があったら（ステップＳＢ７でＹＥＳ）、選択又は入力された複数のコマンド要素の間に「、」（発声段階の区切り）を自動的に挿入し、新たな予約語を生成する（ステップＳＢ８）。なお、図示しないが係る処理はユーザーによって選択又は入力されたコマンド要素が１つであった場合には、そのコマンド要素をそのまま予約語とする。そして、生成した新たな予約語を追加して登録テーブルＴを更新する（ステップＳＢ９）。 Then, after selecting one or a plurality of actions in step SB6, if there is a command input end instruction (YES in step SB7), “,” (in the utterance stage) between the selected or input command elements. (Separator) is automatically inserted to generate a new reserved word (step SB8). Although not shown in the figure, when there is one command element selected or input by the user, the command element is used as a reserved word as it is. Then, the registration table T is updated by adding the generated new reserved word (step SB9).

したがって、ユーザーにおいてはコマンド登録モードを選択することにより、前述した音声シャッターモードでの撮影時に使用可能な命令語を、自由に追加することができる。 Therefore, the user can freely add a command word that can be used at the time of shooting in the above-described voice shutter mode by selecting the command registration mode.

なお、以上の説明においては本発明を専用機としてのデジタルカメラに採用した場合について説明したが、これに限らず本発明は、カメラ付き携帯電話、さらには銀塩カメラ等の他のカメラ装置にも採用することができる。 In the above description, the case where the present invention is applied to a digital camera as a dedicated machine has been described. However, the present invention is not limited to this, and the present invention can be applied to other camera devices such as a camera-equipped mobile phone and a silver salt camera. Can also be adopted.

本発明に係るデジタルカメラの概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the digital camera which concerns on this invention. プログラムメモリに記憶されているデータを示す概念図である。It is a conceptual diagram which shows the data memorize | stored in the program memory. 音声シャッターモードでの動作を示すフローチャートである。It is a flowchart which shows the operation | movement in an audio | voice shutter mode. 記録されている予約語の文法を示す図である。It is a figure which shows the grammar of the reserved word currently recorded. 音声認識中の累積尤度計算方法を示す図である。It is a figure which shows the cumulative likelihood calculation method during speech recognition. 音声認識中における認識状況情報の変化例を示した図である。It is the figure which showed the example of a change of the recognition status information in voice recognition. コマンド登録モードでの制御部の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the control part in command registration mode.

Explanation of symbols

１キー入力部
２合焦部
３露光部
７画像表示部
８マイク
９音声入力部
１０Ａ／Ｄ変換部
１１ワークメモリ
１２音声認識部
１３プログラムメモリ
１４制御部
１０１予約語
１０２対応情報
Ｔ登録テーブル
DESCRIPTION OF SYMBOLS 1 Key input part 2 Focusing part 3 Exposure part 7 Image display part 8 Microphone 9 Voice input part 10 A / D conversion part 11 Work memory 12 Voice recognition part 13 Program memory 14 Control part 101 Reserved word 102 Corresponding information T Registration table

Claims

A camera apparatus having an automatic focus adjustment function for performing a series of photographing operations consisting of a plurality of steps using a voice input of a predetermined command as a trigger,
Speech recognition means for recognizing input speech;
Command word storage means for storing the predetermined command word;
Correspondence information storage means for storing correspondence information indicating a correspondence relationship between a plurality of utterance stages set in the instruction word stored in the instruction word storage means and the operation of each stage in the series of photographing operations;
Each time the speech recognition stage recognized by the speech recognition means reaches each utterance stage of the command word, the series of series corresponding to the utterance stage indicated by the stage information stored in the correspondence information storage means Control means for sequentially starting the operation of each stage in the shooting operation,
The multi-stage operation in the series of photographing operations includes a focusing operation and an exposure operation by an automatic focus adjustment function, and the voice recognition unit includes an acoustic model including a noise component generated during the automatic focus adjustment operation in the photographing operation; An acoustic model that does not include, and in response to the recognition of the instruction word instructing the automatic focus adjustment in the series of shooting operations, the acoustic model that includes the noise component is changed from the acoustic model that does not include the automatic focus adjustment. A camera device characterized by recognizing a voice in a utterance stage next to a utterance stage of the command word instructing .

The camera apparatus according to claim 1, wherein the operation at any stage in the series of photographing operations includes a plurality of operations performed in succession.

A plurality of command words are stored in the command word storage means, and each utterance stage set in each of the plurality of command words stored in the command word storage means in the correspondence information storage means, The camera apparatus according to claim 1, wherein a plurality of correspondence information indicating a correspondence relationship with each step of the series of photographing operations is stored.

Registration means for storing a word composed of speech recognized by the voice recognition means in the command word storage means as a new command word;
The registration unit generates new correspondence information indicating a correspondence relationship between a plurality of utterance stages in the new command word stored in the command word storage unit and each stage operation in the series of photographing operations, and the correspondence The camera device according to claim 1, further comprising a generation unit that stores the information in the information storage unit.

An imaging method in a camera device having an automatic focus adjustment function for performing a series of imaging operations consisting of a plurality of steps using a voice input of a predetermined command as a trigger,
The step of recognizing the input voice sequentially,
Each time the recognized speech recognition stage reaches each utterance stage of the plurality of utterance stages set in the command word, the operation of each stage in the series of photographing operations associated with each utterance stage is started in order. Including the steps of:
The operations in a plurality of stages in the series of photographing operations include a focusing operation and an exposure operation by an automatic focus adjustment function, and the step of sequentially recognizing voice recognizes a command word instructing automatic focus adjustment in the series of photographing operations. In response to the above, the acoustic model including the noise component is changed from the acoustic model not including the noise component, and the speech of the utterance stage next to the utterance stage of the command word instructing the automatic focus adjustment is recognized. Shooting method.

A computer having a camera device having an automatic focus adjustment function for performing a series of photographing operations consisting of a plurality of steps using a voice input of a predetermined command as a trigger
A process for causing the voice recognition means to sequentially recognize the input voice;
Each stage in the series of photographing operations associated with each utterance stage each time the speech recognition stage recognized by the voice recognition means reaches each utterance stage of a plurality of utterance stages set in the command word Process to start each part of the device in order,
The multi-stage operation in the series of shooting operations includes a focusing operation and an exposure operation by an automatic focus adjustment function, and the process of sequentially recognizing the voice to the voice recognition means instructed the automatic focus adjustment in the series of shooting operations. Recognizing the speech in the utterance stage next to the utterance stage of the instruction word instructed to perform automatic focus adjustment by changing from the acoustic model not including the noise component to the acoustic model including the noise component in response to the recognition of the instruction word A program characterized by