JP2015055653A

JP2015055653A - Speech recognition device and method and electronic apparatus

Info

Publication number: JP2015055653A
Application number: JP2013187147A
Authority: JP
Inventors: 文仁倍賞; Fumihito Baisho
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2013-09-10
Filing date: 2013-09-10
Publication date: 2015-03-23

Abstract

PROBLEM TO BE SOLVED: To prevent the reduction in recognition rate of speech recognition processing regardless of difference of contents of choices or conditions such as a voice quality.SOLUTION: A speech recognition device includes: an acoustic model storage unit in which a plurality of sets of acoustic models obtained by collecting distributed states of frequency components of a plurality of phonemes used in a prescribed language, under a plurality of different conditions are stored; a choice list storage unit in which a choice list including choice data and acoustic model specification information is stored; an acoustic model selection unit which selects a set of acoustic models specified by the acoustic model specification information: a signal processing unit which extracts frequency components of a voice signal to generate a feature pattern; and a coincidence detection unit which performs speech recognition processing by comparing a feature pattern generated from at least a part of the voice signal with acoustic models corresponding to at least a part of the choice data, out of the selected set of acoustic models and outputs the speech recognition result.

Description

本発明は、話者が発する音声を認識し、音声認識結果に対応する応答や処理を行う音声認識装置及び音声認識方法に関する。さらに、本発明は、そのような音声認識装置を搭載した電子機器等に関する。 The present invention relates to a speech recognition apparatus and a speech recognition method for recognizing speech uttered by a speaker and performing a response and processing corresponding to a speech recognition result. Furthermore, the present invention relates to an electronic device equipped with such a voice recognition device.

例えば、音声再生機能及び音声認識機能を有する音声認識装置は、ホストシステムとの間で通信を行うホストインターフェースを備えており、ホストシステムからコマンドや音声再生データを受信することによって音声を発生する。また、音声認識装置は、ホストシステムからコマンドや選択肢リスト又は選択肢リスト指定情報を受信して、選択肢リストに含まれている複数の選択肢の内から発話音声に最も近い選択肢を検出することによって音声認識を行い、音声認識結果をホストシステムに送信する。 For example, a voice recognition device having a voice playback function and a voice recognition function includes a host interface that communicates with a host system, and generates voice by receiving commands and voice playback data from the host system. The voice recognition device receives a command, a choice list or choice list designation information from the host system, and recognizes a voice by detecting a choice closest to the uttered voice from a plurality of choices included in the choice list. To transmit the voice recognition result to the host system.

また、音声認識装置は、通常、予め収録された音声に基づいてメモリー等の格納部に用意されている音響モデル（「標準パターン」又は「テンプレート」ともいう）を用いて、話者が発する音声を解析して得られる特徴パターンと音響モデルとのパターンマッチングを行うことにより、音声認識処理を行う。 In addition, the speech recognition apparatus usually uses a sound model (also referred to as “standard pattern” or “template”) prepared in a storage unit such as a memory based on prerecorded speech, and the speech uttered by the speaker. Speech recognition processing is performed by performing pattern matching between a feature pattern obtained by analyzing the above and an acoustic model.

ただし、話者によっては、予め用意されている音響モデルを用いると、音声認識処理における認識率が良くない場合がある。そのような場合には、話者の音声に適応するように音響モデルをトレーニングするスピーカー・アダプテーション機能により、認識率を向上させたり誤認識を改善したりすることができる。 However, depending on the speaker, when an acoustic model prepared in advance is used, the recognition rate in the speech recognition process may not be good. In such a case, the recognition rate can be improved or misrecognition can be improved by the speaker adaptation function that trains the acoustic model to adapt to the voice of the speaker.

しかしながら、一般的には、選択肢リストに含まれている複数の選択肢の内容に関係なく、不特定話者用の同一の音響モデルを用いて音声認識処理が行われるので、選択肢の内容又は話者の声質等の条件によっては認識率が低下する場合がある。そこで、複数の音響モデルを用いて音声認識処理を行うことが考えらえる。 However, in general, since the speech recognition process is performed using the same acoustic model for unspecified speakers regardless of the contents of a plurality of options included in the option list, the contents of options or speakers The recognition rate may decrease depending on the voice quality and other conditions. Therefore, it is conceivable to perform speech recognition processing using a plurality of acoustic models.

関連する従来技術として、特許文献１には、認識精度を向上させると共に正しい認識結果を得るまでの操作を簡略化することを目的とする音声認識システムが開示されている。この音声認識システムは、話者が発声した音声を保存する音声保存手段と、第１の認識辞書を用いて、音声保存手段に保存された音声に対して音声認識処理を行う第１の音声認識手段と、第１の認識辞書と異なる第２の認識辞書を用いて、音声保存手段に保存された音声に対して音声認識処理を行う第２の音声認識手段と、第１及び第２の音声認識手段の認識結果に基づいて、音声保存手段に保存された音声に対応する認識候補を決定する認識結果決定手段とを備えている。 As a related art, Patent Document 1 discloses a speech recognition system that aims to improve recognition accuracy and simplify an operation until obtaining a correct recognition result. This speech recognition system uses a first speech recognition unit that performs speech recognition processing on speech stored in the speech storage unit using a speech recognition unit that stores speech uttered by a speaker and a first recognition dictionary. And a second voice recognition means for performing voice recognition processing on the voice stored in the voice storage means using a second recognition dictionary different from the first recognition dictionary, and the first and second voices A recognition result determining unit that determines a recognition candidate corresponding to the voice stored in the voice storage unit based on the recognition result of the recognition unit.

特開２０１２−１６８３４９号公報（段落０００５−０００６、００１１、図１）JP 2012-168349 A (paragraphs 0005-0006, 0011, FIG. 1)

しかしながら、このように、第１及び第２の認識辞書と第１及び第２の音声認識手段とを用いて同じ音声に対して音声認識処理を行う音声認識システムにおいては、第１の認識辞書を用いて得られた認識結果と第２の認識辞書を用いて得られた認識結果とが互いに異なる場合に、正しい認識結果を判断することが難しいという問題がある。 However, in the speech recognition system that performs speech recognition processing on the same speech using the first and second recognition dictionaries and the first and second speech recognition means, the first recognition dictionary is When the recognition result obtained by using and the recognition result obtained by using the second recognition dictionary are different from each other, there is a problem that it is difficult to determine a correct recognition result.

また、特許文献１には、第２の音声認識手段による音声認識処理によって複数の認識候補に絞られたときに、それらの認識候補に対応する第１の認識辞書を作成し、第１の認識辞書を用いた第１の音声認識手段の音声認識処理によってそれらの認識候補の中から入力音声に最も近いものを抽出することも開示されている。しかしながら、第１の認識辞書を用いて入力音声を正しく認識することができたとしても、第２の認識辞書を用いて入力音声を正しく認識することが難しい場合には、正しい認識結果を得られない可能性が高い。 Further, in Patent Document 1, when a plurality of recognition candidates are narrowed down by the speech recognition processing by the second speech recognition means, a first recognition dictionary corresponding to these recognition candidates is created and the first recognition is performed. It is also disclosed to extract the closest candidate to the input speech from among the recognition candidates by the speech recognition processing of the first speech recognition means using a dictionary. However, even if the input speech can be correctly recognized using the first recognition dictionary, if it is difficult to correctly recognize the input speech using the second recognition dictionary, a correct recognition result can be obtained. Most likely not.

そこで、上記の点に鑑み、本発明の目的の１つは、選択肢リストに含まれている複数の選択肢の内から発話音声に近い選択肢を検出する音声認識処理において、選択肢の内容又は話者の声質等の条件が異なっても、認識率の低下を防止することである。 Accordingly, in view of the above points, one of the objects of the present invention is to select the content of an option or the speaker's content in a speech recognition process for detecting an option close to the uttered speech from among a plurality of options included in the option list. Even if conditions such as voice quality are different, the recognition rate is prevented from decreasing.

以上の課題を解決するため、本発明の第１の観点に係る音声認識装置は、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる条件の下で収集して得られた複数組の音響モデルを格納する音響モデル格納部と、選択肢を表す選択肢データと複数組の音響モデルの内から１組の音響モデルを特定する音響モデル特定情報とを含む選択肢リストを格納する選択肢リスト格納部と、音響モデル特定情報によって特定される１組の音響モデルを選択する音響モデル選択部と、入力された音声信号の周波数成分を抽出し、音声信号の周波数成分の分布状態を表す特徴パターンを生成する信号処理部と、音声信号の少なくとも一部から生成された特徴パターンを、選択された１組の音響モデルの内で、選択肢データの少なくとも一部に対応する音響モデルと比較して音声認識処理を行い、音声認識結果を出力する一致検出部とを具備する。 In order to solve the above problems, the speech recognition apparatus according to the first aspect of the present invention is obtained by collecting frequency component distribution states of a plurality of phonemes used in a predetermined language under a plurality of different conditions. An option for storing an option list including an acoustic model storage unit for storing a plurality of sets of acoustic models, option data representing options, and acoustic model specifying information for specifying one set of acoustic models from the plurality of sets of acoustic models Features that represent a list storage unit, an acoustic model selection unit that selects a set of acoustic models specified by acoustic model specification information, and a frequency component of an input audio signal, and represents a distribution state of the frequency component of the audio signal A signal processing unit for generating a pattern, and a feature pattern generated from at least a part of the audio signal are selected from a set of selected acoustic models. Performs speech recognition processing as compared with the acoustic model corresponding to the part, and a coincidence detecting section for outputting a speech recognition result.

また、本発明の第１の観点に係る音声認識方法は、選択肢を表す選択肢データと音響モデルの組を特定する音響モデル特定情報とを含む選択肢リストを格納する選択肢リスト格納部から、選択肢データ及び音響モデル特定情報を読み出すステップ（ａ）と、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる条件の下で収集して得られた複数組の音響モデルの内から、音響モデル特定情報によって特定される１組の音響モデルを選択するステップ（ｂ）と、入力された音声信号の周波数成分を抽出し、音声信号の周波数成分の分布状態を表す特徴パターンを生成するステップ（ｃ）と、音声信号の少なくとも一部から生成された特徴パターンを、選択された１組の音響モデルの内で、ステップ（ａ）において読み出された選択肢データの少なくとも一部に対応する音響モデルと比較して音声認識処理を行い、音声認識結果を出力するステップ（ｄ）とを具備する。 Further, the speech recognition method according to the first aspect of the present invention includes: an option list storage unit that stores an option list including option data representing options and acoustic model specifying information for specifying a set of acoustic models; The step (a) of reading out the acoustic model specifying information, and from among a plurality of sets of acoustic models obtained by collecting the distribution states of frequency components of a plurality of phonemes used in a predetermined language under a plurality of different conditions, A step (b) of selecting a set of acoustic models specified by the acoustic model specifying information, a step of extracting a frequency component of the input voice signal, and generating a feature pattern representing a distribution state of the frequency component of the voice signal (C) and a feature pattern generated from at least a part of the audio signal is read in step (a) within the selected set of acoustic models. Compared with the acoustic model corresponding to at least part of the selection data issued performs speech recognition processing comprises a step (d) for outputting a speech recognition result.

本発明の第１の観点によれば、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる条件の下で収集して得られた複数組の音響モデルが用意され、複数組の音響モデルの内から、選択肢リストに含まれている音響モデル特定情報によって特定される１組の音響モデルが選択される。これにより、選択肢リストに含まれている選択肢データの内容等の条件に応じて設定された音響モデルを用いて音声認識処理が行われるので、音声認識処理における認識率を向上させることができる。 According to the first aspect of the present invention, a plurality of sets of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in a predetermined language under a plurality of different conditions are prepared. One set of acoustic models specified by the acoustic model specifying information included in the option list is selected from the set of acoustic models. Thereby, since the speech recognition process is performed using the acoustic model set according to the conditions such as the contents of the option data included in the option list, the recognition rate in the speech recognition process can be improved.

本発明の第１の観点に係る音声認識装置において、複数の音響モデルが、所定の言語において用いられる不特定の種類の用語に含まれる音素の周波数成分の分布状態を収集して得られた音響モデルの組と、所定の言語において用いられる特定の種類の用語に含まれる音素の周波数成分の分布状態を収集して得られた音響モデルの組とを含むようにしても良い。その場合には、用語の種類に応じて設定された音響モデルを用いて音声認識処理を行うことができる。 In the speech recognition apparatus according to the first aspect of the present invention, acoustics obtained by collecting a distribution state of phoneme frequency components included in unspecified types of terms used in a predetermined language by a plurality of acoustic models. A set of models and a set of acoustic models obtained by collecting distribution states of frequency components of phonemes included in specific types of terms used in a predetermined language may be included. In that case, speech recognition processing can be performed using an acoustic model set according to the type of term.

本発明の第２の観点に係る音声認識装置は、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる種類の声質について収集して得られた複数組の音響モデルを格納する音響モデル格納部と、選択肢を表す選択肢データを含む選択肢リストを格納する選択肢リスト格納部と、入力された音声信号の周波数成分を抽出し、音声信号の周波数成分の分布状態を表す特徴パターンを生成する信号処理部と、音声信号における声質の種類を判定する声質判定部と、複数組の音響モデルの内から、声質判定部によって判定された声質の種類に対応する１組の音響モデルを選択する音響モデル選択部と、音声信号の少なくとも一部から生成された特徴パターンを、選択された１組の音響モデルの内で、選択肢データの少なくとも一部に対応する音響モデルと比較して音声認識処理を行い、音声認識結果を出力する一致検出部とを具備する。 The speech recognition apparatus according to the second aspect of the present invention stores a plurality of sets of acoustic models obtained by collecting frequency component distribution states of a plurality of phonemes used in a predetermined language for a plurality of different types of voice qualities. An acoustic model storage unit, an option list storage unit that stores an option list including option data representing options, and a feature pattern that extracts a frequency component of an input audio signal and represents a distribution state of the frequency component of the audio signal A signal processing unit to be generated, a voice quality determination unit that determines the type of voice quality in the audio signal, and a set of acoustic models corresponding to the type of voice quality determined by the voice quality determination unit are selected from a plurality of sets of acoustic models An acoustic model selection unit, and a feature pattern generated from at least a part of the audio signal, at least of the selection data in the set of selected acoustic models. Performs speech recognition processing as compared with the acoustic model corresponding to the part comprises a coincidence detection unit for outputting a speech recognition result.

また、本発明の第２の観点に係る音声認識方法は、選択肢を表す選択肢データを含む選択肢リストを格納する選択肢リスト格納部から、選択肢データを読み出すステップ（ａ）と、入力された音声信号の周波数成分を抽出し、音声信号の周波数成分の分布状態を表す特徴パターンを生成するステップ（ｂ）と、音声信号における声質の種類を判定するステップ（ｃ）と、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる種類の声質について収集して得られた複数組の音響モデルの内から、ステップ（ｃ）において判定された声質の種類に対応する１組の音響モデルを選択するステップ（ｄ）と、音声信号の少なくとも一部から生成された特徴パターンを、選択された１組の音響モデルの内で、ステップ（ａ）において読み出された選択肢データの少なくとも一部に対応する音響モデルと比較して音声認識処理を行い、音声認識結果を出力するステップ（ｅ）とを具備する。 The speech recognition method according to the second aspect of the present invention includes a step (a) of reading out option data from an option list storage unit that stores an option list including option data representing options, and an input voice signal A step (b) of extracting a frequency component and generating a feature pattern representing a distribution state of the frequency component of the audio signal; a step (c) of determining a voice quality type in the audio signal; and a plurality of types used in a predetermined language A set of acoustic models corresponding to the type of voice quality determined in step (c) is obtained from a plurality of sets of acoustic models obtained by collecting the distribution of the frequency components of phonemes for a plurality of different types of voice qualities. Selecting (d) a feature pattern generated from at least a portion of the audio signal within a set of selected acoustic models, step (a) Performs speech recognition processing as compared with the acoustic model corresponding to at least a portion of the selection data read Oite comprises a step (e) for outputting a speech recognition result.

本発明の第２の観点によれば、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる種類の声質について収集して得られた複数組の音響モデルが用意され、複数組の音響モデルの内から、入力された音声信号に基づいて判定された声質の種類に対応する１組の音響モデルが選択される。これにより、話者の声質の種類に応じて設定された音響モデルを用いて音声認識処理が行われるので、音声認識処理における認識率を向上させることができる。 According to the second aspect of the present invention, a plurality of sets of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in a predetermined language for a plurality of different types of voice qualities are prepared. A set of acoustic models corresponding to the type of voice quality determined based on the input audio signal is selected from the set of acoustic models. Thereby, since the speech recognition process is performed using the acoustic model set according to the type of voice quality of the speaker, the recognition rate in the speech recognition process can be improved.

本発明の第２の観点に係る音声認識装置において、選択肢リスト格納部が、特定の声質の種類について音声認識処理を許可又は禁止する制御情報をさらに含む選択肢リストを格納し、一致検出部が、声質判定部によって判定された声質の種類と制御情報とに基づいて、選択肢リストについて音声認識処理を開始するか否かを判定しても良い。その場合には、音声認識による電子機器の操作等におけるセキュリティレベルを制御することができる。例えば、子供がエアコンの温度設定を行うことを禁止することにより、エアコンの危険な操作を防止することが可能である。 In the speech recognition apparatus according to the second aspect of the present invention, the option list storage unit stores an option list further including control information for permitting or prohibiting speech recognition processing for a specific voice quality type, and the match detection unit is Based on the type of voice quality determined by the voice quality determination unit and the control information, it may be determined whether or not to start the speech recognition process for the option list. In that case, it is possible to control the security level in operation of the electronic device by voice recognition. For example, by prohibiting children from setting the temperature of the air conditioner, it is possible to prevent dangerous operation of the air conditioner.

本発明の第３の観点に係る音声認識装置は、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる条件の下で収集して得られた複数組の音響モデルを格納する音響モデル格納部と、選択肢を表す選択肢データを含む選択肢リストを格納する選択肢リスト格納部と、入力された音声信号の周波数成分を抽出し、音声信号の周波数成分の分布状態を表す特徴パターンを生成する信号処理部と、音声信号の少なくとも一部から生成された特徴パターンを、複数組の音響モデルの内で、選択肢データの少なくとも一部に対応する音響モデルと比較して音声認識処理を行い、複数組の音響モデルに対応する複数の音声認識結果及び複数の認識確度を出力する一致検出部と、複数の認識確度の内で最も高い認識確度が得られた音声認識結果を、最終的な音声認識結果として出力する認識確度判定部とを具備する。 The speech recognition apparatus according to the third aspect of the present invention stores a plurality of sets of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in a predetermined language under a plurality of different conditions. An acoustic model storage unit, an option list storage unit that stores an option list including option data representing options, and a feature pattern that extracts a frequency component of an input audio signal and represents a distribution state of the frequency component of the audio signal Performs speech recognition processing by comparing the feature pattern generated from the signal processing unit to be generated and at least a part of the speech signal with an acoustic model corresponding to at least a part of the choice data among a plurality of sets of acoustic models. The coincidence detection unit that outputs a plurality of speech recognition results and a plurality of recognition accuracy corresponding to a plurality of sets of acoustic models, and the highest recognition accuracy among the plurality of recognition accuracy were obtained. The voice recognition result comprises a recognition accuracy determining unit which outputs as a final speech recognition result.

また、本発明の第３の観点に係る音声認識方法は、選択肢を表す選択肢データを含む選択肢リストを格納する選択肢リスト格納部から、選択肢データを読み出すステップ（ａ）と、入力された音声信号の周波数成分を抽出し、音声信号の周波数成分の分布状態を表す特徴パターンを生成するステップ（ｂ）と、音声信号の少なくとも一部から生成された特徴パターンを、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる条件の下で収集して得られた複数組の音響モデルの内で、ステップ（ａ）において読み出された選択肢データの少なくとも一部に対応する音響モデルと比較して音声認識処理を行い、複数組の音響モデルに対応する複数の音声認識結果及び複数の認識確度を出力するステップ（ｃ）と、複数の認識確度の内で最も高い認識確度が得られた音声認識結果を、最終的な音声認識結果として出力するステップ（ｄ）とを具備する。 The speech recognition method according to the third aspect of the present invention includes a step (a) of reading out option data from an option list storage unit that stores an option list including option data representing options, and an input voice signal A step (b) of extracting a frequency component and generating a feature pattern representing a distribution state of the frequency component of the audio signal; and a feature pattern generated from at least a part of the audio signal, a plurality of phonemes used in a predetermined language An acoustic model corresponding to at least a part of the option data read out in step (a) among a plurality of sets of acoustic models obtained by collecting the distribution state of the frequency components under a plurality of different conditions; A step (c) of performing a speech recognition process in comparison and outputting a plurality of speech recognition results and a plurality of recognition accuracy corresponding to a plurality of sets of acoustic models; The speech recognition result with the highest recognition certainty is obtained within the recognition accuracy, it comprises a step (d) to output as a final speech recognition result.

本発明の第３の観点によれば、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる条件の下で収集して得られた複数組の音響モデルが用意され、複数組の音響モデルを用いて同時に音声認識処理を行うことにより、複数組の音響モデルに対応する複数の音声認識結果及び複数の認識確度が得られる。さらに、複数の認識確度の内で最も高い認識確度が得られた音声認識結果が、最終的な音声認識結果として出力される。これにより、複数の異なる条件に応じて設定された複数組の音響モデルを用いて同時に行われた音声認識処理において最も高い認識確度が得られた音声認識結果が得られるので、音声認識処理における認識率を向上させることができる。 According to the third aspect of the present invention, a plurality of sets of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in a predetermined language under a plurality of different conditions are prepared. By performing speech recognition processing simultaneously using a set of acoustic models, a plurality of speech recognition results and a plurality of recognition accuracy corresponding to a plurality of sets of acoustic models can be obtained. Furthermore, the speech recognition result that provides the highest recognition accuracy among the plurality of recognition accuracy is output as the final speech recognition result. As a result, a speech recognition result with the highest recognition accuracy can be obtained in speech recognition processing performed simultaneously using a plurality of sets of acoustic models set according to a plurality of different conditions. The rate can be improved.

本発明の第３の観点に係る音声認識装置において、複数組の音響モデルが、所定の言語において用いられる不特定の種類の用語に含まれる音素の周波数成分の分布状態を収集して得られた音響モデルの組と、所定の言語において用いられる特定の種類の用語に含まれる音素の周波数成分の分布状態を収集して得られた音響モデルの組とを含むようにしても良い。その場合には、用語の種類に応じて設定された音響モデルを用いて音声認識処理を行うことができる。 In the speech recognition apparatus according to the third aspect of the present invention, a plurality of sets of acoustic models are obtained by collecting distribution states of frequency components of phonemes included in unspecified types of terms used in a predetermined language. A set of acoustic models and a set of acoustic models obtained by collecting the distribution states of the frequency components of phonemes included in specific types of terms used in a predetermined language may be included. In that case, speech recognition processing can be performed using an acoustic model set according to the type of term.

あるいは、複数組の音響モデルが、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる種類の声質について収集して得られた複数の音響モデルから構成されるようにしても良い。その場合には、声質の種類に応じて設定された音響モデルを用いて音声認識処理を行うことができる。 Alternatively, the plurality of sets of acoustic models may be composed of a plurality of acoustic models obtained by collecting frequency component distribution states of a plurality of phonemes used in a predetermined language for a plurality of different types of voice qualities. good. In that case, the speech recognition process can be performed using an acoustic model set according to the type of voice quality.

本発明の１つの観点に係る電子機器は、上記いずれかの音声認識装置を具備する。これにより、家電製品、住宅設備、車載装置（ナビゲーション装置等）、自動販売機、又は、携帯端末等の電子機器において、音声認識による操作を実現することができる。 An electronic apparatus according to one aspect of the present invention includes any one of the above speech recognition apparatuses. Thereby, operation by voice recognition can be realized in electronic devices such as home appliances, housing equipment, in-vehicle devices (navigation devices, etc.), vending machines, or mobile terminals.

本発明の第１の実施形態に係る音声認識装置を搭載した電子機器の図。The figure of the electronic device carrying the voice recognition device concerning a 1st embodiment of the present invention. 第１の実施形態の変形例に係る音声認識装置を搭載した電子機器の図。The figure of the electronic equipment carrying the voice recognition device concerning the modification of a 1st embodiment. 図１又は図２に示す音声認識装置によって実施される音声認識方法を示す図。The figure which shows the speech recognition method implemented by the speech recognition apparatus shown in FIG. 図１に示す選択肢リスト格納部に格納されている選択肢リストの例を示す図。The figure which shows the example of the choice list | wrist stored in the choice list storage part shown in FIG. 本発明の第２の実施形態に係る音声認識装置を搭載した電子機器の図。The figure of the electronic device carrying the speech recognition apparatus which concerns on the 2nd Embodiment of this invention. 第２の実施形態の変形例に係る音声認識装置の構成の一部を示す図。The figure which shows a part of structure of the speech recognition apparatus which concerns on the modification of 2nd Embodiment. 図５又は図６に示す音声認識装置によって実施される音声認識方法を示す図。The figure which shows the speech recognition method implemented by the speech recognition apparatus shown in FIG. 本発明の第３の実施形態に係る音声認識装置を搭載した電子機器の図。The figure of the electronic device carrying the speech recognition apparatus which concerns on the 3rd Embodiment of this invention. 図８に示す音声認識装置によって実施される音声認識方法を示す図。The figure which shows the speech recognition method implemented by the speech recognition apparatus shown in FIG.

以下、本発明の実施形態について、図面を参照しながら詳しく説明する。なお、同一の構成要素には同一の参照符号を付して、重複する説明を省略する。
＜第１の実施形態に係る音声認識装置＞
図１は、本発明の第１の実施形態に係る音声認識装置を搭載した電子機器の構成の一部を示すブロック図である。この電子機器は、例えば、家電製品、住宅設備、車載装置（ナビゲーション装置等）、自動販売機、又は、携帯端末等であり、図１においては、音声認識機能及び音声再生機能に関する部分のみが示されている。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. In addition, the same referential mark is attached | subjected to the same component and the overlapping description is abbreviate | omitted.
<Voice Recognition Device According to First Embodiment>
FIG. 1 is a block diagram showing a part of the configuration of an electronic apparatus equipped with the speech recognition apparatus according to the first embodiment of the present invention. This electronic device is, for example, a home appliance, a housing facility, a vehicle-mounted device (navigation device, etc.), a vending machine, a portable terminal, or the like. FIG. Has been.

図１に示すように、この電子機器は、音声認識装置１００と、ホストシステムの制御部２００とを含んでいる。音声認識装置１００は、制御部２００から受信したコマンド等に従って、ユーザーに質問又はメッセージを発すると共に、ユーザーの音声を認識して音声認識結果を制御部２００に送信する。 As shown in FIG. 1, the electronic device includes a voice recognition device 100 and a control unit 200 of the host system. The voice recognition device 100 issues a question or a message to the user according to the command received from the control unit 200, recognizes the user's voice, and transmits the voice recognition result to the control unit 200.

音声認識装置１００は、コマンド解析部１０と、音声入力部２０と、Ａ／Ｄ変換器３０と、音声認識処理部４０と、メモリー５０と、音声再生部６０と、Ｄ／Ａ変換器７０と、音声出力部８０とを含んでいる。なお、コマンド解析部１０〜音声出力部８０の一部を、半導体集積回路装置に内蔵しても良い。 The voice recognition apparatus 100 includes a command analysis unit 10, a voice input unit 20, an A / D converter 30, a voice recognition processing unit 40, a memory 50, a voice reproduction unit 60, and a D / A converter 70. And an audio output unit 80. A part of the command analysis unit 10 to the voice output unit 80 may be built in the semiconductor integrated circuit device.

制御部２００は、ホストＣＰＵ（中央演算装置）９１と、格納部９２とを含んでいる。ホストＣＰＵ９１は、格納部９２の記録媒体に記録されているソフトウェア（音声認識制御プログラム）に基づいて動作する。記録媒体としては、ハードディスク、フレキシブルディスク、ＭＯ、ＭＴ、ＣＤ−ＲＯＭ、又は、ＤＶＤ−ＲＯＭ等を用いることができる。 The control unit 200 includes a host CPU (central processing unit) 91 and a storage unit 92. The host CPU 91 operates based on software (voice recognition control program) recorded on the recording medium of the storage unit 92. As the recording medium, a hard disk, flexible disk, MO, MT, CD-ROM, DVD-ROM, or the like can be used.

ホストＣＰＵ９１は、音声認識制御プログラムにおいて予め設定されたシナリオに沿って、音声再生開始コマンド及び音声再生データを音声認識装置１００に送信することにより、音声認識装置１００に音声再生動作を行わせる。また、ホストＣＰＵ９１は、音声認識開始コマンド及び選択肢リスト指定情報を音声認識装置１００に送信することにより、音声認識装置１００に音声認識動作を行わせる。 The host CPU 91 causes the voice recognition apparatus 100 to perform a voice reproduction operation by transmitting a voice reproduction start command and voice reproduction data to the voice recognition apparatus 100 in accordance with a scenario set in advance in the voice recognition control program. Further, the host CPU 91 causes the voice recognition apparatus 100 to perform a voice recognition operation by transmitting a voice recognition start command and option list designation information to the voice recognition apparatus 100.

音声認識装置１００において、コマンド解析部１０は、ホストＣＰＵ９１から送信されるコマンドを解析し、コマンドに従って、音声再生動作及び音声認識動作を独立に又は一体として制御することができる。例えば、音声認識制御プログラムにおいて予め設定されたシナリオに沿って音声再生動作及び音声認識動作を行うことにより、音声認識の候補となる選択肢（単語又は文章）の数を制限して、音声認識処理における認識率の向上を図ることが可能である。 In the speech recognition apparatus 100, the command analysis unit 10 can analyze a command transmitted from the host CPU 91 and control the speech reproduction operation and the speech recognition operation independently or integrally according to the command. For example, in the voice recognition process, the number of options (words or sentences) that are candidates for voice recognition is limited by performing the voice reproduction operation and the voice recognition operation in accordance with a preset scenario in the voice recognition control program. It is possible to improve the recognition rate.

音声入力部２０は、音声を電気信号（音声信号）に変換するマイクロフォンと、マイクロフォンから出力される音声信号を増幅する増幅器と、増幅された音声信号の帯域を制限するローパスフィルタとを含んでいる。Ａ／Ｄ変換器３０は、音声入力部２０から出力されるアナログの音声信号をサンプリングすることにより、ディジタルの音声信号（音声データ）に変換する。例えば、音声データにおける音声周波数帯域は１２ｋＨｚであり、ビット数は１６ビットである。 The audio input unit 20 includes a microphone that converts audio into an electrical signal (audio signal), an amplifier that amplifies the audio signal output from the microphone, and a low-pass filter that limits the band of the amplified audio signal. . The A / D converter 30 samples the analog audio signal output from the audio input unit 20 and converts it into a digital audio signal (audio data). For example, the voice frequency band in the voice data is 12 kHz, and the number of bits is 16 bits.

音声認識処理部４０は、ＣＰＵとソフトウェア、ディジタル回路、又は、アナログ回路によって構成され、信号処理部４１と、音響モデル選択部４２と、一致検出部４３とを含んでいる。また、メモリー５０は、例えば、ＲＯＭ（リードオンリーメモリー）又はフラッシュメモリー等によって構成され、音響モデル格納部５１と、選択肢リスト格納部５２とを含んでいる。 The speech recognition processing unit 40 includes a CPU and software, a digital circuit, or an analog circuit, and includes a signal processing unit 41, an acoustic model selection unit 42, and a coincidence detection unit 43. The memory 50 is constituted by, for example, a ROM (read only memory) or a flash memory, and includes an acoustic model storage unit 51 and an option list storage unit 52.

信号処理部４１は、入力された音声信号にフーリエ変換を施すことにより音声信号の周波数成分を抽出し、周波数成分の分布状態を表す特徴パターンを生成する。生成された特徴パターンは、一致検出部４３に出力される。また、信号処理部４１は、入力された音声信号のレベルが所定の値を超えたときに、音声検出信号を活性化して一致検出部４３及びホストＣＰＵ９１に出力する。これにより、ユーザーからの要求又は回答の有無を判定することができる。 The signal processing unit 41 extracts a frequency component of the audio signal by performing Fourier transform on the input audio signal, and generates a feature pattern representing the distribution state of the frequency component. The generated feature pattern is output to the coincidence detection unit 43. Further, the signal processing unit 41 activates the voice detection signal and outputs it to the coincidence detection unit 43 and the host CPU 91 when the level of the input voice signal exceeds a predetermined value. Thereby, the presence or absence of a request or answer from the user can be determined.

ここで、音声信号から特徴パターンを求める手法の一例について説明する。信号処理部４１は、入力された音声信号にフィルタ処理を施して高域成分を強調する。次に、信号処理部４１は、音声信号によって表される音声波形にハミング窓をかけることにより、時系列の音声信号を所定の時間毎に区切って複数のフレームを作成する。さらに、信号処理部４１は、フレーム毎に音声信号をフーリエ変換することにより、複数の周波数成分を抽出する。各々の周波数成分は複素数であるので、信号処理部４１は、各々の周波数成分の絶対値を求める。 Here, an example of a method for obtaining a feature pattern from an audio signal will be described. The signal processing unit 41 performs a filtering process on the input audio signal to emphasize high frequency components. Next, the signal processing unit 41 creates a plurality of frames by dividing the time-series audio signal at predetermined time intervals by applying a Hamming window to the audio waveform represented by the audio signal. Furthermore, the signal processing unit 41 extracts a plurality of frequency components by performing Fourier transform on the audio signal for each frame. Since each frequency component is a complex number, the signal processing unit 41 obtains an absolute value of each frequency component.

信号処理部４１は、それらの周波数成分の絶対値に、メル尺度（音高の知覚的尺度）に基づいて定められた周波数領域の窓をかけて積分することにより、窓の数に対応する数の数値を求める。さらに、信号処理部４１は、それらの数値の対数をとって、対数値を離散コサイン変換する。これにより、周波数領域の窓が２０個であれば、２０個の数値が得られる。 The signal processing unit 41 integrates the absolute values of these frequency components over a frequency domain window determined based on the Mel scale (perceptual scale of pitch), thereby obtaining a number corresponding to the number of windows. Find the numerical value of. Further, the signal processing unit 41 takes the logarithm of those numerical values and performs a discrete cosine transform on the logarithmic values. Thereby, if there are 20 windows in the frequency domain, 20 numerical values are obtained.

このようにして得られた数値の内で低次のもの（例えば、１２個）が、ＭＦＣＣ（メル周波数ケプストラム係数）と呼ばれる。信号処理部４１は、フレーム毎にＭＦＣＣを算出し、ＨＭＭ（隠れマルコフモデル）に従って複数のＭＦＣＣを連結することにより、連結されたＭＦＣＣを求める。この連結されたＭＦＣＣが、特徴パターンに相当し、多次元空間（例えば、１２次元空間）において点として表される。 Of the numerical values obtained in this way, the lower ones (for example, 12) are called MFCC (Mel Frequency Cepstrum Coefficient). The signal processing unit 41 calculates a MFCC for each frame, and obtains a connected MFCC by connecting a plurality of MFCCs according to an HMM (Hidden Markov Model). This connected MFCC corresponds to a feature pattern and is represented as a point in a multidimensional space (for example, a 12-dimensional space).

ここで、「音素」とは、ある言語において同じとみなされる音の要素を意味する。以下においては、言語として日本語が用いられる場合について説明する。日本語の音素としては、「ａ」、「ｉ」、「ｕ」、「ｅ」、「ｏ」の母音と、「ｋ」、「ｓ」、「ｔ」、「ｎ」等の子音と、「ｊ」、「ｗ」の半母音と、「Ｎ」、「Ｑ」、「Ｈ」の特殊モーラとが該当する。 Here, “phoneme” means an element of a sound that is regarded as the same in a certain language. Below, the case where Japanese is used as a language is demonstrated. Japanese phonemes include “a”, “i”, “u”, “e”, “o” vowels, “k”, “s”, “t”, “n” and other consonants, The semi-vowels of “j” and “w” and the special mora of “N”, “Q”, and “H” are applicable.

音響モデル格納部５１は、複数組の音響モデルを格納する部分である。ここでいう音響モデルとは、所定の言語において用いられる音素の周波数成分の分布状態を表したもの、即ち、音素の特徴パターンに相当する。異なる音素であれば周波数成分も異なり、異なる音響モデルとなる。ある条件の下で複数の音素について音響モデルを収集することにより、１組の音響モデルが得られる。従って、複数の異なる条件の下で音響モデルを収集することにより、複数組の音響モデルを構成することができる。例えば、用語の種類を条件として複数組の音響モデルが構成された場合に、音響モデル格納部５１は、複数の異なる種類の用語に応じて、汎用の１組の音響モデル０と、特定用語用の少なくとも１組の音響モデル（図１においては、複数組の音響モデル１、２、・・・を示す）とを格納している。 The acoustic model storage unit 51 is a part that stores a plurality of sets of acoustic models. The acoustic model here represents a distribution state of frequency components of phonemes used in a predetermined language, that is, a phoneme feature pattern. Different phonemes have different frequency components, resulting in different acoustic models. By collecting acoustic models for a plurality of phonemes under certain conditions, a set of acoustic models is obtained. Therefore, a plurality of sets of acoustic models can be configured by collecting acoustic models under a plurality of different conditions. For example, when a plurality of sets of acoustic models are configured on the condition of the types of terms, the acoustic model storage unit 51 uses a general-purpose set of acoustic models 0 and a specific term for a plurality of different types of terms. At least one set of acoustic models (in FIG. 1, a plurality of sets of acoustic models 1, 2,...) Are stored.

汎用の１組の音響モデル０は、所定の言語において用いられる不特定の種類の用語に含まれている複数の音素の周波数成分の分布状態を収集して得られた複数の音響モデルを含んでいる。また、特定用語用の各組の音響モデル１、２、・・・は、所定の言語において用いられる特定の種類の用語に含まれている複数の音素の周波数成分の分布状態を収集して得られた複数の音響モデルを含んでいる。これにより、用語の種類に応じて設定された音響モデルを用いて音声認識処理を行うことができる。 The general-purpose set of acoustic models 0 includes a plurality of acoustic models obtained by collecting frequency component distribution states of a plurality of phonemes included in unspecified types of terms used in a predetermined language. Yes. Further, each set of acoustic models 1, 2,... For specific terms is obtained by collecting the distribution state of frequency components of a plurality of phonemes included in a specific type of term used in a predetermined language. A plurality of acoustic models. Thereby, speech recognition processing can be performed using the acoustic model set according to the type of term.

例えば、数字認識用の１組の音響モデル１は、温度設定や時刻設定等において用いられる数字を含む用語を認識するために特化してトレーニングされた複数の音響モデルを含んでいる。また、外来語用の１組の音響モデル２は、日本語におけるカタカナの用語や和製英語等の外来語を含む用語を認識するために特化してトレーニングされた複数の音響モデルを含んでいる。 For example, a set of acoustic models 1 for number recognition includes a plurality of acoustic models specially trained to recognize terms including numbers used in temperature setting, time setting, and the like. The set of acoustic models 2 for foreign words includes a plurality of acoustic models trained specifically for recognizing katakana terms in Japanese and terms including foreign words such as Japanese English.

ここで、各組の音響モデルは、所定の言語において用いられる複数の音素について、多数（例えば、２００人程度）の話者が発した音声を用いて予め作成される。音響モデルの作成においては、各々の音素を表す音声信号からＭＦＣＣが求められる。ただし、多数の話者が発した音声を用いて作成されたＭＦＣＣにおいては、それぞれの数値がばらつきを有している。 Here, each set of acoustic models is created in advance using a plurality of (for example, about 200) speakers uttered for a plurality of phonemes used in a predetermined language. In creating an acoustic model, an MFCC is obtained from a speech signal representing each phoneme. However, in the MFCC created using voices uttered by a large number of speakers, each numerical value varies.

従って、各々の音素についての音響モデルは、ＭＦＣＣを表す多次元空間において、ばらつきを含む広がりを有している。信号処理部４１に入力された音声信号から生成された特徴パターンが音響モデルの広がりの範囲内に入っていれば、特徴パターンが音響モデルに一致していると判定される。 Therefore, the acoustic model for each phoneme has a spread including variation in the multidimensional space representing the MFCC. If the feature pattern generated from the audio signal input to the signal processing unit 41 is within the range of the acoustic model, it is determined that the feature pattern matches the acoustic model.

選択肢リスト格納部５２は、複数の選択肢をそれぞれ表す複数の選択肢データと、音響モデルの組を特定する音響モデル特定情報とを含む選択肢リストを格納している。図１においては、選択肢リスト格納部５２が、複数の選択肢リストＡ、Ｂ、Ｃ、・・・を格納している。 The option list storage unit 52 stores an option list including a plurality of option data each representing a plurality of options and acoustic model specifying information for specifying a set of acoustic models. In FIG. 1, the option list storage unit 52 stores a plurality of option lists A, B, C,.

例えば、不特定の種類の用語を選択肢として含む選択肢リストには、汎用の１組の音響モデル０を特定する音響モデル特定情報が付加されている。また、特定の種類の用語を選択肢として含む選択肢リストには、特定用語用の複数組の音響モデル１、２、・・・の内のいずれか１組を特定する音響モデル特定情報が付加されている。 For example, acoustic model specifying information for specifying a general-purpose set of acoustic models 0 is added to an option list including unspecified types of terms as options. In addition, in the option list including specific types of terms as options, acoustic model specifying information for specifying any one of a plurality of sets of acoustic models 1, 2,. Yes.

コマンド解析部１０は、音声認識開始コマンド及び選択肢リスト指定情報をホストＣＰＵ９１から受信すると、選択肢リスト指定情報に従って、選択肢リスト格納部５２に格納されている複数の選択肢リストＡ、Ｂ、Ｃ、・・・の内から１つの選択肢リストを指定する。 When the command analysis unit 10 receives the voice recognition start command and the option list designation information from the host CPU 91, a plurality of option lists A, B, C,... Stored in the option list storage unit 52 according to the option list designation information are received. Specify one option list from

音響モデル選択部４２は、コマンド解析部１０によって指定された選択肢リストに含まれている音響モデル特定情報を選択肢リスト格納部５２から読み出し、音響モデル格納部５１に格納されている複数組の音響モデルの内から、音響モデル特定情報によって特定される１組の音響モデルを選択する。 The acoustic model selection unit 42 reads acoustic model specifying information included in the option list specified by the command analysis unit 10 from the option list storage unit 52, and sets a plurality of acoustic models stored in the acoustic model storage unit 51. A set of acoustic models specified by the acoustic model specifying information is selected.

一致検出部４３は、コマンド解析部１０によって指定された選択肢リストに含まれている複数の選択肢データを選択肢リスト格納部５２から順次読み出すと共に、音響モデル選択部４２によって選択された１組の音響モデルの内で、選択肢データの少なくとも一部に対応する音響モデル（例えば、１音節分の音響モデル）を音響モデル格納部５１から読み出す。 The coincidence detection unit 43 sequentially reads out a plurality of option data included in the option list specified by the command analysis unit 10 from the option list storage unit 52 and sets a set of acoustic models selected by the acoustic model selection unit 42. , An acoustic model (for example, an acoustic model for one syllable) corresponding to at least a part of the option data is read from the acoustic model storage unit 51.

これにより、一致検出部４３は、音声検出信号が活性化されているときに、信号処理部４１に入力された音声信号の少なくとも一部から生成された特徴パターンを、音響モデル選択部４２によって選択された１組の音響モデルの内で、コマンド解析部１０によって指定された選択肢リストに含まれている各々の選択肢データの少なくとも一部に対応する音響モデルと比較して音声認識処理（パターンマッチング）を行い、音声認識結果を出力する。 As a result, the coincidence detection unit 43 selects, by the acoustic model selection unit 42, a feature pattern generated from at least a part of the audio signal input to the signal processing unit 41 when the audio detection signal is activated. Speech recognition processing (pattern matching) in comparison with an acoustic model corresponding to at least a part of each option data included in the option list designated by the command analysis unit 10 in the set of acoustic models And output the speech recognition result.

ここで、一致検出部４３は、音響モデル選択部４２によって選択された１組の音響モデルの内で、選択肢リストに含まれている複数の選択肢データに対応する複数の音響モデルを読み出してから、特徴パターンとそれらの音響モデルとの比較を行っても良い。あるいは、一致検出部４３は、音響モデル選択部４２によって選択された１組の音響モデルの内で、選択肢リストに含まれている１つの選択肢データに対応する１つの音響モデルを読み出して特徴パターンとその音響モデルとの比較を行い、複数の選択肢データについてその動作を繰り返しても良い。 Here, the coincidence detection unit 43 reads out a plurality of acoustic models corresponding to a plurality of option data included in the option list from the set of acoustic models selected by the acoustic model selection unit 42, You may compare with a feature pattern and those acoustic models. Alternatively, the coincidence detection unit 43 reads out one acoustic model corresponding to one option data included in the option list from the set of acoustic models selected by the acoustic model selection unit 42, The operation may be repeated for a plurality of option data by comparing with the acoustic model.

また、一致検出部４３は、特徴パターンと音響モデルとの一致が検出された場合に、その音響モデルに対応する選択肢が、入力された音声信号に一致すると判定しても良い。なお、特徴パターンと音響モデルとの一致が検出されなかった場合には、一致検出部４３は、音声認識不能を表す音声認識結果を出力しても良い。あるいは、一致検出部４３は、ＭＦＣＣを表す多次元空間における特徴パターンの位置と音響モデルの広がりの中心との間の距離を求め、特徴パターンの位置に最も近い音響モデルに対応する選択肢が、入力された音声信号に一致すると判定しても良い。 Further, when a match between the feature pattern and the acoustic model is detected, the match detection unit 43 may determine that the option corresponding to the acoustic model matches the input audio signal. In addition, when the coincidence between the feature pattern and the acoustic model is not detected, the coincidence detection unit 43 may output a speech recognition result indicating that speech recognition is impossible. Alternatively, the coincidence detection unit 43 obtains a distance between the position of the feature pattern in the multidimensional space representing the MFCC and the center of the acoustic model spread, and an option corresponding to the acoustic model closest to the position of the feature pattern is input. It may be determined that the audio signal matches the received audio signal.

例えば、一致検出部４３は、入力された音声信号の先頭の音節から生成された特徴パターンを、選択肢リストに含まれている各々の選択肢データによって表される選択肢の先頭の音節に対応する音響モデルと比較する。選択肢リストにおいて、一致が検出された音節を先頭に有する選択肢が１つだけ存在する場合には、一致検出部４３は、その選択肢が、入力された音声信号に一致すると判定しても良い。一方、選択肢リストにおいて、一致が検出された音節を先頭に有する複数の選択肢が存在する場合には、一致検出部４３は、選択肢が１つに絞られるまで、一致を検出すべき音節の範囲を拡大しても良い。 For example, the coincidence detection unit 43 uses a feature pattern generated from the first syllable of the input speech signal as an acoustic model corresponding to the first syllable of the options represented by each option data included in the option list. Compare with In the option list, when there is only one option having the head of the syllable in which the match is detected, the match detection unit 43 may determine that the option matches the input audio signal. On the other hand, in the option list, when there are a plurality of options having a syllable whose match is detected at the head, the match detection unit 43 determines the range of syllables from which a match should be detected until the option is narrowed down to one. You may enlarge.

ここで、「音節」とは、１個の母音を主音とし、その母音単独で、あるいは、その母音の前後に１つ又は複数の子音を伴って構成される音のまとまりを意味する。また、半母音や特殊モーラも、音節を構成することができる。即ち、１つの音節は、１つ又は複数の音素によって構成される。日本語の音節としては、「あ」、「い」、「う」、「え」、「お」、「か」、「き」、「く」、「け」、「こ」等が該当する。 Here, the “syllable” means a set of sounds that are composed of one vowel as a main sound and that vowels alone or with one or more consonants before and after the vowel. Semi-vowels and special mora can also constitute syllables. That is, one syllable is composed of one or more phonemes. Japanese syllables include “a”, “i”, “u”, “e”, “o”, “ka”, “ki”, “ku”, “ke”, “ko”, etc. .

例えば、音節「あ」に対応する音響モデルとは、音節「あ」を構成する音素「ａ」を表す音響モデルのことである。また、音節「か」に対応する音響モデルとは、音節「か」を構成する第１番目の音素「ｋ」を表す音響モデルと、音節「か」を構成する第２番目の音素「ａ」を表す音響モデルとの組み合わせのことである。 For example, the acoustic model corresponding to the syllable “a” is an acoustic model representing the phoneme “a” constituting the syllable “a”. The acoustic model corresponding to the syllable “ka” includes the acoustic model representing the first phoneme “k” that constitutes the syllable “ka” and the second phoneme “a” that constitutes the syllable “ka”. It is a combination with an acoustic model that represents.

入力された音声信号の１つの音節が１つの音素で構成されている場合には、その音素の一致が検出されれば、音節の一致が検出されたことになる。一方、入力された音声信号の１つの音節が複数の音素で構成されている場合には、それらの音素の一致が検出されれば、音節の一致が検出されたことになる。 When one syllable of the input speech signal is composed of one phoneme, if the phoneme match is detected, the syllable match is detected. On the other hand, when one syllable of the input speech signal is composed of a plurality of phonemes, if a coincidence of these phonemes is detected, a coincidence of syllables is detected.

１つの選択肢と入力された音声信号との間で上述したような一致が検出されると、一致検出部４３は、選択肢リストに含まれている複数の選択肢の内で一致が検出された選択肢を特定する情報（例えば、その選択肢の番号又はその選択肢を表す選択肢データ）を含む音声認識結果を出力する。これにより、ホストＣＰＵ９１は、音声認識処理部４０に入力された音声信号の少なくとも一部に対応する選択肢を認識することができる。 When a match as described above is detected between one option and the input audio signal, the match detection unit 43 selects an option for which a match is detected from among a plurality of options included in the option list. A speech recognition result including information to be specified (for example, option number or option data representing the option) is output. Thereby, the host CPU 91 can recognize an option corresponding to at least a part of the voice signal input to the voice recognition processing unit 40.

ホストＣＰＵ９１は、音声認識結果に基づいて、音声認識制御プログラムにおいて予め設定されたシナリオに沿って、質問又はメッセージを表す音声再生データを音声再生開始コマンドと共にコマンド解析部１０に送信する。また、ホストＣＰＵ９１は、新たな選択肢リスト指定情報を音声認識開始コマンドと共にコマンド解析部１０に送信する。 Based on the voice recognition result, the host CPU 91 transmits voice reproduction data representing a question or a message together with a voice reproduction start command to the command analysis unit 10 according to a scenario set in advance in the voice recognition control program. Further, the host CPU 91 transmits new option list designation information to the command analysis unit 10 together with the voice recognition start command.

コマンド解析部１０は、音声再生開始コマンドに従って、音声再生データを音声再生部６０に供給する。音声再生データは、所定の圧縮フォーマットに従う音声データであっても良いし、テキストデータであっても良い。音声再生部６０は、コマンド解析部１０から供給された音声再生データに基づいて、出力すべき音声を表す出力音声信号を生成する。 The command analysis unit 10 supplies the audio reproduction data to the audio reproduction unit 60 according to the audio reproduction start command. The audio reproduction data may be audio data according to a predetermined compression format or text data. The audio reproduction unit 60 generates an output audio signal representing the audio to be output based on the audio reproduction data supplied from the command analysis unit 10.

音声再生データがテキストデータである場合には、音声再生部６０が、各種の音素について音声波形を表す音声データが含まれている音声合成データベースを用いて、テキストデータによって表される単語又は文章に含まれている複数の音素について音声データを繋ぎ合わせることにより、出力音声信号を合成する。 When the voice reproduction data is text data, the voice reproduction unit 60 uses a voice synthesis database including voice data representing voice waveforms for various phonemes to generate words or sentences represented by the text data. An output audio signal is synthesized by connecting audio data for a plurality of contained phonemes.

Ｄ／Ａ変換器７０は、音声再生部６０から供給されるディジタルの出力音声信号を、アナログの出力音声信号に変換する。音声出力部８０は、Ｄ／Ａ変換器７０から供給されるアナログの出力音声信号を電力増幅する電力増幅器と、電力増幅された出力音声信号に応じて音声を発するスピーカーとを含んでいる。 The D / A converter 70 converts the digital output audio signal supplied from the audio reproduction unit 60 into an analog output audio signal. The audio output unit 80 includes a power amplifier that amplifies the analog output audio signal supplied from the D / A converter 70 and a speaker that emits audio in accordance with the power-amplified output audio signal.

スピーカーは、ホストＣＰＵ９１から送信された音声再生データによって表される質問又はメッセージを、音声として出力する。これにより、音声再生データに基づいて発せられる質問又はメッセージに対するユーザーの回答が幾つかの選択肢の内の１つに予測される状況を作り出し、それらの選択肢を表す選択肢データを含む選択肢リストを適用することができる。 The speaker outputs a question or message represented by the sound reproduction data transmitted from the host CPU 91 as sound. This creates a situation where a user's answer to a question or message that is uttered based on audio playback data is predicted to be one of several options, and applies an option list that includes option data representing those options be able to.

＜第１の実施形態の変形例＞
図２は、本発明の第１の実施形態の変形例に係る音声認識装置を搭載した電子機器の構成の一部を示すブロック図である。第１の実施形態の変形例においては、制御部２００の格納部９２が複数の選択肢リストＡ、Ｂ、Ｃ、・・・を格納し、それらの内から順次選択された選択肢リストが音声認識装置１００に送信される。これに伴い、音声認識装置１００において、図１に示すメモリー５０の替わりに、メモリー５０ａ及び５０ｂが用いられる。その他の点に関しては、第１の実施形態と同様である。 <Modification of First Embodiment>
FIG. 2 is a block diagram showing a part of the configuration of an electronic apparatus equipped with a speech recognition apparatus according to a modification of the first embodiment of the present invention. In the modification of the first embodiment, the storage unit 92 of the control unit 200 stores a plurality of option lists A, B, C,..., And the option list sequentially selected from them is a voice recognition device. 100. Accordingly, in the speech recognition apparatus 100, memories 50a and 50b are used instead of the memory 50 shown in FIG. The other points are the same as in the first embodiment.

制御部２００のホストＣＰＵ９１は、音声認識制御プログラムにおいて予め設定されたシナリオに沿って、音声再生開始コマンド及び音声再生データを音声認識装置１００に送信することにより、音声認識装置１００に音声再生動作を行わせる。また、ホストＣＰＵ９１は、格納部９２に格納されている複数の選択肢リストＡ、Ｂ、Ｃ、・・・の内から１つの選択肢リストを選択して、音声認識開始コマンド及び選択された選択肢リストを音声認識装置１００に送信することにより、音声認識装置１００に音声認識動作を行わせる。 The host CPU 91 of the control unit 200 transmits a voice playback start command and voice playback data to the voice recognition device 100 in accordance with a scenario set in advance in the voice recognition control program, thereby performing voice playback operation on the voice recognition device 100. Let it be done. Further, the host CPU 91 selects one option list from among a plurality of option lists A, B, C,... Stored in the storage unit 92, and receives the voice recognition start command and the selected option list. By transmitting to the voice recognition apparatus 100, the voice recognition apparatus 100 is caused to perform a voice recognition operation.

音声認識装置１００において、メモリー５０ａは、例えば、ＲＯＭ（リードオンリーメモリー）又はフラッシュメモリー等によって構成され、音響モデル格納部５１を含んでいる。また、メモリー５０ｂは、例えば、ＲＡＭ（ランダムアクセスメモリー）等によって構成され、選択肢リスト格納部５２を含んでいる。 In the speech recognition apparatus 100, the memory 50 a is configured by, for example, a ROM (read only memory) or a flash memory, and includes an acoustic model storage unit 51. Further, the memory 50b is constituted by, for example, a RAM (Random Access Memory) or the like, and includes an option list storage unit 52.

コマンド解析部１０は、音声認識開始コマンド及び選択肢リストをホストＣＰＵ９１から受信すると、音声認識開始コマンドに従って、受信した選択肢リストを選択肢リスト格納部５２に格納する。図２においては、選択肢リスト格納部５２が選択肢リストＡを格納している状態が示されている。 When the command analysis unit 10 receives the voice recognition start command and the option list from the host CPU 91, the command analysis unit 10 stores the received option list in the option list storage unit 52 according to the voice recognition start command. FIG. 2 shows a state where the option list storage unit 52 stores the option list A.

音響モデル選択部４２は、選択肢リスト格納部５２に格納されている選択肢リストに含まれている音響モデル特定情報を読み出し、音響モデル格納部５１に格納されている複数組の音響モデルの内から、音響モデル特定情報によって特定される１組の音響モデルを選択する。 The acoustic model selection unit 42 reads out acoustic model specifying information included in the option list stored in the option list storage unit 52, and from among a plurality of sets of acoustic models stored in the acoustic model storage unit 51, A set of acoustic models specified by the acoustic model specifying information is selected.

一致検出部４３は、選択肢リスト格納部５２に格納されている選択肢リストに含まれている複数の選択肢データを順次読み出すと共に、音響モデル選択部４２によって選択された１組の音響モデルの内で、選択肢データの少なくとも一部に対応する音響モデル（例えば、１音節分の音響モデル）を音響モデル格納部５１から読み出す。 The coincidence detection unit 43 sequentially reads a plurality of option data included in the option list stored in the option list storage unit 52, and among the set of acoustic models selected by the acoustic model selection unit 42, An acoustic model (for example, an acoustic model for one syllable) corresponding to at least a part of the option data is read from the acoustic model storage unit 51.

これにより、一致検出部４３は、音声検出信号が活性化されているときに、信号処理部４１に入力された音声信号の少なくとも一部から生成された特徴パターンを、音響モデル選択部４２によって選択された１組の音響モデルの内で、選択肢リスト格納部５２に格納されている選択肢リストに含まれている各々の選択肢データの少なくとも一部に対応する音響モデルと比較して音声認識処理（パターンマッチング）を行い、音声認識結果を出力する。 As a result, the coincidence detection unit 43 selects, by the acoustic model selection unit 42, a feature pattern generated from at least a part of the audio signal input to the signal processing unit 41 when the audio detection signal is activated. Among the set of acoustic models, the speech recognition process (pattern) is compared with the acoustic model corresponding to at least a part of each option data included in the option list stored in the option list storage unit 52. Matching) and output the speech recognition result.

＜第１の実施形態に係る音声認識方法＞
次に、本発明の第１の実施形態に係る音声認識方法について、図１〜図３を参照しながら説明する。図３は、図１又は図２に示す音声認識装置によって実施される音声認識方法を示すフローチャートである。 <Voice Recognition Method According to First Embodiment>
Next, a speech recognition method according to the first embodiment of the present invention will be described with reference to FIGS. FIG. 3 is a flowchart showing a speech recognition method implemented by the speech recognition apparatus shown in FIG. 1 or 2.

図３のステップＳ１１において、複数の選択肢をそれぞれ表す複数の選択肢データと、音響モデルの組を特定する音響モデル特定情報とを含む選択肢リストを格納する選択肢リスト格納部５２から、一致検出部４３が、複数の選択肢データを読み出すと共に、音響モデル選択部４２が、音響モデル特定情報を読み出す。 In step S11 of FIG. 3, the match detection unit 43 starts from an option list storage unit 52 that stores an option list that includes a plurality of option data each representing a plurality of options and acoustic model specifying information that specifies a set of acoustic models. The acoustic model selection unit 42 reads out acoustic model specifying information while reading out a plurality of option data.

ステップＳ１２において、音響モデル選択部４２が、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる条件の下で収集して得られた複数組の音響モデルの内から、音響モデル特定情報によって特定される１組の音響モデルを選択する。一方、ステップＳ１３において、信号処理部４１が、入力された音声信号の周波数成分を抽出し、その周波数成分の分布状態を表す特徴パターンを生成する。 In step S12, the acoustic model selection unit 42 selects a sound from a plurality of sets of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in a predetermined language under a plurality of different conditions. A set of acoustic models specified by the model specifying information is selected. On the other hand, in step S13, the signal processing unit 41 extracts the frequency component of the input audio signal and generates a feature pattern representing the distribution state of the frequency component.

ステップＳ１４において、一致検出部４３が、信号処理部４１に入力された音声信号の少なくとも一部から生成された特徴パターンを、音響モデル選択部４２によって選択された１組の音響モデルの内で、ステップＳ１１において読み出された各々の選択肢データの少なくとも一部に対応する音響モデルと比較して音声認識処理を行い、音声認識結果を出力する。これにより、１回の音声認識動作が終了する。 In step S <b> 14, the coincidence detection unit 43 selects a feature pattern generated from at least a part of the audio signal input to the signal processing unit 41 within the set of acoustic models selected by the acoustic model selection unit 42. A speech recognition process is performed in comparison with an acoustic model corresponding to at least a part of each option data read in step S11, and a speech recognition result is output. Thereby, one speech recognition operation is completed.

＜第１の実施形態の具体例＞
次に、本発明の第１の実施形態に係る音声認識装置が行う音声認識動作の具体例について、図１及び図４を参照しながら説明する。以下においては、図１に示す電子機器がエアコンであり、音声認識装置がエアコンの操作に適用される場合について説明する。 <Specific Example of First Embodiment>
Next, a specific example of the speech recognition operation performed by the speech recognition apparatus according to the first embodiment of the present invention will be described with reference to FIGS. In the following, the case where the electronic device shown in FIG. 1 is an air conditioner and the speech recognition apparatus is applied to the operation of the air conditioner will be described.

図４は、図１に示す選択肢リスト格納部に格納されている選択肢リストの例を示す図である。図４（Ａ）は、選択肢リストＡの内容を示しており、図４（Ｂ）は、選択肢リストＢの内容を示している。選択肢を表す選択肢データは、選択肢に含まれている音素を特定できるローマ字表記又はカナ表記を表すデータを含んでいる。 FIG. 4 is a diagram illustrating an example of an option list stored in the option list storage unit illustrated in FIG. 4A shows the contents of the option list A, and FIG. 4B shows the contents of the option list B. The option data representing the options includes data representing Roman notation or Kana notation that can identify the phonemes included in the options.

例えば、制御部２００のホストＣＰＵ９１は、エアコンの電源投入時に、「冷房にしますか、あるいは暖房にしますか？」という質問を表す音声再生データを音声再生開始コマンドと共にコマンド解析部１０に送信する。また、ホストＣＰＵ９１は、選択肢リストＡを指定する選択肢リスト指定情報を音声認識開始コマンドと共にコマンド解析部１０に送信する。 For example, when the air conditioner is turned on, the host CPU 91 of the control unit 200 transmits voice reproduction data representing a question “Do you want to cool or heat?” Together with the voice reproduction start command to the command analysis unit 10. Further, the host CPU 91 transmits option list designation information for designating the option list A to the command analysis unit 10 together with the voice recognition start command.

音声認識装置１００のコマンド解析部１０は、選択肢リスト指定情報に従って、選択肢リスト格納部５２に格納されている複数の選択肢リストの内から選択肢リストＡを指定する。選択肢リストＡは、選択肢番号に対応してエアコンの動作設定に関する複数の選択肢を表す選択肢データと、音響モデル特定情報として音響モデル番号とを含んでいる。 The command analysis unit 10 of the speech recognition apparatus 100 specifies an option list A from among a plurality of option lists stored in the option list storage unit 52 according to the option list specifying information. The option list A includes option data representing a plurality of options related to the operation setting of the air conditioner corresponding to the option number, and an acoustic model number as the acoustic model specifying information.

音響モデル選択部４２は、音響モデル格納部５１に格納されている複数組の音響モデルの内から、選択肢リストＡに含まれている音響モデル番号「０」によって特定される汎用の１組の音響モデル０を選択する。 The acoustic model selection unit 42 is a general-purpose set of acoustics identified by the acoustic model number “0” included in the option list A from among a plurality of sets of acoustic models stored in the acoustic model storage unit 51. Select model 0.

一致検出部４３は、選択肢リストＡの選択肢１「冷房」及び選択肢２「暖房」の先頭の音節「れ」及び「だ」に含まれている音素「ｒ・ｅ」及び「ｄ・ａ」のそれぞれに対応する音響モデルを、音響モデル格納部５１に格納されている汎用の１組の音響モデル０の内から読み出す。 The coincidence detection unit 43 includes the phonemes “r · e” and “d · a” included in the first syllables “re” and “da” of the option 1 “cooling” and the option 2 “heating” in the option list A. The acoustic model corresponding to each is read out from the general-purpose set of acoustic models 0 stored in the acoustic model storage unit 51.

一方、コマンド解析部１０は、音声再生開始コマンドに従って、音声再生データを音声再生部６０に供給する。音声再生部６０は、音声再生データに基づいて出力音声信号を生成してＤ／Ａ変換器７０に供給する。また、Ｄ／Ａ変換器７０は、ディジタルの出力音声信号をアナログの出力音声信号に変換して、アナログの出力音声信号を音声出力部８０に供給する。これにより、音声出力部８０から、「冷房にしますか、あるいは暖房にしますか？」という質問が発せられる。 On the other hand, the command analysis unit 10 supplies the audio reproduction data to the audio reproduction unit 60 in accordance with the audio reproduction start command. The audio reproduction unit 60 generates an output audio signal based on the audio reproduction data and supplies it to the D / A converter 70. The D / A converter 70 converts the digital output audio signal into an analog output audio signal, and supplies the analog output audio signal to the audio output unit 80. As a result, the voice output unit 80 issues a question “Do you want to use cooling or heating?”.

音声出力部８０から発せられた質問に対して、ユーザーが、「冷房にします。」と言うと、信号処理部４１は、音素「ｒ・ｅ・ｉ・ｂ・ｏ・ｕ・・・」のそれぞれについて、周波数成分の分布状態を表す特徴パターンを生成する。 In response to a question issued from the voice output unit 80, the user says “I will cool”. The signal processing unit 41 reads the phoneme “r · e · i · b · o · u. For each, a feature pattern representing a distribution state of frequency components is generated.

一致検出部４３は、信号処理部４１によって生成された先頭の音節の第１番目の音素「ｒ」の特徴パターンを、選択肢１及び選択肢２の先頭の音節の第１番目の音素「ｒ」及び「ｄ」の音響モデルと比較することにより、音素「ｒ」の一致を検出する。 The coincidence detection unit 43 uses the characteristic pattern of the first phoneme “r” of the first syllable generated by the signal processing unit 41 as the first phoneme “r” of the first syllable of option 1 and option 2 and By comparing with the acoustic model of “d”, the coincidence of phoneme “r” is detected.

一致が検出された音素が子音を表している場合には、さらに、一致検出部４３が、先頭の音節の第２番目の音素を比較する。一致検出部４３は、信号処理部４１によって生成された先頭の音節の第２番目の音素「ｅ」の特徴パターンを、選択肢１及び選択肢２の先頭の音節の第２番目の音素「ｅ」及び「ａ」の音響モデルと比較することにより、音素「ｅ」の一致を検出する。 When the phoneme in which the match is detected represents a consonant, the match detection unit 43 further compares the second phoneme of the first syllable. The coincidence detection unit 43 uses the characteristic pattern of the second phoneme “e” of the first syllable generated by the signal processing unit 41 as the second phoneme “e” of the first syllable of option 1 and option 2 and A phoneme “e” match is detected by comparison with the acoustic model “a”.

これにより、先頭の音節「れ」の一致が検出される。この場合には、一致が検出された選択肢が１つであるので、ここで音声認識結果が得られる。なお、一致が検出された選択肢が複数であれば、いずれが該当するかを認識することができないので、一致検出部４３は、次の音節に含まれている音素について、対応する音響モデルを音響モデル格納部５１に格納されている汎用の１組の音響モデル０の内から読み出して、一致を検出すべき音節の範囲を拡大する。 Thereby, the coincidence of the first syllable “re” is detected. In this case, since there is only one option for which a match is detected, a speech recognition result is obtained here. Note that if there are a plurality of options for which a match is detected, it is not possible to recognize which one is applicable, so the match detection unit 43 generates a sound model corresponding to the phoneme included in the next syllable. Read out from a set of general-purpose acoustic models 0 stored in the model storage unit 51, and expand the range of syllables to be detected for coincidence.

一致検出部４３は、一致が検出された先頭の音節「れ」を有する選択肢１「冷房」を特定する情報を含む音声認識結果をホストＣＰＵ９１に出力する。選択肢１「冷房」を特定する情報としては、例えば、選択肢番号「１」や、選択肢に含まれている音素のローマ字表記「ｒｅｉｂｏｕ」又はその一部「ｒｅ」等が該当する。 The coincidence detection unit 43 outputs a speech recognition result including information specifying option 1 “cooling” having the first syllable “re” from which coincidence is detected to the host CPU 91. The information specifying the option 1 “cooling” corresponds to, for example, the option number “1”, the romaji notation “reibou” of the phoneme included in the option, or a part “re” thereof.

これにより、制御部２００のホストＣＰＵ９１は、入力された音声信号の少なくとも一部に対応する選択肢１「冷房」を認識することができる。このようにして第１回目の音声認識動作が終了すると、ホストＣＰＵ９１は、エアコンの動作を「冷房」に設定する。 Thereby, the host CPU 91 of the control unit 200 can recognize the option 1 “cooling” corresponding to at least a part of the input audio signal. When the first speech recognition operation is thus completed, the host CPU 91 sets the operation of the air conditioner to “cooling”.

次に、ホストＣＰＵ９１は、「設定温度は何度にしますか？」という質問を表す音声再生データを音声再生開始コマンドと共にコマンド解析部１０に送信する。また、ホストＣＰＵ９１は、選択肢リストＢを指定する選択肢リスト指定情報を音声認識開始コマンドと共にコマンド解析部１０に送信する。 Next, the host CPU 91 transmits voice reproduction data representing a question “How many times should the set temperature be?” To the command analysis unit 10 together with a voice reproduction start command. Further, the host CPU 91 transmits option list designation information for designating the option list B to the command analysis unit 10 together with the voice recognition start command.

音声認識装置１００のコマンド解析部１０は、選択肢リスト指定情報に従って、選択肢リスト格納部５２に格納されている複数の選択肢リストの内から選択肢リストＢを指定する。選択肢リストＢは、選択肢番号に対応してエアコンの温度設定に関する複数の選択肢を表す選択肢データと、音響モデル特定情報として音響モデル番号とを含んでいる。 The command analysis unit 10 of the speech recognition apparatus 100 designates the option list B from among a plurality of option lists stored in the option list storage unit 52 according to the option list designation information. The option list B includes option data representing a plurality of options related to the temperature setting of the air conditioner corresponding to the option number, and an acoustic model number as the acoustic model specifying information.

音響モデル選択部４２は、音響モデル格納部５１に格納されている複数組の音響モデルの内から、選択肢リストＢに含まれている音響モデル番号「１」によって特定される数字認識用の１組の音響モデル１を選択する。 The acoustic model selection unit 42 is a set for recognizing a number specified by the acoustic model number “1” included in the option list B from among a plurality of sets of acoustic models stored in the acoustic model storage unit 51. The acoustic model 1 is selected.

一致検出部４３は、選択肢リストＢに含まれている選択肢１「２０℃」、選択肢２「２１℃」、・・・の先頭の音節「に」、「に」、・・・に含まれている音素「ｎ・ｉ」、「ｎ・ｉ」、・・・のそれぞれに対応する音響モデルを、音響モデル格納部５１に格納されている数字認識用の１組の音響モデル１の内から読み出す。 The coincidence detection unit 43 is included in the first syllables “ni”, “ni”,... Of option 1 “20 ° C.”, option 2 “21 ° C.”,. .. Are read out from the set of acoustic models 1 for number recognition stored in the acoustic model storage unit 51. The acoustic models corresponding to the phonemes “n · i”, “n · i”,. .

一方、コマンド解析部１０は、音声再生開始コマンドに従って、音声再生データを音声再生部６０に供給する。音声再生部６０は、音声再生データに基づいて出力音声信号を生成してＤ／Ａ変換器７０に供給する。また、Ｄ／Ａ変換器７０は、ディジタルの出力音声信号をアナログの出力音声信号に変換して、アナログの出力音声信号を音声出力部８０に供給する。これにより、音声出力部８０から、「設定温度は何度にしますか？」という質問が発せられる。 On the other hand, the command analysis unit 10 supplies the audio reproduction data to the audio reproduction unit 60 in accordance with the audio reproduction start command. The audio reproduction unit 60 generates an output audio signal based on the audio reproduction data and supplies it to the D / A converter 70. The D / A converter 70 converts the digital output audio signal into an analog output audio signal, and supplies the analog output audio signal to the audio output unit 80. As a result, the voice output unit 80 issues a question “How many times should the set temperature be?”.

音声出力部８０から発せられた質問に対して、ユーザーが、「２７℃にします。」と言うと、信号処理部４１は、音素「ｎ・ｉ・ｊ・ｕ・ｕ・ｎ・ａ・ｎ・ａ・・・」のそれぞれについて、周波数成分の分布状態を表す特徴パターンを生成する。 In response to a question issued from the voice output unit 80, the user says “set to 27 ° C.” and the signal processing unit 41 reads the phoneme “n.i.j.u.u.n.a.n”. For each of “a...”, A feature pattern representing a distribution state of frequency components is generated.

一致検出部４３は、信号処理部４１によって生成された先頭の音節の第１番目の音素「ｎ」の特徴パターンを、選択肢１、選択肢２、・・・の先頭の音節の第１番目の音素「ｎ」、「ｎ」、・・・の音響モデルと比較することにより、音素「ｎ」の一致を検出する。 The coincidence detection unit 43 uses the feature pattern of the first syllable “n” of the first syllable generated by the signal processing unit 41 as the first phoneme of the first syllable of option 1, option 2,. The phoneme “n” matches are detected by comparing with the acoustic models “n”, “n”,.

一致が検出された音素が子音を表している場合には、さらに、一致検出部４３が、先頭の音節の第２番目の音素を比較する。一致検出部４３は、信号処理部４１によって生成された先頭の音節の第２番目の音素「ｉ」の特徴パターンを、選択肢１、選択肢２、・・・の先頭の音節の第２番目の音素「ｉ」、「ｉ」、・・・の音響モデルと比較することにより、音素「ｉ」の一致を検出する。 When the phoneme in which the match is detected represents a consonant, the match detection unit 43 further compares the second phoneme of the first syllable. The coincidence detection unit 43 uses the characteristic pattern of the second phoneme “i” of the first syllable generated by the signal processing unit 41 as the second phoneme of the first syllable of option 1, option 2,. The phoneme “i” matches are detected by comparing with the acoustic models “i”, “i”,.

この場合には、先頭の音節「に」の一致が検出された選択肢が複数存在しており、いずれが該当するかを認識することができないので、一致検出部４３は、次の音節に含まれている音素のそれぞれについて、対応する音響モデルを音響モデル格納部５１に格納されている数字認識用の１組の音響モデル１の内から読み出して、一致を検出すべき音節の範囲を拡大する。 In this case, since there are a plurality of options in which the match of the first syllable “ni” is detected and it cannot be recognized which matches, the match detection unit 43 is included in the next syllable. For each phoneme, the corresponding acoustic model is read out of the set of acoustic models 1 for number recognition stored in the acoustic model storage unit 51, and the range of syllables whose coincidence is to be detected is expanded.

一致検出部４３は、最終的に一致が唯一検出された複数の音節「にじゅうな」を有する選択肢８「２７℃」を特定する情報を含む音声認識結果をホストＣＰＵ９１に出力する。選択肢８「２７℃」を特定する情報としては、例えば、選択肢番号「８」、選択肢に含まれている音素のローマ字表記「ｎｉｊｕｕｎａｎａｄｏ」又はその一部「ｎｉｊｕｕｎａｎａ」等が該当する。 The coincidence detection unit 43 outputs to the host CPU 91 a speech recognition result including information for specifying option 8 “27 ° C.” having a plurality of syllables “Japanese” whose coincidence is finally detected. The information specifying the option 8 “27 ° C.” corresponds to, for example, the option number “8”, the romanization of the phoneme included in the option “nijuunanado”, or a part thereof “nijuuna”.

これにより、制御部２００のホストＣＰＵ９１は、入力された音声信号の少なくとも一部に対応する選択肢８「２７℃」を認識することができる。このようにして第２回目の音声認識動作が終了すると、ホストＣＰＵ９１は、エアコンの設定温度を「２７℃」に設定する。 Thereby, the host CPU 91 of the control unit 200 can recognize the option 8 “27 ° C.” corresponding to at least a part of the input audio signal. When the second speech recognition operation is thus completed, the host CPU 91 sets the set temperature of the air conditioner to “27 ° C.”.

さらに、ホストＣＰＵ９１は、次の音声認識動作を継続しても良いし、一連の音声認識動作を終了しても良い。一連の音声認識動作を終了するときには、ホストＣＰＵ９１は、「承知しました。」というメッセージを表す音声再生データを音声再生開始コマンドと共にコマンド解析部１０に送信する。 Further, the host CPU 91 may continue the next voice recognition operation or may end a series of voice recognition operations. When the series of voice recognition operations is completed, the host CPU 91 transmits voice reproduction data representing a message “acknowledged” together with the voice reproduction start command to the command analysis unit 10.

コマンド解析部１０は、音声再生開始コマンドに従って、音声再生データを音声再生部６０に供給する。音声再生部６０は、音声再生データに基づいて出力音声信号を生成してＤ／Ａ変換器７０に供給する。また、Ｄ／Ａ変換器７０は、ディジタルの出力音声信号をアナログの出力音声信号に変換して、アナログの出力音声信号を音声出力部８０に供給する。これにより、音声出力部８０から、「承知しました。」というメッセージが発せられる。 The command analysis unit 10 supplies the audio reproduction data to the audio reproduction unit 60 according to the audio reproduction start command. The audio reproduction unit 60 generates an output audio signal based on the audio reproduction data and supplies it to the D / A converter 70. The D / A converter 70 converts the digital output audio signal into an analog output audio signal, and supplies the analog output audio signal to the audio output unit 80. As a result, the message “I understand” is issued from the voice output unit 80.

本発明の第１の実施形態によれば、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる条件の下で収集して得られた複数組の音響モデルが用意され、複数組の音響モデルの内から、選択肢リストに含まれている音響モデル特定情報によって特定される１組の音響モデルが選択される。これにより、選択肢リストに含まれている複数の選択肢データの内容等の条件に応じて設定された音響モデルを用いて音声認識処理が行われるので、音声認識処理における認識率を向上させることができる。 According to the first embodiment of the present invention, a plurality of sets of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in a predetermined language under a plurality of different conditions are prepared, One set of acoustic models specified by the acoustic model specifying information included in the option list is selected from the plurality of sets of acoustic models. Thereby, since the speech recognition process is performed using the acoustic model set according to the conditions such as the contents of the plurality of option data included in the option list, the recognition rate in the speech recognition process can be improved. .

＜第２の実施形態に係る音声認識装置＞
図５は、本発明の第２の実施形態に係る音声認識装置を搭載した電子機器の構成の一部を示すブロック図である。第２の実施形態においては、入力される音声信号における声質の特徴に基づいて、複数組の音響モデルの内から１組の音響モデルが選択される。 <Voice Recognition Device According to Second Embodiment>
FIG. 5 is a block diagram showing a part of the configuration of an electronic apparatus equipped with a speech recognition apparatus according to the second embodiment of the present invention. In the second embodiment, a set of acoustic models is selected from a plurality of sets of acoustic models based on the characteristics of voice quality in the input audio signal.

そのために、図１に示す第１の実施形態に対し、声質判定部４４と、声質判定情報格納部５３とが追加されている。一方、選択肢リスト格納部５２に格納されている各々の選択肢リストは、音響モデル特定情報を含んでいなくても良い。その他の点に関しては、第１の実施形態と同様である。 Therefore, a voice quality determination unit 44 and a voice quality determination information storage unit 53 are added to the first embodiment shown in FIG. On the other hand, each option list stored in the option list storage unit 52 may not include acoustic model specifying information. The other points are the same as in the first embodiment.

図５に示す音響モデル格納部５１は、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる種類の声質について収集して得られた複数組の音響モデルを格納している。これにより、声質の種類に応じて設定された音響モデルを用いて音声認識処理を行うことができる。 The acoustic model storage unit 51 illustrated in FIG. 5 stores a plurality of sets of acoustic models obtained by collecting frequency component distribution states of a plurality of phonemes used in a predetermined language for a plurality of different types of voice qualities. . Thereby, speech recognition processing can be performed using the acoustic model set according to the type of voice quality.

例えば、音響モデル格納部５１は、年齢又は性別の異なる複数群の話者の音声を収録して得られた音声信号に基づいて生成された複数組の音響モデルを格納しても良い。図５においては、音響モデル格納部５１が、子供用の１組の音響モデル０と、女性用の１組の音響モデル１と、男性用の１組の音響モデル２と、老人用の１組の音響モデル３とを格納している。ここで、「女性」とは、ある範囲の年齢の女性のことであり、「男性」とは、ある範囲の年齢の男性のことである。 For example, the acoustic model storage unit 51 may store a plurality of sets of acoustic models generated on the basis of voice signals obtained by recording voices of a plurality of groups of speakers having different ages or genders. In FIG. 5, the acoustic model storage unit 51 includes one set of acoustic models 0 for children, one set of acoustic models 1 for women, one set of acoustic models 2 for men, and one set for elderly people. The acoustic model 3 is stored. Here, “female” means a woman in a certain range of ages, and “male” means a man in a certain range of ages.

声質判定情報格納部５３は、話者の声質の種類を判定するために用いられる声質の特徴を表す声質判定情報を格納している。声質判定情報は、例えば、話者の年齢又は性別に対応して、音声の基本周波数の範囲又はフォルマント周波数（声道の共振周波数）の範囲等を特定する情報である。 The voice quality determination information storage unit 53 stores voice quality determination information representing characteristics of voice quality used for determining the type of voice quality of the speaker. The voice quality determination information is, for example, information that specifies the range of the fundamental frequency of the voice or the range of the formant frequency (resonance frequency of the vocal tract) according to the age or sex of the speaker.

図５においては、声質判定情報格納部５３が、子供の声質の特徴を表す声質判定情報と、女性の声質の特徴を表す声質判定情報と、男性の声質の特徴を表す声質判定情報と、老人の声質の特徴を表す声質判定情報とを格納している。ここで、「女性」とは、ある範囲の年齢の女性のことであり、「男性」とは、ある範囲の年齢の男性のことである。 In FIG. 5, the voice quality determination information storage unit 53 includes voice quality determination information that represents the characteristics of the voice quality of the child, voice quality determination information that represents the characteristics of the female voice quality, voice quality determination information that represents the characteristics of the male voice quality, Voice quality determination information representing the characteristics of the voice quality. Here, “female” means a woman in a certain range of ages, and “male” means a man in a certain range of ages.

声質判定部４４は、入力される音声信号によって表される音声における声質の特徴に基づいて、その音声信号における声質の種類を判定する。例えば、声質判定部４４は、入力される音声信号から基本周波数又はフォルマント周波数等を抽出して、声質判定情報格納部５３に格納されている声質判定情報によって表される周波数範囲と比較することにより、抽出された周波数に最も近い周波数範囲を有する声質の種類（子供、女性、男性、又は、老人）を判定し、判定結果を音響モデル選択部４２に出力する。 The voice quality determination unit 44 determines the type of voice quality in the voice signal based on the voice quality feature of the voice represented by the input voice signal. For example, the voice quality determination unit 44 extracts a fundamental frequency, a formant frequency, or the like from the input voice signal and compares it with the frequency range represented by the voice quality determination information stored in the voice quality determination information storage unit 53. The type of voice quality (child, female, male, or elderly person) having the frequency range closest to the extracted frequency is determined, and the determination result is output to the acoustic model selection unit 42.

音響モデル選択部４２は、声質判定部４４から出力される判定結果に従って、音響モデル格納部５１に格納されている複数組の音響モデルの内から、声質判定部４４によって判定された声質の種類に対応する１組の音響モデルを選択する。 According to the determination result output from the voice quality determination unit 44, the acoustic model selection unit 42 selects the voice quality type determined by the voice quality determination unit 44 from among a plurality of sets of acoustic models stored in the acoustic model storage unit 51. A corresponding set of acoustic models is selected.

ここで、選択肢リスト格納部５２は、特定の声質の種類について音声認識処理を許可又は禁止する制御情報をさらに含む選択肢リストを格納しても良い。制御情報は、声質の種類毎に音声認識処理の許可又は禁止を表す情報を含んでも良い。例えば、エアコンの温度設定に関する選択肢リストＢ（図４（Ｂ）参照）は、子供について音声認識処理を禁止し、それ以外について音声認識処理を許可する情報を含んでも良い。 Here, the option list storage unit 52 may store an option list further including control information for permitting or prohibiting the voice recognition processing for a specific voice quality type. The control information may include information indicating permission or prohibition of voice recognition processing for each type of voice quality. For example, the option list B (see FIG. 4B) regarding the temperature setting of the air conditioner may include information that prohibits the voice recognition process for the child and permits the voice recognition process for the rest.

その場合に、一致検出部４３は、声質判定部４４によって判定された声質の種類と選択肢リストに含まれている制御情報とに基づいて、その選択肢リストについて音声認識処理を開始するか否かを判定する。これにより、音声認識による電子機器の操作等におけるセキュリティレベルを制御することができる。例えば、子供がエアコンの温度設定を行うことを禁止することにより、エアコンの危険な操作を防止することが可能である。 In that case, the coincidence detection unit 43 determines whether or not to start the speech recognition processing for the option list based on the type of voice quality determined by the voice quality determination unit 44 and the control information included in the option list. judge. Thereby, it is possible to control the security level in the operation of the electronic device by voice recognition. For example, by prohibiting children from setting the temperature of the air conditioner, it is possible to prevent dangerous operation of the air conditioner.

また、第２の実施形態においても、第１の実施形態の変形例と同様に、制御部２００の格納部９２が複数の選択肢リストＡ、Ｂ、Ｃ、・・・を格納し、音声認識装置１００において、図５に示すメモリー５０の替わりに、図２に示すメモリー５０ａ及び５０ｂを用いても良い。 Also in the second embodiment, as in the modification of the first embodiment, the storage unit 92 of the control unit 200 stores a plurality of option lists A, B, C,. 100, the memories 50a and 50b shown in FIG. 2 may be used instead of the memory 50 shown in FIG.

＜第２の実施形態の変形例＞
図６は、本発明の第２の実施形態の変形例に係る音声認識装置の構成の一部を示すブロック図である。第２の実施形態の変形例においては、特定の話者に適合するようにトレーニング（話者最適化処理）された音響モデルが用いられる。そのために、第２の実施形態に対し、音響モデルトレーニング部４５と、声質判定情報抽出部４６とが追加されている。その他の点に関しては、第２の実施形態と同様である。 <Modification of Second Embodiment>
FIG. 6 is a block diagram showing a part of the configuration of a speech recognition apparatus according to a modification of the second embodiment of the present invention. In the modification of the second embodiment, an acoustic model that has been trained (speaker optimization processing) to be adapted to a specific speaker is used. For this purpose, an acoustic model training unit 45 and a voice quality determination information extraction unit 46 are added to the second embodiment. The other points are the same as in the second embodiment.

図６に示す音響モデル格納部５１は、不特定話者用の１組の初期音響モデルを格納している。特定の話者によってトレーニングが行われる際に、音響モデルトレーニング部４５は、信号処理部４１から出力される特徴パターン及び外部から供給される音素列情報に基づいて、スピーカー・アダプテーション機能により、トレーニング話者の音声に適応するように１組の初期音響モデルをトレーニングする。 The acoustic model storage unit 51 shown in FIG. 6 stores a set of initial acoustic models for unspecified speakers. When training is performed by a specific speaker, the acoustic model training unit 45 uses a speaker adaptation function based on the feature pattern output from the signal processing unit 41 and the phoneme string information supplied from the outside to perform training training. Train a set of initial acoustic models to adapt to the person's voice.

音響モデルトレーニング部４５は、トレーニングされた１組の音響モデルを音響モデル格納部５１に格納する。これにより、音響モデル格納部５１は、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる種類の声質について収集して得られた複数組の音響モデルを格納することになる。 The acoustic model training unit 45 stores a set of trained acoustic models in the acoustic model storage unit 51. Thereby, the acoustic model storage unit 51 stores a plurality of sets of acoustic models obtained by collecting frequency component distribution states of a plurality of phonemes used in a predetermined language for a plurality of different types of voice qualities. .

一方、声質判定情報抽出部４６は、トレーニング話者の声質を判定するために用いられる声質の特徴を表す声質判定情報を抽出して、その声質判定情報を声質判定情報格納部５３に格納する。声質判定情報は、例えば、トレーニング話者の音声の基本周波数の範囲又はフォルマント周波数の範囲等を特定する情報である。 On the other hand, the voice quality determination information extraction unit 46 extracts voice quality determination information representing the characteristics of the voice quality used for determining the voice quality of the training speaker, and stores the voice quality determination information in the voice quality determination information storage unit 53. The voice quality determination information is, for example, information that specifies the fundamental frequency range or formant frequency range of the training speaker's voice.

トレーニング後に音声認識が行われる際に、声質判定部４４は、入力される音声信号における声質の特徴に基づいて、音声信号における声質がトレーニング話者の声質であるか否かを判定する。例えば、声質判定部４４は、入力される音声信号から基本周波数又はフォルマント周波数等を抽出して、声質判定情報格納部５３に格納されている声質判定情報によって表される周波数範囲と比較することにより、音声信号における声質がトレーニング話者の声質であるか否かを判定し、判定結果を音響モデル選択部４２に出力する。 When speech recognition is performed after training, the voice quality determination unit 44 determines whether the voice quality of the voice signal is the voice quality of the training speaker based on the voice quality characteristics of the input voice signal. For example, the voice quality determination unit 44 extracts a fundamental frequency, a formant frequency, or the like from the input voice signal and compares it with the frequency range represented by the voice quality determination information stored in the voice quality determination information storage unit 53. Then, it is determined whether or not the voice quality in the voice signal is the voice quality of the training speaker, and the determination result is output to the acoustic model selection unit 42.

音響モデル選択部４２は、声質判定部４４から出力される判定結果に従って、音響モデル格納部５１に格納されている複数組の音響モデルの内から、声質判定部４４によって判定された声質の種類に対応する１組の音響モデルを選択する。即ち、音響モデル選択部４２は、音声信号における声質がトレーニング話者の声質であると判定された場合に、音響モデル格納部５１に格納されているトレーニングされた１組の音響モデルを選択する。一方、音響モデル選択部４２は、音声信号における声質がトレーニング話者の声質でないと判定された場合に、音響モデル格納部５１に格納されている１組の初期音響モデルを選択する。 According to the determination result output from the voice quality determination unit 44, the acoustic model selection unit 42 selects the voice quality type determined by the voice quality determination unit 44 from among a plurality of sets of acoustic models stored in the acoustic model storage unit 51. A corresponding set of acoustic models is selected. That is, the acoustic model selection unit 42 selects a set of trained acoustic models stored in the acoustic model storage unit 51 when it is determined that the voice quality in the speech signal is the voice quality of the training speaker. On the other hand, the acoustic model selection unit 42 selects a set of initial acoustic models stored in the acoustic model storage unit 51 when it is determined that the voice quality in the voice signal is not the voice quality of the training speaker.

一致検出部４３は、コマンド解析部１０（図５）によって指定された選択肢リストに含まれている複数の選択肢データを選択肢リスト格納部５２から順次読み出すと共に、音響モデル選択部４２によって選択された１組の音響モデルの内で、選択肢データの少なくとも一部に対応する音響モデル（例えば、１音節分の音響モデル）を音響モデル格納部５１から読み出す。 The coincidence detection unit 43 sequentially reads out a plurality of option data included in the option list designated by the command analysis unit 10 (FIG. 5) from the option list storage unit 52 and 1 selected by the acoustic model selection unit 42. Among the set of acoustic models, an acoustic model (for example, an acoustic model for one syllable) corresponding to at least a part of the option data is read from the acoustic model storage unit 51.

＜第２の実施形態に係る音声認識方法＞
次に、本発明の第２の実施形態に係る音声認識方法について、図５〜図７を参照しながら説明する。図７は、図５又は図６に示す音声認識装置によって実施される音声認識方法を示すフローチャートである。 <Voice Recognition Method According to Second Embodiment>
Next, a speech recognition method according to the second embodiment of the present invention will be described with reference to FIGS. FIG. 7 is a flowchart showing a speech recognition method implemented by the speech recognition apparatus shown in FIG.

図７のステップＳ２１において、一致検出部４３が、複数の選択肢をそれぞれ表す複数の選択肢データを含む選択肢リストを格納する選択肢リスト格納部５２から、複数の選択肢データを読み出す。 In step S21 of FIG. 7, the match detection unit 43 reads a plurality of option data from the option list storage unit 52 that stores an option list including a plurality of option data each representing a plurality of options.

ステップＳ２２において、信号処理部４１が、入力された音声信号の周波数成分を抽出し、その周波数成分の分布状態を表す特徴パターンを生成する。ステップＳ２３において、声質判定部４４が、入力される音声信号における声質の種類を判定する。 In step S22, the signal processing unit 41 extracts the frequency component of the input audio signal and generates a feature pattern representing the distribution state of the frequency component. In step S23, the voice quality determination unit 44 determines the type of voice quality in the input audio signal.

ステップＳ２４において、音響モデル選択部４２が、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる種類の声質について収集して得られた複数組の音響モデルの内から、ステップＳ２３において判定された声質の種類に対応する１組の音響モデルを選択する。 In step S24, the acoustic model selection unit 42 performs a step from among a plurality of sets of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in a predetermined language for a plurality of different types of voice qualities. A set of acoustic models corresponding to the type of voice quality determined in S23 is selected.

ステップＳ２５において、一致検出部４３が、信号処理部４１に入力された音声信号の少なくとも一部から生成された特徴パターンを、音響モデル選択部４２によって選択された１組の音響モデルの内で、ステップＳ２１において読み出された各々の選択肢データの少なくとも一部に対応する音響モデルと比較して音声認識処理を行い、音声認識結果を出力する。これにより、１回の音声認識動作が終了する。 In step S <b> 25, the coincidence detection unit 43 uses the feature pattern generated from at least a part of the audio signal input to the signal processing unit 41, among the set of acoustic models selected by the acoustic model selection unit 42. A speech recognition process is performed in comparison with an acoustic model corresponding to at least a part of each option data read in step S21, and a speech recognition result is output. Thereby, one speech recognition operation is completed.

本発明の第２の実施形態によれば、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる種類の声質について収集して得られた複数組の音響モデルが用意され、複数組の音響モデルの内から、入力された音声信号に基づいて判定された声質の種類に対応する１組の音響モデルが選択される。これにより、話者の声質の種類に応じて設定された音響モデルを用いて音声認識処理が行われるので、音声認識処理における認識率を向上させることができる。 According to the second embodiment of the present invention, a plurality of sets of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in a predetermined language for a plurality of different types of voice qualities are prepared, One set of acoustic models corresponding to the type of voice quality determined based on the input voice signal is selected from the plurality of sets of acoustic models. Thereby, since the speech recognition process is performed using the acoustic model set according to the type of voice quality of the speaker, the recognition rate in the speech recognition process can be improved.

＜第３の実施形態に係る音声認識装置＞
図８は、本発明の第３の実施形態に係る音声認識装置を搭載した電子機器の構成の一部を示すブロック図である。第３の実施形態においては、入力される音声信号から生成された特徴パターンが複数組の音響モデルと比較され、最も高い認識確度が得られた音声認識結果が最終的な音声認識結果として出力される。 <Voice Recognition Device According to Third Embodiment>
FIG. 8 is a block diagram showing a part of the configuration of an electronic device equipped with a speech recognition apparatus according to the third embodiment of the present invention. In the third embodiment, feature patterns generated from input speech signals are compared with a plurality of sets of acoustic models, and a speech recognition result with the highest recognition accuracy is output as a final speech recognition result. The

そのために、図１に示す第１の実施形態における一致検出部４３が複数の部分（図８においては４つの部分４３ａ〜４３ｄを示す）を含み、認識確度判定部４７が追加される。一方、選択肢リスト格納部５２に格納されている各々の選択肢リストは、音響モデル特定情報を含んでいなくても良い。その他の点に関しては、第１の実施形態と同様である。 For this purpose, the coincidence detection unit 43 in the first embodiment shown in FIG. 1 includes a plurality of parts (in FIG. 8, four parts 43a to 43d are shown), and a recognition accuracy judgment unit 47 is added. On the other hand, each option list stored in the option list storage unit 52 may not include acoustic model specifying information. The other points are the same as in the first embodiment.

図８に示す音響モデル格納部５１は、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる条件の下で収集して得られた複数組の音響モデルを格納している。 The acoustic model storage unit 51 illustrated in FIG. 8 stores a plurality of sets of acoustic models obtained by collecting frequency component distribution states of a plurality of phonemes used in a predetermined language under a plurality of different conditions. .

例えば、音響モデル格納部５１は、第１の実施形態と同様に、所定の言語において用いられる不特定の種類の用語に含まれている複数の音素の周波数成分の分布状態を収集して得られた１組の音響モデルと、所定の言語において用いられる特定の種類の用語に含まれている複数の音素の周波数成分の分布状態を収集して得られた少なくとも１組の音響モデルとを格納しても良い。その場合には、用語の種類に応じて設定された音響モデルを用いて音声認識処理を行うことができる。 For example, as in the first embodiment, the acoustic model storage unit 51 is obtained by collecting distribution states of frequency components of a plurality of phonemes included in an unspecified type term used in a predetermined language. A set of acoustic models and at least one set of acoustic models obtained by collecting the distribution states of frequency components of a plurality of phonemes included in a specific type of term used in a predetermined language. May be. In that case, speech recognition processing can be performed using an acoustic model set according to the type of term.

あるいは、音響モデル格納部５１は、第２の実施形態と同様に、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる種類の声質について収集して得られた複数組の音響モデルを格納しても良い。その場合には、声質の種類に応じて設定された音響モデルを用いて音声認識処理を行うことができる。 Alternatively, as in the second embodiment, the acoustic model storage unit 51 collects a plurality of sets of frequency components obtained by collecting frequency component distribution states of a plurality of phonemes used in a predetermined language for a plurality of different types of voice qualities. An acoustic model may be stored. In that case, the speech recognition process can be performed using an acoustic model set according to the type of voice quality.

図８においては、音響モデル格納部５１が、子供用の１組の音響モデル０と、女性用の１組の音響モデル１と、男性用の１組の音響モデル２と、老人用の１組の音響モデル３とを格納している。ここで、「女性」とは、ある範囲の年齢の女性のことであり、「男性」とは、ある範囲の年齢の男性のことである。 In FIG. 8, the acoustic model storage unit 51 includes one set of acoustic models 0 for children, one set of acoustic models 1 for women, one set of acoustic models 2 for men, and one set for elderly people. The acoustic model 3 is stored. Here, “female” means a woman in a certain range of ages, and “male” means a man in a certain range of ages.

一致検出部４３ａ〜４３ｄは、コマンド解析部１０によって指定された選択肢リストに含まれている複数の選択肢データを選択肢リスト格納部５２から順次読み出すと共に、音響モデル格納部５１に格納されている複数組の音響モデルの内で、選択肢データの少なくとも一部に対応する音響モデル（例えば、１音節分の音響モデル）をそれぞれ読み出す。 The coincidence detection units 43 a to 43 d sequentially read out a plurality of option data included in the option list specified by the command analysis unit 10 from the option list storage unit 52 and also store a plurality of sets stored in the acoustic model storage unit 51. Among the acoustic models, an acoustic model (for example, an acoustic model for one syllable) corresponding to at least a part of the option data is read out.

これにより、一致検出部４３ａ〜４３ｄは、音声検出信号が活性化されているときに、信号処理部４１に入力された音声信号の少なくとも一部から生成された特徴パターンを、音響モデル格納部５１に格納されている複数組の音響モデルの内で、コマンド解析部１０によって指定された選択肢リストに含まれている各々の選択肢データの少なくとも一部に対応する音響モデルと比較して音声認識処理（パターンマッチング）を行い、複数組の音響モデルに対応する複数の音声認識結果及び複数の認識確度をそれぞれ出力する。 As a result, the coincidence detection units 43a to 43d use the acoustic model storage unit 51 to generate a feature pattern generated from at least a part of the audio signal input to the signal processing unit 41 when the audio detection signal is activated. Among a plurality of sets of acoustic models stored in the voice recognition processing (comparison with an acoustic model corresponding to at least a part of each option data included in the option list specified by the command analysis unit 10 ( Pattern matching), and outputs a plurality of speech recognition results and a plurality of recognition accuracy corresponding to a plurality of sets of acoustic models.

例えば、一致検出部４３ａ〜４３ｄは、入力された音声信号の先頭の音節から生成された特徴パターンを、選択肢リストに含まれている各々の選択肢データによって表される選択肢の先頭の音節に対応する音響モデルと比較する。選択肢リストにおいて、一致が検出された音節を先頭に有する選択肢が１つだけ存在する場合には、一致検出部４３ａ〜４３ｄは、その選択肢が、入力された音声信号に一致すると判定しても良い。一方、選択肢リストにおいて、一致が検出された音節を先頭に有する複数の選択肢が存在する場合には、一致検出部４３ａ〜４３ｄは、選択肢が１つに絞られるまで、一致を検出すべき音節の範囲を拡大しても良い。 For example, the coincidence detection units 43a to 43d correspond to the feature pattern generated from the head syllable of the input audio signal to the head syllable of the option represented by each option data included in the option list. Compare with acoustic model. In the option list, when there is only one option having a syllable whose head is detected as a match, the match detection units 43a to 43d may determine that the option matches the input audio signal. . On the other hand, in the option list, when there are a plurality of options having a syllable whose match is detected at the head, the match detection units 43a to 43d select a syllable whose match should be detected until the options are narrowed down to one. The range may be expanded.

ここで、一致検出部４３ａ〜４３ｄは、ＭＦＣＣを表す多次元空間における特徴パターンの位置と音響モデルの広がりの中心との間の距離Ｄを求め、所定の値Ｅと距離Ｄとの差（Ｅ−Ｄ）を認識確度としても良い。ここで、所定の値Ｅは、距離Ｄがゼロであるときの認識確度の最大値を表している。また、音声認識結果が得られるまでに複数（Ｎ個）の音素について特徴パターンと音響モデルとの比較が行われた場合には、一致検出部４３ａ〜４３ｄは、それぞれの音素について求められた距離Ｄ（ｉ）の平均値ΣＤ（ｉ）／Ｎを求め、所定の値Ｅと平均値ΣＤ（ｉ）／Ｎとの差（Ｅ−ΣＤ（ｉ）／Ｎ）を認識確度としても良い。 Here, the coincidence detection units 43a to 43d obtain a distance D between the position of the feature pattern in the multidimensional space representing the MFCC and the center of the spread of the acoustic model, and the difference between the predetermined value E and the distance D (E -D) may be used as the recognition accuracy. Here, the predetermined value E represents the maximum value of the recognition accuracy when the distance D is zero. When the feature pattern and the acoustic model are compared for a plurality (N) of phonemes before the speech recognition result is obtained, the coincidence detection units 43a to 43d determine the distance obtained for each phoneme. The average value ΣD (i) / N of D (i) may be obtained, and the difference (E−ΣD (i) / N) between the predetermined value E and the average value ΣD (i) / N may be used as the recognition accuracy.

図８に示す例において、一致検出部４３ａは、音声信号の少なくとも一部から生成された特徴パターンを、子供用の１組の音響モデル０に含まれている音響モデルと比較して音声認識処理を行い、第１の音声認識結果及び第１の認識確度を出力する。一致検出部４３ｂは、音声信号の少なくとも一部から生成された特徴パターンを、女性用の１組の音響モデル１に含まれている音響モデルと比較して音声認識処理を行い、第２の音声認識結果及び第２の認識確度を出力する。 In the example illustrated in FIG. 8, the coincidence detection unit 43 a compares the feature pattern generated from at least a part of the audio signal with the acoustic model included in the one acoustic model 0 for the child, and performs speech recognition processing. And the first speech recognition result and the first recognition accuracy are output. The coincidence detection unit 43b performs a speech recognition process by comparing the feature pattern generated from at least a part of the speech signal with the acoustic model included in the pair of acoustic models 1 for women, and performs the second speech The recognition result and the second recognition accuracy are output.

また、一致検出部４３ｃは、音声信号の少なくとも一部から生成された特徴パターンを、男性用の１組の音響モデル２に含まれている音響モデルと比較して音声認識処理を行い、第３の音声認識結果及び第３の認識確度を出力する。一致検出部４３ｄは、音声信号の少なくとも一部から生成された特徴パターンを、老人用の１組の音響モデル３に含まれている音響モデルと比較して音声認識処理を行い、第４の音声認識結果及び第４の認識確度を出力する。 Further, the coincidence detection unit 43c performs a speech recognition process by comparing the feature pattern generated from at least a part of the speech signal with the acoustic model included in the set of acoustic models 2 for men. The voice recognition result and the third recognition accuracy are output. The coincidence detection unit 43d performs a speech recognition process by comparing the feature pattern generated from at least part of the speech signal with the acoustic model included in the set of acoustic models 3 for the elderly, and performs the fourth speech The recognition result and the fourth recognition accuracy are output.

認識確度判定部４７は、一致検出部４３ａ〜４３ｄから出力される複数の認識確度の内で最も高い認識確度が得られた音声認識結果を、最終的な音声認識結果として出力する。例えば、話者が子供である場合には、第１の認識確度が第２〜４の認識確度よりも高くなる場合が一般的であり、その場合には、認識確度判定部４７が、第１の音声認識結果を最終的な音声認識結果として出力する。 The recognition accuracy determination unit 47 outputs a speech recognition result having the highest recognition accuracy among the plurality of recognition accuracy output from the coincidence detection units 43a to 43d as a final speech recognition result. For example, when the speaker is a child, the first recognition accuracy is generally higher than the second to fourth recognition accuracy. In this case, the recognition accuracy determination unit 47 performs the first recognition accuracy determination. Is output as the final speech recognition result.

また、第３の実施形態においても、第１の実施形態の変形例と同様に、制御部２００の格納部９２が複数の選択肢リストＡ、Ｂ、Ｃ、・・・を格納し、音声認識装置１００において、図８に示す音響モデル格納部５１及び選択肢リスト格納部５２を記憶するために、図２に示すメモリー５０ａ及び５０ｂを用いても良い。 Also in the third embodiment, as in the modification of the first embodiment, the storage unit 92 of the control unit 200 stores a plurality of option lists A, B, C,. 100, the memory 50a and 50b shown in FIG. 2 may be used to store the acoustic model storage 51 and option list storage 52 shown in FIG.

＜第３の実施形態に係る音声認識方法＞
次に、本発明の第３の実施形態に係る音声認識方法について、図８及び図９を参照しながら説明する。図９は、図８に示す音声認識装置によって実施される音声認識方法を示すフローチャートである。 <Voice Recognition Method According to Third Embodiment>
Next, a speech recognition method according to the third embodiment of the present invention will be described with reference to FIGS. FIG. 9 is a flowchart showing a speech recognition method performed by the speech recognition apparatus shown in FIG.

図９のステップＳ３１において、一致検出部４３が、複数の選択肢をそれぞれ表す複数の選択肢データを含む選択肢リストを格納する選択肢リスト格納部５２から、複数の選択肢データを読み出す。ステップＳ３２において、信号処理部４１が、入力された音声信号の周波数成分を抽出し、その周波数成分の分布状態を表す特徴パターンを生成する。 In step S31 of FIG. 9, the match detection unit 43 reads a plurality of option data from the option list storage unit 52 that stores an option list including a plurality of option data each representing a plurality of options. In step S32, the signal processing unit 41 extracts the frequency component of the input audio signal, and generates a feature pattern representing the distribution state of the frequency component.

ステップＳ３３において、一致検出部４３ａ〜４３ｄが、信号処理部４１に入力された音声信号の少なくとも一部から生成された特徴パターンを、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる条件の下で収集して得られた複数組の音響モデルの内で、ステップＳ３１において読み出された各々の選択肢データの少なくとも一部に対応する音響モデルと比較して音声認識処理を行い、複数組の音響モデルに対応する複数の音声認識結果及び複数の認識確度を出力する。 In step S <b> 33, the coincidence detection units 43 a to 43 d convert the feature pattern generated from at least a part of the audio signal input to the signal processing unit 41 to the distribution state of the frequency components of a plurality of phonemes used in a predetermined language. Among a plurality of sets of acoustic models obtained by collecting under a plurality of different conditions, speech recognition processing is performed in comparison with an acoustic model corresponding to at least a part of each option data read in step S31. And output a plurality of speech recognition results and a plurality of recognition accuracy corresponding to a plurality of sets of acoustic models.

ステップＳ３４において、認識確度判定部４７が、一致検出部４３ａ〜４３ｄから出力された複数の認識確度の内で最も高い認識確度が得られた音声認識結果を、最終的な音声認識結果として出力する。これにより、１回の音声認識動作が終了する。 In step S <b> 34, the recognition accuracy determination unit 47 outputs, as a final speech recognition result, a speech recognition result with the highest recognition accuracy among the plurality of recognition accuracy output from the match detection units 43 a to 43 d. . Thereby, one speech recognition operation is completed.

本発明の第３の実施形態によれば、所定の言語において用いられる複数の音素の周波数成分の分布状態を複数の異なる条件の下で収集して得られた複数組の音響モデルが用意され、複数組の音響モデルを用いて同時に音声認識処理を行うことにより、複数組の音響モデルに対応する複数の音声認識結果及び複数の認識確度が得られる。さらに、複数の認識確度の内で最も高い認識確度が得られた音声認識結果が、最終的な音声認識結果として出力される。これにより、複数の異なる条件に応じて設定された複数組の音響モデルを用いて同時に行われた音声認識処理において最も高い認識確度が得られた音声認識結果が得られるので、音声認識処理における認識率を向上させることができる。 According to the third embodiment of the present invention, a plurality of sets of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in a predetermined language under a plurality of different conditions are prepared, By simultaneously performing speech recognition processing using a plurality of sets of acoustic models, a plurality of speech recognition results and a plurality of recognition accuracy corresponding to the plurality of sets of acoustic models can be obtained. Furthermore, the speech recognition result that provides the highest recognition accuracy among the plurality of recognition accuracy is output as the final speech recognition result. As a result, a speech recognition result with the highest recognition accuracy can be obtained in speech recognition processing performed simultaneously using a plurality of sets of acoustic models set according to a plurality of different conditions. The rate can be improved.

なお、音声認識処理を繰り返す中で所定の組の音響モデルにおける認識確度が他の組の音響モデルにおける認識確度よりも高いことが多い場合には、その所定の組の音響モデルのみを用いて音声認識処理を行い、その所定の組の音響モデルにおける認識確度が予め設定した閾値以下となった場合には、再び複数組の音響モデルを用いる音声認識処理に変えるようにしても良い。 If the recognition accuracy of a predetermined set of acoustic models is often higher than the recognition accuracy of another set of acoustic models during repeated speech recognition processing, the speech is obtained using only the predetermined set of acoustic models. When the recognition processing is performed and the recognition accuracy in the predetermined set of acoustic models is equal to or lower than a preset threshold value, the recognition processing may be changed to the voice recognition processing using a plurality of sets of acoustic models again.

以上の実施形態においては、本発明をエアコンに適用した具体例について説明したが、本発明は、この実施形態に限定されるものではなく、一般的な電子機器に適用可能であると共に、当該技術分野において通常の知識を有する者によって、本発明の技術的思想内で多くの変形が可能である。 In the above embodiment, a specific example in which the present invention is applied to an air conditioner has been described. However, the present invention is not limited to this embodiment, and can be applied to general electronic devices. Many modifications are possible within the technical idea of the present invention by those having ordinary knowledge in the field.

１００…音声認識装置、１０…コマンド解析部、２０…音声入力部、３０…Ａ／Ｄ変換器、４０…音声認識処理部、４１…信号処理部、４２…音響モデル選択部、４３、４３ａ〜４３ｄ…一致検出部、４４…声質判定部、４５…音響モデルトレーニング部、４６…声質判定情報抽出部、４７…認識確度判定部、５０、５０ａ、５０ｂ…メモリー、５１…音響モデル格納部、５２…選択肢リスト格納部、５３…声質判定情報格納部、６０…音声再生部、７０…Ｄ／Ａ変換器、８０…音声出力部、２００…制御部、９１…ホストＣＰＵ、９２…格納部 DESCRIPTION OF SYMBOLS 100 ... Voice recognition apparatus, 10 ... Command analysis part, 20 ... Voice input part, 30 ... A / D converter, 40 ... Voice recognition process part, 41 ... Signal processing part, 42 ... Acoustic model selection part, 43, 43a- 43d ... coincidence detection unit, 44 ... voice quality determination unit, 45 ... acoustic model training unit, 46 ... voice quality determination information extraction unit, 47 ... recognition accuracy determination unit, 50, 50a, 50b ... memory, 51 ... acoustic model storage unit, 52 ... Option list storage unit, 53 ... Voice quality determination information storage unit, 60 ... Audio reproduction unit, 70 ... D / A converter, 80 ... Audio output unit, 200 ... Control unit, 91 ... Host CPU, 92 ... Storage unit

Claims

An acoustic model storage unit that stores a plurality of sets of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in a predetermined language under a plurality of different conditions;
An option list storage unit for storing an option list including option data representing options and acoustic model specifying information for specifying one set of acoustic models from the plurality of sets of acoustic models;
An acoustic model selection unit that selects the set of acoustic models identified by the acoustic model identification information;
A signal processing unit that extracts a frequency component of the input audio signal and generates a feature pattern representing a distribution state of the frequency component of the audio signal;
The feature pattern generated from at least a part of the speech signal is compared with an acoustic model corresponding to at least a part of the option data in the selected set of acoustic models, and speech recognition processing is performed. A coincidence detection unit that outputs a speech recognition result;
A speech recognition apparatus comprising:

The plurality of sets of acoustic models are used in the predetermined language and a set of acoustic models obtained by collecting the distribution states of the frequency components of phonemes included in unspecified types of terms used in the predetermined language. The speech recognition apparatus according to claim 1, further comprising: a set of acoustic models obtained by collecting distribution states of frequency components of phonemes included in a specific type of term.

An acoustic model storage unit for storing a plurality of sets of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in a predetermined language for a plurality of different types of voice qualities;
An option list storage unit that stores an option list including option data representing options;
A signal processing unit that extracts a frequency component of the input audio signal and generates a feature pattern representing a distribution state of the frequency component of the audio signal;
A voice quality determination unit that determines the type of voice quality in the audio signal;
An acoustic model selection unit that selects one set of acoustic models corresponding to the type of voice quality determined by the voice quality determination unit from the plurality of sets of acoustic models;
The feature pattern generated from at least a part of the speech signal is compared with an acoustic model corresponding to at least a part of the option data in the selected set of acoustic models, and speech recognition processing is performed. A coincidence detection unit that outputs a speech recognition result;
A speech recognition apparatus comprising:

The option list storage unit stores an option list further including control information for permitting or prohibiting voice recognition processing for the specific voice quality type,
The coincidence detection unit determines whether to start voice recognition processing for the option list based on the voice quality type determined by the voice quality determination unit and the control information.
The speech recognition apparatus according to claim 3.

An acoustic model storage unit that stores a plurality of sets of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in a predetermined language under a plurality of different conditions;
An option list storage unit that stores an option list including option data representing options;
A signal processing unit that extracts a frequency component of the input audio signal and generates a feature pattern representing a distribution state of the frequency component of the audio signal;
The feature pattern generated from at least a part of the speech signal is compared with an acoustic model corresponding to at least a part of the option data in the plurality of sets of acoustic models, and a speech recognition process is performed. A coincidence detector that outputs a plurality of speech recognition results and a plurality of recognition accuracy corresponding to a set of acoustic models;
A recognition accuracy determination unit that outputs a speech recognition result having the highest recognition accuracy among the plurality of recognition accuracy as a final speech recognition result;
A speech recognition apparatus comprising:

The plurality of sets of acoustic models are used in the predetermined language and a set of acoustic models obtained by collecting the distribution states of the frequency components of phonemes included in unspecified types of terms used in the predetermined language. 6. A speech recognition apparatus according to claim 5, further comprising: a set of acoustic models obtained by collecting distribution states of frequency components of phonemes included in a specific type of term.

The plurality of sets of acoustic models are composed of a plurality of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in the predetermined language for a plurality of different types of voice qualities. The speech recognition apparatus according to claim 5.

The electronic device which comprises the speech recognition apparatus of any one of Claims 1-7.

A step (a) of reading out the option data and the acoustic model specifying information from an option list storage unit that stores an option list including option data representing options and acoustic model specifying information for specifying a set of acoustic models;
A set of acoustic models specified by the acoustic model identification information from a plurality of acoustic models obtained by collecting distribution states of frequency components of a plurality of phonemes used in a predetermined language under a plurality of different conditions Selecting an acoustic model (b);
Extracting a frequency component of the input voice signal and generating a feature pattern representing a distribution state of the frequency component of the voice signal;
An acoustic model corresponding to at least a part of the option data read out in step (a), among the selected set of acoustic models, the feature pattern generated from at least a part of the audio signal. A step (d) of performing speech recognition processing and outputting a speech recognition result in comparison with
A speech recognition method comprising:

A step (a) of reading out the option data from an option list storage unit that stores an option list including option data representing options;
(B) extracting a frequency component of the input audio signal and generating a feature pattern representing a distribution state of the frequency component of the audio signal;
Determining the type of voice quality in the audio signal (c);
From the plurality of sets of acoustic models obtained by collecting the distribution states of frequency components of a plurality of phonemes used in a predetermined language for a plurality of different types of voice qualities, the types of voice qualities determined in step (c) Selecting a corresponding set of acoustic models (d);
An acoustic model corresponding to at least a part of the option data read out in step (a), among the selected set of acoustic models, the feature pattern generated from at least a part of the audio signal. A step (e) of performing speech recognition processing and outputting a speech recognition result in comparison with
A speech recognition method comprising:

A step (a) of reading out the option data from an option list storage unit that stores an option list including option data representing options;
(B) extracting a frequency component of the input audio signal and generating a feature pattern representing a distribution state of the frequency component of the audio signal;
A plurality of sets of acoustic models obtained by collecting the feature patterns generated from at least a part of the speech signal under a plurality of different conditions of frequency component distribution states of a plurality of phonemes used in a predetermined language And performing speech recognition processing in comparison with an acoustic model corresponding to at least a part of the option data read in step (a), and a plurality of speech recognition results corresponding to the plurality of sets of acoustic models; Outputting a plurality of recognition accuracy (c);
A step (d) of outputting a speech recognition result having the highest recognition accuracy among the plurality of recognition accuracy as a final speech recognition result;
A speech recognition method comprising: