JP4919282B2

JP4919282B2 - Unclear voice command recognition device and unclear voice command recognition processing method

Info

Publication number: JP4919282B2
Application number: JP2007069773A
Authority: JP
Inventors: 健佐土原
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2007-03-19
Filing date: 2007-03-19
Publication date: 2012-04-18
Anticipated expiration: 2027-03-19
Also published as: JP2008233282A

Description

本発明は、例えば、高齢者や障害者の音声、あるいはノイズの多い環境における音声など、不明瞭に発声された音声または発声ごとの変動の大きな音声であっても、音声コマンドとして用いて機器を操作することができる不明瞭音声コマンド認識装置および不明瞭音声コマンド認識処理方法に関するものである。 The present invention can be used as a voice command even for voices that are unclearly spoken, such as voices of elderly people and persons with disabilities, or voices in a noisy environment, or voices that vary greatly from voice to voice. The present invention relates to an indistinct voice command recognition apparatus and an indistinct voice command recognition processing method that can be operated.

従来より、音声を用いて機器を操作するために、音声コマンドを認識するための技術が研究開発されてきた。例えば、特許文献１では、ワードスポッティング音声認識技術を用いて家電のビデオ録画装置を操作するための音声認識処理技術が記載されており、特許文献２では、カーナビゲーションシステム等の車載情報機器を操作するための音声認識処理技術が記載されており、特許文献３では、電動車椅子をジョイスティックではなく音声で操作するための音声認識処理技術が記載されている。これらの音声認識処理技術では、音声信号から音声認識する認識エンジンは、既存の音声認識処理を利用する。 Conventionally, techniques for recognizing voice commands have been researched and developed in order to operate devices using voice. For example, Patent Document 1 describes a voice recognition processing technique for operating a video recording device of home appliances using a word spotting voice recognition technique, and Patent Document 2 operates an in-vehicle information device such as a car navigation system. For example, Patent Document 3 describes a voice recognition processing technique for operating an electric wheelchair by voice instead of a joystick. In these speech recognition processing techniques, a recognition engine that recognizes speech from speech signals uses existing speech recognition processing.

現在、音声コマンドを認識するために最も良く用いられている技術は、隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ：ＨＭＭ）で記述された音響モデルと記述文法を用いて、音声信号から音声認識するものであり、辞書に登録されたコマンドの中で音響モデルと記述文法に照らして最も尤度の高いコマンドを認識結果として出力する。 Currently, the most commonly used technique for recognizing a voice command is to recognize a voice from a voice signal by using an acoustic model and a description grammar described by a Hidden Markov Model (HMM). Among the commands registered in the dictionary, the command with the highest likelihood is output as the recognition result in light of the acoustic model and the description grammar.

通常、音声を用いて機器を操作するための音声コマンドの辞書には、各音声コマンドの標準的な発声を表す音素列が記述されるが、標準的な発声からの逸脱が大きい話者の場合には、音声信号から音声認識する認識エンジンのＨＭＭにおいて、その逸脱を吸収しきれずに、標準的な音素列とはかなり異なる音素列として認識される場合がある。 Usually, a dictionary of voice commands for operating a device using voice describes a phoneme string representing the standard utterance of each voice command, but for a speaker with a large deviation from the standard utterance In some cases, the HMM of a recognition engine that recognizes speech from a speech signal cannot recognize the deviation and recognizes it as a phoneme sequence significantly different from a standard phoneme sequence.

そのような標準的な発声からの逸脱が大きい話者に適応するために、その話者独自の音素列を複数登録することにより認識の高精度化を図る技術も知られている。 In order to adapt to a speaker whose deviation from the standard utterance is large, there is also known a technique for improving the recognition accuracy by registering a plurality of phoneme sequences unique to the speaker.

また、何らかの理由で安定した発声が困難な話者の場合、発声毎の変動が大きくなり、コマンド毎に一つの音素列を辞書登録するだけでは不十分な場合もある。 In addition, in the case of a speaker for which stable utterance is difficult for some reason, fluctuation for each utterance becomes large, and it may not be sufficient to register one phoneme string for each command.

例えば、脳性マヒなどの障害により、安定した発声が困難な話者に対して、複数の音声コマンドのサンプルから抽出した音素列を辞書に登録することにより、認識精度を向上させることができる。 For example, recognition accuracy can be improved by registering phoneme strings extracted from a plurality of voice command samples in a dictionary for a speaker who is difficult to speak stably due to a disorder such as cerebral palsy.

登録する音素列は、人手による分析により得ることも可能であるが、音素タイプライタのような連続音素認識エンジンを用いて、自動的に得られた音素列を複数登録することでも認識精度を向上させることができる。連続音素認識エンジンは、単語を音素に置き換えた通常の連続単語認識エンジンと同様の技術で実現される。 The phoneme strings to be registered can be obtained by manual analysis, but the recognition accuracy can also be improved by registering multiple phoneme strings automatically obtained using a continuous phoneme recognition engine such as a phoneme typewriter. Can be made. The continuous phoneme recognition engine is realized by the same technology as a normal continuous word recognition engine in which words are replaced with phonemes.

この種の音声コマンド認識に関係する従来の文献としては、次のような各文献が参照できる。
特開２００２−２６９１４６号公報特表２００３−２０２８９７号公報特開２０００−８４００４号公報特開２００２−２２１９８４号公報鹿野他：音声認識システム、オーム社、２００１．Ｎ．ＣｒｉｓｔｉａｎｉｎｉａｎｄＪ．Ｓｈａｗｅ−Ｔａｙｌｏｒ：ＡｎＩｎｔｒｏｄｕｃｔｉｏｎｔｏＳｕｐｐｏｒｔＶｅｃｔｏｒＭａｃｈｉｎｅｓ，ＣａｍｂｒｉｄｇｅＵｎｉｖｅｒｓｉｔｙＰｒｅｓｓ，２００２．Ｈ．Ｌｏｄｈｉｅｔａｌ．：Ｔｅｘｔｃｌａｓｓｉｆｉｃａｔｉｏｎｕｓｉｎｇｓｔｒｉｎｇｋｅｒｎｅｌｓ，ＡｄｖａｎｃｅｓｉｎＮｅｕｒａｌＩｎｆｏｒｍａｔｉｏｎＰｒｏｃｅｓｓｉｎｇＳｙｓｔｅｍｓ１３，２００１．Ｃ．Ｃｏｒｔｅｓｅｔａｌ．：Ｒａｔｉｏｎａｌｋｅｒｎｅｌｓ：ｔｈｅｏｒｙａｎｄａｌｇｏｒｉｔｈｍｓ，ＪｏｕｒｎａｌｏｆＭａｃｈｉｎｅＬｅａｒｎｉｎｇＲｅｓｅａｒｃｈ，５，２００４．Ｃ．Ｓａｕｎｄｅｒｓｅｔａｌ．：Ｓｙｌｌａｂｌｅｓａｎｄｏｔｈｅｒｓｔｒｉｎｇｋｅｒｎｅｌｅｘｔｅｎｓｉｏｎｓ，ＰｒｏｃｅｅｄｉｎｇｓｏｆＩｎｔｅｒｎａｔｉｏｎａｌＣｏｎｆｅｒｅｎｃｅｏｎＭａｃｈｉｎｅＬｅａｒｎｉｎｇ，２００２．Ｔ．Ｊｅｂａｒａｅｔａｌ．：Ｂｈａｔｔａｃｈａｒｙｙａａｎｄｅｘｐｅｃｔｅｄｌｉｋｅｌｉｈｏｏｄｋｅｒｎｅｌｓ，ＰｒｏｃｅｅｄｉｎｇｓｏｆＣｏｍｐｕｔａｔｉｏｎａｌＬｅａｒｎｉｎｇＴｈｅｏｒｙ，２００３．Ｋ．Ｆｕｋｕｎａｇａ：Ｓｔａｔｉｓｔｉｃａｌｐａｔｔｅｒｎｒｅｃｏｇｎｉｔｉｏｎ（２ｎｄｅｄ．），ＡｃａｄｅｍｉｃＰｒｅｓｓ，１９９０． The following documents can be referred to as conventional documents related to this type of voice command recognition.
JP 2002-269146 A Japanese translation of PCT publication No. 2003-202897 JP 2000-84004 A Japanese Patent Laid-Open No. 2002-221984 Shikano et al .: Speech recognition system, Ohmsha, 2001. N. Christianiani and J.M. Shawe-Taylor: An Induction to Support Vector Machines, Cambridge University Press, 2002. Shaw-Taylor: An Introduction to Support Vector Machines, Cambridge University Press, 2002. H. Lodhi et al. : Text classification using string kernels, Advances in Neural Information Processing Systems 13, 2001. C. Cortes et al. : Rational kernels: theory and algorithms, Journal of Machine Learning Research, 5, 2004. C. Saunders et al. : Syllables and other string kernel extensions, Proceedings of International Conference on Machine Learning, 2002. T.A. Jebara et al. Batterchalya and Expected Likelihood Kernels, Proceedings of Computational Learning Theory, 2003. K. Fukunaga: Statistical pattern recognition (2nd ed.), Academic Press, 1990.

ところで、先に例に挙げた脳性マヒの患者の場合、安定した発声が困難であるため、認識するコマンドの数が増えたり、あるいは、音声コマンドの使用中にコマンド以外の音声を発声することを許容し、音声コマンドとコマンド以外の音声とを区別する必要がある場合には、認識精度が低下する。 By the way, in the case of the cerebral palsy patient mentioned above, stable voicing is difficult, so the number of commands to be recognized increases, or voices other than commands are uttered while using voice commands. If it is allowed and it is necessary to distinguish the voice command from the voice other than the command, the recognition accuracy is lowered.

このように、障害者が安定した発声をすることが難しく、発声された音声は、不要音の挿入が頻繁に起こったり、発声する度に部分的に異なる発声となってしまい、異なる音声同士が偶然に、部分的に類似した音素列と認識されてしまうことで、音声コマンドの識別が難しくなるためである。 In this way, it is difficult for disabled people to make stable utterances, and the uttered voices are frequently inserted with unnecessary sounds or become partially different every time they utter, and different voices are This is because it is difficult to identify a voice command by accidentally being recognized as a partially similar phoneme string.

同様の問題は、障害者の音声に限らず、高齢者や、あるいは突発性の雑音の多い環境における音声を認識する場合にも起こり得るという問題がある。 A similar problem is not limited to the voice of a disabled person, but may also occur when recognizing voice in an elderly person or an environment with sudden noise.

本発明は、このような問題を克服するためになされたものであって、本発明の目的は、高齢者や障害者の音声、あるいはノイズの多い環境における音声など、不明瞭に発声された音声または発声ごとの変動の大きな音声であっても、音声コマンドとして用いて機器を操作することができる不明瞭音声コマンド認識装置および不明瞭音声コマンド認識処理方法を提供することにある。 The present invention has been made in order to overcome such problems, and the object of the present invention is to provide unclear voices such as voices of elderly people and persons with disabilities, or voices in noisy environments. Alternatively, an object of the present invention is to provide an unclear voice command recognition device and an unclear voice command recognition processing method that can operate a device using a voice command even if the voice has a large fluctuation for each utterance.

上記の目的を達成するため、本発明は第１の態様として、本発明による不明瞭音声コマンド認識装置が、不明瞭に発声された音声を音声コマンドとして用いて機器を操作するための不明瞭音声コマンド認識装置であって、音声コマンドの音声信号を電気信号に変換して入力する音声入力手段（１００）と、音声入力手段により入力された音声信号の電気信号をディジタルデータとするアナログディジタル変換手段（１０１）と、音声信号のディジタルデータからケプストラム分析により特徴ベクトルを抽出する特徴抽出手段（１０３）と、特徴ベクトルの時系列から音声のサブワード単位を認識してサブワード単位の列またはグラフを出力するサブワード認識手段（１０６）と、サブワード単位の列またはグラフによりコマンド識別のための訓練データを生成しデータベースに登録するデータベース更新手段（１１０）と、前記訓練データに基づいてサポートベクトルマシンによるコマンド識別器を構成するサポートベクトルマシン学習手段（１１１）と、前記サポートベクトルマシン学習手段により構成されたコマンド識別器によるデータ処理により音声コマンドを識別する音声識別手段（１０８）と、前記音声識別手段により識別された音声コマンドにより制御信号を生成し、制御対象機器に送出する制御信号出力手段（１０９）と、を備えることを特徴とする。 In order to achieve the above object, the present invention provides, as a first aspect, an unclear voice for causing an unclear voice command recognition apparatus according to the present invention to operate a device using an unclearly spoken voice as a voice command. A voice recognition means (100) for converting a voice signal of a voice command into an electrical signal and inputting it, and an analog / digital conversion means for using the electrical signal of the voice signal inputted by the voice input means as digital data. (101), feature extraction means (103) for extracting feature vectors from digital data of speech signals by cepstrum analysis, and recognizing speech subword units from the time series of feature vectors and outputting a subword sequence or graph Subword recognition means (106) and a subword unit column or graph for command identification Database update means (110) for generating training data and registering it in the database, support vector machine learning means (111) for configuring a command classifier by a support vector machine based on the training data, and the support vector machine learning means Voice identification means (108) for identifying a voice command by data processing by a configured command discriminator, and control signal output means for generating a control signal by the voice command identified by the voice identification means and sending it to a control target device (109).

この場合に、不明瞭音声コマンド認識装置において、前記サブワード認識手段は、音声信号の音素、音節、または音素片のいずれかの単位によりサブワード単位を認識することを特徴とする。また、サポートベクトルマシン学習手段と音声識別手段は、サブワード単位の列に対しては文字列カーネルを、サブワード単位のグラフに対してはＲａｔｉｏｎａｌカーネルを用いることを特徴とする。さらに、音声入力手段は、複数のマイクが並べられたマイクアレイであり、前記マイクアレイから得られる複数の音声信号のディジタルデータから複数の音源を分離し指向性の雑音を除去する音源分離手段（１０２）と、音声信号から定常雑音を除去する特徴補正手段（１０４）と、基本周波数推定により音声と非音声とを区別する非音声識別手段（１０５）とを備えるように構成される。 In this case, in the ambiguous voice command recognition apparatus, the subword recognition means recognizes a subword unit based on any unit of a phoneme, a syllable, or a phoneme of a voice signal. The support vector machine learning unit and the speech identification unit use a character string kernel for a subword unit column and a relational kernel for a subword unit graph. Furthermore, the voice input means is a microphone array in which a plurality of microphones are arranged, and a sound source separation means for separating a plurality of sound sources from digital data of a plurality of sound signals obtained from the microphone array and removing directivity noise ( 102), feature correction means (104) for removing stationary noise from the speech signal, and non-speech discrimination means (105) for distinguishing speech from non-speech by fundamental frequency estimation.

また、本発明は第２の態様として、本発明による不明瞭音声コマンド認識処理方法が、不明瞭に発声された音声を音声コマンドとして用いて機器を操作するための不明瞭音声コマンドを認識処理する不明瞭音声コマンド認識処理方法であって、コンピュータの処理により、音声コマンドの音声信号を電気信号に変換して入力する音声入力ステップと、音声入力手段により入力された音声信号の電気信号をディジタルデータとするアナログディジタル変換ステップと、音声信号のディジタルデータからケプストラム分析により特徴ベクトルを抽出する特徴抽出ステップと、特徴ベクトルの時系列から音声のサブワード単位を認識してサブワード単位の列またはグラフを出力するサブワード認識ステップと、サブワード単位の列またはグラフによりコマンド識別のための訓練データを生成しデータベースに登録するデータベース更新ステップと、前記訓練データに基づいてサポートベクトルマシンによるコマンド識別器を構成するサポートベクトル学習ステップと、前記サポートベクトルマシン学習手段により構成されたコマンド識別器によるデータ処理により音声コマンドを識別する音声識別ステップと、前記音声識別手段により識別された音声コマンドにより制御信号を生成し、制御対象機器に送出する制御信号出力ステップとの処理を実行することを特徴とする。 In addition, as a second aspect of the present invention, the ambiguous voice command recognition processing method according to the present invention recognizes and processes an ambiguous voice command for operating a device by using an unclearly spoken voice as a voice command. An ambiguous voice command recognition processing method, in which a voice input step of converting a voice signal of a voice command into an electric signal by computer processing and inputting the electric signal, and an electric signal of the voice signal input by the voice input means are converted into digital data An analog-to-digital conversion step, a feature extraction step for extracting a feature vector from digital data of a speech signal by cepstrum analysis, and recognizing a speech subword unit from the time series of the feature vector and outputting a subword sequence or graph Subword recognition step and subword sequence or graph A database update step for generating training data for command identification and registering it in a database, a support vector learning step for configuring a command identifier by a support vector machine based on the training data, and the support vector machine learning means A voice identification step for identifying a voice command by data processing by the command classifier and a control signal output step for generating a control signal by the voice command identified by the voice identification means and sending it to the control target device are executed. It is characterized by doing.

この場合に、不明瞭音声コマンド認識処理方法において、前記サブワード認識ステップは、音声信号の音素、音節、または音素片のいずれかの単位によりサブワード単位を認識することを特徴とする。また、サポートベクトルマシン学習ステップと音声識別ステップは、サブワード単位に対しては文字列カーネルを、サブワード単位のグラフに対してはＲａｔｉｏｎａｌカーネルを用いることを特徴とする。さらに、音声入力ステップは、複数のマイクが並べられたマイクアレイからの複数の音声信号を入力し、前記マイクアレイから得られる複数の音声信号のディジタルデータからから複数の音源を分離し指向性の雑音を除去する音源分離ステップと、音声信号から定常雑音を除去する特徴補正ステップと、基本周波数の推定により音声と非音声とを区別する非音声識別ステップとの処理を実行するように構成される。 In this case, in the ambiguous voice command recognition processing method, the subword recognition step recognizes a subword unit based on any one of a phoneme, a syllable, or a phoneme of a voice signal. In the support vector machine learning step and the speech identification step, a character string kernel is used for a subword unit, and a relational kernel is used for a subword unit graph. Further, the voice input step inputs a plurality of voice signals from a microphone array in which a plurality of microphones are arranged, separates a plurality of sound sources from digital data of the plurality of voice signals obtained from the microphone array, and has directivity. A sound source separation step for removing noise, a feature correction step for removing stationary noise from the speech signal, and a non-speech identification step for distinguishing speech from non-speech by estimating the fundamental frequency are configured to be executed. .

また、本発明は第３の態様として、本発明による不明瞭音声コマンド認識処理プログラムが、不明瞭に発声された音声を音声コマンドとして用いて機器を操作するための不明瞭音声コマンド認識処理を実行するプログラムであって、コンピュータを音声コマンドの音声信号を電気信号に変換して入力する音声入力手段と、音声入力手段により入力された音声信号の電気信号をディジタルデータとするアナログディジタル変換手段と、音声信号のディジタルデータからケプストラム分析により特徴ベクトルを抽出する特徴抽出手段と、特徴ベクトルの時系列から音声のサブワード単位を認識してサブワード単位の列またはグラフを出力するサブワード認識手段と、サブワード単位の列またはグラフによりコマンド識別のための訓練データを生成しデータベースに登録するデータベース更新手段と、前記訓練データに基づいてサポートベクトルマシンによるコマンド識別器を構成するサポートベクトルマシン学習手段と、前記サポートベクトルマシン学習手段により構成されたコマンド識別器によるデータ処理により音声コマンドを識別する音声識別手段と、前記音声識別手段により識別された音声コマンドにより制御信号を生成し、制御対象機器に送出する制御信号出力手段として機能させることを特徴とする。 Further, as a third aspect of the present invention, an unclear voice command recognition processing program according to the present invention executes an unclear voice command recognition process for operating a device using an unclearly spoken voice as a voice command. A voice input means for converting a voice signal of a voice command into an electric signal and inputting the signal to the computer, and an analog-digital conversion means for using the electric signal of the voice signal input by the voice input means as digital data, Feature extraction means for extracting feature vectors from digital data of speech signals by cepstrum analysis, subword recognition means for recognizing speech subword units from the time series of feature vectors and outputting subword unit columns or graphs, subword unit units Generate training data for command identification with columns or graphs Database update means to be registered in the database, support vector machine learning means for configuring a command classifier by a support vector machine based on the training data, and voice by data processing by a command classifier constituted by the support vector machine learning means A voice identifying means for identifying a command and a control signal output means for generating a control signal based on the voice command identified by the voice identifying means and transmitting the control signal to a controlled device are characterized.

この場合に、不明瞭音声コマンド認識プログラムにおいて、前記サブワード認識手段は、音声信号の音素、音節、または音素片のいずれかの単位によりサブワード単位を認識することを特徴とする。また、サポートベクトルマシン学習手段と音声識別手段は、サブワード単位の列に対しては文字列カーネルを、サブワード単位のグラフに対してはＲａｔｉｏｎａｌカーネルを用いることを特徴とする。さらに、音声入力手段は、複数のマイクが並べられたマイクアレイからの複数の音声信号を入力させ、コンピュータを、前記マイクアレイから得られる複数の音声信号のディジタルデータから複数の音源を分離し指向性の雑音を除去する音源分離手段と、音声信号から定常雑音を除去する特徴補正手段と、基本周波数推定により音声と非音声とを区別する非音声識別手段として機能させるように構成される。 In this case, in the ambiguous voice command recognition program, the subword recognition means recognizes a subword unit based on a unit of a phoneme, a syllable, or a phoneme of a voice signal. The support vector machine learning unit and the speech identification unit use a character string kernel for a subword unit column and a relational kernel for a subword unit graph. Furthermore, the voice input means inputs a plurality of voice signals from a microphone array in which a plurality of microphones are arranged, and directs the computer to separate a plurality of sound sources from digital data of the plurality of voice signals obtained from the microphone array. Sound source separation means that removes noise, feature correction means that removes stationary noise from the speech signal, and non-speech discrimination means that distinguishes speech and non-speech by fundamental frequency estimation.

本発明による不明瞭音声コマンド認識装置、不明瞭音声コマンド認識処理方法、不明瞭音声コマンド認識処理プログラムによれば、音声認識の処理の中で、音声を音素や音節、あるいは、音素片等のサブワード単位の列（特許文献４）として認識した上で、特定のコマンドに特徴的な部分列を同定し、不要音に相当する部分列を取り除くなど、音素列の詳細かつ網羅的な分析を行い、入力されたサブワード単位列がどのコマンドであるかを識別することによって、発声毎に変動の大きな不明瞭な音声や、突発性の雑音の多い環境における音声を高精度に認識することができるものとなっている。 According to the ambiguous voice command recognition apparatus, the ambiguous voice command recognition processing method, and the ambiguous voice command recognition processing program according to the present invention, in the voice recognition processing, the speech is subword such as phoneme, syllable, or phoneme. Recognize as a sequence of units (Patent Document 4), identify a partial sequence that is characteristic of a specific command, remove a partial sequence that corresponds to an unnecessary sound, and perform a detailed and comprehensive analysis of the phoneme sequence, By identifying which command is an input subword unit string, it is possible to recognize unclear voices with large fluctuations for each utterance and voices in sudden noisy environments with high accuracy. It has become.

また、本発明によれば、不要音の挿入が多く発声毎の変動が大きい不明瞭な音声コマンドや、突発性の雑音の多い環境における音声コマンドを高精度に認識することが可能になり、これまで音声認識技術を利用することが出来なかった人々、あるいは、これまで音声認識技術が利用できなかった状況においても、音声を用いた機器の制御が行うことができるようになる。しかも、コマンドの数を増やしたり、音声コマンドの使用中にコマンド以外の発声を行ったとしても認識精度の劣化を従来技術よりも低く抑えることが可能になるので、利用者の利便性向上にも寄与する。 In addition, according to the present invention, it becomes possible to accurately recognize an unclear voice command in which unnecessary sounds are inserted and fluctuations for each utterance are large, and a voice command in an environment with sudden noise. Even in the case of people who have not been able to use voice recognition technology until now, or in situations where voice recognition technology has not been available so far, it becomes possible to control equipment using voice. In addition, even if the number of commands is increased or voices other than commands are uttered while using voice commands, the degradation of recognition accuracy can be suppressed to a level lower than that of the conventional technology, thus improving user convenience. Contribute.

以下、本発明を実施する場合の一形態について図面を参照して説明する。図１は、本発明の不明瞭音声コマンド認識装置、不明瞭音声コマンド認識処理方法、不明瞭音声コマンド認識処理プログラムにおける音声認識処理の処理フローの一例を示すフローチャートである。 Hereinafter, an embodiment for carrying out the present invention will be described with reference to the drawings. FIG. 1 is a flowchart showing an example of a processing flow of speech recognition processing in an unclear speech command recognition apparatus, an unclear speech command recognition processing method, and an unclear speech command recognition processing program according to the present invention.

この音声認識処理においては、図１に示すように、マイクにより音声をアナログ電気信号として入力する音声入力過程１００と、音声のアナログ電気信号をデジタル化してディジタルデータとするＡＤ変換過程１０１と、マイクアレイを用いる場合に得られる複数の音声信号から複数の音源を分離し指向性の雑音を除去する音源分離過程１０２と、ケプストラム分析等を行って特徴ベクトルの時系列を得る特徴抽出過程１０３と、音声信号から定常雑音を除去する特徴補正過程１０４と、基本周波数推定等を用いて音声と非音声を区別する音声・非音声識別過程１０５と、特徴ベクトルの時系列から音素や音素片等のサブワード単位を認識し、サブワード単位の列を出力するサブワード認識過程１０６と、コマンド識別器の学習時においてサブワード単位の列をデータベースに登録するコマンドサンプルデータベース更新過程１１０と、サポートベクトルマシン（ＳＶＭ）を用いてコマンド識別器を学習するＳＶＭ学習過程１１１と、学習された識別器をメモリあるいはハードディスクに保存する識別器データベース更新過程１１２と、コマンド識別時においてサブワード単位の列をＳＶＭで学習された識別器を用いてコマンドを識別するＳＶＭ識別過程１０８と、その結果を基にして制御対象機器（電動車椅子）に対する制御信号を出力する制御信号出力過程１０９との各処理を行う複数の処理モジュールが備えられる。これらの処理モジュールのデータ処理によって、不明瞭音声コマンドの認識処理を行い、制御対象機器に対する制御信号を出力する。 In this voice recognition process, as shown in FIG. 1, a voice input process 100 for inputting voice as an analog electric signal by a microphone, an AD conversion process 101 for digitizing a voice analog electric signal into digital data, and a microphone A sound source separation process 102 that separates a plurality of sound sources from a plurality of sound signals obtained when using an array and removes directional noise; a feature extraction process 103 that performs a cepstrum analysis or the like to obtain a time series of feature vectors; A feature correction process 104 that removes stationary noise from the speech signal, a speech / non-speech discrimination process 105 that distinguishes speech and non-speech using fundamental frequency estimation, etc., and subwords such as phonemes and phonemes from the time series of feature vectors Sub-word recognition process 106 for recognizing units and outputting a sequence of sub-word units; A command sample database update process 110 for registering word units in a database, an SVM learning process 111 for learning a command classifier using a support vector machine (SVM), and a learned classifier is stored in a memory or a hard disk. A discriminator database update process 112, a SVM discrimination process 108 for identifying a command using a discriminator learned in SVM for a subword sequence at the time of command discrimination, and a control target device (electric wheelchair) based on the result Are provided with a plurality of processing modules for performing the respective processes of the control signal output process 109 for outputting a control signal for. By the data processing of these processing modules, an unclear voice command recognition process is performed and a control signal for the control target device is output.

ここでの不明瞭音声コマンドの認識処理を行うために、後述するように、サブワード認識を行うための音響モデルを格納する音響モデルデータベース１２１と、コマンドの識別や識別器の学習に用いる認識されたサブワード単位列あるいはサブワード単位グラフを保存するコマンドサンプルデータベース１２２と、学習された識別器を保存する識別器データベース１２３が備えられている。これらの各データベースのデータは、各処理モジュールによるデータ処理において用いられ、不明瞭音声コマンドの認識処理を行い、制御対象機器に対する制御信号を出力する場合に参照され、また、これらのデータベースのデータを更新する。 In order to perform the recognition processing of the unclear voice command here, as will be described later, the acoustic model database 121 storing the acoustic model for performing the subword recognition and the recognition used for command identification and classifier learning are performed. A command sample database 122 that stores a subword unit sequence or a subword unit graph and a classifier database 123 that stores a learned classifier are provided. The data of each database is used in data processing by each processing module, is referred to when performing recognition processing of unclear voice commands, and outputting a control signal to the control target device, and the data of these databases is also used. Update.

ＳＶＭ学習過程１１１について、更に詳細に説明すると、ＳＶＭを用いてサブワード単位の列、すなわち文字列を識別するには、文字列の類似性を効率よく計算するカーネル関数が必要とされる。ＳＶＭの学習原理・アルゴリズムは公知であるので、ここでの詳細な説明は省略するが、これについては、非特許文献２が参照できる。 The SVM learning process 111 will be described in more detail. A kernel function that efficiently calculates the similarity of character strings is required to identify a subword unit string, that is, a character string, using the SVM. Since the learning principle / algorithm of SVM is known, a detailed description thereof is omitted here, but Non-Patent Document 2 can be referred to for this.

従来から、非特許文献３に示されるように、文字列の類似性を与えるカーネル関数としては、文字列カーネルが知られている。このカーネル関数は、任意の２つの文字列が、任意の部分文字列（不連続でも良い）をどの程度共有しているかに基づき、文字列の類似性を文字列の長さの積のｐ倍のオーダーの計算量で計算する関数である。このカーネル関数をＳＶＭに適用することで、音声コマンドのサブワード列の中で、どの部分列がコマンド識別に寄与していて、どの部分列が寄与していないかを詳細かつ網羅的に分析することができる。 Conventionally, as shown in Non-Patent Document 3, a character string kernel is known as a kernel function that gives similarity of character strings. This kernel function calculates the similarity of character strings to p times the product of the lengths of character strings, based on how much any two character strings share any substring (which may be discontinuous). It is a function to calculate with the amount of calculation of the order. By applying this kernel function to SVM, detailed and exhaustive analysis of which subsequences contribute to command identification and which subsequences do not contribute in the subword sequence of voice commands Can do.

さらに、ここでの音声認識処理では、非特許文献５に開示された方法を用いることにより、サブワード単位の厳密なマッチングではなく、例えば、「ａ」という音素は「ｋ」という音素よりも「ａ：」という音素により類似しているというような、サブワード単位のソフトマッチングに基づく類似性を導入することが可能になる。 Further, in the speech recognition processing here, the method disclosed in Non-Patent Document 5 is used, so that the phoneme “a” is not “k” and the phoneme “a” is “a” more than the phoneme “k”. It is possible to introduce similarity based on soft matching in units of subwords, such as “:”.

このようなサブワード単位の類似性の与え方としては、例えば、サブワード単位のＨＭＭ間の類似性を計算することにより行う。サブワード単位のＨＭＭは、通常、複数の状態ｓ_１，…，ｓ_ｎを持ち、状態の遷移には、信号の出力確率分布ｐ_１，…，ｐ_ｎが定義されている。そこで、各出力確率分布の類似性ｓ（ｐ_ｉ，ｑ_ｉ）の平均として、ＨＭＭの類似性を定義する。すなわち、サブワード単位ａとサブワード単位ｂとの類似性Ａ（ａ，ｂ）を、

とする。ここで、ｐ_ｉとｑ_ｉは、それぞれサブワード単位ａとサブワード単位ｂとのＨＭＭのｉ番目の出力確率分布を表す。 As a method of giving such similarity in subword units, for example, the similarity between HMMs in subword units is calculated. Subword unit HMM is typically multiple states _s 1, ..., has a _{s n,} the transition state, the output probability distributions _p 1 signal, ..., _{p n} are defined. Therefore, the similarity _{s (p} _{i, q} i) of each output probability distribution as the average of, defining the similarity of HMM. That is, the similarity A (a, b) between the subword unit a and the subword unit b is

And Here, p _i and q _i represent the i-th output probability distribution of the HMM of subword unit a and subword unit b, respectively.

また、出力確率分布の類似性としては、次のように、Ｂｈａｔｔａｃｈａｒｙｙａカーネル（非特許文献６）を用いることができる。

Further, as the similarity of the output probability distribution, the Bhattacharya kernel (Non-patent Document 6) can be used as follows.

確率分布ｐと確率分布ｑとが正規分布の場合、解析的に積分することが可能で、逆数の対数はＢｈａｔｔａｃｈａｒｙｙａ距離（非特許文献７）、

と等価であることが知られている。 When the probability distribution p and the probability distribution q are normal distributions, they can be integrated analytically, and the logarithm of the reciprocal is the Bhattacharya distance (Non-patent Document 7),

Is known to be equivalent to

しかし、出力確率分布が、混合正規分布の場合は、解析的な解が知られていないので、以下のような近似式を用いる。

ただし、ここでのｐ_ｉとｑ_ｊは、それぞれ混合正規分布のコンポーネントを表しており、ｍ_ｐ（ｉ）とｍ_ｑ（ｊ）は、その混合比を表す。 However, when the output probability distribution is a mixed normal distribution, an analytical solution is not known, so the following approximate expression is used.

Here, p _i and q _j represent components of a mixed normal distribution, and m _p (i) and m _q (j) represent the mixing ratio.

以上述べたような、サブワード単位の類似性に基づいて、サブワード単位の列の類似性を文字列カーネルを用いて計算し、これをサポートベクトルマシンに適用することで、サブワード単位列の識別器を学習により構成し、ひいては音声コマンドの識別器を学習により構成することができる。これらにより構成された識別器は、学習によって識別性が高いものとなる。 Based on the similarity of the subword unit as described above, the similarity of the subword unit column is calculated using the character string kernel, and this is applied to the support vector machine, whereby the subword unit column discriminator is calculated. It is configured by learning, and consequently, a voice command discriminator can be configured by learning. The classifier constituted by these becomes highly discriminative by learning.

さらに、サブワード認識過程１０６は、最も尤度の高いサブワード単位列を出力するだけではなく、尤度の高い複数の認識候補を出力する場合にも、上述した音声コマンド認識処理は拡張可能である。 Further, in the subword recognition process 106, not only the subword unit string with the highest likelihood is output, but also when the plurality of recognition candidates with the highest likelihood are output, the above voice command recognition process can be expanded.

その場合、複数の認識候補をコンパクトに表現する形式として音素をノードとするグラフを用いることができる。例えば、音声認識エンジンＪｕｌｉｕｓ（非特許文献１）には、エッジに認識スコアを付与したグラフを出力することが可能である。このような出力に対して、本発明において特徴的なＳＶＭ識別による音声認識処理を適用するためには、重み付きオートマトン間の類似性を計算するＲａｔｉｏｎａｌカーネル（非特許文献４）のカーネル関数を用いる。この場合には、このカーネル関数を、上述した文字列カーネルの代わりに利用することで、サブワード単位の認識誤りを考慮した音声コマンドの識別器を構成することができる。 In that case, a graph having phonemes as nodes can be used as a format for expressing a plurality of recognition candidates in a compact manner. For example, a graph in which a recognition score is given to an edge can be output to the speech recognition engine Julius (Non-Patent Document 1). In order to apply the speech recognition processing based on SVM identification that is characteristic in the present invention to such an output, a kernel function of a Rational kernel (Non-patent Document 4) that calculates similarity between weighted automata is used. . In this case, by using this kernel function instead of the above-described character string kernel, it is possible to configure a voice command discriminator in consideration of recognition errors in units of subwords.

次に、ＳＶＭ識別過程１０８について、詳細に説明すると、サブワード認識過程１０８では、認識されたサブワード単位列あるいはサブワード単位グラフは、識別器データベースからメモリにロードされた識別器を用いて、どの音声コマンドであるか、あるいは音声コマンドでないかが識別される。 Next, the SVM identification process 108 will be described in detail. In the subword recognition process 108, the recognized subword unit sequence or subword unit graph is used to identify which voice command using the classifier loaded from the classifier database into the memory. Or not a voice command.

基本的に、ＳＶＭで学習させることが可能な識別器は、Ａであるか否かを識別する２クラス識別器であるので、Ｎ個の音声コマンドを識別するために、それぞれのコマンドであるか否かを識別するＮ個の２クラス識別器を用いる。 Basically, the discriminator that can be learned by SVM is a two-class discriminator that discriminates whether or not it is A. N two-class classifiers are used to identify whether or not.

このようにして、ＳＶＭ識別過程１０８では、サブワード単位列、あるいはサブワード単位グラフが、Ｎ個の識別器に入力され、それぞれのコマンドとしてどの程度確からしいかを表すＮ個の確信度が得られる。そして、最も確信度が高いコマンドが識別結果として採用される。ただし、最も高い確信度が、ある定められた閾値よりも小さい場合は、識別結果は棄却され、入力された音声はコマンド以外の音声であったと判断される。後述するように、識別結果のコマンドが得られると、これに対応した電動車椅子を制御するための制御信号が出力される。 In this way, in the SVM identification process 108, a subword unit sequence or a subword unit graph is input to N classifiers, and N confidence levels representing how probable each command is obtained. Then, the command with the highest certainty factor is adopted as the identification result. However, when the highest certainty factor is smaller than a predetermined threshold, the identification result is rejected, and it is determined that the input voice is a voice other than the command. As will be described later, when an identification result command is obtained, a control signal for controlling the electric wheelchair corresponding thereto is output.

図２は、本発明による音声認識処理を電動車椅子の制御に用いる場合の装置の主要な構成を示すブロック図である。図２において、２１０は制御対象機器の電動車椅子である。電動車椅子２１０には、車軸に車輪を駆動する駆動モータ２０５が直結され、話者の手元には手動で電動車椅子を操作するための制御スイッチ２０６が配置され、また、音声コマンドによる操作のためのマイクアレイ２０７が、例えば、電動車椅子２１０の肘掛け部または背もたれ部など適当な位置にそれぞれに設けられている。マイクアレイ２０７により入力された音声信号は、音声認識装置２００に入力され、入力された音声信号の音声コマンドが認識され、それに対応する制御信号が出力されて、制御コントローラ２０４に入力される。制御コントローラ２０４は、音声認識装置２００から出力される制御信号または制御スイッチ２０６から出力される手動操作による制御信号によって、駆動モータ２０５が制御される。 FIG. 2 is a block diagram showing the main configuration of the apparatus when the voice recognition processing according to the present invention is used for controlling an electric wheelchair. In FIG. 2, reference numeral 210 denotes an electric wheelchair as a control target device. The electric wheelchair 210 is directly connected to a drive motor 205 for driving wheels on an axle, and a control switch 206 for manually operating the electric wheelchair is arranged at the speaker's hand. The microphone array 207 is provided at an appropriate position such as an armrest portion or a backrest portion of the electric wheelchair 210, for example. A voice signal input from the microphone array 207 is input to the voice recognition device 200, a voice command of the input voice signal is recognized, a corresponding control signal is output, and is input to the control controller 204. In the controller 204, the drive motor 205 is controlled by a control signal output from the speech recognition apparatus 200 or a control signal by manual operation output from the control switch 206.

また、図２に示すように、本発明による音声認識処理を用いた電動車椅子の一形態においては、電動車椅子２１０に音声認識装置２００が備えられ、音声認識装置２００には、マイクアレイのアナログ音声入力手段２０７と、アナログ音声をデジタル化するＡＤ変換手段２０１と、サブワード認識を行ったり、ＳＶＭを用いて識別器を学習したり、サブワード単位列、あるいはサブワード単位グラフを識別するためのデータ処理装置（ＣＰＵ）２０２、メモリ２０３が備えられる。 In addition, as shown in FIG. 2, in one embodiment of the electric wheelchair using the voice recognition processing according to the present invention, the electric wheelchair 210 includes a voice recognition device 200, and the voice recognition device 200 includes an analog voice of a microphone array. Data processing apparatus for performing input sub-word recognition, learning a discriminator using SVM, identifying sub-word unit sequence or sub-word unit graph, input unit 207, AD conversion unit 201 for digitizing analog speech (CPU) 202 and memory 203 are provided.

音声認識装置２００には、さらに、サブワード認識を行うための音響モデルを格納する音響モデルデータベース１２１と、コマンドの識別や識別器の学習に用いるために、認識されたサブワード単位列あるいはサブワード単位グラフを保存するためのコマンドサンプルデータベース１２２と、学習された識別器を保存する識別器データベース１２３が備えられている。 The speech recognition apparatus 200 further includes an acoustic model database 121 that stores an acoustic model for performing subword recognition, and a recognized subword unit sequence or subword unit graph for use in command identification and classifier learning. A command sample database 122 for saving and a discriminator database 123 for storing learned discriminators are provided.

また、前述したように、音声コマンドによる操作以外にも通常の手動の操作のためのジョイスティックの制御スイッチ２０６が備えられ、制御スイッチ２０６からの制御信号と音声認識装置２００からの制御信号を入力とする制御コントローラ２０４が、適切な制御信号を選択し、最終的に駆動モータ２０５を制御する。 Further, as described above, a joystick control switch 206 for a normal manual operation is provided in addition to an operation by a voice command, and a control signal from the control switch 206 and a control signal from the voice recognition device 200 are input. The control controller 204 selects an appropriate control signal and finally controls the drive motor 205.

ここで、音響モデルデータベース１２１と、コマンドサンプルデータベース１２２と、識別器データベース１２３は、高速なアクセスが可能であって、さらに、電源を落としてもデータベースの内容が消滅しないような不揮発性のメモリあるいはハードディスクドライブを用いて実装することが望ましい。 Here, the acoustic model database 121, the command sample database 122, and the discriminator database 123 can be accessed at high speed, and further, a nonvolatile memory or a database that does not lose its contents even when the power is turned off. It is desirable to mount using a hard disk drive.

また、これらのデータベース（１２１，１２２，１２３）は話者毎に用意することが望ましいので、複数の話者が電動車椅子を利用する場合には、複数のデータベースが実装される。ただし、音響モデルデータベース１２１は、複数の話者で共有してもそれほど性能が劣化しないことは実験により確認されている。 Moreover, since it is desirable to prepare these databases (121, 122, 123) for each speaker, a plurality of databases are mounted when a plurality of speakers use an electric wheelchair. However, it has been confirmed by experiments that the acoustic model database 121 does not deteriorate so much even if it is shared by a plurality of speakers.

図３は、本発明による音声認識処理を用いた脳性マヒ患者の音声コマンド識別実験の結果を示す図である。図中、２点鎖線で示される「ＳＴＤ−ＦＵＬＬ」は、健常者の標準的な発声に基づく音素列をコマンド辞書に登録して行った実験結果であり、破線で示される「ＡＬＬ−ＦＵＬＬ」は、被験者の実際の音声コマンドから音素タイプライタを用いて得られた音素列を辞書に登録して行った実験結果であり、実線で示される「ＳＶＭ−ＦＵＬＬ」は、本発明による音声認識処理を用いた実験結果を示している。 FIG. 3 is a diagram showing a result of voice command identification experiment of a cerebral palsy patient using voice recognition processing according to the present invention. In the figure, “STD-FULL” indicated by a two-dot chain line is an experimental result obtained by registering a phoneme sequence based on a normal utterance of a healthy person in the command dictionary, and “ALL-FULL” indicated by a broken line. Is an experimental result obtained by registering a phoneme string obtained from an actual voice command of a subject using a phoneme typewriter in a dictionary, and “SVM-FULL” indicated by a solid line is a voice recognition process according to the present invention. The experimental result using is shown.

いずれも、前・後・右・左・停止に対応する５つのコマンドと、コマンド以外の音声を識別する実験を行い、コマンドの再現率と適合率を閾値を変えながらプロットしたものである。なお、５つのコマンドとしては、被験者の発声の容易さ等を考慮して、前：／ｍａｅ／、後：／ｋｏｕｔａｉ／、右：／ｍｉｇｉ／、左：／ｈｉｄａｒｉ／または／ｄａｒｉ／、停止：／ａ−／を用いた。 In each case, an experiment for identifying five commands corresponding to front, rear, right, left, and stop and voices other than the command was performed, and the command reproduction rate and matching rate were plotted while changing the threshold value. As for the five commands, taking into account the ease of speech of the subject, etc., front: / mae /, rear: / koutai /, right: / migi /, left: / hidari / or / dari /, stop: / A- / was used.

この実験の結果から、ほとんどの再現率の範囲に対して、本発明による音声認識処理の手法が、既存手法よりも高い適合率を達成可能であることがわかる。 From the results of this experiment, it can be seen that the speech recognition processing technique according to the present invention can achieve a higher precision than the existing technique for most reproduction rate ranges.

本発明の不明瞭音声コマンド認識装置によれば、音声が不明瞭であるためにこれまでの音声認識装置が利用できなった障害者や高齢者に対して、あるいはノイズが多いためにこれまでの音声認識装置が利用できなかった状況においても、音声コマンドを用いた機器の制御が可能になる。 According to the indistinct voice command recognition device of the present invention, the voice recognition device for the disabled or the elderly who cannot use the conventional speech recognition device because of the indistinct voice, or because of the noise, Even in a situation where the voice recognition device cannot be used, it is possible to control the device using voice commands.

本発明の不明瞭音声コマンド認識装置、不明瞭音声コマンド認識処理方法、不明瞭音声コマンド認識処理プログラムにおける音声認識処理の処理フローの一例を示すフローチャートである。It is a flowchart which shows an example of the processing flow of the speech recognition process in the unclear voice command recognition apparatus of this invention, the unclear voice command recognition processing method, and an unclear voice command recognition processing program. 本発明による音声認識処理を電動車椅子の制御に用いる場合の装置の主要な構成を示すブロック図である。It is a block diagram which shows the main structures of the apparatus in the case of using the speech recognition process by this invention for control of an electric wheelchair. 本発明による音声認識処理を用いた脳性マヒ患者の音声コマンド識別実験の結果を示す図である。It is a figure which shows the result of the voice command identification experiment of the cerebral palsy patient using the voice recognition process by this invention.

Explanation of symbols

１２１音響モデルデータベース
１２２コマンドサンプルデータベース
１２３識別器データベース
２００音声認識装置
２０１ＡＤ変換器
２０２データ処理装置（ＣＰＵ）
２０３メモリ
２０４制御コントローラ
２０５駆動モータ
２０６制御スイッチ
２０７マイクアレイ
２１０電動車椅子 121 Acoustic Model Database 122 Command Sample Database 123 Discriminator Database 200 Speech Recognition Device 201 AD Converter 202 Data Processing Device (CPU)
203 Memory 204 Control Controller 205 Drive Motor 206 Control Switch 207 Microphone Array 210 Electric Wheelchair

Claims

An obscure voice command recognition device for operating a device using voice utterly uttered as a voice command,
Voice input means for converting voice signals of voice commands into electric signals and inputting the voice signals;
Analog-digital conversion means for converting the electrical signal of the voice signal input by the voice input means into digital data;
Feature extraction means for extracting feature vectors from digital data of speech signals by cepstrum analysis;
Subword recognition means for recognizing a subword unit of speech from a time series of feature vectors and outputting a sequence or graph of the subword unit;
Database update means for generating training data for command identification by a subword column or graph and registering it in the database;
A support vector machine learning means for configuring a command classifier by a support vector machine based on the training data;
A voice identification means for identifying a voice command by data processing by a command identifier constituted by the support vector machine learning means;
A control signal output means for generating a control signal according to the voice command identified by the voice identification means and sending the control signal to the control target device;
An indistinct voice command recognition device comprising:

In the unclear voice command recognition device according to claim 1,
An unclear voice command recognition apparatus, wherein the subword recognition means recognizes a subword unit based on a unit of a phoneme, a syllable, or a phoneme of a voice signal.

In the unclear voice command recognition device according to claim 1,
The support vector machine learning unit and the voice identification unit use a character string kernel for a subword unit column and a relational kernel for a subword unit graph.

In the unclear voice command recognition device according to claim 1,
The voice input means is a microphone array in which a plurality of microphones are arranged.
Sound source separation means for separating a plurality of sound sources from digital data of a plurality of sound signals obtained from the microphone array and removing directional noise;
Feature correction means for removing stationary noise from the audio signal;
An unclear voice command recognizing device comprising: a non-speech identifying means for distinguishing between speech and non-speech by fundamental frequency estimation.

An ambiguous voice command recognition processing method for recognizing and processing an ambiguous voice command for operating a device by using an unclear voice as a voice command.
A voice input step of converting a voice signal of a voice command into an electric signal and inputting the voice signal;
An analog-digital conversion step in which the electrical signal of the audio signal input by the audio input means is converted into digital data;
A feature extraction step of extracting a feature vector from the digital data of the speech signal by cepstrum analysis;
A subword recognition step of recognizing a subword unit of speech from a time series of feature vectors and outputting a sequence or graph of the subword unit;
A database update step for generating training data for command identification by a subword column or graph and registering it in the database;
A support vector machine learning step of configuring a command classifier by a support vector machine based on the training data;
A voice identification step of identifying a voice command by data processing by a command classifier configured by the support vector machine learning step;
A control signal output step of generating a control signal according to the voice command identified by the voice identification means and sending it to the device to be controlled;
A process for recognizing an unclear voice command characterized by executing the following process.

The unclear voice command recognition processing method according to claim 5,
In the subword recognition step, the subword unit is recognized by any one unit of a phoneme, a syllable, or a phoneme of a speech signal.

The unclear voice command recognition processing method according to claim 5,
In the support vector machine learning step and the voice identification step, a character string kernel is used for a subword unit column, and a relational kernel is used for a subword unit graph.

The unclear voice command recognition processing method according to claim 5,
The audio input step inputs a plurality of audio signals from a microphone array in which a plurality of microphones are arranged,
A sound source separation step of separating a plurality of sound sources from digital data of a plurality of sound signals obtained from the microphone array and removing directional noise;
A feature correction step for removing stationary noise from the audio signal;
An unclear voice command recognition processing method characterized by further executing a process of a non-voice identification step of distinguishing between voice and non-voice by fundamental frequency estimation.

A program for executing an unclear voice command recognition process for operating a device by using an unclearly spoken voice as a voice command, which is a computer that converts a voice command voice signal into an electric signal and inputs it. Input means;
Analog-digital conversion means for converting the electrical signal of the voice signal input by the voice input means into digital data;
Feature extraction means for extracting feature vectors from digital data of speech signals by cepstrum analysis;
Subword recognition means for recognizing a subword unit of speech from a time series of feature vectors and outputting a sequence or graph of the subword unit;
Database update means for generating training data for command identification by a subword column or graph and registering it in the database;
A support vector machine learning means for configuring a command classifier by a support vector machine based on the training data;
Voice identification means for identifying a voice command by data processing by a command classifier constituted by the support vector machine learning means, and control for generating a control signal by the voice command identified by the voice identification means and sending it to a control target device A program for recognizing an unclear voice command characterized by functioning as signal output means.

The unclear voice command recognition program according to claim 9,
The subword recognition means recognizes a subword unit based on any one of a phoneme, a syllable, or a phoneme of a speech signal.

The unclear voice command recognition program according to claim 9,
The support vector machine learning unit and the voice identification unit use a character string kernel for a subword unit column and a relational kernel for a subword unit graph, and an unclear voice command recognition program.

The unclear voice command recognition program according to claim 9,
The audio input means functions to input a plurality of audio signals from a microphone array in which a plurality of microphones are arranged.
Sound source separation means for separating a plurality of sound sources from digital data of a plurality of sound signals obtained from the microphone array and removing directional noise;
Feature correction means for removing stationary noise from the audio signal;
An unclear voice command recognition program which functions as a non-speech discrimination means for distinguishing between speech and non-speech based on fundamental frequency estimation.