JP2007206603A

JP2007206603A - Method of creating acoustic model

Info

Publication number: JP2007206603A
Application number: JP2006028213A
Authority: JP
Inventors: Mitsunobu Kaminuma; 充伸神沼; Masato Akagi; 正人赤木
Original assignee: Japan Advanced Institute of Science and Technology; Nissan Motor Co Ltd
Current assignee: Japan Advanced Institute of Science and Technology; Nissan Motor Co Ltd
Priority date: 2006-02-06
Filing date: 2006-02-06
Publication date: 2007-08-16

Abstract

<P>PROBLEM TO BE SOLVED: To provide an acoustic model capable of recognizing speech with good performance even when utterance distortion arising from boarding and so forth exist. <P>SOLUTION: A method of creating an acoustic model includes: a step of recording prescribed speech in an environment having substantially no acoustic noise and creating an undistorted speech corpus 1; a step of correcting the speech of the undistorted speech corpus created in the step using an utterance conversion filter means 2 for deforming it into the utterance corresponding to the moving speed of a mobile body and for creating a mobile environment speech corpus 3; and a step of learning the mobile environment speech corpus 3 created by the step as learning data to create the acoustic model 4. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、自動車に代表される移動体内で用いられることがある音響モデルの作成方法、音声認識装置及び音声認識に関する。 The present invention relates to a method for creating an acoustic model that may be used in a moving body represented by an automobile, a speech recognition device, and speech recognition.

自動車の運転者や同乗者（以下、単に乗員とも言う。）の発話内容を認識して、これを操作指示として利用する自動車用ナビゲーション装置が実用化されているが、特に自動車内で用いられる音声認識装置では、自動車に乗っているという環境の変化が乗員に影響を与え、これによる発話歪が音声認識システムの認識性能に影響することが知られている。 Automobile navigation devices that recognize the utterances of automobile drivers and passengers (hereinafter also referred to simply as occupants) and use them as operation instructions have been put into practical use. In the recognition device, it is known that a change in the environment of riding in a car affects an occupant, and utterance distortion caused thereby affects the recognition performance of the speech recognition system.

たとえば、自動車に乗った発話者が走行雑音を受聴することによって生じる発話歪（ロンバード効果と呼ばれる。）は、非常にロバストな現象であることが知られている。また、本発明者らの研究により、自動車に乗ることにより乗員、特に運転者が緊張し、この緊張によって発話歪が生じるといった現象も確認されている(図１３参照)。このような発話歪が生じる現象は、多くの場合、無意識のうちに自然に発生する。 For example, it is known that speech distortion (referred to as the Lombard effect) caused when a speaker on a car listens to running noise is a very robust phenomenon. In addition, according to the study of the present inventors, it has been confirmed that a passenger, particularly a driver, is tense when riding in an automobile, and the utterance distortion is caused by this tension (see FIG. 13). In many cases, such a phenomenon that utterance distortion occurs naturally occurs unconsciously.

従来の音声認識装置には、図１４及び図１５に示すように、音声分析、特徴抽出、パターンマッチング等を行なうための情報を有する音響モデルと呼ばれる要素が存在する。 As shown in FIGS. 14 and 15, a conventional speech recognition device includes an element called an acoustic model having information for performing speech analysis, feature extraction, pattern matching, and the like.

音響モデルは、音素または音韻のラベルと、このラベルに対応する音声信号を音響的特徴情報等に変換した信号とが記述されたもので、近年では隠れマルコフモデル（Hidden Markov Model）を用いた音声認識手法が多く採用されている。隠れマルコフモデルとは、確率モデルの一つであって、システムがパラメータ未知のマルコフ過程であると仮定し、観測可能な情報からその未知のパラメータを推定する手法である。この隠れマルコフモデルを用いた音響モデルでは、音声波形を離散的な信号とした音声信号と、この音声信号に付与した音素ラベルからなる音声コーパスとを学習データとして、隠れマルコフモデルのモデル学習を行なうことで、目的とする音響モデルが作成される。 An acoustic model describes a phoneme or phoneme label and a signal obtained by converting a speech signal corresponding to the label into acoustic feature information, etc. In recent years, speech using a hidden Markov model (Hidden Markov Model) is used. Many recognition methods are used. A hidden Markov model is one of probabilistic models, and is a method for estimating an unknown parameter from observable information on the assumption that the system is a Markov process with an unknown parameter. In this acoustic model using the Hidden Markov Model, model learning of the Hidden Markov Model is performed by using a speech corpus including a speech signal having a speech waveform as a discrete signal and a phoneme label attached to the speech signal as learning data. Thus, the target acoustic model is created.

ところで、こうした音響モデルは学習データである音声コーパスの内容に依存するため、音声コーパスの内容は音声認識を使用する実際の環境に適合していることが望ましい。 By the way, since such an acoustic model depends on the content of the speech corpus as learning data, it is desirable that the content of the speech corpus is suitable for an actual environment in which speech recognition is used.

すなわち、音声認識装置に用いるための音響モデルを作成する際に、音声コーパス内の発話データの特徴と、実際の環境における話者の発話の特徴とが近ければ近いほど音声の認識性能が高くなる。したがって、車載用音声認識装置の音響モデルを作成するには、これら発話歪が発生した発話による音声コーパスを用いて作成することが望ましい。 That is, when creating an acoustic model for use in a speech recognition device, the closer the features of speech data in the speech corpus are to the features of the speaker's speech in the actual environment, the higher the speech recognition performance. . Therefore, in order to create an acoustic model of the in-vehicle speech recognition device, it is desirable to create using an audio corpus by utterances in which these utterance distortions occur.

しかしながら、発話歪を含んだ発話データを収録するためには、運転者に運転させながら発話させ、かつ周囲の雑音が含まれないように収録しなければならないことから、そのような音声コーパスを作成することは技術的に極めて困難である。また、技術の発展によりこうしたことの実現が可能となった場合でも、車両の走行環境毎に音声コーパスを作成することは、莫大な工数と費用がかかり、現実的ではなかった。 However, in order to record utterance data including utterance distortion, it is necessary to record while keeping the driver driving and not including ambient noise. It is technically very difficult to do. Moreover, even if this has become possible due to technological development, creating a voice corpus for each traveling environment of a vehicle is enormous and requires a lot of man-hours and costs.

他方で、実際の使用環境において入力された、発話歪を含む音声を補正する手法も提案されているが（たとえば、非特許文献１）、走行中での発話歪現象の挙動が明らかでなかったことから、その処理過程はケプストラム等の一部領域の補正のみに留まっており、音声の認識性能は充分であるとはいえない。
「発話歪モデルを用いた騒音環境下音声認識」（日本音響学会講演論文集平成７年３月）鈴木忠、安部芳春、中島邦男(三菱電機・情シ研） On the other hand, although a method for correcting speech including utterance distortion input in an actual use environment has been proposed (for example, Non-Patent Document 1), the behavior of the utterance distortion phenomenon during traveling has not been clarified. Therefore, the processing process is limited only to correction of a partial region such as a cepstrum, and it cannot be said that the speech recognition performance is sufficient.
"Speech recognition under noisy environments using speech distortion model" (acoustics of the Acoustical Society of Japan, March 1995) Tadashi Suzuki, Yoshiharu Abe, Kunio Nakajima (Mitsubishi Electric & Information Technology Laboratories)

本発明は、乗車などに起因する発話歪があっても音声を性能良く認識できる音響モデル、音声認識装置および音声認識方法を提供することを目的とする。 An object of the present invention is to provide an acoustic model, a speech recognition apparatus, and a speech recognition method capable of recognizing speech with good performance even when there is speech distortion caused by riding or the like.

上記目的を達成するために、第１の観点による発明は、音響的雑音が実質的にない環境において所定の音声を収録して無歪音声コーパスを作成し、この無歪音声コーパスの音声を、移動体の移動速度に対応した発話に変形する発話変換フィルタ手段を用いて補正して移動環境音声コーパスを作成し、この移動環境音声コーパスを基に学習して音響モデルを作成することを特徴とする。 In order to achieve the above object, the invention according to the first aspect creates a non-distorted speech corpus by recording predetermined speech in an environment substantially free of acoustic noise, and the speech of this undistorted speech corpus is Using the speech transformation filter means that transforms into speech corresponding to the moving speed of the moving object, the mobile environment speech corpus is corrected to create an acoustic model by learning based on the mobile environment speech corpus To do.

また、第２の観点による発明は、認識すべき音声信号を、検出された移動体の移動速度に応じて、音響モデルの学習時に用いられた音声コーパスと同じ音響的かつ統計的特性を有する音声信号に補正し、この補正された音声信号を、音響モデルを用いてラベル信号に変換することを特徴とする。 In the invention according to the second aspect, the speech signal to be recognized is a speech signal having the same acoustic and statistical characteristics as the speech corpus used at the time of learning the acoustic model according to the detected moving speed of the moving body. The signal is corrected to a signal, and the corrected sound signal is converted into a label signal using an acoustic model.

第１の観点による発明では、音響モデルの学習データとなる移動環境音声コーパスを作成する際に移動体の移動速度に対応した発話に変形補正することで、移動による発話歪を考慮することし、第２の観点による発明では、音響モデルを用いて復号する際に、音響モデルの学習データとなった音声コーパスと同じ音響的・統計的特性を有する音声信号となるように、認識対象の音声信号を補正することで移動による発話歪を考慮することとしている。これにより、乗車などに起因する発話歪があっても音声を性能良く認識することができる。 In the invention according to the first aspect, the utterance distortion due to movement is taken into account by correcting the deformation to the utterance corresponding to the moving speed of the moving body when creating the moving environment speech corpus that becomes the learning data of the acoustic model, In the invention according to the second aspect, when decoding using the acoustic model, the speech signal to be recognized is such that the speech signal has the same acoustic and statistical characteristics as the speech corpus that is the learning data of the acoustic model. Utterance distortion due to movement is taken into account. As a result, it is possible to recognize the speech with good performance even when there is an utterance distortion caused by riding or the like.

なお、特許請求の範囲、明細書及び図面にいう「音響的雑音が実質的にない環境」とは、走行環境もしくは走行状態に起因する騒音、または話者の緊張等の生理現象の変化が存在しない環境もしくは状態を意味する。 The “environment substantially free of acoustic noise” as used in the claims, description and drawings means that there is a change in physiological phenomena such as noise caused by the driving environment or driving condition, or speaker tension. Means no environment or condition.

BEST MODE FOR CARRYING OUT THE INVENTION

以下、本発明の実施形態を図面に基づいて説明するが、以下の説明では、移動体としての自動車に搭載されるナビゲーション装置、空気調和装置、オーディオ装置などの各種車載機器の操作指示として運転者や同乗者の発話を音声認識するための音声認識装置及び音声認識方法並びにこれに用いられる音響モデルの作成方法を例に挙げて説明する。 DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description, a driver is used as an operation instruction for various in-vehicle devices such as a navigation device, an air conditioner, and an audio device mounted on a vehicle as a moving body. A speech recognition apparatus and speech recognition method for recognizing speech of passengers and passengers and a method of creating an acoustic model used therefor will be described as examples.

ただし、本発明の音響モデルの作成方法、音声認識装置及び音声認識方法は、自動車以外の移動体にも適用することができ、また車載機器のように常に移動体において使用される機器だけでなく、たとえば携帯電話のように、移動体において使用されることがある機器の操作指示に応用することも本発明の範囲内である。 However, the acoustic model creation method, speech recognition apparatus, and speech recognition method of the present invention can be applied to a mobile body other than an automobile, and not only a device that is always used in a mobile body, such as an in-vehicle device. For example, it is also within the scope of the present invention to apply to an operation instruction of a device that may be used in a mobile object such as a mobile phone.

《第１実施形態》
図１は本発明の音響モデルの作成方法の第１実施形態を示すブロック図である。 << First Embodiment >>
FIG. 1 is a block diagram showing a first embodiment of a method for creating an acoustic model of the present invention.

本実施形態の音響モデルの作成方法は、音響的雑音が実質的にない環境において所定の音声を収録して無歪音声コーパス１を作成する第１ステップと、この第１ステップで作成した無歪音声コーパス１の音声を、移動体の移動速度に対応した発話に変形する発話変換フィルタ手段２を用いて補正し、移動環境音声コーパス３を作成する第２ステップと、この第２ステップで作成された移動環境音声コーパス３を学習データとして学習して、目的とする音響モデル４を作成する第３ステップとを少なくとも含む。 The acoustic model creation method of the present embodiment includes a first step of creating a non-distorted speech corpus 1 by recording predetermined speech in an environment substantially free of acoustic noise, and a non-distortion created in the first step. A second step of correcting the voice of the voice corpus 1 by using the utterance conversion filter means 2 that transforms the voice into the utterance corresponding to the moving speed of the moving body to create the mobile environment voice corpus 3 and the second step. And learning the mobile environment speech corpus 3 as learning data to create a target acoustic model 4 at least.

音響モデル４とは、入力された音声信号の特徴を分析し、音素または音韻情報に変換する際に用いられる辞書であり、音声コーパス（テキストラベルのついた音声信号）に含まれる多くの音声信号の音響的特徴を表現するために、様々な長さの時系列を確率的に生成する信号モデルである。 The acoustic model 4 is a dictionary used when analyzing the characteristics of an input speech signal and converting it into phoneme or phoneme information, and many speech signals included in a speech corpus (speech signal with a text label). In order to express the acoustic features of the above, it is a signal model that probabilistically generates time series of various lengths.

一般に、音響モデル４では隠れマルコフモデルを用いた信号モデルが多用されており、この隠れマルコフモデルの学習は、たとえば、バーム・ウェルチ(Baum-Weltch)の学習アルゴリズムなどを用いることによって実現可能であることが知られている（たとえば、特開平９-１５２８８６号公報，「音声認識システム」オーム社参照）。こうした音響モデルは、学習対象となる音声コーパスに含まれる音声の集合の音響的特徴が、実環境において入力される音声の集合の音響的特徴に近いほど、音声認識システムの認識性能が向上すると考えられるため、音声認識システムの性能向上のためには実環境における発話に近い発話を収集し、音声コーパスとする必要がある。 In general, a signal model using a hidden Markov model is often used in the acoustic model 4, and learning of the hidden Markov model can be realized by using, for example, a Baum-Weltch learning algorithm. (See, for example, Japanese Patent Laid-Open No. 9-152886, “Voice Recognition System” Ohm Corporation). These acoustic models are considered to improve the recognition performance of the speech recognition system as the acoustic features of the speech set included in the speech corpus to be learned are closer to the acoustic features of the speech set input in the real environment. Therefore, in order to improve the performance of the speech recognition system, it is necessary to collect utterances that are close to utterances in the real environment and to make a speech corpus.

すなわち、走行中の発話歪に対応可能な音響モデルを作成するためには、学習データとして用いる音声コーパスの中に、走行中の発話歪に係る音声が含まれている必要がある。 That is, in order to create an acoustic model that can cope with utterance distortion during travel, the speech corpus used as learning data needs to include speech related to utterance distortion during travel.

そこで、本実施形態では、予め防音室など、雑音や残響特性の影響が少ないクリーン環境で収録することで無歪音声コーパス１を作成しておき、この無歪音声コーパス１に含まれるクリーンな音声信号を、車速毎に用意した発話変換フィルタ手段２を用いて走行中の発話歪をともなう音声信号に変形したものを移動環境音声コーパス３とし、この変形後の移動環境音声コーパス３を学習データとして、上述したバウム・ウェルチの学習アルゴリズムを用いて音響モデル４の学習を行なう。 Therefore, in the present embodiment, an undistorted speech corpus 1 is created in advance by recording in a clean environment such as a soundproof room where the influence of noise and reverberation characteristics is small, and the clean speech included in the undistorted speech corpus 1 is created. A signal obtained by transforming the signal into a speech signal with utterance distortion during traveling using the speech conversion filter means 2 prepared for each vehicle speed is used as a mobile environment speech corpus 3, and the mobile environment speech corpus 3 after the transformation is used as learning data. The acoustic model 4 is learned using the Baum-Welch learning algorithm described above.

こうすることで、目的とする音響モデル４の学習データとなる移動環境音声コーパス３の内容に、走行中の音声信号も含まれた状態となり、こうして学習された音響モデル４によって音声認識を実行すると、走行環境における発話に対しても認識精度が低下しないロバストな性能を獲得することができる。 In this way, the content of the mobile environment speech corpus 3 which is the learning data of the target acoustic model 4 includes a traveling speech signal, and when speech recognition is performed by the acoustic model 4 thus learned. In addition, it is possible to obtain a robust performance that does not deteriorate the recognition accuracy even for the utterance in the driving environment.

無歪音声コーパス１は、雑音や残響特性の影響が少なく、かつ走行環境下での発話者の緊張のない静止環境、たとえば上述した防音室において、莫大な音声発話をパーソナルコンピュータなどに録音する。そして、この録音した音声発話の信号（コンピュータの音ファイル）のどこからどこまでが何の音素かといったラベル情報を付与することで無歪音声コーパス１を得る。 The undistorted speech corpus 1 records an enormous speech utterance on a personal computer or the like in a stationary environment that is less affected by noise and reverberation characteristics and is free from the tension of the speaker in a running environment, for example, the above-described soundproof room. Then, the undistorted speech corpus 1 is obtained by giving label information such as from where to what in the recorded speech utterance signal (computer sound file).

無歪音声コーパス１を作成したら、自動車の走行速度毎に発話変換フィルタ手段２を用いて音声を変換し、移動環境音声コーパス３を作成する。 When the undistorted speech corpus 1 is created, the speech is converted by using the speech conversion filter means 2 for each traveling speed of the automobile, and the mobile environment speech corpus 3 is created.

本実施形態に係る発話変換フィルタ手段２(パラメータ)としては以下の手段を挙げることができるが、これらの発話変換フィルタ手段２の内容は、本発明者らが行った次の実験結果を根拠とするものである。 The speech conversion filter means 2 (parameters) according to the present embodiment can include the following means, but the contents of these speech conversion filter means 2 are based on the results of the following experiment conducted by the inventors. To do.

まず、発話が可能な被験者男女６名の発話を、テストコースを走行する走行環境と実験室環境とにおいて収録した。走行環境では、アイドリング時、３０ｋｍ／ｈ走行時、６０ｋｍ／ｈ走行時、１００ｋｍ／ｈ走行時の４段階の速度を維持したそれぞれの走行環境下において発話を収録した。被験者は、運転席または助手席に着座した状態で収録を行ない、収録には接話マイクを用い、被験者正面に設置した１ページに１単語を記述した用紙の内容について、複数の環境下において一発話毎に収録した。 First, the utterances of 6 test subjects who can speak are recorded in the driving environment and the laboratory environment. In the driving environment, utterances were recorded in each of the driving environments maintaining the four speeds of idling, 30 km / h driving, 60 km / h driving, and 100 km / h driving. The subject records while sitting in the driver's seat or passenger's seat, uses a close-up microphone for recording, and the content of the paper with one word written on one page in front of the subject is recorded under multiple circumstances. Recorded for each utterance.

一方、実験室環境では、アイドリング時、３０ｋｍ／ｈ走行時、６０ｋｍ／ｈ走行時、１００ｋｍ／ｈ走行時に被験者両耳位置においてバイノーラル収録した騒音を、半無響室において被験者に受聴させながら、被験者正面に設置した１ページに１単語を記述した用紙の内容について、複数の環境下において一発話毎に収録した。なお、この騒音は収録時の音圧レベルに調整した。 On the other hand, in a laboratory environment, the subject listens to the subject in the semi-anechoic room while listening to binaural recordings of noise in the subject's binaural position when idling, traveling at 30 km / h, traveling at 60 km / h, and traveling at 100 km / h. The contents of a sheet with one word written on one page placed in front were recorded for each utterance under multiple environments. This noise was adjusted to the sound pressure level at the time of recording.

これらそれぞれの環境下で収録した男女それぞれの話者６名の音声発話を収集し、パワー、基本周波数、スペクトル傾斜、ホルマント周波数および発話速度に関するパラメータについて、それぞれの変動を高品質音声分析変換合成法STRAIGHT（「聴覚の情景分析と高品質音声分析変換合成法STRAIGHT」河原英紀,日本音響学会講演論文集1-2-1, pp.189-192, Sep.1997）を用いて解析した。 Collect voice utterances of 6 male and female speakers recorded in each of these environments, and change the parameters related to power, fundamental frequency, spectral tilt, formant frequency, and speech speed to high-quality speech analysis conversion synthesis method STRAIGHT (“Analysis of auditory scene analysis and high-quality speech analysis transformation synthesis method STRAIGHT” Hidenori Kawahara, Proceedings of the Acoustical Society of Japan 1-2-1, pp.189-192, Sep.1997) was used for analysis.

(1) 音声のパワー(エネルギー)
本発明者らの実験によれば、９０％以上の分析結果において、走行速度の増加にともない音声のパワーも増加することが明らかとなった。具体的には、０ｋｍ／ｈと１００ｋｍ／ｈの車速変化に対し、平均で３．５ｄＢ、最大７ｄＢの音声パワーを増加させることで実験室環境から走行環境への補正が可能となることが明らかになった。また、音響的にクリーンな環境（防音室等）において無歪音声コーパス１が作成されている場合は、平均で６．８ｄＢ, 最大で１４ｄＢの音声パワーを増加させることで、クリーンな環境において収録された無歪音声コーパス１に対し時速１００ｋｍ／ｈで走行する際の発話歪を含む移動環境音声コーパスを作成できることも確認された。 (1) Audio power (energy)
According to the experiments by the present inventors, it has been clarified that the sound power increases as the traveling speed increases in the analysis result of 90% or more. Specifically, it is clear that it is possible to correct from the laboratory environment to the driving environment by increasing the voice power of 3.5 dB on average and 7 dB at maximum with respect to the vehicle speed changes of 0 km / h and 100 km / h. Became. In addition, when the undistorted audio corpus 1 is created in an acoustically clean environment (such as a soundproof room), recording is performed in a clean environment by increasing the audio power by 6.8 dB on average and 14 dB at maximum. It was also confirmed that a mobile environment speech corpus including speech distortion when traveling at a speed of 100 km / h with respect to the undistorted speech corpus 1 can be created.

したがって、本実施形態では、発話変換フィルタ手段２の一形態として、走行速度が増加するにしたがい、音声信号の単位フレーム切出し後のパワー平均（窓関数等を用いて予め決まった時間だけ観測された音声信号のパワーの平均）、または、この切り出し後の音声区間のパワー和を増加させる。全体としてはアイドリング時に比較して最大で７ｄＢ、クリーン環境において収録された音声に比較して最大で１４ｄＢ程度音声パワーを増加させればよい。 Therefore, in this embodiment, as one form of the utterance conversion filter means 2, as the running speed increases, the power average after extracting the unit frame of the audio signal (observed for a predetermined time using a window function or the like). The average of the power of the audio signal) or the power sum of the audio section after the cutout is increased. As a whole, the sound power may be increased by a maximum of 7 dB compared to the idling time and by a maximum of 14 dB compared with the sound recorded in the clean environment.

なお、本パラメータは、正規化等の手法でパワー変動の影響を吸収する音声認識システムや、個々の音素および音韻におけるパワー変動を認識のパラメータとして用いていない音声認識システムに適用する場合には変更しなくても良い。 This parameter is changed when applied to a speech recognition system that absorbs the effects of power fluctuations by a method such as normalization, or a speech recognition system that does not use power fluctuations in individual phonemes and phonemes as recognition parameters. You don't have to.

(2) 音声の基本周波数
音声周波数の調波成分の中で最も低い周波数のことを音声の基本周波数と称するが、人間の声帯振動の基本振動数と一致することが知られており、音の高さの物理的特徴といわれている。 (2) Fundamental frequency of speech The lowest frequency of the harmonic components of speech frequency is called the fundamental frequency of speech, but it is known to match the fundamental frequency of human vocal cord vibration. It is said to be a physical characteristic of height.

本発明者らの実験によれば、７０％以上の分析結果において、走行速度の増加にともない基本周波数が増加することが判明した。具体的には、０ｋｍ／ｈと１００ｋｍ／ｈの車速変化に対し、最大で２１Ｈｚ増加した。また、クリーン環境で収録された無歪音声コーパスと移動環境音声コーパスとの比較においては、最大で３６Ｈｚ増加した。ただし、このパラメータは分散も大きく、また話者によっては変化しない場合も観測された。 According to the experiments by the present inventors, it has been found that the fundamental frequency increases as the traveling speed increases in the analysis result of 70% or more. Specifically, the vehicle speed increased by 21 Hz at maximum with respect to changes in vehicle speed of 0 km / h and 100 km / h. In comparison between the undistorted speech corpus recorded in the clean environment and the mobile environment speech corpus, the maximum was increased by 36 Hz. However, this parameter was observed to be highly dispersed and not change depending on the speaker.

したがって、本実施形態では、発話変換フィルタ手段２の一形態として、走行速度が増加するにしたがい、基本周波数を増加させる。さらに具体的には、最大で３６Ｈｚの変化量であることから、例えば、０ｋｍ／ｈ〜１００ｋｍ／ｈまでの車速変化において、２５ｋｍ／ｈ毎に９Ｈｚずつ増加させる。 Therefore, in the present embodiment, as one form of the speech conversion filter means 2, the fundamental frequency is increased as the traveling speed increases. More specifically, since the amount of change is 36 Hz at the maximum, for example, when the vehicle speed changes from 0 km / h to 100 km / h, it is increased by 9 Hz every 25 km / h.

原則として走行速度が増加するにしたがい基本周波数を増加させるように発話変換フィルタ手段２を構成するが、上述したとおりこのパラメータは分散も大きいので、発話変換フィルタ手段２により補正した音声コーパスと補正しない音声コーパスとを並存させても良く、更に音響モデルを作成する際に、補正した音声コーパス（移動環境音声コーパス３）からなる探索経路と、補正しない音声コーパス(無歪音声コーパス１)からなる探索経路とを並存させても良い。 As a general rule, the speech conversion filter means 2 is configured to increase the fundamental frequency as the traveling speed increases. However, as described above, since this parameter has a large variance, the speech corpus corrected by the speech conversion filter means 2 is not corrected. A speech corpus may coexist, and when creating an acoustic model, a search route composed of a corrected speech corpus (mobile environment speech corpus 3) and a search composed of an uncorrected speech corpus (undistorted speech corpus 1) A route may coexist.

また、母音の種別によっても変動が異なることが本発明者らの実験により確認されているため、検出される母音毎に設定しても良い。例えば、調査結果の母音毎の変動結果から、/a/（あ）は約２５Ｈｚ増加させ、/i/（い）は変動に一貫性がないため０Ｈｚ、/u/（う）は約２９Ｈｚ増加させ、/e/（え）は約３６Ｈｚ増加させ、/o/（お）は約２６Ｈｚ増加させるといったように設定しても良い。これについては後述する。 Moreover, since it has been confirmed by experiments of the present inventors that the variation varies depending on the type of vowel, it may be set for each detected vowel. For example, from the fluctuation results for each vowel in the survey results, / a / (A) is increased by about 25 Hz, and / i / (I) is inconsistent in fluctuation, so 0 Hz and / u / (U) are increased by about 29 Hz. And / e / (e) may be increased by approximately 36 Hz, and / o / (o) may be increased by approximately 26 Hz. This will be described later.

(3) 音声のスペクトル回帰直線
音声のスペクトル回帰直線とは、音声のスペクトル包絡の０Ｈｚ〜４ｋＨｚまでの周波数スペクトルを１次直線で近似した要素であり、その一例を図１０の直線Ｘで示す。 (3) Speech spectrum regression line The speech spectrum regression line is an element obtained by approximating the frequency spectrum of speech spectrum envelope from 0 Hz to 4 kHz with a linear line, and an example thereof is indicated by a straight line X in FIG.

本発明者らの実験によれば、８０％以上の分析結果において、走行速度の増加にともないスペクトル回帰直線の傾きが増加することが判明した。例えば、クリーン環境で収録された無歪音声信号に対し、１００ｋｍ／ｈ走行時の移動環境で収録された音声信号のスペクトル傾斜は、最大で０．００８１増加した。一方、走行時の移動環境における音声信号同士においては、例えば０ｋｍ／ｈ走行時と１００ｋｍ／ｈ走行時の関係において、５６％程度の分析結果においてのみスペクトル回帰直線の傾斜が最大で０．００６７増加することが判明した。 According to the experiments by the present inventors, it has been found that the slope of the spectral regression line increases with an increase in traveling speed in an analysis result of 80% or more. For example, with respect to the undistorted audio signal recorded in a clean environment, the spectral inclination of the audio signal recorded in a moving environment when traveling at 100 km / h increased by 0.0081 at the maximum. On the other hand, between the audio signals in the moving environment at the time of traveling, for example, in the relationship between traveling at 0 km / h and traveling at 100 km / h, the slope of the spectral regression line increases by a maximum of 0.0067 only in the analysis result of about 56%. Turned out to be.

したがって、本実施形態では、発話変換フィルタ手段２の一形態として、音響モデルの学習時に用いた音声コーパスがクリーン環境における無歪音声信号を用いたものである場合には、走行速度の増加にともないスペクトル回帰直線の傾斜を増加させるように補正する。 Therefore, in the present embodiment, as one form of the utterance conversion filter means 2, when the speech corpus used at the time of learning the acoustic model is an undistorted speech signal in a clean environment, the travel speed increases. Correction is made to increase the slope of the spectral regression line.

例えば、４ｋＨｚまで１ｋＨｚにつき０．５ｄＢ増加するよう高域を増加させ、走行速度がアイドリング（０ｋｍ／ｈ）→３０ｋｍ／ｈ→６０ｋｍ／ｈ→１００ｋｍ／ｈと増加する毎に０．５→１．０→１．５→２．０ｄＢというように傾きの増加（波形全体としては約１．５ｄＢずつ増加）を行なう等の処理を行なう。また、すべての音素に共通のパラメータ変形を行なっても良いし、上述した音声の基本周波数の場合と同様に、たとえば母音などの音素毎に個別の増加設定を行なっても良い。 For example, the high range is increased to increase 0.5 dB per 1 kHz up to 4 kHz, and 0.5 → 1. Every time the traveling speed increases from idling (0 km / h) → 30 km / h → 60 km / h → 100 km / h. Processing such as increasing the slope such that 0 → 1.5 → 2.0 dB (increase by about 1.5 dB for the entire waveform) is performed. In addition, parameter modification common to all phonemes may be performed, or individual increase settings may be performed for each phoneme such as a vowel, as in the case of the basic frequency of speech described above.

(4) 音声のホルマント周波数
ホルマント周波数とは、図１０に示すように、音声のスペクトル包絡上で特定の周波数領域にエネルギーが集中して生じる山の中央値または最大振幅の周波数をいう。ホルマント周波数は主に定常母音において観測され、低い周波数から順に第１ホルマント、第２ホルマントと称され、母音毎に各ホルマント周波数の組合せが異なる。ホルマントは、音声生成における人間の声道の共振によって生じる特性である。 (4) Voice formant frequency As shown in FIG. 10, the formant frequency refers to the median or maximum amplitude frequency of a peak generated by energy concentration in a specific frequency region on the spectrum envelope of the voice. Formant frequencies are mainly observed in stationary vowels and are called first formant and second formant in order from the lowest frequency, and the combination of each formant frequency differs for each vowel. Formant is a characteristic caused by resonance of the human vocal tract during speech generation.

図１１（出典：伊福部達「音声タイプライタの設計」ＣＱ出版 1984）は、横軸に第１ホルマント周波数Ｆ１、縦軸に第２ホルマント周波数Ｆ２をとり、日本語の５母音（あ、い、う、え、お）の分布を表示したものであり、実線で結んだ○印が男性の音声、点線で結んだ●印が女性の音声を表示したものである。しかしながら、移動環境における音声発話では、図１２に示すように変化し、たとえば同図に示すように/u/（う）の音声が場合によっては/e/（え）と認識されることもある。 Figure 11 (Source: Tatsumi Ifukube “Design of Voice Typewriter” CQ Publishing 1984) takes the first formant frequency F1 on the horizontal axis and the second formant frequency F2 on the vertical axis. U, e, o) distribution is displayed, the solid circle connected ○ mark indicates male voice, and the dotted line ● mark indicates female voice. However, the voice utterance in the mobile environment changes as shown in FIG. 12, for example, as shown in the figure, the voice of / u / (u) may be recognized as / e / (e) in some cases. .

このようにホルマント周波数のパラメータは、クリーン環境と走行環境とで相関を示し、また性別等によっても異なる結果が得られた。 As described above, the formant frequency parameter shows a correlation between the clean environment and the driving environment, and different results are obtained depending on gender and the like.

(4-1) 第１ホルマント周波数
実験室環境における調査では、クリーン音声信号と１００ｋｍ／ｈ走行時の走行音声信号の関係では、８０％以上の分析結果において、第１ホルマント周波数の増加傾向が観測された。具体的には、クリーン音声信号と１００ｋｍ／ｈ走行時の環境変化に対し、第１ホルマント周波数は最大で約３００Ｈｚ、平均で約５０Ｈｚ増加した。 (4-1) First formant frequency In an investigation in the laboratory environment, an increase in the first formant frequency was observed in the analysis result of 80% or more in the relationship between the clean voice signal and the running voice signal at 100 km / h. It was done. Specifically, the first formant frequency increased by about 300 Hz at the maximum and about 50 Hz on the average with respect to the clean sound signal and the environmental change at 100 km / h.

一方、走行音声信号同士においても、例えば０ｋｍ／ｈ走行時と１００ｋｍ／ｈ走行時の関係において、７５％以上の分析結果において第１ホルマント周波数の増加傾向が観測された。具体的には、０ｋｍ／ｈ走行時と１００ｋｍ／ｈ走行時の環境変化に対し、第１ホルマントは最大で、約３００Ｈｚ、平均で約１０Ｈｚの増加傾向が観測された。 On the other hand, with respect to the traveling voice signals, for example, in the relationship between traveling at 0 km / h and traveling at 100 km / h, an increasing tendency of the first formant frequency was observed in the analysis result of 75% or more. Specifically, with respect to environmental changes during travel at 0 km / h and at 100 km / h, the first formant was observed to increase at a maximum of about 300 Hz and an average of about 10 Hz.

これに対して、走行環境における調査では、被験者の性別によって傾向が変化した。すなわち、女性被験者において、クリーン音声信号と１００ｋｍ／ｈ走行時の走行音声信号の関係では、７０％以上の結果について、第１ホルマントが減少した。具体的には、クリーン音声信号と１００ｋｍ／ｈ走行時の環境変化に対し、第１ホルマントは最大で約１００Ｈｚ、平均で約２０Ｈｚ減少した。 On the other hand, in the survey in the driving environment, the tendency changed depending on the sex of the subject. That is, in the female subject, the first formant decreased for the result of 70% or more in the relationship between the clean voice signal and the running voice signal at 100 km / h. Specifically, the first formant decreased by about 100 Hz at the maximum and about 20 Hz on the average with respect to the clean sound signal and the environmental change at 100 km / h.

一方、走行音声信号同士においても、例えば０ｋｍ／ｈ走行時と１００ｋｍ／ｈ走行時の関係において、７５％以上の分析結果において第１ホルマント周波数の減少傾向が観測された。具体的には、０ｋｍ／ｈ走行時と１００ｋｍ／ｈ走行時の環境変化に対し、第１ホルマントは最大で約１７０Ｈｚ、平均で約５０Ｈｚの減少傾向が観測された。 On the other hand, also in the traveling voice signals, for example, in the relationship between traveling at 0 km / h and traveling at 100 km / h, a decreasing tendency of the first formant frequency was observed in the analysis result of 75% or more. Specifically, with respect to environmental changes during travel at 0 km / h and 100 km / h, the first formant was observed to decrease at a maximum of about 170 Hz and an average of about 50 Hz.

(4-2) 第２ホルマント周波数
実験室環境および走行環境の何れの調査においても、増加、減少に偏る傾向は見られなかった。ただし、男性被験者では、何れの環境においてもやや増加傾向が見られた。具体的には、走行環境のクリーン音声信号と１００ｋｍ／ｈ走行時の環境変化に対し、第２ホルマントは最大で約１５０Ｈｚ、平均で約７Ｈｚ増加し、０ｋｍ／ｈ走行時と１００ｋｍ／ｈ走行時の環境変化に対し、第２ホルマントは最大で約１３０Ｈｚ、平均で約５０Ｈｚの増加傾向が観測された。 (4-2) Second formant frequency In either the laboratory environment or the traveling environment, there was no tendency to increase or decrease. However, there was a slight increase in male subjects in any environment. Specifically, the second formant increases at a maximum of about 150 Hz and an average of about 7 Hz with respect to the clean sound signal of the driving environment and the environmental change during driving at 100 km / h, and when driving at 0 km / h and 100 km / h. With respect to the environmental changes, the second formant was observed to increase at a maximum of about 130 Hz and an average of about 50 Hz.

また、第２ホルマント周波数については、音素毎にも傾向が異なり、例えば、走行環境のクリーン音声信号と１００ｋｍ／ｈ走行時の環境変化に対し、/a/（あ）の音素は男女を問わず全体として減少傾向にあり、最大で約９０Ｈｚ減少した。 The second formant frequency has a different tendency for each phoneme. For example, the phoneme of / a / (A) can be used regardless of gender for the clean voice signal of the driving environment and the environmental change during driving at 100 km / h. As a whole, there was a downward trend, and the maximum was reduced by about 90 Hz.

(4-3) 第３ホルマント
実験室環境では増加減少にかたよる傾向は見られないが、走行環境におけるクリーン音声信号と１００ｋｍ／ｈ走行時の走行音声信号の関係では、７０％以上の結果について、第３ホルマントが増加した。また、男性被験者においては、全体として増加傾向にあった。 (4-3) 3rd formant Although there is no tendency to increase or decrease in the laboratory environment, the relationship between the clean audio signal in the driving environment and the driving audio signal at 100 km / h is about 70% or more. Third formant increased. In addition, male subjects tended to increase as a whole.

具体的には、走行環境のクリーン音声信号と１００ｋｍ／ｈ走行時の環境変化に対し、第３ホルマントは最大で約３９０Ｈｚ、平均で約１００Ｈｚ増加し、０ｋｍ／ｈ走行時と１００ｋｍ／ｈ走行時の環境変化に対し、第３ホルマントは最大で約５４０Ｈｚ、平均で約３０Ｈｚの増加傾向が観測された。 Specifically, the third formant increases at a maximum of about 390 Hz and an average of about 100 Hz with respect to the clean sound signal of the driving environment and the environmental change during driving at 100 km / h, and when driving at 0 km / h and at 100 km / h. With respect to the environmental change, the third formant was observed to increase at a maximum of about 540 Hz and an average of about 30 Hz.

また、第３ホルマント周波数は音素毎にも傾向が異なり、例えば、走行環境のクリーン音声信号と１００ｋｍ／ｈ走行時の環境変化に対し、/o/（お）の音素は男女を問わず全体として増加傾向にあり、最大で約２０Ｈｚ増加したが、/e/（え）の音素は男女を問わず減少傾向にあり、最大で約７０Ｈｚ減少した。 The third formant frequency also has a different tendency for each phoneme. For example, the / o / (o) phoneme is the same regardless of gender for clean voice signals in the driving environment and environmental changes during driving at 100 km / h. Although it was increasing, it increased by about 20 Hz at maximum, but the phoneme of / e / (E) was decreasing regardless of gender, and decreased by about 70 Hz at maximum.

したがって、本実施形態では、発話変換フィルタ手段２の一形態として、母音毎、音声コーパスの環境毎、男女毎に、ホルマント周波数についてのパラメータ変形を行なうことが好ましい。母音ごとに異なるホルマント周波数のパラメータ変形を行なう一例を、調査結果に基づき下記表１に示す。
（表１）
/a/：速度が30km/h上がる毎に
第一ホルマント : 10Hz
第二ホルマント : 20Hz
第三ホルマント : 10Hz ずつ増加
/i/：速度が30km/h上がる毎に
第一ホルマント : 20Hz
第二ホルマント : 5Hz
第三ホルマント : 0Hz ずつ増加
/u/：速度が30km/h上がる毎に
第一ホルマント : 20Hz
第二ホルマント : 0Hz
第三ホルマント : 10Hz ずつ増加
/e /：速度が30km/h上がる毎に
第一ホルマント : 20Hz
第二ホルマント : 5Hz
第三ホルマント : 0Hz ずつ増加
/o/：速度が30km/h上がる毎に
第一ホルマント : 15Hz
第二ホルマント : 10Hz
第三ホルマント : 0Hz ずつ増加
ただし、これらパラメータの変更量は一例であって、無歪音声コーパスの収録環境や目的とする走行環境によって必要に応じて補正してもよい。 Therefore, in the present embodiment, as one form of the utterance conversion filter means 2, it is preferable to perform parameter modification for the formant frequency for each vowel, for each speech corpus environment, and for each gender. An example of performing formant frequency parameter modification for each vowel is shown in Table 1 below based on the survey results.
(Table 1)
/ a /: Every time the speed increases by 30km / h, the first formant is 10Hz.
Second formant: 20Hz
Third formant: Increase by 10Hz
/ i /: Every time the speed increases by 30km / h, the first formant: 20Hz
Second formant: 5Hz
Third formant: Increase by 0Hz
/ u /: First formant: 20Hz every time the speed increases by 30km / h
Second formant: 0Hz
Third formant: Increase by 10Hz
/ e /: Every time the speed increases by 30km / h, the first formant is 20Hz.
Second formant: 5Hz
Third formant: Increase by 0Hz
/ o /: Every time the speed increases by 30km / h, the first formant: 15Hz
Second formant: 10Hz
Third formant: Increase by 0 Hz However, the amount of change of these parameters is an example, and may be corrected as necessary depending on the recording environment of the undistorted speech corpus and the target driving environment.

(5) 音声発話語彙の語頭（先頭モーラ）の延長
本発明者らの鋭意研究の結果、語頭、すなわち先頭モーラ（mora:アクセントで強勢や抑揚などの単位となる音の相対的長さ。）の持続長が走行環境に応じて長くなることが確認されている。したがって、本実施形態では発話変換フィルタ手段２の一形態として、非走行環境と比較して最大４０％程度、語頭のモーラ長を引き伸ばす。 (5) Extension of the beginning of the voice utterance vocabulary (leading mora) As a result of the present inventors' extensive research, the beginning of the word, that is, the beginning mora (mora: relative length of sound that is a unit of stress or inflection in accents) It has been confirmed that the sustaining length of the vehicle becomes longer depending on the driving environment. Therefore, in the present embodiment, as one form of the utterance conversion filter unit 2, the mora length of the beginning is extended by about 40% at maximum as compared with the non-running environment.

(6) 音声発話語彙の語尾（最終モーラ）の延長
本発明者らの鋭意研究の結果、走行速度にともなう騒音によって聴覚フィードバックが阻害されると、発話者は丁寧にはっきり発話しようとする傾向があることが確認されている。したがって、本実施形態では発話変換フィルタ手段２の一形態として、非走行環境と比較して最大８０％程度、語尾のモーラ長を引き伸ばす。 (6) Extension of the utterance (final mora) of the speech utterance vocabulary As a result of our earnest research, when the auditory feedback is hindered by noise accompanying the running speed, the speaker tends to speak carefully and clearly. It has been confirmed that there is. Therefore, in the present embodiment, as one form of the utterance conversion filter unit 2, the ending mora length is extended by about 80% at maximum as compared with the non-running environment.

以上のとおり、本実施形態に係る発話変換フィルタ手段２のパラメータとして、音声のパワー、音声の基本周波数、音声のスペクトル回帰直線の傾き、音声のホルマント周波数（第１〜第３ホルマント周波数）、発話語彙の語頭、発話語彙の語尾を例示したが、これらは単独でも、また二つ以上のパラメータを組み合わせても良い。これらのパラメータを変換する具体的手段は、一般的にモーフィング(morphing)の技術として提案されている手法（「聴覚の情景分析と高品質音声分析変換合成法STRAIGHT」河原英紀,日本音響学会講演論文集1-2-1, pp.189-192, Sep.1997）等を用いることで変形が可能である。ただし、特に該文献に記載された変換手法にのみ限定されず、発話のパラメータが変形できるアルゴリズムであれば適用することができる。 As described above, as parameters of the speech conversion filter unit 2 according to the present embodiment, the power of speech, the fundamental frequency of speech, the slope of the speech spectral regression line, the speech formant frequency (first to third formant frequencies), the speech The head of the vocabulary and the ending of the utterance vocabulary are illustrated, but these may be used alone or in combination of two or more parameters. The concrete means to convert these parameters is a method that is generally proposed as a morphing technique ("Hearing scene analysis and high quality speech analysis conversion synthesis method STRAIGHT" Hideki Kawahara, Acoustical Society of Japan Vol. 1-2-1, pp.189-192, Sep.1997) etc. can be used for deformation. However, the algorithm is not particularly limited to the conversion method described in the document, and any algorithm that can transform the utterance parameters can be applied.

図１に戻り、上述した発話変換フィルタ手段２によって走行速度毎に無歪音声コーパス１を移動環境音声コーパス３に変換したら、これら走行速度毎に作成された移動環境音声コーパス３を学習データとしてそれぞれ学習することで、目的とする走行速度毎の音響モデル４を作成することができる。この場合の学習法は、既述したバーム・ウェルチの学習アルゴリズムなどを用いることができる。 Returning to FIG. 1, when the undistorted speech corpus 1 is converted into the mobile environment speech corpus 3 for each travel speed by the speech conversion filter means 2 described above, the mobile environment speech corpus 3 created for each travel speed is used as learning data. By learning, the acoustic model 4 for each target traveling speed can be created. As the learning method in this case, the above-described Balm-Welch learning algorithm or the like can be used.

本実施形態の音響モデル４は、走行速度に対応した発話歪が考慮されて作成されているので、実際の走行時に入力される発話に対して高い認識性能を発揮することができる。 Since the acoustic model 4 of the present embodiment is created taking into account the utterance distortion corresponding to the traveling speed, it can exhibit high recognition performance for the utterance input during actual traveling.

次に、上述した音響モデル４を用いた音声認識方法及び音声認識装置について説明する。図２は本発明の音声認識装置の第１実施形態を示すブロック図、図３は本発明の音声認識装置の第１実施形態の制御手順を示すフローチャートである。 Next, a speech recognition method and speech recognition apparatus using the above-described acoustic model 4 will be described. FIG. 2 is a block diagram showing a first embodiment of the speech recognition apparatus of the present invention, and FIG. 3 is a flowchart showing a control procedure of the first embodiment of the speech recognition apparatus of the present invention.

本実施形態に係る音声認識装置は、図２に示すように認識すべき音声を入力するためのマイクロホンなどから構成される音声入力装置５と、音声入力装置５に入力された音声信号から雑音を除去するためのノイズフィルタなどから構成される雑音除去部６と、雑音が除去された後の音声信号をテキスト信号に変換するためのデコーダ７と、このデコーダ７で変換する際に参照される辞書たる、音響モデル４及び言語モデル８とを有する。 As shown in FIG. 2, the speech recognition apparatus according to the present embodiment generates noise from a speech input device 5 including a microphone for inputting speech to be recognized, and a speech signal input to the speech input device 5. A noise removing unit 6 including a noise filter for removing the noise, a decoder 7 for converting the speech signal from which noise has been removed to a text signal, and a dictionary referred to when the decoder 7 converts the audio signal An acoustic model 4 and a language model 8 are included.

デコーダ７は、音声入力装置５に入力された音声信号からその特徴を現すパラメータを抽出し、メモリなどに格納された音響モデル４および言語モデル８に記録されているテキスト情報に結び付けられたパラメータと比較し、最も適切なテキスト情報を出力する。この際に、たとえば隠れマルコフモデルに代表される統計的パラメータを用いた手法を用いることができる。 The decoder 7 extracts parameters that express the characteristics from the audio signal input to the audio input device 5, and parameters associated with the text information recorded in the acoustic model 4 and the language model 8 stored in a memory or the like. Compare and output the most appropriate text information. At this time, for example, a technique using statistical parameters represented by a hidden Markov model can be used.

特に本実施形態では、上述したとおり走行速度毎（実施形態では０，３０，６０，１００ｋｍ／ｈの４種類）に対応した音響モデル４がメモリなどに格納されている。 In particular, in this embodiment, as described above, the acoustic model 4 corresponding to each traveling speed (four types of 0, 30, 60, and 100 km / h in the embodiment) is stored in a memory or the like.

また、本実施形態に係る音声認識装置では、車両の走行速度を検出する車速検出装置９と、この車速検出装置９で検出された車速に応じてメモリに格納された４種類の音響モデル４の中から対応する音響モデル４を選択する選択装置１０が設けられている。選択装置１０は、車速の範囲とその範囲での音響モデル４の対応関係が予め決められたマップを有し、車速検出装置９で検出された実際の車速に基づいて最も適した音響モデル４を選択する。たとえば、０〜１５ｋｍ／ｈであるときは０ｋｍ／ｈの音響モデル４を選択し、１５〜４５ｋｍ／ｈであるときは３０ｋｍ／ｈの音響モデル４を選択するといったようにマップ化されている。 Further, in the voice recognition device according to the present embodiment, the vehicle speed detection device 9 that detects the traveling speed of the vehicle, and the four types of acoustic models 4 stored in the memory according to the vehicle speed detected by the vehicle speed detection device 9. A selection device 10 for selecting the corresponding acoustic model 4 from the inside is provided. The selection device 10 has a map in which the correspondence relationship between the vehicle speed range and the acoustic model 4 within the range is determined in advance, and the most suitable acoustic model 4 is selected based on the actual vehicle speed detected by the vehicle speed detection device 9. select. For example, the acoustic model 4 of 0 km / h is selected when the speed is 0 to 15 km / h, and the acoustic model 4 of 30 km / h is selected when the speed is 15 to 45 km / h.

次に、図３を参照して本実施形態の音声認識装置の動作を説明する。 Next, the operation of the speech recognition apparatus of this embodiment will be described with reference to FIG.

まずステップＳ１００にて初期化処理を行なう。このときすべての音声認識処理に関する初期化が行われる。また、音声認識装置については音声認識処理の入力信号待ち受け状態にしてもよいし、使用者が入力の意思を示したタイミング（たとえばＰＴＴスイッチのＯＮ）で起動し入力信号待ち受け状態にしてもよい。 First, initialization processing is performed in step S100. At this time, all the speech recognition processes are initialized. The voice recognition apparatus may be in an input signal waiting state for voice recognition processing, or may be activated at a timing when the user indicates an intention to input (for example, a PTT switch is turned on) to enter an input signal waiting state.

次いで、ステップＳ１１０にてＣＡＮ(Controller Area Network 車載ＬＡＮ規格)等を用いて車速変化の検出を行なう。車速が変化していればステップＳ１１５へ進み、変化していなければステップＳ１１０の検出処理を繰り返す。車速の変化幅は車速ごとに用意されている音響モデル４の対応する車速に準拠する。例えば、３０ｋｍ／ｈと６０ｋｍ／ｈの音響モデルが用意されている場合は、４５ｋｍ／ｈを超えた段階で車速の変化が検出されるものとする。 In step S110, a change in vehicle speed is detected using CAN (Controller Area Network vehicle-mounted LAN standard) or the like. If the vehicle speed has changed, the process proceeds to step S115, and if not, the detection process in step S110 is repeated. The variation range of the vehicle speed conforms to the corresponding vehicle speed of the acoustic model 4 prepared for each vehicle speed. For example, when acoustic models of 30 km / h and 60 km / h are prepared, it is assumed that a change in vehicle speed is detected at a stage exceeding 45 km / h.

次いで、ステップＳ１１５にて使用者によって音信号が入力された際の車速を検知し、図２に示す音響モデル４の中から対応する音響モデル４を選択する。 Next, in step S115, the vehicle speed when the sound signal is input by the user is detected, and the corresponding acoustic model 4 is selected from the acoustic model 4 shown in FIG.

次いで、ステップＳ１２０にて音声入力が検知された場合はステップＳ１３０へ進み、音声入力が検知されない場合はステップＳ１１０へ戻って以上の処理を繰り返す。 Next, if a voice input is detected in step S120, the process proceeds to step S130. If a voice input is not detected, the process returns to step S110 and the above processing is repeated.

ステップＳ１３０では、入力された音声信号の認識処理を行う。このとき、ステップＳ１１５で音響モデル４を選択したときは、この選択された音響モデル４を用いて音声認識処理を行う。 In step S130, the input speech signal is recognized. At this time, when the acoustic model 4 is selected in step S115, speech recognition processing is performed using the selected acoustic model 4.

最後にステップＳ１４０にて、認識された音声認識処理結果、すなわちテキスト情報を目的とする他の操作機器に送出する。 Finally, in step S140, the recognized voice recognition processing result, that is, text information is sent to the other operation device.

本実施形態に係る音声認識装置では、車載対象となる要素が複数の音響モデル４を含む図２に示すものとなり、音響モデル４の容量が大きくなるものの、音声入力装置５に入力された音声をそのままデコーダ７で変換するだけでテキスト情報を外部へ出力できるので、音声認識性能は勿論のこと、認識速度を高速化することができる。 In the speech recognition apparatus according to the present embodiment, the element to be mounted is the one shown in FIG. 2 including a plurality of acoustic models 4, and although the capacity of the acoustic model 4 is increased, the speech input to the speech input device 5 is Since the text information can be output to the outside by simply converting it with the decoder 7 as it is, not only the speech recognition performance but also the recognition speed can be increased.

《第２実施形態》
図４は本発明に係る音響モデルの作成方法の第２実施形態を示すブロック図である。 << Second Embodiment >>
FIG. 4 is a block diagram showing a second embodiment of a method for creating an acoustic model according to the present invention.

上述した第１実施形態では、移動環境下の発話歪を考慮した音響モデル４を走行速度毎に複数作成し、これを車載音声認識装置に適用することで発話歪を含んだ音声の認識性能を高めるように構成したが、複数の音響モデル４の記憶に必要な容量が大きくなるというデメリットもある。 In the first embodiment described above, a plurality of acoustic models 4 that take into account the utterance distortion under a moving environment are created for each traveling speed, and this is applied to the in-vehicle voice recognition device, thereby improving the speech recognition performance including the utterance distortion. Although it is configured to increase, there is a demerit that a capacity necessary for storing the plurality of acoustic models 4 is increased.

そこで、本実施形態では車載される音声認識装置の記憶容量が小さくなるようにする。すなわち、図４に示すように、音響的雑音が実質的にない環境において所定の音声を収録して無歪音声コーパス１を作成する第１ステップと、この第１ステップで作成した無歪音声コーパス１を学習データとして学習して無歪音響モデル１１を作成する第２ステップと、無歪音声コーパス１の音声を、移動体の速度に対応した発話に変形する発話変換フィルタ手段２を用いて補正し、移動環境音声コーパス３を作成する第３ステップと、この第３ステップで作成された移動環境音声コーパス３を用いて無歪音響モデル１１を適応化して音響モデル４を作成する第４ステップと、を少なくとも含む。 Therefore, in this embodiment, the storage capacity of the voice recognition device mounted on the vehicle is made small. That is, as shown in FIG. 4, a first step of creating a non-distorted speech corpus 1 by recording predetermined speech in an environment substantially free of acoustic noise, and a non-distorted speech corpus created in the first step A second step of learning 1 as learning data to create a distortion-free acoustic model 11 and correction using speech conversion filter means 2 that transforms the speech of the distortion-free speech corpus 1 into speech corresponding to the speed of the moving object A third step of creating the mobile environment speech corpus 3, and a fourth step of creating the acoustic model 4 by adapting the undistorted acoustic model 11 using the mobile environment speech corpus 3 created in the third step; , At least.

ここで、無歪音声コーパス１、発話変換フィルタ手段２及び音響モデル４の構成は上述した第１実施形態と同じであるため、その詳細な説明は省略するが、本実施形態に係る無歪音響モデル１１は、大規模な無歪音声コーパス１を学習データとしてたとえばバーム・ウェルチの学習アルゴリズムなどを用いることにより得られる音響モデルであるが、この無歪音響モデル１１は大容量であって車載される音声認識装置に格納されるものの、走行速度毎に複数存在するものではなく１つの音響モデルとして構成されている。 Here, since the configurations of the undistorted speech corpus 1, the speech conversion filter means 2 and the acoustic model 4 are the same as those in the first embodiment described above, the detailed description thereof is omitted, but the undistorted sound according to the present embodiment. The model 11 is an acoustic model obtained by using a large-scale undistorted speech corpus 1 as learning data, for example, a Balm-Welch learning algorithm. The undistorted acoustic model 11 has a large capacity and is mounted on a vehicle. Are stored in the voice recognition device, but a plurality of them are not present for each traveling speed, but are configured as one acoustic model.

その代わりに、本実施形態に係る移動環境音声コーパス１２の容量を小規模なコーパスで構成し、この小規模な移動環境音声コーパス１２を走行速度に対応させて複数作成して車載される音声認識装置に格納する。 Instead, the capacity of the mobile environment speech corpus 12 according to the present embodiment is configured by a small-scale corpus, and a plurality of small-scale mobile environment speech corpora 12 corresponding to the traveling speed are created and mounted on the vehicle. Store in the device.

そして、この小規模な移動環境音声コーパス１２を用いて、たとえばＭＬＬＲ(Maximum Likelihood Linear Regression 不特定話者音声認識用の音響モデルを特定話者用に変換するアルゴリズム)などの環境適応アルゴリズムを用いて適応化することで目的とする音響モデル４を作成する。 Then, using this small mobile environment speech corpus 12, using, for example, an environment adaptation algorithm such as MLLR (Maximum Likelihood Linear Regression algorithm for converting an acoustic model for speaker-independent speaker recognition to a speaker) The target acoustic model 4 is created by adaptation.

本実施形態の音響モデル４は、走行速度に対応した発話歪が考慮されて作成されているので、実際の走行時に入力される発話に対して高い認識性能を発揮することができる。これに加えて、移動環境音声コーパス１２を小規模なコーパスで構成できるので、車載される音声認識装置に適用して好ましいものとなる。 Since the acoustic model 4 of the present embodiment is created taking into account the utterance distortion corresponding to the traveling speed, it can exhibit high recognition performance for the utterance input during actual traveling. In addition, since the mobile environment speech corpus 12 can be configured with a small corpus, it is preferable to be applied to a speech recognition apparatus mounted on a vehicle.

次に、上述した音響モデル４を用いた音声認識方法及び音声認識装置について説明する。図５は本発明の音声認識装置の第２実施形態を示すブロック図、図６は本発明の音声認識装置の第２実施形態の制御手順を示すフローチャートである。 Next, a speech recognition method and speech recognition apparatus using the above-described acoustic model 4 will be described. FIG. 5 is a block diagram showing a second embodiment of the speech recognition apparatus of the present invention, and FIG. 6 is a flowchart showing a control procedure of the second embodiment of the speech recognition apparatus of the present invention.

本実施形態に係る音声認識装置は、図５に示すように認識すべき音声を入力するためのマイクロホンなどから構成される音声入力装置５と、音声入力装置５に入力された音声信号から雑音を除去するためのノイズフィルタなどから構成される雑音除去部６と、雑音が除去された後の音声信号をテキスト信号に変換するためのデコーダ７と、このデコーダ７で変換する際に参照される辞書たる、音響モデル４及び言語モデル８とを有する。 As shown in FIG. 5, the voice recognition apparatus according to the present embodiment generates noise from a voice input device 5 including a microphone for inputting voice to be recognized, and a voice signal input to the voice input device 5. A noise removing unit 6 including a noise filter for removing the noise, a decoder 7 for converting the speech signal from which noise has been removed to a text signal, and a dictionary referred to when the decoder 7 converts the audio signal An acoustic model 4 and a language model 8 are included.

ここで、音声入力装置５、雑音除去部６、デコーダ７は上述した第１実施形態と同じであるためその詳細な説明は省略する。 Here, since the voice input device 5, the noise removing unit 6, and the decoder 7 are the same as those in the first embodiment described above, detailed description thereof is omitted.

特に本実施形態では、上述したとおり走行速度毎（実施形態では０，３０，６０，１００ｋｍ／ｈの４種類）に対応した移動環境音声コーパス１２がメモリなどに格納されている。 In particular, in this embodiment, as described above, the mobile environment voice corpus 12 corresponding to each traveling speed (four types of 0, 30, 60, and 100 km / h in the embodiment) is stored in a memory or the like.

また、本実施形態に係る音声認識装置では、車両の走行速度を検出する車速検出装置９と、この車速検出装置９で検出された車速に応じてメモリに格納された４種類の移動環境音声コーパス１２の中から対応する移動環境音声コーパス１２を選択する選択装置１０が設けられている。選択装置１０は、車速の範囲とその範囲での移動環境音声コーパス１２の対応関係が予め決められたマップを有し、車速検出装置９で検出された実際の車速に基づいて最も適した移動環境音声コーパス１２を選択する。たとえば、０〜１５ｋｍ／ｈであるときは０ｋｍ／ｈの移動環境音声コーパス１２を選択し、１５〜４５ｋｍ／ｈであるときは３０ｋｍ／ｈの移動環境音声コーパス１２を選択するといったようにマップ化されている。 In the speech recognition apparatus according to the present embodiment, the vehicle speed detection device 9 that detects the traveling speed of the vehicle, and four types of moving environment speech corpora stored in the memory according to the vehicle speed detected by the vehicle speed detection device 9. A selection device 10 for selecting a corresponding mobile environment speech corpus 12 from among the 12 is provided. The selection device 10 has a map in which the correspondence relationship between the range of the vehicle speed and the movement environment voice corpus 12 in the range is determined in advance, and the most suitable movement environment based on the actual vehicle speed detected by the vehicle speed detection device 9. The voice corpus 12 is selected. For example, the mobile environment speech corpus 12 of 0 km / h is selected when it is 0 to 15 km / h, and the mobile environment speech corpus 12 of 30 km / h is selected when it is 15 to 45 km / h. Has been.

次に、図６を参照して本実施形態の音声認識装置の動作を説明する。 Next, the operation of the speech recognition apparatus of this embodiment will be described with reference to FIG.

次いで、ステップＳ１１０にてＣＡＮ等を用いて車速変化の検出を行なう。車速が変化していればステップＳ１１６へ進み、変化していなければステップＳ１１０の検出処理を繰り返す。車速の変化幅は車速ごとに用意されている移動環境音声コーパス１２の対応する車速に準拠する。例えば、３０ｋｍ／ｈと６０ｋｍ／ｈの移動環境音声コーパス１２が用意されている場合は、４５ｋｍ／ｈを超えた段階で車速の変化が検出されるものとする。 Next, in step S110, a change in vehicle speed is detected using CAN or the like. If the vehicle speed has changed, the process proceeds to step S116. If not, the detection process in step S110 is repeated. The change width of the vehicle speed conforms to the corresponding vehicle speed of the moving environment voice corpus 12 prepared for each vehicle speed. For example, when the mobile environment voice corpus 12 of 30 km / h and 60 km / h is prepared, a change in the vehicle speed is detected when the speed exceeds 45 km / h.

次いで、ステップＳ１１６にて使用者によって音信号が入力された際の車速を検知し、図５に示す移動環境音声コーパス１２の中から対応する移動環境音声コーパス１２を選択するとともに、この選択された移動環境音声コーパス１２を用いて適用化装置１３により適応化処理を実行する。 Next, in step S116, the vehicle speed when the sound signal is input by the user is detected, and the corresponding mobile environment speech corpus 12 is selected from the mobile environment speech corpus 12 shown in FIG. Adaptation processing is executed by the applicator 13 using the mobile environment speech corpus 12.

ステップＳ１３０では、入力された音声信号の認識処理を行う。このとき、ステップＳ１１６で適応化された音響モデル４を用いて音声認識処理を行う。 In step S130, the input speech signal is recognized. At this time, speech recognition processing is performed using the acoustic model 4 adapted in step S116.

本実施形態に係る音声認識装置では、車載対象となる移動環境音声コーパス１２を小規模コーパスで構成できるので、適応化処理が付加されるものの、音声認識性能は勿論のこと、小さな記憶容量で音声認識装置を構築することができる。 In the speech recognition apparatus according to the present embodiment, the mobile environment speech corpus 12 to be mounted on the vehicle can be configured with a small-scale corpus, so that adaptation processing is added, but not only speech recognition performance but also speech with a small storage capacity. A recognition device can be constructed.

《第３実施形態》
図７は本発明の音響モデルの作成方法の第３実施形態を示すブロック図である。移動環境下での発話歪を考慮した音響モデルを作成するにあたり、上述した第１実施形態では入力される音声信号を一つの種類（カテゴリー）と考えて発話変換フィルタ手段２を適用したが、話者の個人差や性別などのカテゴリーの相違によってパラメータの走行速度毎の変動態様が異なることもある。このため、本実施形態では、母音などの音素に応じて、または話者が男女の何れかかによって異なる発話変換フィルタ手段２を用意し、音素・話者選別装置１４により発話変換フィルタ手段２を選別し、選別された発話変換フィルタ手段２に基づいて移動環境音声コーパス３を作成する。 << Third Embodiment >>
FIG. 7 is a block diagram showing a third embodiment of the acoustic model creation method of the present invention. In creating an acoustic model that takes into account utterance distortion in a mobile environment, the speech conversion filter means 2 is applied in the first embodiment described above considering the input speech signal as one type (category). Depending on a person's individual difference and gender and other categories, the parameter fluctuation mode may vary depending on the traveling speed. For this reason, in the present embodiment, the utterance conversion filter means 2 is prepared according to phonemes such as vowels or depending on whether the speaker is male or female, and the utterance conversion filter means 2 is prepared by the phoneme / speaker selection device 14. A mobile environment speech corpus 3 is created based on the selected utterance conversion filter means 2.

たとえば、/a/-/o/の母音の音素において、異なる発話変換フィルタ手段２を用いて変換する。より具体的には、/a/（あ）の音素が入力された場合は発話変換フィルタ手段２の変換を実施する。また、男女何れの音声が入力された場合にも、第１ホルマント周波数を減少させた音声信号と、第１ホルマント周波数を増加させた音声信号との何れのコーパスをも送出する。さらに、女性の音声が入力された場合は第１ホルマント周波数を減少させ、男性の音声が入力された場合は第１ホルマント周波数を増加させた音声信号を移動環境音声コーパス３に送出する。 For example, a phoneme of a vowel of / a /-/ o / is converted using different utterance conversion filter means 2. More specifically, when the phoneme of / a / (A) is input, conversion of the speech conversion filter means 2 is performed. In addition, when both male and female voices are input, a corpus of a voice signal having a reduced first formant frequency and a voice signal having a first formant frequency increased is transmitted. Further, when a female voice is input, the first formant frequency is decreased, and when a male voice is input, a voice signal having the first formant frequency increased is transmitted to the mobile environment voice corpus 3.

本実施形態の音響モデル４は、走行速度に対応した発話歪が考慮されて作成されているので、実際の走行時に入力される発話に対して高い認識性能を発揮することができることに加えて、話者の特性に応じた変換を行なって音響モデル４を作成するので、音声認識性能がより高くなる。 Since the acoustic model 4 of the present embodiment is created in consideration of the utterance distortion corresponding to the traveling speed, in addition to being able to demonstrate high recognition performance for the utterance input during actual traveling, Since the acoustic model 4 is created by performing conversion according to the characteristics of the speaker, the speech recognition performance is further improved.

《第４実施形態》
図８は本発明の音声認識装置の第４実施形態を示すブロック図、図９は本発明の音声認識装置の第４実施形態の制御手順を示すフローチャートである。 << 4th Embodiment >>
FIG. 8 is a block diagram showing a fourth embodiment of the speech recognition apparatus of the present invention, and FIG. 9 is a flowchart showing a control procedure of the fourth embodiment of the speech recognition apparatus of the present invention.

上述した第１〜第３実施形態では、音声認識装置に用いられる音響モデル４に移動環境下での発話歪を織り込んだが、本実施形態では入力された音声信号を、移動環境を考慮していない一般的な音響モデル及び言語モデルを用いて変換する前に、移動環境を考慮した前処理補正を実行する。 In the first to third embodiments described above, utterance distortion in a moving environment is incorporated in the acoustic model 4 used in the speech recognition apparatus, but in this embodiment, the input voice signal does not consider the moving environment. Before conversion using a general acoustic model and language model, pre-processing correction considering the moving environment is executed.

すなわち、本実施形態の音声認識装置は、図８に示すように認識すべき音声を入力するためのマイクロホンなどから構成される音声入力装置５と、音声入力装置５に入力された音声信号から雑音を除去するためのノイズフィルタなどから構成される雑音除去部６と、入力された音声信号を移動環境下での発話歪を差し引いた音声信号に補正する発話補正フィルタ手段１５と、補正され、雑音が除去された後の音声信号をテキスト信号に変換するためのデコーダ７と、このデコーダ７で変換する際に参照される辞書たる、音響モデル４及び言語モデル８とを有する。 That is, the speech recognition apparatus according to the present embodiment includes a speech input device 5 including a microphone for inputting speech to be recognized as shown in FIG. 8, and noise from the speech signal input to the speech input device 5. A noise removing unit 6 including a noise filter for removing noise, an utterance correction filter unit 15 that corrects an input voice signal to a voice signal obtained by subtracting utterance distortion in a moving environment, and a corrected noise. Has a decoder 7 for converting the speech signal from which the sound is removed into a text signal, and an acoustic model 4 and a language model 8 which are dictionaries referred to when the decoder 7 converts the speech signal.

ここで、音声入力装置５、雑音除去部６及びデコーダ７並びに音響モデル４及び言語モデル８の構成は上述した第２実施形態と同じであるため、その詳細な説明は省略する。 Here, since the configurations of the voice input device 5, the noise removal unit 6, the decoder 7, the acoustic model 4, and the language model 8 are the same as those in the second embodiment described above, detailed description thereof is omitted.

本実施形態では、特に発話補正フィルタ手段１５により入力された音声信号を前処理補正するが、この前処理補正は、入力された音声信号を、音響モデル４を学習した時に用いた音声コーパスと同じ音響的かつ統計的特性を有する音声信号に補正するものである。すなわち、音声入力装置５に入力された音声信号には実際の移動環境下での発話歪が含まれる一方で、音響モデル４の学習データとなった音声コーパスは無歪音声コーパスなど、移動環境下での発話歪が含まれていない音声コーパスである。 In the present embodiment, the speech signal input by the speech correction filter means 15 is preprocessed and corrected. This preprocessing correction is the same as the speech corpus used when learning the acoustic model 4 for the input speech signal. It corrects to an audio signal having acoustic and statistical characteristics. That is, the speech signal input to the speech input device 5 includes utterance distortion in an actual mobile environment, while the speech corpus that is the learning data of the acoustic model 4 is a mobile environment such as an undistorted speech corpus. This is a speech corpus that does not include speech distortion.

そこで、本実施形態では発話補正フィルタ手段１５にて以下のような前処理補正を行なうが、本実施形態の発話補正フィルタ手段１５は上述した第１実施形態にて詳述した発話変換フィルタ手段２の逆の特性を有するものである。 Therefore, in the present embodiment, the following preprocessing correction is performed by the utterance correction filter means 15, but the utterance correction filter means 15 of the present embodiment is the utterance conversion filter means 2 described in detail in the first embodiment described above. It has the reverse characteristics of

すなわち、発話補正フィルタ手段１５のパラメータとして、音声のパワー、音声の基本周波数、音声のスペクトル回帰直線の傾き、音声のホルマント周波数（第１〜第３ホルマント周波数）、発話語彙の語頭、発話語彙の語尾を例示することができ、具体的な補正値は、第１実施形態の発話変換フィルタ手段２において「増加させる」としたパラメータは本例の発話補正フィルタ手段１５では「減少させる」こととし、同じく第１実施形態の発話変換フィルタ手段２において「延長させる」としたパラメータは本例の発話補正フィルタ手段１５では「縮小させる」こととする。そして、減少や縮小の絶対値は第１実施形態の発話変換フィルタ手段２のそれと同じ値である。 That is, as parameters of the speech correction filter means 15, speech power, speech fundamental frequency, speech spectral regression line slope, speech formant frequency (first to third formant frequencies), speech vocabulary prefix, speech vocabulary The ending can be illustrated, and the specific correction value is that the parameter “increase” in the utterance conversion filter means 2 of the first embodiment is “decrease” in the utterance correction filter means 15 of this example, Similarly, a parameter that is “extended” in the utterance conversion filter unit 2 of the first embodiment is “reduced” in the utterance correction filter unit 15 of this example. The absolute value of reduction or reduction is the same value as that of the utterance conversion filter means 2 of the first embodiment.

次に図９を参照して本実施形態の音声認識装置の動作を説明する。 Next, the operation of the speech recognition apparatus of this embodiment will be described with reference to FIG.

次いで、ステップＳ１１０にてＣＡＮ等を用いて車速変化の検出を行なう。車速が変化していればステップＳ１１７へ進み、変化していなければステップＳ１１０の検出処理を繰り返す。車速の変化幅は車速ごとに用意されている発話補正フィルタ手段１５の対応する車速に準拠する。例えば、３０ｋｍ／ｈと６０ｋｍ／ｈの発話補正フィルタ手段１５が用意されている場合は、４５ｋｍ／ｈを超えた段階で車速の変化が検出されるものとする。 Next, in step S110, a change in vehicle speed is detected using CAN or the like. If the vehicle speed has changed, the process proceeds to step S117. If not, the detection process in step S110 is repeated. The change width of the vehicle speed conforms to the corresponding vehicle speed of the speech correction filter means 15 prepared for each vehicle speed. For example, when the speech correction filter means 15 of 30 km / h and 60 km / h are prepared, a change in the vehicle speed is detected at a stage exceeding 45 km / h.

次いで、ステップＳ１１７にて使用者によって音信号が入力された際の車速を検知し、発話補正フィルタ手段１５の中から対応する発話補正フィルタ手段１５を選択する。 Next, in step S117, the vehicle speed when the sound signal is input by the user is detected, and the corresponding utterance correction filter means 15 is selected from the utterance correction filter means 15.

次いで、ステップＳ１２０にて音声入力が検知された場合はステップＳ１２５へ進み、入力された音声信号をステップＳ１１７で選択された発話補正フィルタ手段１５を用いて補正する。音声入力が検知されない場合はステップＳ１１０へ戻って以上の処理を繰り返す。 Next, when a voice input is detected in step S120, the process proceeds to step S125, and the input voice signal is corrected using the speech correction filter means 15 selected in step S117. If no voice input is detected, the process returns to step S110 and the above processing is repeated.

ステップＳ１３０では、入力された音声信号の認識処理を行う。最後にステップＳ１４０にて、認識された音声認識処理結果、すなわちテキスト情報を目的とする他の操作機器に送出する。 In step S130, the input speech signal is recognized. Finally, in step S140, the recognized voice recognition processing result, that is, text information is sent to the other operation device.

本実施形態に係る音声認識装置では、変換すべき音声信号を、発話歪を含まず、音響モデルの学習データとされた音声コーパスと同じ特性を有する音声信号に前処理補正するので、音声の認識性能を高めることができる。特に本実施形態は、携帯電話機の通話に適用することができる。 In the speech recognition apparatus according to the present embodiment, the speech signal to be converted is preprocessed and corrected to a speech signal that does not include speech distortion and has the same characteristics as the speech corpus that is the learning data of the acoustic model. Performance can be increased. In particular, the present embodiment can be applied to a mobile phone call.

なお、以上説明した実施形態は、本発明の理解を容易にするために記載されたものであって、本発明を限定するために記載されたものではない。したがって、上記の実施形態に開示された各要素は、本発明の技術的範囲に属する全ての設計変更や均等物をも含む趣旨である。 The embodiment described above is described in order to facilitate understanding of the present invention, and is not described in order to limit the present invention. Therefore, each element disclosed in the above embodiment is intended to include all design changes and equivalents belonging to the technical scope of the present invention.

本発明の音響モデルの作成方法の第１実施形態を示すブロック図である。It is a block diagram which shows 1st Embodiment of the preparation method of the acoustic model of this invention. 本発明の音声認識装置の第１実施形態を示すブロック図である。It is a block diagram which shows 1st Embodiment of the speech recognition apparatus of this invention. 本発明の音声認識装置の第１実施形態の制御手順を示すフローチャートである。It is a flowchart which shows the control procedure of 1st Embodiment of the speech recognition apparatus of this invention. 本発明の音響モデルの作成方法の第２実施形態を示すブロック図である。It is a block diagram which shows 2nd Embodiment of the preparation method of the acoustic model of this invention. 本発明の音声認識装置の第２実施形態を示すブロック図である。It is a block diagram which shows 2nd Embodiment of the speech recognition apparatus of this invention. 本発明の音声認識装置の第２実施形態の制御手順を示すフローチャートである。It is a flowchart which shows the control procedure of 2nd Embodiment of the speech recognition apparatus of this invention. 本発明の音響モデルの作成方法の第３実施形態を示すブロック図である。It is a block diagram which shows 3rd Embodiment of the preparation method of the acoustic model of this invention. 本発明の音声認識装置の第４実施形態を示すブロック図である。It is a block diagram which shows 4th Embodiment of the speech recognition apparatus of this invention. 本発明の音声認識装置の第４実施形態の制御手順を示すフローチャートである。It is a flowchart which shows the control procedure of 4th Embodiment of the speech recognition apparatus of this invention. ホルマント周波数を説明するための音声のスペクトル包絡を示すグラフである。It is a graph which shows the spectrum envelope of the audio | voice for demonstrating a formant frequency. 日本語の母音とホルマント周波数との関係を示すグラフである。It is a graph which shows the relationship between a Japanese vowel and a formant frequency. 走行環境における発話歪と音声認識性能との関係を示す図である。It is a figure which shows the relationship between the speech distortion in driving | running | working environment, and speech recognition performance. 走行環境下の緊張等に起因する発話歪の影響を説明するためのグラフである。It is a graph for demonstrating the influence of the utterance distortion resulting from the tension | tensile_strength etc. under a driving environment. 従来の音声認識装置の一例を示すブロック図である。It is a block diagram which shows an example of the conventional speech recognition apparatus. 従来の音響モデルの作成方法の一例を示すブロック図である。It is a block diagram which shows an example of the creation method of the conventional acoustic model.

Explanation of symbols

１…無歪音声コーパス
２…発話変換フィルタ手段
３…移動環境音声コーパス
４…音響モデル
５…音声入力装置
７…デコーダ
８…言語モデル
９…車速検出装置
１０…選択装置
DESCRIPTION OF SYMBOLS 1 ... Undistorted speech corpus 2 ... Speech conversion filter means 3 ... Mobile environment speech corpus 4 ... Acoustic model 5 ... Speech input device 7 ... Decoder 8 ... Language model 9 ... Vehicle speed detection device 10 ... Selection device

Claims

A method for creating an acoustic model for converting a voice signal into a label signal used in a voice recognition device that may be used in a mobile body,
Recording a predetermined sound in an environment substantially free of acoustic noise to create an undistorted sound corpus;
Correcting the speech of the undistorted speech corpus created in the step using speech conversion filter means that transforms the speech into speech corresponding to the moving speed of the moving object, and creating a mobile environment speech corpus;
Learning the mobile environment speech corpus created in the step as learning data to create an acoustic model, and creating an acoustic model.

A method for creating an acoustic model for converting a voice signal into a label signal used in a voice recognition device that may be used in a mobile body,
Recording a predetermined sound in an environment substantially free of acoustic noise to create an undistorted sound corpus;
Learning the undistorted speech corpus created in the step as learning data to create an undistorted acoustic model;
Correcting the speech of the undistorted speech corpus using speech conversion filter means that transforms the speech into speech corresponding to the speed of the mobile body, and creating a mobile environment speech corpus;
Adapting the undistorted acoustic model using the mobile environment speech corpus created in the step, and creating an acoustic model.

The mobile environment speech corpus is selectively corrected from the speech of the undistorted speech corpus so that the data capacity of the mobile environment speech corpus is smaller than the data capacity of the undistorted speech corpus. 2. A method for creating the acoustic model according to 2.

Screening the sound of the undistorted speech corpus according to a predetermined category;
The acoustic model according to any one of claims 1 to 3, wherein the speech conversion filter unit corrects the speech corresponding to the category selected in the step and corresponding to the moving speed of the moving body. How to make.

5. The acoustic model according to claim 1, wherein the utterance conversion filter unit corrects the voice so as to increase the power of the vowel of the voice as the moving speed of the moving body increases. How to make.

The acoustic model according to any one of claims 1 to 4, wherein the utterance conversion filter unit corrects the voice so as to increase a pitch frequency of a vowel of the voice as the moving speed of the moving body increases. How to create

5. The speech conversion filter means corrects the power of the voice band frequency so as to increase the slope of the spectrum regression line of the voice vowel as the moving speed of the moving body increases. A method for creating an acoustic model according to any one of the above.

The speech conversion filter means corrects the sound so as to increase at least one of the first to third formants of the sound as the moving speed of the moving body increases. A method for creating the acoustic model described in the above.

5. The utterance conversion filter means corrects the voice so as to increase the duration of the vowel at the beginning of the vocabulary spoken as the moving speed of the moving body increases. A method for creating the acoustic model described in 1.

The utterance conversion filter means corrects the voice so as to increase the duration of the vowel at the end of the uttered vocabulary as the moving speed of the moving body increases. A method for creating the acoustic model described in 1.

A speech recognition method that may be used in a mobile body,
Detecting a moving speed of the moving body;
Inputting the speech to be recognized;
Converting the audio signal input in the step into a label signal using the acoustic model created by the method according to any one of claims 1 to 10 according to the detected moving speed of the moving body; A speech recognition method comprising:

A speech recognition device that may be used in a mobile body,
Speed detecting means for detecting the moving speed of the moving body;
Voice input means for inputting voice to be recognized;
Storage means for storing an acoustic model created by the method according to any one of claims 1 to 10,
Conversion means for converting an audio signal input to the audio input means into a label signal using an acoustic model stored in the storage means in accordance with the moving speed of the moving body detected by the speed detection means. A speech recognition apparatus characterized by that.

A speech recognition device that may be used in a mobile body,
Speed detecting means for detecting the moving speed of the moving body;
Voice input means for inputting voice to be recognized;
Storage means for storing an acoustic model in which an audio signal and a label signal are associated;
The audio signal input to the input unit is acoustically the same as the audio corpus used when learning the acoustic model stored in the storage unit according to the moving speed of the moving body detected by the speed detecting unit. Speech correction filter means for correcting the speech signal having statistical characteristics;
A speech recognition apparatus comprising: a conversion means for converting the speech signal corrected by the speech correction filter means into a label signal using an acoustic model stored in the storage means.

14. The speech recognition apparatus according to claim 13, wherein the utterance correction filter unit corrects the speech signal so as to decrease the power of the vowel of speech as the moving speed of the moving body increases.

The speech recognition apparatus according to claim 13, wherein the speech correction filter unit corrects the speech signal so as to decrease the pitch frequency of the vowel of the speech as the moving speed of the moving body increases.

The utterance correction filter unit corrects the power of the voice band frequency so as to decrease the slope of the spectrum regression line of the vowel voice as the moving speed of the moving body increases. Voice recognition device.

14. The speech correction filter unit corrects the speech signal so as to decrease at least one of the first to third formants of the speech as the moving speed of the moving body increases. Voice recognition device.

The speech according to claim 13, wherein the speech correction filter means corrects the speech signal so as to reduce the duration of the vowel at the beginning of the spoken vocabulary as the moving speed of the moving body increases. Recognition device.

The speech according to claim 13, wherein the speech correction filter means corrects the speech signal so as to reduce the duration of the vowel at the end of the spoken vocabulary as the moving speed of the moving body increases. Recognition device.

A speech recognition method that may be used in a mobile body,
Detecting a moving speed of the moving body;
Inputting the speech to be recognized;
Storage means in which an acoustic model is stored;
The same acoustic and statistical characteristics as the speech corpus used when learning the acoustic model in which the speech signal and the label signal are associated with the speech signal input in the step according to the detected moving speed of the moving body Correcting to an audio signal having
Converting the voice signal corrected in the step into a label signal using the acoustic model.

21. The speech recognition method according to claim 20, wherein in the correcting step, the input speech signal is corrected so that the power of the vowel of the speech is decreased as the moving speed of the moving body increases.

21. The speech recognition method according to claim 20, wherein in the correcting step, the input speech signal is corrected so as to decrease the pitch frequency of the vowel of the speech as the moving speed of the moving body increases. .

In the correcting step, the power of the voice band frequency of the input voice signal is corrected so as to reduce the slope of the spectrum regression line of the voice vowel as the moving speed of the moving body increases. The speech recognition method according to claim 20.

21. In the correcting step, the input audio signal is corrected so as to decrease at least one of the first to third formants of the audio as the moving speed of the moving body increases. The speech recognition method described in 1.

The input speech signal is corrected in the correcting step so that the duration of the vowel at the beginning of the spoken vocabulary is decreased as the moving speed of the moving body increases. The speech recognition method described.

The input speech signal is corrected in the correcting step so that the duration of the vowel at the end of the spoken vocabulary is decreased as the moving speed of the moving body increases. The speech recognition method described.