JP2019035935A

JP2019035935A - Voice recognition apparatus

Info

Publication number: JP2019035935A
Application number: JP2018023256A
Authority: JP
Inventors: 滋樹青島; Shigeki Aoshima; 恭平増井; Kyohei Masui
Original assignee: Toyota Motor Corp
Current assignee: Toyota Motor Corp
Priority date: 2017-08-10
Filing date: 2018-02-13
Publication date: 2019-03-07
Anticipated expiration: 2038-02-13
Also published as: JP7077645B2

Abstract

To provide a voice recognition apparatus having a recognition rate thereof improved under noises.SOLUTION: A voice recognition apparatus includes: an acoustic model-processing part for obtaining a voice input and generating an input spectrum, the spectrum of voice input; an acoustic model-storing part in which each spectrum of phonemes is stored in advance as an acoustic model; and a phoneme-matching part for calculating spectral intervals among the input spectrum and each spectrum of phonemes in the acoustic model, and for specifying the phoneme having a minimum spectral interval as a phoneme for voice input, wherein the phoneme-matching part weights values based on a margin among the input spectrum and each spectrum of phonemes in each of a plurality of frequencies, and calculates the spectral interval by calculating the total amount of the weighted values.SELECTED DRAWING: Figure 2

Description

本発明は、音声認識装置に関する。 The present invention relates to a speech recognition apparatus.

従来、入力音声の特徴と予め作成された音響モデルの各音素の特徴とを比較し、類似度が高いと判定した音素を入力音声に対する音素ラベルとして同定する、音声認識装置が知られている。 2. Description of the Related Art Conventionally, a speech recognition apparatus is known that compares the characteristics of an input speech with the features of each phoneme of an acoustic model created in advance, and identifies the phoneme determined to have a high degree of similarity as a phoneme label for the input speech.

このような音声認識装置において、騒音下での認識率を向上させることが求められている。関連技術として、特許文献１は、発声時の音入力から非発声時の音入力の周波数成分を引き算することで騒音を除去することを開示している。特許文献２は、入力音声と音響モデルとの間の距離を、それぞれの静的特徴パラメータ間の距離と、動的特徴パラメータ間との距離との重みづけ和によって計算し、計算値が小さい音素を同定する際、騒音のパワーの時間変動の分散度合に基づいて変更することを開示している。 In such a speech recognition apparatus, it is required to improve the recognition rate under noise. As a related technique, Patent Document 1 discloses that noise is removed by subtracting a frequency component of sound input during non-speech from sound input during speech. Patent Document 2 calculates a distance between an input speech and an acoustic model by a weighted sum of a distance between each static feature parameter and a distance between dynamic feature parameters, and a phoneme having a small calculated value. Is identified based on the degree of dispersion of noise power over time.

特開平０２−１７９７００号公報Japanese Patent Laid-Open No. 02-179700 特開２００４−１８４８５６号公報JP 2004-184856 A

騒音下での認識率向上のためには、同じ騒音特性の環境下で作成した音響モデルを用いるか、音声入力から騒音を差し引くことが考えられる。騒音環境下で音響モデルを作成する場合、複数の環境に対応させるためには、環境ごとにモデルを作成する必要があり工数がかかる。また、音声入力から騒音をスペクトル領域で差し引くとスペクトル歪みが生じ音響モデルとの比較方法によっては、かえって認識率が低下するおそれがある。このように、騒音下における認識率向上には改善の余地があった。 In order to improve the recognition rate under noise, it is conceivable to use an acoustic model created in an environment with the same noise characteristics or to subtract noise from speech input. When creating an acoustic model in a noisy environment, it is necessary to create a model for each environment in order to deal with a plurality of environments. Also, if noise is subtracted from the speech input in the spectral region, spectral distortion occurs, and the recognition rate may be lowered depending on the method of comparison with the acoustic model. Thus, there is room for improvement in improving the recognition rate under noise.

本発明は、上記課題を鑑みてなされたものであり、騒音下での認識率を向上した音声認識装置を提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide a speech recognition apparatus that improves the recognition rate under noise.

上記課題を解決するために、本発明の一局面は、音声入力を取得し、音声入力のスペクトルである入力スペクトルを生成する音響処理部と、各音素のスペクトルを音響モデルとして予め記憶している音響モデル記憶部と、入力スペクトルと音響モデルの各音素のスペクトルとのスペクトル間距離を算出し、スペクトル間距離が最小である音素を音声入力に対する音素として特定する音素マッチング部とを備える、音声認識装置である。音素マッチング部は、複数の周波数のそれぞれにおける、入力スペクトルと各音素のスペクトルとの差分に基づく値に重み付けを行い、重み付けした値の総和を計算することによって、スペクトル間距離を算出する。このようなスペクトル間距離の計算方法によって、人間の声を特徴付ける周波数帯を重視した評価や騒音を多く含む周波数帯の影響を小さくした評価を行うことができ、認識率を向上させることができる。 In order to solve the above-described problem, according to one aspect of the present invention, an acoustic processing unit that acquires voice input and generates an input spectrum that is a spectrum of the voice input, and a spectrum of each phoneme are stored in advance as an acoustic model. Speech recognition comprising: an acoustic model storage unit; and a phoneme matching unit that calculates an inter-spectrum distance between an input spectrum and a spectrum of each phoneme of the acoustic model and identifies a phoneme having the smallest inter-spectral distance as a phoneme for speech input Device. The phoneme matching unit weights values based on the difference between the input spectrum and the spectrum of each phoneme at each of a plurality of frequencies, and calculates the inter-spectral distance by calculating the sum of the weighted values. By such a method for calculating the inter-spectral distance, it is possible to perform an evaluation that places importance on the frequency band that characterizes human voice and an evaluation that reduces the influence of a frequency band that contains a lot of noise, and the recognition rate can be improved.

また、音響処理部は、音声入力に含まれる騒音のスペクトルである騒音スペクトルをさらに生成し、音声認識装置は、複数の騒音スペクトルのパターンと重み付け関数とを対応付けて記憶する重み付け関数記憶部と、重み付け関数記憶部を参照し、音響処理部が生成した騒音スペクトルに最も類似する騒音スペクトルに対応付けられた重み付け関数を決定する重み付け関数決定部とをさらに備え、音素マッチング部は、重み付け関数決定部が決定した重み付け関数に基づいて、重み付けを行ってもよい。これにより、音声入力に含まれる騒音の特性に応じてその影響を小さくした評価を行うことができ、認識率をより向上させることができる。 The acoustic processing unit further generates a noise spectrum that is a spectrum of noise included in the voice input, and the voice recognition device includes a weighting function storage unit that stores a plurality of noise spectrum patterns and weighting functions in association with each other. A weighting function determining unit that refers to the weighting function storage unit and determines a weighting function associated with the noise spectrum most similar to the noise spectrum generated by the acoustic processing unit, and the phoneme matching unit determines the weighting function Weighting may be performed based on the weighting function determined by the unit. As a result, it is possible to perform evaluation with a reduced influence according to the characteristics of noise included in the voice input, and to further improve the recognition rate.

また、複数の言語種別と重み付け関数とを対応付けて記憶する言語別重み付け関数記憶部と、音声入力に含まれる音声の言語種別を取得し、言語別重み付け関数記憶部を参照し、取得した言語種別に対応付けられた重み付け関数を決定する言語別重み付け関数決定部とをさらに備え、音素マッチング部は、言語別重み付け関数決定部が決定した重み付け関数に基づいて、重み付けを行ってもよい。これにより、音声入力に含まれる音声の言語種別に応じて、その言語を特徴付ける周波数帯を重視した評価を行うことができ、認識率をより向上させることができる。 In addition, a language-specific weighting function storage unit that associates and stores a plurality of language types and weighting functions, and acquires a language type of speech included in the speech input, refers to the language-specific weighting function storage unit, and acquires the language A language-specific weighting function determining unit that determines a weighting function associated with the type, and the phoneme matching unit may perform weighting based on the weighting function determined by the language-specific weighting function determining unit. Thereby, according to the language type of the voice included in the voice input, it is possible to make an evaluation with an emphasis on the frequency band characterizing the language, and the recognition rate can be further improved.

本発明によれば、騒音下での認識率を向上した音声認識装置を提供することができる。 ADVANTAGE OF THE INVENTION According to this invention, the speech recognition apparatus which improved the recognition rate under noise can be provided.

本発明の第１の実施形態に係る音声認識装置の機能ブロック図Functional block diagram of the speech recognition apparatus according to the first embodiment of the present invention 本発明の第１の実施形態に係る音声認識装置の処理を示すフローチャートThe flowchart which shows the process of the speech recognition apparatus which concerns on the 1st Embodiment of this invention. 本発明の第１の実施形態に係るスペクトル間距離を説明する図The figure explaining the distance between spectra concerning a 1st embodiment of the present invention. 本発明の第１の実施形態に係る重み付け関数の例を示す図The figure which shows the example of the weighting function which concerns on the 1st Embodiment of this invention. 本発明の第２の実施形態に係る音声認識装置の機能ブロック図Functional block diagram of the speech recognition apparatus according to the second embodiment of the present invention. 本発明の第２の実施形態に係る音声認識装置の処理を示すフローチャートThe flowchart which shows the process of the speech recognition apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第２の実施形態に係る騒音スペクトルと重み付け関数との対応付けの例を示す図The figure which shows the example of matching with the noise spectrum which concerns on the 2nd Embodiment of this invention, and a weighting function 本発明の第３の実施形態に係る音声認識装置の機能ブロック図Functional block diagram of a speech recognition apparatus according to a third embodiment of the present invention 本発明の第３の実施形態に係る音声認識装置の処理を示すフローチャートThe flowchart which shows the process of the speech recognition apparatus which concerns on the 3rd Embodiment of this invention. 本発明の第３の実施形態に係る言語種別と重み付け関数との対応付けの例を示す図The figure which shows the example of matching with the language classification and weighting function which concern on the 3rd Embodiment of this invention.

（概要）
本発明に係る音声認識装置においては、音声入力のスペクトルと音響モデルのスペクトルとのスペクトル間距離を算出する際、特定の周波数成分における差分を他の周波数成分における差分より大きく重み付けする。このようなスペクトル間距離の計算方法によって、人間の声を特徴付ける周波数帯を重視した評価や騒音を多く含む周波数帯の影響を小さくした評価を行うことができ、認識率を向上させることができる。 (Overview)
In the speech recognition apparatus according to the present invention, when calculating the inter-spectral distance between the spectrum of the speech input and the spectrum of the acoustic model, the difference in the specific frequency component is weighted more than the difference in the other frequency components. By such a method for calculating the inter-spectral distance, it is possible to perform an evaluation that places importance on the frequency band that characterizes human voice and an evaluation that reduces the influence of a frequency band that contains a lot of noise, and the recognition rate can be improved.

以下、本発明の実施形態について、図面を参照しながら詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

（第１の実施形態）
＜構成＞
図１に、本実施形態に係る音声認識装置１００の機能ブロック図を示す。音声認識装置１００は、音響処理部１１０、音響モデル記憶部１２０、音素マッチング部１３０を含む。音響処理部１１０は高域強調部１１１、音響分析部１１２、騒音処理部１１３を含む。 (First embodiment)
<Configuration>
FIG. 1 shows a functional block diagram of the speech recognition apparatus 100 according to the present embodiment. The speech recognition apparatus 100 includes an acoustic processing unit 110, an acoustic model storage unit 120, and a phoneme matching unit 130. The acoustic processing unit 110 includes a high frequency emphasizing unit 111, an acoustic analysis unit 112, and a noise processing unit 113.

音響処理部１１０は、外部から音声入力を受け付け、音声入力のスペクトルである入力スペクトルを生成する。音響モデル記憶部１２０は、予め作成した各音素のスペクトルを参照用の音響モデルとして記憶している。音素マッチング部１３０は、音響処理部１１０が生成した入力スペクトルと、音響モデル記憶部１２０が記憶している音響モデルの各音素のスペクトルとの距離であるスペクトル間距離を算出する。本実施形態におけるスペクトル間距離の詳細については後述する。そして、スペクトル間距離が最小である音素を音声入力に対する音素として同定してラベリングし音素ラベルを出力する。 The acoustic processing unit 110 receives an audio input from the outside, and generates an input spectrum that is a spectrum of the audio input. The acoustic model storage unit 120 stores a spectrum of each phoneme created in advance as a reference acoustic model. The phoneme matching unit 130 calculates an inter-spectrum distance that is a distance between the input spectrum generated by the acoustic processing unit 110 and the spectrum of each phoneme of the acoustic model stored in the acoustic model storage unit 120. Details of the inter-spectrum distance in this embodiment will be described later. Then, the phoneme having the shortest inter-spectral distance is identified as a phoneme with respect to the voice input, labeled, and a phoneme label is output.

＜動作＞
音声認識装置１００が行う処理を説明する。図２に音声認識装置１００が行う処理のフローチャートを示す。 <Operation>
Processing performed by the speech recognition apparatus 100 will be described. FIG. 2 shows a flowchart of processing performed by the speech recognition apparatus 100.

ステップＳ１０１：音響処理部１１０は音声入力を受け付け、音声入力に基づいてそのスペクトルである入力スペクトルを生成する。音声入力は、例えばハミング窓を用いてフレーム化され、フレーム単位で以下の処理が行われる。音声入力は、まず、高域強調部１１１によって一般に特徴部分となりやすい高周波の領域が強調される。つぎに音響分析部１１２によってＦＦＴ（高速フーリエ変換）が行われ、音声入力のスペクトルが生成される。生成されたスペクトルは、騒音処理部１１３によって、騒音低減処理が行われる。騒音低減は、一例として、音響分析部１１２が生成したスペクトルから騒音のスペクトルを減算することで行う。騒音のスペクトルは、例えば、音声入力のうちパワーが小さいフレームや非発声期間として指定されたフレームに基づいて生成してもよいし、予め騒音モデルとして記憶しておいてもよい。騒音低減されたスペクトルが、音声入力のスペクトルである入力スペクトルとして、音響処理部１１０から出力される。なお、本実施形態では、音響処理部１１０は、騒音処理部１１３を備えず、音響分析部１１２が生成したスペクトルを入力スペクトルとして出力してもよい。 Step S101: The acoustic processing unit 110 receives a voice input, and generates an input spectrum that is a spectrum based on the voice input. The voice input is framed using, for example, a Hamming window, and the following processing is performed in units of frames. In speech input, a high frequency region that tends to be a characteristic portion is generally emphasized by the high frequency emphasizing unit 111. Next, FFT (Fast Fourier Transform) is performed by the acoustic analysis unit 112, and a spectrum of the voice input is generated. The generated spectrum is subjected to noise reduction processing by the noise processing unit 113. For example, noise reduction is performed by subtracting the noise spectrum from the spectrum generated by the acoustic analysis unit 112. The spectrum of noise may be generated based on, for example, a low power frame or a frame designated as a non-voicing period in the voice input, or may be stored in advance as a noise model. The noise-reduced spectrum is output from the acoustic processing unit 110 as an input spectrum that is a voice input spectrum. In this embodiment, the acoustic processing unit 110 may not include the noise processing unit 113 and may output the spectrum generated by the acoustic analysis unit 112 as an input spectrum.

ステップＳ１０２：音素マッチング部１３０は、音響処理部１１０が生成した入力スペクトルを取得し、音響モデル記憶部１２０から各音素スペクトルを取得して、入力スペクトルと各音素のスペクトルとのスペクトル間距離を順次計算する。 Step S102: The phoneme matching unit 130 acquires the input spectrum generated by the acoustic processing unit 110, acquires each phoneme spectrum from the acoustic model storage unit 120, and sequentially determines the inter-spectrum distance between the input spectrum and the spectrum of each phoneme. calculate.

ここで、図３を参照してスペクトル間距離について説明する。スペクトル間距離は、本実施形態においては、複数の周波数（角周波数）ω_ｉ（ｉ＝１、２、…、ｌ）における２つのスペクトルの差分ｄ_ｉ（ｉ＝１、２、…、ｌ）の重み付け和で定義する。重み付けの係数は、周波数ωの関数Ｆ（ω）を重み付け関数として定義し、各ω_ｉにおける関数値Ｆ（ω_ｉ）（ｉ＝１、２、…、ｌ）により定める。図４に、一例に係る重み付関数Ｆ（ω）のグラフを示す。スペクトル間距離Ｄ_Ｗは以下の数式で表される。 Here, the inter-spectrum distance will be described with reference to FIG. In the present embodiment, the inter-spectral distance is the difference between two spectra d _i (i = 1, 2,..., L) at a plurality of frequencies (angular frequencies) ω _i (i = 1, 2,..., L). Defined by the weighted sum of The weighting coefficient is defined by a function value F (ω _i ) (i = 1, 2,..., L) at each ω _i , defining the function F (ω) of the frequency ω as a weighting function. FIG. 4 shows a graph of the weighting function F (ω) according to an example. The inter-spectrum distance D _W is expressed by the following mathematical formula.

ここで、重み付け係数Ｆ（ω_ｉ）（ｉ＝１、２、…、ｌ）は、いずれかの値が他の値と異なるように予め定められる。すなわち、入力スペクトルと各音素のスペクトルとのスペクトル間距離を計算するにあたり、特定の周波数における差分の影響を他の周波数における差分の影響より大きくする。これにより、人間が騒音下で人間の声を聴き分ける場合などに行っていると考えられる、意識的に人間の声を特徴付ける特定の周波数帯を中心に聞き入るという聴覚特性と同様に、特定の周波数帯を重視した評価が可能となる。例えば、１５０Ｈｚの周波数を含む帯域と、１ｋＨｚ以上２ｋＨｚ以下の範囲の周波数を含む帯域とにおける重み付け係数を他の帯域における重み付け係数より大きくすると、車両騒音下での日本語音声の認識率を向上することができる。重み付け係数はこれに限定されず、適宜設計すればよい。 Here, the weighting coefficient F (ω _i ) (i = 1, 2,..., L) is determined in advance so that any value is different from the other values. That is, in calculating the inter-spectrum distance between the input spectrum and the spectrum of each phoneme, the influence of the difference at a specific frequency is made larger than the influence of the difference at another frequency. This makes it possible to hear a specific frequency in the same way as an auditory characteristic that listens mainly to a specific frequency band that characterizes the human voice consciously. Evaluation with emphasis on bands is possible. For example, if the weighting coefficient in a band including a frequency of 150 Hz and a band including a frequency in the range of 1 kHz to 2 kHz is larger than the weighting coefficient in other bands, the recognition rate of Japanese speech under vehicle noise is improved. be able to. The weighting coefficient is not limited to this, and may be designed as appropriate.

音素マッチング部１３０は、上述の式によって、入力スペクトルと音響モデルに含まれるｍ個の音素のスペクトルのそれぞれとのスペクトル間距離Ｄ_Ｗを計算する。入力スペクトルとｊ番目の音素のスペクトルとのスペクトル間距離をＤ_Ｗｊ（ｊ＝１、２、…、ｍ）とする。 The phoneme matching unit 130 calculates the inter-spectral distance D _W between the input spectrum and each of the m phoneme spectra included in the acoustic model by the above-described equation. _Let D _Wj (j = 1, 2,..., _M ) be the inter-spectral distance between the input spectrum and the spectrum of the j-th phoneme.

ステップＳ１０３：音素マッチング部１３０は、スペクトル間距離Ｄ_Ｗｊ（ｊ＝１、２、…、ｍ）の最小値を与える音素を特定する。音声認識装置１００はこのようにして特定した音素を、音声入力フレームに対する音素ラベルとして出力する。１フレームについての処理は以上で終了となるが、次に処理すべきフレームがある場合は、そのフレームについてステップＳ１０１〜Ｓ１０３の処理を実行する。 Step S103: The phoneme matching unit 130 specifies a phoneme that gives the minimum value of the inter-spectral distance D _Wj (j = 1, 2,..., _M ). The speech recognition apparatus 100 outputs the phoneme specified in this way as a phoneme label for the speech input frame. The processing for one frame is completed as described above. However, when there is a frame to be processed next, the processing of steps S101 to S103 is executed for the frame.

＜効果＞
本実施形態においては、騒音を含む入力スペクトルを音素のスペクトルと比較する際に、人間の声を特徴付ける周波数帯を重視した評価を行うことにより、認識率を向上させることができる。 <Effect>
In this embodiment, when comparing an input spectrum including noise with a phoneme spectrum, it is possible to improve the recognition rate by performing an evaluation focusing on a frequency band characterizing a human voice.

なお、スペクトル間距離Ｄ_Ｗは差分ｄ_ｉの重み付け線形和としたが、周波数ごとの重み付けができれば限定されず、重み付け２乗和でもよい。 The inter-spectrum distance _DW is a weighted linear sum of the differences d _i , but is not limited as long as the weighting for each frequency can be performed, and may be a weighted square sum.

（第２の実施形態）
＜構成＞
図５に、本実施形態に係る音声認識装置２００の機能ブロック図を示す。音声認識装置２００は、第１の実施形態に係る音声認識装置１００において、重み付け関数決定部１４０および重み付け関数記憶部１５０をさらに備えたものである。音声認識装置２００は、音声入力に含まれる騒音のスペクトルのパターンに基づいて、重み付け関数決定部１４０が、重み付け関数記憶部１５０を参照して重み付け関数Ｆ（ω）を決定する点で、第１の実施形態に係る音声認識装置１００と異なる。第１の実施形態に係る音声認識装置１００と同様のまたは対応する構成要素には同一の参照符号を付す。 (Second Embodiment)
<Configuration>
FIG. 5 shows a functional block diagram of the speech recognition apparatus 200 according to the present embodiment. The speech recognition apparatus 200 further includes a weighting function determination unit 140 and a weighting function storage unit 150 in the speech recognition apparatus 100 according to the first embodiment. The speech recognition apparatus 200 is the first in that the weighting function determination unit 140 refers to the weighting function storage unit 150 to determine the weighting function F (ω) based on the noise spectrum pattern included in the speech input. This is different from the speech recognition apparatus 100 according to the embodiment. Components that are the same as or correspond to those of the speech recognition apparatus 100 according to the first embodiment are denoted by the same reference numerals.

＜動作＞
音声認識装置２００が行う処理を説明する。図６に音声認識装置２００が行う処理のフローチャートを示す。 <Operation>
Processing performed by the speech recognition apparatus 200 will be described. FIG. 6 shows a flowchart of processing performed by the speech recognition apparatus 200.

ステップＳ２０１：音響処理部１１０は音声入力を受け付け、音声入力に基づいてそのスペクトルである入力スペクトルを生成する。本ステップの処理は、第１の実施形態におけるステップＳ１０１と同様であるが、本実施形態では、騒音処理部１１３による騒音低減処理を実行することが好適である。騒音低減処理に用いる騒音のスペクトルは、音声入力に基づいて生成する。例えば音声入力のうち、パワーが小さいフレームや非発声期間として指定されたフレームに基づいてスペクトルを生成することができる。騒音低減されたスペクトルが、音声入力のスペクトルである入力スペクトルとして、音響処理部１１０から出力される。また、騒音スペクトルも音響処理部１１０から出力される。 Step S201: The acoustic processing unit 110 receives a voice input, and generates an input spectrum that is a spectrum based on the voice input. The processing in this step is the same as that in step S101 in the first embodiment, but in this embodiment, it is preferable to execute the noise reduction processing by the noise processing unit 113. The spectrum of noise used for noise reduction processing is generated based on voice input. For example, a spectrum can be generated based on a frame having a low power or a frame designated as a non-voicing period, among voice inputs. The noise-reduced spectrum is output from the acoustic processing unit 110 as an input spectrum that is a voice input spectrum. A noise spectrum is also output from the acoustic processing unit 110.

ステップＳ２０２：重み付け関数決定部１４０は、騒音スペクトルに基づいて、重み付け関数記憶部１５０を参照して、重み付け関数を決定する。重み付け関数記憶部１５０は、騒音スペクトルのパターンと重み付け関数とを対応付けて記憶している。この対応付けの例を図７に示す。図７に示す騒音Ａ、Ｂ、Ｃの各スペクトルは、車両内、オフィス、トンネルにおける騒音スペクトルである。騒音Ａ、Ｂ、Ｃのそれぞれに重み付け関数Ｆ_Ａ（ω）、Ｆ_Ｂ（ω）、Ｆ_Ｃ（ω）が対応付けられている。重み付け関数は、例えば、人間の音声を特徴づける帯域の重みを大きくし、かつ、騒音のレベルが大きい帯域の重みを小さくする方針で設計することができる。重み付け関数決定部１４０は、音響処理部１１０から取得した騒音スペクトルに最も類似する騒音スペクトルを、重み付け関数記憶部１５０が記憶している騒音スペクトルから選択し、選択した騒音スペクトルに対応付けられた重み付け関数を決定する。なお、騒音スペクトルの類似の判定は、例えば、重み付けなしのスぺクトル間距離に基づいて行うことができる。 Step S202: The weighting function determination unit 140 refers to the weighting function storage unit 150 based on the noise spectrum and determines a weighting function. The weighting function storage unit 150 stores a noise spectrum pattern and a weighting function in association with each other. An example of this association is shown in FIG. Each spectrum of noise A, B, and C shown in FIG. 7 is a noise spectrum in a vehicle, an office, and a tunnel. Weighting functions F _A (ω), F _B (ω), and F _C (ω) are associated with the noises A, B, and C, respectively. The weighting function can be designed, for example, with a policy of increasing the weight of the band characterizing human speech and decreasing the weight of the band having a high noise level. The weighting function determination unit 140 selects a noise spectrum most similar to the noise spectrum acquired from the acoustic processing unit 110 from the noise spectrum stored in the weighting function storage unit 150, and weights associated with the selected noise spectrum Determine the function. The similarity determination of the noise spectrum can be performed based on, for example, an unweighted inter-spectral distance.

ステップＳ２０３：音素マッチング部１３０は、音響処理部１１０が生成した入力スペクトルを取得し、重み付け関数決定部１４０が決定した重み付け関数を取得し、音響モデル記憶部１２０から各音素スペクトルを取得して、入力スペクトルと各音素のスペクトルとのスペクトル間距離Ｄ_Ｗｊ（ｊ＝１、２、…、ｍ）を順次計算する。 Step S203: The phoneme matching unit 130 acquires the input spectrum generated by the acoustic processing unit 110, acquires the weighting function determined by the weighting function determination unit 140, acquires each phoneme spectrum from the acoustic model storage unit 120, The inter-spectrum distance D _Wj (j = 1, 2,..., _M ) between the input spectrum and the spectrum of each phoneme is sequentially calculated.

ステップＳ２０４：音素マッチング部１３０は、スペクトル間距離Ｄ_Ｗｊ（ｊ＝１、２、…、ｍ）の最小値を与える音素を特定する。音声認識装置２００はこのようにして特定した音素を、音声入力フレームに対する音素ラベルとして出力する。１フレームについての処理は以上で終了となるが、次に処理すべきフレームがある場合は、そのフレームについてステップＳ２０１〜Ｓ２０４の処理を実行する。 Step S204: The phoneme matching unit 130 specifies a phoneme that gives the minimum value of the inter-spectral distance D _Wj (j = 1, 2,..., _M ). The speech recognition apparatus 200 outputs the phoneme specified in this way as a phoneme label for the speech input frame. The process for one frame is completed as described above, but if there is a frame to be processed next, the processes of steps S201 to S204 are executed for that frame.

＜効果＞
本実施形態においては、第１の実施形態と同様、騒音を含む入力スペクトルを音素のスペクトルと比較する際に、人間の声を特徴付ける周波数帯を重視した評価を行うことにより、認識率を向上させることができる。本実施形態ではさらに、騒音の特性に応じてその影響を小さくした評価を行うことができ、認識率をより向上させることができる。 <Effect>
In this embodiment, as in the first embodiment, when an input spectrum including noise is compared with a phoneme spectrum, evaluation is performed with emphasis on the frequency band that characterizes human voice, thereby improving the recognition rate. be able to. In the present embodiment, it is further possible to perform evaluation with a reduced influence according to the characteristics of noise, and to further improve the recognition rate.

（第３の実施形態）
＜構成＞
図８に、本実施形態に係る音声認識装置３００の機能ブロック図を示す。音声認識装置３００は、第２の実施形態に係る音声認識装置２００において、重み付け関数決定部１４０の代わりに言語別重み付け関数決定部１４１を備え、重み付け関数記憶部１５０の代わりに言語別重み付け関数決定部１５１を備えたものである。第２の実施形態に係る音声認識装置２００は、音声入力に含まれる騒音のスペクトルのパターンに基づいて、重み付け関数決定部１４０が、重み付け関数記憶部１５０を参照して重み付け関数Ｆ（ω）を決定するのに対し、本実施形態に係る音声認識装置３００は、音声入力に含まれる音声の言語種別に基づいて、言語別重み付け関数決定部１４１が、言語別重み付け関数記憶部１５１を参照して重み付け関数Ｆ（ω）を決定する。また、音声認識装置３００は、一例として、単語マッチング部１６０、日本語単語記憶部１７１、英語単語記憶部１７２を備える。第１および第２の実施形態に係る音声認識装置１００、２００と同様のまたは対応する構成要素には同一の参照符号を付す。 (Third embodiment)
<Configuration>
FIG. 8 shows a functional block diagram of the speech recognition apparatus 300 according to the present embodiment. The speech recognition apparatus 300 includes a language-specific weighting function determination unit 141 instead of the weighting function determination unit 140 in the speech recognition apparatus 200 according to the second embodiment, and determines a language-specific weighting function instead of the weighting function storage unit 150. A portion 151 is provided. In the speech recognition apparatus 200 according to the second embodiment, the weighting function determination unit 140 refers to the weighting function storage unit 150 based on the noise spectrum pattern included in the speech input, and calculates the weighting function F (ω). In contrast, in the speech recognition apparatus 300 according to the present embodiment, the language-specific weighting function determining unit 141 refers to the language-specific weighting function storage unit 151 based on the language type of the speech included in the speech input. A weighting function F (ω) is determined. Moreover, the speech recognition apparatus 300 includes a word matching unit 160, a Japanese word storage unit 171, and an English word storage unit 172 as an example. Constituent elements similar or corresponding to those of the speech recognition apparatuses 100 and 200 according to the first and second embodiments are denoted by the same reference numerals.

＜動作＞
音声認識装置３００が行う処理を説明する。図９に音声認識装置３００が行う処理のフローチャートを示す。 <Operation>
Processing performed by the speech recognition apparatus 300 will be described. FIG. 9 shows a flowchart of processing performed by the speech recognition apparatus 300.

ステップＳ３０１：音響処理部１１０は音声入力を受け付け、音声入力に基づいてそのスペクトルである入力スペクトルを生成する。本ステップの処理は、第１の実施形態におけるステップＳ１０１と同様である。 Step S301: The acoustic processing unit 110 receives a voice input, and generates an input spectrum that is a spectrum based on the voice input. The processing in this step is the same as that in step S101 in the first embodiment.

ステップＳ３０２：言語別重み付け関数決定部１４１は、音声入力に含まれる音声の言語種別に基づいて、言語別重み付け関数記憶部１５１を参照して、重み付け関数を決定する。言語種別の判定の方法は後述する。 Step S302: The language-specific weighting function determination unit 141 determines a weighting function with reference to the language-specific weighting function storage unit 151 based on the language type of the speech included in the speech input. The method for determining the language type will be described later.

言語別重み付け関数記憶部１５１は、言語種別と重み付け関数とを対応付けて記憶している。この対応付けの例を図１０に示す。図１０に示すように、日本語、英語のそれぞれに重み付け関数Ｆ_Ｊ（ω）、Ｆ_Ｅ（ω）が対応付けられている。重み付け関数は、例えば、各言語における人間の音声を特徴付ける帯域の重みを大きくする方針で設計することができる。 The language-specific weighting function storage unit 151 stores a language type and a weighting function in association with each other. An example of this association is shown in FIG. As shown in FIG. 10, weighting functions F _J (ω) and F _E (ω) are associated with Japanese and English, respectively. The weighting function can be designed, for example, with a policy of increasing the weight of the band that characterizes human speech in each language.

日本語は、音素が基本的に子音（Ｃ）の後に母音（Ｖ）が続く「Ｃ＋Ｖ」の構造によって構成され、母音の出現比率が比較的高い。そのため、５００Ｈｚ以下のフォルマントが支配的である母音が特徴的となる。本実施形態では、一例として、日本語の重み付け関数Ｆ_Ｊ（ω）を低、中域から高域にかけて、おおむね一様だが、緩やかに減少するように設定し、日本語音声を特徴づける帯域の重みを大きくする。 Japanese is composed of a structure of “C + V” in which phonemes are basically a consonant (C) followed by a vowel (V), and the appearance ratio of vowels is relatively high. Therefore, a vowel in which formants of 500 Hz or less are dominant is characteristic. In the present embodiment, as an example, the Japanese weighting function F _J (ω) is set to be generally uniform from low to mid to high, but gradually decreases, and the band that characterizes Japanese speech is used. Increase the weight.

これに対して英語は、子音の種類が日本語より多く、また、音素が「Ｃ＋Ｖ」に加えて、「Ｃ＋Ｃ＋Ｖ」、「Ｃ」、「Ｃ＋Ｃ」等の構造によって構成され、子音の出現比率が高い。そのため、中域から高域の成分が支配的である子音が特徴的となる。本実施形態では、一例として、英語の重み付け関数Ｆ_Ｅ（ω）を低域から中、高域にかけて増加するように設定し、英語音声を特徴づける帯域の重みを大きくする。 In contrast, English has more types of consonants than Japanese, and phonemes are composed of “C + C + V”, “C”, “C + C”, etc. in addition to “C + V”, and the consonant appearance ratio is high. Therefore, a consonant in which the middle to high frequency components are dominant is characteristic. In the present embodiment, as an example, the weighting function F _E (ω) for English is set to increase from the low range to the middle to high range, and the weight of the band characterizing the English speech is increased.

ステップＳ３０３：音素マッチング部１３０は、音響処理部１１０が生成した入力スペクトルを取得し、言語別重み付け関数決定部１４１が決定した重み付け関数を取得し、音響モデル記憶部１２０から各音素スペクトルを取得して、入力スペクトルと各音素のスペクトルとのスペクトル間距離Ｄ_Ｗｊ（ｊ＝１、２、…、ｍ）を順次計算する。 Step S303: The phoneme matching unit 130 acquires the input spectrum generated by the acoustic processing unit 110, acquires the weighting function determined by the language-specific weighting function determination unit 141, and acquires each phoneme spectrum from the acoustic model storage unit 120. Then, the inter-spectral distance D _Wj (j = 1, 2,..., _M ) between the input spectrum and the spectrum of each phoneme is sequentially calculated.

ステップＳ３０４：音素マッチング部１３０は、スペクトル間距離Ｄ_Ｗｊ（ｊ＝１、２、…、ｍ）の最小値を与える音素を、音声入力フレームに対する音素ラベルとして特定する。音素マッチング部１３０はこのようにして特定した音素を、音声入力フレームに対する音素ラベルとして出力する。 Step S304: The phoneme matching unit 130 specifies a phoneme that gives the minimum value of the inter-spectrum distance D _Wj (j = 1, 2,..., _M ) as a phoneme label for the speech input frame. The phoneme matching unit 130 outputs the phoneme specified in this way as a phoneme label for the voice input frame.

ステップＳ３０５：単語マッチング部１６０は、音素マッチング部１３０から出力される音素ラベルを順次受け取り、単語マッチングを行う。単語マッチングは、一例として、順次受け取った音素ラベルの配列に対して、最も類似度が高い、すなわち、確率（尤度）が高い単語を対応付けることによって行う。候補となる単語として、日本語の単語が日本語単語記憶部１７１に記憶されており、英語の単語が英語単語記憶部１７２に記憶されている。単語マッチング部１６０は、日本語単語記憶部１７１および英語単語記憶部１７２の両方を参照し、最も確率の高い単語を特定する。単語マッチング部１６０は、音声認識装置３００の出力として、この単語を出力する。１フレームについての処理は以上で終了となるが、次に処理すべきフレームがある場合は、そのフレームについてステップＳ３０１〜Ｓ３０５の処理を実行する。 Step S305: The word matching unit 160 sequentially receives phoneme labels output from the phoneme matching unit 130, and performs word matching. For example, the word matching is performed by associating a word having the highest similarity, that is, a probability (likelihood) with the sequence of phoneme labels received sequentially. As a candidate word, a Japanese word is stored in the Japanese word storage unit 171, and an English word is stored in the English word storage unit 172. The word matching unit 160 refers to both the Japanese word storage unit 171 and the English word storage unit 172 to identify the word with the highest probability. The word matching unit 160 outputs this word as the output of the speech recognition device 300. The process for one frame is completed as described above. However, when there is a frame to be processed next, the processes of steps S301 to S305 are executed for the frame.

ここで、音声入力に含まれる音声の言語種別の判定の方法の一例を説明する。単語マッチング部１６０は、上記ステップＳ３０５の処理の際、日本語単語記憶部１７１および英語単語記憶部１７２から、それぞれ、確率が上位の、例えば、１０位以内の単語を選択し、選択した日本語の単語の確率の平均値と、選択した英語の単語の確率の平均値とを比較する。単語マッチング部１６０は、確率の平均値の高いほうの言語を、音声入力に含まれる音声の言語種別と判定し、判定結果を、言語別重み付け関数決定部１４１に通知する。言語別重み付け関数決定部１４１は、ステップＳ３０２で、通知された判定結果に基づいて重み付け関数を決定し、音素マッチング部１３０は、ステップＳ３０３で、現在処理対象の入力スペクトルに対して、この重み付け関数を用いた処理を行う。単語マッチング部１６０による言語種別の判定処理および通知処理は、ステップＳ３０５において、常時実行してもよく、所定周期ごとに実行してもよい。また、複数回の判定結果の多数決をとった結果を言語別重み付け関数決定部１４１に通知してもよい。 Here, an example of a method for determining the language type of speech included in speech input will be described. In the process of step S305, the word matching unit 160 selects, from the Japanese word storage unit 171 and the English word storage unit 172, a word having a higher probability, for example, the 10th or lower word, and selects the selected Japanese The average value of the probabilities of words is compared with the average value of the probabilities of selected English words. The word matching unit 160 determines the language with the higher probability average value as the language type of the speech included in the speech input, and notifies the language-specific weighting function determination unit 141 of the determination result. In step S302, the language-specific weighting function determination unit 141 determines a weighting function based on the notified determination result. In step S303, the phoneme matching unit 130 performs this weighting function on the input spectrum to be processed. The process using is performed. The language type determination process and the notification process by the word matching unit 160 may be executed constantly in step S305 or may be executed at predetermined intervals. Moreover, you may notify the weighting function determination part 141 classified by language of the result of having taken the majority of the determination result in multiple times.

上述の説明では、音声認識装置３００は、単語マッチング部１６０を備えるので、第１、第２の実施形態に係る音声認識装置１００、２００が音素ラベルを出力するのとは異なり、単語マッチング部１６０が特定した単語を出力するものとした。しかし、音声認識装置３００は、単語の代わりに、あるいは単語に加えて、音素ラベルを出力してもよい。また、音声認識装置３００は、単語マッチング部１６０、日本語単語記憶部１７１、英語単語記憶部１７２を備えず、音素ラベルを出力してもよい。この場合、言語別重み付け関数決定部１４１は、単語マッチング部１６０から判定結果を受け取る代わりに、音声入力やそのスペクトルを音響処理部１１０から受け取り、各種の言語識別手法を用いて自身で言語を判定してもよいし、あるいは、音素ラベルに基づいて外部の機器が行った言語種別の判定処理の結果を受け取ってもよい。 In the above description, since the speech recognition device 300 includes the word matching unit 160, the word matching unit 160 is different from the speech recognition devices 100 and 200 according to the first and second embodiments that output phoneme labels. To output the specified word. However, the speech recognition apparatus 300 may output a phoneme label instead of or in addition to the word. The speech recognition apparatus 300 may not include the word matching unit 160, the Japanese word storage unit 171, and the English word storage unit 172, and may output a phoneme label. In this case, instead of receiving the determination result from the word matching unit 160, the language-specific weighting function determining unit 141 receives the voice input and its spectrum from the acoustic processing unit 110, and determines the language by itself using various language identification methods. Alternatively, the result of the language type determination process performed by an external device based on the phoneme label may be received.

また、言語の例として日本語、英語の２言語を挙げたが、本実施形態においては、言語の種別や数は限定されず、他の言語にも適用可能である。 In addition, although two languages, Japanese and English, are given as examples of languages, in this embodiment, the type and number of languages are not limited, and can be applied to other languages.

＜効果＞
本実施形態においては、第１の実施形態と同様、騒音を含む入力スペクトルを音素のスペクトルと比較する際に、人間の声を特徴付ける周波数帯を重視した評価を行うことにより、認識率を向上させることができる。本実施形態ではさらに、音声入力に含まれる音声の言語種別に応じて、特にその言語を特徴付ける周波数帯を重視した評価を行うことができ、認識率をより向上させることができる。 <Effect>
In this embodiment, as in the first embodiment, when an input spectrum including noise is compared with a phoneme spectrum, evaluation is performed with emphasis on the frequency band that characterizes human voice, thereby improving the recognition rate. be able to. Furthermore, according to the present embodiment, according to the language type of the speech included in the speech input, it is possible to perform an evaluation that particularly emphasizes the frequency band that characterizes the language, and the recognition rate can be further improved.

なお、本発明は、音声認識装置の機能ブロックの構成として捉えるだけでなく、プロセッサを備えるコンピューターが実行する音声認識方法やプログラムとして捉えることができる。 In addition, this invention can be grasped | ascertained not only as a structure of the functional block of a speech recognition apparatus but the speech recognition method and program which a computer provided with a processor performs.

本発明は、音声認識装置等に有用である。 The present invention is useful for speech recognition devices and the like.

１００、２００、３００音声認識装置
１１０音響処理部
１１１高域強調部
１１２音響分析部
１１３騒音処理部
１２０音響モデル記憶部
１３０音素マッチング部
１４０重み付け関数決定部
１４１言語別重み付け関数決定部
１５０重み付け関数記憶部
１５１言語別重み付け関数記憶部
１６０単語マッチング部
１７１日本語単語記憶部
１７２英語単語記憶部 100, 200, 300 Speech recognition device 110 Acoustic processing unit 111 High frequency enhancement unit 112 Acoustic analysis unit 113 Noise processing unit 120 Acoustic model storage unit 130 Phoneme matching unit 140 Weighting function determination unit 141 Language-specific weighting function determination unit 150 Weighting function storage 151 Language-specific weighting function storage unit 160 Word matching unit 171 Japanese word storage unit 172 English word storage unit

Claims

An acoustic processing unit that acquires a voice input and generates an input spectrum that is a spectrum of the voice input;
An acoustic model storage unit that stores in advance the spectrum of each phoneme as an acoustic model;
A speech recognition apparatus comprising: a phoneme matching unit that calculates a distance between spectra of the input spectrum and a spectrum of each phoneme of the acoustic model, and specifies a phoneme having a minimum distance between the spectra as a phoneme for the speech input. There,
The phoneme matching unit weights a value based on a difference between the input spectrum and the spectrum of each phoneme at each of a plurality of frequencies, and calculates the inter-spectral distance by calculating a sum of the weighted values. A speech recognition device.

The acoustic processing unit further generates a noise spectrum that is a spectrum of noise included in the voice input,
A weighting function storage unit that stores a plurality of noise spectrum patterns and weighting functions in association with each other;
A weighting function determining unit that refers to the weighting function storage unit and determines a weighting function associated with a noise spectrum that is most similar to the noise spectrum generated by the acoustic processing unit;
The speech recognition apparatus according to claim 1, wherein the phoneme matching unit performs the weighting based on a weighting function determined by the weighting function determination unit.

A language-specific weighting function storage unit that associates and stores a plurality of language types and weighting functions;
A language-specific weighting function determining unit that acquires a language type of speech included in the speech input, refers to the language-specific weighting function storage unit, and determines a weighting function associated with the acquired language type; ,
The speech recognition apparatus according to claim 1, wherein the phoneme matching unit performs the weighting based on a weighting function determined by the language-specific weighting function determination unit.