JP5302092B2

JP5302092B2 - Speech recognition model parameter creation device, speech recognition model parameter creation method, and speech recognition device

Info

Publication number: JP5302092B2
Application number: JP2009115183A
Authority: JP
Inventors: 昇早坂
Original assignee: RayTron Inc
Current assignee: RayTron Inc
Priority date: 2009-05-12
Filing date: 2009-05-12
Publication date: 2013-10-02
Anticipated expiration: 2029-05-12
Also published as: JP2010266488A

Description

この発明は、音声認識モデルパラメータ作成装置、音声認識モデルパラメータ作成方法および音声認識装置に関し、特に、マルチコンディション学習を用いて音声認識モデルパラメータを作成する音声認識モデルパラメータ作成装置、音声認識モデルパラメータ作成方法、および音声認識装置に関するものである。 The present invention relates to a speech recognition model parameter creation device, a speech recognition model parameter creation method, and a speech recognition device, and in particular, a speech recognition model parameter creation device that creates speech recognition model parameters using multi-condition learning, and a speech recognition model parameter creation. The present invention relates to a method and a speech recognition apparatus.

従来の音声認識方法として、マルチコンディション学習がある。このマルチコンディション学習は、様々な環境の雑音を有する音声データを用いて、音声認識モデルパラメータの学習を行う。そして、入力された音声データにおいて、学習の際に用いた環境の雑音と類似する雑音を有する場合には、音声認識率を向上させることとしている。 There is multi-condition learning as a conventional speech recognition method. In this multi-condition learning, speech recognition model parameters are learned using speech data having noise in various environments. When the input voice data has noise similar to the environmental noise used in learning, the voice recognition rate is improved.

このようなマルチコンディション学習を用いて音声認識を行う技術は、例えば、特開２００８−１２２９２７号公報（特許文献１）に開示されている。 A technique for performing speech recognition using such multi-condition learning is disclosed in, for example, Japanese Patent Application Laid-Open No. 2008-122927 (Patent Document 1).

特開２００８−１２２９２７号公報JP 2008-122927 A

ここで、マルチコンディション学習では、入力された音声データにおいて、学習の際に用いた環境の雑音と大きく異なる場合には、音声認識率は低下してしまう。すなわち、マルチコンディション学習は、学習の際に用いた環境の雑音と類似する場合にのみ有効な音声認識方法である。 Here, in multi-condition learning, the speech recognition rate decreases if the input speech data is significantly different from the environmental noise used in the learning. That is, multi-condition learning is a speech recognition method that is effective only when it is similar to environmental noise used during learning.

また、様々な環境の雑音を有する音声データを用いて、音声認識モデルパラメータの学習を行った場合であっても、実際に音声認識を行う際には、学習の際に用いた様々な環境の雑音とは大きく異なる未知の雑音が混入する虞がある。この場合、音声認識率は低下してしまう。 In addition, even when speech recognition model parameters are learned using speech data having noise in various environments, when performing speech recognition, There is a risk of mixing unknown noise that is very different from noise. In this case, the voice recognition rate decreases.

この発明の目的は、音声認識率を向上させることができる音声認識モデルパラメータ作成装置を提供することである。 An object of the present invention is to provide a speech recognition model parameter creation device capable of improving the speech recognition rate.

この発明の他の目的は、音声認識率を向上させることができる音声認識モデルパラメータ作成方法を提供することである。 Another object of the present invention is to provide a speech recognition model parameter creation method capable of improving the speech recognition rate.

この発明のさらに他の目的は、音声認識率を向上させることができる音声認識装置を提供することである。 Still another object of the present invention is to provide a speech recognition apparatus capable of improving the speech recognition rate.

この発明に係る音声認識モデルパラメータ作成装置は、複数の雑音が重畳された音声データの特徴量を算出する特徴量算出手段と、特徴量算出手段により算出された特徴量を正規化する正規化手段と、正規化手段により正規化された特徴量を用いて、複数の雑音下における音声認識モデルパラメータを作成する作成手段とを備える。 The speech recognition model parameter creation device according to the present invention includes a feature amount calculation unit that calculates a feature amount of speech data on which a plurality of noises are superimposed, and a normalization unit that normalizes the feature amount calculated by the feature amount calculation unit And creating means for creating speech recognition model parameters under a plurality of noises using the feature amount normalized by the normalizing means.

好ましくは、正規化手段は、特徴量算出手段により算出された特徴量をバンドパスフィルタを用いてフィルタリングするフィルタリング手段を含む。 Preferably, the normalization unit includes a filtering unit that filters the feature amount calculated by the feature amount calculation unit using a bandpass filter.

さらに好ましくは、正規化手段は、フィルタリング手段によりフィルタリングされた特徴量をその最大振幅値で除算する除算手段を含む。 More preferably, the normalizing means includes dividing means for dividing the feature quantity filtered by the filtering means by the maximum amplitude value.

さらに好ましくは、作成手段は、音声認識モデルパラメータを学習により作成する。 More preferably, the creating means creates the speech recognition model parameter by learning.

また、この発明の他の局面においては、音声認識モデルパラメータ作成方法に関し、複数の雑音が重畳された音声データの特徴量を算出し、算出された特徴量を正規化し、正規化された特徴量を用いて、複数の雑音下における音声認識モデルパラメータを作成することを特徴とする。 In another aspect of the present invention, a method for creating a speech recognition model parameter is used to calculate a feature amount of speech data on which a plurality of noises are superimposed, normalize the calculated feature amount, and normalize the feature amount Is used to create a speech recognition model parameter under a plurality of noises.

また、この発明のさらに他の局面においては、音声認識装置に関し、上記のいずれかに記載の音声認識モデルパラメータ作成装置により作成された音声認識モデルパラメータを用いて、音声認識を行う認識手段を備える。 According to still another aspect of the present invention, the speech recognition apparatus includes a recognition unit that performs speech recognition using a speech recognition model parameter created by any of the speech recognition model parameter creation devices described above. .

この発明に係る音声認識モデルパラメータ作成装置は、複数の雑音が重畳された音声データの特徴量を正規化し、この正規化された特徴量を用いて、音声認識モデルパラメータを作成する。これにより、複数の雑音を一般化することができるため、音声認識を行う際には、様々な雑音に適用させることができる。その結果、音声認識率を向上させることができる。 The speech recognition model parameter creation device according to the present invention normalizes the feature amount of speech data on which a plurality of noises are superimposed, and creates a speech recognition model parameter using the normalized feature amount. Thereby, since a plurality of noises can be generalized, when performing speech recognition, it can be applied to various noises. As a result, the voice recognition rate can be improved.

また、この発明に係る音声認識モデルパラメータ作成方法は、複数の雑音が重畳された音声データの特徴量を正規化し、この正規化された特徴量を用いて、音声認識モデルパラメータを作成する。これにより、複数の雑音を一般化することができるため、音声認識を行う際には、様々な雑音に適用させることができる。その結果、音声認識率を向上させることができる。 The speech recognition model parameter creation method according to the present invention normalizes the feature amount of speech data on which a plurality of noises are superimposed, and creates a speech recognition model parameter using the normalized feature amount. Thereby, since a plurality of noises can be generalized, when performing speech recognition, it can be applied to various noises. As a result, the voice recognition rate can be improved.

また、この発明に係る音声認識装置は、このような音声認識モデルパラメータ作成装置により作成された音声認識モデルパラメータを用いて、音声認識を行うことができるため、音声認識率を向上させることができる。 Further, since the speech recognition apparatus according to the present invention can perform speech recognition using the speech recognition model parameters created by such a speech recognition model parameter creation device, the speech recognition rate can be improved. .

音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of a speech recognition apparatus. この発明の一実施形態に係る音声認識モデルパラメータ作成装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition model parameter production apparatus which concerns on one Embodiment of this invention. 音声認識モデルパラメータ作成装置を用いて、音声認識モデルパラメータを作成し、音声認識装置の記憶部に記憶させる場合について示すフローチャートである。It is a flowchart shown about the case where a speech recognition model parameter is produced using the speech recognition model parameter creation apparatus, and is memorize | stored in the memory | storage part of a speech recognition apparatus. 算出した特徴量を示すグラフである。It is a graph which shows the calculated feature-value. 図４に示す特徴量をフィルタリングした場合について示すグラフである。It is a graph shown about the case where the feature-value shown in FIG. 4 is filtered. 図５に示す特徴量を最大振幅値で除算した場合について示すグラフである。It is a graph shown about the case where the feature-value shown in FIG. 5 is divided by the maximum amplitude value. 音声認識装置を用いて、音声認識を行う場合について示すフローチャートである。It is a flowchart shown about the case where speech recognition is performed using a speech recognition apparatus. 音声認識モデルパラメータ作成装置を用いて、音声認識モデルパラメータを作成し、音声認識装置の記憶部に記憶させる場合の他の実施形態について示すフローチャートである。It is a flowchart shown about other embodiment in the case of producing | generating a speech recognition model parameter using the speech recognition model parameter creation apparatus, and making it memorize | store in the memory | storage part of a speech recognition apparatus.

以下、図面を参照して、この発明の一実施形態に係る音声認識モデルパラメータ作成装置について説明する。図１は、音声認識装置１０の構成を示すブロック図である。図２は、この発明の一実施形態に係る音声認識モデルパラメータ作成装置２０の構成を示すブロック図である。まず、図１を参照して、音声認識装置１０の構成について説明する。 A speech recognition model parameter creation device according to an embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing the configuration of the speech recognition apparatus 10. FIG. 2 is a block diagram showing the configuration of the speech recognition model parameter creation device 20 according to one embodiment of the present invention. First, the configuration of the speech recognition apparatus 10 will be described with reference to FIG.

音声認識装置１０は、マイクロフォン１４を介して、入力された音声データの特徴量を算出する音声認識装置特徴量算出部１１と、音声認識を行う認識部１２と、認識部１２において音声認識を行う際に用いる音声認識モデルパラメータを記憶する記憶部１３とを備える。 The speech recognition device 10 performs speech recognition in the speech recognition device feature amount calculation unit 11 that calculates the feature amount of the input speech data, the recognition unit 12 that performs speech recognition, and the recognition unit 12 via the microphone 14. And a storage unit 13 for storing voice recognition model parameters used at the time.

認識部１２は、音声認識装置特徴量算出部１１において算出された特徴量および記憶部１３において記憶された音声認識モデルパラメータを用いて音声認識を行う。記憶部１３は、認識部１２において音声認識を行う際に用いる音声認識モデルパラメータを記憶する。音声認識モデルパラメータは、図２に示す音声認識モデルパラメータ作成装置２０によって作成される。 The recognition unit 12 performs speech recognition using the feature amount calculated by the speech recognition device feature amount calculation unit 11 and the speech recognition model parameter stored in the storage unit 13. The storage unit 13 stores voice recognition model parameters used when the recognition unit 12 performs voice recognition. The speech recognition model parameter is created by the speech recognition model parameter creation device 20 shown in FIG.

次に、図２を参照して、音声認識モデルパラメータ作成装置２０の構成について説明する。音声認識モデルパラメータ作成装置２０は、マイクロフォン２４を介して、雑音を含まない無雑音音声データの入力を受け付けると共に、受け付けた無雑音音声データに雑音を重畳して雑音重畳データを作成する雑音重畳部２２と、雑音重畳部２２によって作成された雑音重畳データの特徴量を算出する作成装置特徴量算出部２１と、作成装置特徴量算出部２１によって算出された特徴量を用いて、音声認識モデルパラメータを作成する学習部２３と、雑音重畳部２２において重畳する複数の雑音のデータを保持する保持部２５とを備える。 Next, the configuration of the speech recognition model parameter creation device 20 will be described with reference to FIG. The speech recognition model parameter creation device 20 receives an input of noiseless speech data that does not include noise via the microphone 24, and creates a noise superimposed data by superimposing noise on the received noiseless speech data. 22, a creation device feature amount calculation unit 21 that calculates a feature amount of noise superimposed data created by the noise superimposition unit 22, and a speech recognition model parameter using the feature amount calculated by the creation device feature amount calculation unit 21. And a holding unit 25 for holding a plurality of noise data to be superimposed in the noise superimposing unit 22.

雑音重畳部２２は、保持部２５から所定の雑音のデータを抽出する。そして、無雑音音声データに抽出した雑音を重畳して雑音重畳データを作成する。 The noise superimposing unit 22 extracts predetermined noise data from the holding unit 25. Then, noise superimposed data is created by superimposing extracted noise on noiseless voice data.

作成装置特徴量算出部２１は、雑音重畳データの特徴量を算出する。この特徴量は、例えば、ＭＦＣＣ（ＭｅｌＦｒｅｑｕｅｎｃｙＣｅｐｓｔｒａｌＣｏｅｆｆｉｃｉｅｎｔ）を採用することができる。 The creation device feature value calculator 21 calculates the feature value of the noise superimposed data. As this feature amount, for example, MFCC (Mel Frequency Cepstial Coefficient) can be adopted.

学習部２３は、音声認識モデルパラメータを作成する。この音声認識モデルパラメータは、例えば、隠れマルコフモデル（ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ：ＨＭＭ）の平均値、分散値、遷移確率、重み係数等を採用することができる。 The learning unit 23 creates a speech recognition model parameter. As the speech recognition model parameter, for example, an average value, a variance value, a transition probability, a weighting coefficient, or the like of a hidden Markov model (HMM) can be adopted.

保持部２５は、複数の雑音のデータを保持する。複数の雑音のデータとしては、例えば、雑踏雑音、白色雑音、工場雑音等の様々な種類の環境の雑音を含む構成である。 The holding unit 25 holds a plurality of noise data. The plurality of noise data includes, for example, various types of environmental noise such as hustle noise, white noise, and factory noise.

音声認識モデルパラメータ作成装置２０は、音声認識装置１０にて用いる音声認識モデルパラメータを作成する。ここで、音声認識モデルパラメータ作成装置２０を用いて、音声認識モデルパラメータを作成し、音声認識装置１０の記憶部１３に記憶させる場合について説明する。図３は、音声認識モデルパラメータ作成装置２０を用いて、音声認識モデルパラメータを作成し、音声認識装置１０の記憶部１３に記憶させる場合について示すフローチャートである。図１〜図３を参照して、説明する。 The speech recognition model parameter creation device 20 creates speech recognition model parameters used by the speech recognition device 10. Here, a case where a speech recognition model parameter is created using the speech recognition model parameter creation device 20 and stored in the storage unit 13 of the speech recognition device 10 will be described. FIG. 3 is a flowchart illustrating a case where a speech recognition model parameter is created using the speech recognition model parameter creation device 20 and stored in the storage unit 13 of the speech recognition device 10. This will be described with reference to FIGS.

まず、音声認識モデルパラメータ作成装置２０は、マイクロフォン２４を介して、無雑音音声データの入力を受け付ける（図３において、ステップＳ１１、以下ステップを省略する）。そうすると、雑音重畳部２２により、保持部２５から、雑踏雑音のデータを抽出し、無雑音音声データに抽出した雑踏雑音のデータを重畳して雑踏雑音重畳データを作成する。また、白色雑音のデータを抽出し、上記と同様に、白色雑音重畳データを作成する。また、工場雑音のデータを抽出し、上記と同様に、工場雑音重畳データを作成する。このように、複数の雑音のデータを抽出し、複数の雑音重畳データを作成する（Ｓ１２）。 First, the speech recognition model parameter creation device 20 receives an input of noiseless speech data via the microphone 24 (step S11 in FIG. 3 and the following steps are omitted). Then, the noise superimposing unit 22 extracts the noise data from the holding unit 25 and superimposes the extracted noise data on the noiseless voice data to create the noise superimposing data. Further, white noise data is extracted, and white noise superimposed data is created in the same manner as described above. Also, factory noise data is extracted and factory noise superimposed data is created in the same manner as described above. In this way, a plurality of noise data are extracted, and a plurality of noise superimposed data are created (S12).

そして、作成装置特徴量算出部２１により、Ｓ１２において作成した雑音重畳データ、すなわち、雑踏雑音重畳データ、白色雑音重畳データ、および工場雑音重畳データの特徴量を算出する（Ｓ１３）。具体的には、雑音重畳データにおいて、２０〜３０ｍｓを１フレームとして、雑音重畳データを複数のフレームに分割することにより、各フレームにおいて、特徴量を算出する。分割においては、その１フレームと後に位置するフレームとが部分的にデータを共有するように分割する。ここで、作成装置特徴量算出部２１は、特徴量算出手段として作動する。 Then, the creation device feature value calculation unit 21 calculates the feature values of the noise superimposed data created in S12, that is, the hustle noise superimposed data, the white noise superimposed data, and the factory noise superimposed data (S13). Specifically, in the noise superimposition data, 20 to 30 ms is set as one frame, and the noise superimposition data is divided into a plurality of frames, thereby calculating a feature amount in each frame. In the division, the division is performed so that the one frame and the subsequent frame partially share data. Here, the creation device feature value calculation unit 21 operates as a feature value calculation unit.

図４は、算出した特徴量を示すグラフである。図４を参照して、点線で雑踏雑音重畳データの特徴量を示し、一点鎖線で白色雑音重畳データの特徴量を示し、実線で工場雑音重畳データの特徴量を示している。また、横軸はフレームを示し、縦軸は特徴量の振幅値を示している。図４を参照して、非音声区間Ａにおいて、白色雑音重畳データの特徴量と、雑踏雑音重畳データの特徴量と、工場雑音重畳データの特徴量とは、振幅の差が大きくなっている。また、音声区間Ｂにおいても、白色雑音重畳データの特徴量と、雑踏雑音重畳データの特徴量および工場雑音重畳データの特徴量とは、振幅の差が大きくなっている。また、フレーム１０における特徴量の振幅値を比較すると、雑踏雑音重畳データにおいては０．８を示し、白色雑音重畳データにおいては−４を示し、工場雑音重畳データにおいては−０．３を示している。 FIG. 4 is a graph showing the calculated feature amount. With reference to FIG. 4, the dotted line indicates the feature amount of the hustle noise superimposed data, the alternate long and short dash line indicates the feature amount of the white noise superimposed data, and the solid line indicates the feature amount of the factory noise superimposed data. The horizontal axis indicates the frame, and the vertical axis indicates the amplitude value of the feature amount. Referring to FIG. 4, in the non-speech section A, the difference in amplitude between the feature amount of the white noise superimposed data, the feature amount of the hustle noise superimposed data, and the feature amount of the factory noise superimposed data is large. Also in the speech section B, the difference in amplitude is large between the feature amount of the white noise superimposed data, the feature amount of the hustle noise superimposed data, and the feature amount of the factory noise superimposed data. Further, when comparing the amplitude values of the feature values in the frame 10, the noise superimposition data indicates 0.8, the white noise superimposition data indicates -4, and the factory noise superimposition data indicates -0.3. Yes.

そして、特徴量の算出が終了すると、算出した特徴量を正規化し、正規化された特徴量、すなわち、正規化特徴量を得る。この正規化特徴量は、音声の時間変化が緩やかであることを利用している。具体的には、まず、バンドパスフィルタを用いて、特徴量をフィルタリングする（Ｓ１４）。すなわち、所定の範囲の周波数成分のみを通過させ、それ以外の周波数成分は通過させないよう、特徴量をフィルタリングする。また、このバンドパスフィルタは、ＦＩＲ（Ｆｉｎｉｔｅｉｍｐｕｌｓｅｒｅｓｐｏｎｓｅ）型のフィルタである。これにより、安定して処理を行うことができる。 Then, when the calculation of the feature amount is completed, the calculated feature amount is normalized to obtain a normalized feature amount, that is, a normalized feature amount. This normalized feature value utilizes the fact that the time change of the voice is gradual. Specifically, first, a feature amount is filtered using a bandpass filter (S14). That is, the feature amount is filtered so that only frequency components in a predetermined range are allowed to pass and other frequency components are not allowed to pass. The bandpass filter is a FIR (Finite impulse response) type filter. Thereby, a process can be performed stably.

図５は、図４に示す特徴量をフィルタリングした場合について示すグラフである。図４と同様に、点線で雑踏雑音重畳データの特徴量を示し、一点鎖線で白色雑音重畳データの特徴量を示し、実線で工場雑音重畳データの特徴量を示している。また、横軸はフレームを示し、縦軸は特徴量の振幅値を示している。図５を参照して、非音声区間Ａおよび音声区間Ｂにおいて、白色雑音重畳データの特徴量と、雑踏雑音重畳データの特徴量と、工場雑音重畳データの特徴量とは、図４に示すよりも、波形が揃い、振幅の差が小さくなっている。また、フレーム１０における特徴量の振幅値を比較すると、雑踏雑音重畳データおよび工場雑音重畳データにおいては−０．３を示し、白色雑音重畳データにおいては−０．７を示している。すなわち、フィルタリングすることにより、波形の異なる複数のデータにおいて、波形を揃えることができ、データを一般化することができる。 FIG. 5 is a graph showing the case where the feature amount shown in FIG. 4 is filtered. As in FIG. 4, the dotted line indicates the feature amount of the hustle noise superimposed data, the dashed line indicates the feature amount of the white noise superimposed data, and the solid line indicates the feature amount of the factory noise superimposed data. The horizontal axis indicates the frame, and the vertical axis indicates the amplitude value of the feature amount. Referring to FIG. 5, in the non-voice section A and the voice section B, the feature amount of the white noise superimposed data, the feature amount of the hustle noise superimposed data, and the feature amount of the factory noise superimposed data are as shown in FIG. However, the waveforms are uniform and the difference in amplitude is small. Further, when comparing the amplitude values of the feature values in the frame 10, −0.3 is shown for the hustle noise superimposed data and the factory noise superimposed data, and −0.7 is shown for the white noise superimposed data. That is, by filtering, it is possible to align waveforms in a plurality of data having different waveforms, and generalize the data.

そして、特徴量のフィルタリングが終了すると、フィルタリングされた特徴量を、フィルタリングされた特徴量の最大振幅値で除算（割算）する（Ｓ１５）。例えば、図５を参照して、雑踏雑音重畳データにおいては、その最大振幅値であるａで除算し、白色雑音重畳データにおいては、その最大振幅値であるｂで除算し、工場雑音重畳データにおいては、その最大振幅値であるｃで除算する。 Then, when filtering of the feature amount is completed, the filtered feature amount is divided (divided) by the maximum amplitude value of the filtered feature amount (S15). For example, referring to FIG. 5, in the hustle noise superimposition data, the maximum amplitude value is divided by a, and in the white noise superimposition data, the maximum amplitude value is divided by b. Is divided by c which is its maximum amplitude value.

図６は、図５に示す特徴量を最大振幅値ａ，ｂ，ｃで除算した場合について示すグラフである。図５と同様に、点線で雑踏雑音重畳データの特徴量を示し、一点鎖線で白色雑音重畳データの特徴量を示し、実線で工場雑音重畳データの特徴量を示している。また、横軸はフレームを示し、縦軸は特徴量の振幅値を示している。図６を参照して、非音声区間Ａおよび音声区間Ｂにおいて、白色雑音重畳データの特徴量と、雑踏雑音重畳データの特徴量と、工場雑音重畳データの特徴量とは、図５に示すよりも、さらに波形が揃い、振幅の差が小さくなっている。また、フレーム１０における特徴量の振幅値を比較すると、雑踏雑音重畳データおよび工場雑音重畳データにおいては−０．２３を示し、白色雑音重畳データにおいては−０．２７を示している。すなわち、最大振幅値で除算することにより、波形の異なる複数のデータにおいて、さらに波形を揃えることができ、データを一般化することができる。 FIG. 6 is a graph showing the case where the feature amount shown in FIG. 5 is divided by the maximum amplitude values a, b, and c. Similar to FIG. 5, the dotted line indicates the feature amount of the hustle noise superimposed data, the alternate long and short dash line indicates the feature amount of the white noise superimposed data, and the solid line indicates the feature amount of the factory noise superimposed data. The horizontal axis indicates the frame, and the vertical axis indicates the amplitude value of the feature amount. Referring to FIG. 6, in the non-voice section A and the voice section B, the feature amount of the white noise superimposed data, the feature amount of the hustle noise superimposed data, and the feature amount of the factory noise superimposed data are as shown in FIG. However, the waveforms are more uniform and the difference in amplitude is smaller. In addition, when comparing the amplitude values of the feature values in the frame 10, it shows -0.23 in the hustle noise superimposition data and factory noise superimposition data and -0.27 in the white noise superimposition data. That is, by dividing by the maximum amplitude value, it is possible to further align the waveforms in a plurality of data having different waveforms, and generalize the data.

このように、バンドパスフィルタを用いてフィルタリングし、最大振幅値で除算することにより、算出した特徴量を正規化し、正規化特徴量を得る。ここで、作成装置特徴量算出部２１は、正規化手段、フィルタリング手段、および除算手段として作動する。 In this way, by filtering using a bandpass filter and dividing by the maximum amplitude value, the calculated feature value is normalized to obtain a normalized feature value. Here, the creation device feature amount calculation unit 21 operates as a normalization unit, a filtering unit, and a division unit.

そして、学習部２３により、それぞれの正規化特徴量を用いて、音声認識モデルパラメータを作成する（Ｓ１６）。具体的には、マルチコンディション学習を行うことにより、音声認識モデルパラメータを作成する。ここで、学習部２３は、作成手段として作動する。そして、音声認識モデルパラメータを音声認識装置１０の記憶部１３に記憶させる（Ｓ１７）。 Then, the learning unit 23 creates a speech recognition model parameter using each normalized feature amount (S16). Specifically, a speech recognition model parameter is created by performing multi-condition learning. Here, the learning unit 23 operates as a creation unit. Then, the speech recognition model parameters are stored in the storage unit 13 of the speech recognition device 10 (S17).

このように、音声認識モデルパラメータ作成装置２０は、複数の雑音が重畳された音声データの特徴量を正規化し、この正規化された特徴量を用いて、音声認識モデルパラメータを作成する。これにより、複数の雑音を一般化することができるため、音声認識を行う際には、様々な雑音に適用させることができる。その結果、音声認識率を向上させることができる。 As described above, the speech recognition model parameter creation device 20 normalizes the feature amount of speech data on which a plurality of noises are superimposed, and creates a speech recognition model parameter using the normalized feature amount. Thereby, since a plurality of noises can be generalized, when performing speech recognition, it can be applied to various noises. As a result, the voice recognition rate can be improved.

また、このような音声認識モデルパラメータ作成方法は、複数の雑音が重畳された音声データの特徴量を正規化し、この正規化された特徴量を用いて、音声認識モデルパラメータを作成する。これにより、複数の雑音を一般化することができるため、音声認識を行う際には、様々な雑音に適用させることができる。その結果、音声認識率を向上させることができる。 In addition, such a speech recognition model parameter creation method normalizes the feature amount of speech data on which a plurality of noises are superimposed, and creates a speech recognition model parameter using the normalized feature amount. Thereby, since a plurality of noises can be generalized, when performing speech recognition, it can be applied to various noises. As a result, the voice recognition rate can be improved.

なお、図４〜図６に示す特徴量の振幅値は、入力される音声データにより異なる値となる。 Note that the amplitude values of the feature amounts shown in FIGS. 4 to 6 vary depending on the input audio data.

次に、音声認識装置１０を用いて、音声認識を行う場合について説明する。図７は、音声認識装置１０を用いて、音声認識を行う場合について示すフローチャートである。図１〜図７を参照して、説明する。 Next, a case where voice recognition is performed using the voice recognition device 10 will be described. FIG. 7 is a flowchart illustrating a case where voice recognition is performed using the voice recognition device 10. This will be described with reference to FIGS.

まず、音声認識装置１０は、マイクロフォン１４を介して、音声データの入力を受け付ける（Ｓ２１）。そして、音声認識装置特徴量算出部１１により、音声データの特徴量を算出する。 First, the speech recognition apparatus 10 receives input of speech data via the microphone 14 (S21). Then, the voice recognition device feature quantity calculation unit 11 calculates the feature quantity of the voice data.

この特徴量の算出は、上記した図３のＳ１３〜Ｓ１５と同様に行う。すなわち、音声データを複数のフレームに分割し、各フレームにおいて、特徴量を算出する（Ｓ２２）。そして、音声データの特徴量の算出が終了すると、算出した特徴量を正規化、すなわち、バンドパスフィルタを用いてフィルタリングし、最大振幅値で除算することにより、音声データの正規化特徴量を得る（Ｓ２３）。 The calculation of the feature amount is performed in the same manner as S13 to S15 in FIG. That is, the audio data is divided into a plurality of frames, and the feature amount is calculated in each frame (S22). When the calculation of the feature value of the voice data is completed, the calculated feature value is normalized, that is, filtered using a bandpass filter, and divided by the maximum amplitude value, thereby obtaining the normalized feature value of the voice data. (S23).

そして、認識部１２により、Ｓ２３において算出した音声データの正規化特徴量および上記した図３において記憶した音声認識モデルパラメータを用いて音声認識を行う（Ｓ２４）。ここで、認識部１２は、認識手段として作動する。音声認識は、例えば、Ｓ２３において算出した音声データの正規化特徴量と音声認識モデルパラメータとを比較して尤度値を算出し、算出した尤度値に基づいて行う。 Then, the recognition unit 12 performs voice recognition using the normalized feature amount of the voice data calculated in S23 and the voice recognition model parameters stored in FIG. 3 described above (S24). Here, the recognition unit 12 operates as a recognition unit. For example, the speech recognition is performed based on the calculated likelihood value by comparing the normalized feature amount of the speech data calculated in S23 with the speech recognition model parameter.

このように、音声認識装置１０は、音声認識モデルパラメータ作成装置２０により作成された音声認識モデルパラメータを用いて、音声認識を行うことができるため、音声認識率を向上させることができる。 As described above, since the speech recognition apparatus 10 can perform speech recognition using the speech recognition model parameters created by the speech recognition model parameter creation apparatus 20, the speech recognition rate can be improved.

なお、上記の実施の形態においては、音声認識モデルパラメータ作成装置２０を用いて、音声認識モデルパラメータを作成し、音声認識装置１０の記憶部１３に記憶させる際に、Ｓ１４〜Ｓ１５に示すように、バンドパスフィルタを用いてフィルタリングし、最大振幅値で除算する例について説明したが、これに限ることなく、バンドパスフィルタを用いてフィルタリングしたのちに、分散値で除算してもよい。 In the above-described embodiment, when the speech recognition model parameter is created using the speech recognition model parameter creation device 20 and stored in the storage unit 13 of the speech recognition device 10, as shown in S14 to S15. Although an example of filtering using a bandpass filter and dividing by the maximum amplitude value has been described, the present invention is not limited to this, and after filtering using a bandpass filter, division by a variance value may be performed.

また、上記の実施の形態においては、音声認識モデルパラメータ作成装置２０を用いて、音声認識モデルパラメータを作成し、音声認識装置１０の記憶部１３に記憶させる際に、Ｓ１４〜Ｓ１５に示すように、バンドパスフィルタを用いてフィルタリングし、最大振幅値で除算する例について説明したが、これに限ることなく、以下に示す他の実施形態を採用することもできる。 In the above embodiment, when the speech recognition model parameters are created using the speech recognition model parameter creation device 20 and stored in the storage unit 13 of the speech recognition device 10, as shown in S14 to S15. Although an example of filtering using a bandpass filter and dividing by the maximum amplitude value has been described, the present invention is not limited to this, and other embodiments described below may be employed.

図８は、音声認識モデルパラメータ作成装置２０を用いて、音声認識モデルパラメータを作成し、音声認識装置１０の記憶部１３に記憶させる場合の他の実施形態について示すフローチャートである。なお、Ｓ３１〜Ｓ３３においては、図３に示すＳ１１〜Ｓ１３と同様であるため、説明は省略する。 FIG. 8 is a flowchart illustrating another embodiment in which a speech recognition model parameter is created using the speech recognition model parameter creation device 20 and stored in the storage unit 13 of the speech recognition device 10. Note that S31 to S33 are the same as S11 to S13 shown in FIG.

図８を参照して、まず、Ｓ３３において特徴量の算出が終了すると、算出した特徴量の平均値を求める（Ｓ３４）。次に、求めた平均値をＳ３３において算出した特徴量から減算する（Ｓ３５）。そして、ローパスフィルタを用いて、減算した特徴量をフィルタリングする（Ｓ３６）。さらに、フィルタリングした特徴量を、フィルタリングした特徴量の最大振幅値で除算する（Ｓ３７）。 Referring to FIG. 8, first, when the calculation of the feature value is completed in S33, an average value of the calculated feature values is obtained (S34). Next, the obtained average value is subtracted from the feature amount calculated in S33 (S35). Then, the subtracted feature amount is filtered using a low-pass filter (S36). Further, the filtered feature value is divided by the maximum amplitude value of the filtered feature value (S37).

このように、平均値を減算し、ローパスフィルタを用いてフィルタリングし、最大振幅値で除算することにより、算出した特徴量を正規化し、正規化特徴量を得ることとしてもよい。そして、音声認識モデルパラメータを作成し（Ｓ３８）、記憶部１３に記憶させる（Ｓ３９）。 In this way, the average value is subtracted, filtered using a low-pass filter, and divided by the maximum amplitude value, thereby normalizing the calculated feature value to obtain a normalized feature value. Then, a speech recognition model parameter is created (S38) and stored in the storage unit 13 (S39).

また、上記の実施の形態においては、音声認識モデルパラメータ作成装置２０を用いて、音声認識モデルパラメータを作成し、音声認識装置１０の記憶部１３に記憶させる際に、Ｓ１４〜Ｓ１５に示すように、バンドパスフィルタを用いてフィルタリングし、最大振幅値で除算する例について説明したが、これに限ることなく、バンドパスフィルタを用いてフィルタリングするのみであってもよい。 In the above embodiment, when the speech recognition model parameters are created using the speech recognition model parameter creation device 20 and stored in the storage unit 13 of the speech recognition device 10, as shown in S14 to S15. Although an example of filtering using a band pass filter and dividing by the maximum amplitude value has been described, the present invention is not limited to this, and it may be only filtered using a band pass filter.

また、上記の実施の形態においては、ＦＩＲ型のフィルタを採用する例について説明したが、これに限ることなく、ＩＩＲ（ＩｎｆｉｎｉｔｅＩｍｐｕｌｓｅＲｅｓｐｏｎｓｅ）型のフィルタを採用してもよい。これにより、処理の演算量を少なくすることができる。 In the above-described embodiment, an example in which an FIR type filter is used has been described. However, the present invention is not limited to this, and an IIR (Infinite Impulse Response) type filter may be used. Thereby, the processing amount of processing can be reduced.

また、上記の実施の形態においては、Ｓ１２〜Ｓ１３において、雑踏雑音重畳データ、白色雑音重畳データ、および工場雑音重畳データを作成し、それぞれの特徴量を算出する例について説明したが、これに限ることなく、工場雑音重畳データを作成しない構成としてもよい。すなわち、雑踏雑音重畳データおよび白色雑音重畳データのうちの少なくともいずれか一方のデータの特徴量を算出してもよい。 In the above-described embodiment, an example has been described in which the crowd noise superimposition data, the white noise superimposition data, and the factory noise superimposition data are generated in S12 to S13, and the respective feature amounts are calculated. In addition, the configuration may be such that factory noise superimposition data is not created. That is, the feature amount of at least one of the hustle noise superimposed data and the white noise superimposed data may be calculated.

また、上記の実施の形態においては、Ｓ１２において、雑音重畳データとして、雑踏雑音重畳データ、白色雑音重畳データ、および工場雑音重畳データを作成する例について説明したが、これに限ることなく、重畳する雑音の量を無視可能な程度に小さくしたデータを含める構成としてもよい。すなわち、雑音重畳データとして、無雑音音声データを含める構成としてもよい。 In the above-described embodiment, the example of creating the hustle noise superimposing data, the white noise superimposing data, and the factory noise superimposing data as the noise superimposing data in S12 has been described. It may be configured to include data in which the amount of noise is made small enough to be ignored. In other words, noiseless data may be included as noise superimposed data.

また、上記の実施の形態においては、保持部２５において、様々な環境の複数の雑音のデータを保持する例について説明したが、これに限ることなく、例えば、様々な環境のうち特定種類の複数の雑音を保持してもよい。すなわち、複数の雑音は、特定種類の複数の雑音を含む構成である。例えば、特定種類として工場に関する複数の雑音、具体的には、第１の工場の雑音と、第２の工場の雑音とを保持する。そして、第１の工場の雑音重畳データと、第２の工場の雑音重畳データとを作成することにより、工場における正規化特徴量を得る。また、特定種類として雑踏に関する複数の雑音、具体的には、第１の雑踏の雑音と、第２の雑踏の雑音とを保持し、第１の雑踏の雑音重畳データと、第２の雑踏の雑音重畳データとを作成することにより、雑踏における正規化特徴量を得る。そして、工場における正規化特徴量と、雑踏における正規化特徴量とを用いて、音声認識モデルパラメータを作成してもよい。 In the above-described embodiment, the example in which the holding unit 25 holds a plurality of noise data in various environments has been described. However, the present invention is not limited to this example. You may keep the noise. That is, the plurality of noises are configured to include a plurality of specific types of noises. For example, a plurality of noises related to a factory as a specific type, specifically, a first factory noise and a second factory noise are held. And the normalization feature-value in a factory is obtained by producing the noise superimposition data of a 1st factory, and the noise superimposition data of a 2nd factory. In addition, a plurality of noises related to the hustle and bustle, specifically, the first hustle noise and the second hustle noise are retained as the specific type, and the first hustle noise superimposition data and the second hustle noise are stored. By creating the noise superimposition data, the normalized feature amount in the crowd is obtained. Then, the speech recognition model parameter may be created using the normalized feature value in the factory and the normalized feature value in the hustle and bustle.

また、上記の実施の形態においては、保持部２５において、雑踏雑音、白色雑音、工場雑音等の雑音のデータを含む例について説明したが、これに限ることなく、車のエンジン雑音、コンピュータ等が複数設置される部屋の騒音、オーディオ音等の雑音のデータを含む構成としてもよいし、任意に設定可能である。 In the above embodiment, the holding unit 25 has been described as an example including noise data such as hustle noise, white noise, and factory noise. It may be configured to include noise data such as a plurality of room noises and audio sounds, and can be arbitrarily set.

また、音声認識モデルパラメータ作成装置２０は、ハードウェアで実装してもよいし、ソフトウェアで実装してもよい。また、音声認識装置１０においても同様に、ハードウェアで実装してもよいし、ソフトウェアで実装してもよい。 Further, the speech recognition model parameter creation device 20 may be implemented by hardware or software. Similarly, the speech recognition apparatus 10 may be implemented by hardware or software.

以上、図面を参照してこの発明の実施形態を説明したが、この発明は、図示した実施形態のものに限定されない。図示された実施形態に対して、この発明と同一の範囲内において、あるいは均等の範囲内において、種々の修正や変形を加えることが可能である。 As mentioned above, although embodiment of this invention was described with reference to drawings, this invention is not limited to the thing of embodiment shown in figure. Various modifications and variations can be made to the illustrated embodiment within the same range or equivalent range as the present invention.

１０音声認識装置、１１音声認識装置特徴量算出部、１２認識部、１３記憶部、１４マイクロフォン、２０音声認識モデルパラメータ作成装置、２１作成装置特徴量算出部、２２雑音重畳部、２３学習部、２４マイクロフォン、２５保持部。 DESCRIPTION OF SYMBOLS 10 speech recognition apparatus, 11 speech recognition apparatus feature-value calculation part, 12 recognition part, 13 memory | storage part, 14 microphone, 20 speech recognition model parameter creation apparatus, 21 creation apparatus feature-value calculation part, 22 noise superimposition part, 23 learning part, 24 microphone, 25 holding part.

Claims

Pre retained, and a plurality of noise data, the feature amount calculating means for calculating a plurality of audio data each feature amount created by Rukoto superimposed on noise-free data corresponding to a plurality of noise,
Normalizing means for normalizing each feature quantity calculated by the feature quantity calculating means in order to generalize the plurality of noises,
The normalization means includes filtering means for filtering each feature quantity calculated by the feature quantity calculation means using a bandpass filter so as to pass only frequency components in a predetermined range,
A speech recognition model parameter creation device further comprising creation means for creating a speech recognition model parameter under the plurality of noises using each feature amount normalized by the normalization means.

Pre retained, and a plurality of noise data, the feature amount calculating means for calculating a plurality of audio data each feature amount created by Rukoto superimposed on noise-free data corresponding to a plurality of noise,
Normalizing means for normalizing each feature quantity calculated by the feature quantity calculating means in order to generalize the plurality of noises,
The normalizing means includes
Subtracting means for subtracting an average value of the feature values from each feature value calculated by the feature value calculating means;
Means for filtering the feature amount after subtraction by the subtracting means using a low-pass filter,
A speech recognition model parameter creation device further comprising creation means for creating a speech recognition model parameter under the plurality of noises using each feature amount normalized by the normalization means.

The speech recognition model parameter creation device according to claim 1, wherein the normalization unit further includes a division unit that divides the feature amount filtered by the filtering unit by a maximum amplitude value or a variance value thereof.

The speech recognition model parameter creation device according to claim 1, wherein the creation unit creates the speech recognition model parameter by learning.

A method executed by a speech recognition model parameter creation device,
Pre retained the steps plurality of noise data corresponding to a plurality of noise, which calculates a plurality of audio data each feature amount created by Rukoto superimposed on noiseless data,
In order to generalize the plurality of noises, a normalization step of normalizing each calculated feature amount, and
The normalizing step includes a step of filtering each calculated feature amount using a band-pass filter so as to pass only frequency components in a predetermined range.
A speech recognition model parameter creation method, further comprising creating speech recognition model parameters under the plurality of noises using each normalized feature amount.

A method executed by a speech recognition model parameter creation device,
Pre retained the steps plurality of noise data corresponding to a plurality of noise, which calculates a plurality of audio data each feature amount created by Rukoto superimposed on noiseless data,
In order to generalize the plurality of noises, a normalization step of normalizing each calculated feature amount, and
The normalizing step includes
Subtracting an average value of the feature quantities from the calculated feature quantities;
Filtering the feature amount after subtraction using a low-pass filter,
A speech recognition model parameter creation method, further comprising creating speech recognition model parameters under the plurality of noises using each normalized feature amount.

A speech recognition apparatus comprising a recognition means for performing speech recognition using the speech recognition model parameter created by the speech recognition model parameter creation apparatus according to claim 1.