JP6128287B1

JP6128287B1 - Speech recognition apparatus and speech recognition method

Info

Publication number: JP6128287B1
Application number: JP2016558225A
Authority: JP
Inventors: 利行花沢
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2016-05-20
Filing date: 2016-05-20
Publication date: 2017-05-17
Anticipated expiration: 2036-05-20
Also published as: TWI604436B; TW201742050A; JPWO2017199417A1; WO2017199417A1

Abstract

この発明の音声認識装置は、音響モデルの平均ベクトルと入力された音声の平均ベクトルとの差分を入力された音声の特徴ベクトルに補正し、補正した特徴ベクトルを入力された音声が第１発話目であるか否かに応じて分散値を調整した音響モデルで照合する。そのため、音響モデルの学習時と入力された音声の認識時とでマイクの周波数特性の違いや周囲に騒音があっても、第１発話目に対して認識終了時間の遅延なしに音声認識の精度を確保することができるThe speech recognition apparatus according to the present invention corrects the difference between the average vector of the acoustic model and the average vector of the input speech to the feature vector of the input speech, and the input speech is the first utterance. It collates with the acoustic model which adjusted the dispersion value according to whether it is. Therefore, even if there is a difference in the frequency characteristics of the microphone between the learning of the acoustic model and the recognition of the input speech, and there is noise in the surroundings, the accuracy of speech recognition is not delayed for the first utterance without a delay in the recognition end time. Can ensure

Description

この発明は、入力された音声を認識する音声認識装置と音声認識方法に関するものである。 The present invention relates to a speech recognition apparatus and speech recognition method for recognizing input speech.

音声認識に使用するマイクは様々な機種があり、周波数特性が統一されていない。例えばスマートフォンや携帯電話では機種によるマイクの周波数特性の違いが大きい。音声認識に用いる音の標準パターンである音響モデルの学習データ収録時に使用するマイクの周波数と、音声認識時に使用するマイクの周波数特性が異なると音声認識の性能が一般に低下する。近年の音声認識は、入力音声の周波数パターンを特徴ベクトルとして統計的手法に基づいてパターンマッチングを行う方式が主流である。この方式では事前に大量の話者が発話した音声データの周波数パターンの特徴ベクトルを用いてこの特徴ベクトルの統計的特徴をモデル化した音響モデルと、入力音声の特徴ベクトルとの間でパターンマッチングによって音声認識を実現する。このため、周波数特性の異なる様々なマイクを用いて多くの話者の発話を収録した音声を学習データとして音響モデルを学習することにより認識性能を改善することができる。 There are various types of microphones used for speech recognition, and frequency characteristics are not unified. For example, the difference in frequency characteristics of microphones varies greatly depending on the type of smartphone or mobile phone. If the microphone frequency used for recording the learning data of the acoustic model, which is a standard pattern of sound used for speech recognition, and the frequency characteristics of the microphone used for speech recognition are different, the speech recognition performance generally decreases. In recent speech recognition, a method of performing pattern matching based on a statistical method using a frequency pattern of an input speech as a feature vector is mainly used. This method uses pattern matching between an acoustic model that models the statistical features of this feature vector using frequency vector feature vectors of speech data spoken by a large number of speakers in advance and the feature vector of the input speech. Realize voice recognition. For this reason, it is possible to improve the recognition performance by learning an acoustic model by using various microphones having different frequency characteristics as speech and recording speech of many speakers.

しかし、学習時と認識時のマイクの周波数特性が大きく異なる場合には改善効果に限度があった。これに対してＣＭＮ(Cepstral mean normalization)という手法によって音声認識の性能が改善されることが知られている。ＣＭＮは音響モデルの学習時と認識時で、それぞれ音声の特徴ベクトルの平均ベクトルを求め、学習時と認識時で各々の特徴ベクトルから前記の各平均ベクトルを差し引くことで実現する。平均ベクトルはマイクの周波数特性を反映しているため、それを各々差し引くことにより、マイクの周波数特性の違いを概ね吸収することができる。 However, there was a limit to the improvement effect when the frequency characteristics of the microphone at the time of learning and recognition were greatly different. On the other hand, it is known that the performance of speech recognition is improved by a method called CMN (Cepstral mean normalization). The CMN is realized by obtaining an average vector of speech feature vectors during learning and recognition of the acoustic model, and subtracting the average vector from each feature vector during learning and recognition. Since the average vector reflects the frequency characteristic of the microphone, the difference in the frequency characteristic of the microphone can be almost absorbed by subtracting the average vector.

特許文献１では、音響モデルとしてＨＭＭ(Hidden Markov Model)を用いる場合において、学習時にＣＭＮを行うのではなく、学習後に得られたＨＭＭのパラメータから近似的に平均ベクトルを求めてＣＭＮを行う方法を開示している。この方法と音響モデルの騒音適応化を組み合わせることによってマイクの周波数特性の違い等の乗法性歪みと、周囲騒音等の加法性歪みの両方に頑健な音響モデルを高速に得る技術を開示している。特許文献１では、入力音声の平均ベクトルの算出方法として、入力音声の１発話ごとに１発話全体から平均ベクトルを算出、あるいは音声認識時に前の発話までの特徴ベクトルから平均ベクトルを算出する方法が示されている。 In Patent Document 1, when an HMM (Hidden Markov Model) is used as an acoustic model, a method of performing CMN by approximately obtaining an average vector from parameters of an HMM obtained after learning rather than performing CMN during learning. Disclosure. Disclosed is a technique for obtaining an acoustic model that is robust to both multiplicative distortion such as the difference in frequency characteristics of microphones and additive distortion such as ambient noise by combining this method with acoustic model noise adaptation. . In Patent Document 1, as a method for calculating an average vector of input speech, there is a method of calculating an average vector from one entire utterance for each utterance of input speech or calculating an average vector from feature vectors up to the previous utterance at the time of speech recognition. It is shown.

特開２００６−３４９７２３号公報JP 2006-349723 A

しかし特許文献１の方法は、１発話終了後でなければ１発話全体の平均ベクトルを算出できないため、認識処理は１発話終了後でなければ実施できず、認識の応答速度が遅くなるという課題がある。
本発明は上記課題を解決することを目的とする。すなわちマイクの周波数特性の違いや周囲に騒音があっても、第１発話目に対して認識終了時間の遅延なしに音声認識の精度を確保することを目的とする。However, since the method of Patent Document 1 cannot calculate the average vector of the entire utterance unless one utterance is completed, the recognition process can be performed only after one utterance is completed, resulting in a slow response speed of recognition. is there.
The present invention aims to solve the above problems. In other words, the object is to ensure the accuracy of speech recognition without delay of the recognition end time for the first utterance even if there is a difference in the frequency characteristics of the microphone or there is noise in the surroundings.

この発明に係る音声認識装置は、入力された音声を分析して第１の特徴ベクトルを出力する分析手段、複数の話者が発話した音声データの第２の特徴ベクトルをモデル化した音響モデル、前記音響モデルの平均ベクトルもしくは前記入力された音声の前記第１の特徴ベクトルの平均ベクトルを、前記第１の特徴ベクトルから減算して補正後の前記第１の特徴ベクトルを出力する補正手段、前記音声が第１発話目であるか否かに応じて、前記音響モデルのパラメータの分散値を調整する調整手段、前記分散値を調整した音響モデルと補正後の前記第１の特徴ベクトルを照合して前記音声の認識結果を出力する照合手段を備えたことを特徴とする。 The speech recognition apparatus according to the present invention includes an analysis unit that analyzes input speech and outputs a first feature vector, an acoustic model that models a second feature vector of speech data uttered by a plurality of speakers , wherein an average vector of the average vector or the first feature vector of the input speech acoustic models, the first feature is subtracted from the vector to output the first feature vector after the correction the correction means, said An adjustment unit that adjusts a variance value of the parameter of the acoustic model according to whether or not the speech is the first utterance, and the acoustic model that has adjusted the variance value is compared with the corrected first feature vector. And collating means for outputting the speech recognition result.

この発明の音声認識装置は、音響モデルの平均ベクトルもしくは入力された音声の特徴ベクトルの平均ベクトルを、入力された音声の特徴ベクトルから減算して補正後の特徴ベクトルとする。そして、補正後の特徴ベクトルを入力された音声が第１発話目であるか否かに応じて分散値を調整した音響モデルで照合して認識結果を出力する。そのため、音響モデルの学習時と入力された音声の認識時とでマイクの周波数特性の違いや周囲に騒音があっても、第１発話目に対して認識終了時間の遅延なしに音声認識の精度を確保することができる。 The speech recognition apparatus of the present invention subtracts the average vector of the acoustic model or the feature vector of the input speech from the feature vector of the input speech to obtain a corrected feature vector. Then, the corrected feature vector is collated with an acoustic model whose variance value is adjusted according to whether or not the input speech is the first utterance, and a recognition result is output. Therefore, even if there is a difference in the frequency characteristics of the microphone between the learning of the acoustic model and the recognition of the input speech, and there is noise in the surroundings, the accuracy of speech recognition is not delayed for the first utterance without a delay in the recognition end time. Can be secured.

この発明の実施の形態１における音声認識装置１の構成図である。It is a block diagram of the speech recognition apparatus 1 in Embodiment 1 of this invention. この発明の実施の形態１における音声認識装置１のハードウェア構成図である。It is a hardware block diagram of the speech recognition apparatus 1 in Embodiment 1 of this invention. この発明の実施の形態１における補正手段４の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the correction | amendment means 4 in Embodiment 1 of this invention. この発明の実施の形態１における補正手段４の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the correction | amendment means 4 in Embodiment 1 of this invention. この発明の実施の形態１における調整手段６の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the adjustment means 6 in Embodiment 1 of this invention. この発明の実施の形態２における調整手段６の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the adjustment means 6 in Embodiment 2 of this invention.

以下に、本発明に係る音声認識装置、および音声認識方法の実施の形態を図面に基づいて詳細に説明する。なお、この実施の形態によりこの発明が限定されるものではない。 Embodiments of a speech recognition apparatus and a speech recognition method according to the present invention will be described below in detail with reference to the drawings. Note that the present invention is not limited to the embodiments.

実施の形態１．
図１は、この発明の実施の形態１における音声認識装置１の構成図である。
図１において、音声認識装置１は、入力された音声（入力音声）２を音響分析手段３で分析して得た特徴ベクトル（第１の特徴ベクトル）を補正手段４に出力し、補正手段４は入力音声２が第１発話目であるか否かによって特徴ベクトルを補正する。補正手段４は、入力音声２が第１発話目であれば初期ベクトル５を取得して、特徴ベクトルを初期ベクトル５で補正し、第２発話目以降であれば、一時的に記憶している前の発話までの特徴ベクトルの平均ベクトルで特徴ベクトルを補正する。そして、調整手段６で入力音声２が第１発話目であるか否かに応じて音響モデル７のパラメータの分散値を調整し、照合手段８で分散値を調整した音響モデル７と補正した特徴ベクトルを照合することにより、入力音声２の認識結果９を出力する。
音響分析手段３は、分析手段に相当する。Embodiment 1 FIG.
FIG. 1 is a configuration diagram of a speech recognition apparatus 1 according to Embodiment 1 of the present invention.
In FIG. 1, the speech recognition apparatus 1 outputs a feature vector (first feature vector) obtained by analyzing the input speech (input speech) 2 by the acoustic analysis unit 3 to the correction unit 4. Corrects the feature vector depending on whether or not the input speech 2 is the first utterance. The correction means 4 acquires the initial vector 5 if the input speech 2 is the first utterance, corrects the feature vector with the initial vector 5, and temporarily stores the feature vector if it is the second utterance or later. The feature vector is corrected with an average vector of feature vectors up to the previous utterance. Then, the variance value of the parameter of the acoustic model 7 is adjusted by the adjusting means 6 depending on whether or not the input speech 2 is the first utterance, and the acoustic model 7 having the variance value adjusted by the matching means 8 is corrected. The recognition result 9 of the input speech 2 is output by collating the vector.
The acoustic analysis unit 3 corresponds to an analysis unit.

また、図２は、この発明の実施の形態１における音声認識装置１のハードウェア構成図である。
音声認識装置１は、プロセッサ１０とメモリ１１から構成される。具体的には、音響分析手段３、補正手段４、調整手段６、および照合手段８は、プロセッサ１０がメモリ１１に記憶されたプログラムを実行することにより実現される。また、初期ベクトル５と音響モデル７は、メモリ１１に記憶されている。FIG. 2 is a hardware configuration diagram of the speech recognition apparatus 1 according to Embodiment 1 of the present invention.
The speech recognition apparatus 1 includes a processor 10 and a memory 11. Specifically, the acoustic analysis unit 3, the correction unit 4, the adjustment unit 6, and the collation unit 8 are realized by the processor 10 executing a program stored in the memory 11. The initial vector 5 and the acoustic model 7 are stored in the memory 11.

音響分析手段３は、入力音声２を分析し、分析して得た特徴ベクトルを補正手段４に出力する。この特徴ベクトルは、音声の周波数パターンであるスペクトル特徴を表すベクトルであり、例えばＭＦＣＣ(Mel Frequency Cepstral Coefficient)の１〜１２次元までのデータである。例えばこの特徴ベクトルは、入力音声２をフレームと呼ぶ１０ミリ秒ごとの区間に区切り、フレームごとに音響分析を実施して、１発話の音声データから複数個の特徴ベクトルが得られる。例えば１秒の発話であれば、１秒＝１０００ミリ秒であるため、１０００ミリ秒／１０ミリ秒＝１００個の特徴ベクトルが得られる。 The acoustic analysis unit 3 analyzes the input speech 2 and outputs a feature vector obtained by the analysis to the correction unit 4. This feature vector is a vector representing a spectral feature that is a frequency pattern of speech, and is, for example, data of 1 to 12 dimensions of MFCC (Mel Frequency Cepstral Coefficient). For example, the feature vector is divided into sections of every 10 milliseconds called input frames 2 and an acoustic analysis is performed for each frame, and a plurality of feature vectors are obtained from speech data of one utterance. For example, in the case of an utterance of 1 second, since 1 second = 1000 milliseconds, 1000 feature seconds / 10 milliseconds = 100 feature vectors are obtained.

補正手段４は、入力音声２が第１発話目であれば、図２のメモリ１１に記憶された初期ベクトル５を取得して、音響分析手段３から入力された特徴ベクトルを初期ベクトルで補正し、補正した特徴ベクトルを照合手段８に出力する。この時、初期ベクトル５は、音響モデル７を学習する際に用いた学習データの特徴ベクトル（第２の特徴ベクトル）の平均ベクトルを用いる。一方、補正手段４は、入力音声２が第２発話目以降であれば、前の発話までの平均ベクトルを算出することができるため、特許文献１のような公知の方法により求めた入力音声２の平均ベクトルを補正ベクトルとし、音響分析手段３から入力された特徴ベクトルを補正ベクトルで補正して、補正した特徴ベクトルを照合手段８に出力する。 If the input speech 2 is the first utterance, the correction unit 4 acquires the initial vector 5 stored in the memory 11 of FIG. 2 and corrects the feature vector input from the acoustic analysis unit 3 with the initial vector. The corrected feature vector is output to the matching unit 8. At this time, the initial vector 5 uses an average vector of feature vectors (second feature vectors) of learning data used when learning the acoustic model 7. On the other hand, if the input speech 2 is after the second utterance, the correcting means 4 can calculate the average vector up to the previous utterance, so the input speech 2 obtained by a known method such as Patent Document 1 is used. Is used as a correction vector, the feature vector input from the acoustic analysis means 3 is corrected with the correction vector, and the corrected feature vector is output to the matching means 8.

調整手段６は、入力音声２が第１発話目であれば図２のメモリ１１に記憶された音響モデル７のパラメータの分散値を広げるように調整し、第２発話目以降であれば第１発話目よりも音響モデル７のパラメータの分散値を小さくするように調整する。 The adjusting means 6 adjusts the variance of the parameters of the acoustic model 7 stored in the memory 11 of FIG. 2 if the input speech 2 is the first utterance, and adjusts the first utterance if the input utterance is the second utterance or later. Adjustment is made so that the variance of the parameters of the acoustic model 7 is smaller than that of the utterance.

音響モデル７は、複数の話者が発話した音声データの周波数パターンの特徴ベクトル（第２の特徴ベクトル）を用いて、この特徴ベクトルの統計的特徴をモデル化したものである。例えば、音声認識装置１が単語認識を行うものであれば、音素単位の音響モデルを連結した単語単位の音響モデルが格納されるとものとする。音声認識装置１が都道府県名を認識対象語彙として音声認識を行うことを想定すると、例えば、東京（とーきょー）という単語の音響モデルは、音素／ｔ／，／ｏ／，／ｏ／，／ｋ／，／ｊ／，／ｏ／，／ｏ／の音素の音響モデルを順番に連結したものとして構成し、ＨＭＭを用いるものとする。また、音響モデル７の学習データは、複数の話者が周波数特性の異なる様々なマイクで収録したデータであるとする。そして、音響モデル７は、学習データの特徴ベクトルの平均ベクトルを算出し、学習データの各特徴ベクトルから前記平均ベクトルを減算したデータによって前記音素単位の音響を事前に学習したものとする。 The acoustic model 7 is obtained by modeling a statistical feature of a feature vector using a feature vector (second feature vector) of a frequency pattern of voice data uttered by a plurality of speakers. For example, if the speech recognition apparatus 1 performs word recognition, an acoustic model in units of words obtained by connecting acoustic models in units of phonemes is stored. Assuming that the speech recognition apparatus 1 performs speech recognition using a prefecture name as a recognition target vocabulary, for example, the acoustic model of the word Tokyo is phoneme / t /, / o /, / o. It is assumed that acoustic models of phonemes of /, / k /, / j /, / o /, and / o / are connected in order, and an HMM is used. The learning data of the acoustic model 7 is data recorded by various speakers with various microphones having different frequency characteristics. The acoustic model 7 calculates the average vector of the feature vectors of the learning data, and learns the sound of the phoneme unit in advance using data obtained by subtracting the average vector from each feature vector of the learning data.

照合手段８は、補正手段４から入力された補正後の特徴ベクトルを調整手段６で調整された音響モデル７と照合して、認識結果９を音声認識装置１から出力する。 The collating unit 8 collates the corrected feature vector input from the correcting unit 4 with the acoustic model 7 adjusted by the adjusting unit 6 and outputs a recognition result 9 from the speech recognition apparatus 1.

次に音声認識装置１の動作について説明する。
まず、音響分析手段３は入力音声２を音響分析し、特徴ベクトルを補正手段４に出力する。そして、補正手段４は、入力音声２が第１発話目であるか否かに応じて特徴ベクトルを補正して、補正した特徴ベクトルを照合手段８に出力する。Next, the operation of the speech recognition apparatus 1 will be described.
First, the acoustic analysis means 3 acoustically analyzes the input speech 2 and outputs a feature vector to the correction means 4. Then, the correction unit 4 corrects the feature vector according to whether or not the input speech 2 is the first utterance, and outputs the corrected feature vector to the matching unit 8.

以下に補正手段４の動作の詳細を説明する。
図３は、この発明の実施の形態１における補正手段４の動作を示すフローチャートである。
補正手段４は、図２のメモリ１１に記憶された初期ベクトル５を取得する（ステップ（以下ＳＴと示す）１）。次に、現在の入力音声２が第１発話目であるか否かを判断する（ＳＴ２）。補正手段４は入力音声２の発話の回数をカウントしているものとし、その回数によって入力音声２が第１発話目であるか否かを判断するものとする。Details of the operation of the correction means 4 will be described below.
FIG. 3 is a flowchart showing the operation of the correction means 4 in Embodiment 1 of the present invention.
The correction unit 4 acquires an initial vector 5 stored in the memory 11 of FIG. 2 (step (hereinafter referred to as ST) 1). Next, it is determined whether or not the current input voice 2 is the first utterance (ST2). It is assumed that the correcting means 4 counts the number of utterances of the input speech 2 and determines whether or not the input speech 2 is the first utterance based on the number of utterances.

入力音声２が第１発話目であれば、取得した初期ベクトル５即ち音響モデル７を学習する際に用いた学習データの特徴ベクトルの平均ベクトルを補正ベクトルとする（ＳＴ３）。また、入力音声２が第２発話目以降であれば、一時的に記憶している前の発話までの特徴ベクトルの平均ベクトルを算出し、補正ベクトルとする（ＳＴ４）。
そして、入力音声２の特徴ベクトルから補正ベクトルを減算した補正後の特徴ベクトルを照合手段８に出力する（ＳＴ５）。また、補正前の入力音声２の特徴ベクトルを一時的に記憶する（ＳＴ６）。If the input speech 2 is the first utterance, the acquired initial vector 5, that is, an average vector of feature vectors of learning data used when learning the acoustic model 7 is used as a correction vector (ST3). If the input speech 2 is after the second utterance, an average vector of feature vectors up to the previous utterance temporarily stored is calculated as a correction vector (ST4).
Then, the corrected feature vector obtained by subtracting the correction vector from the feature vector of the input speech 2 is output to the matching means 8 (ST5). Further, the feature vector of the input speech 2 before correction is temporarily stored (ST6).

前述したとおり、第１発話目で補正ベクトルとする前記初期ベクトルは、前記音響モデル７の学習時に特徴ベクトルから減算しているベクトルである。認識時も入力音声２の特徴ベクトルから補正ベクトルとした初期ベクトルを差し引くことにより、入力音声２の特徴ベクトルと学習時の特徴ベクトルの算出方法を同一にする効果がある。但し学習時と認識時のマイクによる周波数特性の違いを吸収する効果はない。効果を得るには後述するとおり、何らかの方法で入力音声２の特徴ベクトルの平均ベクトルを求め、それを入力音声２の特徴ベクトルから減算する必要がある。しかし第１発話目とりわけ発話の最初の部分では、平均ベクトルの算出に利用可能な入力音声２の特徴ベクトルの数が少ないため統計的に妥当な平均ベクトルを算出することが困難であるため、学習時の特徴ベクトルの平均ベクトルを用いる。 As described above, the initial vector used as the correction vector in the first utterance is a vector that is subtracted from the feature vector during learning of the acoustic model 7. Even during recognition, by subtracting the initial vector as the correction vector from the feature vector of the input speech 2, there is an effect that the feature vector of the input speech 2 and the feature vector calculation method at the time of learning are made the same. However, there is no effect of absorbing the difference in frequency characteristics of the microphone during learning and recognition. In order to obtain the effect, it is necessary to obtain an average vector of feature vectors of the input speech 2 by some method and subtract it from the feature vector of the input speech 2 as will be described later. However, in the first utterance, especially the first part of the utterance, it is difficult to calculate a statistically valid average vector because the number of feature vectors of the input speech 2 that can be used for calculating the average vector is small. An average vector of time feature vectors is used.

また、第２発話目以降では、入力音声２の特徴ベクトルの平均ベクトルが算出可能になるため、学習時と認識時のマイクによる周波数特性の違いを吸収する効果が得られるようになる。 Moreover, since the average vector of the feature vectors of the input speech 2 can be calculated after the second utterance, an effect of absorbing the difference in frequency characteristics between the microphones during learning and recognition can be obtained.

なお第２発話目以降の補正ベクトルは、補正手段４の内部に一時的に記憶している前の発話までの特徴ベクトルの平均ベクトルと、１発話前に使用した補正ベクトルを加重平均したベクトルを補正ベクトルとしてもよい。
図４は、この発明の実施の形態１における補正手段４の動作を示すフローチャートである。
図４において、図３と同一動作である箇所は、図３と同一番号を付している。図４と図３の違いは、図３のＳＴ３の処理が図４のＳＴ３ａに置き換わり、図３のＳＴ６の処理が図４のＳＴ６ａの処理に置き換わったことのみである。The correction vectors for the second and subsequent utterances are the average vector of feature vectors up to the previous utterance temporarily stored in the correction means 4 and a vector obtained by weighted averaging the correction vectors used before the first utterance. A correction vector may be used.
FIG. 4 is a flowchart showing the operation of the correction means 4 in the first embodiment of the present invention.
4, parts that are the same as those in FIG. 3 are given the same numbers as in FIG. The only difference between FIG. 4 and FIG. 3 is that the process of ST3 of FIG. 3 is replaced with ST3a of FIG. 4, and the process of ST6 of FIG. 3 is replaced with the process of ST6a of FIG.

ＳＴ３ａの動作は上述した通りであり、補正手段４が一時的に記憶している前の発話までの特徴ベクトルの平均ベクトルと１発話前に使用した補正ベクトルを加重平均して補正ベクトルとする。
ＳＴ６ａでは、補正前の入力音声２の特徴ベクトルに加え補正ベクトルを一時的に記憶する。The operation of ST3a is as described above, and the correction vector 4 temporarily stores the average vector of feature vectors up to the previous utterance and the correction vector used before one utterance to obtain a correction vector.
In ST6a, the correction vector is temporarily stored in addition to the feature vector of the input speech 2 before correction.

この様に、一時的に記憶している前の発話までの特徴ベクトルの平均ベクトルと１発話前に使用した補正ベクトルを加重平均して補正ベクトルとすると、直前の発話をより重視して求めることになる。そのため、話者が途中で交代した場合であっても補正ベクトルを速やかに更新することができ、認識率を上げることができる。
以上が補正手段４の動作の詳細な説明である。In this way, when the average vector of feature vectors up to the previous utterance temporarily stored and the correction vector used before one utterance are weighted and averaged to obtain a correction vector, the previous utterance is more emphasized. become. Therefore, even when the speaker changes in the middle, the correction vector can be updated quickly, and the recognition rate can be increased.
The above is a detailed description of the operation of the correction means 4.

次に調整手段６の動作について説明する。
図５は、この発明の実施の形態１における調整手段６の動作を示すフローチャートである。
調整手段６は、図２のメモリ１１に記憶されている音響モデル７を取得する（ＳＴ２１）。次に、現在の入力音声２が第１発話目であるか否かを判断し（ＳＴ２２）、第１発話目であれば、音響モデル７のパラメータである分散値をα倍して、調整後の音響モデルとして照合手段８に出力する（ＳＴ２３）。ここでαは１より大きい定数（第１の定数）であり、具体的な値は事前に認識実験等により定めておく。このように１より大きい値を乗じて音響モデルの分散値を大きくすることにより、マイクの周波数特性の違い等で入力音声２の特徴ベクトルが変動した場合でも音響尤度の低下が抑えられ、マイクの周波数特性の違いに頑健な音響モデルが得られる。反面、音響モデル間の識別性能が低下する可能性があるため、前記定数αは事前の認識実験で認識性能を検証して値を設定したほうが良い。Next, the operation of the adjusting means 6 will be described.
FIG. 5 is a flowchart showing the operation of the adjusting means 6 in Embodiment 1 of the present invention.
The adjusting means 6 acquires the acoustic model 7 stored in the memory 11 of FIG. 2 (ST21). Next, it is determined whether or not the current input speech 2 is the first utterance (ST22). If the first utterance is the first utterance, the variance value, which is a parameter of the acoustic model 7, is multiplied by α and adjusted. Is output to the matching means 8 as an acoustic model (ST23). Here, α is a constant (first constant) larger than 1, and a specific value is determined in advance by a recognition experiment or the like. Thus, by multiplying a value larger than 1 to increase the variance value of the acoustic model, even when the feature vector of the input speech 2 fluctuates due to a difference in the frequency characteristics of the microphone, a decrease in acoustic likelihood is suppressed, and the microphone A sound model that is robust against the difference in frequency characteristics is obtained. On the other hand, there is a possibility that the discrimination performance between acoustic models may be lowered. Therefore, it is better to set the constant α by verifying the recognition performance in advance in the recognition experiment.

一方、調整手段６は入力音声２が第２発話目以降であれば、音響モデル７のパラメータである分散値をβ倍して調整後の音響モデルとして照合手段８に出力する（ＳＴ２４）。ここでβは１以上であって前記α以下の定数（第２の定数）、または前記α以下であって入力音声２の発話の回数に従って単調減少する１以上の変数である。定数である場合、例えば１とする。変数である場合、発話回数に応じて減じる値とし、例えば式（１）でβの値を算出する。

β＝ＭＡＸ（１，α−ｃ＊ｎ）（式１）
On the other hand, if the input speech 2 is after the second utterance, the adjusting means 6 multiplies the variance value, which is a parameter of the acoustic model 7, by β, and outputs it to the matching means 8 as an adjusted acoustic model (ST24). Here, β is a constant not less than 1 and not more than α (second constant), or is not less than α and is one or more variables that monotonously decrease according to the number of utterances of the input speech 2. If it is a constant, it is set to 1, for example. If it is a variable, the value is reduced according to the number of utterances. For example, the value of β is calculated by equation (1).

β = MAX (1, α−c * n) (Formula 1)

式（１）でＭＡＸ（１，α−ｃ＊ｎ）は、１とα−ｃ＊ｎの値の大きい方を選択する演算子である。またｃは実験的に定めた定数、ｎは入力音声の回数である。 In Expression (1), MAX (1, α-c * n) is an operator that selects the larger of 1 and α-c * n. C is an experimentally determined constant, and n is the number of input voices.

入力音声２が第２発話目以降では、音響モデルの分散値をメモリ１１に記憶された元の音響モデルと同一にする、あるいは発話回数に応じて分散値を近づけていくことにより、音響モデルの識別性能をメモリ１１に格納されている元の音響モデルと同一、あるいは発話回数に応じて音響モデルの識別性能に近づけていくことができる。
なお、このように音響モデル７の分散値を小さくしていくと、マイクの周波数特性の違いに対する頑健性は低くなっていくが、第２発話目以降では、上述したとおり補正手段４において学習時と認識時のマイク周波数特性の違いを吸収する効果が得られる処理を行っているので問題は生じない。
以上が調整手段６の動作の説明である。When the input speech 2 is the second utterance or later, the variance value of the acoustic model is made the same as that of the original acoustic model stored in the memory 11 or the variance value is made closer to the number of utterances. The discrimination performance is the same as that of the original acoustic model stored in the memory 11 or can be close to the discrimination performance of the acoustic model according to the number of utterances.
Note that, as the variance value of the acoustic model 7 is reduced in this way, the robustness against the difference in the frequency characteristics of the microphones is lowered. However, after the second utterance, the correction means 4 performs learning as described above. No problem arises because processing is performed to obtain the effect of absorbing the difference in microphone frequency characteristics at the time of recognition.
The above is the description of the operation of the adjusting unit 6.

次に、照合手段８の動作を説明する。
照合手段８は、前記補正後の特徴ベクトルが入力されると、この補正後の特徴ベクトルを調整手段６から入力された調整後の音響モデルを用いてパターンマッチングを行い、尤度が最も高い音響モデルの語彙を認識結果９として音声認識装置１から出力する。パターンマッチングの方法としては、例えばビタビアルゴリズムを用いる。Next, the operation of the verification unit 8 will be described.
When the corrected feature vector is input, the matching unit 8 performs pattern matching on the corrected feature vector using the adjusted acoustic model input from the adjusting unit 6, and the acoustic having the highest likelihood is obtained. The vocabulary of the model is output from the speech recognition apparatus 1 as the recognition result 9. For example, a Viterbi algorithm is used as a pattern matching method.

以上説明したように、本実施の形態の音声認識装置１では、第１発話に対しては、入力音声２を分析した第１の特徴ベクトルの平均ベクトルと、音響モデルの学習時の平均ベクトルとの差分を第１の特徴ベクトルに補正し、補正した第１の特徴ベクトルを調整手段６によって分散値を大きくした音響モデルを用いてパターンマッチングを行うので、第１発話目に対して認識終了時間の遅延なしに音響モデルの学習時と認識時でマイクの周波数特性が異なる場合でも頑健な音声認識が可能になる。更に、第２発話以降に対しては、学習時と認識時の平均ベクトルの差を補正する処理によって音響モデルの学習効果を活かした音声認識性能を保つことができる。 As described above, in the speech recognition apparatus 1 of the present embodiment, for the first utterance, the average vector of the first feature vector obtained by analyzing the input speech 2 and the average vector during learning of the acoustic model Is corrected to the first feature vector, and pattern matching is performed on the corrected first feature vector using the acoustic model having the variance value increased by the adjusting means 6, so that the recognition end time for the first utterance Even if the frequency characteristics of the microphone are different between the learning and recognition of the acoustic model without delay, robust speech recognition is possible. Furthermore, for the second and subsequent utterances, the speech recognition performance utilizing the learning effect of the acoustic model can be maintained by correcting the difference between the average vectors during learning and recognition.

また、特許文献１では、入力発話長が短い場合には、平均ベクトルの精度が低下して認識性能が低下するが、本発明に係る音声認識装置では、音響モデルの学習に応じて入力音声の特徴ベクトルを補正し、第１発話目であれば、音響モデルのパラメータである分散値を大きくするので、音響モデルの学習時と認識時でマイクの周波数特性が異なり入力発話長が短い場合であっても、頑健な音声認識が可能になる。更に第２発話以降に対しては学習時と認識時の平均ベクトルの差を補正する処理によって音響モデルの学習効果を活かした音声認識性能を保つことができる。 Further, in Patent Document 1, when the input utterance length is short, the accuracy of the average vector is lowered and the recognition performance is lowered. However, in the speech recognition apparatus according to the present invention, the input speech is analyzed according to the learning of the acoustic model. If the feature vector is corrected and the first utterance, the variance value, which is a parameter of the acoustic model, is increased. Therefore, when the acoustic model is learned and recognized, the frequency characteristics of the microphone are different and the input utterance length is short. However, robust speech recognition is possible. Furthermore, for the second and subsequent utterances, the speech recognition performance utilizing the learning effect of the acoustic model can be maintained by correcting the difference between the average vectors during learning and recognition.

また、特許文献１では、学習データ全体で平均ベクトルを求めているため、最初の発話ではＣＭＮが働かず、学習時と認識時でマイクの周波数特性が異なる場合、認識性能が低下するが、本発明に係る音声認識装置では、第１発話目であれば、音響モデルのパラメータである分散値を大きくするので、音響モデルの学習時と認識時でマイクの周波数特性が異なり、入力音声の最初の発話であっても、頑健な音声認識が可能になる。更に２発話以降に対しては学習時と認識時の平均ベクトルの差を補正する処理によって音響モデルの学習効果を活かした音声認識性能を保つことができる。 In Patent Document 1, since the average vector is obtained for the entire learning data, the CMN does not work in the first utterance, and the recognition performance deteriorates when the frequency characteristics of the microphone are different between learning and recognition. In the speech recognition apparatus according to the invention, if the first utterance, the variance value that is a parameter of the acoustic model is increased. Therefore, the frequency characteristics of the microphone are different between learning and recognition of the acoustic model, and the first input speech Robust voice recognition is possible even for utterances. Furthermore, for the second and subsequent utterances, the speech recognition performance utilizing the learning effect of the acoustic model can be maintained by correcting the difference between the average vectors during learning and recognition.

なお、実施の形態１では、音響モデル７に学習データの特徴ベクトルの平均ベクトルを算出し、学習データの各特徴ベクトルから前記平均ベクトルを減算したデータによって前記音素単位の音響モデルを事前に学習したもの、即ちＣＭＮを行ったものとして説明したが、ＣＭＮを行わない場合にも対応する。
学習時にＣＭＮを行わない音響モデル７の場合には、音響モデル７は学習データの特徴ベクトルの平均ベクトルを減算していないため、補正手段４は第１発話目に対して、入力音声２の特徴ベクトルから入力音声２の平均ベクトルを減算しない。更に、第２発話目以降に対しては、入力音声２の特徴ベクトルから前の発話までの特徴ベクトルの平均ベクトルを減算して、音響モデル７の学習データの平均ベクトルを加算する。
この様に、補正手段４は、音響モデル７の学習に応じて、音響モデル７の学習時の平均ベクトルと、入力音声２の特徴ベクトルの平均ベクトルとの差分を入力音声２の特徴ベクトルに補正する。In the first embodiment, an average vector of feature vectors of learning data is calculated for the acoustic model 7 and the acoustic model in units of phonemes is learned in advance by data obtained by subtracting the average vector from each feature vector of the learning data. In the above description, the case where CMN is performed has been described.
In the case of the acoustic model 7 that does not perform CMN at the time of learning, since the acoustic model 7 does not subtract the average vector of the feature vectors of the learning data, the correcting means 4 is the feature of the input speech 2 for the first utterance. The average vector of input speech 2 is not subtracted from the vector. Furthermore, for the second and subsequent utterances, the average vector of the feature vectors from the feature vector of the input speech 2 to the previous utterance is subtracted, and the average vector of the learning data of the acoustic model 7 is added.
As described above, the correction unit 4 corrects the difference between the average vector during learning of the acoustic model 7 and the average vector of the feature vector of the input speech 2 to the feature vector of the input speech 2 in accordance with the learning of the acoustic model 7. To do.

また、音響モデル７は、話者毎の前記第２の特徴ベクトルから話者毎の全ての前記第２の特徴ベクトルの平均ベクトルを減算して作成した話者毎の学習データと、全話者の前記第２の特徴ベクトルから全話者の全ての前記第２の特徴ベクトルの平均を減算して作成した全話者の学習データ、の両方を学習データとして用いてもよい。
前記話者毎学習データは、マイクや話者の違いによる特徴ベクトルの変動を抑えて高精度に音響モデルを学習することができ認識性能を向上させる効果がある。一方、前記全話者学習データは、学習データの特徴ベクトルを一様に減算しただけであるため、元の学習データと同一の特徴を持っている。元の学習データは様々な周波数特性のマイクを用いて収録した話者の特徴ベクトルを含んでいるため、マイクや話者の違いに対して頑健な音響モデルを構築する効果がある。また全話者の特徴ベクトルから全話者平均ベクトルを減算することにより、全話者の特徴ベクトルを話者毎学習データの特徴ベクトルと概ね値をそろえることができ、両方の学習データの特徴を合わせ持つ音響モデルを学習することが可能になる。The acoustic model 7 includes learning data for each speaker created by subtracting an average vector of all the second feature vectors for each speaker from the second feature vector for each speaker, and all speakers. Both the learning data of all speakers created by subtracting the average of all the second feature vectors of all speakers from the second feature vector may be used as learning data.
The learning data for each speaker has an effect of improving the recognition performance by suppressing the variation of the feature vector due to the difference between the microphone and the speaker and learning the acoustic model with high accuracy. On the other hand, the all-speaker learning data has the same features as the original learning data because it is simply a uniform subtraction of the feature vectors of the learning data. Since the original learning data includes speaker feature vectors recorded using microphones having various frequency characteristics, there is an effect of constructing an acoustic model that is robust against differences between microphones and speakers. Also, by subtracting the average vector of all speakers from the feature vectors of all speakers, the feature vectors of all speakers can be roughly aligned with the feature vectors of the learning data for each speaker. It is possible to learn an acoustic model that has both.

そして音声認識を実施する際は、前記のとおり話者毎の学習データと全話者の学習データの両方を用いて学習した音響モデルを用いることによって、補正手段４は第１発話目であれば、全話者の全ての特徴ベクトルの平均ベクトルを補正ベクトルとして、分散値を調整した音響モデルと入力された音声の特徴ベクトルから補正ベクトルを減算した補正後の特徴ベクトルを照合するので、認識終了時間の遅延なしに、全話者の全ての前記第２の特徴ベクトルの平均を減算して作成した全話者の学習データを用いて音響モデルを学習した効果によりマイクや話者の違いに対して頑健になり音声認識の精度を確保することができる。更に、補正手段４は第２発話目以降に対しては、入力音声２の前の発話までの特徴ベクトルの平均ベクトルを補正ベクトルとすることにより、話者毎の全ての前記第２の特徴ベクトルの平均ベクトルを減算して作成した話者毎の学習データを用いて音響モデルを学習した効果によりマイクや話者の違いに起因する特徴ベクトルに対してＣＭＮ効果を十分に発揮する様に学習した音響モデルと同一、あるいは近い音響モデルとなるように分散値を調整するので、認識性能を向上することができる。 When speech recognition is performed, as described above, by using the acoustic model learned using both the learning data for each speaker and the learning data for all speakers, the correcting means 4 is the first utterance. Because the average vector of all the feature vectors of all speakers is used as the correction vector, the acoustic model with the variance value adjusted is compared with the corrected feature vector obtained by subtracting the correction vector from the input speech feature vector, so the recognition ends. The effect of learning the acoustic model using the learning data of all the speakers created by subtracting the average of all the second feature vectors of all the speakers without any time delay It becomes robust and can secure the accuracy of voice recognition. Further, for the second and subsequent utterances, the correction means 4 uses the average vector of the feature vectors up to the utterance before the input speech 2 as a correction vector, so that all the second feature vectors for each speaker are used. By learning the acoustic model using the learning data for each speaker created by subtracting the average vector of the speakers, learning was performed so that the CMN effect could be sufficiently exerted on the feature vectors resulting from differences in microphones and speakers. Since the dispersion value is adjusted so that the acoustic model is the same as or close to the acoustic model, the recognition performance can be improved.

実施の形態２．
この発明の実施の形態２による音声認識装置１は、実施の形態１による図１の音声認識装置１の機能構成および図２のハードウェア構成と同一である。実施の形態１と異なる構成は調整手段６の動作である。その他の構成手段の動作は実施の形態１と同一である。Embodiment 2. FIG.
The speech recognition apparatus 1 according to Embodiment 2 of the present invention is the same as the functional configuration of the speech recognition apparatus 1 of FIG. 1 according to Embodiment 1 and the hardware configuration of FIG. A configuration different from the first embodiment is the operation of the adjusting means 6. The operation of the other constituent means is the same as in the first embodiment.

本実施の形態における調整手段６の動作を説明する。
図６は、この発明の実施の形態１における調整手段６の動作を示すフローチャートである。
調整手段６は、音響モデル７を取得する（ＳＴ２１）。次に現在の入力音声２が第１発話目であるか否かを判断する（ＳＴ２２）。
入力音声２が第１発話目である場合、音響モデル７のパラメータである分散値のうち、一部であるｎ１からｎ２次元の値をα倍して、調整後の音響モデルとして照合手段８に出力する（ＳＴ２３ａ）。The operation of the adjusting means 6 in this embodiment will be described.
FIG. 6 is a flowchart showing the operation of the adjusting means 6 in Embodiment 1 of the present invention.
The adjusting means 6 acquires the acoustic model 7 (ST21). Next, it is determined whether or not the current input voice 2 is the first utterance (ST22).
When the input speech 2 is the first utterance, among the variance values that are the parameters of the acoustic model 7, some of the n1 to n2 dimensional values are multiplied by α, and the adjusted acoustic model is input to the matching unit 8. Output (ST23a).

ここで、ｎ１、ｎ２は０＜＝ｎ１＜＝ｎ２＜Ｎを満たす定数である。Ｎは特徴ベクトルの次元数である。αは１より大きい定数であり、具体的な値は事前に認識実験等により定めておく。このように１より大きい値を低次元の分散値に乗じて音響モデル７の分散値を大きくすることにより、マイクの周波数特性の違い等で入力音声２の特徴ベクトルが変動した場合でも音響尤度の低下が抑えられ、マイクの周波数特性の違いに頑健な音響モデルが得られる。反面、音響モデル間の識別性能が低下する可能性があるので、前記定数αは事前の認識実験で認識性能を検証して値を設定したほうが良い。 Here, n1 and n2 are constants satisfying 0 <= n1 <= n2 <N. N is the number of dimensions of the feature vector. α is a constant larger than 1, and a specific value is determined in advance by a recognition experiment or the like. In this way, by multiplying the low-dimensional variance value by a value larger than 1 to increase the variance value of the acoustic model 7, the acoustic likelihood even when the feature vector of the input speech 2 fluctuates due to a difference in the frequency characteristics of the microphone or the like. Can be suppressed, and an acoustic model robust to the difference in the frequency characteristics of the microphone can be obtained. On the other hand, there is a possibility that the discrimination performance between acoustic models may be lowered. Therefore, it is better to set the constant α by verifying the recognition performance in a prior recognition experiment.

なお分散値を調整する次元をｎ１からｎ２に限定する効果は次のとおりである。音響モデルの学習時と認識時でマイクの周波数特性が異なる場合、マイク間の周波数特性差は特徴ベクトルであるＭＦＣＣの低次元の値に影響を与える場合が多く、高次元のＭＦＣＣはあまり変化しないことが多い。この場合は高次元の分散値は調整せず元の音響モデルの分散値をそのまま使用することにより、音響モデルの識別性能を必要以上に低下させるのを抑える効果がある。 The effect of limiting the dimension for adjusting the dispersion value from n1 to n2 is as follows. When the frequency characteristics of the microphones are different between the learning and recognition of the acoustic model, the frequency characteristic difference between the microphones often affects the low-dimensional value of the feature vector MFCC, and the high-dimensional MFCC does not change much. There are many cases. In this case, the high-dimensional variance value is not adjusted, and the variance value of the original acoustic model is used as it is, so that the acoustic model identification performance is prevented from being lowered more than necessary.

次に、入力音声２が第２発話目以降の調整手段６の動作を説明する。
調整手段６は第２発話目以降では音響モデル７のパラメータである分散値のうち、一部であるｎ１からｎ２次元の値をβ倍して調整後の音響モデル７として照合手段８に出力する（ＳＴ２４ａ）。
ここでβは１以上で前記α以下の定数または、α以下であって入力発話の回数に従って単調減少する１以上の変数である。定数である場合、例えば１とする。変数である場合、発話回数に応じて減じる値とする。例えば実施の形態１と同様に式（１）でβの値を算出する。Next, the operation of the adjusting unit 6 after the input voice 2 is the second utterance will be described.
After the second utterance, the adjusting unit 6 multiplies a part of n1 to n2 dimensional values among the variance values that are parameters of the acoustic model 7 and outputs the result to the matching unit 8 as an adjusted acoustic model 7. (ST24a).
Here, β is a constant not less than 1 and not more than α, or is not less than α and one or more variables that are not more than α and monotonously decrease according to the number of input utterances. If it is a constant, it is set to 1, for example. If it is a variable, the value is reduced according to the number of utterances. For example, as in the first embodiment, the value of β is calculated by equation (1).

上記のように発話の２回目以降では、音響モデル７の分散値を学習して得た元の音響モデル７と同一にする、あるいは発話回数に応じて分散値を近づけていくことにより、音響モデル７の識別性能を学習して得た元の音響モデル７と同一、あるいは発話回数に応じて識別性能を近づけていくことができる。 As described above, in the second and subsequent utterances, by making the variance value of the acoustic model 7 the same as the original acoustic model 7 obtained by learning, or by making the variance value closer according to the number of utterances, the acoustic model 7 can be made closer to the original acoustic model 7 obtained by learning the discrimination performance or according to the number of utterances.

なおこのように音響モデルの分散値を小さくしていくと、マイクの周波数特性の違いに対する頑健性は低くなっていくが、第２発話目以降では上述したとおり補正手段４において学習時と認識時のマイク周波数特性の違いを吸収する効果が得られるようになるので問題は生じない。 If the variance value of the acoustic model is reduced in this way, the robustness against the difference in the frequency characteristic of the microphone is lowered. However, in the second and subsequent utterances, as described above, the correction means 4 performs learning and recognition. Since the effect of absorbing the difference in the microphone frequency characteristics can be obtained, there is no problem.

以上説明したように、本実施の形態の音声認識装置１では、入力音声２が第１発話目であるか否かに応じて、音響モデル７のパラメータである分散値のうち低次元の分散値を調整して、調整した音響モデル７と入力音声２を音響モデル７の学習時に応じて補正した特徴ベクトルを照合して、認識結果を出力するので、マイク間の周波数特性差が顕著な低次元には分散値を大きくし、マイク間の周波数特性差があまりない高次元には分散値の調整を行わない。そのため、音響モデルの識別性能を必要以上に低くすることなく、学習時と認識時のマイク周波数特性の違いを吸収することができ、認識性能を改善する効果がある。 As described above, in the speech recognition device 1 of the present embodiment, the low-dimensional variance value among the variance values that are parameters of the acoustic model 7 depending on whether or not the input speech 2 is the first utterance. Is adjusted, and the adjusted acoustic model 7 and the input speech 2 are collated with the feature vector corrected in accordance with the learning of the acoustic model 7, and the recognition result is output. The dispersion value is increased in the case of, and the adjustment of the dispersion value is not performed in a high dimension where there is not much frequency characteristic difference between the microphones. Therefore, it is possible to absorb the difference in the microphone frequency characteristics between learning and recognition without lowering the identification performance of the acoustic model more than necessary, and there is an effect of improving the recognition performance.

以上のように、本発明に係る音声認識装置は、入力された音声の第１発話目の特徴ベクトルに対して、音響モデルの学習時の平均ベクトルと入力された音声の平均ベクトルとの差を補正し、入力された音声が第１発話目であるか否かに応じて分散値を調整した音響モデルと補正後の入力された音声の特徴ベクトルを照合するので、周波数特性が異なる様々なマイクに対して第１発話目の音声認識性能を改善し、第２発話目以降であっては学習時と認識時の平均ベクトルの差を補正する処理によって音響モデルの学習効果を活かした音声認識性能を保つことができる。 As described above, the speech recognition apparatus according to the present invention calculates the difference between the average vector during learning of the acoustic model and the average vector of the input speech with respect to the feature vector of the first utterance of the input speech. Various microphones with different frequency characteristics are corrected because the acoustic model with the dispersion value adjusted according to whether or not the input speech is the first utterance is compared with the feature vector of the input speech after the correction. On the other hand, the speech recognition performance that improves the speech recognition performance of the first utterance and makes use of the learning effect of the acoustic model by correcting the difference between the average vectors during learning and recognition after the second utterance Can keep.

１音声認識装置、２入力音声、３音響分析手段、４補正手段、５初期ベクトル、６調整手段、７音響モデル、８照合手段、９認識結果、１０プロセッサ、１１メモリ。 DESCRIPTION OF SYMBOLS 1 Speech recognition apparatus, 2 input speech, 3 acoustic analysis means, 4 correction means, 5 initial vector, 6 adjustment means, 7 acoustic model, 8 collation means, 9 recognition result, 10 processor, 11 memory

Claims

Analyzing means for analyzing the input speech and outputting a first feature vector;
An acoustic model that models the second feature vector of speech data uttered by a plurality of speakers ;
Mean vector or an average vector of the first feature vector of the input speech, the first of said corrected by subtracting from the feature vectors a first feature vector correction means for outputting the acoustic model,
Adjusting means for adjusting a variance value of a parameter of the acoustic model according to whether or not the voice is the first utterance;
A speech recognition apparatus comprising: collation means for collating the acoustic model with the variance value adjusted and the corrected first feature vector to output the speech recognition result.

The speech recognition apparatus according to claim 1, wherein the adjustment unit multiplies the variance value by a first constant larger than 1 if the speech is the first utterance.

3. The adjustment unit according to claim 2, wherein if the voice is after the second utterance, the adjustment unit multiplies the variance value by a second constant that is greater than or equal to 1 and less than or equal to the first constant. Voice recognition device.

3. The adjustment means multiplies the variance value by a variable that is not less than 1 and not more than the first constant and monotonously decreases according to the number of input speech if the speech is after the second utterance. The speech recognition apparatus described in 1.

The acoustic model is multi-dimensional data,
The speech recognition apparatus according to claim 1, wherein the adjustment unit adjusts the variance value of a part of the dimensions of the acoustic model.

The acoustic model is data learned using data obtained by subtracting an average of all the second feature vectors of all speakers from the second feature vector,
If the speech is the first utterance, the correction means uses the average vector of the second feature vectors as a correction vector, subtracts the correction vector from the first feature vector, and corrects the first The first feature vector before correction is temporarily stored, and if the voice is after the second utterance, the pre-correction before the temporarily stored utterance is stored. An average vector of the first feature vectors is used as a correction vector, the correction vector is subtracted from the first feature vector, the corrected first feature vector is output, and the first feature vector before correction is output. Is temporarily stored, The speech recognition apparatus according to any one of claims 1 to 5.

The acoustic model includes learning data for each speaker learned by data obtained by subtracting an average vector of all the second feature vectors for each speaker from the second feature vector for each speaker, Learning data of all speakers learned from data obtained by subtracting the average of all the second feature vectors of all speakers from the second feature vector,
If the voice is the first utterance, the correction means uses an average vector of all the second feature vectors of all speakers as a correction vector, and subtracts the correction vector from the first feature vector. The first feature vector after correction is output, the first feature vector before correction is temporarily stored, and if the speech is after the second utterance, the temporarily stored previous utterance The average vector of the first feature vectors before correction up to is used as a correction vector, the correction vector is subtracted from the first feature vector, and the corrected first feature vector is output. The speech recognition apparatus according to claim 1, wherein the first feature vector is temporarily stored.

In a speech recognition method of a speech recognition device that performs speech recognition of input speech,
An analysis step of analyzing the input speech and outputting a first feature vector;
An average vector of an acoustic model obtained by modeling a second feature vector of speech data uttered by a plurality of speakers or an average vector of the first feature vector of the input speech is subtracted from the first feature vector. And a correction step of outputting the first feature vector after correction,
An adjustment step of adjusting a variance value of a parameter of the acoustic model according to whether or not the voice is the first utterance;
A speech recognition method, comprising: a collation step of collating the acoustic model with the adjusted dispersion value and the corrected first feature vector.