JPH10214097A

JPH10214097A - Adapting method for speech feature quantity, speech recognition device, and recording medium

Info

Publication number: JPH10214097A
Application number: JP30905197A
Authority: JP
Inventors: Masatoshi Morishima; 昌俊森島
Original assignee: N T T DATA TSUSHIN KK; NTT Data Communications Systems Corp
Current assignee: N T T DATA TSUSHIN KK; NTT Data Corp
Priority date: 1996-11-29
Filing date: 1997-11-11
Publication date: 1998-08-11

Abstract

PROBLEM TO BE SOLVED: To adapt the feature quantity of a speech to be recognized in real time. SOLUTION: Feature vectors of respective frames are extracted from an inputted object speech and stored as adapted data in a feature quantity storage part 21. A mean feature quantity calculation part 22 calculates the mean of the adapted data from feature vectors of the stored adapted data. A mean distribution value calculation part 25, on the other hand, calculates the mean distribution value of recognition models corresponding to phonemes in the adapted data. A difference value calculation part 23 calculates the difference value between the mean of the said distribution mean values and the mean of feature quantities of the adapted data. The difference value is updated each time the adapted data change. A feature quantity normalization part 13 calculates a normalized feature quantity by subtracting the said difference value from the feature vector of the object speech and a recognition processing part 14 performs a recognition processing by using the normalized feature quantity.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、音声認識の適応化
手法に係り、例えば歪んだ音声が入力されたときの認識
性能の劣化を迅速に是正する手法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a technique for adapting speech recognition, and more particularly to a technique for quickly correcting the degradation of recognition performance when distorted speech is input.

【０００２】[0002]

【従来の技術】音声認識装置では、通常、予め大量の音
声モデルから学習されたサブワードモデルを用意し、単
語登録の際に、複数のサブワードモデルを連結して単語
モデルを生成する。生成した単語モデルは必要に応じて
学習され、認識処理の際に用いられる。認識時には、例
えばマイクロフォンから入力されたアナログの発話音声
をディジタル信号へ変換した後、当該ディジタル信号の
特徴量（特徴ベクトル）を算出し、その特徴量と単語モ
デルとを照合する。そして、照合の結果、最も類似する
単語モデルを認識結果として出力する。2. Description of the Related Art Generally, a speech recognition apparatus prepares a subword model learned in advance from a large number of speech models and generates a word model by connecting a plurality of subword models when registering a word. The generated word model is learned as needed, and is used in the recognition process. At the time of recognition, for example, after converting an analog uttered voice input from a microphone into a digital signal, a feature amount (feature vector) of the digital signal is calculated, and the feature amount is compared with a word model. Then, as a result of the collation, the most similar word model is output as a recognition result.

【０００３】ところで、音声の特徴量は発話者によって
異なり、また、同一話者であっても発話時の状況や体調
等によってその特徴量が異なる場合がある。さらに、例
えば電話回線を介して入力された音声の場合には、当該
音声に乗算性雑音が混入して歪みが生じる。その結果、
本来は同一特徴量のものであっても異なる音声と認識さ
れてしまう場合がある。つまり、音声認識の性能が劣化
する。そのため、近年は、確率分布モデルを使って音声
の特徴量を適応化することが多くなっている。[0003] By the way, the feature amount of a voice differs depending on the speaker, and even for the same speaker, the feature amount may vary depending on the situation at the time of speech, physical condition, and the like. Further, for example, in the case of speech input via a telephone line, multiplication noise is mixed in the speech and distortion occurs. as a result,
Originally, even voices having the same feature amount may be recognized as different voices. That is, the performance of speech recognition is degraded. For this reason, in recent years, the use of a probability distribution model to adapt a feature amount of speech has been increasing.

【０００４】以下、この種の従来の適応化手法を図７〜
図９を参照して説明する。まず、従来の音声認識装置の
基本構成を図７により説明する。この音声認識装置２
は、マイクロフォン等で取り込んだ対象音声、つまり音
声認識の対象となる音声を入力する音声入力部７１と、
入力された音声の特徴量（特徴ベクトル）を抽出する特
徴量抽出部７２と、抽出された音声特徴量を正規化する
特徴量正規化部７３と、正規化された音声特徴量に基づ
いて認識処理を行う認識処理部７４と、認識結果を出力
する認識結果出力部７５とをこの順に配している。ま
た、上記適応化処理を実現するため、特徴量抽出部７２
で抽出された特徴量を発声全体にわたって蓄積する特徴
量蓄積部８１と、発声全体の特徴量の平均を算出する平
均算出部８２とを設けている。さらに、認識処理部７４
により照合される上記単語モデルを用意するため、単語
を入力する認識単語登録部８３と、入力された単語また
は認識対象単語に対応する単語モデルを作成する単語モ
デル作成部８４と、作成された単語モデルを必要に応じ
て学習させて格納した認識単語辞書部８５と、認識処理
部７４で直接使用される複数の単語モデルを保存してお
くための単語モデル保存部８６とを設けている。A conventional adaptation method of this kind will be described below with reference to FIGS.
This will be described with reference to FIG. First, the basic configuration of a conventional speech recognition device will be described with reference to FIG. This voice recognition device 2
A voice input unit 71 for inputting a target voice captured by a microphone or the like, that is, a voice to be subjected to voice recognition;
A feature amount extraction unit 72 for extracting a feature amount (feature vector) of the input speech, a feature amount normalization unit 73 for normalizing the extracted speech feature amount, and recognition based on the normalized speech feature amount A recognition processing unit 74 for performing processing and a recognition result output unit 75 for outputting a recognition result are arranged in this order. Further, in order to realize the above-described adaptation processing, the feature amount extraction unit 72
A feature amount accumulating unit 81 that accumulates the feature amounts extracted in the entire utterance and an average calculating unit 82 that calculates an average of the feature amounts of the entire utterance are provided. Further, the recognition processing unit 74
In order to prepare the above-mentioned word model to be collated, a recognized word registration unit 83 for inputting a word, a word model creation unit 84 for creating a word model corresponding to the input word or the word to be recognized, A recognition word dictionary unit 85 in which models are learned and stored as needed, and a word model storage unit 86 for storing a plurality of word models used directly by the recognition processing unit 74 are provided.

【０００５】上記音声認識装置２における音声認識の手
順は図９に示すとおりである。まず、音声入力部７１よ
り入力された対象音声の特徴量を特徴量抽出部７２でフ
レーム毎に抽出する（ステップＳ２０１）。この抽出さ
れた特徴量を発声の始めから終わりまで特徴量抽出部８
１に蓄積する（ステップＳ２０２）とともに、平均算出
部８２で発声全体の特徴量の平均を算出する（ステップ
Ｓ２０３）。特徴量正規化部７３は、各フレームの特徴
量から発声全体の特徴量の平均を差し引いて特徴量を正
規化する（ステップＳ２０４）。これにより、単語モデ
ル及び対象音声が収録された収録系の周波数特性（アナ
ログ機器の伝達特性）の成分が差し引かれ、対象音声の
特徴量の歪みが除去される。認識処理部７４は、上記正
規化された特徴量に基づく音声認識処理、すなわち単語
モデル保存部８６に保存されている単語モデルとの照合
を行う（ステップＳ２０５）。以上の処理を学習時及び
認識時に行う。[0005] The procedure of speech recognition in the speech recognition apparatus 2 is as shown in FIG. First, the feature amount of the target voice input from the voice input unit 71 is extracted for each frame by the feature amount extracting unit 72 (step S201). The extracted feature amount is used as a feature amount extraction unit 8 from the beginning to the end of the utterance.
1 (step S202), and the average calculator 82 calculates the average of the feature amounts of the entire utterance (step S203). The feature amount normalization unit 73 normalizes the feature amount by subtracting the average of the feature amounts of the entire utterance from the feature amount of each frame (step S204). As a result, the components of the frequency characteristics (transfer characteristics of the analog device) of the recording system in which the word model and the target voice are recorded are subtracted, and the distortion of the feature amount of the target voice is removed. The recognition processing unit 74 performs a speech recognition process based on the normalized feature amount, that is, performs collation with the word model stored in the word model storage unit 86 (step S205). The above processing is performed at the time of learning and at the time of recognition.

【０００６】図８は、Ｎフレームから成る発声の場合
に、フレームｉの特徴量を正規化する様子を示すもので
ある。認識処理部７４の認識結果は、認識結果出力部７
５を経て後続処理手段に送られる。FIG. 8 shows how feature values of a frame i are normalized in the case of an utterance composed of N frames. The recognition result of the recognition processing unit 74 is output to the recognition result output unit 7.
After that, it is sent to the subsequent processing means.

【０００７】上述の適応化手法は、ケプストラム平均値
正規化（ＣＭＳ：Cepstrum Mean Subtraction 又はＣ
ＭＮ：Cepstrum Mean Normalization）と呼ばれ、従来
の音声認識装置においてしばしば使われている。The above-described adaptation method uses cepstrum mean subtraction (CMS: Cepstrum Mean Subtraction or CMS).
It is called MN (Cepstrum Mean Normalization) and is often used in conventional speech recognition devices.

【０００８】[0008]

【発明が解決しようとする課題】上述のように、従来
は、フレーム毎に抽出された特徴量から発声全体の特徴
量の平均を差し引くことで、対象音声の特徴量の適応化
を行っている。このような適応化手法では、対象音声の
歪みが除去されて認識性能を高めることはできるもの
の、発声全体の特徴量の平均の算出処理が不可欠とな
る。このため、発声が終わるまで適応化処理を始めるこ
とができず、迅速な認識処理に対応できないという問題
があった。As described above, conventionally, the feature amount of the target speech is adapted by subtracting the average of the feature amounts of the entire utterance from the feature amount extracted for each frame. . In such an adaptation method, although the distortion of the target voice can be removed and the recognition performance can be improved, it is indispensable to calculate the average of the feature amounts of the entire utterance. For this reason, there is a problem that the adaptation process cannot be started until the utterance ends, and it is not possible to cope with the quick recognition process.

【０００９】一方、この問題を解消する手法の一つにＱ
＆Ａ形式がある。Ｑ＆Ａ形式では、例えば住所を入力し
たい場合に、「都道府県はどこですか？」、「東京で
す」、区または市町村名はどこですか？」、「世田谷区
です」、・・・のように、装置側からの質問と発話者か
らの答え（対象音声の入力）とを繰り返す。このような
Ｑ＆Ａ形式では、一つ前の対象音声の特徴量の平均（ベ
クトル）を求め、その平均を現在の対象音声の各フレー
ムの特徴量から差し引くので、迅速な適応化処理は可能
である。しかし、Ｑ＆Ａ形式の場合、事前に発声した音
声に適応化させることになるので、適応化データ（事前
の音声）と対象音声に含まれる音韻（音素）とが異な
る。そのため、何も適応化処理を施さない場合との比較
ではある程度の性能向上は見られるが、上記ＣＭＳ等と
較べれば認識性能は著しく劣化している。On the other hand, one of the methods for solving this problem is Q
& A format. In the Q & A format, for example, if you want to enter an address, "where is the prefecture?", "Is Tokyo", and where is the ward or municipal name? , "Setagaya-ku", ..., the question from the device side and the answer from the speaker (input of the target voice) are repeated. In such a Q & A format, the average (vector) of the feature amount of the immediately preceding target voice is obtained, and the average is subtracted from the feature value of each frame of the current target voice, so that quick adaptation processing is possible. . However, in the case of the Q & A format, since the voice is adapted to the voice uttered in advance, the adapted data (prior voice) and the phoneme (phoneme) included in the target voice are different. For this reason, although a certain improvement in performance can be seen in comparison with the case where no adaptation processing is performed, the recognition performance is significantly deteriorated as compared with the above-mentioned CMS or the like.

【００１０】そこで、本発明の課題は、音声認識装置に
おいて、対象音声が歪んでいる場合の認識性能の劣化を
防止するとともに、音声認識に用いる特徴量の迅速な適
応化処理を可能にする手法を提供することにある。[0010] Therefore, an object of the present invention is to provide a speech recognition apparatus which can prevent a deterioration in recognition performance when a target speech is distorted, and enables a rapid adaptation processing of a feature used for speech recognition. Is to provide.

【００１１】[0011]

【課題を解決するための手段】本発明は、まず、音声認
識装置に入力された対象音声の特徴量を適応化する、改
良された方法を提供する。この適応化方法は、前記対象
音声の特徴量の所定音声区間の平均を算出するととも
に、前記音声区間の対象音声に含まれる音素に対応する
認識モデルの分布平均値の平均を算出し、算出された各
平均間の差分値に応じて当該音声区間の音声特徴量を正
規化する過程を含むことを特徴とする。SUMMARY OF THE INVENTION The present invention first provides an improved method for adapting features of a target speech input to a speech recognition device. In the adaptation method, an average of a feature amount of the target voice in a predetermined voice section is calculated, and an average of a distribution average value of a recognition model corresponding to a phoneme included in the target voice in the voice section is calculated. And a step of normalizing the audio feature amount of the audio section in accordance with the difference value between the averages.

【００１２】本発明の方法において、好ましくは、前記
認識モデルとして例えば混合連続型隠れマルコフモデル
（hidden Markov model：以下、ＨＭＭ）を用い、前記
音声区間として、前記対象音声からビタビ（Viterbi）
アルゴリズムにより得た最新の音素ｎ個分に対応する音
声区間を用いる。この音声区間は、ビタビアルゴリズム
によって最新の音素が得られる度に更新され、前記差分
値も、この音声区間毎に更新されるようにする。In the method of the present invention, preferably, for example, a mixed continuous hidden Markov model (hereinafter, referred to as HMM) is used as the recognition model, and Viterbi from the target voice is used as the voice section.
A speech section corresponding to the latest n phonemes obtained by the algorithm is used. The voice section is updated every time the latest phoneme is obtained by the Viterbi algorithm, and the difference value is also updated for each voice section.

【００１３】本発明は、また、入力された対象音声から
音声特徴量を抽出する音声特徴量抽出手段と、抽出され
た前記音声特徴量を適応化して当該対象音声の認識処理
を行う認識処理手段と、を有する音声認識装置を提供す
る。この音声認識装置において、前記認識処理手段は、
対象音声の特徴量の平均を所定音声区間毎に算出する第
１手段、前記音声区間内の対象音声に含まれる音素列に
対応する認識モデルを取得する第２手段、取得した認識
モデルの分布平均値の平均を算出する第３手段、算出さ
れた前記特徴量の平均と分布平均値の平均との差分値に
応じて当該音声区間の音声特徴量を正規化する第４手段
を含み、音声区間毎に更新される前記正規化後の音声特
徴量を用いて前記認識処理を行うように構成したもので
ある。[0013] The present invention also provides a speech feature amount extraction unit for extracting a speech feature amount from an input target speech, and a recognition processing unit for adapting the extracted speech feature amount and recognizing the target speech. And a speech recognition device having the following. In this voice recognition device, the recognition processing means includes:
First means for calculating an average of the feature amount of the target voice for each predetermined voice section, second means for obtaining a recognition model corresponding to a phoneme string included in the target voice in the voice section, distribution average of the obtained recognition model A third means for calculating an average of the values, and a fourth means for normalizing the audio feature amount of the audio section in accordance with a difference value between the calculated average of the characteristic amounts and the average of the distribution average value. The recognition processing is performed using the normalized speech feature amount updated every time.

【００１４】本発明は、さらに、上記適応化方法をコン
ピュータ装置上で実現するための記録媒体を提供する。
この記録媒体は、コンピュータ装置を音声認識装置とし
て機能させ、入力された対象音声の特徴量を適応化させ
るためのプログラムを記録したコンピュータ読取可能な
記録媒体であって、前記プログラムが、コンピュータ装
置に読み取られたときに、少なくとも下記の処理を実行
させるものである。（１）前記対象音声の特徴量を適応化する方法であっ
て、前記対象入力された対象音声の特徴量の所定音声区
間の平均を算出する処理、（２）前記音声区間の対象音声に含まれる音素に対応す
る認識モデルの分布平均値の平均をとる処理、（３）前記特徴量の平均と分布平均値の平均との差分値
に応じて当該音声区間の特徴量を正規化する処理。[0014] The present invention further provides a recording medium for realizing the above adaptation method on a computer device.
The recording medium is a computer-readable recording medium that records a program for causing a computer device to function as a voice recognition device and adapting a feature amount of an input target voice, and the program is stored in a computer device. At the time of reading, at least the following processing is executed. (1) a method of adapting the feature amount of the target voice, wherein the process of calculating an average of the feature amount of the target input target voice in a predetermined voice section; (2) a process included in the target voice of the voice section (3) a process of averaging the distribution average value of the recognition model corresponding to the phoneme to be performed, and (3) a process of normalizing the feature amount of the speech section according to a difference value between the average of the feature amount and the average of the distribution average value.

【００１５】[0015]

【発明の実施の形態】次に、図面を参照して本発明の実
施の形態について説明する。図１は、本発明の一実施形
態に係る音声認識装置１の構成図である。この音声認識
装置１は、パーソナルコンピュータやワークステーショ
ンによって実現されるもので、パーソナルコンピュータ
等が所定のプログラムを読み取って実行することにより
形成される、音声入力部１１，特徴量抽出部１２，特徴
量正規化部１３，認識処理部１４，認識結果出力部１
５，特徴量蓄積部２１，特徴量の平均算出部２２，差分
値算出部２３，音素列蓄積部２４，分布平均値の平均算
出部２５，認識単語登録部２６，単語モデル作成部２
７，認識単語辞書部２８，単語モデル保存部２９の機能
ブロックを有している。Next, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a configuration diagram of a speech recognition device 1 according to an embodiment of the present invention. The speech recognition device 1 is realized by a personal computer or a workstation, and is formed by reading and executing a predetermined program by a personal computer or the like, and includes a speech input unit 11, a feature amount extraction unit 12, a feature amount. Normalization unit 13, recognition processing unit 14, recognition result output unit 1
5, feature amount storage unit 21, feature amount average calculation unit 22, difference value calculation unit 23, phoneme string storage unit 24, distribution average value average calculation unit 25, recognition word registration unit 26, word model creation unit 2
7. It has functional blocks of a recognized word dictionary unit 28 and a word model storage unit 29.

【００１６】上記プログラムは、通常はパーソナルコン
ピュータ等の内部あるいは外部に設置された記憶装置に
記録されてパーソナルコンピュータ等と一体として流通
するものであるが、オペレーティングシステムと協働し
て所要の機能、具体的には上記各機能ブロックをパーソ
ナルコンピュータ等において実現できるプログラムであ
れば本発明の実施が可能になるので、その記録形態ない
し流通形態は、必ずしも上述の場合に限定されない。例
えば、フレキシブルディスクやコンパクトディスク型Ｒ
ＯＭ、あるいはディジタル・ビデオ・ディスク等の可搬
性記録媒体、あるいはパーソナルコンピュータ等がアク
セス可能な構内ネットワーク上のプログラムサーバ上の
記録媒体にコンピュータ読取可能な形態で記録され、使
用時に、適宜上記内部あるいは外部記憶装置に読み出さ
れるものであってもよい。The above-mentioned program is usually recorded in a storage device installed inside or outside a personal computer or the like, and is distributed integrally with the personal computer or the like. More specifically, the present invention can be implemented by a program that can realize each of the functional blocks in a personal computer or the like, and the recording form or distribution form is not necessarily limited to the above-described case. For example, flexible disk or compact disk type R
It is recorded in a computer-readable form on a portable recording medium such as an OM or a digital video disk, or on a program server on a private network accessible by a personal computer or the like. The data may be read to an external storage device.

【００１７】この音声認識装置１における各機能ブロッ
クの内容は下記のとおりである。音声入力部１１から入
力された対象音声の特徴量を特徴量抽出部１２で抽出
し、特徴量正規化部１３でこれを正規化した後に認識処
理部１４で音声認識を行い、その結果を認識結果出力部
１５から出力する点は、基本的には図７に示した従来装
置２と同じである。また、特徴量抽出部１２で抽出され
た特徴量を特徴量蓄積部２１に蓄積しておく点、認識単
語登録部２６、単語モデル作成部２７、認識単語辞書部
２８、及び、単語モデル保存部２９の機能内容も基本的
には従来装置２と同じである。The contents of each functional block in the speech recognition apparatus 1 are as follows. The feature amount of the target speech input from the speech input unit 11 is extracted by the feature amount extraction unit 12, normalized by the feature amount normalization unit 13, and then subjected to speech recognition by the recognition processing unit 14, and the result is recognized. The point output from the result output unit 15 is basically the same as that of the conventional device 2 shown in FIG. Further, the feature amount extracted by the feature amount extraction unit 12 is stored in the feature amount storage unit 21, a recognition word registration unit 26, a word model creation unit 27, a recognition word dictionary unit 28, and a word model storage unit. 29 are basically the same as the conventional device 2.

【００１８】本実施形態の音声認識装置１の特徴は、上
記従来装置と同様の機能ブロックのほか、対象音声の特
徴量の平均を所定音声区間毎に算出する特徴量の平均算
出部２２と、認識処理部１４により認識された音素を蓄
積する音素列蓄積部２４と、認識された音素に対応する
認識モデルを認識単語辞書部２８より索出してその分布
平均値の平均を算出する分布平均値の平均算出部２５
と、各平均の差分値を算出する差分値算出部２３とを設
け、平均差分値算出部２３による算出結果に基づいて特
徴量正規化部１３が正規化処理を行う点である。The features of the speech recognition apparatus 1 of the present embodiment include, in addition to the same functional blocks as those of the above-described conventional apparatus, an average calculating section 22 for calculating an average of the features of the target voice for each predetermined voice section. A phoneme string storage unit 24 for storing phonemes recognized by the recognition processing unit 14; and a distribution average value for finding a recognition model corresponding to the recognized phoneme from the recognition word dictionary unit 28 and calculating an average of the distribution average value. Average calculator 25
And a difference value calculating unit 23 for calculating a difference value of each average, and the feature amount normalizing unit 13 performs normalization processing based on the calculation result by the average difference value calculating unit 23.

【００１９】以下、本実施形態の音声認識装置１の動作
を具体的に説明する。ここでは、音素列の認識形式とし
てビタビアルゴリズムを使用し、認識モデルとしてＨＭ
Ｍを用いている。まず、このＨＭＭとビタビアルゴリズ
ムとを用いた場合の認識処理の一般論について述べる。Hereinafter, the operation of the speech recognition apparatus 1 according to the present embodiment will be specifically described. Here, a Viterbi algorithm is used as a phoneme string recognition format, and HM is used as a recognition model.
M is used. First, a general theory of the recognition process using the HMM and the Viterbi algorithm will be described.

【００２０】いま、二つの音素ｗ１，ｗ２にそれぞれ対
応するＨＭＭ（Ｍ１，Ｍ２）を連結した一つのモデル
（ＨＭＭ）をＭ３とする。対象音声に対応する出力シン
ボル系列がｙ１，ｙ２，ｙ３，・・・ｙTの符号であったと
すると、この符号ｙ１，ｙ２，ｙ３，・・・ｙTのモデルＭ
３に対する尤度Ｐ（ｙ１，ｙ２，ｙ３，・・・ｙT｜Ｍ３）
は、下記数１式で近似することができる。Now, one model (HMM) obtained by connecting HMMs (M1, M2) respectively corresponding to two phonemes w1, w2 is defined as M3. Assuming that the output symbol sequence corresponding to the target voice is a code of y1, y2, y3,... YT, a model M of the code y1, y2, y3,.
Likelihood P (y1, y2, y3,... YT | M3)
Can be approximated by the following equation (1).

【００２１】[0021]

【数１】 (Equation 1)

【００２２】以上の式により、ビタビアライメントの境
界ｔ、つまり音素の境界線を知ることができる。ＨＭＭ
による音声認識の尤度計算において、公知のｂａｕｍ−
ｗｅｌｃｈスコアの代わりに、上記近似式の最大値を示
すカテゴリを認識結果として出力する場合、ｂａｕｍ−
ｗｅｌｃｈスコアを使用したときと同等の性能となるこ
とが論文等によって報告されている。この場合、ビタビ
アルゴリズムを用いることによって生じる利点の一つ
に、計算効率が良くなることが挙げられる。From the above equation, the boundary t of Viterbi alignment, that is, the boundary of phonemes can be known. HMM
In speech recognition likelihood calculation by
When a category indicating the maximum value of the above approximate expression is output as a recognition result instead of the Welch score, baum-
It has been reported by papers and the like that the performance is equivalent to that when the weld score is used. In this case, one of the advantages generated by using the Viterbi algorithm is that the calculation efficiency is improved.

【００２３】次に、上記音声認識装置１における音声特
徴量の適応化手法を図２〜図４を参照して具体的に説明
する。ここでは、対象音声として、「赤い（ａｋａ
ｉ）」が入力された場合の例を示す。また、対象音声の
特徴量として、公知のケプストラム、△ケプストラム、
△パワーを使用する。Next, a method for adapting a speech feature in the speech recognition apparatus 1 will be described in detail with reference to FIGS. Here, as the target voice, “red (aka
i) "is input. In addition, as a feature amount of the target voice, a known cepstrum, △ cepstrum,
△ Use power.

【００２４】図２において、まず、入力された対象音声
「赤い（ａｋａｉ）」をある決められたフレーム長にセ
グメントし、各フレームの特徴量（特徴ベクトル）を抽
出する（ステップＳ１０１）。具体的には、フレーム毎
のケプストラムｃｅｐ（ｊ）を抽出する。ｊはフレーム
ナンバーである。また、ある音声区間におけるケプスト
ラムｃｅｐ（ｊ）を適応化データとして用い、この適応
化データに含まれる音素に対応する認識モデルを認識単
語辞書部２８から抽出する（ステップＳ１０２）。そし
て、抽出した認識モデルの分布平均値の平均を算出する
（ステップＳ１０３）。この認識モデルの分布平均値の
平均の求め方の例は図３に示すとおりであり、該当する
認識モデルが認識モデルＡ，Ｂ，Ｃ（Ａベクトル，Ｂベ
クトル，Ｃベクトル）であった場合、それらの平均ベク
トル（Ａ＋Ｂ＋Ｃベクトル）を求める。一方、上記適応
化データからその特徴量の平均を算出しておく（ステッ
プＳ１０４）。この平均は、図４の太線部分におけるケ
プストラム（ｊ）の平均である。ここではビタビアルゴ
リズムにより出力された最新の音素２個分の音声区間を
用いることにする。但し、この音声区間の数は、２個以
外であってもよく、任意に設定が可能である。In FIG. 2, first, the input target voice “red (akai)” is segmented into a predetermined frame length, and a feature amount (feature vector) of each frame is extracted (step S101). Specifically, a cepstrum cep (j) for each frame is extracted. j is a frame number. Further, a cepstrum cep (j) in a certain voice section is used as adaptation data, and a recognition model corresponding to a phoneme included in the adaptation data is extracted from the recognition word dictionary unit 28 (step S102). Then, the average of the distribution average value of the extracted recognition model is calculated (step S103). An example of how to calculate the average of the distribution average value of the recognition model is as shown in FIG. 3. When the corresponding recognition model is the recognition model A, B, C (A vector, B vector, C vector), An average vector (A + B + C vector) is obtained. On the other hand, the average of the feature amount is calculated from the adaptation data (step S104). This average is the average of the cepstrum (j) in the thick line portion in FIG. Here, the speech section of the two latest phonemes output by the Viterbi algorithm is used. However, the number of voice sections may be other than two and can be arbitrarily set.

【００２５】次に、ステップＳ１０３、Ｓ１０４で算出
した各平均の差分値ｄｉｆ（ｊ）を算出する（ステップ
Ｓ１０５）。その後、この差分値ｄｉｆ（ｊ）を上記ケ
プストラムｃｅｐ（ｊ）から差し引いて正規化ケプスト
ラムｎｃｅｐ（ｊ）を算出するとともに、この正規化ケ
プストラムｎｃｅｐ（ｊ）から△ケプストラム、△パワ
ーを算出して認識処理部１４に送る（ステップ１０
６）。認識処理部１４では、算出された正規化ケプスト
ラムｎｃｅｐ（ｊ）、△ケプストラム、△パワーに基づ
いて認識処理を実行し、最適尤度となる単語モデルを単
語モデル保存部２９から抽出する（ステップＳ１０
７）。Next, a difference value dif (j) of each average calculated in steps S103 and S104 is calculated (step S105). Thereafter, the difference value dif (j) is subtracted from the cepstrum cep (j) to calculate a normalized cepstrum nsep (j), and the cepstrum and △ power are calculated and recognized from the normalized cepstrum nsep (j). (Step 10)
6). The recognition processing unit 14 performs a recognition process based on the calculated normalized cepstrum ncep (j), △ cepstrum, and △ power, and extracts a word model having an optimum likelihood from the word model storage unit 29 (step S10).
7).

【００２６】次に、上記差分値（ｊ）の計算方法、及び
その更新方法を説明する。前提として、差分値（ｊ）の
初期値ｄｉｆ0は、予め設定しておくものとする。そし
て、ビタビアルゴリズムにより出力される音声が２個よ
り少ないとき、例えば上記例において「赤い（ａｋａ
ｉ）」の／ａ／しか出力されないときには、差分値
（ｊ）として上記初期値ｄｉｆ0を使用する。音素が２
個出力されたとき、つまり、／ａ／，／ｋ／が出力され
たときは、差分値（ｊ）（＝ｄｉｆ0）を更新する。差
分値ｄｉｆ（ｊ）の計算式は、下記数２式に示すとおり
である。Next, a method for calculating the difference value (j) and a method for updating the difference value (j) will be described. It is assumed that the initial value dif0 of the difference value (j) is set in advance. When less than two voices are output by the Viterbi algorithm, for example, in the above example, “red (aka
When only / a / of "i)" is output, the above initial value dif0 is used as the difference value (j). 2 phonemes
When the number is output, that is, when / a / and / k / are output, the difference value (j) (= dif0) is updated. The equation for calculating the difference value dif (j) is as shown in the following equation (2).

【００２７】[0027]

【数２】 (Equation 2)

【００２８】但し、ｄｉｆ（ｊ−１）は一つ前の音声区
間の差分値であることを表す。また、ｔ１は音素２個の
うち先に出力された音素に対応する音声区間のフレーム
数、ｔ２は音素２個のうち後に出力された音素に対応す
る音声区間のフレーム数である。ｔは現在のフレームの
ナンバー、ＮはＨＭＭにおけるガウス分布の混合数、Ｉ
はＨＭＭにおける状態数、μ１は音素２個のうち先に出
力された音素に対応するＨＭＭの分布平均値、μ２は音
素２個のうち後に出力された音素に対応するＨＭＭの分
布平均値である。以後、ビタビアルゴリズムにより新し
く音素が出力される都度、差分値ｄｉｆ（ｊ）を更新す
る。Here, dif (j-1) represents a difference value of the immediately preceding voice section. Further, t1 is the number of frames in the voice section corresponding to the phoneme output first of the two phonemes, and t2 is the frame number of the voice section corresponding to the phoneme output later in the two phonemes. t is the number of the current frame, N is the number of Gaussian mixtures in the HMM, I
Is the number of states in the HMM, μ1 is the distribution average of the HMM corresponding to the previously output phoneme of the two phonemes, and μ2 is the distribution average of the HMM corresponding to the second output phoneme of the two phonemes. . Thereafter, each time a new phoneme is output by the Viterbi algorithm, the difference value dif (j) is updated.

【００２９】図６は、上記適応化データにおいて、／ａ
／，／ｋ／に対応するフレーム数（＝音声区間ｔａ＋ｔ
ｋ）が“１３”であった場合の上記差分値（ｊ）の要素
の求め方を示す説明図である。図示の場合、先に出力さ
れた音素／ａ／に対応する音声区間ｔａは“７”、後に
出力された音素／ｋ／に対応する音声区間ｔｋは“６”
である。この場合のＤ１、すなわち適応化データの特徴
量の平均は、下記数３式のようになる。FIG. 6 shows that / a
/, Number of frames corresponding to / k / (= voice section ta + t
FIG. 9 is an explanatory diagram showing how to find an element of the difference value (j) when k) is “13”. In the illustrated case, the voice section ta corresponding to the previously output phoneme / a / is "7", and the voice section tk corresponding to the phoneme / k / output later is "6".
It is. D1 in this case, that is, the average of the feature amounts of the adaptation data is expressed by the following equation (3).

【００３０】[0030]

【数３】 (Equation 3)

【００３１】また、先に出力された音素／ａ／に対する
分布平均値の平均Ｄ２は、数４式で表され、後に出力さ
れた音素／ｋ／に対する分布平均値の平均Ｄ３は、数５
式で表される。The average D2 of the distribution average value for the previously output phoneme / a / is expressed by Equation 4, and the average D3 of the distribution average value for the phoneme / k / output later is expressed by Equation 5.
It is expressed by an equation.

【００３２】[0032]

【数４】 (Equation 4)

【００３３】[0033]

【数５】 (Equation 5)

【００３４】このように、本実施形態によれば、対象音
声に基づく適応化データの特徴量の平均と音素に対応す
る認識モデルの分布平均値の平均との差分値に応じて対
象音声の特徴量が正規化され、しかも差分値が随時変化
する音声区間毎に更新されるので、リアルタイムでの適
応化が可能になる。As described above, according to the present embodiment, the characteristic of the target speech is determined according to the difference between the average of the feature amounts of the adaptation data based on the target speech and the average of the distribution average of the recognition model corresponding to the phoneme. Since the amount is normalized and the difference value is updated for each voice section that changes at any time, real-time adaptation is possible.

【００３５】そのため、対象音声が歪んでいる場合であ
ってもその認識性能の劣化が迅速に是正される。これに
よって、例えば電話回線を経由して入力された対象音声
を認識する用途において、通常の有線電話、コードレス
電話、アナログ携帯電話、ディジタル携帯電話、ＰＨＳ
などの電話網の違いによる影響や電話機の違いによる影
響が緩和されるようになり、認識性能を高めることがで
きる。Therefore, even if the target voice is distorted, the deterioration of the recognition performance is quickly corrected. Thus, for example, in an application for recognizing a target voice input via a telephone line, a normal wired telephone, a cordless telephone, an analog mobile phone, a digital mobile phone, a PHS
Thus, the influence of the difference in the telephone network and the influence of the difference in the telephone can be reduced, and the recognition performance can be improved.

【００３６】[0036]

【発明の効果】以上の説明から明らかなように、本発明
によれば、対象音声が歪んでいる場合の認識性能の劣化
を防止するとともに、音声認識に用いる特徴量の迅速な
適応化処理を可能にする手法を提供することができる。As is apparent from the above description, according to the present invention, it is possible to prevent the recognition performance from deteriorating when the target speech is distorted, and to quickly adapt a feature used for speech recognition. An enabling approach can be provided.

[Brief description of the drawings]

【図１】本発明の一実施形態に係る音声認識装置のブロ
ック構成図。FIG. 1 is a block diagram of a speech recognition apparatus according to an embodiment of the present invention.

【図２】本実施形態による認識処理の手順説明図。FIG. 2 is a diagram illustrating a procedure of a recognition process according to the embodiment;

【図３】本実施形態による分布平均値の平均の求め方の
一例を示す説明図。FIG. 3 is an explanatory diagram showing an example of a method of obtaining an average of distribution average values according to the embodiment.

【図４】本実施形態により使用する適応化データと音声
区間の説明図。FIG. 4 is an explanatory diagram of adaptive data and a voice section used in the embodiment.

【図５】本実施形態による差分値の更新タイミングを示
す説明図。FIG. 5 is an explanatory diagram showing the update timing of the difference value according to the embodiment.

【図６】本実施形態による差分値の具体的な算出例を示
す説明図。FIG. 6 is an explanatory diagram showing a specific example of calculating a difference value according to the embodiment;

【図７】従来の音声認識装置のブロック構成図。FIG. 7 is a block diagram of a conventional voice recognition device.

【図８】従来の適応化手法の説明図。FIG. 8 is an explanatory diagram of a conventional adaptation method.

【図９】従来の音声認識処理の手順説明図。FIG. 9 is a diagram illustrating a procedure of a conventional voice recognition process.

[Explanation of symbols]

１，２音声認識装置１１，７１音声入力部１２，７２特徴量抽出部１３，７３特徴量正規化部１４，７４認識処理部１５，７５認識結果出力部２１，８１特徴量蓄積部２２特徴量の平均算出部２３差分値算出部２４音素列蓄積部２５分布平均値の平均算出部２６，８３認識単語登録部２７，８４単語モデル作成部２８，８５認識単語辞書部２９，８６単語モデル保存部 1, 2 voice recognition device 11, 71 voice input unit 12, 72 feature value extraction unit 13, 73 feature value normalization unit 14, 74 recognition processing unit 15, 75 recognition result output unit 21, 81 feature value storage unit 22 feature value Calculation unit 23 difference value calculation unit 24 phoneme string accumulation unit 25 average calculation unit of distribution average value 26,83 recognized word registration unit 27,84 word model creation unit 28,85 recognized word dictionary unit 29,86 word model storage unit

Claims

[Claims]

1. A method for adapting a feature amount of a target voice input to a voice recognition device, comprising: calculating an average of a feature amount of the target voice in a predetermined voice section; Calculating an average of the distribution average values of the recognition models corresponding to the phonemes included, and normalizing the characteristic amounts of the speech section according to a difference value between the calculated average of the characteristic amounts and the average of the distribution average values. A method of adapting a speech feature quantity, characterized by including:

2. The method according to claim 1, wherein the recognition model is a mixed continuous hidden Markov model, and the speech section is a speech section corresponding to the latest n phonemes obtained from the target speech by a Viterbi algorithm. 2. The adaptation method according to claim 1, wherein:

3. The adaptation method according to claim 2, wherein the voice section is updated each time a latest phoneme is obtained by the Viterbi algorithm.

4. The adaptation method according to claim 1, wherein the difference value is updated for each voice section.

5. A speech feature amount extraction unit for extracting a speech feature amount from an input target speech, and a recognition processing unit for adapting the extracted speech feature amount to perform recognition processing of the target speech. A first means for calculating an average of feature amounts of a target voice for each predetermined voice section, a second means for obtaining a recognition model corresponding to a phoneme sequence included in the target voice in the voice section, A third means for calculating an average of the distribution average value of the acquired recognition model; and a third means for normalizing the audio feature value of the voice section in accordance with a difference value between the calculated average of the feature value and the average of the distribution average value. A speech recognition apparatus including four means, wherein the recognition processing is performed using the normalized speech feature amount updated for each speech section.

6. A computer-readable recording medium that records a program for causing a computer device to function as a speech recognition device and adapting a feature amount of an input target speech, wherein the program includes: A method for adapting a feature amount of a target voice, the process of calculating an average of a feature amount of the target input target voice in a predetermined voice section, a recognition model corresponding to a phoneme included in the target voice of the voice section. The computer device executes a process of averaging the distribution average value of the above, and a process of normalizing the feature amount of the voice section in accordance with a difference value between the average of the feature amount and the average of the distribution average value. A recording medium characterized by the above-mentioned.