JP2545960B2

JP2545960B2 - Learning method for adaptive speech recognition

Info

Publication number: JP2545960B2
Application number: JP1001847A
Authority: JP
Inventors: 隆夫渡辺
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1989-01-06
Filing date: 1989-01-06
Publication date: 1996-10-23
Anticipated expiration: 2011-10-23
Also published as: JPH02181798A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、音声を認識する音声認識装置において、異
なる話者や異なる発声雑音環境における発声に適応でき
る適応型認識装置の学習に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to learning of a speech recognition apparatus for recognizing speech, which is adaptable to different speakers and utterances in different utterance noise environments.

（従来の技術）従来、入力された音声をあらかじめ保持されている標
準パタンを用いて認識を行う方法があった。（共立出版
株式会社「音声認識」（文献１）p.101-113参照）この
方法では、標準パタン作成に用いた話者以外の話者の音
声を認識すると、話者にる音声パタンの違いにより十分
な認識率が得られない。また、話者が同じでも周囲雑音
などの発声環境が標準パタン作成時と大きく異なると認
識率の低下が起こる。(Prior Art) Conventionally, there has been a method of recognizing an input voice by using a standard pattern that is held in advance. (See Kyoritsu Shuppan Co., Ltd., “Voice Recognition” (Reference 1) p.101-113) In this method, when a voice of a speaker other than the speaker used to create the standard pattern is recognized, the difference in the voice pattern of the speaker Therefore, a sufficient recognition rate cannot be obtained. In addition, even if the speaker is the same, if the utterance environment such as ambient noise is significantly different from that when the standard pattern is created, the recognition rate will decrease.

通常、特定の話者の音声を認識する場合、認識対象語
をすべて発声し登録することが必要である。しかし、語
彙が多い場合には多くの労力を必要とするという欠点が
あった。これ対し、小量の音声により標準パタンを特定
話者に対応化する法が提案されている。例として、IEEE
ICASSP-86,49.5p.2643“Speaker Adaptation through V
ector Quantization"（文献２）にはベクトル量子化に
よる話者適応化法が述べられている。また、この方法に
おけるベクトル量子化による量子化誤差の影響による性
能の低下を改善するものとして、特願昭63-122559号に
は、ニューラルネットワークによる話者適応化法が述べ
られている。これらの方法では、話者１と話者２が同一
単語（文節や文でもよい）を発声したパタンをDPマッチ
ングにより最適に時間的に対応つけた上で、対応する特
徴ベクトルの対のセットから、適応化即ち変換器を求め
ている。Usually, when recognizing the voice of a specific speaker, it is necessary to utter all the recognition target words and register them. However, it has a drawback that it requires a lot of labor when the vocabulary is large. On the other hand, a method of adapting a standard pattern to a specific speaker with a small amount of voice has been proposed. As an example, IEEE
ICASSP-86,49.5p.2643 “Speaker Adaptation through V
The speaker adaptation method by vector quantization is described in "ector Quantization" (Reference 2). In addition, as a method for improving the performance deterioration due to the influence of the quantization error due to the vector quantization in this method, a patent application is proposed. A speaker adaptation method using a neural network is described in Sho 63-122559. In these methods, a pattern in which speaker 1 and speaker 2 utter the same word (a phrase or a sentence may be used) is DP. The matching is performed for optimum temporal correspondence, and then the adaptation, that is, the converter is obtained from the set of corresponding feature vector pairs.

（発明が解決しようとする問題点）上記の方法では異なった話者の音声パタンをそのまま
マッチングしているが、DPマッチングによる時間軸対応
つけは必ずしも正確ではない。例えば、話者１のある要
素は話者２では別の音素に類似しているなどが起こる
と、正しくない対応付けが起こる可能性がある。このよ
うな誤りは、適応化の性能を低下させ、認識性能を低下
させる原因となる。本発明は、このような話者の違いに
よる時間軸対応つけの誤りを取り除き高精度の話者適応
や発声雑音適応を実現することを目的としている。(Problems to be Solved by the Invention) In the above method, voice patterns of different speakers are matched as they are, but the time axis correspondence by DP matching is not always accurate. For example, if one element of speaker 1 resembles another phoneme of speaker 2 or the like, incorrect correspondence may occur. Such an error reduces the performance of adaptation and the recognition performance. An object of the present invention is to eliminate such errors in time axis correspondence due to speaker differences and realize highly accurate speaker adaptation and vocal noise adaptation.

（問題を解決するための手段）本発明による適応型音声認識用学習方式は、環境１と
環境２の同一発声パタンから学習される環境適応化用の
ニューラルネットにより、環境１の標準パタンを環境２
用に変換したパタンを用いて認識を行う方式において、
環境１の学習パタンをニューラルネットにより変換した
パタンＡと環境２の学習パタンＢとの最適時間軸対応つ
けにより得られるパタン間ご誤差パタンを用いてニュー
ラルネットの荷重係数を修正する過程の反復によりニュ
ーラルネットを学習する手段を有することを特徴とす
る。(Means for Solving the Problem) In the learning method for adaptive speech recognition according to the present invention, a standard pattern of environment 1 is converted into an environment by a neural network for environment adaptation learned from the same utterance pattern of environment 1 and environment 2. Two
In the method of recognizing using the pattern converted for
By repeating the process of correcting the weighting factor of the neural network by using the error pattern between patterns obtained by associating the pattern A obtained by converting the learning pattern of environment 1 with the neural network and the learning pattern B of environment 2 by the optimum time axis correspondence. It is characterized by having a means for learning a neural network.

（作用）話者適応化を例として、本発明の作用を説明する。発
声内容の同じ２つの話者１のパタンＡ、話者２のパタン
Ｂがあり、パタンＡからパタンＢへの変換を行うニュー
ラルネットを学習するものとする。パタンA,Bをベクト
ルの時系列Ａ＝｛ａ（ｉ）,i＝1,I｝Ｂ＝｛ｂ（ｊ）,j＝1,J｝であらわし、学習の反復ステップをｋであらわす。ニュ
ーラルネットはパタンＡのベクトルとパタンＢのベクト
ル間の変換を実現するものであり、入力、出力ともにベ
クトルである。パタンＡの各時刻のベクトルをステップ
ｋのニューラルネットにより変換してパタンB^* _kを得
る。(Operation) The operation of the present invention will be described by taking speaker adaptation as an example. It is assumed that there is a pattern A of two speakers 1 and a pattern B of a speaker 2 having the same utterance content, and a neural network for converting pattern A to pattern B is learned. The patterns A and B are represented by a vector time series A = {a (i), i = 1, I} B = {b (j), j = 1, J}, and the iteration step of learning is represented by k. The neural network realizes conversion between the vector of pattern A and the vector of pattern B, and both input and output are vectors. The vector at each time of the pattern A is converted by the neural network of step _k to obtain the pattern B ^* _k .

B^* _k＝｛b^* _k(i),i＝1,…,I｝入力パタンの変換パタンB^* _kと教師パタンＢとの間でDP
マッチングを行う。DPマッチングでは、次の最小化問題
を解く。なお、DPマッチングの詳細は、文献１に述べら
れている。B ^* _k = {b ^* _k (i), i = 1, ..., I} DP between input pattern conversion pattern B ^* _k and teacher pattern B
Match. DP matching solves the following minimization problem. Details of DP matching are described in Reference 1.

このとき、最適なＪ（ｉ）も求めておく。パタンB^* _kの
時間軸に整合されたパタンＢをB_kとする。 At this time, the optimum J (i) is also obtained. The pattern B ^* pattern matched to the time axis of _k B to B _k.

B_k＝｛b_k(i),i＝1,…,I｝パタンB^* _kとのB_kとの間の誤差ベクトルの時系列パタン
をd_k誤差関数をD_kとする。 _{_{B k = {b k (i}} ), i = 1, ..., I} the d _k error function time series pattern of the error vector between the B _k of the pattern B ^* _k and D _k.

誤差ベクトル時系列パタンd_kを用いてバックプロパゲー
ション学習を行い、ニューラルネットの荷重を修正す
る。バックプロパゲーション学習の詳細は、電子情報通
信学会「確率モデルによる音声」、p.164-167に述べら
れている。 Backpropagation learning is performed using the error vector time series pattern d _k to correct the weight of the neural network. The details of backpropagation learning are described in the Institute of Electronics, Information and Communication Engineers, "Speech by Stochastic Model," p.164-167.

Ｉ個の誤差ベクトルを用いた修正を行うことになる
が、修正の方法として、単純に１個の誤差ベクトルに対
して求められた荷重修正量による修正を繰り返す方法、
あるいは、文献２に述べられているようなＩ個の誤差ベ
クトルについて荷重修正量を求めたのちこれらを平均し
て荷重を修正する方法を用いる。このようにしてステッ
プｋ＋１のニューラルネットが求められる。バックプロ
パゲーション学習の収束性から、教師パタンB_kに固定し
た条件では、ステップｋ＋１のニューラルネットの誤差
関数は、ステップｋのニューラルネットの誤差関数より
小さい。Although the correction will be performed using I error vectors, as a correction method, a method of simply repeating the correction with the load correction amount obtained for one error vector,
Alternatively, as described in Reference 2, a method of calculating the load correction amount for I error vectors and then averaging them to correct the load is used. In this way, the neural network of step k + 1 is obtained. Due to the convergence of the back propagation learning, the error function of the neural network at step k + 1 is smaller than the error function of the neural network at step k + 1 under the condition fixed to the teacher pattern B _k .

すなわち、が成り立つ。一方、DPマッチングは、すべての可能な時
間軸対応つけの中で誤差関数の最小となる対応つけを実
行するから、ステップｋ＋１でのDPマッチングの結果得
られる誤差関数D_k+1についてが成り立つ。（１），（２）式より D_k+1≦D_k が成立つので、上記の反復処理は収束し、上記の反復処
理によりニューラルネットの学習を行うことができる。That is, Holds. On the other hand, since the DP matching executes the matching that minimizes the error function among all possible time axis matching, the error function D _{k + 1} obtained as a result of the DP matching in step k + 1 is Holds. Since D _{k + 1} ≦ D _k is established from the equations (1) and (2), the above iterative processing converges, and the neural network can be learned by the above iterative processing.

学習におけるニューラルネットの荷重係数の初期値と
して、ランダムな値を与えることも可能であるが、特願
昭63-122559号に述べられている方法によりパタンA,Bを
直接DPマッチングにより時間軸対応つけした結果から、
ニューラルネットの学習を行い、得られた結果を、初期
値とすることもできる。It is possible to give a random value as the initial value of the weighting factor of the neural network in learning, but the pattern A and B are directly DP matched to the time axis by the method described in Japanese Patent Application No. 63-122559. From the result of attaching
It is also possible to perform learning of the neural network and use the obtained result as the initial value.

パタンA,Bとしては、単語、文、あるいは複数の単語
セット、文セットなど発声内容が同じ任意のものを用い
ることができる。As the patterns A and B, words, sentences, or a plurality of word sets, sentence sets, or the like having the same utterance content can be used.

また、パタンA,Bとして、環境雑音の異なる同一話者
の発声パタンを用いることにより、発声環境雑音の適応
を行うこともできる。Further, by using the utterance patterns of the same speaker having different environmental noises as the patterns A and B, it is possible to adapt the utterance environmental noises.

（実施例）第１図は、本発明による実施例を示す図である。図に
おいて、記憶部1,2はそれぞれパタンA,Bを保持し、学習
制御部３は、学習ステップを表す制御信号ｋを発生す
る。まず、記憶部１に保持されたパタンＡはニューラル
ネット部４に入力され、パタンB^* _kに変換される。つい
で記憶部２に保持されたパタンＢと、前記パタンB^* _kがD
Pマッチング部５へ入力される。DPマッチング部５は、
入力された２つのパタンB,B^* _kの間でDPマッチングを実
行し、パタンＢをパタンB^* _kに整合されたパタンB_kを出
力する。誤差パタン算出部６は、DPマッチング部５から
出力されたパタンB_kと、ニューラルネット部４から出力
されたパタンB^* _kとの間の誤差パタンd_kを算出する。誤
差パタンd_kは、ニューラルネット修正部７へ送られると
ともに、誤差関数算出部８へ送られる。ニューラルネッ
ト修正部７は、バックプロパゲーション学習により、ニ
ューラルネット部４の内容（荷重係数）を修正する。学
習制御部３は、誤差関数算出部８により算出された誤差
関数D_kがある程度以下になるか、ステップｋがあらかじ
め定められた値以上になるまで、以上の一連の動作を反
復する制御を行う。(Example) FIG. 1 is a diagram showing an example according to the present invention. In the figure, storage units 1 and 2 hold patterns A and B, respectively, and a learning control unit 3 generates a control signal k representing a learning step. First, the pattern A held in the storage unit 1 is input to the neural network unit 4 and converted into the pattern B ^* _k . Then, the pattern B held in the storage unit 2 and the pattern B ^* _k are D
It is input to the P matching unit 5. The DP matching unit 5
Two patterns B input, executes the DP matching between the B ^* _k, and outputs the pattern B _k which are matched pattern B to pattern B ^* _k. The error pattern calculation unit 6 calculates an error pattern d _k between the pattern B _k output from the DP matching unit 5 and the pattern B ^* _k output from the neural network unit 4. The error pattern d _k is sent to the neural network correction unit 7 and the error function calculation unit 8. The neural network correction unit 7 corrects the content (weighting factor) of the neural network unit 4 by back propagation learning. The learning control unit 3 performs control to repeat the above series of operations until the error function D _k calculated by the error function calculation unit 8 becomes a certain value or less or step k becomes a predetermined value or more. .

（発明の効果）本発明によれば、新しい話者や発声雑音環境に効果的
に適応できるニューラルネットを学習することができ、
高性能や適応型音声認識装置を実現できる。(Effects of the Invention) According to the present invention, it is possible to learn a neural network that can effectively adapt to a new speaker or a vocal noise environment,
A high-performance and adaptive voice recognition device can be realized.

[Brief description of drawings]

第１図は、本発明による実施例を示す図であり、図にお
いて、1,2はパタン記憶部、３は学習制御部、４はニュ
ーラルネット部、５はDPマッチング部、６は誤差パタン
算出部、７はニューラルネット修正部、８は誤差関数算
出部である。FIG. 1 is a diagram showing an embodiment according to the present invention, in which 1 and 2 are a pattern storage unit, 3 is a learning control unit, 4 is a neural network unit, 5 is a DP matching unit, and 6 is an error pattern calculation. Reference numeral 7 is a neural network correction unit, and 8 is an error function calculation unit.

Claims

(57) [Claims]

1. An environment 1 is created by a neural network for environment adaptation that is learned from the same speech patterns of environment 1 and environment 2.
In the speech recognition learning method for recognizing using the pattern obtained by converting the standard pattern of the environment 2 into the environment 2, the learning pattern of the environment 1 is converted into the pattern A and the environment 2 by the neural network.
The adaptive speech having means for learning the neural network by repeating the process of correcting the weighting factor of the neural network using the error pattern between the patterns obtained by associating the learning pattern B with the optimal time axis. A learning method for recognition.