JPH0612090A

JPH0612090A - Speech learning system

Info

Publication number: JPH0612090A
Application number: JP4168057A
Authority: JP
Inventors: Shinji Koga; 真二古賀
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1992-06-26
Filing date: 1992-06-26
Publication date: 1994-01-21

Abstract

PURPOSE:To enable high-accuracy learning by invetigating whether learning data are suitable for learning a standard model or not for the unit of a phoneme. CONSTITUTION:At a feature analysis part 11, a speech signal S is converted to a feature vector time sequence V by using a mel cepstrum method. At a recognition part 13, the feature vector time sequence V and a standard model P in a standard model storage part 12 are inputted, the feature vector time sequence V is divided for the unit of music by observation, similarity gn (n=1, 2...N; N is the number of phonemes) is obtained by forward/backward algorithm described in a material 4 while using the standard model corresponding to the phoneme for feature vector time sequences Vn (n=1, 2...N) in respective phoneme blocks to output as a similarity sequence G={g1, g2...gn}. At a learning part 14, the feature vector time sequency V and the similarity sequence G are inputted and when each similarity gn is larger than a fixed value K decided in advance, learning is performed by using the feature vector time sequence V.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、音声認識等において用
いられる標準モデルを学習する音声学習方式に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice learning system for learning a standard model used in voice recognition and the like.

【０００２】[0002]

【従来の技術】あらかじめ発声した学習データから作成
した標準モデルを用いて、多くの語彙を認識対象とする
大語彙音声認識方法では、認識単位として音素などの、
単語より小さい単位が一般に用いられている。以下、
「音素」とは、音韻論的な意味での音声の最小基本単位
という意味だけではなく、音節や複数の音素の連結をも
含む、もっと広い範囲の音声の単位を意味するものとす
る。音素を認識単位とする方法としては、たとえば、渡
辺、吉田、古賀らによる、電子情報通信学会論文誌Ｄ−
II Ｖｏｌ．Ｊ７２−Ｄ−II Ｎｏ．８１９８９年８
月のページ１２６４−１２６９に掲載の論文「半音節を
単位としたＨＭＭを用いた大語い音声認識」（以下、文
献１と記す）に述べられている方法が挙げられる。この
方法では、単語単位に発声された複数個の学習データを
用いて音素の一種である半音節（以下、音素と呼ぶ）単
位の標準モデルを作成している。認識時には、音素表記
された単語辞書を用いて標準モデルを結合して単語単位
のモデルを作成し、この単語モデルを用いて未知単語音
声を認識している。この方法において、高精度の標準モ
デルを作成するためには、正しく発声された学習データ
が必要である。2. Description of the Related Art A large vocabulary speech recognition method that recognizes many vocabulary words by using a standard model created from learning data uttered in advance, such as phonemes as recognition units.
Units smaller than words are commonly used. Less than,
The term "phoneme" means not only the minimum basic unit of speech in the phonological sense, but also a wider range of units of speech including syllables and concatenation of plural phonemes. As a method of using a phoneme as a recognition unit, for example, Watanabe, Yoshida, Koga et al.
II Vol. J72-D-II No. 8 1989 8
The method described in the article “Large-vocabulary speech recognition using HMM in semisyllabic units” (hereinafter referred to as Document 1) published on page 1264-1269 of the moon is mentioned. In this method, a standard model is created for each semi-syllable (hereinafter referred to as a phoneme), which is a kind of phoneme, using a plurality of learning data uttered in word units. At the time of recognition, a word-by-phoneme dictionary is used to combine standard models to create a word-by-word model, and this word model is used to recognize unknown word speech. In this method, correctly uttered learning data is required to create a highly accurate standard model.

【０００３】発声した学習データが標準モデルの学習に
適切かどうかを決める方法としては、たとえば、藤本、
佐藤らによる、日本音響学会昭和６１年度春季研究発表
会講演論文集Ι、昭和６１年３月のページ８５−８６に
掲載の論文「音声認識ボード用簡易型単語音声認識方
式」（以下、文献２と記す）に述べられている方法が挙
げられる。これは、学習する単語に対して既に標準モデ
ルが作成されているときに、新たに発声した学習データ
を標準モデルを用いて認識を行い、同一単語の標準モデ
ルと学習データとの類似度（たとえば、距離の符号を反
転させたもの）が一定のしきい値以上の場合には、標準
モデルの平均化、すなわち標準モデルの学習を行うもの
である。As a method for determining whether the uttered learning data is appropriate for learning the standard model, for example, Fujimoto,
Sato et al., "The Simplified Word Speech Recognition Method for Speech Recognition Boards", published on pages 85-86 of the Acoustical Society of Japan 1986 Spring Research Presentation Collection Ι, March 1986, pages 85-86. The method described in (1) is mentioned. This is because when a standard model has already been created for the words to be learned, the newly uttered learning data is recognized using the standard model, and the similarity between the standard model of the same word and the learning data (for example, , The sign of the distance is inverted) is equal to or greater than a certain threshold, the standard model is averaged, that is, the standard model is learned.

【０００４】[0004]

【発明が解決しようとする課題】上述した文献２の方法
では、学習データを学習に用いるかどうかの判定を、標
準モデルと学習データの全体の類似度で行っているの
で、学習データ中のある特定の音素に対するデータが、
発声誤りや雑音の混入などにより不適切であっても全体
の類似度が大きいならば、学習データとして用いられる
可能性がある。このことは、文献１で述べられているよ
うな音素を認識単位とした方法においては、特定の音素
に対する標準モデルが不適切なデータにより学習されて
しまうという問題がある。In the method of the above-mentioned document 2, since it is determined whether or not the learning data is used for learning, it is determined in the learning data that the standard model and the learning data are similar to each other. Data for a particular phoneme
Even if it is inappropriate due to an erroneous utterance or the inclusion of noise, if the overall similarity is high, it may be used as learning data. This has a problem that the standard model for a specific phoneme is learned by inappropriate data in the method using a phoneme as a recognition unit as described in Document 1.

【０００５】本発明の目的は、発声した学習データが標
準モデルの学習に適切かどうかを音素単位に調べること
により、高精度な学習を行う音声学習方式を提供するこ
とにある。It is an object of the present invention to provide a speech learning method for performing highly accurate learning by checking whether or not the uttered learning data is suitable for learning the standard model in phoneme units.

【０００６】[0006]

【課題を解決するための手段】第１の発明は、入力され
た音声信号を分析して特徴ベクトル時系列を学習データ
として出力する特徴分析部と、あらかじめ作成した音素
を単位とした標準モデルを蓄えておく標準モデル記憶部
と、前記学習データを前記音素の単位に分割し、分割さ
れた学習データの前記標準モデルに対する類似度を求め
る認識部と、認識部より出力された前記学習データを構
成する音素に対する類似度があらかじめ定められた値よ
り大きい場合に前記学習データを用いて学習を行う学習
部とを有することを特徴としている。According to a first aspect of the present invention, there is provided a feature analysis unit for analyzing an input voice signal and outputting a feature vector time series as learning data, and a standard model created in advance in units of phonemes. A standard model storage unit for storing, a recognition unit that divides the learning data into units of the phonemes, obtains a degree of similarity of the divided learning data with respect to the standard model, and the learning data output from the recognition unit is configured. And a learning unit that performs learning using the learning data when the degree of similarity to the phoneme is larger than a predetermined value.

【０００７】また、第２の発明は、入力された音声信号
を分析して特徴ベクトル時系列を学習データとして出力
する特徴分析部と、あらかじめ作成した音素を単位とし
た標準モデルを蓄えておく標準モデル記憶部と、前記学
習データを前記音素の単位に分割し、分割された学習デ
ータの前記標準モデルに対する類似度を求める認識部
と、認識部より出力された前記類似度があらかじめ定め
られた値より小さい音素に対応する部分を除いたデータ
を用いて学習を行う学習部とを有することを特徴として
いる。A second aspect of the present invention is a standard analysis unit that analyzes an input voice signal and outputs a feature vector time series as learning data, and a standard model that stores a phoneme standard model created in advance. A model storage unit, a recognition unit that divides the learning data into units of the phonemes, and obtains a similarity of the divided learning data to the standard model, and the similarity output from the recognition unit is a predetermined value. And a learning unit that performs learning by using data excluding portions corresponding to smaller phonemes.

【０００８】[0008]

【作用】まず、第１の発明について述べる。発声された
学習データの認識に用いる音素単位の標準モデルとして
文献１に述べられているＨＭＭを用い、文献１に述べら
れている学習方法によりあらかじめ作成されているもの
とする。ＨＭＭは、状態遷移ネットワークの一種で、各
状態には状態遷移確率とベクトル出現確率とが定義され
ている。First, the first invention will be described. It is assumed that the HMM described in Reference 1 is used as a standard model of a phoneme unit used for recognizing spoken learning data, and that the learning method described in Reference 1 is used in advance. The HMM is a type of state transition network, and a state transition probability and a vector appearance probability are defined for each state.

【０００９】学習データは、古井著、１９８５年、東海
大学出版会発行の「ディジタル音声処理」（以下、文献
３と記す）のページ１５４−１６０に述べられているメ
ルケプストラムによる方法により特徴ベクトル時系列に
変換され、さらに、音素単位に分割される。音素単位の
分割は、視察や、Ｓ．Ｅ．Ｌｅｖｉｎｓｏｎ，Ｌ．Ｒ．
Ｒａｂｉｎｅｒ，およびＭ．Ｍ．Ｓｏｎｄｈｉらの、Ｔ
ｈｅＢｅｌｌＳｙｓｔｅｍＴｅｃｈｎｉｃａｌ
Ｊｏｕｒｎａｌ，Ｖｏｌ．６２，Ｎｏ．４，１９８３年
４月のページ１０３５−１０７４に掲載の論文“Ａｎ
ＩｎｔｒｏｄｕｃｔｉｏｎｔｏｔｈｅＡｐｐｌｉ
ｃａｔｉｏｎｏｆｔｈｅＴｈｅｏｒｙｏｆＰ
ｒｏｂａｂｉｌｉｓｔｉｃＦｕｎｃｔｉｏｎｓｏｆ
ａＭａｒｋｏｖＰｒｏｃｅｓｓｔｏＡｕｔｏ
ｍａｔｉｃＳｐｅｅｃｈＲｅｃｏｇｎｉｔｉｏｎ”
（以下、文献４と記す）に述べられているビタビ（Ｖｉ
ｔｅｒｂｉ）アルゴリズムを用いて行うことができる。
ビタビアルゴリズムを用いる場合は、学習データを構成
する音素に対する標準モデルを結合して作成したモデル
内での最適な状態遷移パスをビタビアルゴリズムに用い
て求め、そのパス上での各音素単位の標準モデルの始
端、終端に対応する学習データ中の時刻を求めることに
より音素単位に分割することができる。分割後、音素毎
に、学習データ中のその音素に対応するデータの標準モ
デルに対する類似度が求められる。ビタビアルゴリズム
を用いて学習データを分割した場合、分割時に既に求め
られている、各音素単位の標準モデルの最適な状態遷移
パス上での終端状態における尤度を類似度として用いる
ことができる。または、分割された各音素区間の学習デ
ータに対して、文献４に述べられているようなフォワー
ド・バックワード（ｆｏｒｗａｒｄ−ｂａｃｋｗａｒ
ｄ）アルゴリズムを用いて類似度を求めることもでき
る。すべての、または特定の標準モデルに対する類似度
があらかじめ定められた一定値より大きい場合のみ発声
された学習データを用いて学習を行う。The learning data is obtained by Furui's method in 1985 using the method by the mel cepstrum described on pages 154-160 of "Digital Speech Processing" (hereinafter referred to as Reference 3) published by Tokai University Press. It is converted into a sequence and further divided into phonemes. The phoneme-based division is performed by inspection, S. E. Levinson, L .; R.
Rabiner, and M.M. M. Sondi et al., T
he Bell System Technical
Journal, Vol. 62, No. 4, Apr. 1983, page 1035-1074, "An
Introduction to the Appli
Cation of the Theory of P
robabilistic Functions of
a Markov Process to Auto
"matic Speech Recognition"
(Hereinafter referred to as Literature 4) Viterbi (Vi
terbi) algorithm.
When using the Viterbi algorithm, the optimal state transition path in the model created by combining the standard models for the phonemes that make up the training data is found using the Viterbi algorithm, and the standard model for each phoneme unit on that path is obtained. It is possible to divide into phonemes by obtaining the times in the learning data corresponding to the beginning and end of the. After the division, the similarity of the data corresponding to the phoneme in the learning data to the standard model is obtained for each phoneme. When the learning data is divided using the Viterbi algorithm, the likelihood in the terminal state on the optimum state transition path of the standard model for each phoneme, which is already obtained at the time of division, can be used as the similarity. Alternatively, the forward-backward (forward-backwar) as described in Reference 4 is applied to the learning data of each divided phoneme section.
d) The degree of similarity can also be obtained using an algorithm. Learning is performed using the uttered learning data only when the degree of similarity to all or a specific standard model is larger than a predetermined constant value.

【００１０】次に、第２の発明について述べる。この場
合は、音素単位の類似度があらかじめ定められた一定値
より小さいとき、その音素に対応する音声区間を除いた
学習データを用いて学習を行う。これにより、正しく発
声された学習データのみを用いて学習を高精度に行う。Next, the second invention will be described. In this case, when the phoneme-based similarity is smaller than a predetermined constant value, learning is performed using learning data excluding the voice section corresponding to the phoneme. As a result, learning is performed with high accuracy using only the correctly uttered learning data.

【００１１】[0011]

【実施例】次に、本発明の実施例について図面を参照し
て説明する。Embodiments of the present invention will now be described with reference to the drawings.

【００１２】図１は、第１の発明の一実施例を示すブロ
ック図である。本実施例は、特徴分析部１１と標準モデ
ル記憶部１２と認識部１３と学習部１４とにより構成さ
れている。FIG. 1 is a block diagram showing an embodiment of the first invention. This embodiment includes a feature analysis unit 11, a standard model storage unit 12, a recognition unit 13, and a learning unit 14.

【００１３】標準モデル記憶部１２にはあらかじめ標準
モデルＰが保持されている。学習用音声信号Ｓは特徴分
析部１１に入力される。特徴分析部１１では、文献３に
述べられているようなメルケプストラムによる方法を用
いて、音声信号Ｓが特徴ベクトル時系列Ｖに変換され
る。認識部１３では、特徴ベクトル時系列Ｖと標準モデ
ル記憶部１２中の標準モデルＰが入力され、視察により
特徴ベクトル時系列Ｖは音素単位に分割され、各音素区
間の特徴ベクトル時系列Ｖ_n（ｎ＝１，２，・・・，Ｎ、
Ｎは音素数）に対して、その音素に対する標準モデルを
用いて、文献４に述べられているフォワード・バックワ
ードアルゴリズムにより類似度ｇ_n（ｎ＝１，２，・・・
，Ｎ）が求められ、類似度列Ｇ＝｛ｇ₁，ｇ₂，・・・
，ｇ_N｝として出力される。学習部１４では、特徴ベ
クトル時系列Ｖと類似度列Ｇが入力され、各類似度ｇ_n
があらかじめ定められた一定値Ｋより大きい場合に特徴
ベクトル時系列Ｖを用いて学習が行われる。The standard model storage unit 12 holds a standard model P in advance. The learning voice signal S is input to the feature analysis unit 11. The feature analysis unit 11 converts the audio signal S into a feature vector time series V by using the method based on the mel cepstrum as described in Reference 3. In the recognition unit 13, the feature vector time series V and the standard model P in the standard model storage unit 12 are input, the feature vector time series V is divided into phoneme units by observation, and the feature vector time series V _n (V _n ( n = 1, 2, ..., N,
N is the number of phonemes), and using the standard model for that phoneme, the degree of similarity g _n (n = 1, 2, ...) By the forward backward algorithm described in Reference 4.
, N) is obtained, and the similarity sequence G = {g ₁ , g ₂ , ...
, G _N }. In the learning unit 14, the feature vector time series V and the similarity sequence G are input, and each similarity g _n
Is larger than a predetermined constant value K, learning is performed using the feature vector time series V.

【００１４】図２は、第２の発明の一実施例を示すブロ
ック図である。本実施例は、特徴分析部１１と標準モデ
ル記憶部１２と認識部１３と学習部１４ａとにより構成
されている。FIG. 2 is a block diagram showing an embodiment of the second invention. This embodiment includes a feature analysis unit 11, a standard model storage unit 12, a recognition unit 13, and a learning unit 14a.

【００１５】標準モデル記憶部１２にはあらかじめ標準
モデルＰが保持されている。学習用音声信号Ｓは特徴分
析部１１に入力される。特徴分析部１１では、文献３に
述べられているようなメルケプストラムによる方法を用
いて、音声信号Ｓが特徴ベクトル時系列Ｖに変換され
る。認識部１３では、特徴ベクトル時系列Ｖと標準モデ
ル記憶部１２中の標準モデルＰが入力され、視察により
特徴ベクトル時系列Ｖは音素単位に分割され、各音素区
間の特徴ベクトル時系列Ｖ_n（ｎ＝１，２，・・・，Ｎ、
Ｎは音素数）に対して、その音素に対する標準モデルを
用いて、文献４に述べられているフォワード・バックワ
ードアルゴリズムにより類似度ｇ_n（ｎ＝１，２，・・・
，Ｎ）が求められ、類似度列Ｇ＝｛ｇ₁，ｇ₂，・・・
，ｇ_N｝として出力される。学習部１４ａでは、特徴
ベクトル時系列Ｖと類似度列Ｇが入力され、類似度ｇ_n
があらかじめ定められた一定値Ｋより小さい音素に対す
る特徴ベクトル時系列Ｖ_nを特徴ベクトル時系列Ｖから
除いた特徴ベクトル時系列を用いて学習が行われる。The standard model storage unit 12 holds a standard model P in advance. The learning voice signal S is input to the feature analysis unit 11. The feature analysis unit 11 converts the audio signal S into a feature vector time series V by using the method based on the mel cepstrum as described in Reference 3. In the recognition unit 13, the feature vector time series V and the standard model P in the standard model storage unit 12 are input, the feature vector time series V is divided into phoneme units by observation, and the feature vector time series V _n (V _n ( n = 1, 2, ..., N,
N is the number of phonemes), and using the standard model for that phoneme, the degree of similarity g _n (n = 1, 2, ...) By the forward backward algorithm described in Reference 4.
, N) is obtained, and the similarity sequence G = {g ₁ , g ₂ , ...
, G _N }. In the learning unit 14a, the feature vector time series V and the similarity sequence G are input, and the similarity g _n
Learning is performed using a feature vector time series obtained by removing the feature vector time series V _n for a phoneme smaller than a predetermined fixed value K from the feature vector time series V.

【００１６】以上、音声認識について述べたが、本発明
は、音声合成で用いる合成モデルの学習において標準と
なる音声データの選択等にも用いることができる。Although voice recognition has been described above, the present invention can also be used for selection of voice data that becomes a standard in learning a synthesis model used in voice synthesis.

【００１７】[0017]

【発明の効果】以上説明したように本発明は、発声した
学習データが標準モデルの学習に適切かどうかを音素単
位に調べることにより、高精度な音素単位の標準モデル
の学習を実現することができるという効果を有する。As described above, according to the present invention, it is possible to realize highly accurate learning of the standard model in the unit of phoneme by checking whether or not the uttered learning data is appropriate for the learning of the standard model. It has the effect of being able to.

[Brief description of drawings]

【図１】第１の発明の一実施例を示すブロック図であ
る。FIG. 1 is a block diagram showing an embodiment of a first invention.

【図２】第２の発明の一実施例を示すブロック図であ
る。FIG. 2 is a block diagram showing an embodiment of the second invention.

[Explanation of symbols]

１１特徴分析部１２標準モデル記憶部１３認識部１４，１４ａ学習部Ｇ類似度列Ｐ標準モデルＳ学習用音声信号Ｖ特徴ベクトル時系列 11 Feature Analysis Unit 12 Standard Model Storage Unit 13 Recognition Unit 14, 14a Learning Unit G Similarity Sequence P Standard Model S Learning Speech Signal V Feature Vector Time Series

Claims

[Claims]

1. A feature analysis unit that analyzes an input voice signal and outputs a feature vector time series as learning data, a standard model storage unit that stores a standard model created in advance for each phoneme, and A recognition unit that divides the learning data into the units of the phonemes and obtains the similarity of the divided learning data to the standard model, and a similarity to the phonemes that constitute the learning data output from the recognition unit are predetermined. And a learning unit that performs learning using the learning data when the value is larger than a value.

2. A feature analysis unit which analyzes an input voice signal and outputs a feature vector time series as learning data, a standard model storage unit which stores a standard model in units of phonemes created in advance, The learning data is divided into the units of the phonemes, and the recognition unit obtains the similarity of the divided learning data with respect to the standard model, and the recognition unit outputs the similarity corresponding to a phoneme having a smaller value than a predetermined value. A speech learning method comprising: a learning unit that performs learning using data excluding a portion.