JP2834471B2

JP2834471B2 - Pronunciation evaluation method

Info

Publication number: JP2834471B2
Application number: JP1097735A
Authority: JP
Inventors: 聡三樹; 洋浜田; 良平中津
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1989-04-17
Filing date: 1989-04-17
Publication date: 1998-12-09
Anticipated expiration: 2013-12-09
Also published as: JPH02275499A

Description

【発明の詳細な説明】［産業上の利用分野］この発明は評価対象の話者が外国語等を発声した時
に、その発音の良さを評価する発音評価法に関するもの
である。Description: TECHNICAL FIELD The present invention relates to a pronunciation evaluation method for evaluating good pronunciation when a speaker to be evaluated utters a foreign language or the like.

［従来の技術］通常、外国語発音の韻律的な良さに関する定量的な評
価は、評価する外国語の発音が良い音声、例えばその外
国語を母国語とする者（以後、標準話者と呼ぶ）の音声
を基準とし、それと評価対象となる話者（以後、被評価
話者と呼ぶ）における同じ発生内容の音声がどれほど近
いか、という観点で行うのが一般的である。この場合、
時間軸方向における発声の非線形な変動の違いを無視
し、同じ音韻を発声している部分同士の韻律を比較でき
るようにするため、音声をスペクトルの変化がないと見
なせる一定時間間隔（これをフレームと呼ぶ、以下同
じ）に分け、フレーム単位でスペクトル類似性に基づく
標準話者の音声と被評価話者の音声との時間対応をとっ
た後、スペクトル特性が対応している、つまり音韻が一
致していると考えられるフレーム同士で被評価話者と標
準話者の韻律的特徴がどれほど異なっているかを比較
し、評価するといった方法をとる。[Related Art] Normally, a quantitative evaluation of the prosodic goodness of a foreign language pronunciation is made by a speech with a good pronunciation of the foreign language to be evaluated, for example, a person whose native language is the foreign language (hereinafter referred to as a standard speaker) ) Is used as a reference, and it is generally performed from the viewpoint of how close the voices of the same generation content to the speaker to be evaluated (hereinafter referred to as the evaluated speaker) are. in this case,
In order to ignore the difference in non-linear fluctuation of utterance in the time axis direction and to compare prosody between parts uttering the same phoneme, a fixed time interval (a frame After the temporal correspondence between the speech of the standard speaker and the speech of the speaker to be evaluated based on the spectral similarity in frame units, the spectral characteristics correspond, that is, the phoneme is one. A method is used in which the prosodic features of the evaluated speaker and the standard speaker differ between frames considered to be different from each other, and evaluation is performed.

［発明が解決しようとする課題］この評価法において従来の技術では、上記フレーム対
応を、標準話者と被評価話者の音声をそのままDPマッチ
ング法、HMMなどでマッチングし、その対応結果によっ
て定めていた。しかし、発声者が異なる音声に単純にDP
マッチング法などを適応した場合、各音韻におけるスペ
クトル特性の個人差によって、被評価話者と標準話者の
物理的スペクトル類似性によるフレーム対応が音韻対応
とうまく一致しないことが多く、そのため音韻対応が誤
った状態で批評価話者と標準話者との韻律的特徴の比較
を行ってしまい、評価に悪影響を及ぼしていた。[Problems to be Solved by the Invention] In this evaluation method, in the conventional technology, the frame correspondence is determined by directly matching the voices of the standard speaker and the speaker under evaluation by the DP matching method, the HMM, and the like, and determined by the corresponding result. I was However, the speaker is simply DP
When a matching method is applied, the frame correspondence based on the physical spectral similarity between the speaker under evaluation and the standard speaker often does not match well with the phoneme correspondence due to individual differences in the spectral characteristics of each phoneme. In the wrong state, the prosodic features of the critical speaker and the standard speaker were compared, which negatively affected the evaluation.

また、韻律的特徴の比較において、従来の単純にその
パラメータ同士を比較する方法では、その平均基本周波
数・平均音声パワーの個人差やダイナミックレンジの個
人差等の個人性と、韻律パラメータの基本形状の差を分
離して評価することができず、これも評価に悪影響を及
ぼしていた。In comparison of the prosodic features, the conventional method of simply comparing the parameters is based on individual characteristics such as individual differences in the average fundamental frequency and average voice power and individual differences in the dynamic range, and the basic shape of the prosody parameters. Could not be evaluated separately, which also had an adverse effect on the evaluation.

この発明の目的は、従来の技術では不正確であったス
ペクトル類似性に基づく被評価話者と標準話者のフレー
ム対応による音韻対応の精度を上げ、より正確な韻律的
特徴の評価を行えるようにし、かつ平均基本周波数・平
均音声パワーの個人差やダイナミックレンジの個人差等
の個人性と韻律パラメータの基本形状の差とを分離して
評価することができる発音評価法を提供することにあ
る。An object of the present invention is to improve the accuracy of phonemic correspondence by frame correspondence between a speaker to be evaluated and a standard speaker based on spectral similarity, which was inaccurate in the conventional technology, and to enable more accurate evaluation of prosodic features. And to provide a pronunciation evaluation method capable of separately evaluating individuality such as individual difference in average fundamental frequency and average audio power and individual difference in dynamic range and difference in basic shape of prosodic parameters. .

［課題を解決するための手段］この発明は、被評価話者の音声を例えばヒストグラム
を用いたコードブックマッピングなどの手法により標準
話者の音声に話者適応化し、その話者適応化した後の被
評価話者の音声と標準話者の音声とのスペクトル類似性
に基づいたフレーム対応を計算することによって、音韻
対応する精度を上げ、その対応フレームごとに被評価話
者と標準話者の韻律的特徴を比較することによって、発
音の韻律的な良さのより正確な評価が行えることおよび
被評価話者と標準話者の韻律的特徴をその平均値・分散
値を用いて正規化した後、上記のフレーム対応ごとにそ
の差異を比較することによって、韻律パラメータの基本
形状の差のより正確な評価が行えることも最も主要な特
徴とする。[Means for Solving the Problems] The present invention provides a speaker adaptation of a speech of a speaker to be evaluated to a speech of a standard speaker by a method such as codebook mapping using a histogram, and after the speaker adaptation. By calculating the frame correspondence based on the spectral similarity between the speech of the evaluated speaker and the speech of the standard speaker, the accuracy of the phoneme correspondence is increased, and the evaluated speaker and the standard speaker are compared for each corresponding frame. By comparing the prosodic features, it is possible to more accurately evaluate the prosodic goodness of the pronunciation, and after normalizing the prosodic features of the evaluated speaker and the standard speaker using their average and variance values. The most important feature is that the difference between the basic shapes of the prosodic parameters can be more accurately evaluated by comparing the differences for each frame correspondence.

［作用］話者適応化によってスペクトル特性の個人差の関係が
より明確になり、それによって個人差を吸収でき、スペ
クトル類似性に基づくフレーム対応手法による音韻対応
精度が向上する。[Operation] The speaker adaptation clarifies the relationship between individual differences in spectral characteristics, thereby absorbing individual differences, and improving the accuracy of phoneme correspondence by a frame correspondence method based on spectrum similarity.

また平均および分散を正規化することによって、平均
基本周波数・平均音声パワーの個人差やダイナミックレ
ンジの個人差等の個人性を吸収し、韻律パラメータの基
本形状の差のみを分離評価できる。Further, by normalizing the average and the variance, it is possible to absorb the individuality such as the individual difference in the average fundamental frequency and the average voice power and the individual difference in the dynamic range, and separate and evaluate only the difference in the basic shape of the prosodic parameter.

［実施例］第１図はこの発明の一実施例を説明する図であって、
被評価話者の発声した音声の韻律的な良さを評価するも
のである。[Embodiment] FIG. 1 is a view for explaining an embodiment of the present invention.
This is to evaluate the prosodic goodness of the voice uttered by the evaluated speaker.

最初に標準話者と被評価話者が同一の音声セット（単
語、単文等）を発声する。次にこの発声データを用いて
被評価話者の音声を標準話者の音声に話者適応化する。
ここで用いる方法を以下に示す。First, the standard speaker and the speaker under evaluation utter the same voice set (word, simple sentence, etc.). Next, using the utterance data, the voice of the speaker to be evaluated is speaker-adapted to the voice of the standard speaker.
The method used here is described below.

まず、コードブック生成部１において話者ごとのコー
ドブックを作成する。発声された標準話者の音声２を音
声分析部３においてフレーム単位に分析する。分析手法
としては、バンドパスフィルタ分析、線形予測分析、FF
T分析などが提案されているが、そのいずかを用いて分
析を行えばよい。ここでｐ次のLPCケプストラム係数を
フレームのスペクトルに関する特徴パラメータとして用
いる。次に分析後の音声をクラスタリング演算部４でク
ラスタリングし、標準話者音声の代表的なスペクトルパ
タンである、あらかじめ定められた数ｎのコードベクト
ルからなる標準話者のコードブック５を作成する。この
クラスタリングの手法についてはLinde,Buzo,and Gray:
“An Algorithm for Vector Quantizer Design",IEEE T
rans.Comm.,Vol.COM−28,1980に詳しい。First, the codebook generation unit 1 creates a codebook for each speaker. The voice 2 of the uttered standard speaker is analyzed by the voice analysis unit 3 on a frame basis. Analysis methods include bandpass filter analysis, linear prediction analysis, and FF
Although T analysis and the like have been proposed, analysis may be performed using either of them. Here, the p-th order LPC cepstrum coefficient is used as a feature parameter related to the spectrum of the frame. Next, the analyzed speech is clustered by the clustering calculation unit 4 to create a standard speaker codebook 5 consisting of a predetermined number n of code vectors, which is a typical spectrum pattern of the standard speaker speech. Linde, Buzo, and Gray:
“An Algorithm for Vector Quantizer Design”, IEEE T
rans.Comm., Vol.COM-28, 1980.

被評価話者の音声６も同様な手続きで音声分析部７で
分析し、クラスタリング演算部８でクラスタリングを行
い、コードブック９を作成する。The speech 6 of the speaker to be evaluated is also analyzed by the speech analysis unit 7 in the same procedure, and clustering is performed by the clustering calculation unit 8 to create the codebook 9.

次にコード列生成部10において上記で作成した標準話
者のコードブック５を用いて、標準話者の音声２をベク
トル量子化部11でフレーム単位にベクトル量子化し、ベ
クトルコード列12を作成する。被評価話者の音声６も同
様にベクトル量子化部13でベクトル量子化し、ベクトル
コード列14を作成する。Next, the code stream generation unit 10 uses the code book 5 of the standard speaker created above to vector quantize the speech 2 of the standard speaker in frame units by the vector quantization unit 11 to create the vector code string 12. . Similarly, the speech 6 of the speaker to be evaluated is vector-quantized by the vector quantization unit 13 to create a vector code sequence 14.

ここで話者適応化部15において、被評価話者の音声を
標準話者の音声に適応化を行う。話者適応化について
は、いくつかの手法が提案されているがここではShikan
o,Lee,and Reddy:“Speaker Adaptation through Vecto
r Quantization",Proc.ICASSP−86,49.5,1986で提案さ
れたヒストグラムを用いた方法による例を示す。Here, the speaker adaptation unit 15 adapts the voice of the evaluated speaker to the voice of the standard speaker. Several methods have been proposed for speaker adaptation.
o, Lee, and Reddy: “Speaker Adaptation through Vecto
r Quantization ", Proc. ICASSP-86, 49.5, 1986.

まずマッチング演算部16において同一対象音声での標
準話者のベクトルコード列12と被評価話者のベクトルコ
ード列14との間でマッチング演算を行い、２つのベクト
ルコード列のフレーム対応を計算する。この対応を利用
して、ヒストグラム生成部17において、被評価話者のコ
ードブック内での個々のコードベクトルに対する標準話
者のコードブック内でのコードベクトルの対応をヒスト
グラムの形で表す。そこで適応化コードブック生成部18
において、話者適応化前の被評価話者の各コードベクト
ルに対応する話者適応化後のコードベクトルを、適応化
前コードブックに対するヒストグラムを重みとして標準
話者のコードベクトルを重み付き平均することによって
作成する。そしてこの適応化後のコードベクトルを集
め、話者適応化した被評価話者のコードブック19を作成
する。First, the matching operation unit 16 performs a matching operation between the vector code sequence 12 of the standard speaker and the vector code sequence 14 of the evaluated speaker in the same target voice, and calculates the frame correspondence of the two vector code sequences. Utilizing this correspondence, the histogram generation unit 17 represents the correspondence between the code vectors in the codebook of the standard speaker and the individual code vectors in the codebook of the evaluated speaker in the form of a histogram. Therefore, the adaptive codebook generator 18
, The code vector after speaker adaptation corresponding to each code vector of the speaker to be evaluated before speaker adaptation is weighted and averaged over the standard speaker code vector using the histogram for the code book before adaptation as a weight. Create by. Then, the code vectors after the adaptation are collected, and a code book 19 of the speaker to be evaluated, which is adapted to the speaker, is created.

次にフレーム対応計算部20において、上記のようにし
て作成した適応化した被評価話者コードブック19内のコ
ードベクトルと標準評価コードブック５内のコードベク
トルを特徴パラメータとして用い、それぞれのベクトル
コード列12,14から上記コードブック5,19を参照しなが
らフレーム間のスペクトル類似性を計算し、マッチング
手法によりフレーム対応21を決定する。マッチング手法
としてはDPマッチング法、HMMなど提案されているいず
れかの方法が利用できる。DPマッチング法を用いたこの
マッチング方法の報告が鹿野：“入力音声のベクトル量
子化による単語音声認識",音響学会音声研究会資料,S82
−60,1982になされている。Next, the frame correspondence calculation unit 20 uses the code vector in the adapted speaker code book 19 created as described above and the code vector in the standard evaluation code book 5 as feature parameters, The spectrum similarity between frames is calculated from the columns 12 and 14 with reference to the codebooks 5 and 19, and the frame correspondence 21 is determined by a matching method. Any of the proposed methods such as the DP matching method and the HMM can be used as the matching method. A report of this matching method using the DP matching method is reported by Shikano: "Word Speech Recognition by Vector Quantization of Input Speech", Symposium of the Acoustical Society of Japan, S82.
−60,1982.

最後に韻律評価部22において、韻律の良さの評価値を
計算する。Finally, the prosody evaluation unit 22 calculates an evaluation value of good prosody.

ここでは評価する韻律的特徴として基本周波数と音声
パワーを用い、またその正規化は以下の方法をとること
にする。まず、対象音声の基本周波数はピッチ抽出・ス
ムージング部23において変形相関法などのピッチ抽出手
法により抽出した後、倍・半ピッチなどの抽出エラーを
除くためスムージングをかける。次にピッチ正規化部24
において対数をとり、平均基本周波数の個人差を正規化
するため抽出した基本周波数の有声部の平均を全体から
引き、さらにダイナミックレンジの個人差を正規化する
ため同じく有声部の標準偏差で割る。音声パワーも同じ
くパワー計算部25により計算され、パワー正規化部26に
おいて対数をとった後、平均音声パワーを正規化するた
め音声パワーの音声区間の平均を全体から引き、さらに
ダイナミックレンジの個人差を正規化するため音声区間
の標準偏差で割る。正規化はこのように、単純に平均・
分散を一致させるだけでなく、人間の感覚にあわせてパ
ラメータを非線形伸縮させる方法も可能である。Here, the fundamental frequency and the audio power are used as the prosodic features to be evaluated, and the normalization is performed by the following method. First, the fundamental frequency of the target voice is extracted by the pitch extraction / smoothing unit 23 by a pitch extraction method such as a deformation correlation method, and then smoothing is applied to remove an extraction error such as double / half pitch. Next, pitch normalization unit 24
, The average of the voiced part of the fundamental frequency extracted to normalize the individual difference of the average fundamental frequency is subtracted from the whole, and further divided by the standard deviation of the voiced part to normalize the individual difference of the dynamic range. The audio power is also calculated by the power calculation unit 25, and after taking the logarithm in the power normalization unit 26, the average of the audio section of the audio power is subtracted from the whole to normalize the average audio power. Is divided by the standard deviation of the voice section to normalize. Normalization is thus simply the average
In addition to matching the variances, a method of nonlinearly expanding and contracting the parameters according to the human sense is also possible.

次に適応化後のフレーム対応21を基に韻律的特徴の比
較を行う。ここでは、比較方法として差の絶対値を用い
ることとする。この他にも上記絶対値に重み付けを行
う、対応する韻律的特徴パラメータの相関をとる、など
話者適応化後の被評価話者音声と標準話者音声のフレー
ム対応21に基づいた種々の比較方法が可能である。ここ
で正規化した基本周波数と音声パワーはそれぞれの韻律
比較部すなわちピッチ比較部27、パワー比較部28におい
て上記の方法で計算した対応フレームごとに差の絶対値
を計算し、たとえば基本周波数は有声部のみで平均、音
声パワーは音声区間全体で平均し評価値29,30とする。
この評価値29,30は他の方法で計算した音韻的な評価と
組み合わせて、総合的な発音の良さの評価値とすること
が可能である。Next, prosodic features are compared based on the frame correspondence 21 after the adaptation. Here, the absolute value of the difference is used as the comparison method. In addition, various comparisons based on the frame correspondence 21 between the evaluated speaker's speech and the standard speaker's speech after speaker adaptation, such as weighting the absolute value and correlating the corresponding prosodic feature parameters, etc. A method is possible. Here, the normalized fundamental frequency and voice power are calculated by the respective prosodic comparison units, that is, the pitch comparison unit 27 and the power comparison unit 28, for each corresponding frame calculated by the above method, and the absolute value of the difference is calculated. The average is obtained only for the audio section, and the audio power is averaged for the entire audio section to obtain evaluation values 29 and 30.
These evaluation values 29 and 30 can be combined with phonological evaluations calculated by other methods to obtain comprehensive evaluation values of pronunciation.

このようにこの方法は、標準話者に話者適応化して個
人性を吸収した後の被評価話者の音声と標準話者の音声
とのフレーム対応を計算する構造になっているから、被
評価話者と標準話者の音韻の対応がより正確にとれ、ま
た、基本周波数の有声部での平均・分散、音声パワーの
音声区間での平均・分散で正規化することにより、平均
基本周波数・平均音声パワーの個人差やダイナミックレ
ンジの個人差等の個人性が吸収できる。As described above, this method has a structure in which the frame correspondence between the speech of the evaluated speaker and the speech of the standard speaker after speaker adaptation to the standard speaker to absorb personality is calculated. The phonemes of the evaluation speaker and the standard speaker can be more accurately correlated, and the average fundamental frequency is normalized by the average and variance of voiced parts of the fundamental frequency and the average and variance of speech power in the speech section. -Individuality such as individual differences in average audio power and individual differences in dynamic range can be absorbed.

その効果としては、より精度の高い音韻の対応に基づ
く韻律的特徴の比較が可能になり、かつ韻律パラメータ
の基本形状のみを分離評価できるため、より正確な評価
ができる。As an effect, more accurate prosodic features can be compared based on correspondence of phonemes, and only the basic shape of the prosodic parameters can be separated and evaluated, so that more accurate evaluation can be performed.

［発明の効果］以上説明したように、この発明は話者適応化つまりス
ペクトル的な個人性を吸収した後の被評価話者の音声と
標準話者の音声とのフレーム対応をとることにより、音
韻の対応精度が向上し、韻律的特徴評価がより正確にな
り、かつ、韻律パラメータをその平均値・分散値を用い
て正規化することにより、韻律的特徴の基本形状のみを
分離評価できる利点がある。[Effects of the Invention] As described above, the present invention takes a frame correspondence between the speech of the speaker to be evaluated and the speech of the standard speaker after speaker adaptation, that is, absorption of spectral personality, Benefits of improving the accuracy of phonological correspondence, more accurate prosodic feature evaluation, and separating and evaluating only the basic shape of prosodic features by normalizing prosodic parameters using their average and variance values There is.

[Brief description of the drawings]

第１図はこの発明による発音評価法の実施例を示すブロ
ック図である。FIG. 1 is a block diagram showing an embodiment of a pronunciation evaluation method according to the present invention.

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/00 551 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continued on the front page (58) Fields surveyed (Int. Cl. ⁶ , DB name) G10L 3/00 551 JICST file (JOIS)

Claims

(57) [Claims]

The speech of a speaker to be evaluated is speaker-adapted to the speech of a standard speaker, and the correspondence between frames of the speaker-adapted speech of the speaker to be evaluated and the speech of the standard speaker is expressed by spectrum. A pronunciation evaluation characterized by comparing the prosodic features of the evaluated speaker and the standard speaker for each of the frames associated with each other, based on the similarity and then normalized using the average and variance values of the parameters. Law.