JPS6073598A

JPS6073598A - Voice recognition system

Info

Publication number: JPS6073598A
Application number: JP58180247A
Authority: JP
Inventors: 市川　熹; 畑岡　信夫
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1983-09-30
Filing date: 1983-09-30
Publication date: 1985-04-25

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔発明の利用分野〕本発明は音声認識方式、特に、音韻単位で連続発声を認
識する方式に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Application of the Invention] The present invention relates to a speech recognition method, and particularly to a method for recognizing continuous utterances in units of phonemes.

[Background of the invention]

認識すべき語紮の棟類が多い認識装置では、従来実用化
されているような単語単位のｇ繊では、標準パターンの
登録の手間や、認識能力の点で実用には問題が多い。そ
こで音韻や音節を単位に認識する技術が注目されている
。音韻ゆ母音と子音からなる。連続音声中の音韻は前後
の音韻にょ９大きく変形する（調音結合）ことが知られ
ている。In a recognition device that has to recognize a large number of idioms, word-by-word g-strings, which have been put into practical use in the past, have many practical problems in terms of the effort required to register standard patterns and the recognition ability. Therefore, technology that recognizes phonemes and syllables is attracting attention. Phonology Consists of vowels and consonants. It is known that the phonemes in continuous speech are significantly modified by the preceding and succeeding phonemes (articulatory combination).

一般に、子音から母音への影響よシも、母音から子音へ
の影響が大きい。これらの影響を考慮して認識するため
には、前後の音韻が何であるかを考慮した単位を標準パ
ターンとすれば良いが、その組合せは非常に大きなもの
になるため実際的ではない。そこで、相対的に変形の少
ない母音を先ず認識し、認識した母音に挾まれた子音を
、その母音環境の標準パターンを用いて認識を行なう方
式がある。しかしながら、特定の母音では（主にｌｉｔ
と１ｕｌ）％定の子音（主に無声子音）に挾まれた場合
、無声化したシ、脱落することがある（無声化現象）。In general, the influence of consonants on vowels is greater than the influence of vowels on consonants. In order to recognize these influences by considering them, it is possible to use a standard pattern as a unit that takes into consideration the phonemes before and after, but this is not practical because the combinations would be very large. Therefore, there is a method in which a vowel with relatively little deformation is first recognized, and a consonant sandwiched between the recognized vowels is recognized using a standard pattern of the vowel environment. However, for certain vowels (mainly lit
and 1ul) When it is sandwiched between certain consonants (mainly voiceless consonants), it may be devoiced and dropped (devoicing phenomenon).

母音が無声化すると、有声音の母音とはスペクトル構造
等の特性が変形するため、通常の母音を認識する方法で
は認識が困難となβ、見落すことが多くなる。語中の母
音を見落すと、その前後の子音の認識も困難となシ、連
続した子音、母音、子音の３音韻、あるいは、これらの
関与する２音節を誤ｇ識することになる。When a vowel becomes unvoiced, its characteristics such as spectral structure change from that of a voiced vowel, making it difficult to recognize using normal vowel recognition methods and often being overlooked. If a vowel in a word is overlooked, it is difficult to recognize the consonants before and after it, and the student may misidentify three consecutive consonants, a vowel, and a consonant, or two syllables related to these.

[Purpose of the invention]

本発明の目的は、連続音声中の無声化したシ脱落した母
音の位置及び促音と撥音の位置を推定する方式を提供す
ることにある。SUMMARY OF THE INVENTION An object of the present invention is to provide a method for estimating the positions of devoiced and omitted vowels and the positions of consonants and pellicles in continuous speech.

[Summary of the invention]

上記目的を達するために、本発明では、次の事実に注目
する。即ち、（イ）日本語では、原則的に子音と母音が
組となって音節を形成し、音節の生じるリズムはほぼ一
定であること、（ロ）発声者にょシ、また、その時々に
よシ、上記リズムの速度は変るが、協力的な発声におい
ては、その変化や変動はゆるやかであること、（ハ）音
声認識装置の使われ方は基本的にはオンライン使用であ
シ、認識結果の確認は、その場で行ない、誤シは、その
場で訂正すること、に）−音節の長さは、一度に発声す
る声を構成する音節数に関係するが、その傾向には一定
の規則性があること、等の諸点である。これらの事実に
もとづき、次のような手順で無声化母音あるいは脱落母
音の位置を推定する。In order to achieve the above object, the present invention focuses on the following facts. That is, (a) in Japanese, syllables are basically formed by pairs of consonants and vowels, and the rhythm in which syllables occur is almost constant; (C) The speed of the above rhythm changes, but in cooperative vocalizations, the changes and fluctuations are gradual; (C) The speech recognition device is basically used online, and the recognition results - The length of a syllable is related to the number of syllables that make up the voice uttered at one time, but there is a certain tendency to There are various points such as regularity. Based on these facts, the position of the devoiced vowel or dropped vowel is estimated using the following procedure.

（１）平均的音節長を設定する。(1) Set the average syllable length.

（２）　（１）（又は後述する（８））の±３０優程鹿
の範囲の音節長の変動許容範囲を仮定して、最初の発声
の音節数を推定する。音節数の推定法には、たとえば同
一発明者にょシすでに出願されている特願昭５７−７１
２３０号を用いることができる。(2) Assuming a permissible variation range of syllable length of ±30 well as in (1) (or (8) described later), estimate the number of syllables to be uttered for the first time. For example, the method for estimating the number of syllables is based on the patent application filed in 1983-1971 by the same inventor.
No. 230 can be used.

（３）入力音声を認識する。認識の方法についても、た
とえば同一発明者にょシ、すでに出願されている実顧昭
５４−９１２８３号を用いることができる。(3) Recognize the input voice. As for the recognition method, it is possible to use, for example, the patent application No. 1983-91283 filed by the same inventor.

（４）認識結果を表示し、確認する。誤シがあれば誤ｇ
Ｒ部分を訂正する。訂正の方法は、キーボードからの入
力等様々な手法の利用が可能である。(4) Display and confirm the recognition results. If there is a mistake, please write it down.
Correct the R part. Various correction methods can be used, such as input from a keyboard.

（５）確認の結果にもとづき、入力の音節数を確定する
。(5) Based on the confirmation result, determine the number of input syllables.

（６）入力音声の長さと（５）の結果から、芙際に入力
された音声の音節時間長をめる（全長を音節数で割れば
良い）。(6) From the length of the input voice and the result of (5), calculate the syllable duration of the voice input at the end (just divide the total length by the number of syllables).

（７）　（２）の事実にもとづき、推定音節長を修正す
る。第１図は、発明者Ａが約１００単語を丁ねいに発声
したときの、単語の音節数と音節長の分布の実測値の例
である。日本語の単語は４音節程度のものが紋も多いか
ら、この時の音節長を平均的な値として用いるとすれば
、この図の例では、たとえば、入力が３音節の場合は、
実測値の約０．８５倍、入力が５音節の場合は、約１．
１倍すれば良い。図からもわかるように、同一発声者の
同一音節数の音声でも長さはバラツクので、上記程度の
精度での修正で良い。(7) Based on the fact in (2), correct the estimated syllable length. FIG. 1 is an example of actual measured values of the distribution of the number of syllables and syllable length of words when inventor A carefully uttered about 100 words. Japanese words have around 4 syllables and many crests, so if we use this syllable length as an average value, in the example in this figure, for example, if the input is 3 syllables,
Approximately 0.85 times the actual measurement value, approximately 1.
Just multiply it by 1. As can be seen from the figure, the lengths of voices with the same number of syllables from the same speaker vary, so corrections with the above-mentioned accuracy are sufficient.

（８）それまでに用いている平均的音節長と、実測でめ
た音節長の重み付き平均をもって新たな平均的音節長と
する。(8) Set the new average syllable length as the weighted average of the average syllable length used so far and the syllable length obtained through actual measurements.

（９）　（２）にもどシ、認識結果にもとづき、推定音
節長を修正しながら、音節数をめる。(9) Returning to (2), calculate the number of syllables while correcting the estimated syllable length based on the recognition results.

以上の手順の中で、（２）の音節数を推定した後、その
音声の推定平均音節長を推定しく音声区間を推定音節数
で割れば良い）、音声区間中の有声音以外の区間の長さ
を順次調べ、推定平均音節長の１．５倍以上の区間が存
在したとき、その中央に無声化又は脱落した母音が存在
すると仮定する。次にこの区間の前方から７割を超える
区間について、同一の音響特性が継続（たとえばスペク
トル概形が同一あるいは、音声の無い区間の継続など）
した時は、無声化あるいは脱落母音ではなく、促音が存
在するものと仮定する（語尾では無声／脱落とする）。In the above procedure, after estimating the number of syllables in step (2), the estimated average syllable length of the speech can be estimated by dividing the speech interval by the estimated number of syllables). The length is sequentially checked, and if a section longer than 1.5 times the estimated average syllable length exists, it is assumed that a devoiced or dropped vowel exists in the center of the section. Next, for more than 70% of the section from the front of this section, the same acoustic characteristics continue (for example, the spectrum outline is the same, or there is a continuation of a section without sound)
When this happens, it is assumed that there is a consonant rather than a devoiced or dropped vowel (voiceless/dropped at the end of the word).

以上のような手順によって、使用中に発声速度が変化し
て行っても、それに追従しながら、母音の数、特に無声
化したム脱落した母音の位置を精匿良く推定して行くこ
とが可能となる。By following the steps described above, even if the speaking rate changes during use, it is possible to accurately estimate the number of vowels, especially the position of voiceless vowels, while following the changes. becomes.

[Embodiments of the invention]

以下、本発明の一実施例を図をもって説明する。 Hereinafter, one embodiment of the present invention will be described with reference to the drawings.

第２図は本発明の一実施例を説明するブロック図である
。FIG. 2 is a block diagram illustrating an embodiment of the present invention.

第２図において、入力端１よ少入力された音声は、短時
間パワー分析部２と、スペクトル分析部３に送られる。In FIG. 2, a small amount of audio input to the input terminal 1 is sent to a short-time power analysis section 2 and a spectrum analysis section 3.

短時間パワーがめられると、その値が一定以上となる区
間の値をクリップし、短時間パワーパターン用バッファ
・レジスタ５に送ると共に、制御部６にもその値が送ら
れる。制御部では、バワニ値が予め足めた第１の一定値
θ１を超えた時点から、音声の最初の母音区間が始まっ
たものと見なし、以降第二の一定値θ２を切ってパワー
が一定時間以上（たとえば５００ｍ５）続いたとき、入
力音声の最後の有声母音区間が終了したものと見なす。When the short-term power is determined, the value in the section where the value exceeds a certain value is clipped and sent to the short-time power pattern buffer register 5, and the value is also sent to the control unit 6. The control unit considers that the first vowel section of the voice has started from the moment the Bawani value exceeds a first constant value θ1 that has been added in advance, and after that, the power decreases below the second constant value θ2 for a certain period of time. When it continues for more than 500 m5 (for example, 500 m5), it is considered that the last voiced vowel section of the input speech has ended.

この始点から終、Ｈでの長さをＴｏとする。メモリ７に
は、予め平均的母音長１Ｇが記録されておシ、制御部は
、このｔｏ±３０％の範囲の音節がＴｏの内に、何個入
るかをめ、音節数の推定範囲を定めると同時に、推定範
囲内の各音節長τｌをめておく。制御部６は、この推定
音節数範囲の矩形パターンを矩形パターン発生部９から
発生し、バックアメモリ５中のパターンとの相互相関係
数を相関部８にょシ取シ、その結果を制御部に取シ込む
。第３図に一例を示す。（ａ）は、音声／１ｅｕｃｈｉ
ｓｏｂａ／の短時間パワーパターンｐ　（ｔ）であり、
（ｂ）は、このパターンと最も相関の高かった矩形パタ
ーン１＋（１）である。区間の短測点をＴｏとして、相
関値ｒ、をｌ　Ｔ。Let To be the length from this starting point to the ending point H. The average vowel length 1G is recorded in advance in the memory 7, and the control unit determines how many syllables in this range of to ± 30% are included in To, and calculates the estimated range of the number of syllables. At the same time, each syllable length τl within the estimation range is also determined. The control unit 6 generates a rectangular pattern in this estimated syllable number range from the rectangular pattern generation unit 9, extracts the cross-correlation coefficient with the pattern in the backup memory 5 into the correlation unit 8, and sends the result to the control unit. Intake. An example is shown in FIG. (a) is audio/1euchi
The short-time power pattern p (t) of soba/ is
(b) is the rectangular pattern 1+(1) that had the highest correlation with this pattern. Let the short measurement point of the section be To, and let the correlation value r, be lT.

ｒ＋＝五Σｐ　（ｔ）　・ｔ　＋（ｔ）としてめた例が
第４図であシ、上記例の正しい音節数５で相関が最大と
なっている。この例では音節数の推定範囲を平均の±３
０％よシ広め（３〜８）に取シ、その状況が理解しゃす
いように示しであるＵ第５図は第２の例で音声／　ａｏ
ｋｕｓ３７の場合である。母音＋０１が無声化し、パワ
ーパターンの値が小さくなっていることがわかるが、第
６図に示すように正しい音節数４の相関が最も高い。第
５図（ａ）の区間Ａが長く、この間に母音が無声化か脱
落していることを推定させる。（ｂ）の矩形パルスの↑
部位置もその点を予想させる。実際にこの例ではｊｕｌ
がこの位置で無声化している。FIG. 4 shows an example where r+=5Σp(t)·t+(t), and the correlation is maximum at the correct number of syllables in the above example, 5. In this example, the estimated range of syllable counts is ±3 of the average.
0% and wider range (3 to 8), the situation is shown to make it easier to understand.
This is the case of kus37. It can be seen that the vowel +01 has become devoiced and the value of the power pattern has become smaller, but as shown in FIG. 6, the correlation with the correct number of syllables, 4, is the highest. Section A in FIG. 5(a) is long, and it is assumed that the vowel has become devoiced or dropped during this period. ↑ of the rectangular pulse in (b)
The location of the parts also suggests this point. Actually in this example jul
is muted at this position.

なお、制御部６は閾値θｌとθ２とは別に、それらよシ
低い閾値θ３とθ４を持ち、語頭の無声音あるときは、
その位置にも無声化／脱落等によシ変形した母音が存在
しているものと推定する。In addition to the thresholds θl and θ2, the control unit 6 has lower thresholds θ3 and θ4, and when there is an unvoiced sound at the beginning of a word,
It is presumed that there is also a vowel that has been deformed due to devoicing/dropping etc. at that position.

一方スベクトル分析部３に入力された入力信号は、スペ
クトル情報に変換後、バッファメモリ１０に格入される
。バッファメモリ１０に格納されたスペクトル情報は、
マツチング部１１で標準パターンメモリ１２中の母音及
び無音、無声摩擦音の無声部等と分析時点毎にマツチン
グが取られマツチング結果が順次制御部６に送られる。On the other hand, the input signal input to the spectrum analysis section 3 is stored in the buffer memory 10 after being converted into spectrum information. The spectrum information stored in the buffer memory 10 is
A matching unit 11 performs matching with vowels, silences, unvoiced parts of unvoiced fricatives, etc. in the standard pattern memory 12 at each analysis time, and the matching results are sequentially sent to the control unit 6.

制御部６は、この結果と音節位置推定結果（前述）より
、母音又は、促音、あるいは無声化／脱落した母音を判
定する。無声化した母音は原則として１１１とｌｕｌを
仮定するがｌ　ｋｏｋｏｒｏ　ｌや１　ｈａｈａ１等と
なる可能性のある組合せでは１．１又は１．１と仮定す
る。The control unit 6 determines a vowel, a consonant, or a devoiced/dropped vowel based on this result and the syllable position estimation result (described above). In principle, devoiced vowels are assumed to be 111 and lul, but in combinations that may result in l kokoro l, 1 haha1, etc., they are assumed to be 1.1 or 1.1.

このようにして母音候補が定まると、制御部６は、バッ
ファメモリ１０中のスペクトル情報列を順次第２のマツ
チング部１４に送シ込みながら、推定した母音に狭まれ
た子音の標準パタンを標準バタンメモリ１３よシ取シ出
してマツシングし、その結果を制御部６に取シ込み、母
音の推定結果と併合し、音節認識結果として確認部１５
に表示する。使用者は、表示結果を見、正しければＯＫ
のキーを、誤っていれば、キーよシ訂正結果を入力する
。制御部６は、確認結果よシ、正しい音節数を得、音声
区間長の情報Ｔｏよシその入力音声の平均音節長をめ、
さらに第１図で説明したごとく、その長さを音節数の関
数として修正し、推定音節長ｔｏ′としてめる。この値
ｔｏ′と、それまでの平均音節長”ｏｘ’ｔ、重み付き
平均として、新たな平均音節長をめる。Once the vowel candidates are determined in this way, the control unit 6 sequentially sends the spectral information string in the buffer memory 10 to the second matching unit 14, and creates a standard pattern of consonants narrowed to the estimated vowel. It is taken out from the baton memory 13 and mated, the result is taken into the control unit 6, merged with the vowel estimation result, and confirmed by the confirmation unit 15 as a syllable recognition result.
to be displayed. The user looks at the displayed result and if it is correct, it is OK.
If the key is incorrect, enter the key correction result. The control unit 6 obtains the correct number of syllables based on the confirmation result, determines the average syllable length of the input speech based on the information on the speech interval length,
Furthermore, as explained in FIG. 1, the length is corrected as a function of the number of syllables, and is determined as the estimated syllable length to'. A new average syllable length is calculated by combining this value to', the previous average syllable length "ox't," and a weighted average.

ｔｏ−αｔｏ＋βｔｏ′ α十β＝１この結果、使用しながら除徐に発声速度が変化しても、
それに追従しながら、平均音節長をめて行くことができ
る。to−αto+βto′ α×β=1 As a result, even if the speaking speed gradually changes while using the
By following this, you can calculate the average syllable length.

実測結果によれば、極端な列外を除いて、順次発声する
音声の音節時間長は、その時点までの重みつき平均時間
長の±３０チ以上には変動しない（はとんどは２０条以
内）から、入力波形の短時間パワーパターンと相関を取
る矩形波は、重み付き平均時間長の±３０％以内の範囲
を対象とすれば良い。これにより、処理時間の短縮だけ
でなく、音節時間長の推定ｉｂも大幅に減少し、その結
果、認識性能が大幅に向上し、使いやすい音声認識装置
を得ることができる。According to actual measurement results, the syllable duration of sequentially uttered voices does not fluctuate more than ±30 degrees of the weighted average duration up to that point, except for extreme out-of-sequence cases (mostly in the 20th section). Therefore, the rectangular wave that correlates with the short-time power pattern of the input waveform may cover a range within ±30% of the weighted average time length. As a result, not only the processing time is shortened, but also the estimated syllable duration ib is significantly reduced, and as a result, the recognition performance is significantly improved, and a speech recognition device that is easy to use can be obtained.

修正した平均音節長ｔｏはメモリ７に格納され、次の入
力に用いられる。また確認された認識結果は出力部１６
を経て端子１７よシ出力される。The corrected average syllable length to is stored in the memory 7 and used for the next input. Also, the confirmed recognition results are output to the output section 16.
The signal is then output from terminal 17.

なお、二重母音化や撥音ンの影響等で音節数の推定値が
±１程度の識差を生じることがあるが、これによる音節
長の推定誤差の大部分は２５％以内でろ９、無声化や脱
落の母音や促音の推定には障害とはならない。Note that the estimated number of syllables may have a difference of about ±1 due to diphthongization and the effects of phlegmatic sounds, but most of the errors in estimating syllable length due to this are within 25%9. This does not pose an obstacle to estimating omitted vowels or consonants.

〔Effect of the invention〕

以上説明したごとく、本発明によれば、使用中に徐々に
変動する発声速度に追従しながら、無声化や脱落した母
音の位置を推定することが可能となシ、認識能力が高く
、使いやすい音声認識方式を提供することができる。As explained above, according to the present invention, it is possible to estimate the position of devoiced or dropped vowels while following the speech rate that gradually changes during use, and it has high recognition ability and is easy to use. A voice recognition method can be provided.

[Brief explanation of the drawing]

第１図は、音声構成音節数と音節時間長の関係を説明す
る図、第２図は本発明の詳細な説明するブロック図、第
３〜第６図、音節数の推定及び無声化位置の推定を説明
するための例を示す図である。６・・・制御部。Fig. 1 is a diagram explaining the relationship between the number of syllables constituting speech and the syllable duration, Fig. 2 is a block diagram explaining the present invention in detail, and Figs. 3 to 6 show estimation of the number of syllables and devoicing position. It is a figure showing an example for explaining estimation. 6...Control unit.

Claims

[Claims] 1. Along with the average syllable information and the all syllable length information,
It has means for estimating the syllable length constituting the input speech, and means for confirming the recognition result of the input speech, and estimates the input sound length of the input speech from the number of syllables of the input speech obtained from the confirmation introduction button. , the average sound ft6 length*ib is corrected according to the input sound length of the input sound, and is used as average syllable table information of the next input sound. 2. In the speech recognition method described in claim 1, using the estimated syllable-cum-information, devoiced vowels and dropped vowels,
A speech recognition method that is characterized by estimating the positions of consonants and consonants.