JPS58129498A

JPS58129498A - Voice recognition

Info

Publication number: JPS58129498A
Application number: JP1087482A
Authority: JP
Inventors: 入間野　孝雄; 金指　久則
Original assignee: Computer Basic Technology Research Association Corp
Current assignee: Computer Basic Technology Research Association Corp
Priority date: 1982-01-28
Filing date: 1982-01-28
Publication date: 1983-08-02
Also published as: JPS637399B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】本発明は、音声認識方法に関し、通常の母音認識手段で
は認識できない無声化母音を高い確度で抽出し、音声認
識率を向上きせることを目的とするものである。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech recognition method, and an object of the present invention is to extract with high accuracy devoiced vowels that cannot be recognized by ordinary vowel recognition means, and to improve the speech recognition rate.

まず、従来の無声化母音抽出法を述べる。基本的な考え
方としては、まず大まかな音素認識を行ない、無声音区
間が長く、シかもその間に顕著な音声パワのディツノが
存在する時、このディ、ｆを２音節の境界と見なし、無
声音区間を３音素、すなわち無声子音、無声化母音、無
声子音が連らなったものとして認識するものである。第
１図によシ、具体的に説明する。入力音声を１０ｍ５毎
のフレームに分割し、音響レベルでの分析を行ない、音
素認識に必要な−・七うメータを抽出する。次にセグメ
ンテーション（音素の境界を見つけること）とフレーム
単位での予備音素認識を行なう。この時点で有声の母音
はその一！！捷認識されるが、無声化母音の場合には、
両側の子音を含める音素が一体と々す、一つの無声子音
として認識される。従って予備音素認識の結果、無声子
音区間があった場合は、無声化母音の存在をチェックす
る必要がある。従来のチェックルーチンは、第１図中の
ひし形の分岐で表わした部分である。まず無声子音の有
無をチェ、りする。無い場合はそれ以上のチェックは行
なわない。さらに無声子音区間がある場合、この無声子
音区間のフレーム数が、予め実験結果に基づき定められ
たスレッシュホールドＴｌｌより多く、かつ、この無声
子音区間中に無音（Ｑで表わす）フレームが存在し、か
つこのＱ区間のフレーム数が予め定められたスレッシュ
ホールドＴＩ２より多い時、このＱ区間は２個の音節の
境界をなしていると見なし、前記無声子音区間を前後に
２分し、前の方の無声子音の直後に母音ＵＩを挿入する
。このＵＩはＵ又は工なる音素という意味である。以上
のような無声化音処理を行なった後、認識音素系列を作
成し、単語辞書とのマツチングを行なって単語認識を行
なうものであった。First, a conventional devoiced vowel extraction method will be described. The basic idea is to first perform rough phoneme recognition, and when an unvoiced interval is long and there is a significant voice power ditsuno between them, this D, f is regarded as the boundary between two syllables, and the unvoiced interval is recognized. It is recognized as a series of three phonemes: a voiceless consonant, a voiceless vowel, and a voiceless consonant. This will be explained in detail with reference to FIG. The input audio is divided into frames of every 10m5, analyzed at the acoustic level, and the -7 meters necessary for phoneme recognition are extracted. Next, segmentation (finding phoneme boundaries) and preliminary phoneme recognition are performed on a frame-by-frame basis. At this point, that's the only voiced vowel! ! However, in the case of devoiced vowels,
The phoneme including the consonants on both sides is recognized as one voiceless consonant. Therefore, if a voiceless consonant section is found as a result of preliminary phoneme recognition, it is necessary to check the presence of a voiceless vowel. The conventional check routine is represented by a diamond-shaped branch in FIG. First, check whether there are voiceless consonants. If it does not exist, no further checks are performed. Furthermore, if there is an unvoiced consonant section, the number of frames in this unvoiced consonant section is greater than a threshold Tll determined in advance based on experimental results, and a silent (represented by Q) frame exists in this unvoiced consonant section, Moreover, when the number of frames in this Q interval is greater than the predetermined threshold TI2, this Q interval is considered to be a boundary between two syllables, and the unvoiced consonant interval is divided into two halves, one in front and the other in half. Insert a vowel UI immediately after the voiceless consonant. This UI means the phoneme ``U'' or ``tech''. After performing the devoicing process as described above, a recognized phoneme sequence is created and matched with a word dictionary to perform word recognition.

以上に述べた従来の無声化母音抽出法において、Ｔ１２
を小さく設定すると、もともと−個の無声化子音である
無声化子音区間に対し無声化母音の誤抽出を行なうケー
スが増大し、反対にＴＩ２を犬きく設定すると無声化母
音の抽出率が低下するという欠点があった。In the conventional devoiced vowel extraction method described above, T12
If TI2 is set to a small value, the number of cases in which devoiced vowels are incorrectly extracted from a devoiced consonant interval that originally contains - devoiced consonants increases, and on the other hand, if TI2 is set too high, the extraction rate of devoiced vowels decreases. There was a drawback.

本発明は、前記従来例の欠点を除去するものであり、以
下に前記従来例との相異点を中心に、実施例を説明する
。第２図は本発明の一実施例の認識フローを示したもの
である。第２図に示すアル線で示しだ円（２０１）内の
みである。すなわち、Ｑ区間が短かい場合でもＱ区間の
間での・Ｐワの差分値（フレーム間の・ぞワの差）の極
大値と極小値を求め、その差が予め定められたスレッシ
ュホールドＴ２３よシ大きい場合は、そこに明らかな音
声の区切れがあると見なし、従来のＱ区間が長い場合と
同様に取り扱うものである。The present invention eliminates the drawbacks of the conventional example, and embodiments will be described below, focusing on the differences from the conventional example. FIG. 2 shows a recognition flow according to an embodiment of the present invention. It is only within the circle (201) indicated by the al line shown in FIG. In other words, even if the Q interval is short, the local maximum and minimum values of the difference value of Pwa (difference between frames) between the Q intervals are determined, and the difference is set as a predetermined threshold T23. If the Q interval is too large, it is assumed that there is a clear break in the voice, and the Q interval is treated in the same way as when the Q interval is long.

「機械」という言葉について動作を説明する。Explain the operation of the word "machine".

「キカイ」の「キ」は、標準語では無声化する。The ``ki'' in ``kikai'' is devoiced in standard Japanese.

従って「キカイ」の予備認識結果はＫ　Ｉ　ＫＡ　Ｉで
はなく、■が脱落し、２つのＫが運なかってしまう。Therefore, the preliminary recognition result of "Kikai" is not K I KA I, but ■ is omitted, and two K's are not lucky.

第３図は「キカイ」の前半部の予備認識結果と・やワの
変化の様子を、時間軸を同一にして示したものである。Figure 3 shows the preliminary recognition results for the first half of ``Kikai'' and the changes in ``Kikai'' with the same time axis.

無音のＱは、子音のＫと重複して認識されている。これ
は、語中の無音部（正確に言えば、ノクワが、予め定め
られたスレッシュホールド以下である部分）は、子音発
声時の不安定によっても生じ、・クワレベルだけでは子
音の一部か、音節境界か判定できないためである。子音
のＫと母音のＡはｌフレーム重複して認識されている。The silent Q is recognized to overlap with the consonant K. This is because silent parts in words (more precisely, parts where the nokwa is below a predetermined threshold) are also caused by instability during consonant pronunciation. This is because it is not possible to determine whether it is a syllable boundary. The consonant K and the vowel A are recognized overlappingly by 1 frame.

これは、現在、音素境界は重複認識することにしている
からであって、特別な理由はない。さて、第３図のＱ区
間であるが、このＱ区間が、予め定められたスレッシュ
ホールドと比べ長いか、あるいは変化が大きければ音節
境界と見なされ、前半のＫはＫ及びＵＩとなり、後半の
ＫはＡにつながるＫとなって、無声化母音が抽出できた
ことになる。This is because phoneme boundaries are currently recognized overlappingly, and there is no particular reason for this. Now, regarding the Q interval in Figure 3, if this Q interval is longer than a predetermined threshold or has a large change, it is considered a syllable boundary, and the first half K becomes K and UI, and the second half K becomes K connected to A, and the devoiced vowel has been extracted.

なお、パワを計算する場合、真の・ぐワは、瞬時値の２
乗をある時間積分したものであるが、２乗の代りに絶対
値を用いても、はぼ同様の結果が得られる。Note that when calculating power, the true power is 2 of the instantaneous value.
Although it is obtained by integrating the power over a certain period of time, almost the same result can be obtained even if the absolute value is used instead of the square.

上記実施例と従来例において、認識実験を行った結果を
次表に示す。The following table shows the results of recognition experiments conducted in the above embodiment and the conventional example.

上記表において、入力のＣｖＣは、無声子音、無声化母
音、無声子音の連らなった場合のみを示すものとし、そ
れが正しく　ＣＶＣと認識された個数と、１個の子音と
して認識されてしまった個数を示す。In the above table, the input CvC shows only cases in which unvoiced consonants, unvoiced vowels, and unvoiced consonants are connected, and the number of correctly recognized CVCs and the number of unvoiced consonants recognized as one consonant are shown. Indicates the number of pieces.

また下段は入力が、−個の無声子音であるのに、無声化
母音抽出方法が誤まって適用され、認識結果がＣＶＣと
なってしまった個数を示す。なお、各種スレッシュホー
ルドは、従来例、実施例共、それぞれ予備実験により、
最適値に定めた。結果を見ると、無声化母音の抽出率は
、従来７１５％であったのが本発明によシ９８７％に向
上した。一方、無声化母音抽出ルーチンの誤適用は、従
来、正しく適用された場合の１２％存在したが、本発明
により皆無となった。本発明は、誤適用の減少を直接の
目的と、けしていないが、Ｑ区間の長石のスレッシュホ
ールドを厳しくすることにより、誤適用の減少が可能と
なったものである。The lower row shows the number of unvoiced consonants that were input, but the unvoiced vowel extraction method was applied incorrectly and the recognition result was CVC. In addition, various thresholds have been determined through preliminary experiments for both the conventional example and the example.
The optimum value was set. Looking at the results, the extraction rate of devoiced vowels was 715% in the conventional method, but improved to 987% in accordance with the present invention. On the other hand, in the past, incorrect application of the devoiced vowel extraction routine occurred 12% of the time when it was applied correctly, but with the present invention, this has been eliminated. Although the present invention does not directly aim at reducing the number of erroneous applications, it is possible to reduce the number of erroneous applications by tightening the threshold of feldspar in the Q section.

このように、本発明により、無声化母音の抽出率は大巾
に向上し、一方、誤抽出率は大巾に減少した。このよう
な、無声化母音の認識率の向上は単語認識率の向上に直
結するものであって、本発明の効果は大きいものである
。As described above, according to the present invention, the extraction rate of devoiced vowels has been greatly improved, while the erroneous extraction rate has been greatly reduced. Such an improvement in the recognition rate of devoiced vowels is directly linked to an improvement in the word recognition rate, and the effects of the present invention are significant.

[Brief explanation of the drawing]

第１図は、従来の音声認識方法のフロー図、第２図は、
本発明の一実施例の音声認識方法のフロー図、第３図は
、入力音声の予備音素認識結果と・ぞワ変化の例を示す
図である。Figure 1 is a flow diagram of a conventional speech recognition method, and Figure 2 is a flow diagram of a conventional speech recognition method.
FIG. 3, which is a flowchart of the speech recognition method according to an embodiment of the present invention, is a diagram showing an example of preliminary phoneme recognition results of input speech and changes in noise.

Claims

[Claims]

The unvoiced sound continues for longer than the normal length of one phoneme, and there is a voice power division within this unvoiced sound section, and this sound no. 9
When the rate of change of ・guwa in Wadizo with respect to time is larger than a predetermined value, the unvoiced sound section is ・, −
A speech recognition method characterized in that phoneme recognition is performed on the assumption that three phonetic elements consisting of a voiceless consonant, a voiceless vowel, and a voiceless consonant are in a continuous state instead of individual voiceless consonants.