JPH0573035B2

JPH0573035B2 -

Info

Publication number: JPH0573035B2
Application number: JP59238765A
Authority: JP
Inventors: Koichi Yamaguchi
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1984-11-12
Filing date: 1984-11-12
Publication date: 1993-10-13
Also published as: JPS61116400A

Description

【発明の詳細な説明】＜技術分野＞本発明は、零交差情報を用いた音声情報処理装
置に関するものである。DETAILED DESCRIPTION OF THE INVENTION <Technical Field> The present invention relates to an audio information processing device using zero crossing information.

＜従来技術＞波形の零交差回数（以下ZCCとおく）は、コン
パレータを使つた簡単な回路で実現でき、また、
音声のスペクトルに関する情報もある程度表現し
ているため、従来からその性質が研究され、音声
分析の手段として利用されてきた。実際には、単
に入力信号の零交差をとるだけでは、入力信号に
含まれるノイズ等により、音声の入つていない部
分でも零交差が検出される。よつて、純粋なZCC
だけでは音声認識、特に音声（有声）区間と無音
区間との分離等には有効ではなく、ZCCは、音声
区間中である程度以上振幅の大きい部分でのみ、
信頼できる情報を提供する。<Prior art> The number of zero crossings of a waveform (hereinafter referred to as ZCC) can be realized with a simple circuit using a comparator, and
Since it also expresses a certain amount of information about the spectrum of speech, its properties have been studied and used as a means of speech analysis. In reality, if the zero crossings of the input signal are simply detected, the zero crossings will be detected even in parts where no audio is included due to noise contained in the input signal. Therefore, pure ZCC
ZCC alone is not effective for speech recognition, especially for separating speech (voiced) sections from silent sections.
Provide reliable information.

無音区間でノイズによるカウントを避けるた
め、従来からなされている手法として、コンパレ
ータにヒステリシスを持たせたり、コンパレータ
の基準電圧を入力信号の平均値からずらせる方法
がある（例えば、特開昭58−116595号公報参照）。
これによると、音声に比べて振幅の小さいノイズ
部分、つまり無音区間では、コンパレータは応動
しないように設定することができる。この方法に
よる情報を、レベル交差回数（LCC）と呼ぶこ
とにする。 In order to avoid counting due to noise during silent periods, conventional methods include providing hysteresis to the comparator or shifting the reference voltage of the comparator from the average value of the input signal (for example, Japanese Patent Application Laid-Open No. 1989-1999). (See Publication No. 116595).
According to this, the comparator can be set so as not to respond to a noise portion whose amplitude is smaller than that of voice, that is, a silent section. The information obtained by this method will be referred to as the level crossing count (LCC).

LCCは、厳密にはZCCとは違うが、音声区間
等のある程度振幅の大きいところではZCCとほぼ
一致するため、音声認識等でZCCと同様の情報と
して扱つてもよい。ただし、コンパレータの基準
電圧（またはヒステリシス、この値を以下クロス
レベルと呼ぶ）を大きくしすぎると、振幅の小さ
い子音部等でコンパレータが応動せず、零交差情
報が失われることがある。特に、摩擦音は、エネ
ルギーが3K〜6KHzに存在しているため、ZCCは
非常に高い値を示し、他の音韻と顕著な差が出る
が、その振幅が小さいためクロスレベルが大きい
ときは、LCCは小さい値になる可能性がある。
逆に、クロスレベルが小さすぎると、無音区間で
ノイズによつてコンパレータが応動してしまい、
音声区間との分離がしにくくなる。これらが、
LCCの問題点であり、音声認識等の特徴量とし
てLCCが用いられたとき、語頭・語尾の検出誤
りや子音部の判定誤りなどが生じ、エラーやリジ
エクトを招く結果となる。 Strictly speaking, LCC is different from ZCC, but since it almost matches ZCC in areas where the amplitude is large to some extent, such as in voice sections, it may be treated as information similar to ZCC in speech recognition, etc. However, if the reference voltage (or hysteresis, hereinafter referred to as a cross level) of the comparator is made too large, the comparator may not respond to consonant parts with small amplitude, and zero crossing information may be lost. In particular, since the energy of fricatives exists between 3K and 6KHz, the ZCC shows a very high value and there is a noticeable difference from other phonemes, but because the amplitude is small, when the cross level is large, the LCC may be a small value.
On the other hand, if the cross level is too small, the comparator will react due to noise during the silent section,
It becomes difficult to separate it from the voice section. These are
This is a problem with LCC, and when LCC is used as a feature quantity for speech recognition, etc., errors in detection of the beginning and end of words and errors in determining consonants occur, resulting in errors and rejections.

＜発明の目的＞本発明は、コンパレータのクロスレベルを、音
声が含まれていない周囲雑音の区間の信号の大き
さに応じて自動的に補正し、上述したような従来
の欠点を解消した音声情報処理装置を提供するこ
とを目的とする。<Object of the Invention> The present invention automatically corrects the cross level of a comparator according to the signal size of an ambient noise section that does not contain audio, thereby eliminating the above-mentioned conventional audio problems. The purpose is to provide an information processing device.

＜実施例＞以下図面に従つて本発明の一実施例を詳細に説
明する。<Example> An example of the present invention will be described in detail below with reference to the drawings.

第１図はLCCを使つた音声認識装置の構成例
を示すブロツク図である。図において、マイクロ
ホン１から入力された音声信号はプリアンプ２に
よつて増幅され、いくつかの帯域ろ波器（この例
では２個）３１，３２に通された後、各々コンパ
レータ４１，４２に入力される。各コンパレータ
４１，４２は、クロスレベル変換部５１，５２の
出力である基準電圧と比較し、マイクロコンピユ
ータ６に入力される。結果は出力部７等で表示さ
れる。ここで、基準電圧は０にしておき、コンパ
レータ４１，４２のヒステリシスを変化するよう
に、変換部５１，５２を接続してもよい。変換部
５１，５２への入力は、マイクロコンピユータ６
の出力ポートから送られる。この例では、各々
3bit、すなわち８段階に変化できるようにしてい
る。 FIG. 1 is a block diagram showing an example of the configuration of a speech recognition device using LCC. In the figure, an audio signal input from a microphone 1 is amplified by a preamplifier 2, passed through several bandpass filters (two in this example) 31 and 32, and then input to comparators 41 and 42, respectively. be done. Each of the comparators 41 and 42 compares the output voltage with a reference voltage which is the output of the cross level converters 51 and 52, and inputs the result to the microcomputer 6. The results are displayed on the output unit 7 or the like. Here, the reference voltage may be set to 0, and the converters 51 and 52 may be connected so as to change the hysteresis of the comparators 41 and 42. Inputs to the converters 51 and 52 are provided by the microcomputer 6.
is sent from the output port of In this example, each
It is possible to change in 3 bits, or 8 steps.

第２図に、クロスレベル変換部５１，５２の具
体的な回路構成例を示す。図中の集積回路ICは、
８チヤンネルのマルチプレクサとして構成された
ものであり、コントロール端子への3bit（０〜７）
のデイジタル信号によつて、各チヤンネルの対応
したスイツチがONになる。そして、これにより
外部に接続された抵抗Ｒのタツプを切換え、コン
パレータ４１，４２のクロスレベルを修正する。 FIG. 2 shows a specific example of the circuit configuration of the cross-level converters 51 and 52. The integrated circuit IC in the figure is
It is configured as an 8-channel multiplexer, and has 3 bits (0 to 7) to the control terminal.
The corresponding switch of each channel is turned on by the digital signal. As a result, the tap of the externally connected resistor R is switched, and the cross level of the comparators 41 and 42 is corrected.

第３図は、この認識装置の処理の流れを機能的
に表わすブロツク図である。図において、マイク
ロホン１から入力された音声信号は、音響分析部
１１によつて分析処理される。この音響分析部１
１は、第１図の帯域ろ波器３１，３２、コンパレ
ータ４１，４２及びクロスレベル変換器５１，５
２に相当する。分析処理は、例えば、単位時間
（フレーム）毎の特徴ベクトルの時系列として出
力することである。得られたベクトル時系列は、
次にパタン変換部１２に入り、音韻等の標準パタ
ン１３を参照しながら、セグメンテーション等の
手法により、音韻もしくはそれに相当するラベル
の系列として表現し、語検出部１４でその系列に
基づいて音声区間を検出する。 FIG. 3 is a block diagram functionally showing the processing flow of this recognition device. In the figure, an audio signal input from a microphone 1 is analyzed and processed by an acoustic analysis section 11. This acoustic analysis section 1
1 are the bandpass filters 31 and 32, the comparators 41 and 42, and the cross level converters 51 and 5 shown in FIG.
Corresponds to 2. The analysis process is, for example, outputting a time series of feature vectors for each unit time (frame). The obtained vector time series is
Next, the pattern conversion section 12 enters the pattern conversion section 12, which uses a method such as segmentation to express the phoneme as a series of phonemes or labels corresponding to the phoneme while referring to the standard pattern 13 such as phoneme. Detect.

ここで、単語音声認識の場合は、入力音声の単
語パタンがマツチング部１５に送られ、単語の標
準パタン１６を参照して認識を行い、判定結果出
力部１７で結果の表示や送信を実行する。ノイズ
適応部１８は、語検出部１４の結果を参考に、主
として音声区間がない部分、すなわち雑音区間の
特徴ベクトルの時系列に基づき、音響分析部１１
のクロスレベルを修正する。そして、環境雑音が
大きく認識するには不適当と判断したときは、入
力受付不可表示１９を行う。 Here, in the case of word speech recognition, the word pattern of the input speech is sent to the matching section 15, recognition is performed with reference to the standard word pattern 16, and the result is displayed or transmitted in the judgment result output section 17. . The noise adaptation unit 18 uses the results of the word detection unit 14 as a reference, and based on the time series of the feature vectors of the noise interval, which is the part where there is no speech interval, the acoustic analysis unit 11
Correct the cross level. If it is determined that the environmental noise is too large to be recognized, a display 19 indicating that input cannot be accepted is displayed.

ノイズ適応部１８について、さらに詳しく説明
する。適応化アルゴリズムを実現するためには、
以下のような要求がある。 The noise adaptation unit 18 will be explained in more detail. In order to realize the adaptive algorithm,
There are the following requirements.

１実用性の面から、最適なクロスレベルに収束
する時間は速いほどよい。1 From the standpoint of practicality, the faster it takes to converge to the optimal cross level, the better.

２適応化の処理量は、マイコンで実現できる程
度のものでなければならない。2. The amount of processing required for adaptation must be within the range that can be achieved by a microcomputer.

３弱い摩擦音をノイズと判定し、適応化を行わ
ないようにする。3. Determine weak fricatives as noise and do not perform adaptation.

４突発性の雑音に対しては、これを無視しなけ
ればならない。4. Sudden noises must be ignored.

５長く続く発声が入力されたとき、継続時間の
制限により認識装置はリジエクトを出すが、こ
れと大きなノイズとは区別しなければならな
い。5. When a long utterance is input, the recognizer will output a reject due to the duration limit, but this must be distinguished from loud noise.

６発声の前後には、呼吸音や舌打ちなどの生理
的雑音が存在することがときどきあり、これら
のノイズ区間で適応すると、期待するクロスレ
ベルより大きくなるので、この区間は適応を避
けた方が望ましい。6 Physiological noises such as breathing sounds and tongue clicks are sometimes present before and after vocalization, and if adaptation is made in these noise intervals, the cross level will be higher than the expected cross level, so it is better to avoid adaptation in this interval. desirable.

以上の項目中には、相反する要求もあり、すべ
てを満足させるわけにはいかないが、どれもある
程度満たすようなアルゴリズムを実現しなければ
ならない。重要な点は、クロスレベルは適正また
は過大のときは、パタン変換によつて無音部分
（以後Ｓと呼ぶ）が抽出でき、また、クロスレベ
ルが過小のときも、周囲騒音の振幅が小さければ
Ｓが抽出できることである。しかし、クロスレベ
ルが過小でかつ周囲騒音がある水準より大きいと
きは、必らずしもＳが抽出されるとは限らないた
め、この場合は特別扱いにする必要がある。 Among the above items, there are conflicting requirements, and although it is impossible to satisfy all of them, it is necessary to realize an algorithm that satisfies all of them to some extent. The important point is that when the cross level is appropriate or excessive, silent parts (hereinafter referred to as S) can be extracted by pattern conversion, and even when the cross level is too low, S can be extracted if the amplitude of the ambient noise is small. can be extracted. However, when the cross level is too low and the ambient noise is higher than a certain level, S is not necessarily extracted, so this case needs to be treated specially.

前者、すなわちＳが抽出され、語頭・語尾が検
出される場合をケースＡとし、後者、すなわちＳ
が抽出されず、継続時間が長過ぎることによるリ
ジエクトが起こる場合をケースＢとする。このよ
うに適応部をＡ，Ｂの２つに分ける。第４図にノ
イズ適応部１８のフローチヤートを示す。図中、
語頭・語尾の情報は語検出部１４から送られて来
る。ケースＡは、発声と発声の間の無音区間に対
して適応化を行い、ケースＢは、語頭検出後（た
だし、これが音声であるか雑音であるかは、この
例では判定していない）、継続時間がある基準値
以上になつたところでクロスレベルを変更する。 The former, that is, S is extracted and the beginning and end of the word are detected, is case A, and the latter, that is, S
Case B is a case in which a reject occurs because the duration is too long without being extracted. In this way, the adaptive part is divided into two parts, A and B. FIG. 4 shows a flowchart of the noise adaptation section 18. In the figure,
Information on the beginning and end of a word is sent from the word detection section 14. In case A, adaptation is performed for the silent interval between utterances, and in case B, after the beginning of a word is detected (however, it is not determined in this example whether this is speech or noise). The cross level is changed when the duration exceeds a certain reference value.

図において、語頭が存在していないときは、未
発声なのでケースＡのルーチンへ入る。存在して
いれば発声がなされたことになり、語尾判定の結
果をみる。語尾が検出されていなければ、ノイズ
適応部１８から脱出する。検出されているとき
は、装置が正常に動作していると考えられるの
で、ケースＡのルーチンへ入る。通常の単語音声
認識装置は、発声単語の継続時間に最小値と最大
値を設け、その範囲外の入力信号はリジエクトし
ているが、最小値より小さいときはケースＡに、
最大値を越えたときはケースＢに入る。 In the figure, if the beginning of a word does not exist, the case A routine is entered because it has not been uttered. If it exists, it means that the utterance was made, and we will check the result of the word ending determination. If the end of the word is not detected, the process exits the noise adaptation unit 18. If detected, it is considered that the device is operating normally, and the case A routine is entered. Ordinary word speech recognition devices set minimum and maximum values for the duration of uttered words, and reject input signals outside the range.
If the maximum value is exceeded, case B is entered.

ケースＡのルーチンでは、語尾またはリジエク
トの時点から、Ｓがある基準時間T_B（およそ0.5
秒）継続した後から適応Ａを行う。語尾直後とせ
ずに、T_Bを設けているのは、前に述べた要求６）
項の生理的雑音を避けるためである。適応Ａで
は、T_B経過後のＳに属する一定のフレーム分
（たとえば16フレーム）の特徴ベクトルの和（Ｘ
とおく）を用いて、その値をある定められた関数
またはテーブルＴで写像し、現在のクロスレベル
LVに加算し、新しいクロスレベルLV′とする。
すなわち、 LV′＝LV＋Ｔ（Ｘ） ……(1) 一度、適応Ａを作用させると、その時点から一
定フレーム分（たとえば16フレーム）は特徴ベク
トルを集収しない。これは要求４、３を満たすた
めである。以後、同様に集収・適応Ａ・休止を語
頭が検出されるまで続ける。ここで、クロスレベ
ルの最大値を、認識性能をそこなわない最大のレ
ベルにあらかじめ設定しておき、LV′がその最大
値を越えたならば、入力の受付が不可能という表
示を行う。この時、同時に入力されたパタンを第
３図のマツチング部１５へ送らないようにしても
よい。式(1)において、LVが適正ならばＴ（Ｘ）は
０に近い値をとり、LVが過大ならばＴ（Ｘ）は負
となり、LV′は減少する。LVが過小のときはＴ
（Ｘ）は正となり、LV′は増加する。 In the case A routine, from the end of the word or reject, there is a reference time T _B (approximately 0.5
2) Perform adaptation A after the continuation. The reason why T _B is placed instead of immediately after the end of the word is because of the requirement 6) mentioned earlier.
This is to avoid physiological noise in the term. In adaptation _A , the sum of feature vectors (X
), the value is mapped by a certain function or table T, and the current cross-level
Add it to LV and make it a new cross level LV′.
That is, LV'=LV+T(X)...(1) Once adaptation A is applied, feature vectors are not collected for a certain number of frames (for example, 16 frames) from that point on. This is to satisfy requirements 4 and 3. Thereafter, collection, adaptation A, and pause are continued in the same manner until the beginning of the word is detected. Here, the maximum value of the cross level is set in advance to the maximum level that does not impair recognition performance, and if LV' exceeds the maximum value, a message indicating that input cannot be accepted is displayed. At this time, the patterns input at the same time may not be sent to the matching section 15 in FIG. 3. In equation (1), if LV is appropriate, T(X) takes a value close to 0, and if LV is excessive, T(X) becomes negative and LV' decreases. T if LV is too small
(X) becomes positive and LV' increases.

ケースＢのルーチンでは、継続時間が長過ぎる
ことによるリジエクトが発生してからも、ある程
度の時間T_N（１秒〜1.5秒程度）引き続き発声中、
すなわち語尾が検出されなかつたとき、適応Ｂを
作用させる。適応Ｂでは、Ｓの区間から特徴ベク
トルが集収できないため、LVにある一定量Ｋを
加える操作を行う。 In the routine of case B, even after the rejection occurs due to the duration being too long, the utterance continues for a certain period of time T _N (approximately 1 to 1.5 seconds).
That is, when the ending of a word is not detected, adaptation B is applied. In adaptation B, since feature vectors cannot be collected from the section S, an operation is performed to add a certain amount K to the LV.

LV′＝LV＋Ｋ ……(2) 適応Ｂを施した後、語尾が検出されれば、今度
はケースＡのルーチンに戻る。もちろん、このと
きの入力パタンはリジエクトされている。一方、
尚語尾が検出されずにT_N経過したならば、再び
適応Ｂを行う。以後、この動作を繰返す。LV′が
クロスレベルの最大値を越えた場合は、ケースＡ
と同様である。上のケースＢの説明は、対象とし
ている継続時間の長過ぎる入力信号が、発声によ
るものなのか、純然たる周囲騒音なのかを判定す
る機能を認識装置が有していない場合であり、こ
の機能を有している場合は、リジエクト発生直後
に適応Ｂを作動させてもよい。 LV'=LV+K...(2) After applying adaptation B, if the ending of the word is detected, the process returns to the case A routine. Of course, the input pattern at this time is rejected. on the other hand,
If T _N elapses without a word ending being detected, adaptation B is performed again. After this, repeat this operation. If LV′ exceeds the maximum value of the cross level, case A
It is similar to The explanation for case B above is a case where the recognition device does not have a function to determine whether the target input signal with an excessively long duration is due to vocalization or pure ambient noise. Adaptation B may be activated immediately after the reject occurs.

適応Ａ、適応Ｂの起こる様子を第５図、第６図
に示す。１００は特徴ベクトルの集収、１０１は
適応Ａ、１０２は適応Ｂの処理動作を表わしてい
る。第５図では、発声終了後T_B経過して１００
の集収、１０１の適応Ａが起つている。途中、突
発性の雑音が入つているが、継続時間が短いため
リジエクトとなり、その終了時からT_B経過後か
らケースＡになつている。第６図では、振幅の大
きい雑音が長く続き、リジエクトが生じ、その後
更にT_N続いたので、１０２の適応Ｂが動作して
いる。その後Ｓが出現し、語尾と判定され、ケー
スＡになつている。 Figures 5 and 6 show how adaptation A and adaptation B occur. 100 represents feature vector collection, 101 represents adaptation A, and 102 represents adaptation B processing operation. In Figure 5, 100 _minutes have passed after the end of utterance.
collection, 101 adaptations A are occurring. There is a sudden noise in the middle, but because the duration is short, it is rejected, and from the end of the noise, after T _B has passed, case A has occurred. In FIG. 6, the large-amplitude noise continues for a long time, a reject occurs, and then _TN continues, so the adaptation B of 102 is operating. After that, S appears and is determined to be the final word, making case A.

以上、音声認識装置について説明したが、有
声・無声・無音等の判定にLCCを用いる音声合
成装置においても、本発明を利用することがで
き、合成音の音質向上やビツトレートの低減に役
立つ。 Although the speech recognition device has been described above, the present invention can also be used in a speech synthesis device that uses LCC to determine voiced, unvoiced, silent, etc., and is useful for improving the sound quality of synthesized sounds and reducing the bit rate.

＜発明の効果＞上に詳述した如く、本発明は、コンパレータの
クロスレベルを周囲雑音に応じて自動的に変化さ
せることを特徴とし、例えばクロスレベルを手動
で設定することは、かなりの熟練を要し使用者の
負担を強いるが、この自動化のメリツトは非常に
大きい。また本発明によれば、処理量が少なく、
また簡単な回路で実現できるため、マイクロコン
ピユータを使つた音声認識装置や音声合成装置に
組み込むことができ、有用な音声情報処理装置を
提供する。<Effects of the Invention> As detailed above, the present invention is characterized in that the cross level of the comparator is automatically changed according to the ambient noise.For example, it takes considerable skill to manually set the cross level. However, the benefits of this automation are enormous. Further, according to the present invention, the amount of processing is small;
Furthermore, since it can be realized with a simple circuit, it can be incorporated into a speech recognition device or a speech synthesis device using a microcomputer, thereby providing a useful speech information processing device.

[Brief explanation of the drawing]

第１図は本発明の一実施例を示すブロツク構成
図、第２図は第１図の要部具体例を示すブロツク
図、第３図は処理の流れを機能的に示す図、第４
図は第３図の要部を更に詳細に説明するフローチ
ヤート、第５図、第６図は適応例を波形とともに
示す図である。４１，４２……コンパレータ、５１，５２……
クロスレベル変換部、６……マイクロコンピユー
タ、１１……音響分析部、１２……パタン変換
部、１３……音韻等の標準パタン、１４……語検
出部、１５……マツチング部、１６……単語の標
準パタン、１８……ノイズ適応部、１００……特
徴ベクトル収集、１０１……適応Ａ、１０２……
適応Ｂ。 FIG. 1 is a block diagram showing an embodiment of the present invention, FIG. 2 is a block diagram showing a specific example of the main parts of FIG. 1, FIG. 3 is a diagram functionally showing the flow of processing, and FIG.
This figure is a flowchart for explaining the main part of FIG. 3 in more detail, and FIGS. 5 and 6 are diagrams showing adaptation examples together with waveforms. 41, 42... Comparator, 51, 52...
Cross level conversion unit, 6...Microcomputer, 11...Acoustic analysis unit, 12...Pattern conversion unit, 13...Standard pattern such as phoneme, 14...Word detection unit, 15...Matching unit, 16... Standard pattern of words, 18... Noise adaptation unit, 100... Feature vector collection, 101... Adaptation A, 102...
Adaptation B.

Claims

[Scope of Claims] 1. An input means for inputting an audio signal; an acoustic analysis unit for outputting the audio input signal input from the input means as a feature vector time series for each unit time; a pattern conversion unit that expresses the phoneme as a phoneme or a label sequence corresponding to the phoneme while referring to a standard pattern such as; a word detection unit that detects a speech interval of the input signal based on the label sequence; a matching section that performs word speech recognition by referring to a word standard pattern based on the detection result; a judgment result output section that outputs the matching result of the matching section; means for comparing a reference value biased by a quantitative amount; means for correcting the reference value according to the magnitude of a signal in an ambient noise section that does not include speech; and a function for detecting a speech section; When the end of a word of the voice input signal is detected, the section of ambient noise is analyzed after a certain period of time has elapsed from the end of the word, and the correction of the reference value is started according to the result, and the duration of the voice input signal is too long. 1. A speech information processing device using zero-crossing information, comprising: a noise adaptation unit that increases a reference value at that time by a predetermined amount when a predetermined period of time or more elapses without a word ending being detected.