JP2891259B2

JP2891259B2 - Voice section detection device

Info

Publication number: JP2891259B2
Application number: JP62079673A
Authority: JP
Inventors: 教幸藤本
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1987-04-02
Filing date: 1987-04-02
Publication date: 1999-05-17
Anticipated expiration: 2014-05-17
Also published as: JPS63247798A

Description

【発明の詳細な説明】〔概要〕音声認識装置における音声区間検出装置において、音
声信号の語頭における音声区間しきい値を小さく設定し
て音声脱落の確率を下げ、語尾におけるしきい値を前記
しきい値より大きく設定してノイズ付加の確率を下げる
ようにしたものである。〔産業上の利用分野〕本発明は音声区間検出装置に関し、特に、電算機を使
用する音声認識における音声区間と無声区間およびノイ
ズとの識別を容易ならしめる検出装置に関する。〔従来の技術〕人間の発声した音声を電算機を使用して特徴抽出を行
い自動的に検出する方式は、既に広く応用されている。
その典型的な手法の一つとしては、連続発声した音声信
号から単音節や音韻に区分するセグメンテーションを行
ない、この単音節を音声認識するものである。単音節の
認識によってさらに高度な単語認識や会話音声の認識等
へ拡張していくことができる。現在のところ完成なセグ
メンテーションの行える方式はまだないが、例えば単音
節のパワー値が所定のしきい値を越えたものは音声とみ
なす方法は知られている。即ち、パワー値が発声の一定
時間（L_v）以上にわたってパワーしきい値（P_L）を越え
ているときはその区間を音声とみなす方法である。第５図（ａ）〜（ｃ）は音声信号のパワー値（Ｐ）と
発声時間（Ｔ）との関係を示すパターン例である。ここ
でT₀は音声区間である。（ａ）は例えば“あ",“お",
“も",“り”と発声した場合で、しきい値P_L以上で音声
区間のしきい値L_vについてすべての単音節のパワーが存
在するため認識に問題はない。（ｂ）の場合は、例え
ば、“あ",“い",“ち”と発声したとき、無音区間の時
間しきい値L_Sを設けて、しきい値L_S以下のときは“あ",
“い",“ち”は一回の発声によるものとみなしている。
このときの無音区間L_Sはパワーの低い（しきい値P_L以下
の）音声とみることができる。また、（ｃ）の場合は、
例えば、“さ",“っ",“ぽ",“ろ”と発声したときで、
“っ”の区間がしきい値P_L以下でありかつ時間しきい値
L_S以上であるため音声なのかノイズなのか判断しにく
い。第６図（ａ）〜（ｄ）は従来の検出方式を説明するパ
ターン図である。（ａ）は音声区間T₀がすべてしきい値
P_L以上であるため認識の問題はない。（ｂ）は区間T₁が
音声区間の時間しきい値L_v以下なのでノイズとみなし音
声区間としない。（ｃ）は区間T₂およびT₃がしきい値L_v
より大なので音声区間とみなし、区間T₄は無音区間のし
きい値L_S以下なのでノイズとはみなさない。結局この場
合には区間（T₂＋T₄＋T₃）が音声区間とみなされる。
（ｄ）は区間T₅とT₇がしきい値L_v以下なのでノイズと見
なされ、区間T₆はしきい値L_v以上なので音声区間と見な
される。〔発明が解決しようとする問題点〕しかしながら、上記のような方法により検出したとき
は次のような問題がある。即ち、音声信号の始まり（始
端部）では音声の脱落が起き易く、音声信号の終り（終
端部）ではノイズの付加が起き易いことである。このよ
うな始端部（もしくは語頭）と終端部（もしくは語尾）
とで異なる傾向が現われる要因には２つある。１つは、
日本語の場合単語の先頭音節は短かく語尾の音節は長め
に発声される傾向にあること、２つは、単語の終端部で
は発声が不安定となり、一度パワー値が低くなった後に
小さな山が多く現われることである。後者の場合は、発
声者自身が出す音なので音声とみなすことができるが、
音声認識を行なう場合にはこの部分が音声区間に含まれ
ると、誤認識の原因となるためこの部分を音声区間に含
めることは好ましくない。〔問題点を解決するための手段および作用〕本発明は上述の問題点を解消した音声区間検出装置を
提供することにあり、本発明の原理は、音声（特に単語
音声）の検出において、音声区間の時間しきい値を語頭
と語尾とで変えることにあり、具体的には、語頭におい
ては音声区間の第１のしきい値L_Vを小さく設定し、語尾
においてはこのしきい値よりも大きい第２のしきい値を
設定するものである。これにより、従来問題となってい
た語頭における音声の脱落と語尾におけるノイズの付加
を低減することができ音声区間検出の精度を著しく向上
させることができる。第１図（ａ），（ｂ）は本発明の原理を説明する特性
図である。（ａ）は音声の語頭の場合、（ｂ）は音声の
語尾の場合である。（ａ），（ｂ）において、縦軸PRO
はノイズ付加の確率および音声脱落の確率であり、横軸
L_Vは音声区間の時間しきい値である。また、I_aおよびI_b
はノイズ付加の確率曲線、II_aおよびII_bは音声脱落の確
率曲線、そしてIII_aおよびIII_bはL_vの最適値を得るため
の誤り確率曲線である。（ａ）において、語頭の場合にはしきい値L_vが大きけ
れば大きい程I_aに示す如くノイズ付加の確率は減少して
いくが、逆に、音声脱落の確率はII_aに示す如く急激に
増大する。また、しきい値を小さくしていけばノイズ付
加の確率は急激に増大し、音声脱落の確率は減少する。
これらの曲線から、曲線I_aとII_aの和である曲線III_aは
図示の如く極小値を持つ曲線となる。この極小値におけ
るしきい値をL_vaとすると、L_vaは語頭のときの最適しき
い値を示しており、このしきい値L_vaはノイズ付加の確
率と音声脱落の確率がバランスした有効な値となる。こ
の場合、L_vaは騒音環境等によって異なるが、およそ70m
s前後である。（ｂ）は語尾の場合を示している。語尾の場合は語頭
に比べてL_vが大の方に寄っている。（ａ）と同様のパタ
ーンなので詳細説明を省略するが、L_vbは語尾のときの
最適しきい値を示しており、125ms前後である。即ち、
語尾でのしきい値L_vbはノイズ付加の確率と音声脱落の
確率がバランスした125msが有効な値となる。このように、音声区間検出において語頭と語尾とのし
きい値を変えることによってノイズ付加と音声脱落の確
率の共に低い検出を行い得ることが判明した。〔実施例〕第２図は本発明の音声区間検出装置を実現する装置の
概略構成図である。マイクロホン21から入力された音声
信号は、プリエンファシス部22において高域強調された
後、一方はパワー値抽出部23において音声の特徴パラメ
ータの一つであるエネルギ分布の抽出が、サンプリング
により時系的になされ、複数のフィルタからなるバンド
パスフィルタ部24において特徴抽出がなされる。区間検
出部26では後述する第３図に示すようにパワー値の時系
列PW（ｉ）にもとづいて音声区間の検出が行われる。音
声認識出力部27は音声辞書を有しこれを参照しつつパタ
ーンマッチングを行い認識結果をスピーカ28から出力す
る。制御部25は区間検出部26および音声認識出力部27等
を制御する。第３図は第２図の区間検出部26を詳細に示すブロック
図である。第３図において、261は音声の語頭（始端）
を検出する始端検出部、262は語尾（終端）を検出する
終端検出部、263は各種しきい値データP_L,L_va,L_vb,L_s等
を格納するしきい値格納部である。始端検出部261と終
端検出部262には前段のパワー値抽出部23から、パワー
値の例えば10msのサンプリング値PW（ｉ）がシリーズに
入力される。始端検出部261ではフレームごとにしきい
値格納部263から読み出されたパワーのしきい値P_Lとパ
ワーの時系列PW（ｉ）との大小が比較され、さらに、語
頭の第１のしきい値L_va、無声区間のしきい値L_sとサン
プリングフレームの位置が比較される。終端検出部262
では同様にフレームごとにパワーしきい値P_Lと時系列PW
（ｉ）との大小が比較され、さらに語尾の第２のしきい
値L_vb、無声区間のしきい値L_sとサンプリングフレーム
の位置が比較される。終端検出部262では始端検出部261
とこれらのデータとを合わせて始端終端位置情報Ｓを音
声認識出力部27に出力する。第４図は第３図の区間検出部における処理のフローチ
ャートである。フローチャートの前半のステップ１〜９
は始端検出部261における処理、後半のステップ10〜21
は終端検出部262における処理である。第４図におい
て、ｉはサンプリングされたフレーム番号、i_Sはしきい
値の開始のフレーム番号、ｊは始端側のしきい値を連続
して越えているフレーム数、i_eはしきい値の終りのフレ
ーム番号、ｋは終端側のしきい値を連続して下まわって
いるフレーム数である。フローチャートに示すように、
パワー値抽出部23からのパワー値の時系列PW（ｉ）とパ
ワー値しきい値P_Lとが各フレームについてその大小を比
較し（ステップ３）、PW（ｉ）＜P_Lであればステップ２
が繰り返えされる。PW（ｉ）≧P_Lとなったときそのフレ
ーム番号i_Sが記憶され、PW（ｉ）≧P_Lが続く間はステッ
プ6,7が繰り返えされる。ステップ８においてPW（ｉ）
＜P_Lのとき語頭のしきい値L_vaか否か判断され、（ステ
ップ９）、フレーム数ｊがしきい値L_vaを越えていれば
次に終端処理に入る。越えていなければまだ音声が入力
されてないとみなしてステップ２に戻る。終端において
も同様なステップをとるが、ステップ14においてPW
（ｉ）＜P_Lのときはステップ21において無声区間L_Sか否
かの判断が行われ無声区間でなければ、即ち、フレーム
数ｋがL_Sより大であれば音声区間検出は始端検出部にお
いて終了し、小であって無声区間であればステップ12に
戻る。そして、ステップ19においてPW（ｉ）≧P_Lであれ
ば、ステップ20にて語尾のしきい値L_vbか否かが判断さ
れ、しきい値L_vbがフレーム数ｊより大であればステッ
プ21にて無声区間のしきい値L_Sが判断され音声区間検出
は終了する。結局、音声の始端フレームは、i_s、終端フレームはi_e
として求まることになる。〔発明の効果〕以上説明したように、本発明によれば、音声区間検出
において語頭と語尾のしきい値を変えるようにしたので
語頭における音声の脱落、語尾におけるノイズの付加を
著しく低減することができ、音声区間検出の精度を著し
く向上させることができる。DETAILED DESCRIPTION OF THE INVENTION [Summary] In a speech segment detection device in a speech recognition device, a speech segment threshold value at the beginning of a speech signal is set small to reduce the probability of speech dropout, and the threshold value at the end of the speech signal is reduced. This is set to be larger than the threshold value to lower the probability of adding noise. BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice section detection apparatus, and more particularly, to a detection apparatus that facilitates discrimination between voice sections, unvoiced sections, and noise in voice recognition using a computer. [Prior Art] A method of automatically detecting a voice uttered by a human by using a computer to extract features has already been widely applied.
As one of the typical methods, segmentation is performed to divide a continuous vocalized speech signal into single syllables and phonemes, and the single syllable is subjected to speech recognition. By recognizing monosyllables, it can be extended to more advanced word recognition and conversation voice recognition. At present, there is no method capable of performing a complete segmentation, but for example, a method in which a power value of a single syllable exceeds a predetermined threshold value is regarded as speech is known. That is, when the power value exceeds the power threshold (P _L ) for a certain period of time (L _v ) or longer, the section is regarded as speech. FIGS. 5A to 5C are pattern examples showing the relationship between the power value (P) of the audio signal and the utterance time (T). Here T ₀ is the voice section. (A) is, for example, “A”, “O”,
"Also", if you say "Ri", there is no problem in recognition due to the presence of all the single-syllable power for the threshold L _v in a speech period equal to or higher than the threshold value P _L. In the case of (b), for example, "a", "have", "Chi" as when uttered, with a time threshold L _S silent section, when the following threshold L _S "A" ,
“I” and “Chi” are regarded as being made by one utterance.
Silent interval of time L _S can be regarded as a low power (threshold P _L below) speech. In the case of (c),
For example, when you say “sa”, “tsu”, “ぽ”, “ro”,
The interval of “tsu” is equal to or less than the threshold value P _L and the time threshold value
Since it is longer than L _S, it is difficult to determine whether it is voice or noise. 6 (a) to 6 (d) are pattern diagrams for explaining a conventional detection method. (A) shows that all voice segments T ₀ are threshold values
There is no problem of recognition because it is P _L or more. (B) the interval T ₁ is not time threshold L _v or less, such since speech segment regarded as noise in a speech period. (C) is the interval T ₂ and T ₃ threshold L _v
More deemed large, so the speech segment, segment T ₄ are not considered to be the threshold value L _S hereinafter, such the noise of the silent section. After all, in this case, the section (T ₂ + T ₄ + T ₃ ) is regarded as the voice section.
(D) are section T ₅ and T ₇ are considered threshold L _v or less, such the noise, the interval T ₆ are considered speech section because more threshold L _v. [Problems to be Solved by the Invention] However, there is the following problem when detected by the above method. That is, at the beginning (start end) of the audio signal, the sound is likely to drop out, and at the end (end) of the audio signal, noise is likely to be added. Such start (or start) and end (or end)
There are two factors that cause different tendencies between the two. One is
In the case of Japanese, the first syllable of a word tends to be short and the syllable at the end of the word tends to be uttered longer. Is that many appear. In the case of the latter, it can be regarded as a voice because it is the sound emitted by the speaker itself,
In the case of performing voice recognition, if this part is included in a voice section, it causes erroneous recognition, and therefore, it is not preferable to include this part in the voice section. [Means and Actions for Solving the Problems] The present invention is to provide a voice section detection device which solves the above-mentioned problems, and the principle of the present invention is to detect voice (especially word voice) in voice detection. located varying the time threshold interval in the word head and endings, specifically, reduced to set the first threshold value L _V of the speech segment in the prefix in the ending than this threshold This is to set a large second threshold value. As a result, the dropout of speech at the beginning of a word and the addition of noise at the end of the word, which have conventionally been problems, can be reduced, and the accuracy of speech section detection can be significantly improved. 1 (a) and 1 (b) are characteristic diagrams for explaining the principle of the present invention. (A) shows the case of the beginning of the voice, and (b) shows the case of the end of the voice. In (a) and (b), the vertical axis PRO
Is the probability of noise addition and the probability of voice dropout.
L _V is a time threshold of the voice section. Also, I _a and I _b
The probability curve of noise addition, II _a and II _b are probability curve of the speech falling and III _a and III _b, is an error probability curve for obtaining the optimum value of L _v. (A), a is in the case of prefix is decreasing the probability of noise addition, as shown in more I _a greater the threshold L _v, conversely, the probability of speech falling rapidly as shown in II _a To increase. Also, as the threshold value is reduced, the probability of noise addition sharply increases, and the probability of voice dropout decreases.
From these curves, the curve I _a and II _a curve III _a is the sum of the curve with a minimum value as shown. Assuming that the threshold value at this minimum value is L _va , L _va indicates an optimal threshold value at the beginning of the word, and this threshold value L _va is an effective threshold in which the probability of noise addition and the probability of speech dropout are balanced. Value. In this case, _Lva varies depending on the noise environment, etc.
s. (B) shows the case of the ending. In the case of endings L _v is closer to those of large compared to the prefix. Since the pattern is the same as (a), the detailed description is omitted, but L _vb indicates the optimum threshold value at the end of the word, and is about 125 ms. That is,
The effective value of the threshold L _vb at the end is 125 ms in which the probability of noise addition and the probability of voice dropout are balanced. As described above, it has been found that by changing the threshold value between the beginning and the end of the voice section detection, it is possible to detect both the noise addition probability and the voice dropout probability. [Embodiment] FIG. 2 is a schematic configuration diagram of an apparatus for realizing a voice section detection apparatus according to the present invention. The audio signal input from the microphone 21 is subjected to high-frequency emphasis in the pre-emphasis unit 22, and then one of the power value extraction units 23 extracts the energy distribution, which is one of the characteristic parameters of the audio, by sampling. The characteristic is extracted in a band-pass filter unit 24 including a plurality of filters. The section detection unit 26 detects a voice section based on a time series PW (i) of power values, as shown in FIG. 3 described later. The voice recognition output unit 27 has a voice dictionary, performs pattern matching while referring to the voice dictionary, and outputs a recognition result from the speaker 28. The control unit 25 controls the section detection unit 26, the voice recognition output unit 27, and the like. FIG. 3 is a block diagram showing the section detecting section 26 of FIG. 2 in detail. In FIG. 3, reference numeral 261 denotes the beginning of the voice (starting point)
, 262 an end detection unit for detecting the end (end), and 263 a threshold storage unit for storing various threshold data P _L , L _va , L _vb , L _{s and the} like. For example, a sampling value PW (i) of 10 ms of the power value is input to the start end detection unit 261 and the end detection unit 262 from the power value extraction unit 23 in the previous stage. Magnitude of the time-series PW (i) the threshold P _L and power of the power that is read from the threshold storage unit 263 for each frame in the start detection section 261 are compared, further, the first threshold of the prefix The value L _va , the threshold L _{s of the} unvoiced section, and the position of the sampling frame are compared. Termination detector 262
Power threshold P _L and chronological PW in each frame as well
The magnitude of the (i) are compared, further a second threshold L _vb endings, the position of the threshold L _s and the sampling frame unvoiced is compared. In the end detection unit 262, the start detection unit 261
, And outputs the start and end position information S to the speech recognition output unit 27. FIG. 4 is a flowchart of a process in the section detection unit in FIG. Steps 1-9 in the first half of the flowchart
Is the processing in the start end detection unit 261, and the latter half of steps 10 to 21
Is a process in the end detection unit 262. In FIG. 4, i is the sampled frame number, i _S is the start frame number of the threshold, j is the number of frames continuously exceeding the threshold on the start end side, and i _e is the threshold of the threshold. The ending frame number, k, is the number of frames continuously falling below the threshold value on the ending side. As shown in the flowchart,
Comparing the magnitude for the time series PW (i) a power value threshold P _L and each frame of the power values from the power value extracting unit 23 step if (Step 3), PW (i) <a P _L 2
Is repeated. The frame number i _S when a PW (i) ≧ P _L is stored, while the PW (i) ≧ P _L continues step 6 is repeated Kaee. In step 8, PW (i)
<When P _L is determined whether the prefix of the threshold L _va, (step 9), the number of frames j enters the next termination if exceeds the threshold value L _va. If not, it is determined that no voice has been input, and the process returns to step 2. Similar steps are taken at the end, but in step 14 the PW
(I) <If not unvoiced L _S whether the determination is made unvoiced in step 21 when the P _L, i.e., the speech section detection start detection unit if larger than the number of frames k are L _S And returns to step 12 if it is a small and unvoiced section. Then, if PW (i) ≧ P _L in step 19, it is determined whether or not ending threshold L _vb at step 20, step 21 if larger than the threshold value L _vb is the number of frames j Then, the threshold value L _S of the unvoiced section is determined, and the detection of the voice section ends. After all, the start frame of the voice is i _s and the end frame is i _e
Will be obtained as [Effects of the Invention] As described above, according to the present invention, the thresholds of the beginning and the end are changed in speech section detection, so that dropout of speech at the beginning and addition of noise at the end can be significantly reduced. Thus, the accuracy of voice section detection can be significantly improved.

【図面の簡単な説明】第１図は本発明の原理を説明する特性図、第２図は本発明の一実施例装置構成図、第３図は第２図区間検出部の詳細図、第４図は本発明の処理フローチャート、第５図は音声のパワー値と発声時間との関係を示すパタ
ーン図、および第６図はは従来の検出方式を説明するパターン図であ
る。（符号の説明） 21……マイクロホン、 22……プリエンファシス部、 23……パワー抽出部、 24……バンドパスフィルタ部、 25……制御部、26……区間検出部、 27……音声認識出力部、28……スピーカ、 261……始端検出部、262……終端検出部、 263……しきい値格納部。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a characteristic diagram for explaining the principle of the present invention, FIG. 2 is a configuration diagram of an embodiment of the present invention, FIG. FIG. 4 is a processing flowchart of the present invention, FIG. 5 is a pattern diagram showing the relationship between the power value of speech and the utterance time, and FIG. 6 is a pattern diagram explaining a conventional detection method. (Explanation of symbols) 21: microphone, 22: pre-emphasis unit, 23: power extraction unit, 24: band-pass filter unit, 25: control unit, 26: section detection unit, 27: voice recognition Output unit, 28 speaker, 261 start-end detection unit, 262 end-end detection unit, 263 threshold storage unit.

Claims

(57) [Claims] A voice section detection device of a voice recognition device, comprising: a start edge detection unit for comparing a voice power value at a start edge of an input voice signal with a predetermined power threshold value and a first time threshold value of a voice interval; An end detection unit for comparing a sound power value at a terminal part of the terminal with the predetermined power threshold value and a second time threshold value larger than the first time threshold value of a sound section; And a threshold storage unit for storing the first and second time thresholds. Upon detection of a voice section, the first time threshold at the beginning of the voice signal and the threshold at the end of the voice signal A voice section detection device that compares the second time threshold value and detects a voice section.