JPH0222960B2

JPH0222960B2 -

Info

Publication number: JPH0222960B2
Application number: JP59056622A
Authority: JP
Inventors: Hideji Morii; Satoshi Fujii; Masakatsu Hoshimi
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1984-03-23
Filing date: 1984-03-23
Publication date: 1990-05-22
Also published as: JPS60200300A

Description

【発明の詳細な説明】産業上の利用分野本発明は音声認識装置に用いられる音声の始
端・終端の検出装置に関するものである。DETAILED DESCRIPTION OF THE INVENTION FIELD OF INDUSTRIAL APPLICATION The present invention relates to a device for detecting the start and end of speech used in a speech recognition device.

従来例の構成とその問題点音声の始端、終端の検出方法に関する従来例と
しては、信号のエネルギーと零交差回数を用いた
方法が知られている。これは、新美康永：音声認
識、共立出版（1979）、あるいは、L.R.Rabiner
and M.R.Sambur：An algovithm for
determining the endpoint of isolated
utterances、Bell Syst.Tech.J.、（1975）に示さ
れている。Configuration of Conventional Example and Its Problems As a conventional example of a method for detecting the start and end of speech, a method using signal energy and the number of zero crossings is known. This is Yasunaga Niimi: Speech Recognition, Kyoritsu Shuppan (1979), or LRRabiner
and MRSambur: An algovithm for
determining the endpoint of isolation
utterances, Bell Syst.Tech.J., (1975).

零交差回数というのは信号の符号のみを残し、
振幅を１ビツトに量子化した零交差波の一定時間
長の区間における零交差の平均回数である。音声
のようにスペクトル構造をもつた信号の零交差回
数はスペクトル中の優勢な周波数成分とよく対応
する。第１図ａ〜ｃは音声信号の零交差回数の分
布を示したもので、ａは無音、ｂは無声音、ｃは
有声音の分布である。図から分るように、音声信
号の零交差回数は、有声音のように低域の周波数
成分の優勢は音声では第１図ｃのように小さな値
を示し、無声音のように高域の周波数成分の優勢
な音声では第１図ｂのように大きな値を示す。従
来法による音声の始端・終端検出方法はこの零交
差回数を利用することにより信号のエネルギーは
小さいが、零交差回数は大きな値をとる無声子音
の検出精度を上げた方法である。 The number of zero crossings means leaving only the sign of the signal,
This is the average number of zero-crossings in a certain time period of a zero-crossing wave whose amplitude is quantized to 1 bit. The number of zero crossings of a signal with a spectral structure, such as voice, corresponds well to the dominant frequency components in the spectrum. Figures 1a to 1c show the distribution of the number of zero crossings of the audio signal, where a is the distribution of silent sounds, b is the distribution of unvoiced sounds, and c is the distribution of voiced sounds. As can be seen from the figure, the number of zero-crossings in the audio signal is dominated by low frequency components like voiced sounds, small values as shown in Figure 1c in voiced sounds, and high frequency components like unvoiced sounds. Speech with a dominant component shows a large value as shown in FIG. 1b. The conventional method for detecting the beginning and end of a voice uses the number of zero crossings to improve the detection accuracy of unvoiced consonants, which have a small signal energy but a large number of zero crossings.

以下図面を参照しながら従来例の音声の始端・
終端検出方法について説明する。 Referring to the drawings below, we will explain the starting point and
The termination detection method will be explained.

第２図は従来例の構成を示したものであり、第
３図は従来例における音声の始端・終端検出方法
の動作を説明するための例を示したものである。
音声を含む信号は第２図に示すエネルギー算出部
１と零交差回数算出部２によりフレーム（例えば
10msec長）毎に信号エネルギーＥ（ｎ）（ｎはフ
レーム番号）と零交差回数Nz（ｎ）という２つの
特徴パラメータに変換される。３は信号のエネル
ギーレベルにより確実に音声区間であるという部
分を検出する始端・終端候補決定部であり、信号
エネルギーＥ（ｎ）に対対し２つの閾値E₁，E₂
（E₁＞E₂）を適用し音声の始端候補n₁、終端候補
n₂を求める。これは第３図ａの例に示すように、
エネルギーの値がE₂を越え、かつその後E₂以下
になることなしにE₁を越えるとき、音声区間に
入つたとみなし、E₂を越えた点を始端候補n₁とす
るものである。終端候補n₂は時間軸を逆にして、
同様の方法で決定する。第２図の４は音声の始
端・終端決定部である。ここでは、零交差回数算
出部２で計算された信号の零交差回数Nz（ｎ）と
閾値Noを用いて、エネルギーＥ（ｎ）は小さいが
零交差回数Nz（ｎ）が大きな値をとる無声音が、
始端・終端候補決定部３で定められた音声の始
端・終端候補（n₁，n₂）の外側にないか検査す
る。第３図ｂの例に示すように、始端候補n₁より
前の数フレームの区間において零交差回数Nz
（ｎ）が閾値Noより大となるフレームの数を数
え、その数が一定値（たとえば３）以上であれば
始端候補n₁より前に無声音があるとみなし最初に
閾値Noを越えたフレームn′₁に始端を移す。終端
についても同様である。ただし、第３図ｂでは終
端n₂はもとのままである場合を示している。この
ようにして最終的な音声の始端、終端（n′₁、n₂）
が決定される。 FIG. 2 shows the configuration of a conventional example, and FIG. 3 shows an example for explaining the operation of the voice start/end detection method in the conventional example.
A signal including audio is divided into frames (e.g.
(10 msec length), the signal energy E(n) (n is the frame number) and the number of zero crossings Nz(n) are converted into two characteristic parameters. Reference numeral 3 denotes a start/end candidate determination unit that detects a portion that is definitely a voice section based on the energy level of the signal, and has two thresholds E ₁ and E ₂ for the signal energy E(n).
Applying (E ₁ > E ₂ ), start candidate n ₁ and end candidate of audio
Find n ₂ . This is shown in the example in Figure 3a,
When the energy value exceeds E ₂ and then exceeds E ₁ without falling below E ₂ , it is considered that the voice section has entered, and the point where it exceeds E ₂ is set as the starting point candidate n ₁ . Termination candidate n ₂ reverses the time axis,
Determine in a similar manner. 4 in FIG. 2 is a voice start/end determining section. Here, using the number of zero crossings Nz(n) of the signal calculated by the number of zero crossings calculation unit 2 and the threshold value No. but,
It is checked whether there is a voice start/end candidate (n ₁ , n ₂ ) determined by the start/end candidate determining unit 3. As shown in the example in Figure 3b, the number of zero crossings Nz in the section of several frames before the starting edge candidate n ₁
Count the number of frames in which (n) is greater than the threshold No. If the number is greater than a certain value (for example, 3), it is assumed that there is an unvoiced sound before the starting point candidate n ₁ , and the frame n that exceeds the threshold No. ′ Move the starting end to ₁ . The same applies to the termination. However, FIG. 3b shows the case where the terminal end _n2 remains as it was. In this way, the final beginning and end of the voice (n′ ₁ , n ₂ )
is determined.

しかし、上記のように零交差回数を用いた方法
では、エネルギーが小さく零交差回数も小さい有
声子音（例えば、／ｂ／、／ｄ／）などの脱落を
減小することはできない。また、音声の始端、終
端には唇を開けたときの雑音とか呼吸音による雑
音が付加しやすい。第４図ａ，ｂは上記雑音が付
加した音声のエネルギー変化を示したもので、ａ
は唇の動きによる雑音が始端に付加した場合の例
として異様（／ijoo／）という音声のパワー変化
を示し、ｂは呼吸音による雑音が始端に付加した
場合の例として出場（／ideju／）という音声の
パワー変化を示したものである。図に示した例の
ような場合、従来例では始端は雑音部分となつて
しまう。このように、従来例による方法では始
端、終端の位置を誤つてしまい音素の脱落や雑音
による音素の付加がさけられない場合があるとい
う欠点がある。 However, the method using the number of zero crossings as described above cannot reduce the dropout of voiced consonants (for example, /b/, /d/), which have small energy and a small number of zero crossings. In addition, noise caused by opening the lips or breathing sounds is likely to be added to the beginning and end of the voice. Figures 4a and 4b show the energy changes of the voice added with the above noise, and a
shows the power change of the strange voice (/ijoo/) as an example when noise due to lip movements is added to the beginning, and b shows the power change of the voice (/ideju/) as an example when noise due to breathing sounds is added to the beginning. This shows the change in the power of the voice. In a case like the example shown in the figure, in the conventional example, the starting end becomes a noise part. As described above, the conventional method has the disadvantage that the beginning and end positions may be incorrect and phonemes may be dropped or phonemes may be added due to noise.

発明の目的本発明は上記欠点に鑑み、音声の脱落、雑音の
付加が少なく、位置精度の高い音声の始端、終端
検出装置を提供するものである。OBJECTS OF THE INVENTION In view of the above-mentioned drawbacks, the present invention provides a voice start/end detection device that is less likely to drop out of voice, add less noise, and has high positional accuracy.

発明の構成上記目的を達成するためには、信号のエネルギ
ーとスペクトル形状によりフレーム毎（例えば
10msec）に有音・無音の判定を行なう有音・無
音判定部と、フレーム毎の有音・無音判定結果の
持続性により音声の始端・終端候補を検出する部
分と、無音から有音またはその逆の有音から無音
に変化する場合における信号のエネルギーの変化
とスペクトルの変化の大きさという動的な特徴に
より始端・終端の位置を決定する部分とを備え、
入力された音声を含む信号から音声の始端・終端
の位置を検出するようにしたものである。Structure of the Invention In order to achieve the above object, it is necessary to
10msec), a voice/non-speech determination section that determines whether there is voice or no voice, a section that detects voice start/end candidates based on the persistence of voice/silence determination results for each frame, and a section that detects voice start/end candidates based on the persistence of voice/silence determination results for each frame; A part that determines the starting and ending positions based on the dynamic characteristics of the change in the energy of the signal and the magnitude of the change in the spectrum when the signal changes from sound to silence,
The position of the start and end of the voice is detected from a signal including the input voice.

実施例の説明以下、本発明の実施例について図面を参照しな
がら説明する。DESCRIPTION OF EMBODIMENTS Hereinafter, embodiments of the present invention will be described with reference to the drawings.

第５図は本発明の一実施例における音声認識装
置に組込まれた音声の始端・終端検出装置のブロ
ツク図を示したものである。図において５はエネ
ルギー抽出部で、整流平滑回路で構成され信号の
パワーをフレーム毎に抽出する。６はスペクトル
形状抽出部で、例えば、低域（250〜600Hz）、中
域（600〜1500Hz）、高域（1500〜4000Hz）の３種
類の帯域通過フイルタ群と整流平滑回路で構成さ
れ、各帯域におけるフレーム毎のパワーがスペク
トル情報として用いられている。エネルギー抽出
部５とスペクトル形状抽出部６とで特徴量抽出部
１３を構成する。７はマルチプレクサで、エネル
ギー抽出部５からの信号のパワーとスペクトル形
状抽出部６からの帯域フイルタパワーを時分割で
有音・無音判定部８へ入力するためのものであ
る。８は有音・無音判定部で、無音、無声音、有
声音の判別を行うためのものである。９，１０は
閾値メモリと標準パターンメモリであり有音・無
音判定部８で用いられる定数値が格納されてい
る。閾値メモリ９には、パワーの２つの閾値E₁，
E₂（E₁＞E₂）が格納してある。また、標準パター
ンメモリ１０には、無音・無声音を判別するため
の線形判別関数と無音・有声音を判別するための
線形判別関数の２種類の線形判別関数の係数が格
納されている。そして、これら２つの閾値E₁，
E₂と２つの線形判別関数の係数は、あらかじめ
使用する環境下で発声された音声データの統計処
理により求められ、格納されている。１１は始
端・終端候補検出部であり、有音・無音判定部８
より送られてくるフレーム毎の有音・無音判定結
果の持続時間により、音声の始端・終端候補を検
出する。１２は始端・終端決定部で、最終的な始
端・終端を決定する。なお、第５図８〜１２はマ
イクロプロセツサ１台で構成される。 FIG. 5 shows a block diagram of a speech start/end detection device incorporated in a speech recognition device according to an embodiment of the present invention. In the figure, reference numeral 5 denotes an energy extraction section, which is composed of a rectifying and smoothing circuit and extracts the power of the signal for each frame. 6 is a spectrum shape extraction section, which is composed of three types of bandpass filter groups for low frequency (250 to 600 Hz), mid frequency (600 to 1500 Hz), and high frequency (1500 to 4000 Hz), and a rectification and smoothing circuit. The power per frame in the band is used as spectral information. The energy extractor 5 and the spectral shape extractor 6 constitute a feature extractor 13. Reference numeral 7 denotes a multiplexer for inputting the signal power from the energy extraction section 5 and the band filter power from the spectrum shape extraction section 6 to the sound/non-sound determination section 8 in a time-division manner. Reference numeral 8 denotes a voiced/non-sound determining section, which is used to determine whether there is no voice, unvoiced sound, or voiced sound. Reference numerals 9 and 10 are threshold memories and standard pattern memories, in which constant values used by the voice/silence determining section 8 are stored. The threshold memory 9 stores two power thresholds E ₁ ,
E ₂ (E ₁ > E ₂ ) is stored. Further, the standard pattern memory 10 stores coefficients of two types of linear discriminant functions: a linear discriminant function for discriminating silent/voiceless sounds, and a linear discriminant function for discriminating silent/voiced sounds. And these two threshold values E ₁ ,
E ₂ and the coefficients of the two linear discriminant functions are obtained in advance by statistical processing of voice data uttered under the environment to be used and are stored. 11 is a start/end candidate detection unit, and a voice/non-speech determination unit 8
Based on the duration of the sound/non-sound determination results for each frame sent from the system, candidates for the start and end of the audio are detected. Reference numeral 12 denotes a starting end/terminating end determining section, which determines the final starting end/end end. Note that FIGS. 8 to 12 are constructed with one microprocessor.

以上のように構成された音声の始端・終端検出
装置についてその動作を説明する。 The operation of the audio start/end detection device configured as described above will be explained.

マイク等より入力される音声を含む信号は第５
図のエネルギー抽出部５およびスペクトル形状抽
出部６によりフレーム毎にパワーPWと３つの帯
域パワーP_i（ｉ＝１〜３）に変換される。この
PW、P_iはマルチプレクサ７を経て有音・無音判
定部８に入力される。有音・無音判定部８では入
力されたPW、P_i（ｉ＝１〜３）の４つのパラメ
ータを対数変換し対数パワーLPWと対数帯域パ
ワーLP_i（ｉ＝１〜３）を求める。そして、LPW
とLP_i（ｉ＝１〜３）の４つのパラメータと閾値
メモリ９と標準パターンメモリ１０に格納されて
いる閾値E₁，E₂と２つの線形判別関数の係数と
を用いて、入力されたフレームが有音であるか無
音であるかを判定する。この有音・無音判定はま
ず最初に２つのエネルギー閾値E₁，E₂（E₁＞E₂）
と対数パワーLPWとの比較による判定が行なわ
れる。２つの閾値E₁，E₂はLPW＞E₁ならば確実
に有音であり、LPW＜E₂ならば確実に無音であ
るという値に設定されているため判定結果は式(1)
に示すようなものとなる。 The signal containing the voice input from the microphone etc. is the fifth
Each frame is converted into power PW and three band powers P _i (i=1 to 3) by the energy extractor 5 and spectral shape extractor 6 shown in the figure. this
PW and P _i are input to the sound/non-sound determination unit 8 via the multiplexer 7 . The voice/silence determination unit 8 logarithmically transforms the input four parameters PW and P _i (i=1 to 3) to obtain logarithmic power LPW and logarithmic band power LP _i (i=1 to 3). And L.P.W.
and LP _i (i = 1 to 3), the thresholds E ₁ and E ₂ stored in the threshold memory 9 and the standard pattern memory 10, and the coefficients of the two linear discriminant functions. Determine whether the frame is speech or silent. This sound/non-sound judgment first uses two energy thresholds E ₁ and E ₂ (E ₁ > E ₂ ).
Judgment is made by comparing the logarithmic power LPW with the logarithmic power LPW. The two thresholds E ₁ and E ₂ are set to values such that if LPW>E ₁ , there is definitely a sound, and if LPW<E ₂ , there is definitely no sound, so the judgment result is given by formula (1)
It will look like the one shown below.

LPW＞E₁ ならば有音 LPW＜E₂ ならば無音 E₂LPWE₁ ならば不定式(1) LPWというエネルギー量を用いた判定で不定
という判定結果を得た場合は、さらにスペクトル
形状による有音・無音判定を行なう。これは、低
域、中域、高域の３つの帯域の対数パワーLP_i
（ｉ＝１〜３）をスペクトル形状を表わすパラメ
ータとし、標準パターンメモリ１０に格納してあ
る２種類の線形判別関数の係数を用い判別関数の
値を計算することにより有音・無音を判定するも
のである。この２つの線形判別関数のうち１つは
有音／無声音を判別するためのものであり、もう
１つは有音／無声音を判別するためのものであ
る。線形判別関数FXは式(2)に示すものであり、
標準パターンメモリ１０には式(2)のA_i（ｉ＝１〜
３）と_i（ｉ＝１〜３）が無音／無声音、無
音／有声音という２種類の線形判別関数毎に格納
されている。 If LPW>E ₁ then sound LPW<E ₂ then no sound E ₂ If LPWE ₁ then Indefinite Equation (1) If the determination using the amount of energy called LPW yields an undetermined result, then we can further consider the existence by the spectral shape. Performs sound/silence judgment. This is the logarithmic power LP _i of the three bands: low, mid, and high.
(i=1 to 3) is a parameter representing the spectral shape, and the presence/absence of speech is determined by calculating the value of the discriminant function using the coefficients of two types of linear discriminant functions stored in the standard pattern memory 10. It is something. One of these two linear discriminant functions is for discriminating between voiced and unvoiced sounds, and the other is for discriminating between voiced and unvoiced sounds. The linear discriminant function FX is shown in equation (2),
The standard pattern memory 10 stores A _i (i=1~
3) and _i (i=1 to 3) are stored for each of two types of linear discriminant functions: silent/unvoiced and silent/voiced.

FX＝₃ 〓ⁱ⁼¹ A_i（LP_i−_i） ……式(2) （ただし、A_iは係数、_iは平均値）式(2)におけるA_iは２つのクラスの最適な判別を
行なうように設定され２つのクラスの級内分散、
級間分散の比であるFisher比の最大化条件から求
められる。本実施例において、式(2)のA_iおよび
LP_iはあらかじめ使用環境下で発声された音声デ
ータの無音・無声音・有声音を統計処理して求め
られる。そしてFXの値は入力が無音のとき負で、
入力が無声音あるいは有声音のときは正の値をと
るように設定してある。したがつて、スペクトル
形状による有音、無音判定は無音／無声音と無
音／有声音の２つの線形判別関数を計算しいずれ
か一方でも正の値をとるならば有音、２つとも負
の値ならば無音と判定する。このようにして得ら
れたフレーム毎の有音・無音の判定結果は第５図
の始端・終端候補検出部１１に送られる。始端・
終端候補検出部１１ではフレーム毎に得られる有
音・無音の判定結果の持続時間により音声の始端
候補および終端候補を検出する。１１の始端・終
端候補検出部はマイクロプロセツサの２つのレジ
スタをカウンタとして用い、さらに比較演算機能
を用いて構成される。そして、始端候補検出にお
いては１つのカウンタだけを用い、終端候補検出
ではカウンタを２つとも用いている。第６図は始
端候補検出のための処理の流れを示したものであ
る。第６図は有音と判定されたフレームが５フレ
ーム以上連続したときその先頭のフレームを始端
候補とすることを示している。第６図の処理イは
有音フレームのカウンタ（第６図のCOUNT）、
始端候補フレーム番号格納領域（第６図
FRAMES）そして処理フレームポジシヨン（第
６図Ｉ）の初期化のためのリセツトである。第６
図処理ロは処理フレームポジシヨンの更新であ
る。処理ハは処理フレームが有音であるか無音で
あるかの比較による分岐である。処理しているフ
レームが有音である場合は有音フレームのカウン
タ（COUNT）に１を加える（第６図処理ニ）。
さらに、始端候補フレーム番号格納領域
（FRAMES）が０にリセツトされたままである
場合は現在処理を行なつているフレームの番号
（Ｉ）を格納する（処理ホ，ヘ）。処理トでは有音
フレームのカウンタ５になつたかの判定を行な
う。そして、カウンタが５以下の場合は処理ロに
戻り、カウンタが５以上になつた場合は始端候補
が検出されたということで始端候補検出処理を終
了する。処理が終了するまでの間に処理ハにおい
て無音であるというフレームがあつた場合は、処
理チにおいて有音フレームカウンタおよび始端候
補フレーム番号格納領域はリセツトされ処理はロ
に戻る。有音フレームカウンタは無音フレームが
あると処理チによりリセツトされるため有音が連
続したフレーム数のカウンタとなる。したがつ
て、処理トの判定は有音が５フレーム以上連続し
たかの判定となる。したがつて、音声の始端の前
に唇の動きによる雑音などで有声と判定されたフ
レームが２〜３フレームあつてもその後に１フレ
ームでも無音と判定されるフレームがあればそれ
は除去される。このようにして始端候補が検出さ
れると次に終端候補検出のための処理が行なわれ
る。第７図は終端候補検出のための処理の流れを
示したものである。 FX= ₃ 〓 ⁱ⁼¹ A _i (LP _i − _i ) ...Equation (2) (A _i is a coefficient, _i is an average value) A _i in Equation (2) is the optimal discrimination between two classes. Intraclass variance of two classes,
It is obtained from the condition for maximizing the Fisher ratio, which is the ratio of interclass variance. In this example, A _i and
LP _i is determined in advance by statistically processing the silence, unvoiced sounds, and voiced sounds of the audio data uttered under the usage environment. And the FX value is negative when the input is silent,
It is set to take a positive value when the input is an unvoiced sound or a voiced sound. Therefore, to determine voice presence or non-voice based on the spectral shape, calculate two linear discriminant functions: silent/unvoiced and silent/voiced, and if either one takes a positive value, there is a voice, and both have negative values. If so, it is determined that there is no sound. The sound/silence determination result for each frame thus obtained is sent to the start/end candidate detection section 11 shown in FIG. Starting point/
The end candidate detection unit 11 detects a start end candidate and an end end candidate of the audio based on the duration of the sound/silence determination result obtained for each frame. The start/end candidate detection section 11 is constructed using two registers of a microprocessor as a counter and further uses a comparison calculation function. Only one counter is used to detect the starting edge candidate, and both counters are used to detect the ending edge candidate. FIG. 6 shows the flow of processing for detecting a starting edge candidate. FIG. 6 shows that when five or more consecutive frames are determined to have sound, the first frame is selected as the starting edge candidate. Processing A in Fig. 6 is a counter of sound frames (COUNT in Fig. 6),
Starting edge candidate frame number storage area (Fig. 6
FRAMES) and a reset for initializing the processing frame position (FIG. 6I). 6th
Figure processing b is an update of the processing frame position. Process C is branching based on a comparison of whether the processing frame is sound or silent. If the frame being processed is a sound frame, 1 is added to the sound frame counter (COUNT) (processing d in Figure 6).
Further, if the starting edge candidate frame number storage area (FRAMES) remains reset to 0, the number (I) of the frame currently being processed is stored (processing E, F). In the processing step, it is determined whether the counter 5 of a sound frame has been reached. If the counter is less than or equal to 5, the process returns to step B. If the counter is greater than or equal to 5, it means that a starting edge candidate has been detected, and the starting edge candidate detection process ends. If a silent frame is found in process C until the process is completed, the sound frame counter and the starting edge candidate frame number storage area are reset in process C, and the process returns to B. Since the sound frame counter is reset by the processing unit when there is a silent frame, it becomes a counter for the number of consecutive frames with sound. Therefore, the determination as to whether or not to process is a determination as to whether there is a continuous sound for five or more frames. Therefore, even if there are two or three frames that are determined to be voiced due to noise caused by lip movement before the start of the voice, if there is even one frame that is determined to be silent after that, that frame is removed. Once the starting edge candidate is detected in this way, processing for detecting the ending edge candidate is then performed. FIG. 7 shows the flow of processing for detecting termination candidates.

第７図の処理イは無音フレームのカウンタ（第
７図のCOUNT1）、有音フレームのカウンタ（第
７図のCOUNT2）そして終端候補フレーム番号
格納領域（第７図FRAMEE）の初期化のための
リセツトである。第７図処理ロは処理フレームポ
ジシヨン（第７図Ｉ）の更新である。処理ハは処
理フレームが有音であるか無音であるかの比較に
よる分岐である。処理しているフレームが無音で
ある場合は無音フレームカウンタを更新し、有音
フレームカウンタをリセツトする（処理ニ，ホ）。
さらに無音カウンタが２以上でかつ終端フレーム
番号格納領域がリセツトされている場合には無音
フレームカウンタが１となつたフレームの番号を
終端候補フレームとして終端フレーム格納領域に
格納する（処理ヘ，ト）。処理チでは無音フレー
ムカウンタが30になつたかの判定を行なう。そし
て、無音フレームカウンタが30未満の場合は処理
ロに戻り、30以上となつた場合は音声が終了した
とみなし処理を終了する。処理ハにおいて有音で
あつた場合に分岐する処理リ，ヌ，ルは終端候補
フレームが格納されてから有音のフレームが何フ
レーム連続したかの処理で５フレーム以上連続し
た場合は、音声は終了していないとみなし処理イ
に戻り終端候補検出をやり直す。有音フレームが
５フレーム未満の場合は雑音とみなし、その区間
は無音区間であるということで処理ニにおいて無
音フレームカウンタにその区間長が加えられる。 Process A in Fig. 7 is for initializing the silent frame counter (COUNT1 in Fig. 7), the sound frame counter (COUNT2 in Fig. 7), and the end candidate frame number storage area (FRAMEE in Fig. 7). This is a reset. Process B in FIG. 7 is an update of the processing frame position (FIG. 7 I). Process C is branching based on a comparison of whether the processing frame is sound or silent. If the frame being processed is silent, the silent frame counter is updated and the sound frame counter is reset (processing D, E).
Furthermore, if the silent frame counter is 2 or more and the end frame number storage area has been reset, the number of the frame for which the silent frame counter becomes 1 is stored as the end candidate frame in the end frame storage area (processes B and T). . In processing Q, it is determined whether the silent frame counter has reached 30 or not. Then, if the silent frame counter is less than 30, the process returns to processing, and if it becomes 30 or more, it is assumed that the audio has ended and the process ends. Processing 3, which branches when there is sound in processing C, is a process that determines how many consecutive frames have had sound after the end candidate frame is stored, and if there are 5 or more consecutive frames, the sound is If it is not completed, return to processing A and redo the terminal candidate detection. If the number of sound frames is less than 5 frames, it is regarded as noise, and since that section is a silent section, the section length is added to the silent frame counter in process 2.

終端候補は無音フレームが２フレーム連続した
とき音声の終了の可能性があるとし先頭の無音フ
レームを音声の終了候補とし、その終了候補フレ
ームから29フレームの間に有音フレームが５フレ
ーム以上連続することがない場合は先の終了候補
フレームを終端候補とする。もし、終了候補から
29フレーム後の間に有音フレームが５フレーム以
上連続した場合は、音声はまだ終了していないと
し、カウンタおよび終了候補フレームをすべてリ
セツトし第７図に示す終端検出処理を次のフレー
ムからやり直す。このような処理により終端に付
加された４フレーム以下の雑音は取り除かれる。
始端・終端決定部１２では始端・終端候補検出部
１１により検出された始端・終端候補フレーム付
近におけるパワーLPWとスペクトルLP_iの変化の
大きさにより最終的な始端・終端を決定する。パ
ワーの変化の大きさを表わすパラメータとしては
式(3)に示すようにフレーム毎に得られる対数パワ
ーLPWの差分値LPWDが用いられる。 The end candidate is the possibility of the end of the audio when there are two consecutive silent frames, so the first silent frame is the end candidate for the audio, and there are 5 or more consecutive frames with sound within 29 frames from the end candidate frame. If there is no end candidate frame, the previous end candidate frame is set as the end candidate frame. If from the end candidate
If there are 5 or more consecutive frames with sound after 29 frames, it is assumed that the audio has not ended yet, the counter and end candidate frames are all reset, and the end detection process shown in Figure 7 is restarted from the next frame. . Through such processing, noise added to the end of 4 frames or less is removed.
The start/end determination unit 12 determines the final start/end based on the magnitude of change in the power LPW and spectrum LP _i in the vicinity of the start/end candidate frame detected by the start/end candidate detection unit 11 . As a parameter representing the magnitude of power change, the difference value LPWD of logarithmic power LPW obtained for each frame is used as shown in equation (3).

LPWD_j＝LPW_j−LPW_j-1 ……式(3) （ただし、ｊはフレーム番号）また、スペクトルの変化の大きさを表わすパラ
メータとしては式(4)に示す帯域対数パワーLP_iの
ユークリツド距離SPDを用いる。 LPWD _j = LPW _j −LPW _j-1 ...Equation (3) (where j is the frame number) In addition, as a parameter representing the magnitude of the change in the spectrum, the Euclidean value of the band logarithmic power LP _i shown in Equation (4) is Use distance SPD.

SPD_j＝₃ 〓ⁱ⁼¹ (LP_ij-LP_ij-1)² ……(4) （ただし、ｉは帯域を表わし、ｊはフレーム番号
を表わす） LPWDというパラメータはパワーが増加して
いる場合正の値をとり、パワーが減少している場
合は負の値をとる。また、SPDは無音から有音
へと変化する場合のようにスペクトルの形状が大
きく変化するところでは大きな値をとる。始端の
決定はまず始めにLPWDが正の値をとるフレー
ムを始端候補から後端に向つて検索する。次に
LPWDが最初に正となつたフレームから後２フ
レームの計３フレームの中でLPWDが正の値で
SPDが最大となるフレームを求め、そのフレー
ムを始端フレームと決定する。 SPD _j = ₃ 〓 ⁱ⁼¹ (LP _ij -LP _ij-1 ) ² ...(4) (However, i represents the band and j represents the frame number) The parameter LPWD is used when the power is increasing. It takes a positive value, and if the power is decreasing, it takes a negative value. Furthermore, SPD takes a large value where the shape of the spectrum changes significantly, such as when changing from silence to sound. To determine the starting edge, first, a frame in which LPWD takes a positive value is searched from the starting edge candidate toward the trailing edge. next
LPWD has a positive value in a total of 3 frames, from the first frame where LPWD becomes positive to the next two frames.
Find the frame with the maximum SPD and determine that frame as the starting frame.

終端の決定は、まず始めにLPWDが負の値を
とるフレームを終端候補フレームから始端方向に
向つて検索する。次にLPWDが最初に負となつ
たフレームから２フレーム前の計３フレームの中
でLPWDが負の値でSPDが最大となるフレーム
を求め、そのフレームの１つ前のフレームを終端
フレームと決定する。このようにして得られた始
端・終端は音声認識装置にて利用される。 To determine the end, first, a frame whose LPWD takes a negative value is searched from the end candidate frame toward the start end. Next, find the frame in which LPWD has a negative value and SPD is maximum among a total of three frames, two frames before the frame where LPWD first becomes negative, and determine the frame one before that frame as the terminal frame. do. The start and end points obtained in this way are used in a speech recognition device.

本実施例によれば、有音・無音判定部８におい
てエネルギーレベルが低い入力信号に対し、線形
判別関数を用い無音とのスペクトル形状の相異に
より有音であるか判定する方法をとつているた
め、エネルギーの小さな無声子音や有声子音の脱
落を減少することができる。また、始端・終端候
補検出部１１において、音声の持続性を考慮した
検出を行なつているので、音声の始端・終端前後
に付加された短かい雑音を取り除くことができ
る。さらに、始端・終端決定部１２では、無音か
ら有音あるいは逆の場合におけるエネルギーの変
化とスペクトル形状の変化の大きさを利用して始
端・終端の位置を決定しているために位置精度の
高い音声の始端・終端を得ることができる。第８
図は「土台」（／dodai／）と発声された音声に
本発明の一実施例における始端・終端検出を適応
した例で、第８図ａは対数パワーLPWを示し、
ｂはスペクトル変化SPD、ｃはパワー変化
LPWD、ｄの実線は無音／無声音を判別する線
形判別関数の値、破線は無音／有声音を判別する
線形判別関数の値を示したものである。第８図の
例においては、始端・終端にそれぞれ雑音が見ら
れる。フレーム毎の無音・有音判定部８では、
LPWがE₁以上であるか、またはLPWがE₁とE₂の
間にある場合はｄに示す二つの線形判別関数の正
負を勘案することにより、ａに示すイからロおよ
びハからニの区間を有音と判定する。これにより
始端の雑音が取り除かれる。始端・終端候補検出
部１１においては、有音・無音フレームの持続性
により始端候補フレームをイとし、終端候補フレ
ームをロとする。このときハからニの有音区間は
５フレーム未満であるため雑音と判定される。そ
して、始端・終端決定部１２では対数パワーの変
化ｃとスペクトルの変化ｂにより始端イ′、終端
ロ′が決定され雑音が除去された正しい始端・終
端の位置が得られる。あらかじめ目視による始
端・終端のラベル付けが行なわれている男性話者
１名が発声した212単語を用いて本発明の一実施
例の評価実験を行なつた結果、ラベルとの差が２
フレーム以内となるものが始端で93.4％、終端
92.9％、ラベルとの差が３フレーム以内となるも
のが始端で97.6％、終端で97.2％という結果を得
た。そして、始端の音素脱落という重大な誤りは
２単語、終端の音素脱落という重大な誤りは２単
語と少なく、また雑音の付加による誤りはなく、
良好な結果を得ることができ、本発明による音声
の始端・終端検出装置が有効に動作することを確
めることができた。 According to this embodiment, the sound/silence determination unit 8 uses a linear discriminant function for an input signal with a low energy level to determine whether or not there is sound based on the difference in spectral shape from silence. Therefore, dropout of voiceless consonants and voiced consonants with low energy can be reduced. Furthermore, since the start/end candidate detection unit 11 performs detection taking into consideration the continuity of the voice, short noises added before and after the start/end of the voice can be removed. Furthermore, the start/end determining unit 12 determines the start/end positions using the change in energy and the magnitude of the change in spectral shape from silent to active or vice versa, resulting in high positional accuracy. You can get the start and end of the audio. 8th
The figure shows an example in which the start/end detection according to an embodiment of the present invention is applied to a voice uttered as "base" (/dodai/), and FIG. 8a shows the logarithmic power LPW.
b is spectrum change SPD, c is power change
The solid line of LPWD, d shows the value of the linear discriminant function for discriminating silent/voiceless sound, and the broken line shows the value of the linear discriminant function for discriminating silent/voiced sound. In the example shown in FIG. 8, noise can be seen at both the start and end ends. In the silence/speech determination unit 8 for each frame,
If LPW is greater than or equal to E ₁ , or if LPW is between E ₁ and E ₂ , then by considering the sign of the two linear discriminant functions shown in d, we can calculate the difference between A to B and C to D shown in a. Determine the interval as having sound. This removes the noise at the beginning. The start/end candidate detection unit 11 sets the start end candidate frame as A and the end candidate frame as B depending on the persistence of the voiced and silent frames. At this time, since the sound intervals from C to D are less than 5 frames, they are determined to be noise. Then, the start/end determining section 12 determines the start/end A' and the end/B' based on the change c in the logarithmic power and the change b in the spectrum, thereby obtaining the correct start/end positions from which noise has been removed. As a result of conducting an evaluation experiment of an embodiment of the present invention using 212 words uttered by one male speaker whose starting and ending points have been visually labeled in advance, it was found that the difference from the label was 2.
93.4% are within the frame at the start end, and 93.4% at the end
We obtained results of 92.9%, 97.6% at the start end, and 97.2% at the end, where the difference from the label was within 3 frames. There were only two words with serious errors in which a phoneme was dropped at the beginning, and only two words in which a phoneme was dropped at the end, and there were no errors caused by the addition of noise.
Good results were obtained, and it was confirmed that the audio start/end detection device according to the present invention operates effectively.

なお、以上の説明ではスペクトル形状を表わす
パラメータとして帯域対数パワーを用いた有音・
無音の判定として線形判別関数を用いた場合につ
いて説明したが、スペクトル形状を表わすパラメ
ータとして信号のフーリエ変換や線形予測分析に
より得られるパワースペクトルや線形予測分析に
より得られるLPC−ケプストラム係数を用い、
有音・無音の判定法としてベイズ判定やマハラノ
ビス距離などの統計的距離尺度を用いても良い。 In addition, in the above explanation, we use band logarithmic power as a parameter representing the spectrum shape.
We have explained the case where a linear discriminant function is used to determine silence, but the power spectrum obtained by Fourier transform of the signal or linear predictive analysis and the LPC-cepstral coefficient obtained by linear predictive analysis are used as parameters representing the spectral shape.
A statistical distance measure such as Bayesian judgment or Mahalanobis distance may be used to determine whether there is a sound or no sound.

発明の効果以上のように、本発明は信号のエネルギー情報
だけでなくスペクトル形状をも用いたフレーム毎
の有音・無音判定部と、音声の持続性を考慮した
始端・終端候補検出部と、エネルギーの変化およ
びスペクトル形状の変化量により始端・終端位置
を決定する決定部により構成される音声の始端・
終端検出装置を提供するもので、有音・無音判定
として、無音・無声音・有声音のスペクトル標準
パターンとの統計的距離尺度を用いたスペクトル
形状の相異を利用しているため、エネルギーの小
さな無声子音や有声子音の脱落を少なくでき、ま
た有音の持続性による始端・終端候補検出を行な
つているため雑音の付加が少なく、しかもエネル
ギーとスペクトルの変化の大きさにより始端・終
端の位置を決定するための位置情報が高いという
すぐれた効果が得られる。Effects of the Invention As described above, the present invention includes a speech presence/non-speech determination unit for each frame that uses not only signal energy information but also spectral shape, a start/end candidate detection unit that takes into consideration the continuity of audio, The voice start/end position is composed of a determining section that determines the start/end position based on changes in energy and amount of change in spectral shape.
This device provides an end detection device that uses the difference in spectral shape using a statistical distance measure from standard spectral patterns of silence, unvoiced sounds, and voiced sounds to determine whether there is a sound or not. It is possible to reduce the dropout of unvoiced consonants and voiced consonants, and since starting and ending candidates are detected based on the persistence of voicing, there is less noise added, and the starting and ending positions can be determined based on the magnitude of changes in energy and spectrum. This has the advantage of providing a high level of location information for determining the location.

[Brief explanation of drawings]

第１図は従来用いられている零交差回数の分布
図、第２図は従来の始端・終端検出装置のブロツ
ク図、第３図は従来の始端・終端検出装置の動作
例を説明する図、第４図は従来で雑音が付加した
音声のエネルギー変化を示す図、第５図は本発明
の一実施例における音声の始端・終端検出装置の
ブロツク図、第６図は本発明の一実施例における
始端候補検出処理を示すフローチヤート図、第７
図は本発明の一実施例における終端候補検出処理
を示すフローチヤート図、第８図は本発明の一実
施例における動作例を説明する図である。５……エネルギー抽出部、６……スペクトル形
状抽出部、７……マルチプレクサ、８……有音・
無音判定部、９……閾値メモリ、１０……標準パ
ターンメモリ、１１……始端・終端候補検出部、
１２……始端・終端決定部、１３……特徴量抽出
部。 Fig. 1 is a distribution diagram of the number of zero crossings used in the past, Fig. 2 is a block diagram of a conventional start/end detection device, and Fig. 3 is a diagram illustrating an example of the operation of the conventional start/end detection device. FIG. 4 is a diagram showing the energy change of a voice to which noise has been added in a conventional method. FIG. 5 is a block diagram of a voice start/end detection device according to an embodiment of the present invention. FIG. 6 is an example of an embodiment of the present invention. Flowchart diagram showing the start end candidate detection process in 7th
FIG. 8 is a flowchart showing termination candidate detection processing in an embodiment of the present invention, and FIG. 8 is a diagram illustrating an example of operation in the embodiment of the present invention. 5...Energy extractor, 6...Spectrum shape extractor, 7...Multiplexer, 8...Sound/
Silence determination unit, 9... Threshold memory, 10... Standard pattern memory, 11... Start/end candidate detection unit,
12... Start/end determining unit, 13... Feature extraction unit.

Claims

[Claims] 1. A feature extraction unit that extracts feature quantities representing the energy and spectral shape of the signal from a signal containing audio for each section of a certain time length, and a signal input using the feature quantities. a sound/silence determination unit that determines whether there is sound or silence for each section of a certain length of time;
A start/end candidate detection unit that detects candidates for the start/end of speech based on the duration of the judgment results using the time series of the voice/silence judgment results, and a signal energy change and spectrum before and after the start/end candidates. 1. A voice start/end detection device comprising: a start/end determining section that determines the start/end positions using the magnitude of change in the voice. 2. As a feature representing the spectral shape of the signal, a power spectrum obtained by a group of band filters, Fourier transform, or linear predictive analysis, or obtained by linear predictive analysis.
2. The audio start/end detection device according to claim 1, which uses any of the LPC cepstrum coefficients. 3 The voiced/non-sounded determination unit compares the energy of the signal with two threshold values, and the first determination unit compares the energy of the signal with two threshold values, and the statistical distance measure between the spectrum of the input signal and the three standard patterns of silence, unvoiced sound, and voiced sound. and a second determination unit that performs determination based on the similarity of the used spectra,
2. The speech start/end detection device according to claim 1, wherein one of a linear discriminant function, Mahalanobis distance, and Bayesian judgment is used as the statistical distance measure. 4. The feature is that the Euclidean distance between the feature representing the spectrum in an interval of a certain time length and the feature representing the spectrum in the previous interval is used as the feature representing the magnitude of change in the spectrum of the start/end determining section. A voice start/end detection device according to claim 1.