JPH03114100A

JPH03114100A - Voice section detecting device

Info

Publication number: JPH03114100A
Application number: JP1253313A
Authority: JP
Inventors: ▲とう▼　徳淑; Tokuyoshi Tou; Kinhiyou So; 蘇　錦標; Toshihiro Hayashi; 俊宏林
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1989-09-28
Filing date: 1989-09-28
Publication date: 1991-05-15

Abstract

PURPOSE:To detect even a voice of a frictional sound by detecting a start point and an end point of a voiced section of an input voice by a dynamic feature of a voice spectral dispersion and energy. CONSTITUTION:The device is provided with an energy extracting part 20 for extracting the energy setting digital voice data in a prescribed section as one frame, an energy threshold calculating part 30 for adjusting an average value of background noise energy of a frame as an energy threshold, a spectral dispersion extracting part 61 for calculating a spectral dispersion by deriving an average value of a spectrum of the frame by a frequency of the frame, and a spectral dispersion threshold calculating part 71 for adjusting a spectral dispersion average value of a background noise as a spectral dispersion threshold. In this state, the energy and the spectral dispersion of each frame are compared with the energy threshold and the spectral dispersion threshold and whether it is a start point of a voice section or an end point is checked. In such a way, a voice start point of weak energy such as a frictional sound can be detected.

Description

【発明の詳細な説明】産業上の利用分野本発明は音声スペクトル分散（ｓｐｅｃｔｒｕｍｖａｒ
ｌａｎｃｅ　）とエネルギー（ｅｎｅｒｇｙ）との動態
特徴（ｄｙｎａｍｌｃ　ｆｅａｔｕｒｅ）によって入力
音声の有声区間の始点と終点を検出し、音声認識システ
ムに使用される音声区間検出装置である。DETAILED DESCRIPTION OF THE INVENTION Field of Industrial Application This invention relates to audio spectral dispersion.
This is a speech section detection device used in a speech recognition system, which detects the start and end points of a voiced section of input speech based on dynamic features of lance ) and energy.

従来の技術第２図は、従来の音声区間検出を示すブロック図である
。同図において、ＩＯは、マイク、アナログ／ディジタ
ル変換器（Ａ／Ｄ　ｃｏｎｖｅｒｔｅｒ　）などの装置
によって、アナログ音声信号を集めて、ディジタル音声
データに変換して、バッファに記憶する音声データ入力
手段である。２０はフレームごとにの各サンプルのパタ
ーンを２乗して、対数（ｌｏｇ）を取り、このフレーム
内のエネルギーとスルエネルギー抽出手段である。３０
はいくつかのフレームの背景雑音（ｂａｃｋｇｒｏｕｎ
ｄ　ｎｏｌｓｅ　）の雑音エネ゛ルギーを平均して、そ
の平均値を音声検出用のエネルギーしきい値とするエネ
ルギーしきい値獲得手段である。４０はエネルギー抽出
手段２０のフレームでのエネルギーをエネルギーしきい
値獲得手段３０のエネルギーしきい値と比較して、フレ
ームでのエネルギーはエネルギーしきい値より大きけれ
ば、このフレームは有声区間と見なされる。そして、も
し有声区間の始点がまだ見付からない場合はこれは音声
区間の始点と判定して、そうではない場合は音声区間内
の継続音声と見なす音声始点検出手段である。５０は音
声始点を検出してから、エネルギー抽出手段２０のフレ
ームでのエネルギーをエネルギーしきい値獲得手段３０
のエネルギーしきい値と比較して、連続フレームのエネ
ルギーはいくつかもエネルギーしきい値より小さい場合
、このフレームは音声の終点と見なす音声終点検出手段
である。BACKGROUND OF THE INVENTION FIG. 2 is a block diagram showing conventional speech section detection. In the figure, IO is an audio data input means that collects analog audio signals using devices such as a microphone and an analog/digital converter (A/D converter), converts them into digital audio data, and stores the data in a buffer. . 20 is a means for squaring the pattern of each sample for each frame, taking the logarithm, and extracting the energy and the residual energy within this frame. 30
is the background noise of some frames.
d nolse ) and uses the average value as an energy threshold for voice detection. 40 compares the energy in the frame of the energy extraction means 20 with the energy threshold of the energy threshold acquisition means 30, and if the energy in the frame is greater than the energy threshold, this frame is considered to be a voiced section. . If the starting point of the voiced section has not yet been found, this is determined to be the starting point of the voiced section, and if not, the voice starting point detecting means considers it to be continuous speech within the voiced section. 50 detects the voice start point and then converts the energy in the frame of the energy extraction means 20 into the energy threshold acquisition means 30.
If the energy of a consecutive frame is less than the energy threshold, the speech endpoint detection means considers this frame to be the endpoint of speech.

上述第２図に示す一般の音声区間検出は、例えば特公昭
８１−４７４３７号公報（「音声区間の終端検出装置」
）に上記のような類似方式を採用したものが示されてい
る。また、特公昭Ｇ１−３４４０号公報に示されている
「音声区間信号検出回路」もゼロクロス率を基づいて上
記のような類似方法で完成した装置である。The general voice section detection shown in FIG.
) shows a similar method as above. Furthermore, the ``voice section signal detection circuit'' disclosed in Japanese Patent Publication No. Sho G1-3440 is also a device completed using a similar method as described above based on the zero-cross rate.

発明が解決しようとする課題上記従来の音声区間検出は音声のエネルギー特徴だけに
より音声区間の検出を行なう。しかしながら、中国語等
の言語にはある音声の始点のエネルギーは非常に弱くて
、雑音に含まれているエネルギーとよく似ている。例え
ば、Ｉ：（ｆ）％　　へ（Ｓ）、＜　（ｑ　）Ｎ　　”
；　（ｃ　）、などの摩擦音であれば従来の方法では雑
音として誤判定して摩擦音の音声区間を正しく検出でき
ない。それに従って、音声区間も乱れて、認識装置の認
識率はもちろん大幅に低下する。Problems to be Solved by the Invention The above-mentioned conventional speech section detection detects speech sections only based on the energy characteristics of the speech. However, in languages such as Chinese, the energy at the beginning of certain speech sounds is very weak and resembles the energy contained in noise. For example, I: (f)% to (S), < (q)N”
If it is a fricative such as ; (c), conventional methods will incorrectly judge it as noise and cannot correctly detect the vocal section of the fricative. Accordingly, the speech sections are also disturbed, and the recognition rate of the recognition device is of course significantly reduced.

本発明はかかる点に鑑み、従来のエネルギー特徴からだ
けでは検出できない摩擦音等の音声も検出可能な音声区
間検出装置を提供することを目的とする。In view of this, an object of the present invention is to provide a speech section detection device that can detect sounds such as fricatives that cannot be detected only from conventional energy characteristics.

課題を解決するための手段上記の問題点を解消するために、本発明は、入力音声信
号から一定区間内のディジタル音声データをｉつのフレ
ームとしてそのフレームでのエネルギーを抽出するエネ
ルギー抽出部と、いくつかのフレームの背景雑音エネル
ギーを平均してその平均値をエネルギーしきい値とし、
環境の変化に従って、エネルギーしきい値を調整するエ
ネルギーしきい値計算部と、フレームの周波数によりま
ず該当フレームのスペクトルの平均値を求めてスペクト
ル分散をパラメータとして計算するスペクトル分散抽出
部と、いくつかのフレームの背景雑音のスペクトル分散
平均値をスペクトル分散しきい値として環境の変化に従
って、適当にスペクトル分散しきい値を調整するスペク
トル分散しきい値計算部と、入力音声の各フレームのエ
ネルギー及びスペクトル分散をエネルギーしきい値及び
スペクトル分散しきい値と比較することによりこのフレ
ームは音声区間の始点であるか否かをチェックして、入
力音声の各フレームのエネルギーをエネルギーしきい値
と比較してこのフレームが音声区間の終点であるか否か
をチェックする音声区間検出部とを備えたことを特徴と
する音声区間検出装置である。Means for Solving the Problems In order to solve the above-mentioned problems, the present invention provides an energy extraction unit that extracts energy in i frames of digital audio data within a certain section from an input audio signal; Average the background noise energy of several frames and use the average value as the energy threshold,
There is an energy threshold calculation section that adjusts the energy threshold according to changes in the environment, a spectral dispersion extraction section that first calculates the average value of the spectrum of the relevant frame according to the frame frequency, and calculates the spectral dispersion as a parameter. A spectral dispersion threshold calculation unit that appropriately adjusts the spectral dispersion threshold according to changes in the environment by setting the spectral dispersion average value of the background noise of the frame as the spectral dispersion threshold, and calculates the energy and spectrum of each frame of input audio. Check whether this frame is the start of a speech interval by comparing the variance with an energy threshold and a spectral dispersion threshold, and compare the energy of each frame of the input speech with the energy threshold. The speech section detection device is characterized by comprising a speech section detection section that checks whether or not this frame is the end point of the speech section.

作　　　用本発明は上記した構成により、スペクトル分散手法で摩
擦音のようなエネルギーの弱い音声始点をも検出できる
ので、確実に音声区間を検出することができる。Function: With the above-described configuration, the present invention can detect even voice starting points with low energy such as fricatives using the spectral dispersion method, so that voice sections can be reliably detected.

実施例第１図は、本発明の一実施例を示すブロック図である。Example FIG. 1 is a block diagram showing one embodiment of the present invention.

第１図において、第２図のものと同一動作を行なうもの
は、同一符号を付している。１ｏは音声データ入力部、
２０はエネルギー抽出部、３０はエネルギーしきい値計
算部である。以上は第２図と同じ動作である。５１は音
声データ入力部１０で入力された音声信号によって、各
フレーム毎の線形予測分析（Ｌ　Ｐ　ＣＮ　　Ｌｉｎｅ
ａｒ　Ｐｒｅｄｌｃｔｌｏｎ　Ｃｏｄｉｎｇ　）関数で
スペクトルを計算するスペクトル計算部、６１はスペク
トル平均値を計算してそしてスペクトル分散（Ｓｐｅｃ
ｔｒｕｍ　Ｖａｒｉａｎｃｅ　）を計算するスペクトル
分散抽出部である。９０はエネルギー抽出部２０で抽出
した各フレームのエネルギーとスペクトル分散抽出部６
１で抽出した各フレームのスペクトル分散を記憶するバ
ッファメモリである。In FIG. 1, parts that perform the same operations as those in FIG. 2 are given the same reference numerals. 1o is an audio data input section;
20 is an energy extraction section, and 30 is an energy threshold calculation section. The above is the same operation as in FIG. 51 performs linear predictive analysis (L P CN Line
A spectrum calculation unit 61 calculates a spectrum using a spectral average value and a spectral dispersion (Spectrum Coding) function.
This is a spectral variance extraction unit that calculates trum Variance). 90 is the energy of each frame extracted by the energy extraction unit 20 and the spectral variance extraction unit 6
This is a buffer memory that stores the spectral dispersion of each frame extracted in step 1.

本実施例のサンプリング率（ｓａｍｐＨｎｇ　ｒａｔｅ
　）はたとえばｌ０ＫＥＺで、各フレームは２５６ポイ
ント、フレームとフレームとの間の重複部分は１２８ポ
イント、そして各フレーム毎の線形予測分析関数でスペ
クトルを計算する。スペクトルの計算はサンプリング・
スペース（ｓａｍｐＨｎｇ　　５ｐａｃｅ　）の０〜５
Ｋ１１２を１６部分に分けて、各部分は一つの周波数チ
ャンネル（ｃｈａｎｎｅｌ）である。Ｓ　（ｎ、ｍ）は
第ｎ個フレームの第ｍ個周波数チャンネルのスペクトル
の大きさである。下式によりフレームのスペクトル平均
値を獲得する。The sampling rate of this example
) is, for example, l0KEZ, each frame has 256 points, the overlap between frames is 128 points, and the spectrum is calculated using a linear predictive analysis function for each frame. Spectral calculations are based on sampling and
Space (sampHng 5pace) 0-5
K112 is divided into 16 parts, and each part is one frequency channel. S (n, m) is the spectrum size of the mth frequency channel of the nth frame. Obtain the spectral average value of the frame using the following formula.

ここで、５（ｎ）は第ｎフレームのスペクトル平均値で
ある。フレーム内の各周波数チャンネルのスペクトルか
らスペクトル平均値を引いてがら２乗和してこのフレー
ムのスペクトル分散ヲ計算することができる。Here, 5(n) is the spectral average value of the nth frame. The spectral dispersion of this frame can be calculated by subtracting the spectral average value from the spectrum of each frequency channel within the frame and summing the squares.

Ｖ　（ｎ）は第０個フレームのスペクトル分散である。V(n) is the spectral dispersion of the 0th frame.

第３図は四つの久方音声信号／ｄ／、／ｓ／、／ａ／及
び／５ｌｌｅｎｃｅ／　（沈黙）スペクトルである。こ
の四つの音声に対して、沈黙のスペクトルは他のスペク
トルよりスムーズなので、沈黙のスペクトル分散は他の
三つの音声のスペクトル分散より小さい。この特性によ
り音声を背景信号から分離することができる。FIG. 3 shows the spectra of four long speech signals /d/, /s/, /a/ and /5llence/ (silence). For these four voices, the spectrum of silence is smoother than the other spectra, so the spectral variance of silence is smaller than the spectral variance of the other three voices. This property allows the sound to be separated from the background signal.

７１は入力し始めた音声のいくつかのフレームのスペク
トル分散により、下式のように計算してスペクトル分散
しきい値となるスペクトル分散しきい値計算部である。Reference numeral 71 denotes a spectral dispersion threshold calculation unit which calculates a spectral dispersion threshold value according to the following formula based on the spectral dispersion of several frames of audio that has started to be input.

ＶＴＲ＝　　（Σ　Ｖ（ｎ））＊１．　５ここでのＶＴ
Ｈは１０フレームに基づいてのスペクトル分散しきい値
であり、Ｖ（ｎ）は５ｉｌｅｎｃｅ期間内の第ｎフレー
ムのスペクトル分散である。VTR=(ΣV(n))*1. 5 VT here
H is the spectral dispersion threshold based on 10 frames and V(n) is the spectral dispersion of the nth frame within the 5ilence period.

８１は音声始点と音声終点を検出する音声区間検出部で
ある。エネルギー抽出部２ｏで抽出されたエネルギー及
びスペクトル分散抽出部６１で抽出されたスペクトル分
散をそれぞれエネルギーしきい値計算部３０でのエネル
ギーしきい値及びスペクトル分散しきい値計算部７１で
のスペクトル分散しきい値と比較して、そのフレームは
音声始点であるかどうかをチェックする。すなわち、第
４図の流れ図のように、まず一つのフレームのエネルギ
ーを入力して、それのエネルギーがエネルギーしきい値
より大きいかを判定して、もし、エネルギーしきい値よ
り小さい場合は、このフレームを背景雑音と見なして、
エネルギーしきい値を調整する。Reference numeral 81 is a voice section detection unit that detects a voice start point and a voice end point. The energy extracted by the energy extraction section 2o and the spectral dispersion extracted by the spectral dispersion extraction section 61 are subjected to spectral dispersion by an energy threshold calculation section 30 and a spectral dispersion threshold calculation section 71, respectively. Compare with a threshold to check whether the frame is the start of speech. That is, as shown in the flowchart in Figure 4, first input the energy of one frame, determine whether the energy is larger than the energy threshold, and if it is smaller than the energy threshold, this Considering the frame as background noise,
Adjust energy threshold.

そうではない場合は、このフレームの後の５つのフレー
ムがエネルギーしきい値より大きいかをチェックして、
もし、大きければこのフレームは音声始点と一時的に暫
定する。そして、このフレームの前の各フレームのスペ
クトル分散がスペクトル分散しきい値より大きいかどう
かを、あるフレームのスペクトル分散がスペクトル分散
しきい値より小さくなるまで検査する。そして、スペク
トル分散がスペクトル分散しきい値より小さくなるなる
フレームの直後のフレームを音声区間の始点とする。本
当の音声始点はスペクトル分散がスペクトル分散しきい
値より小さいフレームの後のこのフレームである。If not, check if the 5 frames after this frame are greater than the energy threshold,
If it is large, this frame is temporarily set as the voice start point. Then, it is checked whether the spectral dispersion of each frame before this frame is greater than the spectral dispersion threshold until the spectral dispersion of a certain frame becomes smaller than the spectral dispersion threshold. Then, the frame immediately after the frame in which the spectral dispersion becomes smaller than the spectral dispersion threshold is set as the starting point of the voice section. The real voice starting point is this frame after the frame in which the spectral dispersion is less than the spectral dispersion threshold.

第５図は、音声データの例を示す図で、この場合、まず
連続する５フレームがエネルギーしきい値を越える第ｎ
フレームを音声始点と一時的に暫定する。次にこの第ｎ
フレームより前のフレームについて、スペクトル分散値
を求め、スペクトル分散しきい値との比較を行なう。そ
の結果、第ｎフレームから第ｎ−２フレームまではスペ
クトル分散値がスペクトル分散しきい値より大きく、第
ｎ−３フレームではスペクトル分散値がスペクトル分散
しきい値より小さいので、第ｎ−２フレームを音声区間
の始点とする。FIG. 5 is a diagram showing an example of audio data. In this case, first, five consecutive frames exceed the energy threshold.
Temporarily set the frame as the audio start point. Then this nth
The spectral dispersion value for the frame before the frame is determined and compared with the spectral dispersion threshold value. As a result, the spectral dispersion value is larger than the spectral dispersion threshold from the n-th frame to the n-2nd frame, and since the spectral dispersion value is smaller than the spectral dispersion threshold in the n-3rd frame, the n-2nd frame Let be the starting point of the voice section.

音声終点の検出について、エネルギー抽出部２０のフレ
ームでのエネルギーをエネルギーしきい値と比較して、
連続５フレームのエネルギーがいくつかエネルギーしき
い値より小さい場合、このフレームは音声の終点と見な
す。例えば中国語の場合、すべての音声の終点は母音な
ので、音声区間の終点の検出はエネルギーで判定するこ
とにより、正確に検出できる。Regarding the detection of the end point of the voice, the energy in the frame of the energy extraction unit 20 is compared with an energy threshold,
If the energy of five consecutive frames is less than some energy threshold, this frame is considered to be the end of the speech. For example, in the case of Chinese, the end point of all speech is a vowel, so the end point of a speech section can be detected accurately by determining it based on energy.

第５図においては、第ｍフレームから第ｍ＋５フレーム
まで連続する５つのフレームについて、そのエネルギー
がエネルギーしきい値より小さいので、第ｍフレームを
音声区間の終点とする。In FIG. 5, since the energy of five consecutive frames from the mth frame to the m+5th frame is smaller than the energy threshold, the mth frame is set as the end point of the voice section.

上記本発明の実施例の各部で、音声のスペクトル分散の
特性及びエネルギーにより、正確に音声の始点と終点を
検出することができる。音声認識を行なう時に無駄な計
算を減らすことができ、認識率も向上する。例えば、こ
の音声区間の検出装置を特定話者に対する音声認識装置
に適用し、中国語、都市名の１００単語の認識を行なっ
た結果、認識率は９２％から９８％に向上し、認識時間
の大幅な短縮も実現できた。In each part of the embodiments of the present invention, the start and end points of the voice can be accurately detected based on the characteristics and energy of the spectral dispersion of the voice. It is possible to reduce unnecessary calculations when performing speech recognition, and the recognition rate is also improved. For example, when this speech interval detection device was applied to a speech recognition device for a specific speaker and 100 words in Chinese and city names were recognized, the recognition rate improved from 92% to 98%, and the recognition time was reduced. A significant reduction was also achieved.

この発明は上記実施例に限定されることなく、その要旨
を変更しない限り、適当に変化して実施することができ
る。例えば、周波数という特徴を獲得する方法は線形予
測分析スペクトルに限らず、線形分析スペクトルや音声
の周波数を表わす特徴でも適用できる。This invention is not limited to the above-mentioned embodiments, and can be implemented with appropriate changes as long as the gist is not changed. For example, the method of acquiring a feature called frequency is not limited to a linear predictive analysis spectrum, but can also be applied to a feature representing a linear analysis spectrum or voice frequency.

また、データの入力方式はマイクに限らず、録音装置で
も音声データをメモリに記憶させることができる。Furthermore, the data input method is not limited to the microphone, and audio data can also be stored in the memory using a recording device.

また、この実施例では、エネルギーの比較を連続する５
つのフレームについて行なっているが、このフレーム数
については周辺の機器に応じて変更してやれば良い。さ
らに、この実施例で示したスペクトル分散しきい値ＶＴ
Ｒの式は実験値であり、機器の特性等に応じて各係数を
調整すればより良い結果を得ることができる。In addition, in this example, energy comparison is performed for consecutive 5
This is done for one frame, but the number of frames can be changed depending on the peripheral equipment. Furthermore, the spectral dispersion threshold VT shown in this example
The formula for R is an experimental value, and better results can be obtained by adjusting each coefficient according to the characteristics of the device.

発明の効果本発明は音声のエネルギー及びスペクトル分散の特性に
より有効に音声区間を検出することができる。そして、
認識率を上げ、認識時間も大幅、に短縮することができ
るのでその実用的効果は太きい。Effects of the Invention The present invention can effectively detect voice sections based on the characteristics of voice energy and spectral dispersion. and,
The practical effects are significant because the recognition rate can be increased and the recognition time can be significantly shortened.

[Brief explanation of drawings]

第１図は本発明の一実施例における音声区間検出装置の
構成を示すブロック図、第２図は従来例の音声区間検出
装置の構成を示すブロック図、第３図は四つの入力音声
信号のスペクトルを示す説明図、第４図は音声始点検出
を説明する流れ図、第５図は音声区間の検出の例を示す
説明図である。ＩＯ・・・音声データ入力部、２ｏ・・・エネルギー抽
出部、３０・・・エネルギーしきい値計算部、４１・・
・バッファメモリ、５１・・・スペクトル計算部Ｇｌ・
・・スペクトル分散計算部、７１・・・スペクトル分散しきい値計算部、８１・・・
音声区間検出部。第図第４図一一−］第図FIG. 1 is a block diagram showing the configuration of a voice section detection device according to an embodiment of the present invention, FIG. 2 is a block diagram showing the configuration of a conventional voice zone detection device, and FIG. FIG. 4 is an explanatory diagram showing a spectrum, FIG. 4 is a flowchart illustrating voice start point detection, and FIG. 5 is an explanatory diagram showing an example of voice section detection. IO...Audio data input unit, 2o...Energy extraction unit, 30...Energy threshold calculation unit, 41...
・Buffer memory, 51...spectrum calculation section Gl・
... Spectral dispersion calculation section, 71 ... Spectral dispersion threshold calculation section, 81 ...
Voice section detection unit. Figure 4 Figure 11-] Figure 4

Claims

[Claims]

An energy extraction unit extracts the energy of digital audio data within a certain interval from the input audio signal as one frame, and an energy extraction unit that extracts the energy in that frame, and an energy extraction unit that averages the background noise energy of several frames and uses the average value as an energy threshold. , an energy threshold calculation section that adjusts the energy threshold according to changes in the environment, a spectral dispersion extraction section that first calculates the average value of the spectrum of the relevant frame according to the frame frequency, and calculates the spectral dispersion as a parameter. A spectral dispersion threshold calculation unit that uses the average value of the spectral dispersion of the background noise of the frame as the spectral dispersion threshold and adjusts the spectral dispersion threshold appropriately according to changes in the environment, and calculates the energy and Check whether this frame is the beginning of a speech interval by comparing the spectral dispersion with an energy threshold and a spectral dispersion threshold, and compare the energy of each frame of input speech with the energy threshold. 1. A speech section detection device comprising: a speech section detection section that checks whether or not the frame of the lever is the end point of the speech section.