JPH0588840B2

JPH0588840B2 -

Info

Publication number: JPH0588840B2
Application number: JP60282481A
Authority: JP
Inventors: Juichiro Fujihashi
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1985-12-16
Filing date: 1985-12-16
Publication date: 1993-12-24
Also published as: JPS62141595A

Description

【発明の詳細な説明】（産業上の利用分野）本発明は、音声認識装置等において音声の存在
する時間を判定するのに用いる音声検出方式に関
する。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a voice detection method used in a voice recognition device or the like to determine the time when voice exists.

（従来の技術）従来、この種の音声検出方式では、音声のパワ
ーのレベルが閾夢値を越えている継続時間がある
一定時間以上のときに音声の始端とし、閾値を下
回つている継続時間がある一定時間以上のときに
音声の終端とする方式が多く用いられていた。(Prior Art) Conventionally, in this type of voice detection method, the duration of the voice power level exceeding the threshold value is determined to be the beginning of the voice when the duration time is longer than a certain period of time, and the duration of the voice power level exceeding the threshold value is determined as the beginning of the voice, and the continuation of the voice power level exceeding the threshold value A method in which audio ends when the time exceeds a certain time has often been used.

（発明が解決しようとする問題点）上述した従来の音声検出方式では、レベルの継
続時間によつて音声区間を検出しているから、パ
ワー・デイツプの深い音声の場合には語頭部が欠
落したり、瞬時的な雑音でも雑音が語尾に近接し
ている場合には終端が延長されて音声区間に雑音
が含まれる。このように、従来の音声検出方式に
は音声区間を誤つて検出するという欠点がある。(Problems to be Solved by the Invention) In the conventional speech detection method described above, speech sections are detected based on the duration of the level, so in the case of speech with a deep power dip, the beginning of the word may be missing. Even if the noise is instantaneous, if the noise is close to the end of a word, the end will be extended and the noise will be included in the speech section. As described above, the conventional speech detection method has the drawback of erroneously detecting speech sections.

（問題点を解決するための手段）前述の問題点を解決するために本発明が提供す
る手段は、音声信号のパワーを算出するパワー算
出部と、このパワー算出部が算出した前記パワー
を平滑化して平滑化ワーを得るパワー平滑化部
と、前記平滑化パワーの変化率が正から負に変わ
る変曲点をその平滑化パワーのピーク候補として
検出するピーク検出部と、前記ピーク候補のうち
レベルが最大であるピーク候補を最大ピークとし
て選出し、この最大ピークのレベルと所定のピー
ク選別用係数とからピーク選別用閾値を算出し、
前記最大ピークのレベルと所定のピーク幅算出用
係数とからピーク幅算出用閾値を算出する閾値算
出部と、前記検出部で検出した前記ピーク候補の
レベルと前記ピーク選別用閾値とを比較し、その
レベルが前記ピーク選別用閾値以上の前記ピーク
候補だけをピークとして選別する選別部と、前記
平滑化パワーが前記ピーク幅算出用閾値以上であ
る時間であつて前記ピーク選別部で選別されて前
記ピークを含む時間をピーク幅として算出するピ
ーク幅算出部と、前記ピーク幅のうち所定のピー
ク幅閾値より広いピーク幅を音声区間候補として
出力するピーク幅比較部と、このピーク幅比較部
で得た前記音声区間候補が複数である場合、、隣
接した前記音声区間候補のうちの前の前記音声区
間候補の終端から後の前記音声区間候補の始端ま
での時間を音声区間候補時間差として算出する音
声区間候補時間差算出部と、前記ピーク幅比較部
及び前記音声区間候補時間差算出部の出力結果か
ら音声区間の判定を行なう音声区間判定部とを備
え、この音声区間判定部は、前記音声区間候補が
１つの場合にそのままその音声区間候補を前記音
声区間と判定し、前記音声区間候補が複数であつ
て隣接している前記音声区間候補の前記音声区間
候補時間差が所定の音声区間候補時間差閾値より
短かい場合には複数の前記音声区間候補を１つの
音声区間候補にまとめて前の前記音声区間候補の
始端ら後ろの前記音声区間候補の終端までを新た
な音声区間候補とする音声区間候補のまとめ処理
を行ない、この音声区間候補のまとめ処理を繰返
し行ない最終的に残つた音声区間候補のうちの１
つ又はは複数を前記音声区間とすることを特徴と
する。(Means for Solving the Problems) Means provided by the present invention to solve the above-mentioned problems includes a power calculation unit that calculates the power of an audio signal, and a power calculation unit that smooths the power calculated by the power calculation unit. a power smoothing unit that obtains a smoothed power by converting the rate of change into a smoothed power; a peak detection unit that detects an inflection point where the rate of change of the smoothed power changes from positive to negative as a peak candidate of the smoothed power; Selecting the peak candidate with the highest level as the maximum peak, calculating a peak selection threshold from the level of this maximum peak and a predetermined peak selection coefficient,
a threshold calculation unit that calculates a peak width calculation threshold from the maximum peak level and a predetermined peak width calculation coefficient, and a comparison between the level of the peak candidate detected by the detection unit and the peak selection threshold; a selection unit that selects only the peak candidates whose level is equal to or higher than the peak selection threshold as peaks; a peak width calculation unit that calculates the time including the peak as a peak width; a peak width comparison unit that outputs a peak width wider than a predetermined peak width threshold value as a voice section candidate among the peak widths; If there are a plurality of speech segment candidates, the time from the end of the previous speech segment candidate to the start of the next speech segment candidate among the adjacent speech segment candidates is calculated as a speech segment candidate time difference. The speech segment candidate time difference calculation unit includes a speech segment candidate time difference calculation unit, and a speech segment determination unit that determines a speech segment based on the output results of the peak width comparison unit and the speech segment candidate time difference calculation unit. If there is one speech segment candidate, the speech segment candidate is directly determined as the speech segment, and if there are multiple speech segment candidates and the speech segment candidate time difference between the adjacent speech segment candidates is shorter than a predetermined speech segment candidate time difference threshold. In such a case, the plurality of speech segment candidates are combined into one speech segment candidate, and the speech segment candidates are summarized so that a new speech segment candidate extends from the start of the previous speech segment candidate to the end of the subsequent speech segment candidate. Processing is performed, and this process of summarizing the speech segment candidates is repeated until one of the speech segment candidates that remains.
The voice section is characterized in that one or more of the voice sections are defined as the voice section.

（実施例）次に本発明について図面を参照して説明する。
第１図は本発明の一実施例のブロツク図である。(Example) Next, the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram of one embodiment of the present invention.

この実施例は、パワー算出部１、パワー平滑化
部２、ピーク検出部３、閾値算出部４、ピーク選
別部５、ピーク幅算出部６、ピーク幅比較部７、
音声区間判定部８及び音声区間候補時間差算出部
２２から構成される。入力音声１０はパワー算出
部１に入力され、算出されたパワー１１はパワー
平滑化部２に入力され、平滑化されたパワー１２
はピーク検出部３とピーク幅算出部６とに入力さ
れる。ピーク検出部３は、平滑化パワー１２の変
化率が正から負に変わる変曲点をその平滑化パワ
ーのピーク候補１３として検出し、検出したピー
ク候補１３を閾値算出部４とピーク選別部５とに
出力する。閾値算出部４は、ピーク候補１３のう
ちから最大ピークレバーを算出し、ピーク選別用
係数１９と演算を行ないピーク選別用閾値１４を
算出しピーク選別部５へ出力し、また最大ピーク
レベルとピーク幅算出用係数２０との演算を行な
いピーク幅算出用閾値１５を算出しピーク幅算出
部６へ出力する。ピーク選別部５は、ピーク候補
１３のピークレベルとピーク選別用閾値１４とを
比較し閾値以上のピークレベルを有するピーク候
補だけをピーク２５としてピーク幅算出部６へ出
力する。 This embodiment includes a power calculation section 1, a power smoothing section 2, a peak detection section 3, a threshold calculation section 4, a peak selection section 5, a peak width calculation section 6, a peak width comparison section 7,
It is composed of a speech section determination section 8 and a speech section candidate time difference calculation section 22. The input voice 10 is input to the power calculation section 1, the calculated power 11 is input to the power smoothing section 2, and the smoothed power 12 is inputted to the power calculation section 1.
is input to the peak detection section 3 and the peak width calculation section 6. The peak detection unit 3 detects an inflection point where the rate of change of the smoothed power 12 changes from positive to negative as a peak candidate 13 of the smoothed power, and applies the detected peak candidate 13 to the threshold calculation unit 4 and the peak selection unit 5. Output to. The threshold calculation unit 4 calculates the maximum peak lever from among the peak candidates 13, performs calculations with the peak selection coefficient 19, calculates the peak selection threshold 14, outputs it to the peak selection unit 5, and also calculates the maximum peak level and peak A calculation is performed with the width calculation coefficient 20 to calculate a peak width calculation threshold 15 and output it to the peak width calculation section 6. The peak selection unit 5 compares the peak level of the peak candidate 13 with the peak selection threshold 14 and outputs only peak candidates having a peak level equal to or higher than the threshold to the peak width calculation unit 6 as peaks 25 .

ピーク幅算出部６は、ピーク幅算出用閾値１５
以上に平滑化パワー１２がある時間区間であつ
て、ピーク２５が含まれる時間区間をピーク幅１
６として出力する。ピーク幅１６は、ピーク２５
が指定する平滑化パワー１２のピークであつて、
ピーク幅算出用閾値１５以上である平滑化パワー
１２の時間区間を現わしている。このピーク幅１
６はピーク幅比較部７へ出力される。ピーク幅比
較部７は、ピーク幅閾値２１と各ピークのピーク
幅１６とを比較し、閾値２１以上のピーク幅を有
するピークの始端及び終端を音声区間候補１７と
して音声区間判定部８と音声区間候補時間差算出
部２２へ出力する。 The peak width calculation unit 6 calculates a peak width calculation threshold value 15.
The time interval in which the smoothed power is 12 and includes the peak 25 is defined as the peak width 1
Output as 6. Peak width 16 is equal to peak 25
is the peak of smoothed power 12 specified by
It represents a time section of smoothed power 12 that is equal to or greater than the peak width calculation threshold value 15. This peak width 1
6 is output to the peak width comparison section 7. The peak width comparison unit 7 compares the peak width threshold 21 with the peak width 16 of each peak, and selects the starting and ending ends of the peaks having a peak width equal to or greater than the threshold 21 as voice interval candidates 17, and selects the voice interval determination unit 8 and the voice interval. It is output to the candidate time difference calculating section 22.

音声区間判定部８は、音声区間候補１７が１つ
の場合そのまま音声区間候補１７を音声区間１８
として出力する。音声区間候補１７が複数の場合
は、音声区間候補時間差算出部２２は、隣接した
音声区間候補のうち前の音声区間候補の終端から
後の音声区間候補の始端までの時間を音声区間候
補時間差２３として算出する。このとき、音声区
間判定部８は、隣接した音声区間候補の音声区間
候補時間差２３が音声区間候補時間差閾値２４よ
り小さい場合には、１つの音声区間候補にまとめ
る処理をくり返し行ない、最終的に１つになつた
場合はまとめ処理を行なつた音声区間候補を音声
区間１８として出力し、１つにならなかつた場
合、最大のピークレベルを有するまとめ処理を行
なつた音声区間候補を音声区間１８として出力す
る。 When there is only one speech segment candidate 17, the speech segment determination unit 8 converts the speech segment candidate 17 into a speech segment 18.
Output as . When there are multiple speech segment candidates 17, the speech segment candidate time difference calculation unit 22 calculates the speech segment candidate time difference 23, which is the time from the end of the previous speech segment candidate to the start of the next speech segment candidate among the adjacent speech segment candidates. Calculated as At this time, if the speech segment candidate time difference 23 of adjacent speech segment candidates is smaller than the speech segment candidate time difference threshold 24, the speech segment determination unit 8 repeatedly performs the process of combining the speech segment candidates into one speech segment candidate, and finally If it becomes one, the voice section candidate that has undergone the grouping process is output as voice section 18, and if it has not become one, the voice section candidate that has undergone the grouping process and has the maximum peak level is output as voice section 18. Output as .

このように、音声区間候補が最終的に複数とな
つた場合、最大ピークレベルを有する音声区間候
補以外は切り捨てるという方式は、雑音区間の除
去にに有効である。しかし、音声区間判定部８
は、音声区間候補が複雑となつた場合には各々の
音声区間候補を別々の音声区間と判定する方式に
すれば、連続して音声を発声した場合における音
声区間の分離などに有効であること明らかであ
る。 In this way, when a plurality of speech segment candidates end up, the method of discarding speech segment candidates other than those having the maximum peak level is effective in removing noise segments. However, the voice section determination unit 8
In this method, when the speech section candidates become complex, it is effective to separate the speech sections when uttering continuous speech by using a method that determines each speech section candidate as a separate speech section. it is obvious.

第２図は、第１図実施例における平滑した音声
パワー１２の波形と音声検出用閾値と検出された
音声区間との関係を示す図である。第１図実施例
によれば、ピークレベルの低い雑音や音声に近接
した雑音が除去され、かつパワー・デイツプの深
い音声でも語頭の欠落を防ぐことができること
を、第２図を参照して、また第１図と関連づけて
以下に詳しく説明する。第２図の横軸３０は時
間、縦軸３１平滑されたパワーを表し、本図の波
形は、第１図のパワー平滑化部２の出力である平
滑化されたパワー１２の波形を示す。 FIG. 2 is a diagram showing the relationship between the waveform of the smoothed voice power 12, the voice detection threshold, and the detected voice section in the embodiment of FIG. 1. With reference to FIG. 2, it will be seen that according to the embodiment of FIG. 1, noise with a low peak level and noise close to speech can be removed, and even speech with a deep power dip can be prevented from missing the beginning of a word. Further, a detailed explanation will be given below in connection with FIG. The horizontal axis 30 in FIG. 2 represents time, and the vertical axis 31 represents smoothed power, and the waveform in this figure shows the waveform of the smoothed power 12 that is the output of the power smoothing section 2 in FIG.

第１図のピーク検出部３によつて、第２図のピ
ーク候補３２，３３，３４，３５の４つのピーク
候補が検出され、第１図の閾値算出部４で最大ピ
ークであるピーク候補３４からピーク選別用閾値
１４とピーク幅算出用夢値１５とが算出される。
ピーク選別部５では、ピーク選別用閾値１４によ
りピークレベルの小さいピーク候補３２が除去さ
れ、ピーク候補３３，３４，３５がピークとして
出力される。ピーク幅算出部６では、ピーク幅算
出用閾値１５によりピーク３３，３４，３のピー
ク幅３８，３９，４０を算出し、ピーク幅比較部
７ではピーク幅閾値２１と各ピーク幅３８，３
９，４０とを比較し、第２図の例では全てのピー
ク幅が閾値２１より広いので、ピーク３３，３
４，３５の各々の始端から終端までの区間が音声
区間候補１７として出力される。 The peak detection section 3 in FIG. 1 detects four peak candidates 32, 33, 34, and 35 in FIG. 2, and the threshold calculation section 4 in FIG. A peak selection threshold value 14 and a peak width calculation value value 15 are calculated from the above.
In the peak selection section 5, peak candidates 32 with low peak levels are removed using a peak selection threshold 14, and peak candidates 33, 34, and 35 are output as peaks. The peak width calculation section 6 calculates the peak widths 38, 39, and 40 of the peaks 33, 34, and 3 using the peak width calculation threshold 15, and the peak width comparison section 7 calculates the peak widths 38, 39, and 40 using the peak width threshold 21 and the peak widths 38, 3, respectively.
9 and 40, all peak widths are wider than the threshold value 21 in the example of FIG.
The sections from the start to the end of each of 4 and 35 are output as voice section candidates 17.

音声区間候補時間差算出部２２では、ピーク３
３と３４の音声区間候補時間差４３と、ピーク３
４とピーク３５の音声区間候補時間差４４とを算
出する。音声区間判定部８では、音声区間候補時
間差閾値２４と、各音声区間候補時間差４３，４
４とを比較し、音声区間候補時間差４３が閾値２
４より短いのでピーク３３と３４の音声区間候補
を１つにまとめ、ピーク３３の始端からピーク３
４の終端までを新たな音声区間候補とし、音声区
間候補時間差４４は閾値２４より広いので、ピー
ク３５はまとめることができず、２つの音声区間
候補が残ることになる。音声区間判定部８は、次
に２つの音声区間候補のピークレベルを比較し、
最大ピーク３４を有するる始端４１から終端４２
までの音声区間候補を音声区間１８と判定し出力
する。 The voice section candidate time difference calculation unit 22 calculates peak 3
3 and 34 voice section candidate time difference 43 and peak 3
4 and the voice section candidate time difference 44 between the peak 35 and the peak 35 are calculated. The speech section determination unit 8 uses a speech section candidate time difference threshold 24 and each speech section candidate time difference 43, 4.
4, the voice section candidate time difference 43 is the threshold value 2.
Since it is shorter than 4, the voice section candidates of peaks 33 and 34 are combined into one, and the voice section candidates from peak 33 to peak 3 are
4 is set as a new speech section candidate, and since the speech section candidate time difference 44 is wider than the threshold value 24, the peaks 35 cannot be combined, and two speech section candidates remain. The speech section determination unit 8 then compares the peak levels of the two speech section candidates,
From the starting end 41 to the ending end 42 having the maximum peak 34
The speech section candidates up to this point are determined to be speech section 18 and output.

従つて、第１図実施例によれば、第２図に示し
た例の様に、雑音であるるピーク３２と３５が除
去され、かつパワー・デイツプが深くピーク３３
と３４に分離している音声でも正しく音声区間の
検出を行なうことができる。 Therefore, according to the embodiment shown in FIG. 1, as in the example shown in FIG.
Even if the voice is separated into 34 parts, the voice section can be detected correctly.

（発明の効果）以上説明したように、本発明は、平滑化したパ
ワー波形のピークを検出し、レベルが最大である
ピークのレベルからピーク選別用閾値とピーク幅
算出用閾値とを算出し、ピーク選別用閾値以上の
ピークレベルを有するピークのピーク幅をピーク
幅算出用閾値によつて算出し、ピーク幅が所定の
幅以上のピークを音声区間候補と判定し、音声区
間候補と判定されたピークが複数の場合、音声区
間時間差を算出し、所定の時間より短かい場合は
１つの音声区間にまとめる処理をくり返し行な
い、最終的に１つにならなかつた場合にはそのう
ちの１つ（例えば最大のピークレベルを有する音
声区間候補）又は複数の音声区間候補のうちのい
くつかを音声区間と判定することにより、ピーク
の高さ、幅、隣接ピークとの時間差に基づいて音
声区間の判定を行なうことができ、瞬時的なピー
クを持つ雑音が音声に近接していても雑音の部分
を除去でき、またパワー・デイツプの深い音声で
も語頭のピークの部分の欠落を防ぐことができ、
上述した従来方式の欠点を除去することができ、
音声認識装置に用いた場合、認識率を向上でき
る。(Effects of the Invention) As described above, the present invention detects the peak of a smoothed power waveform, calculates the peak selection threshold and the peak width calculation threshold from the level of the peak with the maximum level, The peak width of a peak having a peak level equal to or greater than a peak selection threshold is calculated using a peak width calculation threshold, and a peak whose peak width is equal to or greater than a predetermined width is determined to be a speech section candidate. If there are multiple peaks, calculate the voice interval time difference, and if it is shorter than a predetermined time, repeat the process of combining them into one voice interval, and if the peak does not become one in the end, one of them (e.g. By determining some of the speech section candidates (with the highest peak level) or a plurality of speech section candidates as speech sections, the speech section can be determined based on the peak height, width, and time difference with adjacent peaks. Even if noise with an instantaneous peak is close to the voice, it can be removed, and even in voice with a deep power dip, the peak at the beginning of a word can be prevented from being lost.
The drawbacks of the conventional method mentioned above can be removed,
When used in a speech recognition device, the recognition rate can be improved.

[Brief explanation of the drawing]

第１図は本発明の一実施例のブロツク図、第２
図はこの実施例における平滑化音声パワーの波形
を示す図である。１……パワー算出部、２……パワー平滑化部、
３……ピーク検出部、４……閾値算出部、５……
ピーク選別部、６……ピーク幅算出部、７……ピ
ーク幅比較部、８……音声区間判定部、１０……
入力音声、１１……パワー、１２……平滑化され
たパワー、１３……ピーク候補、１４……ピーク
選別用閾値、１５……ピーク幅算出用閾値、１６
……ピーク幅、１７……音声区間候補、１８……
音声区間、１９……ピーク選別用係数、２０……
ピーク幅算出用係数、２１……ピーク幅閾値、２
２……音声区間候補時間差算出部、２３……音声
区間候補時間差、２４……音声区間候補時間差閾
値、３０……横軸（時間）、３１……縦軸（平滑
されたパワー））、３２〜３５……ピーク候補、３
８〜４０……ピーク幅、４１……終端、４２……
終端、４３，４４音声区間候補時間差。 FIG. 1 is a block diagram of one embodiment of the present invention, and FIG.
The figure is a diagram showing the waveform of smoothed audio power in this embodiment. 1... Power calculation section, 2... Power smoothing section,
3...Peak detection section, 4...Threshold value calculation section, 5...
Peak selection unit, 6...Peak width calculation unit, 7...Peak width comparison unit, 8...Speech section determination unit, 10...
Input audio, 11... Power, 12... Smoothed power, 13... Peak candidate, 14... Threshold for peak selection, 15... Threshold for peak width calculation, 16
...Peak width, 17...Voice section candidate, 18...
Voice section, 19...Coefficient for peak selection, 20...
Peak width calculation coefficient, 21...Peak width threshold, 2
2...Voice segment candidate time difference calculation unit, 23...Voice segment candidate time difference, 24...Voice segment candidate time difference threshold, 30...Horizontal axis (time), 31...Vertical axis (smoothed power)), 32 ~35...Peak candidate, 3
8 to 40...Peak width, 41...Terminal, 42...
End, 43, 44 voice section candidate time difference.

Claims

[Claims] 1. A power calculation unit that calculates the power of an audio signal; a power smoothing unit that smoothes the power calculated by the power calculation unit to obtain smoothed power;
a peak detection unit that detects an inflection point where the rate of change of the smoothed power changes from positive to negative as a peak candidate of the smoothed power; and a peak detector that selects a peak candidate with a maximum level among the peak candidates as the maximum peak. , a threshold calculation unit that calculates a peak selection threshold from the maximum peak level and a predetermined peak selection coefficient, and calculates a peak width calculation threshold from the maximum peak level and a predetermined peak width calculation coefficient; , a peak selection unit that compares the level of the peak candidate detected by the peak detection unit with the peak selection threshold and selects only the peak candidates whose level is equal to or higher than the peak selection threshold as peaks; a peak width calculation unit that calculates, as a peak width, a time during which the peak width calculation power is greater than or equal to the peak width calculation threshold and that includes the peak selected by the peak selection unit; and a predetermined width threshold among the peak widths. a peak width comparison section that outputs a wider peak width as a speech section candidate; and when there is a plurality of speech section candidates obtained by this peak width comparison section, the previous speech section candidate among the adjacent speech section candidates; a speech segment candidate time difference calculation unit that calculates the time from the end of the speech segment candidate to the start of the subsequent speech segment candidate as a speech segment candidate time difference; and a speech segment determination unit that performs the determination, and the speech segment determination unit directly determines the speech segment candidate as the speech segment when there is one speech segment candidate, and when there is a plurality of speech segment candidates, the speech segment determination unit directly determines the speech segment candidate as the speech segment. If the speech segment candidate time difference between adjacent speech segment candidates is shorter than a predetermined speech segment candidate time difference threshold, the plural speech segment candidates are combined into one speech segment candidate, and the speech segment candidates are combined into one speech segment candidate. A process of summarizing speech segment candidates is performed in which a new speech segment candidate is created from the beginning to the end of the voice segment candidate after the speech segment candidate, and this process of summarizing speech segment candidates is repeatedly performed to finally select one of the remaining speech segment candidates. A voice detection method characterized in that one or more of the voice sections are defined as the voice section. 2. In the speech detection method according to claim 1, when there are finally a plurality of speech section candidates, the speech section determining section selects the speech having the highest peak level among the speech section candidates. A speech section detection method characterized in that only section candidates are used as the speech section. 3. In the speech detection method according to claim 1, when there are finally a plurality of speech segment candidates, the speech segment determination unit determines each speech segment candidate to be a different speech segment. A voice section detection method characterized by: