JPS62244100A

JPS62244100A - Voice section detecting system

Info

Publication number: JPS62244100A
Application number: JP61089138A
Authority: JP
Inventors: 安田　晴剛; 河本　俊毅
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1986-04-17
Filing date: 1986-04-17
Publication date: 1987-10-24

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】艮先立夏本発明は、音声認識装置の音声区間検出方式に関する。[Detailed description of the invention] First summer The present invention relates to a speech segment detection method for a speech recognition device.

因米技生音声認識装置の音声区間検出において、語頭部分の始端
検出は重要な意味を持ち、特に、語頭が子音で始まる場
合はそれを正確に検出することが重要となる。この始端
検出は良環境下では検出ミスを起こすことが少いが、実
際のフィールドにおいては騒音レベルが高く、語頭の子
音部分を検出ミスし易くなる。実際には、始端検出を差
分パワースペク１−ラムの大きさで行っている場合が多
いが、この方法のみではゆるやかに上昇する子音部分を
うまく切り出せない場合が多い。また、従来においては
音声パワーの一定閾値以上を音声区間とする方法や始端
検出のみを差分パワースペクトルを用いる方法を用いて
いたが、これらの方法では周囲環境が非常に静かな場合
は良いが、定常的騒音などが存在する場合、一般的には
、その閾値が連動して上昇するため徐々に子音部が欠落
し易くなる。In the detection of voice segments by the Inmai Engineering speech recognition device, detection of the beginning of a word has an important meaning, and it is especially important to accurately detect when the beginning of a word starts with a consonant. This starting point detection rarely causes detection errors in a good environment, but in an actual field, the noise level is high and it is easy to detect the consonant part at the beginning of a word. In reality, the start point is often detected based on the magnitude of the differential power spectrum, but in many cases it is not possible to successfully extract the consonant part that gradually rises using this method alone. In addition, in the past, methods were used in which a voice section is defined as a voice section equal to or above a certain threshold of voice power, and a method in which a differential power spectrum is used only to detect the beginning of the voice.These methods work well when the surrounding environment is very quiet, but When stationary noise or the like is present, the threshold value generally increases in conjunction with the noise, making it easier for consonant parts to be gradually dropped.

第５図は、差分パワースペクトルを用いて始端検出を行
う場合の一例を説明するための電気的ブロック線図で１
図中、１はマイクロフォン、２は前処理部、３はバンド
パスフィルタ、４は音声パワー生成部、５は区間比較部
、６は差分パワー生成部、７は始端比較部、８は音声区
間生成部で、マイク１から入力された音声はバンドパス
フィルタ群３によりある周期のサンプリングで周波数分
析をされた後、各チャンネルのチャンネルパワーを出力
する。差分パワー生成部６においては、１サンプル前の
各チャンネルパワーとの差分値を総計して差分パワース
ペクトルとしている。この差分パワースペクトルがある
閾値を越した場合、音声始端として決定し、始端が決定
された後、音声パワーを監視し、ある閾値以上を音声区
間として決定する。FIG. 5 is an electrical block diagram for explaining an example of starting edge detection using a differential power spectrum.
In the figure, 1 is a microphone, 2 is a preprocessing unit, 3 is a bandpass filter, 4 is a voice power generation unit, 5 is a section comparison unit, 6 is a differential power generation unit, 7 is a start end comparison unit, and 8 is a voice section generation unit. In the section, the audio input from the microphone 1 is frequency-analyzed by sampling at a certain period by the band-pass filter group 3, and then the channel power of each channel is output. In the differential power generating section 6, the differential values from each channel power one sample before are totaled to form a differential power spectrum. If this differential power spectrum exceeds a certain threshold, it is determined as a voice start point, and after the start point is determined, the voice power is monitored, and a voice section that is equal to or higher than a certain threshold is determined as a voice section.

第６図及び第７図は、上述のごとくして決定される音区
間検出の例を説明するための波形図で。FIGS. 6 and 7 are waveform diagrams for explaining examples of sound interval detection determined as described above.

両図とも、（ａ）は音声パワー、（ｂ）は差分パワーを
示している。而して、第６図に示すように。In both figures, (a) shows the audio power, and (b) shows the differential power. Thus, as shown in Figure 6.

母音から始まる場合は比較的小さな誤差で切り出される
が、第７図のように、ゆるやかに音声パワーが立ち上が
る場合、（ア）の点で切り出しができなかった場合、（
イ）の点まで始端検出が不可能となり、語頭の子音ブロ
ックをそのまま欠落させることになる。特に、／ｓ／ｌ
　／ｐ／ｌ　／ｌ／で始まる単語や／Ｍ／、’／Ｎ／な
どはこのようなものが多い。When the sound starts from a vowel, it is extracted with a relatively small error, but as shown in Figure 7, when the voice power rises gradually, when the sound cannot be extracted at point (A), (
It becomes impossible to detect the beginning up to point (b), and the consonant block at the beginning of the word is simply dropped. In particular, /s/l
/p/l There are many words like this that start with /l/, /M/, '/N/, etc.

しかしながら、これらの始端を切り出すために差分パワ
ースペクトルの閾値を小さくしたり、始端を音声パワー
で切り出したりすると定常ノイズや環境ノイズに弱くな
り、実使用においては使いにくいシステムとなる。However, if the threshold value of the differential power spectrum is reduced in order to extract these starting edges, or if the starting edges are extracted using audio power, the system becomes vulnerable to stationary noise and environmental noise, making the system difficult to use in actual use.

目　　　　　的本発明は、上述のごとき実情に鑑みてなされたもので、
特に、音声認識装置における音声区間をより正確に検出
することを目的としてなされたものである。Purpose The present invention was made in view of the above-mentioned circumstances.
In particular, this was done for the purpose of more accurately detecting speech sections in a speech recognition device.

構　　　成本発明は、上記目的を達成するために、マイクから入力
された音声の音声パワーを検出する手段、ｎ　（ｎ　＝
整数）チャンネルのバンドパスフィルタを有し５周波数
を分析して各チャンネルのパワーを得る手段、及び、サ
ンプル周期の周波数差分パワースペクトルを検出する手
段を有する音声区間検出方式において、音声始端の検出
において、高域部分の音声パワーが低域部分の音声パワ
ーに比して大なる時、音声区間の始端とすること、或い
は、音声始端の検出において、低域部分の音声パワーが
高域部分の音声パワーに比して大なる時、音声区間の始
端とすること、或いは、高域部分の音声パワーが低域部
分の音声パワーに比して大なる時を音声区間の始端とす
る処理、低域部分の音声パワーが高域部分の音声パワー
に比して大なる時を音声区間の始端とする処理、及び、
差分パワースペクトルがある閾値より大なる時を音声始
端とする処理とのどれかの条件に合った時に音声始端と
することを特徴としたものである。以下、本発明の実施
例に基いて説明する。Configuration In order to achieve the above object, the present invention provides means for detecting the audio power of audio input from a microphone, n (n =
In a speech interval detection method having a bandpass filter for channels (integer), means for analyzing five frequencies to obtain the power of each channel, and means for detecting a frequency difference power spectrum of a sampling period, in detecting the start of speech. When the audio power in the high-frequency range is greater than the audio power in the low-frequency range, it is determined as the start of a voice section, or in detecting the start of the voice, the audio power in the low-frequency range is higher than the voice in the high-frequency range. Processing to set the start of a voice section when the power is greater than the voice power, or to set the start of a voice section when the voice power in the high frequency range is greater than the voice power in the low frequency range. Processing that sets the time when the audio power of the part is larger than the audio power of the high frequency part as the start of the audio section, and
This process is characterized in that the start of the voice is defined as the start of the voice when the differential power spectrum exceeds a certain threshold value, and the start of the voice is determined when any of the conditions are met. Hereinafter, the present invention will be explained based on examples.

本発明は、上述のごとき欠点した改良しかつ安定に子音
ブロックを検出しようとするものである。The present invention aims to overcome the above-mentioned drawbacks and stably detect consonant blocks.

／ｓ／、／ｐ／、／ｌ／や／Ｍ／、／Ｎ／などをノイズ
と切り分ける場合、その周波数成分に着目するのが望ま
しく、一般的には、／　ｓ　／　ｅ　／　ｐ　／　＊／
１／は高域パワー、／Ｍ／、／Ｎ／などは低域にパワー
が集中している。従って、バンドパスフィルタの出力を
高域、中域、低域の３つのパワーに分割し、その各々の
パワーを始端検出に限って比較する。When separating /s/, /p/, /l/, /M/, /N/, etc. from noise, it is desirable to focus on the frequency components, and generally, /s / e / p / */
1/ has high-frequency power, and /M/, /N/, etc. have power concentrated in the low-frequency range. Therefore, the output of the bandpass filter is divided into three powers: high-frequency, middle-frequency, and low-frequency, and the respective powers are compared only when detecting the starting edge.

第１図は、本発明の一実施例を説明するための電気的ブ
ロック線図・で、図中、１１はマイクロフォン、１２は
前処理部、１３はバンドパスフィルタ群、１４は音声パ
ワー生成部、１５は高域パワー生成部、１６は中域パワ
ー生成部、１７は低域パワー生成部、１８は差分パワー
生成部、１９は比較部、２０はパワー比較部、２１は区
間生成部で、まず、／ｓ／、／ｐ／、／ｌ／などを検出
するために高域パワーがある閾値（Ｔｈｉｇｈ）より大
きく、低域に比較して大きく大なる時は／Ｓ／。FIG. 1 is an electrical block diagram for explaining an embodiment of the present invention, in which 11 is a microphone, 12 is a preprocessing section, 13 is a group of bandpass filters, and 14 is an audio power generation section. , 15 is a high frequency power generation section, 16 is a middle frequency power generation section, 17 is a low frequency power generation section, 18 is a differential power generation section, 19 is a comparison section, 20 is a power comparison section, 21 is an interval generation section, First, in order to detect /s/, /p/, /l/, etc., when the high frequency power is greater than a certain threshold (Thigh) and is much larger than the low frequency, it is /S/.

／ｐ／、／ｌ／などの子音が開始したと判断する。It is determined that a consonant such as /p/ or /l/ has started.

この条件はほとんどの場合、低域にパワーがほとんど存
在せず、高域のみにエネル予−が集中していることにな
り、ノイズなどは区別し易い、又、逆に／Ｍ／、／Ｎ／
などはパワーが閾値（Ｔ１０讐）以上で、高域パワーに
比して大きく大なる時を始端とし決定するが、この場合
は、低域にエネルギーが集中し、高域に存在しない場合
となる。第２図は、この様子を示す図で、上記２つの条
件が成立した場合は即座に始端とし、条件に満たない時
は差分パワースペクトルのチェックを行うというＯＲ条
件にすれば、すべての単語に対して良好に検出が可能と
なる。In most cases, this condition means that there is almost no power in the low range, and energy is concentrated only in the high range, making it easy to distinguish noise etc., and conversely, /M/, /N /
etc., the starting point is determined when the power is greater than the threshold (T10) and is much larger than the high frequency power, but in this case, the energy is concentrated in the low frequency range and does not exist in the high frequency range. . Figure 2 shows this situation.If the above two conditions are met, it is immediately set as the starting point, and if the conditions are not met, the differential power spectrum is checked.If the OR condition is set, all words Good detection is possible.

第３図は、本発明の他の実施例を説明するための電気的
ブロック線図で、図中、２１はフリップフロップ回路で
、その他、第１図に示した実施例と同様の作用をする部
分には第１図の場合と同一の参照番号を付しである。FIG. 3 is an electrical block diagram for explaining another embodiment of the present invention. In the figure, numeral 21 is a flip-flop circuit, and the other functions are similar to those of the embodiment shown in FIG. 1. The parts are given the same reference numerals as in FIG.

而して、第１図に示した実施例において、上記条件で区
間生成を行う場合、／Ｍ／、ＩＮ７などの検出の為に、
発声直前のバズ音などを検出し易くなり、逆に不要な情
報を音声区間などに付加する可能性がある。このバズ音
は第４図に示すように、／Ｍ／、７Ｎ／などに比して低
域への集中度が強く、パワーレベルが小さいという特徴
がある。In the embodiment shown in FIG. 1, when generating sections under the above conditions, in order to detect /M/, IN7, etc.
It becomes easier to detect a buzz sound immediately before a voice is uttered, and on the other hand, there is a possibility that unnecessary information is added to a voice section. As shown in FIG. 4, this buzz sound is characterized by a stronger concentration in the low range and a lower power level than /M/, 7N/, etc.

このバズ音をとるために１本実施例においては、低域パ
ワーとその他の中域、高域のパワーとを比較し、該低域
パワーがその他の中域、高域パワーより大きく大なる場
合でかつ低域パワーがある闇値（Ｔ　ｂａｚｚ）よりも
小さいことを条件に、前記による始端条件が満たされて
いても、この条件が満たされている内は始端としないよ
うにしたものである。In order to eliminate this buzz sound, in this embodiment, the low range power is compared with the power of other mid ranges and high ranges, and if the low range power is much larger than the other mid range and high range powers, and the low frequency power is smaller than a certain dark value (Tbazz), and even if the above-mentioned starting point condition is met, the starting point is not set as long as this condition is satisfied. .

効　　　果以上の説明から明らかなように、本発明によると、より
安定に音声区間の始端を検出することが可能になる。Effects As is clear from the above explanation, according to the present invention, it is possible to more stably detect the start of a voice section.

[Brief explanation of drawings]

第１図は、本発明の一実施例を説明するための電気的ブ
ロック線図、第２図は、第１図に示した実施例の動作説
明するための図、第３図は１本発明の他の実施例を説明
するための電気的ブロック線図、第４図は、第３図に示
した実施例の動作説明をするための図、第５図は、差分
パワースペクトルを用いた始端検出の一例を説明するた
めの電気的ブロック線図、第６図及び第７図は、それぞ
れ第５図に示した電気的ブロック線図の動作説明をする
ための図である。１１・・・マイクロフォン、１２・・・前処理部、１３
・・・バンドパスフィルタ群、１４・・・音声パワー生
成部、１５・・・高域パワー生成部、１６・・・中域パ
ワー生成部、１７・・・低域パワー生成部、１８・・・
差分パワー生成部、１９・・・比較部、２０・・・パワ
ー比較部、２１・・・区間生成部。第　　ｌ　　図第２図＠３Ｉ！！１１第　４　図 □誉濃救FIG. 1 is an electrical block diagram for explaining one embodiment of the present invention, FIG. 2 is a diagram for explaining the operation of the embodiment shown in FIG. 1, and FIG. 3 is an electrical block diagram for explaining one embodiment of the present invention. 4 is a diagram for explaining the operation of the embodiment shown in FIG. 3, and FIG. 5 is an electrical block diagram for explaining another embodiment of the invention. An electrical block diagram for explaining an example of detection, FIGS. 6 and 7, are diagrams for explaining the operation of the electrical block diagram shown in FIG. 5, respectively. 11... Microphone, 12... Preprocessing section, 13
...Band pass filter group, 14...Audio power generation section, 15...High frequency power generation section, 16...Medium frequency power generation section, 17...Low frequency power generation section, 18...・
Difference power generation section, 19... Comparison section, 20... Power comparison section, 21... Section generation section. Figure l Figure 2 @3I! ! 11 Fig. 4 □Honosuke

Claims

[Claims]

(1) means for detecting the audio power of the audio input from the microphone; means having an n (n = integer) channel bandpass filter and analyzing the frequency to obtain the power of each channel; and a sampling period. In a voice section detection method having means for detecting a frequency difference power spectrum of A speech interval detection method characterized by the following.

(2) means for detecting the audio power of the audio input from the microphone; means having an n (n = integer) channel bandpass filter and analyzing the frequency to obtain the power of each channel; and a sampling period. In a voice section detection method having means for detecting a frequency difference power spectrum of A speech interval detection method characterized by the following.

(3) In detecting the start of a voice, when the voice power in the low frequency range is greater than the voice power in other parts, and when the voice power in the low frequency range is less than a certain threshold, the claim Claim (2) characterized in that the start edge detected in the range (2) is canceled.
The speech interval detection method described in Section.

(4) means for detecting the audio power of the audio input from the microphone; means having an n (n = integer) channel bandpass filter and analyzing the frequency to obtain the power of each channel; and a sampling period. In a voice section detection method that has means for detecting a frequency difference power spectrum of Either one of the following conditions is met: a process in which the start of a voice section is defined as the time when the voice power is greater than the voice power in the high frequency range, or a process in which the time when the differential power spectrum is greater than a certain threshold is determined as the start of the voice section. A voice section detection method that is characterized in that it is determined as the beginning of a voice when it occurs.