JPS61233791A

JPS61233791A - Voice section detection system for voice recognition equipment

Info

Publication number: JPS61233791A
Application number: JP60075120A
Authority: JP
Inventors: 安田　晴剛
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1985-04-09
Filing date: 1985-04-09
Publication date: 1986-10-18

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】技監立夏本発明は、音声認識装置における音声区間検出方式に関
する。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a speech segment detection method in a speech recognition device.

災米五生従来の音声認識装置においては、音声区間を検出する場
合、音声区間を音声のパワー情報などである閾値を用い
て検出していたが、周囲の雑音などにより一定閾値の場
合は、区間検出が困難となり、又、可変閾値の場合は、
その雑音レベルによって閾値が上昇し、ノイズにうずも
れた音声区間を検出することはできなかった。In conventional speech recognition devices, when detecting a speech section, the speech section is detected using a threshold value such as speech power information, but if a certain threshold value is used due to surrounding noise etc. Section detection becomes difficult, and in the case of a variable threshold,
The threshold value increased due to the noise level, and it was not possible to detect speech sections buried in noise.

第３図は、従来の音声区間検出方式の例を説明するため
の図で、第３図（Ａ）は固定閾値の場合。FIG. 3 is a diagram for explaining an example of a conventional voice section detection method, and FIG. 3(A) shows a case where a fixed threshold is used.

第３図（Ｂ）は可変閾値の場合を示し、図中。FIG. 3(B) shows the case of variable threshold value;

（Ｇ）は音声パワー、（ｂ）は閾値、（ｃ）は音声区間
信号を示し、可変閾値の場合は、（Ｂ）図の（ｂ）に示
すように、閾値がｂ′の範囲内で可変になっている。而
して、上記従来の音声区間の検出において、音声パワー
情報などに基いて音区間を検出する場合に、（Ａ）図に
示す様に一定閾値の場合は、その区間の検出が不可能と
なり、又、（Ｂ）図に示す様に可変閾値の場合は、その
閾値がノイズレベルに従って上昇し、その区間の有用情
報が欠落する場合があった。しかしながら、音声の始端
付近と、終端付近には、まだ有用な情報が含まれており
、その情報を引きだすことが望まれていた。(G) is the voice power, (b) is the threshold value, and (c) is the voice interval signal. In the case of a variable threshold value, (B) shows the threshold value within the range of b' as shown in (b) of the figure. It is variable. Therefore, in the conventional voice section detection described above, when detecting a sound section based on voice power information, etc., if a certain threshold value is used as shown in Figure (A), it becomes impossible to detect that section. Furthermore, in the case of a variable threshold value as shown in FIG. 12(B), the threshold value increases in accordance with the noise level, and useful information in that section may be missing. However, useful information is still contained near the beginning and end of the voice, and it has been desired to extract this information.

目　　　　　的本発明は、上述のごとき実情に鑑みてなされたもので、
特に、音声認識装置において、騒音下においても安定に
音声区間を検出できるようにすることを目的としてなさ
れたものである。Purpose The present invention was made in view of the above-mentioned circumstances.
In particular, the purpose of this invention is to enable a speech recognition device to stably detect speech sections even under noisy conditions.

盈−一爪本発明は、上記目的を達成するため、入力された音声を
分析するバンドパスフィルタと、該バンドフィルタの各
チャンネルの総和である音声パワー信号を得る手段と、
入力された各チャンネルのスペクトル分析データからフ
レーム特徴データを得る手段と、前記音声パワー信号と
ある閾値とを比較して音声区間を得る手段と、得られた
音声区間信号からある定められた長さの無音区間がある
場合、それを単語の区切りとする一単語信号を生成する
手段とを有する音声認識装置において、音声の始端を検
出した際、その先頭フレームの一フレーム前のデータが
有音情報でかつそのフレーム間距離がある値より小さい
場合、それを先頭フレームとし、それをあるフレーム区
間さかのぼって行うこと、或いは、音声の始端を検知し
た後、音声区間を切り出し、その終端において、その後
の特徴データを調べ、その終端フレームとのフレーム間
距離がある一定値以下である場合、それを真の終端とし
て特徴データがなくなるまで行うことを特徴としたもの
である。以下、本発明の実施例に基づいて説明する。In order to achieve the above object, the present invention includes a bandpass filter for analyzing input audio, a means for obtaining an audio power signal that is the sum of each channel of the bandpass filter, and
means for obtaining frame feature data from the input spectrum analysis data of each channel; means for obtaining a voice section by comparing the voice power signal with a certain threshold; and a means for obtaining a voice section from the obtained voice section signal. When there is a silent section, when the start of speech is detected in a speech recognition device that has means for generating a one-word signal that uses the silent section as a word delimiter, the data one frame before the first frame is used as voice information. If the inter-frame distance is smaller than a certain value, you can use that as the first frame and perform it backwards in a certain frame section, or after detecting the start of the audio, cut out the audio section and start the next frame at the end. The characteristic data is checked, and if the inter-frame distance from the end frame is less than a certain value, it is regarded as the true end and the process is continued until there is no more feature data left. Hereinafter, the present invention will be explained based on examples.

第１図は、本発明の一実施例をＢＴＳＰ方式を用いて説
明するための電気回路図で１図中、■はマイク、２は前
処理部、３は周波数分析部、４は特徴データ抽出部、５
はパワー情報部、６は区間検出部、７は区間補正部であ
り、ＢＴＳＰ方式はフレーム周期毎に音声のホルマント
情報を最小二果曲線を用いて２値化して用いるものであ
る。Figure 1 is an electrical circuit diagram for explaining an embodiment of the present invention using the BTSP method. Part, 5
is a power information section, 6 is a section detection section, and 7 is a section correction section.The BTSP system uses the formant information of the voice by binarizing it for each frame period using a least-diamond curve.

以下、第２図を参照しながら本発明について詳細に説明
するが、第２図において、（ｑ）は音声パワー、（ｂ）
は閾値、（ｃ）は音声区間信号、Ｄは始端、Ｅは終端、
（ｄ　’）は音声データ（ＢＴＳＰ）。Hereinafter, the present invention will be explained in detail with reference to FIG. 2. In FIG. 2, (q) is the audio power, (b)
is the threshold, (c) is the voice section signal, D is the start end, E is the end,
(d') is audio data (BTSP).

ＣＢ）はフレーム間距離、（ｆ）は真の音声区間を示す
、今、周囲の雑音レベル等により閾値が変化するものと
し、その時点の閾値により図示の様に検出された場合、
検出された始端Ｄ（０）とＤ（−１）のフレーム間距離
を演算し、そのフレーム間距離がある一定値以下ならば
そのフレームを真の始端とする。つまり、始端には子音
などの比較的音声パワーの弱いものが含まれており、そ
れが雑音などにうずもれる場合が強いが、この場合は、
フレーム０との相関性が強いので、その意味でそのフレ
ーム間距離は小さくなる。これに着目し、そのフレーム
間距離が大となるまでさかのぼって検索し、この大とな
るまでを音声区間とし、それを真の始端としようとする
ものである。なお、実際の雑音は音声との相関性が低い
ので、その区切り点は比較的明瞭となる。この様にして
、ある限られたｎフレームの間をさかのぼって検索し、
真の始端を検出する。又、終端も同様に行いそのフレー
ム間距離が大きくなる点を真の終端とする。この検出区
間も終端からあるｎフレームとし、又、その中で特徴デ
ータが無くなった時はその点Ｑ、ｅを終端とする。この
様にすれば、雑音などの影響でうずもれた語頭２語尾の
情報を検出することが可能となる。CB) is the interframe distance, and (f) is the true voice section. Now, it is assumed that the threshold value changes depending on the surrounding noise level, etc., and if the threshold value at that time is detected as shown in the figure,
The inter-frame distance between the detected starting edges D(0) and D(-1) is calculated, and if the inter-frame distance is less than or equal to a certain value, that frame is determined to be the true starting edge. In other words, the beginning contains something with relatively weak voice power, such as a consonant, and it is likely that this will be drowned out by noise, but in this case,
Since the correlation with frame 0 is strong, the distance between the frames becomes small in that sense. Focusing on this, the system searches backwards until the inter-frame distance becomes large, defines the period up to this large distance as a voice section, and uses that as the true starting point. Note that since actual noise has a low correlation with speech, its breakpoint is relatively clear. In this way, we can search backwards between a limited number of n frames,
Detect the true starting point. Further, the same process is performed for the termination, and the point where the inter-frame distance becomes large is determined as the true termination. This detection section is also a certain number of n frames from the end, and when there is no feature data within it, the point Q, e is set as the end. In this way, it becomes possible to detect information about the beginning and end of a word that is lost due to noise or the like.

層−一一米以上の説明から明らかなように、本発明によると、雑音
下においても、語頭２語尾の情報を欠落させることなく
安定して音声区間を検出することができる。As is clear from the above description, according to the present invention, it is possible to stably detect a speech section even under noise without losing information at the beginning or end of a word.

[Brief explanation of the drawing]

第１図は１本発明の実施に使用する電気回路の一例を示
す図、第２図は、本発明の一実施例を説明するためのタ
イムチャート、第３図は、従来の音声区間検出方式の例
を説明するための信号波形図である。 ■・・・マイク、２・・・前処理部、３・・・周波数分
析部、４・・・特徴データ抽出部、５・・・パワー情報
部、６・・・区間検出部、７・・・区間補正部。第　　１　図第２図Fig. 1 is a diagram showing an example of an electric circuit used in carrying out the present invention, Fig. 2 is a time chart for explaining an embodiment of the present invention, and Fig. 3 is a diagram of a conventional voice section detection method. FIG. 3 is a signal waveform diagram for explaining an example. ■... Microphone, 2... Preprocessing section, 3... Frequency analysis section, 4... Feature data extraction section, 5... Power information section, 6... Section detection section, 7...・Section correction section. Figure 1 Figure 2

Claims

[Claims]

(1) A bandpass filter that analyzes input audio, a means for obtaining an audio power signal that is the sum of each channel of the bandpass filter, and obtaining frame feature data from spectrum analysis data of each input channel. means for obtaining a speech interval by comparing the speech power signal with a certain threshold value; In a speech recognition device having a means for generating a word signal, when the start of speech is detected, if the data one frame before the first frame is voice information and the interframe distance is smaller than a certain value, it is detected. A speech section detection method in a speech recognition device, characterized in that the first frame is set as the first frame, and the detection is performed retroactively to a certain frame section.

(2) A bandpass filter for analyzing input audio, a means for obtaining an audio power signal that is the sum of each channel of the bandpass filter, and obtaining frame feature data from spectrum analysis data of each input channel. means for obtaining a speech interval by comparing the speech power signal with a certain threshold value; In a speech recognition device having a means for generating a word signal, after detecting the start of speech, a speech section is cut out, and at the end, subsequent feature data is checked, and the interframe distance from the end frame is determined to be a certain value. 1. A speech section detection method in a speech recognition device, characterized in that if the following is true, the detection is performed until there is no more feature data.