JPH0247698A

JPH0247698A - Speech section detection system

Info

Publication number: JPH0247698A
Application number: JP63198162A
Authority: JP
Inventors: Takashi Miki; 三木　敬
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1988-08-09
Filing date: 1988-08-09
Publication date: 1990-02-16
Anticipated expiration: 2011-03-06
Also published as: JPH0823756B2

Abstract

PURPOSE:To reduce errors in speech section detection even in the presence of a loud noise by finding the mean noise level value of all acoustic power values of acoustic power sampled at the time of background noise level measurement except a specific quantity from the maximum value and a specific quantity from the minimum value and mean noise dispersion. CONSTITUTION:A microprocessor 30 as a threshold calculation part calculates the mean noise level NL' of all acoustic power values P1 except the specific number Nmax of values from the maximum value and the 2nd specific number Nmin of values from the minimum value in order and the mean noise dispersion ND', and then calculates a speech segmentation level VL from the both. Then the level is sent to a speech section detection part 16 through a system bus 36 under the control of the microprocessor 30. Consequently, the errors in speech section detection can be reduced in the noise environment of a click noise, etc.

Description

【発明の詳細な説明】（産業上の利用分野）この発明は音声認識装置における音声区間の検出方式に
関するものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a method for detecting speech intervals in a speech recognition device.

（従来の技術）通常の音声認識装置では、入力された音響信号より音声
が存在する区間を検出する処理（以下音声区間検出処理
と呼ぶ）と、検出された音声の内容を認識判定する（以
下認識処理と呼／！ｆｉ）処理に大別できる。(Prior Art) A typical speech recognition device performs a process of detecting a section in which speech exists from an input acoustic signal (hereinafter referred to as speech section detection processing), and a process of recognizing and determining the content of the detected speech (hereinafter referred to as speech section detection processing). It can be roughly divided into two types of processing, called recognition processing and !fi) processing.

通例このような動作を行うために音声認識装置では入力
された音響フレームと呼ばれている微小時間毎に音響信
号を分析してその特徴パラメータを算出しでいる。特徴
パラメータとしでは音響パワー、パワースペクトル等が
代表的なものである。Generally, in order to perform such an operation, a speech recognition device analyzes an input acoustic signal for each minute time period called an acoustic frame and calculates its characteristic parameters. Typical characteristic parameters include acoustic power and power spectrum.

音声区間検出は音声区間がそれ以外の区間に比べ音響パ
ワーが大きいという性質を利用している。Speech section detection utilizes the property that a speech section has higher acoustic power than other sections.

このような従来の音声区間の検出処理方式としては例え
ば文献：特開昭６０−１１４９００号公報に開示された
ものがある。この従来方式の一構成例を第２図を参照し
て説明する。An example of such a conventional voice section detection processing method is disclosed in Japanese Patent Application Laid-Open No. 60-114900. An example of the configuration of this conventional method will be explained with reference to FIG.

外部入力部１０、例えば、マイクロホン、電話機等から
入力した音響信号をＡ／Ｄ変換部１２において標本化し
ディジタル信号系列に変換する０次のパワー算出部１４
ではこのディジタル信号系列（以下単に入力信号とする
）からフレーム毎に音響パワーＰＩ　　（Ｉはフレーム
番号を示す）を演算し、これを音声区間検出部１６及び
閾値設定部１８にそれぞれ送出する。閾１設定部１８に
おいて、後述するように、この音響パワーＰ、に基づい
て平均雑音レベルを算定して音声区間検出部１６へ送り
、この音声区間検出部１δにおいて、音響パワーＰ、と
平均雑音レベルとから音声区間を検出しで判定する０次
の認識部２０においては、音声区間の音響パワー系列か
らなる音声バタンに対しで認識処理が行なわれ、その認
識結果が外部機器２２、例えば、コンピュータとかその
他所要の表示装置等へ送られる。A zero-order power calculation unit 14 that samples an acoustic signal input from an external input unit 10, such as a microphone or a telephone, in the A/D conversion unit 12 and converts it into a digital signal sequence.
Then, the acoustic power PI (I indicates a frame number) is calculated for each frame from this digital signal sequence (hereinafter simply referred to as an input signal), and is sent to the voice section detection section 16 and the threshold value setting section 18, respectively. In the threshold 1 setting section 18, as will be described later, the average noise level is calculated based on this acoustic power P, and sent to the speech section detection section 16, and in this speech section detection section 1δ, the acoustic power P and the average noise level are calculated. In the zero-order recognition unit 20, which detects and determines the voice section from the level, recognition processing is performed on the voice bang consisting of the acoustic power series of the voice section, and the recognition result is sent to the external device 22, for example, a computer. and other necessary display devices.

このような構成の従来の音声認識製雪では認識動作に先
立っで前述したように音声区間検出のための平均雑音レ
ベルを設定する目的で、背景雑音レベルの測定を行って
いる。これは無入力状態での音響パワーの性質を測定し
適切な音声区間検出用閾値を決定するためである。In the conventional voice recognition snowmaking system having such a configuration, prior to the recognition operation, the background noise level is measured for the purpose of setting the average noise level for voice section detection as described above. This is to measure the properties of acoustic power in a no-input state and determine an appropriate threshold for voice section detection.

以下、この処理につき説明する。外部入力部１０よつ入
力された音響信号からパワー算出部１４で得られた音響
パワーＰ、に基づいて、閾値設定部１日では平均雑音レ
ベルＮＬ、平均雑音分散Ｎ。を算出する。これら平均雑
音レベルＮ、及び平均雑音分数ＮＤは、Ｎを測定フレー
ム数とすると次の（１）及び（２）式でそれぞれ与えら
れでいる。This process will be explained below. Based on the acoustic power P obtained by the power calculation section 14 from the acoustic signal inputted from the external input section 10, the threshold setting section calculates the average noise level NL and the average noise variance N for one day. Calculate. These average noise level N and average noise fraction ND are given by the following equations (1) and (2), respectively, where N is the number of measurement frames.

Ｎ、ヨ、ざらに平均雑音レベルＮＬ及び平均雑音分数ＮＤから下
記の（３）式に従って音声切り出しレベルＶＬを決定し
でいる。Roughly speaking, the voice extraction level VL has been determined from the average noise level NL and the average noise fraction ND according to the following equation (3).

ＶＬ　＝　ＮＬ　＋　ＮＩ　Ｘ　Ｎｏ・　・　・　・　
・　・　・　・　・　・　・　・　（３）ここで、Ｎ１
はあらかじめシステムで定めた計数であり通例２〜４程
度の値となる。このように算定された音声切り出しレベ
ルＶＬを以後音声区間検出部１６で利用する。VL = NL + NI
・・・・・・・・ (3) Here, N1
is a count determined in advance by the system, and usually takes a value of about 2 to 4. The voice cutting level VL calculated in this way is used by the voice section detection unit 16 thereafter.

次に従来の音声区間検出動作について簡単に説明する。Next, a conventional voice section detection operation will be briefly explained.

先ず、通常の如く、外部入力部１０より入力された音響
信号をＡ／Ｄ変換部１２において入力信号に変換した後
、パワー算出部１４にて音響パワーＰを算出する。この
音響パワーＰ、の一例を第３図に示す、同図において、
縦軸に音響パワーＰ横軸にフレーム番号工をとって示し
である０図中、破線は音声切り出しレベルＶＬｔ表して
いる。工、及び工、は音声区間の音声始端及び音声終端
である。また、Ｖ、、Ｖ、は音声始端フレーム及び音声
終端フレームであり、通常はフレーム周期を８ミリ秒程
度としている。First, as usual, an acoustic signal input from the external input section 10 is converted into an input signal by the A/D conversion section 12, and then the acoustic power P is calculated by the power calculation section 14. An example of this acoustic power P is shown in FIG.
In the diagram, the vertical axis represents the acoustic power P and the horizontal axis represents the frame number. In the diagram, the broken line represents the audio cutting level VLt. , and , are the voice start and voice end of the voice section. Further, V, ,V are a voice start frame and a voice end frame, and the frame period is usually about 8 milliseconds.

音声区間検出部１６では上述した音声区間を切り出す処
理を行うもので、従来は音響パワーＰ１に対して次の条
件■〜■が成立する最初のフレームを音声区間の始端フ
レームとしでいる。The voice section detecting section 16 performs the process of cutting out the voice section described above, and conventionally, the first frame in which the following conditions (1) to (2) are satisfied for the sound power P1 is set as the starting frame of the voice section.

■始端条件Ｐ≧Ｖ、となるフレームがあるフレームＩ以降、予め経
験により定められている複数個すなわちＮ２個のフレー
ム以上継続したとき、このフレームエを始端フレームＶ
、とする。■When a frame I with which the starting edge condition P≧V exists continues for a number of frames, that is, N2 frames determined in advance by experience, this frame E is changed to the starting edge frame V.
, and so on.

■終端条件また、又始端フレームＶ、を検出後、以下の条件が最初
に成立するフレームの直前のフレームを音声区間の終端
フレームＶ、とする。(2) Termination condition Also, after detecting the start frame V, the frame immediately before the frame in which the following conditions are satisfied for the first time is defined as the end frame V of the voice section.

Ｐ＜Ｖ、となるフレームがフレームエ以降、予め経験に
より定められでいる複数個すなわちＮ３個のフレーム以
上継続したとき。When frames such that P<V continue for a plurality of frames, that is, N3 frames determined in advance by experience, after frame E.

■除外条件さらに音声区間長Ｖ　Ｌ　Ｅ　Ｈが以下の条件にかかる
場合には音声区間とみなさない。■Exclusion conditions In addition, if the voice section length V L E H meets the following conditions, it will not be considered a voice section.

ＶＬＥＩＩ＜Ｎ、又はＶ　Ｌ　ＥＮ　＞　Ｎ　ｓ但しＶＬＥ１１＝Ｖε　−Ｓ＋１でありかつＮ４及びＮ５は経験により予め定められたフ
レーム数である。VLEII<N, or VLEN>Ns, where VLE11=Vε-S+1, and N4 and N5 are the numbers of frames predetermined by experience.

（発明が解決しようとする課題）上述した従来の音声切り出しレベルＶＬの算定は、背景
雑音の音響パワーの分布が正規分布に近いことを仮定し
ている。実際静かな環境下ではこのような近似がよく当
てはまる。しかし騒音レベルが高いような環境か、もし
くは電話等の回！！を経由しできたような入力条件では
、クリック音等の継続時間は短いがピークの音響パワー
が極めて高い雑音が存在するため、この近似から外れる
場合が多く、これがため、第４図に示される様に音響パ
ワーレベルのかなり高いとこ′ろの分布が増加する。(Problems to be Solved by the Invention) The conventional calculation of the audio cutout level VL described above assumes that the distribution of acoustic power of background noise is close to a normal distribution. In fact, this approximation holds true in a quiet environment. However, if you are in an environment where the noise level is high, or if you are using a telephone, etc. ! In the case of input conditions that can be obtained via , there is noise such as a click sound that has a short duration but has an extremely high peak acoustic power, so this approximation often deviates from this approximation. Similarly, when the sound power level is considerably high, the distribution increases.

従ってこのような雑音がちようと背景雑音レベルの測定
時に発生すると、平均雑音レベルＮＬ、平均雑音分散Ｎ
Ｄが共に高く算定されてしまい、これは音声区間検出誤
りの原因となる。このような減少を軽減する一手法とし
て平均雑音レベルの測定時間Ｎを長くする手法があるが
、この手法では認識開始に至るまでの準備時間が長くな
り音声認識製雪自体の応答性が低下してしまうため、充
分な測定時間Ｎを採用出来なかった。Therefore, if such noise occurs when measuring the background noise level, the average noise level NL and the average noise variance N
Both D are calculated to be high, which causes voice section detection errors. One method to reduce this decrease is to lengthen the average noise level measurement time N, but this method increases the preparation time before starting recognition and reduces the responsiveness of the voice recognition snowmaking itself. Therefore, a sufficient measurement time N could not be adopted.

この発明の目的は、上述したクリック音等の雑音環境下
においても音声区間検出誤りを著しく減少させることが
出来るような音声切り出しレベルＶＬを設定出来る音声
区間検出方式を提供することにある。An object of the present invention is to provide a voice section detection method that can set a voice segmentation level VL that can significantly reduce voice section detection errors even in a noisy environment such as the above-mentioned click sound.

（課題を解決するための手段）この目的の達成を図るため、この発明の音声区間検出方
式によれば、閾値算出部において、音響パワーＰ１のう
ち最も大なる値を持つものから順に、第一の所定の個数
Ｎ　ｍａｘの音響パワーと、最も小なる値をもつものか
ら順に第二の所定の個数Ｎ　ｍ　Ｉ　ｎの音響パワーと
を除いた残りの全ての音響パワーＰ、に対して平均雑音
レベルＮＬ′、平均雑音分散Ｎｏ’を算出した後、当該
平均雑音レベルＮＬ’及び平均雑音分散Ｎ。゛より音声
切り出しレベルＬを算定することを特徴とする。(Means for Solving the Problems) In order to achieve this object, according to the speech section detection method of the present invention, the threshold value calculation section selects the first sound power P1 in order from the one having the largest value. The average noise is calculated for all the remaining acoustic powers P, excluding the acoustic powers of a predetermined number N max and the acoustic powers of a second predetermined number N m I n in order of the smallest value. After calculating the level NL' and the average noise variance No', the average noise level NL' and the average noise variance N are calculated. The method is characterized in that the audio cutout level L is calculated from .

（作用）このように構成すれば、音声無入力時の音響パワー分布
のうちクリック音等の雑音に起因する高音響パワー側と
、その他の雑音に起因する低音響パワー側を除いた、本
来の音響パワーが集中する中間の分布領域中の音響パワ
ーを用いて音声切り出しレベルＶＬを定める方式である
ので、ビークパワーの高い雑音成分にほとんど影響され
ずに適切な音声切り出しレベルＶＬ８著しく簡単に決定
出来る。その結果、音声区間検出の誤りが減少する。従
って、総合的な認識性能に優れた音声認識袋Ｍを提供す
ることになる。(Function) With this configuration, the original sound power distribution when no voice is input, excluding the high sound power side caused by noise such as click sound and the low sound power side caused by other noises, can be obtained. Since this method determines the audio extraction level VL using the acoustic power in the intermediate distribution region where the acoustic power is concentrated, it is possible to determine the appropriate audio extraction level VL8 extremely easily without being affected by noise components with high peak power. . As a result, errors in voice section detection are reduced. Therefore, a voice recognition bag M with excellent overall recognition performance is provided.

（実施例）以下、図面を参照してこの発明の音声区間検出方式の実
施例を説明する。(Example) Hereinafter, an example of the voice section detection method of the present invention will be described with reference to the drawings.

第１図はこの発明の音声区間検出方式の実施例の説明に
供するブロック図、第５図は闇値設定部での処理の流れ
図である。FIG. 1 is a block diagram for explaining an embodiment of the voice section detection method of the present invention, and FIG. 5 is a flowchart of processing in a dark value setting section.

第１図において、第２図に示した構成成分と同−の構成
成分についでは同一の符号を付（）て示し、その詳細な
説明を省略する。In FIG. 1, the same constituent components as those shown in FIG. 2 are denoted by the same reference numerals (), and detailed explanation thereof will be omitted.

又、第１図において、２４は第２図に示す従来の閾値設
定部１８に対応する閾値設定部であるが、この従来の閾
値設定部１８とはその機能従って内部構成が異なる。Further, in FIG. 1, 24 is a threshold value setting section corresponding to the conventional threshold value setting section 18 shown in FIG. 2, but it differs from this conventional threshold value setting section 18 in function and internal configuration.

先ず、この実施例における閾値設定部２４につき第５図
を併用しながら説明する。First, the threshold value setting section 24 in this embodiment will be explained with reference to FIG. 5.

この実施例では、先ず、音声無入力状態で各フレームＩ
（Ｉ＝１、・・・、Ｎ）毎の音響パワーＰ　（Ｉ）をパ
ワー算出部１４で算出し、これを闇値設定部２４及び音
声区間検出部１６に送る。In this embodiment, first, each frame I
The power calculation unit 14 calculates the acoustic power P (I) for each (I=1, . . . , N), and sends this to the dark value setting unit 24 and the voice section detection unit 16.

閾値設定部２４においては、マイクロプロセッサ３０の
制御の下で、これら音響パワーＰ　（Ｉ）をパワー算出
部１４からシステムバス３６を経てメモリ３２の各メモ
リ領１或日ＭＥＭ　（１）　、ＲＭＥＭ（２）、ＲＭ・
・・ＰＭＥＭ　（Ｎ）に−時記憶する。この場合、Ｉ＝
１　（１番目）のフレームから処理を開始する（ステッ
プ５１）６次にＩＮＮであるかを判定しくステップＳ２
）、Ｉ≦Ｎである場合には１番目のフレームの音響パワ
ーＰ、をメモリ領ｖｔＲＭＥＭ　（１）に−時記憶する
（ステップ５３）０次にフレーム番号Ｉ！次のＩ＝２へ
進め（ステップＳ４）、上述したステップＳ２へ戻し、
ステップＳ２及びＳ３の処理を行って２番目（Ｉ＝２）
のフレームの音響パワーＰ２ｔメモリ領ｔＳｉＲＭＥＭ
　（２）へ−時記憶する。このように、順次に、Ｉ＝Ｎ
まで各音響パワーＰ、をそれぞれ対応するメモリ領域Ｒ
ＭＥＭ　（Ｎ）へ−時記憶する。In the threshold value setting section 24, under the control of the microprocessor 30, these acoustic powers P (I) are transferred from the power calculation section 14 via the system bus 36 to each memory area 1 of the memory 32, MEM (1), RMEM ( 2), R.M.
・・Memorize - hours in PMEM (N). In this case, I=
1. Start processing from the (first) frame (step 51). 6. Next, determine whether it is an INN. Step S2
), and when I≦N, the acoustic power P of the first frame is stored in the memory area vtRMEM (1) (step 53) 0 Next frame number I! Proceed to the next I=2 (step S4), return to the above-mentioned step S2,
Second (I=2) after performing steps S2 and S3
Sound power of the frame P2t Memory area tSiRMEM
(2) - Time memorization. In this way, sequentially, I=N
Up to each acoustic power P, the corresponding memory area R
To MEM (N) - Time memorization.

ステップＳ２において、ＩＮＮと判定されると、マイク
ロプロセッサの制御の下で、メモリ３２の各メモリ領域
ＲＭＥＭ（１）〜ＲＭＥＭ　（Ｎ）に記憶されでいる音
響パワーＰ１〜ＰＮを昇順にソーティングを行って、そ
の結果をシステムバス３６を経てワークメモリ３４へ送
り、このワークメモリ３４のメモリ領域ＳＭＥＭ　（１
）　、ＳＭＥＭ（２）　５．、、ＳＭＥＭ　（Ｎ）へ大
きざの順に再格納させる（ステップＳ５）。従って、例
えば、メモリ領域ＳＭＥＭ（１）には音響パワーＰ１の
うち一番ピーク値の小さいものが記憶され、逆にメモリ
領域ＳＭＥＭ　（Ｎ）には一番ピーク値の大きいものが
記憶される。すなわち、この実施例では、メモリ領域Ｓ
ＭＥＭ（Ｊ）（Ｊ＝１、・・・、Ｎ）に格納される音響
パワーＰｌの大きさは次の間係が成立する。In step S2, if it is determined to be INN, the acoustic powers P1 to PN stored in each memory area RMEM(1) to RMEM(N) of the memory 32 are sorted in ascending order under the control of the microprocessor. The result is sent to the work memory 34 via the system bus 36, and the memory area SMEM (1
), SMEM (2) 5. ,, SMEM (N) in order of size (step S5). Therefore, for example, the memory area SMEM(1) stores the acoustic power P1 having the smallest peak value, and conversely, the memory area SMEM(N) stores the acoustic power P1 having the largest peak value. That is, in this embodiment, the memory area S
The following relationship holds true for the magnitude of the acoustic power Pl stored in MEM(J) (J=1, . . . , N).

ＳＭＥＭ（＋）　　≦ＳＭＥＭ（２）　　≦・・・ＳＭ
ＥＭ（Ｎ）・　・　・　・　（４）で読み出しかつ、こ
れら個数に対応する音響パワーＰ、を除いた残りの全て
の音響パワーＰ１をワークメモリ３４がらマイクロプロ
セッサ３ｏへ読み出す（ステップＳ６）。SMEM(+) ≦SMEM(2) ≦...SM
EM(N) . . . (4) and all the remaining acoustic powers P1 except for the acoustic powers P corresponding to these numbers are read out from the work memory 34 to the microprocessor 3o (step S6).

次に、マイクロプロセッサ３０において、次式（５）（
こ従った平均雑音レベルＮ、′の算出処理を行ない、そ
の結果をマイクロプロセッサ３ｏのメモリに一時記憶し
ておく（ステップ５７）６次にマイクロプロセッサ３０
において、次の式で示される平均雑音レベルＮ、′を算
出する。Next, in the microprocessor 30, the following equation (5) (
Accordingly, the average noise level N,' is calculated and the result is temporarily stored in the memory of the microprocessor 3o (step 57).6 Next, the microprocessor 30
Then, calculate the average noise level N,' expressed by the following equation.

この目的のため、マイクロプロセッサ３０のメモリ（図
示せず）に、経験によって予め定められた、最大音響パ
ワーから順に小ざい方へ数えてこの平均雑音レベルの計
算に用いない音響レベルの個数Ｎ、□と、同様１こ経験
によって予め定められた、最小音響パワーから順に大き
い方へ数えて、この平均雑音レベルの計算に用いない音
響レベルの個数Ｎ　Ｍｌｌ’ｌとを格納しておき、これ
ら格納されたＮ　１ｌｌａＸ及びＮ、、、＋−８マイク
ロプロセツサ３０自身次に、マイクロプロセッサ３０に
おいて、メモリからＮ□８及びＮ　ｍｉｎと平均雑音レ
ベルＮＬ’とを読み出して次式（６）で与えられる平均
雑音分散Ｎ。′を算出し、その結果ＮＤ゛を当該メモリ
に一時記憶させる（ステ・ンブＳ８）。For this purpose, a memory (not shown) of the microprocessor 30 stores a predetermined number N of sound levels counted from the maximum sound power in ascending order and not used for calculating this average noise level, which is predetermined by experience. □ and the number of sound levels N Mll'l that are not used in calculating this average noise level, counted in order from the minimum sound power to the highest one, which are also predetermined by experience, are stored, and these are stored. N 1lla The average noise variance N. ' is calculated, and the result ND' is temporarily stored in the memory (step S8).

次に、これら平均雑音レベルＮ、’　、平均雑音分散Ｎ
Ｄ′及び予め経験によって定められてマイクロプロセッ
サ３０中のメモリに格納されでいる係数Ｎ＋をそれぞれ
読み出して次式（７）に従って音声切り出しレベルＶＬ
′を求める（ステップＳ９）。Next, these average noise levels N,' and average noise variances N,
D' and the coefficient N+, which has been previously determined based on experience and stored in the memory of the microprocessor 30, are read out and the audio cutting level VL is determined according to the following equation (7).
' is determined (step S9).

る、ＮヨａＸ　％　ＮｍＩｎはピーク性雑音の発生確率
、継続時間の性質によって適切な値に設定する必要があ
る０通例Ｎ□８は測定フレーム数のＮの１／１０〜１１
５０程度、Ｎ　ｍ　Ｌ　ｎはＮの１／１゜〜１１５０な
いし０の値とするのが好適である。NmIn should be set to an appropriate value depending on the probability of occurrence of peak noise and the nature of duration. 0 Usually N□8 is 1/10 to 11 of N of the number of measurement frames.
50, and N m L n is preferably set to a value of 1/1° to 1150 to 0 of N.

音声区間検出処理、認識処理については従来例の通りで
あるのでその説明を省略する。The voice section detection process and the recognition process are the same as in the conventional example, so their explanation will be omitted.

上述した実施例はこの発明の好適例であるにすぎず、こ
の発明は上述した実施例にのみ限定されるものではない
こと明らかである。It is clear that the embodiments described above are only preferred examples of the invention, and that the invention is not limited only to the embodiments described above.

ＶＬ　＝　ＮＬ＋　ＮＩＸ　Ｎｏ・　・　・　・　・　
・　・　・　・　・　・　・　（７）閾値設定部２４に
おいて上述したステップＳ］〜Ｓ９の処理が完了すると
、その結果である音声切り出しレベルＶＬ′がマイクロ
プロセッサ３０の制御によってシステムバス３６を経て
音声区間検出部１６へ送られる。尚、測定時間Ｎは通例
０．１６〜ｏ、３２秒程度が好適であり、フレーム周期
が８ミリ秒の場合、Ｎ＝２０〜４ｏとな（発明の効果）上述した説明からも明らかなようにこの発明の音声区間
検出方式によれば、背景雑音レベル測定に際してサシプ
ルされた音響パワーＰ、のうち最も大なる値を持つもの
からＮ　ｍａＸ個の音響パワと、最も小なる値を持つも
のから順にＮ、、、１．、個の音響パワーを除いた残り
の全ての音響パワーＰの平均雑音レベル値ＮＬ′、平均
雑音分散ＮＤを求めることにより、ど−クバワーの高い
雑音成分が多い環境下でもその影響を受けることなく、
適切な音声切り出しレベルを設定出来るように構成した
ものであるから、高雑音下でも音声区間検出誤りが非常
に少なくなり、これがため総合的な認識性能に優れた認
識部Ｍを実現することが出来る。VL = NL+ NIX No. ・・・・
・・・・・・・ (7) When the threshold value setting unit 24 completes the processing of steps S] to S9 described above, the resulting audio cutout level VL′ is set as the audio cutout level VL′ via the system bus 36 under the control of the microprocessor 30. It is sent to the section detection section 16. The measurement time N is usually 0.16 to 32 seconds, and when the frame period is 8 milliseconds, N = 20 to 4 seconds (effects of the invention).As is clear from the above explanation. According to the voice section detection method of the present invention, N maX acoustic powers are selected from the largest value among the acoustic powers P that have been suppressed when measuring the background noise level, and N maX acoustic powers are suppressed from the one that has the smallest value. In order, N, , 1. By determining the average noise level value NL' and average noise variance ND of all the remaining acoustic powers P except for the acoustic powers of ,
Since it is configured so that an appropriate speech extraction level can be set, speech section detection errors are extremely reduced even under high noise, and thus the recognition section M can be realized with excellent overall recognition performance. .

２０・・・認識部、２４・・・閾値設定部、３２・・・メモリ、３６・・・システムバス。20... recognition section, 24...Threshold value setting section, 32...Memory, 36...System bus.

２２・・・外部機器３０・・・マイクロプロセッサ３４・・・ワークメモリ22...External device 30...Microprocessor 34...Work memory

[Brief explanation of the drawing]

第１図はこの発明の音声区間検出方式の説明に供するブ
ロック図、第２図は従来の音声区間検出方式の説明に供するブロッ
ク図、第３図はこの発明及び従来の説明に供する音声パワーの
一例を示す図、第４図は音響パワー分布を示す図、第５図は音声切り出しレベルの算出処理の動作の流れ図
である。Fig. 1 is a block diagram for explaining the speech interval detection method of the present invention, Fig. 2 is a block diagram for explaining the conventional speech interval detection method, and Fig. 3 is a block diagram for explaining the speech interval detection method of the present invention and the conventional method. FIG. 4 is a diagram showing an example of the sound power distribution, and FIG. 5 is a flowchart of the operation of the calculation process of the audio cutout level.

Claims

[Claims]

(1) The power calculation unit calculates the acoustic power P for each minute time period called a frame from the input acoustic signal from the external input unit.
_1 is calculated, and the acoustic power P_1 is calculated in the threshold value setting section.
Calculate the average noise level based on the acoustic power P
A speech recognition device configured to detect a speech section from _1 and an average noise level, perform recognition processing on a speech pattern determined by the speech section in the recognition section, and output the result to an external device. In detecting the section, the power calculation section measures the acoustic power P_1 for a predetermined period of time with no audio input, and the threshold calculation section measures the acoustic power P_1 having the largest value among the acoustic powers P_1. Average noise for all the remaining acoustic powers P_1 excluding the first predetermined number of acoustic powers N_m_a_x in order from the lowest value and the second predetermined number of acoustic powers N_m_i_n in the order of the smallest value. Level N_
A voice section detection method characterized in that after calculating the average noise level N_L' and the average noise variance N_D', a voice segmentation level V_L is calculated from the average noise level N_L' and the average noise variance N_D'.