JPS6223096A

JPS6223096A - Detection of voice section

Info

Publication number: JPS6223096A
Application number: JP60161781A
Authority: JP
Inventors: 金指　久則; 秋場　国夫; 入間野　孝雄; 猛宮川
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1985-07-24
Filing date: 1985-07-24
Publication date: 1987-01-31
Anticipated expiration: 2011-08-07
Also published as: JP2521425B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】（産業上の利用分野）本発明は音声認識装置における音声区間検出方法に関す
るものである。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a speech segment detection method in a speech recognition device.

（従来の技術）第２図は、従来の音声認識装置における、騒音学習方法
の一例を実行するための機能ブロック図である。(Prior Art) FIG. 2 is a functional block diagram for executing an example of a noise learning method in a conventional speech recognition device.

従来例を第２図、第３図により説明する。A conventional example will be explained with reference to FIGS. 2 and 3.

第２図において、１０はマイクであり、ここから音声ま
たは騒音を入力し、前処理部１１で前処理を行い、パワ
ー算出部１２でパワーを算出する。１３は音声認識モー
ドａ、または騒音学習モードｂの切り換えを行う切り換
えスイッチである。１４は騒音学習部、１５は音声区間
検出部であり、騒音学習モードにおいて音声区間検出の
ためのいき値を設定し、音声区間検出部で音声区間を検
出する。検出した音声区間内の音声を音声認識部１６に
おいて認識する。１７は認識結果出力部である。In FIG. 2, reference numeral 10 denotes a microphone from which voice or noise is input, a preprocessing section 11 performs preprocessing, and a power calculating section 12 calculates the power. 13 is a changeover switch for switching between voice recognition mode a and noise learning mode b. 14 is a noise learning section, and 15 is a speech section detection section, which sets a threshold value for detecting a speech section in the noise learning mode, and detects the speech section in the speech section detection section. The voice recognition unit 16 recognizes the voice within the detected voice section. 17 is a recognition result output unit.

次に、上記従来例の動作について説明する。最初に、音
声認識モードに入る前に切り換えスイッチ１３て騒音学
習モードを選択し、騒音の学習を行う。第２図において
マイク１０から人力した騒音は、前処理部Ｈにおいて、
Ａ／Ｄ変換され、異名現象をとり除くためＬＰＦを通り
、サンプル値Ｘを得る。次にパワー算出部１２において
、（１）式に従い、単位時間（以後フレームと称する）
ごとにパワーＰ（Ｊ）を算出する。Next, the operation of the above conventional example will be explained. First, before entering the voice recognition mode, the changeover switch 13 is used to select the noise learning mode to perform noise learning. In FIG. 2, the noise manually generated from the microphone 10 is processed in the preprocessing section H.
The signal is A/D converted, passes through an LPF to remove pseudonym phenomena, and obtains a sample value X. Next, in the power calculation unit 12, according to equation (1), the unit time (hereinafter referred to as frame)
The power P(J) is calculated for each time.

Ｐ（Ｊ）：Ｊフレーム目のパワーの値ｘ（ｉ）：１フレーム内における１番目のサンプル値Ｎ　：１フレーム内のサンプル数騒音学習部１４では、Ｐ（Ｊ）をもとに（２）式に従っ
て音声区間のいき値ＴＰを設定する訳であるが、ここで
Ｌは騒音学習に要する時間であり、認識装置の仕様によ
り任意に設定するパラメータである。P(J): Power value of the J-th frame x(i): First sample value within one frame N: Number of samples within one frame The noise learning unit 14 calculates (2 ) The threshold value TP of the voice section is set according to the equation, where L is the time required for noise learning, and is a parameter that is arbitrarily set according to the specifications of the recognition device.

なおＴＰ’の値は学習時間における環境騒音の平均パワ
ーである。Note that the value of TP' is the average power of environmental noise during the learning time.

ＴＰ＝　ＴＰ’＋　Ａ＝　Σ　Ｐ　（Ｊ）／　Ｌ　　　＋　Ａ　　・・・（２
）Ｊ＝１ＴＰ：音声区間検出のためのいき値Ｐ（Ｊ）：Ｊフレーム目のパワーの値Ｌ　　：騒音学習時間Ａ　　：定数次に切り換えスイッチ１３て認識モードを選択し、音声
認識を行う。TP= TP'+ A = Σ P (J)/L + A...(2
) J=1 TP: Threshold value for voice section detection P (J): Power value of J frame L: Noise learning time A: Constant Next, select the recognition mode with the changeover switch 13 and perform voice recognition. .

第２図において、入力した音声は、騒音学習モードと同
じ条件で前処理を行い、パワーを算出する。得られたパ
ワーの時系列をもとに、騒音学習モードで得られたいき
値ＴＰを用いて音声区間の検出を行う。第３図は、／ａ
ｋｉｔａ／（秋田）と発声したときのＰ（ｊ）の時系列
を示したものである。In FIG. 2, the input voice is preprocessed under the same conditions as in the noise learning mode, and the power is calculated. Based on the obtained power time series, a voice section is detected using the threshold value TP obtained in the noise learning mode. Figure 3 shows /a
It shows the time series of P(j) when uttering kita/ (Akita).

第２図において、いき値ＴＰを使って、パワーの大きい
山形の部分Ｓｌ、Ｓ２．Ｓ３および山形の部分に挟まれ
た谷形の部分Ｐｌ、Ｐ２１　Ｐ３を検出し、各々に対応
する時間ＳＬ＋３２＋　Ｓ３およびｐｌ＋　’Ｉ’２＋
　ｐ３の値を使って（３）式に示す条件との整合を検定
し音声区間、音声の始端Ｓ、終端Ｅを検出する。In FIG. 2, using the threshold value TP, the chevron-shaped portions Sl, S2, . Detect S3 and the valley-shaped part Pl, P21 P3 sandwiched between the chevron-shaped part, and calculate the corresponding times SL+32+ S3 and pl+ 'I'2+
The value of p3 is used to test whether it matches the condition shown in equation (3), and the voice section, the start point S, and the end point E of the voice are detected.

第４図は、第３図とは異なる騒音下で学習し、／ａｋｉ
ｔａ／（秋田）と発声した場合のいき値ＴＰの設定から
音声区間検出までのようすを表している。いき値設定に
要する時間り内において衝撃的な騒音が入り、音声を発
声している時と比ベレベルが大きくなっている。このた
め、いき値ＴＰは第３図に示す例に比べて大きく設定さ
れるため、音声区間検出を誤り、本来の／ａｋｉｔａ／
の部分の語頭の／ａ／が脱落し、／ｋ　ｉ　ｔ　ａ／ど
なっている。　従って、従来の方法では第４図の場合の
ように、騒音学習時の、騒音レベルと音声発声時の騒音
レベルが著しく異なる場合音声区間検出を誤る欠点があ
った。Figure 4 shows learning under different noise conditions than in Figure 3.
It shows the process from setting the threshold TP to detecting the voice section when uttering ta/ (Akita). Shocking noise occurs during the time required to set the threshold value, and the level is louder than when the voice is being uttered. For this reason, the threshold value TP is set larger than in the example shown in FIG.
The /a/ at the beginning of the word is dropped, making it sound like /k it a/. Therefore, in the conventional method, as in the case of FIG. 4, when the noise level during noise learning and the noise level during voice production are significantly different, the voice section detection is incorrect.

（発明が解決しようとする問題点）上記従来例の音声区間検出方法では、いき値設定の学習
に要する時間内で衝撃的な騒音等により、音声を発声し
ている時の騒音レベルに比べ、学習時の騒音レベルが過
大に評価され、いき値設定を誤り、ひいては音声認識を
誤る問題があった。(Problems to be Solved by the Invention) In the conventional speech section detection method described above, during the time required to learn the threshold setting, due to shocking noise, etc., the noise level is lower than the noise level when the speech is being uttered. There was a problem that the noise level during learning was overestimated, leading to incorrect threshold settings and, in turn, incorrect speech recognition.

本発明はこのような従来の問題を解決するものであり、
音声区間を精度よく検出できる音声区間検出方法を提供
することを目的とするものである。The present invention solves these conventional problems,
It is an object of the present invention to provide a voice section detection method that can detect voice sections with high accuracy.

く問題を解決するための手段）本発明は、上記目的を達成するために、騒音学習を行う
際、学習時間にとり込む全てのフレームの騒音データか
らいき値を設定するのではなく、予め設定した範囲にあ
る騒音データのみを用いていき値を設定するようにした
ものである。In order to achieve the above object, the present invention, when performing noise learning, does not set a threshold value from the noise data of all frames taken into the learning time, but uses a threshold value set in advance. The threshold value is set using only the noise data within the range.

（作用）従って本発明によれば、学習用の騒音データを選択的に
取り扱うことにより、騒音学習時の衝撃騒音によるいき
値設定誤りを減少することができ、音声区間を精度よく
検出することができ、その結果、音声認識誤りを減少す
ることができる。(Function) Therefore, according to the present invention, by selectively handling the noise data for learning, it is possible to reduce threshold setting errors due to impact noise during noise learning, and it is possible to accurately detect speech sections. As a result, speech recognition errors can be reduced.

（実施例）以下に、本発明の一実施例の構成について第１図ととも
に説明する。(Example) Below, the configuration of an example of the present invention will be described with reference to FIG.

第１図においてマイク１、前処理部２およびパワー算出
部３、騒音学習部６、音声区間検出部７、音声認識部８
．認識結果出力部９は、従来例と同様のものである。５
は騒音データ選択部である。In FIG. 1, a microphone 1, a preprocessing section 2, a power calculation section 3, a noise learning section 6, a speech section detection section 7, a speech recognition section 8
．． The recognition result output unit 9 is similar to the conventional example. 5
is the noise data selection section.

次に本発明の実施例の動作について説明する。Next, the operation of the embodiment of the present invention will be explained.

先ずモード切り換えスイッチ４は、騒音学習モードにし
ておく。マイク１から入力した騒音は前処理部２てＡ／
Ｄ変換されＬＰＦを通ってパワー算出部３に入り、従来
例と同様に（１）式に従いフレームのパワーを算出する
。騒音データ選択部５では、音声区間検出のためのいき
値設定に用いる騒音データの選択を行う。これは、騒音
学習時に人力した騒音レベルが予め設定した範囲に入っ
ている騒音データだけをいき値設定用のデータとして使
用するものである。First, the mode changeover switch 4 is set to the noise learning mode. The noise input from microphone 1 is processed by preprocessing section 2 A/
The D-converted signal passes through the LPF and enters the power calculation unit 3, where the power of the frame is calculated according to equation (1) as in the conventional example. The noise data selection unit 5 selects noise data to be used for threshold setting for voice section detection. In this method, only noise data whose manually inputted noise level falls within a preset range during noise learning is used as threshold value setting data.

この範囲は、以下のように決定する。This range is determined as follows.

第３図において音声区間の後端Ｅの後のρ３の部分は、
音声区間を決定する、つまりＥを決定する前までは分析
する訳であるから、ｐ３の区間のフレームごとのパワー
は算出されている。従来法ではｐ３の区間のデータは、
音声区間が決定すれば捨ててしまっていたが、本発明で
は（４）式に従いこの区間のフレーム毎のパワーの平均
値Ｎεと分散σεを求め騒音データ選択部５に送る。騒
音データ選択部５では音声区間検出のいき値設定の際に
用いた騒音レベルの平均値Ｎｐおよび分散 σＰ２と、ＮＥおよびσε２から式（５）に従って新し
くＮとσ２を計算する。In FIG. 3, the portion of ρ3 after the rear end E of the voice section is
Since the analysis is performed before determining the voice section, that is, before determining E, the power for each frame in the section p3 has been calculated. In the conventional method, the data in the section p3 is
Once a voice section is determined, it is discarded, but in the present invention, the average value Nε and the variance σε of the power for each frame in this section are determined according to equation (4) and sent to the noise data selection section 5. The noise data selection unit 5 newly calculates N and σ2 from the average value Np and variance σP2 of the noise level used in setting the threshold for voice section detection, NE and σε2 according to equation (5).

このＮとσ２を使って入力した騒音レベルがＮ±σの範
囲に入っている騒音データだけをいき値設定のための騒
音データとして使用するものである。Only the noise data whose input noise level is within the range of N±σ using N and σ2 are used as the noise data for setting the threshold value.

Ｎｐ、σＰ２の初期値は（６）式に従ってもとめる。The initial values of Np and σP2 are determined according to equation (6).

Ｎ±σの範囲にある騒音データを使って音声区間検出の
ためのいき値ＴＰｘを従来例同様の考え方で式（７）に
従って設定し、このいき値ＴＰＸを用いて音声区間を検
出する。Using noise data in the range of N±σ, a threshold value TPx for detecting a voice section is set according to equation (7) in the same way as in the conventional example, and the voice section is detected using this threshold value TPX.

Ｐ（Ｉ）：学習時間り内にあるＮ±σの範囲にある第一
番目の騒音パワーの値Ｍ　　：学習時間り内にあるＮ±σの範囲にある騒音デ
ータのサンプル数Ｂ　　：定数ＴＰＸを用いて音声区間を検出した場合を第４図に示す
。この図において始端はＳ×、後端はＥとなり、従来例
とは異なり／ａｋｉｔａ／の語頭の／ａ／の脱落がなく
なり、きちんと音声区間を検出できることがわかる。P(I): First noise power value within the range of N±σ within the learning time interval M: Number of samples of noise data within the range of N±σ within the learning time interval B: Constant TPX FIG. 4 shows a case where a voice section is detected using the method. In this figure, the starting end is S×, and the trailing end is E, and it can be seen that unlike the conventional example, the /a/ at the beginning of the word /akita/ is not dropped, and the speech section can be detected properly.

以上の通り本実施例によれば、騒音学習に衝撃的な騒音
が入っても騒音レベルが予め設定した範囲になければ学
習用のデータとして用いないため、音声区間検出のいき
値設定を誤ることがない。従って、精度よく音声区間を
検出できるという利点を有する。As described above, according to this embodiment, even if shocking noise enters noise learning, it will not be used as learning data unless the noise level is within a preset range, so it is possible to incorrectly set the threshold value for speech section detection. There is no. Therefore, it has the advantage that voice sections can be detected with high accuracy.

（発明の効果）本発明は以上の説明から明らかなように、騒音学習を行
う際、学習時間に取り込む全てのフレームの騒音データ
からいき値を設定するのではなく、予め設定した範囲に
ある騒音データのみを用いて、いき値を設定しているの
で、音声区間検出のためのいき値設定誤りを減少させ精
度よく音声区間を検出できる利点を有する。更に、音声
区間を精度よく検出できるため、音声認識率を向上させ
る効果を有する。(Effects of the Invention) As is clear from the above explanation, when performing noise learning, the present invention does not set a threshold value from the noise data of all frames captured during the learning time, but instead uses noise within a preset range. Since the threshold value is set using only data, there is an advantage that errors in setting the threshold value for voice section detection can be reduced and voice sections can be detected with high accuracy. Furthermore, since speech sections can be detected with high accuracy, it has the effect of improving the speech recognition rate.

[Brief explanation of the drawing]

第１図は本発明の一実施例における音声認識装置の概略
ブロック図である。第２図は、従来例における音声認識装置の概略ブロック
図である。第３図は、ある騒音レベルで／ａｋｉｔａ／と発声した
場合の騒音のパワーと音声パワーの時間変化を表したも
のである。第４図は、第３図とは異なる環境で／ａｋｉｔａ／と発
声した場合の騒音パワーと音声パワーの時間変化を表し
たものである。ｌ・・・マイク、２・・・前処理部、３・・・パワー検
出部、４・・・切り換えスイッチ、５・・・騒音データ
還択部、６・・・騒音学習部、７・・・音声区間検出部
、８・・・音声認識部、９・・・認識結果出力部。特許出願人　　松下電器産業株式会社八玉　Ｑ礒λ °（〉−FIG. 1 is a schematic block diagram of a speech recognition device in one embodiment of the present invention. FIG. 2 is a schematic block diagram of a conventional speech recognition device. FIG. 3 shows temporal changes in noise power and voice power when /akita/ is uttered at a certain noise level. FIG. 4 shows temporal changes in noise power and voice power when /akita/ is uttered in an environment different from that in FIG. 3. l... Microphone, 2... Preprocessing section, 3... Power detection section, 4... Changeover switch, 5... Noise data selection section, 6... Noise learning section, 7... - Speech section detection unit, 8... speech recognition unit, 9... recognition result output unit. Patent applicant: Matsushita Electric Industrial Co., Ltd.

Claims

[Claims]

(1) In the method of detecting speech sections by learning the noise level and adaptively setting the threshold for speech section detection, the noise level is determined in advance based on the average value and the magnitude of fluctuation of the noise level. A voice section detection method that limits a range and uses only noise level data within that range as data for setting a threshold for voice section detection during noise learning.

(2) Claim (1) characterized in that the method for determining the preset noise level range uses the portion after the rear end of the voice section detection as input data for determining the range. The voice interval detection method described.