JPH071433B2

JPH071433B2 - Voice start time detection method

Info

Publication number: JPH071433B2
Application number: JP61101806A
Authority: JP
Inventors: 博行関根; 潤一瀧口
Original assignee: Nippon Mektron KK
Current assignee: Nippon Mektron KK
Priority date: 1986-05-01
Filing date: 1986-05-01
Publication date: 1995-01-11
Anticipated expiration: 2010-01-11
Also published as: JPS62258499A

Description

【発明の詳細な説明】〔発明の目的〕（産業上の利用分野）本発明は音声の開始時点を検知する音声開始時点検知方
法に関する。DETAILED DESCRIPTION OF THE INVENTION Object of the Invention (Industrial field of use) The present invention relates to a voice start time detection method for detecting a voice start time.

（従来の技術）音声認識装置は、入力した音声波形をスペクトル分析
し、音声入力パターンとしてパターンメモリに記憶す
る。この音声入力パターンを予め辞書に登録してある標
準パターンと比較して類似度を演算し、最も類似度の高
い標準パターンを認識結果として判定し出力する。(Prior Art) A voice recognition device spectrally analyzes an input voice waveform and stores it as a voice input pattern in a pattern memory. This voice input pattern is compared with the standard pattern registered in the dictionary in advance to calculate the similarity, and the standard pattern with the highest similarity is determined and output as the recognition result.

かかる音声認識装置には入力する音声により単語単位の
音声を認識する単語音声認識装置と単音節単位の音声を
認識する単音節音声認識装置がある。単語音声認識装置
の場合も単音節音声認識装置の場合も、音声の開始時点
を定める必要がある。従来は単に音声波形のレベルが所
定値を超えたか否かを検知し、あるレベルを超えた時点
を音声開始時点としていた。Such voice recognition devices include a word voice recognition device that recognizes a voice in word units by an input voice and a monosyllabic voice recognition device that recognizes a voice in a monosyllabic unit. In both the word voice recognition device and the monosyllabic voice recognition device, it is necessary to determine the start time of the voice. Conventionally, it is simply detected whether or not the level of the voice waveform exceeds a predetermined value, and the time when the level exceeds a certain level is set as the voice start time.

（発明が解決しようとする問題点）しかしながら従来の方法ではノイズと音声の判別を正確
にすることができないという問題があった。特に鋭いノ
イズと子音との判別がつきにくく、音声開始時点をあや
まって検知し、その結果音声を誤認識してしまうという
問題があった。(Problems to be Solved by the Invention) However, the conventional method has a problem in that noise and voice cannot be accurately discriminated. In particular, there is a problem in that it is difficult to distinguish sharp noises from consonants, the voice start time is mistakenly detected, and as a result, the voice is erroneously recognized.

本発明は上記事情を考慮してなされたもので、正確に音
声開始時点を検知することができる音声開始時点検知方
法を提供することを目的とする。The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a voice start time point detection method capable of accurately detecting the voice start time point.

[Structure of Invention]

（問題点を解決するための手段）上記目的を達成するため本発明による音声開始時点検知
方法は、入力音声波形のサンプリング値を、現時点を基
準として時間軸と逆方向に一定時間間隔の複数時点ごと
に複数個ずつとり出し、このとり出された複数個のサン
プリング値の絶対値の和であるサンプリング値和を計算
し、これら22プリング値和のすべてが予め定められたし
きい値を超えた場合には、前記現時点を基準として音声
開始時点を定めることを特徴とする。(Means for Solving Problems) In order to achieve the above object, a voice start time point detection method according to the present invention uses a sampling value of an input voice waveform at a plurality of time points at fixed time intervals in a direction opposite to the time axis with reference to the current time point. A plurality of sampling values are extracted for each time, and the sum of sampling values, which is the sum of the absolute values of the plurality of sampled values, is calculated, and all of these 22 pulling value sums exceed a predetermined threshold value. In this case, the voice start time is determined based on the current time.

（実施例）本発明の一実施例による音声開始時点検知方法を用いた
単音節音声の子音切出方法を第１図のフローチャートを
用いて説明する。本実施例ではサンプリングは100μsec
ごとにおこなわれ、128個のサンプリング値Siを１フレ
ームとし、１フレームで音声の認識がおこなわれるもの
とする。(Embodiment) A method for extracting a consonant of a monosyllabic voice using a method for detecting a voice start point according to an embodiment of the present invention will be described with reference to the flowchart of FIG. In this embodiment, sampling is 100 μsec
It is assumed that 128 sampling values Si are set as one frame and voice recognition is performed in one frame.

まず100μsecごとに割込みがかかり音声波形がサンプリ
ングされ、サンプリング値Siがメモリに格納される（ス
テップ10）。First, an interrupt is applied every 100 μsec, the voice waveform is sampled, and the sampling value Si is stored in the memory (step 10).

次に音声開始時点を決めるための計算をおこなう（ステ
ップ11）。割込み時点のサンプリング値をS₀とし、100
×ｎμsec前のサンプリング値をS_-nとすると、次式で示
すサンプリング値和R₁,R₂,R₃を求める。Next, calculation for determining the voice start time is performed (step 11). The sampling value at the time of the interrupt is set to S ₀ and 100
Assuming that the sampling value before × nμsec is S ₋ n, the sampling value sums R ₁ , R ₂ and R ₃ shown in the following equation are obtained.

例えば今ｍ＝８とすると第２図に示すようにサンプリン
グ値和R₁は１フレーム前の時点のサンプリング値S_-128
〜S_-120の絶対値の和を示し、サンプリング値和R₂は２
フレーム前の時点のサンプリング値S_-256〜S_-248の絶対
値の和を示し、サンプリング値和R₃は３フレーム前の時
点のサンプリング値S_-384〜S_-376の絶対値の和を示して
いる。 For example, if m = 8 now, as shown in FIG. 2, the sampling value sum R ₁ is the sampling value S ₋₁₂₈ at the time point one frame before.
~ The sum of absolute values of S _-120 is shown, and the sum of sampling values R ₂ is 2
The sum of the absolute values of the sampling values S _{-256 to} S _-248 at the time point before the frame is shown, and the sum R ₃ of the sampling values is the sum of the absolute values of the sampling values S _{-384 to} S _-376 at the time point three frames before. ing.

次にこれらサンプリング値和R₁,R₂,R₃すべてが予め定め
られたしきい値R_THより大きいか否か判断する（ステッ
プ12）。サンプリング値R₁,R₂,R₃のひとつでもしきい値
R_THより小さければノイズと判断してステップ10に戻
る。すべてのサンプリング値和R₁,R₂,R₃がしきい値R_TH
より大きければ実際の音声と判断し、ステップ13で音声
開始時点を決定する。本実施例では現時点を基準として
３フレーム過去の時点、すなわちサンプリング値S_-384
を音声開始時点とする。Next, it is determined whether or not all of these sampling value sums R ₁ , R ₂ , R ₃ are larger than a predetermined threshold value R _TH (step 12). Threshold value even with one of sampling values R ₁ , R ₂ and R ₃
If it is smaller than R _TH , it is judged as noise and the process returns to step 10. The sum of all sampling values R ₁ , R ₂ , R ₃ is the threshold value R _TH
If it is larger, it is determined that it is an actual voice, and in step 13, the voice start time is determined. In the present embodiment, a time point three frames in the past based on the current time, that is, a sampling value S _-384
Is the voice start time.

次にこの音声開始時点を基準として予め定められたフレ
ーム数、例えば８フレームを子音部として切出す（ステ
ップ14）。Next, a predetermined number of frames, for example, 8 frames are cut out as a consonant portion with reference to this voice start time point (step 14).

このように本実施例によれば38.4m secと比較的長い期
間で音声かどうか判断しているため、振幅値の大きい鋭
いノイズと音声を判別することができる。As described above, according to the present embodiment, since it is determined whether or not the voice is voice in a relatively long period of 38.4 msec, it is possible to discriminate the voice and the sharp noise having a large amplitude value.

本発明は上記実施例に限らず種々の変形が可能である。
例えばサンプリング値和の計算するためのサンプリング
値はいくつでもよい。また音声開始時点か否かを判断す
るフレーム数は３つに限らずいくつでもよい。また現時
点から何フレーム過去の時点を音声開始時点とするかは
３フレーム過去の時点に限らない。The present invention is not limited to the above embodiment, and various modifications can be made.
For example, any number of sampling values may be used to calculate the sum of sampling values. Further, the number of frames for determining whether or not it is the voice start time is not limited to three, and may be any number. Further, the number of frames past from the present time point is not limited to the time point of three frames past as the voice start time point.

さらに上記実施例は単音節音声の子音切出しに適用した
場合を示したが、単語音声や連続音声の開始時点の検知
にも本発明を適用できることはいうまでもない。Furthermore, although the above-described embodiment has been applied to the consonant extraction of monosyllabic voices, it goes without saying that the present invention can also be applied to detection of the start point of a word voice or continuous voice.

〔The invention's effect〕

以上の通り本発明によれば正確に音声開始時点を検知す
ることができる。As described above, according to the present invention, it is possible to accurately detect the voice start point.

[Brief description of drawings]

第１図は本発明の一実施例による音声開始時点検知方法
のフローチャート、第２図は入力音声波形を示す波形図
である。FIG. 1 is a flowchart of a method for detecting a voice start point according to an embodiment of the present invention, and FIG. 2 is a waveform diagram showing an input voice waveform.

Claims

[Claims]

1. A plurality of sampling values of an input speech waveform are taken out at a plurality of time points at fixed time intervals in the direction opposite to the time axis with respect to the present time, and the absolute values of the plurality of sampled values taken out. The sum of the sampling values, which is the sum of the following, is calculated, and when all of these sums of sampling values exceed a predetermined threshold value, the sound starting time point is determined based on the present time point. Detection method.