JP2625682B2

JP2625682B2 - Voice section start detection device

Info

Publication number: JP2625682B2
Application number: JP61223147A
Authority: JP
Inventors: 正明北野; 正宏浜田; 博之直野
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1986-09-19
Filing date: 1986-09-19
Publication date: 1997-07-02
Anticipated expiration: 2012-07-02
Also published as: JPS6377095A

Description

【発明の詳細な説明】産業上の利用分野本発明は、音声認識装置へ音声を入力するために用い
られる音声区間の始端検出装置に関するものである。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an apparatus for detecting the start of a speech section used for inputting speech to a speech recognition apparatus.

従来の技術近年、音声認識等の音声情報処理、およびLSIの技術
の発達に伴い、音声認識装置は民生機器、産業機器等に
利用され始め、この音声認識装置への入力に用いられる
音声区間検出装置も種々研究されてきた（例えば、特公
昭58−120297号公報）。2. Description of the Related Art In recent years, with the development of speech information processing such as speech recognition and LSI technology, speech recognition devices have begun to be used in consumer equipment, industrial equipment, and the like, and voice section detection used for input to the speech recognition device. Various devices have been studied (for example, Japanese Patent Publication No. 58-120297).

以下図面を参照しながら、上述した従来の音声区間検
出装置の一例について説明する。Hereinafter, an example of the above-described conventional voice section detection device will be described with reference to the drawings.

第３図は従来の音声区間検出装置の一例のブロック図
を示すものである。FIG. 3 is a block diagram showing an example of a conventional voice section detection device.

第３図において、２は入力音声の始端を検出する始端
判定部、３は入力音声の終端を検出する終端判定部であ
る。In FIG. 3, reference numeral 2 denotes a start end determining unit for detecting a start end of the input voice, and 3 denotes an end determining unit for detecting the end of the input voice.

以上のように構成された音声区間検出装置について、
以下その動作を説明する。Regarding the voice section detection device configured as described above,
The operation will be described below.

まず、始端判定部２は入力音声のエネルギーをあらか
じめ与えられた閾値と比較して、始端を決定する。次に
終端判定部３は入力音声のエネルギーをあらかじめ与え
られた閾値と比較して、終端を決定する。First, the start end determination unit 2 determines the start end by comparing the energy of the input voice with a predetermined threshold value. Next, the termination determining unit 3 determines the termination by comparing the energy of the input voice with a threshold value given in advance.

発明が解決しようとする問題点しかしながら、上記のような構成では、入力音声のレ
ベルのばらつきに対して検出された音声区間の始端には
ばらつきがあり、高品質の音声認識装置を実現する障害
になるという問題点を有していた。Problems to be Solved by the Invention However, in the above-described configuration, there is a variation at the beginning of the detected voice section with respect to the variation in the level of the input voice, which is an obstacle to realize a high-quality voice recognition device. Had the problem of becoming

本発明は上記問題点に鑑み、入力音声のレベルのばら
つきによる音声区間の始端検出のばらつきを補正して、
高品質の音声区間の始端検出装置を提供するものであ
る。In view of the above problems, the present invention corrects the variation in the detection of the beginning of a voice section due to the variation in the level of the input voice,
An object of the present invention is to provide a high-quality voice section start detection device.

問題点を解決するための手段本発明は上記目的を達成するため、入力音声を常時蓄
えるメモリバッファと、前記入力音声の最大エネルギー
を検出する最大音声検出部と、前記入力音声の最大エネ
ルギーによって音声検出の閾値を設定する閾値設定部
と、設定された閾値により入力されてくる音声の始端を
判定すると同時に、前記最大音声検出部が、入力されて
きた音声中において、これまでの最大値かつ極大値を検
出した時点で、前記閾値設定部によって新しく設定され
た閾値を用いて、新たな始端の判定を行ない直す始端判
定部とを備えた構成である。Means for Solving the Problems In order to achieve the above object, the present invention provides a memory buffer that constantly stores input voice, a maximum voice detection unit that detects a maximum energy of the input voice, and a voice based on the maximum energy of the input voice. A threshold setting unit for setting a threshold for detection, and at the same time judging a starting point of the input voice by the set threshold, the maximum voice detection unit detects a maximum value and a local maximum in the input voice so far. When a value is detected, a start-end determining unit that performs a new start-end determination using a threshold newly set by the threshold setting unit is provided.

作用これにより、設定された閾値により入力されてくる音
声の始端を判定すると同時に、前記最大音声検出部が、
入力されてきた音声中において、これまでの最大値かつ
極大値を検出した時点で、前記閾値設定部によって新し
く設定された閾値を用いて、新たな始端の判定を行ない
直すので、リアルタイムで、入力レベルの平均値の変動
が原因となる始端の誤検出の低減を可能とし、より正確
な始端の検出を行なう。By this, at the same time as determining the beginning of the input voice by the set threshold, the maximum voice detection unit,
In the input voice, at the time when the maximum value and the maximum value are detected so far, using the threshold value newly set by the threshold value setting unit, the determination of the new starting point is performed again. It is possible to reduce erroneous detection of the leading edge caused by the fluctuation of the average value of the level, and to detect the leading edge more accurately.

実施例以下、本発明の実施例について、図面を参照しながら
説明する。第１図は本発明の一実施例における音声区間
の始端検出装置のブロック図、第２図は同動作を示すフ
ローチャートである。Embodiments Hereinafter, embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram of an apparatus for detecting the beginning of a voice section in one embodiment of the present invention, and FIG. 2 is a flowchart showing the same operation.

第１図において、１はメモリバッファであり、入力音
声を常時記憶する。なお、メモリバッファ１は、ループ
状になっており、メモリサイズは、認識装置の認識対象
単語長の最大のものが格納できる大きさである。終端が
検出されるまで、このメモリバッファ１に入力音声は記
憶され続けられる。２は始端判定部であり、入力音声の
エネルギー、あるいは、メモリバッファ１に蓄えられて
いる音声のエネルギーを閾値と比較して入力音声の始端
を決定する。３は終端判定部であり、入力音声のエネル
ギー、あるいはメモリバッファに蓄えられている音声の
エネルギーを閾値と比較して終端を決定し、メモリバッ
ファ１から、始端から終端までの音声データを出力させ
る。４は最大音声検出部であり、入力音声のその時点ま
での最大かつ極大エネルギーを検出して、閾値設定部５
へ最大エネルギーを送り、始端判定部2,終端判定部３で
は、新しく設定された閾値により、以前に始端判定部２
で決められた始端以降、現時点までの音声エネルギーを
バッファメモリ１より読み込み始端検出あるいは終端検
出を行う。In FIG. 1, reference numeral 1 denotes a memory buffer which constantly stores input voice. The memory buffer 1 has a loop shape, and the memory size is large enough to store the maximum word length to be recognized by the recognition device. Until the end is detected, the input voice is kept stored in the memory buffer 1. Reference numeral 2 denotes a start end determination unit that determines the start end of the input voice by comparing the energy of the input voice or the energy of the voice stored in the memory buffer 1 with a threshold. Reference numeral 3 denotes an end determining unit which determines the end by comparing the energy of the input sound or the energy of the sound stored in the memory buffer with a threshold value, and outputs the sound data from the start end to the end from the memory buffer 1. . Reference numeral 4 denotes a maximum voice detection unit which detects the maximum and maximum energy of the input voice up to that point and sets a threshold value setting unit 5
The maximum energy is sent to the start-end determination unit 2 and the end-end determination unit 3 based on the newly set threshold value.
After that, the voice energy up to the present time is read from the buffer memory 1 and the start end or the end is detected.

以上のように構成された音声区間の始端検出装置につ
いて、以下第１図および第２図を用いてその動作を説明
する。The operation of the apparatus for detecting the beginning of a speech section configured as described above will be described below with reference to FIGS. 1 and 2.

尚、第２図においては、ステップ番号（以下Ｓとい
う）を用いて説明する。In FIG. 2, description will be made using step numbers (hereinafter, referred to as S).

１フレーム音声入力があると（S11）、この入力音声
をメモリバッファ１に書き込む（S12）。また同時にこ
の入力音声の音声エネルギーを最大音声検出部４で判定
して（S13）、入力音声エネルギーが最大であれば、閾
値設定部５で閾値の設定を行う（S15）。そして、始端
判定部２、終端判定部３では始端，終端の判定を行ない
（S16）、終端であれば、音声区間検出は終了し、終端
でない場合は（S11）に戻り、次のフレームの入力を待
つ（S17）。When there is a one-frame sound input (S11), the input sound is written into the memory buffer 1 (S12). At the same time, the voice energy of the input voice is determined by the maximum voice detection unit 4 (S13). If the input voice energy is the maximum, the threshold setting unit 5 sets a threshold (S15). Then, the start-end determination unit 2 and the end-end determination unit 3 determine the start end and the end (S16). If it is the end, the voice section detection ends, and if it is not the end, the process returns to (S11) to input the next frame. Wait (S17).

一方、入力音声の音声エネルギーを最大音声検出部４
で判定して（S13）、入力音声エネルギーが最大でない
場合、さらに最大音声検出部４で最大の次のフレームを
判定する（S14）。ここで最大の次のフレームと判定さ
れた場合、以前に決定された始点から現時点まで（S1
9）、メモリバッファ１から音声エネルギーを読み出し
（S18），始端判定部2,終端判定部３により、始端の判
定を行ない直し、および終端の設定を行なう（S20）。
最大音声検出部４で最大音量の次のフレームと判定され
なかった場合、始端判定部2,終端判定部３により、始
端，終端の判定を行なう（S16）。On the other hand, the audio energy of the input audio is
(S13), if the input voice energy is not the maximum, the maximum voice detection unit 4 further determines the next frame that is the maximum (S14). If it is determined that the next frame is the largest next frame, the previously determined start point to the present time (S1
9) The voice energy is read from the memory buffer 1 (S18), and the start end determination unit 2 and the end end determination unit 3 re-determine the start end and set the end (S20).
If the maximum sound detection unit 4 does not determine that the frame is the next frame of the maximum volume, the start end determination unit 2 and the end determination unit 3 determine the start end and the end (S16).

以上のように本実施例によれば、入力音声をメモリバ
ッファ１に蓄え、最大音声検出部４で検出された音声最
大エネルギーを閾値設定部５により音声検出の閾値を設
定して音声区間の始端を検出することにより、高品質の
音声区間の始端検出を行なうことができる。さらに最大
音声検出部４では音声エネルギーの最大かつ極大を検出
次第、始端判定部2,終端判定部３によりメモリバッファ
１に格納されている音声エネルギーの判定を行なうの
で、リアルタイムで音声区間検出を行なうことができ
る。またメモリバッファ１は、ループ状に入力音声を記
憶するため、メモリの容量が音声認識装置の認識対象単
語のなかで最長のものが格納できる大きさですみ、小さ
い容量のメモリで良い。As described above, according to the present embodiment, the input voice is stored in the memory buffer 1, the maximum voice energy detected by the maximum voice detector 4 is set by the threshold setting unit 5 for the threshold of voice detection, and the start of the voice section is set. , It is possible to detect the beginning of a high-quality voice section. Further, as soon as the maximum and maximum sound energy is detected in the maximum sound detection unit 4, the start and end judgment units 2 and 3 determine the sound energy stored in the memory buffer 1, so that the sound section is detected in real time. be able to. Further, since the memory buffer 1 stores the input voice in a loop, the memory capacity is sufficient to store the longest word among the words to be recognized by the voice recognition device, and may be a small memory.

発明の効果以上のように本発明によれば、リアルタイムで、入力
レベルの平均値の変動が原因となる始端の誤検出の低減
を可能とし、より正確な始端の検出を行なうことができ
る。Advantageous Effects of the Invention As described above, according to the present invention, it is possible to reduce the erroneous detection of the leading edge caused by the fluctuation of the average value of the input level in real time, and to more accurately detect the leading edge.

[Brief description of the drawings]

第１図は本発明の一実施例における音声区間の始端検出
装置のブロック図、第２図は本発明の一実施例における
音声区間の始端検出装置の動作を示すフローチャート、
第３図は従来の音声区間検出装置のブロック図である。１……メモリバッファ、２……始端判定部、３……終端
判定部、４……最大音声検出部、５……閾値設定部。FIG. 1 is a block diagram of an apparatus for detecting the start of a voice section in one embodiment of the present invention, FIG. 2 is a flowchart showing the operation of the apparatus for detecting the start of a voice section in one embodiment of the present invention,
FIG. 3 is a block diagram of a conventional voice section detection device. 1 ... memory buffer, 2 ... start-end determination unit, 3 ... end determination unit, 4 ... maximum sound detection unit, 5 ... threshold setting unit.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭57−97599（ＪＰ，Ａ) 特開昭60−39691（ＪＰ，Ａ) 特開昭61−223796（ＪＰ，Ａ) 特開昭59−111697（ＪＰ，Ａ) 特開昭60−499（ＪＰ，Ａ) ──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-57-97599 (JP, A) JP-A-60-39691 (JP, A) JP-A-61-223796 (JP, A) JP-A-59-97 111697 (JP, A) JP-A-60-499 (JP, A)

Claims

(57) [Claims]

1. A memory buffer for constantly storing input voice,
A maximum voice detection unit that detects the maximum energy of the input voice, a threshold setting unit that sets a threshold value of voice detection based on the maximum energy of the input voice, and simultaneously determines a start edge of the input voice based on the set threshold. When the maximum sound detection unit detects the maximum value and the maximum value so far in the input sound, the determination of the new start end is performed by using the threshold newly set by the threshold setting unit. A start detection unit for a voice section, comprising: a start determination unit for performing a restart.

2. The apparatus according to claim 1, wherein the memory buffer stores the input voice in a loop.

3. The memory buffer according to claim 1, wherein the capacity of the memory is large enough to store the longest word among words to be recognized by the speech recognition device.
Item.