JPS59121099A

JPS59121099A - Voice section detector

Info

Publication number: JPS59121099A
Application number: JP57227708A
Authority: JP
Inventors: 浮田　輝彦; 恒雄新田
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1982-12-28
Filing date: 1982-12-28
Publication date: 1984-07-12
Also published as: JPH045198B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Summary] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〔発明の技術分野〕本発明は連続発声される入力音声信号の音声区間を所謂
舌うち音や息もれの影響を受けることなしに確実に検出
することのできる音声区間検出装置に関する。[Detailed Description of the Invention] [Technical Field of the Invention] The present invention is directed to a voice that can reliably detect the voice section of an input voice signal that is continuously uttered without being affected by so-called tongue clicks or breathlessness. The present invention relates to a section detection device.

[Technical background of the invention and its problems]

連続的に発声入力される音声を認識するに際して、音声
区間を正確に検出することが非常に重要である。しかし
て従来、この音声区間検出は、大別して次のようにして
行われている。即ち、第１の方式は、入力信号のパワー
に着目して２つの閾値を利用１−１一方の閾値にて確実
に音声が存在する区間を検出１．たのち他方の閾値を用
いて正確な音声区間を求めるものである。When recognizing continuously inputted speech, it is very important to accurately detect speech sections. Conventionally, this voice section detection has been performed roughly in the following manner. That is, the first method focuses on the power of the input signal and uses two threshold values 1-1. One threshold value is used to reliably detect a section where voice is present.1. Then, the other threshold is used to find an accurate speech interval.

然し乍ら、このような方式によると、パワーの大〜いパ
ルス性の雑音までも音声として誤判定ｌ７てしまうと云
う問題があった。However, with this method, there is a problem in that even high-power pulsed noise may be erroneously determined as speech.

これに対して第２の方式として、入力信号の零交差回数
やそのスペクトル概形が背景雑音と異ることを利用して
、雑音と区別可能な特徴パラメータを用いて音声区間を
判定゛するものがある。ところが、この方式にあっては
、有声音や無声音を含んで種々変化する音声を決定的に
特徴ずけるパラメータが見出されていない為、前述した
＠１の方式に比較して若干の改良が認められるに過ぎな
い。On the other hand, the second method utilizes the fact that the number of zero crossings of the input signal and its spectral outline are different from background noise, and uses characteristic parameters that can be distinguished from noise to determine speech intervals. There is. However, this method has not been found to have parameters that can definitively characterize various voices, including voiced and unvoiced sounds, so it has been improved slightly compared to the method @1 described above. It's just recognized.

そこで最近では、入力音声の周波数的特徴パラメータを
利用して音韻分類を行い、入力信号のパワー情報とを併
用して音声区間検出を行うことが検討されている。つま
り、音韻の周波数構造を予め知識として袋首に登録して
おき、雑音の周波数構造が音韻のそれと似ていないこと
を利用するもので、例えば入力信号のパワー情報から求
められる音声候補区間について音韻分類を行い、これに
よって求められた音韻ラベルの時系列に基づいて音声区
間であるか否かを判定するものである。Therefore, recently, consideration has been given to performing phoneme classification using frequency characteristic parameters of input speech and performing speech segment detection using the power information of the input signal in combination. In other words, the frequency structure of phonemes is registered in advance as knowledge, and the fact that the frequency structure of noise is not similar to that of phonemes is utilized. Classification is performed, and based on the time series of phoneme labels obtained through this classification, it is determined whether or not it is a speech section.

ところが、この音韻分類は音響分析における成る時刻の
周波数構造を調べるだけのものであるから、例えば音韻
／ｌ　ｆ／や／Ｂ／と似た周波数構造を持つ所謂舌うち
音や、音韻／ｈ／と同様な周波数構造の息漏れ雑音を誤
判定して、誤った音韻ラベル付けが行われる等の不具合
があった。このようにして音韻ラベル付けにおいて誤り
が混入すると、その後の処理において音韻ラベルの時間
的構造を調べても、雑音を除去することができなくなる
と云う問題が生じた。このように、従来にあっては、連
続的に入力される音声信号の音声区間を正しく判定検出
することが非常に困齢であった。However, since this phoneme classification only examines the frequency structure of the time in acoustic analysis, for example, the so-called lisp sounds that have a frequency structure similar to the phonemes /l f/ and /B/, and the phonemes /h/ There were problems such as misjudgment of breath noise with a similar frequency structure and incorrect phonological labeling. If an error is introduced in the phoneme labeling in this way, a problem arises in that the noise cannot be removed even if the temporal structure of the phoneme label is examined in subsequent processing. As described above, in the past, it was extremely difficult to correctly determine and detect the voice section of a continuously input voice signal.

[Purpose of the invention]

本発明はこのような事情を考慮１．でなされたもので、
その目的とするところは、舌うち音や息もれ等の雑音に
左右されることなしに入力信号中の音声区間を正１．＜
、しかも安定に検出することのできる音声区間検出装置
を提供することにある。The present invention takes these circumstances into consideration: 1. It was made in
The purpose of this is to correct the speech interval in the input signal without being influenced by noise such as tongue clicks or breathlessness. <
The object of the present invention is to provide a voice section detection device that can detect a voice section stably.

[Summary of the invention]

本発明は、音声区間の開始時点および終了時点をそれぞ
れ含む部分区間における特徴パラメータの時間構造を標
準パターンとして予め登録しておき、入力信号を分析し
て求められる特徴パラメータの時系列とＬ記標準パター
ンとの類似度をそれぞれ計算して、その類似度値に従っ
て入力信号中の音声区間の始端点とをそれぞれ判定検出
するようにしたものである。The present invention registers in advance the time structure of feature parameters in subintervals including the start and end points of a speech section as a standard pattern, and combines the time series of feature parameters obtained by analyzing an input signal with the L standard. The degree of similarity with each pattern is calculated, and the start point of the voice section in the input signal is determined and detected according to the degree of similarity.

〔Effect of the invention〕

かくして本発明によれば、音声区間の始端点および終端
点の周波数的特徴構造のみならず、その時間的な構造も
特徴パラメータレベルとして標準パターンとして与えら
れるので、音声とは周波数的構造が似ているが、時間的
構造を異にする舌うち音や息もれ等の雑音を高精度に除
外して、その音声区間の始端点および終端点をそれぞれ
正確に、しかも安定に判定１．て入力信号中の音声区間
を検出することが可能となる。Thus, according to the present invention, not only the frequency feature structure of the start and end points of a speech interval but also the time structure is given as a standard pattern as a feature parameter level, so that the frequency structure is similar to that of speech. However, it is possible to accurately and stably determine the start and end points of a speech interval by excluding noises such as tongue clicks and breath leaks that have different temporal structures with high precision.1. This makes it possible to detect voice sections in the input signal.

故に、正しく検出された音声区間に従って、その音声認
識を確実に行わしめることが可能となる等の実用り絶大
なる効果が奏せられる。Therefore, great practical effects can be achieved, such as making it possible to reliably perform speech recognition according to correctly detected speech sections.

[Embodiments of the invention]

以下、図面を参照して本発明の一実施例につき説明する
。Hereinafter, one embodiment of the present invention will be described with reference to the drawings.

第１図は実施例装置の概略構成図であり、Ｉは音響分析
部である。この音響分析部１は１例えば音声帯域を２〜
８程度に分けた複数の帯域通過フィルターからなる周知
のフィルダーバンクからなり、入力信号をスペクトラム
分析して一時時間毎に周波数に対応した特徴パラメーク
を求めてこれを保持している。この音響分析部１を介１
．て求められた入力信号の特徴パラメータ時系列が類似
度計算部２に与えられ、標準パターン記憶部３に予め登
録された標準パターンとの類似度が計算される。しかし
て、１記記憶部３に予め登録される標準パターンは、標
準音声の音声区間の開始時点および終了時点をそれぞれ
含む部分区間の、特徴パラメータの時系列として与えら
れるものである。尚、この標準パターンを為す特徴パラ
メータの時間的サンプル法は、所定の分析時間に対して
連続的に取出しても良いが、不連続に決定されたもので
あっても良く、要はその端点位置が部分区間の時間方向
の中央に位置付けられるようにすればよい。FIG. 1 is a schematic configuration diagram of the embodiment apparatus, and I is an acoustic analysis section. This acoustic analysis section 1 analyzes the audio range from 2 to 1, for example.
It consists of a well-known filter bank consisting of a plurality of band-pass filters divided into about 8 filters, and spectrally analyzes the input signal to obtain characteristic parameters corresponding to the frequency at each time and hold them. Through this acoustic analysis section 1
．． The feature parameter time series of the input signal determined by the input signal is given to the similarity calculation unit 2, and the similarity with a standard pattern registered in advance in the standard pattern storage unit 3 is calculated. Therefore, the standard pattern registered in advance in the first storage unit 3 is given as a time series of characteristic parameters of a partial section including the start time and end time of the speech section of the standard voice. In addition, in the temporal sampling method of the characteristic parameters forming this standard pattern, the characteristic parameters may be extracted continuously for a predetermined analysis time, but they may also be determined discontinuously; may be positioned at the center of the partial interval in the time direction.

しかして、始端点に着目すれば、日本語音声の場合には
約１００音節が異なる音声パターンを持つが、前述１．
た帯域分割による周波数的特徴を抽出したとき、前舌母
音内と後舌母音内とを特に区別する必要がなくなる。従
ってこの場合には、約数十種類程度の標準パターンを準
備すれば、日本語入力音声の始端点検出を十分確実に行
うことが可能となる。また、装置が取扱う語棄が予め定
っているような場合には、各晰語に応じた標準パターン
を準備しておくだけで。However, if we focus on the starting point, in the case of Japanese speech, approximately 100 syllables have different speech patterns.
When frequency features are extracted by band division, there is no need to particularly distinguish between front vowels and back vowels. Therefore, in this case, by preparing about several dozen types of standard patterns, it is possible to detect the starting point of the Japanese input speech with sufficient reliability. In addition, if the idioms that the device handles are predetermined, you can simply prepare standard patterns for each idiom.

その目的が達成される。一方、終端点を含む部分区間に
おける標準パターンについては、音声の終端側はその構
造からして母音であること、また無声化の場合には区別
を要１．ない母音もあることから、数種の標準パターン
のみを準備するだけでよい。That purpose is achieved. On the other hand, regarding the standard pattern in the subinterval that includes the terminal point, it is important to note that the terminal side of the voice is a vowel due to its structure, and in the case of devoicing, it is necessary to distinguish 1. Since there are some vowels that are not included, it is only necessary to prepare a few standard patterns.

類似度計算部３は、このようにして定められた各標準パ
ターンに対して、成る時刻における入力信号の特徴パラ
メータ時系列との類似度を計算するものである。この類
似度計算は、例えば複合類似度計算法を用いる等して、
パターンマ・ソチング的に行われる。しかして、類似度
計算部３は、この類似度計算を入力信号の分析時間毎に
、複数の標準パクーソに対してそれぞれ行っており、こ
れによって各標準パターンニ対する類似度の時系列が求
められている。判定部４は、このような類似度の時系列
を入力し、各標準パターン毎に予め定められいる閾値を
越え、且つ極大値をとる類似度値が得られる時点を音声
区間の端点、つまり始端点または終端点と１７て・判定
検出している。この始端点または終端点は、それぞれ始
端点を含む部分区間の標準パターン、および終端点を含
む部分区間の標準パターンによって相互に独立に求めら
れることは云うまでもない。このようにして音声の始端
点と終端点とが検出され、これらによって入力信号中の
音声区間が示されることになる。尚、始端点および終端
点が複数個存在する場合には、その組合せによって音声
区間が示されることになる。The similarity calculation unit 3 calculates the similarity between each standard pattern thus determined and the characteristic parameter time series of the input signal at a given time. This similarity calculation may be performed using, for example, a composite similarity calculation method.
It is done in a pattern-based manner. Therefore, the similarity calculation unit 3 performs this similarity calculation for each of the plurality of standard pacusos at each analysis time of the input signal, and thereby the time series of the similarity for each standard pattern is obtained. ing. The determining unit 4 inputs the time series of such similarities and determines the point in time at which a similarity value exceeding a predetermined threshold for each standard pattern and having a maximum value is obtained as the end point of the speech section, that is, the starting point. 17 points or terminal points are detected. It goes without saying that the starting point or the ending point is determined independently from each other by the standard pattern of the partial section including the starting point and the standard pattern of the partial section including the ending point. In this way, the start and end points of the voice are detected, and these indicate the voice section in the input signal. Note that if there are a plurality of start points and end points, a voice section is indicated by the combination thereof.

第２図は、上述した一連の区間検出処理を示己ＡＬす制御フローで、音声信号が行われている期間！順次所
定の分析時間毎に入力信号の特徴パラメータを抽出する
（ステ・ツブ１１）。そして成る時間毎に上記特徴パラ
メータの時系列を抽出しくステ・ツブ１２）、その時系
列と標準パターンとの類似度を計算する（ステ・ツブ１
３）。その後、この類似度値に従って音声区間の始端点
と終端点をそれぞれ判定すればよい（ステップ１４）。FIG. 2 is a control flow showing the above-mentioned series of interval detection processing, and shows the period during which an audio signal is being generated. Characteristic parameters of the input signal are sequentially extracted at each predetermined analysis time (step 11). Then, the time series of the above feature parameters is extracted at each time interval (Step 12), and the degree of similarity between the time series and the standard pattern is calculated (Step 1).
3). Thereafter, the starting point and ending point of the voice section may be determined in accordance with this similarity value (step 14).

つまり、第３図に示すように入力信号の時系列に対１．
て部分区間毎に、それが始端点であるか終端点であるか
、更にはそれ以外であるかを、標準パターンとのマヅチ
ングによって判定検出すればよい。In other words, as shown in FIG. 3, the time series of the input signal is 1.
For each partial section, whether it is a starting point, a terminal point, or something else can be determined and detected by matching with a standard pattern.

尚、本発明は上記実施例に限定されるものではない。例
えば端点が部分区間の時間方向中心からずれた標準パタ
ーンを用いる場合には、溶点検出した部分区間から、上
記ずれの分を補正して端点位置を求めなければならない
ことは勿論である。また入力信号のパワーから、明らか
に無音区間である部分を検出し、この検出無音区間を除
く部分を音声候補区間と［７て取出し、この区間につい
てのみ前述した類似度計算を行うようにすれば、計算ｌ
を少なくすることができる。その他、本発明はその要旨
を逸脱しない範囲で種々変形して実施することかできる
。Note that the present invention is not limited to the above embodiments. For example, when using a standard pattern in which the end points are shifted from the temporal center of the partial section, it goes without saying that the position of the end point must be determined from the partial section in which the melt point is detected by correcting the above deviation. Also, from the power of the input signal, a portion that is clearly a silent section is detected, a portion excluding the detected silent section is extracted as a speech candidate section, and the similarity calculation described above is performed only for this section. , calculation l
can be reduced. In addition, the present invention can be implemented with various modifications without departing from the gist thereof.

以り説明したように本発明によれば、端点を含む部分区
間の特徴パラメータの時間構造を標準パターンとして備
えて、入力信号の特徴パラメータの時系列とのパターン
マツチングを行い、その類似度から音声区間の始端点と
を検出するので、舌うち音や息漏れ等の雑音に左右され
ることなく、その時間的構造から正しく音声区間を検出
することができ、その検出精度も高くて安定である。こ
れ故、連続発声されて入力される音声の存在区間を正し
く判定して、その認識処理を効果的に行うことが可能と
なり、実用上絶大なる効果が奏せられる。As explained above, according to the present invention, the time structure of the feature parameters of the subinterval including the end points is provided as a standard pattern, pattern matching is performed with the time series of the feature parameters of the input signal, and the similarity is calculated based on the similarity. Since it detects the starting point of the speech section, it is possible to correctly detect the speech section from its temporal structure without being affected by noise such as tongue clicks or breath leaks, and the detection accuracy is high and stable. be. Therefore, it becomes possible to correctly determine the existence section of continuously uttered and inputted speech, and to perform the recognition processing effectively, resulting in a great practical effect.

[Brief explanation of drawings]

第１図は本発明の一実施例装置の概略構成図、第２図は
実施例に係る音声区間検出処理の流れを示す図、第３図
は入力信号と部分区間との関係を示す図である。Ｉ・・・音響分析部２・・・類似度計算部３・・・特徴パラメータ記憶部４・・・判定部１用幀人代理人　　弁理士　鈴　江　武　彦１第１凶第　３　図 −７０（FIG. 1 is a schematic configuration diagram of an apparatus according to an embodiment of the present invention, FIG. 2 is a diagram showing the flow of voice section detection processing according to the embodiment, and FIG. 3 is a diagram showing the relationship between an input signal and a partial section. be. I...Acoustic analysis unit 2...Similarity calculation unit 3...Feature parameter storage unit 4...Judgment unit 1 agent Patent attorney Takehiko Suzue 1st case 3 Figure-70 (

Claims

[Claims]

(1) An acoustic analysis unit that analyzes an input signal at regular intervals to obtain its characteristic parameters, and a storage unit that stores the time structure of characteristic parameters in a partial interval including the start and end points of a voice interval as a standard pattern. , means for calculating the degree of similarity between the standard pattern and the characteristic parameter time series of the input signal, and means for determining and detecting the start and end points of the speech section of the input signal from the calculated similarity values. A voice section detection device characterized by comprising:

(2) The speech section detection device according to claim 1, wherein the similarity calculation is performed on a time series of feature parameters of the speech candidate section determined from the power of the input signal.

(3) The time structure of the feature parameters is based on the frequency/time structure of the voice in the partial interval including the start and end points of the voice. 2. A voice section detection device according to claim 1, wherein the voice section detection device represents a voice pattern.