JP3061292B2

JP3061292B2 - Accent phrase boundary detection device

Info

Publication number: JP3061292B2
Application number: JP3052478A
Authority: JP
Inventors: 敏高橋; 昭一松永
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1991-03-18
Filing date: 1991-03-18
Publication date: 2000-07-10
Anticipated expiration: 2015-07-10
Also published as: JPH04288597A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は連続的に発声された音
声を認識する連続音声認識装置内で使用され、入力音声
を特徴パラメータを用いた表現形式に変換した時系列か
ら、アクセント句の境界を検出するアクセント句境界検
出装置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention is used in a continuous speech recognition apparatus for recognizing a continuously uttered speech. The present invention relates to an accent phrase boundary detection device that detects a.

【０００２】[0002]

【従来の技術】従来における連続音声のアクセント句境
界検出では、音声に含まれる韻律に関する特徴パラメー
タ（例えばピッチ周波数、パワーなど）パタンの分析結
果から、韻律特徴パラメータ時系列パタンの包絡の谷間
にアクセント句境界が多く存在することを利用し、この
谷間を検出する手法が提案されている。（例えば、鈴木
他「日本語連続音声認識のための韻律情報を利用した句
境界検出」電子情報通信学会論文誌，Ｖｏｌ．Ｊ７２−
Ｄ−２，Ｎｏ．１０（１９８９−１０）；小松他「韻律
情報を利用した構文推定およびワードスポットによる会
話音声理解方式」電子情報通信学会論文誌，Ｖｏｌ．Ｊ
７１−Ｄ，Ｎｏ．７（１９８８−７））図５は、１文章
を読み上げたときのピッチ周波数パタンを示している。
縦軸はピッチ周波数、横軸は時間である。破線がアクセ
ント句の境界である。所々、包絡が途切れている箇所
は、無声子音や発声の合い間の休止区間等で、ピッチ周
波数が得られなかった部分である。ピッチ周波数パタン
はかなり曖昧で、変形（バリエーション）も多く、ピッ
チ周波数が抽出不可能な区間もあるため、包絡の谷を検
出するのはたいへん困難である。しかしながら、これま
での手法は、局所的に韻律特徴パラメータパタンを走査
していたり、発見的あるいは経験的ルールに基づきアク
セント句境界を検出しており、検出性能が不十分で汎用
性に欠けていた。2. Description of the Related Art Conventionally, in detecting an accent phrase boundary of a continuous speech, analysis of a feature parameter (for example, pitch frequency, power, etc.) pattern related to the prosody included in the speech is performed based on the analysis of the prosody feature parameter time series pattern. A technique for detecting this valley using the fact that many phrase boundaries exist has been proposed. (For example, Suzuki et al., "Phrase Boundary Detection Using Prosody Information for Japanese Continuous Speech Recognition," IEICE Transactions, Vol. J72-
D-2, no. 10 (1989-10); Komatsu et al., "Syntax Estimation Using Prosody Information and Conversational Speech Understanding Method Using Word Spot", Transactions of the Institute of Electronics, Information and Communication Engineers, Vol. J
71-D, No. 7 (1988-7)) FIG. 5 shows a pitch frequency pattern when one sentence is read out.
The vertical axis is pitch frequency, and the horizontal axis is time. Dashed lines are the boundaries of accent phrases. In some places, the portion where the envelope is interrupted is a portion in which a pitch frequency cannot be obtained, such as a pause interval between unvoiced consonants and utterances. Since the pitch frequency pattern is quite vague, there are many variations (variations), and there are sections where the pitch frequency cannot be extracted, it is very difficult to detect the valley of the envelope. However, the conventional methods scan the prosodic feature parameter patterns locally, or detect accent phrase boundaries based on heuristic or empirical rules, and have insufficient detection performance and lack versatility. .

【０００３】[0003]

【課題を解決するための手段】この発明によれば、句の
アクセント型ごと、および、句全体の継続時間長ごと
に、隠れマルコフモデル（例えば中川聖一著「確率モデ
ルによる音声認識」（１９８８））を用いて作成され
た、アクセント句の特徴パラメータ時系列パタンを統計
的に表現する統計的標準パタンを予め統計的アクセント
句モデルメモリに記憶しておき、句境界検出部で入力音
声の特徴パラメータ時系列と統計的標準パタンとを、隠
れマルコフモデルの内部状態の滞在時間長を制御しなが
ら照合して、連続音声中のアクセント句のパラメータ時
系列パタン一つ一つを認識し、最適なモデル（標準パタ
ン）系列を求め、それぞれのモデルの境界をアクセント
句境界とする。According to the present invention, a hidden Markov model (for example, Seiichi Nakagawa, "Speech Recognition by Stochastic Model" (1988)) is used for each accent type of a phrase and for each duration of the entire phrase. )), A statistical standard pattern for statistically expressing the characteristic parameter time-series pattern of the accent phrase is stored in the statistical accent phrase model memory in advance, and the characteristic of the input speech is detected by the phrase boundary detection unit. The parameter time series and the statistical standard pattern are collated while controlling the length of stay of the internal state of the hidden Markov model, and the parameter time series pattern of the accent phrase in the continuous speech is recognized one by one, and the optimal A model (standard pattern) sequence is obtained, and the boundary of each model is set as an accent phrase boundary.

【０００４】つまり、この発明ではそれぞれのアクセン
ト句の区間内の韻律に関する特徴パラメータの時系列パ
タンが、アクセント型（例えば、「日本語発音アクセン
ト辞典」日本放送協会編（１９６６））によっていくつ
かの典型的なパタンに分類できることに着目している。
図１Ａ、１Ｂに各アクセント型とそのピッチ周波数パタ
ンの典型的な例を示す。アクセント型は、語頭からアク
セント核のある音節（図中に斜線を付けてある）までの
モーラ数で定義され、ピッチ周波数パタンは、アクセン
ト核のある音節の直後で急激に下降する。図１Ａにおい
て横線の左端のかぎはピッチ周波数が高くなることを示
し、右端のかぎはピッチ周波数が低くなることを示して
いる。アクセント句中の韻律特徴パラメータの時系列パ
タンをアクセント型ごとに分類し、大量のアクセント句
の特徴パラメータの時系列パタンを用いて、隠れマルコ
フモデルによって統計的アクセント句モデルをアクセン
ト型ごとに作成する。このモデルは、アクセント句の特
徴パラメータの時系列パタンを統計的に表現している。
上述したように入力された連続音声の韻律特徴パラメー
タの特系列パタンと、統計的アクセント句モデルとを照
合してアクセント句境界を検出することは、あいまいな
韻律パラメータを扱うために標準パタンを統計モデルに
よって表し、入力音声のパラメータ時系列を大局的に考
慮しながらアクセント句境界を検出していることにな
り、検出性能が従来より向上する。In other words, according to the present invention, the time-series pattern of the characteristic parameters relating to the prosody in each interval of the accent phrase is changed depending on the accent type (for example, "Japanese Pronunciation Accent Dictionary" edited by Japan Broadcasting Corporation (1966)). We focus on the fact that it can be classified into typical patterns.
1A and 1B show typical examples of each accent type and its pitch frequency pattern. Accent type is defined by the number of mora from the beginning of the word to the syllable with the accent nucleus (hatched in the figure), and the pitch frequency pattern drops sharply immediately after the syllable with the accent nucleus. In FIG. 1A, the key at the left end of the horizontal line indicates that the pitch frequency increases, and the key at the right end indicates that the pitch frequency decreases. A time series pattern of prosodic feature parameters in accent phrases is classified for each accent type, and a statistical accent phrase model is created for each accent type by a hidden Markov model using a time series pattern of feature parameters of a large number of accent phrases. . This model statistically represents a time-series pattern of feature parameters of an accent phrase.
As described above, detecting the accent phrase boundary by comparing the special sequence pattern of the prosodic feature parameter of the input continuous speech with the statistical accent phrase model requires statistical standard pattern processing in order to handle ambiguous prosodic parameters. This means that the accent phrase boundary is detected by globally considering the parameter time series of the input voice, which is represented by a model, and the detection performance is improved as compared with the related art.

【０００５】[0005]

【実施例】図２に、この発明の実施例を示す。入力端子
１から入力された音声は、Ａ／Ｄ変換部２においてディ
ジタル信号に変換される。このディジタル信号はピッチ
周波数抽出部３においてピッチ抽出された後、更に１フ
レーム（例えば１０ミリ秒）ごとに特徴パラメータに変
換される。この特徴パラメータは、例えば隣接する数フ
レーム（例えば７フレーム）のピッチ周波数パタンの１
次回帰係数や２次回帰係数である。FIG. 2 shows an embodiment of the present invention. The audio input from the input terminal 1 is converted into a digital signal in the A / D converter 2. After the digital signal is pitch-extracted by the pitch frequency extracting unit 3, it is further converted into characteristic parameters every frame (for example, every 10 milliseconds). This characteristic parameter is, for example, one of the pitch frequency patterns of several adjacent frames (for example, seven frames).
It is a secondary regression coefficient or a secondary regression coefficient.

【０００６】予め、大量の学習用文章データベースを上
記特徴パラメータに変換した後、アクセント句ごとに特
徴パラメータ時系列を切り出し、同一のアクセント型ご
とに分類する。更に、アクセント句の継続時間長ごと
（例えば、４００ms以下、４００msから６００ms、６０
０ms以上）に分類し、いくつかの状態数（例えば３状
態）で表現された隠れマルコフモデルを用いて統計的ア
クセント句モデルを作成する。極端に継続時間長の短い
句や、長い句は特徴が異なるので時間長ごとの分類が有
効である。これらのモデルは、統計的アクセント句モデ
ルメモリ４に記憶しておく。After converting a large amount of a learning sentence database into the above-mentioned feature parameters, a feature parameter time series is cut out for each accent phrase and classified for the same accent type. Further, for each duration of the accent phrase (for example, 400 ms or less, 400 ms to 600 ms, 60 ms or less).
0 ms or more), and a statistical accent phrase model is created using a hidden Markov model represented by several states (for example, three states). Phrases having extremely short durations and phrases having extremely long durations have different characteristics, so classification by time length is effective. These models are stored in the statistical accent phrase model memory 4.

【０００７】句境界検出部５では、統計的アクセント句
モデルを統計的アクセント句モデルメモリ４より読みだ
し、入力音声の特徴パラメータ時系列と照合しながら、
どのモデル系列が最適に照合するかを調べる。照合の際
に、統計的アクセント句モデルに用いられている隠れマ
ルコフモデルの各状態に継続時間長制御を施す。即ち、
学習データから各状態にとどまる平均時間長および分散
を予め求めておき、状態の滞在時間に対して正規分布を
仮定して、実際に照合した継続時間長から尤度を求め、
マッチングの尤度に加える。例えば、３状態の隠れマル
コフモデルを用いて統計的アクセント句モデルを作成し
た場合、各状態は句のピッチパタンの上昇部、平坦部、
下降部を表現するように作成されると考えられる。図３
は統計的アクセント句モデルを３状態の隠れマルコフモ
デルで作成した場合の、各状態にとどまる平均継続時間
長を各アクセント型ごとに示したものである。第２状
態、すなわちピッチパタンの平坦部にマッピングする状
態の継続時間長が、２型から４型に向かって長くなって
いる。これは図１Ｂと比較すればわかるように、実際の
ピッチパタンの傾向と一致しており、継続時間長制御を
加えることにより、各アクセント型のピッチパタンの特
徴をモデルに反映させることができる。The phrase boundary detecting unit 5 reads out a statistical accent phrase model from the statistical accent phrase model memory 4 and compares it with the feature parameter time series of the input speech.
Find out which model series matches best. At the time of matching, each state of the Hidden Markov Model used for the statistical accent phrase model is subjected to duration control. That is,
The average time length and variance remaining in each state are previously obtained from the learning data, and the normal distribution is assumed for the stay time of the state, and the likelihood is obtained from the actually matched duration time,
Add to the likelihood of matching. For example, when a statistical accent phrase model is created using a three-state hidden Markov model, each state includes a rising part of the pitch pattern of the phrase, a flat part,
It is thought to be created to represent the descending part. FIG.
Shows the average duration for each accent type in each state when the statistical accent phrase model is created by a three-state hidden Markov model. The duration of the second state, that is, the state of mapping to the flat part of the pitch pattern, increases from type 2 to type 4. As can be seen from the comparison with FIG. 1B, this is consistent with the tendency of the actual pitch pattern. By adding the duration control, the features of each accent-type pitch pattern can be reflected in the model.

【０００８】このようにして得られた入力音声に対する
最適モデル系列において、それぞれのモデルの境界がア
クセント句境界に当たるので、これを調べて句境界検出
結果出力部６から出力する。In the thus obtained optimal model sequence for the input speech, the boundaries of the respective models correspond to the accent phrase boundaries, which are examined and output from the phrase boundary detection result output unit 6.

【０００９】[0009]

【発明の効果】以上述べたように、この発明においては
韻律特徴パラメータを統計的な手法によりモデル化して
いるために、数多い韻律特徴パラメータ時系列パタンの
バリエーションを、確率的に標準パタンとして表現する
ことができる。従って、連続音声中で微妙に変化するパ
タンに対処することができ検出性能が向上する。As described above, in the present invention, since the prosodic feature parameters are modeled by a statistical method, many variations of the prosodic feature parameter time-series patterns are stochastically represented as standard patterns. be able to. Therefore, it is possible to cope with a pattern that changes subtly in the continuous voice, and the detection performance is improved.

【００１０】図２に示した構成に従い、男性アナウンサ
ー１名が発声した５００文章（発声速度１０モーラ／
秒）を用いたアクセント句境界推定実験結果について示
す。はじめに４００文章（２７８６個のアクセント句を
含む）をもとに、０型から４型までのアクセント型の統
計的アクセント句モデルを作成した。この際、４型以上
の型のアクセント句はすべて４型モデルの作成に使用し
た。また、句全体の継続時間長に従い、４００ms以下、
４００msから６００ms、６００ms以上の３つに分類し
た。従って、全部で１５個の統計的アクセント句モデル
を作成している。特徴パラメータとして、ピッチ周波数
パタンの１次回帰係数と２次回帰係数を用い、また、隠
れマルコフモデルは３状態のものを使用した。次に、上
記とは異なる１００文章（４６３アクセント句境界を含
む）を対象とし、句境界検出評価実験を行なった。１文
章中のアクセント句の個数、アクセント句のアクセント
型が既知の条件において、正解句境界数と同数の境界候
補を検出した場合、検出境界誤差±１００ms以内（およ
そ１モーラ長）で検出された境界の割合は全体の７６．
４％であった。更に、統計的アクセント句モデルに継続
時間長制御を加えると、８２．９％まで向上する。図４
に、この発明によるアクセント句境界検出結果の例を示
す。縦軸はピッチ周波数、横軸は時間である。上段の矢
印が示す破線は、視察によって付けられた音韻ラベルを
もとにした句境界で、下段の矢印が示す破線は推定され
た句境界を示す。アクセント句の継続時間長が短く、十
分な韻律情報が得られない句の境界以外は精度良く推定
されている。According to the configuration shown in FIG. 2, a male announcer utters 500 sentences (utterance speed 10 mora /
The following shows the experimental results of accent phrase boundary estimation using the First, based on 400 sentences (including 2,786 accent phrases), statistical accent phrase models of type 0 to type 4 were created. At this time, all type 4 or more accent phrases were used to create the type 4 model. Also, according to the duration of the entire phrase, 400 ms or less,
It was classified into three categories of 400 ms to 600 ms and 600 ms or more. Therefore, a total of 15 statistical accent phrase models are created. As characteristic parameters, a first-order regression coefficient and a second-order regression coefficient of a pitch frequency pattern were used, and a hidden Markov model having three states was used. Next, a phrase boundary detection evaluation experiment was performed on 100 sentences (including 463 accent phrase boundaries) different from the above. Under the condition that the number of accent phrases in one sentence and the accent type of the accent phrase are known, when the same number of boundary candidates as the number of correct phrase boundaries are detected, the detection boundary error is detected within ± 100 ms (about 1 mora length). The border ratio is 76.
4%. Further, when the duration control is added to the statistical accent phrase model, it is improved to 82.9%. FIG.
An example of the accent phrase boundary detection result according to the present invention is shown below. The vertical axis is pitch frequency, and the horizontal axis is time. The dashed line indicated by the upper arrow indicates a phrase boundary based on the phoneme label attached by the inspection, and the dashed line indicated by the lower arrow indicates the estimated phrase boundary. Accurate phrases are estimated with good accuracy except for the boundaries of phrases in which the duration of the accent phrase is short and sufficient prosodic information cannot be obtained.

【００１１】なお、特徴パラメータとしてパワーパタン
も利用すると更に性能が向上する。If a power pattern is used as a feature parameter, the performance is further improved.

[Brief description of the drawings]

【図１】Ａは０型から５型のアクセント型の例を示し、
Ｂはこれらのピッチ周波数パタンを示す図である。FIG. 1A shows an example of accent types 0 to 5;
B is a diagram showing these pitch frequency patterns.

【図２】この発明の実施例を示すブロック図。FIG. 2 is a block diagram showing an embodiment of the present invention.

【図３】統計的アクセント句モデルを３状態の隠れマル
コフモデルで作成した場合の、各状態の平均継続時間長
をアクセント型ごとに示した図。FIG. 3 is a diagram showing, for each accent type, the average duration of each state when a statistical accent phrase model is created using a three-state hidden Markov model.

【図４】この発明による連続音声の句境界検出結果の例
を示す図。FIG. 4 is a diagram showing an example of a phrase boundary detection result of continuous speech according to the present invention.

【図５】連続音声のピッチ周波数パタン例であって、発
声内容は「言論の／自由は／一歩／譲れば／百歩も／千
歩も／攻めこまれる」。FIG. 5 is an example of a pitch frequency pattern of continuous speech, and the utterance content is “speech / freedom / one step / one yield / one hundred steps / one thousand steps / attacked”.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭64−49098（ＪＰ，Ａ) 特開昭64−20599（ＪＰ，Ａ) 特開昭62−34200（ＪＰ，Ａ) 特開昭62−102295（ＪＰ，Ａ) 特開平４−66999（ＪＰ，Ａ) 特開昭62−136699（ＪＰ，Ａ) (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/04 G10L 15/14 ＪＩＣＳＴファイル（ＪＯＩＳ)──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-64-49098 (JP, A) JP-A-64-20599 (JP, A) JP-A-62-34200 (JP, A) JP-A 62-34200 102295 (JP, A) JP-A-4-66999 (JP, A) JP-A-62-136699 (JP, A) (58) Fields investigated (Int. Cl. ⁷ , DB name) G10L 15/04 G10L 15 / 14 JICST file (JOIS)

Claims

(57) [Claims]

1. An accent phrase boundary detecting device for converting a continuously uttered input speech into a feature parameter time series and detecting a boundary of an accent phrase contained in the input speech from the feature parameter time series, A statistical accent phrase that stores a statistical standard pattern that statistically expresses the feature parameter time-series pattern of the accent phrase created using a hidden Markov model for each accent type and for each duration of the entire phrase A model memory, and a phrase boundary detecting unit that detects the accent phrase boundary by matching the feature parameter time series of the input speech and the statistical standard pattern while controlling the length of stay of the internal state of the hidden Markov model. An accent phrase boundary detecting device provided.

2. The voice characteristic parameter is a linear regression coefficient or a secondary regression coefficient of a change pattern over several frames of a pitch frequency of the voice, and any one of them is combined with a waveform power of each frame. The accent phrase boundary detecting device according to claim 1, wherein