JPH0120440B2

JPH0120440B2 -

Info

Publication number: JPH0120440B2
Application number: JP57171632A
Authority: JP
Inventors: Katsuyuki Futayada; Masakatsu Hoshimi; Satoshi Fujii; Hideji Morii; Ikuo Inoe
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1982-09-29
Filing date: 1982-09-29
Publication date: 1989-04-17
Also published as: JPS5958496A

Description

[Detailed description of the invention]

産業上の利用分野本発明は音声認識における音声セグメンテーシ
ヨン法（連続発声された音声において音素区間を
抽出する方法）に関するものである。従来例の構成とその問題点人間によつて発声された音声を自動的に認識す
る音声自動認識装置は人間から電子計算機や各種
機械へデータや命令を与える手段として非常に有
効と考えられる。従来研究あるいは発表されている音声自動認識
システムの動作原理としてはパタンマツチング法
が多く採用されている。この方法は認識される必
要がある全種類の単語に対して標準パターンをあ
らかじめ記憶しておき、入力される未知の入力パ
ターンと比較することによつて一致の度合（以下
類似度と呼ぶ）を計算し、最大一致が得られる標
準パターンと同一の単語であると判定するもので
ある。このパタンマツチング法では認識されるべ
き全ての単語に対して標準パターンを用意しなけ
ればならないため、発声者が変つた場合には新し
く標準パターンを入力して記憶させる必要があ
る。従つて日本全国の都市名のように数百種類以
上の単語を認識対象とするような場合、全種類の
単語を発声して登録するには膨大な時間と労力を
必要とし、又登録に要するメモリー容易に膨大に
なることが予想される。さらに入力パターンと標
準パターンのパターンマツチングに要する時間も
単語数が多くなると長くなつてしまう欠点があ
る。これに対して、入力音声を音素単位に分けて音
素の組合せとして認識し（以下音素認識と呼ぶ）
音素単位で表記された単語辞書との類似度を求め
る方法は単語辞書に要するメモリー容量が大巾に
少なくて済みパタンマツチングに要する時間が短
くでき、辞書の内容変更も容易であるという特長
を持つている。この方法の例は「音声スペクトル
の概略形とその動特性を利用した単語音声認識シ
ステム」三輪他、日本音響学会誌34（1978）に述
べてある。この方法における単語認識システムのブロツク
図を第１図に示す。まず、あらかじめ多数話者の
音声を10ｍｓの分析区間毎に音響分析部１によつ
てフイルタバンクを用いて分析し、得られたスペ
クトル情報をもとに特徴抽出部２によつて特徴パ
ラメータを求める。この特徴パラメータから｜ａ
｜、｜ｏ｜等の母音や、｜ｎ｜、｜ｂ｜等の子音に
代表される音素毎又は音素グループ毎に標準パタ
ーンを作成して標準パターン登録部５に登録して
おく。次に、入力された不等定話者の音声を、同
様に分析区間毎に音響分析部１によつて分析し、
特徴抽出部２によつて特徴パラメータを求める。
この特徴パラメータと標準パターン登録部５の標
準パターンを用いてセグメンテーシヨン部３にお
いて母音と子音の区切り作業（以下、セグメンテ
ーシヨンと呼ぶ）を行なう。この結果をもとに、
音素判別部４において、標準パターンと照合する
ことによつて、最も類似度の高い標準パターンに
該当する音素をその区間における音素と決定す
る。最後に、この結果作成した音素の時系列（以
下音素系列と呼ぶ）を単語認識部６に送り、同様
に音素系列で表現された単語辞書７と最も類似度
の大きい項目に該当する単語を認識結果として出
力する。以上の全体の動作からわかるように、セグメン
テーシヨン部３においてセグメンテーシヨンを誤
つた場合にはあるべき音素を見過ごしてしまつた
り（音素の脱落）、実際には音素のないところに
別の音素が入り込んでしまう（音素付加）ことに
なる。これらの誤りを発生した場合、単語を音素
系列で表現した時に音素の脱落や付加によつて全
く関係のない他の単語に似かよつてしまうことに
よつて誤認識してしまう危険性が高くなる。このように、音素認識を基本に単語認識を行う
方法においてセグメンテーシヨンは最も重要な作
業であり、セグメンテーシヨンの精度によつて単
語認識システムの性能は大きく左右される。ところで従来行われていたセグメンテーシヨン
法は、セグメンテーシヨン用のパラメータとし
て、音声の全帯域のパワー情報を用い、その時間
的な動きからパワーデイツプを求め、デイツプ区
間を子音と決めていた。また、全帯域パワー情報
のかわりにスペクトルの傾斜を使用したり、両方
を併用する方法もある。（例えば「音声スペクト
ルの概略形とその動的特性を利用した単語音声認
識システム」三輪他、日本音響学会誌34、1978）これらの方法は、いずれもパラメータデイツプ
を利用する方法であり、以下の問題点があつた。 (a) 全帯域パワーやスペクトルの傾斜では検出で
きない子音があつたり、母音その他の区間に対
する子音の付加が多い。 (b) パラメータの時間的動きのみでは検出できな
い子音があり、これらは脱落してしまう。発明の目的本発明は上記従来の欠点を解消し、有声子音か
ら無声子音までの広い範囲の子音のセグメンテー
シヨンを高い精度で行うことができる音声セグメ
ンテーシヨン法を提供することを目的とする。発明の構成日本語は、本質的に母音と子音が交互に組合わ
せられて単語が構成されている。撥音を除く子音
と他の子音が連続することはない。したがつて、
日本語を認識する場合、母音と子音を精度よく分
離することが、音声認識率の向上に大きく貢献す
る。本発明は、次に示す(1)の情報に(2)〜(3)の情報
を効率よく組合わせて、単語中の子音区間を母音
など他の区間から精度よく分離するようにしたも
のである。 (1) 音声信号の低域パワーと高域パワーの時間的
な動きによつて生ずるパワーデイツプ情報。 (2) フレームごと（１フレームは10ｍsecとして
いる）の有声・無声判定の結果。 (3) フレームごとの音素認識の結果。本発明では、上記従来例の欠点ａに対しては、
低域パワーと高域のパワー情報をパラメータと
し、これを効果的に使うことによつて解決し、ま
た欠点ｂに対しては、パラメータの時間的な動き
の他に、有声・無声判定結果、フレームごとの音
素認識結果を併用することによつて解決してい
る。実施例の説明本発明の実施例について述べる。本実施例においては、低域、高域パワー情報、
フレームごとの音素認識結果、有声有無判定結果
を併用することによつて精度の高いセグメンテー
シヨン法を実現している。有声・無声判定は無声摩擦音など無声性が非常
に高く、しかも持続時間が比較的長い音素に対し
て有効である。低域パワーデイツプは持続時間が
短く、しかも無声性が強い音素（無声破裂音な
ど）に対して有効である。高域パワーデイツプは
持続時間が短く、しかも有声性が強い音素（有声
破裂音、流音など）に対して有効である。また、
フレームごとの認識結果は、パワーデイツプが出
現しにくい鼻音や持続時間が長い有声子音（撥
音）に対して有効である。このように、これらの情報は相補的な性質を有
しており、組合わせて使用することによつて日本
語のほとんどの子音を精度よく検出することがで
きる。以下、本実施例による方法を詳細に説明する。まず低域、高域パワー情報を利用した第１の子
音区間検出法について述べる。本実施例においては、セグメンテーシヨン用パ
ラメータとして音声スペクトルの低域パワーと高
域パワーを併用する。前者は母音と無声子音を判
別するのに有効であり、後者は母音と有声子音の
判別に有効である。低域パワーは音声信号を250
〜600Hzの帯域フイルタに通し、それをフレーム
ごとに整流して得る。また高域パワーは1500〜
4000Hzの帯域フイルタによつて同様にして得る。第２図は低域または高域パワー情報からデイツ
プを抽出する方法を示している。ａはフイルタの
整流出力を時系列でプロツトしたものであり、子
音区間の大きなデイツプの他に細かいデイツプが
数多く左右する。後者は不要なデイツプであるの
で平滑化を行なつて取除く（第２図ｂ）。次にｂ
の信号を微分することによつてｃの信号を得る。
そしてｃの信号から最大値と最小値間の大きさｐ
と、最小値から最大値までの時間長（フレーム
数）Ｌを求める。ｐ＞p_nio、Ｌ＜L_naxの条件を適
用し、条件を満足するデイツプに対し、ｃで最小
値から最大値までの区間をデイツプ区間（子音候
補）とする。この方法はパワーデイツプの大きさの計算をパ
ワーの変化速度の検出に置きかえ、その最大値、
最小値を計算することによつて簡易にしかも高い
精度でデイツプ区間を検出することができる。次に低域パワーデイツプ、高域パワーデイツプ
の一方または両方によつて検出された子音候補の
うちから、子音区間を特定する方法を述べる。低
域パワー情報から得られた前述の方法によるデイ
ツプの大きさをpl、高域パワー情報から得られた
それをp_hとする。低域情報による子音候補区間と
高域情報による子音候補区間が重畳している場
合、２次元座標（p_l、p_h）を第３図に示す判別図
に適用する。（p_l、p_h）が判別図上で付加区間
（斜線の内側）に位置した場合、その子音候補は
棄却する。（p_l、p_h）が子音区間に位置した場合、
低域パワーデイツプ区間と高域パワーデイツプ区
間の論理和に相当する部分を子音として特定す
る。低域と高域情報による子音候補区間に重畳が
ない場合、一方をｏ（たとえば（p_l、ｏ））として
判別図に適用する。このように相補的な性質を持つた低域パワー情
報と高域パワー情報をパラメータとし、その各々
によつて子音候補区間を探し、さらにそれを判別
図に適用することによつて子音区間を決定する方
法は、従来の方法に比較して、有声から無声まで
広い範囲の子音に有効であり、高い精度で子音区
間を検出することができる。特に有声子音の｜ｂ
｜、｜ｄ｜、｜η｜、｜ｒ｜、音声子音｜ｈ｜、有
声無声両方の性質を示す｜ｚ｜に対して有効であ
る。次にフレームごとの音素認識結果を利用した第
２の子音区間検出方法について述べる。上に述べ
たデイツプ情報を利用したセグメンテーシヨン法
は鼻音区間の検出率が73％程度であり、他の有声
子音に比べて検出率が充分とはいえない。また撥
音は持続時間が長すぎるため、デイツプ情報は利
用できないという弱点がある。本実施例ではフレ
ームごとの音素認識結果を利用することによつ
て、上記の弱点をカバーしている。本実施例では音素認識は、先ず各フレームごと
に行ない、同じ音素として認識されたフレームを
結合し、その区間の音素認識結果としている。フレームごとの音素認識はいろいろな方法が考
えられるが、本実施例では、パラメータとして
LPC分析（自己相関法使用）で得たLPCケプス
トラム係数C_i（ｉ＝１〜ｄ）を使用し、次のよう
にして行なつている。音素ｋに対する標準パターンとして、平均値
μ_k、共分散マトリツクスΣ_kとすると、あるフレー
ムが音素ｋである確率P_kは次式で求められる。（添字Ｔは転置を添字−１は逆行列を表わす）対数尤度L_kは L_k＝−１／２（Ｃ−μ_k）^T・Σ_k ^-1・（Ｃ−μ_k）−A_k （式２）ただし INDUSTRIAL APPLICATION FIELD The present invention relates to a speech segmentation method (a method for extracting phoneme intervals in continuously uttered speech) in speech recognition. Conventional configurations and their problems Automatic speech recognition devices that automatically recognize speech uttered by humans are considered to be very effective as a means for providing data and instructions from humans to computers and various machines. The pattern matching method is often adopted as the operating principle of automatic speech recognition systems that have been researched or published in the past. This method memorizes standard patterns for all types of words that need to be recognized in advance, and compares them with unknown input patterns to calculate the degree of matching (hereinafter referred to as similarity). The word is calculated and determined to be the same word as the standard pattern that yields the maximum match. In this pattern matching method, standard patterns must be prepared for all words to be recognized, so if the speaker changes, a new standard pattern must be input and stored. Therefore, in cases where hundreds of types of words are to be recognized, such as the names of cities across Japan, it takes a huge amount of time and effort to pronounce and register all types of words. It is expected that the memory will easily become huge. Furthermore, there is a drawback that the time required for pattern matching between the input pattern and the standard pattern increases as the number of words increases. On the other hand, input speech is divided into phoneme units and recognized as combinations of phonemes (hereinafter referred to as phoneme recognition).
The method of determining similarity with a word dictionary written in phoneme units has the advantage that the memory capacity required for the word dictionary is significantly reduced, the time required for pattern matching is shortened, and the contents of the dictionary can be easily changed. I have it. An example of this method is described in ``Word speech recognition system using the outline form of the speech spectrum and its dynamic characteristics'' by Miwa et al., Journal of the Acoustical Society of Japan 34 (1978). A block diagram of a word recognition system using this method is shown in FIG. First, the voices of multiple speakers are analyzed in advance by the acoustic analysis unit 1 using a filter bank for each 10ms analysis interval, and the feature parameters are determined by the feature extraction unit 2 based on the obtained spectrum information. . From this feature parameter |a
A standard pattern is created for each phoneme or phoneme group represented by vowels such as |, |o|, and consonants such as |n|, |b|, and is registered in the standard pattern registration section 5. Next, the input speech of the unequal fixed speaker is similarly analyzed by the acoustic analysis unit 1 for each analysis section,
Feature parameters are determined by the feature extractor 2.
Using these characteristic parameters and the standard pattern of the standard pattern registration section 5, the segmentation section 3 performs a separation operation between vowels and consonants (hereinafter referred to as segmentation). Based on this result,
The phoneme discriminator 4 compares the phoneme with the standard pattern to determine the phoneme that corresponds to the standard pattern with the highest degree of similarity as the phoneme in that section. Finally, the time series of phonemes created as a result (hereinafter referred to as the phoneme series) is sent to the word recognition unit 6, and the word corresponding to the item with the highest similarity to the word dictionary 7 similarly expressed in the phoneme series is recognized. Output as result. As can be seen from the overall operation described above, if the segmentation unit 3 makes a mistake in segmentation, a phoneme that should be present may be overlooked (dropped phoneme), or a different phoneme may be inserted where there is actually no phoneme. This means that phonemes will be added (phoneme addition). When these errors occur, when a word is expressed as a phoneme sequence, the omission or addition of a phoneme causes it to resemble another completely unrelated word, increasing the risk of misrecognition. . As described above, segmentation is the most important task in a method of word recognition based on phoneme recognition, and the performance of a word recognition system is greatly influenced by the accuracy of segmentation. By the way, the conventional segmentation method uses power information of the entire voice band as a parameter for segmentation, calculates the power dip from its temporal movement, and determines the dip interval to be a consonant. There is also a method of using spectral slope instead of full-band power information, or a method of using both in combination. (For example, "Word speech recognition system using the outline form of the speech spectrum and its dynamic characteristics" Miwa et al., Journal of the Acoustical Society of Japan 34, 1978) All of these methods use parameter dips, and the following is There was a problem. (a) There are consonants that cannot be detected using the full-band power or spectrum slope, and there are many consonants added to vowels and other intervals. (b) There are consonants that cannot be detected only by the temporal movement of parameters, and these are dropped. OBJECTS OF THE INVENTION It is an object of the present invention to provide a speech segmentation method that eliminates the above-mentioned conventional drawbacks and can perform segmentation of a wide range of consonants from voiced consonants to voiceless consonants with high accuracy. . Structure of the Invention In Japanese, words are essentially composed of alternating combinations of vowels and consonants. Consonants other than phlegmatic consonants and other consonants do not occur consecutively. Therefore,
When recognizing Japanese, accurately separating vowels and consonants greatly contributes to improving speech recognition rates. The present invention efficiently combines the information in (1) shown below with the information in (2) to (3) to accurately separate consonant intervals in a word from other intervals such as vowels. be. (1) Power dip information generated by the temporal movement of low-frequency power and high-frequency power of an audio signal. (2) Results of voiced/unvoiced judgment for each frame (one frame is 10 msec). (3) Results of frame-by-frame phoneme recognition. In the present invention, to solve the drawback a of the conventional example,
This is solved by effectively using the low-frequency power and high-frequency power information as parameters.In addition to the temporal movement of the parameters, we can also solve the problem by effectively using the low-frequency power and high-frequency power information as parameters. This problem is solved by using the phoneme recognition results for each frame. DESCRIPTION OF EMBODIMENTS An embodiment of the present invention will be described. In this embodiment, low frequency and high frequency power information,
A highly accurate segmentation method is achieved by using both the phoneme recognition results for each frame and the voiced presence/absence determination results. Voiced/unvoiced determination is effective for phonemes that are highly voiceless and have a relatively long duration, such as voiceless fricatives. The low-frequency power dip is effective for phonemes that have a short duration and are strongly voiceless (such as voiceless plosives). High-frequency power dips are effective for phonemes with short duration and strong voicing (voiced plosives, flowing sounds, etc.). Also,
The frame-by-frame recognition results are effective for nasal sounds in which power dips are difficult to appear and voiced consonants with long durations. In this way, these pieces of information have complementary properties, and by using them in combination, most consonants in Japanese can be detected with high accuracy. The method according to this embodiment will be explained in detail below. First, a first consonant interval detection method using low-frequency and high-frequency power information will be described. In this embodiment, the low-frequency power and high-frequency power of the audio spectrum are used together as segmentation parameters. The former is effective for distinguishing between vowels and voiceless consonants, and the latter is effective for discriminating between vowels and voiced consonants. Low frequency power is 250
Pass it through a ~600Hz band filter and rectify it frame by frame. Also, the high frequency power is 1500 ~
Similarly obtained with a 4000Hz bandpass filter. FIG. 2 shows a method for extracting dips from low frequency or high frequency power information. A is a time series plot of the rectified output of the filter, which is affected by many small dips in addition to the large dips in the consonant section. Since the latter is an unnecessary dip, it is removed by smoothing (Fig. 2b). Then b
The signal of c is obtained by differentiating the signal of c.
Then, from the signal of c, the magnitude p between the maximum value and the minimum value
Then, the time length (number of frames) L from the minimum value to the maximum value is determined. The conditions p>p _nio and L<L _nax are applied, and for dips that satisfy the conditions, the section from the minimum value to the maximum value at c is defined as a dip section (consonant candidate). This method replaces calculation of the size of the power dip with detection of the rate of change in power, and its maximum value,
By calculating the minimum value, the dip section can be detected easily and with high accuracy. Next, a method for specifying a consonant section from consonant candidates detected by one or both of the low-frequency power dip and the high-frequency power dip will be described. Let pl be the magnitude of the dip obtained from the low-frequency power information using the method described above, and let _ph be the magnitude of the dip obtained from the high-frequency power information. When the consonant candidate section based on the low-frequency information and the consonant candidate section based on the high-frequency information overlap, the two-dimensional coordinates (p _l , p _h ) are applied to the discriminant diagram shown in FIG. 3. If (p _l , p _h ) is located in the additional section (inside the diagonal line) on the discriminant map, that consonant candidate is rejected. If (p _l , p _h ) is located in the consonant interval,
A portion corresponding to the logical sum of the low-frequency power dip section and the high-frequency power dip section is specified as a consonant. If there is no overlap between the consonant candidate sections based on the low-frequency and high-frequency information, one is set as o (for example, (p _l , o)) and applied to the discriminant map. Using the low-frequency power information and high-frequency power information that have complementary properties as parameters, search for consonant candidate intervals using each of them, and then determine the consonant interval by applying them to a discriminant map. Compared to conventional methods, this method is effective for a wide range of consonants, from voiced to unvoiced, and can detect consonant intervals with high accuracy. Especially the voiced consonant |b
It is valid for |, |d|, |η|, |r|, the vocal consonant |h|, and |z|, which exhibit both voiced and unvoiced properties. Next, a second consonant interval detection method using phoneme recognition results for each frame will be described. The above-mentioned segmentation method using dip information has a detection rate of about 73% for nasal sections, which is not sufficient compared to other voiced consonants. Another drawback is that the duration of a plucked sound is too long, so depth information cannot be used. In this embodiment, the above-mentioned weaknesses are covered by using phoneme recognition results for each frame. In this embodiment, phoneme recognition is first performed for each frame, and frames recognized as the same phoneme are combined to obtain the phoneme recognition result for that section. Various methods can be considered for phoneme recognition for each frame, but in this example, the parameter
The LPC cepstrum coefficients C _i (i=1 to d) obtained by LPC analysis (using the autocorrelation method) are used in the following manner. Assuming that a standard pattern for phoneme k is an average value μ _k and a covariance matrix Σ _k , the probability P _k that a certain frame is phoneme k can be obtained by the following equation. (The subscript T represents the transpose, and the subscript -1 represents the inverse matrix.) The log likelihood L _k is L _k = -1/2 (C-μ _k ) ^T・Σ _k ^-1・(C- μ _k ) − A _k (Formula 2) However

【式】標準パターンは、あらかじめ目視によつて音素
がラベル付されている多くのデータを使用して作
成しておく。標準パターンとして５母音｜ａ｜、｜ｉ｜、｜ｕ
｜、｜ｅ｜、｜ｏ｜および｜Ｎ｜に対するものを用
意する。｜Ｎ｜は鼻音グループを表わし、｜ｍ｜、
｜ｎ｜および撥音をまとめたものである。音声区
間の全フレームに対して、（式２）を適用し、フ
レームごとに尤度が一番大きくなる音素と、２位
となる音素を求める。これが母音と鼻音を対象に
したフレームごとの音素認識の結果である。このように全てのフレームに５母音と鼻音のパ
ターンを適用すると、鼻音｜ｍ｜、｜ｎ｜、撥音
に相当する区間の各フレームは鼻音｜Ｎ｜として
認識され、その他スペクトルパターンが鼻音に類
似している音素（｜ｂ｜、｜ｄ｜、｜η｜、｜ｒ｜）
も｜Ｎ｜として認識される確率が高い。したがつ
て｜Ｎ｜として認識される区間を参照すれば、デ
イツプが存在しない区間においても、有声子音の
検出を行なうことができる。本実施例では｜Ｎ｜
と認識されたフレームが、尤度第２位のフレーム
も含めて５フレーム以上連続する区間を子音区間
としている。第４図はフレームごとの認識結果を尤度第１位
と第２位の音素について例示したものである。こ
の例では第１〜第５フレームが｜ａ｜、第６〜第
10フレームが｜ｏ｜、第17フレーム以降が｜ｕ｜
と、音素が決定される。第11〜16フレームは、第
２位の尤度も含めてＮが６フレーム連続している
ので、子音区間としてセグメンテーシヨンされ
る。子音区間に対しては子音の標準パターンを適
用して、後に音素の決定を行なう。以上述べた鼻音として認識されたフレームの連
続性を見ることによるセグメンテーシヨン法は、
｜ｍ｜、｜ｎ｜、揆音、｜ｂ｜、｜ｄ｜、｜η｜に対
し有効である。次に有声・無声判定結果を利用した第３の子音
区間検出方法について述べる。持続時間が長い無
声子音｜ｓ｜、｜ｃ｜、｜ｈ｜や｜ｚ｜は持続時間
がL_nax以上となり、デイツプが検出できない場合
がある。この場合、フレームごとの有声・無声判
定結果の時間的連続性によつてセグメンテーシヨ
ンを行なうことができる。有声・無声判定の方法は零交差波、スペクトル
の傾き、第１次の自己相関係数の値などを利用す
る方法があり、どの方法でもよい。本実施例で
は、有声・無声の標準パターンをそれぞれ用意
し、式(2)を適用することによつて精度良い判定を
行なつている。本実施例においては、無声区間が連続して７フ
レーム以上続く区間は子音区間としてセグメンテ
ーシヨンを行なう。次に上述した第１〜第３の子音区間の検出法の
適用例について述べる。第１〜第３の子音区間の検出法の組合わせとし
ては種々可能であるが、低域、高域パワー情報を
利用した第１の子音区間検出法に、フレームごと
の音素認識結果を利用した第２の子音区間検出法
と有声無声判定結果を利用した第３の子音区間検
出法のうちのいずれか一方又は両方を組合わせる
のが望ましい。ここでは第３、第１、第２の子音区間検出法を
この順に適用した例を示す。適用法は以下に示す通りである。 (i) 音声区間に対し、先ず第３のルール（と記
す）を適用し、無声区間が７フレーム以上連続
する区間を子音区間とする。 (ii) (i)の区間を除去した区間に第１のルール（
と記す）を適用し、デイツプによる子音区間を
求める。 (iii) 有声区間に対して第２のルール（と記す）
を適用し、｜Ｎ｜と認識された区間が５フレー
ム以上連続する区間を子音区間とする。 (iv) 上記(i)〜(iii)で求められた全区間を子音区間と
する。ただし、(i)と(ii)または(ii)と(iii)のルールに
よつて区間が重畳して求められた場合、原則と
してデイツプによつて求められた区間を優先す
る。男女各10名それぞれが発声した212単語を使用
して本実施例の評価を行なつた。この単語セツト
は、目視によつてあらかじめ子音区間にラベル付
けしてある評価用のデータである。第１表に評価
結果（付加率4.7％）を音素ごとに示す。[Formula] The standard pattern is created in advance using a large amount of data in which phonemes are labeled visually. Five vowels |a|, |i|, |u as a standard pattern
Prepare those for |, |e|, |o|, and |N|. |N| represents the nasal group, |m|,
It is a collection of |n| and pakuon. (Formula 2) is applied to all frames of the speech section, and the phoneme with the largest likelihood and the phoneme with the second highest likelihood are determined for each frame. This is the result of frame-by-frame phoneme recognition for vowels and nasals. If we apply the five vowel and nasal patterns to all frames in this way, each frame in the section corresponding to nasal sounds |m|, |n|, and pellic sounds will be recognized as nasal sounds |N|, and other spectral patterns will be similar to nasal sounds. phonemes (|b|, |d|, |η|, |r|)
also has a high probability of being recognized as |N|. Therefore, by referring to the section recognized as |N|, voiced consonants can be detected even in sections where dips do not exist. In this example, |N|
A consonant interval is defined as a consonant interval in which five or more consecutive frames including the frame with the second highest likelihood are recognized. FIG. 4 shows an example of the recognition results for each frame for phonemes with the first and second likelihoods. In this example, the first to fifth frames are |a|, and the sixth to fifth frames are
The 10th frame is |o|, and the 17th and subsequent frames are |u|
Then, the phoneme is determined. The 11th to 16th frames are segmented as a consonant section because N is 6 consecutive frames including the second likelihood. A standard consonant pattern is applied to the consonant section, and phonemes are determined later. The segmentation method based on looking at the continuity of frames recognized as nasal sounds described above is
This is valid for |m|, |n|, 框音, |b|, |d|, and |η|. Next, a third consonant interval detection method using voiced/unvoiced determination results will be described. The unvoiced consonants |s|, |c|, |h|, and |z|, which have long durations, have a duration longer than L _nax , and dips may not be detected. In this case, segmentation can be performed based on the temporal continuity of voiced/unvoiced determination results for each frame. Voiced/unvoiced determination may be performed using any of the methods that utilize zero-crossing waves, the slope of the spectrum, the value of the first-order autocorrelation coefficient, and the like. In this embodiment, highly accurate determination is made by preparing voiced and unvoiced standard patterns and applying equation (2). In this embodiment, segments in which unvoiced segments continue for seven or more consecutive frames are segmented as consonant segments. Next, an application example of the method for detecting the first to third consonant sections described above will be described. Various combinations of detection methods for the first to third consonant intervals are possible, but the first consonant interval detection method uses low-frequency and high-frequency power information, and the phoneme recognition results for each frame are used. It is desirable to combine one or both of the second consonant interval detection method and the third consonant interval detection method using the voiced/unvoiced determination results. Here, an example will be shown in which the third, first, and second consonant interval detection methods are applied in this order. The applicable law is as shown below. (i) First, apply the third rule (denoted as ) to the voiced section, and define a section in which the unvoiced section continues for 7 or more frames as a consonant section. (ii) Apply the first rule (
) is applied to find the consonant interval by dip. (iii) Second rule for voiced sections (written as)
is applied, and a section in which the section recognized as |N| continues for five or more frames is defined as a consonant section. (iv) All the intervals found in (i) to (iii) above are considered consonant intervals. However, if overlapping sections are determined by the rules (i) and (ii) or (ii) and (iii), priority will be given to the section determined by the dip, as a general rule. This example was evaluated using 212 words uttered by 10 men and 10 men. This word set is data for evaluation in which consonant sections are labeled in advance by visual inspection. Table 1 shows the evaluation results (addition rate 4.7%) for each phoneme.

【表】表には音素の個数と、、、の順にルール
を適用した場合の各段階での認識率が示してあ
る。右の列が最終的な認識率であり、各音素とも
に90％以上の高いセグメンテーシヨン率が得られ
ていることがわかる。個別に見ると、｜ｓ｜と｜
ｈ｜、｜ｚ｜の一部に対してはルールが有効で
あり、撥音、｜ｍ｜、｜ｎ｜に対してはルールが
有効である。その他の音素に対してはルールが
有効である。ほとんどの音素に対して、３つのル
ールを併用することによつて各段階で認識率が向
上している。この結果は３つのルールを併用する
ことによる有効性を実証している。本実施例は従来例に比較すると、有声子音から
無声子音まで広い範囲の子音のセグメンテーシヨ
ンを高い精度で行なうことができることが特長で
ある。たとえば従来例では鼻音のセグメンテーシ
ヨン率が非常に低かつたのに対し、本実施例では
90％以上の結果が得られ、｜ｒ｜、｜η｜、｜ｈ｜、
｜ｚ｜に対しても数％以上向上している。他の音
素に対しても全て１％内外の向上率を得ている。なお本実施例においては、全情報を使用し、
、、の順序で適用する例を示したが、と
またはとのみを使用してもかなり良い結果
を得ることができる。また適用順序も固定された
ものではない。即ちまたはの一方または両方
と、とを順不同に組合わせることができる。発明の効果以上のように本発明は音声スペクトルの低域パ
ワーと高域パワーの各々の時間的な変動によつて
生ずるパワーデイツプを検出し、各々のパワーデ
イツプの大きさを併用することによつて子音区間
を検出する第１の検出方法に、全音声区間に対
し、その全てのフレーム（１フレーム長は例えば
10ｍsec分のデータとする）を母音または鼻音と
して認識し、鼻音として認識されたフレームが一
定数以上連続するとき、その区間を子音区間とし
て検出する第２の検出方法と、全音声区間に対
し、その全てのフレームに対して有声・無声の判
定を行ない、無声フレームが一定数以上連続する
とき、その区間を子音として検出する第３の検出
方法との少なくともいずれかを組合わせることに
より子音区間を検出するようにしたもので、有声
子音から無声子音までの広い範囲の子音のセグメ
ンテーシヨンを高精度で行うことができる。[Table] The table shows the number of phonemes and the recognition rate at each stage when the rules are applied in the order of . The right column shows the final recognition rate, and it can be seen that a high segmentation rate of over 90% was obtained for each phoneme. Looking at them individually, |s| and |
The rule is valid for a part of h|, |z|, and the rule is valid for a part of the pixel sounds, |m|, |n|. The rules are valid for other phonemes. For most phonemes, the combination of the three rules improves the recognition rate at each stage. This result demonstrates the effectiveness of using the three rules together. Compared to the conventional example, this embodiment is characterized in that segmentation of a wide range of consonants from voiced consonants to voiceless consonants can be performed with high precision. For example, in the conventional example, the segmentation rate for nasal sounds was very low, whereas in this example, the segmentation rate for nasal sounds was very low.
More than 90% results were obtained, |r|, |η|, |h|,
It also improves by more than a few percent compared to |z|. Improvement rates of around 1% were obtained for all other phonemes. Note that in this example, all information is used,
I have given an example of applying , , in the order, but you can also get pretty good results using only and or and. Furthermore, the application order is not fixed. That is, one or both of or can be combined in any order. Effects of the Invention As described above, the present invention detects power dips caused by temporal fluctuations in the low-frequency power and high-frequency power of the speech spectrum, and uses the magnitude of each power dip in combination to detect consonant sounds. The first detection method for detecting sections is to detect all frames (one frame length is e.g.
10 msec of data) is recognized as a vowel or a nasal sound, and when a certain number or more frames recognized as a nasal sound are consecutive, the second detection method detects that interval as a consonant interval, and for the entire speech interval, All frames are judged as voiced or unvoiced, and when a certain number or more of consecutive unvoiced frames occur, the consonant interval is detected by combining at least one of the third detection method, which detects the interval as a consonant. This detection allows segmentation of a wide range of consonants, from voiced consonants to voiceless consonants, with high accuracy.

[Brief explanation of drawings]

第１図は従来の音声認識システムのブロツク
図、第２図は本発明の低域パワー情報または高域
パワー情報からパワーデイツプを検出する方法を
説明する図、第３図は低域パワーデイツプ、高域
パワーデイツプの大きさによつて、子音区間と子
音の付加を判別するための判別図、第４図は母音
または鼻音として全てのフレームを認識し、この
結果から子音区間を検出する方法を説明する図で
ある。 Fig. 1 is a block diagram of a conventional speech recognition system, Fig. 2 is a diagram explaining the method of detecting a power dip from low-frequency power information or high-frequency power information according to the present invention, and Fig. 3 is a diagram showing a method for detecting a power dip from low-frequency power information or high-frequency power information. A discrimination diagram for distinguishing consonant intervals and addition of consonants according to the size of the power dip. Figure 4 is a diagram explaining a method for recognizing all frames as vowels or nasals and detecting consonant intervals from this result. It is.

Claims

[Claims] 1. Detecting power dips caused by temporal fluctuations in the low-frequency power and high-frequency power of the speech spectrum, and detecting consonant intervals by using the magnitude of each power dip in combination. A speech segmentation method characterized in that a consonant interval detection method is combined with another detection method to detect a consonant interval. 2. Another detection method is to recognize all frames of an entire speech interval as vowels or nasals, and when a certain number or more of consecutive frames recognized as nasals occur, that interval is detected as a consonant interval. A speech segmentation method according to claim 1. 3. Claims that another detection method is one in which a voiced/unvoiced determination is made for all frames of an entire speech interval, and when a certain number or more of consecutive unvoiced frames occur, that interval is detected as a consonant. The speech segmentation method described in paragraph 1. 4. Consonant section detection, in which another detection method recognizes all frames of a whole speech section as vowels or nasals, and detects that section as a consonant section when a certain number or more consecutive frames are recognized as nasals. By combining this method and a consonant section detection method that performs voiced/unvoiced judgment on all frames of all speech sections and detects the section as a consonant when a certain number or more of consecutive unvoiced frames occur, the consonant section detection method detects the section as a consonant. The speech segmentation method according to claim 1, which detects intervals.