JPS6370899A

JPS6370899A - Voice recognition equipment

Info

Publication number: JPS6370899A
Application number: JP61216180A
Authority: JP
Inventors: 徹上田; 岩橋　弘幸
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 1986-09-13
Filing date: 1986-09-13
Publication date: 1988-03-31
Also published as: JPH0564800B2

Abstract

(57)【要約】本公報は電子出願前の出願データであるた
め要約のデータは記録されません。(57) [Abstract] This bulletin contains application data before electronic filing, so abstract data is not recorded.

Description

【発明の詳細な説明】〈産業上の利用分野〉この発明は、日本語等の入力された音声を音節単位で認
識して、外部装置に出力する音声認識装置に関する。DETAILED DESCRIPTION OF THE INVENTION <Industrial Application Field> The present invention relates to a speech recognition device that recognizes input speech such as Japanese in syllable units and outputs it to an external device.

〈従来の技術〉従来の音声認識装置においては、入力された音声からそ
の認識単位である音節の音節区間を抽出するために、一
定区間の音声スペクトル（以下、単にスペクトルと言う
）の変化を用いて上記音節の境界を検出するようにして
いる。<Prior art> Conventional speech recognition devices use changes in the speech spectrum (hereinafter simply referred to as spectrum) in a certain interval in order to extract the syllable interval of the syllable, which is the recognition unit, from input speech. The boundary between the syllables is detected using the syllable.

〈発明が解決しようとする問題点〉しかしながら、上記従来の音声認識装置では、入力され
る音声のスペクトルの変化には急激に変化する音声と穏
やかに変化する音声とが混在しており、その両者に追従
して音節の境界を正確に検出することは困難であり、し
ばしば音声の誤認が発生ずるという問題がある。<Problems to be Solved by the Invention> However, in the above-mentioned conventional speech recognition device, changes in the spectrum of the input speech include a mixture of rapidly changing speech and gently changing speech. It is difficult to accurately detect syllable boundaries by following the syllable boundaries, and there is a problem that speech misidentification often occurs.

そこで、この発明の目的は、上記スペクトルからの情報
を用いて、上記スペクトルの変化の１２急にかかわらず
、正確に音節境界を検出することができる音節認識装置
を提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to provide a syllable recognition device that can accurately detect syllable boundaries using information from the spectrum, regardless of how abrupt the spectrum changes.

〈問題点を解決するための手段〉上記目的を達成するために、この発明の音声認識装置は
、入力された音声から音節区間を抽出し、この抽出され
た音節の特徴パターンと、メモリに予め記憶している特
徴標準パターンとの類似度計算を行って、入力された音
声を音節単位で認識する音声認識装置′において、上記
入力された音声の一定区間におけるスペクトル情報の変
化から、上記スペクトルの安定点を抽出する安定点抽出
部と、上記安定点におけるスペクトル情報と、上記安定
点以後のスペクトル情報との類似度計算を行って、上記
類似度を所定の値と比較して上記抽出すべき音節区間の
音節の境界を検出する音節境界検出部とを設けたことを
特徴としている。<Means for Solving the Problems> In order to achieve the above object, the speech recognition device of the present invention extracts syllable intervals from input speech, and stores feature patterns of the extracted syllables in advance in memory. The speech recognition device' calculates the degree of similarity with the stored characteristic standard pattern and recognizes the input speech in syllable units. A stable point extraction unit that extracts a stable point calculates the degree of similarity between the spectral information at the stable point and the spectral information after the stable point, and compares the degree of similarity with a predetermined value to determine the point to be extracted. The present invention is characterized in that it includes a syllable boundary detection section that detects syllable boundaries in syllable intervals.

〈作用〉音声が入力されると、安定点抽出部により、上記入力さ
れた音声の一定区間におけるスペクトル情報の変化を用
いて、上記スペクトル上の安定点が抽出される。さらに
、音節境界検出部によって、上記安定点におけるスペク
トル情報と、上記安定点以後のスペクトル情報との類似
度計算が行われ、得られた類似度を所定の値と比較する
ことによって、抽出すべき音節区間の音節境界が検出さ
れ、この検出された音節境界にしたがって音節区間が抽
出される。したがって、上記スペクトルの変化の穏急に
かかわらず、正確に有意味な音節区間を抽出することが
できる。<Operation> When audio is input, the stable point extraction section extracts a stable point on the spectrum using changes in spectral information in a certain section of the input audio. Furthermore, the syllable boundary detection unit calculates the similarity between the spectral information at the stable point and the spectral information after the stable point, and compares the obtained similarity with a predetermined value. A syllable boundary of a syllable interval is detected, and a syllable interval is extracted according to the detected syllable boundary. Therefore, meaningful syllable sections can be extracted accurately regardless of the degree of change in the spectrum.

〈実施例〉以下、この発明を図示の実施例により詳細に説明する。<Example> Hereinafter, the present invention will be explained in detail with reference to illustrated embodiments.

第１図において、Ｉは音声を入力するマイク、２は上記
マイクｌより入力された音声の音声帯域のみを増幅する
増幅器、３は増幅器２の出力か入力される特徴抽出部で
ある。In FIG. 1, reference numeral I designates a microphone that inputs audio, 2 an amplifier that amplifies only the audio band of the audio input from the microphone 1, and 3 a feature extraction unit to which the output of the amplifier 2 is input.

上記特徴抽出部３は、上記増幅器２で増幅された音声を
８ｍｓの間隔ごとに１６ｍ５の区間（以下フレームと呼
ぶ）の特徴パラメータを抽出する。上記特徴パラメータ
とは、マツチング部７によって行われる最終的な音声認
識のための類似度計算に用いられるｌ音節の特徴パター
ン（例えば１６チヤンネルの帯域フィルタからの出力等
）と、音韻分類部４で音韻分類のために使用されるパラ
メータ（例えばパワー、１次の自己相関係数等）である
。The feature extraction unit 3 extracts feature parameters of a 16 m5 section (hereinafter referred to as a frame) of the audio amplified by the amplifier 2 at intervals of 8 ms. The feature parameters mentioned above are the l-syllable feature pattern (for example, output from a 16-channel bandpass filter, etc.) used in the similarity calculation for final speech recognition performed by the matching unit 7, and the feature pattern by the phoneme classification unit 4. These are parameters (for example, power, first-order autocorrelation coefficient, etc.) used for phoneme classification.

上記音韻分類部４は上記特徴パラメータを用いて、上記
音声の１フレームにそのフレームの音声の性質を表わす
ラベル付けを行う。境界検出部５は後に詳述する方法に
よって上記スペクトル上の安定点を抽出し、上記安定点
を基にして音節の境界を検出する。The phoneme classification unit 4 uses the feature parameters to label one frame of the audio to indicate the nature of the audio of that frame. The boundary detection unit 5 extracts stable points on the spectrum using a method described in detail later, and detects syllable boundaries based on the stable points.

音節区間抽出部６は上記音韻分類部４によって得られた
ラベルの時系列と、上記境界検出部５によって得られた
音節境界の情報とを用いて、入力された音声から音節区
間を抽出する。さらに、上記マツチング部７は上記音節
区間抽出部６で抽出された１つの音節区間における上記
特徴抽出部３で抽出された特徴パターンと、特徴標準パ
ターンメモリ８に予め記憶されている特徴標準パターン
との類似度計算の一例であるユークリッド距離計算を行
って音声の認識を行う。ＣＰＵ９は、上記特１位抽出部
３、境界検出部５、音節区間抽出部６およびマツチング
部７を制御すると共に、上記マツチング部７で得られろ
認識結果を、図示しない外部装置に出力するインターフ
ェース１０を制御している。The syllable interval extraction unit 6 extracts syllable intervals from the input speech using the time series of labels obtained by the phoneme classification unit 4 and the syllable boundary information obtained by the boundary detection unit 5. Furthermore, the matching section 7 matches the feature pattern extracted by the feature extraction section 3 in one syllable section extracted by the syllable section extraction section 6 with the feature standard pattern stored in advance in the feature standard pattern memory 8. Speech recognition is performed by performing Euclidean distance calculation, which is an example of similarity calculation. The CPU 9 controls the special first place extraction section 3, the boundary detection section 5, the syllable section extraction section 6, and the matching section 7, and also provides an interface for outputting the recognition results obtained by the matching section 7 to an external device (not shown). It controls 10.

上記溝成の音声認識装置は次のように動作する。Mizonari's speech recognition device operates as follows.

入力者が上記マイク１に向って音声を発声すると、その
音声は上記マイク１から入り上記増幅器２で音声帯域だ
けが増幅されて上記特徴抽出部３に送られ、上記特徴抽
出部３では８ｍｓの間隔ごとにｔ６ｎｓのフレームに区
切って、そのフレームの特徴パラメータが抽出される。When an input person utters a voice into the microphone 1, the voice enters the microphone 1, is amplified only in the voice band by the amplifier 2, and is sent to the feature extractor 3. Each interval is divided into frames of t6ns, and the characteristic parameters of the frames are extracted.

上記特徴パラメータは上記マツチング部７によって行な
われろ最終的な音声認識のための類似度計算に用いられ
るｌ音節の特徴パターン（例えば１６チヤンネルの帯域
フィルタからの出力等）と、音韻分類部・１て音韻分類
のために使用さ、ｈるパラメータ（例えばパワー、１次
の自己用関係数等）とである。−ヒＳこ音韻分類部４で
は上記持１懺抽出部３て；にめられ几特徴パラメータに
よってフレームの音声の性質を表わすラベル付けが行な
われる。ここで不実判例で用いるラベルは母音性（記号
′Ｖ′）、摩擦性（記号′Ｆ′）、バズバー性（記号′
Ｂ′）、無音性（記号′、′）の４種類である。The above feature parameters are determined by the matching unit 7. The feature pattern of l syllables (for example, output from a 16-channel bandpass filter, etc.) used for similarity calculation for final speech recognition, and the phoneme classification unit 1. These are parameters used for phoneme classification (eg, power, first-order self-use relation coefficient, etc.). In the phoneme classification section 4, the above-mentioned characteristic extraction section 3 performs labeling that represents the nature of the speech of the frame using the feature parameters. Here, the labels used in the untrue case are vowel (symbol 'V'), fricative (symbol 'F'), buzzbar (symbol '
There are four types: B'), silence (symbols ',').

また、境界検出部５では得られた１フレームのスペクト
ルの変化から安定な点を抽出し、さらに、上記安定点の
フレームのスペクトルパターンと、上記安定点以後に入
力された音声のフレームのスペクトルパターンとの類似
度を表わすユークリッド距離を求めることによって、抽
出すべき音節の音節境界を検出する。第２図は上記安定
点を抽出してから音節境界を検出までのフローチャート
を示しており、図中右側は上記安定点抽出のフローであ
り左側は音節境界検出のフローである。以下第２図に沿
って上記安定点の抽出および上記音節境界の検出の手段
を詳述する。In addition, the boundary detection unit 5 extracts a stable point from the obtained change in the spectrum of one frame, and further extracts the spectral pattern of the frame at the stable point and the spectral pattern of the audio frame input after the stable point. The syllable boundary of the syllable to be extracted is detected by calculating the Euclidean distance representing the degree of similarity between the syllable and the syllable. FIG. 2 shows a flowchart from the extraction of the stable points to the detection of syllable boundaries, in which the right side of the figure is the flow of stable point extraction and the left side is the flow of syllable boundary detection. The means for extracting the stable point and detecting the syllable boundary will be described in detail below with reference to FIG.

ここで、各変数をｉ、ｊ　　　　　ニ一時変数、Ｎ　　　　　：パターンの次数を表す定数、ｔ　　　　
　：フレームの番号、ｔａ：安定点のフレーム番号、ＰＡ’ｒ（ｉ）　　：安定点の特徴パターンの１次の特
徴量Ｄ　（ｔ）　　　　　フレームｔてのスペクトル変化距
離、Ｓ　Ｐ　（ｔＸｉ）　　：フレームｔての入カバターン
の１次の特徴量、Ｌ　　　　　ニスベクトル変化を計算する窓の長さを表
す定数で２Ｌ＋１が窓長になる、〜Ｉ　　　　　：安定点を求めるための窓の長さを表す
定数で２Ｍ＋１が窓長となる、ＤＩＳ　　　　：安定点の特徴パターンと入力フレーム
の特徴パターンの距離、ＡＮＴＰＬＧ・安定点からの上記距離による境界検出フ
ラグ、とする。Here, each variable is i, j is a temporary variable, N is a constant representing the order of the pattern, t
: frame number, ta: frame number of stable point, PA'r(i): first-order feature amount of feature pattern of stable point D (t) spectrum change distance at frame t, S P (tXi): frame The first-order feature of the input pattern at t, L is a constant that represents the length of the window for calculating the varnish vector change, and 2L+1 is the window length. ~I: The constant that represents the length of the window for finding a stable point. 2M+1 is the window length, DIS: distance between the feature pattern of the stable point and the feature pattern of the input frame, ANTPLG: boundary detection flag based on the above distance from the stable point.

いま、スペクトル上のある１つのフレームｔ（これを現
フレームとする）からの入カバターン５Ｐ（ｔ）（ｉ）
が入力されると、ステップＳＩで、安定点パターンの有無（すなわち、過
去に安定点を抽出して、上記安定点のスペクトルパター
ンを取り込んでいるか否か）を判定する。ここでは、安
定点のスペクトルパターンを取り込んでいれば安定点の
スペクトルパターンのデータが総ての次数で０となるこ
とがないことを利用して、ｉ＝１．、Ｎであるすべての
Ｐ、ＡＴ（ｉ）にたいしてＰ、ＡＴ（ｉ）＝０を満たすときは、すでに抽出された安定点パターンは無
しとしてステップＳ２に進み安定点を求め□る動作に入
り、それ以外のときにはすでに抽出された安定点が有り
としてステップＳ５に進む。Now, the input cover turn 5P(t)(i) from a certain frame t on the spectrum (this is the current frame)
When is input, in step SI, it is determined whether there is a stable point pattern (that is, whether a stable point has been extracted in the past and the spectrum pattern of the stable point has been imported). Here, i=1. , N, when P,AT(i)=0 is satisfied, there are no stable point patterns that have already been extracted, and the process proceeds to step S2, where the process begins the operation of finding stable points. In other cases, it is determined that there is a stable point that has already been extracted, and the process proceeds to step S5.

ステップＳ、で、現フレームの安定性をチェックする。In step S, the stability of the current frame is checked.

すなわち、現フレームｔにおけるスペクトル変化Ｄ　（
ｊ）をＤ　（ｔ）−Σ（Ｓ　Ｐ　（ｔ−ＬＸｉ）　−Ｓ　Ｐ　
（ｔ＋　ＬＸｉ））”ｉ＝１とすると、Ｄ（ｔ）−ｍｉｎ　　Ｄ（ｔ−Ｉ−ｊ）・１　＠　・（
１）ただし、Ｊ”　　Ｍ、　　Ｍ＋１　、、、Ｏ，、、
Ｍを満たすＤ　（ｔ）が存在するときに上記現フレーム
ｔは安定と判断してステップＳ３に進み、・上記（＋）
式を満たすＤ　（ｔ）が存在しないときは現フレームｔ
は安定でないとしてステップＳ１へ戻り次のフレームの
処理を実行する。That is, the spectral change D (
j) as D (t)−Σ(S P (t−LXi) −S P
(t+LXi))"i=1, D(t)-min D(t-I-j)・1 @・(
1) However, J” M, M+1 ,,O,,,
When D (t) that satisfies M exists, the current frame t is judged to be stable and the process proceeds to step S3, and the above (+)
If there is no D(t) that satisfies the equation, the current frame t
is not stable, and returns to step S1 to process the next frame.

ステップＳ、で、上記スペクトル変化が非常に大きい点
を安定点として採択するのを避けるため、ステップＳ２
で求められた安定なフレームｔにおけろスペクトル変化
Ｄ　（ｔ）を設定値ＴＨＤＩＳ２と比較する。その結果
ＴＨＤＩＳ２より小さければステップＳ４に進み、以上
であれば現フレームｔは安定点として採択できないとし
て、ステップＳ、へ戻る。In step S, in order to avoid adopting a point where the spectrum change is extremely large as a stable point, step S2
The spectral change D (t) in the stable frame t determined in is compared with the set value THDIS2. As a result, if it is smaller than THDIS2, the process proceeds to step S4, and if it is greater than THDIS, it is determined that the current frame t cannot be adopted as a stable point, and the process returns to step S.

ステップＳ４で、安定点として採択されたフレームｔａ
におけるスペクトルの特徴パターンを上記安定点パター
ンＦＡＴ（ｉ）にセットして安定点の抽出か完了し、ス
テップＳ、へ戻る。In step S4, the frame ta adopted as the stable point
The characteristic pattern of the spectrum in is set to the stable point pattern FAT(i), the stable point extraction is completed, and the process returns to step S.

ＦＡＴ（ｉ）−ＳＰ（ｔａＸｉ）　　　１＝１１．Ｎス
テップＳ、で、上記抽出された安定点の安定点パターン
と現フレームｔにおけるスペクトルの特徴パターンとの
距＃（ＤＩＳ）を次式を用いて計算して、ステップＳ、
に進む。FAT(i)-SP(taXi) 1=11. In step S, the distance #(DIS) between the stable point pattern of the extracted stable points and the characteristic pattern of the spectrum in the current frame t is calculated using the following formula, and step S,
Proceed to.

ステップＳ６で上記ステップＳ５で求めた距離ＤｒＳが
設定値ＴＨＤ　Ｉ　Ｓ　１より大きいか否か、すなわち
類似度が小さいか大きいかを判断して、設定値以下の場
合は安定点パターンと現フレームにおける特徴パターン
とは類似しているので、現フレームは音節の境界点とし
ては採択できないとしてステップＳＩへ戻る。一方、設
定値より大きい場合は現フレームは音節境界点であると
してステップＳ７へ進む。In step S6, it is determined whether the distance DrS obtained in step S5 is larger than the set value THD I S 1, that is, whether the similarity is small or large. If it is less than the set value, the stable point pattern and the current frame are determined. Since the current frame is similar to the feature pattern, it is determined that the current frame cannot be adopted as a syllable boundary point, and the process returns to step SI. On the other hand, if it is larger than the set value, the current frame is determined to be a syllable boundary point and the process advances to step S7.

ステップＳ、で、ステップＳ、でＤＩＳ＞ＴＨＤＩＳ＋
と判断され、音節境界が検出されたとき、音節境界検出
フラグＡ　Ｎ　Ｔ　Ｆ　Ｌ　ＧをセットＡＮＴＦＬＧ＝
１してステップＳ８に進む。At step S, at step S, DIS>THDIS+
When it is determined that the syllable boundary is detected, the syllable boundary detection flag ANTFLG is set ANTFLG=
1 and proceeds to step S8.

ステップＳ８で抽出すべき音節の音節境界検出が完了し
たので、境界検出に用いた安定点パターンＦ　Ａ　Ｔ　
（ｉ）をクリアＦＡＴ（ｉ）−〇　　　ただしｉ＝　Ｉ　、、Ｎしてス
テップＳＩへ戻り、次の音節の安定点の抽出と音節境界
検出とを行う。Since the syllable boundary detection of the syllable to be extracted in step S8 has been completed, the stable point pattern F A T used for boundary detection has been completed.
Clear (i) FAT(i)-〇 where i=I, . . . N and return to step SI to extract the stable point of the next syllable and detect the syllable boundary.

上述のようにして、１つの音節の安定点が抽出され、こ
の安定点を基にして抽出すべき音節の音節境界が検出さ
れると、第１図の上記音節区間抽出部６により上記音節
分類部４て得られた音節ラベルの時系列と上記境界検出
部５で求められた音節境界情報とから、第３図に示す音
節抽出フローチャートにしたがって、上記音節区間抽出
部６により音節が抽出される。As described above, when the stable point of one syllable is extracted and the syllable boundary of the syllable to be extracted is detected based on this stable point, the syllable segment extraction unit 6 of FIG. From the time series of syllable labels obtained in section 4 and the syllable boundary information obtained by the boundary detection section 5, syllables are extracted by the syllable section extraction section 6 according to the syllable extraction flowchart shown in FIG. .

ここで、各変数をＳＥＣ：音韻分類部で出力されるラベル、ＦＲＡＭＥ　
　：抽出された音節のフレーム数、ＣＵＴＦＬＧ：抽出
完了フラグ、ＡＮＴＦＬＧ’：音節境界検出フラグ、（音節境界検出
部により検出される）ＰＲＭＣＮＴ：　フレームのカウンタ、ＶＣＮＴ　　　
：母音性のラベル／Ｖ／の付いたフレームのカウンタ、ＴＨＣＵＴ　　：定数（１０） ′■′　：母音性の音韻ラベル、 ′Ｆ′　：摩凛性の音韻ラベル、とする。Here, each variable is SEC: label output by the phoneme classification section, FRAME
: Number of extracted syllable frames, CUTFLG: Extraction completion flag, ANTFLG': Syllable boundary detection flag, (detected by the syllable boundary detection unit) PRMCNT: Frame counter, VCNT
: Counter of frames with vowel label /V/, THCUT : Constant (10) '■' : Vowel character phoneme label, 'F' : Fragile phoneme label.

ステップＳ　１１で、ＣＵＴＦＬＧ（音節抽出完了フラ
グ）がセットしであるか否かを判別し、セットしてあれ
ばステップＳＩ２に進み、上記ｃＵＴＦＬＧをクリアし
てステップＳ１３に進む。−クリアしてあればそのまま
ステップＳＩ３に進む。In step S11, it is determined whether or not CUTFLG (syllable extraction completion flag) is set. If it is set, the process proceeds to step SI2, where the cUTFLG is cleared and the process proceeds to step S13. - If it has been cleared, proceed directly to step SI3.

ステップＳｉｊで、現フレームの５ＥＧ（音韻ラベル）
が／Ｖ／か否かを判定し、／　Ｖ／であればステップＳ
１４に進み、／　Ｖ／でなければステップＳ　＋７に進
む。In step Sij, 5EG (phonological label) of the current frame
Determine whether or not is /V/, and if /V/, step S
The process proceeds to step S14, and if it is not /V/, the process proceeds to step S+7.

ステップＳ　１４でＦＲＭＣＮＴ（フレームカウンタ）
に＋１を加え、ＶＣＮＴ（母音性の音韻ラベル′■′の
フレーム数）に＋１を加えステップＳ＋５に進む。In step S14, FRMCNT (frame counter)
+1 is added to VCNT (the number of frames of the vowel phonetic label '■'), and the process proceeds to step S+5.

ステップＳ　＋５で、ＡＮＴＦＬＧ（音節境界検出フラ
グ）がセットされているか否か（このＡＮＴＦＬＧは第
２図の安定点抽出および音節境界点検出のフローチャー
トのステップＳ７で１つの音節境界の検出が完了したと
きにセットされる。）を判別する。その結果、ｌにセッ
トされているときはステップＳ　１４１に進んでｌ音節
抽出を行い、セットされていないときはまだｌ音節の境
界検出が完了していないと判別してステップＳ　１１に
戻り、次のフレームの処理を実行する。ステップＳ　Ｉ
ｌｌで、上記ステップＳＩ５で上記ＡＮＴＦ’ＬＧ力月
にセットされていると判別されたときはｌ音節の境界が
検出されているので、現フレームまでを１音節とみなし
て、現フレームまでの音節のフレーム数をカウントして
いる上記ＰＲＭＣＮＴをＦＲＡＭＥ（抽出された音節の
フレーム数）に転送して、上記ＦＲ）ν１ＣＮ　Ｔおよ
び上記ＶＣＮＴをクリアし、ｌ音節抽出完了のフラグＣ
ＵＴＦＬＣ；をｌにセットしてステップＳ　Ｉ＋に戻り
、次の音節抽出処理を実行する。In step S +5, it is determined whether the ANTFLG (syllable boundary detection flag) is set (this ANTFLG indicates that one syllable boundary detection has been completed in step S7 of the stable point extraction and syllable boundary point detection flowchart in Figure 2). ). As a result, if it is set to l, the process advances to step S141 to extract l syllables, and if it is not set, it is determined that the boundary detection of l syllables has not yet been completed and the process returns to step S11. Executes processing for the next frame. Step SI
ll, when it is determined in step SI5 that the above ANTF'LG is set to the above mentioned ANTF'LG power moon, the l syllable boundary has been detected, so the syllable up to the current frame is regarded as one syllable, and the syllable up to the current frame is Transfer the above PRMCNT counting the number of frames to FRAME (frame number of extracted syllables), clear the above FR) ν1CN T and the above VCNT, and set the l syllable extraction completion flag C.
UTFLC; is set to l and the process returns to step S I+ to execute the next syllable extraction process.

ステップＳＩ７で、現フレームの音韻ラベルが’Ｖ′で
ないときは、上記ＶＣＮＴとＴＨＣＵＴ（定数＝本実施
例では１０）とを比較する。その結果母音性の音韻ラベ
ル数がＴＨＣＵＴよりも大であれば、現フレームより以
前のフレームは有意味な音節であり、現フレームは音節
の境界であると判断して、ステップＳ　１Ｂに進んで１
音節抽出を行い、ＴＨＣＵＴ以下であればステップＳＩ
８に進む。In step SI7, when the phonetic label of the current frame is not 'V', the above VCNT and THCUT (constant = 10 in this embodiment) are compared. As a result, if the number of vowel phonetic labels is greater than THCUT, it is determined that the frames before the current frame are meaningful syllables and the current frame is a syllable boundary, and the process proceeds to step S1B. 1
Perform syllable extraction, and if it is less than THCUT, step SI
Proceed to step 8.

ステップＳ　１８で、現フレームまでを１音節とみなし
て、現フレームまでの音節のフレーム数をカウントして
いる上記ＦＲＭＣＮＴを上記Ｆ　ＲＡ　ＭＥに転送して
、上記Ｆ　ＲＭ　ＣＮ　ＴおよびＶ　ＣＮ　Ｔをクリア
し、ｌ音節抽出完了のフラグＣＵＴＦＬＧを１にセット
してステップＳ　１１に戻る。In step S18, the FRMCNT counting the number of frames of syllables up to the current frame is regarded as one syllable, and the FRMCNT is transferred to the FRAME, and the FRMCNT and VCNT are transferred to the FRAME. The flag CUTFLG indicating completion of l-syllable extraction is set to 1, and the process returns to step S11.

ステップＳ　１９で、現フレームの上記ＳＥＧが′Ｆ′
か否かを判別し、′Ｆ′であればステップＳ　２１に進
み、′Ｆ′でなければステップＳ、。に進む。In step S19, the above SEG of the current frame is 'F'
If it is 'F', proceed to step S21; if not 'F', proceed to step S. Proceed to.

ステップＳ、。で、現フレームのＳＥＧが／　Ｖ／でも
／　Ｐ／でもない場合、現フレームまでの音節は有意味
な音節ではないとして、上記Ｆ　ＲＭ　ＣＮＴおよびＶ
ＣＮＴをクリアしてステップＳ　ＩＩに戻り、次の音節
抽出処理を実行する。Step S. If the SEG of the current frame is neither /V/ nor /P/, the syllables up to the current frame are not meaningful syllables, and the above F RM CNT and V
The CNT is cleared and the process returns to step S II to execute the next syllable extraction process.

ステップＳ　２１で、音韻ラベル′Ｆ′のときはまだ音
節が続いているとして、実行Ｆ　ＲＭＣＮＴに＋１を加
えてステップＳ　ＩＩに戻り、次のフレームの処理を実
行する。In step S21, when the phoneme label is 'F', it is assumed that syllables are still continuing, so +1 is added to execution FRMCNT, and the process returns to step SII to execute the processing of the next frame.

第３図の音節抽出フローチャートのステップＳ　＋ｅお
よびステップＳｌａで、ｌ音節抽出完了のフラグＣＵＴ
ＦＬＧが１にセットされると、第１図の上記ＣＰＵ９の
指令により上記マツチング部７は、入力された音声の上
記音節区間抽出部６によって抽出された１つの音節区間
の特徴パターンと、上記特徴標準パターンメモリ８に予
め記憶されている特徴標準パターンとの類似度を計算し
て、上記入力されて抽出された音節が類似度の最も高い
漂め音節と同一の音節として認識され、その認識結果が
上記インターフェース１０を介して、外部装置に出力さ
れる。At step S+e and step Sla of the syllable extraction flowchart in FIG. 3, the l syllable extraction completion flag CUT is set.
When FLG is set to 1, the matching unit 7, in response to a command from the CPU 9 shown in FIG. The degree of similarity with the characteristic standard pattern stored in advance in the standard pattern memory 8 is calculated, and the input and extracted syllable is recognized as the same syllable as the drifting syllable with the highest degree of similarity, and the recognition result is is output to an external device via the interface 10.

第４図は本実施例において抽出された安定点。Figure 4 shows the stable points extracted in this example.

音節境界点の例を示し、上段より音韻分類ラベルの時系
列、本実施例とは異なる方法によって得られた母音系列
（参考）、スペクトル変化が記されている。また、Ｃは
従来のスペクトル変化から求めた音節境界点、Ａ、Ｂは
本実施例で求めた音節境界点を現わしている。なお、第
４図より、音韻分類ラベルは全て母音性の／Ｖ／である
ため、第４図は第３図の音節抽出フローチャートにおけ
るステップＳ　＋５で音節境界が検出された例である。An example of syllable boundary points is shown, and from the top, a time series of phoneme classification labels, a vowel series (reference) obtained by a method different from this example, and spectral changes are described. Further, C represents the syllable boundary point determined from the conventional spectrum change, and A and B represent the syllable boundary points determined in this embodiment. As shown in FIG. 4, all the phoneme classification labels are vowel /V/, so FIG. 4 is an example in which a syllable boundary is detected at step S+5 in the syllable extraction flowchart of FIG. 3.

すなわち、上記スペクトル曲線上に上述の方法で安定点
Ｐ、が設定され、この安定点Ｐ、を基にして上述の方法
により各フレームの特徴パターンと上記安定点パターン
との距離ＤＩＳが、図中の大曲線Ｐ１Ｑ＋のように求め
られ、点Ｑ１において、ＤＩｓ＞ＴＨＤＩＳＩとなり音
節境界点Ａが検出される。同様にして、次の安定点Ｐ、
が設定されると、Ｐ、を基にして点Ｑ２が求められ、次
の音節境界点Ｂが検出され、３つの音節「え」「い」「
お」が分離して抽出される。従来のスペクトルの変化か
ら音節境界を検出する方法では、スペクトル変化の極値
点Ｐ３より音節境界点Ｃが検出されるので、音節「えい
」と「お」は区別されて抽出されるが、音節「え」と「
い」とはその両音節間のスペクトル変化が穏やかである
ために音節境界点が検出されず、したがって異なる音節
として区別して抽出することができない。That is, a stable point P is set on the spectrum curve by the above method, and based on this stable point P, the distance DIS between the feature pattern of each frame and the stable point pattern is determined by the above method as shown in the figure. is determined as a large curve P1Q+, and at point Q1, DIs>THDISI and syllable boundary point A is detected. Similarly, the next stable point P,
Once set, point Q2 is found based on P, the next syllable boundary point B is detected, and the three syllables "e", "i", "
"O" is separated and extracted. In the conventional method of detecting syllable boundaries from changes in the spectrum, the syllable boundary point C is detected from the extreme point P3 of the spectrum change, so the syllables "ei" and "o" are extracted separately, but the syllables "Uh"
Since the spectral change between the two syllables of "I" is gentle, the syllable boundary point is not detected, and therefore it is not possible to distinguish and extract the syllable as a different syllable.

したがって、本実施例ではスペクトル変化が小さくて従
来のスペクトル変化で音節境界の抽出が不可能な場合で
も正確に音節境界を検出できる。Therefore, in this embodiment, even when the spectral change is small and it is impossible to extract the syllable boundary using the conventional spectral change, the syllable boundary can be detected accurately.

〈発明の効果〉以上より明らかなように、この発明の音節認識装置では
、入力された音声の一定区間におけるスペクトル情報の
変化から上記スペクトルの安定点を抽出する安定点抽出
部と、上記安定点におけるスペクトル情報と上記安定点
以後のスペクトル情報との類似度計算を行って、上記類
似度を所定の値と比較することによって抽出すべき音節
区間の音節境界を検出する音節境界検出ゴくとを設は几
ので、上記スペクトル情報の変化が穏やかな場合であっ
ても、急な場合であっても正確にしかも容易に音節境界
を検出することができる。<Effects of the Invention> As is clear from the above, the syllable recognition device of the present invention includes a stable point extraction section that extracts a stable point of the spectrum from changes in spectral information in a certain section of input speech; The syllable boundary detection gokuto detects the syllable boundary of the syllable interval to be extracted by calculating the similarity between the spectral information at and the spectral information after the stable point and comparing the similarity with a predetermined value. Since the setup is robust, syllable boundaries can be detected accurately and easily even when the change in the spectral information is gradual or sudden.

[Brief explanation of the drawing]

第１図はこの発明の音声認識装置のブロック図、第２図
は安定点抽出および音節境界検出のフローチャート、第
３図は音節抽出のフローチャート、第４図は実施例にお
いて抽出された安定点と音節境界点の一例を示す図であ
る。Fig. 1 is a block diagram of the speech recognition device of the present invention, Fig. 2 is a flowchart of stable point extraction and syllable boundary detection, Fig. 3 is a flowchart of syllable extraction, and Fig. 4 shows the stable points extracted in the embodiment. It is a figure which shows an example of a syllable boundary point.

Claims

[Claims]

(1) Extract syllable intervals from the input speech, calculate the similarity between the extracted syllable feature pattern and the feature standard pattern stored in memory in advance, and extract the input speech syllable by syllable. a stable point extraction unit that extracts a stable point of the voice spectrum from changes in voice spectrum information in a certain section of the input voice; A syllable boundary detection unit that calculates the degree of similarity with the speech spectrum information after the point and compares the degree of similarity with a predetermined value to detect the syllable boundary of the syllable interval to be extracted. speech recognition device.