JP2886879B2

JP2886879B2 - Voice recognition method

Info

Publication number: JP2886879B2
Application number: JP5956589A
Authority: JP
Inventors: 茂実大津
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 1989-03-14
Filing date: 1989-03-14
Publication date: 1999-04-26
Anticipated expiration: 2014-04-26
Also published as: JPH02239297A

Description

【発明の詳細な説明】［産業上の利用分野］この発明は、音声認識方法に係り、特に、不特定話者
の子音及び母音音節（Consonant Vowel Syllable,以下C
V音節と略す）からなる音声を認識する上で有効な音声
認識方法に関する。Description: TECHNICAL FIELD The present invention relates to a speech recognition method, and more particularly to a consonant vowel syllable (hereinafter C) of an unspecified speaker.
The present invention relates to an effective speech recognition method for recognizing speech composed of V syllables.

［従来の技術］従来この種の音声認識方法としては、抽出対象となる
CV音節からなる音声、例えば有声破裂音を周波数分析
し、線形予測分析（以下LPC分析という）して得られる
スペクトルピークから第一ないし第三ホルマント周波数
を抽出し、このうち第二ホルマント周波数の時間的変化
を追い、その傾きから有声破裂音の識別を行うようにし
たものが既に知られている（信学技報 SP88−14）。[Prior Art] Conventionally, this type of speech recognition method is an extraction target.
Frequency analysis of voices composed of CV syllables, such as voiced plosives, and extraction of the first to third formant frequencies from spectral peaks obtained by linear prediction analysis (hereinafter referred to as LPC analysis). It is already known to track the change and identify the voiced plosive from its slope (IEICE Technical Report SP88-14).

［課題を解決しようとする課題］ところで、不特定話者のCV音節からなる音声を認識す
るに当って上記手法を採用すると、第二ホルマント周波
数の時間的変化を直接的に追うようにしているため、男
性、女性、子供等によるホルマント周波数そのもののが
大きく相違するばかりか、イントネーション等の周波数
変動を直接受け易く、その分、音声の誤認識が起り易い
という事態を生じる。[Problem to be Solved] By the way, when the above-mentioned method is adopted in recognizing a voice composed of CV syllables of an unspecified speaker, a temporal change of the second formant frequency is directly followed. Therefore, not only does the formant frequency itself differ greatly among men, women, children, and the like, but also frequency fluctuations such as intonation are directly susceptible, and erroneous recognition of voice is likely to occur.

この発明は、以上の問題点に着目してなされたものあ
って、不特定話者のCV音節からなる音声を認識する上で
誤認識を少なくできるようにした新規な音声認識方法を
提供するものである。The present invention has been made in view of the above problems, and provides a novel speech recognition method capable of reducing erroneous recognition in recognizing speech composed of CV syllables of an unspecified speaker. It is.

［課題を解決するための手段］本発明者は上述した問題を以下のように分析し、本発
明を案出するに至ったのである。[Means for Solving the Problems] The present inventor has analyzed the problems described above as follows, and has come up with the present invention.

具体的に述べると、従来の音声認識方法において誤認
識が生ずるのは、音声を認識する上での特徴空間に第二
ホルマント周波数そのもの用い、子音の重要な特徴であ
るホルマント周波数の時間変動に伴う渡り部分を含めて
CV音節からなる音声の正規化を行っていないことに起因
すると考えられる。More specifically, in the conventional speech recognition method, erroneous recognition occurs because the second formant frequency itself is used in the feature space for recognizing speech, and the time variation of the formant frequency, which is an important feature of consonants, occurs. Including the crossover
This is probably because the speech composed of CV syllables was not normalized.

従って、前記CV音節からなる音声の正規化を行うには
特定話者に依存しない音声の特徴パラメータを選択する
ことが重要であり、本発明者は、音声を認識する上での
特徴空間を人間の聴覚特性に合せた周波数軸（以下聴覚
対応スケールという）空間に対応させることを考慮し、
特定話者に直接依存しないホルマント周波数を相対値化
してなる音声の特徴パラメータを見出し、本発明を案出
するに至ったのである。Therefore, in order to normalize the speech composed of the CV syllables, it is important to select a feature parameter of the speech that does not depend on a specific speaker. Considering that it corresponds to the frequency axis (hereinafter referred to as auditory correspondence scale) space that matches the auditory characteristics of
The present inventors have found out a speech characteristic parameter obtained by converting a formant frequency that is not directly dependent on a specific speaker into a relative value, and have come up with the present invention.

すなわち、この発明は、CV音節からなる音声を認識す
るに際し、音声信号の周波数分析して得られるホルマン
ト周波数のうち、隣接する第一及び第二ホルマント周波
数並びに第二及び第三ホルマント周波数の聴覚対応スケ
ールにおける差情報を特徴パラメータとして抽出し、隣
接するホルマント周波数の聴覚対応スケールにおける差
情報の収束点からの母音成分を識別すると共に、前記差
情報の立ち上がり位置及び時間的挙動をもとに子音成分
を識別するようにしたものである。In other words, the present invention provides a method for recognizing a voice composed of CV syllables, wherein, among the formant frequencies obtained by analyzing the frequency of the voice signal, the auditory correspondence between the adjacent first and second formant frequencies and the second and third formant frequencies The difference information in the scale is extracted as a feature parameter, the vowel component from the convergence point of the difference information in the auditory correspondence scale of the adjacent formant frequency is identified, and the consonant component based on the rising position and the temporal behavior of the difference information. Is identified.

このような技術的手段において、認識対象となるCV音
節としては、第一ないし第三ホルマントが生ずるもので
あれば、有声、無声を問わない。また、上記聴覚対応ス
ケールにはメルスケールあるいはこれを近似した対数ス
ケールが包含される。In such technical means, the CV syllable to be recognized may be voiced or unvoiced as long as the first to third formants occur. The auditory scale includes a mel scale or a logarithmic scale approximating the mel scale.

また、上記母音成分と子音成分との識別工程の順番に
ついてはいずれでもよいが、子音成分に関する特徴パラ
メータの時間的挙動は母音成分によってまちまちである
ため、識別のし易さという観点からすれば、先に母音成
分を識別した後子音成分を識別する方が好ましい。In addition, the order of the step of discriminating the vowel component and the consonant component may be any order, but since the temporal behavior of the characteristic parameters related to the consonant component varies depending on the vowel component, from the viewpoint of easy identification, It is preferable to identify a consonant component after identifying a vowel component first.

更に、特徴パラメータの抽出工程、母音成分、子音成
分の識別工程を実現する手段については、各工程で必要
な機能を発揮し得るものであれば適宜設計変更すること
ができる。Further, the means for implementing the characteristic parameter extracting step and the vowel component / consonant component identifying step can be appropriately changed in design as long as they can exhibit the necessary functions in each step.

［作用］上述したような技術的手段によれば、音声の特徴パラ
メータが第一ホルマント周波数及び第二ホルマント周波
数並びに第二ホルマント周波数及び第三ホルマント周波
数の聴覚対応スケールにおける差情報になっているの
で、音声の特徴パラメータは、ホルマント周波数に直接
依存することなく聴覚空間に対応して正規化され、聴覚
対応スケールの特徴空間内に時間的挙動を伴って分布す
る。このとき、特徴パラメータの時間的挙動としては、
特徴空間内の所定部位で立ち上がり、時間変化に伴って
移動し、所定部位で収束するという形で現れる。[Operation] According to the technical means as described above, the characteristic parameter of the voice is the difference information of the first formant frequency and the second formant frequency, and the second formant frequency and the third formant frequency on the auditory correspondence scale. The feature parameters of speech are normalized in accordance with the auditory space without directly depending on the formant frequency, and are distributed with temporal behavior in the feature space of the auditory correspondence scale. At this time, the temporal behavior of the feature parameters is
It rises at a predetermined location in the feature space, moves with time, and converges at the predetermined location.

上記特徴パラメータの収束位置はCV音節の母音成分に
相当するもので、これの特徴空間内でのばらつきは比較
的狭い範囲に収まる。The convergence position of the feature parameter corresponds to the vowel component of the CV syllable, and the variation in the feature space falls within a relatively narrow range.

一方、上記特徴パラメータの立ち上がり位置及び収束
位置に向う途中の位置はCV音節の子音成分に相当するも
ので、これの特徴空間内でのばらつきについても比較的
狭い範囲に収まる。On the other hand, the position of the characteristic parameter on the way to the rising position and the convergence position corresponds to a consonant component of the CV syllable, and the variation in the characteristic space falls within a relatively narrow range.

［実施例］以下、添附図面に示す実施例に基づいてこの発明を詳
細に説明する。Hereinafter, the present invention will be described in detail based on an embodiment shown in the accompanying drawings.

第１図はこの発明に係る音声認識方法を実現する装置
の一実施例を示すブロック図である。FIG. 1 is a block diagram showing an embodiment of an apparatus for realizing a speech recognition method according to the present invention.

同図において、符号１は音声を電気信号に変換するマ
イクロホン、２はマイクロホン１から入力された音声信
号を増幅した後不要な高周波数成分をカットするアンプ
及びローパスフィルタ並びに音声信号の振幅レベルを調
整するオートゲインコントロール回路からなる信号調整
回路、３は信号調整回路２からのアナログ信号をディジ
タル信号に変換するADコンバータ、４は線形予測分析に
よって周波数を分析する周波数分岐部、５は周波数分析
部で得られたホルマント周波数から特徴パラメータを抽
出し、その時間変化を記録するパラメータ抽出部、６は
パラメータ抽出部の情報に基づいて音声を識別する音声
識別部である。In the figure, reference numeral 1 denotes a microphone for converting a sound into an electric signal, and 2 denotes an amplifier and a low-pass filter for cutting off unnecessary high frequency components after amplifying the sound signal input from the microphone 1 and adjusting the amplitude level of the sound signal. 3 is an AD converter that converts an analog signal from the signal adjustment circuit 2 into a digital signal, 4 is a frequency branching unit that analyzes frequency by linear prediction analysis, and 5 is a frequency analysis unit. A parameter extraction unit 6 extracts a feature parameter from the obtained formant frequency and records the change over time, and a speech identification unit 6 identifies a speech based on information of the parameter extraction unit.

この実施例においては、上記パラメータ抽出部５は第
２図に示すフローチャートに従う制御動作を行うように
なっている。In this embodiment, the parameter extracting section 5 performs a control operation according to a flowchart shown in FIG.

具体的には、パラメータ抽出部５は、先ず、周波数分
析部４で得られた結果をもとに第一ないし第三ホルマン
ト周波数f1ないしf3を抽出し（ステップ１）、次に、上
記ホルマント周波数f1ないしf3のメルスケールの近似で
ある対数スケールにおける相互間距離log（f1/f2）（以
下Lf12で表す）、log（f3/f2）（以下Lf23で表す）を第
一及び第二特徴パラメータとして計算した後メモリに記
憶する（ステップ2,3）。Specifically, the parameter extraction unit 5 first extracts the first to third formant frequencies f1 to f3 based on the result obtained by the frequency analysis unit 4 (step 1), and then extracts the formant frequency The mutual distance log (f1 / f2) (hereinafter referred to as Lf12) and log (f3 / f2) (hereinafter referred to as Lf23) on a logarithmic scale which is an approximation of the mel scale of f1 to f3 as first and second feature parameters. After the calculation, it is stored in the memory (steps 2 and 3).

この後、パラメータ抽出部５は、上記特徴パラメータ
Lf12,Lf23が初回のものであるか否かを判別し（ステッ
プ４）、初回のものであればステップ１へ戻り、初回の
ものでなければ、特徴パラメータLf12,Lf23が変化して
いるか否かを判別する（ステップ５）。After that, the parameter extraction unit 5
It is determined whether or not Lf12 and Lf23 are the first ones (step 4). If it is the first time, the process returns to step 1, and if not, it is determined whether or not the characteristic parameters Lf12 and Lf23 have changed. Is determined (step 5).

そして、もし、特徴パラメータLf12,Lf23が変化して
いるときには、上記パラメータ抽出部５は、後述するカ
ウント数をクリアした（ステップ８）後、ステップ１以
降の動作過程で所定のサイクルで繰返し、一定のメモリ
領域上に順次第一及び第二特徴パラメータLf12,Lf23の
時間変化を記憶するようになっている。If the characteristic parameters Lf12 and Lf23 are changing, the parameter extracting unit 5 clears the count number described later (Step 8), and repeats the operation in Steps 1 and later in a predetermined cycle, and repeats the operation in a predetermined cycle. The time change of the first and second characteristic parameters Lf12 and Lf23 are sequentially stored in the memory area of.

一方、特徴パラメータLf12,Lf23が変化していなけれ
ば、パラメータ抽出部５は、“1"だけカウントアップし
（ステップ６）、そのカウント数が予め定められた規定
数ｍに達したか否かを判別した（ステップ７）後、規定
数ｍに達すれば母音成分が入力されたものとしてパラメ
ータ抽出工程を終了し、そうでない場合にはステップ１
に戻るようになっている。On the other hand, if the characteristic parameters Lf12 and Lf23 have not changed, the parameter extraction unit 5 counts up by “1” (step 6), and determines whether or not the count has reached a predetermined specified number m. After the discrimination (step 7), when the number reaches the specified number m, the vowel component is determined to have been input, and the parameter extraction process is terminated.
To return to.

また、上記音声識別部６は第３図に示すフローチャー
トに従う制御動作を行うようになっている。Further, the voice identification unit 6 performs a control operation according to a flowchart shown in FIG.

具体的には、音声識別部６は、先ず、パラメータ抽出
部５により書込まれたメモリ内の最終付近のデータを検
索し、収束点に位置する特徴パラメータLf12（ｎ）,Lf2
3（ｎ）を抽出し（ステップ１）、この収束点における
特徴パラメータLf12（ｎ）,Lf23（ｎ）が予め区画され
たどの母音域に包含されるかを判断して母音成分を決定
し（ステップ２）、次いで、パラメータ抽出部５により
書込まれたメモリ内の初期のデータを検索し、この立ち
上がり位置における特徴パラメータLf12（１）,Lf23
（１）が予め区画されたどの子音候補域に包含されるか
を判断して一若しくは複数の子音候補を選定し（ステッ
プ３）、子音候補が一つであれば、この子音候補を認識
対象となる子音成分と決定する一方（ステップ4,5）、
子音候補が複数ある場合には、パラメータ抽出部５によ
り書込まれたメモリ内の中間領域データを検索し、各特
徴パラメータLf12（ｉ）,Lf23（ｉ）の変化パターンを
求め（ステップ4,6）、この変化パターンと予め決めら
れた変化パターンデータとを比較することにより認識対
象となる子音成分を決定するようになっている（ステッ
プ７）。Specifically, the voice identification unit 6 first searches for data near the end in the memory written by the parameter extraction unit 5, and determines the feature parameters Lf12 (n) and Lf2 located at the convergence point.
3 (n) is extracted (step 1), and a vowel component is determined by judging which vowel region the feature parameters Lf12 (n) and Lf23 (n) at this convergence point are included in in advance. Step 2) Then, the initial data in the memory written by the parameter extracting unit 5 is searched, and the characteristic parameters Lf12 (1) and Lf23 at the rising position are searched.
(1) One or a plurality of consonant candidates are selected by judging which consonant candidate region is included in advance (step 3). If there is only one consonant candidate, this consonant candidate is recognized. (Steps 4 and 5)
If there are a plurality of consonant candidates, the parameter extracting unit 5 searches the written intermediate area data in the memory and obtains a change pattern of each of the characteristic parameters Lf12 (i) and Lf23 (i) (steps 4 and 6). ), A consonant component to be recognized is determined by comparing the change pattern with predetermined change pattern data (step 7).

そして、この実施例で用いられる各母音の区画域は、
第４図に示すように、例えば男女各20人の不特定話者
（男性:MALE,女性:FEMALE）による各母音“a,i,u,e,o"
の第一及び第二特徴パラメータLf12,Lf23の値を夫々求
め、第一及び第二特徴パラメータLf12,Lf23を座標軸と
する音声の特徴空間において夫々プロットし、プロット
した各母音分布に基づいて夫々の母音域Sa,Si,Su,Se,So
を所定の仕切り線１ないしl4で区画するようにしたも
のであり、各母音域を特定する上で必要な仕切り線１
ないしl4データのいずれかが夫々メモリ内に予め格納さ
れている。尚、第４図において、MALEは●（黒塗り）,F
EMALFは○（白抜き），“a"は○，“i"は△，“u"は
□，“e"は▽，“o"は◇である。Then, the divided area of each vowel used in this embodiment is
As shown in FIG. 4, for example, each vowel "a, i, u, e, o" by 20 unspecified speakers (male: MALE, female: FEMALE) for each gender
The values of the first and second feature parameters Lf12, Lf23 are respectively obtained, plotted in the voice feature space having the first and second feature parameters Lf12, Lf23 as coordinate axes, respectively, based on each vowel distribution plotted. Vowel range Sa, Si, Su, Se, So
Is divided by predetermined partition lines 1 to l4, and a partition line 1 necessary for specifying each vowel region
Or any of the l4 data is stored in the memory in advance. In FIG. 4, MALE indicates ● (black), F
EMALF is ○ (open), “a” is ○, “i” is △, “u” is □, “e” is ▽, and “o” is ◇.

また、この実施例で用いられる候補子音の区画域は、
第５図に示すように、同じく男女20人の不特定話者のCV
音節からなる音声（第５図においては、一例として“b
a"“ba"“ga"を挙げている）の立ち上がり位置における
第一及び第二の特徴パラメータLf12,Lf23の値を夫々求
め、第一及び第二特徴パラメータを座標軸とする音声の
特徴空間において夫々プロットし、プロットした各音声
分布に基づいて区画可能な範囲で一若しくは複数の子音
候補を所定の仕切り線ｌで区画するようにしたものであ
り、各子音候補域を特定する上で必要な仕切り線データ
が夫々メモリ内に予め格納されている。尚、第５図にお
いて、“ba"は●，“da"は○，“ga"は■で示され、仕
切り線ｌにて“ba"と“da,ga"とが分離されている。Also, the area of the candidate consonant used in this embodiment is:
As shown in Fig. 5, the CVs of 20 male and female unspecified speakers
Speech composed of syllables (in FIG. 5, for example, "b
a "" ba "" ga ") are listed at the rising position of the first and second feature parameters Lf12 and Lf23, respectively, in the voice feature space with the first and second feature parameters as coordinate axes. Each of the consonant candidates is plotted, and one or a plurality of consonant candidates are demarcated by a predetermined partitioning line l within a range that can be demarcated based on each plotted voice distribution. In FIG. 5, “ba” is indicated by “●”, “da” is indicated by “○”, “ga” is indicated by “■”, and “ga” is indicated by “■” in FIG. And “da, ga” are separated.

また、この実施例で用いられる子音候補の変化パター
ンデータは、同じく男女10人の不特定話者の対象となる
子音の立ち上がり以降の第一及び第二の特徴パラメータ
Lf12,Lf23の値を夫々求め、第一及び第二特徴パラメー
タを座標軸とする音声の特徴空間において夫々プロット
した後、プロットした各子音の変化パターン群に基づい
て各子音の共通変化パターンを変化分データとてし設定
するようにしたものであり、この各子音に対応する変化
パターンデータはメモリ内に予め格納されている。Further, the change pattern data of the consonant candidates used in this embodiment are the first and second characteristic parameters after the rise of the consonant which is also the subject of 10 male and female unspecified speakers.
After obtaining the values of Lf12 and Lf23, respectively, and plotting them in the voice feature space using the first and second feature parameters as coordinate axes, a common change pattern of each consonant is changed based on the group of change patterns of each consonant plotted. The change pattern data corresponding to each consonant is stored in the memory in advance.

次に、この実施例に係る音声認識方法を用いた具体的
な音声の認識動作を説明する。Next, a specific speech recognition operation using the speech recognition method according to this embodiment will be described.

この実施例においては、CV調節からなる音声のうち、
有声破裂音である“ba"“da"“ga"を識別する場合を例
に挙げ、サンプリング周波数16.6KHz、12ビット長で音
声信号をAD変換し、LPC分析の結果得られた第一ないし
第三ホルマント周波数から、第一及び第二特徴パラメー
タLf12,Lf23の時間的挙動を求めるようにした。In this embodiment, of the voice composed of CV adjustment,
As an example, the case where voiced plosives “ba”, “da”, and “ga” are identified, the audio signal is AD-converted at a sampling frequency of 16.6 KHz and a 12-bit length, and the first to the second obtained as a result of LPC analysis. From the three formant frequencies, the temporal behavior of the first and second feature parameters Lf12 and Lf23 is obtained.

この場合において、第６図及び第７図は入力された音
声信号“ba"“da"“ga"に対する第一及び第二特徴パラ
メータLf12,Lf23の時間的挙動を示すグラフであり、第
８図は第一及び第二特徴パラメータLf12,Lf23の時間的
挙動を両者を座標軸とした特徴空間上に示したグラフで
ある。尚、第６図ないし第８図において、“ba"は●，
“da"は○，“ga"は■である。In this case, FIGS. 6 and 7 are graphs showing the temporal behavior of the first and second feature parameters Lf12 and Lf23 with respect to the input audio signals "ba", "da" and "ga". Is a graph showing the temporal behavior of the first and second feature parameters Lf12 and Lf23 on a feature space using both as coordinate axes. In addition, in FIGS. 6 to 8, "ba" is
“Da” is ○ and “ga” is Δ.

今、音声信号が“ba"であるとすると、先ず、音声識
別部６は、ステップ1,2において収束点にある特徴パラ
メータLf12（ｎ）,Lf23（ｎ）から母音が第４図の“Sa"
領域に包含されると判別し、これに基づいて認識対象の
音声の母音成分が“a"であることを決定する。次いで、
上記音声識別部６は、ステップ３ないし５に基づき、子
音候補として“b"を選定し、子音候補が一つであること
を判別した後、この子音候補を認識対象である子音成分
“b"と決定する。これにより、音声識別部６は上記認識
対象である音声が“ba"であることを識別するのであ
る。Now, assuming that the voice signal is "ba", the voice discriminating unit 6 first determines that the vowel is "Sa" in FIG. 4 from the characteristic parameters Lf12 (n) and Lf23 (n) at the convergence point in steps 1 and 2. "
It is determined that the vowel is included in the area, and based on this, it is determined that the vowel component of the recognition target voice is “a”. Then
The voice discriminating unit 6 selects “b” as a consonant candidate based on steps 3 to 5 and determines that there is one consonant candidate, and then recognizes this consonant candidate as a recognition target consonant component “b”. Is determined. Thus, the voice identification unit 6 identifies that the voice to be recognized is "ba".

また、音声信号が“da"であるとすると、先ず、音声
識別部６は、ステップ1,2において収束点にある特徴パ
ラメータLf12（ｎ）,Lf23（ｎ）から母音が第４図の“S
a"領域に包含されると判別し、これに基づいて認識対象
の音声の母音成分が“a"であることを決定する。次い
で、上記音声識別部６は、ステップ３に基づき、子音候
補として“ｄ及びg"を選定する。この後、上記音声識別
部６は、ステップ４において子音候補が複数あることを
判別すると、ステップ６において、認識対象である音声
の子音成分の変化パターンが第８図の矢印Ａで示すよう
に比較的急な傾斜を持つものであることを抽出した後、
ステップ７の比較工程において認識対象である音声の子
音成分を“d"と決定する。これにより、音声識別部６は
上記認識対象である音声が“da"であることを識別する
のである。Assuming that the voice signal is "da", the voice recognition unit 6 first determines that the vowel is "S" in FIG. 4 from the characteristic parameters Lf12 (n) and Lf23 (n) at the convergence point in steps 1 and 2.
It is determined that the vowel component is included in the “a” region, and based on this, the vowel component of the speech to be recognized is determined to be “a”. After that, when the voice discriminating unit 6 determines in step 4 that there are a plurality of consonant candidates, in step 6, the change pattern of the consonant component of the voice to be recognized is changed to the eighth. After extracting that it has a relatively steep slope as shown by arrow A in the figure,
In the comparison step of step 7, the consonant component of the speech to be recognized is determined to be "d". Thereby, the voice identification unit 6 identifies that the voice to be recognized is "da".

更に、音声信号が“ga"であるとすると、音声識別部
６は、基本的に上述した音声信号“da"の識別過程と同
様な工程を経るが、この場合には、音声識別部６は、ス
テップ６において、認識対象である音声の子音成分の変
化パターンが第８図の矢印Ｂで示すように極めて緩かな
傾斜を持つものであることを抽出した後、ステップ７の
比較工程ておいて認識対象である音声の子音成分を“g"
と決定する。Further, assuming that the audio signal is “ga”, the audio identification unit 6 basically goes through the same process as the above-described identification process of the audio signal “da”. In this case, the audio identification unit 6 In step 6, after extracting that the change pattern of the consonant component of the speech to be recognized has an extremely gentle slope as shown by the arrow B in FIG. 8, the comparison process in step 7 is performed. The consonant component of the speech to be recognized is “g”
Is determined.

また、この実施例においては、CV音節からなる音声
（この実施例では“ba,da,ga"）の立ち上がり位置にお
ける第一及び第二特徴パラメータとして、ホルマント周
波数の対数差情報を用いているが、比較例として、第一
及び第二特徴パラメータに、第一及び第二ホルマント周
波数f1,f2あるいは第二及び第三ホルマント周波数f2,f3
そのものを直接用いたところ、第９図あるいは第10図に
示すような結果が得られた。Further, in this embodiment, the logarithmic difference information of the formant frequency is used as the first and second feature parameters at the rising position of the voice composed of CV syllables (“ba, da, ga” in this embodiment). As a comparative example, the first and second characteristic parameters include first and second formant frequencies f1 and f2 or second and third formant frequencies f2 and f3.
When it was used directly, the results as shown in FIG. 9 or FIG. 10 were obtained.

このとき、第９図の分布図によれば、対象となる子音
候補域を明確に区画することが困難であることが把握さ
れ、また、第10図の分布図によれば、対象となる子音候
補域を仕切り線ｌ′で区画することはできるが、上記仕
切り線ｌ′が夫々の座標軸に対して平行ではないため、
Lf12−Lf23空間での判別に比べて仕切り線ｌ′の引き方
が煩雑になり、いずれにしても、この実施例の優位性が
裏付けられる。At this time, according to the distribution diagram in FIG. 9, it is understood that it is difficult to clearly divide the target consonant candidate area, and according to the distribution diagram in FIG. Although the candidate area can be partitioned by a partition line l ', since the partition line l' is not parallel to each coordinate axis,
The method of drawing the partition line l 'is more complicated than the determination in the Lf12-Lf23 space, and in any case, the superiority of this embodiment is supported.

尚、この実施例に係るパラメータ抽出部５としては、
特徴パラメータLf12,Lf23の時間的挙動データの総てを
メモリに格納するようにしているが、メモリ容量を抑え
るという観点からすれば、母音成分に対応する変化しな
いデータについては、個々的に総て格納することなく、
母音成分に対応する一組のデータのみを格納するように
することが好ましい。Note that the parameter extraction unit 5 according to this embodiment includes:
All of the temporal behavior data of the characteristic parameters Lf12 and Lf23 are stored in the memory, but from the viewpoint of suppressing the memory capacity, all data that does not change corresponding to the vowel components are individually stored. Without storing
It is preferable to store only one set of data corresponding to a vowel component.

［発明の効果］以上説明してきたように、この発明に係る音声認識方
法によれば、ホルマント周波数の聴覚対応スケールにお
ける相互間距離を特徴パラメータに用い、このパラメー
タ空間内での時間的挙動を追うようにしたので、CV音節
からなる音声を聴覚に対応した状態で有効に正規化する
ことができ、その分、不特定話者のホルマント周波数そ
のものの相違やイントネーション等の周波数変動の影響
を直接受ける事態を有効に防止でき、CV音節からなる音
声の誤認識を少なくすることができる。[Effects of the Invention] As described above, according to the speech recognition method of the present invention, the inter-distance on the auditory correspondence scale of the formant frequency is used as a feature parameter, and the temporal behavior in this parameter space is tracked. As a result, voices composed of CV syllables can be effectively normalized in a state corresponding to hearing, and the effect is directly affected by the difference in formant frequency itself of unspecified speakers and frequency fluctuation such as intonation The situation can be effectively prevented, and erroneous recognition of voices composed of CV syllables can be reduced.

更に、請求項２記載の音声認識方法によれば、先に母
音成分を識別した後に子音成分を識別するようにしたの
で、母音成分毎に時間的挙動が異なる子音成分を先に識
別するタイプに比べて、CV音節からなる音声を特定する
処理を簡略化することができ、その分、CV音節からなる
音声の認識を効率的に行うことができる。Furthermore, according to the speech recognition method of the second aspect, since the consonant component is identified after the vowel component is identified first, the consonant component having a different temporal behavior for each vowel component is identified first. In comparison, the process of specifying the voice composed of CV syllables can be simplified, and the voice composed of CV syllables can be recognized more efficiently.

[Brief description of the drawings]

第１図はこの発明に係る音声認識方法を実現する装置の
一実施例を示すブロック図、第２図は実施例で用いられ
るホルマント抽出部の制御動作過程を示すフローチャー
ト、第３図は実施例で用いられる音声識別部の制御動作
過程を示すフローチャート、第４図は不特定話者による
各母音のホルマント周波数の対数差を対数スケールにて
プロットしたグラフ、第５図は不特定話者によるCV音節
からなる一部の音声の立ち上がり位置に対するホルマン
ト周波数の対数差を対数スケールにてプロットしたグラ
フ、第６図はCV音節からなる特定の音声の第一特徴パラ
メータの時間的挙動を示すグラフ、第７図はCV音節から
なる特定の音声の第二特徴パラメータの時間的挙動を示
すグラフ、第８図はCV音節からなる特定の音声の対数ス
ケールからなる特徴空間での時間的挙動を示すグラフ、
第９図は不特定話者によるCV音節からなる一部の音声の
立ち上がり位置に対するホルマント周波数f1,f2をプロ
ットしたグラフ、第10図は不特定話者によるCV音節から
なる一部の音声の立ち上がり位置に対するホルマント周
波数f2,f3をプロットしたグラフである。［符号の説明］１……マイクロホン２……信号調整回路３……ADコンバータ４……周波数分析部５……パラメータ抽出部６……音声識別部FIG. 1 is a block diagram showing an embodiment of an apparatus for realizing a speech recognition method according to the present invention, FIG. 2 is a flowchart showing a control operation process of a formant extraction unit used in the embodiment, and FIG. FIG. 4 is a flowchart showing a control operation process of a voice identification unit used in the embodiment, FIG. 4 is a graph plotting a logarithmic difference of a formant frequency of each vowel by an unspecified speaker on a logarithmic scale, and FIG. A graph in which a logarithmic difference of a formant frequency with respect to a rising position of a part of syllable voices is plotted on a logarithmic scale. FIG. 6 is a graph showing a temporal behavior of a first characteristic parameter of a specific voice including CV syllables. FIG. 7 is a graph showing the temporal behavior of the second feature parameter of a specific voice composed of CV syllables, and FIG. 8 is a feature composed of a logarithmic scale of the specific voice composed of CV syllables Graph showing the temporal behavior between,
FIG. 9 is a graph in which formant frequencies f1 and f2 are plotted with respect to a rising position of a part of voices composed of CV syllables by an unspecified speaker, and FIG. 10 is a part of rising voices of CV syllables by an unspecified speaker. 5 is a graph in which formant frequencies f2 and f3 are plotted against positions. [Description of Signs] 1... Microphone 2... Signal adjustment circuit 3... AD converter 4... Frequency analysis unit 5... Parameter extraction unit 6.

フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 9/02 301 G10L 9/06 ＪＩＣＳＴファイル（ＪＯＩＳ)Continuation of the front page (58) Field surveyed (Int.Cl. ⁶ , DB name) G10L 9/02 301 G10L 9/06 JICST file (JOIS)

Claims

(57) [Claims]

The present invention relates to a method of recognizing a speech comprising a consonant and a vowel syllable, and, among formant frequencies obtained by frequency analysis of a speech signal, hearing of adjacent first and second formant frequencies and second and third formant frequencies. The difference information in the corresponding scale is extracted as a feature parameter, and the difference information in the adjacent corresponding auditory scale of the formant frequency is identified as the vowel component from the convergence point, and the consonant component based on the rising position and the temporal behavior of the difference information. A voice recognition method characterized in that a voice recognition method is performed.

2. The speech recognition method according to claim 1, wherein a vowel component is first identified, and then a consonant component is identified.