JP2958922B2

JP2958922B2 - Voice feature extraction device

Info

Publication number: JP2958922B2
Application number: JP29885788A
Authority: JP
Inventors: 憲治坂本; 耕市山口
Original assignee: Consejo Superior de Investigaciones Cientificas CSIC
Current assignee: Consejo Superior de Investigaciones Cientificas CSIC
Priority date: 1988-11-24
Filing date: 1988-11-24
Publication date: 1999-10-06
Anticipated expiration: 2014-10-06
Also published as: JPH02143300A

Description

【発明の詳細な説明】＜産業上の利用分野＞この発明は、声道中の狭めの状態を表す特徴量を抽出
する音声の特徴量抽出装置に関する。Description: BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech feature value extraction device that extracts a feature value representing a narrow state in a vocal tract.

＜従来の技術＞人間は声道中に狭めを形成することで各種の音声を発
生することができる。例えば、摩擦音は舌と歯茎との間
または舌の口蓋との間に狭めを形成して、十分に高めら
れた呼気をこの狭い隙間に通すことにより発生される。
また、破裂音は唇，歯茎および口蓋のいずれかで閉鎖箇
所を形成し、十分な程度まで口腔内圧が高まると閉鎖を
作っていた筋が弛んで閉鎖が解放され、せき止められて
いた空気を一気に口から放出することで発生される。ま
た、鼻音は唇，歯茎および口蓋のいずれかで閉鎖を形成
し、口蓋帆が下がって鼻腔への通路が開いて鼻腔と口蓋
とで共振を起こすことで発生される。母音は舌によって
声道内に狭めを形成した発生されるが、摩擦音や破裂音
に比べてその狭めはかなり広いので事実上は開放と考え
てよい。<Prior Art> Humans can generate various sounds by forming a narrowing in the vocal tract. For example, rubbing noise is generated by creating a narrowing between the tongue and gums or the palate of the tongue and passing a sufficiently elevated exhalation through this narrow gap.
In addition, the plosive sound forms a closure at one of the lips, gums, and palate, and when the intraoral pressure rises to a sufficient degree, the muscle that created the closure loosens and the closure is released, and the clogged air is released at once. Generated by release from the mouth. In addition, nasal noise is generated when a closure is formed in any of the lips, gums, and palate, and the palate sail lowers, opening a passage to the nasal cavity and causing resonance between the nasal cavity and the palate. Vowels are generated by the tongue forming a narrowing in the vocal tract, but since the narrowing is considerably wider than fricatives and plosives, it can be considered virtually open.

このように大まかに分けても、声道中の狭めの状態に
は閉鎖の状態，狭めのある状態および開放の状態があ
り、実際に、人間もこれらの発生の生理学的特徴量であ
る狭めの状態を知覚して音声を識別していると考えられ
ている。Even if roughly divided in this way, the narrow state in the vocal tract includes a closed state, a narrowed state, and an open state, and actually, humans also have a narrowed state, which is a physiological feature of these developments. It is thought that the state is perceived and the voice is identified.

したがって、音声認識装置や発声練習機において音声
を認識する際の音声の特徴量として、声道中における狭
めの状態を表す特徴量（以下、狭めの値と言う）を用い
ることは非常に重要なことである。Therefore, it is very important to use a feature amount (hereinafter, referred to as a narrow value) representing a narrow state in the vocal tract as a voice feature amount when recognizing a voice in a voice recognition device or a speech training machine. That is.

従来、発明者等は声道中の狭めの値の生成方法とし
て、次のようなものを提案した（特願昭63−186352。）
すなわち、入力音声を周波数分析してフレーム毎に特徴
ベクトルを求め、この特徴ベクトルと予め用意している
ラベル別の標準パターンとの距離を計算し、最も近い標
準パターンのラベルを上記求めた特徴ベクトルのラベル
として採用して、入力音声に対応するラベル系列を求め
る。ここで、上記特徴ベクトルの例として、無音状態，
鼻子音，バズバー，母音，摩擦音，気音および破裂音等
がある。Conventionally, the inventors have proposed the following method for generating a narrow value in the vocal tract (Japanese Patent Application No. 63-186352).
That is, the frequency of the input speech is analyzed to obtain a feature vector for each frame, the distance between this feature vector and a standard pattern prepared for each label prepared in advance is calculated, and the label of the closest standard pattern is obtained as described above. And obtains a label sequence corresponding to the input voice. Here, as an example of the above feature vector, a silent state,
There are nasal consonants, buzz bars, vowels, fricatives, qi and plosives.

上述のようにして採用された特徴ベクトルのラベルか
ら、次のようにして声道中の狭めの値が求められる。す
なわち、ラベルが無音状態，鼻子音およびバズバーのう
ちいずれかの場合であれば、この区間において声道中の
どこかで閉鎖が生じているとする。また、ラベルが摩擦
音，破裂音および気音のうちのいずれかの場合であれば
声道中のどこかに狭めが形成されているとする。さら
に、ラベルが母音か鼻子音化母音の場合であれば声道は
狭めの無い開放状態であるとする。このようにして、３
段階に狭めの値を決定するのである。From the feature vector label adopted as described above, a narrow value in the vocal tract is obtained as follows. That is, if the label is any of a silent state, a nasal consonant, and a buzz bar, it is assumed that a closure has occurred somewhere in the vocal tract in this section. If the label is any one of fricative, plosive, and qi, it is assumed that a constriction is formed somewhere in the vocal tract. Further, if the label is a vowel or a vowel consonant vowel, the vocal tract is assumed to be in an open state without being narrowed. Thus, 3
It decides a narrower value step by step.

＜発明が解決しようとする課題＞しかしながら、上記従来の声道中の狭めの値の生成方
法は、入力音声に対するラベル系列を求める際に、入力
音声を周波数分析して求めた特徴ベクトルと予め用意し
ているラベル別の標準パターンとの距離を計算し、最も
近い標準パターンのラベルを上記求めた特徴ベクトルの
ラベルとして採用するようにしているので、次のような
問題がある。<Problem to be Solved by the Invention> However, in the above-described conventional method for generating a narrow value in the vocal tract, when a label sequence for an input voice is obtained, a feature vector obtained by performing frequency analysis on the input voice is prepared in advance. Since the distance from the standard pattern for each label is calculated and the label of the closest standard pattern is adopted as the label of the feature vector obtained above, there are the following problems.

すなわち、同じ単語を発生した場合でも話者が変われ
ば声道長が異なるために、生成される音声波形の周波数
成分も変わる。したがって、上述のように予め用意され
たラベル毎の標準パターンと特徴ベクトルとのマッチン
グを取る場合、異なる標準パターンとの距離が最も近く
なって間違ったラベルを採用してしまい、狭めの値を間
違って決定してしまう場合があるという問題がある。That is, even if the same word is generated, if the speaker changes, the vocal tract length changes, so that the frequency component of the generated speech waveform also changes. Therefore, when matching the standard pattern for each label prepared in advance with the feature vector as described above, the distance to the different standard pattern is closest and the wrong label is adopted, and the narrow value is incorrectly set. There is a problem that it may be determined by

また、このような問題に対処するために、種々の話者
に対応できるように１つのラベルに対して複数の標準パ
ターンを用意する場合もある。しかしながら、大きな記
憶容量を必要とし計算速度が遅くなるという問題があ
る。さらに、マッチングの際に候補が多くなるとかえっ
て誤認識につながるという問題もある。In order to deal with such a problem, a plurality of standard patterns may be prepared for one label so as to support various speakers. However, there is a problem that a large storage capacity is required and the calculation speed is reduced. Further, there is a problem that an increase in the number of candidates at the time of matching leads to erroneous recognition.

そこで、この発明の目的は、話者による周波数成分の
変動に依存しない声道中の狭めの値を生成できる音声の
特徴量抽出装置を提供することにある。SUMMARY OF THE INVENTION It is an object of the present invention to provide a speech feature amount extraction device capable of generating a narrow value in a vocal tract that does not depend on a change in a frequency component by a speaker.

＜課題を解決するための手段＞上記目的を達成するため、請求項１に係る発明は、入
力音声を周波数分析し、得られた周波数成分から音声の
特徴量を抽出する音声の特徴抽出装置において、上記周
波数分析によって得られた周波数成分に基づいて、人間
の調音に関連した弁別的特徴の度合いを表す弁別素性値
を求める弁別素性抽出部と、上記弁別素性値から上記特
徴量の一つである声道の狭めの度合いを表す狭めの値を
生成する規則に従って、上記弁別素性抽出部によって求
められた弁別素性値から上記狭めの値を生成する狭め値
生成部を備えて、上記弁別素性値から、話者による周波
数成分の変動に依存しない声道中の狭めの値を生成する
ことを特徴としている。<Means for Solving the Problems> In order to achieve the above object, the invention according to claim 1 provides a voice feature extraction device that performs frequency analysis of an input voice and extracts a voice feature from an obtained frequency component. Based on the frequency components obtained by the frequency analysis, a discrimination feature extraction unit that determines a discrimination feature value representing the degree of a discrimination feature related to human articulation, and one of the feature amounts from the discrimination feature value According to a rule for generating a narrow value representing the degree of narrowing of a certain vocal tract, a narrowing value generating unit that generates the narrowing value from the discriminating feature value obtained by the discriminating feature extracting unit includes the discriminating feature value. Therefore, a narrower value in the vocal tract that does not depend on the variation of the frequency component by the speaker is generated.

また、請求項２に係る発明は、請求項１に係る発明の
音声の特徴抽出装置において、上記狭めの値を生成する
規則は、少なくとも前フレームおよび現フレームの弁別
素性値に基づいて現フレームの狭めの値を生成する規則
であることを特徴としている。According to a second aspect of the present invention, in the audio feature extracting apparatus according to the first aspect of the present invention, the rule for generating the narrow value is based on at least the discrimination feature values of the previous frame and the current frame. It is a rule that generates a narrow value.

また、請求項３に係る発明は、入力音声を周波数分析
し、得られた周波数成分から音声の特徴量を抽出する音
声の特徴抽出装置において、上記周波数分析によって得
られた周波数成分に基づいて、人間の調音に関連した弁
別的特徴の度合いを表す弁別素性値を求める弁別素性抽
出部と、上記弁別素性値から上記狭めの値を生成する規
則を誤差逆伝播アルゴリズムを用いた学習によって自ら
作成し、この作成した規則に従って、上記弁別素性抽出
部によって求められた弁別素性値から上記狭めの値を生
成する狭め値生成部を備えて、上記弁別素性値から、話
者による周波数成分の変動に依存しない声道中の狭めの
値を生成することを特徴としている。According to a third aspect of the present invention, in the voice feature extraction device for performing frequency analysis of an input voice and extracting a voice feature amount from the obtained frequency component, based on the frequency component obtained by the frequency analysis, A discrimination feature extraction unit that calculates a discrimination feature value that represents the degree of discrimination characteristics related to human articulation, and a rule that generates the narrower value from the discrimination feature value are created by learning using an error backpropagation algorithm. In accordance with the created rule, a narrowing value generating unit that generates the narrowing value from the discriminating feature value obtained by the discriminating feature extracting unit is provided. It is characterized by generating a narrower value in the vocal tract that does not.

また、請求項４に係る発明は、入力音声を特徴抽出部
で周波数分析し、得られた周波数成分から音声の特徴量
を抽出する音声の特徴抽出装置において、上記特徴抽出
部で周波数成分を得る際に抽出される音響パラメータか
ら上記狭めの値を生成する規則を誤差逆伝播アルゴリズ
ムを用いた学習によって自ら作成し、この作成した規則
に従って、上記音響パラメータから上記狭めの値を生成
する狭め値生成部を備えて、上記音響パラメータから、
話者による周波数成分の変動に依存しない声道中の狭め
の値を生成することを特徴としている。According to a fourth aspect of the present invention, there is provided a voice feature extraction apparatus for performing frequency analysis of an input voice by a feature extraction unit and extracting a feature amount of the voice from the obtained frequency component, wherein the feature extraction unit obtains a frequency component. A rule for generating the narrow value from the acoustic parameters extracted at the time is created by learning using an error back propagation algorithm, and the narrow value generation for generating the narrow value from the acoustic parameter according to the created rule. Comprising a unit, from the acoustic parameters,
It is characterized by generating a narrow value in the vocal tract that does not depend on the variation of the frequency component by the speaker.

＜作用＞請求項１に係る発明においては、音声が入力される
と、この入力音声が周波数分析されて周波数成分が得ら
れる。そうすると、上記周波数分析によって得られた周
波数成分に基づいて、人間の調音に関連した弁別的特徴
の度合いを表す弁別素性値が弁別素性抽出部によって求
められる。そして、上記弁別素性値から音声の特徴量の
ひとつである声道の狭めの度合いを表す狭めの値を生成
する規則に従って、上記弁別素性抽出部によって求めら
れた弁別素性値から狭めの値が狭め値生成部によって生
成される。<Operation> In the invention according to claim 1, when a voice is input, the input voice is subjected to frequency analysis to obtain a frequency component. Then, based on the frequency components obtained by the frequency analysis, a discrimination feature extracting unit obtains a discrimination feature value indicating a degree of a discrimination characteristic related to human articulation. Then, the narrower value is narrowed from the discrimination feature value obtained by the discrimination feature extraction unit according to a rule for generating a narrower value representing the degree of narrowing of the vocal tract, which is one of the speech feature values, from the discrimination feature value. Generated by the value generator.

したがって、上記弁別素性値から、話者による周波数
成分の変動に依存しない声道中の狭めの値が生成され
る。Therefore, a narrower value in the vocal tract that does not depend on the variation of the frequency component by the speaker is generated from the discrimination feature value.

また、請求項２に係る発明においては、上記狭め値生
成部によって、前フレームおよび現フレームの弁別素性
値に基づいて現フレームの狭めの値を生成する規則に従
って、確実に現フレームの上記狭めの値が生成される。Further, in the invention according to claim 2, the narrowing value generation unit reliably performs the narrowing of the current frame in accordance with the rule of generating the narrowing value of the current frame based on the discrimination feature values of the previous frame and the current frame. A value is generated.

また、請求項３に係る発明においては、音声が入力さ
れると、この入力音声が周波数分析されて周波数成分が
得られる。そして、上記周波数分析によって得られた周
波数成分に基づいて、人間の調音に関連した弁別的特徴
の度合いを表す弁別素性値が弁別素性抽出部によって求
められる。そうすると、狭め値生成部は、上記弁別素性
値から上記狭めの値を生成する規則を誤差逆伝播アルゴ
リズムを用いた学習によって自ら作成し、この自ら作成
した規則に従って、上記弁別素性抽出部によって求めら
れた弁別素性値から狭めの値を生成する。Further, in the invention according to claim 3, when a voice is input, the input voice is subjected to frequency analysis to obtain a frequency component. Then, based on the frequency components obtained by the frequency analysis, a discrimination feature value indicating a degree of a discrimination characteristic related to human articulation is obtained by a discrimination feature extraction unit. Then, the narrowing value generation unit creates the rule for generating the narrowing value from the discrimination feature value by learning using an error back propagation algorithm, and obtains the rule according to the rule created by the discrimination feature extraction unit. A narrower value is generated from the discrimination feature value obtained.

また、請求項４に係る発明においては、音声が特徴抽
出部に入力されると、この入力音声が上記特徴抽出部で
周波数分析されて周波数成分が得られる。そうすると、
狭め値生成部は、上記特徴抽出部で周波数成分を得る際
に抽出される音響パラメータから上記狭めの値を生成す
る規則を誤差逆伝播アルゴリズムを用いた学習によって
自ら作成し、この自ら作成した規則に従って、上記音響
パラメータから狭めの値を生成する。In the invention according to claim 4, when a voice is input to the feature extraction unit, the input voice is subjected to frequency analysis by the feature extraction unit to obtain a frequency component. Then,
The narrowing value generation unit creates a rule for generating the narrowing value from the acoustic parameters extracted when obtaining the frequency component in the feature extracting unit by learning using an error back propagation algorithm, and creates the rule by itself. , A narrower value is generated from the above acoustic parameters.

したがって、上記音響パラメータから、話者による周
波数成分の変動に依存しない声道中の狭めの値が生成さ
れる。Therefore, a narrower value in the vocal tract that does not depend on the variation of the frequency component by the speaker is generated from the acoustic parameters.

＜実施例＞以下、この発明を図示の実施例により詳細に説明す
る。<Example> Hereinafter, the present invention will be described in detail with reference to an illustrated example.

話者による周波数変動に影響されずに、声道中の狭め
の状態の狭めの位置の違いをよく表すことができる特徴
量として、人間の調音に関連した弁別的特徴を表す弁別
素性がある。この発明は、上記弁別素性の度合いを表す
値（以下、弁別素性値と言う）によって声道中の狭めの
値を生成するものである。As a feature amount that can well express the difference in the narrow position of the narrow state in the vocal tract without being affected by the frequency fluctuation by the speaker, there is a discrimination feature that indicates a discriminative feature related to human articulation. The present invention is to generate a narrow value in the vocal tract by a value representing the degree of the discrimination feature (hereinafter, referred to as a discrimination feature value).

第１図はこの発明に係る音声認識装置のブロック図で
ある。マイクロホン１から入力された音声信号はアンプ
２によって増幅され、例えば16チャンネルのBPF（帯域
ろ波器）群から成る特徴抽出部３に入力される。そし
て、この特徴抽出部３によって周波数分析されて得られ
た周波数成分等の特徴量はこの発明に係る弁別素性抽出
部４に入力される。この弁別素性抽出部４は入力された
周波数成分から所定の手順によって上記弁別素性値を各
フレーム毎に抽出する部分である。FIG. 1 is a block diagram of a speech recognition apparatus according to the present invention. The audio signal input from the microphone 1 is amplified by the amplifier 2 and input to a feature extraction unit 3 composed of, for example, a 16-channel BPF (band filter) group. Then, the feature amount such as the frequency component obtained by performing the frequency analysis by the feature extracting unit 3 is input to the discrimination feature extracting unit 4 according to the present invention. The discrimination feature extraction unit 4 is a unit that extracts the discrimination feature value for each frame from the input frequency components by a predetermined procedure.

上記弁別素性抽出部４によって各フレーム毎に求めら
れた入力音声の弁別素性値の時系列は狭め値生成部５に
入力される。そうすると、この狭め値生成部５は後に詳
述するようにして入力音声の各フレーム毎に狭めの値を
生成し、この生成された狭めの値を含む上記入力音声の
特徴量の時系列が単語認識部５に入力される。そして、
標準パターン格納部６に格納されている標準パターンの
特徴量の時系列との類似度が計算される。その際に、上
記類似度計算はDPマッチング法等によって行われる。そ
して、上記類似度計算の結果に基づいて認識された音声
入力単語が結果表示部７に表示される。The time series of the discrimination feature values of the input speech obtained for each frame by the discrimination feature extraction unit 4 is input to the narrowing value generation unit 5. Then, the narrowing value generating unit 5 generates a narrowing value for each frame of the input voice as described in detail later, and the time series of the feature amount of the input voice including the generated narrowing value is a word. It is input to the recognition unit 5. And
The similarity between the feature amount of the standard pattern stored in the standard pattern storage unit 6 and the time series is calculated. At this time, the similarity calculation is performed by a DP matching method or the like. Then, the speech input word recognized based on the result of the similarity calculation is displayed on the result display unit 7.

上記狭め値生成部５は、以下に述べる３つの手法のう
ちいずれかの手法によって狭めの値を生成する。すなわ
ち、第１の手法は、知識に基づいて作成した規則に従っ
て上記弁別素性値から狭めの値を生成する手法であり、
第２の手法は、後に詳述するニューラル・ネットワーク
を用いて狭めの値生成用の規則を自動的に作成して上記
弁別素性値から狭めの値を生成する手法であり、第３の
手法は、上記ニューラル・ネットワークを用いて狭めの
値生成用の規則を自動的に作成して、特徴抽出部３で周
波数成分を得る際に抽出される音響パラメータから直接
狭めの値を生成する手法である。The narrowing value generator 5 generates a narrowing value by any one of the following three methods. That is, the first method is a method of generating a narrower value from the discrimination feature value according to a rule created based on knowledge,
The second method is a method of automatically creating a rule for generating a narrow value using a neural network described in detail later and generating a narrow value from the discrimination feature value. This is a method of automatically creating a rule for generating a narrow value using the above-described neural network, and directly generating a narrow value from acoustic parameters extracted when a frequency component is obtained by the feature extracting unit 3. .

ここで、本実施例で用いた弁別素性としては、摩擦性
を表す“strident"、鼻音声を表す“nasal"、murmur性
やバズ性を表す“murmur/buzz"、有音を表す“voice
d"、中周波数帯域のスペクトルの集中性を表す“compac
t"および中高周波数帯域のスペクトルの分散性を表す
“diffuse"等である。また、この他に上記音響パラメー
タである入力波形のパワーを用いる。Here, the discrimination features used in the present embodiment include “strident” indicating friction, “nasal” indicating nasal voice, “murmur / buzz” indicating murmur or buzz, and “voice” indicating voiced.
d ”,“ compac ”that represents the concentration of the spectrum in the middle frequency band
t "and" diffuse "representing the dispersion of the spectrum in the middle and high frequency bands, etc. In addition, the power of the input waveform which is the acoustic parameter is used.

弁別素性値から狭めの値を生成する上記手法の概要は
次の通りである。The outline of the above method of generating a narrower value from the discrimination feature value is as follows.

いま、狭めの値を0,1,2の数値で表し、０は閉鎖の状
態（以下、Ｃ＝０と言う）を表し、１は声道中のどこか
に狭めが形成されている状態（以下、Ｃ＝１と言う）を
表し、２は開放の状態（以下、Ｃ＝２と言う）を表すと
する。そうすると、各狭めの値と各弁別素性値の現れ方
との関係は次のようになる。Now, the value of the narrowing is represented by numerical values of 0, 1, and 2, 0 indicates a closed state (hereinafter referred to as C = 0), and 1 indicates a state where the narrowing is formed somewhere in the vocal tract ( Hereinafter, it is assumed that C = 1, and 2 indicates an open state (hereinafter, C = 2). Then, the relationship between each narrow value and the appearance of each discrimination feature value is as follows.

Ｃ＝０のとき、有声であれば弁別素性“murmur/buzz"や“nasal"の値
が大きくなる。また、無声であれば総ての弁別素性値が
小さくなる。When C = 0, if voiced, the values of the discrimination features “murmur / buzz” and “nasal” increase. Further, if there is no voice, all the discrimination feature values become small.

Ｃ＝１のとき、弁別素性“strident",“diffuse"および“cmpact"の
値が大きくなる。When C = 1, the values of the discrimination features “strident”, “diffuse” and “cmpact” increase.

Ｃ＝２のとき弁別素性“voiced"の値が大きくなる。また、パワー
も大きくなる。When C = 2, the value of the discrimination feature “voiced” increases. Also, the power is increased.

したがって、このような関係を規則化し、入力音声の
１フレーム毎に求められる各弁別素性値にこの規則を適
用することによって、狭めの値を生成することができる
（上記第１の手法）。また、後に詳述するようなニュー
ラル・ネットワークを用い、このニューラル・ネットワ
ークに上述のように弁別素性値と狭めの値との関係が既
知の弁別素性値を入力し、出力される狭めの値が目標と
なる狭めの値に近くなるように学習を行う。そして、こ
の学習が終了したニューラル・ネットワークを用いて狭
めの値を生成するのことができる（上記第２の手法）。
さらに、上記ニューラル・ネットワークの学習の際に、
入力として特徴抽出部３で抽出された音響パラメータを
用いれば、ニューラル・ネットワークを用いて音響パラ
メータから直接狭めの値を生成できるのである（上記第
３の手法）。Therefore, a narrower value can be generated by regularizing such a relationship and applying this rule to each discrimination feature value obtained for each frame of the input speech (the first method). Further, a neural network as described in detail below is used, and a discrimination feature value having a known relationship between the discrimination feature value and the narrow value is input to the neural network as described above. Learning is performed so as to approach the target narrow value. Then, a narrower value can be generated using the neural network for which the learning has been completed (the second method).
Furthermore, when learning the above neural network,
If the acoustic parameters extracted by the feature extraction unit 3 are used as inputs, a narrower value can be directly generated from the acoustic parameters using a neural network (the third method).

以下、上述の３つの手法について順次具体的に説明す
る。Hereinafter, the above-mentioned three techniques will be sequentially and specifically described.

第１実施例本実施例は、上記弁別素性値から知識に基づき作成し
た規則に従って狭めの値を求める手法に関する。First Embodiment This embodiment relates to a method for obtaining a narrower value from the above-mentioned discrimination feature value in accordance with a rule created based on knowledge.

この手法は、上述のような狭めの値と弁別素性値との
関係を条件部と実行部とからなる規則によって規則化し
て、上記狭めの値生成部５が有する規則ファイルに格納
しておく。そして、弁別素性抽出部４によって入力音声
の１フレーム毎の各弁別素性値を求め、その求めた弁別
素性値の大きさを上記規則の条件部に当て嵌めて満足す
る規則を選び出す。そして、選び出した規則の実行部を
実行することによって現フレームの狭めの値を生成する
のである。In this method, the relationship between the narrow value and the discrimination feature value as described above is regularized by a rule including a condition part and an execution part, and stored in a rule file of the narrow value generation part 5. Then, the discrimination feature extracting unit 4 obtains each discrimination feature value for each frame of the input speech, and applies the obtained discrimination feature value to the condition part of the rule to select a rule that satisfies. Then, by executing the execution unit of the selected rule, a narrower value of the current frame is generated.

その際に、１フレームの弁別素性値だけでは狭めの値
を決定することができない場合もあるので、上記規則の
条件部の条件は、１フレーム前のフレームにおける狭め
の値、現フレームにおける弁別素性値、前後数フレーム
における弁別素性値、入力音声波形のパワーの値および
それらの増減値等を加味した知識を用いて決定する。At that time, the narrower value may not be determined only by the discrimination feature value of one frame. Therefore, the condition of the condition part of the above rule is that the discrimination feature in the previous frame, the discrimination feature in the current frame, The value is determined using knowledge taking into account the value, the discrimination feature value in several frames before and after, the power value of the input speech waveform, and the increase / decrease value thereof.

次に、上記各規則の条件部の条件を決定する際に用い
られる上記知識の概略について説明する。Next, an outline of the knowledge used when determining the condition of the condition part of each rule will be described.

１つ前のフレームの狭めの値Ci−₁と現フレームの狭
めの値Ciとの組合わせから、知識を分類すると次に示す
９種類になる。One value CI- ₁ of narrowing of the previous frame from the combination of the values Ci of the narrowing of the current frame, the nine types shown below when classifying knowledge.

（イ）Ci−₁＝０でCi＝０の場合声道中の閉鎖が継続している場合であり、一般に各弁
別素性の変化は少なく、有声の場合には弁別素性“murm
ur/buzz"と“nasal"の値が大きい。また、有声から無声
に変化する場合には弁別素性“murmur/buzz",“voiced"
および“power"の値は減少する。逆に、無声から有声に
変化する場合には弁別素性“murmur/buzz",“voiced"お
よび“power"の値は増加する。(A) In the case of Ci- ₁ = 0 and Ci = 0 This is the case where the closure in the vocal tract is continued, and generally there is little change in each discrimination feature, and in the case of voice, the discrimination feature "murm"
ur / buzz "and" nasal "are large, and when voiced to unvoiced, discriminating features" murmur / buzz "," voiced "
And the value of "power" decreases. Conversely, when changing from unvoiced to voiced, the values of the discrimination features “murmur / buzz”, “voiced” and “power” increase.

（ロ）Ci−₁＝０でCi＝１の場合声道中の閉鎖からわずかな隙間ができた場合である。
有声の場合には弁別素性“nasal",“strident",“diffu
se"および“compact"の値のうちいずれか１つ以上の値
が大きくなり、弁別素性“murmur/buzz"の値は小さくな
る。また、無声の場合には弁別素性“diffuse",“strid
ent"および“compact"の値のうちいずれか１つ以上の値
が大きくなる。そして、無声の場合はパワーの増加を伴
う。(B) In the case of Ci- ₁ = 0 and Ci = 1 This is the case where a slight gap is created from closure in the vocal tract.
In the case of voiced, the discrimination features “nasal”, “strident”, “diffu
The value of one or more of the values of “se” and “compact” increases, the value of the discrimination feature “murmur / buzz” decreases, and the discrimination features “diffuse”, “strid”
One or more of the values of "ent" and "compact" increases, and power increases in the case of unvoiced.

（ハ）Ci−₁＝０でCi＝２の場合声道中の閉鎖が急に開放された場合であり、パワーの
増加を伴う。語頭に現れるときは無声から有声に変化す
る場合であり、弁別素性“voiced"の値が増加する。有
声の場合には弁別素性“murmur/buzz"の値が減少する。
境界付近では弁別素性“nasal"の値が大きくなる。(C) Case of Ci- ₁ = 0 and Ci = 2 This is a case where the closure in the vocal tract is suddenly opened, accompanied by an increase in power. When it appears at the beginning of a word, it changes from unvoiced to voiced, and the value of the discrimination feature “voiced” increases. In the case of voice, the value of the discrimination feature “murmur / buzz” decreases.
Near the boundary, the value of the discrimination feature “nasal” increases.

（ニ）Ci−₁＝１でCi＝０の場合声道中の隙間が閉鎖された場合であり、パワーの減少
を伴う。弁別素性“strident"および“compact"の値の
うち少なくとも１つ以上は減少する。境界付近で弁別素
性“diffuse"の値が増加する場合もあるが、その値はす
ぐに減少する。有声の場合には弁別素性“murmur/buzz"
の値が増加する。(D) Case where Ci- ₁ = 1 and Ci = 0 This is a case where a gap in the vocal tract is closed, accompanied by a decrease in power. At least one of the values of the discrimination features “strident” and “compact” decreases. The value of the discrimination feature "diffuse" may increase near the boundary, but the value immediately decreases. Discriminating feature “murmur / buzz” if voiced
Increases.

（ホ）Ci−₁＝１でCi＝１の場合声道中の隙間が継続する場合である。弁別素性“stri
dent",“diffuse"および“compact"の値のうち少なくと
も１つ以上は大きい。(E) Case of Ci- ₁ = 1 and Ci = 1 This is the case where the gap in the vocal tract continues. Discrimination feature “stri
At least one of the values of "dent", "diffuse" and "compact" is large.

（ヘ）Ci−₁＝１でCi＝２の場合声道中の隙間が開放される場合であり、パワーの増加
を伴う。弁別素性“strident",“diffuse"および“comp
act"の値のうち少なくとも１つ以上は減少する。(F) In the case of Ci- ₁ = 1 and Ci = 2 This is a case where a gap in the vocal tract is opened, accompanied by an increase in power. Discriminating features “strident”, “diffuse” and “comp
At least one of the values of "act" is decreased.

（ト）Ci−₁＝２でCi＝０の場合声道中のいずれかで閉鎖が起こる場合であり、パワー
の減少を伴う。無声の場合は総ての弁別素性の値が減少
する。有声の場合は弁別素性“murmur/buzz"の値が増加
し、弁別素性“strident",“diffuse"および“compact"
の値は減少する。(G) In the case of Ci- ₁ = 2 and Ci = 0 This is a case in which a closure occurs in any of the vocal tracts, accompanied by a decrease in power. In the case of silence, the values of all discrimination features decrease. In the case of voice, the value of the discrimination feature “murmur / buzz” increases, and the discrimination features “strident”, “diffuse”, and “compact”
Decreases.

（チ）Ci−₁＝２でCi＝１の場合声道中のいずれかに隙間で形成されている場合であ
り、パワーの減少を伴う。弁別素性“strident",“diff
use"および“compact"の値のうち少なくとも１つ以上は
増加する。(H) In the case of Ci- ₁ = 2 and Ci = 1 This is the case where a gap is formed in any part of the vocal tract, with a decrease in power. Discrimination features “strident”, “diff
At least one of the values of "use" and "compact" increases.

（リ）Ci−₁＝２でCi＝２の場合声道中の開放が継続する場合である。弁別素性“stri
dent"および“murmur/buzz"の値は急激には増加しな
い。(I) Case of Ci- ₁ = 2 and Ci = 2 This is the case where the opening in the vocal tract continues. Discrimination feature “stri
The values for "dent" and "murmur / buzz" do not increase sharply.

上記知識を使って上記各規則を作る際には、第４図に
示すように、目視により予め音声波形21のスペクトルや
波形と、このスペクトルや波形を基にして得られたラベ
ル22とに基づいて決定された狭めの値23を満足するよう
に作られる。そして、最適な順序に並べられて上記規則
ファイルに記憶される。その際に、上述のように規則を
作るための音声波形のデータは、多数の話者によって発
生された多数の単語から成るデータベースより得た音声
波形21を用いる。When making the above rules using the above knowledge, as shown in FIG. 4, based on the spectrum and waveform of the voice waveform 21 in advance and the label 22 obtained based on the spectrum and waveform, as shown in FIG. It is made to satisfy the narrow value 23 determined by the above. Then, they are arranged in an optimal order and stored in the rule file. At this time, as the data of the speech waveform for making the rule as described above, the speech waveform 21 obtained from the database consisting of many words generated by many speakers is used.

次に、上記知識を元にして作成された現フレームの狭
めの値を算出する際に用いる上記規則の一例を示す。Next, an example of the above-mentioned rule used when calculating a narrower value of the current frame created based on the above knowledge will be described.

規則（１）:if（（Ci−₁＝０）・（Δpower＞０）・（strident＞Const1）） then（Ci←１）規則（２）:if（（Ci−₁＝１）・（Δstrident＞０）） then（Ci←１）規則（３）:if（（Ci−₁＝１）・（strident＞Const2）） then（Ci←１）規則（４）:if（（Ci−₁＝１）・（Δpower＞０）・（strident＜Const3）） then（Ci←２）ここで、 Const1:定数 Const2:定数 Const3:定数 i:現フレームのフレーム番号ｉ−₁:1フレーム前のフレーム番号 Ci:現フレームの狭めの値 Ci−₁:1フレーム前のフレームの狭めの値また、Δは弁別素性の増減を表し、次式で表される。Rule (1): if ((Ci- ₁ = 0) · (Δpower> 0) · (strident> Const1)) then (Ci ← 1) Rule (2): if ((Ci− ₁ = 1) · (Δstrident > 0)) then (Ci ← 1) Rule (3): if ((Ci− ₁ = 1) · (strident> Const2)) then (Ci ← 1) Rule (4): if ((Ci− ₁ = 1) ) · (Δpower> 0) · (strident <Const3)) then (Ci ← 2) where, Const1: constant Const2: constant Const3: constant i: frame number of current frame i− ₁ : frame number of previous frame Ci : Narrower value of the current frame Ci- ₁ : Narrower value of the frame before one frame Also, Δ represents an increase or decrease of the discrimination feature, and is expressed by the following equation.

ΔＦ＝（Ｆ（ｉ）＋Ｆ（ｉ＋１）＋Ｆ（ｉ＋２））− （Ｆ（ｉ−１）＋Ｆ（ｉ−２）＋Ｆ（ｉ−３））ここで、i:現フレームのフレーム番号 F:弁別素性値（例えば、“strident"および“power"
等の値）Ｆ（ｉ）:iフレームの弁別素性値上述の規則の意味は次のようである。ΔF = (F (i) + F (i + 1) + F (i + 2)) − (F (i−1) + F (i−2) + F (i−3)) where i: frame number of the current frame F: discrimination Feature values (for example, “strident” and “power”
F (i): discriminating feature value of i-frame The meaning of the above rules is as follows.

規則（１）:1フレーム前の狭めの値が０で、パワーが増
加し、かつ、弁別素性“strident"の値がある定数Const
1より大きい（例えば、摩擦音の始まりである）なら
ば、現フレームの狭めの値を１にする。ただし、Const1
の設定値は弁別素性“strident"の最大値の1/4程度。Rule (1): a constant Const in which the narrow value before one frame is 0, the power is increased, and the value of the discrimination feature “strident” is
If it is greater than one (eg, the beginning of a fricative), the narrow value of the current frame is set to one. However, Const1
Is about 1/4 of the maximum value of the discrimination feature "strident".

規則（２）:1フレーム前の狭めの値が１で、かつ、弁別
素性“strident"の値が増加（例えば、摩擦音が継続し
ている）しているならば、現フレームの狭めの値を１に
する。Rule (2): If the narrowing value of the previous frame is 1 and the value of the discrimination feature “strident” is increasing (for example, fricatives continue), the narrowing value of the current frame is changed to Set to 1.

規則（３）:1フレーム前の狭めの値が１で、かつ、弁別
素性“strident"の値がある定数Const2より大きい（例
えば、摩擦音が継続している）ならば、現フレームの狭
めの値を１にする。ただし、Const2の設定値はConst1の
1/2程度。Rule (3): If the narrowing value of the previous frame is 1 and the value of the discrimination feature “strident” is larger than a certain constant Const2 (for example, the fricative sound continues), the narrowing value of the current frame is used. To 1. However, the setting value of Const2 is
About 1/2.

規則（４）:1フレーム前の狭めの値が１で、パワーが増
加し、かつ、弁別素性“strident"の値がある定数Const
3より小さい（例えば、母音の始まりである）ならば、
現フレームの狭めの値を２にする。ただし、Const3の設
定値はConst2の1/2程度。Rule (4): a constant Const in which the narrow value before one frame is 1, the power increases, and the value of the discrimination feature “strident” is
If less than 3 (for example, at the beginning of a vowel)
Set the narrow value of the current frame to 2. However, the setting value of Const3 is about 1/2 of Const2.

第２図は狭めの値生成のフローチャートを示す。以
下、第２図に従って狭めの値生成処理について詳細に説
明する。FIG. 2 shows a flowchart for generating a narrower value. Hereinafter, the narrow value generation processing will be described in detail with reference to FIG.

ステップS1で、上記特徴抽出部３によって抽出された
特徴量から弁別素性抽出部４によって各フレーム毎に弁
別素性値が求められる。In step S1, a discrimination feature value is obtained for each frame by the discrimination feature extraction unit 4 from the feature amount extracted by the feature extraction unit 3.

ステップS2で、上記規則ファイルに記憶された複数の
規則の中から１つの規則が取り出される。In step S2, one rule is extracted from a plurality of rules stored in the rule file.

ステップS3で、上記ステップS1において抽出された弁
別素性値Ｆと、上記狭めの値生成部５が有する記憶部に
格納されたCi−₁,Const1,Const2,Const3および各フレー
ムにおける弁別素性値から算出されたΔＦを用いて、上
記取り出された規則の条件部（if部）が成立するか否か
が判別される。その結果条件部が成立すればステップS4
に進み、そうでなけれはステップS2に戻る。In step S3, it is calculated from the discrimination feature value F extracted in step S1 and the discrimination feature values in the frames Ci- ₁ , Const1, Const2, and Const3 stored in the storage unit of the narrow value generation unit 5 and each frame. Using the obtained ΔF, it is determined whether or not the condition part (if part) of the extracted rule is satisfied. As a result, if the condition part is satisfied, step S4
Otherwise, return to step S2.

ステップS4で、条件部が成立した規則の実行部（then
部）が実行されて、現フレームの狭めの値Ciが生成され
る。In step S4, the rule execution part (then
) Is performed to generate a narrower value Ci of the current frame.

ステップS5で、現フレームが最終フレームか否かが判
別される。その結果最終フレームでなければステップS1
に戻って次のフレームの狭めの値を生成し、そうでなけ
れば狭めの値の生成処理を終了する。In step S5, it is determined whether the current frame is the last frame. If the result is not the last frame, step S1
To generate a narrower value for the next frame; otherwise, end the narrower value generation process.

ある狭めの状態において、上記各規則を適応した場合
の狭めの状態の遷移は第３図に示すようになる。すなわ
ち、入力された音声の任意のフレームにおける狭めの値
がＣ＝０であるとき、次のフレームにおける狭めの値の
生成に際して上記規則（１）が適応されたとすると、規
則（１）の実行部が実行されて次フレームの狭めの値Ｃ
＝１が生成されて矢印（Ａ）のごとく遷移する。以下、
同様にＣ＝１の状態において上記規則（２）あるいは規
則（３）が適応されたとすると、次フレームの狭めの値
Ｃ＝１が生成されて矢印（Ｂ）のごとく遷移する。ま
た、Ｃ＝１の状態において上記規則（４）が適応された
とすると、次フレームの狭めの値Ｃ＝２が生成されて矢
印（Ｃ）のごとく遷移するのである。In a certain narrow state, transition of the narrow state when each of the above rules is applied is as shown in FIG. That is, if the narrow value of an input voice in an arbitrary frame is C = 0, and if the rule (1) is applied in generating the narrow value in the next frame, the execution unit of the rule (1) Is executed, and a narrower value C of the next frame is obtained.
= 1 is generated, and transition is made as shown by the arrow (A). Less than,
Similarly, if the above rule (2) or rule (3) is applied in the state of C = 1, a narrower value C = 1 for the next frame is generated, and the transition is made as shown by the arrow (B). Further, if the above rule (4) is applied in the state of C = 1, a narrower value C = 2 of the next frame is generated, and the transition is made as shown by the arrow (C).

このように、本実施例によれば、人間の調音に関連し
た特徴量である弁別素性値に基づいて狭めの値を生成す
るので、話者による周波数変動に影響されない狭めの値
を生成することができる。As described above, according to the present embodiment, since a narrow value is generated based on the discrimination feature value which is a feature amount related to human articulation, it is possible to generate a narrow value which is not affected by the frequency variation by the speaker. Can be.

上述のようにして生成された入力音声の１フレーム毎
の狭めの情報は、同じ狭めの値を有するフレーム単位に
まとめられて、音声認識の際のマッチング時におけるマ
ッチングパスの制限に用いることができる。また、得ら
れた入力音声の１フレーム毎の狭めの値と標準パターン
の狭めの値とのマッチングをとることによって、狭めの
値を入力音声の特徴パターンと標準パターンとの間の距
離計算に利用することもできる。The narrow information for each frame of the input voice generated as described above is grouped into frame units having the same narrow value, and can be used for restricting a matching path at the time of matching in voice recognition. . In addition, the obtained narrower value of each frame of the input voice is matched with the narrower value of the standard pattern, so that the narrower value is used for calculating the distance between the characteristic pattern of the input voice and the standard pattern. You can also.

第２実施例本実施例は、上記ニューラル・ネットワークを用い
て、第１図の狭め値生成部５によって狭めの値生成用の
規則を自動的に求める手法に関する。Second Embodiment This embodiment relates to a method for automatically obtaining a rule for generating a narrow value by the narrow value generating unit 5 shown in FIG. 1 using the above neural network.

ここで、ニューラル・ネットワークの概略について説
明する。ニューラル・ネットワークとは、例えば「“ア
ンイントロダクショントゥコンピューティング
ウィズニューラルネッツ",R.P.リップマン,IEEE AS
SPマガジン、日経エレクトロニクス 1987年８月10日
No.427」に紹介されているように、人間の脳の構造を真
似たネットワークであって、脳のニューロンに対応した
ユニットが複数個複雑に接続しあって形成されている。Here, an outline of the neural network will be described. Neural networks are, for example, "" Introduction to Computing
With Neural Nets ", RP Lippman, IEEE AS
SP Magazine, Nikkei Electronics August 10, 1987
As described in No. 427, this is a network that mimics the structure of the human brain, in which a plurality of units corresponding to brain neurons are connected in a complex manner.

上記ユニットの構造は他のユニットからの入力を受け
る部分と、入力を一定の規則で変換する部分と、変換し
た結果を出力する部分とから成る。上記複数のユニット
は、後に詳述するように入力層，中間層および出力層か
らなる階層構造のネットワークを形成し、他のユニット
との結合部には結合の強さを表す結合係数が付けられて
いる。The structure of the unit includes a portion that receives an input from another unit, a portion that converts the input according to a certain rule, and a portion that outputs the converted result. The plurality of units form a hierarchical network composed of an input layer, an intermediate layer, and an output layer, as will be described in detail later. Coupling coefficients indicating the strength of coupling are attached to coupling portions with other units. ing.

上記結合係数はユニット間の結合の強さをあらわもの
であり、この結合係数の値を変えるとネットワークの構
造が変わるのである。すなわち、上述したニューラル・
ネットワークの学習とは、ある既知の関係を有する２つ
の事象の一方の事象に属するデータを次々に上記階層構
造に形成されたネットワークの入力層に入力し、その際
に、出力層に出力される出力データと上記入力されたデ
ータに対応する他の事象に属するデータ（目標値）との
間の差を減らすように、上記結合係数を変更することで
ある。換言すれば、所定の関係を有する２つの事象のう
ちの一方の事象に属するデータを入力すると、そのデー
タに対応する他方の事象に属するデータを出力するよう
にネットワークの構造を変えることである。The coupling coefficient indicates the strength of coupling between the units, and changing the value of the coupling coefficient changes the network structure. That is, the neural
Network learning means that data belonging to one of two events having a certain known relationship is sequentially input to an input layer of the network formed in the hierarchical structure, and then output to an output layer. The coupling coefficient is changed so as to reduce a difference between output data and data (target value) belonging to another event corresponding to the input data. In other words, the structure of the network is changed so that when data belonging to one of two events having a predetermined relationship is input, data corresponding to the other event is output.

本実施例において用いたニューラル・ネットワークは
第５図に示すような構造を有している。すなわち、この
ニューラル・ネットワークは図中下側から順に入力層3
1,中間層32および出力層33から成る３層構造を有する。
入力層31には35個のユニット34,34,…を配し、中間層32
には７個のユニット35,35,…を配し、出力層33には３個
のユニット36,37,38を配している。ここで、出力層33の
ユニット36は狭めの値“Ｃ＝0"を出力し、ユニット37は
狭めの値“Ｃ＝1"を出力し、ユニット38は狭めの値“Ｃ
＝2"を出力するのである。入力層31の各ユニット34,34,
…は夫々中間層32の全ユニット35,…,35と接続してい
る。また、中間層32の各ユニット35,35,…は夫々出力層
33の全ユニット36,…,36と接続している。しかしなが
ら、各層内のユニット間は接続されない。The neural network used in this embodiment has a structure as shown in FIG. In other words, this neural network consists of input layers 3 in order from the bottom in the figure.
1. It has a three-layer structure consisting of an intermediate layer 32 and an output layer 33.
The input layer 31 includes 35 units 34, 34,.
Are provided with seven units 35, 35,..., And the output layer 33 is provided with three units 36, 37, 38. Here, the unit 36 of the output layer 33 outputs a narrow value “C = 0”, the unit 37 outputs a narrow value “C = 1”, and the unit 38 outputs a narrow value “C = 1”.
= 2 ". Each unit 34,34,
Are connected to all units 35,..., 35 of the intermediate layer 32, respectively. The units 35, 35,... Of the intermediate layer 32 are output layers, respectively.
33 are connected to all units 36, ..., 36. However, the units in each layer are not connected.

上記構造のニューラル・ネットワークは結合係数と共
に上記狭め値生成部５の記憶部に記憶されている。The neural network having the above structure is stored in the storage unit of the narrowing value generator 5 together with the coupling coefficient.

次に、上述したような弁別素性値と狭めの値との関係
に従って、上記構成のニューラル・ネットワークの学習
を誤差逆伝播アルゴリズムを用いて次のようにして行
う。Next, according to the relationship between the discrimination feature value and the narrow value as described above, learning of the neural network having the above configuration is performed as follows using an error back propagation algorithm.

まず、入力層31の各ユニット34,34,…に入力データを
入力する。そうすると、この入力データは各ユニット3
4,34,…によって所定の変換式（一般には、閾値関数や
シグモイド（sigmoid）関数）を用いて変換されて、中
間層32のユニット35,35,…に伝えられる。その際に、中
間層32の各ユニット35,35,…には、入力層31の各ユニッ
ト34,34,…の出力値に対して上記結合係数を掛けた値の
総和が入力される。同様に、中間層32の各ユニット35,3
5,…は入力層31から入力された値を所定の変換式によっ
て変換し、上記結合係数を掛けて出力層33の各ユニット
36,36,36に出力する。さらに、同様に出力層33の各ユニ
ット36,37,38,は中間層32から入力された値を所定の変
換式によって変換し、最終的な出力値を得る。First, input data is input to each unit 34, 34,... Of the input layer 31. Then, this input data is
, Are converted using a predetermined conversion formula (generally, a threshold function or a sigmoid function) and transmitted to the units 35, 35,. At this time, the sum of values obtained by multiplying the output values of the units 34, 34,... Of the input layer 31 by the above coupling coefficient is input to each unit 35, 35,. Similarly, each unit 35, 3 of the intermediate layer 32
5,... Convert the values input from the input layer 31 by a predetermined conversion formula, multiply the values by the above coupling coefficient, and
Output to 36,36,36. Further, similarly, each unit 36, 37, 38 of the output layer 33 converts the value input from the intermediate layer 32 by a predetermined conversion formula to obtain a final output value.

次に、この出力された出力値と上記入力値に対する望
ましい出力値（目標値）とが比較され、その差を減らす
ように各ユニット間の結合の強さ（結合係数）の値が変
えられる。すなわち、各ユニットに対して他のユニット
からある入力値が入力された場合のそのユニットからの
出力値と、上記入力値に対する目標値との差からユニッ
ト間の結合係数の変化量が算出されるのである。Next, this output value is compared with a desired output value (target value) for the input value, and the value of the coupling strength (coupling coefficient) between the units is changed so as to reduce the difference. That is, the amount of change in the coupling coefficient between the units is calculated from the difference between the output value from the unit when a certain input value is input to each unit from another unit and the target value for the input value. It is.

本実施例においては、入力層31の各ユニット34,34,…
には、現フレームとその前後２フレームの合計５フレー
ム分の各弁別素性値（１フレーム分の弁別素性として
は、“strident",“nasal",“murmur/buzz",“voiced",
“diffuse",“compact"および“power"の７種）、すな
わち、全35種の弁別素性値を夫々入力層31の担当するユ
ニット34,34,…に入力するのである。In this embodiment, each unit 34, 34,.
Includes the discrimination feature values for a total of five frames of the current frame and two frames before and after the current frame (the discrimination features for one frame are “strident”, “nasal”, “murmur / buzz”, “voiced”,
, "Diffuse", "compact", and "power"), that is, the 35 differentiating feature values are input to the units 34, 34,...

学習の際には、目標値の狭めの値がＣ＝０またはＣ＝
１またはＣ＝２となるような、すなわち、目標の狭めの
値Ciに相当するユニット（例えば、Ｃ＝０であればユニ
ット36）のみが“1"を出力し、他のユニット（例えば、
Ｃ＝０であればユニット37,38）は“0"を出力するよう
な上記35種の弁別素性値を入力層31の各ユニット34,34,
…に入力する。また、出力層33のユニット36,37,38に
は、目標の狭めの値Ciに相当するユニット（例えば、Ｃ
＝０であればユニット36）のみに“1"を入力し、他のユ
ニット（例えば、Ｃ＝０であればユニット37,38）には
“0"を入力する。そして、誤差逆伝播アルゴリズムによ
って目標値に対する結合係数の変化量が求められ、新た
な各ユニット間の結合係数が設定されるのである。At the time of learning, a narrower target value is C = 0 or C =
Only the unit corresponding to 1 or C = 2, that is, the unit corresponding to the target narrow value Ci (for example, the unit 36 if C = 0) outputs “1”, and the other unit (for example,
If C = 0, the units 37 and 38) output the above 35 kinds of discrimination feature values that output “0” to each of the units 34, 34,
Enter in ... The units 36, 37, and 38 of the output layer 33 include units corresponding to the target narrow value Ci (for example, C
If = 0, "1" is input to only the unit 36, and "0" is input to other units (for example, units 37 and 38 when C = 0). Then, the amount of change in the coupling coefficient with respect to the target value is determined by the error back propagation algorithm, and a new coupling coefficient between the units is set.

この場合、出力層33のユニット36,37,38に入力する目
標値は、第４図に示すように、予め目視によって決めら
れた狭めの値を用いる。これは、予め音声波形21に基づ
いてラベル22を求め、このラベル22を第１表に示すよう
な狭めの値とラベルとの関係表を用いて狭めの値に変換
することによって求められる。In this case, as the target values to be input to the units 36, 37, and 38 of the output layer 33, as shown in FIG. This is obtained by obtaining a label 22 based on the audio waveform 21 in advance, and converting the label 22 into a narrow value using a relation table between the narrow value and the label as shown in Table 1.

上記ニューラル・ネットワークを用いて実際の入力音
声に対する狭めの値を求める場合には、入力層31の各ユ
ニット34,34,…に入力音声から弁別素性抽出部４で求め
られた上記35種の弁別素性値を入力する。そして、出力
層33の各ユニット36,37,38の出力値を各ユニット間の算
出された結合係数に従って順次求め、一番大きな値を出
力しているユニット（ユニット36,37,38のいずれか）に
相当する狭めの値Ciを現フレームの狭めの値とするので
ある。 When the neural network is used to obtain a narrower value for the actual input speech, each of the units 34, 34,... Enter the feature value. Then, the output values of the units 36, 37, and 38 of the output layer 33 are sequentially obtained in accordance with the calculated coupling coefficient between the units, and the unit that outputs the largest value (one of the units 36, 37, and 38) ) Is used as the narrow value of the current frame.

本実施例において、上述のようにして十分な学習が行
われたニューラル・ネットワークでは、そのネットワー
クの構造は35種の弁別素性値が入力されるとその35種の
弁別素性値の組みに応じた正しい狭めの値Ciを出力する
ように変更されている。換言すれば、狭めの値生成の規
則が自動的に生成されている。したがって、学習後のニ
ューラル・ネットワークは総ての入力値に対して正しい
狭めの値Ciを出力することができる。In the present embodiment, in the neural network in which sufficient learning has been performed as described above, the structure of the network corresponds to a set of the 35 discrimination feature values when 35 discrimination feature values are input. It has been changed to output the correct narrow value Ci. In other words, a rule for generating a narrower value is automatically generated. Therefore, the neural network after learning can output a correct narrow value Ci for all input values.

第３実施例本実施例は、上記ニューラル・ネットワークを用い
て、第１図の狭め値生成部５によって狭めの値生成用の
規則を自動的に作成して、音響パラメータから直接狭め
の値を生成する手法に関する。Third Embodiment In the present embodiment, a rule for generating a narrow value is automatically created by the narrow value generating unit 5 in FIG. 1 using the above-described neural network, and the narrow value is directly obtained from the acoustic parameters. The method of generating.

本実施例においては、上述のように音響パラメータか
ら直接狭めの値を生成するために、第１図における弁別
素性抽出部４が除去されて、特徴抽出部３で抽出された
音響パラメータが直接狭め値生成部５に入力される。In this embodiment, in order to generate a narrower value directly from the acoustic parameter as described above, the discrimination feature extracting unit 4 in FIG. 1 is removed, and the acoustic parameter extracted by the feature extracting unit 3 is directly narrowed. The value is input to the value generator 5.

本実施例において用いたニューラル・ネットワークは
第５図と同様の構造を有している。ただし、本実施例の
場合のニューラル・ネットワークの入力層31には80個の
ユニット34,34,…を配し、中間層32には10個のユニット
35,35,…を配し、出力層33には３個のユニット36,37,38
を配している。ここで、上記第２実施例と同様に出力層
33のユニット36は狭めの値“Ｃ＝0"を出力し、ユニット
37は狭めの値“Ｃ＝1"を出力し、ユニット38は狭めの値
“Ｃ＝2"を出力する。The neural network used in this embodiment has a structure similar to that of FIG. However, in this embodiment, the input layer 31 of the neural network has 80 units 34, 34,..., And the intermediate layer 32 has 10 units.
35, 35, ... are arranged, and the output layer 33 has three units 36, 37, 38
Is arranged. Here, as in the second embodiment, the output layer
The unit 36 of 33 outputs a narrower value “C = 0” and the unit 36
37 outputs a narrower value "C = 1" and unit 38 outputs a narrower value "C = 2".

上記構造のニューラル・ネットワークは結合係数と共
に上記狭め値生成部５の記憶部に記憶される。The neural network having the above structure is stored in the storage unit of the narrowing value generator 5 together with the coupling coefficient.

本実施例の場合には、入力層31の各ユニット34,34,…
には、現フレームとその前後２フレームの合計５フレー
ム分の音響パラメータとしての上記各BPFの出力パワー
（１フレームに付き16チャンネルのBPFから出力）の
値、すなわち、全80種のBPFの出力値を、夫々入力層31
の担当するユニット34,34,…に入力するのである。In the case of the present embodiment, each unit 34, 34,.
Is the value of the output power of each of the above BPFs (output from 16 channel BPFs per frame) as acoustic parameters for a total of 5 frames of the current frame and 2 frames before and after the current frame, that is, the output of all 80 BPFs Enter the values in the input layer 31
Are input to the units 34, 34,.

学習の際には、第２実施例と同様にして、目標値の狭
めの値がＣ＝０またはＣ＝１またはＣ＝２となるよう
な、すなわち、目標の狭めの値Ciに相当するユニット
（例えば、Ｃ＝０であればユニット36）のみが“1"を出
力し、他のユニット（例えば、Ｃ＝０であればユニット
37,38）は“0"を出力するような上記80種のBPFの出力パ
ワー値を入力層31の各ユニット34,34,…に入力する。ま
た、出力層33のユニット36,37,38には、目標の狭めの値
Ciに相当するユニット（例えば、Ｃ＝０であればユニッ
ト36）のみに“1"を入力し、他のユニット（例えばＣ＝
０であればユニット37,38）には“0"を入力する。そし
て、誤差逆伝播アルゴリズムによって目標値に対する結
合係数の変化量が求められ、新たな各ユニット間の結合
係数が設定されるのである。出力層33のユニット36,37,
38に入力する目標値は、上述の場合と同様に第４図に示
すような音声波形21に基づいて予め目視によって求めた
ラベル22と上記第１表とによって求められる。At the time of learning, in the same manner as in the second embodiment, a unit in which the narrower target value is C = 0, C = 1, or C = 2, that is, a unit corresponding to the narrower target value Ci (For example, if C = 0, only the unit 36) outputs “1”, and the other units (for example, if C = 0, the unit 36)
37, 38) input the output power values of the 80 BPFs that output "0" to the units 34, 34,... Of the input layer 31. The units 36, 37, and 38 in the output layer 33 have narrower target values.
"1" is input only to the unit corresponding to Ci (for example, unit 36 if C = 0), and other units (for example, C =
If it is 0, "0" is input to the units 37 and 38). Then, the amount of change in the coupling coefficient with respect to the target value is determined by the error back propagation algorithm, and a new coupling coefficient between the units is set. Units 36, 37 of output layer 33,
The target value to be input to 38 is obtained from the label 22 and the above-mentioned Table 1 which are visually determined in advance based on the audio waveform 21 as shown in FIG.

上記ニューラル・ネットワークを用いて実際の入力音
声に対する狭めの値を求める場合には、入力層31の各ユ
ニット34,34,…に入力音声から特徴抽出部３で抽出され
る上記80種のBPFの出力値を入力する。そして、出力層3
3の各ユニット36,37,38の出力値を各ユニット間の算出
された結合係数に従って順次求め、一番大きな値を出力
しているユニットに相当する狭めの値Ciを現フレームの
狭めの値とするのである。When the neural network is used to obtain a narrower value for the actual input speech, the units 34, 34,... Enter the output value. And the output layer 3
The output values of the units 36, 37, and 38 are sequentially obtained in accordance with the calculated coupling coefficient between the units, and the narrower value Ci corresponding to the unit outputting the largest value is the narrower value of the current frame. That is.

本実施例において、上述のようにして十分な学習が行
われたニューラル・ネトワークでは、上記各BPFの出力
値（音響パラメータ）から直接狭めの値を生成する規則
が自動的に生成されている。したがって、学習後のニュ
ーラル・ネットワークは総ての入力値に対して正しい狭
めの値Ciを出力することができる。その際に、上述のニ
ューラル・ネットワークの学習によって、ニューラル・
ネットワークの内部構造が、入力された音響パラメータ
（BPFの出力値）から一旦弁別素性値に相当する特徴量
を生成してその弁別素性値に相当する特徴量から狭めの
値に変換するような構造に変更されているものと考えら
れる。In this embodiment, in the neural network on which sufficient learning has been performed as described above, a rule for directly generating a narrower value from the output value (acoustic parameter) of each BPF is automatically generated. Therefore, the neural network after learning can output a correct narrow value Ci for all input values. At that time, learning of the neural network
A structure in which the internal structure of the network once generates a feature value equivalent to the discrimination feature value from the input acoustic parameters (BPF output value) and converts the feature value corresponding to the discrimination feature value to a narrower value It is considered that it has been changed.

このように、本実施例によれば、特徴抽出部３である
BPF群の出力値（すなわち、音響パラメータ）から人間
の調音に関連した特徴量である弁別素性の値を経由して
狭めの値を生成するので、音響パラメータから話者によ
る周波数変動に影響されない狭めの値を直接生成するこ
とができる。As described above, according to the present embodiment, the feature extracting unit 3 is used.
Since a narrower value is generated from the output value of the BPF group (that is, the acoustic parameter) via the value of the discrimination feature, which is a feature quantity related to human articulation, the narrower value is not affected by the speaker's frequency variation from the acoustic parameter. Can be generated directly.

上記第１実施例および第２実施例において使用される
弁別素性の種類は、上述の７種に限定されるものではな
い。The types of discrimination features used in the first and second embodiments are not limited to the above seven types.

上記第２実施例および第３実施例において、ニューラ
ル・ネットワークを構成する入力層31および中間層32の
ユニットの数は、入力する弁別素性の種類の数やフレー
ムの数に応じて適当に変更してもよいことは言うに及ば
ない。In the second and third embodiments, the number of units of the input layer 31 and the intermediate layer 32 constituting the neural network is appropriately changed according to the number of types of discrimination features and the number of frames to be inputted. Needless to say, this is acceptable.

＜発明の効果＞以上より明らかなように、請求項１に係る発明の音声
の特徴抽出装置には、弁別素性抽出部および狭め値生成
部を設けて、入力音声から得られた周波数成分に基づい
て、人間の調音に関連した弁別的特徴の度合いを表す弁
別素性値を求め、この求められた弁別素性値から所定の
規則に従って上記狭めの値を生成するようにしたので、
上記弁別素性値から、話者による周波数特性の変動に依
存しない声道中の狭めの値を生成することができる。<Effects of the Invention> As is apparent from the above description, the speech feature extraction device according to the first aspect of the present invention includes the discrimination feature extraction unit and the narrowing value generation unit, and is based on the frequency components obtained from the input speech. Therefore, a discrimination feature value representing the degree of discrimination characteristics related to human articulation was obtained, and the narrowed value was generated from the obtained discrimination feature value according to a predetermined rule.
From the discrimination feature value, it is possible to generate a narrow value in the vocal tract that does not depend on the variation of the frequency characteristic by the speaker.

また、請求項２に係る発明の音声の特徴抽出装置にお
ける上記規則は、前フレームおよび現フレームの弁別素
性値に基づいて現フレームの狭めの値を生成する規則で
あるので、確実に現フレームの上記狭めの値を生成でき
る。Further, the rule in the speech feature extraction device of the invention according to claim 2 is a rule for generating a narrow value of the current frame based on the discrimination feature values of the previous frame and the current frame. The above narrow value can be generated.

また、請求項３に係る発明の音声の特徴抽出装置に
は、弁別素性抽出部および学習機能を有する狭め値生成
部を設けて、入力音声から得られた周波数成分に基づい
て、人間の調音に関連した弁別的特徴の度合いを表す弁
別素性値を求め、上記弁別素性値から上記狭めの値を生
成する規則を誤差逆伝播アルゴリズムを用いた学習によ
って自ら作成し、この自ら作成した規則に従って上記弁
別素性値から狭めの値を生成するようにしたので、上記
弁別素性値から、話者による周波数特性の変動に依存し
ない声道中の狭めの値を生成することができる。The speech feature extraction device according to the third aspect of the present invention further includes a discriminating feature extraction unit and a narrowing value generation unit having a learning function, and based on a frequency component obtained from the input speech, adjusts human articulation. A discrimination feature value indicating the degree of the related discrimination feature is obtained, and a rule for generating the narrower value from the discrimination feature value is created by learning using an error back propagation algorithm, and the discrimination is performed according to the rule created by the user. Since the narrower value is generated from the feature value, a narrower value in the vocal tract that does not depend on the variation of the frequency characteristic by the speaker can be generated from the discrimination feature value.

また、請求項４に係る発明の音声の特徴抽出装置に
は、学習機能を有する狭め値生成部を設けて、特徴抽出
部で入力音声から周波数成分を得る際に抽出した音響パ
ラメータから上記狭めの値を生成する規則を誤差逆伝播
アルゴリズムを用いた学習によって自ら作成し、この自
ら作成した規則に従って上記音響パラメータから狭めの
値を生成するようにしたので、上記音響パラメータか
ら、話者による周波数特性の変動に依存しない声道中の
狭めの値を直接生成することができる。Further, the speech feature extraction device according to the fourth aspect of the present invention is provided with a narrowing value generation unit having a learning function, and the feature extraction unit extracts the narrowing value from the acoustic parameters extracted when obtaining a frequency component from the input speech. A rule for generating a value is created by learning using an error back propagation algorithm, and a narrower value is generated from the acoustic parameter according to the rule created by the user. Narrow values in the vocal tract that do not depend on the variation of the vocal tract can be directly generated.

[Brief description of the drawings]

第１図はこの発明に係る音声認識装置のブロック図、第
２図は第１実施例における狭めの値生成のフローチャー
ト、第３図は第１実施例における各規則を適応した場合
の狭めの状態の遷移の説明図、第４図は第１実施例にお
ける規則作成および第2,3実施例における学習の際の目
標値作成の説明図、第５図は第2,3実施例におけるニュ
ーラル・ネットワークの構造の説明図である。１……マイクロホン、２……アンプ、３……特徴抽出
部、４……弁別素性抽出部、５……狭め値生成部、６…
…単語認識部、７……特徴パターン格納部、８……結果
表示部、21……音声波形、22……ラベル、23……狭めの
値、31……入力層、32……中間層、33……出力層、34…
…入力層のユニット、35……中間層のユニット、36……
出力層のＣ＝０を出力するユニット、37……出力層のＣ
＝１を出力するユニット、38……出力層のＣ＝２を出力
するユニット。FIG. 1 is a block diagram of a speech recognition apparatus according to the present invention, FIG. 2 is a flowchart of generating a narrow value in the first embodiment, and FIG. 3 is a narrow state in which each rule in the first embodiment is applied. FIG. 4 is an explanatory diagram of rule creation in the first embodiment and target value creation in learning in the second and third embodiments, and FIG. 5 is a neural network in the second and third embodiments. FIG. 3 is an explanatory diagram of the structure of FIG. 1 microphone 2 amplifier 3 feature extracting unit 4 discriminating feature extracting unit 5 narrowing value generating unit 6
… Word recognition unit, 7… feature pattern storage unit, 8… result display unit, 21… voice waveform, 22… label, 23… narrower value, 31… input layer, 32… middle layer, 33 ... Output layer, 34 ...
... Input layer unit, 35 ... Intermediate layer unit, 36 ...
A unit for outputting C = 0 in the output layer, 37... C in the output layer
= 1 unit, 38 ... Unit for outputting C = 2 in the output layer.

フロントページの続き (56)参考文献特開昭59−139100（ＪＰ，Ａ) 特開昭63−201698（ＪＰ，Ａ) 特開平２−32400（ＪＰ，Ａ) 特開平２−35500（ＪＰ，Ａ) 特開昭64−33598（ＪＰ，Ａ) 電子情報通信学会技術研究報告ＳＰ 87−137，Ｐ53〜60 昭和63年３月清水「音声の調音と知覚」（昭58−８ −15）篠崎書林ｐ．58〜77 (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 7/08 G10L 9/10 301 G10L 3/00 539 ＪＩＣＳＴ科学技術文献ファイルContinuation of front page (56) References JP-A-59-139100 (JP, A) JP-A-63-201698 (JP, A) JP-A-2-32400 (JP, A) JP-A-2-35500 (JP, A) , A) JP-A-64-33598 (JP, A) IEICE Technical Report SP 87-137, P53-60 March 1988 Shimizu "Sound Articulation and Perception" (1983-58-15) Shinozaki Shorin p. 58-77 (58) Field surveyed (Int.Cl. ⁶ , DB name) G10L 7/08 G10L 9/10 301 G10L 3/00 539 JICST scientific and technical literature file

Claims

(57) [Claims]

An audio feature extracting apparatus for analyzing the frequency of an input voice and extracting a feature of the voice from the obtained frequency component, wherein the voice feature extracting apparatus relates to human articulation based on the frequency component obtained by the frequency analysis. A discrimination feature extraction unit that determines a discrimination feature value indicating the degree of the discrimination feature, and a rule that generates a narrow value indicating the degree of narrowing of the vocal tract, which is one of the feature values, from the discrimination feature value. A narrowing value generating unit that generates the narrowing value from the discriminating feature value obtained by the discriminating feature extracting unit; and a narrowing value in a vocal tract that does not depend on a change in a frequency component by a speaker from the discriminating feature value. An audio feature extraction apparatus characterized by generating a speech.

2. A speech feature extracting apparatus according to claim 1, wherein said rule for generating a narrow value generates a narrow value for a current frame based on at least a discrimination feature value of a previous frame and a current frame. An audio feature extraction device characterized by rules.

3. A speech feature extraction device for performing frequency analysis of an input speech and extracting a speech feature amount from the obtained frequency component, wherein the speech feature extraction device relates to human articulation based on the frequency component obtained by the frequency analysis. A discrimination feature extraction unit that obtains a discrimination feature value representing the degree of the discrimination feature that has been created, and a rule that generates the narrower value from the discrimination feature value is created by learning using an error backpropagation algorithm.
According to the created rule, a narrowing value generation unit that generates the narrowing value from the discrimination feature value obtained by the discrimination feature extraction unit is provided. From the discrimination feature value, the narrowing value generation unit does not depend on a change in a frequency component by a speaker. An audio feature extraction device for generating a narrow value in a vocal tract.

4. A speech feature extraction device for analyzing a frequency of an input speech by a feature extraction unit and extracting a speech feature amount from the obtained frequency component, wherein the feature extraction unit extracts the frequency component when obtaining the frequency component. A rule for generating the narrow value from the acoustic parameter is created by learning using an error back propagation algorithm, and according to the created rule, a narrow value generating unit for generating the narrow value from the acoustic parameter is provided. A speech feature extraction apparatus characterized by generating a narrow value in a vocal tract that does not depend on a change in a frequency component by a speaker from the acoustic parameters.