JP2980382B2

JP2980382B2 - Speaker adaptive speech recognition method and apparatus

Info

Publication number: JP2980382B2
Application number: JP2412080A
Authority: JP
Inventors: 徹真田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1990-12-19
Filing date: 1990-12-19
Publication date: 1999-11-22
Anticipated expiration: 2014-11-22
Also published as: JPH04219798A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は，新規話者音声の特徴量
を標準話者音声の特徴量に変換する変換関数を生成して
用いる話者適応音声認識方法および装置に関する。音声
認識装置では，あらかじめ登録された音声特徴量と，入
力音声から抽出した音声特徴量とを照合することによ
り，発声入力された音声の認識を行う。あらかじめ登録
された音声特徴量が，認識する入力音声と同一人の発声
により作成したものであれば，認識精度は高くなるが，
不特定話者用に標準的に作られたものであれば，特に個
性の強い発声に対して認識精度が落ちる。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speaker adaptive speech recognition method and apparatus for generating and using a conversion function for converting a feature amount of a new speaker's speech into a feature amount of a standard speaker's speech. The voice recognition device recognizes a voice that has been uttered by comparing a voice feature amount registered in advance with a voice feature amount extracted from an input voice. If the pre-registered speech features are created by the same person as the input speech to be recognized, the recognition accuracy will be high,
If it is made standard for an unspecified speaker, the recognition accuracy will be reduced especially for highly individualized utterances.

【０００２】しかしながら，特定（新規）話者対応に音
声特徴量の辞書を作成するのは，非常に大きな労力を要
する。そこで，あらかじめ標準的な音声の特徴量を示す
辞書を一つ作成しておき，新規話者音声に対して，その
音声特徴量を標準話者音声の特徴量に変換する変換関数
を学習により生成し，その変換関数を用いて入力音声の
特徴量を標準話者音声の特徴量に近い形に変換し，その
うえで照合する方法が用いられている。このときに用い
る変換関数の生成を簡単に高速に行う技術が必要とされ
る。However, creating a dictionary of speech features for a specific (new) speaker requires a great deal of labor. Therefore, a dictionary indicating the standard speech features is created in advance, and a conversion function that converts the speech features into the standard speaker speech features for new speaker speech is generated by learning. Then, using the conversion function, the feature of the input speech is converted into a form close to the feature of the standard speaker's speech, and then the matching is performed. There is a need for a technique for easily and quickly generating a conversion function used at this time.

【０００３】[0003]

【従来の技術】図７は従来技術の説明図である。変換関
数６０を生成する際には，新規話者の音声から，新規話
者音声特徴量抽出手段１で音声認識に用いる特徴量時系
列を抽出し，この新規話者の音声に対応する標準話者の
音声の特徴量時系列を，標準話者音声特徴量記憶手段４
から読み出し，変換関数生成手段３０によって，新規話
者音声の特徴量時系列を標準話者音声の特徴量時系列に
変換する単一の変換関数６０を生成する。2. Description of the Related Art FIG. 7 is an explanatory diagram of the prior art. When generating the conversion function 60, the feature time series used for speech recognition is extracted from the speech of the new speaker by the new speaker speech feature extraction means 1, and the standard speech corresponding to the speech of the new speaker is extracted. The feature time series of the speaker's voice is stored in a standard speaker voice feature storage unit 4.
, And a conversion function generating unit 30 generates a single conversion function 60 for converting the feature time series of the new speaker's voice into the feature time series of the standard speaker's voice.

【０００４】新規話者の音声で音声認識を行う際には，
新規話者の音声から新規話者音声特徴量抽出手段１で音
声認識に用いる特徴量時系列を抽出し，この特徴量を変
換関数６０で標準話者音声の特徴量時系列に変換し，こ
の特徴量時系列と標準話者音声特徴量記憶手段４に記憶
されている特徴量時系列を，標準話者音声認識手段７で
照合して認識結果を得る。[0004] When performing speech recognition with the voice of a new speaker,
The feature amount time series used for speech recognition is extracted from the new speaker's speech by the new speaker speech feature amount extraction means 1, and the feature amount is converted into the feature amount time series of the standard speaker voice by the conversion function 60. The feature time series and the feature time series stored in the standard speaker voice feature storage unit 4 are collated by the standard speaker voice recognition unit 7 to obtain a recognition result.

【０００５】[0005]

【発明が解決しようとする課題】以上のような従来技術
によれば，すべての新規話者音声の特徴量を標準話者音
声の特徴量に単一の変換関数によって変換することにな
るため，変換関数がきわめて複雑になり，変換関数を生
成するのに長時間を要するので，新規話者の負担が大き
いという問題があった。また，変換精度のよい変換関数
を得るのが難しいという問題があった。According to the prior art described above, all new speaker voice features are converted to standard speaker voice features by a single conversion function. Since the conversion function becomes extremely complicated and it takes a long time to generate the conversion function, there is a problem that the load on a new speaker is large. Also, there is a problem that it is difficult to obtain a conversion function with good conversion accuracy.

【０００６】本発明は上記問題点の解決を図り，精度の
よい変換関数を短時間で生成する手段を提供し，新規話
者の負担を軽減することを目的としている。SUMMARY OF THE INVENTION It is an object of the present invention to solve the above problems, to provide a means for generating an accurate conversion function in a short time, and to reduce the burden on a new speaker.

【０００７】[0007]

【課題を解決するための手段】図１は本発明の原理ブロ
ック図である。本発明では，新規話者音声の特徴量を変
換関数で標準話者音声の特徴量に変換する際に，複数の
変換関数の中から新規話者音声の各時刻における音響特
性に対応する変換関数を選択し，変換に用いる変換関数
を切り換えながら新規話者音声の特徴量を標準話者音声
の特徴量に変換する。FIG. 1 is a block diagram showing the principle of the present invention. According to the present invention, when the feature amount of the new speaker's voice is converted into the feature amount of the standard speaker's voice by the conversion function, the conversion function corresponding to the acoustic characteristic of the new speaker's voice at each time from among the plurality of conversion functions. Is selected, and the feature amount of the new speaker's voice is converted into the feature amount of the standard speaker's voice while switching the conversion function used for the conversion.

【０００８】Ｎ種の音響特性に対応するＮ個の変換関数
(1),(2),…(N) からなる変換関数群を生成する際には，
以下の処理を行う。入力された新規話者の音声から新規
話者音声特徴量抽出手段１で新規話者音声の特徴量時系
列を抽出する。特徴量時系列から音響特性抽出手段２で
Ｎ個の変換関数に対応する音響特性時系列を抽出する。[0008] N conversion functions corresponding to N kinds of acoustic characteristics
When generating a transformation function group consisting of (1), (2), ... (N),
The following processing is performed. From the input new speaker's voice, a new speaker's voice feature amount extracting unit 1 extracts a feature time series of the new speaker's voice. The acoustic characteristic extraction unit 2 extracts the acoustic characteristic time series corresponding to the N conversion functions from the feature amount time series.

【０００９】変換関数生成手段３は，入力された新規話
者音声に対応する標準話者音声の特徴量時系列を標準話
者音声特徴量記憶手段４から読み出し，各時刻の音響特
性に対応して，新規話者音声の特徴量を標準話者音声の
特徴量に変換するＮ個の変換関数からなる変換関数群を
生成する。The conversion function generating means 3 reads out the characteristic time series of the standard speaker voice corresponding to the input new speaker voice from the standard speaker voice characteristic quantity storing means 4 and corresponds to the acoustic characteristic at each time. Then, a conversion function group including N conversion functions for converting the feature amount of the new speaker's voice into the feature amount of the standard speaker's voice is generated.

【００１０】新規話者の音声で音声認識を行う際には，
以下の処理を行う。新規話者の音声から新規話者音声特
徴量抽出手段１で新規話者音声の特徴量時系列を抽出す
る。特徴量時系列から音響特性抽出手段２でＮ個の変換
関数に対応する音響特性時系列を抽出する。[0010] When performing speech recognition with the voice of a new speaker,
The following processing is performed. The feature time series of the new speaker's voice is extracted by the new speaker's voice feature extracting means 1 from the voice of the new speaker. The acoustic characteristic extraction unit 2 extracts the acoustic characteristic time series corresponding to the N conversion functions from the feature amount time series.

【００１１】変換関数切り換え手段５は，各時刻の音響
特性に対応する変換関数を，Ｎ個の変換関数群から選択
する。音声特徴量変換手段６は，選択された変換関数を
用いて，入力された新規話者音声の特徴量を標準話者音
声の特徴量に変換する。すなわち，変換関数群によっ
て，新規話者音声の特徴量時系列が標準話者音声の特徴
量時系列に変換される。The conversion function switching means 5 selects a conversion function corresponding to the acoustic characteristic at each time from a group of N conversion functions. The speech feature amount conversion means 6 converts the input feature amount of the new speaker's speech into the feature amount of the standard speaker's speech using the selected conversion function. That is, the feature time series of the new speaker's voice is converted into the feature time series of the standard speaker's voice by the conversion function group.

【００１２】標準話者音声認識手段７は，標準話者音声
特徴量記憶手段４から読み出した標準話者音声の特徴量
時系列と，音声特徴量変換手段６によって新規話者音声
の特徴量時系列から変換されて得られた特徴量時系列と
を照合し，認識結果を出力する。[0012] The standard speaker voice recognition means 7 includes a feature time series of the standard speaker voice read out from the standard speaker voice feature storage means 4, and a new speaker voice feature time sequence obtained by the voice feature conversion means 6. A feature amount time series obtained by converting the series is collated and a recognition result is output.

【００１３】音響特性抽出手段２が抽出する音響特性と
して，弁別素性（distinctive feature)を用いることが
できる。弁別素性とは，例えば母音性，子音性，単ホル
マント性，鼻音性，エネルギー性，…というような音響
的性質を示すものである。As the acoustic characteristic extracted by the acoustic characteristic extracting means 2, a distinctive feature can be used. The discrimination feature indicates an acoustic property such as vowel, consonant, monoformant, nasal, energetic, and so on.

【００１４】また，具体的には，音響特性として，有声
音，無声音および無音などの性質を選び，これらの各性
質ごとに変換関数を用意してもよい。Further, specifically, properties such as voiced sound, unvoiced sound, and silence may be selected as acoustic characteristics, and a conversion function may be prepared for each of these properties.

【００１５】変換関数生成手段３は，各音響特性ごと
に，変換関数を回帰分析で求めることができる。また，
変換関数生成手段３は，変換関数をニューラルネットで
求め，音声特徴量変換手段６は，音響特性に応じたニュ
ーラルネットで実現される変換関数により，新規話者音
声の特徴量を変換することもできる。The conversion function generating means 3 can obtain a conversion function for each acoustic characteristic by regression analysis. Also,
The conversion function generation means 3 obtains the conversion function by a neural network, and the speech feature quantity conversion means 6 can also convert the feature quantity of the new speaker's voice by the conversion function realized by the neural network corresponding to the acoustic characteristics. it can.

【００１６】[0016]

【作用】本発明では，入力音声の音響特性に対応する複
数の変換関数を生成して，認識の際には入力音声の各時
刻での音響特性により，これらの変換関数を切り換えて
用い，新規話者音声の特徴量を標準話者音声の特徴量に
変換するので，各々の変換関数が単純になり，例えば変
換関数の生成のために１００語の音声入力による学習が
必要であったのに対し，きわめて少ない語数の入力によ
る学習でも，精度のよい変換関数を実現することが可能
になる。According to the present invention, a plurality of conversion functions corresponding to the acoustic characteristics of the input speech are generated, and these functions are switched and used at the time of recognition according to the acoustic characteristics of the input speech at each time. Since the features of the speaker's speech are converted to the features of the standard speaker's speech, each conversion function becomes simple. For example, learning with 100 words of speech input was required to generate the conversion function. On the other hand, even if learning is performed by inputting a very small number of words, an accurate conversion function can be realized.

【００１７】[0017]

【実施例】図２は本発明の実施例を示す。図２におい
て，帯域スペクトル時系列計算部１１は，図１に示す新
規話者音声特徴量抽出手段１に対応する。有声音・無声
音・無音判定部２１は，図１に示す音響特性抽出手段２
に対応する。ＤＰマッチング部３１，帯域スペクトル対
記憶選択部３２，各音響特性ごとの帯域スペクトル対記
憶部群３３および線型回帰分析部３４は，図１に示す変
換関数生成手段３に対応する。標準話者単語音声帯域ス
ペクトル時系列テンプレート記憶部４１は，図１に示す
標準話者音声特徴量記憶手段４に対応する。変換関数選
択部５１は，図１に示す変換関数切り換え手段５に対応
する。音声特徴量変換部６１は，図１に示す音声特徴量
変換手段６に対応する。ＤＰマッチング音声認識部７１
は，図１に示す標準話者音声認識手段７に対応する。FIG. 2 shows an embodiment of the present invention. In FIG. 2, the band spectrum time series calculation unit 11 corresponds to the new speaker voice feature amount extraction unit 1 shown in FIG. The voiced / unvoiced / silent determining unit 21 is a sound characteristic extracting unit 2 shown in FIG.
Corresponding to The DP matching unit 31, the band spectrum pair storage selection unit 32, the band spectrum pair storage unit group 33 for each acoustic characteristic, and the linear regression analysis unit 34 correspond to the conversion function generation unit 3 shown in FIG. The standard speaker word voice band spectrum time-series template storage unit 41 corresponds to the standard speaker voice feature storage unit 4 shown in FIG. The conversion function selector 51 corresponds to the conversion function switching means 5 shown in FIG. The voice feature value conversion unit 61 corresponds to the voice feature value conversion means 6 shown in FIG. DP matching voice recognition unit 71
Corresponds to the standard speaker voice recognition means 7 shown in FIG.

【００１８】音声特徴量変換部６１は，有声用，無声
用，無音用の変換関数６１ａ，６１ｂ，６１ｃを持つ。
これらの変換関数群を生成する際には，以下の処理を行
う。The speech feature quantity conversion unit 61 has conversion functions 61a, 61b, and 61c for voiced, unvoiced, and silent.
When generating these conversion functions, the following processing is performed.

【００１９】帯域スペクトル時系列計算部１１は，新規
話者音声を帯域スペクトル時系列に変換する。有声音・
無声音・無音判定部２１は，帯域スペクトル時系列から
各時刻における音響特性が有声音であるか無声音である
か無音であるかを判定する。無音であるか否かは帯域ス
ペクトルの全パワーの大小で判定する。有声音であるか
無声音であるかは，帯域スペクトルの低域パワーと高域
パワーの相対的大小で判定する。The band spectrum time series calculation unit 11 converts a new speaker's voice into a band spectrum time series. Voiced sound
The unvoiced sound / silence determination unit 21 determines whether the acoustic characteristic at each time is voiced, unvoiced, or silent based on the band spectrum time series. Whether or not there is no sound is determined based on the magnitude of the total power of the band spectrum. Whether the sound is a voiced sound or an unvoiced sound is determined by the relative magnitude of the low band power and the high band power of the band spectrum.

【００２０】ＤＰマッチング部３１は，入力された新規
話者の帯域スペクトル時系列と，標準話者単語音声帯域
スペクトル時系列テンプレート記憶部４１中の入力音声
に対応する帯域スペクトル時系列との時間整合をとり，
帯域スペクトル対を生成する。帯域スペクトル対は，時
間整合によって対応づけられた新規話者の帯域スペクト
ルと標準話者の帯域スペクトルの対である。The DP matching section 31 performs time matching between the input band spectrum time series of the new speaker and the band spectrum time series corresponding to the input speech in the standard speaker word speech band spectrum time series template storage section 41. Take
Generate a band spectrum pair. The band spectrum pair is a pair of the band spectrum of the new speaker and the band spectrum of the standard speaker associated by time matching.

【００２１】帯域スペクトル対記憶選択部３２は，有声
音・無声音・無音判定部２１の判定に従って，帯域スペ
クトル対を帯域スペクトル対記憶部群３３中の有声音帯
域スペクトル対記憶部・無声音帯域スペクトル対記憶部
・無音帯域スペクトル対記憶部のいずれかに格納し蓄積
する。例えば，有声音・無声音・無音判定部２１の判定
が有声音であれば，帯域スペクトル対は有声音帯域スペ
クトル対記憶部に格納され蓄積される。The band spectrum pair storage selection unit 32 stores the band spectrum pair in the band spectrum pair storage unit group 33 according to the judgment of the voiced sound / unvoiced sound / silence judgment unit 21. The data is stored and stored in one of the storage unit and the silent band spectrum pair storage unit. For example, if the determination by the voiced / unvoiced / silence determining unit 21 is a voiced sound, the band spectrum pair is stored and stored in the voiced band spectrum pair storage unit.

【００２２】線型回帰分析部３４は，帯域スペクトル対
記憶部群３３の中の各記憶部に蓄積された帯域スペクト
ル対に線型回帰分析を行って，各変換関数を求め，音声
特徴量変換部６１にそれを通知し格納する。例えば，有
声音帯域スペクトル対記憶部に蓄積された帯域スペクト
ル対に対して線型回帰分析を行うことにより，入力音声
が有声音である場合の変換関数が生成され，音声特徴量
変換部６１が管理する変換関数群中に有声用変換関数６
１ａとして格納される。The linear regression analysis unit 34 performs a linear regression analysis on the band spectrum pairs stored in the respective storage units in the band spectrum pair storage unit group 33 to obtain each conversion function, and obtains a speech feature amount conversion unit 61. Notify and store it. For example, by performing a linear regression analysis on the band spectrum pair stored in the voiced sound band spectrum pair storage unit, a conversion function when the input voice is a voiced sound is generated, and the voice feature amount conversion unit 61 manages the conversion function. Conversion function 6 for voiced in the conversion function group
1a.

【００２３】新規話者の音声で音声認識を行う際には，
以下の処理を行う。帯域スペクトル時系列計算部１１
は，新規話者音声を帯域スペクトル時系列に変換する。
有声音・無声音・無音判定部２１は，帯域スペクトル時
系列から各時刻における音響特性が有声音であるか無声
音であるか無音であるかを判定する。この判定に従っ
て，変換関数選択部５１は，音声特徴量変換部６１で使
用する変換関数群中のいずれかの変換関数を選択する。When performing speech recognition with the voice of a new speaker,
The following processing is performed. Band spectrum time series calculator 11
Converts the new speaker's speech into a band spectrum time series.
The voiced / unvoiced / silence determining unit 21 determines whether the acoustic characteristic at each time is voiced, unvoiced, or silent based on the band spectrum time series. According to this determination, the conversion function selection unit 51 selects one of the conversion functions from the conversion function group used by the audio feature value conversion unit 61.

【００２４】ある時刻において，例えば有声音・無声音
・無音判定部２１が有声音と判定した場合には，変換関
数選択部５１は，その時刻の新規話者音声の帯域スペク
トルを標準話者音声の帯域スペクトルに変換する変換関
数として，有声用変換関数６１ａを選択する。At a certain time, for example, if the voiced / unvoiced / silent determining unit 21 determines that the voice is a voiced sound, the conversion function selecting unit 51 converts the band spectrum of the new speaker's voice at that time into the standard speaker's voice. The voiced conversion function 61a is selected as a conversion function for converting to a band spectrum.

【００２５】音声特徴量変換部６１は，選択された変換
関数に従って新規話者音声の帯域スペクトル時系列を標
準話者音声の帯域スペクトル時系列に変換する。ＤＰマ
ッチング音声認識部７１は，変換された帯域スペクトル
時系列と標準話者単語音声帯域スペクトル時系列テンプ
レート記憶部４１中の単語ごとの帯域スペクトル時系列
とを照合して認識結果を出力する。The speech feature quantity converter 61 converts the band spectrum time series of the new speaker's voice into the band spectrum time series of the standard speaker's voice according to the selected conversion function. The DP matching speech recognition unit 71 collates the converted band spectrum time series with the band spectrum time series for each word in the standard speaker word speech band spectrum time series template storage unit 41, and outputs a recognition result.

【００２６】図３は，図２に示す実施例による変換関数
生成時の処理フロー，すなわち学習時の処理フローを示
している。以下，図３に示す処理(a) 〜(l) に従って説
明する。FIG. 3 shows a processing flow at the time of generating a conversion function, that is, a processing flow at the time of learning according to the embodiment shown in FIG. Hereinafter, description will be made in accordance with the processes (a) to (l) shown in FIG.

【００２７】(a) 学習のための新規話者の発声する単語
は，あらかじめ決められている。新規話者が発声する
と，帯域スペクトル時系列計算部１１により，音声の新
規話者波形を帯域スペクトル時系列に変換する。 (b) ＤＰマッチング部３１において，新規話者帯域スペ
クトル時系列と，同じ単語の標準話者帯域スペクトル時
系列とのＤＰ照合を行い，帯域スペクトル対を生成す
る。(A) The words spoken by a new speaker for learning are predetermined. When the new speaker utters, the band spectrum time series calculation unit 11 converts the new speaker waveform of the voice into a band spectrum time series. (b) The DP matching unit 31 performs DP matching between the new speaker band spectrum time series and the standard speaker band spectrum time series of the same word to generate a band spectrum pair.

【００２８】(c) 時系列が終了するまで，処理(d) 〜処
理(i) を繰り返す。終了したならば，処理(j) へ移る。 (d) 有声音・無声音・無音判定部２１により，帯域スペ
クトルが有声音・無声音・無音のいずれであるかを判定
する。 (e) 〜(f) 有声音であれば，帯域スペクトル対記憶部群
３３中の有声音帯域スペクトル対記憶部に帯域スペクト
ル対を格納する。 (g) 〜(h) 無声音であれば，帯域スペクトル対記憶部群
３３中の無声音帯域スペクトル対記憶部に帯域スペクト
ル対を格納する。 (i) 無音であれば，帯域スペクトル対記憶部群３３中の
無音帯域スペクトル対記憶部に帯域スペクトル対を格納
する。その後，処理(c) へ戻り，同様に処理を繰り返
す。(C) Processes (d) to (i) are repeated until the time series ends. When the processing is completed, the processing moves to the processing (j). (d) The voiced / unvoiced / silence determining unit 21 determines whether the band spectrum is voiced / unvoiced / silent. (e) to (f) If it is a voiced sound, the band spectrum pair is stored in the voiced sound band spectrum pair storage unit in the band spectrum pair storage unit group 33. (g) to (h) If the sound is unvoiced, the band spectrum pair is stored in the unvoiced band spectrum pair storage unit in the band spectrum pair storage unit group 33. (i) If there is no sound, the band spectrum pair is stored in the silent band spectrum pair storage unit in the band spectrum pair storage unit group 33. Thereafter, the process returns to the process (c), and the process is repeated in the same manner.

【００２９】(j) 時系列が終了したならば，線型回帰分
析で無音用変換関数６１ｃを生成し，格納する。なお，
線型回帰分析の手法については周知であるので，ここで
の詳しい説明は省略する。 (k) 同様に，線型回帰分析で無声音用変換関数６１ｂを
生成し，格納する。(l) 同様に，線型回帰分析で有声音
用変換関数６１ａを生成し，格納する。以上の処理によ
り，変換関数の生成処理を終了する。(J) When the time series is completed, a silence conversion function 61c is generated by linear regression analysis and stored. In addition,
Since the method of linear regression analysis is well known, a detailed description thereof will be omitted. (k) Similarly, a conversion function 61b for unvoiced sound is generated by linear regression analysis and stored. (l) Similarly, a voiced sound conversion function 61a is generated by linear regression analysis and stored. With the above processing, the conversion function generation processing ends.

【００３０】図４は，図２に示す実施例による認識時の
処理フローを示している。以下，図４に示す処理(a) 〜
(k) に従って説明する。FIG. 4 shows a processing flow at the time of recognition according to the embodiment shown in FIG. Hereinafter, the processing (a) to FIG.
Explanation will be given according to (k).

【００３１】(a) 新規話者が発声した音声を，帯域スペ
クトル時系列計算部１１により，帯域スペクトル時系列
に変換する。(A) The speech uttered by the new speaker is converted into a band spectrum time series by the band spectrum time series calculation unit 11.

【００３２】(b) 時系列が終了するまで，処理(c) 〜処
理(i) を繰り返す。終了したならば，処理(j) へ移る。 (c) 有声音・無声音・無音判定部２１により，帯域スペ
クトルが有声音・無声音・無音のいずれであるかを判定
する。 (d) 〜(e) 有声音であれば，変換関数として有声用変換
関数を選択する。 (f) 〜(g) 無声音であれば，変換関数として無声用変換
関数を選択する。 (h) 無音であれば，変換関数として無音用変換関数を選
択する。 (i) 音声特徴量変換部６１において，選択された変換関
数を用いることにより，帯域スペクトルを標準話者のも
のに変換する。その後，処理(b) へ戻り，同様に処理を
繰り返す。(B) Processes (c) to (i) are repeated until the time series ends. When the processing is completed, the processing moves to the processing (j). (c) The voiced / unvoiced / silent determining unit 21 determines whether the band spectrum is voiced / unvoiced / silent. (d) to (e) If it is a voiced sound, select a voiced conversion function as the conversion function. (f) to (g) If it is an unvoiced sound, select a conversion function for unvoiced as a conversion function. (h) If there is no sound, select the silence conversion function as the conversion function. (i) The speech feature conversion unit 61 converts the band spectrum into that of the standard speaker by using the selected conversion function. Thereafter, the process returns to the process (b), and the process is repeated in the same manner.

【００３３】(j) 認識対象の時系列が終了したならば，
ＤＰマッチング音声認識部７１により，変換された帯域
スペクトル時系列と，標準話者の帯域スペクトル時系列
テンプレートとについて，ＤＰ（ダイナミックプログラ
ミング）マッチングを行う。 (k) ＤＰマッチングの結果，スコアの最も良かった語句
を認識結果とし，処理を終了する。(J) When the time series of the recognition target is completed,
The DP matching voice recognition unit 71 performs DP (dynamic programming) matching on the converted band spectrum time series and the band spectrum time series template of the standard speaker. (k) As a result of the DP matching, the phrase having the best score is regarded as a recognition result, and the processing is terminated.

【００３４】変換関数群を複数のニューラルネットで構
成してニューラルネット群とし，線型回帰分析部３４
を，バックプロパゲーションによるニューラルネット学
習部として，同様の機能を実現することも可能である。
図５は，そのニューラルネットを用いた本発明の実施例
を示している。The conversion function group is composed of a plurality of neural networks to form a neural network group.
Can be realized as a neural network learning unit by back propagation.
FIG. 5 shows an embodiment of the present invention using the neural network.

【００３５】図５において，図２と同符号のものは図２
に示すものに対応する。３５はニューラルネットに対す
る学習のためのバックプロパゲーション部，５２は有声
音・無声音・無音の判定によって変換に使用するニュー
ラルネットを選択するニューラルネット選択部，６２は
ニューラルネットにより帯域スペクトルを変換する変換
部，６２ａは有声用ニューラルネット，６２ｂは無声用
ニューラルネット，６２ｃは無音用ニューラルネットを
表す。In FIG. 5, the same reference numerals as those in FIG.
Correspond to those shown in FIG. 35 is a back propagation unit for learning the neural network, 52 is a neural network selecting unit for selecting a neural network to be used for conversion by judging voiced sound, unvoiced sound, or no sound, and 62 is a conversion for converting a band spectrum by the neural network. , 62a represents a voiced neural network, 62b represents a voiceless neural network, and 62c represents a silent neural network.

【００３６】図５に示す実施例で，変換関数群，すなわ
ち有声用ニューラルネット６２ａ，無声用ニューラルネ
ット６２ｂ，無音用ニューラルネット６２ｃを生成する
場合，帯域スペクトル対記憶部群３３中の有声音・無声
音・無音別に設けられた記憶部に，帯域スペクトル対を
分けて格納するまでの処理は，図２の実施例と同様であ
る。In the embodiment shown in FIG. 5, when generating a conversion function group, that is, a voiced neural network 62a, a voiceless neural network 62b, and a silent neural network 62c, the voiced sound The processing until the band spectrum pairs are separately stored in the storage unit provided for each unvoiced sound / silence is the same as in the embodiment of FIG.

【００３７】本実施例では，変換関数をニューラルネッ
トで実現するため，バックプロパゲーション部３５によ
る学習を行う。ここでは，記憶種別ごとに帯域スペクト
ル対を帯域スペクトル対記憶部群３３から読み出し，例
えば新規話者音声の帯域スペクトルをニューラルネット
に与える入力信号とし，標準話者音声の帯域スペクトル
を教師信号とすることにより各ニューロンの内部状態を
決める学習を行う。In this embodiment, learning is performed by the back propagation unit 35 in order to realize the conversion function by a neural network. Here, a band spectrum pair is read from the band spectrum pair storage unit group 33 for each storage type, and for example, a band spectrum of a new speaker's voice is used as an input signal to be supplied to a neural network, and a band spectrum of a standard speaker's voice is used as a teacher signal. In this way, learning to determine the internal state of each neuron is performed.

【００３８】ニューラルネットは，一般には入力層，中
間層，出力層に配置されたニューロンで構成されること
が多いが，本実施例の場合，変換関数が単純化されるの
で，実質的には線型変換でもかなりの変換精度を保つこ
とができる。そのため，図６に示すように中間層を省略
して，入力層と出力層だけからなるニューラルネットと
してもよい。無音用ニューラルネット６２ｃは，入力層
の入力信号をそのまま出力層に伝えるものでよい。In general, a neural network is often composed of neurons arranged in an input layer, an intermediate layer, and an output layer. In the case of the present embodiment, however, since the transformation function is simplified, it is substantially realized. Even in linear conversion, considerable conversion accuracy can be maintained. Therefore, as shown in FIG. 6, the intermediate layer may be omitted, and a neural network including only the input layer and the output layer may be used. The silence neural network 62c may transmit the input signal of the input layer to the output layer as it is.

【００３９】学習によって各ニューラルネット６２ａ，
６２ｂ，６２ｃが作成されると，それを用いた音声認識
は，次のように行う。Each neural net 62a,
When 62b and 62c are created, speech recognition using them is performed as follows.

【００４０】図５において，新規話者音声の帯域スペク
トル時系列について，有声音・無声音・無音判定部２１
により，各時刻における音響特性が，有声音・無声音・
無音のいずれであるかを判定するまでの処理は，図２に
示す実施例と同様である。Referring to FIG. 5, a voiced / unvoiced / silent determining unit 21 determines a time series of a band spectrum of a new speaker's voice.
The acoustic characteristics at each time are voiced, unvoiced,
The processing up to the determination of silence is the same as in the embodiment shown in FIG.

【００４１】ニューラルネット選択部５２は，有声音・
無声音・無音判定部２１の判定結果により，変換部６２
で使用するニューラルネットを選択する。すなわち，図
６に示すように，有声用ニューラルネット６２ａ，無声
用ニューラルネット６２ｂ，無音用ニューラルネット６
２ｃと３種あるニューラルネットの中から１つを選択
し，新規話者帯域スペクトル時系列を，その選択したニ
ューラルネットに対する入力信号とする。この入力によ
り，出力層から出力される信号が標準話者音声の帯域ス
ペクトル時系列に相当するものとなる。The neural network selecting section 52 outputs a voiced sound
Based on the judgment result of the unvoiced sound / silence judgment unit 21, the conversion unit 62
Select the neural net to use in. That is, as shown in FIG. 6, the voiced neural network 62a, the unvoiced neural network 62b,
One of the three types of neural networks 2c and 2c is selected, and a new speaker band spectrum time series is used as an input signal to the selected neural network. With this input, the signal output from the output layer corresponds to the time series of the band spectrum of the standard speaker voice.

【００４２】ニューラルネットにより変換した帯域スペ
クトル時系列を，ＤＰマッチング音声認識部７１に渡
す。その後の音声認識処理は，図２に示す実施例と同様
である。The band spectrum time series converted by the neural network is passed to the DP matching speech recognition unit 71. Subsequent speech recognition processing is the same as in the embodiment shown in FIG.

【００４３】以上の実施例では，音響特性として有声音
・無声音・無音の例を取り上げたが，本発明はこれに限
らず，各種の弁別素性を用いて同様に実施することが可
能である。In the above embodiments, examples of voiced sound, unvoiced sound, and silence have been taken as the acoustic characteristics. However, the present invention is not limited to this, and can be similarly implemented using various discrimination features.

【００４４】[0044]

【発明の効果】以上説明したように，本発明によれば，
音響特性に対応する複数の変換関数を用いるので，個々
の変換関数が単純になり，変換関数を短時間で生成でき
るようになる。したがって，新規話者の負担が小さくな
る。また，変換精度がよくなり，良好な認識結果を得る
ことができるようになる。As described above, according to the present invention,
Since a plurality of conversion functions corresponding to the acoustic characteristics are used, each conversion function is simplified, and the conversion function can be generated in a short time. Therefore, the burden on new speakers is reduced. In addition, the conversion accuracy is improved, and a good recognition result can be obtained.

[Brief description of the drawings]

【図１】本発明の原理ブロック図である。FIG. 1 is a principle block diagram of the present invention.

【図２】本発明の実施例説明図である。FIG. 2 is an explanatory view of an embodiment of the present invention.

【図３】本発明の実施例による変換関数生成時の処理フ
ローを示す図である。FIG. 3 is a diagram showing a processing flow at the time of generating a conversion function according to the embodiment of the present invention.

【図４】本発明の実施例による認識時の処理フローを示
す図である。FIG. 4 is a diagram showing a processing flow at the time of recognition according to the embodiment of the present invention.

【図５】本発明のニューラルネットを用いた実施例説明
図である。FIG. 5 is an explanatory view of an embodiment using the neural network of the present invention.

【図６】本発明の実施例に係るニューラルネットの例を
示す図である。FIG. 6 is a diagram illustrating an example of a neural network according to an embodiment of the present invention.

【図７】従来技術の説明図である。FIG. 7 is an explanatory diagram of a conventional technique.

[Explanation of symbols]

１新規話者音声特徴量抽出手段２音響特性抽出手段３変換関数生成手段４標準話者音声特徴量記憶手段５変換関数切り換え手段６音声特徴量変換手段７標準話者音声認識手段 DESCRIPTION OF SYMBOLS 1 New speaker voice feature extraction means 2 Acoustic characteristic extraction means 3 Conversion function generation means 4 Standard speaker voice feature storage means 5 Conversion function switching means 6 Voice feature conversion means 7 Standard speaker speech recognition means

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G10L 3/02 301 G10L 3/00 521 G10L 3/00 531 G10L 3/00 539 G10L 9/10 301 ──────────────────────────────────────────────────続き Continued on the front page (58) Fields surveyed (Int. Cl. ⁶ , DB name) G10L 3/02 301 G10L 3/00 521 G10L 3/00 531 G10L 3/00 539 G10L 9/10 301

Claims

(57) [Claims]

1. A speaker adaptive speech recognition method for generating and using a conversion function for converting a feature amount of a new speaker's speech into a feature amount of a standard speaker's speech. The conversion function used in a plurality of conversion functions is switched based on the process of generating the conversion function by learning and the acoustic characteristics at each time of the input speech during recognition, and the feature amount of the new speaker's voice is changed to the standard speaker. Converting to a speech feature quantity.

2. The speaker adaptive speech recognition method according to claim 1, wherein a discrimination feature is used as an acoustic characteristic corresponding to each of the plurality of conversion functions.

3. The speaker adaptive speech recognition method according to claim 1, wherein a voiced sound, an unvoiced sound, and a silent sound are used as acoustic characteristics corresponding to each of the plurality of conversion functions.

4. The speaker adaptive speech recognition method according to claim 1, wherein the conversion function is obtained by regression analysis.

5. The speaker adaptive speech recognition method according to claim 1, wherein the conversion function is obtained by a neural network, and the plurality of conversion functions realized by the neural network corresponding to the acoustic characteristics are obtained. A speaker-adaptive speech recognition method characterized by converting a feature amount of an input speech into a feature amount of a standard speaker speech.

6. A speaker adaptive speech recognition apparatus for performing speech recognition by generating and using a conversion function for converting a feature amount of a new speaker's speech into a feature amount of a standard speaker's speech. New speaker's voice feature extraction means (1) for extracting the feature amount from the speech of the user, and acoustic characteristic extraction means (2) for extracting a sound characteristic time series related to a plurality of predetermined acoustic characteristics from the extracted feature amount. ), Standard speaker voice feature storage means (4) that stores the feature time series of the standard speaker voice to be referenced during recognition, and the feature volume of the standard speaker voice corresponding to the input new speaker voice A plurality of conversion functions for reading the time series from the standard speaker voice feature storage means (4) and converting the feature of the new speaker voice into the feature of the standard speaker voice corresponding to the acoustic characteristics at each time. The acoustic characteristic extracting means
A conversion function generating means (3) for generating according to the acoustic characteristics extracted by (2);
The conversion function switching means (5) for switching the conversion function to be used in accordance with the acoustic characteristics at each time of the input voice extracted in (2), and the conversion function selected by the conversion function switching means (5), the new speaker Speech feature conversion means (6) for converting speech features into standard speaker speech features
And a standard speaker speech recognition unit (7) for performing speech recognition by comparing the feature amount of the converted speech with the feature amount of the standard speech read from the standard speaker speech feature amount storage unit (4). A speaker adaptive speech recognition apparatus, comprising: