JPH06214596A

JPH06214596A - Voice recognition device and speaker adaptive method

Info

Publication number: JPH06214596A
Application number: JP5021801A
Authority: JP
Inventors: Takashi Ariyoshi; 敬有吉
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1993-01-14
Filing date: 1993-01-14
Publication date: 1994-08-05

Abstract

PURPOSE:To obtain a good recognition result by properly compensating for not only individual differences of vocal tract characteristics but also individual differences of vocal cord sound source characteristics and properly adapting uttered voice of a unknown speaker to uttered voice of a standard speaker. CONSTITUTION:Voice recognition process is performed as follows. A frequency characteristic compensating section 10, a frequency axis transformation section 30 and a feature amount extracting section 20 perform process against input voice signals of a unknown speaker with known uttered voice contents for every respective coefficient of plural different frequency characteristic compensating coefficients and plural different frequency axis transformation coefficients, an input voice feature amount is obtained for every coefficient, collating input voice feature amount for every coefficient and standard voice feature amount which has same content of known uttered voice content, select one frequency characteristic compensation coefficient which gives a minimum distance and one frequency axis transformation coefficient among respective coefficients and perform voice recognition processes.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、不特定話者の音声を音
声認識させる分野等に利用される音声認識装置および話
者適応化方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device and a speaker adaptation method used in the field of recognizing voices of unspecified speakers.

【０００２】[0002]

【従来の技術】音声には性別，年齢，体格，発声法など
の違いによる個人差があり、この個人差が不特定話者の
音声認識の性能を劣化させる大きな要因となっている。
音韻に依存しない個人性としては、声帯音源特性に関す
る音声スペクトル傾斜の変動と、声道特性（例えば声道
長）に関する音声スペクトルの周波数軸方向の伸縮との
２つが挙げられる。これらの個人性を正規化する方法と
して、従来では、例えば文献「三輪、城戸：“音声認識
のための話者正規化の検討”（音響学会、音声研究会資
料Ｓ７９−２４，１９７９年７月）」、文献「中川、神
谷、坂井：“音声スペクトルの時間軸・周波数軸・強度
軸の同時非線型伸縮に基づく不特定話者の単語音声の認
識”（電子通信学会論文誌、Ｖｏ１．Ｊ６４−Ｄ，Ｎ
ｏ．２，１９８１年２月）」などに示されているよう
に、個人差による音声信号のパターン変動に対処する話
者適応化方式が提案されている。2. Description of the Related Art Speech has individual differences due to differences in sex, age, physique, voicing method, etc. This individual difference is a major factor that deteriorates the voice recognition performance of an unspecified speaker.
The phoneme-independent individuality includes two variations of the voice spectrum inclination relating to the vocal cord sound source characteristic and expansion / contraction of the voice spectrum relating to the vocal tract characteristic (for example, vocal tract length) in the frequency axis direction. As a method of normalizing such individuality, conventionally, for example, a document “Miwa, Kido:“ Speaker normalization for speech recognition ”(Acoustic Society of Japan, Speech Study Group Material S79-24, July 1979). ) ", Reference" Nakagawa, Kamiya, Sakai: "Recognition of unspecified speaker's word speech based on simultaneous nonlinear expansion / contraction of time axis / frequency axis / intensity axis of speech spectrum" (Journal of the Institute of Electronics and Communication Engineers, Vol. -D, N
o. 2, February, 1981) "and the like, a speaker adaptation method for coping with a pattern variation of a voice signal due to an individual difference has been proposed.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上述し
たような従来の話者適応化方式では、個人差を正規化す
ると、一部の音韻差までも正規化されてしまい、音韻性
が失なわれるという問題あるいは演算量が多いという問
題があった。However, in the conventional speaker adaptation method as described above, if individual differences are normalized, even some phonological differences are normalized, and phonological properties are lost. There is a problem that there is a large amount of calculation.

【０００４】この問題を解決するため、特開平２−２５
９６９８号に開示されているような話者適応化装置が提
案されており、この装置では、入力話者における音声信
号のスペクトルを周波数軸上でシフトして標準話者にお
ける音声信号のスペクトルに変換し、周波数軸上のシフ
トに関してだけ話者適応化を行なうようになっている。In order to solve this problem, Japanese Patent Laid-Open No. 2-25
A speaker adaptation device as disclosed in Japanese Patent No. 9698 has been proposed. In this device, the spectrum of the voice signal of the input speaker is shifted on the frequency axis and converted into the spectrum of the voice signal of the standard speaker. However, speaker adaptation is performed only for shifts on the frequency axis.

【０００５】ところで、当業者間には、声道特性に関す
る音声スペクトルのみならず、声帯音源特性に関する音
声スペクトルについても、音韻性を失なうことなく個人
差を補正することが望まれているが、上述の話者適応化
装置では、声道特性に関するスペクトルについて周波数
軸上でのみ正規化しているにすぎず、音声スペクトル全
体について簡単な構成かつ簡単な学習で話者適応化を行
なうことができないという欠点があった。By the way, it is desired by those skilled in the art to correct individual differences not only in the voice spectrum relating to the vocal tract characteristics but also in the voice spectrum relating to the vocal cord sound source characteristics without losing the phonological characteristics. In the speaker adaptation device described above, the spectrum relating to the vocal tract characteristics is only normalized on the frequency axis, and the speaker adaptation cannot be performed for the entire speech spectrum with a simple configuration and simple learning. There was a drawback.

【０００６】本発明は、音韻性を失なうことなく、また
装置規模を大型化させず、また簡単かつ効率の良い操作
で、声道特性の個人差のみならず声帯音源特性の個人差
をも良好に補正し、未知の話者の発声を標準話者の発声
に良好に適応させ、良好な認識結果を得ることの可能な
音声認識装置および話者適応化方法を提供することを目
的としている。According to the present invention, the phonological characteristics are not lost, the device scale is not increased, and the operation is simple and efficient, so that not only individual differences in vocal tract characteristics but also individual differences in vocal cord sound source characteristics can be determined. It is also an object of the present invention to provide a speech recognition apparatus and a speaker adaptation method capable of appropriately correcting the utterance of an unknown speaker to the utterance of a standard speaker, and obtaining a good recognition result. There is.

【０００７】[0007]

【課題を解決するための手段および作用】上記目的を達
成するために、本発明は、話者適応フェーズと音声認識
フェーズの機能を具備し、話者適応フェーズにおいて
は、既知なる発声内容の未知なる話者の入力音声信号に
対して、複数の異なる周波数特性補正係数および複数の
異なる周波数軸変換係数の各々の係数毎に、周波数特性
補正手段，周波数軸変換手段，特徴量抽出手段に処理を
行なわせて、各々の係数毎に入力音声特徴量を求めさ
せ、各々の係数毎の入力音声特徴量を既知なる発声内容
と同一内容の標準音声特徴量と照合して、各々の係数の
うちから、最小距離を与える１つの周波数特性補正係数
と１つの周波数軸変換係数を選択し、また、音声認識フ
ェーズにおいては、前記話者適応フェーズで入力を行な
った話者の未知なる発声内容の入力音声信号に対して、
話者適応フェーズにおいて選択された１つの周波数特性
補正係数と１つの周波数軸変換係数とに基づき周波数特
性補正手段，周波数軸変換手段，特徴量抽出手段に処理
を行なわせて入力音声特徴量を求めさせ、該入力音声特
徴量を標準音声記憶手段に保持されている標準音声特徴
量と照合して、認識結果を出力するようになっているこ
とを特徴としている。これにより、音韻性を失なうこと
なく、声帯音源特性と声道特性の個人差を補正し、未知
話者の発声を標準話者の発声に良好に適応させることが
でき、これにより、１人あるいは小人数の標準話者の標
準音声だけを用いて、不特定話者音声認識に近い音声認
識を容易に実現することができる。In order to achieve the above object, the present invention has the functions of a speaker adaptation phase and a voice recognition phase, and in the speaker adaptation phase, the known utterance content is unknown. For the input speech signal of the speaker, the frequency characteristic correction means, the frequency axis conversion means, and the feature amount extraction means are processed for each of a plurality of different frequency characteristic correction coefficients and a plurality of different frequency axis conversion coefficients. The input voice feature amount is calculated for each coefficient, and the input voice feature amount for each coefficient is compared with the standard voice feature amount having the same content as the known utterance content, , One frequency characteristic correction coefficient giving the minimum distance and one frequency axis conversion coefficient are selected, and in the speech recognition phase, the unknown speech of the speaker who has input in the speaker adaptation phase is selected. The contents of the input audio signal,
Based on one frequency characteristic correction coefficient and one frequency axis conversion coefficient selected in the speaker adaptation phase, the frequency characteristic correction means, the frequency axis conversion means, and the feature amount extraction means are caused to perform processing to obtain the input voice characteristic amount. Then, the input voice feature amount is collated with the standard voice feature amount held in the standard voice storage means, and the recognition result is output. As a result, it is possible to correct individual differences in vocal cord sound source characteristics and vocal tract characteristics without losing phonological characteristics, and to adapt the utterance of an unknown speaker to the utterance of a standard speaker satisfactorily. It is possible to easily realize speech recognition close to the unspecified speaker speech recognition by using only the standard speech of a person or a small number of standard speakers.

【０００８】[0008]

【実施例】以下、本発明の実施例を図面に基づいて説明
する。図１は本発明に係る音声認識装置の一実施例の構
成図である。図１を参照すると、この音声認識装置は、
入力された音声信号の周波数特性を補正する周波数特性
補正部１０と、入力音声信号のケプストラム係数を入力
音声特徴量として抽出する特徴量抽出部２０と、入力音
声信号に対し周波数軸の変換を施す周波数軸変換部３０
と、入力された音声信号の区間を検出する音声区間検出
部４０と、標準音声信号の特徴量が標準音声特徴量とし
て予め記憶されている標準音声記憶部５０と、入力音声
信号に対し周波数特性補正部１０，特徴量抽出部２０，
周波数軸変換部３０により得られた入力音声特徴量と標
準音声記憶部５０に記憶されている標準音声特徴量との
照合（マッチング）を行なうマッチング部６０とを有し
ている。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram of an embodiment of a voice recognition device according to the present invention. Referring to FIG. 1, the voice recognition device is
A frequency characteristic correction unit 10 that corrects the frequency characteristic of an input voice signal, a feature amount extraction unit 20 that extracts a cepstrum coefficient of the input voice signal as an input voice feature amount, and a frequency axis conversion is performed on the input voice signal. Frequency axis conversion unit 30
A voice section detection unit 40 that detects a section of the input voice signal, a standard voice storage unit 50 in which the feature amount of the standard voice signal is stored in advance as a standard voice feature amount, and a frequency characteristic with respect to the input voice signal. Correction unit 10, feature amount extraction unit 20,
It has a matching unit 60 for matching (matching) the input voice feature amount obtained by the frequency axis conversion unit 30 with the standard voice feature amount stored in the standard voice storage unit 50.

【０００９】ところで、この音声認識装置では、不特定
話者の音声をも良好に認識させることを目的として、実
際の音声認識処理を開始するに先立って、話者適応学習
処理がなされるようになっている。この２種類の処理を
１つの装置で行なわせるため、図１の装置には、この装
置の動作，機能を話者適応フェーズと音声認識フェーズ
とのいずれかに切換えるためのフェーズ選択部９０がさ
らに設けられている。By the way, in this voice recognition device, the speaker adaptive learning process is performed before the actual voice recognition process is started in order to recognize the voice of an unspecified speaker well. Has become. In order to perform these two types of processing by one device, the device of FIG. 1 further includes a phase selection unit 90 for switching the operation and function of this device to either the speaker adaptation phase or the voice recognition phase. It is provided.

【００１０】また、これと関連させて、標準音声記憶部
５０には、話者適応処理用の標準音声特徴量と音声認識
用の標準音声特徴量とが記憶されている。また、周波数
特性補正部１０には、話者適応学習用に、互いに異なる
複数の周波数特性補正係数が予め用意され、また、周波
数軸変換部３０には、話者適応学習用に、互いに異なる
複数の周波数軸変換係数が用意されている。In association with this, the standard voice storage unit 50 stores a standard voice feature amount for speaker adaptation processing and a standard voice feature amount for voice recognition. Further, the frequency characteristic correction unit 10 is prepared in advance with a plurality of different frequency characteristic correction coefficients for speaker adaptive learning, and the frequency axis conversion unit 30 is provided with a plurality of different frequency characteristic correction coefficients for speaker adaptive learning. The frequency axis conversion coefficient of is prepared.

【００１１】また、話者適応フェーズにおいては、未知
なる話者に既知の発声内容を発声させるようになってお
り、周波数特性補正部１０，周波数軸変換部３０では、
この音声信号に対して、各々、複数の周波数特性補正係
数，複数の周波数軸変換係数を順次に変えて処理を行な
い、マッチング部６０は、それぞれの場合について、周
波数特性補正部１０，特徴量抽出部２０，周波数軸変換
部３０により得られた入力音声特徴量を標準音声記憶部
５０に記憶されている話者適応処理用の標準音声特徴量
とマッチングして、各入力音声特徴量と標準音声特徴量
との距離を求め、そのうち最小距離を与える周波数特性
補正係数と周波数軸変換係数とを選択し決定するように
なっている。In the speaker adaptation phase, an unknown speaker is made to utter a known utterance content, and the frequency characteristic correction unit 10 and the frequency axis conversion unit 30
The audio signal is processed by sequentially changing the plurality of frequency characteristic correction coefficients and the plurality of frequency axis conversion coefficients, and the matching unit 60 performs the frequency characteristic correction unit 10 and the feature amount extraction in each case. The input voice feature amount obtained by the unit 20 and the frequency axis conversion unit 30 is matched with the standard voice feature amount for speaker adaptation processing stored in the standard voice storage unit 50 to obtain each input voice feature amount and standard voice. The distance from the feature amount is obtained, and the frequency characteristic correction coefficient and the frequency axis conversion coefficient that give the minimum distance are selected and determined.

【００１２】また、音声認識フェーズにおいては、未知
なる話者（実際には、話者適応フェーズで入力を行なっ
た話者）の未知の発声内容の音声信号に対して、周波数
特性補正部１０，周波数軸変換部３０では、上記話者適
応フェーズにおいて選択，決定された周波数特性補正係
数と周波数軸変換係数とに基づいて処理を行ない、マッ
チング部６０は、このようにして周波数特性補正部１
０，特徴量抽出部２０，周波数軸変換部３０により得ら
れた入力音声特徴量を標準音声記憶部５０に記憶されて
いる音声認識用の標準音声特徴量とマッチングして、最
小距離を与える標準音声特徴量に対応した語を認識結果
として出力するようになっている。In the voice recognition phase, the frequency characteristic correction unit 10, for the voice signal of the unknown utterance content of the unknown speaker (actually, the speaker who inputs in the speaker adaptation phase), The frequency axis conversion unit 30 performs processing based on the frequency characteristic correction coefficient and the frequency axis conversion coefficient selected and determined in the speaker adaptation phase, and the matching unit 60 thus performs the frequency characteristic correction unit 1.
0, a standard that gives a minimum distance by matching the input voice feature amount obtained by the feature amount extraction unit 20 and the frequency axis conversion unit 30 with the standard voice feature amount for voice recognition stored in the standard voice storage unit 50. A word corresponding to the voice feature amount is output as a recognition result.

【００１３】次に、このような構成の音声認識装置の処
理動作について、図２，図３のフローチャートを用いて
具体的に説明する。なお、図２，図３はそれぞれ話者適
応フェーズ，音声認識フェーズにおける処理動作を示す
フローチャートである。図１の音声認識装置を不特定話
者用の音声認識装置として用いる場合、使用者（未知の
話者）は、実際の音声認識動作を行なうに先立って、自
己の音声を標準音声に適応化させる適応学習を行なうた
め、フェーズ選択部９０を用いて、この装置を話者適応
フェーズに切換える。Next, the processing operation of the speech recognition apparatus having such a configuration will be specifically described with reference to the flowcharts of FIGS. 2 and 3 are flowcharts showing processing operations in the speaker adaptation phase and the voice recognition phase, respectively. When the voice recognition device of FIG. 1 is used as a voice recognition device for an unspecified speaker, the user (unknown speaker) adapts his or her voice to the standard voice before performing the actual voice recognition operation. In order to carry out adaptive learning, the phase selector 90 is used to switch the apparatus to the speaker adaptation phase.

【００１４】この状態で、この使用者（未知の話者）
は、既知の内容，例えば「はちのへ」，「けせんぬ
ま」，「ゆくはし」，「さっぽろ」，「きたみ」の５単
語を順に発声し、装置に例えばＡ／Ｄ変換器（図示しな
いが、例えば標本周波数１６ＫＨ_Z）を介して入力させ
ることができる（ステップＳ１）。なお、個人特有のス
ペクトル傾斜，周波数シフトは、差程、音韻にはよらな
いので、適応学習のための単語の種類（セット）は、こ
のような２０音節程度のもので十分である。図１の装置
は、これらの入力音声信号に対して以下の処理を行な
う。In this state, this user (unknown speaker)
Utters five known contents in order, for example, "Hachinohe,""Kesenuma,""Yukuhashi,""Sapporo," and "Kitami," and uses the A / D converter ( Although not shown, can be input via, for example, the sampling frequency 16KH _Z) (step S1). It should be noted that the spectral tilt and frequency shift peculiar to an individual do not depend on the phoneme to the extent that the kind (set) of words for adaptive learning is sufficient to have about 20 syllables. The apparatus of FIG. 1 performs the following processing on these input audio signals.

【００１５】すなわち、先ず、Ａ／Ｄ変換器からの音声
信号ｘに対し、音声区間検出部４０において各単語の音
声区間を検出する（ステップＳ２）。音声区間検出部４
０では、例えば、各フレームの信号パワーと２つの閾値
を用いた既知の２閾値法を用いて、入力された音声信号
の区間を検出する。なお、この音声区間の情報は、後述
の処理のために保存される。That is, first, in the voice signal x from the A / D converter, the voice section detector 40 detects the voice section of each word (step S2). Voice section detector 4
At 0, for example, a known two-threshold method using the signal power of each frame and two threshold values is used to detect the section of the input audio signal. It should be noted that the information of this voice section is stored for the processing described later.

【００１６】また、Ａ／Ｄ変換器からの音声信号ｘは、
周波数特性補正部１０に入力し、周波数特性補正部１０
では、入力された音声信号のスペクトル傾斜を、例えば
次式で表わされるフィルタＨ_β(ｚ）を作用させること
によって補正する。The audio signal x from the A / D converter is
The frequency characteristic correction unit 10 inputs the frequency characteristic correction unit 10
Then, the spectral tilt of the input audio signal is corrected by operating a filter H _β (z) represented by the following equation, for example.

【００１７】[0017]

【数１】Ｈ_β(ｚ）＝１−βｚ^-1 （−１≦β≦１）[Number 1] _{H β (z) = 1-} βz -1 (-1 ≦ β ≦ 1)

【００１８】なお、スペクトル傾斜の上記のような補正
の仕方は、文献「鹿野、杉山：“ＬＰＣスペクトル・マ
ッチング尺度におけるスペクトルの傾きの正規化”（音
響学会講演論文集、昭和５６年５月、２−７−１５）」
に示されているものと同様である。但し、本実施例で
は、βを周波数特性補正係数とし、これが複数の値，例
えば−０．３， −０．２， −０．１，０．０，０．
１，０．２，０．３の７つの値β₁乃至β₇をとるよう
にしている。これにより、周波数特性補正部１０から
は、７種類の周波数特性補正係数β₁乃至β₇により補正
された７種類の出力信号Ｂ₁乃至Ｂ₇が出力され、特徴量
抽出部２０に加わる。なお、周波数特性補正部１０から
出力された７種類の各出力信号Ｂ₁乃至Ｂ₇は、後述の処
理のため、例えばバッファ（図示せず）に保存される。The method of correcting the spectral tilt as described above is described in the document "Kano, Sugiyama:" Normalization of the spectral tilt in the LPC spectrum matching scale "(Academic Society of Japan, May 1981, 2-7-15) "
Similar to that shown in. However, in the present embodiment, β is a frequency characteristic correction coefficient, and this is a plurality of values, for example, −0.3, −0.2, −0.1, 0.0, 0.
7 values β _{1 to} β ₇ of 1, 0.2 and 0.3 are taken. As a result, the frequency characteristic correction unit 10 outputs seven types of output signals B _{1 to} B ₇ corrected by the seven types of frequency characteristic correction coefficients β _{1 to} β ₇ , and is added to the feature amount extraction unit 20. The seven types of output signals B _{1 to} B ₇ output from the frequency characteristic correction unit 10 are stored in, for example, a buffer (not shown) for the processing described later.

【００１９】周波数特性補正部１０からの各出力信号ｙ
₁乃至ｙ₇が加わると、特徴量抽出部２０では、上記７種
類の出力信号ｙ₁乃至ｙ₇の各々に対し、ＬＰＣケプスト
ラム分析して、一定フレーム周期毎にケプストラム係数
を出力する。なお、この方法は、文献「古井著“ディジ
タル音声処理”（東海大学出版会），１９８５年９月２
５日」などに示され、既に知られている。具体的には、
特徴量抽出部２０は、プリエンファシス：１−ｚ^-1、窓
周期：１６ｍｓ、フレーム周期：１０ｍｓ、ＬＰＣ分析
次数：１４次、ケプストラム分析次数：１４次で、ＬＰ
Ｃケプストラム分析し、ケプストラム係数を出力する。Each output signal y from the frequency characteristic correction unit 10
_{When 1 to} y ₇ are added, the feature amount extraction unit 20 performs LPC cepstrum analysis on each of the seven types of output signals y _{1 to} y ₇ , and outputs a cepstrum coefficient at constant frame periods. Note that this method is described in "Digital Audio Processing" by Furui (Tokai University Press), September 2, 1985.
5 days ”and the like, and is already known. In particular,
The feature amount extraction unit 20 sets the pre-emphasis: 1-z ^-1 , window period: 16 ms, frame period: 10 ms, LPC analysis order: 14th order, cepstrum analysis order: 14th, LP
C Cepstrum analysis is performed and cepstrum coefficient is output.

【００２０】周波数特性補正部１０からの各出力信号ｙ
₁乃至ｙ₇のそれぞれに対してケプストラム分析された結
果のケプストラム係数は、周波数軸変換部３０に順次に
加わり、周波数軸変換部３０では、これらのケプストラ
ム係数に対して、例えば次式で表わされる１次の全域透
過フィルタＨ_α(ｚ）を作用させて、周波数軸の変換を
施し、その変換結果をマッチング部６０に与える。Each output signal y from the frequency characteristic correction unit 10
The cepstrum coefficients obtained as a result of the cepstrum analysis for each of _{1 to} y ₇ are sequentially added to the frequency axis conversion unit 30, and in the frequency axis conversion unit 30, these cepstrum coefficients are expressed by, for example, the following equations. The first-order all-pass filter H _α (z) is operated to perform the conversion of the frequency axis, and the conversion result is given to the matching unit 60.

【００２１】[0021]

【数２】Ｈ_α(ｚ）＝（ｚ^-1−α）／（１−αｚ^-1）（−１≦α≦１）H _α (z) = (z ⁻¹ −α) / (1−αz ⁻¹ ) (−1 ≦ α ≦ 1)

【００２２】なお、Ｈ_α(ｚ）とｚとの間の変換を用い
る上記のような変換の仕方は、文献「小林、松本：“Ｌ
ＰＣ距離尺度における周波数軸正規化に関する検討”
（音響学会講演論文集、昭和５８年１０月、１−１−
５）」に示されているものと同様である。但し、本実施
例では、メルケプストラムの次数を例えば１０次とした
メル周波数変換の処理も同時に行なう。すなわち、メル
尺度を最も良く近似する周波数軸変換係数αの値を０．
５とし、これを基準としてαが、０．４０，０．４
３，０．４７，０．５０，０．５３，０．５６，
０．５９の７つの値α₁乃至α₇をとる。これらは、それ
ぞれ０．７７倍、０．８４倍、０．９２倍、１．００
倍、１．０９倍、１．０９倍、１．３０倍の周波数軸伸
縮に対応する。The above-mentioned conversion method using the conversion between H _α (z) and z is described in the document “Kobayashi, Matsumoto:“ L ”.
A Study on Frequency Axis Normalization in PC Distance Scale "
(Proceedings of the Acoustical Society of Japan, October 1983, 1-1-
5) ”. However, in the present embodiment, the mel frequency conversion processing in which the mel cepstrum order is, for example, 10th order is also performed at the same time. That is, the value of the frequency axis conversion coefficient α that best approximates the Mel scale is 0.
5 and α is 0.40, 0.4
3, 0.47, 0.50, 0.53, 0.56
Take the seven values α _{1 to} α ₇ of 0.59. These are 0.77 times, 0.84 times, 0.92 times and 1.00 respectively.
Corresponds to frequency axis expansion / contraction of double, 1.09, 1.09, and 1.30.

【００２３】この際、周波数軸変換部３０，マッチング
部６０では、先ず、７種類の周波数特性補正係数β₁乃
至β₇のうち、未知の話者の入力音声に最適な１つの周
波数特性補正係数を選択し決定する処理を行なう。この
ため、周波数軸変換部３０では、最初、周波数軸変換係
数αを基準となるα₄（＝０．５０）に設定し、この周
波数軸変換係数α₄により、特徴量抽出部２０からの７
種の信号の周波数軸変換を行ない、それぞれの変換結果
すなわちメルケプストラム係数ｃ_β2，（いまの場合、
ｃ₁₄，ｃ₂₄，…，ｃ₇₄）をマッチング部６０に与える。At this time, in the frequency axis conversion unit 30 and the matching unit 60, first, of the seven types of frequency characteristic correction coefficients β _{1 to} β ₇ , _one frequency characteristic correction coefficient which is most suitable for the input voice of the unknown speaker. Select and determine. Therefore, in the frequency axis conversion unit 30, first, set to ₄ alpha as a reference frequency axis conversion coefficient alpha (= 0.50), by the frequency axis conversion coefficient alpha _4, 7 from the feature extractor 20
The frequency axis conversion of the seed signal is performed, and each conversion result, that is, the mel cepstrum coefficient c _β2 , (in this case,
c ₁₄ , c ₂₄ , ..., C ₇₄ ) are given to the matching unit 60.

【００２４】マッチング部６０では、音声区間検出部４
０で検出された音声区間に存在する入力音声信号の周波
数軸変換部３０からの変換結果ｃ₁₄，ｃ₂₄，…，ｃ₇₄を
標準音声記憶部５０に予め記憶されている複数の発声内
容の標準音声特徴量（具体的には、標準音声信号のメル
ケプストラム係数の時系列）と照合（パターンマッチン
グ）する。なお、このパターンマッチングは、例えば、
公知の端点固定ＤＰ（ダイナミック・プログラミング）
法によりなされる。その場合、整合窓の傾斜制限は、
“０．５”以上で“２”以下である。また、端点固定Ｄ
Ｐ法は一般に、音声区間検出誤差、すなわち始終端のず
れに対してやや難点があるが、ここでは、同一始終端の
入力音声のパターン同士の相対比較を行なうので、始終
端のずれについては何ら問題は生じない。In the matching unit 60, the voice section detecting unit 4
The conversion results c ₁₄ , c ₂₄ , ..., C ₇₄ of the input voice signal existing in the voice section detected by 0 from the frequency axis conversion unit 30 are stored in the standard voice storage unit 50 in advance. It is matched (pattern matching) with the standard voice feature amount (specifically, the time series of the mel-cepstral coefficient of the standard voice signal). In addition, this pattern matching is, for example,
Known fixed end point DP (dynamic programming)
Made by law. In that case, the tilt limit of the matching window is
It is "0.5" or more and "2" or less. Also, fixed end point D
In general, the P method has a slight difficulty with respect to a voice section detection error, that is, a difference between the start and end points. However, since the relative comparison between patterns of input voices having the same start and end points is performed here, there is no difference in the start and end points. There is no problem.

【００２５】このようなＤＰマッチングによる照合にお
いて、マッチング部６０は、その累積距離を最小とする
係数βを選択し、保存する。ここで、累積距離とは、話
者により発声された５つの単語全ての累積距離の総和で
ある。また、その際、各単語毎の区間長の正規化は行な
う必要はない。In matching by such DP matching, the matching unit 60 selects and saves the coefficient β that minimizes the cumulative distance. Here, the cumulative distance is a total sum of cumulative distances of all five words spoken by the speaker. At that time, it is not necessary to normalize the section length for each word.

【００２６】このようにして、７種類の周波数特性補正
係数β₁乃至β₇，基準となる周波数軸変換係数α₄によ
って処理された入力音声信号の７種類の特徴量ｃ₁₄乃至
ｃ₇₄を標準音声特徴量と照合して、７種類の入力音声特
徴量ｃ₁₄乃至ｃ₇₄のうちで最適な特徴量を１つ選択す
る。最適な特徴量として例えばｃ₂₄が選択されると、こ
の特徴量ｃ₂₄に対応した周波数特性補正係数β₂を最適
な周波数特性補正係数として選択し、決定することがで
きる（ステップＳ３）。In this way, the seven types of characteristic quantities c _{14 to} c ₇₄ of the input audio signal processed by the seven types of frequency characteristic correction coefficients β _{1 to} β ₇ and the reference frequency axis conversion coefficient α ₄ are standardized. The optimum feature amount is selected from the seven types of input voice feature amounts c _{14 to} c ₇₄ by collating with the voice feature amount. When, for example, c ₂₄ is selected as the optimum characteristic amount, the frequency characteristic correction coefficient β ₂ corresponding to this characteristic amount c ₂₄ can be selected and determined as the optimum frequency characteristic correction coefficient (step S3).

【００２７】次いで、上記のようにして最適な周波数特
性補正係数βとして、例えばβ₂が選択決定されると、
このβ₂により周波数特性補正部１０において前述のよ
うにすでに処理されバッファに保存されている出力信号
ｙ₂をバッファから読み出し、特徴量抽出部２０に与え
る。特徴量抽出部２０では、前述したと同様に、この出
力信号ｙ₂に対しケプストラム分析を行ない、ケプスト
ラム係数を周波数軸変換部３０に与える。Next, when, for example, β ₂ is selected and determined as the optimum frequency characteristic correction coefficient β as described above,
The output signal y ₂ that has already been processed in the frequency characteristic correction unit 10 as described above and stored in the buffer is read from the buffer by this β ₂ and given to the feature amount extraction unit 20. The feature amount extraction unit 20 performs cepstrum analysis on the output signal y ₂ in the same manner as described above, and gives the cepstrum coefficient to the frequency axis conversion unit 30.

【００２８】周波数軸変換部３０では、今度は、７種類
の周波数軸変換係数α₁乃至α₇のうち、未知の話者の入
力音声に最適な１つの周波数軸変換係数を選択し決定す
る処理を行なう。このため、特徴量抽出部２０から出力
信号ｙ₂に対するケプストラム係数が与えると、周波数
軸変換部３０では、これに対して７種類の周波数軸変換
係数α₁乃至α₇をそれぞれ作用させて７種類の周波数軸
変換を行ない、それぞれの変換結果，すなわちメルケプ
ストラム係数δ_2α（いまの場合、ｍ₂₁，ｍ₂₂，…，ｍ
₂₇）をマッチング部６０に与える。The frequency axis conversion unit 30 now selects and determines one frequency axis conversion coefficient most suitable for the input voice of the unknown speaker from the seven kinds of frequency axis conversion coefficients α _{1 to} α _7. Do. For this reason, when the cepstral coefficient for the output signal y ₂ is given from the feature quantity extraction unit 20, the frequency axis conversion unit 30 applies seven kinds of frequency axis conversion coefficients α _{1 to} α ₇ to each of them, thereby making 7 kinds. Frequency axis conversion is performed, and each conversion result, that is, the mel cepstrum coefficient δ _2α (in this case, m ₂₁ , m ₂₂ , ..., m
₂₇ ) is given to the matching unit 60.

【００２９】マッチング部６０では、前述したと同様
に、周波数軸変換部３０からの変換結果ｍ₂₁，ｍ₂₂，
…，ｍ₂₇を標準音声記憶部５０に予め記憶されている複
数の発声内容の標準音声特徴量（具体的には、標準音声
信号のケプストラム係数の時系列）と照合（パターンマ
ッチング）する。すなわち、マッチング部６０は、各単
語の音声区間に関して、各組の入力音声特徴量とそれに
対応する（それと同一発声内容の）標準音声特徴量とで
ＤＰマッチングを実施し、その累積距離を最小とする係
数αを選択し、保存する。ここで、累積距離とは、前述
したと同様、話者が発声した５単語全ての累積距離の総
和である。また、その際、各単語毎の区間長の正規化は
行なう必要はない。In the matching unit 60, the conversion results m ₂₁ , m ₂₂ , from the frequency axis conversion unit 30 are the same as described above.
, M ₂₇ is collated (pattern matching) with the standard speech feature amount of a plurality of utterance contents (specifically, the time series of the cepstral coefficient of the standard speech signal) stored in advance in the standard speech storage unit 50. That is, the matching unit 60 performs DP matching on the input voice feature amount of each set and the corresponding standard voice feature amount (with the same utterance content) for the voice section of each word, and minimizes the cumulative distance. The coefficient α to be selected is selected and saved. Here, the cumulative distance is the sum of the cumulative distances of all five words spoken by the speaker, as described above. At that time, it is not necessary to normalize the section length for each word.

【００３０】このようにして、１種類の周波数特性補正
係数β₂，７種類の周波数軸変換係数α₁乃至α₇によっ
て処理された７種類の入力音声特徴量ｍ₂₁乃至ｍ₂₇を標
準音声特徴量と照合して、７種類の入力音声特徴量ｍ₂₁
乃至ｍ₂₇のうちで最適な特徴量を１つ選択する。最適な
特徴量として、例えばｍ₂₆が選択されると、この特徴量
ｍ₂₆に対応した周波数軸変換係数α₆を最適な周波数軸
変換係数として選択し、決定することができ、最終的
に、いま発声のなされた未知の話者の入力音声に適応す
る周波数特性補正係数β，周波数軸変換係数αとして、
β₂，α₆を決定することができる（ステップＳ４）。In this way, the seven types of input speech feature quantities m _{21 to} m ₂₇ processed by the one type of frequency characteristic correction coefficient β ₂ and the seven types of frequency axis conversion coefficients α _{1 to} α ₇ are used as standard speech features. 7 types of input speech feature amount m ₂₁
1 to m ₂₇ , one optimal feature amount is selected. When, for example, m ₂₆ is selected as the optimum feature amount, the frequency axis conversion coefficient α ₆ corresponding to this feature amount m ₂₆ can be selected and determined as the optimum frequency axis conversion coefficient, and finally, As the frequency characteristic correction coefficient β and the frequency axis conversion coefficient α that are adapted to the input voice of the unknown speaker who has just uttered,
β ₂ and α ₆ can be determined (step S4).

【００３１】上記のようにして話者適応フェーズにおい
て、自己の音声をこれから音声認識させようとする話者
の音声に対して最適な周波数特性補正係数β，周波数軸
変換係数αが選択決定され、話者適応化がなされると、
この話者は、実際の音声認識処理を行なわせるため、フ
ェーズ選択部９０を用いて、この装置を音声認識フェー
ズに切換える。As described above, in the speaker adaptation phase, the optimum frequency characteristic correction coefficient β and frequency axis conversion coefficient α are selected and determined for the voice of the speaker who wants to recognize his / her own voice. When speaker adaptation is done,
This speaker uses the phase selection unit 90 to switch this device to the voice recognition phase in order to perform the actual voice recognition process.

【００３２】音声認識フェーズでは、この話者は、未知
の内容を発声し、その音声を装置に入力させる（ステッ
プＳ１１）。この話者の入力音声は、音声区間検出部４
０に加わって音声区間が検出されるとともに（ステップ
Ｓ１２）、周波数特性補正部１０にも加わる。周波数特
性補正部１０では、話者の音声信号を周波数特性補正係
数βにより補正し、その出力信号を特徴量抽出部２０に
与える。特徴量抽出部２０では、周波数特性補正部１０
からの出力信号をケプストラム分析してケプストラム係
数を求め、これを周波数軸変換部３０に与える。周波数
軸変換部３０では、これに加わる信号を周波数軸変換係
数αにより周波数軸変換し、その結果をマッチング部６
０に与える。In the voice recognition phase, this speaker utters unknown contents and causes the device to input the voice (step S11). The input voice of this speaker is the voice section detection unit 4
In addition to 0, the voice section is detected (step S12) and also added to the frequency characteristic correction unit 10. The frequency characteristic correction unit 10 corrects the voice signal of the speaker by the frequency characteristic correction coefficient β, and supplies the output signal to the feature amount extraction unit 20. In the feature quantity extraction unit 20, the frequency characteristic correction unit 10
The cepstrum analysis is performed on the output signal from C to obtain the cepstrum coefficient, and this is given to the frequency axis conversion unit 30. In the frequency axis conversion unit 30, the signal added thereto is frequency axis converted by the frequency axis conversion coefficient α, and the result is matched section 6
Give to 0.

【００３３】ところで、音声認識フェーズにおける上記
一連の処理では、周波数特性補正係数β，周波数軸変換
係数αとして、話者適応フェーズにおいて選択決定され
たものを用いる。すなわち、前述の例では、β，αとし
て、β₂，α₆を用いる。但し、音声認識フェーズがなさ
れるに先立って、話者適応フェーズが一度も実施されて
いない場合は、β，αとして、標準の係数“０．０”，
“０．５”がそれぞれ用いられる。従って、音声認識フ
ェーズでは、入力された話者の音声信号は、周波数特性
補正部１０において１つの周波数特性補正係数βにより
補正されて、１つの出力信号ｙとして出力され、また、
周波数軸変換部３０においては、１つの周波数軸変換係
数αにより周波数軸変換がなされて、マッチング部６０
には、この入力音声信号について１種類の入力音声特徴
量ｍだけが送られる。By the way, in the series of processes in the voice recognition phase, the frequency characteristic correction coefficient β and the frequency axis conversion coefficient α which are selected and determined in the speaker adaptation phase are used. That is, in the above example, β ₂ and α ₆ are used as β and α. However, if the speaker adaptation phase has never been performed prior to the voice recognition phase, β and α are set as standard coefficients “0.0”,
“0.5” is used respectively. Therefore, in the voice recognition phase, the input voice signal of the speaker is corrected by the frequency characteristic correction unit 10 by one frequency characteristic correction coefficient β and output as one output signal y.
In the frequency axis conversion unit 30, the frequency axis conversion is performed by one frequency axis conversion coefficient α, and the matching unit 60
, Only one type of input voice feature amount m is sent for this input voice signal.

【００３４】マッチング部６０では、音声区間検出部４
０で検出された音声区間に存在する入力音声信号の周波
数軸変換部３０からの１種類の変換結果ｍを標準音声記
憶部５０に予め記憶されている標準音声特徴量と照合
（マッチング）して、その累積距離を最小とする標準音
声特徴量を選択し、これに対応する語を認識結果として
出力する（ステップＳ１３）。In the matching unit 60, the voice section detecting unit 4
One kind of conversion result m from the frequency axis conversion unit 30 of the input voice signal existing in the voice section detected by 0 is compared (matched) with the standard voice feature amount stored in the standard voice storage unit 50 in advance. , The standard speech feature amount that minimizes the cumulative distance is selected, and the word corresponding to this is output as the recognition result (step S13).

【００３５】以上のように、図１の装置では、話者適応
学習のための学習サンプル量を極力抑え、かつ、音韻性
を失なうことなく声帯音源特性と声道特性との両方の個
人差を良好に補正し、未知話者の発声を標準話者の発声
に適応させることが可能であって、不特定話者音声認識
に適用する場合に、不特定話者の音声を良好に認識させ
ることができる。As described above, in the apparatus of FIG. 1, the learning sample amount for speaker adaptive learning is suppressed as much as possible, and individuals of both vocal cord source characteristics and vocal tract characteristics are maintained without losing phonological characteristics. It is possible to satisfactorily correct the difference and adapt the utterance of an unknown speaker to the utterance of a standard speaker. When applied to the voice recognition of an unspecified speaker, the speech of an unspecified speaker is recognized well. Can be made.

【００３６】なお、図１の装置では、１つのマッチング
部６０だけが設けられ、このマッチング部６０は、話者
適応フェーズと音声認識フェーズとで共通に用いられる
が、図４に示すように、話者適応フェーズと音声認識フ
ェーズとでそれぞれ専用のマッチング部を個別に設ける
こともできる。The apparatus shown in FIG. 1 is provided with only one matching section 60, and this matching section 60 is commonly used in the speaker adaptation phase and the voice recognition phase, but as shown in FIG. It is also possible to individually provide dedicated matching units for the speaker adaptation phase and the voice recognition phase.

【００３７】すなわち、図４の構成では、話者適応フェ
ーズにおいてのみ機能する第１のマッチング部６１と、
音声認識フェーズにおいてのみ機能する第２のマッチン
グ部６２とが設けられており、また、第１のマッチング
部６１用に、第１の標準音声記憶部５１が設けられ、第
２のマッチング部６２用に、第２の標準音声記憶部５２
が設けられている。That is, in the configuration of FIG. 4, the first matching unit 61 that functions only in the speaker adaptation phase,
A second matching unit 62 that functions only in the voice recognition phase is provided, and a first standard voice storage unit 51 is provided for the first matching unit 61 and a second matching unit 62 is provided. The second standard voice storage unit 52
Is provided.

【００３８】ここで、第１のマッチング部６１には、例
えば上述したと同様の端点固定ＤＰ法を用いることがで
きる。また、第２のマッチング部６２には、例えば文献
「室井、米山：“継続時間制御型状態遷移モデルを用い
た単語音声認識”（電子情報通信学会論文誌Ｖｏ１．Ｊ
７２−Ｄ−II、第１７６９頁，１９８９年１１月）」に
示されているような継続時間制御型状態遷移モデルによ
る音声認識法を用いることができる。なお、この音声認
識法は、端点固定ＤＰ法と比較して、音声区間検出誤差
に強く、認識性能が高く、また、音素単位の音声認識を
行なうことができるという特徴を有している。Here, for the first matching unit 61, for example, the same fixed endpoint DP method as described above can be used. In the second matching unit 62, for example, the document “Muroi, Yoneyama:“ Word Speech Recognition Using a Duration Control Type State Transition Model ”(The Institute of Electronics, Information and Communication Engineers Vo1.J
72-D-II, p. 1769, Nov. 1989) ". It should be noted that this speech recognition method has the features that it is more resistant to speech segment detection errors and has higher recognition performance than the fixed endpoint DP method, and that speech recognition can be performed in phoneme units.

【００３９】このように、話者適応処理に適したマッチ
ング手法により照合を行なう第１のマッチング部６１と
音声認識処理に適したマッチング手法により照合を行な
う第２のマッチング部６２とを別個に設けることによ
り、各々のフェーズにおける処理を効率的にかつ正確に
行なうことができ、精度の高い認識結果を得ることがで
きる。As described above, the first matching unit 61 for performing the matching by the matching method suitable for the speaker adaptation processing and the second matching unit 62 for performing the matching by the matching method suitable for the voice recognition processing are separately provided. As a result, the processing in each phase can be performed efficiently and accurately, and a highly accurate recognition result can be obtained.

【００４０】また、図１の装置構成を図５に示す構成の
ものに変形することもできる。図５を参照すると、この
装置には、図１の装置と異なりフェーズ選択部９０が設
けられておらず、話者がフェーズを意識することなく、
音声認識のための音声入力を行なうことができるような
処理制御がマッチング部６０においてなされるようにな
っている。すなわち、図５の装置のマッチング部６０
は、音声区間に関して、入力音声特徴量と複数の標準音
声特徴量とを照合（マッチング）して、各標準音声特徴
量との累積距離のうちで、最小の累積距離を求め、この
最小の累積距離が、予め定められた閾値以下の場合にの
み、この話者が話者適応処理用の既知の内容（単語）を
発声しているとみなして、自動的に話者適応処理を実行
し、予め定められた閾値以下でない場合は、話者が未知
の内容を音声認識させるために発声したとみなし、最小
の累積距離となった標準音声に対応する信号を認識結果
として出力して、１回の音声認識処理を終了するように
なっている。Further, the apparatus configuration of FIG. 1 can be modified into the configuration shown in FIG. Referring to FIG. 5, unlike the device of FIG. 1, this device is not provided with the phase selection unit 90, and the speaker is not aware of the phase.
The matching unit 60 is configured to perform processing control capable of performing voice input for voice recognition. That is, the matching unit 60 of the apparatus of FIG.
For the voice section, the input voice feature amount and a plurality of standard voice feature amounts are matched (matching), the minimum cumulative distance is obtained from the cumulative distances with each standard voice feature amount, and the minimum cumulative distance is calculated. Only when the distance is less than or equal to a predetermined threshold value, it is assumed that this speaker is uttering a known content (word) for speaker adaptation processing, and the speaker adaptation processing is automatically executed, If it is not less than the predetermined threshold value, it is considered that the speaker uttered in order to recognize the unknown content by voice recognition, and the signal corresponding to the standard voice with the minimum accumulated distance is output as the recognition result, and once. The voice recognition process of is ended.

【００４１】図６は図５の装置の処理動作を説明するた
めのフローチャートである。図６を参照すると、この装
置では、未知の話者が音声を入力すると（ステップＳ２
１）、この音声信号は音声区間検出部４０に加わり、音
声区間検出部４０では、この音声信号から音声区間を検
出する（ステップＳ２２）。また、入力された音声信号
は、周波数特性補正部１０，特徴量抽出部２０，周波数
軸変換部３０において所定の処理がなされる。すなわ
ち、周波数特性補正部１０では、これに現在保持されて
いる１つの周波数特性補正係数，例えばβ₄（＝０．
０）により、入力音声信号の周波数特性を補正し、周波
数軸変換部３０では、これに現在保持されている１つの
周波数軸変換係数，例えばα₄（＝０．５）により、特
徴量抽出部２０から出力された特徴量の周波数軸を変換
し、その変換結果としての特徴量をマッチング部６０に
与える。マッチング部６０では、音声区間検出部４０で
検出された音声区間に存在する入力音声信号の周波数軸
変換部３０からの特徴量を標準音声記憶部５０に予め記
憶されている種々の標準音声特徴量と照合（マッチン
グ）して、各標準音声特徴量との累積距離のうちで、最
小の累積距離を抽出する（ステップＳ２３）。FIG. 6 is a flow chart for explaining the processing operation of the apparatus of FIG. Referring to FIG. 6, in this device, when an unknown speaker inputs a voice (step S2
1) This voice signal is added to the voice section detection unit 40, and the voice section detection unit 40 detects the voice section from this voice signal (step S22). Further, the input audio signal is subjected to predetermined processing in the frequency characteristic correction unit 10, the feature amount extraction unit 20, and the frequency axis conversion unit 30. That is, in the frequency characteristic correction unit 10, one frequency characteristic correction coefficient currently held therein, for example β ₄ (= 0.
0) corrects the frequency characteristic of the input audio signal, and the frequency axis conversion unit 30 uses the one frequency axis conversion coefficient currently held therein, for example, α ₄ (= 0.5), to extract the feature quantity extraction unit. The frequency axis of the feature amount output from 20 is converted, and the feature amount as the conversion result is given to the matching unit 60. In the matching unit 60, the feature amounts from the frequency axis conversion unit 30 of the input voice signal existing in the voice segment detected by the voice segment detection unit 40 are stored in the standard voice storage unit 50 in various standard voice feature amounts. (Matching), and the minimum cumulative distance is extracted from the cumulative distances with each standard voice feature amount (step S23).

【００４２】次いで、マッチング部６０では、このよう
に抽出された最小累積距離が所定の閾値よりも小さいか
否かを判断する。この結果、最小累積距離が所定の閾値
よりも小さくないときには、この話者が音声認識させる
ための未知の内容を発声したと判断し、最小の累積距離
を与えた標準音声特徴量に対応する語を認識結果として
出力して、１回の音声認識処理を終了する（ステップＳ
２５）。これに対し、最小累積距離が所定の閾値よりも
小さいときには、この話者が話者適応処理用の既知の内
容（単語）を発声したものと判断し、マッチング部６０
は、周波数特性補正部１０，特徴量抽出部２０，周波数
軸変換部３０に対し、図２の処理と同様の話者適応処理
を行なわせる（ステップＳ２６，Ｓ２７）。Next, the matching unit 60 determines whether the minimum cumulative distance thus extracted is smaller than a predetermined threshold value. As a result, when the minimum cumulative distance is not smaller than the predetermined threshold value, it is judged that this speaker has uttered an unknown content for voice recognition, and the word corresponding to the standard voice feature amount giving the minimum cumulative distance is determined. Is output as a recognition result, and one speech recognition process is ended (step S
25). On the other hand, when the minimum cumulative distance is smaller than the predetermined threshold value, it is determined that this speaker has uttered a known content (word) for speaker adaptation processing, and the matching unit 60
Causes the frequency characteristic correction unit 10, the feature amount extraction unit 20, and the frequency axis conversion unit 30 to perform the speaker adaptation processing similar to the processing of FIG. 2 (steps S26 and S27).

【００４３】このように、図５の装置では、１回の音声
認識処理を行なった結果、得られる最小の累積距離が所
定の閾値よりも小さいか否かにより、話者適応処理を実
行するか否かを判断するようにしているので、オペレー
タがフェーズを選択する手間を省き、話者適応処理が必
要な場合に自動的にこれを実行することができる。As described above, the apparatus of FIG. 5 executes the speaker adaptation processing depending on whether or not the minimum cumulative distance obtained as a result of performing the speech recognition processing once is smaller than a predetermined threshold value. Since it is determined whether or not the phase is to be determined, the operator does not have to select the phase and the speaker adaptation process can be automatically executed when it is necessary.

【００４４】なお、図５の装置では、話者適応処理を実
行するか否かを判断する基準として、最小累積距離が所
定閾値よりも小さいか否かの基準を用いたが、これのか
わりに、例えば、認識の１位候補と２位候補のそれぞれ
の距離の比などを用いることもできる。また、係数α，
βは連続量をとることもできるので、話者適応フェーズ
時の係数α，βの更新において、係数α，βに対し次式
のような平滑化を行なうこともできる。In the apparatus of FIG. 5, the criterion of whether the minimum cumulative distance is smaller than a predetermined threshold is used as the criterion for determining whether or not the speaker adaptation process is to be executed. For example, it is also possible to use the ratio of the distance between each of the first and second candidates for recognition. Also, the coefficient α,
Since β can be a continuous quantity, the coefficients α and β can be smoothed as shown in the following equation when the coefficients α and β are updated in the speaker adaptation phase.

【００４５】[0045]

【数３】 α_t＝ａα＋（１−ａ）α_t-1， β_t＝ａβ＋（１−ａ）β_t-1 （０≦ａ≦１）Equation 3] _{α t = aα + (1-} a) α t-1, β t = aβ + (1-a) β t-1 (0 ≦ a ≦ 1)

【００４６】ここで、α_t，β_tは今回の話者適応フェー
ズの実行後に新しく設定される周波数軸変換係数，周波
数特性補正係数、α_t-1，β_t-1は今回の話者適応フェー
ズが実行される前に設定されていた周波数軸変換係数，
周波数特性補正係数、α，βは今回の話者適応フェーズ
において選択された周波数軸変換係数，周波数特性補正
係数、ａは平滑化のための定数、あるいは、認識の距離
に応じた平滑化のための変数である。ここで、ａを変数
とする場合は、認識結果の信頼性が高い程、話者適応の
ための係数α，βを素早く変化させるために、累積距離
が小さければ１に近く、大きければ０に近い値をとるよ
うにする。例えば、累積距離をＤ（＞０）として、ａを
次式のようにすることもできる。Here, α _t and β _t are frequency axis conversion coefficients and frequency characteristic correction coefficients which are newly set after execution of the present speaker adaptation phase, and α _t-1 and β _t-1 are present speaker adaptation. The frequency axis conversion coefficient that was set before the phase was executed,
Frequency characteristic correction coefficients, α and β are frequency axis conversion coefficients selected in this speaker adaptation phase, frequency characteristic correction coefficients, and a is a constant for smoothing, or for smoothing according to the recognition distance. Is a variable of. Here, when a is used as a variable, as the reliability of the recognition result is higher, the coefficients α and β for speaker adaptation are changed more quickly. Try to get a close value. For example, the cumulative distance may be set to D (> 0), and a may be expressed by the following equation.

【００４７】[0047]

【数４】ａ＝ｅｘｐ（−Ｄ）## EQU00004 ## a = exp (-D)

【００４８】上述の各実施例では、周波数特性補正部１
０，特徴量抽出部２０，周波数軸変換部３０の順に処理
がなされるようになっているが、この処理順序は、本質
的なものではなく、周波数特性補正処理と周波数軸変換
処理に上述した方法とは異なる他の方法が用いられると
きには、変わり得るものである。また、周波数特性補正
部１０，周波数軸変換部３０において用いられる補正
式，変換式も数１，数２以外のものを用いることもで
き、また係数の個数，値もこの例に限るものではなく、
場合に応じ任意所望の個数，値のものを用いることがで
きる。In each of the above embodiments, the frequency characteristic correction unit 1
The processing is performed in the order of 0, the feature quantity extraction unit 20, and the frequency axis conversion unit 30, but this processing order is not essential, and is described above in the frequency characteristic correction processing and the frequency axis conversion processing. It can vary when other methods different from the method are used. Further, it is also possible to use correction equations and conversion equations used in the frequency characteristic correction unit 10 and the frequency axis conversion unit 30 other than those of the equations 1 and 2, and the number and values of the coefficients are not limited to this example. ,
Any desired number and value can be used depending on the case.

【００４９】また、話者適応フェーズの係数選択の順序
も、例えば信号の記憶容量等に応じて、係数αを先に選
択し、次いで係数βを選択するようにしても良いし、あ
るいは、処理能力に応じ、全ての係数α，βの組み合わ
せに対して同時に実行しても良い。また、話者適用化の
ための発声内容の単語の種類（セット）についても、極
端な音韻の偏りがなければ、上述した例とは異なる他の
単語の種類（セット）を用いても良い。また、音声区間
検出部４０，マッチング部６０における音声区間検出
法，音声認識法についても上述以外の手法を用いること
もできる。As for the order of coefficient selection in the speaker adaptation phase, the coefficient α may be selected first and then the coefficient β may be selected according to the signal storage capacity or the like, or the processing may be performed. It may be executed simultaneously for all combinations of the coefficients α and β depending on the capability. Also, as for the type (set) of words of utterance content for speaker application, another type (set) of words different from the above example may be used as long as there is no extreme phoneme bias. Further, as for the voice section detection method and the voice recognition method in the voice section detection unit 40 and the matching unit 60, methods other than those described above can be used.

【００５０】[0050]

【発明の効果】以上に説明したように、請求項１，請求
項２記載の発明によれば、話者適応フェーズと音声認識
フェーズの機能を具備し、話者適応フェーズにおいて
は、既知なる発声内容の未知なる話者の入力音声信号に
対して、複数の異なる周波数特性補正係数および複数の
異なる周波数軸変換係数の各々の係数毎に、周波数特性
補正手段，周波数軸変換手段，特徴量抽出手段に処理を
行なわせて、各々の係数毎に入力音声特徴量を求めさ
せ、各々の係数毎の入力音声特徴量を既知なる発声内容
と同一内容の標準音声特徴量と照合して、各々の係数の
うちから、最小距離を与える１つの周波数特性補正係数
と１つの周波数軸変換係数を選択し、また、音声認識フ
ェーズにおいては、話者適応フェーズで入力を行なった
話者の未知なる発声内容の入力音声信号に対して、話者
適応フェーズにおいて選択された１つの周波数特性補正
係数と１つの周波数軸変換係数とに基づき周波数特性補
正手段，周波数軸変換手段，特徴量抽出手段に処理を行
なわせて入力音声特徴量を求めさせ、該入力音声特徴量
を標準音声記憶手段に保持されている標準音声特徴量と
照合して、認識結果を出力するようになっているので、
音韻性を失なうことなく、声帯音源特性と声道特性の個
人差を補正し、未知話者の発声を標準話者の発声に良好
に適応させることができ、これにより、１人あるいは小
人数の標準話者の標準音声だけを用いて、不特定話者音
声認識に近い音声認識を容易に実現することができる。As described above, according to the inventions of claims 1 and 2, the functions of the speaker adaptation phase and the voice recognition phase are provided, and in the speaker adaptation phase, known utterances are produced. Frequency characteristic correction means, frequency axis conversion means, feature amount extraction means for each of a plurality of different frequency characteristic correction coefficients and a plurality of different frequency axis conversion coefficients for an input voice signal of a speaker whose content is unknown. The input speech feature amount is calculated for each coefficient, and the input speech feature amount for each coefficient is compared with the standard speech feature amount having the same known utterance content and each coefficient is calculated. From among the above, one frequency characteristic correction coefficient that gives the minimum distance and one frequency axis conversion coefficient are selected, and in the speech recognition phase, the unknown utterance of the speaker who has input in the speaker adaptation phase is selected. The input voice signal of 1) is processed by the frequency characteristic correction means, the frequency axis conversion means, and the feature amount extraction means based on one frequency characteristic correction coefficient and one frequency axis conversion coefficient selected in the speaker adaptation phase. Then, the input voice feature amount is calculated, the input voice feature amount is collated with the standard voice feature amount held in the standard voice storage means, and the recognition result is output.
It is possible to correct individual differences in vocal cord sound source characteristics and vocal tract characteristics without losing phonological characteristics and to adapt the utterance of an unknown speaker to the utterance of a standard speaker satisfactorily. It is possible to easily realize speech recognition close to the unspecified speaker speech recognition by using only the standard speeches of a large number of standard speakers.

【００５１】また、請求項３記載の発明によれば、照合
手段が、話者適応フェーズにおいて第１の標準音声記憶
手段に保持されている標準音声特徴量を用いて話者適応
処理に適したマッチング手法により照合を行なう第１の
照合手段と、音声認識フェーズにおいて前記第２の標準
音声記憶手段に保持されている標準音声特徴量を用いて
音声認識処理に適したマッチング手法により照合を行な
う第２の照合手段とに分割されて構成されているので、
話者適応化処理と音声認識処理とのそれぞれの処理を効
率的かつ正確に行なうことができる。According to the third aspect of the present invention, the matching means is suitable for the speaker adaptation processing by using the standard speech feature quantity stored in the first standard speech storage means in the speaker adaptation phase. A first matching means for performing matching by a matching method, and a matching method suitable for speech recognition processing using the standard speech feature amount stored in the second standard speech storage means in the speech recognition phase. Since it is divided into two collating means,
The speaker adaptation process and the voice recognition process can be efficiently and accurately performed.

【００５２】また、請求項４記載の発明によれば、話者
適応フェーズと音声認識フェーズとを切換選択するため
の処理選択手段がさらに設けられており、照合手段は、
処理選択手段からの指示により、話者適応フェーズと音
声認識フェーズとを切換えるようになっているので、話
者が音声認識入力を始める時点，あるいは認識率が悪い
時点の任意の時点で、話者は、容易に話者適応処理，所
謂教師有り学習を行なうことができる。Further, according to the invention described in claim 4, processing selecting means for switching and selecting between the speaker adaptation phase and the voice recognition phase is further provided, and the collation means is
Since the speaker adaptation phase and the voice recognition phase are switched according to an instruction from the process selection means, the speaker can be started at any time when the speaker starts voice recognition input or when the recognition rate is low. Can easily perform speaker adaptation processing, so-called supervised learning.

【００５３】また、請求項５記載の発明によれば、照合
手段は、所定の１つの周波数特性補正係数，所定の１つ
の周波数軸補正係数を用いて周波数特性補正手段，周波
数軸変換手段，特徴量抽出手段により処理されて得られ
た入力音声特徴量を標準音声記憶手段に保持されている
標準音声特徴量と照合し、該照合結果に基づいて入力音
声信号が既知なる発声内容の音声信号であるか否かを判
断し、既知なる発声内容の入力音声信号であると判断し
たときに話者適応フェーズを選択するようになっている
ので、音声認識のための内容未知の音声入力から話者適
応処理を自動的に開始させることができ、話者は予め話
者適応フェーズを選択するという手間を省くことができ
て、所謂教師無し学習を行なうことができる。According to the fifth aspect of the present invention, the collating means uses the predetermined one frequency characteristic correction coefficient and the predetermined one frequency axis correction coefficient to correct the frequency characteristic, the frequency axis converting means, and the characteristic. The input voice feature amount obtained by processing by the amount extraction means is collated with the standard voice feature amount held in the standard voice storage means, and the input voice signal is a voice signal having a known utterance content based on the comparison result. The speaker adaptation phase is selected when it is judged whether or not there is an input voice signal of known utterance content, so that the speaker input from the voice input whose content is unknown for voice recognition. The adaptation process can be automatically started, the speaker can save the labor of selecting the speaker adaptation phase in advance, and so-called unsupervised learning can be performed.

【００５４】また、請求項６記載の発明によれば、照合
手段は、照合結果として、入力音声特徴量と標準音声特
徴量との距離を得て、該距離が所定の条件を満たすとき
にのみ、話者適応フェーズを選択するようになっている
ので、認識結果の信頼性の高い場合に限って教師無し話
者適応処理を行なうことができ、話者適応処理が不適切
に実行されることを防止することができる。According to the sixth aspect of the present invention, the matching means obtains the distance between the input voice feature quantity and the standard voice feature quantity as a matching result, and only when the distance satisfies a predetermined condition. Since the speaker adaptation phase is selected, unsupervised speaker adaptation processing can be performed only when the recognition result is highly reliable, and the speaker adaptation processing is inappropriately executed. Can be prevented.

【００５５】また、請求項７記載の発明によれば、周波
数特性補正係数および／または前記周波数軸変換手段に
おける周波数特性補正係数および／または周波数軸変換
係数を平滑化するための係数平滑化手段がさらに設けら
れており、該係数平滑化手段は、話者適応フェーズが選
択されて実行された場合に、周波数特性補正係数および
／または周波数軸変換係数の話者適応フェーズ実行直前
の値と話者適応フェーズ実行後の値とを用いて周波数特
性補正係数および／または周波数軸変換係数を平滑化す
るようになっているので、話者適応処理を安定に実行す
ることができる。Further, according to the invention of claim 7, there is provided a coefficient smoothing means for smoothing the frequency characteristic correction coefficient and / or the frequency characteristic correction coefficient and / or the frequency axis conversion coefficient in the frequency axis converting means. The coefficient smoothing means is further provided, and when the speaker adaptation phase is selected and executed, the value of the frequency characteristic correction coefficient and / or the frequency axis conversion coefficient immediately before execution of the speaker adaptation phase and the speaker Since the frequency characteristic correction coefficient and / or the frequency axis conversion coefficient are smoothed by using the value after the execution of the adaptation phase, the speaker adaptation processing can be stably executed.

【００５６】また、請求項８記載の発明によれば、係数
平滑化手段は、音声認識処理の照合結果に対応する標準
音声特徴量と入力音声特徴量との距離に応じて定められ
る平滑化係数を用いて周波数特性補正係数および／また
は周波数軸変換係数を平滑化するので、認識結果の信頼
性が低い場合には、平滑化係数はほとんど変化せず、話
者適応処理を行なうか否かの判断の処理を必要としな
い。According to the eighth aspect of the present invention, the coefficient smoothing means is a smoothing coefficient determined according to the distance between the standard speech feature quantity and the input speech feature quantity corresponding to the collation result of the speech recognition processing. Since the frequency characteristic correction coefficient and / or the frequency axis conversion coefficient is smoothed by using, when the reliability of the recognition result is low, the smoothing coefficient hardly changes and whether the speaker adaptation process is performed or not is performed. Does not require judgment processing.

【００５７】また、請求項９記載の発明によれば、声帯
音源特性の個人差の補正と声道特性の個人差の補正とが
シリアルになされるので、話者適応処理を効率良く行な
うことができる。According to the invention of claim 9, correction of individual differences in vocal cord sound source characteristics and correction of individual differences in vocal tract characteristics are performed serially, so that speaker adaptation processing can be efficiently performed. it can.

【００５８】また、請求項１０記載の発明によれば、話
者適応学習のための学習サンプル量としては、所定個数
（例えば２０音節程度）で済み、また、適応学習に十分
な２０音節程度の発声を得るために、平易で発声しやす
い単語セットを容易に選択することができる。According to the tenth aspect of the invention, the learning sample amount for speaker adaptive learning is a predetermined number (for example, about 20 syllables), and is about 20 syllables sufficient for adaptive learning. In order to obtain utterances, one can easily select a simple and easily vocabulary word set.

[Brief description of drawings]

【図１】本発明に係る音声認識装置の一実施例の構成図
である。FIG. 1 is a configuration diagram of an embodiment of a voice recognition device according to the present invention.

【図２】図１の音声認識装置の話者適応フェーズにおけ
る処理動作を説明するためのフローチャートである。FIG. 2 is a flowchart for explaining a processing operation in a speaker adaptation phase of the voice recognition device in FIG.

【図３】図１の音声認識装置の音声認識フェーズにおけ
る処理動作を説明するためのフローチャートである。FIG. 3 is a flowchart for explaining a processing operation in a voice recognition phase of the voice recognition device in FIG.

【図４】図１に示す音声認識装置の変形例を示す図であ
る。FIG. 4 is a diagram showing a modification of the voice recognition device shown in FIG.

【図５】図１に示す音声認識装置の変形例を示す図であ
る。5 is a diagram showing a modification of the voice recognition device shown in FIG.

【図６】図５の音声認識装置の処理動作を説明するため
のフローチャートである。FIG. 6 is a flowchart for explaining the processing operation of the voice recognition device in FIG.

【符号の説明】１０周波数特性補正部２０特徴量抽出部３０周波数軸変換部４０音声区間検出部５０標準音声記憶部５１第１の標準音声記憶部５２第２の標準音声記憶部６０マッチング部６１第１のマッチング部６２第２のマッチング部９０フェーズ選択部[Description of Reference Signs] 10 frequency characteristic correction unit 20 feature amount extraction unit 30 frequency axis conversion unit 40 voice section detection unit 50 standard voice storage unit 51 first standard voice storage unit 52 second standard voice storage unit 60 matching unit 61 1st matching part 62 2nd matching part 90 Phase selection part

Claims

[Claims]

1. A frequency characteristic correction means for correcting the frequency characteristic of an input audio signal based on a plurality of predetermined different frequency characteristic correction coefficients, and a plurality of different predetermined frequency axis conversion coefficients. A frequency axis conversion means for converting the frequency of the input voice signal, a feature amount extraction means for extracting the feature amount of the input voice signal as an input voice feature amount, and a standard holding a standard voice feature amount. Collation means for collating the voice storage means, the frequency characteristic correction means, the frequency axis conversion means, and the input voice feature quantity obtained by processing by the feature quantity extraction means with the standard voice feature quantity held in the standard voice storage means. And a voice recognition device having functions of a speaker adaptation phase and a voice recognition phase, wherein the collating means is configured to detect a known utterance content in the speaker adaptation phase. For an input voice signal of a known speaker, a frequency characteristic correction means, a frequency axis conversion means, and a feature amount extraction means are provided for each of the plurality of different frequency characteristic correction coefficients and the plurality of different frequency axis conversion coefficients. To process the input voice feature quantity for each coefficient,
The input voice characteristic amount for each coefficient is collated with the standard voice characteristic amount having the same content as the known utterance content, and one frequency characteristic correction coefficient and one frequency axis that give the minimum distance from the respective coefficients. Selecting a conversion coefficient, and the collating means, in the voice recognition phase,
Based on one frequency characteristic correction coefficient and one frequency axis conversion coefficient selected in the speaker adaptation phase, with respect to the input voice signal of unknown utterance content of the speaker input in the speaker adaptation phase The frequency characteristic correcting means, the frequency axis converting means, and the feature quantity extracting means are caused to perform an input voice feature quantity, and the input voice feature quantity is collated with the standard voice feature quantity held in the standard voice storing means. hand,
A voice recognition device characterized by outputting a recognition result.

2. The voice recognition device according to claim 1, wherein
The standard voice storage means holds a standard voice feature amount for the speaker adaptation phase and a standard voice feature amount for the voice recognition phase. A speech recognition apparatus characterized in that a standard speech feature amount having the same content as a known content uttered by a speaker in a person adaptation phase is set.

3. The voice recognition device according to claim 1, wherein
The standard voice storage means includes a first standard voice storage means that holds a standard voice feature quantity for a speaker adaptation phase and a second standard voice storage means that holds a standard voice feature quantity for a voice recognition phase. Further, the matching means is configured to perform matching by a matching method suitable for the speaker adaptation process using the standard speech feature amount held in the first standard speech storage means in the speaker adaptation phase. First to do
And the second matching means for performing matching by a matching method suitable for the speech recognition process using the standard speech feature amount held in the second standard speech storage means in the speech recognition phase. A voice recognition device characterized by being configured.

4. The voice recognition device according to claim 1,
Process selection means for switching and selecting between the speaker adaptation phase and the voice recognition phase is further provided,
The collation unit receives an instruction from the process selection unit,
A voice recognition device characterized in that it switches between a speaker adaptation phase and a voice recognition phase.

5. The voice recognition device according to claim 1,
The matching means uses the predetermined one frequency characteristic correction coefficient and the predetermined one frequency axis correction coefficient to process the frequency characteristic correction means, the frequency axis conversion means, and the feature quantity extraction means. Is compared with the standard voice feature amount held in the standard voice storage means, it is determined whether or not the input voice signal is a voice signal of a known utterance content based on the comparison result, A voice recognition device, characterized in that a speaker adaptation phase is selected when it is judged that it is an input voice signal.

6. The voice recognition device according to claim 5,
The matching means obtains the distance between the input voice feature quantity and the standard voice feature quantity as a match result, and selects the speaker adaptation phase only when the distance satisfies a predetermined condition. A voice recognition device.

7. The voice recognition device according to claim 5, wherein
Coefficient smoothing means for smoothing the frequency characteristic correction coefficient and / or the frequency characteristic correction coefficient and / or the frequency axis conversion coefficient in the frequency axis conversion means is further provided, and the coefficient smoothing means When the speaker adaptation phase is selected and executed, the frequency characteristic correction coefficient is used by using the value of the frequency characteristic correction coefficient and / or the frequency axis conversion coefficient immediately before the speaker adaptation phase is executed and the value after the speaker adaptation phase is executed. And / or a frequency axis conversion coefficient is smoothed.

8. The voice recognition device according to claim 7,
The coefficient smoothing means uses the frequency characteristic correction coefficient and / or the frequency axis by using a smoothing coefficient determined according to a distance between a standard speech feature amount corresponding to a matching result of a speech recognition process and an input speech feature amount. A speech recognition device characterized by smoothing conversion coefficients.

9. When selecting one frequency characteristic correction coefficient and one frequency axis conversion coefficient suitable for the voice of an unknown speaker from a plurality of different frequency characteristic correction coefficients and a plurality of different frequency axis conversion coefficients. , With respect to an input voice signal of an unknown speaker of known content, for each of a plurality of coefficients of either one of a frequency characteristic correction coefficient and a frequency axis conversion coefficient, the input voice feature amount for each coefficient and the known Obtain the distance between the utterance content and the standard voice feature amount of the same content, select the coefficient that gives the minimum distance, and then select the input voice feature amount for each coefficient and the known utterance for each of the other types of coefficients. A speaker adaptation method characterized in that a distance between a content and a standard speech feature having the same content is obtained, and a coefficient giving a minimum distance is selected.

10. The speaker adaptation method according to claim 9, wherein the known utterance content uttered by an unknown speaker is:
A speaker adaptation method characterized in that a predetermined number of plural words are used.