JP3231365B2

JP3231365B2 - Voice recognition device

Info

Publication number: JP3231365B2
Application number: JP25255591A
Authority: JP
Inventors: 宏之坪井
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1991-09-30
Filing date: 1991-09-30
Publication date: 2001-11-19
Anticipated expiration: 2016-11-19
Also published as: JPH0588693A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は単語音声、文音声等に対
する認識性能を効果的に高めることができる音声認識装
置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition apparatus capable of effectively improving the recognition performance for word speech, sentence speech, and the like.

【０００２】[0002]

【従来の技術】従来より単語や文等を対象とした音声認
識装置において、認識語彙の変更の容易さや、また、構
文や意味的情報が利用し易いと云う利点から音韻、音
節、母音・子音・母音などを単位として音声認識する認
識方式が研究されている。2. Description of the Related Art Conventionally, in a speech recognition apparatus for a word or a sentence, phonemes, syllables, vowels / consonants have been used because of the easiness of changing the recognition vocabulary and the easy use of syntax and semantic information.・ Recognition methods for recognizing speech in units of vowels and the like have been studied.

【０００３】ここで云う音韻とは音声発声における言語
的最小単位をなすものであり、日本語における母音／Ｉ
／，／Ｅ／，／Ａ／，／Ｏ／，／Ｕ／、撥音／Ｎ／、鼻
子音／ｍ／，／ｎ／、破裂子音／Ｐ／，／Ｔ／，／Ｋ
／，／Ｂ／，／Ｄ／，／Ｇ／、半母音／Ｙ／，／Ｗ／な
どを指すもので、音素とも云う。[0003] The phoneme referred to here is a linguistic minimum unit in voice utterance, and is a vowel / I in Japanese.
/, / E /, / A /, / O /, / U /, sound repellent / N /, nasal consonants / m /, / n /, plosive consonants / P /, / T /, / K
/, / B /, / D /, / G /, semi-vowel / Y /, / W /, etc., and are also called phonemes.

【０００４】また、音節とは子音・母音、子音・半母音
・母音などを云う。ここでは、音韻を例に説明するが音
節、母音・子音・母音などの単位を用いた場合も同様で
ある。A syllable means a consonant / vowel, a consonant / half vowel / vowel, and the like. Here, a description will be given of a phoneme as an example, but the same applies to a case where units such as syllables, vowels / consonants / vowels are used.

【０００５】入力された音声の認識はまず音韻について
行なわれ、次に単語や文の認識が行なわれる。音韻の認
識においては音韻毎に時間的区分化を各特徴を用いて行
なうボトムアップ的処理と、区分化を行なわずに単語や
文を認識語彙、構文や意味など上位の知識を用いて生成
し、区分化と認識を同時に行なうトップダウン的処理が
ある。[0005] Recognition of input speech is first performed on phonemes, and then recognition of words and sentences is performed. In phonological recognition, bottom-up processing that performs temporal segmentation using each feature for each phoneme, and generation of words and sentences without using segmentation using higher-level knowledge such as recognized vocabulary, syntax, and meaning. There is a top-down process that simultaneously performs segmentation and recognition.

【０００６】しかしながら、予め音韻を高い精度で区分
化をすることは困難であり、従って、トップダウン的処
理が用いられることが多い。そして、トップダウン処理
で高い認識率を達成するためには、音韻認識率の向上が
重要であり、多くの方式が用いられてきた。However, it is difficult to classify phonemes in advance with high accuracy, and therefore, top-down processing is often used. In order to achieve a high recognition rate by top-down processing, it is important to improve the phoneme recognition rate, and many methods have been used.

【０００７】例えば、ニューラル・ネットワークを用い
たＴＤＮＮ（タイムディレード・ニューラル・ネットワ
ーク）法では、予め区分化された音韻データ（これを以
下、ラベル付け音韻データと呼ぶ）を用いて、ニューラ
ル・ネットワークで認識し、誤認識したデータについて
は正しく認識できるようにネットワークの係数を変更す
る学習を繰り返し行なうことで、音韻の認識率の向上を
図るようにしている。しかし、この方式では音韻の認識
には予め区分化された音韻データのみを用いて行なって
いるため、必ずしもトップダウン的な音韻の認識が高い
精度で行なわれるとは限らない。これは、ラベル付け音
韻データの区分化とトップダウン的な単語や文の認識の
区分化が音韻の誤り訂正学習で異ったものとなるためで
ある。また、ラベル付け音韻データとして、予め視察に
より専門家が区分化したデータをもとに音韻認識の辞書
を作成する場合にも、ラベル付け音韻データの区分化と
トップダウン的な認識の区分化が異なる。[0007] For example, in the TDNN (Time Delayed Neural Network) method using a neural network, phoneme data (hereinafter, referred to as labeled phoneme data) that has been segmented in advance is used for the neural network. Recognition and erroneously recognized data are repeatedly subjected to learning to change the network coefficients so as to be correctly recognized, thereby improving the phoneme recognition rate. However, in this method, phoneme recognition is performed using only pre-segmented phoneme data, so that top-down phoneme recognition is not always performed with high accuracy. This is because the segmentation of labeled phoneme data and the segmentation of top-down word and sentence recognition differ in phoneme error correction learning. In addition, when a dictionary for phoneme recognition is created based on data previously segmented by an expert as labeling phoneme data, segmentation of labeling phoneme data and segmentation of top-down recognition can be performed. different.

【０００８】そこで、トップダウン的な認識の結果に基
づき、音韻の区分化を再び行ない、音韻の辞書を作成す
る方式を用いて音韻認識率の向上を図る試みもなされて
いるが、音韻認識結果に基づく音韻辞書の学習が行なえ
ないことから、高い音韻認識率、さらには入力音声の高
い認識率は得られていない。Therefore, attempts have been made to improve the phoneme recognition rate by using a method of creating a phoneme dictionary by performing phoneme segmentation again based on the results of top-down recognition. Therefore, a high phoneme recognition rate and a high recognition rate of the input speech have not been obtained because learning of a phoneme dictionary based on the speech cannot be performed.

【０００９】このように従来においては、音韻認識向上
のための音韻辞書学習と、トップダウン的認識における
時間的区分化を充分に考慮しないまま装置が開発される
結果、認識装置は思うような性能が得にくいものとなっ
ていた。また、使用者の登録した音声を用いて話者に適
応化させる話者適応型の装置においては、音韻辞書学習
方式により話者に適応させることができるよう、適応学
習させることができるが、上述したような時間的区分化
を充分考慮せずに装置が開発される結果、話者適応化の
効果が充分でなく、従って、認識装置の性能が不十分な
ものとなる原因となっていた。As described above, in the prior art, a device is developed without sufficiently considering a phoneme dictionary learning for improving phoneme recognition and temporal division in top-down recognition. Was difficult to obtain. Further, in a speaker-adaptive device that adapts to a speaker using a voice registered by a user, adaptive learning can be performed so that the speaker can be adapted by a phonological dictionary learning method. As a result of the development of the apparatus without sufficiently considering such temporal segmentation, the effect of speaker adaptation is not sufficient, and therefore, the performance of the recognition apparatus is insufficient.

【００１０】さらに、雑音中の音声の認識装置におい
て、雑音により音韻の特徴ベクトルが変化すると同時
に、音韻特徴の時間的区分が変化することから、上述し
たような時間的区分化を考慮した認識装置は未開発のも
のとなっていた。このため、音韻を基本とした大語彙認
識装置や連続音声認識装置、音声ワード・プロセッサの
実用化や応用拡大ができなかった。Further, in the apparatus for recognizing speech in noise, since the feature vector of the phoneme changes due to the noise and the time division of the phoneme feature changes, the recognition apparatus taking into account the temporal segmentation as described above. Was undeveloped. For this reason, practical vocabulary recognition devices, continuous speech recognition devices, and speech word processors based on phonemes could not be put to practical use or expanded in application.

【００１１】[0011]

【発明が解決しようとする課題】音声の認識はまず音韻
について行なわれ、次に単語や文の認識が行なわれる
が、音韻認識の精度を向上させることが難しい。これは
従来の音韻を単位とした音声認識装置では音韻認識率向
上のための音韻辞書学習と上位の知識を用いたトップダ
ウン的認識における入力音声の音韻への時間区分化を充
分考慮しなかったためで、これにより音韻認識の辞書が
高精度に作成できず、音韻を基本とした音声認識装置に
おいて高い認識率を得ることができなかった。Speech recognition is first performed for phonemes, and then words and sentences are recognized. However, it is difficult to improve the accuracy of phoneme recognition. This is because conventional phoneme-based speech recognition systems did not sufficiently consider the time segmentation of input speech into phonemes in top-down recognition using phonological dictionary learning to improve the phoneme recognition rate and top-level knowledge. As a result, a phonemic recognition dictionary cannot be created with high accuracy, and a high recognition rate cannot be obtained in a phonemic-based speech recognition device.

【００１２】すなわち、音韻辞書を用いてトップダウン
的な音韻認識を行ない、その結果に基づき、音韻の区分
化を再び行ない、音韻の辞書を作成してこれをもとに再
び音韻認識をすると云った方式であるので、音韻認識結
果に基づく音韻辞書の学習が行なえないことから、高い
音韻認識率、さらには入力音声の高い認識率は得られて
いない。That is, top-down phoneme recognition is performed using a phoneme dictionary, phoneme segmentation is performed again based on the result, a phoneme dictionary is created, and phoneme recognition is performed again based on the dictionary. Therefore, since the learning of the phoneme dictionary based on the phoneme recognition result cannot be performed, a high phoneme recognition rate and a high recognition rate of the input speech have not been obtained.

【００１３】本発明はこのような事情を考慮してなされ
たものであり、その目的とするところは、音韻辞書学習
とトップダウン的認識における入力音声の音韻への時間
的区分化の双方を考慮した高性能な音韻辞書作成が可能
で、入力音声の高い認識率を確保できるようにした音声
認識装置を提供することにある。The present invention has been made in consideration of such circumstances, and has as its object to consider both phonological dictionary learning and temporal segmentation of input speech into phonology in top-down recognition. It is an object of the present invention to provide a speech recognition device capable of creating a high-performance phonological dictionary and ensuring a high recognition rate of input speech.

【００１４】[0014]

【課題を解決するための手段】上記目的を達成するた
め、本発明は次のように構成する。すなわち、入力音声
データを分折して求められる音声の特徴パラメータの時
系列から音韻特徴ベクトルを音韻特徴ベクトル取得手段
により連続的に抽出し、これにより得られる音韻特徴ベ
クトルと音韻辞書記憶手段の保持する音韻別音韻特徴ベ
クトルよりなる音韻辞書とを音韻照合手段により照合し
て照合音韻の尤度時系列を求め、この求められた照合音
韻の尤度時系列と単語辞書記憶手段の保持する単語辞書
とを単語照合手段により照合し、単語の尤度を求め、尤
度の最良の単語を認識単語として得るようにした音声認
識装置であって、前記音韻辞書記憶手段は前記音韻辞書
を更新可能とし、単語学習モード時に使用する学習用音
声データを保持する学習用音声データ保持手段と、単語
学習モード時に前記入力音声データとして前記学習用音
声データを用い、前記単語照合手段より得られる入力の
音韻区分情報から前記学習用音声データの音韻特徴ベク
トルを当該区間について選択し、学習用音韻特徴ベクト
ルを得る音韻特徴ベクトル選択部と、この選択された学
習用音韻特徴ベクトルを格納する学習用音韻特徴ベクト
ル格納部と、音韻学習モード時に前記学習用音韻特徴ベ
クトル格納部の前記学習用音韻特徴ベクトルとこの学習
用音韻特徴ベクトルが前記音韻照合部で前記音韻辞書と
照合した照合結果との比較を行う音韻照合判定部と、こ
の音韻照合判定部からの照合結果により前記音韻辞書記
憶手段の音韻辞書を更新する音韻部辞書更新部と、前記
単語学習モード及び前記音韻学習モードを切換える制御
を行う学習制御部とを具備して構成する。In order to achieve the above object, the present invention is configured as follows. That is, a phoneme feature vector is continuously extracted by a phoneme feature vector acquisition means from a time series of speech feature parameters obtained by dividing input speech data, and a phoneme feature vector obtained thereby and a phoneme dictionary storage means are stored. The likelihood time series of the matching phoneme is obtained by matching the phoneme dictionary composed of the phoneme feature vectors for each phoneme to be obtained by the phoneme matching means, and the obtained likelihood time series of the matching phoneme and the word dictionary stored in the word dictionary storage means. And the likelihood of the word is determined by the word matching means, and the word having the best likelihood is obtained as a recognition word.The phoneme dictionary storage means enables the phoneme dictionary to be updated. A learning voice data holding unit for holding learning voice data used in a word learning mode; and a learning voice data holding unit as the input voice data in a word learning mode. A phoneme feature vector selection unit for selecting a phoneme feature vector of the learning speech data from the input phoneme segment information obtained by the word matching means for the section using voice data, and obtaining a phoneme feature vector for training; A learning phoneme feature vector storage unit for storing the learned learning phoneme feature vector, and the learning phoneme feature vector and the learning phoneme feature vector of the learning phoneme feature vector storage unit in the phoneme learning mode. A phoneme matching determination unit that compares the result of matching with the phoneme dictionary, a phoneme unit dictionary updating unit that updates a phoneme dictionary in the phoneme dictionary storage unit based on the matching result from the phoneme matching determination unit, A learning control unit that controls to switch between a learning mode and the phoneme learning mode.

【００１５】または、本発明は、前記学習制御部は前記
更新された音韻辞書を用いて前記単語学習モード及び前
記音韻学習モードを繰り返し実施し、前記音韻辞書を更
新する制御機能を有することを特徴とするものである。
または、本発明は、前記学習制御部は、前記単語学習モ
ード及び前記音韻学習モードの繰り返し実施の回数を制
御する制御機能を有することを特徴とするものである。Alternatively, the present invention is characterized in that the learning control unit has a control function of repeatedly executing the word learning mode and the phonological learning mode using the updated phonological dictionary and updating the phonological dictionary. It is assumed that.
Alternatively, the present invention is characterized in that the learning control unit has a control function of controlling the number of repetitions of the word learning mode and the phoneme learning mode.

【００１６】[0016]

【作用】このような構成において本発明では、入力音声
データを分折して求められる音声の特徴パラメータの時
系列から音韻特徴ベクトルを連続的に抽出し、この音韻
特徴ベクトルと音韻辞書とを照合し、音韻の尤度時系列
を得て、この得た音韻尤度時系列と単語辞書を照合し、
認識結果を出力すると共に、さらに、音韻尤度時系列と
単語辞書との照合結果から入力の音韻区分情報を得て、
音韻区分情報と入力の特徴パラメータの時系列から学習
用音韻特徴ベクトルを選択し、学習用音韻特徴ベクトル
と音韻辞書を照合し、音韻照合の判定結果と学習用音韻
特徴ベクトルから音韻辞書を学習し、新しい音韻辞書を
得ると云うものである。According to the present invention, a phoneme feature vector is continuously extracted from a time series of speech feature parameters obtained by dividing input speech data, and the phoneme feature vector is compared with a phoneme dictionary. Then, a phoneme likelihood time series is obtained, and the obtained phoneme likelihood time series is collated with the word dictionary.
Along with outputting the recognition result, the input phoneme classification information is obtained from the matching result between the phoneme likelihood time series and the word dictionary,
A learning phoneme feature vector is selected from the time series of phoneme classification information and input feature parameters, and the learning phoneme feature vector is compared with the phoneme dictionary, and the phoneme dictionary is learned from the phoneme matching determination result and the training phoneme feature vector. To get a new phonemic dictionary.

【００１７】[0017]

【００１８】[0018]

【００１９】学習モードは「音韻学習モード」と「単語
学習モード」とにより構成され、「単語学習モード」は
単語照合した結果から学習用音韻特徴ベクトルを選択す
るモードであり、また、「音韻学習モード」は記憶され
た学習用音韻特徴ベクトルを使用し音韻辞書を学習する
モードである。「単語学習モード」では、入力音声の内
容を表現する単語辞書と照合した結果から入力の特徴パ
ラメータの時系列の時間的区分を得て、入力単語辞書中
のそれぞれの音韻に対応する特徴ベクトルを選択し記憶
し、また、「音韻学習モード」では、学習用音韻特徴ベ
クトルを音韻辞書と照合し、学習に必要なデータである
か否かの判定を行ない、学習に必要なデータである場合
には音韻辞書の学習に使用し、新たな音韻辞書を得て、
学習の制御は音韻学習モードの学習を繰り返して行な
い、学習の結果から必要ならば「単語学習モード」で学
習用音韻特徴ベクトルを選択して、さらに「音韻学習モ
ード」で学習を行なうと云った処理を繰り返し、学習が
充分に行なわれたならば学習を停止させる。The learning mode includes a "phonological learning mode" and a "word learning mode". The "word learning mode" is a mode for selecting a learning phonological feature vector from the result of word matching. The "mode" is a mode for learning a phonemic dictionary using the stored phonemic feature vectors for learning. In the "word learning mode", a time series of a time series of input feature parameters is obtained from a result of collation with a word dictionary expressing the content of input speech, and a feature vector corresponding to each phoneme in the input word dictionary is obtained. In the "phonological learning mode", the phonological learning mode is compared with the phonological feature vector for learning, and it is determined whether or not the data is necessary for learning. Used to learn the phonetic dictionary, got a new phonemic dictionary,
Learning control is performed by repeating learning in the phonological learning mode, and if necessary, selecting a learning phonological feature vector in the "word learning mode" from the learning result, and further performing learning in the "phonological learning mode". The process is repeated, and if the learning is sufficiently performed, the learning is stopped.

【００２０】なお、ここで云う単語辞書とは、単語認識
を行なうためのものであり、例えば文認識を行なう場合
には語彙や構文・意味情報などから生成される文認識辞
書を用いる。また、同様に「単語学習モード」とは単語
認識を行なう場合のモードを示しており、例えば文認識
を行なう場合には「文学習モード」を表わす。The word dictionary referred to here is for performing word recognition. For example, when performing sentence recognition, a sentence recognition dictionary generated from vocabulary, syntax, semantic information and the like is used. Similarly, the “word learning mode” indicates a mode in which word recognition is performed. For example, in the case of performing sentence recognition, it indicates the “sentence learning mode”.

【００２１】このように、本発明は入力音声の内容を表
現する単語辞書と照合した結果から入力の特徴パラメー
タの時系列の時間的区分を得て、入力単語辞書中のそれ
ぞれの音韻に対応する特徴ベクトルを選択し記憶し、さ
らにこの学習用音韻特徴ベクトルを音韻辞書と照合し、
学習に必要なデータであるか否かの判定を行ない、学習
に必要なデータである場合には音韻辞書の学習に使用
し、新たな音韻辞書を得るようにしたものであるから、
音韻を単位とした音声認識において高性能な音韻照合が
可能となる。As described above, according to the present invention, the time series of the time series of the characteristic parameters of the input is obtained from the result of collation with the word dictionary expressing the contents of the input speech, and the time series corresponding to each phoneme in the input word dictionary is obtained. Select and memorize the feature vector, and further compare the learning phoneme feature vector with the phoneme dictionary,
It is determined whether or not the data is necessary for learning, and if the data is necessary for learning, the data is used for learning a phonological dictionary, so that a new phonological dictionary is obtained.
High-performance phonemic collation can be performed in speech recognition in units of phonemes.

【００２２】特に「単語学習モード」では単語辞書と照
合した結果から入力の特徴パラメータの時系列の音韻辞
書に適した時間的区分を得て、それぞれの音韻に対応し
た特徴ベクトルを高い精度で抽出でき、「音韻学習モー
ド」では学習用音韻特徴ベクトルと音韻辞書を用いて音
韻辞書学習を行ない、高い性能を持つ音韻辞書を作成す
ることができ、学習が「音韻学習モード」の学習の繰り
返しと「単語学習モード」の学習の組み合せを順次繰り
返すことで実現する構成としていることから、音韻の特
徴ベクトル抽出と音韻辞書の作成が同時に高精度に行な
うことができる。In particular, in the "word learning mode", a temporal division suitable for a time-series phonological dictionary of the input characteristic parameters is obtained from the result of collation with the word dictionary, and a characteristic vector corresponding to each phonological element is extracted with high accuracy. In the phonological learning mode, phonological dictionary learning can be performed using the phonological feature vector for learning and the phonological dictionary, and a phonological dictionary with high performance can be created. Since the configuration is realized by sequentially repeating the learning combination of the “word learning mode”, the extraction of the phoneme feature vector and the creation of the phoneme dictionary can be performed simultaneously with high accuracy.

【００２３】また、話者適応型の認識装置において、音
韻学習と単語学習を繰り返し行なうことにより、話者の
発音法の特徴を考慮しながら、話者に適応した音韻辞書
の作成が可能となり、高い認識性能を得ることができ
る。Further, in the speaker-adaptive recognition device, by repeatedly performing phonological learning and word learning, it is possible to create a phonological dictionary adapted to the speaker while taking into account the characteristics of the pronunciation method of the speaker. High recognition performance can be obtained.

【００２４】さらに、雑音中の音声の認識装置におい
て、雑音による音韻の特徴ベクトルの変化と音韻特徴の
時間的区分の変化を考慮した音韻辞書の作成が可能とな
り、高い認識性能を得ることができる。Further, in the apparatus for recognizing speech in noise, it is possible to create a phonemic dictionary in consideration of changes in the phonetic feature vector and changes in the temporal division of phonemic features due to noise, and high recognition performance can be obtained. .

【００２５】従って、本発明によれば、音韻辞書学習と
トップダウン的認識における入力音声の音韻への時間的
区分化の双方を考慮した高性能な音韻辞書作成が可能
で、入力音声の高い認識率を確保できるようにした音声
認識装置を提供することができる。Therefore, according to the present invention, it is possible to create a high-performance phonological dictionary in consideration of both phonological dictionary learning and temporal segmentation of input voice into phonology in top-down recognition, and high recognition of input voice. It is possible to provide a speech recognition device capable of securing a rate.

【００２６】[0026]

【実施例】以下、本発明の一実施例に係る音声認識装置
について、図面を参照して説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS A speech recognition apparatus according to one embodiment of the present invention will be described below with reference to the drawings.

【００２７】図１は単語認識を行なう実施例装置の基本
的な概略ブロック図である。図において、１は音声入力
・分折部、２は音韻特徴ベクトル抽出部、３は音韻照合
部、４は単語照合部、５は音韻辞書部、６は単語辞書
部、７は音韻辞書更新部、８は音韻照合判定部、９は学
習用音韻特徴ベクトル格納部、１０は学習用音声データ
格納部、１１は学習制御部、１２は照合結果判定出力
部、１３は音韻特徴ベクトル選択部である。FIG. 1 is a basic schematic block diagram of an embodiment apparatus for performing word recognition. In the figure, 1 is a voice input / fraction unit, 2 is a phoneme feature vector extraction unit, 3 is a phoneme matching unit, 4 is a word matching unit, 5 is a phoneme dictionary unit, 6 is a word dictionary unit, and 7 is a phoneme dictionary updating unit. , 8 is a phoneme matching determination unit, 9 is a learning phoneme feature vector storage unit, 10 is a learning voice data storage unit, 11 is a learning control unit, 12 is a matching result determination output unit, and 13 is a phoneme feature vector selection unit. .

【００２８】また、ＳＷ１〜ＳＷ５はそれぞれ信号の経
路切り替え機能部を示しており、これらＳＷ１〜ＳＷ５
で選択される経路のうち、Ａなる符号を付して示す経路
は「認識モード」時の経路であり、また、Ｂなる符号を
付して示す経路は「音韻学習モード」時の経路であり、
また、Ｃなる符号を付して示す経路は「単語学習モー
ド」時の経路であることを示している。このような経路
の切り替え制御は学習制御部１１によって制御される。SW1 to SW5 denote signal path switching function units, respectively.
Among the paths selected in the above, the path indicated by the symbol A is the path in the “recognition mode”, and the path indicated by the symbol B is the path in the “phonological learning mode”. ,
Also, the path indicated by the symbol C indicates a path in the “word learning mode”. Such path switching control is controlled by the learning control unit 11.

【００２９】上記音声入力・分折部１は音声データの
５．４ＫＨｚ以上の高周波成分を除去するローパスフィ
ルタ（ＬＰＦ）と、このＬＰＦを介して入力された音声
信号を標本化周波数12ＫＨｚ、量子化ビット数１６ｂｉ
ｔでディジタル信号に変換するＡ／Ｄ変換器と、このＡ
／Ｄ変換器により出力されたディジタル信号を２５６点
のＤＦＴ（離散的フーリエ変換）分折により、８msec毎
に１２８点の周波数スペクトルから周波数を１６個に分
割した１６チャンネルのフィルタバンク出力を求め、対
数化処理を行って１６次元の音声の特徴パラメータの時
系列データに変換する分折部とから構成されるものであ
り、マイクロフォンなどからの直接入力された音声信号
あるいは、学習用音声データ格納部１０に記憶された学
習用音声データを音声入力として受けて、これを音声の
特徴パラメータの時系列データに変換して出力するもの
である。The voice input / fraction unit 1 includes a low-pass filter (LPF) for removing high-frequency components of 5.4 KHz or more from voice data, and a voice signal input via the LPF at a sampling frequency of 12 KHz and quantization. Bit number 16bi
A / D converter for converting a digital signal at t
The digital signal output by the / D converter is divided into 256 points by DFT (Discrete Fourier Transform) to obtain a filter bank output of 16 channels obtained by dividing the frequency into 16 from the frequency spectrum of 128 points every 8 msec. A demodulation unit that performs logarithmic processing and converts it to time-series data of 16-dimensional speech feature parameters, and a speech signal directly input from a microphone or the like, or a learning speech data storage unit. The learning voice data stored in the memory 10 is received as a voice input, and is converted into time-series data of voice characteristic parameters and output.

【００３０】また、音韻特徴ベクトル抽出部２はこの音
声入力・分折部１からの出力である特徴パラメータの時
系列データを受けて、この特徴パラメータの時系列から
音韻辞書部５に格納された音韻辞書と照合するための音
韻特徴べクトルを順次抽出して音韻照合部３または音韻
特徴べクトル選択部１３に与えるものであり、音韻照合
部３は周知の照合法を使用してこの音韻特徴べクトルを
音韻辞書部５の音韻辞書と照合し、尤度時系列を出力す
るものである。この尤度時系列出力は「音韻学習モー
ド」時（Ｂ）には音韻照合判定部８に入力され、「認識
モード」時（Ａ）には単語照合部４に入力される。The phoneme feature vector extraction unit 2 receives the time-series data of the feature parameters output from the speech input / fractionation unit 1 and stores the time-series data of the feature parameters in the phoneme dictionary unit 5. The phoneme feature vector for matching with the phoneme dictionary is sequentially extracted and given to the phoneme matching unit 3 or the phoneme feature vector selection unit 13. The phoneme matching unit 3 uses a well-known matching method to obtain the phoneme feature. The vector is collated with the phoneme dictionary of the phoneme dictionary unit 5 to output a likelihood time series. The likelihood time-series output is input to the phoneme matching determination unit 8 in the “phoneme learning mode” (B), and is input to the word matching unit 4 in the “recognition mode” (A).

【００３１】音韻辞書部５は種々の音韻の情報を持たせ
た音韻辞書を記憶するものであり、音韻辞書更新部７に
より音韻辞書を更新することができるようになってい
る。The phoneme dictionary unit 5 stores a phoneme dictionary having various phoneme information, and the phoneme dictionary updating unit 7 can update the phoneme dictionary.

【００３２】音韻照合判定部８は音韻照合部３からの照
合結果と学習用音韻特徴ベクトル格納部９の学習用音韻
特徴ベクトルに付与された音韻名との比較を行ない、双
方が一致するなど所定の照合条件を満たすか否かの判定
結果を学習制御部１１に出力すると共に、所定の照合条
件を満たすか否かによる音韻辞書の更新をするか否かの
判定結果を音韻辞書更新部７に出力する機能を有するも
のである。The phoneme collation judging unit 8 compares the collation result from the phoneme collation unit 3 with the phoneme name assigned to the learning phoneme feature vector in the learning phoneme feature vector storage unit 9 and determines whether or not both match. Is output to the learning control unit 11, and the determination result as to whether or not to update the phonemic dictionary based on whether or not the predetermined matching condition is satisfied is sent to the phonemic dictionary updating unit 7. It has a function of outputting.

【００３３】音韻辞書更新部７は判定結果から音韻辞書
学習に用いる特徴ベクトルを学習用音韻特徴ベクトル格
納部９から選択して音韻辞書部５に与え、更新記憶する
機能を有する。The phoneme dictionary updating unit 7 has a function of selecting a feature vector used for phoneme dictionary learning from the judgment result from the phoneme feature vector storage unit 9 for learning, providing the selected feature vector to the phoneme dictionary unit 5, and updating and storing the same.

【００３４】単語照合部４は音韻照合部３からの音韻名
を含めた尤度時系列出力を受け、単語辞書部６の単語辞
書と照合して音韻区分情報（音韻の区切り位置）と音韻
名および音韻の尤度並びに単語としての尤度、総合点等
の情報を出力するものであり、照合結果判定出力部１２
は単語照合部４でのこのような照合結果を受けて、得点
の最良の語を認識結果として出力するものである。The word matching unit 4 receives the likelihood time-series output including the phoneme name from the phoneme matching unit 3, matches the likelihood with the word dictionary of the word dictionary unit 6, and obtains the phoneme classification information (phoneme segmentation position) and the phoneme name. And the likelihood of phonemes, likelihood as words, and information such as the total score.
Receives the result of the collation by the word collating unit 4 and outputs the word having the best score as a recognition result.

【００３５】また、学習制御部１１は学習開始の指示に
より、データの流れを制御しながら、「単語学習モー
ド」、「音韻学習モード」の処理のための制御や音韻学
習の終了判定、学習の終了判定などを行ない、学習の制
御を行なうものである。The learning control unit 11 controls the processing of the "word learning mode" and the "phonological learning mode", determines the end of the phonological learning, and controls the learning while controlling the data flow in response to the instruction to start learning. The end is determined, and learning control is performed.

【００３６】音韻特徴ベクトル選択部１３は単語学習モ
ード時において、単語照合部４より与えられる音韻区分
情報と、音韻名を用い、音韻特徴ベクトル抽出部２によ
り抽出された学習用音声データの音韻特徴べクトルを区
分して、この区分された音韻特徴べクトルを上記音韻名
を付与して学習用音韻特徴ベクトル格納部９に格納する
機能を有するものである。In the word learning mode, the phoneme feature vector selection unit 13 uses the phoneme classification information provided by the word matching unit 4 and the phoneme name, and uses the phoneme feature of the training speech data extracted by the phoneme feature vector extraction unit 2. It has a function of classifying the vector, storing the classified phoneme feature vector in the learning phoneme feature vector storage unit 9 with the phoneme name given thereto.

【００３７】学習用音声データ格納部１０は学習用音声
データを格納するもので、基準となる種々の語について
の音声や、その語に対する読み（音韻名の時系列情
報）、音韻区分情報など種々の付帯情報を含めて記憶し
てある。The learning voice data storage unit 10 stores learning voice data, and includes voices for various reference words, readings (time-series information of phoneme names) for the words, and phoneme classification information. Are stored together with the supplementary information.

【００３８】次にこのような構成の本装置の作用を図２
のフローチャートを参照して説明する。本システムにお
いては「認識モード」、「学習」の２つのモードがあ
り、「学習」は「単語学習モード」、「音韻学習モー
ド」の２種から構成されている。「認識モード」は入力
音声の認識処理を実施するモードで、Ｓ２２〜Ｓ２４の
ステップからなる処理であって、通常の実施モードであ
る。また、「学習」は音韻辞書の精度を高めるための学
習モードであり、「単語学習モード」と「音韻学習モー
ド」とのペアで実現される。Next, the operation of the present apparatus having such a configuration will be described with reference to FIG.
This will be described with reference to the flowchart of FIG. In this system, there are two modes, "recognition mode" and "learning", and "learning" is composed of two types of "word learning mode" and "phonological learning mode". The “recognition mode” is a mode for performing a process of recognizing an input voice, is a process including steps S22 to S24, and is a normal execution mode. “Learning” is a learning mode for improving the accuracy of the phonemic dictionary, and is realized by a pair of “word learning mode” and “phonological learning mode”.

【００３９】「単語学習モード」は現在の音韻辞書でど
の程度の単語認識が可能であるかを知るためのモードで
あって、Ｓ３１〜Ｓ３３のステップからなる処理であ
り、「音韻学習モード」は「単語学習モード」での単語
認識結果に基づいて音韻を学習し直すモードであって、
Ｓ３４〜Ｓ３８のステップからなる処理である。The "word learning mode" is a mode for knowing how much words can be recognized in the current phoneme dictionary, and is a process including steps S31 to S33. A mode in which phonemes are re-learned based on the word recognition results in the "word learning mode",
This is a process including steps S34 to S38.

【００４０】音声はまずマイクロフォン等の音声入力手
段等より音声入力・分折部１に音声信号のかたちで入力
される。音声入力・分折部１はこの入力信号から高周波
成分を除去し、ディジタルデータ化し、さらにこのディ
ジタルデータを用いて１６次元の音声の特徴パラメータ
の時系列を得る。The voice is first input to the voice input / separation unit 1 in the form of a voice signal from voice input means such as a microphone. The voice input / fraction unit 1 removes high-frequency components from the input signal, converts it into digital data, and obtains a time series of 16-dimensional voice feature parameters using the digital data.

【００４１】すなわち、音声入力・分折部１は音声デー
タの５．４ＫＨｚ以上の高周波成分を除去するローパス
フィルタ（ＬＰＦ）と、このＬＰＦを介して入力された
音声信号を標本化周波数12ＫＨｚ、量子化ビット数１６
ｂｉｔでディジタル信号に変換するＡ／Ｄ変換器と、２
５６点のＤＦＴ（離散的フーリエ変換）分折により、８
msec毎に１２８点の周波数スペクトルから周波数を１６
個に分割した１６チャンネルのフィルタバンク出力を求
め、対数化処理により１６次元の音声の特徴パラメータ
の時系列を得る分折部とから構成され、上述のような処
理がなされて、音声の特徴パラメータの時系列のかたち
で出力する。That is, the voice input / separation unit 1 includes a low-pass filter (LPF) for removing high-frequency components of 5.4 kHz or more from voice data, and a voice signal input via the LPF at a sampling frequency of 12 kHz and a quantum 16 bits
an A / D converter for converting a digital signal in bits,
By DFT (Discrete Fourier Transform) analysis of 56 points,
From the frequency spectrum of 128 points every msec, 16
And a dividing unit that obtains a filter bank output of 16 divided channels and obtains a time series of 16-dimensional voice feature parameters by logarithmic processing. Output in the form of a time series.

【００４２】このようにして音声入力・分折部１から出
力された音声の特徴パラメータの時系列は音韻特徴ベク
トル抽出部２に入力される（Ｓ２１）。The time series of the feature parameters of the speech output from the speech input / fractionation unit 1 in this manner is input to the phoneme feature vector extraction unit 2 (S21).

【００４３】一方、学習制御部１１は最初に３つのうち
のいずれのモードを使用するかの指示に従いモードを選
択してシステムの動作モードを設定する。この指示は例
えば、オペレータやシステムの利用者が図示しない設定
手段を操作するなどして設定することも出来るし、ま
た、認識結果の誤り率が所定値に達すると自動的に学習
モードに入るようにしたり、学習を行いつつ認識モード
を実行するようにしたりする等、適宜に選択できる。On the other hand, the learning control unit 11 first selects a mode according to an instruction as to which of the three modes to use, and sets the operation mode of the system. This instruction can be set, for example, by operating a setting means (not shown) by an operator or a user of the system, and automatically enters the learning mode when the error rate of the recognition result reaches a predetermined value. Or the recognition mode is executed while learning is performed.

【００４４】「認識モード」が指示されると学習制御部
１１はＳＷ１，ＳＷ３，ＳＷ４，ＳＷ５を符号“Ａ”を
付して示す経路で繋ぐように設定する。これにより、音
声入力・分析部１，音韻特徴ベクトル抽出部２，音韻照
合部３，単語照合部４，照合結果判定出力部１２の経路
が動作可能になる。When the "recognition mode" is instructed, the learning control unit 11 sets the switches SW1, SW3, SW4, and SW5 so as to be connected by the path indicated by the symbol "A". Thereby, the path of the speech input / analysis unit 1, phoneme feature vector extraction unit 2, phoneme matching unit 3, word matching unit 4, and matching result determination output unit 12 becomes operable.

【００４５】従って、マイクロフォンなどを介して外部
より入力される任意の音声を認識するモードである「認
識モード」では、上述のようにして音声入力・分折部１
から出力された音声の特徴パラメータの時系列を受け
て、音韻特徴ベクトル抽出部２はこの特徴パラメータの
時系列から音韻辞書部５に格納された音韻辞書と照合す
るための音韻特徴べクトルを順次抽出し、音韻照合部３
に送る。Therefore, in the "recognition mode", which is a mode for recognizing an arbitrary voice input from the outside via a microphone or the like, the voice input / separation unit 1 is used as described above.
The phoneme feature vector extraction unit 2 receives the time series of the feature parameters of the speech output from the phoneme and outputs the phoneme feature vectors for collation with the phoneme dictionary stored in the phoneme dictionary unit 5 from the time series of the feature parameters. Extract and phoneme matching unit 3
Send to

【００４６】音韻照合部３では順次入力される音韻特徴
べクトルそれぞれについて、音韻辞書部５に格納されて
いる音韻辞書と照合し、該当すると認識される音韻の尤
度時系列を出力する（Ｓ２２）。照合する方式は従来、
種々提案されている手法を適宜採用可能であり、例え
ば、マハラノビス距離、複合類似度法などにより求める
ことができる。The phoneme matching unit 3 compares each of the phoneme feature vectors sequentially inputted with the phoneme dictionary stored in the phoneme dictionary unit 5, and outputs a likelihood time series of phonemes recognized as corresponding (S22). ). Conventionally, the matching method is
Various proposed methods can be appropriately adopted, and for example, can be obtained by a Mahalanobis distance, a composite similarity method, or the like.

【００４７】音韻照合部３における照合で順次得られる
種々の区間分けによる認識音韻とその尤度、音韻の区分
情報等の情報は順次単語照合部４へ出力され、単語照合
部４ではこれを元に単語辞書を使用して単語照合を行な
うと共に、各単語毎にその尤度と総合点、音韻の尤度等
が算出される（Ｓ２３）。ここで云う尤度とは類似度、
距離、確率など、及びそれらを種々の方式で変換したも
のであり、照合の方式により決まるものである。Recognized phonemes by various sections obtained by the collation in the phoneme collation unit 3 and information such as their likelihood and phoneme division information are sequentially output to the word collation unit 4, and the word collation unit 4 uses the original information. In addition to performing word matching using a word dictionary, the likelihood, total score, likelihood of phonemes, and the like are calculated for each word (S23). The likelihood referred to here is similarity,
Distances, probabilities, and the like are converted by various methods, and are determined by a collation method.

【００４８】単語照合部４での単語照合の結果は照合結
果判定出力部１２で判定され、総合的に最良の値を示す
単語が認識結果として出力される（Ｓ２４）。The result of word collation by the word collating unit 4 is determined by the collation result determination output unit 12, and a word showing the best overall value is output as a recognition result (S24).

【００４９】ここで、音韻特徴ベクトルの構成は、固定
次元ベクトルである。例えば、音声の１６チャネルの特
徴パラメータの時系列を時間軸方向に連続に、例えば、
６フレーム使用した１６×６＝９６次元の時間周波数ベ
クトルを使用する。また、音韻辞書部５に格納する音韻
辞書は、例えば、複合類似度法による照合では辞書作成
用のそれぞれの音韻特徴ベクトルの相関行列を作成し、
その相関行列をＫＬ展開した固有ベクトルと固有値にて
構成する。Here, the configuration of the phoneme feature vector is a fixed-dimensional vector. For example, a time series of feature parameters of 16 channels of audio is continuously arranged in a time axis direction, for example,
A 16 × 6 = 96-dimensional time-frequency vector using six frames is used. For the phoneme dictionary stored in the phoneme dictionary unit 5, for example, in the matching by the composite similarity method, a correlation matrix of each phoneme feature vector for creating a dictionary is created.
The correlation matrix is composed of eigenvectors and eigenvalues obtained by KL expansion.

【００５０】また、単語辞書部６に格納される単語辞書
は音韻をノードとするグラフにより、認識の対象である
単語の音韻のつながりを記述した構成としてある。単語
辞書の例を図３に示す。図３は「あきた（秋田）」なる
固有名詞の単語の辞書の記述例であり、図において、円
で囲んだ“Ｉ”は有声で発音される“Ｉ”を示し、円で
囲んだ“＃Ｉ”は無声化した母音である“Ｉ”を示して
いる。The word dictionary stored in the word dictionary unit 6 has a configuration in which the connection of phonemes of words to be recognized is described by a graph having phonemes as nodes. FIG. 3 shows an example of the word dictionary. FIG. 3 is a description example of a dictionary of words of proper nouns “Akita (Akita)”. In the figure, “I” enclosed in a circle indicates “I” which is voiced, and “I” enclosed in a circle. “#I” indicates “I” which is a vowel that has been devoiced.

【００５１】すなわち、一例として示す認識単語である
“秋田（あきた）”は音韻／Ａ／，／Ｋ／，／Ｉ／，／
Ｔ／，／Ａ／の系列からなるが、通常の発音においては
／Ｋ／，／Ｉ／（つまり、“き”）における／Ｉ／は、
無声破裂音／Ｋ／，／Ｔ／の間にあるため、声帯が振動
しないまま発音を行なうことから、無声化した母音にな
る。That is, the recognition word “Akita” shown as an example is a phoneme / A /, / K /, / I /, /
T /, / A /, but in normal pronunciation / K /, / I / (that is, "I"), / I /
Since the voice is between the unvoiced plosives / K / and / T /, the vocal cords are sounded without being vibrated.

【００５２】この無声化した母音は音響的には高い周波
数成分が多く、周波数特徴は声帯が振動して（有声）発
声された［Ｉ］とは異なる。一方、丁寧に発声された場
合には、／Ｋ／，／Ｉ／の／Ｉ／は有声で発声される。This unvoiced vowel has many acoustically high frequency components, and its frequency characteristics are different from those of [I] produced by vibrating (voiced) vocal cords. On the other hand, when carefully uttered, / K /, / I / of / I / is vocalized.

【００５３】このような発声において、音響特徴が変化
する性質（音形規則）に基づき、認識対象の音韻系列を
グラフに表す。また、それぞれの音声セグメントについ
ての情報、例えば音声セグメントＮｎの最小の継続時間
（ｌNnのmin ）や、最大継続時間（ｌNnのmax ）におい
てのものも、単語辞書部６の単語辞書には格納されてお
り、単語照合部４において使用される。In such an utterance, a phonological sequence to be recognized is represented in a graph based on the characteristic (sound form rule) that the acoustic feature changes. Further, information on each voice segment, for example, information on the minimum duration (1Nn min) and the maximum duration (1Nn max) of the voice segment Nn are also stored in the word dictionary of the word dictionary unit 6. And is used in the word matching unit 4.

【００５４】次に図４と図５，図６に基づき、単語照合
部４の動作について説明する。単語照合部４においては
各認識対象単語Ｗk について、その音韻の系列の順に従
って、各音韻の尤度の和から当該認識対象単語Ｗk の尤
度の最大値ＬＦＷk を求める。Next, the operation of the word collating unit 4 will be described with reference to FIGS. The word matching unit 4 obtains the maximum likelihood value LFWk of each recognition target word Wk from the sum of the likelihoods of the respective phonemes according to the sequence of the phonemes for each recognition target word Wk.

【００５５】すなわち、単語Ｗk の照合を例にとると、
Ｗk なる単語の入力音声における音韻特徴ベクトルにつ
いて、それぞれ定められた音韻照合範囲内で範囲を変え
ながら該当すると思われる音韻を探し、音韻区分情報と
音韻名を出力し（音韻照合部）、この音韻区分情報と音
韻名を受けてその音韻の尤度や所定の算出式に基づく総
合点を求めてゆき、その中から最大尤度を示すものを該
当音韻として順次求めてゆくことにより、最大尤度を示
す音韻列の最大尤度ＬＦＷk と総合点の合計等を求め、
次にこの求めた最大尤度ＬＦＷk の音韻列に該当する単
語を照合結果として単語照合部４は出力する。ただし、
各音韻、例えば単語Ｗk のｎ番目（ｎは１，２，３，４
…）の音韻Ｎｎ（ステップＳ１）の継続時間の範囲ｌNn
のmin 〜ｌNnのmax は、予め各音韻について定められた
ものがあり（ステップＳ２）、通常の発声における音韻
Ｎｎの継続時間を表わす。That is, taking the collation of the word Wk as an example,
The phoneme feature vector in the input speech of the word Wk is searched for a phoneme that seems to be applicable while changing the range within the respective phoneme collation ranges, and phoneme classification information and a phoneme name are output (phoneme collation unit). Receiving the classification information and the phoneme name, the likelihood of the phoneme and an overall score based on a predetermined calculation formula are obtained, and the one showing the maximum likelihood is sequentially obtained as the corresponding phoneme from among them, thereby obtaining the maximum likelihood. The maximum likelihood LFWk of the phoneme sequence indicating
Next, the word matching unit 4 outputs a word corresponding to the phoneme string of the obtained maximum likelihood LFWk as a matching result. However,
Each phoneme, for example, the n-th word n (n is 1, 2, 3, 4
...) Phoneme Nn (step S1) duration lNn
Min to lNn have predetermined values for each phoneme (step S2), and represent the duration of the phoneme Nn in normal speech.

【００５６】このように、音韻の継続時間の制限（最小
限ｌNnのmin〜最大値ｌNnのmax ，ステップＳ３）を利
用しながら動的計画法を用いて音韻間の境界位置ｔおよ
びｔ′（ｔ′は現在の探索音韻区間の一つ前の音韻区間
の末尾位置、ｔは現在の探索音韻区間の末尾側境界位
置）を決定しつつ、単語Ｗk の始端より現在の探索音韻
位置までの合計の尤度ＬＦＷk を求め（ＬＦNn,t＝ＬＦ
Nn-1,t' ＋ＰＬＦNn(t,t')但し、ＬＦNn,tはＬＦＷk で
あり、音声始端からｔまでの合計の尤度、ＬＦNn-1,t'
は音声始端からｔ´までの累積尤度、ＰＬＦNn(t,t')は
現在の音韻区間内の対数尤度の和）、その中の最大値を
示す尤度ＬＦＷk と、その場合の各音韻間の境界位置ｔ
およびｔ′を求める（ステップＳ４）。As described above, the boundary positions t and t 'between the phonemes are determined by using the dynamic programming method while utilizing the restriction of the duration of the phonemes (min of min. LNn to max of max. LNn, step S3). t 'is the end position of the phoneme section immediately before the current search phoneme section, and t is the end position of the end of the current search phoneme section, while determining the sum from the beginning of the word Wk to the current search phoneme position. Is obtained (LFNn, t = LF)
Nn-1, t '+ PLFNn (t, t') where LFNn, t is LFWk, and the total likelihood from the voice start to t, LFNn-1, t '
Is the cumulative likelihood from the beginning of the voice to t ', PLNn (t, t') is the sum of the log likelihood in the current phoneme section), the likelihood LFWk indicating the maximum value among the likelihoods, and each phoneme in that case. Boundary position t between
And t 'are obtained (step S4).

【００５７】そしてその認識対象単語Ｗk の構成音韻に
ついて全て照合され、音声終端ＴEの音韻区間における
尤度ＬＦNn，ＴE を加えて音声始端から音声終端ＴE ま
での尤度の合計値を求め、これを単語Ｗk についての尤
度ＬＦＷk とする（ステップＳ６）。単語Ｗk について
の尤度ＬＦＷk が最も高いものを用い、その場合の音韻
列を単語辞書の各単語と照合し、単語Ｗk に対する単語
辞書の各単語の類似度を総合点のかたちで求める。これ
を照合結果判定出力部１２で類似度の最大のものを判別
して、その判別した単語を認識結果として出力する。Then, all the phonemes constituting the recognition target word Wk are collated, and the likelihood LFNn, TE in the phoneme section of the speech end TE is added to obtain the total value of the likelihood from the speech start end to the speech end TE. The likelihood LFWk for the word Wk is set (step S6). The one with the highest likelihood LFWk for the word Wk is used, the phoneme sequence in that case is collated with each word in the word dictionary, and the similarity of each word in the word dictionary to the word Wk is obtained in the form of a total score. The matching result determination output unit 12 determines the word having the highest similarity, and outputs the determined word as a recognition result.

【００５８】[0058]

【００５９】このフローチャートの処理に従うことによ
り、図４の例では図に示したように／Ａ／，／Ｉ／，／
Ｋ／，／Ｔ／の各音韻について、その尤度の和が最大値
をとることが認識されることになり、この例の場合、／
Ａ／，／Ｉ／，／Ｋ／，／Ｔ／，／Ａ／の順でそれぞれ
最大の尤度を示す音韻が出現したことを見出したことに
なり、さらに、その時の各音韻の境界位置、すなわち、
図４に示すような時間的情報Ｋi （ｉ＝１，２，３，
…）を得ることができるので、これを単語学習モードで
音韻区間情報として使用する。By following the processing of this flowchart, in the example of FIG. 4, / A /, / I /, /
For each phoneme of K /, / T /, it is recognized that the sum of the likelihoods takes the maximum value. In this example, /
A /, / I /, / K /, / T /, and / A / were found to have the highest likelihood phonemes, and the boundary positions of the respective phonemes at that time, That is,
Temporal information Ki (i = 1, 2, 3, 3) as shown in FIG.
..) Can be obtained, and this is used as phoneme section information in the word learning mode.

【００６０】学習では図２に示すように、まず「単語学
習モード」を実行し、次に「音韻学習モード」を実行す
る。In the learning, as shown in FIG. 2, a "word learning mode" is first executed, and then a "phonological learning mode" is executed.

【００６１】「単語学習モード」では、ＳＷ１〜ＳＷ５
において“Ｃ”の経路を辿って処理が流れるように制御
される。従って、これにより、学習用音声データ格納部
１０，音声入力・分析部１，音韻特徴ベクトル抽出部
２，音韻照合部３，単語照合部４，音韻特徴ベクトル選
択部１３，学習用音韻特徴ベクトル格納部９の経路が動
作可能になる。In the "word learning mode", SW1 to SW5
Is controlled so that the process flows following the path “C”. Therefore, by this, the learning speech data storage unit 10, speech input / analysis unit 1, phoneme feature vector extraction unit 2, phoneme matching unit 3, word matching unit 4, phoneme feature vector selection unit 13, learning phoneme feature vector storage The path of the unit 9 becomes operable.

【００６２】また、「単語学習モード」では、予め学習
用音声データのデータベースを準備する。学習用音声デ
ータベースは学習用音声データとその発声内容情報から
なるもので、学習用音声データ格納部１０に記憶されて
いる。In the "word learning mode", a database of learning voice data is prepared in advance. The learning voice database includes learning voice data and utterance content information, and is stored in the learning voice data storage unit 10.

【００６３】「単語学習モード」の処理に入ると、学習
用音声データ格納部１０に記憶されている学習用音声デ
ータは読み出されて音声入力・分折部１に入力される。
そして、学習用音声データは「認識モード」と同様にこ
の音声入力・分折部１において分折され、音声の特徴パ
ラメータの時系列データに変換される（Ｓ２１）。When the process enters the “word learning mode”, the learning voice data stored in the learning voice data storage unit 10 is read out and input to the voice input / fraction unit 1.
The speech data for learning is divided by the speech inputting / dividing unit 1 in the same manner as in the "recognition mode", and is converted into time-series data of characteristic parameters of the speech (S21).

【００６４】そして、この変換された音声の特徴パラメ
ータの時系列データから音韻特徴ベクトル抽出部２は特
徴ベクトルの抽出を行ない、音韻照合部３はこの抽出さ
れた特徴ベクトルから該当する音韻の尤度の時系列を求
める（Ｓ３１）。次に、これをもとに単語照合部４は単
語辞書から該当の単語を得る。Then, the phoneme feature vector extraction unit 2 extracts feature vectors from the time-series data of the feature parameters of the converted speech, and the phoneme matching unit 3 uses the extracted feature vectors to determine the likelihood of the corresponding phoneme. Is obtained (S31). Next, based on this, the word matching unit 4 obtains the corresponding word from the word dictionary.

【００６５】このようにして、学習用音声データの発声
内容から単語辞書を選択し、「認識モード」で述べた方
式と同様の方式により単語照合を行ない、該当する単語
を見付ける（Ｓ３２）。As described above, a word dictionary is selected from the utterance contents of the learning speech data, and word matching is performed by the same method as described in the "recognition mode" to find a corresponding word (S32).

【００６６】但し、「単語学習モード」においては、学
習用の音声データを用いるものであり、この音声データ
には音韻区間や音韻名、読み等のすべての情報が予め付
随情報として用意してあるので、単語辞書部６に記憶さ
れている単語辞書のすべての単語との照合を行なう必要
はなく、学習で選択した音声の単語との照合を行なうだ
けで良い。However, in the "word learning mode", voice data for learning is used, and in the voice data, all information such as phoneme sections, phoneme names, and readings are prepared in advance as accompanying information. Therefore, it is not necessary to perform matching with all the words in the word dictionary stored in the word dictionary unit 6, and it is sufficient to perform matching with the words of the speech selected by learning.

【００６７】次に単語照合部４は単語照合によって音韻
区分情報と音韻名を出力するので、音韻特徴ベクトル選
択部１３はこの照合によって得られる音韻区分情報を用
いて音韻特徴ベクトル抽出部２により抽出された学習用
音声データの音韻特徴べクトルを該当区間について選択
し、上記の音韻名を付与して学習用音韻特徴ベクトル格
納部９に学習用音韻特徴ベクトルとして格納する（Ｓ３
３）。Next, the word matching section 4 outputs the phoneme segment information and the phoneme name by word matching, and the phoneme feature vector selecting section 13 extracts the phoneme feature vector extracting section 2 using the phoneme segment information obtained by the matching. The phoneme feature vector of the obtained learning speech data is selected for the corresponding section, the above-mentioned phoneme name is assigned, and the phoneme feature vector storage unit 9 stores the phoneme feature vector as a training phoneme feature vector (S3).
3).

【００６８】単語照合部４での単語照合によって得られ
る音韻区分情報を含む各種情報は学習制御部１１にも出
力され、学習の終了判定に用いられる。Various types of information including phoneme classification information obtained by word matching in the word matching unit 4 are also output to the learning control unit 11 and used to determine the end of learning.

【００６９】すなわち、学習制御部１１に学習用音声デ
ータ格納部１０からの付随情報が入力されているので、
単語照合部４は単語照合によって得られた音韻区分情報
やその他の情報から類似度がどの程度かを知ることがで
き、認識が正しく行われたか否かを判定することがで
き、学習を終了すべきか否かを決定することができる。
「単語学習モード」での処理が済むと次は「音韻学習モ
ード」に移る。「音韻学習モード」では、学習用音韻特
徴ベクトル格納部９から学習用音韻特徴ベクトルを順次
読み出し、音韻照合部３で音韻辞書との照合を行なう
（Ｓ３４）。That is, since the accompanying information from the learning voice data storage unit 10 is input to the learning control unit 11,
The word matching unit 4 can know the degree of similarity from the phoneme classification information and other information obtained by the word matching, determine whether or not the recognition has been performed correctly, and complete the learning. Can be determined.
After the processing in the “word learning mode” is completed, the process proceeds to the “phonological learning mode”. In the “phonological learning mode”, learning phonetic feature vectors are sequentially read from the learning phonemic feature vector storage unit 9, and the phoneme matching unit 3 performs matching with the phonemic dictionary (S 34).

【００７０】音韻照合判定部８では音韻照合部３での照
合結果と学習用音韻特徴ベクトル格納部９の学習用音韻
特徴ベクトルに付与された音韻名との比較を行なって一
致するか否かの判定を行ない、また、必要ならば音韻毎
の尤度を求め、これと判定結果を学習制御部１１に出力
する。学習制御部１１では学習用音声データ格納部１０
からの付随情報を元に音韻照合判定部８からの判定結果
等の情報を勘案し、音韻学習を行なうか否かの判定を行
なう（Ｓ３５）。The phoneme collation judging unit 8 compares the collation result of the phoneme collation unit 3 with the phoneme name assigned to the learning phoneme feature vector in the learning phoneme feature vector storage unit 9 to determine whether they match. The determination is performed, and if necessary, the likelihood for each phoneme is obtained, and this and the determination result are output to the learning control unit 11. The learning control unit 11 includes a learning voice data storage unit 10.
Then, it is determined whether or not phonological learning is to be performed, taking into account information such as the determination result from the phonological collation determining unit 8 based on the accompanying information from the device (S35).

【００７１】たとえば、判定条件として「すべての音韻
が正しく認識された」と云う条件や、「すべて正しく認
識され、しかも、２位との尤度差がＴＨ１（閾値
“１”）以上ある」と云う条件などが考えられる。この
条件に合った場合には更新を行なわずに次の学習終了判
定を行なう。For example, a condition such as "all phonemes were correctly recognized" or a condition "all are correctly recognized and the likelihood difference from the second place is TH1 (threshold" 1 ") or more" is determined. Such conditions can be considered. If this condition is met, the next learning end determination is made without updating.

【００７２】音韻照合判定部８においての判定の結果、
音韻学習を行なうこととなった場合には、音韻辞書更新
部７は学習制御部１１からの制御により、音韻辞書学習
に用いる特徴ベクトルを学習用音韻特徴ベクトル格納部
９から選択する（Ｓ３６）。この選択の条件として、例
えば「認識を誤った」と云う条件や「認識を誤ったか、
または正しく認識されたが２位との尤度差がＴＨ２以下
である」などの条件などが考えられる。さらに、「尤度
がＴＨ３（閾値“３”）以上のデータ」と云う条件を加
えて、信頼性の高いデータに限定することも考えられ
る。As a result of the judgment in the phoneme collation judging section 8,
If the phoneme learning is to be performed, the phoneme dictionary updating unit 7 selects a feature vector used for phoneme dictionary learning from the phoneme feature vector storage unit 9 for learning under the control of the learning control unit 11 (S36). As conditions for this selection, for example, a condition such as “misrecognition” or “misrecognition,
Or the likelihood difference from the second place is TH2 or less ". Furthermore, it is conceivable to limit the data to data with high reliability by adding a condition of “data having a likelihood of TH3 (threshold“ 3 ”) or more”.

【００７３】選択された学習用音韻特徴ベクトルは、音
韻辞書更新部７で辞書更新に用いる（Ｓ３７）。たとえ
ば、複合類似度法による照合の場合には、予め用意され
ている相関行列と学習用音韻特徴ベクトルから新たな相
関行列を作成し、認識モード項での説明で述べたよう
に、ＫＬ展開を行なって更新した音韻辞書を得、さらに
続けて学習用音韻特徴ベクトルと音韻辞書の照合を行な
い、単語照合を行って「音韻学習モード」の実行を繰り
返す。The selected phoneme feature vector for learning is used by the phoneme dictionary update unit 7 for updating the dictionary (S37). For example, in the case of matching by the composite similarity method, a new correlation matrix is created from a correlation matrix prepared in advance and a phoneme feature vector for learning, and as described in the recognition mode section, the KL expansion is performed. The updated phoneme dictionary is obtained, and the phoneme dictionary for the learning phoneme is collated with the phoneme dictionary. Then, the word is collated and the execution of the “phoneme learning mode” is repeated.

【００７４】以上、述べた「音韻学習」を含む「学習」
の処理の終了判定は、学習制御部１１において行なう。
この終了判定制御は、「単語学習モード」で得られた音
韻区分情報を用いて行ない、例えば現在の学習がＩ回目
であるとすると、「Ｉ−１回目での音韻区分情報とＩ回
目の音韻区分情報の差がすべてスレシホールドレベルＴ
Ｈ5 秒以下である」と云う条件が考えられる。また、
「学習回数ｉが５となったならば終了する」と云う条件
も考えられる。The "learning" including the "phonological learning" described above
Is determined by the learning control unit 11.
This end determination control is performed using the phoneme division information obtained in the “word learning mode”. For example, if the current learning is the first time, the “phonetic division information at the (I−1) -th time and the phoneme division at the I-th time” All the differences in the classification information are the threshold level T
H5 seconds or less. " Also,
A condition that “the process ends when the number of times of learning i becomes 5” is also conceivable.

【００７５】このような終了判定条件が成立した場合に
は学習を終了する（Ｓ３８）。When such an end determination condition is satisfied, the learning is ended (S38).

【００７６】以上のように、本システムでは学習制御部
１１は学習開始の指示により、図１に示すようにデータ
の流れを制御しながら、「単語学習モード」、「音韻学
習モード」の処理および音韻学習の終了判定、学習の終
了判定を行ない、学習の制御を行なう。As described above, in the present system, the learning control section 11 controls the flow of data as shown in FIG. The phonological learning end determination and the learning end determination are performed to control the learning.

【００７７】本実施例においては単語辞書を用いた単語
認識を行なったが、単語情報や文法情報などからなる文
認識辞書を用いた場合には文認識も可能である。さら
に、本実施例における尤度とは距離数似度，確率やそれ
らを変換したものであり、照合方式により決まるもので
ある。ただし、尤度についての説明で最大，最少と述べ
たものは距離の場合には、最少，最大と読み替えるもの
とする。In this embodiment, word recognition is performed using a word dictionary. However, when a sentence recognition dictionary including word information and grammatical information is used, sentence recognition is also possible. Further, the likelihood in the present embodiment is the similarity of the distance, the probability, or a value obtained by converting them, and is determined by the matching method. However, what is described as maximum and minimum in the description of the likelihood should be read as minimum and maximum in the case of distance.

【００７８】さらに、単語音声の認識では、単語照合に
先立って、音声の終了端検出を行うのも計算量削減とい
う点から効果的である。Further, in the recognition of the word voice, it is effective to detect the end of the voice prior to the word matching from the viewpoint of reducing the calculation amount.

【００７９】以上本発明システムにおける動作説明は単
語認識を例としたが、本発明は単語認識に限らず、連続
単語や文節、文の認識にも適用可能である。また、音声
発声の言語的最小単位としての音韻を用いて説明を行な
ったが、音節、母音・子音・母音、子音・母音・子音や
半音節など単位としての意味を逸脱しない範囲で適用可
能である。その他、本発明はその要旨を変更しない範囲
内で適宜変形して実施し得るものである。Although the description of the operation of the system of the present invention has been made with reference to word recognition as an example, the present invention is not limited to word recognition but can be applied to recognition of continuous words, phrases and sentences. Also, the explanation was given using the phoneme as the linguistic minimum unit of speech utterance. is there. In addition, the present invention can be appropriately modified and implemented without departing from the scope of the invention.

【００８０】[0080]

【００８１】[0081]

【００８２】[0082]

【００８３】要するに本システムは、入力音声データを
分折して求められる音声の特徴パラメータの時系列から
音韻特徴ベクトルを連続的に抽出し、この音韻特徴ベク
トルと音韻辞書とを照合し、音韻の尤度時系列を得て、
この得た音韻尤度時系列と単語辞書を照合し、認識結果
を出力すると共に、さらに、音韻尤度時系列と単語辞書
との照合結果から入力の音韻区分情報を得て、音韻区分
情報と入力の特徴パラメータの時系列から学習用音韻特
徴ベクトルを選択し、学習用音韻特徴ベクトルと音韻辞
書を照合し、音韻照合の判定結果と学習用音韻特徴ベク
トルから音韻辞書を学習し、新しい音韻辞書を得ると云
うものである。In short, the present system continuously extracts a phoneme feature vector from a time series of speech feature parameters obtained by dividing input speech data, compares the phoneme feature vector with the phoneme dictionary, and compares the phoneme feature vector with the phoneme dictionary. Get the likelihood time series,
The obtained phoneme likelihood time series is collated with the word dictionary, and a recognition result is output. Further, the phoneme likelihood time series is obtained from the collation result between the word dictionary and the input phoneme division information, and the phoneme division information is obtained. A learning phoneme feature vector is selected from the time series of the input feature parameters, the learning phoneme feature vector is collated with the phoneme dictionary, a phoneme dictionary is learned from the determination result of phoneme matching and the phoneme dictionary for learning, and a new phoneme dictionary is obtained. Is obtained.

【００８４】学習モードは「音韻学習モード」と「単語
学習モード」とにより構成され、「単語学習モード」は
単語照合した結果から学習用音韻特徴ベクトルを選択す
るモードであり、また、「音韻学習モード」は記憶され
た学習用音韻特徴ベクトルを使用し音韻辞書を学習する
モードである。「単語学習モード」では、入力音声の内
容を表現する単語辞書と照合した結果から入力の特徴パ
ラメータの時系列の時間的区分を得て、入力単語辞書中
のそれぞれの音韻に対応する特徴ベクトルを選択し記憶
し、また、「音韻学習モード」では、学習用音韻特徴ベ
クトルを音韻辞書と照合し、学習に必要なデータである
か否かの判定を行ない、学習に必要なデータである場合
には音韻辞書の学習に使用し、新たな音韻辞書を得て、
学習の制御は音韻学習モードの学習を繰り返して行な
い、学習の結果から必要ならば「単語学習モード」で学
習用音韻特徴ベクトルを選択して、さらに「音韻学習モ
ード」で学習を行なうと云った処理を繰り返し、学習が
充分に行なわれたならば学習を停止させる。The learning mode is composed of a “phonological learning mode” and a “word learning mode”. The “word learning mode” is a mode for selecting a learning phonological feature vector from the result of word matching. The "mode" is a mode for learning a phonemic dictionary using the stored phonemic feature vectors for learning. In the "word learning mode", a time series of a time series of input feature parameters is obtained from a result of collation with a word dictionary expressing the content of input speech, and a feature vector corresponding to each phoneme in the input word dictionary is obtained. In the "phonological learning mode", the phonological learning mode is compared with the phonological feature vector for learning, and it is determined whether or not the data is necessary for learning. Used to learn the phonetic dictionary, got a new phonemic dictionary,
Learning control is performed by repeating learning in the phonological learning mode, and if necessary, selecting a learning phonological feature vector in the "word learning mode" from the learning result, and further performing learning in the "phonological learning mode". The process is repeated, and if the learning is sufficiently performed, the learning is stopped.

【００８５】なお、ここで云う単語辞書とは、単語認識
を行なうためのものであり、例えば文認識を行なう場合
には語彙や構文・意味情報などから生成される文認識辞
書を用いる。また、同様に「単語学習モード」とは単語
認識を行なう場合のモードものであり、例えば文認識を
行なう場合には「文学習モード」を表わす。The word dictionary referred to here is for performing word recognition. For example, when performing sentence recognition, a sentence recognition dictionary generated from vocabulary, syntax, semantic information, and the like is used. Similarly, the "word learning mode" is a mode for performing word recognition, and for example, represents a "sentence learning mode" for performing sentence recognition.

【００８６】このように、本発明は入力音声の内容を表
現する単語辞書と照合した結果から入力の特徴パラメー
タの時系列の時間的区分を得て、入力単語辞書中のそれ
ぞれの音韻に対応する特徴ベクトルを選択し記憶し、さ
らにこの学習用音韻特徴ベクトルを音韻辞書と照合し、
学習に必要なデータであるか否かの判定を行ない、学習
に必要なデータである場合には音韻辞書の学習に使用
し、新たな音韻辞書を得るようにしたものであるから、
音韻を単位とした音声認識において高性能な音韻照合が
可能となる。As described above, according to the present invention, the time series of the time series of the characteristic parameters of the input is obtained from the result of collation with the word dictionary expressing the contents of the input speech, and the time series corresponding to each phoneme in the input word dictionary is obtained. Select and memorize the feature vector, and further compare the learning phoneme feature vector with the phoneme dictionary,
It is determined whether or not the data is necessary for learning, and if the data is necessary for learning, the data is used for learning a phonological dictionary, so that a new phonological dictionary is obtained.
High-performance phonemic collation can be performed in speech recognition in units of phonemes.

【００８７】特に「単語学習モード」では単語辞書と照
合した結果から入力の特徴パラメータの時系列の音韻辞
書に適した時間的区分を得て、それぞれの音韻に対応し
た特徴ベクトルを高い精度で抽出でき、「音韻学習モー
ド」では学習用音韻特徴ベクトルと音韻辞書を用いて音
韻辞書学習を行ない、高い性能を持つ音韻辞書を作成す
ることができ、学習が「音韻学習モード」の学習の繰り
返しと「単語学習モード」の学習の組み合せを順次繰り
返すことで実現する構成としていることから、音韻の特
徴ベクトル抽出と音韻辞書の作成が同時に高精度に行な
うことができる。In particular, in the "word learning mode", a temporal section suitable for a time-series phonological dictionary of input characteristic parameters is obtained from the result of collation with the word dictionary, and a characteristic vector corresponding to each phonological element is extracted with high accuracy. In the phonological learning mode, phonological dictionary learning can be performed using the phonological feature vector for learning and the phonological dictionary, and a phonological dictionary with high performance can be created. Since the configuration is realized by sequentially repeating the learning combination of the “word learning mode”, the extraction of the phoneme feature vector and the creation of the phoneme dictionary can be performed simultaneously with high accuracy.

【００８８】また、話者適応型の認識装置において、音
韻学習と単語学習を繰り返し行なうことにより、話者の
発音法の特徴を考慮しながら、話者に適応した音韻辞書
の作成が可能となり、高い認識性能を得ることができ
る。Further, in the speaker-adaptive recognition device, by repeatedly performing phoneme learning and word learning, it is possible to create a phoneme dictionary adapted to a speaker while taking into account the characteristics of the pronunciation method of the speaker. High recognition performance can be obtained.

【００８９】さらに、雑音中の音声の認識装置におい
て、雑音による音韻の特徴ベクトルの変化と音韻特徴の
時間的区分の変化を考慮した音韻辞書の作成が可能とな
り、高い認識性能を得ることができる。Furthermore, in the apparatus for recognizing speech in noise, it is possible to create a phonemic dictionary in consideration of changes in the phonetic feature vector and changes in the temporal division of phonemic features due to noise, and high recognition performance can be obtained. .

【００９０】従って、本発明によれば、音韻辞書学習と
トップダウン的認識における入力音声の音韻への時間的
区分化の双方を考慮した高性能な音韻辞書作成が可能
で、入力音声の高い認識率を確保できるようにした音声
認識装置を提供することができる。Therefore, according to the present invention, it is possible to create a high-performance phonological dictionary taking into account both the phonological dictionary learning and the temporal segmentation of the input voice into phonology in top-down recognition, and high recognition of the input voice. It is possible to provide a speech recognition device capable of securing a rate.

【００９１】[0091]

【発明の効果】以上説明したように本発明によれば、音
韻を単位とした音声認識において、音韻辞書学習とトッ
プダウン的認識による入力の音韻への時間的区分化の両
方を考慮した学習機能により高性能な音韻照合が可能と
なり、単語辞書を用いた入力音声の認識において高い認
識率を得ることが可能となる。As described above, according to the present invention, in speech recognition in units of phonemes, a learning function that takes into account both phoneme dictionary learning and temporal segmentation of input into phonemes by top-down recognition. Thus, high-performance phonemic collation can be performed, and a high recognition rate can be obtained in recognition of input speech using a word dictionary.

[Brief description of the drawings]

【図１】本発明の一実施例を示すブロック図。FIG. 1 is a block diagram showing one embodiment of the present invention.

【図２】本発明システムの動作を説明するためのフロー
チャート。FIG. 2 is a flowchart for explaining the operation of the system of the present invention.

【図３】本発明システムに用いる単語辞書の構成例を説
明するための図。FIG. 3 is a diagram for explaining a configuration example of a word dictionary used in the system of the present invention.

【図４】本発明システムにおける音声特徴パラメータと
音韻尤度との関係を説明するための図。FIG. 4 is a diagram for explaining the relationship between speech feature parameters and phoneme likelihood in the system of the present invention.

【図５】本発明システムに使用する単語認識の手法を説
明するフローチャート。FIG. 5 is a flowchart illustrating a word recognition technique used in the system of the present invention.

【図６】本発明システムに使用する単語認識の手法を説
明するフローチャート。FIG. 6 is a flowchart illustrating a word recognition method used in the system of the present invention.

[Explanation of symbols]

１…音声入力・分折部、２…音韻特徴ベクトル抽出部、
３…音韻照合部、４…単語照合部、５…音韻辞書部、６
…単語辞書部、７…音韻辞書更新部、８…音韻照合判定
部、９…学習用音韻特徴ベクトル格納部、１０…学習用
音声データ格納部、１１…学習制御部、１２…照合結果
判定出力部、１３…音韻特徴ベクトル選択部、ＳＷ１〜
ＳＷ５…信号の経路切り替え機能部。1 ... Speech input / fractionation unit 2 ... Phonological feature vector extraction unit
3: Phoneme matching unit, 4: Word matching unit, 5: Phoneme dictionary unit, 6
... Word dictionary unit, 7 phoneme dictionary update unit, 8 phoneme collation judgment unit, 9 phoneme feature vector storage unit for learning, 10 speech data storage unit for learning, 11 learning control unit, 12 collation result judgment output Part 13, phoneme feature vector selection part, SW1
SW5: Signal path switching function unit.

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁷，ＤＢ名) G10L 15/00 - 15/28 ──────────────────────────────────────────────────続き Continued on front page (58) Field surveyed (Int.Cl. ⁷ , DB name) G10L 15/00-15/28

Claims

(57) [Claims]

A phoneme feature vector is continuously extracted from a time series of feature parameters of speech obtained by dividing input speech data by phoneme feature vector acquisition means, and a phoneme feature vector obtained thereby and a phoneme dictionary are stored. A phoneme dictionary composed of phoneme-specific phoneme feature vectors held by the means is compared with a phoneme matching means by a phoneme matching means to obtain a likelihood time series of a matching phoneme,
The likelihood time series of the obtained matching phoneme is compared with the word dictionary held by the word dictionary storage means by the word matching means,
What is claimed is: 1. A speech recognition apparatus comprising: determining a likelihood of a word; and obtaining a word having the highest likelihood as a recognition word, wherein said phoneme dictionary storage means is capable of updating said phoneme dictionary and is used for learning in a word learning mode. Retain audio data
Learning voice data holding means, and the learning as the input voice data in a word learning mode.
Input data obtained from the word matching means using
Phonological features of the speech data for learning from phonological classification information of power
Select a vector for the section,
Phoneme feature vector selection unit for obtaining a vector, and learning for storing the selected phoneme feature vector for learning.
Phoneme feature vector storage unit for learning, and the phoneme feature vector storage unit for learning in phoneme learning mode.
Of the learning phoneme feature vector and the learning phoneme feature vector
Collation in which the vocals match the phonological dictionary in the phonological matching unit
A phoneme collation judging unit for comparing the result with the phoneme dictionary,
A phoneme dictionary update unit for updating a phoneme dictionary in a storage unit, and switching between the word learning mode and the phoneme learning mode.
And a learning control unit for performing control.
Voice recognition device.

2. The learning control section according to claim 1, wherein
The word learning mode and the phonological learning mode using a calligraphy
And a control function for updating the phonological dictionary.
The speech recognition device according to claim 1, further comprising:

3. The learning control section according to claim 1, wherein
And the number of repetitions of the phoneme learning mode
3. A control function according to claim 2, wherein the control function is provided. Voice of
Recognition device.