JPH07104779A

JPH07104779A - Voice recognizing method

Info

Publication number: JPH07104779A
Application number: JP5247186A
Authority: JP
Inventors: Yoshiaki Noda; 喜昭野田; Akihiro Imamura; 明弘今村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 1993-10-01
Filing date: 1993-10-01
Publication date: 1995-04-21

Abstract

PURPOSE:To prevent a deficiency in learning and to reduce the retrieval processing volume at the time of voice recognition by structuring a subword model which makes a vague acoustic phenomenon whose KANA(Japanese syllabary) notation can not be made to correspond to a subword nonambiguously correspond nonambiguously. CONSTITUTION:In a voice recognition device performs a voice recognizing process operates on the basis of voice recognition environment definitions 1 to perform voice recognition. In the voice recognition environment definitions 1, subword system definitions 12 have respective subwords and information for making subword labels and acoustic phenomena correspond to each other, and in the subword system definitions, subword labels which make acoustic phenomena whose KANA notations can not be made to corresponded to the subwords nonambiguously like 'souchi', 'sohchi'. etc., correspond nonambiguously are defined. Then hidden Markov models corresponding to the subword labels are made to learn and those subword labels are used to perform voice recognition wherein the vagueness of vocalization is absorbed.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、隠れマルコフモデル
（以後、ＨＭＭと称す）を用いた音声認識方法に関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech recognition method using a hidden Markov model (hereinafter referred to as HMM).

【０００２】[0002]

【従来の技術】従来より、ＨＭＭ（Hidden Markov Mode
l ）を用いた音声認識方法において、音声による音響現
象を表現するＨＭＭの単位として、例えば、音韻のよう
に、単語よりも小さな単位（以後、サブワードと称す）
を用いることにより、任意の語彙の音声認識を行うこと
ができる。2. Description of the Related Art Conventionally, HMM (Hidden Markov Mode)
In the speech recognition method using l), a unit smaller than a word (hereinafter referred to as a subword), such as a phoneme, is used as a unit of an HMM that represents an acoustic phenomenon caused by a voice.
By using, it is possible to perform voice recognition of any vocabulary.

【０００３】しかしながら、サブワードに対応する音声
データは、そのサブワードを含む単語や文を発声して得
られる音声から切り出す（以後、この作業をラベリング
と称す）ことによってしか得ることができず、多くの音
声データを用いて統計的な学習を必要とするＨＭＭに基
づいた音声認識では、上記ラベリング作業に多大な時間
および労力を要するという問題があった。However, the voice data corresponding to a subword can be obtained only by cutting out a voice obtained by uttering a word or a sentence containing the subword (hereinafter, this work is referred to as labeling), and many voice data are obtained. Speech recognition based on HMM, which requires statistical learning using speech data, has a problem that the labeling work requires a lot of time and labor.

【０００４】この問題を軽減する手法として連結学習が
ある。連結学習は、文や単語に対応するＨＭＭが、学習
用音声データに対応するサブワードのＨＭＭ（以後、サ
ブワードＨＭＭと称す）を連結してなることを利用し、
サブワードの名称であるサブワードラベルの正確な存在
区間に関する情報を与えなくても、複数のサブワードを
含む文や単語に対応するＨＭＭを一括して学習すること
により、各サブワードＨＭＭに関する学習を行うことが
できるというものである。As a method for reducing this problem, there is connection learning. The connection learning uses that the HMM corresponding to the sentence or word is formed by connecting the HMM of the subword corresponding to the learning voice data (hereinafter referred to as the subword HMM),
It is possible to perform learning on each subword HMM by collectively learning HMMs corresponding to a sentence or a word including a plurality of subwords without giving information about an accurate existence section of a subword label that is a name of a subword. It can be done.

【０００５】このため、上記連結学習を用いた音声認識
方法では、サブワードラベルの存在区間を調べるための
ラベリング作業を省略でき、大量の音声データを用いる
学習を比較的容易に行うことができる。上記音声認識方
法では、学習される音声データは、所定のサブワード系
列を経て、当該データに対応するサブワードラベル列に
変換され、連結学習に用いられる。Therefore, in the speech recognition method using the above-mentioned connection learning, the labeling work for checking the existence section of the subword label can be omitted, and the learning using a large amount of speech data can be performed relatively easily. In the above speech recognition method, the learned speech data is converted into a subword label string corresponding to the data through a predetermined subword series and used for connection learning.

【０００６】[0006]

【発明が解決しようとする課題】ところで、上述した従
来の音声認識方法においては、連結学習に与えるサブワ
ードラベル列をかな表記から生成する場合、かな表記と
実際の発声音である音響現象とが錯綜し、学習不足とな
る場合がある。例えば、図１に示すように、かな表記で
「そうち」と表されている部分は、”ｓ，ｏ，ｕ，ｃ
ｈ，ｉ”というサブワードラベル列に変換されるが、話
者の癖などにより、「そーち」あるいは「そおち」と発
声される可能性もある。By the way, in the above-mentioned conventional speech recognition method, when the subword label string given to the connection learning is generated from the kana notation, the kana notation and the acoustic phenomenon which is the actual uttered sound are complicated. However, learning may be insufficient. For example, as shown in FIG. 1, the part represented by “sachi” in kana notation is “s, o, u, c
It is converted into a subword label string "h, i", but it may be uttered as "so chi" or "so chi" depending on the habit of the speaker.

【０００７】また、氷（こおり）は”ｋ，ｏ，ｏ，ｒ，
ｉ”と変換されるが、実際には、「こうり」，「こー
り」と発声される可能性もある。このように、かな表記
からサブワードラベル列への変換を行う際に、かな表記
だけではサブワードを一義的に決定できないという曖昧
性が存在する。従来の音声認識方法では、図２（ａ）に
示すように、音響現象とサブワードラベルとを１対１に
対応付けるため、かな表記と音響現象との間に錯綜が生
じると、似かよった音響現象が、異なったサブワードＨ
ＭＭに割り当てられる。すなわち、発音の曖昧性によ
り、同一の音声データに対する学習が複数のＨＭＭに分
散してしまい、学習不足の原因となることがあるという
問題がある。[0007] In addition, ice (koori) is "k, o, o, r,
Although it is converted to "i", there is a possibility that "kouri" or "kori" is actually uttered. As described above, there is an ambiguity that the subword cannot be uniquely determined only by the kana notation when converting the kana notation into the subword label string. In the conventional speech recognition method, as shown in FIG. 2A, the acoustic phenomenon and the subword label are associated with each other in a one-to-one manner. Therefore, if the kana notation and the acoustic phenomenon are complicated, a similar acoustic phenomenon may occur. , Different subword H
Assigned to MM. That is, there is a problem that learning for the same voice data may be distributed to a plurality of HMMs due to ambiguity in pronunciation, which may cause insufficient learning.

【０００８】また、上述したように、同一音声データに
対応するサブワードラベルが複数存在するという曖昧性
があると、この曖昧性を考慮した音声認識を行う場合に
は、音声認識時に、複数のサブワードラベル全てを検索
する必要があり、その検索処理が増大するという問題も
ある。本発明は、上述した事情に鑑みてなされたもので
あり、学習不足を防止するとともに、音声認識時の検索
処理量を低減することができる音声認識方法を提供する
ことを目的とする。Further, as described above, if there is ambiguity that there are a plurality of subword labels corresponding to the same voice data, when performing voice recognition in consideration of this ambiguity, a plurality of subwords are recognized at the time of voice recognition. There is also a problem that it is necessary to search all the labels, which increases the search processing. The present invention has been made in view of the above circumstances, and an object of the present invention is to provide a voice recognition method capable of preventing learning shortage and reducing the amount of search processing at the time of voice recognition.

【０００９】[0009]

【課題を解決するための手段】本発明による音声認識方
法は、音響現象を隠れマルコフモデルで表現する単位と
して、単語より小なるサブワードが設定された音声認識
方法において、前記音響現象のかな表記と前記サブワー
ドとを一義的に対応させることができない音響現象を一
義的に対応させるサブワードモデルを構築することを特
徴としている。A speech recognition method according to the present invention is a speech recognition method in which a subword smaller than a word is set as a unit for expressing an acoustic phenomenon by a hidden Markov model. It is characterized by constructing a subword model that uniquely corresponds to an acoustic phenomenon that cannot be uniquely associated with the subword.

【００１０】[0010]

【作用】上記方法によれば、音響現象のかな表記とサブ
ワードとを一義的に対応させることができない曖昧な音
響現象を一義的に対応させるサブワードモデルが構築さ
れる。したがって、曖昧な音響現象全てを一つのモデル
で学習することができ、学習の為の音声データが複数の
サブワードモデルに分散することはない。また、発音の
曖昧性を吸収した音声認識時において検索すべきサブワ
ードモデル数が低減される。すなわち、学習不足が防止
されるとともに、音声認識時の検索処理量が低減され
る。According to the above method, a subword model is constructed in which an ambiguous acoustic phenomenon that cannot uniquely correspond to a kana description of an acoustic phenomenon and a subword can be associated. Therefore, all ambiguous acoustic phenomena can be learned by one model, and the audio data for learning will not be distributed to a plurality of subword models. In addition, the number of subword models to be searched for at the time of speech recognition absorbing the ambiguity of pronunciation is reduced. That is, insufficient learning is prevented, and the search processing amount at the time of voice recognition is reduced.

【００１１】[0011]

【実施例】以下、図面を参照して、本発明の一実施例に
ついて説明する。図３は本発明の一実施例による音声認
識方法を適用した音声認識装置の機能構成を示す図であ
る。この図に示す音声認識装置において、１は、音声認
識処理を行う場合に必要となる各種定義を有する音声認
識環境定義、２は、サブワードを単位とするＨＭＭを用
いた音声認識を行う音声認識処理であり、本実施例によ
る音声認識装置は、上記音声認識環境定義１に基づいて
音声認識処理２が作動し、音声認識処理を行う構成とな
っている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS An embodiment of the present invention will be described below with reference to the drawings. FIG. 3 is a diagram showing a functional configuration of a voice recognition device to which a voice recognition method according to an embodiment of the present invention is applied. In the speech recognition apparatus shown in this figure, 1 is a speech recognition environment definition having various definitions necessary for performing speech recognition processing, and 2 is speech recognition processing for performing speech recognition using an HMM in units of subwords. The voice recognition device according to the present embodiment is configured so that the voice recognition process 2 operates based on the voice recognition environment definition 1 to perform the voice recognition process.

【００１２】音声認識環境定義１において、１１は、特
徴ベクトル定義であり、音声の言語的な特徴を含む特徴
ベクトルを求めるための分析手法（例えば、ＬＰＣ（Li
nearPredictive Coding）ケプストラム）を選択するた
めの情報、選択された分析手法により得られる各種パラ
メータの次元に関する情報を有する。また、１２はサブ
ワード体系定義であり、各サブワードラベル、サブワー
ドラベルと音響現象とを対応付けるための情報を有す
る。In the voice recognition environment definition 1, 11 is a feature vector definition, and an analysis method for obtaining a feature vector including linguistic features of voice (for example, LPC (Li
nearPredictive Coding) information for selecting a cepstrum), and information regarding dimensions of various parameters obtained by the selected analysis method. Reference numeral 12 is a sub-word system definition, which has information for associating each sub-word label and the sub-word label with an acoustic phenomenon.

【００１３】上記対応付けるための情報は、例えば、図
２（ｂ）に示すように、かな表記で「おう」、「おお」
に対応するサブワードラベルとして”ＬＯＵＬ”を対応
させるというものであり、以下に例を示す定義情報によ
り、かな表記において曖昧性が残る部分を、一つのサブ
ワードラベルで表現することが可能になる。The above-mentioned information for associating is, for example, as shown in FIG. 2B, in kana notation "Ou" and "Oo".
"LOUL" is made to correspond to the subword label corresponding to, and the definition information shown in the following example makes it possible to express the part where ambiguity remains in kana notation with one subword label.

【００１４】ＬｇＬ：／ｇ／、その鼻音化、それ
らの中間的な発声音ＬｇｙＬ：／ｇｙ／、その鼻音化、それらの中間的な
発声音ＬｗｏＬ：／ｏ／、／ｗｏ／、それらの中間的な
発声音ＬＯＵＬ：／ｏ／／ｕ／、／ｏ／と／ｕ／の時間的な
融合、／ｏｏ／、それらの中間的な発声
音ＬＥＩＬ：／ｅ／／ｉ／、／ｅ／と／ｉ／の時間的な
融合、／ｅｅ／、それらの中間的な発声
音ＬＩＵＬ：／ｉ／／ｕ／、／ｉ／と／ｕ／の時間的な
融合、／ｙ／／ｕｕ／、それらの中間的な発声
音ＬＸＩＬ：／ｉ／、その無声化、それらの中間的な
発声音ＬＸＵＬ：／ｕ／、その無声化、それらの中間的な
発声音ＬＰＬ：無音区間の有無LgL: / g /, its nasalization, their intermediate vocalization LgyL: / gy /, its nasalization, their intermediate vocalization LwoL: / o /, / wo /, their intermediate Vocalizations LOUL: / o // u /, temporal fusion of / o / and / u /, / oo /, interim vocalizations LEIL: / e // i /, / e / and / I / temporal fusion, / ee /, their intermediate vocalizations LIUL: / i // u /, / i / and / u / temporal fusion, / y // uu /, those Intermediate vocalization LXIL: / i /, its unvoiced, those intermediate vocalizations LXUL: / u /, their unvoiced, their intermediate vocalizations LPL: Presence or absence of silent intervals

【００１５】上記定義情報例において、例えば、サブワ
ードラベル”ＬＯＵＬ”は、発生音／ｏ／／ｕ／、／ｏ
ｕ／、／ｏｏ／等に対応することを意味する。また、図
３の音声認識環境定義１において、１３はＨＭＭ構造定
義であり、ＨＭＭの状態数や出力確率密度分布の表現方
法に関する情報を有する。In the above definition information example, for example, the subword label "LOUL" is the generated sound / o // u /, / o.
It means to correspond to u /, / oo /, etc. Further, in the speech recognition environment definition 1 of FIG. 3, reference numeral 13 is an HMM structure definition, which has information on the number of states of the HMM and a method of expressing the output probability density distribution.

【００１６】次に、音声認識処理２において、１５は、
入力された音声（以後、入力音声と称す）を特徴パラメ
ータに変換する分析処理、１６は、特徴パラメータと当
該パラメータに相当する言語的なサブワードラベル情報
を与えることにより、ＨＭＭパラメータ（後述する）を
推定する学習処理である。１７は認識処理であり、学習
処理１６により推定されたＨＭＭパラメータと、分析処
理１５より供給される特徴パラメータにより、入力音声
の認識を行う。Next, in the voice recognition process 2, 15 is
An analysis process of converting an input voice (hereinafter, referred to as an input voice) into a characteristic parameter, 16 gives an HMM parameter (described later) by giving the characteristic parameter and linguistic subword label information corresponding to the parameter. This is a learning process for estimating. Reference numeral 17 is a recognition process, which recognizes the input voice by the HMM parameter estimated by the learning process 16 and the feature parameter supplied by the analysis process 15.

【００１７】分析処理１５において、２１は、入力音声
を帯域制限してデジタルデータに変換するＡＤ（Analog
to Digital ）変換部、２２は、ＡＤ変換部２１から出
力されるデジタルデータから、入力音声に対応する特徴
パラメータを算出する特徴パラメータ計算部である。In the analysis process 15, a reference numeral 21 designates an AD (Analog) for band-limiting the input voice and converting it into digital data.
The to digital) conversion unit 22 is a characteristic parameter calculation unit that calculates a characteristic parameter corresponding to the input voice from the digital data output from the AD conversion unit 21.

【００１８】また、学習処理１６において、３３は、サ
ブワード体系定義１２で定義されたサブワードラベル
と、入力音声における当該ラベルの出現時間とを有する
ラベルデータである。２３は初期学習部であり、ラベル
データ３３と、特徴パラメータ計算部２２から供給され
る特徴パラメータとに基づいて初期学習を行い、当該学
習に応じたＨＭＭパラメータを出力する。In the learning process 16, 33 is label data having the subword label defined by the subword system definition 12 and the appearance time of the label in the input voice. An initial learning unit 23 performs initial learning based on the label data 33 and the characteristic parameters supplied from the characteristic parameter calculation unit 22 and outputs HMM parameters according to the learning.

【００１９】２６は、入力音声のかな表記である学習用
かな表記文字列データ３２を学習用ラベル列データ３４
に変換するかな表記ラベル列変換部である。２４は連結
学習部であり、初期学習部２３から供給されるＨＭＭパ
ラメータ、特徴パラメータ計算部２２から供給される特
徴パラメータ、学習用ラベル列データ３４に応じて連結
学習を行い、対応するＨＭＭパラメータを出力する。こ
のＨＭＭパラメータは、学習処理１６により推定された
ＨＭＭの状態遷移確率、状態毎の出力密度分布等を表し
ており、ＨＭＭパラメータデータ３１に格納される。Reference numeral 26 denotes learning kana notation character string data 32, which is a kana notation of the input voice, and learning label string data 34.
It is a kana notation label string conversion unit for converting into. Reference numeral 24 denotes a connection learning unit, which performs connection learning according to the HMM parameters supplied from the initial learning unit 23, the feature parameters supplied from the feature parameter calculation unit 22, and the learning label string data 34, and outputs the corresponding HMM parameters. Output. This HMM parameter represents the state transition probability of the HMM estimated by the learning process 16, the output density distribution for each state, etc., and is stored in the HMM parameter data 31.

【００２０】さらに、認識処理１７において、２７はか
な表記ラベル列変換部であり、サブワード体系定義１２
に基づいて、発声の曖昧性を考慮した音声認識を行う際
に使用される「許容される文法」を表す認識用かな表記
文字列データ３５を、認識用ラベル列データ３６に変換
する。また、２５は、認識用ラベル列データ３６、特徴
パラメータ計算部２２で算出された特徴パラメータ、Ｈ
ＭＭパラメータデータ３１に基づいて音声認識処理を行
う認識処理部であり、認識結果を出力する。Further, in the recognition processing 17, 27 is a kana notation label string conversion unit, and the subword system definition 12
Based on the above, the recognition kana notation character string data 35 representing the “acceptable grammar” used when performing the voice recognition in consideration of the ambiguity of utterance is converted into the recognition label string data 36. Further, 25 is the recognition label string data 36, the characteristic parameter calculated by the characteristic parameter calculation unit 22, and H
A recognition processing unit that performs voice recognition processing based on the MM parameter data 31 and outputs a recognition result.

【００２１】このような構成において、まず、分析処理
１５について説明する。入力音声は、ＡＤ変換部２１に
おいて帯域制限され、デジタルデータに変換される。こ
のデジタルデータは、特徴パラメータ計算部２２に供給
され、ここで、特徴ベクトル定義１１で定義された分析
手法、パラメータの次元に基づいた分析処理を施され
る。そして、入力音声に対応する特徴パラメータが算出
される。この特徴パラメータは、学習処理１６および認
識処理１７へ供給される。In such a configuration, the analysis processing 15 will be described first. The input voice is band-limited by the AD converter 21 and converted into digital data. This digital data is supplied to the characteristic parameter calculation unit 22, and is subjected to an analysis process based on the analysis method and parameter dimensions defined by the characteristic vector definition 11. Then, the characteristic parameter corresponding to the input voice is calculated. This characteristic parameter is supplied to the learning process 16 and the recognition process 17.

【００２２】学習処理１６において、特徴パラメータは
初期学習部２３および連結学習部２４に供給される。初
期学習部２３では、特徴パラメータと、ラベルデータ３
３とから、各サブワードラベル毎のＨＭＭの初期学習を
行う。この初期学習部２３では、学習アルゴリズムとし
て、Segmental k-means traininng procedure およびFo
rward-Backwardアルゴリズムが用いられる。これらの詳
細は、L.R.Rabiner, J.G.Wilpon, and B.H.Juang，”A
segumental k-means training procedure forconnected
word recognition”（AT&T Technical Journal:vol.65
pp.21-31, (1986)）、および、中川聖一、「確率モデ
ルによる音声認識」（電子情報通信学会，(1988)）に記
載されている。上述したように、初期学習部２３におい
て、初期学習が行われ、ＨＭＭパラメータが求められ
る。In the learning process 16, the characteristic parameters are supplied to the initial learning section 23 and the connection learning section 24. In the initial learning unit 23, the feature parameter and the label data 3
From 3, the initial learning of the HMM for each subword label is performed. In the initial learning unit 23, Segmental k-means traininng procedure and Fo are used as learning algorithms.
The rward-Backward algorithm is used. These details can be found in LRRabiner, JGWilpon, and BHJuang, “A
segumental k-means training procedure for connected
word recognition ”(AT & T Technical Journal: vol.65
pp.21-31, (1986)), and Seiichi Nakagawa, "Speech recognition by probabilistic model" (IEICE, (1988)). As described above, in the initial learning unit 23, initial learning is performed and HMM parameters are obtained.

【００２３】このＨＭＭパラメータは、連結学習部２４
に供給され、ここで、特徴パラメータ、学習用ラベル列
データ３４とともに、連結学習に用いられる。連結学習
の詳細は、例えば、南、松岡、鹿野、「不特定話者連続
音声データベースによる連結学習ＨＭＭの評価」（電子
情報通信学会技術研究報告，SP91-113，(1992)）に記載
されている。連結学習部２４では、初期学習によって求
められたＨＭＭパラメータが再推定され、ＨＭＭパラメ
ータデータ３１が得られる。This HMM parameter is used by the connection learning unit 24.
And is used for connection learning together with the feature parameter and the learning label string data 34. Details of connected learning are described in, for example, Minami, Matsuoka, Kano, “Evaluation of connected learning HMM using unspecified speaker continuous speech database” (IEICE Technical Report, SP91-113, (1992)). There is. The connection learning unit 24 re-estimates the HMM parameters obtained by the initial learning and obtains the HMM parameter data 31.

【００２４】一方、認識処理１７の認識処理部２５にお
いて、「許容される文法」が記述された認識用ラベル列
データ３６と、ＨＭＭパラメータデータ３１とに基づい
て、特徴パラメータ計算部２２から供給された特徴パラ
メータに対応するサブワードラベル列の認識処理が行わ
れる。この認識処理で用いられるViterbi アルゴリズム
の詳細は、例えば、中川聖一、「確率モデルによる音声
認識」（電子情報通信学会，(1988)）に記載されてい
る。On the other hand, in the recognition processing section 25 of the recognition processing 17, the feature parameter calculation section 22 supplies it based on the recognition label string data 36 in which "allowable grammar" is described and the HMM parameter data 31. The recognition processing of the sub-word label string corresponding to the feature parameter is performed. Details of the Viterbi algorithm used in this recognition processing are described in, for example, Seiichi Nakagawa, “Speech Recognition by Stochastic Model” (IEICE, (1988)).

【００２５】以上説明したように、発声の曖昧性を吸収
したサブワードラベルをサブワード体系定義１２で定義
するため、かな表記からサブワードラベル列への変換を
容易に行うことができる。また、曖昧な音響現象全てを
一つのサブワードラベルで表すことができるため、学習
処理において、一つのサブワードラベルに対応した音声
データが複数のＨＭＭに分散してしまうことがなく、大
量の学習を行うことができる。As described above, since the subword label which absorbs the ambiguity of the utterance is defined by the subword system definition 12, it is possible to easily convert the kana notation into the subword label string. Further, since all the ambiguous acoustic phenomena can be represented by one subword label, a large amount of learning is performed without the voice data corresponding to one subword label being dispersed in a plurality of HMMs in the learning process. be able to.

【００２６】さらに、曖昧な音響現象全てを一つのサブ
ワードラベルで表すことができるため、音声認識時にお
いて、発声の曖昧性を吸収した認識を行う場合には、従
来のように、例えば、発生音／ｏｕ／に対応するサブワ
ードラベルと、発声音／ｏｏ／に対応するサブワードラ
ベルとの両方を許容する必要がなく、検索処理に係る計
算量を低減することができる。Further, since all the ambiguous acoustic phenomena can be represented by one sub-word label, in the case of recognizing the ambiguity of utterance at the time of speech recognition, as in the conventional case, for example, a generated sound is generated. Since it is not necessary to allow both the subword label corresponding to / ou / and the subword label corresponding to utterance / oo /, it is possible to reduce the amount of calculation related to the search process.

【００２７】[0027]

【発明の効果】以上説明したように、本発明によれば、
音響現象のかな表記とサブワードとを一義的に対応させ
ることができない曖昧な音響現象を一義的に対応させる
サブワードモデルが構築される。したがって、曖昧な音
響現象全てを一つのモデルで学習することができ、学習
の為の音声データが複数のサブワードモデルに分散する
ことはない。また、発音の曖昧性を吸収した音声認識時
において検索すべきサブワードモデル数が低減される。
したがって、学習不足を防止するとともに、音声認識時
の検索処理量を低減することができるという効果があ
る。As described above, according to the present invention,
A subword model is constructed in which an ambiguous acoustic phenomenon that cannot be uniquely associated with a kana description of an acoustic phenomenon and a subword is uniquely associated. Therefore, all ambiguous acoustic phenomena can be learned by one model, and the audio data for learning will not be distributed to a plurality of subword models. In addition, the number of subword models to be searched for at the time of speech recognition absorbing the ambiguity of pronunciation is reduced.
Therefore, it is possible to prevent insufficient learning and reduce the amount of search processing during voice recognition.

[Brief description of drawings]

【図１】かな表記からサブワードラベル列への変換例を
示す図である。FIG. 1 is a diagram showing an example of conversion from kana notation to a subword label string.

【図２】かな表記と音響現象とサブワードラベルとの対
応関係を示す図である。FIG. 2 is a diagram showing a correspondence relationship between kana notation, an acoustic phenomenon, and a subword label.

【図３】本発明の一実施例による音声認識方法を適用し
た音声認識装置の機能構成を示す図である。FIG. 3 is a diagram showing a functional configuration of a voice recognition device to which a voice recognition method according to an embodiment of the present invention is applied.

[Explanation of symbols]

１音声認識環境定義２音声認識処理１１特徴ベクトル定義１２サブワード体系定義１３ＨＭＭ構造定義１５分析処理１６学習処理１７認識処理２１ＡＤ変換部２２特徴パラメータ計算部２３初期学習部２４連結学習部２５認識処理部２６，２７かな表記ラベル列変換部３１ＨＭＭパラメータデータ３２学習用かな表記文字列データ３３ラベルデータ３４学習用ラベル列データ３５認識用かな表記文字列データ３６認識用ラベル列データ 1 voice recognition environment definition 2 voice recognition processing 11 feature vector definition 12 subword system definition 13 HMM structure definition 15 analysis processing 16 learning processing 17 recognition processing 21 AD conversion unit 22 feature parameter calculation unit 23 initial learning unit 24 connection learning unit 25 recognition processing Part 26, 27 Kana notation label string conversion unit 31 HMM parameter data 32 Learning kana notation character string data 33 Label data 34 Learning label string data 35 Recognition kana notation character string data 36 Recognition label string data

Claims

[Claims]

1. In a speech recognition method in which a subword smaller than a word is set as a unit for expressing an acoustic phenomenon by a hidden Markov model, the kana notation of the acoustic phenomenon cannot be uniquely associated with the subword. A speech recognition method characterized by constructing a subword model that uniquely corresponds to an acoustic phenomenon.