JPH09311693A

JPH09311693A - Speech recognition apparatus

Info

Publication number: JPH09311693A
Application number: JP8125420A
Authority: JP
Inventors: Nobuyuki Kono; 信幸香野
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1996-05-21
Filing date: 1996-05-21
Publication date: 1997-12-02

Abstract

PROBLEM TO BE SOLVED: To lessen the quantity of voice data in learning (registering words) by installing a referring and judging part to compute the likelihood of a voice dictionary file and of each word model and judge a recognition candidate, and a judgment result sending out part. SOLUTION: Voice containing word voice is sent out in a voice input means 1-1 and only a word voice part is cut out of the voice in a word voice cutting out part 1-2. A characteristic extracting part 1-3 extracts the characteristic data from the cut out word voice and a state number estimating part 1-4 estimates the state number corresponding to a word voice at the time of modeling by a hidden Markov model(HMM) from the obtained characteristic data. A learning part 1-5 computes a HMM parameter by matching the characteristic data to the word model. A voice dictionary file 1-6 comprises the learned HMM parameter and the likelihood information and a referring and judging part 1-7 judges a recognized candidate by computing the likelihood for each word model and a judgment result sending out part 1-8 sends out the recognition result.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、特定話者の単語音
声を認識し、その認識結果を出力する音声認識装置に関
するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice recognition device for recognizing a word voice of a specific speaker and outputting the recognition result.

【０００２】[0002]

【従来の技術】従来の、ＨｉｄｄｅｎＭａｒｋｏｖ
Ｍｏｄｅｌを用いた単語音声を認識する音声認識装置の
説明を行なうために、初めにＨｉｄｄｅｎＭａｒｋｏ
ｖＭｏｄｅｌによる音声認識の方法について説明す
る。2. Description of the Related Art The conventional Hidden Markov
First, in order to explain a voice recognition device for recognizing a word voice using Model, Hidden Marko
A method of voice recognition by vModel will be described.

【０００３】ＨｉｄｄｅｎＭａｒｋｏｖＭｏｄｅｌ
（以下ＨＭＭと略記する）は、Ｎ個の状態Ｓ₁，
Ｓ₂，．．．，Ｓ_Nを持ち、一定周期毎に、ある確率（遷
移確率）で状態を次々に遷移するとともに、その際に、
ある確率（出力確率）でラベル（特徴データ）を一つず
つ出力するというマルコフモデルである。音声をラベル
（特徴データ）の時系列と見た場合に、学習時に、各単
語を数回発声してそれらをモデル化したＨＭＭを作成し
ておき、認識時には、入力音声のラベル系列を出力する
確率（尤度）が最大になるＨＭＭを探すことで認識を行
なう。Hidden Markov Model
(Hereinafter abbreviated as HMM) is the N states S ₁ ,
S ₂ ,. . . , S _N , and transition the states one after another with a certain probability (transition probability) at regular intervals, and at that time,
It is a Markov model that outputs labels (feature data) one by one with a certain probability (output probability). When the speech is regarded as a time series of labels (feature data), each word is uttered several times during learning to create an HMM that models them, and at the time of recognition, the label series of the input speech is output. Recognition is performed by searching for an HMM having the maximum probability (likelihood).

【０００４】以下、図面を参照して具体的に説明する。
図５は、ＨＭＭの例図であって、日本音響学会誌４２巻
１２号（１９８６）「ＨｉｄｄｅｎＭａｒｋｏｖＭ
ｏｄｅｌに基づいた音声認識」で示されたＨＭＭの簡単
な例でありこのＨＭＭは、３つの状態で構成され、２種
類のラベルａとｂのみからなるラベル系列を出力する。
初期状態はＳ₁で、Ｓ₁からは、０．３の確率でＳ₁自体
に遷移する（その際にラベルａを出力する。ラベルｂは
出力確率が０．０なので出力されない）か、０．７の確
率でＳ₂に遷移する（その際にラベルａを０．５の確率
で、ラベルｂを０．５の確率で出力する）。A detailed description will be given below with reference to the drawings.
FIG. 5 is an example diagram of an HMM, which is “Hidden Markov M”, Vol. 42, No. 12 (1986) of the Acoustical Society of Japan.
This is a simple example of an HMM shown in "Old-based speech recognition", and this HMM is composed of three states and outputs a label sequence consisting of only two types of labels a and b.
The initial state is S ₁ , and transition from S ₁ to S ₁ itself with a probability of 0.3 (at that time, label a is output. Since label b has an output probability of 0.0, it is not output), or 0 Transition to S ₂ with a probability of 0.7 (in this case, label a is output with a probability of 0.5 and label b is output with a probability of 0.5).

【０００５】状態Ｓ₂からは、０．２の確率でＳ₂自体に
遷移する（その際にラベルａかｂかをそれぞれ０．３、
０．７の確率で出力する）か、０．８の確率で最終状態
Ｓ₃に遷移する（その際にラベルｂを出力する。ラベル
ａは出力確率が０．０なので出力されない）ことを表し
ている。ここで、このＨＭＭがラベル系列（特徴データ
の列）ａｂｂを出力する確率（尤度）を考えると、この
ＨＭＭで許される状態系列はＳ₁Ｓ₁Ｓ₂Ｓ₃とＳ₁Ｓ₂Ｓ₂
Ｓ₃の２つだけであり、それぞれ確率は、０．３＊１．０＊０．７＊０．５＊０．８＊１．０＝０．０８４００．７＊０．５＊０．２＊０．７＊０．８＊１．０＝０．０３９２である。どちらの可能性もあるので合計０．０８４０＋
０．０３９２＝０．１２３２の確率でこのＨＭＭはａｂ
ｂを出力することがわかる。From the state S _{2, the} state transits to S ₂ itself with a probability of 0.2 (at that time, whether the label a or b is 0.3,
Output with a probability of 0.7) or transition to the final state S ₃ with a probability of 0.8 (in this case, the label b is output. Since the output probability of the label a is 0.0, it is not output). ing. Here, considering the probability (likelihood) that this HMM outputs a label sequence (sequence of feature data) abb, the state sequences allowed by this HMM are S ₁ S ₁ S ₂ S ₃ and S ₁ S ₂ S ₂
There are only two, S ₃ , and the probabilities are 0.3 * 1.0 * 0.7 * 0.5 * 0.8 * 1.0 = 0.0840 0.7 * 0.5 * 0. 2 * 0.7 * 0.8 * 1.0 = 0.0392. Both possibilities are possible, so total 0.0840+
With a probability of 0.0392 = 0.1232, this HMM has ab
It can be seen that b is output.

【０００６】そこで、予め単語毎にそのＨＭＭを学習し
て、各単語に最も適した状態の遷移確率と各状態遷移に
おけるラベルの出力確率を求めておけば、ある未知の単
語のラベル系列が入力された場合、各ＨＭＭに対して確
率（尤度）計算を行なえば、どの単語に対するＨＭＭが
このラベル系列を出力し易いかがわかり、これにより認
識ができる。以上がＨＭＭによる音声認識の方法であ
る。Therefore, if the HMM for each word is learned in advance and the transition probability of the state most suitable for each word and the output probability of the label at each state transition are obtained, the label sequence of an unknown word is input. In such a case, if the probability (likelihood) is calculated for each HMM, it is possible to know which word the HMM is likely to output this label sequence, and it is possible to recognize it. The above is the method of voice recognition by the HMM.

【０００７】また、図６は、音声波形、特徴データの時
系列とＨＭＭの各状態の対応を示す例図であり、「はじ
め」と発声した場合の対応を示している。このように、
音声の特徴データの時系列に対して、その単語の音韻数
程度の少ない状態でＨＭＭが表現される。[0007] FIG. 6 is an example diagram showing the correspondence between the voice waveform, the time series of the characteristic data, and the respective states of the HMM, and shows the correspondence when "beginning" is uttered. in this way,
The HMM is expressed in a state in which the number of phonemes of the word is small with respect to the time series of the voice feature data.

【０００８】従来のＨＭＭを用いた単語音声を認識する
音声認識装置では、学習時に、音声認識装置に登録する
各単語に対し、その単語の音韻数程度の少ない状態数を
音韻のスペクトル変化等から求め、各状態遷移での特徴
データの出力確率と状態間の遷移確率を学習により推定
してＨＭＭにモデル化しておき、認識時に、入力音声を
これらすべてのモデルに当てはめて尤度計算を行ない、
認識していた。また、一般に、ＨＭＭの学習には大量の
音声データを必要とするといわれているが、日本音響学
会講演論文集（平成３年３月３−６−１７）「自動車電
話用音声ダイヤルの認識方式検討」で述べられているよ
うに「分散は認識語彙に独立に不特定話者の音声データ
から計算したものを各ＨＭＭ共通に用いること」で、特
定話者の音声認識装置の場合は、１回の発声のみでもＨ
ＭＭの学習が可能である。In a conventional speech recognition apparatus for recognizing a word speech using an HMM, at the time of learning, for each word registered in the speech recognition apparatus, the number of states having a small number of phonemes of the word is determined from a change in the phoneme spectrum or the like. Obtained, the output probability of the feature data in each state transition and the transition probability between states are estimated by learning and modeled in HMM, and at the time of recognition, the input speech is applied to all these models to perform likelihood calculation,
I was aware. In addition, it is generally said that a large amount of voice data is required for learning HMMs, but the Acoustical Society of Japan Proceedings (March 3-6-17, March 1991) “Voice Dial Recognition Method for Car Phones As described in ", the variance is used independently for the recognition vocabulary, which is calculated from the voice data of the unspecified speaker and is used in common for each HMM." H is the only voice
MM can be learned.

【０００９】[0009]

【発明が解決しようとする課題】音声認識装置では、認
識時に、リアルタイムに認識結果を返す場合、入力され
る単語音声を含む音声の中から単語部分を見つけるため
に、その入力音声を全部入力し終ってから、特徴データ
の時系列を求めて単語部分を判断するのではなく、実際
には、入力される音声が入って来る時刻毎（３２ｍｓ毎
などで、分析フレーム同期とも言う）に特徴データを求
めて単語部分があるかどうかを判断する。この場合、各
時刻毎に認識候補を算出するが、誤認識の原因となるよ
うな候補のリジェクトを適切に行なうために、候補の絞
り込みが重要となる。When a speech recognition device returns a recognition result in real time at the time of recognition, all the input speech is input in order to find a word portion from the speech including the input word speech. After the end, instead of determining the word part by obtaining the time series of the feature data, the feature data is actually obtained at each time when the input voice comes in (every 32 ms, which is also called analysis frame synchronization). To determine whether there is a word part. In this case, the recognition candidates are calculated for each time, but narrowing down the candidates is important in order to appropriately reject the candidates that may cause erroneous recognition.

【００１０】これについては、電子情報通信学会技術研
究報告書（１９９０、ＳＰ９０−１８）の「ＨＭＭによ
る電話音声のスポッティング」で報告されているよう
に、ＨＭＭの状態の継続時間（同じ状態を何回繰り返す
か）を考慮した継続時間長制御等の候補絞り込みのため
の処理が必要となる。上記報告書では、各単語毎に、単
語長とＨＭＭの状態の継続時間長との間に、ある２次回
帰モデルを仮定して各回帰係数や継続時間長の標準偏差
を求めて、最終的に、状態の継続時間の予測値と実際の
継続時間とがある許容範囲内にあるかどうかでリジェク
トの判断を行なうとしている。Regarding this, as described in "Spotting of Telephone Voice by HMM" in Technical Research Report of the Institute of Electronics, Information and Communication Engineers (1990, SP90-18), the duration of the state of the HMM (what is the same state? It is necessary to perform processing for narrowing down candidates such as duration control, etc. considering whether or not it should be repeated twice. In the above report, for each word, the standard deviation of each regression coefficient and duration is calculated by assuming a quadratic regression model between the word length and the duration of the HMM state, and finally In addition, it is said that the judgment of the reject is made based on whether or not the predicted value of the state duration and the actual duration are within a certain allowable range.

【００１１】このように、現状ではＨＭＭの状態の遷移
に関する統計情報を学習時に各単語毎に求めておいて、
認識時にその情報で重み付けするという方法を取ってい
る。しかし、この方法では、単語独立に予め求めた値が
利用できるＨＭＭ自体の学習とは異なり、各単語毎に統
計情報を得る必要があるために、学習時に多数の音声デ
ータ、つまり発声が必要となり、利用者の負担が多くな
る。このため使い勝手が悪いという問題があった。As described above, at present, statistical information regarding the transition of the HMM state is obtained for each word at the time of learning,
At the time of recognition, the information is weighted. However, in this method, unlike the learning of the HMM itself in which a value obtained in advance for each word can be used, since it is necessary to obtain statistical information for each word, a large amount of voice data, that is, utterance, is required during learning. However, the burden on the user increases. Therefore, there is a problem in that it is not easy to use.

【００１２】そこで本発明は、単語音声を認識するＨｉ
ｄｄｅｎＭａｒｋｏｖｍｏｄｅｌを用いた音声認識
装置において、認識候補の絞り込みを行なうために必要
な情報に各単語毎の統計情報を用いないようにし、学習
（単語登録）時の音声データ（発声）の数を減らすこと
を可能にする音声認識装置を提供することを目的とす
る。Therefore, the present invention uses Hi for recognizing word speech.
In a voice recognition device using the dden Markov model, statistical information for each word is not used for information necessary for narrowing down recognition candidates, and the number of voice data (utterances) at the time of learning (word registration) is set. It is an object of the present invention to provide a voice recognition device capable of reducing the number.

【００１３】[0013]

【課題を解決するための手段】ＨＭＭを用いた認識に利
用される特徴データは主に原波形のパワースペクトルの
対数を波形とみなして行なわれる逆フーリエ変換の結果
であるケプストラムである。このケプストラムには声道
と音源の２つの異なる情報が含まれており、通常は低次
成分に含まれる声道の情報だけを用いて学習／認識を行
なっている。因みに、斉藤収三、中田和男共著『音声情
報処理の基礎』（オーム社）の第１２章１２．２音声認
識の原理と構成によると、「音響分析によって得られる
駆動音源特性には、基本周波数や音源振幅などの情報が
含まれている。これらは、音声区間の範囲や有声音と無
声音の区別など、音素をおおまかに分類する前処理に使
われるものであって、標準パターンとして使われる例は
はとんどない。」とされている。The feature data used for the recognition using the HMM is a cepstrum which is the result of the inverse Fourier transform mainly performed by regarding the logarithm of the power spectrum of the original waveform as the waveform. This cepstrum contains two different pieces of information, the vocal tract and the sound source, and usually the learning / recognition is performed using only the vocal tract information included in the low-order component. By the way, according to the principle and structure of Chapter 12, 12.2 Speech recognition, written by Sozo Saito and Kazuo Nakata, "Basics of Speech Information Processing" (Ohmsha), "The fundamental frequency of the driving sound source characteristic obtained by acoustic analysis is Information such as the sound source amplitude, etc. These are used for preprocessing for roughly classifying phonemes, such as the range of voice intervals and the distinction between voiced and unvoiced sounds, and are used as standard patterns. It is said that there is nothing.

【００１４】本発明の音声認識装置は、単語音声を含む
音声を入力するための音声入力手段と、単語音声を含む
音声から単語音声の部分だけを切り出す単語音声切り出
し部と、切り出した単語音声から特徴データを抽出する
特徴抽出部と、特徴データからＨＭＭによりモデル化す
る際の単語音声に対する状態数を推定する状態数推定部
と、特徴データを単語モデルに当てはめてＨＭＭパラメ
ータを求める学習部と、学習したＨＭＭパラメータから
なる音声辞書ファイルと、各単語モデルに対して尤度を
計算して、認識候補を判定する照合判定部と、認識結果
を出力する判定結果出力部とを備えた。The voice recognition device of the present invention includes a voice input means for inputting a voice including a word voice, a word voice cutout portion for cutting out only a portion of the word voice from the voice including the word voice, and a cutout word voice. A feature extraction unit that extracts feature data, a state number estimation unit that estimates the number of states for a word voice when modeling with HMM from the feature data, a learning unit that applies the feature data to a word model and obtains HMM parameters, A voice dictionary file including the learned HMM parameters, a matching determination unit that calculates a likelihood for each word model and determines a recognition candidate, and a determination result output unit that outputs a recognition result are provided.

【００１５】この構成により、学習の際に先ず、声道の
情報（ケプストラムの低次成分）を用いて、ＨＭＭ（声
道ＨＭＭと呼ぶことにする）を求める。更に、音源の情
報（ケプストラムの高次成分）を用いて別に用意したＨ
ＭＭ（音源ＨＭＭと呼ぶことにする）を求めるようにす
る。そして、認識の際には、先ず、声道の情報を用いて
求めた声道ＨＭＭを使って認識を行ない、認識候補を求
める。次に、これらの認識候補に対して、音源の情報を
用いて求めた音源ＨＭＭを使って認識を行ない、候補の
絞り込みを行なうようにした。With this configuration, at the time of learning, an HMM (to be referred to as a vocal tract HMM) is first obtained using vocal tract information (low-order components of the cepstrum). Furthermore, H prepared separately by using the information of the sound source (higher-order components of the cepstrum)
MM (to be referred to as a sound source HMM) is obtained. At the time of recognition, first, the vocal tract HMM obtained using the information of the vocal tract is used for recognition to obtain a recognition candidate. Next, the sound source HMM obtained by using the sound source information is used to recognize these recognition candidates, and the candidates are narrowed down.

【００１６】[0016]

【発明の実施の形態】本発明によれば、各単語毎に統計
情報を求める必要がなくなるため、学習時に、１、２回
程度の発声でも学習ができ、認識時の候補絞り込みもで
きる。According to the present invention, since it is not necessary to obtain statistical information for each word, it is possible to learn by uttering about once or two times at the time of learning and to narrow down candidates at the time of recognition.

【００１７】以下、本発明の一実施の形態である音声認
識装置について図面を参照しながら説明する。A speech recognition apparatus according to an embodiment of the present invention will be described below with reference to the drawings.

【００１８】図１は本発明の一実施の形態における音声
認識装置の構成ブロック図であり、１−１は単語音声を
含む音声を入力するための音声入力手段、１−２は単語
音声を含む音声から単語音声の部分だけを切り出す単語
音声切り出し部、１−３は切り出した単語音声から特徴
データを抽出する特徴抽出部、１−４は特徴データから
ＨＭＭによりモデル化する際の単語音声に対する状態数
を推定する状態数推定部、１−５は特徴データを単語モ
デルに当てはめてＨＭＭパラメータを求める学習部、１
−６は学習したＨＭＭパラメータおよび尤度情報からな
る音声辞書ファイル、１−７は各単語モデルに対して尤
度を計算して、認識候補を判定する照合判定部、１−８
は認識結果を出力する判定結果出力部である。FIG. 1 is a block diagram showing the configuration of a voice recognition apparatus according to an embodiment of the present invention. 1-1 is a voice input means for inputting a voice including a word voice, and 1-2 is a word voice. A word voice cutout unit that cuts out only a word voice portion from a voice, 1-3 is a feature extraction unit that extracts feature data from the cut out word voice, and 1-4 is a state for the word voice when modeling the feature data by HMM. A state number estimating unit for estimating the number, 1-5 is a learning unit for applying the feature data to the word model to obtain HMM parameters, 1
-6 is a voice dictionary file composed of learned HMM parameters and likelihood information, 1-7 is a collation determination unit that calculates a likelihood for each word model and determines a recognition candidate, 1-8
Is a determination result output unit that outputs a recognition result.

【００１９】図２は本発明の一実施の形態における音声
認識装置の回路ブロック図であり、２−１はマイク、２
−２は読み出し専用メモリ（ＲＯＭ）、２−３は中央処
理装置（ＣＰＵ）、２−４は書き込み可能メモリ（ＲＡ
Ｍ）、２−５はモニター、２−６はファイル装置であ
る。FIG. 2 is a circuit block diagram of a voice recognition device according to one embodiment of the present invention.
-2 is a read-only memory (ROM), 2-3 is a central processing unit (CPU), 2-4 is a writable memory (RA)
M), 2-5 is a monitor, and 2-6 is a file device.

【００２０】図１に示した音声入力手段１−１はマイク
２−１により実現されている。単語音声切り出し部１−
２と特徴抽出部１−３と状態数推定部１−４と学習部１
−５照合判定部１−７は、ＲＯＭ２−２に記憶されるプ
ログラムとして実現されている。音声辞書ファイル１−
６はファイル装置２−６により、判定結果出力部１−８
はモニター２−５により実現されている。The voice input means 1-1 shown in FIG. 1 is realized by a microphone 2-1. Word voice cutout unit 1-
2, feature extraction unit 1-3, state number estimation unit 1-4, and learning unit 1
-5 The collation determination unit 1-7 is realized as a program stored in the ROM 2-2. Voice dictionary file 1-
6 is a file device 2-6, and a determination result output unit 1-8
Are realized by the monitor 2-5.

【００２１】図３は本発明の一実施の形態における登録
時のフローチャート、図４は本発明の一実施の形態にお
ける認識時のフローチャートである。上記のように構成
された本発明の一実施の形態における音声認識装置に、
ある単語音声が登録される場合について、以下、この動
作を図３のフローチャートに基づき説明する。FIG. 3 is a flowchart for registration in one embodiment of the present invention, and FIG. 4 is a flowchart for recognition in one embodiment of the present invention. In the voice recognition device according to the embodiment of the present invention configured as described above,
In the case where a certain word voice is registered, this operation will be described below with reference to the flowchart of FIG.

【００２２】ステップ（３−１）では、音声入力手段１
−１により、単語音声を含む発声音声が入力される。ス
テップ（３−２）では、単語音声切り出し部１−２によ
り単語音声を含む発声音声から単語音声を切り出す。こ
れは音声のパワー等により単語音声の前後の無音または
低雑音部分を検出し取り除くことにより実現できる。ス
テップ（３−３）では、特徴抽出部１−３により、ケプ
ストラム分析により、その単語音声に対するケプストラ
ムを求めることで特徴抽出を行なう。ステップ（３−
４）では、状態数推定部１−４により、ステップ（３−
３）で単語音声から抽出した特徴データからその単語音
声に対する状態数を推定する。これは、声道の情報であ
る低次成分を用いた場合と、音源の情報である高次成分
を用いた場合の２種類を求めておく。状態数の推定は、
日本音響学会講演論文集（１９９０．３）「連続数字音
声認識におけるＨＭＭの状態数及び混合数について」に
基づいて行なうことができる。In step (3-1), the voice input means 1
With -1, the voiced voice including the word voice is input. In step (3-2), the word voice cutout unit 1-2 cuts out the word voice from the voiced voice including the word voice. This can be realized by detecting and removing silent or low noise portions before and after the word voice by the power of the voice or the like. In step (3-3), the feature extraction unit 1-3 performs feature extraction by obtaining a cepstrum for the word voice by cepstrum analysis. Step (3-
In 4), the state number estimation unit 1-4 causes the step (3-
The number of states for the word voice is estimated from the feature data extracted from the word voice in 3). For this, two types are obtained, one using a low-order component that is vocal tract information and the other using a high-order component that is sound source information. The number of states can be estimated by
This can be done based on the Acoustical Society of Japan Proceedings (1990.3) "On the number of HMM states and the number of mixtures in continuous digit speech recognition".

【００２３】ステップ（３−５）では、学習部１−５に
より単語音声の特徴データをステップ（３−４）で求め
た状態数を持つＨＭＭモデルを用いて学習し、各状態間
の遷移確率および遷移における特徴データの出力確率か
らなるＨＭＭを求め、音声辞書ファイル１−６に、求め
たＨＭＭを格納する。これは、特徴データとして、声道
の情報である低次成分を用いた場合の声道ＨＭＭと、音
源の情報である高次成分を用いた場合の音源ＨＭＭの２
種類を求めておく。In step (3-5), the learning unit 1-5 learns the feature data of the word voice using the HMM model having the number of states obtained in step (3-4), and the transition probability between the states. Then, an HMM composed of the output probabilities of the feature data in the transition is obtained, and the obtained HMM is stored in the voice dictionary file 1-6. This is a vocal tract HMM when a low-order component that is vocal tract information is used as the feature data, and a sound source HMM when a high-order component that is a sound source information is used.
Ask for the type.

【００２４】次に、ある単語音声を認識する場合につい
て、以下、この動作を図４のフローチャートに基づき説
明する。ステップ（４−１）では、音声入力手段１−１
により、単語音声を含む発声音声が入力される。ステッ
プ（４−２）では、特徴抽出部１−３により入力音声に
対する特徴抽出を行なう。ステップ（４−３）では、照
合判定部１−７により入力音声の特徴データの低次成分
を用いて音声辞書ファイル１−６から読み込んだ各単語
モデルの声道ＨＭＭ上で尤度計算を行ない尤度の高いい
くつかの単語モデルを認識候補と判定する。例えば、尤
度の高い方から４単語（ａ１，ａ２，ａ３，ａ４）（図
示せず）を認識候補として、それらの尤度が各々、０．
９８，０．９７，０．９５，０．８７（図示せず）だっ
たとする。以下の説明の都合上、これらの尤度を「声道
の尤度」と呼ぶことにする。Next, in the case of recognizing a certain word voice, this operation will be described below with reference to the flowchart of FIG. In step (4-1), the voice input means 1-1
Thus, the uttered voice including the word voice is input. In step (4-2), the feature extraction unit 1-3 performs feature extraction on the input voice. In step (4-3), the matching determination unit 1-7 performs likelihood calculation on the vocal tract HMM of each word model read from the voice dictionary file 1-6 using the low-order component of the input voice feature data. Some word models with high likelihood are determined as recognition candidates. For example, four words (a1, a2, a3, a4) (not shown) from the highest likelihood are used as recognition candidates, and their likelihoods are 0.
It is assumed that they are 98, 0.97, 0.95, 0.87 (not shown). For convenience of the following description, these likelihoods will be referred to as “vocal tract likelihoods”.

【００２５】ステップ（４−４）では、照合判定部１−
７により入力音声の特徴データの高次成分を用いて音声
辞書ファイル１−６から読み込んだステップ（４−３）
で認識候補としたいくつかの単語モデルの音源ＨＭＭ上
で尤度計算を行ない尤度の高いものに絞り込んで最終的
な判断を行なう。例えば、ステップ（４−３）で示した
４単語ａ１，ａ２，ａ３，ａ４を対象としてこれらの音
源ＨＭＭで尤度計算を行なった結果が、各々０．５４，
０．９８，０．９８，０．８８だったとする。これらの
尤度を「音源の尤度」と呼ぶことにすると、以下のよう
にして認識候補を絞り込むことができる。候補絞り込み
のための尤度＝α×（声道の尤度）＋β×（音源の尤
度）、但しα＋β＝１。In step (4-4), the collation judging unit 1-
Step (4-3) of reading from the voice dictionary file 1-6 by using the higher-order component of the feature data of the input voice according to 7.
The likelihood is calculated on the sound source HMMs of several word models as recognition candidates, and the final judgment is performed by narrowing down to those with high likelihood. For example, the results of likelihood calculation with these sound source HMMs for the four words a1, a2, a3, and a4 shown in step (4-3) are 0.54, respectively.
It is assumed that they are 0.98, 0.98, and 0.88. When these likelihoods are called “sound source likelihoods”, recognition candidates can be narrowed down as follows. Likelihood for narrowing down candidates = α × (likelihood of vocal tract) + β × (likelihood of sound source), where α + β = 1.

【００２６】これらの係数は装置を評価することで予め
決めることができる。例えばα＝０．８、β＝０．２と
してこの例を計算すると、単語ａ１の候補絞り込みの尤
度は、０．８×０．９８＋０．２×０．５４＝０．８９
２単語ａ２の候補絞り込みの尤度は、０．８×０．９７＋
０．２×０．９８＝０．９７２単語ａ３の候補絞り込みの尤度は、０．８×０．９５＋
０．２×０．９８＝０．９５６単語ａ４の候補絞り込みの尤度は、０．８×０．８７＋
０．２×０．８８＝０．８７２となり、最終的に尤度が一番高くなった単語ａ２が認識
結果となる。ステップ（４−５）では、判定結果出力部
１−８により認識結果を利用者に通知する。These coefficients can be predetermined by evaluating the device. For example, if this example is calculated with α = 0.8 and β = 0.2, the likelihood of narrowing down the candidates of the word a1 is 0.8 × 0.98 + 0.2 × 0.54 = 0.89.
2 Likelihood of narrowing down candidates for word a2 is 0.8 × 0.97 +
0.2 × 0.98 = 0.972 The likelihood of narrowing down candidates for the word a3 is 0.8 × 0.95 +
0.2 × 0.98 = 0.956 The likelihood of narrowing down the candidates of the word a4 is 0.8 × 0.87 +
The result is 0.2 × 0.88 = 0.872, and the word a2 having the highest likelihood finally becomes the recognition result. In step (4-5), the determination result output unit 1-8 notifies the user of the recognition result.

【００２７】[0027]

【発明の効果】本発明の音声認識装置によれば、候補絞
り込みを行なう場合でも各単語毎に統計情報を求める必
要がなくなるため、学習時に、１、２回程度の発声でも
学習ができ、利用者が単語登録（学習）をする時に、何
度も発声しなければならないような不便が生じず、した
がって利用者の使い勝手を向上させることができる。According to the speech recognition apparatus of the present invention, since it is not necessary to obtain statistical information for each word even when narrowing down candidates, learning can be performed by uttering once or twice. When a person registers (learns) a word, there is no inconvenience such as having to speak a number of times, and therefore the usability for the user can be improved.

[Brief description of drawings]

【図１】本発明の一実施の形態における音声認識装置の
構成ブロック図FIG. 1 is a configuration block diagram of a voice recognition device according to an embodiment of the present invention.

【図２】本発明の一実施の形態における音声認識装置の
回路ブロック図FIG. 2 is a circuit block diagram of a voice recognition device according to an embodiment of the present invention.

【図３】本発明の一実施の形態における登録時のフロー
チャートFIG. 3 is a flowchart at the time of registration according to the embodiment of the present invention.

【図４】本発明の一実施の形態における認識時のフロー
チャートFIG. 4 is a flowchart at the time of recognition according to the embodiment of the present invention.

【図５】ＨｉｄｄｅｎＭａｒｃｏｖＭｏｄｅｌの例
図FIG. 5 is an example diagram of Hidden Marcov Model.

【図６】音声波形、特徴データの時系列とＨＭＭの各状
態の対応を示す例図FIG. 6 is an example diagram showing a correspondence between a voice waveform, a time series of characteristic data, and HMM states.

[Explanation of symbols]

１−１音声入力手段１−２単語音声切り出し部１−３特徴抽出部１−４状態数推定部１−５学習部１−６音声辞書ファイル１−７照合判定部１−８判定結果出力部２−１マイク２−２読み出し専用メモリ（ＲＯＭ）２−３中央処理装置（ＣＰＵ）２−４書き込み可能メモリ（ＲＡＭ）２−５モニター２−６ファイル装置 1-1 Voice input means 1-2 Word voice cutout unit 1-3 Feature extraction unit 1-4 State number estimation unit 1-5 Learning unit 1-6 Voice dictionary file 1-7 Collation determination unit 1-8 Determination result output unit 2-1 Microphone 2-2 Read-only memory (ROM) 2-3 Central processing unit (CPU) 2-4 Writable memory (RAM) 2-5 Monitor 2-6 File device

Claims

[Claims]

1. A voice input unit for inputting a voice including a word voice, a word voice cutout unit for cutting out only a portion of the word voice from the voice including the word voice, and a feature for extracting feature data from the cut out word voice. It consists of an extraction unit, a state number estimation unit that estimates the number of states for word speech when modeling with the HMM from the feature data, a learning unit that applies the feature data to the word model to obtain HMM parameters, and a learned HMM parameter. A voice dictionary file,
A speech recognition apparatus comprising: a matching determination unit that calculates a likelihood for each word model and determines a recognition candidate; and a determination result output unit that outputs a recognition result.