JP2010145784A

JP2010145784A - Voice recognizing device, acoustic model learning apparatus, voice recognizing method, and program

Info

Publication number: JP2010145784A
Application number: JP2008323495A
Authority: JP
Inventors: Hiroyasu Ide; 博康井手
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2008-12-19
Filing date: 2008-12-19
Publication date: 2010-07-01
Anticipated expiration: 2028-12-19
Also published as: JP5315976B2

Abstract

<P>PROBLEM TO BE SOLVED: To accurately perform voice recognition using hidden Markov model. <P>SOLUTION: A feature amount extracting means 111 assigns a frame to input voice, and extracts the feature amount for each frame. A cumulative likelihood calculating means 113, using a hidden Markov model, calculates cumulative likelihood for each state in each frame. At this time, normal likelihood operation is performed in a frame showing consonant, and likelihood operation is performed for a group obtained by collecting similar models in a frame showing vowel. After a group providing a maximum cumulative likelihood is determined, the likelihood operation is performed only for the group. Thus, consonant with small individual difference between speakers is separated from vowel with large individual difference, recognition considering individual difference is performed only for vowel, and as a result, the accuracy of voice recognition processing is increased. <P>COPYRIGHT: (C)2010,JPO&INPIT

Description

本発明は、音声認識装置、音響モデル学習装置、音声認識方法、および、プログラムに関し、特に、隠れマルコフモデルを用いた音声認識を高精度に行うことができる音声認識装置、音響モデル学習装置、音声認識方法、および、プログラムに関する。 The present invention relates to a speech recognition device, an acoustic model learning device, a speech recognition method, and a program, and in particular, a speech recognition device, an acoustic model learning device, and a speech that can perform speech recognition using a hidden Markov model with high accuracy. The present invention relates to a recognition method and a program.

機械的な音声認識は、一つの音、例えば、「あ」だけを認識するのであれば、以下の手法で行うこともできる。
まず、音声信号波形に対して所定長のフレーム（時間窓）を設定し、各フレームから数値的な特徴量を抽出する。
そして、各フレームで抽出された特徴量と、標準パターンである音響モデルとを比較する。
その比較の結果、その特徴量と一致する音響モデルの音を認識結果とする。 Mechanical speech recognition can also be performed by the following method if only one sound, for example, “A” is recognized.
First, a frame (time window) having a predetermined length is set for the audio signal waveform, and a numerical feature amount is extracted from each frame.
And the feature-value extracted in each flame | frame is compared with the acoustic model which is a standard pattern.
As a result of the comparison, the sound of the acoustic model that matches the feature amount is set as the recognition result.

例えば、日本語の母音には、「あ」、「い」、「う」、「え」、「お」の５音があり、マイクロフォンで捉えて電気信号に変換された波形から抽出された特徴量が「あ」の音響モデルと一致すれば、「あ」を認識結果とする。 For example, there are five Japanese vowels, “A”, “I”, “U”, “E”, and “O”, which are extracted from a waveform that is captured by a microphone and converted into an electrical signal. If the amount matches the acoustic model “A”, “A” is taken as the recognition result.

ここで、音響モデルとは、「あ」は、この特徴量、「い」は、この特徴量、という具合に予め用意されている特徴量である。また、このような用意をしておくことは音響モデルの学習に相当する。 Here, the acoustic model is a feature amount prepared in advance such that “A” is the feature amount, and “I” is the feature amount. Such preparation is equivalent to learning of an acoustic model.

しかしながら、このような音響モデルは、個人用の認識装置であれば、その個人が装置に学習させておけばよいが、音声認識は、例えば、公共の機関での電話応対とか、不特定人が使用するディクテーション装置のように、未知の人物の声を認識することが必要な場合もある。 However, if such an acoustic model is a personal recognition device, it is sufficient that the individual learns the device. However, for example, voice recognition can be performed by a telephone in a public institution or by an unspecified person. It may be necessary to recognize the voice of an unknown person, such as the dictation device used.

そのため、なるべく多数の人の声から学習した音響モデルを用意する。その場合、「あ」は、この数値範囲、「い」は、この数値範囲、という具合に、音響モデルは数値範囲で用意され、音声認識は、マイクロフォンを通じて得られた特徴量が、この数値範囲に入れば、「あ」、この数値範囲に入れば、「い」という具合に決められる。 Therefore, an acoustic model learned from the voices of as many people as possible is prepared. In that case, “A” is this numerical range, “I” is this numerical range, and so on, and the acoustic model is prepared in the numerical range. For voice recognition, the feature value obtained through the microphone is the numerical range. If it enters, it will be decided as "A", if it enters this numerical range, it will be decided as "I".

ところが、人の話す音声を認識する場合では、音素の連続を認識することとなり、同じ「あ」という音素でも、その前後の音素とのつながりによって上述した特徴量は異なった数値を取る。このため、音声認識において、上述した手法は使うことができない。 However, when recognizing a voice spoken by a person, the continuation of phonemes is recognized, and the above-described feature amount takes different numerical values depending on the connection with the phonemes before and after the same phoneme “A”. For this reason, the above-described method cannot be used in speech recognition.

一般に、音声認識では、音素の連続を、ある定常状態から他の定常状態への遷移として捉え、この遷移がいわゆるマルコフ過程であるとし、音響モデルとして「隠れマルコフモデル」（Hidden Malkov Model：以下「ＨＭＭ」とする）を用いる統計的な手法により音声信号からその信号が出力される元となった言葉を確率的に推定する。 In general, in speech recognition, a phoneme sequence is regarded as a transition from one steady state to another steady state, and this transition is a so-called Markov process. As an acoustic model, a “Hidden Malkov Model” (hereinafter “Hidden Malkov Model”) The word from which the signal is output is stochastically estimated from the speech signal by a statistical method using “HMM”.

この手法では、いずれのＨＭＭに対応する特徴量が最も高い確率で出力されるかを示す尤度が計算され、その確率を最大とするＨＭＭに対応する単語を音声認識結果として出力する。このような音声認識の手法は、例えば、特許文献１などに開示されている。 In this method, a likelihood indicating which feature quantity corresponding to which HMM is output with the highest probability is calculated, and a word corresponding to the HMM having the maximum probability is output as a speech recognition result. Such a speech recognition method is disclosed in, for example, Patent Document 1.

この尤度計算は、例えば、以下のガウス分布の数式（１）を演算することで求められる。 This likelihood calculation is calculated | required by calculating the following numerical formula (1) of Gaussian distribution, for example.

Ｐ_ｍ（Ｙ；μ_ｍ，Σ_ｍ）
＝｛１／√((２π)ｎ｜Σ_ｊ｜）｝exp(−1/2（y_t−μ_t）^ＴΣ^−１（y_t−μ_t））
Ｊ：状態数 t：time
（１） P _m (Y; μ _m , Σ _m )
= {1 / √ ((2π ) n | Σ j |)} exp (-1/2 (y t -μ t) T Σ -1 (y t -μ t))
J: Number of states t: time
(1)

そして、算出された各ＨＭＭ毎の尤度を、前フレームで計算された累積尤度値の最大値に対して累積するというビタビアルゴリズムにより、累積尤度値を更新する。 Then, the cumulative likelihood value is updated by a Viterbi algorithm in which the calculated likelihood for each HMM is accumulated with respect to the maximum value of the cumulative likelihood value calculated in the previous frame.

上記のような演算を行う音声認識において、ＨＭＭは大量の発声データからの学習により作成される。特に、不特定話者を対象とする音声認識では、発声データは年齢層や性別などについて幅広く収集される。この結果、あらゆる人の音声が認識できる。 In speech recognition that performs the above-described calculation, the HMM is created by learning from a large amount of utterance data. In particular, in speech recognition for unspecified speakers, utterance data is collected widely for age groups, genders, and the like. As a result, the speech of any person can be recognized.

ところが、あらゆる人の音声が認識できる反面、各音素についてＨＭＭの数値の取り得る範囲（実際は、多数次元のベクトルの範囲）が広がり、その結果、認識の精度が下がる可能性があった。 However, while the speech of any person can be recognized, the range that can be taken by the HMM values for each phoneme (actually, the range of a multidimensional vector) is expanded, and as a result, the recognition accuracy may be reduced.

特開２００１−３５６７９０号公報JP 2001-356790 A

本発明は上記実状に鑑みてなされたもので、隠れマルコフモデル（ＨＭＭ）を用いる音声認識を高精度に行うことを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to perform speech recognition using a hidden Markov model (HMM) with high accuracy.

上記目的を達成するため、本発明の第１の観点に係る音声認識装置は、
全音声データから学習した、子音認識用の音響モデルと、各グループ別の音声データから学習した、複数の母音認識用の音響モデルと、を記憶した記憶部と、
入力された音声について複数の所定長フレーム毎に抽出した特徴量と、前記記憶部に記憶された各音響モデルとに基づいて、前記入力された音声についての各音素の状態遷移確率を算出する確率算出手段と、
算出された状態遷移確率を累積し、各音響モデル毎の尤度を算出する尤度算出手段と、
当該フレームより前のフレームで算出された尤度の累積値を順次算出する累積尤度算出手段と、
前記累積尤度算出手段が算出した累積尤度に基づいて、前記入力された音声の認識を行う音声認識手段と、
を備えたことを特徴とする。 In order to achieve the above object, a speech recognition apparatus according to the first aspect of the present invention provides:
A storage unit that stores an acoustic model for consonant recognition learned from all speech data and a plurality of acoustic models for vowel recognition learned from speech data for each group;
Probability of calculating the state transition probability of each phoneme for the input speech based on the feature quantity extracted for each of the plurality of predetermined long frames for the input speech and each acoustic model stored in the storage unit A calculation means;
A likelihood calculating means for accumulating the calculated state transition probabilities and calculating a likelihood for each acoustic model;
A cumulative likelihood calculating means for sequentially calculating a cumulative value of likelihood calculated in a frame before the frame;
Speech recognition means for recognizing the input speech based on the cumulative likelihood calculated by the cumulative likelihood calculation means;
It is provided with.

このように、音響モデルを分けたのは、次の理由による。まず、子音認識用と、母音認識用とに分けたのは、子音が話者による個人差が少ない反面、母音は声帯の影響による個人差が大きいからである。また、母音認識用の音響モデルを複数に分けたのは、母音の個人差に対応するためである。 The reason why the acoustic models are divided in this way is as follows. First, the consonant recognition and the vowel recognition are divided because the consonant has a small individual difference among speakers, whereas the vowel has a large individual difference due to the influence of the vocal cords. The reason why the acoustic model for vowel recognition is divided into a plurality is to deal with individual differences in vowels.

上記音声認識装置において、
各フレームの音声が母音であるか子音であるかを判別するフレーム識別手段と、
入力される音声が母音である場合に、前記母音認識用の音響モデルを学習したグループを決定するグループ決定手段と、
を備えることが望ましい。 In the above speech recognition apparatus,
Frame identification means for determining whether the voice of each frame is a vowel or a consonant;
Group determination means for determining a group that has learned the acoustic model for vowel recognition when the input speech is a vowel;
It is desirable to provide.

これは、所定数以上の母音が認識された後は、グループを決定して効率的な認識処理をすることが望ましいためである。 This is because after a predetermined number of vowels have been recognized, it is desirable to determine a group and perform efficient recognition processing.

上記目的を達成するため、本発明の第２の観点に係る音響モデル学習装置は、
全音声データから学習する、子音認識用の音響モデルと、各グループ別の音声データから学習する、各グループ毎の母音認識用の音響モデルと、を記憶する記憶部と、
母音認識用の音響モデルのグループ数を指定するグループ数指定手段と、
前記母音認識用の音響モデルのグループ間の距離を算出する距離算出手段と、
最短距離の２つのグループを１つのグループとするグループ化手段と、
全グループ数が指定された数以下になったかを判定するグループ数判定手段と、
を備えたことを特徴とする。 In order to achieve the above object, an acoustic model learning device according to the second aspect of the present invention provides:
A storage unit that stores an acoustic model for consonant recognition that learns from all speech data, and an acoustic model for vowel recognition for each group that learns from speech data for each group,
A group number specifying means for specifying the number of groups of an acoustic model for vowel recognition;
Distance calculating means for calculating a distance between groups of the acoustic model for vowel recognition;
Grouping means for making two groups of the shortest distance into one group;
A group number determination means for determining whether the total number of groups is equal to or less than a specified number;
It is provided with.

上記目的を達成するため、本発明の第３の観点に係る音声認識方法は、
所定の装置による音響モデルを用いた音声認識を高精度化する音声認識方法であって、
全音声データから学習した、子音認識用の音響モデルと、各グループ別の音声データから学習した、母音認識用の複数の音響モデルと、を取得するモデル取得ステップと、
対象音声に対し、複数の所定長フレームを所定周期で設定し、各フレーム毎に特徴量を抽出する特徴量抽出ステップと、
各フレームにおいて抽出された特徴量に基づいて、前記対象音声についての各音素の状態遷移確率を算出する確率算出ステップと、
算出された状態遷移確率を累積し、各音響モデル毎の尤度を算出する尤度算出ステップと、
算出された各音響モデル毎の尤度と、当該フレームより前のフレームで算出された尤度の最大値とに基づいて、累積尤度を順次算出する累積尤度算出ステップと、
前記算出された累積尤度に基づいて音声認識を行う音声認識ステップと、
を備えたことを特徴とする。 In order to achieve the above object, a speech recognition method according to a third aspect of the present invention includes:
A speech recognition method for improving accuracy of speech recognition using an acoustic model by a predetermined device,
A model acquisition step of acquiring an acoustic model for consonant recognition learned from all speech data and a plurality of acoustic models for vowel recognition learned from speech data for each group;
A feature amount extraction step for setting a plurality of predetermined length frames at a predetermined cycle for the target speech and extracting a feature amount for each frame;
A probability calculating step of calculating a state transition probability of each phoneme for the target speech based on the feature amount extracted in each frame;
A likelihood calculating step for accumulating the calculated state transition probabilities and calculating a likelihood for each acoustic model;
A cumulative likelihood calculating step for sequentially calculating the cumulative likelihood based on the calculated likelihood for each acoustic model and the maximum likelihood calculated in a frame before the frame;
A speech recognition step for performing speech recognition based on the calculated cumulative likelihood;
It is provided with.

上記目的を達成するため、本発明の第４の観点に係るプログラムは、
コンピュータを
全音声データから学習した、子音認識用の音響モデルと、各グループ別の音声データから学習した、母音認識用の複数の音響モデルと、を記憶し、
対象音声を取り込み、該取り込んだ音声に対し、複数の所定長フレームを所定周期で設定し、各フレーム毎に特徴量を抽出し、
各フレームにおいて抽出された特徴量に基づいて、状態遷移確率を算出し、
算出された状態遷移確率を累積し、各音響モデル毎の尤度を算出し、
算出された各音響モデル毎の尤度と、当該フレームより前のフレームで算出された尤度の最大値とに基づいて、累積尤度を順次算出し、
前記算出された累積尤度に基づいて音声認識を行う、
ことを特徴とする音声認識装置として機能させる。 In order to achieve the above object, a program according to the fourth aspect of the present invention provides:
Storing an acoustic model for consonant recognition learned from all speech data and a plurality of acoustic models for vowel recognition learned from speech data for each group;
Capture the target voice, set a plurality of predetermined length frames for the captured voice in a predetermined cycle, extract the feature amount for each frame,
Based on the feature amount extracted in each frame, the state transition probability is calculated,
Accumulate the calculated state transition probabilities, calculate the likelihood for each acoustic model,
Based on the calculated likelihood for each acoustic model and the maximum likelihood calculated in a frame before the frame, the cumulative likelihood is sequentially calculated,
Performing speech recognition based on the calculated cumulative likelihood,
It is made to function as a voice recognition device characterized by this.

本発明によれば、隠れマルコフモデル（ＨＭＭ）を用いた音声認識を高精度に行うことができる。 According to the present invention, speech recognition using a hidden Markov model (HMM) can be performed with high accuracy.

本発明に係る実施の形態を、以下図面を参照して説明する。 Embodiments according to the present invention will be described below with reference to the drawings.

（実施形態１）
（音声認識装置）
図１は、本発明の実施の形態に係る音声認識装置の構成を示すブロック図である。図示するように、音声認識装置１００は、制御部１１０と、入力制御部１２０と、出力制御部１３０と、プログラム格納部１４０と、記憶部１５０と、から構成される。 (Embodiment 1)
(Voice recognition device)
FIG. 1 is a block diagram showing a configuration of a speech recognition apparatus according to an embodiment of the present invention. As shown in the figure, the speech recognition apparatus 100 includes a control unit 110, an input control unit 120, an output control unit 130, a program storage unit 140, and a storage unit 150.

制御部１１０は、例えば、ＣＰＵ（Central Processing Unit：中央演算処理装置）やワークエリアとなる所定の記憶装置（ＲＡＭ（Random Access Memory）など）から構成され、音声認識装置１００の各部を制御するとともに、プログラム格納部１４０に格納されている所定の動作プログラムに基づいて後述する各処理を実行する。 The control unit 110 includes, for example, a CPU (Central Processing Unit) and a predetermined storage device (RAM (Random Access Memory)) serving as a work area, and controls each unit of the speech recognition apparatus 100. Each process to be described later is executed based on a predetermined operation program stored in the program storage unit 140.

入力制御部１２０は、例えば、ＰＣＭ（Pulse Code Modulation）などのサンプリングを行うＡＤＣ（Analog Digital Converter：アナログ−デジタル変換器）などから構成され、マイクロフォンなどの所定の入力装置１２から入力されたアナログ音声信号をデジタル信号に変換する。 The input control unit 120 includes, for example, an ADC (Analog Digital Converter) that performs sampling such as PCM (Pulse Code Modulation) and the like, and analog audio input from a predetermined input device 12 such as a microphone. Convert the signal to a digital signal.

出力制御部１３０は、例えば、スピーカやディスプレイ装置などの所定の出力装置１３を接続し、制御部１１０による音声認識結果などを出力装置１３から出力する。 The output control unit 130 connects, for example, a predetermined output device 13 such as a speaker or a display device, and outputs a voice recognition result or the like by the control unit 110 from the output device 13.

プログラム格納部１４０は、例えば、ＲＯＭ（Read Only Memory）やフラッシュメモリ、ハードディスク装置などの所定の記憶装置から構成され、制御部１１０が実行する種々の動作プログラムが格納されている。プログラム格納部１４０には、以下のような動作プログラムが格納されている。後述する音声認識装置１００の各処理は、制御部１１０がこれらの動作プログラムを実行することで実現される。 The program storage unit 140 includes a predetermined storage device such as a ROM (Read Only Memory), a flash memory, and a hard disk device, for example, and stores various operation programs executed by the control unit 110. The program storage unit 140 stores the following operation programs. Each process of the speech recognition apparatus 100 to be described later is realized by the control unit 110 executing these operation programs.

（１）「特徴量抽出プログラム」：入力制御部１２０で変換された音声信号の特徴量（特徴パラメータ）を抽出するプログラム
（２）「尤度算出プログラム」：各フレーム毎の尤度を算出するとともに、累積尤度を算出するプログラム
（３）「音声認識プログラム」：算出された累積尤度と音響モデルとに基づいて音声認識するプログラム (1) “feature amount extraction program”: a program for extracting feature amounts (feature parameters) of the audio signal converted by the input control unit 120 (2) “likelihood calculation program”: calculating the likelihood for each frame And a program (3) “voice recognition program” for calculating cumulative likelihood: a program for voice recognition based on the calculated cumulative likelihood and an acoustic model

制御部１１０は、プログラム格納部１４０に格納されている上記各プログラムを実行することにより、図２に示すように、特徴量抽出手段１１１、尤度算出手段１１２、累積尤度算出手段１１３、ノード作成手段１１４、音声認識手段１１５、として機能する。図２は、制御部１１０の機能を模式的に示す機能ブロック図である。 As shown in FIG. 2, the control unit 110 executes each program stored in the program storage unit 140, so that the feature amount extraction unit 111, likelihood calculation unit 112, cumulative likelihood calculation unit 113, node It functions as the creation unit 114 and the voice recognition unit 115. FIG. 2 is a functional block diagram schematically showing functions of the control unit 110.

特徴量抽出手段１１１は、入力制御部１２０で変換された音声信号に対し、複数の所定長のフレームを所定周期で設定し、各フレーム毎のパワー成分（特徴量）を抽出する。 The feature amount extraction unit 111 sets a plurality of predetermined length frames at a predetermined period for the audio signal converted by the input control unit 120, and extracts a power component (feature amount) for each frame.

尤度算出手段１１２は、各フレーム毎に抽出された特徴量と、後述する音響モデル格納部１５３に格納されている隠れマルコフモデル（ＨＭＭ）とを比較することで、各フレーム毎の連続音素認識を行い、各ＨＭＭ毎の状態遷移確率（尤度）を算出する。ここでは、各音素毎に所定の状態数が予め定められており、各音素のある状態からどの状態に遷移するかの確率を、取得された特徴量とＨＭＭとを比較することで求める。例えば、「はちのへ」という単語の音素は「h・a・c h・i・ｎ・ｏ・ｈ・ｅ」となるが、状態数を「３」とした場合、各音素を「h1,h2,h3」「a1,a2,a3」「ch1,ch2,ch3」…、と表すことができる。本実施の形態では、各音素毎の状態数が「３」であるものとして以下の各処理を行うものとする。 The likelihood calculating unit 112 compares the feature amount extracted for each frame with a hidden Markov model (HMM) stored in an acoustic model storage unit 153 to be described later, thereby performing continuous phoneme recognition for each frame. The state transition probability (likelihood) for each HMM is calculated. Here, a predetermined number of states is predetermined for each phoneme, and the probability of transition from one state of each phoneme to which state is obtained by comparing the acquired feature quantity with the HMM. For example, the phoneme of the word “Hachinohe” is “h, a, ch, i, n, o, h, e”, but when the number of states is “3”, each phoneme is “h1, h2”. , h3 ”,“ a1, a2, a3 ”,“ ch1, ch2, ch3 ”, and so on. In this embodiment, it is assumed that the number of states for each phoneme is “3” and the following processes are performed.

累積尤度算出手段１１３は、尤度算出手段１１２がこれまで算出した尤度に基づいて、各フレームにおける各状態毎に尤度の累積値を求める。 The cumulative likelihood calculating unit 113 obtains a cumulative value of likelihood for each state in each frame, based on the likelihood calculated by the likelihood calculating unit 112 so far.

ノード作成手段１１４は、後述する文法格納部１５４に格納されている文法情報に基づき、後述する辞書格納部１５５から取得する候補単語と累積尤度とを対応付けて展開する。 Based on grammatical information stored in a grammar storage unit 154, which will be described later, the node creation unit 114 expands a candidate word acquired from the dictionary storage unit 155, which will be described later, in association with a cumulative likelihood.

音声認識手段１１５は、ノード作成手段１１４が展開した累積尤度に基づいて、候補単語を音声認識結果として取得して出力する。 The speech recognition unit 115 acquires and outputs candidate words as speech recognition results based on the cumulative likelihood developed by the node creation unit 114.

記憶部１５０は、例えば、ＲＡＭ（Random Access Memory）やフラッシュメモリ、ハードディスク装置などの記憶装置などから構成され、音声認識装置１００の音声認識処理に必要となる各種情報を記憶する。記憶部１５０は図３に示すように、音声格納部１５１、特徴格納部１５２、音響モデル格納部１５３、文法格納部１５４、辞書格納部１５５、および、累積尤度格納部１５６から構成される。 The storage unit 150 includes, for example, a RAM (Random Access Memory), a flash memory, a storage device such as a hard disk device, and the like, and stores various types of information necessary for the speech recognition processing of the speech recognition device 100. As shown in FIG. 3, the storage unit 150 includes a voice storage unit 151, a feature storage unit 152, an acoustic model storage unit 153, a grammar storage unit 154, a dictionary storage unit 155, and a cumulative likelihood storage unit 156.

音声格納部１５１は、入力制御部１２０が変換したデジタル信号を随時バッファリングする。 The audio storage unit 151 buffers the digital signal converted by the input control unit 120 as needed.

特徴格納部１５２は、特徴量抽出手段１１１が抽出した各フレーム毎の特徴量を示す情報（以下、「特徴量データ」とする）を随時格納（展開）する。 The feature storage unit 152 stores (develops) information indicating the feature amount for each frame extracted by the feature amount extraction unit 111 (hereinafter referred to as “feature amount data”) as needed.

音響モデル格納部１５３は、音声認識装置１００が対応する言語について、認識対象となる音声を構成する全ての音素をモデル化した音響モデル（音素モデル）を予め蓄積する。本実施の形態では、音響モデルとして「隠れマルコフモデル」（ＨＭＭ）を用いるものとする。また、本実施の形態における音響モデル格納部１５３では、ＨＭＭを子音と母音とに分け、全音声データから学習した、子音認識用のＨＭＭと、母音認識用のＨＭＭと、を記憶する。さらに、母音認識用のＨＭＭは、各グループ別の音声データから学習した、複数のＨＭＭから成る。
以下、このグループ分けについて説明する。 The acoustic model storage unit 153 stores in advance an acoustic model (phoneme model) obtained by modeling all the phonemes constituting the speech to be recognized for the language supported by the speech recognition apparatus 100. In the present embodiment, a “hidden Markov model” (HMM) is used as the acoustic model. The acoustic model storage unit 153 in the present embodiment stores the HMM for consonant recognition and the HMM for vowel recognition learned from all speech data by dividing the HMM into consonants and vowels. Furthermore, the vowel recognition HMM is composed of a plurality of HMMs learned from voice data for each group.
Hereinafter, this grouping will be described.

人が発する音声には、子音よりも母音の方に、個人差が多く含まれる。そこで、母音のみを対象に、ＨＭＭを人のグループに分ける。具体的には、メル周波数ケプストラム係数（ＭＦＣＣ）の距離を求め、この距離が近い者同士をグループとしてグループ分けを行い、各グループごとに音声認識できるようにする。 Voices uttered by people contain more individual differences in vowels than consonants. Therefore, HMMs are divided into groups of people for only vowels. Specifically, the distance of the mel frequency cepstrum coefficient (MFCC) is obtained, and persons having a short distance are grouped into groups so that voice recognition can be performed for each group.

ここで、ＭＦＣＣの次元数をＮとする。そして、認識対象言語にＶ個の母音があるとし、それらに０〜Ｖ−１の番号を振る。また、発声者の数をＭとし、ｍ番目の人が発声した母音ｖの音素片の数をＫ_ｍｖとする。 Here, the number of dimensions of the MFCC is N. Then, assuming that there are V vowels in the recognition target language, numbers 0 to V-1 are assigned to them. Also, let M be the number of speakers and K _mv be the number of phonemes of the vowel v uttered by the mth person.

そして、ｍ番目の人が発声したｋ番目の母音ｖのＭＦＣＣをＣ_{ｍ，ｋ，ｖ}＝｛ｃ_{ｎ，ｍ，ｋ，ｖ}｜ｎ＝０，…，Ｎ−１｝とし、その平均値Ｇ_ｍ，ｖ＝｛ｇ_{ｎ，ｍ，ｖ}｜ｎ＝０，…，Ｎ−１｝を次式のように定義する。
ｇ_{ｎ，ｍ，ｖ}＝（１／Ｋ_ｍ，ｖ）Σ_k=0 ^Km,v−１ｃ_{ｎ，ｍ，ｋ，ｖ}
（２） Then, the MFCC of the kth vowel v uttered by the mth person is C _{m, k, v} = {cn _{, m, k, v} | n = 0,..., N−1}, and the average value G _{m, v} = {gn _{, m, v} | n = 0,..., N−1} is defined as follows.
g _{n, m, v} = (1 / K _{m, v} ) Σk _{= 0} ^{Km, v−1} c _{n, m, k, v}
(2)

また、ｍ１番目の人とｍ２番目の人とがそれぞれ発声した母音相互の距離Ｄ（ｍ１，ｍ２）を次式のように定義する。
Ｄ（ｍ１，ｍ２）＝Σ_ｖ=0 ^V-1Σ_n=0 ^N-1ｓ_ｎ(ｇ_n,m1,v−ｇ_n,m2,v)^２
（３）
ここで、ｓ_ｎは、ＭＦＣＣのｎ次元目の重み係数である。 Further, a distance D (m1, m2) between vowels uttered by the m1st person and the m2nd person is defined as follows.
D (m1, m2) = Σ v = 0 V-1 Σ n = 0 N-1 s n (g n, m1, v -g n, m2, v) 2
(3)
Here, s _n is an n-dimensional weighting coefficient of MFCC.

さらに、これらの距離が近い者同士を集めてＭ_Ａ人のグループＡと、Ｍ_Ｂ人のグループＢとができたとする。その場合、グループＡと、グループＢとの相互間の距離ＤＧ（Ａ，Ｂ）を次式のように定義する。
ＤＧ（Ａ，Ｂ）＝ＭＡＸ（Ｄ（ma_ｉ，mb_ｊ））（４） In addition, the group A M _A person collects these distances closer's together, and could and group B of human M _B. In that case, the distance DG (A, B) between the group A and the group B is defined as follows.
DG (A, B) = MAX (D (ma _i , mb _j )) (4)

ここで、ma_ｉの範囲は、｛ma_ｉ｜ｉ＝０，…，Ｍ_Ａ−１｝であり、Ｍ_ＡはグループＡに属する人数である。また、mb_ｊの範囲は、｛mb_ｊ｜ｊ＝０，…，Ｍ_Ｂ−１｝であり、Ｍ_ＢはグループＢに属する人数である。
以上の計算によって、母音のＨＭＭを所定数のグループに分ける。 Here, the range of ma _i _{is, {ma i | i = 0} , ..., M A -1} is, _{M A} is a number of people belonging to the group A. The range of mb _j is {mb _j | j = 0,..., M _B −1}, where M _B is the number of people belonging to group B.
By the above calculation, the vowel HMMs are divided into a predetermined number of groups.

また、距離の代わりに、ｎ次元ベクトルの角度の大きさＤAを用いる方法もある。これは、次式により求められる。
ＤA（ｍ１，ｍ２）
＝Σ_ｖ=0 ^V-1［｛Σ_n=0 ^N-1(ｇ_n,m1,v×ｇ_n,m2,v)｝／｛√(Σ_n=0 ^N-1ｇ_n,m1,v ^２)√(Σ_n=0 ^N-1ｇ_n,m2,v ^２)｝
（５）
これも、広い意味で、ｍ１番目の人とｍ２番目の人との間の距離（広義の距離）とみなせる。
ここで、音響モデル格納部１５３の説明を終わり、記憶部１５０の次の部分の説明に移る。 There is also a method of using the angle magnitude DA of the n-dimensional vector instead of the distance. This is obtained by the following equation.
DA (m1, m2)
= Σv _{= 0} ^V-1 [{Σn _{= 0} ^N-1 (g _{n, m1, v} × g _{n, m2, v} )} / {√ (Σ _{n = 0} ^N-1 g _{n, m1, v} ² ) √ (Σ _{n = 0} ^N-1 gn _{, m2, v} ² )}
(5)
In a broad sense, this can also be regarded as a distance (broadly defined distance) between the m1st person and the m2nd person.
Here, the description of the acoustic model storage unit 153 is finished, and the description of the next part of the storage unit 150 is started.

文法格納部１５４は、音声認識装置１００が対応する言語の文法規則を定義したファイルを格納する。 The grammar storage unit 154 stores a file that defines grammar rules for a language supported by the speech recognition apparatus 100.

辞書格納部１５５は、音声認識装置１００が対応する言語の単語毎の音素パターン系列情報を登録した単語辞書を格納する。 The dictionary storage unit 155 stores a word dictionary in which phoneme pattern sequence information is registered for each word in a language supported by the speech recognition apparatus 100.

累積尤度格納部１５６は、累積尤度算出手段１１３が算出した累積尤度を示す累積尤度情報を格納する。すなわち、累積尤度算出手段１１３が累積尤度を算出すると、ノード作成手段１１４により、図４に示すような累積尤度マップが累積尤度格納部１５６に展開される。図４に示す例では、単語「けせんぬま」について、各フレームの状態番号毎に累積尤度値が展開されている。なお、本実施の形態では、算出された累積尤度の逆数が累積尤度格納部１５６に展開されるものとする。したがって、図４に示す累積尤度のうち、数値が小さいほど尤度が大きいことを示す。 The cumulative likelihood storage unit 156 stores cumulative likelihood information indicating the cumulative likelihood calculated by the cumulative likelihood calculating means 113. That is, when the cumulative likelihood calculating unit 113 calculates the cumulative likelihood, the node creating unit 114 develops a cumulative likelihood map as shown in FIG. 4 in the cumulative likelihood storage unit 156. In the example shown in FIG. 4, the cumulative likelihood value is developed for each state number of each frame for the word “kensenuma”. In the present embodiment, the reciprocal of the calculated cumulative likelihood is developed in the cumulative likelihood storage unit 156. Therefore, it shows that likelihood is so large that a numerical value is small among the cumulative likelihoods shown in FIG.

上記のように構成された音声認識装置１００の動作を以下図面を参照して説明する。以下に示す各動作は、制御部１１０がプログラム格納部１４０に格納されている各プログラムのいずれかまたはすべてを適時実行することで実現されるものである。 The operation of the speech recognition apparatus 100 configured as described above will be described below with reference to the drawings. Each operation shown below is realized when the control unit 110 executes any or all of the programs stored in the program storage unit 140 in a timely manner.

最初に、本発明の実施の形態に係る音声認識装置１００による音声認識動作の概略（「音声認識処理」）を図５に示すフローチャートを参照して説明する。この「音声認識処理」は、音声認識装置１００の入力装置１２から音声が入力され、入力制御部１２０によりデジタル変換された音声信号が音声格納部１５１にバッファされたことを契機に開始されるものとする。 First, an outline of a speech recognition operation (“speech recognition processing”) by the speech recognition apparatus 100 according to the embodiment of the present invention will be described with reference to a flowchart shown in FIG. This “voice recognition processing” is started when a voice is input from the input device 12 of the voice recognition device 100 and the voice signal digitally converted by the input control unit 120 is buffered in the voice storage unit 151. And

まず、特徴量抽出手段１１１は、音声格納部１５１にバッファされた音声信号に対し、所定長のフレームを所定周期毎に割り当てるとともに、各フレームにおける特徴量を抽出して、特徴量データを特徴格納部１５２に格納する（ステップＳ５０１）。なお、各フレームを示す「フレーム番号」は、「０」から割り当てられるものとする。 First, the feature quantity extraction unit 111 assigns a frame having a predetermined length to the audio signal buffered in the audio storage unit 151 for each predetermined period, extracts the feature quantity in each frame, and stores the feature quantity data as a feature. The data is stored in the unit 152 (step S501). The “frame number” indicating each frame is assigned from “0”.

そして、尤度算出手段１１２は、フレーム番号を指定するフレームポインタ（ｆ）に、初期値「０」を設定する（ステップＳ５０２）。 Then, the likelihood calculating means 112 sets an initial value “0” to the frame pointer (f) that designates the frame number (step S502).

次に、尤度算出手段１１２は、当該フレームの直前のフレームにおいて尤度演算を行ったか否かを判別する（ステップＳ５０３）。ステップＳ５０２で０番フレームが指定されているので、直前フレームでの尤度演算は行われていない。したがって、図６に示すステップＳ６０１に進む。 Next, the likelihood calculating unit 112 determines whether or not the likelihood calculation has been performed on the frame immediately before the frame (step S503). Since the 0th frame is designated in step S502, the likelihood calculation in the immediately preceding frame is not performed. Therefore, the process proceeds to step S601 shown in FIG.

そして、尤度算出手段１１２は、当該フレームにおける状態番号を示す状態番号ポインタ（ｓ）に、初期値「０」を設定する（ステップＳ６０１）。 Then, the likelihood calculating unit 112 sets an initial value “0” to the state number pointer (s) indicating the state number in the frame (step S601).

また、尤度算出手段１１２は、当該状態数における尤度に初期値「０」を設定する（ステップＳ６０２）。 Further, the likelihood calculating means 112 sets an initial value “0” as the likelihood in the number of states (step S602).

次に、尤度算出手段１１２は、音響モデル格納部１５３に格納されているガウス分布を用いて、確率演算を行う（ステップＳ６０３）。この演算は、前記数式（１）を用いて行うが、実際は混合ガウス分布であり、数式（１）の正規分布の重み付け和を求める。そして、ステップＳ６０３で算出された確率で尤度を更新する（ステップＳ６０４）。なお、算出された確率および尤度を示す情報は、例えば、ワークエリアなどの所定の記憶領域に保持されるものとする。 Next, the likelihood calculating unit 112 performs a probability calculation using the Gaussian distribution stored in the acoustic model storage unit 153 (step S603). This calculation is performed using Equation (1) above, but is actually a mixed Gaussian distribution, and the weighted sum of the normal distribution of Equation (1) is obtained. Then, the likelihood is updated with the probability calculated in step S603 (step S604). Note that the information indicating the calculated probability and likelihood is held in a predetermined storage area such as a work area, for example.

そして、尤度算出手段１１２は、当該フレームにおいてさらなる状態数があるか否かを判別する（ステップＳ６０５）。 Then, the likelihood calculating unit 112 determines whether there is a further number of states in the frame (step S605).

当該フレームにさらなる状態数がある場合には（ステップＳ６０５：Ｎｏ）、状態番号ポインタ（ｓ）を１インクリメントし（ステップＳ６０６）、次の状態数に対応するガウス分布を用いて確率演算および尤度更新を行う（ステップＳ６０３、Ｓ６０４）。 If there is a further number of states in the frame (step S605: No), the state number pointer (s) is incremented by 1 (step S606), and probability calculation and likelihood are performed using a Gaussian distribution corresponding to the next number of states. Update is performed (steps S603 and S604).

すべての状態数における確率演算および尤度更新が終了すると（ステップＳ６０５：Ｙｅｓ）、累積尤度算出手段１１３は、当該フレームの各状態で算出された尤度を用いて、例えば、ビタビアルゴリズムにより各状態毎の累積尤度を算出して更新し（ステップＳ６０７）、ノード作成手段１１４が候補単語と累積尤度とを対応付けて展開する。 When the probability calculation and the likelihood update in all the number of states are completed (step S605: Yes), the cumulative likelihood calculating unit 113 uses the likelihood calculated in each state of the frame, for example, by Viterbi algorithm. The cumulative likelihood for each state is calculated and updated (step S607), and the node creation unit 114 expands the candidate word and the cumulative likelihood in association with each other.

当該フレームについての累積尤度が更新されると、尤度算出手段１１２は、さらなるフレームがあるか否かを判別する（ステップＳ５０７）。さらなるフレームがある場合（ステップＳ５０７：Ｎｏ）、尤度算出手段１１２は、フレームポインタ（ｆ）を１インクリメントし（ステップＳ５０８）、次のフレームについて、ステップＳ５０３以下で同様の処理を行う。 When the cumulative likelihood for the frame is updated, the likelihood calculating unit 112 determines whether there is a further frame (step S507). When there is a further frame (step S507: No), the likelihood calculating means 112 increments the frame pointer (f) by 1 (step S508), and the same processing is performed in step S503 and subsequent steps for the next frame.

上述のように、先頭フレーム（０番）では尤度演算が行われたので、ステップＳ５０３では「直前フレームで確率演算あり」と判別される（ステップＳ５０３：Ｙｅｓ）。この場合、尤度算出手段１１２は、累積尤度格納部１５６に展開されている累積尤度を参照して、累積尤度値が最大となっている状態番号を特定する（ステップＳ５０４）。これは、各フレーム中で累積尤度が最大となっている部分のモデルと状態番号とを調べることで当該部分の音声が子音であるか母音であるかを判別するためである。 As described above, since the likelihood calculation is performed in the first frame (number 0), it is determined in step S503 that “there is a probability calculation in the immediately preceding frame” (step S503: Yes). In this case, the likelihood calculating unit 112 refers to the cumulative likelihood developed in the cumulative likelihood storage unit 156, and specifies the state number having the maximum cumulative likelihood value (step S504). This is in order to determine whether the speech of the part is a consonant or a vowel by examining the model and state number of the part having the maximum cumulative likelihood in each frame.

図４の例では、例えば、第１９フレームにおける最大累積尤度値は「４９３９」（上述のように、累積尤度値の逆数を取っているため、絶対値が最も小さいものが最大尤度を示す）であり、対応する状態数は「ｋ３」である。「ｋ３」は、「けせんぬま（k・e・s・e・N・n・u・m・a）」の「ｋ」の第３状態部分であるから「子音」であることがわかる。 In the example of FIG. 4, for example, the maximum cumulative likelihood value in the 19th frame is “4939” (as described above, since the reciprocal of the cumulative likelihood value is taken, the one with the smallest absolute value has the maximum likelihood. The corresponding state number is “k3”. Since “k3” is the third state part of “k” in “Kenuma (k, e, s, e, N, n, u, m, a)”, it is understood that it is a “consonant”.

このようにして、尤度算出手段１１２は、当該フレームの音声が「子音」であるか母音であるかを判別する（ステップＳ５０５）。 In this way, the likelihood calculating means 112 determines whether the sound of the frame is a “consonant” or a vowel (step S505).

当該音声が「母音」でない場合（ステップＳ５０５：Ｎｏ）、上述した図６に示すステップＳ６０１に進む。
一方、当該音声が「母音」である場合（ステップＳ５０５：Ｙｅｓ）、ステップＳ５０６で母音比較処理を行う。以下、この処理を図７及び図８を参照して説明する。 If the voice is not a “vowel” (step S505: No), the process proceeds to step S601 shown in FIG.
On the other hand, when the voice is a “vowel” (step S505: Yes), a vowel comparison process is performed in step S506. Hereinafter, this process will be described with reference to FIGS.

図７において、尤度算出手段１１２は、まず、特徴量データと比較する対象であるグループが決定済みか否かを判別する（ステップＳ７０１）。この処理は、例えば、グループが決定済みの旨を表示するフラグを参照することで行い得る。 In FIG. 7, the likelihood calculating unit 112 first determines whether or not a group to be compared with the feature amount data has been determined (step S701). This process can be performed, for example, by referring to a flag indicating that the group has been determined.

グループが決定済みであるときは（ステップＳ７０１：Ｙｅｓ）、すべての母音ＨＭＭのうち、決定済みのグループの母音ＨＭＭのみに着目する（ステップＳ７０２）。この処理は、例えば、ポインタｇに決定済みのグループの番号を設定することにより行い得る（ステップＳ７０２）。 When the group has been determined (step S701: Yes), only the vowel HMMs of the determined group among all the vowel HMMs are focused (step S702). This process can be performed, for example, by setting the determined group number in the pointer g (step S702).

そして、該当するグループの状態番号ポインタsgに初期値「０」を設定し（ステップＳ７１１）、この後は、ステップＳ７１２〜Ｓ７１７で上述したステップＳ６０２〜Ｓ６０７と同様の処理をその該当するグループの母音ＨＭＭを比較対象として行う。 Then, an initial value “0” is set to the state number pointer sg of the corresponding group (step S711), and thereafter, the same processing as that of steps S602 to S607 described above in steps S712 to S717 is performed. HMM is used as a comparison target.

以上の処理を終了した後は、上述したステップＳ５０７に戻って既に説明したステップＳ５０７〜Ｓ５０９の処理を行う。 After the above process is completed, the process returns to step S507 described above and the processes of steps S507 to S509 already described are performed.

一方、ステップＳ７０１でグループが決定済みでないときは（ステップＳ７０１：Ｎｏ）、図８に示す処理を行う。 On the other hand, when the group has not been determined in step S701 (step S701: No), the processing shown in FIG. 8 is performed.

図８においては、まず、各グループの参照回数をそれぞれ計数する各グループごとのカウンタに初期値「０」を設定する（ステップＳ８０１）。そして、すべてのグループを比較の対象として処理すべく、ポインタｇに最初のグループの番号「１」を設定する（ステップＳ８０２）。 In FIG. 8, first, an initial value “0” is set in a counter for each group that counts the number of times each group is referenced (step S801). Then, the first group number “1” is set in the pointer g to process all the groups as comparison targets (step S802).

そして、該当するグループの状態番号ポインタsgに初期値「０」を設定し（ステップＳ８１１）、この後は、ステップＳ８１２〜Ｓ８１７で、上述したステップＳ７１２〜Ｓ７１７、ステップＳ６０２〜Ｓ６０７と同様の処理をその最初のグループの母音ＨＭＭを比較対象として行う。 Then, an initial value “0” is set to the state number pointer sg of the corresponding group (step S811), and thereafter, in steps S812 to S817, the same processes as in steps S712 to S717 and steps S602 to S607 described above are performed. The vowel HMM of the first group is used as a comparison target.

最初のグループの処理が終了すると、尤度算出手段１１２は、さらなるグループがあるか否かを判別する（ステップＳ８１８）。さらなるグループがある場合（ステップＳ８１８：Ｎｏ）、尤度算出手段１１２は、グループポインタ（ｇ）を１インクリメントし（ステップＳ８１９）、次のグループについて、ステップＳ８１１以下で同様の処理を行う。 When the processing of the first group is completed, the likelihood calculating unit 112 determines whether there is a further group (step S818). When there is a further group (step S818: No), the likelihood calculating unit 112 increments the group pointer (g) by 1 (step S819), and the same processing is performed in step S811 and subsequent steps for the next group.

一方、さらなるグループがない場合（ステップＳ８１８：Ｙｅｓ）、以上の処理を終了し、最も確率の高いＨＭＭを輩出したグループのカウンタを１インクリメントする（ステップＳ８２０）。そして、現在がグループを決定する時期か否かを判定する（ステップＳ８２１）。この判定は、例えば、所定の回数以上、母音の比較処理が行われたか否かを判定することにより行い得る。 On the other hand, when there is no further group (step S818: Yes), the above process is complete | finished and the counter of the group which produced HMM with the highest probability is incremented by 1 (step S820). Then, it is determined whether or not it is time to determine a group (step S821). This determination can be made, for example, by determining whether or not a vowel comparison process has been performed a predetermined number of times or more.

この判定の結果、グループを決定する時期であるときは（ステップＳ８２１：Ｙｅｓ）、グループカウンタcnt(g)に最も大きい値が示されているグループの番号を、決定済みグループに設定する（ステップＳ８２２）。この設定により、次のフレームの処理では、上述したステップＳ７０２でグループポインタ（ｇ）に上記決定済みグループの設定番号が移され、決定されたグループだけについて母音比較処理が行われる。 As a result of this determination, when it is time to determine a group (step S821: Yes), the group number for which the largest value is indicated in the group counter cnt (g) is set to the determined group (step S822). ). With this setting, in the processing of the next frame, the set number of the determined group is moved to the group pointer (g) in step S702 described above, and vowel comparison processing is performed only for the determined group.

以上の処理を終了した後は、上述したステップＳ５０７に戻って既に説明したステップＳ５０７、Ｓ５０８の処理を行う。ここで、さらなるフレームがある場合（ステップＳ５０７：Ｎｏ）、尤度算出手段１１２は、フレームポインタ（ｆ）を１インクリメントし（ステップＳ５０８）、次のフレームについて、ステップＳ５０３以下で同様の処理を行う。 After the above process is completed, the process returns to the above-described step S507 and the processes of steps S507 and S508 already described are performed. Here, when there is a further frame (step S507: No), the likelihood calculating unit 112 increments the frame pointer (f) by 1 (step S508), and performs the same processing in step S503 and subsequent steps for the next frame. .

一方、当該フレームが最終フレームである場合（ステップＳ５０７：Ｙｅｓ）は、所定の出力処理（ステップＳ５０９）を行って、音声認識結果の出力を行う。ここでは、音声認識手段１１５が、累積尤度格納部１５６に展開されている候補単語と累積尤度を参照し、最終フレームの最終状態からノードを遡ってトレースして出力すべき認識結果（単語）を取得し、出力制御部１３０により音声あるいは文字情報として出力する。 On the other hand, when the frame is the final frame (step S507: Yes), a predetermined output process (step S509) is performed to output the voice recognition result. Here, the speech recognition means 115 refers to the candidate words and the cumulative likelihood expanded in the cumulative likelihood storage unit 156, and traces the node back from the final state of the final frame to output the recognition result (word ) And output as voice or text information by the output control unit 130.

以上のように、音声を子音と母音とに分けて認識し、母音についてはグループに分け、決定されたグループからＨＭＭを決めるようにしたので、認識処理の高速化を図りつつ、より認識精度を高めることができる。 As described above, the speech is recognized by separating it into consonants and vowels, and the vowels are divided into groups, and the HMM is determined from the determined group, so that the recognition process can be speeded up and the recognition accuracy can be improved. Can be increased.

ところで、母音のグループを決定する方法には、上述した、指定回数以上母音が比較された時点で決定する方法のほかに、以下の方法がある。これは、上述したステップＳ８２０〜Ｓ８２２の代わりに図９に示すステップＳ８３０〜Ｓ８３２の処理を実行することにより行われる。 By the way, as a method for determining a group of vowels, there are the following methods in addition to the above-described method for determining when a vowel is compared more than a specified number of times. This is performed by executing the processes of steps S830 to S832 shown in FIG. 9 instead of steps S820 to S822 described above.

この方法では、ステップＳ８１８でさらなるグループがない場合（ステップＳ８１８：Ｙｅｓ）、各グループの平均尤度を算出する（ステップＳ８３０）。これは、各グループごとの尤度を合計し、それぞれのグループ内のＨＭＭ総数で除算して求められる。 In this method, when there is no further group at step S818 (step S818: Yes), the average likelihood of each group is calculated (step S830). This is obtained by summing the likelihood for each group and dividing by the total number of HMMs in each group.

そして、最大の平均尤度のグループと、第２位の平均尤度のグループとの平均尤度の比が、予め指定された比以上か否かを判定する（ステップＳ８３１）。 Then, it is determined whether or not the ratio of the average likelihood between the maximum average likelihood group and the second average likelihood group is equal to or higher than a ratio specified in advance (step S831).

この比が、予め指定された比以上である場合（ステップＳ８３１：Ｙｅｓ）、最大の平均尤度のグループに決定する（ステップＳ８３２）。 When this ratio is equal to or higher than a ratio specified in advance (step S831: Yes), the group is determined as the maximum average likelihood group (step S832).

以上の処理によれば、決定の時期を別の要因で定める必要がなく、グループの採用回数をカウントする必要もなく、音声認識に伴い算出されている尤度について、平均値とその最大値などを算出することで自ずと採用すべきグループを決定できるという利点がある。 According to the above processing, there is no need to determine the timing of determination by another factor, no need to count the number of times the group has been adopted, and the average value and the maximum value of the likelihood calculated with speech recognition, etc. By calculating, there is an advantage that the group to be adopted can be determined by itself.

（実施形態２）
（音響モデル学習装置）
次に、上述したグループ別のＨＭＭを学習する装置について説明する。 (Embodiment 2)
(Acoustic model learning device)
Next, an apparatus for learning the above-described group-specific HMM will be described.

この装置は、上述した図１の装置を使って実現される。
図１において、記憶部１５０の音響モデル格納部１５３（図３参照）には、まず、子音と母音とに分けない通常のＨＭＭが格納される。そして、制御部１１０は、プログラム格納部１４０に格納された図示しないＨＭＭ学習プログラムを実行することで音響モデル学習装置として機能する。 This apparatus is realized by using the apparatus of FIG. 1 described above.
In FIG. 1, the acoustic model storage unit 153 (see FIG. 3) of the storage unit 150 first stores a normal HMM that cannot be divided into consonants and vowels. The control unit 110 functions as an acoustic model learning device by executing an HMM learning program (not shown) stored in the program storage unit 140.

この装置は、機能的には、図１０に示すように、グループ数指定手段１１６と、距離算出手段１１７と、グループ化手段１１８と、グループ数判定手段１１９と、母音子音選別手段１１１０と、を備える。 As shown in FIG. 10, this apparatus functionally includes a group number designating unit 116, a distance calculating unit 117, a grouping unit 118, a group number determining unit 119, and a vowel consonant sorting unit 1110. Prepare.

グループ数指定手段１１６は、記憶部１５０の音響モデル格納部１５３（図３参照）に設けられた、所定のエリアにおいて、母音認識用のＨＭＭのグループ数を指定する。これは、入力装置１２によってユーザが例えばキーボードにより指定することで、制御部１１０がこの指定値を記憶部１５０に記憶させることにより行い得る。 The group number designation means 116 designates the number of vowel recognition HMM groups in a predetermined area provided in the acoustic model storage unit 153 (see FIG. 3) of the storage unit 150. This can be performed by causing the control unit 110 to store the specified value in the storage unit 150 by the user specifying the input device 12 using, for example, a keyboard.

距離算出手段１１７は、前述した数式（３）及び数式（４）、又は、数式（５）及び数式（４）に従って、母音認識用のＨＭＭのグループ間の距離を算出する。 The distance calculation means 117 calculates the distance between groups of HMMs for vowel recognition in accordance with the above-described mathematical expressions (3) and (4), or mathematical expressions (5) and (4).

グループ化手段１１８は、すべてのグループのうち、最短距離にある２つのグループを１つのグループとする。これは、最短距離にあると判定された２つのグループのそれぞれ異なるグループ番号をいずれか１つの番号に統一し、全グループ数から数値「１」を減ずることで行い得る。 The grouping means 118 sets two groups at the shortest distance among all the groups as one group. This can be done by unifying the different group numbers of the two groups determined to be at the shortest distance into any one number and subtracting the numerical value “1” from the total number of groups.

グループ数判定手段１１９は、グループの統一により全グループ数が減少していき、その全グループ数が指定された数以下になったかを判定する。
母音子音選別手段１１１０は、記憶部１５０の所定領域に記憶されたＨＭＭが母音か子音かを選別してそれぞれ別々の領域に格納する。 The group number determination means 119 determines whether the total number of groups has decreased as a result of group unification, and the total number of groups has become equal to or less than the specified number.
The vowel consonant sorting means 1110 sorts whether the HMM stored in a predetermined area of the storage unit 150 is a vowel or a consonant and stores them in separate areas.

次に、上述した装置の動作を図１１を参照して説明する。
動作開始前に、人についての母音ＨＭＭの全グループ数を規定するために指定されたグループ数Ｇが、ユーザにより入力装置１２を介して記憶部１５０のワークエリアなどに設定されているものとする。 Next, the operation of the above-described apparatus will be described with reference to FIG.
It is assumed that the number of groups G specified for defining the total number of vowel HMMs for a person is set in the work area of the storage unit 150 or the like via the input device 12 before the operation is started. .

まず、制御部１１０は、発声者の音声が録音された媒体からＨＭＭの音声データを収集し、記憶部１５０の所定領域に記憶させる（ステップＳ９０１）。初期状態においては、すべての発声者の音素ＨＭＭをそれぞれ一人から成る１つずつのグループとみなす。 First, the control unit 110 collects HMM audio data from the medium on which the voice of the speaker is recorded, and stores it in a predetermined area of the storage unit 150 (step S901). In the initial state, the phoneme HMMs of all speakers are regarded as one group each consisting of one person.

そして、制御部１１０の母音子音選別手段１１１０は、人のポインタｍ、各人の音声データのポインタｎ、子音ＨＭＭのポインタｓ、母音ＨＭＭのポインタｇに、それぞれ「０」を初期設定する（ステップＳ９０２）。 Then, the vowel consonant sorting unit 1110 of the control unit 110 initializes “0” in each of the person pointer m, the voice data pointer n of each person, the consonant HMM pointer s, and the vowel HMM pointer g (step). S902).

次に、母音子音選別手段１１１０は、ｍ番目の人のｎ番目のデータが母音か子音かを判定する（ステップＳ９０３）。この判定は、例えば、母音か子音かの判定のみを専用に行うためのＨＭＭを用意しておき、前述した音声認識の場合と同様に、尤度算出により母音である確率と子音である確率とを求めた上で、両者を比較することで行い得る。 Next, the vowel consonant selection unit 1110 determines whether the nth data of the mth person is a vowel or a consonant (step S903). For this determination, for example, an HMM for performing only the determination of vowels or consonants is prepared, and the probability of being a vowel and the probability of being a consonant are calculated by likelihood calculation as in the case of the speech recognition described above. This can be done by comparing the two.

そして、母音であるときは（ステップＳ９０３：Ｙｅｓ）、母音ＨＭＭの記憶領域にＨＭＭを格納し（ステップＳ９０４）、母音ＨＭＭのポインタｇを１インクリメントする（ステップＳ９０５）。 If it is a vowel (step S903: Yes), the HMM is stored in the storage area of the vowel HMM (step S904), and the pointer g of the vowel HMM is incremented by 1 (step S905).

また、子音であるときは（ステップＳ９０３：Ｎｏ）、子音ＨＭＭの記憶領域にＨＭＭを格納し（ステップＳ９０６）、子音ＨＭＭのポインタｓを１インクリメントする（ステップＳ９０７）。 If it is a consonant (step S903: No), the HMM is stored in the consonant HMM storage area (step S906), and the consonant HMM pointer s is incremented by 1 (step S907).

このようにして、最初の音声データの処理が終了すると、母音子音選別手段１１１０は、さらなる音声データがあるか否かを判別する（ステップＳ９２１）。さらなる音声データがある場合（ステップＳ９２１：Ｎｏ）、母音子音選別手段１１１０は、音声データのポインタｎを１インクリメントし（ステップＳ９２２）、次の音声データについて、ステップＳ９０３以下で同様の処理を行う。
一方、さらなる音声データがない場合（ステップＳ９２１：Ｙｅｓ）、次の人の処理（ステップＳ９０３〜Ｓ９０７）に進む。
最初の人の処理が終了すると、母音子音選別手段１１１０は、さらなる人がいるか否かを判別する（ステップＳ９０８）。さらなる人がいる場合（ステップＳ９０８：Ｎｏ）、母音子音選別手段１１１０は、人のポインタｍを１インクリメントし（ステップＳ９０９）、次の人について、ステップＳ９０３以下で同様の処理を行う。 In this way, when the processing of the first voice data is completed, the vowel consonant sorting unit 1110 determines whether or not there is further voice data (step S921). When there is further voice data (step S921: No), the vowel consonant sorting unit 1110 increments the pointer n of the voice data by 1 (step S922), and the same process is performed on and after the next voice data in step S903.
On the other hand, if there is no further audio data (step S921: Yes), the process proceeds to the next person's processing (steps S903 to S907).
When the processing of the first person is completed, the vowel consonant sorting unit 1110 determines whether there are more persons (step S908). When there is a further person (step S908: No), the vowel consonant selection unit 1110 increments the person's pointer m by 1 (step S909), and the same process is performed on and after the next person at step S903 and thereafter.

一方、さらなる人がいない場合（ステップＳ９０８：Ｙｅｓ）、次の処理（ステップＳ９１０）に進む。
続いて、グループ数判定手段１１９は、人の全グループ数ｍが指定されたグループ数Ｇ以下であるかを判定する。 On the other hand, when there is no further person (step S908: Yes), it progresses to the next process (step S910).
Subsequently, the group number determination means 119 determines whether or not the total group number m of the person is equal to or less than the specified group number G.

人の全グループ数ｍが指定されたグループ数Ｇ以下でない場合（ステップＳ９１０：Ｎｏ）、距離算出手段１１７は、前述した数式（３）、又は、数式（５）に従って、母音認識用のＨＭＭ間の距離を算出する（ステップＳ９１１）。 If the total group number m of the person is not less than the specified group number G (step S910: No), the distance calculation unit 117 performs inter-HMM recognition for vowel recognition according to the above-described equation (3) or equation (5). Is calculated (step S911).

続いて、グループ化手段１１８は、最短距離にあると判定された２つの母音認識用のＨＭＭを、１つのグループとする（ステップＳ９１２）。これは、例えば、各ＨＭＭに割り当てられているグループ番号のうち、若い方の番号に統一することで行い得る。 Subsequently, the grouping unit 118 sets the two vowel recognition HMMs determined to be within the shortest distance as one group (step S912). This can be performed, for example, by unifying the smaller group number assigned to each HMM.

そして、グループ化手段１１８は、人の全グループ数ｍから数値「１」を減ずる（ステップＳ９１３）。その後、上述したステップＳ９１０に戻る。 Then, the grouping means 118 subtracts the numerical value “1” from the total number m of people (step S913). Thereafter, the process returns to step S910 described above.

ステップＳ９１０でさらに人の全グループ数ｍが指定されたグループ数Ｇ以下ではない場合（ステップＳ９１０：Ｎｏ）、距離算出手段１１７は、前述した数式（３）及び数式（４）、又は、数式（５）及び数式（４）に従って、母音認識用のＨＭＭ間の距離を算出する（ステップＳ９１１）。ここでは、上記の１グループ化により、２者以上の音声データを含むグループも存在する。 If the total group number m of the person is not less than or equal to the designated group number G in step S910 (step S910: No), the distance calculation means 117 may calculate the formula (3) and formula (4) or formula ( 5) and the distance between the HMMs for vowel recognition are calculated according to equation (4) (step S911). Here, there is also a group including two or more voice data due to the above-mentioned one grouping.

続いて、グループ化手段１１８は、最短距離にあると判定された２つの母音認識用のＨＭＭ又はそのグループを、１つのグループとする（ステップＳ９１２）。これは、例えば、各ＨＭＭに割り当てられているグループ番号のうち、若い方の番号に統一することで行い得る。 Subsequently, the grouping unit 118 sets two vowel recognition HMMs or groups determined to be at the shortest distance as one group (step S912). This can be performed, for example, by unifying the smaller group number assigned to each HMM.

以上の処理を繰り返し実行し、人の全グループ数ｍが指定されたグループ数Ｇ以下となった場合（ステップＳ９１０：Ｙｅｓ）、グループ数判定手段１１９は、すべての処理を終了する。 When the above process is repeatedly executed and the total number of groups m of the person is equal to or less than the specified number of groups G (step S910: Yes), the group number determination unit 119 ends all the processes.

以上により、記憶部１５０には、全音声データから学習された、子音認識用のＨＭＭと、人の各グループ別の音声データから学習された、人の各グループ毎の母音認識用のＨＭＭと、が記憶される。 As described above, the storage unit 150 stores the HMM for consonant recognition learned from all speech data, the HMM for vowel recognition for each group of people learned from the speech data for each group of people, Is memorized.

なお、上記実施の形態における音声認識装置１００及び音響モデル学習装置（１００）は、専用装置で構成可能であることはもとより、例えば、パーソナルコンピュータなどの汎用コンピュータ装置などで構成することができる。この場合、上記実施の形態に示した処理をコンピュータ装置上で実現するためのプログラムをコンピュータ装置にインストールすることにより、本発明に係る音声認識装置１００などを構成することができる。この場合のプログラムの配布方法は任意であり、例えば、ＣＤ−ＲＯＭなどの記録媒体に格納して配布可能であることはもとより、搬送波に重畳させることで、インターネットなどの通信媒体を介して配布することができる。 Note that the speech recognition device 100 and the acoustic model learning device (100) in the above embodiment can be configured with a dedicated device, for example, a general-purpose computer device such as a personal computer. In this case, the speech recognition apparatus 100 according to the present invention can be configured by installing a program for realizing the processing described in the above embodiment on the computer apparatus. The distribution method of the program in this case is arbitrary. For example, the program can be distributed by being stored in a recording medium such as a CD-ROM, and can be distributed via a communication medium such as the Internet by being superimposed on a carrier wave. be able to.

すなわち、本発明に係る音声認識装置などは、例えば、携帯型の翻訳装置などとして実現できる他、パーソナルコンピュータやゲーム装置などで動作するアプリケーションとして実現することができ、高精度の音声認識を実現するものである。 That is, the speech recognition device according to the present invention can be realized as an application that operates on a personal computer, a game device, or the like in addition to being realized as a portable translation device, for example, and realizes highly accurate speech recognition. Is.

また、既存の音声認識装置や音声認識アプリケーションなどに、本発明に係る各処理を実現するためのプログラムを追加すること（例えば、バージョンアップなど）により、音声認識処理を高精度化することができる。 Further, by adding a program for realizing each process according to the present invention to an existing voice recognition device or voice recognition application (for example, version upgrade), the voice recognition process can be made highly accurate. .

以上説明したように、本発明によれば、音声認識における高い認識精度を実現することができる。 As described above, according to the present invention, high recognition accuracy in voice recognition can be realized.

本発明の実施の形態に係る音声認識装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech recognition apparatus which concerns on embodiment of this invention. 図１に示す音声認識装置の制御部で実現される機能を示す機能ブロック図である。It is a functional block diagram which shows the function implement | achieved by the control part of the speech recognition apparatus shown in FIG. 図１に示す記憶部の詳細を示す図である。It is a figure which shows the detail of the memory | storage part shown in FIG. 図３に示す累積尤度格納部に展開される累積尤度値の例を示す図である。It is a figure which shows the example of the cumulative likelihood value developed by the cumulative likelihood storage part shown in FIG. 本発明の実施の形態に係る「音声認識処理」を説明するためのフローチャートである。It is a flowchart for demonstrating the "voice recognition process" which concerns on embodiment of this invention. 本発明の実施の形態に係る「音声認識処理」を説明するためのフローチャートである。It is a flowchart for demonstrating the "voice recognition process" which concerns on embodiment of this invention. 本発明の実施の形態に係る「母音比較処理」を説明するためのフローチャートである。It is a flowchart for demonstrating the "vowel comparison process" which concerns on embodiment of this invention. 本発明の実施の形態に係る「母音比較処理」を説明するためのフローチャートである。It is a flowchart for demonstrating the "vowel comparison process" which concerns on embodiment of this invention. 本発明の実施の形態に係る「母音比較処理」を説明するためのフローチャートである。It is a flowchart for demonstrating the "vowel comparison process" which concerns on embodiment of this invention. 本発明の実施の形態に係る音響モデル学習装置の構成を示すブロック図である。It is a block diagram which shows the structure of the acoustic model learning apparatus which concerns on embodiment of this invention. 本発明の実施の形態に係る「ＨＭＭ学習処理」を説明するためのフローチャートである。It is a flowchart for demonstrating the "HMM learning process" which concerns on embodiment of this invention.

Explanation of symbols

１００…音声認識装置（音響モデル学習装置）、１１１…特徴量抽出手段、１１２…尤度算出手段、１１３…累積尤度算出手段、１１４…ノード作成手段、１１５…音声認識手段、１１６…グループ数指定手段、１１７…距離算出手段、１１８…グループ化手段、１１９…グループ数判定手段、１１１０…母音子音選別手段、１５１…音声格納部、１５２…特徴格納部、１５３…音響モデル格納部、１５４…文法格納部、１５５…辞書格納部、１５６…累積尤度格納部 DESCRIPTION OF SYMBOLS 100 ... Voice recognition apparatus (acoustic model learning apparatus), 111 ... Feature-value extraction means, 112 ... Likelihood calculation means, 113 ... Cumulative likelihood calculation means, 114 ... Node creation means, 115 ... Speech recognition means, 116 ... Number of groups Designating means, 117 ... distance calculating means, 118 ... grouping means, 119 ... group number judging means, 1110 ... vowel consonant sorting means, 151 ... speech storage section, 152 ... feature storage section, 153 ... acoustic model storage section, 154 ... Grammar storage unit, 155 ... dictionary storage unit, 156 ... cumulative likelihood storage unit

Claims

A storage unit that stores an acoustic model for consonant recognition learned from all speech data and a plurality of acoustic models for vowel recognition learned from speech data for each group;
Probability of calculating the state transition probability of each phoneme for the input speech based on the feature quantity extracted for each of the plurality of predetermined long frames for the input speech and each acoustic model stored in the storage unit A calculation means;
A likelihood calculating means for accumulating the calculated state transition probabilities and calculating a likelihood for each acoustic model;
A cumulative likelihood calculating means for sequentially calculating a cumulative value of likelihood calculated in a frame before the frame;
Speech recognition means for recognizing the input speech based on the cumulative likelihood calculated by the cumulative likelihood calculation means;
A speech recognition apparatus comprising:

Frame identification means for determining whether the voice of each frame is a vowel or a consonant;
Group determination means for determining a group that has learned the acoustic model for vowel recognition when the input speech is a vowel;
The speech recognition apparatus according to claim 1, further comprising:

A storage unit that stores an acoustic model for consonant recognition that learns from all speech data, and an acoustic model for vowel recognition for each group that learns from speech data for each group,
A group number specifying means for specifying the number of groups of an acoustic model for vowel recognition;
Distance calculating means for calculating a distance between groups of the acoustic model for vowel recognition;
Grouping means for making two groups of the shortest distance into one group;
A group number determination means for determining whether the total number of groups is equal to or less than a specified number;
An acoustic model learning device comprising:

A speech recognition method for improving accuracy of speech recognition using an acoustic model by a predetermined device,
A model acquisition step of acquiring an acoustic model for consonant recognition learned from all speech data and a plurality of acoustic models for vowel recognition learned from speech data for each group;
A feature amount extraction step for setting a plurality of predetermined length frames at a predetermined cycle for the target speech and extracting a feature amount for each frame;
A probability calculating step of calculating a state transition probability of each phoneme for the target speech based on the feature amount extracted in each frame;
A likelihood calculating step for accumulating the calculated state transition probabilities and calculating a likelihood for each acoustic model;
A cumulative likelihood calculating step for sequentially calculating the cumulative likelihood based on the calculated likelihood for each acoustic model and the maximum likelihood calculated in a frame before the frame;
A speech recognition step for performing speech recognition based on the calculated cumulative likelihood;
A speech recognition method comprising:

Storing an acoustic model for consonant recognition learned from all speech data and a plurality of acoustic models for vowel recognition learned from speech data for each group;
Capture the target voice, set a plurality of predetermined length frames for the captured voice in a predetermined cycle, extract the feature amount for each frame,
Based on the feature amount extracted in each frame, the state transition probability is calculated,
Accumulate the calculated state transition probabilities, calculate the likelihood for each acoustic model,
Based on the calculated likelihood for each acoustic model and the maximum likelihood calculated in a frame before the frame, the cumulative likelihood is sequentially calculated,
Performing speech recognition based on the calculated cumulative likelihood,
A program for functioning as a voice recognition device.