JP5975938B2

JP5975938B2 - Speech recognition apparatus, speech recognition method and program

Info

Publication number: JP5975938B2
Application number: JP2013127389A
Authority: JP
Inventors: 亮増村; 浩和政瀧; 隆伸大庭
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-06-18
Filing date: 2013-06-18
Publication date: 2016-08-23
Anticipated expiration: 2033-06-18
Also published as: JP2015001695A

Description

この発明は、音声認識技術に関し、特に、潜在語言語モデル（Latent Words Language Model）を用いて高速に音声認識を行う技術に関する。 The present invention relates to a speech recognition technology, and more particularly to a technology for performing speech recognition at high speed using a latent word language model.

音声認識技術では言語的な予測のために言語モデルが必要である。音声認識の分野では言語モデルとしてＮグラム言語モデルが一般的に利用されている。Ｎグラム言語モデルは音声認識のデコーディングと非常に相性の良い形である。Ｎグラム言語モデルは学習テキストがあれば容易に学習することが可能である。Ｎグラム言語モデルを学習する方法は非特許文献１などに記載されている。 Speech recognition technology requires a language model for linguistic prediction. In the field of speech recognition, an N-gram language model is generally used as a language model. The N-gram language model is very compatible with speech recognition decoding. The N-gram language model can be easily learned if there is a learning text. A method for learning an N-gram language model is described in Non-Patent Document 1 and the like.

Ｎグラム言語モデル以外にも様々な言語モデルが提案されている。しかし、音声認識のデコーディングとの相性を考慮するとＮグラム言語モデル以外の言語モデルが音声認識で利用されることは非常に少ない。Ｎグラム言語モデル以外の言語モデルでは音声認識の際に実行される探索処理を現実的な時間で行うことが困難であることが理由である。 Various language models other than the N-gram language model have been proposed. However, considering compatibility with speech recognition decoding, language models other than the N-gram language model are rarely used in speech recognition. This is because it is difficult for a language model other than the N-gram language model to perform the search process executed at the time of speech recognition in a realistic time.

Ｎグラム言語モデルは音声認識で一般的に利用されているものの問題も存在する。例えば、Ｎグラム言語モデルでは学習テキスト中に存在した言語現象しか学習できない点が挙げられる。実例を挙げて説明する。学習テキスト内に「りんごを食べる」が存在して「みかんを食べる」が存在しない場合、その学習テキストで構築したＮグラム言語モデルを用いて「みかんを食べる」の確率を求めると、基本的には「を食べる」という情報しか使われない。しかしながら、「みかん」と「りんご」は通常の場合には類似した単語であり、「りんごを食べる」の確率は、「みかんを食べる」の情報を使う方がよい場合が多いと考えられる。 Although the N-gram language model is generally used in speech recognition, there is a problem. For example, the N-gram language model can only learn language phenomena that exist in the learning text. An example will be described. If there is "eating apples" in the learning text and "eating tangerines" does not exist, the probability of "eating tangerines" is basically calculated using the N-gram language model constructed with the learning text. Only uses "eating" information. However, “mandarin orange” and “apple” are usually similar words, and the probability of “eating an apple” is often considered to be better to use the information of “eating an orange”.

類似単語の情報を利用する言語モデルとして、非特許文献２に記載されている潜在語言語モデルがある。潜在語言語モデルは学習テキスト中の各単語に対する潜在語を考慮したモデルである。潜在語言語モデルは、例えば「りんご」と「みかん」は類似単語であることを考慮して確率モデルを構築することができる。非特許文献２では、潜在語言語モデルはＮグラム言語モデルと比較して高い言語予測性能があることが報告されている。 As a language model that uses similar word information, there is a latent language model described in Non-Patent Document 2. The latent language model is a model that considers latent words for each word in the learning text. As the latent language model, a probability model can be constructed in consideration of, for example, that “apple” and “mandarin” are similar words. Non-Patent Document 2 reports that the latent language model has higher language prediction performance than the N-gram language model.

潜在語言語モデルの構造は、非特許文献３に記載されているクラスＮグラム言語モデルと同様の構造を持つ。クラスＮグラム言語モデルは、一般的なＮグラム言語モデルではなく、単語をクラスと呼ばれる形態に落とし、そのクラス系列に関するＮグラム言語モデルとして表現される。ここで、クラスとは単語の集合を何らかの基準で分類した概念であり、例えば、果物名クラスや姓クラスなどが考えられる。クラスＮグラム言語モデルでは、さらにクラス内単語生起確率の分布を持つ。クラス内単語生起確率は、例えば、果物名クラスであれば、「りんご」が生起する確率は0.2、「みかん」が生起する確率は0.1といった値になる。 The structure of the latent language model is the same as that of the class N-gram language model described in Non-Patent Document 3. The class N-gram language model is not a general N-gram language model, but is expressed as an N-gram language model related to the class sequence by dropping words into a form called a class. Here, the class is a concept in which a set of words is classified according to some standard, and for example, a fruit name class, a surname class, or the like can be considered. The class N-gram language model further has a distribution of intra-class word occurrence probabilities. For example, in the case of a fruit name class, the probability of occurrence of an in-class word is a value such as 0.2 for the occurrence of “apple” and 0.1 for the occurrence of “mandarin”.

北健二著、「言語と計算−４確率的言語モデル」、東京大学出版会、1999年11月、pp. 57-62Kenji Kita, “Language and Computation-4 Stochastic Language Model”, The University of Tokyo Press, November 1999, pp. 57-62 K. Deschacht, J. D. Belder, M-F. Moens, “The latent words language model”, Computer Speech and Language, vol.26, pp.384-409, 2012.K. Deschacht, J. D. Belder, M-F. Moens, “The latent words language model”, Computer Speech and Language, vol.26, pp.384-409, 2012. 北健二著、「言語と計算−４確率的言語モデル」、東京大学出版会、1999年11月、pp. 72-74Kenji Kita, “Language and Computation-4 Stochastic Language Model”, The University of Tokyo Press, November 1999, pp. 72-74

潜在語言語モデルは音声認識や機械翻訳などでは利用されていない。音声認識で従来の一般的なＮグラム言語モデルの代わりとして潜在語言語モデルを直接利用することは非現実的である。その理由は、デコーディングの際の計算時間が大幅に上がってしまうためである。これは、潜在語言語モデルがソフトクラスタリングのクラスＮグラム言語モデルとしての構造を持つことに起因する。 The latent language model is not used for speech recognition or machine translation. It is unrealistic to directly use a latent language model as a substitute for a conventional general N-gram language model in speech recognition. The reason is that the calculation time for decoding is significantly increased. This is because the latent language model has a structure as a soft clustering class N-gram language model.

音声認識にも利用される一般的なクラスＮグラム言語モデルはハードクラスタリングの構造を持っている。ハードクラスタリングの構造というのは、ある単語は一つのクラスにしか属さない構造である。一方、ソフトクラスタリングの構造というのは、一つの単語が複数のクラスに属し得る構造のことを指す。潜在語言語モデルの場合は、一つの単語がすべてのクラスに属す構造である。ハードクラスタリングの場合は、ある単語列を見るとそのクラス系列は一意に決定できる。しかしながら、ソフトクラスタリングの場合は、ある単語列に対するクラスの系列は膨大に存在する。音声認識では、言語モデルを利用して任意の単語列に対する確率値を求める行程が必要となるが、潜在語言語モデルでは膨大な計算量が必要となり、音声認識への利用は非現実的である。 A general class N-gram language model also used for speech recognition has a hard clustering structure. A hard clustering structure is a structure in which a word belongs to only one class. On the other hand, the structure of soft clustering refers to a structure in which one word can belong to a plurality of classes. In the case of a latent language model, it is a structure in which one word belongs to all classes. In the case of hard clustering, the class sequence can be uniquely determined by looking at a certain word string. However, in the case of soft clustering, there are a huge number of classes for a certain word string. In speech recognition, a process for obtaining a probability value for an arbitrary word string using a language model is required. However, in a latent language model, an enormous amount of calculation is required, and use for speech recognition is unrealistic. .

潜在語言語モデルを音声認識に利用する場合は、一つの単語がすべてのクラスに属す構造であるため、クラスを見つける問題は非常に難しい。クラスを見つけながら音声認識を行う方法は、ビタビ（Viterbi）探索と呼ばれる。ビタビ探索は、単語系列と潜在語系列の同時復号を実現する公知の技術であり、様々な復号化問題で利用される。 When a latent language model is used for speech recognition, the problem of finding a class is very difficult because one word belongs to all classes. The method of performing speech recognition while finding a class is called Viterbi search. Viterbi search is a well-known technique that realizes simultaneous decoding of word sequences and latent word sequences, and is used in various decoding problems.

この発明は、優れた言語予測性能を持つ潜在語言語モデルを音声認識に現実的に利用することを目的とする。より詳しくは、ビタビ探索に必要である膨大な計算量の問題を大幅に解決させ、潜在語言語モデルをビタビ探索で利用できるように構成する。 An object of the present invention is to practically use a latent language model having excellent language prediction performance for speech recognition. More specifically, the problem of the enormous amount of calculation required for the Viterbi search is greatly solved, and the latent language model is configured to be used for the Viterbi search.

上記の課題を解決するために、この発明の音声認識装置は、潜在語言語モデル記憶部、ベースライン言語モデル記憶部、複数仮説生成部、潜在語系列決定部、スコア再計算部及び一位仮説決定部を含む。潜在語言語モデル記憶部は、学習テキストに含まれる観測語系列に対応する潜在語系列である学習潜在語系列の確率分布並びに観測語系列中の観測語及び潜在語系列中の潜在語の確率分布を学習した潜在語言語モデルを記憶する。ベースライン言語モデル記憶部は、潜在語言語モデルに含まれる潜在語系列の確率分布及び観測語系列の確率分布を混合したベースライン言語モデルを記憶する。複数仮説生成部は、ベースライン言語モデルを用いて入力音声を音声認識し、複数の音声認識結果の仮説及び各仮説に対する仮音声認識スコアを生成する。潜在語系列決定部は、潜在語言語モデルを用いて仮説に対応する潜在語系列である仮説潜在語系列を決定し、仮説及び仮説潜在語系列の同時確率を求める。スコア再計算部は、仮音声認識スコア及び同時確率を用いて音声認識スコアを求める。一位仮説決定部は、複数の仮説から音声認識スコアに基づいて入力音声に対する音声認識結果を決定する。 In order to solve the above problems, a speech recognition apparatus according to the present invention includes a latent language model storage unit, a baseline language model storage unit, a multiple hypothesis generation unit, a latent word sequence determination unit, a score recalculation unit, and a first hypothesis. Includes a decision part. The latent word language model storage unit includes a probability distribution of learning latent word sequences that are latent word sequences corresponding to observed word sequences included in the learning text, and probability distributions of observed words in the observed word sequences and latent words in the latent word sequences. The latent language model that learned The baseline language model storage unit stores a baseline language model in which the probability distribution of the latent word sequence and the probability distribution of the observed word sequence included in the latent word language model are mixed. The multiple hypothesis generation unit recognizes input speech using the baseline language model, and generates a plurality of speech recognition result hypotheses and a provisional speech recognition score for each hypothesis. The latent word series determination unit determines a hypothetical latent word series that is a latent word series corresponding to a hypothesis using a latent word language model, and obtains a joint probability of the hypothesis and the hypothetical latent word series. The score recalculation unit obtains a speech recognition score using the provisional speech recognition score and the joint probability. The first hypothesis determination unit determines a speech recognition result for the input speech based on a speech recognition score from a plurality of hypotheses.

この発明の音声認識技術は、優れた言語予測性能を持った潜在語言語モデルを利用して音声認識を行うため、従来の一般的な学習方法で構築したＮグラム言語モデルを利用した音声認識よりも高い言語予測性能を得ることができる。また、音声認識時において、最初に一般的なＮグラム言語モデルの構造を持つベースライン言語モデルを用いて音声認識結果の候補を絞り、その上で潜在語言語モデルによるビタビ探索を行うことで、単純にビタビ探索を行う場合と比較して、計算量を大幅に減らすことができる。これにより、通常の計算機でも潜在語言語モデルを用いた音声認識を実現可能となる。したがって、この発明の音声認識技術によれば、優れた言語予測性能を持つ潜在語言語モデルを用いた音声認識を高速に行うことができる。 Since the speech recognition technology of the present invention performs speech recognition using a latent language model having excellent language prediction performance, the speech recognition technology uses speech recognition using an N-gram language model constructed by a conventional general learning method. High language prediction performance can be obtained. Moreover, at the time of speech recognition, by first narrowing down speech recognition result candidates using a baseline language model having a general N-gram language model structure, and then performing a Viterbi search using a latent language model, Compared with the case where the Viterbi search is simply performed, the amount of calculation can be greatly reduced. This makes it possible to realize speech recognition using a latent language model even with a normal computer. Therefore, according to the speech recognition technology of the present invention, speech recognition using a latent language model having excellent language prediction performance can be performed at high speed.

図１は、潜在語系列と観測語系列を説明するための図である。FIG. 1 is a diagram for explaining a latent word series and an observed word series. 図２は、音声認識装置の機能構成を例示する図である。FIG. 2 is a diagram illustrating a functional configuration of the speech recognition apparatus. 図３は、音声認識方法の処理フローを例示する図である。FIG. 3 is a diagram illustrating a processing flow of the speech recognition method. 図４は、音声認識探索のデータフローを説明するための図である。FIG. 4 is a diagram for explaining the data flow of the speech recognition search.

〔発明のポイント〕
実施形態の説明に先立ち、この発明のポイントについて説明する。 [Points of Invention]
Prior to the description of the embodiments, the points of the present invention will be described.

この発明では、潜在語言語モデルが確率的生成モデルであることに着目して、潜在語言語モデルの確率過程に従って生成した擬似学習テキストから一般的な単語Ｎグラム言語モデルを構築する。つまり、潜在語言語モデルの性質をＮグラム言語モデルとして近似的に表現する。また、構築したＮグラム言語モデルと一般的な方法で学習テキストから直接学習したＮグラム言語モデルを組み合わせることで、両者の性質を活かした言語モデルを構築する。例えば、学習テキスト内に「りんごを食べる」は存在するが「みかんを食べる」は存在しない場合であっても、一度潜在語言語モデルを構築してテキストを生成すると、「みかんを食べる」というテキストが生成し得る枠組みである。したがって、「みかんを食べる」を含む学習テキストからＮグラム言語モデルを学習すれば、「みかん」と「りんご」が類似していることを考慮したＮグラム言語モデルを構築することができる。また、単純なＮグラムとして表すことができれば、一般的な方法で構築するＮグラム言語モデルと混合して単一の言語モデルとして表すことが可能となる。この混合した言語モデルは、潜在語言語モデルの特性と単純なＮグラム言語モデルの特性を補完し合う言語モデルとなることが期待できる。 In this invention, focusing on the fact that the latent language model is a probabilistic generation model, a general word N-gram language model is constructed from the pseudo-learning text generated according to the probability process of the latent language model. That is, the properties of the latent language model are approximately expressed as an N-gram language model. In addition, by combining the constructed N-gram language model with the N-gram language model learned directly from the learning text by a general method, a language model utilizing both properties is constructed. For example, even if “eating apples” exists in the learning text but “eating tangerines” does not exist, once the latent language model is generated and the text is generated, the text “eating tangerines” Is a framework that can be generated. Therefore, if an N-gram language model is learned from a learning text including “eating mandarin oranges”, an N-gram language model can be constructed in consideration of the similarity between “mandarin oranges” and “apples”. If it can be expressed as a simple N-gram, it can be mixed with an N-gram language model constructed by a general method and expressed as a single language model. This mixed language model can be expected to be a language model that complements the characteristics of the latent language model and the characteristics of a simple N-gram language model.

この発明ではさらに潜在語言語モデルを使ったビタビ探索を高速で行うために、出力する単語系列の候補を絞ってから各単語系列の裏に隠れた潜在語系列を推定する。単語系列の候補を絞る段階においては一般的なＮグラム言語モデルを用いる。各単語系列の裏に隠れた潜在語系列を推定する段階においては潜在語言語モデルを使用する。最終的に、裏に隠れた潜在語系列と実際の単語系列の同時確率を参考にして音声認識結果を決定する。 In the present invention, in order to perform a Viterbi search using a latent language model at high speed, a candidate word sequence to be output is narrowed down, and a latent word sequence hidden behind each word sequence is estimated. In the stage of narrowing down word sequence candidates, a general N-gram language model is used. At the stage of estimating the latent word sequence hidden behind each word sequence, a latent language model is used. Finally, the speech recognition result is determined with reference to the simultaneous probability of the hidden latent word sequence and the actual word sequence.

〔潜在語言語モデル〕
潜在語は、クラスＮグラム言語モデルにおけるクラスに相当し、ある文脈において意味や構文的な役割が似た単語をグループ化した場合の代表語を表わす。図１に、潜在語と観測語との関係を例示する。観測語とは、学習テキスト上に現れる単語のことであり、例えば図１に示すような「今日はいい天気です」という一文を構成する個々の単語のことである。以下では観測語の連なる一文を観測語系列と呼ぶ。この観測語に対して、潜在語は観測語に類似する単語の代表語を用いて表される。観測語の「今日」は、潜在語の「明日」に対応し、「明日」「昨日」「今日」等の複数の類似単語の代表として表される。 [Latent language model]
The latent word corresponds to a class in the class N-gram language model, and represents a representative word when words having similar meanings and syntactic roles are grouped in a certain context. FIG. 1 illustrates the relationship between latent words and observed words. An observation word is a word that appears in the learning text, and is an individual word that constitutes a sentence such as “Today is a good weather” as shown in FIG. In the following, a sequence of observation words is called an observation word sequence. For this observed word, the latent word is represented using a representative word of a word similar to the observed word. The observation word “today” corresponds to the latent word “tomorrow” and is represented as a representative of a plurality of similar words such as “tomorrow”, “yesterday”, and “today”.

潜在語言語モデルは、例えば、技術的な単語と一般的な単語が文脈的に類似した役割を持つ場合、一つの潜在語として近い確率を持つように学習できる。このため潜在語言語モデルは、通常のＮグラム言語モデルと比較してデータスパースネスの問題を回避することができ、様々なタスクに頑健に動作することが期待できる。 For example, when a technical word and a general word have roles similar in context, the latent word language model can be learned to have a probability that is close to one latent word. Therefore, the latent language model can avoid the problem of data sparseness as compared with the normal N-gram language model, and can be expected to operate robustly for various tasks.

しかし、潜在語言語モデルは、上述したように音声認識のデコーディングの際に計算コストが大幅に増加してしまう課題がある。この発明では、潜在語言語モデルから一旦疑似学習テキストを作成し、その疑似学習テキストからＮグラム言語モデルを作成することで、潜在語言語モデルを単純なＮグラム言語モデルとして近似したものを利用する。 However, the latent language model has a problem that the calculation cost greatly increases when decoding speech recognition as described above. In the present invention, a pseudo-learning text is once created from the latent language model, and an N-gram language model is created from the pseudo-learning text, so that the latent language model is approximated as a simple N-gram language model. .

〔実施形態〕
以下、この発明の実施の形態について詳細に説明する。なお、図面中において同じ機能を有する構成部には同じ番号を付し、重複説明を省略する。 Embodiment
Hereinafter, embodiments of the present invention will be described in detail. In addition, the same number is attached | subjected to the component which has the same function in drawing, and duplication description is abbreviate | omitted.

図２を参照して、実施形態に係る音声認識装置の機能構成の一例を説明する。音声認識装置１は、学習テキスト記憶部１０、潜在語言語モデル学習部１２、潜在語言語モデル記憶部１４、ベースライン言語モデル学習部１６、ベースライン言語モデル記憶部１８、複数仮設生成部２０、潜在語系列決定部２２、スコア再計算部２４及び一位仮説決定部２６を含む。 With reference to FIG. 2, an example of a functional configuration of the speech recognition apparatus according to the embodiment will be described. The speech recognition device 1 includes a learning text storage unit 10, a latent language model learning unit 12, a latent language model storage unit 14, a baseline language model learning unit 16, a baseline language model storage unit 18, a plurality of temporary generation units 20, A latent word series determination unit 22, a score recalculation unit 24, and a first hypothesis determination unit 26 are included.

音声認識装置１は、例えば、中央演算処理装置（Central Processing Unit、ＣＰＵ）、主記憶装置（Random Access Memory、ＲＡＭ）等を有する公知又は専用のコンピュータに特別なプログラムが読み込まれて構成された特別な装置である。音声認識装置１は例えば、中央演算処理装置の制御のもとで各処理を実行する。音声認識装置１に入力されたデータや各処理で得られたデータは例えば、主記憶装置に格納され、主記憶装置に格納されたデータは必要に応じて読み出されて他の処理に利用される。音声認識装置１が備える各記憶部は、例えば、ＲＡＭ（Random Access Memory）などの主記憶装置、ハードディスクや光ディスクもしくはフラッシュメモリ（Flash Memory）のような半導体メモリ素子により構成される補助記憶装置、またはリレーショナルデータベースやキーバリューストアなどのミドルウェアにより構成することができる。音声認識装置１が備える各記憶部は、それぞれ論理的に分割されていればよく、一つの物理的な記憶装置に記憶されていてもよい。 The speech recognition device 1 is a special configuration in which a special program is read into a known or dedicated computer having a central processing unit (CPU), a main storage device (Random Access Memory, RAM), and the like. Device. For example, the speech recognition apparatus 1 executes each process under the control of the central processing unit. The data input to the speech recognition device 1 and the data obtained in each process are stored in, for example, a main storage device, and the data stored in the main storage device is read out as needed and used for other processing. The Each storage unit included in the speech recognition device 1 includes, for example, a main storage device such as a RAM (Random Access Memory), an auxiliary storage device configured by a semiconductor memory element such as a hard disk, an optical disk, or a flash memory, or It can be configured with middleware such as a relational database or key-value store. Each storage unit included in the speech recognition device 1 may be logically divided, and may be stored in one physical storage device.

図３を参照して、音声認識装置１が実行する音声認識方法の処理フローの一例を、実際に行われる手続きの順に従って説明する。 With reference to FIG. 3, an example of the processing flow of the speech recognition method executed by the speech recognition apparatus 1 will be described according to the order of procedures actually performed.

学習テキスト記憶部１０には、予め用意した学習テキストが記憶されている。学習テキストは、潜在語言語モデルや通常のＮグラム言語モデルなどの言語モデルを学習するために利用するテキストである。この発明では、学習テキストは単語単位で分割されていることを前提とする。 The learning text storage unit 10 stores learning texts prepared in advance. The learning text is a text used for learning a language model such as a latent language model or a normal N-gram language model. In the present invention, it is assumed that the learning text is divided in units of words.

ステップＳ１２において、潜在語言語モデル学習部１２は、学習テキスト記憶部１０に記憶された学習テキストを読み込み、学習テキストに含まれる観測語の単語系列である観測語系列に対応する潜在語の単語系列である学習潜在語系列の確率分布並びに上記観測語及び上記潜在語の確率分布を学習することで潜在語言語モデルを生成する。生成した潜在語言語モデルは潜在語言語モデル記憶部１４に記憶される。 In step S12, the latent language model learning unit 12 reads the learning text stored in the learning text storage unit 10, and the latent word word sequence corresponding to the observed word sequence that is the observed word sequence included in the learned text. A latent word language model is generated by learning the probability distribution of the learning latent word sequence and the probability distribution of the observed word and the latent word. The generated latent language model is stored in the latent language model storage unit 14.

潜在語言語モデルは、P(h_k|h_k-2,h_k-1)という確率分布と、P(w_k|h_k)という確率分布の二つの確率分布を持っている。ここで、hは潜在語であり、wは観測語である。潜在語hは、潜在語言語モデルにおける潜在変数にあたり、観測語wは、学習テキスト中に実際に出現する単語を表す。確率分布P(h_k|h_k-2,h_k-1)は一般的な単語Ｎグラム言語モデルの形、確率分布P(w_k|h_k)はユニグラム（unigram）言語モデルの形となっている。これは、一般的なクラスＮグラム言語モデルと同じ形であり、潜在語はクラスＮグラム言語モデルにおけるクラスに当たる。 The latent language model has _two probability distributions, a probability distribution of P (h _k | h _k−2 , h _k−1 ) and a probability distribution of P (w _k | h _k ). Here, h is a latent word and w is an observed word. The latent word h corresponds to a latent variable in the latent language model, and the observed word w represents a word that actually appears in the learning text. The probability distribution P (h _k | h _k-2 , h _k-1 ) is in the form of a general word N-gram language model, and the probability distribution P (w _k | h _k ) is in the form of a unigram language model. ing. This is the same form as the general class N-gram language model, and the latent word corresponds to a class in the class N-gram language model.

潜在語言語モデルの学習は、入力する学習テキストの各観測語に対して、潜在語の割り当てを推定する問題である。つまり、「w₁ w₂ … w_L」（Lは学習テキストに含まれる総単語数）という学習テキスト（観測語の系列）があるとして、「w₁」「w₂」…「w_L」の各観測語に対応する潜在語「h₁」「h₂」…「h_L」を推定する問題と言える。この割り当てを推定できれば、潜在語系列「h₁ h₂ … h_L」に対してＮグラム言語モデルを学習して確率分布P(h_k|h_k-2,h_k-1)を構築でき、「h₁→w₁」「h₂→w₂」…「h_L→w_L」に対してユニグラム言語モデルを学習して確率分布P(w_k|h_k)を構築できる。 The learning of the latent language model is a problem of estimating the allocation of latent words for each observed word of the input learning text. In other words, as (the L total number of words contained in a learning text) "w ₁ w ₂ ...... w _L" there is a learning text (series of observation words), "w _1", "w _2" ... of "w _L" It can be said that the latent words “h ₁ ”, “h ₂ ”... “H _L ” corresponding to each observation word are estimated. If this allocation can be estimated, the probability distribution P (h _k | h _k-2 , h _k-1 ) can be constructed by learning an N-gram language model for the latent word sequence “h ₁ h ₂ ... H _L ”, A probability distribution P (w _k | h _k ) can be constructed by learning a unigram language model for “h ₁ → w ₁ ”, “h ₂ → w ₂ ”... “H _L → w _L ”.

具体的な潜在語の割り当ての推定は、非特許文献２などに記載されているギブスサンプリングという方法により行うことができる。学習テキストに含まれる観測語の集合をWとすると、各潜在語はWに含まれるいずれかの単語が割り当てられる。例えば、学習テキスト（観測語の系列）が「りんごみかんパインです」という４単語からなる文であるとした場合、Wは「りんご」「みかん」「パイン」「です」の４単語からなる集合である。この場合、学習テキストの潜在語系列は「りんごりんごりんごです」や「みかんりんごみかんです」「パインパインパインパイン」などが候補となる。すなわち、潜在語系列の候補は潜在語のすべての組み合わせであり、例えばM個の単語系列にJ種類の潜在語があるならばM^J個（MのJ乗個）の潜在語系列の候補が存在することになる。各潜在語系列の候補は、その潜在語系列が実際に正解系列である確率を持っていることになる。この複数の候補からより確からしいものを決定する処理をギブスサンプリングでは行うことができる。例えば、非特許文献２では、最終的に単一の潜在語系列を選択する。選択される潜在語系列は、ギブスサンプリングを繰り返した場合に最終的に得られる潜在語系列を利用すればよい。 A specific latent word assignment can be estimated by a method called Gibbs sampling described in Non-Patent Document 2 and the like. If the set of observed words included in the learning text is W, each latent word is assigned one of the words included in W. For example, if the learning text (observed word sequence) is a sentence consisting of four words “apple, orange, pine,” W is a set of four words, “apple,” “mandarin,” “pine,” and “is”. is there. In this case, the latent word sequence of the learning text is “apple apple apple orange”, “mandarin orange apple orange” or “pine pine pine”. In other words, the latent word sequence candidates are all combinations of latent words. For example, if there are J types of latent words in M word sequences, M ^J (M to the power of J) latent word sequence candidates are obtained. Will exist. Each latent word sequence candidate has a probability that the latent word sequence is actually a correct answer sequence. The process of determining a more probable one from the plurality of candidates can be performed by Gibbs sampling. For example, in Non-Patent Document 2, a single latent word sequence is finally selected. The latent word sequence to be selected may be a latent word sequence finally obtained when Gibbs sampling is repeated.

上記の例において、「りんごりんごりんごです」が最終的に潜在語系列として選択された場合を例として、二つの確率分布の意味合いを説明する。上述の通り、潜在語系列が決まれば、確率分布P(h_k|h_k-2,h_k-1)と確率分布P(w_k|h_k)を得ることができる。この例では、潜在語系列が「りんごりんごりんごです」であるから、果物名として「りんご」という単語を代表としている。つまり、確率分布P(h_k|h_k-2,h_k-1)は、果物名の後には果物名がよく出現するという分布を得ている。確率分布P(w_k|h_k)は、果物名の実体が実際に「りんご」や「みかん」、「パイン」であるということを示している。 In the above example, the meaning of the two probability distributions will be explained using the case where “apple apple apple is finally” selected as a latent word sequence as an example. As described above, if a latent word sequence is determined, a probability distribution P (h _k | h _k−2 , h _k−1 ) and a probability distribution P (w _k | h _k ) can be obtained. In this example, since the latent word series is “apple apple apple”, the word “apple” is represented as a fruit name. In other words, the probability distribution P (h _k | h _k−2 , h _k−1 ) has a distribution in which the fruit name often appears after the fruit name. The probability distribution P (w _k | h _k ) indicates that the fruit name entity is actually “apple”, “mandarin orange”, and “pine”.

非特許文献２に記載されたギブスサンプリングでは、潜在語系列の割り当ては一意に決定している。しかし、割り当てる潜在語を複数持つことも可能である。つまり、潜在語系列の候補から複数個を利用することができる。例えば、B（≧2）個の候補を利用する場合は、確率分布P(h_k|h_k-2,h_k-1)と確率分布P(w_k|h_k)の組をB個内在することになる。例えば、B=2であるならば、確率分布P₁(h_k|h_k-2,h_k-1)と確率分布P₁(w_k|h_k)の組と、確率分布P₂(h_k|h_k-2,h_k-1)と確率分布P₂(w_k|h_k)の組を持つ。値Bは事前に人手で与えることになる。言語モデルをより詳細にしたい場合はBの値を大きくすればよい。Bの値は10程度であれば、ある程度詳細であるといえる。なお、複数個の潜在語系列の候補を利用する場合は、ギブスサンプリングにおける最後のB回の割り当てを潜在語系列の候補として利用すればよい。 In the Gibbs sampling described in Non-Patent Document 2, the assignment of latent word sequences is uniquely determined. However, it is possible to have multiple latent words to assign. That is, a plurality of latent word series candidates can be used. For example, when using B (≧ 2) candidates, there are B sets of probability distribution P (h _k | h _k-2 , h _k-1 ) and probability distribution P (w _k | h _k ). Will do. For example, if B = 2, a set of probability distribution P ₁ (h _k | h _k-2 , h _k-1 ) and probability distribution P ₁ (w _k | h _k ) and probability distribution P ₂ (h _k | h _k-2 , h _k-1 ) and probability distribution P ₂ (w _k | h _k ). The value B is given manually in advance. To make the language model more detailed, increase the value of B. If the value of B is about 10, it can be said that it is somewhat detailed. When a plurality of latent word sequence candidates are used, the last B assignments in Gibbs sampling may be used as latent word sequence candidates.

ステップＳ１６において、ベースライン言語モデル学習部１６は、学習テキスト記憶部１０に記憶された学習テキスト及び潜在語言語モデル記憶部１４に記憶された潜在語言語モデルを読み込み、潜在語言語モデルに含まれる潜在語系列の確率分布を学習し、学習テキストに含まれる観測語系列の確率分布を学習し、これらの確率分布を混合してベースライン言語モデルを生成する。 In step S16, the baseline language model learning unit 16 reads the learning text stored in the learning text storage unit 10 and the latent language model stored in the latent language model storage unit 14, and is included in the latent language model. The probability distribution of the latent word sequence is learned, the probability distribution of the observed word sequence included in the learning text is learned, and these probability distributions are mixed to generate a baseline language model.

まず、ベースライン言語モデル学習部１６は、読み込んだ学習テキスト中の単語のＮ個組みのすべての組み合わせについて頻度を数えてＮグラム言語モデルを作成する。Ｎグラム言語モデルの学習方法は非特許文献１などを参照されたい。例えばトライグラム（trigram）の構造を持つベースライン言語モデルはP_base(w_k|w_k-2,w_k-1)という確率分布を与えるものである。これは、単語列w_k-2,w_k-1が出た後の単語w_kの出現する確率を表す。 First, the baseline language model learning unit 16 creates an N-gram language model by counting frequencies for all combinations of N sets of words in the read learning text. Refer to Non-Patent Document 1 for the learning method of the N-gram language model. For example, a baseline language model having a trigram structure gives a probability distribution of P _base (w _k | w _k−2 , w _k−1 ). This represents the probability of appearance of the word w _k after the word strings w _k-2 and w _k-1 appear.

次に、ベースライン言語モデル学習部１６は、読み込んだ潜在語言語モデルを用いて、確率分布P(h_k|h_k-2,h_k-1)から潜在語系列を生成し、その潜在語系列と確率分布P(w_k|h_k)とから疑似学習テキストを作成する。 Next, the baseline language model learning unit 16 generates a latent word sequence from the probability distribution P (h _k | h _k−2 , h _k−1 ) using the read latent word language model, and the latent word A pseudo learning text is created from the sequence and the probability distribution P (w _k | h _k ).

疑似学習テキストの作成は、例えば以下のように行う。まず、ベースライン言語モデル学習部１６は、一様乱数rand_h1を発生させ、P(h₁|-,-)に従って最初の潜在語h₁を決定する。潜在語h₁は、一様乱数と潜在語h₁の確率との関係で決定する。例えば、P(りんご|-,-)=0.3、P(みかん|-,-)=0.3、P(パイン|-,-)=0.3、P(です|-,-)=0.1と仮定した時に、rand_h1=0.1の場合は最初の潜在語h₁=「りんご」に決定する。同様に、rand_h1=0.4の場合は潜在語h₁=「みかん」、rand_h1=0.7の場合は潜在語h₁=「パイン」、rand_h1=0.95の場合は潜在語h₁=「です」に決定する。次に、ベースライン言語モデル学習部１６は、一様乱数rand_w1を発生させ、P(w₁|h₁)を参照して観測語w₁を決定する。観測語w₁は、一様乱数と確率P(w₁|h₁)との関係で決定する。例えば、P(りんご|h₁)=0.3、P(みかん|h₁)=0.3、P(パイン|h₁)=0.3、P(です|h₁)=0.1と仮定した時に、rand_w1=0.1の場合は観測語w₁=「りんご」に決定する。同様に、rand_w1=0.4の場合は観測語w₁=「みかん」、rand_w1=0.7の場合は観測語w₁=「パイン」、rand_w1=0.95の場合は観測語w₁=「です」に決定する。決定した観測語w₁は擬似学習テキストに出力する。この一連の処理を予め定めた回数だけ繰り返すことで、擬似学習テキストを生成する。出力する擬似学習テキストの件数は大きいほど潜在語言語モデルの性質を良く表わす疑似学習テキストとすることができる。 The pseudo-learning text is created as follows, for example. First, the baseline language model learning unit 16 generates a uniform random number rand _h1 and determines the first latent word h ₁ according to P (h ₁ | −, −). The latent word h ₁ is determined by the relationship between the uniform random number and the probability of the latent word h ₁ . For example, assuming P (apple |-,-) = 0.3, P (mandarin |-,-) = 0.3, P (pine |-,-) = 0.3, P (is |-,-) = 0.1, When rand _h1 = 0.1, the first latent word h ₁ = “apple” is determined. Similarly, if rand _h1 = 0.4, the latent word h ₁ = “Mikan”, if rand _h1 = 0.7, the latent word h ₁ = “pine”, and if rand _h1 = 0.95, the latent word h ₁ = “is”. To decide. Next, the baseline language model learning unit 16 generates a uniform random number rand _w1 and determines the observation word w ₁ with reference to P (w ₁ | h ₁ ). The observation word w ₁ is determined by the relationship between the uniform random number and the probability P (w ₁ | h ₁ ). For example, assuming P (apple | h ₁ ) = 0.3, P (mandarin | h ₁ ) = 0.3, P (pine | h ₁ ) = 0.3, P (is | h ₁ ) = 0.1, rand _w1 = 0.1 In this case, the observation word w ₁ = “apple” is determined. Similarly, when rand _w1 = 0.4, the observation word w ₁ = “Mikan”, when rand _w1 = 0.7, the observation word w ₁ = “pine”, and when rand _w1 = 0.95, the observation word w ₁ = “is”. To decide. The determined observation word w ₁ is output to the pseudo learning text. By repeating this series of processing a predetermined number of times, a pseudo learning text is generated. The larger the number of pseudo-learning texts to be output, the more pseudo-learning texts that better represent the properties of the latent language model.

複数個の潜在語系列の候補を利用するために、複数の確率分布P_b(h_k|h_k-2,h_k-1)と確率分布P_b(w_k|h_k)の組を持つ場合には（1≦b≦B）、学習テキストの生成にあたり、さらに一様乱数rand_pを発生させ、値Bの逆数と一様乱数rand_pとの関係で利用する確率分布の組を選択するようにすればよい。例えば、B=2のとき、rand_p=0.3であったとすると、確率分布P₁(h_k|h_k-2,h_k-1)と確率分布P₁(w_k|h_k)を用いて、潜在語h_k及び観測語w_kを決定するように構成すればよい。 In order to use a plurality of latent word sequence candidates, a plurality of probability distributions P _b (h _k | h _k-2 , h _k-1 ) and probability distributions P _b (w _k | h _k ) are set. If (1 ≦ b ≦ B), Upon generation of the learning text, further uniform random number rand _p is generated, to select a set of probability distributions utilized in relation to the uniform random number rand _p and the inverse of the value B What should I do? For example, if B = 2 and rand _p = 0.3, then using probability distribution P ₁ (h _k | h _k-2 , h _k-1 ) and probability distribution P ₁ (w _k | h _k ) The latent word h _k and the observation word w _k may be determined.

続いて、ベースライン言語モデル学習部１６は、生成した擬似学習テキスト中の単語のＮ個組みのすべての組み合わせについて頻度を数えてＮグラム言語モデルを作成する。Ｎグラム言語モデルの学習方法は非特許文献１などを参考にされたい。 Subsequently, the baseline language model learning unit 16 creates an N-gram language model by counting frequencies for all combinations of N sets of words in the generated pseudo learning text. Refer to Non-Patent Document 1 for the learning method of the N-gram language model.

そして、ベースライン言語モデル学習部１６は、学習テキストに基づいて生成したＮグラム言語モデルと潜在語言語モデルに基づいて生成したＮグラム言語モデルとを重み付き和してベースライン言語モデルを生成する。生成したベースライン言語モデルはベースライン言語モデル記憶部１８に記憶される。 Then, the baseline language model learning unit 16 generates a baseline language model by weighted sum of the N-gram language model generated based on the learning text and the N-gram language model generated based on the latent language model. . The generated baseline language model is stored in the baseline language model storage unit 18.

以下、上述のように生成した潜在語言語モデル及びベースライン言語モデルを用いて音声認識を行う処理の詳細を説明する。以降の説明では、図４に示した実例を適宜参照しながら説明する。 Hereinafter, details of processing for performing speech recognition using the latent language model and the baseline language model generated as described above will be described. In the following description, description will be made with reference to the example shown in FIG. 4 as appropriate.

ステップＳ２０において、音声信号が複数仮説生成部２０へ入力される。複数仮説生成部２０は、ベースライン言語モデル記憶部１８からベースライン言語モデルを読み込み、ベースライン言語モデルを用いて入力音声を音声認識し、複数の音声認識結果の仮説及び各仮説に対する音声認識スコアである仮音声認識スコアを生成する。生成した仮説は潜在語系列決定部２２へ入力される。生成した仮説及び仮音声認識スコアはスコア再計算部２６へ入力される。 In step S <b> 20, an audio signal is input to the multiple hypothesis generation unit 20. The multiple hypothesis generation unit 20 reads the baseline language model from the baseline language model storage unit 18, recognizes the input speech using the baseline language model, and provides a plurality of speech recognition result hypotheses and a speech recognition score for each hypothesis. A temporary speech recognition score is generated. The generated hypothesis is input to the latent word sequence determination unit 22. The generated hypothesis and provisional speech recognition score are input to the score recalculation unit 26.

ベースライン言語モデルは、通常のＮグラム言語モデルであるため、任意の音声認識のデコーディングアルゴリズムが利用できる。例えば、１パスフレーム同期ビームサーチなどが利用できる。１パスフレーム同期ビームサーチについての詳細は「鹿野清宏、伊藤克亘、河原達也、武田一哉、山本幹雄、「IT Text 音声認識システム」、オーム社出版局、pp.110.（参考文献１）」などを参考にされたい。 Since the baseline language model is a normal N-gram language model, an arbitrary speech recognition decoding algorithm can be used. For example, a one-pass frame synchronization beam search can be used. Details on 1-pass frame-synchronized beam search include “Kiyohiro Shikano, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, Mikio Yamamoto,“ IT Text Speech Recognition System ”, Ohm Publishing Co., Ltd., pp.110. Please refer to.

複数仮説生成部２０は、様々な音声認識結果をスコア付きで出力する。図４に示すように、ある入力音声に対する複数仮説は、「今日が晴れです：-15000」「今日は晴れです：-18000」「今日はまれです：-20000」などである。「:（コロン）」の前方の文字列が音声認識結果の仮説であり、後方の数値が仮音声認識スコアである。ここで、仮音声認識スコアがマイナスである理由を説明する。音声認識では確率の対数値を音声認識スコアとして利用することが一般的である。0より大きく1より小さい値の対数値はマイナスになる。したがって、仮音声認識スコアはマイナス値となる。 The multiple hypothesis generation unit 20 outputs various speech recognition results with scores. As shown in FIG. 4, multiple hypotheses for a certain input speech are “Today is sunny: −15000”, “Today is sunny: −18000”, “Today is rare: −20000”, and so on. A character string in front of “: (colon)” is a hypothesis of a speech recognition result, and a numerical value behind is a provisional speech recognition score. Here, the reason why the provisional speech recognition score is negative will be described. In speech recognition, the logarithmic value of probability is generally used as a speech recognition score. A logarithmic value greater than 0 and less than 1 is negative. Therefore, the provisional speech recognition score is a negative value.

複数仮説はＮベストと呼ばれる。Ｎベストは、音声認識スコアの良いものからN個の仮説を選択した音声認識結果の仮説のことを表す。音声認識スコアは値が大きいほどその仮説が良いということを意味している。例えば、10ベストであればスコア上位10仮説であることを表している。このNの値は人手で設定する。例えば、1000と設定する。この値は理想的には非常に大きい値がよいが、計算量を考慮して1000から5000までとすることが一般的である。最終的に、複数仮説生成部２０は、N個の仮説と仮音声認識スコアの組、上述の例であれば「今日は晴れです：-15000」などを得て出力する。 Multiple hypotheses are called N-best. N best represents a hypothesis of a speech recognition result in which N hypotheses are selected from those having a good speech recognition score. The larger the value of the speech recognition score, the better the hypothesis. For example, the 10 best represents the top 10 hypothesis score. The value of N is set manually. For example, 1000 is set. This value is ideally very large, but is generally from 1000 to 5000 in consideration of the calculation amount. Finally, the multiple hypothesis generation unit 20 obtains and outputs a set of N hypotheses and provisional speech recognition scores, such as “Today is sunny: −15000” in the above example.

ステップＳ２２において、潜在語系列決定部２２は、潜在語言語モデル記憶部１４から潜在語言語モデルを読み込み、潜在語言語モデルを用いて仮説に対応する潜在語系列である仮説潜在語系列を決定し、仮説及び仮説潜在語系列の同時確率を求める。求まった同時確率は潜在語系列と組にしてスコア再計算部２４へ入力される。 In step S22, the latent word series determination unit 22 reads the latent word language model from the latent word language model storage unit 14, and uses the latent word language model to determine a hypothetical latent word series that is a latent word series corresponding to the hypothesis. The simultaneous probability of the hypothesis and hypothesis latent word series is obtained. The obtained joint probability is input to the score recalculation unit 24 in combination with the latent word sequence.

潜在語系列決定部２２は、具体的には、ビタビ（Viterbi）アルゴリズムを利用できる。ビタビアルゴリズムは、ある仮説の裏に隠れた潜在語系列の決定と、ある仮説と潜在語系列の同時確率の計算を平行して行うことができるアルゴリズムである。ビタビアルゴリズムについての詳細は、上述の参考文献１を参照されたい。ビタビアルゴリズムを利用すれば、潜在語系列と同時確率を求めることができる。ここでは、同時確率をビタビ確率と呼ぶ。例えば、入力された仮説が「みかんです」であるとき、決定された潜在語系列は「りんごです」、入力された仮説と潜在語系列のビタビ確率は、0.0005と一意に決定できる。ビタビ確率は対数確率化して出力する。その理由は、もともとのスコアも確率の対数値に基づいているため、スコアの意味合いを同じにするためである。ビタビ確率が0.0005であれば、対数確率値は-3.301となる。図４に示すように、「今日が晴れです：-15000」という仮説に対する出力は、例えば「今日は天気だ：-8」などとなる。ここでは、「:（コロン）」の前方の文字列が潜在語系列であり、後方の数値が同時確率である。 Specifically, the latent word sequence determination unit 22 can use a Viterbi algorithm. The Viterbi algorithm is an algorithm that can determine a latent word sequence hidden behind a certain hypothesis and calculate a joint probability of a certain hypothesis and latent word sequence in parallel. For details on the Viterbi algorithm, see Reference 1 above. If the Viterbi algorithm is used, the latent word sequence and the joint probability can be obtained. Here, the joint probability is called the Viterbi probability. For example, when the input hypothesis is “mandarin orange”, the determined latent word sequence is “apple”, and the Viterbi probability of the input hypothesis and latent word sequence can be uniquely determined as 0.0005. Viterbi probability is logarithmized and output. The reason is that the original score is also based on the logarithm of the probability, so that the meaning of the score is the same. If the Viterbi probability is 0.0005, the log probability value is −3.301. As shown in FIG. 4, the output for the hypothesis “Today is sunny: −15000” is, for example, “Today is the weather: -8”. Here, the character string in front of “: (colon)” is a latent word sequence, and the numerical value behind is a joint probability.

ステップＳ２４において、スコア再計算部２４は、複数の仮説それぞれについて、複数仮説生成部２０から入力される仮音声認識スコア及び潜在語系列決定部２２から入力される同時確率を用いて音声認識スコアを求める。音声認識スコアは仮説と組にして一位仮説決定部２６へ入力される。 In step S24, the score recalculation unit 24 calculates a speech recognition score for each of a plurality of hypotheses using the temporary speech recognition score input from the multiple hypothesis generation unit 20 and the joint probability input from the latent word sequence determination unit 22. Ask. The speech recognition score is paired with the hypothesis and input to the first hypothesis determination unit 26.

音声認識スコアは仮音声認識スコアとビタビ対数確率である同時確率とを重み付きで加算することで求めることができる。すなわち、以下の式を計算することで、音声認識スコアを求める。
（音声認識スコア）＝（仮音声認識スコア）＋（重み）×（同時確率） The voice recognition score can be obtained by adding the provisional voice recognition score and the joint probability that is the Viterbi logarithmic probability with weights. That is, the speech recognition score is obtained by calculating the following equation.
(Speech recognition score) = (temporary speech recognition score) + (weight) × (simultaneous probability)

このとき、重みは人手で調整する。例えば、重み調整用の音声データを準備して、音声認識の性能が良くなるような値に調整してもよい。例えば、5000などと設定する。図４に示すように、「今日が晴れです：-15000」という仮説に対して同時確率が「-8」であれば、音声認識スコアは、-15000+5000*(-8)=-55000となる。 At this time, the weight is manually adjusted. For example, voice data for weight adjustment may be prepared and adjusted to a value that improves voice recognition performance. For example, 5000 is set. As shown in FIG. 4, if the joint probability is “−8” for the hypothesis “Today is sunny: −15000”, the speech recognition score is −15000 + 5000 * (− 8) = − 55000. Become.

ステップＳ２６において、一位仮説決定部２６は、スコア再計算部２４から入力される複数の仮説から音声認識スコアに基づいて入力音声に対する音声認識結果を決定する。具体的には音声認識スコアを降順にソートして最も音声認識スコアが大きい仮説を一位仮説として決定する。このように決定した一位仮説を音声認識結果として出力する。図４の例であれば、「今日が晴れです」の音声認識スコアは-55000、「今日は晴れです」の音声認識スコアは-40500、「今日が晴れです」の音声認識スコアは-85000であるから、「今日は晴れです」という仮説が音声認識結果として出力される。 In step S <b> 26, the first hypothesis determination unit 26 determines a speech recognition result for the input speech based on the speech recognition score from a plurality of hypotheses input from the score recalculation unit 24. Specifically, the speech recognition scores are sorted in descending order, and the hypothesis having the largest speech recognition score is determined as the first hypothesis. The first hypothesis determined in this way is output as a speech recognition result. In the example of Fig. 4, the speech recognition score for "Today is sunny" is -55000, the speech recognition score for "Today is sunny" is -40500, and the speech recognition score for "Today is sunny" is -85000. Therefore, the hypothesis “Today is sunny” is output as a speech recognition result.

このように構成することで、実施形態の音声認識装置１は、潜在語言語モデルを用いた音声認識を高速に行うことができる。すなわち、潜在語言語モデルの優れた言語予測性能を持ったＮグラム言語モデルを利用して音声認識を行うため、従来の一般的な学習方法で構築したＮグラム言語モデルを利用した音声認識よりも高い言語予測性能を得ることができる。また、潜在語言語モデルに基づいて生成した潜在語系列のＮグラム言語モデルを用いて音声認識結果の候補を絞った上で潜在語言語モデルの探索を行うことで、単純にビタビ探索を行う場合と比較して、計算量を大幅に減らすことができる。これにより、通常の計算機でも潜在語言語モデルを用いた音声認識を実現可能となる。 With this configuration, the speech recognition apparatus 1 according to the embodiment can perform speech recognition using a latent language model at high speed. That is, since speech recognition is performed using an N-gram language model that has excellent language prediction performance of a latent language model, it is more effective than speech recognition using an N-gram language model constructed by a conventional general learning method. High language prediction performance can be obtained. In addition, when searching for a latent word language model after narrowing candidates for speech recognition results using an N-gram language model of a latent word sequence generated based on the latent word language model, a Viterbi search is simply performed. Compared with, the amount of calculation can be greatly reduced. This makes it possible to realize speech recognition using a latent language model even with a normal computer.

〔プログラム、記録媒体〕
この発明は上述の実施形態に限定されるものではなく、この発明の趣旨を逸脱しない範囲で適宜変更が可能であることはいうまでもない。上記実施例において説明した各種の処理は、記載の順に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。 [Program, recording medium]
The present invention is not limited to the above-described embodiment, and it goes without saying that modifications can be made as appropriate without departing from the spirit of the present invention. The various processes described in the above-described embodiments are not only executed in time series according to the order described, but may be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes.

また、上記実施形態で説明した各装置における各種の処理機能をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 When various processing functions in each device described in the above embodiment are realized by a computer, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Furthermore, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶装置に格納する。そして、処理の実行時、このコンピュータは、自己の記録媒体に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実行形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよく、さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、本形態におけるプログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its own storage device. When executing the process, the computer reads a program stored in its own recording medium and executes a process according to the read program. As another execution form of the program, the computer may directly read the program from a portable recording medium and execute processing according to the program, and the program is transferred from the server computer to the computer. Each time, the processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program in this embodiment includes information that is used for processing by an electronic computer and that conforms to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、この形態では、コンピュータ上で所定のプログラムを実行させることにより、本装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In this embodiment, the present apparatus is configured by executing a predetermined program on a computer. However, at least a part of these processing contents may be realized by hardware.

１音声認識装置
１０学習テキスト記憶部
１２潜在語言語モデル学習部
１４潜在語言語モデル記憶部
１６ベースライン言語モデル学習部
１８ベースライン言語モデル記憶部
２０複数仮説生成部
２２潜在語系列決定部
２４スコア再計算部
２６一位仮説生成部 1 speech recognition device 10 learning text storage unit 12 latent language model learning unit 14 latent language model storage unit 16 baseline language model learning unit 18 baseline language model storage unit 20 multiple hypothesis generation unit 22 latent word sequence determination unit 24 score Recalculation unit 26 First hypothesis generation unit

Claims

A latent language that has learned the probability distribution of the learning latent word sequence that is a latent word sequence corresponding to the observed word sequence included in the learning text, and the probability distribution of the observed word in the observed word sequence and the latent word in the latent word sequence A latent language model storage unit for storing a model;
A baseline language model storage unit for storing a baseline language model in which the probability distribution of the latent word sequence included in the latent word language model and the probability distribution of the observed word sequence are mixed;
A plurality of hypothesis generators for recognizing input speech using the baseline language model and generating hypotheses of a plurality of speech recognition results and provisional speech recognition scores for each hypothesis;
A latent word sequence determining unit that determines a hypothetical latent word sequence that is a latent word sequence corresponding to the hypothesis using the latent word language model, and obtains a joint probability of the hypothesis and the hypothetical latent word sequence;
A score recalculation unit for obtaining a speech recognition score using the provisional speech recognition score and the joint probability;
A first hypothesis determination unit that determines a speech recognition result for the input speech based on the speech recognition score from the plurality of hypotheses;
A speech recognition device.

The speech recognition device according to claim 1,
The latent word language model storage unit stores the latent word language model obtained by learning the N-gram probabilities of the plurality of learning latent word sequences corresponding to the observed word sequences and the unigram probabilities of the observed words and the latent words. A voice recognition device.

The speech recognition device according to claim 1 or 2,
The latent word sequence determination unit is configured to perform the determination of the hypothetical latent word sequence and the calculation of the joint probability of the hypothesis and the hypothetical latent word sequence in parallel.

The speech recognition apparatus according to any one of claims 1 to 3,
The said score recalculation part calculates | requires the said speech recognition score by weighting and adding the logarithm value of the said joint probability with the predetermined weight to the said temporary speech recognition score.

The latent language model includes a probability distribution of a learning latent word sequence corresponding to an observed word sequence included in a learning text, and a probability distribution of observed words in the observed word sequence and latent words in the latent word sequence. Has learned
The baseline language model is a mixture of the probability distribution of the latent word sequence included in the latent word language model and the probability distribution of the observed word sequence.
A plurality of hypothesis generation units that recognize the input speech using the baseline language model, and generate a plurality of hypothesis generation hypotheses and a provisional speech recognition score for each hypothesis;
A latent word sequence determination unit determines a hypothetical latent word sequence that is a latent word sequence corresponding to the hypothesis using the latent word language model, and determines a simultaneous probability of the hypothesis and the hypothetical latent word sequence Steps,
A score recalculation unit that calculates a speech recognition score using the provisional speech recognition score and the joint probability, and a score recalculation step;
A first hypothesis determination unit that determines a speech recognition result for the input speech based on the speech recognition score from the plurality of hypotheses;
A speech recognition method including:

A program for causing a computer to function as the voice recognition device according to claim 1.