JP6410491B2

JP6410491B2 - Pronunciation dictionary expansion system, expansion program, expansion method, acoustic model learning method, learning program, and learning system using the extended pronunciation dictionary obtained by the expansion method

Info

Publication number: JP6410491B2
Application number: JP2014132404A
Authority: JP
Inventors: 隆輝立花; 雅史西村
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2014-06-27
Filing date: 2014-06-27
Publication date: 2018-10-24
Anticipated expiration: 2034-06-27
Also published as: JP2016011995A

Description

本発明は、ゆっくりと丁寧に発せられる発話を含む音声データに対して音声認識の精度を改善するための技術に関し、より詳細には、発音辞書の改良により音声認識の精度を改善する技術に関する。 The present invention relates to a technology for improving the accuracy of speech recognition for speech data including utterances that are slowly and carefully uttered, and more particularly to a technology for improving the accuracy of speech recognition by improving a pronunciation dictionary.

大語彙連続音声認識（ＬＶＣＳＲ）システムでは、誤認識に対する訂正発話や歌を歌うなど、しばしば通常とは異なる発話スタイルの音声が入力される場合がある。訂正発話に特有の発話変形は、Hyper articulationと呼ばれ、音節ごとに区切る、抑揚を強くするなどの例があるが、特に多く観察されるのは、音素や音節の継続時間を延ばす変形、即ち、ゆっくりと丁寧に発話する発話スタイルである。 In a large vocabulary continuous speech recognition (LVCSR) system, speech in a different utterance style is often input, such as correcting utterances for misrecognition and singing songs. The utterance deformation peculiar to corrected utterances is called hyper articulation, and there are examples such as dividing into syllables, increasing inflection, etc., but especially observed is deformation that extends the duration of phonemes and syllables, It is an utterance style that speaks slowly and carefully.

確率的継続時間長モデルを備えた従来のＬＶＣＳＲシステムでは、一部の音素や音節が引き伸ばされた音声が入力として与えられると、確率的継続時間長モデルと引き伸ばされた継続時間との間のミスマッチが生じるため、正しい認識ができなくなる。 In conventional LVCSR systems with a probabilistic duration model, a mismatch between the probabilistic duration model and the extended duration is given as input with some phonemes and syllables extended. As a result, correct recognition cannot be performed.

一方、ＬＶＣＳＲシステムの構築には、認識対象とする音声の大量の書き起こしテキストが必要となる。このようなデータを人手で用意するには多大なコストがかかる。そのため現在では、ＬＶＣＳＲシステムを用いて自動で生成したテキスト（Pseudo Truth :PT）を利用することが多い。しかしながら、PTを得るためのフィールドデータは、上述した引き伸ばされた音声を含むため、正確なPTを生成することは難しい。結果として、認識精度の高いＬＶＣＳＲシステムの構築も難しくなる。 On the other hand, the construction of the LVCSR system requires a large amount of transcription text of speech to be recognized. It is very expensive to prepare such data manually. Therefore, at present, the text (Pseudo Truth: PT) automatically generated using the LVCSR system is often used. However, since field data for obtaining a PT includes the above-described stretched voice, it is difficult to generate an accurate PT. As a result, it is difficult to construct an LVCSR system with high recognition accuracy.

通常とは異なる発話スタイルの音声を処理可能な音声認識システムを構築する従来技術として、特許文献１〜２、及び、非特許文献１〜４が存在する。 Patent Documents 1 and 2 and Non-Patent Documents 1 to 4 exist as conventional techniques for constructing a speech recognition system capable of processing speech of an utterance style different from normal.

特許文献１は、英語のようにどの音素の後にも音節の区切りが来る可能性のある言語における強調発話に対し、頑健な音声認識を実現できる言語モデル及び音響モデルを作成することを目的とする。そして特許文献１は、所定の言語の、音節境界を有する言語モデルを準備するための準備手段と、該準備手段により準備された言語モデルの音節境界の各々に対し、当該音節境界と隣接する音素の記述を、短時間ポーズの挿入を許容する所定の形式に書換えるための音素記述書換手段とを含む、言語モデル変換装置を開示する。特許文献１はまた、音節間のポーズを表す音響モデルを準備するための手段と、前記ポーズを表す音響モデルを参照して、複数の音素モデルの各々に対し、状態系列の末尾に、ポーズに対応する新たな状態と、当該新たな状態を迂回して終端に到達する経路とを追加するように、音素モデルの記述を書換えるための音素モデル書換手段とを含む、音響モデル変換装置を開示する。 Patent Document 1 aims to create a language model and an acoustic model that can realize robust speech recognition for an emphasized utterance in a language in which a syllable break may come after any phoneme as in English. . Patent Document 1 discloses a preparation unit for preparing a language model having a syllable boundary in a predetermined language, and a phoneme adjacent to the syllable boundary for each of the syllable boundaries of the language model prepared by the preparation unit. The phone model description rewriting means for rewriting the description into a predetermined format that allows the insertion of a short pause is disclosed. Patent Document 1 also refers to a means for preparing an acoustic model representing a pose between syllables and an acoustic model representing the pose, and for each of a plurality of phoneme models, Disclosed is an acoustic model conversion device including a phoneme model rewriting unit for rewriting a phoneme model description so as to add a corresponding new state and a route that reaches the terminal by bypassing the new state. To do.

特許文献２は、音声認識装置の認識辞書において、ユーザの発声入力に発声区切りや発声誤りが生じ易い認識語彙についても、確実に認識することが可能な音声認識装置及び音声認識方法を提供することを目的とする。そして特許文献２は、複数の語彙を認識語彙として記憶する認識辞書と、前記認識辞書に記憶されている認識語彙のうち複数の語彙に分割可能か否かを判断し、複数の語彙に分割可能な認識語彙を複数の部分認識語彙に分割する解析部と、前記部分認識語彙に対して、所定音素の追加、他の語彙との置換、該当する部分認識語彙の削除およびこれらの再結合を行い、派生認識語彙を生成する派生認識語彙生成部と、音声データの入力を受け付ける音声入力部と、前記音声入力部で受け付けた音声データの音声区間を検出する音声検出部と、前記音声検出部で検出した音声区間内の音声データを前記認識辞書に記憶された認識語彙と前記派生認識語彙生成部で生成された派生認識語彙を用いて音声認識処理を行う音声認識部と、を備える音声認識装置を開示する。 Patent Document 2 provides a speech recognition device and a speech recognition method capable of reliably recognizing a recognition vocabulary that is likely to cause utterance breaks or utterance errors in a user's utterance input in a recognition dictionary of a speech recognition device. With the goal. Patent Document 2 determines a recognition dictionary that stores a plurality of vocabularies as recognition vocabulary and whether or not the recognition vocabulary stored in the recognition dictionary can be divided into a plurality of vocabularies. An analysis unit that divides a simple recognition vocabulary into a plurality of partial recognition vocabularies, and for the partial recognition vocabulary, a predetermined phoneme is added, replaced with another vocabulary, a corresponding partial recognition vocabulary is deleted, and these are recombined A derivation recognition vocabulary generation unit that generates a derivation recognition vocabulary, a voice input unit that receives input of voice data, a voice detection unit that detects a voice section of voice data received by the voice input unit, and the voice detection unit A speech recognition unit comprising: a speech recognition unit configured to perform speech recognition processing using speech data in the detected speech section using a recognition vocabulary stored in the recognition dictionary and a derivative recognition vocabulary generated by the derivative recognition vocabulary generation unit; It discloses an apparatus.

非特許文献１は、Hyper articulationの問題に取り組むために、音素の混同行列を生成及び参照することによって語彙に新しい発音を導入する技術、訂正発話を集めたコーパスに基づいて音響モデルを学習する技術、Hyper articulationの検出に応じてモデルを切り替える技術、調音特徴を要素とする調音ベクトルモデルを利用する技術を開示する。 Non-Patent Document 1 describes a technique for introducing a new pronunciation into a vocabulary by generating and referring to a phoneme confusion matrix and a technique for learning an acoustic model based on a corpus that collects corrected utterances in order to tackle the problem of hyper articulation. Disclosed are a technology for switching models according to detection of Hyper articulation, and a technology that uses articulation vector models that have articulation characteristics as elements.

非特許文献２は、Hyper articulationの問題に取り組むために、複数モデルのアプローチを採用し、訂正発話に特化したモデルを用いたデコード結果と、標準モデルを用いたデコード結果とを結合する技術を開示する。 Non-Patent Document 2 adopts a multi-model approach to tackle the problem of Hyper articulation and combines a decoding result using a model specialized for correction speech and a decoding result using a standard model. Disclose.

非特許文献３は、発音辞書に幼児の発音を追加し、幼児発話が入力された際に発話辞書を通して標準発話表現で出力する技術を開示する。また、発音パターン表を用いて、自動的に発音辞書に単語を追加し、出現頻度順、相対頻度順にパターンを追加することを開示する。 Non-Patent Document 3 discloses a technology in which an infant's pronunciation is added to the pronunciation dictionary, and when the infant's utterance is inputted, the standard utterance expression is output through the utterance dictionary. Further, it is disclosed that words are automatically added to the pronunciation dictionary using a pronunciation pattern table, and patterns are added in order of appearance frequency and relative frequency.

非特許文献４は、子供音声認識の認識精度を改善するために、幼児発話認識に特化した音響モデル・言語モデルを構築することを開示する。非特許文献４はまた、標準発話・幼児表現間の変化する音を記録した発音パターン表を用いて自動的に発音辞書に単語を追加する技術を開示する。 Non-Patent Document 4 discloses the construction of an acoustic model / language model specialized for infant utterance recognition in order to improve the recognition accuracy of child speech recognition. Non-Patent Document 4 also discloses a technique for automatically adding a word to the pronunciation dictionary using a pronunciation pattern table in which sounds changing between standard utterances and infant expressions are recorded.

しかしながら、特許文献１や非特許文献１、２が提案する手法は、特定の発話スタイルに特化した言語モデルや音響モデルを新たに構築する手法であるため、既存の音声認識システムに対し大掛かりな変更が必要となり、費用やコストの面で好ましくない。一方、特許文献２や、非特許文献３、４が提案する手法は、音声認識における認識／発音辞書に対する変更であり、比較的容易な変更であるため好ましい。 However, the methods proposed by Patent Literature 1 and Non-Patent Literatures 1 and 2 are methods for newly constructing a language model and an acoustic model specialized for a specific utterance style. A change is necessary, which is not preferable in terms of cost and cost. On the other hand, the methods proposed by Patent Document 2 and Non-Patent Documents 3 and 4 are changes to the recognition / pronunciation dictionary in speech recognition, and are preferable because they are relatively easy changes.

しかしながら、特許文献２は発声区切りや発声誤りが生じ易い認識語彙を対象とし、また非特許文献３、４は幼児発話を対象とするため、それぞれの技術を、音素や音節の継続時間が引き延ばされた発話に対してそのまま適用することはできない。なお非特許文献５は、本発明の実施例において使用する識別学習を説明する参考文献としてリスとしたものである。 However, since Patent Document 2 targets recognition vocabulary that is likely to cause utterance breaks and utterance errors, and Non-Patent Documents 3 and 4 target infant utterances, the duration of phonemes and syllables is extended. It cannot be applied as it is to an extended utterance. Non-patent document 5 is a squirrel as a reference document for explaining the discriminative learning used in the embodiment of the present invention.

特開２００６−２４３２１３号公報JP 2006-243213 A 特開２０１１−２７９７１号公報JP 2011-27971 A

Hagen Soltau,“Compensating Hyperarticulation for Automatic Speech Recognition,” Karlsruhe Instituteof Technology 2005.Hagen Soltau, “Compensating Hyperarticulation for Automatic Speech Recognition,” Karlsruhe Institute of Technology 2005. Shigeki Matsuda et al., “Speech Recognition System Robust to Noise and Speaking Styles,” ICSLP 2004.Shigeki Matsuda et al., “Speech Recognition System Robust to Noise and Speaking Styles,” ICSLP 2004. 進藤泉、外４名、“幼児特有の発音変動を考慮した実環境音声認識の検討”、日本音響学会講演論文集、2006年9月.Izumi Shindo, 4 others, “Study on real environment speech recognition considering infant's peculiar pronunciation variation”, Acoustical Society of Japan Proceedings, September 2006. 進藤泉、外３名、“公共案内システムにおける幼児音声認識部の開発と評価”、情報処理学会研究報告. HI, ヒューマンインタフェース研究会報告 2007(11), PP. 103-108.Izumi Shindo, 3 others, “Development and Evaluation of Infant Speech Recognition Unit in Public Guidance System”, IPSJ Research Report. HI, Human Interface Study Group Report 2007 (11), PP. 103-108. Xiaodong He, Li Deng,” Discriminative Learning Speech Recognition : Theory and Practice,” Morgan and Claypool Publishers, August 12, 2008.Xiaodong He, Li Deng, ”Discriminative Learning Speech Recognition: Theory and Practice,” Morgan and Claypool Publishers, August 12, 2008.

本発明は、上記従来技術における問題点に鑑みてなされたものであり、音素や音節の継続時間が引き延ばされた発話を含む音声データを正しく認識できるように発音辞書を拡張する拡張方法、拡張システム、及び拡張プログラムを提供することを目的とする。本発明はまた、拡張された発音辞書を用いて音響モデルを学習する学習方法、学習システム、及び学習プログラムを提供することを目的とする。 The present invention has been made in view of the above-described problems in the prior art, and an expansion method for extending a pronunciation dictionary so that speech data including utterances in which the duration of phonemes and syllables is extended can be correctly recognized, An object is to provide an extended system and an extended program. Another object of the present invention is to provide a learning method, a learning system, and a learning program for learning an acoustic model using an expanded pronunciation dictionary.

本発明は、上記従来技術の課題を解決するために以下の特徴を有する複数の語彙を認識語彙として含む発音辞書を拡張する方法を提供する。本発明の発音辞書を拡張する方法は、（ａ）コンピュータが、前記発音辞書から各語彙の発音を読み出すステップと、（ｂ）前記コンピュータが、読み出した前記語彙の発音に含まれる共鳴音を引き伸ばすことにより新たな発音（以下、「引き伸ばし発音」という）を生成するステップと、（ｃ）前記コンピュータが、前記引き伸ばし発音と同一の発音を有する語彙が前記発音辞書内に存在するか否かを判定するステップと、（ｄ）前記コンピュータが、存在しないとの判定に応答して、前記引き伸ばし発音を用いて前記発音辞書を拡張するステップとを含む。 The present invention provides a method for expanding a pronunciation dictionary including a plurality of vocabularies having the following characteristics as recognition vocabulary in order to solve the above-described problems of the prior art. In the method of expanding the pronunciation dictionary of the present invention, (a) a computer reads out the pronunciation of each vocabulary from the pronunciation dictionary, and (b) the computer stretches a resonance sound included in the read out pronunciation of the vocabulary. A step of generating a new pronunciation (hereinafter referred to as “stretched pronunciation”), and (c) the computer determines whether or not a vocabulary having the same pronunciation as the stretched pronunciation exists in the pronunciation dictionary And (d) expanding the pronunciation dictionary using the stretched pronunciation in response to a determination that the computer does not exist.

好ましくは、前記共鳴音が母音である場合に、前記共鳴音の引き伸ばしは、短母音の後に対応する長母音を挿入すること又は同一の短母音を繰り返すことによりなされる。これに代えて或いはこれと共に、前記共鳴音が鼻音である場合に、前記共鳴音の引き伸ばしは、前記鼻音を繰り返すことによりなされる。 Preferably, when the resonance is a vowel, the resonance is stretched by inserting a corresponding long vowel after the short vowel or repeating the same short vowel. Instead of or together with this, when the resonance sound is a nasal sound, the resonance sound is stretched by repeating the nasal sound.

好ましくは、上記ステップ（ｂ）において、前記コンピュータは、その発音を読み出した前記語彙が、その表記及びその発音の両方について、出現頻度がそれぞれに対応する所定の値を超えることを条件として、前記引き伸ばし発音の生成処理を行う。 Preferably, in the step (b), the computer reads out the pronunciation on the condition that the appearance frequency exceeds a predetermined value corresponding to each of both the notation and the pronunciation. A process for generating a stretched pronunciation is performed.

より好ましくは、上記ステップ（ａ）において、その発音を読み出す語彙は一音節の語彙を除く語彙である。 More preferably, in the above step (a), the vocabulary from which the pronunciation is read is a vocabulary excluding a syllable vocabulary.

好ましくは、上記発音辞書の拡張方法は、前記コンピュータが、拡張した前記発音辞書を用いて行った音声認識の結果に基づき、正解率の低い引き伸ばし発音を前記発音辞書から削除するステップを含む。 Preferably, the method of expanding the pronunciation dictionary includes a step of deleting, from the pronunciation dictionary, the extended pronunciation having a low accuracy rate based on a result of speech recognition performed by the computer using the expanded pronunciation dictionary.

上記では、発音辞書を拡張する方法として本発明を説明した。しかし本発明は、そのような発音辞書を拡張する方法の各ステップをコンピュータに実行させる発音辞書の拡張プログラム、及び該発音辞書の拡張プログラムをコンピュータにインストールして実現される発音辞書の拡張システムとして把握することもできる。 In the above, the present invention has been described as a method of expanding the pronunciation dictionary. However, the present invention provides a pronunciation dictionary expansion program for causing a computer to execute each step of such a pronunciation dictionary expansion method, and a pronunciation dictionary expansion system realized by installing the pronunciation dictionary expansion program in a computer. It can also be grasped.

本発明はまた、上記従来技術の課題を解決するために以下の特徴を有する音響モデルを学習する方法を提供する。本発明の音響モデルの学習方法は、（ａ）コンピュータが、発音辞書から各語彙の発音を読み出すステップと、（ｂ）前記コンピュータが、読み出した前記語彙の発音に含まれる共鳴音を引き伸ばすことにより新たな発音（以下、「引き伸ばし発音」という）を生成するステップと、（ｃ）コンピュータが、前記引き伸ばし発音と同一の発音を有する語彙が前記発音辞書内に存在しないことを条件に、前記引き伸ばし発音を用いて前記発音辞書を拡張するステップと、（ｄ）前記コンピュータが、拡張された前記発音辞書を用いて音声認識された、音声データの認識結果を取得するステップと、（ｅ）前記コンピュータが、前記音声データの認識結果を学習データとして、音響モデルを学習するステップとを含む。 The present invention also provides a method for learning an acoustic model having the following features in order to solve the above-mentioned problems of the prior art. According to the acoustic model learning method of the present invention, (a) a computer reads the pronunciation of each vocabulary from the pronunciation dictionary, and (b) the computer stretches a resonance sound included in the read pronunciation of the vocabulary. A step of generating a new pronunciation (hereinafter referred to as “stretched pronunciation”); and (c) the computer uses the stretched pronunciation on the condition that no vocabulary having the same pronunciation as the stretched pronunciation exists in the pronunciation dictionary. (D) the computer acquires speech data recognition results recognized using the extended pronunciation dictionary; and (e) the computer And learning the acoustic model using the recognition result of the voice data as learning data.

好ましくは、上記ステップ（ｅ）における学習は識別学習であり、ステップ（ｅ）は、前記コンピュータが、前記音声データの認識結果についてのアライメントを、前記引き伸ばし発音を用いることなく実行するステップを含む。 Preferably, the learning in the step (e) is identification learning, and the step (e) includes a step in which the computer executes alignment on the recognition result of the voice data without using the stretched pronunciation.

更に好ましくは、上記ステップ（ｄ）は、前記コンピュータが、引き伸ばし発音を用いて認識された発話のリストを作成するステップを含み、上記ステップ（ｅ）において、前記コンピュータが、前記発話のリストを参照して学習データに重み付けをするステップを含む。 More preferably, the step (d) includes the step of the computer creating a list of utterances recognized using stretched pronunciation, and in the step (e), the computer refers to the list of utterances. And weighting the learning data.

以上、音響モデルの学習方法として本発明を説明した。しかし本発明は、そのような音響モデルの学習方法の各ステップをコンピュータに実行させる音響モデルの学習プログラム、及び該音響モデルの学習プログラムをコンピュータにインストールして実現される音響モデルの学習システムとして把握することもできる。 The present invention has been described above as an acoustic model learning method. However, the present invention is grasped as an acoustic model learning program that causes a computer to execute each step of such an acoustic model learning method, and an acoustic model learning system that is realized by installing the acoustic model learning program in the computer. You can also

前述した構成の発音辞書の拡張方法では、発音辞書に含まれる各語彙に対しその発音に含まれる共鳴音が引き伸ばされて新たな発音、即ち、引き伸ばし発音が生成され、生成された引き伸ばし発音は、同じ発音の語彙が発音辞書に存在しないことを条件に発音辞書に追加される。このような拡張方法によれば、発音辞書の拡張による新たな誤認識の問題を引き起こすことなく、発音辞書を拡張することができる。 In the pronunciation dictionary expansion method having the above-described configuration, for each vocabulary included in the pronunciation dictionary, a resonance sound included in the pronunciation is stretched to generate a new pronunciation, i.e., a stretched pronunciation. Vocabulary with the same pronunciation is added to the pronunciation dictionary on condition that the pronunciation dictionary does not exist. According to such an expansion method, the pronunciation dictionary can be expanded without causing a new problem of erroneous recognition due to the expansion of the pronunciation dictionary.

また、上述した方法により拡張された発音辞書を用いて音声を認識することで、音素や音節の継続時間が引き延ばされた発話を含む音声データを正しく認識することが可能となる。また、前述した構成の音響モデルの学習方法では、そのように正しく認識された認識結果を学習データとして音響モデルを学習するので、ゆっくりと丁寧に発せられる発話を含む音声データに対して頑強な音響モデルを構築することが可能となる。本発明のその他の効果については、各実施の形態の記載から理解される。 In addition, by recognizing speech using the pronunciation dictionary expanded by the above-described method, it is possible to correctly recognize speech data including utterances in which the duration of phonemes and syllables is extended. Moreover, in the acoustic model learning method having the above-described configuration, the acoustic model is learned using the recognition result thus correctly recognized as learning data. A model can be constructed. Other effects of the present invention will be understood from the description of each embodiment.

本実施形態に係る発音辞書拡張システムを実現するのに好適な情報処理装置のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware constitutions of the information processing apparatus suitable for implement | achieving the pronunciation dictionary expansion system which concerns on this embodiment. 本実施形態に係る発音辞書拡張システムの機能ブロック図の一例を示す図である。It is a figure which shows an example of the functional block diagram of the pronunciation dictionary expansion system which concerns on this embodiment. 図３（ａ）は、発音辞書の一例を示す図である。図３（ｂ）は、発音辞書の１エントリの構成を説明する図である。図３（ｃ）は、言語モデルの一例を示す図である。FIG. 3A shows an example of the pronunciation dictionary. FIG. 3B is a diagram for explaining the configuration of one entry in the pronunciation dictionary. FIG. 3C is a diagram illustrating an example of the language model. 発音辞書への引き伸ばし発音の追加を抑制される単語の例を説明する図である。It is a figure explaining the example of the word by which the addition of the extended pronunciation to a pronunciation dictionary is suppressed. 本実施形態に係る発音辞書の拡張処理の流れの一例を示すフローチャートある。It is a flowchart which shows an example of the flow of the expansion process of the pronunciation dictionary which concerns on this embodiment. 本実施形態に係る音響モデルの学習システムの機能ブロック図の一例を示す図である。It is a figure which shows an example of the functional block diagram of the learning system of the acoustic model which concerns on this embodiment. 図７（ａ）は、引き伸ばし発音によるアライメントの例を説明する図である。図７（ｂ）は、通常の発音によるアライメントの例を説明する図である。FIG. 7A is a diagram for explaining an example of alignment by stretching pronunciation. FIG. 7B is a diagram for explaining an example of alignment by normal pronunciation. 本実施形態に係る音響モデルの学習処理の流れの一例を示すフローチャートある。It is a flowchart which shows an example of the flow of the learning process of the acoustic model which concerns on this embodiment. 図９（ａ）は、本発明を利用した音声認識の第１の実験結果の一例を示す図である。図９（ｂ）は、本発明を利用した音声認識の第２の実験結果の一例を示す図である。FIG. 9A is a diagram showing an example of a first experimental result of speech recognition using the present invention. FIG. 9B is a diagram showing an example of a second experimental result of speech recognition using the present invention. 引き伸ばし発音追加抑制オプションごとの発音辞書のサイズ抑制効果とエラー削減率の関係を示す図である。It is a figure which shows the relationship between the size suppression effect and error reduction rate of the pronunciation dictionary for every expansion pronunciation addition suppression option.

以下、本発明を実施するための形態を図面に基づいて詳細に説明するが、以下の実施形態は特許請求の範囲にかかる発明を限定するものではなく、また実施形態の中で説明されている特徴の組み合わせの全てが発明の解決手段に必須であるとは限らない。なお、実施の形態の説明の全体を通じて同じ要素には同じ番号を付している。 DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, modes for carrying out the invention will be described in detail with reference to the drawings. However, the following embodiments do not limit the invention according to the claims, and are described in the embodiments. Not all combinations of features are essential to the solution of the invention. Note that the same numbers are assigned to the same elements throughout the description of the embodiment.

図１は、本発明を実施するためのコンピュータ１００の例示的なハードウェア構成を示す。コンピュータ１００は、ＣＰＵ１０２とメイン・メモリ１０４とを備えており、これらはバス１０６に接続されている。ＣＰＵ１０２は好ましくは、３２ビット又は６４ビットのアーキテクチャに基づくものである。当該ＣＰＵ１０２は例えば、インテル社のＣｏｒｅ（商標）ｉシリーズ、Ｃｏｒｅ（商標）２シリーズ、Ａｔｏｍ（商標）シリーズ、Ｘｅｏｎ（登録商標）シリーズ、Ｐｅｎｔｉｕｍ（登録商標）シリーズ若しくはＣｅｌｅｒｏｎ（登録商標）シリーズ、ＡＭＤ（Advanced Micro Devices）社のＡシリーズ、Ｐｈｅｎｏｍ（商標）シリーズ、Ａｔｈｌｏｎ（商標）シリーズ、Ｔｕｒｉｏｎ（商標）シリーズ若しくはＳｅｍｐｒｏｎ（商標）、又は、インターナショナル・ビジネス・マシーンズ・コーポレーションのＰｏｗｅｒ（商標）シリーズでありうる。 FIG. 1 shows an exemplary hardware configuration of a computer 100 for implementing the present invention. The computer 100 includes a CPU 102 and a main memory 104, which are connected to a bus 106. CPU 102 is preferably based on a 32-bit or 64-bit architecture. The CPU 102 includes, for example, Intel's Core (TM) i series, Core (TM) 2 series, Atom (TM) series, Xeon (R) series, Pentium (R) series or Celeron (R) series, AMD (Advanced Micro Devices) A series, Phenom (TM) series, Athlon (TM) series, Turion (TM) series or Sempron (TM), or Power (TM) series of International Business Machines Corporation sell.

バス１０６には、ディスプレイ・コントローラ１０８を介して、ディスプレイ１１０、例えば液晶ディスプレイ（ＬＣＤ）が接続されうる。また、液晶ディスプレイ（ＬＣＤ）は例えば、タッチパネル・ディスプレイ又はフローティング・タッチ・ディスプレイであてもよい。ディスプレイ１１０は、コンピュータ１００上で動作中のソフトウェアが出力する情報を、適当なグラフィック・インタフェースで表示するために使用されうる。 A display 110, for example, a liquid crystal display (LCD) can be connected to the bus 106 via a display controller 108. The liquid crystal display (LCD) may be, for example, a touch panel display or a floating touch display. The display 110 can be used to display information output by software running on the computer 100 with an appropriate graphic interface.

バス１０６には任意的に、例えばＳＡＴＡ又はＩＤＥコントローラ１１２を介して、記憶装置１１４、例えばハードディスクドライブと、ドライブ１１６、例えばＣＤ、ＤＶＤ又はＢＤドライブが接続されうる。 Optionally, a storage device 114, such as a hard disk drive, and a drive 116, such as a CD, DVD or BD drive, may be connected to the bus 106 via, for example, a SATA or IDE controller 112.

バス１０６には、周辺装置コントローラ１１８を介して、例えばキーボード・マウス・コントローラ又はＵＳＢバスを介して、任意的に、キーボード１２０及びマウス１２２が接続されうる。 A keyboard 120 and a mouse 122 can optionally be connected to the bus 106 via a peripheral device controller 118, for example, via a keyboard / mouse controller or a USB bus.

記憶装置１１４には、オペレーティング・システム、例えばＷｉｎｄｏｗｓ（登録商標）ＯＳ、ＵＮＩＸ（登録商標）、ＭａｃＯＳ（登録商標）など、及びＪ２ＥＥなどのＪａｖａ（登録商標）処理環境、Ｊａｖａ（登録商標）アプリケーション、Ｊａｖａ（登録商標）仮想マシン（ＶＭ）、Ｊａｖａ（登録商標）実行時（ＪＩＴ）コンパイラを提供するプログラム、本施態様に従うコンピュータ・プログラム、及びその他のプログラム、並びにデータが、メイン・メモリ１０４にロード可能なように記憶されうる。 The storage device 114 includes an operating system, for example, a Windows (registered trademark) OS, a UNIX (registered trademark), a MacOS (registered trademark), and a Java (registered trademark) processing environment such as J2EE, a Java (registered trademark) application, A program that provides a Java (registered trademark) virtual machine (VM), a Java (registered trademark) runtime (JIT) compiler, a computer program according to this embodiment, and other programs and data are loaded into the main memory 104 It can be stored as possible.

記憶装置１１４は、コンピュータ１００内に内蔵されていてもよく、当該コンピュータ１００がアクセス可能なようにケーブルを介して接続されていてもよく、又は、当該コンピュータ１００がアクセス可能なように有線又は無線ネットワークを介して接続されていてもよい。 The storage device 114 may be built in the computer 100, connected via a cable so that the computer 100 can be accessed, or wired or wireless so that the computer 100 can be accessed. It may be connected via a network.

ドライブ１１６は、必要に応じて、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ又はＢＤ１１７からコンピュータ・プログラム、例えばオペレーティング・システム又はアプリケーションを記憶装置１１４にインストールするために使用されうる。なお、コンピュータ・プログラムは圧縮し、また複数に分割して複数の媒体に記録することもできる。 The drive 116 may be used to install a computer program, such as an operating system or application, from the CD-ROM, DVD-ROM or BD 117 to the storage device 114 as required. Note that the computer program can be compressed, or divided into a plurality of pieces and recorded on a plurality of media.

通信インタフェース１２６は、例えばイーサネット（登録商標）・プロトコルに従う。通信インタフェース１２６は、通信コントローラ１２４を介してバス１０６に接続され、コンピュータ１００を通信回線１２８に有線又は無線接続する役割を担い、コンピュータ１００のオペレーティング・システムの通信機能のＴＣＰ／ＩＰ通信プロトコルに対して、ネットワーク・インタフェース層を提供する。通信回線は例えば、有線ＬＡＮ接続規格に基づく有線ＬＡＮ環境、又は無線ＬＡＮ接続規格に基づく無線ＬＡＮ環境、例えばＩＥＥＥ８０２．１１ａ／ｂ／ｇ／ｎなどのＷｉ−Ｆｉ無線ＬＡＮ環境、若しくは携帯電話網環境（例えば、３Ｇ、又は４Ｇ（ＬＴＥを含む）環境）でありうる。 The communication interface 126 follows, for example, the Ethernet (registered trademark) protocol. The communication interface 126 is connected to the bus 106 via the communication controller 124, plays a role of connecting the computer 100 to the communication line 128 by wire or wirelessly, and corresponds to the TCP / IP communication protocol of the communication function of the operating system of the computer 100. Providing a network interface layer. The communication line is, for example, a wired LAN environment based on the wired LAN connection standard, or a wireless LAN environment based on the wireless LAN connection standard, for example, a Wi-Fi wireless LAN environment such as IEEE802.11a / b / g / n, or a mobile phone network environment. (Eg, 3G or 4G (including LTE) environments).

コンピュータ１００は、通信回線１２８を介して他のコンピュータからのデータを受信し、記憶装置１１４上に格納しうる。 The computer 100 can receive data from other computers via the communication line 128 and store the data on the storage device 114.

以上の説明により、コンピュータ１００は、通常のパーソナルコンピュータ、ワークステーション、メインフレームなどの情報処理装置、又は、これらの組み合わせによって実現されることが容易に理解されるであろう。なお、上記説明した構成要素は例示であり、そのすべての構成要素が本発明の必須構成要素となるわけではない。同様に本発明を実施するためのコンピュータ１００は、スピーカー等の他の構成要素を含むことも可能であることは言うまでもない。 From the above description, it will be easily understood that the computer 100 is realized by an information processing apparatus such as a normal personal computer, a workstation, a main frame, or a combination thereof. In addition, the component demonstrated above is an illustration, All the components are not necessarily an essential component of this invention. Similarly, it goes without saying that the computer 100 for carrying out the present invention may include other components such as a speaker.

図２は、本実施形態に係る発音辞書拡張システム２００の機能ブロック図の一例を示す。発音辞書拡張システム２００は、発音辞書に含まれる所定の条件を満たす各単語について、その発音に含まれる共鳴音を引き伸ばすことにより新たな発音、即ち、引き伸ばし発音を生成し、生成した引き伸ばし発音を発音辞書に追加することで、発音辞書を拡張する。発音辞書拡張システム２００は、発音辞書格納部２０２と、読み出し部２０４と、頻度確認部２０６と、言語モデル学習用コーパス２０８と、PTデータ格納部２１０と、新発音生成部２１２と、同一発音確認部２１４と、候補リスト格納部２１６と、検証部２１８とを備える。発音辞書格納部２０２と、言語モデル学習用コーパス２０８と、PTデータ格納部２１０と、候補リスト格納部２１６と、検証部２１８に含まれる拡張発音辞書格納部２２６は、物理的に同一の記憶装置であってもよく、或いは複数の記憶装置であってもよい。以下各構成要素について説明する。 FIG. 2 shows an example of a functional block diagram of the pronunciation dictionary expansion system 200 according to the present embodiment. The pronunciation dictionary expansion system 200 generates, for each word satisfying a predetermined condition included in the pronunciation dictionary, a new pronunciation by extending the resonance sound included in the pronunciation, that is, a stretched pronunciation, and the generated stretched pronunciation is pronounced. Extend the pronunciation dictionary by adding to the dictionary. The pronunciation dictionary expansion system 200 includes a pronunciation dictionary storage unit 202, a reading unit 204, a frequency confirmation unit 206, a language model learning corpus 208, a PT data storage unit 210, and a new pronunciation generation unit 212. Unit 214, candidate list storage unit 216, and verification unit 218. The pronunciation dictionary storage unit 202, the language model learning corpus 208, the PT data storage unit 210, the candidate list storage unit 216, and the extended pronunciation dictionary storage unit 226 included in the verification unit 218 are physically the same storage device. Or a plurality of storage devices. Each component will be described below.

発音辞書格納部２０２は、ＬＶＣＳＲシステムにおいて用いられる発音辞書を格納する。発音辞書は、認識対象の語彙（単語）とその発音（音韻情報）を規定する辞書である。図３（ａ）に示される辞書３００は、日本語の発音辞書の一例である。図３（ａ）に示されるように、発音辞書内の１エントリは、仮名漢字まじりの表記（spelling）と発音（pronunciation）の対であり、本明細書ではこれをトークンと呼ぶ。 The pronunciation dictionary storage unit 202 stores a pronunciation dictionary used in the LVCSR system. The pronunciation dictionary is a dictionary that defines a vocabulary (word) to be recognized and its pronunciation (phonological information). A dictionary 300 shown in FIG. 3A is an example of a Japanese pronunciation dictionary. As shown in FIG. 3A, one entry in the pronunciation dictionary is a pair of kana-kanji spelling and pronunciation, which is called a token in this specification.

図３（ｂ）を参照して、トークンを説明する。図３（ｂ）に示すトークン「京都／ky.o.O.t.o」３００は、表記「京都」３１２と、発音「ky.o.O.t.o」３１４から成る。発音３１４は、一続きの音素であり、母音と子音と鼻音の組み合わせにより構成される。発音「ky.o.O.t.o」３１４を例に説明すると、音素「ky」３１６ａと音素「t」３１６ｄは子音であり、音素「o」３１６ｂ、３１６ｅは短母音であり、音素「O」３１６ｃは長母音である。なお、図３（ｃ）に示すように、言語モデル３２０では、表記が同一のトークン同士は生起確率を共有する。例えば、トークン「寿司／s.u.sh.i」と、トークン「寿司／z.u.sh.i」の生起確率はどちらも0.0012である。 A token will be described with reference to FIG. The token “Kyoto / ky.o.O.t.o” 300 shown in FIG. 3B includes the notation “Kyoto” 312 and the pronunciation “ky.o.O.t.o” 314. The pronunciation 314 is a series of phonemes, and is composed of a combination of vowels, consonants, and nasal sounds. Taking the pronunciation “ky.oOto” 314 as an example, the phoneme “ky” 316a and the phoneme “t” 316d are consonants, the phonemes “o” 316b and 316e are short vowels, and the phoneme “O” 316c is a long vowel. It is. As shown in FIG. 3C, in the language model 320, tokens having the same notation share the occurrence probability. For example, the occurrence probabilities of the token “sushi / s.u.sh.i” and the token “sushi / z.u.sh.i” are both 0.0012.

読み出し部２０４は、発音辞書格納部２０２に格納される発音辞書から各単語の発音を読み出す。但し、読み出し部２０４がその発音を読み出す単語は、１モーラ（１音節）の単語を除いた残りの単語である。発音辞書には、「歯」、「へ」、「を」など１モーラの単語が含まれる。このような１モーラの単語に対し、後述する所定の処理を施して引き伸ばし発音を生成し発音辞書に追加すると、追加による利益よりも、追加による弊害が大きく、認識精度が大きく劣化する。そのため読み出し部２０４は、発音辞書から、１モーラ（１音節）の単語を除いた残りの単語を読み出す。 The reading unit 204 reads the pronunciation of each word from the pronunciation dictionary stored in the pronunciation dictionary storage unit 202. However, the words from which the reading unit 204 reads out the pronunciation are the remaining words excluding the one mora (one syllable) word. The pronunciation dictionary includes 1-mora words such as “teeth”, “he”, and “wo”. If such a one-mora word is subjected to a predetermined process described below to generate an extended pronunciation and add it to the pronunciation dictionary, the adverse effect of the addition is greater than the benefit of the addition, and the recognition accuracy is greatly degraded. For this reason, the reading unit 204 reads the remaining words from the pronunciation dictionary excluding one mora (one syllable) word.

頻度確認部２０６は、読み出し部２０４により読み出された各単語について、その表記及びその発音の両方について、出現頻度がそれぞれ対応する所定の閾値を超えることを確認する。より具体的には、頻度確認部２０６は、各単語の表記について、言語モデル学習用のコーパスにおける出現数をカウントし、出現頻度が第１の所定の閾値を超えるか否かを判定する。なお、言語モデル学習用のコーパスに基づいて言語モデルが構築されている場合は、頻度確認部２０６は言語モデルの生起確率を参照して判定してよい。言語モデル学習用のコーパス又は言語モデルは、予め言語モデル格納部２０８に用意される。 The frequency confirmation unit 206 confirms that the appearance frequency of each word read by the reading unit 204 exceeds a predetermined threshold corresponding to both the notation and the pronunciation thereof. More specifically, the frequency confirmation unit 206 counts the number of appearances in the language model learning corpus for each word notation, and determines whether the appearance frequency exceeds a first predetermined threshold. When the language model is constructed based on the corpus for language model learning, the frequency confirmation unit 206 may make the determination with reference to the occurrence probability of the language model. A language model learning corpus or language model is prepared in the language model storage unit 208 in advance.

頻度確認部２０６はまた、各単語について、PTデータにおけるその発音の出現数と、その表記の総出現数（同じ表記で異なる発音のトークンの出現数を含む）とをカウントし、表記の総出現数に対する発音の出現数の割合が第２の所定の閾値を超えているか否かを判定する。なお、PTデータとは、ユーザ音声の自動認識結果であるPTと、対応する音声データの対をいい、PTデータは予めPTデータ格納部２１０に用意される。最終的に、頻度確認部２０６は、表記について出現頻度が第１の所定の閾値を超え、かつ、発音について出現頻度が第２の所定の閾値を超える単語を、後述する処理のために新発音生成部２１２に渡す。 The frequency confirmation unit 206 also counts the number of occurrences of the pronunciation in PT data and the total number of occurrences of the notation (including the number of occurrences of tokens with different pronunciations in the same notation) for each word, It is determined whether the ratio of the number of occurrences of pronunciation to the number exceeds a second predetermined threshold. The PT data is a pair of PT, which is a result of automatic recognition of user voice, and corresponding voice data. The PT data is prepared in the PT data storage unit 210 in advance. Finally, the frequency confirmation unit 206 generates a new pronunciation for a word to be described later for a word whose appearance frequency exceeds the first predetermined threshold for notation and whose appearance frequency exceeds the second predetermined threshold for pronunciation. The data is passed to the generation unit 212.

頻度確認部２０６による上記フィルタリング処理は、次の理由によるものである。即ち、本発明では、発音辞書を拡張することにより、音素や音節の継続時間が引き延ばされた発話を含む音声データに頑強なＬＶＣＳＲシステムを構築することを試みる。しかしながら、認識対象となる単語数が増えると、ランタイムにおける計算時間が増加し、また認識率も低下する傾向にある。そのため、発音辞書に追加する引き伸ばし発音は、認識率を向上させることが期待されるものに限定すべきである。ここで、ゆっくりと丁寧に発話する発話スタイルは、通常とは異なる特殊な発話スタイルであるため、ある単語がその特殊な発話スタイルで発せられる頻度は、通常の発話スタイルで発せられる頻度よりも低い。従って、少なくとも通常の発話スタイル、即ち、拡張前の発音辞書に含まれるトークンについて、その表記の出現頻度が高いものを、後述する新発音生成部２１２の処理対象とする。また、出現頻度の高い表記であったとしても、発音としては頻度が低いものであれば、追加による利益よりも、追加による弊害が大きい。そこで頻度確認部２０６は、その表記及びその発音の両方について出現頻度がそれぞれ対応する所定の閾値を超えるトークンを、後述する処理のために新発音生成部２１２に渡す。 The filtering process by the frequency confirmation unit 206 is for the following reason. That is, the present invention attempts to construct an LVCSR system that is robust to speech data including utterances with extended phoneme and syllable duration by extending the pronunciation dictionary. However, as the number of words to be recognized increases, the calculation time at runtime increases, and the recognition rate tends to decrease. Therefore, the extended pronunciation added to the pronunciation dictionary should be limited to those expected to improve the recognition rate. Here, since the utterance style that speaks slowly and carefully is a special utterance style that is different from normal, the frequency that a word is uttered in that special utterance style is lower than the frequency that is uttered in the normal utterance style . Accordingly, at least a normal utterance style, that is, a token included in the pronunciation dictionary before expansion, whose appearance frequency is high is set as a processing target of the new pronunciation generation unit 212 described later. Even if the frequency of appearance is high, if the frequency of pronunciation is low, the harmful effect of the addition is greater than the profit of the addition. Therefore, the frequency confirmation unit 206 passes a token whose appearance frequency exceeds a predetermined threshold corresponding to both the notation and the pronunciation to the new pronunciation generation unit 212 for processing to be described later.

図４を参照して、発音辞書への引き伸ばし発音の追加を抑制される単語の例を説明する。図４に示す表の横軸は、PTデータにおける発音の相対的な出現頻度であり、縦軸は、言語モデル確率である。領域４０６に存在する単語は、言語モデルにおける生起確率が低いために、また、領域４０８に存在する単語は、同一表記に対する発音の相対的な出現頻度が低いために、それぞれ引き伸ばし発音の発音辞書への追加を抑制される単語である。例えば表記「寿司」について、「s.u.sh.i」と発音される頻度は高いが、「z.u.sh.i」と発音される頻度は低い。このため、トークン「寿司／s.u.sh.i」と、トークン「寿司／z.u.sh.i」の言語確率は等しく高いが、頻度確認部２０６は、トークン「寿司／s.u.sh.i」のみを処理対象として新発音生成部２１２に渡す。なお、参照番号４０２により示されるトークン「雨／a.m.a」と参照番号４０３により示されるトークン「外／s.o.t.o」は、後述する同一発音確認部２１４による処理により、引き伸ばし発音の追加を抑制される単語である。また、参照番号４０４により示されるトークン「大／o.o」は、後述する検証部２１８による処理により、引き伸ばし発音の追加を抑制される単語である。 With reference to FIG. 4, an example of a word for which the addition of the extended pronunciation to the pronunciation dictionary is suppressed will be described. The horizontal axis of the table shown in FIG. 4 is the relative appearance frequency of pronunciation in PT data, and the vertical axis is the language model probability. Since words occurring in the region 406 have a low occurrence probability in the language model, and words existing in the region 408 have a low relative appearance frequency of pronunciation with respect to the same notation, each is expanded to a pronunciation pronunciation dictionary. It is a word that is suppressed from adding. For example, the notation “sushi” is frequently pronounced as “s.u.sh.i”, but is rarely pronounced as “z.u.sh.i”. For this reason, the language probabilities of the token “sushi / sush.i” and the token “sushi / zush.i” are equally high, but the frequency confirmation unit 206 processes only the token “sushi / sush.i”. To the new pronunciation generation unit 212. It should be noted that the token “rain / ama” indicated by reference number 402 and the token “outside / soto” indicated by reference number 403 are words that are prevented from being added with an extended pronunciation by the same pronunciation confirmation unit 214 described later. is there. Further, the token “Large / o.o” indicated by reference numeral 404 is a word that is prevented from being added with an extended pronunciation by processing performed by the verification unit 218 described later.

新発音生成部２１２は、頻度確認部２０６により渡されたトークンの発音に含まれる共鳴音を引き伸ばすことにより、引き伸ばし発音を生成する。そして新発音生成部２１２は、生成した引き伸ばし発音を、元のトークンの表記と同一の表記と対にし、新たなトークンとして後述する同一発音確認部２１４に渡す。ここで、上記共鳴音の引き伸ばしは、トークンの発音に含まれる共鳴音が母音である場合、短母音の後に対応する長母音を挿入すること又は短母音の後に同一の短母音を繰り返すことにより行ってよい。また、トークンの発音に含まれる共鳴音が鼻音である場合、上記共鳴音の引き伸ばしは、鼻音を繰り返すことにより行ってよい。 The new pronunciation generation unit 212 generates a stretched pronunciation by stretching the resonance sound included in the pronunciation of the token passed by the frequency confirmation unit 206. Then, the new pronunciation generation unit 212 pairs the generated extended pronunciation with the same notation as the original token and passes it to the same pronunciation confirmation unit 214 described later as a new token. Here, when the resonance sound included in the token pronunciation is a vowel, the resonance sound is stretched by inserting a corresponding long vowel after the short vowel or repeating the same short vowel after the short vowel. It's okay. When the resonance sound included in the pronunciation of the token is a nasal sound, the resonance sound may be stretched by repeating the nasal sound.

共鳴音の引き伸ばしを、トークン「寿司／s.u.sh.i」を例に説明する。発音「s.u.sh.i」において「u」と「i」は短母音である。そこで短母音「u」の後に対応する長母音「U」を挿入し、同様に短母音「i」の後に対応する長母音「I」を挿入する。すると発音「s.u.sh.i」に基づく引き伸ばし発音は、「s.u.U.sh.i.I」となる。また、トークン「忍者／n.i.N.j.a」を例に説明すると、発音「n.i.N.j.a」において「i」と「a」は短母音であり、「N」は鼻音である。そこで、短母音「i」と「a」の後にそれぞれ対応する長母音を挿入し、鼻音「N」の後に同じ鼻音を挿入する。すると、発音「n.i.N.j.a」に基づく引き伸ばし発音は、「n.i.I.N.N.j.a.a」となる。またトークン「京都／ky.o.O.t.o」を例に説明すると、発音「ky.o.O.t.o」において、最初の「o」と２番目の「o」は短母音であるが、最初の「o」の後には既に長母音「O」が存在する。そこで、最初の短母音「o」に対しては何もせずに、２番目の「o」に対してのみその後に長母音「O」を挿入する。すると、発音「ky.o.O.t.o」に基づく引き伸ばし発音は、「ky.o.O.t.o.O」となる。 The expansion of the resonance sound will be explained using the token “sushi / s.u.sh.i” as an example. In the pronunciation “s.u.sh.i”, “u” and “i” are short vowels. Therefore, the corresponding long vowel “U” is inserted after the short vowel “u”, and the corresponding long vowel “I” is inserted after the short vowel “i”. Then, the extended pronunciation based on the pronunciation “s.u.sh.i” becomes “s.u.U.sh.i.I”. The token “Ninja / n.i.N.j.a” will be described as an example. In the pronunciation “n.i.N.j.a”, “i” and “a” are short vowels, and “N” is a nasal sound. Therefore, the corresponding long vowel is inserted after the short vowels “i” and “a”, and the same nasal sound is inserted after the nasal sound “N”. Then, the extended pronunciation based on the pronunciation “n.i.N.j.a” becomes “n.i.I.N.N.j.a.a”. Taking the token “Kyoto / ky.oOto” as an example, in the pronunciation “ky.oOto”, the first “o” and the second “o” are short vowels, but after the first “o” The long vowel “O” already exists. Therefore, nothing is performed on the first short vowel “o”, and the long vowel “O” is inserted only after the second “o”. Then, the extended pronunciation based on the pronunciation “ky.o.O.t.o” becomes “ky.o.O.t.o.O”.

同一発音確認部２１４は、新発音生成部２１２により生成された引き伸ばし発音について、同一の発音を有するトークンが発音辞書内に存在するか否かを判定する。そして、同一発音確認部２１４は、新発音生成部２１２により生成された引き伸ばし発音のうち、同一の発音を有するトークンが発音辞書内に存在しないと判定した引き伸ばし発音のみを取り出して、発音辞書への追加候補として候補リスト格納部２１６に格納する。 The same pronunciation confirmation unit 214 determines whether or not a token having the same pronunciation is present in the pronunciation dictionary for the extended pronunciation generated by the new pronunciation generation unit 212. Then, the same pronunciation confirmation unit 214 extracts only the extended pronunciations determined to have no tokens having the same pronunciation in the pronunciation dictionary from the extended pronunciations generated by the new pronunciation generation unit 212, and stores them in the pronunciation dictionary. The candidate list is stored in the candidate list storage unit 216 as an additional candidate.

同一発音確認部２１４による上記処理は、ランタイム時のＬＶＣＳＲシステムの誤認識を防ぐためである。即ち、引き伸ばし発音と同一の発音を有するトークンが発音辞書内に存在する場合に、その引き伸ばし発音を発音辞書に追加してしまうと、ランタイム時において正しく発音されたにも関わらず誤認識されるという副作用が生じてしまう。例えば、トークン「夢／Y.u.m.e」に対し、引き伸ばし発音「Y.u.U.m.e.E」が生成されるとする。このとき、発音辞書内にトークン「有名／Y.u.U.m.e.E」が存在したとすると、「有名／Y.u.U.m.e.E」として音声データが入力されたにも関わらず、ＬＶＣＳＲシステムは「夢」を出力する可能性が生じてしまう。このような事態を防ぐことから、同一発音確認部２１４の上記機能は重要である。 The above processing by the same pronunciation confirmation unit 214 is to prevent erroneous recognition of the LVCSR system at runtime. In other words, if a token with the same pronunciation as the extended pronunciation exists in the pronunciation dictionary, adding that extended pronunciation to the pronunciation dictionary will result in a misrecognition despite the correct pronunciation at runtime. Side effects will occur. For example, it is assumed that a stretched pronunciation “Y.u.U.m.e.E” is generated for the token “dream / Y.u.m.e”. At this time, if the token “famous / YuUmeE” exists in the pronunciation dictionary, the LVCSR system may output “dream” even though voice data is input as “famous / YuUmeE”. End up. In order to prevent such a situation, the above function of the same pronunciation confirmation unit 214 is important.

検証部２１８は、候補リスト格納部２１６に格納された追加候補の引き伸ばし発音の有効性を検証するために、デコード部２２０と、除去部２２２と、テスト用データ格納部２２４と、拡張発音辞書格納部２２６を含む。テスト用データ格納部２２４は、テスト用の音声データと正解テキストデータの対を格納する。拡張発音辞書格納部２２６は、発音辞書格納部２０２に格納される発音辞書のコピーを初期値として格納する。 The verification unit 218 verifies the validity of the extended pronunciation of the additional candidate stored in the candidate list storage unit 216, and includes a decoding unit 220, a removal unit 222, a test data storage unit 224, and an extended pronunciation dictionary storage. Part 226. The test data storage unit 224 stores a pair of test voice data and correct text data. The extended pronunciation dictionary storage unit 226 stores a copy of the pronunciation dictionary stored in the pronunciation dictionary storage unit 202 as an initial value.

デコード部２２０は、テスト用データ格納部２２４から読み出したテスト用の音声データを入力とし、発音辞書格納部２０２に格納される発音辞書と、候補リスト格納部２１６に格納される追加候補の引き伸ばし発音とを参照して、音声データをデコードする。音声データのデコードは既知の方法であり、本発明の要旨ではないので説明を省略する。 The decoding unit 220 receives the test voice data read from the test data storage unit 224 as an input, and uses the pronunciation dictionary stored in the pronunciation dictionary storage unit 202 and the extended pronunciation of additional candidates stored in the candidate list storage unit 216. Referring to the above, the audio data is decoded. The decoding of the audio data is a known method and is not the gist of the present invention, so that the description is omitted.

除去部２２２は、デコード部２２０からデコードの結果を受け取り、デコードの結果とテスト用データ格納部２２４に格納される正解テキストデータとを比較して、デコード結果に正解、不正解いずれかのマークをつける。除去部２２２はまた、追加候補の引き伸ばし発音ごと、引き伸ばし発音のデコード総数に対する正解数の割合を算出する。そして除去部２２２は、候補リスト格納部２１６に格納された追加候補の引き伸ばし発音から、上記算出した割合が所定の閾値を超える引き伸ばし発音のみを取り出し、拡張発音辞書格納部２２６に格納される発音辞書に追加する。 The removal unit 222 receives the decoding result from the decoding unit 220, compares the decoding result with the correct text data stored in the test data storage unit 224, and puts either a correct or incorrect mark on the decoding result. Put on. The removal unit 222 also calculates the ratio of the number of correct answers to the total number of decoded extended pronunciations for each additional candidate extended pronunciation. Then, the removal unit 222 extracts only the extended pronunciation in which the calculated ratio exceeds a predetermined threshold from the expanded pronunciations of the additional candidates stored in the candidate list storage unit 216, and the pronunciation dictionary stored in the extended pronunciation dictionary storage unit 226 Add to.

なお、上述した頻度確認部２０６、検証部２１８による処理はオプションであり、これら処理を除いた構成もまた、本発明の技術的範囲に含まれることに留意されたい。なお、後述する評価実験の説明を簡単にするため、新発音生成部２１２の機能をＡ１、同一発音格納部２１４の機能をＡ２，頻度確認部２０６の機能のうち表記に対する処理の機能をＡ３、発音に対する処理の機能をＡ４、検証部２１８の機能をＡ５と名づける。 Note that the processing performed by the frequency check unit 206 and the verification unit 218 described above is optional, and configurations excluding these processing are also included in the technical scope of the present invention. In order to simplify the description of the evaluation experiment to be described later, the function of the new pronunciation generation unit 212 is A1, the function of the same pronunciation storage unit 214 is A2, the function of the frequency confirmation unit 206 is the processing function for the notation A3, The processing function for pronunciation is named A4, and the function of the verification unit 218 is named A5.

次に図５を参照して、本実施形態に係る発音辞書拡張システム２００による発音辞書の拡張処理の流れの一例を説明する。発音辞書の拡張処理はステップ５００で開始し、発音辞書拡張システム２００は、発音辞書からトークンを読み出す。続いて発音辞書拡張システム２００は、読み出したトークンの表記について言語モデル学習用のコーパスにおける出現頻度が所定の閾値Ａを超えるか否かを判定する（ステップ５０２）。読み出したトークンの表記の出現頻度が所定の閾値Ａ以下である場合（ステップ５０２：ＮＯ）、処理はステップ５００に戻って、発音辞書拡張システム２００は発音辞書から次のトークンを読み出す。 Next, an example of a pronunciation dictionary expansion process performed by the pronunciation dictionary expansion system 200 according to the present embodiment will be described with reference to FIG. The pronunciation dictionary expansion process starts at step 500, and the pronunciation dictionary expansion system 200 reads a token from the pronunciation dictionary. Subsequently, the pronunciation dictionary expansion system 200 determines whether or not the appearance frequency in the language model learning corpus for the read token notation exceeds a predetermined threshold A (step 502). When the appearance frequency of the notation of the read token is equal to or lower than the predetermined threshold A (step 502: NO), the process returns to step 500, and the pronunciation dictionary expansion system 200 reads the next token from the pronunciation dictionary.

一方、読み出したトークンの表記の出現頻度が所定の閾値Ａを超える場合（ステップ５０２：ＹＥＳ）、処理はステップ５０４へ進み、発音辞書拡張システム２００は、読み出したトークンの発音についてPTデータにおける出現頻度が所定の閾値Ｂを超えるか否かを判定する。読み出したトークンの発音の出現頻度が所定の閾値Ｂ以下である場合（ステップ５０４：ＮＯ）、処理はステップ５００に戻って、発音辞書拡張システム２００は発音辞書から次のトークンを読み出す。 On the other hand, when the appearance frequency of the read token notation exceeds the predetermined threshold A (step 502: YES), the process proceeds to step 504, and the pronunciation dictionary expansion system 200 generates the appearance frequency in the PT data for the pronunciation of the read token. Whether or not exceeds a predetermined threshold B. When the appearance frequency of pronunciation of the read token is less than or equal to the predetermined threshold B (step 504: NO), the process returns to step 500, and the pronunciation dictionary expansion system 200 reads the next token from the pronunciation dictionary.

一方、読み出したトークンの発音の出現頻度が所定の閾値Ｂを超える場合（ステップ５０４：ＹＥＳ）、処理はステップ５０６へ進み、発音辞書拡張システム２００は、読み出したトークンの発音に含まれる共鳴音を引き伸ばすことにより、引き伸ばし発音を生成する。続いて発音辞書拡張システム２００は、生成した引き伸ばし発音と同一の発音を有するトークンが発音辞書に含まれるか否かを判定する（ステップ５０８）。生成した引き伸ばし発音と同一の発音を有するトークンが発音辞書に含まれる場合（ステップ５０８：ＹＥＳ）、処理はステップ５００に戻って、発音辞書拡張システム２００は発音辞書から次のトークンを読み出す。 On the other hand, when the appearance frequency of the pronunciation of the read token exceeds the predetermined threshold B (step 504: YES), the process proceeds to step 506, and the pronunciation dictionary expansion system 200 generates the resonance sound included in the pronunciation of the read token. Stretching pronunciation is generated by stretching. Subsequently, the pronunciation dictionary expansion system 200 determines whether or not a token having the same pronunciation as the generated extended pronunciation is included in the pronunciation dictionary (step 508). When a token having the same pronunciation as the generated extended pronunciation is included in the pronunciation dictionary (step 508: YES), the process returns to step 500, and the pronunciation dictionary expansion system 200 reads the next token from the pronunciation dictionary.

一方、生成した引き伸ばし発音と同一の発音を有するトークンが発音辞書に含まれない場合（ステップ５０８：ＮＯ）、処理はステップ５１０へ進み、発音辞書拡張システム２００は、生成した引き伸ばし発音を発音辞書へ追加するための候補リストに追加する。続いて、発音辞書拡張システム２００は、発音辞書に未検討のトークンがあるか否かを判定する（ステップ５１２）。未検討のトークンが発音辞書に残っている場合（ステップ５１２：ＮＯ）、処理はステップ５００へ戻って、発音辞書拡張システム２００は発音辞書から次のトークンを読み出す。 On the other hand, if a token having the same pronunciation as the generated extended pronunciation is not included in the pronunciation dictionary (step 508: NO), the process proceeds to step 510, and the pronunciation dictionary expansion system 200 transfers the generated extended pronunciation to the pronunciation dictionary. Add to the candidate list to add. Subsequently, the pronunciation dictionary expansion system 200 determines whether or not there is an unexamined token in the pronunciation dictionary (step 512). If an unexamined token remains in the pronunciation dictionary (step 512: NO), the process returns to step 500, and the pronunciation dictionary expansion system 200 reads the next token from the pronunciation dictionary.

一方、発音辞書に未検討のトークンが残っていない場合（ステップ５１２：ＮＯ）、処理はステップ５１４へ進み、発音辞書拡張システム２００は、候補リストを参照して、テスト用データをデコードし、デコード結果と正解テキストデータとを比較することにより引き伸ばし発音についての正解率を算出する（ステップ５１４）。続いて、発音辞書拡張システム２００は、算出した正解率が所定の閾値Ｃを超える引き伸ばし発音を発音辞書に追加して、発音辞書を拡張する（ステップ５１６）。そして処理は終了する。 On the other hand, if there are no unexamined tokens remaining in the pronunciation dictionary (step 512: NO), the process proceeds to step 514, and the pronunciation dictionary expansion system 200 decodes the test data by referring to the candidate list, and decodes the test data. The correct answer rate for the extended pronunciation is calculated by comparing the result with the correct text data (step 514). Subsequently, the pronunciation dictionary expansion system 200 expands the pronunciation dictionary by adding the extended pronunciation whose calculated correct answer rate exceeds the predetermined threshold C to the pronunciation dictionary (step 516). Then, the process ends.

図６は、本実施形態に係る音響モデルの学習システムの機能ブロック図の一例を示す。音響モデル学習システム６００は、上述した方法により拡張された拡張発音辞書を参照して音声データを自動認識することによりPTデータを生成し、生成したPTデータを学習データとして音響モデルを識別学習により学習する。音響モデルの学習システム６００は、音声データ格納部６０２と、デコーダ６０４と、第１認識結果格納部６０６と、アライメント実行部６０８と、第２認識結果格納部６１０と、重み付け実行部６１２と、第３認識結果格納部６１４と、学習部６１６とを含む。なお、図６に示す発音辞書格納部２０２と、拡張発音辞書格納部２２６は、図２に示す発音辞書格納部２０２と、拡張発音辞書格納部２２６と同一のものである。発音辞書格納部２０２と、拡張発音辞書格納部２２６と、第１〜第３認識結果格納部６０６、６１０、６１４は、物理的に同一の記憶装置であってもよく、或いは複数の記憶装置であってもよい。以下各構成要素について説明する。 FIG. 6 shows an example of a functional block diagram of the acoustic model learning system according to the present embodiment. The acoustic model learning system 600 generates PT data by automatically recognizing speech data with reference to the extended pronunciation dictionary expanded by the above-described method, and learns the acoustic model by identification learning using the generated PT data as learning data. To do. The acoustic model learning system 600 includes an audio data storage unit 602, a decoder 604, a first recognition result storage unit 606, an alignment execution unit 608, a second recognition result storage unit 610, a weighting execution unit 612, 3 recognition result storage part 614 and learning part 616 are included. The pronunciation dictionary storage unit 202 and the extended pronunciation dictionary storage unit 226 shown in FIG. 6 are the same as the pronunciation dictionary storage unit 202 and the extended pronunciation dictionary storage unit 226 shown in FIG. The pronunciation dictionary storage unit 202, the extended pronunciation dictionary storage unit 226, and the first to third recognition result storage units 606, 610, and 614 may be physically the same storage device, or may be a plurality of storage devices. There may be. Each component will be described below.

音声データ格納部６０２は、対象ドメインの音声データ（フィールドデータ）を格納する。 The voice data storage unit 602 stores voice data (field data) of the target domain.

デコーダ６０４は、音声データ格納部６０２から音声データを読み出し、拡張発音辞書格納部２２６に格納される拡張発音辞書を参照して、読み出した音声データをデコードする。このときデコーダ６０４は、引き伸ばし発音を参照してデコードしたデコード結果のリストを合わせて作成し、後述する重み付け実行部６１２に渡す。デコーダ６０４は、デコードの結果、即ち、PTデータを第１認識結果格納部６０６に格納する。なお、音声データのデコードは上述したように既知の方法であり、本発明の要旨ではないので説明を省略する。 The decoder 604 reads the audio data from the audio data storage unit 602, decodes the read audio data with reference to the extended pronunciation dictionary stored in the extended pronunciation dictionary storage unit 226. At this time, the decoder 604 creates a list of decoding results decoded with reference to the extended pronunciation, and passes it to the weighting execution unit 612 described later. The decoder 604 stores the decoding result, that is, PT data in the first recognition result storage unit 606. The decoding of the audio data is a known method as described above, and is not the gist of the present invention, so the description is omitted.

従来の発音辞書を用いた音声データのデコードでは、音素や音節の継続時間が引き延ばされた発話は正しく認識されないため、そのような発話を予め捨てることで、PTデータの質は保たれた。しかしながら、そのようにして生成されたPTデータに基づいて構築される音響モデルは、ゆっくりと丁寧に発せられる発話を正しく認識することができない。そこで、本実施形態に係る音響モデルの学習システム６００では、拡張発音辞書を用いることで、音素や音節の継続時間が引き延ばされた発話の正しい認識を可能とし、上記問題を解決する。 In speech data decoding using a conventional pronunciation dictionary, utterances with extended phoneme and syllable duration are not recognized correctly, so the quality of PT data was maintained by discarding such utterances in advance. . However, the acoustic model constructed based on the PT data thus generated cannot correctly recognize the utterances that are slowly and carefully uttered. Therefore, in the acoustic model learning system 600 according to the present embodiment, by using the extended pronunciation dictionary, it is possible to correctly recognize an utterance in which the duration of phonemes and syllables is extended, and solve the above problem.

アライメント実行部６０８は、発音辞書格納部２０２に格納される発音辞書を参照して、第１認識結果格納部６０６に格納されるPTデータをアライメントする。そしてアライメント実行部６０８は、アライメント結果を、第２認識結果格納部６１０に格納する。通常、PTデータのアライメントは、PTデータ取得時に参照した発音辞書を用いてなされる。しかしながら、後述する学習部６１６による音響モデルの学習では、識別学習を利用することから、ここでは拡張発音辞書ではなく、通常の発音辞書を参照してアライメントを行う。即ち、識別学習時において発音辞書に引き伸ばし発音が含まれると、通常の発音と引き伸ばし発音のどちらも正解であるにも関わらず、このようなペアに対してまで識別するための学習がなされてしまう。そこでアライメント実行部６０８はこのような無駄を防ぐため、通常の発音辞書を用いてアライメントを実行する。 The alignment execution unit 608 refers to the pronunciation dictionary stored in the pronunciation dictionary storage unit 202 and aligns PT data stored in the first recognition result storage unit 606. The alignment execution unit 608 stores the alignment result in the second recognition result storage unit 610. Normally, PT data is aligned using a pronunciation dictionary referred to when PT data is acquired. However, the learning of the acoustic model by the learning unit 616, which will be described later, uses identification learning, and here, alignment is performed with reference to a normal pronunciation dictionary instead of an extended pronunciation dictionary. That is, if extended pronunciation is included in the pronunciation dictionary during identification learning, learning for identifying even such a pair is performed even though both normal pronunciation and extended pronunciation are correct. . Therefore, the alignment execution unit 608 executes alignment using a normal pronunciation dictionary in order to prevent such waste.

図７（ａ）、（ｂ）にデコード結果の「寿司」に対するアライメントを示す。図７（ａ）は引き伸ばし発音「s.u.U.sh.i.I」を用いたアライメントを示す。図７（ｂ）は、通常の発音「s.u.sh.i」によるアライメントを示す。「SIL」は無音（silence）を示している。図７（ｂ）に示すような通常の発音を用いたアライメントを得るために、上述したように通常の発音辞書を参照するほか、拡張発音辞書から引き伸ばし発音を除去してアライメント・プログラムを走らせたり、拡張発音辞書を参照して得られた認識結果（例えば、「寿司（s.u.U.sh.i.I）を（o）食べる（t.a.b.e.r.u）」という文）を、通常の発音辞書を参照して得る認識結果（「寿司（s.u.sh.i）を（o）食べる（t.a.b.e.r.u）」）に変換してからアライメント・プログラムを走らせたりしてもよい。これに代えて、引き伸ばし発音のアライメントを参照し、引き伸ばされた発音部分を結合により元の発音に戻すことにより（図７（ａ）の例では、「u.U」を結合して元の発音「u」を、「i.I」を結合して元の発音「i」を得る）、通常の発音によるアライメントを取得してもよい。 FIGS. 7A and 7B show the alignment of the decoding result “sushi”. FIG. 7A shows an alignment using the extended pronunciation “s.u.U.sh.i.I”. FIG. 7B shows alignment by normal pronunciation “s.u.sh.i”. “SIL” indicates silence. In order to obtain an alignment using normal pronunciation as shown in FIG. 7B, in addition to referring to the normal pronunciation dictionary as described above, an extended pronunciation dictionary is removed from the extended pronunciation dictionary and the alignment program is run. The recognition result obtained by referring to the extended pronunciation dictionary (for example, the sentence “sushi (suUsh.iI) (o) eat (taberu)”) is obtained by referring to the normal pronunciation dictionary ( You may run the alignment program after converting “sushi (sush.i) to (o) eat (taberu)”). Instead, by referring to the alignment of the stretched pronunciation and returning the stretched pronunciation portion to the original pronunciation by combining (in the example of FIG. 7A, “uU” is combined to the original pronunciation “u ”May be combined with“ iI ”to obtain the original pronunciation“ i ”), and alignment by normal pronunciation may be acquired.

重み付け実行部６１２は、デコーダ６０４から受け取ったデコード結果のリストに基づいて、第２認識結果格納部６１０から読み出したアライメント結果に対し重み付けを行う。そして重み付け実行部６１２は、重み付けした結果を第３認識結果格納部６１４に格納する。より具体的には、重み付け実行部６１２は、アライメント結果のうち、引き伸ばし発音に基づき認識された単語又は文に対し重み付けを行う。上述したように、アライメントは通常の発音を用いてなされるため、重み付け実行部６１２は、デコード結果のリストに基づいて、引き伸ばし発音を参照して認識された単語又は文を識別する。 The weighting execution unit 612 weights the alignment result read from the second recognition result storage unit 610 based on the decoding result list received from the decoder 604. Then, the weighting execution unit 612 stores the weighted result in the third recognition result storage unit 614. More specifically, the weighting execution unit 612 weights a word or sentence recognized based on the stretched pronunciation in the alignment result. As described above, since the alignment is performed using normal pronunciation, the weighting execution unit 612 identifies the recognized word or sentence with reference to the extended pronunciation based on the decoding result list.

学習部６１６は、第３認識結果格納部６１４から重み付けされたアライメント結果を読み出し、識別学習により音響モデルを学習する。識別学習では、正解の単語（正解の音素列）と不正解の単語（不正解の音素列、対立候補という）とが使用され、正解音素列と不正解音素列のスコアの差が大きくなるようにモデルのパラメータが推定され、モデルが学習される。対立候補としては、音声認識のＮベスト候補を利用したり、デコーダから得られる単語ラティスから対立候補を抽出したりする。ここで、発音辞書に引き伸ばし発音が含まれると、引き伸ばし発音は対応する通常の発音と似ていることから、通常の発音と対応する引き伸ばし発音のペアが常に正解音素列と不正解音素列として登場し、識別のための学習時間を増大させてしまう。そこで本発明では、上述したようにデコード結果のアライメントを通常の発音辞書を用いて行う。学習部６１６は、学習結果の音響モデルを出力する。なお、識別学習の更なる詳細は、例えば非特許文献５を参照されたい。 The learning unit 616 reads the weighted alignment result from the third recognition result storage unit 614 and learns the acoustic model by identification learning. In discriminative learning, correct words (correct phoneme strings) and incorrect words (incorrect phoneme strings, opposite candidates) are used, so that the difference between the correct and incorrect phoneme string scores increases. The model parameters are estimated and the model is learned. As the conflict candidate, the N best candidate for speech recognition is used, or the conflict candidate is extracted from the word lattice obtained from the decoder. Here, if the pronunciation dictionary contains a stretched pronunciation, the stretched pronunciation is similar to the corresponding normal pronunciation, so a pair of normal pronunciation and the corresponding extended pronunciation always appears as a correct phoneme sequence and an incorrect answer phoneme sequence. As a result, the learning time for identification is increased. Therefore, in the present invention, as described above, the decoding results are aligned using a normal pronunciation dictionary. The learning unit 616 outputs an acoustic model as a learning result. For further details of discriminative learning, see Non-Patent Document 5, for example.

次に、図８を参照して、音響モデルの学習システム６００による音響モデルの学習処理の流れの一例を説明する。音響モデルの学習処理はステップ８００で開始し、音響モデルの学習システム６００は、拡張発音辞書を参照して認識された音声データの認識結果と、引き伸ばし発音を参照して認識された単語又は文のリストとを取得する。続いて音響モデルの学習システム６００は、認識結果に対し、通常の発音辞書を用いてアライメントを実行する（ステップ８０２）。 Next, an example of the flow of acoustic model learning processing by the acoustic model learning system 600 will be described with reference to FIG. The acoustic model learning process starts in step 800, and the acoustic model learning system 600 recognizes the recognition result of the speech data recognized with reference to the extended pronunciation dictionary and the word or sentence recognized with reference to the extended pronunciation. Get list and. Subsequently, the acoustic model learning system 600 performs alignment on the recognition result using a normal pronunciation dictionary (step 802).

続いて、音響モデルの学習システム６００は、ステップ８００で取得された単語又は文のリストに基づいて、アライメント結果に対して重み付けをする（ステップ８０４）。続いて音響モデルの学習システム６００は、重み付きのアライメント結果を学習データとして、音響モデルを識別学習する（ステップ８０６）。そして処理は終了する。 Subsequently, the acoustic model learning system 600 weights the alignment result based on the word or sentence list acquired in Step 800 (Step 804). Subsequently, the acoustic model learning system 600 identifies and learns the acoustic model using the weighted alignment result as learning data (step 806). Then, the process ends.

次に図９（ａ）、（ｂ）及び図１０を参照して、本発明が提案する拡張発話辞書及び音響モデルを利用した音声認識の評価実験について述べる。図９（ａ）、（ｂ）を参照して説明する評価実験では、名詞句中心の短い発声から構成されるテストセット１と、長い文を含むテストセット２の２種類の音声データを用いた。また、図９（ａ）の評価実験では、音響モデルは従来の音響モデル、即ち、通常の発音辞書を用いて生成されたPTデータに基づき識別学習により構築された音響モデルを使用した。評価した拡張発話辞書のバリエーションは以下の５つである。
バリエーション０（オリジナルvoc）：通常の発音辞書
バリエーション１（Ａ１＋Ａ２）：発音辞書に存在しない引き伸ばし発音を追加した拡張発音辞書
バリエーション２（Ａ１＋Ａ２＋Ａ３）：発音辞書に存在しない引き伸ばし発音であって、かつ、表記の出現頻度の高い語彙の引き伸ばし発音を追加した拡張発音辞書
バリエーション３（Ａ１＋Ａ２＋Ａ３＋Ａ４）：発音辞書に存在しない引き伸ばし発音であって、かつ、表記と発音の出現頻度の高い語彙の引き伸ばし発音を追加した拡張発音辞書
バリエーション４（Ａ１＋Ａ２＋Ａ３＋Ａ４＋Ａ５）：バリエーション４の拡張発音辞書から、正解率の低い引き伸ばし発音を除去した拡張発音辞書 Next, a speech recognition evaluation experiment using the extended utterance dictionary and acoustic model proposed by the present invention will be described with reference to FIGS. In the evaluation experiment described with reference to FIGS. 9A and 9B, two types of speech data were used: test set 1 composed of a short utterance centered on a noun phrase and test set 2 including a long sentence. . In the evaluation experiment of FIG. 9A, the acoustic model used is a conventional acoustic model, that is, an acoustic model constructed by discriminative learning based on PT data generated using a normal pronunciation dictionary. There are the following five variations of the extended utterance dictionary evaluated.
Variation 0 (original voc): Normal pronunciation dictionary variation 1 (A1 + A2): Extended pronunciation dictionary variation 2 (A1 + A2 + A3) with extended pronunciation not present in the pronunciation dictionary: Stretched pronunciation not present in the pronunciation dictionary and notation Extended pronunciation dictionary variation 3 (A1 + A2 + A3 + A4) with extended pronunciation of vocabulary with high appearance frequency: Extended pronunciation with extended pronunciation that does not exist in the pronunciation dictionary and has a high occurrence frequency of notation and pronunciation Pronunciation dictionary variation 4 (A1 + A2 + A3 + A4 + A5): Extended pronunciation dictionary in which extended pronunciation with a low correct answer rate is removed from the extended pronunciation dictionary of variation 4

上記実験において測定したのは、かな誤り削減率と文字誤り削減率の２つである。ここで、かな誤り率（ＫＥＲ：Katakana Error Rate）は、正解データ及び仮説データを、発音を表す片仮名文字列（片仮名シーケンス）に翻訳してマッチングを行う認識率評価尺度である。また、文字誤り率（ＣＥＲ）は、主にＯＣＲ文字認識やかな漢字変換の評価で用いられる認識率評価尺度であり、文字単位でマッチングを行う手法である。本評価実験では、バリエーション０の通常の発音辞書を用いた認識結果を基準として、それぞれの誤り率の削減率を求めた。 In the above experiment, the kana error reduction rate and the character error reduction rate were measured. Here, the kana error rate (KER: Katakana Error Rate) is a recognition rate evaluation scale in which correct data and hypothesis data are translated into a katakana character string (katakana sequence) representing pronunciation and matched. The character error rate (CER) is a recognition rate evaluation scale mainly used for evaluating OCR character recognition and Kanji conversion, and is a method of matching on a character basis. In this evaluation experiment, the reduction rate of each error rate was obtained based on the recognition result using the normal pronunciation dictionary of variation 0.

図９（ａ）の表を見ると分かるように、辞書サイズの増加を抑制した拡張発音辞書（バリエーション２及び３）を含む、１〜４の全てのバリエーションの拡張発音辞書について、高いエラー削減率が達成されている。 As can be seen from the table in FIG. 9A, the error reduction rate is high for the extended pronunciation dictionaries of all variations 1 to 4 including the extended pronunciation dictionary (variations 2 and 3) in which the increase in the dictionary size is suppressed. Has been achieved.

また図９（ｂ）の評価実験では、音響モデルは上述した従来の音響モデルと、上述した本実施形態に係る音響モデルの学習方法により再構築した音響モデルの２つを用意した。発音辞書は、上述したバリエーション４の拡張発音辞書を用いた。また、ここでも通常の発音辞書と従来の音響モデルの組み合わせを基準とした。 In the evaluation experiment of FIG. 9B, two acoustic models were prepared: the conventional acoustic model described above and the acoustic model reconstructed by the acoustic model learning method according to the present embodiment described above. As the pronunciation dictionary, the extended pronunciation dictionary of variation 4 described above was used. In this case, a combination of a normal pronunciation dictionary and a conventional acoustic model is used as a reference.

図９（ｂ）の表を見ると分かるように、本実施形態に係る音響モデルの学習方法により再構築した音響モデルを用いるだけでも、十分なエラー削減率が達成されている。再構築した音響モデルに加えて、バリエーション４の拡張発音辞書を用いた場合には、更に高いエラー削減率が達成されている。 As can be seen from the table of FIG. 9B, a sufficient error reduction rate is achieved even by using the acoustic model reconstructed by the acoustic model learning method according to the present embodiment. In addition to the reconstructed acoustic model, a higher error reduction rate is achieved when the extended pronunciation dictionary of variation 4 is used.

図１０は、上述したバリエーション１〜４の各拡張発音辞書について、辞書内の発音数とエラー削減率の関係を示す表である。図１０に示す表から、発音や表記の出現頻度が低い語彙の引き伸ばし発音の追加を禁止することにより、エラー削減率にほとんど影響を与えることなく発音数の増加を大幅に抑制することができることが分かる。 FIG. 10 is a table showing the relationship between the number of pronunciations in the dictionary and the error reduction rate for each of the extended pronunciation dictionaries of variations 1 to 4 described above. From the table shown in FIG. 10, by prohibiting the addition of extended pronunciation of vocabulary with low occurrence frequency of pronunciation and notation, the increase in the number of pronunciations can be significantly suppressed without substantially affecting the error reduction rate. I understand.

以上、実施形態を用いて本発明の説明をしたが、本発明の技術範囲は上記実施形態に記載の範囲には限定されない。上記の実施形態に、種々の変更又は改良を加えることが可能であることが当業者に明らかである。従って、そのような変更又は改良を加えた形態も当然に本発明の技術的範囲に含まれる。 As mentioned above, although this invention was demonstrated using embodiment, the technical scope of this invention is not limited to the range as described in the said embodiment. It will be apparent to those skilled in the art that various modifications or improvements can be added to the above-described embodiments. Accordingly, it is a matter of course that embodiments with such changes or improvements are also included in the technical scope of the present invention.

なお、特許請求の範囲、明細書、及び図面中において示した装置、システム、プログラム、及び方法における動作、手順、ステップ、及び段階等の各処理の実行順序は、特段「より前に」、「先立って」等と明示しておらず、また、前の処理の出力を後の処理で用いるのでない限り任意の順序で実現しうることに留意すべきである。また、前の処理の出力を後の処理で用いる場合でも、前の処理と後の処理の間に他の処理が入ることは可能である場合があること、又は間に他の処理が入るように記載されていても前の処理を後の処理の直前に行うよう変更することも可能である場合があることも留意されたい。特許請求の範囲、明細書、及び図面中の動作フローに関して、便宜上「まず、」、「次に、」、「続いて、」等を用いて説明したとしても、この順で実施することが必須であることを必ずしも意味するとは限らない。 In addition, the execution order of each process such as operations, procedures, steps, and stages in the apparatus, system, program, and method shown in the claims, the specification, and the drawings is particularly “before”, “ It should be noted that “preceding” is not specified, and the output of the previous process can be realized in any order unless it is used in the subsequent process. Also, even when the output of the previous process is used in the subsequent process, it may be possible that another process enters between the previous process and the subsequent process, or another process may be inserted in between. Note that it may be possible to change the previous processing to be performed immediately before the subsequent processing even though it is described in the above. Even though the operation flow in the claims, the description, and the drawings is described using “first,” “next,” “follow,” etc. for convenience, it is essential to carry out in this order. It does not necessarily mean that.

Claims

A method for extending a pronunciation dictionary including a plurality of vocabulary as recognition vocabulary by a computer,
Before Symbol computer, a step of reading the pronunciation of each vocabulary from the pronunciation dictionary,
Before SL computer, the vocabulary read the pronunciation, for its representation, the occurrence frequency in the corpus language model exceeds a first predetermined value, and its pronunciation, a pair of audio data and their automatic recognition result A step of generating a new pronunciation (hereinafter referred to as “stretched pronunciation”) by extending a short vowel or nose included in the pronunciation of the vocabulary on the condition that the frequency of occurrence in the data exceeds a second predetermined value When,
Before SL computer, determining whether present in the stretching pronunciation and the same vocabulary the pronunciation dictionary with pronunciation,
A step of pre-Symbol computer, in response to a determination that no, expanding the sound dictionary using said stretching sound,
Including methods.

2. The pronunciation dictionary expansion method according to claim 1, wherein the extension of the short vowel is performed by inserting a corresponding long vowel after the short vowel or repeating the same short vowel.

2. The pronunciation dictionary expansion method according to claim 1 , wherein the expansion of the nasal sound is performed by repeating the nasal sound.

The pronunciation dictionary expansion method according to claim 1, wherein in the reading step , the vocabulary from which the pronunciation is read is a vocabulary excluding a syllable vocabulary.

The pronunciation dictionary expansion method according to claim 1, further comprising the step of deleting, from the pronunciation dictionary, a stretched pronunciation having a low accuracy rate based on a result of speech recognition performed using the expanded pronunciation dictionary.

The computer, expansion program pronunciation dictionary to execute the steps of the expansion process of the pronunciation dictionary according to any one of claims 1 to 5.

Comprising means adapted to perform the steps of the expansion process of the pronunciation dictionary according to any one of claims 1 to 5, expansion system pronunciation dictionary.

Using the extended pronunciation dictionary by expansion methods Pronunciation dictionary according to any one of claims 1 to 5, a computer, a method of learning an acoustic model,
The computer obtaining a recognition result of speech data that has been speech-recognized using the extended pronunciation dictionary;
The computer, the recognition result of the speech data as the training data, and a step of learning an acoustic model,
Learning in the learning step is identification learning, and the learning step includes a step in which the computer executes alignment on a recognition result of the speech data without using the stretched pronunciation. Method.

A method of learning an acoustic model by a computer,
Before Symbol computer, a step of reading the pronunciation of each vocabulary from the pronunciation dictionary,
Before SL computer, the vocabulary read the pronunciation, for its representation, the occurrence frequency in the corpus language model exceeds a first predetermined value, and its pronunciation, a pair of audio data and their automatic recognition result A step of generating a new pronunciation (hereinafter referred to as “stretched pronunciation”) by extending a short vowel or nose included in the pronunciation of the vocabulary on the condition that the frequency of occurrence in the data exceeds a second predetermined value When,
Before SL computer, the steps of vocabulary having the stretching pronunciation and same pronunciation on condition that does not exist in the sound dictionary, expanding the sound dictionary using said stretching sound,
Before SL computer were speech recognition using the extended the pronunciation dictionary, and obtaining a recognition result of the speech data,
Before SL computer, a recognition result of the speech data as the training data, comprising the steps of learning the acoustic model,
Learning method of acoustic model including

Learning in the step of the learning is discriminative training, said step of learning, the computer, the alignment of the recognition result of the speech data includes performing without using the stretching pronunciation vocabulary, claim 9 The learning method of the acoustic model as described in 2.

The obtaining step includes a step in which the computer creates a list of utterances recognized using the stretched pronunciation, and in the learning step , the computer weights learning data with reference to the list. The method for learning an acoustic model according to claim 10 , comprising the step of:

The computer, acoustic model learning program for executing the steps of the method of learning the acoustic model according to any one of claims 9 to 11.

Comprising means adapted to perform the steps of the method of learning the acoustic model according to any one of claims 9 to 11, the acoustic model learning system.