JP2015121731A

JP2015121731A - Pronunciation dictionary conversion model creation device, pronunciation dictionary conversion device, method using the same, program, and storage medium for program

Info

Publication number: JP2015121731A
Application number: JP2013266469A
Authority: JP
Inventors: 亮増村; Akira Masumura; 浩和政瀧; Hirokazu Masataki; 孝典芦原; Takanori Ashihara
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-12-25
Filing date: 2013-12-25
Publication date: 2015-07-02
Anticipated expiration: 2033-12-25
Also published as: JP6125991B2

Abstract

PROBLEM TO BE SOLVED: To provide a pronunciation dictionary conversion model creation device for creating a pronunciation dictionary suitable for voice recognition task with a small amount of cost.SOLUTION: A pronunciation dictionary conversion label preparation part receives, as input, a set of original words constituting a word string and voice data of the original words, and pronunciation variation patterns, recognizes the voice data using a context-free grammar taking an acoustic model and the pronunciation variation patterns into account, and outputs as many pronunciation dictionary conversion labels as the number of the pronunciation variation patterns, the pronunciation dictionary conversion labels composed of a set of the original words and changed patterns corresponding to the pronunciation variation patterns. Next, a pronunciation dictionary conversion model learning part receives, as input, the pronunciation dictionary conversion labels and learns a pronunciation dictionary conversion model in which a conditional probability that converted words corresponding to changed patterns appear from the pronunciation dictionary conversion labels is modeled by machine learning.

Description

本発明は、対象タスクに対応した発音辞書変換を実現するためのモデルを作成する発音辞書変換モデル作成装置と発音辞書変換装置とそれらの方法と、プログラムとその記録媒体に関する。 The present invention relates to a pronunciation dictionary conversion model creation device, a pronunciation dictionary conversion device, a method thereof, a program, and a recording medium for creating a model for realizing pronunciation dictionary conversion corresponding to a target task.

一般的な音声認識システムでは、言語的な予測のための言語モデルと、音響的な予測のための音響モデルとを用いる。そして更に、言語的情報と音響的情報を結びつけるものとして「単語とその発音の関係」を表す発音辞書が用いられる。この発音辞書は、しばしば認識用辞書などと表記される場合がある。発音辞書は周知なものであり、例えば非特許文献１に開示されている。 A general speech recognition system uses a language model for linguistic prediction and an acoustic model for acoustic prediction. Furthermore, a pronunciation dictionary representing “a relationship between words and their pronunciation” is used as a link between linguistic information and acoustic information. This pronunciation dictionary is often described as a recognition dictionary. The pronunciation dictionary is well known and is disclosed in Non-Patent Document 1, for example.

発音辞書は、基本的に形態素解析器による正規の読み仮名に基づき各単語の発音が付与されている。しかし、人間は、単語の正規の読み仮名通りに発音しない場合も多い。例えば、「言った」という単語列の正規の読み仮名は「いった」であるが、「ゆった」と発音する場合もある。このように、単語と発音の関係は一対一の関係ではなく確率的な変動が起こり得るものである。 The pronunciation dictionary is basically given the pronunciation of each word based on the regular reading kana by the morphological analyzer. However, in many cases, humans do not pronounce the words exactly as they are read. For example, the normal reading pseudonym of the word string “said” is “to” but may be pronounced “yutto”. Thus, the relationship between words and pronunciations is not a one-to-one relationship, and stochastic fluctuations can occur.

このような発音の確率的な変動を捉えた発音辞書を構築することができれば、音声認識の性能を向上させる効果が期待できる。但し、発音変動の起こり方は、音声認識タスクにより大きく異なることに注意する必要がある。例えば、教育現場における先生の発声は、丁寧な方向に発音変動が起こる。具体的には、発音が長音化する現象や、発話時に間を挟むといった現象が起こる。 If a pronunciation dictionary that captures such probabilistic variations in pronunciation can be constructed, an effect of improving speech recognition performance can be expected. However, it should be noted that the manner in which pronunciation changes occur varies greatly depending on the voice recognition task. For example, a teacher's utterance in an educational setting changes pronunciation in a polite direction. Specifically, a phenomenon in which the pronunciation becomes longer or a phenomenon in which an interval occurs during utterance occurs.

一方、友人との会話などでは、発音が丁寧ではなく、発音時の音の脱落などの現象も起こり易い。したがって、想定する音声認識タスクごとに適切な発音辞書を構築することが音声認識においては重要となる。 On the other hand, in conversations with friends, pronunciation is not polite, and phenomena such as dropping of sounds during pronunciation are likely to occur. Therefore, it is important in speech recognition to construct an appropriate pronunciation dictionary for each assumed speech recognition task.

音声認識タスクに合った発音辞書を構築するために、任意の単語に対して単語の表記や正規の読みの情報から、発音変動を予測する方法が提案されている。その方法は、統計的学習に基づくものである。統計的学習には、先ず学習データが必要である。学習データには、音声データとその単語系列の組を用いる。 In order to construct a pronunciation dictionary suitable for a speech recognition task, a method for predicting pronunciation variation from word notation and regular reading information for an arbitrary word has been proposed. The method is based on statistical learning. For statistical learning, learning data is required first. As learning data, a set of speech data and its word sequence is used.

従来技術では、最初に音響モデルのみを用いて音声データを連続音声認識し、発音変動を含む音素系列を得る。同時に単語系列に対して形態素解析して正規の音素系列を得る。この処理によって、「正規の音素系列−変動した音素系列」のデータを構築し、このデータから統計的なモデル化を行う。決定木を利用する方法が例えば非特許文献２に、ニューラルネットワークなどを利用する方法が例えば非特許文献３に開示されている。 In the prior art, first, speech data is continuously recognized using only an acoustic model, and a phoneme sequence including pronunciation variation is obtained. At the same time, a regular phoneme sequence is obtained by performing morphological analysis on the word sequence. By this processing, data of “regular phoneme sequence-varied phoneme sequence” is constructed, and statistical modeling is performed from this data. A method using a decision tree is disclosed in Non-Patent Document 2, for example, and a method using a neural network or the like is disclosed in Non-Patent Document 3, for example.

鹿野清宏他「IT Text音声認識システム」オーム社出版局,pp.91-92.Kiyohiro Shikano et al. “IT Text Speech Recognition System”, Ohm Publishing Co., pp.91-92. Riley, M., et al.,”Stochastic pronunciation modelling from hand-labelled phonetic corpora” , Speech Communication, 29(2-4):209-224, November 1999.Riley, M., et al., “Stochastic pronunciation modeling from hand-labelled phonetic corpora”, Speech Communication, 29 (2-4): 209-224, November 1999. T.Fukada, T.Yoshimura, and Y.Sagisaka. Automatic generation of multiple pronunciations based on neural networks and language statistics. Speech Communication, 27:63-73, 1999.T.Fukada, T.Yoshimura, and Y.Sagisaka. Automatic generation of multiple pronunciations based on neural networks and language statistics.Speech Communication, 27: 63-73, 1999.

従来の発音辞書を構築する方法は、音声データを連続音声認識した結果を用いるので、発音変動の範囲をカバーした音素レベルのモデル化を実現するために大量の学習データを必要とする課題がある。あらゆる音素に対して「置換変動（音素が別の音素に置き換わる）」、「挿入変動（音素が新たに追加される）」、「脱落変動（音素が消える）」、これらの発音変動の範囲をカバーした音素レベルのモデル化を実現するためには、学習データを大量に準備しなければならない。 Since the conventional pronunciation dictionary construction method uses the result of continuous speech recognition of speech data, there is a problem that a large amount of learning data is required to realize modeling of the phoneme level covering the range of pronunciation variation. . For every phoneme, change the range of these pronunciation variations, including “substitution variation (phoneme is replaced by another phoneme)”, “insertion variation (new phoneme is added)”, “dropping variation (phoneme disappears)” In order to realize the modeling of the covered phoneme level, a large amount of learning data must be prepared.

本発明は、この課題に鑑みてなされたものであり、少ない学習データで発音辞書を構築するためのモデルを作成する発音辞書変換モデル作成装置と、そのモデルを用いた発音辞書変換装置と、それらの方法とプログラムとその記録媒体を提供することを目的とする。 The present invention has been made in view of this problem, a pronunciation dictionary conversion model creation device that creates a model for constructing a pronunciation dictionary with a small amount of learning data, a pronunciation dictionary conversion device that uses the model, and those It is an object of the present invention to provide a method, a program and a recording medium thereof.

本発明の発音辞書変換モデル作成装置は、発音辞書変換ラベル整備部と、発音辞書変換モデル学習部と、を具備する。発音辞書変換ラベル整備部は、単語系列を構成する元単語と当該元単語の音声データとの組と、発音変動パターンを入力として、音声データを、音響モデルと発音変動パターンを考慮した文脈自由文法とを用いて音声認識し、発音変動パターンに対応した元単語と変動後のパターンとの組から成る発音辞書変換ラベルを、発音変動パターンの数分出力する。発音辞書変換モデル学習部は、発音辞書変換ラベルを入力として、当該発音辞書変換ラベルから変動後のパターンに対応した変換後の単語が出現する条件付き確率を機械学習によってモデル化した発音辞書変換モデルを学習する。 The pronunciation dictionary conversion model creation device of the present invention includes a pronunciation dictionary conversion label maintenance unit and a pronunciation dictionary conversion model learning unit. The pronunciation dictionary conversion label maintenance unit is a context-free grammar that takes the voice data from the acoustic model and the pronunciation variation pattern as an input, with the input word and voice data of the original word constituting the word series as input. Are used for voice recognition, and the pronunciation dictionary conversion label composed of a combination of the original word corresponding to the pronunciation variation pattern and the pattern after the variation is output for the number of the pronunciation variation patterns. The pronunciation dictionary conversion model learning unit receives the pronunciation dictionary conversion label as an input, and the pronunciation dictionary conversion model that models the conditional probability that the converted word corresponding to the changed pattern appears from the pronunciation dictionary conversion label by machine learning To learn.

また、本発明の発音辞書変換装置は、発音辞書変換モデルと、発音辞書素性化部と、発音変動観測部と、発音辞書構築部と、を具備する。発音辞書変換モデルは、上記した本発明の発音辞書変換モデル作成装置で作成したものである。発音辞書素性化部は、正規の読み仮名のみが付与された変換元発音辞書内の辞書エントリを入力として、当該辞書エントリに対して正規読み素性ベクトルを構築する。発音変動観測部は、正規読み素性ベクトルを入力として、発音辞書変換モデルを用い各発音変動パターンの確率値を求める。発音辞書構築部は、発音変動パターンの確率値ごとに辞書エントリを配列して発音変動が考慮された発音辞書を構築する The pronunciation dictionary conversion device of the present invention includes a pronunciation dictionary conversion model, a pronunciation dictionary feature conversion unit, a pronunciation variation observation unit, and a pronunciation dictionary construction unit. The pronunciation dictionary conversion model is created by the above-described pronunciation dictionary conversion model creation device of the present invention. The pronunciation dictionary feature conversion unit receives a dictionary entry in the conversion source pronunciation dictionary to which only a regular reading pseudonym is assigned, and constructs a normal reading feature vector for the dictionary entry. The pronunciation variation observation unit obtains a probability value of each pronunciation variation pattern using the pronunciation dictionary conversion model with the normal reading feature vector as an input. The pronunciation dictionary construction unit arranges dictionary entries for each probability value of the pronunciation variation pattern and constructs a pronunciation dictionary in which pronunciation variation is considered

本発明の発音辞書変換モデル作成装置によれば、単語系列を構成する元単語の音声データの単位で当該音声データを、音響モデルと発音変動パターンを考慮した文脈自由文法とを用いて音声認識した結果から発音辞書変換モデルを作成する。したがって、発音変動パターンに対応した発音辞書変換モデルを少ないデータ量で作成することができる。 According to the pronunciation dictionary conversion model creation device of the present invention, the speech data is speech-recognized using the acoustic model and the context-free grammar considering the pronunciation variation pattern in units of the speech data of the original words constituting the word series. A pronunciation dictionary conversion model is created from the result. Therefore, the pronunciation dictionary conversion model corresponding to the pronunciation variation pattern can be created with a small amount of data.

また、本発明の発音辞書変換装置は、正規の読み仮名のみが付与された変換元発音辞書内の辞書エントリを、上記したこの発明の発音辞書変換モデルを用いて発音変動が考慮された発音辞書に変換する。したがって、音声認識タスクに適した発音辞書を少ないコストで構築することができる。 Further, the pronunciation dictionary conversion device of the present invention uses a dictionary entry in the conversion source pronunciation dictionary to which only regular reading pseudonyms are given as a pronunciation dictionary in which pronunciation variation is considered using the above-described pronunciation dictionary conversion model of the present invention. Convert to Therefore, a pronunciation dictionary suitable for a speech recognition task can be constructed at a low cost.

本発明の発音辞書変換モデル作成装置１００の機能構成例を示す図。The figure which shows the function structural example of the pronunciation dictionary conversion model creation apparatus 100 of this invention. 発音辞書変換モデル作成装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the pronunciation dictionary conversion model creation apparatus. 文脈自由文法の一例を示す図。The figure which shows an example of a context free grammar. 発音辞書変換レベル整備部１１０の機能構成例を示す図。The figure which shows the function structural example of the pronunciation dictionary conversion level maintenance part 110. FIG. 発音辞書変換レベル整備部１１０の動作フローを示す図。The figure which shows the operation | movement flow of the pronunciation dictionary conversion level maintenance part 110. FIG. 発音辞書変換モデル学習部１４０の機能構成例を示す図。The figure which shows the function structural example of the pronunciation dictionary conversion model learning part 140. FIG. 発音辞書変換モデル学習部１４０の動作フローを示す図。The figure which shows the operation | movement flow of the pronunciation dictionary conversion model learning part 140. FIG. 本発明の発音辞書変換装置２００の機能構成例を示す図。The figure which shows the function structural example of the pronunciation dictionary converter 200 of this invention. 発音辞書変換装置２００の動作フローを示す図。The figure which shows the operation | movement flow of the pronunciation dictionary converter 200. 変換元発音辞書の辞書エントリの例を示す図。The figure which shows the example of the dictionary entry of the conversion origin pronunciation dictionary. 発音辞書変換装置２００で変換した発音変動が考慮された発音辞書の辞書エントリの例を示す図。The figure which shows the example of the dictionary entry of the pronunciation dictionary in which the pronunciation variation converted by the pronunciation dictionary conversion apparatus 200 was considered.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

図１に、この発明の発音辞書変換モデル作成装置１００の機能構成例を示す。その動作フローを図２に示す。発音辞書変換モデル作成装置１００は、発音辞書変換ラベル整備部１１０と、音響モデル１２０と、発音辞書変換モデル学習部１４０と、を具備する。発音辞書変換モデル作成装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。以降で説明する他の装置についても同様である。 FIG. 1 shows a functional configuration example of the pronunciation dictionary conversion model creation device 100 of the present invention. The operation flow is shown in FIG. The pronunciation dictionary conversion model creation device 100 includes a pronunciation dictionary conversion label maintenance unit 110, an acoustic model 120, and a pronunciation dictionary conversion model learning unit 140. The pronunciation dictionary conversion model creation device 100 is realized by a predetermined program being read into a computer constituted by, for example, a ROM, a RAM, and a CPU, and the CPU executing the program. The same applies to other devices described below.

発音辞書変換ラベル整備部１１０は、単語系列を構成する元単語と当該元単語の音声データとの組（組１，組２，…，組Ｍ）と、発音変動パターンを入力として、当該元単語の音声データを音響モデル１２０と発音変動パターンを考慮した文脈自由文法とを用いて音声認識し、発音変動パターンに対応した元単語と変動後のパターンとの組から成る発音辞書変換ラベルを、発音変動パターンの数分出力する（ステップＳ１１０）。単語系列は、例えば「今日」、「は」、「晴れ」、「です」の単語の系列である。発音変動パターンとは、例えば丁寧な発音変動が考えられる音声認識タスク（教員の声の音声認識）であれば、「モーラ（mora）ごとに間（ポーズ）を入れる」、「モーラが全て長音化」、「変わらない」等、の例えば３種類が考えられる。 The pronunciation dictionary conversion label maintenance unit 110 receives the original word constituting the word series and the voice data of the original word (the set 1, the set 2,..., The set M) and the pronunciation variation pattern as inputs, and the original word Voice data using the acoustic model 120 and a context-free grammar considering the pronunciation variation pattern, and the pronunciation dictionary conversion label consisting of a combination of the original word corresponding to the pronunciation variation pattern and the pattern after the variation is pronounced. The number of fluctuation patterns is output (step S110). The word series is, for example, a series of words “today”, “ha”, “sunny”, and “is”. For example, if a voice recognition task (teaching voice of a teacher's voice) is considered to be a careful pronunciation change, “put a pause for each mora”, “all mora sounds longer For example, there are three types such as “No change”.

文脈自由文法とは、発音変動パターンのみを許容する文法であり、例えば図３に示すものである。「今日は晴れです」の単語系列を構成する元単語ごとに、発音変動パターンを当てはめたものを直列に並べたものである。各矢印は、選択可能な発音の経路を表す。発音辞書変換ラベルは、元単語と音声データの組の数分出力される。上記した単語系列の例では、４個の発音辞書変換ラベルが出力される。複数の発音辞書変換ラベルは、ラベル群１３０として一時記憶しても良い。 The context-free grammar is a grammar that allows only a pronunciation variation pattern, and is shown in FIG. 3, for example. For each of the original words that make up the word series “Today is Sunny”, words with a pronunciation variation pattern applied are arranged in series. Each arrow represents a selectable pronunciation path. Pronunciation dictionary conversion labels are output for the number of pairs of original words and audio data. In the example of the word series described above, four pronunciation dictionary conversion labels are output. A plurality of pronunciation dictionary conversion labels may be temporarily stored as the label group 130.

発音辞書変換モデル学習部１４０は、発音辞書変換ラベルを入力として、当該発音辞書変換ラベルから変動後のパターンに対応した変換後の単語が出現する条件付き確率を機械学習によってモデル化した発音辞書変換モデルを学習する（ステップＳ１４０）。発音辞書変換ラベルは、例えば、（今日−モーラごとに間を入れる）、（今日−モーラが全て長音化）、（今日−変わらない）である。この発音辞書変換ラベルは、元単語ごとに３種類ずつが、発音辞書変換ラベル整備部１１０から入力される。 The pronunciation dictionary conversion model learning unit 140 receives the pronunciation dictionary conversion label as an input, and the pronunciation dictionary conversion that models the conditional probability that the converted word corresponding to the changed pattern appears from the pronunciation dictionary conversion label by machine learning The model is learned (step S140). The pronunciation dictionary conversion labels are, for example, (today-insert a gap for each mora), (today-all mora sounds longer), (today-not changed). Three pronunciation dictionary conversion labels are input from the pronunciation dictionary conversion label maintenance unit 110 for each original word.

発音辞書変換ラベル整備部１１０と発音辞書変換モデル学習部１４０の処理は、単語系列を構成する元単語と当該元単語の音声データとの組の全てが終了するまで繰り返される（ステップＳ１５０のＮｏ）。この発音辞書変換ラベル整備部１１０と発音辞書変換モデル学習部１４０の時系列動作の制御と動作終了の制御は、制御部１５０が行う。この制御部１５０の機能は、この実施例の特別な技術的特徴では無く一般的なものである。 The processing of the pronunciation dictionary conversion label maintenance unit 110 and the pronunciation dictionary conversion model learning unit 140 is repeated until all the combinations of the original word and the voice data of the original word constituting the word series are completed (No in step S150). . The control unit 150 controls the time series operation and the operation end control of the pronunciation dictionary conversion label maintenance unit 110 and the pronunciation dictionary conversion model learning unit 140. The function of the control unit 150 is not a special technical feature of this embodiment but a general one.

以上の構成で学習された発音辞書変換モデルは、元単語ごとに発音変動パターンの数を限定して求められたものなので、その作成に大量の学習データを必要としない。つまり、少ないデータ量で発音辞書変換モデルを学習することができる。 Since the pronunciation dictionary conversion model learned with the above configuration is obtained by limiting the number of pronunciation variation patterns for each original word, it does not require a large amount of learning data for its creation. That is, the pronunciation dictionary conversion model can be learned with a small amount of data.

以降では、各部のより具体的な機能構成例を示して更に詳しく発音辞書変換モデル作成装置１００の動作を説明する。
〔発音辞書変換ラベル整備部〕
図４に、発音辞書変換ラベル整備部１１０のより具体的な機能構成例を示す。その動作フローを図５に示す。発音辞書変換ラベル整備部１１０は、形態素解析手段１１１と、文脈自由文法構築手段１１２と、文脈自由文法記憶手段１１３と、最尤系列探索手段１１４と、ラベル生成手段１１５と、を備える。 In the following, the operation of the pronunciation dictionary conversion model creation device 100 will be described in more detail by showing a more specific functional configuration example of each unit.
[Dictionary Dictionary Conversion Label Maintenance Department]
FIG. 4 shows a more specific functional configuration example of the pronunciation dictionary conversion label maintenance unit 110. The operation flow is shown in FIG. The pronunciation dictionary conversion label maintenance unit 110 includes a morphological analysis unit 111, a context free grammar construction unit 112, a context free grammar storage unit 113, a maximum likelihood sequence search unit 114, and a label generation unit 115.

形態素解析手段１１１は、単語系列から、読み情報付き形態素解析結果を得る（ステップＳ１１１）。形態素解析には、任意の形態素解析器を利用できる。例えば単語系列を「今日は晴れです」とした場合、その形態素解析結果は「今日；キョウ；名詞：日時：連用；は；ワ；連用助詞；晴れ；はれ；名詞；です；デス；判定詞：終止；」といった系列が得られれば良い。形態素解析手段１１１は、単語系列を形態素ごとに分けることと、正規の読みを付与することを満たすものであればどのような手段を用いても良い。 The morpheme analysis unit 111 obtains a morpheme analysis result with reading information from the word series (step S111). Any morphological analyzer can be used for the morphological analysis. For example, if the word sequence is "Today is sunny", the result of the morphological analysis is "Today; Kyo; Noun: Date: Continuum; is; Wa; Conjunctive particle; Sunny; Hare; Noun; Is; Death; : "End;" The morpheme analyzing means 111 may use any means as long as it satisfies the division of the word series into morphemes and the addition of regular readings.

文脈自由文法構築手段１１２は、読み情報付き形態素解析結果と発音変動パターンを入力として、文脈自由文法を構築する（ステップＳ１１２）。ここで許容する文法は、予め規定した発音変動パターンに限る。この発音変動パターンは、様々に規定して良いが、単語単位で変動できるパターンに限る。 The context-free grammar construction unit 112 constructs a context-free grammar using the morpheme analysis result with reading information and the pronunciation variation pattern as input (step S112). The grammar allowed here is limited to a predetermined pronunciation variation pattern. This pronunciation variation pattern may be defined in various ways, but is limited to a pattern that can vary in units of words.

ここではその変動パターンを、例えば上記した「モーラ（mora）ごとに間（ポーズ）を入れる」、「モーラが全て長音化」、「変わらない」、の３種類として説明する。形態素解析結果にはＬ単語含まれるとする。上記した形態素解析結果ではＬ＝４である。 Here, the variation pattern will be described as, for example, the above three types of “insert a pause for each mora”, “all mora sounds longer”, and “no change”. The morphological analysis result includes L words. In the above morphological analysis result, L = 4.

元単語「今日」の場合、「変わらない（読み情報通り）」とは「ハレ」、「モーラごとに間を入れる」とは「ハ、レ、」、「モーラが全て長音化」とは「ハーレー」、という発音変動パターンになる。 In the case of the original word “Today”, “No change (as per the reading information)” means “Hare”, “Insert every mora” means “Ha, Les,” and “Mora all sounds longer” The pronunciation variation pattern is “Harley”.

これらの３種類の発音変動パターンを形態素解析結果の各単語について考慮して、それらを許容する文脈自由文法を構築する（図３）。なお、発音変動パターンは３種類に限られない。例えば「モーラごとに促音を入れる」などの発音変動パターンを加えても良い、その場合、図３に示す単語間の遷移を表す矢印は４つになる。構築された文脈自由文法は、文脈自由文法記憶手段１１３に記憶される。 Considering these three types of pronunciation variation patterns for each word of the morphological analysis result, a context-free grammar that allows them is constructed (FIG. 3). Note that the pronunciation variation pattern is not limited to three types. For example, a pronunciation variation pattern such as “insert sound for each mora” may be added. In this case, there are four arrows indicating transitions between words shown in FIG. The constructed context-free grammar is stored in the context-free grammar storage unit 113.

最尤系列探索手段１１４は、元単語の音声データを、音響モデル１２０と文脈自由文法記憶手段１１３に記憶された文脈自由文法を用いて音声認識した最尤系列を出力する（ステップＳ１１４）。最尤系列とは、文脈自由文法の許すパス（経路）の中で、音響モデルからの生成確率が最大となる系列ｓ＾を見つけることである。系列ｓ＾は次式で求めることができる。

ここでｔは入力音声の特徴ベクトル系列である。音声認識の場合、例えばメル周波数ケプストラム係数（ＭＦＣＣ）分析によって求めた特徴ベクトルを用いるのが一般的である。音響モデル１２０は、隠れマルコフモデルと混合ガウス分布とで規定される一般的なものである。単語系列と組の音声データを、例えば「キョウハーハ、レ、デースー」とした場合、最尤系列は図３に太線で示すパスが選択される。 The maximum likelihood sequence search means 114 outputs a maximum likelihood sequence obtained by speech recognition of the original word speech data using the acoustic model 120 and the context free grammar stored in the context free grammar storage means 113 (step S114). The maximum likelihood sequence is to find a sequence s ^ having the maximum generation probability from the acoustic model in a path (route) permitted by the context free grammar. The sequence s ^ can be obtained by the following equation.

Here, t is a feature vector series of the input speech. In the case of speech recognition, it is common to use a feature vector obtained by, for example, mel frequency cepstrum coefficient (MFCC) analysis. The acoustic model 120 is a general one defined by a hidden Markov model and a mixed Gaussian distribution. When the speech data of the word series and the set is, for example, “Kyohaha, Les, Dasu”, the path indicated by the bold line in FIG. 3 is selected as the maximum likelihood series.

ラベル生成手段１１５は、読み情報付き形態素解析結果と最尤系列を入力として、元単語と変動後のパターンとから成る発音辞書変換ラベルを生成する（Ｓ１１５）。最尤系列を「キョウハーハ、レ、デースー」とした場合、発音辞書変換ラベルは、「今日；キョウ；名詞：日時：連用；−読み情報通り」、「は；ワ；連用助詞；−モーラが全て長音化」、「晴れ；ハレ；名詞；−モーラごとに間を入れる」、「です；デス；判定詞：終止；−モーラが全て長音化」、の４つの発音辞書変換ラベルを出力として得ることができる。 The label generation means 115 receives the morpheme analysis result with reading information and the maximum likelihood sequence as input, and generates a pronunciation dictionary conversion label composed of the original word and the changed pattern (S115). When the maximum likelihood sequence is “Kyo ha ha, les, dessue”, the pronunciation dictionary conversion label is “Today; Kyo; ”All sounds longer”, “sunny; hare; nouns;-interleave every mora”, “is; death; judgment: end;-all mora sounds longer” as output Can be obtained.

〔発音辞書変換モデル学習部〕
図６に、発音辞書変換モデル学習部１４０の機能構成例を示す。その動作フローを図７に示す。発音辞書変換モデル学習部１４０は、素性ベクトル抽出手段１４１と、発音辞書変換装置モデルパラメータ学習手段１４２と、を備える。 [Dictionary Dictionary Conversion Model Learning Department]
FIG. 6 shows a functional configuration example of the pronunciation dictionary conversion model learning unit 140. The operation flow is shown in FIG. The pronunciation dictionary conversion model learning unit 140 includes a feature vector extraction unit 141 and a pronunciation dictionary conversion device model parameter learning unit 142.

素性ベクトル抽出手段１４１は、発音辞書変換ラベルを入力として、当該発音辞書変換ラベルを構成する元単語の単語情報から抽出した素性ベクトルと、当該発音辞書変換ラベルを構成する変動後のパターンとを対にした学習ラベルを出力する（ステップＳ１４１）。学習ラベルは、素性ベクトルｘと出力ラベルｙの形で表される。 The feature vector extraction means 141 receives the pronunciation dictionary conversion label as an input, and compares the feature vector extracted from the word information of the original word constituting the pronunciation dictionary conversion label with the changed pattern constituting the pronunciation dictionary conversion label. The learned learning label is output (step S141). The learning label is expressed in the form of a feature vector x and an output label y.

素性ベクトル抽出手段１４１は、先ず、元単語の単語情報から素性ベクトルを取り出す。様々な素性ベクトルを利用できるが、例えば、品詞情報、単語のモーラ長、を利用する場合を想定する。品詞の種類が「動詞、名詞、形容詞、副詞、その他」の５種類の場合であれば、品詞の情報のみで５次元分確保する。 The feature vector extracting means 141 first extracts a feature vector from the word information of the original word. Various feature vectors can be used. For example, it is assumed that part-of-speech information and word mora length are used. If there are five types of parts of speech, such as “verb, noun, adjective, adverb, and others”, five dimensions are secured only with the part of speech information.

元単語が名詞であれば、「名詞」の部分のみを１にセットし、その他の部分を０にリセットする。例えば、素性ベクトルｘ＝［０,１,０,０,０］である。 If the original word is a noun, only the “noun” portion is set to 1 and the other portions are reset to 0. For example, the feature vector x = [0, 1, 0, 0, 0].

同様にモーラ長に対しても、「１モーラ,２モーラ〜４モーラ，５モーラ以上」の３種類で考える場合、元単語が１モーラであればその部分のみを１にセットし、その他の部分には０を入れることになる。例えば、素性ベクトルｘ＝［０,１,０,０,０，１，０，０］である。先頭から５個が品詞情報、その後の３個がモーラ長情報を表す。 Similarly, when considering the three types of “1 mora, 2 mora to 4 mora, 5 mora or more” for the mora length, if the original word is 1 mora, only that portion is set to 1 and the other portions 0 will be inserted in For example, the feature vector x = [0, 1, 0, 0, 0, 1, 0, 0]. The first five parts of speech information and the subsequent three parts represent mora length information.

出力ラベルｙは、例えば「読み情報通り」の場合は１、「モーラごとに間（ポーズ）を入れる」の場合は２、「モーラが全て長音化」の場合は３、とする。 The output label y is, for example, 1 for “according to reading information”, 2 for “insert a pause (pause) for each mora”, and 3 for “all mora sounds longer”.

発音辞書変換ラベルが「今日；キョウ；名詞：日時連用−読み情報通り」であれば、学習ラベルは「ｘ＝［０,１,０,０,０，０，１，０］，ｙ＝１」となる。 If the pronunciation dictionary conversion label is “today; Kyo; noun: date and time-reading information as it is”, the learning label is “x = [0,1,0,0,0,0,1,0], y = 1. "

発音辞書変換装置モデルパラメータ学習手段１４２は、学習ラベルを入力として、素性ベクトルを入力特徴ベクトルとし、出力ラベルｙが出力される条件付き確率を求めるモデルパラメータである発音辞書変換モデルを学習する（ステップＳ１４２）。学習すべきモデルは、入力の特徴ベクトルｘと出力ラベルｙから条件付き確率Ｐ（ｙ｜ｘ）をモデル化できるものである。モデル化には様々なものが考えられるが、例えば最大エントロピーモデルを用いてモデル化を行う。 The pronunciation dictionary conversion device model parameter learning unit 142 learns a pronunciation dictionary conversion model which is a model parameter for obtaining a conditional probability that the learning label is input, the feature vector is an input feature vector, and the output label y is output (step) S142). The model to be learned can model the conditional probability P (y | x) from the input feature vector x and the output label y. There are various types of modeling. For example, modeling is performed using a maximum entropy model.

最大エントロピーモデルは、対数線形モデルと等価のモデルであり、周知のモデルである。最大エントロピーモデルは次式で表せる。

ここでｗはモデルパラメータである。具体的な学習方法は、例えば参考文献１（北健二「言語と計算-４確率的言語モデル」東京大学出版会,pp.162-165）に記載された周知の方法を用いる。 The maximum entropy model is a model equivalent to a log-linear model and is a well-known model. The maximum entropy model can be expressed as

Here, w is a model parameter. As a specific learning method, for example, a known method described in Reference Document 1 (Kenji Kita “Language and Computation—4 Stochastic Language Model”, University of Tokyo Press, pp. 162-165) is used.

発音辞書変換モデルの学習が終わると、後述する発音辞書変換装置２００の発音辞書素性化部２２０で構成された特徴ベクトルｘを入力することで、読み情報通りである条件付き確率Ｐ（１｜ｘ）、モーラごとに間を入れる条件付き確率Ｐ（２｜ｘ）、モーラが全て長音化する条件付き確率Ｐ（３｜ｘ）、をそれぞれ計算することが可能になる。 When learning of the pronunciation dictionary conversion model is completed, a conditional probability P (1 | x that is according to the reading information is input by inputting a feature vector x configured by a pronunciation dictionary feature unit 220 of the pronunciation dictionary conversion device 200 described later. ), A conditional probability P (2 | x) that puts a gap for each mora, and a conditional probability P (3 | x) that all mora lengthens.

〔発音辞書変換装置〕
図８に、この発明の発音辞書変換装置２００の機能構成例を示す。その動作フローを図９に示す。発音辞書変換装置２００は、発音辞書変換モデル２１０と、発音辞書素性化部２２０と、発音変動観測部２３０と、発音辞書構築部２４０と、を具備する。 [Pronunciation dictionary converter]
FIG. 8 shows a functional configuration example of the pronunciation dictionary conversion apparatus 200 of the present invention. The operation flow is shown in FIG. The pronunciation dictionary conversion apparatus 200 includes a pronunciation dictionary conversion model 210, a pronunciation dictionary feature conversion unit 220, a pronunciation variation observation unit 230, and a pronunciation dictionary construction unit 240.

発音辞書変換モデル２１０は、上記した発音辞書変換モデル作成装置１００で作成した変換モデルである。発音辞書活性化部２２０は、正規の読み仮名のみが付与された変換元発音辞書内の辞書エントリを入力として、当該辞書エントリに対して正規読み素性ベクトルを構築する（ステップＳ２２０）。 The pronunciation dictionary conversion model 210 is a conversion model created by the pronunciation dictionary conversion model creation apparatus 100 described above. The pronunciation dictionary activation unit 220 receives the dictionary entry in the conversion source pronunciation dictionary to which only the regular reading kana is given as input, and constructs a normal reading feature vector for the dictionary entry (step S220).

図１０に、既存の発音辞書の辞書エントリの例を示す。辞書エントリとは、図１０の１行ずつのことであり、各単語「曇り」の正規の読みと品詞情報と、読み仮名とその確率値と、で構成される。辞書エントリに対する正規読み素性ベクトルは、上記した素性ベクトル抽出手段１４１と同じルールで生成される。例えば、辞書エントリ「曇り；クモリ；名詞；⇒クモリ＝１.０」に対する正規読み素性ベクトルは、ｘ＝［０，１，０，０，０，０，１，０］といった形で表される。先頭から５個の品詞情報で名詞、その後の３個のモーラ長情報で２モーラ、であることを表している。 FIG. 10 shows an example of a dictionary entry of an existing pronunciation dictionary. The dictionary entry is one line in FIG. 10 and is composed of a normal reading of each word “cloudy”, part-of-speech information, a reading pseudonym, and a probability value thereof. The normal reading feature vector for the dictionary entry is generated according to the same rule as the feature vector extracting unit 141 described above. For example, the normal reading feature vector for the dictionary entry “cloudy; spider; noun; ⇒ spider = 1.0” is expressed in the form x = [0, 1, 0, 0, 0, 0, 1, 0]. . The five parts of speech information from the beginning indicates a noun, and the subsequent three mora length information indicates 2 mora.

発音変動観測部２３０は、正規読み素性ベクトルを入力として、発音辞書変換モデル２１０を用い各発音変動パターンの条件付き確率Ｐ（ｙ｜ｘ）を求める（ステップＳ２３０）。例えば、「曇り；クモリ；名詞；⇒クモリ＝１.０」の正規読み素性ベクトルｘ＝［０，１，０，０，０，０，１，０］を入力とすると、Ｐ（１｜ｘ）＝０.６５，Ｐ（２｜ｘ）＝０.２３，Ｐ（３｜ｘ）＝０.１２といった条件付き確率が得られる。発音変動観測部２３０は、条件付き確率を式（２）で計算する。 The pronunciation variation observation unit 230 receives the normal reading feature vector as an input and obtains a conditional probability P (y | x) of each pronunciation variation pattern using the pronunciation dictionary conversion model 210 (step S230). For example, if the normal reading feature vector x = [0, 1, 0, 0, 0, 0, 1, 0] of “cloudy; spider; noun; ⇒ spider = 1.0” is input, P (1 | x ) = 0.65, P (2 | x) = 0.23, and P (3 | x) = 0.12. The pronunciation variation observing unit 230 calculates the conditional probability according to the equation (2).

発音辞書構築部２４０は、発音変動パターンの条件付き確率Ｐ（ｙ｜ｘ）ごとに辞書エントリを配列して発音変動が考慮された発音辞書を構築する（ステップＳ２４０）。発音辞書活性化部２２０と発音変動観測部２３０と発音辞書構築部２４０の処理は、全ての辞書エントリが終了するまで繰り返される（ステップＳ２５０のＮｏ）。 The pronunciation dictionary construction unit 240 constructs a pronunciation dictionary in which pronunciation variation is considered by arranging dictionary entries for each conditional probability P (y | x) of the pronunciation variation pattern (step S240). The processing of the pronunciation dictionary activation unit 220, the pronunciation variation observation unit 230, and the pronunciation dictionary construction unit 240 is repeated until all dictionary entries are completed (No in step S250).

図１１に、発音辞書変換装置２００で変換された発音変動が考慮された発音辞書の辞書エントリの例を示す。この例では、単語ごとに３つの変動パターンが配列されている。 FIG. 11 shows an example of a dictionary entry of the pronunciation dictionary that takes into account the pronunciation variation converted by the pronunciation dictionary conversion apparatus 200. In this example, three variation patterns are arranged for each word.

発音辞書変換装置２００によれば、正規の読み仮名のみが付与された変換元発音辞書内の辞書エントリを、発音変動の限定化を想定したこの発明の発音辞書変換モデルを用いて発音変動が考慮された発音辞書に変換する。したがって、音声認識タスクに適した発音辞書を低コストで構築することができる。 According to the pronunciation dictionary conversion apparatus 200, the dictionary entry in the conversion source pronunciation dictionary to which only the regular reading pseudonym is given is considered for the pronunciation variation by using the pronunciation dictionary conversion model of the present invention in which the pronunciation variation is limited. To the pronunciation dictionary. Therefore, a pronunciation dictionary suitable for a speech recognition task can be constructed at a low cost.

なお、上記した実施例では、素性ベクトルｘを５個の品詞情報と３個のモーラ長情報とで表される例で説明を行ったが、この例は一例であって、例えば「表記内の漢字の有無」や「表記と標準読みの長さの差」などを、ベクトル要素に付加しても良い。また、機械学習の一例として最大エントロピーモデルを用いる例を説明したが、条件付き確率Ｐ（ｙ｜ｘ）のモデル化には任意のモデルが利用可能である。例えば、ニューラルネットワークを利用しても良い。その場合は、ニューラルネットワークの中間層がシグモイド関数、出力層がソフトマトリックス関数となる。 In the above-described embodiment, the feature vector x has been described using an example in which the feature vector x is represented by five parts of speech information and three mora length information. “Presence / absence of kanji” or “difference between notation and standard reading” may be added to the vector element. Further, an example using the maximum entropy model has been described as an example of machine learning, but any model can be used for modeling the conditional probability P (y | x). For example, a neural network may be used. In this case, the intermediate layer of the neural network is a sigmoid function and the output layer is a soft matrix function.

以上説明したようにこの発明の発音辞書変換モデル作成装置１００によれば、従来技術のような連続音声認識を行わない、そして、限定的な発音変動を想定した文脈自由文法を用いることで広範囲な発音変動に対応する必要がなくなる。その結果、少ないデータ量で頑健な発音辞書変換モデルを作成することができる。 As described above, according to the pronunciation dictionary conversion model creation device 100 of the present invention, continuous speech recognition as in the prior art is not performed, and a wide range of context free grammars assuming limited pronunciation fluctuations are used. There is no need to deal with fluctuations in pronunciation. As a result, a robust pronunciation dictionary conversion model can be created with a small amount of data.

また、この発明の発音辞書変換装置２００は、その発音辞書変換モデルを使って音声認識タスクに適した発音辞書を構築するので、低コストで音声認識タスクに適応した発音辞書の作成を可能にする。この発音辞書は、個人の話し方の癖を再現する音声合成に利用するのにも好適である。 Further, the pronunciation dictionary conversion apparatus 200 of the present invention uses the pronunciation dictionary conversion model to construct a pronunciation dictionary suitable for the speech recognition task, so that it is possible to create a pronunciation dictionary adapted to the speech recognition task at low cost. . This pronunciation dictionary is also suitable for use in speech synthesis that reproduces the habits of individual speaking.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

本発明は、音声認識と音声合成の両分野全般で利用することができる。 The present invention can be used in both the fields of speech recognition and speech synthesis.

Claims

Speech recognition using the speech model and the context-free grammar considering the phonetic variation pattern, using the original word constituting the word sequence and the voice data of the original word and the phonetic variation pattern as input A pronunciation dictionary conversion label maintenance unit that outputs a pronunciation dictionary conversion label consisting of a combination of the original word corresponding to the pronunciation variation pattern and a pattern after variation, by the number of the pronunciation variation patterns;
Pronunciation dictionary conversion that learns a pronunciation dictionary conversion model in which the conditional probability that a converted word corresponding to the changed pattern appears from the pronunciation dictionary conversion label is modeled by machine learning from the pronunciation dictionary conversion label A model learning unit;
A pronunciation dictionary conversion model creation device comprising:

In the pronunciation dictionary conversion model creation device according to claim 1,
The pronunciation dictionary conversion label maintenance department
From the above word series, morpheme analysis means for obtaining a morpheme analysis result with reading information;
Context-free grammar construction means for constructing a context-free grammar using the morpheme analysis result with reading information and the pronunciation variation pattern as input,
A context-free grammar storage means for storing the context-free grammar;
Maximum likelihood sequence search means for outputting a maximum likelihood sequence obtained by speech recognition of the original word speech data using an acoustic model and the context-free grammar;
Label generation means for generating a pronunciation dictionary conversion label composed of the original word and the pattern after variation, using the morphological analysis result with reading information and the maximum likelihood sequence as inputs,
A pronunciation dictionary conversion model creation device characterized by comprising:

In the pronunciation dictionary conversion model creation device according to claim 1 or 2,
The pronunciation dictionary conversion model learning unit is
Using the pronunciation dictionary conversion label as an input, a feature vector is extracted from the word information of the original word constituting the pronunciation dictionary conversion label, the feature vector is used as an input feature vector, and the pattern after variation is used as an output label. Feature vector extraction means for outputting as a learning label,
A pronunciation dictionary conversion device that models a pronunciation dictionary conversion model that is a model parameter for obtaining a conditional probability that the output label is output using the feature vector as an input feature vector with the learning label as an input, using a maximum entropy model Model parameter learning means;
A pronunciation dictionary conversion model creation device characterized by comprising:

A pronunciation dictionary conversion model created by the pronunciation dictionary conversion model creation device according to claim 1;
A dictionary entry feature generating unit that constructs a normal reading feature vector for the dictionary entry, using a dictionary entry in the conversion source pronunciation dictionary to which only a regular reading pseudonym is assigned,
Using the normal reading feature vector as an input, the pronunciation variation observation unit for obtaining the probability value of each pronunciation variation pattern using the pronunciation dictionary conversion model,
A pronunciation dictionary construction unit that constructs a pronunciation dictionary in which pronunciation variation is considered by arranging dictionary entries for each probability value of the pronunciation variation pattern;
A pronunciation dictionary conversion device comprising:

The pronunciation dictionary conversion label maintenance unit receives a combination of the original word constituting the word series and the voice data of the original word and the pronunciation variation pattern, and the speech data is considered in the context of the acoustic model and the pronunciation variation pattern. Speech dictionary conversion using free grammar, and pronunciation dictionary conversion label that outputs pronunciation dictionary conversion labels consisting of pairs of the original word and the changed pattern corresponding to the pronunciation variation pattern for the number of the pronunciation variation patterns Process,
A pronunciation dictionary in which the pronunciation dictionary conversion model learning unit models the conditional probability that a converted word corresponding to the changed pattern appears from the pronunciation dictionary conversion label using the pronunciation dictionary conversion label as an input by machine learning Pronunciation dictionary conversion model learning process for learning conversion model,
A pronunciation dictionary conversion model creation method comprising:

A pronunciation dictionary feature generation process, wherein a dictionary entry in the conversion source pronunciation dictionary to which only a regular reading pseudonym is given is input, and a pronunciation dictionary feature conversion process that constructs a normal reading feature vector for the dictionary entry;
A pronunciation variation observing process in which a pronunciation variation observation unit obtains a probability value of each pronunciation variation pattern using the pronunciation dictionary conversion model created by the pronunciation dictionary conversion model creation method according to claim 5 by using the normal reading feature vector as an input; ,
A pronunciation dictionary construction process in which a pronunciation dictionary construction unit constructs a pronunciation dictionary in which pronunciation variation is considered by arranging dictionary entries for each probability value of the pronunciation variation pattern.
A pronunciation dictionary conversion method comprising:

The program for making a computer perform the function of each part of the pronunciation dictionary conversion model creation apparatus in any one of Claims 1 thru | or 3, and the pronunciation dictionary conversion apparatus in Claim 4.

A computer-readable recording medium on which any one of the programs according to claim 7 is recorded.