JP2014232510A

JP2014232510A - Japanese kana character correction model learning device, japanese kana character correction device, methods therefore, and program

Info

Publication number: JP2014232510A
Application number: JP2013114254A
Authority: JP
Inventors: 博子村上; Hiroko Murakami; 水野　秀之; Hideyuki Mizuno; 秀之水野; 勇祐井島; Yusuke Ijima
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2013-05-30
Filing date: 2013-05-30
Publication date: 2014-12-11
Anticipated expiration: 2033-05-30
Also published as: JP5961586B2

Abstract

PROBLEM TO BE SOLVED: To provide a Japanese kana character correction model learning device and a Japanese kana character correction device, which learn a Japanese kana character correction model as a statistical model for automatically correcting a Japanese kana character error due to notation variability.SOLUTION: A Japanese kana character correction model learning device includes: an N-1 series extraction unit for extracting an N-gram where N-1 characters of hiragana letters appear continuously in one Chinese character in a learning text by using the learning text in which Chinese characters are mixed with kanas, as input; and an N-gram model learning unit for learning a Chinese character-kana N-gram model provided with a probability according to an appearance frequency of the N-gram, and outputting the Chinese character-kana N-gram model as a Japanese kana character correction model to the outside. Also, the Japanese kana character correction model learning device inputs an N-gram of an appropriate Chinese character to the Japanese kana character correction model to find an occurrence probability, and corrects a Japanese kana character of the appropriate Chinese character to a Japanese kana character having an occurrence probability of a predetermined value or more and outputs it.

Description

本発明は、読み仮名誤りの自動修正に用いる読み仮名修正モデルを生成する読み仮名修正モデル学習装置と、そのモデルを用いた読み仮名修正装置と、それらの方法とプログラムに関する。 The present invention relates to a reading kana correction model learning device that generates a reading kana correction model used for automatic correction of reading kana errors, a reading kana correction device using the model, and a method and a program thereof.

従来、漢字に対する読み仮名付与では、単語辞書から（単語表記・品詞・読み仮名）の組から成る単語の候補を取得し、単語間の品詞接続に基づき、日本語の文として最も適切な単語系列を選択し、選択された単語系列の読み仮名に基づいて、漢字に読み仮名を付与するという手法が一般的に用いられてきた（例えば非特許文献１）。 Conventionally, with kana reading for kanji, word candidates consisting of pairs of (word notation, part of speech, reading kana) are obtained from a word dictionary, and the most appropriate word sequence as a Japanese sentence based on the part of speech connection between words In general, a method of assigning a reading kana to a kanji based on the reading kana of the selected word series has been used (for example, Non-Patent Document 1).

Twitter・ブログ等、個人が書いた崩れた表記を含んだテキストでは、例えば、「嬉しい」→「嬉しぃ」などの小文字化、「知らない」→「知ラナイ」などのカタカナ化、等の表記ゆれが発生する。読み仮名付与対象のテキストに、このような表記ゆれを含んだテキストが含まれると、単語系列選択の際に正しく辞書照合できず、読み仮名誤りが発生することが問題であった。表記ゆれに起因する読み仮名誤りを改善するため、従来は、単語系列選択を行う前に規則によるテキストの書き換えを行い、表記ゆれを含んだテキストを辞書照合可能な表記に修正してから単語系列選択を行うことで解決していた。 In texts that contain broken notation written by individuals, such as Twitter and blogs, for example, lowercase letters such as “happy” → “happy”, katakana such as “don't know” → “know lanai”, etc. Shake occurs. If the text that is subject to reading kana includes text that includes such a notation fluctuation, the dictionary cannot be correctly matched when selecting a word sequence, and a reading kana error occurs. In order to improve reading kana mistakes caused by typographical fluctuations, it has been common practice to rewrite the text according to rules before selecting a word series and correct the text containing typographical fluctuations to a dictionary-matchable notation. It was solved by making a choice.

松本裕治,et al.″日本語形態素解析システム「茶筌」Version 2.0 使用説明書~″NAIST-IS-TR99012(1999).Yuji Matsumoto, et al. "Japanese Morphological Analysis System" Cha "Version 2.0 Instruction Manual ~" NAIST-IS-TR99012 (1999).

崩れた表記のテキストに含まれる表記ゆれパターンは多岐にわたるので、従来の規則によるテキストの書き換えでは網羅しきれない表記ゆれが多く存在する。また、規則の設計は人手で行う必要があるため、新たな表記ゆれパターンが出現する度に規則を設計するのは高コストである。 Since there are a wide variety of notation fluctuation patterns included in the broken notation text, there are many notation fluctuations that cannot be covered by rewriting text according to conventional rules. Also, since it is necessary to manually design the rules, it is expensive to design the rules every time a new notation fluctuation pattern appears.

本発明は、この課題に鑑みてなされたものであり、表記ゆれに起因する読み仮名誤りを自動的に修正するための統計モデルである読み仮名修正モデルを学習する読み仮名修正モデル学習装置と、そのモデルを用いた読み仮名修正装置と、それらの方法とプログラムを提供することを目的とする。 The present invention has been made in view of this problem, and a reading-kana correction model learning device that learns a reading-kana correction model, which is a statistical model for automatically correcting reading-kana errors caused by notation fluctuation, It is an object of the present invention to provide a reading kana correction device using the model, and a method and program thereof.

本発明の読み仮名修正モデル学習装置は、Ｎ−１系列抽出部と、Ｎ-gramモデル学習部と、を具備する。Ｎ−１系列抽出部は、漢字かな混じりの学習テキストを入力として、当該学習テキスト内の漢字１文字に、ひらがながＮ−１個の文字が連接して出現するＮ-gramを抽出する。Ｎ-gramモデル学習部は、Ｎ-gramの出現頻度に応じて確率を付与した漢字かなＮ-gramモデルを学習し、当該漢字かなＮ-gramモデルを読み仮名修正モデルとして外部に出力する。 The reading-kana correction model learning device of the present invention includes an N-1 sequence extraction unit and an N-gram model learning unit. The N-1 series extraction unit receives a learning text mixed with kanji and kana, and extracts an N-gram in which N-1 hiragana characters appear concatenated with one kanji character in the learning text. The N-gram model learning unit learns a kanji-kana N-gram model to which a probability is given according to the appearance frequency of the N-gram, reads the kanji-kana N-gram model, and outputs it as a kana correction model.

また、本発明の読み仮名修正装置は、読み仮名修正モデルと、読み仮名修正部と、を具備する。読み仮名修正モデルは上記した読み仮名修正モデル学習装置で学習した読み仮名修正モデルである。読み仮名修正部は、入力テキストに含まれる漢字１文字にひらがながＮ−１個の文字が連接して出現するＮ-gramを抽出し、該当漢字の上記Ｎ-gramを上記読み仮名修正モデルに入力して上記Ｎ-gramの生起確率を求め、上記該当漢字の読み仮名を、上記生起確率が所定値以上の読み仮名に修正して出力する。 The reading kana correction device of the present invention includes a reading kana correction model and a reading kana correction unit. The reading kana correction model is a reading kana correction model learned by the above-described reading kana correction model learning device. The reading kana correction unit extracts an N-gram in which N-1 characters of hiragana are concatenated and appear in one kanji character included in the input text, and the N-gram of the corresponding kanji is used as the reading kana correction model. Then, the occurrence probability of the N-gram is obtained, and the reading kana of the corresponding kanji is corrected to the reading kana with the occurrence probability being a predetermined value or more and output.

本発明の読み仮名修正モデル学習装置は、学習テキスト内の漢字１文字とその読み仮名と当該漢字に連接するＮ−１個の文字から成るＮ-gramの確率モデルであり、テキストに含まれる表記ゆれを修正する目的で用いることが可能な読み仮名修正モデルを提供する。また、この発明の読み仮名修正装置は、テキストに含まれる表記ゆれに起因する読み仮名誤りを、上記読み修正モデルを用いて自動的に修正することができる。よって、新たな表記ゆれパターンが出現する度に規則を設計するのに必要なコストを、削減する効果を奏する。 The reading kana correction model learning apparatus of the present invention is an N-gram probability model consisting of one kanji character in a learning text, its reading kana and N-1 characters connected to the kanji, and is included in the text. A kana correction model that can be used to correct shaking is provided. The reading kana correction device of the present invention can automatically correct reading kana errors caused by the notation fluctuations included in the text using the reading correction model. Therefore, there is an effect of reducing the cost required for designing the rule every time a new notation fluctuation pattern appears.

本発明の読み仮名修正モデル学習装置１００の機能構成例を示す図。The figure which shows the function structural example of the reading-kana correction model learning apparatus 100 of this invention. 読み仮名修正モデル学習装置１００の動作フローを示す図。The figure which shows the operation | movement flow of the reading kana correction model learning apparatus 100. FIG. 本発明の読み仮名修正装置２００の機能構成例を示す図。The figure which shows the function structural example of the reading kana correction apparatus 200 of this invention. 読み仮名修正部２１０のより具体的な機能構成例を示す図。The figure which shows the more specific functional structural example of the reading kana correction | amendment part 210. FIG. 本発明の読み仮名修正装置３００の機能構成例を示す図。The figure which shows the function structural example of the reading kana correction apparatus 300 of this invention. 読み仮名候補抽出部３１０のより具体的な機能構成例を示す図。The figure which shows the more specific function structural example of the reading kana candidate extraction part 310. FIG. 本発明の読み仮名修正装置４００の機能構成例を示す図。The figure which shows the function structural example of the reading kana correction apparatus 400 of this invention. 本発明の読み仮名修正装置５００の機能構成例を示す図。The figure which shows the function structural example of the reading kana correction apparatus 500 of this invention.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。 Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

〔読み仮名修正モデル学習装置〕
図１に、この発明の読み仮名修正モデル学習装置１００の機能構成例を示す。その動作フローを図２に示す。読み仮名修正モデル学習装置１００は、Ｎ−１系列抽出部１１０と、Ｎ-gramモデル学習部１２０と、制御部１３０と、を具備する。読み仮名修正モデル学習装置１００は、例えばＲＯＭ、ＲＡＭ、ＣＰＵ等で構成されるコンピュータに所定のプログラムが読み込まれて、ＣＰＵがそのプログラムを実行することで実現されるものである。以降で説明する他の実施例についても同様である。 [Reading Kana correction model learning device]
FIG. 1 shows an example of the functional configuration of a reading-kana correction model learning apparatus 100 according to the present invention. The operation flow is shown in FIG. The reading-kana correction model learning device 100 includes an N-1 sequence extraction unit 110, an N-gram model learning unit 120, and a control unit 130. The reading-kana correction model learning apparatus 100 is realized by reading a predetermined program into a computer composed of, for example, a ROM, a RAM, a CPU, and the like, and executing the program by the CPU. The same applies to other embodiments described below.

Ｎ−１系列抽出部１１０は、漢字仮名混じりの学習テキストを入力として、当該学習テキスト内の漢字１文字にひらがなＮ−１個の文字が連接して出現するＮ-gramを抽出する（ステップＳ１１０）。学習テキストにおいて、漢字１文字にひらがなＮ−１個の文字が連接したＮ-gramのみを学習の対象とする。漢字が連続して出現するものや、漢字の後に出現するＮ−１個の文字にひらがな以外の文字（カタカナ・漢字・記号等）が含まれるものは、学習の対象外とする。 The N-1 series extraction unit 110 receives learning text mixed with kanji kana and extracts an N-gram in which N-1 hiragana characters appear concatenated with one kanji character in the learning text (step S110). ). In the learning text, only an N-gram in which N-1 hiragana characters are concatenated to one kanji character is set as a learning target. Those in which Kanji characters appear consecutively or N-1 characters that appear after Kanji characters include characters other than hiragana (katakana, kanji characters, symbols, etc.) are not subject to learning.

例えばＮ＝３の例を挙げると、「今日は外で遊びましょうね（キョウワソトデアソビマショウネ）」という学習テキストにおいて、漢字１文字に対してひらがな２文字が連接している３-gramは「遊びま」の部分のみである。この例では、１文字目の漢字とその読み仮名のセットである（遊，アソ）と、漢字に連接するひらがな２文字「びま」の読みである「ビマ」の３個組の組み合わせである（[遊，アソ]，ビ，マ）がＮ-gramとしてカウントされる。このＮ-gramの抽出は、学習テキストの全ての単語を対象に行われ、学習テキスト内の漢字１文字に対してひらがな２文字が連接しているＮ-gramの全てが抽出されるまで繰り返される（ステップＳ１３０のＮｏ）。この繰り返し動作の制御は制御部１３０で行う。制御部１３０は、読み仮名修正モデル学習装置１００の各部の時系列動作を制御する一般的なものであり、特別な処理を行うものではない。他の実施例についても同様である。 For example, in the case of N = 3, in the learning text “Let's play outside today (Kyowa Soto de Asobi Mashoune)”, a 3-gram where two hiragana characters are connected to one kanji character is It is only the “Play” part. In this example, it is a set of triples consisting of a set of the first kanji and its reading kana (Yu, Aso) and “bima” which is a reading of the two hiragana characters “Bima” connected to the kanji. ([Yu, Aso], Bi, Ma) is counted as an N-gram. This N-gram extraction is performed for all words in the learning text, and is repeated until all of the N-grams in which two hiragana characters are connected to one kanji character in the learning text are extracted. (No in step S130). This repetitive operation is controlled by the control unit 130. The control unit 130 is a general unit that controls the time-series operation of each unit of the reading-a-kana correction model learning device 100, and does not perform a special process. The same applies to the other embodiments.

Ｎ-gramモデル学習部１２０は、Ｎ−１系列抽出部１１０で抽出された全てのＮ-gramのそれぞれの頻度を数え、その頻度に応じて確率を付与した確率モデルである漢字かなＮ-gramモデルを学習し、その漢字かなＮ-gramモデルを読み仮名修正モデル１４０として外部に出力する（ステップＳ１２０）。Ｎ-gramモデルの学習方法は、例えば参考文献１（北健二著、「言語と計算-4 確率的言語モデル」、東京大学出版会、pp.57-62）に記載されているように周知である。 The N-gram model learning unit 120 counts the frequency of all the N-grams extracted by the N-1 sequence extraction unit 110 and assigns a probability according to the frequency. The model is learned, and the kanji-kana N-gram model is read and output to the outside as the kana correction model 140 (step S120). The learning method of the N-gram model is well known as described in, for example, Reference 1 (Kenji Kita, “Language and Computation-4 Stochastic Language Model”, The University of Tokyo Press, pp.57-62). is there.

従来の一般的なＮ-gramモデルは、隣接する単語の組み合わせを学習し、音声認識や形態素解析用の言語モデルに用いられることが多い。この発明ではＮ-gramモデルを、漢字とその読み仮名と、その漢字に連接する読みの組み合わせとを学習し、読み仮名誤りの修正用モデルとして用いる点で新しい。 The conventional general N-gram model is often used as a language model for speech recognition or morphological analysis by learning a combination of adjacent words. In the present invention, the N-gram model is new in that it learns a kanji, its reading kana, and a combination of readings connected to the kanji and uses it as a model for correcting reading kana errors.

Ｎ-gramのＮは２以上であればいくつであっても良い。例えば、Ｎ＝２として、漢字と漢字に連接する読みを１文字しか考慮しない漢字かなＮ-gramモデルも有り得る。但し、Ｎ＝２とした場合、「楽しい（タノシイ）」、「楽して（ラクシテ）」のように、漢字に連接する読みを２個まで考慮することで読み仮名をほぼ一意に決定できるような例においても、「楽し」までしか考慮できないため、読み仮名「タノ」と「ラク」の間に確率的に大きな差が表れないモデルになる課題がある。 N may be any number as long as N is 2 or more. For example, assuming that N = 2, there may be a kanji-kana N-gram model that considers only kanji and kanji and readings connected to kanji. However, when N = 2, it is possible to determine the reading pseudonym almost uniquely by considering up to two readings concatenated with kanji, such as “Fun” and “Easy”. Even in the example, since only “fun” can be considered, there is a problem of becoming a model in which a large difference does not appear probabilistically between the reading kana “Tano” and “Raku”.

そのようなモデルにしない為には、統計的に十分な学習量を得ることのできる出現頻度の高い漢字に関しては、Ｎ-gramのＮ数を長めに設定した漢字かなＮ-gramモデルを用いる事が望ましい。但し、この場合も、出現頻度が低い漢字においては、Ｎ数を長（大）めに設定すると、学習データが不足してデータスパースの問題が発生する課題がある。 In order to avoid such a model, a Kanji Kana N-gram model with a long N number of N-grams should be used for Kanji characters with a high appearance frequency that can obtain a statistically sufficient learning amount. Is desirable. However, in this case as well, there is a problem that a problem of data sparse occurs due to a lack of learning data if the number of N is set to be long (large) for a Chinese character with a low appearance frequency.

従って、Ｎ-gramのＮ数は、学習テキストに対応させた最適なＮ数に固定しても良いし、複数のＮ数の漢字かなＮ-gramモデルを併用するようにしても良い。
〔読み仮名修正装置〕
図３に、この発明の読み仮名修正装置２００の機能構成例を示す。読み仮名修正装置２００は、読み仮名修正モデル１４０と、読み仮名修正部２１０と、制御部２３０と、を具備する。 Therefore, the N number of N-grams may be fixed to the optimum N number corresponding to the learning text, or a plurality of N kanji-kana N-gram models may be used in combination.
[Reading Kana Correction Device]
FIG. 3 shows an example of the functional configuration of the reading-kana correction device 200 of the present invention. The reading-kana correction device 200 includes a reading-kana correction model 140, a reading-kana correction unit 210, and a control unit 230.

読み仮名修正モデル１４０は、上記した読み仮名修正モデル学習装置１００で学習した漢字かなＮ-gramモデルである。漢字かなＮ-gramモデルは、例えば３-gramモデルである。 The reading-kana correction model 140 is a kanji-kana N-gram model learned by the reading-kana correction model learning device 100 described above. The kanji-kana N-gram model is, for example, a 3-gram model.

読み仮名修正部２１０は、入力テキストに含まれる漢字１文字にひらがながＮ−１個の文字が連接して出現するＮ-gramを抽出し、該当漢字のＮ-gramを読み仮名修正モデル１４０に入力してＮ-gramの生起確率を求め、該当漢字の読み仮名を、生起確率が所定値以上の読み仮名に修正して出力する。読み仮名修正部２１０は、入力テキストに含まれる例えば（[楽，ガク]，シ，イ）の３-gramを、生起確率の高い（[楽，タノ]，シ，イ）に修正した読み仮名が修正されたテキストを外部に出力する。ここで該当漢字とは、読み仮名修正装置２００が、修正の対象にする入力テキスト内の任意の漢字１文字のことである。 The reading kana correction unit 210 extracts an N-gram in which N-1 characters of hiragana are concatenated and appear in one kanji character included in the input text, and reads the N-gram of the corresponding kanji into the kana correction model 140. The N-gram occurrence probability is input and the kanji reading of the corresponding kanji is corrected to a reading kana with an occurrence probability of a predetermined value or more and output. The reading kana correction unit 210 corrects, for example, a 3-gram of ([Raku, Gaku], Shi, A) included in the input text to a high occurrence probability ([Raku, Tano], Shi, A). Output the modified text to the outside. Here, the corresponding kanji is one arbitrary kanji in the input text to be corrected by the reading kana correction device 200.

読み仮名修正部２１０では、入力テキストに対する読み仮名誤り修正の指標として、読み仮名修正モデル学習装置１００で学習した漢字かなＮ-gramモデルから算出される生起確率を用いる。漢字かなＮ-gramモデルは、（[漢字，読み仮名]，漢字に連接する読みＮ-１個）の組み合わせを入力すると、学習テキスト中の該当組み合わせの出現頻度に応じて、その生起確率を算出することができる。学習テキストに高頻度で出現する組み合わせに対しては、高い確率が算出され、逆に低頻度で出現する組み合わせに対しては、低い確率が算出される。この実施例では、漢字かなＮ-gramモデルから算出される生起確率が高いものは読み仮名誤りが発生している可能性が低い、逆に生起確率が低いものは読み仮名誤りが発生している可能性が高いと仮定し、漢字かなＮ-gramモデルから算出される生起確率が低い読み仮名を生起確率が高い読み仮名に修正する事で、読み仮名誤りを修正する。 The reading-kana correction unit 210 uses the occurrence probability calculated from the kanji-kana N-gram model learned by the reading-kana correction model learning device 100 as an index for correcting the reading kana error for the input text. The Kanji Kana N-gram model calculates the probability of occurrence of a combination of ([Kanji, reading Kana], N-1 readings connected to Kanji) according to the frequency of occurrence of the corresponding combination in the learning text. can do. A high probability is calculated for combinations that appear in the learning text with a high frequency, and a low probability is calculated for combinations that appear with a low frequency. In this embodiment, those with a high occurrence probability calculated from the Kanji Kana N-gram model are less likely to have a reading kana error, and conversely, those with a low occurrence probability have a reading kana error. Assuming that there is a high possibility, the reading kana error is corrected by correcting the reading kana with a low occurrence probability calculated from the Kanji N-gram model to a reading kana with a high occurrence probability.

図４に、読み仮名修正部２１０のより具体的な機能構成例を示して更に詳しくその動作を説明する。読み仮名修正部２１０は、単漢字辞書２１１と、入力テキスト読み仮名生起確率算出手段２１２と、単漢字読み仮名生起確率算出手段２１３と、読み仮名決定手段２１４と、を備える。 FIG. 4 shows a more specific functional configuration example of the reading-kana correction unit 210, and the operation will be described in more detail. The reading kana correction unit 210 includes a single kanji dictionary 211, input text reading kana occurrence probability calculation means 212, single kanji reading kana occurrence probability calculation means 213, and reading kana determination means 214.

単漢字辞書２１１は、日本語のテキストに出現する漢字と漢字に対して取り得る読み仮名の候補が列挙されている辞書である。例えば、楽（ラク）、楽しい（タノシイ）、楽して（ラクシテ）、楽しく（タノシク）、…、等の情報を記憶している。 The single kanji dictionary 211 is a dictionary in which kanji appearing in Japanese text and candidates for reading kana for the kanji are listed. For example, it stores information such as easy, fun, joyful, easy, fun, and so on.

入力テキスト読み仮名生起確率算出手段２１２は、入力テキストに含まれる漢字１文字にひらがながＮ−１個の文字が連接して出現するＮ-gramを抽出し、該当漢字のＮ-gramを読み仮名修正モデル学習装置１００で学習した読み仮名修正モデル１４０に入力して当該Ｎ-gramの生起確率Ｐ０を求める。例えば、対象にしている入力テキストのＮ-gramが（[楽，ガク]，シ，イ）であったとして、その生起確率Ｐ０を求める。そして、該当漢字の情報を単漢字読み仮名生起確率算出手段２１３に出力する。 The input text reading kana occurrence probability calculation means 212 extracts an N-gram in which N-1 characters of hiragana are concatenated and appear in one kanji character included in the input text, and reads the k-character N-gram of the corresponding kanji. Input to the reading-kana correction model 140 learned by the correction model learning device 100 to determine the occurrence probability P0 of the N-gram. For example, if the N-gram of the target input text is ([Raku, Gaku], Shi, A), the occurrence probability P0 is obtained. Then, the kanji information is output to the single kanji reading kana occurrence probability calculation means 213.

単漢字読み仮名生起確率算出手段２１３は、該当漢字に対する１個以上のその他の読み仮名候補を単漢字辞書から取得し、該当漢字のその他の読み仮名候補を読み仮名修正モデル１４０に入力してその他の読み仮名候補の生起確率Ｐｋを求める。該当漢字を（[楽，ガク]）とした場合、その他の読み仮名候補であるｋ＝１の楽しい（タノシイ）、ｋ＝２の楽して（ラクシテ）、ｋ＝３の楽しく（タノシク）の、それぞれの生起確率Ｐ１，Ｐ２，Ｐ３を求める。 The single kanji reading kana occurrence probability calculation means 213 acquires one or more other reading kana candidates for the corresponding kanji from the single kanji dictionary, inputs other reading kana candidates for the corresponding kanji into the reading kana correction model 140, and others. The occurrence probability Pk of the reading pseudonym candidate is obtained. When the corresponding kanji is ([Raku, Gaku]), other reading candidate candidates are k = 1 fun (Tanoshii), k = 2 fun (Rakushite), k = 3 fun (Tanoshiku), Respective occurrence probabilities P1, P2, and P3 are obtained.

読み仮名決定手段２１４は、生起確率Ｐｋ（ｋ＝１，…，ｎ）と上記生起確率Ｐ０との尤度比Ｒｋ（＝Ｐｋ/Ｐ０）を求め、当該尤度比Ｒｋが所定値Ｔ以上で且つ最大の読み仮名候補を、上記該当漢字の修正された読み仮名として決定し、当該尤度比Ｒｋが上記所定値Ｔ以下の場合は、生起確率Ｐ０の読み仮名を該当漢字の読み仮名として決定する。該当漢字を（[楽，ガク]）とした例では、（タノシイ）と（ガクシイ）の尤度比Ｒ１の値が、所定値Ｔ以上で最大になったとすると、入力テキストの（[楽，ガク]，シ，イ）の３-gramは、（[楽，タノ]，シ，イ）に修正されて、出力される。ここで所定値Ｔは、尤度最大となる読み仮名候補の生起確率のおよそ２〜３倍程度となるよう（Ｔ＝２〜３程度）に設定しておく。尤度比Ｒｋは１．０以上であればより生起確率の高い読みが在ることを意味するが、１．０に近すぎると誤変換の可能性も高くなる。よって所定値の値は、入力テキストに応じて試行した結果で決めても良い。 The reading-kana determination means 214 calculates a likelihood ratio Rk (= Pk / P0) between the occurrence probability Pk (k = 1,..., N) and the occurrence probability P0, and the likelihood ratio Rk is equal to or greater than a predetermined value T. And the largest reading kana is determined as the corrected reading kana of the corresponding kanji, and when the likelihood ratio Rk is not more than the predetermined value T, the reading kana with the occurrence probability P0 is determined as the reading kana of the corresponding kanji. To do. In the example in which the corresponding kanji is ([Raku, Gaku]), if the value of the likelihood ratio R1 between (Tanoshii) and (Gakushii) is greater than or equal to a predetermined value T, ([Raku, Gaku] ], Shi, and i) are corrected to ([Raku, Tano], shi, i) and output. Here, the predetermined value T is set to be about 2 to 3 times the occurrence probability of the reading pseudonym candidate having the maximum likelihood (T = 2 to 3). If the likelihood ratio Rk is 1.0 or more, it means that there is a reading with a higher occurrence probability. However, if the likelihood ratio Rk is too close to 1.0, the possibility of erroneous conversion increases. Therefore, the value of the predetermined value may be determined by a result of trial according to the input text.

図５に、この発明の読み仮名修正装置３００の機能構成例を示す。読み仮名修正装置３００は、読み仮名修正モデル１４０と、読み仮名候補抽出部３１０と、読み仮名修正部３２０と、読みＮ-gramモデル３４０と、制御部３３０と、を具備する。読み仮名候補抽出部３１０は、複数の読み仮名候補を出力する点で読み仮名修正装置２００（図３）と異なる。読み仮名修正モデル１４０は、実施例１の読み仮名修正装置２００（図３）と同じものである。 FIG. 5 shows an example of a functional configuration of the reading-kana correction device 300 of the present invention. The reading kana correction device 300 includes a reading kana correction model 140, a reading kana candidate extraction unit 310, a reading kana correction unit 320, a reading N-gram model 340, and a control unit 330. The reading kana candidate extraction unit 310 differs from the reading kana correction device 200 (FIG. 3) in that it outputs a plurality of reading kana candidates. The reading-kana correction model 140 is the same as the reading-kana correction device 200 (FIG. 3) of the first embodiment.

読み仮名候補抽出部３１０は、入力テキストに含まれる漢字１文字にひらがながＮ−１個の文字が連接して出現するＮ-gramを抽出し、該当漢字のＮ-gramを読み仮名修正モデル学習装置で学習した読み仮名修正モデルに入力して当該Ｎ-gramの生起確率を求め、当該生起確率が所定値以上の複数の読み仮名を、上記該当漢字の読み仮名候補として出力する。 The kana candidate extraction unit 310 extracts an N-gram in which N-1 hiragana characters are concatenated and appear in one kanji character included in the input text, and reads the k-character N-gram to learn a kana correction model. The N-gram occurrence probability is input to the reading kana correction model learned by the apparatus, and a plurality of reading kana having the occurrence probability equal to or greater than a predetermined value are output as reading kana candidates for the corresponding kanji.

図６に、読み仮名候補抽出部３１０のより具体的な機能構成例を示す。読み仮名候補抽出部３１０は、読み仮名修正部２１０（図４）に対して読み仮名候補選択手段３１１を備える点でのみ異なる。読み仮名候補選択手段３１１は、入力テキスト読み仮名生起確率算出手段２１２の出力する該当漢字のＮ-gramの生起確率Ｐ０と、単漢字読み仮名生起確率算出手段２１３の出力する他の読み仮名候補の生起確率Ｐｋ（ｋ＝１，…，ｎ）と、を入力として尤度比Ｒｋ（＝Ｐｋ/Ｐ０）を求め、当該尤度比Ｒｋが所定値Ｔ以上の複数の読み仮名候補を出力する。 FIG. 6 shows a more specific functional configuration example of the reading kana candidate extraction unit 310. The reading kana candidate extraction unit 310 differs from the reading kana correction unit 210 (FIG. 4) only in that it includes a reading kana candidate selection means 311. The reading kana candidate selection means 311 includes the N-gram occurrence probability P0 of the corresponding kanji output from the input text reading kana occurrence probability calculation means 212 and other reading kana candidate candidates output from the single kanji reading kana occurrence probability calculation means 213. The likelihood ratio Rk (= Pk / P0) is obtained using the occurrence probability Pk (k = 1,..., N) as input, and a plurality of reading kana candidates whose likelihood ratio Rk is equal to or greater than a predetermined value T are output.

読みＮ-gramモデル３４０は、学習テキスト内のＮ個連接して出現する読みの出現頻度を学習したモデルである。Ｎ＝３の場合の例を挙げると、「今日は外で遊びましょうね（キョウワソトデアソビマショウネ）」という学習テキストにおいて、「キョウ」、「ョウワ」、「ウワソ」等の３個連接して出現する読みが全てカウントされ、その頻度に応じて確率が付与される。読みＮ-gramモデル３４０に読みの系列を入力すると、その読みの生起確率を算出することができる。読みＮ-gramモデル３４０の構築方法は、読み仮名修正モデル１４０と同じで周知である。 The reading N-gram model 340 is a model in which the appearance frequency of readings that appear in the learning text connected in series is learned. In the case of N = 3, in the learning text “Let's play outside today (Kyowa Soto de Asobi Mashoune)”, three “Kyo”, “Kyowa”, “Uwaso” etc. are connected. All the readings that appear are counted and given a probability according to the frequency. When a reading sequence is input to the reading N-gram model 340, the occurrence probability of the reading can be calculated. The construction method of the reading N-gram model 340 is the same as the reading kana correction model 140 and is well known.

読み仮名修正部３２０は、上記複数の読み仮名候補を含む一文の生起確率を、読みＮ-gramモデル３４０を参照して求め、生起確率の最も高い読み仮名候補を含む一文を出力する。例を挙げて説明すると、入力テキストの「今日は楽しいな（キョウワガクシイナ）」の「楽」という漢字に対して、読み仮名候補抽出部３１０で、「楽（ラク）」、「楽（タノ）」という２つの読み仮名候補が出力されていたと仮定する。 The reading kana modification unit 320 obtains the occurrence probability of one sentence including the plurality of reading kana candidates by referring to the reading N-gram model 340, and outputs one sentence including the reading kana candidate having the highest occurrence probability. As an example, the kana character “Easy” in the text “Today is Fun (Kyowa Gakushiina)” is input to the kana candidate extraction unit 310 by the “Reading Kana” candidate extraction unit 310. It is assumed that two reading candidate names “)” have been output.

その場合、入力テキスト全体の読みの系列である「キョウワラクシイナ」と「キョウワタノシイナ」のそれぞれの系列に対して読みＮ-gramモデルを用いて生起確率を算出する。そして、この例の場合、生起確率の高い読み系列である「キョウワタノシイナ」を、読み仮名が修正されたテキストとして出力される。 In that case, the occurrence probability is calculated using the reading N-gram model for each of the series “Kyowarakushina” and “Kyowatanoshina” which are the reading series of the entire input text. In the case of this example, “Kyowatanoshina”, which is a reading sequence with a high occurrence probability, is output as a text whose reading is corrected.

図７に、この発明の読み仮名修正装置４００の機能構成例を示す。読み仮名修正装置４００は、漢字かな２-gramモデル１４２と、漢字かな３-gramモデル１４３と、漢字かな４-gramモデル１４４と、読み仮名修正部４１０と、制御部４３０と、を具備する。読み仮名修正装置４００は、読み仮名修正装置２００に対して複数の漢字かなＮ-gramモデル１４２〜１４４を備える点で異なる。 FIG. 7 shows an example of the functional configuration of the reading-kana correction device 400 of the present invention. The reading kana correction device 400 includes a kanji 2-gram model 142, a kanji 3-gram model 143, a kanji 4-gram model 144, a reading kana correction unit 410, and a control unit 430. The reading kana correction device 400 is different from the reading kana correction device 200 in that it includes a plurality of Kanji N-gram models 142 to 144.

漢字かなＮ-gramモデル１４２〜１４４は、読み仮名修正モデル学習装置１００で学習した確率モデルである。読み仮名修正部４１０は、入力テキストに含まれる漢字１文字にひらがなが連接して出現する２-gramと３-gramと４-gramを抽出し、該当漢字のＮ-gramを、対応するＮ-gramの漢字かな２-gramモデル１４２と漢字かな３-gramモデル１４３と漢字かな４-gramモデル１４４のそれぞれに入力して各Ｎ-gramの生起確率を求め、該当漢字の読み仮名を、生起確率が所定値以上の読み仮名に修正して出力する。 The Kanji Kana N-gram models 142 to 144 are probability models learned by the reading-kana correction model learning device 100. The reading-kana correction unit 410 extracts 2-gram, 3-gram, and 4-gram in which hiragana appears concatenated with one kanji character included in the input text, and the N-gram of the corresponding kanji is converted into the corresponding N-gram. gram kanji 2-gram model 142, kanji 3-gram model 143, and kanji 4-gram model 144 are input to each N-gram occurrence probability, the kana reading kana Is corrected to a kana reading greater than or equal to a predetermined value and output.

上記したように、統計的に十分な学習量を得ることのできる出現頻度の高い漢字に関しては、Ｎ-gramのＮ数を長めに設定した漢字かなＮ-gramモデルを用いる事が望ましい。しかし、出現頻度が低い漢字においてはＮ-gramの数を長めに設定すると、学習データが足りず、データスパースの問題が発生する。読み仮名修正装置４００は、この問題を解決することができる。 As described above, it is desirable to use a kanji-kana N-gram model in which the N number of N-grams is set longer for kanji characters with a high appearance frequency that can obtain a statistically sufficient learning amount. However, if the number of N-grams is set longer for kanji characters with a low appearance frequency, there is insufficient learning data and a data sparse problem occurs. The reading-kana correction device 400 can solve this problem.

読み仮名修正装置４００は、複数の漢字かなＮ-gramモデルを併用し、各漢字かなＮ-gramモデルから別々に算出された尤度比Ｒｋ＿ｎ-gram（＝Ｐｋ＿ｎ-gram/Ｐ０＿ｎ-gram）の和が、一定値以上で且つ最大の読み仮名に、該当漢字の読み仮名を修正して出力する。読み仮名修正装置４００によれば、出現頻度の高い漢字に関しては、Ｎ-gramの数を大きく設定したモデルの確率を利用できるため、より高精度に読み仮名修正を行うことができる。また、出現頻度の低い漢字に関しては、Ｎ-gramの数を小さくしたモデルの確率を利用できるため、データスパースの問題が軽減される。 The reading-kana correction device 400 uses a plurality of kanji-kana N-gram models together, and calculates the sum of likelihood ratios Rk_n-gram (= Pk_n-gram / P0_n-gram) separately calculated from each kana-kana N-gram model. However, the kana reading of the corresponding kanji is corrected and output to the maximum reading kana that is greater than or equal to a certain value. According to the reading kana correction device 400, the kanji with high appearance frequency can use the probability of the model in which the number of N-grams is set to be large, so that the reading kana can be corrected with higher accuracy. In addition, for kanji characters with a low appearance frequency, the probability of a model with a reduced number of N-grams can be used, which reduces the data sparse problem.

図８に、この発明の読み仮名修正装置５００の機能構成例を示す。読み仮名修正装置５００は、漢字かな２-gramモデル１４２と、漢字かな３-gramモデル１４３と、漢字かな４-gramモデル１４４と、読み仮名候補抽出部５１０と、読み仮名修正部３２０と、読みＮ-gramモデル３４０と、制御部５３０と、を具備する。読み仮名修正装置５００は、実施例２（読み仮名修正装置３００（図５））と３（読み仮名修正装置４００（図７））の考えを組み合わせたものである。 FIG. 8 shows an example of the functional configuration of the reading-kana correction device 500 of the present invention. The reading kana correction device 500 includes a kanji-kana 2-gram model 142, a kanji-kana 3-gram model 143, a kanji-kana 4-gram model 144, a reading kana candidate extraction unit 510, a reading kana correction unit 320, An N-gram model 340 and a control unit 530 are provided. The reading-kana correction device 500 is a combination of the ideas of the second embodiment (the reading-kana correction device 300 (FIG. 5)) and 3 (the reading-kana correction device 400 (FIG. 7)).

読み仮名候補抽出部５１０は、入力テキストに含まれる漢字１文字にひらがなが連接して出現する２-gramと３-gramと４-gramを抽出し、該当漢字の上記Ｎ-gramを、対応するＮ-gramの漢字かな２-gramモデル１４２と漢字かな３-gramモデル１４３と漢字かな４-gramモデル１４４のそれぞれに入力して各Ｎ-gramの生起確率を求め、生起確率が所定値以上の上記該当漢字の複数の読み仮名候補を出力する。読み仮名修正部３２０と読みＮ-gramモデル３４０は、参照符号から明らかなように読み仮名修正装置３００と同じものである。 The reading kana candidate extraction unit 510 extracts 2-gram, 3-gram, and 4-gram in which hiragana appears concatenated with one kanji character included in the input text, and corresponds to the N-gram of the corresponding kanji. N-gram Kanji 2-gram model 142, Kanji 3-gram model 143 and Kanji 4-gram model 144 are input to each N-gram occurrence probability, and the occurrence probability is greater than or equal to a predetermined value A plurality of reading kana candidates for the corresponding kanji are output. The reading Kana correction unit 320 and the reading N-gram model 340 are the same as the reading Kana correction device 300 as is clear from the reference numerals.

読み仮名修正装置３００と読み仮名修正装置４００の考えを組み合わせた読み仮名修正装置５００によれば、学習テキストにおける漢字の出現頻度の差に依存し難く、且つ文全体として最適になる読み仮名修正を行うことができ、より高精度に読み仮名修正を行うことが可能になる。 According to the reading kana correction device 500 that combines the ideas of the reading kana correction device 300 and the reading kana correction device 400, reading kana correction that is less dependent on the difference in the appearance frequency of kanji in the learning text and that is optimal for the entire sentence is performed. This makes it possible to correct the reading kana with higher accuracy.

以上説明したようにこの発明の読み仮名修正モデル学習装置１００は、学習テキストの漢字とその読み仮名と、その漢字に連接する読みの組み合わせとを学習し、読み仮名誤りの修正用モデルとして用いることが可能な新しい統計モデルを提供することができる。また、この発明の読み仮名修正装置２００，３００，４００，５００は、その新しい統計モデルを用いることで、Twitter・ブログ等、個人が書いた崩れた表記を含んだテキストに含まれる多種多様な表記ゆれを、自動的に正しい読み仮名に修正することができる。この発明の読み仮名修正装置２００，３００，４００，５００は、従来必要であった新たな表記ゆれパターンが出現する度に規則を設計するコストを、削減する効果を奏する。 As described above, the reading kana correction model learning device 100 according to the present invention learns a kanji of a learning text, its reading kana, and a combination of readings connected to the kanji, and uses them as a model for correcting reading kana errors. Can provide a new statistical model. In addition, the kana correction devices 200, 300, 400, and 500 of the present invention use the new statistical model, so that various notations included in texts including broken notations written by individuals such as Twitter and blogs. The shake can be automatically corrected to the correct reading kana. The kana correction device 200, 300, 400, 500 of the present invention has an effect of reducing the cost of designing a rule each time a new notation fluctuation pattern that has been necessary in the past appears.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。 When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims

An N-1 sequence extraction unit that extracts an N-gram in which hiragana N-1 characters appear contiguously to one kanji character in the learning text, using a kanji mixed kana learning text as input,
An N-gram model learning unit that learns a Kanji Kana N-gram model to which a probability is given according to the appearance frequency of the N-gram, reads the Kanji Kana N-gram model, and outputs it as a kana correction model;
A kana correction model learning device comprising:

A reading-kana correction model learned by the reading-kana correction model learning device according to claim 1;
An N-gram in which N-1 characters of hiragana appear concatenated with one Kanji character included in the input text is extracted, and the N-gram of the corresponding Kanji character is input to the reading kana correction model and the N a reading kana correction unit that calculates the occurrence probability of -gram, corrects the kana reading of the corresponding kanji to a reading kana with the occurrence probability equal to or higher than a predetermined value,
A reading kana correction device comprising:

In the reading kana correction device according to claim 2,
The above-mentioned kana correction part
A single kanji dictionary that lists kanji characters that appear in Japanese text and possible kana characters for kanji,
N-gram in which N-1 characters of hiragana appear concatenated to one kanji character included in the input text, and the kana correction corrected by reading the k-character N-gram and learning with the kana correction model learning device An input text reading pseudonym occurrence probability calculating means for inputting to the model and calculating the occurrence probability P0 of the N-gram;
One or more other kana characters for the corresponding kanji are obtained from the single kanji dictionary, and the other kana kana candidates are input into the kana correction model to determine the occurrence probability Pk of the other kana candidates. A reading pseudonym occurrence probability calculating means;
The likelihood ratio Rk between the occurrence probability Pk and the occurrence probability P0 is obtained, and the maximum reading kana candidate whose likelihood ratio Rk is equal to or greater than a predetermined value is determined as the corrected reading kana of the corresponding kanji, When the likelihood ratio Rk is less than or equal to the predetermined value, a reading kana determination means for determining the reading kana of the occurrence probability P0 as the reading kana of the corresponding kanji,
A reading kana correction device comprising:

A reading-kana correction model learned by the reading-kana correction model learning device according to claim 1;
N-gram in which N-1 hiragana characters appear concatenated with one Kanji character included in the input text is extracted, and the above-mentioned N-gram of the corresponding Kanji character is read and learned by the kana correction model learning device. A reading kana candidate extraction unit that inputs to the model to determine the occurrence probability of the N-gram, and outputs a plurality of reading kana with the occurrence probability being a predetermined value or more as reading kana candidates for the corresponding kanji;
A reading N-gram model that has learned the frequency of occurrence of N concatenated readings in the learning text;
An occurrence probability of a sentence including the above-mentioned candidate for reading, with reference to the above-mentioned reading N-gram model,
A reading kana correction device comprising:

A kana 2-gram model, a kana 3-gram model, and a kana 4-gram model of a kana 2-gram model learned by the reading kana correction model learning device according to claim 1;
2-grams, 3-grams and 4-grams in which hiragana characters appear concatenated with one kanji character included in the input text are extracted, and the above kanji for the corresponding kanji 2 -The input to each of the -gram model, the Kanji 3-gram model, and the Kanji 4-gram model is used to determine the occurrence probability of each N-gram, and the above-mentioned occurrence probability is corrected to a reading above the predetermined value and output. A kana correction part;
A reading kana correction device comprising:

An N-1 sequence extraction process for extracting an N-gram in which hiragana N-1 characters appear concatenatingly appearing as one kanji character in the learning text, using a kanji-kana mixed learning text as input,
N-gram model learning process of learning a kanji Kana N-gram model to which a probability is given according to the appearance frequency of the N-gram, reading the Kanji Kana N-gram model and outputting it as a kana correction model,
A kana correction model learning method comprising:

A reading kana correction model learned by the reading kana correction model learning method according to claim 6;
An N-gram in which N-1 characters of hiragana are concatenated and appear in one kanji character included in the input text is extracted, and the N-gram of the corresponding kanji is input to the reading kana correction model and the N-gram is input. A kana correction process for obtaining the occurrence probability of gram, correcting the kana reading of the corresponding kanji into a reading kana with the occurrence probability being a predetermined value or more,
A reading kana correction method comprising:

A program for causing a computer to function as the reading kana correction model learning device according to claim 1 and the reading kana correction device according to any one of claims 2 to 5.