JP2014232510A - Japanese kana character correction model learning device, japanese kana character correction device, methods therefore, and program - Google Patents

Japanese kana character correction model learning device, japanese kana character correction device, methods therefore, and program Download PDF

Info

Publication number
JP2014232510A
JP2014232510A JP2013114254A JP2013114254A JP2014232510A JP 2014232510 A JP2014232510 A JP 2014232510A JP 2013114254 A JP2013114254 A JP 2013114254A JP 2013114254 A JP2013114254 A JP 2013114254A JP 2014232510 A JP2014232510 A JP 2014232510A
Authority
JP
Japan
Prior art keywords
kana
reading
gram
kanji
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP2013114254A
Other languages
Japanese (ja)
Other versions
JP5961586B2 (en
Inventor
博子 村上
Hiroko Murakami
博子 村上
水野 秀之
Hideyuki Mizuno
秀之 水野
勇祐 井島
Yusuke Ijima
勇祐 井島
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nippon Telegraph and Telephone Corp
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2013114254A priority Critical patent/JP5961586B2/en
Publication of JP2014232510A publication Critical patent/JP2014232510A/en
Application granted granted Critical
Publication of JP5961586B2 publication Critical patent/JP5961586B2/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

PROBLEM TO BE SOLVED: To provide a Japanese kana character correction model learning device and a Japanese kana character correction device, which learn a Japanese kana character correction model as a statistical model for automatically correcting a Japanese kana character error due to notation variability.SOLUTION: A Japanese kana character correction model learning device includes: an N-1 series extraction unit for extracting an N-gram where N-1 characters of hiragana letters appear continuously in one Chinese character in a learning text by using the learning text in which Chinese characters are mixed with kanas, as input; and an N-gram model learning unit for learning a Chinese character-kana N-gram model provided with a probability according to an appearance frequency of the N-gram, and outputting the Chinese character-kana N-gram model as a Japanese kana character correction model to the outside. Also, the Japanese kana character correction model learning device inputs an N-gram of an appropriate Chinese character to the Japanese kana character correction model to find an occurrence probability, and corrects a Japanese kana character of the appropriate Chinese character to a Japanese kana character having an occurrence probability of a predetermined value or more and outputs it.

Description

本発明は、読み仮名誤りの自動修正に用いる読み仮名修正モデルを生成する読み仮名修正モデル学習装置と、そのモデルを用いた読み仮名修正装置と、それらの方法とプログラムに関する。   The present invention relates to a reading kana correction model learning device that generates a reading kana correction model used for automatic correction of reading kana errors, a reading kana correction device using the model, and a method and a program thereof.

従来、漢字に対する読み仮名付与では、単語辞書から(単語表記・品詞・読み仮名)の組から成る単語の候補を取得し、単語間の品詞接続に基づき、日本語の文として最も適切な単語系列を選択し、選択された単語系列の読み仮名に基づいて、漢字に読み仮名を付与するという手法が一般的に用いられてきた(例えば非特許文献1)。   Conventionally, with kana reading for kanji, word candidates consisting of pairs of (word notation, part of speech, reading kana) are obtained from a word dictionary, and the most appropriate word sequence as a Japanese sentence based on the part of speech connection between words In general, a method of assigning a reading kana to a kanji based on the reading kana of the selected word series has been used (for example, Non-Patent Document 1).

Twitter・ブログ等、個人が書いた崩れた表記を含んだテキストでは、例えば、「嬉しい」→「嬉しぃ」などの小文字化、「知らない」→「知ラナイ」などのカタカナ化、等の表記ゆれが発生する。読み仮名付与対象のテキストに、このような表記ゆれを含んだテキストが含まれると、単語系列選択の際に正しく辞書照合できず、読み仮名誤りが発生することが問題であった。表記ゆれに起因する読み仮名誤りを改善するため、従来は、単語系列選択を行う前に規則によるテキストの書き換えを行い、表記ゆれを含んだテキストを辞書照合可能な表記に修正してから単語系列選択を行うことで解決していた。   In texts that contain broken notation written by individuals, such as Twitter and blogs, for example, lowercase letters such as “happy” → “happy”, katakana such as “don't know” → “know lanai”, etc. Shake occurs. If the text that is subject to reading kana includes text that includes such a notation fluctuation, the dictionary cannot be correctly matched when selecting a word sequence, and a reading kana error occurs. In order to improve reading kana mistakes caused by typographical fluctuations, it has been common practice to rewrite the text according to rules before selecting a word series and correct the text containing typographical fluctuations to a dictionary-matchable notation. It was solved by making a choice.

松本裕治,et al.″日本語形態素解析システム「茶筌」Version 2.0 使用説明書~″NAIST-IS-TR99012(1999).Yuji Matsumoto, et al. "Japanese Morphological Analysis System" Cha "Version 2.0 Instruction Manual ~" NAIST-IS-TR99012 (1999).

崩れた表記のテキストに含まれる表記ゆれパターンは多岐にわたるので、従来の規則によるテキストの書き換えでは網羅しきれない表記ゆれが多く存在する。また、規則の設計は人手で行う必要があるため、新たな表記ゆれパターンが出現する度に規則を設計するのは高コストである。   Since there are a wide variety of notation fluctuation patterns included in the broken notation text, there are many notation fluctuations that cannot be covered by rewriting text according to conventional rules. Also, since it is necessary to manually design the rules, it is expensive to design the rules every time a new notation fluctuation pattern appears.

本発明は、この課題に鑑みてなされたものであり、表記ゆれに起因する読み仮名誤りを自動的に修正するための統計モデルである読み仮名修正モデルを学習する読み仮名修正モデル学習装置と、そのモデルを用いた読み仮名修正装置と、それらの方法とプログラムを提供することを目的とする。   The present invention has been made in view of this problem, and a reading-kana correction model learning device that learns a reading-kana correction model, which is a statistical model for automatically correcting reading-kana errors caused by notation fluctuation, It is an object of the present invention to provide a reading kana correction device using the model, and a method and program thereof.

本発明の読み仮名修正モデル学習装置は、N−1系列抽出部と、N-gramモデル学習部と、を具備する。N−1系列抽出部は、漢字かな混じりの学習テキストを入力として、当該学習テキスト内の漢字1文字に、ひらがながN−1個の文字が連接して出現するN-gramを抽出する。N-gramモデル学習部は、N-gramの出現頻度に応じて確率を付与した漢字かなN-gramモデルを学習し、当該漢字かなN-gramモデルを読み仮名修正モデルとして外部に出力する。   The reading-kana correction model learning device of the present invention includes an N-1 sequence extraction unit and an N-gram model learning unit. The N-1 series extraction unit receives a learning text mixed with kanji and kana, and extracts an N-gram in which N-1 hiragana characters appear concatenated with one kanji character in the learning text. The N-gram model learning unit learns a kanji-kana N-gram model to which a probability is given according to the appearance frequency of the N-gram, reads the kanji-kana N-gram model, and outputs it as a kana correction model.

また、本発明の読み仮名修正装置は、読み仮名修正モデルと、読み仮名修正部と、を具備する。読み仮名修正モデルは上記した読み仮名修正モデル学習装置で学習した読み仮名修正モデルである。読み仮名修正部は、入力テキストに含まれる漢字1文字にひらがながN−1個の文字が連接して出現するN-gramを抽出し、該当漢字の上記N-gramを上記読み仮名修正モデルに入力して上記N-gramの生起確率を求め、上記該当漢字の読み仮名を、上記生起確率が所定値以上の読み仮名に修正して出力する。   The reading kana correction device of the present invention includes a reading kana correction model and a reading kana correction unit. The reading kana correction model is a reading kana correction model learned by the above-described reading kana correction model learning device. The reading kana correction unit extracts an N-gram in which N-1 characters of hiragana are concatenated and appear in one kanji character included in the input text, and the N-gram of the corresponding kanji is used as the reading kana correction model. Then, the occurrence probability of the N-gram is obtained, and the reading kana of the corresponding kanji is corrected to the reading kana with the occurrence probability being a predetermined value or more and output.

本発明の読み仮名修正モデル学習装置は、学習テキスト内の漢字1文字とその読み仮名と当該漢字に連接するN−1個の文字から成るN-gramの確率モデルであり、テキストに含まれる表記ゆれを修正する目的で用いることが可能な読み仮名修正モデルを提供する。また、この発明の読み仮名修正装置は、テキストに含まれる表記ゆれに起因する読み仮名誤りを、上記読み修正モデルを用いて自動的に修正することができる。よって、新たな表記ゆれパターンが出現する度に規則を設計するのに必要なコストを、削減する効果を奏する。   The reading kana correction model learning apparatus of the present invention is an N-gram probability model consisting of one kanji character in a learning text, its reading kana and N-1 characters connected to the kanji, and is included in the text. A kana correction model that can be used to correct shaking is provided. The reading kana correction device of the present invention can automatically correct reading kana errors caused by the notation fluctuations included in the text using the reading correction model. Therefore, there is an effect of reducing the cost required for designing the rule every time a new notation fluctuation pattern appears.

本発明の読み仮名修正モデル学習装置100の機能構成例を示す図。The figure which shows the function structural example of the reading-kana correction model learning apparatus 100 of this invention. 読み仮名修正モデル学習装置100の動作フローを示す図。The figure which shows the operation | movement flow of the reading kana correction model learning apparatus 100. FIG. 本発明の読み仮名修正装置200の機能構成例を示す図。The figure which shows the function structural example of the reading kana correction apparatus 200 of this invention. 読み仮名修正部210のより具体的な機能構成例を示す図。The figure which shows the more specific functional structural example of the reading kana correction | amendment part 210. FIG. 本発明の読み仮名修正装置300の機能構成例を示す図。The figure which shows the function structural example of the reading kana correction apparatus 300 of this invention. 読み仮名候補抽出部310のより具体的な機能構成例を示す図。The figure which shows the more specific function structural example of the reading kana candidate extraction part 310. FIG. 本発明の読み仮名修正装置400の機能構成例を示す図。The figure which shows the function structural example of the reading kana correction apparatus 400 of this invention. 本発明の読み仮名修正装置500の機能構成例を示す図。The figure which shows the function structural example of the reading kana correction apparatus 500 of this invention.

以下、この発明の実施の形態を図面を参照して説明する。複数の図面中同一のものには同じ参照符号を付し、説明は繰り返さない。   Embodiments of the present invention will be described below with reference to the drawings. The same reference numerals are given to the same components in a plurality of drawings, and the description will not be repeated.

〔読み仮名修正モデル学習装置〕
図1に、この発明の読み仮名修正モデル学習装置100の機能構成例を示す。その動作フローを図2に示す。読み仮名修正モデル学習装置100は、N−1系列抽出部110と、N-gramモデル学習部120と、制御部130と、を具備する。読み仮名修正モデル学習装置100は、例えばROM、RAM、CPU等で構成されるコンピュータに所定のプログラムが読み込まれて、CPUがそのプログラムを実行することで実現されるものである。以降で説明する他の実施例についても同様である。
[Reading Kana correction model learning device]
FIG. 1 shows an example of the functional configuration of a reading-kana correction model learning apparatus 100 according to the present invention. The operation flow is shown in FIG. The reading-kana correction model learning device 100 includes an N-1 sequence extraction unit 110, an N-gram model learning unit 120, and a control unit 130. The reading-kana correction model learning apparatus 100 is realized by reading a predetermined program into a computer composed of, for example, a ROM, a RAM, a CPU, and the like, and executing the program by the CPU. The same applies to other embodiments described below.

N−1系列抽出部110は、漢字仮名混じりの学習テキストを入力として、当該学習テキスト内の漢字1文字にひらがなN−1個の文字が連接して出現するN-gramを抽出する(ステップS110)。学習テキストにおいて、漢字1文字にひらがなN−1個の文字が連接したN-gramのみを学習の対象とする。漢字が連続して出現するものや、漢字の後に出現するN−1個の文字にひらがな以外の文字(カタカナ・漢字・記号等)が含まれるものは、学習の対象外とする。   The N-1 series extraction unit 110 receives learning text mixed with kanji kana and extracts an N-gram in which N-1 hiragana characters appear concatenated with one kanji character in the learning text (step S110). ). In the learning text, only an N-gram in which N-1 hiragana characters are concatenated to one kanji character is set as a learning target. Those in which Kanji characters appear consecutively or N-1 characters that appear after Kanji characters include characters other than hiragana (katakana, kanji characters, symbols, etc.) are not subject to learning.

例えばN=3の例を挙げると、「今日は外で遊びましょうね(キョウワソトデアソビマショウネ)」という学習テキストにおいて、漢字1文字に対してひらがな2文字が連接している3-gramは「遊びま」の部分のみである。この例では、1文字目の漢字とその読み仮名のセットである(遊,アソ)と、漢字に連接するひらがな2文字「びま」の読みである「ビマ」の3個組の組み合わせである([遊,アソ],ビ,マ)がN-gramとしてカウントされる。このN-gramの抽出は、学習テキストの全ての単語を対象に行われ、学習テキスト内の漢字1文字に対してひらがな2文字が連接しているN-gramの全てが抽出されるまで繰り返される(ステップS130のNo)。この繰り返し動作の制御は制御部130で行う。制御部130は、読み仮名修正モデル学習装置100の各部の時系列動作を制御する一般的なものであり、特別な処理を行うものではない。他の実施例についても同様である。   For example, in the case of N = 3, in the learning text “Let's play outside today (Kyowa Soto de Asobi Mashoune)”, a 3-gram where two hiragana characters are connected to one kanji character is It is only the “Play” part. In this example, it is a set of triples consisting of a set of the first kanji and its reading kana (Yu, Aso) and “bima” which is a reading of the two hiragana characters “Bima” connected to the kanji. ([Yu, Aso], Bi, Ma) is counted as an N-gram. This N-gram extraction is performed for all words in the learning text, and is repeated until all of the N-grams in which two hiragana characters are connected to one kanji character in the learning text are extracted. (No in step S130). This repetitive operation is controlled by the control unit 130. The control unit 130 is a general unit that controls the time-series operation of each unit of the reading-a-kana correction model learning device 100, and does not perform a special process. The same applies to the other embodiments.

N-gramモデル学習部120は、N−1系列抽出部110で抽出された全てのN-gramのそれぞれの頻度を数え、その頻度に応じて確率を付与した確率モデルである漢字かなN-gramモデルを学習し、その漢字かなN-gramモデルを読み仮名修正モデル140として外部に出力する(ステップS120)。N-gramモデルの学習方法は、例えば参考文献1(北健二著、「言語と計算-4 確率的言語モデル」、東京大学出版会、pp.57-62)に記載されているように周知である。   The N-gram model learning unit 120 counts the frequency of all the N-grams extracted by the N-1 sequence extraction unit 110 and assigns a probability according to the frequency. The model is learned, and the kanji-kana N-gram model is read and output to the outside as the kana correction model 140 (step S120). The learning method of the N-gram model is well known as described in, for example, Reference 1 (Kenji Kita, “Language and Computation-4 Stochastic Language Model”, The University of Tokyo Press, pp.57-62). is there.

従来の一般的なN-gramモデルは、隣接する単語の組み合わせを学習し、音声認識や形態素解析用の言語モデルに用いられることが多い。この発明ではN-gramモデルを、漢字とその読み仮名と、その漢字に連接する読みの組み合わせとを学習し、読み仮名誤りの修正用モデルとして用いる点で新しい。   The conventional general N-gram model is often used as a language model for speech recognition or morphological analysis by learning a combination of adjacent words. In the present invention, the N-gram model is new in that it learns a kanji, its reading kana, and a combination of readings connected to the kanji and uses it as a model for correcting reading kana errors.

N-gramのNは2以上であればいくつであっても良い。例えば、N=2として、漢字と漢字に連接する読みを1文字しか考慮しない漢字かなN-gramモデルも有り得る。但し、N=2とした場合、「楽しい(タノシイ)」、「楽して(ラクシテ)」のように、漢字に連接する読みを2個まで考慮することで読み仮名をほぼ一意に決定できるような例においても、「楽し」までしか考慮できないため、読み仮名「タノ」と「ラク」の間に確率的に大きな差が表れないモデルになる課題がある。   N may be any number as long as N is 2 or more. For example, assuming that N = 2, there may be a kanji-kana N-gram model that considers only kanji and kanji and readings connected to kanji. However, when N = 2, it is possible to determine the reading pseudonym almost uniquely by considering up to two readings concatenated with kanji, such as “Fun” and “Easy”. Even in the example, since only “fun” can be considered, there is a problem of becoming a model in which a large difference does not appear probabilistically between the reading kana “Tano” and “Raku”.

そのようなモデルにしない為には、統計的に十分な学習量を得ることのできる出現頻度の高い漢字に関しては、N-gramのN数を長めに設定した漢字かなN-gramモデルを用いる事が望ましい。但し、この場合も、出現頻度が低い漢字においては、N数を長(大)めに設定すると、学習データが不足してデータスパースの問題が発生する課題がある。   In order to avoid such a model, a Kanji Kana N-gram model with a long N number of N-grams should be used for Kanji characters with a high appearance frequency that can obtain a statistically sufficient learning amount. Is desirable. However, in this case as well, there is a problem that a problem of data sparse occurs due to a lack of learning data if the number of N is set to be long (large) for a Chinese character with a low appearance frequency.

従って、N-gramのN数は、学習テキストに対応させた最適なN数に固定しても良いし、複数のN数の漢字かなN-gramモデルを併用するようにしても良い。
〔読み仮名修正装置〕
図3に、この発明の読み仮名修正装置200の機能構成例を示す。読み仮名修正装置200は、読み仮名修正モデル140と、読み仮名修正部210と、制御部230と、を具備する。
Therefore, the N number of N-grams may be fixed to the optimum N number corresponding to the learning text, or a plurality of N kanji-kana N-gram models may be used in combination.
[Reading Kana Correction Device]
FIG. 3 shows an example of the functional configuration of the reading-kana correction device 200 of the present invention. The reading-kana correction device 200 includes a reading-kana correction model 140, a reading-kana correction unit 210, and a control unit 230.

読み仮名修正モデル140は、上記した読み仮名修正モデル学習装置100で学習した漢字かなN-gramモデルである。漢字かなN-gramモデルは、例えば3-gramモデルである。   The reading-kana correction model 140 is a kanji-kana N-gram model learned by the reading-kana correction model learning device 100 described above. The kanji-kana N-gram model is, for example, a 3-gram model.

読み仮名修正部210は、入力テキストに含まれる漢字1文字にひらがながN−1個の文字が連接して出現するN-gramを抽出し、該当漢字のN-gramを読み仮名修正モデル140に入力してN-gramの生起確率を求め、該当漢字の読み仮名を、生起確率が所定値以上の読み仮名に修正して出力する。読み仮名修正部210は、入力テキストに含まれる例えば([楽,ガク],シ,イ)の3-gramを、生起確率の高い([楽,タノ],シ,イ)に修正した読み仮名が修正されたテキストを外部に出力する。ここで該当漢字とは、読み仮名修正装置200が、修正の対象にする入力テキスト内の任意の漢字1文字のことである。   The reading kana correction unit 210 extracts an N-gram in which N-1 characters of hiragana are concatenated and appear in one kanji character included in the input text, and reads the N-gram of the corresponding kanji into the kana correction model 140. The N-gram occurrence probability is input and the kanji reading of the corresponding kanji is corrected to a reading kana with an occurrence probability of a predetermined value or more and output. The reading kana correction unit 210 corrects, for example, a 3-gram of ([Raku, Gaku], Shi, A) included in the input text to a high occurrence probability ([Raku, Tano], Shi, A). Output the modified text to the outside. Here, the corresponding kanji is one arbitrary kanji in the input text to be corrected by the reading kana correction device 200.

読み仮名修正部210では、入力テキストに対する読み仮名誤り修正の指標として、読み仮名修正モデル学習装置100で学習した漢字かなN-gramモデルから算出される生起確率を用いる。漢字かなN-gramモデルは、([漢字,読み仮名],漢字に連接する読みN-1個)の組み合わせを入力すると、学習テキスト中の該当組み合わせの出現頻度に応じて、その生起確率を算出することができる。学習テキストに高頻度で出現する組み合わせに対しては、高い確率が算出され、逆に低頻度で出現する組み合わせに対しては、低い確率が算出される。この実施例では、漢字かなN-gramモデルから算出される生起確率が高いものは読み仮名誤りが発生している可能性が低い、逆に生起確率が低いものは読み仮名誤りが発生している可能性が高いと仮定し、漢字かなN-gramモデルから算出される生起確率が低い読み仮名を生起確率が高い読み仮名に修正する事で、読み仮名誤りを修正する。   The reading-kana correction unit 210 uses the occurrence probability calculated from the kanji-kana N-gram model learned by the reading-kana correction model learning device 100 as an index for correcting the reading kana error for the input text. The Kanji Kana N-gram model calculates the probability of occurrence of a combination of ([Kanji, reading Kana], N-1 readings connected to Kanji) according to the frequency of occurrence of the corresponding combination in the learning text. can do. A high probability is calculated for combinations that appear in the learning text with a high frequency, and a low probability is calculated for combinations that appear with a low frequency. In this embodiment, those with a high occurrence probability calculated from the Kanji Kana N-gram model are less likely to have a reading kana error, and conversely, those with a low occurrence probability have a reading kana error. Assuming that there is a high possibility, the reading kana error is corrected by correcting the reading kana with a low occurrence probability calculated from the Kanji N-gram model to a reading kana with a high occurrence probability.

図4に、読み仮名修正部210のより具体的な機能構成例を示して更に詳しくその動作を説明する。読み仮名修正部210は、単漢字辞書211と、入力テキスト読み仮名生起確率算出手段212と、単漢字読み仮名生起確率算出手段213と、読み仮名決定手段214と、を備える。   FIG. 4 shows a more specific functional configuration example of the reading-kana correction unit 210, and the operation will be described in more detail. The reading kana correction unit 210 includes a single kanji dictionary 211, input text reading kana occurrence probability calculation means 212, single kanji reading kana occurrence probability calculation means 213, and reading kana determination means 214.

単漢字辞書211は、日本語のテキストに出現する漢字と漢字に対して取り得る読み仮名の候補が列挙されている辞書である。例えば、楽(ラク)、楽しい(タノシイ)、楽して(ラクシテ)、楽しく(タノシク)、…、等の情報を記憶している。   The single kanji dictionary 211 is a dictionary in which kanji appearing in Japanese text and candidates for reading kana for the kanji are listed. For example, it stores information such as easy, fun, joyful, easy, fun, and so on.

入力テキスト読み仮名生起確率算出手段212は、入力テキストに含まれる漢字1文字にひらがながN−1個の文字が連接して出現するN-gramを抽出し、該当漢字のN-gramを読み仮名修正モデル学習装置100で学習した読み仮名修正モデル140に入力して当該N-gramの生起確率P0を求める。例えば、対象にしている入力テキストのN-gramが([楽,ガク],シ,イ)であったとして、その生起確率P0を求める。そして、該当漢字の情報を単漢字読み仮名生起確率算出手段213に出力する。   The input text reading kana occurrence probability calculation means 212 extracts an N-gram in which N-1 characters of hiragana are concatenated and appear in one kanji character included in the input text, and reads the k-character N-gram of the corresponding kanji. Input to the reading-kana correction model 140 learned by the correction model learning device 100 to determine the occurrence probability P0 of the N-gram. For example, if the N-gram of the target input text is ([Raku, Gaku], Shi, A), the occurrence probability P0 is obtained. Then, the kanji information is output to the single kanji reading kana occurrence probability calculation means 213.

単漢字読み仮名生起確率算出手段213は、該当漢字に対する1個以上のその他の読み仮名候補を単漢字辞書から取得し、該当漢字のその他の読み仮名候補を読み仮名修正モデル140に入力してその他の読み仮名候補の生起確率Pkを求める。該当漢字を([楽,ガク])とした場合、その他の読み仮名候補であるk=1の楽しい(タノシイ)、k=2の楽して(ラクシテ)、k=3の楽しく(タノシク)の、それぞれの生起確率P1,P2,P3を求める。   The single kanji reading kana occurrence probability calculation means 213 acquires one or more other reading kana candidates for the corresponding kanji from the single kanji dictionary, inputs other reading kana candidates for the corresponding kanji into the reading kana correction model 140, and others. The occurrence probability Pk of the reading pseudonym candidate is obtained. When the corresponding kanji is ([Raku, Gaku]), other reading candidate candidates are k = 1 fun (Tanoshii), k = 2 fun (Rakushite), k = 3 fun (Tanoshiku), Respective occurrence probabilities P1, P2, and P3 are obtained.

読み仮名決定手段214は、生起確率Pk(k=1,…,n)と上記生起確率P0との尤度比Rk(=Pk/P0)を求め、当該尤度比Rkが所定値T以上で且つ最大の読み仮名候補を、上記該当漢字の修正された読み仮名として決定し、当該尤度比Rkが上記所定値T以下の場合は、生起確率P0の読み仮名を該当漢字の読み仮名として決定する。該当漢字を([楽,ガク])とした例では、(タノシイ)と(ガクシイ)の尤度比R1の値が、所定値T以上で最大になったとすると、入力テキストの([楽,ガク],シ,イ)の3-gramは、([楽,タノ],シ,イ)に修正されて、出力される。ここで所定値Tは、尤度最大となる読み仮名候補の生起確率のおよそ2〜3倍程度となるよう(T=2〜3程度)に設定しておく。尤度比Rkは1.0以上であればより生起確率の高い読みが在ることを意味するが、1.0に近すぎると誤変換の可能性も高くなる。よって所定値の値は、入力テキストに応じて試行した結果で決めても良い。   The reading-kana determination means 214 calculates a likelihood ratio Rk (= Pk / P0) between the occurrence probability Pk (k = 1,..., N) and the occurrence probability P0, and the likelihood ratio Rk is equal to or greater than a predetermined value T. And the largest reading kana is determined as the corrected reading kana of the corresponding kanji, and when the likelihood ratio Rk is not more than the predetermined value T, the reading kana with the occurrence probability P0 is determined as the reading kana of the corresponding kanji. To do. In the example in which the corresponding kanji is ([Raku, Gaku]), if the value of the likelihood ratio R1 between (Tanoshii) and (Gakushii) is greater than or equal to a predetermined value T, ([Raku, Gaku] ], Shi, and i) are corrected to ([Raku, Tano], shi, i) and output. Here, the predetermined value T is set to be about 2 to 3 times the occurrence probability of the reading pseudonym candidate having the maximum likelihood (T = 2 to 3). If the likelihood ratio Rk is 1.0 or more, it means that there is a reading with a higher occurrence probability. However, if the likelihood ratio Rk is too close to 1.0, the possibility of erroneous conversion increases. Therefore, the value of the predetermined value may be determined by a result of trial according to the input text.

図5に、この発明の読み仮名修正装置300の機能構成例を示す。読み仮名修正装置300は、読み仮名修正モデル140と、読み仮名候補抽出部310と、読み仮名修正部320と、読みN-gramモデル340と、制御部330と、を具備する。読み仮名候補抽出部310は、複数の読み仮名候補を出力する点で読み仮名修正装置200(図3)と異なる。読み仮名修正モデル140は、実施例1の読み仮名修正装置200(図3)と同じものである。   FIG. 5 shows an example of a functional configuration of the reading-kana correction device 300 of the present invention. The reading kana correction device 300 includes a reading kana correction model 140, a reading kana candidate extraction unit 310, a reading kana correction unit 320, a reading N-gram model 340, and a control unit 330. The reading kana candidate extraction unit 310 differs from the reading kana correction device 200 (FIG. 3) in that it outputs a plurality of reading kana candidates. The reading-kana correction model 140 is the same as the reading-kana correction device 200 (FIG. 3) of the first embodiment.

読み仮名候補抽出部310は、入力テキストに含まれる漢字1文字にひらがながN−1個の文字が連接して出現するN-gramを抽出し、該当漢字のN-gramを読み仮名修正モデル学習装置で学習した読み仮名修正モデルに入力して当該N-gramの生起確率を求め、当該生起確率が所定値以上の複数の読み仮名を、上記該当漢字の読み仮名候補として出力する。   The kana candidate extraction unit 310 extracts an N-gram in which N-1 hiragana characters are concatenated and appear in one kanji character included in the input text, and reads the k-character N-gram to learn a kana correction model. The N-gram occurrence probability is input to the reading kana correction model learned by the apparatus, and a plurality of reading kana having the occurrence probability equal to or greater than a predetermined value are output as reading kana candidates for the corresponding kanji.

図6に、読み仮名候補抽出部310のより具体的な機能構成例を示す。読み仮名候補抽出部310は、読み仮名修正部210(図4)に対して読み仮名候補選択手段311を備える点でのみ異なる。読み仮名候補選択手段311は、入力テキスト読み仮名生起確率算出手段212の出力する該当漢字のN-gramの生起確率P0と、単漢字読み仮名生起確率算出手段213の出力する他の読み仮名候補の生起確率Pk(k=1,…,n)と、を入力として尤度比Rk(=Pk/P0)を求め、当該尤度比Rkが所定値T以上の複数の読み仮名候補を出力する。   FIG. 6 shows a more specific functional configuration example of the reading kana candidate extraction unit 310. The reading kana candidate extraction unit 310 differs from the reading kana correction unit 210 (FIG. 4) only in that it includes a reading kana candidate selection means 311. The reading kana candidate selection means 311 includes the N-gram occurrence probability P0 of the corresponding kanji output from the input text reading kana occurrence probability calculation means 212 and other reading kana candidate candidates output from the single kanji reading kana occurrence probability calculation means 213. The likelihood ratio Rk (= Pk / P0) is obtained using the occurrence probability Pk (k = 1,..., N) as input, and a plurality of reading kana candidates whose likelihood ratio Rk is equal to or greater than a predetermined value T are output.

読みN-gramモデル340は、学習テキスト内のN個連接して出現する読みの出現頻度を学習したモデルである。N=3の場合の例を挙げると、「今日は外で遊びましょうね(キョウワソトデアソビマショウネ)」という学習テキストにおいて、「キョウ」、「ョウワ」、「ウワソ」等の3個連接して出現する読みが全てカウントされ、その頻度に応じて確率が付与される。読みN-gramモデル340に読みの系列を入力すると、その読みの生起確率を算出することができる。読みN-gramモデル340の構築方法は、読み仮名修正モデル140と同じで周知である。   The reading N-gram model 340 is a model in which the appearance frequency of readings that appear in the learning text connected in series is learned. In the case of N = 3, in the learning text “Let's play outside today (Kyowa Soto de Asobi Mashoune)”, three “Kyo”, “Kyowa”, “Uwaso” etc. are connected. All the readings that appear are counted and given a probability according to the frequency. When a reading sequence is input to the reading N-gram model 340, the occurrence probability of the reading can be calculated. The construction method of the reading N-gram model 340 is the same as the reading kana correction model 140 and is well known.

読み仮名修正部320は、上記複数の読み仮名候補を含む一文の生起確率を、読みN-gramモデル340を参照して求め、生起確率の最も高い読み仮名候補を含む一文を出力する。例を挙げて説明すると、入力テキストの「今日は楽しいな(キョウワガクシイナ)」の「楽」という漢字に対して、読み仮名候補抽出部310で、「楽(ラク)」、「楽(タノ)」という2つの読み仮名候補が出力されていたと仮定する。   The reading kana modification unit 320 obtains the occurrence probability of one sentence including the plurality of reading kana candidates by referring to the reading N-gram model 340, and outputs one sentence including the reading kana candidate having the highest occurrence probability. As an example, the kana character “Easy” in the text “Today is Fun (Kyowa Gakushiina)” is input to the kana candidate extraction unit 310 by the “Reading Kana” candidate extraction unit 310. It is assumed that two reading candidate names “)” have been output.

その場合、入力テキスト全体の読みの系列である「キョウワラクシイナ」と「キョウワタノシイナ」のそれぞれの系列に対して読みN-gramモデルを用いて生起確率を算出する。そして、この例の場合、生起確率の高い読み系列である「キョウワタノシイナ」を、読み仮名が修正されたテキストとして出力される。   In that case, the occurrence probability is calculated using the reading N-gram model for each of the series “Kyowarakushina” and “Kyowatanoshina” which are the reading series of the entire input text. In the case of this example, “Kyowatanoshina”, which is a reading sequence with a high occurrence probability, is output as a text whose reading is corrected.

図7に、この発明の読み仮名修正装置400の機能構成例を示す。読み仮名修正装置400は、漢字かな2-gramモデル142と、漢字かな3-gramモデル143と、漢字かな4-gramモデル144と、読み仮名修正部410と、制御部430と、を具備する。読み仮名修正装置400は、読み仮名修正装置200に対して複数の漢字かなN-gramモデル142〜144を備える点で異なる。   FIG. 7 shows an example of the functional configuration of the reading-kana correction device 400 of the present invention. The reading kana correction device 400 includes a kanji 2-gram model 142, a kanji 3-gram model 143, a kanji 4-gram model 144, a reading kana correction unit 410, and a control unit 430. The reading kana correction device 400 is different from the reading kana correction device 200 in that it includes a plurality of Kanji N-gram models 142 to 144.

漢字かなN-gramモデル142〜144は、読み仮名修正モデル学習装置100で学習した確率モデルである。読み仮名修正部410は、入力テキストに含まれる漢字1文字にひらがなが連接して出現する2-gramと3-gramと4-gramを抽出し、該当漢字のN-gramを、対応するN-gramの漢字かな2-gramモデル142と漢字かな3-gramモデル143と漢字かな4-gramモデル144のそれぞれに入力して各N-gramの生起確率を求め、該当漢字の読み仮名を、生起確率が所定値以上の読み仮名に修正して出力する。   The Kanji Kana N-gram models 142 to 144 are probability models learned by the reading-kana correction model learning device 100. The reading-kana correction unit 410 extracts 2-gram, 3-gram, and 4-gram in which hiragana appears concatenated with one kanji character included in the input text, and the N-gram of the corresponding kanji is converted into the corresponding N-gram. gram kanji 2-gram model 142, kanji 3-gram model 143, and kanji 4-gram model 144 are input to each N-gram occurrence probability, the kana reading kana Is corrected to a kana reading greater than or equal to a predetermined value and output.

上記したように、統計的に十分な学習量を得ることのできる出現頻度の高い漢字に関しては、N-gramのN数を長めに設定した漢字かなN-gramモデルを用いる事が望ましい。しかし、出現頻度が低い漢字においてはN-gramの数を長めに設定すると、学習データが足りず、データスパースの問題が発生する。読み仮名修正装置400は、この問題を解決することができる。   As described above, it is desirable to use a kanji-kana N-gram model in which the N number of N-grams is set longer for kanji characters with a high appearance frequency that can obtain a statistically sufficient learning amount. However, if the number of N-grams is set longer for kanji characters with a low appearance frequency, there is insufficient learning data and a data sparse problem occurs. The reading-kana correction device 400 can solve this problem.

読み仮名修正装置400は、複数の漢字かなN-gramモデルを併用し、各漢字かなN-gramモデルから別々に算出された尤度比Rk_n-gram(=Pk_n-gram/P0_n-gram)の和が、一定値以上で且つ最大の読み仮名に、該当漢字の読み仮名を修正して出力する。読み仮名修正装置400によれば、出現頻度の高い漢字に関しては、N-gramの数を大きく設定したモデルの確率を利用できるため、より高精度に読み仮名修正を行うことができる。また、出現頻度の低い漢字に関しては、N-gramの数を小さくしたモデルの確率を利用できるため、データスパースの問題が軽減される。   The reading-kana correction device 400 uses a plurality of kanji-kana N-gram models together, and calculates the sum of likelihood ratios Rk_n-gram (= Pk_n-gram / P0_n-gram) separately calculated from each kana-kana N-gram model. However, the kana reading of the corresponding kanji is corrected and output to the maximum reading kana that is greater than or equal to a certain value. According to the reading kana correction device 400, the kanji with high appearance frequency can use the probability of the model in which the number of N-grams is set to be large, so that the reading kana can be corrected with higher accuracy. In addition, for kanji characters with a low appearance frequency, the probability of a model with a reduced number of N-grams can be used, which reduces the data sparse problem.

図8に、この発明の読み仮名修正装置500の機能構成例を示す。読み仮名修正装置500は、漢字かな2-gramモデル142と、漢字かな3-gramモデル143と、漢字かな4-gramモデル144と、読み仮名候補抽出部510と、読み仮名修正部320と、読みN-gramモデル340と、制御部530と、を具備する。読み仮名修正装置500は、実施例2(読み仮名修正装置300(図5))と3(読み仮名修正装置400(図7))の考えを組み合わせたものである。   FIG. 8 shows an example of the functional configuration of the reading-kana correction device 500 of the present invention. The reading kana correction device 500 includes a kanji-kana 2-gram model 142, a kanji-kana 3-gram model 143, a kanji-kana 4-gram model 144, a reading kana candidate extraction unit 510, a reading kana correction unit 320, An N-gram model 340 and a control unit 530 are provided. The reading-kana correction device 500 is a combination of the ideas of the second embodiment (the reading-kana correction device 300 (FIG. 5)) and 3 (the reading-kana correction device 400 (FIG. 7)).

読み仮名候補抽出部510は、入力テキストに含まれる漢字1文字にひらがなが連接して出現する2-gramと3-gramと4-gramを抽出し、該当漢字の上記N-gramを、対応するN-gramの漢字かな2-gramモデル142と漢字かな3-gramモデル143と漢字かな4-gramモデル144のそれぞれに入力して各N-gramの生起確率を求め、生起確率が所定値以上の上記該当漢字の複数の読み仮名候補を出力する。読み仮名修正部320と読みN-gramモデル340は、参照符号から明らかなように読み仮名修正装置300と同じものである。   The reading kana candidate extraction unit 510 extracts 2-gram, 3-gram, and 4-gram in which hiragana appears concatenated with one kanji character included in the input text, and corresponds to the N-gram of the corresponding kanji. N-gram Kanji 2-gram model 142, Kanji 3-gram model 143 and Kanji 4-gram model 144 are input to each N-gram occurrence probability, and the occurrence probability is greater than or equal to a predetermined value A plurality of reading kana candidates for the corresponding kanji are output. The reading Kana correction unit 320 and the reading N-gram model 340 are the same as the reading Kana correction device 300 as is clear from the reference numerals.

読み仮名修正装置300と読み仮名修正装置400の考えを組み合わせた読み仮名修正装置500によれば、学習テキストにおける漢字の出現頻度の差に依存し難く、且つ文全体として最適になる読み仮名修正を行うことができ、より高精度に読み仮名修正を行うことが可能になる。   According to the reading kana correction device 500 that combines the ideas of the reading kana correction device 300 and the reading kana correction device 400, reading kana correction that is less dependent on the difference in the appearance frequency of kanji in the learning text and that is optimal for the entire sentence is performed. This makes it possible to correct the reading kana with higher accuracy.

以上説明したようにこの発明の読み仮名修正モデル学習装置100は、学習テキストの漢字とその読み仮名と、その漢字に連接する読みの組み合わせとを学習し、読み仮名誤りの修正用モデルとして用いることが可能な新しい統計モデルを提供することができる。また、この発明の読み仮名修正装置200,300,400,500は、その新しい統計モデルを用いることで、Twitter・ブログ等、個人が書いた崩れた表記を含んだテキストに含まれる多種多様な表記ゆれを、自動的に正しい読み仮名に修正することができる。この発明の読み仮名修正装置200,300,400,500は、従来必要であった新たな表記ゆれパターンが出現する度に規則を設計するコストを、削減する効果を奏する。   As described above, the reading kana correction model learning device 100 according to the present invention learns a kanji of a learning text, its reading kana, and a combination of readings connected to the kanji, and uses them as a model for correcting reading kana errors. Can provide a new statistical model. In addition, the kana correction devices 200, 300, 400, and 500 of the present invention use the new statistical model, so that various notations included in texts including broken notations written by individuals such as Twitter and blogs. The shake can be automatically corrected to the correct reading kana. The kana correction device 200, 300, 400, 500 of the present invention has an effect of reducing the cost of designing a rule each time a new notation fluctuation pattern that has been necessary in the past appears.

上記装置における処理手段をコンピュータによって実現する場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、各装置における処理手段がコンピュータ上で実現される。   When the processing means in the above apparatus is realized by a computer, the processing contents of the functions that each apparatus should have are described by a program. Then, by executing this program on the computer, the processing means in each apparatus is realized on the computer.

また、このプログラムの流通は、例えば、そのプログラムを記録したDVD、CD−ROM等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記録装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させる構成としてもよい。   The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a recording device of a server computer and transferring the program from the server computer to another computer via a network.

また、各手段は、コンピュータ上で所定のプログラムを実行させることにより構成することにしてもよいし、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。   Each means may be configured by executing a predetermined program on a computer, or at least a part of these processing contents may be realized by hardware.

Claims (8)

漢字かな混じりの学習テキストを入力として、当該学習テキスト内の漢字1文字にひらがながN−1個の文字が連接して出現するN-gramを抽出するN−1系列抽出部と、
上記N-gramの出現頻度に応じて確率を付与した漢字かなN-gramモデルを学習し、当該漢字かなN-gramモデルを読み仮名修正モデルとして外部に出力するN-gramモデル学習部と、
を具備する読み仮名修正モデル学習装置。
An N-1 sequence extraction unit that extracts an N-gram in which hiragana N-1 characters appear contiguously to one kanji character in the learning text, using a kanji mixed kana learning text as input,
An N-gram model learning unit that learns a Kanji Kana N-gram model to which a probability is given according to the appearance frequency of the N-gram, reads the Kanji Kana N-gram model, and outputs it as a kana correction model;
A kana correction model learning device comprising:
請求項1に記載した読み仮名修正モデル学習装置で学習した読み仮名修正モデルと、
入力テキストに含まれる漢字1文字に、ひらがながN−1個の文字が連接して出現するN-gramを抽出し、該当漢字の上記N-gramを上記読み仮名修正モデルに入力して上記N-gramの生起確率を求め、上記該当漢字の読み仮名を、上記生起確率が所定値以上の読み仮名に修正して出力する読み仮名修正部と、
を具備する読み仮名修正装置。
A reading-kana correction model learned by the reading-kana correction model learning device according to claim 1;
An N-gram in which N-1 characters of hiragana appear concatenated with one Kanji character included in the input text is extracted, and the N-gram of the corresponding Kanji character is input to the reading kana correction model and the N a reading kana correction unit that calculates the occurrence probability of -gram, corrects the kana reading of the corresponding kanji to a reading kana with the occurrence probability equal to or higher than a predetermined value,
A reading kana correction device comprising:
請求項2に記載した読み仮名修正装置において、
上記読み仮名修正部は、
日本語のテキストに出現する漢字と漢字に対して取り得る読み仮名の候補が列挙されている単漢字辞書と、
入力テキストに含まれる漢字1文字に、ひらがながN−1個の文字が連接して出現するN-gramを抽出し、該当漢字のN-gramを読み仮名修正モデル学習装置で学習した読み仮名修正モデルに入力して当該N-gramの生起確率P0を求める入力テキスト読み仮名生起確率算出手段と、
上記該当漢字に対する1個以上のその他の読み仮名候補を単漢字辞書から取得し、当該その他の読み仮名候補を上記読み仮名修正モデルに入力してその他の読み仮名候補の生起確率Pkを求める単漢字読み仮名生起確率算出手段と、
上記生起確率Pkと上記生起確率P0との尤度比Rkを求め、当該尤度比Rkが所定値以上で且つ最大の読み仮名候補を、上記該当漢字の修正された読み仮名として決定し、当該尤度比Rkが上記所定値以下の場合は、上記生起確率P0の読み仮名を上記該当漢字の読み仮名として決定する読み仮名決定手段と、
を備えることを特徴とする読み仮名修正装置。
In the reading kana correction device according to claim 2,
The above-mentioned kana correction part
A single kanji dictionary that lists kanji characters that appear in Japanese text and possible kana characters for kanji,
N-gram in which N-1 characters of hiragana appear concatenated to one kanji character included in the input text, and the kana correction corrected by reading the k-character N-gram and learning with the kana correction model learning device An input text reading pseudonym occurrence probability calculating means for inputting to the model and calculating the occurrence probability P0 of the N-gram;
One or more other kana characters for the corresponding kanji are obtained from the single kanji dictionary, and the other kana kana candidates are input into the kana correction model to determine the occurrence probability Pk of the other kana candidates. A reading pseudonym occurrence probability calculating means;
The likelihood ratio Rk between the occurrence probability Pk and the occurrence probability P0 is obtained, and the maximum reading kana candidate whose likelihood ratio Rk is equal to or greater than a predetermined value is determined as the corrected reading kana of the corresponding kanji, When the likelihood ratio Rk is less than or equal to the predetermined value, a reading kana determination means for determining the reading kana of the occurrence probability P0 as the reading kana of the corresponding kanji,
A reading kana correction device comprising:
請求項1に記載した読み仮名修正モデル学習装置で学習した読み仮名修正モデルと、
入力テキストに含まれる漢字1文字にひらがながN−1個の文字が連接して出現するN-gramを抽出し、該当漢字の上記N-gramを読み仮名修正モデル学習装置で学習した読み仮名修正モデルに入力して当該N-gramの生起確率を求め、当該生起確率が所定値以上の複数の読み仮名を、上記該当漢字の読み仮名候補として出力する読み仮名候補抽出部と、
学習テキスト内のN個連接して出現する読みの出現頻度を学習した読みN-gramモデルと、
上記読み仮名候補を含む一文の生起確率を、上記読みN-gramモデルを参照して求め、生起確率の最も高い上記読み仮名を含む一文を出力する読み仮名修正部と、
を具備する読み仮名修正装置。
A reading-kana correction model learned by the reading-kana correction model learning device according to claim 1;
N-gram in which N-1 hiragana characters appear concatenated with one Kanji character included in the input text is extracted, and the above-mentioned N-gram of the corresponding Kanji character is read and learned by the kana correction model learning device. A reading kana candidate extraction unit that inputs to the model to determine the occurrence probability of the N-gram, and outputs a plurality of reading kana with the occurrence probability being a predetermined value or more as reading kana candidates for the corresponding kanji;
A reading N-gram model that has learned the frequency of occurrence of N concatenated readings in the learning text;
An occurrence probability of a sentence including the above-mentioned candidate for reading, with reference to the above-mentioned reading N-gram model,
A reading kana correction device comprising:
請求項1に記載した読み仮名修正モデル学習装置で学習した漢字かな2-gramモデルと漢字かな3-gramモデルと漢字かな4-gramモデルの読み仮名修正モデルと、
入力テキストに含まれる漢字1文字にひらがなが連接して出現する2-gramと3-gramと4-gramを抽出し、該当漢字の上記N-gramを、対応するN-gramの上記漢字かな2-gramモデルと漢字かな3-gramモデルと漢字かな4-gramモデルのそれぞれに入力して各N-gramの生起確率を求め、上記生起確率が所定値以上の読み仮名に修正して出力する読み仮名修正部と、
を具備する読み仮名修正装置。
A kana 2-gram model, a kana 3-gram model, and a kana 4-gram model of a kana 2-gram model learned by the reading kana correction model learning device according to claim 1;
2-grams, 3-grams and 4-grams in which hiragana characters appear concatenated with one kanji character included in the input text are extracted, and the above kanji for the corresponding kanji 2 -The input to each of the -gram model, the Kanji 3-gram model, and the Kanji 4-gram model is used to determine the occurrence probability of each N-gram, and the above-mentioned occurrence probability is corrected to a reading above the predetermined value and output. A kana correction part;
A reading kana correction device comprising:
漢字かな混じりの学習テキストを入力として、当該学習テキスト内の漢字1文字にひらがながN−1個の文字が連接して出現するN-gramを抽出するN−1系列抽出過程と、
上記N-gramの出現頻度に応じて確率を付与した漢字かなN-gramモデルを学習し、当該漢字かなN-gramモデルを読み仮名修正モデルとして外部に出力するN-gramモデル学習過程と、
を備える読み仮名修正モデル学習方法。
An N-1 sequence extraction process for extracting an N-gram in which hiragana N-1 characters appear concatenatingly appearing as one kanji character in the learning text, using a kanji-kana mixed learning text as input,
N-gram model learning process of learning a kanji Kana N-gram model to which a probability is given according to the appearance frequency of the N-gram, reading the Kanji Kana N-gram model and outputting it as a kana correction model,
A kana correction model learning method comprising:
請求項6に記載した読み仮名修正モデル学習方法で学習した読み仮名修正モデルと、
入力テキストに含まれる漢字1文字にひらがながN−1個の文字が連接して出現するN-gramを抽出し、該当漢字の上記N-gramを上記読み仮名修正モデルに入力して上記N-gramの生起確率を求め、上記該当漢字の読み仮名を、上記生起確率が所定値以上の読み仮名に修正して出力する読み仮名修正過程と、
を備える読み仮名修正方法。
A reading kana correction model learned by the reading kana correction model learning method according to claim 6;
An N-gram in which N-1 characters of hiragana are concatenated and appear in one kanji character included in the input text is extracted, and the N-gram of the corresponding kanji is input to the reading kana correction model and the N-gram is input. A kana correction process for obtaining the occurrence probability of gram, correcting the kana reading of the corresponding kanji into a reading kana with the occurrence probability being a predetermined value or more,
A reading kana correction method comprising:
請求項1に記載した読み仮名修正モデル学習装置、請求項2乃至5の何れかに記載した読み仮名修正装置としてコンピュータを機能させるためのプログラム。   A program for causing a computer to function as the reading kana correction model learning device according to claim 1 and the reading kana correction device according to any one of claims 2 to 5.
JP2013114254A 2013-05-30 2013-05-30 Reading kana correction model learning device, reading kana correction device, method and program thereof Expired - Fee Related JP5961586B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP2013114254A JP5961586B2 (en) 2013-05-30 2013-05-30 Reading kana correction model learning device, reading kana correction device, method and program thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP2013114254A JP5961586B2 (en) 2013-05-30 2013-05-30 Reading kana correction model learning device, reading kana correction device, method and program thereof

Publications (2)

Publication Number Publication Date
JP2014232510A true JP2014232510A (en) 2014-12-11
JP5961586B2 JP5961586B2 (en) 2016-08-02

Family

ID=52125826

Family Applications (1)

Application Number Title Priority Date Filing Date
JP2013114254A Expired - Fee Related JP5961586B2 (en) 2013-05-30 2013-05-30 Reading kana correction model learning device, reading kana correction device, method and program thereof

Country Status (1)

Country Link
JP (1) JP5961586B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11687599B2 (en) 2019-01-31 2023-06-27 Nippon Telegraph And Telephone Corporation Data retrieving apparatus, method, and program

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000353159A (en) * 1999-06-11 2000-12-19 Nippon Telegr & Teleph Corp <Ntt> Notation-reading correspondence device, notation- reading dictionary generating method, text reading arranging device, text reading arranging method, and recording medium
JP2003132052A (en) * 2001-10-19 2003-05-09 Nippon Hoso Kyokai <Nhk> Application apparatus for phonetic transcription in kana, and program thereof
JP2007226359A (en) * 2006-02-21 2007-09-06 Nec Corp Reading evaluation method, reading evaluation device, and reading evaluation program
JP2009294913A (en) * 2008-06-05 2009-12-17 Nippon Hoso Kyokai <Nhk> Language processing apparatus and program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000353159A (en) * 1999-06-11 2000-12-19 Nippon Telegr & Teleph Corp <Ntt> Notation-reading correspondence device, notation- reading dictionary generating method, text reading arranging device, text reading arranging method, and recording medium
JP2003132052A (en) * 2001-10-19 2003-05-09 Nippon Hoso Kyokai <Nhk> Application apparatus for phonetic transcription in kana, and program thereof
JP2007226359A (en) * 2006-02-21 2007-09-06 Nec Corp Reading evaluation method, reading evaluation device, and reading evaluation program
JP2009294913A (en) * 2008-06-05 2009-12-17 Nippon Hoso Kyokai <Nhk> Language processing apparatus and program

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11687599B2 (en) 2019-01-31 2023-06-27 Nippon Telegraph And Telephone Corporation Data retrieving apparatus, method, and program

Also Published As

Publication number Publication date
JP5961586B2 (en) 2016-08-02

Similar Documents

Publication Publication Date Title
Silberztein Formalizing natural languages: The NooJ approach
US10762293B2 (en) Using parts-of-speech tagging and named entity recognition for spelling correction
US11024287B2 (en) Method, device, and storage medium for correcting error in speech recognition result
TWI567569B (en) Natural language processing systems, natural language processing methods, and natural language processing programs
US20140380169A1 (en) Language input method editor to disambiguate ambiguous phrases via diacriticization
WO2022267353A1 (en) Text error correction method and apparatus, and electronic device and storage medium
JP6778655B2 (en) Word concatenation discriminative model learning device, word concatenation detection device, method, and program
CN111858883A (en) Method and device for generating triple sample, electronic equipment and storage medium
van Esch et al. Writing across the world's languages: Deep internationalization for Gboard, the Google keyboard
CN104239289A (en) Syllabication method and syllabication device
Stankevičius et al. Correcting diacritics and typos with a ByT5 transformer model
US10120843B2 (en) Generation of parsable data for deep parsing
JP2019191900A (en) Extraction device for language characteristics, extraction device for unique expressions, extraction method, and program
US20150058011A1 (en) Information processing apparatus, information updating method and computer-readable storage medium
JP7040155B2 (en) Information processing equipment, information processing methods and programs
Mammadzada A review of existing transliteration approaches and methods
CN109670040B (en) Writing assistance method and device, storage medium and computer equipment
JP5961586B2 (en) Reading kana correction model learning device, reading kana correction device, method and program thereof
US10789410B1 (en) Identification of source languages for terms
Muhamad et al. Proposal: A hybrid dictionary modelling approach for malay tweet normalization
CN109960812B (en) Language processing method and device
JP2010257021A (en) Text correction device, text correction system, text correction method, and text correction program
CN111079489A (en) Content identification method and electronic equipment
Hladek et al. Unsupervised spelling correction for Slovak
CN114580391A (en) Chinese error detection model training method, device, equipment and storage medium

Legal Events

Date Code Title Description
A621 Written request for application examination

Free format text: JAPANESE INTERMEDIATE CODE: A621

Effective date: 20150731

A977 Report on retrieval

Free format text: JAPANESE INTERMEDIATE CODE: A971007

Effective date: 20160415

A131 Notification of reasons for refusal

Free format text: JAPANESE INTERMEDIATE CODE: A131

Effective date: 20160419

A521 Request for written amendment filed

Free format text: JAPANESE INTERMEDIATE CODE: A523

Effective date: 20160523

TRDD Decision of grant or rejection written
A01 Written decision to grant a patent or to grant a registration (utility model)

Free format text: JAPANESE INTERMEDIATE CODE: A01

Effective date: 20160621

A61 First payment of annual fees (during grant procedure)

Free format text: JAPANESE INTERMEDIATE CODE: A61

Effective date: 20160627

R150 Certificate of patent or registration of utility model

Ref document number: 5961586

Country of ref document: JP

Free format text: JAPANESE INTERMEDIATE CODE: R150

LAPS Cancellation because of no payment of annual fees