JP5115239B2

JP5115239B2 - Character processing device

Info

Publication number: JP5115239B2
Application number: JP2008052216A
Authority: JP
Inventors: 博増市; 智子大熊; 大悟杉原
Original assignee: Fuji Xerox Co Ltd; Fujifilm Business Innovation Corp
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2008-03-03
Filing date: 2008-03-03
Publication date: 2013-01-09
Anticipated expiration: 2028-03-03
Also published as: JP2009211287A

Description

本発明は、形態素解析等によるテキスト分割の結果を変換するための変換ルールを生成する文字処理装置及びプログラムに関する。 The present invention relates to a character processing device and a program for generating a conversion rule for converting a result of text division by morphological analysis or the like.

テキスト中からキーワードを抽出する技術（テキストを単語単位に分割する技術）は、テキスト検索、テキスト分類などの言語処理タスクを行なう上で重要な役割を果たす。例えば、テキストに対して適切なキーワードを付与することができれば、精度の高いテキスト検索やテキスト分類を行なうことが可能となる。
テキストを単語に分割する処理（品詞を付与する処理も含めて）は、一般に形態素解析と呼ばれている。形態素解析の解析精度は、新聞テキスト等の一般的なテキストを対象とした場合は、実用上十分な値が得られている。しかしながら、医学テキストのように専門用語が多く含まれるテキストを対象とする場合は、形態素解析に必要な単語辞書が十分に整備されていないため、解析精度が実用上十分なレベルに達しているとは言い難い。 A technique for extracting keywords from text (a technique for dividing text into words) plays an important role in performing language processing tasks such as text search and text classification. For example, if an appropriate keyword can be assigned to text, highly accurate text search and text classification can be performed.
The process of dividing the text into words (including the process of adding parts of speech) is generally called morphological analysis. The analysis accuracy of the morphological analysis has a practically sufficient value for general text such as newspaper text. However, when targeting texts that contain many technical terms such as medical texts, the word dictionary necessary for morphological analysis is not sufficiently prepared, so the analysis accuracy has reached a practically sufficient level. Is hard to say.

ここで、辞書が整備されていない分野のテキストを対象とする場合であっても高い解析精度を得るための手法として、誤り駆動モデルに基づく形態素解析の手法が提案されている（例えば、特許文献１、非特許文献１、非特許文献２）。
このような誤り駆動モデルでは、テキストに対して正しい単語区切りの情報が付与された正解コーパスを用意し、同じテキストの形態素解析結果と該正解コーパスの単語区切りとを比較することによって、形態素解析処理から得られる単語区切りを正しい単語区切りに変換するルールを生成する。そして、このルールを形態素解析結果に適用することにより、単語辞書が十分整備されていない分野のテキストを対象とする場合であっても極めて高い精度の単語区切り結果を得ることが可能となる。 Here, a technique of morphological analysis based on an error driving model has been proposed as a technique for obtaining high analysis accuracy even when text in a field where a dictionary is not maintained (for example, patent literature) 1, Non-Patent Document 1, Non-Patent Document 2).
In such an error driving model, a correct corpus with correct word break information added to text is prepared, and the morphological analysis result of the same text is compared with the word break of the correct corpus, thereby performing morphological analysis processing. Generate a rule to convert the word breaks obtained from to the correct word breaks. By applying this rule to the morphological analysis result, it is possible to obtain a word segmentation result with extremely high accuracy even when text in a field where the word dictionary is not sufficiently prepared.

特開２０００−０４００８５号公報Japanese Patent Laid-Open No. 2000-04-0085 「書き換え規則と文脈情報を用いた形態素解析後処理」、情報処理学会研究報告、NL-126、1998、p.55-62"Post-processing of morphological analysis using rewrite rules and context information", IPSJ Research Report, NL-126, 1998, p.55-62 「誤り駆動モデルに基づく中国語未登録語の認識」、情報処理学会研究報告、NL-134、1999、p.123-129"Recognition of unregistered Chinese words based on error-driven model", IPSJ Research Report, NL-134, 1999, p.123-129 岡野原、「単語抽出法による次世代データ圧縮法の開発」、［online］、平成１４年度未踏ソフトウェア創造事業未踏Youth研究報告、インターネット＜URL:http://homepage3.nifty.com/DO/okamito04.pdf＞Okanohara, “Development of Next-Generation Data Compression Method Using Word Extraction”, [online], 2002 Unexplored Software Creation Project Unexplored Youth Research Report, Internet <URL: http://homepage3.nifty.com/DO/okamito04 .pdf>

上記のような誤り駆動モデルに基づく形態素解析手法においては、正解コーパスを用意しておく必要がある。ここで、正解コーパスの作成には一般に多大な工数が必要であり、特に、医学分野のように専門性の高い分野の正解コーパスの作成は専門家が実施しなければならないため、作成コストが極めて大きくなってしまう。また、正解コーパスを用いずに単語らしさの統計値のみを用いてキーワード抽出を行なう手法の提案（例えば、非特許文献３参照）も行なわれているが、この場合は、既存の形態素解析で整備されている形態素解析用辞書を全く用いることができないため、精度の低い解析結果しか得ることができない。 In the morphological analysis method based on the error driving model as described above, a correct corpus needs to be prepared. Here, the creation of the correct corpus generally requires a large amount of man-hours, and in particular, the creation of the correct corpus in a highly specialized field such as the medical field must be performed by an expert. It gets bigger. In addition, a method for extracting keywords using only the statistical value of word likeness without using the correct corpus has been proposed (for example, see Non-patent Document 3). Since the dictionary for morphological analysis that has been used cannot be used at all, only an analysis result with low accuracy can be obtained.

本発明は、上記従来の事情に鑑みてなされたものであり、単語辞書や正解コーパスが整備されていない分野のテキストを対象とする場合であっても、高い精度の単語区切り結果が得られる形態素解析システムを実現する技術を提供することを目的としている。 The present invention has been made in view of the above-mentioned conventional circumstances, and is a morpheme that can obtain a high-accuracy word segmentation result even when a text in a field where a word dictionary and a correct corpus are not prepared is targeted. The purpose is to provide the technology to realize the analysis system.

第１の本発明は、１以上の文字からなる文字列の単語らしさを示す単語尤度を算出する算出手段と、１以上の文字列を連結した連結文字列が、当該連結文字列中の各文字列の単語尤度の平均値である第１平均値が所定の第１閾値を下回る第１連結文字列であるか否かを判定する第１判定手段と、第１連結文字列における文字列区切りの位置を異ならせた連結文字列が、当該連結文字列中の各文字列の単語尤度の平均値である第２平均値が所定の第２閾値を上回る条件、又は、第２平均値から第１平均値を差し引いた値が所定の第３閾値を上回る条件、又は、これら両方を満たす条件のいずれか、を満たす第２連結文字列であるか否かを判定する第２判定手段と、第１連結文字列における区切位置を対応する第２連結文字列における区切位置に変換する変換ルールを生成する生成手段と、を備えたことを特徴とする文字処理装置である。 According to the first aspect of the present invention, a calculation means for calculating word likelihood indicating the word likeness of a character string composed of one or more characters, and a concatenated character string obtained by concatenating the one or more character strings, First determination means for determining whether or not the first average value that is the average value of word likelihoods of the character string is a first concatenated character string that is below a predetermined first threshold; and the character string in the first concatenated character string A condition in which the second average value, which is the average value of the word likelihood of each character string in the concatenated character string, exceeds the predetermined second threshold, or the second average value of the concatenated character strings with different delimiters Second determination means for determining whether or not the second concatenated character string satisfies a condition in which a value obtained by subtracting the first average value from the condition exceeds a predetermined third threshold value, or a condition that satisfies both of them, and , The delimiter position in the first concatenated character string to the delimiter position in the corresponding second concatenated character string Generating means for generating a conversion rule for conversion, a character processing apparatus, comprising the.

第２の本発明は、第１の本発明において、前記生成手段により生成される変換ルールは、第１連結文字列の抽出元の文中における当該第１連結文字列の直前又は直後に位置する１以上の文字列を含む変換ルールであることを特徴とする。 According to a second aspect of the present invention, in the first aspect of the present invention, the conversion rule generated by the generating means is located immediately before or immediately after the first concatenated character string in the sentence from which the first concatenated character string is extracted. The conversion rule includes the above character string.

第３の本発明は、第１又は第２の本発明において、前記生成手段は、生成した変換ルールを一般化して新たな変換ルールを生成することを特徴とする。 According to a third aspect of the present invention, in the first or second aspect of the present invention, the generating means generates a new conversion rule by generalizing the generated conversion rule.

第４の本発明は、第１〜第３の本発明において、前記文字処理装置は、文を複数の文字列に区切る区切手段を備え、前記第１判定手段は、前記区切手段により区切られた１以上の文字列を前記文中の順序に沿って連結した連結文字列が第１連結文字列か否かを判定することを特徴とする。 According to a fourth aspect of the present invention, in the first to third aspects of the present invention, the character processing device includes a delimiter that delimits a sentence into a plurality of character strings, and the first determination unit is delimited by the delimiter It is determined whether or not a concatenated character string obtained by concatenating one or more character strings in the order in the sentence is a first concatenated character string.

第５の本発明は、第１〜第４の本発明において、前記第１判定手段は、複数に区切られた文に含まれる１以上の文字列を当該文中の順序に沿って連結した連結文字列が第１連結文字列か否かを判定するものであり、前記文字処理装置は、前記文の全体に変換ルールを適用する適用手段と、前記生成手段により生成された複数の変換ルールから、変換ルール適用後の区切位置に基づく前記文中の各文字列の単語尤度の平均値から変換ルール適用前の区切位置に基づく前記文中の各文字列の単語尤度の平均値を差し引いた値が第４閾値を上回る変換ルールを選出する選出手段と、を備えたことを特徴とする。 According to a fifth aspect of the present invention, in the first to fourth aspects of the present invention, the first determining means is a concatenated character in which one or more character strings included in a plurality of sentences are connected in the order in the sentence. Whether the column is a first concatenated character string, the character processing device, from the application means for applying a conversion rule to the entire sentence, and a plurality of conversion rules generated by the generation means, A value obtained by subtracting an average value of word likelihoods of each character string in the sentence based on the separation position before application of the conversion rule from an average value of word likelihoods of each character string in the sentence based on the separation position after application of the conversion rule. Selecting means for selecting a conversion rule exceeding the fourth threshold value.

第６の本発明は、コンピュータに、１以上の文字からなる文字列の単語らしさを示す単語尤度を算出する算出機能と、１以上の文字列を連結した連結文字列が、当該連結文字列中の各文字列の単語尤度の平均値である第１平均値が所定の第１閾値を下回る第１連結文字列であるか否かを判定する第１判定機能と、第１連結文字列における文字列区切りの位置を異ならせた連結文字列が、当該連結文字列中の各文字列の単語尤度の平均値である第２平均値が所定の第２閾値を上回る条件、又は、第２平均値から第１平均値を差し引いた値が所定の第３閾値を上回る条件、又は、これら両方を満たす条件のいずれか、を満たす第２連結文字列であるか否かを判定する第２判定機能と、第１連結文字列における区切位置を対応する第２連結文字列における区切位置に変換する変換ルールを生成する生成機能と、をコンピュータに実現させるためのプログラムである。 According to a sixth aspect of the present invention, there is provided a calculation function for calculating word likelihood indicating a word likelihood of a character string composed of one or more characters and a concatenated character string obtained by concatenating the one or more character strings. A first determination function for determining whether or not the first average value, which is an average value of word likelihoods of each character string, is less than a predetermined first threshold, and a first connected character string In the concatenated character string in which the character string delimiter positions are different, the second average value that is the average value of the word likelihood of each character string in the concatenated character string exceeds a predetermined second threshold, or A second that determines whether or not the second concatenated character string satisfies a condition in which a value obtained by subtracting the first average value from the two average values exceeds a predetermined third threshold value, or a condition that satisfies both of them. The determination function and the delimiter position in the first concatenated character string are converted into the corresponding second concatenated character string. Kicking a program for implementing a generating function for generating a conversion rule for converting the separated position, to the computer.

第１の本発明に係る文字処理装置によると、連結文字列中の各文字列の単語尤度の平均値に基づいて変換ルールを生成することから、単語辞書や正解コーパスを用意する必要が無いため、単語辞書や正解コーパスが整備されていない分野のテキストを対象とする場合であっても、当該生成された変換ルールを用いることで、高い精度の単語区切り結果が得られる形態素解析システムを実現することができる。 According to the character processing device of the first aspect of the present invention, since the conversion rule is generated based on the average value of the word likelihood of each character string in the connected character string, there is no need to prepare a word dictionary or a correct corpus. Therefore, a morphological analysis system that can obtain a high-accuracy word segmentation result is realized by using the generated conversion rule even when the text in a field where a word dictionary or correct corpus is not prepared is targeted. can do.

第２の本発明に係る文字処理装置によると、第１連結文字列の前後の文字列により変換ルールの適用場面が制限されるため、より適切な場面で適用される変換ルールを生成することができる。 According to the character processing device of the second aspect of the present invention, since the application scene of the conversion rule is limited by the character string before and after the first concatenated character string, the conversion rule applied in a more appropriate scene can be generated. it can.

第３の本発明に係る文字処理装置によると、変換ルールの一般化によりその適用場面が緩和されるため、より汎用的に適用される変換ルールを生成することができる。 According to the character processing device of the third aspect of the present invention, since the application scene is alleviated by the generalization of the conversion rule, it is possible to generate a conversion rule that is applied more generally.

第４の本発明に係る文字処理装置によると、形態素解析の結果に基づいて変換ルールを生成することができる。 According to the character processing device of the fourth aspect of the present invention, the conversion rule can be generated based on the result of the morphological analysis.

第５の本発明に係る文字処理装置によると、複数生成された変換ルールの候補の中から、より好適な変換ルールを選出することができる。 According to the character processing device of the fifth aspect of the present invention, a more suitable conversion rule can be selected from a plurality of generated conversion rule candidates.

第６の本発明に係るプログラムによると、上記の文字処理装置としてコンピュータを機能させることができる。 According to the program of the sixth aspect of the present invention, a computer can function as the character processing device.

本発明を、一実施形態に基づいて具体的に説明する。
以下に例示する形態素解析システムでは、ある分野のテキストを対象とした場合に実用上十分な解析精度（９９％程度）を得ることができる形態素解析手段の使用を前提とし、さらに、それを他の分野（例えば医療分野）のテキストに適用した場合には実使用レベルの一歩手前（９５％程度）の解析精度が得られている状況を前提とする。また、後者の分野のテキストにおいて、正解コーパスが存在しないことも前提とする。なお、十分な量の正解コーパスが存在する場合は、本発明に係る手法を用いず、正解コーパスからの学習により形態素解析手段を構築することが好ましい。また、実使用レベルよりも遥かに低い精度（９０％以下など）しか得られない分野のテキストに適用しても、誤り駆動モデルによって実用上十分な解析精度に改善することは困難である。 The present invention will be specifically described based on an embodiment.
The morpheme analysis system exemplified below is based on the premise of using morpheme analysis means capable of obtaining practically sufficient analysis accuracy (about 99%) when texts in a certain field are targeted. When applied to text in a field (for example, medical field), it is assumed that the analysis accuracy is one step before the actual use level (about 95%). It is also assumed that there is no correct corpus in the text of the latter field. When there is a sufficient amount of correct corpus, it is preferable to construct a morpheme analyzing means by learning from the correct corpus without using the method according to the present invention. Moreover, even if it is applied to texts in fields where accuracy much lower than the actual usage level (such as 90% or less) can be obtained, it is difficult to improve to practically sufficient analysis accuracy by an error driving model.

図１は、本発明を適用して構成した形態素解析システムの機能ブロック図である。
本例の形態素解析システムは、同一分野（本例では医療分野）の複数のテキストを格納するテキスト格納手段１、テキスト格納手段１に格納されているテキストに対して形態素解析を行う形態素解析手段２、形態素解析結果を格納する形態素解析結果格納手段３、任意の文字列の単語らしさを示す値（単語尤度）を計算する単語尤度計算手段４、形態素解析結果を修正するための変換ルールの候補を生成する変換ルール生成手段５、変換ルール候補から有効な変換ルールを選択する変換ルール選択手段６、形態素解析結果に変換ルールを適用する変換ルール適用手段７、を備えている。 FIG. 1 is a functional block diagram of a morphological analysis system configured by applying the present invention.
The morpheme analysis system of this example includes a text storage unit 1 that stores a plurality of texts in the same field (medical field in this example), and a morpheme analysis unit 2 that performs morpheme analysis on the text stored in the text storage unit 1. Morphological analysis result storage means 3 for storing morphological analysis results, word likelihood calculation means 4 for calculating a word-likeness value (word likelihood) of an arbitrary character string, and conversion rules for correcting the morphological analysis results A conversion rule generation unit 5 that generates candidates, a conversion rule selection unit 6 that selects an effective conversion rule from conversion rule candidates, and a conversion rule application unit 7 that applies a conversion rule to a morpheme analysis result are provided.

本例では、テキスト格納手段１に医療分野のテキストが格納されている。そして、形態素解析手段２が、テキスト格納手段１に格納されている全てのテキストに対して形態素解析を行って、各テキストを形態素解析辞書に応じた複数の文字列（単語の候補）に区切ると共に、それぞれに品詞を付与して形態素解析結果格納手段３に格納している。
例えば「軽度のび慢性を認める。」という文（テキスト）を形態素解析した場合、図２に示すような結果が得られて形態素解析結果格納手段３に格納される。同図によると、形態素解析の結果として、「軽度（名詞）」、「のび（動詞）」、「慢性（名詞）」、「を（格助詞）」、「認める（動詞）」、「。（句点）」、が得られたことがわかる。 In this example, text in the medical field is stored in the text storage means 1. The morpheme analysis unit 2 performs morpheme analysis on all the texts stored in the text storage unit 1 and divides each text into a plurality of character strings (word candidates) according to the morpheme analysis dictionary. , Each part of speech is assigned and stored in the morphological analysis result storage means 3.
For example, when a morphological analysis is performed on a sentence (text) that “accepts mild and chronic”, a result as shown in FIG. 2 is obtained and stored in the morphological analysis result storage means 3. According to the figure, as a result of the morphological analysis, "mild (noun)", "nobi (verb)", "chronic (noun)", "wo (case particle)", "acknowledge (verb)", ". It can be seen that the punctuation)) was obtained.

単語尤度計算手段４は、任意の文字列（１以上の文字からなる文字列）が与えられたときに、テキスト格納手段１に格納されているテキストを参照することによって、その文字列の単語らしさを示す値（単語尤度）を算出する。
単語尤度の定義は様々なものが考えられるが、本例では非特許文献３で提案されている以下の定義を、文字列Ｃの単語尤度ＷＴとして用いている。
WT = length * ((log(totalCount) + log(count))
ここで、lengthは文字列Ｃを構成する文字数、totalCountは形態素解析結果格納手段３に格納されている全単語数、countはテキスト格納手段１に格納されている全てのテキスト中に文字列Ｃが出現する回数である。
勿論、他の手法により単語尤度を定義してもよく、例えば「単語長×単語出現頻度」を単語尤度としてもよい。 The word likelihood calculation means 4 refers to the text stored in the text storage means 1 when an arbitrary character string (character string consisting of one or more characters) is given, and thereby the word of the character string A value (word likelihood) indicating the likelihood is calculated.
Although various definitions of the word likelihood can be considered, in this example, the following definition proposed in Non-Patent Document 3 is used as the word likelihood WT of the character string C.
WT = length * ((log (totalCount) + log (count))
Here, length is the number of characters constituting the character string C, totalCount is the total number of words stored in the morpheme analysis result storage means 3, and count is the character string C in all the texts stored in the text storage means 1. The number of occurrences.
Of course, the word likelihood may be defined by another method, for example, “word length × word appearance frequency” may be used as the word likelihood.

次に、本例の変換ルール生成手段５による変換ルール候補の生成処理を説明する。
本例では、例えば非特許文献１で提案されている以下の変換ルールをテンプレートとして用いている。
ａ_１…ａ_ＫＷ_１…Ｗ_ｎｂ_１…ｂ_Ｌ ⇒ ａ_１…ａ_ＫＷ_１’…Ｗ_ｍ’ｂ_１…ｂ_Ｌ
ここで、ａ_ｐ（ｐ＝１…Ｋ）、ｂ_ｑ（ｑ＝１…Ｌ）、Ｗ_ｒ（ｒ＝１…ｎ）、Ｗ_ｓ’（ｓ＝１…ｍ）はそれぞれ単語（文字列）であり、ａ_１…ａ_Ｋ、ｂ_１…ｂ_Ｌ、Ｗ_１…Ｗ_ｎ、Ｗ_１’…Ｗ_ｍ’はそれぞれ単語列（１以上の文字列の連結）である。すなわち、上記ルールは、単語列Ｗ_１…Ｗ_ｎの前後の単語列がそれぞれａ_１…ａ_Ｋ、ｂ_１…ｂ_Ｌである場合に、Ｗ_１…Ｗ_ｎをＷ_１’…Ｗ_ｍ’に変換するルールである。なお、本例では、Ｌ＝Ｋ＝１としている。（通常、データスパースネスの問題からＬ＝Ｋ＝１とされる。２以上の値を用いると、変換ルールの適用場面が著しき限定されるため、変換ルールの汎用性が低下して実用性に乏しくなるからである。） Next, conversion rule candidate generation processing by the conversion rule generation means 5 of this example will be described.
In this example, for example, the following conversion rule proposed in Non-Patent Document 1 is used as a template.
a ₁ ... a _K W ₁ ... W _n b ₁ ... b _L ⇒ a ₁ ... a _K W ₁ '... W _m ' b ₁ ... b _L
Here, a _p (p = 1... K), b _q (q = 1... L), W _r (r = 1... N), and W _s ′ (s = 1... M) are words (character strings), respectively. A ₁ ... A _K , b ₁ ... B _L , W ₁ ... W _n , W ₁ ′... W _m ′ are each a word string (concatenation of one or more character strings). That is, the rules, before and after the word string word string _W 1 ... _{W n} are each _a 1 _... _a _K, in the case of _b 1 ... _{b _L,} the _W 1 ... _{W n} in _{_W 1} '... _W _m' It is a rule to convert. In this example, L = K = 1. (Normally, L = K = 1 because of the problem of data sparseness. If a value of 2 or more is used, the application situation of the conversion rule is remarkably limited, so the versatility of the conversion rule is reduced and the practicality is reduced. Because it becomes scarce.)

非特許文献１では正解コーパスの存在を前提にしているが、本例では正解コーパスを用いずに、図３のフローチャートに示す処理により変換ルール候補を生成する。
変換ルール生成手段５は、形態素解析結果格納手段３に格納されている形態素解析結果の各文字列の単語尤度ＷＴを単語尤度計算手段４から受け取る。
そして、形態素解析結果を先頭から走査して得られる単語列（形態素解析結果の１以上の文字列を抽出元の文中の順序に沿って連結した連結文字列）Ｗ_１…Ｗ_ｎについて、その構成要素である各文字列の単語尤度の平均値（ＷＴｎ）が閾値Ｔ１（予め設定した非負の実数）よりも小さい第１連結文字列か否かを判定し、第１連結文字列と判定された連結文字列とその前後の一単語を変換ルールの左辺として抽出する（ステップＳ１１）。 In Non-Patent Document 1, it is assumed that there is a correct corpus, but in this example, conversion rule candidates are generated by the process shown in the flowchart of FIG. 3 without using the correct corpus.
The conversion rule generation unit 5 receives the word likelihood WT of each character string of the morpheme analysis result stored in the morpheme analysis result storage unit 3 from the word likelihood calculation unit 4.
Then, (concatenated string linked in the order of the extraction source of 1372 or more strings morphological analysis result) morphological analysis result word obtained by scanning from the top row for W 1 _... W _n, the configuration It is determined whether or not the word likelihood average value (WTn) of each character string that is an element is a first concatenated character string that is smaller than a threshold value T1 (a preset non-negative real number). The connected character string and one word before and after it are extracted as the left side of the conversion rule (step S11).

例えば、「のび」と「慢性」のＷＴの平均値（ＷＴｎ）がＴ１よりも小さい場合、上記の形態素解析結果から、変換ルールの左辺として、「軽度（名詞）／のび（動詞）／慢性（名詞）／を（格助詞）」が抽出される。この場合、ａ_１…ａ_Ｋ＝「軽度（名詞）」、Ｗ_１…Ｗ_ｎ＝「のび（動詞）／慢性（名詞）」、ｂ_１…ｂ_Ｌ＝「を（格助詞）」である。 For example, when the average value (WTn) of WT of “Nobi” and “Chronic” is smaller than T1, from the above morphological analysis result, as the left side of the conversion rule, “mild (noun) / nobi (verb) / chronic ( Noun) / O (case particle) "is extracted. In this case, a ₁ ... A _K = “mild (noun)”, W ₁ ... W _n = “Nobi (verb) / chronic (noun)”, b ₁ ... B _L = “((case particle)”).

次に、ステップＳ１１で得られた第１連結文字列Ｗ_１…Ｗ_ｎに対して、全ての区切り候補（区切位置を異ならせた連結文字列）を列挙する（ステップＳ１２）。
例えば、第１連結文字列Ｗ_１…Ｗ_ｎが「のび／慢性」の場合、図４に示すように、「のび慢性」、「の／び慢性」、「のび／慢性」、「のび慢／性」、「の／び／慢性」、「のび／慢／性」、「の／び／慢／性」の各区切り候補が得られる。 Next, all the delimiter candidates (concatenated character strings with different delimiter positions) are listed for the first concatenated character strings W ₁ ... W _n obtained in step S11 (step S12).
For example, when the first concatenated character string W ₁ ... W _n is “long / chronic”, as shown in FIG. 4, “long / chronic”, “long / chronic”, “long / chronic”, “long / long / Each delimiter candidate of “sex”, “no / bi / chronic”, “nobi / arrogant / sex”, and “no / bi / arrogant / sex” is obtained.

その後、各区切り候補について、その構成要素である各文字列の単語尤度の平均値（ＷＴｍ）が閾値Ｔ２（予め設定した非負の実数）よりも大きく、かつ、ＷＴｍ−ＷＴｎが閾値Ｔ３（予め設定した非負の実数）よりも大きい第２連結文字列か否かを判定し、第２連結文字列と判定された区切り候補に基づく右辺を有する変換ルールを生成し、変換ルール候補に追加する。なお、区切り候補が上記の判定条件を満たさない場合には、その区切り候補については変換ルールを生成しない（ステップＳ１３）。 Thereafter, for each delimiter candidate, the average value (WTm) of word likelihood of each character string that is a constituent element thereof is larger than a threshold value T2 (a preset non-negative real number), and WTm-WTn is a threshold value T3 (preliminary value). It is determined whether or not the second concatenated character string is larger than the set non-negative real number), and a conversion rule having a right side based on the delimiter candidate determined as the second concatenated character string is generated and added to the conversion rule candidate. Note that if the break candidate does not satisfy the above-described determination condition, no conversion rule is generated for the break candidate (step S13).

例えば、上記の判定条件を満たす区切り候補が「の／び慢性」であったとすれば、以下の変換ルールを得ることができる。
軽度（名詞）／のび（動詞）／慢性（名詞）／を（格助詞）
⇒ 軽度（名詞）／の（未知語）／び慢性（未知語）／を（格助詞）
つまり、第１連結文字列「のび／慢性」における区切位置が、対応する第２連結文字列「の／び慢性」における区切位置に変換する変換ルールが生成される。
本例では、非特許文献１での手法と同様に品詞情報を各単語に付与しているが、変換ルールの右辺側の書換え文字列については品詞が特定できていないため、品詞が不明であることを意味する「未知語」を付与している。 For example, if the delimiter candidate satisfying the above determination condition is “no / chronic”, the following conversion rule can be obtained.
Mild (noun) / Nobi (verb) / Chronic (noun) / O (case particle)
⇒ Mild (noun) / no (unknown word) / chronic (unknown word) / (case particle)
That is, a conversion rule is generated that converts the break position in the first connected character string “Nobi / Chronic” into the break position in the corresponding second linked character string “No / Chronic”.
In this example, part-of-speech information is assigned to each word in the same manner as in Non-Patent Document 1, but the part-of-speech is unknown because the part-of-speech has not been identified for the rewrite character string on the right side of the conversion rule. "Unknown word" meaning that is given.

上記の処理を形態素解析結果を先頭から走査して得られる全ての連結文字列に対して行って、変換ルールを生成する。また、変換ルール生成手段５は、非特許文献１で提案されている変換ルールの一般化も併せて行う。つまり、上記処理により得られた変換ルールを一般化して新たな変換ルールを生成し、変換ルール候補に追加する（ステップＳ１４）。 The above processing is performed on all connected character strings obtained by scanning the morphological analysis results from the top to generate conversion rules. Moreover, the conversion rule production | generation means 5 also generalizes the conversion rule proposed by the nonpatent literature 1. That is, the conversion rule obtained by the above process is generalized to generate a new conversion rule and added to the conversion rule candidate (step S14).

ここで、変換ルールの一般化としては、例えば、Ｗ_１…Ｗ_ｎの単語ではなく品詞を対象とする変換ルールとする。つまり、Ｗ_１…Ｗ_ｎといった具体的な文字列の合致を条件に適用される変換ルールではなく、Ｗ_１…Ｗ_ｎに対応する各品詞の合致を条件に適用される変換ルールとする。また、例えば、ｂ_１…ｂ_Ｌを無視（ルールから削除）する変換ルールとする。つまり、Ｗ_１…Ｗ_ｎに後続する文字列とは無関係に適用される変換ルールとする。なお、これらは一例に過ぎず、変換ルールの汎用性を高め得る種々の一般化の手法を採用することができる。 Here, as a generalization of the conversion rule, for example, a conversion rule for the part of speech rather than the word of W ₁ ... W _n is used. That is, rather than the conversion rules applied to match the specific character string such W 1 _... W _n the condition, the conversion rule to be applied on the condition matching of each part of speech corresponding to W 1 _... W _n. For example, b ₁ ... B _L is a conversion rule that ignores (deletes from the rule). That is, the conversion rule is applied regardless of the character string following W ₁ ... W _n . Note that these are merely examples, and various generalization techniques that can improve the versatility of the conversion rules can be employed.

上記処理の結果、同一の第１連結文字列に対する複数の変換ルール候補が生成され得るが、本例では図５に示すように、その中から一定の条件を満たすものを選出して最終的な変換ルールの集合としている。
つまり、変換ルール選択手段６が、変換ルール生成手段５により生成された変換ルールの候補から一つを選び（ステップＳ２１）、その変換ルールを形態素解析結果格納手段３に格納されている形態素解析結果に適用し、その結果得られた新たな区切位置からなる各文字列の単語尤度の平均値（ＷＴａ）から、変換ルール適用前の形態素解析結果格納手段３中の各文字列の単語尤度の平均値（ＷＴｂ）を差し引いた値（ＷＴｃ）が、第４閾値（予め設定した非負の実数）よりも大きいか否かを判定し、当該条件を満たす変換ルールを選出して最終的な変換ルールの集合に加える処理（ステップＳ２２）を、変換ルールの候補の全てについて繰り返す（ステップＳ２３）。つまり、各変換ルールの候補の中から、その適用により単語尤度の平均値に一定の向上が見られるものを選出する。 As a result of the above processing, a plurality of conversion rule candidates for the same first concatenated character string can be generated. In this example, as shown in FIG. It is a set of conversion rules.
That is, the conversion rule selection means 6 selects one of the conversion rule candidates generated by the conversion rule generation means 5 (step S21), and the morpheme analysis result stored in the morpheme analysis result storage means 3 is selected. The word likelihood of each character string in the morpheme analysis result storage means 3 before application of the conversion rule is calculated from the average value (WTa) of the word likelihood of each character string consisting of the new break position obtained as a result. It is determined whether or not the value (WTc) obtained by subtracting the average value (WTb) is larger than the fourth threshold value (preset non-negative real number), and a conversion rule satisfying the condition is selected and final conversion is performed. The process (step S22) added to the rule set is repeated for all the conversion rule candidates (step S23). In other words, from the candidates for each conversion rule, the one that shows a certain improvement in the average value of word likelihoods is selected.

変換ルール適用手段７は、任意のテキストを形態素解析手段２で形態素解析した結果に対して、変換ルール選択手段６から得られる変換ルール集合に含まれる各変換ルールを適用し、最終的な形態素解析結果を得る。本例では、ＷＴｃの値が大きい変換ルールから順に適用し、適用する変換ルールが無くなった場合に処理を終了する。 The conversion rule applying means 7 applies each conversion rule included in the conversion rule set obtained from the conversion rule selecting means 6 to the result of the morphological analysis of the arbitrary text by the morphological analysis means 2, and performs the final morphological analysis Get results. In this example, the conversion rules are applied in descending order of the WTc value, and the process ends when there are no more conversion rules to apply.

以上のように、本例では、連結文字列中の各文字列の単語尤度の平均値に基づいて変換ルールを生成することから、単語辞書や正解コーパスを用意する必要が無いため、医療分野のように単語辞書や正解コーパスが整備されていない分野のテキストを対象とする場合であっても、形態素解析結果を修正するための変換ルールの生成を可能にしており、この変換ルールを形態素解析結果に適用することで、高い精度の単語区切り結果が得られる形態素解析システムを実現している。 As described above, in this example, since the conversion rule is generated based on the average value of the word likelihood of each character string in the connected character string, it is not necessary to prepare a word dictionary or a correct corpus, so the medical field Even if the target is text in a field that does not have a word dictionary or correct corpus, it is possible to generate conversion rules to correct the morphological analysis results. By applying to the results, we have realized a morphological analysis system that can obtain highly accurate word break results.

ここで、本例の変換ルール生成手段５は、第１連結文字列における区切位置を異ならせた各区切候補に対し、ＷＴｍ（対象の区切り候補に係る単語尤度の平均値）が閾値Ｔ２より大きく、かつ、ＷＴｍからＷＴｎ（第１連結文字列に係る単語尤度の平均値）を差し引いた値が閾値Ｔ２より大きい場合に当該区切り候補を第２連結文字列と判定しているが、いずれか一方の条件を満たす場合に第２連結文字列と判定してもよい。
要は、区切位置の変更の前後で単語尤度の平均値に一定の向上が見られたものを第２連結文字列と判定できればよく、例えば、第１条件のみを用いる場合は、閾値Ｔ２を閾値Ｔ１に所定値（非負の実数）を加えた値としておけばよい。 Here, the conversion rule generation means 5 of this example has a WTm (average value of word likelihood associated with the target break candidate) from the threshold value T2 for each break candidate with different break positions in the first connected character string. If the value obtained by subtracting WTm (average value of word likelihoods related to the first concatenated character string) from WTm is larger than the threshold T2, the delimiter candidate is determined as the second concatenated character string. If either one of the conditions is satisfied, the second concatenated character string may be determined.
In short, it is only necessary to determine that the average value of the word likelihood before and after the change of the delimiter position can be determined as the second concatenated character string. For example, when only the first condition is used, the threshold T2 is set. What is necessary is just to set it as the value which added predetermined value (non-negative real number) to threshold value T1.

なお、本例では、単語尤度の平均値が閾値より大きいか（又は小さいか）を比較しているが、単語尤度の平均値が閾値以上（又は以下）であってもよい。このため、本願では、閾値より大きい又は閾値以上であることを「閾値を上回る」と表現し、閾値より小さい又は閾値以下であることを「閾値を下回る」と表現する。 In this example, whether the average value of word likelihood is larger (or smaller) than the threshold is compared, but the average value of word likelihood may be equal to or greater than (or less than) the threshold. For this reason, in this application, being larger than a threshold value or more than a threshold value is expressed as “beyond the threshold value”, and being smaller than or less than the threshold value is expressed as “below the threshold value”.

本例の変換ルール選択手段６の機能の拡張について説明する。
例えば、変換ルール選択手段６の選択結果に対して更に人手で変換ルールの新規追加や取捨選択を行うことが可能なユーザインターフェースを設ける。これにより、専門家の知見を活かして変換ルールの生成や選択を行うことが可能となり、より精度の高い形態素解析システムを構築することが可能となる。 The expansion of the function of the conversion rule selection means 6 of this example will be described.
For example, a user interface capable of manually adding a new conversion rule or selecting a conversion rule for the selection result of the conversion rule selection means 6 is provided. This makes it possible to generate and select conversion rules by making use of expert knowledge, and to construct a more accurate morphological analysis system.

例えば、変換ルール選択手段６が変換ルールの選出を行う際に、初期の形態素解析結果ではなく、既に選択された変換ルールを適用した形態素解析結果を用いて次なる変換ルールを選出する。つまり、形態素解析手段２による形態素解析結果に対して最初に選出された変換ルール（単語尤度の平均値が最も向上する最適な変換ルール）を適用し、その結果に対して他の各変換ルールを適用して次なる変換ルールを選出し、以下、これを再帰的に繰り返す。このように、再帰的な変換ルールの選出を行うことで、より好適に変換ルールを選出することが可能となる。 For example, when the conversion rule selection means 6 selects a conversion rule, the next conversion rule is selected using not the initial morpheme analysis result but the morpheme analysis result to which the already selected conversion rule is applied. That is, the first selected conversion rule (the optimal conversion rule that improves the average value of word likelihood) is applied to the morpheme analysis result by the morpheme analysis means 2, and each other conversion rule is applied to the result. Is applied to select the next conversion rule, and then this is repeated recursively. Thus, by selecting a recursive conversion rule, it becomes possible to select a conversion rule more suitably.

例えば、変換ルール選択手段６が変換ルールの選出を行う際に、初期の形態素解析結果ではなく、既に選択された変換ルールを適用した形態素解析結果を用い、更に、その結果に基づいて変換ルール生成手段５が生成した変換ルールの候補から次なる変換ルールを選出する。つまり、形態素解析手段２による形態素解析結果に対して最初に選出された変換ルールを適用し、その結果に基づいて変換ルール生成手段５により新たに変換ルールの候補を生成し、その中から次なる変換ルールを選出し、以下、これを再帰的に繰り返す。このように、再帰的な変換ルールの生成及び選出を行うことで、より好適に変換ルールを選出することが可能となる。 For example, when the conversion rule selection means 6 selects a conversion rule, it uses a morpheme analysis result to which the already selected conversion rule is applied instead of an initial morpheme analysis result, and further generates a conversion rule based on the result. The next conversion rule is selected from the conversion rule candidates generated by the means 5. In other words, the conversion rule selected first is applied to the morpheme analysis result by the morpheme analysis unit 2, and based on the result, the conversion rule generation unit 5 generates a new conversion rule candidate, and the next one is generated. A conversion rule is selected, and this is repeated recursively. Thus, it becomes possible to select a conversion rule more suitably by generating and selecting a recursive conversion rule.

図６は、本例に係る形態素解析システムを構成する文字処理装置の主要なハードウェア構成を示している。
すなわち、本例の文字処理装置は、各種演算処理を行うＣＰＵ１１、ＣＰＵ１１の作業領域となるＲＡＭ１２、基本的な制御プログラムを記憶するＲＯＭ１３、本発明に係る機能を実現するためのプログラムや各種データを記憶するＨＤＤ１４、利用者に対する情報を表示出力する液晶ディスプレイや利用者からの情報の入力を受け付けるマウス・キーボード等の機器とのインターフェースである入出力Ｉ／Ｆ１５、他の装置との間で通信を行うインターフェースである通信Ｉ／Ｆ１６、等のハードウェア資源を有するコンピュータで構成されている。 FIG. 6 shows a main hardware configuration of the character processing device constituting the morphological analysis system according to this example.
That is, the character processing apparatus of this example includes a CPU 11 that performs various arithmetic processes, a RAM 12 that is a work area of the CPU 11, a ROM 13 that stores a basic control program, a program and various data for realizing the functions according to the present invention. Communication between the HDD 14 to be stored, a liquid crystal display for displaying and outputting information to the user, an input / output I / F 15 that is an interface with a device such as a mouse / keyboard that accepts input of information from the user, and other devices It is comprised with the computer which has hardware resources, such as communication I / F16 which is an interface to perform.

そして、本発明に係るプログラムをＨＤＤ１４から読み出してＲＡＭ１２に展開し、これをＣＰＵ１１により実行させることで、本発明に係る文字処理装置の各機能手段をコンピュータにより実現している。なお、本例では、形態素解析手段２により区切手段が構成され、単語尤度計算手段４により算出手段が構成され、変換ルール生成手段５により第１判定手段、第２判定手段、生成手段が構成され、変換ルール選択手段６により選出手段が構成され、変換ルール適用手段７により適用手段が構成されている。 Then, the program according to the present invention is read from the HDD 14, loaded into the RAM 12, and executed by the CPU 11, whereby each functional unit of the character processing device according to the present invention is realized by a computer. In this example, the morphological analysis means 2 constitutes a delimiting means, the word likelihood calculation means 4 constitutes a calculation means, and the conversion rule generation means 5 constitutes a first determination means, a second determination means, and a generation means. The conversion rule selection means 6 constitutes a selection means, and the conversion rule application means 7 constitutes an application means.

なお、本発明に係るプログラムは、例えば当該プログラムを記憶したＣＤ−ＲＯＭ等の外部記憶媒体を配布する形式やネットワークを介して配信する形式により、本発明の実施者に提供される。また、本発明に係る各機能手段は、本例のようなソフトウェア構成により実現する態様に限られず、それぞれ専用のハードウエアモジュールで構成してもよい。また、本発明に係る各機能手段は、本例のように１台のコンピュータに設ける態様に限られず、複数台のコンピュータに分散して設けてもよい。 Note that the program according to the present invention is provided to the practitioner of the present invention, for example, in a format in which an external storage medium such as a CD-ROM storing the program is distributed or distributed via a network. Each functional unit according to the present invention is not limited to a mode realized by a software configuration as in the present example, and may be configured by a dedicated hardware module. In addition, each functional unit according to the present invention is not limited to an aspect provided in one computer as in this example, and may be provided in a distributed manner in a plurality of computers.

本発明の一実施形態に係る形態素解析システムの機能ブロック図である。It is a functional block diagram of the morphological analysis system concerning one embodiment of the present invention. 本発明の一実施形態に係る形態素解析結果を例示する図である。It is a figure which illustrates the morphological analysis result concerning one embodiment of the present invention. 本発明の一実施形態に係る変換ルール生成処理を示す図である。It is a figure which shows the conversion rule production | generation process which concerns on one Embodiment of this invention. 本発明の一実施形態に係る区切り候補を例示する図である。It is a figure which illustrates the division | segmentation candidate which concerns on one Embodiment of this invention. 本発明の一実施形態に係る変換ルール選出処理を示す図である。It is a figure which shows the conversion rule selection process which concerns on one Embodiment of this invention. 本発明の一実施形態に係る文字処理装置のハードウェア構成図である。It is a hardware block diagram of the character processing apparatus which concerns on one Embodiment of this invention.

Explanation of symbols

１：テキスト格納手段、２：形態素解析手段、３：形態素解析結果格納手段、４：単語尤度計算手段、５：変換ルール生成手段、６：変換ルール選択手段、７：変換ルール適用手段 1: text storage means, 2: morpheme analysis means, 3: morpheme analysis result storage means, 4: word likelihood calculation means, 5: conversion rule generation means, 6: conversion rule selection means, 7: conversion rule application means

Claims

Calculating means for calculating word likelihood indicating the word likeness of a character string composed of one or more characters;
A concatenated character string obtained by concatenating one or more character strings is a first concatenated character string in which a first average value that is an average value of word likelihood of each character string in the concatenated character string is lower than a predetermined first threshold value. First determination means for determining whether or not,
The second average value, which is the average word likelihood of each character string in the concatenated character string in the concatenated character string in the first concatenated character string, exceeds a predetermined second threshold. Whether or not the second concatenated character string satisfies either the condition or the condition that the value obtained by subtracting the first average value from the second average value exceeds a predetermined third threshold value, or the condition that satisfies both of them. Second determination means for determining
Generating means for generating a conversion rule for converting a break position in the first linked character string into a break position in the corresponding second linked character string;
A character processing device comprising:

The conversion rule generated by the generating means is a conversion rule including one or more character strings located immediately before or immediately after the first concatenated character string in the sentence from which the first concatenated character string is extracted. The character processing device according to claim 1.

The character processing apparatus according to claim 1, wherein the generation unit generates a new conversion rule by generalizing the generated conversion rule.

The character processing device includes a delimiter that delimits a sentence into a plurality of character strings,
The first determination means determines whether or not a concatenated character string obtained by concatenating one or more character strings delimited by the delimiter along a sequence in the sentence is a first concatenated character string. The character processing device according to any one of claims 1 to 3.

The first determination means determines whether or not a concatenated character string obtained by concatenating one or more character strings included in a sentence divided into a plurality along a sequence in the sentence is a first concatenated character string;
The character processing device includes:
Applying means for applying a conversion rule to the whole sentence;
From the plurality of conversion rules generated by the generation means, each character in the sentence based on the delimiter position before application of the conversion rule from the average word likelihood of each character string in the sentence based on the delimiter position after application of the conversion rule A selection means for selecting a conversion rule in which a value obtained by subtracting an average value of word likelihoods of the column exceeds a fourth threshold;
The character processing device according to claim 1, comprising:

On the computer,
A calculation function for calculating a word likelihood indicating the word likeness of a character string composed of one or more characters;
A concatenated character string obtained by concatenating one or more character strings is a first concatenated character string in which a first average value that is an average value of word likelihood of each character string in the concatenated character string is lower than a predetermined first threshold value. A first determination function for determining whether or not
The second average value, which is the average word likelihood of each character string in the concatenated character string in the concatenated character string in the first concatenated character string, exceeds a predetermined second threshold. Whether or not the second concatenated character string satisfies either the condition or the condition that the value obtained by subtracting the first average value from the second average value exceeds a predetermined third threshold value, or the condition that satisfies both of them. A second determination function for determining
A generation function for generating a conversion rule for converting a break position in the first linked character string into a break position in the corresponding second linked character string;
A program to make a computer realize.