JP2011096245A

JP2011096245A - Kanji compound word dividing method and kanji compound word dividing device

Info

Publication number: JP2011096245A
Application number: JP2010222057A
Authority: JP
Inventors: Tomonori Goto; 智範後藤; Sadahiro Umeki; 定博梅木
Original assignee: Kanagawa University
Current assignee: Kanagawa University
Priority date: 2009-09-30
Filing date: 2010-09-30
Publication date: 2011-05-12
Anticipated expiration: 2030-09-30
Also published as: JP2014149869A; JP5750815B2; JP5648956B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a kanji compound word dividing method and a kanji compound word dividing device, in which a kanji compound word comprising continuous kanji strings included in a Japanese document can be correctly divided with super high accuracy, and the reliability of the respective divided kanji strings is improved to the extent that the kanji strings can be put into practical use. <P>SOLUTION: The kanji compound word dividing method is configured to divide the kanji compound word of a division object, by referring to a Japanese dictionary, in which a basic word, to be a base when dividing a kanji compound word comprising continuous kanji strings, and a part of speech, corresponding to the basic word, are associated with each other and recorded, and a word division pattern dictionary, in which a dividing pattern, indicating the array of the number of characters of respective kanji strings configured after dividing the kanji compound word, and the pattern present in the dividing pattern of a part-of-speech array pattern, indicating the array of a part of speech corresponding to the respective kanji strings configured after dividing the kanji compound word, are associated with each other, and which is classified by the number of characters of the kanji compound word and recorded. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、日本語文書に含まれる連続する漢字列で構成された漢字複合語を超高精度で分割することができる漢字複合語分割方法及び漢字複合語分割装置に関するものである。 The present invention relates to a kanji compound word dividing method and a kanji compound word dividing apparatus capable of dividing kanji compound words composed of continuous kanji strings included in a Japanese document with extremely high accuracy.

日本語文書において、主要な概念・テーマは、漢字熟語又は漢字熟語を含む名詞句に表現されることが多い。 In Japanese documents, main concepts and themes are often expressed in kanji idioms or noun phrases including kanji idioms.

漢字複合語は、専門性、特殊性が高く、情報の価値が高いため、漢字複合語を適切に分割する必要性が高まっている。ところが、数文字（例えば、５文字）以上の連続する漢字列で構成された漢字複合語は、非常に複雑な構造を有するため、漢字複合語を高精度で分割することは容易でない。 Kanji compound words are highly specialized and specialized, and have high information value. Therefore, the necessity of appropriately dividing kanji compound words is increasing. However, a kanji compound word composed of a sequence of kanji characters of several characters (for example, five characters) or more has a very complicated structure, so that it is not easy to divide a kanji compound word with high accuracy.

漢字複合語を分割する手法として、例えば、特許文献１には、単語分割処理として入力した単語の漢字列部分の文字数を設定し、頻度情報配列、単語分割指標配列、分割識別子配列をクリアした後、漢字２文字組の文字列の単語頭及び単語末に出現する頻度情報を備えた辞書に基づいて設定された文字境界の単語末頻度と単語頭頻度から、文字境界に基本単語分割指標（相乗平均・相加平均）及び接辞分割指標（頻度差・頻度和）を設定して、設定した指標により、２文字の漢字語基と１文字の接辞（接頭辞又は接尾辞）に分割する複合語分割装置及び複合語分割方法が開示されている。 As a method for dividing a kanji compound word, for example, in Patent Document 1, after setting the number of characters in the kanji string portion of a word input as word division processing, and clearing the frequency information array, word division index array, and division identifier array From the word end frequency and word head frequency of the character boundary set based on the dictionary with the frequency information that appears at the beginning and end of the character string of the two-character kanji character set, the basic word segmentation index (synergistic) A compound word that is divided into a 2-character kanji word base and a single-letter affix (prefix or suffix) according to the set index by setting the average / arithmetic mean) and affix division index (frequency difference / frequency sum) A dividing device and a compound word dividing method are disclosed.

特許文献１では、分割は、頻度情報配列、単語分割指標配列、分割識別子配列の３つのデータに基づいてなされる。最初に、対象漢字熟語の長さを設定する。先頭からの個々の文字の位置を示す文字位置と先頭から文字間の境界を示す文字境界位置の２つの指標を用いる。先頭の文字境界位置は０に設定される。文字境界位置に対して、前の２文字漢字列の単語末頻度、後ろの２文字漢字列の単語頭頻度を、頻度格納配列ｆ［Ｉ，ｎ］（Ｉ＝１，２，ｎ＝０，・・・，Ｎ）に設定する。文字位置ｐ（＝１）から１字ずつずらしながら、対象漢字熟語中の２文字漢字列（ｐ＝１，・・・，Ｎ−１）を辞書と照合し、対応する２種類の頻度を設定する。これら２つの頻度に基づき、基本単語分割指標（ｗ［１，ｉ］と接辞分割指標（ｗ［２，ｉ］）を設定し、単語分割指標配列に格納される。 In Patent Document 1, division is performed based on three types of data: a frequency information array, a word division index array, and a division identifier array. First, the length of the target kanji idiom is set. Two indices are used: a character position indicating the position of each character from the beginning and a character boundary position indicating the boundary between the characters from the beginning. The leading character boundary position is set to zero. For the character boundary position, the word end frequency of the preceding two-character kanji string and the word head frequency of the following two-character kanji string are represented by a frequency storage array f [I, n] (I = 1, 2, n = 0, ..., N). While shifting one character at a time from the character position p (= 1), two character kanji strings (p = 1,..., N−1) in the target kanji idiom are collated with the dictionary, and two corresponding frequencies are set. To do. Based on these two frequencies, a basic word division index (w [1, i] and an affix division index (w [2, i]) are set and stored in the word division index array.

特許文献１では、これらの指標について、複数の計算式を提案している。
（ａ）和と差
ｗ［１，ｉ］＝ｆ［１，ｉ］＋ｆ［２，ｉ］
ｗ［２，ｉ］＝ｆ［１，ｉ］−ｆ［２，ｉ］
（ｂ）相乗平均と頻度差を頻度和で正規化された値
ｗ［１，ｉ］＝（ｆ［２，ｉ］・ｆ［１，ｉ］）／２
ｗ［２，ｉ］＝（ｆ［１，ｉ］＋ｆ［２，ｉ］））／（ｆ［１，ｉ］−ｆ［２，ｉ］） Patent Document 1 proposes a plurality of calculation formulas for these indices.
(A) Sum and difference w [1, i] = f [1, i] + f [2, i]
w [2, i] = f [1, i] −f [2, i]
(B) Value obtained by normalizing geometric mean and frequency difference by frequency sum w [1, i] = (f [2, i] · f [1, i]) / 2
w [2, i] = (f [1, i] + f [2, i])) / (f [1, i] −f [2, i])

特許文献１では、これらの指標以外に、基本単語分割指標として擬似的な確率指標や確率の積、また接辞分割指標としてこれらの正規化差を提案している。 In addition to these indices, Patent Document 1 proposes a pseudo probability index and a product of probabilities as basic word division indices, and these normalized differences as affix division indices.

特許文献１では、分割境界の決定は、上述の２つの指標、基本単語分割指標（Ｃｕｔ−Ｗ）と接辞分割指標（Ｃｕｔ−Ｐ）の値の大きさ基づいてなされる。最初に、基本単語分割指標の最大の大きさもつｉ番目の境界で、対象漢字列を２つに部分漢字列に分割する。それぞれの部分漢字列をさらに２分割し、部分漢字列の長さが４文字以下になるまで、再帰的に繰り返す。次に、長さが３文字以上の部分漢字列を対象に、接辞分割指標に基づいて、接頭辞と基本単語に分割する。接辞分割指標の値が正の場合には、接頭辞と基本単語に分割され、接辞分割指標の値が負の場合には基本単語と接尾辞に分割される。 In Patent Document 1, the division boundary is determined based on the magnitudes of the two indexes, the basic word division index (Cut-W) and the affix division index (Cut-P). First, the target Chinese character string is divided into two partial Chinese character strings at the i-th boundary having the maximum size of the basic word division index. Each partial kanji string is further divided into two, and recursively repeated until the length of the partial kanji string becomes 4 characters or less. Next, a partial Chinese character string having a length of 3 characters or more is divided into a prefix and a basic word based on the affix division index. When the value of the affix division index is positive, it is divided into a prefix and a basic word, and when the value of the affix division index is negative, it is divided into a basic word and a suffix.

特許文献１では、実例として「対共産圏輸出統制委員会」を挙げて、分割の過程が説明されている。新聞記事１年分（１２０ＭＢ）を対象に、２文字漢字列の２種類の出現頻度情報を算出している。当該熟語を構成する２文字漢字列と、単語頭頻度、単語末頻度は、「委員」（１９３０，２９７２）、「員会」（３，７５９４）、「共産」（１７３５，２１７）、「産圏」（０，１５）、「制委」（０，１）、「対共」（２４，０）、「統制」（９９，１４５）、「輸出」（１５２９，９００）とし、これらの頻度から、基本単語分割指標（ｗ［１，ｉ］）として上述の（ｂ）を使用すると、「対／共産圏／輸出／統制／委／員会」（１７３５，１５１．４，２８．５，５２９．０，１．７）となる。ここで、“／”は分割境界を示し、カッコ内の数値はその単語分割指標を示している。また、接辞の分割境界とその値は、「対／共産／圏／輸出／統制／委／員会」（＋１，−１，−０．９８，−０．８０，＋０．８６，＋０．５，−１）となる。最初に最大値５２９．０をもつ８文字目の境界で分割し、「対共産圏輸出統制」、「委員会」の２つの部分漢字列に分割される。前者は４文字以上で、さらに、「対共産圏輸出」と「統制」に分割されるが、後者は３文字なのでこれ以上分割されない。対共産圏輸出」は、「対共産圏」と「輸出」に分割される。次に、「対共産圏」と「委員会」に対して、接辞分割指標に基づいて、分割がなされ、正の値をとる「対」が接頭辞に、負の値をとる「圏」、「会」が接尾辞として識別される。 In Patent Document 1, the process of division is explained by taking the “community zone export control committee” as an example. Two kinds of appearance frequency information of a two-character kanji string are calculated for a newspaper article for one year (120 MB). The two-character kanji strings that make up the idiom, the word head frequency, and the word end frequency are “committee” (1930, 2972), “members” (3, 7594), “community” (1735, 217), “product” “Band” (0,15), “Restriction” (0,1), “Co-op” (24,0), “Control” (99,145), “Export” (1529,900), and their frequency From the above, when using (b) as the basic word segmentation index (w [1, i]), “Vs./Communist zone / Export / Control / Committee / Membership” (1735, 151.4, 28.5) 529.0, 1.7). Here, “/” indicates a division boundary, and the numerical value in parentheses indicates the word division index. In addition, the division boundary of affixes and their values are expressed as “pair / community / zone / export / control / committee / members” (+1, −1, −0.98, −0.80, +0.86, +0.5 , -1). First, it is divided at the boundary of the eighth character having the maximum value 529.0, and is divided into two partial kanji strings of “communist zone export control” and “committee”. The former is 4 characters or more and is further divided into “Export to Communist zone” and “Control”, but the latter is 3 characters, so it is not further divided. “Export to communist zone” is divided into “communist zone” and “export”. Next, for “community bloc” and “committee”, division is made based on the affix division index, and “pair” that takes a positive value is “previous” that takes a negative value as a prefix, “Meeting” is identified as a suffix.

漢字複合語の分割に関する特許文献以外の先行研究としては、例えば、係り受けに着目した手法（非特許文献１）、語基間の接続確率に基づく手法（非特許文献２）、名詞間の意味の共起確率を利用した手法（非特許文献３）、文脈情報を利用した手法（非特許文献４）が挙げられる。 Prior research other than patent literature on the division of Kanji compound words includes, for example, a method that focuses on dependency (Non-Patent Literature 1), a method based on the connection probability between word bases (Non-Patent Literature 2), and the meaning between nouns A method using the co-occurrence probability (Non-Patent Document 3) and a method using context information (Non-Patent Document 4).

係り受け解析を用いた手法（非特許文献１）
非特許文献では、漢字複合語を構成する語基間の係り受けに着目した自動分割手法が提案されている。「前方の単語から後方の単語に係る」、「単語の係り先は一つに限る」、「複数の単語を一つの単語が受けてもいい」、「係り受けの非交差性を守る」を原則として、数詞、接辞、一般語の３種類に品詞分類し、品詞毎に係り受け規則を定めている。 Method using dependency analysis (Non-patent Document 1)
Non-patent literature proposes an automatic division method that focuses on the dependency between word bases constituting a kanji compound word. “Relate words from the front word”, “Limit the word to one word”, “A single word can be received by multiple words”, “Protect non-intersection of dependency” As a general rule, parts of speech are classified into three types: numbers, affixes, and general words, and dependency rules are defined for each part of speech.

非特許文献１では、分割は、形態素解析を行い、全分割パターンを作成し、基本単語数をそれぞれ算出するステップ１と、各分割パターンの係り受けの個数を求めるステップ２と、係り受け解析を行いステップ２で求めた語基数の差を求めるステップ３と、差が最小となる分割パターンを自動分割の解とするステップ４の４つのステップにより構成され、ステップ４で解が一意に判断できない場合には、単語の使用頻度による選択を行っている。 In Non-Patent Document 1, division is performed by performing morphological analysis, creating all divided patterns, calculating the number of basic words, step 2, calculating the number of dependency of each divided pattern, and dependency analysis. When there are four steps, Step 3 for obtaining the difference in word radix obtained in Step 2 and Step 4 for setting the division pattern with the smallest difference as the solution for automatic division, and the solution cannot be determined uniquely in Step 4 Is selected based on the frequency of use of words.

非特許文献１において、例えば、「畜産物価格安定法」は次の過程を経て分割される。分割パターン１を「畜産物価格安定法」、分割パターン２を「畜産物価格安定法」、分割パターン３を「畜産物価格安定法」、分割パターン４を「畜産物価格安定法」、分割パターン５を「畜産物価格安定法」とする。分割パターン１の基本単語数は４、分割パターン２の基本単語数は５、分割パターン３の基本単語数は５、分割パターン４の基本単語数は５、分割パターン５の基本単語数は５となる（ステップ１）。分割パターン１の係り受けの個数は１、分割パターン２の係り受けの個数は２、分割パターン３の係り受けの個数は１、分割パターン４の係り受けの個数は３、分割パターン５の係り受けの個数は２となる（ステップ２）。分割パターン１の語基数の差は４−１＝３、分割パターン２の語基数の差は５−２＝３、分割パターン３の語基数の差は５−０＝５、分割パターン４の語基数の差は５−３＝２、分割パターン５の語基数の差は５−２＝３となり、ステップ３の最小値は３で、結果として分割解「畜産物価格安定法」を得る。 In Non-Patent Document 1, for example, the “livestock price stabilization method” is divided through the following process. Division pattern 1 is the “Livestock price cheap method”, Division pattern 2 is the “Livestock price stability method”, Division pattern 3 is the “Livestock price stability method”, Division pattern 4 is the “Livestock price stability method”, Division pattern 5 Is defined as “Livestock Price Price Stabilization Law”. The number of basic words in division pattern 1 is 4, the number of basic words in division pattern 2 is 5, the number of basic words in division pattern 3 is 5, the number of basic words in division pattern 4 is 5, and the number of basic words in division pattern 5 is 5. (Step 1). The number of dependency of division pattern 1 is 1, the number of dependency of division pattern 2 is 2, the number of dependency of division pattern 3 is 1, the number of dependency of division pattern 4 is 3, and the dependency of division pattern 5 Is 2 (step 2). The difference of the word radix of division pattern 1 is 4-1 = 3, the difference of the word radix of division pattern 2 is 5-2 = 3, the difference of the word radix of division pattern 3 is 5-0 = 5, and the word of division pattern 4 The difference in radix is 5-3 = 2, the difference in word radix of division pattern 5 is 5-2 = 3, the minimum value in step 3 is 3, and as a result, the division solution “livestock price stabilization method” is obtained.

語基間の接続確率に基づく手法（非特許文献２）
非特許文献２では、漢字複合語をマルコフモデルの出力と考え、状態遷移モデルで表現し、基本単語からなる語の各遷移確率を用いた自動分割手法の提案を行っている。非特許文献２は、漢字熟語を（接頭辞）基本単語（接尾辞）の形で表現し、初期状態から終了状態までの遷移確率を求め、それが最大となるパターンを解とする。遷移確率は、ベイズの事後確率推定法を利用し、初期確率と繰り返し時の確率を求めるという方法で、レーニングデータを対象に「状態遷移確率推定アルゴリズム」を用いて、トレーニングデータ中の基本単語間の遷移確率を算出している。 Method based on connection probability between word bases (non-patent document 2)
Non-Patent Document 2 considers a Kanji compound word as the output of a Markov model, proposes an automatic segmentation method using a state transition model and using each transition probability of a word consisting of basic words. Non-Patent Document 2 expresses kanji idioms in the form of (prefix) basic words (suffixes), obtains the transition probability from the initial state to the end state, and uses the pattern that maximizes it as the solution. Transition probabilities are obtained by using the Bayesian posterior probability estimation method to obtain initial probabilities and repetition probabilities, and using the “state transition probability estimation algorithm” for training data, the basic words in training data The transition probability is calculated.

非特許文献２において、熟語分割は、:漢字複合語の短単位モデルの遷移図を生成し（ステップ１）、各状態遷移確率を求め（ステップ２）、状態遷移確率が最大のものを解とする（ステップ３）という手順で行われる。 In Non-Patent Document 2, idiom division is: a short unit model transition diagram of kanji compound words is generated (step 1), each state transition probability is obtained (step 2), and the one with the largest state transition probability is determined as the solution. (Step 3).

非特許文献２において、例えば、「太陽熱発電」は以下のように分割される。分割解１「太陽熱発電」の遷移起確率は０．０１７５、分割解２「太陽熱発電」の遷移起確率は０．０５６、分割解３「太陽熱発電」の遷移確率は０．０３６、分割解４「太陽熱発電」の遷移確率は０．０１２となる。ここで、分割解２と分割解３は分割位置が同じであるが、分割解２では「熱」が接尾辞として扱われ、分割解３では「熱」が接頭辞として扱われるため、同じ分割位置となる２通りの分割パターンが存在する。非特許文献２では、長さ３〜１０文字の２５００語の漢字熟語に対して、上述の手法を用いた評価実験を行っている。 In Non-Patent Document 2, for example, “solar thermal power generation” is divided as follows. Transition probability of split solution 1 “Taiyo Thermal Power Generation” is 0.0175, transition probability of split solution 2 “Solar Thermal Power Generation” is 0.056, and transition probability of Split Solution 3 “Solar Thermal Power Generation” is 0.036 The transition probability of split solution 4 “solar thermal power generation” is 0.012. Here, split solution 2 and split solution 3 have the same split position, but split solution 2 treats “heat” as a suffix and split solution 3 treats “heat” as a prefix. There are two division patterns as positions. In Non-Patent Document 2, an evaluation experiment using the above-described method is performed on 2500 kanji idioms having a length of 3 to 10 characters.

名詞間の意味的共起情報による手法（非特許文献３）
非特許文献３では、漢字複合語を構成する基本単語を意味カテゴリーに分類し、カテゴリー間の共起頻度を用いた分割手法の提案し、分割実験を行っている。 Method based on semantic co-occurrence information between nouns (Non-Patent Document 3)
Non-Patent Document 3 classifies basic words constituting kanji compound words into semantic categories, proposes a division method using the co-occurrence frequency between categories, and conducts division experiments.

非特許文献３では、分割は次の手順で行われる。まず、トレーニングデータの漢字複合語を手動で基本単語に分割し、個々の基本単語に対してあらかじめ体系化されているクラスを付与する。その後、対象漢字複合語を基本単語と照合して、分割する（ステップ１）。ステップ１では全ての分割パターンを求める。次に、基本単語を意味分類辞書と照合してクラス番号を付与し、可能なクラス列を求め（ステップ２）、次いで、クラス間の係り受け規則に基づき、全係り受けクラス列を求める（ステップ３）。そして、提案されている優先度算出方法に基づき、係り受けパターン毎に優先度を算出し、最大の優先度をもつ係り受けパターンを解とする（ステップ４）。 In Non-Patent Document 3, division is performed according to the following procedure. First, the kanji compound words in the training data are manually divided into basic words, and classes that are pre-systemized are assigned to the individual basic words. Thereafter, the target kanji compound word is collated with the basic word and divided (step 1). In step 1, all division patterns are obtained. Next, the basic words are collated with the semantic classification dictionary, and class numbers are assigned to obtain possible class strings (step 2). Next, all dependency class strings are obtained based on the dependency rules between classes (step 2). 3). Then, based on the proposed priority calculation method, the priority is calculated for each dependency pattern, and the dependency pattern having the highest priority is determined as a solution (step 4).

非特許文献３において、例えば、「歩行者通路」は以下のように分割される。まず、ステップ１で対象漢字複合語を基本単語と照合し、「歩行者通路」と「歩行者通路」に分割される。次に、ステップ２で、基本単語を意味分類辞書と照合して、クラス番号を付与し、可能なクラス列を求めると、「歩行［１３３］者［１１０：１２０］通路［１４７］」と「歩［１１９：１３３：１４５］行者［１２４］通路［１４７］」となる。“：”は、複数のクラスが存在する場合を示している。クラス間の係り受け規則に基づき、［［１３３：１１０］，１４７］、［１３３］、［１１０：１４７］、・・・、［［１１９：１２４］，１４７］、・・・、［１４５，［１２４：１４７］］の合計１０種類の係り受けクラス列が得られ（ステップ３）、個々のクラス列に対する優先度を計算すると、最大の優先度１．３６となる［［１３３：１１０］，１４７］が解となるクラス列で、分割解は「歩行者通路」となる（ステップ４）。特許文献３では、４文字以上の３００８語の漢字熟語に対して、上述の手法を用いた評価実験を行っている。 In Non-Patent Document 3, for example, the “pedestrian passage” is divided as follows. First, in step 1, the target Kanji compound word is collated with the basic word and divided into “pedestrian passage” and “pedestrian passage”. Next, in step 2, the basic word is checked against the semantic classification dictionary, given a class number, and a possible class string is obtained. As a result, “walk [133] person [110: 120] passage [147]” and “ Step [119: 133: 145] Deliver [124] Passage [147] ". “:” Indicates a case where a plurality of classes exist. Based on the dependency rules between classes, [[133: 110], 147], [133], [110: 147],..., [[119: 124], 147],. [124: 147]] in total 10 kinds of dependency class sequences are obtained (step 3), and when the priority for each class sequence is calculated, the maximum priority is 1.36 [[133: 110], 147] is a class sequence that is a solution, and the divided solution is a “pedestrian passage” (step 4). In Patent Document 3, an evaluation experiment using the above-described method is performed on 3008 kanji idioms having four or more characters.

文脈情報を利用した手法
非特許文献４では、基本単語間の共起情報に基づく、（ａ）共起割合とよんでいる熟語内の基本単語間の修飾比率、（ｂ）相互情報量とよんでいる共起する比率に基づく計算指標、（ｃ）優先度と呼んでいる（ｂ）の相互情報量とテキスト中の名詞の頻度を考慮した指標という３種類の手法−計算式を提案し、評価実験を行っている。 A method using context information In Non-Patent Document 4, based on co-occurrence information between basic words, (a) a modification ratio between basic words in an idiom called co-occurrence ratio, and (b) a mutual information amount. Proposed three types of methods-calculation formulas, a calculation index based on the ratio of co-occurrence, (c) an index that takes into account the mutual information amount of (b) called priority and the frequency of nouns in the text, and an evaluation experiment It is carried out.

非特許文献４では、分割は次の手順で行われる。まず、対象漢字複合語を基本単語と照合し、分割する（ステップ１）。この段階では全ての分割パターンを求める。次に、各分割パターンに対して上述した指標を算出する（ステップ２）。ここで、各指標における最大の値をもつパターンが分割解となる。 In Non-Patent Document 4, division is performed according to the following procedure. First, the target kanji compound word is collated with the basic word and divided (step 1). At this stage, all division patterns are obtained. Next, the above-described index is calculated for each division pattern (step 2). Here, the pattern having the maximum value in each index is a divided solution.

非特許文献４において、例えば、「砂糖類価格安定」は、上述した（ａ）共起割合の指標では、「砂糖類価格安定」は０、「砂糖類価格安定」は０、「砂糖類価格安定」は０、・・・「砂糖類価格安定」は０．１０、・・・「砂糖類価格安定」は０．２５となり、最大の値をとる「砂糖類価格安定」が分割解となる。非特許文献４では、５文字、７文字、１０文字の漢字熟語それぞれ１００語に対し、上述した手法を用いた評価実験を行っている。 In Non-Patent Document 4, for example, “sugar price stability” is 0 for “sand sugar price stability” and 0 for “sugar sugar price stability” in the above-mentioned (a) co-occurrence ratio index. , "Sand sugar price cheap" is 0, ... "Sand sugar price stable" is 0.10, ... "Sugar price stable" is 0.25, the highest value for "sugars" “Price stability” is the split solution. In Non-Patent Document 4, an evaluation experiment using the above-described method is performed for 100 words each of 5 characters, 7 characters, and 10 characters.

特開２００２−２５９３７０号公報JP 2002-259370 A

宮崎正弘，係り受け解析を用いた複合語の自動分割法，情報処理学会論文誌，Ｖｏｌ２５，Ｎｏ６，９７０−９７９（１９８４）Masahiro Miyazaki, Compound word automatic segmentation using dependency analysis, Transactions of Information Processing Society of Japan, Vol 25, No. 6, 970-979 (1984). 武田，藤崎，統計的手法による漢字複合語の自動分割，情報処理学会論文誌，Ｖｏｌ２８，Ｎｏ９，９５２−９６１（１９８７）Takeda, Fujisaki, automatic division of Kanji compound words by statistical methods, Journal of Information Processing Society of Japan, Vol 28, No 9, 952-961 (1987) 小林義行，徳永健伸，田中穂積，名詞間の意味的共起情報を用いた複合名詞の解析，自然言語処理，Ｖｏｌ３，Ｎｏ１，２９−４３（１９９６）Yoshiyuki Kobayashi, Takenobu Tokunaga, Hozumi Tanaka, Analysis of compound nouns using semantic co-occurrence information between nouns, natural language processing, Vol 3, No 1, 29-43 (1996) 韓東力，加藤浩一，古郡廷治，文脈情報を利用した多文字複合語の分割，電子情報通信学会技術研究報告，Ｖｏｌ１０１，Ｎｏ４０，２９−３４（２００１）Handong Power, Koichi Kato, Courtesy Furugun, Multi-character compound word division using context information, IEICE technical report, Vol101, No40, 29-34 (2001)

特許文献１及び非特許文献１〜４には、対象熟語の分割に使用される数量的指標はそれぞれ異なるが、いずれも大量の漢字熟語集合から基本単語の出現頻度に基づいて計算され、これらの文献が依拠している熟語の構造、すなわち基本単語の構成パターンについての情報は全く考慮されておらず、実際には長い漢字熟語は構文構造をもっているという共通する特徴がある。 In Patent Document 1 and Non-Patent Documents 1 to 4, the numerical indexes used for dividing the target idioms are different, but both are calculated based on the appearance frequency of basic words from a large set of kanji idioms, Information about the structure of idioms on which the literature relies, that is, the basic word composition pattern is not considered at all. In fact, long idioms have a common feature that they have a syntactic structure.

しかしながら、特許文献１及び非特許文献１〜４には、漢字複合語の分割に際し、分割候補の生成に慨して多くの計算が必要とされる上、分割対象の熟語が辞書に登録されていない基本単語を含んでいると、数量的指標が算出できず、理論的に分割不能となるという共通する問題点がある。また、非特許文献２〜４については、本願発明の発明者らが評価実験を行ったが、性能評価で用いている分割対象熟語の量は３００〜３０００語程度であり、熟語が長くなると分割精度は大きく低下するという問題点もある。 However, Patent Document 1 and Non-Patent Documents 1 to 4 require a lot of calculations for generating kanji compound words when dividing kanji compound words, and idioms to be divided are registered in a dictionary. If there is no basic word, there is a common problem that a quantitative index cannot be calculated and theoretically it cannot be divided. Moreover, about the nonpatent literatures 2-4, although the inventors of this invention performed evaluation experiment, the quantity of the division | segmentation target idiom used by performance evaluation is about 300-3000 words, and when a idiom becomes long, it divides | segments. There is also a problem that the accuracy is greatly reduced.

以上のことから、学術・特許データベース、あるいはインターネット上のｗｅｂ文書のような大量の文書を対象とする場合には、特許文献１及び非特許文献１〜４では、性能評価で得られた分割精度が過度に低下することは容易に推測され、とても実用化することができる程度のものでない。 From the above, when a large number of documents such as academic / patent databases or web documents on the Internet are targeted, Patent Document 1 and Non-Patent Documents 1 to 4 show the division accuracy obtained by performance evaluation. It is easily guessed that the excessively decreased value is not so high that it can be put into practical use.

本発明の目的とするところは、日本語文書に含まれる連続する漢字列で構成された漢字複合語を超高精度で正しく分割することができ、分割した各漢字列の信頼性が実用化することができる程度まで高められた、漢字複合語分割方法及び漢字複合語分割装置を提供することにある。 An object of the present invention is that a kanji compound word composed of continuous kanji strings included in a Japanese document can be correctly divided with ultra-high accuracy, and the reliability of each divided kanji string is put into practical use. An object of the present invention is to provide a kanji compound word segmentation method and kanji compound word segmentation apparatus that are enhanced to the extent possible.

本発明の発明者は、前記課題を解決するため、鋭意検討を重ねた結果、連続する漢字列で構成された漢字複合語を分割する場合の基となる基本単語と基本単語に該当する品詞を関連付けて記録した日本語辞書と、漢字複合語を分割した後に構成される各漢字列の字数の配列を示した分割パターンと漢字複合語を分割した後に構成される各漢字列に該当する品詞の配列を表した品詞列パターンのうち当該分割パターンに存在するものを関連付け、漢字複合語の字数毎に分類して記録した単語分割パターン辞書とを参照し、分割対象の漢字複合語を分割する漢字複合語分割方法などが上記目的を達成することを見出して、本発明をするに至った。 The inventor of the present invention, as a result of intensive studies in order to solve the above-mentioned problem, has determined a basic word as a basis for dividing a kanji compound word composed of continuous kanji strings and a part of speech corresponding to the basic word. The Japanese dictionary recorded in association with each other, the division pattern showing the arrangement of the number of characters in each kanji string constructed after dividing the kanji compound word, and the part of speech corresponding to each kanji string constructed after dividing the kanji compound word A kanji that divides a kanji compound word to be divided by associating a part-of-speech string pattern that represents an array with a word segmentation pattern dictionary that is classified and recorded according to the number of characters of the kanji compound word The present inventors have found that a compound word dividing method and the like achieve the above object, and have come to the present invention.

即ち、本発明の漢字複合語分割方法は、連続する漢字列で構成された漢字複合語を分割する場合の基となる基本単語と該基本単語に該当する品詞を関連付け、該基本単語の字数毎に分類して、該基本単語と該品詞の両者を記録した日本語辞書と、該漢字複合語を分割した後に構成される各漢字列の字数の配列を示した分割パターンと該漢字複合語を分割した後に構成される各漢字列に該当する品詞の配列を表した品詞列パターンのうち当該分割パターンに存在するものを関連付け、該漢字複合語の字数毎に分類して、該分割パターンと該品詞列パターンの両者を記録した単語分割パターン辞書とを参照して、該漢字複合語を分割することを特徴とする。 That is, the kanji compound word dividing method of the present invention relates a basic word that is a basis for dividing a kanji compound word composed of continuous kanji strings and a part-of-speech corresponding to the basic word, for each number of characters of the basic word. A Japanese dictionary in which both the basic word and the part of speech are recorded, a division pattern indicating an arrangement of the number of characters of each kanji string formed after dividing the kanji compound word, and the kanji compound word Of the part-of-speech string patterns representing the part-of-speech string pattern representing the arrangement of parts of speech corresponding to each kanji string constructed after the division, the patterns existing in the divided pattern are associated with each other and classified according to the number of characters of the kanji compound word. The kanji compound words are divided by referring to a word division pattern dictionary in which both parts of speech string patterns are recorded.

また、本発明の漢字複合語分割方法は、連続する漢字列で構成された漢字複合語を分割する場合の基となる基本単語と該基本単語に該当する品詞を関連付け、該基本単語と該品詞の両者を記録した日本語辞書と、該漢字複合語を分割した後に構成される各漢字列の字数の配列を示した分割パターンと該漢字複合語を分割した後に構成される各漢字列に該当する品詞の配列を表した品詞列パターンのうち当該分割パターンに存在するものを関連付け、該漢字複合語の字数毎に分類して、該分割パターンと該品詞列パターンの両者を記録した単語分割パターン辞書とを参照して、該漢字複合語を分割することを特徴とする。 Further, the kanji compound word dividing method of the present invention relates a basic word that is a basis for dividing a kanji compound word composed of continuous kanji strings and a part of speech corresponding to the basic word, and the basic word and the part of speech Corresponds to the Japanese dictionary that records both of the above, the division pattern that shows the arrangement of the number of characters in each kanji string that is constructed after dividing the kanji compound word, and each kanji string that is constructed after the kanji compound word is divided A word division pattern in which both the division pattern and the part-of-speech string pattern are recorded by associating those existing in the division pattern among the part-of-speech string patterns representing the arrangement of parts of speech The kanji compound word is divided with reference to a dictionary.

本発明の漢字複合語分割方法においては、前記漢字複合語の語頭の漢字又は前記漢字複合語の直前に決定した区切位置の直後にある漢字から、予め設定した抽出字数の順番に従って、抽出字数分の漢字列を順次抽出し、前記日本語辞書を参照して、抽出した漢字列を基本単語と照合する第一のステップと、第一のステップで抽出した漢字列と一致する基本単語が見つかった場合には、該基本単語と一致する漢字複合語から抽出した漢字列の後方に漢字があるか確認し、該基本単語と一致する漢字複合語から抽出した漢字列の後方に漢字があるときは、該基本単語と一致する抽出した漢字列の語尾とその直後の漢字の間を、前記漢字複合語を分割する区切位置として決定して、第一のステップに戻る第二のステップと、第一のステップで予め設定した全ての抽出字数から抽出した漢字列の全部と一致する基本単語が見つからなかった場合には、抽出した漢字１字を前記日本語辞書に存在しない１字未知語と定め、該抽出した漢字１字の後方に漢字があるときは、該抽出した漢字１字とその直後の漢字の間を、前記漢字複合語を分割する区切位置として決定して、第一のステップに戻る第三のステップとを含む構成を採用することができる。 In the kanji compound word dividing method according to the present invention, from the kanji at the beginning of the kanji compound word or the kanji immediately after the delimiter position determined immediately before the kanji compound word, the number of extracted characters is determined according to the preset number of extracted characters. The first step of collating the extracted kanji strings with reference to the Japanese dictionary and comparing the extracted kanji strings with the basic words and the basic words matching the kanji strings extracted in the first step were found. If there is a kanji after the kanji string extracted from the kanji string extracted from the kanji compound word that matches the basic word, check whether there is a kanji after the kanji string extracted from the kanji compound word that matches the basic word. A second step of determining a division position for dividing the kanji compound word between the ending of the extracted kanji character string that matches the basic word and the kanji immediately after it, and returning to the first step; In advance If a basic word that matches all of the extracted kanji strings is not found, the extracted kanji is determined as one unknown word that does not exist in the Japanese dictionary, and the extracted kanji 1 When there is a kanji character behind the character, a third step of determining between the extracted kanji character and the next kanji character as a break position for dividing the kanji compound word and returning to the first step; It is possible to adopt a configuration including

また、本発明の漢字複合語分割方法においては、予め設定した抽出字数の順番は、前記日本語辞書に記録された前記基本単語の字数の大きい順とする構成も採用することができる（以下、「手法１」ということがある。）。 Further, in the kanji compound word dividing method of the present invention, it is possible to adopt a configuration in which the preset number of extracted characters is set in the descending order of the number of characters of the basic words recorded in the Japanese dictionary (hereinafter, referred to as the following). Sometimes referred to as “Method 1”).

さらに、本発明の漢字複合語分割方法においては、二以上の前記１字未知語を連接する第四のステップをさらに含む構成をも採用することができ、前記１字未知語を含む隣接する漢字列を連接する第五のステップをさらに含む構成をも採用することができる。 Furthermore, in the kanji compound word dividing method of the present invention, a configuration further including a fourth step of connecting two or more of the one-character unknown words can be adopted, and adjacent kanji characters including the one-character unknown words are adopted. A configuration further including a fifth step of connecting the columns can also be adopted.

本発明の漢字複合語分割方法においては、前記単語分割パターン辞書を参照して、前記分割パターンの出現頻度の高い順に、前記漢字複合語を複数の漢字列に順次仮分割した後、前記日本語辞書を参照して、該仮分割した全ての漢字列を基本単語と照合する第六のステップと、第六のステップで仮分割した全ての漢字列について一致する基本単語が見つかった場合には、仮分割した全ての漢字列と一致する基本単語が見つかった分割パターンに従い、前記漢字複合語を分割する区切位置を決定する第七のステップと、第六のステップで仮分割した漢字列のいずれかの漢字列に一致する基本単語が見つからなかった場合には、前記日本語辞書に存在しない漢字列を未知語と定めると共に、全ての分割パターンについて仮分割したか確認して、全ての分割パターンについて仮分割していないときは、第六のステップに戻り、全ての分割パターンについて仮分割したときは、該未知語の個数が最小であり、かつ分割パターンの出現頻度の最も高い分割パターンに従い、前記漢字複合語を分割する区切位置を決定する第八のステップとを含む構成を採用することができる（以下、「手法２」ということがある。）。 In the kanji compound word dividing method of the present invention, the kanji compound words are sequentially provisionally divided into a plurality of kanji strings in descending order of appearance frequency of the divided patterns with reference to the word dividing pattern dictionary, and then the Japanese With reference to the dictionary, if a basic word that matches all the kanji strings provisionally divided in the sixth step and the sixth step of matching all the kanji strings divided temporarily in the sixth step is found, Any one of the seventh step of determining the break position for dividing the kanji compound word according to the division pattern in which the basic words matching all the temporarily divided kanji strings are found, and the kanji string provisionally divided in the sixth step If a basic word that matches the kanji string is not found, the kanji string that does not exist in the Japanese dictionary is determined as an unknown word, and all divided patterns are checked for provisional division, If the division pattern is not provisionally divided, the process returns to the sixth step. If all the division patterns are provisionally divided, the division having the smallest number of unknown words and the highest occurrence frequency of the division pattern is performed. A configuration including an eighth step of determining a break position for dividing the kanji compound word according to a pattern may be employed (hereinafter, sometimes referred to as “method 2”).

本発明の漢字複合語分割方法においては、前記漢字複合語から抽出する漢字列の先頭の文字位置としての抽出先頭位置を前記漢字複合語の語頭又は前記漢字複合語の語頭から設定変更した最新の抽出先頭文字の位置とし、前記漢字複合語の中から、該抽出先頭位置から設定した抽出字数分の漢字列を抽出する第九のステップと、第九のステップで抽出した漢字列のいずれかの漢字に変更したフラグが付与されているか判定し、第九のステップで抽出した漢字列のいずれかの漢字に変更したフラグが付与されている場合には、前記抽出先頭文字を前記抽出先頭位置から一字分後方のものに設定変更して、第九のステップに戻る第十のステップと、前記日本語辞書を参照して、第九のステップで抽出した漢字列を基本単語と照合する第十一のステップと、第十一のステップにおいて、第九のステップで抽出した漢字列と一致する基本単語が見つかった場合には、該基本単語と一致する抽出した漢字列の語尾とその直後の漢字の間を、前記漢字複合語を分割する区切位置として決定すると共に、該基本単語と一致する抽出した漢字列を構成する各々の漢字に付与されたフラグを変更した後、前記漢字複合語において、該基本単語と一致する漢字複合語から抽出した漢字列の後方に前記抽出字数以上の文字数の漢字があるか確認し、前記漢字複合語において、該基本単語と一致する漢字複合語から抽出した漢字列の後方に前記抽出字数未満の文字数の漢字しかないときは、前記抽出字数を一つ減らして設定すると共に、前記抽出先頭文字を前記漢字複合語の語頭に設定変更して、第九のステップに戻り、前記漢字複合語において、該基本単語と一致する漢字複合語から抽出した漢字列の後方に前記抽出字数以上の文字数の漢字があるときは、前記抽出先頭文字を前記抽出先頭位置から抽出字数分後方のものに設定変更して、第九のステップに戻る第十二のステップと、第十一のステップにおいて、第九のステップで抽出した漢字列と一致する基本単語が見つからなかった場合には、前記漢字複合語において、該基本単語と一致しなかった漢字複合語から抽出した漢字列の後方に漢字があるか確認し、前記漢字複合語において、該基本単語と一致しなかった漢字複合語から抽出した漢字列の後方に漢字がないときは、前記抽出字数を一つ減らして設定すると共に、前記抽出先頭文字を前記漢字複合語の語頭に設定変更して、第九のステップに戻り、前記漢字複合語において、該基本単語と一致しなかった漢字複合語から抽出した漢字列の後方に漢字があるときは、前記抽出先頭文字を前記抽出先頭位置から一字分後方のものに設定変更して、第九のステップに戻る第十三のステップと、前記漢字複合語を構成するすべての漢字に変更されたフラグが付与されている場合又は設定した抽出字数が０になった場合には、第十二のステップで決定した区切位置を、前記漢字複合語を分割する区切位置として確定する第十四ステップとを含む構成を採用することができる（以下、「手法３」ということがある。）。 In the kanji compound word segmentation method of the present invention, the extraction start position as the first character position of the kanji string extracted from the kanji compound word is set and changed from the beginning of the kanji compound word or the beginning of the kanji compound word. Either the ninth step of extracting the kanji strings for the number of extracted characters set from the extraction head position from among the kanji compound words as the position of the extracted first character, and any of the kanji strings extracted in the ninth step It is determined whether or not a flag changed to Kanji is assigned, and if the changed flag is assigned to any Kanji character extracted in the ninth step, the extracted first character is changed from the extracted first position. The tenth step of changing the setting backward by one character and returning to the ninth step, and the tenth step of referring to the Japanese dictionary and collating the Chinese character string extracted in the ninth step with the basic word One step And in the eleventh step, if a basic word that matches the kanji character string extracted in the ninth step is found, the ending of the extracted kanji character string that matches the basic word and the next kanji character Is determined as a break position for dividing the kanji compound word, and the flag assigned to each kanji that constitutes the extracted kanji string that matches the basic word is changed. Check whether there are more kanji characters than the number of extracted characters behind the kanji string extracted from the kanji compound word that matches the word, and the kanji string extracted from the kanji compound word that matches the basic word in the kanji compound word If there are only kanji characters less than the number of extracted characters behind, the number of extracted characters is reduced by one and set, and the extracted first character is set to the beginning of the kanji compound word, In the kanji compound word, if there are more kanji characters than the number of extracted characters behind the kanji string extracted from the kanji compound word that matches the basic word, the extracted first character is moved from the extracted first position. In the twelfth step and the eleventh step, the basic word that matches the kanji character string extracted in the ninth step was not found in the twelfth step and the eleventh step. In the case, in the Kanji compound word, it is confirmed whether there is a Kanji character behind the Kanji string extracted from the Kanji compound word that did not match the basic word, and the Kanji compound word did not match the basic word. If there is no kanji after the kanji string extracted from the kanji compound word, the number of extracted characters is reduced by one, and the extracted first character is changed to the beginning of the kanji compound word, and the ninth step is changed. When there is a kanji character behind the kanji string extracted from the kanji compound word that did not match the basic word in the kanji compound word, the extracted first character is moved backward by one character from the extracted start position. The setting is changed to the one in the thirteenth step to return to the ninth step, and when the changed flag is assigned to all the kanji characters constituting the kanji compound word, or the set number of extracted characters is 0 In such a case, it is possible to adopt a configuration including a fourteenth step for determining the delimiter position determined in the twelfth step as a delimiter position for dividing the kanji compound word (hereinafter referred to as “Method 3”). There are times. ).

また、本発明の漢字複合語分割方法においては、第十二ステップは、さらに、第十一のステップにおいて、第九のステップで抽出した漢字列と一致する基本単語が見つかった場合には、前記漢字複合語を分割する区切位置として決定すると共に、前記基本単語と一致する抽出した漢字列を構成する各々の漢字に付与されたフラグを変更する前に、前記日本語辞書に従い、前記基本単語と一致する抽出した漢字列に品詞を付与し、第十四ステップは、さらに、変更したフラグが付与されていない漢字については１字未知語と定める構成も採用することができる。 Further, in the kanji compound word dividing method of the present invention, the twelfth step further includes, in the eleventh step, when a basic word matching the kanji string extracted in the ninth step is found, Before determining the division position for dividing the kanji compound word and changing the flag assigned to each kanji that constitutes the extracted kanji string that matches the basic word, the basic word and A part of speech may be assigned to the extracted extracted Chinese character string, and the fourteenth step may further adopt a configuration in which a Chinese character to which the changed flag is not assigned is defined as one unknown word.

また、本発明の第一の漢字複合語分割装置は、連続する漢字列で構成された漢字複合語を分割する場合の基となる基本単語と該基本単語に該当する品詞を関連付け、該基本単語の字数毎に分類して、該基本単語と該品詞の両者を記録した日本語辞書と、前記漢字複合語を分割した後に構成される各漢字列の字数の配列を示した分割パターンと前記漢字複合語を分割した後に構成される各漢字列に該当する品詞の配列を表した品詞列パターンのうち当該分割パターンに存在するものを関連付け、前記漢字複合語の字数毎に分類して、該分割パターンと該品詞列パターンの両者を記録した単語分割パターン辞書と、前記漢字複合語の語頭の漢字又は前記漢字複合語の直前に決定した区切位置の直後にある漢字から、予め設定した抽出字数の順番に従って、抽出字数分の漢字列を順次抽出し、前記日本語辞書を参照して、抽出した漢字列を基本単語と照合する抽出照合手段と、前記抽出照合手段で抽出した漢字列と一致する基本単語が見つかった場合には、前記日本語辞書に従い、基本単語と一致する抽出した漢字列に品詞を付与し、該基本単語と一致する抽出した漢字列の後方に漢字があるときは、該基本単語と一致する抽出した漢字列の語尾とその直後の漢字の間を、該漢字複合語を分割する区切位置として決定する区切決定手段と、前記抽出照合手段で予め設定した全ての抽出字数から抽出した漢字列の全部と一致する基本単語が見つからなかった場合には、抽出した漢字１字を前記日本語辞書に存在しない１字未知語と定め、該抽出した漢字１字の後方に漢字があるときは、該抽出した漢字１字とその直後の漢字の間を、該漢字複合語を分割する区切位置として決定する未知語区切決定手段とを含むことを特徴とする。 Further, the first kanji compound word dividing device of the present invention relates a basic word used as a basis for dividing a kanji compound word composed of continuous kanji strings and a part of speech corresponding to the basic word, and the basic word A Japanese dictionary in which both the basic word and the part of speech are recorded, a division pattern indicating an arrangement of the number of characters in each kanji string formed after dividing the kanji compound word, and the kanji A part-of-speech string pattern representing an arrangement of part-of-speech corresponding to each kanji character string configured after dividing a compound word is associated with one that exists in the divided pattern, and is classified according to the number of characters of the kanji compound word. From the word division pattern dictionary that records both the pattern and the part-of-speech string pattern, and the kanji at the beginning of the kanji compound word or the kanji immediately after the delimiter position determined immediately before the kanji compound word, In order Thus, the extracted kanji character strings corresponding to the extracted words are sequentially extracted, and the Japanese dictionary is referenced to match the extracted kanji character strings with the basic words, and the kanji character string extracted by the extracted collating means matches. When a basic word is found, in accordance with the Japanese dictionary, the part of speech is given to the extracted kanji string that matches the basic word, and when there is a kanji behind the extracted kanji string that matches the basic word, A delimiter determining means for determining the position between the ending of the extracted kanji character string that matches the basic word and the immediately following kanji as a delimiter position for dividing the kanji compound word, and from all the extracted character numbers preset by the extraction collating means If a basic word that matches all of the extracted kanji strings is not found, one extracted kanji character is determined as one unknown word that does not exist in the Japanese dictionary, and a kanji character is located behind the extracted kanji character. If there is, The Kanji 1 characters and between the immediately kanji, characterized in that it comprises a unknown word delimiter determining means for determining a break position for dividing the 該漢 shaped compound word.

本発明の第一の漢字複合語分割装置については、二以上の前記１字未知語を連接する未知語連接手段をさらに含む構成を採用することができ、前記１字未知語を含む隣接する漢字列を連接する隣接語連接手段をさらに含む構成を採用することができ、決定した区切位置を、前記単語分割パターン辞書を参照して、前記漢字複合語を分割する区切位置として確定する区切位置確定手段をさらに含む構成を採用することができる。 About the 1st kanji compound word division | segmentation apparatus of this invention, the structure further including the unknown word concatenation means which connects two or more said 1-character unknown words can be employ | adopted, and the adjacent kanji character containing the said 1-character unknown words is employable. It is possible to adopt a configuration further including adjacent word concatenation means for concatenating columns, and determine a delimiter position that determines the determined delimiter position as a delimiter position for dividing the kanji compound word with reference to the word segmentation pattern dictionary A configuration further including means can be employed.

本発明の第二の漢字複合語分割装置は、連続する漢字列で構成された漢字複合語を分割する場合の基となる基本単語と該基本単語に該当する品詞を関連付け、該基本単語の字数毎に分類して、該基本単語と該品詞の両者を記録した日本語辞書と、前記漢字複合語を分割した後に構成される各漢字列の字数の配列を示した分割パターンと前記漢字複合語を分割した後に構成される各漢字列に該当する品詞の配列を表した品詞列パターンのうち当該分割パターンに存在するものを関連付け、前記漢字複合語の字数毎に分類して、該分割パターンと該品詞列パターンの両者を記録した単語分割パターン辞書と、前記単語分割パターン辞書を参照して、前記分割パターンの出現頻度の高い順に、前記漢字複合語を複数の漢字列に順次仮分割した後、前記日本語辞書を参照して、該仮分割した全ての漢字列を基本単語と照合する仮分割照合手段と、仮分割照合手段で仮分割した全ての漢字列について一致する基本単語が見つかった場合には、前記日本語辞書に従い、基本単語と一致する全ての漢字列に品詞を付与して、仮分割した全ての漢字列と一致する基本単語が見つかった分割パターンに従い、前記漢字複合語を分割する分割位置を決定する分割決定手段と、仮分割照合手段で仮分割した漢字列のいずれかの漢字列に一致する基本単語が見つからなかった場合には、前記日本語辞書に存在しない漢字列を未知語と定め、全ての分割パターンについて仮分割した漢字列のいずれかの漢字列に一致する基本単語が見つからなかったときは、該未知語の個数が最小であり、かつ分割パターンの出現頻度の最も高い分割パターンに従い、前記漢字複合語を分割する分割位置を決定する未知語分割決定手段とを含むことを特徴とする。 The second kanji compound word dividing device of the present invention associates a basic word that is a basis for dividing a kanji compound word composed of continuous kanji strings with the part of speech corresponding to the basic word, and the number of characters of the basic word A Japanese dictionary in which each of the basic words and the parts of speech are recorded, a division pattern indicating an arrangement of the number of characters of each kanji string formed after dividing the kanji compound words, and the kanji compound words Among the part-of-speech string patterns representing the part-of-speech string pattern representing the arrangement of parts of speech corresponding to each kanji character string constructed after dividing the kanji character string, and classifying each kanji compound word according to the number of characters, After temporarily dividing the kanji compound words into a plurality of kanji strings in order of appearance frequency of the divided patterns with reference to the word division pattern dictionary in which both of the part of speech string patterns are recorded and the word division pattern dictionary ,in front With reference to the Japanese dictionary, when a temporary division matching unit that matches all the kanji character strings that have been temporarily divided with basic words and a matching basic word for all the kanji strings that have been temporarily divided by the temporary division matching unit are found In accordance with the Japanese dictionary, parts of speech are assigned to all Kanji strings that match the basic words, and the Kanji compound words are divided according to the division pattern in which the basic words that match all the Kanji strings that have been provisionally divided are found. If a basic word that matches one of the kanji strings provisionally divided by the division determination means for determining the division position and the temporary division collating means is not found, a kanji string that does not exist in the Japanese dictionary is unknown. If a basic word that matches one of the kanji strings that were provisionally divided for all the divided patterns is not found, the number of unknown words is the smallest and the divided pattern is output. According highest division patterns of frequency, characterized in that it comprises a unknown word division determination means for determining a dividing position for dividing the Kanji compound words.

本発明の第二の漢字複合語分割装置については、決定した分割位置を、前記単語分割パターン辞書を参照して、前記漢字複合語を分割する分割位置として確定する分割位置確定手段をさらに含む構成を採用することができる。 The second kanji compound word dividing device of the present invention further includes a dividing position determining means for determining the determined dividing position as a dividing position for dividing the kanji compound word with reference to the word dividing pattern dictionary. Can be adopted.

本発明の第三の漢字複合語分割装置は、連続する漢字列で構成された漢字複合語を分割する場合の基となる基本単語と該基本単語に該当する品詞を関連付け、該基本単語の字数毎に分類して、該基本単語と該品詞の両者を記録した日本語辞書と、前記漢字複合語から抽出する漢字列の先頭の文字位置としての抽出先頭位置を前記漢字複合語の語頭又は前記漢字複合語の語頭から設定変更した最新の抽出先頭文字の位置とし、前記漢字複合語の中から、該抽出先頭位置から設定した抽出字数分の漢字列を抽出する漢字列抽出処理手段と、前記漢字列抽出処理手段で抽出した漢字列のいずれかの漢字に変更したフラグが付与されているか判定し、前記漢字列抽出処理手段で抽出した漢字列のいずれかの漢字に変更したフラグが付与されている場合には、前記抽出先頭文字を前記抽出先頭位置から一字分後方のものに設定変更して、前記漢字列抽出処理手段に戻るフラグ付与判定処理手段と、前記日本語辞書を参照して、前記漢字列抽出処理手段で抽出した漢字列を基本単語と照合する基本単語照合処理手段と、前記基本単語照合処理手段において、前記漢字列抽出処理手段で抽出した漢字列と一致する基本単語が見つかった場合には、前記日本語辞書に従い、該基本単語と一致する抽出した漢字列に品詞を付与してから、該基本単語と一致する抽出した漢字列の語尾とその直後の漢字の間を、前記漢字複合語を分割する区切位置として決定すると共に、該基本単語と一致する抽出した漢字列を構成する各々の漢字に付与されたフラグを変更した後、前記漢字複合語において、該基本単語と一致する漢字複合語から抽出した漢字列の後方に前記抽出字数以上の文字数の漢字があるか確認し、前記漢字複合語において、該基本単語と一致する漢字複合語から抽出した漢字列の後方に前記抽出字数未満の文字数の漢字しかないときは、前記抽出字数を一つ減らして設定すると共に、前記抽出先頭文字を前記漢字複合語の語頭に設定変更して、前記漢字列抽出処理手段に戻り、前記漢字複合語において、該基本単語と一致する漢字複合語から抽出した漢字列の後方に前記抽出字数以上の文字数の漢字があるときは、前記抽出先頭文字を前記抽出先頭位置から抽出字数分後方のものに設定変更して、前記漢字列抽出処理手段に戻る第一の照合結果処理手段と、前記基本単語照合処理手段において、前記漢字列抽出処理手段で抽出した漢字列と一致する基本単語が見つからなかった場合には、前記漢字複合語において、該基本単語と一致しなかった漢字複合語から抽出した漢字列の後方に漢字があるか確認し、前記漢字複合語において、該基本単語と一致しなかった漢字複合語から抽出した漢字列の後方に漢字がないときは、前記抽出字数を一つ減らして設定すると共に、前記抽出先頭文字を前記漢字複合語の語頭に設定変更して、前記漢字列抽出処理手段に戻り、前記漢字複合語において、該基本単語と一致しなかった漢字複合語から抽出した漢字列の後方に漢字があるときは、前記抽出先頭文字を前記抽出先頭位置から一字分後方のものに設定変更して、前記漢字列抽出処理手段に戻る第二の照合結果処理手段と、前記漢字複合語を構成するすべての漢字に変更されたフラグが付与されている場合又は設定した抽出字数が０になった場合には、第一の照合結果処理手段で決定した区切位置を、前記漢字複合語を分割する区切位置として確定し、変更したフラグが付与されていない漢字については１字未知語と定める区切位置確定処理手段とを含むことを特徴とする。 The third kanji compound word dividing device of the present invention relates a basic word that is a basis for dividing a kanji compound word composed of continuous kanji strings and the part of speech corresponding to the basic word, and the number of characters of the basic word A Japanese dictionary that records both the basic word and the part of speech, and the extracted head position as the head character position of the kanji string extracted from the kanji compound word. A kanji character string extraction processing means for extracting the kanji character string corresponding to the number of extracted characters set from the extraction start position from the kanji compound word, and setting the position of the latest extracted first character changed from the beginning of the kanji compound word; It is determined whether the changed flag is assigned to any kanji of the kanji string extracted by the kanji string extraction processing means, and the changed flag is assigned to any kanji of the kanji string extracted by the kanji string extraction processing means. If Changes the setting of the extracted head character to one character behind the extracted head position, and returns to the kanji string extraction processing means, with reference to the Japanese dictionary and the kanji character A basic word matching processing means for matching a kanji string extracted by the string extraction processing means with a basic word; and a basic word matching the kanji string extracted by the kanji string extraction processing means is found in the basic word matching processing means. In accordance with the Japanese dictionary, after adding a part of speech to the extracted kanji character string that matches the basic word, the kanji character between the ending of the extracted kanji character string that matches the basic word and the next kanji character After determining the break position for dividing the compound word and changing the flag given to each kanji that constitutes the extracted kanji string that matches the basic word, the basic word in the kanji compound word Check if there are more kanji characters than the number of extracted characters behind the kanji string extracted from the matching kanji compound word, and in the kanji compound word, after the kanji string extracted from the kanji compound word that matches the basic word If there are only kanji characters less than the number of extracted characters, the number of extracted characters is reduced by one and set, and the extraction head character is changed to the beginning of the kanji compound word, and the process returns to the kanji string extraction processing means. In the kanji compound word, when there are more kanji characters than the extracted character number behind the kanji string extracted from the kanji compound word that matches the basic word, the extracted first character is extracted from the extracted start position by the number of extracted characters. In the first collation result processing means that changes the setting to the rear one and returns to the kanji string extraction processing means, and the basic word collation processing means, the kanji string extracted by the kanji string extraction processing means In the kanji compound word, the kanji compound word confirms whether there is a kanji character behind the kanji string extracted from the kanji compound word that did not match the basic word. When there is no kanji character behind the kanji string extracted from the kanji compound word that does not match the basic word, the number of extracted characters is reduced by one and the extracted first character is set at the beginning of the kanji compound word. Change the setting and return to the kanji string extraction processing means, and when there is a kanji character behind the kanji string extracted from the kanji compound word that does not match the basic word in the kanji compound word, The second collation result processing means for changing the setting to one character backward from the extraction start position and returning to the kanji string extraction processing means, and the flag changed to all the kanji characters constituting the kanji compound word If it is given or if the set number of extracted characters becomes 0, the delimiter position determined by the first collation result processing means is determined as the delimiter position for dividing the kanji compound word, and the changed flag is The kanji that has not been assigned includes a delimiter position determination processing means for determining one unknown word.

本発明を用いることによって、日本語文書に含まれる漢字複合語を超高精度で正しく分割することができ、かつ分割した単語の信頼性が非常に高くなり、従来よりも、形態素解析、構文解析は勿論のこと、Ｗｅｂ検索エンジン、音声認識、文字認識、仮名漢字変換などの精度が向上するという利点がある。 By using the present invention, kanji compound words contained in a Japanese document can be correctly divided with ultra-high accuracy, and the reliability of the divided words becomes very high. Of course, there is an advantage that the accuracy of Web search engine, voice recognition, character recognition, kana-kanji conversion, etc. is improved.

本発明は、従来よりも、日本語文書に含まれる漢字複合語の分割処理、形態素解析、構文解析の速度が向上するという利点がある。 The present invention has an advantage in that the speed of division processing, morphological analysis, and syntax analysis of kanji compound words included in a Japanese document is improved.

それ故、本発明は、従来と異なり、実用化に耐え得るものである。 Therefore, the present invention can withstand practical use, unlike the prior art.

本発明の漢字複合語分割装置の基本的な構成の一実施態様を説明する概念図である。It is a conceptual diagram explaining one embodiment of the basic composition of the Chinese character compound word division | segmentation apparatus of this invention. 本発明の漢字複合語分割装置の基本的な構成の他の一実施態様を説明する概念図である。It is a conceptual diagram explaining another embodiment of the basic composition of the Chinese character compound word division | segmentation apparatus of this invention. 本発明の漢字複合語分割装置の基本的な構成の他の一実施態様を説明する概念図である。It is a conceptual diagram explaining another embodiment of the basic composition of the Chinese character compound word division | segmentation apparatus of this invention. 本発明の漢字複合語分割方法を用いて漢字複合語を分割する過程の一例を説明するフロー図である。It is a flowchart explaining an example of the process which divides | segments a Chinese character compound word using the Chinese character compound word division | segmentation method of this invention. 本発明の漢字複合語分割方法を用いて漢字複合語を分割する過程の他の一例を説明するフロー図である。It is a flowchart explaining another example of the process which divides | segments a Chinese character compound word using the Chinese character compound word division | segmentation method of this invention. 本発明の漢字複合語分割方法を用いて漢字複合語を分割する過程の他の一例を説明するフロー図である。It is a flowchart explaining another example of the process which divides | segments a Chinese character compound word using the Chinese character compound word division | segmentation method of this invention. 本発明の漢字複合語分割方法の手法１、手法２及び手法３についての分割精度の評価実験の手順を示す図である。It is a figure which shows the procedure of the evaluation experiment of the division | segmentation precision about the method 1, the method 2, and the method 3 of the Chinese character compound word division | segmentation method of this invention. 本発明の漢字複合語分割方法の手法１、手法２及び手法３を用いて漢字複合語の分割を行った場合における成功の確率を表示したグラフである。It is the graph which displayed the probability of success in the case of dividing | segmenting a Chinese character compound word using the method 1, the method 2, and the method 3 of the Chinese character compound word division | segmentation method of this invention.

以下、本発明をさらに詳細に説明する。本発明の漢字複合語分割装置は、連続する漢字列で構成された漢字複合語を、日本語辞書と単語分割パターン辞書を参照して、単語に分割する。 Hereinafter, the present invention will be described in more detail. The kanji compound word dividing device of the present invention divides a kanji compound word composed of continuous kanji strings into words by referring to a Japanese dictionary and a word division pattern dictionary.

本発明の第一の漢字複合語分割装置１０は、日本語辞書１と、単語分割パターン辞書２と、抽出照合手段１１と、区切決定手段１２と、未知語決定手段１３と、未知語連接手段１４と、隣接語連接手段１５と、区切位置確定手段１６とを備える（図１）。 A first kanji compound word segmentation device 10 of the present invention includes a Japanese dictionary 1, a word segmentation pattern dictionary 2, an extraction collation unit 11, a segment determination unit 12, an unknown word determination unit 13, and an unknown word concatenation unit. 14, adjacent word connecting means 15, and delimiter position determining means 16 (FIG. 1).

日本語辞書１には、基本単語と基本単語の品詞の両方が関連付けられて記録されている。 In the Japanese dictionary 1, both basic words and parts of speech of basic words are recorded in association with each other.

基本単語は、漢字複合語を分割する場合に基となる単位であって、語基（ｗｏｒｄｂａｓｅ）と称されることもあり、単独で独立した意味をもつ。例えば、「技術文献」という漢字複合語については、「技術」と「文献」が基本単語となる。基本単語は、多くは文章中に単独で使用されるが、接頭辞（例えば、「本手法」の「本」）や接尾辞（例えば、「数量的の「的」」）など熟語の構成要素としてのみ使用されるものもある。基本単語としては、例えば、広辞苑、三省堂国語辞典、角川類義語辞典、ＥＢ科学技術用語大辞典、電気・電子情報用語辞典、コンピュータ用語辞典などから１〜４字の単語を抽出した後、重複を取り除き、更に、固有名詞、仏教用語、故事成語、化学物質名等を除外したものを使用する。 A basic word is a unit that is used as a basis for dividing a Kanji compound word, and is sometimes referred to as a word base, and has an independent meaning. For example, for the kanji compound word “technical literature”, “technical” and “literature” are basic words. Basic words are often used alone in a sentence, but they are components of idioms, such as prefixes (eg, “book” in “the method”) and suffixes (eg, “quantitative“ target ”). Some are used only as. As basic words, for example, after extracting 1 to 4 letters from Kojitsuen, Sanseido Kokusai Dictionary, Kadokawa Thesaurus, EB Science and Technology Glossary Dictionary, Electrical / Electronic Information Glossary Dictionary, Computer Glossary Dictionary, etc., the duplication is removed. In addition, use names that exclude proper nouns, Buddhist terms, factual words, chemical names, etc.

品詞としては、例えば、名詞、動詞、サ変名詞（以下、「サ変」という。）、形容動詞語幹（以下、「形動」という。）、形容詞語幹（以下、「形容」という。）、接頭辞（以下、「接頭」という。）、接尾辞（以下、「接尾」という。）、副詞、数詞の９種類が挙げられるが、適宜、９種類以外の品詞を追加してもよい。複数品詞の場合には「−」でつなぎ複数記述する（例えば、「下」は「接尾−接頭」）。 Examples of parts of speech include nouns, verbs, sa-changing nouns (hereinafter referred to as “sa-changing”), adjective verb stems (hereinafter referred to as “adjectives”), adjective stems (hereinafter referred to as “adjectives”), prefixes. (Hereinafter referred to as “prefix”), suffix (hereinafter referred to as “suffix”), adverb, and number, there are nine types, but other parts of speech may be added as appropriate. In the case of multiple parts of speech, a plurality of “-” are connected and described (for example, “lower” is “suffix-prefix”).

日本語辞書１には、例えば、基本単語と基本単語の字数と基本単語の品詞数と基本単語の品詞とが関連付けられて記録されていてもよい。具体的には、日本語辞書１には、「記入」は、記入・２・１・サ変、「材料」は、材料・２・１・名詞、「直交」は、直交・２・１・サ変、「下」は、下・１・２・接尾−接頭と記録される。なお、基本単語と基本単語の字数と基本単語の品詞数と基本単語の品詞の順番は、基本単語、基本単語の字数、基本単語の品詞数、基本単語の品詞の順番で配列してもよく、それ以外の順番で配列してもよい。 In the Japanese dictionary 1, for example, the basic word, the number of characters of the basic word, the number of parts of speech of the basic word, and the part of speech of the basic word may be recorded in association with each other. Specifically, in the Japanese dictionary 1, “entry” is entry 2 · 1 · sa change, “material” is material 2 · 1 · noun, “orthogonal” is orthogonal 2 · 1 · sa change , “Lower” is recorded as Lower, 1, 2, Suffix-Prefix. The basic word, the number of characters of the basic word, the number of parts of speech of the basic word, and the order of parts of speech of the basic word may be arranged in the order of the basic word, the number of characters of the basic word, the number of parts of speech of the basic word, and the part of speech of the basic word. , They may be arranged in any other order.

単語分割パターン辞書２には、分割パターンとその分割パターンに存在する品詞列パターンの両者が関連付けられ、漢字複合語の字数（例えば、６〜１０字）毎に分類して記録されている。 In the word division pattern dictionary 2, both the division pattern and the part-of-speech string pattern existing in the division pattern are associated with each other, and are classified and recorded for each number of characters (for example, 6 to 10 characters) of the kanji compound word.

単語分割パターン辞書２は、例えば、広辞苑、三省堂国語辞典、角川類義語辞典、ＥＢ科学技術用語大辞典、電気・電子情報用語辞典、コンピュータ用語辞典などから見出し語を抽出して、連続する漢字列で構成された漢字複合語のみを選び出した後、４字までの短い漢字複合語と重複を取り除き、更に、固有名詞、仏教用語、故事成語、化学物質名等を除外し、漢字複合語の字数（例えば、６〜１０字）毎に分類したものを使用する。 The word division pattern dictionary 2 extracts headwords from, for example, a broad kanji string, Sanseido Kokusai Dictionary, Kadokawa Thesaurus, EB Science and Technology Glossary Dictionary, Electrical / Electronic Information Glossary Dictionary, Computer Glossary Dictionary, etc. After selecting only the composed Kanji compound words, we removed the short Kanji compound words and duplications of up to 4 characters, further excluded proper nouns, Buddhist terms, factual words, chemical names, etc., and the number of Kanji compound words ( For example, those classified every 6 to 10 characters) are used.

分割パターンは漢字複合語を分割した後に構成される各漢字列の字数の配列であり、通常数字で表わされる。分割パターンは、理論上、２^ｎ−１（ｎは漢字複合語の字数）通りの組み合わせが考えられるが、実際には、一部の特定の分割パターンに偏り、分割対象となる漢字複合語から２^ｎ−１通りのうちの全ての分割パターンが出現するわけではない。 The division pattern is an array of the number of characters of each kanji string formed after dividing the kanji compound word, and is usually represented by a number. In theory, there are 2 ^n-1 combinations (n is the number of characters in a kanji compound word), but in reality, the division pattern is biased toward some specific division patterns, and from the kanji compound word to be divided. 2 Not all of the ^n-1 divided patterns appear.

出願人らは、角川類義語辞典（１９８９）の見出し語３６１０７語、広辞苑（１９９６）の見出し語１３６９４９語、ＥＢ科学技術用語大辞典（１９９１）の見出し語１３３３８１語、電気・電子情報用語辞典（１９９７）の見出し語２７９８４語、コンピュータ用語辞典（１９９０）の見出し語７９７９語から漢字複合語のみを選び出した後、４字までの短い漢字複合語と重複を取り除き、更に、固有名詞、仏教用語、故事成語、化学物質名等を除外し、６字〜１０字の漢字複合語（６字の漢字複合語１２９５１語、７字の漢字複合語６５２７語、８字の漢字複合語３２１６語、９字の漢字複合語６６６語、１０字の漢字複合語２８６語）について、分割パターンの解析を行った。 The applicants include 36107 words in the Kadokawa Thesaurus (1989), 136949 words in the Kojien (1996), 133338 words in the EB Science and Technology Glossary Dictionary (1991), and the Electrical and Electronic Information Glossary (1997). ) Headwords 27984, computer terminology dictionary (1990) headwords 7979, and after selecting only Kanji compound words, remove the short Kanji compound words and duplications up to 4 characters, and further, proper nouns, Buddhist terms, facts Excluding adults, chemical names, etc., 6 to 10 kanji compound words (6 kanji compound words 12951 words, 7 kanji compound words 6527 words, 8 kanji compound words 3216 words, 9 characters The division pattern was analyzed for 666 kanji compound words and 286 kanji compound words.

漢字複合語が６字の場合における分割パターンとその分割パターンの出現数とその分割パターンに存在する品詞列パターンの数の一例を表１に示す。 Table 1 shows an example of a division pattern, the number of appearances of the division pattern, and the number of part-of-speech string patterns existing in the division pattern when the kanji compound word is six characters.

表１から、漢字複合語が６字の場合、３分割（５７％）と４分割（４２％）で全体の９９％となり、２文字の単語が含まれる漢字複合語が非常に多いことがわかる。また、３分割では、２・２・２という分割パターンが３分割の９８％を占め、４分割では、１文字の単語２個と２文字の単語２個で構成される分割パターン（１・１・２・２、１・２・１・２、１・２・２・１、２・１・２・１、２・２・１・１）が４分割の９９．９％を占めていることがわかる。 Table 1 shows that when there are 6 kanji compound words, 3 divisions (57%) and 4 divisions (42%) account for 99% of the total, and there are very many kanji compound words containing two-letter words. . In 3 divisions, a division pattern of 2 · 2 · 2 occupies 98% of the 3 divisions, and in 4 divisions, a division pattern (1 · 1 consisting of two 1-character words and two 2-character words).・ 2, 2, 1, 1, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1) occupy 99.9% of the four divisions I understand.

漢字複合語が７字の場合における分割パターンとその分割パターンの出現数とその分割パターンに存在する品詞列パターンの数の一例を表２に示す。 Table 2 shows an example of a divided pattern, the number of appearances of the divided pattern, and the number of part-of-speech string patterns existing in the divided pattern when the kanji compound word is 7 characters.

表２から、漢字複合語が７字の場合、４分割（８２％）と５分割（１６％）で全体の９８％となることがわかる。また、４分割では、２・２・２・１という分割パターンが４分割全体の４３％、２・１・２・２という分割パターンが４分割の３０％を占め、他にも２・２・１・２という分割パターンや１・２・２・２という分割パターンのように出現頻度の高い分割パターンは存在し、５分割では、２・１・１・２・１、１・２・１・２・１、１・１・２・２・１の３つの分割パターンで、５分割の７３％を占めることがわかる。 From Table 2, it can be seen that when the Kanji compound word is 7 characters, it is 98% of the total with 4 divisions (82%) and 5 divisions (16%). Also, in 4 divisions, the division pattern 2 · 2 · 2 · 1 occupies 43% of the entire 4 divisions, and the division pattern 2 · 1, 2 · 2 occupies 30% of the 4 divisions. There are division patterns with high frequency of appearance, such as division patterns 1 and 2, and division patterns 1, 2, 2, and 2. With 5 divisions, 2 · 1 · 1 · 2 · 1,1,2,1 · 1 · It can be seen that three division patterns of 2, 1, 1, 1, 2, 2, 1 occupy 73% of the five divisions.

漢字複合語が８字の場合における分割パターンとその分割パターンの出現数とその分割パターンに存在する品詞列パターンの数の一例を表３に示す。 Table 3 shows an example of the division pattern, the number of appearances of the division pattern, and the number of part-of-speech string patterns existing in the division pattern when the Kanji compound word is eight characters.

表３から、漢字複合語が８字の場合、４分割（４０％）と５分割（５７％）で全体のほぼ９７％となることがわかる。また、４分割では、２・２・２・２という分割パターンが４分割の９２％を占め、５分割では、１文字の単語２個と２文字の単語３個で構成される分割パターンが５分割の９９％以上を占めているが、各分割パターンで頻度に大きな違いがあることがわかる。なお、漢字複合語６字が３分割で構成される分割パターンが多かったということに比べ、漢字複合語８字は５分割の比率が高くなっているため、漢字複合語の字数が長くなると、２文字の単語のみで構成される分割パターンより、途中で接辞などの１文字の単語を含む分割パターンの方が出現しやすい傾向にあると考えられる。 From Table 3, it can be seen that when there are 8 kanji compound words, 4 divisions (40%) and 5 divisions (57%) account for almost 97% of the total. In 4 divisions, the division pattern of 2, 2, 2, 2 occupies 92% of the 4 divisions, and in 5 divisions, there are 5 division patterns composed of 2 words of 1 character and 3 words of 2 characters. Although it accounts for 99% or more of the divisions, it can be seen that there is a large difference in frequency between the division patterns. Compared to the fact that there were many division patterns composed of 6 kanji compound words divided into 3 parts, the ratio of 5 kanji compound words is high, so when the number of characters in a kanji compound word is long, It is considered that a division pattern including a one-letter word such as an affix tends to appear on the way rather than a division pattern including only two-letter words.

漢字複合語が９字の場合における分割パターンとその分割パターンの出現数とその分割パターンに存在する品詞列パターンの数の一例を表４に示す。 Table 4 shows an example of the division pattern, the number of appearances of the division pattern, and the number of part-of-speech string patterns existing in the division pattern when the Kanji compound word is nine characters.

表４から、漢字複合語が９字の場合、５分割（５８％）と６分割（４０％）で全体の９８％となることがわかる。また、５分割では、２・２・２・２・１という分割パターンが５分割の３２％を占めているが、出現回数が１といった分割パターンもある程度存在し、上位４つの分割パターン（２・２・２・２・１、２・１・２・１・２・１、２・１・２・２・２、２・２・１・２・２）で全体の５９％を占め、対象となるデータ数が少ないこともあるが、一部の分割パターンに出現が偏っていることがわかる。 From Table 4, it can be seen that when there are 9 kanji compound words, 5 divisions (58%) and 6 divisions (40%) constitute 98% of the total. In addition, in 5 divisions, the division pattern of 2, 2, 2, 2 and 1 occupies 32% of the five divisions. However, there are some division patterns with the number of appearances of 1, and the upper four division patterns (2. 2 ・ 2 ・ 2 ・ 1,2 ・ 1 ・ 2 ・ 1 ・ 2 ・ 1,2 ・ 1 ・ 2 ・ 2 ・ 2, 2 ・ 2, ・ 1, ・ 2 ・ 2) Although there are cases where the number of data is small, it can be seen that the appearance of some of the divided patterns is biased.

漢字複合語が１０字の場合における分割パターンとその分割パターンの出現数とその分割パターンに存在する品詞列パターンの数の一例を表５に示す。 Table 5 shows an example of a division pattern, the number of appearances of the division pattern, and the number of part-of-speech string patterns existing in the division pattern when the Kanji compound word is 10 characters.

表５から、漢字複合語が１０字の場合、対象となる漢字複合語が少なかったこともあるが、上位４つの分割パターン（２・１・２・２・２・１、２・２・２・２・２、２・２・１・２・２・１、２・１・２・１・２・２）で全体の５５％となり、１文字の単語２個と２文字の単語４個で構成される分割パターンの上位３つの分割パターンのみでも、６分割の５７％、全体の３９％を占めることがわかる。 From Table 5, when there are 10 kanji compound words, there were few target kanji compound words, but the top four division patterns (2 ・ 1 ・ 2 ・ 2 ・ 2 ・ 1,2 ・ 2 ・ 2・ 2, 2, 2, 1, 1, 2, 2, 1, 2, 1, 1, 2, 1, 2, 2), 55% of the total, 2 words of 1 character and 4 words of 2 characters It can be seen that even the top three division patterns of the division pattern constituted account for 57% of the six divisions and 39% of the total.

なお、全体の傾向として、漢字複合語のほとんど全ての分割数は、漢字複合語の字数／２（四捨五入）又は漢字複合語の字数／２（四捨五入）＋１となることがわかる。また、例えば、２・２・２、２・２・２・２、２・２・２・２・２のように全て２文字の単語で構成される分割パターンの出現頻度が高く、２文字の単語を多く含む、例えば、２・２・２・１のような分割パターンの出現頻度も高いが、漢字複合語の字数が長くなると、分割パターンの比率が少なくなる傾向も出ている（例えば、１０文字の２・２・２・２・２と２・１・２・２・２・１）。出現した分割パターンについては、漢字複合語８字までは、漢字複合語の字数が増える毎に分割パターン数が増加しているが、漢字複合語８字以上は、対象となる漢字複合語が減少するので、分割パターンが莫大になってしまうことはないこともわかる。 In addition, as an overall trend, it can be seen that the division number of almost all Kanji compound words is Kanji compound word number / 2 (rounded) or Kanji compound word number / 2 (rounded) +1. Also, for example, the appearance frequency of divided patterns composed of two-letter words such as 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, and 2 is high. The appearance frequency of divided patterns such as 2 · 2 · 2 · 1 that contain a lot of words is high, but when the number of characters in a kanji compound word increases, the ratio of the divided patterns tends to decrease (for example, 10 characters 2 ・ 2 ・ 2 ・ 2 ・ 2 and 2 ・ 1 ・ 2 ・ 2 ・ 2 ・ 1). As for the division patterns that appeared, the number of division patterns increased with each increase in the number of Kanji compound words up to 8 Kanji compound words, but the number of Kanji compound words decreased for 8 or more Kanji compound words. Thus, it can be seen that the division pattern does not become enormous.

品詞列パターンは、分割パターンが２・２・２の場合には、例えば、出現頻度が高い順に、名詞・名詞・名詞、名詞・サ変・名詞、名詞・名詞・サ変、サ変・サ変・名詞、サ変・名詞・名詞、名詞・サ変・サ変、サ変・名詞・サ変、形動・名詞・名詞、形動・サ変・名詞、サ変・サ変・サ変、名詞・形動・名詞、形動・名詞・サ変、名詞・形動・サ変、サ変・形動・名詞、名詞・サ変・形動、形動・サ変・サ変、サ変・形動・サ変、名詞・名詞・動詞、名詞・名詞・形動、形動・形動・名詞、名詞・動詞・名詞、動詞・サ変・名詞、動詞・名詞・名詞、名詞・動詞・サ変、形動・形動・サ変、名詞・数詞・名詞、形動・動詞・名詞、サ変・名詞・動詞、サ変・サ変・形動、名詞・サ変・動詞、名詞・形動・形動、サ変・動詞・名詞、形動・名詞・動詞、形動・名詞・形動、動詞・サ変・サ変、動詞・名詞・サ変、サ変・名詞・形動、接頭辞・名詞・名詞、形動・動詞・サ変、形動・サ変・形動、サ変・動詞・サ変が存在する。 If the part of speech string pattern is 2, 2, or 2, for example, in the order of appearance frequency, the noun / noun / noun, noun / sa modification / noun, noun / no modification / sa modification, sa modification / sa modification / noun, Noun / Noun / Noun, Noun / Noh / Noh, Noh / Noh / Noh, Noun / Noh / Noun, Noh / Noh / Noh, Noh / Noh / Noh, Noh / Noh / Noh, Noh / Noh / Noh Sari, noun, form, sari, sari, form, noun, noun, sari, form, form, sari, form, sari, form, form, noun, noun, verb, noun, noun, form, Verb, verb, noun, verb, noun, noun, verb, noun, noun, noun, verb, sa, noun, verb, noun, noun, numeral, noun, verb, verb・ Nouns, Sa-Variations / Nouns / Verbs, Sa-Variations / Sa-Variations / Forms, Nouns, Sa-Variations / Verbs, Nouns / Transformations / Verbals, Sa-Variations / Verbs / Nouns, Verbs Verbs, verbs, nouns, verbs, verbs, nouns, verbs, nouns, verbs, nouns, verbs, prefixes, nouns, nouns, verbs, verbs, verbs, verbs, verbs, verbs There are movement, sa change, verb, sa change.

また、品詞列パターンは、分割パターンが２・１・２・１の場合には、例えば、出現頻度が高い順に、名詞・接尾・名詞・名詞、名詞・名詞・サ変・名詞、名詞・接尾・サ変・名詞、名詞・名詞・名詞・名詞、サ変・接尾・名詞・名詞、サ変・接尾・サ変・名詞、名詞・接尾・サ変・接尾、名詞・接尾・名詞・接尾、サ変・名詞・サ変・名詞、サ変・名詞・名詞・名詞、名詞・名詞・サ変・接尾、サ変・接尾・名詞・接尾、サ変・接尾・サ変・接尾、名詞・接頭辞・名詞・名詞、名詞・動詞・サ変・名詞、名詞・名詞・名詞・接尾、サ変・名詞・サ変・接尾、形動・名詞・名詞・名詞、名詞・動詞・名詞・名詞、名詞・接頭辞・サ変・名詞、形動・接尾・名詞・名詞、名詞・接頭辞・名詞・接尾、形動・名詞・サ変・名詞、サ変・名詞・名詞・接尾、名詞・接頭辞・サ変・接尾、名詞・名詞・動詞・名詞、名詞・形容・名詞・名詞、名詞・接尾・形動・接尾、名詞・接尾・名詞・動詞、名詞・名詞・形動・名詞、名詞・接尾・名詞・形容、名詞・接尾・形動・名詞、形動・接尾・サ変・名詞、サ変・接尾・形動・接尾、形動・名詞・名詞・接尾、形動・名詞・サ変・接尾、名詞・数詞・名詞・名詞、サ変・接尾・名詞・形容、名詞・サ変・サ変・名詞、サ変・接尾・形動・名詞、形動・接尾・サ変・接尾、形動・接頭辞・名詞・名詞、サ変・サ変・サ変・名詞、サ変・接頭辞・サ変・接尾、名詞・形容・サ変・名詞、形動・接頭辞・サ変・名詞、サ変・形容・名詞・名詞、名詞・名詞・名詞・動詞、名詞・動詞・名詞・接尾、サ変・接頭辞・名詞・名詞、サ変・接尾・動詞・名詞、サ変・接尾・名詞・動詞、名詞・接頭辞・形動・接尾、名詞・数詞・サ変・名詞、名詞・接尾・形容・名詞、動詞・接尾・名詞・名詞、名詞・サ変・名詞・名詞、名詞・名詞・名詞・形容、名詞・接続・サ変・接尾、名詞・接尾・動詞・名詞、名詞・形容・サ変・接尾、サ変・名詞・動詞・名詞、形動・形容・名詞・名詞、名詞・サ変・名詞・接尾、サ変・形容・サ変・名詞、サ変・名詞・名詞・サ変、動詞・名詞・名詞・名詞、サ変・動詞・名詞・名詞、サ変・名詞・形動・名詞、名詞・接頭辞・形動・名詞、名詞・名詞・サ変・動詞、形動・動詞・名詞・名詞、形動・接尾・形動・接尾、形動・動詞・サ変・名詞、形動・形容・サ変・名詞、サ変・サ変・名詞・名詞、形動・名詞・形動・名詞、動・サ変・名詞・名詞、形動・数詞・サ変・名詞、サ変・サ変・サ変・接尾、名詞・形容・名詞・動詞、名詞・動詞・サ変・動詞、形動・サ変・名詞・接尾、動詞・名詞・サ変・名詞、サ変・接頭辞・名詞・形容、サ変・接頭辞・名詞・接尾、サ変・形容・名詞・動詞、サ変・動詞・サ変・接尾、サ変・接頭辞・サ変・名詞、名詞・動詞・動詞・名詞、名詞・接尾・サ変・動詞、形動・形容・名詞・接尾、動詞・名詞・動詞・接尾、サ変・接頭辞・形動・接尾、形動・名詞・形動・接尾が存在する。 In addition, when the part-of-speech string pattern is 2, 1, 2, 1, 1, for example, in the order of appearance frequency, nouns, suffixes, nouns, nouns, nouns, nouns, suffixes, nouns, nouns, suffixes, Sa-chang / noun, noun-noun-noun-noun, sa-chang-suffix-noun-noun, sa-chang-suffix-sa-mut-noun, noun-suffix-sa-mut-suffix, noun-suffix-noun-suffix Noun, Sa-chang, Noun, Noun, Noun, Noun, Noun, Sa-Mod, Suffix, Sa-Mod, Suffix, Noun, Suffix, Sa-Mod, Suffix, Sa-Mod, Suffix, Noun, Prefix, Noun, Noun, Noun, Verb, Sa-Mod, Noun , Nouns, nouns, nouns, suffixes, sabers, nouns, sabers, suffixes, forms, nouns, nouns, nouns, nouns, verbs, nouns, nouns, nouns, prefixes, sabers, nouns, forms, suffixes, nouns, nouns Nouns, Nouns / Prefixes / Nouns / Suffixes, Movements / Nouns / Sabari / Nouns, Sab / Nouns / Noun Suffixes, nouns / prefixes / sa-variants / suffixes, nouns / nouns / verbs / nouns, nouns / adjections / nouns / nouns, nouns / suffixes / nouns / verbs, nouns / suffixes / nouns / verbs, nouns / nouns / verbs・ Nouns, nouns, suffixes, nouns, adjectives, nouns, suffixes, verbs, nouns, verbs, suffixes, sabers, nouns, sabers, suffixes, verbs, suffixes, verbs, nouns, nouns, suffixes, verbs Nouns / Sen-Modification / Suffix, Nouns / Numerics / Nouns / Nouns, S-Modifications / Suffixes / Nouns / Adjectives, Nouns / Sen-Modifications / Sen-Modifications / Nouns, S-Modifications / Suffixes / Transformation / Nouns, Transformation / Suffix / Saturation / Suffixes, Transforms・ Prefixes ・ Nouns ・ Nouns ・ Sen ・ Sen ・ Sen ・ Nouns ・ Sen ・ Prefixes ・ Sen ・ Suffixes ・ Nouns ・ Adjectives ・ Sons , Nouns, nouns, nouns, verbs, nouns, verbs, nouns, suffixes, sa-variation, prefixes, nouns, nouns, sa-variations, suffixes, verbs, nouns, Abnormal, Suffix, Noun, Verb, Noun, Prefix, Form, Suffix, Noun, Numeral, Sa-Mod, Noun, Noun, Suffix, Adjective, Noun, Verb, Suffix, Noun, Noun, Noun, Sa-Mod, Noun, Noun, Nouns, nouns, nouns, adjectives, nouns, connections, sabers, suffixes, nouns, suffixes, verbs, nouns, nouns, adjectives, sabers, suffixes, sabers, nouns, verbs, nouns, verbs, adjectives, nouns, nouns, nouns・ Sari ・ Noun ・ Suffix, Sari ・ Adjective ・ Sabari ・ Noun, Sari ・ Noun ・ Noun ・ Sen, Verb ・ Noun ・ Noun ・ Noun, Sari ・ Verb ・ Noun ・ Noun, Sari ・ Noun ・ Transform ・ Noun, Noun ・Prefixes, verbs, nouns, nouns, nouns, verbs, verbs, verbs, verbs, nouns, nouns, verbs, suffixes, verbs, suffixes, verbs, verbs, verbs, nouns, verbs, adjectives, verbs・ Nouns, sari, sari, nouns, nouns, movements, nouns, morphisms, nouns, movements, sabs Nouns, Sa-Modification, Sa-Modification, Sa-Modification, Suffix, Nouns, Adjectives, Nouns, Verbs, Nouns, Verbs, Sa-Modifications, Verbs, Verb, Sa-Modification, Nouns, Suffixes, Verbs, Nouns, Sa-Modifications, Nouns, Sa-Modifications, Prefixes, Nouns Adjective, Sa-Variation / Prefix / Noun / Suffix, Sa-Variation / Adjective / Noun / Verb, Sa-Variation / Verb / Sa-Variation / Suffix, Sa-Variation / Prefix / Sa-Variation / Noun, Noun / Verb / Verb / Noun, Noun / Verb / Verb / Variation There are verbs, verbs / adjectives / nouns / suffixes, verbs / nouns / verbs / suffixes, sa-variants / prefixes / verbs / suffixes, verbs / nouns / verbs / suffixes.

単語分割パターン辞書２には、漢字複合語の字数（例えば、６〜１０字）毎に、分割パターンの漢字複合語における出現頻度（出現数）の多い順番で、例えば、漢字複合語の字数、分割数、分割パターンを含む単語分割パターンと、単語分割パターンの出現順位と、分割パターンの出現頻度と、分割パターンで分割した後に得られる全ての漢字列の品詞列パターンとが関連付けられて記録されていてもよい。 In the word division pattern dictionary 2, for each number of characters of the kanji compound word (for example, 6 to 10 characters), in the order of appearance frequency (number of occurrences) in the kanji compound word of the division pattern, for example, the number of characters of the kanji compound word, The number of divisions, the word division pattern including the division pattern, the appearance order of the word division pattern, the appearance frequency of the division pattern, and the part-of-speech string patterns of all the kanji strings obtained after dividing by the division pattern are recorded in association with each other. It may be.

単語分割パターンとしては、例えば、漢字複合語の字数、分割数、分割パターンの順に、６Ｐ（漢字複合語の字数）３Ｂ（分割数）２２２（分割パターン）、６Ｐ４Ｂ２１２１、７Ｐ４Ｂ２２２１、７Ｐ５Ｂ２１１２１などと表示できる。 As the word division pattern, for example, 6P (number of characters of Kanji compound words) 3B (number of divisions) 222 (division pattern), 6P4B2121, 7P4B2221, 7P5B21121, etc. can be displayed in the order of the number of characters, the number of divisions, and the division pattern of the kanji compound words. .

単語分割パターン辞書２の記録データの一例としては、例えば、６Ｐ３Ｂ２２２１２０６６名詞・名詞・名詞、６Ｐ３Ｂ２２２１１７２５名詞・サ変・名詞、６Ｐ３Ｂ２２２１８３８名詞・名詞・サ変、６Ｐ４Ｂ２１２１２６９８名詞・接尾・名詞・名詞、６Ｐ３Ｂ２２２１５２０サ変・サ変・名詞、６Ｐ３Ｂ２２２１５０７サ変・名詞・名詞、６Ｐ３Ｂ２２２１４２９名詞・サ変・サ変、６Ｐ４Ｂ２１２１２２８１名詞・名詞・サ変・名詞などを挙げることができる。この場合、単語分割パターン辞書２は、主記憶にロードされた後に、単語分割パターンとその単語分割パターンに含まれる分割パターンに存在する複数の品詞列パターンとの構成に編成される。なお、単語分割パターンと、単語分割パターンの出現順位と、分割パターンの出現頻度と、分割パターンで分割した後に得られる全ての漢字列の品詞列パターンの順番は、単語分割パターン、単語分割パターンの出現順位、分割パターンの出現頻度、分割パターンで分割した後に得られる全ての漢字列の品詞列パターンの順番で配列してもよく、それ以外の順番で配列してもよい。 Examples of recorded data of the word division pattern dictionary 2 include, for example, 6P3B222 1 2066 nouns / nouns / nouns, 6P3B222 1 1725 nouns / sa modification / nouns, 6P3B222 1 838 nouns / nouns / sa modification, 6P4B2121 2 698 nouns / suffixes / nouns・ Noun, 6P3B222 1 520 Noun, Noun, Noun, 6P3B222 1 507 Noun, Noun, Noun, 6P3B222 1 429 Noun, No, No, No, 6P4B2121 2 281 In this case, the word division pattern dictionary 2 is organized into a configuration of a word division pattern and a plurality of part-of-speech string patterns existing in the division pattern included in the word division pattern after being loaded into the main memory. Note that the word division pattern, the order of appearance of the word division pattern, the frequency of occurrence of the division pattern, and the order of the part-of-speech string patterns of all the kanji strings obtained after the division by the division pattern are as follows: It may be arranged in the order of appearance order, frequency of appearance of divided patterns, part-of-speech string patterns of all kanji strings obtained after being divided by the divided patterns, or may be arranged in other orders.

抽出照合手段１１は、漢字複合語の語頭の漢字又は漢字複合語の直前に決定した区切位置の直後にある漢字から、予め設定した抽出字数の順番に従って、抽出字数分の漢字列を順次抽出し、日本語辞書１を参照して、抽出した漢字列を基本単語と照合する。 The extraction collating means 11 sequentially extracts a kanji string corresponding to the number of extracted characters from the kanji at the beginning of the kanji compound word or the kanji immediately after the delimiter position determined immediately before the kanji compound word in accordance with the preset number of extracted characters. Referring to the Japanese dictionary 1, the extracted kanji string is collated with the basic word.

区切決定手段１２は、抽出照合手段１１で抽出した漢字列と一致する基本単語が見つかった場合には、日本語辞書１に従い、基本単語と一致する抽出した漢字列に品詞を付与し、基本単語と一致する抽出した漢字列の後方に漢字があるときは、基本単語と一致する抽出した漢字列の語尾とその直後の漢字の間を、漢字複合語を分割する区切位置として決定する。 When the basic word that matches the kanji string extracted by the extraction collating means 11 is found, the delimiter determining means 12 gives a part of speech to the extracted kanji string that matches the basic word according to the Japanese dictionary 1, and the basic word If there is a kanji character behind the extracted kanji character string that matches, the ending position of the extracted kanji character string that matches the basic word and the immediately following kanji character are determined as the dividing positions for dividing the kanji compound word.

未知語決定手段１３は、抽出照合手段１１で予め設定した全ての抽出字数から抽出した漢字列の全部と一致する基本単語が見つからなかった場合には、抽出した漢字１字を日本語辞書１に存在しない１字未知語と定め、抽出した漢字１字の後方に漢字があるときは、抽出した漢字１字とその直後の漢字の間を、漢字複合語を分割する区切位置として決定する。 If the basic word that matches all the kanji strings extracted from all the extracted character numbers preset by the extraction collating means 11 is not found, the unknown word determining means 13 stores the extracted kanji characters in the Japanese dictionary 1. If it is determined as an unknown one-character unknown word and there is a kanji character behind the extracted one kanji character, the section between the extracted kanji character and the immediately following kanji character is determined as a break position for dividing the kanji compound word.

未知語連接手段１４は、二以上の１字未知語を連接する。二以上の１字未知語が存在する場合には、常に未知語連接手段１４で未知語を連接する処理を行う必要はなく、未知語を連接する処理を行うオプションが付加されているときのみ、未知語連接手段１４で未知語を連接する処理を行えばよい。 The unknown word connection means 14 connects two or more one-letter unknown words. When there are two or more one-letter unknown words, it is not always necessary to perform the process of connecting unknown words with the unknown word concatenation means 14, and only when the option of performing the process of connecting unknown words is added, What is necessary is just to perform the process of connecting unknown words by the unknown word connecting means 14.

未知語連接手段１４では、例えば、漢字複合語のｐ番目の漢字列に未知語決定手段１３で定義した未知語が存在する場合には、ｐ＋１番目以降の漢字列に未知語が存在していないか検索した後、ｐ番目の漢字列から連続するｋ個の未知語を連接して連接未知語とし、未知語決定手段１３で決定した区切位置を、連接未知語の語尾とその直後にある漢字の間に変更する。ここで、連接未知語が日本語辞書１に存在するかどうか検索してもよく、連接未知語が日本語辞書１に存在する場合には、連接未知語に品詞を付与して、未知語決定手段１３で決定した区切位置を、連接未知語の語尾とその直後にある漢字の間に変更し、連接未知語が日本語辞書１に存在しない場合には、未知語の連接は行わないようにしてもよい。 In the unknown word concatenation means 14, for example, when the unknown word defined by the unknown word determination means 13 exists in the p-th character string of the kanji compound word, the unknown word does not exist in the p + 1 and subsequent kanji strings. After the search, the k unknown words consecutive from the p-th character string are concatenated to form a concatenated unknown word, and the delimiter position determined by the unknown word determining means 13 is set to the end of the concatenated unknown word and the Chinese character immediately after that. Change between. Here, it may be searched whether or not the concatenated unknown word exists in the Japanese dictionary 1, and if the concatenated unknown word exists in the Japanese dictionary 1, the part of speech is given to the concatenated unknown word to determine the unknown word. The delimiter position determined by the means 13 is changed between the ending of the concatenated unknown word and the kanji immediately after it, and when the concatenated unknown word does not exist in the Japanese dictionary 1, the unknown word is not concatenated. May be.

隣接語連接手段１５は、１字未知語を含む隣接する漢字列を連接する。１字未知語を含む隣接する漢字列が存在する場合には、常に隣接語連接手段１５で隣接する漢字列を連接する処理を行う必要はなく、未知語を含む隣接する漢字列を連接する処理を行うオプションが付加されているときのみ、隣接語連接手段１５で隣接する漢字列を連接する処理を行えばよい。 The adjacent word connecting means 15 connects adjacent Chinese character strings including one-character unknown words. When there is an adjacent kanji string including one unknown character, it is not always necessary to perform the process of concatenating adjacent kanji strings by the adjacent word concatenation means 15, and the process of concatenating adjacent kanji strings including an unknown word Only when the option to perform is added, the adjacent word concatenation means 15 performs the process of concatenating adjacent Chinese character strings.

隣接語連接手段１５では、例えば、漢字複合語のｐ番目の漢字列に未知語決定手段１３で定義した未知語が存在する場合には、ｐ番目の漢字列とｐ＋１番目の漢字列を連接して、第一の隣接語とし、第一の隣接語が日本語辞書１に存在するかどうか検索する。第一の隣接語が日本語辞書１に存在する場合には、第一の隣接語に品詞を付与して、未知語決定手段１３で決定した区切位置を、第一の隣接語の語尾とその直後にある漢字の間に変更する。第一の隣接語が日本語辞書１に存在しない場合には、ｐ番目の漢字列とｐ−１番目の漢字列を連接して、第二の隣接語とし、第二の隣接語が日本語辞書１に存在するかどうか検索する。第二の隣接語が日本語辞書１に存在する場合には、第二の隣接語に品詞を付与して、未知語決定手段１３で決定した区切位置を、第二の隣接語の語尾とその直後にある漢字の間に変更する。第二の隣接語が日本語辞書１に存在しない場合には、隣接する漢字列の連接は行わない。 For example, if the unknown word defined by the unknown word determination means 13 exists in the p-th kanji character string of the kanji compound word, the adjacent word concatenation means 15 concatenates the p-th kanji character string and the p + 1-th kanji character string. Thus, it is determined whether the first adjacent word is present in the Japanese dictionary 1 as the first adjacent word. If the first adjacent word exists in the Japanese dictionary 1, the part of speech is given to the first adjacent word, and the delimiter position determined by the unknown word determining means 13 is set to the ending of the first adjacent word and its Change to the next kanji. If the first adjacent word does not exist in the Japanese dictionary 1, the p-th kanji string and the p-1st kanji string are concatenated to form the second adjacent word, and the second adjacent word is Japanese It searches whether it exists in the dictionary 1. When the second adjacent word exists in the Japanese dictionary 1, the part of speech is given to the second adjacent word, and the delimiter position determined by the unknown word determining means 13 is set to the ending of the second adjacent word and its Change to the next kanji. When the second adjacent word does not exist in the Japanese dictionary 1, the adjacent kanji strings are not connected.

区切位置確定手段１６は、区切決定手段１２、未知語決定手段１３、未知語連接手段１４、隣接語連接手段１５で決定した区切位置を、単語分割パターン辞書２を参照して、漢字複合語を分割する区切位置として確定する。 The delimiter position determining unit 16 refers to the delimiter positions determined by the delimiter determining unit 12, the unknown word determining unit 13, the unknown word concatenating unit 14, and the adjacent word concatenating unit 15 by referring to the word division pattern dictionary 2 and converting the kanji compound words. Confirm as the dividing position to be divided.

区切位置確定手段１６では、第一段階として、単語分割パターン辞書２のうち、分割対象となる漢字複合語の字数に属する分割パターンを検索して、決定した区切位置の各漢字列の字数の配列と一致する分割パターンが存在するか判定する。決定した区切位置の各漢字列の字数の配列と一致する分割パターンが単語分割パターン辞書２に存在する場合には、第二段階として、単語分割パターン辞書２のうち、分割対象となる漢字複合語の字数に属する品詞列パターンを検索して、決定した区切位置で分割した各漢字列に該当する品詞の配列と一致する品詞列パターンが存在するか判定する。決定した区切位置で分割した各漢字列に該当する品詞の配列と一致する品詞列パターンが単語分割パターン辞書２に存在する場合には、決定した区切位置を、漢字複合語を分割する区切位置として確定する。 As a first step, the delimiter position determination unit 16 searches the word division pattern dictionary 2 for a division pattern belonging to the number of characters of the kanji compound word to be divided, and arranges the number of characters of each kanji string at the determined delimiter position. It is determined whether there is a division pattern that matches. If there is a division pattern in the word division pattern dictionary 2 that matches the arrangement of the number of characters in each Chinese character string at the determined break position, as a second step, the kanji compound word to be divided in the word division pattern dictionary 2 The part-of-speech string patterns belonging to the number of characters are searched to determine whether there is a part-of-speech string pattern that matches the part-of-speech array corresponding to each kanji string divided at the determined break position. If the part-of-speech string pattern that matches the part-of-speech sequence corresponding to each kanji character string divided at the determined break position exists in the word division pattern dictionary 2, the decided break position is set as the break position for dividing the kanji compound word. Determine.

なお、決定した区切位置の各漢字列の字数の配列と一致する分割パターンが単語分割パターン辞書２に存在しない場合には、一致する分割パターンがないことを示す出力マーカーを付与してもよく、決定した区切位置で分割した各漢字列に該当する品詞の配列と一致する品詞列パターンが単語分割パターン辞書２に存在しない場合には、一致する品詞列パターンがないことを示す出力マーカーを付与してもよい。 If there is no division pattern in the word division pattern dictionary 2 that matches the arrangement of the number of characters of each kanji character string at the determined division position, an output marker indicating that there is no matching division pattern may be added, If there is no part-of-speech string pattern in the word segmentation pattern dictionary 2 that matches the part-of-speech array corresponding to each kanji string divided at the determined break position, an output marker indicating that there is no matching part-of-speech string pattern is given. May be.

次に、本発明の第二の漢字複合語分割装置について説明する。なお、上述した漢字複合語分割装置と同様の事項は記載を省略する。漢字複合語分割装置２０は、日本語辞書１と、単語分割パターン辞書２と、仮分割照合手段２１と、分割決定手段２２と、未知語分割決定手段２３と、分割位置確定手段２４とを備える（図２）。 Next, a second kanji compound word dividing device of the present invention will be described. In addition, description similar to the above-described Chinese character compound word dividing device is omitted. The kanji compound word dividing device 20 includes a Japanese dictionary 1, a word division pattern dictionary 2, a temporary division matching unit 21, a division determination unit 22, an unknown word division determination unit 23, and a division position determination unit 24. (FIG. 2).

仮分割照合手段２１は、単語分割パターン辞書２を参照して、分割パターンの出現頻度の高い順に、漢字複合語を複数の漢字列に順次仮分割した後、日本語辞書１を参照して、仮分割した全ての漢字列を基本単語と照合する。 The provisional division matching means 21 refers to the word division pattern dictionary 2, provisionally divides the kanji compound words into a plurality of kanji strings in order of appearance frequency of the division patterns, and then refers to the Japanese dictionary 1. All the kanji strings temporarily divided are checked against basic words.

分割決定手段２２は、仮分割照合手段２１で仮分割した全ての漢字列について一致する基本単語が見つかった場合には、日本語辞書１に従い、基本単語と一致する全ての漢字列に品詞を付与して、仮分割した全ての漢字列と一致する基本単語が見つかった分割パターンに従い、漢字複合語を分割する分割位置を決定する。 The division determination unit 22 assigns the part of speech to all the kanji strings that match the basic words according to the Japanese dictionary 1 when the matching basic words are found for all the kanji strings temporarily divided by the temporary division matching unit 21. Then, the division position for dividing the kanji compound word is determined in accordance with the division pattern in which the basic words matching all the temporarily divided kanji strings are found.

未知語分割決定手段２３は、仮分割照合手段２１で仮分割した漢字列のいずれかの漢字列に一致する基本単語が見つからなかった場合には、日本語辞書１に存在しない漢字列を未知語と定め、全ての分割パターンについて仮分割した漢字列のいずれかの漢字列に一致する基本単語が見つからなかったときは、未知語の個数が最小であり、かつ分割パターンの出現頻度の最も高い分割パターンに従い、漢字複合語を分割する分割位置を決定する。なお、全ての分割パターンについて仮分割したか確認する過程を設けてもよい。この場合、全ての分割パターンについて仮分割していないときは、仮分割照合手段２１に戻り、全ての分割パターンについて仮分割したときは、未知語の個数が最小であり、かつ分割パターンの出現頻度の最も高い分割パターンに従い、漢字複合語を分割する分割位置を決定する。 The unknown word division determination unit 23 determines that a kanji string that does not exist in the Japanese dictionary 1 is an unknown word if no basic word that matches any of the kanji strings provisionally divided by the temporary division matching unit 21 is found. If a basic word that matches one of the Chinese character strings temporarily divided for all division patterns is not found, the division with the smallest number of unknown words and the highest occurrence frequency of the division pattern According to the pattern, the division position for dividing the kanji compound word is determined. A process for confirming whether or not all the divided patterns have been provisionally divided may be provided. In this case, when all the divided patterns are not provisionally divided, the process returns to the provisional division collating means 21. When all the divided patterns are provisionally divided, the number of unknown words is the smallest and the frequency of appearance of the divided patterns. The division position for dividing the kanji compound word is determined according to the highest division pattern.

分割位置確定手段２４は、決定した分割位置を、単語分割パターン辞書２を参照して、漢字複合語を分割する分割位置として確定する。 The division position determining means 24 determines the determined division position as a division position for dividing the kanji compound word with reference to the word division pattern dictionary 2.

分割位置確定手段２４では、単語分割パターン辞書２のうち、分割対象となる漢字複合語の字数に属する品詞列パターンを検索して、決定した分割位置で分割した各漢字列の品詞の配列と一致する品詞列パターンが存在するか判定する。決定した分割位置で分割した各漢字列の品詞の配列と一致する品詞列パターンが単語分割パターン辞書２の中に存在する場合には、決定した分割位置を、漢字複合語を分割する区切位置として確定する。 The division position determination means 24 searches the word division pattern dictionary 2 for part-of-speech string patterns belonging to the number of characters of the kanji compound word to be divided, and matches the part-of-speech array of each kanji string divided at the determined division position. It is determined whether a part-of-speech string pattern is present. When a part-of-speech string pattern that matches the part-of-speech array of each kanji string divided at the determined division position exists in the word division pattern dictionary 2, the determined division position is set as a division position for dividing the kanji compound word. Determine.

次に、本発明の第三の漢字複合語分割装置について説明する。なお、上述した漢字複合語分割装置と同様の事項は記載を省略する。漢字複合語分割装置３０は、漢字列抽出処理手段３１と、フラグ付与判定処理手段３２と、基本単語照合処理手段３３と、第一の照合結果処理手段３４と、第二の照合結果処理手段３５と、区切位置確定処理手段３６とを備える（図３）。 Next, the 3rd kanji compound word division | segmentation apparatus of this invention is demonstrated. In addition, description similar to the above-described Chinese character compound word dividing device is omitted. The kanji compound word dividing device 30 includes a kanji string extraction processing means 31, a flag assignment determination processing means 32, a basic word matching processing means 33, a first matching result processing means 34, and a second matching result processing means 35. And delimiter position determination processing means 36 (FIG. 3).

漢字列抽出処理手段３１は、漢字複合語から抽出する漢字列の先頭の文字位置としての抽出先頭位置を漢字複合語の語頭又は漢字複合語の語頭から設定変更した最新の抽出先頭文字の位置とし、漢字複合語の中から、抽出先頭位置から設定した抽出字数分の漢字列を抽出する。 The kanji string extraction processing means 31 sets the extraction start position as the first character position of the kanji string extracted from the kanji compound word as the position of the latest extracted first character set and changed from the beginning of the kanji compound word or the kanji compound word. From the kanji compound words, kanji strings corresponding to the number of extracted characters set from the extraction start position are extracted.

フラグ付与判定処理手段３２は、漢字列抽出処理手段３１で抽出した漢字列のいずれかの漢字に変更したフラグが付与されているか判定し、漢字列抽出処理手段３１で抽出した漢字列のいずれかの漢字に変更したフラグが付与されている場合には、抽出先頭文字を抽出先頭位置から一字分後方のものに設定変更して、漢字列抽出処理手段３１に戻る。 The flag assignment determination processing unit 32 determines whether the changed flag is assigned to any Kanji character string extracted by the Kanji character string extraction processing unit 31, and selects one of the Kanji character strings extracted by the Kanji character string extraction processing unit 31. If the flag changed to kanji is added, the extraction head character is changed to a character one character behind the extraction head position, and the flow returns to the kanji string extraction processing means 31.

基本単語照合処理手段３３は、日本語辞書１を参照して、漢字列抽出処理手段３１で抽出した漢字列を基本単語と照合する。 The basic word collation processing unit 33 refers to the Japanese dictionary 1 and collates the Chinese character string extracted by the Chinese character string extraction processing unit 31 with the basic word.

第一の照合結果処理手段３４は、基本単語照合処理手段３３において、漢字列抽出処理手段３１で抽出した漢字列と一致する基本単語が見つかった場合には、日本語辞書１に従い、抽出した漢字列に品詞を付与してから、抽出した漢字列の語尾とその直後の漢字の間を、漢字複合語を分割する区切位置として決定すると共に、抽出した漢字列を構成する各々の漢字に付与されたフラグを変更した後、漢字複合語において、抽出した漢字列の後方に抽出字数以上の文字数の漢字があるか確認し、抽出した漢字列の後方に抽出字数未満の文字数の漢字しかないときは、抽出字数を一つ減らして設定すると共に、抽出先頭文字を漢字複合語の語頭に設定変更して、漢字列抽出処理手段３１に戻り、漢字複合語において、抽出した漢字列の後方に抽出字数以上の文字数の漢字があるときは、抽出先頭文字を抽出先頭位置から抽出字数分後方のものに設定変更して、漢字列抽出処理手段３１に戻る。 The first collation result processing means 34, when the basic word matching processing means 33 finds a basic word that matches the kanji character string extracted by the kanji character string extraction processing means 31, the extracted kanji according to the Japanese dictionary 1. After the part of speech is given to the column, the section between the ending of the extracted kanji string and the next kanji is determined as the break position for dividing the kanji compound word, and it is given to each kanji that composes the extracted kanji string. After changing the flag, in the Kanji compound word, check if there are more Kanji characters than the number of extracted characters behind the extracted Kanji character string, and if there are only Kanji characters less than the number of extracted characters after the extracted Kanji character string The number of extracted characters is reduced by one, and the extraction head character is changed to the beginning of the kanji compound word, and the process returns to the kanji string extraction processing means 31, and the kanji compound word is extracted behind the extracted kanji character string. When there is a kanji having more characters are extracted first character of the set changed to the extraction number of characters partial rear from the extraction head position, it returns to the kanji string extracting process unit 31.

第二の照合結果処理手段３５は、基本単語照合処理手段３３において、漢字列抽出処理手段３１で抽出した漢字列と一致する基本単語が見つからなかった場合には、漢字複合語において、抽出した漢字列の後方に漢字があるか確認し、抽出した漢字列の後方に漢字がないときは、抽出字数を一つ減らして設定すると共に、抽出先頭文字を漢字複合語の語頭に設定変更して、漢字列抽出処理手段３１に戻り、漢字複合語において、抽出した漢字列の後方に漢字があるときは、抽出先頭文字を抽出先頭位置から一字分後方のものに設定変更して、漢字列抽出処理手段３１に戻る。 When the basic word matching processing unit 33 does not find a basic word that matches the kanji string extracted by the kanji string extraction processing unit 31, the second matching result processing unit 35 extracts the kanji character extracted from the kanji compound word. Check if there is a kanji at the back of the string, and if there is no kanji after the extracted kanji string, set the number of extracted characters to one and set the extracted first character to the beginning of the kanji compound word, Returning to the kanji string extraction processing means 31, if there is a kanji character behind the extracted kanji string in the kanji compound word, the extraction head character is changed to the one character one character after the extraction start position, and the kanji string extraction is performed. Return to the processing means 31.

区切位置確定処理手段３６は、漢字複合語を構成するすべての漢字に変更されたフラグが付与されている場合又は設定した抽出字数が０になった場合には、第一の照合結果処理手段３４で決定した区切位置を、漢字複合語を分割する区切位置として確定し、変更したフラグが付与されていない漢字については１字未知語と定める。 The delimiter position determination processing means 36 is the first collation result processing means 34 when the changed flag is assigned to all the kanji characters constituting the kanji compound word or when the set number of extracted characters becomes zero. The delimiter position determined in step 1 is determined as the delimiter position for dividing the kanji compound word, and the kanji character to which the changed flag is not assigned is determined as one character unknown word.

本発明の漢字複合語分割方法は、連続する漢字列で構成された漢字複合語を、日本語辞書と単語分割パターン辞書を参照して、単語に分割する。以下、手法１〜手法３を例として説明する。 The kanji compound word dividing method of the present invention divides a kanji compound word composed of continuous kanji strings into words by referring to a Japanese dictionary and a word dividing pattern dictionary. Hereinafter, Method 1 to Method 3 will be described as examples.

手法１では、抽出字数の順番は、日本語辞書に記録された基本単語の字数の大きい順に設定する。例えば、基本単語の長さが１字〜４字であった場合、日本語辞書１に記録された基本単語の字数は、大きい順に、４字、３字、２字、１字となるため、最初に漢字複合語から４字抽出された後、４字の基本単語との照合が行われ、一致しない場合には、漢字複合語から３字抽出された後、３字の基本単語との照合が行われ、一致しない場合には、漢字複合語から２字抽出された後、２字の基本単語との照合が行われ、一致しない場合には、漢字複合語から１字抽出された後、１字の基本単語との照合が行われる。 In Method 1, the number of extracted characters is set in descending order of the number of basic words recorded in the Japanese dictionary. For example, when the length of the basic word is 1 to 4 characters, the number of basic words recorded in the Japanese dictionary 1 is 4 characters, 3 characters, 2 characters, and 1 character in descending order. First, 4 characters are extracted from the Kanji compound word, then collation with the 4 character basic word is performed. If they do not match, 3 characters are extracted from the Kanji compound word and then collated with the 3 character basic word. If there is no match, two characters are extracted from the Kanji compound word and then collated with a two-character basic word. If they do not match, one character is extracted from the Kanji compound word, Matching with a single basic word is performed.

６字の漢字複合語「遠隔早期警戒」は、手法１を用いると、以下の手順で分割される。なお、Ｎは漢字複合語から抽出される漢字列の語頭が漢字複合語の語頭から何番目に位置しているかを示し、Ｌは漢字複合語から適宜抽出される漢字列の字数を示す。 When the method 1 is used, the six-character kanji compound word “remote early warning” is divided by the following procedure. N indicates the position of the beginning of the kanji character string extracted from the kanji compound word from the beginning of the kanji compound word, and L indicates the number of characters of the kanji character string extracted from the kanji compound word as appropriate.

漢字複合語の語頭（Ｎ＝１）（遠）（Ｓ１０１）から４字（Ｌ＝４）（Ｓ１０２）を取り出し（遠隔早期）（Ｓ１０３）、日本語辞書中の４字の基本単語と照合する（Ｓ１０４）。「遠隔早期」は４字の基本単語に存在しない（Ｓ１０４／Ｎｏ）ため、漢字複合語の語頭（遠）から３字（Ｌ＝３）（Ｓ１０５／Ｎｏ，Ｓ１０６）を取り出し（遠隔早）（Ｓ１０３）、日本語辞書中の３字の基本単語と照合する（Ｓ１０４）。「遠隔早」は３字の基本単語に存在しない（Ｓ１０４／Ｎｏ）ため、漢字複合語の語頭（遠）から２字（Ｌ＝２）（Ｓ１０５／Ｎｏ，Ｓ１０６）を取り出し（遠隔）（Ｓ１０３）、日本語辞書中の２字の基本単語と照合する（Ｓ１０４）。「遠隔」は２字の基本単語に存在する（Ｓ１０４／Ｙｅｓ）ため、第一ステップから第二のステップに進み、漢字列「遠隔」に品詞が付与され（遠隔（形動）早期警戒）（Ｓ１０７）、基本単語と一致する抽出した漢字列「遠隔」の語尾「隔」とその直後にある漢字「早」との間を単語に分割する区切位置として決定する（遠隔（形動）｜早期警戒）（Ｓ１０９）。 4 characters (L = 4) (S102) are extracted from the beginning (N = 1) (distant) (S101) of the kanji compound word (remote early) (S103) and collated with the 4 character basic words in the Japanese dictionary. (S104). Since “remote early” does not exist in the basic word of 4 characters (S104 / No), 3 characters (L = 3) (S105 / No, S106) are extracted from the beginning (distant) of the Kanji compound word (remote early) ( S103), collation with the three-letter basic word in the Japanese dictionary (S104). Since “remote early” does not exist in the basic word of 3 characters (S104 / No), 2 characters (L = 2) (S105 / No, S106) are extracted from the beginning (distant) of the Kanji compound word (remote) (S103). ) And collate with a two-letter basic word in the Japanese dictionary (S104). Since “remote” is present in the two-letter basic word (S104 / Yes), the process proceeds from the first step to the second step, and the part of speech is given to the kanji character string “remote” (remote (formation) early warning) ( S 107), determining the segmentation position to divide the word between the ending “separate” of the extracted kanji character string “remote” that matches the basic word and the kanji character “early” immediately after it (remote (motion) | early (Caution) (S109).

ここで、Ｎ＝１，Ｌ＝２であるため、Ｎは、１＋２＝３となり（Ｓ１１０）、漢字複合語の数−３（６−３＝３）と同じである（Ｓ１１１／Ｎｏ）ため、次に、直前に分割した区切位置の直後（Ｎ＝１＋２＝３）（早）（Ｓ１１０）から４字（Ｌ＝４）（Ｓ１１２）を取り出し（早期警戒）（Ｓ１０３）、日本語辞書中の４字の基本単語と照合する（Ｓ１０４）。「早期警戒」は４字の基本単語に存在しない（Ｓ１０４／Ｎｏ）ため、直前に分割した区切位置の直後（早）から３字（Ｌ＝３）（Ｓ１０５／Ｎｏ，Ｓ１０６）を取り出し（早期警）（Ｓ１０３）、日本語辞書中の３字の基本単語と照合する（Ｓ１０４）。「早期警」は３字の基本単語に存在しない（Ｓ１０４／Ｎｏ）ため、直前に分割した区切位置の直後（早）から２字（Ｌ＝２）（Ｓ１０５／Ｎｏ，Ｓ１０６）を取り出し（早期）（Ｓ１０３）、日本語辞書中の２字の基本単語と照合する（Ｓ１０４）。「早期」は２字の基本単語に存在する（Ｓ１０４／Ｙｅｓ）ため、第一ステップから第二のステップに進み、漢字列「早期」に品詞が付与され（遠隔（形動）｜早期（形動）警戒）（Ｓ１０７）、基本単語と一致する抽出した漢字列「遠隔」の語尾「隔」とその直後にある漢字「早」との間を単語に分割する区切位置として決定する（遠隔（形動）｜早期（形動）｜警戒）（Ｓ１０９）。 Here, since N = 1 and L = 2, N becomes 1 + 2 = 3 (S110), which is the same as the number of Kanji compound words-3 (6-3 = 3) (S111 / No). Next, four characters (L = 4) (S112) are extracted from (N = 1 + 2 = 3) (early) (S110) immediately after the division position divided immediately before (early warning) (S103), It collates with a 4-character basic word (S104). Since “early warning” does not exist in the four-letter basic word (S104 / No), three letters (L = 3) (S105 / No, S106) are extracted immediately after (early) the division position divided immediately before (early warning). (Caution) (S103), it collates with a three-letter basic word in the Japanese dictionary (S104). Since “early warning” does not exist in the three-letter basic word (S104 / No), two letters (L = 2) (S105 / No, S106) are taken out immediately after (early) the division position divided immediately before (early warning) ) (S103), it collates with a two-letter basic word in the Japanese dictionary (S104). Since “early” is present in the two-letter basic word (S104 / Yes), the process proceeds from the first step to the second step, and the part-of-speech is given to the Chinese character string “early”. (Action) (alert) (S107), it determines as a delimiter position to divide the word between the ending "separate" of the extracted kanji character string "remote" that matches the basic word and the kanji character "early" immediately after it (remote ( (Formation) | early (formation) | alert) (S109).

ここで、Ｎ＝３，Ｌ＝２であるため、Ｎは、３＋２＝５となり（Ｓ１１０）、漢字複合語の数−３（６−３＝３）より大きい（Ｓ１１１／Ｙｅｓ）が、漢字複合語の語数（６）より小さい（Ｓ１１３／Ｎｏ）ため、次いで、直前に分割した区切候補の直後（Ｎ＝３＋２＝５）（警）（Ｓ１１０）から２字（Ｌ＝６−５＋１＝２）（Ｓ１１４）を取り出し（警戒）、２字の基本単語と照合する（Ｓ１０４）。「警戒」は２字の基本単語に存在するため、第一のステップから第二のステップに進み、漢字列「警戒」に品詞が付与され（遠隔（形動）｜早期（形動）｜警戒（動詞））（Ｓ１０７）、各漢字列に品詞が付与され、かつすべての区切位置が決定した状態となる。この時点では、Ｎ＝５，Ｌ＝２であるため、Ｎは、５＋２＝７となり（Ｓ１１０）、漢字複合語の基本単語と一致する抽出した漢字列の後方に漢字がない（Ｎ＝５＋２＞６）（Ｓ１１３／Ｙｅｓ）ことになる。 Here, since N = 3 and L = 2, N is 3 + 2 = 5 (S110), and the number of Kanji compound words is larger than -3 (6-3 = 3) (S111 / Yes). Since it is smaller than the number of words (6) (S113 / No), the next two characters (L = 6-5 + 1 = 2) immediately after the immediately preceding division candidate (N = 3 + 2 = 5) (warning) (S110) (S114) is extracted (warning) and collated with a two-letter basic word (S104). Since “alert” is present in the basic two-letter word, the process proceeds from the first step to the second step, and the part of speech is given to the kanji string “alert” (remote (advance) | early (advance) | (Verb)) (S107), a part of speech is assigned to each kanji string, and all delimiter positions are determined. At this time, since N = 5 and L = 2, N becomes 5 + 2 = 7 (S110), and there is no kanji character behind the extracted kanji string that matches the basic word of the kanji compound word (N = 5 + 2>). 6) (S113 / Yes).

なお、上述の場合には、未知語が全くないため、二以上の１字未知語を連接する第四のステップ（未知語連接）や１字未知語を含む隣接する漢字列を連接する第五のステップ（隣接語連接）は必要とされない。 In the above case, since there are no unknown words, there is a fourth step (unknown word concatenation) for concatenating two or more one-character unknown words, and a fifth step for concatenating adjacent Chinese character strings including one-character unknown words. This step (adjacent word concatenation) is not required.

しかしながら、日本語辞書に「早期」、「早」及び「期」が存在しないという場合（Ｓ１０５／Ｙｅｓ）には、第三のステップで「早」と「期」は未知語と定義され（Ｓ１０８）、すべての区切位置が決定した状態は、遠隔｜早（未知）｜期（未知）｜警戒となる（Ｓ１０９／Ｙｅｓ）。ここで、第四のステップの未知語連接を行うと、連続する複数の未知語が１つの未知語となり、遠隔｜早期（未知）｜警戒となる。 However, if “early”, “early”, and “term” do not exist in the Japanese dictionary (S105 / Yes), “early” and “term” are defined as unknown words in the third step (S108). ), The state in which all the delimiter positions are determined is remote | early (unknown) | period (unknown) | warning (S109 / Yes). Here, when unknown word concatenation is performed in the fourth step, a plurality of consecutive unknown words become one unknown word, and remote | early (unknown) | alert.

また、日本語辞書に「早期」及び「期」が存在しないという場合には、第三のステップで「期」は未知語と定義され（Ｓ１０８）、すべての区切位置が決定した状態は、遠隔｜早｜期（未知）｜警戒となる（Ｓ１０９／Ｙｅｓ）。ここで、第五のステップの隣接語連接を行うと、１字未知語を含む隣接する漢字列が１つの未知語となり、遠隔｜早期（未知）｜警戒となる。 If “early” and “term” do not exist in the Japanese dictionary, “term” is defined as an unknown word in the third step (S108), and the state where all the delimiter positions are determined is remote. | Early | term (unknown) | Be alert (S109 / Yes). Here, when the adjacent word concatenation of the fifth step is performed, an adjacent kanji character string including one unknown character becomes one unknown word, and remote | early (unknown) | alert.

手法２は、単語分割パターン辞書の情報に基づいて漢字複合語を複数の漢字列に仮分割し、次に仮分割されたすべての漢字列に対して日本語辞書の基本単語と照合する。６字の漢字複合語「遠隔早期警戒」は、手法２を用いると、以下の手順で分割される。 Method 2 provisionally divides the kanji compound word into a plurality of kanji strings based on the information in the word division pattern dictionary, and then collates all the kanji strings temporarily provisionally with basic words in the Japanese dictionary. When the method 2 is used, the six-character kanji compound word “remote early warning” is divided by the following procedure.

単語分割パターン辞書に記録された漢字複合語６字の分割パターンのうち、出現頻度が最も高い分割パターンは２・１・２・１であり、出現頻度が二番目に高い分割パターンは２・２・２であるため、最初に、第六のステップで、一番目（ｉ＝１）の分割パターン２・１・２・１（Ｓ２０１）を用いて、「遠隔早期警戒」を「遠隔／早／期警／戒」と仮分割し（Ｓ２０２）、先頭の漢字列から日本語辞書中の１字及び２字の基本単語に対して照合を行う（Ｓ２０３）。 Of the 6 kanji compound word division patterns recorded in the word division pattern dictionary, the division pattern with the highest appearance frequency is 2 · 1 · 2 · 1, and the division pattern with the second highest appearance frequency is 2 · 2 Since it is 2, first, in the sixth step, the first (i = 1) division pattern 2 · 1 · 2 · 1 (S201) is used to set “remote early warning” to “remote / early / Temporary division of “precaution / warning” (S202), and collation is performed on the first and second basic words in the Japanese dictionary from the first Kanji string (S203).

仮分割した漢字列のうち「期警」については一致する基本単語が見つからない（Ｓ２０３／Ｎｏ）ため、第八のステップで、日本語辞書に存在しない漢字列（期警）は未知語と定義され（遠隔／早／期警（未知）／戒）（Ｓ２０４）、全ての分割パターンについて仮分割されていないことを確認し（Ｓ２０５／Ｙｅｓ）、第六のステップに戻る。 Since no matching basic word is found for “periodical police” in the temporarily divided Chinese character string (S203 / No), in the eighth step, the Chinese character string (periodical police) that does not exist in the Japanese dictionary is defined as an unknown word. (Remote / early / early warning (unknown) / control) (S204), it is confirmed that all the division patterns are not provisionally divided (S205 / Yes), and the process returns to the sixth step.

次に、二番目（ｉ＝１＋１）の分割パターン２・２・２（Ｓ２０８）を用いて、「遠隔早期警戒」を「遠隔／早期／警戒」と仮分割し（Ｓ２０２）、先頭の漢字列から日本語辞書中の２字の基本単語に対して照合を行う（Ｓ２０３）。 Next, using the second (i = 1 + 1) division pattern 2 • 2 • 2 (S208), “remote early warning” is provisionally divided into “remote / early / warning” (S202), and the first Chinese character string Are collated with respect to two basic words in the Japanese dictionary (S203).

仮分割した漢字列の全部が日本語辞書に存在する、即ち、仮分割した漢字列の全てに一致する基本単語が見つかった（Ｓ２０３／Ｙｅｓ）ため、第七のステップで、すべての漢字列に品詞が付与され（遠隔（形動）｜早期（形動）｜警戒（動詞））（Ｓ２１０）、仮分割した全ての漢字列と一致する基本単語が見つかった分割パターンの区切位置（２・２・２）を、漢字複合語を分割する分割位置として決定する（遠隔（形動）｜早期（形動）｜警戒（動詞））。 Since all of the temporarily divided kanji strings exist in the Japanese dictionary, that is, a basic word that matches all of the temporarily divided kanji strings is found (S203 / Yes), in the seventh step, all kanji strings are Part-of-speech is given (remote (advanced) | early (advanced) | warning (verb)) (S210), and the division position of the division pattern where the basic word that matches all the temporarily divided kanji strings is found (2.2 (2) is determined as a division position for dividing the kanji compound word (remote (advance) | early (advance) | warning (verb)).

手法３は、日本語辞書に含まれる基本単語に基づき、漢字複合語の抽出位置を順次移動させながら、分割位置を決定する。８字の漢字複合語「良性副腎皮質腫瘍」は、手法３を用いると、以下の手順で分割される。ここでは、照合方向を前方から後方としている。 Method 3 determines the division position while sequentially moving the extraction position of the kanji compound word based on the basic word included in the Japanese dictionary. The 8-character Kanji compound word “benign adrenal cortex tumor” is divided by the following procedure using method 3. Here, the collation direction is from front to back.

分割対象の漢字複合語「良性副腎皮質腫瘍」に対し、日本語辞書を構成する基本単語の長さ順、例えば、４字の基本単語、３字の基本単語、２字の基本単語、１字の基本単語の順で照合する。漢字複合語から抽出する抽出字数、即ち照合する基本単語の長さ（Ｌｗ）、漢字複合語の語数（Ｌｅｎ）、抽出先頭位置（Ｐｏｓ）、漢字複合語を構成する各々の漢字の解析状態（Ｆｌａｇ）の変数を用意する。ここでＦｌａｇは、漢字複合語を構成する各々の漢字に対する解析状態を表し、０は、初期状態であり、抽出した漢字列と日本語辞書中の基本単語とが一致しなかったことを示し、例えば、１から４は抽出した漢字列と一致した基本単語の長さ（Ｌｗ）を示す。初期状態では、全ての文字のＦｌａｇを０とする。初期設定として、照合方向を前方から後方としたので、抽出先頭位置は漢字複合語の語頭（Ｐｏｓ＝１）、漢字複合語「良性副腎皮質腫瘍」の長さＬｅｎは８となる（Ｓ３０１）。 For the kanji compound word "benign adrenal cortex tumor" to be divided, the basic words that make up the Japanese dictionary are sorted in the order of the length of the basic words, for example, 4 basic words, 3 basic words, 2 basic words, 1 character Match in the order of basic words. The number of extracted characters extracted from the Kanji compound word, that is, the length (Lw) of the basic word to be collated, the number of words of the Kanji compound word (Len), the extraction start position (Pos), and the analysis state of each Kanji constituting the Kanji compound word ( Flag) variable is prepared. Here, Flag represents an analysis state for each kanji constituting the kanji compound word, and 0 is an initial state, indicating that the extracted kanji string and the basic word in the Japanese dictionary did not match, For example, 1 to 4 indicate the length (Lw) of the basic word that matches the extracted Chinese character string. In the initial state, Flags of all characters are set to 0. Since the collation direction is set from the front to the rear as an initial setting, the extraction start position is the beginning of the kanji compound word (Pos = 1), and the length Len of the kanji compound word “benign adrenal cortex tumor” is 8 (S301).

まず、漢字複合語の中から、漢字複合語の語頭（Ｐｏｓ＝１）から最初に設定した抽出字数４字（Ｌｗ＝４）分を抽出し、抽出した漢字列を構成する各々の漢字について、０以外のフラグが付与されているか判定する（Ｓ３０２）。漢字複合語の語頭から後方４（＝Ｌｗ）文字の個々の漢字のすべてのＦｌａｇが０である（Ｓ３０２／Ｙｅｓ）ため、抽出した漢字列「良性副腎」が日本語辞書中の４字の基本単語と一致するか照合する（Ｓ３０３）。「良性副腎」と一致する４字の基本単語がない（Ｓ３０３／Ｎｏ）ため、抽出先頭文字を漢字複合語の語頭（Ｐｏｓ＝１）から１文字後ろに設定変更する（Ｐｏｓ＝１＋１＝２）（Ｓ３０８）。設定変更した抽出先頭文字の位置と基本単語の長さの和（Ｐｏｓ＋Ｌｗ）は６で、漢字複合語の語数８を超えない（Ｓ３０９／Ｎｏ）ので、１文字後ろに設定変更した抽出先頭位置（Ｐｏｓ＝２）から設定した抽出字数４字（Ｌｗ＝４）分を抽出し、抽出した漢字列を構成する各々の漢字について、０以外のフラグが付与されているか判定する（Ｓ３０２）。抽出した漢字列の個々の漢字のすべてのＦｌａｇが０である（Ｓ３０２／Ｙｅｓ）ため、抽出した漢字列「良性副腎」が日本語辞書中の４字の基本単語と一致するか照合する（Ｓ３０３）。「性副腎皮」と一致する４字の基本単語がない（Ｓ３０３／Ｎｏ）ため、抽出先頭文字（Ｐｏｓ＝２）を１文字後ろに設定変更する（Ｐｏｓ＝２＋１＝３）（Ｓ３０８）。ここで、設定変更した抽出先頭文字の位置と基本単語の長さの和（Ｐｏｓ＋Ｌｗ）は７で、漢字複合語の語数８を超えない（Ｓ３０９／Ｎｏ）ので、１文字後方に設定変更した抽出先頭位置（Ｐｏｓ＝３）から設定した抽出字数４字（Ｌｗ＝４）分を抽出し、抽出した漢字列を構成する各々の漢字について、０以外のフラグが付与されているか判定する（Ｓ３０２）。抽出した漢字列の個々の漢字のすべてのＦｌａｇが０である（Ｓ３０２／Ｙｅｓ）ため、抽出した漢字列「副腎皮質」が日本語辞書中の４字の基本単語と一致するか照合する（Ｓ３０３）。 First, from the kanji compound word, the first four extracted characters (Lw = 4) set from the beginning of the kanji compound word (Pos = 1) are extracted, and for each kanji constituting the extracted kanji string, It is determined whether a flag other than 0 is given (S302). Since all Flags of the individual 4 kanji characters from the beginning of the kanji compound word are 0 (S302 / Yes), the extracted kanji character string “benign adrenal gland” is the basic of 4 characters in the Japanese dictionary. Whether or not the word matches is checked (S303). Since there is no four-letter basic word that matches “benign adrenal” (S303 / No), the first character of extraction is changed from the beginning of the Kanji compound word (Pos = 1) to one character behind (Pos = 1 + 1 = 2). (S308). The sum (Pos + Lw) of the position of the extracted first character and the basic word length that has been changed is 6 and does not exceed the number of Kanji compound words of 8 (S309 / No). The extracted number of 4 characters (Lw = 4) set from Pos = 2) is extracted, and it is determined whether or not a flag other than 0 is assigned to each Chinese character constituting the extracted Chinese character string (S302). Since all Flags of individual Kanji characters in the extracted Kanji character string are 0 (S302 / Yes), it is checked whether or not the extracted Kanji character string “benign adrenal gland” matches the four-character basic word in the Japanese dictionary (S303). ). Since there is no four-letter basic word that coincides with “sex adrenal gland” (S303 / No), the extraction start character (Pos = 2) is changed to one character behind (Pos = 2 + 1 = 3) (S308). Here, the sum (Pos + Lw) of the position of the extracted first character and the basic word length changed in the setting is 7, which does not exceed the number of words of the Kanji compound word (S309 / No). The extracted number of 4 characters (Lw = 4) set from the head position (Pos = 3) is extracted, and it is determined whether or not a flag other than 0 is assigned to each Chinese character constituting the extracted Chinese character string (S302). . Since all Flags of individual Kanji characters in the extracted Kanji string are 0 (S302 / Yes), it is checked whether or not the extracted Kanji character string “adrenal cortex” matches the four-character basic word in the Japanese dictionary (S303). ).

「副腎皮質」と一致する４字の基本単語がある（Ｓ３０３／Ｙｅｓ）ため、「副腎皮質」に品詞（名詞）を付与し（Ｓ３０４）、抽出した漢字列の語尾とその直後の漢字の間を区切位置として決定すると共に、「副腎皮質」の４個の漢字のＦｌａｇに４を付与する（Ｓ３０５）。ここで、漢字複合語を構成する全ての漢字のＦｌａｇは０より大きくない（Ｓ３０６／Ｎｏ）ため、抽出先頭文字を４文字分後方に設定変更する（Ｓ３０７）。抽出先頭位置は７となる（Ｐｏｓ＝３＋４）。ここで、設定変更した抽出先頭文字の位置と基本単語の長さの和（Ｐｏｓ＋Ｌｗ）は１１で、漢字複合語の語数８を超える（Ｓ３０９／Ｙｅｓ）ので、抽出字数が一字減らした３字に設定変更され、照合する基本単語の長さ（Ｌｗ）は３になる（Ｓ３１０）。 Since there is a four-letter basic word that matches "adrenal cortex" (S303 / Yes), a part of speech (noun) is given to "adrenal cortex" (S304), and the ending of the extracted kanji string and the next kanji Is determined as a delimiter position, and 4 is added to the flags of the four kanji characters of “adrenal cortex” (S305). Here, since the flags of all the kanji characters constituting the kanji compound word are not larger than 0 (S306 / No), the extraction start character is set backward by four characters (S307). The extraction start position is 7 (Pos = 3 + 4). Here, the sum (Pos + Lw) of the position of the extracted extracted first character and the length of the basic word is 11 and exceeds the number of Kanji compound words of 8 (S309 / Yes), so the number of extracted characters is reduced by one. The length (Lw) of the basic word to be collated becomes 3 (S310).

抽出字数は０でない（Ｓ３１１／Ｎｏ）ため、抽出先頭文字は漢字複合語の語頭（Ｐｏｓ＝１）にする（Ｓ３１２）。漢字複合語の語頭（Ｐｏｓ＝１）から設定変更した抽出字数３字（Ｌｗ＝３）分を抽出し、抽出した漢字列を構成する各々の漢字について、０以外のフラグが付与されているか判定する（Ｓ３０２）。漢字複合語の語頭から後方３文字のうち、「副」のＦｌａｇが４である（Ｓ３０２／Ｎｏ）ため、抽出先頭位置は２となる（Ｓ３０８）。ここで、設定変更した抽出先頭文字の位置と基本単語の長さの和（Ｐｏｓ＋Ｌｗ）は６で、漢字複合語の語数８を超えない（Ｓ３０９／Ｎｏ）ので、１文字後方に設定変更した抽出先頭位置（Ｐｏｓ＝２）から設定した抽出字数３字（Ｌｗ＝３）分を抽出し、抽出した漢字列を構成する各々の漢字について、０以外のフラグが付与されているか判定する（Ｓ３０２）。漢字複合語の語頭から後方３文字のうち、「副」と「腎」のＦｌａｇが４である（Ｓ３０２／Ｎｏ）ため、抽出先頭位置は３となる（Ｓ３０８）。その後、抽出先頭位置が５となるまで全く同じステップが繰り返され、抽出先頭位置が６のとき、抽出先頭文字の位置と基本単語の長さの和（Ｐｏｓ＋Ｌｗ）が９になり、漢字複合語の語数８を超える（Ｓ３０９／Ｙｅｓ）ため、抽出字数が一字減らした２字に設定変更され、照合する基本単語の長さ（Ｌｗ）は２になる（Ｓ３１０）。 Since the number of extracted characters is not 0 (S311 / No), the first character of extraction is set to the beginning of the Kanji compound word (Pos = 1) (S312). The extracted number of extracted 3 characters (Lw = 3) is extracted from the beginning of the kanji compound word (Pos = 1), and it is determined whether a flag other than 0 is assigned to each kanji constituting the extracted kanji string. (S302). Of the three characters from the beginning of the Kanji compound word, the flag of “sub” is 4 (S302 / No), so the extraction start position is 2 (S308). Here, the sum (Pos + Lw) of the position of the extracted first character and the basic word length changed in the setting is 6, which does not exceed the number of words of the Kanji compound word (S309 / No). The extracted number of 3 characters (Lw = 3) set from the head position (Pos = 2) is extracted, and it is determined whether or not a flag other than 0 is assigned to each Chinese character constituting the extracted Chinese character string (S302). . Among the three characters from the beginning of the Kanji compound word, the flag of “sub” and “kidney” is 4 (S302 / No), so the extraction start position is 3 (S308). Thereafter, exactly the same steps are repeated until the extraction start position reaches 5, and when the extraction start position is 6, the sum of the position of the extraction start character and the length of the basic word (Pos + Lw) becomes 9, and the kanji compound word Since the number of words exceeds 8 (S309 / Yes), the number of extracted characters is changed to 2 characters reduced by one, and the length (Lw) of the basic word to be collated is 2 (S310).

抽出字数は０でない（Ｓ３１１／Ｎｏ）ため、抽出先頭文字は漢字複合語の語頭（Ｐｏｓ＝１）にする（Ｓ３１２）。漢字複合語の語頭（Ｐｏｓ＝１）から設定変更した抽出字数２字（Ｌｗ＝２）分を抽出し、抽出した漢字列を構成する各々の漢字について、０以外のフラグが付与されているか判定する（Ｓ３０２）。漢字複合語の語頭から後方２（＝Ｌｗ）文字の個々の漢字のすべてのＦｌａｇが０である（Ｓ３０２／Ｙｅｓ）ため、抽出した漢字列「良性」が日本語辞書中の２字の基本単語と一致するか照合する（Ｓ３０３）。「良性」と一致する２字の基本単語がある（Ｓ３０３／Ｙｅｓ）ため、「良性」に品詞（名詞）を付与し（Ｓ３０４）、抽出した漢字列の語尾とその直後の漢字の間を区切位置として決定すると共に、「良性」の２個の漢字のＦｌａｇに２を付与する（Ｓ３０５）。 Since the number of extracted characters is not 0 (S311 / No), the first character of extraction is set to the beginning of the Kanji compound word (Pos = 1) (S312). The extracted number of extracted characters 2 (Lw = 2) is extracted from the beginning of the kanji compound word (Pos = 1), and it is determined whether a flag other than 0 is assigned to each kanji constituting the extracted kanji string. (S302). Since all Flags of individual Kanji characters after the beginning of the Kanji compound word are 2 (= Lw) characters are 0 (S302 / Yes), the extracted Kanji character string “benign” is the basic character of 2 characters in the Japanese dictionary. Are matched (S303). Since there is a two-letter basic word that matches “benign” (S303 / Yes), a part of speech (noun) is given to “benign” (S304), and the ending of the extracted kanji string and the next kanji are separated. While determining the position, 2 is added to the flags of the two “benign” Kanji characters (S305).

以降、ステップ３０６、ステップ３０７、ステップ３０９、ステップ３０２と進み、ステップ３０２でＮｏとなり、ステップ３０８に進み、抽出先頭位置は１字後方に設定変更され、３となる。その後、ステップ３０９、ステップ３０２、ステップ３０８のループが繰り返され、抽出先頭位置が６のときに、漢字複合語の語頭（Ｐｏｓ＝６）から設定変更した抽出字数２字（Ｌｗ＝２）分を抽出し、抽出した漢字列を構成する各々の漢字については、漢字複合語の語頭から後方２（＝Ｌｗ）文字の個々の漢字のすべてのＦｌａｇが０である（Ｓ３０２／Ｙｅｓ）ため、抽出した漢字列「腫瘍」が日本語辞書中の２字の基本単語と一致するか照合する（Ｓ３０３）。「腫瘍」と一致する２字の基本単語がある（Ｓ３０３／Ｙｅｓ）ため、「腫瘍」に品詞（名詞）を付与する（Ｓ３０４）と共に、「腫瘍」の２個の漢字のＦｌａｇに２を付与する（Ｓ３０５）。この段階で、全ての文字のＦｌａｇの値は２又は４となった（Ｓ３０６／Ｙｅｓ）ため、漢字複合語の分割処理は終了する（良性（名詞）｜副腎皮質（名詞）｜腫瘍（名詞））。 Thereafter, the process proceeds to step 306, step 307, step 309, and step 302. The result of step 302 is No, the process proceeds to step 308, and the extraction start position is changed to one character backward and becomes 3. Thereafter, the loop of Step 309, Step 302, and Step 308 is repeated, and when the extraction start position is 6, the extracted character number of 2 characters (Lw = 2) changed from the beginning of the Kanji compound word (Pos = 6) is obtained. Extracted and extracted for each Kanji character constituting the extracted Kanji string because all Flags of individual Kanji characters of the last 2 (= Lw) characters from the beginning of the Kanji compound word are 0 (S302 / Yes). Whether the Chinese character string “tumor” matches the two-letter basic word in the Japanese dictionary is collated (S303). Since there are two basic words that match "tumor" (S303 / Yes), part of speech (noun) is given to "tumor" (S304) and 2 is given to the flags of the two kanji characters of "tumor" (S305). At this stage, the flag values of all the characters are 2 or 4 (S306 / Yes), so the kanji compound word segmentation process ends (benign (noun) | adrenal cortex (noun) | tumor (noun). ).

上記の処理において、日本語辞書中の１字の基本単語にない１字の漢字がある場合には、Ｆｌａｇの値は０のままとなり、ステップ３１１が真（Ｓ３１１／Ｙｅｓ）となり、終了する。この場合、Ｆｌａｇの値が０の１字の漢字は未知語と判断される。 In the above processing, if there is one Kanji character that is not in one basic word in the Japanese dictionary, the value of Flag remains 0, and Step 311 becomes true (S311 / Yes), and the process ends. In this case, one Kanji character whose Flag value is 0 is determined as an unknown word.

（１）分割精度の評価実験その１
手法１、手法２及び手法３の３つの手法の分割精度を客観的に測定するため、図６に示す手順で評価実験を行った。具体的には、辞書から取り出した６〜１０字の漢字複合語（６字：７７７６語、７字：４３１５語、８字：２０８６語、９字：１１１７語、１０字：５４３語）を漢字熟語ファイルに記録した。漢字熟語ファイルに記録した漢字複合語１５８３７語について、自動単語分割プログラムを用い、上述した手法１、手法２及び手法３のそれぞれを実行して、漢字複合語を分割し、分割した漢字複合語に品詞を付与した。使用した日本語辞書及び単語分割ファイル辞書は上述したフォーマットのファイルを用い、単語分割パターンは、異なる字数のものを比較することができないようにした。その後、予め人手により分割された漢字複合語との比較を判定プログラムで行って、分割の成否を調べた。 (1) Dividing accuracy evaluation experiment 1
In order to objectively measure the division accuracy of the three methods, Method 1, Method 2 and Method 3, an evaluation experiment was performed according to the procedure shown in FIG. Specifically, a 6-10 character kanji compound word (6 characters: 7776 words, 7 characters: 4315 words, 8 characters: 2086 words, 9 characters: 1117 words, 10 characters: 543 words) taken from the dictionary Recorded in idiom file. Using the automatic word division program, each of the above-mentioned method 1, method 2 and method 3 is executed for the kanji compound word 15837 words recorded in the kanji idiom file, and the kanji compound word is divided into divided kanji compound words. Part of speech was given. The Japanese dictionary and the word division file dictionary used were files of the above-described format, and the word division patterns could not be compared with different numbers of characters. After that, a comparison with a kanji compound word that was previously divided manually was performed using a determination program to check the success or failure of the division.

手法１、手法２及び手法３のそれぞれの手法を用いて漢字複合語を分割した結果を表６に示す。また、漢字複合語の字数を横軸とし、漢字複合語を分割したときの成功の確率を縦軸として、グラフ化した結果を図７に示す。 Table 6 shows the result of dividing the kanji compound word using the methods 1, 2, and 3. FIG. 7 shows a graph of the result of graphing with the number of characters of the kanji compound word as the horizontal axis and the probability of success when the kanji compound word is divided as the vertical axis.

その結果、手法１〜手法３のいずれについても、一部の例外（漢字複合語が１０字の場合における手法２及び手法３）はあるが、ほぼ９０％以上の非常に高い確率で漢字複合語の分割が成功していることがわかった。これにより、本発明を用いることによって、日本語文書に含まれる漢字複合語を超高精度で正しく分割することができ、かつ分割した単語の信頼性が非常に高くなることが証明された。 As a result, although there are some exceptions (method 2 and method 3 when the number of kanji compound words is 10) for any of methods 1 to 3, kanji compound words have a very high probability of approximately 90% or more. It was found that the split was successful. Thus, it has been proved that by using the present invention, kanji compound words included in a Japanese document can be correctly divided with ultra-high accuracy, and the reliability of the divided words becomes very high.

（２）分割精度の評価実験その２
非特許文献２〜４の手法についても、６〜１０字の漢字複合語の分割精度を求めてみた。表７に本発明の手法１と非特許文献２〜４の手法の分割精度を示す。ただし、分割対象の漢字複合語の特性は本発明の手法１と非特許文献２〜４では同一ではないことを考慮されたい。 (2) Dividing accuracy evaluation experiment 2
Regarding the methods of Non-Patent Documents 2 to 4, the division accuracy of 6 to 10 kanji compound words was obtained. Table 7 shows the division accuracy of the method 1 of the present invention and the methods of Non-Patent Documents 2 to 4. However, it should be considered that the characteristics of the kanji compound word to be divided are not the same between the method 1 of the present invention and the non-patent documents 2 to 4.

表７から、全ての漢字複合語の字数で、本発明の手法１が最も高精度であることがわかった。また、本発明の手法１では、漢字複合語の字数が１０字であっても分割精度は９５％以上であるが、非特許文献２〜４の手法では最高でも９４％以下であった。さらに、本発明の手法１では総計１５０００語の漢字複合語を対象としており、非特許文献２〜４で用いられた漢字複合語と比較しても数倍以上大きい。それ故、本発明の手法は、非特許文献２〜４と比較して、学術・特許データベースはもちろんのこと、インターネット上の膨大のｗｅｂページなどの大規模なデータに対しても、相対的に最も有効であることは明らかである。 From Table 7, it was found that Method 1 of the present invention is the most accurate in the number of characters of all the Kanji compound words. In Method 1 of the present invention, the division accuracy is 95% or more even when the number of Kanji compound words is 10, but in the methods of Non-Patent Documents 2 to 4, it is 94% or less at the maximum. Furthermore, Method 1 of the present invention targets a total of 15,000 kanji compound words, which is several times larger than the kanji compound words used in Non-Patent Documents 2 to 4. Therefore, compared with Non-Patent Documents 2-4, the method of the present invention is relatively effective not only for academic / patent databases but also for large-scale data such as a huge web page on the Internet. It is clear that it is most effective.

本発明は、例えば、形態素解析、構文解析は勿論のこと、Ｗｅｂ検索エンジン、音声認識、文字認識、仮名漢字変換などに有用である。 The present invention is useful for, for example, Web search engines, speech recognition, character recognition, kana-kanji conversion, as well as morphological analysis and syntax analysis.

１日本語辞書
２単語分割パターン辞書
１０漢字複合語分割装置
１１抽出照合手段
１２区切決定手段
１３未知語区切決定手段
１４未知語連接手段
１５隣接語連接手段
１６区切位置確定手段
２０漢字複合語分割装置
２１仮分割照合手段
２２分割決定手段
２３未知語分割決定手段
２４分割位置確定手段
３０漢字複合語分割装置
３１漢字列抽出処理手段
３２フラグ付与判定処理手段
３３基本単語照合処理手段
３４第一の照合結果処理手段
３５第二の照合結果処理手段
３６区切位置確定処理手段 DESCRIPTION OF SYMBOLS 1 Japanese dictionary 2 Word division | segmentation pattern dictionary 10 Kanji compound word division | segmentation apparatus 11 Extraction collation means 12 Delimitation determination means 13 Unknown word division | segmentation determination means 14 Unknown word connection means 15 Neighboring word connection means 16 Break position determination means 20 Kanji compound word division apparatus 21 provisional division collation means 22 division decision means 23 unknown word division decision means 24 division position decision means 30 kanji compound word division apparatus 31 kanji string extraction processing means 32 flag assignment determination processing means 33 basic word collation processing means 34 first collation result Processing means 35 Second collation result processing means 36 Separation position determination processing means

Claims

A basic word that is a basis for dividing a kanji compound word composed of continuous kanji strings is associated with a part of speech corresponding to the basic word, and is classified according to the number of characters of the basic word. Corresponds to the Japanese dictionary that records both, the division pattern that shows the arrangement of the number of characters in each kanji string that is constructed after dividing the kanji compound word, and each kanji string that is constructed after the kanji compound word is divided A word division pattern dictionary in which both the division pattern and the part-of-speech string pattern are recorded by associating those existing in the division pattern among the part-of-speech string patterns representing the arrangement of parts of speech The Chinese character compound word dividing method characterized by dividing the Chinese character compound word with reference to the above.

A Japanese dictionary that associates a basic word that is a basis for dividing a kanji compound word composed of continuous kanji strings and a part of speech corresponding to the basic word, and records both the basic word and the part of speech, and the kanji Of the part-of-speech string pattern showing the arrangement of the part of speech corresponding to each kanji string formed after dividing the kanji compound word and the dividing pattern showing the arrangement of the number of characters of each kanji string constructed after dividing the compound word The kanji compound words are classified by the number of characters of the kanji compound words, and the kanji compound words are classified by referring to the word dividing pattern dictionary in which both the divided patterns and the part of speech string patterns are recorded. A kanji compound word dividing method characterized by dividing.

The kanji compound word segmentation method includes a kanji string corresponding to the number of extracted characters from a kanji at the beginning of the kanji compound word or a kanji immediately after a delimiter position determined immediately before the kanji compound word, according to a preset number of extracted characters. A first step of referring to the Japanese dictionary and collating the extracted kanji string with a basic word;
If a basic word that matches the kanji character string extracted in the first step is found, check whether there is a kanji character after the kanji character string extracted from the kanji compound word that matches the basic word. When there is a kanji character after the kanji character string extracted from the kanji compound word to be determined, the ending position of the extracted kanji character string that matches the basic word and the kanji character immediately after that are determined as the dividing positions for dividing the kanji compound word. And a second step back to the first step,
If a basic word that matches all of the kanji strings extracted from all the extracted character numbers set in advance in the first step is not found, one extracted kanji character is regarded as a one-character unknown word that does not exist in the Japanese dictionary. If there is a kanji character behind the extracted kanji character, a portion between the extracted kanji character and the next kanji character is determined as a break position for dividing the kanji compound word. The third step back,
The kanji compound word dividing method according to claim 1 or 2, characterized by comprising:

4. The kanji compound word dividing method according to claim 3, wherein the preset number of extracted characters is set in descending order of the number of characters of the basic words recorded in the Japanese dictionary.

5. The kanji compound word dividing method according to claim 3, wherein the kanji compound word dividing method further includes a fourth step of concatenating two or more one-character unknown words.

6. The kanji compound word according to claim 3, wherein the kanji compound word dividing method further includes a fifth step of concatenating adjacent kanji strings including the one-character unknown word. Split method.

The kanji compound word dividing method refers to the word division pattern dictionary, sequentially tentatively divides the kanji compound words into a plurality of kanji strings in order of appearance frequency of the divided patterns, and then refers to the japanese dictionary And a sixth step of collating all of the temporarily divided kanji strings with basic words;
When a matching basic word is found for all the kanji strings provisionally divided in the sixth step, the kanji compound word is divided according to the dividing pattern in which the basic words matching all the temporarily divided kanji strings are found. A seventh step of determining the break position;
If a basic word that matches any of the kanji strings provisionally divided in the sixth step is not found, a kanji string that does not exist in the Japanese dictionary is determined as an unknown word, and all divided patterns Check if provisional division has been performed, and if all the division patterns have not been provisionally divided, the process returns to the sixth step, and if all division patterns have been provisionally divided, the number of unknown words is minimum, and An eighth step of determining a break position for dividing the kanji compound word according to the division pattern having the highest frequency of occurrence of the division pattern;
The kanji compound word dividing method according to claim 1 or 2, characterized by comprising:

In the kanji compound word dividing method, the latest extracted first character in which the extraction start position as the first character position of the kanji string extracted from the kanji compound word is changed from the beginning of the kanji compound word or the beginning of the kanji compound word is changed. A ninth step of extracting a kanji string corresponding to the number of extracted characters set from the extraction start position from among the kanji compound words;
When it is determined whether the changed flag is assigned to any of the kanji strings extracted in the ninth step, and the changed flag is assigned to any of the kanji strings extracted in the ninth step The tenth step of changing the setting of the extraction head character to one that is one character behind the extraction head position and returning to the ninth step;
An eleventh step of referring to the Japanese dictionary and collating the Chinese character string extracted in the ninth step with a basic word;
In the eleventh step, when a basic word that matches the kanji character string extracted in the ninth step is found, the ending of the extracted kanji character string that matches the basic word and the immediately following kanji character are After determining the division position to divide the kanji compound word and changing the flag assigned to each kanji that constitutes the extracted kanji string that matches the basic word, the kanji compound word matches the basic word Check whether there are more kanji characters than the extracted character number behind the kanji string extracted from the kanji compound word to be extracted, and in the kanji compound word, the kanji character string extracted from the kanji compound word matching the basic word If there are only Kanji characters that are less than the number of extracted characters, the number of extracted characters is reduced by one, and the extracted first character is changed to the beginning of the Kanji compound word, and the process returns to the ninth step. In the kanji compound word, when there are more kanji characters than the extracted character number behind the kanji string extracted from the kanji compound word that matches the basic word, the extracted first character is extracted from the extracted start position by the number of extracted characters. A twelfth step of changing the setting to the rear one and returning to the ninth step;
In the eleventh step, if a basic word that matches the kanji string extracted in the ninth step is not found, the kanji extracted from the kanji compound word that does not match the basic word in the kanji compound word Check if there is a kanji character in the back of the sequence. If there is no kanji character in the kanji compound word extracted from the kanji compound word that did not match the basic word, reduce the number of extracted characters by one. And setting the extracted first character to the beginning of the Kanji compound word, and returning to the ninth step, the Kanji string extracted from the Kanji compound word that did not match the basic word in the Kanji compound word When there is a kanji character behind the thirteenth step of changing the setting of the extraction leading character to the one character behind the extraction leading position and returning to the ninth step;
When the changed flag is given to all the kanji characters constituting the kanji compound word or when the set number of extracted characters becomes 0, the delimiter position determined in the twelfth step is set as the kanji compound compound. A fourteenth step for determining a break position for dividing a word;
The kanji compound word dividing method according to claim 1 or 2, characterized by comprising:

In the eleventh step, when a basic word matching the kanji character string extracted in the ninth step is found in the eleventh step, the twelfth step is determined as a break position for dividing the kanji compound word; Before changing the flag assigned to each kanji that constitutes the extracted kanji string that matches the basic word, according to the Japanese dictionary, the part of speech is given to the extracted kanji string that matches the basic word, 9. The kanji compound word dividing method according to claim 8, wherein in the fourteenth step, a kanji character to which the changed flag is not assigned is determined as one unknown character.

A basic word that is a basis for dividing a kanji compound word composed of continuous kanji strings is associated with a part of speech corresponding to the basic word, and is classified according to the number of characters of the basic word. A Japanese dictionary that records both,
A part-of-speech string pattern representing an array of part-of-speech corresponding to each kanji character string formed after dividing the kanji compound word and a division pattern showing an arrangement of the number of characters of each kanji character string constituted after dividing the kanji compound word A word division pattern dictionary in which both the division pattern and the part-of-speech string pattern are recorded by associating those existing in the division pattern and classifying each of the kanji compound words.
From the kanji at the beginning of the kanji compound word or the kanji immediately after the delimiter position determined immediately before the kanji compound word, the kanji strings for the number of extracted characters are sequentially extracted according to the preset number of extracted characters, and the Japanese With reference to the dictionary, extraction collation means for collating the extracted kanji strings with basic words,
If a basic word that matches the kanji string extracted by the extraction collating means is found, the part of speech is given to the extracted kanji string that matches the basic word in accordance with the Japanese dictionary, and the extracted word that matches the basic word is extracted. When there is a kanji character behind the kanji string, delimiter determining means for determining the position between the tail of the extracted kanji string that matches the basic word and the kanji character immediately after that as a delimiter position for dividing the kanji compound word;
If a basic word that matches all of the kanji strings extracted from all the extracted character numbers set in advance by the extraction collating means is not found, one extracted kanji character is regarded as one unknown character that does not exist in the Japanese dictionary. And, when there is a kanji character behind the extracted kanji character, an unknown word delimiter determining means for determining a delimiter position for dividing the kanji compound word between the extracted kanji character and the next kanji character A kanji compound word segmentation device characterized by comprising:

The kanji compound word dividing device according to claim 10, further comprising unknown word concatenating means for connecting two or more one-character unknown words.

The kanji compound word dividing device according to claim 10 or 11, further comprising adjacent word concatenating means for connecting adjacent kanji strings including the one-character unknown word.

The kanji compound word dividing device further includes delimiter position determining means for determining the determined delimiter position as a delimiter position for dividing the kanji compound word with reference to the word division pattern dictionary. The Chinese character compound word division | segmentation apparatus of any one of 10-12.

A basic word that is a basis for dividing a kanji compound word composed of continuous kanji strings is associated with a part of speech corresponding to the basic word, and is classified according to the number of characters of the basic word. A Japanese dictionary that records both,
A part-of-speech string pattern representing an array of part-of-speech corresponding to each kanji character string formed after dividing the kanji compound word and a division pattern showing an arrangement of the number of characters of each kanji character string constituted after dividing the kanji compound word A word division pattern dictionary in which both the division pattern and the part-of-speech string pattern are recorded by associating those existing in the division pattern and classifying each of the kanji compound words.
Referencing the word division pattern dictionary, the kanji compound words are sequentially provisionally divided into a plurality of kanji strings in descending order of the appearance frequency of the division patterns, and then the provisional division is performed with reference to the Japanese dictionary. A temporary division matching means for matching the kanji string of
If a matching basic word is found for all kanji strings provisionally divided by the temporary dividing collating means, all parts that are temporarily divided according to the Japanese dictionary are given part-of-speech for all kanji strings that match the basic word. Division determination means for determining a division position for dividing the kanji compound word according to a division pattern in which a basic word matching the kanji string of
If a basic word that matches one of the Chinese character strings provisionally divided by the temporary division matching means is not found, a Chinese character string that does not exist in the Japanese dictionary is determined as an unknown word, and all divided patterns are temporarily stored. When a basic word that matches one of the divided kanji strings is not found, the kanji compound word is determined according to the dividing pattern having the smallest number of unknown words and the highest occurrence frequency of the dividing pattern. A kanji compound word dividing device comprising: an unknown word division determining means for determining a division position to be divided.

The said Chinese character compound word division | segmentation apparatus further includes the division position determination means which fixes the determined division position as a division position which divides | segments the said Chinese character compound word with reference to the said word division pattern dictionary. 14. The Chinese character compound word segmentation device according to 14.

A basic word that is a basis for dividing a kanji compound word composed of continuous kanji strings is associated with a part of speech corresponding to the basic word, and is classified according to the number of characters of the basic word. A Japanese dictionary that records both,
The extraction start position as the beginning character position of the kanji string extracted from the kanji compound word is set as the position of the latest extracted first character set and changed from the beginning of the kanji compound word or the beginning of the kanji compound word, and the kanji compound word Kanji string extraction processing means for extracting kanji strings for the number of extracted characters set from the extraction start position,
It is determined whether or not a flag changed to any kanji in the kanji string extracted by the kanji string extraction processing means is given, and a flag changed to any kanji in the kanji string extracted by the kanji string extraction processing means is given If so, flag addition determination processing means for changing the setting of the extraction leading character to one character behind the extraction leading position and returning to the Chinese character string extraction processing means,
Basic word collation processing means for referring to the Japanese dictionary and collating a Chinese character string extracted by the kanji character string extraction processing means with a basic word;
When the basic word matching processing means finds a basic word that matches the kanji character string extracted by the kanji character string extraction processing means, the part of speech is added to the extracted kanji character string that matches the basic word according to the Japanese dictionary. And determining an interval between the ending of the extracted kanji character string that matches the basic word and the next kanji character as a dividing position for dividing the kanji compound word, and the extracted kanji character string that matches the basic word After changing the flag assigned to each kanji that constitutes the Kanji compound word, whether there are more kanji characters than the number of extracted characters behind the kanji string extracted from the kanji compound word that matches the basic word If there are only kanji characters less than the extracted character number behind the kanji string extracted from the kanji compound word that matches the basic word in the kanji compound word, the number of extracted characters is reduced by one. Together with and set, the extracted first character set change prefix of the Kanji compound words, to return to the kanji string extracting process unit,
In the kanji compound word, when there are kanji characters that are more than the number of extracted characters behind the kanji string extracted from the kanji compound word that matches the basic word, the extracted first character is moved backward from the extracted start position by the number of extracted characters. A first matching result processing means for changing the setting to that and returning to the kanji string extraction processing means,
When the basic word matching processing means does not find a basic word that matches the kanji character string extracted by the kanji character string extraction processing means, the kanji compound word does not match the basic word. Check if there is a kanji character behind the extracted kanji string, and if there is no kanji character after the kanji string extracted from the kanji compound word that does not match the basic word in the kanji compound word, the number of extracted characters is And setting the extracted first character to the beginning of the Kanji compound word, and returning to the Kanji string extraction processing means, the Kanji compound word that did not match the basic word in the Kanji compound word If there is a kanji character behind the kanji string extracted from the second kanji character string, change the setting of the first extraction character to one that is one character behind the extraction start position, and return to the kanji string extraction processing means. And fruit processing means,
When the changed flag is assigned to all the kanji characters constituting the kanji compound word or when the set number of extracted characters becomes 0, the delimiter positions determined by the first matching result processing means are A delimiter position determination processing means for determining a kanji compound word as a delimiter position to be divided, and determining a single character unknown word for a kanji character that has not been given a changed flag;
A kanji compound word segmentation device characterized by comprising: