JPH0289157A

JPH0289157A - Japanese language morpheme analyzing system

Info

Publication number: JPH0289157A
Application number: JP63239956A
Authority: JP
Inventors: Shiyouichi Sasabe; 佐々部　昭一
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 1988-09-26
Filing date: 1988-09-26
Publication date: 1990-03-29

Abstract

PURPOSE:To enable highly precise analysis by equal processing in respect of the word separation of a composite word in which KANJIs (Chinese character) continue as well by selecting a word string according to decision using the cumulative degree of connection or the mean degree of connection of a candidate word string. CONSTITUTION:In a word dictionary retrieving part 6, a matching word is retrieved from a word dictionary 5 as starting from the head of a retrieved character string, and in a morpheme connection grammar characteristic correspondence table retrieving part 10, the connection grammar information of a morpheme connection grammar characteristic correspondence table 9 is retrieved from its morpheme information, and is added to the obtained word, and in a connection line and row retrieving part 11, each of these words is retrieved in a connection line and row table 12 according to its forward connection grammar characteristic and the backward connection grammar characteristic of the word preceding said word, and the degree of connection between those words is obtained. Then, in the case where there are plural candidates, in a word string candidates setting part 3, the word string is selected according to the decision using the cumulative degree of connection of the candidates word string or the mean degree of connection per one word of the candidate word string. Thus, the highly precise morpheme analysis can be performed by equal processing even in respect of the word separation of the composite word in which the KANJIs continue as well.

Description

【発明の詳細な説明】筑提分災本発明は、日本語文章の形態素解析に関し、特に、二文
節最長法と単語間接続度を用いて容易に精度良く形態素
（単語）に分割する処理に関する。[Detailed Description of the Invention] The present invention relates to morphological analysis of Japanese sentences, and particularly relates to the process of easily and accurately dividing Japanese sentences into morphemes (words) using the longest two-clause method and the degree of connectivity between words. .

（米援権Ｂ本語テキストの形態素解析を行い、形態素を抽出する
手法としては、一般に知られている最長−教法、二文節
最長法、文節数最小法などがあり、ワードプロセッサな
どのかな漢字変換処理に利用されている。しかし、これ
らの方法では漢字が連続する複合語の単語分割に関して
精度が低い。この欠点を補う為、近年では、統計的な方
法や単語の意味分類等を利用して、漢字連続の複合語に
対して特別な解析を施している。統計的な方法では、予
め多くの漢字字種について統計量を収集処理している、
単語の意味を用いるためには予め単語辞書中の各単語に
ついて適切な意味分類を施している。(Methods for performing morphological analysis and extracting morphemes from the original Japanese text of Beiengen B include the generally known longest-kyoho, two-bunsetsu longest method, and least-number-of-bunsetsu method. However, these methods have low accuracy when it comes to word segmentation of compound words consisting of consecutive kanji.In order to compensate for this drawback, in recent years, statistical methods and word meaning classification have been used. , a special analysis is performed on compound words consisting of consecutive kanji.In the statistical method, statistics are collected and processed for many kanji character types in advance.
In order to use the meanings of words, appropriate meaning classification is applied to each word in the word dictionary in advance.

また、接続行列は従来、上記の方法の処理中で、連続す
る二つの単語間の接続の可否を判定したり、一つの先行
する単語に接続する複数の単語候補から一つの単語を選
択したりする為に使用されている。Additionally, connection matrices have conventionally been used to determine whether or not two consecutive words can be connected, or to select one word from multiple word candidates that connect to one preceding word during the processing of the above method. is used to do.

このように従来方式によって日本語テキストの形態素解
析を精度良く行なう場合、最長−教法、二文節最長法、
文節数最小法などの解析方法と上記の特別な複合語処理
とを併用する必要があり。In this way, when performing morphological analysis of Japanese text with high accuracy using conventional methods, the longest - teaching method, the two-bunsetsu longest method,
It is necessary to use an analysis method such as the minimum clause count method together with the special compound word processing described above.

−様な処理で解析できず、また、一般にコストが高い。- It cannot be analyzed using various processing methods, and the cost is generally high.

目　　　　　的本発明は、上述のごとき実情に鑑みてなされたもので、
特に、接続行列表の要素の値は文法的な接続の難易を表
すように重み付けられた接続度で、連続する二つの単語
間の接続の可否を判定するだけでなく、文節の単語並び
（単語列）が適切な文節候補を選択する為に、その接続
度を使用するようにして、漢字が連続する複合語の単語
分割に関しても−様な処理によって精度良く形態素解析
することができ、処理全体のコストも低くできるように
することを目的としてなされたものである。Purpose The present invention was made in view of the above-mentioned circumstances.
In particular, the values of the elements of the connection matrix table are weighted connections that represent the difficulty of grammatical connections, and are used not only to determine whether or not two consecutive words can be connected, but also to determine the word arrangement of a clause (words By using the degree of connectivity in order to select appropriate phrase candidates (sequences), morphological analysis can be performed with high precision even when word segmentation of compound words consisting of consecutive kanji is performed. This was done with the aim of lowering the cost.

隻−一底本発明は、上記目的を達成するために、入力された日本
語文章を、あらかじめ作成された単語辞書中の単語との
マツチングにより単語単位あるいは形態素に分割する処
理において、単語の形態情報（品詞および活用形情報な
ど）と接続文法情報（直前に接続する単語に対する文法
的特性を示す前後文法特性、直後に接続する単語に対す
る文法的特性を示す後接文法特性）を対応づける形態接
続文法特性対応表と、前接文法特性と後接文法特性との
間の接続度を示す接続行列と、入力された文字列と前記
単語辞書中の単語文字列のマツチングによって単語の形
態情報を得る辞書検索処理と。In order to achieve the above object, the present invention uses morphological information of words in the process of dividing an input Japanese sentence into word units or morphemes by matching words in a word dictionary created in advance. Morphological conjunctive grammar that associates (part of speech and conjugation information, etc.) with conjunctive grammatical information (prefix and conjunctive grammatical properties that indicate the grammatical properties of the immediately preceding word, postfix grammatical properties that indicate the grammatical properties of the word that immediately follows) A characteristic correspondence table, a connection matrix indicating the degree of connection between prefix grammatical characteristics and postfix grammatical characteristics, and a dictionary that obtains word morphology information by matching input character strings with word strings in the word dictionary. Search processing and.

辞書検索で抽出された形態情報によって接続文法情報を
検索する形態接続文法特性対応表検索処理と、先行する
単語の後接文法特性と続く単語の前接文法特性の間の接
続度を検索する接続行列検索処理と、該接続行列検索よ
り得られた接続度によって単語の接続の可否を判定する
接続判定処理とを備え、入力された日本語文字列を先頭
から順次辞書検索して接続可能な単語列候補を抽出して
該候補が複数の場合には、少なくとも、候補単語列の累
積接続度あるいは候補単語列の一単語当りの平均接続度
を用いた判定によって単語列を選択すること、更に詳細
には、入力された日本語文字列を先頭から順次辞書検索
して接続可能な単語列候補を抽出して該候補が複数の場
合に二文節最長によって単語列を決定する単語列候補決
定処理とを備え、単語列候補決定処理によって単語列候
補が一意に定まらない場合には、二文節の候補単語列の
累積接続度あるいは二文節の候補単語列の一単語当りの
平均接続度を用いた判定によって単語列を選択すること
を特徴としたものである。以下、本発明の実施例に基づ
いて説明する。A morpho-conjunctive grammatical feature correspondence table search process that searches for conjunctive grammatical information using the morphological information extracted from a dictionary search, and a connection process that searches for the degree of connection between the consequential grammatical properties of the preceding word and the conative grammatical properties of the following word. It includes a matrix search process and a connection determination process that determines whether or not words can be connected based on the degree of connectivity obtained from the connection matrix search, and sequentially searches the input Japanese character string in a dictionary from the beginning to find connectable words. If the string candidates are extracted and there are a plurality of candidates, the word string is selected by at least a determination using the cumulative degree of connectivity of the candidate word strings or the average degree of connectivity per word of the candidate word strings, and further details. This includes a word string candidate determination process that sequentially searches the input Japanese character string in a dictionary from the beginning to extract connectable word string candidates, and if there are multiple candidates, determines the word string based on the longest two clauses. If a word string candidate cannot be uniquely determined by the word string candidate determination process, determination is made using the cumulative degree of connectivity of the candidate word strings of two clauses or the average degree of connectivity per word of the candidate word strings of two clauses. The feature is that a word string is selected by. Hereinafter, the present invention will be explained based on examples.

第１図は、本発明の一実施例を説明するための構成図で
１図中、１は日本語文字列入力部、２は検索文字列設定
部、３は単語列候補設定部、４は単語列候補選択部、５
は単語辞書、６は単語辞書検索部、７は接続判定部、８
は結果出力部、９は形態接続文法特性対応表、１０は形
態接続文法特性対応表検索部、１１は接続行列表検索部
、１２は接続行列表で１日本語文字列入力部１より入力
された日本語文字列から検索文字列設定部２で辞書検索
に必要な文字列を辞書検索文字列として設定して単語辞
書検索部６に出力する。辞書検索部６では、該検索文字
列の先頭からマツチングする単語を単語辞書５より検索
する。得られた単語に対して、形態接続文法特性対応表
検索部１０でその形態情報から形態接続文法特性対応表
の接続文法情報を検索して付加する。接続行列表検索部
１１では、これらの単語の各々について、その前接文法
特性とその単語に先行する単語の後接文法特性によって
接続行列表１２を検索し、それらの単語間の接続度を求
める。接続判定部７では、接続度＝０の場合を接続不可
と判定して接続度＝０の単語を却下し、残りを候補とし
て接続度を付加して単語列候補設定部３に出力する。単
語列候補設定部３では、単語列が定める選択基準を越え
。FIG. 1 is a block diagram for explaining one embodiment of the present invention. In the figure, 1 is a Japanese character string input section, 2 is a search string setting section, 3 is a word string candidate setting section, and 4 is a block diagram for explaining an embodiment of the present invention. Word string candidate selection section, 5
is a word dictionary, 6 is a word dictionary search unit, 7 is a connection determination unit, 8
is a result output section, 9 is a morphological connection grammar characteristic correspondence table, 10 is a morphological connection grammar characteristic correspondence table search section, 11 is a connection matrix table search section, and 12 is a connection matrix table inputted from the Japanese character string input section 1. A search character string setting unit 2 sets a character string necessary for dictionary search as a dictionary search character string from the Japanese character strings obtained, and outputs it to a word dictionary search unit 6. The dictionary search unit 6 searches the word dictionary 5 for matching words starting from the beginning of the search string. For the obtained word, the morphological conjunctive grammar characteristic correspondence table search unit 10 searches for conjunctive grammar information of the morphological conjunctive grammatical characteristic correspondence table based on the morphological information and adds it. The connection matrix table search unit 11 searches the connection matrix table 12 for each of these words using its prefix grammatical characteristics and the postfix grammatical characteristics of the word preceding the word, and determines the degree of connectivity between these words. . The connection determination unit 7 determines that the connection is not possible when the degree of connection is 0, rejects the words with the degree of connection = 0, and outputs the remaining words as candidates with the degree of connection added to the word string candidate setting unit 3. In the word string candidate setting section 3, if the word string exceeds the selection criteria defined.

候補が複数の場合に、単語列候補選択部４で適切な候補
を選択して、単語列候補とする。ここで単語列の選択基
準は１例えば、二文節最長法では単語列が二叉節分を構
成する長さとなることである。If there are a plurality of candidates, the word string candidate selection unit 4 selects an appropriate candidate as a word string candidate. Here, the criterion for selecting a word string is 1. For example, in the two-segment longest method, the word string must have a length that constitutes a bifurcated segment.

また、二文節最長法では、候補選択部で選ばれた第一候
補単語列の第一文節が解析結果として確定され、該文節
と同じ文節をもつ単語をもつ単語列が単語列候補として
設定されることになる。解析が進み、入力された文字列
の解析がすべて終了すると、解析結果を結果出力部８に
出力する。In addition, in the two-clause longest method, the first clause of the first candidate word string selected by the candidate selection section is determined as the analysis result, and a word string that has a word with the same clause as the first clause is set as a word string candidate. That will happen. When the analysis progresses and all input character strings have been analyzed, the analysis results are output to the result output section 8.

構成は候補選択部を除き、二文節最長法等の公知の構成
を用いることができ、接続行列検索の結果を各単語ごと
に保持できる必要がある。For the configuration, except for the candidate selection section, a known configuration such as the two-clause longest method can be used, and it is necessary to be able to hold the results of the connection matrix search for each word.

次に、候補選択部の処理の７例を第２図のフローチャー
トを参照して説明する。この処理には複数の単語列候補
が入力され、図に示すように選択条件によって候補を一
つに絞る処理を施す、先ず。Next, seven examples of processing by the candidate selection section will be explained with reference to the flowchart of FIG. In this process, multiple word string candidates are input, and as shown in the figure, the candidates are first narrowed down to one based on selection conditions.

単語列候補の中からその単語列の文字数が最大のものを
選ぶ。候補が複数残った場合は、次に単語列の単語数を
比較して最小の単語列を選ぶ。まだ候補が複数の場合に
は、単語列中の単語間の接続度を全て加算した値を比較
して最大の単語列を選択する。ここで、単語間の接続度
は前記の接続行列検索によって得られた値で、０の時は
接続不可を示し、値が大きいほど接続・結合が容易であ
り、接続の可能性が高いことを表すものとする。さらに
候補が複数であれば、・単語列中の単語の頻度を加算し
た値を比較して最大の単語列を選択する。The word string with the largest number of characters is selected from among the word string candidates. If multiple candidates remain, then compare the number of words in the word strings and select the smallest word string. If there are still multiple candidates, the largest word string is selected by comparing the summed values of all the degrees of connectivity between words in the word strings. Here, the degree of connectivity between words is the value obtained by the above-mentioned connection matrix search, and when it is 0, it indicates that the connection is not possible, and the larger the value, the easier it is to connect/combine, and the higher the possibility of connection. shall be expressed. Furthermore, if there are multiple candidates, select the largest word string by comparing the added frequencies of the words in the word strings.

また、候補選択部の選択処理は判定式を定めて選択する
ことも考えられる。例えば、ｉ単には次式の値Ｙが最大
の単語列を選択する。Furthermore, it is also conceivable that the selection process of the candidate selection unit may be performed by determining a determination formula. For example, i simply selects the word string with the largest value Y in the following equation.

Ｙ＝ｗ１Ｘ１＋ｗ２Ｘ２＋ｗ３Ｘ３＋ｗ４Ｘ４ここで、
Ｘｌ：単語列の文字数Ｘ２：単語列の単語数Ｘ、：単語列の累積接続度Ｘ４：単語列の累積頻度Ｗ工〜ｗ４＝重み係数効−−二呆以上の説明から明らかなように、本発明によると、入力
された日本語文章をあらかじめ作成された単語辞書中の
単語とのマツチングにより単語単位あるいは形態素に分
割する処理において、複数の単語列候補があるとき、候
補単語列の累積接続度あるいは平均接続度を用いた判定
によって単語列を選択するようにしたもので、漢字が連
続する複合語の単語分割に関しても−様な処理によって
精度良く解析することができ、処理全体のコストも低く
実現することができる。Y=w1X1+w2X2+w3X3+w4X4 where,
Xl: Number of characters in a word string X2: Number of words in a word string According to the present invention, in the process of dividing an input Japanese sentence into word units or morphemes by matching words in a word dictionary created in advance, when there are multiple word string candidates, the cumulative connection of the candidate word strings This method selects word strings by determining the degree of connection or the average degree of connectivity, and even when it comes to word segmentation of compound words consisting of consecutive kanji, it can be analyzed with high accuracy through similar processing, and the overall processing cost can be reduced. can be achieved at a low level.

[Brief explanation of drawings]

第１図は、本発明の一実施例を説明するための構成図、
第２図は、候補選択部の処理の一例を説明するためのフ
ローチャートである。１・・・日本語文字列入力部、２・・・検索文字列設定
部。３・・・単語列候補設定部、４・・・単語列候補選択部
。５・・・単語辞書、６・・・単語辞書検索部、７・・・
接続判定部、８・・・結果出力部、９・・・形態接続文
法特性対応表、１０・・・形態接続文法特性対応表検索
部。１１・・・接続行列表検索部、１２・・・接続行列表。FIG. 1 is a configuration diagram for explaining one embodiment of the present invention,
FIG. 2 is a flowchart for explaining an example of processing by the candidate selection section. 1...Japanese character string input section, 2...Search string setting section. 3... Word string candidate setting section, 4... Word string candidate selection section. 5... Word dictionary, 6... Word dictionary search section, 7...
Connection determination unit, 8... Result output unit, 9... Morphological connection grammar characteristic correspondence table, 10... Morphological connection grammar characteristic correspondence table search unit. 11... Connection matrix table search unit, 12... Connection matrix table.

Claims

[Claims]

1. In the process of dividing an input Japanese sentence into word units or morphemes by matching words in a word dictionary created in advance, a morpho-conjunctive grammatical characteristic correspondence table that matches word morphological information and conjunctive grammatical information is used. , a connection matrix indicating the degree of connectivity between a prefix grammatical property and a postfix grammatical property, and a dictionary search process for obtaining word morphology information by matching an input string with a word string in the word dictionary; A morpho-conjunctive grammatical feature correspondence table search process that searches for conjunctive grammatical information using the morphological information extracted from a dictionary search, and a connection process that searches for the degree of connection between the consequential grammatical properties of the preceding word and the conative grammatical properties of the following word. It includes a matrix search process and a connection determination process that determines whether or not words can be connected based on the degree of connectivity obtained from the connection matrix search, and sequentially searches the input Japanese character string in a dictionary from the beginning to find connectable words. If the string candidates are extracted and there are a plurality of candidates, the word string is selected by at least a determination using the cumulative degree of connectivity of the candidate word strings or the average degree of connectivity per word of the candidate word strings. Japanese morphological analysis method.