JP2684138B2

JP2684138B2 - Japanese morphological analysis system and headline extraction method

Info

Publication number: JP2684138B2
Application number: JP4273836A
Authority: JP
Inventors: 秀憲青沢; 朗高木
Original assignee: 株式会社シーエスケイ
Priority date: 1992-09-17
Filing date: 1992-09-17
Publication date: 1997-12-03
Anticipated expiration: 2012-12-03
Also published as: JPH06103309A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】本発明は、パーザなどの処理の一
環としてなされる日本語形態素解析処理に関し、特に入
力文から同定すべき形態素に対応する部分文字列を切り
出す場合の切り出し方法に特徴を有する日本語形態素解
析システムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to Japanese morphological analysis processing performed as part of processing such as a parser, and is particularly characterized by a clipping method for clipping a partial character string corresponding to a morpheme to be identified from an input sentence. The present invention relates to a Japanese morphological analysis system.

【０００２】[0002]

【従来の技術】日本語パーザなどに用いられる形態素解
析システムは、一般的に、形態素についての種々の情報
を格納した辞書と、該辞書を検索する辞書検索部とを備
えている。そして、入力した日本語文を、上記辞書に格
納した辞書情報等を参照しながら適当な形態素（単語）
に分割し、各形態素に所定の情報を付加したものを構文
解析システム等に渡す。2. Description of the Related Art A morphological analysis system used for a Japanese parser or the like generally comprises a dictionary storing various information about morphemes and a dictionary search unit for searching the dictionary. Then, referring to the dictionary information etc. stored in the above dictionary, the input Japanese sentence is converted into an appropriate morpheme (word).
Then, the morphemes to which predetermined information is added are passed to the syntactic analysis system or the like.

【０００３】ここで、形態素解析システムに備えた上記
従来の辞書検索部は、形態素の見出しの先頭文字で検索
する方式を採るのが一般的であった。この辞書検索方式
によると検索回数が少なく処理アルゴリズムも比較的シ
ンプルにできることが長所となっていた。Here, the above-mentioned conventional dictionary search unit provided in the morpheme analysis system generally employs a method of searching by the first character of the morpheme heading. This dictionary search method has the advantage that the number of searches is small and the processing algorithm can be relatively simple.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、上述し
た辞書検索方式による従来の形態素解析システムは、必
ずしも解析効率の良いものではなかった。これは、登録
語数の多い大規模な辞書では、先頭文字が同じ語は必然
的に多くなり（仮名の種類は漢字等に比べて少ないた
め、特に仮名見出しの多い辞書では、この傾向が顕
著）、見出しの先頭文字で検索する方式を採る場合に
は、同時に出力される形態素の検索情報が膨大になるこ
とに起因する。このため、主として以下に示すような問
題が存在していた。However, the conventional morphological analysis system based on the above-mentioned dictionary search method is not always efficient in analysis. This is because in large dictionaries with many registered words, words with the same first letter inevitably increase (because the type of kana is small compared to kanji etc., this tendency is particularly noticeable in dictionaries with many kana headings). When the method of searching by the first character of the headline is adopted, this is because the morpheme search information output at the same time becomes huge. Therefore, there have been the following problems.

【０００５】１）出力情報に対するメモリ不足の問題同時に検索される情報が多くなればなるほど主記憶領域
が不足し、その影響を受けて解析速度が遅くなるおそれ
があった。1) Problem of lack of memory for output information As the amount of information retrieved simultaneously increases, the main memory area becomes insufficient, which may affect the analysis speed.

【０００６】２）出力情報に対するマッチングの無駄の
問題一度に検索される情報が多くなれば、必然的にマッチン
グ処理も膨大になる。したがって、無駄な検索情報が多
くなればマッチングの無駄も多くなり、解析効率が悪く
なるおそれがあった。2) Useless Matching of Output Information When the amount of information retrieved at one time increases, the matching process inevitably becomes enormous. Therefore, if the amount of useless search information increases, the amount of matching useless increases, which may reduce the analysis efficiency.

【０００７】３）最長一致法を採用した形態素解析にお
ける無意味な情報の検索問題この検索方式を用いると、例えば、入力文が「ＡＢＣＤ
ＥＦＧ」であって、先頭の同定すべき正解の形態素が
「ＡＢＣＤ」である場合、「ＡＢＣＤ」だけでなく、
「Ａ」で始まりさえすれば、「ＡＢＣ」や「ＡＢ」など
のように「ＡＢＣＤ」より短い形態素も検索してしま
う。ところが、一般的な形態素切り出し手法である最長
一致法を採用して最適な形態素の分割パターンを１組だ
け求める形態素解析システムにおいては、これらは形態
素解析の結果として出力する必要がないので無意味であ
り、検索の無駄となっていた。3) Retrieval Problem of Meaningless Information in Morphological Analysis Adopting the Longest Matching Method Using this retrieval method, for example, the input sentence becomes "ABCD".
If the correct morpheme to be identified at the beginning is "ABCD" in the case of "EFG", not only "ABCD" but also
If it starts with "A", it also searches for morphemes shorter than "ABCD", such as "ABC" and "AB". However, in a morphological analysis system that employs the longest matching method, which is a general morpheme extraction method, and finds only one set of optimal morpheme division patterns, these do not need to be output as the results of morphological analysis, so it is meaningless. Yes, it was a waste of search.

【０００８】４）ありえない情報の検索の問題入力文が「ＡＢＣＤＥＦＧ」のときに、「ＡＢＣＤＥＦ
ＧＨＩＪ」といった必要以上に長い形態素を検索した
り、入力文が「ＡＢＣＤ、ＦＧ」であって「Ｄ」の次に
区切り文字等があるときにおいても、「ＡＢＣＤＦＧ
Ｄ」のような必要以上に長い形態素を検索してしまうと
いう問題があった。4) Problem of impossible information retrieval When the input sentence is "ABCDEFG", "ABCDEF"
Even if you search for an unnecessarily long morpheme such as "GHIJ", or when the input sentence is "ABCD, FG" and there is a delimiter character after "D", "ABCDFG"
There is a problem that a morpheme longer than necessary such as "D" is searched.

【０００９】５）出力情報に対する無駄な検索時間の問
題一度の検索で該当する全ての形態素を検索するため、検
索回数自体は少ないが、上記２）〜４）の無駄な検索に
伴って無駄な検索時間も要していた。5) Problem of useless search time for output information Since all applicable morphemes are searched in one search, the number of searches itself is small, but it is useless due to the useless searches in 2) to 4) above. Search time was also required.

【００１０】ところで、上述した問題は、全て形態素の
見出しの先頭文字で辞書検索を行なうことに起因するの
だから、形態素の見出しをフルスペルで指定して辞書検
索する方式を採ることによって解消することができる。By the way, since the above-mentioned problems are all caused by performing the dictionary search with the first character of the morpheme heading, it can be solved by adopting the dictionary search method by specifying the morpheme heading in full spelling. it can.

【００１１】しかしながら、通常、一つの形態素の見出
しに対応する入力日本語文における部分文字列（以下こ
の文字列を「見出し対応文字列」と言う）の範囲を効率
的に決定することは難しい。そのため、最長一致法に基
づき、かつフルスペルで辞書検索を行なう検索方式を採
用した形態素解析では、結果的に、試行錯誤的に日本語
文字列の中で最も長い範囲の見出し対応文字列から優先
的に切り出し、それぞれについて毎回辞書検索すること
になる。このため、特に長文になればなるほど辞書検索
回数が加速度的に増加してしまう。However, it is usually difficult to efficiently determine the range of a partial character string (hereinafter, this character string is referred to as a “heading-corresponding character string”) in an input Japanese sentence corresponding to one morpheme heading. Therefore, in morphological analysis that uses the longest match method and employs a search method that performs a full-spelling dictionary search, as a result, the longest range of heading-corresponding character strings in Japanese character strings is preferentially prioritized by trial and error. It will be cut out into, and the dictionary will be searched each time. Therefore, the number of dictionary searches will increase at an accelerating rate as the length of the sentence becomes longer.

【００１２】そして、辞書検索の回数に拘わらず、一回
の辞書検索には物理的に一定の時間が必要であるため、
フルスペルで辞書検索する方式を採用すると、長文の解
析を行なう場合に膨大な解析時間を要するおそれがある
という問題があった。したがって、結局は見出しの先頭
文字による辞書検索方式を採用した上で、マッチングの
手順等を工夫して解析効率の向上を図ろうとするのが一
般的となっていた。[0012] Regardless of the number of dictionary searches, one dictionary search requires a physically constant time.
When the dictionary search method with full spelling is adopted, there is a problem that a long analysis time may be required to analyze a long sentence. Therefore, after all, it has been common to adopt a dictionary search method using the first character of the headline and then improve the analysis efficiency by devising the matching procedure and the like.

【００１３】ところで、形態素の見出しをフルスペルで
指定する辞書検索方式における上記のような問題は、見
出し対応文字列の範囲を効率よく決定できないことに起
因する。By the way, the above-mentioned problem in the dictionary search system in which a morpheme heading is specified by full spelling is due to the inability to efficiently determine the range of heading-corresponding character strings.

【００１４】そこで、上記の問題を解決するため、形態
素の見出しをフルスペルで指定して辞書を検索する方式
を採るとともに、見出し対応文字列の範囲を効率よく決
定して辞書検索回数を削減する手段を実現する事が課題
となる。Therefore, in order to solve the above-mentioned problem, a method is adopted in which a dictionary is searched by designating a morpheme heading with full spelling, and a range of heading corresponding character strings is efficiently determined to reduce the number of dictionary searches. It is an issue to realize.

【００１５】本発明の日本語形態素解析システムは、特
に入力文が日本語文である場合を前提とし、辞書に登録
されている形態素の見出しを語尾活用させた場合に取り
得る見出し活用文字列の長さ等に関する情報を利用して
見出し対応文字列を効率的に切り出すことで上記の問題
を解決することを目的とする。The Japanese morphological analysis system of the present invention is predicated on the case where the input sentence is a Japanese sentence, and the length of the headline-utilizing character string that can be obtained when the headings of the morphemes registered in the dictionary are used at the end. It is an object of the present invention to solve the above problem by efficiently cutting out a heading-corresponding character string using information regarding the size and the like.

【００１６】[0016]

【課題を解決するための手段】上記の目的を達成するた
め、本発明の日本語形態素解析システムは、形態素情報
を形態素の見出しのフルスペルで検索できる辞書を備
え、入力された日本語文の文字列について文頭側の所定
の範囲の部分文字列から順に切り出して語尾変形させ、
該語尾変形によって作成された上記辞書に登録されてい
る形式の見出しで上記辞書を検索しながら上記切り出し
た部分文字列に対応する形態素を同定していく日本語形
態素解析システムにおいて、上記辞書に登録されている
全ての形態素の取り得る全ての見出し活用文字列につい
て、上記見出し活用文字列の長さに関する情報を、少な
くとも上記見出し活用文字列の先頭文字が同一であるグ
ループごとに区別して格納した見出し長存在判定テーブ
ルと、上記日本語文の文字列から部分文字列を切り出そ
うとする場合に、該部分文字列の少なくとも先頭文字に
対応する上記見出し長存在判定テーブル中の情報を参照
して、該部分文字列の少なくとも先頭文字が一致し、し
かも該部分文字列の長さが一致するような見出し活用文
字列を取り得る形態素が上記辞書に登録されていると判
定した場合にのみ上記部分文字列を切り出していく見出
し切り出し部を備えてなることを特徴とする。In order to achieve the above object, the Japanese morphological analysis system of the present invention is provided with a dictionary capable of searching morpheme information by full spelling of morpheme headings, and a character string of an input Japanese sentence. About the beginning of the sentence, it is cut out in order from the partial character string in the predetermined range and the ending is transformed,
In the Japanese morphological analysis system, which identifies the morpheme corresponding to the extracted partial character string while searching the dictionary with the heading of the format registered in the dictionary created by the ending modification, registers in the dictionary For all possible headline utilization character strings of all the morphemes that have been stored, information about the length of the headline utilization character string is stored separately for at least each group in which the first character of the headline utilization character string is the same. When attempting to cut out a partial character string from the long presence determination table and the character string of the Japanese sentence, refer to the information in the heading length presence determination table corresponding to at least the first character of the partial character string, A form that can take a headline utilization character string in which at least the first character of the partial character string matches and the lengths of the partial character strings match Element is characterized in that it comprises a heading cutout portion going cut out the partial string only when it is determined to be registered in the dictionary.

【００１７】上記発明において、見出し長存在判定テー
ブルは、辞書に登録されている全ての形態素の取り得る
全ての見出し活用文字列について、少なくとも上記見出
し活用文字列を先頭文字の違いによってグループ分け
し、上記グループごとに、見出し活用文字列の長さを集
計して作成した、該グループ中に特定の長さの見出し活
用文字列が属するか否かを判定するための、見出し長存
在情報を格納してなり、かつ、上記グループごとに、該
グループに属する見出し活用文字列の最大長を格納して
なることを特徴とする。In the above invention, the headline length existence determination table divides at least the headline utilization character strings into groups according to the difference in the first character for all the headline utilization character strings that can be taken by all the morphemes registered in the dictionary, Stores heading length existence information for determining whether or not a heading utilization character string of a specific length belongs to the group, which is created by aggregating the lengths of heading utilization character strings for each group. In addition, the maximum length of the headline utilization character string belonging to the group is stored for each group.

【００１８】また、入力された日本語文の文字列につい
て文頭側の所定の範囲の部分文字列から順に切り出して
語尾変形させ、該語尾変形によって作成された辞書に登
録されている形式のフルスペルの見出しで上記辞書を検
索しながら上記切り出した部分文字列に対応する形態素
を同定していく日本語形態素解析システムに用いられ、
上記日本語文の文字列から、１つの形態素の見出しに対
応する範囲の部分文字列を切り出す場合に用いる見出し
切り出し方法において、上記日本語文の文字列から部分
文字列を切り出そうとする場合に、上記辞書に登録され
ている全ての形態素の取り得る全ての見出し活用文字列
の長さに関する情報が少なくとも上記見出し活用文字列
の先頭文字で区別されて格納された見出し長存在判定テ
ーブルから、上記切り出そうとする部分文字列に対応す
る情報を取り出すステップと、上記見出し長存在判定テ
ーブルから取り出した情報を参照して、上記切り出そう
とする部分文字列の少なくとも先頭文字が一致し、しか
も該部分文字列の長さが一致するような見出し活用文字
列を取り得る形態素が上記辞書に登録されているか否か
を判定するステップと、上記判定によって、上記切り出
そうとする部分文字列に一致する可能性のある見出し活
用文字列を取り得る形態素が上記辞書に登録されている
と判定された場合にのみ上記部分文字列を切り出すステ
ップとからなることを特徴とする。Further, a character string of an input Japanese sentence is sequentially cut out from a partial character string in a predetermined range on the sentence head side and subjected to word ending modification, and a full spelling heading in a format registered in a dictionary created by the word ending modification. Used in a Japanese morphological analysis system that identifies the morphemes corresponding to the extracted partial character strings while searching the dictionary with
In the headline cutout method used when cutting out a partial character string in the range corresponding to one morpheme heading from the character string of the Japanese sentence, when a partial character string is to be cut out from the character string of the Japanese sentence, From the heading length existence determination table in which information on the lengths of all possible heading utilization character strings registered in all the above morphemes stored in the dictionary is distinguished by at least the first character of the heading utilization character string and stored, At least the first character of the partial character string to be cut out coincides with the step of extracting the information corresponding to the partial character string to be output and the information extracted from the heading length existence determination table, and A step for determining whether or not a morpheme that can take a headline-utilizing character string whose lengths of the partial character strings match is registered in the dictionary. Then, by the above determination, the partial character string is changed only when it is determined that a morpheme that can take a headline utilization character string that may match the partial character string to be cut out is registered in the dictionary. And a step of cutting out.

【００１９】[0019]

【実施例】以下、本発明の実施例について図面を参照し
て説明する。図１は本発明の一実施例に係る形態素解析
システムの構成を示すブロック図である。Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing the configuration of a morphological analysis system according to an embodiment of the present invention.

【００２０】図示のように、本実施例の形態素解析シス
テムは、形態素を該形態素についての種々の情報ととも
に登録格納し形態素の見出しのフルスペルで検索可能な
辞書１０を備えるとともに、入力文から見出し対応文字
列と仮定して所定の範囲の文字列を切り出す見出し切り
出し部２と、見出し切り出し部２で切り出した見出し対
応文字列を語尾変化させて辞書１０に登録した形式の見
出し（以下、この見出しを「語尾活用見出し」という）
を作成する語尾活用部３と、語尾活用部３によって作成
された語尾活用見出しについて辞書１０を検索する辞書
検索部４と、隣り合う形態素の連接の可否を判定する連
接判定部５と、これらの見出し切り出し部２、語尾活用
部３、辞書検索部４、連接判定部５を制御する形態素同
定制御部１とを備えてなる。また、見出し切り出し部２
の処理において参照する見出し長存在判定テーブル２１
と、語尾活用部３において参照する語尾活用テーブル３
１を備えてなる。As shown in the figure, the morpheme analysis system of this embodiment is provided with a dictionary 10 in which morphemes are registered and stored together with various information about the morphemes, and can be searched by full spelling of morpheme headings. A headline cutout unit 2 that cuts out a character string in a predetermined range on the assumption that it is a character string, and a headline in a format in which the headline corresponding character string cut out by the headline cutout unit 2 is changed and registered in the dictionary 10 (hereinafter, this headline will be referred to as (It is called "word ending utilization headline")
And a dictionary search unit 4 that searches the dictionary 10 for the ending utilization headings created by the ending utilization unit 3, and a connection determination unit 5 that determines whether or not adjacent morphemes can be connected. The headline cutout unit 2, the word ending utilization unit 3, the dictionary search unit 4, and the morpheme identification control unit 1 for controlling the concatenation determination unit 5 are provided. Also, the headline cutout unit 2
Head length existence determination table 21 referred to in the processing of
And the ending ending table 3 referred to by the ending ending part 3
1 is provided.

【００２１】上記辞書１０には、形態素情報が格納して
あり、各形態素の見出しは、語尾活用する場合には終止
形の見出しで、語尾活用しない場合にはそのままの見出
しで登録されている。そして、辞書に格納されている形
態素情報は、形態素の見出しのフルスペル指定で検索で
きる。Morphological information is stored in the dictionary 10, and the heading of each morpheme is registered as a final heading when the ending is used and as it is when the ending is not used. The morpheme information stored in the dictionary can be searched by specifying the full spelling of the morpheme heading.

【００２２】上記形態素同定制御部１は、入力文の文頭
側の文字から順に着目していき、見出し切り出し部２乃
至連接判定部５の一連の処理によって文頭側の形態素か
ら順に同定していく。形態素同定制御部１は、バックト
ラック処理部、未知語処理部、形態素確定部を含む（い
ずれも図示せず）。The morpheme identification control unit 1 sequentially pays attention to the characters on the sentence head side of the input sentence, and sequentially identifies the morphemes on the sentence head side by a series of processes from the headline cutout unit 2 to the concatenation determination unit 5. The morpheme identification control unit 1 includes a backtrack processing unit, an unknown word processing unit, and a morpheme determining unit (all are not shown).

【００２３】ここで形態素同定制御部１は、直前の形態
素が同定された場合及び入力文に対する最初の処理の場
合には、確定した範囲の次の文字に新たに着目し、見出
し切り出し部２にその着目中の文字以降の文字列と着目
中の文字の位置を渡す。そして、着目中の文字を含むよ
うな形態素が同定できなかった場合のうち、辞書検索部
４において適当な形態素が１つも検索できなかった場
合、あるいは連接判定部５において形態素が１つも連接
できなかった場合には、見出し切り出し部２にその着目
中の文字以降の文字列と着目中の文字の位置を再び渡
す。Here, when the immediately preceding morpheme is identified and in the case of the first processing for the input sentence, the morpheme identification control unit 1 newly pays attention to the next character in the determined range, and the morpheme extraction unit 2 The character string after the character under attention and the position of the character under attention are passed. Then, when the morpheme that includes the character of interest cannot be identified, if no suitable morpheme can be retrieved by the dictionary retrieval unit 4, or no morpheme can be concatenated by the concatenation determination unit 5. In this case, the character string after the character under attention and the position of the character under attention are passed again to the headline cutout unit 2.

【００２４】一方、見出し切り出し部２において切り出
し不可とした場合なら、バックトラック処理を起動す
る。そして、バックトラック成功ならバックトラック処
理で同定された範囲の次の文字に着目する。バックトラ
ック失敗なら未知語処理を起動し、その後、未知語処理
で同定された範囲の次の文字に着目する。On the other hand, if the headline cutout unit 2 cannot cut out, the backtrack processing is started. If the backtrack is successful, the next character in the range identified by the backtrack processing is focused on. If backtracking fails, unknown word processing is started, and then the next character in the range identified by unknown word processing is focused.

【００２５】上記見出し切り出し部２は、形態素同定制
御部１から、入力文における着目中の文字以降の文字列
と、入力文における着目中の文字の位置を渡される。見
出し切り出し部２は、文字列及び着目中の文字の位置を
渡されると、着目中の文字を見出し対応文字列の先頭文
字と仮定し、その見出し対応文字列として考えられる所
定の範囲の部分文字列を、形態素同定制御部１から渡さ
れた文字列（着目中の文字以降の文字列）から切り出
す。The headline cutout unit 2 is passed from the morpheme identification control unit 1 the character string after the character of interest in the input sentence and the position of the character of interest in the input sentence. When the character string and the position of the focused character are passed, the headline cutout unit 2 assumes that the focused character is the first character of the headline-corresponding character string, and determines partial characters within a predetermined range that are considered as the headline-corresponding character string. The string is cut out from the character string (character string after the character under attention) passed from the morpheme identification control unit 1.

【００２６】ここで見出し切り出し部２は、一般的に用
いられている「最長一致法」（最も長い形態素を優先す
る）を原則的に採用して文字列の切り出しを行なうが、
無駄な切り出しを低減するために、さらに本実施例で
は、辞書に登録されている全ての形態素の見出しを語尾
活用させた場合に取り得る全ての文字列（以下、この文
字列を「見出し活用文字列」という）の長さに関する情
報を所定の手順によって集計したものを利用して、１つ
の形態素の見出しに対応する範囲を効率良く決定し、そ
の範囲の文字列を見出し対応文字列として切り出す。Here, the headline cutout unit 2 cuts out a character string by using a generally used "longest matching method" (priority is given to the longest morpheme), but
In order to reduce unnecessary cutting, further, in the present embodiment, all the character strings that can be obtained when the headings of all the morphemes registered in the dictionary are used as endings (hereinafter, this character string will be referred to as “heading utilization character A range corresponding to the headline of one morpheme is efficiently determined by using information obtained by aggregating information about the length of a “column”) by a predetermined procedure, and the character string in the range is cut out as a headline-corresponding character string.

【００２７】かかる手段を実現するため、見出し切り出
し部２は見出し長存在判定テーブル２１を有しており、
新たに着目された文字、すなわち、切り出そうとする見
出し対応文字列について、少なくとも見出し対応文字列
の先頭文字と見出し対応文字列の長さを手掛りとして、
見出し長存在判定テーブル２１中にある該当情報を参照
する。In order to realize such means, the headline cutout unit 2 has a headline length existence determination table 21,
Newly noticed characters, that is, for the heading corresponding character string to be cut out, at least the first character of the heading corresponding character string and the length of the heading corresponding character string are clues,
The corresponding information in the index length existence determination table 21 is referred to.

【００２８】見出し長存在判定テーブル２１は、辞書に
登録されている全ての形態素の取り得る全ての見出し活
用文字列について、それらの見出し活用文字列を少なく
とも先頭文字の違いによってグループ分けし、そのグル
ープごとに、そのグループに属する見出し活用文字列の
長さを集計して作成した見出し長存在情報を格納してい
る。また、各グループには、各グループに属する見出し
活用文字列の最大長の情報も格納されている。ここで、
グループごとに格納してある見出し長存在情報は、各グ
ループに何文字の長さの見出し活用文字列が属するかを
判定できるような情報である。The headline length existence determination table 21 classifies all the headline utilization character strings that can be taken by all the morphemes registered in the dictionary into groups according to at least the difference in the head character, and the groups thereof. The headline length presence information created by aggregating the lengths of the headline utilization character strings belonging to the group is stored for each. Further, in each group, information on the maximum length of the headline utilization character string belonging to each group is also stored. here,
The headline length existence information stored for each group is information that can be used to determine how many characters of the headline utilization character string belong to each group.

【００２９】したがって、見出し対応文字列を切り出そ
うとする場合に、見出し対応文字列の先頭文字等に対応
するグループの見出し長存在情報を参照すると、切り出
そうとする見出し対応文字列の長さに等しい見出し活用
文字列がそのグループに属するか否かを判定できる。こ
れは、切り出そうとする見出し対応文字列の少なくとも
先頭文字が一致して、しかも見出し対応文字列の長さが
一致するような見出し活用文字列を取り得る形態素が辞
書に登録されているか否かを判定できることを意味す
る。なお、見出し長存在判定テーブル２１は、十分に小
さく、主記憶上に設けることが可能なものである。Therefore, when the heading-corresponding character string is to be cut out, referring to the heading length existence information of the group corresponding to the first character of the heading-corresponding character string, the length of the heading-corresponding character string to be cut out is referred to. It is possible to determine whether the headline utilization character string equal to the length belongs to the group. This is because whether or not a morpheme that can take a headline utilization character string in which at least the first character of the headline-corresponding character string to be cut out matches and the headline-corresponding character string has the same length is registered in the dictionary. It means that it can be determined. The heading length existence determination table 21 is sufficiently small and can be provided in the main memory.

【００３０】ここで、見出し長存在判定テーブル２１
は、予め以下の様にして作成されており、形態素解析処
理の一連の処理前に主記憶上にロードされる。まず、辞
書に登録されている全ての形態素に対して、語尾活用の
ある見出しは辞書に登録されている見出し（終止形見出
し）を語尾活用させて得られる全ての見出し活用文字列
の長さを取り出し、語尾活用のない見出しはその見出し
自体の文字列の長さを取り出す。Here, the heading length existence determination table 21
Is created in advance as follows and is loaded on the main memory before a series of morphological analysis processing. First, for all morphemes registered in the dictionary, the heading with inflection is the length of all the inflection-use character strings obtained by inflectioning the inflection of the heading (end-form heading) registered in the dictionary. For a heading that has no endings, the length of the character string of the heading itself is extracted.

【００３１】例えば、「報」で始まる形態素が、仮に
「報いる」、「報酬」、「報道する」しか辞書１０に登
録されていないとすると、「報」に対応する見出し活用
文字列に関する情報は表１のようになる。For example, if a morpheme beginning with "report" is registered only in the dictionary 10 as "reward", "reward", and "report", the information on the headline utilization character string corresponding to "report" Is as shown in Table 1.

【００３２】[0032]

【表１】 [Table 1]

【００３３】次に、少なくとも見出し活用文字列の先頭
文字毎にグループ分けし、それぞれのグループに存在す
る見出し活用文字列の長さ及びその最大長を求め、見出
し長存在判定テーブル２１に格納する。表１で、「報」
で始まるものに対応する見出し活用文字列の長さは、２
文字、３文字、４文字のものしか存在しないので、
「報」に対応する見出し長存在判定テーブルの情報は、
表２のようになる。Next, at least the head character of the headline utilization character string is divided into groups, and the lengths of the headline utilization character strings existing in the respective groups and the maximum lengths thereof are obtained and stored in the headline length existence determination table 21. In Table 1, "report"
The length of the headline inflection character string corresponding to those beginning with is 2
Since there are only letters, three letters, and four letters,
The information of the heading length existence determination table corresponding to “report” is
It becomes like Table 2.

【００３４】[0034]

【表２】 [Table 2]

【００３５】ここで、表２における見出し長存在情報
は、８ビットで表された情報であり、それぞれのビット
は、そのグループ（この場合は「報」を先頭文字とする
グループ）に属する見出し活用文字列の文字数に対応し
ている。そして、辞書１０にそのグループに属する所定
の長さの見出し活用文字列を取り得る形態素が存在する
なら１、存在しないなら０となっている。Here, the heading length existence information in Table 2 is information represented by 8 bits, and each bit is a heading utilization that belongs to that group (in this case, a group having "report" as the first character). Corresponds to the number of characters in the string. The dictionary 10 has a morpheme that can take a headline utilization character string of a predetermined length belonging to the group, and has 1 if it does not exist.

【００３６】表２においては、例えば、「報」に対応す
る見出し活用文字列の長さは２文字、３文字、４文字し
か存在しないので、左から２ビット目、３ビット目、４
ビット目のみが１となっており、それ以外は０となって
いる。In Table 2, for example, the length of the headline utilization character string corresponding to "report" includes only two characters, three characters, and four characters. Therefore, the second bit from the left, the third bit, the fourth bit, and the fourth bit.
Only the 1st bit is 1, and the others are 0.

【００３７】なお、見出し長存在情報が８ビットの領域
を持つように設定されている場合、９文字以上の見出し
活用文字列が存在するなら見出し長存在情報では表現で
きない。このような時は、例えば、最大長は１０となっ
ており、仮想的に９ビット目から１０ビット目は全て１
になっているように解釈することとする。When the heading length existence information is set to have an 8-bit area, if there is a heading utilization character string of 9 characters or more, it cannot be expressed by the heading length existence information. In such a case, for example, the maximum length is 10, and virtually all 9th to 10th bits are 1
Will be interpreted as follows.

【００３８】例えば、「オ」で始まる見出しが、辞書１
０に「オープン・ファンド」（９文字）しか登録されて
いないとすると、「オ」に対応する見出し長存在判定テ
ーブル２１の情報は、表３のようになる。For example, the heading starting with "O" is the dictionary 1
If only "open fund" (9 characters) is registered in 0, the information of the heading length existence determination table 21 corresponding to "o" is as shown in Table 3.

【００３９】[0039]

【表３】 [Table 3]

【００４０】見出し長存在判定テーブル２１における見
出し長存在情報の領域の大きさは自由に設定でき、仮
に、辞書に登録されている形態素の取り得る見出し活用
文字列に９文字以上のものが多い場合には、例えば、見
出し長存在情報が１２ビットとか１６ビットのような、
より大きな領域を持つようにしておくとよく、仮に見出
し長存在情報が１６ビットの領域を持つように設定され
ているなら１７文字以上の見出し活用文字列については
同様に扱われることとなる。The size of the area of the heading length existence information in the heading length existence determination table 21 can be freely set, and if the heading utilization character string that can be taken by the morphemes registered in the dictionary is often 9 characters or more. Is, for example, if the heading length existence information is 12 bits or 16 bits,
It is better to have a larger area, and if the heading length existence information is set to have a 16-bit area, a heading utilization character string of 17 characters or more will be treated in the same way.

【００４１】なお、動詞の「くる」のように、語尾活用
させた場合の見出し活用文字列の先頭文字が辞書１０に
登録されている形態素の見出し（終止形の見出し等）の
先頭文字と異なる場合もある。仮に辞書１０に「くる」
しか登録されていないなら、「くる」に対応する見出し
活用文字列の長さ及び見出し長存在情報は表４、表５の
ようになる。It should be noted that, as in the case of the verb "kuru", the head character of the heading-utilized character string when the word is utilized is different from the head character of the morpheme heading registered in the dictionary 10 (end-type heading, etc.). In some cases. Temporarily "come" to the dictionary 10
If only registered, the length of the headline utilization character string corresponding to "Kuru" and the headline length existence information are as shown in Tables 4 and 5.

【００４２】[0042]

【表４】 [Table 4]

【００４３】[0043]

【表５】 [Table 5]

【００４４】表５において、辞書１０に、「くる」以外
の見出し、例えば、「こども」（３文字）が登録されて
いるなら、表５の「こ」に対応する見出し長存在情報の
左から３ビット目のところが１となる。以上のようにし
て作成された、先頭１文字で情報をグループ分けした場
合の見出し長存在判定テーブル２１の例を図２に示す。In Table 5, if a dictionary other than "Kuru" is registered in the dictionary 10, for example, "Kodomo" (three characters), from the left of the heading length existence information corresponding to "Ko" in Table 5. The place of the 3rd bit becomes 1. FIG. 2 shows an example of the headline length existence determination table 21 in the case where information is divided into groups by the first character, which is created as described above.

【００４５】なお、上記のように見出し活用文字列の先
頭１文字でグループ分けするより、例えば、見出し活用
文字列の先頭２文字でより細かくグループ分けした方
が、各グループに属する見出し活用文字列の数が少なく
なるため、より厳密に該当する長さの見出し活用文字列
がそのグループに属するかどうかを判定できる。さら
に、先頭２文字より先頭３文字でグループ分けした方が
より厳密な判定ができ、最終的には、全ての見出し活用
文字列を主記憶上に置いて該当するものがあるかどうか
を判定する方が最も厳密に判定できることになる。しか
し、このようにすると、その情報に費やされるメモリが
膨大になる恐れがあり実現困難である。It is to be noted that, instead of grouping by the first one character of the headline utilization character string as described above, it is better to divide the headline utilization character string into smaller groups, for example, by the first two characters of the headline utilization character string. Since the number of characters becomes smaller, it is possible to more strictly determine whether or not the headline utilization character string of the corresponding length belongs to the group. Further, it is possible to make a stricter determination by grouping with the first three characters rather than the first two characters, and finally, it is determined whether or not there is a corresponding one by putting all the headline utilization character strings in the main memory. It will be the most rigorous decision. However, if this is done, the memory consumed for that information may become enormous and it is difficult to implement.

【００４６】そこで、本実施例においては、主記憶上に
置く見出し長存在判定テーブル２１を低サイズにし、か
つ、より厳密な判定を可能にするため、原則として先頭
１文字でグループ分けをし、先頭１文字目が平仮名の場
合に限り先頭２文字目をある程度まとめたものを用い
て、これらの組み合わせでグループ分けしている。Therefore, in this embodiment, in order to make the heading length existence determination table 21 placed in the main memory small and to make more strict determination possible, in principle, grouping is performed by the first character. Only when the first character is a hiragana, the first two characters are grouped together to some extent and grouped by these combinations.

【００４７】具体的には、先頭１文字目が平仮名で先頭
２文字目が平仮名なら、先頭１文字目の全ての文字ごと
の区分と、先頭２文字目が平仮名なら「あ行」、「か
行」等のようにまとめ、先頭２文字目が平仮名以外の文
字（例えば漢字）なら、先頭２文字目を「漢字」、「片
仮名」等のようにまとめた区分との組み合わせ毎に見出
し長存在情報を設定している。Specifically, if the first character at the beginning is hiragana and the second character at the beginning is hiragana, all the characters in the first character at the beginning are classified, and if the second character at the beginning is hiragana, "A line", " If the first second character is a character other than hiragana (for example, Kanji), the heading length exists for each combination with the grouping such as “Kanji”, “Katakana”, etc. Information is set.

【００４８】これは、一般に、（１）辞書の見出し（あるいは見出し活用文字列）の平
均長が約２文字であるため先頭２文字を参照すればかな
り厳密に判定できること（２）平仮名の見出しが、他の種類の文字の見出しに比
べて辞書に多く登録されていること（３）文字の種類が変わるところで区別するのが効果的
であること（４）平仮名の種類が漢字等に比べて極めて少ないた
め、先頭２文字の組み合わせによる情報を設定してもそ
の情報自体の量が少なく、使用するメモリに影響を与え
にくいことなどに基づいている。Generally, this is because (1) the average length of a dictionary heading (or heading-use character string) is about 2 characters, and therefore it can be determined quite strictly by referring to the first two characters. (2) Hiragana heading , Being registered more in the dictionary than the headings of other types of characters (3) It is effective to distinguish when the type of characters changes (4) The type of hiragana is extremely different from kanji etc. It is based on the fact that even if the information consisting of the combination of the first two characters is set, the amount of the information itself is small and it does not affect the memory used because it is small.

【００４９】なお、見出し長存在判定テーブル２１は、
上述したように、情報の設定の単位を自由に変更できる
ものであり、メモリの制約を考慮する必要がない場合に
は、より細かくグループ分けしてもよい。また、先頭文
字のみでグループ分けされた見出し長存在判定テーブル
２１を主記憶上に備えるだけでも十分に効果的である。The heading length existence determination table 21 is
As described above, the unit of information setting can be freely changed, and when it is not necessary to consider the memory constraint, the units may be divided into smaller groups. Further, it is sufficiently effective to provide the main memory with the heading length existence determination table 21 grouped by only the first character.

【００５０】図３に、見出しの先頭１文字目が平仮名に
該当する場合だけ、先頭２文字目の種類によってグルー
プを細分化した見出し長存在判定テーブル２１の例を示
す。FIG. 3 shows an example of the heading length existence determination table 21 in which the group is subdivided according to the type of the second head character only when the first head character of the heading corresponds to hiragana.

【００５１】見出し切り出し部２は、以上の様にして作
成された見出し長存在判定テーブル２１を参照し、新た
に着目された文字（場合によっては次の文字も含む）に
対応する見出し長存在情報を取り出す。そして、切り出
そうとする範囲（文末等を越えない範囲）のなかで、
「最長一致法」にのっとり、長いものから順に見出し対
応文字列として切り出し始める。The headline cutout unit 2 refers to the headline length existence determination table 21 created as described above, and the headline length existence information corresponding to the newly noticed character (including the next character in some cases). Take out. Then, within the range to be cut out (the range that does not exceed the end of sentence etc.),
In accordance with the "longest match method," the longest string is sequentially cut out as a heading-compatible character string.

【００５２】例えば、入力文が「報道しなければならな
い」であり、「報」に着目して見出し切り出し部２にき
た場合、通常なら、「報道しなければならない」「報道しなければならな」「報道しなければなら」：「報道しな」「報道し」のように切り出されていくことになるが、本実施例にお
いて、仮に「報」に対する見出し長存在判定テーブル２
１の情報が上記表２のようになっているなら、これを参
照することで「報」で始まる見出し活用文字列に４文字
以下のものしかないことがわかるので、最初は４文字の
「報道しな」が切り出され、次に「報道し」が切り出さ
れる。For example, if the input sentence is “must report” and the user is paying attention to “report” and coming to the headline segmentation section 2, normally, “must report” “must report” "I must report": It will be cut out like "not report" and "report", but in the present embodiment, the headline length existence determination table 2 for "report" is assumed.
If the information of No. 1 is as shown in Table 2 above, it can be seen that by referring to this, the headline utilization character string starting with "report" has only 4 characters or less, so at first "Shin" is cut out, and then "report" is cut out.

【００５３】なお、特定の文字を先頭文字とする見出し
活用文字列が存在しない場合、すなわち、特定の文字を
先頭文字とする見出し活用文字列を取り得る形態素が辞
書１０に登録されていない場合には、その特定の文字に
対応する見出し長存在判定テーブル２１の最大長が０と
なっている。例えば、「る」で始まる見出し活用文字列
を取り得る形態素が辞書１０に登録されていないとする
と、「る」に対応する見出し長存在判定テーブル２１の
情報は表６のようになる。When there is no headline utilization character string having a specific character as the first character, that is, when a morpheme that can take a headline utilization character string having the specific character as the first character is not registered in the dictionary 10. Indicates that the maximum length of the index length existence determination table 21 corresponding to the specific character is 0. For example, if a morpheme that can take a headline inflection character string starting with "ru" is not registered in the dictionary 10, the information of the headline length existence determination table 21 corresponding to "ru" is as shown in Table 6.

【００５４】[0054]

【表６】 [Table 6]

【００５５】従って、特定の文字で始まる見出し対応文
字列を切り出そうとする場合に、その特定の文字に対応
する見出し長存在判定テーブル２１の情報における最大
長が０であるなら、特定の文字で始まる見出し活用文字
列を取り得る形態素は辞書１０に未登録であると直ちに
判定できる。この場合、特定の文字で始まる見出し対応
文字列を一度も切り出す（辞書検索する）ことなしに切
り出し不可とし、未登録語を効率良く判定できるという
効果がある。Therefore, when trying to cut out a heading-corresponding character string starting with a specific character, if the maximum length in the information of the heading length existence determination table 21 corresponding to the specific character is 0, the specific character It is possible to immediately determine that a morpheme that can take a headline inflection character string beginning with is not registered in the dictionary 10. In this case, there is an effect that the headline-corresponding character string starting with a specific character cannot be cut out without cutting out (searching the dictionary) even once, and an unregistered word can be efficiently determined.

【００５６】見出し切り出し部２は、切り出した文字列
の範囲に関する情報、すなわち、入力文における着目中
の文字、着目中の位置、切り出した文字列の長さ等の判
定できる情報を保存しておく。この情報によって、同じ
着目位置の文字を先頭とする、より短い範囲の見出し対
応文字列を切り出すのか、新たに着目された位置の文字
を先頭とする最大範囲の見出し対応文字列を切り出すの
かを区別する。The headline slicing unit 2 stores information about the range of the sliced character string, that is, the information such as the character under consideration in the input sentence, the position under consideration, the length of the extracted character string, etc. . This information is used to distinguish between cutting out a shorter range heading-compatible character string starting with the character at the same focus position or cutting out the maximum range heading-compatible character string starting at the character at the newly focused position. To do.

【００５７】見出し切り出し部２は、形態素同定制御部
１より渡された文字列から１文字以上の見出し対応文字
列を切り出すことができたときは、その切り出した見出
し対応文字列を語尾活用部３に渡す。１文字以上の見出
し対応文字列を切り出すことができなかったときは、切
り出し不可として、形態素同定制御部１に戻る。When the headline cutout unit 2 can cut out a character string corresponding to one or more characters from the character string passed from the morpheme identification control unit 1, the cutout corresponding character string is used by the ending part utilization unit 3. Pass to. When it is not possible to cut out the heading-corresponding character string of one or more characters, it is determined that cutting out is not possible, and the process returns to the morpheme identification control unit 1.

【００５８】上記語尾活用部３は、見出し切り出し部２
から見出し対応文字列を受け取る。語尾活用部３は、図
４に示すような語尾活用テーブル３１を有しており、見
出し対応文字列の末尾の１〜４文字の平仮名部分につい
て語尾活用テーブル３１を参照し、それによって推定さ
れる語尾活用見出し（辞書１０に登録されている形式の
見出し）を作成する。もちろん、語尾活用しない見出し
もそれ自体で１つの語尾活用見出しとする。図５に語尾
活用見出しの作成例を示す。The word ending part 3 is used by the headline cutout part 2.
Receives a heading-compatible character string from. The word ending utilization unit 3 has a word ending utilization table 31 as shown in FIG. 4, and refers to the word ending utilization table 31 for the hiragana part of 1 to 4 characters at the end of the heading corresponding character string, and is estimated by this. An ending ending heading (heading in the format registered in the dictionary 10) is created. Of course, a heading that does not utilize the ending is also regarded as a single ending using the heading. FIG. 5 shows an example of creating an ending inflection heading.

【００５９】以上の処理の後、語尾活用部３は、得られ
た任意個の語尾活用見出しを１つのリストとしてまと
め、辞書検索部４に渡す。After the above processing, the ending ending section 3 collects the obtained arbitrary ending ending headings as one list and sends it to the dictionary searching section 4.

【００６０】辞書検索部４は、渡されたすべての語尾活
用見出しについて、辞書１０を検索して各語尾活用見出
しに該当する形態素の形態素情報を検索する。そして、
適当な形態素が検索されたなら、その検索された形態素
情報のリストを連接判定部５に送る。一方、適当な形態
素が検索できなかったときは、形態素同定制御部１を経
由して見出し切り出し部２の処理に進む。The dictionary search unit 4 searches the dictionary 10 for all the passed inflection use headings, and retrieves the morpheme information of the morphemes corresponding to each inflection use heading. And
When an appropriate morpheme is retrieved, the list of retrieved morpheme information is sent to the concatenation determination unit 5. On the other hand, when an appropriate morpheme cannot be retrieved, the process proceeds to the processing of the headline cutout unit 2 via the morpheme identification control unit 1.

【００６１】連接判定部５は、辞書検索部４から渡され
た情報に基づいて、隣り合う形態素の連接の可否を判定
して適当な形態素を同定する。そして、適当な形態素を
同定したならその情報を形態素同定制御部１に渡す。適
当な形態素を同定できなかったときは、形態素同定制御
部１を経由して見出し切り出し部２の処理に進む。The concatenation determination section 5 determines whether or not adjacent morphemes can be concatenated based on the information passed from the dictionary retrieval section 4, and identifies an appropriate morpheme. Then, when an appropriate morpheme is identified, the information is passed to the morpheme identification control unit 1. When an appropriate morpheme cannot be identified, the process proceeds to the processing of the headline cutout unit 2 via the morpheme identification control unit 1.

【００６２】次に具体的な例文の処理例を示し、本実施
例の作用を詳細に説明する。図６及び図７に、本実施例
により「今の状況では彼女が優勢だ」という文を形態素
解析処理した場合の処理例を示す。なお、本解析例で
は、先頭文字が平仮名である場合には先頭２文字までの
上述した組み合わせでグループ分けし、先頭文字が平仮
名以外である場合には先頭１文字でグループ分けした見
出し長存在判定テーブル２１を用いた例を示す。Next, the operation of the present embodiment will be described in detail by showing a concrete example of processing of an example sentence. FIG. 6 and FIG. 7 show a processing example in the case where the sentence “she is dominant in the present situation” is subjected to the morphological analysis processing according to the present embodiment. In the present analysis example, when the first character is a hiragana character, the grouping is performed by combining the above two characters up to the first two characters, and when the first character is not a hiragana character, the heading character is grouped by the first character. An example using the table 21 will be shown.

【００６３】まず、形態素同定制御部１の制御の下、見
出し切り出し部２は、見出し長存在判定テーブル２１
（図３）を参照し、先頭文字の「今」で始まる見出し活
用文字列が最大３文字であることから見出し対応文字列
「今の状」を切り出す。そして、この見出し対応文字列
に対し、語尾活用部３、辞書検索部４、連接判定部５の
処理が行なわれるが、形態素を同定することができず、
見出し切り出し部２の処理に戻る。First, under the control of the morpheme identification control unit 1, the headline cutout unit 2 causes the headline length existence determination table 21.
Referring to FIG. 3, since the headline utilization character string starting with the first character “now” has a maximum of three characters, the headline corresponding character string “current state” is cut out. Then, with respect to this headline-corresponding character string, the processing of the word ending utilizing unit 3, the dictionary searching unit 4, and the concatenation judging unit 5 is performed, but the morpheme cannot be identified,
The process returns to the processing of the headline cutout unit 2.

【００６４】そこで見出し切り出し部２は、「今の状」
の末尾の１文字を削除し、新たな見出し対応文字列とし
て「今の」を切り出す。結局、この見出し対応文字列に
対しても、後の処理で形態素を同定することができず、
見出し切り出し部２の処理に戻る。そして同様の処理を
繰り返した後、形態素「今」を同定する。そして、次の
文字「の」に新たに着目して、その見出し対応文字列の
切り出しに移る。Therefore, the headline cutout section 2 is "current state".
One character at the end of is deleted, and "Imamono" is cut out as a new heading-compatible character string. After all, even for this headline-corresponding character string, the morpheme could not be identified in the subsequent processing,
The process returns to the processing of the headline cutout unit 2. After repeating the same processing, the morpheme "now" is identified. Then, paying attention to the next character “NO”, the process moves to the extraction of the heading corresponding character string.

【００６５】見出し切り出し部２は、続く文字列の先頭
文字が平仮名「の」であることから２文字目の「状」ま
でみて文字列「の状」について見出し長存在判定テーブ
ル２１を参照し、「の＋漢字」で始まる見出し活用文字
列が最大０文字であることから見出し対応文字列「の
状」は切り出さず、「の」（１文字）に対応する見出し
活用文字列があるので見出し対応文字列「の」を切り出
す。そして、この見出し対応文字列に対し、語尾活用部
３、辞書検索部４、連接判定部５の処理を行ない、形態
素「の」または「だ」を同定する。そして、次の見出し
対応文字列の切り出しに移る。Since the first character of the succeeding character string is the hiragana "no", the heading cut-out part 2 looks at the second character "shape" and refers to the heading length existence determination table 21 for the character string "no". Since the headline utilization character string starting with "no + kanji" has a maximum of 0 characters, the headline corresponding character string "no" is not cut out and there is a headline utilization character string corresponding to "no" (one character) Cut out the character string "no". Then, the heading-corresponding character string is processed by the word ending utilizing unit 3, the dictionary searching unit 4, and the concatenation judging unit 5 to identify the morpheme "no" or "da". Then, the process proceeds to the cutting out of the next heading corresponding character string.

【００６６】見出し切り出し部２は、続く文字列の先頭
文字「状」について見出し長存在判定テーブル２１を参
照し、「状」で始まる見出し活用文字列が最大３文字で
あることから見出し対応文字列「状況で」を切り出す。
ここでは、この見出し対応文字列に対し、後の処理で形
態素を同定することができず、再び見出し切り出し部２
の処理に戻る。The headline slicing unit 2 refers to the headline length existence determination table 21 for the first character "character" of the following character string, and since the headline utilization character string starting with "character" has a maximum of three characters, the headline corresponding character string Cut out "by situation".
In this case, the morpheme cannot be identified in the subsequent processing for this headline-corresponding character string, and the headline cutout unit 2 is again used.
Return to the processing of.

【００６７】そこで、見出し切り出し部２は、「状況
で」の末尾の一文字を削除し、新たな見出し対応文字列
として「状況」を切り出す。そして、後の処理によって
形態素「状況」を同定する。そして、次の見出し対応文
字列の切り出しに移る。Therefore, the headline cutout unit 2 deletes the last character of "in the situation" and cuts out the "status" as a new headline-corresponding character string. Then, the morpheme “situation” is identified by the subsequent processing. Then, the process proceeds to the cutting out of the next heading corresponding character string.

【００６８】見出し切り出し部２は、続く文字列の先頭
文字が平仮名「で」であることから２文字目の「は」ま
でみて文字列「では」について見出し長存在判定テーブ
ル２１を参照し、「で＋は行」で始まる見出し活用文字
列が最大０文字であることから見出し対応文字列「で
は」は切り出さずに見出し対応文字列「で」を切り出
す。そして、この見出し対応文字列に対し、語尾活用部
３、辞書検索部４、連接判定部５の処理を行ない、形態
素「で」「でる」または「だ」を同定する（但し、「で
る」、「だ」は後に連接不可として棄却される）。そし
て、次の見出し対応文字列の切り出しに移る。The headline slicing section 2 refers to the headline length existence determination table 21 for the character string "a" by looking up to the second character "ha" since the first character of the following character string is a hiragana "de", Since the headline utilization character string starting with "+ is a line" has a maximum of 0 characters, the headline-corresponding character string "de" is cut out without cutting out the headline-corresponding character string "de". Then, the headword corresponding character string is processed by the word ending part 3, the dictionary searching part 4, and the concatenation judging part 5 to identify the morpheme “de” “de” or “da” (however, “de”, "Da" is later rejected as not connectable). Then, the process proceeds to the cutting out of the next heading corresponding character string.

【００６９】見出し切り出し部２、語尾活用部３、辞書
検索部４、連接判定部５は、同様の処理を繰り返して、
形態素「は」「彼女」「が」「優勢だ」を同定する。The headline cutout unit 2, the word ending utilization unit 3, the dictionary search unit 4, and the concatenation determination unit 5 repeat the same processing,
The morphemes "ha", "she", "ga", and "predominant" are identified.

【００７０】以上の処理の過程において、見出し切り出
し部２は、図６及び図７に示すように、１２種類の見出
し対応文字列を切り出し、それらに対する語尾活用見出
しで２４回の辞書検索を行なう。In the process of the above processing, the headline cutout unit 2 cuts out 12 types of headline-corresponding character strings as shown in FIG. 6 and FIG.

【００７１】次に、図８乃至図１０に、上記解析例と同
様の例文「今の状況では彼女が優勢だ」について、先頭
文字のみでグループ分けした見出し長存在判定テーブル
２１（図２）を用いた例を示す。Next, FIGS. 8 to 10 show the heading length existence determination table 21 (FIG. 2) in which the example sentence “she is dominant in the present situation” similar to the above analysis example is grouped only by the first character. The example used is shown below.

【００７２】本解析例では、先頭文字が「の」である範
囲の見出し対応文字列の切り出し、先頭文字が「で」で
ある範囲の見出し対応文字列の切り出し、先頭文字が
「は」である範囲の見出し対応文字列の切り出し、先頭
文字が「が」である範囲の見出し対応文字列の切り出し
において、図示のように、先頭文字のみにより見出し長
存在判定テーブル２１を参照するため、上記図６及び図
７の解析例よりも多くの見出し対応文字列を切り出し、
辞書検索を行なう。In this analysis example, the heading corresponding character string in the range where the first character is “no” is cut out, the heading corresponding character string is cut out in the range where the first character is “de”, and the first character is “ha”. In cutting out the heading corresponding character string in the range and cutting out the heading corresponding character string in the range in which the first character is “ga”, the heading length existence determination table 21 is referred to only by the first character as shown in FIG. And cut out more character strings corresponding to the headings than in the analysis example of FIG.
Do a dictionary search.

【００７３】先頭文字が「の」である範囲の見出し対応
文字列の切り出しにおいては、見出し切り出し部２は、
「の」で始まる見出し活用文字列が最大５文字であるこ
とから見出し対応文字列「の状況では」を切り出す。結
局、この見出し対応文字列に対して、後の処理で形態素
を同定することができず、見出し切り出し部２の処理に
戻る。そして、最長一致法に基づき、文字列「の状況
で」「の状況」「の状」「の」について、順次同様の処
理を繰り返した後、形態素「の」または「だ」を同定す
る。そして、次の見出し対応文字列の切り出しに移る。In cutting out the heading corresponding character string in the range where the first character is "no", the heading cutout unit 2
Since the headline utilization character string starting with "no" has a maximum of 5 characters, the headline corresponding character string "in the situation of" is cut out. After all, with respect to this headline corresponding character string, the morpheme cannot be identified in the subsequent process, and the process returns to the process of the headline cutout unit 2. Then, based on the longest match method, the same processing is sequentially repeated for the character strings "in the situation", "the situation", "state", and "no", and then the morpheme "no" or "da" is identified. Then, the process proceeds to the cutting out of the next heading corresponding character string.

【００７４】先頭文字が「で」である範囲の見出し対応
文字列の切り出しにおいては、見出し切り出し部２は、
「で」で始まる見出し活用文字列が最大５文字であるこ
とから見出し対応文字列「では彼女が」を切り出す。結
局、この見出し対応文字列に対して、後の処理で形態素
を同定することができず、見出し切り出し部２の処理に
戻る。そして、最長一致法に基づき、文字列「では彼」
「では」「で」について、順次同様の処理を繰り返す。
ここで、図２の見出し長存在判定テーブル２１に示すよ
うに、「で」で始まる見出し活用文字列には４文字のも
のは存在しないので「では彼女」は切り出さない。In cutting out the heading corresponding character string in the range in which the first character is "de", the heading cutout unit 2
Since the headline utilization character string starting with “de” has a maximum of 5 characters, the headline-corresponding character string “wa she” is cut out. After all, with respect to this headline corresponding character string, the morpheme cannot be identified in the subsequent process, and the process returns to the process of the headline cutout unit 2. Then, based on the longest match method, the string "was him"
The same processing is sequentially repeated for “de” and “de”.
Here, as shown in the heading length existence determination table 21 of FIG. 2, since the heading utilization character string starting with “de” does not have four characters, “then she” is not cut out.

【００７５】上記処理の後、形態素「で」「でる」また
は「だ」を同定する（但し、「でる」、「だ」は後に連
接不可として棄却される）。そして、次の見出し対応文
字列の切り出しに移る。After the above processing, the morpheme "de,""de," or "da" is identified (however, "de,""da" are later rejected as non-connectable). Then, the process proceeds to the cutting out of the next heading corresponding character string.

【００７６】同様にして、先頭文字が「は」である範囲
の見出し対応文字列の切り出しにおいては、「は彼女が
優」を、先頭文字が「が」である範囲の見出し対応文字
列の切り出しにおいては、「が優勢だ」を最初に切り出
す。そして、順次最長一致法に基づいて処理を行ない形
態素を同定する。なお、「が」で始まる見出し活用文字
列は最大５文字であるが、文末までに４文字しかないの
で「が優勢だ」を最初に切り出す。Similarly, in cutting out the heading-corresponding character string in the range in which the first character is "ha", "ha she is excellent" is cut out, and the heading-in correspondence character string in the range in which the first character is "ga" is cut out. In, first cut out "is superior". Then, the morphemes are identified by sequentially performing processing based on the longest match method. Note that the maximum number of character strings used for headings starting with "ga" is 5 characters, but since there are only 4 characters by the end of the sentence, "ga is dominant" is cut out first.

【００７７】以上の処理の過程において、見出し切り出
し部２は、図８乃至図１０に示すように、２６種類の見
出し対応文字列を切り出し、それらに対応する語尾活用
見出しで４４回の辞書検索を行なう。In the process of the above processing, the headline cutout unit 2 cuts out 26 types of headline-corresponding character strings, as shown in FIGS. To do.

【００７８】次に、図１１乃至図１６に、上記解析例と
同様の例文「今の状況では彼女が優勢だ」について、見
出し長存在判定テーブル２１を用いずに解析した例を示
す。この場合、見出し切り出し部２は、図示のように５
７種類の見出し対応文字列を切り出し、それらに対応す
る語尾活用見出しで９２回の辞書検索を行なうこととな
る。Next, FIGS. 11 to 16 show examples in which the same sentence as in the above-mentioned analysis example, “she is dominant in the present situation”, is analyzed without using the heading length existence determination table 21. In this case, the headline cutout unit 2 is 5 as shown in the figure.
Seven types of heading-corresponding character strings are cut out, and a dictionary search is performed 92 times with the endings-use heading corresponding to them.

【００７９】これを上述した本実施例の２つの解析例と
比較すれば、図６及び図７の解析例では見出し対応文字
列の切り出しで４５回の切り出しを、辞書検索で６８回
の検索を省略できることとなる。また、これよりも解析
効率の劣る図８乃至図１０の解析例でも、見出し対応文
字列の切り出しで３１回の切り出しを、辞書検索で４８
回の検索を省略できることとなり、見出し対応文字列の
切り出し回数、辞書検索回数とも見出し長存在判定テー
ブル２１を用いない場合に比して半減することとなる。Comparing this with the two analysis examples of the present embodiment described above, in the analysis examples of FIGS. 6 and 7, 45 cuts are made by cutting out the heading corresponding character string and 68 searches are made by the dictionary search. It can be omitted. Also, in the analysis examples of FIGS. 8 to 10 in which the analysis efficiency is lower than this, 31 times of extraction of the headline-corresponding character string and 48 times of dictionary search are performed.
This means that the search can be omitted twice, and the number of times the headline-corresponding character string is cut out and the number of times the dictionary is searched are halved as compared with the case where the index length existence determination table 21 is not used.

【００８０】なお、本解析例で用いた例文は比較的短い
ものであるが、一般には、より長い文を解析するため、
切り出し回数及び辞書検索回数における削減効果はより
大きくなる。The example sentence used in this analysis example is relatively short, but in general, since a longer sentence is analyzed,
The reduction effect in the number of cuts and the number of dictionary searches is greater.

【００８１】[0081]

【発明の効果】以上説明したように、本発明の日本語形
態素解析システム及び見出し切り出し方法は、見出し長
存在判定テーブル２１を備え、これを参照して切り出す
見出し対応文字列の範囲を決定することとしたため、形
態素として同定されることのない見出し対応文字列の無
駄な切り出しを行なうことがなく、見出し対応文字列の
範囲を効率よく決定して辞書検索回数を大幅に低減する
ことができる。そのため、形態素の見出しをフルスペル
で指定して辞書を検索する方式を採ることが可能となる
という効果がある。As described above, the Japanese morphological analysis system and heading cutout method of the present invention are provided with the heading length existence determination table 21, and the range of the heading corresponding character string to be cut out is determined with reference to this. Therefore, the headline-corresponding character string that is not identified as a morpheme is not wastefully cut out, the range of the headline-corresponding character string is efficiently determined, and the number of dictionary searches can be significantly reduced. Therefore, there is an effect that it is possible to adopt a method of searching a dictionary by designating a morpheme heading with full spelling.

【００８２】また、見出し長存在判定テーブル２１を参
照することにより、見出し活用文字列として存在しない
字数を直ちに検出することができ、また、特定の文字で
始まる見出し活用文字列を取り得る形態素が辞書に登録
されていないということも直ちに判定することができる
という効果がある。Further, by referring to the heading length existence determination table 21, it is possible to immediately detect the number of characters that do not exist as a heading utilization character string, and a morpheme that can take a heading utilization character string starting with a specific character is a dictionary. There is an effect that it is possible to immediately determine that it is not registered in.

【００８３】さらに本発明は、見出し長存在判定テーブ
ル２１の参照によって辞書を検索する無駄そのものを低
減するため、仮に辞書情報が全て主記憶上に実装できる
場合を想定したとしても、辞書の検索には一定の時間を
要することを鑑みれば、なお解析効率が高いという効果
がある。Further, according to the present invention, in order to reduce the waste itself of searching the dictionary by referring to the heading length existence determination table 21, even if it is assumed that all the dictionary information can be installed in the main memory, the dictionary can be searched. Considering that takes a certain time, there is an effect that the analysis efficiency is still high.

[Brief description of the drawings]

【図１】本発明の一実施例に係る形態素解析システムの
構成を示すブロック図である。FIG. 1 is a block diagram showing the configuration of a morphological analysis system according to an embodiment of the present invention.

【図２】図１の見出し長存在判定テーブル２１の例を示
す図である。FIG. 2 is a diagram showing an example of a heading length existence determination table 21 of FIG.

【図３】図１の見出し長存在判定テーブル２１の例を示
す図である。FIG. 3 is a diagram showing an example of a heading length existence determination table 21 of FIG.

【図４】図１の語尾活用テーブルを示す図である。FIG. 4 is a diagram showing a word inflection table in FIG. 1.

【図５】図１の語尾活用部による語尾活用見出しの作成
例を示す図である。FIG. 5 is a diagram showing an example of creating a ending inflection head by the ending inflection unit in FIG. 1;

【図６】図１の形態素解析システムによる解析処理例を
示す図である。6 is a diagram showing an example of analysis processing by the morphological analysis system of FIG.

【図７】図１の形態素解析システムによる解析処理例を
示す図である。FIG. 7 is a diagram showing an example of analysis processing by the morphological analysis system of FIG.

【図８】図１の形態素解析システムによる解析処理例を
示す図である。8 is a diagram showing an example of analysis processing by the morphological analysis system of FIG.

【図９】図１の形態素解析システムによる解析処理例を
示す図である。9 is a diagram showing an example of analysis processing by the morphological analysis system of FIG.

【図１０】図１の形態素解析システムによる解析処理例
を示す図である。10 is a diagram showing an example of analysis processing by the morphological analysis system of FIG.

【図１１】図１の形態素解析システムによる解析処理例
を示す図である。11 is a diagram showing an example of analysis processing by the morphological analysis system of FIG.

【図１２】図１の形態素解析システムによる解析処理例
を示す図である。12 is a diagram showing an example of analysis processing by the morphological analysis system of FIG.

【図１３】図１の形態素解析システムによる解析処理例
を示す図である。13 is a diagram showing an example of analysis processing by the morphological analysis system of FIG.

【図１４】図１の形態素解析システムによる解析処理例
を示す図である。14 is a diagram showing an example of analysis processing by the morphological analysis system of FIG.

【図１５】図１の形態素解析システムによる解析処理例
を示す図である。15 is a diagram showing an example of analysis processing by the morphological analysis system of FIG.

【図１６】図１の形態素解析システムによる解析処理例
を示す図である。16 is a diagram showing an example of analysis processing by the morphological analysis system of FIG.

[Description of sign]

１形態素同定制御部２見出し切り出し部３語尾活用部４辞書検索部５連接判定部１０辞書２１見出し長存在判定テーブル３１語尾活用テーブル DESCRIPTION OF SYMBOLS 1 Morphological identification control unit 2 Heading cutout unit 3 Word ending utilization unit 4 Dictionary search unit 5 Concatenation determination unit 10 Dictionary 21 Heading length existence determination table 31 Word ending utilization table

Claims

(57) [Claims]

1. A morpheme information dictionary is provided which can be searched by full spelling of a morpheme heading, and a character string of an input Japanese sentence is sequentially cut out from a partial character string in a predetermined range on the sentence head side and the word is transformed, and the word is transformed. In the Japanese morphological analysis system that identifies the morpheme corresponding to the extracted partial character string while searching the dictionary with the heading of the format registered in the dictionary created by Heading length existence determination in which information about the length of the heading utilization character string is stored for all the heading utilization character strings that can be taken by all morphemes, at least for each group in which the first character of the heading utilization character string is the same When attempting to cut out a partial character string from the table and the above Japanese sentence character string, at least the beginning of the partial character string Using the information in the heading length existence judging table corresponding to the character,
Only when it is determined that a morpheme that can take a headline-utilizing character string in which at least the first character of the partial character string matches and the lengths of the partial character strings match is registered in the dictionary. A Japanese morphological analysis system characterized by comprising a headline extraction unit for extracting a column.

2. The heading length existence determination table divides at least the heading utilization character strings into groups according to the difference in the first character for all the heading utilization character strings that can be taken by all the morphemes registered in the dictionary. For each item, the headline utilization character string length information is stored, which is created by aggregating the lengths of the headline utilization character strings and used for determining whether or not the group has a specific length of the headline utilization character string. The Japanese morphological analysis system according to claim 1, wherein, for each group, the maximum length of a headline utilization character string belonging to the group is stored.

3. A full-spelling headline in a format registered in a dictionary created by cutting out a character string of an input Japanese sentence in order starting from a partial character string in a predetermined range on the sentence head side and deforming the word end. It is used in a Japanese morphological analysis system that identifies the morphemes corresponding to the cut out partial character strings while searching the dictionary with, and the part of the range corresponding to the heading of one morpheme from the character strings of the Japanese sentence. In the headline cutout method used when cutting out a character string, when trying to cut out a partial character string from the character string of the above Japanese sentence, all the headline utilization character strings that can be taken by all the morphemes registered in the above dictionary Information is extracted from the heading length existence determination table in which information on the length of the At least the first character of the partial character string to be cut out coincides with the step of extracting the information corresponding to the partial character string to be extracted and the information extracted from the heading length existence determination table, and the partial character A step of determining whether or not a morpheme that can take a headline-utilizing character string having the same column length is registered in the dictionary, and the determination can match the partial character string to be cut out. And a step of cutting out the partial character string only when it is determined that a morpheme capable of taking a useful headline-use character string is registered in the dictionary.