JP2002269083A

JP2002269083A - Morpheme analysis system

Info

Publication number: JP2002269083A
Application number: JP2001066028A
Authority: JP
Inventors: Atsushi Ito; 篤伊藤
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2001-03-09
Filing date: 2001-03-09
Publication date: 2002-09-20

Abstract

PROBLEM TO BE SOLVED: To perform morpheme analysis without removing a tag, i.e., document description information in a document and to easily know the position where the tag, i.e., document description information has been located from the morpheme analysis result. SOLUTION: A sentence division part 11 divides an inputted document into sentences and phrases and when a dictionary referring range determination part 12 extracts a word for referring to a morpheme analysis dictionary 30 by a dictionary referring part 13 and document description information is included in the inputted document, a language unit level decision part 20 decides the language unit level of the document description information according to a language unit level registered in a language unit table 23 and changes the operation for morpheme analysis regarding the inputted document according to the language unit level of the decided document description information. Then an optimum path determination part 14 determines the best morpheme analysis result.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、言語処理，構文解
析等に適用される形態素解析システムに関し、特に、Ｈ
ＴＭＬ，ＸＭＬ，ＳＧＭＬなどのタグ付き文書を扱う言
語解析システムに適用される形態素解析システムに関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a morphological analysis system applied to language processing, syntax analysis, and the like.
The present invention relates to a morphological analysis system applied to a language analysis system that handles tagged documents such as TML, XML, and SGML.

【０００２】[0002]

【従来の技術】ＨＴＭＬ，ＸＭＬ，ＳＧＭＬなどのタグ
付き文書から、文書中のタグを取り除き、形態素解析を
し、結果とタグとをまとめて出力するものとして、特開
平９−１６５９４号公報「文書形態素解析装置」におい
て開示されている形態素解析技術がある。本形態素解析
技術においては、文書中に含まれるタグを検出すること
により文の区切りを認識し、次いで、文書からタグを取
り除いて、普通に形態素解析をし、最後に、形態素解析
結果とタグとについての情報をマージして出力するもの
であり、タグの付いている文書でも形態素解析を実施す
ることができる。2. Description of the Related Art Japanese Unexamined Patent Publication No. Hei 9-16594 discloses a document which removes tags in a document from a tagged document such as HTML, XML, and SGML, performs morphological analysis, and outputs the result and the tags together. There is a morphological analysis technique disclosed in "Morphological analyzer". In this morphological analysis technology, a sentence break is recognized by detecting a tag included in a document, then the tag is removed from the document, a morphological analysis is performed normally, and finally, the morphological analysis result and the tag Is output by merging the information about the documents, and morphological analysis can be performed even on a document with a tag.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、従来技
術において開示された形態素解析技術では、一旦、タグ
を取り除くため、後から、形態素解析結果とタグとの位
置関係を求めることが困難になるという問題がある。However, in the morphological analysis technology disclosed in the prior art, since the tag is once removed, it is difficult to obtain the positional relationship between the morphological analysis result and the tag later. There is.

【０００４】たとえば、以下のような入力文書を考え
る。「土曜日に＜hinshi name＝“固有名詞”＞花見＜
／hinshi＞に行＜ｂｒ＞きましょう＜ｐ＞ね、いいか
な」[0004] For example, consider the following input document. "On Saturday <hinshi name =" proper noun "> Hanami <
/ Hinshi> line let's go

【０００５】ここで、＜hinshi＞，＜／hinshi＞，＜ｂ
ｒ＞，＜ｐ＞は、それぞれ、タグを示し、＜hinshi＞及
び＜／hinshi＞のタグは、両者のタグの間に挟まれてい
る単語の品詞を指定するタグであり、前述の例において
は、「花見」が固有名詞（本例においては、飲み屋の名
前を意味する固有名詞）であることを示している。ま
た、＜ｂｒ＞のタグは、物理的な改行を指定しているこ
とを示し、＜ｂｒ＞の挿入位置で改行がなされる。ま
た、＜ｐ＞のタグは、段落が分かれていることを意味
し、＜ｐ＞の挿入位置で、上下に段落が分離されてい
る。Here, <hinshi>, </ hinshi>, and indicate tags, respectively, and the tags <hinshi> and </ hinshi> are tags specifying the part of speech of a word interposed between the two tags. Indicates that "hanami" is a proper noun (in this example, a proper noun meaning the name of a bar). The tag indicates that a physical line feed is specified, and a line feed is made at the insertion position of . The tag means that the paragraph is divided, and the paragraph is vertically separated at the insertion position of .

【０００６】従って、上記の入力文書は、「飲み屋「花
見」（ここで、「花見」は飲み屋の店名を示す固有名詞
である）に土曜日に行こう。」と誘っているテキスト
と、「ね、いいかな。」と確認しているテキストの二つ
の段落からなっているテキストの例を示しているもので
ある。Therefore, the input document described above is to go to “bar“ Hanami ”(here,“ hanami ”is a proper noun indicating the store name of the bar) on Saturday. This is an example of a text consisting of two paragraphs, a text inviting "" and a text confirming "I see.

【０００７】かかる入力文書から、前記特開平９−１６
５９４号公報における形態素解析のように、まず、最初
にタグを取り除いてしまった場合、「土曜日に花見に行
きましょうね、いいかな」のテキストとなり、このテキ
ストの形態素解析を施した結果として、本来は、「土曜
日（名詞）／に（副助詞）／花見（固有名詞）／に（副
助詞）／行き（動詞連用形）／ましょ（助動詞）／う
（終助詞）／ね（感動詞）／、（読点）／いい（形容
詞）／か（終助詞）／な（終助詞）」となるべきである
にも関わらず、タグがないため、「土曜日（名詞）／に
（副助詞）／花（名詞）／見（動詞連用形）／に（副助
詞）／行き（動詞連用形）／ましょ（助動詞）／う（終
動詞）／ね（終助詞）／、（読点）／いい（形容詞）／
か（終助詞）／な（終助詞）」となってしまう。From such an input document, Japanese Patent Laid-Open Publication No.
As in the case of the morphological analysis in Japanese Patent No. 594, if the tag is first removed, the text of "Let's go to Hanami on Saturday, I wonder if it's good" is obtained. Originally, "Saturday (noun) / ni (adjunct particle) / Hanami (proper noun) / ni (adjunct particle) / go (verb conjunctive form) / macho (auxiliary verb) / u (final particle) / ne (adjective) / , (Reading point) / good (adjective) / or (final particle) / na (final particle), but there is no tag, so "Saturday (noun) / ni (adjunct particle) / flower (Noun) / look (verb conjunctive form) / ni (adjunct verb) / go (verb conjunctive form) / macho (auxiliary verb) / u (final verb) / ne (final particle) /
Or (final particle) / na (final particle) ".

【０００８】即ち、「花見」を固有名詞として解析すべ
きにも関わらず、「花」と一般名詞と「見」の動詞連用
形に分割されてしまい、更には、「ね」が、次の段落の
テキストの先頭の感動詞と解析すべきにも関わらず、最
初のテキストの最後に付された終動詞と解釈されてしま
い、「土曜日に花を見に行こうね。いいかな。」の趣旨
に解析されて、正しく解析されない結果を招いてしま
う。また、この後、タグを形態素解析結果のどこに対応
づけるかも困難となってしまう。That is, although “hanami” should be analyzed as a proper noun, it is divided into verb conjunctive forms of “hana”, common nouns and “mi”. Despite the fact that it should be analyzed as a verb at the beginning of the text, it is interpreted as a final verb added to the end of the first text, and the purpose of "Let's go see the flowers on Saturday. The result is not analyzed correctly. After this, it becomes difficult to associate the tag with the result of the morphological analysis.

【０００９】本発明は、かかる問題に鑑みてなされたも
のであり、文書中のタグ即ち文書記述情報を除去するこ
と無く、形態素解析を実行し、解析結果から、タグ即ち
文書記述情報があった位置が容易にわかることを目的と
するものである。さらに、タグを使って、形態素解析の
解析率の向上をも目的としている。The present invention has been made in view of such a problem, and performs morphological analysis without removing tags in a document, ie, document description information, and finds tags, ie, document description information, from the analysis result. The purpose is to make the position easily understandable. It also aims to improve the analysis rate of morphological analysis using tags.

【００１０】[0010]

【課題を解決するための手段】請求項１に記載の発明
は、少なくとも、形態素解析辞書を有し、入力された文
書の中に含まれる単語と一致する該形態素解析辞書中の
単語を辞書引きすることによって、入力された前記文書
を形態素解析し、最適な形態素解析結果を出力する形態
素解析システムにおいて、入力された前記文書中に、文
書記述情報が含まれていた場合に、該文書記述情報の言
語単位レベルを識別し、識別された前記言語単位レベル
に応じて、入力された前記文書に関する形態素解析の動
作を変更する形態素解析システムとすることを特徴とす
るものである。According to a first aspect of the present invention, there is provided a morphological analysis dictionary having at least a word in the morphological analysis dictionary that matches a word included in an input document. In the morphological analysis system that performs morphological analysis of the input document and outputs an optimal morphological analysis result, when the input document includes document description information, the document description information A morphological analysis system that identifies a linguistic unit level and changes a morphological analysis operation on the input document in accordance with the identified linguistic unit level.

【００１１】請求項２に記載の発明は、請求項１に記載
の形態素解析システムにおいて、識別された前記文書記
述情報の前記言語単位レベルが、文字以下の低いレベル
であった場合、前記形態素解析辞書を索引する際、該文
書記述情報を読み飛ばして索引すべき単語を抽出する形
態素解析システムとすることを特徴とするものである。According to a second aspect of the present invention, in the morphological analysis system according to the first aspect, when the linguistic unit level of the identified document description information is a low level of characters or less, When the dictionary is indexed, the morphological analysis system extracts the words to be indexed by skipping the document description information.

【００１２】請求項３に記載の発明は、請求項１に記載
の形態素解析システムにおいて、識別された前記文書記
述情報の前記言語単位レベルが、形態素以上のレベルで
あった場合、前記形態素解析辞書を索引する際、該文書
記述情報を挟んだ文字列に基づいて、前記形態素解析辞
書を索引させない形態素解析システムことを特徴とする
ものである。According to a third aspect of the present invention, in the morphological analysis system according to the first aspect, when the linguistic unit level of the identified document description information is a level higher than a morpheme, the morphological analysis dictionary Is indexed, the morphological analysis system does not index the morphological analysis dictionary based on a character string sandwiching the document description information.

【００１３】請求項４に記載の発明は、請求項１に記載
の形態素解析システムにおいて、識別された前記文書記
述情報の前記言語単位レベルが、文以上のレベルであっ
た場合、該文書記述情報で区切られた文書毎に分割し
て、形態素解析を実施する形態素解析システムとするこ
とを特徴とするものである。According to a fourth aspect of the present invention, in the morphological analysis system according to the first aspect, when the linguistic unit level of the identified document description information is a level equal to or higher than a sentence, The morphological analysis system performs a morphological analysis by dividing each document divided by.

【００１４】請求項５に記載の発明は、請求項１乃至４
のいずれかに記載の形態素解析システムにおいて、前記
文書記述情報が、ＳＧＭＬ及び／又はＸＭＬ及び／又は
ＨＴＭＬのタグである形態素解析システムとすることを
特徴とするものである。[0014] The invention according to claim 5 provides the invention according to claims 1 to 4.
The morphological analysis system according to any one of the above, wherein the document description information is a tag of SGML and / or XML and / or HTML.

【００１５】請求項６に記載の発明は、請求項１乃至４
のいずれかに記載の形態素解析システムにおいて、文書
記述情報が、改行などからなるレイアウト制御文字であ
る形態素解析システムとすることを特徴とするものであ
る。[0015] The invention described in claim 6 is the invention according to claims 1 to 4.
In the morphological analysis system according to any one of the above, the document description information is a morphological analysis system that is a layout control character including a line feed.

【００１６】請求項７に記載の発明は、請求項１乃至４
の形態素解析システムにおいて、文書記述情報が、句
点、読点などの区切り文字である形態素解析システムと
することを特徴とするものである。The invention according to claim 7 is the invention according to claims 1 to 4
Is characterized in that the document description information is a morphological analysis system in which delimiters such as punctuation marks and punctuation marks are used.

【００１７】請求項８に記載の発明は、少なくとも、形
態素解析辞書を有し、入力された文書の中に含まれる単
語と一致する該形態素解析辞書中の単語を辞書引きする
ことによって、入力された前記文書を形態素解析し、最
適な形態素解析結果を出力する形態素解析方法におい
て、入力された前記文書中に、文書記述情報が含まれて
いた場合に、該文書記述情報の言語単位レベルを識別
し、識別された前記言語単位レベルに応じて、入力され
た前記文書に関する形態素解析の動作を変更する形態素
解析方法とすることを特徴とするものである。[0018] The invention according to claim 8 has at least a morphological analysis dictionary, and a dictionary is searched for words in the morphological analysis dictionary that match words contained in the input document. A morphological analysis of the document and outputting an optimal morphological analysis result, when the input document includes document description information, identifies a language unit level of the document description information. And a morphological analysis method that changes a morphological analysis operation for the input document according to the identified linguistic unit level.

【００１８】請求項９に記載の発明は、少なくとも、形
態素解析辞書を有し、入力された文書の中に含まれる単
語と一致する該形態素解析辞書中の単語を辞書引きする
ことによって、入力された前記文書を形態素解析し、最
適な形態素解析結果を出力する形態素解析方法におい
て、入力された前記文書中に、文書記述情報が含まれて
いた場合に、該文書記述情報の言語単位レベルを識別
し、識別された前記言語単位レベルに応じて、入力され
た前記文書に関する形態素解析の動作を変更する形態素
解析方法を、コンピュータが実行できるように、コンピ
ュータのプログラムとして読み取り可能とするプログラ
ム記録媒体とすることを特徴とするものである。According to a ninth aspect of the present invention, there is provided a morphological analysis dictionary having at least a morphological analysis dictionary. A morphological analysis of the document and outputting an optimal morphological analysis result, when the input document includes document description information, identifies a language unit level of the document description information. And a program recording medium readable as a computer program so that the computer can execute a morphological analysis method for changing an operation of morphological analysis on the input document according to the identified linguistic unit level. It is characterized by doing.

【００１９】[0019]

【発明の実施の形態】図１に、本発明に係る形態素解析
システムの機能ブロック図を、一実施形態として示す。
図１において、本形態素解析システムは、形態素解析部
１０と、言語単位レベル判定部２０と、形態素解析辞書
３０とを有している。形態素解析部１０は、文分割部１
１と、辞書引き範囲決定部１２と、辞書引き部１３と、
最適パス決定部１４とを有している。文分割部１１は、
入力された文書を、文書中に含まれている文書記述情報
（タグ，レイアウト制御文字や区切り文字）に基づい
て、文や文節単位に分割する機能を有する。DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows a functional block diagram of a morphological analysis system according to the present invention as an embodiment.
In FIG. 1, the morphological analysis system includes a morphological analysis unit 10, a linguistic unit level determination unit 20, and a morphological analysis dictionary 30. The morphological analysis unit 10 includes the sentence division unit 1
1, a dictionary lookup range determination unit 12, a dictionary lookup unit 13,
And an optimal path determination unit 14. The sentence division unit 11
It has a function of dividing an input document into sentences and phrases based on document description information (tags, layout control characters and delimiters) included in the document.

【００２０】辞書引き範囲決定部１２は、文分割部１１
で分割された文節（文に含まれる区間毎のテキスト）の
中から、形態素解析辞書３０を索引する単位となる単語
を抽出する機能を有する。なお、単語の抽出に当たり、
文書記述情報が抽出された場合は、言語単位レベル判定
部２０に引継ぎ、形態素解析辞書３０を索引する単語と
はしない。辞書引き部１３は、辞書引き範囲決定部１２
により抽出された単語に基づいて、形態素解析辞書３０
を索引して、各単語毎の表記や品詞やコストなどを求め
る機能を有する。最適パス決定部１４は、辞書引き部１
３により、文書を構成する各単語毎の辞書索引結果に基
づいて、最適な形態素解析結果を出力する機能を有す
る。The dictionary lookup range determining unit 12 includes a sentence dividing unit 11
Has a function of extracting a word as a unit for indexing the morphological analysis dictionary 30 from among the phrases (texts for each section included in the sentence) divided by. In extracting words,
When the document description information is extracted, it is taken over by the linguistic unit level determination unit 20 and is not used as a word to index the morphological analysis dictionary 30. The dictionary lookup unit 13 is a dictionary lookup range determination unit 12
Morphological analysis dictionary 30 based on the words extracted by
And has a function of obtaining the notation, part of speech, cost, and the like for each word. The optimum path determination unit 14 is configured to search the dictionary
3 has a function of outputting an optimal morphological analysis result based on a dictionary index result for each word constituting a document.

【００２１】一方、言語単位レベル判定部２０は、言語
単位レベル比較部２１と、文書記述情報テーブル２２
と、言語単位レベルテーブル２３とを有している。言語
単位レベル判定部２０は、文書中に含まれている文書記
述情報の言語レベルを判定する機能を有し、文分割部１
１あるいは辞書引き範囲決定部１２から指定された文書
記述情報が、文書記述情報テーブル２２に登録されてい
るか否かを判定すると共に、登録されていた場合、言語
単位レベルテーブル２３に登録されている言語単位テー
ブル情報に基づいて、言語単位レベル比較部２１におい
て、たとえば、文以上の単位を示す言語単位レベルか否
かを判定する機能を有する。On the other hand, the linguistic unit level determining unit 20 includes a linguistic unit level comparing unit 21 and a document description information table 22.
And a language unit level table 23. The language unit level determining unit 20 has a function of determining the language level of the document description information included in the document, and the sentence dividing unit 1
1 or the document description information specified by the dictionary lookup range determination unit 12 is determined to be registered in the document description information table 22. If registered, the document description information is registered in the language unit level table 23. Based on the linguistic unit table information, the linguistic unit level comparing unit 21 has, for example, a function of determining whether or not the linguistic unit level indicates a unit of a sentence or more.

【００２２】本形態素解析システムに入力された文書
は、形態素解析部１０に入力されて、文書の先頭から、
順次、形態素解析されていくが、文分割部１１におい
て、分割されて摘出された単語が、前述のタグに相当す
る文書記述情報であった場合、該文書記述情報の言語単
位レベルを、言語単位レベル判定部２０によって識別さ
せる。而して、該識別結果によって、形態素解析の処理
方法を変更する。The document input to the morphological analysis system is input to the morphological analysis unit 10 and, from the beginning of the document,
Although the morphological analysis is sequentially performed, if the word divided and extracted in the sentence dividing unit 11 is the document description information corresponding to the above-described tag, the language unit level of the document description information is changed to the language unit. The level is determined by the level determination unit 20. Thus, the processing method of the morphological analysis is changed according to the identification result.

【００２３】次に、請求項１乃至９に記載の各発明に関
する実施例について、［発明が解決しようとする課題］
の項において、引用した下記の入力文書を例にとって、
以下に説明する。Next, with respect to the embodiments according to the first to ninth aspects of the invention, [Problems to be Solved by the Invention]
In the section below, taking the following input document as an example,
This will be described below.

【００２４】即ち、「飲み屋「花見」（飲み屋の名前）
に土曜日に行こう。」と誘っているテキストと、「ね、
いいかな。」と確認しているテキストの二つの段落から
なっているテキストを示す、「土曜日に＜hinshi name
＝“固有名詞”＞花見＜／hinshi＞に行＜ｂｒ＞きまし
ょう＜ｐ＞ね、いいかな」との文書が入力されたとき、
以下の各ステップにより、形態素解析が実行される。な
お、タグなど文書の構成を制御したりする制御情報を、
本発明においては、「文書記述情報」と称し、以下の説
明においては、文書記述情報の用語を用いることとす
る。[0024] That is, "bar" Hanami "(name of bar)
Let's go on Saturday. "And the text"
Is it okay. "Indicates a text consisting of two paragraphs of text identified as" Saturday <hinshi name
= "Proper noun"> Hanami </ hinshi> line Let's go 
Morphological analysis is performed by the following steps. Note that control information for controlling the structure of a document such as a tag is
In the present invention, the term is referred to as “document description information”, and the term of the document description information is used in the following description.

【００２５】まず、文分割部１１において、入力された
文書を、先頭から順次、調べていき、文書記述情報を検
索した場合、言語単位レベル判定部２０によって、言語
単位レベルが、文以上を示す文書記述情報を検索したと
判定された場合、該文書記述情報の位置で、文書を文に
分割していく（ステップＳ１）。即ち、ここでは、前述
の段落を示す＜ｐ＞が検索されて、前後二つのテキスト
に分割され、表１に示すように、分割されたそれぞれの
テキストに「１」「２」の二つの文番号が付与される。
また、表１に示す「位置」欄は、各文番号が示すテキス
トの最初の文字の位置を、入力された文書の先頭からの
文字数により示すものである。なお、前述の入力文書に
は示していないが、入力文書の最後を示す文書記述情報
も、文の最後にも付け加えておくものとする。First, in the sentence division unit 11, the input document is examined sequentially from the beginning, and when the document description information is searched, the language unit level determination unit 20 indicates that the language unit level is higher than the sentence. If it is determined that the document description information has been searched, the document is divided into sentences at the position of the document description information (step S1). That is, here, indicating the above paragraph is searched, divided into two texts before and after, and as shown in Table 1, two sentences “1” and “2” are added to each divided text. A number is assigned.
The "position" column shown in Table 1 indicates the position of the first character of the text indicated by each sentence number by the number of characters from the head of the input document. Although not shown in the input document, document description information indicating the end of the input document is also added to the end of the sentence.

【００２６】[0026]

【表１】 [Table 1]

【００２７】続いて、各文番号に対応するそれぞれのテ
キストに対して、文の形態素解析を行なった後（ステッ
プＳ２）、各テキストに含まれている文書記述情報に関
する「解析に付随する処理」を施して（ステップＳ
３）、最終的に、形態素解析結果を出力することになる
（ステップＳ４）。Subsequently, after performing a morphological analysis of the sentence for each text corresponding to each sentence number (step S2), the "processing accompanying the analysis" relating to the document description information included in each text. (Step S
3) Finally, a morphological analysis result is output (step S4).

【００２８】次に、前記ステップＳ２に示す文の形態素
解析処理について説明する。更に、文分割部１１は、文
番号「１」のテキストである文を、先頭から順次、調べ
ていき、文書記述情報を検索した場合、言語単位レベル
判定部２０によって、言語単位レベルが、単語以上、か
つ、文未満の文書記述情報を検索したと判定された場
合、該文書記述情報の位置で、文を該文書記述情報によ
り更に各区間の文節に分割していく（ステップＳ２．
１）。たとえば、文番号「１」の文については、文未満
で、かつ、単語以上の区切りを行なっている文書記述情
報＜hinshi name＝“固有名詞”＞と＜／hinshi＞とが
検索されて、三つの区間の文節に細分割され、表２に示
すように、分割されたそれぞれの区間毎に、「１」
「２」「３」の三つの区間番号が付与される。また、表
２に示す「位置」欄は、各区間番号が示すテキストの最
初の文字の位置を、入力された文書の先頭からの文字数
により示すものである。Next, the morphological analysis of the sentence shown in step S2 will be described. Further, the sentence division unit 11 examines the sentence which is the text of the sentence number “1” sequentially from the beginning, and when the document description information is searched, the language unit level determination unit 20 determines that the language unit level is the word As described above, when it is determined that the document description information less than the sentence has been searched, the sentence is further divided into the segments of each section at the position of the document description information by the document description information (step S2.
1). For example, for the sentence with the sentence number “1”, the document description information <hinshi name = “proper noun”> and </ hinshi>, which are less than the sentence and delimit more than words, are searched. As shown in Table 2, “1” is assigned to each of the divided sections.
Three section numbers “2” and “3” are assigned. The "position" column shown in Table 2 indicates the position of the first character of the text indicated by each section number by the number of characters from the beginning of the input document.

【００２９】[0029]

【表２】 [Table 2]

【００３０】次いで、ステップＳ２．１において、分割
された各区間の文節に対して、それぞれ、以下を実行す
る。現在位置（ｐ）を、区間を示す文節の先頭の文字位
置に設定する（ステップＳ２．２）。辞書引き範囲決定
部１２において、設定された文字位置から始まる単語を
抽出して（ステップＳ２．３）、辞書引き部１３に引渡
し、形態素解析辞書３０を索引させて、該単語の表記，
品詞，コストに関する情報を取得する（ステップＳ２．
４）。ここで、該単語として、文書記述情報が検出され
た場合、言語単位レベル判定部２０において、該文書記
述情報の言語単位レベルが判定され、言語単位レベル
が、単語のレベルであった場合には、後述する「解析に
付随する処理」を施し、該文書記述情報に関する辞書引
きは行なわない。Next, in step S2.1, the following is executed for each of the segments in each of the divided sections. The current position (p) is set to the first character position of the phrase indicating the section (step S2.2). The dictionary lookup range determination unit 12 extracts a word starting from the set character position (step S2.3), delivers the word to the dictionary lookup unit 13, causes the morphological analysis dictionary 30 to index the word, and
Acquire information on part of speech and cost (step S2.
4). Here, when document description information is detected as the word, the language unit level determination unit 20 determines the language unit level of the document description information, and when the language unit level is the word level, Then, a "process accompanying the analysis" to be described later is performed, and a dictionary lookup for the document description information is not performed.

【００３１】一方、該文書記述情報が、単語のレベルで
もなく、文字以下のレベルであった場合には、読み飛ば
して、単語を抽出して、辞書引き部１３に引渡し、形態
素解析辞書３０を索引させて、該単語の表記，品詞，コ
ストに関する情報を取得する。次の文字位置に現在位置
（ｐ）を更新して（ステップＳ２．５）、まだ、区間を
示す文節が終了していない場合は（ステップＳ２．
６）、ステップＳ２．３に戻り、次の単語の抽出を繰り
返す。On the other hand, if the document description information is not at the word level but at the character level or lower, the word is skipped, the word is extracted and passed to the dictionary lookup unit 13, and the morphological analysis dictionary 30 is read. By indexing, information on the notation, part of speech, and cost of the word is acquired. The current position (p) is updated to the next character position (step S2.5). If the phrase indicating the section has not been finished yet (step S2.).
6) Returning to step S2.3, the extraction of the next word is repeated.

【００３２】たとえば、表２における区間番号「２」の
文節について、単語の抽出（ステップＳ２．３）と辞書
引き（ステップＳ２．４）とが繰り返して実施される
と、表３に示すように、「花」と「見＜/hinshi＞」の
二つの単語に分解されて、それぞれの単語について、辞
書引きされて、表記，品詞，コストに関する情報が取得
されることになる。ここに、コストとは、形態素解析コ
ストを示すものであり、単語毎に辞書に登録されてい
る。なお、区間の最後を示す文書記述情報は最後の単語
に付随させて分解されることとなり、区間の最後の単語
のコストは、辞書引きにより得られたコストに更に文書
記述情報のコストを含めて算出される。また、表３にお
いて、「位置」欄は、単語の先頭位置を文書の先頭から
の文字数として示し、「長さ」欄は、単語の文字数を示
している。For example, if the extraction of words (step S2.3) and dictionary lookup (step S2.4) are repeatedly performed for the phrase having the section number "2" in Table 2, as shown in Table 3, , "Flower" and "mi </ hinshi>", and each word is lexicographically searched to obtain information on notation, part of speech, and cost. Here, the cost indicates the morphological analysis cost, and is registered in the dictionary for each word. Note that the document description information indicating the end of the section is decomposed by attaching to the last word, and the cost of the last word of the section is calculated by adding the cost of the document description information to the cost obtained by dictionary lookup. Is calculated. In Table 3, the “position” column indicates the head position of the word as the number of characters from the head of the document, and the “length” column indicates the number of characters of the word.

【００３３】[0033]

【表３】 [Table 3]

【００３４】また、表２における区間番号「３」の文節
について、単語の抽出（ステップＳ２．３）と辞書引き
（ステップＳ２．４）とを繰り返して実施されると、表
４に示すように、「に」，「行＜ｂｒ＞き」，「き」，
「ましょ」，「う＜ｐ＞」の五つの単語に分解されて、
それぞれの単語について、辞書引きされて、表記，品
詞，コストに関する情報が取得されることになる。When the extraction of words (step S2.3) and the dictionary lookup (step S2.4) are repeatedly performed for the phrase having the section number "3" in Table 2, as shown in Table 4, , "Ni", "row ki","ki",
Decomposed into five words, "Mashou" and "U "
Each word is looked up in a dictionary, and information on the notation, part of speech, and cost is obtained.

【００３５】[0035]

【表４】 [Table 4]

【００３６】但し、実際には、文節の各単語は、入力テ
キスト上の位置と長さから求めることができるため、表
３や表４のような形式で保存格納する必要はない。However, in practice, each word in the phrase can be obtained from the position and length in the input text, and therefore it is not necessary to store and store the words in the format shown in Tables 3 and 4.

【００３７】次に、前述のステップＳ２．３における単
語抽出において、文書記述情報の言語単位レベルとして
単語のレベルの文書記述情報が検出された場合に、施さ
れる該文書記述情報に伴う「解析に付随する処理」（ス
テップＳ３）について説明する。Next, in the above-described word extraction in step S2.3, when the document description information at the word level is detected as the language unit level of the document description information, the "analysis" accompanying the document description information to be performed is performed. (Step S3) will be described.

【００３８】たとえば、前述の例に示す文番号「１」の
文については、ステップＳ２．１において、三つの文節
即ち区間に分割されているが、区間番号「１」の文節
は、表２に示すように、文節の最後を示す文書記述情報
として、「＜hinshi name＝“固有名詞”＞」が伴われ
ており、スタックには、“固有名詞”が積み込まれてい
るので、前述した区間番号「２」の文節に関し、ステッ
プＳ２．３とＳ２．４の繰り返しにより作成された、
「花」「見＜/hinshi＞」の二つの単語による辞書引き
結果は、区間番号「１」のスタックに積み込まれている
文書記述情報が示す品詞に変更されることになる（ステ
ップＳ３）。よって、区間番号「２」の辞書引き結果
は、表５のように、二つの単語は、一つの単語に合成さ
れた固有名詞「花見＜/hinshi＞」と修正される。な
お、表５に示すように、品詞は、文書記述情報に指定さ
れている通り、「固有名詞」が設定され、表記も、その
まま、「花見」となる。更に、コストについては、簡単
のため、常に、「１０」が設定されるものとしている。For example, the sentence with the sentence number "1" shown in the above example is divided into three clauses, that is, sections in step S2.1. As shown in the figure, “<hinshi name =“ proper noun ”>” is accompanied as document description information indicating the end of a phrase, and “proper noun” is loaded on the stack. Regarding the clause of “2”, created by repeating steps S2.3 and S2.4,
The dictionary lookup result using the two words “flower” and “mi hinshi” is changed to the part of speech indicated by the document description information loaded on the stack with the section number “1” (step S3). Therefore, in the dictionary lookup result of the section number “2”, as shown in Table 5, the two words are corrected to the proper noun “hanami </ hinshi>” synthesized into one word. As shown in Table 5, the part-of-speech is set to “proper noun” as specified in the document description information, and the notation becomes “hanami” as it is. Further, as for the cost, "10" is always set for simplicity.

【００３９】[0039]

【表５】 [Table 5]

【００４０】最後に、ステップＳ４として、辞書引きさ
れて得られた結果の各単語毎のコストに基づいて、最適
な形態素の組み合わせを求め、形態素解析結果を出力す
る。なお、最適な形態素を求める手法は、一般的な形態
素解析の手法によって可能であるため、ここでは述べな
い。以上に述べた形態素解析の結果として、表６のよう
な形態素解析結果が出力される。Finally, in step S4, an optimal combination of morphemes is obtained based on the cost of each word of the result obtained by dictionary lookup, and a morphological analysis result is output. It should be noted that a method for obtaining an optimum morpheme can be performed by a general morphological analysis method, and thus will not be described here. As a result of the morphological analysis described above, a morphological analysis result as shown in Table 6 is output.

【００４１】[0041]

【表６】 [Table 6]

【００４２】以上のように、文書中に文書記述情報（タ
グ）があっても、形態素解析をすることができ、なおか
つ、形態素解析結果としても、文書記述情報（タグ）が
どこにあったのかが容易に判断できる。更には、＜hins
hi＞のようなタグを設けることによって、解析率が向上
される。なお、本実施例においては、文書記述情報とし
て、ＸＭＬ，ＳＧＭＬ，ＨＴＭＬのタグを例として示し
ているが、特定の１バイトのレイアウト制御文字（たと
えば、改行コードなど）や、特定の区切り文字（たとえ
ば、句点、読点など）の場合も、同様に処理することが
できる。As described above, morphological analysis can be performed even if document description information (tag) is present in a document, and where the document description information (tag) is located as a result of morphological analysis. Easy to judge. Furthermore, <hins
By providing a tag such as hi>, the analysis rate is improved. In the present embodiment, XML, SGML, and HTML tags are shown as examples of the document description information. However, a specific 1-byte layout control character (for example, a line feed code) or a specific delimiter ( For example, the same processing can be performed in the case of a period, a reading mark, etc.).

【００４３】表７は、図１における文書記述情報テーブ
ル２２の例を示すものであり、文書中の任意の位置にあ
る文書記述情報の言語単位レベルを調べる際に、表７に
示す文書記述情報テーブル２２を索引することにより、
言語単位レベルを求めることができる。なお、文書記述
情報テーブル２２の「文書記述情報」欄に登録される文
書記述情報は、正規化された正規表現を用いて登録され
ているものとする。ここで、文書記述情報としては、Ｓ
ＧＭＬ及び／又はＸＭＬ及び／又はＨＴＭＬのタグを登
録することとしても良いし、改行などからなるレイアウ
ト制御文字、あるいは句点、読点などの区切り文字を登
録することとしても良い。Table 7 shows an example of the document description information table 22 in FIG. 1. When examining the language unit level of the document description information at an arbitrary position in the document, the document description information table 22 shown in Table 7 is used. By indexing table 22,
Language unit level can be determined. It is assumed that the document description information registered in the “document description information” column of the document description information table 22 is registered using a normalized regular expression. Here, as the document description information, S
GML and / or XML and / or HTML tags may be registered, or layout control characters such as line feeds, or delimiters such as punctuation marks and reading marks may be registered.

【００４４】[0044]

【表７】 [Table 7]

【００４５】また、表８は、図１における言語単位レベ
ルテーブル２３の例を示すものであり、前述のステップ
Ｓ１において、文書記述情報の各言語単位レベルが、
「文」以上を示す文書記述情報か否かを判定する場合な
どにおいて、言語単位レベルテーブル２３が索引される
ものである。ここで、表８に示すように、文書記述情報
の言語単位レベルが、文字以下の低いレベルであった場
合、形態素解析辞書を索引する際、該文書記述情報を読
み飛ばして索引すべき単語を抽出することとし、言語単
位レベルが、形態素以上のレベルであった場合、形態素
解析辞書を索引する際、該文書記述情報を挟んだ文字列
に基づいて、形態素解析辞書を索引させないこととし、
更には、言語単位レベルが、文以上のレベルであった場
合、該文書記述情報で区切られた文書毎に分割して、形
態素解析を実施することとしている。Table 8 shows an example of the language unit level table 23 in FIG. 1. In step S1, the language unit level of the document description information is
The language unit level table 23 is indexed when it is determined whether or not the document description information indicates “sentence” or more. Here, as shown in Table 8, when the linguistic unit level of the document description information is a low level of characters or less, when indexing the morphological analysis dictionary, the document description information is skipped and the words to be indexed are searched. When the linguistic unit level is equal to or higher than the morpheme, when the morphological analysis dictionary is indexed, the morphological analysis dictionary is not indexed based on a character string sandwiching the document description information,
Further, when the language unit level is equal to or higher than a sentence, the morphological analysis is performed by dividing each document divided by the document description information.

【００４６】[0046]

【表８】 [Table 8]

【００４７】[0047]

【発明の効果】以上に説明したように、本発明に係る形
態素解析システムによれば、以下のごとき効果を得るこ
とができる。即ち、・文書中の文書記述情報（即ち、タグや制御文字）をそ
のまま保存した状態で、文書の形態素解析を行なうこと
ができる。・形態素解析結果と文書記述情報（即ち、タグや制御文
字）との位置関係を容易に知ることができる。・文書記述情報（即ち、タグや制御文字）を用いること
によって、形態素解析の解析率を向上することができ
る。As described above, according to the morphological analysis system according to the present invention, the following effects can be obtained. The morphological analysis of the document can be performed while the document description information (that is, tags and control characters) in the document is stored as it is. The positional relationship between the morphological analysis result and the document description information (that is, tags and control characters) can be easily known. The analysis rate of morphological analysis can be improved by using document description information (that is, tags and control characters).

[Brief description of the drawings]

【図１】本発明に係る形態素解析システムの機能ブロ
ック図の一実施形態を示す図である。FIG. 1 is a diagram showing an embodiment of a functional block diagram of a morphological analysis system according to the present invention.

[Explanation of symbols]

１０…形態素解析部、１１…文分割部、１２…辞書引き
範囲決定部、１３…辞書引き部、１４…最適パス決定
部、２０…言語単位レベル判定部、２１…言語単位レベ
ル比較部、２２…文書記述情報テーブル、２３…言語単
位レベルテーブル、３０…形態素解析辞書。Reference Signs List 10: morphological analysis unit, 11: sentence division unit, 12: dictionary lookup range determination unit, 13: dictionary lookup unit, 14: optimal path determination unit, 20: language unit level determination unit, 21: language unit level comparison unit, 22 ... a document description information table, 23 ... a language unit level table, 30 ... a morphological analysis dictionary.

Claims

[Claims]

1. A morphological analysis of an input document is performed by dictionary-searching words in the morphological analysis dictionary having at least a morphological analysis dictionary and matching words included in the input document. In a morphological analysis system that outputs an optimal morphological analysis result, when the input document includes document description information, a language unit level of the document description information is identified, and the identified language is identified. A morphological analysis system that changes an operation of morphological analysis on the input document according to a unit level.

2. The morphological analysis system according to claim 1, wherein when the linguistic unit level of the identified document description information is a low level equal to or lower than a character, the morphological analysis dictionary is indexed. A morphological analysis system for skipping document description information and extracting words to be indexed.

3. The morphological analysis system according to claim 1, wherein, when the linguistic unit level of the identified document description information is a level equal to or higher than a morpheme, the document is indexed in the morphological analysis dictionary. A morphological analysis system wherein the morphological analysis dictionary is not indexed based on a character string sandwiching description information.

4. The morphological analysis system according to claim 1, wherein, when the linguistic unit level of the identified document description information is a level equal to or higher than a sentence, each of the documents separated by the document description information. A morphological analysis system that performs division and performs morphological analysis.

5. The morphological analysis system according to claim 1, wherein the document description information is SGM.
A morphological analysis system characterized by being an L and / or XML and / or HTML tag.

6. The morphological analysis system according to claim 1, wherein the document description information is a layout control character including a line feed.

7. The morphological analysis system according to claim 1, wherein the document description information is a delimiter such as a period or a punctuation mark.

8. A morphological analysis of the input document by at least having a morphological analysis dictionary and dictionary-searching words in the morphological analysis dictionary that match words included in the input document. In the morphological analysis method for outputting an optimal morphological analysis result, when the input document includes document description information, a language unit level of the document description information is identified, and the identified language is identified. A morphological analysis method, wherein an operation of morphological analysis relating to the input document is changed according to a unit level.

9. A morphological analysis of the input document by having at least a morphological analysis dictionary and dictionary-searching words in the morphological analysis dictionary that match words included in the input document. In the morphological analysis method for outputting an optimal morphological analysis result, when the input document includes document description information, a language unit level of the document description information is identified, and the identified language is identified. A program recording medium, which is readable as a computer program so that the computer can execute a morphological analysis method for changing an operation of morphological analysis on the input document according to a unit level.