JP2001265763A

JP2001265763A - Morpheme analytic method and recording medium with recorded morpheme analytic program

Info

Publication number: JP2001265763A
Application number: JP2000082551A
Authority: JP
Inventors: Hisako Asano; 久子浅野; Hisashi Obara; 永小原
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2000-03-23
Filing date: 2000-03-23
Publication date: 2001-09-28
Anticipated expiration: 2020-03-23
Also published as: JP3379643B2

Abstract

PROBLEM TO BE SOLVED: To improve word acknowledgement accuracy in morpheme analysis. SOLUTION: In word dictionary retrieving processing 2-1, a Japanese text is inputted, a word dictionary is retrieved while using a word dictionary 5 and a grammar rule 4, all connectable word streams are selected and a word chain candidate stream 2-2 is prepared. In word selecting processing 2-3, when word information is applied to an inputted Japanese text 1 or registered in setting information 6, a word, which satisfies designated word information, is selected while applying priority to the part-of-speech chain of the previously specified grammar rule 4 or when there is not such a word, it is newly generated. Thus, concerning a character string, with which the word information 5 is designated, a word satisfying the designated word information or having the most certain information concerning non-designated word information is acknowledged as a first solution, the word, to which the word information is applied, is registered in the setting information 6 corresponding to the designation and a Japanese word information stream 3 is outputted.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は日本語テキストを単
語分割し、読みや品詞などの情報を付与する日本語形態
素解析方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a Japanese morphological analysis method for dividing a Japanese text into words and adding information such as readings and parts of speech.

【０００２】[0002]

【従来の技術】日本語形態素解析における単語認定で
は、単語辞書と文法規則を用いて、接続可能な品詞連鎖
を満たす単語を抽出し、品詞連鎖に対して優先度を付与
し、品詞連鎖による絞り込みを行ったのち、同形語が存
在する場合には単語の優先度による絞り込みを行い、最
終的に第一解となる単語連鎖を抽出するという手法が一
般的である。2. Description of the Related Art In word recognition in Japanese morphological analysis, a word that satisfies a connectable part-of-speech chain is extracted using a word dictionary and grammatical rules, priority is assigned to the part-of-speech chain, and narrowing down by the part-of-speech chain is performed. Is performed, if there is an isomorphous word, a method of narrowing down by word priority and finally extracting a word chain as a first solution is common.

【０００３】具体的な優先度の付与方法としては、文節
の数が少ない連鎖を優先する文節数最小法（吉村他、
「文節数最小法を用いたべた書き日本語文の形態素解
析」、情処論Ｖｏｌ．２４，Ｎｏ．１，１９８３）、語
の隣接規則にコストを設け、その合計が最小な連鎖を優
先するコスト最小法（久光他、「接続コスト最小法によ
る形態素解析の提案と計算量の評価について」、信学技
法，ＮＬＣ９０−８，１９９０）等やそれらを組み合わ
せた技術が確立されている。As a specific method of assigning priorities, there is a minimum number of clauses method for giving priority to a chain having a small number of clauses (Yoshimura et al.,
“Morphological analysis of solid Japanese sentences using the minimum number of clauses method”, Jiho-Ken, Vol. 24, no. 1, 1983), a cost minimization method in which costs are assigned to adjacent rules of words, and the sum of the rules is given priority to the chain with the minimum sum (Hisamitsu et al., "Proposal of morphological analysis and evaluation of computational complexity by the minimum connection cost method," Techniques, NLC90-8, 1990) and techniques combining them have been established.

【０００４】しかしこれらの従来の方法では、複数の読
みが考えられる場合にでも、最も一般的な読みを持つ単
語が第一解として選ばれるため、ローカルな固有名詞
（例えば同僚の姓名など）について正しい単語認定が行
われない場合がある。例えば、「中島健司」は通常の形
態素解析では「中島（なかじま）健司（けんじ）」と最
も一般的な読みの単語が認定されるのが普通であるが、
実際は「中島（なかしま）健司（たけし）」という認定
が正しい場合もありえる。However, in these conventional methods, even when a plurality of readings are considered, the word having the most common reading is selected as the first solution. Correct word recognition may not be performed. For example, “Kenji Nakajima” is usually recognized as the most common reading word as “Kenji Nakajima” in normal morphological analysis,
In fact, the certification "Kenji Nakajima (Takeshi)" may be correct.

【０００５】また、一般名詞と同じ表記をもつ固有名詞
は多数存在するが、文脈を考慮しなければどちらである
か判定することは難しく（例：「野原（＝地名）の町
長」、「ススキの野原（＝一般名詞）」）、通常の形態
素解析では、一般にどちらかの品詞に固定されて認定さ
れる。Although there are many proper nouns having the same notation as general nouns, it is difficult to determine which one is appropriate without considering the context (for example, "the mayor of a field (= place name)", " Nohara (= general noun) "), in ordinary morphological analysis, it is generally fixed to one of the parts of speech.

【０００６】[0006]

【発明が解決しようとする課題】従来の日本語形態素解
析方法では、一般的ではない読みをもつ単語に対して正
しく読みが付与できず、例えばテキスト音声合成で利用
した場合に、人名を読み誤る等の問題があった。In the conventional Japanese morphological analysis method, a word having an unusual reading cannot be correctly given a reading. For example, when the word is used in text-to-speech synthesis, a personal name is erroneously read. And so on.

【０００７】また、品詞の区別が困難な単語に対しては
どちらかの品詞に固定されて認定され、例えばある製品
の説明とわかっている場合にでも、その製品名を一般名
詞として扱うなどの問題があった。[0007] In addition, a word for which it is difficult to distinguish a part of speech is fixed to one of the parts of speech and is recognized. For example, even when the description of a product is known, the product name is treated as a general noun. There was a problem.

【０００８】本発明の目的は、単語認定精度を向上させ
る形態素解析方法を提供することにある。An object of the present invention is to provide a morphological analysis method for improving word recognition accuracy.

【０００９】[0009]

【課題を解決するための手段】本発明は、単語辞書と文
法規則を用いて単語辞書検索を行い、接続可能な単語列
をすべて選択し、単語連鎖候補列を作成する。次に、入
力された日本語テキストに単語情報が付与されている、
あるいは設定情報に登録されている場合には、あらかじ
め規定している文法規則の品詞連鎖を優先しつつ、指定
された単語情報を満たす単語を選択、なければ新規生成
することにより、単語情報が指定された文字列に対し、
指定された単語情報を満たし、指定されなかった単語情
報は最も確からしい情報をもつ単語を第一解として認定
し、単語情報を与えられた単語を指定に応じて設定情報
に登録し、日本語単語情報列を出力する。According to the present invention, a word dictionary is searched using a word dictionary and grammar rules, all connectable word strings are selected, and a word chain candidate string is created. Next, word information is added to the input Japanese text,
Alternatively, if the word information is registered in the setting information, the word information that satisfies the specified word information is selected while giving priority to the part-of-speech chain of the grammatical rules prescribed in advance. For the given string
For the word information that satisfies the specified word information, and for the word information that is not specified, the word with the most likely information is recognized as the first solution, the word given the word information is registered in the setting information according to the specification, and the Japanese Output a word information sequence.

【００１０】[0010]

【発明の実施の形態】次に、本発明の実施の形態につい
て図面を参照して説明する。Next, embodiments of the present invention will be described with reference to the drawings.

【００１１】図１は本発明の一実施形態の形態素解析方
法を示す流れ図で、日本語テキスト１を入力して、単語
辞書検索処理２−１を行い、単語連鎖候補列２−２を出
力し、単語選択処理２−３を行い、日本語単語情報列３
を出力することを文法規則４と単語辞書５と設定情報６
を用いて行う。FIG. 1 is a flowchart showing a morphological analysis method according to one embodiment of the present invention. A Japanese text 1 is input, a word dictionary search process 2-1 is performed, and a word chain candidate sequence 2-2 is output. , Word selection process 2-3, and the Japanese word information sequence 3
Grammar rule 4, word dictionary 5, and setting information 6
This is performed using

【００１２】日本語テキスト１は、任意の日本語のテキ
ストである。ただし、単語情報が付加されている場合が
ある。ここでいう単語情報とは、単語辞書５に登録され
るべき情報であり、一般に、表記、品詞、読み、意味属
性、アクセント型など多くの情報がある。ここでは、解
析対象となる日本語テキスト１に表記は当然含まれるた
め、表記以外の単語情報を付加する。Japanese text 1 is any Japanese text. However, word information may be added. The word information referred to here is information to be registered in the word dictionary 5, and generally includes a lot of information such as notation, part of speech, reading, semantic attributes, and accent type. Here, since the notation is naturally included in the Japanese text 1 to be analyzed, word information other than the notation is added.

【００１３】単語情報を付与するための単語情報指定パ
ターンは、設定情報６の単語情報指定パターンで任意に
指定できるものとする。例えば、図２（ａ）では、単語
情報を付加する対象の文字列を“｛｝”で囲み、その品
詞を“＜＞”で囲み、読みを“（）”で囲むという指定
である。単語情報を指定する際には、対象文字列および
１つ以上の単語情報は必須であるが、単語情報指定パタ
ーンに登録したすべての単語情報を記述する必要はな
い。例えば、図２の例では、「中島」という対象文字列
に対しては品詞および読みが指定されているが、「健
司」に対しては読みのみが指定されている。The word information designation pattern for assigning word information can be arbitrarily designated by the word information designation pattern of the setting information 6. For example, in FIG. 2A, a character string to which word information is to be added is enclosed by “$”, its part of speech is enclosed by “<>”, and the reading is enclosed by “()”. When specifying word information, the target character string and one or more word information are indispensable, but it is not necessary to describe all word information registered in the word information specification pattern. For example, in the example of FIG. 2, the part of speech and the reading are specified for the target character string “Nakajima”, but only the reading is specified for “Kenji”.

【００１４】このように、文字列パターンにより単語情
報の指定を行う際には、文字列パターンで使われた文字
を形態素解析対象とする場合には直前に“＼”を付加す
る。例えば図２（ａ）の単語情報指定パターンを使い、欧州連合（ＥＵ）を解析させるためには、欧州連合＼（ＥＵ＼）を入力の日本語テキストとする。As described above, when specifying word information using a character string pattern, if characters used in the character string pattern are to be subjected to morphological analysis, "@" is added immediately before. For example, in order to analyze the European Union (EU) using the word information designation pattern of FIG. 2A, the Japanese text of the European Union (EU) is used.

【００１５】また、図２（ｂ）はＸＭＬによる単語情報
の指定である。ＸＭＬに基づき、対象文字列を＜ｗｏｒ
ｄ＞要素、その読みを＜ｗｏｒｄ＞要素のｙｏｍｉ属
性、品詞を＜ｗｏｒｄ＞要素のｈｉｎｓｈｉ属性で指定
するという記述である。この場合も図２（ａ）と同様
に、＜ｗｏｒｄ＞要素において１つ以上の属性を指定す
ればよい。FIG. 2B shows the designation of word information by XML. Based on XML, set the target character string to <wor
This is a description that the <d> element and its reading are specified by the yomi attribute of the <word> element, and the part of speech is specified by the Hinshi attribute of the <word> element. In this case, as in FIG. 2A, one or more attributes may be specified in the <word> element.

【００１６】図１に戻り、形態素解析２は文単位に行わ
れる。Returning to FIG. 1, the morphological analysis 2 is performed for each sentence.

【００１７】単語辞書検索処理２−１では、文法規則４
に記述された接続可能な単語連鎖を単語辞書５よりすべ
て抽出し、単語連鎖候補列２−２を出力する。In the word dictionary search process 2-1, grammar rules 4
Is extracted from the word dictionary 5 and the word chain candidate sequence 2-2 is output.

【００１８】単語連鎖候補列２−２は品詞連鎖と単語よ
りなる。例えば、図５において線で結ばれたものが品詞
連鎖（＜＞内が品詞名）で、各品詞連鎖に１つ以上の単
語が存在する。The word chain candidate sequence 2-2 is composed of a part of speech chain and a word. For example, in FIG. 5, the part connected by a line is a part-of-speech chain (the part-of-speech name is in <>), and one or more words exist in each part-of-speech chain.

【００１９】次に、単語選択処理２−３について、図３
を用いて詳細に説明する。Next, the word selection process 2-3 will be described with reference to FIG.
This will be described in detail with reference to FIG.

【００２０】ステップ１０では、単語連鎖候補列２−２
に対して優先順位を付け、最も優先順位が高い単語連鎖
（最も優先度が高い品詞連鎖の中の最も優先度が高い単
語）を第一解とする。In step 10, the word chain candidate sequence 2-2
, And the word chain with the highest priority (the word with the highest priority in the part-of-speech chain with the highest priority) is set as the first solution.

【００２１】ステップ１１では、日本語テキスト１に、
単語情報が与えられた文字列、あるいは設定情報６の登
録単語情報に登録されている単語（以下「対象文字列」
と略記）が存在するか判定する。存在する場合には、ス
テップ１２に移行する。ここで、対象文字列が複数存在
する場合には、ステップ１２以降の処理を各対象文字列
に対して順に行う。存在しない場合には処理を終了す
る。In step 11, the Japanese text 1 is
A character string given word information or a word registered in the registered word information of the setting information 6 (hereinafter, “target character string”
Abbreviated). If there is, go to step 12. Here, when there are a plurality of target character strings, the processing after step 12 is sequentially performed on each target character string. If not, the process ends.

【００２２】ステップ１２では、対象文字列と完全一致
する単語が、ステップ１０で与えられた優先順位一位の
品詞連鎖に存在するか判定する。存在する場合には、ス
テップ１６へ移行する。存在しない場合にはステップ１
３へ移行する。In step 12, it is determined whether or not a word that completely matches the target character string exists in the part of speech chain with the highest priority given in step 10. If there is, go to step 16. Step 1 if not present
Move to 3.

【００２３】ここで、「完全一致」についてより具体的
に説明する。Here, "perfect match" will be described more specifically.

【００２４】「完全一致」とは、対象文字列と単語境界
が完全一致し、指定された単語情報をすべて満たす単語
に加えて、対象文字列と単語境界は異なるが、指定され
た単語情報をすべて満たす単語も含む。"Exact match" means that the target character string exactly matches the word boundary and that the word that satisfies all the specified word information and the target character string and the word boundary are different but the specified word information Also includes words that satisfy all.

【００２５】例えば、日本語テキスト１として「｛山
田｝（やまだ）太郎」（図２（ａ）の指定パターンを利
用）が入力され（すなわち「山田」の読みのみ指定）、
「山田太郎（読み＝やまだたろう）」という単語が第一
品詞連鎖に含まれている場合、先頭からの読みが「やま
だ」と一致し、他に指定された単語情報は存在しないの
で、この単語は完全一致する単語として扱う。For example, "{Yamada} (Yamada) Taro" (using the designated pattern in FIG. 2A) is input as Japanese text 1 (ie, only the reading of "Yamada" is designated).
If the word "Yamada Taro (Yomi = Yamataro)" is included in the first part-of-speech chain, the word from the beginning matches "Yamada" and there is no other word information specified. Is treated as an exact match.

【００２６】また、逆に、日本語テキスト１として
「｛山田太郎｝（やまだたろう）」が入力され、単語連
鎖候補列２−２に、「山田（読み＝やまだ）」＋「太郎
（読み＝たろう）」という単語連鎖が第一品詞連鎖に含
まれている場合、この２単語の表記と対象文字列は一致
し、読みも一致するので、この単語連鎖を完全一致する
単語（連鎖）として扱う。Conversely, "｛Taro Yamada" is input as Japanese text 1, and "Yamada (reading = Yamada)" + "Taro (reading = Is included in the first part-of-speech chain, the notation of the two words matches the target character string, and the reading matches, so this word chain is treated as a completely matching word (chain). .

【００２７】ステップ１３では、対象文字列と単語境界
が一致し、品詞指定がある場合には品詞も一致する単語
が単語連鎖候補列２−２の単語の中に存在するか判定す
る。存在する場合には、ステップ１４に移行し、存在し
ない場合にはステップ１５に移行する。In step 13, it is determined whether or not a word that matches the target character string and the word boundary and has a part of speech designation also exists in the words of the word chain candidate string 2-2. If it exists, the process proceeds to step 14, and if not, the process proceeds to step 15.

【００２８】ステップ１４では、対象文字列に対して品
詞が指定されている場合、その品詞と一致し、他に指定
された単語情報も満たす単語が存在すれば、その単語を
選択する。他に指定された単語情報が異なる場合には、
その単語を単語連鎖を含めて複製し、他に指定されてい
る単語情報を上書きする。In step 14, if a part of speech is specified for the target character string, if there is a word that matches the part of speech and satisfies other specified word information, the word is selected. If other specified word information is different,
Duplicate the word including the word chain and overwrite other specified word information.

【００２９】対象文字列に対して品詞が指定されていな
い場合には、ステップ１３の中で最も優先度の高い単語
連鎖の単語を品詞連鎖を含めて複製し、他に指定されて
いる単語情報を上書きする。If no part of speech is specified for the target character string, the word of the highest priority word chain in step 13 is copied including the part of speech chain, and the other specified word information Overwrite.

【００３０】ステップ１５では、対象文字列に対して品
詞が指定されている場合には、対象単語列の境界を単語
境界としてもつ品詞連鎖のうち、その境界で指定された
品詞と連鎖可能な最も優先度の高い連鎖の前後の品詞連
鎖をコピーし、対象単語列は１単語として指定された単
語情報を埋める。指定されていない単語情報について
は、元とした品詞連鎖の単語（複数ある場合にはその中
で単語の優先度の最も高いもの）から生成する（例え
ば、読みはすべてつなげる）。連鎖可能なものが存在し
ない場合には未知語との連鎖を生成する。In step 15, if the part of speech is specified for the target character string, of the part of speech chain having the boundary of the target word string as a word boundary, the part of the part of speech that can be chained to the part of speech specified at that boundary is selected. The part-of-speech chain before and after the high-priority chain is copied, and the target word string fills in the word information specified as one word. Unspecified word information is generated from the original part-of-speech word (if there is more than one, the word with the highest priority) (for example, all readings are connected). If there is no chainable thing, a chain with an unknown word is generated.

【００３１】品詞が指定されていない場合には、対象単
語列の境界を単語境界としてもつ品詞連鎖のうち、最も
優先度の高い単語連鎖をコピーし、対象単語列は１単語
として指定された単語情報を埋める。指定されていない
単語情報については、元の単語の情報から生成する。If the part of speech is not specified, the highest priority word chain among the part of speech chains having the boundary of the target word string as the word boundary is copied, and the target word string is the word specified as one word. Fill in information. Unspecified word information is generated from the information of the original word.

【００３２】ステップ１６では、直前の処理で扱った単
語（連鎖）の優先順位を上げ、それを、第１位で選択さ
れるようにする。In step 16, the priority of the word (chain) handled in the immediately preceding process is raised, and the word (chain) is selected at the first place.

【００３３】図１に戻り、文法規則４は、品詞連鎖の生
成および優先度を付与するための文法を記述する。Returning to FIG. 1, grammar rule 4 describes a grammar for generating a part-of-speech chain and assigning a priority.

【００３４】単語辞書５は、形態素解析で認定する単語
とその単語情報を登録する。The word dictionary 5 registers words recognized by morphological analysis and their word information.

【００３５】設定情報６は、日本語テキスト１で単語情
報を付与する対象文字列と与える単語情報の指定方法を
記述する単語情報指定パターン（図２参照）、一度、単
語情報が指定された場合にそれを登録単語情報とするか
を表す単語保存情報、単語情報が指定された単語の表記
および指定された単語情報を保存する登録単語情報から
なる。The setting information 6 is a word information specification pattern (see FIG. 2) describing a target character string to which word information is to be assigned and a method of specifying word information to be given in the Japanese text 1. The word information includes word storage information indicating whether or not to use the word information as registered word information, word information designated by the word information, and registered word information for storing the specified word information.

【００３６】ステップ１７で設定情報６の単語保存情報
がオンの場合には、ステップ１８で単語情報が指定され
た単語を登録単語情報に登録し、オフの場合には登録し
ない。また、登録単語情報から単語情報を削除すること
により、指定した単語情報は解除される。If the word storage information of the setting information 6 is on in step 17, the word whose word information is designated is registered in the registered word information in step 18, and if it is off, the word is not registered. Also, by deleting the word information from the registered word information, the designated word information is released.

【００３７】日本語単語情報列７は、最終的に選択され
た単語連鎖列である。第一解のみに絞り込まれる場合
も、上位数個の単語連鎖列が提示される場合もある。The Japanese word information sequence 7 is a word chain sequence finally selected. There may be a case where only the first solution is narrowed down, or a case where the top several word chains are presented.

【００３８】図４に、図２（ａ）の単語情報指定パター
ンを使った日本語テキスト１の例を示す。FIG. 4 shows an example of Japanese text 1 using the word information designation pattern shown in FIG.

【００３９】以下、この図４の例を用いて、単語選択処
理２−３を説明する。ここでは、優先度の付与には、ま
ず品詞の隣接コストを用い、同じ隣接コストの単語が複
数ある場合の選択には語コストを利用するとする（隣接
コスト、語コストとも値が小さい方を優先）。Hereinafter, the word selection process 2-3 will be described with reference to the example of FIG. Here, it is assumed that the priority is assigned using the adjacent cost of the part of speech, and the word cost is used to select a plurality of words having the same adjacent cost. ).

【００４０】また、文法規則４において、固有名詞：地
名と固有名詞：名は接続できない、記号と固有名詞、固
有名詞と記号の接続はできると記述されているとする。
さらに、未知語への接続は、他に接続する単語が存在し
ない場合のみ許すとする。It is also assumed that grammar rule 4 describes that proper nouns: place names and proper nouns: names cannot be connected, signs and proper nouns, and proper nouns and signs can be connected.
Further, connection to an unknown word is allowed only when there is no other connected word.

【００４１】設定情報６の単語保存情報はオンとする。The word storage information of the setting information 6 is turned on.

【００４２】第１文については、対象文字列の指定が存
在しないので、ステップ１０の通常の優先度付与により
単語選択が行われ、処理を終了する。As for the first sentence, since there is no designation of the target character string, a word is selected by the normal priority assignment in step 10, and the process is terminated.

【００４３】第２文では、対象文字列の指定が２つ、す
なわち「中島」に対する「なかしま」という読みと「健
司」に対する「たけし」という読みが存在する。この時
の単語連鎖候補列２−２を図５（ａ）に示す。ここで、
「中島」を例に図３の流れを詳細に説明する。In the second sentence, there are two designations of the target character string, namely, the reading "Nakashima" for "Nakajima" and the reading "Takeshi" for "Kenji". FIG. 5A shows the word chain candidate sequence 2-2 at this time. here,
The flow of FIG. 3 will be described in detail using “Nakajima” as an example.

【００４４】ステップ１０では隣接コスト合計が小さい
実線の品詞連鎖（第一品詞連鎖列）をたどり、「健司」
については語コストの低い、読みが「けんじ」となる単
語を選択して、スペース記号中島固有名詞：姓なかじま健司固有名詞：名けんじさん人名接尾辞さんを第一解として優先度を与える。In step 10, the tracing of the part-of-speech chain (first part-of-speech sequence) of the solid line with the small total adjacent cost is performed, and "Kenji"
For, select a word with low word cost and the reading is "Kenji", and give priority to the space symbol Nakajima proper noun: last name Kenji Nakajima proper noun: first name Kenji's personal name suffix as the first solution.

【００４５】ステップ１１→ステップ１２と移行し、第
一品詞連鎖列に含まれる単語「中島」は読みが「なかじ
ま」のものしか存在しない（読みが「なかしま」のもの
が存在しない）ため、ステップ１３へ移行する。The process proceeds from step 11 to step 12, and the word "Nakajima" included in the first part-of-speech chain sequence has only the reading "Nakajima" (there is no "Nakashima" reading). Go to step 13.

【００４６】ステップ１３では、対象文字列「中島」に
品詞が指定されておらず（品詞を満たす）、「中島」と
いう単語が存在するためステップ１４へ移行する。In step 13, since the part of speech is not specified (satisfies the part of speech) in the target character string "Nakajima" and the word "Nakajima" exists, the process proceeds to step 14.

【００４７】ステップ１４では、最も優先度の高い単語
連鎖は「中島（読み＝なかじま、品詞＝固有名詞：
姓）」であるため、この単語を複製し、読みを指定され
た「なかしま」に置き換え、ステップ１６に移行する
（図６（ａ）参照）。In step 14, the word chain with the highest priority is "Nakajima (reading = Nakajima, part of speech = proper noun:
Therefore, the word is duplicated, the pronunciation is replaced with the designated "Nakashima", and the process proceeds to step 16 (see FIG. 6A).

【００４８】ステップ１６では、中島固有名詞：姓なかしまを第一解とし、ステップ１７へ移行する。At step 16, Nakashima proper noun: surname Nakashima is set as the first solution, and the process proceeds to step 17.

【００４９】ステップ１７→ステップ１８と移行し、
「中島（読み＝なかしま）」を登録単語情報へ登録す
る。The process proceeds from step 17 to step 18, and
“Nakajima (reading = Nakashima)” is registered in the registered word information.

【００５０】「健司」については、ステップ１２で第一
品詞連鎖列の中に、与えられた読み「たけし」と完全一
致する単語が存在するため（「健司、読み＝たけ
し」）、それを選択し、ステップ１６に移行する。As for “Kenji”, since a word that completely matches the given pronunciation “Takeshi” exists in the first part-of-speech chain in step 12 (“Kenji, Yomi = Takeshi”), it is selected. Then, the process proceeds to step S16.

【００５１】ステップ１６では、「健司、読み＝けん
じ」の代わりに「健司、読み＝たけし」を優先度第一単
語として選択する。In step 16, instead of "Kenji, Yomi = Kenji", "Kenji, Yomi = Takeshi" is selected as the first priority word.

【００５２】そして、最終的に、スペース記号中島固有名詞：姓なかしま健司固有名詞：名たけしさん人名接尾辞さんという単語連鎖が第一解として出力される。Finally, a word chain of the space symbol Nakajima proper noun: last name Kenji Nakashima proper noun: first name Takeshi personal name suffix is output as the first solution.

【００５３】第３文では、文頭の「中島」が登録単語情
報に登録されている単語と一致するので、第２文と同じ
流れで「なかしま」と読みが与えられた単語が第一解と
なる。In the third sentence, the word "Nakashima" at the beginning of the sentence matches the word registered in the registered word information, so that the word given the reading "Nakashima" in the same flow as the second sentence is the first solution. Become.

【００５４】以下、品詞が指定されている「インタース
ペース」について、図３の流れを詳細に説明する。この
部分の単語連鎖候補列２−２は図５（ｂ）に示す。Hereinafter, the flow of FIG. 3 will be described in detail for the “interspace” in which the part of speech is specified. The word chain candidate sequence 2-2 of this part is shown in FIG.

【００５５】ステップ１０で隣接コスト合計が小さい実
線の品詞連鎖（＝単語連鎖、すべて１単語よりなるた
め）が第一解となる。In step 10, a solid part-of-speech chain (= word chain, since all are composed of one word) having a small adjacent cost is the first solution.

【００５６】ステップ１１→ステップ１２と移行し、与
えられた単語情報「インタースペース」に与えられた品
詞である固有名詞と完全一致する単語が第一品詞連鎖に
存在しないのでステップ１３へ移行する。The process proceeds from step 11 to step 12, and since there is no word in the first part-of-speech chain that exactly matches the proper noun which is the part of speech given to the given word information "interspace", the process proceeds to step 13.

【００５７】ステップ１３では、「インタースペース」
に与えられた品詞である固有名詞の単語は存在しないの
で、ステップ１５へ移行する。In step 13, "interspace"
Since there is no word of the proper noun which is the part of speech given to, the process proceeds to step S15.

【００５８】ステップ１５では、第一解の単語連鎖であ
る「インター（接頭辞）」の前方境界、「スペース（名
詞）」の後方境界が対象文字列「インタースペース」の
前後の境界と一致し、文法規則４において、記号（「）
と固有名詞、固有名詞と記号（」）の連鎖は許されてい
るので、この第一解の品詞連鎖をコピーして、「インタ
ースペース」という新語に接続する（図６（ｂ））。
「インタースペース」の読みは、「インター（接頭
辞）」と「スペース（名詞）」の読みを合わせて、「い
んたーすぺーす」とする。In step 15, the front boundary of "inter (prefix)" and the rear boundary of "space (noun)" which are the word chains of the first solution coincide with the boundary before and after the target character string "inter space". , In grammar rule 4, the symbol (")
Since the chain of a proper noun and a proper noun and a symbol (") is permitted, the part-of-speech chain of the first solution is copied and connected to a new word" interspace "(FIG. 6B).
The pronunciation of "interspace" is a combination of the readings of "inter (prefix)" and "space (noun)", and is referred to as "inta space".

【００５９】ステップ１６では、今までの第一品詞連鎖
を太線の品詞連鎖が生じたところは置き換え、インター
スペース固有名詞いんたーすぺーすを第一解とす
る。In step 16, the first part-of-speech chain up to this point is replaced with the part where the part-of-speech chain of bold lines occurs, and the interspace proper noun int-space is set as the first solution.

【００６０】ステップ１７→ステップ１８と移行し、
「インタースペース（品詞＝固有名詞）」を登録単語情
報へ登録する。Step 17 is followed by step 18.
"Interspace (part of speech = proper noun)" is registered in the registered word information.

【００６１】第４文については、対象文字列の指定が存
在しないので、ステップ１０の通常の優先度付与により
単語選択が行われ、処理を終了する。As for the fourth sentence, since there is no designation of the target character string, the word is selected by the normal priority assignment in step 10, and the process is terminated.

【００６２】図７は本発明の文書解析方法をパソコン等
のコンピュータ上で実施する場合の構成を示したもので
ある。FIG. 7 shows the configuration when the document analysis method of the present invention is implemented on a computer such as a personal computer.

【００６３】入力装置２１は日本語テキスト１を入力す
るための、キーボード等の入力装置である。記憶装置２
２、２３、２４にはそれぞれ文法規則４、単語辞書５、
設定情報６が格納されている。記憶装置２５はハードデ
ィスクである。出力装置２６は日本語単語情報列が出力
される、プリンタ、ディスプレイ等の出力装置である。
記録媒体２７は単語辞書検索２−１、単語選択２−３の
処理からなる文書解析プログラムが記録されている、フ
ロッピィ・ディスク、ＣＤ−ＲＯＭ、光磁気ディスク等
の記録媒体である。データ処理装置２８はＣＰＵ、各種
インタフェースを含み、記録媒体２７から文書解析プロ
グラムを読み込んで、これを実行する。The input device 21 is an input device such as a keyboard for inputting Japanese text 1. Storage device 2
The grammar rules 4, the word dictionary 5,
The setting information 6 is stored. The storage device 25 is a hard disk. The output device 26 is an output device such as a printer or a display to which a Japanese word information sequence is output.
The recording medium 27 is a recording medium, such as a floppy disk, a CD-ROM, or a magneto-optical disk, on which a document analysis program including the word dictionary search 2-1 and the word selection 2-3 is recorded. The data processing device 28 includes a CPU and various interfaces, reads a document analysis program from the recording medium 27, and executes it.

【発明の効果】以上説明したように、本発明によれば、
入力テキストの一部の単語に対して、読みや品詞、アク
セント型などの単語情報を指定を許し、あらかじめ規定
している文法規則の品詞連鎖を優先しつつ、指定された
単語情報を満たす単語を選択、なければ新規生成するこ
とにより、指定された単語情報を満し、指定されなかっ
た単語情報は最も確からしい情報をもつ単語を第一解と
して認定、また、一度指定された単語を保持し、それが
解除されるまで、その指定条件を満たす単語を認定する
形態素解析を行うことにより、既知情報を積極的に与え
ることで単語認定精度を向上させることができる。As described above, according to the present invention,
Word information such as reading, part of speech, and accent type can be specified for some words in the input text, and words that satisfy the specified word information while giving priority to the part of speech chain defined in advance by grammar rules. If no selection is made, a new word is generated to satisfy the specified word information, and for unspecified word information, the word with the most probable information is recognized as the first solution, and the word once specified is retained. By performing morphological analysis for certifying words that satisfy the specified condition until the word is canceled, word recognition accuracy can be improved by positively providing known information.

[Brief description of the drawings]

【図１】本発明の一実施形態の形態素解析方法を示す図
である。FIG. 1 is a diagram showing a morphological analysis method according to an embodiment of the present invention.

【図２】単語情報指定パターン例とそれに対応する日本
語テキスト例を示す図である。FIG. 2 is a diagram showing an example of a word information designation pattern and an example of a Japanese text corresponding thereto.

【図３】単語選択処理２−３の処理を示すフローチャー
トである。FIG. 3 is a flowchart illustrating a process of a word selection process 2-3.

【図４】日本語テキストの一例を示す図である。FIG. 4 is a diagram showing an example of a Japanese text.

【図５】単語連鎖候補列例、および通常の優先度付与に
より与えられた優先度を示す図である。FIG. 5 is a diagram showing an example of a word chain candidate string and a priority given by normal priority assignment;

【図６】単語連鎖候補列への生成単語追加例を示す図で
ある。FIG. 6 is a diagram illustrating an example of adding a generated word to a word chain candidate sequence.

【図７】本発明の形態素検索方法を実施する装置のブロ
ック図である。FIG. 7 is a block diagram of an apparatus for implementing the morphological search method of the present invention.

[Explanation of symbols]

１日本語テキスト２形態素解析２−１単語辞書検索処理２−２単語連鎖候補列２−３単語選択処理３日本語単語情報列４文法規則５単語辞書６設定情報１０〜１８ステップ２１入力装置２２〜２５記憶装置２６出力装置２７記録媒体２８データ処理装置 1 Japanese text 2 Morphological analysis 2-1 Word dictionary search process 2-2 Word chain candidate sequence 2-3 Word selection process 3 Japanese word information sequence 4 Grammar rules 5 Word dictionary 6 Setting information 10-18 Step 21 Input device 22 -25 storage device 26 output device 27 recording medium 28 data processing device

Claims

[Claims]

1. A morphological analysis method for dividing a Japanese text into words and adding word information, the method comprising inputting the Japanese text, describing a grammatical rule for generating a part-of-speech chain and assigning a priority A word dictionary search step that extracts all connectable word chains from the word dictionary that has registered words and their word information that have been identified by morphological analysis, creates and outputs word chain candidate strings, and adds words to the input Japanese text. Character strings to which information is added or registered in setting information that describes a target character string to which the word information is to be added in Japanese text and a method of specifying the word information to be given are prescribed in advance. By giving priority to the part-of-speech chain of the grammar rules and selecting a word that satisfies the specified word information, and generating a new word if it is not, for the character string whose word information is specified, The specified word information is satisfied, and the unspecified word information is recognized as the first solution having the most probable information, and the word given the word information is registered in the setting information according to the specification, and A morphological analysis method including a word selecting step of outputting a word / word information sequence.

2. The method according to claim 1, wherein the word selecting step includes the step of saving and designating the character string in which the word information is once specified in the input Japanese text when the same designation is performed in the subsequent text. The method according to claim 1, wherein the information is deleted when canceling.

3. The method according to claim 1, wherein the word selecting step assigns a priority to the word chain candidate sequence, and sets a chain having the highest priority as a first solution. A second step of determining whether or not there is a target character string that is a character string given information or a word registered in the registered word information of the setting information; A third step of determining whether a word that exactly matches the character string exists in the part-of-speech chain having the highest priority, and if there is no word that completely matches the target character string,
A fourth step of determining whether a word whose word boundary matches the word boundary and which matches the part of speech when there is a part of speech designation is present in the word chain candidate sequence; and If there is a word whose boundary matches, if a part of speech is specified for the target character string, if there is a word that matches the part of speech and satisfies other specified word information, the word is selected. If the other specified word information is different, duplicate the word including word chain, overwrite the other specified word information, and if no part of speech is specified for the target character string A fifth step of duplicating the word of the highest priority word chain in the fourth step including the part of speech chain and overwriting other specified word information; If the part of speech is specified, Among the part-of-speech chain with the boundary of the elephant word sequence as a word boundary,
The part-of-speech chain before and after the highest-priority chain that can be chained to the part-of-speech specified at that boundary is copied, and the target word string fills in the word information specified as one word. If the original part-of-speech chain word (if there are multiple words, the word with the highest priority) is generated, and if there is no chainable word, a chain with an unknown word is generated. If not specified, the part-of-speech chain with the boundary of the target word string as the word boundary
Copy the word sequence with the highest priority, and set the target word string to 1
A sixth method in which word information specified as a word is filled, and word information not specified is generated from information of the original word.
The method according to claim 1, further comprising the steps of: (a) increasing the priority of the words handled in the third, fifth, and sixth steps, and selecting the words in the first place.

4. The method according to claim 3, wherein the word selecting step further comprises a step of registering a word for which the word information is specified in the registered word information when the word storage information of the setting information is on.

5. A morphological analysis program that divides a Japanese text into words and adds word information, the morphological analysis program being described in a grammar rule for inputting a Japanese text and generating a part of speech chain and assigning a priority. A word dictionary search process that extracts all connectable word chains from the word dictionary in which morphological analysis identifies words and their word information, creates and outputs word chain candidate strings, and outputs words to the input Japanese text Character strings to which information is added or registered in setting information that describes a target character string to which the word information is to be added in Japanese text and a method of specifying the word information to be given are prescribed in advance. A word that satisfies the specified word information is selected while giving priority to the part-of-speech chain of the grammar rules. For the word information that satisfies the specified word information, and for the unspecified word information, the word having the most probable information is recognized as the first solution, and the word given the word information is registered in the setting information according to the specification. A recording medium storing a morphological analysis program for causing a computer to execute a word selection process for outputting a word word information sequence.

6. The word selecting process includes the steps of, when a character string in which word information is once specified in an input Japanese text and a similar specification is performed in subsequent texts, the information is saved and specified. 6. The recording medium according to claim 5, wherein when canceling, the information is deleted.

7. The word selection process includes: a first procedure for assigning priorities to the word chain candidate strings, and setting a chain having the highest priority as a first solution; A second procedure for determining whether or not there is a target character string that is a character string to which information is given or a word registered in the registered word information of the setting information; A third step of determining whether a word that exactly matches the character string exists in the part-of-speech chain with the highest priority; and if there is no word that completely matches the target character string,
A fourth procedure for determining whether or not a word whose word boundary matches the target character string and which matches the part of speech when there is a part of speech designation is present in the word chain candidate string; and If there is a word whose boundary matches, if a part of speech is specified for the target character string, if there is a word that matches the part of speech and satisfies other specified word information, the word is selected. If the other specified word information is different, duplicate the word including word chain, overwrite the other specified word information, and if no part of speech is specified for the target character string The fifth step is to copy the word of the highest priority word chain in the fourth step, including the part of speech chain, and to overwrite other specified word information. If the part of speech is specified, the target word Among the part-of-speech chain with a boundary as a word boundary,
The part-of-speech chain before and after the highest-priority chain that can be chained to the part-of-speech specified at that boundary is copied, and the target word string fills in the word information specified as one word. If the original part-of-speech chain word (if there are multiple words, the word with the highest priority) is generated, and if there is no chainable word, a chain with an unknown word is generated. If not specified, the part-of-speech chain with the boundary of the target word string as the word boundary
Copy the word sequence with the highest priority, and set the target word string to 1
A sixth method in which word information specified as a word is filled, and word information not specified is generated from information of the original word.
6. The recording medium according to claim 5, further comprising: a procedure for increasing the priority of the words handled in the third, fifth, and sixth steps, and selecting the word in the first place.

8. The recording medium according to claim 7, wherein said word selection processing further includes a step of registering a word for which word information is designated in said registered word information when word storage information of said setting information is on. .