JPH10198676A

JPH10198676A - Device and method for japanese morpheme analysis

Info

Publication number: JPH10198676A
Application number: JP9003462A
Authority: JP
Inventors: Hitomi Kinoshita; ひとみ木下
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1997-01-13
Filing date: 1997-01-13
Publication date: 1998-07-31

Abstract

PROBLEM TO BE SOLVED: To cancel the vagueness of KANA description sentence containing much vagueness by displaying candidates and instructing a correct answer out of displayed candidates from a user when vagueness occurs in token division and KANJI conversion. SOLUTION: A KANJI/KANA mixed candidate sentence display part 10 shows structure prepared by a token list preparing part 9 for the user. A user instruction part 11 makes the user instruct the correct answer out of candidates displayed by a token candidate display part 8 and the KANJI/KANA mixed candidate sentence display part 10. A morpheme analysis control part 12 outputs the morpheme information of sentence inputted by an input part 1 while controlling the KANJI/KANA mixed candidate sentence display part 10 and the user instruction part 11. A storage part 14 stores the sentence inputted by the input part 1, the retrieved result of dictionary retrieval part 3, token data divided by a token dividing part 4, data prepared by a KANJI converting part 7 and the token list preparing part 9, the information of correct answer instructed by the user instruction part 11, and the analyzed result of morpheme analysis control part 12.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、文字列として入力
した日本語文の形態情報を出力する日本語形態素解析装
置及び日本語形態素解析方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a Japanese morphological analysis device and a Japanese morphological analysis method for outputting morphological information of a Japanese sentence input as a character string.

【０００２】[0002]

【従来の技術】ワープロのかな漢字変換や機械翻訳な
ど、日本語を処理する場合、まず、形態素解析を行う必
要がある。形態素解析では、普通、単語をキーとしてそ
の語彙情報を記憶した辞書を検索しながら、文字列を形
態素（意味を持つ最小の単位、以下トークンと呼ぶ）に
分割し（トークン分割）、ここのトークンに形態情報
（品詞、活用など）を付加する。形態素解析には、文節
数最小法、左最長一致法、コスト最小法等の手法があ
り、これらの手法を用いて曖昧性を解消している。しか
し、どの手法も完全ではなく、誤解釈を導くことがあ
る。仮名表記の場合、それが顕著である。2. Description of the Related Art When processing Japanese words such as kana-kanji conversion and machine translation of a word processor, it is necessary to first perform morphological analysis. In morphological analysis, a character string is usually divided into morphemes (smallest units having meaning, hereinafter referred to as tokens) while searching a dictionary storing vocabulary information using words as keys (token division). Morphological information (part of speech, inflection, etc.) Morphological analysis includes methods such as the minimum number of clauses method, the longest matching method on the left, and the minimum cost method, and ambiguity is resolved using these methods. However, none of the methods is perfect and can lead to misinterpretations. In the case of kana notation, that is remarkable.

【０００３】たとえば、「かれがくるまでまつ。」とい
う日本語入力文を英文に変換する機械翻訳の場合、この
入力文から得ることができる英文は、１．Ｈｅｗａｉｔｓｉｎａｃａｒ．２．Ｉｗａｉｔｕｎｔｉｌｈｅｃｏｍｅｓ．の２つが考えられる。これを漢字仮名混じり文で表記す
ると、１．彼が車で待つ。[0003] For example, in the case of machine translation for converting a Japanese input sentence "Kare ga Kurumatsu" into an English sentence, the English sentence obtained from this input sentence is: He waits in a car. 2. I wait unity he comes. There are two possibilities. If this is described in a sentence mixed with kanji and kana, He waits by car.

【０００４】２．彼が来るまで待つ。となり、かな漢字変換に２通りの解釈が存在することが
分かる。この２つの文を前述の３つの手法で評価してみ
ると、文節数最小法…文節数は４でどちらも同じ左最長一致法…「彼が車で待つ。」接続コスト最小法…コストの付け方による「名詞＋助詞＞動詞＋助詞」なら「彼が来るまで待
つ。」「名詞＋助詞＜動詞＋助詞」なら「彼が車で待
つ。」となる。どの手法を採っても、経験則に依る所が大き
く、多種多様な状況を表現し得る自然言語を処理する場
合、誤解釈を導くことは避けられない。[0004] 2. Wait until he comes. It can be seen that there are two types of interpretation in kana-kanji conversion. When these two sentences are evaluated by the above three methods, the minimum number of clauses method: the number of clauses is 4 and both are the same. The longest matching method on the left. "He waits by car." Depending on the way of attachment, "Noun + particle> verb + particle" means "wait until he comes.""Noun + particle <verb + particle" means "he waits by car." Regardless of the method used, it depends heavily on empirical rules, and when processing natural languages that can express a wide variety of situations, it is inevitable that misinterpretations will be introduced.

【０００５】[0005]

【発明が解決しようとする課題】このような従来の方法
では、漢字仮名混じり表記であれば一意に英文を決定す
る事ができるが、べた書き表記の場合、前後の文脈情報
を用いない限り正解を導き出すのは難しいという課題を
有していた。また、文脈理解の技術は、実用化レベルに
達していないのが現状である。In such a conventional method, an English sentence can be uniquely determined if it is written in a mixture of kanji and kana, but in the case of a solid writing, a correct answer is used unless context information before and after is used. Was difficult to derive. At present, the technology of context understanding has not yet reached the level of practical use.

【０００６】本発明は以上の課題を解決し、複数の解釈
（漢字変換候補）を有するべた書き文であっても、正し
い解釈を得ることができる日本語形態素解析方法及び日
本語形態素解析方法を提供することを目的とする。The present invention solves the above problems, and provides a Japanese morphological analysis method and a Japanese morphological analysis method that can obtain a correct interpretation even in a solid sentence having a plurality of interpretations (kanji conversion candidates). The purpose is to provide.

【０００７】[0007]

【課題を解決するための手段】請求項１に記載の発明の
日本語形態素解析装置は、日本語文を文字列として入力
する入力手段と、日本語単語の読み、漢字表記、品詞情
報、及び形態素解析に必要な語彙情報を記憶した辞書群
と、前記入力手段より入力された文字列を前記辞書群を
参照してトークンに分割するトークン分割手段と、分割
したトークンが平仮名であった場合、それを漢字に変換
する漢字変換手段と、トークン分割及び漢字変換で曖昧
性が生じた場合、その候補を表示する表示手段と、表示
した候補の中から正解をユーザに指示してもらう指示手
段とを備える構成とした。According to a first aspect of the present invention, there is provided a Japanese morphological analyzer for inputting a Japanese sentence as a character string, reading a Japanese word, kanji notation, part of speech information, and a morpheme. A dictionary group storing vocabulary information necessary for analysis, a token division unit for dividing a character string input from the input unit into tokens by referring to the dictionary group, and if the divided token is a hiragana, To kanji conversion, token display and kanji conversion when ambiguity arises, display means for displaying candidates, and instructing means for instructing the user to select a correct answer from the displayed candidates. A configuration was provided.

【０００８】そしてこの構成により、複数の解釈（漢字
変換候補）を有するべた書き文であっても、正しい解釈
を得ることができる日本語形態素解析方法及び日本語形
態素解析方法を実現できる。With this configuration, it is possible to realize a Japanese morphological analysis method and a Japanese morphological analysis method that can obtain a correct interpretation even for a solid text having a plurality of interpretations (kanji conversion candidates).

【０００９】[0009]

【発明の実施の形態】請求項１の発明は、日本語文を文
字列として入力する入力手段と、日本語単語の読み、漢
字表記、品詞情報、及び形態素解析に必要な語彙情報を
記憶した辞書群と、前記入力手段より入力された文字列
を前記辞書群を参照してトークンに分割するトークン分
割手段と、分割したトークンが平仮名であった場合、そ
れを漢字に変換する漢字変換手段と、トークン分割及び
漢字変換で曖昧性が生じた場合、その候補を表示する表
示手段と、表示した候補の中から正解をユーザに指示し
てもらう指示手段とを備えた構成により、曖昧性を多く
含む仮名表記文の曖昧性を解消できる。DESCRIPTION OF THE PREFERRED EMBODIMENTS The invention of claim 1 is an input means for inputting a Japanese sentence as a character string, and a dictionary which stores Japanese word reading, kanji notation, part of speech information, and vocabulary information necessary for morphological analysis. Group, token dividing means for dividing the character string input from the input means into tokens by referring to the dictionary group, and, if the divided token is a hiragana, a kanji conversion means for converting it into kanji, When an ambiguity occurs in token division and kanji conversion, a configuration including display means for displaying the candidate and instructing means for instructing the user to select a correct answer from the displayed candidates includes a lot of ambiguity. It can eliminate the ambiguity of the kana notation sentence.

【００１０】請求項２の発明は、分割したトークンの前
後の接続関係を調べる接続チェック手段と、接続関係を
考慮した漢字変換候補をユーザに表示する手段とを備え
た構成により、正しい接続関係にあるものの漢字変換候
補のみをユーザに表示できる。According to the second aspect of the present invention, a correct connection relationship is provided by a configuration including a connection check means for examining a connection relation before and after a divided token and a means for displaying a kanji conversion candidate to the user in consideration of the connection relation. Only certain kanji conversion candidates can be displayed to the user.

【００１１】請求項３の発明は、ユーザに曖昧性を表示
する場合、入力文の形態素解析が全て終了した後に文を
候補として表示する手段を備えた構成により、複数文が
まとまった文章を解析する場合に一括処理を行うことが
できる。According to a third aspect of the present invention, when displaying the ambiguity to the user, a structure comprising means for displaying a sentence as a candidate after all the morphological analysis of the input sentence is completed, thereby analyzing a sentence composed of a plurality of sentences. Batch processing can be performed.

【００１２】請求項４の発明は、ユーザに曖昧性を表示
する場合、曖昧性が生じた時点で、トークン単位に候補
を表示する手段を備えた構成により、曖昧性が生じた時
点でその曖昧性を解決することができ、その結果をその
後の形態素解析に利用することで、効率よく解析を行う
ことができる。According to a fourth aspect of the present invention, in the case where the ambiguity is displayed to the user, the ambiguity is generated at the time when the ambiguity is generated by the configuration including means for displaying a candidate in token units when the ambiguity occurs. Sexuality can be solved, and the result can be used for subsequent morphological analysis to perform the analysis efficiently.

【００１３】請求項５の発明は、日本語文を文字列とし
て入力するステップと、日本語単語の読み、漢字表記、
品詞情報、及び形態素解析に必要な語彙情報を記憶した
辞書群と、前記入力手段より入力された文字列を前記辞
書群を参照してトークンに分割するステップと、分割し
たトークンが平仮名であった場合、それを漢字に変換す
るステップと、トークン分割及び漢字変換で曖昧性が生
じた場合、その候補を表示するステップと、表示した候
補の中から正解をユーザに指示してもらうステップとを
含む構成により、曖昧性を多く含む仮名表記文の曖昧性
を解消できる。According to a fifth aspect of the present invention, a step of inputting a Japanese sentence as a character string, reading a Japanese word, writing a kanji character,
A step of dividing a group of dictionaries storing part-of-speech information and vocabulary information necessary for morphological analysis, a step of dividing a character string input from the input means into tokens with reference to the group of dictionaries, wherein the divided tokens are hiragana If the ambiguity occurs in token division and kanji conversion, a step of displaying the candidate and a step of instructing the user to select a correct answer from the displayed candidates are included. With the configuration, the ambiguity of the kana notation sentence including many ambiguities can be resolved.

【００１４】請求項６の発明は、分割したトークンの前
後の接続関係を調べるステップと、接続関係を考慮した
漢字変換候補をユーザに表示するステップとを含む構成
により、正しい接続関係にあるものの漢字変換候補のみ
をユーザに表示できる。According to a sixth aspect of the present invention, there is provided a configuration including a step of examining a connection relation before and after a divided token and a step of displaying a kanji conversion candidate considering the connection relation to a user, so that a kanji character having a correct connection relation is obtained. Only conversion candidates can be displayed to the user.

【００１５】請求項７の発明は、ユーザに曖昧性を表示
する場合、入力文の形態素解析が全て終了した後に文を
候補として表示するステップを含む構成により、複数文
がまとまった文章を解析する場合に一括処理を行うこと
ができる。According to a seventh aspect of the present invention, when displaying ambiguity to a user, a sentence including a plurality of sentences is analyzed by a configuration including a step of displaying a sentence as a candidate after all morphological analysis of the input sentence is completed. In such a case, batch processing can be performed.

【００１６】請求項８の発明は、ユーザに曖昧性を表示
する場合、曖昧性が生じた時点で、トークン単位に候補
を表示するステップを含む構成により、曖昧性が生じた
時点でその曖昧性を解決することができ、その結果をそ
の後の形態素解析に利用することで、効率よく解析を行
うことができる。According to the invention of claim 8, when the ambiguity is displayed to the user, the method includes a step of displaying a candidate in token units when the ambiguity occurs. Can be solved, and the analysis can be performed efficiently by using the result for subsequent morphological analysis.

【００１７】（実施の形態）以下、本発明の実施の形態
について、図面を参照しながら説明する。図１は、本発
明の一実施の形態における日本語形態素解析装置の機能
ブロック図、図２は同回路ブロック図、図３は同逐次型
の処理の流れを示した図、図４は同辞書検索の流れを示
した図、図５は同一括型の処理の流れを示した図、図６
は同辞書データの一例を示した図、図７は同接続テーブ
ルの一例を示した図、図８は同一括型の表示例を示した
図である。(Embodiments) Embodiments of the present invention will be described below with reference to the drawings. FIG. 1 is a functional block diagram of a Japanese morphological analyzer according to an embodiment of the present invention, FIG. 2 is a circuit block diagram of the same, FIG. 3 is a diagram showing a flow of a sequential type process, and FIG. FIG. 5 is a diagram showing a flow of a search, FIG. 5 is a diagram showing a flow of the batch processing, and FIG.
FIG. 7 is a diagram showing an example of the dictionary data, FIG. 7 is a diagram showing an example of the connection table, and FIG. 8 is a diagram showing a display example of the batch type.

【００１８】図１において、１は、ユーザからべた書き
文（平仮名のみの文）を入力してもらう入力部である。
２は、文字列をキーとして、その語彙情報が登録された
辞書群である。ここで用いる辞書データの一例を図６に
示す。図６に示したように、キーとなる文字列は、平仮
名表記とする。この辞書には、漢字表記情報とその漢字
表記に対する形態素情報が記憶されている。形態素情報
としては、「品詞」「活用型」「活用形」「接続情報」
が記憶されているものとし、漢字表記情報と形態素情報
を合わせて語彙情報と呼ぶ。更に、キーの次のデータ
は、同一仮名表記のデータ数を表す。ただし、同一表記
中、最初のエントリーのみ、データ数が記憶されてお
り、他は、０が記憶されているものとする。In FIG. 1, reference numeral 1 denotes an input unit for a user to input a solid written sentence (a sentence containing only hiragana characters).
Reference numeral 2 denotes a dictionary group in which vocabulary information is registered using a character string as a key. FIG. 6 shows an example of the dictionary data used here. As shown in FIG. 6, a character string serving as a key is written in hiragana. This dictionary stores kanji notation information and morpheme information for the kanji notation. As morpheme information, “part of speech”, “conjugation type”, “conjugation type”, “connection information”
Is stored, and the kanji notation information and the morpheme information are collectively referred to as vocabulary information. Further, the data next to the key indicates the number of data in the same kana notation. However, in the same notation, it is assumed that the number of data is stored only in the first entry and 0 is stored in the other entries.

【００１９】３は、平仮名文字列をキーとして、辞書群
２を検索する辞書検索部である。４は、辞書検索部３の
結果を用いてトークンを切り出すトークン分割部であ
る。５は、隣接するトークンの接続可否を定義した接続
テーブルである。接続テーブル５の一例を図７に示す。
接続テーブル５は、図７に示したように、配列構造をな
している。配列の行を後接情報と呼び、列を前接情報と
いう。辞書には、この前接情報（列の添字）と後接情報
（行の添字）が接続情報として登録されている。Reference numeral 3 denotes a dictionary search unit for searching the dictionary group 2 using a hiragana character string as a key. Reference numeral 4 denotes a token division unit that cuts out tokens using the result of the dictionary search unit 3. Reference numeral 5 denotes a connection table that defines whether adjacent tokens can be connected. FIG. 7 shows an example of the connection table 5.
The connection table 5 has an array structure as shown in FIG. The rows of the array are called trailing information, and the columns are trailing information. The prefix information (subscripts of columns) and the postfix information (subscripts of rows) are registered as connection information in the dictionary.

【００２０】６は、トークン分割部４で切り出したトー
クンとそれに前接するトークンの接続可否を接続テーブ
ル５を参照してチェックする接続チェック部である。接
続テーブル５の見方は、前接トークンの後接情報と後接
トークンの前接情報の交わった個所が１ならば隣接する
トークンは接続可、０ならば接続付加である。７は、接
続チェック部６で接続可となった全てのトークンの辞書
データを参照して漢字表記に変換する漢字変換部であ
る。８は、漢字変換部７において取得した漢字候補をユ
ーザに示すトークン候補表示部である。９は、接続チェ
ック部６で接続可となった全てのトークンの辞書データ
を参照して漢字に変換し、図８に示すような構造（これ
を、トークンリストと呼ぶ）を構築するトークンリスト
作成部である。Reference numeral 6 denotes a connection check unit for checking whether or not the token extracted by the token division unit 4 and the token preceding it can be connected with reference to the connection table 5. From the viewpoint of the connection table 5, if the intersection of the preceding information of the preceding token and the preceding information of the succeeding token is 1, the adjacent token is connectable, and if it is 0, the connection is added. Reference numeral 7 denotes a kanji conversion unit that converts dictionary data of all tokens that can be connected by the connection check unit 6 into kanji notation by referring to the dictionary data. Reference numeral 8 denotes a token candidate display unit that shows the kanji candidates acquired by the kanji conversion unit 7 to the user. Reference numeral 9 refers to dictionary data of all tokens that can be connected by the connection check unit 6, converts them into kanji, and creates a token list for constructing a structure as shown in FIG. 8 (this is called a token list). Department.

【００２１】１０は、トークンリスト作成部９で作成し
た構造をユーザに示す漢字仮名混じり候補文表示部であ
る。１１は、トークン候補表示部８及び漢字仮名混じり
候補文表示部１０で表示した候補の中から正解をユーザ
に指示してもらうユーザ指示部である。１２は、トーク
ン分割部４、接続チェック部６、漢字変換部７、トーク
ン候補表示部８、トークンリスト作成部９、漢字仮名混
じり候補文表示部１０、及びユーザ指示部１１を制御
し、入力部１で入力された文の形態情報を出力する形態
素解析制御部である。１３は、入力部１、形態素解析制
御部１２を制御する制御部である。１４は、入力部１で
入力された文、辞書検索部３の検索結果、トークン分割
部４で分割されたトークンデータ、漢字変換部７及びト
ークンリスト作成部９で作成したデータ、ユーザ指示部
１１でユーザより指示された正解情報、形態素解析制御
部１２における解析結果を記憶する記憶部である。Reference numeral 10 denotes a candidate sentence display unit including Kanji and Kana, which shows the structure created by the token list creation unit 9 to the user. Reference numeral 11 denotes a user instruction unit for instructing the user to select a correct answer from the candidates displayed on the token candidate display unit 8 and the candidate sentence display unit 10 mixed with kanji and kana. Reference numeral 12 denotes a token division unit 4, a connection check unit 6, a kanji conversion unit 7, a token candidate display unit 8, a token list creation unit 9, a kanji kana mixed candidate sentence display unit 10, and a user instruction unit 11, and an input unit. A morphological analysis control unit that outputs morphological information of the sentence input in step 1. Reference numeral 13 denotes a control unit that controls the input unit 1 and the morphological analysis control unit 12. Reference numeral 14 denotes a sentence input by the input unit 1, a search result of the dictionary search unit 3, token data divided by the token division unit 4, data created by the kanji conversion unit 7 and the token list creation unit 9, a user instruction unit 11 Is a storage unit for storing the correct answer information specified by the user and the analysis result in the morphological analysis control unit 12.

【００２２】図２は、日本語形態素解析装置の回路ブロ
ック図である。２１は、キーボード（マウスを含む）で
ある。２２は、陰極線管ディスプレイ（以下、ＣＲＴ）
である。２３は、中央処理装置（以下、ＣＰＵ）であ
る。２４は、ランダムアクセスメモリ（以下、ＲＡＭ）
である。２５は、制御プログラムなどを記憶するリード
オンリーメモリ（以下、ＲＯＭ）である。入力部１及び
ユーザ指示部１１は、キーボード２１により、トークン
候補表示部８及び漢字仮名混じり候補文表示部１０は、
ＣＲＴ２２により、記憶部１４は、ＲＡＭ２４により実
現されている。接続テーブル５は、ＲＯＭ２５に、辞書
群２は、ＲＡＭ２４、ＲＯＭ２５、２次記憶装置のいず
れかに記憶されている。辞書検索部３、トークン分割部
４、接続チェック部６、漢字変換部７、トークンリスト
作成部９、形態素解析制御部１２、制御部１３は、ＣＰ
Ｕ２３がＲＡＭ２４、および、ＲＯＭ２５とデータのや
りとりを行いながら、ＲＯＭ２５に記憶されたプログラ
ムを実行することにより実現されている。FIG. 2 is a circuit block diagram of the Japanese morphological analyzer. Reference numeral 21 denotes a keyboard (including a mouse). 22 is a cathode ray tube display (CRT)
It is. 23 is a central processing unit (hereinafter, CPU). 24 is a random access memory (hereinafter, RAM)
It is. Reference numeral 25 denotes a read-only memory (hereinafter, ROM) for storing a control program and the like. The input unit 1 and the user instruction unit 11 use the keyboard 21 to display the token candidate display unit 8 and the kanji kana mixed candidate sentence display unit 10.
The storage unit 14 is realized by the RAM 24 by the CRT 22. The connection table 5 is stored in the ROM 25, and the dictionary group 2 is stored in any of the RAM 24, the ROM 25, and a secondary storage device. The dictionary search unit 3, token division unit 4, connection check unit 6, kanji conversion unit 7, token list creation unit 9, morphological analysis control unit 12, and control unit 13
The U23 is realized by executing a program stored in the ROM 25 while exchanging data with the RAM 24 and the ROM 25.

【００２３】以上のように構成された本実施の形態の日
本語形態素解析装置について、以下その動作を図３、図
４、図５のフローチャートに基づいて説明する。The operation of the Japanese morphological analyzer of the present embodiment configured as described above will be described below with reference to the flowcharts of FIGS. 3, 4 and 5.

【００２４】図３は、トークン分割、及び、漢字変換で
曖昧性が生じる毎にその曖昧性を表示し、ユーザに正解
を指示してもらう処理の流れを示したものである。まず
ステップＳ１では、入力部１より日本語文を１文単位に
入力する。ここでは、仮名のみの文（べた書き文）「か
れがくるまでまつ。」が入力されたものとする。FIG. 3 shows the flow of processing in which every time an ambiguity occurs in token division and kanji conversion, the ambiguity is displayed and the user is instructed to answer correctly. First, in step S1, a Japanese sentence is input from the input unit 1 in units of one sentence. In this example, it is assumed that a sentence (solid writing sentence) of “kana only” is input.

【００２５】ステップＳ２では、処理中の文の位置（文
字番号）を示す変数ｐｏｓの初期化を行う。先頭文字
「か」の文字番号は０とする。ステップＳ３では、ｐｏ
ｓが入力文字数（ここでは、１０）に達したか否かをチ
ェックし、達していなければステップＳ４へ移り、達し
ていれば１文の形態素解析処理を終わる。In step S2, a variable pos indicating the position (character number) of the sentence being processed is initialized. The character number of the first character "ka" is 0. In step S3, po
It is checked whether or not s has reached the number of input characters (here, 10). If not, the process proceeds to step S4, and if it has, the morphological analysis processing of one sentence ends.

【００２６】ステップＳ４では、辞書の検索を行う。辞
書検索処理を図４に示す。まずステップＤ１では、変数
ｄｉｃに辞書の最初のデータを読み込む。また、検索さ
れた辞書データの件数をカウントする変数ｄＮｕｍに０
を格納する。ステップＤ２では、辞書データが存在する
か否かをチェックし、存在すればステップＤ３へ移り、
存在しなければ辞書検索処理を終える。辞書検索処理を
終えるとき、辞書検索部３は、トークン分割部４に検索
された辞書データとデータ数（ｄＮｕｍ）を返す。In step S4, a dictionary is searched. FIG. 4 shows the dictionary search process. First, in step D1, the first data of the dictionary is read into the variable dic. A variable dNum for counting the number of searched dictionary data is set to 0.
Is stored. In step D2, it is checked whether dictionary data exists, and if so, the process proceeds to step D3.
If not, the dictionary search process ends. When ending the dictionary search process, the dictionary search unit 3 returns the searched dictionary data and the number of data (dNum) to the token division unit 4.

【００２７】ステップＤ３では、ｄｉｃに読み込まれた
辞書データの見出しの長さを求め、変数ｌｅｎに格納す
る。ステップＤ４では、入力文字列の文字位置ｐｏｓか
らｌｅｎ文字分の文字列と辞書見出しを比較する。次に
ステップＤ５では、ステップＤ４の比較の結果をチェッ
クし、一致していればステップＤ６へ移り、一致してい
なければステップＤ８へ移る。In step D3, the length of the header of the dictionary data read into dic is obtained and stored in the variable len. In step D4, a character string of len characters from the character position pos of the input character string is compared with the dictionary heading. Next, in step D5, the result of the comparison in step D4 is checked. If they match, the process proceeds to step D6, and if they do not match, the process proceeds to step D8.

【００２８】ステップＤ６では、トークン分割部４に渡
す検索結果を格納する領域ｒｅｓｕｌｔに一致した辞書
データを追加し、ステップＤ７で検索データ数のカウン
ターｄＮｕｍを１増やす。ステップＤ８では、ｄｉｃに
次の辞書データを読み込み、ステップＤ２に戻る。At step D6, the dictionary data matching the area "result" for storing the search result to be passed to the token dividing unit 4 is added. At step D7, the counter dNum of the number of search data is incremented by one. In step D8, the next dictionary data is read into dic, and the process returns to step D2.

【００２９】最初の辞書検索処理では、ｐｏｓは０、辞
書見出しは「か」であり、入力文の文字位置０から１文
字分の「か」と辞書見出しは一致する。ここでは、辞書
の最初のデータから３番目までのデータが一致すること
になる。ステップＳ５では、辞書検索部３の結果を受け
て、一致する見出しが検索されたか否かチェックし、検
索されていればステップＳ６へ移り、検索されていなけ
ればステップＳ１４で解析エラーを通知して形態素解析
処理を終える。In the first dictionary search process, pos is 0 and the dictionary heading is "?", And "?" For one character from character position 0 of the input sentence matches the dictionary heading. Here, the first to third data in the dictionary match. In step S5, based on the result of the dictionary search unit 3, it is checked whether a matching headline has been searched. If it has been searched, the process proceeds to step S6. If not, an analysis error is notified in step S14. The morphological analysis processing ends.

【００３０】ステップＳ６では前接トークンとの接続チ
ェックを行う。文頭の時は、文頭になり得る品詞とそう
でないものがある。今、辞書検索の結果、「か」（蚊） −名詞「か」 −助詞「かれ」（彼）−名詞の３つが得られたが、助詞は文頭になり得ない品詞であ
るので、ここでの候補は、「蚊」と「彼」の２つとな
る。ｐｏｓが３の場合を例に接続テーブルを用いた接続
チェックを説明する。この時、直前のトークンは「が」
−助詞であり、このトークンの後接情報は０である。ス
テップＳ４の辞書検索では、以下の５個のデータが検索
された。それぞれについて接続可否をチェックする。In step S6, a connection check with the preceding token is performed. At the beginning of the sentence, there are parts of speech that can be the beginning of the sentence and those that are not. Now, as a result of a dictionary search, three items were obtained: "ka" (mosquito)-noun "ka"-particle "kare" (he)-noun, but the particle is a part-of-speech that cannot be the beginning of a sentence. Are "mosquito" and "he". A connection check using a connection table will be described with an example where pos is 3. At this time, the previous token is
-Particles, the trailing information of this token is 0. In the dictionary search in step S4, the following five data items were searched. Check whether connection is possible for each.

【００３１】１．「くる」（繰る）−動詞の終止形、
前接情報：２接続テーブルの０行２列は１で接続可。1. "Kuru" (repeating)-the final form of the verb,
Prefix information: 2 0 rows and 2 columns in the connection table can be connected with 1.

【００３２】２．「くる」（繰る）−動詞の連体形、
前接情報：２接続テーブルの０行２列は１で接続可。2. "Kuru" (repeating)-Adverb form of verb,
Prefix information: 2 0 rows and 2 columns in the connection table can be connected with 1.

【００３３】３．「くる」（来る）−動詞の終止形、
前接情報：２接続テーブルの０行２列は１で接続可。3. "Kuru" (coming)-the final form of the verb,
Prefix information: 2 0 rows and 2 columns in the connection table can be connected with 1.

【００３４】４．「くる」（来る）−動詞の連体形、
前接情報：２接続テーブルの０行２列は１で接続可。4. "Kuru" (coming)-adverb form of verb,
Prefix information: 2 0 rows and 2 columns in the connection table can be connected with 1.

【００３５】５．「くるま」（車）−名詞、前接情
報：１接続テーブルの０行１列は１で接続可。5. "Car" (vehicle)-Noun, prefix information: 1 row 0 column 1 of connection table can be connected with 1.

【００３６】ステップＳ７では、接続可となるものが複
数存在するか否かチェックし、複数であればステップＳ
８へ、複数でなければステップＳ１０へ移る。ここで
は、全て接続可となったので、ステップＳ８で、５候補
全てを表示し、ステップＳ９でユーザから正解を指示し
てもらう。In step S7, it is checked whether or not there are a plurality of connectable ones.
If not, go to step S10. Here, since all the connections are possible, all the five candidates are displayed in step S8, and the user instructs the correct answer in step S9.

【００３７】ステップＳ１０では、接続可が１つか否か
チェックし、１つであればステップＳ１１へ移り、１つ
でない（１つもない）場合はステップＳ１４で解析エラ
ーを通知して形態素解析処理を終わる。ステップＳ１１
では、ユーザから指示されたトークン、又は、接続チェ
ックで唯一接続可であったトークンを解析結果として記
憶部１４に記憶する。ステップＳ１２では、前接トーク
ンの後接情報を記憶する変数ｃｏｎに解析結果として記
憶したトークンの後接情報を記憶する。In step S10, it is checked whether or not one connection is possible. If there is one connection, the process proceeds to step S11. If there is not one (no one), an analysis error is notified in step S14 and the morphological analysis process is performed. Ends. Step S11
Then, the token instructed by the user or the token that can be connected only in the connection check is stored in the storage unit 14 as the analysis result. In step S12, the succeeding information of the token stored as the analysis result is stored in the variable con for storing the succeeding information of the preceding token.

【００３８】ステップＳ１３では、ｐｏｓを１増やし、
ステップＳ３に戻る。図５は、文の解析が全て終了した
後、文単位に候補を表示し、ユーザに正解を指示しても
らう処理の流れを示したものである。図５のステップＴ
１〜Ｔ４は、図３のステップＳ１〜Ｓ４と同様である。In step S13, pos is increased by 1, and
It returns to step S3. FIG. 5 shows a flow of processing in which candidates are displayed for each sentence after all sentence analysis is completed, and the user is instructed to answer correctly. Step T in FIG.
Steps 1 to T4 are the same as steps S1 to S4 in FIG.

【００３９】ステップＴ５では、辞書検索の結果を受け
て、検索されたトークンと前接トークンとの接続チェッ
クを行う。接続チェックの方法は図３のステップＳ６と
同様であるが、ここでは、前接トークンが後接トークン
（辞書検索されたトークン）のいずれとも接続不可の場
合、その前接トークンを解析結果から削除するという処
理を施す。例えば、ｐｏｓが５の場合、前接トークン
は、「くる」（繰る）−動詞の終止形、後接情報：２「くる」（繰る）−動詞の連体形、後接情報：４「くる」（来る）−動詞の終止形、後接情報：２「くる」（来る）−動詞の連体形、後接情報：４の４つであり、後接トークン候補は、「まで」−助詞、前接情報：３の１つである。それぞれの接続可否をチェックすると、
接続テーブルの２行３列は１で接続可、４行３列は０で
接続不可となり、４つの前接トークンのうち、「くる」（繰る）−動詞の連体形、後接情報：４「くる」（来る）−動詞の連体形、後接情報：４は、解析結果から削除することになる。At step T5, upon receiving the result of the dictionary search, a connection check between the searched token and the preceding token is performed. The connection check method is the same as that in step S6 in FIG. 3, but here, if the preceding token cannot be connected to any of the following tokens (tokens searched in the dictionary), the preceding token is deleted from the analysis result. Is performed. For example, when the pos is 5, the preceding token is "coming" (repeat)-the final form of the verb, the posterior information: 2 "coming" (repeat)-the union form of the verb, the posterior information: 4 "comes" (Coming)-the final form of the verb, the adjunct information: 2 "Kuru" (coming)-the union form of the verb, the adjunct information: 4, and the candidate for the adjunct token is "to"-particle, previous Contact information: one of 3. If you check each connection possibility,
2 rows and 3 columns in the connection table can be connected with 1 and 4 rows and 3 columns can be connected with 0 and cannot be connected, out of the four preceding tokens, the "coming" (repeat) -verb adjunct form, the posterior information: 4 ""Kuru" (coming)-the adverb form of the verb, and the adjunct information: 4, will be deleted from the analysis result.

【００４０】次にステップＴ６では、接続可のトークン
の存在をチェックし、接続可のトークンが存在すればス
テップＴ７へ移り、存在しなければステップＴ１３で解
析エラーを通知して形態素解析処理を終わる。ステップ
Ｔ７では、接続可となったトークンを図８に示したトー
クンリストに追加する。Next, in step T6, the existence of a connectable token is checked. If there is a connectable token, the process proceeds to step T7. If not, an analysis error is notified in step T13 and the morphological analysis process ends. . In step T7, the token that has become connectable is added to the token list shown in FIG.

【００４１】ステップＴ８からＴ１１では、接続可とな
ったトークンの後接情報を配列ｃｏｎに格納する。ま
す、ステップＴ８で、カウンターｉに０を格納する。次
に、ステップＴ９で、ｉが接続可のトークン数を超えた
か否かチェックし、超えていなければステップＴ１０
へ、超えていればステップＴ１２へ移る。ステップＴ１
０では、ｉ番目の接続可トークンの後接情報をｃｏｎの
ｉ番目に格納する。ステップＴ１１でカウンターｉを１
増やし、ステップＴ９へ戻る。ステップＴ１２では、カ
ウンターｐｏｓを１増やし、ステップＴ３へ戻る。最後
に、ステップＴ１４で解析結果（トークンリスト）を表
示し、正しいパスをユーザに指示してもらう。In steps T8 to T11, the succeeding information of the token that can be connected is stored in the array con. First, at step T8, 0 is stored in a counter i. Next, in step T9, it is checked whether or not i has exceeded the number of connectable tokens.
If not, the process proceeds to step T12. Step T1
If 0, the succeeding information of the i-th connectable token is stored in the i-th of the con. In step T11, the counter i is set to 1
Increase and return to step T9. In step T12, the counter pos is incremented by 1, and the process returns to step T3. Finally, the analysis result (token list) is displayed in step T14, and the user is instructed on the correct path.

【００４２】このような方法で、形態素解析を行うこと
により、複数の解釈（漢字変換候補）を有するべた書き
文であっても、正しい解釈を得ることができる。By performing morphological analysis in this manner, a correct interpretation can be obtained even for a solid sentence having a plurality of interpretations (kanji conversion candidates).

【００４３】また、ここでは、仮名のみの文に限って説
明してきたが、漢字仮名混じり文の一部の平仮名表記に
対しても同様の手法を取り入れることができる。Although the description has been limited to a sentence containing only kana, the same method can be applied to a part of the hiragana notation of a sentence mixed with kanji kana.

【００４４】[0044]

【発明の効果】以上のように本発明によれば、複数の解
釈（漢字変換候補）を有するべた書き文であっても、正
しい解釈を得ることができる日本語形態素解析方法及び
日本語形態素解析方法を実現することができる。As described above, according to the present invention, a Japanese morphological analysis method and a Japanese morphological analysis capable of obtaining a correct interpretation even for a solid sentence having a plurality of interpretations (kanji conversion candidates). The method can be realized.

[Brief description of the drawings]

【図１】本発明の一実施の形態における日本語形態素解
析装置の機能ブロック図FIG. 1 is a functional block diagram of a Japanese morphological analyzer according to an embodiment of the present invention.

【図２】本発明の一実施の形態における日本語形態素解
析装置の回路ブロック図FIG. 2 is a circuit block diagram of a Japanese morphological analyzer according to one embodiment of the present invention.

【図３】本発明の一実施の形態における日本語形態素解
析装置の逐次型の処理のフローチャートFIG. 3 is a flowchart of a sequential processing of the Japanese morphological analyzer according to the embodiment of the present invention;

【図４】本発明の一実施の形態における日本語形態素解
析装置の辞書検索のフローチャートFIG. 4 is a flowchart of dictionary search performed by the Japanese morphological analyzer according to the embodiment of the present invention;

【図５】本発明の一実施の形態における日本語形態素解
析装置の一括型のフローチャートFIG. 5 is a collective flowchart of the Japanese morphological analyzer according to one embodiment of the present invention;

【図６】本発明の一実施の形態における日本語形態素解
析装置の辞書データの一例を示した図FIG. 6 is a diagram showing an example of dictionary data of the Japanese morphological analyzer according to the embodiment of the present invention.

【図７】本発明の一実施の形態における日本語形態素解
析装置の接続テーブルの一例を示した図FIG. 7 is a diagram showing an example of a connection table of the Japanese morphological analyzer according to the embodiment of the present invention;

【図８】本発明の一実施の形態における日本語形態素解
析装置の一括型の表示例を示した図FIG. 8 is a diagram showing an example of a batch display of the Japanese morphological analyzer according to the embodiment of the present invention;

[Explanation of symbols]

１入力部２辞書群３辞書検索部４トークン分割部５接続テーブル６接続チェック部７漢字変換部８トークン候補表示部９トークンリスト作成部１０漢字仮名混じり候補文表示部１１ユーザ指示部１２形態素解析制御部１３制御部１４記憶部２１キーボード２２ＣＲＴ２３ＣＰＵ２４ＲＡＭ２５ＲＯＭ DESCRIPTION OF SYMBOLS 1 Input part 2 Dictionary group 3 Dictionary search part 4 Token division part 5 Connection table 6 Connection check part 7 Kanji conversion part 8 Token candidate display part 9 Token list creation part 10 Kanji kana mixed candidate sentence display part 11 User instruction part 12 Morphological analysis Control unit 13 Control unit 14 Storage unit 21 Keyboard 22 CRT 23 CPU 24 RAM 25 ROM

Claims

[Claims]

An input means for inputting a Japanese sentence as a character string, a dictionary group storing reading of Japanese words, kanji notation, part of speech information, and vocabulary information necessary for morphological analysis, and input from the input means Token dividing means for dividing the divided character string into tokens by referring to the dictionary group, kanji converting means for converting the divided tokens into kanji if the divided tokens are hiragana, and ambiguity in token division and kanji conversion. A Japanese morphological analyzer comprising: display means for displaying a candidate when it occurs; and instructing means for instructing a user to select a correct answer from the displayed candidates.

2. The Japanese language according to claim 1, further comprising: a connection check unit for checking a connection relationship before and after the divided token; and a unit for displaying a kanji conversion candidate to the user in consideration of the connection relationship. Morphological analyzer.

3. The Japanese morphological analysis according to claim 1, further comprising means for displaying a sentence as a candidate after the morphological analysis of the input sentence is completed when displaying the ambiguity to the user. apparatus.

4. The Japanese morphological analyzer according to claim 1, further comprising means for displaying a candidate in units of tokens when the ambiguity occurs when displaying the ambiguity to the user. .

5. A step of inputting a Japanese sentence as a character string, a dictionary group storing reading of Japanese words, kanji notation, part-of-speech information, and vocabulary information necessary for morphological analysis, and input from the input means. A step of dividing the character string into tokens by referring to the dictionary group; a step of converting the divided tokens into hiragana if they are hiragana; and a step of dividing the tokens and converting them into kanji. A method for analyzing Japanese morphemes, comprising: displaying a candidate; and having a user instruct a correct answer from the displayed candidates.

6. The Japanese morphological analysis method according to claim 5, further comprising the steps of: examining a connection relationship before and after the divided token; and displaying to the user a kanji conversion candidate in consideration of the connection relationship. .

7. The Japanese morphological analysis method according to claim 5, further comprising the step of displaying a sentence as a candidate after the morphological analysis of the input sentence is completed when displaying the ambiguity to the user. .

8. The Japanese morphological analysis method according to claim 5, further comprising the step of displaying candidates in units of tokens when the ambiguity occurs when displaying the ambiguity to the user.