JPH09223141A

JPH09223141A - Device and method for analyzing japanese sentence

Info

Publication number: JPH09223141A
Application number: JP8030339A
Authority: JP
Inventors: Tomoyuki Tada; 多田　　智之; Hidenobu Kaneoka; 秀信金岡; Toshihiro Fujinami; 稔弘藤並
Original assignee: Omron Corp; Omron Tateisi Electronics Co
Current assignee: Omron Corp
Priority date: 1996-02-19
Filing date: 1996-02-19
Publication date: 1997-08-26
Anticipated expiration: 2016-02-19
Also published as: JP3728789B2

Abstract

PROBLEM TO BE SOLVED: To provide Japanese sentence analyzing device and method capable of accurately detecting only a character string forming an unregistered word out of a character string including the unregistered word and registering the word consisting of the detected character string as an unregistered word. SOLUTION: A morphem analyzing part 3 divides an inputted character string into words by using a dictionary file 4 for registering data indicating the character string of a word and the attribute of the word. When a word of which character string length consists of one character, a word of a prescribed part of speech or a word having less possibility of forming a composite word is included in the divided words, a word candidate detection part 5 detects a character string connecting the word to a preceding or succeeding adjacent word as an unregistered word (word candidate) and temporarily records the word in the dictionary file 4. Then a word candidate verifying part 7 verifies the validity of the temporarily registered word candidate. The word candidate of which validity is verified is formally registered in a word candidate formal registration part 10.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は、入力された日本
語文に含まれる辞書にのっていない単語、特にカタカナ
で記載された単語を検出し、この検出した単語を登録す
る日本語文解析装置および日本語文解析方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention detects a word not included in a dictionary included in an input Japanese sentence, particularly a word written in katakana, and registers the detected word in a Japanese sentence analyzer. Regarding Japanese sentence analysis method.

【０００２】[0002]

【従来の技術】機械翻訳の前処理には、形態素解析と言
われる処理が行われている。形態素解析とは、簡単にい
うと入力された日本語文に対して辞書を用いて文節切り
や単語切りなどを行う処理である。ここで、問題となる
のは入力された日本語文中に辞書に載っていない単語
（以下、未登録単語と言う。）が存在すると、形態素解
析が正確に行えないという点である。したがって、機械
翻訳にも失敗するという結果となる。そこで、機械翻訳
の前処理で翻訳に失敗しそうなところを予め警告するた
めには、未登録語を正確に検出する必要がある。2. Description of the Related Art A process called morphological analysis is performed as a preprocess for machine translation. The morphological analysis is simply a process of performing segmentation, word segmentation, etc. on an input Japanese sentence using a dictionary. Here, the problem is that if there is a word (hereinafter referred to as an unregistered word) that is not included in the dictionary in the input Japanese sentence, the morphological analysis cannot be performed accurately. Therefore, machine translation also fails. Therefore, it is necessary to accurately detect the unregistered word in order to give a warning in advance that the translation is likely to fail in the preprocessing of the machine translation.

【０００３】また、未登録単語は文献中のキーワードと
なる単語として用いられる新語である場合が多い。この
ため、文書検索の自動キーワード作成（インデックス作
成）等の技術では、未登録単語をキーワードとして登録
する必要がある。すなわち、文書検索の自動キーワード
作成で辞書にない未登録単語をキーワードとして登録す
るためには、未登録単語を正確に検出しなければならな
い。Unregistered words are often new words used as keywords in documents. Therefore, in a technique such as automatic keyword creation (index creation) for document search, it is necessary to register an unregistered word as a keyword. That is, in order to register an unregistered word that is not in the dictionary as a keyword in the automatic keyword creation for document retrieval, the unregistered word must be detected accurately.

【０００４】従来、未登録単語の検出処理は、未登録単
語がカタカナ文字列である場合がほとんどであることか
ら、辞書引きに失敗した文字列に同じ文字種（カタカ
ナ）が連接する文字列全体を未登録語として検出すると
いうものであった。例えば、「インタラプタ」という文
字列に対して「イン」、「タラ」が辞書に登録されてい
る登録語で、「プタ」が未登録語である場合には、「イ
ンタラプタ」を未登録単語として検出する方法（情報処
理学会第３６回（昭和６３年前記）全国大会予稿集１２
３１頁〜１２３２頁「日英機械翻訳用前編集システム
(2)-形態素のあいまい性の検出方法- 」参照）や、「ニ
ューステーションホテル」と言う文字列に対して辞書引
きされる「ニュー」、「ニュース」「ホテル」等の情報
は無視し、カタカナ文字列全体である「ニューステーシ
ョンホテル」を未知語（本願で言う未登録単語）として
検出する方法（情報処理学会第４７回（平成５年後期）
全国大会予稿集３−１５９頁〜３−１６０頁「選択的辞
書引き機構を導入した日本語形態素解析における未知語
推定機構」参照）であった。Conventionally, in most unregistered word detection processes, unregistered words are katakana character strings. Therefore, the entire character string in which the same character type (katakana) is concatenated to the character string for which dictionary lookup has failed is performed. It was to detect it as an unregistered word. For example, if "In" and "Tara" are registered words registered in the dictionary and "Puta" is an unregistered word for the character string "Interrupter", "Interrupter" is regarded as an unregistered word. Method of detection (Proceedings of the 36th National Convention of the Information Processing Society of Japan (above 1988) 12
Pages 31-1232 "Pre-editing system for Japanese-English machine translation
(2) -Method of detecting ambiguity of morpheme- ") and the information such as" New "," News "and" Hotel "that are looked up in the dictionary for the character string" New Station Hotel "are ignored, Method to detect "New Station Hotel", which is the entire katakana character string, as an unknown word (unregistered word referred to in this application) (Information Processing Society of Japan 47th (late 1993))
The National Convention Proceedings, pp. 3-159 to 3-160, "Unknown Word Estimation Mechanism in Japanese Morphological Analysis Introducing Selective Dictionary Lookup Mechanism").

【０００５】[0005]

【発明が解決しようとする課題】しかしながら、上記し
た未登録単語を抽出する方法では、カタカナの文字列が
複数の単語からなる複合語であると、この複合語を未登
録単語として検出してしまうという問題がある。例え
ば、「ファイナンシャルシステム」という文字列の単語
（「ファイナンシャル」が未登録語であり、「システ
ム」とつながって複合語を形成している単語）がある場
合、登録語である「ファイ」や「システム」等の辞書引
きされる情報を無視し（「ファイ」、「システム」は辞
書ファイルに登録されている単語とする。）、「ファイ
ナンシャルシステム」全体が未登録単語として検出され
る。また、「ファイナンシャル」という未登録単語を含
む複合語である「ファイナンシャルバンキング」、「フ
ァイナンシャルセンター」、「ファイナンシャルアドバ
イザー」、「ファイナンシャルプランナー」等も別の未
登録単語として検出される（「バンキング」、「センタ
ー」、「アドバイザー」、「プランナー」等は辞書に登
録されている単語であるとする。）。このため、機械翻
訳の前処理における処理量が増加したり、文書検索のキ
ーワードとして冗長なキーワード（複合語）が作成され
るという結果となる。However, in the method of extracting unregistered words described above, if the character string of katakana is a compound word composed of a plurality of words, this compound word is detected as an unregistered word. There is a problem. For example, if there is a word in the character string "Financial System" (a word in which "Financial" is an unregistered word and is connected to "System" to form a compound word), the registered words "Phi" and " Ignoring the information such as "system" that is drawn in the dictionary ("phi" and "system" are words registered in the dictionary file), and the entire "financial system" is detected as an unregistered word. Further, a compound word including an unregistered word "Financial", "Financial Banking", "Financial Center", "Financial Advisor", "Financial Planner", etc. are also detected as another unregistered word ("Banking", "Center,""advisor,""planner," etc. are words registered in the dictionary.) As a result, the amount of processing in the preprocessing of machine translation is increased, and redundant keywords (compound words) are created as keywords for document retrieval.

【０００６】ここで、登録単語と照合しない部分のみを
未登録単語として検出するという手法も考えられるが、
この手法では検出すべき未登録単語の一部が登録単語と
一致していると、この一致した部分が切り離された不適
当な文字列の未登録単語が検出されることになる。上記
した例の「ファイナンシャルシステム」という文字列に
対して、「ファイ」、「システム」と言う登録単語に一
致する部分を除いた「ナンシャル」という単語として正
当性のない文字列を未登録単語として検出してしまう。
さらに、未登録単語の文字列が複数の単語を連接させた
文字列とたまたま一致する場合には、未登録単語が検出
されないという問題もある。例えば、「カリマンタン」
と言う文字列の未登録単語に対して「カリ」「マン」
「タン」という３つの登録単語があると、未登録単語が
検出されないということである。Here, a method of detecting only a portion that does not match the registered word as an unregistered word is conceivable.
In this method, if a part of the unregistered word to be detected matches the registered word, an unregistered word of an inappropriate character string in which the matched part is cut off is detected. For the character string "Financial system" in the above example, a character string that is not valid as the word "National" excluding the part that matches the registered words "Phi" and "System" is regarded as an unregistered word. I will detect it.
Further, when the character string of the unregistered word happens to match the character string in which a plurality of words are concatenated, there is a problem that the unregistered word is not detected. For example, "Kalimantan"
"Kari" and "Man" for the unregistered word of the character string
If there are three registered words "tan", it means that the unregistered word is not detected.

【０００７】この発明の目的は、未登録単語を含むカタ
カナ文字列中から正確に未登録単語を形成する文字列の
みを検出し、この検出した未登録単語を登録することの
できる日本語文解析装置および日本語文解析方法を提供
することにある。An object of the present invention is to detect only a character string forming an unregistered word accurately from a katakana character string containing an unregistered word and register the detected unregistered word. And to provide a Japanese sentence analysis method.

【０００８】また、この発明は、検出された未登録単語
の正当性を検出し、誤って検出されて登録された未登録
単語を取り消すことのできる日本語文解析装置および日
本語文解析方法を提供することを目的とする。The present invention also provides a Japanese sentence analysis apparatus and a Japanese sentence analysis method capable of detecting the validity of a detected unregistered word and canceling the unregistered word that is erroneously detected and registered. The purpose is to

【０００９】[0009]

【課題を解決するための手段】この発明の請求項１に記
載した日本語文解析装置は、単語の文字列およびその単
語の属性を示すデータを登録した辞書ファイルと、前記
辞書ファイルを用いて入力された文字列を単語に分割す
る形態素解析を行う形態素解析手段と、を備え、前記入
力された文字列中の連接するカタカナ文字列が前記形態
素解析によって複数の単語に分割されたとき、この分割
された単語内に文字列長が１文字の単語が含まれていれ
ば、該単語と該単語の前または後ろに隣合う単語とをつ
ないだ文字列を未登録単語として検出する未登録単語検
出手段と、該検出された未登録単語を登録する未登録単
語登録手段と、を備えたことを特徴とする。A Japanese sentence analysis apparatus according to claim 1 of the present invention inputs a character string of a word and a dictionary file in which data indicating an attribute of the word is registered and the dictionary file. A morpheme analysis means for performing morpheme analysis of dividing the character string into words, and when the concatenated katakana character string in the input character string is divided into a plurality of words by the morpheme analysis, If a word having a character string length of 1 character is included in the generated word, an unregistered word detection that detects a character string that connects the word and an adjacent word before or after the word as an unregistered word Means and an unregistered word registration means for registering the detected unregistered word.

【００１０】この構成では、形態素解析手段が辞書ファ
イルを用いて入力された文字列を単語に分割する形態素
解析を行う。形態素解析された結果、連接するカタカナ
文字列が複数の単語に分割されたときに、分割された単
語内に文字列長が１文字の単語が含まれていれば、未登
録単語検出手段がこの１文字の単語と該単語の前または
後ろに隣合う単語とをつないだ文字列を未登録単語とし
て検出する。そして、未登録単語登録手段がこの検出さ
れた未登録単語を登録する。In this configuration, the morpheme analysis means performs the morpheme analysis by dividing the character string input using the dictionary file into words. As a result of morphological analysis, when the concatenated katakana character string is divided into a plurality of words, if the divided words include a word with a character string length of 1 character, the unregistered word detection means A character string that connects a one-character word and a word adjacent to the word before or after the word is detected as an unregistered word. Then, the unregistered word registration means registers the detected unregistered word.

【００１１】この発明の請求項２に記載した日本語文解
析装置は、単語の文字列およびその単語の属性を示すデ
ータを登録した辞書ファイルと、前記辞書ファイルを用
いて入力された文字列を単語に分割する形態素解析を行
う形態素解析手段と、を備え、前記入力された文字列中
の連接するカタカナ文字列が前記形態素解析によって複
数の単語に分割されたとき、この分割された単語内に所
定の品詞の単語が含まれていれば、該単語と該単語の前
または後ろに隣合う単語とをつないだ文字列を未登録単
語として検出する未登録単語検出手段と、該検出された
未登録単語を登録する未登録単語登録手段と、を備えた
ことを特徴とする。According to a second aspect of the present invention, a Japanese sentence analysis apparatus uses a dictionary file in which character strings of words and data indicating attributes of the words are registered, and a character string input using the dictionary file. A morpheme analysis means for performing morpheme analysis of dividing into concatenated katakana character strings in the input character string into a plurality of words by the morpheme analysis. If the word of the part of speech is included, an unregistered word detecting unit that detects a character string that connects the word and an adjacent word before or after the word as an unregistered word, and the detected unregistered word And an unregistered word registration unit for registering a word.

【００１２】この構成では、形態素解析手段が辞書ファ
イルを用いて入力された文字列を単語に分割する形態素
解析を行う。形態素解析された結果、連接するカタカナ
文字列が複数の単語に分割されたときに、分割された単
語内に所定の品詞の単語が含まれていれば、該単語と該
単語の前または後ろに隣合う単語とをつないだ文字列を
未登録単語として検出する。そして、未登録単語登録手
段がこの検出された未登録単語を登録する。In this configuration, the morpheme analysis means performs the morpheme analysis of dividing the character string input using the dictionary file into words. As a result of morphological analysis, when concatenated katakana character strings are divided into a plurality of words, if a word with a predetermined part of speech is included in the divided words, the word and the word before or after the word A character string connecting adjacent words is detected as an unregistered word. Then, the unregistered word registration means registers the detected unregistered word.

【００１３】この発明の請求項３に記載した日本語文解
析装置は、単語の文字列およびその単語の属性を示すデ
ータを登録した辞書ファイルと、前記辞書ファイルを用
いて入力された文字列を単語に分割する形態素解析を行
う形態素解析手段と、を備え、前記属性を示すデータ
は、対応する単語が複合語を形成する可能性の少ない単
語であるかどうかを表すデータを含み、前記入力された
文字列中の連接するカタカナ文字列が前記形態素解析に
よって複数の単語に分割されたとき、この分割された単
語内に複合語を形成する可能性の少ない単語が含まれて
いれば該単語と該単語の前または後ろに隣合う単語とを
つないだ文字列を未登録単語として検出する未登録単語
検出手段と、該検出された未登録単語を登録する未登録
単語登録手段と、を備えたことを特徴とする。According to a third aspect of the present invention, a Japanese sentence analysis apparatus uses a dictionary file in which character strings of words and data indicating attributes of the words are registered and character strings input using the dictionary file. Morphological analysis means for performing morphological analysis to divide into, and the data indicating the attribute includes data indicating whether or not the corresponding word is less likely to form a compound word, the input When concatenated katakana character strings in a character string are divided into a plurality of words by the morphological analysis, if the divided words include a word that is unlikely to form a compound word, the word and the word An unregistered word detecting means for detecting a character string connecting adjacent words before or after the word as an unregistered word, and an unregistered word registering means for registering the detected unregistered word. And it said that there were pictures.

【００１４】この構成では、形態素解析手段が辞書ファ
イルを用いて入力された文字列を単語に分割する形態素
解析を行う。形態素解析された結果、連接するカタカナ
文字列が複数の単語に分割されたときに、分割された単
語内に複合語を形成する可能性の少ない単語が含まれて
いれば、未登録単語検出手段が該単語と該単語の前また
は後ろに隣合う単語とをつないだ文字列を未登録単語と
して検出する。そして、未登録単語登録手段がこの検出
された未登録単語を登録する。In this configuration, the morpheme analysis means performs morpheme analysis by dividing the character string input using the dictionary file into words. As a result of morphological analysis, when the concatenated katakana character string is divided into a plurality of words, if the divided words include a word that is unlikely to form a compound word, an unregistered word detection unit. Detects a character string connecting the word and a word adjacent to the word before or after the word as an unregistered word. Then, the unregistered word registration means registers the detected unregistered word.

【００１５】この発明の請求項４に記載した日本語文解
析装置は、前記未登録単語検出手段は、前記形態素解析
によって複数の単語に分割された単語内に所定の品詞の
単語が含まれていれば、該単語と該単語の前または後ろ
に隣合う単語とをつないだ文字列を未登録単語として検
出する手段を含むことを特徴とする。In the Japanese sentence analyzing apparatus according to claim 4 of the present invention, the unregistered word detecting means includes a word having a predetermined part of speech included in a word divided into a plurality of words by the morphological analysis. For example, it is characterized by including means for detecting a character string connecting the word and a word adjacent to the word before or after the word as an unregistered word.

【００１６】この構成は、請求項１と請求項２とを組み
合わせた構成である。This configuration is a combination of claim 1 and claim 2.

【００１７】この発明の請求項５に記載した日本語文解
析装置は、前記属性を示すデータは、対応する単語が複
合語を形成する可能性の少ない単語であるかどうかを表
すデータを含み、前記未登録単語検出手段は、前記形態
素解析によって複数の単語に分割された単語内に複合語
を形成する可能性が少ない単語が含まれていれば該単語
と該単語の前または後ろに隣合う単語とをつないだ文字
列を未登録単語として検出する手段を含むことを特徴と
する。In the Japanese sentence analysis apparatus according to claim 5 of the present invention, the data indicating the attribute includes data indicating whether or not the corresponding word is less likely to form a compound word. The unregistered word detection means, if a word that is unlikely to form a compound word is included in the words divided into a plurality of words by the morphological analysis, the word and the word adjacent to the word before or after the word. It is characterized in that it includes means for detecting a character string connecting the and as an unregistered word.

【００１８】この構成は、請求項１、請求項２、請求項
３、のいずれかを複数組み合わせた構成である。This structure is a combination of any one of claims 1, 2, and 3.

【００１９】この発明の請求項６に記載した日本語文解
析装置は、前記未登録単語検出手段で検出された未登録
単語と一致する文字列の単語が前記形態素解析によって
複数の単語に分割された単語内に含まれるとき、予め定
められたルールと比較して該未登録単語が単語として正
当なものであるかどうかを検証する正当性検証手段を備
えたことを特徴する。In the Japanese sentence analysis device according to claim 6 of the present invention, the word of the character string that matches the unregistered word detected by the unregistered word detecting means is divided into a plurality of words by the morphological analysis. When it is included in a word, it is characterized by including a legitimacy verifying unit that verifies whether or not the unregistered word is valid as a word by comparing with a predetermined rule.

【００２０】この構成では、正当性検証手段が前記形態
素解析によって複数の単語に分割された単語内に前記未
登録単語検出手段で検出され、登録されている未登録単
語と一致する文字列の単語が含まれるとき、該未登録単
語が単語として正当なものであるかどうかを検証する。In this configuration, the correctness verifying means detects the word of the character string which matches the registered unregistered word detected by the unregistered word detecting means in the word divided into the plurality of words by the morphological analysis. Is included, it is verified whether the unregistered word is valid as a word.

【００２１】請求項７〜請求項１２に記載したこの発明
の日本語文解析方法は、上記した請求項１〜請求項６に
記載された日本語文解析装置を方法としてとらえた構成
である。The Japanese sentence analysis method according to the present invention described in claims 7 to 12 has a configuration in which the Japanese sentence analysis device according to any one of claims 1 to 6 is considered as a method.

【００２２】[0022]

【発明の実施の形態】図１は、この発明の実施の形態で
ある日本語文解析装置の機能を示すブロック図である。
日本語文解析装置１は、テキストデータ記憶部２と、形
態素解析部３、辞書ファイル４と、単語候補検出部５
と、単語候補登録部６と、単語候補検証部７と、単語候
補削除部８と、登録単語検証部９と、単語候補正式登録
部１０とを備えている。テキストデータ記憶部２は、処
理の対象となるテキストデータを記憶する。形態素解析
部３は、前記テキストデータ記憶部２に記憶されている
テキストデータに対して形態素解析を行う。辞書ファイ
ル４は、単語の文字列と、該単語の属性（品詞等）を対
応させて記憶している。単語候補検出部５は、前記形態
素解析部３で形態素解析された結果に基づいて、辞書フ
ァイル４に登録されていない未登録単語を単語候補とし
て検出する。単語候補登録部６は、前記単語候補検出部
５で検出された単語候補を辞書ファイル４に仮登録す
る。単語候補検証部７は、辞書ファイル４に仮登録され
ている単語候補の正当性を検証する。単語候補削除部８
は、辞書ファイル４に仮登録されている単語候補で正当
性が検証されなかった単語を削除する。登録単語検証部
９は、未登録単語の含まれている可能性がある文字列中
から検出された辞書ファイル４に登録されている単語の
正当性を検証する。単語候補正式登録部１０は、仮登録
されている単語候補で正当性が検証されたときにこの単
語候補を正式に登録する。1 is a block diagram showing the functions of a Japanese sentence analysis apparatus according to an embodiment of the present invention.
The Japanese sentence analysis device 1 includes a text data storage unit 2, a morpheme analysis unit 3, a dictionary file 4, and a word candidate detection unit 5.
1, a word candidate registration unit 6, a word candidate verification unit 7, a word candidate deletion unit 8, a registered word verification unit 9, and a word candidate formal registration unit 10. The text data storage unit 2 stores text data to be processed. The morpheme analysis unit 3 performs morpheme analysis on the text data stored in the text data storage unit 2. The dictionary file 4 stores the character strings of words and the attributes (parts of speech, etc.) of the words in association with each other. The word candidate detection unit 5 detects an unregistered word that is not registered in the dictionary file 4 as a word candidate based on the result of the morpheme analysis performed by the morpheme analysis unit 3. The word candidate registration unit 6 provisionally registers the word candidates detected by the word candidate detection unit 5 in the dictionary file 4. The word candidate verification unit 7 verifies the validity of the word candidates temporarily registered in the dictionary file 4. Word candidate deletion unit 8
Deletes a word whose validity is not verified among the word candidates temporarily registered in the dictionary file 4. The registered word verification unit 9 verifies the validity of the word registered in the dictionary file 4 detected from the character string that may include the unregistered word. The word candidate formal registration unit 10 officially registers this word candidate when the validity of the temporarily registered word candidate is verified.

【００２３】図２は、この発明の実施の形態である日本
語群解析装置の処理を示すフローチャートである。ここ
では、日本語文解析装置１の一連の処理を簡単に説明す
る。日本語文解析装置１は、テキストデータ記憶部２に
処理対象となるテキストデータを取り込み、記憶する
（ｎ１）。形態素解析部３は、句読点で区切られた文字
列単位毎にテキストデータを取り込む（ｎ２）。形態素
解析部３は、句読点で区切られたテキストデータを取り
込むと、辞書ファイル４を用いて形態素解析を行う（ｎ
３）。この形態素解析によって、ｎ２で取り込まれた句
読点で区切られたテキストデータが形態素に分割され
る。日本語文解析装置１は、ｎ３において形態素解析部
３で形態素解析された結果にカタカナ文字列の単語が含
まれているかどうかを判定する（ｎ４）。カタカナ文字
列の単語が含まれていない場合には、ｎ１でテキストデ
ータ記憶部２に記憶したテキストデータを全て処理した
かどうか（未処理のテキストデータが残っていないかど
うか）を判定し（ｎ１１）、処理されていないテキスト
データが残っているとｎ２に戻る。FIG. 2 is a flow chart showing the processing of the Japanese group analysis device according to the embodiment of the present invention. Here, a series of processes of the Japanese sentence analysis device 1 will be briefly described. The Japanese sentence analysis device 1 fetches and stores the text data to be processed in the text data storage unit 2 (n1). The morphological analysis unit 3 takes in text data for each character string unit delimited by punctuation marks (n2). When the morphological analysis unit 3 takes in the text data delimited by punctuation, it performs morphological analysis using the dictionary file 4 (n
3). By this morphological analysis, the text data delimited by the punctuation marks captured in n2 is divided into morphemes. The Japanese sentence analysis device 1 determines whether or not the result of the morphological analysis by the morphological analysis unit 3 in n3 includes a word of a katakana character string (n4). When the word of the katakana character string is not included, it is determined whether all the text data stored in the text data storage unit 2 in n1 has been processed (whether unprocessed text data remains) (n11 ), If unprocessed text data remains, the process returns to n2.

【００２４】カタカナ文字列の単語が含まれている場合
には、単語候補検出部５がこのカタカナ文字列の単語に
未登録単語が含まれているかどうかを判定する（ｎ
５）。未登録語が含まれていると、単語候補検出部５は
この未登録単語に連接するカタカナ文字列の単語を含め
たカタカナ文字列全体から単語候補を検出し、この検出
した単語候補を辞書ファイル４に仮登録する第１の単語
候補検出、登録処理を行う（ｎ６）。ｎ５で未登録単語
が含まれていないと判定した場合、または、上記したｎ
６の処理を完了すると、辞書ファイル４に登録されてい
る複数のカタカナ文字列の登録単語が連接している箇所
の有無を判定する（ｎ７）。カタカナ文字列の登録単語
が連接した箇所があると、この登録単語をつなげたカタ
カナ文字列全体から単語候補を検出し、この検出した単
語候補を辞書ファイル４に仮登録する第２の単語候補検
出、登録処理を行う（ｎ８）。When the word of the katakana character string is included, the word candidate detecting unit 5 determines whether or not the word of the katakana character string includes an unregistered word (n.
5). If the unregistered word is included, the word candidate detection unit 5 detects the word candidate from the entire katakana character string including the word of the katakana character string concatenated to the unregistered word, and the detected word candidate is a dictionary file. The first word candidate to be temporarily registered in No. 4 is detected and registered (n6). When it is determined that the unregistered word is not included in n5, or the above n
When the process of 6 is completed, it is determined whether or not there is a place where registered words of a plurality of katakana character strings registered in the dictionary file 4 are connected (n7). If there is a portion where the registered words of the katakana character string are connected, a word candidate is detected from the entire katakana character string that connects the registered words, and the detected word candidate is temporarily registered in the dictionary file 4 Second word candidate detection , Registration processing is performed (n8).

【００２５】また、単語候補検証部７は、形態素解析に
よって分割された単語の中に、ｎ６、または、ｎ８で辞
書ファイル４に仮登録されている単語候補と同じカタカ
ナ文字列の単語があるかどうか（ｎ６、または、ｎ８で
辞書ファイル４に仮登録した単語候補が別の文字列中か
ら再出現しているかどうか）を判定する（ｎ９）。単語
候補が再出現していると、単語候補検証部７が辞書ファ
イル４に仮登録されているこの再出現した単語候補の正
当性を検証する単語候補正当性検証処理を実行する（ｎ
１０）。Further, the word candidate verification unit 7 determines whether, among the words divided by the morphological analysis, there is a word having the same katakana character string as the word candidate temporarily registered in the dictionary file 4 at n6 or n8. Whether or not (whether the word candidate temporarily registered in the dictionary file 4 at n6 or n8 has reappeared from another character string) is determined (n9). When the word candidate reappears, the word candidate verification unit 7 executes a word candidate correctness verification process for verifying the correctness of the reappearing word candidate temporarily registered in the dictionary file 4 (n.
10).

【００２６】そして、未処理のテキストデータが残って
いないかどうかをｎ１１で判定し、未処理のテキストデ
ータが残っていれば上記したｎ２〜ｎ１０の処理をくり
かえす。未処理のテキストデータがなければ、辞書ファ
イル４に仮登録されている不要な単語候補（正当性が検
証されなかった単語候補等）を全て削除して処理を完了
する（ｎ１２）。Then, it is determined in n11 whether or not unprocessed text data remains, and if unprocessed text data remains, the above-described processes of n2 to n10 are repeated. If there is no unprocessed text data, all unnecessary word candidates temporarily registered in the dictionary file 4 (word candidates whose validity has not been verified) are deleted and the process is completed (n12).

【００２７】以下、上記した処理を詳細に説明する。ｎ
１では、テキストデータ記憶部２が形態素解析を行う一
連のテキストデータ（ファイル単位、レコード単位、デ
ィレクトリ単位、ハードディスク単位、時間単位、１０
０ＭＢ等のデータ量単位等）を取り込み、記憶する。The above processing will be described in detail below. n
1, a series of text data (file unit, record unit, directory unit, hard disk unit, time unit, 10
The data amount unit such as 0 MB is taken in and stored.

【００２８】ｎ２、ｎ３では、形態素解析部３がテキス
トデータ記憶部２に記憶された一連のテキストデータか
ら、順次句読点で区切られた文字列単位で取り出し、形
態素解析を行う。そして、形態素解析部３はこの形態素
解析を行った句読点で区切られた文字列単位のテキスト
データを単語に分割して出力する。例えば、形態素解析
を行うテキストデータに「ファイナンシャルシステム」
というカタカナ文字列が含まれているとする。また、辞
書ファイル４には図３（Ａ）に示すように「ファイ」
「システム」という文字列が単語として登録されてお
り、「ナンシャル」「ファイナンシャル」「ファイナン
シャルシステム」という文字列の単語が登録されていな
いものとする。ここで、形態素解析部３はこの「ファイ
ナンシャルシステム」と言うカタカナ文字列に対して形
態素解析を行うと、図３（Ｂ）に示すように「ナ」
「ン」「シャ」「ル」で辞書引きに失敗し、「ファイ」
と「システム」とを品詞が名詞である単語として検出す
る。そして、形態素解析部３はこの連続して辞書引きに
失敗した「ナ」「ン」「シャ」「ル」をつなげたカタカ
ナ文字列「ナンシャル」を１つの未登録単語とみなし、
図３（Ｃ）に示すように「ファイナンシャルシステム」
と言う文字列の形態素解析の結果として「ファイ」、
「ナンシャル」および「システム」の３つの単語を出力
する。このとき「ナンシャル」の品詞は未登録語として
出力される。At n2 and n3, the morpheme analysis unit 3 extracts from the series of text data stored in the text data storage unit 2 in units of character strings sequentially delimited by punctuation and performs morpheme analysis. Then, the morpheme analysis unit 3 divides the text data in character string units delimited by the punctuation marks on which the morpheme analysis has been performed into words and outputs the words. For example, text data for morphological analysis has a "financial system".
It is assumed that the katakana character string is included. In addition, as shown in FIG.
It is assumed that the character string “system” is registered as a word, and the words of the character strings “nuncial”, “financial”, and “financial system” are not registered. Here, when the morpheme analysis unit 3 performs the morpheme analysis on the katakana character string called "financial system", as shown in FIG.
"N", "Sha", and "Le" failed to look up the dictionary, and "Phi"
And "system" are detected as words whose part of speech is a noun. Then, the morphological analysis unit 3 considers the katakana character string “Nanchal”, which is a combination of “Na”, “n”, “sha”, and “ru” that have failed to look up the dictionary, as one unregistered word,
"Financial system" as shown in FIG.
"Phi" as a result of morphological analysis of the character string
Outputs the three words "nuclear" and "system". At this time, the part of speech of "NANCIAL" is output as an unregistered word.

【００２９】ｎ４では、形態素解析部３が出力した形態
素解析結果にカタカナ文字列の単語が含まれているかど
うかを判定する。ここで、形態素解析結果にカタカナ文
字列の単語が含まれていない場合には、ｎ５〜ｎ１０の
処理を行わず、ｎ１１で未処理のテキストデータの有無
を判定する。一方、形態素解析結果にカタカナ文字列の
単語が含まれている場合にはｎ５〜ｎ１０の処理を行
う。At n4, it is determined whether or not the morpheme analysis result output by the morpheme analysis unit 3 includes a word of katakana character string. Here, if the morpheme analysis result does not include the word of the katakana character string, the processes of n5 to n10 are not performed, and the presence or absence of unprocessed text data is determined in n11. On the other hand, if the morpheme analysis result includes words in the katakana character string, the processes of n5 to n10 are performed.

【００３０】ｎ５では、単語候補検出部５がこの形態素
解析結果に未登録単語が含まれているかどうかを判定す
る。この実施の形態では、形態素解析の結果にその品詞
が未登録語とされた単語を含んでいるときに、未登録単
語が含まれていると判定する。上記した例では、品詞が
未登録語とされた「ナンシャル」と言う単語が含まれて
いるので、ｎ５で未登録単語を含んでいると判定され
る。At n5, the word candidate detecting unit 5 determines whether or not the unregistered word is included in the morpheme analysis result. In this embodiment, when the morphological analysis result includes a word whose part of speech is an unregistered word, it is determined that the unregistered word is included. In the above-described example, the word “NANCIAL” whose part-of-speech is an unregistered word is included, so it is determined that the unregistered word is included in n5.

【００３１】単語候補検出部５は、ｎ５で未登録単語を
含んでいると判定すると、この未登録単語に連接するカ
タカナ文字列全体から、単語候補を検出し、この検出し
た単語候補を辞書ファイル４に仮登録する第１の単語候
補検出、登録処理を実行する。ここで、未登録単語に連
接するカタカナ文字列全体とは、未登録単語の前または
／および後ろに連続しているカタカナ文字列の単語（未
登録単語に連接している単語）を含めたカタカナ文字列
のことである。上記した例では、「ファイナンシャルシ
ステム」が未登録単語に連接するカタカナ文字列全体と
なる。なお、「ファイ」の前および「システム」の後ろ
に、カタカナ文字列の単語が連接していないものとす
る。When the word candidate detection unit 5 determines that the word n5 contains an unregistered word, it detects the word candidate from the entire katakana character string connected to the unregistered word, and the detected word candidate is stored in the dictionary file. The first word candidate detection and registration process to be temporarily registered in No. 4 is executed. Here, the entire katakana character string that is connected to an unregistered word is a katakana character that includes a word of a katakana character string that is continuous before or / and after an unregistered word (a word that is connected to an unregistered word). It is a character string. In the above example, the “financial system” is the entire katakana character string that is connected to unregistered words. It is assumed that the words in the katakana character string are not concatenated before “Phi” and after “System”.

【００３２】ここで、図４を参照しながら第１の単語候
補検出、登録処理を詳細に説明する。図４は第１の単語
候補検出、登録処理の流れを示すフローチャートであ
る。このカタカナ文字列の未登録単語に連接するカタカ
ナ文字列全体を１つの単語とし、品詞を名詞として辞書
ファイル４に登録する（ｎ２１）。上記している例では
「ファイナンシャルシステム」が単語（品詞は名詞）と
して辞書ファイル４に登録される。つぎに、ｎ２１で辞
書ファイル４に登録した単語のカタカナ文字列中に含ま
れる未登録単語が１文字の単語であるかどうかを判定す
る（ｎ２２）。ここで、１文字の単語でなければこの未
登録単語を単語候補の構成要素として検出する（ｎ２
３）。未登録単語が１文字であり、この未登録単語の前
にカタカナ文字列の登録単語が連接しているとこの登録
単語と未登録単語とをつないだ文字列からなる単語を単
語候補の構成要素として検出し、また、この未登録単語
の前または後ろにカタカナ文字列の登録単語が連接して
いるとこの登録単語と未登録単語とをつないだ文字列か
らなる単語を単語候補の構成要素として検出する（ｎ２
４）。例えば、「イリオモテ」と言う文字列に対して、
形態素解析結果が「イ」が未登録語、「リオ」「モテ」
が登録語である場合、「イ」が１文字の未登録であるの
で後ろの登録語「リオ」とつながれた「イリオ」が単語
候補の構成要素として検出される。なお、形態素解析部
３で未登録語である「イ」の前に検出している単語はカ
タカナ文字列ではないとする。また、この１文字の未登
録単語の前後両方にカタカナ文字列の登録単語が連接し
ている場合には、前に連接する登録単語とつながれた単
語候補の構成要素と、後ろに連接する登録単語とつなが
れた単語候補の構成要素とを検出する。Here, the first word candidate detection / registration process will be described in detail with reference to FIG. FIG. 4 is a flow chart showing the flow of the first word candidate detection and registration processing. The entire katakana character string connected to the unregistered word of this katakana character string is set as one word, and the part of speech is registered as a noun in the dictionary file 4 (n21). In the example described above, “Financial system” is registered in the dictionary file 4 as a word (part of speech is a noun). Next, it is determined whether the unregistered word included in the katakana character string of the word registered in the dictionary file 4 in n21 is a one-character word (n22). If the word is not a one-character word, this unregistered word is detected as a constituent element of the word candidate (n2).
3). If the unregistered word is one character, and the registered word of the katakana character string is concatenated in front of this unregistered word, the word consisting of the character string connecting this registered word and the unregistered word is a constituent element of the word candidate. If a registered word of a katakana character string is concatenated before or after this unregistered word, a word consisting of a character string connecting this registered word and the unregistered word is used as a component of the word candidate. Detect (n2
4). For example, for the string "Iriomote",
Morphological analysis result "i" is unregistered word, "Rio""Mote"
Is a registered word, "i" is one character not registered yet, so that "irio" connected to the registered word "rio" behind is detected as a component of the word candidate. It is assumed that the word detected before the unregistered word “i” in the morphological analysis unit 3 is not a katakana character string. If a registered word of a katakana character string is connected both before and after this one-character unregistered word, the registered word that is connected in front and the component of the word candidate that is connected and the registered word that is connected behind The components of the connected word candidates are detected.

【００３３】そして、形態素解析において、未登録単語
を含むカタカナ文字列全体から検出されている登録単語
（辞書ファイル４に登録されている単語）の正当性の対
象となる登録単語の正当性検証処理を行う（ｎ２５）。
ここでは、検証する登録単語が、複合語を形成しうる単
語であれば正当性がある、複合語を形成しえない単語で
あれば正当性がない、とする。上記した「ファイナンシ
ャルシステム」という文字列の例では「ファイ」と「シ
ステム」との２つの登録単語が正当性の検証対象とな
る。単語の正当性の検証は登録単語検証部９で行われ
る。単語の正当性は以下に示す〜のルールに基づい
て検証される。単語の文字列長によるルールこのルールでは、複合語ではない単語の文字列中に、登
録単語と一致する文字列が含まれる可能性は、登録単語
の文字列長が長くなるにつれて低下するという理由か
ら、この実施の形態では、(1) 文字列長が４文字以上の
単語であれば正当性のある単語、(2) 文字列長が２また
は３文字の単語であれば正当性の有無を判定できない単
語、(3) １文字であれば正当性がない単語、であるとす
る。単語の品詞によるルールこのルールでは以下に示す品詞の働きに基づいて正当性
を検証する。感動詞は、他の単語を修飾したり、他の単
語に修飾されたりする性質がないため、複合語の構成単
語とはならない。副詞は、他の単語を修飾したり、他の
単語に修飾されたりする性質がないため、複合語の構成
単語とはならない。サ行変格活用以外の動詞は、複合語
の構成単語とならない。接頭辞は、複合語の最後に来る
ことはない。接尾辞は、複合語の先頭にくることはな
い。連濁は、複合語の先頭にくることはない。以上の理
由から、本実施の形態では(1) 単語が感動詞、副詞、サ
行変格活用以外の動詞、のいずれかであれば、正当性の
ない単語、(2) 単語が接頭辞で、且つ、該単語の後ろに
カタカナ文字列が連接していないと、正当性のない単
語、(3) 単語が接尾辞、連濁で、且つ、該単語の前にカ
タカナ文字列が連接していないと、正当性のない単語、
(4) 上記(1)(2)(3) のいずれにも該当しないと、正当性
の有無を判定できない単語、であるとする。単語の性質によるルールこのルールでは、単語毎にその性質を、複合語を形成す
る可能性の多い単語、複合語を形成する可能性の少ない
単語、どちらでもない単語（以下、有用な性質を持たな
い単語、と言う。）、のいずれかに設定しておき、(1)
単語の性質が複合語を形成する可能性の多い単語であれ
ば、正当性のある単語、(2) 単語の性質が複合語を形成
する可能性の少ない単語であれば、正当性のない単語、
(3) 単語の性質が有用な性質を持たない単語であれば、
正当性の有無を判定できない単語、であるとする。な
お、この単語毎に性質を種類分けはする方法としては、
複数の文献等から、単語毎に形成された複合語の数、一
致する文字列を含む独立した単語（複合語でない単語）
の数等の統計を取り、この統計に基づいて単語の性質を
設定すればよい。また、人手による作業でこの統計を取
ってもよいし、自動的に統計を取って単語の性質を設定
するようにしてもよい。自動的にこの統計を取って単語
の性質を設定する処理については後述する。Then, in the morphological analysis, the correctness verification process of the registered word which is the target of the correctness of the registered word (the word registered in the dictionary file 4) detected from the entire katakana character string including the unregistered word. (N25).
Here, it is assumed that the registered word to be verified is valid if it is a word that can form a compound word, and is not valid if it is a word that cannot form a compound word. In the example of the character string "Financial system" described above, the two registered words "Phi" and "System" are to be verified for validity. The registered word verification unit 9 verifies the correctness of the word. The correctness of the word is verified based on the following rules of. Rule based on word string length In this rule, the possibility that a character string that is not a compound word contains a character string that matches the registered word decreases as the character string length of the registered word decreases. Therefore, in this embodiment, (1) a word having a character string length of 4 characters or more is a valid word, and (2) a word having a character string length of 2 or 3 is valid. A word that cannot be determined, (3) A word that is not valid if it consists of one character. Rule by word part of speech In this rule, the correctness is verified based on the function of the part of speech shown below. Since the verb does not have the property of modifying other words or being modified by other words, it is not a constituent word of a compound word. An adverb does not become a constituent word of a compound word because it does not have the property of modifying other words or being modified by other words. Verbs other than syugaku-inflection are not constituent words of compound words. The prefix never ends the compound word. The suffix does not start the compound word. Rendaku does not come at the beginning of a compound word. For the above reason, in the present embodiment, if the word (1) is any of a verb, an adverb, or a verb other than the syllable inflection, the word is not valid, and the (2) word is a prefix, And if the katakana character string is not concatenated after the word, the word is not valid, (3) the word is suffix, rendaku, and the katakana character string is not concatenated before the word. , Unjustified words,
(4) Unless any of the above (1), (2) and (3) is satisfied, it is assumed that the word cannot be judged to be valid. Rules based on word characteristics In this rule, the characteristics of each word are: a word that is likely to form a compound word, a word that is unlikely to form a compound word, a word that is neither No word, say.), Or (1)
Words that are likely to form a compound word that are legitimate, and (2) Words that are unlikely to form a compound word are not legitimate. ,
(3) If the word has no useful property,
It is assumed that it is a word whose validity cannot be determined. As a method of classifying the properties for each word,
Number of compound words formed for each word from multiple documents, etc., independent words containing matching character strings (words that are not compound words)
It suffices to take statistics such as the number of and to set the property of the word based on this statistics. Further, this statistic may be obtained manually, or the statistic may be automatically set to set the property of the word. The process of automatically taking this statistic and setting the property of the word will be described later.

【００３４】図５は、ｎ２５における登録単語の正当性
検証処理のフローチャートである。この処理は、最初に
文字列長によるルールから単語の正当性を検証する。正
当性を検証する登録単語の文字列長が、４文字以上、２
または３文字、１文字、のいずれであるかを判定する
（ｎ４１、ｎ４２）。ここで、文字列長が４文字以上で
あればｎ４８において正当性のある単語と判定する。文
字列長が１文字であればｎ４９において正当性のない単
語と判定する。文字列長が２または３文字であれば、単
語の文字列長によるルールからは、該単語の正当性を検
証できないとして、単語の品詞による正当性の検証を行
う。FIG. 5 is a flowchart of the registered word validity verification process in n25. This process first verifies the correctness of the word from the rule based on the character string length. The character string length of the registered word that verifies the legitimacy is 4 characters or more, 2
Alternatively, it is determined whether it is three characters or one character (n41, n42). Here, if the character string length is 4 characters or more, the word is determined to be valid in n48. If the character string length is one character, the word is determined to be invalid in n49. If the character string length is 2 or 3 characters, it is not possible to verify the correctness of the word from the rule based on the character string length of the word, and the correctness is verified by the part of speech of the word.

【００３５】ここでは、単語の品詞が感動詞、副詞、サ行変格活用以外の動
詞、であるか、単語の品詞が接頭辞で且つ後ろにカタカナ文字列が続
いていないか、単語の品詞が接尾辞または連濁で且つ前にカタカナ文
字列が続いていないか、を判定し（ｎ４３〜ｎ４５）、この〜のいずれかに
該当する単語であれば、ｎ４９で正当性のない単語と判
定する。また、この〜のいずれにも該当しない単語
であれば、この単語の品詞によるルールからは該単語の
正当性が検証できないとして、以下の単語の性質による
正当性の検証を行う。Here, whether the part of speech of the word is a verb, an adverb, or a verb other than the syllable inflection, whether the part of speech of the word is a prefix and is not followed by a katakana character string, or the part of speech of the word is It is determined whether or not the suffix or rendaku is followed by a katakana character string (n43 to n45), and if the word corresponds to any of the above, it is determined to be an invalid word in n49. Further, if the word does not correspond to any of the above items, it is considered that the validity of the word cannot be verified from the rule of the part of speech of the word, and the validity is verified by the following properties of the word.

【００３６】上記したように、単語毎に、複合語を形成
する可能性の多い単語、複合語を形成する可能性の少な
い単語、有用な性質を持たない単語、のいずれかの性質
が設定されている。検証する単語の性質が上記したいず
れに設定されているかを判定し（ｎ４６、ｎ４７）、複
合語を形成する可能性の多い単語であればｎ４８で正当
性のある単語と判定する。また、複合語を形成する可能
性の少ない単語であればｎ４９で正当性のない単語と判
定する。また、有用な性質を持たない単語であれば正当
性を検証できない単語と判定する（ｎ５０）。以上のよ
うに、この処理では登録単語が正当性のある単語、正当
性のない単語、または、正当性の検証できない単語のい
ずれかに判定される。なお、上記した実施の形態では、
単語の文字列長によるルール、単語の品詞によるルー
ル、単語の性質によるルール、の３つで単語の正当性を
検証しているが、上記した任意のルール１つまたは２つ
を組み合わせて単語の正当性を検証するようにしてもよ
い。As described above, for each word, one of the properties of a word that is likely to form a compound word, a word that is unlikely to form a compound word, or a word that has no useful property is set. ing. It is determined which of the above-described properties of the word to be verified is set (n46, n47), and if the word is likely to form a compound word, it is determined to be valid in n48. If the word is unlikely to form a compound word, the word is determined to be invalid in n49. If the word has no useful property, it is determined that the word cannot be verified for validity (n50). As described above, in this processing, the registered word is determined to be either a valid word, a non-valid word, or a word whose validity cannot be verified. In the above embodiment,
The legitimacy of a word is verified by three rules: the rule based on the character string length of the word, the rule based on the word part of speech, and the rule based on the character of the word. The validity may be verified.

【００３７】単語候補検出部５は，登録単語の正当性検
証処理で、正当性があると判定された単語を単語候補の
構成要素としては検出しない（ｎ２６→ｎ３０）。ま
た、正当性がないと判定された単語であれば、前にカタ
カナ文字列の単語が連接していると、この単語とをつな
いだ文字列からなる単語を単語候補の構成要素として検
出する（ｎ２９）。また、後ろにカタカナ文字列の単語
が連接しているとこの単語とつないだ文字列からなる単
語を単語候補の構成要素として検出する（ｎ２９）。正
当性が検証されなかった単語であれば、その単語を単語
候補の構成要素として検出する（ｎ２８）。The word candidate detection unit 5 does not detect a word judged to be valid as a constituent element of the word candidate in the validity verification process of the registered word (n26 → n30). If the word is judged to be invalid, if a word in the katakana character string is concatenated before, a word consisting of the character string connecting this word is detected as a constituent element of the word candidate ( n29). If a word of a katakana character string is concatenated behind the word, a word composed of the character string connected to this word is detected as a constituent element of the word candidate (n29). If the word is not validated, the word is detected as a constituent element of the word candidate (n28).

【００３８】例えば、図６（Ａ）に示すように、「ファ
イナンシャルシステム」と言う文字列に対して、形態素
解析によって「ファイ」「システム」が登録語、「ナン
シャル」が未登録語とする結果であれば、未登録語であ
る「ナンシャル」の文字列長は１文字ではないので、単
語候補の構成要素として検出される。登録単語である
「ファイ」は文字列長、単語の品詞、および、その性質
からも正当性が検証されない単語であるので、単語候補
の構成要素として検出される（「ファイ」は有用な性質
を持たない単語であるとする。）。また、登録単語であ
る「システム」は文字列長が４文字であるので、文字列
長によるルールによって正当性がある単語と判定され、
単語候補の構成要素として検出されない。したがって、
この例では、「ファイ」と「ナンシャル」の２つが単語
候補の構成要素として検出される。また、図６（Ｂ）に
示すように、「インフレーター」と言う文字列に対し
て、形態素解析の結果が「イン」「フレー」を登録語、
「ター」を未登録語とするものであれば、未登録語であ
る「ター」の文字列長は１文字ではないので単語候補の
構成要素として検出される。「イン」は文字列長、単語
の品詞、および、その性質からも正当性が検証されない
単語であるので、単語候補の構成要素として検出される
（「イン」は有用な性質を持たない単語であるとす
る。）。また、感動詞「フレー」は単語の品詞によるル
ールによって正当性のない単語と判定されるので、前に
隣合う単語「イン」とつなげた「インフレー」と後ろに
隣合う単語「ター」とつなげた「フレーター」が単語候
補の構成要素として検出される。したがって、この例で
は、「イン」「インフレー」「フレーター」「ター」の
４つが単語候補の構成要素として検出される。また、図
６（Ｃ）に示すように、「イリオモテ」と言う文字列に
対して、形態素解析の結果が「イ」が未登録語、「リ
オ」「モテ」が登録語とするものであれば、未登録語で
ある「イ」の文字列長は１文字であるので、その後ろに
隣合う単語「リオ」とつながる。また、下一段動詞であ
る「モテ」は単語の品詞によるルールによって正当性の
ない単語と判定され、前に隣合う単語「リオ」とつなが
る。ここで、「リオ」にはすでに「イ」が接続されてい
るので、「イリオモテ」が単語候補の構成要素として検
出される。さらに、図６（Ｄ）に示すように、「インタ
ラプタ」言う文字列に対して、形態素解析の結果が「イ
ン」「タラ」が登録語「プタ」が未登録語とするもので
あれば、未登録語である「プタ」の文字列長は１文字で
はないの単語候補の構成要素として検出される。「イ
ン」「タラ」は文字列長、単語の品詞、および、その性
質からも正当性が検証されない単語であるので、その単
語が単語候補の構成要素として検出される（「イン」
「タラ」は有用な性質を持たない単語であるとす
る。）。したがって、この例では、「イン」「タラ」
「プタ」の３つが単語候補の構成要素として検出され
る。For example, as shown in FIG. 6 (A), for a character string “Financial system”, the result of morphological analysis that “Phi” and “System” are registered words and “National” is an unregistered word In this case, since the character string length of the unregistered word “NANCIAL” is not one character, it is detected as a constituent element of the word candidate. Since the registered word "Phi" is a word whose validity is not verified from the character string length, the word part of speech, and its property, it is detected as a component of a word candidate ("Phi" is a useful property. It is a word that does not have.) In addition, since the registered word “system” has a character string length of 4 characters, it is determined as a valid word according to the rule based on the character string length.
Not detected as a component of a word candidate. Therefore,
In this example, "phi" and "nuclear" are detected as the constituent elements of the word candidate. In addition, as shown in FIG. 6 (B), for the character string “inflator”, the result of the morphological analysis is “in” and “flare” as registered words,
If “Tar” is an unregistered word, the character string length of the unregistered word “Tar” is not one character, so that it is detected as a component of a word candidate. "In" is a word whose validity is not verified from the character string length, word part of speech, and its property, so it is detected as a component of a word candidate ("In" is a word that does not have a useful property. There is.). In addition, since the touching verb "Flae" is determined to be an unjust word according to the rules based on the part-of-speech of the word, the word "In" that is connected to the word "In" that is adjacent to the front and the word "Tar" that is adjacent to the word behind is connected. "Flater" is detected as a component of the word candidate. Therefore, in this example, four "in", "inflation", "flater", and "tar" are detected as the constituent elements of the word candidate. Further, as shown in FIG. 6C, for the character string “Iriomote”, if the result of the morphological analysis is that “i” is an unregistered word and “Rio” and “Mote” are registered words. For example, since the unregistered word "i" has a character string length of one character, it is connected to the word "Rio" adjacent to the character string. In addition, the lower one-stage verb "Mote" is determined to be an invalid word according to the rule based on the word part of speech, and is connected to the adjacent word "Rio" in front. Here, since "I" is already connected to "Rio", "Iriomote" is detected as a component of the word candidate. Further, as shown in FIG. 6 (D), if the result of the morphological analysis is "In", "Tara", and the registered word "Puta" is an unregistered word for the character string "interrupter", The character string length of the unregistered word "Puta" is detected as a component of a word candidate that is not one character. Since “in” and “cod” are words whose legitimacy is not verified from the character string length, the part of speech of the word, and their properties, the word is detected as a component of the word candidate (“in”).
"Cod" is a word that has no useful properties. ). Therefore, in this example, "in""cod"
Three "Puta" are detected as the constituent elements of the word candidate.

【００３９】このようにして検出された単語候補の構成
要素および単語候補の構成要素で連接するものの組み合
わせを、単語候補として作成する（ｎ３１）。例えば、
図６（Ａ）に示す例では、「ファイ」「ナンシャル」
「ファイナンシャル」の３つが単語候補として作成され
る。また、図６（Ｂ）に示す例では、「イン」「インフ
レー」「フレーター」「ター」「インフレータ」が単語
候補として作成される。図６（Ｃ）に示す例では、「イ
リオモテ」が単語候補として作成される。図６（Ｄ）に
示す例では「イン」「タラ」「プタ」「インタラ」「タ
ラプタ」「インタラプタ」が単語候補として作成され
る。なお、連接していない単語「イン」と「プタ」をつ
ないだ「インプタ」という単語候補は作成されない。そ
して、ｎ３１で作成された単語候補で且つ辞書ファイル
４に登録されていない文字列の単語候補を、辞書ファイ
ル４に仮登録する（ｎ３２）。仮登録された単語候補の
品詞は「候補」に設定される。また、単語候補の仮登録
においては、この単語候補が切り出された元の文字列の
単語（ｎ２１で登録された単語）を登録した辞書ファイ
ル４内の位置を示すデータ（ポインタ）も同時に登録す
る。図７に単語候補が登録された辞書ファイル４の例を
示す。図６（Ａ）に示す例では、「ファイ」はすでに辞
書に登録されているので「ナンシャル」「ファイナンシ
ャル」の２つが単語候補として登録され、品詞は候補に
設定されている。また、これらの単語候補は切り出され
た元の文字列の単語「ファイナンシャルシステム」が登
録されている辞書ファイル４内の位置を示すデータ（ポ
インタ８）が付加されて辞書ファイル４に登録される。A combination of the constituent elements of the word candidate thus detected and the constituent elements of the word candidate which are concatenated is created as a word candidate (n31). For example,
In the example shown in FIG. 6 (A), "Phi" and "National"
Three of "Financial" are created as word candidates. Further, in the example shown in FIG. 6B, "in", "inflation", "flater", "ter", and "inflator" are created as word candidates. In the example shown in FIG. 6C, “Iriomote” is created as a word candidate. In the example shown in FIG. 6D, “IN”, “Tara”, “Puta”, “Interla”, “Trapta”, and “Interrupter” are created as word candidates. In addition, the word candidate "imputa" which connects the words "in" and "puta" which are not connected is not created. Then, the word candidates of the character string created in n31 and not registered in the dictionary file 4 are provisionally registered in the dictionary file 4 (n32). The part of speech of the temporarily registered word candidate is set to “candidate”. In the temporary registration of word candidates, the data (pointer) indicating the position in the dictionary file 4 in which the word of the original character string (word registered in n21) from which this word candidate is cut is registered at the same time. . FIG. 7 shows an example of the dictionary file 4 in which word candidates are registered. In the example shown in FIG. 6A, since “Phi” is already registered in the dictionary, two of “National” and “Financial” are registered as word candidates, and the part of speech is set as a candidate. Further, these word candidates are registered in the dictionary file 4 with data (pointer 8) indicating the position in the dictionary file 4 in which the word “financial system” of the original character string that has been cut out is registered.

【００４０】以下、ｎ２１で登録した単語の品詞を名詞
とした理由について簡単に説明する。カタカナの未登録
語の発生源は大きく分けて以下に示す〜の３つであ
ると考えられる。外来語の動詞、形容詞、名詞がカタカナ表記された日
本語となる場合（図８（Ａ）参照）外来語の動詞は日本語のサ行変格活用の動詞の語幹とな
り、サ行変格活用の動詞の語幹は名詞として使われてい
る。また、外来語の形容詞は日本語の形容動詞になる。
さらに、外来語としても形容詞と名詞の両方の性質をも
つものがカタカナ表記されることが多い。これらの理由
から、この発生源から発生するカタカナ未登録語が名詞
である確率が非常に高いといえる。日本語で難しい漢字や強調したい単語などがカタカナ
表記された場合（図８（Ｂ）参照）この発生源から発生するカタカナ未登録語は上記したよ
うにサ行変格活用の動詞、形容動詞、名詞に加えて文法
的に「名詞」と同様に扱われる固有名詞がほとんどであ
るといえる。したがって、この発生源から発生するカタ
カナ未登録語も名詞である確率が非常に高いといえる。外来語の擬音語や擬態語を転用、外来語の短縮、また
は、和声カタカナ語から発生する場合（図８（Ｃ）参
照）この場合には、その品詞がいろいろあって、どの品詞が
多いということは一概に言うことはできないが、統計的
に言って、このような発生源から発生するカタカナ未登
録語の出現の頻度は非常に少ない。以上の〜の理由
から、カタカナ未登録語の品詞を名詞とすることが最適
である考えられるからである。The reason why the part of speech of the word registered in n21 is used as a noun will be briefly described below. The sources of unregistered katakana words are roughly classified into the following three categories. When the verbs, adjectives, and nouns of foreign words are written in Japanese with katakana notation (see FIG. 8A) Is used as a noun. Also, foreign words adjectives become Japanese adjectives.
In addition, foreign words that have both adjective and noun properties are often written in katakana. For these reasons, it can be said that the unregistered word in katakana generated from this source is very likely to be a noun. When difficult kanji or words to be emphasized are written in katakana in Japanese (see Fig. 8 (B)) The katakana unregistered words generated from this source are verbs, adjective verbs, and nouns that are used as a safing as described above. In addition, it can be said that most proper nouns are treated grammatically in the same way as "nouns". Therefore, it can be said that the unregistered word in katakana generated from this source is very likely to be a noun. When an onomatopoeia or mimetic word of a foreign word is diverted, a foreign word is shortened, or a word is generated from a harmony katakana word (see FIG. 8C) In this case, there are various parts of speech, and which part of speech is the most. However, statistically speaking, the frequency of occurrence of unregistered katakana words from such sources is extremely low. For the reasons (1) to (3) above, it is considered optimal to use the part-of-speech of the unregistered katakana word as a noun.

【００４１】ｎ７では、形態素解析の結果から複数のカ
タカナ文字列の登録単語が連接している箇所があるかど
うかを判定する。ここで、複数のカタカナ文字列の登録
単語が連接している箇所があれば、ｎ８で第２の単語候
補検出、登録処理が実行される。図９は、第２の単語候
補検出、登録処理の流れを示すフローチャートである。
形態素解析結果において、複数のカタカナ文字列の登録
単語が連接する例としては「カリマンタン」「カードシ
ステム」等の文字列がある。「カリマンタン」という文
字列の形態素解析結果を図１０（Ａ）に示し、「カード
システム」という文字列に対する形態素解析結果を図１
０（Ｂ）に示す。「カリマンタン」と言う文字列は、形
態素解析で「カリ」「マン」「タン」という３つの登録
単語が連接する文字列であると判定される。「カードシ
ステム」と言う文字列は形態素解析で「カード」「シス
テム」という２つの登録単語が連接する文字列であると
判定される。At n7, it is judged from the result of the morphological analysis whether or not there is a portion where the registered words of a plurality of katakana character strings are connected. Here, if there is a portion where the registered words of a plurality of katakana character strings are concatenated, the second word candidate detection and registration processing is executed in n8. FIG. 9 is a flowchart showing the flow of the second word candidate detection / registration process.
In the morphological analysis result, there is a character string such as "Kalimantan" or "card system" as an example in which registered words of a plurality of katakana character strings are connected. The morphological analysis result of the character string "Kalimantan" is shown in FIG. 10 (A), and the morphological analysis result of the character string "card system" is shown in FIG.
It is shown in 0 (B). The character string “Kalimantan” is determined by morphological analysis to be a character string in which three registered words “Kali”, “Man”, and “Tan” are concatenated. The character string “card system” is determined by morphological analysis to be a character string in which two registered words “card” and “system” are concatenated.

【００４２】登録単語検証部９が各登録単語に対して、
単語の正当性を検証する（ｎ５１、ｎ５２）。この単語
の正当性は上記した図５に示した処理で検証される。そ
して、正当性のない単語が検出されているか（ｎ５
３）、または、正当性の検証できない単語が連接して検
出されているかを判定する（ｎ５４）。ここで、正当性
のない単語が検出されておらず、且つ、正当性の検証で
きない単語が連接していなければ、未登録単語が含まれ
ている可能性が無いとして処理を完了する。正当性のな
い単語が検出されている場合、または、正当性を検証で
きない単語が連接して検出されている場合には、以下の
処理が行われる。The registered word verification section 9
The validity of the word is verified (n51, n52). The validity of this word is verified by the processing shown in FIG. 5 described above. Then, whether an unjustified word is detected (n5
3) Alternatively, it is determined whether or not words whose validity cannot be verified are concatenated and detected (n54). Here, if no unjustified word is detected and no unverified word is concatenated, it is determined that there is no possibility that an unregistered word is included, and the process is completed. When an unjustified word is detected, or when unverifiable words are concatenated and detected, the following processing is performed.

【００４３】このカタカナ文字列全体を１つの単語と
し、品詞を名詞として辞書ファイル４に登録する（ｎ５
５）。正当性の検証できない単語を、単語候補の構成要
素として検出する（ｎ５６）。また、正当性がないと判
定された単語は、前にカタカナ文字列の登録単語が連接
しているとこの登録単語とつないだ文字列からなる単語
を単語候補の構成要素として検出し、また、後ろにカタ
カナ文字列の登録単語が連接しているとこの登録単語を
つないだ文字列からなる単語を単語候補の構成要素とし
て検出する（ｎ５７）。そして、検出された単語候補の
構成要素を組み合わせて単語候補を作成し（ｎ５８）、
作成された単語候補で且つ辞書ファイル４に登録されて
いない文字列の単語候補を、辞書ファイル４に仮登録す
る（ｎ５９）。仮登録された単語候補の品詞は候補に設
定される。また、この単語候補が切り出された元の文字
列の単語（ｎ５５で登録された単語）が登録されている
辞書ファイル４内の位置も記憶される。The entire katakana character string is registered as one word and the part of speech is registered as a noun in the dictionary file 4 (n5).
5). A word whose validity cannot be verified is detected as a constituent element of a word candidate (n56). In addition, the word that is determined to have no validity, if the registered word of the katakana character string is concatenated before, detects the word consisting of the character string connected to this registered word as a constituent element of the word candidate, When the registered words of the katakana character string are concatenated at the back, the word consisting of the character string connecting the registered words is detected as a constituent element of the word candidate (n57). Then, the constituent elements of the detected word candidates are combined to create word candidates (n58),
The word candidates of the character strings that have been created and are not registered in the dictionary file 4 are provisionally registered in the dictionary file 4 (n59). The part of speech of the temporarily registered word candidate is set as the candidate. Further, the position in the dictionary file 4 in which the word of the original character string (word registered in n55) from which this word candidate is cut out is also stored.

【００４４】例えば、「カリマンタン」という文字列を
形態素解析した結果の「カリ」「マン」「タン」の３つ
の登録単語が全て正当性の検証できない単語であったと
する。この場合、ｎ５５で「カリマンタン」の品詞を名
詞として辞書ファイル４に登録する。また、「カリ」
「マン」「タン」が単語候補の構成要素として検出さ
れ、「カリマン」「マンタン」が単語候補として仮登録
される。なお、連接していない単語「カリ」「タン」を
つないだ「カリタン」という単語は単語候補として作成
されない。また、「カードシステム」という文字列の形
態素解析の結果である「システム」は上記した文字列に
よるルールから正当性のある単語と判定される。したが
って、正当性のない単語が検出されておらず、且つ、正
当性の検証できない単語も連接しないので、未登録単語
が含んでいる可能性が無いと判定され、ｎ５５以降処理
が行われない。For example, it is assumed that all three registered words "Kari", "Man", and "Tan" that are the result of morphological analysis of the character string "Kalimantan" are words whose validity cannot be verified. In this case, the part of speech of "Kalimantan" is registered as a noun in the dictionary file 4 at n55. Also, "Kari"
“Man” and “Tan” are detected as constituent elements of the word candidate, and “Kaliman” and “Mantan” are provisionally registered as word candidates. In addition, the word "Karitan" which connects the words "Kari" and "Tan" which are not connected is not created as a word candidate. Further, "system", which is the result of the morphological analysis of the character string "card system", is determined to be a valid word from the rule based on the character string. Therefore, since an unjustified word is not detected and a word whose authenticity cannot be verified is not concatenated, it is determined that there is no possibility that an unregistered word is included, and the process after n55 is not performed.

【００４５】すなわち、この実施の形態では、形態素解
析の結果に正当性のない単語が含まれている場合、また
は、正当性が検証できない単語が連接している場合に、
カタカナ文字列中に未登録語含まれている可能性がある
と判断し、その他の場合であればカタカナ文字列中に未
登録語含まれている可能性がないと判断している。そし
て、カタカナ文字列中に未登録語含まれている可能性が
あると判断した場合には、単語候補を作成し、これを辞
書ファイル４に仮登録している。That is, in this embodiment, when the result of the morphological analysis includes an unjustified word, or when a word whose authenticity cannot be verified is concatenated,
It is determined that the katakana character string may include an unregistered word, and in other cases, it is determined that the katakana character string may not include an unregistered word. When it is determined that the katakana character string may include an unregistered word, a word candidate is created and temporarily registered in the dictionary file 4.

【００４６】なお、この第２の単語候補検出、登録処理
における単語の正当性の検証において、上記したルール
では厳しすぎて、正当性のある単語を正当性のない単語
であると判定してしまうケースも想定される。このよう
な場合には、辞書ファイル４に登録されている複数の単
語からなる複合語が、未登録単語として登録されてしま
うという問題が生じる恐れもある。このため、この第２
の単語候補検出、登録処理における、上記した単語の正
当性を検証する単語の文字列長によるルールを以下のよ
うに変更してもよい。In the second word candidate detection and the verification of the correctness of the words in the registration process, the above-mentioned rule is too strict and the correct words are judged to be the non-valid words. Cases are also envisioned. In such a case, there is a possibility that a compound word composed of a plurality of words registered in the dictionary file 4 may be registered as an unregistered word. Therefore, this second
In the word candidate detection / registration process, the rule based on the character string length of the word for verifying the validity of the word may be changed as follows.

【００４７】(1) 文字列長が３文字以上の単語であれば
正当性のある単語、(2) 文字列長が２文字の単語であれ
ば正当性の有無を判定できない単語、(3) １文字であれ
ば正当性がない単語、であるとする。このように、変更
することで辞書ファイル４に複数の登録単語からなる複
合語が登録される可能性を減少させることができる。(1) If the character string length is 3 characters or more, the word is valid, (2) If the character string length is 2 characters, the word whose validity cannot be determined, (3) It is assumed that one character is a word that is not valid. As described above, the possibility of registering a compound word composed of a plurality of registered words in the dictionary file 4 can be reduced by making the change.

【００４８】ｎ９では、形態素解析された結果に辞書フ
ァイル４に仮登録されている単語候補が含まれているか
（単語候補が再出現したか）どうかを判定している。こ
こで、単語候補が再出現したと判定すると、ｎ１０の単
語候補の正当性検証処理が実行される。図１１は、単語
候補の正当性検証処理を示すフローチャートである。最
初に、再出現した単語候補に連接するがカタカナ文字列
全体が、該単語候補を辞書ファイル４に仮登録したとき
に切り出した文字列と一致しているかどうかを判定する
（ｎ６１）。すなわち、「ファイナンシャルシステム」
という文字列から切り出された「ファイナンシャル」と
いう単語が辞書ファイル４に仮登録されている場合、再
度同じ文字列から「ファイナンシャル」という単語候補
が切り出されたのかどうかを判定する。ｎ６１で、単語
候補が切り出された文字列と同一であると判定すると、
単語候補の正当性を正確に検証ができないとして処理を
完了する。At n9, it is determined whether the result of the morphological analysis includes a word candidate temporarily registered in the dictionary file 4 (whether the word candidate has reappeared). Here, when it is determined that the word candidate has reappeared, the validity verification process of the word candidate of n10 is executed. FIG. 11 is a flowchart showing the validity verification process of word candidates. First, it is determined whether or not the entire katakana character string connected to the re-appearing word candidate matches the character string cut out when the word candidate is provisionally registered in the dictionary file 4 (n61). That is, "Financial system"
If the word “Financial” cut out from the character string is provisionally registered in the dictionary file 4, it is determined again whether the word candidate “Financial” is cut out from the same character string. If it is determined in n61 that the word candidate is the same as the cut out character string,
The processing is completed because the correctness of the word candidate cannot be accurately verified.

【００４９】ｎ６１で文字列が同一でないと判定する
と、この文字列の形態素解析された結果に単語候補が２
つ以上含まれているかどうかを判定する（ｎ６２）。ｎ
６２で単語候補が２つ以上含まれている場合には、単語
候補の正当性の検証ができないと判定して処理を完了す
る。一方、このカタカナ文字列中に単語候補が１つしか
含まれていない場合には、各登録単語に対して上記した
図５に示す正当性の検証処理を行う（ｎ６３、ｎ６
４）。そして、全ての登録単語が正当性のある単語とし
て判定されなければ（ｎ６５）、単語候補の正当性が検
証できないとして処理を完了する。全ての登録単語の正
当性が検証されれば、該単語候補は正当性があると判定
して、辞書ファイル４に該単語候補を正式に登録する
（ｎ６６）。単語候補を辞書ファイル４に正式に登録す
る処理は、その品詞を候補から名詞に変更する処理であ
る。単語候補正式登録部１０がこの仮登録されている単
語候補を正式に登録する処理を行う。When it is determined in n61 that the character strings are not the same, the word candidate is 2 in the result of the morphological analysis of this character string.
It is determined whether one or more is included (n62). n
If two or more word candidates are included in 62, it is determined that the validity of the word candidate cannot be verified, and the process is completed. On the other hand, when only one word candidate is included in this katakana character string, the validity verification process shown in FIG. 5 is performed on each registered word (n63, n6).
4). Then, if all the registered words are not determined as valid words (n65), the validity of the word candidates cannot be verified and the process is completed. If the validity of all registered words is verified, it is determined that the word candidate is valid, and the word candidate is formally registered in the dictionary file 4 (n66). The process of officially registering a word candidate in the dictionary file 4 is a process of changing the part of speech from a candidate to a noun. The word candidate formal registration unit 10 officially registers the temporarily registered word candidates.

【００５０】例えば、「ファイナンシャルシステム」と
いう文字列が検出されて、辞書ファイル４に「ファイナ
ンシャル」「ナンシャル」が単語候補として仮登録され
ている。ここで、「ファイナンシャルアドバイザ」とい
う文字列の形態素解析の結果は図１２（Ａ）に示すよう
になる。なお、「ファイナンシャルアドバイザ」という
文字列の形態素解析の結果が、図１２（Ｂ）に示すよう
になると考えることもできるが、形態素解析の一般的な
手法である最長一致法（最も長い単語を優先する。）
や、文節数最小法（分割する単語数を最小にする。）を
用いることでこのような結果となることはない。そし
て、「アドバイザ」は文字列長から正当性のある単語と
判定される。これにより、「ファイナンシャル」という
文字列も正当性がある単語候補と判定され、辞書ファイ
ル４における「ファイナンシャル」の品詞が候補から名
詞に変更される。これによって、「ファイナンシャル」
が辞書ファイル４に正式に登録されたことになる。For example, the character string “Financial system” is detected, and “Financial” and “National” are provisionally registered in the dictionary file 4 as word candidates. Here, the result of the morphological analysis of the character string "Financial Advisor" is as shown in FIG. Although it can be considered that the result of the morphological analysis of the character string “Financial Advisor” is as shown in FIG. 12 (B), the longest matching method (the longest word is given priority to the longest word is a general method of morphological analysis. Yes.)
Alternatively, using the minimum clause number method (minimizing the number of words to be divided) does not produce such a result. Then, the "advisor" is determined to be a valid word based on the character string length. As a result, the character string “Financial” is also determined to be a valid word candidate, and the part of speech of “Financial” in the dictionary file 4 is changed from a candidate to a noun. This makes it a “financial”
Is officially registered in the dictionary file 4.

【００５１】ｎ１で記憶した一連のテキストデータ全体
に対して上記した処理が完了すると、辞書ファイル４に
仮登録されている不要な単語を削除する処理を実行す
る。図１４は、不要な単語を削除する不要単語削除処理
の流れを示すフローチャートである。辞書ファイル４に
登録されている単語で、その品詞が候補である単語を全
て検出して削除する（ｎ７１〜ｎ７３）。これによっ
て、単語候補として仮登録されたがその後に同じ文字列
が出現しなかったものや、正当性が検証されなかった単
語候補は全て削除される。例えば、図１３に示した辞書
ファイル４であれば「ナンシャル」「インフレー」「フ
レーター」・・・等が削除される（図１５（Ａ）参
照）。そして、もとの文字列の位置を示すポインタを記
憶している単語があれば、このポインタで指定される位
置に登録されている単語を削除するとともに、このポイ
ンタも同時に削除する（ｎ７４〜ｎ７７）。これによっ
て、上記した処理で正当性が検証され、正式な単語とし
て辞書ファイル４に登録された単語を切り出したカタカ
ナ文字列の単語が削除される。なお、このカタカナ文字
列は複合語であり、辞書ファイル４に登録されていなく
ても問題はない。例えば、図１５（Ａ）に示した辞書フ
ァイル４では「ファイナンシャルシステム」が削除され
る（図１５（Ｂ）参照）が、「ファイナンシャル」と
「システム」は単語として登録されているので、「ファ
イナンシャルシステム」が削除されたことで問題が生じ
ることはない。When the above-mentioned processing is completed for the entire series of text data stored in n1, processing for deleting unnecessary words temporarily registered in the dictionary file 4 is executed. FIG. 14 is a flowchart showing the flow of unnecessary word deletion processing for deleting unnecessary words. Of the words registered in the dictionary file 4, all the words whose parts of speech are candidates are detected and deleted (n71 to n73). As a result, all of the word candidates that have been provisionally registered as word candidates but the same character string does not appear thereafter, and the word candidates whose validity has not been verified are deleted. For example, in the case of the dictionary file 4 shown in FIG. 13, "native,""inflation,""flater," etc. are deleted (see FIG. 15 (A)). Then, if there is a word storing a pointer indicating the position of the original character string, the word registered at the position designated by this pointer is deleted, and this pointer is also deleted at the same time (n74 to n77). ). As a result, the correctness is verified by the above processing, and the words in the katakana character string obtained by cutting out the words registered in the dictionary file 4 as the official words are deleted. Note that this Katakana character string is a compound word, and there is no problem even if it is not registered in the dictionary file 4. For example, in the dictionary file 4 shown in FIG. 15A, “Financial system” is deleted (see FIG. 15B), but “Financial” and “System” are registered as words, so “Financial” The removal of the "system" does not cause any problems.

【００５２】以上、説明したように、本願発明では未登
録単語と登録単語とが連接して形成されたカタカナ文字
列の複合語から、未登録単語を正確に検出して辞書ファ
イル４に登録することができる。また、未登録単語の文
字列と、複数の登録単語が連接して形成されたカタカナ
文字列とが一致する場合であっても、未登録単語を正確
に検出して辞書ファイル４に登録することができる。さ
らに、誤って登録された単語は、最終的に削除されるの
で、辞書ファイル４に不要な単語が登録されることもな
い。As described above, in the present invention, the unregistered word is accurately detected from the compound word of the katakana character string formed by concatenating the unregistered word and the registered word and registered in the dictionary file 4. be able to. Further, even when the character string of the unregistered word and the katakana character string formed by connecting a plurality of registered words match, the unregistered word is accurately detected and registered in the dictionary file 4. You can Furthermore, since the incorrectly registered words are finally deleted, unnecessary words are not registered in the dictionary file 4.

【００５３】次に、自動的に統計を取って、単語毎にそ
の性質を、複合語を形成することが多い単語、複合語を
形成することが少ない単語、有用な性質を持たない単
語、のいずれかに設定する処理を説明する。図１６は、
この単語の性質判定処理を示すフローチャートである。
ここでは、図１７に示すように辞書ファイル４は、単語
毎にその性質を記憶するエリアを有している。図中にお
いて、複合語を形成することが多い単語の性質は１であ
り、複合語を形成することが少ない単語の性質は２であ
り、有用な性質を持たない単語の性質は３である。ま
た、統計を取るデータとして複合語および単語（辞書フ
ァイル４に登録されていない未登録単語を含む）を登録
した統計データを用意する（図１８参照）。なお、複合
語には単語間に・を単語の区切りを示す記号として入れ
られている。Next, statistics are automatically taken and the characteristics of each word are classified into words that often form compound words, words that rarely form compound words, and words that do not have useful characteristics. The process of setting one of them will be described. FIG.
It is a flowchart which shows the property determination process of this word.
Here, as shown in FIG. 17, the dictionary file 4 has an area for storing the property of each word. In the figure, the property of a word that often forms a compound word is 1, the property of a word that rarely forms a compound word is 2, and the property of a word that does not have a useful property is 3. In addition, statistical data in which compound words and words (including unregistered words that are not registered in the dictionary file 4) are registered as data for obtaining statistics is prepared (see FIG. 18). In the compound word, "*" is put between the words as a symbol indicating a word delimiter.

【００５４】ｎ８１で、統計を取る単語（以下、対象単
語と言う。）が選択入力されると、その単語が部分文字
列として含まれている全ての単語を検出する（ｎ８
２）。例えば、対象単語が「イズム」であれば、図１７
に示す辞書ファイル４からは「イズム」「エゴイズム」
「ダダイズム」「ヒロイズム」「ヘブライズム」が検出
され、対象単語が「マネー」であれば「マネー」「マネ
ージ」「マネージメント」「マネージャ」「マネージャ
ー」が検出される。そして、統計データから１つずつ単
語を抽出して（ｎ８３）、以下に示す判定を行う。な
お、複合語からは・で区切られた単語毎に抽出する。例
えば、統計データに複合語である「イズム・グループ」
が入っていれば、「イズム」「グループ」の２つの単語
として抽出する。When a word for which statistics are taken (hereinafter referred to as a target word) is selected and input in n81, all the words including the word as a partial character string are detected (n8).
2). For example, if the target word is “ism”, FIG.
From the dictionary file 4 shown in, "ism" and "egoism"
"Dadaism", "heroism", and "hebrewism" are detected, and if the target word is "money", "money", "managed", "management", "manager", and "manager" are detected. Then, the words are extracted one by one from the statistical data (n83), and the following determination is performed. It should be noted that the compound words are extracted for each word delimited by. For example, "ism group", which is a compound word in statistical data
If is included, it is extracted as two words of “ism” and “group”.

【００５５】ｎ８３で抽出した単語に対象単語が部分文
字列として含まれているかどうかを判定し（ｎ８４）、
含まれていなければｎ８３に戻って次の単語を抽出す
る。一方、対象単語が部分文字列として含まれていれ
ば、ｎ８３で抽出した単語と対象単語が完全に一致する
か（文字列長がおなじかどうか）を判定し（ｎ８５）、
文字列長が同じであれば、図示していないカウンタａを
１カウントアップする（ｎ８６）。また、文字列長が同
じでなければ、（ｎ８３で抽出した単語の文字列長が対
象単語の文字列長よりも長ければ）、ｎ８３で抽出され
た単語と同じ単語がｎ８２で検出されているかどうか
（辞書ファイル４に完全に一致する単語が登録されてい
るかどうか）を判定する（ｎ８７）。ここで、辞書ファ
イル４に完全に一致する単語が登録されていると判定す
れば、ｎ８３に戻って次の単語を抽出する。例えば、辞
書ファイル４に「エゴイズム」が登録されており、ｎ８
３で抽出された単語も「エゴイズム」である場合であ
る。辞書ファイル４に完全に一致する単語が登録されて
いないと判定すれば、図示していないカウンタｂを１カ
ウントアップする（ｎ８８）。ｎ８６、またはｎ８８の
処理が完了すると、ｎ８３に戻って次の単語を抽出す
る。なお、カウンタａおよびカウンタｂは、ｎ８１で対
象単語が選択されたときにカウント値が０に設定され
る。統計データの全ての単語を抽出して上記したｎ８３
以降の処理を完了すると（ｎ８９）、以下に示す単語の
性質を判定する処理を行い（ｎ９０）、この性質を対象
単語の性質として辞書ファイル４に登録する（ｎ９
１）。It is judged whether or not the target word is included in the word extracted in n83 as a partial character string (n84),
If it is not included, the process returns to n83 to extract the next word. On the other hand, if the target word is included as a partial character string, it is determined whether the word extracted in n83 and the target word completely match (whether the character string length is the same) (n85),
If the character string lengths are the same, the counter a (not shown) is incremented by 1 (n86). If the character string lengths are not the same (if the character string length of the word extracted in n83 is longer than the character string length of the target word), is the same word as the word extracted in n83 detected in n82? It is determined (n87) whether or not a completely matching word is registered in the dictionary file 4 (n87). If it is determined that the completely matching word is registered in the dictionary file 4, the process returns to n83 and the next word is extracted. For example, “egoism” is registered in the dictionary file 4, and n8
This is a case where the word extracted in 3 is also "egoism". If it is determined that the completely matching word is not registered in the dictionary file 4, the counter b (not shown) is incremented by 1 (n88). When the process of n86 or n88 is completed, the process returns to n83 and the next word is extracted. The counters a and b are set to 0 when the target word is selected in n81. All the words in the statistical data are extracted and the above n83
When the subsequent process is completed (n89), the following process for determining the property of the word is performed (n90), and this property is registered in the dictionary file 4 as the property of the target word (n9).
1).

【００５６】図１７に示す辞書ファイル４と図１８に示
す統計データを用い、「イズム」と「マネー」を対象単
語としたときには、上記した処理でカウンタａ、およ
び、カウンタｂの計数値は、以下のようになる。「イズム」ａ＝１、ｂ＝６「マネー」ａ＝５、ｂ＝０単語の性質は、上記したように複合語を形成することが
多い単語、複合語を形成することが少ない単語、有用な
性質を持たない単語、の３つのいずれかに判定される。
この実施の形態では、ａ／（ａ＋ｂ）＞０．８が成立
すればその性質を複合語のなかで独立した単語となりや
すいとし、ｂ／（ａ＋ｂ）＞０．８が成立すればその
性質を複合語のなかで独立した単語となりにくいとし、
それ以外は、有用な性質をもたない単語であると判定す
る。Using the dictionary file 4 shown in FIG. 17 and the statistical data shown in FIG. 18, when "ism" and "money" are the target words, the count values of the counter a and the counter b in the above processing are as follows. It looks like this: “Ism” a = 1, b = 6 “Money” a = 5, b = 0 As described above, words that often form compound words, words that rarely form compound words, and useful It is determined to be one of three words that do not have this property.
In this embodiment, if a / (a + b)> 0.8 holds, the property is likely to be an independent word in a compound word, and if b / (a + b)> 0.8 holds, the property becomes It is difficult to be an independent word in a compound word,
Otherwise, it is determined that the word has no useful property.

【００５７】ｎ９０では、上記した処理で得られたカウ
ンタａ、および、カウンタｂの計数値を用いて、上記の
演算を行い、単語の性質を判定する。そして、ｎ９１で
この性質を対象単語の性質として辞書ファイル４に登録
し、処理を完了する。At n90, the above calculation is performed by using the count values of the counter a and the counter b obtained by the above processing, and the nature of the word is determined. Then, in n91, this property is registered in the dictionary file 4 as the property of the target word, and the process is completed.

【００５８】以上のように、本実施の形態では単語の性
質を統計に基づいて設定するようにしているので、客観
的に単語の性質を設定することができる。As described above, in the present embodiment, since the word property is set based on the statistics, the word property can be set objectively.

【００５９】[0059]

【発明の効果】以上のように、この発明によれば、連接
するカタカナ文字列に未登録単語含まれるときに、正確
に未登録単語を形成する文字列の範囲を特定して、未登
録単語を検出し、この検出した未登録単語を登録するこ
とができる。また、誤って検出され、辞書ファイルに登
録された単語は最終的に削除されるので、辞書ファイル
の容量が不要に大きくなることもない。As described above, according to the present invention, when an unregistered word is included in a concatenated katakana character string, the range of the character string forming the unregistered word is accurately specified to identify the unregistered word. Can be detected, and the detected unregistered word can be registered. In addition, since the words that are erroneously detected and registered in the dictionary file are finally deleted, the capacity of the dictionary file does not become unnecessarily large.

[Brief description of drawings]

【図１】この発明の実施の形態である日本語文解析装置
の機能を示すブロック図である。FIG. 1 is a block diagram showing functions of a Japanese sentence analysis device according to an embodiment of the present invention.

【図２】この実施の形態の日本語文解析装置の処理を示
すフローチャートである。FIG. 2 is a flowchart showing a process of the Japanese sentence analysis device of this embodiment.

【図３】形態素解析の概念を示す図である。FIG. 3 is a diagram showing a concept of morphological analysis.

【図４】第１の単語候補検出、登録処理の流れを示すフ
ローチャートである。FIG. 4 is a flowchart showing a flow of first word candidate detection / registration processing.

【図５】登録単語の正当性検証処理のフローチャートで
ある。FIG. 5 is a flowchart of a registered word validity verification process.

【図６】検出される単語候補の構成要素および作成され
る単語候補の例を示す図である。FIG. 6 is a diagram showing examples of detected word candidate components and created word candidates.

【図７】単語候補が登録された辞書ファイルを示す図で
ある。FIG. 7 is a diagram showing a dictionary file in which word candidates are registered.

【図８】カタカナ未登録語の発生源を説明する図であ
る。FIG. 8 is a diagram illustrating a generation source of unregistered katakana words.

【図９】第２の単語候補検出、登録処理の流れを示すフ
ローチャートである。FIG. 9 is a flowchart showing a flow of second word candidate detection / registration processing.

【図１０】検出される単語候補の構成要素および作成さ
れる単語候補の例を示す図である。FIG. 10 is a diagram showing examples of detected word candidate components and created word candidates.

【図１１】単語候補の正当性検証処理を示すフローチャ
ートである。FIG. 11 is a flowchart showing the validity verification process of word candidates.

【図１２】単語候補が含まれる文字列の形態素解析結果
を示す図である。FIG. 12 is a diagram showing a morpheme analysis result of a character string including word candidates.

【図１３】仮登録されていた単語候補が正式に登録され
たときの辞書ファイルを示す図である。FIG. 13 is a diagram showing a dictionary file when a temporarily registered word candidate is officially registered.

【図１４】不要な単語を削除する不要単語削除処理の流
れを示すフローチャートである。FIG. 14 is a flowchart showing a flow of unnecessary word deletion processing for deleting an unnecessary word.

【図１５】不要な単語が削除されたときの辞書ファイル
を示す図である。FIG. 15 is a diagram showing a dictionary file when unnecessary words are deleted.

【図１６】単語性質判定処理を示すフローチャートであ
るFIG. 16 is a flowchart showing a word property determination process.

【図１７】単語の性質を記憶する辞書ファイルを示す図
である。FIG. 17 is a diagram showing a dictionary file that stores the properties of words.

【図１８】統計データを示す図である。FIG. 18 is a diagram showing statistical data.

[Explanation of symbols]

１−日本語文解析装置２−テキストデータ記憶部３−形態素解析部４−辞書ファイル５−単語候補検出部６−単語候補登録部７−単語候補検証部８−単語候補削除部９−登録単語検証部１０−単語候補正式登録部 1-Japanese sentence analysis device 2-Text data storage unit 3-Morpheme analysis unit 4-Dictionary file 5-Word candidate detection unit 6-Word candidate registration unit 7-Word candidate verification unit 8-Word candidate deletion unit 9-Registration word verification Department 10-Word candidate formal registration department

Claims

[Claims]

1. A dictionary file in which a character string of a word and data indicating an attribute of the word are registered, and a morpheme analysis means for performing a morpheme analysis for dividing a character string input using the dictionary file into words. When the concatenated katakana character strings in the input character string are divided into a plurality of words by the morphological analysis, if the divided words include a word having a character string length of 1 character, An unregistered word detecting means for detecting a character string connecting the word and an adjacent word before or after the word as an unregistered word, and an unregistered word registering means for registering the detected unregistered word. A Japanese sentence analysis device characterized by being equipped with.

2. A dictionary file in which a character string of a word and data indicating an attribute of the word are registered, and a morphological analysis means for performing a morphological analysis for dividing the character string input using the dictionary file into words. Provided, when the concatenated katakana character string in the input character string is divided into a plurality of words by the morphological analysis, if a word of a predetermined part of speech is included in the divided words, the word And an unregistered word detecting means for detecting a character string connecting the adjacent word before or after the word as an unregistered word, and an unregistered word registering means for registering the detected unregistered word. Japanese sentence analysis device characterized by

3. A dictionary file in which a character string of a word and data indicating an attribute of the word are registered, and a morphological analysis means for performing a morphological analysis for dividing the character string input using the dictionary file into words. The data indicating the attribute includes data indicating whether or not the corresponding word is a word less likely to form a compound word, and the concatenated katakana character string in the input character string is the morphological analysis. When the word is divided into a plurality of words by the word, if the word that is unlikely to form a compound word is included in the divided word, the word and the word adjacent to the word before or after the word are connected. An unregistered word detecting means for detecting a character string as an unregistered word, and an unregistered word registering means for registering the detected unregistered word.

4. The unregistered word detecting means, if the word divided into a plurality of words by the morphological analysis includes a word having a predetermined part of speech, the unregistered word detecting means is adjacent to the word and the word before or after the word. 2. The Japanese sentence analysis apparatus according to claim 1, further comprising means for detecting a character string connecting a matching word as an unregistered word.

5. The data indicating the attribute includes data indicating whether or not the corresponding word is a word that is unlikely to form a compound word, and the unregistered word detection means detects a plurality of words by the morphological analysis. If a word divided into words contains a word that is unlikely to form a compound word, a character string connecting the word and the word adjacent to the word before or after the word is detected as an unregistered word. 3. Means including means.
Or the Japanese sentence analysis device described in any one of 4.

6. When a word of a character string that matches the unregistered word detected by the unregistered word detecting means is included in a word divided into a plurality of words by the morphological analysis, a predetermined rule is set. 6. The Japanese sentence analysis device according to claim 1, further comprising a legitimacy verifying unit that verifies whether or not the unregistered word is valid as a word by comparison.

7. A morphological analysis that divides an input character string into words using a dictionary file in which character strings of words and data indicating attributes of the words are registered, and concatenates in the input character strings. When a katakana character string is divided into a plurality of words by the morphological analysis, if a word having a character string length of 1 character is included in the divided words, the word and the word before or after the word are adjacent to each other. A Japanese sentence analysis method characterized by detecting a character string connecting a matching word as an unregistered word and registering the detected unregistered word.

8. A morphological analysis for dividing an input character string into words is performed using a dictionary file in which a character string of a word and data indicating attributes of the word are registered, and concatenation in the input character string is performed. When a katakana character string is divided into a plurality of words by the morphological analysis, if a word having a predetermined part-of-speech is included in the divided words, the word and a word adjacent to the word before or after the word A method for analyzing a Japanese sentence, characterized in that a character string connected to each other is detected as an unregistered word, and the detected unregistered word is registered.

9. A character string input using a dictionary file in which data indicating attributes including data indicating whether a character string of a word and a corresponding word are unlikely to form a compound word is registered. Performing morphological analysis to divide into words, and when the concatenated katakana character string in the input character string is divided into a plurality of words by the morphological analysis, a possibility of forming a compound word in the divided words If a word containing a small number of words is included, a character string connecting the word and an adjacent word before or after the word is detected as an unregistered word, and the detected unregistered word is registered. Japanese sentence analysis method.

10. If a word having a predetermined part-of-speech is included in a word divided into a plurality of words by the morphological analysis, a character string connecting the word and an adjacent word before or after the word. 9. The Japanese sentence analysis method according to claim 7, wherein is detected as an unregistered word.

11. The data indicating the attribute includes data indicating whether or not a corresponding word is a word that is unlikely to form a compound word, and is included in a word divided into a plurality of words by the morphological analysis. 8. If a word that is unlikely to form a compound word is included, a character string that connects the word and a word adjacent to the word before or after the word is detected as an unregistered word. , 8 or 10 Japanese sentence analysis method.

12. When a word of a character string that matches the detected unregistered word is included in a word divided into a plurality of words by the morphological analysis, the unregistered word is valid as a word. 12. The Japanese sentence analysis method according to claim 7, wherein the verification is performed.