JPH06301722A

JPH06301722A - Morpheme analyzing device and keyword extracting device

Info

Publication number: JPH06301722A
Application number: JP5086154A
Authority: JP
Inventors: Satoru Ito; 哲伊藤; 泰治〆木; Taiji Shimeki; Kazuhiro Kayashima; 一弘萱嶋; Toshio Niwa; 寿男丹羽; Hidetsugu Maekawa; 英嗣前川
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1993-04-13
Filing date: 1993-04-13
Publication date: 1994-10-28
Anticipated expiration: 2018-01-07
Also published as: JP3361563B2

Abstract

PURPOSE:To provide the morpheme analyzing device which can execute efficiently a word processing with respect to a character-string in which a KATAKANA (square form of Japanese syllabary) and the Roman alphabet are continued, and the keyword extracting device which can extract efficiently a keyword of a word and a composite word. CONSTITUTION:This device is provided with a morpheme analyzing means 2 for extracting a character-string of a KATAKANA or the Roman alphabet which is continued, in a Japanese language document inputted by a document input means 1, as a noun equivalent word, and extracting a noun word by using a dictionary 5 as for other kind of character, a composite word generating means 3 for generating all composite words which can generate those words as constitution words from the continued noun words or noun equivalent words, and a keyword evaluating means 4 for calculating an evaluation value of keyword likelihood, based on appearance frequency of these words and composite words and general occurrence frequency.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、ワードプロセッサ、日
本語文書解析装置などに利用される辞書の作成や、キー
ワードなどの意味を持つ文字列の抽出処理が必要な形態
素解析装置及びキーワード抽出装置に関するものであ
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a morphological analysis apparatus and a keyword extraction apparatus which require the creation of a dictionary used in a word processor, a Japanese document analysis apparatus, etc., and the extraction processing of a character string having a meaning such as a keyword. It is a thing.

【０００２】[0002]

【従来の技術】キーワード抽出装置を含めた日本語解析
処理システムにおいては、入力文中の辞書に登録されて
いる単語を手がかりに形態素解析を行なうことで文書の
要約やキーワードとなる単語の抽出を行なっている。し
かし、辞書に未登録である未知語が文章中に存在する
と、この未知語部分を含めた正しい形態素解析や複合語
の作成が困難である。2. Description of the Related Art In a Japanese analysis processing system including a keyword extraction device, morpheme analysis is performed by using a word registered in a dictionary in an input sentence as a clue to extract a document summary or a keyword word. ing. However, if an unknown word that is not registered in the dictionary exists in the sentence, it is difficult to perform a correct morphological analysis including the unknown word portion and a compound word.

【０００３】従来、未知語の中でも特に片仮名表記の未
知語の処理に関して、特開昭６２−９０７６０号に示さ
れた文書解析方式がある。これは、辞書に未登録の片仮
名表記が文章中に存在したとき、この片仮名表記部分を
独立した単語と見なす処理を行なうものである。Conventionally, there is a document analysis method disclosed in Japanese Patent Application Laid-Open No. 62-90760 regarding the processing of unknown words in katakana among the unknown words. This is a process of treating a katakana notation not registered in the dictionary as an independent word when the katakana notation exists in the sentence.

【０００４】上記の片仮名未知語の処理を説明するため
の文書解析装置のブロック図を図６に示す。図６におい
て、６５は文章中の単語を辞書を用いて検索する辞書検
索部、６１は文章中の片仮名表記の用語が、単語として
辞書に登録されていない場合に、一つの単語として処理
する片仮名未知語処理部、６２はＤＰ照合を行うＤＰ照
合部、６３は辞書検索部６５とＤＰ照合部６２と片仮名
未知語処理部６１において処理された結果を統合する後
処理部、６４は入力文章を、例えば句読点や記号等で分
割する前処理及び後処理後のデータを統合する文書解析
管理部である。FIG. 6 shows a block diagram of a document analysis device for explaining the processing of the above-mentioned katakana unknown word. In FIG. 6, reference numeral 65 is a dictionary search unit that searches a word in a sentence using a dictionary, and 61 is a katakana that is processed as one word when a term in katakana notation in the sentence is not registered as a word in the dictionary. An unknown word processing unit, 62 is a DP matching unit that performs DP matching, 63 is a post-processing unit that integrates the results processed by the dictionary searching unit 65, the DP matching unit 62, and the katakana unknown word processing unit 61, and 64 is an input sentence. , A document analysis management unit that integrates pre-processed and post-processed data divided by punctuation marks or symbols.

【０００５】図７は、上記の文書解析装置の動作フロー
を示す流れ図である。この流れ図を参照しながら片仮名
文字の文書解析方式を以下に説明する。（ａ）入力文章の文字列が文書解析管理部６４に入力さ
れると、解析単位に区切られて送出される（ステップＳ
１）。ここで、解析単位とは、句読点や記号等で区切ら
れた文字列である。（ｂ）上記の解析単位について、辞書検索部６５により
辞書検索が行なわれ、一致するものがある場合にはそれ
をＤＰ照合部６２へ送出し、一致しない場合には片仮名
未知語処理部６１へ送出する（ステップＳ２）。（ｃ）辞書検索の結果一致した場合、ＤＰ照合部６２は
抽出された単語を解析単位と一致する最適な単語の組合
せを判定し（ステップＳ３）、不適当と判定されると辞
書検索部６５へそれを戻し、適当であれば後処理部６３
へ送出する。（ｄ）辞書検索部６５はＤＰ照合部６２から戻された解
析単位を片仮名未知語処理部６１へ送出する。（ｅ）片仮名未知語処理部６１は、辞書検索の結果不一
致、あるいは照合不能として送り込まれた解析単位の片
仮名部分を一つの単語として処理し（ステップＳ４）、
後処理部６３へ送出する。（ｆ）後処理部６３は、ＤＰ照合部６２及び片仮名未知
語処理部６１から送出されてくる単語などのデータを統
合し、それを文章解析管理部６４へ返還する（ステップ
Ｓ６）。FIG. 7 is a flow chart showing the operation flow of the above document analysis apparatus. The document analysis method for Katakana characters will be described below with reference to this flowchart. (A) When the character string of the input sentence is input to the document analysis management unit 64, it is divided into analysis units and sent (step S
1). Here, the analysis unit is a character string delimited by punctuation marks or symbols. (B) With respect to the above analysis unit, the dictionary search unit 65 performs a dictionary search, and if there is a match, it is sent to the DP matching unit 62, and if there is no match, it is sent to the katakana unknown word processing unit 61. It is sent (step S2). (C) If they match as a result of the dictionary search, the DP matching unit 62 determines the optimum word combination that matches the extracted word with the analysis unit (step S3), and if it is determined to be inappropriate, the dictionary search unit 65. Return it to the post-processing unit 63 if appropriate.
Send to. (D) The dictionary search unit 65 sends the analysis unit returned from the DP collation unit 62 to the katakana unknown word processing unit 61. (E) The katakana unknown word processing unit 61 processes the katakana part of the analysis unit sent as a result of a mismatch in the dictionary search or a non-matchable result as one word (step S4),
It is sent to the post-processing unit 63. (F) The post-processing unit 63 integrates data such as words sent from the DP collation unit 62 and the Katakana unknown word processing unit 61, and returns the data to the sentence analysis management unit 64 (step S6).

【０００６】以上の手順により片仮名表記の単語に関す
る文書解析が順次進められる。By the above procedure, the document analysis regarding the word in katakana notation is sequentially advanced.

【０００７】また、特に複合語のキーワードを頻度情報
から作成する特開平３−１１６３７４号に示された専門
用語抽出システムについて以下に説明する。これは、文
章中からすべての複合語を抽出し、複合語の構成語の出
現頻度と単語結合数から、特定の構成語を専門用語特徴
語基として複合語から抽出し、その語基を含む複合語を
すべてキーワードとするものである。A technical term extraction system disclosed in Japanese Patent Laid-Open No. 3-116374, which particularly creates a keyword of a compound word from frequency information, will be described below. This is to extract all compound words from a sentence, extract a specific constituent word from a compound word as a technical term characteristic word base, and include that word base from the appearance frequency and the number of word combinations of the compound words of the compound word. All compound words are used as keywords.

【０００８】図８は、上記の従来の専門用語抽出システ
ムの構成を示すブロック図である。従来の専門用語抽出
システムは、入力文書８１を入力とする形態素解析部８
２、入力文書８１中の全ての複合語を抽出する複合語抽
出部８３、抽出された複合語から特定の構成語を抽出す
る専門用語特徴語基抽出部８４及び、その抽出された特
定の構成語に基づいてキーワードを抽出する専門用語抽
出部８５により構成され、それらによる処理が順に行わ
れる。FIG. 8 is a block diagram showing the configuration of the conventional technical term extraction system described above. The conventional technical term extraction system uses a morphological analysis unit 8 that receives an input document 81 as input.
2. A compound word extraction unit 83 that extracts all compound words in the input document 81, a technical term characteristic word base extraction unit 84 that extracts a specific constituent word from the extracted compound words, and the extracted specific structure. It is configured by a technical term extraction unit 85 that extracts a keyword based on a word, and processing by them is sequentially performed.

【０００９】まず、形態素解析部８２は任意の分野の入
力文書８１について、語の単位と品詞を認定する形態素
解析を行なう。複合語抽出部８３は形態素解析結果に対
して名詞、形容動詞、接辞の連続を複合語として抽出す
る複合語抽出を行い、入力文書８１中の複合語をすべて
抽出する。専門用語特徴語基抽出部８４では抽出された
複合語に対して、その構成語なる主語基が専門用語をな
す語基であるかどうかを評価値を用いて判断し、専門用
語をなす語基だけを抽出する専門用語特徴語基抽出処理
を行なう。その後、専門用語抽出部８５では、抽出され
た複合語中から、専門用語特徴語基と判定された語を語
基とするものだけを専門用語として抽出する処理を行な
う。First, the morpheme analysis unit 82 performs a morpheme analysis on the input document 81 in an arbitrary field to identify a word unit and a part of speech. The compound word extraction unit 83 performs compound word extraction on the result of morphological analysis to extract a sequence of nouns, adjective verbs and affixes as compound words, and extracts all compound words in the input document 81. The technical term feature word base extraction unit 84 judges whether or not the subject base, which is a constituent word, of the extracted compound word is a base word that constitutes a technical term, using an evaluation value, and determines a base word that constitutes a technical term. A technical term feature word base extraction process for extracting only After that, the technical term extraction unit 85 performs a process of extracting, from the extracted compound words, only those having a word determined to be a technical term characteristic word base as a technical term.

【００１０】ここに、専門用語特徴語基抽出部８４によ
る処理の概要を図９を参照しながら説明する。前述した
複合語抽出部８３により抽出された複合語リスト８６中
の複合語に対して、まず、語末に位置する主語基の頻度
を計算する（ステップＳ１１）。次に、各主語基の入力
文書８１内での平均結合数を計算する（ステップＳ１
２）。これは、ある語を主語基とする時に、その前に付
く修飾語の数を、１つの語基あたりについて計算するも
のである。次に、このようにして求められた主語基の頻
度と平均結合数から評価値を算出し、ある閾値以上であ
ればその語を専門用語特徴語基と判定するものである
（ステップＳ１４）。The outline of the processing by the technical term characteristic word base extraction unit 84 will be described with reference to FIG. With respect to the compound words in the compound word list 86 extracted by the compound word extracting unit 83 described above, first, the frequency of the subject base located at the end of the word is calculated (step S11). Next, the average number of combinations of each subject base in the input document 81 is calculated (step S1).
2). This is to calculate the number of modifiers that precedes a certain word base when it is used as a base. Next, the evaluation value is calculated from the frequency and the average number of bonds of the subject group thus obtained, and if it is equal to or more than a certain threshold value, the word is determined as a technical term feature word group (step S14).

【００１１】[0011]

【発明が解決しようとする課題】しかしながら、従来例
の前者のように片仮名文字列を辞書と照合する方法で
は、以下のような課題がある。日本語文書中の英語の片
仮名表記や和製英語などの片仮名の文字列は、同じ語源
であるにもかかわらず表記が僅かに異なるものが多い。
そのために辞書の未登録語とされてしまう片仮名語や英
字が多く発生する。例えば、曖昧さを表わす「Ｆｕｚｚ
ｙ」という英単語は、片仮名では「ファジー」「ファジ
イ」「ファジィ」などと表記されるが、これら全ての表
記を辞書に登録してあることは希である。それゆえ、こ
のような微妙な表記の違いをすべて辞書に登録すること
や、片仮名や英字の字種からなる文字列を辞書に照合す
ることは無駄が多い。However, the method of collating the katakana character string with the dictionary as in the former case of the conventional example has the following problems. The English katakana notation and Japanese katakana character strings in Japanese documents often have slightly different notations even though they have the same etymology.
Therefore, there are many katakana words and English letters that are regarded as unregistered words in the dictionary. For example, “Fuzz” that represents ambiguity
Although the English word "y" is written in katakana as "fuzzy", "fuzzy", "fuzzy", etc., it is rare that all these expressions are registered in the dictionary. Therefore, it is wasteful to register all such subtle differences in notation in the dictionary and to collate a character string consisting of Katakana or alphabetic characters in the dictionary.

【００１２】また、従来例の後者のようなキーワードの
抽出方法では、抽出された専門用語特徴語基の単語が、
真にその分野の特徴を表わす単語であるとは限らないと
いう課題がある。それは専門用語となる複合語の構成語
には、分野によらず一般的に使用されている単語や対象
分野とは全く異なる分野の単語が多数存在するためであ
る。それらの単語は単独ではキーワードとしての意味は
少ない。Further, in the latter keyword extraction method of the conventional example, the extracted words of the technical term characteristic word base are
There is a problem that it is not necessarily a word that truly represents the characteristics of the field. This is because there are a large number of words in fields that are completely different from the commonly used words or target fields in the constituent words of the compound words that are technical terms. Those words alone have little meaning as keywords.

【００１３】また、主語基とされた特定の共通の構成語
を含んでいる複合語をすべてキーワードとしているの
で、主語基となる構成語の選択に誤りがあった場合、多
くの不要語からなる不適当なキーワードを選択してしま
うことや、文中に複合語が無い場合や共通した複合語の
構成語がない場合に主語基を作成できないという課題が
ある。Further, since all compound words including a specific common constituent word as the subject base are used as keywords, if there is an error in the selection of the constituent word as the subject base, it will consist of many unnecessary words. There is a problem that the subject base cannot be created when an inappropriate keyword is selected, when there is no compound word in the sentence, or when there is no common compound word.

【００１４】また、人間は特定分野のキーワードや専門
用語として新語や造語など未知語を多く含む複合語を選
ぶことが多いが、従来の複合語作成手段では、複合語の
構成語に１つでも未知語が含まれていると、それをひと
つの複合語として認定することは困難であるという課題
がある。Further, humans often select a compound word containing many unknown words such as a new word or a coined word as a keyword or a technical term in a specific field, but in the conventional compound word creating means, even one compound word is included in the compound word. If an unknown word is included, it is difficult to identify it as one compound word.

【００１５】本発明は、従来の文書処理におけるこのよ
うな課題を考慮し、片仮名や英字が連続した文字列を効
率よく単語処理することができる形態素解析装置及び、
単語及び複合語のキーワードを効率的に抽出できるキー
ワード抽出装置を提供することを目的とするものであ
る。In consideration of such problems in the conventional document processing, the present invention provides a morphological analyzer capable of efficiently word-processing a character string in which katakana or English letters are continuous, and
An object of the present invention is to provide a keyword extraction device that can efficiently extract keywords of words and compound words.

【００１６】[0016]

【課題を解決するための手段】請求項１の本発明は、解
析に必要な少なくとも文法情報を格納した辞書と、入力
された日本語文章から少なくとも名詞単語及び名詞相当
単語を辞書を利用して抽出する形態素解析手段とを備え
た形態素解析装置において、形態素解析手段は、入力さ
れた日本語文章中の連続した片仮名や英字からなる文字
列を名詞相当単語として抽出する形態素解析装置であ
る。The present invention according to claim 1 uses a dictionary in which at least grammatical information necessary for analysis is stored and at least a noun word and a noun-equivalent word from an input Japanese sentence. In the morpheme analysis device provided with the morpheme analysis means for extracting, the morpheme analysis means is a morpheme analysis device for extracting a character string consisting of continuous katakana or English characters in the input Japanese sentence as a noun equivalent word.

【００１７】請求項２の本発明は、解析に必要な少なく
とも文法情報を格納した辞書と、入力された日本語文章
から少なくとも名詞単語及び名詞相当単語を辞書を利用
して抽出する形態素解析手段と、その形態素解析手段が
抽出した名詞単語又は名詞相当単語が連続してなる複合
語を構成している各単語又は相当単語を複数個組み合わ
せてなる複合語をキーワード候補として列挙する複合語
キーワード候補作成部と、その複合語キーワード候補作
成部が列挙した複合語キーワード候補の、入力された日
本語文章中の累積頻度から相互の重複頻度を考慮して、
各複合語キーワード候補の評価値を決定する複合語評価
手段と、それに基づきキーワードを決定するキーワード
決定手段とを備えたキーワード抽出装置である。The present invention according to claim 2 is a dictionary storing at least grammatical information necessary for analysis, and a morphological analysis means for extracting at least a noun word and a noun-equivalent word from an input Japanese sentence using the dictionary. , Compound word candidate creation which enumerates compound words formed by combining a plurality of respective words or equivalent words forming a compound word in which noun words or noun-equivalent words extracted by the morphological analysis means are combined as keyword candidates Part and the compound word keyword candidates listed by the compound word candidate creation part, considering the mutual overlapping frequency from the cumulative frequency in the input Japanese sentence,
A keyword extracting device comprising a compound word evaluating means for determining an evaluation value of each compound word keyword candidate and a keyword determining means for determining a keyword based on the compound word evaluating means.

【００１８】[0018]

【作用】本発明は、辞書が解析に必要な少なくとも文法
情報を格納し、形態素解析手段が、入力された日本語文
章から少なくとも名詞単語を辞書を利用して抽出し、
又、入力された日本語文章中の連続した片仮名や英字か
らなる文字列を名詞相当単語として抽出する。According to the present invention, the dictionary stores at least grammatical information necessary for analysis, and the morpheme analysis means extracts at least a noun word from the input Japanese sentence using the dictionary,
Also, a character string consisting of continuous katakana or English characters in the input Japanese sentence is extracted as a noun equivalent word.

【００１９】また、本発明は、辞書が解析に必要な少な
くとも文法情報を格納し、形態素解析手段が、入力され
た日本語文章から少なくとも名詞単語及び名詞相当単語
を辞書を利用して抽出し、複合語キーワード候補作成部
が、抽出された名詞単語又は名詞相当単語が連続してな
る複合語を構成している各単語又は相当単語を複数個組
み合わせてなる複合語をキーワード候補として列挙し、
複合語評価手段が、列挙された複合語キーワード候補の
累積頻度から相互の重複頻度を考慮して、各複合語キー
ワード候補の評価値を決定し、キーワード決定手段が、
それに基づきキーワードを決定する。Further, in the present invention, the dictionary stores at least grammatical information necessary for analysis, and the morphological analysis means extracts at least a noun word and a noun-equivalent word from the input Japanese sentence by using the dictionary, The compound word keyword candidate creation unit enumerates, as keyword candidates, compound words formed by combining a plurality of words or equivalent words forming a compound word in which the extracted noun words or noun-equivalent words are continuous,
The compound word evaluation means determines the evaluation value of each compound word keyword candidate by considering the mutual overlapping frequency from the cumulative frequency of the listed compound word keyword candidates, and the keyword determination means,
The keyword is determined based on it.

【００２０】[0020]

【実施例】以下に、本発明をその実施例を示す図面に基
づいて説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS The present invention will be described below with reference to the drawings showing its embodiments.

【００２１】図１は、本発明にかかる一実施例のキーワ
ード抽出装置を示すブロック図である。すなわち、キー
ワード抽出装置には、本発明にかかる別の実施例である
形態素解析装置が形態素解析手段２として設けられ、そ
の形態素解析手段２は、文書入力手段１から入力された
文字列から単語を抽出し、その品詞を認定し、名詞を抽
出する機能を有する。又、形態素解析手段２には、抽出
された名詞の単語を組み合わせて複合語を作成する複合
語作成手段３、文書から抽出された単語と複合語をあわ
せてキーワードとして評価するキーワード評価手段４及
び、単語の抽出、品詞の認定等に使用する辞書５が接続
されている。この辞書５は、例えば図５に示すように、
平仮名・漢字単語文字列、品詞情報等の単語情報の他
に、名詞単語の一般文章中における生起頻度（以下、一
般的な生起頻度と呼ぶ）を格納する格納部５１、特定分
野における生起頻度を格納する格納部５２及び、それら
格納部５１，５２を、入力される文書に応じて選択する
選択部５３により構成されている。FIG. 1 is a block diagram showing a keyword extracting device according to an embodiment of the present invention. That is, the keyword extraction device is provided with a morpheme analysis device as another embodiment according to the present invention as the morpheme analysis means 2, and the morpheme analysis means 2 extracts words from the character string input from the document input means 1. It has a function of extracting, recognizing the part of speech, and extracting a noun. Further, the morpheme analyzing means 2 includes a compound word creating means 3 for creating a compound word by combining the extracted words of the noun, a keyword evaluating means 4 for evaluating the word and the compound word extracted from the document together as a keyword, and A dictionary 5 used for word extraction, part-of-speech recognition, etc. is connected. This dictionary 5 is, for example, as shown in FIG.
In addition to word information such as hiragana / kanji word strings and part-of-speech information, a storage unit 51 that stores the occurrence frequency of a noun word in a general sentence (hereinafter, referred to as general occurrence frequency), the occurrence frequency in a specific field, It is configured by a storage unit 52 for storing and a selection unit 53 for selecting the storage units 51, 52 according to an input document.

【００２２】上述の複合語作成手段３が複合語キーワー
ド候補作成部を構成し、キーワード評価手段４が複合語
評価手段及びキーワード決定手段を構成している。The compound word creating means 3 constitutes a compound word keyword candidate creating part, and the keyword evaluating means 4 constitutes a compound word evaluating means and a keyword determining means.

【００２３】ここに、前述の形態素解析手段２による処
理の概要を図２を参照しながら説明する。図２におい
て、形態素解析手段２は、未知語抽出部２１と名詞抽出
部２２のそれぞれの処理を順に行うように構成されてい
る。文字入力手段１により入力された文字列について、
未知語抽出部２１は字種を判定し、連続した片仮名また
は英字の文字列を名詞相当単語として抽出し、登録す
る。また、漢字または平仮名の文字列は、そのまま名詞
抽出部２２に送られ、それ以外の句読点や記号は排除さ
れる。Here, the outline of the processing by the morphological analysis means 2 will be described with reference to FIG. In FIG. 2, the morpheme analysis unit 2 is configured to sequentially perform the processes of the unknown word extraction unit 21 and the noun extraction unit 22. Regarding the character string input by the character input means 1,
The unknown word extraction unit 21 determines the character type, extracts a continuous katakana or English character string as a noun-equivalent word, and registers it. Further, the character string of kanji or hiragana is sent to the noun extraction unit 22 as it is, and other punctuation marks and symbols are excluded.

【００２４】次に、名詞抽出部２２は、送られてきた漢
字と平仮名の文字列を辞書５に登録された文字列と照会
することにより単語と品詞を認定する。このとき品詞が
名詞と判定された単語だけ、その一般的な生起頻度と同
時に登録される。辞書５には、平仮名や漢字の単語の文
字列とその品詞情報、そして名詞に関しては、さらにそ
の語の一般的な生起頻度が格納部５１に格納されている
（図５参照）。ここでいう「一般的な生起頻度」とは、
その語句が一般的な文書中で出現する確率のことであ
る。Next, the noun extraction unit 22 identifies the word and the part of speech by inquiring the sent character strings of kanji and hiragana with the character strings registered in the dictionary 5. At this time, only the word whose part of speech is determined to be a noun is registered at the same time as its general occurrence frequency. The dictionary 5 stores a character string of hiragana or kanji words, its part-of-speech information, and a general occurrence frequency of the word in the storage unit 51 (see FIG. 5). "General occurrence frequency" here means
The probability that the phrase appears in a typical document.

【００２５】以上のようにして登録された名詞単語およ
び名詞相当単語は、入力された文書中での出現頻度がカ
ウントされ、名詞単語のキーワード候補とされる。ま
た、連続した名詞および名詞相当単語は複合語作成手段
３に送られる。The noun word and the noun-equivalent word registered as described above are counted as the noun word keyword candidates by counting the appearance frequency in the input document. The continuous nouns and noun-equivalent words are sent to the compound word creating means 3.

【００２６】このように特定字種の文字列を別々に処理
することによって、未知語とされ易い片仮名や英字など
の文字列をキーワード候補とするばかりでなく、辞書に
登録されている確率が少ない字種からなる文字列の辞書
５への照会を回避することで、効率的な単語の抽出が可
能となる。By separately processing character strings of a specific character type in this way, not only are character strings such as katakana and English characters that are likely to be unknown words used as keyword candidates, but the probability that they are registered in the dictionary is low. By avoiding the inquiry to the dictionary 5 of the character string consisting of the character types, it becomes possible to efficiently extract the words.

【００２７】次に、複合語作成手段３による処理の概要
を図３を参照しながら説明する。図３に示すように、複
合語作成手段３は、複合語候補作成部３１及び複合語評
価部３２の処理を順に行うように構成されている。複合
語候補作成部３１においては、形態素解析手段２により
抽出された連続した名詞単語を構成語として、作成可能
な全ての複合語候補を列挙し、それぞれの複合語候補の
共出現頻度をカウントする。ここで共出現頻度とは、あ
る単語がほかの特定の単語と共に並んで出現する頻度の
ことであり、すなわち、複合語の出現頻度である。例え
ば、名詞「Ａ」「Ｂ」「Ｃ」が、ある文章「〜ＡＢＣ〜
ＡＢ〜ＡＢ〜」の中で、連続して使われている場合、作
成される複合語候補文字列とその共出現頻度は「ＡＢ
Ｃ」が１、「ＡＢ」が３、「ＢＣ」が１となる。この複
合語候補作成部３１における処理により、特に複合語候
補間に共通して使用されている構成語からなる複合語の
作成と評価が可能となる。ここで、辞書にある名詞
「Ａ」が、同じく辞書にある単語「ａ１」「ａ２」から
なる複合語である場合でも、前述のように複合語候補を
作成すればよいが、構成語を「ａ１」「ａ２」「Ｂ」
「Ｃ」の４個として複合語候補を作成することも可能で
ある。Next, the outline of the processing by the compound word creating means 3 will be described with reference to FIG. As shown in FIG. 3, the compound word creating means 3 is configured to sequentially perform the processes of the compound word candidate creating unit 31 and the compound word evaluating unit 32. The compound word candidate creation unit 31 enumerates all compound word candidates that can be created using the continuous noun words extracted by the morpheme analysis means 2 as constituent words, and counts the co-occurrence frequency of each compound word candidate. . Here, the co-occurrence frequency is the frequency with which a certain word appears side by side with another specific word, that is, the appearance frequency of a compound word. For example, the nouns "A", "B", and "C" are converted into a sentence "~ ABC ~".
"AB ~ AB ~", the compound word candidate character string created and its co-occurrence frequency are "AB
“C” is 1, “AB” is 3, and “BC” is 1. By the processing in the compound word candidate creating unit 31, it is possible to create and evaluate a compound word composed of constituent words commonly used among compound word candidates. Here, even when the noun "A" in the dictionary is a compound word composed of the words "a1" and "a2" also in the dictionary, the compound word candidate may be created as described above, but the constituent word is " a1 ”“ a2 ”“ B ”
It is also possible to create compound word candidates as four “C”.

【００２８】次に、複合語評価部３２において、複合語
候補作成部３１によって作成された複合語候補を比較し
たとき、例えば、ある複合語Ｘが他のある複合語Ｙの構
成語の一部となっている場合、複合語Ｘの共出現頻度か
ら複合語Ｙの共出現頻度を引いた値が、ある閾値（≧
０）以下であれば、複合語候補Ｘを複合語候補から削除
する処理を行う。例えば、前述の文章例「〜ＡＢＣ〜Ａ
Ｂ〜ＡＢ〜」の場合、複合語候補作成部３１で、「ＡＢ
Ｃ」、「ＡＢ」、「ＢＣ」が作成されるが、「ＡＢ」と
「ＢＣ」は「ＡＢＣ」に含まれるため、それぞれの共出
現頻度３、１から「ＡＢＣ」の共出現頻度１を引いた値
は、それぞれ２、０となる。このときの閾値が０であれ
ば、候補「ＢＣ」は複合語候補から除かれる。これは、
同じ頻度であれば、「ＢＣ」より「ＡＢＣ」の方が、よ
り内容を反映したキーワードとなると考えられるからで
ある。Next, when the compound word evaluation unit 32 compares the compound word candidates created by the compound word candidate creating unit 31, for example, a certain compound word X is a part of the constituent words of another compound word Y. , The value obtained by subtracting the co-occurrence frequency of the compound word Y from the co-occurrence frequency of the compound word X is a threshold value (≧
If 0 or less, a process of deleting the compound word candidate X from the compound word candidate is performed. For example, the above sentence example "~ ABC ~ A
In the case of “B-AB-”, the compound word candidate creation unit 31 displays “AB
"C", "AB", and "BC" are created, but since "AB" and "BC" are included in "ABC", the co-occurrence frequency 3 of each is 1 to the co-occurrence frequency 1 of "ABC". The subtracted values are 2 and 0, respectively. If the threshold value at this time is 0, the candidate “BC” is excluded from the compound word candidates. this is,
This is because if the frequency is the same, it is considered that “ABC” is a keyword that reflects the content more than “BC”.

【００２９】また、複合語候補の出現頻度が設定された
閾値以上であれば、その候補を複合語のキーワード候補
と認定する。これらの閾値は適用内容により適宜設定可
能である。If the appearance frequency of the compound word candidate is equal to or higher than the set threshold value, the candidate is recognized as the keyword candidate of the compound word. These thresholds can be appropriately set depending on the application content.

【００３０】次に、キーワード評価手段４による処理の
概要を図４を参照しながら説明する。図４において、キ
ーワード評価手段４には評価値演算部４１が設けられ、
その評価値演算部４１は、形態素解析手段２から名詞単
語のキーワード候補の文字列と出現頻度と一般的な生起
頻度と、複合語作成手段３から複合語のキーワード候補
と共出現頻度がそれぞれ入力される構成になっている。Next, the outline of the processing by the keyword evaluation means 4 will be described with reference to FIG. In FIG. 4, the keyword evaluation means 4 is provided with an evaluation value calculation unit 41,
The evaluation value calculation unit 41 inputs a character string of a keyword candidate of a noun word, an appearance frequency, and a general occurrence frequency from the morpheme analysis unit 2, and a keyword candidate of a compound word and a co-occurrence frequency from the compound word creating unit 3, respectively. It is configured to be.

【００３１】まず、用途によっては、評価値演算部４１
において、名詞単語のキーワード候補が、複合語のキー
ワード候補の構成語となっている場合、名詞単語の出現
頻度からその複合語の共出現頻度を引く。この結果、出
現頻度がある閾値以下となった単語をキーワード候補か
ら除外する処理を行う。この処理によって、文書中では
単独で使われにくい単語がキーワード候補から排除され
る。First, depending on the application, the evaluation value calculation unit 41
In, when the keyword candidate of the noun word is a constituent word of the keyword candidate of the compound word, the co-occurrence frequency of the compound word is subtracted from the appearance frequency of the noun word. As a result, a process of excluding, from the keyword candidates, words whose appearance frequency is less than or equal to a certain threshold value is performed. By this processing, words that are difficult to be used alone in the document are excluded from the keyword candidates.

【００３２】次に、評価値演算部４１において、単語と
複合語の頻度情報からキーワードのなり易さの評価値を
演算する。基本的にキーワードらしさの評価値Ｅは（数
１）に示すように文中の出現頻度ｆに定数ａをかけた値
から、一般的な生起頻度Ｆに定数ｂをかけた値を引いた
値とする。この出現頻度と一般的な生起頻度の重み付け
係数である定数ａと定数ｂは、用途により適宜設定され
る。複合語の場合、一般的な生起頻度は０とする。ま
た、出現頻度は適用する文章の長さによって正規化した
ものを使用する。なお、上記の例では複合語の生起頻度
は０と設定したが、キーワード文字列の長さ、構成語の
生起頻度、字種、個数などをもとに設定してもよい。Next, in the evaluation value calculation unit 41, the evaluation value of the likelihood of becoming a keyword is calculated from the frequency information of words and compound words. Basically, as shown in (Equation 1), the evaluation value E of keyword-likeness is a value obtained by subtracting a value obtained by multiplying a general occurrence frequency F by a constant b from a value obtained by multiplying an appearance frequency f in a sentence by a constant a. To do. The constant a and the constant b, which are weighting factors for the appearance frequency and the general occurrence frequency, are appropriately set depending on the application. In the case of a compound word, the general occurrence frequency is 0. The appearance frequency is normalized according to the length of the applied text. In the above example, the occurrence frequency of the compound word is set to 0, but it may be set based on the length of the keyword character string, the occurrence frequency of the constituent word, the character type, the number, and the like.

【００３３】[0033]

【数１】Ｅ＝ｆａ−Ｆｂ（ただし、ａ＞０、ｂ≧０）このようにキーワード評価手段４で算出された評価値
は、一般的な生起頻度よりも文書中の出現頻度が高いも
のほどキーワードとなりやすい。キーワードは、上位か
らある一定数までを認定してもよいし、評価値がある閾
値以上のものをすべて選んでもよい。このような閾値
も、適用する文書の規模や用途などにより適宜設定す
る。[Equation 1] E = fa-Fb (where a> 0, b ≧ 0) The evaluation value calculated by the keyword evaluation means 4 in this way has a higher appearance frequency in the document than a general occurrence frequency. The more likely it is to become a keyword. As for the keywords, a certain number from the top may be recognized, or all of the evaluation values having a certain threshold value or more may be selected. Such a threshold value is also appropriately set according to the scale and purpose of the document to be applied.

【００３４】以上のように単語と複合語のキーワード候
補を統合して評価し、適切な長さのキーワードを抽出す
るため、より少ない数のキーワードで文書内容を反映す
ることが出来る。As described above, the keyword candidates of the word and the compound word are integrated and evaluated, and the keyword of an appropriate length is extracted, so that the document content can be reflected with a smaller number of keywords.

【００３５】更に、このとき特定分野に関する単語のキ
ーワード抽出を行うとすると、図５に示すように、辞書
５に設けられた名詞の生起頻度の選択部５３により、特
定分野における生起頻度の格納部５２を選択すればよ
い。このように格納部５２を選択することで、一般的な
名詞の生起頻度の代わりに特定分野の名詞の生起頻度を
使用することができ、より文書内容を反映したキーワー
ドを抽出することが可能となる。Further, at this time, if a keyword of a word relating to a specific field is extracted, as shown in FIG. 5, a noun occurrence frequency selection unit 53 provided in the dictionary 5 stores a occurrence frequency storage unit in the specific field. 52 may be selected. By selecting the storage unit 52 in this manner, the occurrence frequency of a noun in a specific field can be used instead of the occurrence frequency of a general noun, and it is possible to extract a keyword that more reflects the document content. Become.

【００３６】一般的な生起頻度は様々な分野についての
文章に単語が出現する確率であるが、特定分野の生起頻
度は、予め対象とする特定の分野の文書から名詞単語の
出現頻度を解析することで容易に収集することが出来
る。例えば、ある新聞中の特定の記事に関するキーワー
ド抽出を行うならば、その新聞全体の名詞の出現頻度を
求めてもよいし、また、政治に関してならば、過去の政
治記事だけを抜きだして統計をとってもよい。Generally, the occurrence frequency is the probability that a word appears in sentences of various fields, but the occurrence frequency of a specific field is obtained by analyzing the appearance frequency of a noun word from a document in a specific field of interest in advance. It can be easily collected. For example, if keyword extraction is performed for a specific article in a newspaper, the frequency of occurrence of nouns in that newspaper may be calculated. For politics, only past political articles may be extracted and statistics may be extracted. Very good.

【００３７】生起頻度情報の利用や収集時において、ど
の分野の生起頻度にするかの選択は、用途により人間が
決定しても良いが、キーワード抽出や生起頻度の収集対
象の文章から抽出した単語をシソーラス辞書などでチェ
ックし、分野を推定する事で自動化も可能である。When the occurrence frequency information is used or collected, a person may decide which field the occurrence frequency is to be used depending on the intended use, but the words extracted from the keyword extraction or the sentence for which the occurrence frequency is collected are extracted. It is also possible to automate by checking with a thesaurus dictionary and estimating the field.

【００３８】更に、得られたキーワード情報をキーワー
ド辞書として登録することで、より高度な日本語文書解
析システムに利用可能である。Further, by registering the obtained keyword information as a keyword dictionary, it can be used in a more advanced Japanese document analysis system.

【００３９】以上のように、形態素解析手段２で片仮名
と英字の文字列をそのまま名詞相当語として抽出するこ
とにより、未知語とされ易いそれらの字種からなる単語
やその単語を含む複合語を効率的にキーワード候補とす
ることが可能となり、さらに、単語の文中の出現頻度と
一般的な生起頻度と複合語の共出現頻度からキーワード
候補を評価する簡便なキーワード評価手段４によって、
単語と複合語のキーワードを同時に評価し抽出すること
で、片仮名語などの微妙な表記の違いをすべて登録した
大容量の辞書や専門用語特徴語基抽出手段など複合語の
構成語間の係受け関係の詳細な解析を必要とせず、入力
された文書から、片仮名や英字の未知語を含む単語や妥
当な構成語からなる複合語のキーワードを自動的に作
成、抽出することができる。As described above, the morpheme analysis unit 2 extracts the katakana and English character strings as they are as noun-equivalent words, so that words composed of those character types that are likely to be unknown words and compound words including the words are extracted. It becomes possible to efficiently make a keyword candidate, and further, by the simple keyword evaluation means 4 which evaluates a keyword candidate from the appearance frequency of a word in a sentence, a general occurrence frequency, and a co-occurrence frequency of a compound word,
By evaluating and extracting keywords of words and compound words at the same time, a large-capacity dictionary that registers all the subtle differences in notation such as katakana words and technical term characteristic word base extraction means It is possible to automatically create and extract a keyword including a word including an unknown word in katakana or an alphabet and a compound word consisting of valid constituent words from an input document without requiring detailed analysis of the relationship.

【００４０】なお、上記実施例では、キーワード抽出装
置の形態素解析手段２として本発明の形態素解析装置を
組み合わせた構成としたが、形態素解析手段を従来の方
法により行う構成としてもよい。あるいは又、形態素解
析手段２以外を従来の方法により構成してもよい。In the above embodiment, the morpheme analysis means 2 of the keyword extraction device is combined with the morpheme analysis device of the present invention, but the morpheme analysis means may be configured by a conventional method. Alternatively, the method other than the morphological analysis unit 2 may be configured by a conventional method.

【００４１】また、上記実施例に示したキーワードの評
価値の算出方法は、より文書内容を反映するような評価
値が算出できれば、上述の方法に限定されるものではな
い。The method for calculating the keyword evaluation value shown in the above embodiment is not limited to the above method as long as the evaluation value that more reflects the document content can be calculated.

【００４２】また、上記実施例では、特定分野における
生起頻度の格納部は１分野であったが、これに限らず、
種々の特定分野について格納部を設けてもよい。In the above embodiment, the occurrence frequency storage section in a specific field is one field, but the storage section is not limited to this.
Storage may be provided for various specific fields.

【００４３】また、上記実施例では、各処理手段を専用
のハードウェアにより構成したが、これに限らず、同様
の機能をコンピュータを用いてソフトウェア的に実現し
てもよい。Further, in the above-mentioned embodiment, each processing means is constituted by dedicated hardware, but not limited to this, the same function may be realized by software using a computer.

【００４４】[0044]

【発明の効果】以上述べたところから明らかなように本
発明は、形態素解析手段が、入力された日本語文章中の
連続した片仮名や英字からなる文字列を名詞相当単語と
して抽出するので、片仮名や英字が連続した文字列を効
率よく単語処理することができるという長所を有する。As is apparent from the above description, in the present invention, the morpheme analysis means extracts a continuous katakana or a character string consisting of English characters in the input Japanese sentence as a noun equivalent word. It has the advantage that it can efficiently process words in a character string of consecutive letters.

【００４５】また、本発明は、複合語キーワード候補
の、入力された日本語文章中の累積頻度から相互の重複
頻度を考慮して、各複合語候補の評価値を決定する複合
語評価部と、それに基づきキーワードを決定するキーワ
ード決定手段とを備えているので、キーワードを効率的
に抽出できるという利点がある。The present invention also provides a compound word evaluation unit for determining an evaluation value of each compound word candidate, taking into consideration the mutual overlapping frequency of the compound word keyword candidates in the input Japanese sentence. Further, since it is provided with the keyword determining means for determining the keyword based on it, there is an advantage that the keyword can be efficiently extracted.

[Brief description of drawings]

【図１】本発明にかかる一実施例のキーワード抽出装置
を示すブロック図である。FIG. 1 is a block diagram showing a keyword extracting device according to an embodiment of the present invention.

【図２】同実施例の形態素解析手段の概要を示すブロッ
ク図である。FIG. 2 is a block diagram showing an outline of a morpheme analysis means of the embodiment.

【図３】同実施例の複合語作成手段の概要を示すブロッ
ク図である。FIG. 3 is a block diagram showing an outline of a compound word creating means of the embodiment.

【図４】同実施例のキーワード評価手段の概要を示すブ
ロック図である。FIG. 4 is a block diagram showing an outline of a keyword evaluation unit of the embodiment.

【図５】同実施例の辞書の概要を示すブロック図であ
る。FIG. 5 is a block diagram showing an outline of a dictionary of the embodiment.

【図６】従来の片仮名未知語の処理を説明するためのブ
ロック図である。FIG. 6 is a block diagram for explaining processing of a conventional katakana unknown word.

【図７】図６における従来例の文書解析順序を示す流れ
図である。FIG. 7 is a flowchart showing the document analysis order of the conventional example in FIG.

【図８】従来の専門用語抽出システムを説明するための
ブロック図である。FIG. 8 is a block diagram for explaining a conventional technical term extraction system.

【図９】図８における従来例の専門用語特徴語基処理を
示す流れ図である。9 is a flowchart showing a technical term characteristic word base process of the conventional example in FIG.

[Explanation of symbols]

１文書入力手段２形態素解析手段３複合語作成手段４キーワード評価手段５辞書２１未知語抽出部２２名詞抽出部３１複合語候補作成部３２複合語評価部４１評価値演算部５３選択部 1 Document Input Means 2 Morphological Analysis Means 3 Compound Word Creating Means 4 Keyword Evaluating Means 5 Dictionaries 21 Unknown Word Extractors 22 Noun Extractors 31 Compound Word Candidate Creators 32 Compound Word Evaluators 41 Evaluation Value Calculators 53 Selectors

───────────────────────────────────────────────────── フロントページの続き (72)発明者丹羽寿男大阪府門真市大字門真1006番地松下電器産業株式会社内 (72)発明者前川英嗣大阪府門真市大字門真1006番地松下電器産業株式会社内 ─────────────────────────────────────────────────── ─── Continuation of the front page (72) Inventor Toshio Niwa 1006 Kadoma, Kadoma City, Osaka Prefecture Matsushita Electric Industrial Co., Ltd.

Claims

[Claims]

1. A morphological analysis comprising: a dictionary storing at least grammatical information necessary for analysis; and a morphological analysis means for extracting at least a noun word and a noun-equivalent word from an input Japanese sentence using the dictionary. In the apparatus, the morpheme analysis means extracts a character string consisting of continuous katakana or English characters in the input Japanese sentence as a noun-equivalent word.

2. A dictionary that stores at least grammatical information necessary for analysis, a morpheme analysis unit that extracts at least a noun word and a noun-equivalent word from an input Japanese sentence using the dictionary, and the morpheme analysis unit. A noun word or a noun-equivalent word that forms a compound word consisting of a series of compound words, or a compound word keyword candidate creation unit that enumerates compound words formed by combining a plurality of words or equivalent words as keyword candidates, and its compound Compound word evaluation means for determining the evaluation value of each compound word keyword candidate in consideration of the mutual overlapping frequency of the compound word keyword candidates listed by the word keyword candidate creating unit in the input Japanese sentence. And a keyword determining device for determining a keyword based on the keyword extracting device.

3. The keyword extraction according to claim 2, wherein the mutual overlapping frequency is taken into consideration, and the appearance frequency of one compound word keyword candidate included in another one compound word keyword candidate is subtracted by one. apparatus.

4. The compound word evaluation means, the occurrence frequency of the noun word in a general sentence, the appearance frequency of the noun word and the noun equivalent word in the input Japanese sentence, and the compound word keyword candidate. Based on the frequency of occurrence of
A method of calculating an evaluation value of keyword-likeness by crossing the noun word and the noun-equivalent word with the compound word keyword candidate, and at that time, reducing the evaluation of the constituent words included in the compound word keyword candidate. 3. The keyword extracting device according to claim 2, which is characterized in that.

5. The dictionary stores a frequency of occurrence of noun words in a sentence in a predetermined specific field, and a selection for switching between the frequency of occurrence in the general sentence and the frequency of occurrence in the specific field sentence according to the use. The keyword extracting device according to claim 4, further comprising a section.