JP2011022809A

JP2011022809A - Important word extraction method, device, program, and recording medium

Info

Publication number: JP2011022809A
Application number: JP2009167476A
Authority: JP
Inventors: Yugo Nishikawa; 侑吾西川; Osamu Nakagawa; 修中川; Haruo Nishimura; 治男西村; Naoyuki Ito; 直之伊藤; Noriyuki Kobayashi; 宣幸小林; Junpei Kobayashi; 潤平小林; Naoyuki Tamura; 直之田村
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2009-07-16
Filing date: 2009-07-16
Publication date: 2011-02-03
Anticipated expiration: 2029-07-16
Also published as: JP5499546B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method to set a compound word as an important word representing a feature of a document without using a compound word rule or important word knowledge. <P>SOLUTION: An important word extraction method using synonym database is performed by a procedure including: a document input step of acquiring document data; a compound word extraction step of extracting a compound word including a plurality of morphemes from document data; a similar compound word creation step of performing replacement by a synonym obtained by searching a synonym database by using one morpheme including a compound word as a search key to create a similar compound word; an appearance frequency totalization step of totalizing appearance frequencies in the document data by using a compound word and the similar compound word; and an important word setting/presenting step of setting the compound word as an important word of the document when the appearance frequencies of the compound word and the similar compound word meet an important word setting condition, and presenting the compound word. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、文書の特徴を現す重要語を抽出する方法、装置、プログラム、記録媒体に関するものである。
本発明は、特に、自然言語で作成された文書から抽出した形態素を連結した語を、文書の特徴を現す重要語として抽出する場合に有用である。
The present invention relates to a method, an apparatus, a program, and a recording medium for extracting an important word that represents a feature of a document.
The present invention is particularly useful when a word obtained by concatenating morphemes extracted from a document created in a natural language is extracted as an important word representing the characteristics of the document.

インターネットやコンピューターの普及に伴い、膨大な電子化文書の蓄積が進んでいることから、これらの文書を活用するために、文書の特徴を現す語（＝重要語）を抜き出すことが益々重要となっている。
重要語を手掛りとすることで、文書の利用者（以下、分析者）は、文書を能率よく、分類し検索することができる。
ここで、文章の内容が専門的な事項に関する場合には、複数の単語の組み合わせ（＝複合語）で、複雑な概念を表すことが多いから、文書の重要語として複合語を用いることが、有効である。 With the spread of the Internet and computers, enormous amounts of electronic documents are accumulating, so in order to utilize these documents, it is becoming increasingly important to extract words (= important words) that express the characteristics of the documents. ing.
By using key words as clues, a document user (hereinafter, an analyst) can efficiently classify and search for documents.
Here, when the content of a sentence is related to a specialized matter, a combination of a plurality of words (= compound word) often represents a complex concept, so it is possible to use a compound word as an important word of a document. It is valid.

そこで、文書から複合語を抜き出す方法と重要語を決定する方法について説明する。
《Ａ．複合語の抽出》
複合語を抜き出す方法は、自然言語で作成された文書を形態素解析プログラムにより形態素分割して、次に、この分割された形態素に対して、単語の品詞や属性などの組み合わせ方を定めた複合語規則を適用することで、複数の形態素を連結した複合語を作成する。
たとえば、非特許文献１には、日本文を単語単位（＝形態素）に分割し、品詞を付与する技術が開示されている。
また、特許文献１では、文書の形態素に対して複合語規則を適用することで、品詞や読みなどの形態素解析情報を有する複合語を作成する技術が開示されている。
ここで、形態素解析とは、自然言語で書かれた文を、大まかにいえば、言語で意味を持つ最小単位である形態素の列に分割し、それぞれの品詞を判別する自然言語処理の基礎技術であり、広く用いられている。
《Ｂ．重要語の決定》
複合語の中から重要語を選定する方法は、重要語が出現する文書位置などの重要語知識や、複合語の出現頻度の情報を用いて、複合語の重要度を計算して、重要語を決定する。
たとえば、特許文献２では、重要語が割り付けられる場所や重要語のフォントサイズなどの重要語知識を用いて、重要語を決定する技術が開示されている。
Therefore, a method for extracting compound words from a document and a method for determining important words will be described.
<< A. Compound word extraction >>
A method of extracting compound words is a compound word in which a document created in a natural language is divided into morphemes by a morpheme analysis program, and then a combination of word parts of speech and attributes is determined for the divided morphemes. By applying the rules, a compound word that connects multiple morphemes is created.
For example, Non-Patent Document 1 discloses a technique for dividing Japanese sentences into word units (= morphemes) and adding parts of speech.
Patent Document 1 discloses a technique for creating a compound word having morphological analysis information such as part of speech or reading by applying a compound word rule to a morpheme of a document.
Here, morphological analysis is a basic technology of natural language processing that divides a sentence written in natural language into morpheme strings, which are the smallest units meaningful in the language, and discriminates each part of speech. Is widely used.
<< B. Determination of important words >>
The method of selecting an important word from compound words is to calculate the importance of the compound word using important word knowledge such as the document position where the important word appears and the frequency of occurrence of the compound word. To decide.
For example, Patent Literature 2 discloses a technique for determining an important word by using important word knowledge such as a place where the important word is allocated and a font size of the important word.

特開２００８−２８７４０６号公報（段落００８１−段落００８８、図６）JP 2008-287406 A (paragraph 0081-paragraph 0088, FIG. 6) 特開２００６−３０９３４７号公報（段落００５９−段落００６２）JP 2006-309347 A (paragraph 0059-paragraph 0062)

形態素解析システム茶筅［平成２０年６月２０日検索］インターネットＵＲＬ：http://chasen-legacy.sourceforge.jp/Morphological analysis system tea bowl [Search June 20, 2008] Internet URL: http://chasen-legacy.sourceforge.jp/

上述したように、《Ａ．複合語の抽出》に於いて、複合語規則を用いて文書から複合語を抽出する場合には、複合語規則に則らない複合語が抽出できないので、抽出できない複合語は、文書の重要語として設定できない。
また、《Ｂ．重要語の決定》に於いて、重要語を決定するためには、予め重要語について分析して、重要語知識を習得しなければ成らない。
As described above, << A. In compound word extraction, when compound words are extracted from a document using compound word rules, compound words that do not comply with the compound word rules cannot be extracted. Cannot be set as
In addition, << B. In determining important words, in order to determine important words, it is necessary to analyze important words in advance and acquire important word knowledge.

本発明は以上のような点を解決するためになされたものであって、本発明の課題は、複合語規則や重要語知識を用いずに、複合語を文書の特徴を現す重要語に設定できる方法、装置、プログラム、記録媒体を提供することである。
The present invention has been made in order to solve the above-described points, and an object of the present invention is to set a compound word as an important word that represents a document feature without using compound word rules or important word knowledge. It is to provide a method, an apparatus, a program, and a recording medium.

本発明は、以下の各態様に記載の手段により、前記課題を解決する。
すなわち、本願発明の第１の発明は、類語データベースを用いる重要語抽出方法であって、
文書データを取得する文書入力ステップと、
文書データから複数の形態素で構成される複合語を抽出する複合語抽出ステップと、
複合語を構成する１つの形態素を検索キーにして、類語データベースを検索して得られる類語で置き換えて、類似複合語を作成する類似複合語作成ステップと、
複合語と類似複合語を用いて、文書データにおける出現頻度を集計する出現頻度集計ステップと、
複合語と類似複合語の出現頻度が重要語設定条件を満たせば、その複合語を、その文書の重要語に設定して、提示する重要語設定提示ステップと、
を含んだ手順でなされることを特徴とする重要語抽出方法である。 This invention solves the said subject by the means as described in each following aspect.
That is, the first invention of the present invention is an important word extraction method using a synonym database,
A document input step for obtaining document data;
A compound word extraction step for extracting a compound word composed of a plurality of morphemes from document data;
A similar compound word creating step of creating a similar compound word by replacing one morpheme constituting the compound word with a synonym obtained by searching the synonym database using a search key;
An appearance frequency counting step for counting appearance frequencies in document data using compound words and similar compound words;
If the appearance frequency of the compound word and the similar compound word satisfies the important word setting condition, the important word setting presenting step for setting the compound word as the important word of the document and presenting it,
It is a key word extraction method characterized by being performed in the procedure including.

このように、複合語規則を用いずに、文書から複合語を抽出することが可能である。ここで、複合語の品詞は、名詞や動詞などに限定されず、いかなる品詞であってもよい。また、複合語を構成する形態素の品詞も、いかなる品詞であってもよい。
また、類似複合語（＝複合語の一部を類語に置き換えた語）の出現頻度を重要語判定条件を用いることによって、複合語が重要語（＝文書の特徴を現す語）であるか否かを容易に判定することが可能である。 In this way, compound words can be extracted from a document without using compound word rules. Here, the part of speech of the compound word is not limited to a noun or a verb, and may be any part of speech. The part of speech of the morpheme constituting the compound word may be any part of speech.
Whether or not the compound word is an important word (= a word representing the feature of the document) by using the occurrence frequency of the similar compound word (= a word obtained by replacing a part of the compound word with a synonym) using the important word determination condition. It is possible to determine easily.

本願発明の第２の発明は、第１の発明に記載の重要語抽出方法であって、
前記重要語設定条件は、
複合語の出現頻度が２回で、かつ、類似複合語の出現頻度が０回である、
を備えることを特徴とする重要語抽出方法である。 A second invention of the present invention is the key word extraction method according to the first invention,
The important word setting conditions are:
The appearance frequency of the compound word is 2 and the appearance frequency of the similar compound word is 0.
Is an important word extraction method characterized by comprising:

本願発明の第３の発明は、類語データベースを格納する類語格納領域、
を有する記憶部と、
文書データを取得する文書入力手段と、
文書データから複数の形態素で構成される複合語を抽出する複合語抽出手段と、
複合語を構成する１つの形態素を検索キーにして、類語データベースを検索して、類語を取得する類語取得手段と、
検索キーに用いた形態素を取得された類語に置き換えて、類似複合語を作成する類似複合語作成手段と、
複合語と類似複合語を用いて、文書データにおける出現頻度を集計する出現頻度集計手段と、
複合語と類似複合語の出現頻度が重要語設定条件を満たせば、その複合語を、その文書の重要語に設定して、提示する重要語設定提示手段と、
を備えることを特徴とする重要語抽出装置である。 A third invention of the present invention provides a synonym storage area for storing a synonym database,
A storage unit having
A document input means for obtaining document data;
Compound word extraction means for extracting a compound word composed of a plurality of morphemes from document data;
A synonym acquisition means for searching a synonym database and acquiring a synonym using one morpheme constituting a compound word as a search key;
A similar compound word creating means for creating a similar compound word by replacing the morpheme used for the search key with the obtained synonym;
Appearance frequency counting means for calculating appearance frequency in document data using compound words and similar compound words,
If the appearance frequency of the compound word and the similar compound word satisfies the important word setting condition, the important word setting presenting means for setting and presenting the compound word as the important word of the document,
It is an important word extracting device characterized by comprising.

本願発明の第４の発明は、第３の発明に記載の重要語抽出装置であって、
前記重要語設定条件は、
複合語の出現頻度が２回以上で、かつ、類似複合語の出現頻度が０回である、
ことを特徴とする重要語抽出装置である。 A fourth invention of the present invention is the key word extraction device according to the third invention,
The important word setting conditions are:
The appearance frequency of the compound word is 2 times or more, and the appearance frequency of the similar compound word is 0 times.
This is an important word extraction device.

本願発明の第５の発明は、コンピューターに組込むことによって、コンピューターを第３または第４の発明に記載の重要語抽出装置として動作させるコンピュータプログラムである。 A fifth invention of the present invention is a computer program that causes a computer to operate as the important word extracting device according to the third or fourth invention by being incorporated in the computer.

本願発明の第６の発明は、第５の発明に記載のコンピュータプログラムを記録したコンピューター読取り可能な記録媒体である。
A sixth invention of the present invention is a computer-readable recording medium on which the computer program according to the fifth invention is recorded.

本願発明によれば、
（１）文書から複数の形態素を連結した複合語を抽出することが可能である。
（２）類似複合語（＝複合語の一部を類語に置き換えた語）の出現頻度を計測することで、複合語が文書の特徴を現す語であるか否かを容易に判定することが可能である。
従って、本発明によれば、複合語と類語を用いて、当該複合語の出現頻度を判定することで、複合語が文書の特徴を現す重要語であるか否かを、容易に特定できるという効果がある。
According to the present invention,
(1) It is possible to extract a compound word obtained by connecting a plurality of morphemes from a document.
(2) It is possible to easily determine whether or not a compound word is a word representing a feature of a document by measuring the appearance frequency of a similar compound word (= a word obtained by replacing a part of a compound word with a synonym). Is possible.
Therefore, according to the present invention, it is possible to easily identify whether or not a compound word is an important word that expresses the characteristics of a document by determining the appearance frequency of the compound word using a compound word and a synonym. effective.

図１は、本発明の実施の形態による複合語抽出システム１の概要を説明する図である。（実施例１）FIG. 1 is a diagram illustrating an overview of a compound word extraction system 1 according to an embodiment of the present invention. Example 1 図２は複合語抽出システム１の作業と処理の大まかな手順を説明する。FIG. 2 illustrates a rough procedure of the work and processing of the compound word extraction system 1. 図３は、文書データの表示例を説明する図である。FIG. 3 is a diagram for explaining a display example of document data. 図４は、類語データベースの詳細を説明する図である。FIG. 4 is a diagram for explaining the details of the synonym database. 図５は、語の意味と類語を説明する図である。FIG. 5 is a diagram illustrating the meaning of words and synonyms. 図６は、複合語抽出処理の詳細な手順を説明する図である。FIG. 6 is a diagram for explaining the detailed procedure of the compound word extraction process. 図７は、複合語を説明する図である。FIG. 7 is a diagram for explaining a compound word. 図８は、類似複合語生成処理の詳細な手順を説明する図である。FIG. 8 is a diagram for explaining the detailed procedure of the similar compound word generation process. 図９は、重要語抽出装置１００の詳細な構成図である。FIG. 9 is a detailed configuration diagram of the keyword extraction device 100.

以下、図面等を参照しながら、本発明の実施の形態について、更に詳しく説明する。 Hereinafter, embodiments of the present invention will be described in more detail with reference to the drawings.

図１は、本発明の実施の形態による複合語抽出システム１の概要を説明する図である。
複合語抽出システム１は、重要語抽出装置１００と、ウェブサーバー装置３００とがネットワーク接続されて構成される。 FIG. 1 is a diagram illustrating an overview of a compound word extraction system 1 according to an embodiment of the present invention.
The compound word extraction system 1 is configured by connecting a keyword extraction device 100 and a web server device 300 via a network.

重要語抽出装置１００は、既存の形態素解析プログラムを備えたパーソナルコンピューターに、ユーザーエージェント（たとえば、ウェブブラウザーやクローラーやメールユーザーエージェントなど）と、後述する専用プログラムを搭載したものである。 The keyword extraction device 100 is a personal computer equipped with an existing morpheme analysis program, in which a user agent (for example, a web browser, a crawler, a mail user agent, etc.) and a dedicated program described later are installed.

ウェブサーバー装置３００は、既存のサーバープログラム（たとえば、ウェブページをユーザーエージェントに提供するウェブサーバープログラム）を備えたサーバーコンピューターである。
ウェブサーバー装置３００が提供する文書データには、ウェブページ（ブログや電子掲示板などを含む）などがある。
なお、ウェブサーバー装置３００は、メールサーバープログラムを備えて、ネットニュースを提供してもよい。 The web server device 300 is a server computer provided with an existing server program (for example, a web server program that provides a web page to a user agent).
The document data provided by the web server device 300 includes web pages (including blogs and electronic bulletin boards).
The web server device 300 may be provided with a mail server program and provide net news.

図２は複合語抽出システム１の作業と処理の大まかな手順を説明する。
（１）分析者は、インターネット上の文書（ウェブサーバー装置３００のウェブページなど）を指定（たとえば、ＵＲＬ＝Uniform Resource Locatorを入力）する。
（２）重要語抽出装置１００は、ウェブサーバー装置３００からＵＲＬで指定された文書データ３９１を取得する。
ここで、重要語抽出装置１００は外部記録媒体（ＣＤ、ＤＶＤ、半導体メモリ）から文書データ３９１を取得するように構成しても良い。
（３）重要語抽出装置１００は、文書データ３９１から、複数の形態素で構成される複合語を抽出する。
ここで、複合語の品詞の種類は問わず、いかなる品詞であってもよい。また、複合語には、品詞や読みなどの形態素解析情報を関連付ける必要はない。また、複合語を構成する複数の形態素には、品詞や読みなどの形態素解析情報を関連付ける必要はない。（詳細は後述する）
（４）重要語抽出装置１００は、類語データベースを用いて、複合語を構成する形態素の類語を得て、複合語の該当する形態素をこの類語に置き換えて、類似複合語を生成する。
（５）重要語抽出装置１００は、文書に於いて、複合語と類似複合語の出現頻度を集計して、出現頻度が重要語設定条件（たとえば、複合語の出現頻度が２回以上で、類似複合語の出現頻度がゼロ回）を満たせば、その複合語を、その文書の重要語に設定して、提示する。
（６）分析者は、重要語を手掛りにして、必要な文書にたどり着く。 FIG. 2 illustrates a rough procedure of the work and processing of the compound word extraction system 1.
(1) The analyst designates a document (such as a web page of the web server device 300) on the Internet (for example, URL = Uniform Resource Locator is input).
(2) The keyword extraction device 100 acquires the document data 391 specified by the URL from the web server device 300.
Here, the keyword extraction device 100 may be configured to acquire the document data 391 from an external recording medium (CD, DVD, semiconductor memory).
(3) The keyword extraction device 100 extracts a compound word composed of a plurality of morphemes from the document data 391.
Here, the type of part of speech of the compound word is not limited, and any part of speech may be used. Further, it is not necessary to associate morphological analysis information such as part of speech or reading with a compound word. In addition, it is not necessary to associate morpheme analysis information such as part of speech or reading with a plurality of morphemes constituting a compound word. (Details will be described later)
(4) The keyword extraction device 100 uses the synonym database to obtain a synonym of a morpheme constituting the compound word, and replaces the corresponding morpheme of the compound word with the synonym to generate a similar compound word.
(5) The keyword extraction device 100 aggregates the appearance frequencies of compound words and similar compound words in a document, and the appearance frequency is an important word setting condition (for example, the appearance frequency of a compound word is two or more, If the appearance frequency of similar compound words is zero), the compound word is set as an important word of the document and presented.
(6) The analyst arrives at the necessary document using the key words as clues.

図３は、文書の重要語を説明する図である。 FIG. 3 is a diagram for explaining important words of a document.

図３の（ａ）は、複合語が重要語と判断される文書データ３９１文書の表示例である。
表示された文書データ３９１には、複合語「我輩は猫である」が含まれている。
ここで、形態素「猫」の類語を用い作成した類似複合語（たとえば、形態素「猫」の類語「犬」を用いて作成した類似複合語「我輩は犬である」など）が、この文書には含まれていないことを示している。 FIG. 3A shows a display example of document data 391 document in which a compound word is determined to be an important word.
The displayed document data 391 includes the compound word “I am a cat”.
Here, a similar compound word created using the synonym of the morpheme “cat” (for example, the similar compound word “I am a dog” created using the synonym “dog” of the morpheme “cat”) is included in this document. Indicates that it is not included.

この表示された文書データ３９１の意味は、複合語「我輩は猫である」の出現頻度が、所定の回数（たとえば、２回以上）あって、類似複合語（たとえば、「我輩は犬である」など）の出現頻度が、０回であるので、重要語設定条件に適合しおり、複合語「我輩は猫である」は、重要語であることを示している。 The meaning of the displayed document data 391 is that the compound word “I am a cat” has an appearance frequency of a predetermined number of times (for example, two times or more), and a similar compound word (for example, “I am a dog” “)” Is 0 times, so it matches the important word setting condition, and the compound word “I am a cat” indicates that it is an important word.

図３の（ｂ）は、複合語と、類似複合語が含まれる文書データ３９１の表示例である。
表示された文書データ３９１には、複合語「日本の食料自給率」とこの複合語に対する類似複合語「アメリカの食料自給率」と「韓国の食料自給率」が含まれている。すなわち、形態素「日本」の類語が、「アメリカ」と「韓国」とであるので、複合語「日本の食料自給率」から生成した類似複合語は、複合語「アメリカの食料自給率」と「韓国の食料自給率」に相当する。 FIG. 3B is a display example of document data 391 including compound words and similar compound words.
The displayed document data 391 includes a compound word “Japanese food self-sufficiency ratio” and a similar compound word “American food self-sufficiency ratio” and “Korea food self-sufficiency ratio” corresponding to this compound word. That is, since the synonyms of the morpheme “Japan” are “America” and “Korea”, the similar compound words generated from the compound word “Japan's food self-sufficiency rate” are compound words “US food self-sufficiency rate” and “ Corresponds to the “Korean food self-sufficiency”.

この表示された文書データ３９１の意味は、複合語「日本の食料自給率」の出現頻度が、所定の回数（たとえば、２回以上）あって、類似複合語「アメリカの食料自給率」と「韓国の食料自給率」の出現頻度が、少なくとも１回あるので、複合語「日本の食料自給率」は、重要語でないことを示している。 The meaning of the displayed document data 391 is that the compound word “Japanese food self-sufficiency rate” has a predetermined frequency (for example, two or more times) and the similar compound words “US food self-sufficiency rate” and “ Since the frequency of “Korean food self-sufficiency” appears at least once, the compound word “Japanese food self-sufficiency” indicates that it is not an important word.

ここで、複合語「食料自給率」が重要語であるか否かを検討する。複合語「食料自給率」の出現頻度（２回以上）と、類似複合語（たとえば、「食べ物自給率」や「食料調達率」など）の出現頻度（０回）とが、重要語設定条件を満たす。そこで、複合語「食料自給率」は、この文書の重要語に設定される。 Here, it is examined whether or not the compound word “food self-sufficiency ratio” is an important word. The appearance frequency of the compound word “food self-sufficiency” (two times or more) and the appearance frequency of similar compound words (for example, “food self-sufficiency rate” and “food procurement rate”) (zero times) are important word setting conditions Meet. Therefore, the compound word “food self-sufficiency ratio” is set as an important word of this document.

以上の説明で述べた複合語の品詞は、名詞や動詞であったが、これに限定されるものではなくて、形容詞や副詞などであってもよい。また、複合語を構成する形態素の品詞は、名詞や動詞、助詞などであったが、これに限定されるものではなくて、形容詞や副詞などであってもよい。
なお、複合語を抽出するためには、特許文献１の技術の複合語規則や、特許文献２の技術の重要語知識を必要としないことが分かる。 The part of speech of the compound word described in the above description is a noun or a verb, but is not limited to this, and may be an adjective or an adverb. Moreover, although the part of speech of the morpheme which comprises a compound word was a noun, a verb, a particle, etc., it is not limited to this, An adjective, an adverb, etc. may be sufficient.
In addition, in order to extract a compound word, it turns out that the compound word rule of the technique of patent document 1 and the important word knowledge of the technique of patent document 2 are not required.

図４は、類語データベースの詳細を説明する。
類語データベース１９１は、１つの形態素と同じ意味の類語を検索するデータベースである。
類語データベース１９１は、意味テーブル１９１１と、類語テーブル１９１２とを含んで構成される。
意味テーブル１９１１は、単語１９１４（主キー）と、意味を表す語１９１３と、を対応付けた表形式のデータである。
類語テーブル１９１２は、意味を表す語１９１７（主キー）と、類語１９１６の集合１９１５と、を対応付けた表形式のデータである。 FIG. 4 explains the details of the synonym database.
The synonym database 191 is a database that searches for synonyms having the same meaning as one morpheme.
The synonym database 191 includes a meaning table 1911 and a synonym table 1912.
The meaning table 1911 is tabular data in which a word 1914 (primary key) and a meaning word 1913 are associated with each other.
The synonym table 1912 is tabular data in which a meaning word 1917 (primary key) and a set 1915 of synonyms 1916 are associated with each other.

類語データベース１９１の説明として、リレーショナル型データベースを例にしたが、ネットワーク型データベースを用いてもよい。
類語データベース１９１としては、たとえば、ＮＴＴコミュニケーション科学研究所が監
修している『日本語語彙大系』（岩波書店）を用いればよい。 As an explanation of the synonym database 191, a relational database has been taken as an example, but a network database may be used.
As the synonym database 191, for example, “Japanese vocabulary system” (Iwanami Shoten) supervised by NTT Communication Science Laboratories may be used.

ここで、図５を用いて、意味を表す語１９１３と類語の集合１９１６について説明する。 Here, a word 1913 representing meaning and a set 1916 of synonyms will be described with reference to FIG.

意味を表す語１９１３とは、体系化された語体系（たとえば、木構造による語体系）である。
図５に例示された意味を表す語１９１３の木構造は、たとえば、意味を表す語１９１３「名詞」の下位には、意味を表す語１９１３「生物」と「物質」とが関連付けられて、木構造で体系化されている。また、意味を表す語１９１３「生物」の下位には、たとえば、意味を表す語１９１３「動物」と「植物」とが関連付けられて、木構造で体系化されている。また、意味を表す語１９１３「動物」の下位には、たとえば、意味を表す語１９１３「哺乳類」と「爬虫類」と「鳥類」と関連付けられてが、木構造で体系化されている。 The meaning word 1913 is a systemized word system (for example, a word system based on a tree structure).
In the tree structure of the word 1913 representing the meaning illustrated in FIG. 5, for example, the word 1913 “living” and “substance” representing the meaning are associated with the word 1913 “noun” representing the meaning, Systematized by structure. Further, below the meaning word 1913 “organism”, for example, the meaning word 1913 “animal” and “plant” are associated and organized in a tree structure. Further, below the meaning word 1913 “animal”, for example, the meaning words 1913 “mammals”, “reptiles”, and “birds” are associated with each other in a tree structure.

これらの例示された意味を表す語の木構造の意図は、「名詞」１９１３の下位の意味には、「生物」１９１３と「物質」１９１３とがあるということを示している。また、「生物」１９１３の下位の意味には、「動物」１９１３と「植物」１９１３とがあるということを示している。また、「動物」１９１３の下位の意味には、「哺乳類」１９１３と「爬虫類」１９１３と「鳥類」１９１３とがあるということである。 The intention of the tree structure of the words representing these exemplified meanings indicates that “subject” 1913 and “substance” 1913 are subordinate meanings of “noun” 1913. Further, it is shown that “animal” 1913 and “plant” 1913 are subordinate to “living organism” 1913. The subordinate meaning of “animal” 1913 is “mammal” 1913, “reptile” 1913, and “bird” 1913.

次に、類語の集合１９１６について説明する。
図５に例示された類語の集合１９１６は、たとえば、「犬」１９１５と「猫」１９１５「牛」１９１５となどである。この類語の集合１９１５は、意味を表す語１９１３「哺乳類」に対応付けられている。 Next, the set of synonyms 1916 will be described.
The set of synonyms 1916 illustrated in FIG. 5 includes, for example, “Dog” 1915 and “Cat” 1915 “Cow” 1915. This set of synonyms 1915 is associated with the word 1913 “mammal” representing meaning.

これらの例示された類語の集合１９１５の意図は、ある類語１９１５「猫」の意味は、たとえば、「哺乳類」１９１３であるので、この「哺乳類」１９１３に対応付けられた他の類語１９１５は、「犬」１９１５と「牛」１９１５などであるということである。 The intent of these illustrated set of synonyms 1915 is that the meaning of a certain synonym 1915 “cat” is, for example, “mammal” 1913, so another synonym 1915 associated with this “mammal” 1913 is “ That is, “Dog” 1915 and “Cow” 1915.

図６は、複合語抽出処理の詳細な手順を説明する図である。
（３）複合語抽出処理
（３−１）重要語抽出装置１００は、文書データ３９１を形態素解析して、形態素に分割する。
（３−２）重要語抽出装置１００は、これらの分割された形態素の列から、連続してＮ個の形態素を取り出して連結して、複合語とする。（Ｎ＝２、３、・・・、ｎ。ｎは、予め定めた整数値）従って、複合語を構成する形態素は、形態素の列から連続して、機械的に取り出すので、形態素の品詞の種類は問わない。 FIG. 6 is a diagram for explaining the detailed procedure of the compound word extraction process.
(3) Compound Word Extraction Processing (3-1) The keyword extraction device 100 performs morphological analysis on the document data 391 and divides it into morphemes.
(3-2) The keyword extraction device 100 continuously extracts N morphemes from these divided morpheme strings and connects them to form a compound word. (N = 2, 3,..., N, where n is a predetermined integer value) Accordingly, the morphemes constituting the compound word are mechanically extracted from the sequence of morphemes. Any type.

ここで、図７を用いて、複合語を説明する。
図７の（ａ）には、文書データ３９１の一部としての句が、例示されている。
文書データ３９１の句の例としては、小説の題名「我輩は猫である」が示されている。
図７の（ｂ）には、分割された形態素が、例示されている。
分割された形態素の例としては、「我輩／は／猫／で／ある」が示されている。／は、形態素の区切りを示す記号である。
図７の（ｃ）には、連続してＮ個の形態素を取り出して連結した複合語が、例示されている。
Ｎ＝２の時には、複合語の例としては、「吾輩は」「は猫」「猫で」「である」が示されている。
Ｎ＝３の時には、複合語の例としては、「吾輩は猫」「は猫で」「猫である」が示されている。
Ｎ＝４の時には、複合語の例としては、「吾輩は猫で」「は猫である」が示されている。
Ｎ＝５の時には、複合語の例としては、「吾輩は猫である」が示されている。 Here, the compound word will be described with reference to FIG.
FIG. 7A illustrates a phrase as a part of the document data 391.
As an example of the phrase of the document data 391, a novel title “I am a cat” is shown.
FIG. 7B illustrates divided morphemes.
As an example of a divided morpheme, “I am / is / cat / is / is” is shown. / Is a symbol indicating a morpheme break.
FIG. 7C illustrates a compound word in which N morphemes are continuously extracted and connected.
When N = 2, as examples of compound words, “Senior is” “is a cat” “is a cat” “is” is shown.
When N = 3, examples of compound words are “Senior is a cat”, “is a cat”, and “is a cat”.
When N = 4, examples of compound words are “I am a cat” and “I am a cat”.
When N = 5, “I am a cat” is shown as an example of a compound word.

図８は、類似複合語生成処理の詳細な手順を説明する図である。
（４）類似複合語生成処理
（４−１）重要語抽出装置１００は、複合語を構成する１つの形態素を検索キーにして、類語データベースを検索して、検索キーの形態素に対応する類語を取得する。
なお、類語データベース１９１は、検索キーの形態素を用いて、意味テーブルを参照して、検索キーの形態素に対応する意味コードを取得して、次に、この取得した意味コードを用いて、類語テーブルを参照して、意味コードに対応する類語を取得する。
（４−２）重要語抽出装置１００は、取得した類語を用いて、形態素列の該当する形態素を、この類語に置き換えて、類似複合語を生成する。 FIG. 8 is a diagram for explaining the detailed procedure of the similar compound word generation process.
(4) Similar Compound Word Generation Processing (4-1) The keyword extraction device 100 searches the synonym database using one morpheme constituting the compound word as a search key, and selects a synonym corresponding to the morpheme of the search key. get.
The synonym database 191 obtains a semantic code corresponding to the morpheme of the search key by referring to the semantic table using the morpheme of the search key, and then uses the acquired semantic code to generate a synonym table. Referring to, a synonym corresponding to the semantic code is acquired.
(4-2) The keyword extraction device 100 uses the acquired synonym to replace the corresponding morpheme in the morpheme string with this synonym and generates a similar compound word.

図９は、重要語抽出装置１００の詳細な構成図である。
重要語抽出装置１００は、ＣＰＵ１０１と、表示部１０３と、入力部１０２と、ネットワーク通信部１０４と、記憶部１０９と専用プログラムとを備える。
ＣＰＵ１０１と、表示部１０３と、入力部１０２と、ネットワーク通信部１０４と、記憶部１０９とは、ＢＵＳ１９９で接続される。 FIG. 9 is a detailed configuration diagram of the keyword extraction device 100.
The keyword extraction device 100 includes a CPU 101, a display unit 103, an input unit 102, a network communication unit 104, a storage unit 109, and a dedicated program.
The CPU 101, the display unit 103, the input unit 102, the network communication unit 104, and the storage unit 109 are connected by a BUS199.

ＣＰＵ１０１は、中央演算装置である。
表示部１０３は、液晶表示装置や有機ＥＬ表示装置である。
入力部１０２は、マウスやキーボードである。入力部１０２は、キーボードやマウスや表示部１０３に表示されたソフトキーボード（＝タッチパネル）であってもよい。
ネットワーク通信部１０４は、ＬＡＮアダプターである。 The CPU 101 is a central processing unit.
The display unit 103 is a liquid crystal display device or an organic EL display device.
The input unit 102 is a mouse or a keyboard. The input unit 102 may be a keyboard, a mouse, or a soft keyboard (= touch panel) displayed on the display unit 103.
The network communication unit 104 is a LAN adapter.

記憶部１０９は、半導体メモリーや磁気メモリーである。
記憶部１０９は、類語データベース格納領域１０９１と文書格納領域１０９２と形態素格納領域１０９３とを備えて、オペレーティングシステム１８５と、形態素解析プログラム１８０と、ウェブブラウザー３８１と、専用プログラムとを記憶する。 The storage unit 109 is a semiconductor memory or a magnetic memory.
The storage unit 109 includes a synonym database storage area 1091, a document storage area 1092, and a morpheme storage area 1093, and stores an operating system 185, a morpheme analysis program 180, a web browser 381, and a dedicated program.

類語データベース格納領域１０９１は、類語データベース１９１を格納する。
文書格納領域１０９２は、文書データ３９１を格納する。
形態素格納領域１０９３は、形態素列データ１９５を格納する。 The synonym database storage area 1091 stores a synonym database 191.
The document storage area 1092 stores document data 391.
The morpheme storage area 1093 stores morpheme string data 195.

記憶部１０９は、クローラーやメールユーザーエージェントなどのユーザーエージェントを記憶してもよい。 The storage unit 109 may store user agents such as crawlers and mail user agents.

オペレーティングシステム１８５は、重要語抽出装置１００のハードウェア（たとえば、ＣＰＵ１０１と、入力部１０２と、表示部１０３と、ネットワーク通信部１０４と、記憶部１０９と、ＢＵＳ１９９など）を管理・制御して、応用ソフトウエア（たとえば、専用プログラム）に対して、これらのハードウェアを利用できるようなサービスを提供する基本ソフトウエアである。
形態素解析プログラム１８０は、既存のプログラムであって、所定の形式のテキストデータを入力データとして与えると、分解した形態素と品詞や読みの情報とを対応付けた形態素解析データを出力するプログラムである。
ウェブブラウザー３８１は、既存のプログラムであって、インターネットのウェブサイトのウェブページ（たとえば、文書データ）を閲覧するプログラムである。 The operating system 185 manages and controls the hardware of the keyword extraction device 100 (for example, the CPU 101, the input unit 102, the display unit 103, the network communication unit 104, the storage unit 109, the BUS 199, etc.) It is basic software that provides a service that can use these hardware for application software (for example, a dedicated program).
The morpheme analysis program 180 is an existing program that outputs morpheme analysis data in which disassembled morphemes are associated with parts of speech and reading information when text data of a predetermined format is given as input data.
The web browser 381 is an existing program, and is a program for browsing a web page (for example, document data) of an internet website.

この他に、文書入力手段１１０と、複合語抽出手段１２０と、類語抽出手段１３５と、類似複合語作成手段１３０と、出現頻度集計手段１４０と、重要語設定提示手段１５０と、を備える。これらの各手段は、それぞれの専用プログラムによって実現され、専用プログラムがＣＰＵ１０１に解釈・実行されることによって機能する。 In addition, a document input unit 110, a compound word extraction unit 120, a synonym extraction unit 135, a similar compound word creation unit 130, an appearance frequency totaling unit 140, and an important word setting presentation unit 150 are provided. Each of these means is realized by each dedicated program and functions by the CPU 101 interpreting and executing the dedicated program.

文書入力手段１１０は、ウェブサーバー装置３００から文書データ３９１を取得する。 The document input unit 110 acquires document data 391 from the web server device 300.

複合語抽出手段１２０は、形態素解析プログラム１８０を呼び出して、文書データ３９１を入力データとして渡して、形態素解析プログラム１８０が出力する形態素解析データを受け取り、この形態素解析データの形態素を用いて複合語を作成する。
なお、複合語抽出手段１２０の処理の詳細は、図５の（３）《複合語抽出処理》の詳細な流れを説明する項で述べた。 The compound word extraction unit 120 calls the morpheme analysis program 180, passes the document data 391 as input data, receives the morpheme analysis data output from the morpheme analysis program 180, and uses the morpheme of the morpheme analysis data to generate a compound word. create.
The details of the processing of the compound word extraction means 120 are described in the section describing the detailed flow of (3) << Compound word extraction processing >> in FIG.

類語抽出手段１３５は、複合語を構成する１つの形態素を検索キーにして、類語データベース１９１を検索して、類語を取得する。
なお、類語抽出手段１３５の処理の詳細は、図６の（４）《類似複合語生成処理》の詳細な流れを説明する項で述べた。 The synonym extracting unit 135 searches the synonym database 191 using one morpheme constituting the compound word as a search key, and acquires a synonym.
The details of the processing of the synonym extraction means 135 are described in the section describing the detailed flow of (4) << Similar compound word generation processing >> of FIG.

類似複合語作成手段１３０は、元の複合語を構成する１つの形態素を、取得された類語に置き換えて、類似複合語を作成する。 The similar compound word creation means 130 replaces one morpheme constituting the original compound word with the obtained synonym and creates a similar compound word.

出現頻度集計手段１４０は、複合語と類似複合語を用いて、文書データ３９１における出現頻度を集計する。 The appearance frequency totaling unit 140 totals the appearance frequencies in the document data 391 using compound words and similar compound words.

重要語設定提示手段１５０は、複合語と類似複合語の出現頻度が所定の条件を満たせば、その複合語を、その文書の重要語に設定して、提示する。
When the appearance frequency of the compound word and the similar compound word satisfies a predetermined condition, the keyword setting presentation unit 150 sets the compound word as the keyword of the document and presents it.

１複合語抽出システム
１００重要語抽出装置
１１０文書入力手段
１２０複合語抽出手段
１３０類似複合語作成手段
１３５類語抽出手段
１４０出現頻度集計手段
１５０重要語設定提示手段
１８０形態素解析プログラム
１９１類語データベース
３００文書サーバー装置
３９１文書データ

DESCRIPTION OF SYMBOLS 1 Compound word extraction system 100 Important word extraction apparatus 110 Document input means 120 Compound word extraction means 130 Similar compound word creation means 135 Synonym extraction means 140 Appearance frequency totaling means 150 Important word setting presentation means 180 Morphological analysis program 191 Synonym database 300 Document server Device 391 Document data

Claims

An important word extraction method using a synonym database,
A document input step for obtaining document data;
A compound word extraction step for extracting a compound word composed of a plurality of morphemes from document data;
A similar compound word creating step of creating a similar compound word by replacing one morpheme constituting the compound word with a synonym obtained by searching the synonym database using a search key;
An appearance frequency counting step for counting appearance frequencies in document data using compound words and similar compound words;
If the appearance frequency of the compound word and the similar compound word satisfies the important word setting condition, the important word setting presenting step for setting the compound word as the important word of the document and presenting it,
A key word extraction method characterized by being performed in a procedure including

The key word extraction method according to claim 1,
The important word setting conditions are:
The appearance frequency of the compound word is 2 times or more, and the appearance frequency of the similar compound word is 0 times.
An important word extraction method characterized by this.

A synonym storage area for storing a synonym database,
A storage unit having
A document input means for obtaining document data;
Compound word extraction means for extracting a compound word composed of a plurality of morphemes from document data;
A synonym acquisition means for searching a synonym database and acquiring a synonym using one morpheme constituting a compound word as a search key;
A similar compound word creating means for creating a similar compound word by replacing the morpheme used for the search key with the obtained synonym;
Appearance frequency counting means for calculating appearance frequency in document data using compound words and similar compound words,
If the appearance frequency of the compound word and the similar compound word satisfies the important word setting condition, the important word setting presenting means for setting and presenting the compound word as the important word of the document,
An important word extracting device comprising:

The key word extraction device according to claim 3,
The important word setting conditions are:
The appearance frequency of the compound word is 2 times or more, and the appearance frequency of the similar compound word is 0 times.
An important word extraction device characterized by that.

The computer program which makes a computer operate | move as an important word extraction apparatus of Claim 3 or Claim 4 by incorporating in a computer.

A computer-readable recording medium on which the computer program according to claim 5 is recorded.