JP2011165087A

JP2011165087A - Important word extraction device, important word extraction method and important word extraction program

Info

Publication number: JP2011165087A
Application number: JP2010029405A
Authority: JP
Inventors: Mariko Kawaba; 真理子川場; Toru Hirano; 徹平野; Hisako Asano; 久子浅野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-02-12
Filing date: 2010-02-12
Publication date: 2011-08-25
Anticipated expiration: 2030-02-12
Also published as: JP5331023B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide an important word extraction technology for selecting an important word on the basis of other than appearance frequencies. <P>SOLUTION: Important word candidates which are collocations of one or more nouns or proper nouns are extracted from an input text. Action expression which appears in describing an action of an input text creator is extracted as a feature word from the input text. One or more features showing the property of the important word candidates are extracted with respect to each important word candidate. An importance score is calculated using the features on the basis of a classification rule determined beforehand by machine learning, and the important word is determined from the importance score. In extracting the features, modification structure information showing whether the features are the important word candidates concerning the action expression is added to the important word candidates as the features. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、テキスト中に出現する語の中から特に重要な語を抽出する技術に関する。 The present invention relates to a technique for extracting a particularly important word from words appearing in a text.

テキスト中に出現する語の中から特に重要な語を抽出する従来技術として、非特許文献１が知られている。非特許文献１では、主にテキスト中の語の出現頻度を用いる。例えば、図１のテキストの場合、出現頻度が最も多い「チョコレート」を重要語として取得する。 Non-patent document 1 is known as a conventional technique for extracting a particularly important word from words appearing in a text. In Non-Patent Document 1, the appearance frequency of words in text is mainly used. For example, in the case of the text of FIG. 1, “chocolate” having the highest appearance frequency is acquired as an important word.

高村大也、奥村学著、「最大被覆問題とその変種による文書要約モデル」、人工知能学会論文誌、2008、Vol.23,No.6、pp.505-513Daiya Takamura and Manabu Okumura, "Document Summary Model with Maximum Cover Problem and its Variants", Journal of the Japanese Society for Artificial Intelligence, 2008, Vol.23, No.6, pp.505-513

しかしながら、従来技術は出現頻度以外に基づき、重要語を適切に取得することができない場合があるという問題がある。 However, the prior art has a problem that important words may not be acquired properly based on the appearance frequency.

例えば、テキスト作成者（以下「著者」という）が気になっている語を重要語とする場合には、従来技術では、重要語を選択できない場合がある。著者はテキスト中で何度も気になっている語を述べることは少なく、気になっている語をより一般化した語を多用することが多いからである。図１のように、「トリュフ」という種類のチョコレートが好きな著者がテキスト内でこれについて述べる場合、「トリュフ」ではなく「チョコレート」という言葉を多用する。そのため、従来技術では、著者が気になっている「トリュフ」ではなく、「チョコレート」を重要語として選択する。このように語の出現頻度のみでは、テキスト中のトピックを見つけることはできるが、著者が気になっている語を見つけることはできない。 For example, when a word that a text creator (hereinafter referred to as “author”) is interested in is an important word, the prior art may not be able to select the important word. This is because authors rarely state words they are interested in many times in the text, and often use more generalized words. As shown in FIG. 1, when an author who likes the type of chocolate called “Truffle” mentions this in the text, the term “chocolate” is used frequently instead of “Truffle”. Therefore, in the prior art, “chocolate” is selected as an important word instead of “truff” which the author is interested in. In this way, only the appearance frequency of words can find a topic in the text, but cannot find a word that the author is interested in.

上記の課題を解決するために、本発明に係る重要語抽出技術は、入力テキストから１つ以上の名詞の連語または固有名詞である重要語候補を抽出し、入力テキストから入力テキスト作成者の行動を記述する際に現れる行動表現を特徴語として抽出し、重要語候補が備える性質を表す素性を、各重要語候補に対し１つ以上抽出し、機械学習により予め定められた分類ルールに基づき、素性を用いて重要度スコアを算出し、重要度スコアから重要語を決定する。素性を抽出する際に、行動表現に係る重要語候補であるか否かを表す係り受け構造情報を素性として、その重要語候補に付与する。 In order to solve the above problems, the keyword extraction technique according to the present invention extracts a keyword candidate that is a collocation or proper noun of one or more nouns from the input text, and the action of the input text creator from the input text. The behavioral expression that appears when describing the feature word is extracted as a feature word, one or more features representing the properties of the key word candidate are extracted for each key word candidate, and based on a classification rule predetermined by machine learning, An importance score is calculated using the feature, and an important word is determined from the importance score. When extracting a feature, dependency structure information indicating whether or not it is a keyword candidate related to behavioral expression is assigned to the keyword candidate as a feature.

本発明は、行動表現に係る語か否かを素性とし、より柔軟に重要語を選択することができるという効果を奏する。 The present invention has an effect that an important word can be selected more flexibly by using whether or not it is a word related to behavioral expression.

形態素解析前の入力テキスト例を示す図。The figure which shows the example of input text before a morphological analysis. 重要語抽出装置の構成例を示す図。The figure which shows the structural example of an important word extraction apparatus. 重要語抽出装置の処理フローを示す図。The figure which shows the processing flow of an important word extraction apparatus. 形態素解析済みの入力テキスト例を示す図。The figure which shows the example of input text after morphological analysis. 記憶部に記憶される重要語候補及びその素性の例を示す図。The figure which shows the example of the important word candidate memorize | stored in a memory | storage part, and its feature. （Ａ）は記憶部に記憶される特徴語の例を、（Ｂ）は記憶部に記憶される行動表現のルール例を示す図。(A) is a figure which shows the example of the feature word memorize | stored in a memory | storage part, (B) is a figure which shows the example of a rule of the action expression memorize | stored in a memory | storage part. 係り受け構造の例を示す図。The figure which shows the example of a dependency structure. 重要語抽出装置のハードウェア構成を例示したブロック図。The block diagram which illustrated the hardware constitutions of the keyword extraction device.

本実施例は、行動表現に係る語や、指示語を受ける語は重要語である可能性が高いという日本語の語彙的特徴を利用する。また、重要語を含む文、及び、その前後の文に指示語を含む文に指示語が存在しやすいという日本語の語彙的特徴を利用する。 This embodiment uses Japanese lexical features that words related to behavioral expressions and words that receive instruction words are likely to be important words. In addition, the Japanese lexical feature that the instruction word is likely to exist in the sentence including the important word and the sentence including the instruction word in the sentence before and after the important word is used.

これらの語彙的特徴を考慮して、機械学習の素性として用いる。
以下、本発明の実施の形態について、詳細に説明する。 Considering these lexical features, it is used as a machine learning feature.
Hereinafter, embodiments of the present invention will be described in detail.

＜重要語抽出装置１００＞
図２及び３を用いて実施例１に係る重要語抽出装置１００を説明する。重要語抽出装置１００は、入力部１０１、記憶部１０３、重要語候補抽出部１１０、特徴語抽出部１２０、素性抽出部１３０及び分類器１４０を有する。 <Keyword extraction apparatus 100>
The keyword extraction device 100 according to the first embodiment will be described with reference to FIGS. The keyword extraction device 100 includes an input unit 101, a storage unit 103, a keyword candidate extraction unit 110, a feature word extraction unit 120, a feature extraction unit 130, and a classifier 140.

重要語抽出装置１００は、形態素解析済みの入力テキストＴを入力とし、重要語Ｊ（ｐ）及びその重要度スコアｓｃｏｒｅ（Ｊ（ｐ））を出力する。但し、ｐ＝１，２，…，Ｐであり、Ｐは選択された重要語の数を表す。
＜入力部１０１及び記憶部１０３＞
重要語抽出装置１００は、入力部１０１を介して形態素解析済み入力テキストが入力される（ｓ１０１）。例えば、「いつも色々なお店で買うけど、今日は銀座によったので、あのＰ社のトリュフを購入。」という文が、形態素解析済みの入力テキストとして、図４の状態で入力される。入力部１０１は、データが入力される入力インターフェース等である。 The keyword extraction device 100 receives the input text T that has been subjected to morpheme analysis, and outputs the keyword J (p) and its importance score score (J (p)). Here, p = 1, 2,..., P, and P represents the number of selected important words.
<Input unit 101 and storage unit 103>
The keyword extraction device 100 receives the input text after the morphological analysis via the input unit 101 (s101). For example, a sentence “I always buy at various shops, but today I bought G's truffles because of Ginza.” Is input as the morphological-analyzed input text in the state of FIG. The input unit 101 is an input interface or the like through which data is input.

但し、重要語抽出装置１００は、形態素解析が済んでいない入力テキストＴ’（図１参照）が入力されてもよい。その場合、入力部１０１を介して、図示しない形態素解析部に入力テキストＴ’が入力され、形態素解析部は、入力テキストＴ’に対し、既存の形態素解析手法により単語に区切り、各単語に品詞を付与し、形態素解析済みの入力テキストＴ（図４参照）を求め、重要語候補抽出部１１０と特徴語抽出部１２０に出力する。 However, the key word extraction apparatus 100 may receive input text T ′ (see FIG. 1) that has not been subjected to morphological analysis. In that case, the input text T ′ is input to a morpheme analysis unit (not shown) via the input unit 101, and the morpheme analysis unit divides the input text T ′ into words by an existing morpheme analysis method, and parts of speech for each word , The morphological-analyzed input text T (see FIG. 4) is obtained and output to the important word candidate extraction unit 110 and the feature word extraction unit 120.

記憶部１０３は、入出力される各データや演算過程の各データを、逐一、格納・読み出しする。それにより各演算処理が進められる。但し、必ずしも記憶部１０３に記憶しなければならないわけではなく、各部間で直接データを受け渡してもよい。
＜重要語候補抽出部１１０＞
重要語候補抽出部１１０は、図４のように形態素解析済みの入力テキストＴを入力とし、この入力テキストＴから１つ以上の名詞の連語または固有名詞である重要語候補ｊ（１），…，ｊ（Ｍ）を抽出し（ｓ１１０）、これを素性抽出部１３０や記憶部１０３へ出力する。但し、Ｍは入力テキストＴに含まれる重要語候補の種類数を表す。よって、重要語候補は重複しない。 The storage unit 103 stores / reads each input / output data and each data of the calculation process one by one. Thereby, each calculation process is advanced. However, the data need not necessarily be stored in the storage unit 103, and data may be directly transferred between the units.
<Keyword candidate extraction unit 110>
As shown in FIG. 4, the keyword candidate extraction unit 110 receives the input text T that has been subjected to morphological analysis, and inputs the keyword candidate j (1),... Which is a collocation or proper noun of one or more nouns from the input text T. , J (M) are extracted (s110) and output to the feature extraction unit 130 and the storage unit 103. However, M represents the number of types of important word candidates included in the input text T. Therefore, important word candidates do not overlap.

例えば、重要語候補抽出部１１０は、形態素解析済みテキストＴを入力とし、既存の固有表現抽出手法によって（参考文献１参照）、人名や地名、組織名などの固有物を表す表現を抽出し、重要語候補として出力する。
［参考文献１］今村賢治、斎藤邦子、浅野久子、「テキストからの知識抽出の基盤となる日本語基本解析技術」、ＮＴＴ技術ジャーナル、社団法人電気通信協会、2008.6、pp.20-23
抽出した重要語候補に固有表現の種類（人名、地名、ブランド名等）を付与しても良い。さらに、１つ以上連続する名詞を抽出し、これも重要語候補として出力する。例えば、図４のテキストを入力とした場合、固有表現である「きょう（日付）」、「銀座（地名）」、「Ｐ社（組織）」、「トリュフ（名詞）」及び「お（冠名詞）/店（名詞）」を連続する名詞「お店（名詞）」とし、これらが重要語候補として出力される。但し、( )内は固有表現の種類を表す。例えば、図１の入力テキストを形態素解析した情報を入力とした場合、抽出した重要語候補を図５の１列目のように記憶部１０３に記憶する。 For example, the keyword extraction unit 110 receives the morpheme-analyzed text T as an input, extracts an expression representing a specific object such as a person name, a place name, or an organization name by an existing specific expression extraction method (see Reference 1), Output as important word candidates.
[Reference 1] Kenji Imamura, Kuniko Saito, Hisako Asano, “Basic Japanese Analysis Technology as a Base for Knowledge Extraction from Texts”, NTT Technical Journal, Telecommunications Association, 2008.6, pp.20-23
Kinds of unique expressions (person names, place names, brand names, etc.) may be assigned to the extracted important word candidates. Furthermore, one or more consecutive nouns are extracted and output as important word candidates. For example, when the text of FIG. 4 is input, the proper expressions “Kyo (date)”, “Ginza (place name)”, “P company (organization)”, “Truffle (noun)” and “O (crown noun)” ) / Store (noun) ”as consecutive nouns“ shop (noun) ”, and these are output as important word candidates. However, the inside of () represents the kind of proper expression. For example, when information obtained by morphological analysis of the input text in FIG. 1 is used as an input, the extracted important word candidates are stored in the storage unit 103 as in the first column in FIG.

＜特徴語抽出部１２０＞
特徴語抽出部１２０は、形態素解析済みの入力テキストＴから行動表現と指示語を特徴語ｈ（１），…，ｈ（Ｎ）として抽出し（ｓ１２０）、素性抽出部１３０や記憶部１０３に出力する。但し、Ｎは入力テキストＴに含まれる特徴語の数を表す。例えば、図１の入力テキストを形態素解析した情報を入力とした場合、抽出した特徴語を図６（Ａ）のように記憶部１０３に記憶する。なお、行動表現とは、著者の行動を記述する際に現れる語である。行動表現は、主に自発的な動作を表す動詞の過去形、進行形、動作を表す名詞が該当する。書き手の体験の結果、得られたであろう感想を表すような形容詞の過去形等も含まれる。例えば、行動表現は、図６（Ｂ）のようなルールによって表すことができる（参考文献２参照）。
［参考文献２］池田佳代、田邊勝義、奥田英範、「体験表現を手がかりにしたBlogの体験情報の抽出」、電子情報通信学会第18回データ工学ワークショップ(DEWS2007)論文集、2007
例えば、特徴語抽出部１２０は、図示しない指示語抽出部と行動表現抽出部を備える。記憶部１０３は予め全ての指示語（例えば、「これ」、「この」、「あれ」等）を記憶しておき、指示語抽出部は、記憶部１０３を参照しながら、入力テキストＴに含まれる指示語を抽出する。 <Feature word extraction unit 120>
The feature word extraction unit 120 extracts action expressions and instruction words as feature words h (1),..., H (N) from the input text T that has been subjected to morphological analysis (s120), and stores them in the feature extraction unit 130 and the storage unit 103. Output. N represents the number of feature words included in the input text T. For example, when information obtained by morphological analysis of the input text in FIG. 1 is used as an input, the extracted feature words are stored in the storage unit 103 as illustrated in FIG. The action expression is a word that appears when describing the author's action. The action expression mainly corresponds to a noun representing a past form, a progressive form, and an action of a verb representing a spontaneous action. This includes past forms of adjectives that express the impressions that would have been obtained as a result of the writer's experience. For example, the behavioral expression can be expressed by a rule as shown in FIG. 6B (see Reference 2).
[Reference 2] Kayo Ikeda, Katsuyoshi Tabuchi, Hidenori Okuda, “Extraction of Blog Experience Information Using Experience Expressions”, IEICE 18th Data Engineering Workshop (DEWS2007) Proceedings, 2007
For example, the feature word extraction unit 120 includes an instruction word extraction unit and an action expression extraction unit (not shown). The storage unit 103 stores in advance all instruction words (for example, “this”, “this”, “that”, etc.), and the instruction word extraction unit is included in the input text T while referring to the storage unit 103. The instruction word to be extracted is extracted.

記憶部１０３は予め全ての行動表現（例えば、「買った」、「使った」等）を記憶しておくか、または、図６（Ｂ）のように行動表現のルールを記憶しておく。 The storage unit 103 stores all action expressions (for example, “Bought”, “Used”, etc.) in advance, or stores action expression rules as shown in FIG.

行動表現抽出部は、記憶部１０３に記憶されている行動表現自体を参照しながら、入力テキストＴに含まれる行動表現を抽出するか、または、記憶部１０３に記憶されている行動表現のルール参照し、テキストＴからルールに合致する語を抽出する。
＜素性抽出部１３０＞
素性抽出部１３０は、重要語候補が備える性質を表す素性を、各重要語候補に対し１つ以上抽出する（ｓ１３０）。なお、素性とは、分類器における所定の解析処理のために用いる情報（例えば、種類、頻度、重み、係り受け構造情報、指示語情報、タイトル情報等）の一単位であって、重要語候補が備える性質を意味する。機械学習では、この素性によって現象をモデル化することで、確率的な振る舞いとしてルール間の依存関係を学習することができる。 The behavioral expression extraction unit extracts the behavioral expression included in the input text T while referring to the behavioral expression itself stored in the storage unit 103 or refers to the rules of behavioral expression stored in the storage unit 103 Then, a word matching the rule is extracted from the text T.
<Feature Extraction Unit 130>
The feature extraction unit 130 extracts one or more features representing the properties of the important word candidate for each important word candidate (s130). A feature is a unit of information (for example, type, frequency, weight, dependency structure information, directive information, title information, etc.) used for predetermined analysis processing in the classifier, and is an important word candidate. Means the nature of In machine learning, by modeling a phenomenon based on this feature, it is possible to learn a dependency relationship between rules as a probabilistic behavior.

例えば、素性抽出部１３０は、頻度付与部１３１、重み付与部１３３、係り受け構造情報付与部１３５、指示語情報付与部１３７及びタイトル情報付与部１３９を備える。
（頻度付与部１３１）
頻度付与部１３１は、入力テキストＴと重要語候補ｊ（ｍ）（但し、ｍ＝１,２，…，Ｍ）を入力とし、入力テキストＴから重要語候補ｊ（ｍ）の出現頻度（単語頻度）を数え上げ、出現頻度を素性α１（ｊ（ｍ））として、その重要語候補に付与し（ｓ１３１）、出力する。図１の文書を形態素解析したものを入力テキストとした場合、各重要語候補の単語頻度は図５の２列目のようになる。
（重み付与部１３３）
重み付与部１３３は、重要語候補ｊ（ｍ）に対し予め定められた重みを素性α２（ｊ（ｍ））として、図５の４列目のように、その重要語候補に付与し（ｓ１３３）、出力する。 For example, the feature extraction unit 130 includes a frequency assignment unit 131, a weight assignment unit 133, a dependency structure information addition unit 135, an instruction word information addition unit 137, and a title information addition unit 139.
(Frequency giving unit 131)
The frequency assigning unit 131 receives the input text T and the keyword candidate j (m) (where m = 1, 2,..., M) as input, and the frequency of occurrence of the keyword candidate j (m) from the input text T (word Frequency) is counted, and the appearance frequency is assigned as the feature α1 (j (m)) to the important word candidate (s131) and output. When the morphological analysis of the document of FIG. 1 is used as the input text, the word frequency of each important word candidate is as shown in the second column of FIG.
(Weighting unit 133)
The weight assigning unit 133 assigns a predetermined weight to the keyword candidate j (m) as a feature α2 (j (m)) and assigns it to the keyword candidate as shown in the fourth column of FIG. 5 (s133). ),Output.

例えば、記憶部１０３は予め語の重みを記憶しておく。重みとしては、例えば、検索エンジンにおいて検索クエリとして使われた頻度や検索クエリに使われた頻度の多いものから順にランキングした順位等、外部から得られる情報を用いる。また、例えば、ブログ及び新聞等に出現する頻度の多いものから順にランキングした順位等を重みとしてもよい。なお、重みは記憶部１０３に記憶したものでなくともよく、例えば、重要語抽出装置１００が、通信回線等を介して、外部の単語の検索ランキング等が分かるサイトにアクセスできる場合には、サイトにアクセスし、各重要語候補に対するランキングを取得しそれを重みとする構成としてもよい。なお、この場合にも、重みはサイト等で予め定められているものと考えられる。また、重要語候補に対応する語が、記憶部１０３やサイトにない場合には、重みを予め決めておいた値（例えば「０」等）としてもよい。
（係り受け構造情報付与部１３５）
係り受け構造情報付与部１３５は、重要語候補ｊ（ｍ）と特徴語ｈ（ｎ）（但し、ｎ＝１，２，…，Ｎ）と形態素解析済みの入力テキストＴを入力とし、係り受け構造情報を素性α３（ｊ（ｍ））として、その重要語候補に付与し（ｓ１３５）、出力する。なお、係り受け構造情報とは、行動表現に係る重要語候補であるか否か、及び、指示語を受ける重要語候補であるか否かを表す情報である。 For example, the storage unit 103 stores word weights in advance. As the weight, for example, information obtained from the outside such as a frequency used as a search query in a search engine or a ranking ranked in descending order of frequency used in a search query is used. Further, for example, the ranking or the like ranked in descending order of frequency of appearance in blogs and newspapers may be used as the weight. The weights do not have to be stored in the storage unit 103. For example, when the keyword extraction device 100 can access a site in which an external word search ranking is known via a communication line or the like, the site It is good also as a structure which accesses this, acquires the ranking with respect to each important word candidate, and makes it a weight. In this case as well, the weight is considered to be predetermined at the site or the like. In addition, when the word corresponding to the important word candidate is not in the storage unit 103 or the site, the weight may be a predetermined value (for example, “0”).
(Dependency structure information adding unit 135)
The dependency structure information adding unit 135 receives an important word candidate j (m), a feature word h (n) (where n = 1, 2,..., N) and an input text T that has been subjected to morphological analysis as inputs. The structural information is assigned as the feature α3 (j (m)) to the important word candidate (s135) and output. The dependency structure information is information indicating whether or not it is a keyword candidate related to behavioral expression and whether or not it is a keyword candidate that receives an instruction word.

例えば、係り受け構造情報付与部１３５は、既存の係り受け解析手法によって（参考文献３参照）、重要語候補を文節にまとめ上げ、各文節間の係り受け構造を抽出する。
［参考文献３］長尾真、「自然言語処理」、岩波講座ソフトウェア科学１５、岩波書店、1996/04
そして、各重要語候補が、
・特徴語抽出部にて抽出した指示語を受ける重要語候補である場合は２
・特徴語抽出部にて抽出した行動表現に係る重要語候補である場合は１
・その他の重要語候補である場合は０
を係り受け構造情報として、図５の３列目のように重要語候補と対応付けて出力する。「行動表現に係る」とは行動表現の主語、目的語、副詞となることを意味し、「指示語を受ける」とは指示語に修飾されることを意味する。なお、上記０〜２の値は、それぞれが識別できるものであれば他の数値や文字列でも良い。 For example, the dependency structure information adding unit 135 collects important word candidates into phrases by using an existing dependency analysis method (see Reference 3), and extracts a dependency structure between the phrases.
[Reference 3] Makoto Nagao, “Natural Language Processing”, Iwanami Course Software Science 15, Iwanami Shoten, 1996/04
And each important word candidate
-2 if it is an important word candidate that receives the instruction word extracted by the feature word extraction unit
-1 if it is an important word candidate related to the action expression extracted by the feature word extraction unit
-0 for other important word candidates
As dependency structure information in association with important word candidates as shown in the third column of FIG. “According to behavioral expression” means becoming a subject, object, and adverb of behavioral expression, and “receiving a directive” means being modified to a directive. The values 0 to 2 may be other numerical values or character strings as long as each can be identified.

例えば、入力が「きょう（名詞）/は（連用名詞）/銀座（名詞）/に（格助詞）/よ（動詞語幹）/っ（動詞活用語尾）/た（動詞接尾辞）/ので（接続接尾辞）/あの（連体詞）/
Ｐ社（名詞）/の（格助詞）/トリュフ（名詞）/を（格助詞）/購入（名詞）」の場合、図７のような係り受け構造が抽出される。 For example, the input is “Kyo (noun) / ha (joint noun) / Ginza (noun) / ni (case particle) / yo (verb stem) / tsu (verb inflection ending) / ta (verb suffix) / (connection (Suffix) / that (combined) /
In the case of “Company P (noun) / (case particle) / Truffle (noun) / (case particle) / purchase (noun)”, a dependency structure as shown in FIG. 7 is extracted.

これは、重要語抽出部で取得できた重要語のうち、特徴語抽出部にて抽出した「買った」「つかった」などの行動表現に係っている重要語候補や指示語を受ける重要語候補を識別する目的がある。例えば、図５の３列目の結果からは、係り受け構造情報が「１（行動表現に係る重要語候補）」である「きょう」「トリュフ」が行動表現に係っている重要語候補であり、係り受け構造情報が「２（指示語を受ける重要語候補）」である「Ｐ社」が指示語を受ける重要語候補であることが分かる。
（指示語情報付与部１３７）
指示語情報付与部１３７は、特徴語に含まれる指示語と重要語候補ｊ（ｍ）と形態素解析済みの入力テキストＴを入力とし、指示語情報を素性α４（ｊ（ｍ））として、図５の５列目のようにその重要語候補に付与し（ｓ１３７）、出力する。なお、指示語情報とは、重要語候補を含む文に指示語が存在するか否か、及び、重要語候補を含む文の前後の文に指示語が存在するか否かを表す情報である。図５では、指示語情報の値は、
・重要語候補を含む文の前後の文に指示語が存在する場合は２
・重要語候補の含む文に指示語が存在する場合は１
・それ以外の場合は０
となっている。なお、この値は、上記３つの状態を識別できるものであれば他の値や文字列でも良い。 This is an important word that can be obtained from the key words extracted by the key word extraction unit and important word candidates and directives related to behavioral expressions such as “Bought” and “Used” extracted by the feature word extraction unit. The purpose is to identify word candidates. For example, from the result in the third column of FIG. 5, “Kyo” and “Truffle” whose dependency structure information is “1 (important word candidate related to action expression)” are important word candidates related to action expression. In addition, it can be seen that “Company P” whose dependency structure information is “2 (important word candidate for receiving an instruction word)” is an important word candidate for receiving an instruction word.
(Indicator information adding unit 137)
The instruction word information adding unit 137 receives the instruction word included in the feature word, the important word candidate j (m), and the input text T after morphological analysis, and the instruction word information as the feature α4 (j (m)). As shown in the fifth column of 5, it is assigned to the important word candidate (s 137) and output. The instruction word information is information indicating whether or not an instruction word is present in a sentence including an important word candidate, and whether or not an instruction word is present in a sentence before and after the sentence including the important word candidate. . In FIG. 5, the value of the instruction word information is
・ If there is a directive in the sentence before and after the sentence containing the key word candidate, 2
-1 if the instruction word is present in the sentence that contains the keyword candidate
-0 otherwise
It has become. This value may be another value or a character string as long as the above three states can be identified.

また、例えば、重要語を含む文の後の文（または、前の文）に指示語が特に存在しやすい場合に、必ずしも上述のルールに従って、素性の値を設けてなくともよい。つまり、重要語候補を含む文の前の文に指示語が存在する場合、重要語候補を含む文の後ろの文に指示語が存在する場合、重要語候補の含む文に指示語が存在する場合、それ以外の場合の４つに素性の値を設けてもよいし、「重要語候補を含む文の前の文に指示語が存在する場合」を除く３つに対し素性の値を設けてもよい。また、重要語候補を含む文の前後１つの文ではなく、前後２つ以上の文に指示語が存在するか否かを素性の値として設けてもよい。
（タイトル情報付与部１３９）
タイトル情報付与部１３９は、タイトルと重要語候補ｊ（ｍ）を入力とし、タイトル情報を素性α５（ｊ（ｍ））として、図５の６列目のようにその重要語候補に付与し（ｓ１３９）、出力する。なお、タイトル情報とは、重要語候補がタイトルに含まれるか否かを表す情報である。 Further, for example, when a directive word is particularly likely to be present in a sentence after a sentence including an important word (or a preceding sentence), the feature value may not necessarily be provided according to the above-described rules. In other words, when a directive word exists in a sentence before a sentence including a keyword candidate, if a directive word exists in a sentence after the sentence including the keyword candidate, the directive word exists in a sentence including the keyword candidate. In this case, feature values may be provided for the other four cases, and feature values may be provided for the three cases excluding “when a directive word exists in a sentence preceding a sentence including an important word candidate”. May be. Further, whether or not the instruction word exists in two or more sentences before and after the sentence including the important word candidate may be provided as a feature value.
(Title information adding unit 139)
The title information assigning unit 139 receives the title and the keyword candidate j (m) as input, and assigns the title information as the feature α5 (j (m)) to the keyword candidate as shown in the sixth column of FIG. s139) and output. Note that the title information is information indicating whether important word candidates are included in the title.

例えば、タイトル情報付与部１３９は、入力テキストにタイトルがついている場合、
・タイトルに含まれる重要語候補の場合は１
・タイトルに含まれない重要語候補の場合は０
をタイトル情報として付与する。図１のタイトルが「コーヒーのお供」であった場合には、図５の６列目のようになる。なお、入力テキストにタイトルがついていない場合には、タイトル付与部での処理を行わなくてもよいし、タイトルがついていない旨を表すタイトル情報（例えば２）を付与してもよいし、タイトルに含まれない重要語候補の場合と同じタイトル情報（例えば０）を付与してもよい。なお、タイトル情報付与部１３９は必須ではない。
＜分類器１４０＞
分類器１４０は、素性α１（ｊ（ｍ））〜α５（ｊ（ｍ））を入力とし、機械学習により予め定められた分類ルールに基づき（参考文献４参照）、素性を用いて重要度スコアｓｃｏｒｅ（ｊ（ｍ））を算出する。
［参考文献４］甘利俊一、麻生英樹、津田宏治、村田昇、「パターン認識と学習の統計学―新しい概念と手法」、岩波書店、２００３年４月
さらに、重要度スコアｓｃｏｒｅ（ｊ（ｍ））から重要語Ｊ（ｐ）を決定し（ｓ１４０）、重要語Ｊ（ｐ）及び重要度スコアｓｃｏｒｅ（Ｊ（ｐ））を出力する（ｓ１５０）。重要度スコアｓｃｏｒｅ（Ｊ（ｐ））の値が一定値以上の場合に、その重要語候補を重要語としてもよいし、単一文書内において、重要度スコアｓｃｏｒｅ（Ｊ（ｐ））が所定の順位より上の重要語候補を重要語としてもよい。また上位数％の重要語候補を重要語としてもよい。 For example, the title information giving unit 139, when the input text has a title,
-1 for important word candidates included in the title
-0 for important word candidates not included in the title
Is given as title information. If the title of FIG. 1 is “Coffee Companion”, it becomes like the sixth column of FIG. If the input text does not have a title, the processing in the title assigning unit may not be performed, title information (for example, 2) indicating that the title is not attached may be given, and the title may be given. You may give the same title information (for example, 0) as the case of the important word candidate which is not contained. Note that the title information adding unit 139 is not essential.
<Classifier 140>
The classifier 140 receives the features α1 (j (m)) to α5 (j (m)) as input, and based on a classification rule predetermined by machine learning (see Reference 4), the importance score using the features is used. score (j (m)) is calculated.
[Reference 4] Shunichi Amari, Hideki Aso, Koji Tsuda, Noboru Murata, “Statistics of Pattern Recognition and Learning: New Concepts and Methods”, Iwanami Shoten, April 2003, and further, importance score score (j (m) ) Determines the important word J (p) (s140), and outputs the important word J (p) and the importance score score (J (p)) (s150). When the value of the importance score score (J (p)) is a certain value or more, the important word candidate may be set as an important word, or the importance score score (J (p)) is predetermined within a single document. An important word candidate that is higher than the ranking may be used as an important word. Moreover, it is good also considering the important word candidate of the upper few% as an important word.

例えば、分類器１４０は、各重要語候補ｊ（ｍ）の素性α（ｊ（ｍ））＝［α１（ｊ（ｍ）），α２（ｊ（ｍ）），α３（ｊ（ｍ）），α４（ｊ（ｍ）），α５（ｊ（ｍ））］を入力として、予め作成したモデル（分類ルール）ｆ（）に基づき、重要度スコアを算出する。 For example, the classifier 140 includes the feature α (j (m)) = [α1 (j (m)), α2 (j (m)), α3 (j (m))) of each important word candidate j (m), Using α4 (j (m)), α5 (j (m))] as inputs, an importance score is calculated based on a model (classification rule) f () created in advance.

score(j(m))=f(α(j(m)))
全ての重要語候補の重要度スコアを求め、重要度スコアから重要語を決定する。日本語の語彙的特徴を利用して、重要語（例えば、著者の気になっている語）を出力することができる。
＜分類ルールの作成方法＞
分類ルールｆ（）は、機械学習により予め学習しておく。すなわち、学習用のテキスト集合に対し、素性抽出部により抽出した重要語候補及び各重要語候補に付与された素性と、予め学習用のテキスト集合中の重要語候補に対して人手で重要語（例えば、著者の気になっている語）を表すラベルを付与したものを、学習データとして利用する。参考文献１等に記載の既知の機械学習により、素性が付与された重要語候補の中から重要語を選択するための分類ルールを学習する。
＜効果＞
行動表現に係る重要語候補か否かを素性とし、機械学習に重要語を選択することにより、出現頻度のみに重要語を選択していた従来技術に比べ、より柔軟に重要語を選択することができる
また、指示語を受ける重要語候補か否かを素性とすることで、さらに柔軟、かつ、適切に重要語を選択することができる。 score (j (m)) = f (α (j (m)))
The importance score of all important word candidates is obtained, and the important word is determined from the importance score. By using Japanese lexical features, important words (for example, words that the author is interested in) can be output.
<How to create a classification rule>
The classification rule f () is learned in advance by machine learning. That is, with respect to the learning text set, the important word candidates extracted by the feature extraction unit and the features assigned to each important word candidate, and the important words ( For example, what is given a label indicating a word that the author is interested in) is used as learning data. A classification rule for selecting an important word from important word candidates given features is learned by known machine learning described in Reference 1 or the like.
<Effect>
Select important words more flexibly than conventional technologies that select important words only for appearance frequency by selecting important words for machine learning based on whether they are important word candidates related to behavioral expressions In addition, it is possible to select an important word more flexibly and appropriately by making it a feature whether or not it is an important word candidate that receives an instruction word.

重要語を含む文、及び、その前後の文に指示語を含む文に指示語が存在しやすいという日本語の語彙的特徴を利用することで、より適切に重要語を選択することができる。 An important word can be selected more appropriately by using Japanese lexical features that a directive word is likely to be present in a sentence including the important word and a sentence including the directive word in the preceding and succeeding sentences.

従来技術で用いていた単語の頻度も、重要語を決定する上で、重要な要素となるため、機械学習の素性とすることで、より適切に重要語を選択することができる。但し、従来技術では、出現頻度が高いものが重要語であるという仮定の上で、重要語を決定していたが、本実施例では、単に出現頻度が多いものが重要語であるとは考えず、重要語になりやすい出現頻度があると仮定し、機械学習により学習する。このような構成により、より適切に重要語を選択することができる。 Since the frequency of words used in the prior art is also an important factor in determining important words, it is possible to select important words more appropriately by using machine learning features. However, in the prior art, an important word is determined on the assumption that a word having a high appearance frequency is an important word. However, in this embodiment, a word having a high appearance frequency is simply considered to be an important word. It is assumed that there is an appearance frequency that tends to become an important word, and learning is performed by machine learning. With such a configuration, an important word can be selected more appropriately.

順位等からなる重みも重要語を決定する上で、重要な要素となるため、機械学習の素性とすることで、より適切に重要語を選択することができる。なお、頻度と同様に重みの値が高い（重い）ものが重要語であるとは考えず、重要語になりやすい重みを機械学習により学習する。 Since the weight including the rank is an important factor in determining the important word, the important word can be selected more appropriately by using the machine learning feature. Note that weights that are likely to become important words are learned by machine learning without considering that words having high (heavy) weight values as in the case of frequency are important words.

タイトルに含まれる重要語候補のほうが、重要語となる可能性が高いと考えられるため、タイトル情報を機械学習の素性とすることで、より適切に重要語を選択することができる。 Since the important word candidate included in the title is considered to be more likely to be an important word, the important word can be selected more appropriately by using the title information as a feature of machine learning.

これらの素性の係わり合いを考慮して重要語を選択することで、従来技術よりも柔軟、かつ、適切に重要語を選択することができる。
＜ハードウェア構成＞
図８に例示するように、この例の重要語抽出装置１００は、それぞれＣＰＵ（Central Processing Unit）１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ（Read Only Memory）１５、ＲＡＭ（Random Access Memory）１６及びバス１７を有している。 By selecting an important word in consideration of the relationship between these features, it is possible to select an important word more appropriately and more flexibly than in the prior art.
<Hardware configuration>
As illustrated in FIG. 8, the keyword extraction device 100 of this example includes a CPU (Central Processing Unit) 11, an input unit 12, an output unit 13, an auxiliary storage device 14, a ROM (Read Only Memory) 15, a RAM ( Random Access Memory) 16 and a bus 17.

この例のＣＰＵ１１は、制御部１１ａ、演算部１１ｂ及びレジスタ１１ｃを有し、レジスタ１１ｃに読み込まれた各種プログラムに従って様々な演算処理を実行する。また、入力部１２は、データが入力される入力インターフェース、キーボード、マウス等であり、出力部１３は、データが出力される出力インターフェース、ディスプレイ、プリンタ等である。補助記憶装置１４は、例えば、ハードディスク、半導体メモリ等であり、重要語抽出装置１００としてコンピュータを機能させるためのプログラムや各種データが格納される。また、ＲＡＭ１６には、上記のプログラムや各種データが展開され、ＣＵＰ１１等から利用される。また、バス１７は、ＣＰＵ１１、入力部１２、出力部１３、補助記憶装置１４、ＲＯＭ１５及びＲＡＭ１６を通信可能に接続する。なお、このようなハードウェアの具体例としては、例えば、パーソナルコンピュータの他、サーバ装置やワークステーション等を例示できる。
＜プログラム構成＞
上述のように、補助記憶装置１４には、本実施例の重要語抽出装置１００の各処理を実行するための各プログラムが格納される。重要語抽出プログラムを構成する各プログラムは、単一のプログラム列として記載されていてもよく、また、少なくとも一部のプログラムが別個のモジュールとしてライブラリに格納されていてもよい。
＜ハードウェアとプログラムとの協働＞
ＣＰＵ１１は、読み込まれたＯＳプログラムに従い、補助記憶装置１４に格納されている上述のプログラムや各種データをＲＡＭ１６に展開する。そして、このプログラムやデータが書き込まれたＲＡＭ１６上のアドレスがＣＰＵ１１のレジスタ１１ｃに格納される。ＣＰＵ１１の制御部１１ａは、レジスタ１１ｃに格納されたこれらのアドレスを順次読み出し、読み出したアドレスが示すＲＡＭ１６上の領域からプログラムやデータを読み出し、そのプログラムが示す演算を演算部１１ｂに順次実行させ、その演算結果をレジスタ１１ｃに格納していく。 The CPU 11 in this example includes a control unit 11a, a calculation unit 11b, and a register 11c, and executes various calculation processes according to various programs read into the register 11c. The input unit 12 is an input interface for inputting data, a keyboard, a mouse, and the like. The output unit 13 is an output interface for outputting data, a display, a printer, and the like. The auxiliary storage device 14 is, for example, a hard disk, a semiconductor memory, or the like, and stores a program for causing the computer to function as the keyword extraction device 100 and various data. Further, the above-mentioned program and various data are expanded in the RAM 16 and used from the CUP 11 or the like. The bus 17 connects the CPU 11, the input unit 12, the output unit 13, the auxiliary storage device 14, the ROM 15, and the RAM 16 in a communicable manner. In addition, as a specific example of such hardware, a server apparatus, a workstation, etc. other than a personal computer can be illustrated, for example.
<Program structure>
As described above, the auxiliary storage device 14 stores each program for executing each process of the keyword extraction device 100 of the present embodiment. Each program constituting the keyword extraction program may be described as a single program sequence, or at least a part of the programs may be stored in the library as a separate module.
<Cooperation between hardware and program>
The CPU 11 expands the above-described program and various data stored in the auxiliary storage device 14 in the RAM 16 according to the read OS program. The address on the RAM 16 where the program and data are written is stored in the register 11c of the CPU 11. The control unit 11a of the CPU 11 sequentially reads these addresses stored in the register 11c, reads a program and data from the area on the RAM 16 indicated by the read address, and causes the calculation unit 11b to sequentially execute the operation indicated by the program. The calculation result is stored in the register 11c.

図２は、このようにＣＰＵ１１に上述のプログラムが読み込まれて実行されることにより構成される重要語抽出装置１００の機能構成を例示したブロック図である。 FIG. 2 is a block diagram illustrating the functional configuration of the keyword extraction device 100 configured by reading and executing the above-described program in the CPU 11 in this manner.

ここで、記憶部１０３は、補助記憶装置１４、ＲＡＭ１６、レジスタ１１ｃ、その他のバッファメモリやキャッシュメモリ等の何れか、あるいはこれらを併用した記憶領域に相当する。また、重要語候補抽出部１１０、特徴語抽出部１２０、素性抽出部１３０及び分類器１４０は、ＣＰＵ１１に重要語抽出プログラムを実行させることにより構成されるものである。
＜変形例＞
本実施例では、重要語抽出装置１００は、重要語及び重要度スコアを出力しているが、少なくとも重要語を出力すればよい。 Here, the storage unit 103 corresponds to any one of the auxiliary storage device 14, the RAM 16, the register 11 c, other buffer memory and cache memory, or a storage area using these in combination. The keyword candidate extraction unit 110, the feature word extraction unit 120, the feature extraction unit 130, and the classifier 140 are configured by causing the CPU 11 to execute a keyword extraction program.
<Modification>
In the present embodiment, the keyword extraction device 100 outputs the keyword and the importance score, but it is sufficient to output at least the keyword.

重要語候補抽出部１１０は、抽出した重要語候補に固有表現としての種類（例えば、人名、地名等）を、素性として、付与してもよい。このとき、分類器１４０は、この種類も機械学習の素性とすることができる。このような構成とすることで、例えば、著者が人名よりも地名を重要語としやすい傾向がある場合には、重要語として、地名が選択されやすくなる。 The important word candidate extraction unit 110 may assign a type (for example, a person name, a place name, etc.) as a specific expression to the extracted important word candidate as a feature. At this time, the classifier 140 can also use this type of machine learning feature. By adopting such a configuration, for example, when an author tends to make a place name an important word rather than a person name, the place name is easily selected as the important word.

本実施例では、素性抽出部１３０で抽出する素性を頻度、重み、係り受け構造情報、指示語情報、タイトル情報の５つとしているが、少なくとも係り受け構造情報のうち、行動表現に係る重要語か否かを表す情報を素性とすれば、従来技術より柔軟な重要語の選択が可能となる。 In the present embodiment, the features extracted by the feature extraction unit 130 are five: frequency, weight, dependency structure information, directive information, and title information, but at least of the dependency structure information, an important word related to action expression If the information indicating whether or not is a feature, it is possible to select key words more flexibly than in the prior art.

本実施例では、係り受け構造情報付与部１３５で付与する係り受け構造情報が、１つの重要語候補に対し、１つの場合について説明しているが、１つの重要語候補に対し、二つの係り受け情報（例えば、１と２）を付与できる場合には、二つの係り受け情報を付与してもよい。また、１つの重要語候補に対し、同じ係り受け情報を２回以上付与できる場合は、１回だけ付与する場合と同様に処理してもよい。指示語情報付与部１３７で付与する指示語情報も同様である。 In the present embodiment, the case where the dependency structure information provided by the dependency structure information adding unit 135 is one for one important word candidate has been described. However, there are two relationships for one important word candidate. When receiving information (for example, 1 and 2) can be provided, two pieces of dependency information may be provided. Further, when the same dependency information can be given twice or more for one important word candidate, the same processing may be performed as in the case of giving only once. The same applies to the instruction word information provided by the instruction word information adding unit 137.

本発明は、単にテキスト文書の重要語を抽出する以外に、複数のテキスト文書から重要語を抽出し、その重要語から現在注目されているキーワードを求める際に利用できる。また、例えば、非特許文献１のようにテキスト文書から要約を作成する際に利用することなどができる。 The present invention can be used when extracting important words from a plurality of text documents and obtaining a keyword currently focused on from the important words, in addition to simply extracting important words of a text document. Further, for example, as in Non-Patent Document 1, it can be used when creating a summary from a text document.

１００重要語抽出装置
１１０重要語候補抽出部
１２０特徴語抽出部
１３０素性抽出部
１３１頻度付与部
１３３重み付与部
１３５係り受け構造情報付与部
１３７指示語情報付与部
１３９タイトル情報付与部
１４０分類器 100 keyword extraction unit 110 keyword candidate extraction unit 120 feature word extraction unit 130 feature extraction unit 131 frequency assignment unit 133 weight assignment unit 135 dependency structure information addition unit 137 instruction word information addition unit 139 title information addition unit 140 classifier

Claims

An important word candidate extraction unit for extracting an important word candidate that is a collocation or proper noun of one or more nouns from the input text;
A feature word extraction unit that extracts, as a feature word, an action expression that appears when describing the action of the input text creator from the input text;
A feature extraction unit that extracts one or more features representing the properties of the keyword candidates for each keyword word;
A classifier that calculates an importance score using the feature based on a classification rule predetermined by machine learning, and determines an important word from the importance score;
The feature extraction unit includes:
A dependency structure information assigning unit that assigns the dependency structure information indicating whether or not it is a key word candidate related to the behavior expression to the key word candidate as a feature,
An important word extraction device characterized by that.

The key word extraction device according to claim 1,
The feature word extraction unit extracts an instruction word as a feature word in addition to the action expression,
The dependency structure information represents whether or not it is a keyword candidate that receives the instruction word, in addition to whether or not it is a keyword candidate related to the behavior expression,
The dependency structure information assigning unit assigns the dependency structure information as a feature to an important word candidate.
An important word extraction device characterized by that.

The key word extraction device according to claim 1 or 2,
The feature word extraction unit extracts an instruction word as a feature word in addition to the action expression,
The feature extraction unit further includes:
The keyword candidate is identified by using, as a feature, indicator word information indicating whether or not a directive word exists in a sentence including the keyword candidate, and whether or not a directive word exists in the sentences before and after the sentence including the keyword candidate. An instruction word information assigning unit to be provided to,
An important word extraction device characterized by that.

The key word extraction device according to any one of claims 1 to 3,
The feature extraction unit further includes:
A frequency giving unit that obtains the appearance frequency of the important word candidate from the input text and assigns it to the important word candidate as a feature;
An important word extraction device characterized by that.

The key word extraction device according to any one of claims 1 to 4,
The feature extraction unit further includes a weight assigning unit that assigns a predetermined weight to the important word candidate as a feature to the important word candidate.
An important word extraction device characterized by that.

The key word extraction device according to any one of claims 1 to 5,
The feature extraction unit further provides title information indicating whether or not an important word candidate is included in the title by using the title of the input text and the important word candidate as a feature, and adding title information to the important word candidate And comprising
An important word extraction device characterized by that.

The key word extraction device according to any one of claims 1 to 6,
The important word candidate extraction unit, when extracting the important word candidate, gives the type as a specific expression as a feature to the important word candidate.
An important word extraction device characterized by that.

An important word candidate extraction step of extracting an important word candidate that is a collocation or proper noun of one or more nouns from the input text;
A feature word extraction step of extracting an action expression that appears when describing the action of the input text creator from the input text as a feature word;
A feature extraction step of extracting one or more features representing the properties of the keyword candidate for each keyword candidate;
A classifier that calculates an importance score using the feature based on a classification rule predetermined by machine learning, and determines an important word from the importance score;
The feature extraction step includes:
A dependency structure information giving step for assigning to the important word candidate the dependency structure information representing whether or not it is a key word candidate related to the behavior expression,
An important word extraction method characterized by this.

The key word extraction method according to claim 8,
The feature word extraction step extracts an instruction word as a feature word in addition to the action expression,
The dependency structure information represents whether or not it is a keyword candidate that receives the instruction word, in addition to whether or not it is a keyword candidate related to the behavior expression,
The dependency structure information giving step assigns the dependency structure information as a feature to an important word candidate.
An important word extraction method characterized by this.

A program for causing a computer to function as the important word extracting device according to claim 1.